Home PC Games Linux Windows Database Network Programming Server Mobile  
  Home \ Linux \ How to install web crawler tool in Ubuntu 14.04 LTS: Scrapy     - VMware virtual machine Ubuntu install arm-linux-gcc cross-compiler environment (Linux)

- Linux Apache server security (Linux)

- Win7 + Ubuntu Kylin + CentOS 6.5 installed three systems (Linux)

- Linux Getting Started tutorial: Experience KVM Virtual Machine chapter (Linux)

- How to generate Linux, random password encryption or decryption (Linux)

- Linux system monitoring tool set cpu (Linux)

- RabbitMQ user roles and access control (Linux)

- Python substring format (Programming)

- Fedora10 use Git version Configuration Management (Linux)

- Oracle 10046 Event (Database)

- Configuring Haproxy log support (syslog logging support) (Server)

- CentOS7 install NTFS-3G driver (Linux)

- Linux Systemd-- To start / stop / restart services in RHEL / CentOS 7 (Linux)

- for (int a: i) the use of the Java programming (Programming)

- Linux LVM - File system extension (Linux)

- Scope of variables in Object-C (Programming)

- Install Java on RHEL6 (Linux)

- Linux file time Comments ctime mtime atime (Linux)

- Android determine the device network connection status, and determine the connection (Programming)

- Linux system security configuration Collection (Linux)

  How to install web crawler tool in Ubuntu 14.04 LTS: Scrapy
  Add Date : 2018-11-21      
  This is an open source tool to extract website data. Scrapy framework developed with Python made it quick and easy work of moving the gripper, and scalable. We have created a virtual machine (VM) in the virtual box above and installed Ubuntu 14.04 LTS.

Installation Scrapy

Scrapy rely on Python, development libraries and pip. The latest version of Python has been pre-installed on Ubuntu. Therefore, we only need to install pip and python development libraries before installing Scrapy it.

pip as a python package indexer easy_install alternatives for installing and managing Python packages. pip package installation.

sudo apt-get install python-pip

pip install

We must use the following command to install python development libraries. If the package is not that it will report an error on python.h header files in the installation scrapy frame installation.

sudo apt-get install python-dev

Python development libraries

scrapy frame can be installed from a deb package from source to install. We use pip (Python Package Manager) to install the deb package.

sudo pip install scrapy

Scrapy installation

scrapy successful installation requires some time.

 Successfully installed Scrapy framework

Use scrapy framework to extract data

Basic Course

We will extract the name of the store (shop selling cards) from fatwallet.com with scrapy. First, we use the following command to create a new project scrapy "store name" ,.

$ Sudo scrapy startproject store_name

Scrapy framework of new projects

The above command in the current path creates a "store_name" directory. File / directory under the main project folder containing Figure 6.

$ Sudo ls -lR store_name

Content store_name items

Each file / folder summary is as follows:

scrapy.cfg project profile
Another document store_name / home directory folder. This directory contains the python source project
store_name / items.py spider crawling will contain items
store_name / pipelines.py pipeline file
store_name / settings.py configuration file for the project
store_name / spiders /, contains the spider for crawling
Due to its name as extracted from our fatwallet.com, we modify the file as follows (LCTT Annotation: here did not explain which documents should be considered translator items.py).

import scrapy
classStoreNameItem (scrapy.Item):
name = scrapy.Field () # remove the name card shop
Then we want to write a new spider store_name project / spiders / folder. Spider is a python class, it must contain the following several attributes:

Spider name (name)
Crawling start url (start_urls)
It includes analytical methods required to extract the contents of the corresponding response from the regular expression. Analytical method is very important in terms of reptiles.
We created "storename.py" reptiles at storename / spiders / directory, and add the following code to extract its name from the fatwallet.com. Output crawler written to a file (StoreName.txt) as shown in Figure 7.

from scrapy.selector importSelector
from scrapy.spider importBaseSpider
from scrapy.http importRequest
from scrapy.http importFormRequest
import re
classStoreNameItem (BaseSpider):
name = "storename"
allowed_domains = [ "fatwallet.com"]
start_urls = [ "http://fatwallet.com/cash-back-shopping/"]
def parse (self, response):
output = open ( 'StoreName.txt', 'w')
resp = Selector (response)
tags = resp.xpath ( '// tr [@ class = "storeListRow"] | \
// Tr [@ class = "storeListRow even"] | \
// Tr [@ class = "storeListRow even last"] | \
// Tr [@ class = "storeListRow last"] '). Extract ()
for i in tags:
i = i.encode ( 'utf-8', 'ignore'). strip ()
store_name = ''
if re.search (r "class = \" storeListStoreName \ "> * <.?", i, re.I | re.S):
store_name = re.search (r "class = \" storeListStoreName \ "> * <.?", i, re.I | re.S) .group ()
store_name = re.search (r "> * <.?", store_name, re.I | re.S) .group ()
store_name = re.sub (r '>', "", re.sub (r '<', "", store_name, re.I))
store_name = re.sub (r '& amp;', "&", re.sub (r '& amp;', "&", store_name, re.I))
#print store_name
output.write (store_name + "" + "\ n")

Output of the Spider code.

 Output reptile

Note: This tutorial is only for the understanding scrapy framework
- fcntl file locking function add (Programming)
- Linux iptables firewall settings whitelist (RHEL 6 and CentOS 7) (Linux)
- The method of Linux into the rescue mode (Linux)
- Android media library of analysis: MediaProvider (Programming)
- Linux system on how to use rsync to synchronize data (Server)
- Oracle 12c In-Memory Study (Database)
- Hadoop new and old version of the difference in the size of the InputSplit (Server)
- LVM basic concepts, management (Linux)
- Replace font under Linux (Linux)
- Shell scripts get a snapshot of the page and generates thumbnails (Linux)
- Redis 3.0.3 Cluster Setup (Database)
- MD5 and simple to use (Linux)
- Slow update statement Performance Analysis (Database)
- quotacheck command file can not be created aquota.user and aquota.group solutions (Linux)
- History and Statistics tuptime use tools to view Linux server system boot time (Server)
- Install RAID 6 (Striping double distributed parity) (Linux)
- Definition Format Oracle basis of various statements (Database)
- CentOS and RHEL installation under GAMIT10.6 (Linux)
- How to upgrade to Ubuntu 14.04 Ubuntu 14.10 (Linux)
- You know the difference between URL, URI and URN among you (Linux)
  CopyRight 2002-2020 newfreesoft.com, All Rights Reserved.