This is an open source tool to extract website data. Scrapy framework developed with Python made it quick and easy work of moving the gripper, and scalable. We have created a virtual machine (VM) in the virtual box above and installed Ubuntu 14.04 LTS.
Scrapy rely on Python, development libraries and pip. The latest version of Python has been pre-installed on Ubuntu. Therefore, we only need to install pip and python development libraries before installing Scrapy it.
pip as a python package indexer easy_install alternatives for installing and managing Python packages. pip package installation.
sudo apt-get install python-pip
We must use the following command to install python development libraries. If the package is not that it will report an error on python.h header files in the installation scrapy frame installation.
sudo apt-get install python-dev
Python development libraries
scrapy frame can be installed from a deb package from source to install. We use pip (Python Package Manager) to install the deb package.
sudo pip install scrapy
scrapy successful installation requires some time.
Successfully installed Scrapy framework
Use scrapy framework to extract data
We will extract the name of the store (shop selling cards) from fatwallet.com with scrapy. First, we use the following command to create a new project scrapy "store name" ,.
$ Sudo scrapy startproject store_name
Scrapy framework of new projects
The above command in the current path creates a "store_name" directory. File / directory under the main project folder containing Figure 6.
$ Sudo ls -lR store_name
Content store_name items
Each file / folder summary is as follows:
scrapy.cfg project profile
Another document store_name / home directory folder. This directory contains the python source project
store_name / items.py spider crawling will contain items
store_name / pipelines.py pipeline file
store_name / settings.py configuration file for the project
store_name / spiders /, contains the spider for crawling
Due to its name as extracted from our fatwallet.com, we modify the file as follows (LCTT Annotation: here did not explain which documents should be considered translator items.py).
name = scrapy.Field () # remove the name card shop
Then we want to write a new spider store_name project / spiders / folder. Spider is a python class, it must contain the following several attributes:
Spider name (name)
Crawling start url (start_urls)
It includes analytical methods required to extract the contents of the corresponding response from the regular expression. Analytical method is very important in terms of reptiles.
We created "storename.py" reptiles at storename / spiders / directory, and add the following code to extract its name from the fatwallet.com. Output crawler written to a file (StoreName.txt) as shown in Figure 7.
from scrapy.selector importSelector
from scrapy.spider importBaseSpider
from scrapy.http importRequest
from scrapy.http importFormRequest
name = "storename"
allowed_domains = [ "fatwallet.com"]
start_urls = [ "http://fatwallet.com/cash-back-shopping/"]
def parse (self, response):
output = open ( 'StoreName.txt', 'w')
resp = Selector (response)
tags = resp.xpath ( '// tr [@ class = "storeListRow"] | \
// Tr [@ class = "storeListRow even"] | \
// Tr [@ class = "storeListRow even last"] | \
// Tr [@ class = "storeListRow last"] '). Extract ()
for i in tags:
i = i.encode ( 'utf-8', 'ignore'). strip ()
store_name = ''
if re.search (r "class = \" storeListStoreName \ "> * <.?", i, re.I | re.S):
store_name = re.search (r "class = \" storeListStoreName \ "> * <.?", i, re.I | re.S) .group ()
store_name = re.search (r "> * <.?", store_name, re.I | re.S) .group ()
store_name = re.sub (r '>', "", re.sub (r '<', "", store_name, re.I))
store_name = re.sub (r '& amp;', "&", re.sub (r '& amp;', "&", store_name, re.I))
output.write (store_name + "" + "\ n")
Output of the Spider code.
Note: This tutorial is only for the understanding scrapy framework