top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Part 2 - WebScraping with Scrapy - Pagination, Debugging Spiders, Build Crawlers - Demystified

Welcome back to another blog on Scrapy. The first part of the blog can be found here. In this blog, we will be seeing how to use Scrapy to scrape multiple web pages. Also, we will see how to Debug Spiders. Then we will see building crawlers using Scrapy. Scrapy documentation can be found here. Let's get started.


Scraping Multiple Pages:

To scrape multiple pages, we will use tinydeal.com website, which is available with its archived data at https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html. To start scraping, we need to understand, how website behaves, how site functions without JavaScript, whether we can scrape website or not - for this, we use base page/Robots.txt file and search for specials, if not present, we can scrape the website. Sometimes, even if not restricted in robots.txt file, we may not be able to scrape. This can be bypassed using a proxy rotating service. Robots.txt cannot be trusted fully, also depending on the country's laws regarding web scraping.


When we disable JavaScript, we do not see images, which we are not scraping currently.

As discussed in the previous blog, switch to folder in shell, where the new project needs to be created. Enter command, "scrapy startproject tinydeal", then "cd tinydeal", then "scrapy genspider special_offers web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html". Scrapy uses HTTP protocol by default. Open tinydeal folder created under projects folder, in VSCode.


1. First, lets scrape first page only. We will scrape Product's Title , URL, Discounted Price, Original Price.

settings.py: Add this line at the end: FEED_EXPORT_ENCODING = 'utf-8' # fixes encoding issue

Execute using: scrapy crawl special_offers -o tinydeal_dataset.json

import scrapy class SpecialOffersSpider(scrapy.Spider): name = 'special_offers' allowed_domains = ['web.archive.org'] start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html'] def parse(self, response): for product in response.xpath("//ul[@class='productlisting-ul']/div/li"): yield{ 'product':product.xpath(".//a[@class='p_box_title']/text()").get(), 'url': response.urljoin(product.xpath(".//a[@class='p_box_title']/@href").get()), 'product': product.xpath(".//div[@class='p_box_price']/span[1]/text()").get(), 'product': product.xpath(".//div[@class='p_box_price']/span[2]/text()").get() }

2. Now, lets deal with Pagination - Handling multiple pages.

import scrapy class SpecialOffersSpider(scrapy.Spider): name = 'special_offers' allowed_domains = ['web.archive.org'] start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html'] def parse(self, response): for product in response.xpath("//ul[@class='productlisting-ul']/div/li"): yield{ 'product': product.xpath(".//a[@class='p_box_title']/text()").get(), 'url': response.urljoin(product.xpath(".//a[@class='p_box_title']/@href").get()), 'discounted_price': product.xpath(".//div[@class='p_box_price']/span[1]/text()").get(), 'original_price': product.xpath(".//div[@class='p_box_price']/span[2]/text()").get() } next_page = response.xpath("//a[@class='nextPage']/@href").get() if next_page: yield scrapy.Request(url=next_page, callback=self.parse)


3.

We see available object --> request

Give command, request.headers OR response.request.headers --> user agent shows who actually sent request, if it is Scrapy or browser or script etc. Some websites may block if user agent is Scrapy. We can override this.

As can be seen below, user agent here is Scrapy.

On the browser, go to Developer tools --> Network tab, make sure "All" is checked, press Ctrl+R to reload browser, we can see mozilla as browser type, as seen below.

To change user agent, go to settings.py --> uncomment USER_AGENT value and replace value with value from browser. This overrides multiple request headers. This is not a good approach. Instead, we can pass the same inside {} in DEFAULT_REQUEST_HEADERS in settings.py file.


Instead, for starting request and every request in the file, we can pass header with User-Agent, as shown below.

We can see user-agent yielded as below, with the following code:

import scrapy class SpecialOffersSpider(scrapy.Spider): name = 'special_offers' allowed_domains = ['web.archive.org'] # start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html'] def start_requests(self): yield scrapy.Request(url='https://web.archive.org/web/20190324163700/http://www.tinydeal.com/specials.html', callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' }) def parse(self, response): for product in response.xpath("//ul[@class='productlisting-ul']/div/li"): yield{ 'product': product.xpath(".//a[@class='p_box_title']/text()").get(), 'url': response.urljoin(product.xpath(".//a[@class='p_box_title']/@href").get()), 'discounted_price': product.xpath(".//div[@class='p_box_price']/span[1]/text()").get(), 'original_price': product.xpath(".//div[@class='p_box_price']/span[2]/text()").get(), 'User_Agent': response.request.headers['User-Agent'] } next_page = response.xpath("//a[@class='nextPage']/@href").get() if next_page: yield scrapy.Request(url=next_page, callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' })


Try Scraping Exercise at this URL: https://www.glassesshop.com/bestsellers

Solution: here


Debugging Crawlers:

Things that could go wrong to debug when using Scrapy can include, incorrect XPath, incorrect CSS Selector, for which searching online resources may not help. We can find Logical or Syntactical bugs. Be careful with indentation in Python. We will see how to execute step by step and keep track of changes happening to catch any Logical error. This can help too --> https://docs.scrapy.org/en/latest/topics/debug.html.

We will see different techniques to debug:

  1. Parse command: $ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>

-d here represents depth. Here depth set to 2, means spider does not go to link D in the image above.

From before blog, for worldometers example, when we execute:

scrapy parse --spider=countries -c parse_country https://www.worldometers.info/world-population/india-population/

-c here is the call back function to specify, we get country_name as KeyError, as country_name is not part of the call back function - parse_country. We can fix this passing country_name argument as below:

scrapy parse --spider=countries -c parse_country --meta='{\"country_name\":\"India\"}' https://www.worldometers.info/world-population/india-population/

We get data scraped for Country India as below:




2. Scrapy Shell:

We will invoke scrapy shell from spider itself. Use from scrapy.shell import inspect_response and then in parse_country method, use only this line: inspect_response(response,self)

In terminal, use "scrapy crawl countries". Type response.body, view(response) --> in the browser.


3. Open in browser:

import scrapy

from scrapy.utils.response import open_in_browser

class CountriesSpider(scrapy.Spider): name = 'countries' allowed_domains = ['www.worldometers.info'] start_urls = ['https://www.worldometers.info/world-population/population-by-country/'] def parse(self, response):

yield response.follow(url="https://www.worldometers.info/world-population/india-population/", callback=self.parse_country, meta={'country_name': "India"})

def parse_country(self, response): # logging.info(response.url) open_in_browser(response) When we run "scrapy crawl countries", the response will be opened in browser.


4. Logging module:

import logging

in parse_country method: logging.info(response.status)

logging.warning(response.status)

When we do above modifications, we get below output.

5. Manual way:

Create a file runner.py under project root directory.

import scrapy from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from worldometers.spiders.countries import CountriesSpider process = CrawlerProcess(settings=get_project_settings()) process.crawl(CountriesSpider) process.start()


Keep a breakpoint in countries.py at start of parse_country method.

Choosing runner.py, go to VSCode --> Debug --> Start Debugging --> Select python file --> debugging starts.

Debugging stops at breakpoint in countries.py. We can see variable values at this point as in below image.

We can add variables to watch in WATCH section.


Why's and When's of Web Scraping:

Why Web Scraping:

  • Web Scraping is used in Data Analysis. Data Analysis relies 100% on large amounts of data (datasets). The more data you have, the more accurate your data analysis will be.

  • Machine Learning also requires huge amounts of data. The more the data you have, the more your system can learn!!!

  • Lead Generation is where you scrape phone numbers, email and sell them online.

  • Real estate listings from scraping real estate data.

  • Price monitoring, where if price comes to specific amount, we can get notified.

  • Stock market tracking

  • Drop shipping - scraping products from e-commerce websites like Amazon, eBay, then sell them online without having inventory,

When to / not use Web Scraping:

  • Terms of services & the Robots.txt

  • Does the website have a public API?

  • Does the API have any limitations?

  • Does the API provide all the data you want?

  • Is the API free/paid? If paid, use Web Scraping.

Web Scraping Challenges:

Stability of Spider is slightly coupled and 100% dependent on the website we are going to scrape. Website changes can affect XPath and CSS Selectors.

For example, when spider is first created, they may not have used JavaScript. Later, they used JavaScript. In this case, Spider breaks because we did not use Splash or Selenium.


The Spider you write today has high chances it won't work tomorrow. Websites frequently keep updating and also can terminate the service.


PROJECT:

For the project, we will use Crawl Spider on way-back machine's URL http://web.archive.org/web/20200715000935if_/https://www.imdb.com/search/title/?groups=top_250&sort=user_rating instead of https://www.imdb.com . Not all web pages are archived though.


genspider command uses basic template in scrapy.Spider module. This has name, allowed_domains, start_urls, start_requests(self), parse(self, response) among others. "scrapy genspider -l" gives list of templates available: basic, crawl, csvfeed, xmlfeed templates.


Create a new project:

scrapy startproject imdb

cd imdb


Open imdb folder in VSCode.

best_movies.py: The Spider extends CrawlSpider now.

rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), )

rules is a tuple, which is immutable, and has one Rule to start with. Default Rule with LinkExtractor, should not be parse, but should be 'parse_item', follow parameter in Rule object creation instructs whether to follow the Rule when creating Spider or not.

LinkExtractor(allow=r'Items/') --> allow on regular expression matching 'Items/'

other options: deny, restrict_xpaths=("//a[@class='active']") --> /@href is optional

using CSS selector: restrict_css=("List of cssSelectors")

rules can have any number of Rule objects, each Rule object can be responsible for following certain links.


Step 1: scrapy crawl best_movies

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class BestMoviesSpider(CrawlSpider): name = 'best_movies' allowed_domains = ['web.archive.org'] start_urls = ['http://web.archive.org/web/20200715000935if_/https://www.imdb.com/search/title/?groups=top_250&sort=user_rating'] rules = ( Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True), ) def parse_item(self, response): print(response.url)

Step2: yield the data scraped in parse_item method

def parse_item(self, response): yield { 'title': response.xpath("//div[@class='title_wrapper']/h1/text()").get().strip('\xa0'), 'year': response.xpath("//span[@id='titleYear']/a/text()").get(), 'duration': response.xpath("normalize-space((//time)[1]/text())").get(), 'genre': response.xpath("//div[@class='subtext']/a[1]/text()").get(), 'rating': response.xpath("//span[@itemprop='ratingValue']/text()").get(), 'movie_url': response.url }

Step3: Pagination: To handle multiple pages, add this line in rules tuple.

Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"), process_request='set_user_agent')

# follow=True is default value, order of Rule Objects matters # when visiting next page, first Rule will again be called automatically, so need of call back function.


Spoofing Request Headers: change request headers, specifically, user-agent in request headers.

Search for "My user agent" in google, to get local user agent value.


# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class BestMoviesSpider(CrawlSpider): name = 'best_movies' allowed_domains = ['web.archive.org'] user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' def start_requests(self): yield scrapy.Request(url='http://web.archive.org/web/20200715000935if_/https://www.imdb.com/search/title/?groups=top_250&sort=user_rating', headers={ 'User-Agent': self.user_agent }) rules = ( Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True, process_request='set_user_agent'), Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"), process_request='set_user_agent') ) def set_user_agent(self, request): request.headers['User-Agent'] = self.user_agent return request def parse_item(self, response): yield { 'title': response.xpath("//div[@class='title_wrapper']/h1/text()").get(), 'year': response.xpath("//span[@id='titleYear']/a/text()").get(), 'duration': response.xpath("normalize-space((//time)[1]/text())").get(), 'genre': response.xpath("//div[@class='subtext']/a[1]/text()").get(), 'rating': response.xpath("//span[@itemprop='ratingValue']/text()").get(), 'movie_url': response.url } Try on 'http://books.toscrape.com/'

Solution: here

Conclusion:

In this blog we have seen how to use Scrapy to scrape multiple web pages. Also, we saw how to Debug Spiders. Then saw building crawlers using Scrapy. The crawlers break occasionally owing to site changes. For this to be corrected to some extent, we use Splash or Selenium to handle changes due to web page updations.


Happy Learning Web Scraping!!!

119 views0 comments

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2022 by NumPy Ninja

  • Twitter
  • LinkedIn
bottom of page