We have seen Scrapy and Splash for Web Scraping so far. Part3 of the blog can be found here. In this blog, we will be seeing how to use Selenium, Selenium with Scrapy for Web Scraping. We will also see using pipelines to store data into MongoDB and SQLite3 databases. We will then see, API Scraping. Also covered are examples of logging to website using Scrapy, Splash and Cloudflare protection overriding.
Selenium is used for JavaScript enabled page scraping along with Splash. Selenium is beginner friendly and easy to understand. So people prefer Selenium over Splash for Web Scraping. Selenium is meant for Test / Automation tool, not much for Web Scraping though.
Steps to work with Selenium for Web Scraping:
1. Install Selenium: pip install selenium OR conda install selenium
2. Get Chrome Web Driver.
3. Create a folder selenium_basics, open this in VSCode. Extract Chrome web driver and place in this folder.
4. Create a file basics.py.
from selenium import webdriver import urllib3 from shutil import which from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By chrome_options = Options() chrome_options.add_argument("--headless") chrome_path = which("chromedriver") driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options) driver = webdriver.Chrome('./chromedriver') driver.get("https://duckduckgo.com") search_input = driver.find_element(By.XPATH, "(//input[contains(@class, 'js-search-input')])[1]") search_input.send_keys("My User Agent") search_btn = driver.find_element(By.ID, "search_button_homepage") # search_btn.click() search_btn.send_keys(Keys.ENTER) print(driver.page_source) driver.close()
Execute file in terminal using: python .\basics.py
Methods available to locate elements:
driver.find_element_by_id("search_form_input_homepage") driver.find_element_by_class_name() driver.find_element_by_tag_name("h1") driver.find_element_by_xpath() driver.find_element_by_css_selector()
Selenium Web Scraping:
When using Selenium, we cannot have call back function, so we use Selector object to pass selenium page source as response.
driver.set_window_size(1920, 1080) # helps to get all items on the page scraped
CODE:
# -*- coding: utf-8 -*- import scrapy # from scrapy_splash import SplashRequest from selenium import webdriver from selenium.webdriver.chrome.options import Options from shutil import which from selenium.webdriver.common.by import By from scrapy.selector import Selector class CoinSpiderSelenium(scrapy.Spider): name = 'coin_selenium' allowed_domains = ['web.archive.org'] start_urls = [ 'https://web.archive.org/web/20200116052415/https://www.livecoin.net/en' ] def __init__(self): chrome_options = Options() chrome_options.add_argument("--headless") chrome_path = which("chromedriver") # driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options) driver = webdriver.Chrome("./chromedriver") driver.set_window_size(1920, 1080) driver.get("https://web.archive.org/web/20200116052415/https://www.livecoin.net/en") rur_tab = driver.find_elements(By.CLASS_NAME, "filterPanelItem___2z5Gb") rur_tab[4].click() self.html = driver.page_source driver.close() def parse(self, response): resp = Selector(text=self.html) for currency in resp.xpath("//div[contains(@class, 'ReactVirtualized__Table__row tableRow___3EtiS ')]"): print("HelloHERE", currency.xpath(".//div[1]/div/text()").get()) print("Hello123", currency.xpath(".//div[2]/span/text()").get()) yield { 'currency pair': currency.xpath(".//div[1]/div/text()").get(), 'volume(24h)': currency.xpath(".//div[2]/span/text()").get() } Pagination:
To handle pagination, pip install scrapy-selenium
For this, we will use website https://slickdeals.net/computer-deals/
Create a scrapy project slickdeals for Scrapy, with spider.
Copy configuration settings from here https://github.com/clemfromspace/scrapy-selenium.
from shutil import whichSELENIUM_DRIVER_NAME = 'firefox'SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
example.py : create example spider: scrapy genspider example example.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
class ExampleSpider(scrapy.Spider):
name = 'example'
def start_requests(self):
yield SeleniumRequest(
url='https://duckduckgo.com',
wait_time=3,
screenshot=True,
callback=self.parse
)
def parse(self, response):
# img = response.meta['screenshot']
# with open('screenshot.png', 'wb') as f:
# f.write(img) # gets only initial response screenshot
driver = response.meta['driver']
search_input = driver.find_element(By.XPATH, "(//input[contains(@class, 'js-search-input')])[1]")
search_input.send_keys('Hello World')
# driver.save_screenshot('after_filling_input.png')
search_input.send_keys(Keys.ENTER)
html = driver.page_source
response_obj = Selector(text=html)
links = response_obj.xpath("//div[@class='result__extras__url']/a")
for link in links:
yield {
'URL': link.xpath(".//@href").get()
}
We will do pagination in the below code:
# -*- coding: utf-8 -*- import scrapy from scrapy_selenium import SeleniumRequest class ComputerdealsSpider(scrapy.Spider): name = 'computerdeals' def remove_characters(self, value): return value.strip('\xa0') def start_requests(self): yield SeleniumRequest( url='https://slickdeals.net/computer-deals/', wait_time=3, callback=self.parse ) def parse(self, response): products = response.xpath("//ul[@class='bp-p-filterGrid_items']/li") for product in products: yield { 'name': product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(), 'link': product.xpath(".//a[@class='bp-c-card_title bp-c-link']/@href").get(), 'store_name': self.remove_characters(product.xpath("normalize-space(.//span[@class='bp-c-card_subtitle']/text())").get()), 'price': product.xpath("normalize-space(.//div[@class='bp-c-card_content']/span/text())").get() } next_page = response.xpath("//a[@data-page='next']/@href").get() if next_page: absolute_url = f"https://slickdeals.net{next_page}" yield SeleniumRequest( url=absolute_url, wait_time=3, callback=self.parse )
Working with Pipelines:
To export scraped data to database, we have to write code in pipelines.py file.
The pipelines file can have open_spider(self, spider) , close_spider(self, spider)--> called once when spider starts execution, process_item() --> called for each item scraped.
settings.py : uncomment the following lines
ITEM_PIPELINES = { "slickdeals.pipelines.SlickdealsPipeline": 300, }
MONGO_URI="Hello World"
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import logging class SlickdealsPipeline(object): @classmethod def from_crawler(cls, crawler): logging.warning(crawler.settings.get("MONGO_URI")) return crawler def open_spider(self, spider): logging.warning("SPIDER OPENED from PIPELINE") def close_spider(self, spider): logging.warning("SPIDER CLOSED from PIPELINE") def process_item(self, item, spider): return item We can see the value set in settings.py executed before open_spider logging message.
Using cloud, signin to mongodb, create a cluster --> build a cluster --> choose M0 Sandbox(free cluster) --> Create Cluster.
In Terminal, conda install pymongo dnspython -y
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import logging import pymongo class MongodbPipeline(object): collection_name = "example" def open_spider(self, spider): self.client = pymongo.MongoClient("mongodb+srv://vanisuruvu:<password>@vanicluster.xkm7tbs.mongodb.net/?retryWrites=true&w=majority") # replace password here
self.db = self.client["SlickDeals"] # created later def close_spider(self, spider): # close connection to database self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert(item) return item Give cluster a password and give ourselves permission to read and write data to database.
In Cluster Access, Network Access, add 0.0.0.0/0 IP address.
Connect Cluster --> select cluster created--> Connect your Application --> Python driver and version select --> copy connection string and paste in pipelines.py Change settings.py -->
ITEM_PIPELINES = { "slickdeals.pipelines.MongodbPipeline": 300, }
We can see data populated in collections of mongodb, when we run the crawler: scrapy crawler computerdeals. Do not give special characters in mongodb character like @.
To connect with SQLite3, we don't need to install any package.
Add sqlite extension, to view the database created in vscode.
import sqlite3
class SQLitePipeline(object): collection_name = "example" # @classmethod # def from_crawler(cls, crawler): # logging.warning(crawler.settings.get("MONGO_URI")) # return crawler def open_spider(self, spider): self.connection = sqlite3.connect("slackdeals.db") self.c = self.connection.cursor() try: self.c.execute(''' CREATE TABLE best_movies( link TEXT, name TEXT, price TEXT, store_name TEXT ) ''') self.connection.commit() except sqlite3.OperationalError: pass def close_spider(self, spider): # close connection to database self.connection.close() def process_item(self, item, spider): self.c.execute(''' INSERT INTO best_movies(link, name, price, store_name) VALUES(?,?,?,?) '''), ( item.get('link'), item.get('name'), item.get('price'), item.get('store_name') ) self.connection.commit() return item
Update settings.py with "slickdeals.pipelines.SQLitePipeline": 300, Open database, after installing the extension, In the SQLite Explorer, we can open database, Show table created.
Scraping APIs:
We won't scrape HTML markup here - not XPath or CSS Selectors. We use JSON object to pass to python.
We will scrape http://quotes.toscrape.com/scroll . Page is dynamically loaded using JavaScript, though there is no pagination, when we scroll down. We don't have to use Selenium or Splash to handle JavaScript though.
Open Developer Tools (Ctrl+Shift+I), Open Network tab, XHR filter, Ctrl+R to refresh page, open the first link, then Headers tab: Check Request URL, Preview tab: we see JSON object with key value pairs -> quotes key has further data. API URL is different from actual website URL.
Create project in Scrapy: scrapy startproject demo_api, cd demo_api, scrapy genspider quotes quotes.toscrape.com , code . (opens VSCode from command prompt)
Open folder in VSCode.
Copy Request URL from browser: http://quotes.toscrape.com/api/quotes?page=1
We can use has_next field to track if there are more pages.
# -*- coding: utf-8 -*-
import scrapy
import json
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']
def parse(self, response):
resp = json.loads(response.body)
quotes = resp.get('quotes')
for quote in quotes:
yield {
'author': quote.get('author').get('name'),
'tags': quote.get('tags'),
'quote_text': quote.get('text')
}
has_next = resp.get('has_next')
if has_next:
next_page_number = resp.get('page') + 1
yield scrapy.Request(
url=f'http://quotes.toscrape.com/api/quotes?page={next_page_number}',
callback=self.parse
)
Explore JSON object structure first, and then start scraping the object.
Next, we will scrape https://openlibrary.org/subjects/picture_books website. As we scroll the list of books to the left, we get more books.
scrapy genspider ebooks "openlibrary.org/subjects/picture_books.json?limit=12&offset=12"
Here, we include URL in "" because it has some special characters.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.exceptions import CloseSpider
import json
class EbooksSpider(scrapy.Spider):
name = 'ebooks'
INCREMENTED_BY = 12
offset = 0
allowed_domains = ['openlibrary.org']
start_urls = ['https://openlibrary.org/subjects/picture_books.json?limit=12']
def parse(self, response):
if response.status == 500:
raise CloseSpider('Reached last page...')
resp = json.loads(response.body)
ebooks = resp.get('works')
for ebook in ebooks:
yield {
'title': ebook.get('title'),
'subject': ebook.get('subject')
}
self.offset += self.INCREMENTED_BY
yield scrapy.Request(
url=f'https://openlibrary.org/subjects/picture_books.json?limit=12&offset={self.offset}',
callback=self.parse
)
Here, as seen from browser Developer tools, offset is getting incremented by 12.
When we set offset to very large value, we get 500 internal server error, as seen below.
Login to Websites using Scrapy:
We will use https://quotes.toscrape.com/login website to login using Scrapy. This website doesn't require javascript, as wrt UI we do not see any changes by disabling javascript, in developer tools.
In developer tools, go to Network tab, Filter: all, Check Preserve log (so we can capture login requests).
Any username / password works for this site, as its for scraping purpose.
Decoding login request(from developer tools):
HTML Status code: 302 ==> redirected to another URL.
Form Data: we need to send form data when doing login. csrf_token is dynamically generated. username and password fields are also present. csrf_token value changes when reloaded.
# -*- coding: utf-8 -*-
import scrapy
from scrapy import FormRequest
class QuotesLoginSpider(scrapy.Spider):
name = 'quotes_login'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/login']
def parse(self, response):
csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
yield FormRequest.from_response(
response,
formxpath='//form',
formdata={
'csrf_token': csrf_token,
'username': 'admin',
'password': 'admin'
},
callback=self.after_login
)
def after_login(self, response):
if response.xpath("//a[@href='/logout']/text()").get():
print('logged in')
We can use formcss, formname, formxpath to get form element. If logged in successfully, we get 'logged in' message in the output as programmed.
Another Example:
Checking if logged into openlibrary website, create an account. Logout requires javascript.
# -*- coding: utf-8 -*-
import scrapy
from scrapy import FormRequest
class OpenlibraryLoginSpider(scrapy.Spider):
name = 'openlibrary_login'
allowed_domains = ['openlibrary.org']
start_urls = ['https://openlibrary.org/account/login']
def parse(self, response):
yield FormRequest.from_response(
response,
formid='register',
formdata={
'username': 'abcdefgh@gmail.com',
'password': 'test123',
'redirect': '/',
'debug_token': '',
'login': 'Log In'
},
callback=self.after_login
)
def after_login(self, response):
print('logged in...')
We see logged in is printed, with the above program, once the user logs in.
If the form does require JavaScript then you can't use the FormRequest class, so as an alternative solution you have to use another class called SplashFormRequest which does have the same methods as in the FormRequest class and takes the same arguments.
# Splash should be running on the background
import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest
class QuotesLoginSpider(scrapy.Spider):
name = 'quotes_login'
allowed_domains = ['quotes.toscrape.com']
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(
url='https://quotes.toscrape.com/login',
endpoint='execute',
args = {
'lua_source': self.script
},
callback=self.parse
)
def parse(self, response):
csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
yield SplashFormRequest.from_response(
response,
formxpath='//form',
formdata={
'csrf_token': csrf_token,
'username': 'admin',
'password': 'admin'
},
callback=self.after_login
)
def after_login(self, response):
if response.xpath("//a[@href='/logout']/text()").get():
print('logged in')
PROJECT:
Use this URL instead of https://coinmarketcap.com
Cloudfare (DNS, SSL, CDN/Proxy) Protection service can be used when scraping. Cloudfare is a Content Delivery Network, CDN, acts like a watch that watches and protect incoming requests from bots, spiders, crawlers etc. If the requests come very frequently, then the source of request will be kicked off when making any more requests. We can bypass this extra layer of security. We can check if website is protected by cloudfare here.
Project: scrapy genspider -t crawl coins coinmarketcap.com
LinkExtractor will automatically extract link from below restrict_xpaths.
When runner below spider, we may get blocked, and then try after some 10 minutes.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CoinsSpider(CrawlSpider):
name = 'coins'
allowed_domains = ['web.archive.org']
start_urls = ['https://web.archive.org/web/20190101085451/https://coinmarketcap.com/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='currency-name-container link-secondary']"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'name': response.xpath("normalize-space((//h1[@class='details-panel-item--name']/text())[2])").get(),
'rank': response.xpath("//span[@class='label label-success']/text()").get(),
'price(USD)': response.xpath("//span[@class='h2 text-semi-bold details-panel-item--price__value']/text()").get()
}
To overcome Cloudfare protection, bypass 429 error --> too many requests, use this
pip install scrapy-cloudflare-middleware. By default, cloudfare handles 503 http response status code, and not handle 429 response code.
DOWNLOADER_MIDDLEWARES = {
# The priority of 560 is important, because we want this middleware to kick in just before the scrapy built-in `RetryMiddleware`.'scrapy_cloudflare_middleware.middlewares.CloudFlareMiddleware': 560
}
Add below lines:
from scrapy_cloudfare_middleware.middlewares import CloudFlareMiddleware in coins.py
CloudFlareMiddleware only works with CrawlSpider, not with scrapy.Spider class.
Click on CloudFlareMiddleware --> change (delete above import)
response.status == 503 to
response.status == 503 or response.status == 429
Deactivate DuplicateFilter in settings.py:
DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter"
Conclusion:
We have seen how to use Selenium, Selenium with Scrapy to do Web Scraping. We have also seen how to use pipelines to store scraped data into MongoDB and SQLite3 databases. Then we saw API Scraping. Also discussed are Logging to the website using Scrapy / Splash, and Cloudflare Bypassing.
Hope you enjoyed Scraping!!! More power to you!!! Happy Learning!!! Thank you!!!
Comments