top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Web Scraping - Part 4 - with Selenium Python, Pipelines, API Scraping & more



We have seen Scrapy and Splash for Web Scraping so far. Part3 of the blog can be found here. In this blog, we will be seeing how to use Selenium, Selenium with Scrapy for Web Scraping. We will also see using pipelines to store data into MongoDB and SQLite3 databases. We will then see, API Scraping. Also covered are examples of logging to website using Scrapy, Splash and Cloudflare protection overriding.


Selenium is used for JavaScript enabled page scraping along with Splash. Selenium is beginner friendly and easy to understand. So people prefer Selenium over Splash for Web Scraping. Selenium is meant for Test / Automation tool, not much for Web Scraping though.


Steps to work with Selenium for Web Scraping:

1. Install Selenium: pip install selenium OR conda install selenium

2. Get Chrome Web Driver.

3. Create a folder selenium_basics, open this in VSCode. Extract Chrome web driver and place in this folder.

4. Create a file basics.py.

from selenium import webdriver import urllib3 from shutil import which from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By chrome_options = Options() chrome_options.add_argument("--headless") chrome_path = which("chromedriver") driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options) driver = webdriver.Chrome('./chromedriver') driver.get("https://duckduckgo.com") search_input = driver.find_element(By.XPATH, "(//input[contains(@class, 'js-search-input')])[1]") search_input.send_keys("My User Agent") search_btn = driver.find_element(By.ID, "search_button_homepage") # search_btn.click() search_btn.send_keys(Keys.ENTER) print(driver.page_source) driver.close()

Execute file in terminal using: python .\basics.py

Methods available to locate elements:

driver.find_element_by_id("search_form_input_homepage") driver.find_element_by_class_name() driver.find_element_by_tag_name("h1") driver.find_element_by_xpath() driver.find_element_by_css_selector()


Selenium Web Scraping:

When using Selenium, we cannot have call back function, so we use Selector object to pass selenium page source as response.

driver.set_window_size(1920, 1080) # helps to get all items on the page scraped

CODE:

# -*- coding: utf-8 -*- import scrapy # from scrapy_splash import SplashRequest from selenium import webdriver from selenium.webdriver.chrome.options import Options from shutil import which from selenium.webdriver.common.by import By from scrapy.selector import Selector class CoinSpiderSelenium(scrapy.Spider): name = 'coin_selenium' allowed_domains = ['web.archive.org'] start_urls = [ 'https://web.archive.org/web/20200116052415/https://www.livecoin.net/en' ] def __init__(self): chrome_options = Options() chrome_options.add_argument("--headless") chrome_path = which("chromedriver") # driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options) driver = webdriver.Chrome("./chromedriver") driver.set_window_size(1920, 1080) driver.get("https://web.archive.org/web/20200116052415/https://www.livecoin.net/en") rur_tab = driver.find_elements(By.CLASS_NAME, "filterPanelItem___2z5Gb") rur_tab[4].click() self.html = driver.page_source driver.close() def parse(self, response): resp = Selector(text=self.html) for currency in resp.xpath("//div[contains(@class, 'ReactVirtualized__Table__row tableRow___3EtiS ')]"): print("HelloHERE", currency.xpath(".//div[1]/div/text()").get()) print("Hello123", currency.xpath(".//div[2]/span/text()").get()) yield { 'currency pair': currency.xpath(".//div[1]/div/text()").get(), 'volume(24h)': currency.xpath(".//div[2]/span/text()").get() } Pagination:

To handle pagination, pip install scrapy-selenium

For this, we will use website https://slickdeals.net/computer-deals/

Create a scrapy project slickdeals for Scrapy, with spider.

Copy configuration settings from here https://github.com/clemfromspace/scrapy-selenium.

from shutil import whichSELENIUM_DRIVER_NAME = 'firefox'SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

example.py : create example spider: scrapy genspider example example.com

# -*- coding: utf-8 -*-

import scrapy

from scrapy.selector import Selector

from scrapy_selenium import SeleniumRequest

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.common.by import By


class ExampleSpider(scrapy.Spider):

name = 'example'


def start_requests(self):

yield SeleniumRequest(

url='https://duckduckgo.com',

wait_time=3,

screenshot=True,

callback=self.parse

)


def parse(self, response):

# img = response.meta['screenshot']

# with open('screenshot.png', 'wb') as f:

# f.write(img) # gets only initial response screenshot


driver = response.meta['driver']

search_input = driver.find_element(By.XPATH, "(//input[contains(@class, 'js-search-input')])[1]")

search_input.send_keys('Hello World')

# driver.save_screenshot('after_filling_input.png')

search_input.send_keys(Keys.ENTER)


html = driver.page_source

response_obj = Selector(text=html)


links = response_obj.xpath("//div[@class='result__extras__url']/a")

for link in links:

yield {

'URL': link.xpath(".//@href").get()

}


We will do pagination in the below code:

# -*- coding: utf-8 -*- import scrapy from scrapy_selenium import SeleniumRequest class ComputerdealsSpider(scrapy.Spider): name = 'computerdeals' def remove_characters(self, value): return value.strip('\xa0') def start_requests(self): yield SeleniumRequest( url='https://slickdeals.net/computer-deals/', wait_time=3, callback=self.parse ) def parse(self, response): products = response.xpath("//ul[@class='bp-p-filterGrid_items']/li") for product in products: yield { 'name': product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(), 'link': product.xpath(".//a[@class='bp-c-card_title bp-c-link']/@href").get(), 'store_name': self.remove_characters(product.xpath("normalize-space(.//span[@class='bp-c-card_subtitle']/text())").get()), 'price': product.xpath("normalize-space(.//div[@class='bp-c-card_content']/span/text())").get() } next_page = response.xpath("//a[@data-page='next']/@href").get() if next_page: absolute_url = f"https://slickdeals.net{next_page}" yield SeleniumRequest( url=absolute_url, wait_time=3, callback=self.parse )

Working with Pipelines:

To export scraped data to database, we have to write code in pipelines.py file.

The pipelines file can have open_spider(self, spider) , close_spider(self, spider)--> called once when spider starts execution, process_item() --> called for each item scraped.


settings.py : uncomment the following lines

ITEM_PIPELINES = { "slickdeals.pipelines.SlickdealsPipeline": 300, }

MONGO_URI="Hello World"

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import logging class SlickdealsPipeline(object): @classmethod def from_crawler(cls, crawler): logging.warning(crawler.settings.get("MONGO_URI")) return crawler def open_spider(self, spider): logging.warning("SPIDER OPENED from PIPELINE") def close_spider(self, spider): logging.warning("SPIDER CLOSED from PIPELINE") def process_item(self, item, spider): return item We can see the value set in settings.py executed before open_spider logging message.

To store data in MongoDB, download and install MongoDB Server or we can use cloud.

Using cloud, signin to mongodb, create a cluster --> build a cluster --> choose M0 Sandbox(free cluster) --> Create Cluster.

In Terminal, conda install pymongo dnspython -y

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import logging import pymongo class MongodbPipeline(object): collection_name = "example" def open_spider(self, spider): self.client = pymongo.MongoClient("mongodb+srv://vanisuruvu:<password>@vanicluster.xkm7tbs.mongodb.net/?retryWrites=true&w=majority") # replace password here

self.db = self.client["SlickDeals"] # created later def close_spider(self, spider): # close connection to database self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert(item) return item Give cluster a password and give ourselves permission to read and write data to database.

In Cluster Access, Network Access, add 0.0.0.0/0 IP address.

Connect Cluster --> select cluster created--> Connect your Application --> Python driver and version select --> copy connection string and paste in pipelines.py Change settings.py -->

ITEM_PIPELINES = { "slickdeals.pipelines.MongodbPipeline": 300, }

We can see data populated in collections of mongodb, when we run the crawler: scrapy crawler computerdeals. Do not give special characters in mongodb character like @.


To connect with SQLite3, we don't need to install any package.

Add sqlite extension, to view the database created in vscode.

import sqlite3

class SQLitePipeline(object): collection_name = "example" # @classmethod # def from_crawler(cls, crawler): # logging.warning(crawler.settings.get("MONGO_URI")) # return crawler def open_spider(self, spider): self.connection = sqlite3.connect("slackdeals.db") self.c = self.connection.cursor() try: self.c.execute(''' CREATE TABLE best_movies( link TEXT, name TEXT, price TEXT, store_name TEXT ) ''') self.connection.commit() except sqlite3.OperationalError: pass def close_spider(self, spider): # close connection to database self.connection.close() def process_item(self, item, spider): self.c.execute(''' INSERT INTO best_movies(link, name, price, store_name) VALUES(?,?,?,?) '''), ( item.get('link'), item.get('name'), item.get('price'), item.get('store_name') ) self.connection.commit() return item


Update settings.py with "slickdeals.pipelines.SQLitePipeline": 300, Open database, after installing the extension, In the SQLite Explorer, we can open database, Show table created.

Scraping APIs:

We won't scrape HTML markup here - not XPath or CSS Selectors. We use JSON object to pass to python.

We will scrape http://quotes.toscrape.com/scroll . Page is dynamically loaded using JavaScript, though there is no pagination, when we scroll down. We don't have to use Selenium or Splash to handle JavaScript though.


Open Developer Tools (Ctrl+Shift+I), Open Network tab, XHR filter, Ctrl+R to refresh page, open the first link, then Headers tab: Check Request URL, Preview tab: we see JSON object with key value pairs -> quotes key has further data. API URL is different from actual website URL.


Create project in Scrapy: scrapy startproject demo_api, cd demo_api, scrapy genspider quotes quotes.toscrape.com , code . (opens VSCode from command prompt)

Open folder in VSCode.

Copy Request URL from browser: http://quotes.toscrape.com/api/quotes?page=1

We can use has_next field to track if there are more pages.

# -*- coding: utf-8 -*-

import scrapy

import json


class QuotesSpider(scrapy.Spider):

name = 'quotes'

allowed_domains = ['quotes.toscrape.com']

start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']


def parse(self, response):

resp = json.loads(response.body)

quotes = resp.get('quotes')

for quote in quotes:

yield {

'author': quote.get('author').get('name'),

'tags': quote.get('tags'),

'quote_text': quote.get('text')

}

has_next = resp.get('has_next')

if has_next:

next_page_number = resp.get('page') + 1

yield scrapy.Request(

url=f'http://quotes.toscrape.com/api/quotes?page={next_page_number}',

callback=self.parse

)


Explore JSON object structure first, and then start scraping the object.


Next, we will scrape https://openlibrary.org/subjects/picture_books website. As we scroll the list of books to the left, we get more books.

scrapy genspider ebooks "openlibrary.org/subjects/picture_books.json?limit=12&offset=12"

Here, we include URL in "" because it has some special characters.

# -*- coding: utf-8 -*-

import scrapy

from scrapy.exceptions import CloseSpider

import json


class EbooksSpider(scrapy.Spider):

name = 'ebooks'


INCREMENTED_BY = 12

offset = 0


allowed_domains = ['openlibrary.org']

start_urls = ['https://openlibrary.org/subjects/picture_books.json?limit=12']


def parse(self, response):


if response.status == 500:

raise CloseSpider('Reached last page...')


resp = json.loads(response.body)

ebooks = resp.get('works')

for ebook in ebooks:

yield {

'title': ebook.get('title'),

'subject': ebook.get('subject')

}

self.offset += self.INCREMENTED_BY

yield scrapy.Request(

url=f'https://openlibrary.org/subjects/picture_books.json?limit=12&offset={self.offset}',

callback=self.parse

)

Here, as seen from browser Developer tools, offset is getting incremented by 12.

When we set offset to very large value, we get 500 internal server error, as seen below.


Login to Websites using Scrapy:

We will use https://quotes.toscrape.com/login website to login using Scrapy. This website doesn't require javascript, as wrt UI we do not see any changes by disabling javascript, in developer tools.

In developer tools, go to Network tab, Filter: all, Check Preserve log (so we can capture login requests).

Any username / password works for this site, as its for scraping purpose.


Decoding login request(from developer tools):

HTML Status code: 302 ==> redirected to another URL.

Form Data: we need to send form data when doing login. csrf_token is dynamically generated. username and password fields are also present. csrf_token value changes when reloaded.

# -*- coding: utf-8 -*-

import scrapy

from scrapy import FormRequest


class QuotesLoginSpider(scrapy.Spider):

name = 'quotes_login'

allowed_domains = ['quotes.toscrape.com']

start_urls = ['https://quotes.toscrape.com/login']


def parse(self, response):

csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()

yield FormRequest.from_response(

response,

formxpath='//form',

formdata={

'csrf_token': csrf_token,

'username': 'admin',

'password': 'admin'

},

callback=self.after_login

)

def after_login(self, response):

if response.xpath("//a[@href='/logout']/text()").get():

print('logged in')


We can use formcss, formname, formxpath to get form element. If logged in successfully, we get 'logged in' message in the output as programmed.


Another Example:

Checking if logged into openlibrary website, create an account. Logout requires javascript.

# -*- coding: utf-8 -*-

import scrapy

from scrapy import FormRequest


class OpenlibraryLoginSpider(scrapy.Spider):

name = 'openlibrary_login'

allowed_domains = ['openlibrary.org']

start_urls = ['https://openlibrary.org/account/login']


def parse(self, response):

yield FormRequest.from_response(

response,

formid='register',

formdata={

'username': 'abcdefgh@gmail.com',

'password': 'test123',

'redirect': '/',

'debug_token': '',

'login': 'Log In'

},

callback=self.after_login

)

def after_login(self, response):

print('logged in...')


We see logged in is printed, with the above program, once the user logs in.


If the form does require JavaScript then you can't use the FormRequest class, so as an alternative solution you have to use another class called SplashFormRequest which does have the same methods as in the FormRequest class and takes the same arguments.

# Splash should be running on the background

import scrapy

from scrapy_splash import SplashRequest, SplashFormRequest

class QuotesLoginSpider(scrapy.Spider):

name = 'quotes_login'

allowed_domains = ['quotes.toscrape.com']

script = '''

function main(splash, args)

assert(splash:go(args.url))

assert(splash:wait(0.5))

return splash:html()

end

'''


def start_requests(self):

yield SplashRequest(

url='https://quotes.toscrape.com/login',

endpoint='execute',

args = {

'lua_source': self.script

},

callback=self.parse

)

def parse(self, response):

csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()

yield SplashFormRequest.from_response(

response,

formxpath='//form',

formdata={

'csrf_token': csrf_token,

'username': 'admin',

'password': 'admin'

},

callback=self.after_login

)

def after_login(self, response):

if response.xpath("//a[@href='/logout']/text()").get():

print('logged in')


PROJECT:

Use this URL instead of https://coinmarketcap.com

Cloudfare (DNS, SSL, CDN/Proxy) Protection service can be used when scraping. Cloudfare is a Content Delivery Network, CDN, acts like a watch that watches and protect incoming requests from bots, spiders, crawlers etc. If the requests come very frequently, then the source of request will be kicked off when making any more requests. We can bypass this extra layer of security. We can check if website is protected by cloudfare here.


Project: scrapy genspider -t crawl coins coinmarketcap.com

LinkExtractor will automatically extract link from below restrict_xpaths.

When runner below spider, we may get blocked, and then try after some 10 minutes.

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule


class CoinsSpider(CrawlSpider):

name = 'coins'

allowed_domains = ['web.archive.org']

start_urls = ['https://web.archive.org/web/20190101085451/https://coinmarketcap.com/']


rules = (

Rule(LinkExtractor(restrict_xpaths="//a[@class='currency-name-container link-secondary']"), callback='parse_item', follow=True),

)


def parse_item(self, response):

yield {

'name': response.xpath("normalize-space((//h1[@class='details-panel-item--name']/text())[2])").get(),

'rank': response.xpath("//span[@class='label label-success']/text()").get(),

'price(USD)': response.xpath("//span[@class='h2 text-semi-bold details-panel-item--price__value']/text()").get()

}


To overcome Cloudfare protection, bypass 429 error --> too many requests, use this

pip install scrapy-cloudflare-middleware. By default, cloudfare handles 503 http response status code, and not handle 429 response code.

DOWNLOADER_MIDDLEWARES = {
    # The priority of 560 is important, because we want this middleware to kick in just before the scrapy built-in `RetryMiddleware`.'scrapy_cloudflare_middleware.middlewares.CloudFlareMiddleware': 560
}

Add below lines:

from scrapy_cloudfare_middleware.middlewares import CloudFlareMiddleware in coins.py

CloudFlareMiddleware only works with CrawlSpider, not with scrapy.Spider class.


Click on CloudFlareMiddleware --> change (delete above import)

response.status == 503 to

response.status == 503 or response.status == 429


Deactivate DuplicateFilter in settings.py:

DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter"


Conclusion:

We have seen how to use Selenium, Selenium with Scrapy to do Web Scraping. We have also seen how to use pipelines to store scraped data into MongoDB and SQLite3 databases. Then we saw API Scraping. Also discussed are Logging to the website using Scrapy / Splash, and Cloudflare Bypassing.


Hope you enjoyed Scraping!!! More power to you!!! Happy Learning!!! Thank you!!!


To learn more about pipelines: part1, part2

204 views0 comments

Recent Posts

See All

Beginner Friendly Java String Interview Questions

Hello Everyone! Welcome to the second section of the Java Strings blog. Here are some interesting coding questions that has been solved with different methods and approaches. “Better late than never!”

Comentários

Avaliado com 0 de 5 estrelas.
Ainda sem avaliações

Adicione uma avaliação
bottom of page