This blog is continuation of Part2 of Web Scraping blog, found here. In this blog, we will see how to work with Splash. Also we will see how to use Splash with Scrapy.
When doing Web Scraping, people face challenges when working with websites that use JavaScript to render content. JavaScript executes on the browser, client-side.
We will scrape livecoin.net/en web site. When we scroll to bottom of page, "Show more" button helps to load more data. When we disable JavaScript, this button would not work anymore. Scrapy doesn't have in-built browser - an engine that can interpret JavaScript. Here, we can use Splash or Selenium.
Splash is a light weight browser. We interact with that browser is by writing some code that Splash can understand, not by using icons like Chrome for example. Splash is meant to be used with Scrapy.
Selenium is Test / Automation tool and beginner friendly, not specifically for web scraping.
JavaScript uses engine to execute the code. Each browser has their own JavaScript engines.
Chrome --> V8 Engine
Firefox --> Spider Monkey
Safari --> Apple Webkit (same used by Spalsh)
Microsoft Edge --> Chakra
Setting up Splash:
Windows Professional / Mac: 64-bit machine required
Download "Docker Desktop" and install it.
Search Docker Desktop and hit enter.
Create an account in hub.docker.com, signup
Right click on Docker, give signin details.
Docker Desktop: check signed in.
Go to settings --> Resources --> Network --> Check "Manual DNS configuration" option
Make sure DNS server is 8.8.8.8
Restart, wait to complete, close all windows / popups.
Go to Command prompt: cmd, docker pull scrapinghub/splash (500MB)
Run Docker: docker run -it -p 8050:8050 scrapinghub/splash (image name)
Wait and Check for message "Server listening on http://0.0.0.0:8050
Open Chrome, localhost:8050 --> Splash page can be seen
Trick: on cmd, ctrl+c to stop splash
go to docker desktop --> dashboard --> click on image and play button, see logs by clicking on it.
Windows Home: 64 bit operating system is required
search for turn windows features on or off, open, check Hyper-V. Reboot PC.
Go to https://github.com/docker-archive/toolbox/releases , download exe file, and install it.
Check Install VirtualBox with NDISS driver [default NDIS6] option, while installing Docker Toolbox.
Open Kitematic from shortcut on desktop, use Virtualbox - if Docker is not already running.
Search for Splash, if not already installed, from scrapinghub and click Create.
Once Spalsh is installed, click on Splash on left hand menu, and click start.
On right hand side, we get access URL with the default port. Each time we run, a new port will be assigned. To make port static, click on wheel, enter Docker Port 8050 and save.
Stop splash and start again.
Copy static URL: 192.168.99.100:8050 into browser. Splash page can be seen.
We need Lua language to code in Splash. We can learn from here.
Tasks: We will be using website https://duckduckgo.com/. Use Splash to enter the webpage.
Send text to input box --> like to fill in a form, mouse clicks on button, send keyboard keys like enter - instead of mouse click, use request methods, change or send custom request headers.
On the browser, go to static URL created earlier, give URL as https://duckduckgo.com/ and start writing below script in Lua language.
function main(splash, args)
url = args.url
splash.go(url)
return {
html = splash:html()
image = splash:png()
}
end
Here, HTML and screenshot as image are output.
If URL is invalid, Splash still executes script after it. To stop executing in case of error, we can use assert as below.
assert(splash:go(url))
Edit & Execute: Click the Script button, edit and Click Render button.
In JavaScript heavy websites, its always better to wait to return anything, as website can be slow. The below code waits 1 second before executing.
function main(splash, args)
url = args.url
assert(splash.go(url))
assert(splash:wait(1))
return {
html = splash:html()
image = splash:png()
}
end
Next, we will see how to enter text and click enter or click on search button. We cannot use XPath in Splash. So we use CSS Selector. Id attribute is preferred over class to get web element.
select() returns single element, select_all() returns multiple elements like findElements in selenium.
To enter text, first we have to focus, then send text.
Below is the code, to enter text and click on search button.
function main(splash, args)
url = args.url
assert(splash.go(url))
assert(splash:wait(1))
input_box = assert(splash:select("#search_form_input_homepage"))
input_box:focus()
input_box:send_text("my user agent")
asset(splash:wait(0.5))
btn = assert(splash:select('#search_button_homepage'))
btn:mouse_click()
assert(splash:wait(5))
return {
html = splash:html()
image = splash:png()
}
end
comments: --[[ content ]] --
splash:set_viewport_full() --> full page screenshot can be seen. Image has UserAgent as Splash.
input_box:send_keys("<Enter>") --> sends Enter key
{} --> create objects in Lua
Changing / Spoofing Request Headers:
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36") --> changes user agent in header
To change other request headers, we need set_custom_headers function.
Also, we can do this by registering a callback function that will be called on each request using on_request.
function main(splash, args)
--splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36")
--[[
headers = {
['User-Agent'] = "splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36")"
}
splash:set_custom_headers(headers)
--]]
splash:on_request(function(request)
request:set_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36')
end)
url = args.url
assert(spalsh.go(url))
assert(splash:wait(1))
input_box = assert(splash:select("#search_form_input_homepage"))
input_box:focus()
input_box:send_text("my user agent")
asset(splash:wait(0.5))
--[[ btn = assert(splash:select('#search_button_homepage'))
btn:mouse_click() --]]
input_box:send_keys("<Enter>")
assert(splash:wait(5))
splash:set_viewport_full()
return {
html = splash:html()
image = splash:png()
}
end
PROJECT:
We will be using https://web.archive.org/web/20200116052415/https://www.livecoin.net/en/ instead of https://www.livecoin.net/en/ website. If we disable JavaScript, we won't be able to scrape. URL doesn't change when you move between various currencies. So Splash is needed to work with JavaScript.
Scrapy-Splash package can be found here.
Splash is built using Apple's Web engine, the same one used in Safari. This gets data related to BTC and RUR both. Splash works in private mode, similar to incognito mode in Chrome/Firefox, but working is not same. To disable private mode: splash.private_mode_enabled = false --> now we get only RUR (5th element) results.
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(spalsh.go(url))
assert(splash:wait(1))
rur_rub = assert(splash:select_all(".filterPanelItem___2x5Gb"))
rur_rub[5]:mouse_click()
assert(splash:wait(1))
splash:set_viewport_full()
return {
html = splash:html()
image = splash:png()
}
end
Using Splash with Scrapy:
On Terminal, in scrapy_env virutual environment created earlier, type below commands:
scrapy startproject livecoin
cd livecoin
scrapy genspider coin web.archive.org/web/20200116052415/https://www.livecoin.net/en/
Open livecoin folder project in VSCode. In terminal, install scrapy-splash, using below command:
pip install scrapy-splash
Follow steps to configure: https://github.com/scrapy-plugins/scrapy-splash
settings.py: SPLASH_URL = 'http://192.168.59.103:8050' (OR 'http://localhost:8050'
Add DOWNLOADER_MIDDLEWARES, SPIDER_MIDDLEWARES, DUPEFILTER_CLASS (prevents duplicate request):
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
3. Copy script above in coin.py , do modifications in start_requests(self) method,
import scrapy
from scrapy_splash import SplashRequest
class CoinSpider(scrapy.Spider):
name = 'coin'
allowed_domains = ['web.archive.org']
script='''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(spalsh.go(url))
assert(splash:wait(1))
rur_rub = assert(splash:select_all(".filterPanelItem___2x5Gb"))
rur_rub[5]:mouse_click()
assert(splash:wait(1))
splash:set_viewport_full()
return {
html = splash:html()
image = splash:png()
}
end
'''
def start_requests(self):
yield SplashRequest(url="https://web.archive.org/web/20200116052415/https://www.livecoin.net/en/", callback=self.parse, endpoint="execute", args={
'lua-source': self.script
})
def parse(self, response):
print(response.body)
Open terminal, execute "scrapy crawl coin".
Parsing Bad HTML Markup:
Get values scraped for each currency: Override parse(self) method as below:
def parse(self, response):
for currency in response.xpath("//div[contains(@class, 'ReactVirtualized__Table__row tableRow___3EtiS ')]"):
yield {
'currency pair': currency.xpath(".//div[1]/div/text()").get(),
'volume(24h)': currency.xpath(".//div[2]/span/text()").get()
}
Exercise: try scraping 'http://quotes.toscrape.com/js'.
Solution: here
Conclusion:
We know its challenging to work with Scrapy to scrape on JavaScript pages. Splash comes to rescue, to work with JavaScript loaded pages. We have seen an example of how to work with Splash. We also saw how to use Scrapy with Splash.
Hope you enjoyed the Splash!!! Thank you!!!
Comments