Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.
Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.
WebSites Used :
Installation
Python , Selenium, Chrome Driver
Quickstart :
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
Here are two other interesting Web Driver properties:
driver.title gets the page's title
Driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
There are many methods available in the Selenium API to select elements on the page. You can use:
Tag name
Class name
IDs
XPath
CSS selectors
find_element
There are many ways to locate an element in selenium. Let's say that we want to locate the tr tag in this HTML:
WebElement
A Web Element is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property element.text()
Clicking on the element with element.click()
Accessing an attribute with element.get_attribute(‘class’)
Sending text to an input with: element.send_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
Executing Javascript
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
Print to Data Dictionary or Json and Excel
Observe the output in Json format and CSV
The End
Comments