Web Scrapping

Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.


Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.


WebSites Used :


https://www.glassdoor.com

https://www.covidfacts.in


Installation


Python , Selenium, Chrome Driver



Quickstart :

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser



This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by automated software.


To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:




Here are two other interesting Web Driver properties:

  • driver.title gets the page's title

  • Driver.current_url  gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name

  • Class name

  • IDs

  • XPath

  • CSS selectors

find_element

There are many ways to locate an element in selenium. Let's say that we want to locate the tr tag in this HTML:




WebElement

A Web Element is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

  • Accessing the text of the element with the property element.text()

  • Clicking on the element with element.click()

  • Accessing an attribute with element.get_attribute(‘class’)

  • Sending text to an input with: element.send_keys(‘mypassword’)

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.


Executing Javascript


Blocking images and JavaScript

With Selenium, by using the correct Chrome options, you can block some requests from being made.


This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.

To do this, you need to launch Chrome with the below options:



Print to Data Dictionary or Json and Excel



Observe the output in Json format and CSV




The End

5 views0 comments

Recent Posts

See All

Headless Browser in Python

What is a headless browser? A headless browser can access any website but unlike normal browsers (which you currently use) nothing will appear on the screen. Everything is done on the backend side inv