Web scraping is automated process to extract huge amounts of data from websites. The web scraping software directly accesses the web information using HTTP (Hyper Text Transfer Protocol) or web browser. The extracted data can be stored in analysable formats like JSON, HTML etc.
But alongside web scraping has challenges that include legal and ethical issues around collecting, storing, analyzing, and sharing the scraped data.
Only those who will risk going too far can possibly find out how far one can go !
Ever wondered why we take on the challenge of navigating through sites to gather data despite the risks? Well, there are certain applications for web scraping: Price comparison, Job listings, Social media scraping to understand what’s trending.
Just because you can log in to the page through your browser doesn’t mean you’ll be able to scrape it with your Python script. That means you’ll need an account to be able to scrape anything from the page.
In Python, Beautifulsoup4 (bs4 module) is a web scraping and HTML parsing module that extracts information from HTML documents and modify according to the requirement specified. Using bs4 for web scraping also implies reading a HTML file, modifying the content of the file using Python module and create a new HTML file.
Following are the steps to scrape desired data from a website:
1. Get the URL to scrape the desired data.
2. Inspect the page the URL generates.
3. Identify the data to be extracted.
4. Develop code in Python using bs4 module.
5. Execute the code and extract the data.
6. The resultant data can be stored then.
For the above code to execute firstly the bs4 module should be installed.
Over the terminal or command prompt, run
pip install beautifulsoup4
That downloads and installs the bs4 library from which the BeautifuSoup class can be imported. Following is the syntax.
Object = BeautifulSoup(string/file handle, parser)
Where string can be a URL source and file handler can be a local file from which the HTML content is read. Parser can be any parser out of Python’s html parser, C’s lxml parser for HTML and XML.
Also, we need to install requests module of Python to get the URL from web as
python -m pip install requests
Tip: Please ensure you have a thorough understanding of HTML basics, particularly the tags.
For example, we shall code a simple program to scrape the price of the Monitor from the given URL and print it.
To begin the code, we first need to inspect the price for parent tags (class names and containers). Right click on the above page with cursor on the price and select inspect.
With the help of above tag names, we begin to code as follows.
Executing the above program, we get the following output.
Let us run through the code now :
from bs4 import BeautifulSoup
import requests
imports the two basic libraries necessary for web scraping which are requests and bs4.
The BeautifulSoup library is necessary to pull the HTML data from websites and parses it for us to access its content for desired result.
source = requests.get("https://www.newegg.ca/p/N82E16824011452?Item=N82E16824011452")
would access the given URL and saves in the source variable as a Response object.
source.raise_for_status()
returns an HTTPError object if an error has occurred during the process.
soup = BeautifulSoup(source.text, "html.parser")
statement would parse the HTML content received by the source.text using the built-in Python HTML parser.
If we are going to print out the content of object ‘soup’, in cases of multiple scrape result like for example all the details of the above monitor including the model number, manufacturer, dimensions, ports etc, it’s recommended to use the function prettify() as in
This returns a string with the HTML content nicely formatted with proper indentation.
prices = soup.find_all(string='$')
This statement would find all the strings/objects beginning with a ‘$’ symbol listed on the page in the given URL since we need to capture the price. find_all() returns a list of all the occurrences.
dollars = prices[0].parent.find("strong")
.find() returns the first occurrence of the specified string.prices[0] has the first dollar symbol in the page as we are looking for a price that starts with a ‘$’ symbol. ‘parent’ gives a large, parsed string that has all the parent classes and subclasses where the 209 occurs. Finally, we find our 209 under ‘strong’ tag and we grab it !! Definitely not as complicated as explained 😊
Can’t leave the cents behind !!!
cents = prices[0].parent.find("sup")
Just grabs and holds the cents under ‘sup’ tag
print('$' + dollars.text, cents.text)
The amount is finally printed.
Couple of challenges while you are busy getting ready to scrape the sites out there.
1. Code durability : Websites are dynamic and constantly change. The code that is defined to automatically run and gather the information may surprise you the next time you try to run and be prepared for traceback !!
2. Diverse websites : Each website is different as in static and dynamic and at times we may need to change the approach to extract the relevant information. Difference between extracting data from a static online bookstore and dynamic social media site !!
At the End of the Day, Everything is Achievable. Happy scraping !!
留言