Note: This method is applicable only if the API can be accessed to fetch records as many websites may have secure APIs . In that case you can go for other tools like Selenium or Scrapy.
I have worked in scrapping projects but this one was very challenging for me due to few reasons. First reason being that I was scrapping with Beautiful Soup for the first time and Python is new to me. I have used Selenium before with Python to scrap but this time I decided to use Beautiful Soup.
So what is Beautiful Soup?
According to the web, Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner. This python library is called bs4 and you just need to use with import statement in beginning of your python code
import bs4 and you are set to use Beautiful Soup!
I was trying to scrap some data out of a website . I started with Selenium for the basic actions of dropdowns and clicks but I used BeautifulSoup to parse the elements and store data and used that to populate the values into dropdowns. It was so good to see how fast the dropdown options were getting selected! As long as you can see values on the page source inside the HTML body you are good to use BeautifulSoup.
But what if you cant see the values??
Imagine a search option shows a table of data but when you right click on the page and click Page Source , you don't see the table. Then how do you use BeautifulSoup for the same? It is ideally not possible because BeautifulSoup is just an HTML parser. So in those scenarios it is better to use Selenium to pull dynamic content. Yes if we cant see tables in the HTML body then those are dynamically generated through scripts.
I determined to find out other ways to pull data without using Selenium. Since this was data related website , one possible approach would be to find out the API of the website and directly pull out the data. If that's possible you can still use BeautifulSoup because the data rendered would have an html body and all you need to is to parse it .
So how to find out if the website has a API that can be accessed. Following are the steps:
Open the website you intend to scrap and in your browser click on the three vertical dots on the right upper corner, go to MoreTools-> DeveloperTools
A window will open on the side as below. By default it goes to Network tab. Click on the Fetch/XHR beside the All option and do Ctrl+R as instructed below
The dialog will start recording the network events. All this while your website is open on the side.
Now lets perform some actions on the website like selecting from dropdown and clicking on Search buttons.
I select some values from dropdowns and click on Search button . I now see a table below for the data. But as soon I clicked the Search button I also see a method displayed on the side window.
If you click on that method, you will get information about the request and response headers. If you scroll down you can also view the form-data. You can preview the data returned and also check the response. If the status code is 200, then we can be sure that data has been fetched. But if the API is secured and requires authentication, then it is better to go with Selenium or Scrapy for scrapping dynamic content.
Now go to your code. In your code where the Search button click logic is written, below that write the code to fetch data from the Request URL using the Form Data
You can get the values of the currently selected dropdowns from the content of the BeautifulSoup. Store these values and we will be passing those into the request URL to fetch data. For example if I want to see table data for particular fields, then I store those fields values into variables and will pass them as form data fields. Here rID and sID stores values which we will pass as a JSON input. That is stored in the payload variable.
For the request , I have imported urllib3 package and used request function. In this function you will pass the action which is POST along with the URL and the form fields.
The result of the request if successful is going to fetch the data depending on the form fields sent . That entire response data can be stored as a BeautifulSoup which is what I have done in the second last line in the code above. I am parsing the content as a xml. You can also use html.parser or html5lib
Then once you have saved it as a BeautifulSoup content, you can simple parse the entire data and fetch the records within a matter of seconds. In this website since the data is in the form of tables, we parse the entire content with tr and td tags and fetch data, store them as a dictionary. I was able to fetch almost 700+ records in a matter of mins for all dropdown values. The beauty of using BeautifulSoup is that the content is parsed for all records so there is no need to paginate or click Next button also as all data is readily available.
Just to summarize , I have used BeautifulSoup to parse the contents of the page and used the API to fetch the data and parse the data to create a dictionary for the same. Below I am pasting a comparison of the tools that can be used for scrapping with Python. Each of the tools presented has its advantages and disadvantages.
In a nutshell, go with BeautifulSoup if you want to speed up development or if you just want to familiarize yourself with Python and web scraping. With Scrapy, demanding web scraping applications can be implemented in Python — provided you have the appropriate know-how. Use Selenium if your primary goal is to scrape dynamic content with Python.