Mission Humane Scraping the Java way

Introduction

By definition, web scraping refers to the process of extracting a significant amount of information from a website using scripts or programs. Such scripts or programs allow one to extract data from a website, store it and present it as designed by the creator. The data collected can also be part of a larger project that uses the extracted data as input.With web scraping, you can not only automate the process but also scale the process to handle as many websites as your computing resources can allow.

In this blog, we will explore web scraping using the Java language. I also expect that you are familiar with the basics of the Java language.

Why Web Scraping?

The web scraping process poses several advantages which include:

  • The time required to extract information from a particular source is significantly reduced as compared to manually copying and pasting the data.

  • The data extracted is more accurate and uniformly formatted ensuring consistency.

  • A web scraper can be integrated into a system and feed data directly into the system enhancing automation.

  • Some websites and organizations provide no APIs that provide the information on their websites. APIs make data extraction easier since they are easy to consume from within other applications. In their absence, we can use web scraping to extract information.

What to Use :

Before you continue, ensure you have the following installed on your computer:

  • Java

  • Maven : to manage our project in terms of generation, packaging, dependency management, testing among other operations.

  • An IDE or Text Editor of your choice (IntelliJ, Eclipse, VS Code or Sublime Text)

  • Libraries used:

Selenium: Selenium is used to automate the web browsers.

TestNG: It is designed to cover all categories of tests: unit, functional, end-to- end, integration, etc…

Once we’re done with installation and setup of maven, inject all the dependencies required in order to Scrape data.


Implementation:

Finally — let’s write some code!

Step 1: Create the testng class, launch the browser and navigate to the URL(https://covidpune.com).

Step 2: Here we’re extracting data for beds category and each result should be formatted like below.

{
    "description":   <some description of the availablity> (required),
    "category": <category >(required, options: ["Bed", "Blood Plasma", "Oxygen", "Remdesivir", "Fabiflu", "Tocilizumab"]),
    "state": <state >(required),
    "district": <district/city/area> (not obligatory, but prefered),
    "phoneNumber": [<ph1>, <ph2>]   (list of phone numbers, at least one is required),
    "addedOn": <linux timestamp (time.time())> (required, if it is not an update),
    "modifiedOn": <linux timestamp (time.time())>  (required if created On is not provided)
}

Step 3: Let’s inspect the page,When you are using Selenium, you can take advantage of identifying the required item or button by name, ID, Tag name, CSS or Xpath. Here after inspecting the elements, i have found this page is a dynamic webtable.

Xpath path for rows -

rows=driver.findElements(By.xpath(“//tbody/tr”));

Xpath path for Columns-

cols=driver.findElements(By.xpath(“//tbody/tr/td”));

Xpath path to load next page(button)-

next20Button = driver.findElement(By.xpath(“//*[@id=\”root\”]/div/div/div[2]/div[3]/div/button”));

Use Tag name locator to identify all hospital links

buttons=driver.findElements(By.tagName(“strong”));

Step 4 : Then go through all the rows on the current page, and move to the next page. Below is a code to loop to go through each row on every page and extract relevant data.



Step 5: After extracting the data, i have to save it in a structured format. For this i have used HashMap of Java.Before adding all the data to hashmap, i have created a class name DataCovidPune and set the parameters as String description; String category; String phone; String pincode; String address; String addedTime;. And pass this class in HashMap as object.

HashMap<String, DataCovidPune> hmap= new HashMap<String, DataCovidPune>();

Step 6: Add the extracted data to hashmap and convert it into json text.

hmap.put(hospitalName, dataCovidPune);

String json = new ObjectMapper().writeValueAsString(hmap);

Step 6: Then i used FileWriter class to write this json object and extract it into text file. Here is the output file..

Closing remarks

I hope this blog has given you the confidence to start web scraping with Selenium usin Java.



18 views0 comments

Recent Posts

See All

Headless Browser in Python

What is a headless browser? A headless browser can access any website but unlike normal browsers (which you currently use) nothing will appear on the screen. Everything is done on the backend side inv