What is Web Scraping?
Web Scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user.
How does Web Scraping work?
The process of scraping any website to get relevant data could be divided into a few steps.
Step 1. Web scraping for any target website starts with sending an HTTP request. This request is the same as the one web browser sends to any website to load it.
Step 2. The website servers respond to this initial HTTP request by sending the HTML content of the target web page. This is the page we will scrape to extract the required data.
Step 3. The HTML content of the loaded webpage is used to extract the required data by identifying relevant elements on the page. These elements can be headings, sub-headings, paras, tables, hyperlinks, etc., based on your requirements.
Step 4. Now that the data is extracted, we store it in a structured format that can be processed to make more informed decisions. Depending on the use case, data can be stored locally in a spreadsheet or CSV or JSON or on a server using databases.
Web Scraping using JSOUP:
Step 1. Create a simple maven project and add below dependency in POM.(Choose frame work as per your need)
For JSOUP
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
For Excel
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.3</version>
</dependency>
Step 2. I used TestNG Framework,
In base class I initialize the driver and read the input data from the config reader. Partial code.
if (browser.equalsIgnoreCase("firefox")) {
WebDriverManager.firefoxdriver().setup();
driver = new FirefoxDriver();
} else if (browser.equalsIgnoreCase("chrome"))
{ ChromeOptions options = new ChromeOptions();
options.addArguments("--remote-allow-origins=*");
options.addArguments("--headless");
options.addArguments("--disable-popup-blocking");
options.addArguments("-–disable-notifications");
options.addArguments("-–disable-extensions");
options.addArguments("--blink-settings=imagesEnabled=false");
}……….
Chrome option is set to browser to disable notifications, popups ,images loading and execute test in headless mode. Helps us to scrape data significantly fast.
Step 3. Method used to scrape data from http site.
jsoup is a Java library for working with real-world HTML. parse HTML from a URL, file or string ,find and extract data, using DOM traversal or CSS selectors, manipulate the HTML elements, attributes, and text.
import org.jsoup.*;
public void module1(String type) throws IOException {
String url=ConfigReader.getApplicationUrl();
PageFactory.initElements(driver, this);
int pagenumber = Integer.parseInt(pagenumberWeb.getText());
for (int i = 1; i<= pagenumber ; i++) {
String pageclick1=”pageLink”;
//pageclick1 contains URL of the page.
Document doc=Jsoup.connect(pageclick1).get();
Elements recipeCard=doc.select("span.rcc_recipename");
Elements recipeCards=doc.select("span.rcc_recipename>a");
for (int j=0;j<recipeCards.size();j++) {
String recipeTitle=recipeCard.get(j).text();
String recipeNumber=recipeId.get(j).text();
//Write data into excel file
Baseutil.WriteExcel(“outputfileLocation”,recipeTitle,recipeNumber);}
Excel Write: Write the scraped data into excelsheet.
CodeSnippet: WriteExcel()
if (ExcelSheetName.equals("OUTPUTEXCEL")) {
if (rows == 0) {
sheet.createRow(0);
sheet.getColumnStyle(0).setFont(font);
sheet.getRow(0).createCell(0).setCellValue("RecipeId");
sheet.getRow(0).createCell(1).setCellValue("Recipe Name");}
Thanks.Keep Learning!!!
Comments