Extracting Data from Webpages in Java with help of HtmlUnit

Extracting Data from Webpages in Java with help of HtmlUnit

The World Wide Web (WWW) is an information system which inter-connects hypertext documents which is usually called as webpages and can be accessed through the internet. The webpage may contain any kind of data from text to multimedia. As per worldwidewebsize.com the web contains about 4.56million indexed web pages.

Internet plays major role in communication. It is the primary data source for almost 90% of applications and it has about 672 Exabytes of accessible data (672,000,000,000 Gigabytes). Following picture shows the data generated per second in internet.

Data Generated per second on internet.

While it is amply clear that enormous data is created on the internet, there is no standard structure being followed in webpages, each one has its own structure so getting data into our application is always a herculean task.

Secondly, while working on market intelligence projects, supplementing predictive analytics models and/or segmentation models with secondary research data on competitor’s market share offers a significant client impact.

But as we all know, secondary research involves sifting and collating data across competitors and across multiple webpages and presenting information in an analysis-ready form. Given that data is stored in thousands of webpages a manual copy-paste effort wouldn’t be a prudent investment of time. So to get the data, we have to dynamically iterate and extract data from those webpages.

Fortunately there is an API which allows us to dynamically process the webpages in java. In this blog I have explained about how to get (grab) data from websites in java with HtmlUnit API.

HtmlUnit:

            HtmlUnit is an API for java which can simulate a browser. Using this API with java program one can invoke pages, fill out forms, click links, this will work just like a normal browser. HtmlUnit offers the following features

  • Support for the HTTP and HTTPS protocols
  • Support for cookies
  • Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
  • Support for submit methods POST and GET (as well as HEAD, DELETE)
  • Ability to customize the request headers being sent to the server
  • Support for HTML responses
    • Wrapper for HTML pages that provides easy access to all information contained inside them
    • Support for submitting forms
    • Support for clicking links
    • Support for walking the DOM model of the HTML document
  • Proxy server support
  • Support for basic and NTLM authentication
  • Excellent JavaScript support

How To Guide:

Step 1:

  • Create a new java project in eclipse

Extracting data from webpages - diagram 2

Step 2:

Extracting data from webpages - diagram 3

Step 3:

  • Create a java class, Here for example I have created ‘googleRes’ class

Extracting data from webpages - diagram 4

  • Create and initialize an object for WebClient: WebClient is root for HtmlUnit which is used to imitate a client browser. It has a parameterised and non-parameterised constructor. Here I have used single parameter constructor to create a new object by passing BrowserVersion.CHROME constant as an argument. By which a new WebClient object has been created to imitate a chrome browser.

Extracting data from webpages - diagram 5

  • Create object for page: The WebClient class contains a method called getPage() which is used to fetch a webpage, Return type of getPage() method is HtmlPage. So create an object for HtmlPage and assign it by calling webClient.getPage().The getPage() method requires one argument which is the URL of the webpage you want to fetch.

Extracting data from webpages - diagram 5

  • Create object for page: The WebClient class contains a method called getPage() which is used to fetch a webpage, Return type of getPage() method is HtmlPage. So create an object for HtmlPage and assign it by calling webClient.getPage().

           The getPage() method requires one argument which is the URL of the webpage you            want to fetch.

Extracting data from webpages - diagram 6

HtmlPage object has been created which contains all the data stored in the webpage which you send as URL argument for getPage() method. Now you can play with the HtmlPage object and you can get whatever content your want from the webpage.

Handling the JavaScript

Nowadays all website contains javascript in it. Which is used to process or call background function when some event occurs or to show some dynamic advertisement in the page. So if you don’t have to do anything with the javascript then it’s better to turn it off.

Because in the background the javascript will be executed and if it has some internal function call again that will be executed and this process will keep on going which will reduce the performance of your program and will take lot of time to fetch the websites. The WebClient class offers methods to solve this issue by which you can enable/disable the javascript in the webpage.

Extracting data from webpages - diagram 7

  • Getting page contents: HtmlPage class offers two unique methods called asText() & asXml() by which you can get the page’s text content without tag or with tag respectively.

Extracting data from webpages - diagram 8

Here I have used asText() method which will get all the text content from the webpage without any html tag and stored it in string object. When you are printing the string you will get the following result.

Result:

Extracting data from webpages - diagram 9

Sample Program to find no.of results for list of programming languages:

Extracting data from webpages - diagram 10

Result:

 

 

Extracting data from webpages - diagram 11
Note—Based on the program result chart has been created using excel

You can also access the this blog on BRIDGEi2i Github here.

Related Posts

Comments (3)

I want to read only the text from web page in the manner it is displaying in web page.But Using htmlunit i am not able to read the text as it is.The text alignment is missing.Is there any method to get web page text without disturbing its alignment.If there please help me out.

Thanks in advance

while using the lib facing issue with some urls.

Exception in thread “main” javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

How can i bypass this SSLHandshakeException.

FilingHttpStatusCodeException can’t omported, eclipe dosen’t give import suggestion, eclips give seggestion to create new class or interface for this…i alreadyadded Htmlunit jar file

Leave a comment