The World Wide Web (WWW) is an information system which inter-connects hypertext documents which is usually called as webpages and can be accessed through the internet. The webpage may contain any kind of data from text to multimedia. As per worldwidewebsize.com the web contains about 4.56million indexed web pages.
Internet plays major role in communication. It is the primary data source for almost 90% of applications and it has about 672 Exabytes of accessible data (672,000,000,000 Gigabytes). Following picture shows the data generated per second in internet.
While it is amply clear that enormous data is created on the internet, there is no standard structure being followed in webpages, each one has its own structure so getting data into our application is always a herculean task.
Secondly, while working on market intelligence projects, supplementing predictive analytics models and/or segmentation models with secondary research data on competitor’s market share offers a significant client impact.
But as we all know, secondary research involves sifting and collating data across competitors and across multiple webpages and presenting information in an analysis-ready form. Given that data is stored in thousands of webpages a manual copy-paste effort wouldn’t be a prudent investment of time. So to get the data, we have to dynamically iterate and extract data from those webpages.
Fortunately there is an API which allows us to dynamically process the webpages in java. In this blog I have explained about how to get (grab) data from websites in java with HtmlUnit API.
HtmlUnit is an API for java which can simulate a browser. Using this API with java program one can invoke pages, fill out forms, click links, this will work just like a normal browser. HtmlUnit offers the following features
- Support for the HTTP and HTTPS protocols
- Support for cookies
- Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
- Support for submit methods POST and GET (as well as HEAD, DELETE)
- Ability to customize the request headers being sent to the server
- Support for HTML responses
- Wrapper for HTML pages that provides easy access to all information contained inside them
- Support for submitting forms
- Support for clicking links
- Support for walking the DOM model of the HTML document
- Proxy server support
- Support for basic and NTLM authentication
How To Guide:
- Create a new java project in eclipse
- Download the HtmlUnit API from https://sourceforge.net/projects/htmlunit/files/htmlunit
- Add the HtmlUnit jar files into project’s build path
- Create a java class, Here for example I have created ‘googleRes’ class
- Create and initialize an object for WebClient: WebClient is root for HtmlUnit which is used to imitate a client browser. It has a parameterised and non-parameterised constructor. Here I have used single parameter constructor to create a new object by passing BrowserVersion.CHROME constant as an argument. By which a new WebClient object has been created to imitate a chrome browser.
- Create object for page: The WebClient class contains a method called getPage() which is used to fetch a webpage, Return type of getPage() method is HtmlPage. So create an object for HtmlPage and assign it by calling webClient.getPage().The getPage() method requires one argument which is the URL of the webpage you want to fetch.
- Create object for page: The WebClient class contains a method called getPage() which is used to fetch a webpage, Return type of getPage() method is HtmlPage. So create an object for HtmlPage and assign it by calling webClient.getPage().
The getPage() method requires one argument which is the URL of the webpage you want to fetch.
HtmlPage object has been created which contains all the data stored in the webpage which you send as URL argument for getPage() method. Now you can play with the HtmlPage object and you can get whatever content your want from the webpage.
- Getting page contents: HtmlPage class offers two unique methods called asText() & asXml() by which you can get the page’s text content without tag or with tag respectively.
Here I have used asText() method which will get all the text content from the webpage without any html tag and stored it in string object. When you are printing the string you will get the following result.
Sample Program to find no.of results for list of programming languages: