Java Web Scraper using JSoup – Part I

In this tutorial, we will be using JSoup. JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

I am going to use Eclipse as the IDE with my JSoup tutorials.

JSoup can be downloaded here.

Eclipse  can be downloaded here.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you finish creating the page, create a new project in eclipse, add jsoup as an external library, create a new class called GrabLinks and type the codes below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  File input = new File("C:\\Users\\allmankind\\Documents\\sample1.html");
  Document doc = Jsoup.parse(input, "UTF-8");

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Initially, we imported the necessary libraries we need. Then, we initialized Document and gave it a name of doc. This will contain the HTML document. Then, we initialized Elements and gave it a name of links which would contain all links we read from the document.

After that, we used the select method within the Document class we initialized before which selects all elements depending on what you are looking for. When that finishes, we run a loop to output the link names and its href.

You can use any loops by the way. If you want to know how many elements you have read, use the doc.size() method and store it to an integer variable.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  String input = "http://en.wikipedia.org/wiki/Language";
  Document doc = Jsoup.connect(input).get();

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Thank you for reading. Don’t forget to share or leave a comment.