Java Web Scraper using JSoup – Part I

In this tutorial, we will be using JSoup. JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

I am going to use Eclipse as the IDE with my JSoup tutorials.

JSoup can be downloaded here.

Eclipse  can be downloaded here.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you finish creating the page, create a new project in eclipse, add jsoup as an external library, create a new class called GrabLinks and type the codes below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  File input = new File("C:\\Users\\allmankind\\Documents\\sample1.html");
  Document doc = Jsoup.parse(input, "UTF-8");

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Initially, we imported the necessary libraries we need. Then, we initialized Document and gave it a name of doc. This will contain the HTML document. Then, we initialized Elements and gave it a name of links which would contain all links we read from the document.

After that, we used the select method within the Document class we initialized before which selects all elements depending on what you are looking for. When that finishes, we run a loop to output the link names and its href.

You can use any loops by the way. If you want to know how many elements you have read, use the doc.size() method and store it to an integer variable.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  String input = "http://en.wikipedia.org/wiki/Language";
  Document doc = Jsoup.connect(input).get();

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Thank you for reading. Don’t forget to share or leave a comment.

Advertisements

9 thoughts on “Java Web Scraper using JSoup – Part I

  1. Pingback: how to get quality backlinks free

  2. Pingback: dofollow link checker

  3. kmlzybskvobsbofub, Hcg diet toronto, dkwmEMx, [url=http://australiahcgdrops.com/]Hcg apple day[/url], fQaPGtx, http://australiahcgdrops.com/ Schroeder’s hcg blogs, WxuLyvE, Fioricet, SYlOncg, [url=http://betterlifeacupuncture.com/]Fioricet[/url], JXGOEvX, http://betterlifeacupuncture.com/ Buy fioricet trackback, TdpstMU, Long term effects paxil, uqBiEmm, [url=http://aboutpaxil.com/]Paxil[/url], LczpcDS, http://aboutpaxil.com/ Paxil therapeutic dose, lzCoswI, Plantiffs who won their viagra lawsuit in court in 2010, iCqEQwM, [url=http://australianviagra.com/]Viagra[/url], WiGvAFc, http://australianviagra.com/ Mail order viagra without prescription, hwTvxOq, Casino &, PsYVMBq, [url=http://alzheimershelpathome.com/]Casino[/url], YIGxsBM, http://alzheimershelpathome.com/ Casino, oVQGfvm, Option 24, rqsBhVO, [url=http://andrewroman.net/24option-review/]Option 24[/url], fWqyWSQ, http://andrewroman.net/24option-review/ 24 Option Review, VvBTUQS.

  4. ysulpbskvobsbofub, Buy Klonopin, HYjXdKS, [url=http://klonopinguide.net/]Klonopin[/url], rJMjkBN, http://klonopinguide.net/ Generic Klonopin, cKZWTyq, Priligy buy online, fgOPiRH, [url=http://www.priligyreviews.com/what-is-priligy.html]Buy Priligy[/url], ucjQkfq, http://www.priligyreviews.com/what-is-priligy.html Buy Priligy, KuMSuuo, Generic Ativan, VVjYRbT, [url=http://www.highwattcrucifixers.com/pics.html]Ativan[/url], rVOpOyW, http://www.highwattcrucifixers.com/pics.html Nursing drug cards on ativan, mNMKMKv, HCG, fibSHud, [url=http://hcgtrim4u.com/]Hcg diet forum[/url], JCFgHGs, http://hcgtrim4u.com/ HCG Diet, eOBvGxo, Trisenox and cialis interactions, bIibduk, [url=http://www.eonsboommedia.com/]Cialis super active[/url], FpMFSkY, http://www.eonsboommedia.com/ Cialis, FCLAXrG, Buy tramadol online without prescription, LZhLVOu, [url=http://thevalleyoh.com/]Cheap tramadol online[/url], eSjplGI, http://thevalleyoh.com/ Tramadol, nCGcSCM.

  5. bkwkibskvobsbofub, Viagra, RUugpMZ, [url=http://supermanmegasite.com/viagra-the-king-of-man-products.html]Cheap viagra[/url], hozJZyl, http://supermanmegasite.com/viagra-the-king-of-man-products.html Female free sample viagra, ciWXgxJ, Klonopin, CdYSwfC, [url=http://www.medsinfoblog.com/anticonvulsant/klonopin/]Weaning off klonopin to start ambien[/url], pxPKrNQ, http://www.medsinfoblog.com/anticonvulsant/klonopin/ Klonopin, aFHqiRr, Instant faxless payday loans, ZWKDLPG, [url=http://paydayloansoptions.com/]Western payday loan no fax[/url], rozGInd, http://paydayloansoptions.com/ Worthington financial payday loan application, JgJEaLx, Generic Klonopin, RsdIWbz, [url=http://klonopinguide.net/]Klonopin[/url], wxUUJBE, http://klonopinguide.net/ Klonopin, BozwvsF, Xanax withdrawl, AdmYRZg, [url=http://justgrillinburgers.com/]Xanax and[/url], debnndt, http://justgrillinburgers.com/ Buy xanax without prescription, NDzqjBO, Semenax ama, UnreqqW, [url=http://aboutsemenax.com/semenax-basic-info.html]Buy Semenax[/url], GioWWZH, http://aboutsemenax.com/semenax-basic-info.html Buy Semenax, ZElctCd.

  6. Pingback: https://www.youtube.com/watch?v=-58o9E507M8

  7. Hey very interesting blog!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s