by Arjun Araneta

Java Web Scraper using JSoup – Part IV

In this tutorial we’ll be scraping a webpage with a set of list items. This tutorial is useful for beginners in web scraping. If you are expecting some advance stuff, I will be posting more of those tutorials soon but for now you can read through or just skip this part.

The address that we’ll be using here is http://bemorewithless.com/my-100-thing-challenge/

Now, we’ll be fetching the 100 items that are listed in the site.

Here’s our code.


package org.soup.examples;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;

public class FectchList{

public static void main(String[] args) throws IOException{
String input = "http://bemorewithless.com/my-100-thing-challenge/";
Document doc = Jsoup.connect(input).get();

Elements items= doc.select("div.postarea ol li");

for(Element item:itemss){
System.out.println(item.text());
}
}
}

Run the program and the output should be the 100 items listed on the website. Now, we used the same library as before but we changed the url that the input variable contains.

You may be wondering why did I put div.postarea before ol and li. Well, its because the list is contained within the div tags with the class name of postarea. You can see those names if you view the page source.

Thank you for reading. Don’t forget to leave a comment and share.

07.06.13

by Arjun Araneta

Java Web Scraper using JSoup – Part III

In this tutorial, I will show you how to read data from tables. Sometimes you have to develop a program that reads data from a table within an HTML Page. For example, reading jokes and its author from a site. Here’s a sample HTML page below named sample3.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 3</title>
</head>
<body>
<table>
<tr>
<td>Author1</td>
<td>Joke1</td>
</tr>
<tr>
<td>Author2</td>
<td>Joke2</td>
</tr>
</table>
</body>
</html>

After you coded that html, create a new class in eclipse and name it GrabJokes and type the code below.


package org.soup.examples;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;

public class GrabJokes {

public static void main(String[] args) throws IOException{
File input = new File("C:\\Users\\allmankind\\Documents\\sample3.html");
Document doc = Jsoup.parse(input, "UTF-8");
Elements authors = doc.select("tr td:first-child");
Elements jokes = doc.select("tr td:last-child");

for(int i = 0; i<authors.size(); i++){
System.out.println(authors.get(i).text() + " - " + jokes.get(i).text());
}
}
}

Like the other tutorials we initialized the Document and Elements and gave them names. Then we run a loop and output it to the console. The difference is on the doc.select(“tr td:first-child”) and doc.select(“tr td:last-child”). They mean we are selecting the first-child td within a table row which is on our case, the Author. The other selector selects the second td element within the table row which is the Joke itself.

Thank you for reading. Don’t forget to share and leave a comment.

07.06.13

by Arjun Araneta

Java Web Scraper using JSoup – Part II

In this tutorial, we’ll be selecting the text inside <p> and <div> tags from an HTML page and save it to text file as a bonus. First, we create our HTML document, name it sample2.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
	<div>First div - some more text here</div>
	<p>Paragraph1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
	<div>Second div - some more text here</div>
	<p>Paragraph2 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
</body>
</html>

When you finished coding the html document. Create another class in your Eclipse and name in GrabElements. Here is the code below:

package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

public class GrabElements {

public static void main(String[] args) throws IOException{
File input = new File("C:\\Users\\allmankind\\Documents\\sample2.html");
Document doc = Jsoup.parse(input, "UTF-8");

Elements divs = doc.select("div");

System.out.println("The div tags are: ");
for(Element div: divs){
System.out.println(div.text());
}

Elements ps = doc.select("p");

System.out.println("\nThe p tags are: ");
for(Element p: ps){
System.out.println(p.text());
}
}
}

The same thing as the part I tutorial. We initialized the Document which will contain the HTML page but this time we initialized two Elements which are divs and ps. After reading each elements we again run a loop to output what we need to the console.

Like I said we are also going to save the data to a text file. Just edit a few codes shown below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.io.IOException;

public class GrabElements {

public static void main(String[] args) throws IOException{
File input = new File("C:\\Users\\allmankind\\Documents\\sample2.html");
Document doc = Jsoup.parse(input, "UTF-8");
BufferedWriter out = new BufferedWriter(new FileWriter("C:\\Users\\allmankind\\Documents\\sample2.txt"));

Elements divs = doc.select("div");

out.write("The div tags are: ");

for(Element div: divs){
out.newLine();
out.write(div.text());
}

Elements ps = doc.select("p");

out.newLine();
out.newLine();
out.write("The p tags are: ");
for(Element p: ps){
out.newLine();
out.write(p.text());
}

out.close();
System.out.println("Process Completed.");
}
}

There you have it.

Thank you for reading. Don’t forget to share and leave a comment.

07.06.13

by Arjun Araneta

Java Web Scraper using JSoup – Part I

In this tutorial, we will be using JSoup. JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

I am going to use Eclipse as the IDE with my JSoup tutorials.

JSoup can be downloaded here.

Eclipse can be downloaded here.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you finish creating the page, create a new project in eclipse, add jsoup as an external library, create a new class called GrabLinks and type the codes below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  File input = new File("C:\\Users\\allmankind\\Documents\\sample1.html");
  Document doc = Jsoup.parse(input, "UTF-8");

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Initially, we imported the necessary libraries we need. Then, we initialized Document and gave it a name of doc. This will contain the HTML document. Then, we initialized Elements and gave it a name of links which would contain all links we read from the document.

After that, we used the select method within the Document class we initialized before which selects all elements depending on what you are looking for. When that finishes, we run a loop to output the link names and its href.

You can use any loops by the way. If you want to know how many elements you have read, use the doc.size() method and store it to an integer variable.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class GrabLinks {

public static void main(String[] args) throws IOException{
  String input = "http://en.wikipedia.org/wiki/Language";
  Document doc = Jsoup.connect(input).get();

  Elements links = doc.select("a");

  for(Element link: links){
    System.out.print("\"" + link.text() + "\"");
    System.out.println(" links to " + link.attr("href"));
  }
}
}

Thank you for reading. Don’t forget to share or leave a comment.

arjunaraneta

Freelance Developer

Tag Archives: java tutorial

Java Web Scraper using JSoup – Part IV

Java Web Scraper using JSoup – Part III

Java Web Scraper using JSoup – Part II

Java Web Scraper using JSoup – Part I