Java Web Scraper using JSoup – Part II

In this tutorial, we’ll be selecting the text inside <p> and <div> tags from an HTML page and save it to text file as a bonus. First, we create our HTML document, name it sample2.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
	<div>First div - some more text here</div>
	<p>Paragraph1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
	<div>Second div - some more text here</div>
	<p>Paragraph2 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
</body>
</html>

When you finished coding the html document. Create another class in your Eclipse and name in GrabElements. Here is the code below:

package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

public class GrabElements {

public static void main(String[] args) throws IOException{
File input = new File("C:\\Users\\allmankind\\Documents\\sample2.html");
Document doc = Jsoup.parse(input, "UTF-8");

Elements divs = doc.select("div");

System.out.println("The div tags are: ");
for(Element div: divs){
System.out.println(div.text());
}

Elements ps = doc.select("p");

System.out.println("\nThe p tags are: ");
for(Element p: ps){
System.out.println(p.text());
}
}
}

The same thing as the part I tutorial. We initialized the Document which will contain the HTML page but this time we initialized two Elements which are divs and ps. After reading each elements we again run a loop to output what we need to the console.

Like I said we are also going to save the data to a text file. Just edit a few codes shown below.


package org.soup.examples;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.io.IOException;

public class GrabElements {

public static void main(String[] args) throws IOException{
File input = new File("C:\\Users\\allmankind\\Documents\\sample2.html");
Document doc = Jsoup.parse(input, "UTF-8");
BufferedWriter out = new BufferedWriter(new FileWriter("C:\\Users\\allmankind\\Documents\\sample2.txt"));

Elements divs = doc.select("div");

out.write("The div tags are: ");

for(Element div: divs){
out.newLine();
out.write(div.text());
}

Elements ps = doc.select("p");

out.newLine();
out.newLine();
out.write("The p tags are: ");
for(Element p: ps){
out.newLine();
out.write(p.text());
}

out.close();
System.out.println("Process Completed.");
}
}

There you have it.

Thank you for reading. Don’t forget to share and leave a comment.

Advertisements

One thought on “Java Web Scraper using JSoup – Part II

  1. You’ve got some cool tutorials for the beginners.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s