Java Web Scraper using JSoup – Part IV

In this tutorial we’ll be scraping a webpage with a set of list items. This tutorial is useful for beginners in web scraping. If you are expecting some advance stuff, I will be posting more of those tutorials soon but for now you can read through or just skip this part.

The address that we’ll be using here is http://bemorewithless.com/my-100-thing-challenge/

Now, we’ll be fetching the 100 items that are listed in the site.

Here’s our code.


package org.soup.examples;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;

public class FectchList{

public static void main(String[] args) throws IOException{
String input = "http://bemorewithless.com/my-100-thing-challenge/";
Document doc = Jsoup.connect(input).get();

Elements items= doc.select("div.postarea ol li");

for(Element item:itemss){
System.out.println(item.text());
}
}
}

Run the program and the output should be the 100 items listed on the website. Now, we used the same library as before but we changed the url that the input variable contains.

You may be wondering why did I put div.postarea before ol and li. Well, its because the list is contained within the div tags with the class name of postarea. You can see those names if you view the page source.

Thank you for reading. Don’t forget to leave a comment and share.

VB.Net Web Scraper Using HTMLAgilityPack – Part III

In this tutorial, I will show you how to read data from tables. Sometimes you have to develop a program that reads data from a table within an HTML Page. For example, reading jokes and its author from a site. Here’s a sample HTML page below named sample3.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 3</title>
</head>
<body>
<table>
<tr>
<td>Author1</td>
<td>Joke1</td>
</tr>
<tr>
<td>Author2</td>
<td>Joke2</td>
</tr>
</table>
</body>
</html>

When you are finished with the HTML page, create a new VB.Net Project in Visual Studio. Select Console Application and name it GrabJokes. Type the codes below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim doc As New HtmlDocument
doc.Load("C:\\Users\\allmankind\\Documents\\sample3.html")
For Each row As HtmlNode In doc.DocumentNode.SelectNodes("//tr")
Console.Write(row.Elements("td").First().InnerText)
Console.Write(" - ")
Console.WriteLine(row.Elements("td").Last().InnerText)
Next
Console.ReadKey()
End Sub

End Module

Like the other tutorials in this series, similar codes are used. The difference is the code inside the foreach loop which selects tr and then selects the first-child td and outputs a hyphen then selects the last-child td.

Thank you for reading. Don’t forget to share and leave a comment.

VB.Net Web Scraper using HTMLAgilityPack – Part II

In this tutorial, we’ll be selecting the text inside <p> and <div> tags from an HTML page and save it to text file as a bonus. First, we create our HTML document, name it sample2.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
	<div>First div - some more text here</div>
	<p>Paragraph1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
	<div>Second div - some more text here</div>
	<p>Paragraph2 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
</body>
</html>

When you are finished with the HTML page, create a new VB.Net Project in Visual Studio. Select Console Application and name it GrabElements. Type the codes below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim doc As New HtmlDocument
doc.Load("C:\\Users\\allmankind\\Documents\\sample2.html")

Console.WriteLine("The div tags are: ")
For Each div As HtmlNode In doc.DocumentNode.SelectNodes("//div")
Console.WriteLine(div.InnerText)
Next

Console.WriteLine()
Console.WriteLine("The p tags are:")
For Each p As HtmlNode In doc.DocumentNode.SelectNodes("//p")
Console.WriteLine(p.InnerText)
Next

Console.ReadKey()
End Sub

End Module

When you create a new VB.Net console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the import keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now like what I did on the part 2 java tutorial, I am going to save the elements to a text file.

Create a new VB.Net Console Application in Visual Studio and name it GrabElementsTxt. Type the codes below.


Imports HtmlAgilityPack
Imports System.IO

Module Module1

Sub Main()
Dim file As New StreamWriter("C:\\Users\\allmankind\\Documents\\sample2.txt")
Dim doc As New HtmlDocument
doc.Load("C:\\Users\\allmankind\\Documents\\sample2.html")
file.WriteLine("The div tags are: ")
For Each div As HtmlNode In doc.DocumentNode.SelectNodes("//div")
file.WriteLine(div.InnerText)
Next

file.WriteLine()
file.WriteLine("The p tags are:")
For Each p As HtmlNode In doc.DocumentNode.SelectNodes("//p")
file.WriteLine(p.InnerText)
Next
file.Close()
End Sub

End Module

Thank you for reading. Don’t forget to share and leave a comment.

VB.Net Web Scraper using HTMLAgilityPack – Part I

In this tutorial, we will be developing a simple web scraping program that scrapes the link names and its href within an HTML Page. For this series of tutorials, I will be using Visual Studio 2010 for the VB.Net language and a library called HtmlAgilityPack.

You can download HTMLAgilityPack here http://htmlagilitypack.codeplex.com/releases/view/90925.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you are finished with the HTML page, create a new VB.Net Project in Visual Studio. Select Console Application and name it GetLinks. Type the codes below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim doc As New HtmlDocument
doc.Load("C:\\Users\\allmankind\\Documents\\sample1.html")
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a")
Console.Write(link.InnerText)
Console.Write(" - ")
Console.WriteLine(link.Attributes("href").Value)
Next
Console.ReadKey()
End Sub

End Module

When you create a new VB.Net console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim web As New HtmlWeb
Dim doc As New HtmlDocument
doc = web.Load("http://en.wikipedia.org/wiki/Language")
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a")
Console.Write(link.InnerText)
Console.Write(" - ")
Console.Write(link.Attributes("href").Value)
Next
Console.ReadKey()
End Sub

End Module

Thank you for reading. Don’t Forget to share and leave a comment.

C# Web Scraper Using HTMLAgilityPack – Part III

In this tutorial, I will show you how to read data from tables. Sometimes you have to develop a program that reads data from a table within an HTML Page. For example, reading jokes and its author from a site. Here’s a sample HTML page below named sample3.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 3</title>
</head>
<body>
<table>
<tr>
<td>Author1</td>
<td>Joke1</td>
</tr>
<tr>
<td>Author2</td>
<td>Joke2</td>
</tr>
</table>
</body>
</html>

When you are finished with the HTML page, create a new C# Project in Visual Studio. Select Console Application and name it GrabJokes. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GrabJokes
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample3.html”);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes(“//tr”))
{
Console.Write(row.Elements(“td”).First().InnerText);
Console.Write(” – “);
Console.WriteLine(row.Elements(“td”).Last().InnerText);
}
Console.ReadKey();

}
}
}

Like the other tutorials in this series, similar codes are used. The difference is the code inside the foreach loop which selects tr and then selects the first-child td and outputs a hyphen then selects the last-child td.

Thank you for reading. Don’t forget to share and leave a comment.

C# Web Scraper using HTMLAgilityPack – Part II

In this tutorial, we’ll be selecting the text inside <p> and <div> tags from an HTML page and save it to text file as a bonus. First, we create our HTML document, name it sample2.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
	<div>First div - some more text here</div>
	<p>Paragraph1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
	<div>Second div - some more text here</div>
	<p>Paragraph2 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
</body>
</html>

When you are finished with the HTML page, create a new C# Project in Visual Studio. Select Console Application and name it GrabElements. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GrabElements
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample2.html”);
Console.WriteLine(“The div tags are: “);
foreach (HtmlNode div in doc.DocumentNode.SelectNodes(“//div”))
{
Console.WriteLine(div.InnerText);
}

Console.WriteLine(“\nThe p tags are:”);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes(“//p”))
{
Console.WriteLine(p.InnerText);
}
Console.ReadKey();

}
}
}

When you create a new C# console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now like what I did on the part 2 java tutorial, I am going to save the elements to a text file.

Create a new C# Console Application in Visual Studio and name it GrabElementsTxt. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.IO;

namespace GrabElementsTxt
{
class Program
{
static void Main(string[] args)
{
StreamWriter file = new StreamWriter(“C:\\Users\\allmankind\\Documents\\sample2.txt”);
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample2.html”);
file.WriteLine(“The div tags are: “);
foreach (HtmlNode div in doc.DocumentNode.SelectNodes(“//div”))
{
file.WriteLine(div.InnerText);
}

file.WriteLine(“\nThe p tags are:”);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes(“//p”))
{
file.WriteLine(p.InnerText);
}
file.Close();
}
}
}

Thank you for reading. Don’t forget to share and leave a comment.

C# Web Scraper using HTMLAgilityPack – Part I

In this tutorial, we will be developing a simple web scraping program that scrapes the link names and its href within an HTML Page. For this series of tutorials, I will be using Visual Studio 2010 for the C# language and a library called HtmlAgilityPack.

You can download HTMLAgilityPack here http://htmlagilitypack.codeplex.com/releases/view/90925.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you are finished with the HTML page, create a new C# Project in Visual Studio. Select Console Application and name it GetLinks. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GetLinks
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample1.html”);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes(“//a”))
{
Console.WriteLine(link.InnerText);
Console.WriteLine(link.Attributes[“href”].Value);
}
Console.ReadKey();

}
}
}

When you create a new C# console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GetLinks
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlWeb().Load(“http://en.wikipedia.org/wiki/Language&#8221;);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes(“//a”))
{
Console.WriteLine(link.InnerText);
Console.WriteLine(link.Attributes[“href”].Value);
}
Console.ReadKey();

}
}
}

Thank you for reading. Don’t Forget to share and leave a comment.