VB.Net Web Scraper using HTMLAgilityPack – Part I

In this tutorial, we will be developing a simple web scraping program that scrapes the link names and its href within an HTML Page. For this series of tutorials, I will be using Visual Studio 2010 for the VB.Net language and a library called HtmlAgilityPack.

You can download HTMLAgilityPack here http://htmlagilitypack.codeplex.com/releases/view/90925.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you are finished with the HTML page, create a new VB.Net Project in Visual Studio. Select Console Application and name it GetLinks. Type the codes below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim doc As New HtmlDocument
doc.Load("C:\\Users\\allmankind\\Documents\\sample1.html")
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a")
Console.Write(link.InnerText)
Console.Write(" - ")
Console.WriteLine(link.Attributes("href").Value)
Next
Console.ReadKey()
End Sub

End Module

When you create a new VB.Net console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.


Imports HtmlAgilityPack

Module Module1

Sub Main()
Dim web As New HtmlWeb
Dim doc As New HtmlDocument
doc = web.Load("http://en.wikipedia.org/wiki/Language")
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a")
Console.Write(link.InnerText)
Console.Write(" - ")
Console.Write(link.Attributes("href").Value)
Next
Console.ReadKey()
End Sub

End Module

Thank you for reading. Don’t Forget to share and leave a comment.

Advertisements

2 thoughts on “VB.Net Web Scraper using HTMLAgilityPack – Part I

  1. Pingback: FAQ

  2. Thanks! Your information helped me to use this pack, just 1 thing i got to change: Dim doc As New HtmlAgilityPack.HtmlDocument to get it running 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s