C# Web Scraper using HTMLAgilityPack – Part II

In this tutorial, we’ll be selecting the text inside <p> and <div> tags from an HTML page and save it to text file as a bonus. First, we create our HTML document, name it sample2.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
	<div>First div - some more text here</div>
	<p>Paragraph1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
	<div>Second div - some more text here</div>
	<p>Paragraph2 - Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ornare velit vel ipsum consectetur facilisis. In iaculis tempor elit a porttitor. Etiam nisl eros, rutrum a purus a, placerat fringilla ante. </p>
</body>
</html>

When you are finished with the HTML page, create a new C# Project in Visual Studio. Select Console Application and name it GrabElements. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GrabElements
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample2.html”);
Console.WriteLine(“The div tags are: “);
foreach (HtmlNode div in doc.DocumentNode.SelectNodes(“//div”))
{
Console.WriteLine(div.InnerText);
}

Console.WriteLine(“\nThe p tags are:”);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes(“//p”))
{
Console.WriteLine(p.InnerText);
}
Console.ReadKey();

}
}
}

When you create a new C# console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now like what I did on the part 2 java tutorial, I am going to save the elements to a text file.

Create a new C# Console Application in Visual Studio and name it GrabElementsTxt. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.IO;

namespace GrabElementsTxt
{
class Program
{
static void Main(string[] args)
{
StreamWriter file = new StreamWriter(“C:\\Users\\allmankind\\Documents\\sample2.txt”);
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample2.html”);
file.WriteLine(“The div tags are: “);
foreach (HtmlNode div in doc.DocumentNode.SelectNodes(“//div”))
{
file.WriteLine(div.InnerText);
}

file.WriteLine(“\nThe p tags are:”);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes(“//p”))
{
file.WriteLine(p.InnerText);
}
file.Close();
}
}
}

Thank you for reading. Don’t forget to share and leave a comment.

Advertisements

4 thoughts on “C# Web Scraper using HTMLAgilityPack – Part II

  1. Pingback: dofollow wiki linksys wrt54g

  2. Link exchange is nothing else except it is only placing the other person’s webpage link on your page at proper place and other person will
    also do same in support of you.

  3. good tutorial. this crawls a single page, what if we want to crawl the whole website, what will be the logic behind that?

    • one common way is you parse links from the site
      normally analyzing header menu and footer menu
      it’s a good thing developers/programming are naming them on what they actually are (e.g. header, head, menu, foot, footer..)
      it’s a plus for web scrapers because you can set it to find those tokens then extract all links from them
      save the links somewhere
      then scrape them again.

      I hope this comment helped you. Thank you. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s