C# Web Scraper using HTMLAgilityPack – Part I

In this tutorial, we will be developing a simple web scraping program that scrapes the link names and its href within an HTML Page. For this series of tutorials, I will be using Visual Studio 2010 for the C# language and a library called HtmlAgilityPack.

You can download HTMLAgilityPack here http://htmlagilitypack.codeplex.com/releases/view/90925.

First, we’ll create our own html document to try-out the programs we are going to develop. Then, we’ll try the program with a valid webpage URL.

Here is our sample HTML document named sample1.html.


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>Sample 1</title>
</head>
<body>
  <div id="wrapper">
  <a href="link1.html">This is link1</a>
  <a href="link2.html">This is link2</a>
    <div>
      <a href="link3.html">This is link3</a>
    </div>
  </div>
</body>
</html>

When you are finished with the HTML page, create a new C# Project in Visual Studio. Select Console Application and name it GetLinks. Type the codes below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GetLinks
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(“C:\\Users\\allmankind\\Documents\\sample1.html”);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes(“//a”))
{
Console.WriteLine(link.InnerText);
Console.WriteLine(link.Attributes[“href”].Value);
}
Console.ReadKey();

}
}
}

When you create a new C# console application in visual studio, there will be codes generated for you. Like the libraries that are imported at the top with the using keyword. The main class is also generated.

Add the .dll inside HTMLAgilityPack folder in your project via Add References under Project menu. Then add the code that imports the library like on the code above. Next, we initialize HTMLDocument with doc as its name as something that holds the html document. Then, we load the html page with its local address in your computer. Then we run a loop that reads all links inside it and outputs them on the console.

Now let’s test our program with an article in wikipedia for example. We’ll use this link http://en.wikipedia.org/wiki/Language. There will be slight changes to the code, see below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace GetLinks
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlWeb().Load(“http://en.wikipedia.org/wiki/Language&#8221;);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes(“//a”))
{
Console.WriteLine(link.InnerText);
Console.WriteLine(link.Attributes[“href”].Value);
}
Console.ReadKey();

}
}
}

Thank you for reading. Don’t Forget to share and leave a comment.

Advertisements

752 thoughts on “C# Web Scraper using HTMLAgilityPack – Part I

  1. Hello, of course this piece of writing is actually fastidious and I have learned
    lot of things from it on the topic of blogging. thanks.

  2. Having read this I thought it was really informative.
    I appreciate you taking the time and effort to put this article together.
    I once again find myself spending a significant amount of time both reading and
    posting comments. But so what, it was still worth it!

  3. Good day! This post could not be written any better!

    Reading through this post reminds me of my old room mate!
    He always kept chatting about this. I will forward this page
    to him. Pretty sure he will have a good read. Thank you
    for sharing!

  4. Generally I don’t learn article on blogs, howevver I wish to say that this write-up
    very compelled me too try and do it! Your writing style has been surprised me.
    Thank you, quite nice article.

  5. obviously like your web site but you have to
    test the spelling on quite a few of your posts. Many of them are rife with spelling problems and I find it very bothersome to inform the reality then again I’ll surely come again again.

  6. Hey there! This is my first visit to your blog! We are a team of volunteers and starting a
    new project in a community in the same niche. Your blog prdovided us
    useful information to work on. You have done a outstanding job!

  7. That is really interesting, You’re a very professional blogger.
    I’ve joined your rss feed and look forward to searching
    for more of your fantastic post. Additionally, I
    have shared your web site in my social networks

  8. This is my first time go to see at here and i am truly impressed
    to read all at one place.

  9. Wonderful gods from you, man. I have understand your
    stuff previous to and you are just extremely excellent. I really like what you’ve acquired here, certainly
    like what you are stating and the way in whicch
    you say it. You make it entertaining and you still care for to keep it
    smart. I can not wait to read far more from you.
    This is really a tremendous web site.

  10. I simply couldn’t leave your website prior to suggesting that I extremely enjoyed the usual info a person provide for your visitors?
    Is gonna be back ceaselessly to investigate cross-check new posts

  11. We shall see some of the applications you should have, in the subsequent list.
    The very best use of this app is its light-weight, Telephone locator and backup choices.

    Ideally, so will the subsequent installment of the Iphone.

  12. I was wondering if you ever considered changing the page layout of your
    site? Its very well written; I love what youve got to say.
    But maybe you could a little more in the way of content so people could connect with it better.
    Youve got an awful lot of text for only having one or 2 images.
    Maybe you could space it out better?

  13. Hi there, this weekend is fastidious for me, for the reason that this point in time i am reading
    this enormous informative article here at my residence.

  14. I think what you typed made a great deal of sense. But,
    what about this? what if you were to create
    a killer post title? I mean, I don’t wish to tell you how to run your website, however suppose you added a headline to possibly grab folk’s attention?
    I mean arjunaraneta | C# Web Scraper using HTMLAgilityPack – Part I is a little vanilla.
    You could look at Yahoo’s home page and watch how they create article
    headlines to get viewers interested. You might try
    adding a video or a related pic or two to get readers excited about what you’ve written. Just my opinion, it would make your posts a little
    bit more interesting.

  15. This site was… how do you say it? Relevant!! Finally I’ve found
    something which helped me. Thanks!

  16. This function allow you to connect to many different networks.
    No commandline texts and no crappy dark DOS-like monitor!
    I’m certain there are folks out there who’ve been hugely irritated by my FaceBooking.

  17. Thank you a lot for sharing this with all folks
    you actually recognize what you’re talking about!
    Bookmarked. Kindly also discuss with my web site =).

    We may have a hyperlink alternate agreement among us

  18. I really like it whenever people get together and share
    views. Great website, keep it up!

  19. I’m not sure exactlƴ whʏ but this bloǥ is loaԀing extremely slow fοr me.

    Is anyne elsе having thiss problem or iss іt a issue on myy end?
    I’ll check back laer and see if tthe ρroblem still exists.

  20. It is possible to download and utilize the test version for free, for a
    limited period. To The other hand, the iPhone 4 has a considerably
    smaller 3.5″ display. This might also aid in your phone making technique.

  21. I am really impressed along with your writing skills and
    also with the layout on your weblog. Is this a paid theme
    or did you modify it your self? Anyway stay up the nice high quality writing, it’s rare to
    look a great weblog like this one these days..

  22. Way cool! Some very valid points! I appreciate you
    penning this write-up plus the rest of the website is
    very good.

  23. Hi there! Very good article! Please keep us posted!

  24. I seldom leave a response, but I read a few of the comments here arjunaraneta | C#
    Web Scraper using HTMLAgilityPack – Part I. Iactually do have a couple of questions for you if
    you tennd not to mind. Could it bbe simoly me or does it look like lie a feew of thjese
    remarks appear like they are written by brain dead folks?
    😛 And, if you are posting at other online social sites, I would lie to keep up with you.

    Would you poet a list of all of all your shared pages
    like your twitter feed, Facebook page or linkedin profile?

  25. I savor, cause I discovered just what I used to be having a look for.
    You have ended my four day long hunt! God Bless you man. Have a nice day.
    Bye

  26. Everyone loves what you guys are up too. This kind of clever work and exposure!
    Keep up the fantastic works guys I’ve incorporated you guys to blogroll.

  27. Hello there! Thhis blog post couldn’t be writtenn much better!
    Looking through this post reminds me of my previous roommate!

    He constantly kept talking about this. I’ll forward this article to him.Fairly certain he’ll have a good read.
    Thanks forr sharing!

  28. Excellent website. Lots of helpful information here.
    I’m sending it to a few buddies ans additionally
    sharing in delicious. And certainly, thanks on your sweat!

  29. Thank you for another excellent article. The place else may anyone get that
    kind of info in such an ideal way of writing? I’ve a presentation next week, and I
    am on the look for such information.

  30. You need to be a part of a contest for one of the best blogs on the net.

    I will highly recommend this site!

  31. Hello, its good article regarding media print, we alll be faamiliar with media is
    a great source of data.

  32. Because the admin of this website is working, no
    doubt very rapidly it will be well-known, due
    to its quality contents.

  33. Nice post. I learn something new and challenging on sites I stumbleupon every day.
    It will always be interesting to read articles from other writers and use a
    little something from other websites.

  34. Thank you, I have just been searching for info about this subject for a while and yours is the best I have came upon till now.
    But, what in regards to the conclusion? Are you certain concerning the supply?

  35. Appreciation to myy father who stated too me concerning this
    website, this blog is really awesome.

  36. I will right away snatch your rss feed as I can’t in finding your e-mail subscription hyperlink or e-newsletter service.
    Do you have any? Kindly permit me recognise so
    that I may subscribe. Thanks.

  37. Hi there, yup this post is really nice and I have learned lot of things from it on the topic of
    blogging. thanks.

  38. Hi there mates, its great piece of writing about educationand fully defined, keep it up all the
    time.

  39. Hi there colleagues, its wonderful piece of writing regarding educationand fully defined, keep it up all the time.

  40. Excellent post. I was checking constantly this blog and I am inspired!
    Extremely useful information specifically the closing section 🙂 I deal with such info much.
    I was looking for this particular information for a very lengthy time.
    Thanks and best of luck.

  41. We stumbled over here by a different website and thought I might as
    well check things out. I like what I see so now i am following
    you. Look forward to checking out your web page yet again.

  42. Very shortly this web site will be famous amid
    aall blogging and site-building visitors, duee to it’s pleasant articles or reviews

  43. Hi there to every single one, it’s in fact a fastidious for me to
    visit this website, it contains precious Information.

  44. Awesome! Its truly awesome article, I have got much clear idea on the topic of
    from this paragraph.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s