Поиск  
Always will be ready notify the world about expectations as easy as possible: job change page
Jul 25, 2023

Web scraping using C#, HTTPClient, and HTML Agility Pack

Web scraping using C#, HTTPClient, and HTML Agility Pack
Автор:
Kenji Elzerman
Источник:
Просмотров:
2999

Web scraping is a technique that scrapes information from other online sources. This is a great way to combine different sources into one source. To create a web scraper you need a few things: online sources, some code that can access the internet, and a GUI. In this article, I am going to show you how web scraping using C# works.

Goals of this article

In this article, I will show you how you can scrape information from the internet by using C# and .NET. We will use the HTTPClient for the online connection and the HTML Agility Pack to read the information we get from the HTTPClient response.

At the end of this article, you know how to retrieve information from a website and put that information in C# objects for you to use.

I will be using a console application called WebScraping.Console in a solution called WebScraping.

What Is web scraping?

I love to travel and I get my information from several sources. Some websites, some social media, and more. It can be a hassle to keep up to date with all these different online sources, so why not combine them all in one? And that is what web scraping can do.

I build this small tool that will read different online pages, get new information from those pages, and store them in a database. When I open the GUI I see all the new articles of 14 different online sources in one page. This saves me a lot of time browsing through different sources.

But before I am going to show you how you can get the information from a website, I want to dismiss some rumors that aren’t true:

Web scraping is legal, as long as the information you are requesting is public. If you need to register somewhere and then get the data, that information is not really public. Just be careful with that.

We will make requests to a website, using the HTTPClient. Don’t make requests every 2 seconds. This will get your IP on a blacklist and you won’t be able to get the information you want. This is usually a temporary ban, but sometimes it’s permanent. I usually have intervals of 3 to 6 hours.

For this article, I am going to scrape my own website and the Microsoft blogs on C# to get an overview of newly posted articles.

A simple class

I want to scrape the tutorial page of my website and the Microsoft page showing the latest articles on C Sharp Programming. Each article on both websites has a thumbnail, title, and short description. I want to use that information in my own application.

To store this information I need a class in my application. A simple class that has the same properties as listed above. I add this class to the root of my console application.

namespace WebScraping.Console;

public class Article
{
    public string Title { get; set; }
    public string Description { get; set; }
    public Uri ThumbnailUri { get; set; }
    public Uri Link { get; set; }
}

Retrieving the HTML

The very first thing we need to do is to grab the HTML from the online source we want to scrape. I will start with the page that shows all the tutorials on my own website: https://kenslearningcurve.com/free-tutorials/.

Getting the HTML is pretty simple: Just use the HTTPClient to send a request to the specific page and store that HTML in a variable. This is a simple GET request. To do this I make a new method called GetHTMLAsync(string url). My Program.cs looks like this:

string url = "https://kenslearningcurve.com/free-tutorials/";
string html = await GetHTMLAsync(url);

static async Task<string> GetHTMLAsync(string url)
{
    using HttpClient client = new();
    HttpRequestMessage request = new(HttpMethod.Get, url);

    HttpResponseMessage response = await client.SendAsync(request);

    if (response.IsSuccessStatusCode)
        return await response.Content.ReadAsStringAsync();

    throw new Exception($"Request to {url} failed.");
}

In the method, I initialize the HTTPClient and create a new HttpRequestMessage, where I set the request method to GET and specify the URL. Then I call the SendAsync of the client and catch the HttpResponseMessage. If the request is a success, I read the content of the response and send it back. If it fails I throw an exception.

If you run the application nothing is showing. That’s why I put a breakpoint in my code so I can check the contents of HTML through the Text Visualizer.

Text Visualizer

This looks a lot like HTML! Now we want to grab the articles only.

Html Agility Pack

There are several ways to extract only that part of the HTML we need. You could do it with regular expressions, but there is a 3rd party library that will make your life a lot easier. Meet Html Agility Pack, or HAP for short.

HAP can read the HTML and with XPath, you can select nodes you want to grab and manage. Let’s install HAP.

Install-Package HtmlAgilityPack

Initializing

To make HAP read the HTML we need to initialize the HTMLDocument and load the HTML. The HmlDocument is a class that will parse the HTML and place it in the property DocumentNode. From here we can use the different nodes in the HTML to get those nodes we want to retrieve.

HtmlDocument htmlDoc = new();
htmlDoc.LoadHtml(html);

After we initialize the HtmlDocument we can load the HTML into that HtmlDocument. I am using the LoadHtml because that method accepts the HTML in a string. You can also use the Load method, which has different ways of accepting the HTML.

Node selection

Now we want to get all the nodes that represent a tutorial listed on the webpage. We need to find a recurring HTML tag that we can fetch as a list. Each tag will have the thumbnail, title, post date, and excerpt. If we inspect the HTML of the page we can see a reoccurring piece of HTML.

HTML

Each listed tutorial is a list item (LI) in the HTML. Now I am going to use HAP to grab those list items. To do this I need a path to these list items, which is called XPath. We need to be sure the path is unique so we don’t get the wrong items on our list.

An ID is the best way because those are unique, but I don’t see an ID I can use. I do see some classes and the unsorted list seems promising because of the multiple CSS classes penci-wrapper-data penci-grid penci-shortcode-render. I can use those CSS classes to get the UL and then the list items.

I will get multiple list items, which are nodes in HAP. So I want to select nodes, not a single node. The htmlDoc is the variable that holds the whole HTML page. The property DocumentNode is the root node and to select nodes from the root node we use the method SelectNodes. This gets XPath as a parameter, which points to the list items.

HtmlNodeCollection items = htmlDoc.DocumentNode.SelectNodes("//ul[@class='penci-wrapper-data penci-grid penci-shortcode-render']/li");

The // means from the root, the ul means I am looking for an unsorted list, the […] means I want to filter on attributes, in this case, a CSS class with the classes listed in the example, and as last item, I want the list items found inside the UL with the class.

If I would remove the /li it will only give the UL as a result, but I want the items.

Parsing the title

Next, we want to get the information from the list items and place them in a List<Article> so that we can use it in our application. For that, we need to iterate through the list of nodes, stored in the variable items. We can use the foreach for this.

List<Article> articles = new();

foreach (HtmlNode node in items)
{
    
}

When inside the foreach, we need to grab a specific part of the node item. The HtmlNode is a small part of the complete HTML and only the HTML of the list item. We can use HAP to further explore the piece of HTML.

Let’s start with the title of an article. The title is stored inside the following path: article -> second div -> h2 -> a. This is actually the XPath we are going to use.

The second div can be located by using its CSS class (recommended) or an index. I usually use the CSS class because it’s likely to change.

foreach (HtmlNode node in items)
{
    Console.WriteLine(node.SelectSingleNode("article/div[@class='grid-header-box']/h2/a").InnerHtml);
}

To get the text inside the a-element we use the InnerHtml property. We can also use the InnerText if you want. The SelectSingleNode does the same as SelectNodes, except it returns one node. If multiple nodes are found only the first one is returned.

You can run the application and you will see the following result:

Select node result

It’s pretty easy to place this information into the List<Article>. But before I am going to show that, you try to get the description from the articles and put them in the right property of the Article class.

Done?

List<Article> articles = new();

foreach (HtmlNode node in items)
{
    articles.Add(new()
    {
        Title = node.SelectSingleNode("article/div[@class='grid-header-box']/h2/a").InnerHtml,
        Description = node.SelectSingleNode("article/div[@class='item-content entry-content']/p").InnerHtml
    });
}

Get values of attributes

The link to the actual article and the thumbnail per article are a bit different. The link is embedded in an anchor (<a…></a>) and the thumbnail is also in that same anchor.

Get Values Of Attributes

The link is in the href attribute, which means we need to grab the single node get the attribute, and read the value. Easy!

Link = new Uri(node.SelectSingleNode("article/div/a").GetAttributeValue("href", string.Empty));

First I get the anchor (<a …>…</a>). Then I get the value of the attribute href with the method GetAttributeValue. If the attribute is not found you need to return a default value. I chose the string.Empty, because I have no idea what else I want to see if the href is not found.

The link to the thumbnail works the same way, but we need to read the value of the attribute data-bgset. I know you are capable of making this one yourself. The final code will be shown later in this article.

Using the data

Now that we have a list of articles, we can show them on the screen. You can build a whole GUI around it, with WPF, MAUI, or a website, but I am just going to show you in a console application… To keep it simple.

The complete code looks like this:

using HtmlAgilityPack;
using WebScraping.Console;

string url = "https://kenslearningcurve.com/free-tutorials/";
string html = await GetHTMLAsync(url);

HtmlDocument htmlDoc = new();
htmlDoc.LoadHtml(html);

HtmlNodeCollection items = htmlDoc.DocumentNode.SelectNodes("//ul[@class='penci-wrapper-data penci-grid penci-shortcode-render']/li");

List<Article> articles = new();

foreach (HtmlNode node in items)
{
    articles.Add(new()
    {
        Title = node.SelectSingleNode("article/div[@class='grid-header-box']/h2/a").InnerHtml,
        Description = node.SelectSingleNode("article/div[@class='item-content entry-content']/p").InnerHtml,
        Link = new Uri(node.SelectSingleNode("article/div/a").GetAttributeValue("href", string.Empty)),
        ThumbnailUri = new Uri(node.SelectSingleNode("article/div/a").GetAttributeValue("data-bgset", string.Empty))
    });
}

foreach (Article article in articles)
{
    Console.WriteLine(article.Title);
    Console.WriteLine(article.Description);
    Console.WriteLine(article.Link);
    Console.WriteLine("--------------------------------------");
}

static async Task<string> GetHTMLAsync(string url)
{
    using HttpClient client = new();
    HttpRequestMessage request = new(HttpMethod.Get, url);

    HttpResponseMessage response = await client.SendAsync(request);

    if (response.IsSuccessStatusCode)
        return await response.Content.ReadAsStringAsync();

    throw new Exception($"Request to {url} failed.");
}

The output of the console application looks like this:

HTML output

Another source

Now I have scraped my own website. Using a web scraper for one source isn’t logical. So let’s add another one. Let’s scrape the Microsoft page, and more specifically: https://devblogs.microsoft.com/dotnet/category/csharp/

.NET Blog

Scrape classes

This webpage has a list of articles in a different design (and HTML) than my website, so we need to change the code a bit. But if we change the code, we can’t scrape my tutorials page anymore. It’s better to make a class per scraper. So I will place the current code in a class called ‘KensLearningCurverScraper’. To make sure the architecture is easy to understand, I place the scraper in a folder called Scapers.

We will need the GetHTMLAsync() method for the Microsoft page too, so I place that in a base class and inherit it in the KensLearningCurveScraper.

The KensLearningCurveScraper looks like this:

public class KensLearningCurverScraper: ScraperBase
{
    public async Task<List<Article>> Scrape()
    {
        string url = "https://kenslearningcurve.com/free-tutorials/";
        string html = await GetHTMLAsync(url);

        HtmlDocument htmlDoc = new();
        htmlDoc.LoadHtml(html);

        HtmlNodeCollection items = htmlDoc.DocumentNode.SelectNodes("//ul[@class='penci-wrapper-data penci-grid penci-shortcode-render']/li");

        List<Article> articles = new();

        foreach (HtmlNode node in items)
        {
            articles.Add(new()
            {
                Title = node.SelectSingleNode("article/div[@class='grid-header-box']/h2/a").InnerHtml,
                Description = node.SelectSingleNode("article/div[@class='item-content entry-content']/p").InnerHtml,
                Link = new Uri(node.SelectSingleNode("article/div/a").GetAttributeValue("href", string.Empty)),
                ThumbnailUri = new Uri(node.SelectSingleNode("article/div/a").GetAttributeValue("data-bgset", string.Empty))
            });
        }

        return articles;
    }
}

And this is the ScraperBase:

public class ScraperBase
{
    public async Task<string> GetHTMLAsync(string url)
    {
        using HttpClient client = new();
        HttpRequestMessage request = new(HttpMethod.Get, url);

        HttpResponseMessage response = await client.SendAsync(request);

        if (response.IsSuccessStatusCode)
            return await response.Content.ReadAsStringAsync();

        throw new Exception($"Request to {url} failed.");
    }
}

Adding new web scraper

Let’s add the web scraper for Microsoft. First I will create a new class called MicrosoftScraper. I will also add a method Scrape(), which starts with the same code as with KensLearningCurveScraper, but with a different URL:

public class MicrosoftScraper : ScraperBase
{
    public async Task<List<Article>> Scrape()
    {
        string url = "https://devblogs.microsoft.com/dotnet/category/csharp/";
        string html = await GetHTMLAsync(url);

        HtmlDocument htmlDoc = new();
        htmlDoc.LoadHtml(html);

        return null;
    }
}

From this point on I need to get the correct data from the web page HTML. That means I need to find the list of articles in the HTML and the class names of the Microsoft page. All the articles are listed inside the div with the CSS classes col-md-12 content-area, but the div also has an id: primary.

Each article is inside the HTML tag article and they have a lot of CSS classes. If I want to fill the HtmlNodeCollection I will do that with the following code:

HtmlNodeCollection items = htmlDoc.DocumentNode.SelectNodes("//div[@class='col-md-12 content-area']/article");

Next up is getting the required information and sending it back:

public partial class MicrosoftScraper : ScraperBase
{
    public async Task<List<Article>> Scrape()
    {
        string url = "https://devblogs.microsoft.com/dotnet/category/csharp/";
        string html = await GetHTMLAsync(url);

        HtmlDocument htmlDoc = new();
        htmlDoc.LoadHtml(html);

        HtmlNodeCollection items = htmlDoc.DocumentNode.SelectNodes("//div[@class='col-md-12 content-area']/article");
        List<Article> articles = new();

        foreach (HtmlNode node in items)
        {
            string description = DescriptionRegEx()
                .Matches(node.SelectSingleNode("div/div/div[@class='entry-content col-md-8']")
                .InnerHtml)[0]
                .Groups[1]
                .Value
                .Trim();

            articles.Add(new()
            {
                Title = node.SelectSingleNode("div/div/div[@class='entry-content col-md-8']/header/h2/a").InnerHtml,
                Description = description,
                Link = new Uri(node.SelectSingleNode("div/div/div[@class='entry-content col-md-8']/header/h2/a").GetAttributeValue("href", string.Empty)),
                ThumbnailUri = new Uri(node.SelectSingleNode("div/div/div/img").GetAttributeValue("data-src", string.Empty))
            });
        }

        return articles;
    }

    [GeneratedRegex("<!-- .entry-header -->(.*)<footer class=\"entry-footer\">", RegexOptions.Multiline | RegexOptions.Singleline)]
    private static partial Regex DescriptionRegEx();
}

The description needs some explanation. The description of the articles is not inside an element. Well, it is, but together with a header and a footer. To extract the description I am using a regular expression and the description is placed in a group, which I then extract from the matches. The description now has break lines, which are removed by using the Trim() method.

Because the extraction of the description is a bit more code than usual, I place it in a variable to keep the overview of what is happening. Normally, some elements need extra attention or code. My advice is to place them above the list (articles in this case) to work on them.

The Program.cs

Let’s put it all together in the Program.cs now:

KensLearningCurverScraper kensLearningCurverScraper = new();
MicrosoftScraper microsoftScraper = new();

List<Article> articles = await kensLearningCurverScraper.Scrape();
articles.AddRange(await microsoftScraper.Scrape());

foreach (Article article in articles)
{
    Console.WriteLine(article.Title);
    Console.WriteLine(article.Description);
    Console.WriteLine(article.Link);
    Console.WriteLine("");
    Console.WriteLine("--------------------------------------");
    Console.WriteLine("");
}

When you run the application now, you will see the articles from Kens Learning Curve first and then the articles from Microsoft.

Articles from Kens Learning Curve

Conclusion

I have shown you how you can create scrapers from different sources. Both sources have the same information, but different content and styles. We have combined the two sources into one format which we can use in our own applications. It isn’t hard to add more sources.

The Microsoft blog has different categories. You can use the Microsoft scraper to also scrape the blogs with the category F# (https://devblogs.microsoft.com/dotnet/category/fsharp/) because the style and HTML are the same. To avoid a lot of initialization of different scrapers, I would highly suggest using the strategy pattern and letting the code decide which scraper to use for a specific source.

Похожее
Dec 25, 2022
Author: Sannan Malik
This article will look at the differences between static and readonly C# fields. Basically, a static readonly field can never be replaced by an instance of a reference type. In contrast, a const constant object cannot be replaced by a...
Mar 22
Author: Dayanand Thombare
LINQ (Language Integrated Query) has revolutionized the way we interact with data in C#. It offers a consistent, readable, and concise way to manipulate collections, databases, XML, and more. However, the beauty and ease of LINQ can sometimes mask performance...
Sep 5, 2023
Author: Edson Moisinho
Simplifying Data Transport in C#.In modern C# development, data transport objects (DTOs) play a crucial role in exchanging information between different layers of an application, such as between a client and a server, and traditionally, developers have used classes to...
Mar 18
Author: Erik Pourali
Imagine crafting a library app where users effortlessly find books by title, author, or genre. Traditional search methods drown you in code. But fear not! Dynamic Querying in C# saves the day.In our tale, crafting separate search methods for each...
Написать сообщение
Почта
Имя
*Сообщение


© 1999–2024 WebDynamics
1980–... Sergey Drozdov
Area of interests: .NET Framework | .NET Core | C# | ASP.NET | Windows Forms | WPF | HTML5 | CSS3 | jQuery | AJAX | Angular | React | MS SQL Server | Transact-SQL | ADO.NET | Entity Framework | IIS | OOP | OOA | OOD | WCF | WPF | MSMQ | MVC | MVP | MVVM | Design Patterns | Enterprise Architecture | Scrum | Kanban