visit
In English, the word Scraping has different definitions but they are all within the same meaning.
to remove (an outer layer, adhering matter, etc.) in this way: to scrape the paint and varnish from a table.
In
the act of removing the surface from something using a sharp edge or something rough.
However, we are for sure interested into what Web Scraping means in Software.
In software, Web Scraping is the process of extracting some information from a web resource from its user interface rather than its legit APIs. Therefore, it is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML.
For Static websites, where the data are already rendered into the HTML from the first instance, you can follow the same steps we described.
However, for Dynamic websites, where the data are not rendered into the HTML from the first instance, and they are loaded dynamically through JavaScript libraries and frameworks (like Angular, React, Vue,…), you need to follow another approach.
First, let’s start with trying to scrap some data from a static website. On this example, we are going to scrap my own GitHub profile
We would try to get a list of the pinned repositories on my profile. Each entry would be composed of the name of the repository and its description.
Till the moment of writing this article, this how my
using HtmlAgilityPack;
private static Task<string> GetHtml()
to get the HTML.private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
to parse the HTML.
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
namespace WebScraper
{
class Program
{
static async Task Main(string[] args)
{
var html = await GetHtml();
var data = ParseHtmlUsingHtmlAgilityPack(html);
}
private static Task<string> GetHtml()
{
var client = new HttpClient();
return client.GetStringAsync("//github.com/AhmedTarekHasan");
}
private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var repositories =
htmlDoc
.DocumentNode
.SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");
List<(string RepositoryName, string Description)> data = new();
foreach (var repo in repositories)
{
var name = repo.SelectSingleNode("div/div/span/a").InnerText;
var description = repo.SelectSingleNode("p").InnerText;
data.Add((name, description));
}
return data;
}
}
}
As you can see, it is easy to use HttpClient and HtmlAgilityPack. All what you need is to get used to their APIs and then it would be an easy job.
All of this you can still handle with HttpClient or other libraries which you can use to perform a call.
Therefore, again, on this example, we are going to scrap my own GitHub profile
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;
private static string GetHtml()
to get the HTML.private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
to parse the HTML.
using System.Collections.Generic;
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
var html = GetHtml();
var data = ParseHtmlUsingHtmlAgilityPack(html);
}
private static string GetHtml()
{
var options = new ChromeOptions
{
BinaryLocation = @"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
};
options.AddArguments("headless");
var chrome = new ChromeDriver(options);
chrome.Navigate().GoToUrl("//github.com/AhmedTarekHasan");
return chrome.PageSource;
}
private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var repositories =
htmlDoc
.DocumentNode
.SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");
List<(string RepositoryName, string Description)> data = new();
foreach (var repo in repositories)
{
var name = repo.SelectSingleNode("div/div/span/a").InnerText;
var description = repo.SelectSingleNode("p").InnerText;
data.Add((name, description));
}
return data;
}
}
}
Again, using Selenium.WebDriver and Selenium.WebDriver.ChromeDriver is easy as you can see.