visit
In this post, we will learn web scraping Google with Node JS using some of the in-demand web scraping and web parsing libraries present in Node JS.
This article will be helpful to beginners who want to make their career in web scraping, data extraction, or mining in Javascript. Web Scraping can open many opportunities to developers around the world. As we say, “Data is the new oil.”
So, here we end the introduction and get started with our long tutorial on scraping Google with Node JS.
Before we start with the tutorial, let me explain the and their importance in scraping.
Headers are an important part of an HTTP request or response that provides some additional meta-information about the request or response. Headers are case-insensitive, and the header name and its value are usually separated by a single colon in a text string format.
Headers play an important in web scraping. Usually, when website owners' has information that their website data extraction can take place in many different ways, they implement many tools and strategies to save their website from being scraped by the bots.
Scrapers with nonoptimized headers get failed to scrape these types of websites. But when you pass correct headers, your bot not only mimics a real user but is also successfully able to scrape the quality data from the website. Thus, scrapers with optimized headers can save your IPs from being blocked by these websites.
Headers can be classified into four different categories:
These are the headers sent by the client while fetching some data from the server. It consists same key-value pair headers in a text string format as other headers. Identification of the request sender can be done with the help of the information in request headers.
The below example shows some of the request headers:
authority: accounts.google.com
method: GET
accept-language: en-US
origin: //www.geeksforgeeks.org
referer: //www.geeksforgeeks.org/
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36
HTTP Request header contains various information like:
The headers sent back by the server after successfully receiving the request headers from the user are known as Response Headers. It contains information like the date and time and the type of file sent back by the server. It also consists of information about the server that generated the response.
The below example shows some of the response headers:
content-length: 27098
content-type: text/html
date: Fri, 16 Sep 2022 19:49:16 GMT
server: server: nginx
cache-control: cache-control: max-age=21584
Content-Encoding list all the encodes that are applied to the representation. Content-Length is the size of the received by the user in bytes. The Content-Type header indicates the of the resource.
The representation header describes the type of resource sent in an HTTP message body. The data transferred can be in any format, such as JSON, XML, HTML, etc. These headers tell the client about the data format they received. The below example shows some of the representation headers:
Understanding the payload headers is quite difficult, so first you should know about the meaning of payload, then we will come to an explanation.
What is Payload?
When the data is transferred from a server to a recipient, the message content or the data expected by the server that the recipient will receive is known as payload. The Payload Header is the HTTP header that consists of the payload information about the original resource representation. They consist of information about the content length and range of the message, any encoding present in the transfer of the message, etc.
The below example shows some of the payload headers:
content-length: 27098
content-range: bytes 200-1000/67589
trailer: Expires
transfer-encoding: chunked
The content–range header indicates where a partial message belongs in a full body massage. The transfer-encoding header is used to specify the type of encoding to securely transfer the payload body to the user.
is used to identify the application, operating system, vendor, and version of the requesting user agent. It can help us in mimicking as a real user. Thus, saving our IP from being blocked by Google. It is one of the main headers we use while scraping Google Search Results.
It looks like this:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
In this section, we will learn about the Node JS library which will help us scrape Google Search Results. We will discuss the need for this library and the disadvantages associated with it.
Unirest is a lightweight HTTP library available in many languages, including Java, Python, PHP, .Net, etc. currently manages the Unirest JS. Also, it comes in the list of one of the most popular web scraping Javascript libraries. It helps us to make all types of HTTP requests to scrape the precious data on the requested page.
Let us take an example of how we can scrape Google using this library:
npm i unirest
Then we will make a request on our target URL:
const unirest = require(“unirest”)
function getData() {
const url = "//www.google.com/search?q=javascript&gl=us&hl=en"
let header = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Viewer/96.9.4688.89"
}
return unirest.get(url).headers(header).then((response) => {
console.log(response.body);
})
}
getData();
Step-by-step explanation after header declaration:
get()
is used to make a get request at our target URL.headers()
are used to pass HTTP request headers along with the request.
This block of code will return an HTML file and will look like this:
Unreadable, right? Don’t worry. We will be discussing a web parsing library in a bit.
As we know, Google can block our request if we request with the same User Agent each time. So, if you want to rotate User-Agents on each request, let us define a function that will return random User-Agent strings from the User-Agent array.
const selectRandom = () => {
const userAgents = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36", ]
var randomNumber = Math.floor(Math.random() * userAgents.length);
return userAgents[randomNumber];
}
let user_agent = selectRandom();
let header = {
"User-Agent": `${user_agent}`
}
This logic will ensure we don’t have to use the same User-Agents each time.
is a promise-based HTTP client for Node JS and browsers and one of the most popular and powerful javascript libraries. It can make XMLHttpRequests and HTTP from the browser Node JS respectively. It also has client-side support for protecting against the CSRF.
Let us take an example of how we can use Axios for web scraping:
npm i axios
The below block of code will return the same HTML file we saw in the Unirest section.
const axios = require('axios');
let headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Viewer/96.9.4688.89"
}
axios.get('//www.google.com/search?q=javascript&gl=us&hl=en', headers).then((response) {
console.log(response.body);
}).catch((e) {
console.log(e);
});
is a web parsing library that can parse any HTML and XML document. It implements a subset of jQuery, which is why its syntax is quite similar to jQuery.
Manipulating and rendering the markup can be done very fast with the help of Cheerio. It doesn’t produce a visual rendering, apply CSS, load external resources, or execute Javascript.
Let us take a small example of how we can use Cheerio to parse the Google ads search results. You can install Cheerio by running the below command in your terminal.
npm i cheerio
Now, we will prepare our parser by finding the CSS selectors using the extension. Watch the tutorial on the website if you want to learn how to use it.
Let us first scrape the HTML with the help of unirest and make a cheerio instance for parsing the HTML.
const cheerio = require("cheerio");
const unirest = require("unirest");
const getData = async () => {
try {
const url = "//www.google.com/search?q=life+insurance";
const response = await unirest.get(url).headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}) const $ = cheerio.load(response.body)
In the last line, we just created a constant and loaded the scraped HTML in it. If you see the right bottom of the page, the results of the ads are under the tag .uEierd
. We will scrape the ad's title, snippet, link, displayed link, and site links.
Look at the bottom of the image for the tag of the title. Similarly, for the snippet:
Let us find the tag for the displayed link:
And if you inspect the title, you will find the tag for the link to be a.sVXRqc
.After searching all the tags, our code will look like this:
let ads = [];
$("#tads .uEierd").each((i, el) => {
ads[i] = {
title: $(el).find(".v0nnCb span").text(),
snippet: $(el).find(".lyLwlc").text(),
displayed_link: $(el).find(".qzEoUe").text(),
link: $(el).find("a.sVXRqc").attr("href"),
}
})
Now, let us find tags for site links.
Now, similarly, if we follow the above process to find the tags for sitelinks
titles, snippets, and links, our code will look like this:
let sitelinks = [];
if ($(el).find(".UBEOKe").length) {
$(el).find(".MhgNwc").each((i, el) => {
sitelinks.push({
title: $(el).find("h3").text(),
link: $(el).find("a").attr("href"),
snippet: $(el).find(".lyLwlc").text()
})
}) ads[i].sitelinks = sitelinks
}
And our results:
Complete Code:
const cheerio = require("cheerio");
const unirest = require("unirest");
const getData = async () => {
try {
const url = "//www.google.com/search?q=life+insurance";
const response = await unirest.get(url).headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}) const $ = cheerio.load(response.body) let ads = [];
$("#tads .uEierd").each((i, el) => {
let sitelinks = [];
ads[i] = {
title: $(el).find(".v0nnCb span").text(),
snippet: $(el).find(".lyLwlc").text(),
displayed_link: $(el).find(".qzEoUe").text(),
link: $(el).find("a.sVXRqc").attr("href"),
}
if ($(el).find(".UBEOKe").length) {
$(el).find(".MhgNwc").each((i, el) => {
sitelinks.push({
title: $(el).find("h3").text(),
link: $(el).find("a").attr("href"),
snippet: $(el).find(".lyLwlc").text()
})
}) ads[i].sitelinks = sitelinks
}
}) console.log(ads)
} catch (e) {
console.log(e);
}
}
getData();
You can see how easy it is to use Cheerio JS for parsing HTML. Similarly, we can use Cheerio with other web scraping libraries like Axios, Puppeteer, Playwright, etc.
If you want to learn more about scraping websites with Cheerio, you can consider my blogs where I have used Cheerio as a web parser:
Gone are the days when websites used to build with only HTML and CSS. Nowadays, interaction on modern websites can be handled by javascript completely, especially the SPAs(single page applications), built on frameworks like React, Next, and Angular are heavily relied on Javascript for rendering the dynamic content.
But when doing web scraping, the content we require is sometimes rendered by Javascript, which is not accessible from the HTML response we get from the server.
And that’s where the headless browser comes into play. Let’s discuss some of the Javascript libraries which use headless browsers for web automation and scraping.
is a Google-designed Node JS library that provides a high-quality API that enables you to control Chrome or Chromium browsers.
Here are some features associated with Puppeteer JS:
It can be used to crawl single-page applications and can generate pre-rendered content, i.e., server-side rendering.
It works in the background and performs actions as directed by the API.
It can generate screenshots of web pages.
It can make pdf of web pages.
Let us take an example of how we can scrape Google Books Results using Puppeteer JS. We will scrape the book title, image, description, and writer.
First, install puppeteer by running the below command in your project terminal:
npm i puppeteer
Now, let us create a web crawler by launching the puppeteer in a non-headless mode.
const url = "//www.google.com/search?q=merchant+of+venice&gl=us&tbm=bks";
browser = await puppeteer.launch({
headless: false,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88",
});
await page.goto(url, {
waitUntil: "domcontentloaded"
});
What each line of code says:
puppeteer.launch()
- This will launch the chrome browser with non-headless mode.browser.newPage()
- This will open a new tab in the browser.page.setExtraHTTPHeaders()
- This will allow us to set headers on our target URL.page.goto()
- This will navigate us to our target URL page.
Now, let us find the CSS selector for the book title.
As you can see at the bottom of the page, the CSS selector of our title. We will paste this into our code:
let books_results = [];
books_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
return {
title: el.querySelector(".DKV0Md")?.textContent
}
})
});
Here I have used the page.evaluate()
function to evaluate the page’s context and return the result.
Then I selected the parent handler of the title, which is also a parent handler of other things we want to scrape(image, writer, description, etc as stated above) using the document.querySelectorAll()
method.
And finally, we selected the title from the elements present in the parent handler container with the help of querySelector()
. The textContent
will allow us to grab the text inside the selected element.
We will select the other elements just in the same way as we selected the title. Now, let us find the tag for the writer.
books_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
return {
title: el.querySelector(".DKV0Md")?.textContent,
writers: el.querySelector(".N96wpd")?.textContent,
}
})
});
Let us find the tag for our description as well.
let books_results = [];
books_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
return {
title: el.querySelector(".DKV0Md")?.textContent,
writers: el.querySelector(".N96wpd")?.textContent,
description: el.querySelector(".cmlJmd")?.textContent,
}
})
});
And finally for the image:
let books_results = [];
books_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
return {
title: el.querySelector(".DKV0Md")?.textContent,
writers: el.querySelector(".N96wpd")?.textContent,
description: el.querySelector(".cmlJmd")?.textContent,
thumbnail: el.querySelector("img").getAttribute("src"),
}
})
});
console.log(books_results);
await browser.close();
We don’t need to find the tag for the image as it is the only image in the container. So we just used the “img
” element for reference. Don’t forget to close the browser. Now, let us run our program to check the results.
The long URL you see as a thumbnail value is nothing but a base64 image URL. So, we got the results we wanted.
Complete Code:
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const getBooksData = async () => {
const url = "//www.google.com/search?q=merchant+of+venice&gl=us&tbm=bks";
browser = await puppeteer.launch({
headless: true,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88",
});
await page.goto(url, {
waitUntil: "domcontentloaded"
});
let books_results = [];
books_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
return {
title: el.querySelector(".DKV0Md")?.textContent,
writers: el.querySelector(".N96wpd")?.textContent,
description: el.querySelector(".cmlJmd")?.textContent,
thumbnail: el.querySelector("img").getAttribute("src"),
}
})
});
console.log(books_results) await browser.close();
};
getBooksData();
So, we have now understood a basic understanding of Puppeteer JS. Now, let’s discuss its some advantages:
is a test automation framework used by developers around the world to automate web browsers. The same team that worked on Puppeteer JS previously has developed the Playwright JS. You will find the syntax of Playwright JS to be similar to Puppeteer JS, the API method in both cases are also identical, but both languages have some differences. Let’s discuss them:
Playwright supports multiple languages like C#, .NET, Javascript, etc. While the latter only supports Javascript.
The Playwright JS is still a new library with limited community support, unlike Puppeteer JS, which has good community support.
Playwright supports browsers like Chromium, Firefox, and Webkit, while Puppeteer's main focus is Chrome and Chromium, with limited support for Firefox.
Let us take an example of how we can use Playwright JS to scrape Top Stories from Google Search Results. First, install playwright by running the below command in your terminal:
npm i playwright
Now, let's create our scraper by launching the chromium browser at our target URL.
const browser = await playwright['chromium'].launch({
headless: false,
args: ['--no-sandbox']
});
const context = await browser.newContext();
const page = await context.newPage();
await page.goto("//www.google.com/search?q=india&gl=us&hl=en");
Step-by-step explanation:
Now, let us search for the tags for these single stories.
As you can see every single story comes under the .WlydOe
tag. This method page.$$
will find all elements matching the specified selector within the page and will return the array containing all these elements.
Look for tags of the title, date, and thumbnail, with the same approach as we have done in the Puppeteer section. After finding the tags push the data in our top_stories array and close the browser.
let top_stories = [];
for (let single_story of single_stories) {
top_stories.push({
title: await single_story.$eval(".mCBkyc", el => el.textContent.replace('\n', '')),
link: await single_story.getAttribute("href"),
date: await single_story.$eval(".eGGgIf", el => el.textContent),
thumbnail: await single_story.$eval("img", el => el.getAttribute("src"))
})
}
console.log(top_stories) await browser.close();
The $eval
will find the specified element inside the parent element we declared above in the single_stories
array. The textContent
will return the text inside the specified element and getAttribute
will return the value of the specified element’s attribute. Our result will look like this:
Here is the complete code:
const playwright = require("playwright");
const getTopStories = async () => {
try {
const browser = await playwright['chromium'].launch({
headless: false,
args: ['--no-sandbox']
});
const context = await browser.newContext();
const page = await context.newPage();
await page.goto("//www.google.com/search?q=football&gl=us&hl=en");
const single_stories = await page.$$(".WlydOe");
let top_stories = [];
for (let single_story of single_stories) {
top_stories.push({
title: await single_story.$eval(".mCBkyc", el => el.textContent.replace('\n', '')),
link: await single_story.getAttribute("href"),
date: await single_story.$eval(".eGGgIf", el => el.textContent),
thumbnail: await single_story.$eval("img", el => el.getAttribute("src"))
})
}
console.log(top_stories) await browser.close();
} catch (e) {
console.log(e);
}
};
getTopStories();
The above sections taught us to scrape and parse Google Search Results with various Javascript libraries. We also saw how we can use a combination of Unirest and Cheerio and Axios and Cheerio to extract the data from Google.
In this section, we will discuss some of the alternatives to the above-discussed libraries.
is a web automation library designed for websites that don’t own APIs and want to automate browsing tasks. Nightmare JS is mostly used by developers for UI testing and crawling. It can also help mimic user actions(like goto, type, and click) with an API that feels synchronous for each block of scripting. Let us take an example of how we can use Nightmare JS to scrape the Google Search Twitter Results.
Install the Nightmare JS by running this command:
npm i nightmare
As you can see in the above image, each Twitter result is under the tag .dHOsHb
. So, this makes our code look like this:
const Nightmare = require("nightmare") const nightmare = Nightmare() nightmare.goto("//www.google.com/search?q=cristiano+ronaldo&gl=us").wait(".dHOsHb").evaluate(() => {
let twitter_results = [];
const results = document.querySelectorAll(".dHOsHb") results.forEach((result) => {
let row = {
"tweet": result.innerText,
}
twitter_results.push(row)
}) return twitter_results;
}).end().then((result) => {
result.forEach((r) => {
console.log(r.tweet);
})
}).catch((error) => {
console.log(error)
})
Step-by-step explanation:
goto()
to navigate to our target URL.wait()
to wait for the selected tag of the twitter result. You can also pass a time value as a parameter to wait for a specific period.evaluate()
, which invokes functions on the page, in our case, it is querySelectorAll()
.forEach()
function to iterate over the results array and fill each element with the text content.end()
to stop the crawler and returned our scraped value.
Here are our results:
Node Fetch is a lightweight module that brings Fetch API to Node JS, or you can say it enables to use of the fetch() functionality in Node JS.
Features:
window.fetch
API.To use Node Fetch run this command in your project terminal:
npm i node-fetch@2
Let us take a simple example to request our target URL:
const fetch = require("node-fetch");
const getData = async () => {
const response = await fetch("//google.com/search?q=web+scraping&gl=us", {
headers: {
“
User - Agent”: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88"
}
});
const body = await response.text();
console.log(body);
}
getData();
is an HTML/XML parser for Node JS.
Features:
In this tutorial, we discussed eight Javascript libraries that can be used for web scraping Google Search Results. We also learned some examples to scrape search results. Each of these libraries has unique features and advantages, some are just new, and some have been updated and adopted according to developer needs. Thus, you know which library to choose according to the circumstances.
If you have any questions about the tutorial, please feel free to . If you think I have not covered some topics in the tutorial, please feel free to inform us.
1. Which Javascript library is best for web scraping?
When selecting the best library for web scraping, consider that library that is easier to use, a library that has good community support and can withstand large amounts of data.
2. From where should I start learning scraping Google?
This tutorial is designed for beginners to develop a basic understanding to scrape Google.
3. Is web scraping Google hard? Web scraping Google is pretty much easy! Even a developer with decent knowledge can kickstart his career in web scraping if given the .
4. Is web scraping legal? Yes. All the data publicly available on the internet is legal to scrape.
Also published .