visit
From a technical marketer perspective, scraping and automation libraries are extremely important to learn. Here’s an introduction to two of the most widely used web scraping libraries in Node JS.
Recently I had the chance to work with Puppeteer and Cheerio, and switch between the two, so here’s a marketer’s perspective on when to use each of them.
The main use case of Puppeteer is Automation.
It’s not always simple to scrape data. Take, for example, my Product Hunt scraper, In Product Hunt, upvoter information is not readily available in the page’s HTML when you first load it. Before you can access the full upvoter list, you have to:Click the upvoters panel.Scroll all the way to the end of the listTo do so, you need a tool that can automate actions in the browser – That’s what puppeteer is for. Use Puppeteer when you need to log in to get data, or when you need to perform automated actions in the browser.
const axios = require("axios");
const cheerio = require("cheerio");
//This function uses axios to get the html data on a given url it's also possible to do the same using the fetch method
const getHtml = async url => {
link = await axios.get(url);
return link.data;
};
//this is a node module that uses the above function and Cheerio to extract twitter data from a list of user tag (used in the backend of Hunt)
module.exports = async function run(userList) {
const enrichedUsers = [];
for (user of productHuntUserList) {
try {
const $ = cheerio.load(await getHtml(`//twitter.com/${user.tag}`));
// here we extract the relevant information from each twitter page - followers number, description, and Twitter URL
let followers = $(
".ProfileNav-item.ProfileNav-item--followers > a > span.ProfileNav-value"
).text();
const description = $(".ProfileHeaderCard-bio").text();
const url = `//twitter.com/${user.tag}`;
// create a user object with existing info and the new info we've extracted from twitter
const enrichedUser = {
tag: user.tag,
name: user.name,
profile: user.profile,
twitterDescription: description,
twitterFollowers: followers,
pageUrl: url,
messagedAndFollowed: false
};
enrichedUsers.push(enrichedUser); // push the new user object into the enrichedUsers array
} catch (e) {
continue; // this is not a good way to handle errors, but I didn't want to get into error handling here and it works for the sake of this tutorial.
}
}
return enrichedUsers;
};
While working on Hunt, I’ve built 2 Scrapers – one for Product Hunt and one for Twitter. I initially built both with Puppeteer, and I noticed a lot of performance issues when trying to scrape a large list of users from Twitter (including memory errors on the Heroku server) – it took Puppeteer about 10 minutes to finish scraping a 1000 upvoters. I then rewrote the Twitter bot in Cheerio (as described above) and saw a performance boost of around 5X+ : The new code took about 2 minutes (or less) to finish scraping.
Both tools allow you to use node for automation and scraping in ways that marketers usually attribute to Python. These tools are another example of how learning Javascript might be a pain in the ass, but can eventually give you more profound and holistic knowledge of web development.
As a marketer, you can probably think of many ways to use both, and I recommend that you do so and go for it. If you’re learning something new, you might as well create something useful!