visit
Setting up this project was extremely easy. All you have to do is install Puppeteer with the command
npm install puppeteer
. There was one confusing issue I had during setup, however. The puppeteer package was not unzipped correctly when I initially installed it. I found this out while running the initial example in the article. If you get an error that states Failed to launch browser process
or something similar follow these steps:
chrome-win
from node_modules/puppeteer/.local-chromium/
win64
folder in that same .local-chromium folder
.
chrome.exe
is in this path node_modules/puppeteer/.local-chromium/win64-818858/chrome-win/chrome.exe
First example
The first example didn't work for me. To fix the problem I assigned the async function to a variable then invoked that variable after the function. I'm not sure this is the best way to handle the issue but hey, it works. Here is the code:const puppeteer = require('puppeteer');
const takeScreenShot = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('//www.stem-effect.com/');
await page.screenshot({path: 'output.png'});
await browser.close();
};
takeScreenShot();
Wikipedia scraper
I also had an issue with the Wikipedia scraper code. For some reason, I was getting null values for the country names. This screwed up all of my data in the JSON file I was creating. Also, the scraper was 'scraping' every table on the Wikipedia page. I didn't want that. I only wanted the first table with the total number of cases and deaths caused by COVID-19. Here is the modified code I used:const puppeteer = require('puppeteer');
const fs = require('fs')
const scrape = async () =>{
const browser = await puppeteer.launch({headless : false}); //browser initiate
const page = await browser.newPage(); // opening a new blank page
await page.goto('//en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic_by_country_and_territory', {waitUntil : 'domcontentloaded'}) // navigate to url and wait until page loads completely
// Selected table by aria-label instead of div id
const recordList = await page.$$eval('[aria-label="COVID-19 pandemic by country and territory table"] table#thetable tbody tr',(trows)=>{
let rowList = []
trows.forEach(row => {
let record = {'country' : '','cases' :'', 'death' : '', 'recovered':''}
record.country = row.querySelector('a').innerText; // (tr < th < a) anchor tag text contains country name
const tdList = Array.from(row.querySelectorAll('td'), column => column.innerText); // getting textvalue of each column of a row and adding them to a list.
record.cases = tdList[0];
record.death = tdList[1];
record.recovered = tdList[2];
if(tdList.length >= 3){
rowList.push(record)
}
});
return rowList;
})
console.log(recordList)
// Commented out screen shot here
// await page.screenshot({ path: 'screenshots/wikipedia.png' }); //screenshot
browser.close();
// Store output
fs.writeFile('covid-19.json',JSON.stringify(recordList, null, 2),(err)=>{
if(err){console.log(err)}
else{console.log('Saved Successfully!')}
})
};
scrape();
First, instead of identifying the table, I wanted to use by the
div#covid19-container
, I pinpointed the table with the aria-label. This was a little more precise. Originally, the reason the code was scraping over all of the tables on the page was because the IDs were the same (I know, not a good practice. That's what classes are for, right?). Identifying the table via aria-label helped ensure that I only scraped the exact table I wanted, at least in this scenario.
Second, I commented out the screenshot command. It broke the code for some reason and I didn't see the need for it if we were just trying to create a JSON object from table data.
Lastly, after I obtained the data from the correct table I wanted to actually use it in a chart. I created an HTML file and displayed the data using Google charts. You can see the full project on my if you are curious. Fair warning, I got down and dirty (very hacky) putting this part together, but at the end of the day, I just wanted an easier way to consume the data that I had just mined for. There could be a whole separate article on the amount of refactoring that can be done on my HTML page.
Also published at