visit
Note: Other websites like use JavaScript to inject the table’s content into the page, making it harder for us to access the data. That said, we’ll be dealing with those kinds of tables in a later entry.
At first glance, every table is made up of two major elements: columns and rows. However, the is a little more complex than that.
<table>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</table>
The code is a little messy, but this is the kind of HTML file you’ll find in the real world. Despite the mess, it still respects the <table>
, <tr>
and <td>
the structure we discussed above.
Do the same thing with a few more elements to be sure. The page source is the HTML file before any rendering happens, so you can see the initial state of the page. If the element is not there, that means the data is being injected from elsewhere and you’ll need to find another solution to scrape it. The second thing we want to test before coding our scraper is our selectors. For this, we can use the browser’s console to select elements using the .querySelectorAll() method, using the element and class we want to scrape.
Finally, we’ll select all the <tr>
elements in the table: document.querySelectorAll(“table > tbody > tr”)
Awesome, 20 nodes! It matches the number of rows we want to scrape, so we now know how to select them with our scraper.
Note: Remember that when we have a node list, the count starts at 0 instead of 1.
The only thing missing is learning the position of each element in the cell. Spoiler alert, it goes from 2 to 10.
Awesome, now we’re finally ready to go to our code editor.
To begin the project, create a new directory/folder and open it in VScode or your preferred code editor. We’ll install first, then open your terminal and start your project with the command npm -y init
.
(async function () {
const response = await axios('//www.bbc.com/sport/football/tables');
console.log(response.status)
})();
const html = await response.data;
const $ = cheerio.load(html)
allRows.each((index, element) => {
const tds = $(element).find('td');
const team = $(tds[2]).text();
const played = $(tds[3]).text();
const won = $(tds[4]).text();
const drawn = $(tds[5]).text();
const lost = $(tds[6]).text();
const gf = $(tds[7]).text();
const against = $(tds[8]).text();
const gd = $(tds[9]).text();
const points = $(tds[10]).text();
Notice that we first store all the <td>
elements into a tds
variable that we can then use to pick every cell by its position in the index.
allRows.each((index, element) => {
const tds = $(element).find('td');
const team = $(tds[2]).text();
console.log(team)
});
1- Create an empty array outside of the main function.
2- Next, push each element scraped into the array using the .push() method and use a descriptive name to label them. You want them to match the headers from the table you’re scraping.
premierLeagueTable.push({
'Team': team,
'Played': played,
'Won': won,
'Drawn': drawn,
'Lost': lost,
'Goals For': gf,
'Goals Against': against,
'Goals Difference': gd,
'Points': points,
})
3- Third, use ObjectToCsv to create a new CSV file and save it to your machine with the .toDisk() method, including the file path and name of the file.
const csv = new ObjectsToCsv(premierLeagueTable);
await csv.toDisk('./footballData.csv')
We added a few console.log()
statements for testing purposes but other than that, you should have the same code like the one below:
const axios = require("axios");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");
premierLeagueTable = [];
(async function () {
const response = await axios('//www.bbc.com/sport/football/tables');
console.log('Loading tables')
console.log(response.status)
const html = await response.data;
const $ = cheerio.load(html);
const allRows = $("table.gs-o-table > tbody.gel-long-primer > tr");
console.log('Going through rows')
allRows.each((index, element) => {
const tds = $(element).find('td');
const team = $(tds[2]).text();
const played = $(tds[3]).text();
const won = $(tds[4]).text();
const drawn = $(tds[5]).text();
const lost = $(tds[6]).text();
const gf = $(tds[7]).text();
const against = $(tds[8]).text();
const gd = $(tds[9]).text();
const points = $(tds[10]).text();
premierLeagueTable.push({
'Team': team,
'Played': played,
'Won': won,
'Drawn': drawn,
'Lost': lost,
'Goals For': gf,
'Goals Against': against,
'Goals Difference': gd,
'Points': points,
})
});
console.log('Saving data to CSV');
const csv = new ObjectsToCsv(premierLeagueTable);
await csv.toDisk('./footballData.csv')
console.log('Saved to csv');
})();
You can use the same process to scrape virtually any HTML table you want and grow a huge football dataset for analytics, result forecasting, and more.
As you can see, all we need to do is to replace the URL in the example with our target URL, and ScraperAPI will do the rest.