visit
Puppeteer is a great choice for this task. It allows you to control a headless browser, which is crucial for scraping dynamic websites. Many websites now use JavaScript to load content, and Puppeteer enables you to interact with and scrape this dynamic content.
mkdir web-scraper
cd web-scraper
npm init -y
This will create a package.json
file for you with the following output:
Wrote to /home/user/web-scraper/package.json:
{
"name": "web-scraper",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC"
}
npm install puppeteer pg dotenv
Log in to your Aptible account and create a new environment.
Navigate to the new environment and click on Databases > New Database.
Click on the dropdown menu and select PostgreSQL 16. This will create a new PostgreSQL database for you.
Input the Database Handle and click on New Database to create the database.
You can view the database by clicking the Databases tab on the left menu.
Navigate to the database you created and click on Endpoints > New Endpoint.
Include an IP address in the IP Allowlist field. This will allow you to connect to the database from your local machine.
After the endpoint has finished provisioning, click on the ConnectionURL tab.
Click on the show button to copy the connection URL. You will need this to connect to the database.
To create the script, create a new file named scraper.js
in the root of the project directory and add the following code to it:
const puppeteer = require("puppeteer");
const { Client } = require("pg");
require("dotenv").config();
The code above imports the necessary packages and loads the environment variables from the .env
file.
puppeteer
: A headless browser automation library for Node.js used for web scraping.
Client
: Part of the pg library, a PostgreSQL client for Node.js.
dotenv
: A library for loading environment variables from a file, used to load database credentials.
Next, add the following code to the scraper.js
file:
async function run() {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
// Navigate to the website you want to scrape
await page.goto("//netflix.com/");
// Extract data from the page
const data = await page.evaluate(() => {
const title = document.querySelector("h1").innerText;
const url = window.location.href;
return { title, url };
});
// Close the browser
await browser.close();
return data;
}
The code above creates a function named run
which will be used to scrape data from the Netflix website.
run
is an asynchronous function that launches a headless browser using Puppeteer.
It opens a new page, navigates to //netflix.com/
, extracts data (title and URL) from the page using a function provided to page.evaluate
, and then close the browser.
Next, add the following code to the scraper.js
file:
async function saveToDatabase(data) {
const client = new Client({
connectionString: process.env.DATABASE_URL,
ssl: { rejectUnauthorized: false },
});
try {
await client.connect();
// Check if the table exists, and create it if not
const createTableQuery = `
CREATE TABLE IF NOT EXISTS scraped_data (
id SERIAL PRIMARY KEY,
title TEXT,
url TEXT
);
`;
await client.query(createTableQuery);
// Insert data into the database
const insertQuery = "INSERT INTO scraped_data (title, url) VALUES ($1, $2)";
const values = [data.title, data.url];
await client.query(insertQuery, values);
} finally {
await client.end();
}
}
The code above creates a function named saveToDatabase
which will be used to save the scraped data to the PostgreSQL database.
saveToDatabase
is an asynchronous function that connects to the PostgreSQL database using the credentials in the DATABASE_URL
environment variable.
It checks if the table scraped_data
exists and creates it if it doesn't.
Next, add the following code to the scraper.js
file:
(async () => {
try {
const scrapedData = await run();
console.log("Scraped Data:", scrapedData);
// Save data to the database
await saveToDatabase(scrapedData);
console.log("Data saved to the database.");
} catch (error) {
console.error("Error during scraping:", error);
}
})();
The code above calls the run
function to scrape data from the Netflix website and then call the saveToDatabase
function to save the scraped data to the PostgreSQL database.
All together, the scraper.js
file should look like this:
const puppeteer = require("puppeteer");
const { Client } = require("pg");
require("dotenv").config();
async function run() {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
// Navigate to the website you want to scrape
await page.goto("//netflix.com/");
// Extract data from the page
const data = await page.evaluate(() => {
const title = document.querySelector("h1").innerText;
const url = window.location.href;
return { title, url };
});
// Close the browser
await browser.close();
return data;
}
async function saveToDatabase(data) {
const client = new Client({
connectionString: process.env.DATABASE_URL,
ssl: { rejectUnauthorized: false },
});
try {
await client.connect();
// Check if the table exists, and create it if not
const createTableQuery = `
CREATE TABLE IF NOT EXISTS scraped_data (
id SERIAL PRIMARY KEY,
title TEXT,
url TEXT
);
`;
await client.query(createTableQuery);
// Insert data into the database
const insertQuery = "INSERT INTO scraped_data (title, url) VALUES ($1, $2)";
const values = [data.title, data.url];
await client.query(insertQuery, values);
} finally {
await client.end();
}
}
(async () => {
try {
const scrapedData = await run();
console.log("Scraped Data:", scrapedData);
// Save data to the database
await saveToDatabase(scrapedData);
console.log("Data saved to the database.");
} catch (error) {
console.error("Error during scraping:", error);
}
})();
.env
fileThe .env
file will contain the database credentials. Create a new file named .env
in the root of the project directory and add the following code to it:
DATABASE_URL=postgres://<username>:<password>@<host>:<port>/<database>
Note: Replace the placeholders with the database credentials from the database endpoint you created earlier.
node scraper.js
Scraped Data: {
title: 'Unlimited movies, TV shows, and more',
url: '//www.netflix.com/'
}
Data saved to the database.
psql <connection_url>
Note: Replace <connection_url>
with the connection URL from the database endpoint.
\dt
You should see the scraped_data
table listed in the output:
List of relations
Schema | Name | Type | Owner
--------+--------------+-------+---------
public | scraped_data | table | aptible
(1 row)
You can view the data in the scraped_data
table by running the following command:
SELECT * FROM scraped_data;
id | title | url
----+--------------------------------------+--------------------------
1 | Unlimited movies, TV shows, and more | //www.netflix.com/
(1 row)