visit
Fortunately, a technique called web scraping is at your service. I am sure you have heard about it before. But if you did not, it is a great way to extract data from any webpage on the Internet and convert it into a computer-readable format such as pandas
dataframes. Many Python libraries allow you to process HTML behind any webpage and get the data you want. The most notable examples are BeautifulSoup
and Scrapy
. And do not forget to learn some requests
, as well what get
, post
mean in data web transfer.
import re
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil import parser
def parse_html(url):
"""
Parses HTML and returns soup.
Parameter
---------
url: str
URL to parse.
Returns
--------
bs4.BeautifulSoup
Parse HTML content.
"""
# Headers to send to server
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0",
"Accept-Language": "en-US, en;q=0.5",
}
# Get a response from server
r = requests.get(url, headers=HEADERS)
# Parse HTML
soup = BeautifulSoup(r.text, "html.parser")
return type(soup)
# List to store review data
reviewlist = []
# Pattern to match a country
country_pat = "(?<=Reviewed\sin\s)(.*)(?=\son)"
def extract_date(string):
"""
Extracts date from a string.
"""
# Parse string and extract datetime object
dt = parser.parse(string, fuzzy=True)
return datetime.strftime(dt, "%Y/%m/%d")
def scrape_reviews(soup):
"""
Scrape information about each review.
Parameter
----------
soup: bs4.BeautifulSoup
Parsed HTML content
Returns
-------
Adds each review as a dictionary to the pre-defined list reviewlist
"""
reviews = soup.find_all("div", {"data-hook": "review"})
try:
for review in reviews:
review = {
"title": review.find("a", {"data-hook": "review-title"}).text.strip(),
"rating": float(
review.find("i", {"data-hook": "review-star-rating"})
.text.replace("out of 5 stars", "")
.strip()
),
"country": re.search(
country_pat,
review.find("span", {"data-hook": "review-date"}).text.strip(),
).group(1),
"date": extract_date(
review.find("span", {"data-hook": "review-date"}).text.strip()
),
"review": review.find(
"span", {"data-hook": "review-body"}
).text.strip(),
"num_helpful": int(
review.find("span", {"data-hook": "helpful-vote-statement"})
.text.replace(" people found this helpful", "")
.strip(),
),
}
reviewlist.append(review)
except:
pass
# Scrape Amazon reviews from multiple pages
for i in range(1, 145):
# URL
url = f"//www.amazon.com/Wireless-Bluetooth-Headphones-Foldable-Headsets/product-reviews/B08VNFD8FS/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={i})"
print(f"Scraping page {i}")
# Parse HTML content
soup = parse_html(url)
# Get reviews
scrape_reviews(soup)
# Save review data to a csv file
pd.DataFrame(reviewlist).to_csv("amazon_reviews.csv", index=False)
Insert the and click "Save". Octoparse will automatically load the page and propose you the data to scrape. It was pretty code in figuring out what data I wanted, although I needed to clean it a bit.
Thus, before creating a workflow, let's remove the columns we do not need. Just hover the cursor over the column name, and click on the icon of a trash bin.
Now it is time to rename the fields. Double-click on the column name you want to rename and insert the new title.
Now, create a workflow! Every task in Octoparse is basically a workflow. In this case, the algorithms automatically figured out that the website has multiple pages, and we need to paginate and extract the data from each page. You can see the workflow on the right-hand side of the window.
Great! However, we still need to clean the data a little bit. For example, extract ratings as numbers. Hover the mouse cursor on the right side of this column, and you will see three dots, click on them. A menu will show up, where you have to click on "Clean data". We are now able to add different steps involved in data cleaning. In this case, we just have to match the pattern "digit.digit" (like 5.0) using RegEx. Click on "Add step", and then "Replace with Regular Expression". A new will pop up, where we insert the pattern \d.\d
, which you can evaluate to see if it works correctly by clicking the "Evaluate" button. As you can see below, it worked! Thus, click "Confirm". Once the window closes, click on "Apply" on the right-hand side.
You now see that the column values changed to match the pattern. To practice, perform a similar task with the number of people who found this review helpful.
To extract the country, you basically have to match a RegEx pattern as previously, but this time it is a bit more complicated: (?<=Reviewed\sin\s)(.*)(?=\son)
.
On the dashboard, you will see that running task! You can now safely close the program and even switch off the computer. Once the task is completed, you can export the data into one of the formats like csv
and start investigating it with your preferred programming language and tool.