visit
This tutorial will walk you through the basic steps of scraping amazon product information, using Python and BeautifulSoup.
Scraping product information from Amazon can generate incredibly valuable insights for many use cases. No matter if you are monitoring prices, running a business intelligence project or having an eye on your competition.
Python is well suited for this task, since it’s syntax is very easy to read and offers great libraries for networking (requests) and data extraction (BeautifulSoup, full documentation here).
pip3 install requests BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = '//www.amazon.com/FEICE-Stainless-Leathers-Waterproof-Business/dp/B074MWWTVL'
response = requests.get(url, headers=headers)
The retrieved answer can be viewed by running:
print(response.text)
soup = BeautifulSoup(response.content, features="lxml")
title = soup.select("#productTitle")[0].get_text().strip()
Since the select-function returns an array, even if only one element was retrieved, we are selecting the first element with [0].
We are just interested in the text inside the element
#productTitle
and do not care about the HTML tags that are wrapping it. Thus, we are adding get_text() to our command. Furthermore, our text is wrapped by a lot of whitespace that we want to get rid of. strip() does the job for us.
The number of related categories will vary each product. This is where the findAll() function becomes handy:
categories = []
for li in soup.select("#wayfinding-breadcrumbs_container ul.a-unordered-list")[0].findAll("li"):
categories.append(li.get_text().strip())
features = []
for li in soup.select("#feature-bullets ul.a-unordered-list")[0].findAll('li'):
features.append(li.get_text().strip())
price = soup.select("#priceblock_saleprice")[0].get_text()
Note: The retrieved value is a string, containing the dollar sign and the price of the product. If this tutorial was not for demonstrational purpose only, we would detect the contained currency and save the price in a separate float variable.
review_count = int(soup.select("#acrCustomerReviewText")[0].get_text().split()[0])
jsonObject = {'title': title, 'categories': categories, 'features': features, 'price': price, 'review_count': review_count, 'availability': availability}
print(json.dumps(jsonObject, indent=2))
Proxy Server
As stated above, Amazon is very sensitive when it comes to scraping. Even if you are implementing measures like slow scraping, sleep periods, user-agent rotation, etc. Amazon will stop your script at some point. A way to get around this is to implement a proxy server, or use a Scraper API.The following link provides a good overview about available products: