In this article, we will build a program that allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping.Web scraping is all about programmatically using Python or any other programming language to download, clean, and use the data from a web page. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed.Attach robots.txt to the end of any link to find out about the allowed endpoints. For example, let’s use .The result should look like this with this text file below:
The screenshot states what endpoints we are allowed and not allowed to scrape from the YCombinator website. A crawl delay means a pause when scraping data from the website with programs, thereby not overloading their servers and slowing down the website because of constant scraping.In this exercise, we scrape the news content's home page, which we can do according to the user agent.
Getting Started
The Python web scraper requires two necessary modules for scraping the data:
Beautiful Soup
Beautiful Soup is a Python library for extracting data from HTML files. It modifies the file using a parser, turns the data into a valuable document, and saves programmers hours of manual and repetitive work.
Requests
The requests HTTP library is for downloading HTML files using the link to the website with the
.get()
function.
Creating a Web Scraper
Now to the nitty-gritty of this project. Create a new directory, and in there, a file that will contain all the scripts for the web scraper program.Copy and paste the following code:
# app.py
import requests
response = requests.get('//news.ycombinator.com/news')
yc_web_page = response.text
print(yc_web_page)
The code above does the following:
- Importing the
requests
module - Using the response variable, the requests attached to the
.get()
function download the HTML files from the link of the website provided - Reading the content of the web page with
.text
If you run this code with the command python
app.py
and it does not give you any output, it means the two imported modules need to be installed.Run the following commands to install the modules.
pip3 install requests
pip install beautifulsoup4
The result of the source code should look like this:
Next, let’s update the
app.py
file with the rest of the code using beautiful soup:
# main.py
import requests
from bs4 import BeautifulSoup # add this
response = requests.get('//news.ycombinator.com/news')
yc_web_page = response.text
# add this
soup = BeautifulSoup(yc_web_page, 'html.parser')
article_tag = soup.find(name="a", class_='titlelink')
article_title = article_tag.get_text()
article_link = article_tag.get('href')
article_upvote = soup.find(name="span", class_="score").get_text()
result = {
"title": article_title,
"link": article_link,
"point": article_upvote
}
print(result)
Follow the code snippet above by doing the following::
- Import the BeautifulSoup function from module bs4
- Next, use the variable soup to parse the document from the
yc_web_page
using the BeautifulSoup function and html.parser
to get the HTML files
Before going over the rest of the code, let’s open our web browser with the link provided in
.get()
Next, right-click on the page, and click inspect to view the elements tab of the YCombinator news page.
Our web page should look like this:
With Beautiful Soup, we can target specific elements on the page with their class names:
- By assigning the article_tag variable, every page element has a tag name using the
find()
function with the element's name, the a tag, and the class_
with an underscore. This is done to prevent an overwrite of the class in the element on the web page
- Now, we want to extract one of the link titles of the
article_tag
using the .get_text()
function - Next, extract the link of the
article_tag
using the attribute href
with the .get()
function - The same applies to the
article_upvote
variable, where the tag name, <span>
, and the class name are used to extract the points for each article link - Create a variable result that will display the extracted data as a dictionary with the key and value pair
- Print out the final result
With the whole script written, our page should scrape the data from the news home page of YCombinator and look like this:
Conclusion
This article taught you how to use Python web scraper to extract data from a web page. Also, the functionalities of using a web scraper are that it saves time and effort in producing large data sets faster rather than manually.
Learn More