visit
In this article, you're going to learn the basics of web scraping in python and we'll do a demo project to scrape quotes from a website.
Web scraping is extracting data from a website programmatically. Using web scraping you can extract the text in HTML tags, download images & files and almost do anything you do manually with copying and pasting but in a faster way.
$ pip install requests
$ pip install beautifulsoup4
We gonna use the requests library while implementing our demo project to send a get a request to the website so as to get its HTML source code.
Let use the BeautifulSoup library to extract data from the below HTML file sample.html.
<!DOCTYPE html>
<head>
<title>Document</title>
</head>
<body>
<div id = 'quotes'>
<p id = 'normal'>Time the time before the time times you</p>
<p id = 'normal'>The Future is now </p>
<p id = 'special'>Be who you wanted to be when you're younger</p>
<p id = 'special'>The world is reflection of who you're</p>
</div>
<div>
<p id = 'Languages'>Programming Languages</p>
<ul>
<li>Python</li>
<li>C+++</li>
<li>Javascript</li>
<li>Golang</li>
</ul>
</div>
</body>
</html>
Extracting all paragraphs in HTML
Let’s Extract all paragraphs from the sample.html shown above using BeautifulSoup:
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
print(paragraph.text)
Output
When you run the above simple program it will produce the following result:$ python app.py
Time the time before the time times you
The Future is now
Be who you wanted to be when you're younger
The world is a reflection of who you're
Programming Languages
Code Explanation
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data.
for paragraph in soup.find_all('p'):
print(paragraph.text)
We used BeautifulSoup find_all () method to extract all the paragraph in the HTML file, it accepts a parameter of the name of HTML tag and then it parses through the HTML string to find all tags and returns them.
In extracting the list elements instead of paragraph, we are going to specify tag li instead of p in the find_all() method just as shown below:
app.pyfrom bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for List in soup.find_all('li'):
print(List.text)
Output
$ python app.py
Python
C+++
Javascript
Golang
Extracting paragraphs with a specific id
Apart from just returning all tags in HTML string, we can also specify the attributes of those tags for us to extract only specific tags. just as shown below:
import requests
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
if paragraph['id'] == 'normal':
print(paragraph.text)
Output
$ python app.py
Time the time before the time times you
The Future is now
Quotes spider
In this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.Website of Interest (WOI)
In our demo project, we are going to scrap the quotes fromDemo project source code
In the source code of our demo project, nothing has changed much other than the fact that this time we gonna obtains the HTML source code from a website using the requests module instead of reading it from the file.import requests
from bs4 import BeautifulSoup
html = requests.get('//quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('span'):
if paragraph.string:
print(paragraph.string
Output
$ python scraper.py
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."