visit
Step 1: Understanding the task
First, it is important to understand the task clearly. They want from us to scrape data then save it in CSV file where each attribute listed above is its own separate column.
We are going to crawl the following information:We can provide an initial set of 100 DOT numbers to make sure the information above can be scraped easily, from there we can provide all DOT numbers we would like to scrape on a daily basis.Report pages can be accessed by dots or with other name IDs. Each page has its own id (dot). So, these dots locate in Excel file. We have to read this file and extract dots from there then pass it into URL to access the report page.
Step 2: Creating our environment and installing dependencies
Now, we know what client wants from us, so let's create our virtual environment then inspect elements that we are going to crawl.To create virtualenv run the following command in your terminal:virtualenv env
pip install beautifulsoup4 xlrd
Step 3: Crawling Data
Alright Devs! Let's start with opening the example url from project description so can see the fields.The page should look like this:The information below reflects the content of the FMCSA management information systems as of 01/01/2020.The truth is we can't crawl this element by specific class name or id. Unfortunately, this report pages messed up.
But we can see that this element is in bold format. There are also many elements locate with the bold format. However, we can crawl all of them as text and use RegEx to extract the data we need.
A regular expression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.The date locates between The information below reflects the content of the FMCSA management information systems as of and .(dot). So , to find the date regex will look between these strings.
import re # regex
import urllib.request
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
def crawl_data(url):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
# Find all bold texts
bold_texts = bs.find_all('b')
for b in bold_texts:
try:
# look between these strings
date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip()
# If regex finds multiple dots, extract string before first dot
if len(date) > 11:
date = date.split(".",1)[0]
print(date)
except AttributeError:
pass
import re
# We need to extarct "coderasha" from the string
data = "Hello my name is coderasha."
name = re.search('Hello my name is (.*).', data)
print(name)
# Output: <_sre.SRE_Match object; span=(0, 27), match='Hello my name is coderasha.'>
import re
# We need to extarct "coderasha" from the string
data = "Hello my name is coderasha."
name = re.search('Hello my name is (.*).', data).group(1)
print(name)
# Output: coderasha
# Get all texts inside table
information = bs.find('center').get_text(strip=True, separator=' ')
# Find fields using RegEx
operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()
Step 4: Write data in CSV
Once data crawled, it is time to create new csv file and write data into it. I prefer to create another function which will handle this action.import csv
def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers):
with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file:
fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers })
CSV name must be unique, so I named it with
usdot_address
or with other name ID of report page from crawled dataStep 5: Read excel file to crawl data for each dot
The final step is to read excel file and pass these dots end of the URL to access the pages. We can use
xlrd
to read excel fileimport xlrd
dots = []
def read_excel_file():
loc = ("dots.xls")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
# First five dot in excel
for i in range(1,5):
# Convert floats to string and clean from .0
dot = str(sheet.cell_value(i, 0)).replace('.0', '')
dots.append(dot)
xlrd reads numbers as float so the best solution is to convert them to strings and use
replace()
method to remove .0 end of the string and pass these dots into url:for dot in dots:
crawl_data('//safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot)
# Sleep 5 seconds to avoid any errors
time.sleep(5)
import re
import csv
import urllib.request
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import xlrd
import time
dots = []
def read_excel_file():
loc = ("dots.xls")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
for i in range(1,5):
dot = str(sheet.cell_value(i, 0)).replace('.0', '')
dots.append(dot)
def crawl_data(url):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
bold_texts = bs.find_all('b')
for b in bold_texts:
try:
date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip()
if len(date) > 11:
date = date.split(".",1)[0]
print(date)
except AttributeError:
pass
information = bs.find('center').get_text(strip=True, separator=' ')
operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()
write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers)
def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers):
with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file:
fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers })
read_excel_file()
print(dots)
for dot in dots:
crawl_data('//safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot)
time.sleep(5)
I post similar stories at . Check it out.
Video Tutorial available in .Stay Connected!