visit
1. Choose Your Target Keywords
Now that we know our main goal, it’s time to pick the keywords we want to scrape to support it.To pick your target keywords, think of the terms consumers could be searching to find your offering, and identify your competitors. In this example, we’ll target four keywords:“asana reviews”“clickup reviews”“best project management software”“best project management software for small teams”
We could add many more keywords to this list, but for this scraper tutorial, they’ll be more than enough.Also, notice that the first two queries are related to direct competitors, while the last two will help us identify other competitors and get an initial knowledge of the state of the industry.2. Setup Your Development Environment
The next step is to get our machine ready to develop our Google scraper. For this, we’ll need a few things:Python version 3 or laterPip – to install Scrapy and other packages, we might need ScraperAPI.Your machine may have a pre-installed Python version. Enter
python -v
into your command prompt to see if that’s the case.If you need to install everything from scratch, follow our . We’ll be using the same setup, so get that done and come back.Note: something to keep in mind is that the team behind Scrapy recommends installing Scrapy in a virtual environment (VE) instead of globally on your PC or laptop. If you’re unfamiliar, the above Python and Scrapy tutorial shows you how to create the VE and install all dependencies.In this tutorial, we’re also going to be using ScraperAPI to avoid any IP bans or repercussions. Google doesn’t really want us to scrape their SERPs – especially for free. As such, they have implemented advanced anti-scraping techniques that’ll quickly identify any bots trying to extract data automatically.To get around this, ScraperAPI is a complex system that utilizes third-party proxies, machine learning, huge browser farms, and years of statistical data to ensure that our scraper won’t get blocked from any site by rotating our IP address for every request, setting wait times between requests and handling CAPTCHAs.In other words, by just adding a few lines of code, ScraperAPI will supercharge our scraper, saving us headaches and hours of work.All we need for this tutorial is to get our API Key from ScraperAPI. To get it, just to redeem 5000 free API requests.3. Create Your Project’s Folder
After installing Scrapy in your VE, enter this snippet into your terminal to create the necessary folders:scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com
Scrapy will first create a new project folder called “google-scraper,” which also happens to be the project’s name. Next, go into this folder and run the
genspider
command to create a web scraper named “google”.We now have many configuration files, a “spiders” folder containing our scraper, and a Python modules folder containing package files.4. Import All Necessary Dependencies to Your google.py File
The next step is to build a few components that will make our script as efficient as possible. To do so, we’ll need to make our dependencies available to our scraper by adding them at the top of our file:import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'
5. Construct the Google Search Query
Google employs a standard and query-able URL structure. You just need to know the URL parameters for the data you need and you can generate a URL to query Google with.That said, the following makes up the URL structure for all Google search queries://www.google.com/searchThere are several standard parameters that make up Google search queries:q represents the search keyword parameter.//www.google.com/search?q=tshirt, for example, will look for results containing the keyword “tshirt.”
The offset point is specified by the start parameter. //www.google.com/search?q=tshirt&start=100 is an example. hl is the language parameter. //www.google.com/search?q=tshirt&hl=en is a good example. The as_sitesearch argument allows you to search for a domain (or website). //www.google.com/search?q=tshirt&as sitesearch=amazon.com is one example.
The number of results per page (maximum is 100) is specified by the num parameter. //www.google.com/search?q=tshirt&num=50 is an example. The safe parameter generates only “safe” results. //www.google.com/search?q=tshirt&safe=active is a good example.Note: is incredibly useful in building a query-able URL. Bookmark it for more complex scraping projects in the future.Alright, let’s define a method to construct our Google URL using this information:def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return '//www.google.com/search?' + urlencode(google_dict)
return '//www.google.com/search?' + urlencode(google_dict)
6. Define the ScraperAPI Method
To use ScraperAPI, all we need to do is to send our request through ScraperAPI’s server by appending our query URL to the proxy URL provided by ScraperAPI using payload and urlencode. The code looks like this:def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = '//api.scraperapi.com/?' +
urlencode(payload)
return proxy_url
7. Write the Spider Class
In Scrapy we can create different classes, called spiders, to scrape specific pages or groups of sites. Thanks to this function, we can build different spiders inside the same project, making it much easier to scale and maintain.class GoogleSpider(scrapy.Spider):
name = 'google'
allowed_domains = ['api.scraperapi.com']
custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'RETRY_TIMES': 5}
8. Send the Initial Request
It’s finally time to send our HTTP request. It is very simple to do this with the start_requests(self) method:def start_requests(self):
queries = ['asana+reviews',
'clickup+reviews',
'best+project+management+software',
'best+project+management+software+for+small+teams']
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
9. Write the Parse Function
Thanks to ScraperAPI’s auto parsing functionality, our scraper should be returning a JSON file as a response to our request. Make sure it is by enabling the parameter ‘autoparse’: ‘true’ in the get_url function.Next, we’ll load the complete JSON response and cycle through each result, taking the data and combining it into a new item that we can utilize later.This procedure checks to see whether another page of results is available. The request is invoked again if an additional page is present, repeating until there are no additional pages. def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
10. Run the Spider
Congratulations, we built our first Google scraper! Remember, our code can always be changed to add functionality we discover is missing, but for now we have a functional scraper. If you’ve been following along, your google.py file should look like this by now:import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'
def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = '//api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return '//www.google.com/search?' + urlencode(google_dict)
return '//www.google.com/search?' + urlencode(google_dict)
class GoogleSpider(scrapy.Spider):
name = 'google'
allowed_domains = ['api.scraperapi.com']
custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'RETRY_TIMES': 5}
def start_requests(self):
queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']
for query in queries:
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
scrapy crawl google -o serps.csv
This post was originally published on .