How To Make a Web Crawler Using Python - Beginner's Guide
Ștefan Răcila on Apr 11 2023
Web crawling is the process of automatically visiting web pages and extracting useful information from them. A web crawler, also known as a spider or bot, is a program that performs this task. In this article, we will be discussing how to create a web crawler using the Python programming language. Specifically, we will be making two web crawlers.
We will build a simple web crawler from scratch in Python using the Requests and BeautifulSoup libraries. After that, we will talk about the advantages of using a web crawling framework like Scrapy. And lastly, we will build an example crawler with Scrapy to collect data from all baby products from Amazon. We will also see how Scrapy scales to websites with several million pages.
Prerequisites
Before following this article, you will need to have a basic understanding of Python and have it installed on your computer. Additionally, you will need to install the Requests and BeautifulSoup modules. This can be done by running the following command in your command prompt or terminal:
$ pip install requests bs4
For the second part of this article, where we will build an example web crawler using Scrapy, you will need to install the Scrapy framework. The creators of this framework strongly recommend that you install Scrapy in a dedicated virtual environment, to avoid conflicting with your system packages.
I suggest you install virtualenv and virtualenvwrapper to create an isolated Python environment. Please note that there is a version for virtualenvwrapper for Windows named virtualenvwrapper-win.
You will also need to install pipx via pip to install virtualenv.
$ python -m pip install --user pipx
$ python -m pipx ensurepath
After you created an isolated Python environment you can install Scrapy with the following command.
$ pip install Scrapy
You can find the installation guide for Scrapy here.
What is a web crawler?
Web crawling and web scraping are related but distinct concepts. Web scraping is the overall process of extracting data from a website. Web crawling is the specific task of automatically navigating through web pages to find the URLs that need to be scraped.
A web crawler begins with a list of URLs to navigate to, known as the seed. As it navigates through each URL, it searches the HTML for links and filters them based on specific criteria. Any new links found are added to a queue for future processing. The extracted HTML or specified information is then passed on to another pipeline for further processing.
When creating a web crawler, it is important to keep in mind that not all pages on a website will be visited. The number of pages that are visited depends on the budget of the crawler, the depth of crawling or the time allocated for execution.
Many websites have a robots.txt file that indicates which parts of the website can be crawled and which should be avoided. Additionally, some websites have a sitemap.xml which is more explicit than robots.txt and specifically instructs bots which pages should be crawled, and provides additional metadata for each URL.
Web crawlers are commonly used for a variety of purposes:
- SEO analytics tools collect metadata such as response time and response status in addition to HTML to detect broken pages and links between different domains to collect backlinks.
- Price monitoring tools crawl e-commerce websites to find product pages and extract metadata, specifically prices. Product pages are then revisited periodically.
- Search engines, like Googlebot, Bingbot, and Yandex Bot, collect all the HTML for a significant portion of the web and use the data to make it searchable.
Later in this article, we will compare two different approaches to building a web crawler in Python. The first approach is using the Requests library for making HTTP requests and BeautifulSoup for parsing HTML content. And the second approach is using a web crawling framework. We will be using Scrapy.
Using Requests and BeautifulSoup libraries
The requests module in Python is a powerful tool for making HTTP requests. To use it for web crawling, you can start by importing the module and making a request to a specific URL. For example:
url = 'https://amazon.com/s?k=baby+products'
response = requests.get(url)
Once you have the response, you can extract all the links from the HTML content using BeautifulSoup. For example:
import json
from urllib.parse import urljoin
from bs4 import BeautifulSoup
html = response.text
links = []
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
path = link.get('href')
if path and path.startswith('/'):
path = urljoin(url, path)
links.append(path)
print(json.dumps(links, sort_keys = True, indent = 2))
You can then iterate through the links and make requests to them, repeating the process until you have visited all the pages you want to crawl. This is a recursive function that acts just like that:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import logging
logging.basicConfig(
format='%(asctime)s %(levelname)s:%(message)s',
level=logging.INFO)
url = 'https://amazon.com/s?k=baby+products'
visited = []
def crawl(url):
logging.info(f'Crawling: {url}')
visited.append(url)
html = ''
try:
html = requests.get(url).text
except Exception:
logging.exception(f'Failed to crawl: {url}')
return
soup = BeautifulSoup(html, 'html.parser')
# here you can extract and store useful data from the page
for link in soup.find_all('a'):
path = link.get('href')
if path and path.startswith('/'):
path = urljoin(url, path)
if path not in visited:
crawl(path)
crawl(url)
The function logs one line for each URL that is visited.
2023-01-16 09:20:51,681 INFO:Crawling: https://amazon.com/s?k=baby+products
2023-01-16 09:20:53,053 INFO:Crawling: https://amazon.com/ref=cs_503_logo
2023-01-16 09:20:54,195 INFO:Crawling: https://amazon.com/ref=cs_503_link
2023-01-16 09:20:55,131 INFO:Crawling: https://amazon.com/dogsofamazon/ref=cs_503_d
2023-01-16 09:20:56,549 INFO:Crawling: https://www.amazon.com/ref=nodl_?nodl_android
2023-01-16 09:20:57,071 INFO:Crawling: https://www.amazon.com/ref=cs_503_logo
2023-01-16 09:20:57,690 INFO:Crawling: https://www.amazon.com/ref=cs_503_link
2023-01-16 09:20:57,943 INFO:Crawling: https://www.amazon.com/dogsofamazon/ref=cs_503_d
2023-01-16 09:20:58,413 INFO:Crawling: https://www.amazon.com.au/ref=nodl_&nodl_android
2023-01-16 09:20:59,555 INFO:Crawling: None
2023-01-16 09:20:59,557 ERROR:Failed to crawl: None
While the code for a basic web crawler may seem simple, there are many challenges that must be overcome in order to successfully crawl an entire website. These include issues such as:
- The download URL logic lacks a retry mechanism and the URL queue is not very efficient with a high number of URLs.
- The crawler does not identify itself and ignores the robots.txt file.
- The crawler is slow and does not support parallelism. Each URL takes about one second to crawl, and the crawler waits for a response before proceeding to the next URL.
- The link extraction logic does not support standardizing URLs by removing query string parameters, does not handle relative anchor/fragment URLs (such as href="#anchor"), and does not support filtering URLs by domain or filtering out requests to static files.
In the next section, we will see how Scrapy addresses these issues and makes it easy to extend the functionality of the web crawler for custom use cases.
How to make a web crawler in Python using the Scrapy framework
Scrapy is a powerful framework for creating web crawlers in Python. It provides a built-in way to follow links and extract information from web pages. You will need to create a new Scrapy project and a spider to define the behavior of your crawler.
Before starting to crawl a website like Amazon, it is important to check the website's robots.txt file to see which URL paths are allowed. Scrapy automatically reads this file and follows it when the ROBOTSTXT_OBEY setting is set to true, which is the default for projects created using the Scrapy command `startproject`.
To create a new Scrapy project you need to run the following command:
$ scrapy startproject amazon_crawler
This command will generate a project with the following structure:
amazon_crawler/
├── scrapy.cfg
└── amazon_crawler
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
To create a spider use the `genspider` command from Scrapy’s CLI. The command has the following definition:
$ scrapy genspider [options] <name> <domain>
To generate a spider for this crawler we can run:
$ cd amazon_crawler
$ scrapy genspider baby_products amazon.com
It should create a file named `baby_products.py` inside the folder named `spiders` and have this code generated:
import scrapy
class BabyProductsSpider(scrapy.Spider):
name = 'wikipedia'
allowed_domains = ['en.wikipedia.com']
start_urls = ['http://en.wikipedia.com/']
def parse(self, response):
pass
Scrapy also offers a variety of pre-built spider classes, such as CrawlSpider, XMLFeedSpider, CSVFeedSpider, and SitemapSpider. The CrawlSpider class, which is built on top of the base Spider class, includes an extra "rules" attribute to define how to navigate through a website. Each rule utilizes a LinkExtractor to determine which links should be extracted from each page.
For our use case we should inherit our Spider class from CrawlSpider. We will also need to make a LinkExtractor rule that tells the crawler to extract links only from Amazon’s pagination. Remember that our goal was to collect data from all baby products from Amazon, so we don’t actually want to follow all the links we find on the page.
Then we need to create another two methods in our class, `parse_item` and `parse_product`. `parse_item` will be given as a callback function to our LinkExtractor rule and it will be called with each link extracted. `parse_product` will parse each product… ¯\_(ツ)_/¯
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
class BabyProductsSpider(CrawlSpider):
name = 'baby_products'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/s?k=baby+products']
rules = (
Rule(
LinkExtractor(
restrict_css='.s-pagination-strip'
),
callback='parse_item',
follow=True),
)
def parse_item(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.select('div[data-component-type="s-search-result"]')
data = []
for product in products:
parsed_product = self.parse_product(product)
if (parsed_product != 'error'):
data.append(parsed_product)
return {
'url': response.url,
'data': data
}
def parse_product(self, product):
try:
link = product.select_one('a.a-text-normal')
price = product.select_one('span.a-price > span.a-offscreen').text
return {
'product_url': link['href'],
'name': link.text,
'price': price
}
except:
return 'error'
To start the crawler you can run:
$ scrapy crawl baby_products
You will see lots of logs in the console (you can specify a log file with `--logfile [log_file_name]`).
I used Amazon Search as an example to demonstrate the basics of creating a web crawler in Python. However, the crawler does not find many links to follow and is not tailored for a specific use case for the data. If you are looking to extract specific data from Amazon Search, you can consider using our Amazon Product Data API. We created custom parsers for Amazon Search, Product and Category page and it returns data in JSON format ready to be used in your application.
Why is it better to use a professional scraping service than using a crawler
While web crawling can be a useful tool for extracting data from the internet, it can also be time-consuming and complex to set up. Additionally, web scraping can be against the terms of service of some websites and can result in your IP being blocked or even legal action being taken against you.
On the other hand, professional scraping services use advanced techniques and technologies to bypass anti-scraping measures and extract data without being detected. They also handle the maintenance and scaling of the scraping infrastructure, allowing you to focus on analyzing and using the data. Additionally, they also provide a higher level of data accuracy and completeness as they are able to handle more advanced data extraction use cases and can handle large scale scraping jobs.
Summary
In conclusion, while web crawling can be a useful tool for extracting data from the internet, it can also be time-consuming and complex to set up. Additionally, web crawling can be against the terms of service of some websites and can result in your IP being blocked or even legal action being taken against you. Therefore, for more advanced and large-scale scraping jobs, it's better to use a professional scraping service.
If you're looking for an alternative to crawling on your own, consider using WebScrapingAPI. WebScrapingAPI is a professional web scraping service that allows you to easily extract data from websites without the need to build and maintain your own web scraper.
It's a fast, reliable, and cost-effective solution that is suitable for businesses of all sizes.
Why don’t you give it a try today! It’s free and we have a 14 day trial with no card required.
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
We care about the protection of your data. Read our Privacy Policy.
Related articles
Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.
Learn how to scrape dynamic JavaScript-rendered websites using Scrapy and Splash. From installation to writing a spider, handling pagination, and managing Splash responses, this comprehensive guide offers step-by-step instructions for beginners and experts alike.
Explore a detailed comparison between Scrapy and Beautiful Soup, two leading web scraping tools. Understand their features, pros and cons, and discover how they can be used together to suit various project needs.