Top 3 Python HTTP Clients for Web Scraping
Mihnea-Octavian Manolache on Dec 02 2022
When it comes to web scraping, the variety of python HTTP clients available make Python one of the most popular choices. But what exactly are these HTTP clients and how can you use them to build a web scraper? Well, in today’s article we will discuss exactly this topic. By the end of this article, you should have a solid understanding about:
- What is an HTTP client in general
- What are the best python HTTP clients in 2022
- Why Python makes a great choice for web scraping
- How to actually create a web scraper using HTTP clients
What are Python HTTP Clients and How to Use Them
In order to gain a deeper understanding of how the internet communicates, one should get familiar with the Hypertext Transfer Protocol (HTTP). However, our main focus for today resides around python HTTP clients. So I will assume that you are already familiar with HTTP.
Generally speaking, an HTTP client refers to an instance or a program that facilitates communication with a server. For example, a web browser can be considered an HTTP client. However, as programmers, we rarely use an actual browser when building an application, except when we’re working on a web scraper or when we’re doing our research.
This being said, when we refer to HTTP clients in a more programmatic manner, we usually mean a method or an instance of a class used to execute HTTP requests. Since Python is undoubtedly one of the most popular programming languages (and also my personal favorite), today we’ll discuss the best Python HTTP clients and also how to implement them in a real project.
Understanding the HTTP Protocol
Before moving forward, even though I recommend checking out the HTTP documentation, let me quickly go over some of the basic HTTP concepts. First of all, HTTP is maybe one of the most used internet protocols. We use it every day to exchange information between clients and servers.
To make it possible, HTTP uses requests methods. These methods indicate the action a client wants to perform on a server. For example, if you want to obtain some information from a server, you would use GET. If you want to send something to the server, you would say POST. Here is a list of the most common HTTP requests methods:
- GET - retrieve data from the server
- HEAD - only retrieve the header, without the body (the actual data)
- POST - Send some information over to the server
- PUT - Send information to the server and replace all current representations of the resource
- PATCH - Send information to the server and partially modify the resource
- DELETE - Delete the resource from the server
Why Python for HTTP Requests
First of all, Python has a great syntax and an even greater community. Hence, it is perfect for learning. I myself, when I first started programming, chose Python. As a matter of fact, Python HTTP clients were among the first technologies that I came across with. But this is another topic.
My goal for today’s article is to make sure you leave not only with a basic theoretical understanding, but also with an overview of the practical implementation.
And Python is great for both for a number of reasons. Just to list a few:
- Syntax - Writing Python is a lot like writing English. So reading a Python script will help you relate theoretical concepts with their actual implementation.
- Support - Python has a very large community. Most of the time, if you’re stuck, a simple question on StackOverflow will reveal the answer to your problem.
- Availability - Python’s package library is among the most extensive. For example, only when it comes to Python HTTP clients there are over a dozen packages. But we’ll focus on the most popular ones for today.
3(+1) Best Python HTTP Clients
When it comes to categorizing packages to come up with a top 3 best Python HTTP clients, I believe it is both a matter of functionality and personal preference. So it is correct to say that the following represents my top 3 HTTP client libraries for Python, rather than a general ranking.
1. Requests - Powerful Simplicity
Requests is probably one of the most preferred HTTP clients in the Python community. I make no exception. Whenever I test a new web scraper, I use Python with requests. It is as easy as saying .get and as powerful as an actual web browser.
Among other things, the requests library offers:
- SSL verification
- Proxy support for HTTPS
- Cookie persistence and sessions
- Keep alive feature
- Custom authentication
And these are just a few. You can check out the full list of features here. Now let me show you how to work with requests:
import requests
r = requests.get("http://google.com")
print(r.test)
As you can see, with only 3 lines of code, the requests library helps us collect the row HTML from a server. In the example above, we’re making a GET request to the server and we’re printing the result. But like I said, this library is much more diverse. Let’s build a more complex example, that uses features such as proxies and POST requests:
import requests
def get_params(object):
params = ''
for key,value in object.items():
if list(object).index(key) < len(object) - 1:
params += f"{key}={value}."
else:
params += f"{key}={value}"
return params
API_KEY = '<YOUR_API_KEY>'
TARGET_URL = 'https://httpbin.org/post'
DATA = {"foo":"bar"}
PARAMETERS = {
"proxy_type":"datacenter",
"device":"desktop"
}
PROXY = {
"http": f"http://webscrapingapi.{ get_params(PARAMETERS) }:{ API_KEY }@proxy.webscrapingapi.com:80",
"https": f"https://webscrapingapi.{ get_params(PARAMETERS) }:{ API_KEY }@proxy.webscrapingapi.com:8000"
}
response = requests.post(
url=TARGET_URL,
data=DATA,
proxies=PROXY,
verify=False
)
print(response.text)
Let’s see what we are doing here:
- We are defining the `get_params` function, which takes an object and returns it as an URL parameters string.
- We are defining our variables:
undefinedundefinedundefinedundefinedundefined - We’re using the `post` method from Requests to send an HTTP post request.
- We’re printing the response body
2. HTTPX - Requests Reinvented
HTTPX is relatively new to the scene. However, in a very short time it became one of the most recommended Python HTTP clients. For example, Flask (one of the biggest web frameworks for Python) recommends using HTTPX in their official documentation.
When I said HTTPX is requests reinvented, it was because the two libraries are very similar in syntax. In fact, HTTPX aims for full compatibility with requests. There are only a few minor design differences between the two, which are highlighted here.
Here is how a basic POST request looks like in HTTPX:
import httpx
TARGET_URL = 'https://httpbin.org/post'
DATA = {"foo":"bar"}
r = httpx.post(
url=TARGET_URL,
data=DATA,
)
print(r.text)
As you can see, it’s mainly the package name that we’re changing, compared to the Requests example. And since they are so similar, the question remains, why choose HTTPX over requests. Well, for once, HTTPX is one the few Python HTTP clients that offers asynchronous support. To sum everything up, HTTPX is a great choice if you want to refactor your requests based code.
3. urllib3 - Thread Safe Connections
Python has a couple of ‘urllibs’, which usually confuses new programmers. The main difference between urllib, urllib2 and urllib3 stands in each package’s features. urllib was Python’s original HTTP client, included in the Python 1.2 standard library. urllib2 was the upgraded version, introduced in Python 1.6 and was intended to replace the original urllib.
When it comes to urllib3 however, it is actually a third party Python HTTP client. Despite its name, this library is unrelated to the two ‘predecessors’. Moreover, the Python community rumors that there is no intention of including urllib3 in the standard library. At least in the near future.
Even though this package is not officially linked with the Python standard library, many developers use it because it offers:
- Thread safety
- Client side SSL / TLS verification
- Proxy support for HTTP and SOCKS
- Complete test coverage
Now that we’ve covered the theoretical part, let us check the implementation example:
import urllib3,json
TARGET_URL = 'https://httpbin.org/post'
DATA = {"foo":"bar"}
http = urllib3.PoolManager()
encoded_data = json.dumps(DATA)
r = http.request('POST', TARGET_URL, body=encoded_data)
print(r.data.decode('utf-8'))
Let’s discuss the differences identified in urllib3, as opposed to Requests:
- `http` - an instance of the `PoolManager` method, which handles the details of thread safety and connection pooling
- `encoded_data` - a converted JSON string, holding the payload we’re sending
- `r` - the actual POST request we’re making with the help of urllib3. Here, we’re using the `request` method of the `PoolManager` instance.
And then in the end, we have to decode the data we’re receiving back from our request. As you can see, there are a couple of things that we’re doing differently than with Requests.
Honorable Mention: http.client - Traditional Python HTTP Client
http.client is also part of the standard Python library. Traditionally, it is not used directly by programmers. For example, urllib actually uses it as a dependency, in order to handle HTTP and HTTPS requests. I’ve included it in our ranking because I think as programmers, it’s good to know the ‘bones’ of the packages we’re using.
So even though you might not actually create an actual project with http.client, here is an implementation example for it, that I’m sure will help you better understand how Python HTTP clients work:
import http.client
TARGET_URL = 'www.httpbin.org'
http = http.client.HTTPSConnection(TARGET_URL)
http.request("GET", "/get")
r = http.getresponse()
print(r.read().decode('utf-8'))
The `HTTPSConnection` instance takes a couple of parameters, which you can check here. In our example, we’re only defining the `method` and the `url` (or more accurately, the endpoint). Also, similarly to urllib3, http.client returns an encoded response. So we actually need to decode it before printing.
Use Case: Building a Scraper With Requests
Now that we know how to use HTTP clients, let’s assign ourselves a small project. It will be helpful not only to apply what you’ve learned, but also to add some value to your own programming portfolio.
Since Python HTTP clients are commonly used to collect information from servers, the most common use of these technologies is to create a web scraper. So moving forward, we will focus on how to create a web scraper using HTTP clients in Python. Because I have a personal favorite - requests - I will be using it for this project. However, you can use it as a starting point and even tweak it to use some of the other technologies we discussed. Without further ado, let’s start coding:
1. Project Set Up
Let’s start by creating a new directory in which we will hold the files for our web scraper. Now open up a new terminal window and `cd` into that directory. Here, we want to initiate a new virtual environment. If you’re on a UNIX like OS, you can use:
~ » python3 -m venv env && source env/bin/activate
Now simply create a new Python file which will hold our logic and open it in your desired IDE. If you want to use the terminal, simply paste the following command:
~ » touch scraper.py && code .
2. Installing Dependencies
We will use pip to install the packages we need for this project. For now, we have established that we’re going to use Requests, but it is not enough for a web scraper. A web scraper also implies handling the data. This means we need to parse the HTML collected from the servers. Luckily, Python’s library offers a great variety of packages. For this project though, we will use BeautifulSoup. To install the packages, simply paste the following command:
~ » python3 -m pip install requests bs4
3. Writing The Logic
We will split our code in two sections: one for data extraction and one for data manipulation. The first part is covered by the Requests package, while the second part is covered by BeautifulSoup. WIthout further ado, let’s jump into coding, starting with the extraction part:
import requests
def scrape( url = None ):
# if there is no URL, there is no need to use Python HTTP clients
# We will print a message and stop execution
if url == None:
print('[!] Please add a target!')
return
response = requests.get( url )
return response
In this section, we are defining a function with only one parameter: the targeted URL. If the URL is not provided, we are going to print a message and stop the execution. Otherwise, we’re using Request’s get method to return the response. Now, we know that Python HTTP clinetts cover more methods, so let’s add a conditional parameter:
import requests
def scrape( method = 'get', url = None, data = None ):
# if there is no URL, there is no need to use Python HTTP clients
# We will print a message and stop execution
if url == None:
print('[!] Please add a target!')
return
if method.lower() == 'get':
response = requests.get( url )
elif method.lower() == 'post':
if data == None:
print('[!] Please add a payload to your POST request!')
return
response = requests.post( url, data )
return response
As you can see, we added a couple more parameters to our function. The `method` parameter specifies which method should be used for our request. The `data` represents the payload we are sending with the POST request. By default, the method is GET, hence the `method` parameter is not required.
Challenge: Add more methods to this function and enrich our scraper’s capabilities. Not only is it fun, but it’s also a good learning approach. Plus, you get to make the code your own so you can add it to your portfolio.
So far we’ve covered the data extraction. Let’s parse the HTML and do something with it:
from bs4 import BeautifulSoup
def extract_elements(data = None, el = None):
if data == None:
print('[!] Please add some data!')
return
if el == None:
print('[!] Please specify which elements you are targeting!')
return
soup = BeautifulSoup(data.text, 'html.parser')
elements = soup.find_all(el)
return elements
But a web scraper should be able to extract more specific data. For example, it should be able to locate and return elements based on their CSS selector. So let’s add the logic that handles this part:
from bs4 import BeautifulSoup
def extract_elements(data = None, el = None, attr = None, attr_value = None):
if data == None:
print('[!] Please add some data!')
return
if el == None:
print('[!] Please specify which elements you are targeting!')
return
soup = BeautifulSoup(data.text, 'html.parser')
elements = soup.find_all(el, { attr : attr_value })
return elements
BeautifulSoup allows us to extract specific data based on their attributes. So here we’ve added two new parameters that will help us locate and extract elements based on their attributes.
We now have everything we need. All that is left to do is to combine the two sections and we have our web scraper. Once you assemble your code, simply:
- Create a new variable that will hold the data extracted with Requests
- Print the elements returned by BeautifulSoup
Here are the two missing pieces of your code:
data = scrape('GET', 'https://webscrapingapi.com')
print( extract_elements(data, 'ul') )
I am sure you already figured out what everything does and there is no need for a translation at this point. Just as with our scraper, I challenge you to play around with the `extract_elements` function and make it do more than simply returning elements.
Conclusion
When learning a new programming concept, I think it’s best to test with the different technologies available. When it comes to actually building the infrastructure for a wider project though, it is better to learn about the strength and weakness of each technology, before choosing one.
I hope this article helped you either way and that you now have a solid understanding of how the Python HTTP clients work. I also encourage you to play around as I am sure you will discover the right package for you.
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
We care about the protection of your data. Read our Privacy Policy.
Related articles
Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.
Learn how to scrape dynamic JavaScript-rendered websites using Scrapy and Splash. From installation to writing a spider, handling pagination, and managing Splash responses, this comprehensive guide offers step-by-step instructions for beginners and experts alike.
Get started with WebScrapingAPI, the ultimate web scraping solution! Collect real-time data, bypass anti-bot systems, and enjoy professional support.