Now that we have our environment set up and have a basic understanding of HTML tables, we can start building a web scraper to extract data from an HTML table. In this section, we will walk through the steps of building a simple scraper that can extract data from a table and store it in a structured format.
The first step is to use the requests library to send an HTTP request to the webpage that contains the HTML table that we want to scrape.
You can install it using pip, as any other Python package:
$ pip install requests
This library allows us to retrieve the HTML content of a web page as a string:
import requests
url = 'https://www.w3schools.com/html/html_tables.asp'
html = requests.get(url).text
Next, we will use the BeautifulSoup library to parse the HTML content and extract the data from the table. BeautifulSoup provides a variety of methods and attributes that make it easy to navigate and extract data from an HTML document. Here is an example of how to use it to find the table element and extract the data from the cells:
soup = BeautifulSoup(html, 'html.parser')
# Find the table element
table = soup.find('table')
# Extract the data from the cells
data = []
for row in table.find_all('tr'):
cols = row.find_all('td')
# Extracting the table headers
if len(cols) == 0:
cols = row.find_all('th')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
print(data)
The 2D data array is now filled with table rows and columns values. In order for it to be more readable to us we can pass the content to a Pandas Dataframe very easily now:
import pandas as pd
# Getting the headers from the data array
# It is important to remove them from the data array afterwards in order to be parsed correctly by Pandas
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
print(df)
Once you have extracted the data from the table, you can use it for a variety of purposes, such as data analysis, machine learning, or storing it in a database. You can also modify the code to scrape multiple tables from the same web page or from multiple web pages.
Please keep in mind that not all the websites on the internet are this easy to scrape data from. Many of them implemented high level protection measures designed to prevent scraping such as CAPTCHA and blocking the IP addresses, but luckily there are 3rd party services such as WebScrapingAPI which offer IP Rotation and CAPTCHA bypass enabling you to scrape those targets.
I hope this section provided a helpful overview of the process of scraping data from an HTML table using Python. In the next section, we will discuss some of the ways you can improve this process and best web scraping practices.