Parsel: How to Extract Text From HTML in Python
Mihai Maxim on Jan 31 2023
Introduction
Web scraping is the automated process of collecting data from websites by using a script or program. It's used to extract information such as text, images, and other types of data that can be useful for different purposes like research, data analysis, or market analysis.
Nowadays, there are a ton of solutions when it comes to web scraping with Python. Selenium and Scrapy are some of the most widely used and popular libraries. While these tools are great for complicated scraping tasks, they can be a bit overwhelming for casual use.
Enter Parsel, the little scraping library. This lightweight and easy-to-learn library is perfect for small projects and is great for those who are new to web scraping. It is able to parse HTML and extract data using CSS and XPath selectors, making it a great tool for any data lover looking for a fast and easy way to collect information from the web.
Buckle up and get ready to learn how to use this library as you join me on this adventure of automated data collection. Let's get scraping!
Getting Started With Parsel
You can install the Parsel library with:
pip install parsel
Now let’s dive straight into an example project and scrape all the countries data from this simple website https://www.scrapethissite.com/pages/simple/.
To get the HTML from the website, you will need to make a HTTP GET request.
We will be making HTTP requests with the “requests” Python library, so make sure you install it with:
pip install requests
Now you can fetch the HTML, write it to a file:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
with open("out.html", "w", encoding="utf-8") as f:
f.write(response.text)
And examine the structure:
Our data is stored in structures similar to this:
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
Afghanistan
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Kabul</span><br>
<strong>Population:</strong> <span class="country-population">29121286</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br>
</div>
</div><!--.col-->
In order to write selectors, you’ll need to pass the raw HTML to Parsel:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
raw_html = response.text
parsel_dom = parsel.Selector(text = raw_html)
Now we’re ready to write some selectors.
Extracting Text Using CSS Selectors
You can print the first country capital with:
parsel_dom = parsel.Selector(text=raw_html)
first_capital = parsel_dom.css(".country-capital::text").get()
print(first_capital)
// Output
Andorra la Vella
parsel_dom.css(".country-capital::text").get() will select the inner text of the first element that has the country-capital class.
You can print all the country names with:
countries_names = filter(lambda line: line.strip() != "", parsel_dom.css(".country-name::text").getall())
for country_name in countries_names:
print(country_name.strip())
// Output
Andorra
United Arab Emirates
Afghanistan
Antigua and Barbuda
Anguilla
. . .
parsel_dom.css(".country-name::text").getall() will select the inner texts of all the elements that have the "country-name" class.
Notice that we had to clean-up the output a bit. We did that because all the elements that have the “.country-name” class also have an <i> tag nested inside of them. Also, the country name is surrounded by many trailing spaces.
<h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i> //this is picked up as an empty string
United Arab Emirates // this is picked up as “ United Arab Emirates “
</h3>
Now let’s write a script to extract all the data with CSS selectors:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
raw_html = response.text
parsel_dom = parsel.Selector(text=raw_html)
countries = parsel_dom.css(".country")
countries_data = []
for country in countries:
country_name = country.css(".country-name::text").getall()[1].strip()
country_capital = country.css(".country-capital::text").get()
country_population = country.css(".country-population::text").get()
country_area = country.css(".country-area::text").get()
countries_data.append({
"name": country_name,
"capital": country_capital,
"population": country_population,
"area": country_area
})
for country_data in countries_data:
print(country_data)
// Outputs
{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}
{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}
{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}
...
Extracting Text Using XPath Selectors
XPath is a query language for selecting nodes from an XML document. It stands for XML Path Language, and it uses a path notation similar to that of URLs to navigate through the elements and attributes of an XML document. XPath expressions can be used to select a single element, a set of elements, or a specific attribute of an element. XPath is primarily used in XSLT, but it can also be used to navigate through the Document Object Model (DOM) of any XML-like language document, such as HTML or SVG.
XPath can seem intimidating at first, but it is actually quite easy to get started with once you understand the basic concepts and syntax. One resource that can come in handy is our XPath selectors guide at https://www.webscrapingapi.com/the-ultimate-xpath-cheat-sheet.
Now let’s try some selectors:
Here is how you can print the first capital:
parsel_dom = parsel.Selector(text=raw_html)
first_capital = parsel_dom.xpath('//*[@class="country-capital"]/text()').get()
print(first_capital)
// Output
Andorra la Vella
And all the country names:
countries_names = filter(lambda line: line.strip() != "",
parsel_dom.xpath('//*[@class="country-name"]//text()').getall())
for country_name in countries_names:
print(country_name.strip())
// Output
Andorra la Vella
Abu Dhabi
Kabul
St. John's
The Valley
Tirana
...
Let’s reimplement the script with XPath selectors:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
raw_html = response.text
parsel_dom = parsel.Selector(text=raw_html)
countries = parsel_dom.xpath('//div[contains(@class,"country")][not(contains(@class,"country-"))]')
countries_data = []
for country in countries:
country_name = country.xpath(".//h3/text()").getall()[1].strip()
country_capital = country.xpath(".//span/text()").getall()[0]
country_population = country.xpath(".//span/text()").getall()[1]
country_area = country.xpath(".//span/text()").getall()[2]
countries_data.append({
"name": country_name,
"capital": country_capital,
"population": country_population,
"area": country_area
})
for country_data in countries_data:
print(country_data)
// Output
{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}
{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}
{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}
...
Removing elements
Removing elements is simple. Just apply the drop function to a selector:
selector.css(".my_class").drop()
Let’s showcase this functionality by writing a script that removes the “population” filed from each country:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
raw_html = response.text
parsel_dom = parsel.Selector(text=raw_html)
countries = parsel_dom.css(".country")
for country in countries:
country.css(".country-population").drop()
country.xpath(".//strong")[1].drop()
country.xpath(".//br")[1].drop()
countries_without_population_html = parsel_dom.get()
with open("out.html", "w", encoding="utf-8") as f:
f.write(countries_without_population_html)
Exporting the data
When you've finished scraping the data, it's important to think about how you want to save it. Two common formats for storing this kind of data is .json and .csv. However, you should choose the one that works best for your project needs.
Exporting the data to .json
JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is often used for exchanging data between a web application and a server, or between different parts of a web application. JSON is similar to a Python dictionary, in that it is used to store data in key-value pairs, and it can be used to store and access the same type of data and have the same structure.
Exporting an array of Python dictionary to .json can be done with the json library:
import json
countries_dictionaries = [
{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'},
{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}
]
json_data = json.dumps(countries_dictionaries, indent=4)
with open("data.json", "w") as outfile:
outfile.write(json_data)
// data.json
[
{
"name": "Andorra",
"capital": "Andorra la Vella",
"population": "84000",
"area": "468.0"
},
{
"name": "United Arab Emirates",
"capital": "Abu Dhabi",
"population": "4975593",
"area": "82880.0"
}
]
Exporting the data to .csv
A CSV is a simple way to store data in a text file, where each line represents a row and each value is separated by a comma. It's often used in a spreadsheet or database programs. Python has great built-in support for working with CSV files, through its csv module. One of the most powerful features of the CSV module is the DictWriter class, which allows you to write a Python dictionary to a CSV file in a simple way. The keys of the dictionary will be used as the column headers in the CSV file, and the values will be written as the corresponding data in the rows.
Here is how you can use the csv library to export an array of Python dictionaries to .csv.
countries_dictionaries = [
{"name": "John Smith", "age": 35, "city": "New York"},
{"name": "Jane Doe", "age": 28, "city": "San Francisco"}
]
with open("data.csv", "w") as outfile:
writer = csv.DictWriter(outfile, fieldnames=countries_dictionaries[0].keys())
writer.writeheader()
for row in countries_dictionaries:
writer.writerow(row)
// data.csv
name,age,city
John Smith,35,New York
Jane Doe,28,San Francisco
Wrapping up
In this article, we've explored the use of the Parsel library in Python. We've seen how easy it is to use the CSS and XPath selectors that Parsel provides to extract data from web pages. Overall, Parsel provides an efficient and versatile solution for web scraping. If you're interested in automating data collection, you should definitely give it a try.
Do you want to learn more about web scraping? Check out our product, WebScrapingAPI, and discover how you can bring your data extraction skills to the next level. Our powerful API is specifically designed to help you conquer the most common challenges of web scraping, like avoiding IP bans or rendering Javascript. And the best part? You can try it for free!
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
We care about the protection of your data. Read our Privacy Policy.
Related articles
Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.
Discover how to efficiently extract and organize data for web scraping and data analysis through data parsing, HTML parsing libraries, and schema.org meta data.
Are XPath selectors better than CSS selectors for web scraping? Learn about each method's strengths and limitations and make the right choice for your project!