So, we have an HTML document, but we want data, which means that we should parse the previous response into human-readable information.
Starting with baby steps, let’s extract the title of the website. A remarkable fact about Ruby is that everything is an object with very few exceptions, meaning that even a simple string can have attributes and methods.
Therefore, we can simply access the value of the website’s title through the attributes of the parsed_page object.
puts parsed_page.title
Moving forward, let’s extract all the links from the website. For this, we will use a more generic method that parses specific tags, the css method.
links = parsed_page.css('a')
links.map {|element| element["href"]}
puts links
We also use the map method to keep only the links with a href attribute from the HTML.
Let’s take a more realistic example. We need to extract the articles from the blog, their title, address, and meta description.
If you inspect one of the article cards, you can see that we can get the address and the article’s title through the link’s attributes. Also, the meta description is under a <div> tag with a specific class name.
Of course, there are many ways to perform this search. The one we’ll use will consist of looking for all the <div> tags with the td_module_10 class name and then iterating through each one of them to extract the <a> tags and the inner <div> tags with the td-excerpt class name.
article_cards = parsed_page.xpath("//div[contains(@class, 'td_module_10')]")
article_cards.each do |card|
title = card.xpath("div[@class='td-module-thumb']/a/@title")
link = card.xpath("div[@class='td-module-thumb']/a/@href")
meta = card.xpath("div[@class='item-details']/div[@class='td-excerpt']")
end
Yes, as you may have already guessed, an XPath expression is what does the trick because we are looking for HTML elements by their class names and their ascendants.