Parsing schema.org metadata is a way to extract structured data from web pages using web schema standards. The community behind schema.org manages these standards and promotes the use of schema for structured data on the web.
Parsing schema metadata can be useful for various reasons, such as finding updated information on events, or for researchers gathering data for studies. Additionally, websites that aggregate data like real-estate listings, job postings, and weather forecasts can also benefit from parsing schema data.
There are different formats of schema you can use, including JSON-LD, RDFa, and Microdata.
JSON-LD (JavaScript Object Notation for Linked Data) is a format for encoding linked data using JSON. The design of this standard makes it easy for humans to read and write and for machines to parse and generate.
Here’s how JSON-LD would look for a web page about a book:
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Book",
"name": "The Adventures of Tom Sawyer",
"author": "Mark Twain",
"datePublished": "1876-12-01",
"description": "The Adventures of Tom Sawyer is a novel about a young boy growing up along the Mississippi River in the mid-1800s. It is a classic of American literature and has been loved by generations of readers.",
"publisher": "Penguin Books",
"image": "https://www.example.com/images/tom_sawyer.jpg"
}
</script>
World Wide Web Consortium (W3C) recommendation is RDFa, or Resource Description Framework in Attributes, used to embed RDF statements in XML and HTML.
You can find below how the RDFa would look inside an HTML page. You can notice how tag attributes are used to store the extra data.
<!DOCTYPE html>
<html>
<head>
<title>RDFa Example</title>
</head>
<body>
<div about="http://example.com/books/the-great-gatsby" typeof="schema:Book">
<h1 property="schema:name">The Great Gatsby</h1>
<div property="schema:author" typeof="schema:Person">
<span property="schema:name">F. Scott Fitzgerald</span>
</div>
<div property="schema:review" typeof="schema:Review">
<span property="schema:author" typeof="schema:Person">
<span property="schema:name">John Doe</span>
</span>
<span property="schema:reviewBody">
A classic novel that explores themes of wealth, love, and the decline of the American Dream.
</span>
<span property="schema:ratingValue">4.5</span>
</div>
</div>
</body>
</html>
Microdata is a WHATWG HTML specification that is used to nest metadata inside existing content on web pages and can use schema.org or custom vocabularies.
Here is an example of Microdata in HTML:
<div itemscope itemtype="http://schema.org/Product">
<span itemprop="name">Shiny new gadget</span>
<img itemprop="image" src="shinygadget.jpg" alt="A shiny new gadget" />
<div itemprop="offerDetails" itemscope itemtype="http://schema.org/Offer">
<span itemprop="price">$19.99</span>
<link itemprop="availability" href="http://schema.org/InStock" />
</div>
</div>
There are many tools available to parse schema across different languages, such as Extruct from Zyte and RDFLib library, making it easy to extract structured data from web pages using web schema standards.