A good understanding of the Internet requires a lot of knowledge. Let’s go through a brief introduction to all the terms you need to better understand web scraping.
HTTP or HyperText Transfer Protocol is the foundation of any data exchange on the web. As the name suggests, HTTP is a client-server convention. An HTTP client like a web browser, opens a connection to an HTTP server and sends a message, like: "Hey! What's up? Do you mind passing me those images?". The server typically offers a response, as the HTML code, and closes the connection.
Let's say you need to visit Google. If you type the address in the web browser and hit enter, the HTTP client (the browser) will send the following message to the server:
GET / HTTP/1.1
Host: google.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/web\p,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch, br
Connection: keep-aliveUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
The first line of the message contains the request method (GET), the path we made the request to (in our case is just '/' because we only accessed www.google.com), the version of the HTTP protocol, and multiple headers, like Connection or User-Agent.
Let's talk about the most important header fields for the process:
- Host: The domain name of the server you accessed after you typed the address in the web browser and hit enter.
- User-Agent: Here we can see details regarding the client that made the request. I use a MacBook, as you can see from the __(Macintosh; Intel Mac OS X 10_11_6)__ part, and Chrome as the web browser __(Chrome/56.0.2924.87)__.
- Accept: By using this header, the client constrains the server to only send him some type of responses, like application/JSON or text/plain.
- Referrer: This header field contains the address of the page making the request. Websites use this header to change their content based on where the user comes from.
A server response can look like this:
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu) Content-Type: text/html; charset=utf-8
<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>
<body>The content of the document</body>
</html>
As you can see, on the first line there's the HTTP response code: **200 OK. This means that the scraping action was successful.
Now, if we would have sent the request using a web browser, it would have parsed the HTML code, obtained all the other assets like CSS, JavaScript files, images, and rendered the final version of the web page. In the steps below, we are going to try to automate this process.