Web scraping in R
Learn how to extract data directly from a web page, referred as web scraping, in R through a real life example
--
This post has been written in collaboration with Pietro Zanotta.
Introduction
Almost anyone is familiar with web pages (otherwise you would not be here), but what if we tell you that how you see a site is different from how Google or your browser does?
In fact, when you type any site address in your browser, your browser will download and render the page for you, but for rendering the page it needs some instructions.
There are 3 types of instructions:
- HTML: describes a web page’s infrastructure;
- CSS: defines the appearance of a site;
- JavaScript: decides the behavior of the page.
Web scraping is the art of extracting information from the HTML, CSS and Javascript lines of code. The term usually refers to an automated process, which is less error-prone and faster than gathering data by hand.
It is important to note that web scraping can raise ethical concerns, as it involves accessing and using data from websites without the explicit permission of the website owner. It is a good practice to respect the terms of use for a website, and to seek written permission before scraping large amounts of data.
This article aims to cover the basics of how to do web scraping in R. We will conclude by creating a database on Formula 1 drivers from Wikipedia.
Note that this article doesn’t want to be exhaustive on topic. To learn more, see this section.
HTML and CSS
Before starting it is important to have a basic knowledge of HTML and CSS. This section aims to briefly explain how HTML and CSS work, to learn more we leave you some resources at the bottom of this article.
Feel free to skip this section if you already are knowledgeable in this topic.
Starting from HTML, an HTML file looks like the following piece of code.
<!DOCTYPE html>
<html lang="en">
<body>
<h1…