Web scraping in R

Learn how to extract data directly from a web page, referred as web scraping, in R through a real life example

Antoine Soetewey

--

Web scraping in R
Photo by Florian Olivo

This post has been written in collaboration with Pietro Zanotta.

Introduction

Almost anyone is familiar with web pages (otherwise you would not be here), but what if we tell you that how you see a site is different from how Google or your browser does?

In fact, when you type any site address in your browser, your browser will download and render the page for you, but for rendering the page it needs some instructions.

There are 3 types of instructions:

  • HTML: describes a web page’s infrastructure;
  • CSS: defines the appearance of a site;
  • JavaScript: decides the behavior of the page.

Web scraping is the art of extracting information from the HTML, CSS and Javascript lines of code. The term usually refers to an automated process, which is less error-prone and faster than gathering data by hand.

It is important to note that web scraping can raise ethical concerns, as it involves accessing and using data from websites without the explicit permission of the website owner. It is a good practice to respect the terms of use for a website, and to seek written permission before scraping large amounts of data.

This article aims to cover the basics of how to do web scraping in R. We will conclude by creating a database on Formula 1 drivers from Wikipedia.

Note that this article doesn’t want to be exhaustive on topic. To learn more, see this section.

HTML and CSS

Before starting it is important to have a basic knowledge of HTML and CSS. This section aims to briefly explain how HTML and CSS work, to learn more we leave you some resources at the bottom of this article.

Feel free to skip this section if you already are knowledgeable in this topic.

Starting from HTML, an HTML file looks like the following piece of code.

<!DOCTYPE html>
<html lang="en">
<body>
<h1…

--

--

Antoine Soetewey

PhD researcher and teaching assistant in statistics at UCLouvain. Interested in statistics, R, and making them accessible to everyone. Author of statsandr.com.