July 2nd 2021

Welcome to fantastic data and where to scrape them

This workshop is for beginner R users and aims to introduce you to the world of scraping.

Topics covered:

  1. What is scraping?
  2. Ethical considerations
  3. Webpage structures and HTML
  4. Inspect
  5. Scraping data in R
  6. Case study

Why do I care about scraping?

Current research project aims at analysing real-world treatment patterns across multiple cancer types.

Victorian-wide linked datasets, comprising cancer registry, hospital admin data, and PBS/MBS data.

Costing of pharmaceuticals and services subsidised by the Australian Government is key but there’s no database resources that link drugs/services/cost information, including historical prices.

So we scrape it all and build a database that is used to
1. find and retrieve drug information within R/excel
2. link a drug/service item to its current and historical price
3. facilitate patterns of care analyses

1. S-C-R-A-P-I-N-G-!

What is scraping?

The internet is the first place where you look for information and data!

Web scraping is one of the most robust and reliable ways of getting web data from the internet.

Scraping performs automated information extraction from websites by parsing the page source code to retrieve programmatically specified elements.

Do you really need scraping?

Remember to choose the easiest tool for the job!

  • Can you easily copy and paste data from a site into Excel?
  • Is there an export/download feature?
  • Is there an API to extract structured data via R?

From the web to data

So you need data that’s found on a website that you can only browse…

2. Some ethical considerations

Think before you scrape

Consider the following before going ahead with scraping:

  • Is the data free? –> Terms of use is a good thing to check first.

  • Are there restrictions on what I can do with this data? –> Look for any copyright statements.

  • Is there a risk of overloading the website’s server? –> Some websites have access rules, check for any robots.txt files.

! The polite package can be used for responsible scraping etiquette

3. Webpage structures and HTML

Websites

(are made of these)

What’s in a website?

  1. Structure of content: Hypertext Markup Language
  2. Styling of content: Cascading Stylesheets
  3. Adding complex stuff: Javascript


> If the underlying structure is well organised, then scraping is usually straight forward

HTML basics

The basic unit of an HTML document: element

- Contains tags: <tag> </tag>
- Contains attributes: <tag attribute="cool" > </tag>   
- Contains content: <tag attribute="cool" > content </tag>

HTML’s anatomy

        <!DOCTYPE html>
          <html>
            <head>
              <title> must have a title </title>
                container for metadata / CSS
            </head>
          <body>
              container for the  actual content, with various elements:
                for grouping: <div> <span> 
                for heading: <H1> to <H6> 
                for paragraphs: <p> <br>
                for list / table: (<ul> or <ol>) + <li> / <table> + <tr><td>
                attributes to elements: `id` `class` `title` `href`
          </body>
        </html>

4. Inspect

Inspecting a website

Most browsers have built-in inspect tool that allow you to explore a web page.

Inspect Element Windows macOS
Mozilla Ctrl + Shift + C Cmd + Shift + C
Chrome Ctrl + Shift + J Cmd + Option + J
Safari x Cmd + Option + I

Or simply right-click any part of a website and select Inspect or Inspect element

Safari users may have to enable the option:
Preferences -> Advanced -> Menu bar -> enable 'Show Developer menu'

5. Scraping data in R

rvest for scraping

Function Description
read_html() read HTML from a character string or URL
html_nodes() select specified pieces from the HTML document using CSS selectors
html_text() extract content
html_text2() extract content AND proper parsing of white spaces
html_elements() extract the variables from each observation of a specific element
html_table() parse an HTML table into a data frame
html_name() extract tag names
html_attr() extract value for a specified attribute name
html_attrs() extract all attributes and values

CSS selectors

CSS used for styling…

But also include a miniature language for selecting elements on a HTML document.

-> Define patterns for locating HTML elements

Combining inspecting tool + CSS selectors = provide a path to where the data you want can be extracted!

6. Case study: scraping the Pharmaceutical Benefits Scheme (PBS)

The Pharmaceutical Benefits Scheme

PBS similar to the British National Formulary

Let’s look at the website together:

  1. A-Z drugs correspond to different indexes
  2. Each index has a list of drugs
  3. Each drug link has a list of sub links

Step 1: collect links

Step 2: Get all links from each index

Step 3: extract the data

Please, do not run this code during workshop

## Get all sub-links within one drug link
find_all_sublinks <- function(all_pbs) {
  all_subpbs = c()
  for (link in all_pbs) {
    tryCatch(
      tmp_sub <- link %>% read_html %>%
        html_nodes('#content > div > div > div:nth-child(4) > div > table > 
                   tbody > tr:nth-child(1) > td > ul > li:nth-child(n) > a') %>%
        html_attr('href'), error = function(e){NA} ) 
    all_subpbs = c(all_subpbs, tmp_sub) }
  return(all_subpbs) }

sub_pbs <- find_all_sublinks(all_pbs) %>% paste('https://www.pbs.gov.au',.,sep='') 
head(sub_pbs, 5)

Step 4: retrieve the tables and produce db

Please, do not run GitHub R code during workshop

  • Parse nodes/columns within main table
CSS selector: html_nodes("#medicine-item")
  • Scrape first complete row
CSS selector: html_nodes("tr:nth-child(2) > td.align-top")
Extraction: html_text()
  • Build data frame

Each node contains a column of interest One row is one drug item Merge all rows from all links into one RDS/csv file.

PBS scraped – further linked to scraped MBS and historical prices

Amazing resources