This workshop is for beginner R users and aims to introduce you to the world of scraping.
Topics covered:
- What is scraping?
- Ethical considerations
- Webpage structures and HTML
- Inspect
- Scraping data in R
- Case study
July 2nd 2021
This workshop is for beginner R users and aims to introduce you to the world of scraping.
Topics covered:
Current research project aims at analysing real-world treatment patterns across multiple cancer types.
Victorian-wide linked datasets, comprising cancer registry, hospital admin data, and PBS/MBS data.
Costing of pharmaceuticals and services subsidised by the Australian Government is key but there’s no database resources that link drugs/services/cost information, including historical prices.
So we scrape it all and build a database that is used to
1. find and retrieve drug information within R/excel
2. link a drug/service item to its current and historical price
3. facilitate patterns of care analyses
The internet is the first place where you look for information and data!
Web scraping is one of the most robust and reliable ways of getting web data from the internet.
Scraping performs automated information extraction from websites by parsing the page source code to retrieve programmatically specified elements.
Remember to choose the easiest tool for the job!
So you need data that’s found on a website that you can only browse…
Consider the following before going ahead with scraping:
robots.txt
files.
! The polite package can be used for responsible scraping etiquette
What’s in a website?
> If the underlying structure is well organised, then scraping is usually straight forward
The basic unit of an HTML document: element
- Contains tags: <tag> </tag> - Contains attributes: <tag attribute="cool" > </tag> - Contains content: <tag attribute="cool" > content </tag>
<!DOCTYPE html> <html> <head> <title> must have a title </title> container for metadata / CSS </head> <body> container for the actual content, with various elements: for grouping: <div> <span> for heading: <H1> to <H6> for paragraphs: <p> <br> for list / table: (<ul> or <ol>) + <li> / <table> + <tr><td> attributes to elements: `id` `class` `title` `href` </body> </html>
Most browsers have built-in inspect tool that allow you to explore a web page.
Inspect Element | Windows | macOS |
---|---|---|
Mozilla | Ctrl + Shift + C |
Cmd + Shift + C |
Chrome | Ctrl + Shift + J |
Cmd + Option + J |
Safari | x | Cmd + Option + I |
Or simply right-click any part of a website and select Inspect or Inspect element
Safari users may have to enable the option:Preferences -> Advanced -> Menu bar -> enable 'Show Developer menu'
Function | Description |
---|---|
read_html() | read HTML from a character string or URL |
html_nodes() | select specified pieces from the HTML document using CSS selectors |
html_text() | extract content |
html_text2() | extract content AND proper parsing of white spaces |
html_elements() | extract the variables from each observation of a specific element |
html_table() | parse an HTML table into a data frame |
html_name() | extract tag names |
html_attr() | extract value for a specified attribute name |
html_attrs() | extract all attributes and values |
CSS used for styling…
But also include a miniature language for selecting elements on a HTML document.-> Define patterns for locating HTML elements
PBS similar to the British National Formulary
Let’s look at the website together:
## Gather all index links on PBS url_pbs <- "https://www.pbs.gov.au/browse/medicine-listing?initial=" pages_pbs <- paste0(url_pbs, letters) head(pages_pbs, 3)
## [1] "https://www.pbs.gov.au/browse/medicine-listing?initial=a" ## [2] "https://www.pbs.gov.au/browse/medicine-listing?initial=b" ## [3] "https://www.pbs.gov.au/browse/medicine-listing?initial=c"
## Get all links from each index (i.e. all drug links) find_all_links <- function(pages_pbs) { res_pbs = c() for (link in pages_pbs) { tmp <- link %>% read_html %>% html_nodes('#medicine-item > tbody > tr:nth-child(n) > td > a') %>% html_attr('href') %>% paste('https://www.pbs.gov.au',.,sep='') res_pbs = c(res_pbs, tmp) } return(res_pbs) } all_pbs <- find_all_links(pages_pbs) head(all_pbs, 3)
## [1] "https://www.pbs.gov.au/pbs/search?term=abacavir&analyse=false&search-type=medicines" ## [2] "https://www.pbs.gov.au/pbs/search?term=abacavir%20%2B%20lamivudine&analyse=false&search-type=medicines" ## [3] "https://www.pbs.gov.au/pbs/search?term=abacavir%20%2B%20lamivudine%20%2B%20zidovudine&analyse=false&search-type=medicines"
Please, do not run this code during workshop
## Get all sub-links within one drug link find_all_sublinks <- function(all_pbs) { all_subpbs = c() for (link in all_pbs) { tryCatch( tmp_sub <- link %>% read_html %>% html_nodes('#content > div > div > div:nth-child(4) > div > table > tbody > tr:nth-child(1) > td > ul > li:nth-child(n) > a') %>% html_attr('href'), error = function(e){NA} ) all_subpbs = c(all_subpbs, tmp_sub) } return(all_subpbs) } sub_pbs <- find_all_sublinks(all_pbs) %>% paste('https://www.pbs.gov.au',.,sep='') head(sub_pbs, 5)
Please, do not run GitHub R code during workshop
CSS selector: html_nodes("#medicine-item")
CSS selector: html_nodes("tr:nth-child(2) > td.align-top") Extraction: html_text()
Each node contains a column of interest One row is one drug item Merge all rows from all links into one RDS/csv file.
Want to learn more?
HTML basics : introduction to web structure @ Mozilla
HTML elements : complement to the above @ W3Schools
Scraping with R : rvest package homepage
Scraping etiquette : polite package homepage
CSS selectors : CSS selectors @ Interneting is hard
Scraping Javascript : Rselenium when Rvest is not enough