I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.
As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.
First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.
library("rvest") url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html" download_folder <- "C:/R/" etf_table <- data.frame()
Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.
Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.
#find how many pages are there download.file(url,paste(download_folder,"ETFs.html",sep="")) html <- html(paste(download_folder,"ETFs.html",sep="")) pages <- html_text(html_node(html,"p.floatsx")) pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))
Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.
#for each page for (p in 1:pages) { cur_url <- paste(url,"?&page=",p,sep="") #download the file download.file(cur_url,paste(download_folder,p,".html",sep="")) #create html object html <- html(paste(download_folder,p,".html",sep="")) #look for tables on the page and get the first one table <- html_table(html_nodes(html,"table")[[1]]) #only first 6 columns contain information that we need table <- table[1:6] #stick a timestamp at end table["Timestamp"] <- Sys.time() #add into the final results table etf_table <- rbind(etf_table,table) #remove the originally downloaded file file.remove(paste(download_folder,p,".html",sep="")) #summary summary(etf_table) }
As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link
Image may be NSFW.
Clik here to view.

Clik here to view.
