HTML Tables in R
A few weeks ago I saw Hadley Wickam tweet about trying Rvest. He mentioned something about it being perfect for those who like BeautifulSoup in Python. It caught my attention because I spent a month during the summer working on a project that required using Beautiful Soup to do some large web scraping automation. With Rvest, the R ecosystem has a robust toolset to handle web scraping now.
This post isn’t about the entire package, rather one function, the html_table() function. This function turns an HTML table into an R dataframe, which you can imagine will be useful for any number of reasons. Sure you could write your own function that does that, but I thought it was great that it came out of the box with this functionality. In this post I just grab data from the npm homepage, specifically the downloads of Node packaged modules over the last day, month and week and plot it using ggvis.
I grabbed the CSS selector for the table of interest and then called html_table() on it. From there I basically had strings rather than integers because of the way that counter is implemented on NPM.org. I just removed the white space and changed it to a numeric type and I was on my way. From there plotting in ggvis was cake.
I’m looking forward to grabbing more table data in this manner. Dealing with white space, and other string issues are mostly trivial in R and this will open up a good set of data to be easily when working with R. Furthermore ,there is so much else in this package that I hope to cover in the coming weeks.