XML and Web pages

We will use XML package to do the parsing.

Install package

install.packages("XML", dep = T)
library(XML)

Load XML tree

install.packages("XML")
library(XML)

url <- "http://www.w3schools.com/xml/simple.xml"

document <- xmlTreeParse(url, useInternal=TRUE)

Get element name

rootNode <- xmlRoot(document)
xmlName(rootNode)

Access first element.

rootNode[[1]]

Get element on exact position (going through subelements).

rootNode[[1]][[1]]

Using custom function to load values from XML. xmlValue is that function.

xmlSApply(rootNode, xmlValue)

Using XPath.

xpathSApply(rootNode, "//name", xmlValue)

Read a table from HTML

library(XML)

url <- "http://www.drugs.com/top200_2003.html"

html.table.data <- readHTMLTable(url, which = 2, skip.rows = 1)

View(html.table.data)

Get page using httr package

library(httr)
library(XML)

url <- "http://www.drugs.com/top200_2003.html"

html = GET(url)
content = content(html, as="text")
parsedHtml = htmlParse(content, asText=TRUE)

xpathSApply(parsedHtml, "//title", xmlValue)

Authentificate with httr package

We can use authenticate function in order to access a secured page.

library(httr)
library(XML)

url <- "http://httpbin.org/basic-auth/user/passwd"

GET(url, authenticate("user", "passwd"))

The code above returns the following response.

Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2014-12-30 16:30
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true,
  "user": "user"
}

Use handle function to access more page with during one authentificated session.

Last updated