Data Table

Facts about data.table.

  • Extends from data.frame and therefore should provide the same API.

  • Is written in C and is really fast.

  • Much faster at subsetting, grouping and updating.

Hello world

install.packages("data.table")
library(data.table)

years = c(2012, 2013)
average = c(250, 275)
table.values <- data.table(year = years, averageBeerConsumption = average)

See all data.table tables created in memory.

tables()

Subsetting rows.

Access row on specific index.

table.values[2]
table.values[c(1,2)]

Access rows that fulfil a condition.

table.values[table.values$year==2012]

Calculate values from columns

table.values[, sum(averageBeerConsumption)]

table.values[, list(mean(year), sum(averageBeerConsumption))]

Return table of values for a column

table.values[, table(year)]

Add new column

table.values[, volume:=averageBeerConsumption*0.5]

Multiple operations.

table.values[,
    x:={temp <- averageBeerConsumption*year;
log2(temp)
}]

Plyr like operations

table.values[, y:= year<2013]

Grouping by

table.values[, sum:= sum(averageBeerConsumption), by= year]

Count number of occurrences

table.values[, .N, by=year]

Keys

Making table faster by setting the keys

setkey(table.values, year)

Then we can join tables by keys.

setkey(table1.values, year)
setkey(table2.values, year)

merge(table1.values, table2.values)

Fast reading

First we create a file that we can use to test speed of reading.

big.file <- data.frame(x=rnorm(1E6), y=rnorm(1E6))

file <- tempfile()

write.table(big.file, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)

Slow approach using read.table function.

system.time(read.table(file, header=TRUE, sep="\t"))

Faster approach using fread function.

system.time(fread(file))

Last updated