We might want to create new variables because we want to
add extra value for missing values, e.g. NA
create quantitavive values, for example from numbers
Get some data to summarize
We will use data from the following page: https://data.baltimorecity.gov/Culture-Arts/Restaurants/k5ry-ef3g
Run this code in order to download sample data for this tutorial.
if (!file.exists("./data")) { dir.create("./data") }
url <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(url, destfile="./data/restaurants.csv", method="curl")
data <- read.csv("./data/restaurants.csv")
How to create a sequence
Sometimes, we might need to create a sequence.
> seq(1, 10, by=2)
[1] 1 3 5 7 9
> seq(1, 10, length=3)
[1] 1.0 5.5 10.0
Create new column by subsetting.
> data$nearMe = data$neighborhood %in% c("Roland Park", "Homeland")
> table(data$nearMe)
FALSE TRUE
1314 13
Create binary column
> data$wrongZip = ifelse(data$zipCode < 0, TRUE, FALSE)
> table(data$wrongZip)
FALSE TRUE
1326 1
Create categorical values
The following code will create categorize zipCode values into percentile 0-25, 25-50, 50-75 and 75-100.
data$zipGroups = cut(data$zipCode,breaks=quantile(data$zipCode))
table(data$zipGroups)
table(data$zipGroups, data$zipCode)
Easier way to do that using Hmisc package.
install.packages("Hmisc")
library(Hmisc)
data$zipGroups = cut2(data$zipCode, g=4)
table(data$zipGroups)
Create factor values
zipCode is integer and we might want to turn it into factor type.
> data$zipCodeFactor <- factor(data$zipCode)
> class(data$zipCodeFactor)
[1] "factor"
Levels of factor variables
> options <- sample(c("yes", "no"), size=10, replace=TRUE)
> options
[1] "yes" "no" "no" "yes" "yes" "no" "yes" "no" "yes" "no"
> optionsFactor = factor(options, levels=c("yes", "no"))
> optionsFactor
[1] yes no no yes yes no yes no yes no
Levels: yes no
relevel(optionsFactor, ref="yes")
as.numeric(optionsFactor)
Mutate function
Mutate function can be used to add variable to a new table.
install.packages("Hmisc")
library(Hmisc)
install.packages("plyr")
library(plyr)
newData = mutate(data, zipGroups=cut2(zipCode, g=4))
View(newData)
Transformations
# absolute value
abs(x)
# square root
sqrt(x)
# ceiling(3.4) is 4
ceiling(x)
# floor(3.4) is 3
floor(x)
# rounding
round(x, digits=n)
# signif(3.475, digits=2) is 3.5
signif(x, digits=n)
cos(x), sin(x) # etc.
# natural logarithm
log(x)
# other logaritmhs
log2(x), log10(x)
# exponentiating x
exp(x)
More here functions here: http://www.statmethods.net/management/functions.html