Summarizing
Get some data to summarize
We will use data from the following page: https://data.baltimorecity.gov/Culture-Arts/Restaurants/k5ry-ef3g
Run this code in order to download sample data for this tutorial.
if (!file.exists("./data")) { dir.create("./data") }
url <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(url, destfile="./data/restaurants.csv", method="curl")
data <- read.csv("./data/restaurants.csv")
View data to make sure there is at least something.
Check size in bytes.
object.size(data)
print(object.size(data), units="Mb")
View data in table.
View(data)
View first few rows.
head(data, n=3)
View last few rows.
tail(data, n=3)
Get summary for each attribute.
summary(data)
Return types and more info about data.
str(data)
See quantile of values in specific column.
quantile(data$councilDistrict, na.rm=TRUE)
Get quantile for different percentiles.
quantile(data$councilDistrict, probs = c(0.5, 0.75, 0.9))
Create table of values from a column. Option ifany
will enable the table to show missing values.
table(data$zipCode, useNA = "ifany")
Make two dimensional table.
table(data$councilDistrict, data$zipCode)
Check for missing values. The following command returns number of NA values.
sum(is.na(data$zipCode))
The same as above but this returns true or false.
any(is.na(data$zipCode))
Take all values and check if all fulfil a condition.
all(data$zipCode < 0)
Check all columns and get count of NA values for each column.
colSums(is.na(data))
Covert the command above to single command that returns true or false instead of counts.
all(colSums(is.na(data)) == 0)
table(data$zipCode %in% c("21212"))
Find all values that full fill a condition.
table(data$zipCode %in% c("21212"))
The same as above but with multiple values. There is OR condition between the values in the condition.
table(data$zipCode %in% c("21212", "21213"))
Filter out specific rows based on a condition and return values (not just numbers/counts).
data[data$zipCode %in% c("21212", "21213"),]
Cross tabs. The following example is nonsense. But it will break down the data by policeDistrict and counsilDistrict and create sums from zipCodes values.
xtabs = xtabs(zipCode ~ policeDistrict + councilDistrict, data=data)
-- to make data compact and easier to view... if possible.
ftable(xtabs)
Last updated
Was this helpful?