Tidy data set
About 80% of data analysis is often spent by cleaning messy data, structuring datasets to facilitate analysis.
We can say that a dataset is a collection of numbers or strings or both. If a value is number, it has quantitative character. If it is a string, it has qualitative character.
Tidy data sets are easy to manipulate, model and visualise, and have a specific structure. And therefore tidy data should fulfil Codd's 3rd normal form.
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
What is untidy data then? There are many ways data can be untidy. Wickham’s top five are as follows:
Column names represent data values instead of variable names
A single column contains data on multiple variables instead of a single variable
Variables are contained in both rows and columns instead of just columns
A single table contains more than one observational unit
Data about an observational unit is spread across multiple data sets
We might rather use one big data set, let's say a big table, than multiple connected smaller data sets, like in relational databases. It makes easier to perform analysis on one data set rather than on multilple datasets.
Anything else what does not fulfil the points above, we will call messy data.
Let's explore how to get tidy data set from raw and messy data.
Raw and messy data
Raw data is the original and not modified data files and it is often difficult to use that raw data. Raw data can be for example videos, images, binary files, text files, unformatted Excel files with multiple worksheets, hand written data, JSON files, XML files and so on.
In order to check whether we are working with really raw data, we can check the following points:
We ran NO software to modify the data
We did NOT manipulated any numbers
We did NOT remove any data
We did NO summarizations
Analysis we run should include data cleaning to process raw data in order to prepare them for analysis.
Cleaning data
Clean data is data that is ready for analysis. In order to clean data we might be merging, transforming, extracting, ordering or performing similar operations.
All cleaning steps should be properly documented.
We might also want to order columns, so dimensions or fixed values are first and then we have measured values.
Common problems with messy data sets:
Column names contains values
Multiple values in one column
Values are stored in both rows and columns
Multiple document types are mixed in single document
Single document type is stored in multiple types
For some industries, there might be standarts for processing raw data.
Melting dataset
When column names contains values we might need to melt the dataset.
For example, assume we have te following dataset.
Income | < $10 | => $10 |
New York | 2500 | 5000 |
San Francisco | 1000 | 2000 |
In order to melt it down, we should transform it, for example, to the following dataset structure.
City | Income range | Population |
New York | < $10 | 2500 |
New York | => $10 | 5000 |
San Francisco | => $10 | 1000 |
San Francisco | => $10 | 2000 |
See Reshaping chapter to learn how to melt the data.
Tidy data
Tidy data is collection of four things:
Raw data
Tidy data set
Cook book describing each variable, for example a column in a table, and values in tidy data set. The cook book is also sometimes called as metadata.
Exact steps, formulas and recepies we took to clean the raw data set.
Hints:
Always include units in metadata.
Include how you have collected the data
Write down reproduction steps how to get tidy data from raw data.
Recommended reading about tidy data.
Last updated