Handbook of Hidden Data Scientist (Python)
  • Introduction
  • Machine Learning
    • Supervised Learning
      • Features and Labels
      • Linear Decision Surface
      • Naive Bayes
      • Support Vector Machine
      • Decision Trees
      • Regressions
  • Python
  • CSV with pandas
    • Reading CSV
    • Math Operations on Column
    • Joining CSVs
    • Plot and Normalize CSV Data
  • NumPy
    • Using NumPy from pandas DataFrame
    • Create NDArray
    • Working with NDArray
    • Timing operations
  • Statistical Analysis
    • Global Statistics
    • Rolling Statistics
    • Daily Returns
    • Cumulative Returns
  • Incomplete Data
    • Pandas fillna()
  • Histograms and Scatter Plots
    • Histogram
    • Two Histograms
    • Scatter Plot
  • Visualization
    • pyplot
Powered by GitBook
On this page

Was this helpful?

  1. CSV with pandas

Joining CSVs

PreviousMath Operations on ColumnNextPlot and Normalize CSV Data

Last updated 5 years ago

Was this helpful?

Create two CSV files with the following content and name it "AAPL.csv" and "GOOG.csv". It contains stock information for few days. You can get more data on

AAPL.csv

Date,Open,High,Low,Close,Volume,Adj Close
2017-01-20,120.449997,120.449997,119.730003,120.00,29479900,120.00
2017-01-19,119.400002,120.089996,119.370003,119.779999,25295700,119.779999
2017-01-18,120.00,120.50,119.709999,119.989998,23644700,119.989998
2017-01-17,118.339996,120.239998,118.220001,120.00,34078600,120.00
2017-01-13,119.110001,119.620003,118.809998,119.040001,25938300,119.040001
2017-01-12,118.900002,119.300003,118.209999,119.25,27002400,119.25
2017-01-11,118.739998,119.93,118.599998,119.75,27418600,119.75
GOOG.csv

Date,Open,High,Low,Close,Volume,Adj Close
2017-01-20,806.909973,806.909973,801.690002,805.02002,1645000,805.02002
2017-01-19,805.119995,809.47998,801.799988,802.174988,912000,802.174988
2017-01-18,805.809998,806.205017,800.98999,806.070007,1293300,806.070007
2017-01-17,807.080017,807.140015,800.369995,804.609985,1362100,804.609985
2017-01-13,807.47998,811.223999,806.690002,807.880005,1090100,807.880005
2017-01-12,807.140015,807.390015,799.169983,806.359985,1351000,806.359985
2017-01-11,805.00,808.150024,801.369995,807.909973,1057900,807.909973

Then we rename 'Adj Close' column, so it is not in conflict when joining other CSV that contains column of the same name.

Finally, join the APPL CSV into our new data frame df1. Drop rows that contain just NaN values. And then we could iterate through other CSV files and join them into df1 data frame.

import pandas as pd

def test_run():
    start_date = '2017-01-01'
    end_date = '2017-01-30'

    dates = pd.date_range(start_date, end_date)

    df1 = pd.DataFrame(index=dates)

    dfAPPL = pd.read_csv("data/APPL.csv",
                        index_col='Date',
                        parse_dates=True,
                        usecols=['Date', 'Adj Close'],
                        na_values=['nan']
                        )
    dfAPPL = dfAPPL.rename(columns={'Adj Close': 'APPL'})

    df1 = df1.join(dfAPPL, how='inner') # left join by default
    df1 = df1.dropna() # drop rows with NaN values

    symbols = ['GOOG', 'IBM']
    for symbol in symbols:
        df_temp = pd.read_csv("data/{}.csv".format(symbol),
                              index_col='Date',
                              parse_dates=True,
                              usecols=['Date', 'Adj Close'],
                              na_values='nan')
        df_temp = df_temp.rename(columns={'Adj Close': symbol})
        df1 = df1.join(df_temp)
    print df1

if __name__ == "__main__":
    test_run()

First, load data only certain data range. That is done by defining what column should be used for indexing, see index_col parameter in read_csv function. Then we provide other parameters for read_csv function (check to find out what are these parameters doing).

https://finance.yahoo.com/quote/AAPL/history?p=AAPL.
read_csv documentation