2.3.7 Loading Data
Data sets often contain different types of data, and may have names associated with the rows or columns. For these reasons, they typically are best accommodated using a data frame. We can think of a data frame as a sequence of arrays of identical length; these are the columns. Entries in the different arrays can be combined to form a row. The pandas library can be used to create and work with data frame objects.
Reading in a Data Set
The first step of most analyses involves importing a data set into Python. Before attempting to load a data set, we must make sure that Python knows where to find the file containing it. If the file is in the same location as this notebook file, then we are all set. Otherwise, the command os.chdir() can be used to change directory. (You will need to call import os before calling os.chdir().)
We will begin by reading in Auto.csv, available on the book website. This is a comma-separated file, and can be read in using pd.read_csv():
In [73]: import pandas as pd
Auto = pd.read_csv('Auto.csv')
Auto
The book website also has a whitespace-delimited version of this data, called Auto.data. This can be read in as follows:
In [74]: Auto = pd.read_csv('Auto.data', delim_whitespace=True)
Both Auto.csv and Auto.data are simply text files. Before loading data into Python, it is a good idea to view it using a text editor or other software, such as Microsoft Excel.
We now take a look at the column of Auto corresponding to the variable horsepower:
In [75]: Auto['horsepower']
Out[75]: 0 130.0
1 165.0
2 150.0
3 150.0
4 140.0
...
392 86.0
393 52.0
394 84.0
395 79.0
396 82.0
Name: horsepower, Length: 397, dtype: object
We see that the dtype of this column is object. It turns out that all values of the horsepower column were interpreted as strings when reading in the data. We can find out why by looking at the unique values.
In [76]: np.unique(Auto['horsepower'])
To save space, we have omitted the output of the previous code block. We see the culprit is the value ?, which is being used to encode missing values.
To fix the problem, we must provide pd.read_csv() with an argument called na_values. Now, each instance of ? in the file is replaced with the value np.nan, which means not a number:
In [77]:Auto=pd.read_csv('Auto.data',
na_values=['?'],
delim_whitespace=True)
Auto['horsepower'].sum()
Out[77]:40952.0
The Auto.shape attribute tells us that the data has 397 observations, or rows, and nine variables, or columns.
In [78]:Auto.shape
Out[78]:(397,9)
There are various ways to deal with missing data. In this case, since only five of the rows contain missing observations, we choose to use the Auto.dropna() method to simply remove these rows.
.dropna()
In [79]:Auto_new=Auto.dropna()
Auto_new.shape
Out[79]:(392,9)
Basics of Selecting Rows and Columns
We can use Auto.columns to check the variable names.
In [80]:Auto=Auto_new#overwritethepreviousvalue
Auto.columns
Out[80]:Index(['mpg','cylinders','displacement','horsepower',
'weight','acceleration','year','origin','name'],
dtype='object')
Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. Recall that the first argument to the [] method is always applied to the rows of the array. Similarly, passing in a slice to the [] method creates a data frame whose rows are determined by the slice:
In [81]:Auto[:3]
Out[81]: |
mpg |
cylinders |
displacement |
horsepower |
weight |
... |
|
|---|---|---|---|---|---|---|---|
0 |
18.0 |
8 |
307.0 |
130.0 |
3504.0 |
... |
|
1 |
15.0 |
8 |
350.0 |
165.0 |
3693.0 |
... |
|
2 |
18.0 |
8 |
318.0 |
150.0 |
3436.0 |
... |
Similarly, an array of Booleans can be used to subset the rows:
2.3 Lab: Introduction to Python 57
In [82]:idx_80=Auto['year']>80
Auto[idx_80]
However, if we pass in a list of strings to the [] method, then we obtain a data frame containing the corresponding set of columns .
In [83]: Auto[['mpg', 'horsepower']]
Out[83]: mpg horsepower
0 18.0 130.0
1 15.0 165.0
2 18.0 150.0
3 16.0 150.0
4 17.0 140.0
.. ... ...
392 27.0 86.0
393 44.0 52.0
394 32.0 84.0
395 28.0 79.0
396 31.0 82.0
[392 rows x 2 columns]
Since we did not specify an index column when we loaded our data frame, the rows are labeled using integers 0 to 396.
In [84]: Auto.index
Out[84]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
387, 388, 389, 390, 391, 392, 393, 394, 395, 396],
dtype='int64', length=392)
We can use the set_index() method to re-name the rows using the contents of Auto['name'].
In [85]: Auto_re = Auto.set_index('name')
Auto_re
Out[85]: mpg cylinders displacement ...
name ...
chevrolet chevelle malibu 18.0 8 307.0 ...
buick skylark 320 15.0 8 350.0 ...
plymouth satellite 18.0 8 318.0 ...
amc rebel sst 16.0 8 304.0 ...
In [86]: Auto_re.columns
Out[86]: Index(['mpg', 'cylinders', 'displacement', 'horsepower',
'weight', 'acceleration', 'year', 'origin'],
dtype='object')
We see that the column 'name' is no longer there.
Now that the index has been set to name, we can access rows of the data frame by name using the loc[] method of Auto:
.loc[]
```python
In [87]: rows = ['amc rebel sst', 'ford torino']
Auto_re.loc[rows]
Out[87]: mpg cylinders displacement horsepower ...
name ...
amc rebel sst 16.0 8 304.0 150.0 ...
ford torino 17.0 8 302.0 140.0 ...
As an alternative to using the index name, we could retrieve the 4th and 5th rows of Auto using the iloc[] method:
In [88]: Auto_re.iloc[[3, 4]]
We can also use it to retrieve the 1st, 3rd and and 4th columns of Auto_re:
In [89]: Auto_re.iloc[:, [0, 2, 3]]
We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using a single call to iloc[]:
In [90]: Auto_re.iloc[[3, 4], [0, 2, 3]]
Out[90]: mpg displacement horsepower
name
amc rebel sst 16.0 304.0 150.0
ford torino 17.0 302.0 140.0
Index entries need not be unique: there are several cars in the data frame named ford galaxie 500.
In [91]: Auto_re.loc['ford galaxie 500', ['mpg', 'origin']]
Out[91]: mpg origin
name
ford galaxie 500 15.0 1
ford galaxie 500 14.0 1
ford galaxie 500 14.0 1
More on Selecting Rows and Columns
Suppose now that we want to create a data frame consisting of the weight and origin of the subset of cars with year greater than 80 — i.e. those built after 1980. To do this, we first create a Boolean array that indexes the rows. The loc[] method allows for Boolean entries as well as strings:
In [92]: idx_80 = Auto_re['year'] > 80
Auto_re.loc[idx_80, ['weight', 'origin']]
To do this more concisely, we can use an anonymous function called a lambda:
In [93]: Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']]
The lambda call creates a function that takes a single argument, here df, and returns df['year'] > 80. Since it is created inside the loc[] method for the dataframe Auto_re, that dataframe will be the argument supplied. As another example of using a lambda, suppose that we want all cars built after 1980 that achieve greater than 30 miles per gallon:
In [94]: Auto_re.loc[lambda df: (df['year'] > 80) & (df['mpg'] > 30),
['weight', 'origin']
]
The symbol & computes an element-wise and operation. As another example, suppose that we want to retrieve all Ford and Datsun cars with displacement less than 300. We check whether each name entry contains either the string ford or datsun using the str.contains() method of the index attribute of the dataframe:
In [95]: Auto_re.loc[lambda df: (df['displacement'] < 300)
& (df.index.str.contains('ford')
| df.index.str.contains('datsun')),
['weight', 'origin']
]
Here, the symbol | computes an element-wise or operation.
In summary, a powerful set of operations is available to index the rows and columns of data frames. For integer based queries, use the iloc[] method. For string and Boolean selections, use the loc[] method. For functional queries that filter rows, use the loc[] method with a function (typically a lambda) in the rows argument.