Working with Python Pandas data frame
In the last article of this series I talked about how we can create python pandas data frames. Now in this article I will discuss about how we can do various operations on pandas data frames. Let me mention that Python pandas has two type of data structures — pandas series and pandas data frames. The first ones are one dimensional objects like vectors (with subtle differences) and the second are two dimensional objects — one dimension we can identify with the rows and another with the columns. In the article we will learn the followings:
1. Extract columns from data frame.
2. Drop illegal entries.
3. Extract rows on the basis of some conditions.
4. Copy some of the columns to a new data frame.
5. Extract a set of rows.
6. Add an extra column.
7. Iterate over the rows.
8. Drop selected rows or columns.
9. Modify a column.
10. Change the order of the columns.
11. Apply some operation on a column.
12. Change the type of all or some columns.
For this tutorial I will use a data frame with three columns — ‘Class’, ‘Boys’ and ‘Girls’ with six rows representing six classes. I create this data frame with the following commands:
>>> import pandas as pd
>>> df=pd.DataFrame(columns=[‘Class’,’Boys’,’Girls’])
>>> df[‘Class’]=[1,2,3,4,5,6]
>>> df[‘Boys’]=[23,12,13,32,32,-1]
>>> df[‘Girls’]=[43,12,””,32,11,0]
>>> dfClass Boys Girls
0 1 23 43
1 2 12 12
2 3 13
3 4 32 32
4 5 32 11
5 6 -1 0
>>>
Note that I have deliberately put one of the entries -1 (Boys in class 5) and another blank (Girls in class 2).
Two of the most important properties of a pandas data frame are its shape and the name of the columns. In order to explore and manipulate a data frame we need these properties and we can get these with the following commands:
>>> df.shape
(6, 3)
>>> df.columns
Index([‘Class’, ‘Boys’, ‘Girls’], dtype=’object’)
>>>
The first one gives the shape (number of rows, number of columns) of the data frame and the second one gives the name of the columns. Note that ‘df.columns’ is a python list so we can apply any operation a list supports on it.
1. Extracting a column with given column name from a data frame :
We can extract a data frame in many different data structures such as -
a) Python list
>>> x = df[‘Boys’].to_list()
>>> x
[23, 12, 13, 32, 32, -1]
>>> type(x)
<class ‘list’>
>>>
b) Numpy array
>>> x = df[‘Boys’].to_numpy()
>>> x
array([23, 12, 13, 32, 32, -1])
>>> type(x)
<class ‘numpy.ndarray’>
>>>
c) Pandas series
>>> x = df[‘Boys’]
>>> x
0 23
1 12
2 13
3 32
4 32
5 -1
Name: Boys, dtype: int64
>>> type(x)
<class ‘pandas.core.series.Series’>
>>>
2. Dropping ‘nan’:
One of the problems with many data frames is that they have a lot of invalid entries and we can either drop the rows with ‘nan’ or replace ‘nan’ with something. Since our data frame does not have ‘nan’ so let us make one of the elements — number of boys in class ‘2’ as nan.
>>> df.at[2,’Girls’] = np.nan
>>> df
Class Boys Girls
0 1 23 43
1 2 12 12
2 3 13 NaN
3 4 32 32
4 5 32 11
5 6 -1 0
>>>Now we can drop the column with ‘NaN’df = df.dropna()>>> df=df.dropna()
>>> df
Class Boys Girls
0 1 23 43
1 2 12 12
3 4 32 32
4 5 32 11
5 6 -1
Note that ’nan’ is defined in ‘Numpy’ so before you try this you must import numpy as:
import numpy as np
You can see that the column with ‘NaN’ entry is gone.
3. Get the rows which satisfy some condition.
For example if we want to get only those rows (in a data frame) from a data frame which have ‘Boys’ positive then we can do:
>>> df = df[df[‘Boys’] > 0]
>>> df
Class Boys Girls
0 1 23 43.0
1 2 12 12.0
2 3 13 NaN
3 4 32 32.0
4 5 32 11.0
>>> df = df[df[‘Girls’] > 0]
>>> dfClass Boys Girls
0 1 23 43.0
1 2 12 12.03
4 32 32.0
4 5 32 11.0
>>>
4. Copying some column to another data frame:
>>> df
Class Boys Girls
0 1 23 43.0
1 2 12 12.0
3 4 32 32.0
4 5 32 11.0>>> df1=df[[‘Class’,’Boys’]].copy()
>>> df1
Class Boys
0 1 23
1 2 12
3 4 32
4 5 32
>>>
5. Select only some rows
>>> df
Class Boys Girls
0 1 23 43.0
1 2 12 12.0
3 4 32 32.0
4 5 32 11.0>>> df2 = df.iloc[2:4]
>>> df2
Class Boys Girls
3 4 32 32.0
4 5 32 11.0
>>>
6. Add extra columns with data from existing columns:
>>> df
Class Boys Girls
0 1 23 43.0
1 2 12 12.0
3 4 32 32.0
4 5 32 11.0>>> df[‘Total’] = df[‘Boys’] + df[‘Girls’]
>>> df
Class Boys Girls Total
0 1 23 43.0 66.0
1 2 12 12.0 24.0
3 4 32 32.0 64.0
4 5 32 11.0 43.0>>>
7. Iterating over rows :
>>> for index, row in df.iterrows():
print(index, row[‘Class’], row[‘Boys’])0 1.0 23.0
1 2.0 12.0
3 4.0 32.0
4 5.0 32.0
>>>
8. Dropping selected rows or columns:
>>> df
Class Boys Girls Total
0 1 23 43.0 66.0
1 2 12 12.0 24.0
3 4 32 32.0 64.0
4 5 32 11.0 43.0Drop the row with index ‘1’>>> df1=df.drop(1, axis=0)
>>> df1
Class Boys Girls Total
0 1 23 43.0 66.0
3 4 32 32.0 64.0
4 5 32 11.0 43.0Drop the column with name ‘Boys’>>> df2=df.drop(‘Boys’, axis=1)
>>> df2Class Girls Total
0 1 43.0 66.0
1 2 12.0 24.0
3 4 32.0 64.0
4 5 11.0 43.0>>>
9. Add fixed number to a column :
>>> df[‘Boys’] = df[‘Boys’] + 10
>>> dfClass Boys Girls Total
0 1 33 43.0 66.0
1 2 22 12.0 24.0
3 4 42 32.0 64.0
4 5 42 11.0 43.0>>>
10. Change the order of the columns:
>>> df1=df[[‘Class’,’Girls’,’Boys’,’Total’]]
>>> df1Class Girls Boys Total
0 1 43.0 33 66.0
1 2 12.0 22 24.0
3 4 32.0 42 64.0
4 5 11.0 42 43.0>>> dfClass Boys Girls Total
0 1 33 43.0 66.0
1 2 22 12.0 24.0
3 4 42 32.0 64.0
4 5 42 11.0 43.0>>>
11. Apply some operation on all the rows of a column:
>>> df1[‘Boys’] = df1[‘Boys’].apply(lambda x : x*x+100)
>>> df1
Class Girls Boys Total
0 1 43.0 1189 66.0
1 2 12.0 584 24.0
3 4 32.0 1864 64.0
4 5 11.0 1864 43.0>>>
12. Change the type of all the elements of a data frame :
>>> df1
Class Girls Boys Total
0 1 43.0 34.481879 66.0
1 2 12.0 24.166092 24.0
3 4 32.0 43.174066 64.0
4 5 11.0 43.174066 43.0
>>> df1 = df1.astype(int)
>>> df1
Class Girls Boys Total
0 1 43 34 66
1 2 12 24 24
3 4 32 43 64
4 5 11 43 43
>>>
In this article I have discussed some common tasks I generally do with python pandas data frames. In the next article of the series I will discuss more complex operations on data frames.
If you find the article useful, please like, share and comment.