Data Frames
6 min readSep 3, 2018
In my previous blog, I talked about using the Series Data Structure in Pandas. Feel free to have a look at that blog.
This blog post will be a brief walk through of Data Frames. Data Frame is one of the most widely used tool in python by Data Scientists/Analyst.
A DataFrame can be considered as a data table. Data Frame provides a wide range of functionality like
- filtering
- grouping
- sorting
- joins
- merging
Let’s start with creating a data frame
#importing pandas and DataFrame
import pandas as pd
from pandas import DataFrame#constructor to create a data frame
#df=DataFrame( data, index, columns, dtype, copy)
#Lists, dict, Series, Numpy ndarrays, Another DataFrame#creating an empty data frame
df=DataFrame()#creating a data frame form an array
df=DataFrame([1,2,3,4,5,5])
print dfOutput:0
0 1
1 2
2 3
3 4
4 5
5 5# creating a dataframe with an array of array
# each inner array represents a row
d2=[[1,2,3,4],
['a','b','c','d']]
df= DataFrame(d2)
print dfOutput:0 1 2 3
0 1 2 3 4
1 a b c d#creating a datframe from a dict
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print dfOutput:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN# creating a df from a list of dictionaries
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print dfOutput:a b c
0 1 2 NaN
1 5 10 20.0
Now that we know how to create a dataframe. Let’s try some functions and statistics with Data Frames
#create a data frame
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print dfOutput:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN#understand the data
#gives brief statistics about the data like mean,count,standard deviation, percentile limits
print df.describe()Output:Age
count 4.000000
mean 33.250000
std 6.396614
min 28.000000
25% 28.750000
50% 31.500000
75% 36.000000
max 42.000000#gives information about the data frame like numbr of columns,adta types memory usage
print df.info()
Output:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Age 4 non-null int64
Name 4 non-null object
a 0 non-null object
dtypes: int64(1), object(2)
memory usage: 168.0+ bytes
None#to get all the columns
print df.columns
print "---------"
# to print all the values
print df.values
print "---------"
# to print the index
print df.indexOutput:Index([u'Age', u'Name', u'a'], dtype='object')
---------
[[28 'Tom' nan]
[34 'Jack' nan]
[29 'Steve' nan]
[42 'Ricky' nan]]
---------
RangeIndex(start=0, stop=4, step=1)
Indexing in DataFrame
# selecting a column from the dataframe
print df['Name']
print "----------"#selecting multiple columns
print df[['Name','Age']]
print "-----------"# adding a new column
df['Gender']=['M','M','M','M']
print df
print "------------"#deleting a column from data frame
del df['Gender']
# or df.pop('a')Output:0 Tom
1 Jack
2 Steve
3 Ricky
Name: Name, dtype: object
----------
Name Age
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
-----------
Age Name a Gender
0 28 Tom NaN M
1 34 Jack NaN M
2 29 Steve NaN M
3 42 Ricky NaN M
------------
For selecting values from a DataFrame we have 3 functions
- iloc = For accessing the data frame with integer indexing
- loc = For accessing the dataframe with label indexing
- ix = For accessing teh dataframe with mixture of integer and label indexing
# selecting a row by integer index
df.iloc[1,1]
Output:'Jack'# selecting a row by label index
df.loc[1,"Name"]
Output:'Jack'# selecting a row by integer index
df.ix[3,1]
Output:'Ricky'
Iterating over the Data Frame Rows
for row in df.iterrows():
#returns the tuples with (label,row) mapping
#check the data type of tuple keys
print "return type of iterrows ",type(row[0])
print "values of tupe are of type ",type(row[1])
break
for row in df.iterrows():
print "Name is : ",row[1]['Name']
print "Age is : ",row[1]['Age']Output:return type of iterrows <type 'numpy.int64'>
values of tupe are of type <class 'pandas.core.series.Series'>
Name is : Tom
Age is : 28
Name is : Jack
Age is : 34
Name is : Steve
Age is : 29
Name is : Ricky
Age is : 42# returns the columns with values in columns
for row in df.iteritems():
print row[0]
print row[1]Output:Age
rank1 28
rank3 34
rank2 29
rank4 42
Name: Age, dtype: int64
Name
rank1 Tom
rank3 Jack
rank2 Steve
rank4 Ricky
Name: Name, dtype: object
a
rank1 NaN
rank3 NaN
rank2 NaN
rank4 NaN
Name: a, dtype: object# gives rows as tuples
for row in df.itertuples():
print rowOutput:Pandas(Index='rank1', Age=28, Name='Tom', a=nan)
Pandas(Index='rank3', Age=34, Name='Jack', a=nan)
Pandas(Index='rank2', Age=29, Name='Steve', a=nan)
Pandas(Index='rank4', Age=42, Name='Ricky', a=nan)
Reindexing a DataFrame
Reindexing a dataframe creates a new Dataframe with the desired index from the orignal dataframe.
print df
df1=df.reindex(index=[1,2],columns=['Name','Age'])
df1Output:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN#changing the index values of a Data Frame
df.index=['rank1','rank3','rank2','rank4']
Sorting a DataFrame
Sorting can be done in 2 ways
- Sort by indexes
- Sort y values
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
#sorting can be done in
#sorting based on index
print unsorted_df.sort_index()
print "-----------------------------"
print unsorted_df.sort_index(ascending=False)Output:col2 col1
0 0.991140 -1.203227
1 -1.335899 1.494410
2 0.182436 0.167360
3 0.379386 -0.028920
4 -0.076693 1.179620
5 0.418442 -0.282703
6 -0.136885 0.812825
7 0.808891 0.336993
8 2.177074 1.805119
9 1.240525 0.104993
-----------------------------
col2 col1
9 1.240525 0.104993
8 2.177074 1.805119
7 0.808891 0.336993
6 -0.136885 0.812825
5 0.418442 -0.282703
4 -0.076693 1.179620
3 0.379386 -0.028920
2 0.182436 0.167360
1 -1.335899 1.494410
0 0.991140 -1.203227#sorting based on values
#meathod takes a column name or list of columns as input
unsorted_df.sort_values('col1')
unsorted_df.sort_values(['col1','col2'])#### Statistical Functions with DataFrame#calculate the percent change on rolling window of 1
print unsorted_df.pct_change(1)
print "---------------"
print "correlation matrix"
print unsorted_df.corr()Output:col2 col1
1 NaN NaN
4 -0.942591 -0.210645
6 0.784835 -0.310944
2 -2.332776 -0.794100
3 1.079551 -1.172798
5 0.102945 8.775505
9 1.964630 -1.371389
8 0.754962 16.192776
0 -0.544738 -1.666564
7 -0.183878 -1.280075
---------------
correlation matrix
col2 col1
col2 1.000000 -0.157285
col1 -0.157285 1.000000#rolling functions are window based functions which can be applied on a a set of rows
# calclulate the mean based on every 4 rows in a data frameprint df.rolling(window=2).mean()
print df.rolling(window=2).sum()#applying custome function to adta frame
def fun(s):
print s
return s[0]
print df.rolling(window=2).agg(fun)Output:Age Name a
rank1 NaN Tom NaN
rank3 31.0 Jack NaN
rank2 31.5 Steve NaN
rank4 35.5 Ricky NaN
Age Name a
rank1 NaN Tom NaN
rank3 62.0 Jack NaN
rank2 63.0 Steve NaN
rank4 71.0 Ricky NaN
[28. 34.]
[34. 29.]
[29. 42.]
Age Name a
rank1 NaN Tom NaN
rank3 28.0 Jack NaN
rank2 34.0 Steve NaN
rank4 29.0 Ricky NaN
Grouping a DataFrame
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
#group by creates groups of the data frame rows
print df.groupby('Team').groups#printing the group names
for name,group in df.groupby('Team'):
print name
print groupOutput:{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
Devils
Points Rank Team Year
2 863 2 Devils 2014
3 673 3 Devils 2015
Kings
Points Rank Team Year
4 741 3 Kings 2014
6 756 1 Kings 2016
7 788 1 Kings 2017
Riders
Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
8 694 2 Riders 2016
11 690 2 Riders 2017
Royals
Points Rank Team Year
9 701 4 Royals 2014
10 804 1 Royals 2015
kings
Points Rank Team Year
5 812 4 kings 2015#applying sum function in a dataframe
print df.groupby('Team').sum()#applying sum function on a column
print df.groupby('Team')['Points'].sum()
#or
print df.groupby('Team')['Points'].agg(np.sum)Output:Points Rank Year
Team
Devils 1536 5 4029
Kings 2285 5 6047
Riders 3049 7 8062
Royals 1505 5 4029
kings 812 4 2015
Team
Devils 1536
Kings 2285
Riders 3049
Royals 1505
kings 812
Name: Points, dtype: int64
Team
Devils 1536
Kings 2285
Riders 3049
Royals 1505
kings 812
Name: Points, dtype: int64# applying multiple functions
print df.groupby(['Rank'])['Points'].agg([np.sum, np.mean, np.std])
print df.groupby(['Rank','Team'])['Points'].agg([np.sum, np.mean, np.std])Output:sum mean std
Rank
1 3224 806.0 50.754310
2 3036 759.0 83.070251
3 1414 707.0 48.083261
4 1513 756.5 78.488853
sum mean std
Rank Team
1 Kings 1544 772.000000 22.627417
Riders 876 876.000000 NaN
Royals 804 804.000000 NaN
2 Devils 863 863.000000 NaN
Riders 2173 724.333333 56.038677
3 Devils 673 673.000000 NaN
Kings 741 741.000000 NaN
4 Royals 701 701.000000 NaN
kings 812 812.000000 NaN