pandas will be a major tool of interest throughout much of the rest of the
book. It contains data structures and data manipulation tools designed to
make data cleaning and analysis fast and easy in Python. pandas is often
used in tandem with numerical computing tools like NumPy and SciPy,
analytical libraries like statsmodels and scikit-learn, and data
visualization libraries like matplotlib. pandas adopts significant parts of
NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without
for loops.
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.
Since becoming an open source project in 2010, pandas has matured into a quite large library that’s applicable in a broad set of real-world use cases. The developer community has grown to over 800 distinct contributors, who’ve been helping build the project as they’ve used it to solve their day-to-day data problems.
Throughout the rest of the book, I use the following import convention for pandas:
In[1]:importpandasaspd
Thus, whenever you see pd. in code,
it’s referring to pandas. You may also find it easier to import Series and
DataFrame into the local namespace since they are so frequently used:
In[2]:frompandasimportSeries,DataFrame
To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:
In[11]:obj=pd.Series([4,7,-5,3])In[12]:objOut[12]:04172-533dtype:int64
The string representation of a Series displayed interactively
shows the index on the left and the values on the right. Since we did
not specify an index for the data, a default one consisting of the
integers 0 through N - 1 (where
N is the length of the data) is created. You can get
the array representation and index object of the Series via its values
and index attributes, respectively:
In[13]:obj.valuesOut[13]:array([4,7,-5,3])In[14]:obj.index# like range(4)Out[14]:RangeIndex(start=0,stop=4,step=1)
Often it will be desirable to create a Series with an index identifying each data point with a label:
In[15]:obj2=pd.Series([4,7,-5,3],index=['d','b','a','c'])In[16]:obj2Out[16]:d4b7a-5c3dtype:int64In[17]:obj2.indexOut[17]:Index(['d','b','a','c'],dtype='object')
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
In[18]:obj2['a']Out[18]:-5In[19]:obj2['d']=6In[20]:obj2[['c','a','d']]Out[20]:c3a-5d6dtype:int64
Here ['c', 'a', 'd'] is interpreted as a list
of indices, even though it contains strings instead of integers.
Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
In[21]:obj2[obj2>0]Out[21]:d6b7c3dtype:int64In[22]:obj2*2Out[22]:d12b14a-10c6dtype:int64In[23]:np.exp(obj2)Out[23]:d403.428793b1096.633158a0.006738c20.085537dtype:float64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:
In[24]:'b'inobj2Out[24]:TrueIn[25]:'e'inobj2Out[25]:False
Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
In[26]:sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}In[27]:obj3=pd.Series(sdata)In[28]:obj3Out[28]:Ohio35000Oregon16000Texas71000Utah5000dtype:int64
When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:
In[29]:states=['California','Ohio','Oregon','Texas']In[30]:obj4=pd.Series(sdata,index=states)In[31]:obj4Out[31]:CaliforniaNaNOhio35000.0Oregon16000.0Texas71000.0dtype:float64
Here, three values found in sdata were placed in the appropriate
locations, but since no value for 'California' was found, it appears as NaN (not a number),
which is considered in pandas to mark missing or NA
values. Since 'Utah' was not included in
states, it is excluded from the resulting
object.
I will use the terms “missing” or “NA” interchangeably to refer to
missing data. The isnull and notnull functions
in pandas should be used to detect missing data:
In[32]:pd.isnull(obj4)Out[32]:CaliforniaTrueOhioFalseOregonFalseTexasFalsedtype:boolIn[33]:pd.notnull(obj4)Out[33]:CaliforniaFalseOhioTrueOregonTrueTexasTruedtype:bool
Series also has these as instance methods:
In[34]:obj4.isnull()Out[34]:CaliforniaTrueOhioFalseOregonFalseTexasFalsedtype:bool
I discuss working with missing data in more detail in Chapter 7.
A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:
In[35]:obj3Out[35]:Ohio35000Oregon16000Texas71000Utah5000dtype:int64In[36]:obj4Out[36]:CaliforniaNaNOhio35000.0Oregon16000.0Texas71000.0dtype:float64In[37]:obj3+obj4Out[37]:CaliforniaNaNOhio70000.0Oregon32000.0Texas142000.0UtahNaNdtype:float64
Data alignment features will be addressed in more detail later. If you have experience with databases, you can think about this as being similar to a join operation.
Both the Series object itself and its index have a name attribute, which
integrates with other key areas of pandas functionality:
In[38]:obj4.name='population'In[39]:obj4.index.name='state'In[40]:obj4Out[40]:stateCaliforniaNaNOhio35000.0Oregon16000.0Texas71000.0Name:population,dtype:float64
A Series’s index can be altered in-place by assignment:
In[41]:objOut[41]:04172-533dtype:int64In[42]:obj.index=['Bob','Steve','Jeff','Ryan']In[43]:objOut[43]:Bob4Steve7Jeff-5Ryan3dtype:int64
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.
While a DataFrame is physically two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical indexing, a subject we will discuss in Chapter 8 and an ingredient in some of the more advanced data-handling features in pandas.
There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002,2003],'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}frame=pd.DataFrame(data)
The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:
In[45]:frameOut[45]:popstateyear01.5Ohio200011.7Ohio200123.6Ohio200232.4Nevada200142.9Nevada200253.2Nevada2003
If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.
For large DataFrames, the head method
selects only the first five rows:
In[46]:frame.head()Out[46]:popstateyear01.5Ohio200011.7Ohio200123.6Ohio200232.4Nevada200142.9Nevada2002
If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
In[47]:pd.DataFrame(data,columns=['year','state','pop'])Out[47]:yearstatepop02000Ohio1.512001Ohio1.722002Ohio3.632001Nevada2.442002Nevada2.952003Nevada3.2
If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:
In[48]:frame2=pd.DataFrame(data,columns=['year','state','pop','debt'],....:index=['one','two','three','four',....:'five','six'])In[49]:frame2Out[49]:yearstatepopdebtone2000Ohio1.5NaNtwo2001Ohio1.7NaNthree2002Ohio3.6NaNfour2001Nevada2.4NaNfive2002Nevada2.9NaNsix2003Nevada3.2NaNIn[50]:frame2.columnsOut[50]:Index(['year','state','pop','debt'],dtype='object')
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
In[51]:frame2['state']Out[51]:oneOhiotwoOhiothreeOhiofourNevadafiveNevadasixNevadaName:state,dtype:objectIn[52]:frame2.yearOut[52]:one2000two2001three2002four2001five2002six2003Name:year,dtype:int64
Attribute-like access (e.g., frame2.year) and
tab completion of column names in IPython is provided as a
convenience.
frame2[column] works for any column name, but
frame2.column only works when the column name is a
valid Python variable name.
Note that the returned Series have the same index as the
DataFrame, and their name attribute has been
appropriately set.
Rows can also be retrieved by position or name with the
special loc attribute
(much more on this later):
In[53]:frame2.loc['three']Out[53]:year2002stateOhiopop3.6debtNaNName:three,dtype:object
Columns can be modified by assignment. For example, the empty
'debt' column could be assigned a
scalar value or an array of values:
In[54]:frame2['debt']=16.5In[55]:frame2Out[55]:yearstatepopdebtone2000Ohio1.516.5two2001Ohio1.716.5three2002Ohio3.616.5four2001Nevada2.416.5five2002Nevada2.916.5six2003Nevada3.216.5In[56]:frame2['debt']=np.arange(6.)In[57]:frame2Out[57]:yearstatepopdebtone2000Ohio1.50.0two2001Ohio1.71.0three2002Ohio3.62.0four2001Nevada2.43.0five2002Nevada2.94.0six2003Nevada3.25.0
When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:
In[58]:val=pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])In[59]:frame2['debt']=valIn[60]:frame2Out[60]:yearstatepopdebtone2000Ohio1.5NaNtwo2001Ohio1.7-1.2three2002Ohio3.6NaNfour2001Nevada2.4-1.5five2002Nevada2.9-1.7six2003Nevada3.2NaN
Assigning a column that doesn’t exist will create a new column.
The del keyword will delete columns as with a dict.
As an example of del, I first add a new column
of boolean values where the state column equals
'Ohio':
In[61]:frame2['eastern']=frame2.state=='Ohio'In[62]:frame2Out[62]:yearstatepopdebteasternone2000Ohio1.5NaNTruetwo2001Ohio1.7-1.2Truethree2002Ohio3.6NaNTruefour2001Nevada2.4-1.5Falsefive2002Nevada2.9-1.7Falsesix2003Nevada3.2NaNFalse
New columns cannot be created with the
frame2.eastern syntax.
The del method can then be used to remove this column:
In[63]:delframe2['eastern']In[64]:frame2.columnsOut[64]:Index(['year','state','pop','debt'],dtype='object')
The column returned from indexing a DataFrame is a
view on the underlying data, not a copy. Thus,
any in-place modifications to the Series will be reflected in the
DataFrame. The column can be explicitly copied with the Series’s copy
method.
Another common form of data is a nested dict of dicts:
In[65]:pop={'Nevada':{2001:2.4,2002:2.9},....:'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices:
In[66]:frame3=pd.DataFrame(pop)In[67]:frame3Out[67]:NevadaOhio2000NaN1.520012.41.720022.93.6
You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:
In[68]:frame3.TOut[68]:200020012002NevadaNaN2.42.9Ohio1.51.73.6
The keys in the inner dicts are combined and sorted to form the index in the result. This isn’t true if an explicit index is specified:
In[69]:pd.DataFrame(pop,index=[2001,2002,2003])Out[69]:NevadaOhio20012.41.720022.93.62003NaNNaN
Dicts of Series are treated in much the same way:
In[70]:pdata={'Ohio':frame3['Ohio'][:-1],....:'Nevada':frame3['Nevada'][:2]}In[71]:pd.DataFrame(pdata)Out[71]:NevadaOhio2000NaN1.520012.41.7
For a complete list of things you can pass the DataFrame constructor, see Table 5-1.
If a DataFrame’s index and
columns have their name attributes set, these will also be
displayed:
In[72]:frame3.index.name='year';frame3.columns.name='state'In[73]:frame3Out[73]:stateNevadaOhioyear2000NaN1.520012.41.720022.93.6
As with Series, the values
attribute returns the data contained in the DataFrame as a
two-dimensional ndarray:
In[74]:frame3.valuesOut[74]:array([[nan,1.5],[2.4,1.7],[2.9,3.6]])
If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns:
In[75]:frame2.valuesOut[75]:array([[2000,'Ohio',1.5,nan],[2001,'Ohio',1.7,-1.2],[2002,'Ohio',3.6,nan],[2001,'Nevada',2.4,-1.5],[2002,'Nevada',2.9,-1.7],[2003,'Nevada',3.2,nan]],dtype=object)
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:
In[76]:obj=pd.Series(range(3),index=['a','b','c'])In[77]:index=obj.indexIn[78]:indexOut[78]:Index(['a','b','c'],dtype='object')In[79]:index[1:]Out[79]:Index(['b','c'],dtype='object')
Index objects are immutable and thus can’t be modified by the user:
index[1]='d'# TypeError
Immutability makes it safer to share Index objects among data structures:
In[80]:labels=pd.Index(np.arange(3))In[81]:labelsOut[81]:Int64Index([0,1,2],dtype='int64')In[82]:obj2=pd.Series([1.5,-2.5,0],index=labels)In[83]:obj2Out[83]:01.51-2.520.0dtype:float64In[84]:obj2.indexislabelsOut[84]:True
Some users will not often take advantage of the capabilities provided by indexes, but because some operations will yield results containing indexed data, it’s important to understand how they work.
In addition to being array-like, an Index also behaves like a fixed-size set:
In[85]:frame3Out[85]:stateNevadaOhioyear2000NaN1.520012.41.720022.93.6In[86]:frame3.columnsOut[86]:Index(['Nevada','Ohio'],dtype='object',name='state')In[87]:'Ohio'inframe3.columnsOut[87]:TrueIn[88]:2003inframe3.indexOut[88]:False
Unlike Python sets, a pandas Index can contain duplicate labels:
In[89]:dup_labels=pd.Index(['foo','foo','bar','bar'])In[90]:dup_labelsOut[90]:Index(['foo','foo','bar','bar'],dtype='object')
Selections with duplicate labels will select all occurrences of that label.
Each Index has a number of methods and properties for set logic, which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2.
This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; instead, we’ll focus on the most important features, leaving the less common (i.e., more esoteric) things for you to explore on your own.
An important method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index. Consider an
example:
In[91]:obj=pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])In[92]:objOut[92]:d4.5b7.2a-5.3c3.6dtype:float64
Calling reindex on this Series
rearranges the data according to the new index, introducing missing
values if any index values were not already present:
In[93]:obj2=obj.reindex(['a','b','c','d','e'])In[94]:obj2Out[94]:a-5.3b7.2c3.6d4.5eNaNdtype:float64
For ordered data like time series, it may be desirable to do some
interpolation or filling of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:
In[95]:obj3=pd.Series(['blue','purple','yellow'],index=[0,2,4])In[96]:obj3Out[96]:0blue2purple4yellowdtype:objectIn[97]:obj3.reindex(range(6),method='ffill')Out[97]:0blue1blue2purple3purple4yellow5yellowdtype:object
With DataFrame, reindex can
alter either the (row) index, columns, or both. When passed only a
sequence, it reindexes the rows in the result:
In[98]:frame=pd.DataFrame(np.arange(9).reshape((3,3)),....:index=['a','c','d'],....:columns=['Ohio','Texas','California'])In[99]:frameOut[99]:OhioTexasCaliforniaa012c345d678In[100]:frame2=frame.reindex(['a','b','c','d'])In[101]:frame2Out[101]:OhioTexasCaliforniaa0.01.02.0bNaNNaNNaNc3.04.05.0d6.07.08.0
The columns can be reindexed with the columns keyword:
In[102]:states=['Texas','Utah','California']In[103]:frame.reindex(columns=states)Out[103]:TexasUtahCaliforniaa1NaN2c4NaN5d7NaN8
See Table 5-3 for more about the
arguments to reindex.
As we’ll explore in more detail, you can reindex more
succinctly by label-indexing with loc, and many users prefer to use it
exclusively:
In[104]:frame.loc[['a','b','c','d'],states]Out[104]:TexasUtahCaliforniaa1.0NaN2.0bNaNNaNNaNc4.0NaN5.0d7.0NaN8.0
Dropping one or more entries from an axis is easy if you already have an index
array or list without those entries. As that can require a bit of
munging and set logic, the drop method will
return a new object with the indicated value or values deleted from an
axis:
In[105]:obj=pd.Series(np.arange(5.),index=['a','b','c','d','e'])In[106]:objOut[106]:a0.0b1.0c2.0d3.0e4.0dtype:float64In[107]:new_obj=obj.drop('c')In[108]:new_objOut[108]:a0.0b1.0d3.0e4.0dtype:float64In[109]:obj.drop(['d','c'])Out[109]:a0.0b1.0e4.0dtype:float64
With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:
In[110]:data=pd.DataFrame(np.arange(16).reshape((4,4)),.....:index=['Ohio','Colorado','Utah','New York'],.....:columns=['one','two','three','four'])In[111]:dataOut[111]:onetwothreefourOhio0123Colorado4567Utah891011NewYork12131415
Calling drop with a sequence of labels will
drop values from the row labels (axis 0):
In[112]:data.drop(['Colorado','Ohio'])Out[112]:onetwothreefourUtah891011NewYork12131415
You can drop values from the columns by passing
axis=1 or axis='columns':
In[113]:data.drop('two',axis=1)Out[113]:onethreefourOhio023Colorado467Utah81011NewYork121415In[114]:data.drop(['two','four'],axis='columns')Out[114]:onethreeOhio02Colorado46Utah810NewYork1214
Many functions, like drop, which modify the
size or shape of a Series or DataFrame, can manipulate an object
in-place without returning a new object:
In[115]:obj.drop('c',inplace=True)In[116]:objOut[116]:a0.0b1.0d3.0e4.0dtype:float64
Be careful with the inplace, as it destroys any
data that is dropped.
Series indexing (obj[...])
works analogously to NumPy array indexing, except you can use
the Series’s index values instead of only integers. Here are some
examples of this:
In[117]:obj=pd.Series(np.arange(4.),index=['a','b','c','d'])In[118]:objOut[118]:a0.0b1.0c2.0d3.0dtype:float64In[119]:obj['b']Out[119]:1.0In[120]:obj[1]Out[120]:1.0In[121]:obj[2:4]Out[121]:c2.0d3.0dtype:float64In[122]:obj[['b','a','d']]Out[122]:b1.0a0.0d3.0dtype:float64In[123]:obj[[1,3]]Out[123]:b1.0d3.0dtype:float64In[124]:obj[obj<2]Out[124]:a0.0b1.0dtype:float64
Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
In[125]:obj['b':'c']Out[125]:b1.0c2.0dtype:float64
Setting using these methods modifies the corresponding section of the Series:
In[126]:obj['b':'c']=5In[127]:objOut[127]:a0.0b5.0c5.0d3.0dtype:float64
Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:
In[128]:data=pd.DataFrame(np.arange(16).reshape((4,4)),.....:index=['Ohio','Colorado','Utah','New York'],.....:columns=['one','two','three','four'])In[129]:dataOut[129]:onetwothreefourOhio0123Colorado4567Utah891011NewYork12131415In[130]:data['two']Out[130]:Ohio1Colorado5Utah9NewYork13Name:two,dtype:int64In[131]:data[['three','one']]Out[131]:threeoneOhio20Colorado64Utah108NewYork1412
Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:
In[132]:data[:2]Out[132]:onetwothreefourOhio0123Colorado4567In[133]:data[data['three']>5]Out[133]:onetwothreefourColorado4567Utah891011NewYork12131415
The row selection syntax data[:2] is provided
as a convenience. Passing a single element or a list to the
[] operator selects columns.
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:
In[134]:data<5Out[134]:onetwothreefourOhioTrueTrueTrueTrueColoradoTrueFalseFalseFalseUtahFalseFalseFalseFalseNewYorkFalseFalseFalseFalseIn[135]:data[data<5]=0In[136]:dataOut[136]:onetwothreefourOhio0000Colorado0567Utah891011NewYork12131415
This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.
For DataFrame label-indexing on the rows, I introduce the
special indexing operators loc and
iloc. They enable you to select a subset of the
rows and columns from a DataFrame with NumPy-like notation using
either axis labels (loc) or integers
(iloc).
As a preliminary example, let’s select a single row and multiple columns by label:
In[137]:data.loc['Colorado',['two','three']]Out[137]:two5three6Name:Colorado,dtype:int64
We’ll then perform some similar selections with integers using
iloc:
In[138]:data.iloc[2,[3,0,1]]Out[138]:four11one8two9Name:Utah,dtype:int64In[139]:data.iloc[2]Out[139]:one8two9three10four11Name:Utah,dtype:int64In[140]:data.iloc[[1,2],[3,0,1]]Out[140]:fouronetwoColorado705Utah1189
Both indexing functions work with slices in addition to single labels or lists of labels:
In[141]:data.loc[:'Utah','two']Out[141]:Ohio0Colorado5Utah9Name:two,dtype:int64In[142]:data.iloc[:,:3][data.three>5]Out[142]:onetwothreeColorado056Utah8910NewYork121314
So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you’ll see later, there are a number of additional options for working with hierarchical indexes.
When originally designing pandas, I felt that having to type
frame[:, col] to select a column
was too verbose (and error-prone), since column selection is one of
the most common operations. I made the design trade-off to push all
of the fancy indexing behavior (both labels and integers) into the
ix operator. In practice, this
led to many edge cases in data with integer axis labels, so the
pandas team decided to create the loc and
iloc operators to deal with strictly label-based
and integer-based indexing, respectively.
The ix indexing operator still exists, but
it is deprecated. I do not recommend using it.
Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:
ser=pd.Series(np.arange(3.))serser[-1]
In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:
In[144]:serOut[144]:00.011.022.0dtype:float64
On the other hand, with a non-integer index, there is no potential for ambiguity:
In[145]:ser2=pd.Series(np.arange(3.),index=['a','b','c'])In[146]:ser2[-1]Out[146]:2.0
To keep things consistent, if you have an axis index containing
integers, data selection will always be label-oriented. For more precise
handling, use loc (for labels) or
iloc (for integers):
In[147]:ser[:1]Out[147]:00.0dtype:float64In[148]:ser.loc[:1]Out[148]:00.011.0dtype:float64In[149]:ser.iloc[:1]Out[149]:00.0dtype:float64
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:
In[150]:s1=pd.Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])In[151]:s2=pd.Series([-2.1,3.6,-1.5,4,3.1],.....:index=['a','c','e','f','g'])In[152]:s1Out[152]:a7.3c-2.5d3.4e1.5dtype:float64In[153]:s2Out[153]:a-2.1c3.6e-1.5f4.0g3.1dtype:float64
Adding these together yields:
In[154]:s1+s2Out[154]:a5.2c1.1dNaNe0.0fNaNgNaNdtype:float64
The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.
In the case of DataFrame, alignment is performed on both the rows and the columns:
In[155]:df1=pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),.....:index=['Ohio','Texas','Colorado'])In[156]:df2=pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),.....:index=['Utah','Ohio','Texas','Oregon'])In[157]:df1Out[157]:bcdOhio0.01.02.0Texas3.04.05.0Colorado6.07.08.0In[158]:df2Out[158]:bdeUtah0.01.02.0Ohio3.04.05.0Texas6.07.08.0Oregon9.010.011.0
Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:
In[159]:df1+df2Out[159]:bcdeColoradoNaNNaNNaNNaNOhio3.0NaN6.0NaNOregonNaNNaNNaNNaNTexas9.0NaN12.0NaNUtahNaNNaNNaNNaN
Since the 'c' and 'e'
columns are not found in both DataFrame objects, they appear as all
missing in the result. The same holds for the rows whose labels are not
common to both objects.
If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:
In[160]:df1=pd.DataFrame({'A':[1,2]})In[161]:df2=pd.DataFrame({'B':[3,4]})In[162]:df1Out[162]:A0112In[163]:df2Out[163]:B0314In[164]:df1-df2Out[164]:AB0NaNNaN1NaNNaN
In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
In[165]:df1=pd.DataFrame(np.arange(12.).reshape((3,4)),.....:columns=list('abcd'))In[166]:df2=pd.DataFrame(np.arange(20.).reshape((4,5)),.....:columns=list('abcde'))In[167]:df2.loc[1,'b']=np.nanIn[168]:df1Out[168]:abcd00.01.02.03.014.05.06.07.028.09.010.011.0In[169]:df2Out[169]:abcde00.01.02.03.04.015.0NaN7.08.09.0210.011.012.013.014.0315.016.017.018.019.0
Adding these together results in NA values in the locations that don’t overlap:
In[170]:df1+df2Out[170]:abcde00.02.04.06.0NaN19.0NaN13.015.0NaN218.020.022.024.0NaN3NaNNaNNaNNaNNaN
Using the add method on
df1, I pass df2 and an argument to fill_value:
In[171]:df1.add(df2,fill_value=0)Out[171]:abcde00.02.04.06.04.019.05.013.015.09.0218.020.022.024.014.0315.016.017.018.019.0
See Table 5-5 for a listing of Series
and DataFrame methods for arithmetic. Each of them has a counterpart,
starting with the letter r, that has arguments
flipped. So these two statements are equivalent:
In[172]:1/df1Out[172]:abcd0inf1.0000000.5000000.33333310.2500000.2000000.1666670.14285720.1250000.1111110.1000000.090909In[173]:df1.rdiv(1)Out[173]:abcd0inf1.0000000.5000000.33333310.2500000.2000000.1666670.14285720.1250000.1111110.1000000.090909
Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:
In[174]:df1.reindex(columns=df2.columns,fill_value=0)Out[174]:abcde00.01.02.03.0014.05.06.07.0028.09.010.011.00
As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:
In[175]:arr=np.arange(12.).reshape((3,4))In[176]:arrOut[176]:array([[0.,1.,2.,3.],[4.,5.,6.,7.],[8.,9.,10.,11.]])In[177]:arr[0]Out[177]:array([0.,1.,2.,3.])In[178]:arr-arr[0]Out[178]:array([[0.,0.,0.,0.],[4.,4.,4.,4.],[8.,8.,8.,8.]])
When we subtract arr[0] from
arr, the subtraction is performed once for each
row. This is referred to as broadcasting and is
explained in more detail as it relates to general NumPy arrays in
Appendix A. Operations between a DataFrame and
a Series are similar:
In[179]:frame=pd.DataFrame(np.arange(12.).reshape((4,3)),.....:columns=list('bde'),.....:index=['Utah','Ohio','Texas','Oregon'])In[180]:series=frame.iloc[0]In[181]:frameOut[181]:bdeUtah0.01.02.0Ohio3.04.05.0Texas6.07.08.0Oregon9.010.011.0In[182]:seriesOut[182]:b0.0d1.0e2.0Name:Utah,dtype:float64
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:
In[183]:frame-seriesOut[183]:bdeUtah0.00.00.0Ohio3.03.03.0Texas6.06.06.0Oregon9.09.09.0
If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:
In[184]:series2=pd.Series(range(3),index=['b','e','f'])In[185]:frame+series2Out[185]:bdefUtah0.0NaN3.0NaNOhio3.0NaN6.0NaNTexas6.0NaN9.0NaNOregon9.0NaN12.0NaN
If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:
In[186]:series3=frame['d']In[187]:frameOut[187]:bdeUtah0.01.02.0Ohio3.04.05.0Texas6.07.08.0Oregon9.010.011.0In[188]:series3Out[188]:Utah1.0Ohio4.0Texas7.0Oregon10.0Name:d,dtype:float64In[189]:frame.sub(series3,axis='index')Out[189]:bdeUtah-1.00.01.0Ohio-1.00.01.0Texas-1.00.01.0Oregon-1.00.01.0
The axis number that you pass is the axis to match
on. In this case we mean to match on the DataFrame’s row
index (axis='index' or axis=0)
and broadcast across.
NumPy ufuncs (element-wise array methods) also work with pandas objects:
In[190]:frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),.....:index=['Utah','Ohio','Texas','Oregon'])In[191]:frameOut[191]:bdeUtah-0.2047080.478943-0.519439Ohio-0.5557301.9657811.393406Texas0.0929080.2817460.769023Oregon1.2464351.007189-1.296221In[192]:np.abs(frame)Out[192]:bdeUtah0.2047080.4789430.519439Ohio0.5557301.9657811.393406Texas0.0929080.2817460.769023Oregon1.2464351.0071891.296221
Another frequent operation is applying a function on
one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:
In[193]:f=lambdax:x.max()-x.min()In[194]:frame.apply(f)Out[194]:b1.802165d1.684034e2.689627dtype:float64
Here the function f, which computes the
difference between the maximum and minimum of a Series, is invoked once
on each column in frame. The result is a Series
having the columns of frame as its index.
If you pass axis='columns' to
apply, the function will be invoked once per row
instead:
In[195]:frame.apply(f,axis='columns')Out[195]:Utah0.998382Ohio2.521511Texas0.676115Oregon2.542656dtype:float64
Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.
The function passed to apply
need not return a scalar value; it can also return a Series with
multiple values:
In[196]:deff(x):.....:returnpd.Series([x.min(),x.max()],index=['min','max'])In[197]:frame.apply(f)Out[197]:bdemin-0.5557300.281746-1.296221max1.2464351.9657811.393406
Element-wise Python functions can be used, too. Suppose you wanted
to compute a formatted string from each floating-point value in frame. You can do this with applymap:
In[198]:format=lambdax:'%.2f'%xIn[199]:frame.applymap(format)Out[199]:bdeUtah-0.200.48-0.52Ohio-0.561.971.39Texas0.090.280.77Oregon1.251.01-1.30
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In[200]:frame['e'].map(format)Out[200]:Utah-0.52Ohio1.39Texas0.77Oregon-1.30Name:e,dtype:object
Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:
In[201]:obj=pd.Series(range(4),index=['d','a','b','c'])In[202]:obj.sort_index()Out[202]:a1b2c3d0dtype:int64
With a DataFrame, you can sort by index on either axis:
In[203]:frame=pd.DataFrame(np.arange(8).reshape((2,4)),.....:index=['three','one'],.....:columns=['d','a','b','c'])In[204]:frame.sort_index()Out[204]:dabcone4567three0123In[205]:frame.sort_index(axis=1)Out[205]:abcdthree1230one5674
The data is sorted in ascending order by default, but can be sorted in descending order, too:
In[206]:frame.sort_index(axis=1,ascending=False)Out[206]:dcbathree0321one4765
To sort a Series by its values, use its sort_values
method:
In[207]:obj=pd.Series([4,7,-3,2])In[208]:obj.sort_values()Out[208]:2-3320417dtype:int64
Any missing values are sorted to the end of the Series by default:
In[209]:obj=pd.Series([4,np.nan,7,np.nan,-3,2])In[210]:obj.sort_values()Out[210]:4-3.052.004.027.01NaN3NaNdtype:float64
When sorting a DataFrame, you can use the data in one or more
columns as the sort keys. To do so, pass one or more column names to the
by option of
sort_values:
In[211]:frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})In[212]:frameOut[212]:ab00411720-3312In[213]:frame.sort_values(by='b')Out[213]:ab20-3312004117
To sort by multiple columns, pass a list of names:
In[214]:frame.sort_values(by=['a','b'])Out[214]:ab20-3004312117
Ranking assigns ranks from one through the
number of valid data points in an array. The rank methods for
Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the
mean rank:
In[215]:obj=pd.Series([7,-5,7,4,2,0,4])In[216]:obj.rank()Out[216]:06.511.026.534.543.052.064.5dtype:float64
Ranks can also be assigned according to the order in which they’re observed in the data:
In[217]:obj.rank(method='first')Out[217]:06.011.027.034.043.052.065.0dtype:float64
Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data.
You can rank in descending order, too:
# Assign tie values the maximum rank in the groupIn[218]:obj.rank(ascending=False,method='max')Out[218]:02.017.022.034.045.056.064.0dtype:float64
See Table 5-6 for a list of tie-breaking methods available.
DataFrame can compute ranks over the rows or the columns:
In[219]:frame=pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],.....:'c':[-2,5,8,-2.5]})In[220]:frameOut[220]:abc004.3-2.0117.05.020-3.08.0312.0-2.5In[221]:frame.rank(axis='columns')Out[221]:abc02.03.01.011.03.02.022.01.03.032.03.01.0
Up until now all of the examples we’ve looked at have had unique axis labels (index values).
While many pandas functions (like reindex) require that the labels be unique,
it’s not mandatory. Let’s consider a small Series with duplicate indices:
In[222]:obj=pd.Series(range(5),index=['a','a','b','b','c'])In[223]:objOut[223]:a0a1b2b3c4dtype:int64
The index’s is_unique property
can tell you whether its labels are unique or not:
In[224]:obj.index.is_uniqueOut[224]:False
Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:
In[225]:obj['a']Out[225]:a0a1dtype:int64In[226]:obj['c']Out[226]:4
This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.
The same logic extends to indexing rows in a DataFrame:
In[227]:df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])In[228]:dfOut[228]:012a0.2749920.2289131.352917a0.886429-2.001637-0.371843b1.669025-0.438570-0.539741b0.4769853.248944-1.021228In[229]:df.loc['b']Out[229]:012b1.669025-0.438570-0.539741b0.4769853.248944-1.021228
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:
In[230]:df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],.....:[np.nan,np.nan],[0.75,-1.3]],.....:index=['a','b','c','d'],.....:columns=['one','two'])In[231]:dfOut[231]:onetwoa1.40NaNb7.10-4.5cNaNNaNd0.75-1.3
Calling DataFrame’s sum method
returns a Series containing column sums:
In[232]:df.sum()Out[232]:one9.25two-5.80dtype:float64
Passing axis='columns' or
axis=1 sums across the columns
instead:
In[233]:df.sum(axis='columns')Out[233]:a1.40b2.60cNaNd-0.55dtype:float64
NA values are excluded unless the entire slice (row or column in
this case) is NA. This can be disabled with the skipna option:
In[234]:df.mean(axis='columns',skipna=False)Out[234]:aNaNb1.300cNaNd-0.275dtype:float64
See Table 5-7 for a list of common options for each reduction method.
Some methods, like idxmin and
idxmax, return indirect statistics like
the index value where the minimum or maximum values are attained:
In[235]:df.idxmax()Out[235]:onebtwoddtype:object
Other methods are accumulations:
In[236]:df.cumsum()Out[236]:onetwoa1.40NaNb8.50-4.5cNaNNaNd9.25-5.8
Another type of method is neither a reduction nor an accumulation.
describe is one such example, producing
multiple summary statistics in one shot:
In[237]:df.describe()Out[237]:onetwocount3.0000002.000000mean3.083333-2.900000std3.4936852.262742min0.750000-4.50000025%1.075000-3.70000050%1.400000-2.90000075%4.250000-2.100000max7.100000-1.300000
On non-numeric data, describe
produces alternative summary statistics:
In[238]:obj=pd.Series(['a','a','b','c']*4)In[239]:obj.describe()Out[239]:count16unique3topafreq8dtype:object
See Table 5-8 for a full list of summary statistics and related methods.
Some summary statistics, like correlation and covariance, are computed from pairs
of arguments. Let’s consider some DataFrames of stock prices and volumes
obtained from Yahoo! Finance using the add-on
pandas-datareader package. If you don’t have it installed already, it can be obtained
via conda or pip:
conda install pandas-datareader
I use the pandas_datareader module to download
some data for a few stock tickers:
importpandas_datareader.dataasweball_data={ticker:web.get_data_yahoo(ticker)fortickerin['AAPL','IBM','MSFT','GOOG']}price=pd.DataFrame({ticker:data['Adj Close']forticker,datainall_data.items()})volume=pd.DataFrame({ticker:data['Volume']forticker,datainall_data.items()})
It’s possible by the time you are reading this that Yahoo! Finance no longer exists since Yahoo! was acquired by Verizon in 2017. Refer to the pandas-datareader documentation online for the latest functionality.
I now compute percent changes of the prices, a time series operation which will be explored further in Chapter 11:
In[242]:returns=price.pct_change()In[243]:returns.tail()Out[243]:AAPLGOOGIBMMSFTDate2016-10-17-0.0006800.0018370.002072-0.0034832016-10-18-0.0006810.019616-0.0261680.0076902016-10-19-0.0029790.0078460.003583-0.0022552016-10-20-0.000512-0.0056520.001719-0.0048672016-10-21-0.0039300.003011-0.0124740.042096
The corr method of Series computes the correlation of the overlapping,
non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance:
In[244]:returns['MSFT'].corr(returns['IBM'])Out[244]:0.49976361144151144In[245]:returns['MSFT'].cov(returns['IBM'])Out[245]:8.8706554797035462e-05
Since MSFT is a valid Python attribute, we can also select these columns using
more concise syntax:
In[246]:returns.MSFT.corr(returns.IBM)Out[246]:0.49976361144151144
DataFrame’s corr and cov methods, on the other hand, return a full correlation or covariance
matrix as a DataFrame, respectively:
In[247]:returns.corr()Out[247]:AAPLGOOGIBMMSFTAAPL1.0000000.4079190.3868170.389695GOOG0.4079191.0000000.4050990.465919IBM0.3868170.4050991.0000000.499764MSFT0.3896950.4659190.4997641.000000In[248]:returns.cov()Out[248]:AAPLGOOGIBMMSFTAAPL0.0002770.0001070.0000780.000095GOOG0.0001070.0002510.0000780.000108IBM0.0000780.0000780.0001460.000089MSFT0.0000950.0001080.0000890.000215
Using DataFrame’s corrwith
method, you can compute pairwise correlations between a
DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each
column:
In[249]:returns.corrwith(returns.IBM)Out[249]:AAPL0.386817GOOG0.405099IBM1.000000MSFT0.499764dtype:float64
Passing a DataFrame computes the correlations of matching column names. Here I compute correlations of percent changes with volume:
In[250]:returns.corrwith(volume)Out[250]:AAPL-0.075565GOOG-0.007067IBM-0.204849MSFT-0.092950dtype:float64
Passing axis='columns' does
things row-by-row instead. In all cases, the data points are aligned by
label before the correlation is computed.
Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:
In[251]:obj=pd.Series(['c','a','d','a','a','b','b','c','c'])
The first function is unique,
which gives you an array of the unique values in a
Series:
In[252]:uniques=obj.unique()In[253]:uniquesOut[253]:array(['c','a','d','b'],dtype=object)
The unique values are not necessarily returned in sorted order, but could be
sorted after the fact if needed (uniques.sort()). Relatedly, value_counts computes a Series containing
value frequencies:
In[254]:obj.value_counts()Out[254]:c3a3b2d1dtype:int64
The Series is sorted by value in descending order as a
convenience. value_counts is also
available as a top-level pandas method that can be used with any array
or sequence:
In[255]:pd.value_counts(obj.values,sort=False)Out[255]:b2a3c3d1dtype:int64
isin performs a vectorized set membership check and can be useful in
filtering a dataset down to a subset of values in a Series or column in
a DataFrame:
In[256]:objOut[256]:0c1a2d3a4a5b6b7c8cdtype:objectIn[257]:mask=obj.isin(['b','c'])In[258]:maskOut[258]:0True1False2False3False4False5True6True7True8Truedtype:boolIn[259]:obj[mask]Out[259]:0c5b6b7c8cdtype:object
Related to isin is the
Index.get_indexer method, which gives you an index array from an array of possibly
non-distinct values into another array of distinct values:
In[260]:to_match=pd.Series(['c','a','b','b','c','a'])In[261]:unique_vals=pd.Series(['c','b','a'])In[262]:pd.Index(unique_vals).get_indexer(to_match)Out[262]:array([0,2,1,1,0,2])
See Table 5-9 for a reference on these methods.
In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:
In[263]:data=pd.DataFrame({'Qu1':[1,3,4,3,4],.....:'Qu2':[2,3,1,2,3],.....:'Qu3':[1,5,2,4,4]})In[264]:dataOut[264]:Qu1Qu2Qu301211335241233244434
Passing pandas.value_counts to
this DataFrame’s apply function
gives:
In[265]:result=data.apply(pd.value_counts).fillna(0)In[266]:resultOut[266]:Qu1Qu2Qu311.01.01.020.02.01.032.02.00.042.00.02.050.00.01.0
Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column.
In the next chapter, we’ll discuss tools for reading (or loading) and writing datasets with pandas. After that, we’ll dig deeper into data cleaning, wrangling, analysis, and visualization tools using pandas.