In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, join, and rearrange data.
First, I introduce the concept of hierarchical indexing in pandas, which is used extensively in some of these operations. I then dig into the particular data manipulations. You can see various applied usages of these tools in Chapter 14.
Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example; create a Series with a list of lists (or arrays) as the index:
In[9]:data=pd.Series(np.random.randn(9),...:index=[['a','a','a','b','b','c','c','d','d'],...:[1,2,3,1,3,1,2,2,3]])In[10]:dataOut[10]:a1-0.20470820.4789433-0.519439b1-0.55573031.965781c11.39340620.092908d20.28174630.769023dtype:float64
What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index
display mean “use the label directly above”:
In[11]:data.indexOut[11]:MultiIndex(levels=[['a','b','c','d'],[1,2,3]],labels=[[0,0,0,1,1,2,2,3,3],[0,1,2,0,2,0,1,1,2]])
With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:
In[12]:data['b']Out[12]:1-0.55573031.965781dtype:float64In[13]:data['b':'c']Out[13]:b1-0.55573031.965781c11.39340620.092908dtype:float64In[14]:data.loc[['b','d']]Out[14]:b1-0.55573031.965781d20.28174630.769023dtype:float64
Selection is even possible from an “inner” level:
In[15]:data.loc[:,2]Out[15]:a0.478943c0.092908d0.281746dtype:float64
Hierarchical indexing plays an important role in reshaping data and
group-based operations like forming a pivot table. For example, you could rearrange the data
into a DataFrame using its unstack method:
In[16]:data.unstack()Out[16]:123a-0.2047080.478943-0.519439b-0.555730NaN1.965781c1.3934060.092908NaNdNaN0.2817460.769023
The inverse operation of unstack
is stack:
In[17]:data.unstack().stack()Out[17]:a1-0.20470820.4789433-0.519439b1-0.55573031.965781c11.39340620.092908d20.28174630.769023dtype:float64
stack and unstack will be explored in more detail later in
this chapter.
With a DataFrame, either axis can have a hierarchical index:
In[18]:frame=pd.DataFrame(np.arange(12).reshape((4,3)),....:index=[['a','a','b','b'],[1,2,1,2]],....:columns=[['Ohio','Ohio','Colorado'],....:['Green','Red','Green']])In[19]:frameOut[19]:OhioColoradoGreenRedGreena10122345b1678291011
The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:
In[20]:frame.index.names=['key1','key2']In[21]:frame.columns.names=['state','color']In[22]:frameOut[22]:stateOhioColoradocolorGreenRedGreenkey1key2a10122345b1678291011
Be careful to distinguish the index names
'state' and 'color' from the row
labels.
With partial column indexing you can similarly select groups of columns:
In[23]:frame['Ohio']Out[23]:colorGreenRedkey1key2a101234b1672910
A MultiIndex can be created by
itself and then reused; the columns in the preceding DataFrame with level
names could be created like this:
MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
At times you will need to rearrange the order of the levels on an axis or sort the
data by the values in one specific level. The swaplevel takes two level numbers or names and
returns a new object with the levels interchanged (but the data is
otherwise unaltered):
In[24]:frame.swaplevel('key1','key2')Out[24]:stateOhioColoradocolorGreenRedGreenkey2key11a0122a3451b6782b91011
sort_index, on the other hand,
sorts the data using only the values in a single level. When swapping
levels, it’s not uncommon to also use sort_index so that the result is
lexicographically sorted by the indicated level:
In[25]:frame.sort_index(level=1)Out[25]:stateOhioColoradocolorGreenRedGreenkey1key2a1012b1678a2345b291011In[26]:frame.swaplevel(0,1).sort_index(level=0)Out[26]:stateOhioColoradocolorGreenRedGreenkey2key11a012b6782a345b91011
Data selection performance is much better on hierarchically
indexed objects if the index is lexicographically sorted starting with
the outermost level—that is, the result of calling sort_index(level=0) or sort_index().
Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the
level you want to aggregate by on a particular axis. Consider the above
DataFrame; we can aggregate by level on either the rows or columns like
so:
In[27]:frame.sum(level='key2')Out[27]:stateOhioColoradocolorGreenRedGreenkey2168102121416In[28]:frame.sum(level='color',axis=1)Out[28]:colorGreenRedkey1key2a121284b114722010
Under the hood, this utilizes pandas’s groupby machinery, which will be discussed in more detail later in the book.
It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:
In[29]:frame=pd.DataFrame({'a':range(7),'b':range(7,0,-1),....:'c':['one','one','one','two','two',....:'two','two'],....:'d':[0,1,2,0,1,2,3]})In[30]:frameOut[30]:abcd007one0116one1225one2334two0443two1552two2661two3
DataFrame’s set_index function
will create a new DataFrame using one or more of its columns as the
index:
In[31]:frame2=frame.set_index(['c','d'])In[32]:frame2Out[32]:abcdone007116225two034143252361
By default the columns are removed from the DataFrame, though you can leave them in:
In[33]:frame.set_index(['c','d'],drop=False)Out[33]:abcdcdone007one0116one1225one2two034two0143two1252two2361two3
reset_index, on the other hand,
does the opposite of set_index; the
hierarchical index levels are moved into the columns:
In[34]:frame2.reset_index()Out[34]:cdab0one0071one1162one2253two0344two1435two2526two361
Data contained in pandas objects can be combined together in a number of ways:
pandas.merge connects rows in DataFrames based on one or more keys. This will
be familiar to users of SQL or other relational databases, as it
implements database join operations.
pandas.concat concatenates
or “stacks” together objects along an axis.
The combine_first
instance method enables splicing together overlapping data to
fill in missing values in one object with values from another.
I will address each of these and give a number of examples. They’ll be utilized in examples throughout the rest of the book.
Merge or join operations
combine datasets by linking rows using one or more
keys. These operations are central to relational
databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these
algorithms on your data.
Let’s start with a simple example:
In[35]:df1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],....:'data1':range(7)})In[36]:df2=pd.DataFrame({'key':['a','b','d'],....:'data2':range(3)})In[37]:df1Out[37]:data1key00b11b22a33c44a55a66bIn[38]:df2Out[38]:data2key00a11b22d
This is an example of a many-to-one join; the data in
df1 has multiple rows labeled
a and b, whereas df2 has only one row for each value in the
key column. Calling merge with these objects we obtain:
In[39]:pd.merge(df1,df2)Out[39]:data1keydata200b111b126b132a044a055a0
Note that I didn’t specify which column to join on. If that information is not
specified, merge uses the overlapping
column names as the keys. It’s a good practice to specify explicitly,
though:
In[40]:pd.merge(df1,df2,on='key')Out[40]:data1keydata200b111b126b132a044a055a0
If the column names are different in each object, you can specify them separately:
In[41]:df3=pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],....:'data1':range(7)})In[42]:df4=pd.DataFrame({'rkey':['a','b','d'],....:'data2':range(3)})In[43]:pd.merge(df3,df4,left_on='lkey',right_on='rkey')Out[43]:data1lkeydata2rkey00b1b11b1b26b1b32a0a44a0a55a0a
You may notice that the 'c' and
'd' values and associated data are
missing from the result. By default merge does an 'inner' join; the keys in the result are the
intersection, or the common set found in both tables. Other possible
options are 'left', 'right', and 'outer'. The outer join takes the union of the
keys, combining the effect of applying both left and right joins:
In[44]:pd.merge(df1,df2,how='outer')Out[44]:data1keydata200.0b1.011.0b1.026.0b1.032.0a0.044.0a0.055.0a0.063.0cNaN7NaNd2.0
See Table 8-1 for a summary of
the options for how.
Many-to-many merges have well-defined, though not necessarily intuitive, behavior. Here’s an example:
In[45]:df1=pd.DataFrame({'key':['b','b','a','c','a','b'],....:'data1':range(6)})In[46]:df2=pd.DataFrame({'key':['a','b','a','b','d'],....:'data2':range(5)})In[47]:df1Out[47]:data1key00b11b22a33c44a55bIn[48]:df2Out[48]:data2key00a11b22a33b44dIn[49]:pd.merge(df1,df2,on='key',how='left')Out[49]:data1keydata200b1.010b3.021b1.031b3.042a0.052a2.063cNaN74a0.084a2.095b1.0105b3.0
Many-to-many joins form the Cartesian product of the rows. Since there were three
'b' rows in the left DataFrame and
two in the right one, there are six 'b' rows in the result. The join method only
affects the distinct key values appearing in the result:
In[50]:pd.merge(df1,df2,how='inner')Out[50]:data1keydata200b110b321b131b345b155b362a072a284a094a2
To merge with multiple keys, pass a list of column names:
In[51]:left=pd.DataFrame({'key1':['foo','foo','bar'],....:'key2':['one','two','one'],....:'lval':[1,2,3]})In[52]:right=pd.DataFrame({'key1':['foo','foo','bar','bar'],....:'key2':['one','one','one','two'],....:'rval':[4,5,6,7]})In[53]:pd.merge(left,right,on=['key1','key2'],how='outer')Out[53]:key1key2lvalrval0fooone1.04.01fooone1.05.02footwo2.0NaN3barone3.06.04bartwoNaN7.0
To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key (even though it’s not actually implemented that way).
When you’re joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.
A last issue to consider in merge operations is the treatment of
overlapping column names. While you can address the overlap manually
(see the earlier section on renaming axis labels), merge has a suffixes option for specifying strings to
append to overlapping names in the left and right DataFrame
objects:
In[54]:pd.merge(left,right,on='key1')Out[54]:key1key2_xlvalkey2_yrval0fooone1one41fooone1one52footwo2one43footwo2one54barone3one65barone3two7In[55]:pd.merge(left,right,on='key1',suffixes=('_left','_right'))Out[55]:key1key2_leftlvalkey2_rightrval0fooone1one41fooone1one52footwo2one43footwo2one54barone3one65barone3two7
See Table 8-2 for an argument
reference on merge. Joining using the
DataFrame’s row index is the subject of the next section.
In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you
can pass left_index=True or right_index=True (or both) to indicate that
the index should be used as the merge key:
In[56]:left1=pd.DataFrame({'key':['a','b','a','a','b','c'],....:'value':range(6)})In[57]:right1=pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])In[58]:left1Out[58]:keyvalue0a01b12a23a34b45c5In[59]:right1Out[59]:group_vala3.5b7.0In[60]:pd.merge(left1,right1,left_on='key',right_index=True)Out[60]:keyvaluegroup_val0a03.52a23.53a33.51b17.04b47.0
Since the default merge method is to intersect the join keys, you can instead form the union of them with an outer join:
In[61]:pd.merge(left1,right1,left_on='key',right_index=True,how='outer')Out[61]:keyvaluegroup_val0a03.52a23.53a33.51b17.04b47.05c5NaN
With hierarchically indexed data, things are more complicated, as joining on index is implicitly a multiple-key merge:
In[62]:lefth=pd.DataFrame({'key1':['Ohio','Ohio','Ohio',....:'Nevada','Nevada'],....:'key2':[2000,2001,2002,2001,2002],....:'data':np.arange(5.)})In[63]:righth=pd.DataFrame(np.arange(12).reshape((6,2)),....:index=[['Nevada','Nevada','Ohio','Ohio',....:'Ohio','Ohio'],....:[2001,2000,2000,2000,2001,2002]],....:columns=['event1','event2'])In[64]:lefthOut[64]:datakey1key200.0Ohio200011.0Ohio200122.0Ohio200233.0Nevada200144.0Nevada2002In[65]:righthOut[65]:event1event2Nevada200101200023Ohio20004520006720018920021011
In this case, you have to indicate multiple columns to merge on as
a list (note the handling of duplicate index values with
how='outer'):
In[66]:pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)Out[66]:datakey1key2event1event200.0Ohio20004500.0Ohio20006711.0Ohio20018922.0Ohio2002101133.0Nevada200101In[67]:pd.merge(lefth,righth,left_on=['key1','key2'],....:right_index=True,how='outer')Out[67]:datakey1key2event1event200.0Ohio20004.05.000.0Ohio20006.07.011.0Ohio20018.09.022.0Ohio200210.011.033.0Nevada20010.01.044.0Nevada2002NaNNaN4NaNNevada20002.03.0
Using the indexes of both sides of the merge is also possible:
In[68]:left2=pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],....:index=['a','c','e'],....:columns=['Ohio','Nevada'])In[69]:right2=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13,14]],....:index=['b','c','d','e'],....:columns=['Missouri','Alabama'])In[70]:left2Out[70]:OhioNevadaa1.02.0c3.04.0e5.06.0In[71]:right2Out[71]:MissouriAlabamab7.08.0c9.010.0d11.012.0e13.014.0In[72]:pd.merge(left2,right2,how='outer',left_index=True,right_index=True)Out[72]:OhioNevadaMissouriAlabamaa1.02.0NaNNaNbNaNNaN7.08.0c3.04.09.010.0dNaNNaN11.012.0e5.06.013.014.0
DataFrame has a convenient join
instance for merging by index. It can also be used to combine together
many DataFrame objects having the same or similar indexes but
non-overlapping columns. In the prior example, we could have
written:
In[73]:left2.join(right2,how='outer')Out[73]:OhioNevadaMissouriAlabamaa1.02.0NaNNaNbNaNNaN7.08.0c3.04.09.010.0dNaNNaN11.012.0e5.06.013.014.0
In part for legacy reasons (i.e., much earlier versions of
pandas), DataFrame’s join method
performs a left join on the join keys, exactly preserving the left
frame’s row index. It also supports joining the index of the passed
DataFrame on one of the columns of the calling DataFrame:
In[74]:left1.join(right1,on='key')Out[74]:keyvaluegroup_val0a03.51b17.02a23.53a33.54b47.05c5NaN
Lastly, for simple index-on-index merges, you can pass a list of
DataFrames to join as an alternative
to using the more general concat
function described in the next section:
In[75]:another=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],....:index=['a','c','e','f'],....:columns=['New York','Oregon'])In[76]:anotherOut[76]:NewYorkOregona7.08.0c9.010.0e11.012.0f16.017.0In[77]:left2.join([right2,another])Out[77]:OhioNevadaMissouriAlabamaNewYorkOregona1.02.0NaNNaN7.08.0c3.04.09.010.09.010.0e5.06.013.014.011.012.0In[78]:left2.join([right2,another],how='outer')Out[78]:OhioNevadaMissouriAlabamaNewYorkOregona1.02.0NaNNaN7.08.0bNaNNaN7.08.0NaNNaNc3.04.09.010.09.010.0dNaNNaN11.012.0NaNNaNe5.06.013.014.011.012.0fNaNNaNNaNNaN16.017.0
Another kind of data combination operation is referred to
interchangeably as concatenation, binding, or stacking. NumPy’s concatenate function can do this with NumPy arrays:
In[79]:arr=np.arange(12).reshape((3,4))In[80]:arrOut[80]:array([[0,1,2,3],[4,5,6,7],[8,9,10,11]])In[81]:np.concatenate([arr,arr],axis=1)Out[81]:array([[0,1,2,3,0,1,2,3],[4,5,6,7,4,5,6,7],[8,9,10,11,8,9,10,11]])
In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation. In particular, you have a number of additional things to think about:
If the objects are indexed differently on the other axes, should we combine the distinct elements in these axes or use only the shared values (the intersection)?
Do the concatenated chunks of data need to be identifiable in the resulting object?
Does the “concatenation axis” contain data that needs to be preserved? In many cases, the default integer labels in a DataFrame are best discarded during concatenation.
The concat function in pandas provides a consistent way to address each of
these concerns. I’ll give a number of examples to illustrate how it
works. Suppose we have three Series with no index overlap:
In[82]:s1=pd.Series([0,1],index=['a','b'])In[83]:s2=pd.Series([2,3,4],index=['c','d','e'])In[84]:s3=pd.Series([5,6],index=['f','g'])
Calling concat with these
objects in a list glues together the values and indexes:
In[85]:pd.concat([s1,s2,s3])Out[85]:a0b1c2d3e4f5g6dtype:int64
By default concat works along
axis=0, producing another Series. If
you pass axis=1, the result will
instead be a DataFrame (axis=1 is the
columns):
In[86]:pd.concat([s1,s2,s3],axis=1)Out[86]:012a0.0NaNNaNb1.0NaNNaNcNaN2.0NaNdNaN3.0NaNeNaN4.0NaNfNaNNaN5.0gNaNNaN6.0
In this case there is no overlap on the other axis, which as you
can see is the sorted union (the 'outer' join) of the indexes. You can instead
intersect them by passing join='inner':
In[87]:s4=pd.concat([s1,s3])In[88]:s4Out[88]:a0b1f5g6dtype:int64In[89]:pd.concat([s1,s4],axis=1)Out[89]:01a0.00b1.01fNaN5gNaN6In[90]:pd.concat([s1,s4],axis=1,join='inner')Out[90]:01a00b11
In this last example, the 'f' and
'g' labels disappeared because of the
join='inner' option.
You can even specify the axes to be used on the other axes with
join_axes:
In[91]:pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])Out[91]:01a0.00.0cNaNNaNb1.01.0eNaNNaN
A potential issue is that the concatenated pieces are not
identifiable in the result. Suppose instead you wanted to create a
hierarchical index on the concatenation axis. To do this, use the
keys argument:
In[92]:result=pd.concat([s1,s1,s3],keys=['one','two','three'])In[93]:resultOut[93]:onea0b1twoa0b1threef5g6dtype:int64In[94]:result.unstack()Out[94]:abfgone0.01.0NaNNaNtwo0.01.0NaNNaNthreeNaNNaN5.06.0
In the case of combining Series along axis=1, the keys become the DataFrame column
headers:
In[95]:pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])Out[95]:onetwothreea0.0NaNNaNb1.0NaNNaNcNaN2.0NaNdNaN3.0NaNeNaN4.0NaNfNaNNaN5.0gNaNNaN6.0
The same logic extends to DataFrame objects:
In[96]:df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],....:columns=['one','two'])In[97]:df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],....:columns=['three','four'])In[98]:df1Out[98]:onetwoa01b23c45In[99]:df2Out[99]:threefoura56c78In[100]:pd.concat([df1,df2],axis=1,keys=['level1','level2'])Out[100]:level1level2onetwothreefoura015.06.0b23NaNNaNc457.08.0
If you pass a dict of objects instead of a list, the dict’s keys
will be used for the keys
option:
In[101]:pd.concat({'level1':df1,'level2':df2},axis=1)Out[101]:level1level2onetwothreefoura015.06.0b23NaNNaNc457.08.0
There are additional arguments governing how the hierarchical
index is created (see Table 8-3). For
example, we can name the created axis levels with the
names argument:
In[102]:pd.concat([df1,df2],axis=1,keys=['level1','level2'],.....:names=['upper','lower'])Out[102]:upperlevel1level2loweronetwothreefoura015.06.0b23NaNNaNc457.08.0
A last consideration concerns DataFrames in which the row index does not contain any relevant data:
In[103]:df1=pd.DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])In[104]:df2=pd.DataFrame(np.random.randn(2,3),columns=['b','d','a'])In[105]:df1Out[105]:abcd01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.539741In[106]:df2Out[106]:bda00.4769853.248944-1.0212281-0.5770870.1241210.302614
In this case, you can pass ignore_index=True:
In[107]:pd.concat([df1,df2],ignore_index=True)Out[107]:abcd01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.5397413-1.0212280.476985NaN3.24894440.302614-0.577087NaN0.124121
There is another data combination situation that can’t be expressed as either a
merge or concatenation operation. You may have two datasets whose
indexes overlap in full or part. As a motivating example, consider
NumPy’s where function,
which performs the array-oriented equivalent of an if-else
expression:
In[108]:a=pd.Series([np.nan,2.5,0.0,3.5,4.5,np.nan],.....:index=['f','e','d','c','b','a'])In[109]:b=pd.Series([0.,np.nan,2.,np.nan,np.nan,5.],.....:index=['a','b','c','d','e','f'])In[110]:aOut[110]:fNaNe2.5d0.0c3.5b4.5aNaNdtype:float64In[111]:bOut[111]:a0.0bNaNc2.0dNaNeNaNf5.0dtype:float64In[112]:np.where(pd.isnull(a),b,a)Out[112]:array([0.,2.5,0.,3.5,4.5,5.])
Series has a combine_first
method, which performs the equivalent of this operation along with
pandas’s usual data alignment logic:
In[113]:b.combine_first(a)Out[113]:a0.0b4.5c2.0d0.0e2.5f5.0dtype:float64
With DataFrames, combine_first
does the same thing column by column, so you can think of it as
“patching” missing data in the calling object with data from the object
you pass:
In[114]:df1=pd.DataFrame({'a':[1.,np.nan,5.,np.nan],.....:'b':[np.nan,2.,np.nan,6.],.....:'c':range(2,18,4)})In[115]:df2=pd.DataFrame({'a':[5.,4.,np.nan,3.,7.],.....:'b':[np.nan,3.,4.,6.,8.]})In[116]:df1Out[116]:abc01.0NaN21NaN2.0625.0NaN103NaN6.014In[117]:df2Out[117]:ab05.0NaN14.03.02NaN4.033.06.047.08.0In[118]:df1.combine_first(df2)Out[118]:abc01.0NaN2.014.02.06.025.04.010.033.06.014.047.08.0NaN
There are a number of basic operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stackThis “rotates” or pivots from the columns in the data to the rows
unstackThis pivots from the rows into the columns
I’ll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
In[119]:data=pd.DataFrame(np.arange(6).reshape((2,3)),.....:index=pd.Index(['Ohio','Colorado'],name='state'),.....:columns=pd.Index(['one','two','three'],.....:name='number'))In[120]:dataOut[120]:numberonetwothreestateOhio012Colorado345
Using the stack method
on this data pivots the columns into the rows, producing a
Series:
In[121]:result=data.stack()In[122]:resultOut[122]:statenumberOhioone0two1three2Coloradoone3two4three5dtype:int64
From a hierarchically indexed Series, you can rearrange the data
back into a DataFrame with unstack:
In[123]:result.unstack()Out[123]:numberonetwothreestateOhio012Colorado345
By default the innermost level is unstacked (same with stack). You can unstack a different level by
passing a level number or name:
In[124]:result.unstack(0)Out[124]:stateOhioColoradonumberone03two14three25In[125]:result.unstack('state')Out[125]:stateOhioColoradonumberone03two14three25
Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:
In[126]:s1=pd.Series([0,1,2,3],index=['a','b','c','d'])In[127]:s2=pd.Series([4,5,6],index=['c','d','e'])In[128]:data2=pd.concat([s1,s2],keys=['one','two'])In[129]:data2Out[129]:onea0b1c2d3twoc4d5e6dtype:int64In[130]:data2.unstack()Out[130]:abcdeone0.01.02.03.0NaNtwoNaNNaN4.05.06.0
Stacking filters out missing data by default, so the operation is more easily invertible:
In[131]:data2.unstack()Out[131]:abcdeone0.01.02.03.0NaNtwoNaNNaN4.05.06.0In[132]:data2.unstack().stack()Out[132]:onea0.0b1.0c2.0d3.0twoc4.0d5.0e6.0dtype:float64In[133]:data2.unstack().stack(dropna=False)Out[133]:onea0.0b1.0c2.0d3.0eNaNtwoaNaNbNaNc4.0d5.0e6.0dtype:float64
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
In[134]:df=pd.DataFrame({'left':result,'right':result+5},.....:columns=pd.Index(['left','right'],name='side'))In[135]:dfOut[135]:sideleftrightstatenumberOhioone05two16three27Coloradoone38two49three510In[136]:df.unstack('state')Out[136]:sideleftrightstateOhioColoradoOhioColoradonumberone0358two1469three25710
When calling stack, we can indicate the name of
the axis to stack:
In[137]:df.unstack('state').stack('side')Out[137]:stateColoradoOhionumbersideoneleft30right85twoleft41right96threeleft52right107
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let’s load some example data and do a small amount of time series wrangling and other data cleaning:
In[138]:data=pd.read_csv('examples/macrodata.csv')In[139]:data.head()Out[139]:yearquarterrealgdprealconsrealinvrealgovtrealdpicpi\01959.01.02710.3491707.4286.898470.0451886.928.9811959.02.02778.8011733.7310.859481.3011919.729.1521959.03.02775.4881751.8289.226491.2601916.429.3531959.04.02785.2041753.7299.356484.0521931.329.3741960.01.02847.6991770.5331.722462.1991955.529.54m1tbilrateunemppopinflrealint0139.72.825.8177.1460.000.001141.73.085.1177.8302.340.742140.53.825.3178.6572.741.093140.04.335.6179.3860.274.064139.63.505.2180.0072.311.19In[140]:periods=pd.PeriodIndex(year=data.year,quarter=data.quarter,.....:name='date')In[141]:columns=pd.Index(['realgdp','infl','unemp'],name='item')In[142]:data=data.reindex(columns=columns)In[143]:data.index=periods.to_timestamp('D','end')In[144]:ldata=data.stack().reset_index().rename(columns={0:'value'})
We will look at PeriodIndex a bit more closely
in Chapter 11. In short, it combines the
year and quarter columns to create
a kind of time interval type.
Now, ldata looks like:
In[145]:ldata[:10]Out[145]:dateitemvalue01959-03-31realgdp2710.34911959-03-31infl0.00021959-03-31unemp5.80031959-06-30realgdp2778.80141959-06-30infl2.34051959-06-30unemp5.10061959-09-30realgdp2775.48871959-09-30infl2.74081959-09-30unemp5.30091959-12-31realgdp2785.204
This is the so-called long format for multiple time series, or other observational data with two or more keys (here, our keys are date and item). Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a fixed schema (column names and
data types) allows the number of distinct values in the item column to change as data is added to the
table. In the previous example, date
and item would usually be the primary
keys (in relational database parlance), offering both relational
integrity and easier joins. In some cases, the data may be more
difficult to work with in this format; you might prefer to have a
DataFrame containing one column per distinct item value indexed by timestamps in the
date column. DataFrame’s pivot method performs exactly this transformation:
In[146]:pivoted=ldata.pivot('date','item','value')In[147]:pivotedOut[147]:iteminflrealgdpunempdate1959-03-310.002710.3495.81959-06-302.342778.8015.11959-09-302.742775.4885.31959-12-310.272785.2045.61960-03-312.312847.6995.21960-06-300.142834.3905.21960-09-302.702839.0225.61960-12-311.212802.6166.31961-03-31-0.402819.2646.81961-06-301.472872.0057.0............2007-06-302.7513203.9774.52007-09-303.4513321.1094.72007-12-316.3813391.2494.82008-03-312.8213366.8654.92008-06-308.5313415.2665.42008-09-30-3.1613324.6006.02008-12-31-8.7913141.9206.92009-03-310.9412925.4108.12009-06-303.3712901.5049.22009-09-303.5612990.3419.6[203rowsx3columns]
The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:
In[148]:ldata['value2']=np.random.randn(len(ldata))In[149]:ldata[:10]Out[149]:dateitemvaluevalue201959-03-31realgdp2710.3490.52377211959-03-31infl0.0000.00094021959-03-31unemp5.8001.34381031959-06-30realgdp2778.801-0.71354441959-06-30infl2.340-0.83115451959-06-30unemp5.100-2.37023261959-09-30realgdp2775.488-1.86076171959-09-30infl2.740-0.86075781959-09-30unemp5.3000.56014591959-12-31realgdp2785.204-1.265934
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
In[150]:pivoted=ldata.pivot('date','item')In[151]:pivoted[:5]Out[151]:valuevalue2iteminflrealgdpunempinflrealgdpunempdate1959-03-310.002710.3495.80.0009400.5237721.3438101959-06-302.342778.8015.1-0.831154-0.713544-2.3702321959-09-302.742775.4885.3-0.860757-1.8607610.5601451959-12-310.272785.2045.60.119827-1.265934-1.0635121960-03-312.312847.6995.2-2.3594190.332883-0.199543In[152]:pivoted['value'][:5]Out[152]:iteminflrealgdpunempdate1959-03-310.002710.3495.81959-06-302.342778.8015.11959-09-302.742775.4885.31959-12-310.272785.2045.61960-03-312.312847.6995.2
Note that pivot is equivalent
to creating a hierarchical index using set_index
followed by a call to unstack:
In[153]:unstacked=ldata.set_index(['date','item']).unstack('item')In[154]:unstacked[:7]Out[154]:valuevalue2iteminflrealgdpunempinflrealgdpunempdate1959-03-310.002710.3495.80.0009400.5237721.3438101959-06-302.342778.8015.1-0.831154-0.713544-2.3702321959-09-302.742775.4885.3-0.860757-1.8607610.5601451959-12-310.272785.2045.60.119827-1.265934-1.0635121960-03-312.312847.6995.2-2.3594190.332883-0.1995431960-06-300.142834.3905.2-0.970736-1.541996-1.3070301960-09-302.702839.0225.60.3779840.286350-0.753887
An inverse operation to pivot for DataFrames is
pandas.melt. Rather than transforming one column into many in a new DataFrame,
it merges multiple columns into one, producing a DataFrame that is
longer than the input. Let’s look at an example:
In[156]:df=pd.DataFrame({'key':['foo','bar','baz'],.....:'A':[1,2,3],.....:'B':[4,5,6],.....:'C':[7,8,9]})In[157]:dfOut[157]:ABCkey0147foo1258bar2369baz
The 'key' column may be a group indicator, and
the other columns are data values. When using
pandas.melt, we must indicate which columns (if any)
are group indicators. Let’s use 'key' as the only
group indicator here:
In[158]:melted=pd.melt(df,['key'])In[159]:meltedOut[159]:keyvariablevalue0fooA11barA22bazA33fooB44barB55bazB66fooC77barC88bazC9
Using pivot, we can reshape back to the
original layout:
In[160]:reshaped=melted.pivot('key','variable','value')In[161]:reshapedOut[161]:variableABCkeybar258baz369foo147
Since the result of pivot creates an index from
the column used as the row labels, we may want to use
reset_index to move the data back into a column:
In[162]:reshaped.reset_index()Out[162]:variablekeyABC0bar2581baz3692foo147
You can also specify a subset of columns to use as value columns:
In[163]:pd.melt(df,id_vars=['key'],value_vars=['A','B'])Out[163]:keyvariablevalue0fooA11barA22bazA33fooB44barB55bazB6
pandas.melt can be used without any group
identifiers, too:
In[164]:pd.melt(df,value_vars=['A','B','C'])Out[164]:variablevalue0A11A22A33B44B55B66C77C88C9In[165]:pd.melt(df,value_vars=['key','A','B'])Out[165]:variablevalue0keyfoo1keybar2keybaz3A14A25A36B47B58B6
Now that you have some pandas basics for data import, cleaning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advanced analytics.