The preceding chapters have focused on introducing different types of data wrangling workflows and features of NumPy, pandas, and other libraries. Over time, pandas has developed a depth of features for power users. This chapter digs into a few more advanced feature areas to help you deepen your expertise as a pandas user.
This section introduces the pandas Categorical type. I will show how
you can achieve better performance and memory use in some pandas
operations by using it. I also introduce some tools for using categorical
data in statistics and machine learning applications.
Frequently, a column in a table may contain repeated instances of
a smaller set of distinct values. We have already seen functions
like unique and
value_counts, which enable us to extract the distinct
values from an array and compute their frequencies, respectively:
In[10]:importnumpyasnp;importpandasaspdIn[11]:values=pd.Series(['apple','orange','apple',....:'apple']*2)In[12]:valuesOut[12]:0apple1orange2apple3apple4apple5orange6apple7appledtype:objectIn[13]:pd.unique(values)Out[13]:array(['apple','orange'],dtype=object)In[14]:pd.value_counts(values)Out[14]:apple6orange2dtype:int64
Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table:
In[15]:values=pd.Series([0,1,0,0]*2)In[16]:dim=pd.Series(['apple','orange'])In[17]:valuesOut[17]:0011203040516070dtype:int64In[18]:dimOut[18]:0apple1orangedtype:object
We can use the take method to restore the original Series of strings:
In[19]:dim.take(values)Out[19]:0apple1orange0apple0apple0apple1orange0apple0appledtype:object
This representation as integers is called the categorical or dictionary-encoded representation. The array of distinct values can be called the categories, dictionary, or levels of the data. In this book we will use the terms categorical and categories. The integer values that reference the categories are called the category codes or simply codes.
The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
pandas has a special Categorical type for
holding data that uses the integer-based categorical representation or
encoding. Let’s consider the example Series from
before:
In[20]:fruits=['apple','orange','apple','apple']*2In[21]:N=len(fruits)In[22]:df=pd.DataFrame({'fruit':fruits,....:'basket_id':np.arange(N),....:'count':np.random.randint(3,15,size=N),....:'weight':np.random.uniform(0,4,size=N)},....:columns=['basket_id','fruit','count','weight'])In[23]:dfOut[23]:basket_idfruitcountweight00apple53.85805811orange82.61270822apple42.99562733apple72.61427944apple122.99085955orange83.84522766apple50.03355377apple40.425778
Here, df['fruit'] is an array of Python string
objects. We can convert it to categorical by calling:
In[24]:fruit_cat=df['fruit'].astype('category')In[25]:fruit_catOut[25]:0apple1orange2apple3apple4apple5orange6apple7appleName:fruit,dtype:categoryCategories(2,object):[apple,orange]
The values for fruit_cat are not a NumPy array,
but an instance of pandas.Categorical:
In[26]:c=fruit_cat.valuesIn[27]:type(c)Out[27]:pandas.core.categorical.Categorical
The Categorical object has
categories and codes
attributes:
In[28]:c.categoriesOut[28]:Index(['apple','orange'],dtype='object')In[29]:c.codesOut[29]:array([0,1,0,0,0,1,0,0],dtype=int8)
You can convert a DataFrame column to categorical by assigning the converted result:
In[30]:df['fruit']=df['fruit'].astype('category')In[31]:df.fruitOut[31]:0apple1orange2apple3apple4apple5orange6apple7appleName:fruit,dtype:categoryCategories(2,object):[apple,orange]
You can also create pandas.Categorical directly
from other types of Python sequences:
In[32]:my_categories=pd.Categorical(['foo','bar','baz','foo','bar'])In[33]:my_categoriesOut[33]:[foo,bar,baz,foo,bar]Categories(3,object):[bar,baz,foo]
If you have obtained categorical encoded data from another source,
you can use the alternative from_codes
constructor:
In[34]:categories=['foo','bar','baz']In[35]:codes=[0,1,2,0,0,1]In[36]:my_cats_2=pd.Categorical.from_codes(codes,categories)In[37]:my_cats_2Out[37]:[foo,bar,baz,foo,foo,bar]Categories(3,object):[foo,bar,baz]
Unless explicitly specified, categorical conversions assume no
specific ordering of the categories. So the
categories array may be in a different order
depending on the ordering of the input data. When using
from_codes or any of the other constructors, you can
indicate that the categories have a meaningful ordering:
In[38]:ordered_cat=pd.Categorical.from_codes(codes,categories,....:ordered=True)In[39]:ordered_catOut[39]:[foo,bar,baz,foo,foo,bar]Categories(3,object):[foo<bar<baz]
The output [foo < bar < baz] indicates
that 'foo' precedes 'bar' in the
ordering, and so on. An unordered categorical instance can be made
ordered with as_ordered:
In[40]:my_cats_2.as_ordered()Out[40]:[foo,bar,baz,foo,foo,bar]Categories(3,object):[foo<bar<baz]
As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.
Using Categorical in pandas compared with the non-encoded version
(like an array of strings) generally behaves the same way. Some parts of
pandas, like the groupby function, perform better
when working with categoricals. There are also some
functions that can utilize the ordered flag.
Let’s consider some random numeric data, and use the pandas.qcut binning function. This
return pandas.Categorical; we used
pandas.cut earlier in the book but glossed over the
details of how categoricals work:
In[41]:np.random.seed(12345)In[42]:draws=np.random.randn(1000)In[43]:draws[:5]Out[43]:array([-0.2047,0.4789,-0.5194,-0.5557,1.9658])
Let’s compute a quartile binning of this data and extract some statistics:
In[44]:bins=pd.qcut(draws,4)In[45]:binsOut[45]:[(-0.684,-0.0101],(-0.0101,0.63],(-0.684,-0.0101],(-0.684,-0.0101],(0.63,3.928],...,(-0.0101,0.63],(-0.684,-0.0101],(-2.95,-0.684],(-0.0101,0.63],(0.63,3.928]]Length:1000Categories(4,interval[float64]):[(-2.95,-0.684]<(-0.684,-0.0101]<(-0.0101,0.63]<(0.63,3.928]]
While useful, the exact sample quartiles may be less useful for
producing a report than quartile names. We can achieve this with the
labels argument to qcut:
In[46]:bins=pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])In[47]:binsOut[47]:[Q2,Q3,Q2,Q2,Q4,...,Q3,Q2,Q1,Q3,Q4]Length:1000Categories(4,object):[Q1<Q2<Q3<Q4]In[48]:bins.codes[:10]Out[48]:array([1,2,1,1,3,3,2,2,3,3],dtype=int8)
The labeled bins categorical does not contain
information about the bin edges in the data, so we can use
groupby to extract some summary statistics:
In[49]:bins=pd.Series(bins,name='quartile')In[50]:results=(pd.Series(draws)....:.groupby(bins)....:.agg(['count','min','max'])....:.reset_index())In[51]:resultsOut[51]:quartilecountminmax0Q1250-2.949343-0.6854841Q2250-0.683066-0.0101152Q3250-0.0100320.6288943Q42500.6342383.927528
The 'quartile' column in the result retains the
original categorical information, including ordering, from
bins:
In[52]:results['quartile']Out[52]:0Q11Q22Q33Q4Name:quartile,dtype:categoryCategories(4,object):[Q1<Q2<Q3<Q4]
If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too. Let’s consider some Series with 10 million elements and a small number of distinct categories:
In[53]:N=10000000In[54]:draws=pd.Series(np.random.randn(N))In[55]:labels=pd.Series(['foo','bar','baz','qux']*(N//4))
Now we convert labels to categorical:
In[56]:categories=labels.astype('category')
Now we note that labels uses significantly
more memory than categories:
In[57]:labels.memory_usage()Out[57]:80000080In[58]:categories.memory_usage()Out[58]:10000272
The conversion to category is not free, of course, but it is a one-time cost:
In[59]:%time_=labels.astype('category')CPUtimes:user360ms,sys:140ms,total:500msWalltime:502ms
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the
Series.str specialized string methods. This also
provides convenient access to the categories and codes. Consider the
Series:
In[60]:s=pd.Series(['a','b','c','d']*2)In[61]:cat_s=s.astype('category')In[62]:cat_sOut[62]:0a1b2c3d4a5b6c7ddtype:categoryCategories(4,object):[a,b,c,d]
The special attribute cat provides access to
categorical methods:
In[63]:cat_s.cat.codesOut[63]:0011223340516273dtype:int8In[64]:cat_s.cat.categoriesOut[64]:Index(['a','b','c','d'],dtype='object')
Suppose that we know the actual set of categories for this data
extends beyond the four values observed in the data. We can use the
set_categories method to change them:
In[65]:actual_categories=['a','b','c','d','e']In[66]:cat_s2=cat_s.cat.set_categories(actual_categories)In[67]:cat_s2Out[67]:0a1b2c3d4a5b6c7ddtype:categoryCategories(5,object):[a,b,c,d,e]
While it appears that the data is unchanged, the new categories
will be reflected in operations that use them. For example,
value_counts respects the categories, if
present:
In[68]:cat_s.value_counts()Out[68]:d2c2b2a2dtype:int64In[69]:cat_s2.value_counts()Out[69]:d2c2b2a2e0dtype:int64
In large datasets, categoricals are often used as a convenient
tool for memory savings and better performance. After you filter a large
DataFrame or Series, many of the categories may not appear in the data.
To help with this, we can use the
remove_unused_categories method to trim unobserved
categories:
In[70]:cat_s3=cat_s[cat_s.isin(['a','b'])]In[71]:cat_s3Out[71]:0a1b4a5bdtype:categoryCategories(4,object):[a,b,c,d]In[72]:cat_s3.cat.remove_unused_categories()Out[72]:0a1b4a5bdtype:categoryCategories(2,object):[a,b]
See Table 12-1 for a listing of available categorical methods.
When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as one-hot encoding. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.
Consider the previous example:
In[73]:cat_s=pd.Series(['a','b','c','d']*2,dtype='category')
As mentioned previously in Chapter 7,
the pandas.get_dummies function converts this one-dimensional categorical data into a
DataFrame containing the dummy variable:
In[74]:pd.get_dummies(cat_s)Out[74]:abcd0100010100200103000141000501006001070001
While we’ve already discussed using the groupby method for
Series and DataFrame in depth in Chapter 10, there are
some additional techniques that you may find of use.
In Chapter 10 we looked at the apply method in grouped operations
for performing transformations. There is another built-in method
called transform, which is similar to
apply but imposes more constraints on the kind of
function you can use:
It can produce a scalar value to be broadcast to the shape of the group
It can produce an object of the same shape as the input group
It must not mutate its input
Let’s consider a simple example for illustration:
In[75]:df=pd.DataFrame({'key':['a','b','c']*4,....:'value':np.arange(12.)})In[76]:dfOut[76]:keyvalue0a0.01b1.02c2.03a3.04b4.05c5.06a6.07b7.08c8.09a9.010b10.011c11.0
Here are the group means by key:
In[77]:g=df.groupby('key').valueIn[78]:g.mean()Out[78]:keya4.5b5.5c6.5Name:value,dtype:float64
Suppose instead we wanted to produce a Series of the same shape as
df['value'] but with values replaced by the average
grouped by 'key'. We can pass the function
lambda x: x.mean() to
transform:
In[79]:g.transform(lambdax:x.mean())Out[79]:04.515.526.534.545.556.564.575.586.594.5105.5116.5Name:value,dtype:float64
For built-in aggregation functions, we can pass a string alias as
with the GroupBy agg method:
In[80]:g.transform('mean')Out[80]:04.515.526.534.545.556.564.575.586.594.5105.5116.5Name:value,dtype:float64
Like apply, transform works
with functions that return Series, but the result must be the same size
as the input. For example, we can multiply each group by 2 using a
lambda function:
In[81]:g.transform(lambdax:x*2)Out[81]:00.012.024.036.048.0510.0612.0714.0816.0918.01020.01122.0Name:value,dtype:float64
As a more complicated example, we can compute the ranks in descending order for each group:
In[82]:g.transform(lambdax:x.rank(ascending=False))Out[82]:04.014.024.033.043.053.062.072.082.091.0101.0111.0Name:value,dtype:float64
Consider a group transformation function composed from simple aggregations:
defnormalize(x):return(x-x.mean())/x.std()
We can obtain equivalent results in this case either using
transform or apply:
In[84]:g.transform(normalize)Out[84]:0-1.1618951-1.1618952-1.1618953-0.3872984-0.3872985-0.38729860.38729870.38729880.38729891.161895101.161895111.161895Name:value,dtype:float64In[85]:g.apply(normalize)Out[85]:0-1.1618951-1.1618952-1.1618953-0.3872984-0.3872985-0.38729860.38729870.38729880.38729891.161895101.161895111.161895Name:value,dtype:float64
Built-in aggregate functions like 'mean' or
'sum' are often much faster than a general
apply function. These also have a “fast past” when
used with transform. This allows us to perform a
so-called unwrapped group operation:
In[86]:g.transform('mean')Out[86]:04.515.526.534.545.556.564.575.586.594.5105.5116.5Name:value,dtype:float64In[87]:normalized=(df['value']-g.transform('mean'))/g.transform('std')In[88]:normalizedOut[88]:0-1.1618951-1.1618952-1.1618953-0.3872984-0.3872985-0.38729860.38729870.38729880.38729891.161895101.161895111.161895Name:value,dtype:float64
While an unwrapped group operation may involve multiple group aggregations, the overall benefit of vectorized operations often outweighs this.
For time series data, the resample method is semantically a
group operation based on a time intervalization. Here’s a small example
table:
In[89]:N=15In[90]:times=pd.date_range('2017-05-20 00:00',freq='1min',periods=N)In[91]:df=pd.DataFrame({'time':times,....:'value':np.arange(N)})In[92]:dfOut[92]:timevalue02017-05-2000:00:00012017-05-2000:01:00122017-05-2000:02:00232017-05-2000:03:00342017-05-2000:04:00452017-05-2000:05:00562017-05-2000:06:00672017-05-2000:07:00782017-05-2000:08:00892017-05-2000:09:009102017-05-2000:10:0010112017-05-2000:11:0011122017-05-2000:12:0012132017-05-2000:13:0013142017-05-2000:14:0014
Here, we can index by 'time' and then
resample:
In[93]:df.set_index('time').resample('5min').count()Out[93]:valuetime2017-05-2000:00:0052017-05-2000:05:0052017-05-2000:10:005
Suppose that a DataFrame contains multiple time series, marked by an additional group key column:
In[94]:df2=pd.DataFrame({'time':times.repeat(3),....:'key':np.tile(['a','b','c'],N),....:'value':np.arange(N*3.)})In[95]:df2[:7]Out[95]:keytimevalue0a2017-05-2000:00:000.01b2017-05-2000:00:001.02c2017-05-2000:00:002.03a2017-05-2000:01:003.04b2017-05-2000:01:004.05c2017-05-2000:01:005.06a2017-05-2000:02:006.0
To do the same resampling for each value of
'key', we introduce the pandas.TimeGrouper object:
In[96]:time_key=pd.TimeGrouper('5min')
We can then set the time index, group by 'key'
and time_key, and aggregate:
In[97]:resampled=(df2.set_index('time')....:.groupby(['key',time_key])....:.sum())In[98]:resampledOut[98]:valuekeytimea2017-05-2000:00:0030.02017-05-2000:05:00105.02017-05-2000:10:00180.0b2017-05-2000:00:0035.02017-05-2000:05:00110.02017-05-2000:10:00185.0c2017-05-2000:00:0040.02017-05-2000:05:00115.02017-05-2000:10:00190.0In[99]:resampled.reset_index()Out[99]:keytimevalue0a2017-05-2000:00:0030.01a2017-05-2000:05:00105.02a2017-05-2000:10:00180.03b2017-05-2000:00:0035.04b2017-05-2000:05:00110.05b2017-05-2000:10:00185.06c2017-05-2000:00:0040.07c2017-05-2000:05:00115.08c2017-05-2000:10:00190.0
One constraint with using TimeGrouper is that
the time must be the index of the Series or DataFrame.
When applying a sequence of transformations to a dataset, you may find yourself creating numerous temporary variables that are never used in your analysis. Consider this example, for instance:
df=load_data()df2=df[df['col2']<0]df2['col1_demeaned']=df2['col1']-df2['col1'].mean()result=df2.groupby('key').col1_demeaned.std()
While we’re not using any real data here, this example highlights
some new methods. First, the DataFrame.assign method
is a functional alternative to column
assignments of the form df[k] = v. Rather than
modifying the object in-place, it returns a new DataFrame with the
indicated modifications. So these statements are equivalent:
# Usual non-functional waydf2=df.copy()df2['k']=v# Functional assign waydf2=df.assign(k=v)
Assigning in-place may execute faster than using
assign, but assign enables easier
method chaining:
result=(df2.assign(col1_demeaned=df2.col1-df2.col2.mean()).groupby('key').col1_demeaned.std())
I used the outer parentheses to make it more convenient to add line breaks.
One thing to keep in mind when doing method chaining is that you may
need to refer to temporary objects. In the preceding example, we cannot
refer to the result of load_data until it has been
assigned to the temporary variable df. To help with
this, assign and many other pandas functions accept
function-like arguments, also known as
callables.
To show callables in action, consider a fragment of the example from before:
df=load_data()df2=df[df['col2']<0]
This can be rewritten as:
df=(load_data()[lambdax:x['col2']<0])
Here, the result of load_data is not assigned to
a variable, so the function passed into [] is then
bound to the object at that stage of the method
chain.
We can continue, then, and write the entire sequence as a single chained expression:
result=(load_data()[lambdax:x.col2<0].assign(col1_demeaned=lambdax:x.col1-x.col1.mean()).groupby('key').col1_demeaned.std())
Whether you prefer to write code in this style is a matter of taste, and splitting up the expression into multiple steps may make your code more readable.
You can accomplish a lot with built-in pandas functions and the approaches to
method chaining with callables that we just looked at. However,
sometimes you need to use your own functions or functions from
third-party libraries. This is where the pipe method
comes in.
Consider a sequence of function calls:
a=f(df,arg1=v1)b=g(a,v2,arg3=v3)c=h(b,arg4=v4)
When using functions that accept and return Series or DataFrame
objects, you can rewrite this using calls to
pipe:
result=(df.pipe(f,arg1=v1).pipe(g,v2,arg3=v3).pipe(h,arg4=v4))
The statement f(df) and
df.pipe(f) are equivalent, but
pipe makes chained invocation easier.
A potentially useful pattern for pipe is to
generalize sequences of operations into reusable functions. As an
example, let’s consider substracting group means from a column:
g=df.groupby(['key1','key2'])df['col1']=df['col1']-g.transform('mean')
Suppose that you wanted to be able to demean more than one column and easily change the group keys. Additionally, you might want to perform this transformation in a method chain. Here is an example implementation:
defgroup_demean(df,by,cols):result=df.copy()g=df.groupby(by)forcincols:result[c]=df[c]-g[c].transform('mean')returnresult
result=(df[df.col1<0].pipe(group_demean,['key1','key2'],['col1']))
pandas, like many open source software projects, is still changing and acquiring new and improved functionality. As elsewhere in this book, the focus here has been on the most stable functionality that is less likely to change over the next several years.
To deepen your expertise as a pandas user, I encourage you to explore the documentation and read the release notes as the development team makes new open source releases. We also invite you to join in on pandas development: fixing bugs, building new features, and improving the documentation.