During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.
If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere in the pandas library, feel free to share your use case on one of the Python mailing lists or on the pandas GitHub site. Indeed, much of the design and implementation of pandas has been driven by the needs of real-world applications.
In this chapter I discuss tools for missing data, duplicate data, string manipulation, and some other analytical data transformations. In the next chapter, I focus on combining and rearranging datasets in various ways.
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.
The way that missing data is represented in pandas objects is
somewhat imperfect, but it is functional for a lot of users. For numeric
data, pandas uses the floating-point value NaN (Not a Number)
to represent missing data. We call this a sentinel value that can be easily
detected:
In[10]:string_data=pd.Series(['aardvark','artichoke',np.nan,'avocado'])In[11]:string_dataOut[11]:0aardvark1artichoke2NaN3avocadodtype:objectIn[12]:string_data.isnull()Out[12]:0False1False2True3Falsedtype:bool
In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
The built-in Python None value
is also treated as NA in object arrays:
In[13]:string_data[0]=NoneIn[14]:string_data.isnull()Out[14]:0True1False2True3Falsedtype:bool
There is work ongoing in the pandas project to improve the internal
details of how missing data is handled, but the user API functions, like
pandas.isnull, abstract away many of the annoying details. See Table 7-1 for a list of some functions related to
missing data handling.
There are a few ways to filter out missing data. While you always have the option
to do it by hand using pandas.isnull and boolean
indexing, the dropna can be helpful. On a Series, it
returns the Series with only the non-null data and index values:
In[15]:fromnumpyimportnanasNAIn[16]:data=pd.Series([1,NA,3.5,NA,7])In[17]:data.dropna()Out[17]:01.023.547.0dtype:float64
This is equivalent to:
In[18]:data[data.notnull()]Out[18]:01.023.547.0dtype:float64
With DataFrame objects, things are a bit more complex. You may
want to drop rows or columns that are all NA or only those containing
any NAs. dropna by default drops any row containing a missing value:
In[19]:data=pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],....:[NA,NA,NA],[NA,6.5,3.]])In[20]:cleaned=data.dropna()In[21]:dataOut[21]:01201.06.53.011.0NaNNaN2NaNNaNNaN3NaN6.53.0In[22]:cleanedOut[22]:01201.06.53.0
Passing how='all' will only
drop rows that are all NA:
In[23]:data.dropna(how='all')Out[23]:01201.06.53.011.0NaNNaN3NaN6.53.0
To drop columns in the same way, pass
axis=1:
In[24]:data[4]=NAIn[25]:dataOut[25]:012401.06.53.0NaN11.0NaNNaNNaN2NaNNaNNaNNaN3NaN6.53.0NaNIn[26]:data.dropna(axis=1,how='all')Out[26]:01201.06.53.011.0NaNNaN2NaNNaNNaN3NaN6.53.0
A related way to filter out DataFrame rows tends to concern time
series data. Suppose you want to keep only rows containing a certain
number of observations. You can indicate this with the thresh argument:
In[27]:df=pd.DataFrame(np.random.randn(7,3))In[28]:df.iloc[:4,1]=NAIn[29]:df.iloc[:2,2]=NAIn[30]:dfOut[30]:0120-0.204708NaNNaN1-0.555730NaNNaN20.092908NaN0.76902331.246435NaN-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741In[31]:df.dropna()Out[31]:01240.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741In[32]:df.dropna(thresh=2)Out[32]:01220.092908NaN0.76902331.246435NaN-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For
most purposes, the fillna method is
the workhorse function to use. Calling fillna with a constant replaces missing values
with that value:
In[33]:df.fillna(0)Out[33]:0120-0.2047080.0000000.0000001-0.5557300.0000000.00000020.0929080.0000000.76902331.2464350.000000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
Calling fillna with a dict, you
can use a different fill value for each column:
In[34]:df.fillna({1:0.5,2:0})Out[34]:0120-0.2047080.5000000.0000001-0.5557300.5000000.00000020.0929080.5000000.76902331.2464350.500000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
fillna returns a new object,
but you can modify the existing object in-place:
In[35]:_=df.fillna(0,inplace=True)In[36]:dfOut[36]:0120-0.2047080.0000000.0000001-0.5557300.0000000.00000020.0929080.0000000.76902331.2464350.000000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
The same interpolation methods available for reindexing can be
used with fillna:
In[37]:df=pd.DataFrame(np.random.randn(6,3))In[38]:df.iloc[2:,1]=NAIn[39]:df.iloc[4:,2]=NAIn[40]:dfOut[40]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.523772NaN1.3438103-0.713544NaN-2.3702324-1.860761NaNNaN5-1.265934NaNNaNIn[41]:df.fillna(method='ffill')Out[41]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.5237720.1241211.3438103-0.7135440.124121-2.3702324-1.8607610.124121-2.3702325-1.2659340.124121-2.370232In[42]:df.fillna(method='ffill',limit=2)Out[42]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.5237720.1241211.3438103-0.7135440.124121-2.3702324-1.860761NaN-2.3702325-1.265934NaN-2.370232
With fillna you can do lots of
other things with a little creativity. For example, you might pass the
mean or median value of a Series:
In[43]:data=pd.Series([1.,NA,3.5,NA,7])In[44]:data.fillna(data.mean())Out[44]:01.00000013.83333323.50000033.83333347.000000dtype:float64
See Table 7-2 for a reference on
fillna.
So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning, and other transformations are another class of important operations.
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:
In[45]:data=pd.DataFrame({'k1':['one','two']*3+['two'],....:'k2':[1,1,2,3,3,4,4]})In[46]:dataOut[46]:k1k20one11two12one23two34one35two46two4
The DataFrame method duplicated
returns a boolean Series indicating whether each row is a
duplicate (has been observed in a previous row) or not:
In[47]:data.duplicated()Out[47]:0False1False2False3False4False5False6Truedtype:bool
Relatedly, drop_duplicates
returns a DataFrame where the duplicated array is False:
In[48]:data.drop_duplicates()Out[48]:k1k20one11two12one23two34one35two4
Both of these methods by default consider all of the columns;
alternatively, you can specify any subset of them to detect duplicates.
Suppose we had an additional column of values and wanted to filter
duplicates only based on the 'k1'
column:
In[49]:data['v1']=range(7)In[50]:data.drop_duplicates(['k1'])Out[50]:k1k2v10one101two11
duplicated and drop_duplicates by default keep the first
observed value combination. Passing keep='last' will return the last one:
In[51]:data.drop_duplicates(['k1','k2'],keep='last')Out[51]:k1k2v10one101two112one223two334one346two46
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:
In[52]:data=pd.DataFrame({'food':['bacon','pulled pork','bacon',....:'Pastrami','corned beef','Bacon',....:'pastrami','honey ham','nova lox'],....:'ounces':[4,3,12,6,7.5,8,3,5,6]})In[53]:dataOut[53]:foodounces0bacon4.01pulledpork3.02bacon12.03Pastrami6.04cornedbeef7.55Bacon8.06pastrami3.07honeyham5.08novalox6.0
Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:
meat_to_animal={'bacon':'pig','pulled pork':'pig','pastrami':'cow','corned beef':'cow','honey ham':'pig','nova lox':'salmon'}
The map method on a Series accepts a function or dict-like object containing a
mapping, but here we have a small problem in that some of the meats are
capitalized and others are not. Thus, we need to convert each value to
lowercase using the str.lower Series method:
In[55]:lowercased=data['food'].str.lower()In[56]:lowercasedOut[56]:0bacon1pulledpork2bacon3pastrami4cornedbeef5bacon6pastrami7honeyham8novaloxName:food,dtype:objectIn[57]:data['animal']=lowercased.map(meat_to_animal)In[58]:dataOut[58]:foodouncesanimal0bacon4.0pig1pulledpork3.0pig2bacon12.0pig3Pastrami6.0cow4cornedbeef7.5cow5Bacon8.0pig6pastrami3.0cow7honeyham5.0pig8novalox6.0salmon
We could also have passed a function that does all the work:
In[59]:data['food'].map(lambdax:meat_to_animal[x.lower()])Out[59]:0pig1pig2pig3cow4cow5pig6cow7pig8salmonName:food,dtype:object
Using map is a convenient way
to perform element-wise transformations and other data cleaning–related
operations.
Filling in missing data with the fillna
method is a special case of more general value replacement. As you’ve
already seen, map can be used to
modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s consider
this Series:
In[60]:data=pd.Series([1.,-999.,2.,-999.,-1000.,3.])In[61]:dataOut[61]:01.01-999.022.03-999.04-1000.053.0dtype:float64
The -999 values might be
sentinel values for missing data. To replace these with NA values that
pandas understands, we can use replace, producing a new Series (unless you
pass inplace=True):
In[62]:data.replace(-999,np.nan)Out[62]:01.01NaN22.03NaN4-1000.053.0dtype:float64
If you want to replace multiple values at once, you instead pass a list and then the substitute value:
In[63]:data.replace([-999,-1000],np.nan)Out[63]:01.01NaN22.03NaN4NaN53.0dtype:float64
To use a different replacement for each value, pass a list of substitutes:
In[64]:data.replace([-999,-1000],[np.nan,0])Out[64]:01.01NaN22.03NaN40.053.0dtype:float64
The argument passed can also be a dict:
In[65]:data.replace({-999:np.nan,-1000:0})Out[65]:01.01NaN22.03NaN40.053.0dtype:float64
The data.replace method is distinct from
data.str.replace, which performs string
substitution element-wise. We look at these string methods on Series
later in the chapter.
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example:
In[66]:data=pd.DataFrame(np.arange(12).reshape((3,4)),....:index=['Ohio','Colorado','New York'],....:columns=['one','two','three','four'])
Like a Series, the axis indexes have a map
method:
In[67]:transform=lambdax:x[:4].upper()In[68]:data.index.map(transform)Out[68]:Index(['OHIO','COLO','NEW '],dtype='object')
You can assign to index,
modifying the DataFrame in-place:
In[69]:data.index=data.index.map(transform)In[70]:dataOut[70]:onetwothreefourOHIO0123COLO4567NEW891011
If you want to create a transformed version of a dataset without modifying the
original, a useful method is rename:
In[71]:data.rename(index=str.title,columns=str.upper)Out[71]:ONETWOTHREEFOUROhio0123Colo4567New891011
Notably, rename can be used in
conjunction with a dict-like object providing new values for a subset of
the axis labels:
In[72]:data.rename(index={'OHIO':'INDIANA'},....:columns={'three':'peekaboo'})Out[72]:onetwopeekaboofourINDIANA0123COLO4567NEW891011
rename saves you from the chore
of copying the DataFrame manually and assigning to its index and columns attributes. Should you wish to modify
a dataset in-place, pass inplace=True:
In[73]:data.rename(index={'OHIO':'INDIANA'},inplace=True)In[74]:dataOut[74]:onetwothreefourINDIANA0123COLO4567NEW891011
Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:
In[75]:ages=[20,22,25,27,21,23,37,31,61,45,41,32]
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and
finally 61 and older. To do so, you have to use cut, a function in pandas:
In[76]:bins=[18,25,35,60,100]In[77]:cats=pd.cut(ages,bins)In[78]:catsOut[78]:[(18,25],(18,25],(18,25],(25,35],(18,25],...,(25,35],(60,100],(35,60],(35,60],(25,35]]Length:12Categories(4,interval[int64]):[(18,25]<(25,35]<(35,60]<(60,100]]
The object pandas returns is a special Categorical
object. The output you see describes the bins computed by
pandas.cut. You can treat it like an array of strings
indicating the bin name; internally it contains a categories array specifying the distinct
category names along with a labeling for the ages data in the codes attribute:
In[79]:cats.codesOut[79]:array([0,0,0,1,0,0,2,1,3,2,2,1],dtype=int8)In[80]:cats.categoriesOut[80]:IntervalIndex([(18,25],(25,35],(35,60],(60,100]]closed='right',dtype='interval[int64]')In[81]:pd.value_counts(cats)Out[81]:(18,25]5(35,60]3(25,35]3(60,100]1dtype:int64
Note that pd.value_counts(cats) are the bin counts for the result of
pandas.cut.
Consistent with mathematical notation for intervals, a parenthesis
means that the side is open, while the square
bracket means it is closed (inclusive). You can change which side
is closed by passing right=False:
In[82]:pd.cut(ages,[18,26,36,61,100],right=False)Out[82]:[[18,26),[18,26),[18,26),[26,36),[18,26),...,[26,36),[61,100),[36,61),[36,61),[26,36)]Length:12Categories(4,interval[int64]):[[18,26)<[26,36)<[36,61)<[61,100)]
You can also pass your own bin names by passing a list or array to
the labels option:
In[83]:group_names=['Youth','YoungAdult','MiddleAged','Senior']In[84]:pd.cut(ages,bins,labels=group_names)Out[84]:[Youth,Youth,Youth,YoungAdult,Youth,...,YoungAdult,Senior,MiddleAged,MiddleAged,YoungAdult]Length:12Categories(4,object):[Youth<YoungAdult<MiddleAged<Senior]
If you pass an integer number of bins to cut
instead of explicit bin edges, it will compute
equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into
fourths:
In[85]:data=np.random.rand(20)In[86]:pd.cut(data,4,precision=2)Out[86]:[(0.34,0.55],(0.34,0.55],(0.76,0.97],(0.76,0.97],(0.34,0.55],...,(0.34,0.55],(0.34,0.55],(0.55,0.76],(0.34,0.55],(0.12,0.34]]Length:20Categories(4,interval[float64]):[(0.12,0.34]<(0.34,0.55]<(0.55,0.76]<(0.76,0.97]]
The precision=2 option limits the decimal
precision to two digits.
A closely related function, qcut, bins the data based on sample quantiles. Depending on the
distribution of the data, using cut
will not usually result in each bin having the same number of data
points. Since qcut uses sample
quantiles instead, by definition you will obtain roughly equal-size
bins:
In[87]:data=np.random.randn(1000)# Normally distributedIn[88]:cats=pd.qcut(data,4)# Cut into quartilesIn[89]:catsOut[89]:[(-0.0265,0.62],(0.62,3.928],(-0.68,-0.0265],(0.62,3.928],(-0.0265,0.62],...,(-0.68,-0.0265],(-0.68,-0.0265],(-2.95,-0.68],(0.62,3.928],(-0.68,-0.0265]]Length:1000Categories(4,interval[float64]):[(-2.95,-0.68]<(-0.68,-0.0265]<(-0.0265,0.62]<(0.62,3.928]]In[90]:pd.value_counts(cats)Out[90]:(0.62,3.928]250(-0.0265,0.62]250(-0.68,-0.0265]250(-2.95,-0.68]250dtype:int64
Similar to cut you can pass
your own quantiles (numbers between 0 and 1, inclusive):
In[91]:pd.qcut(data,[0,0.1,0.5,0.9,1.])Out[91]:[(-0.0265,1.286],(-0.0265,1.286],(-1.187,-0.0265],(-0.0265,1.286],(-0.0265,1.286],...,(-1.187,-0.0265],(-1.187,-0.0265],(-2.95,-1.187],(-0.0265,1.286],(-1.187,-0.0265]]Length:1000Categories(4,interval[float64]):[(-2.95,-1.187]<(-1.187,-0.0265]<(-0.0265,1.286]<(1.286,3.928]]
We’ll return to cut and
qcut later in the chapter during our
discussion of aggregation and group operations, as these discretization
functions are especially useful for quantile and group analysis.
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
In[92]:data=pd.DataFrame(np.random.randn(1000,4))In[93]:data.describe()Out[93]:0123count1000.0000001000.0000001000.0000001000.000000mean0.0490910.026112-0.002544-0.051827std0.9969471.0074580.9952320.998311min-3.645860-3.184377-3.745356-3.42825425%-0.599807-0.612162-0.687373-0.74747850%0.047101-0.013609-0.022158-0.08827475%0.7566460.6952980.6990460.623331max2.6536563.5258652.7355273.366626
Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:
In[94]:col=data[2]In[95]:col[np.abs(col)>3]Out[95]:41-3.399312136-3.745356Name:2,dtype:float64
To select all rows having a value exceeding 3 or –3, you can use
the any method on a
boolean DataFrame:
In[96]:data[(np.abs(data)>3).any(1)]Out[96]:0123410.457246-0.025907-3.399312-0.974657601.9513123.2603830.9633011.2012061360.508391-0.196713-3.745356-1.520113235-0.242459-3.0569901.918403-0.5788282580.6828410.3260450.425384-3.4282543221.179227-3.1843771.369891-1.074833544-3.5488241.553205-2.1863011.277104635-0.5780930.1932991.3978223.366626782-0.2074343.5258650.2830700.544635803-3.6458600.255475-0.549574-1.907459
Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:
In[97]:data[np.abs(data)>3]=np.sign(data)*3In[98]:data.describe()Out[98]:0123count1000.0000001000.0000001000.0000001000.000000mean0.0502860.025567-0.001399-0.051765std0.9929201.0042140.9914140.995761min-3.000000-3.000000-3.000000-3.00000025%-0.599807-0.612162-0.687373-0.74747850%0.047101-0.013609-0.022158-0.08827475%0.7566460.6952980.6990460.623331max2.6536563.0000002.7355273.000000
The statement np.sign(data) produces 1 and –1 values based on whether the values in
data are positive or negative:
In[99]:np.sign(data).head()Out[99]:01230-1.01.0-1.01.011.0-1.01.0-1.021.01.01.0-1.03-1.0-1.01.0-1.04-1.01.0-1.0-1.0
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function.
Calling permutation
with the length of the axis you want to permute produces an array of
integers indicating the new ordering:
In[100]:df=pd.DataFrame(np.arange(5*4).reshape((5,4)))In[101]:sampler=np.random.permutation(5)In[102]:samplerOut[102]:array([3,1,4,2,0])
That array can then be used in iloc-based indexing or the equivalent take
function:
In[103]:dfOut[103]:012300123145672891011312131415416171819In[104]:df.take(sampler)Out[104]:012331213141514567416171819289101100123
To select a random subset without replacement, you can use
the sample method on Series and
DataFrame:
In[105]:df.sample(n=3)Out[105]:01233121314154161718192891011
To generate a sample with replacement (to
allow repeat choices), pass replace=True to
sample:
In[106]:choices=pd.Series([5,7,-1,6,4])In[107]:draws=choices.sample(n=10,replace=True)In[108]:drawsOut[108]:4417442-1053617440544dtype:int64
Another type of transformation for statistical modeling or machine
learning applications is converting a categorical variable into a
“dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix
or DataFrame with k columns
containing all 1s and 0s. pandas has a get_dummies function
for doing this, though devising one yourself is not difficult. Let’s
return to an earlier example DataFrame:
In[109]:df=pd.DataFrame({'key':['b','b','a','c','a','b'],.....:'data1':range(6)})In[110]:pd.get_dummies(df['key'])Out[110]:abc001010102100300141005010
In some cases, you may want to add a prefix to the columns in the
indicator DataFrame, which can then be merged with the other data.
get_dummies has a prefix argument for
doing this:
In[111]:dummies=pd.get_dummies(df['key'],prefix='key')In[112]:df_with_dummy=df[['data1']].join(dummies)In[113]:df_with_dummyOut[113]:data1key_akey_bkey_c000101101022100330014410055010
If a row in a DataFrame belongs to multiple categories, things are a bit more complicated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in Chapter 14:
In[114]:mnames=['movie_id','title','genres']In[115]:movies=pd.read_table('datasets/movielens/movies.dat',sep='::',.....:header=None,names=mnames)In[116]:movies[:10]Out[116]:movie_idtitlegenres01ToyStory(1995)Animation|Children's|Comedy12Jumanji(1995)Adventure|Children's|Fantasy23GrumpierOldMen(1995)Comedy|Romance34WaitingtoExhale(1995)Comedy|Drama45FatheroftheBridePartII(1995)Comedy56Heat(1995)Action|Crime|Thriller67Sabrina(1995)Comedy|Romance78TomandHuck(1995)Adventure|Children's89SuddenDeath(1995)Action910GoldenEye(1995)Action|Adventure|Thriller
Adding indicator variables for each genre requires a little bit of wrangling. First, we extract the list of unique genres in the dataset:
In[117]:all_genres=[]In[118]:forxinmovies.genres:.....:all_genres.extend(x.split('|'))In[119]:genres=pd.unique(all_genres)
Now we have:
In[120]:genresOut[120]:array(['Animation',"Children's",'Comedy','Adventure','Fantasy','Romance','Drama','Action','Crime','Thriller','Horror','Sci-Fi','Documentary','War','Musical','Mystery','Film-Noir','Western'],dtype=object)
One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:
In[121]:zero_matrix=np.zeros((len(movies),len(genres)))In[122]:dummies=pd.DataFrame(zero_matrix,columns=genres)
Now, iterate through each movie and set entries in each row of
dummies to 1. To do this, we use the
dummies.columns to compute the column indices for
each genre:
In[123]:gen=movies.genres[0]In[124]:gen.split('|')Out[124]:['Animation',"Children's",'Comedy']In[125]:dummies.columns.get_indexer(gen.split('|'))Out[125]:array([0,1,2])
Then, we can use .iloc to set values based on
these indices:
In[126]:fori,geninenumerate(movies.genres):.....:indices=dummies.columns.get_indexer(gen.split('|')).....:dummies.iloc[i,indices]=1.....:
Then, as before, you can combine this with movies:
In[127]:movies_windic=movies.join(dummies.add_prefix('Genre_'))In[128]:movies_windic.iloc[0]Out[128]:movie_id1titleToyStory(1995)genresAnimation|Children's|ComedyGenre_Animation1Genre_Children's 1Genre_Comedy1Genre_Adventure0Genre_Fantasy0Genre_Romance0Genre_Drama0...Genre_Crime0Genre_Thriller0Genre_Horror0Genre_Sci-Fi0Genre_Documentary0Genre_War0Genre_Musical0Genre_Mystery0Genre_Film-Noir0Genre_Western0Name:0,Length:21,dtype:object
For much larger data, this method of constructing indicator variables with multiple membership is not especially speedy. It would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame.
A useful recipe for statistical applications is to combine
get_dummies with a discretization
function like cut:
In[129]:np.random.seed(12345)In[130]:values=np.random.rand(10)In[131]:valuesOut[131]:array([0.9296,0.3164,0.1839,0.2046,0.5677,0.5955,0.9645,0.6532,0.7489,0.6536])In[132]:bins=[0,0.2,0.4,0.6,0.8,1]In[133]:pd.get_dummies(pd.cut(values,bins))Out[133]:(0.0,0.2](0.2,0.4](0.4,0.6](0.6,0.8](0.8,1.0]000001101000210000301000400100500100600001700010800010900010
We set the random seed with numpy.random.seed
to make the example deterministic. We will look again at
pandas.get_dummies later in the book.
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.
In many string munging and scripting applications, built-in string methods are
sufficient. As an example, a comma-separated string can be broken into
pieces with split:
In[134]:val='a,b, guido'In[135]:val.split(',')Out[135]:['a','b',' guido']
split is often combined
with strip to trim
whitespace (including line breaks):
In[136]:pieces=[x.strip()forxinval.split(',')]In[137]:piecesOut[137]:['a','b','guido']
These substrings could be concatenated together with a two-colon delimiter using addition:
In[138]:first,second,third=piecesIn[139]:first+'::'+second+'::'+thirdOut[139]:'a::b::guido'
But this isn’t a practical generic method. A faster and more
Pythonic way is to pass a list or tuple to the join method on the string '::':
In[140]:'::'.join(pieces)Out[140]:'a::b::guido'
Other methods are concerned with locating substrings. Using
Python’s in keyword is the best way to detect a substring, though index and
find can also be used:
In[141]:'guido'invalOut[141]:TrueIn[142]:val.index(',')Out[142]:1In[143]:val.find(':')Out[143]:-1
Note the difference between find and index is that index raises an exception if the string isn’t
found (versus returning –1):
In[144]:val.index(':')---------------------------------------------------------------------------ValueErrorTraceback(mostrecentcalllast)<ipython-input-144-2c016e7367ac>in<module>()---->1val.index(':')ValueError:substringnotfound
Relatedly, count returns
the number of occurrences of a particular
substring:
In[145]:val.count(',')Out[145]:2
replace will substitute occurrences of one pattern for another. It is
commonly used to delete patterns, too, by passing an empty
string:
In[146]:val.replace(',','::')Out[146]:'a::b:: guido'In[147]:val.replace(',','')Out[147]:'ab guido'
See Table 7-3 for a listing of some of Python’s string methods.
Regular expressions can also be used with many of these operations, as you’ll see.
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a
regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular expressions to
strings; I’ll give a number of examples of its use here.
The art of writing regular expressions could be a chapter of its own and thus is outside the book’s scope. There are many excellent tutorials and references available on the internet and in other books.
The re module functions fall
into three categories: pattern matching, substitution, and splitting.
Naturally these are all related; a regex describes a pattern to locate
in the text, which can then be used for many purposes. Let’s look at a
simple example: suppose we wanted to split a string with a variable
number of whitespace characters (tabs, spaces, and newlines). The regex
describing one or more whitespace characters is \s+:
In[148]:importreIn[149]:text="foo bar\tbaz\tqux"In[150]:re.split('\s+',text)Out[150]:['foo','bar','baz','qux']
When you call re.split('\s+',
text), the regular expression is first
compiled, and then its split method is called on the passed text. You
can compile the regex yourself with re.compile,
forming a reusable regex object:
In[151]:regex=re.compile('\s+')In[152]:regex.split(text)Out[152]:['foo','bar','baz','qux']
If, instead, you wanted to get a list of all patterns matching the
regex, you can use the findall
method:
In[153]:regex.findall(text)Out[153]:[' ','\t','\t']
To avoid unwanted escaping with \ in a regular expression, use
raw string literals like r'C:\x' instead of the equivalent 'C:\\x'.
Creating a regex object with re.compile is highly recommended if you intend
to apply the same expression to many strings; doing so will save CPU
cycles.
match and search are closely related to findall. While findall returns all matches in a string,
search returns only the first match.
More rigidly, match
only matches at the beginning of the string. As a less trivial
example, let’s consider a block of text and a regular expression capable
of identifying most email addresses:
text="""Dave dave@google.comSteve steve@gmail.comRob rob@gmail.comRyan ryan@yahoo.com"""pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'# re.IGNORECASE makes the regex case-insensitiveregex=re.compile(pattern,flags=re.IGNORECASE)
Using findall on the text
produces a list of the email addresses:
In[155]:regex.findall(text)Out[155]:['dave@google.com','steve@gmail.com','rob@gmail.com','ryan@yahoo.com']
search returns a special match
object for the first email address in the text. For the preceding regex,
the match object can only tell us the start and end position of the
pattern in the string:
In[156]:m=regex.search(text)In[157]:mOut[157]:<_sre.SRE_Matchobject;span=(5,20),match='dave@google.com'>In[158]:text[m.start():m.end()]Out[158]:'dave@google.com'
regex.match returns None, as it only will match if the pattern
occurs at the start of the string:
In[159]:(regex.match(text))None
Relatedly, sub will return a new string with occurrences of the pattern
replaced by the a new string:
In[160]:(regex.sub('REDACTED',text))DaveREDACTEDSteveREDACTEDRobREDACTEDRyanREDACTED
Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:
In[161]:pattern=r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'In[162]:regex=re.compile(pattern,flags=re.IGNORECASE)
A match object produced by this modified regex returns a tuple of
the pattern components with its groups
method:
In[163]:m=regex.match('wesm@bright.net')In[164]:m.groups()Out[164]:('wesm','bright','net')
findall returns a list of
tuples when the pattern has groups:
In[165]:regex.findall(text)Out[165]:[('dave','google','com'),('steve','gmail','com'),('rob','gmail','com'),('ryan','yahoo','com')]
sub also has access to groups
in each match using special symbols like \1 and \2. The symbol
\1 corresponds to the first matched group, \2 corresponds to the second,
and so forth:
In[166]:(regex.sub(r'Username: \1, Domain: \2, Suffix: \3',text))DaveUsername:dave,Domain:,Suffix:comSteveUsername:steve,Domain:gmail,Suffix:comRobUsername:rob,Domain:gmail,Suffix:comRyanUsername:ryan,Domain:yahoo,Suffix:com
There is much more to regular expressions in Python, most of which is outside the book’s scope. Table 7-4 provides a brief summary.
Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:
In[167]:data={'Dave':'dave@google.com','Steve':'steve@gmail.com',.....:'Rob':'rob@gmail.com','Wes':np.nan}In[168]:data=pd.Series(data)In[169]:dataOut[169]:Davedave@google.comRobrob@gmail.comStevesteve@gmail.comWesNaNdtype:objectIn[170]:data.isnull()Out[170]:DaveFalseRobFalseSteveFalseWesTruedtype:bool
You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each
value using data.map, but it will
fail on the NA (null) values. To cope with this, Series has
array-oriented methods for string operations that skip NA values. These
are accessed through Series’s str
attribute; for example, we could check whether each email address has
'gmail' in it with str.contains:
In[171]:data.str.contains('gmail')Out[171]:DaveFalseRobTrueSteveTrueWesNaNdtype:object
Regular expressions can be used, too, along with any re options like IGNORECASE:
In[172]:patternOut[172]:'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'In[173]:data.str.findall(pattern,flags=re.IGNORECASE)Out[173]:Dave[(dave,,com)]Rob[(rob,gmail,com)]Steve[(steve,gmail,com)]WesNaNdtype:object
There are a couple of ways to do vectorized element retrieval.
Either use str.get or index into the
str attribute:
In[174]:matches=data.str.match(pattern,flags=re.IGNORECASE)In[175]:matchesOut[175]:DaveTrueRobTrueSteveTrueWesNaNdtype:object
To access elements in the embedded lists, we can pass an index to either of these functions:
In[176]:matches.str.get(1)Out[176]:DaveNaNRobNaNSteveNaNWesNaNdtype:float64In[177]:matches.str[0]Out[177]:DaveNaNRobNaNSteveNaNWesNaNdtype:float64
You can similarly slice strings using this syntax:
In[178]:data.str[:5]Out[178]:Davedave@Robrob@gStevesteveWesNaNdtype:object
See Table 7-5 for more pandas string methods.
Effective data preparation can significantly improve productive by enabling you to spend more time analyzing data and less time getting it ready for analysis. We have explored a number of tools in this chapter, but the coverage here is by no means comprehensive. In the next chapter, we will explore pandas’s joining and grouping functionality.