Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is an important step.
The most common data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load data as a dataframedataframe=pd.read_csv(url)# Show first 5 rowsdataframe.head(5)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.00 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.00 | female | 0 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.00 | male | 0 | 0 |
| 3 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25.00 | female | 0 | 1 |
| 4 | Allison, Master Hudson Trevor | 1st | 0.92 | male | 1 | 0 |
There are three important things to notice in this data frame.
First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, was female, and survived the disaster.
Second, each column contains a name (e.g., Name, PClass, Age) and each row contains an index number (e.g., 0 for the lucky Miss Elisabeth Walton Allen). We will use these to select and manipulate observations and features.
Third, two columns, Sex and SexCode, contain the same information in different formats. In Sex, a woman is indicated by the string female, while in SexCode, a woman is indicated by using the integer 1. We will want all our features to be unique, and therefore we will need to remove one of these columns.
In this chapter, we will cover a wide variety of techniques to manipulate data frames using the pandas library with the goal of creating a clean, well-structured set of observations for further preprocessing.
You want to create a new data frame.
pandas has many methods of creating a new DataFrame object. One easy method is to create an empty data frame using DataFrame and then define each column separately:
# Load libraryimportpandasaspd# Create DataFramedataframe=pd.DataFrame()# Add columnsdataframe['Name']=['Jacky Jackson','Steven Stevenson']dataframe['Age']=[38,25]dataframe['Driver']=[True,False]# Show DataFramedataframe
| Name | Age | Driver | |
|---|---|---|---|
| 0 | Jacky Jackson | 38 | True |
| 1 | Steven Stevenson | 25 | False |
Alternatively, once we have created a DataFrame object, we can append new rows to the bottom:
# Create rownew_person=pd.Series(['Molly Mooney',40,True],index=['Name','Age','Driver'])# Append rowdataframe.append(new_person,ignore_index=True)
| Name | Age | Driver | |
|---|---|---|---|
| 0 | Jacky Jackson | 38 | True |
| 1 | Steven Stevenson | 25 | False |
| 2 | Molly Mooney | 40 | True |
pandas offers what can feel like an infinite number of ways to create a DataFrame. In the real world, creating an empty DataFrame and then populating it will almost never happen. Instead, our DataFrames will be created from real data we have loading from other sources (e.g., a CSV file or database).
One of the easiest things we can do after loading the data is view the first few rows using head:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Show two rowsdataframe.head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
We can also take a look at the number of rows and columns:
# Show dimensionsdataframe.shape
(1313, 6)
Additionally, we can get descriptive statistics for any numeric columns using describe:
# Show statisticsdataframe.describe()
| Age | Survived | SexCode | |
|---|---|---|---|
| count | 756.000000 | 1313.000000 | 1313.000000 |
| mean | 30.397989 | 0.342727 | 0.351866 |
| std | 14.259049 | 0.474802 | 0.477734 |
| min | 0.170000 | 0.000000 | 0.000000 |
| 25% | 21.000000 | 0.000000 | 0.000000 |
| 50% | 28.000000 | 0.000000 | 0.000000 |
| 75% | 39.000000 | 1.000000 | 1.000000 |
| max | 71.000000 | 1.000000 | 1.000000 |
After we load some data, it is a good idea to understand how it is structured and what kind of information it contains. Ideally, we would view the full data directly. But with most real-world cases, the data could have thousands to hundreds of thousands to millions of rows and columns. Instead, we have to rely on pulling samples to view small slices and calculating summary statistics of the data.
In our solution, we are using a toy dataset of the passengers of
the Titanic on her last voyage. Using head we can take a look at the first few rows (five by default) of the data. Alternatively, we can use tail to view the last few rows. With shape we can see how many rows and columns our DataFrame contains. And finally, with describe we can see some basic descriptive statistics for any numerical column.
It is worth noting that summary statistics do not always tell the full
story. For example, pandas treats the columns Survived and SexCode
as numeric columns because they contain 1s and 0s. However, in this
case the numerical values represent categories. For example, if
Survived equals 1, it indicates that the passenger survived the
disaster. For this reason, some of the summary statistics provided don’t make sense, such as the standard deviation of the SexCode column (an
indicator of the passenger’s gender).
This can be easily done in pandas. For example, if we wanted to select all the women on the Titanic:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Show top two rows where column 'sex' is 'female'dataframe[dataframe['Sex']=='female'].head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
Take a second and look at the format of this solution.
dataframe['Sex'] == 'female' is our conditional statement; by wrapping
that in dataframe[] we are telling pandas to “select all the rows in the
DataFrame where the value of dataframe['Sex'] is 'female'.
Multiple conditions are easy as well. For example, here we select all the rows where the passenger is a female 65 or older:
# Filter rowsdataframe[(dataframe['Sex']=='female')&(dataframe['Age']>=65)]
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 73 | Crosby, Mrs Edward Gifford (Catherine Elizabet... | 1st | 69.0 | female | 1 | 1 |
Conditionally selecting and filtering data is one of the most common tasks in data wrangling. You rarely want all the raw data from the source; instead, you are interested in only some subsection of it. For example, you might only be interested in stores in certain states or the records of patients over a certain age.
pandas’ replace is an easy way to find and replace values. For example,
we can replace any instance of "female" in the Sex column with
"Woman":
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Replace values, show two rowsdataframe['Sex'].replace("female","Woman").head(2)
0 Woman 1 Woman Name: Sex, dtype: object
We can also replace multiple values at the same time:
# Replace "female" and "male with "Woman" and "Man"dataframe['Sex'].replace(["female","male"],["Woman","Man"]).head(5)
0 Woman 1 Woman 2 Man 3 Woman 4 Man Name: Sex, dtype: object
We can also find and replace across the entire DataFrame object by specifying the whole data frame instead of a single column:
# Replace values, show two rowsdataframe.replace(1,"One").head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29 | female | One | One |
| 1 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | One |
replace also accepts regular expressions:
# Replace values, show two rowsdataframe.replace(r"1st","First",regex=True).head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | First | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | First | 2.0 | female | 0 | 1 |
Rename columns using the rename method:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Rename column, show two rowsdataframe.rename(columns={'PClass':'Passenger Class'}).head(2)
| Name | Passenger Class | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
Notice that the rename method can accept a dictionary as a parameter. We
can use the dictionary to change multiple column names at once:
# Rename columns, show two rowsdataframe.rename(columns={'PClass':'Passenger Class','Sex':'Gender'}).head(2)
| Name | Passenger Class | Age | Gender | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
Using rename with a dictionary as an argument to the columns
parameter is my preferred way to rename columns because it works with
any number of columns. If we want to rename all columns at once, this
helpful snippet of code creates a dictionary with the old column names
as keys and empty strings as values:
# Load libraryimportcollections# Create dictionarycolumn_names=collections.defaultdict(str)# Create keysfornameindataframe.columns:column_names[name]# Show dictionarycolumn_names
defaultdict(str,
{'Age': '',
'Name': '',
'PClass': '',
'Sex': '',
'SexCode': '',
'Survived': ''})
pandas comes with some built-in methods for commonly used descriptive statistics:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Calculate statistics('Maximum:',dataframe['Age'].max())('Minimum:',dataframe['Age'].min())('Mean:',dataframe['Age'].mean())('Sum:',dataframe['Age'].sum())('Count:',dataframe['Age'].count())
Maximum: 71.0 Minimum: 0.17 Mean: 30.397989417989415 Sum: 22980.879999999997 Count: 756
In addition to the statistics used in the solution, pandas offers
variance (var), standard deviation (std), kurtosis (kurt),
skewness (skew), standard error of the mean (sem), mode (mode),
median (median), and a number of others.
Furthermore, we can also apply these methods to the whole DataFrame:
# Show countsdataframe.count()
Name 1313 PClass 1313 Age 756 Sex 1313 Survived 1313 SexCode 1313 dtype: int64
Use unique to view an array of all unique values in a column:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Select unique valuesdataframe['Sex'].unique()
array(['female', 'male'], dtype=object)
Alternatively, value_counts will display all unique values with the
number of times each value appears:
# Show countsdataframe['Sex'].value_counts()
male 851 female 462 Name: Sex, dtype: int64
Both unique and value_counts are useful for manipulating and
exploring categorical columns. Very often in categorical columns there
will be classes that need to be handled in the data wrangling phase. For example, in the Titanic dataset, PClass is a column indicating the class of a passenger’s ticket. There were three classes on the Titanic; however, if we use value_counts we can see a problem:
# Show countsdataframe['PClass'].value_counts()
3rd 711 1st 322 2nd 279 * 1 Name: PClass, dtype: int64
While almost all passengers belong to one of three classes as expected,
a single passenger has the class *. There are a number of strategies
for handling this type of issue, which we will address in Chapter 5, but for now just realize that “extra” classes are common in categorical data and should not be ignored.
Finally, if we simply want to count the number of unique values, we
can use nunique:
# Show number of unique valuesdataframe['PClass'].nunique()
4
isnull and notnull return booleans indicating whether a value is
missing:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)## Select missing values, show two rowsdataframe[dataframe['Age'].isnull()].head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 12 | Aubert, Mrs Leontine Pauline | 1st | NaN | female | 1 | 1 |
| 13 | Barkworth, Mr Algernon H | 1st | NaN | male | 1 | 0 |
Missing values are a ubiquitous problem in data wrangling, yet many
underestimate the difficulty of working with missing data. pandas uses NumPy’s NaN (“Not A Number”) value to denote missing values, but it is important to note that NaN is not fully implemented natively in pandas. For example, if we wanted to replace all strings containing male with missing values, we return an error:
# Attempt to replace values with NaNdataframe['Sex']=dataframe['Sex'].replace('male',NaN)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-5682d714f87d> in <module>()
1 # Attempt to replace values with NaN
----> 2 dataframe['Sex'] = dataframe['Sex'].replace('male', NaN)
NameError: name 'NaN' is not defined
---------------------------------------------------------------------------
To have full functionality with NaN we need to import the NumPy
library first:
# Load libraryimportnumpyasnp# Replace values with NaNdataframe['Sex']=dataframe['Sex'].replace('male',np.nan)
Oftentimes a dataset uses a specific value to denote a missing
observation, such as NONE, -999, or .. pandas’ read_csv includes
a parameter allowing us to specify the values used to indicate missing
values:
# Load data, set missing valuesdataframe=pd.read_csv(url,na_values=[np.nan,'NONE',-999])
The best way to delete a column is to use drop with the parameter
axis=1 (i.e., the column axis):
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Delete columndataframe.drop('Age',axis=1).head(2)
| Name | PClass | Sex | Survived | SexCode | |
|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | female | 0 | 1 |
You can also use a list of column names as the main argument to drop multiple columns at once:
# Drop columnsdataframe.drop(['Age','Sex'],axis=1).head(2)
| Name | PClass | Survived | SexCode | |
|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 0 | 1 |
If a column does not have a name (which can sometimes happen), you can
drop it by its column index using dataframe.columns:
# Drop columndataframe.drop(dataframe.columns[1],axis=1).head(2)
| Name | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 2.0 | female | 0 | 1 |
drop is the idiomatic method of deleting a column. An alternative
method is del dataframe['Age'], which works most of the time but is
not recommended because of how it is called within pandas (the details
of which are outside the scope of this book).
One habit I recommend learning is to never use pandas’ inplace=True
argument. Many pandas methods include an inplace parameter, which when
True edits the DataFrame directly. This can lead to problems in more
complex data processing pipelines because we are treating the DataFrames
as mutable objects (which they technically are). I recommend treating
DataFrames as immutable objects. For example:
# Create a new DataFramedataframe_name_dropped=dataframe.drop(dataframe.columns[0],axis=1)
In this example, we are not mutating the DataFrame dataframe but
instead are making a new DataFrame that is an altered version of
dataframe called dataframe_name_dropped. If you treat your
DataFrames as immutable objects, you will save yourself a lot of
headaches down the road.
Use a boolean condition to create a new DataFrame excluding the rows you want to delete:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Delete rows, show first two rows of outputdataframe[dataframe['Sex']!='male'].head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
While technically you can use the drop method (for example, df.drop([0, 1], axis=0) to drop the first two rows), a more practical method is simply to wrap a boolean condition inside df[]. The reason is because we can use the power of conditionals to delete either a single row or (far more likely) many rows at once.
We can use boolean conditions to easily delete single rows by matching a unique value:
# Delete row, show first two rows of outputdataframe[dataframe['Name']!='Allison, Miss Helen Loraine'].head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.0 | male | 0 | 0 |
And we can even use it to delete a single row by row index:
# Delete row, show first two rows of outputdataframe[dataframe.index!=0].head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.0 | male | 0 | 0 |
Use drop_duplicates, but be mindful of the parameters:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Drop duplicates, show first two rows of outputdataframe.drop_duplicates().head(2)
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 1 | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
A keen reader will notice that the solution didn’t actually drop any rows:
# Show number of rows("Number Of Rows In The Original DataFrame:",len(dataframe))("Number Of Rows After Deduping:",len(dataframe.drop_duplicates()))
Number Of Rows In The Original DataFrame: 1313 Number Of Rows After Deduping: 1313
The reason is because drop_duplicates defaults to only dropping rows
that match perfectly across all columns. Under this condition, every row
in our DataFrame, dataframe, is actually unique. However, often we
want to consider only a subset of columns to check for duplicate rows.
We can accomplish this using the subset parameter:
# Drop duplicatesdataframe.drop_duplicates(subset=['Sex'])
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 0 | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
| 2 | Allison, Mr Hudson Joshua Creighton | 1st | 30.0 | male | 0 | 0 |
Take a close look at the preceding output: we told drop_duplicates to only consider any two rows with the same value for Sex to be duplicates and to drop them. Now we are left with a DataFrame of only two rows: one man and one woman. You might be asking why drop_duplicates decided to keep these two rows instead of two different rows. The answer is that drop_duplicates defaults to keeping the first occurrence of a duplicated row and dropping the rest. We can control this behavior using the keep parameter:
# Drop duplicatesdataframe.drop_duplicates(subset=['Sex'],keep='last')
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| 1307 | Zabour, Miss Tamini | 3rd | NaN | female | 0 | 1 |
| 1312 | Zimmerman, Leo | 3rd | 29.0 | male | 0 | 0 |
A related method is duplicated, which returns a boolean series denoting
if a row is a duplicate or not. This is a good option if you don’t want
to simply drop duplicates.
groupby is one of the most powerful features in pandas:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Group rows by the values of the column 'Sex', calculate mean# of each groupdataframe.groupby('Sex').mean()
| Age | Survived | SexCode | |
|---|---|---|---|
| Sex | |||
| female | 29.396424 | 0.666667 | 1.0 |
| male | 31.014338 | 0.166863 | 0.0 |
groupby is where data wrangling really starts to take shape. It is
very common to have a DataFrame where each row is a person or an event
and we want to group them according to some criterion and then calculate
a statistic. For example, you can imagine a DataFrame where each row is
an individual sale at a national restaurant chain and we want the total
sales per restaurant. We can accomplish this by grouping rows by
individual resturants and then calculating the sum of each group.
Users new to groupby often write a line like this and are confused by
what is returned:
# Group rowsdataframe.groupby('Sex')
<pandas.core.groupby.DataFrameGroupBy object at 0x10efacf28>
Why didn’t it return something more useful? The reason is because
groupby needs to be paired with some operation we want to apply to
each group, such as calculating an aggregate statistic (e.g., mean,
median, sum). When talking about grouping we often use shorthand and say “group by gender,” but that is incomplete. For grouping to be useful, we need to group by something and then apply a function to each of those groups:
# Group rows, count rowsdataframe.groupby('Survived')['Name'].count()
Survived 0 863 1 450 Name: Name, dtype: int64
Notice Name added after the groupby? That is because particular
summary statistics are only meaningful to certain types of data. For
example, while calculating the average age by gender makes sense,
calculating the total age by gender does not. In this case we group the
data into survived or not, then count the number of names (i.e.,
passengers) in each group.
We can also group by a first column, then group that grouping by a second column:
# Group rows, calculate meandataframe.groupby(['Sex','Survived'])['Age'].mean()
Sex Survived
female 0 24.901408
1 30.867143
male 0 32.320780
1 25.951875
Name: Age, dtype: float64
You need to group individual rows by time periods.
Use resample to group rows by chunks of time:
# Load librariesimportpandasaspdimportnumpyasnp# Create date rangetime_index=pd.date_range('06/06/2017',periods=100000,freq='30S')# Create DataFramedataframe=pd.DataFrame(index=time_index)# Create column of random valuesdataframe['Sale_Amount']=np.random.randint(1,10,100000)# Group rows by week, calculate sum per weekdataframe.resample('W').sum()
| Sale_Amount | |
|---|---|
| 2017-06-11 | 86423 |
| 2017-06-18 | 101045 |
| 2017-06-25 | 100867 |
| 2017-07-02 | 100894 |
| 2017-07-09 | 100438 |
| 2017-07-16 | 10297 |
Our standard Titanic dataset does not contain a datetime column, so for this recipe we have generated a simple DataFrame where each row is an individual sale. For each sale we know its date and time and its dollar amount (this data isn’t realistic because every sale takes place precisely 30 seconds apart and is an exact dollar amount, but for the sake of simplicity let us pretend).
The raw data looks like this:
# Show three rowsdataframe.head(3)
| Sale_Amount | |
|---|---|
| 2017-06-06 00:00:00 | 7 |
| 2017-06-06 00:00:30 | 2 |
| 2017-06-06 00:01:00 | 7 |
Notice that the date and time of each sale is the index of the
DataFrame; this is because resample requires the index to be
datetime-like values.
Using resample we can group the rows by a wide array of time periods
(offsets) and then we can calculate some statistic on each time group:
# Group by two weeks, calculate meandataframe.resample('2W').mean()
| Sale_Amount | |
|---|---|
| 2017-06-11 | 5.001331 |
| 2017-06-25 | 5.007738 |
| 2017-07-09 | 4.993353 |
| 2017-07-23 | 4.950481 |
# Group by month, count rowsdataframe.resample('M').count()
| Sale_Amount | |
|---|---|
| 2017-06-30 | 72000 |
| 2017-07-31 | 28000 |
You might notice that in the two outputs the datetime index is a date despite the fact that we are grouping by weeks and months, respectively. The reason is because by default resample returns the
label of the right “edge” (the last label) of the time group. We can control this behavior using the label parameter:
# Group by month, count rowsdataframe.resample('M',label='left').count()
| Sale_Amount | |
|---|---|
| 2017-05-31 | 72000 |
| 2017-06-30 | 28000 |
You can treat a pandas column like any other sequence in Python:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Print first two names uppercasedfornameindataframe['Name'][0:2]:(name.upper())
ALLEN, MISS ELISABETH WALTON ALLISON, MISS HELEN LORAINE
In addition to loops (often called for loops), we can also use list
comprehensions:
# Show first two names uppercased[name.upper()fornameindataframe['Name'][0:2]]
['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']
Despite the temptation to fall back on for loops, a more Pythonic
solution would use pandas’ apply method, described in the next recipe.
Use apply to apply a built-in or custom function on every element in a column:
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Create functiondefuppercase(x):returnx.upper()# Apply function, show two rowsdataframe['Name'].apply(uppercase)[0:2]
0 ALLEN, MISS ELISABETH WALTON 1 ALLISON, MISS HELEN LORAINE Name: Name, dtype: object
apply is a great way to do data cleaning and wrangling. It is common
to write a function to perform some useful operation (separate
first and last names, convert strings to floats, etc.) and then map
that function to every element in a column.
# Load libraryimportpandasaspd# Create URLurl='https://tinyurl.com/titanic-csv'# Load datadataframe=pd.read_csv(url)# Group rows, apply function to groupsdataframe.groupby('Sex').apply(lambdax:x.count())
| Name | PClass | Age | Sex | Survived | SexCode | |
|---|---|---|---|---|---|---|
| Sex | ||||||
| female | 462 | 462 | 288 | 462 | 462 | 462 |
| male | 851 | 851 | 468 | 851 | 851 | 851 |
In Recipe 3.16 I mentioned apply. apply is particularly useful when you want to apply a function to groups. By combining groupby and apply we can calculate custom statistics or apply any function to each group separately.
Use concat with axis=0 to concatenate along the row axis:
# Load libraryimportpandasaspd# Create DataFramedata_a={'id':['1','2','3'],'first':['Alex','Amy','Allen'],'last':['Anderson','Ackerman','Ali']}dataframe_a=pd.DataFrame(data_a,columns=['id','first','last'])# Create DataFramedata_b={'id':['4','5','6'],'first':['Billy','Brian','Bran'],'last':['Bonder','Black','Balwner']}dataframe_b=pd.DataFrame(data_b,columns=['id','first','last'])# Concatenate DataFrames by rowspd.concat([dataframe_a,dataframe_b],axis=0)
| id | first | last | |
|---|---|---|---|
| 0 | 1 | Alex | Anderson |
| 1 | 2 | Amy | Ackerman |
| 2 | 3 | Allen | Ali |
| 0 | 4 | Billy | Bonder |
| 1 | 5 | Brian | Black |
| 2 | 6 | Bran | Balwner |
You can use axis=1 to concatenate along the column axis:
# Concatenate DataFrames by columnspd.concat([dataframe_a,dataframe_b],axis=1)
| id | first | last | id | first | last | |
|---|---|---|---|---|---|---|
| 0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
| 1 | 2 | Amy | Ackerman | 5 | Brian | Black |
| 2 | 3 | Allen | Ali | 6 | Bran | Balwner |
Concatenating is not a word you hear much outside of computer science
and programming, so if you have not heard it before, do not worry.
The informal definition of concatenate is to glue two objects together. In the solution we glued together two small DataFrames using the axis parameter to indicate whether we wanted to stack the two DataFrames on top of each other or place them side by side.
Alternatively we can use append to add a new row to a DataFrame:
# Create rowrow=pd.Series([10,'Chris','Chillon'],index=['id','first','last'])# Append rowdataframe_a.append(row,ignore_index=True)
| id | first | last | |
|---|---|---|---|
| 0 | 1 | Alex | Anderson |
| 1 | 2 | Amy | Ackerman |
| 2 | 3 | Allen | Ali |
| 3 | 10 | Chris | Chillon |
To inner join, use merge with the on parameter to specify the column to merge on:
# Load libraryimportpandasaspd# Create DataFrameemployee_data={'employee_id':['1','2','3','4'],'name':['Amy Jones','Allen Keys','Alice Bees','Tim Horton']}dataframe_employees=pd.DataFrame(employee_data,columns=['employee_id','name'])# Create DataFramesales_data={'employee_id':['3','4','5','6'],'total_sales':[23456,2512,2345,1455]}dataframe_sales=pd.DataFrame(sales_data,columns=['employee_id','total_sales'])# Merge DataFramespd.merge(dataframe_employees,dataframe_sales,on='employee_id')
| employee_id | name | total_sales | |
|---|---|---|---|
| 0 | 3 | Alice Bees | 23456 |
| 1 | 4 | Tim Horton | 2512 |
merge defaults to inner joins. If we want to do an outer join, we can
specify that with the how parameter:
# Merge DataFramespd.merge(dataframe_employees,dataframe_sales,on='employee_id',how='outer')
| employee_id | name | total_sales | |
|---|---|---|---|
| 0 | 1 | Amy Jones | NaN |
| 1 | 2 | Allen Keys | NaN |
| 2 | 3 | Alice Bees | 23456.0 |
| 3 | 4 | Tim Horton | 2512.0 |
| 4 | 5 | NaN | 2345.0 |
| 5 | 6 | NaN | 1455.0 |
The same parameter can be used to specify left and right joins:
# Merge DataFramespd.merge(dataframe_employees,dataframe_sales,on='employee_id',how='left')
| employee_id | name | total_sales | |
|---|---|---|---|
| 0 | 1 | Amy Jones | NaN |
| 1 | 2 | Allen Keys | NaN |
| 2 | 3 | Alice Bees | 23456.0 |
| 3 | 4 | Tim Horton | 2512.0 |
We can also specify the column name in each DataFrame to merge on:
# Merge DataFramespd.merge(dataframe_employees,dataframe_sales,left_on='employee_id',right_on='employee_id')
| employee_id | name | total_sales | |
|---|---|---|---|
| 0 | 3 | Alice Bees | 23456 |
| 1 | 4 | Tim Horton | 2512 |
If instead of merging on two columns we want to merge on the indexes of
each DataFrame, we can replace the left_on and right_on parameters
with right_index=True and left_index=True.
Oftentimes, the data we need to use is complex; it doesn’t always come in one piece. Instead in the real world, we’re usually faced with disparate datasets, from multiple database queries or files. To get all that data into one place, we can load each data query or data file into pandas as individual DataFrames and then merge them together into a single DataFrame.
This process might be familiar to anyone who has used SQL, a popular language for doing merging operations (called joins). While the exact parameters used by pandas will be different, they follow the same general patterns used by other software languages and tools.
There are three aspects to specify with any merge operation. First, we have to specify the two DataFrames we want to merge together. In the
solution we named them dataframe_employees and dataframe_sales. Second, we have to specify the name(s) of the columns to merge on—that is, the columns whose values are shared between the two DataFrames. For example, in our solution both DataFrames have a column named employee_id. To merge the two DataFrames we will match up the values in each DataFrame’s employee_id column with each other. If these two columns use the same name, we can use the on parameter. However, if they have different names we can use left_on and right_on.
What is the left and right DataFrame? The simple answer is that the left
DataFrame is the first one we specified in merge and the right
DataFrame is the second one. This language comes up again in the next
sets of parameters we will need.
The last aspect, and most difficult for some people to grasp, is the
type of merge operation we want to conduct. This is specified by the
how parameter. merge supports the four main types of joins:
Return only the rows that match in both DataFrames (e.g.,
return any row with an employee_id value appearing in both
dataframe_employees and dataframe_sales).
Return all rows in both DataFrames. If a row exists in one
DataFrame but not in the other DataFrame, fill NaN values for the
missing values (e.g., return all rows in both employee_id and
dataframe_sales).
Return all rows from the left DataFrame but only rows from the
right DataFrame that matched with the left DataFrame. Fill NaN values
for the missing values (e.g., return all rows from dataframe_employees but only rows from dataframe_sales that have a value for employee_id that appears in dataframe_employees).
Return all rows from the right DataFrame but only rows from
the left DataFrame that matched with the right DataFrame. Fill NaN
values for the missing values (e.g., return all rows from dataframe_sales but only rows from dataframe_employees that have a
value for employee_id that appears in dataframe_sales).
If you did not understand all of that right now, I encourage you to play around with the how parameter in your code and see how it affects what merge returns.