Dates and times (datetimes) are frequently encountered during preprocessing for machine learning, whether the time of a particular sale or the year of some public health statistic. In this chapter, we will build a toolbox of strategies for handling time series data including tackling time zones and creating lagged time features. Specifically, we will focus on the time series tools in the pandas library, which centralizes the functionality of many other libraries.
Use pandas’ to_datetime with the format of the date and/or time
specified in the format parameter:
# Load librariesimportnumpyasnpimportpandasaspd# Create stringsdate_strings=np.array(['03-04-2005 11:35 PM','23-05-2010 12:01 AM','04-09-2009 09:09 PM'])# Convert to datetimes[pd.to_datetime(date,format='%d-%m-%Y%I:%M%p')fordateindate_strings]
[Timestamp('2005-04-03 23:35:00'),
Timestamp('2010-05-23 00:01:00'),
Timestamp('2009-09-04 21:09:00')]
We might also want to add an argument to the errors parameter to
handle problems:
# Convert to datetimes[pd.to_datetime(date,format="%d-%m-%Y%I:%M%p",errors="coerce")fordateindate_strings]
[Timestamp('2005-04-03 23:35:00'),
Timestamp('2010-05-23 00:01:00'),
Timestamp('2009-09-04 21:09:00')]
If errors="coerce", then any problem that occurs will not raise an
error (the default behavior) but instead will set the value causing the
error to NaT (i.e., a missing value).
When dates and times come as strings, we need to convert them into a
data type Python can understand. While there are a number of Python
tools for converting strings to datetimes, following our use of pandas
in other recipes we can use to_datetime to conduct the transformation. One obstacle to strings representing dates and times is that the format of the strings can vary significantly between data sources. For example, one vector of dates might represent March 23rd, 2015 as “03-23-15” while another might use “3|23|2015”. We can use the format parameter to specify the exact format of the string. Here are some common date and time formatting codes:
| Code | Description | Example |
|---|---|---|
%Y |
Full year | 2001 |
%m |
Month w/ zero padding | 04 |
%d |
Day of the month w/ zero padding | 09 |
%I |
Hour (12hr clock) w/ zero padding | 02 |
%p |
AM or PM | AM |
%M |
Minute w/ zero padding | 05 |
%S |
Second w/ zero padding | 09 |
If not specified, pandas objects have no time zone. However, we can add
a time zone using tz during creation:
# Load libraryimportpandasaspd# Create datetimepd.Timestamp('2017-05-01 06:00:00',tz='Europe/London')
Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')
We can add a time zone to a previously created datetime using
tz_localize:
# Create datetimedate=pd.Timestamp('2017-05-01 06:00:00')# Set time zonedate_in_london=date.tz_localize('Europe/London')# Show datetimedate_in_london
Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')
We can also convert to a different time zone:
# Change time zonedate_in_london.tz_convert('Africa/Abidjan')
Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')
Finally, pandas’ Series objects can apply tz_localize
and tz_convert to every element:
# Create three datesdates=pd.Series(pd.date_range('2/2/2002',periods=3,freq='M'))# Set time zonedates.dt.tz_localize('Africa/Abidjan')
0 2002-02-28 00:00:00+00:00 1 2002-03-31 00:00:00+00:00 2 2002-04-30 00:00:00+00:00 dtype: datetime64[ns, Africa/Abidjan]
pandas supports two sets of strings representing timezones; however, I suggest using pytz library’s strings. We can see all the strings used to represent time zones by importing all_timezones:
# Load libraryfrompytzimportall_timezones# Show two time zonesall_timezones[0:2]
['Africa/Abidjan', 'Africa/Accra']
Use two boolean conditions as the start and end dates:
# Load libraryimportpandasaspd# Create data framedataframe=pd.DataFrame()# Create datetimesdataframe['date']=pd.date_range('1/1/2001',periods=100000,freq='H')# Select observations between two datetimesdataframe[(dataframe['date']>'2002-1-1 01:00:00')&(dataframe['date']<='2002-1-1 04:00:00')]
| date | |
|---|---|
| 8762 | 2002-01-01 02:00:00 |
| 8763 | 2002-01-01 03:00:00 |
| 8764 | 2002-01-01 04:00:00 |
Alternatively, we can set the date column as the DataFrame’s index and
then slice using loc:
# Set indexdataframe=dataframe.set_index(dataframe['date'])# Select observations between two datetimesdataframe.loc['2002-1-1 01:00:00':'2002-1-1 04:00:00']
| date | |
|---|---|
| date | |
| 2002-01-01 01:00:00 | 2002-01-01 01:00:00 |
| 2002-01-01 02:00:00 | 2002-01-01 02:00:00 |
| 2002-01-01 03:00:00 | 2002-01-01 03:00:00 |
| 2002-01-01 04:00:00 | 2002-01-01 04:00:00 |
Whether we use boolean conditions or index slicing is situation dependent. If we wanted to do some complex time series manipulation, it might be worth the overhead of setting the date column as the index of the DataFrame, but if we wanted to do some simple data wrangling, the boolean conditions might be easier.
Use pandas Series.dt’s time properties:
# Load libraryimportpandasaspd# Create data framedataframe=pd.DataFrame()# Create five datesdataframe['date']=pd.date_range('1/1/2001',periods=150,freq='W')# Create features for year, month, day, hour, and minutedataframe['year']=dataframe['date'].dt.yeardataframe['month']=dataframe['date'].dt.monthdataframe['day']=dataframe['date'].dt.daydataframe['hour']=dataframe['date'].dt.hourdataframe['minute']=dataframe['date'].dt.minute# Show three rowsdataframe.head(3)
| date | year | month | day | hour | minute | |
|---|---|---|---|---|---|---|
| 0 | 2001-01-07 | 2001 | 1 | 7 | 0 | 0 |
| 1 | 2001-01-14 | 2001 | 1 | 14 | 0 | 0 |
| 2 | 2001-01-21 | 2001 | 1 | 21 | 0 | 0 |
Subtract the two date features using pandas:
# Load libraryimportpandasaspd# Create data framedataframe=pd.DataFrame()# Create two datetime featuresdataframe['Arrived']=[pd.Timestamp('01-01-2017'),pd.Timestamp('01-04-2017')]dataframe['Left']=[pd.Timestamp('01-01-2017'),pd.Timestamp('01-06-2017')]# Calculate duration between featuresdataframe['Left']-dataframe['Arrived']
0 0 days 1 2 days dtype: timedelta64[ns]
Often we will want to remove the days output and keep only the
numerical value:
# Calculate duration between featurespd.Series(delta.daysfordeltain(dataframe['Left']-dataframe['Arrived']))
0 0 1 2 dtype: int64
There are times when the feature we want is the change (delta) between
two points in time. For example, we might have the dates a customer
checks in and checks out of a hotel, but the feature we want is the
duration of his stay. pandas makes this calculation easy using the
TimeDelta data type.
Use pandas’ Series.dt property weekday_name:
# Load libraryimportpandasaspd# Create datesdates=pd.Series(pd.date_range("2/2/2002",periods=3,freq="M"))# Show days of the weekdates.dt.weekday_name
0 Thursday 1 Sunday 2 Tuesday dtype: object
If we want the output to be a numerical value and therefore more
usable as a machine learning feature, we can use weekday where the
days of the week are represented as an integer (Monday is 0):
# Show days of the weekdates.dt.weekday
0 3 1 6 2 1 dtype: int64
Knowing the weekday can be helpful if, for instance, we wanted to compare total sales on Sundays for the past three years. pandas makes creating a feature vector containing weekday information easy.
# Load libraryimportpandasaspd# Create data framedataframe=pd.DataFrame()# Create datadataframe["dates"]=pd.date_range("1/1/2001",periods=5,freq="D")dataframe["stock_price"]=[1.1,2.2,3.3,4.4,5.5]# Lagged values by one rowdataframe["previous_days_stock_price"]=dataframe["stock_price"].shift(1)# Show data framedataframe
| dates | stock_price | previous_days_stock_price | |
|---|---|---|---|
| 0 | 2001-01-01 | 1.1 | NaN |
| 1 | 2001-01-02 | 2.2 | 1.1 |
| 2 | 2001-01-03 | 3.3 | 2.2 |
| 3 | 2001-01-04 | 4.4 | 3.3 |
| 4 | 2001-01-05 | 5.5 | 4.4 |
Very often data is based on regularly spaced time periods (e.g., every
day, every hour, every three hours) and we are interested in using values in the past to make predictions (this is often called lagging a feature). For example, we might want to predict a stock’s price using
the price it was the day before. With pandas we can use shift to lag values by one row, creating a new feature containing past values.
In our solution, the first row for previous_days_stock_price is a
missing value because there is no previous stock_price value.
# Load libraryimportpandasaspd# Create datetimestime_index=pd.date_range("01/01/2010",periods=5,freq="M")# Create data frame, set indexdataframe=pd.DataFrame(index=time_index)# Create featuredataframe["Stock_Price"]=[1,2,3,4,5]# Calculate rolling meandataframe.rolling(window=2).mean()
| Stock_Price | |
|---|---|
| 2010-01-31 | NaN |
| 2010-02-28 | 1.5 |
| 2010-03-31 | 2.5 |
| 2010-04-30 | 3.5 |
| 2010-05-31 | 4.5 |
Rolling (also called moving) time windows are conceptually simple but can be difficult to understand at first. Imagine we have monthly observations for a stock’s price. It is often useful to have a time window of a certain number of months and then move over the observations calculating a statistic for all observations in the time window.
For example, if we have a time window of three months and we want a rolling mean, we would calculate:
mean(January, February, March)
mean(February, March, April)
mean(March, April, May)
etc.
Another way to put it: our three-month time window “walks” over the observations, calculating the window’s mean at each step.
pandas’ rolling allows us to specify the size of the window using
window and then quickly calculate some common statistics, including
the max value (max()), mean value (mean()), count of values
(count()), and rolling correlation (corr()).
Rolling means are often used to smooth out time series data because using the mean of the entire time window dampens the effect of short-term fluctuations.
In addition to the missing data strategies previously discussed, when we have time series data we can use interpolation to fill in gaps caused by missing values:
# Load librariesimportpandasaspdimportnumpyasnp# Create datetime_index=pd.date_range("01/01/2010",periods=5,freq="M")# Create data frame, set indexdataframe=pd.DataFrame(index=time_index)# Create feature with a gap of missing valuesdataframe["Sales"]=[1.0,2.0,np.nan,np.nan,5.0]# Interpolate missing valuesdataframe.interpolate()
| Sales | |
|---|---|
| 2010-01-31 | 1.0 |
| 2010-02-28 | 2.0 |
| 2010-03-31 | 3.0 |
| 2010-04-30 | 4.0 |
| 2010-05-31 | 5.0 |
Alternatively, we can replace missing values with the last known value (i.e., forward-filling):
# Forward-filldataframe.ffill()
| Sales | |
|---|---|
| 2010-01-31 | 1.0 |
| 2010-02-28 | 2.0 |
| 2010-03-31 | 2.0 |
| 2010-04-30 | 2.0 |
| 2010-05-31 | 5.0 |
We can also replace missing values with the latest known value (i.e., back-filling):
# Back-filldataframe.bfill()
| Sales | |
|---|---|
| 2010-01-31 | 1.0 |
| 2010-02-28 | 2.0 |
| 2010-03-31 | 5.0 |
| 2010-04-30 | 5.0 |
| 2010-05-31 | 5.0 |
Interpolation is a technique for filling in gaps caused by missing values by, in effect, drawing a line or curve between the known values bordering the gap and using that line or curve to predict reasonable values. Interpolation can be particularly useful when the time intervals between are constant, the data is not prone to noisy
fluctuations, and the gaps caused by missing values are small. For
example, in our solution a gap of two missing values was bordered by
2.0 and 5.0. By fitting a line starting at 2.0 and ending at
5.0, we can make reasonable guesses for the two missing values in between of 3.0 and 4.0.
If we believe the line between the two known points is nonlinear, we
can use interpolate’s method to specify the interpolation method:
# Interpolate missing valuesdataframe.interpolate(method="quadratic")
| Sales | |
|---|---|
| 2010-01-31 | 1.000000 |
| 2010-02-28 | 2.000000 |
| 2010-03-31 | 3.059808 |
| 2010-04-30 | 4.038069 |
| 2010-05-31 | 5.000000 |
Finally, there might be cases when we have large gaps of missing values
and do not want to interpolate values across the entire gap. In these
cases we can use limit to restrict the number of interpolated values
and limit_direction to set whether to interpolate values forward from
at the last known value before the gap or vice versa:
# Interpolate missing valuesdataframe.interpolate(limit=1,limit_direction="forward")
| Sales | |
|---|---|
| 2010-01-31 | 1.0 |
| 2010-02-28 | 2.0 |
| 2010-03-31 | 3.0 |
| 2010-04-30 | NaN |
| 2010-05-31 | 5.0 |
Back-filling and forward-filling can be thought of as a form of naive interpolation, where we draw a flat line from a known value and use it to fill in missing values. One (minor) advantage back- and forward-filling have over interpolation is the lack of the need for known values on both sides of missing value(s).