While traditional programming languages use loops, R has traditionally
encouraged using vectorized operations and the apply family of
functions to crunch data in batches, greatly streamlining the
calculations. There is noting to prevent you from writing loops in R
that break your data into whatever chunks you want and then do an
operation on each chunk. However using vectorized functions can, in many
cases, increase speed, readability, and maintainability of your code.
In recent history, however, the Tidyverse, specifically the purrr and
dplyr packages, have introdcued new idioms into R that make these
concepts easier to learn and slightly more consistent. The name purrr
comes from a play on the phrase “Pure R.” A “pure function” is a
function where the result of the function is only determined by its
inputs, and which does not produce any side effects. This is a
functional programming concept which you need not understand in order to
get great value from purrr. All most users need to know is purrr
contains functions to help us operate “chunk by chunk” on our data in a
way that meshes well with other Tidyverse packages such as dplyr.
Base R has many apply functions: apply, lapply, sapply, tapply,
mapply; and their cousins, by and split. These are solid functions
that have been workhorses in Base R for years. The authors have
struggled a bit with how much to focus on the Base R apply functions and
how much to focus on the newer “tidy” approach. After much debate we’ve
chosen to try and illustrate the purrr approach and to acknowledge
Base R approaches and, in a few places, to illustrate both. The
interface to purrr and dplyr is very clean and, we believe, in most
cases, more intuitive.
You have a list, and you want to apply a function to each element of the list.
We can use map to apply the function to every element of a list:
library(tidyverse)lst%>%map(fun)
Let’s look at a specific example of taking the average of all the numbers in each element of a list:
library(tidyverse)lst<-list(a=c(1,2,3),b=c(4,5,6))lst%>%map(mean)#> $a#> [1] 2#>#> $b#> [1] 5
These functions will call your function once for every element on your
list. Your function should expect one argument, an element from the
list. The map functions will collect the returned values and return
them in a list.
The purrr package, contains a whole family of map functions that take
a list or a vector then return an object with the same number of
elements as the input. The type of object they return varies based on
which map function is used. See the help file for map for a complete
list, but a few of the most common are as follows:
map() : always returns a list, and the elements of the list may be of
different types. This is quite similar to the Base R function lapply.
map_chr() : returns a character vector
map_int() : returns an integer vector
map_dbl() : returns a floating point numeric vector
Let’s take a quick look at a contrived situation where we have a function that could result in a character or an integer result:
fun<-function(x){if(x>1){1}else{"Less Than 1"}}fun(5)#> [1] 1fun(0.5)#> [1] "Less Than 1"
Let’s create a list of elements which we can map fun to and look at
how each some of the map variants behave:
lst<-list(.5,1.5,.9,2)map(lst,fun)#> [[1]]#> [1] "Less Than 1"#>#> [[2]]#> [1] 1#>#> [[3]]#> [1] "Less Than 1"#>#> [[4]]#> [1] 1
You can see that map produced a list and it is of mixed data types.
And map_chr will produce a character vector and coerce the numbers
into characters.
map_chr(lst,fun)#> [1] "Less Than 1" "1.000000" "Less Than 1" "1.000000"## or using pipeslst%>%map_chr(fun)#> [1] "Less Than 1" "1.000000" "Less Than 1" "1.000000"
While map_dbl will try to coerce a character sting into a double and
died trying.
map_dbl(lst,fun)#> Error: Can't coerce element 1 from a character to a double
As mentioned above, the Base R lapply function acts very much like
map. The Base R sapply function is more like the other map functions
mentioned above in that the function tries to simplify the results into
a vector or matrix.
See Recipe X-X.
You have a function and you want to apply it to every row in a data frame.
The mutate function will create a new variable based on a vector of
values. We can use one of the pmap functions (in this case pmap_dbl)
to operate on every row and return a vector. The pmap functions that
have an underscore (_) following the pmap return data in a vector of
the type described after the _. So pmap_dbl returns a vector of
doubles, while pmap_chr would coerce the output into a vector of
characters.
fun<-function(a,b,c){# calculate the sum of a sequence from a to b by csum(seq(a,b,c))}df<-data.frame(mn=c(1,2,3),mx=c(8,13,18),rng=c(1,2,3))df%>%mutate(output=pmap_dbl(list(a=mn,b=mx,c=rng),fun))#> mn mx rng output#> 1 1 8 1 36#> 2 2 13 2 42#> 3 3 18 3 63
pmap returns a list, so we could use it to map our function to each
data frame row then return the results into a list, if we prefer:
pmap(list(a=df$mn,b=df$mx,c=df$rng),fun)#> [[1]]#> [1] 36#>#> [[2]]#> [1] 42#>#> [[3]]#> [1] 63
The pmap family of functions takes in a list of inputs and a function
then applies the function to each element in the list. In our example
above we wrap list() around the columns we are interested in using in
our function, fun. The list function turns the columns we want to
operate on into a list. Within the same operation we name the columns to
match the names our function is looking for. So we set a = mn for
example. This names the mn column in our data frame to a in the
resulting list, which is one of the inputs our function is expecting.
You have a matrix. You want to apply a function to every row, calculating the function result for each row.
Use the apply function. Set the second argument to 1 to indicate
row-by-row application of a function:
results<-apply(mat,1,fun)# mat is a matrix, fun is a function
The apply function will call fun once for each row of the matrix,
assemble the returned values into a vector, and then return that vector.
You may notice that we only show the use of the Base R apply function
here while other recipes illustrate purrr alternatives. As of this
writing, matrix operations are out of scope for purrr so we use the
very solid Base R apply function.
Suppose your matrix long is longitudinal data, so each row contains
data for one subject and the columns contain the repeated observations
over time:
long<-matrix(1:15,3,5)long#> [,1] [,2] [,3] [,4] [,5]#> [1,] 1 4 7 10 13#> [2,] 2 5 8 11 14#> [3,] 3 6 9 12 15
You could calculate the average observation for each subject by applying
the mean function to each row. The result is a vector:
apply(long,1,mean)#> [1] 7 8 9
If your matrix has row names, apply uses them to identify the elements
of the resulting vector, which is handy.
rownames(long)<-c("Moe","Larry","Curly")apply(long,1,mean)#> Moe Larry Curly#> 7 8 9
The function being called should expect one argument, a vector, which
will be one row from the matrix. The function can return a scalar or a
vector. In the vector case, apply assembles the results into a matrix.
The range function returns a vector of two elements, the minimum and
the maximum, so applying it to long produces a matrix:
apply(long,1,range)#> Moe Larry Curly#> [1,] 1 2 3#> [2,] 13 14 15
You can employ this recipe on data frames as well. It works if the data frame is homogeneous; that is, either all numbers or all character strings. When the data frame has columns of different types, extracting vectors from the rows isn’t sensible because vectors must be homogeneous.
You have a matrix or data frame, and you want to apply a function to every column.
For a matrix, use the apply function. Set the second argument to 2,
which indicates column-by-column application of the function. So if our
matrix or data frame was named mat and we wanted to apply a function
named fun to every column, it would look like this:
apply(mat,2,fun)
Let’s look at an example with real numbers and apply the mean function
to every column of a matrix:
mat<-matrix(c(1,3,2,5,4,6),2,3)colnames(mat)<-c("t1","t2","t3")mat#> t1 t2 t3#> [1,] 1 2 4#> [2,] 3 5 6apply(mat,2,mean)# Compute the mean of every column#> t1 t2 t3#> 2.0 3.5 5.0
In Base R, the apply function is intended for processing a matrix or
data frame. The second argument of apply determines the direction:
1 means process row by row.
2 means process column by column.
This is more mnemonic than it looks. We speak of matrices in “rows and columns”, so rows are first and columns second; 1 and 2, respectively.
A data frame is a more complicated data structure than a matrix, so
there are more options. You can simply use apply, in which case R will
convert your data frame to a matrix and then apply your function. That
will work if your data frame contains only one type of data but will
likely not do what you want if some columns are numeric and some are
character. In that case, R will force all columns to have identical
types, likely performing an unwanted conversion as a result.
Fortunately, there are multiple alternative. Recall that a data frame is
a kind of list: it is a list of the columns of the data frame. purrr
has a whole family of map functions that return different types of
objects. Of particular interest here is the map_df which returns a
data.frame, thus the df in the name.
df2<-map_df(df,fun)# Returns a data.frame
The function fun should expect one argument: a column from the data
frame.
This is a common recipe to check the types of columns in data frames.
The batch column of this data frame, at quick glance, seems to contain
numbers:
load("./data/batches.rdata")head(batches)#> batch clinic dosage shrinkage#> 1 3 KY IL -0.307#> 2 3 IL IL -1.781#> 3 1 KY IL -0.172#> 4 3 KY IL 1.215#> 5 2 IL IL 1.895#> 6 2 NJ IL -0.430
But printing the classes of the columns reveals batch to be a factor
instead:
map_df(batches,class)#> # A tibble: 1 x 4#> batch clinic dosage shrinkage#> <chr> <chr> <chr> <chr>#> 1 factor factor factor numeric
You have a function that takes multiple arguments. You want to apply the function element-wise to vectors and obtain a vector result. Unfortunately, the function is not vectorized; that is, it works on scalars but not on vectors.
Use use one of the map or pmap functions from the tidyverse core
package purrr. The most general solution is to put your vectors in a
list, then use pmap:
lst<-list(v1,v2,v3)pmap(lst,fun)
pmap will take the elements of lst and pass them as the inputs to
fun.
If you only have two vectors you are passing as inputs to your function,
the map2_* family of functions is convenient and saves you the step of
putting your vectors in a list first. map2 will return a list, while
the typed variants (map2_chr, map2_dbl, etc. ) return vectors of the
type their name implies:
map2(v1,v2,fun)
or if fun returns only a double:
map2_dbl(v1,v2,fun)
The typed variants in purrr functions refers to the output type
expected from the function. All the typed variants return vectors of
their respective type while the untyped variants return lists which
allow mixing of types.
The basic operators of R, such as x + y, are vectorized; this means that they compute their result element-by-element and return a vector of results. Also, many R functions are vectorized.
Not all functions are vectorized, however, and those that are not typed
work only on scalars. Using vector arguments produces errors at best and
meaningless results at worst. In such cases, the map functions from
purrr can effectively vectorize the function for you.
Consider the gcd function from Recipe X-X, which takes two arguments:
gcd<-function(a,b){if(b==0){return(a)}else{return(gcd(b,a%%b))}}
If we apply gcd to two vectors, the result is wrong answers and a pile
of error messages:
gcd(c(1,2,3),c(9,6,3))#> Warning in if (b == 0) {: the condition has length > 1 and only the first#> element will be used#> Warning in if (b == 0) {: the condition has length > 1 and only the first#> element will be used#> Warning in if (b == 0) {: the condition has length > 1 and only the first#> element will be used#> [1] 1 2 0
The function is not vectorized, but we can use map to “vectorize” it.
In this case, since we have two inputs we’re mapping over, we should use
the map2 function. This gives the element-wise GCDs between two
vectors.
a<-c(1,2,3)b<-c(9,6,3)my_gcds<-map2(a,b,gcd)my_gcds#> [[1]]#> [1] 1#>#> [[2]]#> [1] 2#>#> [[3]]#> [1] 3
Notice that map2 returns a list of lists. If we wanted the output in a
vector, we could use unlist on the result, or use one of the typed
variants:
unlist(my_gcds)#> [1] 1 2 3
The map family of purrr functions give you a series of variations
that return specific types of output. The suffixes on the function names
communicate the type of vector that they will return. While map and
map2 return lists, since the type specific variants are returning
objects guaranteed to be the same type, they can be put in atomic
vectors. For example, we could use the map_chr function to ask R to
coerce the results into character output or map2_dbl to ensure the
reults are doubles:
map2_chr(a,b,gcd)#> [1] "1.000000" "2.000000" "3.000000"map2_dbl(a,b,gcd)#> [1] 1 2 3
If our data has more than two vectors, or the data is already in a list,
we can use the pmap family of functions which take a list as an input.
lst<-list(a,b)pmap(lst,gcd)#> [[1]]#> [1] 1#>#> [[2]]#> [1] 2#>#> [[3]]#> [1] 3
Or if we want a typed vector as output:
lst<-list(a,b)pmap_dbl(lst,gcd)#> [1] 1 2 3
With the purrr functions, remember that pmap family are parallel
mappers that take in a list as inputs, while map2 functions take
two, and only two, vectors as inputs.
This is really just a special case of our very first recipe in this
chapter: “Applying a Function to Each List Element”. See that recipe for more discussion of
map variants. In addition, Jenny Bryan has a great collection of
purrr tutorials on her GitHub site:
https://jennybc.github.io/purrr-tutorial/
JDL note: think about where the major dplyr operators go:
group by (already above)
rowwise (alread above)
select (includeing -) (coverd)
filter (subset records based on values) *
arrange (sort a data frame) *
group_by *
summarize (note that it drops a grouping) (calcualte a statistic over a group)
case_when inside a mutate: (create a new column based on conditional logic) ==, >, >= etc &, |, !, %in%, !something %in%
Your data elements occur in groups. You want to process the data by groups—for example, summing by group or averaging by group.
The easiest way to do grouping is with the dplyr function group_by
in conjunction with summarize. If our data frame is df and has a
variable we want to group by named grouping_var and we want to apply
the function fun to all the combinations of v1 and v2, we can do
that with group_by:
df%>%group_by(v1,v2)%>%summarize(result_var=fun(value_var))
Let’s look at a specifc example where our intput data frame, df
contains a variable my_group which we want to group by, and a field
named values which we would like to calculate some statistics on:
df<-tibble(my_group=c("A","B","A","B","A","B"),values=1:6)df%>%group_by(my_group)%>%summarize(avg_values=mean(values),tot_values=sum(values),count_values=n())#> # A tibble: 2 x 4#> my_group avg_values tot_values count_values#> <chr> <dbl> <int> <int>#> 1 A 3 9 3#> 2 B 4 12 3
The output has one record per grouping along with calculated values for the three summary fields we defined.
See this chapter’s “Introduction” for more about grouping factors.