You can get pretty far in R just using vectors. That’s what Chapter 2 is all about. This chapter moves beyond vectors to recipes for matrices, lists, factors, data frames, and Tibbles (which are a special case of data frames). If you have preconceptions about data structures, I suggest you put them aside. R does data structures differently than many other languages.
If you want to study the technical aspects of R’s data structures, I suggest reading R in a Nutshell (O’Reilly) and the R Language Definition. The notes here are more informal. These are things we wish we’d known when we started using R.
Here are some key properties of vectors:
All elements of a vector must have the same type or, in R terminology, the same mode.
So v[2] refers to the second element of v.
So v[c(2,3)] is a subvector of v that consists of the second and
third elements.
Vectors have a names property, the same length as the vector itself,
that gives names to the elements:
+
v<-c(10,20,30)names(v)<-c("Moe","Larry","Curly")(v)#> Moe Larry Curly#> 10 20 30
Continuing the previous example: +
v["Larry"]#> Larry#> 20
Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.
So lst[[2]] refers to the second element of lst. Note the double
square brackets. Double brackets means that R will return the element
as whatever type of element it is.
So lst[c(2,3)] is a sublist of lst that consists of the second and
third elements. Note the single square brackets. Single brackets means
that R will return the items in a list. If you pull a single element
with single brackets, like lst[2] R will return a list of length 1
with the first item containing the desired item.
JDL TODO: read Jenny Bryant’s description and think about clarifying this list business
Both lst[["Moe"]] and lst$Moe refer to the element named “Moe”.
Since lists are heterogeneous and since their elements can be retrieved by name, a list is like a dictionary or hash or lookup table in other programming languages (“Building a Name/Value Association List”). What’s surprising (and cool) is that in R, unlike most of those other programming languages, lists can also be indexed by position.
In R, every object has a mode, which indicates how it is stored in memory: as a number, as a character string, as a list of pointers to other objects, as a function, and so forth:
| Object | Example | Mode |
|---|---|---|
Number |
|
numeric |
Vector of numbers |
|
numeric |
Character string |
|
character |
Vector of character strings |
|
character |
Factor |
|
numeric |
List |
|
list |
Data frame |
|
list |
Function |
|
function |
The mode function gives us this information:
mode(3.1415)# Mode of a number#> [1] "numeric"mode(c(2.7182,3.1415))# Mode of a vector of numbers#> [1] "numeric"mode("Moe")# Mode of a character string#> [1] "character"mode(list("Moe","Larry","Curly"))# Mode of a list#> [1] "list"
A critical difference between a vector and a list can be summed up this way:
In a vector, all elements must have the same mode.
In a list, the elements can have different modes.
In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation.
For example, a Date object consists of a single number:
d<-as.Date("2010-03-15")mode(d)#> [1] "numeric"length(d)#> [1] 1
But it has a class of Date, telling us how to interpret that number;
namely, as the number of days since January 1, 1970:
class(d)#> [1] "Date"
R uses an object’s class to decide how to process the object. For
example, the generic function print has specialized versions (called
methods) for printing objects according to their class: data.frame,
Date, lm, and so forth. When you print an object, R calls the
appropriate print function according to the object’s class.
The quirky thing about scalars is their relationship to vectors. In some software, scalars and vectors are two different things. In R, they are the same thing: a scalar is simply a vector that contains exactly one element. In this book I often use the term “scalar”, but that’s just shorthand for “vector with one element.”
Consider the built-in constant pi. It is a scalar:
pi#> [1] 3.14
Since a scalar is a one-element vector, you can use vector functions on
pi:
length(pi)#> [1] 1
You can index it. The first (and only) element is π, of course:
pi[1]#> [1] 3.14
If you ask for the second element, there is none:
pi[2]#> [1] NA
In R, a matrix is just a vector that has dimensions. It may seem strange at first, but you can transform a vector into a matrix simply by giving it dimensions.
A vector has an attribute called dim, which is initially NULL, as
shown here:
A<-1:6dim(A)#> NULL(A)#> [1] 1 2 3 4 5 6
We give dimensions to the vector when we set its dim attribute. Watch
what happens when we set our vector dimensions to 2 × 3 and print it:
dim(A)<-c(2,3)(A)#> [,1] [,2] [,3]#> [1,] 1 3 5#> [2,] 2 4 6
Voilà! The vector was reshaped into a 2 × 3 matrix.
A matrix can be created from a list, too. Like a vector, a list has a
dim attribute, which is initially NULL:
B<-list(1,2,3,4,5,6)dim(B)#> NULL
If we set the dim attribute, it gives the list a shape:
dim(B)<-c(2,3)(B)#> [,1] [,2] [,3]#> [1,] 1 3 5#> [2,] 2 4 6
Voilà! We have turned this list into a 2 × 3 matrix.
The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2:
D<-1:12dim(D)<-c(2,3,2)(D)#> , , 1#>#> [,1] [,2] [,3]#> [1,] 1 3 5#> [2,] 2 4 6#>#> , , 2#>#> [,1] [,2] [,3]#> [1,] 7 9 11#> [2,] 8 10 12
Note that R prints one “slice” of the structure at a time, since it’s not possible to print a 3-dimensional structure on a 2-dimensional medium.
It strikes us as very odd that we can turn a list into a matrix just by
giving the list a dim attribute. But wait; it gets stranger.
Recall that a list can be heterogeneous (mixed modes). We can start with a heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This code snippet creates a matrix that is a mix of numeric and character data:
C<-list(1,2,3,"X","Y","Z")dim(C)<-c(2,3)(C)#> [,1] [,2] [,3]#> [1,] 1 3 "Y"#> [2,] 2 "X" "Z"
To me this is strange because I ordinarily assume a matrix is purely numeric, not mixed. R is not that restrictive.
The possibility of a heterogeneous matrix may seem powerful and strangely fascinating. However, it creates problems when you are doing normal, day-to-day stuff with matrices. For example, what happens when the matrix C (above) is used in matrix multiplication? What happens if it is converted to a data frame? The answer is that odd things happen.
In this book, I generally ignore the pathological case of a heterogeneous matrix. I assume you’ve got simple, vanilla matrices. Some recipes involving matrices may work oddly (or not at all) if your matrix contains mixed data. Converting such a matrix to a vector or data frame, for instance, can be problematic (“Converting One Structured Data Type into Another”).
A factor looks like a character vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.
There are two key uses for factors:
A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.
This is a technique for labeling or tagging your data items according to their group. See the Introduction to Data Transformations.
A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS.
A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:
The elements of the list are vectors and/or factors.1
Those vectors and factors are the columns of the data frame.
The vectors and factors must all have the same length; in other words, all columns must have the same height.
The equal-height columns give a rectangular shape to the data frame.
The columns must have names.
Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:
You can use list operators to extract columns from a data frame, such
as df[i], df[[i]], or df$name.
You can use matrix-like notation, such as df[i,j], df[i,], or
df[,j].
Your perception of a data frame likely depends on your background:
A data frame is a table of observations. Each row contains one observation. Each observation must contain the same variables. These variables are called columns, and you can refer to them by name. You can also refer to the contents by row number and column number, just as with a matrix.
A data frame is a table. The table resides entirely in memory, but you can save it to a flat file and restore it later. You needn’t declare the column types because R figures that out for you.
A data frame is like a worksheet, or perhaps a range within a worksheet. It is more restrictive, however, in that each column has a type.
A data frame is like a SAS dataset for which all the data resides in memory. R can read and write the data frame to disk, but the data frame must be in memory while R is processing it.
A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.
A data frame is a rectangular data structure. The columns are strongly typed, and each column must be numeric values, character strings, or a factor. Columns must have labels; rows may have labels. The table can be indexed by position, column name, and/or row name. It can also be accessed by list operators, in which case R treats the data frame as a list whose elements are the columns of the data frame.
You can put names and numbers into a data frame. It’s easy! A data frame is like a little database. Your staff will enjoy using data frames.`
A tibble is a modern reimagining of the data frame, introduced by Hadley Wickham in his Tidyverse packages. Most of the common functions you would use with data frames also work with Tibbles. However Tibbles typically do less than data frames and complain more. This idea of complaining and doing less may remind you of your least favorite coworker, however, we think tibbles will be one of your most favorite data structures. Doing less and complaining more can be a feature, not a bug.
Unlike data frames, tibbles do not:
Tibbles do not give you row numbers by default.
Tibbles do not coerce column names and surprise you with names different than you expected.
Tibbles don’t coerce your data into factors without you explictly asking for that.
Tibbles only recycle vectors of length 1.
In addition to basic data frame functionality, tibbles also do this:
Tibbles only print the top four rows and a bit of metadata by default.
Tibbles always return a tibble when subsetting.
Tibbles never do partial matching: if you want a column from a tibble you have to ask for it using its full name.
Tibbles complain more by giving you more warnings and chatty messages to make sure you understand what the software is doing.
All these extras are designed to give you fewer surprises and help you be more productive.
You want to append additional data items to a vector.
Use the vector constructor (c) to construct a vector with the
additional data items:
v<-c(1,2,3)newItems<-c(6,7,8)v<-c(v,newItems)v#> [1] 1 2 3 6 7 8
For a single item, you can also assign the new item to the next vector element. R will automatically extend the vector:
v[length(v)+1]<-42v#> [1] 1 2 3 6 7 8 42
If you ask us about appending a data item to a vector, we will likely suggest that maybe you shouldn’t.
R works best when you think about entire vectors, not single data items. Are you repeatedly appending items to a vector? If so, then you are probably working inside a loop. That’s OK for small vectors, but for large vectors your program will run slowly. The memory management in R works poorly when you repeatedly extend a vector by one element. Try to replace that loop with vector-level operations. You’ll write less code, and R will run much faster.
Nonetheless, one does occasionally need to append data to vectors. Our
experiments show that the most efficient way is to create a new vector
using the vector constructor (c) to join the old and new data. This
works for appending single elements or multiple elements:
v<-c(1,2,3)v<-c(v,4)# Append a single value to vv#> [1] 1 2 3 4w<-c(5,6,7,8)v<-c(v,w)# Append an entire vector to vv#> [1] 1 2 3 4 5 6 7 8
You can also append an item by assigning it to the position past the end of the vector, as shown in the Solution. In fact, R is very liberal about extending vectors. You can assign to any element and R will expand the vector to accommodate your request:
v<-c(1,2,3)# Create a vector of three elementsv[10]<-10# Assign to the 10th elementv# R extends the vector automatically#> [1] 1 2 3 NA NA NA NA NA NA 10
Note that R did not complain about the out-of-bounds subscript. It just extended the vector to the needed length, filling with NA.
R includes an append function that creates a new vector by appending
items to an existing vector. However, our experiments show that this
function runs more slowly than both the vector constructor and the
element assignment.
You want to insert one or more data items into a vector.
Despite its name, the append function inserts data into a vector by
using the after parameter, which gives the insertion point for the new
item or items:
v#> [1] 1 2 3 NA NA NA NA NA NA 10newvalues<-c(100,101)n<-2append(v,newvalues,after=n)#> [1] 1 2 100 101 3 NA NA NA NA NA NA 10
The new items will be inserted at the position given by after. This
example inserts 99 into the middle of a sequence:
append(1:10,99,after=5)#> [1] 1 2 3 4 5 99 6 7 8 9 10
The special value of after=0 means insert the new items at the head of
the vector:
append(1:10,99,after=0)#> [1] 99 1 2 3 4 5 6 7 8 9 10
The comments in “Appending Data to a Vector” apply here, too. If you are inserting single items into a vector, you might be working at the element level when working at the vector level would be easier to code and faster to run.
You want to understand the mysterious Recycling Rule that governs how R handles vectors of unequal length.
When you do vector arithmetic, R performs element-by-element operations. That works well when both vectors have the same length: R pairs the elements of the vectors and applies the operation to those pairs.
But what happens when the vectors have unequal lengths?
In that case, R invokes the Recycling Rule. It processes the vector element in pairs, starting at the first elements of both vectors. At a certain point, the shorter vector is exhausted while the longer vector still has unprocessed elements. R returns to the beginning of the shorter vector, “recycling” its elements; continues taking elements from the longer vector; and completes the operation. It will recycle the shorter-vector elements as often as necessary until the operation is complete.
It’s useful to visualize the Recycling Rule. Here is a diagram of two vectors, 1:6 and 1:3:
1:6 1:3
----- -----
1 1
2 2
3 3
4
5
6
Obviously, the 1:6 vector is longer than the 1:3 vector. If we try to add the vectors using (1:6) + (1:3), it appears that 1:3 has too few elements. However, R recycles the elements of 1:3, pairing the two vectors like this and producing a six-element vector:
1:6 1:3 (1:6) + (1:3)
----- ----- ---------------
1 1 2
2 2 4
3 3 6
4 5
5 7
6 9
Here is what you see in the R console:
(1:6)+(1:3)#> [1] 2 4 6 5 7 9
It’s not only vector operations that invoke the Recycling Rule;
functions can, too. The cbind function can create column vectors, such
as the following column vectors of 1:6 and 1:3. The two column have
different heights, of course:
r} cbind(1:6) cbind(1:3)
If we try binding these column vectors together into a two-column
matrix, the lengths are mismatched. The 1:3 vector is too short, so
cbind invokes the Recycling Rule and recycles the elements of 1:3:
cbind(1:6,1:3)#> [,1] [,2]#> [1,] 1 1#> [2,] 2 2#> [3,] 3 3#> [4,] 4 1#> [5,] 5 2#> [6,] 6 3
If the longer vector’s length is not a multiple of the shorter vector’s length, R gives a warning. That’s good, since the operation is highly suspect and there is likely a bug in your logic:
(1:6)+(1:5)# Oops! 1:5 is one element too short#> Warning in (1:6) + (1:5): longer object length is not a multiple of shorter#> object length#> [1] 2 4 6 8 10 7
Once you understand the Recycling Rule, you will realize that operations between a vector and a scalar are simply applications of that rule. In this example, the 10 is recycled repeatedly until the vector addition is complete:
(1:6)+10#> [1] 11 12 13 14 15 16
You have a vector of character strings or integers. You want R to treat them as a factor, which is R’s term for a categorical variable.
The factor function encodes your vector of discrete values into a
factor:
v<-c("dog","cat","mouse","rat","dog")f<-factor(v)# v can be a vector of strings or integersf#> [1] dog cat mouse rat dog#> Levels: cat dog mouse ratstr(f)#> Factor w/ 4 levels "cat","dog","mouse",..: 2 1 3 4 2
If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor:
v<-c("dog","cat","mouse","rat","dog")f<-factor(v,levels=c("dog","cat","mouse","rat","horse"))f#> [1] dog cat mouse rat dog#> Levels: dog cat mouse rat horsestr(f)#> Factor w/ 5 levels "dog","cat","mouse",..: 1 2 3 4 1
In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models.
Most of the time, converting your categorical data into a factor is a
simple matter of calling the factor function, which identifies the
distinct levels of the categorical data and packs them into a factor:
f<-factor(c("Win","Win","Lose","Tie","Win","Lose"))f#> [1] Win Win Lose Tie Win Lose#> Levels: Lose Tie Win
Notice that when we printed the factor, f, R did not put quotes around
the values. They are levels, not strings. Also notice that when we
printed the factor, R also displayed the distinct levels below the
factor.
If your vector contains only a subset of all the possible levels, then R
will have an incomplete picture of the possible levels. Suppose you have
a string-valued variable wday that gives the day of the week on which
your data was observed:
wday<-c("Wed","Thu","Mon","Wed","Thu","Thu","Thu","Tue","Thu","Tue")f<-factor(wday)f#> [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue#> Levels: Mon Thu Tue Wed
R thinks that Monday, Thursday, Tuesday, and Wednesday are the only
possible levels. Friday is not listed. Apparently, the lab staff never
made observations on Friday, so R does not know that Friday is a
possible value. Hence you need to list the possible levels of wday
explicitly:
f<-factor(wday,c("Mon","Tue","Wed","Thu","Fri"))f#> [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue#> Levels: Mon Tue Wed Thu Fri
Now R understands that f is a factor with five possible levels. It
knows their correct order, too. It originally put Thursday before
Tuesday because it assumes alphabetical order by default.2 The explicit
second argument defines the correct order.
In many situations it is not necessary to call factor explicitly. When
an R function requires a factor, it usually converts your data to a
factor automatically. The table function, for instance, works only on
factors, so it routinely converts its inputs to factors without asking.
You must explicitly create a factor variable when you want to specify
the full set of levels or when you want to control the ordering of
levels.
When creating a data frame using base R functinos like data.frame the
default behavior for text fields is to turn them into factors. This has
caused grief and consternation for many R users over the years as often
we expect text fields to be imported simply as text, not factors.
Tibbles, part of the Tidyverse of tools, on the other hand, never
converts to factors by default.
See Recipe X-X to create a factor from continuous data.
You have several groups of data, with one vector for each group. You want to combine the vectors into one large vector and simultaneously create a parallel factor that identifies each value’s original group.
Create a list that contains the vectors. Use the stack function to
combine the list into a two-column data frame:
v1<-c(1,2,3)v2<-c(4,5,6)v3<-c(7,8,9)comb<-stack(list(v1=v1,v2=v2,v3=v3))# Combine 3 vectorscomb#> values ind#> 1 1 v1#> 2 2 v1#> 3 3 v1#> 4 4 v2#> 5 5 v2#> 6 6 v2#> 7 7 v3#> 8 8 v3#> 9 9 v3
The data frame’s columns are called values and ind. The first column
contains the data, and the second column contains the parallel factor.
Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format.
Suppose you survey freshmen, sophomores, and juniors regarding their
confidence level (“What percentage of the time do you feel confident in
school?”). Now you have three vectors, called freshmen, sophomores,
and juniors. You want to perform an ANOVA analysis of the differences
between the groups. The ANOVA function, aov, requires one vector with
the survey results as well as a parallel factor that identifies the
group. You can combine the groups using the stack function:
set.seed(2)n<-5freshmen<-sample(1:5,n,replace=TRUE,prob=c(.6,.2,.1,.05,.05))sophomores<-sample(1:5,n,replace=TRUE,prob=c(.05,.2,.6,.1,.05))juniors<-sample(1:5,n,replace=TRUE,prob=c(.05,.2,.55,.15,.05))comb<-stack(list(fresh=freshmen,soph=sophomores,jrs=juniors))(comb)#> values ind#> 1 1 fresh#> 2 2 fresh#> 3 1 fresh#> 4 1 fresh#> 5 5 fresh#> 6 5 soph#> 7 3 soph#> 8 4 soph#> 9 3 soph#> 10 3 soph#> 11 2 jrs#> 12 3 jrs#> 13 4 jrs#> 14 3 jrs#> 15 3 jrs
Now you can perform the ANOVA analysis on the two columns:
aov(values~ind,data=comb)#> Call:#> aov(formula = values ~ ind, data = comb)#>#> Terms:#> ind Residuals#> Sum of Squares 6.53 17.20#> Deg. of Freedom 2 12#>#> Residual standard error: 1.2#> Estimated effects may be unbalanced
When building the list we must provide tags for the list elements (the
tags are fresh, soph, and jrs in this example). Those tags are
required because stack uses them as the levels of the parallel factor.
You want to create and populate a list.
To create a list from individual data items, use the list function:
x<-c("a","b","c")y<-c(1,2,3)z<-"why be normal?"lst<-list(x,y,z)lst#> [[1]]#> [1] "a" "b" "c"#>#> [[2]]#> [1] 1 2 3#>#> [[3]]#> [1] "why be normal?"
Lists can be quite simple, such as this list of three numbers:
lst<-list(0.5,0.841,0.977)lst#> [[1]]#> [1] 0.5#>#> [[2]]#> [1] 0.841#>#> [[3]]#> [1] 0.977
When R prints the list, it identifies each list element by its position
([[1]], [[2]], [[3]]) and prints the element’s value (e.g.,
[1] 0.5) under its position.
More usefully, lists can, unlike vectors, contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:
lst<-list(3.14,"Moe",c(1,1,2,3),mean)lst#> [[1]]#> [1] 3.14#>#> [[2]]#> [1] "Moe"#>#> [[3]]#> [1] 1 1 2 3#>#> [[4]]#> function (x, ...)#> UseMethod("mean")#> <bytecode: 0x7f8f0457ff88>#> <environment: namespace:base>
You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:
lst<-list()lst[[1]]<-3.14lst[[2]]<-"Moe"lst[[3]]<-c(1,1,2,3)lst[[4]]<-meanlst#> [[1]]#> [1] 3.14#>#> [[2]]#> [1] "Moe"#>#> [[3]]#> [1] 1 1 2 3#>#> [[4]]#> function (x, ...)#> UseMethod("mean")#> <bytecode: 0x7f8f0457ff88>#> <environment: namespace:base>
List elements can be named. The list function lets you supply a name
for every element:
lst<-list(mid=0.5,right=0.841,far.right=0.977)lst#> $mid#> [1] 0.5#>#> $right#> [1] 0.841#>#> $far.right#> [1] 0.977
See the “Introduction” to this chapter for more about lists; see “Building a Name/Value Association List” for more about building and using lists with named elements.
You want to access list elements by position.
Use one of these ways. Here, lst is a list variable:
lst[[n]]Select the _n_th element from the list.
lst[c(n1, n2, ..., nk)]Returns a list of elements, selected by their positions.
Note that the first form returns a single element and the second returns a list.
Suppose we have a list of four integers, called years:
years<-list(1960,1964,1976,1994)years#> [[1]]#> [1] 1960#>#> [[2]]#> [1] 1964#>#> [[3]]#> [1] 1976#>#> [[4]]#> [1] 1994
We can access single elements using the double-square-bracket syntax:
years[[1]]
We can extract sublists using the single-square-bracket syntax:
years[c(1,2)]#> [[1]]#> [1] 1960#>#> [[2]]#> [1] 1964
This syntax can be confusing because of a subtlety: there is an
important difference between lst[[n]] and lst[n]. They are not the
same thing:
lst[[n]]This is an element, not a list. It is the _n_th element of lst.
lst[n]This is a list, not an element. The list contains one element, taken
from the _n_th element of lst. This is a special case of
lst[c(n1, n2, ..., nk)] in which we eliminated the c(…)
construct because there is only one n.
The difference becomes apparent when we inspect the structure of the result—one is a number; the other is a list:
class(years[[1]])#> [1] "numeric"class(years[1])#> [1] "list"
The difference becomes annoyingly apparent when we cat the value.
Recall that cat can print atomic values or vectors but complains about
printing structured objects:
cat(years[[1]],"\n")#> 1960cat(years[1],"\n")#> Error in cat(years[1], "\n"): argument 1 (type 'list') cannot be handled by 'cat'
We got lucky here because R alerted us to the problem. In other contexts, you might work long and hard to figure out that you accessed a sublist when you wanted an element, or vice versa.
You want to access list elements by their names.
Use one of these forms. Here, lst is a list variable:
lst[["name"]]Selects the element called name. Returns NULL if no element has that
name.
lst$nameSame as previous, just different syntax.
lst[c(name1, name2, ..., namek)]Returns a list built from the indicated elements of lst.
Note that the first two forms return an element whereas the third form returns a list.
Each element of a list can have a name. If named, the element can be selected by its name. This assignment creates a list of four named integers:
years<-list(Kennedy=1960,Johnson=1964,Carter=1976,Clinton=1994)
These next two expressions return the same value—namely, the element that is named “Kennedy”:
years[["Kennedy"]]#> [1] 1960years$Kennedy#> [1] 1960
The following two expressions return sublists extracted from years:
years[c("Kennedy","Johnson")]#> $Kennedy#> [1] 1960#>#> $Johnson#> [1] 1964years["Carter"]#> $Carter#> [1] 1976
Just as with selecting list elements by position
(“Selecting List Elements by Position”), there is an
important difference between lst[["name"]] and lst["name"]. They are
not the same:
lst[["name"]]This is an element, not a list.
lst["name"]This is a list, not an element. This is a special case of
lst[c(name1, name2, ..., namek)] in which we don’t need the
c(…) construct because there is only one name.
See “Selecting List Elements by Position” to access elements by position rather than by name.
You want to create a list that associates names and values — as would a dictionary, hash, or lookup table in another programming language.
The list function lets you give names to elements, creating an
association between each name and its value:
lst<-list(mid=0.5,right=0.841,far.right=0.977)lst#> $mid#> [1] 0.5#>#> $right#> [1] 0.841#>#> $far.right#> [1] 0.977
If you have parallel vectors of names and values, you can create an empty list and then populate the list by using a vectorized assignment statement:
values<-c(1,2,3)names<-c("a","b","c")lst<-list()lst[names]<-valueslst#> $a#> [1] 1#>#> $b#> [1] 2#>#> $c#> [1] 3
Each element of a list can be named, and you can retrieve list elements by name. This gives you a basic programming tool: the ability to associate names with values.
You can assign element names when you build the list. The list
function allows arguments of the form name=value:
lst<-list(far.left=0.023,left=0.159,mid=0.500,right=0.841,far.right=0.977)lst#> $far.left#> [1] 0.023#>#> $left#> [1] 0.159#>#> $mid#> [1] 0.5#>#> $right#> [1] 0.841#>#> $far.right#> [1] 0.977
One way to name the elements is to create an empty list and then populate it via assignment statements:
lst<-list()lst$far.left<-0.023lst$left<-0.159lst$mid<-0.500lst$right<-0.841lst$far.right<-0.977lst#> $far.left#> [1] 0.023#>#> $left#> [1] 0.159#>#> $mid#> [1] 0.5#>#> $right#> [1] 0.841#>#> $far.right#> [1] 0.977
Sometimes you have a vector of names and a vector of corresponding values:
values<-pnorm(-2:2)names<-c("far.left","left","mid","right","far.right")
You can associate the names and the values by creating an empty list and then populating it with a vectorized assignment statement:
lst<-list()lst[names]<-values
Once the association is made, the list can “translate” names into values through a simple list lookup:
cat("The left limit is",lst[["left"]],"\n")#> The left limit is 0.159cat("The right limit is",lst[["right"]],"\n")#> The right limit is 0.841for(nminnames(lst))cat("The",nm,"limit is",lst[[nm]],"\n")#> The far.left limit is 0.0228#> The left limit is 0.159#> The mid limit is 0.5#> The right limit is 0.841#> The far.right limit is 0.977
You want to remove an element from a list.
Assign NULL to the element. R will remove it from the list.
To remove a list element, select it by position or by name, and then
assign NULL to the selected element:
years<-list(Kennedy=1960,Johnson=1964,Carter=1976,Clinton=1994)years#> $Kennedy#> [1] 1960#>#> $Johnson#> [1] 1964#>#> $Carter#> [1] 1976#>#> $Clinton#> [1] 1994years[["Johnson"]]<-NULL# Remove the element labeled "Johnson"years#> $Kennedy#> [1] 1960#>#> $Carter#> [1] 1976#>#> $Clinton#> [1] 1994
You can remove multiple elements this way, too:
years[c("Carter","Clinton")]<-NULL# Remove two elementsyears#> $Kennedy#> [1] 1960
You want to flatten all the elements of a list into a vector.
Use the unlist function.
There are many contexts that require a vector. Basic statistical
functions work on vectors but not on lists, for example. If iq.scores
is a list of numbers, then we cannot directly compute their mean:
iq.scores<-list(rnorm(5,100,15))iq.scores#> [[1]]#> [1] 115.8 88.7 78.4 95.7 84.5mean(iq.scores)#> Warning in mean.default(iq.scores): argument is not numeric or logical:#> returning NA#> [1] NA
Instead, we must flatten the list into a vector using unlist and then
compute the mean of the result:
mean(unlist(iq.scores))#> [1] 92.6
Here is another example. We can cat scalars and vectors, but we cannot
cat a list:
cat(iq.scores,"\n")#> Error in cat(iq.scores, "\n"): argument 1 (type 'list') cannot be handled by 'cat'
One solution is to flatten the list into a vector before printing:
cat("IQ Scores:",unlist(iq.scores),"\n")#> IQ Scores: 116 88.7 78.4 95.7 84.5
Conversions such as this are discussed more fully in “Converting One Structured Data Type into Another”.
Your list contains NULL values. You want to remove them.
Suppose lst is a list some of whose elements are NULL. This
expression will remove the NULL elements:
lst<-list(1,NULL,2,3,NULL,4)lst#> [[1]]#> [1] 1#>#> [[2]]#> NULL#>#> [[3]]#> [1] 2#>#> [[4]]#> [1] 3#>#> [[5]]#> NULL#>#> [[6]]#> [1] 4lst[sapply(lst,is.null)]<-NULLlst#> [[1]]#> [1] 1#>#> [[2]]#> [1] 2#>#> [[3]]#> [1] 3#>#> [[4]]#> [1] 4
Finding and removing NULL elements from a list is surprisingly tricky.
The recipe above was written by one of the authors in a fit of
frustration after trying many other solutions that didn’t work. Here’s
how it works:
R calls sapply to apply the is.null function to every element of
the list.
sapply returns a vector of logical values that are TRUE wherever
the corresponding list element is NULL.
R selects values from the list according to that vector.
R assigns NULL to the selected items, removing them from the list.
The curious reader may be wondering how a list can contain NULL
elements, given that we remove elements by setting them to NULL
(“Removing an Element from a List”). The answer is
that we can create a list containing NULL elements:
lst<-list("Moe",NULL,"Curly")# Create list with NULL elementlst#> [[1]]#> [1] "Moe"#>#> [[2]]#> NULL#>#> [[3]]#> [1] "Curly"lst[sapply(lst,is.null)]<-NULL# Remove NULL element from listlst#> [[1]]#> [1] "Moe"#>#> [[2]]#> [1] "Curly"
In practice we might end up with NULL items in a list because of the results of a function we wrote to do something else.
See “Removing an Element from a List” for how to remove list elements.
You want to remove elements from a list according to a conditional test, such as removing elements that are negative or smaller than some threshold.
Build a logical vector based on the condition. Use the vector to select
list elements and then assign NULL to those elements. This assignment,
for example, removes all negative value from lst:
lst<-as.list(rnorm(7))lst#> [[1]]#> [1] -0.0281#>#> [[2]]#> [1] -0.366#>#> [[3]]#> [1] -1.12#>#> [[4]]#> [1] -0.976#>#> [[5]]#> [1] 1.12#>#> [[6]]#> [1] 0.324#>#> [[7]]#> [1] -0.568lst[lst<0]<-NULLlst#> [[1]]#> [1] 1.12#>#> [[2]]#> [1] 0.324
It’s worth noting that in the above example we use as.list instead of
list to create a list from the 7 random values created by rnorm(7).
The reason for this is that as.list will turn each element of a vector
into a list item. On the other hand, list would have given us a list
of length 1 where the first element was a vector containing 7 numbers:
list(rnorm(7))#> [[1]]#> [1] -1.034 -0.533 -0.981 0.823 -0.388 0.879 -2.178
This recipe is based on two useful features of R. First, a list can be
indexed by a logical vector. Wherever the vector element is TRUE, the
corresponding list element is selected. Second, you can remove a list
element by assigning NULL to it.
Suppose we want to remove elements from lst whose value is zero. We
construct a logical vector which identifies the unwanted values
(lst == 0). Then we select those elements from the list and assign
NULL to them:
lst[lst==0]<-NULL
This expression will remove NA values from the list:
lst[is.na(lst)]<-NULL
So far, so good. The problems arise when you cannot easily build the
logical vector. That often happens when you want to use a function that
cannot handle a list. Suppose you want to remove list elements whose
absolute value is less than 1. The abs function will not handle a
list, unfortunately:
lst[abs(lst)<1]<-NULL#> Error in abs(lst): non-numeric argument to mathematical function
The simplest solution is flattening the list into a vector by calling
unlist and then testing the vector:
lst#> [[1]]#> [1] 1.12#>#> [[2]]#> [1] 0.324lst[abs(unlist(lst))<1]<-NULLlst#> [[1]]#> [1] 1.12
A more elegant solution uses lapply (the list apply function) to apply
the function to every element of the list:
lst<-as.list(rnorm(5))lst#> [[1]]#> [1] 1.47#>#> [[2]]#> [1] 0.885#>#> [[3]]#> [1] 2.29#>#> [[4]]#> [1] 0.554#>#> [[5]]#> [1] 1.21lst[lapply(lst,abs)<1]<-NULLlst#> [[1]]#> [1] 1.47#>#> [[2]]#> [1] 2.29#>#> [[3]]#> [1] 1.21
Lists can hold complex objects, too, not just atomic values. Suppose
that mods is a list of linear models created by the lm function.
This expression will remove any model whose R2 value is less than
0.70:
x<-1:10y1<-2*x+rnorm(10,0,1)y2<-3*x+rnorm(10,0,8)result_list<-list(lm(x~y1),lm(x~y2))result_list[sapply(result_list,function(m)summary(m)$r.squared<0.7)]<-NULL
If we wanted to simply see the R2 values for each model, we could do the following:
sapply(result_list,function(m)summary(m)$r.squared)#> [1] 0.990 0.708
Using sapply (simple apply) will return a vector of results. If we had
used lapply we would have received a list in return:
lapply(result_list,function(m)summary(m)$r.squared)#> [[1]]#> [1] 0.99#>#> [[2]]#> [1] 0.708
It’s worth noting that if you face a situation like the one above, you might also explore the package called broom on CRAN. Broom is designed to take output of models and put the results in a tidy format that fits better in a tidy-style workflow.
You want to create a matrix and initialize it from given values.
Capture the data in a vector or list, and then use the matrix function
to shape the data into a matrix. This example shapes a vector into a 2 ×
3 matrix (i.e., two rows and three columns):
vec<-1:6matrix(vec,2,3)#> [,1] [,2] [,3]#> [1,] 1 3 5#> [2,] 2 4 6
The first argument of matrix is the data, the second argument is the
number of rows, and the third argument is the number of columns. Observe
that the matrix was filled column by column, not row by row.
It’s common to initialize an entire matrix to one value such as zero or
NA. If the first argument of matrix is a single value, then R will
apply the Recycling Rule and automatically replicate the value to fill
the entire matrix:
matrix(0,2,3)# Create an all-zeros matrix#> [,1] [,2] [,3]#> [1,] 0 0 0#> [2,] 0 0 0matrix(NA,2,3)# Create a matrix populated with NA#> [,1] [,2] [,3]#> [1,] NA NA NA#> [2,] NA NA NA
You can create a matrix with a one-liner, of course, but it becomes difficult to read:
mat<-matrix(c(1.1,1.2,1.3,2.1,2.2,2.3),2,3)mat#> [,1] [,2] [,3]#> [1,] 1.1 1.3 2.2#> [2,] 1.2 2.1 2.3
A common idiom in R is typing the data itself in a rectangular shape that reveals the matrix structure:
theData<-c(1.1,1.2,1.3,2.1,2.2,2.3)mat<-matrix(theData,2,3,byrow=TRUE)mat#> [,1] [,2] [,3]#> [1,] 1.1 1.2 1.3#> [2,] 2.1 2.2 2.3
Setting byrow=TRUE tells matrix that the data is row-by-row and not
column-by-column (which is the default). In condensed form, that
becomes:
mat<-matrix(c(1.1,1.2,1.3,2.1,2.2,2.3),2,3,byrow=TRUE)
Expressed this way, the reader quickly sees the two rows and three columns of data.
There is a quick-and-dirty way to turn a vector into a matrix: just assign dimensions to the vector. This was discussed in the “Introduction”. The following example creates a vanilla vector and then shapes it into a 2 × 3 matrix:
v<-c(1.1,1.2,1.3,2.1,2.2,2.3)dim(v)<-c(2,3)v#> [,1] [,2] [,3]#> [1,] 1.1 1.3 2.2#> [2,] 1.2 2.1 2.3
Personally, I find this more opaque than using matrix, especially
since there is no byrow option here.
You want to perform matrix operations such as transpose, matrix inversion, matrix multiplication, or constructing an identity matrix.
t(A)Matrix transposition of A
solve(A)Matrix inverse of A
A %*% BMatrix multiplication of A and B
diag(n)An n-by-n diagonal (identity) matrix
Recall that A*B is element-wise multiplication whereas A %*% B
is matrix multiplication.
All these functions return a matrix. Their arguments can be either matrices or data frames. If they are data frames then R will first convert them to matrices (although this is useful only if the data frame contains exclusively numeric values).
You want to assign descriptive names to the rows or columns of a matrix.
Every matrix has a rownames attribute and a colnames attribute.
Assign a vector of character strings to the appropriate attribute:
theData<-c(1.1,1.2,1.3,2.1,2.2,2.3,3.1,3.2,3.3)mat<-matrix(theData,3,3,byrow=TRUE)rownames(mat)<-c("rowname1","rowname2","rowname3")colnames(mat)<-c("colname1","colname2","colname3")mat#> colname1 colname2 colname3#> rowname1 1.1 1.2 1.3#> rowname2 2.1 2.2 2.3#> rowname3 3.1 3.2 3.3
R lets you assign names to the rows and columns of a matrix, which is
useful for printing the matrix. R will display the names if they are
defined, enhancing the readability of your output. Below we use the
quantmod library to pull stock prices for three tech stocks. Then we
calculate daily returns and create a correlation matrix of the daily
returns of Apple, Microsoft, and Google stock. No need to worry about
the details here, unless stocks are your thing. We’re just creating some
real-world data for illustration:
library("quantmod")#> Loading required package: xts#> Loading required package: zoo#>#> Attaching package: 'zoo'#> The following objects are masked from 'package:base':#>#> as.Date, as.Date.numeric#>#> Attaching package: 'xts'#> The following objects are masked from 'package:dplyr':#>#> first, last#> Loading required package: TTR#> Version 0.4-0 included new data defaults. See ?getSymbols.getSymbols(c("AAPL","MSFT","GOOG"),auto.assign=TRUE)#> 'getSymbols' currently uses auto.assign=TRUE by default, but will#> use auto.assign=FALSE in 0.5-0. You will still be able to use#> 'loadSymbols' to automatically load data. getOption("getSymbols.env")#> and getOption("getSymbols.auto.assign") will still be checked for#> alternate defaults.#>#> This message is shown once per session and may be disabled by setting#> options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.#>#> WARNING: There have been significant changes to Yahoo Finance data.#> Please see the Warning section of '?getSymbols.yahoo' for details.#>#> This message is shown once per session and may be disabled by setting#> options("getSymbols.yahoo.warning"=FALSE).#> [1] "AAPL" "MSFT" "GOOG"cor_mat<-cor(cbind(periodReturn(AAPL,period="daily",subset="2017"),periodReturn(MSFT,period="daily",subset="2017"),periodReturn(GOOG,period="daily",subset="2017")))cor_mat#> daily.returns daily.returns.1 daily.returns.2#> daily.returns 1.000 0.438 0.489#> daily.returns.1 0.438 1.000 0.619#> daily.returns.2 0.489 0.619 1.000
In this form, the matrix output’s interpretation is not self-evident.The
columns are named daily.returns.X because before we bound the columns
together with cbind they were each named daily.returns. R then
helped us manage the naming clash by appending .1 to the second column
and .2 to the third.
The default naming does not tell us which column came from which stock. So we’ll define names for the rows and columns, then R will annotate the matrix output with the names:
colnames(cor_mat)<-c("AAPL","MSFT","GOOG")rownames(cor_mat)<-c("AAPL","MSFT","GOOG")cor_mat#> AAPL MSFT GOOG#> AAPL 1.000 0.438 0.489#> MSFT 0.438 1.000 0.619#> GOOG 0.489 0.619 1.000
Now the reader knows at a glance which rows and columns apply to which stocks.
Another advantage of naming rows and columns is that you can refer to matrix elements by those names:
cor_mat["MSFT","GOOG"]# What is the correlation between MSFT and GOOG?#> [1] 0.619
You want to select a single row or a single column from a matrix.
The solution depends on what you want. If you want the result to be a simple vector, just use normal indexing:
mat[1,]# First row#> colname1 colname2 colname3#> 1.1 1.2 1.3mat[,3]# Third column#> rowname1 rowname2 rowname3#> 1.3 2.3 3.3
If you want the result to be a one-row matrix or a one-column matrix,
then include the drop=FALSE argument:
mat[1,,drop=FALSE]# First row in a one-row matrix#> colname1 colname2 colname3#> rowname1 1.1 1.2 1.3mat[,3,drop=FALSE]# Third column in a one-column matrix#> colname3#> rowname1 1.3#> rowname2 2.3#> rowname3 3.3
Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:
mat[1,]#> colname1 colname2 colname3#> 1.1 1.2 1.3mat[,3]#> rowname1 rowname2 rowname3#> 1.3 2.3 3.3
When you include the drop=FALSE argument, however, R retains the
dimensions. In that case, selecting a row returns a row vector (a 1 ×
n matrix):
mat[1,,drop=FALSE]#> colname1 colname2 colname3#> rowname1 1.1 1.2 1.3
Likewise, selecting a column with drop=FALSE returns a column vector
(an n × 1 matrix):
mat[,3,drop=FALSE]#> colname3#> rowname1 1.3#> rowname2 2.3#> rowname3 3.3
Your data is organized by columns, and you want to assemble it into a data frame.
If your data is captured in several vectors and/or factors, use the
data.frame function to assemble them into a data frame:
v1<-1:5v2<-6:10v3<-c("A","B","C","D","E")f1<-factor(c("a","a","a","b","b"))df<-data.frame(v1,v2,v3,f1)df#> v1 v2 v3 f1#> 1 1 6 A a#> 2 2 7 B a#> 3 3 8 C a#> 4 4 9 D b#> 5 5 10 E b
If your data is captured in a list that contains vectors and/or
factors, use instead as.data.frame:
list.of.vectors<-list(v1=v1,v2=v2,v3=v3,f1=f1)df2<-as.data.frame(list.of.vectors)df2#> v1 v2 v3 f1#> 1 1 6 A a#> 2 2 7 B a#> 3 3 8 C a#> 4 4 9 D b#> 5 5 10 E b
A data frame is a collection of columns, each of which corresponds to an observed variable (in the statistical sense, not the programming sense). If your data is already organized into columns, then it’s easy to build a data frame.
The data.frame function can construct a data frame from vectors, where
each vector is one observed variable. Suppose you have two numeric
predictor variables, one categorical predictor variable, and one
response variable. The data.frame function can create a data frame
from your vectors.
pred1<-rnorm(10)pred2<-rnorm(10,1,2)pred3<-sample(c("AM","PM"),10,replace=TRUE)resp<-2.1+pred1*.3+pred2*.9df<-data.frame(pred1,pred2,pred3,resp)df#> pred1 pred2 pred3 resp#> 1 -0.117 -0.0196 AM 2.05#> 2 -1.133 0.1529 AM 1.90#> 3 0.632 3.8004 AM 5.71#> 4 0.188 4.5922 AM 6.29#> 5 0.892 1.8556 AM 4.04#> 6 -1.224 2.8140 PM 4.27#> 7 0.174 0.4908 AM 2.59#> 8 -0.689 -0.1335 PM 1.77#> 9 1.204 -0.0482 AM 2.42#> 10 0.697 2.2268 PM 4.31
Notice that data.frame takes the column names from your program
variables. You can override that default by supplying explicit column
names:
df<-data.frame(p1=pred1,p2=pred2,p3=pred3,r=resp)head(df,3)#> p1 p2 p3 r#> 1 -0.117 -0.0196 AM 2.05#> 2 -1.133 0.1529 AM 1.90#> 3 0.632 3.8004 AM 5.71
As illustrated above, your data may be organized into vectors but those
vectors are held in a list, not individual program variables. Use the
as.data.frame function to create a data frame from the list of
vectors.
If you’d rather have a tibble (a.k.a tidy data frame) instead of a data
frame, then use the function as_tibble instead of data.frame.
However, note that as_tibble is designed to operate on a list, matrix,
data.frame, or table. So we can just wrap our vectors in a list
function before we call as_tibble:
tib<-as_tibble(list(p1=pred1,p2=pred2,p3=pred3,r=resp))tib#> # A tibble: 10 x 4#> p1 p2 p3 r#> <dbl> <dbl> <chr> <dbl>#> 1 -0.117 -0.0196 AM 2.05#> 2 -1.13 0.153 AM 1.90#> 3 0.632 3.80 AM 5.71#> 4 0.188 4.59 AM 6.29#> 5 0.892 1.86 AM 4.04#> 6 -1.22 2.81 PM 4.27#> # ... with 4 more rows
One subtle difference between a data.frame object and a tibble is
that when using the data.frame function to create a data.frame R
will coerce character values into factors by default. On the other hand,
as_tibble does not convert characters to factors. If you look at the
last two code examples above, you’ll see column p3 is of type chr in
the tibble example and type fctr in the data.frame example. This
difference is something you should be aware of as it can be maddeningly
frustrating to debug an issue caused by this subtle difference.
Your data is organized by rows, and you want to assemble it into a data frame.
Store each row in a one-row data frame. Store the one-row data frames in
a list. Use rbind and do.call to bind the rows into one, large data
frame:
r1<-data.frame(a=1,b=2,c="a")r2<-data.frame(a=3,b=4,c="b")r3<-data.frame(a=5,b=6,c="c")obs<-list(r1,r2,r3)df<-do.call(rbind,obs)df#> a b c#> 1 1 2 a#> 2 3 4 b#> 3 5 6 c
Here, obs is a list of one-row data frames. But notice that column c
is a factor, not a character.
Data often arrives as a collection of observations. Each observation is a record or tuple that contains several values, one for each observed variable. The lines of a flat file are usually like that: each line is one record, each record contains several columns, and each column is a different variable (see “Reading Files with a Complex Structure”). Such data is organized by observation, not by variable. In other words, you are given rows one at a time rather than columns one at a time.
Each such row might be stored in several ways. One obvious way is as a vector. If you have purely numerical data, use a vector.
However, many datasets are a mixture of numeric, character, and categorical data, in which case a vector won’t work. I recommend storing each such heterogeneous row in a one-row data frame. (You could store each row in a list, but this recipe gets a little more complicated.)
We need to bind together those rows into a data frame. That’s what the
rbind function does. It binds its arguments in such a way that each
argument becomes one row in the result. If we rbind the first two
observations, for example, we get a two-row data frame:
rbind(obs[[1]],obs[[2]])#> a b c#> 1 1 2 a#> 2 3 4 b
We want to bind together every observation, not just the first two, so
we tap into the vector processing of R. The do.call function will
expand obs into one, long argument list and call rbind with that
long argument list:
do.call(rbind,obs)#> a b c#> 1 1 2 a#> 2 3 4 b#> 3 5 6 c
The result is a data frame built from our rows of data.
Sometimes, for reasons beyond your control, the rows of your data are
stored in lists rather than one-row data frames. You may be dealing with
rows returned by a database package, for example. In that case, obs
will be a list of lists, not a list of data frames. We first transform
the rows into data frames using the Map function and then apply this
recipe:
l1<-list(a=1,b=2,c="a")l2<-list(a=3,b=4,c="b")l3<-list(a=5,b=6,c="c")obs<-list(l1,l2,l3)df<-do.call(rbind,Map(as.data.frame,obs))df#> a b c#> 1 1 2 a#> 2 3 4 b#> 3 5 6 c
This recipe works also if your observations are stored in vectors rather than one-row data frames. But with vectors, all elements have to be of the same data type. Though R will happily coerce integers into floats on the fly:
r1<-1:3r2<-6:8r3<-rnorm(3)obs<-list(r1,r2,r3)df<-do.call(rbind,obs)df#> [,1] [,2] [,3]#> [1,] 1.000 2.000 3.0#> [2,] 6.000 7.000 8.0#> [3,] -0.945 -0.547 1.6
Note the factor trap mentioned in the example above. If you would rather
get characters instead of factors, you have a couple of options. One is
to set the stringsAsFactors parameter to FALSE when data.frame is
called:
data.frame(a=1,b=2,c="a",stringsAsFactors=FALSE)#> a b c#> 1 1 2 a
Of course if you inherited your data and it’s already in a data frame
with factors, you can convert all factors in a data.frame to
characters using this bonus recipe:
## same set up as in the previous examples l1 <- list( a=1, b=2, c='a' ) l2 <- list( a=3, b=4, c='b' ) l3 <- list( a=5, b=6, c='c' ) obs <- list(l1, l2, l3) df <- do.call(rbind,Map(as.data.frame,obs)) # yes, you could use stringsAsFactors=FALSE above, but we're assuming the data.frame # came to you with factors already i <- sapply(df, is.factor) ## determine which columns are factors df[i] <- lapply(df[i], as.character) ## turn only the factors to characters df
Keep in mind that if you use a tibble instead of a data.frame then
characters will not be forced into factors by default.
See “Initializing a Data Frame from Column Data” if your data is organized by columns, not
rows.
See Recipe X-X to learn more about do.call.
You want to append one or more new rows to a data frame.
Create a second, temporary data frame containing the new rows. Then use
the rbind function to append the temporary data frame to the original
data frame.
Suppose we want to append a new row to our data frame of Chicago-area cities. First, we create a one-row data frame with the new data:
newRow<-data.frame(city="West Dundee",county="Kane",state="IL",pop=5428)
Next, we use the rbind function to append that one-row data frame to
our existing data frame:
library(tidyverse)suburbs<-read_csv("./data/suburbs.txt")#> Parsed with column specification:#> cols(#> city = col_character(),#> county = col_character(),#> state = col_character(),#> pop = col_double()#> )suburbs2<-rbind(suburbs,newRow)suburbs2#> # A tibble: 18 x 4#> city county state pop#> <chr> <chr> <chr> <dbl>#> 1 Chicago Cook IL 2853114#> 2 Kenosha Kenosha WI 90352#> 3 Aurora Kane IL 171782#> 4 Elgin Kane IL 94487#> 5 Gary Lake(IN) IN 102746#> 6 Joliet Kendall IL 106221#> # ... with 12 more rows
The rbind function tells R that we are appending a new row to
suburbs, not a new column. It may be obvious to you that newRow is a
row and not a column, but it is not obvious to R. (Use the cbind
function to append a column.)
One word of caution. The new row must use the same column names as the
data frame. Otherwise, rbind will fail.
We can combine these two steps into one, of course:
suburbs3<-rbind(suburbs,data.frame(city="West Dundee",county="Kane",state="IL",pop=5428))
We can even extend this technique to multiple new rows because rbind
allows multiple arguments:
suburbs4<-rbind(suburbs,data.frame(city="West Dundee",county="Kane",state="IL",pop=5428),data.frame(city="East Dundee",county="Kane",state="IL",pop=2955))
It’s worth noting that in the examples above we seamlessly comingled
tibbles and data frames because we used the tidy function read_csv
which produces tibbles. And note that the data frames contain factors
while the tibbles do not:
str(suburbs)#> Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of 4 variables:#> $ city : chr "Chicago" "Kenosha" "Aurora" "Elgin" ...#> $ county: chr "Cook" "Kenosha" "Kane" "Kane" ...#> $ state : chr "IL" "WI" "IL" "IL" ...#> $ pop : num 2853114 90352 171782 94487 102746 ...#> - attr(*, "spec")=#> .. cols(#> .. city = col_character(),#> .. county = col_character(),#> .. state = col_character(),#> .. pop = col_double()#> .. )str(newRow)#> 'data.frame': 1 obs. of 4 variables:#> $ city : Factor w/ 1 level "West Dundee": 1#> $ county: Factor w/ 1 level "Kane": 1#> $ state : Factor w/ 1 level "IL": 1#> $ pop : num 5428
When this inputs to rbind are a mix of data.frame objects and
tibble objects, the result will be the type of object passed to the
first argument of rbind. So this would produce a tibble:
rbind(some_tibble,some_data.frame)
While this would produce a data.frame:
rbind(some_data.frame,some_tibble)
You are building a data frame, row by row. You want to preallocate the space instead of appending rows incrementally.
Create a data frame from generic vectors and factors using the functions
numeric(n) and`character(n)`:
n<-5df<-data.frame(colname1=numeric(n),colname2=character(n))
Here, n is the number of rows needed for the data frame.
Theoretically, you can build a data frame by appending new rows, one by one. That’s OK for small data frames, but building a large data frame in that way can be tortuous. The memory manager in R works poorly when one new row is repeatedly appended to a large data structure. Hence your R code will run very slowly.
One solution is to preallocate the data frame, assuming you know the required number of rows. By preallocating the data frame once and for all, you sidestep problems with the memory manager.
Suppose you want to create a data frame with 1,000,000 rows and three
columns: two numeric and one character. Use the numeric and
character functions to preallocate the columns; then join them
together using data.frame:
n<-1000000df<-data.frame(dosage=numeric(n),lab=character(n),response=numeric(n),stringsAsFactors=FALSE)str(df)#> 'data.frame': 1000000 obs. of 3 variables:#> $ dosage : num 0 0 0 0 0 0 0 0 0 0 ...#> $ lab : chr "" "" "" "" ...#> $ response: num 0 0 0 0 0 0 0 0 0 0 ...
Now you have a data frame with the correct dimensions, 1,000,000 × 3, waiting to receive its contents.
Notice in the example above we set stringsAsFactors=FALSE so that R
would not coerce the character field into factors. Data frames can
contain factors, but preallocating a factor is a little trickier. You
can’t simply call factor(n). You need to specify the factor’s levels
because you are creating it. Continuing our example, suppose you want
the lab column to be a factor, not a character string, and that the
possible levels are NJ, IL, and CA. Include the levels in the
column specification, like this:
n<-1000000df<-data.frame(dosage=numeric(n),lab=factor(n,levels=c("NJ","IL","CA")),response=numeric(n))str(df)#> 'data.frame': 1000000 obs. of 3 variables:#> $ dosage : num 0 0 0 0 0 0 0 0 0 0 ...#> $ lab : Factor w/ 3 levels "NJ","IL","CA": NA NA NA NA NA NA NA NA NA NA ...#> $ response: num 0 0 0 0 0 0 0 0 0 0 ...
You want to select columns from a data frame according to their position.
To select a single column, use this list operator:
df[[n]]Returns one column—specifically, the nth column of df.
To select one or more columns and package them in a data frame, use the following sublist expressions:
df[n]Returns a data frame consisting solely of the nth column of df.
df[c(n1, n2, ..., nk)]Returns a data frame built from the columns in positions n1,
n2, …, nk of df.
You can use matrix-style subscripting to select one or more columns:
df[, n]Returns the nth column (assuming that n contains exactly one value).
df[,c(n1, n2, ..., nk)]Returns a data frame built from the columns in positions n1, n2, …, nk.
Note that the matrix-style subscripting can return two different data types (either column or data frame) depending upon whether you select one column or multiple columns.
Or you can use the dplyr package from the Tidyverse and pass column
numbers to the select function to get back a tibble.
df %>% select(n1, n2, ..., nk)
There are a bewildering number of ways to select columns from a data frame. The choices can be confusing until you understand the logic behind the alternatives. As you read this explanation, notice how a slight change in syntax—a comma here, a double-bracket there—changes the meaning of the expression.
Let’s play with the population data for the 16 largest cities in the Chicago metropolitan area:
suburbs<-read_csv("./data/suburbs.txt")#> Parsed with column specification:#> cols(#> city = col_character(),#> county = col_character(),#> state = col_character(),#> pop = col_double()#> )suburbs#> # A tibble: 17 x 4#> city county state pop#> <chr> <chr> <chr> <dbl>#> 1 Chicago Cook IL 2853114#> 2 Kenosha Kenosha WI 90352#> 3 Aurora Kane IL 171782#> 4 Elgin Kane IL 94487#> 5 Gary Lake(IN) IN 102746#> 6 Joliet Kendall IL 106221#> # ... with 11 more rows
So right off the bat we can see this is a tibble. Subsetting and selecting in tibbles works very much like base R data frames. So the recipes below can work on either data structure.
Use simple list notation to select exactly one column, such as the first column:
suburbs[[1]]#> [1] "Chicago" "Kenosha" "Aurora"#> [4] "Elgin" "Gary" "Joliet"#> [7] "Naperville" "Arlington Heights" "Bolingbrook"#> [10] "Cicero" "Evanston" "Hammond"#> [13] "Palatine" "Schaumburg" "Skokie"#> [16] "Waukegan" "West Dundee"
The first column of suburbs is a vector, so that’s what suburbs[[1]]
returns: a vector. If the first column were a factor, we’d get a factor.
The result differs when you use the single-bracket notation, as in
suburbs[1] or suburbs[c(1,3)]. You still get the requested columns,
but R wraps them in a data frame. This example returns the first column
wrapped in a data frame:
suburbs[1]#> # A tibble: 17 x 1#> city#> <chr>#> 1 Chicago#> 2 Kenosha#> 3 Aurora#> 4 Elgin#> 5 Gary#> 6 Joliet#> # ... with 11 more rows
Another option, using the dplyr package from the Tidyverse, is to pipe
the data into a select statement: ** JAL note: both select statements
below are patch with dplyr:: issue with MASS not unloading?
suburbs%>%dplyr::select(1)#> # A tibble: 17 x 1#> city#> <chr>#> 1 Chicago#> 2 Kenosha#> 3 Aurora#> 4 Elgin#> 5 Gary#> 6 Joliet#> # ... with 11 more rows
You can, of course, use select from the dplyr package to pull more
than one column:
suburbs%>%dplyr::select(1,4)#> # A tibble: 17 x 2#> city pop#> <chr> <dbl>#> 1 Chicago 2853114#> 2 Kenosha 90352#> 3 Aurora 171782#> 4 Elgin 94487#> 5 Gary 102746#> 6 Joliet 106221#> # ... with 11 more rows
The next example returns the first and third columns as a data frame:
suburbs[c(1,3)]#> # A tibble: 17 x 2#> city state#> <chr> <chr>#> 1 Chicago IL#> 2 Kenosha WI#> 3 Aurora IL#> 4 Elgin IL#> 5 Gary IN#> 6 Joliet IL#> # ... with 11 more rows
A major source of confusion is that suburbs[[1]] and suburbs[1] look
similar but produce very different results:
suburbs[[1]]This returns one column.
suburbs[1]This returns a data frame, and the data frame contains exactly one
column. This is a special case of df[c(n1,n2, ..., nk)]. We don’t
need the c(...) construct because there is only one n.
The point here is that “one column” is different from “a data frame that contains one column.” The first expression returns a column, so it’s a vector or a factor. The second expression returns a data frame, which is different.
R lets you use matrix notation to select columns, as shown in the Solution. But an odd quirk can bite you: you might get a column or you might get a data frame, depending upon many subscripts you use. In the simple case of one index you get a column, like this:
suburbs[,1]#> # A tibble: 17 x 1#> city#> <chr>#> 1 Chicago#> 2 Kenosha#> 3 Aurora#> 4 Elgin#> 5 Gary#> 6 Joliet#> # ... with 11 more rows
But using the same matrix-style syntax with multiple indexes returns a data frame:
suburbs[,c(1,4)]#> # A tibble: 17 x 2#> city pop#> <chr> <dbl>#> 1 Chicago 2853114#> 2 Kenosha 90352#> 3 Aurora 171782#> 4 Elgin 94487#> 5 Gary 102746#> 6 Joliet 106221#> # ... with 11 more rows
This creates a problem. Suppose you see this expression in some old R script:
df[,vec]
Quick, does that return a column or a data frame? Well, it depends. If
vec contains one value then you get a column; otherwise, you get a
data frame. You cannot tell from the syntax alone.
To avoid this problem, you can include drop=FALSE in the subscripts;
this forces R to return a data frame:
df[,vec,drop=FALSE]
Now there is no ambiguity about the returned data structure. It’s a data frame.
When all is said and done, using matrix notation to select columns from
data frames is not the best procedure. It’s a good idea to instead use
the list operators described previously. They just seem clearer. Or you
can use the functions in dplyr and know that you will get back a
tibble.
See “Selecting One Row or Column from a Matrix” for more about using drop=FALSE.
You want to select columns from a data frame according to their name.
To select a single column, use one of these list expressions:
df[["name"]]Returns one column, the column called name.
df$nameSame as previous, just different syntax.
To select one or more columns and package them in a data frame, use these list expressions:
df["name"]Selects one column and packages it inside a data frame object.
df[c("name1", "name2", ..., "namek")]
: Selects several columns and packages them in a data frame.
You can use matrix-style subscripting to select one or more columns:
df[, "name"]Returns the named column.
df[, c("name1", "name2", ..., "namek")]Selects several columns and packages in a data frame.
Once again, the matrix-style subscripting can return two different data types (column or data frame) depending upon whether you select one column or multiple columns.
Or you can use the dplyr package from the Tidyverse and pass column
names to the select function to get back a tibble.
df %>% select(name1, name2, ..., namek)
All columns in a data frame must have names. If you know the name, it’s usually more convenient and readable to select by name, not by position.
The solutions just described are similar to those for “Selecting Data Frame Columns by Position”, where we selected columns by position. The only difference is that here we use column names instead of column numbers. All the observations made in “Selecting Data Frame Columns by Position” apply here:
df[["name"]] returns one column, not a data frame.
df[c("name1", "name2", ..., "namek")] returns a data frame, not a
column.
df["name"] is a special case of the previous expression and so
returns a data frame, not a column.
The matrix-style subscripting can return either a column or a data
frame, so be careful how many names you supply. See
“Selecting Data Frame Columns by Position” for a
discussion of this “gotcha” and using drop=FALSE.
There is one new addition:
df$name
This is identical in effect to df[["name"]], but it’s easier to type
and to read.
Note that if you use select from dplyr, you don’t put the column
names in quotes:
df %>% select(name1, name2, ..., namek)
Unquoted column names are a Tidyverse feature and help make Tidy functions fast and easy to type interactivly.
See “Selecting Data Frame Columns by Position” to understand these ways to select columns.
You want an easier way to select rows and columns from a data frame or matrix.
Use the subset function. The select argument is a column name, or a
vector of column names, to be selected:
subset(df,select=colname)subset(df,select=c(colname1,...,colnameN))
Note that you do not quote the column names.
The subset argument is a logical expression that selects rows. Inside
the expression, you can refer to the column names as part of the logical
expression. In this example, city is a column in the data frame, and
we are selecting rows with a pop over 100,000:
subset(suburbs,subset=(pop>100000))#> # A tibble: 5 x 4#> city county state pop#> <chr> <chr> <chr> <dbl>#> 1 Chicago Cook IL 2853114#> 2 Aurora Kane IL 171782#> 3 Gary Lake(IN) IN 102746#> 4 Joliet Kendall IL 106221#> 5 Naperville DuPage IL 147779
subset is most useful when you combine the select and subset
arguments:
subset(suburbs,select=c(city,state,pop),subset=(pop>100000))#> # A tibble: 5 x 3#> city state pop#> <chr> <chr> <dbl>#> 1 Chicago IL 2853114#> 2 Aurora IL 171782#> 3 Gary IN 102746#> 4 Joliet IL 106221#> 5 Naperville IL 147779
The Tidyverse alternative is to use dplyr and string together a
select statement with a filter statement:
suburbs%>%dplyr::select(city,state,pop)%>%filter(pop>100000)#> # A tibble: 5 x 3#> city state pop#> <chr> <chr> <dbl>#> 1 Chicago IL 2853114#> 2 Aurora IL 171782#> 3 Gary IN 102746#> 4 Joliet IL 106221#> 5 Naperville IL 147779
Indexing is the “official” Base R way to select rows and columns from a data frame, as described in Recipes and . However, indexing is cumbersome when the index expressions become complicated.
The subset function provides a more convenient and readable way to
select rows and columns. It’s beauty is that you can refer to the
columns of the data frame right inside the expressions for selecting
columns and rows.
Combining select and filter from dplyr along with pipes makes the
steps even easier to both read and write.
Here are some examples using the Cars93 dataset in the MASS package.
The dataset includes columns for Manufacturer, Model, MPG.city,
MPG.highway, Min.Price, and Max.Price:
Select the model name for cars that can exceed 30 miles per gallon (MPG) in the city * JAL note: turned off the mass load to see if it fixes select issue
library(MASS)#>#> Attaching package: 'MASS'#> The following object is masked from 'package:dplyr':#>#> selectmy_subset<-subset(Cars93,select=Model,subset=(MPG.city>30))head(my_subset)#> Model#> 31 Festiva#> 39 Metro#> 42 Civic#> 73 LeMans#> 80 Justy#> 83 Swift
Or, using dplyr:
Cars93%>%filter(MPG.city>30)%>%select(Model)%>%head()#> Error in select(., Model): unused argument (Model)
TODO: make this a warning sidebar. Need editors to give instruction
on how to indicate that ** Wait… what? Why did this not work?
select worked just fine in an earlier example! Well, we left this in
the book as an example of a bad surprise. We loaded the Tidyvese package
at the beginning of the chapter then we just now loaded the MASS
package. It turns out that MASS has a function named select too. So
the package loaded last is the one that stomps on top of the others. So
we have two options. 1) we can unload packages and then load MASS
before dplyr or tidyverse' or 2) we can disambiguagte which`select
statement we are calling. Let’s go with option 2 because it’s easy to
illustrate:
Cars93%>%filter(MPG.city>30)%>%dplyr::select(Model)%>%head()#> Model#> 1 Festiva#> 2 Metro#> 3 Civic#> 4 LeMans#> 5 Justy#> 6 Swift
By using dplyr::select we tell R, “Hey, R, only use the select
statement from dplyr" And R typically follows suit.
Now let’s select the model name and price range for four-cylinder cars made in the United States
my_cars<-subset(Cars93,select=c(Model,Min.Price,Max.Price),subset=(Cylinders==4&Origin=="USA"))head(my_cars)#> Model Min.Price Max.Price#> 6 Century 14.2 17.3#> 12 Cavalier 8.5 18.3#> 13 Corsica 11.4 11.4#> 15 Lumina 13.4 18.4#> 21 LeBaron 14.5 17.1#> 23 Colt 7.9 10.6
Or, using our unambiguious dplyr functions:
Cars93%>%filter(Cylinders==4&Origin=="USA")%>%dplyr::select(Model,Min.Price,Max.Price)%>%head()#> Model Min.Price Max.Price#> 1 Century 14.2 17.3#> 2 Cavalier 8.5 18.3#> 3 Corsica 11.4 11.4#> 4 Lumina 13.4 18.4#> 5 LeBaron 14.5 17.1#> 6 Colt 7.9 10.6
Notice that in the above example we put the filter statement above the
select statement. Commands connected by pipes are sequencial and if we
selected only our four fields before we filtered on Cylinders adn
Origin then the Cylinder and Origin fields would no longer be in
the data and we’d get an error.
Now we’ll select the manufacturer’s name and the model name for all cars whose highway MPG value is above the median
my_cars<-subset(Cars93,select=c(Manufacturer,Model),subset=c(MPG.highway>median(MPG.highway)))head(my_cars)#> Manufacturer Model#> 1 Acura Integra#> 5 BMW 535i#> 6 Buick Century#> 12 Chevrolet Cavalier#> 13 Chevrolet Corsica#> 15 Chevrolet Lumina
The subset function is actually more powerful than this recipe
implies. It can select from lists and vectors, too. See the help page
for details.
Or, using dplyr:
Cars93%>%filter(MPG.highway>median(MPG.highway))%>%dplyr::select(Manufacturer,Model)%>%head()#> Manufacturer Model#> 1 Acura Integra#> 2 BMW 535i#> 3 Buick Century#> 4 Chevrolet Cavalier#> 5 Chevrolet Corsica#> 6 Chevrolet Lumina
Remember in the above examples the only reason we use the full
dplyr::select name is because we have a conflict with MASS::select.
In your code you will likely only need to use select after you load
dplyr.
Just to keep us from frustrating naming clashes, let’s detach the
MASS package:
detach("package:MASS",unload=TRUE)
You converted a matrix or list into a data frame. R gave names to the columns, but the names are at best uninformative and at worst bizarre.
Data frames have a colnames attribute that is a vector of column
names. You can update individual names or the entire vector:
df<-data.frame(V1=1:3,V2=4:6,V3=7:9)df#> V1 V2 V3#> 1 1 4 7#> 2 2 5 8#> 3 3 6 9colnames(df)<-c("tom","dick","harry")# a vector of character stringsdf#> tom dick harry#> 1 1 4 7#> 2 2 5 8#> 3 3 6 9
Or, using dplyr from the Tidyverse:
df<-data.frame(V1=1:3,V2=4:6,V3=7:9)df%>%rename(tom=V1,dick=V2,harry=V3)#> tom dick harry#> 1 1 4 7#> 2 2 5 8#> 3 3 6 9
Notice that with the rename function in dplyr there’s no need to use
quotes around the column names, as is typical with Tidyverse functions.
Also note that the argument order is new_name=old_name.
The columns of data frames (and tibbles) must have names. If you convert
a vanilla matrix into a data frame, R will synthesize names that are
reasonable but boring — for example, V1, V2, V3, and so forth:
mat<-matrix(rnorm(9),nrow=3,ncol=3)mat#> [,1] [,2] [,3]#> [1,] 0.701 0.0976 0.821#> [2,] 0.388 -1.2755 -1.086#> [3,] 1.968 1.2544 0.111as.data.frame(mat)#> V1 V2 V3#> 1 0.701 0.0976 0.821#> 2 0.388 -1.2755 -1.086#> 3 1.968 1.2544 0.111
If the matrix had column names defined, R would have used those names instead of synthesizing new ones.
However, converting a list into a data frame produces some strange synthetic names:
lst<-list(1:3,c("a","b","c"),round(rnorm(3),3))lst#> [[1]]#> [1] 1 2 3#>#> [[2]]#> [1] "a" "b" "c"#>#> [[3]]#> [1] 0.181 0.773 0.983as.data.frame(lst)#> X1.3 c..a....b....c.. c.0.181..0.773..0.983.#> 1 1 a 0.181#> 2 2 b 0.773#> 3 3 c 0.983
Again, if the list elements had names then R would have used them.
Fortunately, you can overwrite the synthetic names with names of your
own by setting the colnames attribute:
df<-as.data.frame(lst)colnames(df)<-c("patient","treatment","value")df#> patient treatment value#> 1 1 a 0.181#> 2 2 b 0.773#> 3 3 c 0.983
You can do renaming by position using rename from dplyr… but it’s
not really pretty. Actually it’s quite horrible and we considered
omitting it from this book.
df<-as.data.frame(lst)df%>%rename("patient"=!!names(.[1]),"treatment"=!!names(.[2]),"value"=!!names(.[3]))#> patient treatment value#> 1 1 a 0.181#> 2 2 b 0.773#> 3 3 c 0.983
The reason this is so ugly is that the Tidyverse is designed around
using names, not positions, when referring to columns. And in this
example the names are pretty miserable to type and get right. While you
could use the above recipe, we recommend using the Base R colnames()
method if you really must rename by position number.
Of course, we could have made this all a lot easier by simply giving the list elements names before we converted it to a data frame:
names(lst)<-c("patient","treatment","value")as.data.frame(lst)#> patient treatment value#> 1 1 a 0.181#> 2 2 b 0.773#> 3 3 c 0.983
Your data frame contains NA values, which is creating problems for you.
Use na.omit to remove rows that contain any NA values.
df<-data.frame(my_data=c(NA,1,NA,2,NA,3))df#> my_data#> 1 NA#> 2 1#> 3 NA#> 4 2#> 5 NA#> 6 3clean_df<-na.omit(df)clean_df#> my_data#> 2 1#> 4 2#> 6 3
We frequently stumble upon situations where just a few NA values in a
data frame cause everything to fall apart. One solution is simply to
remove all rows that contain any NAs. That’s what na.omit does.
Here we can see cumsum fail because the input contains NA values:
df<-data.frame(x=c(NA,rnorm(4)),y=c(rnorm(2),NA,rnorm(2)))df#> x y#> 1 NA -0.836#> 2 0.670 -0.922#> 3 -1.421 NA#> 4 -0.236 -1.123#> 5 -0.975 0.372cumsum(df)#> x y#> 1 NA -0.836#> 2 NA -1.759#> 3 NA NA#> 4 NA NA#> 5 NA NA
If we remove the NA values, cumsum can complete its summations:
cumsum(na.omit(df))#> x y#> 2 0.670 -0.922#> 4 0.434 -2.046#> 5 -0.541 -1.674
This recipe works for vectors and matrices, too, but not for lists.
The obvious danger here is that simply dropping observations from your
data could render the results computationally or statistically
meaningless. Make sure that omitting data makes sense in your context.
Remember that na.omit will remove entire rows, not just the NA values,
which could eliminate a lot of useful information.
You want to exclude a column from a data frame using its name.
Use the subset function with a negated argument for the select
parameter:
df<-data.frame(good=rnorm(3),meh=rnorm(3),bad=rnorm(3))df#> good meh bad#> 1 1.911 -0.7045 -1.575#> 2 0.912 0.0608 -2.238#> 3 -0.819 0.4424 -0.807subset(df,select=-bad)# All columns except bad#> good meh#> 1 1.911 -0.7045#> 2 0.912 0.0608#> 3 -0.819 0.4424
Or we can use select from dplyr to accomplish the same thing:
df%>%dplyr::select(-bad)#> good meh#> 1 1.911 -0.7045#> 2 0.912 0.0608#> 3 -0.819 0.4424
We can exclude a column by position (e.g., df[-1]), but how do we
exclude a column by name? The subset function can exclude columns from a
data frame. The select parameter is a normally a list of columns to
include, but prefixing a minus sign (-) to the name causes the column
to be excluded instead.
We often encounter this problem when calculating the correlation matrix of a data frame and we want to exclude nondata columns such as labels. Let’s set up some dummy data:
id<-1:10pre<-rnorm(10)dosage<-rnorm(10)+.3*prepost<-dosage*.5*prepatient_data<-data.frame(id=id,pre=pre,dosage=dosage,post=post)cor(patient_data)#> id pre dosage post#> id 1.0000 -0.6934 -0.5075 0.0672#> pre -0.6934 1.0000 0.5830 -0.0919#> dosage -0.5075 0.5830 1.0000 0.0878#> post 0.0672 -0.0919 0.0878 1.0000
This correlation matrix includes the meaningless “correlation” between id and other variables, which is annoying. We can exclude the id column to clean up the output:
cor(subset(patient_data,select=-id))#> pre dosage post#> pre 1.0000 0.5830 -0.0919#> dosage 0.5830 1.0000 0.0878#> post -0.0919 0.0878 1.0000
or with dplyr:
patient_data%>%dplyr::select(-id)%>%cor()#> pre dosage post#> pre 1.0000 0.5830 -0.0919#> dosage 0.5830 1.0000 0.0878#> post -0.0919 0.0878 1.0000
We can exclude multiple columns by giving a vector of negated names:
## JDL Note... now that I've written all this I think the right thing to do is only show dplyr examples... one way to do things is better... fix in editcor(subset(patient_data,select=c(-id,-dosage)))
or with dplyr:
patient_data%>%dplyr::select(-id,-dosage)%>%cor()#> pre post#> pre 1.0000 -0.0919#> post -0.0919 1.0000
Note that with dplyr we don’t wrap the column names in c().
See “Selecting Rows and Columns More Easily” for more about the subset function.
You want to combine the contents of two data frames into one data frame.
To combine the columns of two data frames side by side, use cbind
(column bind):
df1<-data_frame(a=rnorm(5))df2<-data_frame(b=rnorm(5))all<-cbind(df1,df2)all#> a b#> 1 -1.6357 1.3669#> 2 -0.3662 -0.5432#> 3 0.4445 -0.0158#> 4 0.4945 -0.6960#> 5 0.0934 -0.7334
To “stack” the rows of two data frames, use rbind (row bind):
df1<-data_frame(x=rep("a",2),y=rnorm(2))df1#> # A tibble: 2 x 2#> x y#> <chr> <dbl>#> 1 a 1.90#> 2 a 0.440df2<-data_frame(x=rep("b",2),y=rnorm(2))df2#> # A tibble: 2 x 2#> x y#> <chr> <dbl>#> 1 b 2.35#> 2 b 0.188rbind(df1,df2)#> # A tibble: 4 x 2#> x y#> <chr> <dbl>#> 1 a 1.90#> 2 a 0.440#> 3 b 2.35#> 4 b 0.188
You can combine data frames in one of two ways: either by putting the
columns side by side to create a wider data frame; or by “stacking” the
rows to create a taller data frame. The cbind function will combine
data frames side by side. You would normally combine columns with the
same height (number of rows). Technically speaking, however, cbind
does not require matching heights. If one data frame is short, it will
invoke the Recycling Rule to extend the short columns as necessary
(“Understanding the Recycling Rule”), which may or may
not be what you want.
The rbind function will “stack” the rows of two data frames. The
rbind function requires that the data frames have the same width: same
number of columns and same column names. The columns need not be in the
same order, however; rbind will sort that out:
df1<-data_frame(x=rep("a",2),y=rnorm(2))df1#> # A tibble: 2 x 2#> x y#> <chr> <dbl>#> 1 a -0.366#> 2 a -0.478df2<-data_frame(y=1:2,x=c("b","b"))df2#> # A tibble: 2 x 2#> y x#> <int> <chr>#> 1 1 b#> 2 2 brbind(df1,df2)#> # A tibble: 4 x 2#> x y#> <chr> <dbl>#> 1 a -0.366#> 2 a -0.478#> 3 b 1#> 4 b 2
Finally, this recipe is slightly more general than the title implies.
First, you can combine more than two data frames because both rbind
and cbind accept multiple arguments. Second, you can apply this recipe
to other data types because rbind and cbind work also with vectors,
lists, and matrices.
The merge function can combine data frames that are otherwise
incompatible owing to missing or different columns. In addition, dplyr
and tidyr from the Tidyverse include some powerful functions for
slicing, dicing, and recombining data frames.
You have two data frames that share a common column. You want to merge or join their rows into one data frame by matching on the common column.
Use the merge function to join the data frames into one new data frame
based on the common column:
df1<-data.frame(index=letters[1:5],val1=rnorm(5))df2<-data.frame(index=letters[1:5],val2=rnorm(5))m<-merge(df1,df2,by="index")m#> index val1 val2#> 1 a -0.000837 1.178#> 2 b -0.214967 -1.599#> 3 c -1.399293 0.487#> 4 d 0.010251 -1.688#> 5 e -0.031463 -0.149
Here index is the name of the column that is common to data frames
df1 and df2.
The alternative dplyr way of doing this is with inner_join:
df1%>%inner_join(df2)#> Joining, by = "index"#> index val1 val2#> 1 a -0.000837 1.178#> 2 b -0.214967 -1.599#> 3 c -1.399293 0.487#> 4 d 0.010251 -1.688#> 5 e -0.031463 -0.149
Suppose you have two data frames, born and died, that each contain a
column called name:
born<-data.frame(name=c("Moe","Larry","Curly","Harry"),year.born=c(1887,1902,1903,1964),place.born=c("Bensonhurst","Philadelphia","Brooklyn","Moscow"))died<-data.frame(name=c("Curly","Moe","Larry"),year.died=c(1952,1975,1975))
We can merge them into one data frame by using name to combine matched
rows:
merge(born,died,by="name")#> name year.born place.born year.died#> 1 Curly 1903 Brooklyn 1952#> 2 Larry 1902 Philadelphia 1975#> 3 Moe 1887 Bensonhurst 1975
Notice that merge does not require the rows to be sorted or even to
occur in the same order. It found the matching rows for Curly even
though they occur in different positions. It also discards rows that
appear in only one data frame or the other.
In SQL terms, the merge function essentially performs a join operation
on the two data frames. It has many options for controlling that join
operation, all of which are described on the help page for merge.
Because of the similarity with SQL, dplyr uses similar terms:
born%>%inner_join(died)#> Joining, by = "name"#> Warning: Column `name` joining factors with different levels, coercing to#> character vector#> name year.born place.born year.died#> 1 Moe 1887 Bensonhurst 1975#> 2 Larry 1902 Philadelphia 1975#> 3 Curly 1903 Brooklyn 1952
Because we used data.frame to create the data frame, the name column
was turned into factors. dplyr, and most of the Tidyverse packages,
really prefer characters, so the column name was coerced into charater
and we get a chatty notification in R. This is the sort of verbose
feedback that is common in the Tidyverse. There are multiple types of
joins in dplyr including, inner, left, right, and full. For a complete
list, see the join documentation by typing ?dplyr::join.
See “Combining Two Data Frames” for other ways to combine data frames.
Your data is stored in a data frame. You are getting tired of repeatedly typing the data frame name and want to access the columns more easily.
For quick, one-off expressions, use the with function to expose the
column names:
with(dataframe,expr)
Inside expr, you can refer to the columns of dataframe by their names as if they were simple variables.
If you’re working with Tidyverse functions and pipes (%>%) this is not
very useful as in a piped workflow you are always dealing with whatever
input data was sent via the pipe.
A data frame is a great way to store your data, but accessing individual
columns can become tedious. For a data frame called suburbs that
contains a column called pop, here is the naïve way to calculate the
z-scores of pop:
z<-(suburbs$pop-mean(suburbs$pop))/sd(suburbs$pop)z#> [1] 3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364
Call us lazy, but all that typing gets tedious. The with function lets
you expose the columns of a data frame as distinct variables. It takes
two arguments, a data frame and an expression to be evaluated. Inside
the expression, you can refer to the data frame columns by their names:
z<-with(suburbs,(pop-mean(pop))/sd(pop))z#> [1] 3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364
When using dplyr you can accomplish the same logic with mutate:
suburbs%>%mutate(z=(pop-mean(pop))/sd(pop))#> # A tibble: 17 x 5#> city county state pop z#> <chr> <chr> <chr> <dbl> <dbl>#> 1 Chicago Cook IL 2853114 3.88#> 2 Kenosha Kenosha WI 90352 -0.237#> 3 Aurora Kane IL 171782 -0.116#> 4 Elgin Kane IL 94487 -0.231#> 5 Gary Lake(IN) IN 102746 -0.219#> 6 Joliet Kendall IL 106221 -0.214#> # ... with 11 more rows
As you can see, mutate helpfully mutates the data drame by adding the
column we just created.
You have a data value which has an atomic data type: character, complex, double, integer, or logical. You want to convert this value into one of the other atomic data types.
For each atomic data type, there is a function for converting values to that type. The conversion functions for atomic types include:
as.character(x)
as.complex(x)
as.numeric(x) or as.double(x)
as.integer(x)
as.logical(x)
Converting one atomic type into another is usually pretty simple. If the conversion works, you get what you would expect. If it does not work, you get NA:
as.numeric(" 3.14 ")#> [1] 3.14as.integer(3.14)#> [1] 3as.numeric("foo")#> Warning: NAs introduced by coercion#> [1] NAas.character(101)#> [1] "101"
If you have a vector of atomic types, these functions apply themselves to every value. So the preceding examples of converting scalars generalize easily to converting entire vectors:
as.numeric(c("1","2.718","7.389","20.086"))#> [1] 1.00 2.72 7.39 20.09as.numeric(c("1","2.718","7.389","20.086","etc."))#> Warning: NAs introduced by coercion#> [1] 1.00 2.72 7.39 20.09 NAas.character(101:105)#> [1] "101" "102" "103" "104" "105"
When converting logical values into numeric values, R converts FALSE
to 0 and TRUE to 1:
as.numeric(FALSE)#> [1] 0as.numeric(TRUE)#> [1] 1
This behavior is useful when you are counting occurrences of TRUE in
vectors of logical values. If logvec is a vector of logical values,
then sum(logvec) does an implicit conversion from logical to integer
and returns the number of `TRUE`s:
logvec<-c(TRUE,FALSE,TRUE,TRUE,TRUE,FALSE)sum(logvec)## num true#> [1] 4length(logvec)-sum(logvec)## num not true#> [1] 2
You want to convert a variable from one structured data type to another—for example, converting a vector into a list or a matrix into a data frame.
These functions convert their argument into the corresponding structured data type:
as.data.frame(x)
as.list(x)
as.matrix(x)
as.vector(x)
Some of these conversions may surprise you, however. I suggest you review Table XX. * TODO: can’t find above link… find it
Converting between structured data types can be tricky. Some conversions behave as you’d expect. If you convert a matrix into a data frame, for instance, the rows and columns of the matrix become the rows and columns of the data frame. No sweat.
todo: yeah this table looks like hell in markdown. how does it render?
| Conversion | How | Notes |
|---|---|---|
Vector→List |
|
Don’t use |
Vector→Matrix |
To create a 1-column matrix: |
|
To create a 1-row matrix: |
||
To create an n × m matrix: |
||
Vector→Data frame |
To create a 1-column data frame:
|
|
To create a 1-row data frame: |
||
List→Vector |
|
Use |
List→Matrix |
To create a 1-column matrix: |
|
To create a 1-row matrix: |
||
To create an n × m matrix: |
||
List→Data frame |
If the list elements are columns of data:
|
|
If the list elements are rows of data: “Initializing a Data Frame from Row Data” |
||
Matrix→Vector |
|
Returns all matrix elements in a vector. |
Matrix→List |
|
Returns all matrix elements in a list. |
Matrix→Data frame |
|
|
Data frame→Vector |
To convert a 1-row data frame: |
See Note 2. |
To convert a 1-column data frame: |
||
Data frame→List |
|
See Note 3. |
Data frame→Matrix |
|
See Note 4. |
In other cases, the results might surprise you. Table XX (to-do) summarizes some noteworthy examples. The following Notes are cited in that table:
When you convert a list into a vector, the conversion works cleanly if your list contains atomic values that are all of the same mode. Things become complicated if either (a) your list contains mixed modes (e.g., numeric and character), in which case everything is converted to characters; or (b) your list contains other structured data types, such as sublists or data frames—in which case very odd things happen, so don’t do that.
Converting a data frame into a vector makes sense only if the data
frame contains one row or one column. To extract all its elements into
one, long vector, use as.vector(as.matrix(df)). But even that makes
sense only if the data frame is all-numeric or all-character; if not,
everything is first converted to character strings.
Converting a data frame into a list may seem odd in that a data
frame is already a list (i.e., a list of columns). Using as.list
essentially removes the class (data.frame) and thereby exposes the
underlying list. That is useful when you want R to treat your data
structure as a list—say, for printing.
Be careful when converting a data frame into a matrix. If the data frame contains only numeric values then you get a numeric matrix. If it contains only character values, you get a character matrix. But if the data frame is a mix of numbers, characters, and/or factors, then all values are first converted to characters. The result is a matrix of character strings.
The matrix conversions detailed here assume that your matrix is homogeneous: all elements have the same mode (e.g, all numeric or all character). A matrix can to be heterogeneous, too, when the matrix is built from a list. If so, conversions become messy. For example, when you convert a mixed-mode matrix to a data frame, the data frame’s columns are actually lists (to accommodate the mixed data).
See “Converting One Atomic Value into Another” for converting atomic data types; see the “Introduction” to this chapter for remarks on problematic conversions.
1 A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.
2 More precisely, it orders the names according to your Locale.