R Cookbook

Second Edition

Proven Recipes for Data Analysis, Statistics, and Graphics

J.D. Long and Paul Teetor

R Cookbook

by J.D. Long and Paul Teetor

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • Editor: Nicole Tache
  • Production Editor: Kristen Brown
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • May 2019: Second Edition

Revision History for the Second Edition

  • 2019-01-02: First Early Release
  • 2019-01-25: Second Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492040682 for release details.

Chapter 1. Getting Started and Getting Help

Introduction

This chapter sets the groundwork for the other chapters. It explains how to download, install, and run R.

More importantly, it also explains how to get answers to your questions. The R community provides a wealth of documentation and help. You are not alone. Here are some common sources of help:

Local, installed documentation

When you install R on your computer, a mass of documentation is also installed. You can browse the local documentation (“Viewing the Supplied Documentation”) and search it (“Searching the Supplied Documentation”). We are amazed how often we search the Web for an answer only to discover it was already available in the installed documentation.

Task views: (http://cran.r-project.org/web/views)

A task view describes packages that are specific to one area of statistical work, such as econometrics, medical imaging, psychometrics, or spatial statistics. Each task view is written and maintained by an expert in the field. There are more than 35 such task views, so there is likely to be one or more for your areas of interest. We recommend that every beginner find and read at least one task view in order to gain a sense of R’s possibilities (“Finding Relevant Functions and Packages”).

Package documentation

Most packages include useful documentation. Many also include overviews and tutorials, called “vignettes” in the R community. The documentation is kept with the packages in package repositories, such as CRAN (http://cran.r-project.org/), and it is automatically installed on your machine when you install a package.

Question and answer (Q&A) websites

On a Q&A site, anyone can post a question, and knowledgeable people can respond. Readers vote on the answers, so the best answers tend to emerge over time. All this information is tagged and archived for searching. These sites are a cross between a mailing list and a social network; “Stack Overflow” (http://stackoverflow.com/) is the canonical example.

The Web

The Web is loaded with information about R, and there are R-specific tools for searching it (“Searching the Web for Help”). The Web is a moving target, so be on the lookout for new, improved ways to organize and search information regarding R.

Mailing lists

Volunteers have generously donated many hours of time to answer beginners’ questions that are posted to the R mailing lists. The lists are archived, so you can search the archives for answers to your questions (“Searching the Mailing Lists”).

Downloading and Installing R

Problem

You want to install R on your computer.

Solution

Windows and OS X users can download R from CRAN, the Comprehensive R Archive Network. Linux and Unix users can install R packages using their package management tool:

Windows

  1. Open http://www.r-project.org/ in your browser.

  2. Click on “CRAN”. You’ll see a list of mirror sites, organized by country.

  3. Select a site near you or the top one listed as “0-Cloud” which tends to work well for most locations (https://cloud.r-project.org/)

  4. Click on “Download R for Windows” under “Download and Install R”.

  5. Click on “base”.

  6. Click on the link for downloading the latest version of R (an .exe file).

  7. When the download completes, double-click on the .exe file and answer the usual questions.

OS X

  1. Open http://www.r-project.org/ in your browser.

  2. Click on “CRAN”. You’ll see a list of mirror sites, organized by country.

  3. Select a site near you or the top one listed as “0-Cloud” which tends to work well for most locations.

  4. Click on “Download R for (Mac) OS X”.

  5. Click on the .pkg file for the latest version of R, under “Latest release:”, to download it.

  6. When the download completes, double-click on the .pkg file and answer the usual questions.

Linux or Unix

The major Linux distributions have packages for installing R. Here are some examples:

Table 1-1. (#tab:LinuxDistributions) Linux Distributions
Distribution Package name

Ubuntu or Debian

r-base

Red Hat or Fedora

R.i386

Suse

R-base

Use the system’s package manager to download and install the package. Normally, you will need the root password or sudo privileges; otherwise, ask a system administrator to perform the installation.

Discussion

Installing R on Windows or OS X is straightforward because there are prebuilt binaries (compiled programs) for those platforms. You need only follow the preceding instructions. The CRAN Web pages also contain links to installation-related resources, such as frequently asked questions (FAQs) and tips for special situations (“Does R run under Windows Vista/7/8/Server 2008?”) that you may find useful.

The best way to install R on Linux or Unix is by using your Linux distribution package manager to install R as a package. The distribution packages greatly streamline both the initial installation and subsequent updates.

On Ubuntu or Debian, use apt-get to download and install R. Run under sudo to have the necessary privileges:

$ sudo apt-get install r-base

On Red Hat or Fedora, use yum:

$ sudo yum install R.i386

Most Linux platforms also have graphical package managers, which you might find more convenient.

Beyond the base packages, we recommend installing the documentation packages, too. We like to install r-base-html (because we like browsing the hyperlinked documentation) as well as r-doc-html, which installs the important R manuals locally:

$ sudo apt-get install r-base-html r-doc-html

Some Linux repositories also include prebuilt copies of R packages available on CRAN. We don’t use them because we’d rather get software directly from CRAN itself, which usually has the freshest versions.

In rare cases, you may need to build R from scratch. You might have an obscure, unsupported version of Unix; or you might have special considerations regarding performance or configuration. The build procedure on Linux or Unix is quite standard. Download the tarball from the home page of your CRAN mirror; it’s called something like R-3.5.1.tar.gz, except the “3.5.1” will be replaced by the latest version. Unpack the tarball, look for a file called INSTALL, and follow the directions.

See Also

R in a Nutshell (http://oreilly.com/catalog/9780596801717) (O’Reilly) contains more details of downloading and installing R, including instructions for building the Windows and OS X versions. Perhaps the ultimate guide is the one entitled “R Installation and Administration” (http://cran.r-project.org/doc/manuals/R-admin.html), available on CRAN, which describes building and installing R on a variety of platforms.

This recipe is about installing the base package. See “Installing Packages from CRAN” for installing add-on packages from CRAN.

Installing R Studio

Problem

You want a more comprehensive Integrated Development Environment (IDE) than the R default. In other words, you want to install R Studio Desktop.

Solution

Over the past few years R Studio has become the most widly used IDE for R. We are of the opinion that most all R work should be done in the R Studio Desktop IDE unless there is a compelling reason to do otherwise. R Studio makes multiple products including R Studio Desktop, R Studio Server, R Studio Shiny Server, just to name a few. For this book we will use the term R Studio to mean R Studio Desktop though most concepts apply to R Studio Server as well.

To install R Studio, download the latest installer for your platform from the R Studio website: https://www.rstudio.com/products/rstudio/download/

The R Studio Desktop Open Source License version is free to download and use.

Discussion

This book was written and built using R Studio version 1.2.x and R versions 3.5.x. New versions of R Studio are released every few months, so be sure and update regularly. Note that R Studio works with whichever version of R you have installed. So updating to the latest version of R Studio does not upgrade your version of R. R must be upgraded seperatly.

Interacting with R is slightly different in R Studio than in the built in R user interface. For this book, we’ve elected to use R Studio for all examples.

Starting R Studio

Problem

You want to run R Studio on your computer.

Solution

A common point of confusion for new users of R and R Studio is to accidentally start R when they intended to start R Studio. The easiest way to ensure you’re actually starting R Studio is to search for RStudio on your desktop OS. Then use whatever method your OS provides for pinning the icon somewhere easy to find later.

Windows

Click on the Start Screen menue in the lower left corner of the screen. In the search box, type RStudio.

OS X

Look in your launchpad for the R Studio app or press command
space and type Rstudio to search using Spotlight Search.

Ubuntu

Press Alt + F1 and type RStudio to search for R Studio.

Discussion

Confusion between R and R Studio can easily happen becuase as you can see in Figure 1-1, the icons look similiar.

R and R Studio icons in OSX
Figure 1-1. R and R Studio icons in OSX

If you click on the R icon you’ll be greeted by something like Figure Figure 1-2 which is the Base R interface on a Mac, but certainly not R Studio.

R Console
Figure 1-2. The R Console in OSX

When you start R Studio, the default behavior is that R Studio will reopen the last project you were working on in R Studio.

Entering Commands

Problem

You’ve started R Studio. Now what?

Solution

When you start R Studio, the main window on the left is an R session. From there you can enter commands interactivly directly to R.

Discussion

R prompts you with “>”. To get started, just treat R like a big calculator: enter an expression, and R will evaluate the expression and print the result:

1 + 1
#> [1] 2

The computer adds one and one, giving two, and displays the result.

The [1] before the 2 might be confusing. To R, the result is a vector, even though it has only one element. R labels the value with [1] to signify that this is the first element of the vector… which is not surprising, since it’s the only element of the vector.

R will prompt you for input until you type a complete expression. The expression max(1,3,5) is a complete expression, so R stops reading input and evaluates what it’s got:

max(1, 3, 5)
#> [1] 5

In contrast, “max(1,3,” is an incomplete expression, so R prompts you for more input. The prompt changes from greater-than (>) to plus (+), letting you know that R expects more:

max(
  1, 3,
  +5
)
#> [1] 5

It’s easy to mistype commands, and retyping them is tedious and frustrating. So R includes command-line editing to make life easier. It defines single keystrokes that let you easily recall, correct, and reexecute your commands. My own typical command-line interaction goes like this:

  1. I enter an R expression with a typo.

  2. R complains about my mistake.

  3. I press the up-arrow key to recall my mistaken line.

  4. I use the left and right arrow keys to move the cursor back to the error.

  5. I use the Delete key to delete the offending characters.

  6. I type the corrected characters, which inserts them into the command line.

  7. I press Enter to reexecute the corrected command.

That’s just the basics. R supports the usual keystrokes for recalling and editing command lines, as listed in table @ref(tab:keystrokes).

Table 1-2. (#tab:keystrokes) R Command Shortcuts
Labeled key Ctrl-key combination Effect

Up arrow

Ctrl-P

Recall previous command by moving backward through the history of commands.

Down arrow

Ctrl-N

Move forward through the history of commands.

Backspace

Ctrl-H

Delete the character to the left of cursor.

Delete (Del)

Ctrl-D

Delete the character to the right of cursor.

Home

Ctrl-A

Move cursor to the start of the line.

End

Ctrl-E

Move cursor to the end of the line.

Right arrow

Ctrl-F

Move cursor right (forward) one character.

Left arrow

Ctrl-B

Move cursor left (back) one character.

Ctrl-K

Delete everything from the cursor position to the end of the line.

Ctrl-U

Clear the whole darn line and start over.

Tab

Name completion (on some platforms).

: Keystrokes for command-line editing

On Windows and OS X, you can also use the mouse to highlight commands and then use the usual copy and paste commands to paste text into a new command line.

See Also

See “Typing Less and Accomplishing More”. From the Windows main menu, follow HelpConsole for a complete list of keystrokes useful for command-line editing.

Exiting from R Studio

Problem

You want to exit from R Studio.

Solution

Windows

Select FileQuit Session from the main menu; or click on the X in the upper-right corner of the window frame.

OS X

Press CMD-q (apple-q); or click on the red X in the upper-left corner of the window frame.

Linux or Unix

At the command prompt, press Ctrl-D.

On all platforms, you can also use the q function (as in _q_uit) to terminate the program.

q()

Note the empty parentheses, which are necessary to call the function.

Discussion

Whenever you exit, R typically asks if you want to save your workspace. You have three choices:

  • Save your workspace and exit.

  • Don’t save your workspace, but exit anyway.

  • Cancel, returning to the command prompt rather than exiting.

If you save your workspace, then R writes it to a file called .RData in the current working directory. Savign the workspace saves any R objects which you have created. Next time you start R in the same directory the workspace will automatically load. Saving your workspace will overwrite the previously saved workspace, if any, so don’t save if you don’t like the changes to your workspace (e.g., if you have accidentally erased critical data).

We recommend never saving your workspace when you exit, and instead always explicitly saving your project, scripts, and data. We also recommend that you turn off the prompt to save and auto restore of workspace in R Studio using the Global Options found in the menu ToolsGlobal Options and shown in Figure 1-3. This way when you exit R and R Studio you will not be prompted to save your workspace. But keep in mind that any objects created but not saved to disk will be lost.

save workspace
Figure 1-3. Save Workspace Options

See Also

See “Getting and Setting the Working Directory” for more about the current working directory and “Saving Your Workspace” for more about saving your workspace. See Chapter 2 of R in a Nutshell (http://oreilly.com/catalog/9780596801717).

Interrupting R

Problem

You want to interrupt a long-running computation and return to the command prompt without exiting R Studio.

Solution

Press the Esc key on your keyboard, or click on the Session Menu in R Studio and select Interrupt R

Discussion

Interrupting R means telling R to stop running the current command but without deleting variables from memory or completly closing R Studio. Although, interrupting R can leave your variables in an indeterminate state, depending upon how far the computation had progressed. Check your workspace after interrupting.

Viewing the Supplied Documentation

Problem

You want to read the documentation supplied with R.

Solution

Use the help.start function to see the documentation’s table of contents:

help.start()

From there, links are available to all the installed documentation. In R Studio the help will show up in the help pane which by default is on the right hand side of the screen.

In R Studio you can also click helpR Help to get a listng with help options for both R and R Studio.

Discussion

The base distribution of R includes a wealth of documentation—literally thousands of pages. When you install additional packages, those packages contain documentation that is also installed on your machine.

It is easy to browse this documentation via the help.start function, which opens on the top-level table of contents. Figure Figure 1-4 shows how help.start() appears inside the help pane in R Studio.

help start
Figure 1-4. R Studio help.start

The two links in the Base R Reference section are especially useful:

Packages

Click here to see a list of all the installed packages, both in the base packages and the additional, installed packages. Click on a package name to see a list of its functions and datasets.

Search Engine & Keywords

Click here to access a simple search engine, which allows you to search the documentation by keyword or phrase. There is also a list of common keywords, organized by topic; click one to see the associated pages.

The Base R documentation shown by typing help.start() is loaded on your computer when you install R. The R Studio help which you get by using the menu option helpR Help presents a page with links to R Studio’s web site. So you will need Internet access to access the R Studio help links.

See Also

The local documentation is copied from the R Project website, which may have updated documents.

Getting Help on a Function

Problem

You want to know more about a function that is installed on your machine.

Solution

Use help to display the documentation for the function:

help(functionname)

Use args for a quick reminder of the function arguments:

args(functionname)

Use example to see examples of using the function:

example(functionname)

Discussion

We present many R functions in this book. Every R function has more bells and whistles than we can possibly describe. If a function catches your interest, we strongly suggest reading the help page for that function. One of its bells or whistles might be very useful to you.

Suppose you want to know more about the mean function. Use the help function like this:

help(mean)

This will open the help page for the mean function in the help pane in R Studio. A shortcut for the help command is to simply type ? followed by the function name:

?mean

Sometimes you just want a quick reminder of the arguments to a function: What are they, and in what order do they occur? Use the args function:

args(mean)
#> function (x, ...)
#> NULL
args(sd)
#> function (x, na.rm = FALSE)
#> NULL

The first line of output from args is a synopsis of the function call. For mean, the synopsis shows one argument, x, which is a vector of numbers. For sd, the synopsis shows the same vector, x, and an optional argument called na.rm. (You can ignore the second line of output, which is often just NULL.) In R Studio you will see the args output as a floating tool tip over your cursor when you type a function name as shown in figure Figure 1-5.

mean tooltip
Figure 1-5. R Studio Tooltip

Most documentation for functions includes example code near the end of the document. A cool feature of R is that you can request that it execute the examples, giving you a little demonstration of the function’s capabilities. The documentation for the mean function, for instance, contains examples, but you don’t need to type them yourself. Just use the example function to watch them run:

example(mean)
#>
#> mean> x <- c(0:10, 50)
#>
#> mean> xm <- mean(x)
#>
#> mean> c(xm, mean(x, trim = 0.10))
#> [1] 8.75 5.50

The user typed example(mean). Everything else was produced by R, which executed the examples from the help page and displayed the results.

See Also

See “Searching the Supplied Documentation” for searching for functions and “Displaying Loaded Packages via the Search Path” for more about the search path.

Searching the Supplied Documentation

Problem

You want to know more about a function that is installed on your machine, but the help function reports that it cannot find documentation for any such function.

Alternatively, you want to search the installed documentation for a keyword.

Solution

Use help.search to search the R documentation on your computer:

help.search("pattern")

A typical pattern is a function name or keyword. Notice that it must be enclosed in quotation marks.

For your convenience, you can also invoke a search by using two question marks (in which case the quotes are not required). Note that searching for a function by name uses one question mark while searching for a text pattern uses two:

> ??pattern

Discussion

You may occasionally request help on a function only to be told R knows nothing about it:

help(adf.test)
#> No documentation for 'adf.test' in specified packages and libraries:
#> you could try '??adf.test'

This can be frustrating if you know the function is installed on your machine. Here the problem is that the function’s package is not currently loaded, and you don’t know which package contains the function. It’s a kind of catch-22 (the error message indicates the package is not currently in your search path, so R cannot find the help file; see “Displaying Loaded Packages via the Search Path” for more details).

The solution is to search all your installed packages for the function. Just use the help.search function, as suggested in the error message:

help.search("adf.test")

The search will produce a listing of all packages that contain the function:

Help files with alias or concept or title matching 'adf.test' using
regular expression matching:

tseries::adf.test       Augmented Dickey-Fuller Test

Type '?PKG::FOO' to inspect entry 'PKG::FOO TITLE'.

The output above indicates that the tseries package contains the adf.test function. You can see its documentation by explicitly telling help which package contains the function:

help(adf.test, package = "tseries")

or you can use the double colon operator to tell R to look in a specific package:

?tseries::adf.test

You can broaden your search by using keywords. R will then find any installed documentation that contains the keywords. Suppose you want to find all functions that mention the Augmented Dickey–Fuller (ADF) test. You could search on a likely pattern:

help.search("dickey-fuller")

On my machine, the result looks like this because I’ve installed two additional packages (fUnitRoots and urca) that implement the ADF test:

Help files with alias or concept or title matching 'dickey-fuller' using
fuzzy matching:

fUnitRoots::DickeyFullerPValues
                         Dickey-Fuller p Values
tseries::adf.test        Augmented Dickey-Fuller Test
urca::ur.df              Augmented-Dickey-Fuller Unit Root Test

Type '?PKG::FOO' to inspect entry 'PKG::FOO TITLE'.

See Also

You can also access the local search engine through the documentation browser; see “Viewing the Supplied Documentation” for how this is done. See “Displaying Loaded Packages via the Search Path” for more about the search path and “Listing Files” for getting help on functions.

Getting Help on a Package

Problem

You want to learn more about a package installed on your computer.

Solution

Use the help function and specify a package name (without a function name):

help(package = "packagename")

Discussion

Sometimes you want to know the contents of a package (the functions and datasets). This is especially true after you download and install a new package, for example. The help function can provide the contents plus other information once you specify the package name.

This call to help will display the information for the tseries package, a standard package in the base distribution:

help(package = "tseries")

The information begins with a description and continues with an index of functions and datasets. In R Studio, the HTML formatted help page will open in the help window of the IDE.

Some packages also include vignettes, which are additional documents such as introductions, tutorials, or reference cards. They are installed on your computer as part of the package documentation when you install the package. The help page for a package includes a list of its vignettes near the bottom.

You can see a list of all vignettes on your computer by using the vignette function:

vignette()

In R Studio this will open a new tab and list every package installed on your computer which includes vignettes and a list of vignette names and descriptions.

You can see the vignettes for a particular package by including its name:

vignette(package = "packagename")

Each vignette has a name, which you use to view the vignette:

vignette("vignettename")

See Also

See “Getting Help on a Function” for getting help on a particular function in a package.

Searching the Web for Help

Problem

You want to search the Web for information and answers regarding R.

Solution

Inside R, use the RSiteSearch function to search by keyword or phrase:

RSiteSearch("key phrase")

Inside your browser, try using these sites for searching:

RSeek: http://rseek.org

This is a Google custom search that is focused on R-specific websites.

Stack Overflow: http://stackoverflow.com/

Stack Overflow is a searchable Q&A site from Stack Exchange oriented toward programming issues such as data structures, coding, and graphics. http://stats.stackexchange.com/[Cross Validated:

http://stats.stackexchange.com/]

Cross Validated is a Stack Exchange site focused on statistics, machine learning, and data analysis rather than programming. Cross Validated is a good place for questions about what statistical method to use.

Discussion

The RSiteSearch function will open a browser window and direct it to the search engine on the R Project website (http://search.r-project.org/). There you will see an initial search that you can refine. For example, this call would start a search for “canonical correlation”:

RSiteSearch("canonical correlation")

This is quite handy for doing quick web searches without leaving R. However, the search scope is limited to R documentation and the mailing-list archives.

The rseek.org site provides a wider search. Its virtue is that it harnesses the power of the Google search engine while focusing on sites relevant to R. That eliminates the extraneous results of a generic Google search. The beauty of rseek.org is that it organizes the results in a useful way.

Figure Figure 1-6 shows the results of visiting rseek.org and searching for “canonical correlation”. The left side of the page shows general results for search R sites. The right side is a tabbed display that organizes the search results into several categories:

  • Introductions

  • Task Views

  • Support Lists

  • Functions

  • Books

  • Blogs

  • Related Tools

RSeek
Figure 1-6. RSeek

If you click on the Introductions tab, for example, you’ll find tutorial material. The Task Views tab will show any Task View that mentions your search term. Likewise, clicking on Functions will show links to relevant R functions. This is a good way to zero in on search results.

Stack Overflow (http://stackoverflow.com/) is a Q&A site, which means that anyone can submit a question and experienced users will supply answers—often there are multiple answers to each question. Readers vote on the answers, so good answers tend to rise to the top. This creates a rich database of Q&A dialogs, which you can search. Stack Overflow is strongly problem oriented, and the topics lean toward the programming side of R.

Stack Overflow hosts questions for many programming languages; therefore, when entering a term into their search box, prefix it with [r] to focus the search on questions tagged for R. For example, searching via [r] standard error will select only the questions tagged for R and will avoid the Python and C++ questions.

Stack Overflow also includes a wiki about the R language that is an excellent community curreated list of online R resources: https://stackoverflow.com/tags/r/info

Stack Exchange (parent company of Stack Overflow) has a Q&A area for statistical analysis called Cross Validated: https://stats.stackexchange.com/. This area is more focused on statistics than programming, so use this site when seeking answers that are more concerned with statistics in general and less with R in particular.

See Also

If your search reveals a useful package, use “Installing Packages from CRAN” to install it on your machine.

Finding Relevant Functions and Packages

Problem

Of the 10,000+ packages for R, you have no idea which ones would be useful to you.

Solution

Discussion

This problem is especially vexing for beginners. You think R can solve your problems, but you have no idea which packages and functions would be useful. A common question on the mailing lists is: “Is there a package to solve problem X?” That is the silent scream of someone drowning in R.

As of this writing, there are more than 10,000 packages available for free download from CRAN. Each package has a summary page with a short description and links to the package documentation. Once you’ve located a potentially interesting package, you would typically click on the “Reference manual” link to view the PDF documentation with full details. (The summary page also contains download links for installing the package, but you’ll rarely install the package that way; see “Installing Packages from CRAN”.)

Sometimes you simply have a generic interest—such as Bayesian analysis, econometrics, optimization, or graphics. CRAN contains a set of task view pages describing packages that may be useful. A task view is a great place to start since you get an overview of what’s available. You can see the list of task view pages at CRAN Task Views (http://cran.r-project.org/web/views/) or search for them as described in the Solution. Task Views on CRAN list a number of broad fields and show packages that are used in each field. For example, there are Task Views for high performance computing, genetics, time series, and social science, just to name a few.

Suppose you happen to know the name of a useful package—say, by seeing it mentioned online. A complete, alphabetical list of packages is available at CRAN (http://cran.r-project.org/web/packages/) with links to the package summary pages.

See Also

You can download and install an R package called sos that provides powerful other ways to search for packages; see the vignette at SOS (http://cran.r-project.org/web/packages/sos/vignettes/sos.pdf).

Searching the Mailing Lists

Problem

You have a question, and you want to search the archives of the mailing lists to see whether your question was answered previously.

Solution

Discussion

This recipe is really just an application of “Searching the Web for Help”. But it’s an important application because you should search the mailing list archives before submitting a new question to the list. Your question has probably been answered before.

See Also

CRAN has a list of additional resources for searching the Web; see CRAN Search (http://cran.r-project.org/search.html).

Submitting Questions to Stack Overflow or Elsewhere in the Community

Problem

You have a question you can’t find the answer to online. So you want to submit a question to the R community.

Solution

The first step to asking a question online is to create a reproducable example. Having example code that someone can run and see exactly your problem is to most critical part of asking for help online. A question with a good reproducable example has three componenets:

  1. Example Data - This can be simulated data or some real data that you provide

  2. Example Code - This code shows what you have tried or an error you are having

  3. Written Description - This is where you explain what you have, what you’d like to have and what you have tried that didn’t work.

The details of writing a reproducable example are below in the discussion. Once you have a reproducable example, you can post your quesion on Stack Overflow via https://stackoverflow.com/questions/ask. Be sure and include the r tag in the Tags section of the ask page.

Or if your discussion is more general or related to concepts instead of specific syntax, R Studio runs an R Studio Community discussion forum at https://community.rstudio.com/. Note that the site is broken into multiple topics, so pick the topic category that best fits your question.

Or you may submit your question to the R Mailing lists (but don’t submit to multiple sites, the mailing lists, and Stack Overflow as that’s considered rude cross posting):

The Mailing Lists (http://www.r-project.org/mail.html) page contains general information and instructions for using the R-help mailing list. Here is the general process:

  1. Subscribe to the R-help list at the “Main R Mailing List” (https://stat.ethz.ch/mailman/listinfo/r-help).

  2. Write your question carefully and correctly and include your reproducable example.

  3. Mail your question to r-help@r-project.org.

Discussion

The R mailing list, Stack Overflow, and the R Studio Community site are great resources, but please treat them as a last resort. Read the help pages, read the documentation, search the help list archives, and search the Web. It is most likely that your question has already been answered. Don’t kid yourself: very few questions are unique. If you’ve exhausted all other options, maybe it’s time to create a good question.

The reproducable example is the crux of a good help reqeust. The first step is example data. A good way to get example data is to simulate the data using a few R functions. The following example creates a data frame called example_df that has three columns, each of a different data type:

set.seed(42)
n <- 4
example_df <- data.frame(
  some_reals = rnorm(n),
  some_letters = sample(LETTERS, n, replace = TRUE),
  some_ints = sample(1:10, n, replace = TRUE)
)
example_df
#>   some_reals some_letters some_ints
#> 1      1.371            R        10
#> 2     -0.565            S         3
#> 3      0.363            L         5
#> 4      0.633            S        10

Note that this example uses the command set.seed() at the beginning. This ensures that every time this code is run the answers will be the same. The n value is the number of rows of example data you would like to create. Make your example data as simple as possible to illustrate your question.

An alternative to creating simulated data is to use example data that comes with R. For example, the dataset mtcars contains a data frame with 32 records about different car models:

data(mtcars)
head(mtcars)
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

If your example is only reproducable with a bit of your own data, you can use dput() to put a small bit of your own data in a string which you can put in your example. We’ll illustrate that using two rows from the mtcars data:

dput(head(mtcars, 2))
#> structure(list(mpg = c(21, 21), cyl = c(6, 6), disp = c(160,
#> 160), hp = c(110, 110), drat = c(3.9, 3.9), wt = c(2.62, 2.875
#> ), qsec = c(16.46, 17.02), vs = c(0, 0), am = c(1, 1), gear = c(4,
#> 4), carb = c(4, 4)), row.names = c("Mazda RX4", "Mazda RX4 Wag"
#> ), class = "data.frame")

You can put the resulting structure() directly in your question:

example_df <- structure(list(mpg = c(21, 21), cyl = c(6, 6), disp = c(160,
160), hp = c(110, 110), drat = c(3.9, 3.9), wt = c(2.62, 2.875
), qsec = c(16.46, 17.02), vs = c(0, 0), am = c(1, 1), gear = c(4,
4), carb = c(4, 4)), row.names = c("Mazda RX4", "Mazda RX4 Wag"
), class = "data.frame")

example_df
#>               mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4      21   6  160 110  3.9 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  3.9 2.88 17.0  0  1    4    4

The second part of a good reproducable example is the example minimal code. The code example should be as simple as possible and illustrate what you are trying to do or have already tried. This should not be a big block of code with many different things going on. Boil your example down to only the minimal amount of code needed. If you use any packages be sure and include the library() call at the beginning of your code. Also, don’t include anything in your question that will harm the state of someone running your question code, such as rm(list=ls()) which would delete all R objects in memory. Have empathy for the person trying to help you and realize that they are volunteering their time to help you out and may run your code on the same machine they do their own work.

To test your example, open a new R session and try running your example. Once you have edited your code, it’s time to give just a bit more information to your potential question answerer. In the plain text of the question, describe what you were trying to do, what you’ve tried, and your question. Be as conscise as possible. Much like with the example code, your objective is to communicate as efficiently as possible with the person reading your question. You may find it helpful to include in your description which version of R you are running as well as which platform (Windows, Mac, Linux). You can get that information easily with the sessionInfo() command.

If you are going to submit your question to the R mailing lists, you should know there are actually several mailing lists. R-help is the main list for general questions. There are also many special interest group (SIG) mailing lists dedicated to particular domains such as genetics, finance, R development, and even R jobs. You can see the full list at https://stat.ethz.ch/mailman/listinfo. If your question is specific to one such domain, you’ll get a better answer by selecting the appropriate list. As with R-help, however, carefully search the SIG list archives before submitting your question.

See Also

An excellent essay by Eric Raymond and Rick Moen is entitled “How to Ask Questions the Smart Way” (http://www.catb.org/~esr/faqs/smart-questions.html). We suggest that you read it before submitting any question. Seriously. Read it.

Stack Overflow has an excellent question that includes details about producing a reproducable example. You can find that here: https://stackoverflow.com/q/5963269/37751

Jenny Bryan has a great R package called reprex that helps in the creation of a good reproduable example and the package has helper functions that will help you write the markdown text for sites like Stack Overflow. You can find that package on her Github page: https://github.com/tidyverse/reprex

Chapter 2. Some Basics

Introduction

The recipes in this chapter lie somewhere between problem-solving ideas and tutorials. Yes, they solve common problems, but the Solutions showcase common techniques and idioms used in most R code, including the code in this Cookbook. If you are new to R, we suggest skimming this chapter to acquaint yourself with these idioms.

Printing Something to the Screen

Problem

You want to display the value of a variable or expression.

Solution

If you simply enter the variable name or expression at the command prompt, R will print its value. Use the print function for generic printing of any object. Use the cat function for producing custom formatted output.

Discussion

It’s very easy to ask R to print something: just enter it at the command prompt:

pi
#> [1] 3.14
sqrt(2)
#> [1] 1.41

When you enter expressions like that, R evaluates the expression and then implicitly calls the print function. So the previous example is identical to this:

print(pi)
#> [1] 3.14
print(sqrt(2))
#> [1] 1.41

The beauty of print is that it knows how to format any R value for printing, including structured values such as matrices and lists:

print(matrix(c(1, 2, 3, 4), 2, 2))
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
print(list("a", "b", "c"))
#> [[1]]
#> [1] "a"
#>
#> [[2]]
#> [1] "b"
#>
#> [[3]]
#> [1] "c"

This is useful because you can always view your data: just print it. You need not write special printing logic, even for complicated data structures.

The print function has a significant limitation, however: it prints only one object at a time. Trying to print multiple items gives this mind-numbing error message:

print("The zero occurs at", 2 * pi, "radians.")
#> Error in print.default("The zero occurs at", 2 * pi, "radians."): invalid 'quote' argument

The only way to print multiple items is to print them one at a time, which probably isn’t what you want:

print("The zero occurs at")
#> [1] "The zero occurs at"
print(2 * pi)
#> [1] 6.28
print("radians")
#> [1] "radians"

The cat function is an alternative to print that lets you concatenate multiple items into a continuous output:

cat("The zero occurs at", 2 * pi, "radians.", "\n")
#> The zero occurs at 6.28 radians.

Notice that cat puts a space between each item by default. You must provide a newline character (\n) to terminate the line.

The cat function can print simple vectors, too:

fib <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
cat("The first few Fibonacci numbers are:", fib, "...\n")
#> The first few Fibonacci numbers are: 0 1 1 2 3 5 8 13 21 34 ...

Using cat gives you more control over your output, which makes it especially useful in R scripts that generate output consumed by others. A serious limitation, however, is that it cannot print compound data structures such as matrices and lists. Trying to cat them only produces another mind-numbing message:

cat(list("a", "b", "c"))
#> Error in cat(list("a", "b", "c")): argument 1 (type 'list') cannot be handled by 'cat'

See Also

See “Printing Fewer Digits (or More Digits)” for controlling output format.

Setting Variables

Problem

You want to save a value in a variable.

Solution

Use the assignment operator (<-). There is no need to declare your variable first:

x <- 3

Discussion

Using R in “calculator mode” gets old pretty fast. Soon you will want to define variables and save values in them. This reduces typing, saves time, and clarifies your work.

There is no need to declare or explicitly create variables in R. Just assign a value to the name and R will create the variable:

x <- 3
y <- 4
z <- sqrt(x^2 + y^2)
print(z)
#> [1] 5

Notice that the assignment operator is formed from a less-than character (<) and a hyphen (-) with no space between them.

When you define a variable at the command prompt like this, the variable is held in your workspace. The workspace is held in the computer’s main memory but can be saved to disk. The variable definition remains in the workspace until you remove it.

R is a dynamically typed language, which means that we can change a variable’s data type at will. We could set x to be numeric, as just shown, and then turn around and immediately overwrite that with (say) a vector of character strings. R will not complain:

x <- 3
print(x)
#> [1] 3

x <- c("fee", "fie", "foe", "fum")
print(x)
#> [1] "fee" "fie" "foe" "fum"

In some R functions you will see assignment statements that use the strange-looking assignment operator <<-:

x <<- 3

That forces the assignment to a global variable rather than a local variable. Scoping is a bit, well, out of scope for this discussion, however.

In the spirit of full disclosure, we will reveal that R also supports two other forms of assignment statements. A single equal sign (=) can be used as an assignment operator. A rightward assignment operator (->) can be used anywhere the leftward assignment operator (<-) can be used (but with the arguments reversed):

foo <- 3
print(foo)
#> [1] 3
5 -> fum
print(fum)
#> [1] 5

We recommend that you avoid these as well. The equals-sign assignment is easily confused with the test for equality. The rightward assignment can be useful in certain contexts, but it can be confusing to those not used to seeing it.

See Also

See Recipes , , and . See also the help page for the assign function.

Creating a Pipeline of Function Calls

Problem

You’re getting tired of creating temporary, intermediate variables when doing analysis. The alternative, nesting R functions, seems nearly unreadable.

Solution

You can use the pipe operator (%>%) to make your data flow easier to read and understand. It passes data from one step to another function without having to name an intermediate variable.

library(tidyverse)

mpg %>%
  head %>%
  print
#> # A tibble: 6 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#> 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#> 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#> 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#> 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#> 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

It is identical to

print(head(mpg))
#> # A tibble: 6 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#> 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#> 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#> 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#> 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#> 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

Both code fragments start with the mpg dataset, select the head of the dataset, and print it.

Discussion

The pipe operator (%>%), created by Stefan Bache and found in the magrittr package, is used extensivly in the tidyverse and works analogously to the Unix pipe operator (|). It doesn’t provide any new functionality to R, but it can greatly improve readability of code.

The pipe operator takes the value on the left side of the operator and passes it as the first argument of the function on the right. These two lines of code are identical.

x %>% head

head(x)

For example, the Solution code

mpg %>%
  head %>%
  print

has the same effect as this code which use an intermediate variable.
x <- head(mpg)
print(x)

This approach is fairly readable but creates intermediate data frames and requires the reader to keep track of them, putting a cognitive load on the reader.

This following code also has the same effect as the Solution by using nested function calls:

print(head(mpg))

While this is very conscise since it’s only one line, this code requires much more attention to read and understand what’s going on. Code that is difficult for the user to parse mentally can introduce potential for error, and also make maintenance of the code harder in the future.

The function on the right-hand side of the %>% can include additional arguments, and they will be included after the piped-in value. These two lines of code are identical, for example.

iris %>% head(10)

head(iris, 10)

Sometimes, don’t want the piped value to be the first argument. In those cases, use the dot expression (.) to indicate the desired position. These two lines of code, for example, are identical.

10 %>% head(x, .)

head(x, 10)

This is handy for functions where the first argument is not the principal input.

Listing Variables

Problem

You want to know what variables and functions are defined in your workspace.

Solution

Use the ls function. Use ls.str for more details about each variable.

Discussion

The ls function displays the names of objects in your workspace:

x <- 10
y <- 50
z <- c("three", "blind", "mice")
f <- function(n, p) sqrt(p * (1 - p) / n)
ls()
#> [1] "f" "x" "y" "z"

Notice that ls returns a vector of character strings in which each string is the name of one variable or function. When your workspace is empty, ls returns an empty vector, which produces this puzzling output:

ls()
#> character(0)

That is R’s quaint way of saying that ls returned a zero-length vector of strings; that is, it returned an empty vector because nothing is defined in your workspace.

If you want more than just a list of names, try ls.str; this will also tell you something about each variable:

x <- 10
y <- 50
z <- c("three", "blind", "mice")
f <- function(n, p) sqrt(p * (1 - p) / n)
ls.str()
#> f : function (n, p)
#> x :  num 10
#> y :  num 50
#> z :  chr [1:3] "three" "blind" "mice"

The function is called ls.str because it is both listing your variables and applying the str function to them, showing their structure (Revealing the Structure of an Object).

Ordinarily, ls does not return any name that begins with a dot (.). Such names are considered hidden and are not normally of interest to users. (This mirrors the Unix convention of not listing files whose names begin with dot.) You can force ls to list everything by setting the all.names argument to TRUE:

ls()
#> [1] "f" "x" "y" "z"
ls(all.names = TRUE)
#> [1] ".Random.seed" "f"            "x"            "y"
#> [5] "z"

See Also

See “Deleting Variables” for deleting variables and Recipe X-X for inspecting your variables.

Deleting Variables

Problem

You want to remove unneeded variables or functions from your workspace or to erase its contents completely.

Solution

Use the rm function.

Discussion

Your workspace can get cluttered quickly. The rm function removes, permanently, one or more objects from the workspace:

x <- 2 * pi
x
#> [1] 6.28
rm(x)
x
#> Error in eval(expr, envir, enclos): object 'x' not found

There is no “undo”; once the variable is gone, it’s gone.

You can remove several variables at once:

rm(x, y, z)

You can even erase your entire workspace at once. The rm function has a list argument consisting of a vector of names of variables to remove. Recall that the ls function returns a vector of variables names; hence you can combine rm and ls to erase everything:

ls()
#> [1] "f" "x" "y" "z"
rm(list = ls())
ls()
#> character(0)

Alternativly you could click the broom icon in the top of the Environment pane in R Studio, shown in Figure 2-1.

Environment Panel in R Studio
Figure 2-1. Environment Panel in R Studio

Never put rm(list=ls()) into code you share with others, such as a library function or sample code sent to a mailing list or Stack Overflow. Deleting all the variables in someone else’s workspace is worse than rude and will make you extremely unpopular.

See Also

See “Listing Variables”.

Creating a Vector

Problem

You want to create a vector.

Solution

Use the c(...) operator to construct a vector from given values.

Discussion

Vectors are a central component of R, not just another data structure. A vector can contain either numbers, strings, or logical values but not a mixture.

The c(...) operator can construct a vector from simple elements:

c(1, 1, 2, 3, 5, 8, 13, 21)
#> [1]  1  1  2  3  5  8 13 21
c(1 * pi, 2 * pi, 3 * pi, 4 * pi)
#> [1]  3.14  6.28  9.42 12.57
c("My", "twitter", "handle", "is", "@cmastication")
#> [1] "My"            "twitter"       "handle"        "is"
#> [5] "@cmastication"
c(TRUE, TRUE, FALSE, TRUE)
#> [1]  TRUE  TRUE FALSE  TRUE

If the arguments to c(...) are themselves vectors, it flattens them and combines them into one single vector:

v1 <- c(1, 2, 3)
v2 <- c(4, 5, 6)
c(v1, v2)
#> [1] 1 2 3 4 5 6

Vectors cannot contain a mix of data types, such as numbers and strings. If you create a vector from mixed elements, R will try to accommodate you by converting one of them:

v1 <- c(1, 2, 3)
v3 <- c("A", "B", "C")
c(v1, v3)
#> [1] "1" "2" "3" "A" "B" "C"

Here, the user tried to create a vector from both numbers and strings. R converted all the numbers to strings before creating the vector, thereby making the data elements compatible. Note that R does this without warning or complaint.

Technically speaking, two data elements can coexist in a vector only if they have the same mode. The modes of 3.1415 and "foo" are numeric and character, respectively:

mode(3.1415)
#> [1] "numeric"
mode("foo")
#> [1] "character"

Those modes are incompatible. To make a vector from them, R converts 3.1415 to character mode so it will be compatible with "foo":

c(3.1415, "foo")
#> [1] "3.1415" "foo"
mode(c(3.1415, "foo"))
#> [1] "character"
Warning

c is a generic operator, which means that it works with many datatypes and not just vectors. However, it might not do exactly what you expect, so check its behavior before applying it to other datatypes and objects.

See Also

See the “Introduction” to the Chapter 5 chapter for more about vectors and other data structures.

Computing Basic Statistics

Problem

You want to calculate basic statistics: mean, median, standard deviation, variance, correlation, or covariance.

Solution

Use one of these functions as applies, assuming that x and y are vectors:

  • mean(x)

  • median(x)

  • sd(x)

  • var(x)

  • cor(x, y)

  • cov(x, y)

Discussion

When you first use R you might open the docuentation and begin searching for material entitled “Procedures for Calculating Standard Deviation.” It seems that such an important topic would likely require a whole chapter.

It’s not that complicated.

Standard deviation and other basic statistics are calculated by simple functions. Ordinarily, the function argument is a vector of numbers and the function returns the calculated statistic:

x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
mean(x)
#> [1] 8.8
median(x)
#> [1] 4
sd(x)
#> [1] 11
var(x)
#> [1] 122

The sd function calculates the sample standard deviation, and var calculates the sample variance.

The cor and cov functions can calculate the correlation and covariance, respectively, between two vectors:

x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
y <- log(x + 1)
cor(x, y)
#> [1] 0.907
cov(x, y)
#> [1] 11.5

All these functions are picky about values that are not available (NA). Even one NA value in the vector argument causes any of these functions to return NA or even halt altogether with a cryptic error:

x <- c(0, 1, 1, 2, 3, NA)
mean(x)
#> [1] NA
sd(x)
#> [1] NA

It’s annoying when R is that cautious, but it is the right thing to do. You must think carefully about your situation. Does an NA in your data invalidate the statistic? If yes, then R is doing the right thing. If not, you can override this behavior by setting na.rm=TRUE, which tells R to ignore the NA values:

x <- c(0, 1, 1, 2, 3, NA)
sd(x, na.rm = TRUE)
#> [1] 1.14

In older versions of R, mean and sd were smart about data frames. They understood that each column of the data frame is a different variable, so they calculated their statistic for each column individually. This is no longer the case and, as a result, you may read confusing comments online or in older books (like version 1 of this book). In order to apply the functions to each column of a dataframe we now need to use a helper function. The Tidyverse family of helper functions for this sort of thing are in the purrr package. As with other Tidyverse packages, this gets loaded when you run library(tidyverse). The function we’ll use to apply a function to each column of a data frame is map_dbl:

data(cars)

map_dbl(cars, mean)
#> speed  dist
#>  15.4  43.0
map_dbl(cars, sd)
#> speed  dist
#>  5.29 25.77
map_dbl(cars, median)
#> speed  dist
#>    15    36

Notice that using map_dbl to apply mean or sd each return two values, one for each column defined by the data frame. (Technically, they return a two-element vector whose names attribute is taken from the columns of the data frame.)

The var function understands data frames without the help of a mapping function. It calculates the covariance between the columns of the data frame and returns the covariance matrix:

var(cars)
#>       speed dist
#> speed    28  110
#> dist    110  664

Likewise, if x is either a data frame or a matrix, then cor(x) returns the correlation matrix and cov(x) returns the covariance matrix:

cor(cars)
#>       speed  dist
#> speed 1.000 0.807
#> dist  0.807 1.000
cov(cars)
#>       speed dist
#> speed    28  110
#> dist    110  664

Creating Sequences

Problem

You want to create a sequence of numbers.

Solution

Use an n:m expression to create the simple sequence n, n+1, n+2, …, m:

1:5
#> [1] 1 2 3 4 5

Use the seq function for sequences with an increment other than 1:

seq(from = 1, to = 5, by = 2)
#> [1] 1 3 5

Use the rep function to create a series of repeated values:

rep(1, times = 5)
#> [1] 1 1 1 1 1

Discussion

The colon operator (n:m) creates a vector containing the sequence n, n+1, n+2, …, m:

0:9
#>  [1] 0 1 2 3 4 5 6 7 8 9
10:19
#>  [1] 10 11 12 13 14 15 16 17 18 19
9:0
#>  [1] 9 8 7 6 5 4 3 2 1 0

Observe that R was clever with the last expression (9:0). Because 9 is larger than 0, it counts backward from the starting to ending value. You can also use the colon operator directly with the pipe to pass data to another function:

10:20 %>% mean()

The colon operator works for sequences that grow by 1 only. The seq function also builds sequences but supports an optional third argument, which is the increment:

seq(from = 0, to = 20)
#>  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
seq(from = 0, to = 20, by = 2)
#>  [1]  0  2  4  6  8 10 12 14 16 18 20
seq(from = 0, to = 20, by = 5)
#> [1]  0  5 10 15 20

Alternatively, you can specify a length for the output sequence and then R will calculate the necessary increment:

seq(from = 0, to = 20, length.out = 5)
#> [1]  0  5 10 15 20
seq(from = 0, to = 100, length.out = 5)
#> [1]   0  25  50  75 100

The increment need not be an integer. R can create sequences with fractional increments, too:

seq(from = 1.0, to = 2.0, length.out = 5)
#> [1] 1.00 1.25 1.50 1.75 2.00

For the special case of a “sequence” that is simply a repeated value you should use the rep function, which repeats its first argument:

rep(pi, times = 5)
#> [1] 3.14 3.14 3.14 3.14 3.14

See Also

See “Creating a Sequence of Dates” for creating a sequence of Date objects.

Comparing Vectors

Problem

You want to compare two vectors or you want to compare an entire vector against a scalar.

Solution

The comparison operators (==, !=, <, >, <=, >=) can perform an element-by-element comparison of two vectors. They can also compare a vector’s element against a scalar. The result is a vector of logical values in which each value is the result of one element-wise comparison.

Discussion

R has two logical values, TRUE and FALSE. These are often called Boolean values in other programming languages.

The comparison operators compare two values and return TRUE or FALSE, depending upon the result of the comparison:

a <- 3
a == pi # Test for equality
#> [1] FALSE
a != pi # Test for inequality
#> [1] TRUE
a < pi
#> [1] TRUE
a > pi
#> [1] FALSE
a <= pi
#> [1] TRUE
a >= pi
#> [1] FALSE

You can experience the power of R by comparing entire vectors at once. R will perform an element-by-element comparison and return a vector of logical values, one for each comparison:

v <- c(3, pi, 4)
w <- c(pi, pi, pi)
v == w # Compare two 3-element vectors
#> [1] FALSE  TRUE FALSE
v != w
#> [1]  TRUE FALSE  TRUE
v < w
#> [1]  TRUE FALSE FALSE
v <= w
#> [1]  TRUE  TRUE FALSE
v > w
#> [1] FALSE FALSE  TRUE
v >= w
#> [1] FALSE  TRUE  TRUE

You can also compare a vector against a single scalar, in which case R will expand the scalar to the vector’s length and then perform the element-wise comparison. The previous example can be simplified in this way:

v <- c(3, pi, 4)
v == pi # Compare a 3-element vector against one number
#> [1] FALSE  TRUE FALSE
v != pi
#> [1]  TRUE FALSE  TRUE

(This is an application of the Recycling Rule, “Understanding the Recycling Rule”.)

After comparing two vectors, you often want to know whether any of the comparisons were true or whether all the comparisons were true. The any and all functions handle those tests. They both test a logical vector. The any function returns TRUE if any element of the vector is TRUE. The all function returns TRUE if all elements of the vector are TRUE:

v <- c(3, pi, 4)
any(v == pi) # Return TRUE if any element of v equals pi
#> [1] TRUE
all(v == 0) # Return TRUE if all elements of v are zero
#> [1] FALSE

Selecting Vector Elements

Problem

You want to extract one or more elements from a vector.

Solution

Select the indexing technique appropriate for your problem:

  • Use square brackets to select vector elements by their position, such as v[3] for the third element of v.

  • Use negative indexes to exclude elements.

  • Use a vector of indexes to select multiple values.

  • Use a logical vector to select elements based on a condition.

  • Use names to access named elements.

Discussion

Selecting elements from vectors is another powerful feature of R. Basic selection is handled just as in many other programming languages—use square brackets and a simple index:

fib <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
fib
#>  [1]  0  1  1  2  3  5  8 13 21 34
fib[1]
#> [1] 0
fib[2]
#> [1] 1
fib[3]
#> [1] 1
fib[4]
#> [1] 2
fib[5]
#> [1] 3

Notice that the first element has an index of 1, not 0 as in some other programming languages.

A cool feature of vector indexing is that you can select multiple elements at once. The index itself can be a vector, and each element of that indexing vector selects an element from the data vector:

fib[1:3] # Select elements 1 through 3
#> [1] 0 1 1
fib[4:9] # Select elements 4 through 9
#> [1]  2  3  5  8 13 21

An index of 1:3 means select elements 1, 2, and 3, as just shown. The indexing vector needn’t be a simple sequence, however. You can select elements anywhere within the data vector—as in this example, which selects elements 1, 2, 4, and 8:

fib[c(1, 2, 4, 8)]
#> [1]  0  1  2 13

R interprets negative indexes to mean exclude a value. An index of −1, for instance, means exclude the first value and return all other values:

fib[-1] # Ignore first element
#> [1]  1  1  2  3  5  8 13 21 34

This method can be extended to exclude whole slices by using an indexing vector of negative indexes:

fib[1:3] # As before
#> [1] 0 1 1
fib[-(1:3)] # Invert sign of index to exclude instead of select
#> [1]  2  3  5  8 13 21 34

Another indexing technique uses a logical vector to select elements from the data vector. Everywhere that the logical vector is TRUE, an element is selected:

fib < 10 # This vector is TRUE wherever fib is less than 10
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
fib[fib < 10] # Use that vector to select elements less than 10
#> [1] 0 1 1 2 3 5 8
fib %% 2 == 0 # This vector is TRUE wherever fib is even
#>  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
fib[fib %% 2 == 0] # Use that vector to select the even elements
#> [1]  0  2  8 34

Ordinarily, the logical vector should be the same length as the data vector so you are clearly either including or excluding each element. (If the lengths differ then you need to understand the Recycling Rule, “Understanding the Recycling Rule”.)

By combining vector comparisons, logical operators, and vector indexing, you can perform powerful selections with very little R code:

Select all elements greater than the median

v <- c(3, 6, 1, 9, 11, 16, 0, 3, 1, 45, 2, 8, 9, 6, -4)
v[ v > median(v)]
#> [1]  9 11 16 45  8  9

Select all elements in the lower and upper 5%

v[ (v < quantile(v, 0.05)) | (v > quantile(v, 0.95)) ]
#> [1] 45 -4

The above example uses the | operator which means “or” when indexing. If you wanted “and” you use the & operator.

Select all elements that exceed ±1 standard deviations from the mean

v[ abs(v - mean(v)) > sd(v)]
#> [1] 45 -4

Select all elements that are neither NA nor NULL

v <- c(1, 2, 3, NA, 5)
v[!is.na(v) & !is.null(v)]
#> [1] 1 2 3 5

One final indexing feature lets you select elements by name. It assumes that the vector has a names attribute, defining a name for each element. This can be done by assigning a vector of character strings to the attribute:

years <- c(1960, 1964, 1976, 1994)
names(years) <- c("Kennedy", "Johnson", "Carter", "Clinton")
years
#> Kennedy Johnson  Carter Clinton
#>    1960    1964    1976    1994

Once the names are defined, you can refer to individual elements by name:

years["Carter"]
#> Carter
#>   1976
years["Clinton"]
#> Clinton
#>    1994

This generalizes to allow indexing by vectors of names: R returns every element named in the index:

years[c("Carter", "Clinton")]
#>  Carter Clinton
#>    1976    1994

See Also

See “Understanding the Recycling Rule” for more about the Recycling Rule.

Performing Vector Arithmetic

Problem

You want to operate on an entire vector at once.

Solution

The usual arithmetic operators can perform element-wise operations on entire vectors. Many functions operate on entire vectors, too, and return a vector result.

Discussion

Vector operations are one of R’s great strengths. All the basic arithmetic operators can be applied to pairs of vectors. They operate in an element-wise manner; that is, the operator is applied to corresponding elements from both vectors:

v <- c(11, 12, 13, 14, 15)
w <- c(1, 2, 3, 4, 5)
v + w
#> [1] 12 14 16 18 20
v - w
#> [1] 10 10 10 10 10
v * w
#> [1] 11 24 39 56 75
v / w
#> [1] 11.00  6.00  4.33  3.50  3.00
w^v
#> [1] 1.00e+00 4.10e+03 1.59e+06 2.68e+08 3.05e+10

Observe that the length of the result here is equal to the length of the original vectors. The reason is that each element comes from a pair of corresponding values in the input vectors.

If one operand is a vector and the other is a scalar, then the operation is performed between every vector element and the scalar:

w
#> [1] 1 2 3 4 5
w + 2
#> [1] 3 4 5 6 7
w - 2
#> [1] -1  0  1  2  3
w * 2
#> [1]  2  4  6  8 10
w / 2
#> [1] 0.5 1.0 1.5 2.0 2.5
2^w
#> [1]  2  4  8 16 32

For example, you can recenter an entire vector in one expression simply by subtracting the mean of its contents:

w
#> [1] 1 2 3 4 5
mean(w)
#> [1] 3
w - mean(w)
#> [1] -2 -1  0  1  2

Likewise, you can calculate the z-score of a vector in one expression: subtract the mean and divide by the standard deviation:

w
#> [1] 1 2 3 4 5
sd(w)
#> [1] 1.58
(w - mean(w)) / sd(w)
#> [1] -1.265 -0.632  0.000  0.632  1.265

Yet the implementation of vector-level operations goes far beyond elementary arithmetic. It pervades the language, and many functions operate on entire vectors. The functions sqrt and log, for example, apply themselves to every element of a vector and return a vector of results:

w <- 1:5
w
#> [1] 1 2 3 4 5
sqrt(w)
#> [1] 1.00 1.41 1.73 2.00 2.24
log(w)
#> [1] 0.000 0.693 1.099 1.386 1.609
sin(w)
#> [1]  0.841  0.909  0.141 -0.757 -0.959

There are two great advantages to vector operations. The first and most obvious is convenience. Operations that require looping in other languages are one-liners in R. The second is speed. Most vectorized operations are implemented directly in C code, so they are substantially faster than the equivalent R code you could write.

See Also

Performing an operation between a vector and a scalar is actually a special case of the Recycling Rule; see “Understanding the Recycling Rule”.

Getting Operator Precedence Right

Problem

Your R expression is producing a curious result, and you wonder if operator precedence is causing problems.

Solution

The full list of operators is shown in table @ref(tab:precedence), listed in order of precedence from highest to lowest. Operators of equal precedence are evaluated from left to right except where indicated.

Table 2-1. (#tab:precedence) Operator precedence
Operator Meaning See also

[ [[

Indexing

“Selecting Vector Elements”

:: :::

Access variables in a name space (environment)

$ @

Component extraction, slot extraction

^

Exponentiation (right to left)

- +

Unary minus and plus

:

Sequence creation

Recipes pass:[<a data-type="xref” data-xrefstyle="select:labelnumber” href="#recipe-id021">#recipe-id021</a>, <a data-type="xref” data-xrefstyle="select:labelnumber” href="#recipe-id047">#recipe-id047</a>

% any % (includin

g %>%) Special operators

Discussion

* /

Multiplication, division

Discussion

+ -

Addition, subtraction

== != < > <= >=

Comparison

“Comparing Vectors”

!

Logical negation

& &&

Logical “and”, short-circuit “and”

`

`

Logical “or”, short-circuit “or”

~

Formula

“Performing Simple Linear Regression”

-> ->>

Rightward assignment

“Setting Variables”

=

Assignment (right to left)

“Setting Variables”

<- <<-

Assignment (right to left)

“Setting Variables”

?

Help

“Getting Help on a Function”

It’s not important that you know what every one of these operators do, or what they mean. The list here is to expose you to the idea that different operators have different precedence.

Discussion

Getting your operator precedence wrong in R is a common problem. It certainly happens to the authors a lot. We unthinkingly expect that the expression 0:n−1 will create a sequence of integers from 0 to n − 1 but it does not:

n <- 10
0:n - 1
#>  [1] -1  0  1  2  3  4  5  6  7  8  9

It creates the sequence from −1 to n − 1 because R interprets it as (0:n)−1.

You might not recognize the notation %`_any_%` in the table. R interprets any text between percent signs (%%) as a binary operator. Several such operators have predefined meanings:

%%

Modulo operator

%/%

Integer division

%*%

Matrix multiplication

%in%

Returns TRUE if the left operand occurs in its right operand; FALSE otherwise

%>%

Pipe that passes results from the left to a function on the right

You can also define new binary operators using the %% notation; see Defining Your Own Binary Operators. The point here is that all such operators have the same precedence.

See Also

See “Performing Vector Arithmetic” for more about vector operations, “Performing Matrix Operations” for more about matrix operations, and Recipe X-X to define your own operators. See the Arithmetic and Syntax topics in the R help pages as well as Chapters 5 and 6 of R in a Nutshell (O’Reilly).

Typing Less and Accomplishing More

Problem

You are getting tired of typing long sequences of commands and especially tired of typing the same ones over and over.

Solution

Open an editor window and accumulate your reusable blocks of R commands there. Then, execute those blocks directly from that window. Reserve the command line for typing brief or one-off commands.

When you are done, you can save the accumulated code blocks in a script file for later use.

Discussion

The typical beginner to R types an expression in the console window and sees what happens. As he gets more comfortable, he types increasingly complicated expressions. Then he begins typing multiline expressions. Soon, he is typing the same multiline expressions over and over, perhaps with small variations, in order to perform his increasingly complicated calculations.

The experienced user does not often retype a complex expression. She may type the same expression once or twice, but when she realizes it is useful and reusable she will cut-and-paste it into an editor window. To execute the snippet thereafter, she selects the snippet in the editor window and tells R to execute it, rather than retyping it. This technique is especially powerful as her snippets evolve into long blocks of code.

In R Studio, a few features of the IDE facilitate this workstyle. Windows and Linux machines have slightly different keys than Mac machines: Windows/Linux uses the Ctrl and Alt modifiers, whereas the Mac uses Cmd and Opt.

To open an editor window

From the main menu, select File → New File then select the type of file you want to create, in this case, an R Script.

To execute one line of the editor window

Position the cursor on the line and then press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to execute it.

To execute several lines of the editor window

Highlight the lines using your mouse; then press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to execute them.

To execute the entire contents of the editor window

Press Ctrl+Alt+R (Windows) or Cmd+Opt+R (Mac) to execute the whole editor window. Or from the menu click CodeRun RegionRun All

These keyboard shortcuts and dozens more can be found within R Studio by clicking the menu: ToolsKeyboard Shortcuts Help

Copying lines from the console window to the editor window is simply a matter of copy and paste. When you exit R Studio, it will ask if you want to save the new script. You can either save it for future reuse or discard it.

Creating a Pipeline of Function Calls

Problem

Creating many intermediate variables in your code is tedious and overly verbose, while nesting R functions seems nearly unreadable.

Solution

Use the pipe operator (%>%) to make your expression easier to read and write. The pipe operator (%>%), created by Stefan Bache and found in the magrittr package and used extensivly in many tidyverse functions as well.

Us the pipe operator to combine multiple functions together into a “pipeline” of functions without intermediate variables:

library(tidyverse)
data(mpg)

mpg %>%
  filter(cty > 21) %>%
  head(3) %>%
  print()
#> # A tibble: 3 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 chevrolet    mali~   2.4  2008     4 auto~ f        22    30 r     mids~
#> 2 honda        civic   1.6  1999     4 manu~ f        28    33 r     subc~
#> 3 honda        civic   1.6  1999     4 auto~ f        24    32 r     subc~

The pipe is much cleaner and easier to read than using intermediate temporary variables:

temp1 <- filter(mpg, cty > 21)
temp2 <- head(temp1, 3)
print(temp2)
#> # A tibble: 3 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 chevrolet    mali~   2.4  2008     4 auto~ f        22    30 r     mids~
#> 2 honda        civic   1.6  1999     4 manu~ f        28    33 r     subc~
#> 3 honda        civic   1.6  1999     4 auto~ f        24    32 r     subc~

Discussion

The pipe operator does not provide any new functionality to R, but it can greatly improve readability of code. The pipe operator takes the output of the function or object on the left of the operator and passes it as the first argument of the function on the right.

Writing this:

x %>% head()

is functionally the same as writing this:

head(x)

In both cases x is the argument to head. We can supply additional arguments, but x is always the first argument. These two lines are functionally identical:

x %>% head(n = 10)

head(x, n = 10)

This difference may seem small, but with a more complicated example, the benefits begin to accumulate. If we had a workflow where we wanted to use filter to limit our data to values, then select to keep only certain variables, followed by ggplot to create a simple plot, we could use intermediate variables.

library(tidyverse)

filtered_mpg <- filter(mpg, cty > 21)
selected_mpg <- select(filtered_mpg, cty, hwy)
ggplot(selected_mpg, aes(cty, hwy)) + geom_point()

This incremental approach is fairly readable but creates a number of intermediate data frames and requires the user to keep track of the state of many objects which can generate a cognitive load on the user.

Another alternative is to nest the functions together:

ggplot(select(filter(mpg, cty > 21), cty, hwy), aes(cty, hwy)) + geom_point()

While this is very concise since it’s only one line, this code requires much more attention to read and understand what’s going on. Code that is difficult for the user to parse mentally can introduce potential for error, and also make maintenance of the code harder in the future.

mpg %>%
  filter(cty > 21) %>%
  select(cty, hwy) %>%
  ggplot(aes(cty, hwy)) + geom_point()
Plotting with pipes example
Figure 2-2. Plotting with pipes example

The above code starts with the mpg dataset, pipes it to the filter function which keeps only records where the city mpg (cty) is greater than 21. Those results are piped into the select command that keeps only the listed variables cty and hwy and those are piped into the ggplot command where an point plot is produced in Figure 2-2

If you want the argument going into your target (right hand side) function to be somewhere other than the first argument, use the dot (.) operator:

iris %>% head(3)

is the same as:

iris %>% head(3, x = .)

However in the second example we passed the iris data frame into the second named argument using the dot operator. This can be handy for functions where the input data frame goes in a position other than the first argument.

Through this book we use pipes to hold together data transformations with multiple steps. We typically format the code with a line break after each pipe and then we indent the code on the following lines. This makes the code easily identifiable as parts of the same data pipeline.

Avoiding Some Common Mistakes

Problem

You want to avoid some of the common mistakes made by beginning users—and also by experienced users, for that matter.

Discussion

Here are some easy ways to make trouble for yourself:

Forgetting the parentheses after a function invocation:
You call an R function by putting parentheses after the name. For instance, this line invokes the ls function:

ls()

However, if you omit the parentheses then R does not execute the function. Instead, it shows the function definition, which is almost never what you want:

ls

# > function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
# >     pattern, sorted = TRUE)
# > {
# >     if (!missing(name)) {
# >         pos <- tryCatch(name, error = function(e) e)
# >         if (inherits(pos, "error")) {
# >             name <- substitute(name)
# >             if (!is.character(name))
# >                 name <- deparse(name)
# > etc...

Forgetting to double up backslashes in Windows file paths
This function call appears to read a Windows file called F:\research\bio\assay.csv, but it does not:

tbl <- read.csv("F:\research\bio\assay.csv")

Backslashes (\) inside character strings have a special meaning and therefore need to be doubled up. R will interpret this file name as F:researchbioassay.csv, for example, which is not what the user wanted. See “Dealing with “Cannot Open File” in Windows” for possible solutions.

Mistyping “<-” as “< (blank) -
The assignment operator is <-, with no space between the < and the -:

x <- pi # Set x to 3.1415926...

If you accidentally insert a space between < and -, the meaning changes completely:

x < -pi # Oops! We are comparing x instead of setting it!
#> [1] FALSE

This is now a comparison (<) between x and negative π (-pi). It does not change x. If you are lucky, x is undefined and R will complain, alerting you that something is fishy:

x < -pi
#> Error in eval(expr, envir, enclos): object 'x' not found

If x is defined, R will perform the comparison and print a logical value, TRUE or FALSE. That should alert you that something is wrong: an assignment does not normally print anything:

x <- 0 # Initialize x to zero
x < -pi # Oops!
#> [1] FALSE

Incorrectly continuing an expression across lines
R reads your typing until you finish a complete expression, no matter how many lines of input that requires. It prompts you for additional input using the + prompt until it is satisfied. This example splits an expression across two lines:

total <- 1 + 2 + 3 + # Continued on the next line
  4 + 5
print(total)
#> [1] 15

Problems begin when you accidentally finish the expression prematurely, which can easily happen:

total <- 1 + 2 + 3 # Oops! R sees a complete expression
+4 + 5 # This is a new expression; R prints its value
#> [1] 9
print(total)
#> [1] 6

There are two clues that something is amiss: R prompted you with a normal prompt (>), not the continuation prompt (+); and it printed the value of 4 + 5.

This common mistake is a headache for the casual user. It is a nightmare for programmers, however, because it can introduce hard-to-find bugs into R scripts.

Using = instead of ==
Use the double-equal operator (==) for comparisons. If you accidentally use the single-equal operator (=), you will irreversibly overwrite your variable:

v <- 1 # Assign 1 to v
v == 0 # Compare v against zero
#> [1] FALSE
v <- 0 # Assign 0 to v, overwriting previous contents

Writing 1:n+1 when you mean 1:(n+1)
You might think that 1:n+1 is the sequence of numbers 1, 2, …, n, n + 1. It’s not. It is the sequence 1, 2, …, n with 1 added to every element, giving 2, 3, …, n, n + 1. This happens because R interprets 1:n+1 as (1:n)+1. Use parentheses to get exactly what you want:

n <- 5
1:n + 1
#> [1] 2 3 4 5 6
1:(n + 1)
#> [1] 1 2 3 4 5 6

Getting bitten by the Recycling Rule
Vector arithmetic and vector comparisons work well when both vectors have the same length. However, the results can be baffling when the operands are vectors of differing lengths. Guard against this possibility by understanding and remembering the Recycling Rule, “Understanding the Recycling Rule”.

Installing a package but not loading it with library() or require()
Installing a package is the first step toward using it, but one more step is required. Use library or require to load the package into your search path. Until you do so, R will not recognize the functions or datasets in the package. See “Accessing the Functions in a Package”:

x <- rnorm(100)
n <- 5
truehist(x, n)
#> Error in truehist(x, n): could not find function "truehist"

However if we load the library first, then the code runs and we get the chart shown in Figure 2-3.

library(MASS) # Load the MASS package into R
truehist(x, n)
Example truehist
Figure 2-3. Example truehist

We typically use library() instead of require(). The reason is that if you create an R script that uses library() and the desired package is not already installed, R will return an error. While require(), in contrast, will simply return FALSE if the package is not installed.

Writing aList[i] when you mean aList[[i]], or vice versa
If the variable lst contains a list, it can be indexed in two ways: lst[[n]] is the _n_th element of the list; whereas lst[n] is a list whose only element is the _n_th element of lst. That’s a big difference. See “Selecting List Elements by Position”.

Using & instead of &&, or vice versa; same for | and ||
Use & and | in logical expressions involving the logical values TRUE and FALSE. See “Selecting Vector Elements”.

Use && and || for the flow-of-control expressions inside if and while statements.

Programmers accustomed to other programming languages may reflexively use && and || everywhere because “they are faster.” But those operators give peculiar results when applied to vectors of logical values, so avoid them unless you are sure that they do what you want.

Passing multiple arguments to a single-argument function
What do you think is the value of mean(9,10,11)? No, it’s not 10. It’s 9. The mean function computes the mean of the first argument. The second and third arguments are being interpreted as other, positional arguments. To pass multiple items into a single argument, we put them in a vector with the c operator. mean(c(9,10,11)) will return 10, as you might expect.

Some functions, such as mean, take one argument. Other arguments, such as max and min, take multiple arguments and apply themselves across all arguments. Be sure you know which is which.

Thinking that max behaves like pmax, or that min behaves like pmin
The max and min functions have multiple arguments and return one value: the maximum or minimum of all their arguments.

The pmax and pmin functions have multiple arguments but return a vector with values taken element-wise from the arguments. See Finding Parwise Minimums or Maximums.

Misusing a function that does not understand data frames
Some functions are quite clever regarding data frames. They apply themselves to the individual columns of the data frame, computing their result for each individual column. Sadly, not all functions are that clever. This includes the mean, median, max, and min functions. They will lump together every value from every column and compute their result from the lump or possibly just return an error. Be aware of which functions are savvy to data frames and which are not.

Using single backslash (\) in Windows Paths If you are using R on Windows, it is common to copy and paste a file path into your R script. Windows File Explorer will show you that your path is C:\temp\my_file.csv but if you try to tell R to read that file, you’ll get a cryptic message:

Error: '\m' is an unrecognized escape in character string starting "'.\temp\m"

This is because R sees backslashes as special characters. You can get around this either by using forward slashes (/), or using double backslashes, (\\).

read_csv(`./temp/my_file.csv`)
read_csv(`.\\temp\\my_file.csv`)

This is only an issue on Windows because both Mac and Linux use forward slashes as path seperators.

Posting a question to Stack Overflow or the mailing list before searching for the answer
Don’t waste your time. Don’t waste other people’s time. Before you post a question to a mailing list or to Stack Overflow, do your homework and search the archives. Odds are, someone has already answered your question. If so, you’ll see the answer in the discussion thread for the question. See “Searching the Mailing Lists”.

See Also

See Recipes , , , and .

Chapter 4. Input and Output

***todo: add base R options at end of tidy recipes?

Introduction

All statistical work begins with data, and most data is stuck inside files and databases. Dealing with input is probably the first step of implementing any significant statistical project.

All statistical work ends with reporting numbers back to a client, even if you are the client. Formatting and producing output is probably the climax of your project.

Casual R users can solve their input problems by using basic functions such as read.csv to read CSV files and read.table to read more complicated, tabular data. They can use print, cat, and format to produce simple reports.

Users with heavy-duty input/output (I/O) needs are strongly encouraged to read the R Data Import/Export guide, available on CRAN at http://cran.r-project.org/doc/manuals/R-data.pdf. This manual includes important information on reading data from sources such as spreadsheets, binary files, other statistical systems, and relational databases.

Entering Data from the Keyboard

Problem

You have a small amount of data, too small to justify the overhead of creating an input file. You just want to enter the data directly into your workspace.

Solution

For very small datasets, enter the data as literals using the c() constructor for vectors:

scores <- c(61, 66, 90, 88, 100)

Discussion

When working on a simple problem, you may not want the hassle of creating and then reading a data file outside of R. You may just want to enter the data into R. The easiest way is by using the c() constructor for vectors, as shown in the Solution.

This approach works for data frames, too, by entering each variable (column) as a vector:

points <- data.frame(
  label = c("Low", "Mid", "High"),
  lbound = c(0, 0.67,   1.64),
  ubound = c(0.67, 1.64,   2.33)
)

See Also

See Recipe X-X for more about using the built-in data editor, as suggested in the Solution.

For cutting and pasting data from another application into R, be sure and look at datapasta, a package that provides R Studio addins that make pasting data into your scripts easier: https://github.com/MilesMcBain/datapasta

Printing Fewer Digits (or More Digits)

Problem

Your output contains too many digits or too few digits. You want to print fewer or more.

Solution

For print, the digits parameter can control the number of printed digits.

For cat, use the format function (which also has a digits parameter) to alter the formatting of numbers.

Discussion

R normally formats floating-point output to have seven digits:

pi
#> [1] 3.14
100 * pi
#> [1] 314

This works well most of the time but can become annoying when you have lots of numbers to print in a small space. It gets downright misleading when there are only a few significant digits in your numbers and R still prints seven.

The print function lets you vary the number of printed digits using the digits parameter:

print(pi, digits = 4)
#> [1] 3.142
print(100 * pi, digits = 4)
#> [1] 314.2

The cat function does not give you direct control over formatting. Instead, use the format function to format your numbers before calling cat:

cat(pi, "\n")
#> 3.14
cat(format(pi, digits = 4), "\n")
#> 3.142

This is R, so both print and format will format entire vectors at once:

pnorm(-3:3)
#> [1] 0.00135 0.02275 0.15866 0.50000 0.84134 0.97725 0.99865
print(pnorm(-3:3), digits = 3)
#> [1] 0.00135 0.02275 0.15866 0.50000 0.84134 0.97725 0.99865

Notice that print formats the vector elements consistently: finding the number of digits necessary to format the smallest number and then formatting all numbers to have the same width (though not necessarily the same number of digits). This is extremely useful for formating an entire table:

q <- seq(from = 0, to = 3, by = 0.5)
tbl <- data.frame(Quant = q,
                  Lower = pnorm(-q),
                  Upper = pnorm(q))
tbl                                # Unformatted print
#>   Quant   Lower Upper
#> 1   0.0 0.50000 0.500
#> 2   0.5 0.30854 0.691
#> 3   1.0 0.15866 0.841
#> 4   1.5 0.06681 0.933
#> 5   2.0 0.02275 0.977
#> 6   2.5 0.00621 0.994
#> 7   3.0 0.00135 0.999
print(tbl, digits = 2)             # Formatted print: fewer digits
#>   Quant  Lower Upper
#> 1   0.0 0.5000  0.50
#> 2   0.5 0.3085  0.69
#> 3   1.0 0.1587  0.84
#> 4   1.5 0.0668  0.93
#> 5   2.0 0.0228  0.98
#> 6   2.5 0.0062  0.99
#> 7   3.0 0.0013  1.00

You can also alter the format of all output by using the options function to change the default for digits:

pi
#> [1] 3.14
options(digits = 15)
pi
#> [1] 3.14159265358979

But this is a poor choice in our experience, since it also alters the output from R’s built-in functions, and that alteration may likely be unpleasant.

See Also

Other functions for formatting numbers include sprintf and formatC; see their help pages for details.

Redirecting Output to a File

Problem

You want to redirect the output from R into a file instead of your console.

Solution

You can redirect the output of the cat function by using its file argument:

cat("The answer is", answer, "\n", file = "filename.txt")

Use the sink function to redirect all output from both print and cat. Call sink with a filename argument to begin redirecting console output to that file. When you are done, use sink with no argument to close the file and resume output to the console:

sink("filename")          # Begin writing output to file

# ... other session work ...

sink()                    # Resume writing output to console

Discussion

The print and cat functions normally write their output to your console. The cat function writes to a file if you supply a file argument, which can be either a filename or a connection. The print function cannot redirect its output, but the sink function can force all output to a file. A common use for sink is to capture the output of an R script:

sink("script_output.txt")   # Redirect output to file
source("script.R")          # Run the script, capturing its output
sink()                      # Resume writing output to console

If you are repeatedly cat`ing items to one file, be sure to set `append=TRUE. Otherwise, each call to cat will simply overwrite the file’s contents:

cat(data, file = "analysisReport.out")
cat(results, file = "analysisRepart.out", append = TRUE)
cat(conclusion, file = "analysisReport.out", append = TRUE)

Hard-coding file names like this is a tedious and error-prone process. Did you notice that the filename is misspelled in the second line? Instead of hard-coding the filename repeatedly, I suggest opening a connection to the file and writing your output to the connection:

con <- file("analysisReport.out", "w")
cat(data, file = con)
cat(results, file = con)
cat(conclusion, file = con)
close(con)

(You don’t need append=TRUE when writing to a connection because append is the default with connections.) This technique is especially valuable inside R scripts because it makes your code more reliable and more maintainable.

Listing Files

Problem

You want an R vector that is a listing of the files in your working directory.

Solution

The list.files function shows the contents of your working directory:

list.files()
#>  [1] "_book"                            "_bookdown_files"
#>  [3] "_bookdown_files.old"              "_bookdown.yml"
#>  [5] "_common.R"                        "_main.rds"
#>  [7] "_output.yaml"                     "01_GettingStarted_cache"
#>  [9] "01_GettingStarted.md"             "01_GettingStarted.Rmd"
etc ...

Discussion

This function is terribly handy to grab the names of all files in a subdirectory. You can use it to refresh your memory of your file names or, more likely, as input into another process, like importing data files.

You can pass list.files a path and a pattern to shows files in a specific path and matching a specific regular expression pattern.

list.files(path = 'data/') # show files in a directory
#>  [1] "ac.rdata"               "adf.rdata"
#>  [3] "anova.rdata"            "anova2.rdata"
#>  [5] "bad.rdata"              "batches.rdata"
#>  [7] "bnd_cmty.Rdata"         "compositePerf-2010.csv"
#>  [9] "conf.rdata"             "daily.prod.rdata"
#> [11] "data1.csv"              "data2.csv"
#> [13] "datafile_missing.tsv"   "datafile.csv"
#> [15] "datafile.fwf"           "datafile.qsv"
#> [17] "datafile.ssv"           "datafile.tsv"
#> [19] "df_decay.rdata"         "df_squared.rdata"
#> [21] "diffs.rdata"            "example1_headless.csv"
#> [23] "example1.csv"           "excel_table_data.xlsx"
#> [25] "get_USDA_NASS_data.R"   "ibm.rdata"
#> [27] "iris_excel.xlsx"        "lab_df.rdata"
#> [29] "movies.sas7bdat"        "nacho_data.csv"
#> [31] "NearestPoint.R"         "not_a_csv.txt"
#> [33] "opt.rdata"              "outcome.rdata"
#> [35] "pca.rdata"              "pred.rdata"
#> [37] "pred2.rdata"            "sat.rdata"
#> [39] "singles.txt"            "state_corn_yield.rds"
#> [41] "student_data.rdata"     "suburbs.txt"
#> [43] "tab1.csv"               "tls.rdata"
#> [45] "triples.txt"            "ts_acf.rdata"
#> [47] "workers.rdata"          "world_series.csv"
#> [49] "xy.rdata"               "yield.Rdata"
#> [51] "z.RData"
list.files(path = 'data/', pattern = '\\.csv')
#> [1] "compositePerf-2010.csv" "data1.csv"
#> [3] "data2.csv"              "datafile.csv"
#> [5] "example1_headless.csv"  "example1.csv"
#> [7] "nacho_data.csv"         "tab1.csv"
#> [9] "world_series.csv"

To see all the files in your subdirectories, too, use

list.files(recursive = T)

A possible “gotcha” of list.files is that it ignores hidden files—typically, any file whose name begins with a period. If you don’t see the file you expected to see, try setting all.files=TRUE:

list.files(path = 'data/', all.files = TRUE)
#>  [1] "."                      ".."
#>  [3] ".DS_Store"              ".hidden_file.txt"
#>  [5] "ac.rdata"               "adf.rdata"
#>  [7] "anova.rdata"            "anova2.rdata"
#>  [9] "bad.rdata"              "batches.rdata"
#> [11] "bnd_cmty.Rdata"         "compositePerf-2010.csv"
#> [13] "conf.rdata"             "daily.prod.rdata"
#> [15] "data1.csv"              "data2.csv"
#> [17] "datafile_missing.tsv"   "datafile.csv"
#> [19] "datafile.fwf"           "datafile.qsv"
#> [21] "datafile.ssv"           "datafile.tsv"
#> [23] "df_decay.rdata"         "df_squared.rdata"
#> [25] "diffs.rdata"            "example1_headless.csv"
#> [27] "example1.csv"           "excel_table_data.xlsx"
#> [29] "get_USDA_NASS_data.R"   "ibm.rdata"
#> [31] "iris_excel.xlsx"        "lab_df.rdata"
#> [33] "movies.sas7bdat"        "nacho_data.csv"
#> [35] "NearestPoint.R"         "not_a_csv.txt"
#> [37] "opt.rdata"              "outcome.rdata"
#> [39] "pca.rdata"              "pred.rdata"
#> [41] "pred2.rdata"            "sat.rdata"
#> [43] "singles.txt"            "state_corn_yield.rds"
#> [45] "student_data.rdata"     "suburbs.txt"
#> [47] "tab1.csv"               "tls.rdata"
#> [49] "triples.txt"            "ts_acf.rdata"
#> [51] "workers.rdata"          "world_series.csv"
#> [53] "xy.rdata"               "yield.Rdata"
#> [55] "z.RData"

If you just want to see which files are in a directory and not use the file names in a procedure, the easiest way is to open the Files pane in the lower right corner of RStudio. But keep in mind that the RStudio Files pane hides files that start with a . as you can see in ???:

. image::images_v2/rstudio.files2.png[]

See Also

R has other handy functions for working with files; see help(files).

Dealing with “Cannot Open File” in Windows

Problem

You are running R on Windows, and you are using file names such as C:\data\sample.txt. R says it cannot open the file, but you know the file does exist.

Solution

The backslashes in the file path are causing trouble. You can solve this problem in one of two ways:

  • Change the backslashes to forward slashes: "C:/data/sample.txt".

  • Double the backslashes: "C:\\data\\sample.txt".

Discussion

When you open a file in R, you give the file name as a character string. Problems arise when the name contains backslashes (\) because backslashes have a special meaning inside strings. You’ll probably get something like this:

samp <- read_csv("C:\Data\sample-data.csv")
#> Error: '\D' is an unrecognized escape in character string starting ""C:\D"

R escapes every character that follows a backslash and then removes the backslashes. That leaves a meaningless file path, such as C:Datasample-data.csv in this example.

The simple solution is to use forward slashes instead of backslashes. R leaves the forward slashes alone, and Windows treats them just like backslashes. Problem solved:

samp <- read_csv("C:/Data/sample-data.csv")

An alternative solution is to double the backslashes, since R replaces two consecutive backslashes with a single backslash:

samp <- read_csv("C:\\Data\\sample-data.csv")

Reading Fixed-Width Records

Problem

You are reading data from a file of fixed-width records: records whose data items occur at fixed boundaries.

Solution

Use the read_fwf from the readr package (which is part of the tidyverse). The main arguments are the file name and the description of the fields:

library(tidyverse)
records <- read_fwf("./data/datafile.fwf",
                    fwf_cols(
                      last = 10,
                      first = 10,
                      birth = 5,
                      death = 5
                    ))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
records
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

Discussion

For reading in data into R, we highly recommend the readr package. There are base R functions for reading in text files, but readr improves on these base functions with faster performance, better defaults, and more flexibility.

Suppose we want to read an entire file of fixed-width records, such as fixed-width.txt, shown here:

Fisher    R.A.      1890 1962
Pearson   Karl      1857 1936
Cox       Gertrude  1900 1978
Yates     Frank     1902 1994
Smith     Kirstine  1878 1939

We need to know the column widths. In this case the columns are:

  • Last name, 10 characters

  • First name, 10 characters

  • Year of birth, 5 characters

  • Year of death, 5 characters

There are 5 different ways to define the columns using read_fwf. Pick the one that’s easiest to use (or remember) in your situation:

  1. read_fwf can try to guess your column widths if there is empty space between the columns with the `fwf_empty`option:

file <- "./data/datafile.fwf"
t1 <- read_fwf(file, fwf_empty(file, col_names = c("last", "first", "birth", "death")))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. You can define each column by a vector of widths followed by a vector of names with with fwf_widths:

t2 <- read_fwf(file, fwf_widths(c(10, 10, 5, 4),
                                c("last", "first", "birth", "death")))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. The columns can be defined with fwf_cols which takes a series of column names followed by the column widths:

t3 <-
  read_fwf("./data/datafile.fwf",
           fwf_cols(
             last = 10,
             first = 10,
             birth = 5,
             death = 5
           ))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. Each column can be defined by a begining position and ending poaition with fwf_cols:

t4 <- read_fwf(file, fwf_cols(
  last = c(1, 10),
  first = c(11, 20),
  birth = c(21, 25),
  death = c(26, 30)
))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. You can also define the columns with a vector of starting positions, a vector of ending positions, and a vector of column names with fwf_positions:

t5 <- read_fwf(file, fwf_positions(
  c(1, 11, 21, 26),
  c(10, 20, 25, 30),
  c("first", "last", "birth", "death")
))
#> Parsed with column specification:
#> cols(
#>   first = col_character(),
#>   last = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )

The read_fwf returns a tibble which is a tidyverse object very similiar to a data frame. As is common with tidyverse packages, read_fwf has a good selection of default assumptions that make it less tricky to use than some base R functions for importing data. For example, `read_fwf_ will, by default, import character fields as characters, not factors, which prevents much pain and consternation for users.

See Also

See “Reading Tabular Data Files” for more discussion of reading text files.

Reading Tabular Data Files

Problem

You want to read a text file that contains a table of white-space delimited data.

Solution

Use the read_table2 function from the readr package, which returns a tibble:

library(tidyverse)

tab1 <- read_table2("./data/datafile.tsv")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
tab1
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

Discussion

Tabular data files are quite common. They are text files with a simple format:

  • Each line contains one record.

  • Within each record, fields (items) are separated by a white space delimiter, such as a space or tab.

  • Each record contains the same number of fields.

This format is more free-form than the fixed-width format because fields needn’t be aligned by position. Here is the data file from “Reading Fixed-Width Records” in tabular format, using a tab character between fields:

last    first   birth   death
Fisher  R.A.    1890    1962
Pearson Karl    1857    1936
Cox Gertrude    1900    1978
Yates   Frank   1902    1994
Smith   Kirstine    1878    1939

The read_table2 function is designed to make some good guesses about your data. It assumes your data has column names in the first row, guesses your delimiter, and it imputes your column types based on the first 1000 records in your data set. Below is an example with space delimited data.

t <- read_table2("./data/datafile.ssv")
#> Parsed with column specification:
#> cols(
#>   `#The` = col_character(),
#>   following = col_character(),
#>   is = col_character(),
#>   a = col_character(),
#>   list = col_character(),
#>   of = col_character(),
#>   statisticians = col_character()
#> )
#> Warning: 6 parsing failures.
#> row col  expected    actual                  file
#>   1  -- 7 columns 4 columns './data/datafile.ssv'
#>   2  -- 7 columns 4 columns './data/datafile.ssv'
#>   3  -- 7 columns 4 columns './data/datafile.ssv'
#>   4  -- 7 columns 4 columns './data/datafile.ssv'
#>   5  -- 7 columns 4 columns './data/datafile.ssv'
#> ... ... ......... ......... .....................
#> See problems(...) for more details.
print(t)
#> # A tibble: 6 x 7
#>   `#The`  following is    a     list  of    statisticians
#>   <chr>   <chr>     <chr> <chr> <chr> <chr> <chr>
#> 1 last    first     birth death <NA>  <NA>  <NA>
#> 2 Fisher  R.A.      1890  1962  <NA>  <NA>  <NA>
#> 3 Pearson Karl      1857  1936  <NA>  <NA>  <NA>
#> 4 Cox     Gertrude  1900  1978  <NA>  <NA>  <NA>
#> 5 Yates   Frank     1902  1994  <NA>  <NA>  <NA>
#> 6 Smith   Kirstine  1878  1939  <NA>  <NA>  <NA>

read_table2 often guess corectly. But as with other readr import functions, you can overwrite the defaults with explicit parameters.

t <-
  read_table2(
    "./data/datafile.tsv",
    col_types = c(
      col_character(),
      col_character(),
      col_integer(),
      col_integer()
    )
  )

If any field contains the string “NA”, then read_table2 assumes that the value is missing and converts it to NA. Your data file might employ a different string to signal missing values, in which case use the na parameter. The SAS convention, for example, is that missing values are signaled by a single period (.). We can read such text files using the na="." option. If we have a file named datafile_missing.tsv that has a missing value indicated with a . in the last row:

last    first   birth   death
Fisher  R.A.    1890    1962
Pearson Karl    1857    1936
Cox Gertrude    1900    1978
Yates   Frank   1902    1994
Smith   Kirstine    1878    1939
Cox David 1924 .

we can import it like so

t <- read_table2("./data/datafile_missing.tsv", na = ".")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
t
#> # A tibble: 6 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939
#> 6 Cox     David     1924    NA

We’re huge fans of self-describing data: data files which describe their own contents. (A computer scientist would say the file contains its own metadata.) The read_table2 function make the default assumption that the first line of your file contains a header line with column names. If your file does not have column names, you can turn this off with the parameter col_names = FALSE.

An additional type of metadata supported by read_table2 is comment lines. Using the comment parameter you can tell read_table2 which character distinguishes comment lines. The following file has a comment line at the top that starts with #.

# The following is a list of statisticians
last first birth death
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

so we can import this file as follows:

t <- read_table2("./data/datafile.ssv", comment = '#')
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
t
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

read_table2 has many parameters for controlling how it reads and interprets the input file. See the help page (?read_table2) or the readr vignette (vignette("readr")) for more details. If you’re curious about the difference between read_table and read_table2, it’s in the help file… but the short answer is that read_table is slightly less forgiving in file structure and line length.

See Also

If your data items are separated by commas, see “Reading from CSV Files” for reading a CSV file.

Reading from CSV Files

Problem

You want to read data from a comma-separated values (CSV) file.

Solution

The read_csv function from the readr pacakge is a fast (and, according to the documentation, fun) way to read CSV files. If your CSV file has a header line, use this:

library(tidyverse)

tbl <- read_csv("./data/datafile.csv")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )

If your CSV file does not contain a header line, set the col_names option to FALSE:

tbl <- read_csv("./data/datafile.csv",  col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_character(),
#>   X4 = col_character()
#> )

Discussion

The CSV file format is popular because many programs can import and export data in that format. This includes R, Excel, other spreadsheet programs, many database managers, and most statistical packages. It is a flat file of tabular data, where each line in the file is a row of data, and each row contains data items separated by commas. Here is a very simple CSV file with three rows and three columns (the first line is a header line that contains the column names, also separated by commas):

label,lbound,ubound
low,0,0.674
mid,0.674,1.64
high,1.64,2.33

The read_csv file reads the data and creates a tibble, which is a special type of data frame used in Tidy packages and a common representation for tabular data. The function assumes that your file has a header line unless told otherwise:

tbl <- read_csv("./data/example1.csv")
#> Parsed with column specification:
#> cols(
#>   label = col_character(),
#>   lbound = col_double(),
#>   ubound = col_double()
#> )
tbl
#> # A tibble: 3 x 3
#>   label lbound ubound
#>   <chr>  <dbl>  <dbl>
#> 1 low    0      0.674
#> 2 mid    0.674  1.64
#> 3 high   1.64   2.33

Observe that read_csv took the column names from the header line for the tibble. If the file did not contain a header, then we would specify col_names=FALSE and R would synthesize column names for us (X1, X2, and X3 in this case):

tbl <- read_csv("./data/example1.csv", col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_character()
#> )
tbl
#> # A tibble: 4 x 3
#>   X1    X2     X3
#>   <chr> <chr>  <chr>
#> 1 label lbound ubound
#> 2 low   0      0.674
#> 3 mid   0.674  1.64
#> 4 high  1.64   2.33

Sometimes it’s convenient to put metadata in files. If this metadata starts with a common character, such as a pound sign (#) we can use the comment=FALSE parameter to ignore metadata lines.

The read_csv function has many useful bells and whistles. A few of these options and their default values include:

  • na = c("", "NA"): Indicate what values represent missing or NA values

  • comment = "": which lines to ignore as comments or metadata

  • trim_ws = TRUE: Whether to drop white space at the beginning and/or end of fields

  • skip = 0: Number of rows to skip at the beginning of the file

  • guess_max = min(1000, n_max): Number of rows to consider when imputing column types

See the R help page, help(read_csv), for more details on all the availiable options.

If you have a data file that uses semicolons (;) for seperators and commas for the decimal mark, as is common outside of North America, then you should use the function read_csv2 which is built for that very situation.

See Also

See “Writing to CSV Files”. See also the vignette for the readr: vignette(readr).

Writing to CSV Files

Problem

You want to save a matrix or data frame in a file using the comma-separated values format.

Solution

The write_csv function from the tidyverse readr package can write a CSV file:

library(tidyverse)

write_csv(tab1, path = "./data/tab1.csv")

Discussion

The write_csv function writes tabular data to an ASCII file in CSV format. Each row of data creates one line in the file, with data items separated by commas (,):

library(tidyverse)

print(tab1)
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939
write_csv(tab1, "./data/tab1.csv")

This example creates a file called tab1.csv in the data directory which is a subdirectory of the working directory. The file looks like this:

last,first,birth,death
Fisher,R.A.,1890,1962
Pearson,Karl,1857,1936
Cox,Gertrude,1900,1978
Yates,Frank,1902,1994
Smith,Kirstine,1878,1939

write_csv has a number of parameters with typically very good defaults. Should you want to adjust the output, here are a few parameters you can change, along with their defaults:

col_names = TRUE

: Indicate whether or not the first row contains column names

col_types = NULL

: write_csv will look at the first 1000 rows (changable with guess_max below) and make an informed guess as to what data types to use for the columns. If you’d rather explicitly state the column types, you can do that by passing a vector of column types to the parameter col_types

na = c("", "NA")

: Indicate what values represent missing or NA values

comment = ""

: Which lines to ignore as comments or metadata

trim_ws = TRUE

: Whether to drop white space at the beginning and/or end of fields

skip = 0

: Number of rows to skip at the beginning of the file

guess_max = min(1000, n_max)

: Number of rows to consider when guessing column types

See Also

See “Getting and Setting the Working Directory” for more about the current working directory and “Saving and Transporting Objects” for other ways to save data to files. For more info on reading and writing text files, see the readr vignette: vignette(readr).

Reading Tabular or CSV Data from the Web

Problem

You want to read data directly from the Web into your R workspace.

Solution

Use the the read_csv or read_table2 functions from the readr package, using a URL instead of a file name. The functions will read directly from the remote server:

library(tidyverse)

berkley <- read_csv('http://bit.ly/barkley18', comment = '#')
#> Parsed with column specification:
#> cols(
#>   Name = col_character(),
#>   Location = col_character(),
#>   Time = col_time(format = "")
#> )

You can also open a connection using the URL and then read from the connection, which may be preferable for complicated files.

Discussion

The Web is a gold mine of data. You could download the data into a file and then read the file into R, but it’s more convenient to read directly from the Web. Give the URL to read_csv, read_table2, or other read function in readr (depending upon the format of the data), and the data will be downloaded and parsed for you. No fuss, no muss.

Aside from using a URL, this recipe is just like reading from a CSV file (“Reading from CSV Files”) or a complex file (“Reading Files with a Complex Structure”), so all the comments in those recipes apply here, too.

Remember that URLs work for FTP servers, not just HTTP servers. This means that R can also read data from FTP sites using URLs:

tbl <- read_table2("ftp://ftp.example.com/download/data.txt")

See Also

See Recipes and .

Reading Data From Excel

Problem

You want to read data in from an Excel file.

Solution

The openxlsx package makes reading Excel files easy.

library(openxlsx)

df1 <- read.xlsx(xlsxFile = "data/iris_excel.xlsx",
                 sheet = 'iris_data')
head(df1, 3)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa

Discussion

The package openxlsx is a good choice for both reading and writing Excel files with R. If we’re reading in an entire sheet then passing a file name and a sheet name to the read.xlsx function is a simple option. But openxlsx supports more complex workflows.

A common pattern is to read a named table out of an Excel file and into an R data frame. This is trickier because the sheet we’re reading from may have values outside of the named table and we want to only read in the named table range. We can use the functions in openxlsx to get the location of a table, then read that range of cells into a data frame.

First we load the workbook into R:

library(openxlsx)
wb <- loadWorkbook("data/excel_table_data.xlsx")

Then we can use the getTables function to get a the names and ranges of all the Excel Tables in the input_data sheet and select out the one table we want. In this example the Excel Table we are after is named example_data:

tables <- getTables(wb, 'input_data')
table_range_str <- names(tables[tables == 'example_table'])
table_range_refs <- strsplit(table_range_str, ':')[[1]]

# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)

# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)

Now the vector col_vec contains the column numbers of our named table while table_range_row_num contains the row numbers of our named table. We can then use the read.xlsx function to pull in only the rows and columns we are after.

df <- read.xlsx(
  xlsxFile = "data/excel_table_data.xlsx",
  sheet = 'input_data',
  cols = table_range_col_num[1]:table_range_col_num[2],
  rows = table_range_row_num[1]:table_range_row_num[2]
)

See Also

Vingette for openxlsx by installing openxlsx and running: vignette('Introduction', package='openxlsx')

The readxl package is party of the Tidyverse and provides fast, simple reading of Excel files: https://readxl.tidyverse.org/

The writexl package is a fast and lightweight (no dependencies) package for writing Excel files: https://cran.r-project.org/web/packages/writexl/index.html

“Writing a Data Frame to Excel”

Writing a Data Frame to Excel

Problem

You want to write an R data frame to an Excel file.

Solution

The openxlsx package makes writing to Excel files realitivly easy. While there are lots of options in openxlsx, a typical pattern is to specify an Excel file name and a sheet name:

library(openxlsx)

write.xlsx(x = iris,
           sheetName = 'iris_data',
           file = "data/iris_excel.xlsx")

Discussion

The openxlsx package has a huge number of options for controlling many aspects of the Excel object model. We can use it to set cell colors, define named ranges, and set cell outlines, for example. But it has a few helper functions like write.xlsx which make simple tasks easier.

When businesses work with Excel it’s a good practice to keep all input data in an Excel file in a named Excel Table which makes accessing the data easier and less error prone. However if you use openxlsx to overwrite an Excel Table in one of the sheets, you run the risk that the new data may contain fewer rows than the Excel Table it replaces. That could cause errors as you would end up with old data and new data in contiguious rows. The solution is to first delete out an existing Excel Table, then add new data back into the same location and assign the new data to a named Excel Table. To do this we need to use the more advanced Excel manipulation features of openxlsx.

First we use loadWorkbook to read the Excel workbook into R in its entirety:

library(openxlsx)

wb <- loadWorkbook("data/excel_table_data.xlsx")

Before we delete the table out we want to extract the table starting row and column.

tables <- getTables(wb, 'input_data')
table_range_str <- names(tables[tables == 'example_table'])
table_range_refs <- strsplit(table_range_str, ':')[[1]]

# use a regex to extract out the starting row number
table_row_num <- gsub("[^0-9.]", "", table_range_refs)[[1]]

# extract out the starting column number
table_col_num <- convertFromExcelRef(table_range_refs)[[1]]

Then we can use the removeTable function to remove the existing named Excel Table:

## remove the existing Excel Table
removeTable(wb = wb,
            sheet = 'input_data',
            table = 'example_table')

Then we can use writeDataTable to write the iris data frame (which comes with R) to write data back into our workbook object in R.

writeDataTable(
  wb = wb,
  sheet = 'input_data',
  x = iris,
  startCol = table_col_num,
  startRow = table_row_num,
  tableStyle = "TableStyleLight9",
  tableName = 'example_table'
)

At this point we could save the workbook and our Table would be updated. However it’s a good idea to save some meta data in the workbook to let others know exactly when the data was refreshed. We can do this with the writeData function then save the workbook to file and overwrite the original file. We’ll put the text in cell B:5 then save the workbook back to a file overwriting the original.

writeData(
  wb = wb,
  sheet = 'input_data',
  x = paste('example_table data refreshed on:', Sys.time()),
  startCol = 2,
  startRow = 5
)

## then save the workbook
saveWorkbook(wb = wb,
             file = "data/excel_table_data.xlsx",
             overwrite = T)

The resulting Excel sheet looks is shown in Figure 4-1.

excel table data
Figure 4-1. Excel Table and Caption

See Also

Vingette for openxlsx by installing openxlsx and running: vignette(Introduction, package=openxlsx)

The readxl package is party of the Tidyverse and provides fast, simple reading of Excel files: https://readxl.tidyverse.org/

The writexl package is a fast and lightweight (no dependencies) package for writing Excel files: https://cran.r-project.org/web/packages/writexl/index.html

“Reading Data From Excel”

Reading Data from a SAS file

Problem

You want to read a SAS data set into an R data frame.

Solution

The sas7bdat package supports reading SAS sas7bdat files into R.

library(haven)

sas_movie_data <- read_sas("data/movies.sas7bdat")

Discussion

SAS V7 and beyond all support the sas7bdat file format. The read_sas function in haven supports reading the sas7bdat file format including variable labels. If your SAS file has variable labels, when they are inported into R they will be stored in the label attributes of the data frame. These labels will not be printed by default. You can see the labels by opening the data frame in R Studio, or by calling the attributes Base R function on each column:

sapply(sas_movie_data, attributes)
#> $Movie
#> $Movie$label
#> [1] "Movie"
#>
#>
#> $Type
#> $Type$label
#> [1] "Type"
#>
#>
#> $Rating
#> $Rating$label
#> [1] "Rating"
#>
#>
#> $Year
#> $Year$label
#> [1] "Year"
#>
#>
#> $Domestic__
#> $Domestic__$label
#> [1] "Domestic $"
#>
#> $Domestic__$format.sas
#> [1] "F"
#>
#>
#> $Worldwide__
#> $Worldwide__$label
#> [1] "Worldwide $"
#>
#> $Worldwide__$format.sas
#> [1] "F"
#>
#>
#> $Director
#> $Director$label
#> [1] "Director"

See Also

The sas7bdat package is much slower on large files than haven, but it has more elaborate support for file attributes. If the SAS metadata is important to you then you should investigate sas7bdat::read.sas7bdat.

Reading Data from HTML Tables

Problem

You want to read data from an HTML table on the Web.

Solution

Use the read_html and html_table functions in the rvest package. To read all tables on the page, do the following:

library(rvest)
library(magrittr)

all_tables <-
  read_html("https://en.wikipedia.org/wiki/Aviation_accidents_and_incidents") %>%
  html_table(fill = TRUE, header = TRUE)

read_html puts all tables from the HTML document into the output list. To pull a single table from that list, you can use the function extract2 from the magrittr package:

out_table <-
  read_html("https://en.wikipedia.org/wiki/Aviation_accidents_and_incidents") %>%
  html_table(fill = TRUE, header = TRUE) %>%
  extract2(2)

head(out_table)
#>   Year Deaths[52] # of incidents[53]
#> 1 2017        399           101 [54]
#> 2 2016        629                102
#> 3 2015        898                123
#> 4 2014      1,328                122
#> 5 2013        459                138
#> 6 2012        800                156

Note that the rvest and magrittr packages are both installed when you run install.packages('tidyverse') They are not core tidyverse packages, however, so you must explicitly load them, as shown here.

Discussion

Web pages can contain several HTML tables. Calling read_html(url) then piping that to html_table() reads all tables on the page and returns them in a list. This can be useful for exploring a page, but it’s annoying if you want just one specific table. In that case, use extract2(n) to select the the _n_th table.

Two common parameters for the html_table function are fill=TRUE which fills in missing values with NA, and header=TRUE which indicates that the first row contains the header names.

The following example, loads all tables from the Wikipedia page entitled “World population”:

url <- 'http://en.wikipedia.org/wiki/World_population'
tbls <- read_html(url) %>%
  html_table(fill = TRUE, header = TRUE)

As it turns out, that page contains 24 tables (or things that html_table thinks might be tables):

length(tbls)
#> [1] 23

In this example we care only about the sixth table (which lists the largest populations by country), so we can either access that element using brackets: tbls[[6]] or we can pipe it into the extract2 function from the magrittr package:

library(magrittr)
url <- 'http://en.wikipedia.org/wiki/World_population'
tbl <- read_html(url) %>%
  html_table(fill = TRUE, header = TRUE) %>%
  extract2(2)

head(tbl, 2)
#>   World population (millions, UN estimates)[10]
#> 1                                             #
#> 2                                             1
#>   World population (millions, UN estimates)[10]
#> 1               Top ten most populous countries
#> 2                                        China*
#>   World population (millions, UN estimates)[10]
#> 1                                          2000
#> 2                                         1,270
#>   World population (millions, UN estimates)[10]
#> 1                                          2015
#> 2                                         1,376
#>   World population (millions, UN estimates)[10]
#> 1                                         2030*
#> 2                                         1,416

In that table, columns 2 and 3 contain the country name and population, respectively:

tbl[, c(2, 3)]
#>                          World population (millions, UN estimates)[10]
#> 1                                      Top ten most populous countries
#> 2                                                               China*
#> 3                                                                India
#> 4                                                        United States
#> 5                                                            Indonesia
#> 6                                                             Pakistan
#> 7                                                               Brazil
#> 8                                                              Nigeria
#> 9                                                           Bangladesh
#> 10                                                              Russia
#> 11                                                              Mexico
#> 12                                                         World total
#> 13 Notes:\nChina = excludes Hong Kong and Macau\n2030 = Medium variant
#>                        World population (millions, UN estimates)[10].1
#> 1                                                                 2000
#> 2                                                                1,270
#> 3                                                                1,053
#> 4                                                                  283
#> 5                                                                  212
#> 6                                                                  136
#> 7                                                                  176
#> 8                                                                  123
#> 9                                                                  131
#> 10                                                                 146
#> 11                                                                 103
#> 12                                                               6,127
#> 13 Notes:\nChina = excludes Hong Kong and Macau\n2030 = Medium variant

Right away, we can see problems with the data: the second row of the data has info that really belongs with the header. And China has * appended to its name. On the Wikipedia website, that was a footnote reference, but now it’s just a bit of unwanted text. Adding insult to injury, the population numbers have embedded commas, so you cannot easily convert them to raw numbers. All these problems can be solved by some string processing, but each problem adds at least one more step to the process.

This illustrates the main obstacle to reading HTML tables. HTML was designed for presenting information to people, not to computers. When you “scrape” information off an HTML page, you get stuff that’s useful to people but annoying to computers. If you ever have a choice, choose instead a computer-oriented data representation such as XML, JSON, or CSV.

The read_html(url) and html_table() functions are part of the rvest package, which (by necessity) is large and complex. Any time you pull data from a site designed for human readers, not machines, expect that you will have to do post processing to clean up the bits and pieces left messy by the machine.

See Also

See “Installing Packages from CRAN” for downloading and installing packages such as the rvest package.

Reading Files with a Complex Structure

Problem

You are reading data from a file that has a complex or irregular structure.

Solution

  • Use the readLines function to read individual lines; then process them as strings to extract data items.

  • Alternatively, use the scan function to read individual tokens and use the argument what to describe the stream of tokens in your file. The function can convert tokens into data and then assemble the data into records.

Discussion

Life would be simple and beautiful if all our data files were organized into neat tables with cleanly delimited data. We could read those files using one of the functions in the readr package and get on with living.

Unfortunatly we don’t live in a land of rainbows and unicorn kisses.

You will eventually encounter a funky file format, and your job (suck it up, buttercup) is to read the file contents into R.

The read.table and read.csv functions are line-oriented and probably won’t help. However, the readLines and scan functions are useful here because they let you process the individual lines and even tokens of the file.

The readLines function is pretty simple. It reads lines from a file and returns them as a list of character strings:

lines <- readLines("input.txt")

You can limit the number of lines by using the n parameter, which gives the number of maximum number of lines to be read:

lines <- readLines("input.txt", n = 10)       # Read 10 lines and stop

The scan function is much richer. It reads one token at a time and handles it according to your instructions. The first argument is either a filename or a connection (more on connections later). The second argument is called what, and it describes the tokens that scan should expect in the input file. The description is cryptic but quite clever:

what=numeric(0)

Interpret the next token as a number.

what=integer(0)

Interpret the next token as an integer.

what=complex(0)

Interpret the next token as complex number.

what=character(0)

Interpret the next token as a character string.

what=logical(0)

Interpret the next token as a logical value.

The scan function will apply the given pattern repeatedly until all data is read.

Suppose your file is simply a sequence of numbers, like this:

2355.09 2246.73 1738.74 1841.01 2027.85

Use what=numeric(0) to say, “My file is a sequence of tokens, each of which is a number”:

singles <- scan("./data/singles.txt", what = numeric(0))
singles
#> [1] 2355.09 2246.73 1738.74 1841.01 2027.85

A key feature of scan is that the what can be a list containing several token types. The scan function will assume your file is a repeating sequence of those types. Suppose your file contains triplets of data, like this:

15-Oct-87 2439.78 2345.63 16-Oct-87 2396.21 2207.73
19-Oct-87 2164.16 1677.55 20-Oct-87 2067.47 1616.21
21-Oct-87 2081.07 1951.76

Use a list to tell scan that it should expect a repeating, three-token sequence:

triples <-
  scan("./data/triples.txt",
       what = list(character(0), numeric(0), numeric(0)))
triples
#> [[1]]
#> [1] "15-Oct-87" "16-Oct-87" "19-Oct-87" "20-Oct-87" "21-Oct-87"
#>
#> [[2]]
#> [1] 2439.78 2396.21 2164.16 2067.47 2081.07
#>
#> [[3]]
#> [1] 2345.63 2207.73 1677.55 1616.21 1951.76

Give names to the list elements, and scan will assign those names to the data:

triples <- scan("./data/triples.txt",
                what = list(
                  date = character(0),
                  high = numeric(0),
                  low = numeric(0)
                ))
triples
#> $date
#> [1] "15-Oct-87" "16-Oct-87" "19-Oct-87" "20-Oct-87" "21-Oct-87"
#>
#> $high
#> [1] 2439.78 2396.21 2164.16 2067.47 2081.07
#>
#> $low
#> [1] 2345.63 2207.73 1677.55 1616.21 1951.76

This can easily be turned into a data frame with the data.frame command:

df_triples <- data.frame(triples)
df_triples
#>        date    high     low
#> 1 15-Oct-87 2439.78 2345.63
#> 2 16-Oct-87 2396.21 2207.73
#> 3 19-Oct-87 2164.16 1677.55
#> 4 20-Oct-87 2067.47 1616.21
#> 5 21-Oct-87 2081.07 1951.76

The scan function has many bells and whistles, but the following are especially useful:

n=number

Stop after reading this many tokens. (Default: stop at end of file.)

nlines=number

Stop after reading this many input lines. (Default: stop at end of file.)

skip=number

Number of input lines to skip before reading data.

na.strings=list

A list of strings to be interpreted as NA.

An Example

Let’s use this recipe to read a dataset from StatLib, the repository of statistical data and software maintained by Carnegie Mellon University. Jeff Witmer contributed a dataset called wseries that shows the pattern of wins and losses for every World Series since 1903. The dataset is stored in an ASCII file with 35 lines of comments followed by 23 lines of data. The data itself looks like this:

1903  LWLlwwwW    1927  wwWW      1950  wwWW      1973  WLwllWW
1905  wLwWW       1928  WWww      1951  LWlwwW    1974  wlWWW
1906  wLwLwW      1929  wwLWW     1952  lwLWLww   1975  lwWLWlw
1907  WWww        1930  WWllwW    1953  WWllwW    1976  WWww
1908  wWLww       1931  LWwlwLW   1954  WWww      1977  WLwwlW

.
. (etc.)
.

The data is encoded as follows: L = loss at home, l = loss on the road, W = win at home, w = win on the road. The data appears in column order, not row order, which complicates our lives a bit.

Here is the R code for reading the raw data:

# Read the wseries dataset:
#     - Skip the first 35 lines
#     - Then read 23 lines of data
#     - The data occurs in pairs: a year and a pattern (char string)
#
world.series <- scan(
  "http://lib.stat.cmu.edu/datasets/wseries",
  skip = 35,
  nlines = 23,
  what = list(year = integer(0),
              pattern = character(0)),
)

The scan function returns a list, so we get a list with two elements: year and pattern. The function reads from left to right, but the dataset is organized by columns and so the years appear in a strange order:

world.series$year
#>  [1] 1903 1927 1950 1973 1905 1928 1951 1974 1906 1929 1952 1975 1907 1930
#> [15] 1953 1976 1908 1931 1954 1977 1909 1932 1955 1978 1910 1933 1956 1979
#> [29] 1911 1934 1957 1980 1912 1935 1958 1981 1913 1936 1959 1982 1914 1937
#> [43] 1960 1983 1915 1938 1961 1984 1916 1939 1962 1985 1917 1940 1963 1986
#> [57] 1918 1941 1964 1987 1919 1942 1965 1988 1920 1943 1966 1989 1921 1944
#> [71] 1967 1990 1922 1945 1968 1991 1923 1946 1969 1992 1924 1947 1970 1993
#> [85] 1925 1948 1971 1926 1949 1972

We can fix that by sorting the list elements according to year:

perm <- order(world.series$year)
world.series <- list(year    = world.series$year[perm],
                     pattern = world.series$pattern[perm])

Now the data appears in chronological order:

world.series$year
#>  [1] 1903 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917
#> [15] 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931
#> [29] 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
#> [43] 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
#> [57] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
#> [71] 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
#> [85] 1988 1989 1990 1991 1992 1993

world.series$pattern
#>  [1] "LWLlwwwW" "wLwWW"    "wLwLwW"   "WWww"     "wWLww"    "WLwlWlw"
#>  [7] "WWwlw"    "lWwWlW"   "wLwWlLW"  "wLwWw"    "wwWW"     "lwWWw"
#> [13] "WWlwW"    "WWllWw"   "wlwWLW"   "WWlwwLLw" "wllWWWW"  "LlWwLwWw"
#> [19] "WWwW"     "LwLwWw"   "LWlwlWW"  "LWllwWW"  "lwWLLww"  "wwWW"
#> [25] "WWww"     "wwLWW"    "WWllwW"   "LWwlwLW"  "WWww"     "WWlww"
#> [31] "wlWLLww"  "LWwwlW"   "lwWWLw"   "WWwlw"    "wwWW"     "WWww"
#> [37] "LWlwlWW"  "WLwww"    "LWwww"    "WLWww"    "LWlwwW"   "LWLwwlw"
#> [43] "LWlwlww"  "WWllwLW"  "lwWWLw"   "WLwww"    "wwWW"     "LWlwwW"
#> [49] "lwLWLww"  "WWllwW"   "WWww"     "llWWWlw"  "llWWWlw"  "lwLWWlw"
#> [55] "llWLWww"  "lwWWLw"   "WLlwwLW"  "WLwww"    "wlWLWlw"  "wwWW"
#> [61] "WLlwwLW"  "llWWWlw"  "wwWW"     "wlWWLlw"  "lwLLWww"  "lwWWW"
#> [67] "wwWLW"    "llWWWlw"  "wwLWLlw"  "WLwllWW"  "wlWWW"    "lwWLWlw"
#> [73] "WWww"     "WLwwlW"   "llWWWw"   "lwLLWww"  "WWllwW"   "llWWWw"
#> [79] "LWwllWW"  "LWwww"    "wlWWW"    "LLwlwWW"  "LLwwlWW"  "WWlllWW"
#> [85] "WWlww"    "WWww"     "WWww"     "WWlllWW"  "lwWWLw"   "WLwwlW"

Reading from MySQL Databases

Problem

You want access to data stored in a MySQL database.

Solution

  1. Install the RMySQL package on your computer.

  2. Open a database connection using the DBI::dbConnect function.

  3. Use dbGetQuery to initiate a SELECT and return the result sets.

  4. Use dbDisconnect to terminate the database connection when you are done.

Discussion

This recipe requires that the RMySQL package be installed on your computer. That package requires, in turn, the MySQL client software. If the MySQL client software is not already installed and configured, consult the MySQL documentation or your system administrator.

Use the dbConnect function to establish a connection to the MySQL database. It returns a connection object which is used in subsequent calls to RMySQL functions:

library(RMySQL)

con <- dbConnect(
    drv = RMySQL::MySQL(),
    dbname = "your_db_name",
    host = "your.host.com",
    username = "userid",
    password = "pwd"
  )

The username, password, and host parameters are the same parameters used for accessing MySQL through the mysql client program. The example given here shows them hard-coded into the dbConnect call. Actually, that is an ill-advised practice. It puts your password in a plain text document, creating a security problem. It also creates a major headache whenever your password or host change, requiring you to hunt down the hard-coded values. I strongly recommend using the security mechanism of MySQL instead. Put those three parameters into your MySQL configuration file, which is $HOME/.my.cnf on Unix and C:\my.cnf on Windows. Make sure the file is unreadable by anyone except you. The file is delimited into sections with markers such as [client]. Put the parameters into the [client] section, so that your config file will contain something like this:

[client]
user = userid
password = password
host = hostname

Once the parameters are defined in the config file, you no longer need to supply them in the dbConnect call, which then becomes much simpler:

  • jal TODO - test this in anger

con <- dbConnect(dbConnect(
  drv = RMySQL::MySQL(),
  dbname = "your_db_name",
  host = "your.host.com"
)

Use the dbGetQuery function to submit your SQL to the database and read the result sets. Doing so requires an open database connection:

sql <- "SELECT * from SurveyResults WHERE City = 'Chicago'"
rows <- dbGetQuery(con, sql)

You are not restricted to SELECT statements. Any SQL that generates a result set is OK. It is common to use CALL statements, for example, if your SQL is encapsulated in stored procedures and those stored procedures contain embedded SELECT statements.

Using dbGetQuery is convenient because it packages the result set into a data frame and returns the data frame. This is the perfect representation of an SQL result set. The result set is a tabular data structure of rows and columns, and so is a data frame. The result set’s columns have names given by the SQL SELECT statement, and R uses them for naming the columns of the data frame.

After the first result set of data, MySQL can return a second result set containing status information. You can choose to inspect the status or ignore it, but you must read it. Otherwise, MySQL will complain that there are unprocessed result sets and then halt. So call dbNextResult if necessary:

if (dbMoreResults(con)) dbNextResult(con)

Call dbGetQuery repeatedly to perform multiple queries, checking for the result status after each call (and reading it, if necessary). When you are done, close the database connection using dbDisconnect:

dbDisconnect(con)

Here is a complete session that reads and prints three rows from a database of stock prices. The query selects the price of IBM stock for the last three days of 2008. It assumes that the username, password, and host are defined in the my.cnf file:

con <- dbConnect(MySQL(), client.flag = CLIENT_MULTI_RESULTS)
sql <- paste(
  "select * from DailyBar where Symbol = 'IBM'",
  "and Day between '2008-12-29' and '2008-12-31'"
)
rows <- dbGetQuery(con, sql)
if (dbMoreResults(con)) {
  dbNextResults(con)
}
dbDisconnect(con)
print(rows)

*TODO - format this so it looks like output, maybe? * TODO - do we need the dbMoreResults still? Symbol Day Next OpenPx HighPx LowPx ClosePx AdjClosePx 1 IBM 2008-12-29 2008-12-30 81.72 81.72 79.68 81.25 81.25 2 IBM 2008-12-30 2008-12-31 81.83 83.64 81.52 83.55 83.55 3 IBM 2008-12-31 2009-01-02 83.50 85.00 83.50 84.16 84.16 HistClosePx Volume OpenInt 1 81.25 6062600 NA 2 83.55 5774400 NA 3 84.16 6667700 NA

See Also

See “Installing Packages from CRAN” and the documentation for RMySQL, which contains more details about configuring and using the package.

See “Accessing a Database with dbplyr” for information about how to get data from an SQL without actually writing SQL yourself.

R can read from several other RDBMS systems, including Oracle, Sybase, PostgreSQL, and SQLite. For more information, see the R Data Import/Export guide, which is supplied with the base distribution (“Viewing the Supplied Documentation”) and is also available on CRAN at http://cran.r-project.org/doc/manuals/R-data.pdf.

Accessing a Database with dbplyr

Problem

You want to access a database, but you’d rather not write SQL code in order to manipulate data and return results to R.

Solution

In addition to being a grammar of data manipulation, the tidyverse package dplyr can, in in connection with the dbplyr package, turn dplyr commands into SQL for you.

Let’s set up an example database using RSQLite and then we’ll connect to it and use dplyr and the dbplyr backend to extract data.

Set up the example table by loading the msleep example data into an in-memory SQLite database:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
sleep_db <- copy_to(con, msleep, "sleep")

Now that we have a table in our database, we can create a reference to it from R

sleep_table <- tbl(con, "sleep")

The sleep_table object is a type of pointer or alias to the table on the database. However, dplyr will treat it like a regular tidyverse tibble or data frame. So you can operate on it using dplyr and other R commands. Let’s select all anaimals from the data who sleep less than 3 hours.

little_sleep <- sleep_table %>%
  select(name, genus, order, sleep_total) %>%
  filter(sleep_total < 3)

The dbplyr backend does not go fetch the data when we do the above commands. But it does build the query and get ready. To see the query built by dplyr you can use show_query:

show_query(little_sleep)
#> <SQL>
#> SELECT *
#> FROM (SELECT `name`, `genus`, `order`, `sleep_total`
#> FROM `sleep`)
#> WHERE (`sleep_total` < 3.0)

Then to bring the data back to your local machine use collect:

local_little_sleep <- collect(little_sleep)
local_little_sleep
#> # A tibble: 3 x 4
#>   name        genus         order          sleep_total
#>   <chr>       <chr>         <chr>                <dbl>
#> 1 Horse       Equus         Perissodactyla         2.9
#> 2 Giraffe     Giraffa       Artiodactyla           1.9
#> 3 Pilot whale Globicephalus Cetacea                2.7

Discussion

By using dplyr to access SQL databases by only writing dplyr commands, you can be more productive by not having to switch from one language to another and back. The alternative is to have large chunks of SQL code stored as text strings in the middle of an R script, or have the SQL in seperate files which are read in by R.

By allowing dplyr to transparently create the SQL in the background, the user is freed from having to maintain seperate SQL code to extract data.

The dbplyr package uses DBI to connect to your database, so you’ll need a DBI backend package for whichever database you want to access.

Some commonly used backend DBI packages are:

odbc

Uses the open database connectivity protocol to connect to many different databases. This is typically the best choice when connecting to Microsoft SQL Server. ODBC is typically straight forward on Windows machines but may require some considerable effort to get working in Linux or Mac.

RPostgreSQL

For connecting to Postgres and Redshift.

RMySQL

For MySQL and MariaDB

RSQLite

Connecting to SQLite databases on disk or in memory.

bigrquery

For connections to Google’s BigQuery.

Each DBI backend package listed above is listed on CRAN and can be installed with the typical install.packages('packagename') command.

See Also

For more information about connecting the databases with R & RStudio: https://db.rstudio.com/

For more detail on SQL translation in dbplyr, see the sql-translation vignette at vignette("sql-translation") or http://dbplyr.tidyverse.org/articles/sql-translation.html

Saving and Transporting Objects

Problem

You want to store one or more R objects in a file for later use, or you want to copy an R object from one machine to another.

Solution

Write the objects to a file using the save function:

save(tbl, t, file = "myData.RData")

Read them back using the load function, either on your computer or on any platform that supports R:

load("myData.RData")

The save function writes binary data. To save in an ASCII format, use dput or dump instead:

dput(tbl, file = "myData.txt")
dump("tbl", file = "myData.txt")    # Note quotes around variable name

Discussion

We’ve found ourselves with a large, complicated data object that we want to load into other workspaces, or we may want to move R objects between a Linux box and a Windows box. The load and save functions let us do all this: save will store the object in a file that is portable across machines, and load can read those files.

When you run load, it does not return your data per se; rather, it creates variables in your workspace, loads your data into those variables, and then returns the names of the variables (in a vector). The first time you run load, you might be tempted to do this:

myData <- load("myData.RData")     # Achtung! Might not do what you think

Let’s look at what myData is above:

myData
#> [1] "tbl" "t"
str(myData)
#>  chr [1:2] "tbl" "t"

This might be puzzling, because myData will not contain your data at all. This can be perplexing and frustrating the first time.

The save function writes in a binary format to keep the file small. Sometimes you want an ASCII format instead. When you submit a question to a mailing list or to Stack Overflow, for example, including an ASCII dump of the data lets others re-create your problem. In such cases use dput or dump, which write an ASCII representation.

Be careful when you save and load objects created by a particular R package. When you load the objects, R does not automatically load the required packages, too, so it will not “understand” the object unless you previously loaded the package yourself. For instance, suppose you have an object called z created by the zoo package, and suppose we save the object in a file called z.RData. The following sequence of functions will create some confusion:

load("./data/z.RData")   # Create and populate the z variable
plot(z)                  # Does not plot as expected: zoo pkg not loaded
unnamed chunk 87 1

We should have loaded the zoo package before printing or plotting any zoo objects, like this:

library(zoo)                  # Load the zoo package into memory
load("./data/z.RData") # Create and populate the z variable
plot(z)                       # Ahhh. Now plotting works correctly
Plotting with zoo
Figure 4-2. Plotting with zoo

And you can see the resulting plot in Figure 4-2.

Chapter 5. Data Structures

Introduction

You can get pretty far in R just using vectors. That’s what Chapter 2 is all about. This chapter moves beyond vectors to recipes for matrices, lists, factors, data frames, and Tibbles (which are a special case of data frames). If you have preconceptions about data structures, I suggest you put them aside. R does data structures differently than many other languages.

If you want to study the technical aspects of R’s data structures, I suggest reading R in a Nutshell (O’Reilly) and the R Language Definition. The notes here are more informal. These are things we wish we’d known when we started using R.

Vectors

Here are some key properties of vectors:

Vectors are homogeneous

All elements of a vector must have the same type or, in R terminology, the same mode.

Vectors can be indexed by position

So v[2] refers to the second element of v.

Vectors can be indexed by multiple positions, returning a subvector

So v[c(2,3)] is a subvector of v that consists of the second and third elements.

Vector elements can have names

Vectors have a names property, the same length as the vector itself, that gives names to the elements: +

v <- c(10, 20, 30)
names(v) <- c("Moe", "Larry", "Curly")
print(v)
#>   Moe Larry Curly
#>    10    20    30
If vector elements have names then you can select them by name

Continuing the previous example: +

v["Larry"]
#> Larry
#>    20

Lists

Lists are heterogeneous

Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.

Lists can be indexed by position

So lst[[2]] refers to the second element of lst. Note the double square brackets. Double brackets means that R will return the element as whatever type of element it is.

Lists let you extract sublists

So lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets. Single brackets means that R will return the items in a list. If you pull a single element with single brackets, like lst[2] R will return a list of length 1 with the first item containing the desired item.

  • JDL TODO: read Jenny Bryant’s description and think about clarifying this list business

List elements can have names

Both lst[["Moe"]] and lst$Moe refer to the element named “Moe”.

Since lists are heterogeneous and since their elements can be retrieved by name, a list is like a dictionary or hash or lookup table in other programming languages (“Building a Name/Value Association List”). What’s surprising (and cool) is that in R, unlike most of those other programming languages, lists can also be indexed by position.

Mode: Physical Type

In R, every object has a mode, which indicates how it is stored in memory: as a number, as a character string, as a list of pointers to other objects, as a function, and so forth:

Object Example Mode

Number

3.1415

numeric

Vector of numbers

c(2.7.182, 3.1415)

numeric

Character string

"Moe"

character

Vector of character strings

c("Moe", "Larry", "Curly")

character

Factor

factor(c("NY", "CA", "IL"))

numeric

List

list("Moe", "Larry", "Curly")

list

Data frame

data.frame(x=1:3, y=c("NY", "CA", "IL"))

list

Function

print

function

The mode function gives us this information:

mode(3.1415)                        # Mode of a number
#> [1] "numeric"
mode(c(2.7182, 3.1415))             # Mode of a vector of numbers
#> [1] "numeric"
mode("Moe")                         # Mode of a character string
#> [1] "character"
mode(list("Moe", "Larry", "Curly")) # Mode of a list
#> [1] "list"

A critical difference between a vector and a list can be summed up this way:

  • In a vector, all elements must have the same mode.

  • In a list, the elements can have different modes.

Class: Abstract Type

In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation.

For example, a Date object consists of a single number:

d <- as.Date("2010-03-15")
mode(d)
#> [1] "numeric"
length(d)
#> [1] 1

But it has a class of Date, telling us how to interpret that number; namely, as the number of days since January 1, 1970:

class(d)
#> [1] "Date"

R uses an object’s class to decide how to process the object. For example, the generic function print has specialized versions (called methods) for printing objects according to their class: data.frame, Date, lm, and so forth. When you print an object, R calls the appropriate print function according to the object’s class.

Scalars

The quirky thing about scalars is their relationship to vectors. In some software, scalars and vectors are two different things. In R, they are the same thing: a scalar is simply a vector that contains exactly one element. In this book I often use the term “scalar”, but that’s just shorthand for “vector with one element.”

Consider the built-in constant pi. It is a scalar:

pi
#> [1] 3.14

Since a scalar is a one-element vector, you can use vector functions on pi:

length(pi)
#> [1] 1

You can index it. The first (and only) element is π, of course:

pi[1]
#> [1] 3.14

If you ask for the second element, there is none:

pi[2]
#> [1] NA

Matrices

In R, a matrix is just a vector that has dimensions. It may seem strange at first, but you can transform a vector into a matrix simply by giving it dimensions.

A vector has an attribute called dim, which is initially NULL, as shown here:

A <- 1:6
dim(A)
#> NULL
print(A)
#> [1] 1 2 3 4 5 6

We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it:

dim(A) <- c(2, 3)
print(A)
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

Voilà! The vector was reshaped into a 2 × 3 matrix.

A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:

B <- list(1, 2, 3, 4, 5, 6)
dim(B)
#> NULL

If we set the dim attribute, it gives the list a shape:

dim(B) <- c(2, 3)
print(B)
#>      [,1] [,2] [,3]
#> [1,] 1    3    5
#> [2,] 2    4    6

Voilà! We have turned this list into a 2 × 3 matrix.

Arrays

The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2:

D <- 1:12
dim(D) <- c(2, 3, 2)
print(D)
#> , , 1
#>
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
#>
#> , , 2
#>
#>      [,1] [,2] [,3]
#> [1,]    7    9   11
#> [2,]    8   10   12

Note that R prints one “slice” of the structure at a time, since it’s not possible to print a 3-dimensional structure on a 2-dimensional medium.

It strikes us as very odd that we can turn a list into a matrix just by giving the list a dim attribute. But wait; it gets stranger.

Recall that a list can be heterogeneous (mixed modes). We can start with a heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This code snippet creates a matrix that is a mix of numeric and character data:

C <- list(1, 2, 3, "X", "Y", "Z")
dim(C) <- c(2, 3)
print(C)
#>      [,1] [,2] [,3]
#> [1,] 1    3    "Y"
#> [2,] 2    "X"  "Z"

To me this is strange because I ordinarily assume a matrix is purely numeric, not mixed. R is not that restrictive.

The possibility of a heterogeneous matrix may seem powerful and strangely fascinating. However, it creates problems when you are doing normal, day-to-day stuff with matrices. For example, what happens when the matrix C (above) is used in matrix multiplication? What happens if it is converted to a data frame? The answer is that odd things happen.

In this book, I generally ignore the pathological case of a heterogeneous matrix. I assume you’ve got simple, vanilla matrices. Some recipes involving matrices may work oddly (or not at all) if your matrix contains mixed data. Converting such a matrix to a vector or data frame, for instance, can be problematic (“Converting One Structured Data Type into Another”).

Factors

A factor looks like a character vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.

There are two key uses for factors:

Categorical variables

A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.

Grouping

This is a technique for labeling or tagging your data items according to their group. See the Introduction to Data Transformations.

Data Frames

A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS.

A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:

  • The elements of the list are vectors and/or factors.1

  • Those vectors and factors are the columns of the data frame.

  • The vectors and factors must all have the same length; in other words, all columns must have the same height.

  • The equal-height columns give a rectangular shape to the data frame.

  • The columns must have names.

Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:

  • You can use list operators to extract columns from a data frame, such as df[i], df[[i]], or df$name.

  • You can use matrix-like notation, such as df[i,j], df[i,], or df[,j].

Your perception of a data frame likely depends on your background:

To a statistician

A data frame is a table of observations. Each row contains one observation. Each observation must contain the same variables. These variables are called columns, and you can refer to them by name. You can also refer to the contents by row number and column number, just as with a matrix.

To a SQL programmer

A data frame is a table. The table resides entirely in memory, but you can save it to a flat file and restore it later. You needn’t declare the column types because R figures that out for you.

To an Excel user

A data frame is like a worksheet, or perhaps a range within a worksheet. It is more restrictive, however, in that each column has a type.

To an SAS user

A data frame is like a SAS dataset for which all the data resides in memory. R can read and write the data frame to disk, but the data frame must be in memory while R is processing it.

To an R programmer

A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.

To a computer scientist

A data frame is a rectangular data structure. The columns are strongly typed, and each column must be numeric values, character strings, or a factor. Columns must have labels; rows may have labels. The table can be indexed by position, column name, and/or row name. It can also be accessed by list operators, in which case R treats the data frame as a list whose elements are the columns of the data frame.

To an executive

You can put names and numbers into a data frame. It’s easy! A data frame is like a little database. Your staff will enjoy using data frames.`

Tibbles

A tibble is a modern reimagining of the data frame, introduced by Hadley Wickham in his Tidyverse packages. Most of the common functions you would use with data frames also work with Tibbles. However Tibbles typically do less than data frames and complain more. This idea of complaining and doing less may remind you of your least favorite coworker, however, we think tibbles will be one of your most favorite data structures. Doing less and complaining more can be a feature, not a bug.

Unlike data frames, tibbles do not:

  • Tibbles do not give you row numbers by default.

  • Tibbles do not coerce column names and surprise you with names different than you expected.

  • Tibbles don’t coerce your data into factors without you explictly asking for that.

  • Tibbles only recycle vectors of length 1.

In addition to basic data frame functionality, tibbles also do this:

  • Tibbles only print the top four rows and a bit of metadata by default.

  • Tibbles always return a tibble when subsetting.

  • Tibbles never do partial matching: if you want a column from a tibble you have to ask for it using its full name.

  • Tibbles complain more by giving you more warnings and chatty messages to make sure you understand what the software is doing.

All these extras are designed to give you fewer surprises and help you be more productive.

Appending Data to a Vector

Problem

You want to append additional data items to a vector.

Solution

Use the vector constructor (c) to construct a vector with the additional data items:

v <- c(1, 2, 3)
newItems <- c(6, 7, 8)
v <- c(v, newItems)
v
#> [1] 1 2 3 6 7 8

For a single item, you can also assign the new item to the next vector element. R will automatically extend the vector:

v[length(v) + 1] <- 42
v
#> [1]  1  2  3  6  7  8 42

Discussion

If you ask us about appending a data item to a vector, we will likely suggest that maybe you shouldn’t.

Warning

R works best when you think about entire vectors, not single data items. Are you repeatedly appending items to a vector? If so, then you are probably working inside a loop. That’s OK for small vectors, but for large vectors your program will run slowly. The memory management in R works poorly when you repeatedly extend a vector by one element. Try to replace that loop with vector-level operations. You’ll write less code, and R will run much faster.

Nonetheless, one does occasionally need to append data to vectors. Our experiments show that the most efficient way is to create a new vector using the vector constructor (c) to join the old and new data. This works for appending single elements or multiple elements:

v <- c(1, 2, 3)
v <- c(v, 4) # Append a single value to v
v
#> [1] 1 2 3 4

w <- c(5, 6, 7, 8)
v <- c(v, w) # Append an entire vector to v
v
#> [1] 1 2 3 4 5 6 7 8

You can also append an item by assigning it to the position past the end of the vector, as shown in the Solution. In fact, R is very liberal about extending vectors. You can assign to any element and R will expand the vector to accommodate your request:

v <- c(1, 2, 3) # Create a vector of three elements
v[10] <- 10 # Assign to the 10th element
v # R extends the vector automatically
#>  [1]  1  2  3 NA NA NA NA NA NA 10

Note that R did not complain about the out-of-bounds subscript. It just extended the vector to the needed length, filling with NA.

R includes an append function that creates a new vector by appending items to an existing vector. However, our experiments show that this function runs more slowly than both the vector constructor and the element assignment.

Inserting Data into a Vector

Problem

You want to insert one or more data items into a vector.

Solution

Despite its name, the append function inserts data into a vector by using the after parameter, which gives the insertion point for the new item or items:

v
#>  [1]  1  2  3 NA NA NA NA NA NA 10
newvalues <- c(100, 101)
n <- 2
append(v, newvalues, after = n)
#>  [1]   1   2 100 101   3  NA  NA  NA  NA  NA  NA  10

Discussion

The new items will be inserted at the position given by after. This example inserts 99 into the middle of a sequence:

append(1:10, 99, after = 5)
#>  [1]  1  2  3  4  5 99  6  7  8  9 10

The special value of after=0 means insert the new items at the head of the vector:

append(1:10, 99, after = 0)
#>  [1] 99  1  2  3  4  5  6  7  8  9 10

The comments in “Appending Data to a Vector” apply here, too. If you are inserting single items into a vector, you might be working at the element level when working at the vector level would be easier to code and faster to run.

Understanding the Recycling Rule

Problem

You want to understand the mysterious Recycling Rule that governs how R handles vectors of unequal length.

Discussion

When you do vector arithmetic, R performs element-by-element operations. That works well when both vectors have the same length: R pairs the elements of the vectors and applies the operation to those pairs.

But what happens when the vectors have unequal lengths?

In that case, R invokes the Recycling Rule. It processes the vector element in pairs, starting at the first elements of both vectors. At a certain point, the shorter vector is exhausted while the longer vector still has unprocessed elements. R returns to the beginning of the shorter vector, “recycling” its elements; continues taking elements from the longer vector; and completes the operation. It will recycle the shorter-vector elements as often as necessary until the operation is complete.

It’s useful to visualize the Recycling Rule. Here is a diagram of two vectors, 1:6 and 1:3:

   1:6   1:3
  ----- -----
    1     1
    2     2
    3     3
    4
    5
    6

Obviously, the 1:6 vector is longer than the 1:3 vector. If we try to add the vectors using (1:6) + (1:3), it appears that 1:3 has too few elements. However, R recycles the elements of 1:3, pairing the two vectors like this and producing a six-element vector:

   1:6   1:3   (1:6) + (1:3)
  ----- ----- ---------------
    1     1         2
    2     2         4
    3     3         6
    4               5
    5               7
    6               9

Here is what you see in the R console:

(1:6) + (1:3)
#> [1] 2 4 6 5 7 9

It’s not only vector operations that invoke the Recycling Rule; functions can, too. The cbind function can create column vectors, such as the following column vectors of 1:6 and 1:3. The two column have different heights, of course:

r}
cbind(1:6)

cbind(1:3)

If we try binding these column vectors together into a two-column matrix, the lengths are mismatched. The 1:3 vector is too short, so cbind invokes the Recycling Rule and recycles the elements of 1:3:

cbind(1:6, 1:3)
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> [3,]    3    3
#> [4,]    4    1
#> [5,]    5    2
#> [6,]    6    3

If the longer vector’s length is not a multiple of the shorter vector’s length, R gives a warning. That’s good, since the operation is highly suspect and there is likely a bug in your logic:

(1:6) + (1:5) # Oops! 1:5 is one element too short
#> Warning in (1:6) + (1:5): longer object length is not a multiple of shorter
#> object length
#> [1]  2  4  6  8 10  7

Once you understand the Recycling Rule, you will realize that operations between a vector and a scalar are simply applications of that rule. In this example, the 10 is recycled repeatedly until the vector addition is complete:

(1:6) + 10
#> [1] 11 12 13 14 15 16

Creating a Factor (Categorical Variable)

Problem

You have a vector of character strings or integers. You want R to treat them as a factor, which is R’s term for a categorical variable.

Solution

The factor function encodes your vector of discrete values into a factor:

v <- c("dog", "cat", "mouse", "rat", "dog")
f <- factor(v) # v can be a vector of strings or integers
f
#> [1] dog   cat   mouse rat   dog
#> Levels: cat dog mouse rat
str(f)
#>  Factor w/ 4 levels "cat","dog","mouse",..: 2 1 3 4 2

If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor:

v <- c("dog", "cat", "mouse", "rat", "dog")
f <- factor(v, levels = c("dog", "cat", "mouse", "rat", "horse"))
f
#> [1] dog   cat   mouse rat   dog
#> Levels: dog cat mouse rat horse
str(f)
#>  Factor w/ 5 levels "dog","cat","mouse",..: 1 2 3 4 1

Discussion

In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models.

Most of the time, converting your categorical data into a factor is a simple matter of calling the factor function, which identifies the distinct levels of the categorical data and packs them into a factor:

f <- factor(c("Win", "Win", "Lose", "Tie", "Win", "Lose"))
f
#> [1] Win  Win  Lose Tie  Win  Lose
#> Levels: Lose Tie Win

Notice that when we printed the factor, f, R did not put quotes around the values. They are levels, not strings. Also notice that when we printed the factor, R also displayed the distinct levels below the factor.

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels. Suppose you have a string-valued variable wday that gives the day of the week on which your data was observed:

wday <- c("Wed", "Thu", "Mon", "Wed", "Thu", "Thu", "Thu", "Tue", "Thu", "Tue")
f <- factor(wday)
f
#>  [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
#> Levels: Mon Thu Tue Wed

R thinks that Monday, Thursday, Tuesday, and Wednesday are the only possible levels. Friday is not listed. Apparently, the lab staff never made observations on Friday, so R does not know that Friday is a possible value. Hence you need to list the possible levels of wday explicitly:

f <- factor(wday, c("Mon", "Tue", "Wed", "Thu", "Fri"))
f
#>  [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
#> Levels: Mon Tue Wed Thu Fri

Now R understands that f is a factor with five possible levels. It knows their correct order, too. It originally put Thursday before Tuesday because it assumes alphabetical order by default.2 The explicit second argument defines the correct order.

In many situations it is not necessary to call factor explicitly. When an R function requires a factor, it usually converts your data to a factor automatically. The table function, for instance, works only on factors, so it routinely converts its inputs to factors without asking. You must explicitly create a factor variable when you want to specify the full set of levels or when you want to control the ordering of levels.

When creating a data frame using base R functinos like data.frame the default behavior for text fields is to turn them into factors. This has caused grief and consternation for many R users over the years as often we expect text fields to be imported simply as text, not factors. Tibbles, part of the Tidyverse of tools, on the other hand, never converts to factors by default.

See Also

See Recipe X-X to create a factor from continuous data.

Combining Multiple Vectors into One Vector and a Factor

Problem

You have several groups of data, with one vector for each group. You want to combine the vectors into one large vector and simultaneously create a parallel factor that identifies each value’s original group.

Solution

Create a list that contains the vectors. Use the stack function to combine the list into a two-column data frame:

v1 <- c(1, 2, 3)
v2 <- c(4, 5, 6)
v3 <- c(7, 8, 9)
comb <- stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors
comb
#>   values ind
#> 1      1  v1
#> 2      2  v1
#> 3      3  v1
#> 4      4  v2
#> 5      5  v2
#> 6      6  v2
#> 7      7  v3
#> 8      8  v3
#> 9      9  v3

The data frame’s columns are called values and ind. The first column contains the data, and the second column contains the parallel factor.

Discussion

Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format.

Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA analysis of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:

set.seed(2)
n <- 5
freshmen <- sample(1:5, n, replace = TRUE, prob = c(.6, .2, .1, .05, .05))
sophomores <- sample(1:5, n, replace = TRUE, prob = c(.05, .2, .6, .1, .05))
juniors <- sample(1:5, n, replace = TRUE, prob = c(.05, .2, .55, .15, .05))

comb <- stack(list(fresh = freshmen, soph = sophomores, jrs = juniors))
print(comb)
#>    values   ind
#> 1       1 fresh
#> 2       2 fresh
#> 3       1 fresh
#> 4       1 fresh
#> 5       5 fresh
#> 6       5  soph
#> 7       3  soph
#> 8       4  soph
#> 9       3  soph
#> 10      3  soph
#> 11      2   jrs
#> 12      3   jrs
#> 13      4   jrs
#> 14      3   jrs
#> 15      3   jrs

Now you can perform the ANOVA analysis on the two columns:

aov(values ~ ind, data = comb)
#> Call:
#>    aov(formula = values ~ ind, data = comb)
#>
#> Terms:
#>                   ind Residuals
#> Sum of Squares   6.53     17.20
#> Deg. of Freedom     2        12
#>
#> Residual standard error: 1.2
#> Estimated effects may be unbalanced

When building the list we must provide tags for the list elements (the tags are fresh, soph, and jrs in this example). Those tags are required because stack uses them as the levels of the parallel factor.

Creating a List

Problem

You want to create and populate a list.

Solution

To create a list from individual data items, use the list function:

x <- c("a", "b", "c")
y <- c(1, 2, 3)
z <- "why be normal?"
lst <- list(x, y, z)
lst
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] 1 2 3
#>
#> [[3]]
#> [1] "why be normal?"

Discussion

Lists can be quite simple, such as this list of three numbers:

lst <- list(0.5, 0.841, 0.977)
lst
#> [[1]]
#> [1] 0.5
#>
#> [[2]]
#> [1] 0.841
#>
#> [[3]]
#> [1] 0.977

When R prints the list, it identifies each list element by its position ([[1]], [[2]], [[3]]) and prints the element’s value (e.g., [1] 0.5) under its position.

More usefully, lists can, unlike vectors, contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:

lst <- list(3.14, "Moe", c(1, 1, 2, 3), mean)
lst
#> [[1]]
#> [1] 3.14
#>
#> [[2]]
#> [1] "Moe"
#>
#> [[3]]
#> [1] 1 1 2 3
#>
#> [[4]]
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x7f8f0457ff88>
#> <environment: namespace:base>

You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:

lst <- list()
lst[[1]] <- 3.14
lst[[2]] <- "Moe"
lst[[3]] <- c(1, 1, 2, 3)
lst[[4]] <- mean
lst
#> [[1]]
#> [1] 3.14
#>
#> [[2]]
#> [1] "Moe"
#>
#> [[3]]
#> [1] 1 1 2 3
#>
#> [[4]]
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x7f8f0457ff88>
#> <environment: namespace:base>

List elements can be named. The list function lets you supply a name for every element:

lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
lst
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

See Also

See the “Introduction” to this chapter for more about lists; see “Building a Name/Value Association List” for more about building and using lists with named elements.

Selecting List Elements by Position

Problem

You want to access list elements by position.

Solution

Use one of these ways. Here, lst is a list variable:

lst[[n]]

Select the _n_th element from the list.

lst[c(n1, n2, ..., nk)]

Returns a list of elements, selected by their positions.

Note that the first form returns a single element and the second returns a list.

Discussion

Suppose we have a list of four integers, called years:

years <- list(1960, 1964, 1976, 1994)
years
#> [[1]]
#> [1] 1960
#>
#> [[2]]
#> [1] 1964
#>
#> [[3]]
#> [1] 1976
#>
#> [[4]]
#> [1] 1994

We can access single elements using the double-square-bracket syntax:

years[[1]]

We can extract sublists using the single-square-bracket syntax:

years[c(1, 2)]
#> [[1]]
#> [1] 1960
#>
#> [[2]]
#> [1] 1964

This syntax can be confusing because of a subtlety: there is an important difference between lst[[n]] and lst[n]. They are not the same thing:

lst[[n]]

This is an element, not a list. It is the _n_th element of lst.

lst[n]

This is a list, not an element. The list contains one element, taken from the _n_th element of lst. This is a special case of lst[c(n1, n2, ..., nk)] in which we eliminated the c() construct because there is only one n.

The difference becomes apparent when we inspect the structure of the result—one is a number; the other is a list:

class(years[[1]])
#> [1] "numeric"

class(years[1])
#> [1] "list"

The difference becomes annoyingly apparent when we cat the value. Recall that cat can print atomic values or vectors but complains about printing structured objects:

cat(years[[1]], "\n")
#> 1960

cat(years[1], "\n")
#> Error in cat(years[1], "\n"): argument 1 (type 'list') cannot be handled by 'cat'

We got lucky here because R alerted us to the problem. In other contexts, you might work long and hard to figure out that you accessed a sublist when you wanted an element, or vice versa.

Selecting List Elements by Name

Problem

You want to access list elements by their names.

Solution

Use one of these forms. Here, lst is a list variable:

lst[["name"]]

Selects the element called name. Returns NULL if no element has that name.

lst$name

Same as previous, just different syntax.

lst[c(name1, name2, ..., namek)]

Returns a list built from the indicated elements of lst.

Note that the first two forms return an element whereas the third form returns a list.

Discussion

Each element of a list can have a name. If named, the element can be selected by its name. This assignment creates a list of four named integers:

years <- list(Kennedy = 1960, Johnson = 1964, Carter = 1976, Clinton = 1994)

These next two expressions return the same value—namely, the element that is named “Kennedy”:

years[["Kennedy"]]
#> [1] 1960
years$Kennedy
#> [1] 1960

The following two expressions return sublists extracted from years:

years[c("Kennedy", "Johnson")]
#> $Kennedy
#> [1] 1960
#>
#> $Johnson
#> [1] 1964

years["Carter"]
#> $Carter
#> [1] 1976

Just as with selecting list elements by position (“Selecting List Elements by Position”), there is an important difference between lst[["name"]] and lst["name"]. They are not the same:

lst[["name"]]

This is an element, not a list.

lst["name"]

This is a list, not an element. This is a special case of lst[c(name1, name2, ..., namek)] in which we don’t need the c() construct because there is only one name.

See Also

See “Selecting List Elements by Position” to access elements by position rather than by name.

Building a Name/Value Association List

Problem

You want to create a list that associates names and values — as would a dictionary, hash, or lookup table in another programming language.

Solution

The list function lets you give names to elements, creating an association between each name and its value:

lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
lst
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

If you have parallel vectors of names and values, you can create an empty list and then populate the list by using a vectorized assignment statement:

values <- c(1, 2, 3)
names <- c("a", "b", "c")
lst <- list()
lst[names] <- values
lst
#> $a
#> [1] 1
#>
#> $b
#> [1] 2
#>
#> $c
#> [1] 3

Discussion

Each element of a list can be named, and you can retrieve list elements by name. This gives you a basic programming tool: the ability to associate names with values.

You can assign element names when you build the list. The list function allows arguments of the form name=value:

lst <- list(
  far.left = 0.023,
  left = 0.159,
  mid = 0.500,
  right = 0.841,
  far.right = 0.977
)
lst
#> $far.left
#> [1] 0.023
#>
#> $left
#> [1] 0.159
#>
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

One way to name the elements is to create an empty list and then populate it via assignment statements:

lst <- list()
lst$far.left <- 0.023
lst$left <- 0.159
lst$mid <- 0.500
lst$right <- 0.841
lst$far.right <- 0.977
lst
#> $far.left
#> [1] 0.023
#>
#> $left
#> [1] 0.159
#>
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

Sometimes you have a vector of names and a vector of corresponding values:

values <- pnorm(-2:2)
names <- c("far.left", "left", "mid", "right", "far.right")

You can associate the names and the values by creating an empty list and then populating it with a vectorized assignment statement:

lst <- list()
lst[names] <- values

Once the association is made, the list can “translate” names into values through a simple list lookup:

cat("The left limit is", lst[["left"]], "\n")
#> The left limit is 0.159
cat("The right limit is", lst[["right"]], "\n")
#> The right limit is 0.841

for (nm in names(lst)) cat("The", nm, "limit is", lst[[nm]], "\n")
#> The far.left limit is 0.0228
#> The left limit is 0.159
#> The mid limit is 0.5
#> The right limit is 0.841
#> The far.right limit is 0.977

Removing an Element from a List

Problem

You want to remove an element from a list.

Solution

Assign NULL to the element. R will remove it from the list.

Discussion

To remove a list element, select it by position or by name, and then assign NULL to the selected element:

years <- list(Kennedy = 1960, Johnson = 1964, Carter = 1976, Clinton = 1994)
years
#> $Kennedy
#> [1] 1960
#>
#> $Johnson
#> [1] 1964
#>
#> $Carter
#> [1] 1976
#>
#> $Clinton
#> [1] 1994
years[["Johnson"]] <- NULL # Remove the element labeled "Johnson"
years
#> $Kennedy
#> [1] 1960
#>
#> $Carter
#> [1] 1976
#>
#> $Clinton
#> [1] 1994

You can remove multiple elements this way, too:

years[c("Carter", "Clinton")] <- NULL # Remove two elements
years
#> $Kennedy
#> [1] 1960

Flatten a List into a Vector

Problem

You want to flatten all the elements of a list into a vector.

Solution

Use the unlist function.

Discussion

There are many contexts that require a vector. Basic statistical functions work on vectors but not on lists, for example. If iq.scores is a list of numbers, then we cannot directly compute their mean:

iq.scores <- list(rnorm(5, 100, 15))
iq.scores
#> [[1]]
#> [1] 115.8  88.7  78.4  95.7  84.5
mean(iq.scores)
#> Warning in mean.default(iq.scores): argument is not numeric or logical:
#> returning NA
#> [1] NA

Instead, we must flatten the list into a vector using unlist and then compute the mean of the result:

mean(unlist(iq.scores))
#> [1] 92.6

Here is another example. We can cat scalars and vectors, but we cannot cat a list:

cat(iq.scores, "\n")
#> Error in cat(iq.scores, "\n"): argument 1 (type 'list') cannot be handled by 'cat'

One solution is to flatten the list into a vector before printing:

cat("IQ Scores:", unlist(iq.scores), "\n")
#> IQ Scores: 116 88.7 78.4 95.7 84.5

See Also

Conversions such as this are discussed more fully in “Converting One Structured Data Type into Another”.

Removing NULL Elements from a List

Problem

Your list contains NULL values. You want to remove them.

Solution

Suppose lst is a list some of whose elements are NULL. This expression will remove the NULL elements:

lst <- list(1, NULL, 2, 3, NULL, 4)
lst
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> [1] 2
#>
#> [[4]]
#> [1] 3
#>
#> [[5]]
#> NULL
#>
#> [[6]]
#> [1] 4
lst[sapply(lst, is.null)] <- NULL
lst
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
#>
#> [[4]]
#> [1] 4

Discussion

Finding and removing NULL elements from a list is surprisingly tricky. The recipe above was written by one of the authors in a fit of frustration after trying many other solutions that didn’t work. Here’s how it works:

  1. R calls sapply to apply the is.null function to every element of the list.

  2. sapply returns a vector of logical values that are TRUE wherever the corresponding list element is NULL.

  3. R selects values from the list according to that vector.

  4. R assigns NULL to the selected items, removing them from the list.

The curious reader may be wondering how a list can contain NULL elements, given that we remove elements by setting them to NULL (“Removing an Element from a List”). The answer is that we can create a list containing NULL elements:

lst <- list("Moe", NULL, "Curly") # Create list with NULL element
lst
#> [[1]]
#> [1] "Moe"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> [1] "Curly"

lst[sapply(lst, is.null)] <- NULL # Remove NULL element from list
lst
#> [[1]]
#> [1] "Moe"
#>
#> [[2]]
#> [1] "Curly"

In practice we might end up with NULL items in a list because of the results of a function we wrote to do something else.

See Also

See “Removing an Element from a List” for how to remove list elements.

Removing List Elements Using a Condition

Problem

You want to remove elements from a list according to a conditional test, such as removing elements that are negative or smaller than some threshold.

Solution

Build a logical vector based on the condition. Use the vector to select list elements and then assign NULL to those elements. This assignment, for example, removes all negative value from lst:

lst <- as.list(rnorm(7))
lst
#> [[1]]
#> [1] -0.0281
#>
#> [[2]]
#> [1] -0.366
#>
#> [[3]]
#> [1] -1.12
#>
#> [[4]]
#> [1] -0.976
#>
#> [[5]]
#> [1] 1.12
#>
#> [[6]]
#> [1] 0.324
#>
#> [[7]]
#> [1] -0.568

lst[lst < 0] <- NULL
lst
#> [[1]]
#> [1] 1.12
#>
#> [[2]]
#> [1] 0.324

It’s worth noting that in the above example we use as.list instead of list to create a list from the 7 random values created by rnorm(7). The reason for this is that as.list will turn each element of a vector into a list item. On the other hand, list would have given us a list of length 1 where the first element was a vector containing 7 numbers:

list(rnorm(7))
#> [[1]]
#> [1] -1.034 -0.533 -0.981  0.823 -0.388  0.879 -2.178

Discussion

This recipe is based on two useful features of R. First, a list can be indexed by a logical vector. Wherever the vector element is TRUE, the corresponding list element is selected. Second, you can remove a list element by assigning NULL to it.

Suppose we want to remove elements from lst whose value is zero. We construct a logical vector which identifies the unwanted values (lst == 0). Then we select those elements from the list and assign NULL to them:

lst[lst == 0] <- NULL

This expression will remove NA values from the list:

lst[is.na(lst)] <- NULL

So far, so good. The problems arise when you cannot easily build the logical vector. That often happens when you want to use a function that cannot handle a list. Suppose you want to remove list elements whose absolute value is less than 1. The abs function will not handle a list, unfortunately:

lst[abs(lst) < 1] <- NULL
#> Error in abs(lst): non-numeric argument to mathematical function

The simplest solution is flattening the list into a vector by calling unlist and then testing the vector:

lst
#> [[1]]
#> [1] 1.12
#>
#> [[2]]
#> [1] 0.324
lst[abs(unlist(lst)) < 1] <- NULL
lst
#> [[1]]
#> [1] 1.12

A more elegant solution uses lapply (the list apply function) to apply the function to every element of the list:

lst <- as.list(rnorm(5))
lst
#> [[1]]
#> [1] 1.47
#>
#> [[2]]
#> [1] 0.885
#>
#> [[3]]
#> [1] 2.29
#>
#> [[4]]
#> [1] 0.554
#>
#> [[5]]
#> [1] 1.21
lst[lapply(lst, abs) < 1] <- NULL
lst
#> [[1]]
#> [1] 1.47
#>
#> [[2]]
#> [1] 2.29
#>
#> [[3]]
#> [1] 1.21

Lists can hold complex objects, too, not just atomic values. Suppose that mods is a list of linear models created by the lm function. This expression will remove any model whose R2 value is less than 0.70:

x <- 1:10
y1 <- 2 * x + rnorm(10, 0, 1)
y2 <- 3 * x + rnorm(10, 0, 8)

result_list <- list(lm(x ~ y1), lm(x ~ y2))

result_list[sapply(result_list, function(m) summary(m)$r.squared < 0.7)] <- NULL

If we wanted to simply see the R2 values for each model, we could do the following:

sapply(result_list, function(m) summary(m)$r.squared)
#> [1] 0.990 0.708

Using sapply (simple apply) will return a vector of results. If we had used lapply we would have received a list in return:

lapply(result_list, function(m) summary(m)$r.squared)
#> [[1]]
#> [1] 0.99
#>
#> [[2]]
#> [1] 0.708

It’s worth noting that if you face a situation like the one above, you might also explore the package called broom on CRAN. Broom is designed to take output of models and put the results in a tidy format that fits better in a tidy-style workflow.

See Also

See Recipes , , , , , and .

Initializing a Matrix

Problem

You want to create a matrix and initialize it from given values.

Solution

Capture the data in a vector or list, and then use the matrix function to shape the data into a matrix. This example shapes a vector into a 2 × 3 matrix (i.e., two rows and three columns):

vec <- 1:6
matrix(vec, 2, 3)
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

Discussion

The first argument of matrix is the data, the second argument is the number of rows, and the third argument is the number of columns. Observe that the matrix was filled column by column, not row by row.

It’s common to initialize an entire matrix to one value such as zero or NA. If the first argument of matrix is a single value, then R will apply the Recycling Rule and automatically replicate the value to fill the entire matrix:

matrix(0, 2, 3) # Create an all-zeros matrix
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    0    0    0

matrix(NA, 2, 3) # Create a matrix populated with NA
#>      [,1] [,2] [,3]
#> [1,]   NA   NA   NA
#> [2,]   NA   NA   NA

You can create a matrix with a one-liner, of course, but it becomes difficult to read:

mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3)
mat
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.3  2.2
#> [2,]  1.2  2.1  2.3

A common idiom in R is typing the data itself in a rectangular shape that reveals the matrix structure:

theData <- c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3
)
mat <- matrix(theData, 2, 3, byrow = TRUE)
mat
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.2  1.3
#> [2,]  2.1  2.2  2.3

Setting byrow=TRUE tells matrix that the data is row-by-row and not column-by-column (which is the default). In condensed form, that becomes:

mat <- matrix(c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3
),
2, 3,
byrow = TRUE
)

Expressed this way, the reader quickly sees the two rows and three columns of data.

There is a quick-and-dirty way to turn a vector into a matrix: just assign dimensions to the vector. This was discussed in the “Introduction”. The following example creates a vanilla vector and then shapes it into a 2 × 3 matrix:

v <- c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3)
dim(v) <- c(2, 3)
v
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.3  2.2
#> [2,]  1.2  2.1  2.3

Personally, I find this more opaque than using matrix, especially since there is no byrow option here.

Performing Matrix Operations

Problem

You want to perform matrix operations such as transpose, matrix inversion, matrix multiplication, or constructing an identity matrix.

Solution

t(A)

Matrix transposition of A

solve(A)

Matrix inverse of A

A %*% B

Matrix multiplication of A and B

diag(n)

An n-by-n diagonal (identity) matrix

Discussion

Recall that A*B is element-wise multiplication whereas A %*% B is matrix multiplication.

All these functions return a matrix. Their arguments can be either matrices or data frames. If they are data frames then R will first convert them to matrices (although this is useful only if the data frame contains exclusively numeric values).

Giving Descriptive Names to the Rows and Columns of a Matrix

Problem

You want to assign descriptive names to the rows or columns of a matrix.

Solution

Every matrix has a rownames attribute and a colnames attribute. Assign a vector of character strings to the appropriate attribute:

theData <- c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3,
  3.1, 3.2, 3.3
)
mat <- matrix(theData, 3, 3, byrow = TRUE)

rownames(mat) <- c("rowname1", "rowname2", "rowname3")
colnames(mat) <- c("colname1", "colname2", "colname3")
mat
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3
#> rowname2      2.1      2.2      2.3
#> rowname3      3.1      3.2      3.3

Discussion

R lets you assign names to the rows and columns of a matrix, which is useful for printing the matrix. R will display the names if they are defined, enhancing the readability of your output. Below we use the quantmod library to pull stock prices for three tech stocks. Then we calculate daily returns and create a correlation matrix of the daily returns of Apple, Microsoft, and Google stock. No need to worry about the details here, unless stocks are your thing. We’re just creating some real-world data for illustration:

library("quantmod")
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#>     as.Date, as.Date.numeric
#>
#> Attaching package: 'xts'
#> The following objects are masked from 'package:dplyr':
#>
#>     first, last
#> Loading required package: TTR
#> Version 0.4-0 included new data defaults. See ?getSymbols.

getSymbols(c("AAPL", "MSFT", "GOOG"), auto.assign = TRUE)
#> 'getSymbols' currently uses auto.assign=TRUE by default, but will
#> use auto.assign=FALSE in 0.5-0. You will still be able to use
#> 'loadSymbols' to automatically load data. getOption("getSymbols.env")
#> and getOption("getSymbols.auto.assign") will still be checked for
#> alternate defaults.
#>
#> This message is shown once per session and may be disabled by setting
#> options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
#>
#> WARNING: There have been significant changes to Yahoo Finance data.
#> Please see the Warning section of '?getSymbols.yahoo' for details.
#>
#> This message is shown once per session and may be disabled by setting
#> options("getSymbols.yahoo.warning"=FALSE).
#> [1] "AAPL" "MSFT" "GOOG"
cor_mat <- cor(cbind(
  periodReturn(AAPL, period = "daily", subset = "2017"),
  periodReturn(MSFT, period = "daily", subset = "2017"),
  periodReturn(GOOG, period = "daily", subset = "2017")
))
cor_mat
#>                 daily.returns daily.returns.1 daily.returns.2
#> daily.returns           1.000           0.438           0.489
#> daily.returns.1         0.438           1.000           0.619
#> daily.returns.2         0.489           0.619           1.000

In this form, the matrix output’s interpretation is not self-evident.The columns are named daily.returns.X because before we bound the columns together with cbind they were each named daily.returns. R then helped us manage the naming clash by appending .1 to the second column and .2 to the third.

The default naming does not tell us which column came from which stock. So we’ll define names for the rows and columns, then R will annotate the matrix output with the names:

colnames(cor_mat) <- c("AAPL", "MSFT", "GOOG")
rownames(cor_mat) <- c("AAPL", "MSFT", "GOOG")
cor_mat
#>       AAPL  MSFT  GOOG
#> AAPL 1.000 0.438 0.489
#> MSFT 0.438 1.000 0.619
#> GOOG 0.489 0.619 1.000

Now the reader knows at a glance which rows and columns apply to which stocks.

Another advantage of naming rows and columns is that you can refer to matrix elements by those names:

cor_mat["MSFT", "GOOG"] # What is the correlation between MSFT and GOOG?
#> [1] 0.619

Selecting One Row or Column from a Matrix

Problem

You want to select a single row or a single column from a matrix.

Solution

The solution depends on what you want. If you want the result to be a simple vector, just use normal indexing:

mat[1, ] # First row
#> colname1 colname2 colname3
#>      1.1      1.2      1.3
mat[, 3] # Third column
#> rowname1 rowname2 rowname3
#>      1.3      2.3      3.3

If you want the result to be a one-row matrix or a one-column matrix, then include the drop=FALSE argument:

mat[1, , drop = FALSE] # First row in a one-row matrix
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3
mat[, 3, drop = FALSE] # Third column in a one-column matrix
#>          colname3
#> rowname1      1.3
#> rowname2      2.3
#> rowname3      3.3

Discussion

Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:

mat[1, ]
#> colname1 colname2 colname3
#>      1.1      1.2      1.3

mat[, 3]
#> rowname1 rowname2 rowname3
#>      1.3      2.3      3.3

When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):

mat[1, , drop = FALSE]
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3

Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):

mat[, 3, drop = FALSE]
#>          colname3
#> rowname1      1.3
#> rowname2      2.3
#> rowname3      3.3

Initializing a Data Frame from Column Data

Problem

Your data is organized by columns, and you want to assemble it into a data frame.

Solution

If your data is captured in several vectors and/or factors, use the data.frame function to assemble them into a data frame:

v1 <- 1:5
v2 <- 6:10
v3 <- c("A", "B", "C", "D", "E")
f1 <- factor(c("a", "a", "a", "b", "b"))
df <- data.frame(v1, v2, v3, f1)
df
#>   v1 v2 v3 f1
#> 1  1  6  A  a
#> 2  2  7  B  a
#> 3  3  8  C  a
#> 4  4  9  D  b
#> 5  5 10  E  b

If your data is captured in a list that contains vectors and/or factors, use instead as.data.frame:

list.of.vectors <- list(v1 = v1, v2 = v2, v3 = v3, f1 = f1)
df2 <- as.data.frame(list.of.vectors)
df2
#>   v1 v2 v3 f1
#> 1  1  6  A  a
#> 2  2  7  B  a
#> 3  3  8  C  a
#> 4  4  9  D  b
#> 5  5 10  E  b

Discussion

A data frame is a collection of columns, each of which corresponds to an observed variable (in the statistical sense, not the programming sense). If your data is already organized into columns, then it’s easy to build a data frame.

The data.frame function can construct a data frame from vectors, where each vector is one observed variable. Suppose you have two numeric predictor variables, one categorical predictor variable, and one response variable. The data.frame function can create a data frame from your vectors.

pred1 <- rnorm(10)
pred2 <- rnorm(10, 1, 2)
pred3 <- sample(c("AM", "PM"), 10, replace = TRUE)
resp <- 2.1 + pred1 * .3 + pred2 * .9
df <- data.frame(pred1, pred2, pred3, resp)
df
#>     pred1   pred2 pred3 resp
#> 1  -0.117 -0.0196    AM 2.05
#> 2  -1.133  0.1529    AM 1.90
#> 3   0.632  3.8004    AM 5.71
#> 4   0.188  4.5922    AM 6.29
#> 5   0.892  1.8556    AM 4.04
#> 6  -1.224  2.8140    PM 4.27
#> 7   0.174  0.4908    AM 2.59
#> 8  -0.689 -0.1335    PM 1.77
#> 9   1.204 -0.0482    AM 2.42
#> 10  0.697  2.2268    PM 4.31

Notice that data.frame takes the column names from your program variables. You can override that default by supplying explicit column names:

df <- data.frame(p1 = pred1, p2 = pred2, p3 = pred3, r = resp)
head(df, 3)
#>       p1      p2 p3    r
#> 1 -0.117 -0.0196 AM 2.05
#> 2 -1.133  0.1529 AM 1.90
#> 3  0.632  3.8004 AM 5.71

As illustrated above, your data may be organized into vectors but those vectors are held in a list, not individual program variables. Use the as.data.frame function to create a data frame from the list of vectors.

If you’d rather have a tibble (a.k.a tidy data frame) instead of a data frame, then use the function as_tibble instead of data.frame. However, note that as_tibble is designed to operate on a list, matrix, data.frame, or table. So we can just wrap our vectors in a list function before we call as_tibble:

tib <- as_tibble(list(p1 = pred1, p2 = pred2, p3 = pred3, r = resp))
tib
#> # A tibble: 10 x 4
#>       p1      p2 p3        r
#>    <dbl>   <dbl> <chr> <dbl>
#> 1 -0.117 -0.0196 AM     2.05
#> 2 -1.13   0.153  AM     1.90
#> 3  0.632  3.80   AM     5.71
#> 4  0.188  4.59   AM     6.29
#> 5  0.892  1.86   AM     4.04
#> 6 -1.22   2.81   PM     4.27
#> # ... with 4 more rows

One subtle difference between a data.frame object and a tibble is that when using the data.frame function to create a data.frame R will coerce character values into factors by default. On the other hand, as_tibble does not convert characters to factors. If you look at the last two code examples above, you’ll see column p3 is of type chr in the tibble example and type fctr in the data.frame example. This difference is something you should be aware of as it can be maddeningly frustrating to debug an issue caused by this subtle difference.

Initializing a Data Frame from Row Data

Problem

Your data is organized by rows, and you want to assemble it into a data frame.

Solution

Store each row in a one-row data frame. Store the one-row data frames in a list. Use rbind and do.call to bind the rows into one, large data frame:

r1 <- data.frame(a = 1, b = 2, c = "a")
r2 <- data.frame(a = 3, b = 4, c = "b")
r3 <- data.frame(a = 5, b = 6, c = "c")
obs <- list(r1, r2, r3)
df <- do.call(rbind, obs)
df
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

Here, obs is a list of one-row data frames. But notice that column c is a factor, not a character.

Discussion

Data often arrives as a collection of observations. Each observation is a record or tuple that contains several values, one for each observed variable. The lines of a flat file are usually like that: each line is one record, each record contains several columns, and each column is a different variable (see “Reading Files with a Complex Structure”). Such data is organized by observation, not by variable. In other words, you are given rows one at a time rather than columns one at a time.

Each such row might be stored in several ways. One obvious way is as a vector. If you have purely numerical data, use a vector.

However, many datasets are a mixture of numeric, character, and categorical data, in which case a vector won’t work. I recommend storing each such heterogeneous row in a one-row data frame. (You could store each row in a list, but this recipe gets a little more complicated.)

We need to bind together those rows into a data frame. That’s what the rbind function does. It binds its arguments in such a way that each argument becomes one row in the result. If we rbind the first two observations, for example, we get a two-row data frame:

rbind(obs[[1]], obs[[2]])
#>   a b c
#> 1 1 2 a
#> 2 3 4 b

We want to bind together every observation, not just the first two, so we tap into the vector processing of R. The do.call function will expand obs into one, long argument list and call rbind with that long argument list:

do.call(rbind, obs)
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

The result is a data frame built from our rows of data.

Sometimes, for reasons beyond your control, the rows of your data are stored in lists rather than one-row data frames. You may be dealing with rows returned by a database package, for example. In that case, obs will be a list of lists, not a list of data frames. We first transform the rows into data frames using the Map function and then apply this recipe:

l1 <- list(a = 1, b = 2, c = "a")
l2 <- list(a = 3, b = 4, c = "b")
l3 <- list(a = 5, b = 6, c = "c")
obs <- list(l1, l2, l3)
df <- do.call(rbind, Map(as.data.frame, obs))
df
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

This recipe works also if your observations are stored in vectors rather than one-row data frames. But with vectors, all elements have to be of the same data type. Though R will happily coerce integers into floats on the fly:

r1 <- 1:3
r2 <- 6:8
r3 <- rnorm(3)
obs <- list(r1, r2, r3)
df <- do.call(rbind, obs)
df
#>        [,1]   [,2] [,3]
#> [1,]  1.000  2.000  3.0
#> [2,]  6.000  7.000  8.0
#> [3,] -0.945 -0.547  1.6

Note the factor trap mentioned in the example above. If you would rather get characters instead of factors, you have a couple of options. One is to set the stringsAsFactors parameter to FALSE when data.frame is called:

data.frame(a = 1, b = 2, c = "a", stringsAsFactors = FALSE)
#>   a b c
#> 1 1 2 a

Of course if you inherited your data and it’s already in a data frame with factors, you can convert all factors in a data.frame to characters using this bonus recipe:

## same set up as in the previous examples
l1 <- list( a=1, b=2, c='a' )
l2 <- list( a=3, b=4, c='b' )
l3 <- list( a=5, b=6, c='c' )
obs <- list(l1, l2, l3)
df <- do.call(rbind,Map(as.data.frame,obs))
# yes, you could use stringsAsFactors=FALSE above, but we're assuming the data.frame
# came to you with factors already

i <- sapply(df, is.factor)             ## determine which columns are factors
df[i] <- lapply(df[i], as.character)   ## turn only the factors to characters
df

Keep in mind that if you use a tibble instead of a data.frame then characters will not be forced into factors by default.

See Also

See “Initializing a Data Frame from Column Data” if your data is organized by columns, not rows.
See Recipe X-X to learn more about do.call.

Appending Rows to a Data Frame

Problem

You want to append one or more new rows to a data frame.

Solution

Create a second, temporary data frame containing the new rows. Then use the rbind function to append the temporary data frame to the original data frame.

Discussion

Suppose we want to append a new row to our data frame of Chicago-area cities. First, we create a one-row data frame with the new data:

newRow <- data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428)

Next, we use the rbind function to append that one-row data frame to our existing data frame:

library(tidyverse)
suburbs <- read_csv("./data/suburbs.txt")
#> Parsed with column specification:
#> cols(
#>   city = col_character(),
#>   county = col_character(),
#>   state = col_character(),
#>   pop = col_double()
#> )

suburbs2 <- rbind(suburbs, newRow)
suburbs2
#> # A tibble: 18 x 4
#>   city    county   state     pop
#>   <chr>   <chr>    <chr>   <dbl>
#> 1 Chicago Cook     IL    2853114
#> 2 Kenosha Kenosha  WI      90352
#> 3 Aurora  Kane     IL     171782
#> 4 Elgin   Kane     IL      94487
#> 5 Gary    Lake(IN) IN     102746
#> 6 Joliet  Kendall  IL     106221
#> # ... with 12 more rows

The rbind function tells R that we are appending a new row to suburbs, not a new column. It may be obvious to you that newRow is a row and not a column, but it is not obvious to R. (Use the cbind function to append a column.)

One word of caution. The new row must use the same column names as the data frame. Otherwise, rbind will fail.

We can combine these two steps into one, of course:

suburbs3 <- rbind(suburbs, data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428))

We can even extend this technique to multiple new rows because rbind allows multiple arguments:

suburbs4 <- rbind(
  suburbs,
  data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428),
  data.frame(city = "East Dundee", county = "Kane", state = "IL", pop = 2955)
)

It’s worth noting that in the examples above we seamlessly comingled tibbles and data frames because we used the tidy function read_csv which produces tibbles. And note that the data frames contain factors while the tibbles do not:

str(suburbs)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    17 obs. of  4 variables:
#>  $ city  : chr  "Chicago" "Kenosha" "Aurora" "Elgin" ...
#>  $ county: chr  "Cook" "Kenosha" "Kane" "Kane" ...
#>  $ state : chr  "IL" "WI" "IL" "IL" ...
#>  $ pop   : num  2853114 90352 171782 94487 102746 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   city = col_character(),
#>   ..   county = col_character(),
#>   ..   state = col_character(),
#>   ..   pop = col_double()
#>   .. )
str(newRow)
#> 'data.frame':    1 obs. of  4 variables:
#>  $ city  : Factor w/ 1 level "West Dundee": 1
#>  $ county: Factor w/ 1 level "Kane": 1
#>  $ state : Factor w/ 1 level "IL": 1
#>  $ pop   : num 5428

When this inputs to rbind are a mix of data.frame objects and tibble objects, the result will be the type of object passed to the first argument of rbind. So this would produce a tibble:

rbind(some_tibble, some_data.frame)

While this would produce a data.frame:

rbind(some_data.frame, some_tibble)
Warning

Do not use this recipe to append many rows to a large data frame. That would force R to reallocate a large data structure repeatedly, which is a very slow process. Build your data frame using more efficient means, such as those in Recipes or .

Preallocating a Data Frame

Problem

You are building a data frame, row by row. You want to preallocate the space instead of appending rows incrementally.

Solution

Create a data frame from generic vectors and factors using the functions numeric(n) and`character(n)`:

n <- 5
df <- data.frame(colname1 = numeric(n), colname2 = character(n))

Here, n is the number of rows needed for the data frame.

Discussion

Theoretically, you can build a data frame by appending new rows, one by one. That’s OK for small data frames, but building a large data frame in that way can be tortuous. The memory manager in R works poorly when one new row is repeatedly appended to a large data structure. Hence your R code will run very slowly.

One solution is to preallocate the data frame, assuming you know the required number of rows. By preallocating the data frame once and for all, you sidestep problems with the memory manager.

Suppose you want to create a data frame with 1,000,000 rows and three columns: two numeric and one character. Use the numeric and character functions to preallocate the columns; then join them together using data.frame:

n <- 1000000
df <- data.frame(
  dosage = numeric(n),
  lab = character(n),
  response = numeric(n),
  stringsAsFactors = FALSE
)
str(df)
#> 'data.frame':    1000000 obs. of  3 variables:
#>  $ dosage  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ lab     : chr  "" "" "" "" ...
#>  $ response: num  0 0 0 0 0 0 0 0 0 0 ...

Now you have a data frame with the correct dimensions, 1,000,000 × 3, waiting to receive its contents.

Notice in the example above we set stringsAsFactors=FALSE so that R would not coerce the character field into factors. Data frames can contain factors, but preallocating a factor is a little trickier. You can’t simply call factor(n). You need to specify the factor’s levels because you are creating it. Continuing our example, suppose you want the lab column to be a factor, not a character string, and that the possible levels are NJ, IL, and CA. Include the levels in the column specification, like this:

n <- 1000000
df <- data.frame(
  dosage = numeric(n),
  lab = factor(n, levels = c("NJ", "IL", "CA")),
  response = numeric(n)
)
str(df)
#> 'data.frame':    1000000 obs. of  3 variables:
#>  $ dosage  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ lab     : Factor w/ 3 levels "NJ","IL","CA": NA NA NA NA NA NA NA NA NA NA ...
#>  $ response: num  0 0 0 0 0 0 0 0 0 0 ...

Selecting Data Frame Columns by Position

Problem

You want to select columns from a data frame according to their position.

Solution

To select a single column, use this list operator:

df[[n]]

Returns one column—specifically, the nth column of df.

To select one or more columns and package them in a data frame, use the following sublist expressions:

df[n]

Returns a data frame consisting solely of the nth column of df.

df[c(n1, n2, ..., nk)]

Returns a data frame built from the columns in positions n1, n2, …, nk of df.

You can use matrix-style subscripting to select one or more columns:

df[, n]

Returns the nth column (assuming that n contains exactly one value).

df[,c(n1, n2, ..., nk)]

Returns a data frame built from the columns in positions n1, n2, …, nk.

Note that the matrix-style subscripting can return two different data types (either column or data frame) depending upon whether you select one column or multiple columns.

Or you can use the dplyr package from the Tidyverse and pass column numbers to the select function to get back a tibble.

df %>% select(n1, n2, ..., nk)

Discussion

There are a bewildering number of ways to select columns from a data frame. The choices can be confusing until you understand the logic behind the alternatives. As you read this explanation, notice how a slight change in syntax—a comma here, a double-bracket there—changes the meaning of the expression.

Let’s play with the population data for the 16 largest cities in the Chicago metropolitan area:

suburbs <- read_csv("./data/suburbs.txt")
#> Parsed with column specification:
#> cols(
#>   city = col_character(),
#>   county = col_character(),
#>   state = col_character(),
#>   pop = col_double()
#> )
suburbs
#> # A tibble: 17 x 4
#>   city    county   state     pop
#>   <chr>   <chr>    <chr>   <dbl>
#> 1 Chicago Cook     IL    2853114
#> 2 Kenosha Kenosha  WI      90352
#> 3 Aurora  Kane     IL     171782
#> 4 Elgin   Kane     IL      94487
#> 5 Gary    Lake(IN) IN     102746
#> 6 Joliet  Kendall  IL     106221
#> # ... with 11 more rows

So right off the bat we can see this is a tibble. Subsetting and selecting in tibbles works very much like base R data frames. So the recipes below can work on either data structure.

Use simple list notation to select exactly one column, such as the first column:

suburbs[[1]]
#>  [1] "Chicago"           "Kenosha"           "Aurora"
#>  [4] "Elgin"             "Gary"              "Joliet"
#>  [7] "Naperville"        "Arlington Heights" "Bolingbrook"
#> [10] "Cicero"            "Evanston"          "Hammond"
#> [13] "Palatine"          "Schaumburg"        "Skokie"
#> [16] "Waukegan"          "West Dundee"

The first column of suburbs is a vector, so that’s what suburbs[[1]] returns: a vector. If the first column were a factor, we’d get a factor.

The result differs when you use the single-bracket notation, as in suburbs[1] or suburbs[c(1,3)]. You still get the requested columns, but R wraps them in a data frame. This example returns the first column wrapped in a data frame:

suburbs[1]
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

Another option, using the dplyr package from the Tidyverse, is to pipe the data into a select statement: ** JAL note: both select statements below are patch with dplyr:: issue with MASS not unloading?

suburbs %>%
  dplyr::select(1)
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

You can, of course, use select from the dplyr package to pull more than one column:

suburbs %>%
  dplyr::select(1, 4)
#> # A tibble: 17 x 2
#>   city        pop
#>   <chr>     <dbl>
#> 1 Chicago 2853114
#> 2 Kenosha   90352
#> 3 Aurora   171782
#> 4 Elgin     94487
#> 5 Gary     102746
#> 6 Joliet   106221
#> # ... with 11 more rows

The next example returns the first and third columns as a data frame:

suburbs[c(1, 3)]
#> # A tibble: 17 x 2
#>   city    state
#>   <chr>   <chr>
#> 1 Chicago IL
#> 2 Kenosha WI
#> 3 Aurora  IL
#> 4 Elgin   IL
#> 5 Gary    IN
#> 6 Joliet  IL
#> # ... with 11 more rows

A major source of confusion is that suburbs[[1]] and suburbs[1] look similar but produce very different results:

suburbs[[1]]

This returns one column.

suburbs[1]

This returns a data frame, and the data frame contains exactly one column. This is a special case of df[c(n1,n2, ..., nk)]. We don’t need the c(...) construct because there is only one n.

The point here is that “one column” is different from “a data frame that contains one column.” The first expression returns a column, so it’s a vector or a factor. The second expression returns a data frame, which is different.

R lets you use matrix notation to select columns, as shown in the Solution. But an odd quirk can bite you: you might get a column or you might get a data frame, depending upon many subscripts you use. In the simple case of one index you get a column, like this:

suburbs[, 1]
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

But using the same matrix-style syntax with multiple indexes returns a data frame:

suburbs[, c(1, 4)]
#> # A tibble: 17 x 2
#>   city        pop
#>   <chr>     <dbl>
#> 1 Chicago 2853114
#> 2 Kenosha   90352
#> 3 Aurora   171782
#> 4 Elgin     94487
#> 5 Gary     102746
#> 6 Joliet   106221
#> # ... with 11 more rows

This creates a problem. Suppose you see this expression in some old R script:

df[, vec]

Quick, does that return a column or a data frame? Well, it depends. If vec contains one value then you get a column; otherwise, you get a data frame. You cannot tell from the syntax alone.

To avoid this problem, you can include drop=FALSE in the subscripts; this forces R to return a data frame:

df[, vec, drop = FALSE]

Now there is no ambiguity about the returned data structure. It’s a data frame.

When all is said and done, using matrix notation to select columns from data frames is not the best procedure. It’s a good idea to instead use the list operators described previously. They just seem clearer. Or you can use the functions in dplyr and know that you will get back a tibble.

See Also

See “Selecting One Row or Column from a Matrix” for more about using drop=FALSE.

Selecting Data Frame Columns by Name

Problem

You want to select columns from a data frame according to their name.

Solution

To select a single column, use one of these list expressions:

df[["name"]]

Returns one column, the column called name.

df$name

Same as previous, just different syntax.

To select one or more columns and package them in a data frame, use these list expressions:

df["name"]

Selects one column and packages it inside a data frame object.

df[c("name1", "name2", ..., "namek")]

: Selects several columns and packages them in a data frame.

You can use matrix-style subscripting to select one or more columns:

df[, "name"]

Returns the named column.

df[, c("name1", "name2", ..., "namek")]

Selects several columns and packages in a data frame.

Once again, the matrix-style subscripting can return two different data types (column or data frame) depending upon whether you select one column or multiple columns.

Or you can use the dplyr package from the Tidyverse and pass column names to the select function to get back a tibble.

df %>% select(name1, name2, ..., namek)

Discussion

All columns in a data frame must have names. If you know the name, it’s usually more convenient and readable to select by name, not by position.

The solutions just described are similar to those for “Selecting Data Frame Columns by Position”, where we selected columns by position. The only difference is that here we use column names instead of column numbers. All the observations made in “Selecting Data Frame Columns by Position” apply here:

  • df[["name"]] returns one column, not a data frame.

  • df[c("name1", "name2", ..., "namek")] returns a data frame, not a column.

  • df["name"] is a special case of the previous expression and so returns a data frame, not a column.

  • The matrix-style subscripting can return either a column or a data frame, so be careful how many names you supply. See “Selecting Data Frame Columns by Position” for a discussion of this “gotcha” and using drop=FALSE.

There is one new addition:

df$name

This is identical in effect to df[["name"]], but it’s easier to type and to read.

Note that if you use select from dplyr, you don’t put the column names in quotes:

df %>% select(name1, name2, ..., namek)

Unquoted column names are a Tidyverse feature and help make Tidy functions fast and easy to type interactivly.

See Also

See “Selecting Data Frame Columns by Position” to understand these ways to select columns.

Selecting Rows and Columns More Easily

Problem

You want an easier way to select rows and columns from a data frame or matrix.

Solution

Use the subset function. The select argument is a column name, or a vector of column names, to be selected:

subset(df, select = colname)
subset(df, select = c(colname1, ..., colnameN))

Note that you do not quote the column names.

The subset argument is a logical expression that selects rows. Inside the expression, you can refer to the column names as part of the logical expression. In this example, city is a column in the data frame, and we are selecting rows with a pop over 100,000:

subset(suburbs, subset = (pop > 100000))
#> # A tibble: 5 x 4
#>   city       county   state     pop
#>   <chr>      <chr>    <chr>   <dbl>
#> 1 Chicago    Cook     IL    2853114
#> 2 Aurora     Kane     IL     171782
#> 3 Gary       Lake(IN) IN     102746
#> 4 Joliet     Kendall  IL     106221
#> 5 Naperville DuPage   IL     147779

subset is most useful when you combine the select and subset arguments:

subset(suburbs, select = c(city, state, pop), subset = (pop > 100000))
#> # A tibble: 5 x 3
#>   city       state     pop
#>   <chr>      <chr>   <dbl>
#> 1 Chicago    IL    2853114
#> 2 Aurora     IL     171782
#> 3 Gary       IN     102746
#> 4 Joliet     IL     106221
#> 5 Naperville IL     147779

The Tidyverse alternative is to use dplyr and string together a select statement with a filter statement:

suburbs %>%
  dplyr::select(city, state, pop) %>%
  filter(pop > 100000)
#> # A tibble: 5 x 3
#>   city       state     pop
#>   <chr>      <chr>   <dbl>
#> 1 Chicago    IL    2853114
#> 2 Aurora     IL     171782
#> 3 Gary       IN     102746
#> 4 Joliet     IL     106221
#> 5 Naperville IL     147779

Discussion

Indexing is the “official” Base R way to select rows and columns from a data frame, as described in Recipes and . However, indexing is cumbersome when the index expressions become complicated.

The subset function provides a more convenient and readable way to select rows and columns. It’s beauty is that you can refer to the columns of the data frame right inside the expressions for selecting columns and rows.

Combining select and filter from dplyr along with pipes makes the steps even easier to both read and write.

Here are some examples using the Cars93 dataset in the MASS package. The dataset includes columns for Manufacturer, Model, MPG.city, MPG.highway, Min.Price, and Max.Price:

Select the model name for cars that can exceed 30 miles per gallon (MPG) in the city * JAL note: turned off the mass load to see if it fixes select issue

library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#>     select
my_subset <- subset(Cars93, select = Model, subset = (MPG.city > 30))
head(my_subset)
#>      Model
#> 31 Festiva
#> 39   Metro
#> 42   Civic
#> 73  LeMans
#> 80   Justy
#> 83   Swift

Or, using dplyr:

Cars93 %>%
  filter(MPG.city > 30) %>%
  select(Model) %>%
  head()
#> Error in select(., Model): unused argument (Model)
  • TODO: make this a warning sidebar. Need editors to give instruction on how to indicate that ** Wait… what? Why did this not work? select worked just fine in an earlier example! Well, we left this in the book as an example of a bad surprise. We loaded the Tidyvese package at the beginning of the chapter then we just now loaded the MASS package. It turns out that MASS has a function named select too. So the package loaded last is the one that stomps on top of the others. So we have two options. 1) we can unload packages and then load MASS before dplyr or tidyverse' or 2) we can disambiguagte which`select statement we are calling. Let’s go with option 2 because it’s easy to illustrate:

Cars93 %>%
  filter(MPG.city > 30) %>%
  dplyr::select(Model) %>%
  head()
#>     Model
#> 1 Festiva
#> 2   Metro
#> 3   Civic
#> 4  LeMans
#> 5   Justy
#> 6   Swift

By using dplyr::select we tell R, “Hey, R, only use the select statement from dplyr" And R typically follows suit.

Now let’s select the model name and price range for four-cylinder cars made in the United States

my_cars <- subset(Cars93,
  select = c(Model, Min.Price, Max.Price),
  subset = (Cylinders == 4 & Origin == "USA")
)
head(my_cars)
#>       Model Min.Price Max.Price
#> 6   Century      14.2      17.3
#> 12 Cavalier       8.5      18.3
#> 13  Corsica      11.4      11.4
#> 15   Lumina      13.4      18.4
#> 21  LeBaron      14.5      17.1
#> 23     Colt       7.9      10.6

Or, using our unambiguious dplyr functions:

Cars93 %>%
  filter(Cylinders == 4 & Origin == "USA") %>%
  dplyr::select(Model, Min.Price, Max.Price) %>%
  head()
#>      Model Min.Price Max.Price
#> 1  Century      14.2      17.3
#> 2 Cavalier       8.5      18.3
#> 3  Corsica      11.4      11.4
#> 4   Lumina      13.4      18.4
#> 5  LeBaron      14.5      17.1
#> 6     Colt       7.9      10.6

Notice that in the above example we put the filter statement above the select statement. Commands connected by pipes are sequencial and if we selected only our four fields before we filtered on Cylinders adn Origin then the Cylinder and Origin fields would no longer be in the data and we’d get an error.

Now we’ll select the manufacturer’s name and the model name for all cars whose highway MPG value is above the median

my_cars <- subset(Cars93,
  select = c(Manufacturer, Model),
  subset = c(MPG.highway > median(MPG.highway))
)
head(my_cars)
#>    Manufacturer    Model
#> 1         Acura  Integra
#> 5           BMW     535i
#> 6         Buick  Century
#> 12    Chevrolet Cavalier
#> 13    Chevrolet  Corsica
#> 15    Chevrolet   Lumina

The subset function is actually more powerful than this recipe implies. It can select from lists and vectors, too. See the help page for details.

Or, using dplyr:

Cars93 %>%
  filter(MPG.highway > median(MPG.highway)) %>%
  dplyr::select(Manufacturer, Model) %>%
  head()
#>   Manufacturer    Model
#> 1        Acura  Integra
#> 2          BMW     535i
#> 3        Buick  Century
#> 4    Chevrolet Cavalier
#> 5    Chevrolet  Corsica
#> 6    Chevrolet   Lumina

Remember in the above examples the only reason we use the full dplyr::select name is because we have a conflict with MASS::select. In your code you will likely only need to use select after you load dplyr.

Just to keep us from frustrating naming clashes, let’s detach the MASS package:

detach("package:MASS", unload = TRUE)

Changing the Names of Data Frame Columns

Problem

You converted a matrix or list into a data frame. R gave names to the columns, but the names are at best uninformative and at worst bizarre.

Solution

Data frames have a colnames attribute that is a vector of column names. You can update individual names or the entire vector:

df <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
df
#>   V1 V2 V3
#> 1  1  4  7
#> 2  2  5  8
#> 3  3  6  9
colnames(df) <- c("tom", "dick", "harry") # a vector of character strings
df
#>   tom dick harry
#> 1   1    4     7
#> 2   2    5     8
#> 3   3    6     9

Or, using dplyr from the Tidyverse:

df <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
df %>%
  rename(tom = V1, dick = V2, harry = V3)
#>   tom dick harry
#> 1   1    4     7
#> 2   2    5     8
#> 3   3    6     9

Notice that with the rename function in dplyr there’s no need to use quotes around the column names, as is typical with Tidyverse functions. Also note that the argument order is new_name=old_name.

Discussion

The columns of data frames (and tibbles) must have names. If you convert a vanilla matrix into a data frame, R will synthesize names that are reasonable but boring — for example, V1, V2, V3, and so forth:

mat <- matrix(rnorm(9), nrow = 3, ncol = 3)
mat
#>       [,1]    [,2]   [,3]
#> [1,] 0.701  0.0976  0.821
#> [2,] 0.388 -1.2755 -1.086
#> [3,] 1.968  1.2544  0.111
as.data.frame(mat)
#>      V1      V2     V3
#> 1 0.701  0.0976  0.821
#> 2 0.388 -1.2755 -1.086
#> 3 1.968  1.2544  0.111

If the matrix had column names defined, R would have used those names instead of synthesizing new ones.

However, converting a list into a data frame produces some strange synthetic names:

lst <- list(1:3, c("a", "b", "c"), round(rnorm(3), 3))
lst
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "a" "b" "c"
#>
#> [[3]]
#> [1] 0.181 0.773 0.983
as.data.frame(lst)
#>   X1.3 c..a....b....c.. c.0.181..0.773..0.983.
#> 1    1                a                  0.181
#> 2    2                b                  0.773
#> 3    3                c                  0.983

Again, if the list elements had names then R would have used them.

Fortunately, you can overwrite the synthetic names with names of your own by setting the colnames attribute:

df <- as.data.frame(lst)
colnames(df) <- c("patient", "treatment", "value")
df
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

You can do renaming by position using rename from dplyr… but it’s not really pretty. Actually it’s quite horrible and we considered omitting it from this book.

df <- as.data.frame(lst)
df %>%
  rename(
    "patient" = !!names(.[1]),
    "treatment" = !!names(.[2]),
    "value" = !!names(.[3])
  )
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

The reason this is so ugly is that the Tidyverse is designed around using names, not positions, when referring to columns. And in this example the names are pretty miserable to type and get right. While you could use the above recipe, we recommend using the Base R colnames() method if you really must rename by position number.

Of course, we could have made this all a lot easier by simply giving the list elements names before we converted it to a data frame:

names(lst) <- c("patient", "treatment", "value")
as.data.frame(lst)
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

Removing NAs from a Data Frame

Problem

Your data frame contains NA values, which is creating problems for you.

Solution

Use na.omit to remove rows that contain any NA values.

df <- data.frame(my_data = c(NA, 1, NA, 2, NA, 3))
df
#>   my_data
#> 1      NA
#> 2       1
#> 3      NA
#> 4       2
#> 5      NA
#> 6       3
clean_df <- na.omit(df)
clean_df
#>   my_data
#> 2       1
#> 4       2
#> 6       3

Discussion

We frequently stumble upon situations where just a few NA values in a data frame cause everything to fall apart. One solution is simply to remove all rows that contain any NAs. That’s what na.omit does.

Here we can see cumsum fail because the input contains NA values:

df <- data.frame(
  x = c(NA, rnorm(4)),
  y = c(rnorm(2), NA, rnorm(2))
)
df
#>        x      y
#> 1     NA -0.836
#> 2  0.670 -0.922
#> 3 -1.421     NA
#> 4 -0.236 -1.123
#> 5 -0.975  0.372
cumsum(df)
#>    x      y
#> 1 NA -0.836
#> 2 NA -1.759
#> 3 NA     NA
#> 4 NA     NA
#> 5 NA     NA

If we remove the NA values, cumsum can complete its summations:

cumsum(na.omit(df))
#>        x      y
#> 2  0.670 -0.922
#> 4  0.434 -2.046
#> 5 -0.541 -1.674

This recipe works for vectors and matrices, too, but not for lists.

The obvious danger here is that simply dropping observations from your data could render the results computationally or statistically meaningless. Make sure that omitting data makes sense in your context. Remember that na.omit will remove entire rows, not just the NA values, which could eliminate a lot of useful information.

Excluding Columns by Name

Problem

You want to exclude a column from a data frame using its name.

Solution

Use the subset function with a negated argument for the select parameter:

df <- data.frame(good = rnorm(3), meh = rnorm(3), bad = rnorm(3))
df
#>     good     meh    bad
#> 1  1.911 -0.7045 -1.575
#> 2  0.912  0.0608 -2.238
#> 3 -0.819  0.4424 -0.807
subset(df, select = -bad) # All columns except bad
#>     good     meh
#> 1  1.911 -0.7045
#> 2  0.912  0.0608
#> 3 -0.819  0.4424

Or we can use select from dplyr to accomplish the same thing:

df %>%
  dplyr::select(-bad)
#>     good     meh
#> 1  1.911 -0.7045
#> 2  0.912  0.0608
#> 3 -0.819  0.4424

Discussion

We can exclude a column by position (e.g., df[-1]), but how do we exclude a column by name? The subset function can exclude columns from a data frame. The select parameter is a normally a list of columns to include, but prefixing a minus sign (-) to the name causes the column to be excluded instead.

We often encounter this problem when calculating the correlation matrix of a data frame and we want to exclude nondata columns such as labels. Let’s set up some dummy data:

id <- 1:10
pre <- rnorm(10)
dosage <- rnorm(10) + .3 * pre
post <- dosage * .5 * pre
patient_data <- data.frame(id = id, pre = pre, dosage = dosage, post = post)

cor(patient_data)
#>             id     pre  dosage    post
#> id      1.0000 -0.6934 -0.5075  0.0672
#> pre    -0.6934  1.0000  0.5830 -0.0919
#> dosage -0.5075  0.5830  1.0000  0.0878
#> post    0.0672 -0.0919  0.0878  1.0000

This correlation matrix includes the meaningless “correlation” between id and other variables, which is annoying. We can exclude the id column to clean up the output:

cor(subset(patient_data, select = -id))
#>            pre dosage    post
#> pre     1.0000 0.5830 -0.0919
#> dosage  0.5830 1.0000  0.0878
#> post   -0.0919 0.0878  1.0000

or with dplyr:

patient_data %>%
  dplyr::select(-id) %>%
  cor()
#>            pre dosage    post
#> pre     1.0000 0.5830 -0.0919
#> dosage  0.5830 1.0000  0.0878
#> post   -0.0919 0.0878  1.0000

We can exclude multiple columns by giving a vector of negated names:

## JDL Note... now that I've written all this I think the right thing to do is only show dplyr examples... one way to do things is better... fix in edit
cor(subset(patient_data, select = c(-id, -dosage)))

or with dplyr:

patient_data %>%
  dplyr::select(-id, -dosage) %>%
  cor()
#>          pre    post
#> pre   1.0000 -0.0919
#> post -0.0919  1.0000

Note that with dplyr we don’t wrap the column names in c().

See Also

See “Selecting Rows and Columns More Easily” for more about the subset function.

Combining Two Data Frames

Problem

You want to combine the contents of two data frames into one data frame.

Solution

To combine the columns of two data frames side by side, use cbind (column bind):

df1 <- data_frame(a = rnorm(5))
df2 <- data_frame(b = rnorm(5))

all <- cbind(df1, df2)
all
#>         a       b
#> 1 -1.6357  1.3669
#> 2 -0.3662 -0.5432
#> 3  0.4445 -0.0158
#> 4  0.4945 -0.6960
#> 5  0.0934 -0.7334

To “stack” the rows of two data frames, use rbind (row bind):

df1 <- data_frame(x = rep("a", 2), y = rnorm(2))
df1
#> # A tibble: 2 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 a     1.90
#> 2 a     0.440

df2 <- data_frame(x = rep("b", 2), y = rnorm(2))
df2
#> # A tibble: 2 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 b     2.35
#> 2 b     0.188

rbind(df1, df2)
#> # A tibble: 4 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 a     1.90
#> 2 a     0.440
#> 3 b     2.35
#> 4 b     0.188

Discussion

You can combine data frames in one of two ways: either by putting the columns side by side to create a wider data frame; or by “stacking” the rows to create a taller data frame. The cbind function will combine data frames side by side. You would normally combine columns with the same height (number of rows). Technically speaking, however, cbind does not require matching heights. If one data frame is short, it will invoke the Recycling Rule to extend the short columns as necessary (“Understanding the Recycling Rule”), which may or may not be what you want.

The rbind function will “stack” the rows of two data frames. The rbind function requires that the data frames have the same width: same number of columns and same column names. The columns need not be in the same order, however; rbind will sort that out:

df1 <- data_frame(x = rep("a", 2), y = rnorm(2))
df1
#> # A tibble: 2 x 2
#>   x          y
#>   <chr>  <dbl>
#> 1 a     -0.366
#> 2 a     -0.478

df2 <- data_frame(y = 1:2, x = c("b", "b"))
df2
#> # A tibble: 2 x 2
#>       y x
#>   <int> <chr>
#> 1     1 b
#> 2     2 b

rbind(df1, df2)
#> # A tibble: 4 x 2
#>   x          y
#>   <chr>  <dbl>
#> 1 a     -0.366
#> 2 a     -0.478
#> 3 b      1
#> 4 b      2

Finally, this recipe is slightly more general than the title implies. First, you can combine more than two data frames because both rbind and cbind accept multiple arguments. Second, you can apply this recipe to other data types because rbind and cbind work also with vectors, lists, and matrices.

See Also

The merge function can combine data frames that are otherwise incompatible owing to missing or different columns. In addition, dplyr and tidyr from the Tidyverse include some powerful functions for slicing, dicing, and recombining data frames.

Merging Data Frames by Common Column

Problem

You have two data frames that share a common column. You want to merge or join their rows into one data frame by matching on the common column.

Solution

Use the merge function to join the data frames into one new data frame based on the common column:

df1 <- data.frame(index = letters[1:5], val1 = rnorm(5))
df2 <- data.frame(index = letters[1:5], val2 = rnorm(5))

m <- merge(df1, df2, by = "index")
m
#>   index      val1   val2
#> 1     a -0.000837  1.178
#> 2     b -0.214967 -1.599
#> 3     c -1.399293  0.487
#> 4     d  0.010251 -1.688
#> 5     e -0.031463 -0.149

Here index is the name of the column that is common to data frames df1 and df2.

The alternative dplyr way of doing this is with inner_join:

df1 %>%
  inner_join(df2)
#> Joining, by = "index"
#>   index      val1   val2
#> 1     a -0.000837  1.178
#> 2     b -0.214967 -1.599
#> 3     c -1.399293  0.487
#> 4     d  0.010251 -1.688
#> 5     e -0.031463 -0.149

Discussion

Suppose you have two data frames, born and died, that each contain a column called name:

born <- data.frame(
  name = c("Moe", "Larry", "Curly", "Harry"),
  year.born = c(1887, 1902, 1903, 1964),
  place.born = c("Bensonhurst", "Philadelphia", "Brooklyn", "Moscow")
)
died <- data.frame(
  name = c("Curly", "Moe", "Larry"),
  year.died = c(1952, 1975, 1975)
)

We can merge them into one data frame by using name to combine matched rows:

merge(born, died, by = "name")
#>    name year.born   place.born year.died
#> 1 Curly      1903     Brooklyn      1952
#> 2 Larry      1902 Philadelphia      1975
#> 3   Moe      1887  Bensonhurst      1975

Notice that merge does not require the rows to be sorted or even to occur in the same order. It found the matching rows for Curly even though they occur in different positions. It also discards rows that appear in only one data frame or the other.

In SQL terms, the merge function essentially performs a join operation on the two data frames. It has many options for controlling that join operation, all of which are described on the help page for merge.

Because of the similarity with SQL, dplyr uses similar terms:

born %>%
  inner_join(died)
#> Joining, by = "name"
#> Warning: Column `name` joining factors with different levels, coercing to
#> character vector
#>    name year.born   place.born year.died
#> 1   Moe      1887  Bensonhurst      1975
#> 2 Larry      1902 Philadelphia      1975
#> 3 Curly      1903     Brooklyn      1952

Because we used data.frame to create the data frame, the name column was turned into factors. dplyr, and most of the Tidyverse packages, really prefer characters, so the column name was coerced into charater and we get a chatty notification in R. This is the sort of verbose feedback that is common in the Tidyverse. There are multiple types of joins in dplyr including, inner, left, right, and full. For a complete list, see the join documentation by typing ?dplyr::join.

See Also

See “Combining Two Data Frames” for other ways to combine data frames.

Accessing Data Frame Contents More Easily

Problem

Your data is stored in a data frame. You are getting tired of repeatedly typing the data frame name and want to access the columns more easily.

Solution

For quick, one-off expressions, use the with function to expose the column names:

with(dataframe, expr)

Inside expr, you can refer to the columns of dataframe by their names as if they were simple variables.

If you’re working with Tidyverse functions and pipes (%>%) this is not very useful as in a piped workflow you are always dealing with whatever input data was sent via the pipe.

Discussion

A data frame is a great way to store your data, but accessing individual columns can become tedious. For a data frame called suburbs that contains a column called pop, here is the naïve way to calculate the z-scores of pop:

z <- (suburbs$pop - mean(suburbs$pop)) / sd(suburbs$pop)
z
#>  [1]  3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264
#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364

Call us lazy, but all that typing gets tedious. The with function lets you expose the columns of a data frame as distinct variables. It takes two arguments, a data frame and an expression to be evaluated. Inside the expression, you can refer to the data frame columns by their names:

z <- with(suburbs, (pop - mean(pop)) / sd(pop))
z
#>  [1]  3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264
#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364

When using dplyr you can accomplish the same logic with mutate:

suburbs %>%
  mutate(z = (pop - mean(pop)) / sd(pop))
#> # A tibble: 17 x 5
#>   city    county   state     pop      z
#>   <chr>   <chr>    <chr>   <dbl>  <dbl>
#> 1 Chicago Cook     IL    2853114  3.88
#> 2 Kenosha Kenosha  WI      90352 -0.237
#> 3 Aurora  Kane     IL     171782 -0.116
#> 4 Elgin   Kane     IL      94487 -0.231
#> 5 Gary    Lake(IN) IN     102746 -0.219
#> 6 Joliet  Kendall  IL     106221 -0.214
#> # ... with 11 more rows

As you can see, mutate helpfully mutates the data drame by adding the column we just created.

Converting One Atomic Value into Another

Problem

You have a data value which has an atomic data type: character, complex, double, integer, or logical. You want to convert this value into one of the other atomic data types.

Solution

For each atomic data type, there is a function for converting values to that type. The conversion functions for atomic types include:

  • as.character(x)

  • as.complex(x)

  • as.numeric(x) or as.double(x)

  • as.integer(x)

  • as.logical(x)

Discussion

Converting one atomic type into another is usually pretty simple. If the conversion works, you get what you would expect. If it does not work, you get NA:

as.numeric(" 3.14 ")
#> [1] 3.14
as.integer(3.14)
#> [1] 3
as.numeric("foo")
#> Warning: NAs introduced by coercion
#> [1] NA
as.character(101)
#> [1] "101"

If you have a vector of atomic types, these functions apply themselves to every value. So the preceding examples of converting scalars generalize easily to converting entire vectors:

as.numeric(c("1", "2.718", "7.389", "20.086"))
#> [1]  1.00  2.72  7.39 20.09
as.numeric(c("1", "2.718", "7.389", "20.086", "etc."))
#> Warning: NAs introduced by coercion
#> [1]  1.00  2.72  7.39 20.09    NA
as.character(101:105)
#> [1] "101" "102" "103" "104" "105"

When converting logical values into numeric values, R converts FALSE to 0 and TRUE to 1:

as.numeric(FALSE)
#> [1] 0
as.numeric(TRUE)
#> [1] 1

This behavior is useful when you are counting occurrences of TRUE in vectors of logical values. If logvec is a vector of logical values, then sum(logvec) does an implicit conversion from logical to integer and returns the number of `TRUE`s:

logvec <- c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)
sum(logvec) ## num true
#> [1] 4
length(logvec) - sum(logvec) ## num not true
#> [1] 2

Converting One Structured Data Type into Another

Problem

You want to convert a variable from one structured data type to another—for example, converting a vector into a list or a matrix into a data frame.

Solution

These functions convert their argument into the corresponding structured data type:

  • as.data.frame(x)

  • as.list(x)

  • as.matrix(x)

  • as.vector(x)

Some of these conversions may surprise you, however. I suggest you review Table XX. * TODO: can’t find above link… find it

Discussion

Converting between structured data types can be tricky. Some conversions behave as you’d expect. If you convert a matrix into a data frame, for instance, the rows and columns of the matrix become the rows and columns of the data frame. No sweat.

  • todo: yeah this table looks like hell in markdown. how does it render?

Table 5-1. Data conversions
Conversion How Notes

Vector→List

as.list(vec)

Don’t use list(vec); that creates a 1-element list whose only element is a copy of vec.

Vector→Matrix

To create a 1-column matrix: cbind(vec) or as.matrix(vec)

See “Initializing a Matrix”.

To create a 1-row matrix: rbind(vec)

To create an n × m matrix: matrix(vec,n,m)

Vector→Data frame

To create a 1-column data frame: as.data.frame(vec)

To create a 1-row data frame: as.data.frame(rbind(vec))

List→Vector

unlist(lst)

Use unlist rather than as.vector; see Note 1 and “Flatten a List into a Vector”.

List→Matrix

To create a 1-column matrix: as.matrix(lst)

To create a 1-row matrix: as.matrix(rbind(lst))

To create an n × m matrix: matrix(lst,n,m)

List→Data frame

If the list elements are columns of data: as.data.frame(lst)

If the list elements are rows of data: “Initializing a Data Frame from Row Data”

Matrix→Vector

as.vector(mat)

Returns all matrix elements in a vector.

Matrix→List

as.list(mat)

Returns all matrix elements in a list.

Matrix→Data frame

as.data.frame(mat)

Data frame→Vector

To convert a 1-row data frame: df[1,]

See Note 2.

To convert a 1-column data frame: df[,1] or df[[1]]

Data frame→List

as.list(df)

See Note 3.

Data frame→Matrix

as.matrix(df)

See Note 4.

In other cases, the results might surprise you. Table XX (to-do) summarizes some noteworthy examples. The following Notes are cited in that table:

  1. When you convert a list into a vector, the conversion works cleanly if your list contains atomic values that are all of the same mode. Things become complicated if either (a) your list contains mixed modes (e.g., numeric and character), in which case everything is converted to characters; or (b) your list contains other structured data types, such as sublists or data frames—in which case very odd things happen, so don’t do that.

  2. Converting a data frame into a vector makes sense only if the data frame contains one row or one column. To extract all its elements into one, long vector, use as.vector(as.matrix(df)). But even that makes sense only if the data frame is all-numeric or all-character; if not, everything is first converted to character strings.

  3. Converting a data frame into a list may seem odd in that a data frame is already a list (i.e., a list of columns). Using as.list essentially removes the class (data.frame) and thereby exposes the underlying list. That is useful when you want R to treat your data structure as a list—say, for printing.

  4. Be careful when converting a data frame into a matrix. If the data frame contains only numeric values then you get a numeric matrix. If it contains only character values, you get a character matrix. But if the data frame is a mix of numbers, characters, and/or factors, then all values are first converted to characters. The result is a matrix of character strings.

Problems with matrices

The matrix conversions detailed here assume that your matrix is homogeneous: all elements have the same mode (e.g, all numeric or all character). A matrix can to be heterogeneous, too, when the matrix is built from a list. If so, conversions become messy. For example, when you convert a mixed-mode matrix to a data frame, the data frame’s columns are actually lists (to accommodate the mixed data).

See Also

See “Converting One Atomic Value into Another” for converting atomic data types; see the “Introduction” to this chapter for remarks on problematic conversions.

1 A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.

2 More precisely, it orders the names according to your Locale.

Chapter 6. Data Transformations

Introduction

While traditional programming languages use loops, R has traditionally encouraged using vectorized operations and the apply family of functions to crunch data in batches, greatly streamlining the calculations. There is noting to prevent you from writing loops in R that break your data into whatever chunks you want and then do an operation on each chunk. However using vectorized functions can, in many cases, increase speed, readability, and maintainability of your code.

In recent history, however, the Tidyverse, specifically the purrr and dplyr packages, have introdcued new idioms into R that make these concepts easier to learn and slightly more consistent. The name purrr comes from a play on the phrase “Pure R.” A “pure function” is a function where the result of the function is only determined by its inputs, and which does not produce any side effects. This is a functional programming concept which you need not understand in order to get great value from purrr. All most users need to know is purrr contains functions to help us operate “chunk by chunk” on our data in a way that meshes well with other Tidyverse packages such as dplyr.

Base R has many apply functions: apply, lapply, sapply, tapply, mapply; and their cousins, by and split. These are solid functions that have been workhorses in Base R for years. The authors have struggled a bit with how much to focus on the Base R apply functions and how much to focus on the newer “tidy” approach. After much debate we’ve chosen to try and illustrate the purrr approach and to acknowledge Base R approaches and, in a few places, to illustrate both. The interface to purrr and dplyr is very clean and, we believe, in most cases, more intuitive.

Applying a Function to Each List Element

Problem

You have a list, and you want to apply a function to each element of the list.

Solution

We can use map to apply the function to every element of a list:

library(tidyverse)

lst %>%
  map(fun)

Discussion

Let’s look at a specific example of taking the average of all the numbers in each element of a list:

library(tidyverse)

lst <- list(
  a = c(1,2,3),
  b = c(4,5,6)
)
lst %>%
  map(mean)
#> $a
#> [1] 2
#>
#> $b
#> [1] 5

These functions will call your function once for every element on your list. Your function should expect one argument, an element from the list. The map functions will collect the returned values and return them in a list.

The purrr package, contains a whole family of map functions that take a list or a vector then return an object with the same number of elements as the input. The type of object they return varies based on which map function is used. See the help file for map for a complete list, but a few of the most common are as follows:

map() : always returns a list, and the elements of the list may be of different types. This is quite similar to the Base R function lapply.

map_chr() : returns a character vector

map_int() : returns an integer vector

map_dbl() : returns a floating point numeric vector

Let’s take a quick look at a contrived situation where we have a function that could result in a character or an integer result:

fun <- function(x) {
  if (x > 1) {
    1
  } else {
    "Less Than 1"
  }
}

fun(5)
#> [1] 1
fun(0.5)
#> [1] "Less Than 1"

Let’s create a list of elements which we can map fun to and look at how each some of the map variants behave:

lst <- list(.5, 1.5, .9, 2)

map(lst, fun)
#> [[1]]
#> [1] "Less Than 1"
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] "Less Than 1"
#>
#> [[4]]
#> [1] 1

You can see that map produced a list and it is of mixed data types.

And map_chr will produce a character vector and coerce the numbers into characters.

map_chr(lst, fun)
#> [1] "Less Than 1" "1.000000"    "Less Than 1" "1.000000"

## or using pipes
lst %>%
  map_chr(fun)
#> [1] "Less Than 1" "1.000000"    "Less Than 1" "1.000000"

While map_dbl will try to coerce a character sting into a double and died trying.

map_dbl(lst, fun)
#> Error: Can't coerce element 1 from a character to a double

As mentioned above, the Base R lapply function acts very much like map. The Base R sapply function is more like the other map functions mentioned above in that the function tries to simplify the results into a vector or matrix.

See Also

See Recipe X-X.

Applying a Function to Every Row of a Data Frame

Problem

You have a function and you want to apply it to every row in a data frame.

Solution

The mutate function will create a new variable based on a vector of values. We can use one of the pmap functions (in this case pmap_dbl) to operate on every row and return a vector. The pmap functions that have an underscore (_) following the pmap return data in a vector of the type described after the _. So pmap_dbl returns a vector of doubles, while pmap_chr would coerce the output into a vector of characters.

fun <- function(a, b, c) {
  # calculate the sum of a sequence from a to b by c
  sum(seq(a, b, c))
}

df <- data.frame(mn = c(1, 2, 3),
                 mx = c(8, 13, 18),
                 rng = c(1, 2, 3))

df %>%
  mutate(output =
           pmap_dbl(list(a = mn, b = mx, c = rng), fun))
#>   mn mx rng output
#> 1  1  8   1     36
#> 2  2 13   2     42
#> 3  3 18   3     63

pmap returns a list, so we could use it to map our function to each data frame row then return the results into a list, if we prefer:

pmap(list(a = df$mn, b = df$mx, c = df$rng), fun)
#> [[1]]
#> [1] 36
#>
#> [[2]]
#> [1] 42
#>
#> [[3]]
#> [1] 63

Discussion

The pmap family of functions takes in a list of inputs and a function then applies the function to each element in the list. In our example above we wrap list() around the columns we are interested in using in our function, fun. The list function turns the columns we want to operate on into a list. Within the same operation we name the columns to match the names our function is looking for. So we set a = mn for example. This names the mn column in our data frame to a in the resulting list, which is one of the inputs our function is expecting.

Applying a Function to Every Row of a Matrix

Problem

You have a matrix. You want to apply a function to every row, calculating the function result for each row.

Solution

Use the apply function. Set the second argument to 1 to indicate row-by-row application of a function:

results <- apply(mat, 1, fun)    # mat is a matrix, fun is a function

The apply function will call fun once for each row of the matrix, assemble the returned values into a vector, and then return that vector.

Discussion

You may notice that we only show the use of the Base R apply function here while other recipes illustrate purrr alternatives. As of this writing, matrix operations are out of scope for purrr so we use the very solid Base R apply function.

Suppose your matrix long is longitudinal data, so each row contains data for one subject and the columns contain the repeated observations over time:

long <- matrix(1:15, 3, 5)
long
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    4    7   10   13
#> [2,]    2    5    8   11   14
#> [3,]    3    6    9   12   15

You could calculate the average observation for each subject by applying the mean function to each row. The result is a vector:

apply(long, 1, mean)
#> [1] 7 8 9

If your matrix has row names, apply uses them to identify the elements of the resulting vector, which is handy.

rownames(long) <- c("Moe", "Larry", "Curly")
apply(long, 1, mean)
#>   Moe Larry Curly
#>     7     8     9

The function being called should expect one argument, a vector, which will be one row from the matrix. The function can return a scalar or a vector. In the vector case, apply assembles the results into a matrix. The range function returns a vector of two elements, the minimum and the maximum, so applying it to long produces a matrix:

apply(long, 1, range)
#>      Moe Larry Curly
#> [1,]   1     2     3
#> [2,]  13    14    15

You can employ this recipe on data frames as well. It works if the data frame is homogeneous; that is, either all numbers or all character strings. When the data frame has columns of different types, extracting vectors from the rows isn’t sensible because vectors must be homogeneous.

Applying a Function to Every Column

Problem

You have a matrix or data frame, and you want to apply a function to every column.

Solution

For a matrix, use the apply function. Set the second argument to 2, which indicates column-by-column application of the function. So if our matrix or data frame was named mat and we wanted to apply a function named fun to every column, it would look like this:

apply(mat, 2, fun)

Discussion

Let’s look at an example with real numbers and apply the mean function to every column of a matrix:

mat <- matrix(c(1, 3, 2, 5, 4, 6), 2, 3)
colnames(mat) <- c("t1", "t2", "t3")
mat
#>      t1 t2 t3
#> [1,]  1  2  4
#> [2,]  3  5  6

apply(mat, 2, mean)  # Compute the mean of every column
#>  t1  t2  t3
#> 2.0 3.5 5.0

In Base R, the apply function is intended for processing a matrix or data frame. The second argument of apply determines the direction:

  • 1 means process row by row.

  • 2 means process column by column.

This is more mnemonic than it looks. We speak of matrices in “rows and columns”, so rows are first and columns second; 1 and 2, respectively.

A data frame is a more complicated data structure than a matrix, so there are more options. You can simply use apply, in which case R will convert your data frame to a matrix and then apply your function. That will work if your data frame contains only one type of data but will likely not do what you want if some columns are numeric and some are character. In that case, R will force all columns to have identical types, likely performing an unwanted conversion as a result.

Fortunately, there are multiple alternative. Recall that a data frame is a kind of list: it is a list of the columns of the data frame. purrr has a whole family of map functions that return different types of objects. Of particular interest here is the map_df which returns a data.frame, thus the df in the name.

df2 <- map_df(df, fun) # Returns a data.frame

The function fun should expect one argument: a column from the data frame.

This is a common recipe to check the types of columns in data frames. The batch column of this data frame, at quick glance, seems to contain numbers:

load("./data/batches.rdata")
head(batches)
#>   batch clinic dosage shrinkage
#> 1     3     KY     IL    -0.307
#> 2     3     IL     IL    -1.781
#> 3     1     KY     IL    -0.172
#> 4     3     KY     IL     1.215
#> 5     2     IL     IL     1.895
#> 6     2     NJ     IL    -0.430

But printing the classes of the columns reveals batch to be a factor instead:

map_df(batches, class)
#> # A tibble: 1 x 4
#>   batch  clinic dosage shrinkage
#>   <chr>  <chr>  <chr>  <chr>
#> 1 factor factor factor numeric

See Also

See Recipes , , and .

Applying a Function to Parallel Vectors or Lists

Problem

You have a function that takes multiple arguments. You want to apply the function element-wise to vectors and obtain a vector result. Unfortunately, the function is not vectorized; that is, it works on scalars but not on vectors.

Solution

Use use one of the map or pmap functions from the tidyverse core package purrr. The most general solution is to put your vectors in a list, then use pmap:

lst <- list(v1, v2, v3)
pmap(lst, fun)

pmap will take the elements of lst and pass them as the inputs to fun.

If you only have two vectors you are passing as inputs to your function, the map2_* family of functions is convenient and saves you the step of putting your vectors in a list first. map2 will return a list, while the typed variants (map2_chr, map2_dbl, etc. ) return vectors of the type their name implies:

map2(v1, v2, fun)

or if fun returns only a double:

map2_dbl(v1, v2, fun)

The typed variants in purrr functions refers to the output type expected from the function. All the typed variants return vectors of their respective type while the untyped variants return lists which allow mixing of types.

Discussion

The basic operators of R, such as x + y, are vectorized; this means that they compute their result element-by-element and return a vector of results. Also, many R functions are vectorized.

Not all functions are vectorized, however, and those that are not typed work only on scalars. Using vector arguments produces errors at best and meaningless results at worst. In such cases, the map functions from purrr can effectively vectorize the function for you.

Consider the gcd function from Recipe X-X, which takes two arguments:

gcd <- function(a, b) {
  if (b == 0) {
    return(a)
  } else {
    return(gcd(b, a %% b))
  }
}

If we apply gcd to two vectors, the result is wrong answers and a pile of error messages:

gcd(c(1, 2, 3), c(9, 6, 3))
#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used

#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used

#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used
#> [1] 1 2 0

The function is not vectorized, but we can use map to “vectorize” it. In this case, since we have two inputs we’re mapping over, we should use the map2 function. This gives the element-wise GCDs between two vectors.

a <- c(1, 2, 3)
b <- c(9, 6, 3)
my_gcds <- map2(a, b, gcd)
my_gcds
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3

Notice that map2 returns a list of lists. If we wanted the output in a vector, we could use unlist on the result, or use one of the typed variants:

unlist(my_gcds)
#> [1] 1 2 3

The map family of purrr functions give you a series of variations that return specific types of output. The suffixes on the function names communicate the type of vector that they will return. While map and map2 return lists, since the type specific variants are returning objects guaranteed to be the same type, they can be put in atomic vectors. For example, we could use the map_chr function to ask R to coerce the results into character output or map2_dbl to ensure the reults are doubles:

map2_chr(a, b, gcd)
#> [1] "1.000000" "2.000000" "3.000000"
map2_dbl(a, b, gcd)
#> [1] 1 2 3

If our data has more than two vectors, or the data is already in a list, we can use the pmap family of functions which take a list as an input.

lst <- list(a,b)
pmap(lst, gcd)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3

Or if we want a typed vector as output:

lst <- list(a,b)
pmap_dbl(lst, gcd)
#> [1] 1 2 3

With the purrr functions, remember that pmap family are parallel mappers that take in a list as inputs, while map2 functions take two, and only two, vectors as inputs.

See Also

This is really just a special case of our very first recipe in this chapter: “Applying a Function to Each List Element”. See that recipe for more discussion of map variants. In addition, Jenny Bryan has a great collection of purrr tutorials on her GitHub site: https://jennybc.github.io/purrr-tutorial/

  • JDL note: think about where the major dplyr operators go:

    • group by (already above)

    • rowwise (alread above)

    • select (includeing -) (coverd)

    • filter (subset records based on values) *

    • arrange (sort a data frame) *

    • group_by *

    • summarize (note that it drops a grouping) (calcualte a statistic over a group)

    • case_when inside a mutate: (create a new column based on conditional logic) ==, >, >= etc &, |, !, %in%, !something %in%

Applying a Function to Groups of Rows

Problem

Your data elements occur in groups. You want to process the data by groups—for example, summing by group or averaging by group.

Solution

The easiest way to do grouping is with the dplyr function group_by in conjunction with summarize. If our data frame is df and has a variable we want to group by named grouping_var and we want to apply the function fun to all the combinations of v1 and v2, we can do that with group_by:

df %>%
  group_by(v1, v2) %>%
  summarize(
    result_var = fun(value_var)
  )

Discussion

Let’s look at a specifc example where our intput data frame, df contains a variable my_group which we want to group by, and a field named values which we would like to calculate some statistics on:

df <- tibble(
  my_group = c("A", "B","A", "B","A", "B"),
  values = 1:6
)

df %>%
  group_by(my_group) %>%
  summarize(
    avg_values = mean(values),
    tot_values = sum(values),
    count_values = n()
  )
#> # A tibble: 2 x 4
#>   my_group avg_values tot_values count_values
#>   <chr>         <dbl>      <int>        <int>
#> 1 A                 3          9            3
#> 2 B                 4         12            3

The output has one record per grouping along with calculated values for the three summary fields we defined.

See Also

See this chapter’s “Introduction” for more about grouping factors.

Chapter 7. Strings and Dates

Introduction

Strings? Dates? In a statistical programming package?

As soon as you read files or print reports, you need strings. When you work with real-world problems, you need dates.

R has facilities for both strings and dates. They are clumsy compared to string-oriented languages such as Perl, but then it’s a matter of the right tool for the job. We wouldn’t want to perform logistic regression in Perl.

Some of this clunkyness with strings and dates has been inproved through the tidyverse packages stringr and lubridate. As with other chapters in this book, the examples below will pull from both Base R as well as add on packages that make life easier, faster, and more convenient.

Classes for Dates and Times

R has a variety of classes for working with dates and times; which is nice if you prefer having a choice but annoying if you prefer living simply. There is a critical distinction among the classes: some are date-only classes, some are datetime classes. All classes can handle calendar dates (e.g., March 15, 2019), but not all can represent a datetime (11:45 AM on March 1, 2019).

The following classes are included in the base distribution of R:

Date

The Date class can represent a calendar date but not a clock time. It is a solid, general-purpose class for working with dates, including conversions, formatting, basic date arithmetic, and time-zone handling. Most of the date-related recipes in this book are built on the Date class.

POSIXct

This is a datetime class, and it can represent a moment in time with an accuracy of one second. Internally, the datetime is stored as the number of seconds since January 1, 1970, and so is a very compact representation. This class is recommended for storing datetime information (e.g., in data frames).

POSIXlt

This is also a datetime class, but the representation is stored in a nine-element list that includes the year, month, day, hour, minute, and second. That representation makes it easy to extract date parts, such as the month or hour. Obviously, this representation is much less compact than the POSIXct class; hence it is normally used for intermediate processing and not for storing data.

The base distribution also provides functions for easily converting between representations: as.Date, as.POSIXct, and as.POSIXlt.

The following helpful packages are available for downloading from CRAN:

chron

The chron package can represent both dates and times but without the added complexities of handling time zones and daylight savings time. It’s therefore easier to use than Date but less powerful than POSIXct and POSIXlt. It would be useful for work in econometrics or time series analysis.

lubridate

Lubridate is designed to make working with dates and times easier while keeping the important bells and whistles such as time zones. It’s especially clever regarding datetime arithmetic. This package introduces some helpful constructs like durations, periods, and intervals. Lubridate is part of the tidyverse, so it is installed when you install.packages('tidyverse') but it is not part of “core tidyverse” so it does not get loaded when you run library(tidyverse) so you must explicitly load it by running library(lubridate).

mondate

This is a specialized package for handling dates in units of months in addition to days and years. Such needs arise in accounting and actuarial work, for example, where month-by-month calculations are needed.

timeDate

This is a high-powered package with well-thought-out facilities for handling dates and times, including date arithmetic, business days, holidays, conversions, and generalized handling of time zones. It was originally part of the Rmetrics software for financial modeling, where precision in dates and times is critical. If you have a demanding need for date facilities, consider this package.

Which class should you select? The article “Date and Time Classes in R” by Grothendieck and Petzoldt offers this general advice:

Example 7-1.

When considering which class to use, always choose the least complex class that will support the application. That is, use Date if possible, otherwise use chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.

See Also

See help(DateTimeClasses) for more details regarding the built-in facilities. See the June 2004 article “Date and Time Classes in R” by Gabor Grothendieck and Thomas Petzoldt for a great introduction to the date and time facilities. The June 2001 article “Date-Time Classes” by Brian Ripley and Kurt Hornik discusses the two POSIX classes in particular. “Dates and times” chapter from the book R for Data Science by Garrett Grolemund and Hadley Wickham which provides a great intro to lubridate

Getting the Length of a String

Problem

You want to know the length of a string.

Solution

Use the nchar function, not the length function.

Discussion

The nchar function takes a string and returns the number of characters in the string:

nchar("Moe")
#> [1] 3
nchar("Curly")
#> [1] 5

If you apply nchar to a vector of strings, it returns the length of each string:

s <- c("Moe", "Larry", "Curly")
nchar(s)
#> [1] 3 5 5

You might think the length function returns the length of a string. Nope. It returns the length of a vector. When you apply the length function to a single string, R returns the value 1 because it views that string as a singleton vector—a vector with one element:

length("Moe")
#> [1] 1
length(c("Moe", "Larry", "Curly"))
#> [1] 3

Concatenating Strings

Problem

You want to join together two or more strings into one string.

Solution

Use the paste function.

Discussion

The paste function concatenates several strings together. In other words, it creates a new string by joining the given strings end to end:

paste("Everybody", "loves", "stats.")
#> [1] "Everybody loves stats."

By default, paste inserts a single space between pairs of strings, which is handy if that’s what you want and annoying otherwise. The sep argument lets you specify a different separator. Use an empty string ("") to run the strings together without separation:

paste("Everybody", "loves", "stats.", sep = "-")
#> [1] "Everybody-loves-stats."
paste("Everybody", "loves", "stats.", sep = "")
#> [1] "Everybodylovesstats."

It’s a common idiom to want to concatenate strings together with no seperator at all. So there exists a convenince function, paste0 to make this very convenient:

paste0("Everybody", "loves", "stats.")
#> [1] "Everybodylovesstats."

The function is very forgiving about nonstring arguments. It tries to convert them to strings using the as.character function:

paste("The square root of twice pi is approximately", sqrt(2 * pi))
#> [1] "The square root of twice pi is approximately 2.506628274631"

If one or more arguments are vectors of strings, paste will generate all combinations of the arguments (because of recycling):

stooges <- c("Moe", "Larry", "Curly")
paste(stooges, "loves", "stats.")
#> [1] "Moe loves stats."   "Larry loves stats." "Curly loves stats."

Sometimes you want to join even those combinations into one, big string. The collapse parameter lets you define a top-level separator and instructs paste to concatenate the generated strings using that separator:

paste(stooges, "loves", "stats", collapse = ", and ")
#> [1] "Moe loves stats, and Larry loves stats, and Curly loves stats"

Extracting Substrings

Problem

You want to extract a portion of a string according to position.

Solution

Use substr(string,start,end) to extract the substring that begins at start and ends at end.

Discussion

The substr function takes a string, a starting point, and an ending point. It returns the substring between the starting to ending points:

substr("Statistics", 1, 4) # Extract first 4 characters
#> [1] "Stat"
substr("Statistics", 7, 10) # Extract last 4 characters
#> [1] "tics"

Just like many R functions, substr lets the first argument be a vector of strings. In that case, it applies itself to every string and returns a vector of substrings:

ss <- c("Moe", "Larry", "Curly")
substr(ss, 1, 3) # Extract first 3 characters of each string
#> [1] "Moe" "Lar" "Cur"

In fact, all the arguments can be vectors, in which case substr will treat them as parallel vectors. From each string, it extracts the substring delimited by the corresponding entries in the starting and ending points. This can facilitate some useful tricks. For example, the following code snippet extracts the last two characters from each string; each substring starts on the penultimate character of the original string and ends on the final character:

cities <- c("New York, NY", "Los Angeles, CA", "Peoria, IL")
substr(cities, nchar(cities) - 1, nchar(cities))
#> [1] "NY" "CA" "IL"

You can extend this trick into mind-numbing territory by exploiting the Recycling Rule, but we suggest you avoid the temptation.

Splitting a String According to a Delimiter

Problem

You want to split a string into substrings. The substrings are separated by a delimiter.

Solution

Use strsplit, which takes two arguments: the string and the delimiter of the substrings:

strsplit(string, delimiter)

The `delimiter` can be either a simple string or a regular expression.

Discussion

It is common for a string to contain multiple substrings separated by the same delimiter. One example is a file path, whose components are separated by slashes (/):

path <- "/home/mike/data/trials.csv"

We can split that path into its components by using strsplit with a delimiter of /:

strsplit(path, "/")
#> [[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"

Notice that the first “component” is actually an empty string because nothing preceded the first slash.

Also notice that strsplit returns a list and that each element of the list is a vector of substrings. This two-level structure is necessary because the first argument can be a vector of strings. Each string is split into its substrings (a vector); then those vectors are returned in a list.

If you are only operating on a single string, you can pop out the first element like this:

strsplit(path, "/")[[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"

This example splits three file paths and returns a three-element list:

paths <- c(
  "/home/mike/data/trials.csv",
  "/home/mike/data/errors.csv",
  "/home/mike/corr/reject.doc"
)
strsplit(paths, "/")
#> [[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"
#>
#> [[2]]
#> [1] ""           "home"       "mike"       "data"       "errors.csv"
#>
#> [[3]]
#> [1] ""           "home"       "mike"       "corr"       "reject.doc"

The second argument of strsplit (the `delimiter` argument) is actually much more powerful than these examples indicate. It can be a regular expression, letting you match patterns far more complicated than a simple string. In fact, to turn off the regular expression feature (and its interpretation of special characters) you must include the fixed=TRUE argument.

See Also

To learn more about regular expressions in R, see the help page for regexp. See O’Reilly’s Mastering Regular Expressions, by Jeffrey E.F. Friedl to learn more about regular expressions in general.

Replacing Substrings

Problem

Within a string, you want to replace one substring with another.

Solution

Use sub to replace the first instance of a substring:

sub(old, new, string)

Use gsub to replace all instances of a substring:

gsub(old, new, string)

Discussion

The sub function finds the first instance of the old substring within string and replaces it with the new substring:

str <- "Curly is the smart one. Curly is funny, too."
sub("Curly", "Moe", str)
#> [1] "Moe is the smart one. Curly is funny, too."

gsub does the same thing, but it replaces all instances of the substring (a global replace), not just the first:

gsub("Curly", "Moe", str)
#> [1] "Moe is the smart one. Moe is funny, too."

To remove a substring altogether, simply set the new substring to be empty:

sub(" and SAS", "", "For really tough problems, you need R and SAS.")
#> [1] "For really tough problems, you need R."

The old argument can be regular expression, which allows you to match patterns much more complicated than a simple string. This is actually assumed by default, so you must set the fixed=TRUE argument if you don’t want sub and gsub to interpret old as a regular expression.

See Also

To learn more about regular expressions in R, see the help page for regexp. See Mastering Regular Expressions to learn more about regular expressions in general.

Generating All Pairwise Combinations of Strings

Problem

You have two sets of strings, and you want to generate all combinations from those two sets (their Cartesian product).

Solution

Use the outer and paste functions together to generate the matrix of all possible combinations:

m <- outer(strings1, strings2, paste, sep = "")

Discussion

The outer function is intended to form the outer product. However, it allows a third argument to replace simple multiplication with any function. In this recipe we replace multiplication with string concatenation (paste), and the result is all combinations of strings.

Suppose you have four test sites and three treatments:

locations <- c("NY", "LA", "CHI", "HOU")
treatments <- c("T1", "T2", "T3")

We can apply outer and paste to generate all combinations of test sites and treatments:

outer(locations, treatments, paste, sep = "-")
#>      [,1]     [,2]     [,3]
#> [1,] "NY-T1"  "NY-T2"  "NY-T3"
#> [2,] "LA-T1"  "LA-T2"  "LA-T3"
#> [3,] "CHI-T1" "CHI-T2" "CHI-T3"
#> [4,] "HOU-T1" "HOU-T2" "HOU-T3"

The fourth argument of outer is passed to paste. In this case, we passed sep="-" in order to define a hyphen as the separator between the strings.

The result of outer is a matrix. If you want the combinations in a vector instead, flatten the matrix using the as.vector function.

In the special case when you are combining a set with itself and order does not matter, the result will be duplicate combinations:

outer(treatments, treatments, paste, sep = "-")
#>      [,1]    [,2]    [,3]
#> [1,] "T1-T1" "T1-T2" "T1-T3"
#> [2,] "T2-T1" "T2-T2" "T2-T3"
#> [3,] "T3-T1" "T3-T2" "T3-T3"

Or we can use expand.grid to get a pair of vectors representing all combinations:

expand.grid(treatments, treatments)
#>   Var1 Var2
#> 1   T1   T1
#> 2   T2   T1
#> 3   T3   T1
#> 4   T1   T2
#> 5   T2   T2
#> 6   T3   T2
#> 7   T1   T3
#> 8   T2   T3
#> 9   T3   T3

But suppose we want all unique pairwise combinations of treatments. We can eliminate the duplicates by removing the lower triangle (or upper triangle). The lower.tri function identifies that triangle, so inverting it identifies all elements outside the lower triangle:

m <- outer(treatments, treatments, paste, sep = "-")
m[!lower.tri(m)]
#> [1] "T1-T1" "T1-T2" "T2-T2" "T1-T3" "T2-T3" "T3-T3"

See Also

See “Concatenating Strings” for using paste to generate combinations of strings. The gtools package on CRAN (https://cran.r-project.org/web/packages/gtools/index.html) has functions combinations and permutation which may be of help with related tasks.

Getting the Current Date

Problem

You need to know today’s date.

Solution

The Sys.Date function returns the current date:

Sys.Date()
#> [1] "2019-01-07"

Discussion

The Sys.Date function returns a Date object. In the preceding example it seems to return a string because the result is printed inside double quotes. What really happened, however, is that Sys.Date returned a Date object and then R converted that object into a string for printing purposes. You can see this by checking the class of the result from Sys.Date:

class(Sys.Date())
#> [1] "Date"

Converting a String into a Date

Problem

You have the string representation of a date, such as “2018-12-31”, and you want to convert that into a Date object.

Solution

You can use as.Date, but you must know the format of the string. By default, as.Date assumes the string looks like yyyy-mm-dd. To handle other formats, you must specify the format parameter of as.Date. Use format="%m/%d/%Y" if the date is in American style, for instance.

Discussion

This example shows the default format assumed by as.Date, which is the ISO 8601 standard format of yyyy-mm-dd:

as.Date("2018-12-31")
#> [1] "2018-12-31"

The as.Date function returns a Date object that (as in the prior recipe) is here being converted back to a string for printing; this explains the double quotes around the output.

The string can be in other formats, but you must provide a format argument so that as.Date can interpret your string. See the help page for the stftime function for details about allowed formats.

Being simple Americans, we often mistakenly try to convert the usual American date format (mm/dd/yyyy) into a Date object, with these unhappy results:

as.Date("12/31/2018")
#> Error in charToDate(x): character string is not in a standard unambiguous format

Here is the correct way to convert an American-style date:

as.Date("12/31/2018", format = "%m/%d/%Y")
#> [1] "2018-12-31"

Observe that the Y in the format string is capitalized to indicate a 4-digit year. If you’re using 2-digit years, specify a lowercase y.

Converting a Date into a String

Problem

You want to convert a Date object into a character string, usually because you want to print the date.

Solution

Use either format or as.character:

format(Sys.Date())
#> [1] "2019-01-07"
as.character(Sys.Date())
#> [1] "2019-01-07"

Both functions allow a format argument that controls the formatting. Use format="%m/%d/%Y" to get American-style dates, for example:

format(Sys.Date(), format = "%m/%d/%Y")
#> [1] "01/07/2019"

Discussion

The format argument defines the appearance of the resulting string. Normal characters, such as slash (/) or hyphen (-) are simply copied to the output string. Each two-letter combination of a percent sign (%) followed by another character has special meaning. Some common ones are:

%b

Abbreviated month name (“Jan”)

%B

Full month name (“January”)

%d

Day as a two-digit number

%m

Month as a two-digit number

%y

Year without century (00–99)

%Y

Year with century

See the help page for the strftime function for a complete list of formatting codes.

Converting Year, Month, and Day into a Date

Problem

You have a date represented by its year, month, and day in different variables. You want to merge these elements into a single Date object representation.

Solution

Use the ISOdate function:

ISOdate(year, month, day)

The result is a POSIXct object that you can convert into a Date object:

year <- 2018
month <- 12
day <- 31
as.Date(ISOdate(year, month, day))
#> [1] "2018-12-31"

Discussion

It is common for input data to contain dates encoded as three numbers: year, month, and day. The ISOdate function can combine them into a POSIXct object:

ISOdate(2020, 2, 29)
#> [1] "2020-02-29 12:00:00 GMT"

You can keep your date in the POSIXct format. However, when working with pure dates (not dates and times), we often convert to a Date object and truncate the unused time information:

as.Date(ISOdate(2020, 2, 29))
#> [1] "2020-02-29"

Trying to convert an invalid date results in NA:

ISOdate(2013, 2, 29) # Oops! 2013 is not a leap year
#> [1] NA

ISOdate can process entire vectors of years, months, and days, which is quite handy for mass conversion of input data. The following example starts with the year/month/day numbers for the third Wednesday in January of several years and then combines them all into Date objects:

years <- 2010:2014
months <- rep(1, 5)
days <- 5:9
ISOdate(years, months, days)
#> [1] "2010-01-05 12:00:00 GMT" "2011-01-06 12:00:00 GMT"
#> [3] "2012-01-07 12:00:00 GMT" "2013-01-08 12:00:00 GMT"
#> [5] "2014-01-09 12:00:00 GMT"
as.Date(ISOdate(years, months, days))
#> [1] "2010-01-05" "2011-01-06" "2012-01-07" "2013-01-08" "2014-01-09"

Purists will note that the vector of months is redundant and that the last expression can therefore be further simplified by invoking the Recycling Rule:

as.Date(ISOdate(years, 1, days))
#> [1] "2010-01-05" "2011-01-06" "2012-01-07" "2013-01-08" "2014-01-09"

This recipe can also be extended to handle year, month, day, hour, minute, and second data by using the ISOdatetime function (see the help page for details):

ISOdatetime(year, month, day, hour, minute, second)

Getting the Julian Date

Problem

Given a Date object, you want to extract the Julian date—which is, in R, the number of days since January 1, 1970.

Solution

Either convert the Date object to an integer or use the julian function:

d <- as.Date("2019-03-15")
as.integer(d)
#> [1] 17970
jd <- julian(d)
jd
#> [1] 17970
#> attr(,"origin")
#> [1] "1970-01-01"
attr(jd, "origin")
#> [1] "1970-01-01"

Discussion

A Julian “date” is simply the number of days since a more-or-less arbitrary starting point. In the case of R, that starting point is January 1, 1970, the same starting point as Unix systems. So the Julian date for January 1, 1970 is zero, as shown here:

as.integer(as.Date("1970-01-01"))
#> [1] 0
as.integer(as.Date("1970-01-02"))
#> [1] 1
as.integer(as.Date("1970-01-03"))
#> [1] 2

Extracting the Parts of a Date

Problem

Given a Date object, you want to extract a date part such as the day of the week, the day of the year, the calendar day, the calendar month, or the calendar year.

Solution

Convert the Date object to a POSIXlt object, which is a list of date parts. Then extract the desired part from that list:

d <- as.Date("2019-03-15")
p <- as.POSIXlt(d)
p$mday        # Day of the month
#> [1] 15
p$mon         # Month (0 = January)
#> [1] 2
p$year + 1900 # Year
#> [1] 2019

Discussion

The POSIXlt object represents a date as a list of date parts. Convert your Date object to POSIXlt by using the as.POSIXlt function, which will give you a list with these members:

sec

Seconds (0–61)

min

Minutes (0–59)

hour

Hours (0–23)

mday

Day of the month (1–31)

mon

Month (0–11)

year

Years since 1900

wday

Day of the week (0–6, 0 = Sunday)

yday

Day of the year (0–365)

isdst

Daylight savings time flag

Using these date parts, we can learn that April 2, 2020, is a Thursday (wday = 4) and the 93rd day of the year (because yday = 0 on January 1):

d <- as.Date("2020-04-02")
as.POSIXlt(d)$wday
#> [1] 4
as.POSIXlt(d)$yday
#> [1] 92

A common mistake is failing to add 1900 to the year, giving the impression you are living a long, long time ago:

as.POSIXlt(d)$year # Oops!
#> [1] 120
as.POSIXlt(d)$year + 1900
#> [1] 2020

Creating a Sequence of Dates

Problem

You want to create a sequence of dates, such as a sequence of daily, monthly, or annual dates.

Solution

The seq function is a generic function that has a version for Date objects. It can create a Date sequence similarly to the way it creates a sequence of numbers.

Discussion

A typical use of seq specifies a starting date (from), ending date (to), and increment (by). An increment of 1 indicates daily dates:

s <- as.Date("2019-01-01")
e <- as.Date("2019-02-01")
seq(from = s, to = e, by = 1) # One month of dates
#>  [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#>  [6] "2019-01-06" "2019-01-07" "2019-01-08" "2019-01-09" "2019-01-10"
#> [11] "2019-01-11" "2019-01-12" "2019-01-13" "2019-01-14" "2019-01-15"
#> [16] "2019-01-16" "2019-01-17" "2019-01-18" "2019-01-19" "2019-01-20"
#> [21] "2019-01-21" "2019-01-22" "2019-01-23" "2019-01-24" "2019-01-25"
#> [26] "2019-01-26" "2019-01-27" "2019-01-28" "2019-01-29" "2019-01-30"
#> [31] "2019-01-31" "2019-02-01"

Another typical use specifies a starting date (from), increment (by), and number of dates (length.out):

seq(from = s, by = 1, length.out = 7) # Dates, one week apart
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06" "2019-01-07"

The increment (by) is flexible and can be specified in days, weeks, months, or years:

seq(from = s, by = "month", length.out = 12)   # First of the month for one year
#>  [1] "2019-01-01" "2019-02-01" "2019-03-01" "2019-04-01" "2019-05-01"
#>  [6] "2019-06-01" "2019-07-01" "2019-08-01" "2019-09-01" "2019-10-01"
#> [11] "2019-11-01" "2019-12-01"
seq(from = s, by = "3 months", length.out = 4) # Quarterly dates for one year
#> [1] "2019-01-01" "2019-04-01" "2019-07-01" "2019-10-01"
seq(from = s, by = "year", length.out = 10)    # Year-start dates for one decade
#>  [1] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01" "2023-01-01"
#>  [6] "2024-01-01" "2025-01-01" "2026-01-01" "2027-01-01" "2028-01-01"

Be careful with by="month" near month-end. In this example, the end of February overflows into March, which is probably not what you wanted:

seq(as.Date("2019-01-29"), by = "month", len = 3)
#> [1] "2019-01-29" "2019-03-01" "2019-03-29"

Chapter 8. Probability

Introduction

Probability theory is the foundation of statistics, and R has plenty of machinery for working with probability, probability distributions, and random variables. The recipes in this chapter show you how to calculate probabilities from quantiles, calculate quantiles from probabilities, generate random variables drawn from distributions, plot distributions, and so forth.

Names of Distributions

R has an abbreviated name for every probability distribution. This name is used to identify the functions associated with the distribution. For example, the name of the Normal distribution is “norm”, which is the root of these function names:

Function Purpose

dnorm

Normal density

pnorm

Normal distribution function

qnorm

Normal quantile function

rnorm

Normal random variates

Table 8-1 describes some common discrete distributions, and Table 8-2 describes several common continuous distributions.

Table 8-1. Common Discrete Distributions
Discrete distribution R name Parameters

Binomial

binom

n = number of trials; p = probability of success for one trial

Geometric

geom

p = probability of success for one trial

Hypergeometric

hyper

m = number of white balls in urn; n = number of black balls in urn; k = number of balls drawn from urn

Negative binomial (NegBinomial)

nbinom

size = number of successful trials; either prob = probability of successful trial or mu = mean

Poisson

pois

lambda = mean

Table 8-2. Common Continuous Distributions
Continuous distribution R name Parameters

Beta

beta

shape1; shape2

Cauchy

cauchy

location; scale

Chi-squared (Chisquare)

chisq

df = degrees of freedom

Exponential

exp

rate

F

f

df1 and df2 = degrees of freedom

Gamma

gamma

rate; either rate or scale

Log-normal (Lognormal)

lnorm

meanlog = mean on logarithmic scale;

sdlog = standard deviation on logarithmic scale

Logistic

logis

location; scale

Normal

norm

mean; sd = standard deviation

Student’s t (TDist)

t

df = degrees of freedom

Uniform

unif

min = lower limit; max = upper limit

Weibull

weibull

shape; scale

Wilcoxon

wilcox

m = number of observations in first sample;

n = number of observations in second sample

Warning

All distribution-related functions require distributional parameters, such as size and prob for the binomial or prob for the geometric. The big “gotcha” is that the distributional parameters may not be what you expect. For example, I would expect the parameter of an exponential distribution to be β, the mean. The R convention, however, is for the exponential distribution to be defined by the rate = 1/β, so I often supply the wrong value. The moral is, study the help page before you use a function related to a distribution. Be sure you’ve got the parameters right.

Getting Help on Probability Distributions

To see the R functions related to a particular probability distribution, use the help command and the full name of the distribution. For example, this will show the functions related to the Normal distribution:

?Normal

Some distributions have names that don’t work well with the help command, such as “Student’s t”. They have special help names, as noted in Tables Table 8-1 and Table 8-2: NegBinomial, Chisquare, Lognormal, and TDist. Thus, to get help on the Student’s t distribution, use this:

?TDist

See Also

There are many other distributions implemented in downloadable packages; see the CRAN task view devoted to probability distributions. The SuppDists package is part of the R base, and it includes ten supplemental distributions. The MASS package, which is also part of the base, provides additional support for distributions, such as maximum-likelihood fitting for some common distributions as well as sampling from a multivariate Normal distribution.

Counting the Number of Combinations

Problem

You want to calculate the number of combinations of n items taken k at a time.

Solution

Use the choose function:

n <- 10
k <- 2
choose(n, k)
#> [1] 45

Discussion

A common problem in computing probabilities of discrete variables is counting combinations: the number of distinct subsets of size k that can be created from n items. The number is given by n!/r!(nr)!, but it’s much more convenient to use the choose function—especially as n and k grow larger:

choose(5, 3)   # How many ways can we select 3 items from 5 items?
#> [1] 10
choose(50, 3)  # How many ways can we select 3 items from 50 items?
#> [1] 19600
choose(50, 30) # How many ways can we select 30 items from 50 items?
#> [1] 4.71e+13

These numbers are also known as binomial coefficients.

See Also

This recipe merely counts the combinations; see “Generating Combinations” to actually generate them.

Generating Combinations

Problem

You want to generate all combinations of n items taken k at a time.

Solution

Use the combn function:

items <- 2:5
k <- 2
combn(items, k)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    2    2    2    3    3    4
#> [2,]    3    4    5    4    5    5

Discussion

We can use combn(1:5,3) to generate all combinations of the numbers 1 through 5 taken three at a time:

combn(1:5, 3)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]    1    1    1    1    1    1    2    2    2     3
#> [2,]    2    2    2    3    3    4    3    3    4     4
#> [3,]    3    4    5    4    5    5    4    5    5     5

The function is not restricted to numbers. We can generate combinations of strings, too. Here are all combinations of five treatments taken three at a time:

combn(c("T1", "T2", "T3", "T4", "T5"), 3)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "T1" "T1" "T1" "T1" "T1" "T1" "T2" "T2" "T2" "T3"
#> [2,] "T2" "T2" "T2" "T3" "T3" "T4" "T3" "T3" "T4" "T4"
#> [3,] "T3" "T4" "T5" "T4" "T5" "T5" "T4" "T5" "T5" "T5"
Warning

As the number of items, n, increases, the number of combinations can explode—especially if k is not near to 1 or n.

See Also

See “Counting the Number of Combinations” to count the number of possible combinations before you generate a huge set.

Generating Random Numbers

Problem

You want to generate random numbers.

Solution

The simple case of generating a uniform random number between 0 and 1 is handled by the runif function. This example generates one uniform random number:

runif(1)
#> [1] 0.915
Note

If you are saying runif out loud (or even in your head), you should pronounce it “are unif” instead of “run if.” The term runif is a portmanteau of “random uniform” so should not sound as if it’s a flow control function.

R can generate random variates from other distributions as well. For a given distribution, the name of the random number generator is “r” prefixed to the distribution’s abbreviated name (e.g., rnorm for the Normal distribution’s random number generator). This example generates one random value from the standard normal distribution:

rnorm(1)
#> [1] 1.53

Discussion

Most programming languages have a wimpy random number generator that generates one random number, uniformly distributed between 0.0 and 1.0, and that’s all. Not R.

R can generate random numbers from many probability distributions other than the uniform distribution. The simple case of generating uniform random numbers between 0 and 1 is handled by the runif function:

runif(1)
#> [1] 0.83

The argument of runif is the number of random values to be generated. Generating a vector of 10 such values is as easy as generating one:

runif(10)
#>  [1] 0.642 0.519 0.737 0.135 0.657 0.705 0.458 0.719 0.935 0.255

There are random number generators for all built-in distributions. Simply prefix the distribution name with “r” and you have the name of the corresponding random number generator. Here are some common ones:

set.seed(42)
runif(1, min = -3, max = 3)      # One uniform variate between -3 and +3
#> [1] 2.49
rnorm(1)                         # One standard Normal variate
#> [1] 1.53
rnorm(1, mean = 100, sd = 15)    # One Normal variate, mean 100 and SD 15
#> [1] 114
rbinom(1, size = 10, prob = 0.5) # One binomial variate
#> [1] 5
rpois(1, lambda = 10)            # One Poisson variate
#> [1] 12
rexp(1, rate = 0.1)              # One exponential variate
#> [1] 3.14
rgamma(1, shape = 2, rate = 0.1) # One gamma variate
#> [1] 22.3

As with runif, the first argument is the number of random values to be generated. Subsequent arguments are the parameters of the distribution, such as mean and sd for the Normal distribution or size and prob for the binomial. See the function’s R help page for details.

The examples given so far use simple scalars for distributional parameters. Yet the parameters can also be vectors, in which case R will cycle through the vector while generating random values. The following example generates three normal random values drawn from distributions with means of −10, 0, and +10, respectively (all distributions have a standard deviation of 1.0):

rnorm(3, mean = c(-10, 0, +10), sd = 1)
#> [1] -9.420 -0.658 11.555

That is a powerful capability in such cases as hierarchical models, where the parameters are themselves random. The next example calculates 30 draws of a normal variate whose mean is itself randomly distributed and with hyperparameters of μ = 0 and σ = 0.2:

means <- rnorm(30, mean = 0, sd = 0.2)
rnorm(30, mean = means, sd = 1)
#>  [1] -0.5549 -2.9232 -1.2203  0.6962  0.1673 -1.0779 -0.3138 -3.3165
#>  [9]  1.5952  0.8184 -0.1251  0.3601 -0.8142  0.1050  2.1264  0.6943
#> [17] -2.7771  0.9026  0.0389  0.2280 -0.5599  0.9572  0.1972  0.2602
#> [25] -0.4423  1.9707  0.4553  0.0467  1.5229  0.3176

If you are generating many random values and the vector of parameters is too short, R will apply the Recycling Rule to the parameter vector.

See Also

See the “Introduction” to this chapter.

Generating Reproducible Random Numbers

Problem

You want to generate a sequence of random numbers, but you want to reproduce the same sequence every time your program runs.

Solution

Before running your R code, call the set.seed function to initialize the random number generator to a known state:

set.seed(42) # Or use any other positive integer...

Discussion

After generating random numbers, you may often want to reproduce the same sequence of “random” numbers every time your program executes. That way, you get the same results from run to run. One of the authors (Paul) once supported a complicated Monte Carlo analysis of a huge portfolio of securities. The users complained about getting slightly different results each time the program ran. No kidding! The analysis was driven entirely by random numbers, so of course there was randomness in the output. The solution was to set the random number generator to a known state at the beginning of the program. That way, it would generate the same (quasi-)random numbers each time and thus yield consistent, reproducible results.

In R, the set.seed function sets the random number generator to a known state. The function takes one argument, an integer. Any positive integer will work, but you must use the same one in order to get the same initial state.

The function returns nothing. It works behind the scenes, initializing (or reinitializing) the random number generator. The key here is that using the same seed restarts the random number generator back at the same place:

set.seed(165)   # Initialize generator to known state
runif(10)       # Generate ten random numbers
#>  [1] 0.116 0.450 0.996 0.611 0.616 0.426 0.666 0.168 0.788 0.442

set.seed(165)   # Reinitialize to the same known state
runif(10)       # Generate the same ten "random" numbers
#>  [1] 0.116 0.450 0.996 0.611 0.616 0.426 0.666 0.168 0.788 0.442
Warning

When you set the seed value and freeze your sequence of random numbers, you are eliminating a source of randomness that may be critical to algorithms such as Monte Carlo simulations. Before you call set.seed in your application, ask yourself: Am I undercutting the value of my program or perhaps even damaging its logic?

See Also

See “Generating Random Numbers” for more about generating random numbers.

Generating a Random Sample

Problem

You want to sample a dataset randomly.

Solution

The sample function will randomly select n items from a set:

sample(set, n)

Discussion

Suppose your World Series data contains a vector of years when the Series was played. You can select 10 years at random using sample:

world_series <- read_csv("./data/world_series.csv")
sample(world_series$year, 10)
#>  [1] 2010 1961 1906 1992 1982 1948 1910 1973 1967 1931

The items are randomly selected, so running sample again (usually) produces a different result:

sample(world_series$year, 10)
#>  [1] 1941 1973 1921 1958 1979 1946 1932 1919 1971 1974

The sample function normally samples without replacement, meaning it will not select the same item twice. Some statistical procedures (especially the bootstrap) require sampling with replacement, which means that one item can appear multiple times in the sample. Specify replace=TRUE to sample with replacement.

It’s easy to implement a simple bootstrap using sampling with replacement. Suppose we have a vector, x, of 1,000 random numbers, drawn from a normal distribution with mean 4 and standard deviation 10.

set.seed(42)
x <- rnorm(1000, 4, 10)

This code fragment samples 1,000 times from x and calculates the median of each sample:

medians <- numeric(1000)   # empty vector of 1000 numbers
for (i in 1:1000) {
  medians[i] <- median(sample(x, replace = TRUE))
}

From the bootstrap estimates, we can estimate the confidence interval for the median:

ci <- quantile(medians, c(0.025, 0.975))
cat("95% confidence interval is (", ci, ")\n")
#> 95% confidence interval is ( 3.16 4.49 )

We know that x was created from a normal distribution with a mean of 4 and, hence, the sample median should be 4 also. (In a symetrical distribution like the normal, the mean and the median are the same.) Our confidence interval easily contains the value.

See Also

See “Randomly Permuting a Vector” for randomly permuting a vector and Recipe X-X for more about bootstrapping. “Generating Reproducible Random Numbers” discusses setting seeds for quasi-random numbers.

Generating Random Sequences

Problem

You want to generate a random sequence, such as a series of simulated coin tosses or a simulated sequence of Bernoulli trials.

Solution

Use the sample function. Sample n draws from the set of possible values, and set replace=TRUE:

sample(set, n, replace = TRUE)

Discussion

The sample function randomly selects items from a set. It normally samples without replacement, which means that it will not select the same item twice and will return an error if you try to sample more items than exist in the set. With replace=TRUE, however, sample can select items over and over; this allows you to generate long, random sequences of items.

The following example generates a random sequence of 10 simulated flips of a coin:

sample(c("H", "T"), 10, replace = TRUE)
#>  [1] "H" "T" "H" "T" "T" "T" "H" "T" "T" "H"

The next example generates a sequence of 20 Bernoulli trials—random successes or failures. We use TRUE to signify a success:

sample(c(FALSE, TRUE), 20, replace = TRUE)
#>  [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
#> [12]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

By default, sample will choose equally among the set elements and so the probability of selecting either TRUE or FALSE is 0.5. With a Bernoulli trial, the probability p of success is not necessarily 0.5. You can bias the sample by using the prob argument of sample; this argument is a vector of probabilities, one for each set element. Suppose we want to generate 20 Bernoulli trials with a probability of success p = 0.8. We set the probability of FALSE to be 0.2 and the probability of TRUE to 0.8:

sample(c(FALSE, TRUE), 20, replace = TRUE, prob = c(0.2, 0.8))
#>  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [12]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

The resulting sequence is clearly biased toward TRUE. I chose this example because it’s a simple demonstration of a general technique. For the special case of a binary-valued sequence you can use rbinom, the random generator for binomial variates:

rbinom(10, 1, 0.8)
#>  [1] 1 0 1 1 1 1 1 0 1 1

Randomly Permuting a Vector

Problem

You want to generate a random permutation of a vector.

Solution

If v is your vector, then sample(v) returns a random permutation.

Discussion

We typically think of the sample function for sampling from large datasets. However, the default parameters enable you to create a random rearrangement of the dataset. The function call sample(v) is equivalent to:

sample(v, size = length(v), replace = FALSE)

which means “select all the elements of v in random order while using each element exactly once.” That is a random permutation. Here is a random permutation of 1, …, 10:

sample(1:10)
#>  [1]  7  3  6  1  5  2  4  8 10  9

See Also

See “Generating a Random Sample” for more about sample.

Calculating Probabilities for Discrete Distributions

Problem

You want to calculate either the simple or the cumulative probability associated with a discrete random variable.

Solution

For a simple probability, P(X = x), use the density function. All built-in probability distributions have a density function whose name is “d” prefixed to the distribution name. For example, dbinom for the binomial distribution.

For a cumulative probability, P(Xx), use the distribution function. All built-in probability distributions have a distribution function whose name is “p” prefixed to the distribution name; thus, pbinom is the distribution function for the binomial distribution.

Discussion

Suppose we have a binomial random variable X over 10 trials, where each trial has a success probability of 1/2. Then we can calculate the probability of observing x = 7 by calling dbinom:

dbinom(7, size = 10, prob = 0.5)
#> [1] 0.117

That calculates a probability of about 0.117. R calls dbinom the density function. Some textbooks call it the probability mass function or the probability function. Calling it a density function keeps the terminology consistent between discrete and continuous distributions (“Calculating Probabilities for Continuous Distributions”).

The cumulative probability, P(Xx), is given by the distribution function, which is sometimes called the cumulative probability function. The distribution function for the binomial distribution is pbinom. Here is the cumulative probability for x = 7 (i.e., P(X ≤ 7)):

pbinom(7, size = 10, prob = 0.5)
#> [1] 0.945

It appears the probability of observing X ≤ 7 is about 0.945.

The density functions and distribution functions for some common discrete distributions are shown in Table @ref(tab:distributions).

Table 8-3. (#tab:distributions) Discrete Distributions
Distribution Density function: P(X = x) Distribution function: P(Xx)

Binomial

dbinom(x, size, prob)

pbinom(x, size, prob)

Geometric

dgeom(x, prob)

pgeom(x, prob)

Poisson

dpois(x, lambda)

ppois(x, lambda)

The complement of the cumulative probability is the survival function, P(X > x). All of the distribution functions let you find this right-tail probability simply by specifying lower.tail=FALSE:

pbinom(7, size = 10, prob = 0.5, lower.tail = FALSE)
#> [1] 0.0547

Thus we see that the probability of observing X > 7 is about 0.055.

The interval probability, P(x1 < Xx2), is the probability of observing X between the limits x1 and x2. It is calculated as the difference between two cumulative probabilities: P(Xx2) − P(Xx1). Here is P(3 < X ≤ 7) for our binomial variable:

pbinom(7, size = 10, prob = 0.5) - pbinom(3, size = 10, prob = 0.5)
#> [1] 0.773

R lets you specify multiple values of x for these functions and will return a vector of the corresponding probabilities. Here we calculate two cumulative probabilities, P(X ≤ 3) and P(X ≤ 7), in one call to pbinom:

pbinom(c(3, 7), size = 10, prob = 0.5)
#> [1] 0.172 0.945

This leads to a one-liner for calculating interval probabilities. The diff function calculates the difference between successive elements of a vector. We apply it to the output of pbinom to obtain the difference in cumulative probabilities—in other words, the interval probability:

diff(pbinom(c(3, 7), size = 10, prob = 0.5))
#> [1] 0.773

See Also

See this chapter’s “Introduction” for more about the built-in probability distributions.

Calculating Probabilities for Continuous Distributions

Problem

You want to calculate the distribution function (DF) or cumulative distribution function (CDF) for a continuous random variable.

Solution

Use the distribution function, which calculates P(Xx). All built-in probability distributions have a distribution function whose name is “p” prefixed to the distribution’s abbreviated name—for instance, pnorm for the Normal distribution.

Example: what’s the probability of a draw being below .8 for a draw from a random standard normal distribution?

pnorm(q = .8, mean = 0, sd = 1)
#> [1] 0.788

Discussion

The R functions for probability distributions follow a consistent pattern, so the solution to this recipe is essentially identical to the solution for discrete random variables (“Calculating Probabilities for Discrete Distributions”). The significant difference is that continuous variables have no “probability” at a single point, P(X = x). Instead, they have a density at a point.

Given that consistency, the discussion of distribution functions in “Calculating Probabilities for Discrete Distributions” is applicable here, too. Table @ref(tab:continuous) gives the distribution functions for several continuous distributions.

Table 8-4. Continuous Distributions
Distribution Distribution function: P(Xx)

Normal

pnorm(x, mean, sd)

Student’s t

pt(x, df)

Exponential

pexp(x, rate)

Gamma

pgamma(x, shape, rate)

Chi-squared (χ2)

pchisq(x, df)

We can use pnorm to calculate the probability that a man is shorter than 66 inches, assuming that men’s heights are normally distributed with a mean of 70 inches and a standard deviation of 3 inches. Mathematically speaking, we want P(X ≤ 66) given that X ~ N(70, 3):

pnorm(66, mean = 70, sd = 3)
#> [1] 0.0912

Likewise, we can use pexp to calculate the probability that an exponential variable with a mean of 40 could be less than 20:

pexp(20, rate = 1 / 40)
#> [1] 0.393

Just as for discrete probabilities, the functions for continuous probabilities use lower.tail=FALSE to specify the survival function, P(X > x). This call to pexp gives the probability that the same exponential variable could be greater than 50:

pexp(50, rate = 1 / 40, lower.tail = FALSE)
#> [1] 0.287

Also like discrete probabilities, the interval probability for a continuous variable, P(x1 < X < x2), is computed as the difference between two cumulative probabilities, P(X < x2) − P(X < x1). For the same exponential variable, here is P(20 < X < 50), the probability that it could fall between 20 and 50:

pexp(50, rate = 1 / 40) - pexp(20, rate = 1 / 40)
#> [1] 0.32

See Also

See this chapter’s “Introduction” for more about the built-in probability distributions.

Converting Probabilities to Quantiles

Problem

Given a probability p and a distribution, you want to determine the corresponding quantile for p: the value x such that P(Xx) = p.

Solution

Every built-in distribution includes a quantile function that converts probabilities to quantiles. The function’s name is “q” prefixed to the distribution name; thus, for instance, qnorm is the quantile function for the Normal distribution.

The first argument of the quantile function is the probability. The remaining arguments are the distribution’s parameters, such as mean, shape, or rate:

qnorm(0.05, mean = 100, sd = 15)
#> [1] 75.3

Discussion

A common example of computing quantiles is when we compute the limits of a confidence interval. If we want to know the 95% confidence interval (α = 0.05) of a standard normal variable, then we need the quantiles with probabilities of α/2 = 0.025 and (1 − α)/2 = 0.975:

qnorm(0.025)
#> [1] -1.96
qnorm(0.975)
#> [1] 1.96

In the true spirit of R, the first argument of the quantile functions can be a vector of probabilities, in which case we get a vector of quantiles. We can simplify this example into a one-liner:

qnorm(c(0.025, 0.975))
#> [1] -1.96  1.96

All the built-in probability distributions provide a quantile function. Table @ref(tab:discrete-quant-dist) shows the quantile functions for some common discrete distributions.

Table 8-5. Discrete Quantile Distributions
Distribution Quantile function

Binomial

qbinom(p, size, prob)

Geometric

qgeom(p, prob)

Poisson

qpois(p, lambda)

Table @ref(tab:cont-quant-dist) shows the quantile functions for common continuous distributions.

Table 8-6. (#tab:cont-quant-dist) Continuous Quantile Distributions
Distribution Quantile function

Normal

qnorm(p, mean, sd)

Student’s t

qt(p, df)

Exponential

qexp(p, rate)

Gamma

qgamma(p, shape, rate=rate) or qgamma(p, shape, scale=scale)

Chi-squared (χ2)

qchisq(p, df)

See Also

Determining the quantiles of a data set is different from determining the quantiles of a distribution—see “Calculating Quantiles (and Quartiles) of a Dataset”.

Plotting a Density Function

Problem

You want to plot the density function of a probability distribution.

Solution

Define a vector x over the domain. Apply the distribution’s density function to x and then plot the result. If x is a vector of points over the domain we care about plotting, we then calculate the density using one of the d_____ density functions like dlnorm for lognormal or dnorm for normal.

dens <- data.frame(x = x,
                   y = d_____(x))
ggplot(dens, aes(x, y)) + geom_line()

Here is a specific example that plots the standard normal distribution for the interval -3 to +3:

library(ggplot2)

x <- seq(-3, +3, 0.1)
dens <- data.frame(x = x, y = dnorm(x))

ggplot(dens, aes(x, y)) + geom_line()
std norm 1
Figure 8-1. Standard Normal

Figure 8-1 shows the smooth density function.

Discussion

All the built-in probability distributions include a density function. For a particular density, the function name is “d” prepended to the density name. The density function for the Normal distribution is dnorm, the density for the gamma distribution is dgamma, and so forth.

If the first argument of the density function is a vector, then the function calculates the density at each point and returns the vector of densities.

The following code creates a 2 × 2 plot of four densities:

x <- seq(from = 0, to = 6, length.out = 100) # Define the density domains
ylim <- c(0, 0.6)

# Make a data.frame with densities of several distributions
df <- rbind(
  data.frame(x = x, dist_name = "Uniform"    , y = dunif(x, min   = 2, max = 4)),
  data.frame(x = x, dist_name = "Normal"     , y = dnorm(x, mean  = 3, sd = 1)),
  data.frame(x = x, dist_name = "Exponential", y = dexp(x, rate  = 1 / 2)),
  data.frame(x = x, dist_name = "Gamma"      , y = dgamma(x, shape = 2, rate = 1)) )

# Make a line plot like before, but use facet_wrap to create the grid
ggplot(data = df, aes(x = x, y = y)) +
  geom_line() +
  facet_wrap(~dist_name)   # facet and wrap by the variable dist_name
densities 1
Figure 8-2. Multiple Density Plots

Figure 8-2 shows four density plots. However, a raw density plot is rarely useful or interesting by itself, and we often shade a region of interest.

densityshaded 1
Figure 8-3. Standard Normal with Shading

Figure 8-3 is a normal distribution with shading from the 75th percentile to the 95th percentile.

We create the plot by first plotting the density and then creating a shaded region with the geom_ribbon function from ggplot2.

First, we create some data and draw the density curve shown in Figure 8-4

x <- seq(from = -3, to = 3, length.out = 100)
df <- data.frame(x = x, y = dnorm(x, mean = 0, sd = 1))

p <- ggplot(df, aes(x, y)) +
  geom_line() +
  labs(
    title = "Standard Normal Distribution",
    y = "Density",
    x = "Quantile"
  )
p
normalden1 1
Figure 8-4. Density Plot

Next, we define the region of interest by calculating the x value for the quantiles we’re interested in. Then finally we add a geom_ribbon to add a subset of our original data as a colored region. The resulting plot is shown here:

q75 <- quantile(df$x, .75)
q95 <- quantile(df$x, .95)

p +
  geom_ribbon(
    data = subset(df, x > q75 & x < q95),
    aes(ymax = y),
    ymin = 0,
    fill = "blue",
    colour = NA,
    alpha = 0.5
  )

Chapter 9. General Statistics

Introduction

Any significant application of R includes statistics or models or graphics. This chapter addresses the statistics. Some recipes simply describe how to calculate a statistic, such as relative frequency. Most recipes involve statistical tests or confidence intervals. The statistical tests let you choose between two competing hypotheses; that paradigm is described next. Confidence intervals reflect the likely range of a population parameter and are calculated based on your data sample.

Null Hypotheses, Alternative Hypotheses, and p-Values

Many of the statistical tests in this chapter use a time-tested paradigm of statistical inference. In the paradigm, we have one or two data samples. We also have two competing hypotheses, either of which could reasonably be true.

One hypothesis, called the null hypothesis, is that nothing happened: the mean was unchanged; the treatment had no effect; you got the expected answer; the model did not improve; and so forth.

The other hypothesis, called the alternative hypothesis, is that something happened: the mean rose; the treatment improved the patients’ health; you got an unexpected answer; the model fit better; and so forth.

We want to determine which hypothesis is more likely in light of the data:

  1. To begin, we assume that the null hypothesis is true.

  2. We calculate a test statistic. It could be something simple, such as the mean of the sample, or it could be quite complex. The critical requirement is that we must know the statistic’s distribution. We might know the distribution of the sample mean, for example, by invoking the Central Limit Theorem.

  3. From the statistic and its distribution we can calculate a p-value, the probability of a test statistic value as extreme or more extreme than the one we observed, while assuming that the null hypothesis is true.

  4. If the p-value is too small, we have strong evidence against the null hypothesis. This is called rejecting the null hypothesis.

  5. If the p-value is not small then we have no such evidence. This is called failing to reject the null hypothesis.

There is one necessary decision here: When is a p-value “too small”?

Note

In this book, we follow the common convention that we reject the null hypothesis when p < 0.05 and fail to reject it when p > 0.05. In statistical terminology, we chose a significance level of α = 0.05 to define the border between strong evidence and insufficient evidence against the null hypothesis.

But the real answer is, “it depends”. Your chosen significance level depends on your problem domain. The conventional limit of p < 0.05 works for many problems. In our work, the data are especially noisy and so we are often satisfied with p < 0.10. For someone working in high-risk areas, p < 0.01 or p < 0.001 might be necessary.

In the recipes, we mention which tests include a p-value so that you can compare the p-value against your chosen significance level of α. We worded the recipes to help you interpret the comparison. Here is the wording from “Testing Categorical Variables for Independence”, a test for the independence of two factors:

Example 9-1.

Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent whereas a p-value exceeding 0.05 fails to provide any such evidence.

This is a compact way of saying:

  • The null hypothesis is that the variables are independent.

  • The alternative hypothesis is that the variables are not independent.

  • For α = 0.05, if p < 0.05 then we reject the null hypothesis, giving strong evidence that the variables are not independent; if p > 0.05, we fail to reject the null hypothesis.

  • You are free to choose your own α, of course, in which case your decision to reject or fail to reject might be different.

Remember, the recipe states the informal interpretation of the test results, not the rigorous mathematical interpretation. We use colloquial language in the hope that it will guide you toward a practical understanding and application of the test. If the precise semantics of hypothesis testing is critical for your work, we urge you to consult the reference cited under See Also or one of the other fine textbooks on mathematical statistics.

Confidence Intervals

Hypothesis testing is a well-understood mathematical procedure, but it can be frustrating. First, the semantics is tricky. The test does not reach a definite, useful conclusion. You might get strong evidence against the null hypothesis, but that’s all you’ll get. Second, it does not give you a number, only evidence.

If you want numbers then use confidence intervals, which bound the estimate of a population parameter at a given level of confidence. Recipes in this chapter can calculate confidence intervals for means, medians, and proportions of a population.

For example, “Forming a Confidence Interval for a Mean” calculates a 95% confidence interval for the population mean based on sample data. The interval is 97.16 < μ < 103.98, which means there is a 95% probability that the population’s mean, μ, is between 97.16 and 103.98.

See Also

Statistical terminology and conventions can vary. This book generally follows the conventions of Mathematical Statistics with Applications, 6th ed., by Wackerly et al. (Duxbury Press). We recommend this book also for learning more about the statistical tests described in this chapter.

Summarizing Your Data

Problem

You want a basic statistical summary of your data.

Solution

The summary function gives some useful statistics for vectors, matrices, factors, and data frames:

summary(vec)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>     0.0     0.5     1.0     1.6     1.9    33.0

Discussion

The Solution exhibits the summary of a vector. The 1st Qu. and 3rd Qu. are the first and third quartile, respectively. Having both the median and mean is useful because you can quickly detect skew. The Solution above, for example, shows a mean that is larger than the median; this indicates a possible skew to the right, as one would expect from a lognormal distribution.

The summary of a matrix works column by column. Here we see the summary of a matrix, mat, with three columns named Samp1, Samp2, and Samp3:

summary(mat)
#>      Samp1           Samp2            Samp3
#>  Min.   :  1.0   Min.   :-2.943   Min.   : 0.04
#>  1st Qu.: 25.8   1st Qu.:-0.774   1st Qu.: 0.39
#>  Median : 50.5   Median :-0.052   Median : 0.85
#>  Mean   : 50.5   Mean   :-0.067   Mean   : 1.60
#>  3rd Qu.: 75.2   3rd Qu.: 0.684   3rd Qu.: 2.12
#>  Max.   :100.0   Max.   : 2.150   Max.   :13.18

The summary of a factor gives counts:

summary(fac)
#> Maybe    No   Yes
#>    38    32    30

The summary of a character vector is pretty useless, just the vector length:

summary(char)
#>    Length     Class      Mode
#>       100 character character

The summary of a data frame incorporates all these features. It works column by column, giving an appropriate summary according to the column type. Numeric values receive a statistical summary and factors are counted (character strings are not summarized):

suburbs <- read_csv("./data/suburbs.txt")
summary(suburbs)
#>      city              county             state
#>  Length:17          Length:17          Length:17
#>  Class :character   Class :character   Class :character
#>  Mode  :character   Mode  :character   Mode  :character
#>
#>
#>
#>       pop
#>  Min.   :   5428
#>  1st Qu.:  72616
#>  Median :  83048
#>  Mean   : 249770
#>  3rd Qu.: 102746
#>  Max.   :2853114

The “summary” of a list is pretty funky: just the data type of each list member. Here is a summary of a list of vectors:

summary(vec_list)
#>   Length Class  Mode
#> x 100    -none- numeric
#> y 100    -none- numeric
#> z 100    -none- character

To summarize the data inside a list of vectors, map summary to each list element:

library(purrr)
map(vec_list, summary)
#> $x
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>  -2.572  -0.686  -0.084  -0.043   0.660   2.413
#>
#> $y
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>  -1.752  -0.589   0.045   0.079   0.769   2.293
#>
#> $z
#>    Length     Class      Mode
#>       100 character character

Unfortunately, the summary function does not compute any measure of variability, such as standard deviation or median absolute deviation. This is a serious shortcoming, so we usually call sd or mad right after calling summary.

See Also

See Recipes “Computing Basic Statistics” and .

Calculating Relative Frequencies

Problem

You want to count the relative frequency of certain observations in your sample.

Solution

Identify the interesting observations by using a logical expression; then use the mean function to calculate the fraction of observations it identifies. For example, given a vector x, you can find the relative frequency of positive values in this way:

mean(x > 3)
#> [1] 0.12

Discussion

A logical expression, such as x > 3, produces a vector of logical values (TRUE and FALSE), one for each element of x. The mean function converts those values to 1s and 0s, respectively, and computes the average. This gives the fraction of values that are TRUE—in other words, the relative frequency of the interesting values. In the Solution, for example, that’s the relative frequency of values greater than 3.

The concept here is pretty simple. The tricky part is dreaming up a suitable logical expression. Here are some examples:

mean(lab == “NJ”)

Fraction of lab values that are New Jersey

mean(after > before)

Fraction of observations for which the effect increases

mean(abs(x-mean(x)) > 2*sd(x))

Fraction of observations that exceed two standard deviations from the mean

mean(diff(ts) > 0)

Fraction of observations in a time series that are larger than the previous observation

Tabulating Factors and Creating Contingency Tables

Problem

You want to tabulate one factor or to build a contingency table from multiple factors.

Solution

The table function produces counts of one factor:

table(f1)
#> f1
#>  a  b  c  d  e
#> 14 23 24 21 18

It can also produce contingency tables (cross-tabulations) from two or more factors:

table(f1, f2)
#>    f2
#> f1   f  g  h
#>   a  6  4  4
#>   b  7  9  7
#>   c  4 11  9
#>   d  7  8  6
#>   e  5 10  3

table works for characters, too, not only factors:

t1 <- sample(letters[9:11], 100, replace = TRUE)
table(t1)
#> t1
#>  i  j  k
#> 20 40 40

Discussion

The table function counts the levels of one factor or characters, such as these counts of initial and outcome (which are factors):

set.seed(42)
initial <- factor(sample(c("Yes", "No", "Maybe"), 100, replace = TRUE))
outcome <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))

table(initial)
#> initial
#> Maybe    No   Yes
#>    39    31    30

table(outcome)
#> outcome
#> Fail Pass
#>   56   44

The greater power of table is in producing contingency tables, also known as cross-tabulations. Each cell in a contingency table counts how many times that row–column combination occurred:

table(initial, outcome)
#>        outcome
#> initial Fail Pass
#>   Maybe   23   16
#>   No      20   11
#>   Yes     13   17

This table shows that the combination of initial = Yes and outcome = Fail occurred 13 times, the combination of initial = Yes and outcome = Pass occurred 17 times, and so forth.

See Also

The xtabs function can also produce a contingency table. It has a formula interface, which some people prefer.

Testing Categorical Variables for Independence

Problem

You have two categorical variables that are represented by factors. You want to test them for independence using the chi-squared test.

Solution

Use the table function to produce a contingency table from the two factors. Then use the summary function to perform a chi-squared test of the contingency table. In the example below we have two vectors of factor values which we created in the prior recipe:

summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#>  Chisq = 3, df = 2, p-value = 0.2

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent whereas a p-value exceeding 0.05 fails to provide any such evidence.

Discussion

This example performs a chi-squared test on the contingency table of “Tabulating Factors and Creating Contingency Tables” and yields a p-value of 0.2225:

summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#>  Chisq = 3, df = 2, p-value = 0.2

The large p-value indicates that the two factors, initial and outcome, are probably independent. Practically speaking, we conclude there is no connection between the variables. This makes sense as this example data was created by simply drawing random data using the sample function in the prior recipe.

See Also

The chisq.test function can also perform this test.

Calculating Quantiles (and Quartiles) of a Dataset

Problem

Given a fraction f, you want to know the corresponding quantile of your data. That is, you seek the observation x such that the fraction of observations below x is f.

Solution

Use the quantile function. The second argument is the fraction, f:

quantile(vec, 0.95)
#>  95%
#> 1.43

For quartiles, simply omit the second argument altogether:

quantile(vec)
#>      0%     25%     50%     75%    100%
#> -2.0247 -0.5915 -0.0693  0.4618  2.7019

Discussion

Suppose vec contains 1,000 observations between 0 and 1. The quantile function can tell you which observation delimits the lower 5% of the data:

vec <- runif(1000)
quantile(vec, .05)
#>     5%
#> 0.0451

The quantile documentation refers to the second argument as a “probability”, which is natural when we think of probability as meaning relative frequency.

In true R style, the second argument can be a vector of probabilities; in this case, quantile returns a vector of corresponding quantiles, one for each probability:

quantile(vec, c(.05, .95))
#>     5%    95%
#> 0.0451 0.9363

That is a handy way to identify the middle 90% (in this case) of the observations.

If you omit the probabilities altogether then R assumes you want the probabilities 0, 0.25, 0.50, 0.75, and 1.0—in other words, the quartiles:

quantile(vec)
#>       0%      25%      50%      75%     100%
#> 0.000405 0.235529 0.479543 0.737619 0.999379

Amazingly, the quantile function implements nine (yes, nine) different algorithms for computing quantiles. Study the help page before assuming that the default algorithm is the best one for you.

Inverting a Quantile

Problem

Given an observation x from your data, you want to know its corresponding quantile. That is, you want to know what fraction of the data is less than x.

Solution

Assuming your data is in a vector vec, compare the data against the observation and then use mean to compute the relative frequency of values less than x; say, 1.6 as per this example.

mean(vec < 1.6)
#> [1] 0.948

Discussion

The expression vec < x compares every element of vec against x and returns a vector of logical values, where the n_th logical value is TRUE if vec[n] < x. The mean function converts those logical values to 0 and 1: 0 for FALSE and 1 for TRUE. The average of all those 1s and 0s is the fraction of vec that is less than _x, or the inverse quantile of x.

See Also

This is an application of the general approach described in “Calculating Relative Frequencies”.

Converting Data to Z-Scores

Problem

You have a dataset, and you want to calculate the corresponding z-scores for all data elements. (This is sometimes called normalizing the data.)

Solution

Use the scale function:

scale(x)
#>          [,1]
#>  [1,]  0.8701
#>  [2,] -0.7133
#>  [3,] -1.0503
#>  [4,]  0.5790
#>  [5,] -0.6324
#>  [6,]  0.0991
#>  [7,]  2.1495
#>  [8,]  0.2481
#>  [9,] -0.8155
#> [10,] -0.7341
#> attr(,"scaled:center")
#> [1] 2.42
#> attr(,"scaled:scale")
#> [1] 2.11

This works for vectors, matrices, and data frames. In the case of a vector, scale returns the vector of normalized values. In the case of matrices and data frames, scale normalizes each column independently and returns columns of normalized values in a matrix.

Discussion

You might also want to normalize a single value y relative to a dataset x. That can be done by using vectorized operations as follows:

(y - mean(x)) / sd(x)
#> [1] -0.633

Testing the Mean of a Sample (t Test)

Problem

You have a sample from a population. Given this sample, you want to know if the mean of the population could reasonably be a particular value m.

Solution

Apply the t.test function to the sample x with the argument mu=m:

t.test(x, mu = m)

The output includes a p-value. Conventionally, if p < 0.05 then the population mean is unlikely to be m whereas p > 0.05 provides no such evidence.

If your sample size n is small, then the underlying population must be normally distributed in order to derive meaningful results from the t test. A good rule of thumb is that “small” means n < 30.

Discussion

The t test is a workhorse of statistics, and this is one of its basic uses: making inferences about a population mean from a sample. The following example simulates sampling from a normal population with mean μ = 100. It uses the t test to ask if the population mean could be 95, and t.test reports a p-value of 0.005055:

x <- rnorm(75, mean = 100, sd = 15)
t.test(x, mu = 95)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 3, df = 70, p-value = 0.005
#> alternative hypothesis: true mean is not equal to 95
#> 95 percent confidence interval:
#>   96.5 103.0
#> sample estimates:
#> mean of x
#>      99.7

The p-value is small and so it’s unlikely (based on the sample data) that 95 could be the mean of the population.

Informally, we could interpret the low p-value as follows. If the population mean were really 95, then the probability of observing our test statistic (t = 2.8898 or something more extreme) would be only 0.005055 That is very improbable, yet that is the value we observed. Hence we conclude that the null hypothesis is wrong; therefore, the sample data does not support the claim that the population mean is 95.

In sharp contrast, testing for a mean of 100 gives a p-value of 0.8606:

t.test(x, mu = 100)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = -0.2, df = 70, p-value = 0.9
#> alternative hypothesis: true mean is not equal to 100
#> 95 percent confidence interval:
#>   96.5 103.0
#> sample estimates:
#> mean of x
#>      99.7

The large p-value indicates that the sample is consistent with assuming a population mean μ of 100. In statistical terms, the data does not provide evidence against the true mean being 100.

A common case is testing for a mean of zero. If you omit the mu argument, it defaults to zero.

See Also

The t.test is a many-splendored thing. See Recipes and for other uses.

Forming a Confidence Interval for a Mean

Problem

You have a sample from a population. Given that sample, you want to determine a confidence interval for the population’s mean.

Solution

Apply the t.test function to your sample x:

t.test(x)

The output includes a confidence interval at the 95% confidence level. To see intervals at other levels, use the conf.level argument.

As in “Testing the Mean of a Sample (t Test)”, if your sample size n is small then the underlying population must be normally distributed for there to be a meaningful confidence interval. Again, a good rule of thumb is that “small” means n < 30.

Discussion

Applying the t.test function to a vector yields a lot of output. Buried in the output is a confidence interval:

t.test(x)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 50, df = 50, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>   94.2 101.5
#> sample estimates:
#> mean of x
#>      97.9

In this example, the confidence interval is approximately 94.16 < μ < 101.55, which is sometimes written simply as (94.16, 101.55).

We can raise the confidence level to 99% by setting conf.level=0.99:

t.test(x, conf.level = 0.99)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 50, df = 50, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 99 percent confidence interval:
#>   92.9 102.8
#> sample estimates:
#> mean of x
#>      97.9

That change widens the confidence interval to 92.93 < μ < 102.78

Forming a Confidence Interval for a Median

Problem

You have a data sample, and you want to know the confidence interval for the median.

Solution

Use the wilcox.test function, setting conf.int=TRUE:

wilcox.test(x, conf.int = TRUE)

The output will contain a confidence interval for the median.

Discussion

The procedure for calculating the confidence interval of a mean is well-defined and widely known. The same is not true for the median, unfortunately. There are several procedures for calculating the median’s confidence interval. None of them is “the” procedure, but the Wilcoxon signed rank test is pretty standard.

The wilcox.test function implements that procedure. Buried in the output is the 95% confidence interval, which is approximately (-0.102, 0.646) in this case:

wilcox.test(x, conf.int = TRUE)
#>
#>  Wilcoxon signed rank test
#>
#> data:  x
#> V = 200, p-value = 0.1
#> alternative hypothesis: true location is not equal to 0
#> 95 percent confidence interval:
#>  -0.102  0.646
#> sample estimates:
#> (pseudo)median
#>          0.311

You can change the confidence level by setting conf.level, such as conf.level=0.99 or other such values.

The output also includes something called the pseudomedian, which is defined on the help page. Don’t assume it equals the median; they are different:

median(x)
#> [1] 0.314

See Also

The bootstrap procedure is also useful for estimating the median’s confidence interval; see Recipes and Recipe X-X.

Testing a Sample Proportion

Problem

You have a sample of values from a population consisting of successes and failures. You believe the true proportion of successes is p, and you want to test that hypothesis using the sample data.

Solution

Use the prop.test function. Suppose the sample size is n and the sample contains x successes:

prop.test(x, n, p)

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the true proportion is unlikely to be p whereas a p-value exceeding 0.05 fails to provide such evidence.

Discussion

Suppose you encounter some loudmouthed fan of the Chicago Cubs early in the baseball season. The Cubs have played 20 games and won 11 of them, or 55% of their games. Based on that evidence, the fan is “very confident” that the Cubs will win more than half of their games this year. Should he be that confident?

The prop.test function can evaluate the fan’s logic. Here, the number of observations is n = 20, the number of successes is x = 11, and p is the true probability of winning a game. We want to know whether it is reasonable to conclude, based on the data, that p > 0.5. Normally, prop.test would check for p ≠ 0.05 but we can check for p > 0.5 instead by setting alternative="greater":

prop.test(11, 20, 0.5, alternative = "greater")
#>
#>  1-sample proportions test with continuity correction
#>
#> data:  11 out of 20, null probability 0.5
#> X-squared = 0.05, df = 1, p-value = 0.4
#> alternative hypothesis: true p is greater than 0.5
#> 95 percent confidence interval:
#>  0.35 1.00
#> sample estimates:
#>    p
#> 0.55

The prop.test output shows a large p-value, 0.4115, so we cannot reject the null hypothesis; that is, we cannot reasonably conclude that p is greater than 1/2. The Cubs fan is being overly confident based on too little data. No surprise there.

Forming a Confidence Interval for a Proportion

Problem

You have a sample of values from a population consisting of successes and failures. Based on the sample data, you want to form a confidence interval for the population’s proportion of successes.

Solution

Use the prop.test function. Suppose the sample size is n and the sample contains x successes:

prop.test(x, n)

The function output includes the confidence interval for p.

Discussion

We subscribe to a stock market newsletter that is well written, but includes a section purporting to identify stocks that are likely to rise. It does this by looking for a certain pattern in the stock price. It recently reported, for example, that a certain stock was following the pattern. It also reported that the stock rose six times after the last nine times that pattern occurred. The writers concluded that the probability of the stock rising again was therefore 6/9 or 66.7%.

Using prop.test, we can obtain the confidence interval for the true proportion of times the stock rises after the pattern. Here, the number of observations is n = 9 and the number of successes is x = 6. The output shows a confidence interval of (0.309, 0.910) at the 95% confidence level:

prop.test(6, 9)
#> Warning in prop.test(6, 9): Chi-squared approximation may be incorrect
#>
#>  1-sample proportions test with continuity correction
#>
#> data:  6 out of 9, null probability 0.5
#> X-squared = 0.4, df = 1, p-value = 0.5
#> alternative hypothesis: true p is not equal to 0.5
#> 95 percent confidence interval:
#>  0.309 0.910
#> sample estimates:
#>     p
#> 0.667

The writers are pretty foolish to say the probability of rising is 66.7%. They could be leading their readers into a very bad bet.

By default, prop.test calculates a confidence interval at the 95% confidence level. Use the conf.level argument for other confidence levels:

prop.test(x, n, p, conf.level = 0.99)   # 99% confidence level

Testing for Normality

Problem

You want a statistical test to determine whether your data sample is from a normally distributed population.

Solution

Use the shapiro.test function:

shapiro.test(x)

The output includes a p-value. Conventionally, p < 0.05 indicates that the population is likely not normally distributed whereas p > 0.05 provides no such evidence.

Discussion

This example reports a p-value of .7765 for x:

shapiro.test(x)
#>
#>  Shapiro-Wilk normality test
#>
#> data:  x
#> W = 1, p-value = 0.05

The large p-value suggests the underlying population could be normally distributed. The next example reports a very small p-value for y, so it is unlikely that this sample came from a normal population:

shapiro.test(y)
#>
#>  Shapiro-Wilk normality test
#>
#> data:  y
#> W = 0.7, p-value = 9e-12

We have highlighted the Shapiro–Wilk test because it is a standard R function. You can also install the package nortest, which is dedicated entirely to tests for normality. This package includes:

  • Anderson–Darling test (ad.test)

  • Cramer–von Mises test (cvm.test)

  • Lilliefors test (lillie.test)

  • Pearson chi-squared test for the composite hypothesis of normality (pearson.test)

  • Shapiro–Francia test (sf.test)

The problem with all these tests is their null hypothesis: they all assume that the population is normally distributed until proven otherwise. As a result, the population must be decidedly nonnormal before the test reports a small p-value and you can reject that null hypothesis. That makes the tests quite conservative, tending to err on the side of normality.

Instead of depending solely upon a statistical test, we suggest also using histograms (“Creating a Histogram”) and quantile-quantile plots (“Creating a Normal Quantile-Quantile (Q-Q) Plot”) to evaluate the normality of any data. Are the tails too fat? Is the peak to peaked? Your judgment is likely better than a single statistical test.

See Also

See “Installing Packages from CRAN” for how to install the nortest package.

Testing for Runs

Problem

Your data is a sequence of binary values: yes–no, 0–1, true–false, or other two-valued data. You want to know: Is the sequence random?

Solution

The tseries package contains the runs.test function, which checks a sequence for randomness. The sequence should be a factor with two levels:

library(tseries)
runs.test(as.factor(s))

The runs.test function reports a p-value. Conventionally, a p-value of less than 0.05 indicates that the sequence is likely not random whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

A run is a subsequence composed of identical values, such as all 1s or all 0s. A random sequence should be properly jumbled up, without too many runs. Similarly, it shouldn’t contain too few runs, either. A sequence of perfectly alternating values (0, 1, 0, 1, 0, 1, …) contains no runs, but would you say that it’s random?

The runs.test function checks the number of runs in your sequence. If there are too many or too few, it reports a small p-value.

This first example generates a random sequence of 0s and 1s and then tests the sequence for runs. Not surprisingly, runs.test reports a large p-value, indicating the sequence is likely random:

s <- sample(c(0, 1), 100, replace = T)
runs.test(as.factor(s))
#>
#>  Runs Test
#>
#> data:  as.factor(s)
#> Standard Normal = 0.1, p-value = 0.9
#> alternative hypothesis: two.sided

This next sequence, however, consists of three runs and so the reported p-value is quite low:

s <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0)
runs.test(as.factor(s))
#>
#>  Runs Test
#>
#> data:  as.factor(s)
#> Standard Normal = -2, p-value = 0.02
#> alternative hypothesis: two.sided

See Also

See Recipes and .

Comparing the Means of Two Samples

Problem

You have one sample each from two populations. You want to know if the two populations could have the same mean.

Solution

Perform a t test by calling the t.test function:

t.test(x, y)

By default, t.test assumes that your data are not paired. If the observations are paired (i.e., if each xi is paired with one yi), then specify paired=TRUE:

t.test(x, y, paired = TRUE)

In either case, t.test will compute a p-value. Conventionally, if p < 0.05 then the means are likely different whereas p > 0.05 provides no such evidence:

  • If either sample size is small, then the populations must be normally distributed. Here, “small” means fewer than 20 data points.

  • If the two populations have the same variance, specify var.equal=TRUE to obtain a less conservative test.

Discussion

We often use the t test to get a quick sense of the difference between two population means. It requires that the samples be large enough (both samples have 20 or more observations) or that the underlying populations be normally distributed. We don’t take the “normally distributed” part too literally. Being bell-shaped and reasonably symetrical should be good enough.

A key distinction here is whether or not your data contains paired observations, since the results may differ in the two cases. Suppose we want to know if coffee in the morning improves scores on SAT tests. We could run the experiment two ways:

  1. Randomly select one group of people. Give them the SAT test twice, once with morning coffee and once without morning coffee. For each person, we will have two SAT scores. These are paired observations.

  2. Randomly select two groups of people. One group has a cup of morning coffee and takes the SAT test. The other group just takes the test. We have a score for each person, but the scores are not paired in any way.

Statistically, these experiments are quite different. In experiment 1, there are two observations for each person (caffeinated and decaf) and they are not statistically independent. In experiment 2, the data are independent.

If you have paired observations (experiment 1) and erroneously analyze them as unpaired observations (experiment 2), then you could get this result with a p-value of 0.9867:

load("./data/sat.rdata")
t.test(x, y)
#>
#>  Welch Two Sample t-test
#>
#> data:  x and y
#> t = -1, df = 200, p-value = 0.3
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -46.4  16.2
#> sample estimates:
#> mean of x mean of y
#>      1054      1069

The large p-value forces you to conclude there is no difference between the groups. Contrast that result with the one that follows from analyzing the same data but correctly identifying it as paired:

t.test(x, y, paired = TRUE)
#>
#>  Paired t-test
#>
#> data:  x and y
#> t = -20, df = 100, p-value <2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -16.8 -13.5
#> sample estimates:
#> mean of the differences
#>                   -15.1

The p-value plummets to 2.2e-16, and we reach the exactly opposite conclusion.

See Also

If the populations are not normally distributed (bell-shaped) and either sample is small, consider using the Wilcoxon–Mann–Whitney test described in “Comparing the Locations of Two Samples Nonparametrically”.

Comparing the Locations of Two Samples Nonparametrically

Problem

You have samples from two populations. You don’t know the distribution of the populations, but you know they have similar shapes. You want to know: Is one population shifted to the left or right compared with the other?

Solution

You can use a nonparametric test, the Wilcoxon–Mann–Whitney test, which is implemented by the wilcox.test function. For paired observations (every xi is paired with yi), set paired=TRUE:

wilcox.test(x, y, paired = TRUE)

For unpaired observations, let paired default to FALSE:

wilcox.test(x, y)

The test output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the second population is likely shifted left or right with respect to the first population whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

When we stop making assumptions regarding the distributions of populations, we enter the world of nonparametric statistics. The Wilcoxon–Mann–Whitney test is nonparametric and so can be applied to more datasets than the t test, which requires that the data be normally distributed (for small samples). This test’s only assumption is that the two populations have the same shape.

In this recipe, we are asking: Is the second population shifted left or right with respect to the first? This is similar to asking whether the average of the second population is smaller or larger than the first. However, the Wilcoxon–Mann–Whitney test answers a different question: it tells us whether the central locations of the two populations are significantly different or, equivalently, whether their relative frequencies are different.

Suppose we randomly select a group of employees and ask each one to complete the same task under two different circumstances: under favorable conditions and under unfavorable conditions, such as a noisy environment. We measure their completion times under both conditions, so we have two measurements for each employee. We want to know if the two times are significantly different, but we can’t assume they are normally distributed.

The data are paired, so we must set paired=TRUE:

load(file = "./data/workers.rdata")
wilcox.test(fav, unfav, paired = TRUE)
#>
#>  Wilcoxon signed rank test
#>
#> data:  fav and unfav
#> V = 10, p-value = 1e-04
#> alternative hypothesis: true location shift is not equal to 0

The p-value is essentially zero. Statistically speaking, we reject the assumption that the completion times were equal. Practically speaking, it’s reasonable to conclude that the times were different.

In this example, setting paired=TRUE is critical. Treating the data as unpaired would be wrong because the observations are not independent; and this, in turn, would produce bogus results. Running the example with paired=FALSE produces a p-value of 0.1022, which leads to the wrong conclusion.

See Also

See “Comparing the Means of Two Samples” for the parametric test.

Testing a Correlation for Significance

Problem

You calculated the correlation between two variables, but you don’t know if the correlation is statistically significant.

Solution

The cor.test function can calculate both the p-value and the confidence interval of the correlation. If the variables came from normally distributed populations then use the default measure of correlation, which is the Pearson method:

cor.test(x, y)

For nonnormal populations, use the Spearman method instead:

cor.test(x, y, method = "spearman")

The function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

Discussion

In my experience, people often fail to check a correlation for significance. In fact, many people are unaware that a correlation can be insignificant. They jam their data into a computer, calculate the correlation, and blindly believe the result. However, they should ask themselves: Was there enough data? Is the magnitude of the correlation large enough? Fortunately, the cor.test function answers those questions.

Suppose we have two vectors, x and y, with values from normal populations. We might be very pleased that their correlation is greater than 0.75:

cor(x, y)
#> [1] 0.751

But that is naïve. If we run cor.test, it reports a relatively large p-value of 0.085:

cor.test(x, y)
#>
#>  Pearson's product-moment correlation
#>
#> data:  x and y
#> t = 2, df = 4, p-value = 0.09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.155  0.971
#> sample estimates:
#>   cor
#> 0.751

The p-value is above the conventional threshold of 0.05, so we conclude that the correlation is unlikely to be significant.

You can also check the correlation by using the confidence interval. In this example, the confidence interval is (−0.155, 0.970). The interval contains zero and so it is possible that the correlation is zero, in which case there would be no correlation. Again, you could not be confident that the reported correlation is significant.

The cor.test output also includes the point estimate reported by cor (at the bottom, labeled “sample estimates”), saving you the additional step of running cor.

By default, cor.test calculates the Pearson correlation, which assumes that the underlying populations are normally distributed. The Spearman method makes no such assumption because it is nonparametric. Use method="Spearman" when working with nonnormal data.

See Also

See “Computing Basic Statistics” for calculating simple correlations.

Testing Groups for Equal Proportions

Problem

You have samples from two or more groups. The group’s elements are binary-valued: either success or failure. You want to know if the groups have equal proportions of successes.

Solution

Use the prop.test function with two vector arguments:

#>
#>  2-sample test for equality of proportions with continuity
#>  correction
#>
#> data:  ns out of nt
#> X-squared = 5, df = 1, p-value = 0.03
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.3058 -0.0142
#> sample estimates:
#> prop 1 prop 2
#>   0.48   0.64
ns <- c(48, 64)
nt <- c(100, 100)
prop.test(ns, nt)

These are parallel vectors. The first vector, ns, gives the number of successes in each group. The second vector, nt, gives the size of the corresponding group (often called the number of trials).

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that it is likely the groups’ proportions are different whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

In “Testing a Sample Proportion” we tested a proportion based on one sample. Here, we have samples from several groups and want to compare the proportions in the underlying groups.

One of the authors recently taught statistics to 38 students and awarded a grade of A to 14 of them. A colleague taught the same class to 40 students and awarded an A to only 10. We wanted to know: Is the author fostering grade inflation by awarding significantly more A grades than the other teacher did?

We used prop.test. “Success” means awarding an A, so the vector of successes contains two elements: the number awarded by me and the number awarded by my colleague:

successes <- c(14, 10)

The number of trials is the number of students in the corresponding class:

trials <- c(38, 40)

The prop.test output yields a p-value of 0.3749:

prop.test(successes, trials)
#>
#>  2-sample test for equality of proportions with continuity
#>  correction
#>
#> data:  successes out of trials
#> X-squared = 0.8, df = 1, p-value = 0.4
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.111  0.348
#> sample estimates:
#> prop 1 prop 2
#>  0.368  0.250

The relatively large p-value means that we cannot reject the null hypothesis: the evidence does not suggest any difference between the teachers’ grading.

Performing Pairwise Comparisons Between Group Means

Problem

You have several samples, and you want to perform a pairwise comparison between the sample means. That is, you want to compare the mean of every sample against the mean of every other sample.

Solution

Place all data into one vector and create a parallel factor to identify the groups. Use pairwise.t.test to perform the pairwise comparison of means:

pairwise.t.test(x, f)   # x is the data, f is the grouping factor

The output contains a table of p-values, one for each pair of groups. Conventionally, if p < 0.05 then the two groups likely have different means whereas p > 0.05 provides no such evidence.

Discussion

This is more complicated than “Comparing the Means of Two Samples”, where we compared the means of two samples. Here we have several samples and want to compare the mean of every sample against the mean of every other sample.

Statistically speaking, pairwise comparisons are tricky. It is not the same as simply performing a t test on every possible pair. The p-values must be adjusted, for otherwise you will get an overly optimistic result. The help pages for pairwise.t.test and p.adjust describe the adjustment algorithms available in R. Anyone doing serious pairwise comparisons is urged to review the help pages and consult a good textbook on the subject.

Suppose we are using a larger sample of the data from “Combining Multiple Vectors into One Vector and a Factor”, where we combined data for freshmen, sophomores, and juniors into a data frame called comb. The data frame has two columns: the data in a column called values, and the grouping factor in a column called ind. We can use pairwise.t.test to perform pairwise comparisons between the groups:

pairwise.t.test(comb$values, comb$ind)
#>
#>  Pairwise comparisons using t tests with pooled SD
#>
#> data:  comb$values and comb$ind
#>
#>      fresh soph
#> soph 0.001 -
#> jrs  3e-04 0.592
#>
#> P value adjustment method: holm

Notice the table of p-values. The comparisons of juniors versus freshmen and of sophomores versus freshmen produced small p-values: 0.0011 and 0.0003, respectively. We can conclude there are significant differences between those groups. However, the comparison of sophomores versus juniors produced a (relatively) large p-value of 0.5922, so they are not significantly different.

See Also

See Recipes and .

Testing Two Samples for the Same Distribution

Problem

You have two samples, and you are wondering: Did they come from the same distribution?

Solution

The Kolmogorov–Smirnov test compares two samples and tests them for being drawn from the same distribution. The ks.test function implements that test:

ks.test(x, y)

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the two samples (x and y) were drawn from different distributions whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

The Kolmogorov–Smirnov test is wonderful for two reasons. First, it is a nonparametric test and so you needn’t make any assumptions regarding the underlying distributions: it works for all distributions. Second, it checks the location, dispersion, and shape of the populations, based on the samples. If these characteristics disagree then the test will detect that, allowing us to conclude that the underlying distributions are different.

Suppose we suspect that the vectors x and y come from differing distributions. Here, ks.test reports a p-value of 0.03663:

#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided
ks.test(x, y)
#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided

From the small p-value we can conclude that the samples are from different distributions. However, when we test x against another sample, z, the p-value is much larger (0.5806); this suggests that x and z could have the same underlying distribution:

z <- rnorm(100, mean = 4, sd = 6)
ks.test(x, z)
#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and z
#> D = 0.1, p-value = 0.6
#> alternative hypothesis: two-sided

Chapter 10. Graphics

Introduction

Graphics is a great strength of R. The graphics package is part of the standard distribution and contains many useful functions for creating a variety of graphic displays. The base functionality has been expanded and made easier with ggplot2, part of the tidyverse of packages. In this chapter we will focus on examples using ggplot2, and we will occasionally suggest other packages. In this chapter’s See Also sections we mention functions in other packages that do the same job in a different way. We suggest that you explore those alternatives if you are dissatisfied with what’s offered by ggplot2 or base graphics.

Graphics is a vast subject, and we can only scratch the surface here. Winston Chang’s R Graphics Cookbook, 2nd Edition is part of the O’Reilly Cookbook series and walks through many useful recipes with a focus on ggplot2. If you want to delve deeper, we recommend R Graphics by Paul Murrell (Chapman & Hall, 2006). That book discusses the paradigms behind R graphics, explains how to use the graphics functions, and contains numerous examples—including the code to recreate them. Some of the examples are pretty amazing.

The Illustrations

The graphs in this chapter are mostly plain and unadorned. We did that intentionally. When you call the ggplot function, as in:

library(tidyverse)
simpleplot 1
Figure 10-1. Simple Plot
df <- data.frame(x = 1:5, y = 1:5)
ggplot(df, aes(x, y)) +
  geom_point()

you get a plain, graphical representation of x and y as shown in Figure 10-1. You could adorn the graph with colors, a title, labels, a legend, text, and so forth, but then the call to ggplot becomes more and more crowded, obscuring the basic intention.

ggplot(df, aes(x, y)) +
  geom_point() +
  labs(
    title = "Simple Plot Example",
    subtitle = "with a subtitle",
    x = "x values",
    y = "y values"
  ) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))
complicatedplot 1
Figure 10-2. Complicated Plot

The resulting plot is shown in Figure 10-2. We want to keep the recipes clean, so we emphasize the basic plot and then show later (as in “Adding a Title and Labels”) how to add adornments.

Notes on ggplot2 basics

While the package is called ggplot2 the primary plotting function in the package is called ggplot. It is important to understand the basic pieces of a ggplot2 graph. In the examples above you can see that we pass data into ggplot then define how the graph is created by stacking together small phrases that describe some aspect of the plot. This stacking together of phrases is part of the “grammar of graphics” ethos (that’s where the gg comes from). To learn more, you can read “The Layered Grammar of Graphics” written by ggplot2 author, Hadley Wickham (http://vita.had.co.nz/papers/layered-grammar.pdf). The “grammar of graphics,” concept originating with Leland Wilkinson who articulated the idea of building graphics up from a set of primitives (i.e. verbs and nouns). With ggplot, the underlying data need not be fundamentally reshaped for each type of graphical representation. In general, the data stays the same and the user then changes syntax slightly to illustrate the data differently. This is significantly more consistent than base graphics which often require reshaping the data in order to change the way the data is visualized.

As we talk about ggplot graphics it’s worth defining the things that make up a ggplot graph:

geometric object functions

These are geometric objects that describe the type of graph being created. These start with geom_ and examples include geom_line, geom_boxplot, and geom_point along with dozens more.

aesthetics

The aesthetics, or aesthetic mappings, communicate to ggplot which fields in the source data get mapped to which visual elements in the graphic. This is the aes() line in a ggplot call.

stats

Stats are statistical transformations that are done before displaying the data. Not all graphs will have stats, but a few common stats are stat_ecdf (the empirical cumulative distribution function) and stat_identity which tells ggplot to pass the data without doing any stats at all.

facet functions

Facets are subplots where each small plot represents a subgroup of the data. The faceting functions include facet_wrap and facet_grid.

themes

Themes are the visual elements of the plot that are not tied to data. These might include titles, margins, table of contents locations, or font choices.

layer

A layer is a combination of data, aesthetics, a geometric object, a stat, and other options to produce a visual layer in the ggplot graphic.

“Long” vs. “Wide” data with ggplot

One of the first confusions new users to ggplot often face is that they are inclined to reshape their data to be “wide” before plotting it. Wide here meaning every variable they are plotting is its own column in the underlying data frame.

ggplot works most easily with “long” data where additional variables are added as rows in the data frame rather than columns. The great side effect of adding additional measurements as rows is that any properly constructed ggplot graphs will automatically update to reflect the new data without changing the ggplot code. If each additional variable was added as a column, then the plotting code would have to be changed to introduce additional variables. This idea of “long” vs. “wide” data will become more obvious in the examples in the rest of this chapter.

Graphics in Other Packages

R is highly programmable, and many people have extended its graphics machinery with additional features. Quite often, packages include specialized functions for plotting their results and objects. The zoo package, for example, implements a time series object. If you create a zoo object z and call plot(z), then the zoo package does the plotting; it creates a graphic that is customized for displaying a time series. Zoo uses base graphics so the resulting graph will not be a ggplot graphic.

There are even entire packages devoted to extending R with new graphics paradigms. The lattice package is an alternative to base graphics that predates ggplot2. It uses a powerful graphics paradigm that enables you to create informative graphics more easily. It was implemented by Deepayan Sarkar, who also wrote Lattice: Multivariate Data Visualization with R (Springer, 2008), which explains the package and how to use it. The lattice package is also described in R in a Nutshell (O’Reilly).

There are two chapters in Hadley Wickham’s excellent book R for Data Science which deal with graphics. The first, “Exploratory Data Analysis” focuses on exploring data with ggplot2 while “Graphics for Communication” explores communicating to others with graphics. R for Data Science is availiable in a printed version from O’Reilly Media or online at http://r4ds.had.co.nz/graphics-for-communication.html.

Creating a Scatter Plot

Problem

You have paired observations: (x1, y1), (x2, y2), …, (xn, yn). You want to create a scatter plot of the pairs.

Solution

We can plot the data by calling ggplot, passing in the data frame, and invoking a geometric point function:

ggplot(df, aes(x, y)) +
  geom_point()

In this example, the data frame is called df and the x and y data are in fields named x and y which we pass to the aesthetic in the call aes(x, y).

Discussion

A scatter plot is a common first attack on a new dataset. It’s a quick way to see the relationship, if any, between x and y.

Plotting with ggplot requires telling ggplot what data frame to use, then what type of graph to create, and which aesthetic mapping (aes) to use . The aes in this case defines which field from df goes into which axis on the plot. Then the command geom_point communicates that you want a point graph, as opposed to a line or other type of graphic.

We can use the built in mtcars dataset to illustrate plotting horsepower hp on the x axis and fuel economy mpg on the y:

ggplot(mtcars, aes(hp, mpg)) +
  geom_point()
point ex 1
Figure 10-3. Scatter Plot Example

The resulting plot is shown in Figure 10-3.

See Also

See “Adding a Title and Labels” for adding a title and labels; see Recipes and for adding a grid and a legend (respectively). See “Plotting All Variables Against All Other Variables” for plotting multiple variables.

Adding a Title and Labels

Problem

You want to add a title to your plot or add labels for the axes.

Solution

With ggplot we add a labs element which controls the labels for the title and axies.

When calling labs in ggplot:

: title The desired title text

: x x-axis label

: y: y-axis label

ggplot(df, aes(x, y)) +
  geom_point() +
  labs(title = "The Title",
       x = "X-axis Label",
       y = "Y-axis Label")

Discussion

The graph created in “Creating a Scatter Plot” is quite plain. A title and better labels will make it more interesting and easier to interpret.

Note that in ggplot you build up the elements of the graph by connecting the parts with the plus sign +. So we add additional graphical elements by stringing together phrases. You can see this in the following code that uses the built in cars dataset and plots speed vs. stopping distance in a scatter plot, shown in Figure 10-4

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  labs(title = "Cars: Horsepower vs. Fuel Economy",
       x = "HP",
       y = "Economy (miles per gallon)")
car plot 1
Figure 10-4. Labeled Axis and Title

Adding (or Removing) a Grid

Problem

You want to change the background grid to your graphic.

Solution

With ggplot background grids come as a default, as you have seen in other recipes. However we can alter the background grid using the theme function or by applying a prepackaed theme to our graph.

We can use theme to alter the backgorund panel of our graphic:

ggplot(df) +
  geom_point(aes(x, y)) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))
whitebackground 1
Figure 10-5. White background

Discussion

ggplot fills in the background with a grey grid by default. So you may find yourself wanting to remove that grid completely or change it to something else. Let’s create a ggplot graphic and then incrementally change the background style.

We can add or change aspects of our graphic by creating a ggplot object then calling the object and using the + to add to it. The background shading in a ggplot graphic is actually 3 different graph elements:

panel.grid.major:

These are white by default and heavy

panel.grid.minor:

These are white by default and light

panel.background:

This is the background that is grey by default

You can see these elements if you look carefully at the background of Figure 10-4:

If we set the background as element_blank() then the major and minor grids are there but they are white on white so we can’t see them in ???:

g1 <- ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  labs(title = "Cars: Horsepower vs. Fuel Economy",
       x = "HP",
       y = "Economy (miles per gallon)") +
  theme(panel.background = element_blank())
g1

. image::images/10_Graphics_files/figure-html/examplebackground-1.png[]

Notice in the code above we put the ggplot graph into a variable called g1. Then we printed the graphic by just calling g1. By having the graph inside of g1 we can then add additional graphical components without rebuilding the graph again.

But if we wanted to show the background grid in some bright colors for illustration, it’s as easy as setting them to a color and setting a line type which is shown in ???.

g2 <- g1 + theme(panel.grid.major =
                   element_line(color = "red", linetype = 3)) +
  # linetype = 3 is dash
  theme(panel.grid.minor =
          element_line(color = "blue", linetype = 4))
  # linetype = 4 is dot dash
g2

. image::images/10_Graphics_files/figure-html/majorgrid-1.png[]

??? lacks visual appeal, but you can cleary see that the red lines make up the major grid and the blue lines are the minor grid.

Or we could do something less garash and take the ggplot object g1 from above and add grey gridlines to the white background, shown in Figure 10-6.

g1 +
  theme(panel.grid.major = element_line(colour = "grey"))
backgrids 1
Figure 10-6. Grey Major Gridlines

Creating a Scatter Plot of Multiple Groups

Problem

You have data in a data frame with three observations per record: x, y, and a factor f that indicates the group. You want to create a scatter plot of x and y that distinguishes among the groups.

Solution

With ggplot we control the mapping of shapes to the factor f by passing shape = f to the aes.

ggplot(df, aes(x, y, shape = f)) +
  geom_point()

Discussion

Plotting multiple groups in one scatter plot creates an uninformative mess unless we distinguish one group from another. This distinction is done in ggplot by setting the shape parameter of the aes function.

The built in iris dataset contains paired measures of Petal.Length and Petal.Width. Each measurement also has a Species property indicating the species of the flower that was measured. If we plot all the data at once, we just get the scatter plot shown in ???:

ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width)) +
  geom_point()

. image::images/10_Graphics_files/figure-html/irisnoshape-1.png[]

The graphic would be far more informative if we distinguished the points by species. In addition to distinguising species by shape, we could also differentiate by color. We can add shape = Species and color = Species to our aes call, to get each species with a different shape and color, shown in ???.

ggplot(data = iris,
       aes(
         x = Petal.Length,
         y = Petal.Width,
         shape = Species,
         color = Species
       )) +
  geom_point()

. image::images/10_Graphics_files/figure-html/irisshape-1.png[]

ggplot conveniently sets up a legend for you as well, which is handy.

See Also

See “Adding (or Removing) a Legend” to add a legend.

Adding (or Removing) a Legend

Problem

You want your plot to include a legend, the little box that decodes the graphic for the viewer.

Solution

In most cases ggplot will add the legends automatically, as you can see in the previous recipe. If you do not have explicit grouping in the aes then ggplot will not show a legend by default. If we want to force ggplot to show a legend we can set the shape or linetype of our graph to a constant. ggplot will then show a legend with one group. We then use guides to guide ggplot in how to label the legend.

This can be illustrated with our iris scatterplot:

g <- ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width,
           shape="Point Name")) +
  geom_point()  +
  guides(shape=guide_legend(title="Legend Title"))
g
needslegend 1
Figure 10-7. Legend Added

Figure 10-7 illustrates the result of setting the shape to a string value then relabeling the legend using guides.

More commonly, you may want to turn legends off which can be done by setting the legend.position = "none" in the theme. We can use the iris plot from the prior recipe and add the theme call as shown in Figure 10-8:

g <- ggplot(data = iris,
            aes(
              x = Petal.Length,
              y = Petal.Width,
              shape = Species,
              color = Species
            )) +
  geom_point() +
  theme(legend.position = "none")
g
irisshapelegend 1
Figure 10-8. Legend Removed

Discussion

Adding legends to ggplot when there is no grouping is an excercise in tricking ggplot into showing the legend by passing a string to a grouping parameter in aes. This will not change the grouping as there is only one group, but will result in a legend being shown with a name.

Then we can use guides to alter the legend title. It’s worth noting that we are not changing anything about the data, just exploiting settings in order to coerce ggplot into showing a legend when it typically would not.

One of the huge benefits of ggplot is its very good defaults. Getting positions and correspondence between labels and their point types is done automatically, but can be overridden if needed. To remove a legend totally, we set theme parameters with theme(legend.position = "none"). In addition to “none” you can set the legend.position to be "left", "right", "bottom", "top", or a two-element numeric vector. Use a two-element numeric vector in order to pass ggplot specific coordinates of where you want the legend. If using the coordinate positions the values passed are between 0 and 1 for x and y position, respectivly.

An example of a legend positioned at the bottom is in Figure 10-9 created with this adjustment to the legend.poisition:

g + theme(legend.position = "bottom")
irisshapelegend moved 1
Figure 10-9. Legend on the Bottom

Or we could use the two-element numeric vector to put the legend in a specific location as in Figure 10-10. The example puts the center of the legend at 80% to the right and 20% up from the bottom.

g + theme(legend.position = c(.8, .2))
irisshapelegend moved2 1
Figure 10-10. Legend at a Point

In many aspects beyond legends, ggplot uses sane defaults with flexibility to override those and tweak the details. More detail of ggplot options related to legends can be found in the help for theme by typing ?theme or looking in the ggplot online reference material:

Plotting the Regression Line of a Scatter Plot

Problem

You are plotting pairs of data points, and you want to add a line that illustrates their linear regression.

Solution

Using ggplot there is no need to calculate the linear model first using the R lm function. We can instead use the geom_smooth function to calculate the linear regression inside of our ggplot call.

If our data is in a data frame df and the x and y data are in columns x and y we plot the regression line like this:

ggplot(df, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = y ~ x,
              se = FALSE)

The se = FALSE parameter tells ggplot not to plot the standard error bands around our regression line.

Discussion

Suppose we are modeling the strongx dataset found in the faraway package. We can create a linear model using the built in lm function in R. We can predict the variable crossx as a linear function of energy. First, lets look at a simple scatter plot of our data:

library(faraway)
data(strongx)

ggplot(strongx, aes(energy, crossx)) +
  geom_point()
strongx scatter 1
Figure 10-11. Strongx Scatter Plot

ggplot can calculate a linear model on the fly and then plot the regression line along with our data:

g <- ggplot(strongx, aes(energy, crossx)) +
  geom_point()

g + geom_smooth(method = "lm",
                formula = y ~ x,
                se = FALSE)
one step 1

We can turn the confidence bands on by omitting the se = FALSE option as as shown in ???:

g + geom_smooth(method = "lm",
                formula = y ~ x)

. image::images/10_Graphics_files/figure-html/one-step-nose-1.png[]

Notice that in the geom_smooth we use x and y rather than the variable names. ggplot has set the x and y inside the plot based on the aesthetic. Multiple smoothing methods are supported by geom_smooth. You can explore those, and other options in the help by typing ?geom_smooth.

If we had a line we wanted to plot that was stored in another R object, we could use geom_abline to plot the line on our graph. In the following example we pull the intercept term and the slope from the regression model m and add those to our graph:

m <- lm(crossx ~ energy, data = strongx)

ggplot(strongx, aes(energy, crossx)) +
  geom_point() +
  geom_abline(
    intercept = m$coefficients[1],
    slope = m$coefficients[2]
  )

This produces a very similar plot to ???. The geom_abline method can be handy if you are plotting a line from a source other than a simple linear model.

See Also

See the chapter on Linear Regression and ANOVA for more about linear regression and the lm function.

Plotting All Variables Against All Other Variables

Problem

Your dataset contains multiple numeric variables. You want to see scatter plots for all pairs of variables.

Solution

ggplot does not have any built in method to create pairs plots, however, the package GGally provides the functionality with the ggpairs function:

library(GGally)
ggpairs(df)

Discussion

When you have a large number of variables, finding interrelationships between them is difficult. One useful technique is looking at scatter plots of all pairs of variables. This would be quite tedious if coded pair-by-pair, but the ggpairs function from the package GGally provides an easy way to produce all those scatter plots at once.

The iris dataset contains four numeric variables and one categorical variable:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

What is the relationship, if any, between the columns? Plotting the columns with ggpairs produces multiple scatter plots.

library(GGally)
ggpairs(iris)
ggpairsiris 1
Figure 10-12. ggpairs Plot of Iris Data

The ggpairs function is pretty, but not particularly fast. If you’re just doing interactive work and want a quick peak at the data, the base R plot function provides faster output and is shown in Figure 10-13.

plot(iris)
basepairs 1
Figure 10-13. Base plot() Pairs Plot

While the ggpairs function is not as fast to plot as the Base R plot function, it produces density graphs on the diagonal and reports correlation in the upper triangle of the graph. When factors or character columns are present, ggpairs produces histograms on the lower triangle of the graph and boxplots on the upper triangle. These are nice additions to understanding relationships in your data.

Creating One Scatter Plot for Each Factor Level

Problem

Your dataset contains (at least) two numeric variables and a factor. You want to create several scatter plots for the numeric variables, with one scatter plot for each level of the factor.

Solution

This kind of plot is called a conditioning plot, which is produced in ggplot by adding facet_wrap to our plot. In this example we use the data frame df which contains three columns: x, y, and f with f being a factor (or a character).

ggplot(df, aes(x, y)) +
  geom_point() +
  facet_wrap( ~ f)

Discussion

Conditioning plots (coplots) are another way to explore and illustrate the effect of a factor or to compare different groups to each other.

The Cars93 dataset contains 27 variables describing 93 car models as of 1993. Two numeric variables are MPG.city, the miles per gallon in the city, and Horsepower, the engine horsepower. One categorical variable is Origin, which can be USA or non-USA according to where the model was built.

Exploring the relationship between MPG and horsepower, we might ask: Is there a different relationship for USA models and non-USA models?

Let’s examine this as a facet plot:

data(Cars93, package = "MASS")
ggplot(data = Cars93, aes(MPG.city, Horsepower)) +
  geom_point() +
  facet_wrap( ~ Origin)
facet cars 1
Figure 10-14. Cars Data with Facet

The resulting plot in Figure 10-13 reveals a few insights. If we really crave that 300-horsepower monster then we’ll have to buy a car built in the USA; but if we want high MPG, we have more choices among non-USA models. These insights could be teased out of a statistical analysis, but the visual presentation reveals them much more quickly.

Note that using facet results in subplots with the same x and y axis ranges. This helps insure that visual inspection of the data is not misleading because of differeing axis ranges.

See Also

The Base R Graphics function coplot can accomplish very similar plots using only Base Graphics.

Creating a Bar Chart

Problem

You want to create a bar chart.

Solution

A common situation is to have a column of data that represents a group and then another column that represents a measure about that group. This format is “long” data because the data runs vertically instead of having a column for each group.

Using the geom_bar function in ggplot we can plot the heights as bars. If the data is already aggregated, we add stat = "identity" so that ggplot knows it needs to do no aggregation on the groups of values before plotting.

ggplot(data = df, aes(x, y)) +
  geom_bar(stat = "identity")

Discussion

Let’s use the cars made by Ford in the Cars93 data in an example:

ford_cars <- Cars93 %>%
  filter(Manufacturer == "Ford")

ggplot(ford_cars, aes(Model, Horsepower)) +
  geom_bar(stat = "identity")
fordcars 1
Figure 10-15. Ford Cars Bar Chart

The resulting graph in Figure 10-15 shows the resuting bar chart.

This example above uses stat = "identity" which assumes that the heights of your bars are conveniently stored as a value in one field with only one record per column. That is not always the case, however. Often you have a vector of numeric data and a parallel factor or character field that groups the data, and you want to produce a bar chart of the group means or the group totals.

Let’s work up an example using the built-in airquality dataset which contains daily temperature data for a single location for five months. The data frame has a numeric Temp column and Month and Day columns. If we want to plot the mean temp by month using ggplot we don’t need to precompute the mean, instead we can have ggplot do that in the plot command logic. To tell ggplot to calculate the mean we pass stat = "summary", fun.y = "mean" to the geom_bar command. We can also turn the month numbers into dates using the built in constant month.abb which contains the abbreviations for the months.

ggplot(airquality, aes(month.abb[Month], Temp)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
aq1 1
Figure 10-16. Bar Chart: Temp by Month

Figure 10-16 shows the resulting plot. But you might notice the sort order on the months is alphabetical, which is not how we typically like to see months sorted.

We can fix the sorting issue using a few functions from dplyr combined with fct_inorder from the forcats Tidyverse package. To get the months in the correct order we can sort the data frame by Month which is the month number, then we can apply fct_inorder which will arrange our factors in the order they appear in the data. You can see in Figure 10-17 that the bars are now sorted properly.

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

ggplot(aq_data, aes(month_abb, Temp)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
aq2 1
Figure 10-17. Bar Chart Properly Sorted

See Also

See “Adding Confidence Intervals to a Bar Chart” for adding confidence intervals and “Coloring a Bar Chart” for adding color

?geom_bar for help with bar charts in ggplot

barplot for Base R bar charts or the barchart function in the lattice package.

Adding Confidence Intervals to a Bar Chart

Problem

You want to augment a bar chart with confidence intervals.

Solution

Suppose you have a data frame df with columns group which are group names, stat which is a column of statistics, and lower and upper which represent the corresponding limits for the confidence intervals. We can display a bar chart of stat for each group and its confidence intervals using the geom_bar combined with geom_errorbar.

ggplot(df, aes(group, stat)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = .2)

. image::images/10_Graphics_files/figure-html/confbars-1.png[]

??? shows the resulting bar chart with confidence intervals.

Discussion

Most bar charts display point estimates, which are shown by the heights of the bars, but rarely do they include confidence intervals. Our inner statisticians dislike this intensely. The point estimate is only half of the story; the confidence interval gives the full story.

Fortunately, we can plot the error bars using ggplot. The hard part is calculating the intervals. In the examples above our data had a simple -15% and +20% interval. However, in “Creating a Bar Chart”, we calculated group means before plotting them. If we let ggplot do the calculations for us we can use the build in mean_se along with the stat_summary function to get the standard errors of the mean measures.

Let’s use the airquality data we used previously. First we’ll do the sorted factor procedure (from the prior recipe) to get the month names in the desired order:

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

Now we can plot the bars along with the associated standard errors as in the following:

ggplot(aq_data, aes(month_abb, Temp)) +
  geom_bar(stat = "summary",
           fun.y = "mean",
           fill = "cornflowerblue") +
  stat_summary(fun.data = mean_se, geom = "errorbar") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")

Sometimes you’ll want to sort your columns in your bar chart in descending order based on their height. This can be a little bit confusing when using summary stats in ggplot but the secret is to use mean in the reorder statement to sort the factor by the mean of the temp. Note that the reference to mean in reorder is not quoted, while the reference to mean in geom_bar is quoted:

ggplot(aq_data, aes(reorder(month_abb,-Temp, mean), Temp)) +
  geom_bar(stat = "summary",
           fun.y = "mean",
           fill = "tomato"           ) +
  stat_summary(fun.data = mean_se, geom = "errorbar") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
airqual2 1
Figure 10-18. Mean Temp By Month Descending Order

You may look at this example and the result in Figure 10-18 and wonder, “Why didn’t they just use reorder(month_abb, Month) in the first example instead of that sorting business with forcats::fct_inorder to get the months in the right order?” Well, we could have. But sorting using fct_inorder is a design pattern that provides flexibility for more complicated things. Plus it’s quite easy to read in a script. Using reorder inside the aes is a bit more dense and hard to read later. But either approach is reasonable.

See Also

See “Forming a Confidence Interval for a Mean” for more about t.test.

Coloring a Bar Chart

Problem

You want to color or shade the bars of a bar chart.

Solution

With gplot we add the fill= call to our aes and let ggplot pick the colors for us:

ggplot(df, aes(x, y, fill = group))

Discussion

In ggplot we can use the fill parameter in aes to tell ggplot what field to base the colors on. If we pass a numeric field to ggplot we will get a continuous gradient of colors and if we pass a factor or character field to fill we will get contrasting colors for each group. Below we pass the character name of each month to the fill parameter:

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

ggplot(data = aq_data, aes(month_abb, Temp, fill = month_abb)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)") +
  scale_fill_brewer(palette = "Paired")
colored air 1
Figure 10-19. Colored Monthly Temp Bar Chart

The colors in the resulting Figure 10-19 are defined by calling scale_fill_brewer(palette="Paired"). The "Paired" color palette comes, along with many other color pallets, in the package RColorBrewer.

If we wanted to change the color of each bar based on the temperature, we can’t just set fill=Temp as might seem intuitive because ggplot would not understand we want the mean temperature after the grouping by month. So the way we get around this is to access a special field inside of our graph called ..y.. which is the calculated value on the Y axis. But we don’t want the legend labeled ..y.. so we add fill="Temp" to our labs call in order to change the name of the legend. The result is ???

ggplot(airquality, aes(month.abb[Month], Temp, fill = ..y..)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)",
       fill = "Temp")

images::images/10_Graphics_files/figure-latex/barsshaded-1.png[]

If we want to reverse the color scale, we can just add a negative - in front of the field we are filling by. Like fill= -..y.., for example.

See Also

See “Creating a Bar Chart” for creating a bar chart.

Plotting a Line from x and y Points

Problem

You have paired observations in a data frame: (x1, y1), (x2, y2), …, (xn, yn). You want to plot a series of line segments that connect the data points.

Solution

With ggplot we can use geom_point to plot the points:.

ggplot(df, aes(x, y)) +
  geom_point()

Since ggplot graphics are built up, element by element, we can have both a point and a line in the same graphic very easily by having two geoms:

ggplot(df, aes(x , y)) +
  geom_point() +
  geom_line()

Discussion

To illustrate, let’s look at some example US economic data that comes with ggplot2. This example data frame has a column called date which we’ll plot on the x axis and a field unemploy which is the number of unemployed people.

ggplot(economics, aes(date , unemploy)) +
  geom_point() +
  geom_line()
linechart 1
Figure 10-20. Line Chart Example

Figure 10-20 shows the resulting chart which contains both lines and points because we used both geoms.

Changing the Type, Width, or Color of a Line

Problem

You are plotting a line. You want to change the type, width, or color of the line.

Solution

ggplot uses the linetype parameter for controlling the appearance of lines:

  • linetype="solid" or linetype=1 (default)

  • linetype="dashed" or linetype=2

  • linetype="dotted" or linetype=3

  • linetype="dotdash" or linetype=4

  • linetype="longdash" or linetype=5

  • linetype="twodash" or linetype=6

  • linetype="blank" or linetype=0 (inhibits drawing)

You can change the line characteristics by passing linetype, col and/or size as parameters to the geom_line. So if we want to change the linetype to dashed, red, and heavy we could pass the linetype, col and size params to geom_line:

ggplot(df, aes(x, y)) +
  geom_line(linetype = 2,
            size = 2,
            col = "red")

Discussion

The example syntax above shows how to draw one line and specify its style, width, or color. A common scenario involves drawing multiple lines, each with its own style, width, or color.

Let’s set up some example data:

x <- 1:10
y1 <- x**1.5
y2 <- x**2
y3 <- x**2.5
df <- data.frame(x, y1, y2, y3)

In ggplot this can be a conundrum for many users. The challenge is that ggplot works best with “long” data instead of “wide” data as was mentioned in the introduction to this chapter. Our example data frame has 4 columns of wide data:

head(df, 3)
#>   x   y1 y2    y3
#> 1 1 1.00  1  1.00
#> 2 2 2.83  4  5.66
#> 3 3 5.20  9 15.59

We can make our wide data long by using the gather function from the core tidyverse pacakge tidyr. In the example below, we use gather to create a new column named bucket and put our column names in there while keeping our x and y variables.

df_long <- gather(df, bucket, y, -x)
head(df_long, 3)
#>   x bucket    y
#> 1 1     y1 1.00
#> 2 2     y1 2.83
#> 3 3     y1 5.20
tail(df_long, 3)
#>     x bucket   y
#> 28  8     y3 181
#> 29  9     y3 243
#> 30 10     y3 316

Now we can pass bucket to the col parameter and get multiple lines, each a different color:

ggplot(df_long, aes(x, y, col = bucket)) +
  geom_line()
unnamed chunk 21 1

It’s straight forward to vary the line weight by a variable by passing a numerical variable to size:

ggplot(df, aes(x, y1, size = y2)) +
  geom_line() +
  scale_size(name = "Thickness based on y2")
thickness 1
Figure 10-21. Thickness as a Function of x

The result of varying the thickness with x is shown in Figure 10-21.

See Also

See “Plotting a Line from x and y Points” for plotting a basic line.

Plotting Multiple Datasets

Problem

You want to show multiple datasets in one plot.

Solution

We could combine the data into one data frame before plotting using one of the join functions from dplyr. However below we will create two seperate data frames then add them each to a ggplot graph.

First let’s set up our example data frames, df1 and df2:

# example data
n <- 20

x1 <- 1:n
y1 <- rnorm(n, 0, .5)
df1 <- data.frame(x1, y1)

x2 <- (.5 * n):((1.5 * n) - 1)
y2 <- rnorm(n, 1, .5)
df2 <- data.frame(x2, y2)

Typically we would pass the data frame directly into the ggplot function call. Since we want two geoms with two different data sources, we will initiate a plot with ggplot() and then add in two calls to geom_line each with its own data source.

ggplot() +
  geom_line(data = df1, aes(x = x1, y = y1), color = "darkblue") +
  geom_line(data = df2, aes(x = x2, y = y2), linetype = "dashed")
twolines 1
Figure 10-22. Two Lines One Plot

Discussion

ggplot allows us to make multiple calls to different geom_ functions each with its own data source, if desired. Then ggplot will look at all the data we are plotting and adjust the ranges to accomodate all the data.

Even with good defaults, sometimes we want our plot range to show a different range. We can do that by setting the xlim and ylim in our ggplot.

ggplot() +
  geom_line(data = df1, aes(x = x1, y = y1), color = "darkblue") +
  geom_line(data = df2, aes(x = x2, y = y2), linetype = "dashed") +
  xlim(0, 35) +
  ylim(-2, 2)
twolineslins 1
Figure 10-23. Two Lines Larger Limits

The graph with expanded limits is in Figure 10-23.

Adding Vertical or Horizontal Lines

Problem

You want to add a vertical or horizontal line to your plot, such as an axis through the origin or pointing out a threshold.

Solution

The ggplot functions geom_vline and geom_hline allow vertical and horizontal lines, respectivly. The functions can also take color, linetype, and size parameters to set the line style:

# using the data.frame df1 from the prior recipe
ggplot(df1) +
  aes(x = x1, y = y1) +
  geom_point() +
  geom_vline(
    xintercept = 10,
    color = "red",
    linetype = "dashed",
    size = 1.5
  ) +
  geom_hline(yintercept = 0, color = "blue")
vhlines 1
Figure 10-24. Vertical and Horizontal Lines

Figure 10-24 shows the resulting plot with added horizontal and vertical lines.

Discussion

A typical use of lines would be drawing regularly spaced lines. Suppose we have a sample of points, samp. First, we plot them with a solid line through the mean. Then we calculate and draw dotted lines at ±1 and ±2 standard deviations away from the mean. We can add the lines into our plot with geom_hline:

samp <- rnorm(1000)
samp_df <- data.frame(samp, x = 1:length(samp))

mean_line <- mean(samp_df$samp)
sd_lines <- mean_line + c(-2, -1, +1, +2) * sd(samp_df$samp)

ggplot(samp_df) +
  aes(x = x, y = samp) +
  geom_point() +
  geom_hline(yintercept = mean_line, color = "darkblue") +
  geom_hline(yintercept = sd_lines, linetype = "dotted")
spacedlines 1
Figure 10-25. Mean and SD Bands in a Plot

Figure 10-25 shows the sampled data along with the mean and standard deviation lines.

See Also

See “Changing the Type, Width, or Color of a Line” for more about changing line types.

Creating a Box Plot

Problem

You want to create a box plot of your data.

Solution

Use geom_boxplot from ggplot to add a boxplot geom to a ggplot graphic. Using the samp_df data frame from the prior recipe, we can create a box plot of the values in the x column. The resulting graph is in Figure 10-26.

ggplot(samp_df) +
  aes(y = samp) +
  geom_boxplot()
boxplot 1
Figure 10-26. Single Boxplot

Discussion

A box plot provides a quick and easy visual summary of a dataset.

  • The thick line in the middle is the median.

  • The box surrounding the median identifies the first and third quartiles; the bottom of the box is Q1, and the top is Q3.

  • The “whiskers” above and below the box show the range of the data, excluding outliers.

  • The circles identify outliers. By default, an outlier is defined as any value that is farther than 1.5 × IQR away from the box. (IQR is the interquartile range, or Q3 − Q1.) In this example, there are a few outliers on the high side.

We can rotate the boxplot by flipping the coordinates. There are some situations where this makes a more appealing graphic. This is shown in Figure 10-27.

ggplot(samp_df) +
  aes(y = samp) +
  geom_boxplot() +
  coord_flip()
boxplotrotate 1
Figure 10-27. Single Boxplot

See Also

One box plot alone is pretty boring. See “Creating One Box Plot for Each Factor Level” for creating multiple box plots.

Creating One Box Plot for Each Factor Level

Problem

Your dataset contains a numeric variable and a factor (or other catagorical text). You want to create several box plots of the numeric variable broken out by levels.

Solution

With ggplot we pass the name of the categorical variable to the x parameter in the aes call. The resulting boxplot will then be grouped by the values in the categorical variable:

ggplot(df) +
  aes(x = factor, y = values) +
  geom_boxplot()

Discussion

This recipe is another great way to explore and illustrate the relationship between two variables. In this case, we want to know whether the numeric variable changes according to the level of a category.

The UScereal dataset from the MASS package contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is the shelf position (counting from the floor). Cereal manufacturers can negotiate for shelf position, placing their product for the best sales potential. We wonder: Where do they put the high-sugar cereals? We can produce Figure 10-28 and explore that question by creating one box plot per shelf:

data(UScereal, package = "MASS")

ggplot(UScereal) +
  aes(x = as.factor(shelf), y = sugars) +
  geom_boxplot() +
  labs(
    title = "Sugar Content by Shelf",
    x = "Shelf",
    y = "Sugar (grams per portion)"
  )
cerealboxplot 1
Figure 10-28. Boxplot by Shelf Number

The box plots suggest that shelf #2 has the most high-sugar cereals. Could it be that this shelf is at eye level for young children who can influence their parent’s choice of cereals?

Note that in the aes call we had to tell ggplot to treat the shelf number as a factor. Otherwise ggplot would not react to the shelf as a grouping and only print a single boxplot.

See Also

See “Creating a Box Plot” for creating a basic box plot.

Creating a Histogram

Problem

You want to create a histogram of your data.

Solution

Use geom_histogram, and set x to a vector of numeric values.

Discussion

Figure 10-29 is a histogram of the MPG.city column taken from the Cars93 dataset:

data(Cars93, package = "MASS")

ggplot(Cars93) +
  geom_histogram(aes(x = MPG.city))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
carshist 1
Figure 10-29. Histogram of Counts by MPG

The geom_histogram function must decide how many cells (bins) to create for binning the data. In this example, the default algorithm chose 30 bins. If we wanted fewer bins, we wo would include the bins parameter to tell geom_histogram how many bins we want:

ggplot(Cars93) +
  geom_histogram(aes(x = MPG.city), bins = 13)
carshistbins 1
Figure 10-30. Histogram of Counts by MPG with Fewer Bins

Figure 10-30 shows the histogram with 13 bins.

See Also

The Base R function hist provides much of the same functionality as does the histogram function of the lattice package.

Adding a Density Estimate to a Histogram

Problem

You have a histogram of your data sample, and you want to add a curve to illustrate the apparent density.

Solution

Use the geom_density function to approximate the sample density as shown in Figure 10-31:

ggplot(Cars93) +
  aes(x = MPG.city) +
  geom_histogram(aes(y = ..density..), bins = 21) +
  geom_density()
histdensity 1
Figure 10-31. Histogram with Density Plot

Discussion

A histogram suggests the density function of your data, but it is rough. A smoother estimate could help you better visualize the underlying distribution. A Kernel Density Estimation (KDE) is a smoother representation of univariate data.

In ggplot we tell the geom_histogram function to use the density function by passing it aes(y = ..density..).

The following example takes a sample from a gamma distribution and then plots the histogram and the estimated density as shown in Figure 10-32.

samp <- rgamma(500, 2, 2)

ggplot() +
  aes(x = samp) +
  geom_histogram(aes(y = ..density..), bins = 10) +
  geom_density()
gammahistdens 1
Figure 10-32. Histogram and Density: Gamma Distribution

See Also

The density function approximates the shape of the density nonparametrically. If you know the actual underlying distribution, use instead “Plotting a Density Function” to plot the density function.

Creating a Normal Quantile-Quantile (Q-Q) Plot

Problem

You want to create a quantile-quantile (Q-Q) plot of your data, typically because you want to know how the data differs from a normal distribution.

Solution

With ggplot we can use the stat_qq and stat_qq_line functions to create a Q-Q plot that shows both the observed points as well as the Q-Q Line. Figure 10-33 shows the resulting plot.

df <- data.frame(x = rnorm(100))

ggplot(df, aes(sample = x)) +
  stat_qq() +
  stat_qq_line()
qqplot 1
Figure 10-33. Q-Q Plot

Discussion

Sometimes it’s important to know if your data is normally distributed. A quantile-quantile (Q-Q) plot is a good first check.

The Cars93 dataset contains a Price column. Is it normally distributed? This code snippet creates a Q-Q plot of Price shown in Figure 10-34:

ggplot(Cars93, aes(sample = Price)) +
  stat_qq() +
  stat_qq_line()
qqcars 1
Figure 10-34. Q-Q Plot of Car Prices

If the data had a perfect normal distribution, then the points would fall exactly on the diagonal line. Many points are close, especially in the middle section, but the points in the tails are pretty far off. Too many points are above the line, indicating a general skew to the left.

The leftward skew might be cured by a logarithmic transformation. We can plot log(Price), which yields Figure 10-35:

ggplot(Cars93, aes(sample = log(Price))) +
  stat_qq() +
  stat_qq_line()
qqcarslog 1
Figure 10-35. Q-Q Plot of Log Car Prices

Notice that the points in the new plot are much better behaved, staying close to the line except in the extreme left tail. It appears that log(Price) is approximately Normal.

See Also

See “Creating Other Quantile-Quantile Plots” for creating Q-Q plots for other distributions. See Recipe X-X for an application of Normal Q-Q plots to diagnosing linear regression.

Creating Other Quantile-Quantile Plots

Problem

You want to view a quantile-quantile plot for your data, but the data is not normally distributed.

Solution

For this recipe, you must have some idea of the underlying distribution, of course. The solution is built from the following steps:

  • Use the ppoints function to generate a sequence of points between 0 and 1.

  • Transform those points into quantiles, using the quantile function for the assumed distribution.

  • Sort your sample data.

  • Plot the sorted data against the computed quantiles.

  • Use abline to plot the diagonal line.

This can all be done in two lines of R code. Here is an example that assumes your data, y, has a Student’s t distribution with 5 degrees of freedom. Recall that the quantile function for Student’s t is qt and that its second argument is the degrees of freedom. To create draws from

First let’s make some example data:

df_t <- data.frame(y = rt(100, 5))

In order to plot the Q-Q plot we need to estimate the parameters of the distribution we’re wanting to plot. Since this is a Student’s t distribution, we only need to estimate one parameter, the degrees of freedom. Of course we know the actual degrees of freedom is 5, but in most situations we’ll need to calcuate value. So we’ll use the MASS::fitdistr function to estimate the degrees of freedom:

est_df <- as.list(MASS::fitdistr(df_t$y, "t")$estimate)[["df"]]
#> Warning in log(s): NaNs produced

#> Warning in log(s): NaNs produced

#> Warning in log(s): NaNs produced
est_df
#> [1] 19.5

As expected, that’s pretty close to what was used to generate the simulated data. So let’s pass the estimaged degrees of freedom to the Q-Q functions and create Figure 10-36:

ggplot(df_t) +
  aes(sample = y) +
  geom_qq(distribution = qt, dparams = est_df) +
  stat_qq_line(distribution = qt, dparams = est_df)
qqt 1
Figure 10-36. Student’s t Distribution Q-Q Plot

Discussion

The solution looks complicated, but the gist of it is picking a distribution, fitting the parameters, and then passing those parameters to the Q-Q functions in ggplot.

We can illustrate this recipe by taking a random sample from an exponential distribution with a mean of 10 (or, equivalently, a rate of 1/10):

rate <- 1 / 10
n <- 1000
df_exp <- data.frame(y = rexp(n, rate = rate))
est_exp <- as.list(MASS::fitdistr(df_exp$y, "exponential")$estimate)[["rate"]]
est_exp
#> [1] 0.101

Notice that for an exponential distribution the parameter we estimate is called rate as opposed to df which was the parameter in the t distribution.

ggplot(df_exp) +
  aes(sample = y) +
  geom_qq(distribution = qexp, dparams = est_exp) +
  stat_qq_line(distribution = qexp, dparams = est_exp)
qqexp 1
Figure 10-37. Exponential Distribution Q-Q Plot

The quantile function for the exponential distribution is qexp, which takes the rate argument. Figure 10-37 shows the resulting Q-Q plot using a theoretical exponential distribution.

Plotting a Variable in Multiple Colors

Problem

You want to plot your data in multiple colors, typically to make the plot more informative, readable, or interesting.

Solution

We can pass a color to a geom_ function in order to produced colored output:

df <- data.frame(x = rnorm(200), y = rnorm(200))

ggplot(df) +
  aes(x = x, y = y) +
  geom_point(color = "blue")
colorscatter 1
Figure 10-38. Point Data in Color

The value of color can be:

  • One color, in which case all data points are that color.

  • A vector of colors, the same length as x, in which case each value of x is colored with its corresponding color.

  • A short vector, in which case the vector of colors is recycled.

Discussion

The default color in ggplot is black. While it’s not very exciting, black is high contrast and easy for most anyone to see.

However, it is much more useful (and interesting) to vary the color in a way that illuminates the data. Let’s illustrate this by plotting a graphic two ways, once in black and white and once with simple shading.

This produces the basic black-and-white graphic in Figure 10-39:

df <- data.frame(
  x = 1:100,
  y = rnorm(100)
)

ggplot(df) +
  aes(x, y) +
  geom_point()
plainpoints 1
Figure 10-39. Simple Point Plot

Now we can make it more interesting by creating a vector of "gray" and "black" values, according to the sign of x and then plotting x using those colors as shown in Figure 10-40:

shade <- if_else(df$y >= 0, "black", "gray")

ggplot(df) +
  aes(x, y) +
  geom_point(color = shade)
shadepoints 1
Figure 10-40. Color Shaded Point Plot

The negative values are now plotted in gray because the corresponding element of colors is "gray".

See Also

See “Understanding the Recycling Rule” regarding the Recycling Rule. Execute colors() to see a list of available colors, and use geom_segment in ggplot to plot line segments in multiple colors.

Graphing a Function

Problem

You want to graph the value of a function.

Solution

The ggplot function stat_function will graph a function across a range. In Figure 10-41 we plot a sin wave across the range -3 to 3.

ggplot(data.frame(x = c(-3, 3))) +
  aes(x) +
  stat_function(fun = sin)
functionplot 1
Figure 10-41. Sin Wave Plot

Discussion

It’s pretty common to want to plot a statistical function, such as a normal distribution across a given range. The stat_function in ggplot allows us to do thise. We need only supply a data frame with x value limits and stat_function will calculate the y values, and plot the results:

ggplot(data.frame(x = c(-3.5, 3.5))) +
  aes(x) +
  stat_function(fun = dnorm) +
  ggtitle("Std. Normal Density")
unnamed chunk 28 1

Notice in the chart above we use ggtitle to set the title. If setting multiple text elements in a ggplot we use labs but when just adding a title, ggtitle is more concise than labs(title='Std. Normal Density') although they accomplish the same thing. See ?labs for more discussion of labels with ggplot

stat_function can graph any function that takes one argument and returns one value. Let’s create a function and then plot it. Our function is a dampened sin wave that is a sin wave that loses amplitude as it moves away from 0:

f <- function(x) exp(-abs(x)) * sin(2 * pi * x)
ggplot(data.frame(x = c(-3.5, 3.5))) +
  aes(x) +
  stat_function(fun = f) +
  ggtitle("Dampened Sine Wave")
unnamed chunk 30 1

See Also

See Recipe X-X for how to define a function.

Pausing Between Plots

Problem

You are creating several plots, and each plot is overwriting the previous one. You want R to pause between plots so you can view each one before it’s overwritten.

Solution

There is a global graphics option called ask. Set it to TRUE, and R will pause before each new plot. We turn on this option by passing it to the par function which sets parameters:

par(ask = TRUE)

When you are tired of R pausing between plots, set it to FALSE:

par(ask = FALSE)

Discussion

When ask is TRUE, R will print this message immediately before starting a new plot:

Hit <Return> to see next plot:

When you are ready, hit the return or enter key and R will begin the next plot.

This is a Base R Graphics function but you can use it in ggplot if you wrap your plot function in a print statement in order to get prompted. Below is an example of a loop that prints a random set of points 5 times. If you run this loop in RStudio, you will be prompted between each graphic. Notice how we wrap g inside a print call:

par(ask = TRUE)

for (i in (11:15)) {
  g <- ggplot(data.frame(x = rnorm(i), y = 1:i)) +
    aes(x, y) +
    geom_point()
  print(g)
}

# don't forget to turn ask off after you're done
par(ask = FALSE)

See Also

If one graph is overwriting another, consider using “Displaying Several Figures on One Page” to plot multiple graphs in one frame. See Recipe X-X for more about changing graphical parameters.

JDL EDIT MARK

Displaying Several Figures on One Page

Problem

You want to display several plots side by side on one page.

Solution

# example data
z <- rnorm(1000)
y <- runif(1000)

# plot elements
p1 <- ggplot() +
  geom_point(aes(x = 1:1000, y = z))
p2 <- ggplot() +
  geom_point(aes(x = 1:1000, y = y))
p3 <- ggplot() +
  geom_density(aes(z))
p4 <- ggplot() +
  geom_density(aes(y))

There are a number of ways to put ggplot graphics into a grid, but one of the easiest to use and understand is patchwork by Thomas Lin Pedersen. When this book was written, patchwork was not availiable on CRAN, but can be installed using devtools:

devtools::install_github("thomasp85/patchwork")

After installing the package we can use it to plot mulitple ggplot objects using a + between the objects then a call to plot_layout to arrange the images into a grid as shown in Figure 10-42:

library(patchwork)
p1 + p2 + p3 + p4
patchwork1 1
Figure 10-42. A Patchwork Plot

patchwork supports grouping with parenthesis and using / to put groupings under other elements as illustrated in Figure 10-43.

p3 / (p1 + p2 + p4)
patchwork2 1
Figure 10-43. A Patchwork 1 / 2 Plot

Discussion

Let’s use a multifigure plot to display four different beta distributions. Using ggplot and the patchwork package, we can create a 2 x 2 layout effect by greating four graphics objects then print them using the + notation from Patchwork:

library(patchwork)


df <- data.frame(x = c(0, 1))

g1 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 2, 4)
  ) +
  ggtitle("First")

g2 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 4, 1)
  ) +
  ggtitle("Second")

g3 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 1, 1)
  ) +
  ggtitle("Third")

g4 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, .5, .5)
  ) +
  ggtitle("Fourth")

g1 + g2 + g3 + g4 + plot_layout(ncol = 2, byrow = TRUE)
unnamed chunk 36 1

To lay the images out in columns order we could pass the byrow=FALSE to plot_layout:

g1 + g2 + g3 + g4 + plot_layout(ncol = 2, byrow = FALSE)

See Also

“Plotting a Density Function” discusses plotting of density functions as we do above.

The grid package and the lattice package contain additional tools for multifigure layouts with Base Graphics.

Writing Your Plot to a File

Problem

You want to save your graphics in a file, such as a PNG, JPEG, or PostScript file.

Solution

With ggplot figures we can use ggsave to save a displayed image to a file. ggsave will make some default assumptions about size and file type for you, allowing you to only specify a filename:

ggsave("filename.jpg")

The file type is derived from the extension you use in the filename you pass to ggsave. You can control details of size, filetype, and scale by passing parameters to ggsave. See ?ggsave for specific details.

Discussion

In RStudio, a shortcut is to click on Export in the Plots window and then click on Save as Image, Save as PDF, or Copy to Clipboard. The save options will prompt you for a file type and a file name before writing the file. The Copy to Clipboard option can be handy if you are manually copying and pasting your graphics into a presentation or word processor.

Remember that the file will be written to your current working directory (unless you use an absolute file path), so be certain you know which directory is your working directory before calling savePlot.

In a non-interactive script using ggplot you can pass plot objects directly to ggsave so they need not be displayed before saving. In the prior recipe we created a plot object called g1. We can save it to a file like this:

ggsave("g1.png", plot = g1, units = "in", width = 5, height = 4)

Note that the units for height and width in ggsave are specified with the units parameter. In this case we used in for inches, but ggsave also supports mm and cm for the more metricly inclined.

See Also

See “Getting and Setting the Working Directory” for more about the current working directory.

Chapter 11. Linear Regression and ANOVA

Introduction

In statistics, modeling is where we get down to business. Models quantify the relationships between our variables. Models let us make predictions.

A simple linear regression is the most basic model. It’s just two variables and is modeled as a linear relationship with an error term:

  • yi = β0 + β1xi + εi

We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for β0 and β1 (“Performing Simple Linear Regression”).

That generalizes naturally to multiple linear regression, where we have multiple variables on the righthand side of the relationship (“Performing Multiple Linear Regression”):

  • yi = β0 + β1ui + β2vi + β3wi + εi

Statisticians call u, v, and w the predictors and y the response. Obviously, the model is useful only if there is a fairly linear relationship between the predictors and the response, but that requirement is much less restrictive than you might think. “Regressing on Transformed Data” discusses transforming your variables into a (more) linear relationship so that you can use the well-developed machinery of linear regression.

The beauty of R is that anyone can build these linear models. The models are built by a function, lm, which returns a model object. From the model object, we get the coefficients (βi) and regression statistics. It’s easy. Really!

The horror of R is that anyone can build these models. Nothing requires you to check that the model is reasonable, much less statistically significant. Before you blindly believe a model, check it. Most of the information you need is in the regression summary (“Understanding the Regression Summary”):

Is the model statistically significant?

Check the F statistic at the bottom of the summary.

Are the coefficients significant?

Check the coefficient’s t statistics and p-values in the summary, or check their confidence intervals (“Forming Confidence Intervals for Regression Coefficients”).

Is the model useful?

Check the R2 near the bottom of the summary.

Does the model fit the data well?

Plot the residuals and check the regression diagnostics (Recipes and .

Does the data satisfy the assumptions behind linear regression?

Check whether the diagnostics confirm that a linear model is reasonable for your data (“Diagnosing a Linear Regression”).

ANOVA

Analysis of variance (ANOVA) is a powerful statistical technique. First-year graduate students in statistics are taught ANOVA almost immediately because of its importance, both theoretical and practical. We are often amazed, however, at the extent to which people outside the field are unaware of its purpose and value.

Regression creates a model, and ANOVA is one method of evaluating such models. The mathematics of ANOVA are intertwined with the mathematics of regression, so statisticians usually present them together; we follow that tradition here.

ANOVA is actually a family of techniques that are connected by a common mathematical analysis. This chapter mentions several applications:

One-way ANOVA

This is the simplest application of ANOVA. Suppose you have data samples from several populations and are wondering whether the populations have different means. One-way ANOVA answers that question. If the populations have normal distributions, use the oneway.test function (“Performing One-Way ANOVA”); otherwise, use the nonparametric version, the kruskal.test function (“Performing Robust ANOVA (Kruskal–Wallis Test)”).

Model comparison

When you add or delete a predictor variable from a linear regression, you want to know whether that change did or did not improve the model. The anova function compares two regression models and reports whether they are significantly different (“Comparing Models by Using ANOVA”).

ANOVA table

The anova function can also construct the ANOVA table of a linear regression model, which includes the F statistic needed to gauge the model’s statistical significance (“Getting Regression Statistics”). This important table is discussed in nearly every textbook on regression.

The See Also section below contain more about the mathematics of ANOVA.

Example Data

In many of the examples in this chapter, we start with creating example data using R’s pseudo random number generation capabilities. So at the beginning of each recipe you may see something like the following:

set.seed(42)
x <- rnorm(100)
e <- rnorm(100, mean=0, sd=5)
y <- 5 + 15 * x + e

We use set.seed to set the random number generation seed so that if you run the example code on your machine you will get the same answer. In the above example, x is a vector of 100 draws from a standard normal (mean=0, sd=1) distribution. Then we create a little random noise called e from a normal distribution with mean=0 and sd=5. y is then calculated as 5 + 15 * x + e. The idea behind creating example data rather than using “real world” data is that with simulated “toy” data you can change the coefficients and parameters in the example data and see how the change impacts the resulting model. For example, you could increase the standard deviation of e in the example data and see what impact that has on the R^2 of your model.

See Also

There are many good texts on linear regression. One of our favorites is Applied Linear Regression Models (4th ed.) by Kutner, Nachtsheim, and Neter (McGraw-Hill/Irwin). We generally follow their terminology and conventions in this chapter.

We also like Linear Models with R by Julian Faraway (Chapman & Hall), because it illustrates regression using R and is quite readable. Earlier versions of Faraday’s work are available free on-line, too (e.g., http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf).

Performing Simple Linear Regression

Problem

You have two vectors, x and y, that hold paired observations: (x1, y1), (x2, y2), …, (xn, yn). You believe there is a linear relationship between x and y, and you want to create a regression model of the relationship.

Solution

The lm function performs a linear regression and reports the coefficients:

set.seed(42)
x <- rnorm(100)
e <- rnorm(100, mean = 0, sd = 5)
y <- 5 + 15 * x + e

lm(y ~ x)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Coefficients:
#> (Intercept)            x
#>        4.56        15.14

Discussion

Simple linear regression involves two variables: a predictor (or independent) variable, often called x; and a response (or dependent) variable, often called y. The regression uses the ordinary least-squares (OLS) algorithm to fit the linear model:

  • yi = β0 + β1xi + εi

where β0 and β1 are the regression coefficients and the εi are the error terms.

The lm function can perform linear regression. The main argument is a model formula, such as y ~ x. The formula has the response variable on the left of the tilde character (~) and the predictor variable on the right. The function estimates the regression coefficients, β0 and β1, and reports them as the intercept and the coefficient of x, respectively:

Coefficients:
(Intercept)            x
      4.558       15.136

In this case, the regression equation is:

  • yi = 17.72 + 3.25xi + εi

It is quite common for data to be captured inside a data frame, in which case you want to perform a regression between two data frame columns. Here, x and y are columns of a data frame dfrm:

df <- data.frame(x, y)
head(df)
#>        x     y
#> 1  1.371 31.57
#> 2 -0.565  1.75
#> 3  0.363  5.43
#> 4  0.633 23.74
#> 5  0.404  7.73
#> 6 -0.106  3.94

The lm function lets you specify a data frame by using the data parameter. If you do, the function will take the variables from the data frame and not from your workspace:

lm(y ~ x, data = df)          # Take x and y from df
#>
#> Call:
#> lm(formula = y ~ x, data = df)
#>
#> Coefficients:
#> (Intercept)            x
#>        4.56        15.14

Performing Multiple Linear Regression

Problem

You have several predictor variables (e.g., u, v, and w) and a response variable y. You believe there is a linear relationship between the predictors and the response, and you want to perform a linear regression on the data.

Solution

Use the lm function. Specify the multiple predictors on the righthand side of the formula, separated by plus signs (+):

lm(y ~ u + v + w)

Discussion

Multiple linear regression is the obvious generalization of simple linear regression. It allows multiple predictor variables instead of one predictor variable and still uses OLS to compute the coefficients of a linear equation. The three-variable regression just given corresponds to this linear model:

  • yi = β0 + β1ui + β2vi + β3wi + εi

R uses the lm function for both simple and multiple linear regression. You simply add more variables to the righthand side of the model formula. The output then shows the coefficients of the fitted model:

set.seed(42)
u <- rnorm(100)
v <- rnorm(100, mean = 3,  sd = 2)
w <- rnorm(100, mean = -3, sd = 1)
e <- rnorm(100, mean = 0,  sd = 3)

y <- 5 + 4 * u + 3 * v + 2 * w + e

lm(y ~ u + v + w)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

The data parameter of lm is especially valuable when the number of variables increases, since it’s much easier to keep your data in one data frame than in many separate variables. Suppose your data is captured in a data frame, such as the df variable shown here:

df <- data.frame(y, u, v, w)
head(df)
#>       y      u     v     w
#> 1 16.67  1.371 5.402 -5.00
#> 2 14.96 -0.565 5.090 -2.67
#> 3  5.89  0.363 0.994 -1.83
#> 4 27.95  0.633 6.697 -0.94
#> 5  2.42  0.404 1.666 -4.38
#> 6  5.73 -0.106 3.211 -4.15

When we supply df to the data parameter of lm, R looks for the regression variables in the columns of the data frame:

lm(y ~ u + v + w, data = df)
#>
#> Call:
#> lm(formula = y ~ u + v + w, data = df)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

See Also

See “Performing Simple Linear Regression” for simple linear regression.

Getting Regression Statistics

Problem

You want the critical statistics and information regarding your regression, such as R2, the F statistic, confidence intervals for the coefficients, residuals, the ANOVA table, and so forth.

Solution

Save the regression model in a variable, say m:

m <- lm(y ~ u + v + w)

Then use functions to extract regression statistics and information from the model:

anova(m)

ANOVA table

coefficients(m)

Model coefficients

coef(m)

Same as coefficients(m)

confint(m)

Confidence intervals for the regression coefficients

deviance(m)

Residual sum of squares

effects(m)

Vector of orthogonal effects

fitted(m)

Vector of fitted y values

residuals(m)

Model residuals

resid(m)

Same as residuals(m)

summary(m)

Key statistics, such as R2, the F statistic, and the residual standard error (σ)

vcov(m)

Variance–covariance matrix of the main parameters

Discussion

When we started using R, the documentation said use the lm function to perform linear regression. So we did something like this, getting the output shown in “Performing Multiple Linear Regression”:

lm(y ~ u + v + w)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

How disappointing! The output was nothing compared to other statistics packages such as SAS. Where is R2? Where are the confidence intervals for the coefficients? Where is the F statistic, its p-value, and the ANOVA table?

Of course, all that information is available—you just have to ask for it. Other statistics systems dump everything and let you wade through it. R is more minimalist. It prints a bare-bones output and lets you request what more you want.

The lm function returns a model object that you can assign to a variable:

m <- lm(y ~ u + v + w)

From the model object, you can extract important information using specialized functions. The most important function is summary:

summary(m)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -5.383 -1.760 -0.312  1.856  6.984
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    4.770      0.969    4.92  3.5e-06 ***
#> u              4.173      0.260   16.07  < 2e-16 ***
#> v              3.013      0.148   20.31  < 2e-16 ***
#> w              1.905      0.266    7.15  1.7e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.66 on 96 degrees of freedom
#> Multiple R-squared:  0.885,  Adjusted R-squared:  0.882
#> F-statistic:  247 on 3 and 96 DF,  p-value: <2e-16

The summary shows the estimated coefficients. It shows the critical statistics, such as R2 and the F statistic. It shows an estimate of σ, the standard error of the residuals. The summary is so important that there is an entire recipe devoted to understanding it (“Understanding the Regression Summary”).

There are specialized extractor functions for other important information:

Model coefficients (point estimates)
    coef(m)
#> (Intercept)           u           v           w
#>        4.77        4.17        3.01        1.91
Confidence intervals for model coefficients
    confint(m)
#>             2.5 % 97.5 %
#> (Intercept)  2.85   6.69
#> u            3.66   4.69
#> v            2.72   3.31
#> w            1.38   2.43
Model residuals
    resid(m)
#>       1       2       3       4       5       6       7       8       9
#> -0.5675  2.2880  0.0972  2.1474 -0.7169 -0.3617  1.0350  2.8040 -4.2496
#>      10      11      12      13      14      15      16      17      18
#> -0.2048 -0.6467 -2.5772 -2.9339 -1.9330  1.7800 -1.4400 -2.3989  0.9245
#>      19      20      21      22      23      24      25      26      27
#> -3.3663  2.6890 -1.4190  0.7871  0.0355 -0.3806  5.0459 -2.5011  3.4516
#>      28      29      30      31      32      33      34      35      36
#>  0.3371 -2.7099 -0.0761  2.0261 -1.3902 -2.7041  0.3953  2.7201 -0.0254
#>      37      38      39      40      41      42      43      44      45
#> -3.9887 -3.9011 -1.9458 -1.7701 -0.2614  2.0977 -1.3986 -3.1910  1.8439
#>      46      47      48      49      50      51      52      53      54
#>  0.8218  3.6273 -5.3832  0.2905  3.7878  1.9194 -2.4106  1.6855 -2.7964
#>      55      56      57      58      59      60      61      62      63
#> -1.3348  3.3549 -1.1525  2.4012 -0.5320 -4.9434 -2.4899 -3.2718 -1.6161
#>      64      65      66      67      68      69      70      71      72
#> -1.5119 -0.4493 -0.9869  5.6273 -4.4626 -1.7568  0.8099  5.0320  0.1689
#>      73      74      75      76      77      78      79      80      81
#>  3.5761 -4.8668  4.2781 -2.1386 -0.9739 -3.6380  0.5788  5.5664  6.9840
#>      82      83      84      85      86      87      88      89      90
#> -3.5119  1.2842  4.1445 -0.4630 -0.7867 -0.7565  1.6384  3.7578  1.8942
#>      91      92      93      94      95      96      97      98      99
#>  0.5542 -0.8662  1.2041 -1.7401 -0.7261  3.2701  1.4012  0.9476 -0.9140
#>     100
#>  2.4278
Residual sum of squares
    deviance(m)
#> [1] 679
ANOVA table
    anova(m)
#> Analysis of Variance Table
#>
#> Response: y
#>           Df Sum Sq Mean Sq F value  Pr(>F)
#> u          1   1776    1776   251.0 < 2e-16 ***
#> v          1   3097    3097   437.7 < 2e-16 ***
#> w          1    362     362    51.1 1.7e-10 ***
#> Residuals 96    679       7
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you find it annoying to save the model in a variable, you are welcome to use one-liners such as this:

summary(lm(y ~ u + v + w))

Or you can use Magritr pipes:

lm(y ~ u + v + w) %>%
  summary

See Also

See “Understanding the Regression Summary”. See “Identifying Influential Observations” for regression statistics specific to model diagnostics.

Understanding the Regression Summary

Problem

You created a linear regression model, m. However, you are confused by the output from summary(m).

Discussion

The model summary is important because it links you to the most critical regression statistics. Here is the model summary from “Getting Regression Statistics”:

summary(m)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -5.383 -1.760 -0.312  1.856  6.984
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    4.770      0.969    4.92  3.5e-06 ***
#> u              4.173      0.260   16.07  < 2e-16 ***
#> v              3.013      0.148   20.31  < 2e-16 ***
#> w              1.905      0.266    7.15  1.7e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.66 on 96 degrees of freedom
#> Multiple R-squared:  0.885,  Adjusted R-squared:  0.882
#> F-statistic:  247 on 3 and 96 DF,  p-value: <2e-16

Let’s dissect this summary by section. We’ll read it from top to bottom—even though the most important statistic, the F statistic, appears at the end:

Call
    summary(m)$call

This shows how lm was called when it created the model, which is important for putting this summary into the proper context.

Residuals statistics
    # Residuals:
    #     Min      1Q  Median      3Q     Max
    # -5.3832 -1.7601 -0.3115  1.8565  6.9840

Ideally, the regression residuals would have a perfect, normal distribution. These statistics help you identify possible deviations from normality. The OLS algorithm is mathematically guaranteed to produce residuals with a mean of zero.[‸1] Hence the sign of the median indicates the skew’s direction, and the magnitude of the median indicates the extent. In this case the median is negative, which suggests some skew to the left.

If the residuals have a nice, bell-shaped distribution, then the first quartile (1Q) and third quartile (3Q) should have about the same magnitude. In this example, the larger magnitude of 3Q versus 1Q (1.3730 versus 0.9472) indicates a slight skew to the right in our data, although the negative median makes the situation less clear-cut.

The Min and Max residuals offer a quick way to detect extreme outliers in the data, since extreme outliers (in the response variable) produce large residuals.

Coefficients
summary(m)$coefficients
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)     4.77      0.969    4.92 3.55e-06
#> u               4.17      0.260   16.07 5.76e-29
#> v               3.01      0.148   20.31 1.58e-36
#> w               1.91      0.266    7.15 1.71e-10

The column labeled Estimate contains the estimated regression coefficients as calculated by ordinary least squares.

Theoretically, if a variable’s coefficient is zero then the variable is worthless; it adds nothing to the model. Yet the coefficients shown here are only estimates, and they will never be exactly zero. We therefore ask: Statistically speaking, how likely is it that the true coefficient is zero? That is the purpose of the t statistics and the p-values, which in the summary are labeled (respectively) t value and Pr(>|t|).

The p-value is a probability. It gauges the likelihood that the coefficient is not significant, so smaller is better. Big is bad because it indicates a high likelihood of insignificance. In this example, the p-value for the u coefficient is a mere 0.00106, so u is likely significant. The p-value for w, however, is 0.05744; this is just over our conventional limit of 0.05, which suggests that w is likely insignificant.[^2] Variables with large p-values are candidates for elimination.

A handy feature is that R flags the significant variables for quick identification. Do you notice the extreme righthand column containing double asterisks (**), a single asterisk (*), and a period(.)? That column highlights the significant variables. The line labeled "Signif. codes" at the bottom gives a cryptic guide to the flags’ meanings:

+

--------- ---------------------------------- *** p-value between 0 and 0.001 ** p-value between 0.001 and 0.01 * p-value between 0.01 and 0.05 . p-value between 0.05 and 0.1 (blank) p-value between 0.1 and 1.0 --------- ----------------------------------

+

The column labeled Std. Error is the standard error of the estimated coefficient. The column labeled t value is the t statistic from which the p-value was calculated.

Residual standard error::
+
[source, r]
# Residual standard error: 2.66 on 96 degrees of freedom
+
-------------------------------------------------------------------
This reports the standard error of the residuals (*σ*)—that is, the
sample standard deviation of *ε*.
-------------------------------------------------------------------

_R_^2^ (coefficient of determination)::
+
[source, r]
# Multiple R-squared:  0.8851,  Adjusted R-squared:  0.8815
+
-------------------------------------------------------------------
*R*^2^ is a measure of the model’s quality. Bigger is better.
Mathematically, it is the fraction of the variance of *y* that is
explained by the regression model. The remaining variance is not
explained by the model, so it must be due to other factors (i.e.,
unknown variables or sampling variability). In this case, the model
explains 0.4981 (49.81%) of the variance of *y*, and the remaining
0.5019 (50.19%) is unexplained.

That being said, we strongly suggest using the adjusted rather than
the basic *R*^2^. The adjusted value accounts for the number of
variables in your model and so is a more realistic assessment of
its effectiveness. In this case, then, we would use 0.8815,
not 0.8851s
-------------------------------------------------------------------

_F_ statistic::
+
[source, r]
# F-statistic: 246.6 on 3 and 96 DF,  p-value: < 2.2e-16
+
--------------------------------------------------------------------
The *F* statistic tells you whether the model is significant
or insignificant. The model is significant if any of the
coefficients are nonzero (i.e., if *β*~*i*~ ≠ 0 for some *i*). It is
insignificant if all coefficients are zero (*β*~1~ = *β*~2~ = … =
*β*~*n*~ = 0).

Conventionally, a *p*-value of less than 0.05 indicates that the
model is likely significant (one or more *β*~*i*~ are nonzero)
whereas values exceeding 0.05 indicate that the model is likely
not significant. Here, the probability is only 0.000391 that our
model is insignificant. That’s good.

Most people look at the *R*^2^ statistic first. The statistician
wisely starts with the *F* statistic, for if the model is not
significant then nothing else matters.
--------------------------------------------------------------------

[[see_also-id240]]
==== See Also

See <<recipe-id231>> for more on extracting statistics and information from the
model object.

[[recipe-id205]]
=== Performing Linear Regression Without an Intercept

[[problem-id205]]
==== Problem

You want to perform a linear regression, but you want to force the
intercept to be zero.

[[solution-id205]]
==== Solution

Add "`+` `0`" to the righthand side of your regression formula. That
will force `lm` to fit the model with a zero intercept:

[source, r]

lm(y ~ x + 0)

The corresponding regression equation is:

++++
<ul class="simplelist">
  <li><em>y</em><sub><em>i</em></sub> = <em>βx</em><sub><em>i</em></sub> + <em>ε</em><sub><em>i</em></sub></li>
</ul>
++++

[[discussion-id205]]
==== Discussion

Linear regression ordinarily includes an intercept term, so that is the
default in R. In rare cases, however, you may want to fit the data while
assuming that the intercept is zero. In this you make a modeling
assumption: when _x_ is zero, _y_ should be zero.

When you force a zero intercept, the `lm` output includes a coefficient
for _x_ but no intercept for _y_, as shown here:

[source, r]

lm(y x + 0) #> #> Call: #> lm(formula = y x + 0) #> #> Coefficients: #> x #> 4.3

We strongly suggest you check that modeling assumption before
proceeding. Perform a regression with an intercept; then see if the
intercept could plausibly be zero. Check the intercept’s confidence
interval. In this example, the confidence interval is (6.26, 8.84):

[source, r]

confint(lm(y ~ x)) #> 2.5 % 97.5 % #> (Intercept) 6.26 8.84 #> x 2.82 5.31

Because the confidence interval does not contain zero, it is NOT
statistically plausible that the intercept could be zero. So in this
case, it is not reasonable to rerun the regression while forcing a zero
intercept.

[[title-highcor]]
=== Regressing Only Variables that Highly Correlate with your Dependent Variable

[[problem-highcor]]
==== Problem

You have a data frame with many variables and you want to build a
multiple linear regression using only the variables that are highly
correlated to your response (dependent) variable.

[[solution-highcor]]
==== Solution

If `df` is our data frame containing both our response (dependent) and
all our predictor (independent) variables and `dep_var` is our response
variable, we can figure out our best predictors and then use them in a
linear regression. If we want the top 4 predictor variables, we can use
this recipe:

[source, r]

best_pred ← df %>% select(-dep_var) %>% map_dbl(cor, y = df$dep_var) %>% sort(decreasing = TRUE) %>% .[1:4] %>% names %>% df[.]

mod ← lm(df$dep_var ~ as.matrix(best_pred))

This recipe is a combination of many differnt pieces of logic used
elsewhere in this book. We will describe each step here then walk
through it in the discussion using some example data.

First we drop the response variable out of our pipe chain so that we
have only our predictor variables in our data flow:

[source, r]

df %>% select(-dep_var)

Then we use `map_dbl` from `purrr` to perform a pairwise correlation on
each column relative to the response variable.

[source, r]
map_dbl(cor, y = df$dep_var) %>%
We then take the resulting correlations and sort them in decreasing
order:

[source, r]
sort(decreasing = TRUE) %>%
We want only the top 4 correlated variables so we select the top 4
records in the resulting vector:

[source, r]
.[1:4] %>%
And we don't need the correlation values, only the names of the rows
which are the variable names from our original data frame `df`:

[source, r]

names %>%

Then we can pass those names into our subsetting brackets to select only
the columns with names matching the ones we want:

[source, r]

Chapter 12. df[.]

Our pipe chain assigns the resulting data frame into best_pred. We can then use best_pred as the predictor variables in our regression and we can use df$dep_var as the response

mod <- lm(df$dep_var ~ as.matrix(best_pred))

Discussion

We can combine the mapping functions discussed in recpie @ref(recipe-id157): “Applying a Function to Every Column” and create a recipe to remove low-correlation variables from a set of predictors and use the high correlation predictors in a regression.

We have an example data frame that contains 6 predictor variables named pred1 through pred6. The response variable is named resp. Let’s walk that data frame through our logic and see how it works.

Loading the data and dropping the resp variable is pretty straight forward. So let’s look at the result of mapping the cor function:

# loads the pred data frame
load("./data/pred.rdata")

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp)
#> pred1 pred2 pred3 pred4 pred5 pred6
#> 0.573 0.279 0.753 0.799 0.322 0.607

The output is a named vector of values where the names are the variable names and the values are the pairwise correlations between each predictor variable and resp, the response variable.

If we sort this vector, we get the correlations in decreasing order:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE)
#> pred4 pred3 pred6 pred1 pred5 pred2
#> 0.799 0.753 0.607 0.573 0.322 0.279

Using subsetting allows us to select the top 4 records. The . operator is a special operator that tells the pipe where to put the result of the prior step.

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4]
#> pred4 pred3 pred6 pred1
#> 0.799 0.753 0.607 0.573

We then use the names function to extract the names from our vector. The names are the names of the columns we ultimatly want to use as our independent variables:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names
#> [1] "pred4" "pred3" "pred6" "pred1"

When we pass the vecotr of names into pred[.] the names are used to select columns from the pred data frame. We then use head to select only the top 6 rows for easier illustration:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names %>%
  pred[.] %>%
  head
#>    pred4   pred3  pred6  pred1
#> 1  7.252  1.5127  0.560  0.206
#> 2  2.076  0.2579 -0.124 -0.361
#> 3 -0.649  0.0884  0.657  0.758
#> 4  1.365 -0.1209  0.122 -0.727
#> 5 -5.444 -1.1943 -0.391 -1.368
#> 6  2.554  0.6120  1.273  0.433

Now let’s bring it all together and pass the resulting data into the regression:

best_pred <- pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names %>%
  pred[.]

mod <- lm(pred$resp ~ as.matrix(best_pred))
summary(mod)
#>
#> Call:
#> lm(formula = pred$resp ~ as.matrix(best_pred))
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -1.485 -0.619  0.189  0.562  1.398
#>
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                  1.117      0.340    3.28   0.0051 **
#> as.matrix(best_pred)pred4    0.523      0.207    2.53   0.0231 *
#> as.matrix(best_pred)pred3   -0.693      0.870   -0.80   0.4382
#> as.matrix(best_pred)pred6    1.160      0.682    1.70   0.1095
#> as.matrix(best_pred)pred1    0.343      0.359    0.95   0.3549
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.927 on 15 degrees of freedom
#> Multiple R-squared:  0.838,  Adjusted R-squared:  0.795
#> F-statistic: 19.4 on 4 and 15 DF,  p-value: 8.59e-06

Performing Linear Regression with Interaction Terms

Problem

You want to include an interaction term in your regression.

Solution

The R syntax for regression formulas lets you specify interaction terms. The interaction of two variables, u and v, is indicated by separating their names with an asterisk (*):

lm(y ~ u*v)

This corresponds to the model yi = β0 + β1ui
β2vi + β3uivi + εi, which includes the first-order interaction term β3uivi.

Discussion

In regression, an interaction occurs when the product of two predictor variables is also a significant predictor (i.e., in addition to the predictor variables themselves). Suppose we have two predictors, u and v, and want to include their interaction in the regression. This is expressed by the following equation:

  • yi = β0 + β1ui + β2vi + β3uivi + εi

Here the product term, β3uivi, is called the interaction term. The R formula for that equation is:

y ~ u * v

When you write y ~ u*v, R automatically includes u, v, and their product in the model. This is for a good reason. If a model includes an interaction term, such as β3uivi, then regression theory tells us the model should also contain the constituent variables ui and vi.

Likewise, if you have three predictors (u, v, and w) and want to include all their interactions, separate them by asterisks:

y ~ u * v * w

This corresponds to the regression equation:

  • yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi

Now we have all the first-order interactions and a second-order interaction (β7uiviwi).

Sometimes, however, you may not want every possible interaction. You can explicitly specify a single product by using the colon operator (:). For example, u:v:w denotes the product term βuiviwi but without all possible interactions. So the R formula:

y ~ u + v + w + u:v:w

corresponds to the regression equation:

  • yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi

It might seem odd that colon (:) means pure multiplication while asterisk (*) means both multiplication and inclusion of constituent terms. Again, this is because we normally incorporate the constituents when we include their interaction, so making that the default for asterisk makes sense.

There is some additional syntax for easily specifying many interactions:

(u + v + ... + w)^2

: Include all variables (u, v, …, w) and all their first-order interactions.

(u + v + ... + w)^3

: Include all variables, all their first-order interactions, and all their second-order interactions.

(u + v + ... + w)^4

: And so forth.

Both the asterisk (*) and the colon (:) follow a “distributive law”, so the following notations are also allowed:

x*(u + v + ... + w)

: Same as x*u + x*v + ... + x*w (which is the same as x + u + v + ... + w + x:u + x:v + ... + x:w).

x:(u + v + ... + w)

: Same as x:u + x:v + ... + x:w.

All this syntax gives you some flexibility in writing your formula. For example, these three formulas are equivalent:

y ~ u * v
y ~ u + v + u:v
y ~ (u + v) ^ 2

They all define the same regression equation, yi = β0
β1ui + β2vi + β3uivi + εi .

See Also

The full syntax for formulas is richer than described here. See R in a Nutshell (O’Reilly) or the R Language Definition for more details.

Selecting the Best Regression Variables

Problem

You are creating a new regression model or improving an existing model. You have the luxury of many regression variables, and you want to select the best subset of those variables.

Solution

The step function can perform stepwise regression, either forward or backward. Backward stepwise regression starts with many variables and removes the underperformers:

full.model <- lm(y ~ x1 + x2 + x3 + x4)
reduced.model <- step(full.model, direction = "backward")

Forward stepwise regression starts with a few variables and adds new ones to improve the model until it cannot be improved further:

min.model <- lm(y ~ 1)
fwd.model <-
  step(min.model,
       direction = "forward",
       scope = (~ x1 + x2 + x3 + x4))

Discussion

When you have many predictors, it can be quite difficult to choose the best subset. Adding and removing individual variables affects the overall mix, so the search for “the best” can become tedious.

The step function automates that search. Backward stepwise regression is the easiest approach. Start with a model that includes all the predictors. We call that the full model. The model summary, shown here, indicates that not all predictors are statistically significant:

# example data
set.seed(4)
n <- 150
x1 <- rnorm(n)
x2 <- rnorm(n, 1, 2)
x3 <- rnorm(n, 3, 1)
x4 <- rnorm(n,-2, 2)
e <- rnorm(n, 0, 3)
y <- 4 + x1 + 5 * x3 + e

# build the model
full.model <- lm(y ~ x1 + x2 + x3 + x4)
summary(full.model)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x4)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.032 -1.774  0.158  2.032  6.626
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  3.40224    0.80767    4.21  4.4e-05 ***
#> x1           0.53937    0.25935    2.08    0.039 *
#> x2           0.16831    0.12291    1.37    0.173
#> x3           5.17410    0.23983   21.57  < 2e-16 ***
#> x4          -0.00982    0.12954   -0.08    0.940
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 145 degrees of freedom
#> Multiple R-squared:  0.77,   Adjusted R-squared:  0.763
#> F-statistic:  121 on 4 and 145 DF,  p-value: <2e-16

We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model:

reduced.model <- step(full.model, direction="backward")
#> Start:  AIC=327
#> y ~ x1 + x2 + x3 + x4
#>
#>        Df Sum of Sq  RSS AIC
#> - x4    1         0 1240 325
#> - x2    1        16 1256 327
#> <none>              1240 327
#> - x1    1        37 1277 329
#> - x3    1      3979 5219 540
#>
#> Step:  AIC=325
#> y ~ x1 + x2 + x3
#>
#>        Df Sum of Sq  RSS AIC
#> - x2    1        16 1256 325
#> <none>              1240 325
#> - x1    1        37 1277 327
#> - x3    1      3988 5228 539
#>
#> Step:  AIC=325
#> y ~ x1 + x3
#>
#>        Df Sum of Sq  RSS AIC
#> <none>              1256 325
#> - x1    1        44 1300 328
#> - x3    1      3974 5230 537

The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors:

summary(reduced.model)
#>
#> Call:
#> lm(formula = y ~ x1 + x3)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.148 -1.850 -0.055  2.026  6.550
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    3.648      0.751    4.86    3e-06 ***
#> x1             0.582      0.255    2.28    0.024 *
#> x3             5.147      0.239   21.57   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 147 degrees of freedom
#> Multiple R-squared:  0.767,  Adjusted R-squared:  0.763
#> F-statistic:  241 on 2 and 147 DF,  p-value: <2e-16

Backward stepwise regression is easy, but sometimes it’s not feasible to start with “everything” because you have too many candidate variables. In that case use forward stepwise regression, which will start with nothing and incrementally add variables that improve the regression. It stops when no further improvement is possible.

A model that “starts with nothing” may look odd at first:

min.model <- lm(y ~ 1)

This is a model with a response variable (y) but no predictor variables. (All the fitted values for y are simply the mean of y, which is what you would guess if no predictors were available.)

We must tell step which candidate variables are available for inclusion in the model. That is the purpose of the scope argument. The scope is a formula with nothing on the lefthand side of the tilde (~) and candidate variables on the righthand side:

fwd.model <- step(
  min.model,
  direction = "forward",
  scope = (~ x1 + x2 + x3 + x4),
  trace = 0
)

Here we see that x1, x2, x3, and x4 are all candidates for inclusion. (We also included trace=0 to inhibit the voluminous output from step.) The resulting model has two significant predictors and no insignificant predictors:

summary(fwd.model)
#>
#> Call:
#> lm(formula = y ~ x3 + x1)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.148 -1.850 -0.055  2.026  6.550
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    3.648      0.751    4.86    3e-06 ***
#> x3             5.147      0.239   21.57   <2e-16 ***
#> x1             0.582      0.255    2.28    0.024 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 147 degrees of freedom
#> Multiple R-squared:  0.767,  Adjusted R-squared:  0.763
#> F-statistic:  241 on 2 and 147 DF,  p-value: <2e-16

The step-forward algorithm reached the same model as the step-backward model by including x1 and x3 but excluding x2 and x4. This is a toy example, so that is not surprising. In real applications, we suggest trying both the forward and the backward regression and then comparing the results. You might be surprised.

Finally, don’t get carried away by stepwise regression. It is not a panacea, it cannot turn junk into gold, and it is definitely not a substitute for choosing predictors carefully and wisely. You might think: “Oh boy! I can generate every possible interaction term for my model, then let step choose the best ones! What a model I’ll get!” You’d be thinking of something like this, which starts with all possible interactions then tries to reduce the model:

full.model <- lm(y ~ (x1 + x2 + x3 + x4) ^ 4)
reduced.model <- step(full.model, direction = "backward")
#> Start:  AIC=337
#> y ~ (x1 + x2 + x3 + x4)^4
#>
#>               Df Sum of Sq  RSS AIC
#> - x1:x2:x3:x4  1    0.0321 1145 335
#> <none>                     1145 337
#>
#> Step:  AIC=335
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4 + x1:x3:x4 + x2:x3:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x2:x3:x4  1      0.76 1146 333
#> - x1:x3:x4  1      8.37 1154 334
#> <none>                  1145 335
#> - x1:x2:x4  1     20.95 1166 336
#> - x1:x2:x3  1     25.18 1170 336
#>
#> Step:  AIC=333
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4 + x1:x3:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x1:x3:x4  1      8.74 1155 332
#> <none>                  1146 333
#> - x1:x2:x4  1     21.72 1168 334
#> - x1:x2:x3  1     26.51 1172 334
#>
#> Step:  AIC=332
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x3:x4     1      0.29 1155 330
#> <none>                  1155 332
#> - x1:x2:x4  1     23.24 1178 333
#> - x1:x2:x3  1     31.11 1186 334
#>
#> Step:  AIC=330
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x1:x2:x3 + x1:x2:x4
#>
#>            Df Sum of Sq  RSS AIC
#> <none>                  1155 330
#> - x1:x2:x4  1      23.4 1178 331
#> - x1:x2:x3  1      31.5 1187 332

This does not work well. Most of the interaction terms are meaningless. The step function becomes overwhelmed, and you are left with many insignificant terms.

Regressing on a Subset of Your Data

Problem

You want to fit a linear model to a subset of your data, not to the entire dataset.

Solution

The lm function has a subset parameter that specifies which data elements should be used for fitting. The parameter’s value can be any index expression that could index your data. This shows a fitting that uses only the first 100 observations:

lm(y ~ x1, subset=1:100)          # Use only x[1:100]

Discussion

You will often want to regress only a subset of your data. This can happen, for example, when using in-sample data to create the model and out-of-sample data to test it.

The lm function has a parameter, subset, that selects the observations used for fitting. The value of subset is a vector. It can be a vector of index values, in which case lm selects only the indicated observations from your data. It can also be a logical vector, the same length as your data, in which case lm selects the observations with a corresponding TRUE.

Suppose you have 1,000 observations of (x, y) pairs and want to fit your model using only the first half of those observations. Use a subset parameter of 1:500, indicating lm should use observations 1 through 500:

## example data
n <- 1000
x <- rnorm(n)
e <- rnorm(n, 0, .5)
y <- 3 + 2 * x + e
lm(y ~ x, subset = 1:500)
#>
#> Call:
#> lm(formula = y ~ x, subset = 1:500)
#>
#> Coefficients:
#> (Intercept)            x
#>           3            2

More generally, you can use the expression 1:floor(length(x)/2) to select the first half of your data, regardless of size:

lm(y ~ x, subset = 1:floor(length(x) / 2))
#>
#> Call:
#> lm(formula = y ~ x, subset = 1:floor(length(x)/2))
#>
#> Coefficients:
#> (Intercept)            x
#>           3            2

Let’s say your data was collected in several labs and you have a factor, lab, that identifies the lab of origin. You can limit your regression to observations collected in New Jersey by using a logical vector that is TRUE only for those observations:

load('./data/lab_df.rdata')
lm(y ~ x, subset = (lab == "NJ"), data = lab_df)
#>
#> Call:
#> lm(formula = y ~ x, data = lab_df, subset = (lab == "NJ"))
#>
#> Coefficients:
#> (Intercept)            x
#>        2.58         5.03

Using an Expression Inside a Regression Formula

Problem

You want to regress on calculated values, not simple variables, but the syntax of a regression formula seems to forbid that.

Solution

Embed the expressions for the calculated values inside the I(...) operator. That will force R to calculate the expression and use the calculated value for the regression.

Discussion

If you want to regress on the sum of u and v, then this is your regression equation:

  • yi = β0 + β1(ui + vi) + εi

How do you write that equation as a regression formula? This won’t work:

lm(y ~ u + v)    # Not quite right

Here R will interpret u and v as two separate predictors, each with its own regression coefficient. Likewise, suppose your regression equation is:

  • yi = β0 + β1ui + β2ui2 + εi

This won’t work:

lm(y ~ u + u ^ 2)  # That's an interaction, not a quadratic term

R will interpret u^2 as an interaction term (“Performing Linear Regression with Interaction Terms”) and not as the square of u.

The solution is to surround the expressions by the I(...) operator, which inhibits the expressions from being interpreted as a regression formula. Instead, it forces R to calculate the expression’s value and then incorporate that value directly into the regression. Thus the first example becomes:

lm(y ~ I(u + v))

In response to that command, R computes u + v and then regresses y on the sum.

For the second example we use:

lm(y ~ u + I(u ^ 2))

Here R computes the square of u and then regresses on the sum u
u2.

All the basic binary operators (+, -, *, /, ^) have special meanings inside a regression formula. For this reason, you must use the I(...) operator whenever you incorporate calculated values into a regression.

A beautiful aspect of these embedded transformations is that R remembers the transformations and applies them when you make predictions from the model. Consider the quadratic model described by the second example. It uses u and u^2, but we supply the value of u only and R does the heavy lifting. We don’t need to calculate the square of u ourselves:

load('./data/df_squared.rdata')
m <- lm(y ~ u + I(u ^ 2), data = df_squared)
predict(m, newdata = data.frame(u = 13.4))
#>   1
#> 877

See Also

See “Regressing on a Polynomial” for the special case of regression on a polynomial. See “Regressing on Transformed Data” for incorporating other data transformations into the regression.

Regressing on a Polynomial

Problem

You want to regress y on a polynomial of x.

Solution

Use the poly(x,n) function in your regression formula to regress on an n-degree polynomial of x. This example models y as a cubic function of x:

lm(y ~ poly(x, 3, raw = TRUE))

The example’s formula corresponds to the following cubic regression equation:

  • yi = β0 + β1xi + β2xi2 + β3xi3 + εi

Discussion

When a person first uses a polynomial model in R, they often do something clunky like this:

x_sq <- x ^ 2
x_cub <- x ^ 3
m <- lm(y ~ x + x_sq + x_cub)

Obviously, this was quite annoying, and it littered my workspace with extra variables.

It’s much easier to write:

m <- lm(y ~ poly(x, 3, raw = TRUE))

The raw=TRUE is necessary. Without it, the poly function computes orthogonal polynomials instead of simple polynomials.

Beyond the convenience, a huge advantage is that R will calculate all those powers of x when you make predictions from the model (“Predicting New Values”). Without that, you are stuck calculating x2 and x3 yourself every time you employ the model.

Here is another good reason to use poly. You cannot write your regression formula in this way:

lm(y ~ x + x^2 + x^3)     # Does not do what you think!

R will interpret x^2 and x^3 as interaction terms, not as powers of x. The resulting model is a one-term linear regression, completely unlike your expectation. You could write the regression formula like this:

lm(y ~ x + I(x ^ 2) + I(x ^ 3))

But that’s getting pretty verbose. Just use poly.

  • JDL note: we don’t give a runnable example here… that OK?

See Also

See “Performing Linear Regression with Interaction Terms” for more about interaction terms. See “Regressing on Transformed Data” for other transformations on regression data.

Regressing on Transformed Data

Problem

You want to build a regression model for x and y, but they do not have a linear relationship.

Solution

You can embed the needed transformation inside the regression formula. If, for example, y must be transformed into log(y), then the regression formula becomes:

lm(log(y) ~ x)

Discussion

A critical assumption behind the lm function for regression is that the variables have a linear relationship. To the extent this assumption is false, the resulting regression becomes meaningless.

Fortunately, many datasets can be transformed into a linear relationship before applying lm.

obs trans 1
Figure 12-1. Example of a Data Transform

Figure 12-1 shows an example of exponential decay. The left panel shows the original data, z. The dotted line shows a linear regression on the original data; clearly, it’s a lousy fit. If the data is really exponential, then a possible model is:

  • z = exp[β0 + β1t + ε]

where t is time and exp[⋅] is the exponential function (ex). This is not linear, of course, but we can linearize it by taking logarithms:

  • log(z) = β0 + β1t + ε

In R, that regression is simple because we can embed the log transform directly into the regression formula:

# read in our example data
load(file = './data/df_decay.rdata')
z <- df_decay$z
t <- df_decay$time

# transform and model
m <- lm(log(z) ~ t)
summary(m)
#>
#> Call:
#> lm(formula = log(z) ~ t)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -0.4479 -0.0993  0.0049  0.0978  0.2802
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   0.6887     0.0306    22.5   <2e-16 ***
#> t            -2.0118     0.0351   -57.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.148 on 98 degrees of freedom
#> Multiple R-squared:  0.971,  Adjusted R-squared:  0.971
#> F-statistic: 3.28e+03 on 1 and 98 DF,  p-value: <2e-16

The right panel of Figure X-X shows the plot of log(z) versus time. Superimposed on that plot is their regression line. The fit appears to be much better; this is confirmed by the R2 = 0.97, compared with 0.82 for the linear regression on the original data.

You can embed other functions inside your formula. If you thought the relationship was quadratic, you could use a square-root transformation:

lm(sqrt(y) ~ month)

You can apply transformations to variables on both sides of the formula, of course. This formula regresses y on the square root of x:

lm(y ~ sqrt(x))

This regression is for a log-log relationship between x and y:

lm(log(y) ~ log(x))

Finding the Best Power Transformation (Box–Cox Procedure)

Problem

You want to improve your linear model by applying a power transformation to the response variable.

Solution

Use the Box–Cox procedure, which is implemented by the boxcox function of the MASS package. The procedure will identify a power, λ, such that transforming y into yλ will improve the fit of your model:

library(MASS)
m <- lm(y ~ x)
boxcox(m)

Discussion

To illustrate the Box–Cox transformation, let’s create some artificial data using the equation y−1.5 = x + ε, where ε is an error term:

set.seed(9)
x <- 10:100
eps <- rnorm(length(x), sd = 5)
y <- (x + eps) ^ (-1 / 1.5)

Then we will (mistakenly) model the data using a simple linear regression and derive an adjusted R2 of 0.6374:

m <- lm(y ~ x)
summary(m)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Residuals:
#>      Min       1Q   Median       3Q      Max
#> -0.04032 -0.01633 -0.00792  0.00996  0.14516
#>
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  0.166885   0.007078    23.6   <2e-16 ***
#> x           -0.001465   0.000116   -12.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0291 on 89 degrees of freedom
#> Multiple R-squared:  0.641,  Adjusted R-squared:  0.637
#> F-statistic:  159 on 1 and 89 DF,  p-value: <2e-16

When plotting the residuals against the fitted values, we get a clue that something is wrong:

plot(m, which = 1)       # Plot only the fitted vs residuals
fitplot 1
Figure 12-2. Fitted Values vs Residuals

We used the Base R plot function to plot the residuals vs the fitted values in Figure 12-2. We can see this plot has a clear parabolic shape. A possible fix is a power transformation on y, so we run the Box–Cox procedure:

library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#>     select
bc <- boxcox(m)
boxcox 1
Figure 12-3. Output of boxcox on the Model (m)

The boxcox function plots values of λ against the log-likelihood of the resulting model as shown in Figure 12-3. We want to maximize that log-likelihood, so the function draws a line at the best value and also draws lines at the limits of its confidence interval. In this case, it looks like the best value is around −1.5, with a confidence interval of about (−1.75, −1.25).

Oddly, the boxcox function does not return the best value of λ. Rather, it returns the (x, y) pairs displayed in the plot. It’s pretty easy to find the values of λ that yield the largest log-likelihood y. We use the which.max function:

which.max(bc$y)
#> [1] 13

Then this gives us the position of the corresponding λ:

lambda <- bc$x[which.max(bc$y)]
lambda
#> [1] -1.52

The function reports that the best λ is −1.515. In an actual application, We would urge you to interpret this number and choose the power that makes sense to you—rather than blindly accepting this “best” value. Use the graph to assist you in that interpretation. Here, We’ll go with −1.515.

We can apply the power transform to y and then fit the revised model; this gives a much better R2 of 0.9668:

z <- y ^ lambda
m2 <- lm(z ~ x)
summary(m2)
#>
#> Call:
#> lm(formula = z ~ x)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -13.459  -3.711  -0.228   2.206  14.188
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -0.6426     1.2517   -0.51     0.61
#> x             1.0514     0.0205   51.20   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.15 on 89 degrees of freedom
#> Multiple R-squared:  0.967,  Adjusted R-squared:  0.967
#> F-statistic: 2.62e+03 on 1 and 89 DF,  p-value: <2e-16

For those who prefer one-liners, the transformation can be embedded right into the revised regression formula:

m2 <- lm(I(y ^ lambda) ~ x)

By default, boxcox searches for values of λ in the range −2 to +2. You can change that via the lambda argument; see the help page for details.

We suggest viewing the Box–Cox result as a starting point, not as a definitive answer. If the confidence interval for λ includes 1.0, it may be that no power transformation is actually helpful. As always, inspect the residuals before and after the transformation. Did they really improve?

Forming Confidence Intervals for Regression Coefficients

Problem

You are performing linear regression and you need the confidence intervals for the regression coefficients.

Solution

Save the regression model in an object; then use the confint function to extract confidence intervals:

load(file = './data/conf.rdata')
m <- lm(y ~ x1 + x2)
confint(m)
#>             2.5 % 97.5 %
#> (Intercept) -3.90   6.47
#> x1          -2.58   6.24
#> x2           4.67   5.17

Discussion

The Solution uses the model y = β0 + β1(x1)i
β2(x2)i + εi. The confint function returns the confidence intervals for the intercept (β0), the coefficient of x1 (β1), and the coefficient of x2 (β2):

confint(m)
#>             2.5 % 97.5 %
#> (Intercept) -3.90   6.47
#> x1          -2.58   6.24
#> x2           4.67   5.17

By default, confint uses a confidence level of 95%. Use the level parameter to select a different level:

confint(m, level = 0.99)
#>             0.5 % 99.5 %
#> (Intercept) -5.72   8.28
#> x1          -4.12   7.79
#> x2           4.58   5.26

See Also

The coefplot function of the arm package can plot confidence intervals for regression coefficients.

Plotting Regression Residuals

Problem

You want a visual display of your regression residuals.

Solution

You can plot the model object by selecting the residuals plot from the available plots:

m <- lm(y ~ x1 + x2)
plot(m, which = 1)
residualplot 1
Figure 12-4. Model Residual Plot

The output is shown in Figure 12-4.

Discussion

Normally, plotting a regression model object produces several diagnostic plots. You can select just the residuals plot by specifying which=1.

The graph above shows a plot of the residuals from “Performing Simple Linear Regression”. R draws a smoothed line through the residuals as a visual aid to finding significant patterns—for example, a slope or a parabolic shape.

See Also

See “Diagnosing a Linear Regression”, which contains examples of residuals plots and other diagnostic plots.

Diagnosing a Linear Regression

Problem

You have performed a linear regression. Now you want to verify the model’s quality by running diagnostic checks.

Solution

Start by plotting the model object, which will produce several diagnostic plots:

m <- lm(y ~ x1 + x2)
plot(m)

Next, identify possible outliers either by looking at the diagnostic plot of the residuals or by using the outlierTest function of the car package:

library(car)
#> Loading required package: carData
#>
#> Attaching package: 'car'
#> The following object is masked from 'package:dplyr':
#>
#>     recode
#> The following object is masked from 'package:purrr':
#>
#>     some
outlierTest(m)
#> No Studentized residuals with Bonferonni p < 0.05
#> Largest |rstudent|:
#>   rstudent unadjusted p-value Bonferonni p
#> 2     2.27             0.0319        0.956

Finally, identify any overly influential observations (“Identifying Influential Observations”).

Discussion

R fosters the impression that linear regression is easy: just use the lm function. Yet fitting the data is only the beginning. It’s your job to decide whether the fitted model actually works and works well.

Before anything else, you must have a statistically significant model. Check the F statistic from the model summary (“Understanding the Regression Summary”) and be sure that the p-value is small enough for your purposes. Conventionally, it should be less than 0.05 or else your model is likely not very meaningful.

Simply plotting the model object produces several useful diagnostic plots, shown in Figure 12-5:

length(x1)
#> [1] 30
length(x2)
#> [1] 30
length(y)
#> [1] 30

m <- lm(y ~ x1 + x2)
par(mfrow = (c(2, 2))) # this gives us a 2x2 plot
plot(m)
goodfit 1
Figure 12-5. Diagnostics of a Good Fit

Figure 12-5 shows diagnostic plots for a pretty good regression:

  • The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern.

  • The points in the Normal Q–Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution.

  • In both the Scale–Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center.

In contrast, the series of graphs shown in Figure 12-6 show the diagnostics for a not-so-good regression:

load(file = './data/bad.rdata')
m <- lm(y2 ~ x3 + x4)
par(mfrow = (c(2, 2)))      # this gives us a 2x2 plot
plot(m)
badfit 1
Figure 12-6. Diagnostics of a Poor Fit

Observe that the Residuals vs Fitted plot has a definite parabolic shape. This tells us that the model is incomplete: a quadratic factor is missing that could explain more variation in y. Other patterns in residuals would be suggestive of additional problems: a cone shape, for example, may indicate nonconstant variance in y. Interpreting those patterns is a bit of an art, so we suggest reviewing a good book on linear regression while evaluating the plot of residuals.

There are other problems with the not-so-good diagnostics above. The Normal Q–Q plot has more points off the line than it does for the good regression. Both the Scale–Location and Residuals vs Leverage plots show points scattered away from the center, which suggests that some points have excessive leverage.

Another pattern is that point number 28 sticks out in every plot. This warns us that something is odd with that observation. The point could be an outlier, for example. We can check that hunch with the outlierTest function of the car package:

outlierTest(m)
#>    rstudent unadjusted p-value Bonferonni p
#> 28     4.46           7.76e-05       0.0031

The outlierTest identifies the model’s most outlying observation. In this case, it identified observation number 28 and so confirmed that it could be an outlier.

See Also

See recipes “Understanding the Regression Summary” and “Identifying Influential Observations”. The car package is not part of the standard distribution of R; see “Installing Packages from CRAN”.

Identifying Influential Observations

Problem

You want to identify the observations that are having the most influence on the regression model. This is useful for diagnosing possible problems with the data.

Solution

The influence.measures function reports several useful statistics for identifying influential observations, and it flags the significant ones with an asterisk (*). Its main argument is the model object from your regression:

influence.measures(m)

Discussion

The title of this recipe could be “Identifying Overly Influential Observations”, but that would be redundant. All observations influence the regression model, even if only a little. When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them.

The influence.measures function reports several statistics: DFBETAS, DFFITS, covariance ratio, Cook’s distance, and hat matrix values. If any of these measures indicate that an observation is influential, the function flags that observation with an asterisk (*) along the righthand side:

influence.measures(m)
#> Influence measures of
#>   lm(formula = y2 ~ x3 + x4) :
#>
#>      dfb.1_   dfb.x3   dfb.x4    dffit cov.r   cook.d    hat inf
#> 1  -0.18784  0.15174  0.07081 -0.22344 1.059 1.67e-02 0.0506
#> 2   0.27637 -0.04367 -0.39042  0.45416 1.027 6.71e-02 0.0964
#> 3  -0.01775 -0.02786  0.01088 -0.03876 1.175 5.15e-04 0.0772
#> 4   0.15922 -0.14322  0.25615  0.35766 1.133 4.27e-02 0.1156
#> 5  -0.10537  0.00814 -0.06368 -0.13175 1.078 5.87e-03 0.0335
#> 6   0.16942  0.07465  0.42467  0.48572 1.034 7.66e-02 0.1062
etc ...
  • JDL NOTE: the output above does not seem to be respecting the output.lines=10 setting. Debug. We also use output.lines=5 in ch4. Go see if that is working

This is the model from “Diagnosing a Linear Regression”, where we suspected that observation 28 was an outlier. An asterisk is flagging that observation, confirming that it’s overly influential.

This recipe can identify influential observations, but you shouldn’t reflexively delete them. Some judgment is required here. Are those observations improving your model or damaging it?

See Also

See “Diagnosing a Linear Regression”. Use help(influence.measures) to get a list of influence measures and some related functions. See a regression textbook for interpretations of the various influence measures.

Testing Residuals for Autocorrelation (Durbin–Watson Test)

Problem

You have performed a linear regression and want to check the residuals for autocorrelation.

Solution

The Durbin—Watson test can check the residuals for autocorrelation. The test is implemented by the dwtest function of the lmtest package:

library(lmtest)
m <- lm(y ~ x)           # Create a model object
dwtest(m)                # Test the model residuals

The output includes a p-value. Conventionally, if p < 0.05 then the residuals are significantly correlated whereas p > 0.05 provides no evidence of correlation.

You can perform a visual check for autocorrelation by graphing the autocorrelation function (ACF) of the residuals:

acf(m)                   # Plot the ACF of the model residuals

Discussion

The Durbin–Watson test is often used in time series analysis, but it was originally created for diagnosing autocorrelation in regression residuals. Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.

This first example builds a simple regression model and then tests the residuals for autocorrelation. The test returns a p-value well above zero, which indicates that there is no significant autocorrelation:

library(lmtest)
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#>     as.Date, as.Date.numeric
load(file = './data/ac.rdata')
m <- lm(y1 ~ x)
dwtest(m)
#>
#>  Durbin-Watson test
#>
#> data:  m
#> DW = 2, p-value = 0.4
#> alternative hypothesis: true autocorrelation is greater than 0

This second example exhibits autocorrelation in the residuals. The p-value is near 0, so the autocorrelation is likely positive:

m <- lm(y2 ~ x)
dwtest(m)
#>
#>  Durbin-Watson test
#>
#> data:  m
#> DW = 2, p-value = 0.01
#> alternative hypothesis: true autocorrelation is greater than 0

By default, dwtest performs a one-sided test and answers this question: Is the autocorrelation of the residuals greater than zero? If your model could exhibit negative autocorrelation (yes, that is possible), then you should use the alternative option to perform a two-sided test:

dwtest(m, alternative = "two.sided")

The Durbin–Watson test is also implemented by the durbinWatsonTest function of the car package. We suggested the dwtest function primarily because we think the output is easier to read.

See Also

Neither the lmtest package nor the car package are included in the standard distribution of R; see recipes @ref(recipe-id013) “Accessing the Functions in a Package” and @ref(recipe-id012) “Installing Packages from CRAN”. See recipes @ref(recipe-id082) X-X and X-X for more regarding tests of autocorrelation.

Predicting New Values

Problem

You want to predict new values from your regression model.

Solution

Save the predictor data in a data frame. Use the predict function, setting the newdata parameter to the data frame:

load(file = './data/pred2.rdata')

m <- lm(y ~ u + v + w)
preds <- data.frame(u = 3.1, v = 4.0, w = 5.5)
predict(m, newdata = preds)
#>  1
#> 45

Discussion

Once you have a linear model, making predictions is quite easy because the predict function does all the heavy lifting. The only annoyance is arranging for a data frame to contain your data.

The predict function returns a vector of predicted values with one prediction for every row in the data. The example in the Solution contains one row, so predict returned one value.

If your predictor data contains several rows, you get one prediction per row:

preds <- data.frame(
  u = c(3.0, 3.1, 3.2, 3.3),
  v = c(3.9, 4.0, 4.1, 4.2),
  w = c(5.3, 5.5, 5.7, 5.9)
)
predict(m, newdata = preds)
#>    1    2    3    4
#> 43.8 45.0 46.3 47.5

In case it’s not obvious: the new data needn’t contain values for response variables, only predictor variables. After all, you are trying to calculate the response, so it would be unreasonable of R to expect you to supply it.

See Also

These are just the point estimates of the predictions. See “Forming Prediction Intervals” for the confidence intervals.

Forming Prediction Intervals

Problem

You are making predictions using a linear regression model. You want to know the prediction intervals: the range of the distribution of the prediction.

Solution

Use the predict function and specify interval="prediction":

predict(m, newdata = preds, interval = "prediction")

Discussion

This is a continuation of “Predicting New Values”, which described packaging your data into a data frame for the predict function. We are adding interval="prediction" to obtain prediction intervals.

Here is the example from “Predicting New Values”, now with prediction intervals. The new lwr and upr columns are the lower and upper limits, respectively, for the interval:

predict(m, newdata = preds, interval = "prediction")
#>    fit  lwr  upr
#> 1 43.8 38.2 49.4
#> 2 45.0 39.4 50.7
#> 3 46.3 40.6 51.9
#> 4 47.5 41.8 53.2

By default, predict uses a confidence level of 0.95. You can change this via the level argument.

A word of caution: these prediction intervals are extremely sensitive to deviations from normality. If you suspect that your response variable is not normally distributed, consider a nonparametric technique, such as the bootstrap (Recipe X-X), for prediction intervals.

Performing One-Way ANOVA

Problem

Your data is divided into groups, and the groups are normally distributed. You want to know if the groups have significantly different means.

Solution

Use a factor to define the groups. Then apply the oneway.test function:

oneway.test(x ~ f)

Here, x is a vector of numeric values and f is a factor that identifies the groups. The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that two or more groups have significantly different means whereas a value exceeding 0.05 provides no such evidence.

Discussion

Comparing the means of groups is a common task. One-way ANOVA performs that comparison and computes the probability that they are statistically identical. A small p-value indicates that two or more groups likely have different means. (It does not indicate that all groups have different means.)

The basic ANOVA test assumes that your data has a normal distribution or that, at least, it is pretty close to bell-shaped. If not, use the Kruskal–Wallis test instead (“Performing Robust ANOVA (Kruskal–Wallis Test)”).

We can illustrate ANOVA with stock market historical data. Is the stock market more profitable in some months than in others? For instance, a common folk myth says that October is a bad month for stock market investors.1 We explored this question by creating a data frame GSPC_df containing two columns, r and mon. r, is the daily returns in the Standard & Poor’s 500 index, a broad measure of stock market performance. The factor, mon, indicates the calendar month in which that change occurred: Jan, Feb, Mar, and so forth. The data covers the period 1950 though 2009.

The one-way ANOVA shows a p-value of 0.03347:

load(file = './data/anova.rdata')
oneway.test(r ~ mon, data = GSPC_df)
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and mon
#> F = 2, num df = 10, denom df = 7000, p-value = 0.03

We can conclude that stock market changes varied significantly according to the calendar month.

Before you run to your broker and start flipping your portfolio monthly, however, we should check something: did the pattern change recently? We can limit the analysis to recent data by specifying a subset parameter. This works for oneway.test just as it does for the lm function. The subset contains the indexes of observations to be analyzed; all other observations are ignored. Here, we give the indexes of the 2,500 most recent observations, which is about 10 years of data:

oneway.test(r ~ mon, data = GSPC_df, subset = tail(seq_along(r), 2500))
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and mon
#> F = 0.7, num df = 10, denom df = 1000, p-value = 0.8

Uh-oh! Those monthly differences evaporated during the past 10 years. The large p-value, 0.7608, indicates that changes have not recently varied according to calendar month. Apparently, those differences are a thing of the past.

Notice that the oneway.test output says “(not assuming equal variances)”. If you know the groups have equal variances, you’ll get a less conservative test by specifying var.equal=TRUE:

oneway.test(x ~ f, var.equal = TRUE)

You can also perform one-way ANOVA by using the aov function like this:

m <- aov(x ~ f)
summary(m)

However, the aov function always assumes equal variances and so is somewhat less flexible than oneway.test.

See Also

If the means are significantly different, use “Finding Differences Between Means of Groups” to see the actual differences. Use “Performing Robust ANOVA (Kruskal–Wallis Test)” if your data is not normally distributed, as required by ANOVA.

Creating an Interaction Plot

Problem

You are performing multiway ANOVA: using two or more categorical variables as predictors. You want a visual check of possible interaction between the predictors.

Solution

Use the interaction.plot function:

interaction.plot(pred1, pred2, resp)

Here, pred1 and pred2 are two categorical predictors and resp is the response variable.

Discussion

ANOVA is a form of linear regression, so ideally there is a linear relationship between every predictor and the response variable. One source of nonlinearity is an interaction between two predictors: as one predictor changes value, the other predictor changes its relationship to the response variable. Checking for interaction between predictors is a basic diagnostic.

The faraway package contains a dataset called rats. In it, treat and poison are categorical variables and time is the response variable. When plotting poison against time we are looking for straight, parallel lines, which indicate a linear relationship. However, using the interaction.plot function produces Figure 12-7 which reveals that something is not right:

library(faraway)
data(rats)
interaction.plot(rats$poison, rats$treat, rats$time)
interactionplot 1
Figure 12-7. Interaction Plot Example

Each line graphs time against poison. The difference between lines is that each line is for a different value of treat. The lines should be parallel, but the top two are not exactly parallel. Evidently, varying the value of treat “warped” the lines, introducing a nonlinearity into the relationship between poison and time.

This signals a possible interaction that we should check. For this data it just so happens that yes, there is an interaction but no, it is not statistically significant. The moral is clear: the visual check is useful, but it’s not foolproof. Follow up with a statistical check.

Finding Differences Between Means of Groups

Problem

Your data is divided into groups, and an ANOVA test indicates that the groups have significantly different means. You want to know the differences between those means for all groups.

Solution

Perform the ANOVA test using the aov function, which returns a model object. Then apply the TukeyHSD function to the model object:

m <- aov(x ~ f)
TukeyHSD(m)

Here, x is your data and f is the grouping factor. You can plot the TukeyHSD result to obtain a graphical display of the differences:

plot(TukeyHSD(m))

Discussion

The ANOVA test is important because it tells you whether or not the groups’ means are different. But the test does not identify which groups are different, and it does not report their differences.

The TukeyHSD function can calculate those differences and help you identify the largest ones. It uses the “honest significant differences” method invented by John Tukey.

We’ll illustrate TukeyHSD by continuing the example from “Performing One-Way ANOVA”, which grouped daily stock market changes by month. Here, we group them by weekday instead, using a factor called wday that identifies the day of the week (Mon, …, Fri) on which the change occurred. We’ll use the first 2,500 observations, which roughly cover the period from 1950 to 1960:

load(file = './data/anova.rdata')
oneway.test(r ~ wday, subset = 1:2500, data = GSPC_df)
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and wday
#> F = 10, num df = 4, denom df = 1000, p-value = 5e-10

The p-value is essentially zero, indicating that average changes varied significantly depending on the weekday. To use the TukeyHSD function, We first perform the ANOVA test using the aov function, which returns a model object, and then apply the TukeyHSD function to the object:

m <- aov(r ~ wday, subset = 1:2500, data = GSPC_df)
TukeyHSD(m)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#>
#> Fit: aov(formula = r ~ wday, data = GSPC_df, subset = 1:2500)
#>
#> $wday
#>              diff       lwr       upr p adj
#> Mon-Fri -0.003153 -4.40e-03 -0.001911 0.000
#> Thu-Fri -0.000934 -2.17e-03  0.000304 0.238
#> Tue-Fri -0.001855 -3.09e-03 -0.000618 0.000
#> Wed-Fri -0.000783 -2.01e-03  0.000448 0.412
#> Thu-Mon  0.002219  9.79e-04  0.003460 0.000
#> Tue-Mon  0.001299  5.85e-05  0.002538 0.035
#> Wed-Mon  0.002370  1.14e-03  0.003605 0.000
#> Tue-Thu -0.000921 -2.16e-03  0.000314 0.249
#> Wed-Thu  0.000151 -1.08e-03  0.001380 0.997
#> Wed-Tue  0.001072 -1.57e-04  0.002300 0.121

Each line in the output table includes the difference between the means of two groups (diff) as well as the lower and upper bounds of the confidence interval (lwr and upr) for the difference. The first line in the table, for example,compares the Mon group and the Fri group: the difference of their means is 0.003 with a confidence interval of (-0.0044 -0.0019).

Scanning the table, we see that the Wed-Mon comparison had the largest difference, which was 0.00237.

A cool feature of TukeyHSD is that it can display these differences visually, too. Simply plot the function’s return value to get output as is shown in Figure 12-8.

plot(TukeyHSD(m))
tukeyhsd 1
Figure 12-8. TukeyHSD Plot

The horizontal lines plot the confidence intervals for each pair. With this visual representation you can quickly see that several confidence intervals cross over zero, indicating that the difference is not necessarily significant. You can also see that the Wed-Mon pair has the largest difference because their confidence interval is farthest to the right.

Performing Robust ANOVA (Kruskal–Wallis Test)

Problem

Your data is divided into groups. The groups are not normally distributed, but their distributions have similar shapes. You want to perform a test similar to ANOVA—you want to know if the group medians are significantly different.

Solution

Create a factor that defines the groups of your data. Use the kruskal.test function, which implements the Kruskal–Wallis test. Unlike the ANOVA test, this test does not depend upon the normality of the data:

kruskal.test(x ~ f)

Here, x is a vector of data and f is a grouping factor. The output includes a p-value. Conventionally, p < 0.05 indicates that there is a significant difference between the medians of two or more groups whereas p > 0.05 provides no such evidence.

Discussion

Regular ANOVA assumes that your data has a Normal distribution. It can tolerate some deviation from normality, but extreme deviations will produce meaningless p-values.

The Kruskal–Wallis test is a nonparametric version of ANOVA, which means that it does not assume normality. However, it does assume same-shaped distributions. You should use the Kruskal–Wallis test whenever your data distribution is nonnormal or simply unknown.

The null hypothesis is that all groups have the same median. Rejecting the null hypothesis (with p < 0.05) does not indicate that all groups are different, but it does suggest that two or more groups are different.

One year, Paul taught Business Statistics to 94 undergraduate students. The class included a midterm examination, and there were four homework assignments prior to the exam. He wanted to know: What is the relationship between completing the homework and doing well on the exam? If there is no relation, then the homework is irrelevant and needs rethinking.

He created a vector of grades, one per student and he also created a parallel factor that captured the number of homework assignments completed by that student. The data are in a data frame named student_data:

load(file = './data/student_data.rdata')
head(student_data)
#> # A tibble: 6 x 4
#>   att.fact hw.mean midterm hw
#>   <fct>      <dbl>   <dbl> <fct>
#> 1 3          0.808   0.818 4
#> 2 3          0.830   0.682 4
#> 3 3          0.444   0.511 2
#> 4 3          0.663   0.670 3
#> 5 2          0.9     0.682 4
#> 6 3          0.948   0.954 4

Notice that the hw variable—although it appears to be numeric—is actually a factor. It assigns each midterm grade to one of five groups depending upon how many homework assignments the student completed.

The distribution of exam grades is definitely not Normal: the students have a wide range of math skills, so there are an unusual number of A and F grades. Hence regular ANOVA would not be appropriate. Instead we used the Kruskal–Wallis test and obtained a p-value of essentially zero (3.99 × 10−5, or 0.00003669):

kruskal.test(midterm ~ hw, data = student_data)
#>
#>  Kruskal-Wallis rank sum test
#>
#> data:  midterm by hw
#> Kruskal-Wallis chi-squared = 30, df = 4, p-value = 4e-05

Obviously, there is a significant performance difference between students who complete their homework and those who do not. But what could Paul actually conclude? At first, Paul was pleased that the homework appeared so effective. Then it dawned on him that this was a classic error in statistical reasoning: He assumed that correlation implied causality. It does not, of course. Perhaps strongly motivated students do well on both homework and exams whereas lazy students do not. In that case, the causal factor is degree of motivation, not the brilliance of my homework selection. In the end, he could only conclude something very simple: students who complete the homework will likely do well on the midterm exam, but he still doesn’t really know why.

Comparing Models by Using ANOVA

Problem

You have two models of the same data, and you want to know whether they produce different results.

Solution

The anova function can compare two models and report if they are significantly different:

anova(m1, m2)

Here, m1 and m2 are both model objects returned by lm. The output from anova includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the models are significantly different whereas a value exceeding 0.05 provides no such evidence.

Discussion

In “Getting Regression Statistics”, we used the anova function to print the ANOVA table for one regression model. Now we are using the two-argument form to compare two models.

The anova function has one strong requirement when comparing two models: one model must be contained within the other. That is, all the terms of the smaller model must appear in the larger model. Otherwise, the comparison is impossible.

The ANOVA analysis performs an F test that is similar to the F test for a linear regression. The difference is that this test is between two models whereas the regression F test is between using the regression model and using no model.

Suppose we build three models of y, adding terms as we go:

load(file = './data/anova2.rdata')
m1 <- lm(y ~ u)
m2 <- lm(y ~ u + v)
m3 <- lm(y ~ u + v + w)

Is m2 really different from m1? We can use anova to compare them, and the result is a p-value of 0.009066:

anova(m1, m2)
#> Analysis of Variance Table
#>
#> Model 1: y ~ u
#> Model 2: y ~ u + v
#>   Res.Df RSS Df Sum of Sq    F Pr(>F)
#> 1     18 197
#> 2     17 130  1      66.4 8.67 0.0091 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicates that the models are significantly different. Comparing m2 and m3, however, yields a p-value of 0.05527:

anova(m2, m3)
#> Analysis of Variance Table
#>
#> Model 1: y ~ u + v
#> Model 2: y ~ u + v + w
#>   Res.Df RSS Df Sum of Sq    F Pr(>F)
#> 1     17 130
#> 2     16 103  1      27.5 4.27  0.055 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is right on the edge. Strictly speaking, it does not pass our requirement to be smaller than 0.05; however, it’s close enough that you might judge the models to be “different enough.”

This example is a bit contrived, so it does not show the larger power of anova. We use anova when, while experimenting with complicated models by adding and deleting multiple terms, we need to know whether or not the new model is really different from the original one. In other words: if we add terms and the new model is essentially unchanged, then the extra terms are not worth the additional complications.

1 In the words of Mark Twain, “October: This is one of the peculiarly dangerous months to speculate in stocks in. The others are July, January, September, April, November, May, March, June, December, August and February.”

About the Authors

J.D. Long is a misplaced southern agricultural economist currently working for Renaissance Re in New York City. J.D. is an avid user of Python, R, AWS and colorful metaphors, and is a frequent presenter at R conferences as well as the founder of the Chicago R User Group. He lives in Jersey City, NJ with his wife, a recovering trial lawyer, and his 11-year-old circuit bending daughter.

Paul Teetor is a quantitative developer with Masters degrees in statistics and computer science. He specializes in analytics and software engineering for investment management, securities trading, and risk management. He works with hedge funds, market makers, and portfolio managers in the greater Chicago area.

R Cookbook

Second Edition

Proven Recipes for Data Analysis, Statistics, and Graphics

J.D. Long and Paul Teetor

R Cookbook

by J.D. Long and Paul Teetor

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • Editor: Nicole Tache
  • Production Editor: Kristen Brown
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • May 2019: Second Edition

Revision History for the Second Edition

  • 2019-01-02: First Early Release
  • 2019-01-25: Second Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492040682 for release details.

Chapter 1. Getting Started and Getting Help

Introduction

This chapter sets the groundwork for the other chapters. It explains how to download, install, and run R.

More importantly, it also explains how to get answers to your questions. The R community provides a wealth of documentation and help. You are not alone. Here are some common sources of help:

Local, installed documentation

When you install R on your computer, a mass of documentation is also installed. You can browse the local documentation (“Viewing the Supplied Documentation”) and search it (“Searching the Supplied Documentation”). We are amazed how often we search the Web for an answer only to discover it was already available in the installed documentation.

Task views: (http://cran.r-project.org/web/views)

A task view describes packages that are specific to one area of statistical work, such as econometrics, medical imaging, psychometrics, or spatial statistics. Each task view is written and maintained by an expert in the field. There are more than 35 such task views, so there is likely to be one or more for your areas of interest. We recommend that every beginner find and read at least one task view in order to gain a sense of R’s possibilities (“Finding Relevant Functions and Packages”).

Package documentation

Most packages include useful documentation. Many also include overviews and tutorials, called “vignettes” in the R community. The documentation is kept with the packages in package repositories, such as CRAN (http://cran.r-project.org/), and it is automatically installed on your machine when you install a package.

Question and answer (Q&A) websites

On a Q&A site, anyone can post a question, and knowledgeable people can respond. Readers vote on the answers, so the best answers tend to emerge over time. All this information is tagged and archived for searching. These sites are a cross between a mailing list and a social network; “Stack Overflow” (http://stackoverflow.com/) is the canonical example.

The Web

The Web is loaded with information about R, and there are R-specific tools for searching it (“Searching the Web for Help”). The Web is a moving target, so be on the lookout for new, improved ways to organize and search information regarding R.

Mailing lists

Volunteers have generously donated many hours of time to answer beginners’ questions that are posted to the R mailing lists. The lists are archived, so you can search the archives for answers to your questions (“Searching the Mailing Lists”).

Downloading and Installing R

Problem

You want to install R on your computer.

Solution

Windows and OS X users can download R from CRAN, the Comprehensive R Archive Network. Linux and Unix users can install R packages using their package management tool:

Windows

  1. Open http://www.r-project.org/ in your browser.

  2. Click on “CRAN”. You’ll see a list of mirror sites, organized by country.

  3. Select a site near you or the top one listed as “0-Cloud” which tends to work well for most locations (https://cloud.r-project.org/)

  4. Click on “Download R for Windows” under “Download and Install R”.

  5. Click on “base”.

  6. Click on the link for downloading the latest version of R (an .exe file).

  7. When the download completes, double-click on the .exe file and answer the usual questions.

OS X

  1. Open http://www.r-project.org/ in your browser.

  2. Click on “CRAN”. You’ll see a list of mirror sites, organized by country.

  3. Select a site near you or the top one listed as “0-Cloud” which tends to work well for most locations.

  4. Click on “Download R for (Mac) OS X”.

  5. Click on the .pkg file for the latest version of R, under “Latest release:”, to download it.

  6. When the download completes, double-click on the .pkg file and answer the usual questions.

Linux or Unix

The major Linux distributions have packages for installing R. Here are some examples:

Table 1-1. (#tab:LinuxDistributions) Linux Distributions
Distribution Package name

Ubuntu or Debian

r-base

Red Hat or Fedora

R.i386

Suse

R-base

Use the system’s package manager to download and install the package. Normally, you will need the root password or sudo privileges; otherwise, ask a system administrator to perform the installation.

Discussion

Installing R on Windows or OS X is straightforward because there are prebuilt binaries (compiled programs) for those platforms. You need only follow the preceding instructions. The CRAN Web pages also contain links to installation-related resources, such as frequently asked questions (FAQs) and tips for special situations (“Does R run under Windows Vista/7/8/Server 2008?”) that you may find useful.

The best way to install R on Linux or Unix is by using your Linux distribution package manager to install R as a package. The distribution packages greatly streamline both the initial installation and subsequent updates.

On Ubuntu or Debian, use apt-get to download and install R. Run under sudo to have the necessary privileges:

$ sudo apt-get install r-base

On Red Hat or Fedora, use yum:

$ sudo yum install R.i386

Most Linux platforms also have graphical package managers, which you might find more convenient.

Beyond the base packages, we recommend installing the documentation packages, too. We like to install r-base-html (because we like browsing the hyperlinked documentation) as well as r-doc-html, which installs the important R manuals locally:

$ sudo apt-get install r-base-html r-doc-html

Some Linux repositories also include prebuilt copies of R packages available on CRAN. We don’t use them because we’d rather get software directly from CRAN itself, which usually has the freshest versions.

In rare cases, you may need to build R from scratch. You might have an obscure, unsupported version of Unix; or you might have special considerations regarding performance or configuration. The build procedure on Linux or Unix is quite standard. Download the tarball from the home page of your CRAN mirror; it’s called something like R-3.5.1.tar.gz, except the “3.5.1” will be replaced by the latest version. Unpack the tarball, look for a file called INSTALL, and follow the directions.

See Also

R in a Nutshell (http://oreilly.com/catalog/9780596801717) (O’Reilly) contains more details of downloading and installing R, including instructions for building the Windows and OS X versions. Perhaps the ultimate guide is the one entitled “R Installation and Administration” (http://cran.r-project.org/doc/manuals/R-admin.html), available on CRAN, which describes building and installing R on a variety of platforms.

This recipe is about installing the base package. See “Installing Packages from CRAN” for installing add-on packages from CRAN.

Installing R Studio

Problem

You want a more comprehensive Integrated Development Environment (IDE) than the R default. In other words, you want to install R Studio Desktop.

Solution

Over the past few years R Studio has become the most widly used IDE for R. We are of the opinion that most all R work should be done in the R Studio Desktop IDE unless there is a compelling reason to do otherwise. R Studio makes multiple products including R Studio Desktop, R Studio Server, R Studio Shiny Server, just to name a few. For this book we will use the term R Studio to mean R Studio Desktop though most concepts apply to R Studio Server as well.

To install R Studio, download the latest installer for your platform from the R Studio website: https://www.rstudio.com/products/rstudio/download/

The R Studio Desktop Open Source License version is free to download and use.

Discussion

This book was written and built using R Studio version 1.2.x and R versions 3.5.x. New versions of R Studio are released every few months, so be sure and update regularly. Note that R Studio works with whichever version of R you have installed. So updating to the latest version of R Studio does not upgrade your version of R. R must be upgraded seperatly.

Interacting with R is slightly different in R Studio than in the built in R user interface. For this book, we’ve elected to use R Studio for all examples.

Starting R Studio

Problem

You want to run R Studio on your computer.

Solution

A common point of confusion for new users of R and R Studio is to accidentally start R when they intended to start R Studio. The easiest way to ensure you’re actually starting R Studio is to search for RStudio on your desktop OS. Then use whatever method your OS provides for pinning the icon somewhere easy to find later.

Windows

Click on the Start Screen menue in the lower left corner of the screen. In the search box, type RStudio.

OS X

Look in your launchpad for the R Studio app or press command
space and type Rstudio to search using Spotlight Search.

Ubuntu

Press Alt + F1 and type RStudio to search for R Studio.

Discussion

Confusion between R and R Studio can easily happen becuase as you can see in Figure 1-1, the icons look similiar.

R and R Studio icons in OSX
Figure 1-1. R and R Studio icons in OSX

If you click on the R icon you’ll be greeted by something like Figure Figure 1-2 which is the Base R interface on a Mac, but certainly not R Studio.

R Console
Figure 1-2. The R Console in OSX

When you start R Studio, the default behavior is that R Studio will reopen the last project you were working on in R Studio.

Entering Commands

Problem

You’ve started R Studio. Now what?

Solution

When you start R Studio, the main window on the left is an R session. From there you can enter commands interactivly directly to R.

Discussion

R prompts you with “>”. To get started, just treat R like a big calculator: enter an expression, and R will evaluate the expression and print the result:

1 + 1
#> [1] 2

The computer adds one and one, giving two, and displays the result.

The [1] before the 2 might be confusing. To R, the result is a vector, even though it has only one element. R labels the value with [1] to signify that this is the first element of the vector… which is not surprising, since it’s the only element of the vector.

R will prompt you for input until you type a complete expression. The expression max(1,3,5) is a complete expression, so R stops reading input and evaluates what it’s got:

max(1, 3, 5)
#> [1] 5

In contrast, “max(1,3,” is an incomplete expression, so R prompts you for more input. The prompt changes from greater-than (>) to plus (+), letting you know that R expects more:

max(
  1, 3,
  +5
)
#> [1] 5

It’s easy to mistype commands, and retyping them is tedious and frustrating. So R includes command-line editing to make life easier. It defines single keystrokes that let you easily recall, correct, and reexecute your commands. My own typical command-line interaction goes like this:

  1. I enter an R expression with a typo.

  2. R complains about my mistake.

  3. I press the up-arrow key to recall my mistaken line.

  4. I use the left and right arrow keys to move the cursor back to the error.

  5. I use the Delete key to delete the offending characters.

  6. I type the corrected characters, which inserts them into the command line.

  7. I press Enter to reexecute the corrected command.

That’s just the basics. R supports the usual keystrokes for recalling and editing command lines, as listed in table @ref(tab:keystrokes).

Table 1-2. (#tab:keystrokes) R Command Shortcuts
Labeled key Ctrl-key combination Effect

Up arrow

Ctrl-P

Recall previous command by moving backward through the history of commands.

Down arrow

Ctrl-N

Move forward through the history of commands.

Backspace

Ctrl-H

Delete the character to the left of cursor.

Delete (Del)

Ctrl-D

Delete the character to the right of cursor.

Home

Ctrl-A

Move cursor to the start of the line.

End

Ctrl-E

Move cursor to the end of the line.

Right arrow

Ctrl-F

Move cursor right (forward) one character.

Left arrow

Ctrl-B

Move cursor left (back) one character.

Ctrl-K

Delete everything from the cursor position to the end of the line.

Ctrl-U

Clear the whole darn line and start over.

Tab

Name completion (on some platforms).

: Keystrokes for command-line editing

On Windows and OS X, you can also use the mouse to highlight commands and then use the usual copy and paste commands to paste text into a new command line.

See Also

See “Typing Less and Accomplishing More”. From the Windows main menu, follow HelpConsole for a complete list of keystrokes useful for command-line editing.

Exiting from R Studio

Problem

You want to exit from R Studio.

Solution

Windows

Select FileQuit Session from the main menu; or click on the X in the upper-right corner of the window frame.

OS X

Press CMD-q (apple-q); or click on the red X in the upper-left corner of the window frame.

Linux or Unix

At the command prompt, press Ctrl-D.

On all platforms, you can also use the q function (as in _q_uit) to terminate the program.

q()

Note the empty parentheses, which are necessary to call the function.

Discussion

Whenever you exit, R typically asks if you want to save your workspace. You have three choices:

  • Save your workspace and exit.

  • Don’t save your workspace, but exit anyway.

  • Cancel, returning to the command prompt rather than exiting.

If you save your workspace, then R writes it to a file called .RData in the current working directory. Savign the workspace saves any R objects which you have created. Next time you start R in the same directory the workspace will automatically load. Saving your workspace will overwrite the previously saved workspace, if any, so don’t save if you don’t like the changes to your workspace (e.g., if you have accidentally erased critical data).

We recommend never saving your workspace when you exit, and instead always explicitly saving your project, scripts, and data. We also recommend that you turn off the prompt to save and auto restore of workspace in R Studio using the Global Options found in the menu ToolsGlobal Options and shown in Figure 1-3. This way when you exit R and R Studio you will not be prompted to save your workspace. But keep in mind that any objects created but not saved to disk will be lost.

save workspace
Figure 1-3. Save Workspace Options

See Also

See “Getting and Setting the Working Directory” for more about the current working directory and “Saving Your Workspace” for more about saving your workspace. See Chapter 2 of R in a Nutshell (http://oreilly.com/catalog/9780596801717).

Interrupting R

Problem

You want to interrupt a long-running computation and return to the command prompt without exiting R Studio.

Solution

Press the Esc key on your keyboard, or click on the Session Menu in R Studio and select Interrupt R

Discussion

Interrupting R means telling R to stop running the current command but without deleting variables from memory or completly closing R Studio. Although, interrupting R can leave your variables in an indeterminate state, depending upon how far the computation had progressed. Check your workspace after interrupting.

Viewing the Supplied Documentation

Problem

You want to read the documentation supplied with R.

Solution

Use the help.start function to see the documentation’s table of contents:

help.start()

From there, links are available to all the installed documentation. In R Studio the help will show up in the help pane which by default is on the right hand side of the screen.

In R Studio you can also click helpR Help to get a listng with help options for both R and R Studio.

Discussion

The base distribution of R includes a wealth of documentation—literally thousands of pages. When you install additional packages, those packages contain documentation that is also installed on your machine.

It is easy to browse this documentation via the help.start function, which opens on the top-level table of contents. Figure Figure 1-4 shows how help.start() appears inside the help pane in R Studio.

help start
Figure 1-4. R Studio help.start

The two links in the Base R Reference section are especially useful:

Packages

Click here to see a list of all the installed packages, both in the base packages and the additional, installed packages. Click on a package name to see a list of its functions and datasets.

Search Engine & Keywords

Click here to access a simple search engine, which allows you to search the documentation by keyword or phrase. There is also a list of common keywords, organized by topic; click one to see the associated pages.

The Base R documentation shown by typing help.start() is loaded on your computer when you install R. The R Studio help which you get by using the menu option helpR Help presents a page with links to R Studio’s web site. So you will need Internet access to access the R Studio help links.

See Also

The local documentation is copied from the R Project website, which may have updated documents.

Getting Help on a Function

Problem

You want to know more about a function that is installed on your machine.

Solution

Use help to display the documentation for the function:

help(functionname)

Use args for a quick reminder of the function arguments:

args(functionname)

Use example to see examples of using the function:

example(functionname)

Discussion

We present many R functions in this book. Every R function has more bells and whistles than we can possibly describe. If a function catches your interest, we strongly suggest reading the help page for that function. One of its bells or whistles might be very useful to you.

Suppose you want to know more about the mean function. Use the help function like this:

help(mean)

This will open the help page for the mean function in the help pane in R Studio. A shortcut for the help command is to simply type ? followed by the function name:

?mean

Sometimes you just want a quick reminder of the arguments to a function: What are they, and in what order do they occur? Use the args function:

args(mean)
#> function (x, ...)
#> NULL
args(sd)
#> function (x, na.rm = FALSE)
#> NULL

The first line of output from args is a synopsis of the function call. For mean, the synopsis shows one argument, x, which is a vector of numbers. For sd, the synopsis shows the same vector, x, and an optional argument called na.rm. (You can ignore the second line of output, which is often just NULL.) In R Studio you will see the args output as a floating tool tip over your cursor when you type a function name as shown in figure Figure 1-5.

mean tooltip
Figure 1-5. R Studio Tooltip

Most documentation for functions includes example code near the end of the document. A cool feature of R is that you can request that it execute the examples, giving you a little demonstration of the function’s capabilities. The documentation for the mean function, for instance, contains examples, but you don’t need to type them yourself. Just use the example function to watch them run:

example(mean)
#>
#> mean> x <- c(0:10, 50)
#>
#> mean> xm <- mean(x)
#>
#> mean> c(xm, mean(x, trim = 0.10))
#> [1] 8.75 5.50

The user typed example(mean). Everything else was produced by R, which executed the examples from the help page and displayed the results.

See Also

See “Searching the Supplied Documentation” for searching for functions and “Displaying Loaded Packages via the Search Path” for more about the search path.

Searching the Supplied Documentation

Problem

You want to know more about a function that is installed on your machine, but the help function reports that it cannot find documentation for any such function.

Alternatively, you want to search the installed documentation for a keyword.

Solution

Use help.search to search the R documentation on your computer:

help.search("pattern")

A typical pattern is a function name or keyword. Notice that it must be enclosed in quotation marks.

For your convenience, you can also invoke a search by using two question marks (in which case the quotes are not required). Note that searching for a function by name uses one question mark while searching for a text pattern uses two:

> ??pattern

Discussion

You may occasionally request help on a function only to be told R knows nothing about it:

help(adf.test)
#> No documentation for 'adf.test' in specified packages and libraries:
#> you could try '??adf.test'

This can be frustrating if you know the function is installed on your machine. Here the problem is that the function’s package is not currently loaded, and you don’t know which package contains the function. It’s a kind of catch-22 (the error message indicates the package is not currently in your search path, so R cannot find the help file; see “Displaying Loaded Packages via the Search Path” for more details).

The solution is to search all your installed packages for the function. Just use the help.search function, as suggested in the error message:

help.search("adf.test")

The search will produce a listing of all packages that contain the function:

Help files with alias or concept or title matching 'adf.test' using
regular expression matching:

tseries::adf.test       Augmented Dickey-Fuller Test

Type '?PKG::FOO' to inspect entry 'PKG::FOO TITLE'.

The output above indicates that the tseries package contains the adf.test function. You can see its documentation by explicitly telling help which package contains the function:

help(adf.test, package = "tseries")

or you can use the double colon operator to tell R to look in a specific package:

?tseries::adf.test

You can broaden your search by using keywords. R will then find any installed documentation that contains the keywords. Suppose you want to find all functions that mention the Augmented Dickey–Fuller (ADF) test. You could search on a likely pattern:

help.search("dickey-fuller")

On my machine, the result looks like this because I’ve installed two additional packages (fUnitRoots and urca) that implement the ADF test:

Help files with alias or concept or title matching 'dickey-fuller' using
fuzzy matching:

fUnitRoots::DickeyFullerPValues
                         Dickey-Fuller p Values
tseries::adf.test        Augmented Dickey-Fuller Test
urca::ur.df              Augmented-Dickey-Fuller Unit Root Test

Type '?PKG::FOO' to inspect entry 'PKG::FOO TITLE'.

See Also

You can also access the local search engine through the documentation browser; see “Viewing the Supplied Documentation” for how this is done. See “Displaying Loaded Packages via the Search Path” for more about the search path and “Listing Files” for getting help on functions.

Getting Help on a Package

Problem

You want to learn more about a package installed on your computer.

Solution

Use the help function and specify a package name (without a function name):

help(package = "packagename")

Discussion

Sometimes you want to know the contents of a package (the functions and datasets). This is especially true after you download and install a new package, for example. The help function can provide the contents plus other information once you specify the package name.

This call to help will display the information for the tseries package, a standard package in the base distribution:

help(package = "tseries")

The information begins with a description and continues with an index of functions and datasets. In R Studio, the HTML formatted help page will open in the help window of the IDE.

Some packages also include vignettes, which are additional documents such as introductions, tutorials, or reference cards. They are installed on your computer as part of the package documentation when you install the package. The help page for a package includes a list of its vignettes near the bottom.

You can see a list of all vignettes on your computer by using the vignette function:

vignette()

In R Studio this will open a new tab and list every package installed on your computer which includes vignettes and a list of vignette names and descriptions.

You can see the vignettes for a particular package by including its name:

vignette(package = "packagename")

Each vignette has a name, which you use to view the vignette:

vignette("vignettename")

See Also

See “Getting Help on a Function” for getting help on a particular function in a package.

Searching the Web for Help

Problem

You want to search the Web for information and answers regarding R.

Solution

Inside R, use the RSiteSearch function to search by keyword or phrase:

RSiteSearch("key phrase")

Inside your browser, try using these sites for searching:

RSeek: http://rseek.org

This is a Google custom search that is focused on R-specific websites.

Stack Overflow: http://stackoverflow.com/

Stack Overflow is a searchable Q&A site from Stack Exchange oriented toward programming issues such as data structures, coding, and graphics. http://stats.stackexchange.com/[Cross Validated:

http://stats.stackexchange.com/]

Cross Validated is a Stack Exchange site focused on statistics, machine learning, and data analysis rather than programming. Cross Validated is a good place for questions about what statistical method to use.

Discussion

The RSiteSearch function will open a browser window and direct it to the search engine on the R Project website (http://search.r-project.org/). There you will see an initial search that you can refine. For example, this call would start a search for “canonical correlation”:

RSiteSearch("canonical correlation")

This is quite handy for doing quick web searches without leaving R. However, the search scope is limited to R documentation and the mailing-list archives.

The rseek.org site provides a wider search. Its virtue is that it harnesses the power of the Google search engine while focusing on sites relevant to R. That eliminates the extraneous results of a generic Google search. The beauty of rseek.org is that it organizes the results in a useful way.

Figure Figure 1-6 shows the results of visiting rseek.org and searching for “canonical correlation”. The left side of the page shows general results for search R sites. The right side is a tabbed display that organizes the search results into several categories:

  • Introductions

  • Task Views

  • Support Lists

  • Functions

  • Books

  • Blogs

  • Related Tools

RSeek
Figure 1-6. RSeek

If you click on the Introductions tab, for example, you’ll find tutorial material. The Task Views tab will show any Task View that mentions your search term. Likewise, clicking on Functions will show links to relevant R functions. This is a good way to zero in on search results.

Stack Overflow (http://stackoverflow.com/) is a Q&A site, which means that anyone can submit a question and experienced users will supply answers—often there are multiple answers to each question. Readers vote on the answers, so good answers tend to rise to the top. This creates a rich database of Q&A dialogs, which you can search. Stack Overflow is strongly problem oriented, and the topics lean toward the programming side of R.

Stack Overflow hosts questions for many programming languages; therefore, when entering a term into their search box, prefix it with [r] to focus the search on questions tagged for R. For example, searching via [r] standard error will select only the questions tagged for R and will avoid the Python and C++ questions.

Stack Overflow also includes a wiki about the R language that is an excellent community curreated list of online R resources: https://stackoverflow.com/tags/r/info

Stack Exchange (parent company of Stack Overflow) has a Q&A area for statistical analysis called Cross Validated: https://stats.stackexchange.com/. This area is more focused on statistics than programming, so use this site when seeking answers that are more concerned with statistics in general and less with R in particular.

See Also

If your search reveals a useful package, use “Installing Packages from CRAN” to install it on your machine.

Finding Relevant Functions and Packages

Problem

Of the 10,000+ packages for R, you have no idea which ones would be useful to you.

Solution

Discussion

This problem is especially vexing for beginners. You think R can solve your problems, but you have no idea which packages and functions would be useful. A common question on the mailing lists is: “Is there a package to solve problem X?” That is the silent scream of someone drowning in R.

As of this writing, there are more than 10,000 packages available for free download from CRAN. Each package has a summary page with a short description and links to the package documentation. Once you’ve located a potentially interesting package, you would typically click on the “Reference manual” link to view the PDF documentation with full details. (The summary page also contains download links for installing the package, but you’ll rarely install the package that way; see “Installing Packages from CRAN”.)

Sometimes you simply have a generic interest—such as Bayesian analysis, econometrics, optimization, or graphics. CRAN contains a set of task view pages describing packages that may be useful. A task view is a great place to start since you get an overview of what’s available. You can see the list of task view pages at CRAN Task Views (http://cran.r-project.org/web/views/) or search for them as described in the Solution. Task Views on CRAN list a number of broad fields and show packages that are used in each field. For example, there are Task Views for high performance computing, genetics, time series, and social science, just to name a few.

Suppose you happen to know the name of a useful package—say, by seeing it mentioned online. A complete, alphabetical list of packages is available at CRAN (http://cran.r-project.org/web/packages/) with links to the package summary pages.

See Also

You can download and install an R package called sos that provides powerful other ways to search for packages; see the vignette at SOS (http://cran.r-project.org/web/packages/sos/vignettes/sos.pdf).

Searching the Mailing Lists

Problem

You have a question, and you want to search the archives of the mailing lists to see whether your question was answered previously.

Solution

Discussion

This recipe is really just an application of “Searching the Web for Help”. But it’s an important application because you should search the mailing list archives before submitting a new question to the list. Your question has probably been answered before.

See Also

CRAN has a list of additional resources for searching the Web; see CRAN Search (http://cran.r-project.org/search.html).

Submitting Questions to Stack Overflow or Elsewhere in the Community

Problem

You have a question you can’t find the answer to online. So you want to submit a question to the R community.

Solution

The first step to asking a question online is to create a reproducable example. Having example code that someone can run and see exactly your problem is to most critical part of asking for help online. A question with a good reproducable example has three componenets:

  1. Example Data - This can be simulated data or some real data that you provide

  2. Example Code - This code shows what you have tried or an error you are having

  3. Written Description - This is where you explain what you have, what you’d like to have and what you have tried that didn’t work.

The details of writing a reproducable example are below in the discussion. Once you have a reproducable example, you can post your quesion on Stack Overflow via https://stackoverflow.com/questions/ask. Be sure and include the r tag in the Tags section of the ask page.

Or if your discussion is more general or related to concepts instead of specific syntax, R Studio runs an R Studio Community discussion forum at https://community.rstudio.com/. Note that the site is broken into multiple topics, so pick the topic category that best fits your question.

Or you may submit your question to the R Mailing lists (but don’t submit to multiple sites, the mailing lists, and Stack Overflow as that’s considered rude cross posting):

The Mailing Lists (http://www.r-project.org/mail.html) page contains general information and instructions for using the R-help mailing list. Here is the general process:

  1. Subscribe to the R-help list at the “Main R Mailing List” (https://stat.ethz.ch/mailman/listinfo/r-help).

  2. Write your question carefully and correctly and include your reproducable example.

  3. Mail your question to r-help@r-project.org.

Discussion

The R mailing list, Stack Overflow, and the R Studio Community site are great resources, but please treat them as a last resort. Read the help pages, read the documentation, search the help list archives, and search the Web. It is most likely that your question has already been answered. Don’t kid yourself: very few questions are unique. If you’ve exhausted all other options, maybe it’s time to create a good question.

The reproducable example is the crux of a good help reqeust. The first step is example data. A good way to get example data is to simulate the data using a few R functions. The following example creates a data frame called example_df that has three columns, each of a different data type:

set.seed(42)
n <- 4
example_df <- data.frame(
  some_reals = rnorm(n),
  some_letters = sample(LETTERS, n, replace = TRUE),
  some_ints = sample(1:10, n, replace = TRUE)
)
example_df
#>   some_reals some_letters some_ints
#> 1      1.371            R        10
#> 2     -0.565            S         3
#> 3      0.363            L         5
#> 4      0.633            S        10

Note that this example uses the command set.seed() at the beginning. This ensures that every time this code is run the answers will be the same. The n value is the number of rows of example data you would like to create. Make your example data as simple as possible to illustrate your question.

An alternative to creating simulated data is to use example data that comes with R. For example, the dataset mtcars contains a data frame with 32 records about different car models:

data(mtcars)
head(mtcars)
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

If your example is only reproducable with a bit of your own data, you can use dput() to put a small bit of your own data in a string which you can put in your example. We’ll illustrate that using two rows from the mtcars data:

dput(head(mtcars, 2))
#> structure(list(mpg = c(21, 21), cyl = c(6, 6), disp = c(160,
#> 160), hp = c(110, 110), drat = c(3.9, 3.9), wt = c(2.62, 2.875
#> ), qsec = c(16.46, 17.02), vs = c(0, 0), am = c(1, 1), gear = c(4,
#> 4), carb = c(4, 4)), row.names = c("Mazda RX4", "Mazda RX4 Wag"
#> ), class = "data.frame")

You can put the resulting structure() directly in your question:

example_df <- structure(list(mpg = c(21, 21), cyl = c(6, 6), disp = c(160,
160), hp = c(110, 110), drat = c(3.9, 3.9), wt = c(2.62, 2.875
), qsec = c(16.46, 17.02), vs = c(0, 0), am = c(1, 1), gear = c(4,
4), carb = c(4, 4)), row.names = c("Mazda RX4", "Mazda RX4 Wag"
), class = "data.frame")

example_df
#>               mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4      21   6  160 110  3.9 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  3.9 2.88 17.0  0  1    4    4

The second part of a good reproducable example is the example minimal code. The code example should be as simple as possible and illustrate what you are trying to do or have already tried. This should not be a big block of code with many different things going on. Boil your example down to only the minimal amount of code needed. If you use any packages be sure and include the library() call at the beginning of your code. Also, don’t include anything in your question that will harm the state of someone running your question code, such as rm(list=ls()) which would delete all R objects in memory. Have empathy for the person trying to help you and realize that they are volunteering their time to help you out and may run your code on the same machine they do their own work.

To test your example, open a new R session and try running your example. Once you have edited your code, it’s time to give just a bit more information to your potential question answerer. In the plain text of the question, describe what you were trying to do, what you’ve tried, and your question. Be as conscise as possible. Much like with the example code, your objective is to communicate as efficiently as possible with the person reading your question. You may find it helpful to include in your description which version of R you are running as well as which platform (Windows, Mac, Linux). You can get that information easily with the sessionInfo() command.

If you are going to submit your question to the R mailing lists, you should know there are actually several mailing lists. R-help is the main list for general questions. There are also many special interest group (SIG) mailing lists dedicated to particular domains such as genetics, finance, R development, and even R jobs. You can see the full list at https://stat.ethz.ch/mailman/listinfo. If your question is specific to one such domain, you’ll get a better answer by selecting the appropriate list. As with R-help, however, carefully search the SIG list archives before submitting your question.

See Also

An excellent essay by Eric Raymond and Rick Moen is entitled “How to Ask Questions the Smart Way” (http://www.catb.org/~esr/faqs/smart-questions.html). We suggest that you read it before submitting any question. Seriously. Read it.

Stack Overflow has an excellent question that includes details about producing a reproducable example. You can find that here: https://stackoverflow.com/q/5963269/37751

Jenny Bryan has a great R package called reprex that helps in the creation of a good reproduable example and the package has helper functions that will help you write the markdown text for sites like Stack Overflow. You can find that package on her Github page: https://github.com/tidyverse/reprex

Chapter 2. Some Basics

Introduction

The recipes in this chapter lie somewhere between problem-solving ideas and tutorials. Yes, they solve common problems, but the Solutions showcase common techniques and idioms used in most R code, including the code in this Cookbook. If you are new to R, we suggest skimming this chapter to acquaint yourself with these idioms.

Printing Something to the Screen

Problem

You want to display the value of a variable or expression.

Solution

If you simply enter the variable name or expression at the command prompt, R will print its value. Use the print function for generic printing of any object. Use the cat function for producing custom formatted output.

Discussion

It’s very easy to ask R to print something: just enter it at the command prompt:

pi
#> [1] 3.14
sqrt(2)
#> [1] 1.41

When you enter expressions like that, R evaluates the expression and then implicitly calls the print function. So the previous example is identical to this:

print(pi)
#> [1] 3.14
print(sqrt(2))
#> [1] 1.41

The beauty of print is that it knows how to format any R value for printing, including structured values such as matrices and lists:

print(matrix(c(1, 2, 3, 4), 2, 2))
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
print(list("a", "b", "c"))
#> [[1]]
#> [1] "a"
#>
#> [[2]]
#> [1] "b"
#>
#> [[3]]
#> [1] "c"

This is useful because you can always view your data: just print it. You need not write special printing logic, even for complicated data structures.

The print function has a significant limitation, however: it prints only one object at a time. Trying to print multiple items gives this mind-numbing error message:

print("The zero occurs at", 2 * pi, "radians.")
#> Error in print.default("The zero occurs at", 2 * pi, "radians."): invalid 'quote' argument

The only way to print multiple items is to print them one at a time, which probably isn’t what you want:

print("The zero occurs at")
#> [1] "The zero occurs at"
print(2 * pi)
#> [1] 6.28
print("radians")
#> [1] "radians"

The cat function is an alternative to print that lets you concatenate multiple items into a continuous output:

cat("The zero occurs at", 2 * pi, "radians.", "\n")
#> The zero occurs at 6.28 radians.

Notice that cat puts a space between each item by default. You must provide a newline character (\n) to terminate the line.

The cat function can print simple vectors, too:

fib <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
cat("The first few Fibonacci numbers are:", fib, "...\n")
#> The first few Fibonacci numbers are: 0 1 1 2 3 5 8 13 21 34 ...

Using cat gives you more control over your output, which makes it especially useful in R scripts that generate output consumed by others. A serious limitation, however, is that it cannot print compound data structures such as matrices and lists. Trying to cat them only produces another mind-numbing message:

cat(list("a", "b", "c"))
#> Error in cat(list("a", "b", "c")): argument 1 (type 'list') cannot be handled by 'cat'

See Also

See “Printing Fewer Digits (or More Digits)” for controlling output format.

Setting Variables

Problem

You want to save a value in a variable.

Solution

Use the assignment operator (<-). There is no need to declare your variable first:

x <- 3

Discussion

Using R in “calculator mode” gets old pretty fast. Soon you will want to define variables and save values in them. This reduces typing, saves time, and clarifies your work.

There is no need to declare or explicitly create variables in R. Just assign a value to the name and R will create the variable:

x <- 3
y <- 4
z <- sqrt(x^2 + y^2)
print(z)
#> [1] 5

Notice that the assignment operator is formed from a less-than character (<) and a hyphen (-) with no space between them.

When you define a variable at the command prompt like this, the variable is held in your workspace. The workspace is held in the computer’s main memory but can be saved to disk. The variable definition remains in the workspace until you remove it.

R is a dynamically typed language, which means that we can change a variable’s data type at will. We could set x to be numeric, as just shown, and then turn around and immediately overwrite that with (say) a vector of character strings. R will not complain:

x <- 3
print(x)
#> [1] 3

x <- c("fee", "fie", "foe", "fum")
print(x)
#> [1] "fee" "fie" "foe" "fum"

In some R functions you will see assignment statements that use the strange-looking assignment operator <<-:

x <<- 3

That forces the assignment to a global variable rather than a local variable. Scoping is a bit, well, out of scope for this discussion, however.

In the spirit of full disclosure, we will reveal that R also supports two other forms of assignment statements. A single equal sign (=) can be used as an assignment operator. A rightward assignment operator (->) can be used anywhere the leftward assignment operator (<-) can be used (but with the arguments reversed):

foo <- 3
print(foo)
#> [1] 3
5 -> fum
print(fum)
#> [1] 5

We recommend that you avoid these as well. The equals-sign assignment is easily confused with the test for equality. The rightward assignment can be useful in certain contexts, but it can be confusing to those not used to seeing it.

See Also

See Recipes , , and . See also the help page for the assign function.

Creating a Pipeline of Function Calls

Problem

You’re getting tired of creating temporary, intermediate variables when doing analysis. The alternative, nesting R functions, seems nearly unreadable.

Solution

You can use the pipe operator (%>%) to make your data flow easier to read and understand. It passes data from one step to another function without having to name an intermediate variable.

library(tidyverse)

mpg %>%
  head %>%
  print
#> # A tibble: 6 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#> 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#> 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#> 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#> 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#> 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

It is identical to

print(head(mpg))
#> # A tibble: 6 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#> 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#> 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#> 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#> 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#> 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

Both code fragments start with the mpg dataset, select the head of the dataset, and print it.

Discussion

The pipe operator (%>%), created by Stefan Bache and found in the magrittr package, is used extensivly in the tidyverse and works analogously to the Unix pipe operator (|). It doesn’t provide any new functionality to R, but it can greatly improve readability of code.

The pipe operator takes the value on the left side of the operator and passes it as the first argument of the function on the right. These two lines of code are identical.

x %>% head

head(x)

For example, the Solution code

mpg %>%
  head %>%
  print

has the same effect as this code which use an intermediate variable.
x <- head(mpg)
print(x)

This approach is fairly readable but creates intermediate data frames and requires the reader to keep track of them, putting a cognitive load on the reader.

This following code also has the same effect as the Solution by using nested function calls:

print(head(mpg))

While this is very conscise since it’s only one line, this code requires much more attention to read and understand what’s going on. Code that is difficult for the user to parse mentally can introduce potential for error, and also make maintenance of the code harder in the future.

The function on the right-hand side of the %>% can include additional arguments, and they will be included after the piped-in value. These two lines of code are identical, for example.

iris %>% head(10)

head(iris, 10)

Sometimes, don’t want the piped value to be the first argument. In those cases, use the dot expression (.) to indicate the desired position. These two lines of code, for example, are identical.

10 %>% head(x, .)

head(x, 10)

This is handy for functions where the first argument is not the principal input.

Listing Variables

Problem

You want to know what variables and functions are defined in your workspace.

Solution

Use the ls function. Use ls.str for more details about each variable.

Discussion

The ls function displays the names of objects in your workspace:

x <- 10
y <- 50
z <- c("three", "blind", "mice")
f <- function(n, p) sqrt(p * (1 - p) / n)
ls()
#> [1] "f" "x" "y" "z"

Notice that ls returns a vector of character strings in which each string is the name of one variable or function. When your workspace is empty, ls returns an empty vector, which produces this puzzling output:

ls()
#> character(0)

That is R’s quaint way of saying that ls returned a zero-length vector of strings; that is, it returned an empty vector because nothing is defined in your workspace.

If you want more than just a list of names, try ls.str; this will also tell you something about each variable:

x <- 10
y <- 50
z <- c("three", "blind", "mice")
f <- function(n, p) sqrt(p * (1 - p) / n)
ls.str()
#> f : function (n, p)
#> x :  num 10
#> y :  num 50
#> z :  chr [1:3] "three" "blind" "mice"

The function is called ls.str because it is both listing your variables and applying the str function to them, showing their structure (Revealing the Structure of an Object).

Ordinarily, ls does not return any name that begins with a dot (.). Such names are considered hidden and are not normally of interest to users. (This mirrors the Unix convention of not listing files whose names begin with dot.) You can force ls to list everything by setting the all.names argument to TRUE:

ls()
#> [1] "f" "x" "y" "z"
ls(all.names = TRUE)
#> [1] ".Random.seed" "f"            "x"            "y"
#> [5] "z"

See Also

See “Deleting Variables” for deleting variables and Recipe X-X for inspecting your variables.

Deleting Variables

Problem

You want to remove unneeded variables or functions from your workspace or to erase its contents completely.

Solution

Use the rm function.

Discussion

Your workspace can get cluttered quickly. The rm function removes, permanently, one or more objects from the workspace:

x <- 2 * pi
x
#> [1] 6.28
rm(x)
x
#> Error in eval(expr, envir, enclos): object 'x' not found

There is no “undo”; once the variable is gone, it’s gone.

You can remove several variables at once:

rm(x, y, z)

You can even erase your entire workspace at once. The rm function has a list argument consisting of a vector of names of variables to remove. Recall that the ls function returns a vector of variables names; hence you can combine rm and ls to erase everything:

ls()
#> [1] "f" "x" "y" "z"
rm(list = ls())
ls()
#> character(0)

Alternativly you could click the broom icon in the top of the Environment pane in R Studio, shown in Figure 2-1.

Environment Panel in R Studio
Figure 2-1. Environment Panel in R Studio

Never put rm(list=ls()) into code you share with others, such as a library function or sample code sent to a mailing list or Stack Overflow. Deleting all the variables in someone else’s workspace is worse than rude and will make you extremely unpopular.

See Also

See “Listing Variables”.

Creating a Vector

Problem

You want to create a vector.

Solution

Use the c(...) operator to construct a vector from given values.

Discussion

Vectors are a central component of R, not just another data structure. A vector can contain either numbers, strings, or logical values but not a mixture.

The c(...) operator can construct a vector from simple elements:

c(1, 1, 2, 3, 5, 8, 13, 21)
#> [1]  1  1  2  3  5  8 13 21
c(1 * pi, 2 * pi, 3 * pi, 4 * pi)
#> [1]  3.14  6.28  9.42 12.57
c("My", "twitter", "handle", "is", "@cmastication")
#> [1] "My"            "twitter"       "handle"        "is"
#> [5] "@cmastication"
c(TRUE, TRUE, FALSE, TRUE)
#> [1]  TRUE  TRUE FALSE  TRUE

If the arguments to c(...) are themselves vectors, it flattens them and combines them into one single vector:

v1 <- c(1, 2, 3)
v2 <- c(4, 5, 6)
c(v1, v2)
#> [1] 1 2 3 4 5 6

Vectors cannot contain a mix of data types, such as numbers and strings. If you create a vector from mixed elements, R will try to accommodate you by converting one of them:

v1 <- c(1, 2, 3)
v3 <- c("A", "B", "C")
c(v1, v3)
#> [1] "1" "2" "3" "A" "B" "C"

Here, the user tried to create a vector from both numbers and strings. R converted all the numbers to strings before creating the vector, thereby making the data elements compatible. Note that R does this without warning or complaint.

Technically speaking, two data elements can coexist in a vector only if they have the same mode. The modes of 3.1415 and "foo" are numeric and character, respectively:

mode(3.1415)
#> [1] "numeric"
mode("foo")
#> [1] "character"

Those modes are incompatible. To make a vector from them, R converts 3.1415 to character mode so it will be compatible with "foo":

c(3.1415, "foo")
#> [1] "3.1415" "foo"
mode(c(3.1415, "foo"))
#> [1] "character"
Warning

c is a generic operator, which means that it works with many datatypes and not just vectors. However, it might not do exactly what you expect, so check its behavior before applying it to other datatypes and objects.

See Also

See the “Introduction” to the Chapter 5 chapter for more about vectors and other data structures.

Computing Basic Statistics

Problem

You want to calculate basic statistics: mean, median, standard deviation, variance, correlation, or covariance.

Solution

Use one of these functions as applies, assuming that x and y are vectors:

  • mean(x)

  • median(x)

  • sd(x)

  • var(x)

  • cor(x, y)

  • cov(x, y)

Discussion

When you first use R you might open the docuentation and begin searching for material entitled “Procedures for Calculating Standard Deviation.” It seems that such an important topic would likely require a whole chapter.

It’s not that complicated.

Standard deviation and other basic statistics are calculated by simple functions. Ordinarily, the function argument is a vector of numbers and the function returns the calculated statistic:

x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
mean(x)
#> [1] 8.8
median(x)
#> [1] 4
sd(x)
#> [1] 11
var(x)
#> [1] 122

The sd function calculates the sample standard deviation, and var calculates the sample variance.

The cor and cov functions can calculate the correlation and covariance, respectively, between two vectors:

x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
y <- log(x + 1)
cor(x, y)
#> [1] 0.907
cov(x, y)
#> [1] 11.5

All these functions are picky about values that are not available (NA). Even one NA value in the vector argument causes any of these functions to return NA or even halt altogether with a cryptic error:

x <- c(0, 1, 1, 2, 3, NA)
mean(x)
#> [1] NA
sd(x)
#> [1] NA

It’s annoying when R is that cautious, but it is the right thing to do. You must think carefully about your situation. Does an NA in your data invalidate the statistic? If yes, then R is doing the right thing. If not, you can override this behavior by setting na.rm=TRUE, which tells R to ignore the NA values:

x <- c(0, 1, 1, 2, 3, NA)
sd(x, na.rm = TRUE)
#> [1] 1.14

In older versions of R, mean and sd were smart about data frames. They understood that each column of the data frame is a different variable, so they calculated their statistic for each column individually. This is no longer the case and, as a result, you may read confusing comments online or in older books (like version 1 of this book). In order to apply the functions to each column of a dataframe we now need to use a helper function. The Tidyverse family of helper functions for this sort of thing are in the purrr package. As with other Tidyverse packages, this gets loaded when you run library(tidyverse). The function we’ll use to apply a function to each column of a data frame is map_dbl:

data(cars)

map_dbl(cars, mean)
#> speed  dist
#>  15.4  43.0
map_dbl(cars, sd)
#> speed  dist
#>  5.29 25.77
map_dbl(cars, median)
#> speed  dist
#>    15    36

Notice that using map_dbl to apply mean or sd each return two values, one for each column defined by the data frame. (Technically, they return a two-element vector whose names attribute is taken from the columns of the data frame.)

The var function understands data frames without the help of a mapping function. It calculates the covariance between the columns of the data frame and returns the covariance matrix:

var(cars)
#>       speed dist
#> speed    28  110
#> dist    110  664

Likewise, if x is either a data frame or a matrix, then cor(x) returns the correlation matrix and cov(x) returns the covariance matrix:

cor(cars)
#>       speed  dist
#> speed 1.000 0.807
#> dist  0.807 1.000
cov(cars)
#>       speed dist
#> speed    28  110
#> dist    110  664

Creating Sequences

Problem

You want to create a sequence of numbers.

Solution

Use an n:m expression to create the simple sequence n, n+1, n+2, …, m:

1:5
#> [1] 1 2 3 4 5

Use the seq function for sequences with an increment other than 1:

seq(from = 1, to = 5, by = 2)
#> [1] 1 3 5

Use the rep function to create a series of repeated values:

rep(1, times = 5)
#> [1] 1 1 1 1 1

Discussion

The colon operator (n:m) creates a vector containing the sequence n, n+1, n+2, …, m:

0:9
#>  [1] 0 1 2 3 4 5 6 7 8 9
10:19
#>  [1] 10 11 12 13 14 15 16 17 18 19
9:0
#>  [1] 9 8 7 6 5 4 3 2 1 0

Observe that R was clever with the last expression (9:0). Because 9 is larger than 0, it counts backward from the starting to ending value. You can also use the colon operator directly with the pipe to pass data to another function:

10:20 %>% mean()

The colon operator works for sequences that grow by 1 only. The seq function also builds sequences but supports an optional third argument, which is the increment:

seq(from = 0, to = 20)
#>  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
seq(from = 0, to = 20, by = 2)
#>  [1]  0  2  4  6  8 10 12 14 16 18 20
seq(from = 0, to = 20, by = 5)
#> [1]  0  5 10 15 20

Alternatively, you can specify a length for the output sequence and then R will calculate the necessary increment:

seq(from = 0, to = 20, length.out = 5)
#> [1]  0  5 10 15 20
seq(from = 0, to = 100, length.out = 5)
#> [1]   0  25  50  75 100

The increment need not be an integer. R can create sequences with fractional increments, too:

seq(from = 1.0, to = 2.0, length.out = 5)
#> [1] 1.00 1.25 1.50 1.75 2.00

For the special case of a “sequence” that is simply a repeated value you should use the rep function, which repeats its first argument:

rep(pi, times = 5)
#> [1] 3.14 3.14 3.14 3.14 3.14

See Also

See “Creating a Sequence of Dates” for creating a sequence of Date objects.

Comparing Vectors

Problem

You want to compare two vectors or you want to compare an entire vector against a scalar.

Solution

The comparison operators (==, !=, <, >, <=, >=) can perform an element-by-element comparison of two vectors. They can also compare a vector’s element against a scalar. The result is a vector of logical values in which each value is the result of one element-wise comparison.

Discussion

R has two logical values, TRUE and FALSE. These are often called Boolean values in other programming languages.

The comparison operators compare two values and return TRUE or FALSE, depending upon the result of the comparison:

a <- 3
a == pi # Test for equality
#> [1] FALSE
a != pi # Test for inequality
#> [1] TRUE
a < pi
#> [1] TRUE
a > pi
#> [1] FALSE
a <= pi
#> [1] TRUE
a >= pi
#> [1] FALSE

You can experience the power of R by comparing entire vectors at once. R will perform an element-by-element comparison and return a vector of logical values, one for each comparison:

v <- c(3, pi, 4)
w <- c(pi, pi, pi)
v == w # Compare two 3-element vectors
#> [1] FALSE  TRUE FALSE
v != w
#> [1]  TRUE FALSE  TRUE
v < w
#> [1]  TRUE FALSE FALSE
v <= w
#> [1]  TRUE  TRUE FALSE
v > w
#> [1] FALSE FALSE  TRUE
v >= w
#> [1] FALSE  TRUE  TRUE

You can also compare a vector against a single scalar, in which case R will expand the scalar to the vector’s length and then perform the element-wise comparison. The previous example can be simplified in this way:

v <- c(3, pi, 4)
v == pi # Compare a 3-element vector against one number
#> [1] FALSE  TRUE FALSE
v != pi
#> [1]  TRUE FALSE  TRUE

(This is an application of the Recycling Rule, “Understanding the Recycling Rule”.)

After comparing two vectors, you often want to know whether any of the comparisons were true or whether all the comparisons were true. The any and all functions handle those tests. They both test a logical vector. The any function returns TRUE if any element of the vector is TRUE. The all function returns TRUE if all elements of the vector are TRUE:

v <- c(3, pi, 4)
any(v == pi) # Return TRUE if any element of v equals pi
#> [1] TRUE
all(v == 0) # Return TRUE if all elements of v are zero
#> [1] FALSE

Selecting Vector Elements

Problem

You want to extract one or more elements from a vector.

Solution

Select the indexing technique appropriate for your problem:

  • Use square brackets to select vector elements by their position, such as v[3] for the third element of v.

  • Use negative indexes to exclude elements.

  • Use a vector of indexes to select multiple values.

  • Use a logical vector to select elements based on a condition.

  • Use names to access named elements.

Discussion

Selecting elements from vectors is another powerful feature of R. Basic selection is handled just as in many other programming languages—use square brackets and a simple index:

fib <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
fib
#>  [1]  0  1  1  2  3  5  8 13 21 34
fib[1]
#> [1] 0
fib[2]
#> [1] 1
fib[3]
#> [1] 1
fib[4]
#> [1] 2
fib[5]
#> [1] 3

Notice that the first element has an index of 1, not 0 as in some other programming languages.

A cool feature of vector indexing is that you can select multiple elements at once. The index itself can be a vector, and each element of that indexing vector selects an element from the data vector:

fib[1:3] # Select elements 1 through 3
#> [1] 0 1 1
fib[4:9] # Select elements 4 through 9
#> [1]  2  3  5  8 13 21

An index of 1:3 means select elements 1, 2, and 3, as just shown. The indexing vector needn’t be a simple sequence, however. You can select elements anywhere within the data vector—as in this example, which selects elements 1, 2, 4, and 8:

fib[c(1, 2, 4, 8)]
#> [1]  0  1  2 13

R interprets negative indexes to mean exclude a value. An index of −1, for instance, means exclude the first value and return all other values:

fib[-1] # Ignore first element
#> [1]  1  1  2  3  5  8 13 21 34

This method can be extended to exclude whole slices by using an indexing vector of negative indexes:

fib[1:3] # As before
#> [1] 0 1 1
fib[-(1:3)] # Invert sign of index to exclude instead of select
#> [1]  2  3  5  8 13 21 34

Another indexing technique uses a logical vector to select elements from the data vector. Everywhere that the logical vector is TRUE, an element is selected:

fib < 10 # This vector is TRUE wherever fib is less than 10
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
fib[fib < 10] # Use that vector to select elements less than 10
#> [1] 0 1 1 2 3 5 8
fib %% 2 == 0 # This vector is TRUE wherever fib is even
#>  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
fib[fib %% 2 == 0] # Use that vector to select the even elements
#> [1]  0  2  8 34

Ordinarily, the logical vector should be the same length as the data vector so you are clearly either including or excluding each element. (If the lengths differ then you need to understand the Recycling Rule, “Understanding the Recycling Rule”.)

By combining vector comparisons, logical operators, and vector indexing, you can perform powerful selections with very little R code:

Select all elements greater than the median

v <- c(3, 6, 1, 9, 11, 16, 0, 3, 1, 45, 2, 8, 9, 6, -4)
v[ v > median(v)]
#> [1]  9 11 16 45  8  9

Select all elements in the lower and upper 5%

v[ (v < quantile(v, 0.05)) | (v > quantile(v, 0.95)) ]
#> [1] 45 -4

The above example uses the | operator which means “or” when indexing. If you wanted “and” you use the & operator.

Select all elements that exceed ±1 standard deviations from the mean

v[ abs(v - mean(v)) > sd(v)]
#> [1] 45 -4

Select all elements that are neither NA nor NULL

v <- c(1, 2, 3, NA, 5)
v[!is.na(v) & !is.null(v)]
#> [1] 1 2 3 5

One final indexing feature lets you select elements by name. It assumes that the vector has a names attribute, defining a name for each element. This can be done by assigning a vector of character strings to the attribute:

years <- c(1960, 1964, 1976, 1994)
names(years) <- c("Kennedy", "Johnson", "Carter", "Clinton")
years
#> Kennedy Johnson  Carter Clinton
#>    1960    1964    1976    1994

Once the names are defined, you can refer to individual elements by name:

years["Carter"]
#> Carter
#>   1976
years["Clinton"]
#> Clinton
#>    1994

This generalizes to allow indexing by vectors of names: R returns every element named in the index:

years[c("Carter", "Clinton")]
#>  Carter Clinton
#>    1976    1994

See Also

See “Understanding the Recycling Rule” for more about the Recycling Rule.

Performing Vector Arithmetic

Problem

You want to operate on an entire vector at once.

Solution

The usual arithmetic operators can perform element-wise operations on entire vectors. Many functions operate on entire vectors, too, and return a vector result.

Discussion

Vector operations are one of R’s great strengths. All the basic arithmetic operators can be applied to pairs of vectors. They operate in an element-wise manner; that is, the operator is applied to corresponding elements from both vectors:

v <- c(11, 12, 13, 14, 15)
w <- c(1, 2, 3, 4, 5)
v + w
#> [1] 12 14 16 18 20
v - w
#> [1] 10 10 10 10 10
v * w
#> [1] 11 24 39 56 75
v / w
#> [1] 11.00  6.00  4.33  3.50  3.00
w^v
#> [1] 1.00e+00 4.10e+03 1.59e+06 2.68e+08 3.05e+10

Observe that the length of the result here is equal to the length of the original vectors. The reason is that each element comes from a pair of corresponding values in the input vectors.

If one operand is a vector and the other is a scalar, then the operation is performed between every vector element and the scalar:

w
#> [1] 1 2 3 4 5
w + 2
#> [1] 3 4 5 6 7
w - 2
#> [1] -1  0  1  2  3
w * 2
#> [1]  2  4  6  8 10
w / 2
#> [1] 0.5 1.0 1.5 2.0 2.5
2^w
#> [1]  2  4  8 16 32

For example, you can recenter an entire vector in one expression simply by subtracting the mean of its contents:

w
#> [1] 1 2 3 4 5
mean(w)
#> [1] 3
w - mean(w)
#> [1] -2 -1  0  1  2

Likewise, you can calculate the z-score of a vector in one expression: subtract the mean and divide by the standard deviation:

w
#> [1] 1 2 3 4 5
sd(w)
#> [1] 1.58
(w - mean(w)) / sd(w)
#> [1] -1.265 -0.632  0.000  0.632  1.265

Yet the implementation of vector-level operations goes far beyond elementary arithmetic. It pervades the language, and many functions operate on entire vectors. The functions sqrt and log, for example, apply themselves to every element of a vector and return a vector of results:

w <- 1:5
w
#> [1] 1 2 3 4 5
sqrt(w)
#> [1] 1.00 1.41 1.73 2.00 2.24
log(w)
#> [1] 0.000 0.693 1.099 1.386 1.609
sin(w)
#> [1]  0.841  0.909  0.141 -0.757 -0.959

There are two great advantages to vector operations. The first and most obvious is convenience. Operations that require looping in other languages are one-liners in R. The second is speed. Most vectorized operations are implemented directly in C code, so they are substantially faster than the equivalent R code you could write.

See Also

Performing an operation between a vector and a scalar is actually a special case of the Recycling Rule; see “Understanding the Recycling Rule”.

Getting Operator Precedence Right

Problem

Your R expression is producing a curious result, and you wonder if operator precedence is causing problems.

Solution

The full list of operators is shown in table @ref(tab:precedence), listed in order of precedence from highest to lowest. Operators of equal precedence are evaluated from left to right except where indicated.

Table 2-1. (#tab:precedence) Operator precedence
Operator Meaning See also

[ [[

Indexing

“Selecting Vector Elements”

:: :::

Access variables in a name space (environment)

$ @

Component extraction, slot extraction

^

Exponentiation (right to left)

- +

Unary minus and plus

:

Sequence creation

Recipes pass:[<a data-type="xref” data-xrefstyle="select:labelnumber” href="#recipe-id021">#recipe-id021</a>, <a data-type="xref” data-xrefstyle="select:labelnumber” href="#recipe-id047">#recipe-id047</a>

% any % (includin

g %>%) Special operators

Discussion

* /

Multiplication, division

Discussion

+ -

Addition, subtraction

== != < > <= >=

Comparison

“Comparing Vectors”

!

Logical negation

& &&

Logical “and”, short-circuit “and”

`

`

Logical “or”, short-circuit “or”

~

Formula

“Performing Simple Linear Regression”

-> ->>

Rightward assignment

“Setting Variables”

=

Assignment (right to left)

“Setting Variables”

<- <<-

Assignment (right to left)

“Setting Variables”

?

Help

“Getting Help on a Function”

It’s not important that you know what every one of these operators do, or what they mean. The list here is to expose you to the idea that different operators have different precedence.

Discussion

Getting your operator precedence wrong in R is a common problem. It certainly happens to the authors a lot. We unthinkingly expect that the expression 0:n−1 will create a sequence of integers from 0 to n − 1 but it does not:

n <- 10
0:n - 1
#>  [1] -1  0  1  2  3  4  5  6  7  8  9

It creates the sequence from −1 to n − 1 because R interprets it as (0:n)−1.

You might not recognize the notation %`_any_%` in the table. R interprets any text between percent signs (%%) as a binary operator. Several such operators have predefined meanings:

%%

Modulo operator

%/%

Integer division

%*%

Matrix multiplication

%in%

Returns TRUE if the left operand occurs in its right operand; FALSE otherwise

%>%

Pipe that passes results from the left to a function on the right

You can also define new binary operators using the %% notation; see Defining Your Own Binary Operators. The point here is that all such operators have the same precedence.

See Also

See “Performing Vector Arithmetic” for more about vector operations, “Performing Matrix Operations” for more about matrix operations, and Recipe X-X to define your own operators. See the Arithmetic and Syntax topics in the R help pages as well as Chapters 5 and 6 of R in a Nutshell (O’Reilly).

Typing Less and Accomplishing More

Problem

You are getting tired of typing long sequences of commands and especially tired of typing the same ones over and over.

Solution

Open an editor window and accumulate your reusable blocks of R commands there. Then, execute those blocks directly from that window. Reserve the command line for typing brief or one-off commands.

When you are done, you can save the accumulated code blocks in a script file for later use.

Discussion

The typical beginner to R types an expression in the console window and sees what happens. As he gets more comfortable, he types increasingly complicated expressions. Then he begins typing multiline expressions. Soon, he is typing the same multiline expressions over and over, perhaps with small variations, in order to perform his increasingly complicated calculations.

The experienced user does not often retype a complex expression. She may type the same expression once or twice, but when she realizes it is useful and reusable she will cut-and-paste it into an editor window. To execute the snippet thereafter, she selects the snippet in the editor window and tells R to execute it, rather than retyping it. This technique is especially powerful as her snippets evolve into long blocks of code.

In R Studio, a few features of the IDE facilitate this workstyle. Windows and Linux machines have slightly different keys than Mac machines: Windows/Linux uses the Ctrl and Alt modifiers, whereas the Mac uses Cmd and Opt.

To open an editor window

From the main menu, select File → New File then select the type of file you want to create, in this case, an R Script.

To execute one line of the editor window

Position the cursor on the line and then press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to execute it.

To execute several lines of the editor window

Highlight the lines using your mouse; then press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to execute them.

To execute the entire contents of the editor window

Press Ctrl+Alt+R (Windows) or Cmd+Opt+R (Mac) to execute the whole editor window. Or from the menu click CodeRun RegionRun All

These keyboard shortcuts and dozens more can be found within R Studio by clicking the menu: ToolsKeyboard Shortcuts Help

Copying lines from the console window to the editor window is simply a matter of copy and paste. When you exit R Studio, it will ask if you want to save the new script. You can either save it for future reuse or discard it.

Creating a Pipeline of Function Calls

Problem

Creating many intermediate variables in your code is tedious and overly verbose, while nesting R functions seems nearly unreadable.

Solution

Use the pipe operator (%>%) to make your expression easier to read and write. The pipe operator (%>%), created by Stefan Bache and found in the magrittr package and used extensivly in many tidyverse functions as well.

Us the pipe operator to combine multiple functions together into a “pipeline” of functions without intermediate variables:

library(tidyverse)
data(mpg)

mpg %>%
  filter(cty > 21) %>%
  head(3) %>%
  print()
#> # A tibble: 3 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 chevrolet    mali~   2.4  2008     4 auto~ f        22    30 r     mids~
#> 2 honda        civic   1.6  1999     4 manu~ f        28    33 r     subc~
#> 3 honda        civic   1.6  1999     4 auto~ f        24    32 r     subc~

The pipe is much cleaner and easier to read than using intermediate temporary variables:

temp1 <- filter(mpg, cty > 21)
temp2 <- head(temp1, 3)
print(temp2)
#> # A tibble: 3 x 11
#>   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 chevrolet    mali~   2.4  2008     4 auto~ f        22    30 r     mids~
#> 2 honda        civic   1.6  1999     4 manu~ f        28    33 r     subc~
#> 3 honda        civic   1.6  1999     4 auto~ f        24    32 r     subc~

Discussion

The pipe operator does not provide any new functionality to R, but it can greatly improve readability of code. The pipe operator takes the output of the function or object on the left of the operator and passes it as the first argument of the function on the right.

Writing this:

x %>% head()

is functionally the same as writing this:

head(x)

In both cases x is the argument to head. We can supply additional arguments, but x is always the first argument. These two lines are functionally identical:

x %>% head(n = 10)

head(x, n = 10)

This difference may seem small, but with a more complicated example, the benefits begin to accumulate. If we had a workflow where we wanted to use filter to limit our data to values, then select to keep only certain variables, followed by ggplot to create a simple plot, we could use intermediate variables.

library(tidyverse)

filtered_mpg <- filter(mpg, cty > 21)
selected_mpg <- select(filtered_mpg, cty, hwy)
ggplot(selected_mpg, aes(cty, hwy)) + geom_point()

This incremental approach is fairly readable but creates a number of intermediate data frames and requires the user to keep track of the state of many objects which can generate a cognitive load on the user.

Another alternative is to nest the functions together:

ggplot(select(filter(mpg, cty > 21), cty, hwy), aes(cty, hwy)) + geom_point()

While this is very concise since it’s only one line, this code requires much more attention to read and understand what’s going on. Code that is difficult for the user to parse mentally can introduce potential for error, and also make maintenance of the code harder in the future.

mpg %>%
  filter(cty > 21) %>%
  select(cty, hwy) %>%
  ggplot(aes(cty, hwy)) + geom_point()
Plotting with pipes example
Figure 2-2. Plotting with pipes example

The above code starts with the mpg dataset, pipes it to the filter function which keeps only records where the city mpg (cty) is greater than 21. Those results are piped into the select command that keeps only the listed variables cty and hwy and those are piped into the ggplot command where an point plot is produced in Figure 2-2

If you want the argument going into your target (right hand side) function to be somewhere other than the first argument, use the dot (.) operator:

iris %>% head(3)

is the same as:

iris %>% head(3, x = .)

However in the second example we passed the iris data frame into the second named argument using the dot operator. This can be handy for functions where the input data frame goes in a position other than the first argument.

Through this book we use pipes to hold together data transformations with multiple steps. We typically format the code with a line break after each pipe and then we indent the code on the following lines. This makes the code easily identifiable as parts of the same data pipeline.

Avoiding Some Common Mistakes

Problem

You want to avoid some of the common mistakes made by beginning users—and also by experienced users, for that matter.

Discussion

Here are some easy ways to make trouble for yourself:

Forgetting the parentheses after a function invocation:
You call an R function by putting parentheses after the name. For instance, this line invokes the ls function:

ls()

However, if you omit the parentheses then R does not execute the function. Instead, it shows the function definition, which is almost never what you want:

ls

# > function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
# >     pattern, sorted = TRUE)
# > {
# >     if (!missing(name)) {
# >         pos <- tryCatch(name, error = function(e) e)
# >         if (inherits(pos, "error")) {
# >             name <- substitute(name)
# >             if (!is.character(name))
# >                 name <- deparse(name)
# > etc...

Forgetting to double up backslashes in Windows file paths
This function call appears to read a Windows file called F:\research\bio\assay.csv, but it does not:

tbl <- read.csv("F:\research\bio\assay.csv")

Backslashes (\) inside character strings have a special meaning and therefore need to be doubled up. R will interpret this file name as F:researchbioassay.csv, for example, which is not what the user wanted. See “Dealing with “Cannot Open File” in Windows” for possible solutions.

Mistyping “<-” as “< (blank) -
The assignment operator is <-, with no space between the < and the -:

x <- pi # Set x to 3.1415926...

If you accidentally insert a space between < and -, the meaning changes completely:

x < -pi # Oops! We are comparing x instead of setting it!
#> [1] FALSE

This is now a comparison (<) between x and negative π (-pi). It does not change x. If you are lucky, x is undefined and R will complain, alerting you that something is fishy:

x < -pi
#> Error in eval(expr, envir, enclos): object 'x' not found

If x is defined, R will perform the comparison and print a logical value, TRUE or FALSE. That should alert you that something is wrong: an assignment does not normally print anything:

x <- 0 # Initialize x to zero
x < -pi # Oops!
#> [1] FALSE

Incorrectly continuing an expression across lines
R reads your typing until you finish a complete expression, no matter how many lines of input that requires. It prompts you for additional input using the + prompt until it is satisfied. This example splits an expression across two lines:

total <- 1 + 2 + 3 + # Continued on the next line
  4 + 5
print(total)
#> [1] 15

Problems begin when you accidentally finish the expression prematurely, which can easily happen:

total <- 1 + 2 + 3 # Oops! R sees a complete expression
+4 + 5 # This is a new expression; R prints its value
#> [1] 9
print(total)
#> [1] 6

There are two clues that something is amiss: R prompted you with a normal prompt (>), not the continuation prompt (+); and it printed the value of 4 + 5.

This common mistake is a headache for the casual user. It is a nightmare for programmers, however, because it can introduce hard-to-find bugs into R scripts.

Using = instead of ==
Use the double-equal operator (==) for comparisons. If you accidentally use the single-equal operator (=), you will irreversibly overwrite your variable:

v <- 1 # Assign 1 to v
v == 0 # Compare v against zero
#> [1] FALSE
v <- 0 # Assign 0 to v, overwriting previous contents

Writing 1:n+1 when you mean 1:(n+1)
You might think that 1:n+1 is the sequence of numbers 1, 2, …, n, n + 1. It’s not. It is the sequence 1, 2, …, n with 1 added to every element, giving 2, 3, …, n, n + 1. This happens because R interprets 1:n+1 as (1:n)+1. Use parentheses to get exactly what you want:

n <- 5
1:n + 1
#> [1] 2 3 4 5 6
1:(n + 1)
#> [1] 1 2 3 4 5 6

Getting bitten by the Recycling Rule
Vector arithmetic and vector comparisons work well when both vectors have the same length. However, the results can be baffling when the operands are vectors of differing lengths. Guard against this possibility by understanding and remembering the Recycling Rule, “Understanding the Recycling Rule”.

Installing a package but not loading it with library() or require()
Installing a package is the first step toward using it, but one more step is required. Use library or require to load the package into your search path. Until you do so, R will not recognize the functions or datasets in the package. See “Accessing the Functions in a Package”:

x <- rnorm(100)
n <- 5
truehist(x, n)
#> Error in truehist(x, n): could not find function "truehist"

However if we load the library first, then the code runs and we get the chart shown in Figure 2-3.

library(MASS) # Load the MASS package into R
truehist(x, n)
Example truehist
Figure 2-3. Example truehist

We typically use library() instead of require(). The reason is that if you create an R script that uses library() and the desired package is not already installed, R will return an error. While require(), in contrast, will simply return FALSE if the package is not installed.

Writing aList[i] when you mean aList[[i]], or vice versa
If the variable lst contains a list, it can be indexed in two ways: lst[[n]] is the _n_th element of the list; whereas lst[n] is a list whose only element is the _n_th element of lst. That’s a big difference. See “Selecting List Elements by Position”.

Using & instead of &&, or vice versa; same for | and ||
Use & and | in logical expressions involving the logical values TRUE and FALSE. See “Selecting Vector Elements”.

Use && and || for the flow-of-control expressions inside if and while statements.

Programmers accustomed to other programming languages may reflexively use && and || everywhere because “they are faster.” But those operators give peculiar results when applied to vectors of logical values, so avoid them unless you are sure that they do what you want.

Passing multiple arguments to a single-argument function
What do you think is the value of mean(9,10,11)? No, it’s not 10. It’s 9. The mean function computes the mean of the first argument. The second and third arguments are being interpreted as other, positional arguments. To pass multiple items into a single argument, we put them in a vector with the c operator. mean(c(9,10,11)) will return 10, as you might expect.

Some functions, such as mean, take one argument. Other arguments, such as max and min, take multiple arguments and apply themselves across all arguments. Be sure you know which is which.

Thinking that max behaves like pmax, or that min behaves like pmin
The max and min functions have multiple arguments and return one value: the maximum or minimum of all their arguments.

The pmax and pmin functions have multiple arguments but return a vector with values taken element-wise from the arguments. See Finding Parwise Minimums or Maximums.

Misusing a function that does not understand data frames
Some functions are quite clever regarding data frames. They apply themselves to the individual columns of the data frame, computing their result for each individual column. Sadly, not all functions are that clever. This includes the mean, median, max, and min functions. They will lump together every value from every column and compute their result from the lump or possibly just return an error. Be aware of which functions are savvy to data frames and which are not.

Using single backslash (\) in Windows Paths If you are using R on Windows, it is common to copy and paste a file path into your R script. Windows File Explorer will show you that your path is C:\temp\my_file.csv but if you try to tell R to read that file, you’ll get a cryptic message:

Error: '\m' is an unrecognized escape in character string starting "'.\temp\m"

This is because R sees backslashes as special characters. You can get around this either by using forward slashes (/), or using double backslashes, (\\).

read_csv(`./temp/my_file.csv`)
read_csv(`.\\temp\\my_file.csv`)

This is only an issue on Windows because both Mac and Linux use forward slashes as path seperators.

Posting a question to Stack Overflow or the mailing list before searching for the answer
Don’t waste your time. Don’t waste other people’s time. Before you post a question to a mailing list or to Stack Overflow, do your homework and search the archives. Odds are, someone has already answered your question. If so, you’ll see the answer in the discussion thread for the question. See “Searching the Mailing Lists”.

See Also

See Recipes , , , and .

Chapter 4. Input and Output

***todo: add base R options at end of tidy recipes?

Introduction

All statistical work begins with data, and most data is stuck inside files and databases. Dealing with input is probably the first step of implementing any significant statistical project.

All statistical work ends with reporting numbers back to a client, even if you are the client. Formatting and producing output is probably the climax of your project.

Casual R users can solve their input problems by using basic functions such as read.csv to read CSV files and read.table to read more complicated, tabular data. They can use print, cat, and format to produce simple reports.

Users with heavy-duty input/output (I/O) needs are strongly encouraged to read the R Data Import/Export guide, available on CRAN at http://cran.r-project.org/doc/manuals/R-data.pdf. This manual includes important information on reading data from sources such as spreadsheets, binary files, other statistical systems, and relational databases.

Entering Data from the Keyboard

Problem

You have a small amount of data, too small to justify the overhead of creating an input file. You just want to enter the data directly into your workspace.

Solution

For very small datasets, enter the data as literals using the c() constructor for vectors:

scores <- c(61, 66, 90, 88, 100)

Discussion

When working on a simple problem, you may not want the hassle of creating and then reading a data file outside of R. You may just want to enter the data into R. The easiest way is by using the c() constructor for vectors, as shown in the Solution.

This approach works for data frames, too, by entering each variable (column) as a vector:

points <- data.frame(
  label = c("Low", "Mid", "High"),
  lbound = c(0, 0.67,   1.64),
  ubound = c(0.67, 1.64,   2.33)
)

See Also

See Recipe X-X for more about using the built-in data editor, as suggested in the Solution.

For cutting and pasting data from another application into R, be sure and look at datapasta, a package that provides R Studio addins that make pasting data into your scripts easier: https://github.com/MilesMcBain/datapasta

Printing Fewer Digits (or More Digits)

Problem

Your output contains too many digits or too few digits. You want to print fewer or more.

Solution

For print, the digits parameter can control the number of printed digits.

For cat, use the format function (which also has a digits parameter) to alter the formatting of numbers.

Discussion

R normally formats floating-point output to have seven digits:

pi
#> [1] 3.14
100 * pi
#> [1] 314

This works well most of the time but can become annoying when you have lots of numbers to print in a small space. It gets downright misleading when there are only a few significant digits in your numbers and R still prints seven.

The print function lets you vary the number of printed digits using the digits parameter:

print(pi, digits = 4)
#> [1] 3.142
print(100 * pi, digits = 4)
#> [1] 314.2

The cat function does not give you direct control over formatting. Instead, use the format function to format your numbers before calling cat:

cat(pi, "\n")
#> 3.14
cat(format(pi, digits = 4), "\n")
#> 3.142

This is R, so both print and format will format entire vectors at once:

pnorm(-3:3)
#> [1] 0.00135 0.02275 0.15866 0.50000 0.84134 0.97725 0.99865
print(pnorm(-3:3), digits = 3)
#> [1] 0.00135 0.02275 0.15866 0.50000 0.84134 0.97725 0.99865

Notice that print formats the vector elements consistently: finding the number of digits necessary to format the smallest number and then formatting all numbers to have the same width (though not necessarily the same number of digits). This is extremely useful for formating an entire table:

q <- seq(from = 0, to = 3, by = 0.5)
tbl <- data.frame(Quant = q,
                  Lower = pnorm(-q),
                  Upper = pnorm(q))
tbl                                # Unformatted print
#>   Quant   Lower Upper
#> 1   0.0 0.50000 0.500
#> 2   0.5 0.30854 0.691
#> 3   1.0 0.15866 0.841
#> 4   1.5 0.06681 0.933
#> 5   2.0 0.02275 0.977
#> 6   2.5 0.00621 0.994
#> 7   3.0 0.00135 0.999
print(tbl, digits = 2)             # Formatted print: fewer digits
#>   Quant  Lower Upper
#> 1   0.0 0.5000  0.50
#> 2   0.5 0.3085  0.69
#> 3   1.0 0.1587  0.84
#> 4   1.5 0.0668  0.93
#> 5   2.0 0.0228  0.98
#> 6   2.5 0.0062  0.99
#> 7   3.0 0.0013  1.00

You can also alter the format of all output by using the options function to change the default for digits:

pi
#> [1] 3.14
options(digits = 15)
pi
#> [1] 3.14159265358979

But this is a poor choice in our experience, since it also alters the output from R’s built-in functions, and that alteration may likely be unpleasant.

See Also

Other functions for formatting numbers include sprintf and formatC; see their help pages for details.

Redirecting Output to a File

Problem

You want to redirect the output from R into a file instead of your console.

Solution

You can redirect the output of the cat function by using its file argument:

cat("The answer is", answer, "\n", file = "filename.txt")

Use the sink function to redirect all output from both print and cat. Call sink with a filename argument to begin redirecting console output to that file. When you are done, use sink with no argument to close the file and resume output to the console:

sink("filename")          # Begin writing output to file

# ... other session work ...

sink()                    # Resume writing output to console

Discussion

The print and cat functions normally write their output to your console. The cat function writes to a file if you supply a file argument, which can be either a filename or a connection. The print function cannot redirect its output, but the sink function can force all output to a file. A common use for sink is to capture the output of an R script:

sink("script_output.txt")   # Redirect output to file
source("script.R")          # Run the script, capturing its output
sink()                      # Resume writing output to console

If you are repeatedly cat`ing items to one file, be sure to set `append=TRUE. Otherwise, each call to cat will simply overwrite the file’s contents:

cat(data, file = "analysisReport.out")
cat(results, file = "analysisRepart.out", append = TRUE)
cat(conclusion, file = "analysisReport.out", append = TRUE)

Hard-coding file names like this is a tedious and error-prone process. Did you notice that the filename is misspelled in the second line? Instead of hard-coding the filename repeatedly, I suggest opening a connection to the file and writing your output to the connection:

con <- file("analysisReport.out", "w")
cat(data, file = con)
cat(results, file = con)
cat(conclusion, file = con)
close(con)

(You don’t need append=TRUE when writing to a connection because append is the default with connections.) This technique is especially valuable inside R scripts because it makes your code more reliable and more maintainable.

Listing Files

Problem

You want an R vector that is a listing of the files in your working directory.

Solution

The list.files function shows the contents of your working directory:

list.files()
#>  [1] "_book"                            "_bookdown_files"
#>  [3] "_bookdown_files.old"              "_bookdown.yml"
#>  [5] "_common.R"                        "_main.rds"
#>  [7] "_output.yaml"                     "01_GettingStarted_cache"
#>  [9] "01_GettingStarted.md"             "01_GettingStarted.Rmd"
etc ...

Discussion

This function is terribly handy to grab the names of all files in a subdirectory. You can use it to refresh your memory of your file names or, more likely, as input into another process, like importing data files.

You can pass list.files a path and a pattern to shows files in a specific path and matching a specific regular expression pattern.

list.files(path = 'data/') # show files in a directory
#>  [1] "ac.rdata"               "adf.rdata"
#>  [3] "anova.rdata"            "anova2.rdata"
#>  [5] "bad.rdata"              "batches.rdata"
#>  [7] "bnd_cmty.Rdata"         "compositePerf-2010.csv"
#>  [9] "conf.rdata"             "daily.prod.rdata"
#> [11] "data1.csv"              "data2.csv"
#> [13] "datafile_missing.tsv"   "datafile.csv"
#> [15] "datafile.fwf"           "datafile.qsv"
#> [17] "datafile.ssv"           "datafile.tsv"
#> [19] "df_decay.rdata"         "df_squared.rdata"
#> [21] "diffs.rdata"            "example1_headless.csv"
#> [23] "example1.csv"           "excel_table_data.xlsx"
#> [25] "get_USDA_NASS_data.R"   "ibm.rdata"
#> [27] "iris_excel.xlsx"        "lab_df.rdata"
#> [29] "movies.sas7bdat"        "nacho_data.csv"
#> [31] "NearestPoint.R"         "not_a_csv.txt"
#> [33] "opt.rdata"              "outcome.rdata"
#> [35] "pca.rdata"              "pred.rdata"
#> [37] "pred2.rdata"            "sat.rdata"
#> [39] "singles.txt"            "state_corn_yield.rds"
#> [41] "student_data.rdata"     "suburbs.txt"
#> [43] "tab1.csv"               "tls.rdata"
#> [45] "triples.txt"            "ts_acf.rdata"
#> [47] "workers.rdata"          "world_series.csv"
#> [49] "xy.rdata"               "yield.Rdata"
#> [51] "z.RData"
list.files(path = 'data/', pattern = '\\.csv')
#> [1] "compositePerf-2010.csv" "data1.csv"
#> [3] "data2.csv"              "datafile.csv"
#> [5] "example1_headless.csv"  "example1.csv"
#> [7] "nacho_data.csv"         "tab1.csv"
#> [9] "world_series.csv"

To see all the files in your subdirectories, too, use

list.files(recursive = T)

A possible “gotcha” of list.files is that it ignores hidden files—typically, any file whose name begins with a period. If you don’t see the file you expected to see, try setting all.files=TRUE:

list.files(path = 'data/', all.files = TRUE)
#>  [1] "."                      ".."
#>  [3] ".DS_Store"              ".hidden_file.txt"
#>  [5] "ac.rdata"               "adf.rdata"
#>  [7] "anova.rdata"            "anova2.rdata"
#>  [9] "bad.rdata"              "batches.rdata"
#> [11] "bnd_cmty.Rdata"         "compositePerf-2010.csv"
#> [13] "conf.rdata"             "daily.prod.rdata"
#> [15] "data1.csv"              "data2.csv"
#> [17] "datafile_missing.tsv"   "datafile.csv"
#> [19] "datafile.fwf"           "datafile.qsv"
#> [21] "datafile.ssv"           "datafile.tsv"
#> [23] "df_decay.rdata"         "df_squared.rdata"
#> [25] "diffs.rdata"            "example1_headless.csv"
#> [27] "example1.csv"           "excel_table_data.xlsx"
#> [29] "get_USDA_NASS_data.R"   "ibm.rdata"
#> [31] "iris_excel.xlsx"        "lab_df.rdata"
#> [33] "movies.sas7bdat"        "nacho_data.csv"
#> [35] "NearestPoint.R"         "not_a_csv.txt"
#> [37] "opt.rdata"              "outcome.rdata"
#> [39] "pca.rdata"              "pred.rdata"
#> [41] "pred2.rdata"            "sat.rdata"
#> [43] "singles.txt"            "state_corn_yield.rds"
#> [45] "student_data.rdata"     "suburbs.txt"
#> [47] "tab1.csv"               "tls.rdata"
#> [49] "triples.txt"            "ts_acf.rdata"
#> [51] "workers.rdata"          "world_series.csv"
#> [53] "xy.rdata"               "yield.Rdata"
#> [55] "z.RData"

If you just want to see which files are in a directory and not use the file names in a procedure, the easiest way is to open the Files pane in the lower right corner of RStudio. But keep in mind that the RStudio Files pane hides files that start with a . as you can see in ???:

. image::images_v2/rstudio.files2.png[]

See Also

R has other handy functions for working with files; see help(files).

Dealing with “Cannot Open File” in Windows

Problem

You are running R on Windows, and you are using file names such as C:\data\sample.txt. R says it cannot open the file, but you know the file does exist.

Solution

The backslashes in the file path are causing trouble. You can solve this problem in one of two ways:

  • Change the backslashes to forward slashes: "C:/data/sample.txt".

  • Double the backslashes: "C:\\data\\sample.txt".

Discussion

When you open a file in R, you give the file name as a character string. Problems arise when the name contains backslashes (\) because backslashes have a special meaning inside strings. You’ll probably get something like this:

samp <- read_csv("C:\Data\sample-data.csv")
#> Error: '\D' is an unrecognized escape in character string starting ""C:\D"

R escapes every character that follows a backslash and then removes the backslashes. That leaves a meaningless file path, such as C:Datasample-data.csv in this example.

The simple solution is to use forward slashes instead of backslashes. R leaves the forward slashes alone, and Windows treats them just like backslashes. Problem solved:

samp <- read_csv("C:/Data/sample-data.csv")

An alternative solution is to double the backslashes, since R replaces two consecutive backslashes with a single backslash:

samp <- read_csv("C:\\Data\\sample-data.csv")

Reading Fixed-Width Records

Problem

You are reading data from a file of fixed-width records: records whose data items occur at fixed boundaries.

Solution

Use the read_fwf from the readr package (which is part of the tidyverse). The main arguments are the file name and the description of the fields:

library(tidyverse)
records <- read_fwf("./data/datafile.fwf",
                    fwf_cols(
                      last = 10,
                      first = 10,
                      birth = 5,
                      death = 5
                    ))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
records
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

Discussion

For reading in data into R, we highly recommend the readr package. There are base R functions for reading in text files, but readr improves on these base functions with faster performance, better defaults, and more flexibility.

Suppose we want to read an entire file of fixed-width records, such as fixed-width.txt, shown here:

Fisher    R.A.      1890 1962
Pearson   Karl      1857 1936
Cox       Gertrude  1900 1978
Yates     Frank     1902 1994
Smith     Kirstine  1878 1939

We need to know the column widths. In this case the columns are:

  • Last name, 10 characters

  • First name, 10 characters

  • Year of birth, 5 characters

  • Year of death, 5 characters

There are 5 different ways to define the columns using read_fwf. Pick the one that’s easiest to use (or remember) in your situation:

  1. read_fwf can try to guess your column widths if there is empty space between the columns with the `fwf_empty`option:

file <- "./data/datafile.fwf"
t1 <- read_fwf(file, fwf_empty(file, col_names = c("last", "first", "birth", "death")))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. You can define each column by a vector of widths followed by a vector of names with with fwf_widths:

t2 <- read_fwf(file, fwf_widths(c(10, 10, 5, 4),
                                c("last", "first", "birth", "death")))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. The columns can be defined with fwf_cols which takes a series of column names followed by the column widths:

t3 <-
  read_fwf("./data/datafile.fwf",
           fwf_cols(
             last = 10,
             first = 10,
             birth = 5,
             death = 5
           ))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. Each column can be defined by a begining position and ending poaition with fwf_cols:

t4 <- read_fwf(file, fwf_cols(
  last = c(1, 10),
  first = c(11, 20),
  birth = c(21, 25),
  death = c(26, 30)
))
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
  1. You can also define the columns with a vector of starting positions, a vector of ending positions, and a vector of column names with fwf_positions:

t5 <- read_fwf(file, fwf_positions(
  c(1, 11, 21, 26),
  c(10, 20, 25, 30),
  c("first", "last", "birth", "death")
))
#> Parsed with column specification:
#> cols(
#>   first = col_character(),
#>   last = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )

The read_fwf returns a tibble which is a tidyverse object very similiar to a data frame. As is common with tidyverse packages, read_fwf has a good selection of default assumptions that make it less tricky to use than some base R functions for importing data. For example, `read_fwf_ will, by default, import character fields as characters, not factors, which prevents much pain and consternation for users.

See Also

See “Reading Tabular Data Files” for more discussion of reading text files.

Reading Tabular Data Files

Problem

You want to read a text file that contains a table of white-space delimited data.

Solution

Use the read_table2 function from the readr package, which returns a tibble:

library(tidyverse)

tab1 <- read_table2("./data/datafile.tsv")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
tab1
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

Discussion

Tabular data files are quite common. They are text files with a simple format:

  • Each line contains one record.

  • Within each record, fields (items) are separated by a white space delimiter, such as a space or tab.

  • Each record contains the same number of fields.

This format is more free-form than the fixed-width format because fields needn’t be aligned by position. Here is the data file from “Reading Fixed-Width Records” in tabular format, using a tab character between fields:

last    first   birth   death
Fisher  R.A.    1890    1962
Pearson Karl    1857    1936
Cox Gertrude    1900    1978
Yates   Frank   1902    1994
Smith   Kirstine    1878    1939

The read_table2 function is designed to make some good guesses about your data. It assumes your data has column names in the first row, guesses your delimiter, and it imputes your column types based on the first 1000 records in your data set. Below is an example with space delimited data.

t <- read_table2("./data/datafile.ssv")
#> Parsed with column specification:
#> cols(
#>   `#The` = col_character(),
#>   following = col_character(),
#>   is = col_character(),
#>   a = col_character(),
#>   list = col_character(),
#>   of = col_character(),
#>   statisticians = col_character()
#> )
#> Warning: 6 parsing failures.
#> row col  expected    actual                  file
#>   1  -- 7 columns 4 columns './data/datafile.ssv'
#>   2  -- 7 columns 4 columns './data/datafile.ssv'
#>   3  -- 7 columns 4 columns './data/datafile.ssv'
#>   4  -- 7 columns 4 columns './data/datafile.ssv'
#>   5  -- 7 columns 4 columns './data/datafile.ssv'
#> ... ... ......... ......... .....................
#> See problems(...) for more details.
print(t)
#> # A tibble: 6 x 7
#>   `#The`  following is    a     list  of    statisticians
#>   <chr>   <chr>     <chr> <chr> <chr> <chr> <chr>
#> 1 last    first     birth death <NA>  <NA>  <NA>
#> 2 Fisher  R.A.      1890  1962  <NA>  <NA>  <NA>
#> 3 Pearson Karl      1857  1936  <NA>  <NA>  <NA>
#> 4 Cox     Gertrude  1900  1978  <NA>  <NA>  <NA>
#> 5 Yates   Frank     1902  1994  <NA>  <NA>  <NA>
#> 6 Smith   Kirstine  1878  1939  <NA>  <NA>  <NA>

read_table2 often guess corectly. But as with other readr import functions, you can overwrite the defaults with explicit parameters.

t <-
  read_table2(
    "./data/datafile.tsv",
    col_types = c(
      col_character(),
      col_character(),
      col_integer(),
      col_integer()
    )
  )

If any field contains the string “NA”, then read_table2 assumes that the value is missing and converts it to NA. Your data file might employ a different string to signal missing values, in which case use the na parameter. The SAS convention, for example, is that missing values are signaled by a single period (.). We can read such text files using the na="." option. If we have a file named datafile_missing.tsv that has a missing value indicated with a . in the last row:

last    first   birth   death
Fisher  R.A.    1890    1962
Pearson Karl    1857    1936
Cox Gertrude    1900    1978
Yates   Frank   1902    1994
Smith   Kirstine    1878    1939
Cox David 1924 .

we can import it like so

t <- read_table2("./data/datafile_missing.tsv", na = ".")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
t
#> # A tibble: 6 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939
#> 6 Cox     David     1924    NA

We’re huge fans of self-describing data: data files which describe their own contents. (A computer scientist would say the file contains its own metadata.) The read_table2 function make the default assumption that the first line of your file contains a header line with column names. If your file does not have column names, you can turn this off with the parameter col_names = FALSE.

An additional type of metadata supported by read_table2 is comment lines. Using the comment parameter you can tell read_table2 which character distinguishes comment lines. The following file has a comment line at the top that starts with #.

# The following is a list of statisticians
last first birth death
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

so we can import this file as follows:

t <- read_table2("./data/datafile.ssv", comment = '#')
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )
t
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939

read_table2 has many parameters for controlling how it reads and interprets the input file. See the help page (?read_table2) or the readr vignette (vignette("readr")) for more details. If you’re curious about the difference between read_table and read_table2, it’s in the help file… but the short answer is that read_table is slightly less forgiving in file structure and line length.

See Also

If your data items are separated by commas, see “Reading from CSV Files” for reading a CSV file.

Reading from CSV Files

Problem

You want to read data from a comma-separated values (CSV) file.

Solution

The read_csv function from the readr pacakge is a fast (and, according to the documentation, fun) way to read CSV files. If your CSV file has a header line, use this:

library(tidyverse)

tbl <- read_csv("./data/datafile.csv")
#> Parsed with column specification:
#> cols(
#>   last = col_character(),
#>   first = col_character(),
#>   birth = col_double(),
#>   death = col_double()
#> )

If your CSV file does not contain a header line, set the col_names option to FALSE:

tbl <- read_csv("./data/datafile.csv",  col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_character(),
#>   X4 = col_character()
#> )

Discussion

The CSV file format is popular because many programs can import and export data in that format. This includes R, Excel, other spreadsheet programs, many database managers, and most statistical packages. It is a flat file of tabular data, where each line in the file is a row of data, and each row contains data items separated by commas. Here is a very simple CSV file with three rows and three columns (the first line is a header line that contains the column names, also separated by commas):

label,lbound,ubound
low,0,0.674
mid,0.674,1.64
high,1.64,2.33

The read_csv file reads the data and creates a tibble, which is a special type of data frame used in Tidy packages and a common representation for tabular data. The function assumes that your file has a header line unless told otherwise:

tbl <- read_csv("./data/example1.csv")
#> Parsed with column specification:
#> cols(
#>   label = col_character(),
#>   lbound = col_double(),
#>   ubound = col_double()
#> )
tbl
#> # A tibble: 3 x 3
#>   label lbound ubound
#>   <chr>  <dbl>  <dbl>
#> 1 low    0      0.674
#> 2 mid    0.674  1.64
#> 3 high   1.64   2.33

Observe that read_csv took the column names from the header line for the tibble. If the file did not contain a header, then we would specify col_names=FALSE and R would synthesize column names for us (X1, X2, and X3 in this case):

tbl <- read_csv("./data/example1.csv", col_names = FALSE)
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_character()
#> )
tbl
#> # A tibble: 4 x 3
#>   X1    X2     X3
#>   <chr> <chr>  <chr>
#> 1 label lbound ubound
#> 2 low   0      0.674
#> 3 mid   0.674  1.64
#> 4 high  1.64   2.33

Sometimes it’s convenient to put metadata in files. If this metadata starts with a common character, such as a pound sign (#) we can use the comment=FALSE parameter to ignore metadata lines.

The read_csv function has many useful bells and whistles. A few of these options and their default values include:

  • na = c("", "NA"): Indicate what values represent missing or NA values

  • comment = "": which lines to ignore as comments or metadata

  • trim_ws = TRUE: Whether to drop white space at the beginning and/or end of fields

  • skip = 0: Number of rows to skip at the beginning of the file

  • guess_max = min(1000, n_max): Number of rows to consider when imputing column types

See the R help page, help(read_csv), for more details on all the availiable options.

If you have a data file that uses semicolons (;) for seperators and commas for the decimal mark, as is common outside of North America, then you should use the function read_csv2 which is built for that very situation.

See Also

See “Writing to CSV Files”. See also the vignette for the readr: vignette(readr).

Writing to CSV Files

Problem

You want to save a matrix or data frame in a file using the comma-separated values format.

Solution

The write_csv function from the tidyverse readr package can write a CSV file:

library(tidyverse)

write_csv(tab1, path = "./data/tab1.csv")

Discussion

The write_csv function writes tabular data to an ASCII file in CSV format. Each row of data creates one line in the file, with data items separated by commas (,):

library(tidyverse)

print(tab1)
#> # A tibble: 5 x 4
#>   last    first    birth death
#>   <chr>   <chr>    <dbl> <dbl>
#> 1 Fisher  R.A.      1890  1962
#> 2 Pearson Karl      1857  1936
#> 3 Cox     Gertrude  1900  1978
#> 4 Yates   Frank     1902  1994
#> 5 Smith   Kirstine  1878  1939
write_csv(tab1, "./data/tab1.csv")

This example creates a file called tab1.csv in the data directory which is a subdirectory of the working directory. The file looks like this:

last,first,birth,death
Fisher,R.A.,1890,1962
Pearson,Karl,1857,1936
Cox,Gertrude,1900,1978
Yates,Frank,1902,1994
Smith,Kirstine,1878,1939

write_csv has a number of parameters with typically very good defaults. Should you want to adjust the output, here are a few parameters you can change, along with their defaults:

col_names = TRUE

: Indicate whether or not the first row contains column names

col_types = NULL

: write_csv will look at the first 1000 rows (changable with guess_max below) and make an informed guess as to what data types to use for the columns. If you’d rather explicitly state the column types, you can do that by passing a vector of column types to the parameter col_types

na = c("", "NA")

: Indicate what values represent missing or NA values

comment = ""

: Which lines to ignore as comments or metadata

trim_ws = TRUE

: Whether to drop white space at the beginning and/or end of fields

skip = 0

: Number of rows to skip at the beginning of the file

guess_max = min(1000, n_max)

: Number of rows to consider when guessing column types

See Also

See “Getting and Setting the Working Directory” for more about the current working directory and “Saving and Transporting Objects” for other ways to save data to files. For more info on reading and writing text files, see the readr vignette: vignette(readr).

Reading Tabular or CSV Data from the Web

Problem

You want to read data directly from the Web into your R workspace.

Solution

Use the the read_csv or read_table2 functions from the readr package, using a URL instead of a file name. The functions will read directly from the remote server:

library(tidyverse)

berkley <- read_csv('http://bit.ly/barkley18', comment = '#')
#> Parsed with column specification:
#> cols(
#>   Name = col_character(),
#>   Location = col_character(),
#>   Time = col_time(format = "")
#> )

You can also open a connection using the URL and then read from the connection, which may be preferable for complicated files.

Discussion

The Web is a gold mine of data. You could download the data into a file and then read the file into R, but it’s more convenient to read directly from the Web. Give the URL to read_csv, read_table2, or other read function in readr (depending upon the format of the data), and the data will be downloaded and parsed for you. No fuss, no muss.

Aside from using a URL, this recipe is just like reading from a CSV file (“Reading from CSV Files”) or a complex file (“Reading Files with a Complex Structure”), so all the comments in those recipes apply here, too.

Remember that URLs work for FTP servers, not just HTTP servers. This means that R can also read data from FTP sites using URLs:

tbl <- read_table2("ftp://ftp.example.com/download/data.txt")

See Also

See Recipes and .

Reading Data From Excel

Problem

You want to read data in from an Excel file.

Solution

The openxlsx package makes reading Excel files easy.

library(openxlsx)

df1 <- read.xlsx(xlsxFile = "data/iris_excel.xlsx",
                 sheet = 'iris_data')
head(df1, 3)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa

Discussion

The package openxlsx is a good choice for both reading and writing Excel files with R. If we’re reading in an entire sheet then passing a file name and a sheet name to the read.xlsx function is a simple option. But openxlsx supports more complex workflows.

A common pattern is to read a named table out of an Excel file and into an R data frame. This is trickier because the sheet we’re reading from may have values outside of the named table and we want to only read in the named table range. We can use the functions in openxlsx to get the location of a table, then read that range of cells into a data frame.

First we load the workbook into R:

library(openxlsx)
wb <- loadWorkbook("data/excel_table_data.xlsx")

Then we can use the getTables function to get a the names and ranges of all the Excel Tables in the input_data sheet and select out the one table we want. In this example the Excel Table we are after is named example_data:

tables <- getTables(wb, 'input_data')
table_range_str <- names(tables[tables == 'example_table'])
table_range_refs <- strsplit(table_range_str, ':')[[1]]

# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)

# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)

Now the vector col_vec contains the column numbers of our named table while table_range_row_num contains the row numbers of our named table. We can then use the read.xlsx function to pull in only the rows and columns we are after.

df <- read.xlsx(
  xlsxFile = "data/excel_table_data.xlsx",
  sheet = 'input_data',
  cols = table_range_col_num[1]:table_range_col_num[2],
  rows = table_range_row_num[1]:table_range_row_num[2]
)

See Also

Vingette for openxlsx by installing openxlsx and running: vignette('Introduction', package='openxlsx')

The readxl package is party of the Tidyverse and provides fast, simple reading of Excel files: https://readxl.tidyverse.org/

The writexl package is a fast and lightweight (no dependencies) package for writing Excel files: https://cran.r-project.org/web/packages/writexl/index.html

“Writing a Data Frame to Excel”

Writing a Data Frame to Excel

Problem

You want to write an R data frame to an Excel file.

Solution

The openxlsx package makes writing to Excel files realitivly easy. While there are lots of options in openxlsx, a typical pattern is to specify an Excel file name and a sheet name:

library(openxlsx)

write.xlsx(x = iris,
           sheetName = 'iris_data',
           file = "data/iris_excel.xlsx")

Discussion

The openxlsx package has a huge number of options for controlling many aspects of the Excel object model. We can use it to set cell colors, define named ranges, and set cell outlines, for example. But it has a few helper functions like write.xlsx which make simple tasks easier.

When businesses work with Excel it’s a good practice to keep all input data in an Excel file in a named Excel Table which makes accessing the data easier and less error prone. However if you use openxlsx to overwrite an Excel Table in one of the sheets, you run the risk that the new data may contain fewer rows than the Excel Table it replaces. That could cause errors as you would end up with old data and new data in contiguious rows. The solution is to first delete out an existing Excel Table, then add new data back into the same location and assign the new data to a named Excel Table. To do this we need to use the more advanced Excel manipulation features of openxlsx.

First we use loadWorkbook to read the Excel workbook into R in its entirety:

library(openxlsx)

wb <- loadWorkbook("data/excel_table_data.xlsx")

Before we delete the table out we want to extract the table starting row and column.

tables <- getTables(wb, 'input_data')
table_range_str <- names(tables[tables == 'example_table'])
table_range_refs <- strsplit(table_range_str, ':')[[1]]

# use a regex to extract out the starting row number
table_row_num <- gsub("[^0-9.]", "", table_range_refs)[[1]]

# extract out the starting column number
table_col_num <- convertFromExcelRef(table_range_refs)[[1]]

Then we can use the removeTable function to remove the existing named Excel Table:

## remove the existing Excel Table
removeTable(wb = wb,
            sheet = 'input_data',
            table = 'example_table')

Then we can use writeDataTable to write the iris data frame (which comes with R) to write data back into our workbook object in R.

writeDataTable(
  wb = wb,
  sheet = 'input_data',
  x = iris,
  startCol = table_col_num,
  startRow = table_row_num,
  tableStyle = "TableStyleLight9",
  tableName = 'example_table'
)

At this point we could save the workbook and our Table would be updated. However it’s a good idea to save some meta data in the workbook to let others know exactly when the data was refreshed. We can do this with the writeData function then save the workbook to file and overwrite the original file. We’ll put the text in cell B:5 then save the workbook back to a file overwriting the original.

writeData(
  wb = wb,
  sheet = 'input_data',
  x = paste('example_table data refreshed on:', Sys.time()),
  startCol = 2,
  startRow = 5
)

## then save the workbook
saveWorkbook(wb = wb,
             file = "data/excel_table_data.xlsx",
             overwrite = T)

The resulting Excel sheet looks is shown in Figure 4-1.

excel table data
Figure 4-1. Excel Table and Caption

See Also

Vingette for openxlsx by installing openxlsx and running: vignette(Introduction, package=openxlsx)

The readxl package is party of the Tidyverse and provides fast, simple reading of Excel files: https://readxl.tidyverse.org/

The writexl package is a fast and lightweight (no dependencies) package for writing Excel files: https://cran.r-project.org/web/packages/writexl/index.html

“Reading Data From Excel”

Reading Data from a SAS file

Problem

You want to read a SAS data set into an R data frame.

Solution

The sas7bdat package supports reading SAS sas7bdat files into R.

library(haven)

sas_movie_data <- read_sas("data/movies.sas7bdat")

Discussion

SAS V7 and beyond all support the sas7bdat file format. The read_sas function in haven supports reading the sas7bdat file format including variable labels. If your SAS file has variable labels, when they are inported into R they will be stored in the label attributes of the data frame. These labels will not be printed by default. You can see the labels by opening the data frame in R Studio, or by calling the attributes Base R function on each column:

sapply(sas_movie_data, attributes)
#> $Movie
#> $Movie$label
#> [1] "Movie"
#>
#>
#> $Type
#> $Type$label
#> [1] "Type"
#>
#>
#> $Rating
#> $Rating$label
#> [1] "Rating"
#>
#>
#> $Year
#> $Year$label
#> [1] "Year"
#>
#>
#> $Domestic__
#> $Domestic__$label
#> [1] "Domestic $"
#>
#> $Domestic__$format.sas
#> [1] "F"
#>
#>
#> $Worldwide__
#> $Worldwide__$label
#> [1] "Worldwide $"
#>
#> $Worldwide__$format.sas
#> [1] "F"
#>
#>
#> $Director
#> $Director$label
#> [1] "Director"

See Also

The sas7bdat package is much slower on large files than haven, but it has more elaborate support for file attributes. If the SAS metadata is important to you then you should investigate sas7bdat::read.sas7bdat.

Reading Data from HTML Tables

Problem

You want to read data from an HTML table on the Web.

Solution

Use the read_html and html_table functions in the rvest package. To read all tables on the page, do the following:

library(rvest)
library(magrittr)

all_tables <-
  read_html("https://en.wikipedia.org/wiki/Aviation_accidents_and_incidents") %>%
  html_table(fill = TRUE, header = TRUE)

read_html puts all tables from the HTML document into the output list. To pull a single table from that list, you can use the function extract2 from the magrittr package:

out_table <-
  read_html("https://en.wikipedia.org/wiki/Aviation_accidents_and_incidents") %>%
  html_table(fill = TRUE, header = TRUE) %>%
  extract2(2)

head(out_table)
#>   Year Deaths[52] # of incidents[53]
#> 1 2017        399           101 [54]
#> 2 2016        629                102
#> 3 2015        898                123
#> 4 2014      1,328                122
#> 5 2013        459                138
#> 6 2012        800                156

Note that the rvest and magrittr packages are both installed when you run install.packages('tidyverse') They are not core tidyverse packages, however, so you must explicitly load them, as shown here.

Discussion

Web pages can contain several HTML tables. Calling read_html(url) then piping that to html_table() reads all tables on the page and returns them in a list. This can be useful for exploring a page, but it’s annoying if you want just one specific table. In that case, use extract2(n) to select the the _n_th table.

Two common parameters for the html_table function are fill=TRUE which fills in missing values with NA, and header=TRUE which indicates that the first row contains the header names.

The following example, loads all tables from the Wikipedia page entitled “World population”:

url <- 'http://en.wikipedia.org/wiki/World_population'
tbls <- read_html(url) %>%
  html_table(fill = TRUE, header = TRUE)

As it turns out, that page contains 24 tables (or things that html_table thinks might be tables):

length(tbls)
#> [1] 23

In this example we care only about the sixth table (which lists the largest populations by country), so we can either access that element using brackets: tbls[[6]] or we can pipe it into the extract2 function from the magrittr package:

library(magrittr)
url <- 'http://en.wikipedia.org/wiki/World_population'
tbl <- read_html(url) %>%
  html_table(fill = TRUE, header = TRUE) %>%
  extract2(2)

head(tbl, 2)
#>   World population (millions, UN estimates)[10]
#> 1                                             #
#> 2                                             1
#>   World population (millions, UN estimates)[10]
#> 1               Top ten most populous countries
#> 2                                        China*
#>   World population (millions, UN estimates)[10]
#> 1                                          2000
#> 2                                         1,270
#>   World population (millions, UN estimates)[10]
#> 1                                          2015
#> 2                                         1,376
#>   World population (millions, UN estimates)[10]
#> 1                                         2030*
#> 2                                         1,416

In that table, columns 2 and 3 contain the country name and population, respectively:

tbl[, c(2, 3)]
#>                          World population (millions, UN estimates)[10]
#> 1                                      Top ten most populous countries
#> 2                                                               China*
#> 3                                                                India
#> 4                                                        United States
#> 5                                                            Indonesia
#> 6                                                             Pakistan
#> 7                                                               Brazil
#> 8                                                              Nigeria
#> 9                                                           Bangladesh
#> 10                                                              Russia
#> 11                                                              Mexico
#> 12                                                         World total
#> 13 Notes:\nChina = excludes Hong Kong and Macau\n2030 = Medium variant
#>                        World population (millions, UN estimates)[10].1
#> 1                                                                 2000
#> 2                                                                1,270
#> 3                                                                1,053
#> 4                                                                  283
#> 5                                                                  212
#> 6                                                                  136
#> 7                                                                  176
#> 8                                                                  123
#> 9                                                                  131
#> 10                                                                 146
#> 11                                                                 103
#> 12                                                               6,127
#> 13 Notes:\nChina = excludes Hong Kong and Macau\n2030 = Medium variant

Right away, we can see problems with the data: the second row of the data has info that really belongs with the header. And China has * appended to its name. On the Wikipedia website, that was a footnote reference, but now it’s just a bit of unwanted text. Adding insult to injury, the population numbers have embedded commas, so you cannot easily convert them to raw numbers. All these problems can be solved by some string processing, but each problem adds at least one more step to the process.

This illustrates the main obstacle to reading HTML tables. HTML was designed for presenting information to people, not to computers. When you “scrape” information off an HTML page, you get stuff that’s useful to people but annoying to computers. If you ever have a choice, choose instead a computer-oriented data representation such as XML, JSON, or CSV.

The read_html(url) and html_table() functions are part of the rvest package, which (by necessity) is large and complex. Any time you pull data from a site designed for human readers, not machines, expect that you will have to do post processing to clean up the bits and pieces left messy by the machine.

See Also

See “Installing Packages from CRAN” for downloading and installing packages such as the rvest package.

Reading Files with a Complex Structure

Problem

You are reading data from a file that has a complex or irregular structure.

Solution

  • Use the readLines function to read individual lines; then process them as strings to extract data items.

  • Alternatively, use the scan function to read individual tokens and use the argument what to describe the stream of tokens in your file. The function can convert tokens into data and then assemble the data into records.

Discussion

Life would be simple and beautiful if all our data files were organized into neat tables with cleanly delimited data. We could read those files using one of the functions in the readr package and get on with living.

Unfortunatly we don’t live in a land of rainbows and unicorn kisses.

You will eventually encounter a funky file format, and your job (suck it up, buttercup) is to read the file contents into R.

The read.table and read.csv functions are line-oriented and probably won’t help. However, the readLines and scan functions are useful here because they let you process the individual lines and even tokens of the file.

The readLines function is pretty simple. It reads lines from a file and returns them as a list of character strings:

lines <- readLines("input.txt")

You can limit the number of lines by using the n parameter, which gives the number of maximum number of lines to be read:

lines <- readLines("input.txt", n = 10)       # Read 10 lines and stop

The scan function is much richer. It reads one token at a time and handles it according to your instructions. The first argument is either a filename or a connection (more on connections later). The second argument is called what, and it describes the tokens that scan should expect in the input file. The description is cryptic but quite clever:

what=numeric(0)

Interpret the next token as a number.

what=integer(0)

Interpret the next token as an integer.

what=complex(0)

Interpret the next token as complex number.

what=character(0)

Interpret the next token as a character string.

what=logical(0)

Interpret the next token as a logical value.

The scan function will apply the given pattern repeatedly until all data is read.

Suppose your file is simply a sequence of numbers, like this:

2355.09 2246.73 1738.74 1841.01 2027.85

Use what=numeric(0) to say, “My file is a sequence of tokens, each of which is a number”:

singles <- scan("./data/singles.txt", what = numeric(0))
singles
#> [1] 2355.09 2246.73 1738.74 1841.01 2027.85

A key feature of scan is that the what can be a list containing several token types. The scan function will assume your file is a repeating sequence of those types. Suppose your file contains triplets of data, like this:

15-Oct-87 2439.78 2345.63 16-Oct-87 2396.21 2207.73
19-Oct-87 2164.16 1677.55 20-Oct-87 2067.47 1616.21
21-Oct-87 2081.07 1951.76

Use a list to tell scan that it should expect a repeating, three-token sequence:

triples <-
  scan("./data/triples.txt",
       what = list(character(0), numeric(0), numeric(0)))
triples
#> [[1]]
#> [1] "15-Oct-87" "16-Oct-87" "19-Oct-87" "20-Oct-87" "21-Oct-87"
#>
#> [[2]]
#> [1] 2439.78 2396.21 2164.16 2067.47 2081.07
#>
#> [[3]]
#> [1] 2345.63 2207.73 1677.55 1616.21 1951.76

Give names to the list elements, and scan will assign those names to the data:

triples <- scan("./data/triples.txt",
                what = list(
                  date = character(0),
                  high = numeric(0),
                  low = numeric(0)
                ))
triples
#> $date
#> [1] "15-Oct-87" "16-Oct-87" "19-Oct-87" "20-Oct-87" "21-Oct-87"
#>
#> $high
#> [1] 2439.78 2396.21 2164.16 2067.47 2081.07
#>
#> $low
#> [1] 2345.63 2207.73 1677.55 1616.21 1951.76

This can easily be turned into a data frame with the data.frame command:

df_triples <- data.frame(triples)
df_triples
#>        date    high     low
#> 1 15-Oct-87 2439.78 2345.63
#> 2 16-Oct-87 2396.21 2207.73
#> 3 19-Oct-87 2164.16 1677.55
#> 4 20-Oct-87 2067.47 1616.21
#> 5 21-Oct-87 2081.07 1951.76

The scan function has many bells and whistles, but the following are especially useful:

n=number

Stop after reading this many tokens. (Default: stop at end of file.)

nlines=number

Stop after reading this many input lines. (Default: stop at end of file.)

skip=number

Number of input lines to skip before reading data.

na.strings=list

A list of strings to be interpreted as NA.

An Example

Let’s use this recipe to read a dataset from StatLib, the repository of statistical data and software maintained by Carnegie Mellon University. Jeff Witmer contributed a dataset called wseries that shows the pattern of wins and losses for every World Series since 1903. The dataset is stored in an ASCII file with 35 lines of comments followed by 23 lines of data. The data itself looks like this:

1903  LWLlwwwW    1927  wwWW      1950  wwWW      1973  WLwllWW
1905  wLwWW       1928  WWww      1951  LWlwwW    1974  wlWWW
1906  wLwLwW      1929  wwLWW     1952  lwLWLww   1975  lwWLWlw
1907  WWww        1930  WWllwW    1953  WWllwW    1976  WWww
1908  wWLww       1931  LWwlwLW   1954  WWww      1977  WLwwlW

.
. (etc.)
.

The data is encoded as follows: L = loss at home, l = loss on the road, W = win at home, w = win on the road. The data appears in column order, not row order, which complicates our lives a bit.

Here is the R code for reading the raw data:

# Read the wseries dataset:
#     - Skip the first 35 lines
#     - Then read 23 lines of data
#     - The data occurs in pairs: a year and a pattern (char string)
#
world.series <- scan(
  "http://lib.stat.cmu.edu/datasets/wseries",
  skip = 35,
  nlines = 23,
  what = list(year = integer(0),
              pattern = character(0)),
)

The scan function returns a list, so we get a list with two elements: year and pattern. The function reads from left to right, but the dataset is organized by columns and so the years appear in a strange order:

world.series$year
#>  [1] 1903 1927 1950 1973 1905 1928 1951 1974 1906 1929 1952 1975 1907 1930
#> [15] 1953 1976 1908 1931 1954 1977 1909 1932 1955 1978 1910 1933 1956 1979
#> [29] 1911 1934 1957 1980 1912 1935 1958 1981 1913 1936 1959 1982 1914 1937
#> [43] 1960 1983 1915 1938 1961 1984 1916 1939 1962 1985 1917 1940 1963 1986
#> [57] 1918 1941 1964 1987 1919 1942 1965 1988 1920 1943 1966 1989 1921 1944
#> [71] 1967 1990 1922 1945 1968 1991 1923 1946 1969 1992 1924 1947 1970 1993
#> [85] 1925 1948 1971 1926 1949 1972

We can fix that by sorting the list elements according to year:

perm <- order(world.series$year)
world.series <- list(year    = world.series$year[perm],
                     pattern = world.series$pattern[perm])

Now the data appears in chronological order:

world.series$year
#>  [1] 1903 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917
#> [15] 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931
#> [29] 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
#> [43] 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
#> [57] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
#> [71] 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
#> [85] 1988 1989 1990 1991 1992 1993

world.series$pattern
#>  [1] "LWLlwwwW" "wLwWW"    "wLwLwW"   "WWww"     "wWLww"    "WLwlWlw"
#>  [7] "WWwlw"    "lWwWlW"   "wLwWlLW"  "wLwWw"    "wwWW"     "lwWWw"
#> [13] "WWlwW"    "WWllWw"   "wlwWLW"   "WWlwwLLw" "wllWWWW"  "LlWwLwWw"
#> [19] "WWwW"     "LwLwWw"   "LWlwlWW"  "LWllwWW"  "lwWLLww"  "wwWW"
#> [25] "WWww"     "wwLWW"    "WWllwW"   "LWwlwLW"  "WWww"     "WWlww"
#> [31] "wlWLLww"  "LWwwlW"   "lwWWLw"   "WWwlw"    "wwWW"     "WWww"
#> [37] "LWlwlWW"  "WLwww"    "LWwww"    "WLWww"    "LWlwwW"   "LWLwwlw"
#> [43] "LWlwlww"  "WWllwLW"  "lwWWLw"   "WLwww"    "wwWW"     "LWlwwW"
#> [49] "lwLWLww"  "WWllwW"   "WWww"     "llWWWlw"  "llWWWlw"  "lwLWWlw"
#> [55] "llWLWww"  "lwWWLw"   "WLlwwLW"  "WLwww"    "wlWLWlw"  "wwWW"
#> [61] "WLlwwLW"  "llWWWlw"  "wwWW"     "wlWWLlw"  "lwLLWww"  "lwWWW"
#> [67] "wwWLW"    "llWWWlw"  "wwLWLlw"  "WLwllWW"  "wlWWW"    "lwWLWlw"
#> [73] "WWww"     "WLwwlW"   "llWWWw"   "lwLLWww"  "WWllwW"   "llWWWw"
#> [79] "LWwllWW"  "LWwww"    "wlWWW"    "LLwlwWW"  "LLwwlWW"  "WWlllWW"
#> [85] "WWlww"    "WWww"     "WWww"     "WWlllWW"  "lwWWLw"   "WLwwlW"

Reading from MySQL Databases

Problem

You want access to data stored in a MySQL database.

Solution

  1. Install the RMySQL package on your computer.

  2. Open a database connection using the DBI::dbConnect function.

  3. Use dbGetQuery to initiate a SELECT and return the result sets.

  4. Use dbDisconnect to terminate the database connection when you are done.

Discussion

This recipe requires that the RMySQL package be installed on your computer. That package requires, in turn, the MySQL client software. If the MySQL client software is not already installed and configured, consult the MySQL documentation or your system administrator.

Use the dbConnect function to establish a connection to the MySQL database. It returns a connection object which is used in subsequent calls to RMySQL functions:

library(RMySQL)

con <- dbConnect(
    drv = RMySQL::MySQL(),
    dbname = "your_db_name",
    host = "your.host.com",
    username = "userid",
    password = "pwd"
  )

The username, password, and host parameters are the same parameters used for accessing MySQL through the mysql client program. The example given here shows them hard-coded into the dbConnect call. Actually, that is an ill-advised practice. It puts your password in a plain text document, creating a security problem. It also creates a major headache whenever your password or host change, requiring you to hunt down the hard-coded values. I strongly recommend using the security mechanism of MySQL instead. Put those three parameters into your MySQL configuration file, which is $HOME/.my.cnf on Unix and C:\my.cnf on Windows. Make sure the file is unreadable by anyone except you. The file is delimited into sections with markers such as [client]. Put the parameters into the [client] section, so that your config file will contain something like this:

[client]
user = userid
password = password
host = hostname

Once the parameters are defined in the config file, you no longer need to supply them in the dbConnect call, which then becomes much simpler:

  • jal TODO - test this in anger

con <- dbConnect(dbConnect(
  drv = RMySQL::MySQL(),
  dbname = "your_db_name",
  host = "your.host.com"
)

Use the dbGetQuery function to submit your SQL to the database and read the result sets. Doing so requires an open database connection:

sql <- "SELECT * from SurveyResults WHERE City = 'Chicago'"
rows <- dbGetQuery(con, sql)

You are not restricted to SELECT statements. Any SQL that generates a result set is OK. It is common to use CALL statements, for example, if your SQL is encapsulated in stored procedures and those stored procedures contain embedded SELECT statements.

Using dbGetQuery is convenient because it packages the result set into a data frame and returns the data frame. This is the perfect representation of an SQL result set. The result set is a tabular data structure of rows and columns, and so is a data frame. The result set’s columns have names given by the SQL SELECT statement, and R uses them for naming the columns of the data frame.

After the first result set of data, MySQL can return a second result set containing status information. You can choose to inspect the status or ignore it, but you must read it. Otherwise, MySQL will complain that there are unprocessed result sets and then halt. So call dbNextResult if necessary:

if (dbMoreResults(con)) dbNextResult(con)

Call dbGetQuery repeatedly to perform multiple queries, checking for the result status after each call (and reading it, if necessary). When you are done, close the database connection using dbDisconnect:

dbDisconnect(con)

Here is a complete session that reads and prints three rows from a database of stock prices. The query selects the price of IBM stock for the last three days of 2008. It assumes that the username, password, and host are defined in the my.cnf file:

con <- dbConnect(MySQL(), client.flag = CLIENT_MULTI_RESULTS)
sql <- paste(
  "select * from DailyBar where Symbol = 'IBM'",
  "and Day between '2008-12-29' and '2008-12-31'"
)
rows <- dbGetQuery(con, sql)
if (dbMoreResults(con)) {
  dbNextResults(con)
}
dbDisconnect(con)
print(rows)

*TODO - format this so it looks like output, maybe? * TODO - do we need the dbMoreResults still? Symbol Day Next OpenPx HighPx LowPx ClosePx AdjClosePx 1 IBM 2008-12-29 2008-12-30 81.72 81.72 79.68 81.25 81.25 2 IBM 2008-12-30 2008-12-31 81.83 83.64 81.52 83.55 83.55 3 IBM 2008-12-31 2009-01-02 83.50 85.00 83.50 84.16 84.16 HistClosePx Volume OpenInt 1 81.25 6062600 NA 2 83.55 5774400 NA 3 84.16 6667700 NA

See Also

See “Installing Packages from CRAN” and the documentation for RMySQL, which contains more details about configuring and using the package.

See “Accessing a Database with dbplyr” for information about how to get data from an SQL without actually writing SQL yourself.

R can read from several other RDBMS systems, including Oracle, Sybase, PostgreSQL, and SQLite. For more information, see the R Data Import/Export guide, which is supplied with the base distribution (“Viewing the Supplied Documentation”) and is also available on CRAN at http://cran.r-project.org/doc/manuals/R-data.pdf.

Accessing a Database with dbplyr

Problem

You want to access a database, but you’d rather not write SQL code in order to manipulate data and return results to R.

Solution

In addition to being a grammar of data manipulation, the tidyverse package dplyr can, in in connection with the dbplyr package, turn dplyr commands into SQL for you.

Let’s set up an example database using RSQLite and then we’ll connect to it and use dplyr and the dbplyr backend to extract data.

Set up the example table by loading the msleep example data into an in-memory SQLite database:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
sleep_db <- copy_to(con, msleep, "sleep")

Now that we have a table in our database, we can create a reference to it from R

sleep_table <- tbl(con, "sleep")

The sleep_table object is a type of pointer or alias to the table on the database. However, dplyr will treat it like a regular tidyverse tibble or data frame. So you can operate on it using dplyr and other R commands. Let’s select all anaimals from the data who sleep less than 3 hours.

little_sleep <- sleep_table %>%
  select(name, genus, order, sleep_total) %>%
  filter(sleep_total < 3)

The dbplyr backend does not go fetch the data when we do the above commands. But it does build the query and get ready. To see the query built by dplyr you can use show_query:

show_query(little_sleep)
#> <SQL>
#> SELECT *
#> FROM (SELECT `name`, `genus`, `order`, `sleep_total`
#> FROM `sleep`)
#> WHERE (`sleep_total` < 3.0)

Then to bring the data back to your local machine use collect:

local_little_sleep <- collect(little_sleep)
local_little_sleep
#> # A tibble: 3 x 4
#>   name        genus         order          sleep_total
#>   <chr>       <chr>         <chr>                <dbl>
#> 1 Horse       Equus         Perissodactyla         2.9
#> 2 Giraffe     Giraffa       Artiodactyla           1.9
#> 3 Pilot whale Globicephalus Cetacea                2.7

Discussion

By using dplyr to access SQL databases by only writing dplyr commands, you can be more productive by not having to switch from one language to another and back. The alternative is to have large chunks of SQL code stored as text strings in the middle of an R script, or have the SQL in seperate files which are read in by R.

By allowing dplyr to transparently create the SQL in the background, the user is freed from having to maintain seperate SQL code to extract data.

The dbplyr package uses DBI to connect to your database, so you’ll need a DBI backend package for whichever database you want to access.

Some commonly used backend DBI packages are:

odbc

Uses the open database connectivity protocol to connect to many different databases. This is typically the best choice when connecting to Microsoft SQL Server. ODBC is typically straight forward on Windows machines but may require some considerable effort to get working in Linux or Mac.

RPostgreSQL

For connecting to Postgres and Redshift.

RMySQL

For MySQL and MariaDB

RSQLite

Connecting to SQLite databases on disk or in memory.

bigrquery

For connections to Google’s BigQuery.

Each DBI backend package listed above is listed on CRAN and can be installed with the typical install.packages('packagename') command.

See Also

For more information about connecting the databases with R & RStudio: https://db.rstudio.com/

For more detail on SQL translation in dbplyr, see the sql-translation vignette at vignette("sql-translation") or http://dbplyr.tidyverse.org/articles/sql-translation.html

Saving and Transporting Objects

Problem

You want to store one or more R objects in a file for later use, or you want to copy an R object from one machine to another.

Solution

Write the objects to a file using the save function:

save(tbl, t, file = "myData.RData")

Read them back using the load function, either on your computer or on any platform that supports R:

load("myData.RData")

The save function writes binary data. To save in an ASCII format, use dput or dump instead:

dput(tbl, file = "myData.txt")
dump("tbl", file = "myData.txt")    # Note quotes around variable name

Discussion

We’ve found ourselves with a large, complicated data object that we want to load into other workspaces, or we may want to move R objects between a Linux box and a Windows box. The load and save functions let us do all this: save will store the object in a file that is portable across machines, and load can read those files.

When you run load, it does not return your data per se; rather, it creates variables in your workspace, loads your data into those variables, and then returns the names of the variables (in a vector). The first time you run load, you might be tempted to do this:

myData <- load("myData.RData")     # Achtung! Might not do what you think

Let’s look at what myData is above:

myData
#> [1] "tbl" "t"
str(myData)
#>  chr [1:2] "tbl" "t"

This might be puzzling, because myData will not contain your data at all. This can be perplexing and frustrating the first time.

The save function writes in a binary format to keep the file small. Sometimes you want an ASCII format instead. When you submit a question to a mailing list or to Stack Overflow, for example, including an ASCII dump of the data lets others re-create your problem. In such cases use dput or dump, which write an ASCII representation.

Be careful when you save and load objects created by a particular R package. When you load the objects, R does not automatically load the required packages, too, so it will not “understand” the object unless you previously loaded the package yourself. For instance, suppose you have an object called z created by the zoo package, and suppose we save the object in a file called z.RData. The following sequence of functions will create some confusion:

load("./data/z.RData")   # Create and populate the z variable
plot(z)                  # Does not plot as expected: zoo pkg not loaded
unnamed chunk 87 1

We should have loaded the zoo package before printing or plotting any zoo objects, like this:

library(zoo)                  # Load the zoo package into memory
load("./data/z.RData") # Create and populate the z variable
plot(z)                       # Ahhh. Now plotting works correctly
Plotting with zoo
Figure 4-2. Plotting with zoo

And you can see the resulting plot in Figure 4-2.

Chapter 5. Data Structures

Introduction

You can get pretty far in R just using vectors. That’s what Chapter 2 is all about. This chapter moves beyond vectors to recipes for matrices, lists, factors, data frames, and Tibbles (which are a special case of data frames). If you have preconceptions about data structures, I suggest you put them aside. R does data structures differently than many other languages.

If you want to study the technical aspects of R’s data structures, I suggest reading R in a Nutshell (O’Reilly) and the R Language Definition. The notes here are more informal. These are things we wish we’d known when we started using R.

Vectors

Here are some key properties of vectors:

Vectors are homogeneous

All elements of a vector must have the same type or, in R terminology, the same mode.

Vectors can be indexed by position

So v[2] refers to the second element of v.

Vectors can be indexed by multiple positions, returning a subvector

So v[c(2,3)] is a subvector of v that consists of the second and third elements.

Vector elements can have names

Vectors have a names property, the same length as the vector itself, that gives names to the elements: +

v <- c(10, 20, 30)
names(v) <- c("Moe", "Larry", "Curly")
print(v)
#>   Moe Larry Curly
#>    10    20    30
If vector elements have names then you can select them by name

Continuing the previous example: +

v["Larry"]
#> Larry
#>    20

Lists

Lists are heterogeneous

Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.

Lists can be indexed by position

So lst[[2]] refers to the second element of lst. Note the double square brackets. Double brackets means that R will return the element as whatever type of element it is.

Lists let you extract sublists

So lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets. Single brackets means that R will return the items in a list. If you pull a single element with single brackets, like lst[2] R will return a list of length 1 with the first item containing the desired item.

  • JDL TODO: read Jenny Bryant’s description and think about clarifying this list business

List elements can have names

Both lst[["Moe"]] and lst$Moe refer to the element named “Moe”.

Since lists are heterogeneous and since their elements can be retrieved by name, a list is like a dictionary or hash or lookup table in other programming languages (“Building a Name/Value Association List”). What’s surprising (and cool) is that in R, unlike most of those other programming languages, lists can also be indexed by position.

Mode: Physical Type

In R, every object has a mode, which indicates how it is stored in memory: as a number, as a character string, as a list of pointers to other objects, as a function, and so forth:

Object Example Mode

Number

3.1415

numeric

Vector of numbers

c(2.7.182, 3.1415)

numeric

Character string

"Moe"

character

Vector of character strings

c("Moe", "Larry", "Curly")

character

Factor

factor(c("NY", "CA", "IL"))

numeric

List

list("Moe", "Larry", "Curly")

list

Data frame

data.frame(x=1:3, y=c("NY", "CA", "IL"))

list

Function

print

function

The mode function gives us this information:

mode(3.1415)                        # Mode of a number
#> [1] "numeric"
mode(c(2.7182, 3.1415))             # Mode of a vector of numbers
#> [1] "numeric"
mode("Moe")                         # Mode of a character string
#> [1] "character"
mode(list("Moe", "Larry", "Curly")) # Mode of a list
#> [1] "list"

A critical difference between a vector and a list can be summed up this way:

  • In a vector, all elements must have the same mode.

  • In a list, the elements can have different modes.

Class: Abstract Type

In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation.

For example, a Date object consists of a single number:

d <- as.Date("2010-03-15")
mode(d)
#> [1] "numeric"
length(d)
#> [1] 1

But it has a class of Date, telling us how to interpret that number; namely, as the number of days since January 1, 1970:

class(d)
#> [1] "Date"

R uses an object’s class to decide how to process the object. For example, the generic function print has specialized versions (called methods) for printing objects according to their class: data.frame, Date, lm, and so forth. When you print an object, R calls the appropriate print function according to the object’s class.

Scalars

The quirky thing about scalars is their relationship to vectors. In some software, scalars and vectors are two different things. In R, they are the same thing: a scalar is simply a vector that contains exactly one element. In this book I often use the term “scalar”, but that’s just shorthand for “vector with one element.”

Consider the built-in constant pi. It is a scalar:

pi
#> [1] 3.14

Since a scalar is a one-element vector, you can use vector functions on pi:

length(pi)
#> [1] 1

You can index it. The first (and only) element is π, of course:

pi[1]
#> [1] 3.14

If you ask for the second element, there is none:

pi[2]
#> [1] NA

Matrices

In R, a matrix is just a vector that has dimensions. It may seem strange at first, but you can transform a vector into a matrix simply by giving it dimensions.

A vector has an attribute called dim, which is initially NULL, as shown here:

A <- 1:6
dim(A)
#> NULL
print(A)
#> [1] 1 2 3 4 5 6

We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it:

dim(A) <- c(2, 3)
print(A)
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

Voilà! The vector was reshaped into a 2 × 3 matrix.

A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:

B <- list(1, 2, 3, 4, 5, 6)
dim(B)
#> NULL

If we set the dim attribute, it gives the list a shape:

dim(B) <- c(2, 3)
print(B)
#>      [,1] [,2] [,3]
#> [1,] 1    3    5
#> [2,] 2    4    6

Voilà! We have turned this list into a 2 × 3 matrix.

Arrays

The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2:

D <- 1:12
dim(D) <- c(2, 3, 2)
print(D)
#> , , 1
#>
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
#>
#> , , 2
#>
#>      [,1] [,2] [,3]
#> [1,]    7    9   11
#> [2,]    8   10   12

Note that R prints one “slice” of the structure at a time, since it’s not possible to print a 3-dimensional structure on a 2-dimensional medium.

It strikes us as very odd that we can turn a list into a matrix just by giving the list a dim attribute. But wait; it gets stranger.

Recall that a list can be heterogeneous (mixed modes). We can start with a heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This code snippet creates a matrix that is a mix of numeric and character data:

C <- list(1, 2, 3, "X", "Y", "Z")
dim(C) <- c(2, 3)
print(C)
#>      [,1] [,2] [,3]
#> [1,] 1    3    "Y"
#> [2,] 2    "X"  "Z"

To me this is strange because I ordinarily assume a matrix is purely numeric, not mixed. R is not that restrictive.

The possibility of a heterogeneous matrix may seem powerful and strangely fascinating. However, it creates problems when you are doing normal, day-to-day stuff with matrices. For example, what happens when the matrix C (above) is used in matrix multiplication? What happens if it is converted to a data frame? The answer is that odd things happen.

In this book, I generally ignore the pathological case of a heterogeneous matrix. I assume you’ve got simple, vanilla matrices. Some recipes involving matrices may work oddly (or not at all) if your matrix contains mixed data. Converting such a matrix to a vector or data frame, for instance, can be problematic (“Converting One Structured Data Type into Another”).

Factors

A factor looks like a character vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.

There are two key uses for factors:

Categorical variables

A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.

Grouping

This is a technique for labeling or tagging your data items according to their group. See the Introduction to Data Transformations.

Data Frames

A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS.

A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:

  • The elements of the list are vectors and/or factors.1

  • Those vectors and factors are the columns of the data frame.

  • The vectors and factors must all have the same length; in other words, all columns must have the same height.

  • The equal-height columns give a rectangular shape to the data frame.

  • The columns must have names.

Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:

  • You can use list operators to extract columns from a data frame, such as df[i], df[[i]], or df$name.

  • You can use matrix-like notation, such as df[i,j], df[i,], or df[,j].

Your perception of a data frame likely depends on your background:

To a statistician

A data frame is a table of observations. Each row contains one observation. Each observation must contain the same variables. These variables are called columns, and you can refer to them by name. You can also refer to the contents by row number and column number, just as with a matrix.

To a SQL programmer

A data frame is a table. The table resides entirely in memory, but you can save it to a flat file and restore it later. You needn’t declare the column types because R figures that out for you.

To an Excel user

A data frame is like a worksheet, or perhaps a range within a worksheet. It is more restrictive, however, in that each column has a type.

To an SAS user

A data frame is like a SAS dataset for which all the data resides in memory. R can read and write the data frame to disk, but the data frame must be in memory while R is processing it.

To an R programmer

A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.

To a computer scientist

A data frame is a rectangular data structure. The columns are strongly typed, and each column must be numeric values, character strings, or a factor. Columns must have labels; rows may have labels. The table can be indexed by position, column name, and/or row name. It can also be accessed by list operators, in which case R treats the data frame as a list whose elements are the columns of the data frame.

To an executive

You can put names and numbers into a data frame. It’s easy! A data frame is like a little database. Your staff will enjoy using data frames.`

Tibbles

A tibble is a modern reimagining of the data frame, introduced by Hadley Wickham in his Tidyverse packages. Most of the common functions you would use with data frames also work with Tibbles. However Tibbles typically do less than data frames and complain more. This idea of complaining and doing less may remind you of your least favorite coworker, however, we think tibbles will be one of your most favorite data structures. Doing less and complaining more can be a feature, not a bug.

Unlike data frames, tibbles do not:

  • Tibbles do not give you row numbers by default.

  • Tibbles do not coerce column names and surprise you with names different than you expected.

  • Tibbles don’t coerce your data into factors without you explictly asking for that.

  • Tibbles only recycle vectors of length 1.

In addition to basic data frame functionality, tibbles also do this:

  • Tibbles only print the top four rows and a bit of metadata by default.

  • Tibbles always return a tibble when subsetting.

  • Tibbles never do partial matching: if you want a column from a tibble you have to ask for it using its full name.

  • Tibbles complain more by giving you more warnings and chatty messages to make sure you understand what the software is doing.

All these extras are designed to give you fewer surprises and help you be more productive.

Appending Data to a Vector

Problem

You want to append additional data items to a vector.

Solution

Use the vector constructor (c) to construct a vector with the additional data items:

v <- c(1, 2, 3)
newItems <- c(6, 7, 8)
v <- c(v, newItems)
v
#> [1] 1 2 3 6 7 8

For a single item, you can also assign the new item to the next vector element. R will automatically extend the vector:

v[length(v) + 1] <- 42
v
#> [1]  1  2  3  6  7  8 42

Discussion

If you ask us about appending a data item to a vector, we will likely suggest that maybe you shouldn’t.

Warning

R works best when you think about entire vectors, not single data items. Are you repeatedly appending items to a vector? If so, then you are probably working inside a loop. That’s OK for small vectors, but for large vectors your program will run slowly. The memory management in R works poorly when you repeatedly extend a vector by one element. Try to replace that loop with vector-level operations. You’ll write less code, and R will run much faster.

Nonetheless, one does occasionally need to append data to vectors. Our experiments show that the most efficient way is to create a new vector using the vector constructor (c) to join the old and new data. This works for appending single elements or multiple elements:

v <- c(1, 2, 3)
v <- c(v, 4) # Append a single value to v
v
#> [1] 1 2 3 4

w <- c(5, 6, 7, 8)
v <- c(v, w) # Append an entire vector to v
v
#> [1] 1 2 3 4 5 6 7 8

You can also append an item by assigning it to the position past the end of the vector, as shown in the Solution. In fact, R is very liberal about extending vectors. You can assign to any element and R will expand the vector to accommodate your request:

v <- c(1, 2, 3) # Create a vector of three elements
v[10] <- 10 # Assign to the 10th element
v # R extends the vector automatically
#>  [1]  1  2  3 NA NA NA NA NA NA 10

Note that R did not complain about the out-of-bounds subscript. It just extended the vector to the needed length, filling with NA.

R includes an append function that creates a new vector by appending items to an existing vector. However, our experiments show that this function runs more slowly than both the vector constructor and the element assignment.

Inserting Data into a Vector

Problem

You want to insert one or more data items into a vector.

Solution

Despite its name, the append function inserts data into a vector by using the after parameter, which gives the insertion point for the new item or items:

v
#>  [1]  1  2  3 NA NA NA NA NA NA 10
newvalues <- c(100, 101)
n <- 2
append(v, newvalues, after = n)
#>  [1]   1   2 100 101   3  NA  NA  NA  NA  NA  NA  10

Discussion

The new items will be inserted at the position given by after. This example inserts 99 into the middle of a sequence:

append(1:10, 99, after = 5)
#>  [1]  1  2  3  4  5 99  6  7  8  9 10

The special value of after=0 means insert the new items at the head of the vector:

append(1:10, 99, after = 0)
#>  [1] 99  1  2  3  4  5  6  7  8  9 10

The comments in “Appending Data to a Vector” apply here, too. If you are inserting single items into a vector, you might be working at the element level when working at the vector level would be easier to code and faster to run.

Understanding the Recycling Rule

Problem

You want to understand the mysterious Recycling Rule that governs how R handles vectors of unequal length.

Discussion

When you do vector arithmetic, R performs element-by-element operations. That works well when both vectors have the same length: R pairs the elements of the vectors and applies the operation to those pairs.

But what happens when the vectors have unequal lengths?

In that case, R invokes the Recycling Rule. It processes the vector element in pairs, starting at the first elements of both vectors. At a certain point, the shorter vector is exhausted while the longer vector still has unprocessed elements. R returns to the beginning of the shorter vector, “recycling” its elements; continues taking elements from the longer vector; and completes the operation. It will recycle the shorter-vector elements as often as necessary until the operation is complete.

It’s useful to visualize the Recycling Rule. Here is a diagram of two vectors, 1:6 and 1:3:

   1:6   1:3
  ----- -----
    1     1
    2     2
    3     3
    4
    5
    6

Obviously, the 1:6 vector is longer than the 1:3 vector. If we try to add the vectors using (1:6) + (1:3), it appears that 1:3 has too few elements. However, R recycles the elements of 1:3, pairing the two vectors like this and producing a six-element vector:

   1:6   1:3   (1:6) + (1:3)
  ----- ----- ---------------
    1     1         2
    2     2         4
    3     3         6
    4               5
    5               7
    6               9

Here is what you see in the R console:

(1:6) + (1:3)
#> [1] 2 4 6 5 7 9

It’s not only vector operations that invoke the Recycling Rule; functions can, too. The cbind function can create column vectors, such as the following column vectors of 1:6 and 1:3. The two column have different heights, of course:

r}
cbind(1:6)

cbind(1:3)

If we try binding these column vectors together into a two-column matrix, the lengths are mismatched. The 1:3 vector is too short, so cbind invokes the Recycling Rule and recycles the elements of 1:3:

cbind(1:6, 1:3)
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> [3,]    3    3
#> [4,]    4    1
#> [5,]    5    2
#> [6,]    6    3

If the longer vector’s length is not a multiple of the shorter vector’s length, R gives a warning. That’s good, since the operation is highly suspect and there is likely a bug in your logic:

(1:6) + (1:5) # Oops! 1:5 is one element too short
#> Warning in (1:6) + (1:5): longer object length is not a multiple of shorter
#> object length
#> [1]  2  4  6  8 10  7

Once you understand the Recycling Rule, you will realize that operations between a vector and a scalar are simply applications of that rule. In this example, the 10 is recycled repeatedly until the vector addition is complete:

(1:6) + 10
#> [1] 11 12 13 14 15 16

Creating a Factor (Categorical Variable)

Problem

You have a vector of character strings or integers. You want R to treat them as a factor, which is R’s term for a categorical variable.

Solution

The factor function encodes your vector of discrete values into a factor:

v <- c("dog", "cat", "mouse", "rat", "dog")
f <- factor(v) # v can be a vector of strings or integers
f
#> [1] dog   cat   mouse rat   dog
#> Levels: cat dog mouse rat
str(f)
#>  Factor w/ 4 levels "cat","dog","mouse",..: 2 1 3 4 2

If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor:

v <- c("dog", "cat", "mouse", "rat", "dog")
f <- factor(v, levels = c("dog", "cat", "mouse", "rat", "horse"))
f
#> [1] dog   cat   mouse rat   dog
#> Levels: dog cat mouse rat horse
str(f)
#>  Factor w/ 5 levels "dog","cat","mouse",..: 1 2 3 4 1

Discussion

In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models.

Most of the time, converting your categorical data into a factor is a simple matter of calling the factor function, which identifies the distinct levels of the categorical data and packs them into a factor:

f <- factor(c("Win", "Win", "Lose", "Tie", "Win", "Lose"))
f
#> [1] Win  Win  Lose Tie  Win  Lose
#> Levels: Lose Tie Win

Notice that when we printed the factor, f, R did not put quotes around the values. They are levels, not strings. Also notice that when we printed the factor, R also displayed the distinct levels below the factor.

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels. Suppose you have a string-valued variable wday that gives the day of the week on which your data was observed:

wday <- c("Wed", "Thu", "Mon", "Wed", "Thu", "Thu", "Thu", "Tue", "Thu", "Tue")
f <- factor(wday)
f
#>  [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
#> Levels: Mon Thu Tue Wed

R thinks that Monday, Thursday, Tuesday, and Wednesday are the only possible levels. Friday is not listed. Apparently, the lab staff never made observations on Friday, so R does not know that Friday is a possible value. Hence you need to list the possible levels of wday explicitly:

f <- factor(wday, c("Mon", "Tue", "Wed", "Thu", "Fri"))
f
#>  [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
#> Levels: Mon Tue Wed Thu Fri

Now R understands that f is a factor with five possible levels. It knows their correct order, too. It originally put Thursday before Tuesday because it assumes alphabetical order by default.2 The explicit second argument defines the correct order.

In many situations it is not necessary to call factor explicitly. When an R function requires a factor, it usually converts your data to a factor automatically. The table function, for instance, works only on factors, so it routinely converts its inputs to factors without asking. You must explicitly create a factor variable when you want to specify the full set of levels or when you want to control the ordering of levels.

When creating a data frame using base R functinos like data.frame the default behavior for text fields is to turn them into factors. This has caused grief and consternation for many R users over the years as often we expect text fields to be imported simply as text, not factors. Tibbles, part of the Tidyverse of tools, on the other hand, never converts to factors by default.

See Also

See Recipe X-X to create a factor from continuous data.

Combining Multiple Vectors into One Vector and a Factor

Problem

You have several groups of data, with one vector for each group. You want to combine the vectors into one large vector and simultaneously create a parallel factor that identifies each value’s original group.

Solution

Create a list that contains the vectors. Use the stack function to combine the list into a two-column data frame:

v1 <- c(1, 2, 3)
v2 <- c(4, 5, 6)
v3 <- c(7, 8, 9)
comb <- stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors
comb
#>   values ind
#> 1      1  v1
#> 2      2  v1
#> 3      3  v1
#> 4      4  v2
#> 5      5  v2
#> 6      6  v2
#> 7      7  v3
#> 8      8  v3
#> 9      9  v3

The data frame’s columns are called values and ind. The first column contains the data, and the second column contains the parallel factor.

Discussion

Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format.

Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA analysis of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:

set.seed(2)
n <- 5
freshmen <- sample(1:5, n, replace = TRUE, prob = c(.6, .2, .1, .05, .05))
sophomores <- sample(1:5, n, replace = TRUE, prob = c(.05, .2, .6, .1, .05))
juniors <- sample(1:5, n, replace = TRUE, prob = c(.05, .2, .55, .15, .05))

comb <- stack(list(fresh = freshmen, soph = sophomores, jrs = juniors))
print(comb)
#>    values   ind
#> 1       1 fresh
#> 2       2 fresh
#> 3       1 fresh
#> 4       1 fresh
#> 5       5 fresh
#> 6       5  soph
#> 7       3  soph
#> 8       4  soph
#> 9       3  soph
#> 10      3  soph
#> 11      2   jrs
#> 12      3   jrs
#> 13      4   jrs
#> 14      3   jrs
#> 15      3   jrs

Now you can perform the ANOVA analysis on the two columns:

aov(values ~ ind, data = comb)
#> Call:
#>    aov(formula = values ~ ind, data = comb)
#>
#> Terms:
#>                   ind Residuals
#> Sum of Squares   6.53     17.20
#> Deg. of Freedom     2        12
#>
#> Residual standard error: 1.2
#> Estimated effects may be unbalanced

When building the list we must provide tags for the list elements (the tags are fresh, soph, and jrs in this example). Those tags are required because stack uses them as the levels of the parallel factor.

Creating a List

Problem

You want to create and populate a list.

Solution

To create a list from individual data items, use the list function:

x <- c("a", "b", "c")
y <- c(1, 2, 3)
z <- "why be normal?"
lst <- list(x, y, z)
lst
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] 1 2 3
#>
#> [[3]]
#> [1] "why be normal?"

Discussion

Lists can be quite simple, such as this list of three numbers:

lst <- list(0.5, 0.841, 0.977)
lst
#> [[1]]
#> [1] 0.5
#>
#> [[2]]
#> [1] 0.841
#>
#> [[3]]
#> [1] 0.977

When R prints the list, it identifies each list element by its position ([[1]], [[2]], [[3]]) and prints the element’s value (e.g., [1] 0.5) under its position.

More usefully, lists can, unlike vectors, contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:

lst <- list(3.14, "Moe", c(1, 1, 2, 3), mean)
lst
#> [[1]]
#> [1] 3.14
#>
#> [[2]]
#> [1] "Moe"
#>
#> [[3]]
#> [1] 1 1 2 3
#>
#> [[4]]
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x7f8f0457ff88>
#> <environment: namespace:base>

You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:

lst <- list()
lst[[1]] <- 3.14
lst[[2]] <- "Moe"
lst[[3]] <- c(1, 1, 2, 3)
lst[[4]] <- mean
lst
#> [[1]]
#> [1] 3.14
#>
#> [[2]]
#> [1] "Moe"
#>
#> [[3]]
#> [1] 1 1 2 3
#>
#> [[4]]
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x7f8f0457ff88>
#> <environment: namespace:base>

List elements can be named. The list function lets you supply a name for every element:

lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
lst
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

See Also

See the “Introduction” to this chapter for more about lists; see “Building a Name/Value Association List” for more about building and using lists with named elements.

Selecting List Elements by Position

Problem

You want to access list elements by position.

Solution

Use one of these ways. Here, lst is a list variable:

lst[[n]]

Select the _n_th element from the list.

lst[c(n1, n2, ..., nk)]

Returns a list of elements, selected by their positions.

Note that the first form returns a single element and the second returns a list.

Discussion

Suppose we have a list of four integers, called years:

years <- list(1960, 1964, 1976, 1994)
years
#> [[1]]
#> [1] 1960
#>
#> [[2]]
#> [1] 1964
#>
#> [[3]]
#> [1] 1976
#>
#> [[4]]
#> [1] 1994

We can access single elements using the double-square-bracket syntax:

years[[1]]

We can extract sublists using the single-square-bracket syntax:

years[c(1, 2)]
#> [[1]]
#> [1] 1960
#>
#> [[2]]
#> [1] 1964

This syntax can be confusing because of a subtlety: there is an important difference between lst[[n]] and lst[n]. They are not the same thing:

lst[[n]]

This is an element, not a list. It is the _n_th element of lst.

lst[n]

This is a list, not an element. The list contains one element, taken from the _n_th element of lst. This is a special case of lst[c(n1, n2, ..., nk)] in which we eliminated the c() construct because there is only one n.

The difference becomes apparent when we inspect the structure of the result—one is a number; the other is a list:

class(years[[1]])
#> [1] "numeric"

class(years[1])
#> [1] "list"

The difference becomes annoyingly apparent when we cat the value. Recall that cat can print atomic values or vectors but complains about printing structured objects:

cat(years[[1]], "\n")
#> 1960

cat(years[1], "\n")
#> Error in cat(years[1], "\n"): argument 1 (type 'list') cannot be handled by 'cat'

We got lucky here because R alerted us to the problem. In other contexts, you might work long and hard to figure out that you accessed a sublist when you wanted an element, or vice versa.

Selecting List Elements by Name

Problem

You want to access list elements by their names.

Solution

Use one of these forms. Here, lst is a list variable:

lst[["name"]]

Selects the element called name. Returns NULL if no element has that name.

lst$name

Same as previous, just different syntax.

lst[c(name1, name2, ..., namek)]

Returns a list built from the indicated elements of lst.

Note that the first two forms return an element whereas the third form returns a list.

Discussion

Each element of a list can have a name. If named, the element can be selected by its name. This assignment creates a list of four named integers:

years <- list(Kennedy = 1960, Johnson = 1964, Carter = 1976, Clinton = 1994)

These next two expressions return the same value—namely, the element that is named “Kennedy”:

years[["Kennedy"]]
#> [1] 1960
years$Kennedy
#> [1] 1960

The following two expressions return sublists extracted from years:

years[c("Kennedy", "Johnson")]
#> $Kennedy
#> [1] 1960
#>
#> $Johnson
#> [1] 1964

years["Carter"]
#> $Carter
#> [1] 1976

Just as with selecting list elements by position (“Selecting List Elements by Position”), there is an important difference between lst[["name"]] and lst["name"]. They are not the same:

lst[["name"]]

This is an element, not a list.

lst["name"]

This is a list, not an element. This is a special case of lst[c(name1, name2, ..., namek)] in which we don’t need the c() construct because there is only one name.

See Also

See “Selecting List Elements by Position” to access elements by position rather than by name.

Building a Name/Value Association List

Problem

You want to create a list that associates names and values — as would a dictionary, hash, or lookup table in another programming language.

Solution

The list function lets you give names to elements, creating an association between each name and its value:

lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
lst
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

If you have parallel vectors of names and values, you can create an empty list and then populate the list by using a vectorized assignment statement:

values <- c(1, 2, 3)
names <- c("a", "b", "c")
lst <- list()
lst[names] <- values
lst
#> $a
#> [1] 1
#>
#> $b
#> [1] 2
#>
#> $c
#> [1] 3

Discussion

Each element of a list can be named, and you can retrieve list elements by name. This gives you a basic programming tool: the ability to associate names with values.

You can assign element names when you build the list. The list function allows arguments of the form name=value:

lst <- list(
  far.left = 0.023,
  left = 0.159,
  mid = 0.500,
  right = 0.841,
  far.right = 0.977
)
lst
#> $far.left
#> [1] 0.023
#>
#> $left
#> [1] 0.159
#>
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

One way to name the elements is to create an empty list and then populate it via assignment statements:

lst <- list()
lst$far.left <- 0.023
lst$left <- 0.159
lst$mid <- 0.500
lst$right <- 0.841
lst$far.right <- 0.977
lst
#> $far.left
#> [1] 0.023
#>
#> $left
#> [1] 0.159
#>
#> $mid
#> [1] 0.5
#>
#> $right
#> [1] 0.841
#>
#> $far.right
#> [1] 0.977

Sometimes you have a vector of names and a vector of corresponding values:

values <- pnorm(-2:2)
names <- c("far.left", "left", "mid", "right", "far.right")

You can associate the names and the values by creating an empty list and then populating it with a vectorized assignment statement:

lst <- list()
lst[names] <- values

Once the association is made, the list can “translate” names into values through a simple list lookup:

cat("The left limit is", lst[["left"]], "\n")
#> The left limit is 0.159
cat("The right limit is", lst[["right"]], "\n")
#> The right limit is 0.841

for (nm in names(lst)) cat("The", nm, "limit is", lst[[nm]], "\n")
#> The far.left limit is 0.0228
#> The left limit is 0.159
#> The mid limit is 0.5
#> The right limit is 0.841
#> The far.right limit is 0.977

Removing an Element from a List

Problem

You want to remove an element from a list.

Solution

Assign NULL to the element. R will remove it from the list.

Discussion

To remove a list element, select it by position or by name, and then assign NULL to the selected element:

years <- list(Kennedy = 1960, Johnson = 1964, Carter = 1976, Clinton = 1994)
years
#> $Kennedy
#> [1] 1960
#>
#> $Johnson
#> [1] 1964
#>
#> $Carter
#> [1] 1976
#>
#> $Clinton
#> [1] 1994
years[["Johnson"]] <- NULL # Remove the element labeled "Johnson"
years
#> $Kennedy
#> [1] 1960
#>
#> $Carter
#> [1] 1976
#>
#> $Clinton
#> [1] 1994

You can remove multiple elements this way, too:

years[c("Carter", "Clinton")] <- NULL # Remove two elements
years
#> $Kennedy
#> [1] 1960

Flatten a List into a Vector

Problem

You want to flatten all the elements of a list into a vector.

Solution

Use the unlist function.

Discussion

There are many contexts that require a vector. Basic statistical functions work on vectors but not on lists, for example. If iq.scores is a list of numbers, then we cannot directly compute their mean:

iq.scores <- list(rnorm(5, 100, 15))
iq.scores
#> [[1]]
#> [1] 115.8  88.7  78.4  95.7  84.5
mean(iq.scores)
#> Warning in mean.default(iq.scores): argument is not numeric or logical:
#> returning NA
#> [1] NA

Instead, we must flatten the list into a vector using unlist and then compute the mean of the result:

mean(unlist(iq.scores))
#> [1] 92.6

Here is another example. We can cat scalars and vectors, but we cannot cat a list:

cat(iq.scores, "\n")
#> Error in cat(iq.scores, "\n"): argument 1 (type 'list') cannot be handled by 'cat'

One solution is to flatten the list into a vector before printing:

cat("IQ Scores:", unlist(iq.scores), "\n")
#> IQ Scores: 116 88.7 78.4 95.7 84.5

See Also

Conversions such as this are discussed more fully in “Converting One Structured Data Type into Another”.

Removing NULL Elements from a List

Problem

Your list contains NULL values. You want to remove them.

Solution

Suppose lst is a list some of whose elements are NULL. This expression will remove the NULL elements:

lst <- list(1, NULL, 2, 3, NULL, 4)
lst
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> [1] 2
#>
#> [[4]]
#> [1] 3
#>
#> [[5]]
#> NULL
#>
#> [[6]]
#> [1] 4
lst[sapply(lst, is.null)] <- NULL
lst
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
#>
#> [[4]]
#> [1] 4

Discussion

Finding and removing NULL elements from a list is surprisingly tricky. The recipe above was written by one of the authors in a fit of frustration after trying many other solutions that didn’t work. Here’s how it works:

  1. R calls sapply to apply the is.null function to every element of the list.

  2. sapply returns a vector of logical values that are TRUE wherever the corresponding list element is NULL.

  3. R selects values from the list according to that vector.

  4. R assigns NULL to the selected items, removing them from the list.

The curious reader may be wondering how a list can contain NULL elements, given that we remove elements by setting them to NULL (“Removing an Element from a List”). The answer is that we can create a list containing NULL elements:

lst <- list("Moe", NULL, "Curly") # Create list with NULL element
lst
#> [[1]]
#> [1] "Moe"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> [1] "Curly"

lst[sapply(lst, is.null)] <- NULL # Remove NULL element from list
lst
#> [[1]]
#> [1] "Moe"
#>
#> [[2]]
#> [1] "Curly"

In practice we might end up with NULL items in a list because of the results of a function we wrote to do something else.

See Also

See “Removing an Element from a List” for how to remove list elements.

Removing List Elements Using a Condition

Problem

You want to remove elements from a list according to a conditional test, such as removing elements that are negative or smaller than some threshold.

Solution

Build a logical vector based on the condition. Use the vector to select list elements and then assign NULL to those elements. This assignment, for example, removes all negative value from lst:

lst <- as.list(rnorm(7))
lst
#> [[1]]
#> [1] -0.0281
#>
#> [[2]]
#> [1] -0.366
#>
#> [[3]]
#> [1] -1.12
#>
#> [[4]]
#> [1] -0.976
#>
#> [[5]]
#> [1] 1.12
#>
#> [[6]]
#> [1] 0.324
#>
#> [[7]]
#> [1] -0.568

lst[lst < 0] <- NULL
lst
#> [[1]]
#> [1] 1.12
#>
#> [[2]]
#> [1] 0.324

It’s worth noting that in the above example we use as.list instead of list to create a list from the 7 random values created by rnorm(7). The reason for this is that as.list will turn each element of a vector into a list item. On the other hand, list would have given us a list of length 1 where the first element was a vector containing 7 numbers:

list(rnorm(7))
#> [[1]]
#> [1] -1.034 -0.533 -0.981  0.823 -0.388  0.879 -2.178

Discussion

This recipe is based on two useful features of R. First, a list can be indexed by a logical vector. Wherever the vector element is TRUE, the corresponding list element is selected. Second, you can remove a list element by assigning NULL to it.

Suppose we want to remove elements from lst whose value is zero. We construct a logical vector which identifies the unwanted values (lst == 0). Then we select those elements from the list and assign NULL to them:

lst[lst == 0] <- NULL

This expression will remove NA values from the list:

lst[is.na(lst)] <- NULL

So far, so good. The problems arise when you cannot easily build the logical vector. That often happens when you want to use a function that cannot handle a list. Suppose you want to remove list elements whose absolute value is less than 1. The abs function will not handle a list, unfortunately:

lst[abs(lst) < 1] <- NULL
#> Error in abs(lst): non-numeric argument to mathematical function

The simplest solution is flattening the list into a vector by calling unlist and then testing the vector:

lst
#> [[1]]
#> [1] 1.12
#>
#> [[2]]
#> [1] 0.324
lst[abs(unlist(lst)) < 1] <- NULL
lst
#> [[1]]
#> [1] 1.12

A more elegant solution uses lapply (the list apply function) to apply the function to every element of the list:

lst <- as.list(rnorm(5))
lst
#> [[1]]
#> [1] 1.47
#>
#> [[2]]
#> [1] 0.885
#>
#> [[3]]
#> [1] 2.29
#>
#> [[4]]
#> [1] 0.554
#>
#> [[5]]
#> [1] 1.21
lst[lapply(lst, abs) < 1] <- NULL
lst
#> [[1]]
#> [1] 1.47
#>
#> [[2]]
#> [1] 2.29
#>
#> [[3]]
#> [1] 1.21

Lists can hold complex objects, too, not just atomic values. Suppose that mods is a list of linear models created by the lm function. This expression will remove any model whose R2 value is less than 0.70:

x <- 1:10
y1 <- 2 * x + rnorm(10, 0, 1)
y2 <- 3 * x + rnorm(10, 0, 8)

result_list <- list(lm(x ~ y1), lm(x ~ y2))

result_list[sapply(result_list, function(m) summary(m)$r.squared < 0.7)] <- NULL

If we wanted to simply see the R2 values for each model, we could do the following:

sapply(result_list, function(m) summary(m)$r.squared)
#> [1] 0.990 0.708

Using sapply (simple apply) will return a vector of results. If we had used lapply we would have received a list in return:

lapply(result_list, function(m) summary(m)$r.squared)
#> [[1]]
#> [1] 0.99
#>
#> [[2]]
#> [1] 0.708

It’s worth noting that if you face a situation like the one above, you might also explore the package called broom on CRAN. Broom is designed to take output of models and put the results in a tidy format that fits better in a tidy-style workflow.

See Also

See Recipes , , , , , and .

Initializing a Matrix

Problem

You want to create a matrix and initialize it from given values.

Solution

Capture the data in a vector or list, and then use the matrix function to shape the data into a matrix. This example shapes a vector into a 2 × 3 matrix (i.e., two rows and three columns):

vec <- 1:6
matrix(vec, 2, 3)
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

Discussion

The first argument of matrix is the data, the second argument is the number of rows, and the third argument is the number of columns. Observe that the matrix was filled column by column, not row by row.

It’s common to initialize an entire matrix to one value such as zero or NA. If the first argument of matrix is a single value, then R will apply the Recycling Rule and automatically replicate the value to fill the entire matrix:

matrix(0, 2, 3) # Create an all-zeros matrix
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    0    0    0

matrix(NA, 2, 3) # Create a matrix populated with NA
#>      [,1] [,2] [,3]
#> [1,]   NA   NA   NA
#> [2,]   NA   NA   NA

You can create a matrix with a one-liner, of course, but it becomes difficult to read:

mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3)
mat
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.3  2.2
#> [2,]  1.2  2.1  2.3

A common idiom in R is typing the data itself in a rectangular shape that reveals the matrix structure:

theData <- c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3
)
mat <- matrix(theData, 2, 3, byrow = TRUE)
mat
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.2  1.3
#> [2,]  2.1  2.2  2.3

Setting byrow=TRUE tells matrix that the data is row-by-row and not column-by-column (which is the default). In condensed form, that becomes:

mat <- matrix(c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3
),
2, 3,
byrow = TRUE
)

Expressed this way, the reader quickly sees the two rows and three columns of data.

There is a quick-and-dirty way to turn a vector into a matrix: just assign dimensions to the vector. This was discussed in the “Introduction”. The following example creates a vanilla vector and then shapes it into a 2 × 3 matrix:

v <- c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3)
dim(v) <- c(2, 3)
v
#>      [,1] [,2] [,3]
#> [1,]  1.1  1.3  2.2
#> [2,]  1.2  2.1  2.3

Personally, I find this more opaque than using matrix, especially since there is no byrow option here.

Performing Matrix Operations

Problem

You want to perform matrix operations such as transpose, matrix inversion, matrix multiplication, or constructing an identity matrix.

Solution

t(A)

Matrix transposition of A

solve(A)

Matrix inverse of A

A %*% B

Matrix multiplication of A and B

diag(n)

An n-by-n diagonal (identity) matrix

Discussion

Recall that A*B is element-wise multiplication whereas A %*% B is matrix multiplication.

All these functions return a matrix. Their arguments can be either matrices or data frames. If they are data frames then R will first convert them to matrices (although this is useful only if the data frame contains exclusively numeric values).

Giving Descriptive Names to the Rows and Columns of a Matrix

Problem

You want to assign descriptive names to the rows or columns of a matrix.

Solution

Every matrix has a rownames attribute and a colnames attribute. Assign a vector of character strings to the appropriate attribute:

theData <- c(
  1.1, 1.2, 1.3,
  2.1, 2.2, 2.3,
  3.1, 3.2, 3.3
)
mat <- matrix(theData, 3, 3, byrow = TRUE)

rownames(mat) <- c("rowname1", "rowname2", "rowname3")
colnames(mat) <- c("colname1", "colname2", "colname3")
mat
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3
#> rowname2      2.1      2.2      2.3
#> rowname3      3.1      3.2      3.3

Discussion

R lets you assign names to the rows and columns of a matrix, which is useful for printing the matrix. R will display the names if they are defined, enhancing the readability of your output. Below we use the quantmod library to pull stock prices for three tech stocks. Then we calculate daily returns and create a correlation matrix of the daily returns of Apple, Microsoft, and Google stock. No need to worry about the details here, unless stocks are your thing. We’re just creating some real-world data for illustration:

library("quantmod")
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#>     as.Date, as.Date.numeric
#>
#> Attaching package: 'xts'
#> The following objects are masked from 'package:dplyr':
#>
#>     first, last
#> Loading required package: TTR
#> Version 0.4-0 included new data defaults. See ?getSymbols.

getSymbols(c("AAPL", "MSFT", "GOOG"), auto.assign = TRUE)
#> 'getSymbols' currently uses auto.assign=TRUE by default, but will
#> use auto.assign=FALSE in 0.5-0. You will still be able to use
#> 'loadSymbols' to automatically load data. getOption("getSymbols.env")
#> and getOption("getSymbols.auto.assign") will still be checked for
#> alternate defaults.
#>
#> This message is shown once per session and may be disabled by setting
#> options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
#>
#> WARNING: There have been significant changes to Yahoo Finance data.
#> Please see the Warning section of '?getSymbols.yahoo' for details.
#>
#> This message is shown once per session and may be disabled by setting
#> options("getSymbols.yahoo.warning"=FALSE).
#> [1] "AAPL" "MSFT" "GOOG"
cor_mat <- cor(cbind(
  periodReturn(AAPL, period = "daily", subset = "2017"),
  periodReturn(MSFT, period = "daily", subset = "2017"),
  periodReturn(GOOG, period = "daily", subset = "2017")
))
cor_mat
#>                 daily.returns daily.returns.1 daily.returns.2
#> daily.returns           1.000           0.438           0.489
#> daily.returns.1         0.438           1.000           0.619
#> daily.returns.2         0.489           0.619           1.000

In this form, the matrix output’s interpretation is not self-evident.The columns are named daily.returns.X because before we bound the columns together with cbind they were each named daily.returns. R then helped us manage the naming clash by appending .1 to the second column and .2 to the third.

The default naming does not tell us which column came from which stock. So we’ll define names for the rows and columns, then R will annotate the matrix output with the names:

colnames(cor_mat) <- c("AAPL", "MSFT", "GOOG")
rownames(cor_mat) <- c("AAPL", "MSFT", "GOOG")
cor_mat
#>       AAPL  MSFT  GOOG
#> AAPL 1.000 0.438 0.489
#> MSFT 0.438 1.000 0.619
#> GOOG 0.489 0.619 1.000

Now the reader knows at a glance which rows and columns apply to which stocks.

Another advantage of naming rows and columns is that you can refer to matrix elements by those names:

cor_mat["MSFT", "GOOG"] # What is the correlation between MSFT and GOOG?
#> [1] 0.619

Selecting One Row or Column from a Matrix

Problem

You want to select a single row or a single column from a matrix.

Solution

The solution depends on what you want. If you want the result to be a simple vector, just use normal indexing:

mat[1, ] # First row
#> colname1 colname2 colname3
#>      1.1      1.2      1.3
mat[, 3] # Third column
#> rowname1 rowname2 rowname3
#>      1.3      2.3      3.3

If you want the result to be a one-row matrix or a one-column matrix, then include the drop=FALSE argument:

mat[1, , drop = FALSE] # First row in a one-row matrix
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3
mat[, 3, drop = FALSE] # Third column in a one-column matrix
#>          colname3
#> rowname1      1.3
#> rowname2      2.3
#> rowname3      3.3

Discussion

Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:

mat[1, ]
#> colname1 colname2 colname3
#>      1.1      1.2      1.3

mat[, 3]
#> rowname1 rowname2 rowname3
#>      1.3      2.3      3.3

When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):

mat[1, , drop = FALSE]
#>          colname1 colname2 colname3
#> rowname1      1.1      1.2      1.3

Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):

mat[, 3, drop = FALSE]
#>          colname3
#> rowname1      1.3
#> rowname2      2.3
#> rowname3      3.3

Initializing a Data Frame from Column Data

Problem

Your data is organized by columns, and you want to assemble it into a data frame.

Solution

If your data is captured in several vectors and/or factors, use the data.frame function to assemble them into a data frame:

v1 <- 1:5
v2 <- 6:10
v3 <- c("A", "B", "C", "D", "E")
f1 <- factor(c("a", "a", "a", "b", "b"))
df <- data.frame(v1, v2, v3, f1)
df
#>   v1 v2 v3 f1
#> 1  1  6  A  a
#> 2  2  7  B  a
#> 3  3  8  C  a
#> 4  4  9  D  b
#> 5  5 10  E  b

If your data is captured in a list that contains vectors and/or factors, use instead as.data.frame:

list.of.vectors <- list(v1 = v1, v2 = v2, v3 = v3, f1 = f1)
df2 <- as.data.frame(list.of.vectors)
df2
#>   v1 v2 v3 f1
#> 1  1  6  A  a
#> 2  2  7  B  a
#> 3  3  8  C  a
#> 4  4  9  D  b
#> 5  5 10  E  b

Discussion

A data frame is a collection of columns, each of which corresponds to an observed variable (in the statistical sense, not the programming sense). If your data is already organized into columns, then it’s easy to build a data frame.

The data.frame function can construct a data frame from vectors, where each vector is one observed variable. Suppose you have two numeric predictor variables, one categorical predictor variable, and one response variable. The data.frame function can create a data frame from your vectors.

pred1 <- rnorm(10)
pred2 <- rnorm(10, 1, 2)
pred3 <- sample(c("AM", "PM"), 10, replace = TRUE)
resp <- 2.1 + pred1 * .3 + pred2 * .9
df <- data.frame(pred1, pred2, pred3, resp)
df
#>     pred1   pred2 pred3 resp
#> 1  -0.117 -0.0196    AM 2.05
#> 2  -1.133  0.1529    AM 1.90
#> 3   0.632  3.8004    AM 5.71
#> 4   0.188  4.5922    AM 6.29
#> 5   0.892  1.8556    AM 4.04
#> 6  -1.224  2.8140    PM 4.27
#> 7   0.174  0.4908    AM 2.59
#> 8  -0.689 -0.1335    PM 1.77
#> 9   1.204 -0.0482    AM 2.42
#> 10  0.697  2.2268    PM 4.31

Notice that data.frame takes the column names from your program variables. You can override that default by supplying explicit column names:

df <- data.frame(p1 = pred1, p2 = pred2, p3 = pred3, r = resp)
head(df, 3)
#>       p1      p2 p3    r
#> 1 -0.117 -0.0196 AM 2.05
#> 2 -1.133  0.1529 AM 1.90
#> 3  0.632  3.8004 AM 5.71

As illustrated above, your data may be organized into vectors but those vectors are held in a list, not individual program variables. Use the as.data.frame function to create a data frame from the list of vectors.

If you’d rather have a tibble (a.k.a tidy data frame) instead of a data frame, then use the function as_tibble instead of data.frame. However, note that as_tibble is designed to operate on a list, matrix, data.frame, or table. So we can just wrap our vectors in a list function before we call as_tibble:

tib <- as_tibble(list(p1 = pred1, p2 = pred2, p3 = pred3, r = resp))
tib
#> # A tibble: 10 x 4
#>       p1      p2 p3        r
#>    <dbl>   <dbl> <chr> <dbl>
#> 1 -0.117 -0.0196 AM     2.05
#> 2 -1.13   0.153  AM     1.90
#> 3  0.632  3.80   AM     5.71
#> 4  0.188  4.59   AM     6.29
#> 5  0.892  1.86   AM     4.04
#> 6 -1.22   2.81   PM     4.27
#> # ... with 4 more rows

One subtle difference between a data.frame object and a tibble is that when using the data.frame function to create a data.frame R will coerce character values into factors by default. On the other hand, as_tibble does not convert characters to factors. If you look at the last two code examples above, you’ll see column p3 is of type chr in the tibble example and type fctr in the data.frame example. This difference is something you should be aware of as it can be maddeningly frustrating to debug an issue caused by this subtle difference.

Initializing a Data Frame from Row Data

Problem

Your data is organized by rows, and you want to assemble it into a data frame.

Solution

Store each row in a one-row data frame. Store the one-row data frames in a list. Use rbind and do.call to bind the rows into one, large data frame:

r1 <- data.frame(a = 1, b = 2, c = "a")
r2 <- data.frame(a = 3, b = 4, c = "b")
r3 <- data.frame(a = 5, b = 6, c = "c")
obs <- list(r1, r2, r3)
df <- do.call(rbind, obs)
df
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

Here, obs is a list of one-row data frames. But notice that column c is a factor, not a character.

Discussion

Data often arrives as a collection of observations. Each observation is a record or tuple that contains several values, one for each observed variable. The lines of a flat file are usually like that: each line is one record, each record contains several columns, and each column is a different variable (see “Reading Files with a Complex Structure”). Such data is organized by observation, not by variable. In other words, you are given rows one at a time rather than columns one at a time.

Each such row might be stored in several ways. One obvious way is as a vector. If you have purely numerical data, use a vector.

However, many datasets are a mixture of numeric, character, and categorical data, in which case a vector won’t work. I recommend storing each such heterogeneous row in a one-row data frame. (You could store each row in a list, but this recipe gets a little more complicated.)

We need to bind together those rows into a data frame. That’s what the rbind function does. It binds its arguments in such a way that each argument becomes one row in the result. If we rbind the first two observations, for example, we get a two-row data frame:

rbind(obs[[1]], obs[[2]])
#>   a b c
#> 1 1 2 a
#> 2 3 4 b

We want to bind together every observation, not just the first two, so we tap into the vector processing of R. The do.call function will expand obs into one, long argument list and call rbind with that long argument list:

do.call(rbind, obs)
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

The result is a data frame built from our rows of data.

Sometimes, for reasons beyond your control, the rows of your data are stored in lists rather than one-row data frames. You may be dealing with rows returned by a database package, for example. In that case, obs will be a list of lists, not a list of data frames. We first transform the rows into data frames using the Map function and then apply this recipe:

l1 <- list(a = 1, b = 2, c = "a")
l2 <- list(a = 3, b = 4, c = "b")
l3 <- list(a = 5, b = 6, c = "c")
obs <- list(l1, l2, l3)
df <- do.call(rbind, Map(as.data.frame, obs))
df
#>   a b c
#> 1 1 2 a
#> 2 3 4 b
#> 3 5 6 c

This recipe works also if your observations are stored in vectors rather than one-row data frames. But with vectors, all elements have to be of the same data type. Though R will happily coerce integers into floats on the fly:

r1 <- 1:3
r2 <- 6:8
r3 <- rnorm(3)
obs <- list(r1, r2, r3)
df <- do.call(rbind, obs)
df
#>        [,1]   [,2] [,3]
#> [1,]  1.000  2.000  3.0
#> [2,]  6.000  7.000  8.0
#> [3,] -0.945 -0.547  1.6

Note the factor trap mentioned in the example above. If you would rather get characters instead of factors, you have a couple of options. One is to set the stringsAsFactors parameter to FALSE when data.frame is called:

data.frame(a = 1, b = 2, c = "a", stringsAsFactors = FALSE)
#>   a b c
#> 1 1 2 a

Of course if you inherited your data and it’s already in a data frame with factors, you can convert all factors in a data.frame to characters using this bonus recipe:

## same set up as in the previous examples
l1 <- list( a=1, b=2, c='a' )
l2 <- list( a=3, b=4, c='b' )
l3 <- list( a=5, b=6, c='c' )
obs <- list(l1, l2, l3)
df <- do.call(rbind,Map(as.data.frame,obs))
# yes, you could use stringsAsFactors=FALSE above, but we're assuming the data.frame
# came to you with factors already

i <- sapply(df, is.factor)             ## determine which columns are factors
df[i] <- lapply(df[i], as.character)   ## turn only the factors to characters
df

Keep in mind that if you use a tibble instead of a data.frame then characters will not be forced into factors by default.

See Also

See “Initializing a Data Frame from Column Data” if your data is organized by columns, not rows.
See Recipe X-X to learn more about do.call.

Appending Rows to a Data Frame

Problem

You want to append one or more new rows to a data frame.

Solution

Create a second, temporary data frame containing the new rows. Then use the rbind function to append the temporary data frame to the original data frame.

Discussion

Suppose we want to append a new row to our data frame of Chicago-area cities. First, we create a one-row data frame with the new data:

newRow <- data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428)

Next, we use the rbind function to append that one-row data frame to our existing data frame:

library(tidyverse)
suburbs <- read_csv("./data/suburbs.txt")
#> Parsed with column specification:
#> cols(
#>   city = col_character(),
#>   county = col_character(),
#>   state = col_character(),
#>   pop = col_double()
#> )

suburbs2 <- rbind(suburbs, newRow)
suburbs2
#> # A tibble: 18 x 4
#>   city    county   state     pop
#>   <chr>   <chr>    <chr>   <dbl>
#> 1 Chicago Cook     IL    2853114
#> 2 Kenosha Kenosha  WI      90352
#> 3 Aurora  Kane     IL     171782
#> 4 Elgin   Kane     IL      94487
#> 5 Gary    Lake(IN) IN     102746
#> 6 Joliet  Kendall  IL     106221
#> # ... with 12 more rows

The rbind function tells R that we are appending a new row to suburbs, not a new column. It may be obvious to you that newRow is a row and not a column, but it is not obvious to R. (Use the cbind function to append a column.)

One word of caution. The new row must use the same column names as the data frame. Otherwise, rbind will fail.

We can combine these two steps into one, of course:

suburbs3 <- rbind(suburbs, data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428))

We can even extend this technique to multiple new rows because rbind allows multiple arguments:

suburbs4 <- rbind(
  suburbs,
  data.frame(city = "West Dundee", county = "Kane", state = "IL", pop = 5428),
  data.frame(city = "East Dundee", county = "Kane", state = "IL", pop = 2955)
)

It’s worth noting that in the examples above we seamlessly comingled tibbles and data frames because we used the tidy function read_csv which produces tibbles. And note that the data frames contain factors while the tibbles do not:

str(suburbs)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    17 obs. of  4 variables:
#>  $ city  : chr  "Chicago" "Kenosha" "Aurora" "Elgin" ...
#>  $ county: chr  "Cook" "Kenosha" "Kane" "Kane" ...
#>  $ state : chr  "IL" "WI" "IL" "IL" ...
#>  $ pop   : num  2853114 90352 171782 94487 102746 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   city = col_character(),
#>   ..   county = col_character(),
#>   ..   state = col_character(),
#>   ..   pop = col_double()
#>   .. )
str(newRow)
#> 'data.frame':    1 obs. of  4 variables:
#>  $ city  : Factor w/ 1 level "West Dundee": 1
#>  $ county: Factor w/ 1 level "Kane": 1
#>  $ state : Factor w/ 1 level "IL": 1
#>  $ pop   : num 5428

When this inputs to rbind are a mix of data.frame objects and tibble objects, the result will be the type of object passed to the first argument of rbind. So this would produce a tibble:

rbind(some_tibble, some_data.frame)

While this would produce a data.frame:

rbind(some_data.frame, some_tibble)
Warning

Do not use this recipe to append many rows to a large data frame. That would force R to reallocate a large data structure repeatedly, which is a very slow process. Build your data frame using more efficient means, such as those in Recipes or .

Preallocating a Data Frame

Problem

You are building a data frame, row by row. You want to preallocate the space instead of appending rows incrementally.

Solution

Create a data frame from generic vectors and factors using the functions numeric(n) and`character(n)`:

n <- 5
df <- data.frame(colname1 = numeric(n), colname2 = character(n))

Here, n is the number of rows needed for the data frame.

Discussion

Theoretically, you can build a data frame by appending new rows, one by one. That’s OK for small data frames, but building a large data frame in that way can be tortuous. The memory manager in R works poorly when one new row is repeatedly appended to a large data structure. Hence your R code will run very slowly.

One solution is to preallocate the data frame, assuming you know the required number of rows. By preallocating the data frame once and for all, you sidestep problems with the memory manager.

Suppose you want to create a data frame with 1,000,000 rows and three columns: two numeric and one character. Use the numeric and character functions to preallocate the columns; then join them together using data.frame:

n <- 1000000
df <- data.frame(
  dosage = numeric(n),
  lab = character(n),
  response = numeric(n),
  stringsAsFactors = FALSE
)
str(df)
#> 'data.frame':    1000000 obs. of  3 variables:
#>  $ dosage  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ lab     : chr  "" "" "" "" ...
#>  $ response: num  0 0 0 0 0 0 0 0 0 0 ...

Now you have a data frame with the correct dimensions, 1,000,000 × 3, waiting to receive its contents.

Notice in the example above we set stringsAsFactors=FALSE so that R would not coerce the character field into factors. Data frames can contain factors, but preallocating a factor is a little trickier. You can’t simply call factor(n). You need to specify the factor’s levels because you are creating it. Continuing our example, suppose you want the lab column to be a factor, not a character string, and that the possible levels are NJ, IL, and CA. Include the levels in the column specification, like this:

n <- 1000000
df <- data.frame(
  dosage = numeric(n),
  lab = factor(n, levels = c("NJ", "IL", "CA")),
  response = numeric(n)
)
str(df)
#> 'data.frame':    1000000 obs. of  3 variables:
#>  $ dosage  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ lab     : Factor w/ 3 levels "NJ","IL","CA": NA NA NA NA NA NA NA NA NA NA ...
#>  $ response: num  0 0 0 0 0 0 0 0 0 0 ...

Selecting Data Frame Columns by Position

Problem

You want to select columns from a data frame according to their position.

Solution

To select a single column, use this list operator:

df[[n]]

Returns one column—specifically, the nth column of df.

To select one or more columns and package them in a data frame, use the following sublist expressions:

df[n]

Returns a data frame consisting solely of the nth column of df.

df[c(n1, n2, ..., nk)]

Returns a data frame built from the columns in positions n1, n2, …, nk of df.

You can use matrix-style subscripting to select one or more columns:

df[, n]

Returns the nth column (assuming that n contains exactly one value).

df[,c(n1, n2, ..., nk)]

Returns a data frame built from the columns in positions n1, n2, …, nk.

Note that the matrix-style subscripting can return two different data types (either column or data frame) depending upon whether you select one column or multiple columns.

Or you can use the dplyr package from the Tidyverse and pass column numbers to the select function to get back a tibble.

df %>% select(n1, n2, ..., nk)

Discussion

There are a bewildering number of ways to select columns from a data frame. The choices can be confusing until you understand the logic behind the alternatives. As you read this explanation, notice how a slight change in syntax—a comma here, a double-bracket there—changes the meaning of the expression.

Let’s play with the population data for the 16 largest cities in the Chicago metropolitan area:

suburbs <- read_csv("./data/suburbs.txt")
#> Parsed with column specification:
#> cols(
#>   city = col_character(),
#>   county = col_character(),
#>   state = col_character(),
#>   pop = col_double()
#> )
suburbs
#> # A tibble: 17 x 4
#>   city    county   state     pop
#>   <chr>   <chr>    <chr>   <dbl>
#> 1 Chicago Cook     IL    2853114
#> 2 Kenosha Kenosha  WI      90352
#> 3 Aurora  Kane     IL     171782
#> 4 Elgin   Kane     IL      94487
#> 5 Gary    Lake(IN) IN     102746
#> 6 Joliet  Kendall  IL     106221
#> # ... with 11 more rows

So right off the bat we can see this is a tibble. Subsetting and selecting in tibbles works very much like base R data frames. So the recipes below can work on either data structure.

Use simple list notation to select exactly one column, such as the first column:

suburbs[[1]]
#>  [1] "Chicago"           "Kenosha"           "Aurora"
#>  [4] "Elgin"             "Gary"              "Joliet"
#>  [7] "Naperville"        "Arlington Heights" "Bolingbrook"
#> [10] "Cicero"            "Evanston"          "Hammond"
#> [13] "Palatine"          "Schaumburg"        "Skokie"
#> [16] "Waukegan"          "West Dundee"

The first column of suburbs is a vector, so that’s what suburbs[[1]] returns: a vector. If the first column were a factor, we’d get a factor.

The result differs when you use the single-bracket notation, as in suburbs[1] or suburbs[c(1,3)]. You still get the requested columns, but R wraps them in a data frame. This example returns the first column wrapped in a data frame:

suburbs[1]
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

Another option, using the dplyr package from the Tidyverse, is to pipe the data into a select statement: ** JAL note: both select statements below are patch with dplyr:: issue with MASS not unloading?

suburbs %>%
  dplyr::select(1)
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

You can, of course, use select from the dplyr package to pull more than one column:

suburbs %>%
  dplyr::select(1, 4)
#> # A tibble: 17 x 2
#>   city        pop
#>   <chr>     <dbl>
#> 1 Chicago 2853114
#> 2 Kenosha   90352
#> 3 Aurora   171782
#> 4 Elgin     94487
#> 5 Gary     102746
#> 6 Joliet   106221
#> # ... with 11 more rows

The next example returns the first and third columns as a data frame:

suburbs[c(1, 3)]
#> # A tibble: 17 x 2
#>   city    state
#>   <chr>   <chr>
#> 1 Chicago IL
#> 2 Kenosha WI
#> 3 Aurora  IL
#> 4 Elgin   IL
#> 5 Gary    IN
#> 6 Joliet  IL
#> # ... with 11 more rows

A major source of confusion is that suburbs[[1]] and suburbs[1] look similar but produce very different results:

suburbs[[1]]

This returns one column.

suburbs[1]

This returns a data frame, and the data frame contains exactly one column. This is a special case of df[c(n1,n2, ..., nk)]. We don’t need the c(...) construct because there is only one n.

The point here is that “one column” is different from “a data frame that contains one column.” The first expression returns a column, so it’s a vector or a factor. The second expression returns a data frame, which is different.

R lets you use matrix notation to select columns, as shown in the Solution. But an odd quirk can bite you: you might get a column or you might get a data frame, depending upon many subscripts you use. In the simple case of one index you get a column, like this:

suburbs[, 1]
#> # A tibble: 17 x 1
#>   city
#>   <chr>
#> 1 Chicago
#> 2 Kenosha
#> 3 Aurora
#> 4 Elgin
#> 5 Gary
#> 6 Joliet
#> # ... with 11 more rows

But using the same matrix-style syntax with multiple indexes returns a data frame:

suburbs[, c(1, 4)]
#> # A tibble: 17 x 2
#>   city        pop
#>   <chr>     <dbl>
#> 1 Chicago 2853114
#> 2 Kenosha   90352
#> 3 Aurora   171782
#> 4 Elgin     94487
#> 5 Gary     102746
#> 6 Joliet   106221
#> # ... with 11 more rows

This creates a problem. Suppose you see this expression in some old R script:

df[, vec]

Quick, does that return a column or a data frame? Well, it depends. If vec contains one value then you get a column; otherwise, you get a data frame. You cannot tell from the syntax alone.

To avoid this problem, you can include drop=FALSE in the subscripts; this forces R to return a data frame:

df[, vec, drop = FALSE]

Now there is no ambiguity about the returned data structure. It’s a data frame.

When all is said and done, using matrix notation to select columns from data frames is not the best procedure. It’s a good idea to instead use the list operators described previously. They just seem clearer. Or you can use the functions in dplyr and know that you will get back a tibble.

See Also

See “Selecting One Row or Column from a Matrix” for more about using drop=FALSE.

Selecting Data Frame Columns by Name

Problem

You want to select columns from a data frame according to their name.

Solution

To select a single column, use one of these list expressions:

df[["name"]]

Returns one column, the column called name.

df$name

Same as previous, just different syntax.

To select one or more columns and package them in a data frame, use these list expressions:

df["name"]

Selects one column and packages it inside a data frame object.

df[c("name1", "name2", ..., "namek")]

: Selects several columns and packages them in a data frame.

You can use matrix-style subscripting to select one or more columns:

df[, "name"]

Returns the named column.

df[, c("name1", "name2", ..., "namek")]

Selects several columns and packages in a data frame.

Once again, the matrix-style subscripting can return two different data types (column or data frame) depending upon whether you select one column or multiple columns.

Or you can use the dplyr package from the Tidyverse and pass column names to the select function to get back a tibble.

df %>% select(name1, name2, ..., namek)

Discussion

All columns in a data frame must have names. If you know the name, it’s usually more convenient and readable to select by name, not by position.

The solutions just described are similar to those for “Selecting Data Frame Columns by Position”, where we selected columns by position. The only difference is that here we use column names instead of column numbers. All the observations made in “Selecting Data Frame Columns by Position” apply here:

  • df[["name"]] returns one column, not a data frame.

  • df[c("name1", "name2", ..., "namek")] returns a data frame, not a column.

  • df["name"] is a special case of the previous expression and so returns a data frame, not a column.

  • The matrix-style subscripting can return either a column or a data frame, so be careful how many names you supply. See “Selecting Data Frame Columns by Position” for a discussion of this “gotcha” and using drop=FALSE.

There is one new addition:

df$name

This is identical in effect to df[["name"]], but it’s easier to type and to read.

Note that if you use select from dplyr, you don’t put the column names in quotes:

df %>% select(name1, name2, ..., namek)

Unquoted column names are a Tidyverse feature and help make Tidy functions fast and easy to type interactivly.

See Also

See “Selecting Data Frame Columns by Position” to understand these ways to select columns.

Selecting Rows and Columns More Easily

Problem

You want an easier way to select rows and columns from a data frame or matrix.

Solution

Use the subset function. The select argument is a column name, or a vector of column names, to be selected:

subset(df, select = colname)
subset(df, select = c(colname1, ..., colnameN))

Note that you do not quote the column names.

The subset argument is a logical expression that selects rows. Inside the expression, you can refer to the column names as part of the logical expression. In this example, city is a column in the data frame, and we are selecting rows with a pop over 100,000:

subset(suburbs, subset = (pop > 100000))
#> # A tibble: 5 x 4
#>   city       county   state     pop
#>   <chr>      <chr>    <chr>   <dbl>
#> 1 Chicago    Cook     IL    2853114
#> 2 Aurora     Kane     IL     171782
#> 3 Gary       Lake(IN) IN     102746
#> 4 Joliet     Kendall  IL     106221
#> 5 Naperville DuPage   IL     147779

subset is most useful when you combine the select and subset arguments:

subset(suburbs, select = c(city, state, pop), subset = (pop > 100000))
#> # A tibble: 5 x 3
#>   city       state     pop
#>   <chr>      <chr>   <dbl>
#> 1 Chicago    IL    2853114
#> 2 Aurora     IL     171782
#> 3 Gary       IN     102746
#> 4 Joliet     IL     106221
#> 5 Naperville IL     147779

The Tidyverse alternative is to use dplyr and string together a select statement with a filter statement:

suburbs %>%
  dplyr::select(city, state, pop) %>%
  filter(pop > 100000)
#> # A tibble: 5 x 3
#>   city       state     pop
#>   <chr>      <chr>   <dbl>
#> 1 Chicago    IL    2853114
#> 2 Aurora     IL     171782
#> 3 Gary       IN     102746
#> 4 Joliet     IL     106221
#> 5 Naperville IL     147779

Discussion

Indexing is the “official” Base R way to select rows and columns from a data frame, as described in Recipes and . However, indexing is cumbersome when the index expressions become complicated.

The subset function provides a more convenient and readable way to select rows and columns. It’s beauty is that you can refer to the columns of the data frame right inside the expressions for selecting columns and rows.

Combining select and filter from dplyr along with pipes makes the steps even easier to both read and write.

Here are some examples using the Cars93 dataset in the MASS package. The dataset includes columns for Manufacturer, Model, MPG.city, MPG.highway, Min.Price, and Max.Price:

Select the model name for cars that can exceed 30 miles per gallon (MPG) in the city * JAL note: turned off the mass load to see if it fixes select issue

library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#>     select
my_subset <- subset(Cars93, select = Model, subset = (MPG.city > 30))
head(my_subset)
#>      Model
#> 31 Festiva
#> 39   Metro
#> 42   Civic
#> 73  LeMans
#> 80   Justy
#> 83   Swift

Or, using dplyr:

Cars93 %>%
  filter(MPG.city > 30) %>%
  select(Model) %>%
  head()
#> Error in select(., Model): unused argument (Model)
  • TODO: make this a warning sidebar. Need editors to give instruction on how to indicate that ** Wait… what? Why did this not work? select worked just fine in an earlier example! Well, we left this in the book as an example of a bad surprise. We loaded the Tidyvese package at the beginning of the chapter then we just now loaded the MASS package. It turns out that MASS has a function named select too. So the package loaded last is the one that stomps on top of the others. So we have two options. 1) we can unload packages and then load MASS before dplyr or tidyverse' or 2) we can disambiguagte which`select statement we are calling. Let’s go with option 2 because it’s easy to illustrate:

Cars93 %>%
  filter(MPG.city > 30) %>%
  dplyr::select(Model) %>%
  head()
#>     Model
#> 1 Festiva
#> 2   Metro
#> 3   Civic
#> 4  LeMans
#> 5   Justy
#> 6   Swift

By using dplyr::select we tell R, “Hey, R, only use the select statement from dplyr" And R typically follows suit.

Now let’s select the model name and price range for four-cylinder cars made in the United States

my_cars <- subset(Cars93,
  select = c(Model, Min.Price, Max.Price),
  subset = (Cylinders == 4 & Origin == "USA")
)
head(my_cars)
#>       Model Min.Price Max.Price
#> 6   Century      14.2      17.3
#> 12 Cavalier       8.5      18.3
#> 13  Corsica      11.4      11.4
#> 15   Lumina      13.4      18.4
#> 21  LeBaron      14.5      17.1
#> 23     Colt       7.9      10.6

Or, using our unambiguious dplyr functions:

Cars93 %>%
  filter(Cylinders == 4 & Origin == "USA") %>%
  dplyr::select(Model, Min.Price, Max.Price) %>%
  head()
#>      Model Min.Price Max.Price
#> 1  Century      14.2      17.3
#> 2 Cavalier       8.5      18.3
#> 3  Corsica      11.4      11.4
#> 4   Lumina      13.4      18.4
#> 5  LeBaron      14.5      17.1
#> 6     Colt       7.9      10.6

Notice that in the above example we put the filter statement above the select statement. Commands connected by pipes are sequencial and if we selected only our four fields before we filtered on Cylinders adn Origin then the Cylinder and Origin fields would no longer be in the data and we’d get an error.

Now we’ll select the manufacturer’s name and the model name for all cars whose highway MPG value is above the median

my_cars <- subset(Cars93,
  select = c(Manufacturer, Model),
  subset = c(MPG.highway > median(MPG.highway))
)
head(my_cars)
#>    Manufacturer    Model
#> 1         Acura  Integra
#> 5           BMW     535i
#> 6         Buick  Century
#> 12    Chevrolet Cavalier
#> 13    Chevrolet  Corsica
#> 15    Chevrolet   Lumina

The subset function is actually more powerful than this recipe implies. It can select from lists and vectors, too. See the help page for details.

Or, using dplyr:

Cars93 %>%
  filter(MPG.highway > median(MPG.highway)) %>%
  dplyr::select(Manufacturer, Model) %>%
  head()
#>   Manufacturer    Model
#> 1        Acura  Integra
#> 2          BMW     535i
#> 3        Buick  Century
#> 4    Chevrolet Cavalier
#> 5    Chevrolet  Corsica
#> 6    Chevrolet   Lumina

Remember in the above examples the only reason we use the full dplyr::select name is because we have a conflict with MASS::select. In your code you will likely only need to use select after you load dplyr.

Just to keep us from frustrating naming clashes, let’s detach the MASS package:

detach("package:MASS", unload = TRUE)

Changing the Names of Data Frame Columns

Problem

You converted a matrix or list into a data frame. R gave names to the columns, but the names are at best uninformative and at worst bizarre.

Solution

Data frames have a colnames attribute that is a vector of column names. You can update individual names or the entire vector:

df <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
df
#>   V1 V2 V3
#> 1  1  4  7
#> 2  2  5  8
#> 3  3  6  9
colnames(df) <- c("tom", "dick", "harry") # a vector of character strings
df
#>   tom dick harry
#> 1   1    4     7
#> 2   2    5     8
#> 3   3    6     9

Or, using dplyr from the Tidyverse:

df <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
df %>%
  rename(tom = V1, dick = V2, harry = V3)
#>   tom dick harry
#> 1   1    4     7
#> 2   2    5     8
#> 3   3    6     9

Notice that with the rename function in dplyr there’s no need to use quotes around the column names, as is typical with Tidyverse functions. Also note that the argument order is new_name=old_name.

Discussion

The columns of data frames (and tibbles) must have names. If you convert a vanilla matrix into a data frame, R will synthesize names that are reasonable but boring — for example, V1, V2, V3, and so forth:

mat <- matrix(rnorm(9), nrow = 3, ncol = 3)
mat
#>       [,1]    [,2]   [,3]
#> [1,] 0.701  0.0976  0.821
#> [2,] 0.388 -1.2755 -1.086
#> [3,] 1.968  1.2544  0.111
as.data.frame(mat)
#>      V1      V2     V3
#> 1 0.701  0.0976  0.821
#> 2 0.388 -1.2755 -1.086
#> 3 1.968  1.2544  0.111

If the matrix had column names defined, R would have used those names instead of synthesizing new ones.

However, converting a list into a data frame produces some strange synthetic names:

lst <- list(1:3, c("a", "b", "c"), round(rnorm(3), 3))
lst
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] "a" "b" "c"
#>
#> [[3]]
#> [1] 0.181 0.773 0.983
as.data.frame(lst)
#>   X1.3 c..a....b....c.. c.0.181..0.773..0.983.
#> 1    1                a                  0.181
#> 2    2                b                  0.773
#> 3    3                c                  0.983

Again, if the list elements had names then R would have used them.

Fortunately, you can overwrite the synthetic names with names of your own by setting the colnames attribute:

df <- as.data.frame(lst)
colnames(df) <- c("patient", "treatment", "value")
df
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

You can do renaming by position using rename from dplyr… but it’s not really pretty. Actually it’s quite horrible and we considered omitting it from this book.

df <- as.data.frame(lst)
df %>%
  rename(
    "patient" = !!names(.[1]),
    "treatment" = !!names(.[2]),
    "value" = !!names(.[3])
  )
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

The reason this is so ugly is that the Tidyverse is designed around using names, not positions, when referring to columns. And in this example the names are pretty miserable to type and get right. While you could use the above recipe, we recommend using the Base R colnames() method if you really must rename by position number.

Of course, we could have made this all a lot easier by simply giving the list elements names before we converted it to a data frame:

names(lst) <- c("patient", "treatment", "value")
as.data.frame(lst)
#>   patient treatment value
#> 1       1         a 0.181
#> 2       2         b 0.773
#> 3       3         c 0.983

Removing NAs from a Data Frame

Problem

Your data frame contains NA values, which is creating problems for you.

Solution

Use na.omit to remove rows that contain any NA values.

df <- data.frame(my_data = c(NA, 1, NA, 2, NA, 3))
df
#>   my_data
#> 1      NA
#> 2       1
#> 3      NA
#> 4       2
#> 5      NA
#> 6       3
clean_df <- na.omit(df)
clean_df
#>   my_data
#> 2       1
#> 4       2
#> 6       3

Discussion

We frequently stumble upon situations where just a few NA values in a data frame cause everything to fall apart. One solution is simply to remove all rows that contain any NAs. That’s what na.omit does.

Here we can see cumsum fail because the input contains NA values:

df <- data.frame(
  x = c(NA, rnorm(4)),
  y = c(rnorm(2), NA, rnorm(2))
)
df
#>        x      y
#> 1     NA -0.836
#> 2  0.670 -0.922
#> 3 -1.421     NA
#> 4 -0.236 -1.123
#> 5 -0.975  0.372
cumsum(df)
#>    x      y
#> 1 NA -0.836
#> 2 NA -1.759
#> 3 NA     NA
#> 4 NA     NA
#> 5 NA     NA

If we remove the NA values, cumsum can complete its summations:

cumsum(na.omit(df))
#>        x      y
#> 2  0.670 -0.922
#> 4  0.434 -2.046
#> 5 -0.541 -1.674

This recipe works for vectors and matrices, too, but not for lists.

The obvious danger here is that simply dropping observations from your data could render the results computationally or statistically meaningless. Make sure that omitting data makes sense in your context. Remember that na.omit will remove entire rows, not just the NA values, which could eliminate a lot of useful information.

Excluding Columns by Name

Problem

You want to exclude a column from a data frame using its name.

Solution

Use the subset function with a negated argument for the select parameter:

df <- data.frame(good = rnorm(3), meh = rnorm(3), bad = rnorm(3))
df
#>     good     meh    bad
#> 1  1.911 -0.7045 -1.575
#> 2  0.912  0.0608 -2.238
#> 3 -0.819  0.4424 -0.807
subset(df, select = -bad) # All columns except bad
#>     good     meh
#> 1  1.911 -0.7045
#> 2  0.912  0.0608
#> 3 -0.819  0.4424

Or we can use select from dplyr to accomplish the same thing:

df %>%
  dplyr::select(-bad)
#>     good     meh
#> 1  1.911 -0.7045
#> 2  0.912  0.0608
#> 3 -0.819  0.4424

Discussion

We can exclude a column by position (e.g., df[-1]), but how do we exclude a column by name? The subset function can exclude columns from a data frame. The select parameter is a normally a list of columns to include, but prefixing a minus sign (-) to the name causes the column to be excluded instead.

We often encounter this problem when calculating the correlation matrix of a data frame and we want to exclude nondata columns such as labels. Let’s set up some dummy data:

id <- 1:10
pre <- rnorm(10)
dosage <- rnorm(10) + .3 * pre
post <- dosage * .5 * pre
patient_data <- data.frame(id = id, pre = pre, dosage = dosage, post = post)

cor(patient_data)
#>             id     pre  dosage    post
#> id      1.0000 -0.6934 -0.5075  0.0672
#> pre    -0.6934  1.0000  0.5830 -0.0919
#> dosage -0.5075  0.5830  1.0000  0.0878
#> post    0.0672 -0.0919  0.0878  1.0000

This correlation matrix includes the meaningless “correlation” between id and other variables, which is annoying. We can exclude the id column to clean up the output:

cor(subset(patient_data, select = -id))
#>            pre dosage    post
#> pre     1.0000 0.5830 -0.0919
#> dosage  0.5830 1.0000  0.0878
#> post   -0.0919 0.0878  1.0000

or with dplyr:

patient_data %>%
  dplyr::select(-id) %>%
  cor()
#>            pre dosage    post
#> pre     1.0000 0.5830 -0.0919
#> dosage  0.5830 1.0000  0.0878
#> post   -0.0919 0.0878  1.0000

We can exclude multiple columns by giving a vector of negated names:

## JDL Note... now that I've written all this I think the right thing to do is only show dplyr examples... one way to do things is better... fix in edit
cor(subset(patient_data, select = c(-id, -dosage)))

or with dplyr:

patient_data %>%
  dplyr::select(-id, -dosage) %>%
  cor()
#>          pre    post
#> pre   1.0000 -0.0919
#> post -0.0919  1.0000

Note that with dplyr we don’t wrap the column names in c().

See Also

See “Selecting Rows and Columns More Easily” for more about the subset function.

Combining Two Data Frames

Problem

You want to combine the contents of two data frames into one data frame.

Solution

To combine the columns of two data frames side by side, use cbind (column bind):

df1 <- data_frame(a = rnorm(5))
df2 <- data_frame(b = rnorm(5))

all <- cbind(df1, df2)
all
#>         a       b
#> 1 -1.6357  1.3669
#> 2 -0.3662 -0.5432
#> 3  0.4445 -0.0158
#> 4  0.4945 -0.6960
#> 5  0.0934 -0.7334

To “stack” the rows of two data frames, use rbind (row bind):

df1 <- data_frame(x = rep("a", 2), y = rnorm(2))
df1
#> # A tibble: 2 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 a     1.90
#> 2 a     0.440

df2 <- data_frame(x = rep("b", 2), y = rnorm(2))
df2
#> # A tibble: 2 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 b     2.35
#> 2 b     0.188

rbind(df1, df2)
#> # A tibble: 4 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 a     1.90
#> 2 a     0.440
#> 3 b     2.35
#> 4 b     0.188

Discussion

You can combine data frames in one of two ways: either by putting the columns side by side to create a wider data frame; or by “stacking” the rows to create a taller data frame. The cbind function will combine data frames side by side. You would normally combine columns with the same height (number of rows). Technically speaking, however, cbind does not require matching heights. If one data frame is short, it will invoke the Recycling Rule to extend the short columns as necessary (“Understanding the Recycling Rule”), which may or may not be what you want.

The rbind function will “stack” the rows of two data frames. The rbind function requires that the data frames have the same width: same number of columns and same column names. The columns need not be in the same order, however; rbind will sort that out:

df1 <- data_frame(x = rep("a", 2), y = rnorm(2))
df1
#> # A tibble: 2 x 2
#>   x          y
#>   <chr>  <dbl>
#> 1 a     -0.366
#> 2 a     -0.478

df2 <- data_frame(y = 1:2, x = c("b", "b"))
df2
#> # A tibble: 2 x 2
#>       y x
#>   <int> <chr>
#> 1     1 b
#> 2     2 b

rbind(df1, df2)
#> # A tibble: 4 x 2
#>   x          y
#>   <chr>  <dbl>
#> 1 a     -0.366
#> 2 a     -0.478
#> 3 b      1
#> 4 b      2

Finally, this recipe is slightly more general than the title implies. First, you can combine more than two data frames because both rbind and cbind accept multiple arguments. Second, you can apply this recipe to other data types because rbind and cbind work also with vectors, lists, and matrices.

See Also

The merge function can combine data frames that are otherwise incompatible owing to missing or different columns. In addition, dplyr and tidyr from the Tidyverse include some powerful functions for slicing, dicing, and recombining data frames.

Merging Data Frames by Common Column

Problem

You have two data frames that share a common column. You want to merge or join their rows into one data frame by matching on the common column.

Solution

Use the merge function to join the data frames into one new data frame based on the common column:

df1 <- data.frame(index = letters[1:5], val1 = rnorm(5))
df2 <- data.frame(index = letters[1:5], val2 = rnorm(5))

m <- merge(df1, df2, by = "index")
m
#>   index      val1   val2
#> 1     a -0.000837  1.178
#> 2     b -0.214967 -1.599
#> 3     c -1.399293  0.487
#> 4     d  0.010251 -1.688
#> 5     e -0.031463 -0.149

Here index is the name of the column that is common to data frames df1 and df2.

The alternative dplyr way of doing this is with inner_join:

df1 %>%
  inner_join(df2)
#> Joining, by = "index"
#>   index      val1   val2
#> 1     a -0.000837  1.178
#> 2     b -0.214967 -1.599
#> 3     c -1.399293  0.487
#> 4     d  0.010251 -1.688
#> 5     e -0.031463 -0.149

Discussion

Suppose you have two data frames, born and died, that each contain a column called name:

born <- data.frame(
  name = c("Moe", "Larry", "Curly", "Harry"),
  year.born = c(1887, 1902, 1903, 1964),
  place.born = c("Bensonhurst", "Philadelphia", "Brooklyn", "Moscow")
)
died <- data.frame(
  name = c("Curly", "Moe", "Larry"),
  year.died = c(1952, 1975, 1975)
)

We can merge them into one data frame by using name to combine matched rows:

merge(born, died, by = "name")
#>    name year.born   place.born year.died
#> 1 Curly      1903     Brooklyn      1952
#> 2 Larry      1902 Philadelphia      1975
#> 3   Moe      1887  Bensonhurst      1975

Notice that merge does not require the rows to be sorted or even to occur in the same order. It found the matching rows for Curly even though they occur in different positions. It also discards rows that appear in only one data frame or the other.

In SQL terms, the merge function essentially performs a join operation on the two data frames. It has many options for controlling that join operation, all of which are described on the help page for merge.

Because of the similarity with SQL, dplyr uses similar terms:

born %>%
  inner_join(died)
#> Joining, by = "name"
#> Warning: Column `name` joining factors with different levels, coercing to
#> character vector
#>    name year.born   place.born year.died
#> 1   Moe      1887  Bensonhurst      1975
#> 2 Larry      1902 Philadelphia      1975
#> 3 Curly      1903     Brooklyn      1952

Because we used data.frame to create the data frame, the name column was turned into factors. dplyr, and most of the Tidyverse packages, really prefer characters, so the column name was coerced into charater and we get a chatty notification in R. This is the sort of verbose feedback that is common in the Tidyverse. There are multiple types of joins in dplyr including, inner, left, right, and full. For a complete list, see the join documentation by typing ?dplyr::join.

See Also

See “Combining Two Data Frames” for other ways to combine data frames.

Accessing Data Frame Contents More Easily

Problem

Your data is stored in a data frame. You are getting tired of repeatedly typing the data frame name and want to access the columns more easily.

Solution

For quick, one-off expressions, use the with function to expose the column names:

with(dataframe, expr)

Inside expr, you can refer to the columns of dataframe by their names as if they were simple variables.

If you’re working with Tidyverse functions and pipes (%>%) this is not very useful as in a piped workflow you are always dealing with whatever input data was sent via the pipe.

Discussion

A data frame is a great way to store your data, but accessing individual columns can become tedious. For a data frame called suburbs that contains a column called pop, here is the naïve way to calculate the z-scores of pop:

z <- (suburbs$pop - mean(suburbs$pop)) / sd(suburbs$pop)
z
#>  [1]  3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264
#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364

Call us lazy, but all that typing gets tedious. The with function lets you expose the columns of a data frame as distinct variables. It takes two arguments, a data frame and an expression to be evaluated. Inside the expression, you can refer to the data frame columns by their names:

z <- with(suburbs, (pop - mean(pop)) / sd(pop))
z
#>  [1]  3.875 -0.237 -0.116 -0.231 -0.219 -0.214 -0.152 -0.259 -0.266 -0.264
#> [11] -0.261 -0.248 -0.272 -0.260 -0.277 -0.236 -0.364

When using dplyr you can accomplish the same logic with mutate:

suburbs %>%
  mutate(z = (pop - mean(pop)) / sd(pop))
#> # A tibble: 17 x 5
#>   city    county   state     pop      z
#>   <chr>   <chr>    <chr>   <dbl>  <dbl>
#> 1 Chicago Cook     IL    2853114  3.88
#> 2 Kenosha Kenosha  WI      90352 -0.237
#> 3 Aurora  Kane     IL     171782 -0.116
#> 4 Elgin   Kane     IL      94487 -0.231
#> 5 Gary    Lake(IN) IN     102746 -0.219
#> 6 Joliet  Kendall  IL     106221 -0.214
#> # ... with 11 more rows

As you can see, mutate helpfully mutates the data drame by adding the column we just created.

Converting One Atomic Value into Another

Problem

You have a data value which has an atomic data type: character, complex, double, integer, or logical. You want to convert this value into one of the other atomic data types.

Solution

For each atomic data type, there is a function for converting values to that type. The conversion functions for atomic types include:

  • as.character(x)

  • as.complex(x)

  • as.numeric(x) or as.double(x)

  • as.integer(x)

  • as.logical(x)

Discussion

Converting one atomic type into another is usually pretty simple. If the conversion works, you get what you would expect. If it does not work, you get NA:

as.numeric(" 3.14 ")
#> [1] 3.14
as.integer(3.14)
#> [1] 3
as.numeric("foo")
#> Warning: NAs introduced by coercion
#> [1] NA
as.character(101)
#> [1] "101"

If you have a vector of atomic types, these functions apply themselves to every value. So the preceding examples of converting scalars generalize easily to converting entire vectors:

as.numeric(c("1", "2.718", "7.389", "20.086"))
#> [1]  1.00  2.72  7.39 20.09
as.numeric(c("1", "2.718", "7.389", "20.086", "etc."))
#> Warning: NAs introduced by coercion
#> [1]  1.00  2.72  7.39 20.09    NA
as.character(101:105)
#> [1] "101" "102" "103" "104" "105"

When converting logical values into numeric values, R converts FALSE to 0 and TRUE to 1:

as.numeric(FALSE)
#> [1] 0
as.numeric(TRUE)
#> [1] 1

This behavior is useful when you are counting occurrences of TRUE in vectors of logical values. If logvec is a vector of logical values, then sum(logvec) does an implicit conversion from logical to integer and returns the number of `TRUE`s:

logvec <- c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)
sum(logvec) ## num true
#> [1] 4
length(logvec) - sum(logvec) ## num not true
#> [1] 2

Converting One Structured Data Type into Another

Problem

You want to convert a variable from one structured data type to another—for example, converting a vector into a list or a matrix into a data frame.

Solution

These functions convert their argument into the corresponding structured data type:

  • as.data.frame(x)

  • as.list(x)

  • as.matrix(x)

  • as.vector(x)

Some of these conversions may surprise you, however. I suggest you review Table XX. * TODO: can’t find above link… find it

Discussion

Converting between structured data types can be tricky. Some conversions behave as you’d expect. If you convert a matrix into a data frame, for instance, the rows and columns of the matrix become the rows and columns of the data frame. No sweat.

  • todo: yeah this table looks like hell in markdown. how does it render?

Table 5-1. Data conversions
Conversion How Notes

Vector→List

as.list(vec)

Don’t use list(vec); that creates a 1-element list whose only element is a copy of vec.

Vector→Matrix

To create a 1-column matrix: cbind(vec) or as.matrix(vec)

See “Initializing a Matrix”.

To create a 1-row matrix: rbind(vec)

To create an n × m matrix: matrix(vec,n,m)

Vector→Data frame

To create a 1-column data frame: as.data.frame(vec)

To create a 1-row data frame: as.data.frame(rbind(vec))

List→Vector

unlist(lst)

Use unlist rather than as.vector; see Note 1 and “Flatten a List into a Vector”.

List→Matrix

To create a 1-column matrix: as.matrix(lst)

To create a 1-row matrix: as.matrix(rbind(lst))

To create an n × m matrix: matrix(lst,n,m)

List→Data frame

If the list elements are columns of data: as.data.frame(lst)

If the list elements are rows of data: “Initializing a Data Frame from Row Data”

Matrix→Vector

as.vector(mat)

Returns all matrix elements in a vector.

Matrix→List

as.list(mat)

Returns all matrix elements in a list.

Matrix→Data frame

as.data.frame(mat)

Data frame→Vector

To convert a 1-row data frame: df[1,]

See Note 2.

To convert a 1-column data frame: df[,1] or df[[1]]

Data frame→List

as.list(df)

See Note 3.

Data frame→Matrix

as.matrix(df)

See Note 4.

In other cases, the results might surprise you. Table XX (to-do) summarizes some noteworthy examples. The following Notes are cited in that table:

  1. When you convert a list into a vector, the conversion works cleanly if your list contains atomic values that are all of the same mode. Things become complicated if either (a) your list contains mixed modes (e.g., numeric and character), in which case everything is converted to characters; or (b) your list contains other structured data types, such as sublists or data frames—in which case very odd things happen, so don’t do that.

  2. Converting a data frame into a vector makes sense only if the data frame contains one row or one column. To extract all its elements into one, long vector, use as.vector(as.matrix(df)). But even that makes sense only if the data frame is all-numeric or all-character; if not, everything is first converted to character strings.

  3. Converting a data frame into a list may seem odd in that a data frame is already a list (i.e., a list of columns). Using as.list essentially removes the class (data.frame) and thereby exposes the underlying list. That is useful when you want R to treat your data structure as a list—say, for printing.

  4. Be careful when converting a data frame into a matrix. If the data frame contains only numeric values then you get a numeric matrix. If it contains only character values, you get a character matrix. But if the data frame is a mix of numbers, characters, and/or factors, then all values are first converted to characters. The result is a matrix of character strings.

Problems with matrices

The matrix conversions detailed here assume that your matrix is homogeneous: all elements have the same mode (e.g, all numeric or all character). A matrix can to be heterogeneous, too, when the matrix is built from a list. If so, conversions become messy. For example, when you convert a mixed-mode matrix to a data frame, the data frame’s columns are actually lists (to accommodate the mixed data).

See Also

See “Converting One Atomic Value into Another” for converting atomic data types; see the “Introduction” to this chapter for remarks on problematic conversions.

1 A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.

2 More precisely, it orders the names according to your Locale.

Chapter 6. Data Transformations

Introduction

While traditional programming languages use loops, R has traditionally encouraged using vectorized operations and the apply family of functions to crunch data in batches, greatly streamlining the calculations. There is noting to prevent you from writing loops in R that break your data into whatever chunks you want and then do an operation on each chunk. However using vectorized functions can, in many cases, increase speed, readability, and maintainability of your code.

In recent history, however, the Tidyverse, specifically the purrr and dplyr packages, have introdcued new idioms into R that make these concepts easier to learn and slightly more consistent. The name purrr comes from a play on the phrase “Pure R.” A “pure function” is a function where the result of the function is only determined by its inputs, and which does not produce any side effects. This is a functional programming concept which you need not understand in order to get great value from purrr. All most users need to know is purrr contains functions to help us operate “chunk by chunk” on our data in a way that meshes well with other Tidyverse packages such as dplyr.

Base R has many apply functions: apply, lapply, sapply, tapply, mapply; and their cousins, by and split. These are solid functions that have been workhorses in Base R for years. The authors have struggled a bit with how much to focus on the Base R apply functions and how much to focus on the newer “tidy” approach. After much debate we’ve chosen to try and illustrate the purrr approach and to acknowledge Base R approaches and, in a few places, to illustrate both. The interface to purrr and dplyr is very clean and, we believe, in most cases, more intuitive.

Applying a Function to Each List Element

Problem

You have a list, and you want to apply a function to each element of the list.

Solution

We can use map to apply the function to every element of a list:

library(tidyverse)

lst %>%
  map(fun)

Discussion

Let’s look at a specific example of taking the average of all the numbers in each element of a list:

library(tidyverse)

lst <- list(
  a = c(1,2,3),
  b = c(4,5,6)
)
lst %>%
  map(mean)
#> $a
#> [1] 2
#>
#> $b
#> [1] 5

These functions will call your function once for every element on your list. Your function should expect one argument, an element from the list. The map functions will collect the returned values and return them in a list.

The purrr package, contains a whole family of map functions that take a list or a vector then return an object with the same number of elements as the input. The type of object they return varies based on which map function is used. See the help file for map for a complete list, but a few of the most common are as follows:

map() : always returns a list, and the elements of the list may be of different types. This is quite similar to the Base R function lapply.

map_chr() : returns a character vector

map_int() : returns an integer vector

map_dbl() : returns a floating point numeric vector

Let’s take a quick look at a contrived situation where we have a function that could result in a character or an integer result:

fun <- function(x) {
  if (x > 1) {
    1
  } else {
    "Less Than 1"
  }
}

fun(5)
#> [1] 1
fun(0.5)
#> [1] "Less Than 1"

Let’s create a list of elements which we can map fun to and look at how each some of the map variants behave:

lst <- list(.5, 1.5, .9, 2)

map(lst, fun)
#> [[1]]
#> [1] "Less Than 1"
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] "Less Than 1"
#>
#> [[4]]
#> [1] 1

You can see that map produced a list and it is of mixed data types.

And map_chr will produce a character vector and coerce the numbers into characters.

map_chr(lst, fun)
#> [1] "Less Than 1" "1.000000"    "Less Than 1" "1.000000"

## or using pipes
lst %>%
  map_chr(fun)
#> [1] "Less Than 1" "1.000000"    "Less Than 1" "1.000000"

While map_dbl will try to coerce a character sting into a double and died trying.

map_dbl(lst, fun)
#> Error: Can't coerce element 1 from a character to a double

As mentioned above, the Base R lapply function acts very much like map. The Base R sapply function is more like the other map functions mentioned above in that the function tries to simplify the results into a vector or matrix.

See Also

See Recipe X-X.

Applying a Function to Every Row of a Data Frame

Problem

You have a function and you want to apply it to every row in a data frame.

Solution

The mutate function will create a new variable based on a vector of values. We can use one of the pmap functions (in this case pmap_dbl) to operate on every row and return a vector. The pmap functions that have an underscore (_) following the pmap return data in a vector of the type described after the _. So pmap_dbl returns a vector of doubles, while pmap_chr would coerce the output into a vector of characters.

fun <- function(a, b, c) {
  # calculate the sum of a sequence from a to b by c
  sum(seq(a, b, c))
}

df <- data.frame(mn = c(1, 2, 3),
                 mx = c(8, 13, 18),
                 rng = c(1, 2, 3))

df %>%
  mutate(output =
           pmap_dbl(list(a = mn, b = mx, c = rng), fun))
#>   mn mx rng output
#> 1  1  8   1     36
#> 2  2 13   2     42
#> 3  3 18   3     63

pmap returns a list, so we could use it to map our function to each data frame row then return the results into a list, if we prefer:

pmap(list(a = df$mn, b = df$mx, c = df$rng), fun)
#> [[1]]
#> [1] 36
#>
#> [[2]]
#> [1] 42
#>
#> [[3]]
#> [1] 63

Discussion

The pmap family of functions takes in a list of inputs and a function then applies the function to each element in the list. In our example above we wrap list() around the columns we are interested in using in our function, fun. The list function turns the columns we want to operate on into a list. Within the same operation we name the columns to match the names our function is looking for. So we set a = mn for example. This names the mn column in our data frame to a in the resulting list, which is one of the inputs our function is expecting.

Applying a Function to Every Row of a Matrix

Problem

You have a matrix. You want to apply a function to every row, calculating the function result for each row.

Solution

Use the apply function. Set the second argument to 1 to indicate row-by-row application of a function:

results <- apply(mat, 1, fun)    # mat is a matrix, fun is a function

The apply function will call fun once for each row of the matrix, assemble the returned values into a vector, and then return that vector.

Discussion

You may notice that we only show the use of the Base R apply function here while other recipes illustrate purrr alternatives. As of this writing, matrix operations are out of scope for purrr so we use the very solid Base R apply function.

Suppose your matrix long is longitudinal data, so each row contains data for one subject and the columns contain the repeated observations over time:

long <- matrix(1:15, 3, 5)
long
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    4    7   10   13
#> [2,]    2    5    8   11   14
#> [3,]    3    6    9   12   15

You could calculate the average observation for each subject by applying the mean function to each row. The result is a vector:

apply(long, 1, mean)
#> [1] 7 8 9

If your matrix has row names, apply uses them to identify the elements of the resulting vector, which is handy.

rownames(long) <- c("Moe", "Larry", "Curly")
apply(long, 1, mean)
#>   Moe Larry Curly
#>     7     8     9

The function being called should expect one argument, a vector, which will be one row from the matrix. The function can return a scalar or a vector. In the vector case, apply assembles the results into a matrix. The range function returns a vector of two elements, the minimum and the maximum, so applying it to long produces a matrix:

apply(long, 1, range)
#>      Moe Larry Curly
#> [1,]   1     2     3
#> [2,]  13    14    15

You can employ this recipe on data frames as well. It works if the data frame is homogeneous; that is, either all numbers or all character strings. When the data frame has columns of different types, extracting vectors from the rows isn’t sensible because vectors must be homogeneous.

Applying a Function to Every Column

Problem

You have a matrix or data frame, and you want to apply a function to every column.

Solution

For a matrix, use the apply function. Set the second argument to 2, which indicates column-by-column application of the function. So if our matrix or data frame was named mat and we wanted to apply a function named fun to every column, it would look like this:

apply(mat, 2, fun)

Discussion

Let’s look at an example with real numbers and apply the mean function to every column of a matrix:

mat <- matrix(c(1, 3, 2, 5, 4, 6), 2, 3)
colnames(mat) <- c("t1", "t2", "t3")
mat
#>      t1 t2 t3
#> [1,]  1  2  4
#> [2,]  3  5  6

apply(mat, 2, mean)  # Compute the mean of every column
#>  t1  t2  t3
#> 2.0 3.5 5.0

In Base R, the apply function is intended for processing a matrix or data frame. The second argument of apply determines the direction:

  • 1 means process row by row.

  • 2 means process column by column.

This is more mnemonic than it looks. We speak of matrices in “rows and columns”, so rows are first and columns second; 1 and 2, respectively.

A data frame is a more complicated data structure than a matrix, so there are more options. You can simply use apply, in which case R will convert your data frame to a matrix and then apply your function. That will work if your data frame contains only one type of data but will likely not do what you want if some columns are numeric and some are character. In that case, R will force all columns to have identical types, likely performing an unwanted conversion as a result.

Fortunately, there are multiple alternative. Recall that a data frame is a kind of list: it is a list of the columns of the data frame. purrr has a whole family of map functions that return different types of objects. Of particular interest here is the map_df which returns a data.frame, thus the df in the name.

df2 <- map_df(df, fun) # Returns a data.frame

The function fun should expect one argument: a column from the data frame.

This is a common recipe to check the types of columns in data frames. The batch column of this data frame, at quick glance, seems to contain numbers:

load("./data/batches.rdata")
head(batches)
#>   batch clinic dosage shrinkage
#> 1     3     KY     IL    -0.307
#> 2     3     IL     IL    -1.781
#> 3     1     KY     IL    -0.172
#> 4     3     KY     IL     1.215
#> 5     2     IL     IL     1.895
#> 6     2     NJ     IL    -0.430

But printing the classes of the columns reveals batch to be a factor instead:

map_df(batches, class)
#> # A tibble: 1 x 4
#>   batch  clinic dosage shrinkage
#>   <chr>  <chr>  <chr>  <chr>
#> 1 factor factor factor numeric

See Also

See Recipes , , and .

Applying a Function to Parallel Vectors or Lists

Problem

You have a function that takes multiple arguments. You want to apply the function element-wise to vectors and obtain a vector result. Unfortunately, the function is not vectorized; that is, it works on scalars but not on vectors.

Solution

Use use one of the map or pmap functions from the tidyverse core package purrr. The most general solution is to put your vectors in a list, then use pmap:

lst <- list(v1, v2, v3)
pmap(lst, fun)

pmap will take the elements of lst and pass them as the inputs to fun.

If you only have two vectors you are passing as inputs to your function, the map2_* family of functions is convenient and saves you the step of putting your vectors in a list first. map2 will return a list, while the typed variants (map2_chr, map2_dbl, etc. ) return vectors of the type their name implies:

map2(v1, v2, fun)

or if fun returns only a double:

map2_dbl(v1, v2, fun)

The typed variants in purrr functions refers to the output type expected from the function. All the typed variants return vectors of their respective type while the untyped variants return lists which allow mixing of types.

Discussion

The basic operators of R, such as x + y, are vectorized; this means that they compute their result element-by-element and return a vector of results. Also, many R functions are vectorized.

Not all functions are vectorized, however, and those that are not typed work only on scalars. Using vector arguments produces errors at best and meaningless results at worst. In such cases, the map functions from purrr can effectively vectorize the function for you.

Consider the gcd function from Recipe X-X, which takes two arguments:

gcd <- function(a, b) {
  if (b == 0) {
    return(a)
  } else {
    return(gcd(b, a %% b))
  }
}

If we apply gcd to two vectors, the result is wrong answers and a pile of error messages:

gcd(c(1, 2, 3), c(9, 6, 3))
#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used

#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used

#> Warning in if (b == 0) {: the condition has length > 1 and only the first
#> element will be used
#> [1] 1 2 0

The function is not vectorized, but we can use map to “vectorize” it. In this case, since we have two inputs we’re mapping over, we should use the map2 function. This gives the element-wise GCDs between two vectors.

a <- c(1, 2, 3)
b <- c(9, 6, 3)
my_gcds <- map2(a, b, gcd)
my_gcds
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3

Notice that map2 returns a list of lists. If we wanted the output in a vector, we could use unlist on the result, or use one of the typed variants:

unlist(my_gcds)
#> [1] 1 2 3

The map family of purrr functions give you a series of variations that return specific types of output. The suffixes on the function names communicate the type of vector that they will return. While map and map2 return lists, since the type specific variants are returning objects guaranteed to be the same type, they can be put in atomic vectors. For example, we could use the map_chr function to ask R to coerce the results into character output or map2_dbl to ensure the reults are doubles:

map2_chr(a, b, gcd)
#> [1] "1.000000" "2.000000" "3.000000"
map2_dbl(a, b, gcd)
#> [1] 1 2 3

If our data has more than two vectors, or the data is already in a list, we can use the pmap family of functions which take a list as an input.

lst <- list(a,b)
pmap(lst, gcd)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3

Or if we want a typed vector as output:

lst <- list(a,b)
pmap_dbl(lst, gcd)
#> [1] 1 2 3

With the purrr functions, remember that pmap family are parallel mappers that take in a list as inputs, while map2 functions take two, and only two, vectors as inputs.

See Also

This is really just a special case of our very first recipe in this chapter: “Applying a Function to Each List Element”. See that recipe for more discussion of map variants. In addition, Jenny Bryan has a great collection of purrr tutorials on her GitHub site: https://jennybc.github.io/purrr-tutorial/

  • JDL note: think about where the major dplyr operators go:

    • group by (already above)

    • rowwise (alread above)

    • select (includeing -) (coverd)

    • filter (subset records based on values) *

    • arrange (sort a data frame) *

    • group_by *

    • summarize (note that it drops a grouping) (calcualte a statistic over a group)

    • case_when inside a mutate: (create a new column based on conditional logic) ==, >, >= etc &, |, !, %in%, !something %in%

Applying a Function to Groups of Rows

Problem

Your data elements occur in groups. You want to process the data by groups—for example, summing by group or averaging by group.

Solution

The easiest way to do grouping is with the dplyr function group_by in conjunction with summarize. If our data frame is df and has a variable we want to group by named grouping_var and we want to apply the function fun to all the combinations of v1 and v2, we can do that with group_by:

df %>%
  group_by(v1, v2) %>%
  summarize(
    result_var = fun(value_var)
  )

Discussion

Let’s look at a specifc example where our intput data frame, df contains a variable my_group which we want to group by, and a field named values which we would like to calculate some statistics on:

df <- tibble(
  my_group = c("A", "B","A", "B","A", "B"),
  values = 1:6
)

df %>%
  group_by(my_group) %>%
  summarize(
    avg_values = mean(values),
    tot_values = sum(values),
    count_values = n()
  )
#> # A tibble: 2 x 4
#>   my_group avg_values tot_values count_values
#>   <chr>         <dbl>      <int>        <int>
#> 1 A                 3          9            3
#> 2 B                 4         12            3

The output has one record per grouping along with calculated values for the three summary fields we defined.

See Also

See this chapter’s “Introduction” for more about grouping factors.

Chapter 7. Strings and Dates

Introduction

Strings? Dates? In a statistical programming package?

As soon as you read files or print reports, you need strings. When you work with real-world problems, you need dates.

R has facilities for both strings and dates. They are clumsy compared to string-oriented languages such as Perl, but then it’s a matter of the right tool for the job. We wouldn’t want to perform logistic regression in Perl.

Some of this clunkyness with strings and dates has been inproved through the tidyverse packages stringr and lubridate. As with other chapters in this book, the examples below will pull from both Base R as well as add on packages that make life easier, faster, and more convenient.

Classes for Dates and Times

R has a variety of classes for working with dates and times; which is nice if you prefer having a choice but annoying if you prefer living simply. There is a critical distinction among the classes: some are date-only classes, some are datetime classes. All classes can handle calendar dates (e.g., March 15, 2019), but not all can represent a datetime (11:45 AM on March 1, 2019).

The following classes are included in the base distribution of R:

Date

The Date class can represent a calendar date but not a clock time. It is a solid, general-purpose class for working with dates, including conversions, formatting, basic date arithmetic, and time-zone handling. Most of the date-related recipes in this book are built on the Date class.

POSIXct

This is a datetime class, and it can represent a moment in time with an accuracy of one second. Internally, the datetime is stored as the number of seconds since January 1, 1970, and so is a very compact representation. This class is recommended for storing datetime information (e.g., in data frames).

POSIXlt

This is also a datetime class, but the representation is stored in a nine-element list that includes the year, month, day, hour, minute, and second. That representation makes it easy to extract date parts, such as the month or hour. Obviously, this representation is much less compact than the POSIXct class; hence it is normally used for intermediate processing and not for storing data.

The base distribution also provides functions for easily converting between representations: as.Date, as.POSIXct, and as.POSIXlt.

The following helpful packages are available for downloading from CRAN:

chron

The chron package can represent both dates and times but without the added complexities of handling time zones and daylight savings time. It’s therefore easier to use than Date but less powerful than POSIXct and POSIXlt. It would be useful for work in econometrics or time series analysis.

lubridate

Lubridate is designed to make working with dates and times easier while keeping the important bells and whistles such as time zones. It’s especially clever regarding datetime arithmetic. This package introduces some helpful constructs like durations, periods, and intervals. Lubridate is part of the tidyverse, so it is installed when you install.packages('tidyverse') but it is not part of “core tidyverse” so it does not get loaded when you run library(tidyverse) so you must explicitly load it by running library(lubridate).

mondate

This is a specialized package for handling dates in units of months in addition to days and years. Such needs arise in accounting and actuarial work, for example, where month-by-month calculations are needed.

timeDate

This is a high-powered package with well-thought-out facilities for handling dates and times, including date arithmetic, business days, holidays, conversions, and generalized handling of time zones. It was originally part of the Rmetrics software for financial modeling, where precision in dates and times is critical. If you have a demanding need for date facilities, consider this package.

Which class should you select? The article “Date and Time Classes in R” by Grothendieck and Petzoldt offers this general advice:

Example 7-1.

When considering which class to use, always choose the least complex class that will support the application. That is, use Date if possible, otherwise use chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.

See Also

See help(DateTimeClasses) for more details regarding the built-in facilities. See the June 2004 article “Date and Time Classes in R” by Gabor Grothendieck and Thomas Petzoldt for a great introduction to the date and time facilities. The June 2001 article “Date-Time Classes” by Brian Ripley and Kurt Hornik discusses the two POSIX classes in particular. “Dates and times” chapter from the book R for Data Science by Garrett Grolemund and Hadley Wickham which provides a great intro to lubridate

Getting the Length of a String

Problem

You want to know the length of a string.

Solution

Use the nchar function, not the length function.

Discussion

The nchar function takes a string and returns the number of characters in the string:

nchar("Moe")
#> [1] 3
nchar("Curly")
#> [1] 5

If you apply nchar to a vector of strings, it returns the length of each string:

s <- c("Moe", "Larry", "Curly")
nchar(s)
#> [1] 3 5 5

You might think the length function returns the length of a string. Nope. It returns the length of a vector. When you apply the length function to a single string, R returns the value 1 because it views that string as a singleton vector—a vector with one element:

length("Moe")
#> [1] 1
length(c("Moe", "Larry", "Curly"))
#> [1] 3

Concatenating Strings

Problem

You want to join together two or more strings into one string.

Solution

Use the paste function.

Discussion

The paste function concatenates several strings together. In other words, it creates a new string by joining the given strings end to end:

paste("Everybody", "loves", "stats.")
#> [1] "Everybody loves stats."

By default, paste inserts a single space between pairs of strings, which is handy if that’s what you want and annoying otherwise. The sep argument lets you specify a different separator. Use an empty string ("") to run the strings together without separation:

paste("Everybody", "loves", "stats.", sep = "-")
#> [1] "Everybody-loves-stats."
paste("Everybody", "loves", "stats.", sep = "")
#> [1] "Everybodylovesstats."

It’s a common idiom to want to concatenate strings together with no seperator at all. So there exists a convenince function, paste0 to make this very convenient:

paste0("Everybody", "loves", "stats.")
#> [1] "Everybodylovesstats."

The function is very forgiving about nonstring arguments. It tries to convert them to strings using the as.character function:

paste("The square root of twice pi is approximately", sqrt(2 * pi))
#> [1] "The square root of twice pi is approximately 2.506628274631"

If one or more arguments are vectors of strings, paste will generate all combinations of the arguments (because of recycling):

stooges <- c("Moe", "Larry", "Curly")
paste(stooges, "loves", "stats.")
#> [1] "Moe loves stats."   "Larry loves stats." "Curly loves stats."

Sometimes you want to join even those combinations into one, big string. The collapse parameter lets you define a top-level separator and instructs paste to concatenate the generated strings using that separator:

paste(stooges, "loves", "stats", collapse = ", and ")
#> [1] "Moe loves stats, and Larry loves stats, and Curly loves stats"

Extracting Substrings

Problem

You want to extract a portion of a string according to position.

Solution

Use substr(string,start,end) to extract the substring that begins at start and ends at end.

Discussion

The substr function takes a string, a starting point, and an ending point. It returns the substring between the starting to ending points:

substr("Statistics", 1, 4) # Extract first 4 characters
#> [1] "Stat"
substr("Statistics", 7, 10) # Extract last 4 characters
#> [1] "tics"

Just like many R functions, substr lets the first argument be a vector of strings. In that case, it applies itself to every string and returns a vector of substrings:

ss <- c("Moe", "Larry", "Curly")
substr(ss, 1, 3) # Extract first 3 characters of each string
#> [1] "Moe" "Lar" "Cur"

In fact, all the arguments can be vectors, in which case substr will treat them as parallel vectors. From each string, it extracts the substring delimited by the corresponding entries in the starting and ending points. This can facilitate some useful tricks. For example, the following code snippet extracts the last two characters from each string; each substring starts on the penultimate character of the original string and ends on the final character:

cities <- c("New York, NY", "Los Angeles, CA", "Peoria, IL")
substr(cities, nchar(cities) - 1, nchar(cities))
#> [1] "NY" "CA" "IL"

You can extend this trick into mind-numbing territory by exploiting the Recycling Rule, but we suggest you avoid the temptation.

Splitting a String According to a Delimiter

Problem

You want to split a string into substrings. The substrings are separated by a delimiter.

Solution

Use strsplit, which takes two arguments: the string and the delimiter of the substrings:

strsplit(string, delimiter)

The `delimiter` can be either a simple string or a regular expression.

Discussion

It is common for a string to contain multiple substrings separated by the same delimiter. One example is a file path, whose components are separated by slashes (/):

path <- "/home/mike/data/trials.csv"

We can split that path into its components by using strsplit with a delimiter of /:

strsplit(path, "/")
#> [[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"

Notice that the first “component” is actually an empty string because nothing preceded the first slash.

Also notice that strsplit returns a list and that each element of the list is a vector of substrings. This two-level structure is necessary because the first argument can be a vector of strings. Each string is split into its substrings (a vector); then those vectors are returned in a list.

If you are only operating on a single string, you can pop out the first element like this:

strsplit(path, "/")[[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"

This example splits three file paths and returns a three-element list:

paths <- c(
  "/home/mike/data/trials.csv",
  "/home/mike/data/errors.csv",
  "/home/mike/corr/reject.doc"
)
strsplit(paths, "/")
#> [[1]]
#> [1] ""           "home"       "mike"       "data"       "trials.csv"
#>
#> [[2]]
#> [1] ""           "home"       "mike"       "data"       "errors.csv"
#>
#> [[3]]
#> [1] ""           "home"       "mike"       "corr"       "reject.doc"

The second argument of strsplit (the `delimiter` argument) is actually much more powerful than these examples indicate. It can be a regular expression, letting you match patterns far more complicated than a simple string. In fact, to turn off the regular expression feature (and its interpretation of special characters) you must include the fixed=TRUE argument.

See Also

To learn more about regular expressions in R, see the help page for regexp. See O’Reilly’s Mastering Regular Expressions, by Jeffrey E.F. Friedl to learn more about regular expressions in general.

Replacing Substrings

Problem

Within a string, you want to replace one substring with another.

Solution

Use sub to replace the first instance of a substring:

sub(old, new, string)

Use gsub to replace all instances of a substring:

gsub(old, new, string)

Discussion

The sub function finds the first instance of the old substring within string and replaces it with the new substring:

str <- "Curly is the smart one. Curly is funny, too."
sub("Curly", "Moe", str)
#> [1] "Moe is the smart one. Curly is funny, too."

gsub does the same thing, but it replaces all instances of the substring (a global replace), not just the first:

gsub("Curly", "Moe", str)
#> [1] "Moe is the smart one. Moe is funny, too."

To remove a substring altogether, simply set the new substring to be empty:

sub(" and SAS", "", "For really tough problems, you need R and SAS.")
#> [1] "For really tough problems, you need R."

The old argument can be regular expression, which allows you to match patterns much more complicated than a simple string. This is actually assumed by default, so you must set the fixed=TRUE argument if you don’t want sub and gsub to interpret old as a regular expression.

See Also

To learn more about regular expressions in R, see the help page for regexp. See Mastering Regular Expressions to learn more about regular expressions in general.

Generating All Pairwise Combinations of Strings

Problem

You have two sets of strings, and you want to generate all combinations from those two sets (their Cartesian product).

Solution

Use the outer and paste functions together to generate the matrix of all possible combinations:

m <- outer(strings1, strings2, paste, sep = "")

Discussion

The outer function is intended to form the outer product. However, it allows a third argument to replace simple multiplication with any function. In this recipe we replace multiplication with string concatenation (paste), and the result is all combinations of strings.

Suppose you have four test sites and three treatments:

locations <- c("NY", "LA", "CHI", "HOU")
treatments <- c("T1", "T2", "T3")

We can apply outer and paste to generate all combinations of test sites and treatments:

outer(locations, treatments, paste, sep = "-")
#>      [,1]     [,2]     [,3]
#> [1,] "NY-T1"  "NY-T2"  "NY-T3"
#> [2,] "LA-T1"  "LA-T2"  "LA-T3"
#> [3,] "CHI-T1" "CHI-T2" "CHI-T3"
#> [4,] "HOU-T1" "HOU-T2" "HOU-T3"

The fourth argument of outer is passed to paste. In this case, we passed sep="-" in order to define a hyphen as the separator between the strings.

The result of outer is a matrix. If you want the combinations in a vector instead, flatten the matrix using the as.vector function.

In the special case when you are combining a set with itself and order does not matter, the result will be duplicate combinations:

outer(treatments, treatments, paste, sep = "-")
#>      [,1]    [,2]    [,3]
#> [1,] "T1-T1" "T1-T2" "T1-T3"
#> [2,] "T2-T1" "T2-T2" "T2-T3"
#> [3,] "T3-T1" "T3-T2" "T3-T3"

Or we can use expand.grid to get a pair of vectors representing all combinations:

expand.grid(treatments, treatments)
#>   Var1 Var2
#> 1   T1   T1
#> 2   T2   T1
#> 3   T3   T1
#> 4   T1   T2
#> 5   T2   T2
#> 6   T3   T2
#> 7   T1   T3
#> 8   T2   T3
#> 9   T3   T3

But suppose we want all unique pairwise combinations of treatments. We can eliminate the duplicates by removing the lower triangle (or upper triangle). The lower.tri function identifies that triangle, so inverting it identifies all elements outside the lower triangle:

m <- outer(treatments, treatments, paste, sep = "-")
m[!lower.tri(m)]
#> [1] "T1-T1" "T1-T2" "T2-T2" "T1-T3" "T2-T3" "T3-T3"

See Also

See “Concatenating Strings” for using paste to generate combinations of strings. The gtools package on CRAN (https://cran.r-project.org/web/packages/gtools/index.html) has functions combinations and permutation which may be of help with related tasks.

Getting the Current Date

Problem

You need to know today’s date.

Solution

The Sys.Date function returns the current date:

Sys.Date()
#> [1] "2019-01-07"

Discussion

The Sys.Date function returns a Date object. In the preceding example it seems to return a string because the result is printed inside double quotes. What really happened, however, is that Sys.Date returned a Date object and then R converted that object into a string for printing purposes. You can see this by checking the class of the result from Sys.Date:

class(Sys.Date())
#> [1] "Date"

Converting a String into a Date

Problem

You have the string representation of a date, such as “2018-12-31”, and you want to convert that into a Date object.

Solution

You can use as.Date, but you must know the format of the string. By default, as.Date assumes the string looks like yyyy-mm-dd. To handle other formats, you must specify the format parameter of as.Date. Use format="%m/%d/%Y" if the date is in American style, for instance.

Discussion

This example shows the default format assumed by as.Date, which is the ISO 8601 standard format of yyyy-mm-dd:

as.Date("2018-12-31")
#> [1] "2018-12-31"

The as.Date function returns a Date object that (as in the prior recipe) is here being converted back to a string for printing; this explains the double quotes around the output.

The string can be in other formats, but you must provide a format argument so that as.Date can interpret your string. See the help page for the stftime function for details about allowed formats.

Being simple Americans, we often mistakenly try to convert the usual American date format (mm/dd/yyyy) into a Date object, with these unhappy results:

as.Date("12/31/2018")
#> Error in charToDate(x): character string is not in a standard unambiguous format

Here is the correct way to convert an American-style date:

as.Date("12/31/2018", format = "%m/%d/%Y")
#> [1] "2018-12-31"

Observe that the Y in the format string is capitalized to indicate a 4-digit year. If you’re using 2-digit years, specify a lowercase y.

Converting a Date into a String

Problem

You want to convert a Date object into a character string, usually because you want to print the date.

Solution

Use either format or as.character:

format(Sys.Date())
#> [1] "2019-01-07"
as.character(Sys.Date())
#> [1] "2019-01-07"

Both functions allow a format argument that controls the formatting. Use format="%m/%d/%Y" to get American-style dates, for example:

format(Sys.Date(), format = "%m/%d/%Y")
#> [1] "01/07/2019"

Discussion

The format argument defines the appearance of the resulting string. Normal characters, such as slash (/) or hyphen (-) are simply copied to the output string. Each two-letter combination of a percent sign (%) followed by another character has special meaning. Some common ones are:

%b

Abbreviated month name (“Jan”)

%B

Full month name (“January”)

%d

Day as a two-digit number

%m

Month as a two-digit number

%y

Year without century (00–99)

%Y

Year with century

See the help page for the strftime function for a complete list of formatting codes.

Converting Year, Month, and Day into a Date

Problem

You have a date represented by its year, month, and day in different variables. You want to merge these elements into a single Date object representation.

Solution

Use the ISOdate function:

ISOdate(year, month, day)

The result is a POSIXct object that you can convert into a Date object:

year <- 2018
month <- 12
day <- 31
as.Date(ISOdate(year, month, day))
#> [1] "2018-12-31"

Discussion

It is common for input data to contain dates encoded as three numbers: year, month, and day. The ISOdate function can combine them into a POSIXct object:

ISOdate(2020, 2, 29)
#> [1] "2020-02-29 12:00:00 GMT"

You can keep your date in the POSIXct format. However, when working with pure dates (not dates and times), we often convert to a Date object and truncate the unused time information:

as.Date(ISOdate(2020, 2, 29))
#> [1] "2020-02-29"

Trying to convert an invalid date results in NA:

ISOdate(2013, 2, 29) # Oops! 2013 is not a leap year
#> [1] NA

ISOdate can process entire vectors of years, months, and days, which is quite handy for mass conversion of input data. The following example starts with the year/month/day numbers for the third Wednesday in January of several years and then combines them all into Date objects:

years <- 2010:2014
months <- rep(1, 5)
days <- 5:9
ISOdate(years, months, days)
#> [1] "2010-01-05 12:00:00 GMT" "2011-01-06 12:00:00 GMT"
#> [3] "2012-01-07 12:00:00 GMT" "2013-01-08 12:00:00 GMT"
#> [5] "2014-01-09 12:00:00 GMT"
as.Date(ISOdate(years, months, days))
#> [1] "2010-01-05" "2011-01-06" "2012-01-07" "2013-01-08" "2014-01-09"

Purists will note that the vector of months is redundant and that the last expression can therefore be further simplified by invoking the Recycling Rule:

as.Date(ISOdate(years, 1, days))
#> [1] "2010-01-05" "2011-01-06" "2012-01-07" "2013-01-08" "2014-01-09"

This recipe can also be extended to handle year, month, day, hour, minute, and second data by using the ISOdatetime function (see the help page for details):

ISOdatetime(year, month, day, hour, minute, second)

Getting the Julian Date

Problem

Given a Date object, you want to extract the Julian date—which is, in R, the number of days since January 1, 1970.

Solution

Either convert the Date object to an integer or use the julian function:

d <- as.Date("2019-03-15")
as.integer(d)
#> [1] 17970
jd <- julian(d)
jd
#> [1] 17970
#> attr(,"origin")
#> [1] "1970-01-01"
attr(jd, "origin")
#> [1] "1970-01-01"

Discussion

A Julian “date” is simply the number of days since a more-or-less arbitrary starting point. In the case of R, that starting point is January 1, 1970, the same starting point as Unix systems. So the Julian date for January 1, 1970 is zero, as shown here:

as.integer(as.Date("1970-01-01"))
#> [1] 0
as.integer(as.Date("1970-01-02"))
#> [1] 1
as.integer(as.Date("1970-01-03"))
#> [1] 2

Extracting the Parts of a Date

Problem

Given a Date object, you want to extract a date part such as the day of the week, the day of the year, the calendar day, the calendar month, or the calendar year.

Solution

Convert the Date object to a POSIXlt object, which is a list of date parts. Then extract the desired part from that list:

d <- as.Date("2019-03-15")
p <- as.POSIXlt(d)
p$mday        # Day of the month
#> [1] 15
p$mon         # Month (0 = January)
#> [1] 2
p$year + 1900 # Year
#> [1] 2019

Discussion

The POSIXlt object represents a date as a list of date parts. Convert your Date object to POSIXlt by using the as.POSIXlt function, which will give you a list with these members:

sec

Seconds (0–61)

min

Minutes (0–59)

hour

Hours (0–23)

mday

Day of the month (1–31)

mon

Month (0–11)

year

Years since 1900

wday

Day of the week (0–6, 0 = Sunday)

yday

Day of the year (0–365)

isdst

Daylight savings time flag

Using these date parts, we can learn that April 2, 2020, is a Thursday (wday = 4) and the 93rd day of the year (because yday = 0 on January 1):

d <- as.Date("2020-04-02")
as.POSIXlt(d)$wday
#> [1] 4
as.POSIXlt(d)$yday
#> [1] 92

A common mistake is failing to add 1900 to the year, giving the impression you are living a long, long time ago:

as.POSIXlt(d)$year # Oops!
#> [1] 120
as.POSIXlt(d)$year + 1900
#> [1] 2020

Creating a Sequence of Dates

Problem

You want to create a sequence of dates, such as a sequence of daily, monthly, or annual dates.

Solution

The seq function is a generic function that has a version for Date objects. It can create a Date sequence similarly to the way it creates a sequence of numbers.

Discussion

A typical use of seq specifies a starting date (from), ending date (to), and increment (by). An increment of 1 indicates daily dates:

s <- as.Date("2019-01-01")
e <- as.Date("2019-02-01")
seq(from = s, to = e, by = 1) # One month of dates
#>  [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#>  [6] "2019-01-06" "2019-01-07" "2019-01-08" "2019-01-09" "2019-01-10"
#> [11] "2019-01-11" "2019-01-12" "2019-01-13" "2019-01-14" "2019-01-15"
#> [16] "2019-01-16" "2019-01-17" "2019-01-18" "2019-01-19" "2019-01-20"
#> [21] "2019-01-21" "2019-01-22" "2019-01-23" "2019-01-24" "2019-01-25"
#> [26] "2019-01-26" "2019-01-27" "2019-01-28" "2019-01-29" "2019-01-30"
#> [31] "2019-01-31" "2019-02-01"

Another typical use specifies a starting date (from), increment (by), and number of dates (length.out):

seq(from = s, by = 1, length.out = 7) # Dates, one week apart
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06" "2019-01-07"

The increment (by) is flexible and can be specified in days, weeks, months, or years:

seq(from = s, by = "month", length.out = 12)   # First of the month for one year
#>  [1] "2019-01-01" "2019-02-01" "2019-03-01" "2019-04-01" "2019-05-01"
#>  [6] "2019-06-01" "2019-07-01" "2019-08-01" "2019-09-01" "2019-10-01"
#> [11] "2019-11-01" "2019-12-01"
seq(from = s, by = "3 months", length.out = 4) # Quarterly dates for one year
#> [1] "2019-01-01" "2019-04-01" "2019-07-01" "2019-10-01"
seq(from = s, by = "year", length.out = 10)    # Year-start dates for one decade
#>  [1] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01" "2023-01-01"
#>  [6] "2024-01-01" "2025-01-01" "2026-01-01" "2027-01-01" "2028-01-01"

Be careful with by="month" near month-end. In this example, the end of February overflows into March, which is probably not what you wanted:

seq(as.Date("2019-01-29"), by = "month", len = 3)
#> [1] "2019-01-29" "2019-03-01" "2019-03-29"

Chapter 8. Probability

Introduction

Probability theory is the foundation of statistics, and R has plenty of machinery for working with probability, probability distributions, and random variables. The recipes in this chapter show you how to calculate probabilities from quantiles, calculate quantiles from probabilities, generate random variables drawn from distributions, plot distributions, and so forth.

Names of Distributions

R has an abbreviated name for every probability distribution. This name is used to identify the functions associated with the distribution. For example, the name of the Normal distribution is “norm”, which is the root of these function names:

Function Purpose

dnorm

Normal density

pnorm

Normal distribution function

qnorm

Normal quantile function

rnorm

Normal random variates

Table 8-1 describes some common discrete distributions, and Table 8-2 describes several common continuous distributions.

Table 8-1. Common Discrete Distributions
Discrete distribution R name Parameters

Binomial

binom

n = number of trials; p = probability of success for one trial

Geometric

geom

p = probability of success for one trial

Hypergeometric

hyper

m = number of white balls in urn; n = number of black balls in urn; k = number of balls drawn from urn

Negative binomial (NegBinomial)

nbinom

size = number of successful trials; either prob = probability of successful trial or mu = mean

Poisson

pois

lambda = mean

Table 8-2. Common Continuous Distributions
Continuous distribution R name Parameters

Beta

beta

shape1; shape2

Cauchy

cauchy

location; scale

Chi-squared (Chisquare)

chisq

df = degrees of freedom

Exponential

exp

rate

F

f

df1 and df2 = degrees of freedom

Gamma

gamma

rate; either rate or scale

Log-normal (Lognormal)

lnorm

meanlog = mean on logarithmic scale;

sdlog = standard deviation on logarithmic scale

Logistic

logis

location; scale

Normal

norm

mean; sd = standard deviation

Student’s t (TDist)

t

df = degrees of freedom

Uniform

unif

min = lower limit; max = upper limit

Weibull

weibull

shape; scale

Wilcoxon

wilcox

m = number of observations in first sample;

n = number of observations in second sample

Warning

All distribution-related functions require distributional parameters, such as size and prob for the binomial or prob for the geometric. The big “gotcha” is that the distributional parameters may not be what you expect. For example, I would expect the parameter of an exponential distribution to be β, the mean. The R convention, however, is for the exponential distribution to be defined by the rate = 1/β, so I often supply the wrong value. The moral is, study the help page before you use a function related to a distribution. Be sure you’ve got the parameters right.

Getting Help on Probability Distributions

To see the R functions related to a particular probability distribution, use the help command and the full name of the distribution. For example, this will show the functions related to the Normal distribution:

?Normal

Some distributions have names that don’t work well with the help command, such as “Student’s t”. They have special help names, as noted in Tables Table 8-1 and Table 8-2: NegBinomial, Chisquare, Lognormal, and TDist. Thus, to get help on the Student’s t distribution, use this:

?TDist

See Also

There are many other distributions implemented in downloadable packages; see the CRAN task view devoted to probability distributions. The SuppDists package is part of the R base, and it includes ten supplemental distributions. The MASS package, which is also part of the base, provides additional support for distributions, such as maximum-likelihood fitting for some common distributions as well as sampling from a multivariate Normal distribution.

Counting the Number of Combinations

Problem

You want to calculate the number of combinations of n items taken k at a time.

Solution

Use the choose function:

n <- 10
k <- 2
choose(n, k)
#> [1] 45

Discussion

A common problem in computing probabilities of discrete variables is counting combinations: the number of distinct subsets of size k that can be created from n items. The number is given by n!/r!(nr)!, but it’s much more convenient to use the choose function—especially as n and k grow larger:

choose(5, 3)   # How many ways can we select 3 items from 5 items?
#> [1] 10
choose(50, 3)  # How many ways can we select 3 items from 50 items?
#> [1] 19600
choose(50, 30) # How many ways can we select 30 items from 50 items?
#> [1] 4.71e+13

These numbers are also known as binomial coefficients.

See Also

This recipe merely counts the combinations; see “Generating Combinations” to actually generate them.

Generating Combinations

Problem

You want to generate all combinations of n items taken k at a time.

Solution

Use the combn function:

items <- 2:5
k <- 2
combn(items, k)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    2    2    2    3    3    4
#> [2,]    3    4    5    4    5    5

Discussion

We can use combn(1:5,3) to generate all combinations of the numbers 1 through 5 taken three at a time:

combn(1:5, 3)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]    1    1    1    1    1    1    2    2    2     3
#> [2,]    2    2    2    3    3    4    3    3    4     4
#> [3,]    3    4    5    4    5    5    4    5    5     5

The function is not restricted to numbers. We can generate combinations of strings, too. Here are all combinations of five treatments taken three at a time:

combn(c("T1", "T2", "T3", "T4", "T5"), 3)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "T1" "T1" "T1" "T1" "T1" "T1" "T2" "T2" "T2" "T3"
#> [2,] "T2" "T2" "T2" "T3" "T3" "T4" "T3" "T3" "T4" "T4"
#> [3,] "T3" "T4" "T5" "T4" "T5" "T5" "T4" "T5" "T5" "T5"
Warning

As the number of items, n, increases, the number of combinations can explode—especially if k is not near to 1 or n.

See Also

See “Counting the Number of Combinations” to count the number of possible combinations before you generate a huge set.

Generating Random Numbers

Problem

You want to generate random numbers.

Solution

The simple case of generating a uniform random number between 0 and 1 is handled by the runif function. This example generates one uniform random number:

runif(1)
#> [1] 0.915
Note

If you are saying runif out loud (or even in your head), you should pronounce it “are unif” instead of “run if.” The term runif is a portmanteau of “random uniform” so should not sound as if it’s a flow control function.

R can generate random variates from other distributions as well. For a given distribution, the name of the random number generator is “r” prefixed to the distribution’s abbreviated name (e.g., rnorm for the Normal distribution’s random number generator). This example generates one random value from the standard normal distribution:

rnorm(1)
#> [1] 1.53

Discussion

Most programming languages have a wimpy random number generator that generates one random number, uniformly distributed between 0.0 and 1.0, and that’s all. Not R.

R can generate random numbers from many probability distributions other than the uniform distribution. The simple case of generating uniform random numbers between 0 and 1 is handled by the runif function:

runif(1)
#> [1] 0.83

The argument of runif is the number of random values to be generated. Generating a vector of 10 such values is as easy as generating one:

runif(10)
#>  [1] 0.642 0.519 0.737 0.135 0.657 0.705 0.458 0.719 0.935 0.255

There are random number generators for all built-in distributions. Simply prefix the distribution name with “r” and you have the name of the corresponding random number generator. Here are some common ones:

set.seed(42)
runif(1, min = -3, max = 3)      # One uniform variate between -3 and +3
#> [1] 2.49
rnorm(1)                         # One standard Normal variate
#> [1] 1.53
rnorm(1, mean = 100, sd = 15)    # One Normal variate, mean 100 and SD 15
#> [1] 114
rbinom(1, size = 10, prob = 0.5) # One binomial variate
#> [1] 5
rpois(1, lambda = 10)            # One Poisson variate
#> [1] 12
rexp(1, rate = 0.1)              # One exponential variate
#> [1] 3.14
rgamma(1, shape = 2, rate = 0.1) # One gamma variate
#> [1] 22.3

As with runif, the first argument is the number of random values to be generated. Subsequent arguments are the parameters of the distribution, such as mean and sd for the Normal distribution or size and prob for the binomial. See the function’s R help page for details.

The examples given so far use simple scalars for distributional parameters. Yet the parameters can also be vectors, in which case R will cycle through the vector while generating random values. The following example generates three normal random values drawn from distributions with means of −10, 0, and +10, respectively (all distributions have a standard deviation of 1.0):

rnorm(3, mean = c(-10, 0, +10), sd = 1)
#> [1] -9.420 -0.658 11.555

That is a powerful capability in such cases as hierarchical models, where the parameters are themselves random. The next example calculates 30 draws of a normal variate whose mean is itself randomly distributed and with hyperparameters of μ = 0 and σ = 0.2:

means <- rnorm(30, mean = 0, sd = 0.2)
rnorm(30, mean = means, sd = 1)
#>  [1] -0.5549 -2.9232 -1.2203  0.6962  0.1673 -1.0779 -0.3138 -3.3165
#>  [9]  1.5952  0.8184 -0.1251  0.3601 -0.8142  0.1050  2.1264  0.6943
#> [17] -2.7771  0.9026  0.0389  0.2280 -0.5599  0.9572  0.1972  0.2602
#> [25] -0.4423  1.9707  0.4553  0.0467  1.5229  0.3176

If you are generating many random values and the vector of parameters is too short, R will apply the Recycling Rule to the parameter vector.

See Also

See the “Introduction” to this chapter.

Generating Reproducible Random Numbers

Problem

You want to generate a sequence of random numbers, but you want to reproduce the same sequence every time your program runs.

Solution

Before running your R code, call the set.seed function to initialize the random number generator to a known state:

set.seed(42) # Or use any other positive integer...

Discussion

After generating random numbers, you may often want to reproduce the same sequence of “random” numbers every time your program executes. That way, you get the same results from run to run. One of the authors (Paul) once supported a complicated Monte Carlo analysis of a huge portfolio of securities. The users complained about getting slightly different results each time the program ran. No kidding! The analysis was driven entirely by random numbers, so of course there was randomness in the output. The solution was to set the random number generator to a known state at the beginning of the program. That way, it would generate the same (quasi-)random numbers each time and thus yield consistent, reproducible results.

In R, the set.seed function sets the random number generator to a known state. The function takes one argument, an integer. Any positive integer will work, but you must use the same one in order to get the same initial state.

The function returns nothing. It works behind the scenes, initializing (or reinitializing) the random number generator. The key here is that using the same seed restarts the random number generator back at the same place:

set.seed(165)   # Initialize generator to known state
runif(10)       # Generate ten random numbers
#>  [1] 0.116 0.450 0.996 0.611 0.616 0.426 0.666 0.168 0.788 0.442

set.seed(165)   # Reinitialize to the same known state
runif(10)       # Generate the same ten "random" numbers
#>  [1] 0.116 0.450 0.996 0.611 0.616 0.426 0.666 0.168 0.788 0.442
Warning

When you set the seed value and freeze your sequence of random numbers, you are eliminating a source of randomness that may be critical to algorithms such as Monte Carlo simulations. Before you call set.seed in your application, ask yourself: Am I undercutting the value of my program or perhaps even damaging its logic?

See Also

See “Generating Random Numbers” for more about generating random numbers.

Generating a Random Sample

Problem

You want to sample a dataset randomly.

Solution

The sample function will randomly select n items from a set:

sample(set, n)

Discussion

Suppose your World Series data contains a vector of years when the Series was played. You can select 10 years at random using sample:

world_series <- read_csv("./data/world_series.csv")
sample(world_series$year, 10)
#>  [1] 2010 1961 1906 1992 1982 1948 1910 1973 1967 1931

The items are randomly selected, so running sample again (usually) produces a different result:

sample(world_series$year, 10)
#>  [1] 1941 1973 1921 1958 1979 1946 1932 1919 1971 1974

The sample function normally samples without replacement, meaning it will not select the same item twice. Some statistical procedures (especially the bootstrap) require sampling with replacement, which means that one item can appear multiple times in the sample. Specify replace=TRUE to sample with replacement.

It’s easy to implement a simple bootstrap using sampling with replacement. Suppose we have a vector, x, of 1,000 random numbers, drawn from a normal distribution with mean 4 and standard deviation 10.

set.seed(42)
x <- rnorm(1000, 4, 10)

This code fragment samples 1,000 times from x and calculates the median of each sample:

medians <- numeric(1000)   # empty vector of 1000 numbers
for (i in 1:1000) {
  medians[i] <- median(sample(x, replace = TRUE))
}

From the bootstrap estimates, we can estimate the confidence interval for the median:

ci <- quantile(medians, c(0.025, 0.975))
cat("95% confidence interval is (", ci, ")\n")
#> 95% confidence interval is ( 3.16 4.49 )

We know that x was created from a normal distribution with a mean of 4 and, hence, the sample median should be 4 also. (In a symetrical distribution like the normal, the mean and the median are the same.) Our confidence interval easily contains the value.

See Also

See “Randomly Permuting a Vector” for randomly permuting a vector and Recipe X-X for more about bootstrapping. “Generating Reproducible Random Numbers” discusses setting seeds for quasi-random numbers.

Generating Random Sequences

Problem

You want to generate a random sequence, such as a series of simulated coin tosses or a simulated sequence of Bernoulli trials.

Solution

Use the sample function. Sample n draws from the set of possible values, and set replace=TRUE:

sample(set, n, replace = TRUE)

Discussion

The sample function randomly selects items from a set. It normally samples without replacement, which means that it will not select the same item twice and will return an error if you try to sample more items than exist in the set. With replace=TRUE, however, sample can select items over and over; this allows you to generate long, random sequences of items.

The following example generates a random sequence of 10 simulated flips of a coin:

sample(c("H", "T"), 10, replace = TRUE)
#>  [1] "H" "T" "H" "T" "T" "T" "H" "T" "T" "H"

The next example generates a sequence of 20 Bernoulli trials—random successes or failures. We use TRUE to signify a success:

sample(c(FALSE, TRUE), 20, replace = TRUE)
#>  [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
#> [12]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

By default, sample will choose equally among the set elements and so the probability of selecting either TRUE or FALSE is 0.5. With a Bernoulli trial, the probability p of success is not necessarily 0.5. You can bias the sample by using the prob argument of sample; this argument is a vector of probabilities, one for each set element. Suppose we want to generate 20 Bernoulli trials with a probability of success p = 0.8. We set the probability of FALSE to be 0.2 and the probability of TRUE to 0.8:

sample(c(FALSE, TRUE), 20, replace = TRUE, prob = c(0.2, 0.8))
#>  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [12]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

The resulting sequence is clearly biased toward TRUE. I chose this example because it’s a simple demonstration of a general technique. For the special case of a binary-valued sequence you can use rbinom, the random generator for binomial variates:

rbinom(10, 1, 0.8)
#>  [1] 1 0 1 1 1 1 1 0 1 1

Randomly Permuting a Vector

Problem

You want to generate a random permutation of a vector.

Solution

If v is your vector, then sample(v) returns a random permutation.

Discussion

We typically think of the sample function for sampling from large datasets. However, the default parameters enable you to create a random rearrangement of the dataset. The function call sample(v) is equivalent to:

sample(v, size = length(v), replace = FALSE)

which means “select all the elements of v in random order while using each element exactly once.” That is a random permutation. Here is a random permutation of 1, …, 10:

sample(1:10)
#>  [1]  7  3  6  1  5  2  4  8 10  9

See Also

See “Generating a Random Sample” for more about sample.

Calculating Probabilities for Discrete Distributions

Problem

You want to calculate either the simple or the cumulative probability associated with a discrete random variable.

Solution

For a simple probability, P(X = x), use the density function. All built-in probability distributions have a density function whose name is “d” prefixed to the distribution name. For example, dbinom for the binomial distribution.

For a cumulative probability, P(Xx), use the distribution function. All built-in probability distributions have a distribution function whose name is “p” prefixed to the distribution name; thus, pbinom is the distribution function for the binomial distribution.

Discussion

Suppose we have a binomial random variable X over 10 trials, where each trial has a success probability of 1/2. Then we can calculate the probability of observing x = 7 by calling dbinom:

dbinom(7, size = 10, prob = 0.5)
#> [1] 0.117

That calculates a probability of about 0.117. R calls dbinom the density function. Some textbooks call it the probability mass function or the probability function. Calling it a density function keeps the terminology consistent between discrete and continuous distributions (“Calculating Probabilities for Continuous Distributions”).

The cumulative probability, P(Xx), is given by the distribution function, which is sometimes called the cumulative probability function. The distribution function for the binomial distribution is pbinom. Here is the cumulative probability for x = 7 (i.e., P(X ≤ 7)):

pbinom(7, size = 10, prob = 0.5)
#> [1] 0.945

It appears the probability of observing X ≤ 7 is about 0.945.

The density functions and distribution functions for some common discrete distributions are shown in Table @ref(tab:distributions).

Table 8-3. (#tab:distributions) Discrete Distributions
Distribution Density function: P(X = x) Distribution function: P(Xx)

Binomial

dbinom(x, size, prob)

pbinom(x, size, prob)

Geometric

dgeom(x, prob)

pgeom(x, prob)

Poisson

dpois(x, lambda)

ppois(x, lambda)

The complement of the cumulative probability is the survival function, P(X > x). All of the distribution functions let you find this right-tail probability simply by specifying lower.tail=FALSE:

pbinom(7, size = 10, prob = 0.5, lower.tail = FALSE)
#> [1] 0.0547

Thus we see that the probability of observing X > 7 is about 0.055.

The interval probability, P(x1 < Xx2), is the probability of observing X between the limits x1 and x2. It is calculated as the difference between two cumulative probabilities: P(Xx2) − P(Xx1). Here is P(3 < X ≤ 7) for our binomial variable:

pbinom(7, size = 10, prob = 0.5) - pbinom(3, size = 10, prob = 0.5)
#> [1] 0.773

R lets you specify multiple values of x for these functions and will return a vector of the corresponding probabilities. Here we calculate two cumulative probabilities, P(X ≤ 3) and P(X ≤ 7), in one call to pbinom:

pbinom(c(3, 7), size = 10, prob = 0.5)
#> [1] 0.172 0.945

This leads to a one-liner for calculating interval probabilities. The diff function calculates the difference between successive elements of a vector. We apply it to the output of pbinom to obtain the difference in cumulative probabilities—in other words, the interval probability:

diff(pbinom(c(3, 7), size = 10, prob = 0.5))
#> [1] 0.773

See Also

See this chapter’s “Introduction” for more about the built-in probability distributions.

Calculating Probabilities for Continuous Distributions

Problem

You want to calculate the distribution function (DF) or cumulative distribution function (CDF) for a continuous random variable.

Solution

Use the distribution function, which calculates P(Xx). All built-in probability distributions have a distribution function whose name is “p” prefixed to the distribution’s abbreviated name—for instance, pnorm for the Normal distribution.

Example: what’s the probability of a draw being below .8 for a draw from a random standard normal distribution?

pnorm(q = .8, mean = 0, sd = 1)
#> [1] 0.788

Discussion

The R functions for probability distributions follow a consistent pattern, so the solution to this recipe is essentially identical to the solution for discrete random variables (“Calculating Probabilities for Discrete Distributions”). The significant difference is that continuous variables have no “probability” at a single point, P(X = x). Instead, they have a density at a point.

Given that consistency, the discussion of distribution functions in “Calculating Probabilities for Discrete Distributions” is applicable here, too. Table @ref(tab:continuous) gives the distribution functions for several continuous distributions.

Table 8-4. Continuous Distributions
Distribution Distribution function: P(Xx)

Normal

pnorm(x, mean, sd)

Student’s t

pt(x, df)

Exponential

pexp(x, rate)

Gamma

pgamma(x, shape, rate)

Chi-squared (χ2)

pchisq(x, df)

We can use pnorm to calculate the probability that a man is shorter than 66 inches, assuming that men’s heights are normally distributed with a mean of 70 inches and a standard deviation of 3 inches. Mathematically speaking, we want P(X ≤ 66) given that X ~ N(70, 3):

pnorm(66, mean = 70, sd = 3)
#> [1] 0.0912

Likewise, we can use pexp to calculate the probability that an exponential variable with a mean of 40 could be less than 20:

pexp(20, rate = 1 / 40)
#> [1] 0.393

Just as for discrete probabilities, the functions for continuous probabilities use lower.tail=FALSE to specify the survival function, P(X > x). This call to pexp gives the probability that the same exponential variable could be greater than 50:

pexp(50, rate = 1 / 40, lower.tail = FALSE)
#> [1] 0.287

Also like discrete probabilities, the interval probability for a continuous variable, P(x1 < X < x2), is computed as the difference between two cumulative probabilities, P(X < x2) − P(X < x1). For the same exponential variable, here is P(20 < X < 50), the probability that it could fall between 20 and 50:

pexp(50, rate = 1 / 40) - pexp(20, rate = 1 / 40)
#> [1] 0.32

See Also

See this chapter’s “Introduction” for more about the built-in probability distributions.

Converting Probabilities to Quantiles

Problem

Given a probability p and a distribution, you want to determine the corresponding quantile for p: the value x such that P(Xx) = p.

Solution

Every built-in distribution includes a quantile function that converts probabilities to quantiles. The function’s name is “q” prefixed to the distribution name; thus, for instance, qnorm is the quantile function for the Normal distribution.

The first argument of the quantile function is the probability. The remaining arguments are the distribution’s parameters, such as mean, shape, or rate:

qnorm(0.05, mean = 100, sd = 15)
#> [1] 75.3

Discussion

A common example of computing quantiles is when we compute the limits of a confidence interval. If we want to know the 95% confidence interval (α = 0.05) of a standard normal variable, then we need the quantiles with probabilities of α/2 = 0.025 and (1 − α)/2 = 0.975:

qnorm(0.025)
#> [1] -1.96
qnorm(0.975)
#> [1] 1.96

In the true spirit of R, the first argument of the quantile functions can be a vector of probabilities, in which case we get a vector of quantiles. We can simplify this example into a one-liner:

qnorm(c(0.025, 0.975))
#> [1] -1.96  1.96

All the built-in probability distributions provide a quantile function. Table @ref(tab:discrete-quant-dist) shows the quantile functions for some common discrete distributions.

Table 8-5. Discrete Quantile Distributions
Distribution Quantile function

Binomial

qbinom(p, size, prob)

Geometric

qgeom(p, prob)

Poisson

qpois(p, lambda)

Table @ref(tab:cont-quant-dist) shows the quantile functions for common continuous distributions.

Table 8-6. (#tab:cont-quant-dist) Continuous Quantile Distributions
Distribution Quantile function

Normal

qnorm(p, mean, sd)

Student’s t

qt(p, df)

Exponential

qexp(p, rate)

Gamma

qgamma(p, shape, rate=rate) or qgamma(p, shape, scale=scale)

Chi-squared (χ2)

qchisq(p, df)

See Also

Determining the quantiles of a data set is different from determining the quantiles of a distribution—see “Calculating Quantiles (and Quartiles) of a Dataset”.

Plotting a Density Function

Problem

You want to plot the density function of a probability distribution.

Solution

Define a vector x over the domain. Apply the distribution’s density function to x and then plot the result. If x is a vector of points over the domain we care about plotting, we then calculate the density using one of the d_____ density functions like dlnorm for lognormal or dnorm for normal.

dens <- data.frame(x = x,
                   y = d_____(x))
ggplot(dens, aes(x, y)) + geom_line()

Here is a specific example that plots the standard normal distribution for the interval -3 to +3:

library(ggplot2)

x <- seq(-3, +3, 0.1)
dens <- data.frame(x = x, y = dnorm(x))

ggplot(dens, aes(x, y)) + geom_line()
std norm 1
Figure 8-1. Standard Normal

Figure 8-1 shows the smooth density function.

Discussion

All the built-in probability distributions include a density function. For a particular density, the function name is “d” prepended to the density name. The density function for the Normal distribution is dnorm, the density for the gamma distribution is dgamma, and so forth.

If the first argument of the density function is a vector, then the function calculates the density at each point and returns the vector of densities.

The following code creates a 2 × 2 plot of four densities:

x <- seq(from = 0, to = 6, length.out = 100) # Define the density domains
ylim <- c(0, 0.6)

# Make a data.frame with densities of several distributions
df <- rbind(
  data.frame(x = x, dist_name = "Uniform"    , y = dunif(x, min   = 2, max = 4)),
  data.frame(x = x, dist_name = "Normal"     , y = dnorm(x, mean  = 3, sd = 1)),
  data.frame(x = x, dist_name = "Exponential", y = dexp(x, rate  = 1 / 2)),
  data.frame(x = x, dist_name = "Gamma"      , y = dgamma(x, shape = 2, rate = 1)) )

# Make a line plot like before, but use facet_wrap to create the grid
ggplot(data = df, aes(x = x, y = y)) +
  geom_line() +
  facet_wrap(~dist_name)   # facet and wrap by the variable dist_name
densities 1
Figure 8-2. Multiple Density Plots

Figure 8-2 shows four density plots. However, a raw density plot is rarely useful or interesting by itself, and we often shade a region of interest.

densityshaded 1
Figure 8-3. Standard Normal with Shading

Figure 8-3 is a normal distribution with shading from the 75th percentile to the 95th percentile.

We create the plot by first plotting the density and then creating a shaded region with the geom_ribbon function from ggplot2.

First, we create some data and draw the density curve shown in Figure 8-4

x <- seq(from = -3, to = 3, length.out = 100)
df <- data.frame(x = x, y = dnorm(x, mean = 0, sd = 1))

p <- ggplot(df, aes(x, y)) +
  geom_line() +
  labs(
    title = "Standard Normal Distribution",
    y = "Density",
    x = "Quantile"
  )
p
normalden1 1
Figure 8-4. Density Plot

Next, we define the region of interest by calculating the x value for the quantiles we’re interested in. Then finally we add a geom_ribbon to add a subset of our original data as a colored region. The resulting plot is shown here:

q75 <- quantile(df$x, .75)
q95 <- quantile(df$x, .95)

p +
  geom_ribbon(
    data = subset(df, x > q75 & x < q95),
    aes(ymax = y),
    ymin = 0,
    fill = "blue",
    colour = NA,
    alpha = 0.5
  )

Chapter 9. General Statistics

Introduction

Any significant application of R includes statistics or models or graphics. This chapter addresses the statistics. Some recipes simply describe how to calculate a statistic, such as relative frequency. Most recipes involve statistical tests or confidence intervals. The statistical tests let you choose between two competing hypotheses; that paradigm is described next. Confidence intervals reflect the likely range of a population parameter and are calculated based on your data sample.

Null Hypotheses, Alternative Hypotheses, and p-Values

Many of the statistical tests in this chapter use a time-tested paradigm of statistical inference. In the paradigm, we have one or two data samples. We also have two competing hypotheses, either of which could reasonably be true.

One hypothesis, called the null hypothesis, is that nothing happened: the mean was unchanged; the treatment had no effect; you got the expected answer; the model did not improve; and so forth.

The other hypothesis, called the alternative hypothesis, is that something happened: the mean rose; the treatment improved the patients’ health; you got an unexpected answer; the model fit better; and so forth.

We want to determine which hypothesis is more likely in light of the data:

  1. To begin, we assume that the null hypothesis is true.

  2. We calculate a test statistic. It could be something simple, such as the mean of the sample, or it could be quite complex. The critical requirement is that we must know the statistic’s distribution. We might know the distribution of the sample mean, for example, by invoking the Central Limit Theorem.

  3. From the statistic and its distribution we can calculate a p-value, the probability of a test statistic value as extreme or more extreme than the one we observed, while assuming that the null hypothesis is true.

  4. If the p-value is too small, we have strong evidence against the null hypothesis. This is called rejecting the null hypothesis.

  5. If the p-value is not small then we have no such evidence. This is called failing to reject the null hypothesis.

There is one necessary decision here: When is a p-value “too small”?

Note

In this book, we follow the common convention that we reject the null hypothesis when p < 0.05 and fail to reject it when p > 0.05. In statistical terminology, we chose a significance level of α = 0.05 to define the border between strong evidence and insufficient evidence against the null hypothesis.

But the real answer is, “it depends”. Your chosen significance level depends on your problem domain. The conventional limit of p < 0.05 works for many problems. In our work, the data are especially noisy and so we are often satisfied with p < 0.10. For someone working in high-risk areas, p < 0.01 or p < 0.001 might be necessary.

In the recipes, we mention which tests include a p-value so that you can compare the p-value against your chosen significance level of α. We worded the recipes to help you interpret the comparison. Here is the wording from “Testing Categorical Variables for Independence”, a test for the independence of two factors:

Example 9-1.

Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent whereas a p-value exceeding 0.05 fails to provide any such evidence.

This is a compact way of saying:

  • The null hypothesis is that the variables are independent.

  • The alternative hypothesis is that the variables are not independent.

  • For α = 0.05, if p < 0.05 then we reject the null hypothesis, giving strong evidence that the variables are not independent; if p > 0.05, we fail to reject the null hypothesis.

  • You are free to choose your own α, of course, in which case your decision to reject or fail to reject might be different.

Remember, the recipe states the informal interpretation of the test results, not the rigorous mathematical interpretation. We use colloquial language in the hope that it will guide you toward a practical understanding and application of the test. If the precise semantics of hypothesis testing is critical for your work, we urge you to consult the reference cited under See Also or one of the other fine textbooks on mathematical statistics.

Confidence Intervals

Hypothesis testing is a well-understood mathematical procedure, but it can be frustrating. First, the semantics is tricky. The test does not reach a definite, useful conclusion. You might get strong evidence against the null hypothesis, but that’s all you’ll get. Second, it does not give you a number, only evidence.

If you want numbers then use confidence intervals, which bound the estimate of a population parameter at a given level of confidence. Recipes in this chapter can calculate confidence intervals for means, medians, and proportions of a population.

For example, “Forming a Confidence Interval for a Mean” calculates a 95% confidence interval for the population mean based on sample data. The interval is 97.16 < μ < 103.98, which means there is a 95% probability that the population’s mean, μ, is between 97.16 and 103.98.

See Also

Statistical terminology and conventions can vary. This book generally follows the conventions of Mathematical Statistics with Applications, 6th ed., by Wackerly et al. (Duxbury Press). We recommend this book also for learning more about the statistical tests described in this chapter.

Summarizing Your Data

Problem

You want a basic statistical summary of your data.

Solution

The summary function gives some useful statistics for vectors, matrices, factors, and data frames:

summary(vec)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>     0.0     0.5     1.0     1.6     1.9    33.0

Discussion

The Solution exhibits the summary of a vector. The 1st Qu. and 3rd Qu. are the first and third quartile, respectively. Having both the median and mean is useful because you can quickly detect skew. The Solution above, for example, shows a mean that is larger than the median; this indicates a possible skew to the right, as one would expect from a lognormal distribution.

The summary of a matrix works column by column. Here we see the summary of a matrix, mat, with three columns named Samp1, Samp2, and Samp3:

summary(mat)
#>      Samp1           Samp2            Samp3
#>  Min.   :  1.0   Min.   :-2.943   Min.   : 0.04
#>  1st Qu.: 25.8   1st Qu.:-0.774   1st Qu.: 0.39
#>  Median : 50.5   Median :-0.052   Median : 0.85
#>  Mean   : 50.5   Mean   :-0.067   Mean   : 1.60
#>  3rd Qu.: 75.2   3rd Qu.: 0.684   3rd Qu.: 2.12
#>  Max.   :100.0   Max.   : 2.150   Max.   :13.18

The summary of a factor gives counts:

summary(fac)
#> Maybe    No   Yes
#>    38    32    30

The summary of a character vector is pretty useless, just the vector length:

summary(char)
#>    Length     Class      Mode
#>       100 character character

The summary of a data frame incorporates all these features. It works column by column, giving an appropriate summary according to the column type. Numeric values receive a statistical summary and factors are counted (character strings are not summarized):

suburbs <- read_csv("./data/suburbs.txt")
summary(suburbs)
#>      city              county             state
#>  Length:17          Length:17          Length:17
#>  Class :character   Class :character   Class :character
#>  Mode  :character   Mode  :character   Mode  :character
#>
#>
#>
#>       pop
#>  Min.   :   5428
#>  1st Qu.:  72616
#>  Median :  83048
#>  Mean   : 249770
#>  3rd Qu.: 102746
#>  Max.   :2853114

The “summary” of a list is pretty funky: just the data type of each list member. Here is a summary of a list of vectors:

summary(vec_list)
#>   Length Class  Mode
#> x 100    -none- numeric
#> y 100    -none- numeric
#> z 100    -none- character

To summarize the data inside a list of vectors, map summary to each list element:

library(purrr)
map(vec_list, summary)
#> $x
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>  -2.572  -0.686  -0.084  -0.043   0.660   2.413
#>
#> $y
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>  -1.752  -0.589   0.045   0.079   0.769   2.293
#>
#> $z
#>    Length     Class      Mode
#>       100 character character

Unfortunately, the summary function does not compute any measure of variability, such as standard deviation or median absolute deviation. This is a serious shortcoming, so we usually call sd or mad right after calling summary.

See Also

See Recipes “Computing Basic Statistics” and .

Calculating Relative Frequencies

Problem

You want to count the relative frequency of certain observations in your sample.

Solution

Identify the interesting observations by using a logical expression; then use the mean function to calculate the fraction of observations it identifies. For example, given a vector x, you can find the relative frequency of positive values in this way:

mean(x > 3)
#> [1] 0.12

Discussion

A logical expression, such as x > 3, produces a vector of logical values (TRUE and FALSE), one for each element of x. The mean function converts those values to 1s and 0s, respectively, and computes the average. This gives the fraction of values that are TRUE—in other words, the relative frequency of the interesting values. In the Solution, for example, that’s the relative frequency of values greater than 3.

The concept here is pretty simple. The tricky part is dreaming up a suitable logical expression. Here are some examples:

mean(lab == “NJ”)

Fraction of lab values that are New Jersey

mean(after > before)

Fraction of observations for which the effect increases

mean(abs(x-mean(x)) > 2*sd(x))

Fraction of observations that exceed two standard deviations from the mean

mean(diff(ts) > 0)

Fraction of observations in a time series that are larger than the previous observation

Tabulating Factors and Creating Contingency Tables

Problem

You want to tabulate one factor or to build a contingency table from multiple factors.

Solution

The table function produces counts of one factor:

table(f1)
#> f1
#>  a  b  c  d  e
#> 14 23 24 21 18

It can also produce contingency tables (cross-tabulations) from two or more factors:

table(f1, f2)
#>    f2
#> f1   f  g  h
#>   a  6  4  4
#>   b  7  9  7
#>   c  4 11  9
#>   d  7  8  6
#>   e  5 10  3

table works for characters, too, not only factors:

t1 <- sample(letters[9:11], 100, replace = TRUE)
table(t1)
#> t1
#>  i  j  k
#> 20 40 40

Discussion

The table function counts the levels of one factor or characters, such as these counts of initial and outcome (which are factors):

set.seed(42)
initial <- factor(sample(c("Yes", "No", "Maybe"), 100, replace = TRUE))
outcome <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))

table(initial)
#> initial
#> Maybe    No   Yes
#>    39    31    30

table(outcome)
#> outcome
#> Fail Pass
#>   56   44

The greater power of table is in producing contingency tables, also known as cross-tabulations. Each cell in a contingency table counts how many times that row–column combination occurred:

table(initial, outcome)
#>        outcome
#> initial Fail Pass
#>   Maybe   23   16
#>   No      20   11
#>   Yes     13   17

This table shows that the combination of initial = Yes and outcome = Fail occurred 13 times, the combination of initial = Yes and outcome = Pass occurred 17 times, and so forth.

See Also

The xtabs function can also produce a contingency table. It has a formula interface, which some people prefer.

Testing Categorical Variables for Independence

Problem

You have two categorical variables that are represented by factors. You want to test them for independence using the chi-squared test.

Solution

Use the table function to produce a contingency table from the two factors. Then use the summary function to perform a chi-squared test of the contingency table. In the example below we have two vectors of factor values which we created in the prior recipe:

summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#>  Chisq = 3, df = 2, p-value = 0.2

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the variables are likely not independent whereas a p-value exceeding 0.05 fails to provide any such evidence.

Discussion

This example performs a chi-squared test on the contingency table of “Tabulating Factors and Creating Contingency Tables” and yields a p-value of 0.2225:

summary(table(initial, outcome))
#> Number of cases in table: 100
#> Number of factors: 2
#> Test for independence of all factors:
#>  Chisq = 3, df = 2, p-value = 0.2

The large p-value indicates that the two factors, initial and outcome, are probably independent. Practically speaking, we conclude there is no connection between the variables. This makes sense as this example data was created by simply drawing random data using the sample function in the prior recipe.

See Also

The chisq.test function can also perform this test.

Calculating Quantiles (and Quartiles) of a Dataset

Problem

Given a fraction f, you want to know the corresponding quantile of your data. That is, you seek the observation x such that the fraction of observations below x is f.

Solution

Use the quantile function. The second argument is the fraction, f:

quantile(vec, 0.95)
#>  95%
#> 1.43

For quartiles, simply omit the second argument altogether:

quantile(vec)
#>      0%     25%     50%     75%    100%
#> -2.0247 -0.5915 -0.0693  0.4618  2.7019

Discussion

Suppose vec contains 1,000 observations between 0 and 1. The quantile function can tell you which observation delimits the lower 5% of the data:

vec <- runif(1000)
quantile(vec, .05)
#>     5%
#> 0.0451

The quantile documentation refers to the second argument as a “probability”, which is natural when we think of probability as meaning relative frequency.

In true R style, the second argument can be a vector of probabilities; in this case, quantile returns a vector of corresponding quantiles, one for each probability:

quantile(vec, c(.05, .95))
#>     5%    95%
#> 0.0451 0.9363

That is a handy way to identify the middle 90% (in this case) of the observations.

If you omit the probabilities altogether then R assumes you want the probabilities 0, 0.25, 0.50, 0.75, and 1.0—in other words, the quartiles:

quantile(vec)
#>       0%      25%      50%      75%     100%
#> 0.000405 0.235529 0.479543 0.737619 0.999379

Amazingly, the quantile function implements nine (yes, nine) different algorithms for computing quantiles. Study the help page before assuming that the default algorithm is the best one for you.

Inverting a Quantile

Problem

Given an observation x from your data, you want to know its corresponding quantile. That is, you want to know what fraction of the data is less than x.

Solution

Assuming your data is in a vector vec, compare the data against the observation and then use mean to compute the relative frequency of values less than x; say, 1.6 as per this example.

mean(vec < 1.6)
#> [1] 0.948

Discussion

The expression vec < x compares every element of vec against x and returns a vector of logical values, where the n_th logical value is TRUE if vec[n] < x. The mean function converts those logical values to 0 and 1: 0 for FALSE and 1 for TRUE. The average of all those 1s and 0s is the fraction of vec that is less than _x, or the inverse quantile of x.

See Also

This is an application of the general approach described in “Calculating Relative Frequencies”.

Converting Data to Z-Scores

Problem

You have a dataset, and you want to calculate the corresponding z-scores for all data elements. (This is sometimes called normalizing the data.)

Solution

Use the scale function:

scale(x)
#>          [,1]
#>  [1,]  0.8701
#>  [2,] -0.7133
#>  [3,] -1.0503
#>  [4,]  0.5790
#>  [5,] -0.6324
#>  [6,]  0.0991
#>  [7,]  2.1495
#>  [8,]  0.2481
#>  [9,] -0.8155
#> [10,] -0.7341
#> attr(,"scaled:center")
#> [1] 2.42
#> attr(,"scaled:scale")
#> [1] 2.11

This works for vectors, matrices, and data frames. In the case of a vector, scale returns the vector of normalized values. In the case of matrices and data frames, scale normalizes each column independently and returns columns of normalized values in a matrix.

Discussion

You might also want to normalize a single value y relative to a dataset x. That can be done by using vectorized operations as follows:

(y - mean(x)) / sd(x)
#> [1] -0.633

Testing the Mean of a Sample (t Test)

Problem

You have a sample from a population. Given this sample, you want to know if the mean of the population could reasonably be a particular value m.

Solution

Apply the t.test function to the sample x with the argument mu=m:

t.test(x, mu = m)

The output includes a p-value. Conventionally, if p < 0.05 then the population mean is unlikely to be m whereas p > 0.05 provides no such evidence.

If your sample size n is small, then the underlying population must be normally distributed in order to derive meaningful results from the t test. A good rule of thumb is that “small” means n < 30.

Discussion

The t test is a workhorse of statistics, and this is one of its basic uses: making inferences about a population mean from a sample. The following example simulates sampling from a normal population with mean μ = 100. It uses the t test to ask if the population mean could be 95, and t.test reports a p-value of 0.005055:

x <- rnorm(75, mean = 100, sd = 15)
t.test(x, mu = 95)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 3, df = 70, p-value = 0.005
#> alternative hypothesis: true mean is not equal to 95
#> 95 percent confidence interval:
#>   96.5 103.0
#> sample estimates:
#> mean of x
#>      99.7

The p-value is small and so it’s unlikely (based on the sample data) that 95 could be the mean of the population.

Informally, we could interpret the low p-value as follows. If the population mean were really 95, then the probability of observing our test statistic (t = 2.8898 or something more extreme) would be only 0.005055 That is very improbable, yet that is the value we observed. Hence we conclude that the null hypothesis is wrong; therefore, the sample data does not support the claim that the population mean is 95.

In sharp contrast, testing for a mean of 100 gives a p-value of 0.8606:

t.test(x, mu = 100)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = -0.2, df = 70, p-value = 0.9
#> alternative hypothesis: true mean is not equal to 100
#> 95 percent confidence interval:
#>   96.5 103.0
#> sample estimates:
#> mean of x
#>      99.7

The large p-value indicates that the sample is consistent with assuming a population mean μ of 100. In statistical terms, the data does not provide evidence against the true mean being 100.

A common case is testing for a mean of zero. If you omit the mu argument, it defaults to zero.

See Also

The t.test is a many-splendored thing. See Recipes and for other uses.

Forming a Confidence Interval for a Mean

Problem

You have a sample from a population. Given that sample, you want to determine a confidence interval for the population’s mean.

Solution

Apply the t.test function to your sample x:

t.test(x)

The output includes a confidence interval at the 95% confidence level. To see intervals at other levels, use the conf.level argument.

As in “Testing the Mean of a Sample (t Test)”, if your sample size n is small then the underlying population must be normally distributed for there to be a meaningful confidence interval. Again, a good rule of thumb is that “small” means n < 30.

Discussion

Applying the t.test function to a vector yields a lot of output. Buried in the output is a confidence interval:

t.test(x)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 50, df = 50, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>   94.2 101.5
#> sample estimates:
#> mean of x
#>      97.9

In this example, the confidence interval is approximately 94.16 < μ < 101.55, which is sometimes written simply as (94.16, 101.55).

We can raise the confidence level to 99% by setting conf.level=0.99:

t.test(x, conf.level = 0.99)
#>
#>  One Sample t-test
#>
#> data:  x
#> t = 50, df = 50, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 99 percent confidence interval:
#>   92.9 102.8
#> sample estimates:
#> mean of x
#>      97.9

That change widens the confidence interval to 92.93 < μ < 102.78

Forming a Confidence Interval for a Median

Problem

You have a data sample, and you want to know the confidence interval for the median.

Solution

Use the wilcox.test function, setting conf.int=TRUE:

wilcox.test(x, conf.int = TRUE)

The output will contain a confidence interval for the median.

Discussion

The procedure for calculating the confidence interval of a mean is well-defined and widely known. The same is not true for the median, unfortunately. There are several procedures for calculating the median’s confidence interval. None of them is “the” procedure, but the Wilcoxon signed rank test is pretty standard.

The wilcox.test function implements that procedure. Buried in the output is the 95% confidence interval, which is approximately (-0.102, 0.646) in this case:

wilcox.test(x, conf.int = TRUE)
#>
#>  Wilcoxon signed rank test
#>
#> data:  x
#> V = 200, p-value = 0.1
#> alternative hypothesis: true location is not equal to 0
#> 95 percent confidence interval:
#>  -0.102  0.646
#> sample estimates:
#> (pseudo)median
#>          0.311

You can change the confidence level by setting conf.level, such as conf.level=0.99 or other such values.

The output also includes something called the pseudomedian, which is defined on the help page. Don’t assume it equals the median; they are different:

median(x)
#> [1] 0.314

See Also

The bootstrap procedure is also useful for estimating the median’s confidence interval; see Recipes and Recipe X-X.

Testing a Sample Proportion

Problem

You have a sample of values from a population consisting of successes and failures. You believe the true proportion of successes is p, and you want to test that hypothesis using the sample data.

Solution

Use the prop.test function. Suppose the sample size is n and the sample contains x successes:

prop.test(x, n, p)

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the true proportion is unlikely to be p whereas a p-value exceeding 0.05 fails to provide such evidence.

Discussion

Suppose you encounter some loudmouthed fan of the Chicago Cubs early in the baseball season. The Cubs have played 20 games and won 11 of them, or 55% of their games. Based on that evidence, the fan is “very confident” that the Cubs will win more than half of their games this year. Should he be that confident?

The prop.test function can evaluate the fan’s logic. Here, the number of observations is n = 20, the number of successes is x = 11, and p is the true probability of winning a game. We want to know whether it is reasonable to conclude, based on the data, that p > 0.5. Normally, prop.test would check for p ≠ 0.05 but we can check for p > 0.5 instead by setting alternative="greater":

prop.test(11, 20, 0.5, alternative = "greater")
#>
#>  1-sample proportions test with continuity correction
#>
#> data:  11 out of 20, null probability 0.5
#> X-squared = 0.05, df = 1, p-value = 0.4
#> alternative hypothesis: true p is greater than 0.5
#> 95 percent confidence interval:
#>  0.35 1.00
#> sample estimates:
#>    p
#> 0.55

The prop.test output shows a large p-value, 0.4115, so we cannot reject the null hypothesis; that is, we cannot reasonably conclude that p is greater than 1/2. The Cubs fan is being overly confident based on too little data. No surprise there.

Forming a Confidence Interval for a Proportion

Problem

You have a sample of values from a population consisting of successes and failures. Based on the sample data, you want to form a confidence interval for the population’s proportion of successes.

Solution

Use the prop.test function. Suppose the sample size is n and the sample contains x successes:

prop.test(x, n)

The function output includes the confidence interval for p.

Discussion

We subscribe to a stock market newsletter that is well written, but includes a section purporting to identify stocks that are likely to rise. It does this by looking for a certain pattern in the stock price. It recently reported, for example, that a certain stock was following the pattern. It also reported that the stock rose six times after the last nine times that pattern occurred. The writers concluded that the probability of the stock rising again was therefore 6/9 or 66.7%.

Using prop.test, we can obtain the confidence interval for the true proportion of times the stock rises after the pattern. Here, the number of observations is n = 9 and the number of successes is x = 6. The output shows a confidence interval of (0.309, 0.910) at the 95% confidence level:

prop.test(6, 9)
#> Warning in prop.test(6, 9): Chi-squared approximation may be incorrect
#>
#>  1-sample proportions test with continuity correction
#>
#> data:  6 out of 9, null probability 0.5
#> X-squared = 0.4, df = 1, p-value = 0.5
#> alternative hypothesis: true p is not equal to 0.5
#> 95 percent confidence interval:
#>  0.309 0.910
#> sample estimates:
#>     p
#> 0.667

The writers are pretty foolish to say the probability of rising is 66.7%. They could be leading their readers into a very bad bet.

By default, prop.test calculates a confidence interval at the 95% confidence level. Use the conf.level argument for other confidence levels:

prop.test(x, n, p, conf.level = 0.99)   # 99% confidence level

Testing for Normality

Problem

You want a statistical test to determine whether your data sample is from a normally distributed population.

Solution

Use the shapiro.test function:

shapiro.test(x)

The output includes a p-value. Conventionally, p < 0.05 indicates that the population is likely not normally distributed whereas p > 0.05 provides no such evidence.

Discussion

This example reports a p-value of .7765 for x:

shapiro.test(x)
#>
#>  Shapiro-Wilk normality test
#>
#> data:  x
#> W = 1, p-value = 0.05

The large p-value suggests the underlying population could be normally distributed. The next example reports a very small p-value for y, so it is unlikely that this sample came from a normal population:

shapiro.test(y)
#>
#>  Shapiro-Wilk normality test
#>
#> data:  y
#> W = 0.7, p-value = 9e-12

We have highlighted the Shapiro–Wilk test because it is a standard R function. You can also install the package nortest, which is dedicated entirely to tests for normality. This package includes:

  • Anderson–Darling test (ad.test)

  • Cramer–von Mises test (cvm.test)

  • Lilliefors test (lillie.test)

  • Pearson chi-squared test for the composite hypothesis of normality (pearson.test)

  • Shapiro–Francia test (sf.test)

The problem with all these tests is their null hypothesis: they all assume that the population is normally distributed until proven otherwise. As a result, the population must be decidedly nonnormal before the test reports a small p-value and you can reject that null hypothesis. That makes the tests quite conservative, tending to err on the side of normality.

Instead of depending solely upon a statistical test, we suggest also using histograms (“Creating a Histogram”) and quantile-quantile plots (“Creating a Normal Quantile-Quantile (Q-Q) Plot”) to evaluate the normality of any data. Are the tails too fat? Is the peak to peaked? Your judgment is likely better than a single statistical test.

See Also

See “Installing Packages from CRAN” for how to install the nortest package.

Testing for Runs

Problem

Your data is a sequence of binary values: yes–no, 0–1, true–false, or other two-valued data. You want to know: Is the sequence random?

Solution

The tseries package contains the runs.test function, which checks a sequence for randomness. The sequence should be a factor with two levels:

library(tseries)
runs.test(as.factor(s))

The runs.test function reports a p-value. Conventionally, a p-value of less than 0.05 indicates that the sequence is likely not random whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

A run is a subsequence composed of identical values, such as all 1s or all 0s. A random sequence should be properly jumbled up, without too many runs. Similarly, it shouldn’t contain too few runs, either. A sequence of perfectly alternating values (0, 1, 0, 1, 0, 1, …) contains no runs, but would you say that it’s random?

The runs.test function checks the number of runs in your sequence. If there are too many or too few, it reports a small p-value.

This first example generates a random sequence of 0s and 1s and then tests the sequence for runs. Not surprisingly, runs.test reports a large p-value, indicating the sequence is likely random:

s <- sample(c(0, 1), 100, replace = T)
runs.test(as.factor(s))
#>
#>  Runs Test
#>
#> data:  as.factor(s)
#> Standard Normal = 0.1, p-value = 0.9
#> alternative hypothesis: two.sided

This next sequence, however, consists of three runs and so the reported p-value is quite low:

s <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0)
runs.test(as.factor(s))
#>
#>  Runs Test
#>
#> data:  as.factor(s)
#> Standard Normal = -2, p-value = 0.02
#> alternative hypothesis: two.sided

See Also

See Recipes and .

Comparing the Means of Two Samples

Problem

You have one sample each from two populations. You want to know if the two populations could have the same mean.

Solution

Perform a t test by calling the t.test function:

t.test(x, y)

By default, t.test assumes that your data are not paired. If the observations are paired (i.e., if each xi is paired with one yi), then specify paired=TRUE:

t.test(x, y, paired = TRUE)

In either case, t.test will compute a p-value. Conventionally, if p < 0.05 then the means are likely different whereas p > 0.05 provides no such evidence:

  • If either sample size is small, then the populations must be normally distributed. Here, “small” means fewer than 20 data points.

  • If the two populations have the same variance, specify var.equal=TRUE to obtain a less conservative test.

Discussion

We often use the t test to get a quick sense of the difference between two population means. It requires that the samples be large enough (both samples have 20 or more observations) or that the underlying populations be normally distributed. We don’t take the “normally distributed” part too literally. Being bell-shaped and reasonably symetrical should be good enough.

A key distinction here is whether or not your data contains paired observations, since the results may differ in the two cases. Suppose we want to know if coffee in the morning improves scores on SAT tests. We could run the experiment two ways:

  1. Randomly select one group of people. Give them the SAT test twice, once with morning coffee and once without morning coffee. For each person, we will have two SAT scores. These are paired observations.

  2. Randomly select two groups of people. One group has a cup of morning coffee and takes the SAT test. The other group just takes the test. We have a score for each person, but the scores are not paired in any way.

Statistically, these experiments are quite different. In experiment 1, there are two observations for each person (caffeinated and decaf) and they are not statistically independent. In experiment 2, the data are independent.

If you have paired observations (experiment 1) and erroneously analyze them as unpaired observations (experiment 2), then you could get this result with a p-value of 0.9867:

load("./data/sat.rdata")
t.test(x, y)
#>
#>  Welch Two Sample t-test
#>
#> data:  x and y
#> t = -1, df = 200, p-value = 0.3
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -46.4  16.2
#> sample estimates:
#> mean of x mean of y
#>      1054      1069

The large p-value forces you to conclude there is no difference between the groups. Contrast that result with the one that follows from analyzing the same data but correctly identifying it as paired:

t.test(x, y, paired = TRUE)
#>
#>  Paired t-test
#>
#> data:  x and y
#> t = -20, df = 100, p-value <2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -16.8 -13.5
#> sample estimates:
#> mean of the differences
#>                   -15.1

The p-value plummets to 2.2e-16, and we reach the exactly opposite conclusion.

See Also

If the populations are not normally distributed (bell-shaped) and either sample is small, consider using the Wilcoxon–Mann–Whitney test described in “Comparing the Locations of Two Samples Nonparametrically”.

Comparing the Locations of Two Samples Nonparametrically

Problem

You have samples from two populations. You don’t know the distribution of the populations, but you know they have similar shapes. You want to know: Is one population shifted to the left or right compared with the other?

Solution

You can use a nonparametric test, the Wilcoxon–Mann–Whitney test, which is implemented by the wilcox.test function. For paired observations (every xi is paired with yi), set paired=TRUE:

wilcox.test(x, y, paired = TRUE)

For unpaired observations, let paired default to FALSE:

wilcox.test(x, y)

The test output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the second population is likely shifted left or right with respect to the first population whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

When we stop making assumptions regarding the distributions of populations, we enter the world of nonparametric statistics. The Wilcoxon–Mann–Whitney test is nonparametric and so can be applied to more datasets than the t test, which requires that the data be normally distributed (for small samples). This test’s only assumption is that the two populations have the same shape.

In this recipe, we are asking: Is the second population shifted left or right with respect to the first? This is similar to asking whether the average of the second population is smaller or larger than the first. However, the Wilcoxon–Mann–Whitney test answers a different question: it tells us whether the central locations of the two populations are significantly different or, equivalently, whether their relative frequencies are different.

Suppose we randomly select a group of employees and ask each one to complete the same task under two different circumstances: under favorable conditions and under unfavorable conditions, such as a noisy environment. We measure their completion times under both conditions, so we have two measurements for each employee. We want to know if the two times are significantly different, but we can’t assume they are normally distributed.

The data are paired, so we must set paired=TRUE:

load(file = "./data/workers.rdata")
wilcox.test(fav, unfav, paired = TRUE)
#>
#>  Wilcoxon signed rank test
#>
#> data:  fav and unfav
#> V = 10, p-value = 1e-04
#> alternative hypothesis: true location shift is not equal to 0

The p-value is essentially zero. Statistically speaking, we reject the assumption that the completion times were equal. Practically speaking, it’s reasonable to conclude that the times were different.

In this example, setting paired=TRUE is critical. Treating the data as unpaired would be wrong because the observations are not independent; and this, in turn, would produce bogus results. Running the example with paired=FALSE produces a p-value of 0.1022, which leads to the wrong conclusion.

See Also

See “Comparing the Means of Two Samples” for the parametric test.

Testing a Correlation for Significance

Problem

You calculated the correlation between two variables, but you don’t know if the correlation is statistically significant.

Solution

The cor.test function can calculate both the p-value and the confidence interval of the correlation. If the variables came from normally distributed populations then use the default measure of correlation, which is the Pearson method:

cor.test(x, y)

For nonnormal populations, use the Spearman method instead:

cor.test(x, y, method = "spearman")

The function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

Discussion

In my experience, people often fail to check a correlation for significance. In fact, many people are unaware that a correlation can be insignificant. They jam their data into a computer, calculate the correlation, and blindly believe the result. However, they should ask themselves: Was there enough data? Is the magnitude of the correlation large enough? Fortunately, the cor.test function answers those questions.

Suppose we have two vectors, x and y, with values from normal populations. We might be very pleased that their correlation is greater than 0.75:

cor(x, y)
#> [1] 0.751

But that is naïve. If we run cor.test, it reports a relatively large p-value of 0.085:

cor.test(x, y)
#>
#>  Pearson's product-moment correlation
#>
#> data:  x and y
#> t = 2, df = 4, p-value = 0.09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.155  0.971
#> sample estimates:
#>   cor
#> 0.751

The p-value is above the conventional threshold of 0.05, so we conclude that the correlation is unlikely to be significant.

You can also check the correlation by using the confidence interval. In this example, the confidence interval is (−0.155, 0.970). The interval contains zero and so it is possible that the correlation is zero, in which case there would be no correlation. Again, you could not be confident that the reported correlation is significant.

The cor.test output also includes the point estimate reported by cor (at the bottom, labeled “sample estimates”), saving you the additional step of running cor.

By default, cor.test calculates the Pearson correlation, which assumes that the underlying populations are normally distributed. The Spearman method makes no such assumption because it is nonparametric. Use method="Spearman" when working with nonnormal data.

See Also

See “Computing Basic Statistics” for calculating simple correlations.

Testing Groups for Equal Proportions

Problem

You have samples from two or more groups. The group’s elements are binary-valued: either success or failure. You want to know if the groups have equal proportions of successes.

Solution

Use the prop.test function with two vector arguments:

#>
#>  2-sample test for equality of proportions with continuity
#>  correction
#>
#> data:  ns out of nt
#> X-squared = 5, df = 1, p-value = 0.03
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.3058 -0.0142
#> sample estimates:
#> prop 1 prop 2
#>   0.48   0.64
ns <- c(48, 64)
nt <- c(100, 100)
prop.test(ns, nt)

These are parallel vectors. The first vector, ns, gives the number of successes in each group. The second vector, nt, gives the size of the corresponding group (often called the number of trials).

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that it is likely the groups’ proportions are different whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

In “Testing a Sample Proportion” we tested a proportion based on one sample. Here, we have samples from several groups and want to compare the proportions in the underlying groups.

One of the authors recently taught statistics to 38 students and awarded a grade of A to 14 of them. A colleague taught the same class to 40 students and awarded an A to only 10. We wanted to know: Is the author fostering grade inflation by awarding significantly more A grades than the other teacher did?

We used prop.test. “Success” means awarding an A, so the vector of successes contains two elements: the number awarded by me and the number awarded by my colleague:

successes <- c(14, 10)

The number of trials is the number of students in the corresponding class:

trials <- c(38, 40)

The prop.test output yields a p-value of 0.3749:

prop.test(successes, trials)
#>
#>  2-sample test for equality of proportions with continuity
#>  correction
#>
#> data:  successes out of trials
#> X-squared = 0.8, df = 1, p-value = 0.4
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -0.111  0.348
#> sample estimates:
#> prop 1 prop 2
#>  0.368  0.250

The relatively large p-value means that we cannot reject the null hypothesis: the evidence does not suggest any difference between the teachers’ grading.

Performing Pairwise Comparisons Between Group Means

Problem

You have several samples, and you want to perform a pairwise comparison between the sample means. That is, you want to compare the mean of every sample against the mean of every other sample.

Solution

Place all data into one vector and create a parallel factor to identify the groups. Use pairwise.t.test to perform the pairwise comparison of means:

pairwise.t.test(x, f)   # x is the data, f is the grouping factor

The output contains a table of p-values, one for each pair of groups. Conventionally, if p < 0.05 then the two groups likely have different means whereas p > 0.05 provides no such evidence.

Discussion

This is more complicated than “Comparing the Means of Two Samples”, where we compared the means of two samples. Here we have several samples and want to compare the mean of every sample against the mean of every other sample.

Statistically speaking, pairwise comparisons are tricky. It is not the same as simply performing a t test on every possible pair. The p-values must be adjusted, for otherwise you will get an overly optimistic result. The help pages for pairwise.t.test and p.adjust describe the adjustment algorithms available in R. Anyone doing serious pairwise comparisons is urged to review the help pages and consult a good textbook on the subject.

Suppose we are using a larger sample of the data from “Combining Multiple Vectors into One Vector and a Factor”, where we combined data for freshmen, sophomores, and juniors into a data frame called comb. The data frame has two columns: the data in a column called values, and the grouping factor in a column called ind. We can use pairwise.t.test to perform pairwise comparisons between the groups:

pairwise.t.test(comb$values, comb$ind)
#>
#>  Pairwise comparisons using t tests with pooled SD
#>
#> data:  comb$values and comb$ind
#>
#>      fresh soph
#> soph 0.001 -
#> jrs  3e-04 0.592
#>
#> P value adjustment method: holm

Notice the table of p-values. The comparisons of juniors versus freshmen and of sophomores versus freshmen produced small p-values: 0.0011 and 0.0003, respectively. We can conclude there are significant differences between those groups. However, the comparison of sophomores versus juniors produced a (relatively) large p-value of 0.5922, so they are not significantly different.

See Also

See Recipes and .

Testing Two Samples for the Same Distribution

Problem

You have two samples, and you are wondering: Did they come from the same distribution?

Solution

The Kolmogorov–Smirnov test compares two samples and tests them for being drawn from the same distribution. The ks.test function implements that test:

ks.test(x, y)

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the two samples (x and y) were drawn from different distributions whereas a p-value exceeding 0.05 provides no such evidence.

Discussion

The Kolmogorov–Smirnov test is wonderful for two reasons. First, it is a nonparametric test and so you needn’t make any assumptions regarding the underlying distributions: it works for all distributions. Second, it checks the location, dispersion, and shape of the populations, based on the samples. If these characteristics disagree then the test will detect that, allowing us to conclude that the underlying distributions are different.

Suppose we suspect that the vectors x and y come from differing distributions. Here, ks.test reports a p-value of 0.03663:

#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided
ks.test(x, y)
#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and y
#> D = 0.2, p-value = 0.04
#> alternative hypothesis: two-sided

From the small p-value we can conclude that the samples are from different distributions. However, when we test x against another sample, z, the p-value is much larger (0.5806); this suggests that x and z could have the same underlying distribution:

z <- rnorm(100, mean = 4, sd = 6)
ks.test(x, z)
#>
#>  Two-sample Kolmogorov-Smirnov test
#>
#> data:  x and z
#> D = 0.1, p-value = 0.6
#> alternative hypothesis: two-sided

Chapter 10. Graphics

Introduction

Graphics is a great strength of R. The graphics package is part of the standard distribution and contains many useful functions for creating a variety of graphic displays. The base functionality has been expanded and made easier with ggplot2, part of the tidyverse of packages. In this chapter we will focus on examples using ggplot2, and we will occasionally suggest other packages. In this chapter’s See Also sections we mention functions in other packages that do the same job in a different way. We suggest that you explore those alternatives if you are dissatisfied with what’s offered by ggplot2 or base graphics.

Graphics is a vast subject, and we can only scratch the surface here. Winston Chang’s R Graphics Cookbook, 2nd Edition is part of the O’Reilly Cookbook series and walks through many useful recipes with a focus on ggplot2. If you want to delve deeper, we recommend R Graphics by Paul Murrell (Chapman & Hall, 2006). That book discusses the paradigms behind R graphics, explains how to use the graphics functions, and contains numerous examples—including the code to recreate them. Some of the examples are pretty amazing.

The Illustrations

The graphs in this chapter are mostly plain and unadorned. We did that intentionally. When you call the ggplot function, as in:

library(tidyverse)
simpleplot 1
Figure 10-1. Simple Plot
df <- data.frame(x = 1:5, y = 1:5)
ggplot(df, aes(x, y)) +
  geom_point()

you get a plain, graphical representation of x and y as shown in Figure 10-1. You could adorn the graph with colors, a title, labels, a legend, text, and so forth, but then the call to ggplot becomes more and more crowded, obscuring the basic intention.

ggplot(df, aes(x, y)) +
  geom_point() +
  labs(
    title = "Simple Plot Example",
    subtitle = "with a subtitle",
    x = "x values",
    y = "y values"
  ) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))
complicatedplot 1
Figure 10-2. Complicated Plot

The resulting plot is shown in Figure 10-2. We want to keep the recipes clean, so we emphasize the basic plot and then show later (as in “Adding a Title and Labels”) how to add adornments.

Notes on ggplot2 basics

While the package is called ggplot2 the primary plotting function in the package is called ggplot. It is important to understand the basic pieces of a ggplot2 graph. In the examples above you can see that we pass data into ggplot then define how the graph is created by stacking together small phrases that describe some aspect of the plot. This stacking together of phrases is part of the “grammar of graphics” ethos (that’s where the gg comes from). To learn more, you can read “The Layered Grammar of Graphics” written by ggplot2 author, Hadley Wickham (http://vita.had.co.nz/papers/layered-grammar.pdf). The “grammar of graphics,” concept originating with Leland Wilkinson who articulated the idea of building graphics up from a set of primitives (i.e. verbs and nouns). With ggplot, the underlying data need not be fundamentally reshaped for each type of graphical representation. In general, the data stays the same and the user then changes syntax slightly to illustrate the data differently. This is significantly more consistent than base graphics which often require reshaping the data in order to change the way the data is visualized.

As we talk about ggplot graphics it’s worth defining the things that make up a ggplot graph:

geometric object functions

These are geometric objects that describe the type of graph being created. These start with geom_ and examples include geom_line, geom_boxplot, and geom_point along with dozens more.

aesthetics

The aesthetics, or aesthetic mappings, communicate to ggplot which fields in the source data get mapped to which visual elements in the graphic. This is the aes() line in a ggplot call.

stats

Stats are statistical transformations that are done before displaying the data. Not all graphs will have stats, but a few common stats are stat_ecdf (the empirical cumulative distribution function) and stat_identity which tells ggplot to pass the data without doing any stats at all.

facet functions

Facets are subplots where each small plot represents a subgroup of the data. The faceting functions include facet_wrap and facet_grid.

themes

Themes are the visual elements of the plot that are not tied to data. These might include titles, margins, table of contents locations, or font choices.

layer

A layer is a combination of data, aesthetics, a geometric object, a stat, and other options to produce a visual layer in the ggplot graphic.

“Long” vs. “Wide” data with ggplot

One of the first confusions new users to ggplot often face is that they are inclined to reshape their data to be “wide” before plotting it. Wide here meaning every variable they are plotting is its own column in the underlying data frame.

ggplot works most easily with “long” data where additional variables are added as rows in the data frame rather than columns. The great side effect of adding additional measurements as rows is that any properly constructed ggplot graphs will automatically update to reflect the new data without changing the ggplot code. If each additional variable was added as a column, then the plotting code would have to be changed to introduce additional variables. This idea of “long” vs. “wide” data will become more obvious in the examples in the rest of this chapter.

Graphics in Other Packages

R is highly programmable, and many people have extended its graphics machinery with additional features. Quite often, packages include specialized functions for plotting their results and objects. The zoo package, for example, implements a time series object. If you create a zoo object z and call plot(z), then the zoo package does the plotting; it creates a graphic that is customized for displaying a time series. Zoo uses base graphics so the resulting graph will not be a ggplot graphic.

There are even entire packages devoted to extending R with new graphics paradigms. The lattice package is an alternative to base graphics that predates ggplot2. It uses a powerful graphics paradigm that enables you to create informative graphics more easily. It was implemented by Deepayan Sarkar, who also wrote Lattice: Multivariate Data Visualization with R (Springer, 2008), which explains the package and how to use it. The lattice package is also described in R in a Nutshell (O’Reilly).

There are two chapters in Hadley Wickham’s excellent book R for Data Science which deal with graphics. The first, “Exploratory Data Analysis” focuses on exploring data with ggplot2 while “Graphics for Communication” explores communicating to others with graphics. R for Data Science is availiable in a printed version from O’Reilly Media or online at http://r4ds.had.co.nz/graphics-for-communication.html.

Creating a Scatter Plot

Problem

You have paired observations: (x1, y1), (x2, y2), …, (xn, yn). You want to create a scatter plot of the pairs.

Solution

We can plot the data by calling ggplot, passing in the data frame, and invoking a geometric point function:

ggplot(df, aes(x, y)) +
  geom_point()

In this example, the data frame is called df and the x and y data are in fields named x and y which we pass to the aesthetic in the call aes(x, y).

Discussion

A scatter plot is a common first attack on a new dataset. It’s a quick way to see the relationship, if any, between x and y.

Plotting with ggplot requires telling ggplot what data frame to use, then what type of graph to create, and which aesthetic mapping (aes) to use . The aes in this case defines which field from df goes into which axis on the plot. Then the command geom_point communicates that you want a point graph, as opposed to a line or other type of graphic.

We can use the built in mtcars dataset to illustrate plotting horsepower hp on the x axis and fuel economy mpg on the y:

ggplot(mtcars, aes(hp, mpg)) +
  geom_point()
point ex 1
Figure 10-3. Scatter Plot Example

The resulting plot is shown in Figure 10-3.

See Also

See “Adding a Title and Labels” for adding a title and labels; see Recipes and for adding a grid and a legend (respectively). See “Plotting All Variables Against All Other Variables” for plotting multiple variables.

Adding a Title and Labels

Problem

You want to add a title to your plot or add labels for the axes.

Solution

With ggplot we add a labs element which controls the labels for the title and axies.

When calling labs in ggplot:

: title The desired title text

: x x-axis label

: y: y-axis label

ggplot(df, aes(x, y)) +
  geom_point() +
  labs(title = "The Title",
       x = "X-axis Label",
       y = "Y-axis Label")

Discussion

The graph created in “Creating a Scatter Plot” is quite plain. A title and better labels will make it more interesting and easier to interpret.

Note that in ggplot you build up the elements of the graph by connecting the parts with the plus sign +. So we add additional graphical elements by stringing together phrases. You can see this in the following code that uses the built in cars dataset and plots speed vs. stopping distance in a scatter plot, shown in Figure 10-4

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  labs(title = "Cars: Horsepower vs. Fuel Economy",
       x = "HP",
       y = "Economy (miles per gallon)")
car plot 1
Figure 10-4. Labeled Axis and Title

Adding (or Removing) a Grid

Problem

You want to change the background grid to your graphic.

Solution

With ggplot background grids come as a default, as you have seen in other recipes. However we can alter the background grid using the theme function or by applying a prepackaed theme to our graph.

We can use theme to alter the backgorund panel of our graphic:

ggplot(df) +
  geom_point(aes(x, y)) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))
whitebackground 1
Figure 10-5. White background

Discussion

ggplot fills in the background with a grey grid by default. So you may find yourself wanting to remove that grid completely or change it to something else. Let’s create a ggplot graphic and then incrementally change the background style.

We can add or change aspects of our graphic by creating a ggplot object then calling the object and using the + to add to it. The background shading in a ggplot graphic is actually 3 different graph elements:

panel.grid.major:

These are white by default and heavy

panel.grid.minor:

These are white by default and light

panel.background:

This is the background that is grey by default

You can see these elements if you look carefully at the background of Figure 10-4:

If we set the background as element_blank() then the major and minor grids are there but they are white on white so we can’t see them in ???:

g1 <- ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  labs(title = "Cars: Horsepower vs. Fuel Economy",
       x = "HP",
       y = "Economy (miles per gallon)") +
  theme(panel.background = element_blank())
g1

. image::images/10_Graphics_files/figure-html/examplebackground-1.png[]

Notice in the code above we put the ggplot graph into a variable called g1. Then we printed the graphic by just calling g1. By having the graph inside of g1 we can then add additional graphical components without rebuilding the graph again.

But if we wanted to show the background grid in some bright colors for illustration, it’s as easy as setting them to a color and setting a line type which is shown in ???.

g2 <- g1 + theme(panel.grid.major =
                   element_line(color = "red", linetype = 3)) +
  # linetype = 3 is dash
  theme(panel.grid.minor =
          element_line(color = "blue", linetype = 4))
  # linetype = 4 is dot dash
g2

. image::images/10_Graphics_files/figure-html/majorgrid-1.png[]

??? lacks visual appeal, but you can cleary see that the red lines make up the major grid and the blue lines are the minor grid.

Or we could do something less garash and take the ggplot object g1 from above and add grey gridlines to the white background, shown in Figure 10-6.

g1 +
  theme(panel.grid.major = element_line(colour = "grey"))
backgrids 1
Figure 10-6. Grey Major Gridlines

Creating a Scatter Plot of Multiple Groups

Problem

You have data in a data frame with three observations per record: x, y, and a factor f that indicates the group. You want to create a scatter plot of x and y that distinguishes among the groups.

Solution

With ggplot we control the mapping of shapes to the factor f by passing shape = f to the aes.

ggplot(df, aes(x, y, shape = f)) +
  geom_point()

Discussion

Plotting multiple groups in one scatter plot creates an uninformative mess unless we distinguish one group from another. This distinction is done in ggplot by setting the shape parameter of the aes function.

The built in iris dataset contains paired measures of Petal.Length and Petal.Width. Each measurement also has a Species property indicating the species of the flower that was measured. If we plot all the data at once, we just get the scatter plot shown in ???:

ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width)) +
  geom_point()

. image::images/10_Graphics_files/figure-html/irisnoshape-1.png[]

The graphic would be far more informative if we distinguished the points by species. In addition to distinguising species by shape, we could also differentiate by color. We can add shape = Species and color = Species to our aes call, to get each species with a different shape and color, shown in ???.

ggplot(data = iris,
       aes(
         x = Petal.Length,
         y = Petal.Width,
         shape = Species,
         color = Species
       )) +
  geom_point()

. image::images/10_Graphics_files/figure-html/irisshape-1.png[]

ggplot conveniently sets up a legend for you as well, which is handy.

See Also

See “Adding (or Removing) a Legend” to add a legend.

Adding (or Removing) a Legend

Problem

You want your plot to include a legend, the little box that decodes the graphic for the viewer.

Solution

In most cases ggplot will add the legends automatically, as you can see in the previous recipe. If you do not have explicit grouping in the aes then ggplot will not show a legend by default. If we want to force ggplot to show a legend we can set the shape or linetype of our graph to a constant. ggplot will then show a legend with one group. We then use guides to guide ggplot in how to label the legend.

This can be illustrated with our iris scatterplot:

g <- ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width,
           shape="Point Name")) +
  geom_point()  +
  guides(shape=guide_legend(title="Legend Title"))
g
needslegend 1
Figure 10-7. Legend Added

Figure 10-7 illustrates the result of setting the shape to a string value then relabeling the legend using guides.

More commonly, you may want to turn legends off which can be done by setting the legend.position = "none" in the theme. We can use the iris plot from the prior recipe and add the theme call as shown in Figure 10-8:

g <- ggplot(data = iris,
            aes(
              x = Petal.Length,
              y = Petal.Width,
              shape = Species,
              color = Species
            )) +
  geom_point() +
  theme(legend.position = "none")
g
irisshapelegend 1
Figure 10-8. Legend Removed

Discussion

Adding legends to ggplot when there is no grouping is an excercise in tricking ggplot into showing the legend by passing a string to a grouping parameter in aes. This will not change the grouping as there is only one group, but will result in a legend being shown with a name.

Then we can use guides to alter the legend title. It’s worth noting that we are not changing anything about the data, just exploiting settings in order to coerce ggplot into showing a legend when it typically would not.

One of the huge benefits of ggplot is its very good defaults. Getting positions and correspondence between labels and their point types is done automatically, but can be overridden if needed. To remove a legend totally, we set theme parameters with theme(legend.position = "none"). In addition to “none” you can set the legend.position to be "left", "right", "bottom", "top", or a two-element numeric vector. Use a two-element numeric vector in order to pass ggplot specific coordinates of where you want the legend. If using the coordinate positions the values passed are between 0 and 1 for x and y position, respectivly.

An example of a legend positioned at the bottom is in Figure 10-9 created with this adjustment to the legend.poisition:

g + theme(legend.position = "bottom")
irisshapelegend moved 1
Figure 10-9. Legend on the Bottom

Or we could use the two-element numeric vector to put the legend in a specific location as in Figure 10-10. The example puts the center of the legend at 80% to the right and 20% up from the bottom.

g + theme(legend.position = c(.8, .2))
irisshapelegend moved2 1
Figure 10-10. Legend at a Point

In many aspects beyond legends, ggplot uses sane defaults with flexibility to override those and tweak the details. More detail of ggplot options related to legends can be found in the help for theme by typing ?theme or looking in the ggplot online reference material:

Plotting the Regression Line of a Scatter Plot

Problem

You are plotting pairs of data points, and you want to add a line that illustrates their linear regression.

Solution

Using ggplot there is no need to calculate the linear model first using the R lm function. We can instead use the geom_smooth function to calculate the linear regression inside of our ggplot call.

If our data is in a data frame df and the x and y data are in columns x and y we plot the regression line like this:

ggplot(df, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = y ~ x,
              se = FALSE)

The se = FALSE parameter tells ggplot not to plot the standard error bands around our regression line.

Discussion

Suppose we are modeling the strongx dataset found in the faraway package. We can create a linear model using the built in lm function in R. We can predict the variable crossx as a linear function of energy. First, lets look at a simple scatter plot of our data:

library(faraway)
data(strongx)

ggplot(strongx, aes(energy, crossx)) +
  geom_point()
strongx scatter 1
Figure 10-11. Strongx Scatter Plot

ggplot can calculate a linear model on the fly and then plot the regression line along with our data:

g <- ggplot(strongx, aes(energy, crossx)) +
  geom_point()

g + geom_smooth(method = "lm",
                formula = y ~ x,
                se = FALSE)
one step 1

We can turn the confidence bands on by omitting the se = FALSE option as as shown in ???:

g + geom_smooth(method = "lm",
                formula = y ~ x)

. image::images/10_Graphics_files/figure-html/one-step-nose-1.png[]

Notice that in the geom_smooth we use x and y rather than the variable names. ggplot has set the x and y inside the plot based on the aesthetic. Multiple smoothing methods are supported by geom_smooth. You can explore those, and other options in the help by typing ?geom_smooth.

If we had a line we wanted to plot that was stored in another R object, we could use geom_abline to plot the line on our graph. In the following example we pull the intercept term and the slope from the regression model m and add those to our graph:

m <- lm(crossx ~ energy, data = strongx)

ggplot(strongx, aes(energy, crossx)) +
  geom_point() +
  geom_abline(
    intercept = m$coefficients[1],
    slope = m$coefficients[2]
  )

This produces a very similar plot to ???. The geom_abline method can be handy if you are plotting a line from a source other than a simple linear model.

See Also

See the chapter on Linear Regression and ANOVA for more about linear regression and the lm function.

Plotting All Variables Against All Other Variables

Problem

Your dataset contains multiple numeric variables. You want to see scatter plots for all pairs of variables.

Solution

ggplot does not have any built in method to create pairs plots, however, the package GGally provides the functionality with the ggpairs function:

library(GGally)
ggpairs(df)

Discussion

When you have a large number of variables, finding interrelationships between them is difficult. One useful technique is looking at scatter plots of all pairs of variables. This would be quite tedious if coded pair-by-pair, but the ggpairs function from the package GGally provides an easy way to produce all those scatter plots at once.

The iris dataset contains four numeric variables and one categorical variable:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

What is the relationship, if any, between the columns? Plotting the columns with ggpairs produces multiple scatter plots.

library(GGally)
ggpairs(iris)
ggpairsiris 1
Figure 10-12. ggpairs Plot of Iris Data

The ggpairs function is pretty, but not particularly fast. If you’re just doing interactive work and want a quick peak at the data, the base R plot function provides faster output and is shown in Figure 10-13.

plot(iris)
basepairs 1
Figure 10-13. Base plot() Pairs Plot

While the ggpairs function is not as fast to plot as the Base R plot function, it produces density graphs on the diagonal and reports correlation in the upper triangle of the graph. When factors or character columns are present, ggpairs produces histograms on the lower triangle of the graph and boxplots on the upper triangle. These are nice additions to understanding relationships in your data.

Creating One Scatter Plot for Each Factor Level

Problem

Your dataset contains (at least) two numeric variables and a factor. You want to create several scatter plots for the numeric variables, with one scatter plot for each level of the factor.

Solution

This kind of plot is called a conditioning plot, which is produced in ggplot by adding facet_wrap to our plot. In this example we use the data frame df which contains three columns: x, y, and f with f being a factor (or a character).

ggplot(df, aes(x, y)) +
  geom_point() +
  facet_wrap( ~ f)

Discussion

Conditioning plots (coplots) are another way to explore and illustrate the effect of a factor or to compare different groups to each other.

The Cars93 dataset contains 27 variables describing 93 car models as of 1993. Two numeric variables are MPG.city, the miles per gallon in the city, and Horsepower, the engine horsepower. One categorical variable is Origin, which can be USA or non-USA according to where the model was built.

Exploring the relationship between MPG and horsepower, we might ask: Is there a different relationship for USA models and non-USA models?

Let’s examine this as a facet plot:

data(Cars93, package = "MASS")
ggplot(data = Cars93, aes(MPG.city, Horsepower)) +
  geom_point() +
  facet_wrap( ~ Origin)
facet cars 1
Figure 10-14. Cars Data with Facet

The resulting plot in Figure 10-13 reveals a few insights. If we really crave that 300-horsepower monster then we’ll have to buy a car built in the USA; but if we want high MPG, we have more choices among non-USA models. These insights could be teased out of a statistical analysis, but the visual presentation reveals them much more quickly.

Note that using facet results in subplots with the same x and y axis ranges. This helps insure that visual inspection of the data is not misleading because of differeing axis ranges.

See Also

The Base R Graphics function coplot can accomplish very similar plots using only Base Graphics.

Creating a Bar Chart

Problem

You want to create a bar chart.

Solution

A common situation is to have a column of data that represents a group and then another column that represents a measure about that group. This format is “long” data because the data runs vertically instead of having a column for each group.

Using the geom_bar function in ggplot we can plot the heights as bars. If the data is already aggregated, we add stat = "identity" so that ggplot knows it needs to do no aggregation on the groups of values before plotting.

ggplot(data = df, aes(x, y)) +
  geom_bar(stat = "identity")

Discussion

Let’s use the cars made by Ford in the Cars93 data in an example:

ford_cars <- Cars93 %>%
  filter(Manufacturer == "Ford")

ggplot(ford_cars, aes(Model, Horsepower)) +
  geom_bar(stat = "identity")
fordcars 1
Figure 10-15. Ford Cars Bar Chart

The resulting graph in Figure 10-15 shows the resuting bar chart.

This example above uses stat = "identity" which assumes that the heights of your bars are conveniently stored as a value in one field with only one record per column. That is not always the case, however. Often you have a vector of numeric data and a parallel factor or character field that groups the data, and you want to produce a bar chart of the group means or the group totals.

Let’s work up an example using the built-in airquality dataset which contains daily temperature data for a single location for five months. The data frame has a numeric Temp column and Month and Day columns. If we want to plot the mean temp by month using ggplot we don’t need to precompute the mean, instead we can have ggplot do that in the plot command logic. To tell ggplot to calculate the mean we pass stat = "summary", fun.y = "mean" to the geom_bar command. We can also turn the month numbers into dates using the built in constant month.abb which contains the abbreviations for the months.

ggplot(airquality, aes(month.abb[Month], Temp)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
aq1 1
Figure 10-16. Bar Chart: Temp by Month

Figure 10-16 shows the resulting plot. But you might notice the sort order on the months is alphabetical, which is not how we typically like to see months sorted.

We can fix the sorting issue using a few functions from dplyr combined with fct_inorder from the forcats Tidyverse package. To get the months in the correct order we can sort the data frame by Month which is the month number, then we can apply fct_inorder which will arrange our factors in the order they appear in the data. You can see in Figure 10-17 that the bars are now sorted properly.

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

ggplot(aq_data, aes(month_abb, Temp)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
aq2 1
Figure 10-17. Bar Chart Properly Sorted

See Also

See “Adding Confidence Intervals to a Bar Chart” for adding confidence intervals and “Coloring a Bar Chart” for adding color

?geom_bar for help with bar charts in ggplot

barplot for Base R bar charts or the barchart function in the lattice package.

Adding Confidence Intervals to a Bar Chart

Problem

You want to augment a bar chart with confidence intervals.

Solution

Suppose you have a data frame df with columns group which are group names, stat which is a column of statistics, and lower and upper which represent the corresponding limits for the confidence intervals. We can display a bar chart of stat for each group and its confidence intervals using the geom_bar combined with geom_errorbar.

ggplot(df, aes(group, stat)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = .2)

. image::images/10_Graphics_files/figure-html/confbars-1.png[]

??? shows the resulting bar chart with confidence intervals.

Discussion

Most bar charts display point estimates, which are shown by the heights of the bars, but rarely do they include confidence intervals. Our inner statisticians dislike this intensely. The point estimate is only half of the story; the confidence interval gives the full story.

Fortunately, we can plot the error bars using ggplot. The hard part is calculating the intervals. In the examples above our data had a simple -15% and +20% interval. However, in “Creating a Bar Chart”, we calculated group means before plotting them. If we let ggplot do the calculations for us we can use the build in mean_se along with the stat_summary function to get the standard errors of the mean measures.

Let’s use the airquality data we used previously. First we’ll do the sorted factor procedure (from the prior recipe) to get the month names in the desired order:

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

Now we can plot the bars along with the associated standard errors as in the following:

ggplot(aq_data, aes(month_abb, Temp)) +
  geom_bar(stat = "summary",
           fun.y = "mean",
           fill = "cornflowerblue") +
  stat_summary(fun.data = mean_se, geom = "errorbar") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")

Sometimes you’ll want to sort your columns in your bar chart in descending order based on their height. This can be a little bit confusing when using summary stats in ggplot but the secret is to use mean in the reorder statement to sort the factor by the mean of the temp. Note that the reference to mean in reorder is not quoted, while the reference to mean in geom_bar is quoted:

ggplot(aq_data, aes(reorder(month_abb,-Temp, mean), Temp)) +
  geom_bar(stat = "summary",
           fun.y = "mean",
           fill = "tomato"           ) +
  stat_summary(fun.data = mean_se, geom = "errorbar") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)")
airqual2 1
Figure 10-18. Mean Temp By Month Descending Order

You may look at this example and the result in Figure 10-18 and wonder, “Why didn’t they just use reorder(month_abb, Month) in the first example instead of that sorting business with forcats::fct_inorder to get the months in the right order?” Well, we could have. But sorting using fct_inorder is a design pattern that provides flexibility for more complicated things. Plus it’s quite easy to read in a script. Using reorder inside the aes is a bit more dense and hard to read later. But either approach is reasonable.

See Also

See “Forming a Confidence Interval for a Mean” for more about t.test.

Coloring a Bar Chart

Problem

You want to color or shade the bars of a bar chart.

Solution

With gplot we add the fill= call to our aes and let ggplot pick the colors for us:

ggplot(df, aes(x, y, fill = group))

Discussion

In ggplot we can use the fill parameter in aes to tell ggplot what field to base the colors on. If we pass a numeric field to ggplot we will get a continuous gradient of colors and if we pass a factor or character field to fill we will get contrasting colors for each group. Below we pass the character name of each month to the fill parameter:

aq_data <- airquality %>%
  arrange(Month) %>%
  mutate(month_abb = fct_inorder(month.abb[Month]))

ggplot(data = aq_data, aes(month_abb, Temp, fill = month_abb)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)") +
  scale_fill_brewer(palette = "Paired")
colored air 1
Figure 10-19. Colored Monthly Temp Bar Chart

The colors in the resulting Figure 10-19 are defined by calling scale_fill_brewer(palette="Paired"). The "Paired" color palette comes, along with many other color pallets, in the package RColorBrewer.

If we wanted to change the color of each bar based on the temperature, we can’t just set fill=Temp as might seem intuitive because ggplot would not understand we want the mean temperature after the grouping by month. So the way we get around this is to access a special field inside of our graph called ..y.. which is the calculated value on the Y axis. But we don’t want the legend labeled ..y.. so we add fill="Temp" to our labs call in order to change the name of the legend. The result is ???

ggplot(airquality, aes(month.abb[Month], Temp, fill = ..y..)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  labs(title = "Mean Temp by Month",
       x = "",
       y = "Temp (deg. F)",
       fill = "Temp")

images::images/10_Graphics_files/figure-latex/barsshaded-1.png[]

If we want to reverse the color scale, we can just add a negative - in front of the field we are filling by. Like fill= -..y.., for example.

See Also

See “Creating a Bar Chart” for creating a bar chart.

Plotting a Line from x and y Points

Problem

You have paired observations in a data frame: (x1, y1), (x2, y2), …, (xn, yn). You want to plot a series of line segments that connect the data points.

Solution

With ggplot we can use geom_point to plot the points:.

ggplot(df, aes(x, y)) +
  geom_point()

Since ggplot graphics are built up, element by element, we can have both a point and a line in the same graphic very easily by having two geoms:

ggplot(df, aes(x , y)) +
  geom_point() +
  geom_line()

Discussion

To illustrate, let’s look at some example US economic data that comes with ggplot2. This example data frame has a column called date which we’ll plot on the x axis and a field unemploy which is the number of unemployed people.

ggplot(economics, aes(date , unemploy)) +
  geom_point() +
  geom_line()
linechart 1
Figure 10-20. Line Chart Example

Figure 10-20 shows the resulting chart which contains both lines and points because we used both geoms.

Changing the Type, Width, or Color of a Line

Problem

You are plotting a line. You want to change the type, width, or color of the line.

Solution

ggplot uses the linetype parameter for controlling the appearance of lines:

  • linetype="solid" or linetype=1 (default)

  • linetype="dashed" or linetype=2

  • linetype="dotted" or linetype=3

  • linetype="dotdash" or linetype=4

  • linetype="longdash" or linetype=5

  • linetype="twodash" or linetype=6

  • linetype="blank" or linetype=0 (inhibits drawing)

You can change the line characteristics by passing linetype, col and/or size as parameters to the geom_line. So if we want to change the linetype to dashed, red, and heavy we could pass the linetype, col and size params to geom_line:

ggplot(df, aes(x, y)) +
  geom_line(linetype = 2,
            size = 2,
            col = "red")

Discussion

The example syntax above shows how to draw one line and specify its style, width, or color. A common scenario involves drawing multiple lines, each with its own style, width, or color.

Let’s set up some example data:

x <- 1:10
y1 <- x**1.5
y2 <- x**2
y3 <- x**2.5
df <- data.frame(x, y1, y2, y3)

In ggplot this can be a conundrum for many users. The challenge is that ggplot works best with “long” data instead of “wide” data as was mentioned in the introduction to this chapter. Our example data frame has 4 columns of wide data:

head(df, 3)
#>   x   y1 y2    y3
#> 1 1 1.00  1  1.00
#> 2 2 2.83  4  5.66
#> 3 3 5.20  9 15.59

We can make our wide data long by using the gather function from the core tidyverse pacakge tidyr. In the example below, we use gather to create a new column named bucket and put our column names in there while keeping our x and y variables.

df_long <- gather(df, bucket, y, -x)
head(df_long, 3)
#>   x bucket    y
#> 1 1     y1 1.00
#> 2 2     y1 2.83
#> 3 3     y1 5.20
tail(df_long, 3)
#>     x bucket   y
#> 28  8     y3 181
#> 29  9     y3 243
#> 30 10     y3 316

Now we can pass bucket to the col parameter and get multiple lines, each a different color:

ggplot(df_long, aes(x, y, col = bucket)) +
  geom_line()
unnamed chunk 21 1

It’s straight forward to vary the line weight by a variable by passing a numerical variable to size:

ggplot(df, aes(x, y1, size = y2)) +
  geom_line() +
  scale_size(name = "Thickness based on y2")
thickness 1
Figure 10-21. Thickness as a Function of x

The result of varying the thickness with x is shown in Figure 10-21.

See Also

See “Plotting a Line from x and y Points” for plotting a basic line.

Plotting Multiple Datasets

Problem

You want to show multiple datasets in one plot.

Solution

We could combine the data into one data frame before plotting using one of the join functions from dplyr. However below we will create two seperate data frames then add them each to a ggplot graph.

First let’s set up our example data frames, df1 and df2:

# example data
n <- 20

x1 <- 1:n
y1 <- rnorm(n, 0, .5)
df1 <- data.frame(x1, y1)

x2 <- (.5 * n):((1.5 * n) - 1)
y2 <- rnorm(n, 1, .5)
df2 <- data.frame(x2, y2)

Typically we would pass the data frame directly into the ggplot function call. Since we want two geoms with two different data sources, we will initiate a plot with ggplot() and then add in two calls to geom_line each with its own data source.

ggplot() +
  geom_line(data = df1, aes(x = x1, y = y1), color = "darkblue") +
  geom_line(data = df2, aes(x = x2, y = y2), linetype = "dashed")
twolines 1
Figure 10-22. Two Lines One Plot

Discussion

ggplot allows us to make multiple calls to different geom_ functions each with its own data source, if desired. Then ggplot will look at all the data we are plotting and adjust the ranges to accomodate all the data.

Even with good defaults, sometimes we want our plot range to show a different range. We can do that by setting the xlim and ylim in our ggplot.

ggplot() +
  geom_line(data = df1, aes(x = x1, y = y1), color = "darkblue") +
  geom_line(data = df2, aes(x = x2, y = y2), linetype = "dashed") +
  xlim(0, 35) +
  ylim(-2, 2)
twolineslins 1
Figure 10-23. Two Lines Larger Limits

The graph with expanded limits is in Figure 10-23.

Adding Vertical or Horizontal Lines

Problem

You want to add a vertical or horizontal line to your plot, such as an axis through the origin or pointing out a threshold.

Solution

The ggplot functions geom_vline and geom_hline allow vertical and horizontal lines, respectivly. The functions can also take color, linetype, and size parameters to set the line style:

# using the data.frame df1 from the prior recipe
ggplot(df1) +
  aes(x = x1, y = y1) +
  geom_point() +
  geom_vline(
    xintercept = 10,
    color = "red",
    linetype = "dashed",
    size = 1.5
  ) +
  geom_hline(yintercept = 0, color = "blue")
vhlines 1
Figure 10-24. Vertical and Horizontal Lines

Figure 10-24 shows the resulting plot with added horizontal and vertical lines.

Discussion

A typical use of lines would be drawing regularly spaced lines. Suppose we have a sample of points, samp. First, we plot them with a solid line through the mean. Then we calculate and draw dotted lines at ±1 and ±2 standard deviations away from the mean. We can add the lines into our plot with geom_hline:

samp <- rnorm(1000)
samp_df <- data.frame(samp, x = 1:length(samp))

mean_line <- mean(samp_df$samp)
sd_lines <- mean_line + c(-2, -1, +1, +2) * sd(samp_df$samp)

ggplot(samp_df) +
  aes(x = x, y = samp) +
  geom_point() +
  geom_hline(yintercept = mean_line, color = "darkblue") +
  geom_hline(yintercept = sd_lines, linetype = "dotted")
spacedlines 1
Figure 10-25. Mean and SD Bands in a Plot

Figure 10-25 shows the sampled data along with the mean and standard deviation lines.

See Also

See “Changing the Type, Width, or Color of a Line” for more about changing line types.

Creating a Box Plot

Problem

You want to create a box plot of your data.

Solution

Use geom_boxplot from ggplot to add a boxplot geom to a ggplot graphic. Using the samp_df data frame from the prior recipe, we can create a box plot of the values in the x column. The resulting graph is in Figure 10-26.

ggplot(samp_df) +
  aes(y = samp) +
  geom_boxplot()
boxplot 1
Figure 10-26. Single Boxplot

Discussion

A box plot provides a quick and easy visual summary of a dataset.

  • The thick line in the middle is the median.

  • The box surrounding the median identifies the first and third quartiles; the bottom of the box is Q1, and the top is Q3.

  • The “whiskers” above and below the box show the range of the data, excluding outliers.

  • The circles identify outliers. By default, an outlier is defined as any value that is farther than 1.5 × IQR away from the box. (IQR is the interquartile range, or Q3 − Q1.) In this example, there are a few outliers on the high side.

We can rotate the boxplot by flipping the coordinates. There are some situations where this makes a more appealing graphic. This is shown in Figure 10-27.

ggplot(samp_df) +
  aes(y = samp) +
  geom_boxplot() +
  coord_flip()
boxplotrotate 1
Figure 10-27. Single Boxplot

See Also

One box plot alone is pretty boring. See “Creating One Box Plot for Each Factor Level” for creating multiple box plots.

Creating One Box Plot for Each Factor Level

Problem

Your dataset contains a numeric variable and a factor (or other catagorical text). You want to create several box plots of the numeric variable broken out by levels.

Solution

With ggplot we pass the name of the categorical variable to the x parameter in the aes call. The resulting boxplot will then be grouped by the values in the categorical variable:

ggplot(df) +
  aes(x = factor, y = values) +
  geom_boxplot()

Discussion

This recipe is another great way to explore and illustrate the relationship between two variables. In this case, we want to know whether the numeric variable changes according to the level of a category.

The UScereal dataset from the MASS package contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is the shelf position (counting from the floor). Cereal manufacturers can negotiate for shelf position, placing their product for the best sales potential. We wonder: Where do they put the high-sugar cereals? We can produce Figure 10-28 and explore that question by creating one box plot per shelf:

data(UScereal, package = "MASS")

ggplot(UScereal) +
  aes(x = as.factor(shelf), y = sugars) +
  geom_boxplot() +
  labs(
    title = "Sugar Content by Shelf",
    x = "Shelf",
    y = "Sugar (grams per portion)"
  )
cerealboxplot 1
Figure 10-28. Boxplot by Shelf Number

The box plots suggest that shelf #2 has the most high-sugar cereals. Could it be that this shelf is at eye level for young children who can influence their parent’s choice of cereals?

Note that in the aes call we had to tell ggplot to treat the shelf number as a factor. Otherwise ggplot would not react to the shelf as a grouping and only print a single boxplot.

See Also

See “Creating a Box Plot” for creating a basic box plot.

Creating a Histogram

Problem

You want to create a histogram of your data.

Solution

Use geom_histogram, and set x to a vector of numeric values.

Discussion

Figure 10-29 is a histogram of the MPG.city column taken from the Cars93 dataset:

data(Cars93, package = "MASS")

ggplot(Cars93) +
  geom_histogram(aes(x = MPG.city))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
carshist 1
Figure 10-29. Histogram of Counts by MPG

The geom_histogram function must decide how many cells (bins) to create for binning the data. In this example, the default algorithm chose 30 bins. If we wanted fewer bins, we wo would include the bins parameter to tell geom_histogram how many bins we want:

ggplot(Cars93) +
  geom_histogram(aes(x = MPG.city), bins = 13)
carshistbins 1
Figure 10-30. Histogram of Counts by MPG with Fewer Bins

Figure 10-30 shows the histogram with 13 bins.

See Also

The Base R function hist provides much of the same functionality as does the histogram function of the lattice package.

Adding a Density Estimate to a Histogram

Problem

You have a histogram of your data sample, and you want to add a curve to illustrate the apparent density.

Solution

Use the geom_density function to approximate the sample density as shown in Figure 10-31:

ggplot(Cars93) +
  aes(x = MPG.city) +
  geom_histogram(aes(y = ..density..), bins = 21) +
  geom_density()
histdensity 1
Figure 10-31. Histogram with Density Plot

Discussion

A histogram suggests the density function of your data, but it is rough. A smoother estimate could help you better visualize the underlying distribution. A Kernel Density Estimation (KDE) is a smoother representation of univariate data.

In ggplot we tell the geom_histogram function to use the density function by passing it aes(y = ..density..).

The following example takes a sample from a gamma distribution and then plots the histogram and the estimated density as shown in Figure 10-32.

samp <- rgamma(500, 2, 2)

ggplot() +
  aes(x = samp) +
  geom_histogram(aes(y = ..density..), bins = 10) +
  geom_density()
gammahistdens 1
Figure 10-32. Histogram and Density: Gamma Distribution

See Also

The density function approximates the shape of the density nonparametrically. If you know the actual underlying distribution, use instead “Plotting a Density Function” to plot the density function.

Creating a Normal Quantile-Quantile (Q-Q) Plot

Problem

You want to create a quantile-quantile (Q-Q) plot of your data, typically because you want to know how the data differs from a normal distribution.

Solution

With ggplot we can use the stat_qq and stat_qq_line functions to create a Q-Q plot that shows both the observed points as well as the Q-Q Line. Figure 10-33 shows the resulting plot.

df <- data.frame(x = rnorm(100))

ggplot(df, aes(sample = x)) +
  stat_qq() +
  stat_qq_line()
qqplot 1
Figure 10-33. Q-Q Plot

Discussion

Sometimes it’s important to know if your data is normally distributed. A quantile-quantile (Q-Q) plot is a good first check.

The Cars93 dataset contains a Price column. Is it normally distributed? This code snippet creates a Q-Q plot of Price shown in Figure 10-34:

ggplot(Cars93, aes(sample = Price)) +
  stat_qq() +
  stat_qq_line()
qqcars 1
Figure 10-34. Q-Q Plot of Car Prices

If the data had a perfect normal distribution, then the points would fall exactly on the diagonal line. Many points are close, especially in the middle section, but the points in the tails are pretty far off. Too many points are above the line, indicating a general skew to the left.

The leftward skew might be cured by a logarithmic transformation. We can plot log(Price), which yields Figure 10-35:

ggplot(Cars93, aes(sample = log(Price))) +
  stat_qq() +
  stat_qq_line()
qqcarslog 1
Figure 10-35. Q-Q Plot of Log Car Prices

Notice that the points in the new plot are much better behaved, staying close to the line except in the extreme left tail. It appears that log(Price) is approximately Normal.

See Also

See “Creating Other Quantile-Quantile Plots” for creating Q-Q plots for other distributions. See Recipe X-X for an application of Normal Q-Q plots to diagnosing linear regression.

Creating Other Quantile-Quantile Plots

Problem

You want to view a quantile-quantile plot for your data, but the data is not normally distributed.

Solution

For this recipe, you must have some idea of the underlying distribution, of course. The solution is built from the following steps:

  • Use the ppoints function to generate a sequence of points between 0 and 1.

  • Transform those points into quantiles, using the quantile function for the assumed distribution.

  • Sort your sample data.

  • Plot the sorted data against the computed quantiles.

  • Use abline to plot the diagonal line.

This can all be done in two lines of R code. Here is an example that assumes your data, y, has a Student’s t distribution with 5 degrees of freedom. Recall that the quantile function for Student’s t is qt and that its second argument is the degrees of freedom. To create draws from

First let’s make some example data:

df_t <- data.frame(y = rt(100, 5))

In order to plot the Q-Q plot we need to estimate the parameters of the distribution we’re wanting to plot. Since this is a Student’s t distribution, we only need to estimate one parameter, the degrees of freedom. Of course we know the actual degrees of freedom is 5, but in most situations we’ll need to calcuate value. So we’ll use the MASS::fitdistr function to estimate the degrees of freedom:

est_df <- as.list(MASS::fitdistr(df_t$y, "t")$estimate)[["df"]]
#> Warning in log(s): NaNs produced

#> Warning in log(s): NaNs produced

#> Warning in log(s): NaNs produced
est_df
#> [1] 19.5

As expected, that’s pretty close to what was used to generate the simulated data. So let’s pass the estimaged degrees of freedom to the Q-Q functions and create Figure 10-36:

ggplot(df_t) +
  aes(sample = y) +
  geom_qq(distribution = qt, dparams = est_df) +
  stat_qq_line(distribution = qt, dparams = est_df)
qqt 1
Figure 10-36. Student’s t Distribution Q-Q Plot

Discussion

The solution looks complicated, but the gist of it is picking a distribution, fitting the parameters, and then passing those parameters to the Q-Q functions in ggplot.

We can illustrate this recipe by taking a random sample from an exponential distribution with a mean of 10 (or, equivalently, a rate of 1/10):

rate <- 1 / 10
n <- 1000
df_exp <- data.frame(y = rexp(n, rate = rate))
est_exp <- as.list(MASS::fitdistr(df_exp$y, "exponential")$estimate)[["rate"]]
est_exp
#> [1] 0.101

Notice that for an exponential distribution the parameter we estimate is called rate as opposed to df which was the parameter in the t distribution.

ggplot(df_exp) +
  aes(sample = y) +
  geom_qq(distribution = qexp, dparams = est_exp) +
  stat_qq_line(distribution = qexp, dparams = est_exp)
qqexp 1
Figure 10-37. Exponential Distribution Q-Q Plot

The quantile function for the exponential distribution is qexp, which takes the rate argument. Figure 10-37 shows the resulting Q-Q plot using a theoretical exponential distribution.

Plotting a Variable in Multiple Colors

Problem

You want to plot your data in multiple colors, typically to make the plot more informative, readable, or interesting.

Solution

We can pass a color to a geom_ function in order to produced colored output:

df <- data.frame(x = rnorm(200), y = rnorm(200))

ggplot(df) +
  aes(x = x, y = y) +
  geom_point(color = "blue")
colorscatter 1
Figure 10-38. Point Data in Color

The value of color can be:

  • One color, in which case all data points are that color.

  • A vector of colors, the same length as x, in which case each value of x is colored with its corresponding color.

  • A short vector, in which case the vector of colors is recycled.

Discussion

The default color in ggplot is black. While it’s not very exciting, black is high contrast and easy for most anyone to see.

However, it is much more useful (and interesting) to vary the color in a way that illuminates the data. Let’s illustrate this by plotting a graphic two ways, once in black and white and once with simple shading.

This produces the basic black-and-white graphic in Figure 10-39:

df <- data.frame(
  x = 1:100,
  y = rnorm(100)
)

ggplot(df) +
  aes(x, y) +
  geom_point()
plainpoints 1
Figure 10-39. Simple Point Plot

Now we can make it more interesting by creating a vector of "gray" and "black" values, according to the sign of x and then plotting x using those colors as shown in Figure 10-40:

shade <- if_else(df$y >= 0, "black", "gray")

ggplot(df) +
  aes(x, y) +
  geom_point(color = shade)
shadepoints 1
Figure 10-40. Color Shaded Point Plot

The negative values are now plotted in gray because the corresponding element of colors is "gray".

See Also

See “Understanding the Recycling Rule” regarding the Recycling Rule. Execute colors() to see a list of available colors, and use geom_segment in ggplot to plot line segments in multiple colors.

Graphing a Function

Problem

You want to graph the value of a function.

Solution

The ggplot function stat_function will graph a function across a range. In Figure 10-41 we plot a sin wave across the range -3 to 3.

ggplot(data.frame(x = c(-3, 3))) +
  aes(x) +
  stat_function(fun = sin)
functionplot 1
Figure 10-41. Sin Wave Plot

Discussion

It’s pretty common to want to plot a statistical function, such as a normal distribution across a given range. The stat_function in ggplot allows us to do thise. We need only supply a data frame with x value limits and stat_function will calculate the y values, and plot the results:

ggplot(data.frame(x = c(-3.5, 3.5))) +
  aes(x) +
  stat_function(fun = dnorm) +
  ggtitle("Std. Normal Density")
unnamed chunk 28 1

Notice in the chart above we use ggtitle to set the title. If setting multiple text elements in a ggplot we use labs but when just adding a title, ggtitle is more concise than labs(title='Std. Normal Density') although they accomplish the same thing. See ?labs for more discussion of labels with ggplot

stat_function can graph any function that takes one argument and returns one value. Let’s create a function and then plot it. Our function is a dampened sin wave that is a sin wave that loses amplitude as it moves away from 0:

f <- function(x) exp(-abs(x)) * sin(2 * pi * x)
ggplot(data.frame(x = c(-3.5, 3.5))) +
  aes(x) +
  stat_function(fun = f) +
  ggtitle("Dampened Sine Wave")
unnamed chunk 30 1

See Also

See Recipe X-X for how to define a function.

Pausing Between Plots

Problem

You are creating several plots, and each plot is overwriting the previous one. You want R to pause between plots so you can view each one before it’s overwritten.

Solution

There is a global graphics option called ask. Set it to TRUE, and R will pause before each new plot. We turn on this option by passing it to the par function which sets parameters:

par(ask = TRUE)

When you are tired of R pausing between plots, set it to FALSE:

par(ask = FALSE)

Discussion

When ask is TRUE, R will print this message immediately before starting a new plot:

Hit <Return> to see next plot:

When you are ready, hit the return or enter key and R will begin the next plot.

This is a Base R Graphics function but you can use it in ggplot if you wrap your plot function in a print statement in order to get prompted. Below is an example of a loop that prints a random set of points 5 times. If you run this loop in RStudio, you will be prompted between each graphic. Notice how we wrap g inside a print call:

par(ask = TRUE)

for (i in (11:15)) {
  g <- ggplot(data.frame(x = rnorm(i), y = 1:i)) +
    aes(x, y) +
    geom_point()
  print(g)
}

# don't forget to turn ask off after you're done
par(ask = FALSE)

See Also

If one graph is overwriting another, consider using “Displaying Several Figures on One Page” to plot multiple graphs in one frame. See Recipe X-X for more about changing graphical parameters.

JDL EDIT MARK

Displaying Several Figures on One Page

Problem

You want to display several plots side by side on one page.

Solution

# example data
z <- rnorm(1000)
y <- runif(1000)

# plot elements
p1 <- ggplot() +
  geom_point(aes(x = 1:1000, y = z))
p2 <- ggplot() +
  geom_point(aes(x = 1:1000, y = y))
p3 <- ggplot() +
  geom_density(aes(z))
p4 <- ggplot() +
  geom_density(aes(y))

There are a number of ways to put ggplot graphics into a grid, but one of the easiest to use and understand is patchwork by Thomas Lin Pedersen. When this book was written, patchwork was not availiable on CRAN, but can be installed using devtools:

devtools::install_github("thomasp85/patchwork")

After installing the package we can use it to plot mulitple ggplot objects using a + between the objects then a call to plot_layout to arrange the images into a grid as shown in Figure 10-42:

library(patchwork)
p1 + p2 + p3 + p4
patchwork1 1
Figure 10-42. A Patchwork Plot

patchwork supports grouping with parenthesis and using / to put groupings under other elements as illustrated in Figure 10-43.

p3 / (p1 + p2 + p4)
patchwork2 1
Figure 10-43. A Patchwork 1 / 2 Plot

Discussion

Let’s use a multifigure plot to display four different beta distributions. Using ggplot and the patchwork package, we can create a 2 x 2 layout effect by greating four graphics objects then print them using the + notation from Patchwork:

library(patchwork)


df <- data.frame(x = c(0, 1))

g1 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 2, 4)
  ) +
  ggtitle("First")

g2 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 4, 1)
  ) +
  ggtitle("Second")

g3 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, 1, 1)
  ) +
  ggtitle("Third")

g4 <- ggplot(df) +
  aes(x) +
  stat_function(
    fun = function(x)
      dbeta(x, .5, .5)
  ) +
  ggtitle("Fourth")

g1 + g2 + g3 + g4 + plot_layout(ncol = 2, byrow = TRUE)
unnamed chunk 36 1

To lay the images out in columns order we could pass the byrow=FALSE to plot_layout:

g1 + g2 + g3 + g4 + plot_layout(ncol = 2, byrow = FALSE)

See Also

“Plotting a Density Function” discusses plotting of density functions as we do above.

The grid package and the lattice package contain additional tools for multifigure layouts with Base Graphics.

Writing Your Plot to a File

Problem

You want to save your graphics in a file, such as a PNG, JPEG, or PostScript file.

Solution

With ggplot figures we can use ggsave to save a displayed image to a file. ggsave will make some default assumptions about size and file type for you, allowing you to only specify a filename:

ggsave("filename.jpg")

The file type is derived from the extension you use in the filename you pass to ggsave. You can control details of size, filetype, and scale by passing parameters to ggsave. See ?ggsave for specific details.

Discussion

In RStudio, a shortcut is to click on Export in the Plots window and then click on Save as Image, Save as PDF, or Copy to Clipboard. The save options will prompt you for a file type and a file name before writing the file. The Copy to Clipboard option can be handy if you are manually copying and pasting your graphics into a presentation or word processor.

Remember that the file will be written to your current working directory (unless you use an absolute file path), so be certain you know which directory is your working directory before calling savePlot.

In a non-interactive script using ggplot you can pass plot objects directly to ggsave so they need not be displayed before saving. In the prior recipe we created a plot object called g1. We can save it to a file like this:

ggsave("g1.png", plot = g1, units = "in", width = 5, height = 4)

Note that the units for height and width in ggsave are specified with the units parameter. In this case we used in for inches, but ggsave also supports mm and cm for the more metricly inclined.

See Also

See “Getting and Setting the Working Directory” for more about the current working directory.

Chapter 11. Linear Regression and ANOVA

Introduction

In statistics, modeling is where we get down to business. Models quantify the relationships between our variables. Models let us make predictions.

A simple linear regression is the most basic model. It’s just two variables and is modeled as a linear relationship with an error term:

  • yi = β0 + β1xi + εi

We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for β0 and β1 (“Performing Simple Linear Regression”).

That generalizes naturally to multiple linear regression, where we have multiple variables on the righthand side of the relationship (“Performing Multiple Linear Regression”):

  • yi = β0 + β1ui + β2vi + β3wi + εi

Statisticians call u, v, and w the predictors and y the response. Obviously, the model is useful only if there is a fairly linear relationship between the predictors and the response, but that requirement is much less restrictive than you might think. “Regressing on Transformed Data” discusses transforming your variables into a (more) linear relationship so that you can use the well-developed machinery of linear regression.

The beauty of R is that anyone can build these linear models. The models are built by a function, lm, which returns a model object. From the model object, we get the coefficients (βi) and regression statistics. It’s easy. Really!

The horror of R is that anyone can build these models. Nothing requires you to check that the model is reasonable, much less statistically significant. Before you blindly believe a model, check it. Most of the information you need is in the regression summary (“Understanding the Regression Summary”):

Is the model statistically significant?

Check the F statistic at the bottom of the summary.

Are the coefficients significant?

Check the coefficient’s t statistics and p-values in the summary, or check their confidence intervals (“Forming Confidence Intervals for Regression Coefficients”).

Is the model useful?

Check the R2 near the bottom of the summary.

Does the model fit the data well?

Plot the residuals and check the regression diagnostics (Recipes and .

Does the data satisfy the assumptions behind linear regression?

Check whether the diagnostics confirm that a linear model is reasonable for your data (“Diagnosing a Linear Regression”).

ANOVA

Analysis of variance (ANOVA) is a powerful statistical technique. First-year graduate students in statistics are taught ANOVA almost immediately because of its importance, both theoretical and practical. We are often amazed, however, at the extent to which people outside the field are unaware of its purpose and value.

Regression creates a model, and ANOVA is one method of evaluating such models. The mathematics of ANOVA are intertwined with the mathematics of regression, so statisticians usually present them together; we follow that tradition here.

ANOVA is actually a family of techniques that are connected by a common mathematical analysis. This chapter mentions several applications:

One-way ANOVA

This is the simplest application of ANOVA. Suppose you have data samples from several populations and are wondering whether the populations have different means. One-way ANOVA answers that question. If the populations have normal distributions, use the oneway.test function (“Performing One-Way ANOVA”); otherwise, use the nonparametric version, the kruskal.test function (“Performing Robust ANOVA (Kruskal–Wallis Test)”).

Model comparison

When you add or delete a predictor variable from a linear regression, you want to know whether that change did or did not improve the model. The anova function compares two regression models and reports whether they are significantly different (“Comparing Models by Using ANOVA”).

ANOVA table

The anova function can also construct the ANOVA table of a linear regression model, which includes the F statistic needed to gauge the model’s statistical significance (“Getting Regression Statistics”). This important table is discussed in nearly every textbook on regression.

The See Also section below contain more about the mathematics of ANOVA.

Example Data

In many of the examples in this chapter, we start with creating example data using R’s pseudo random number generation capabilities. So at the beginning of each recipe you may see something like the following:

set.seed(42)
x <- rnorm(100)
e <- rnorm(100, mean=0, sd=5)
y <- 5 + 15 * x + e

We use set.seed to set the random number generation seed so that if you run the example code on your machine you will get the same answer. In the above example, x is a vector of 100 draws from a standard normal (mean=0, sd=1) distribution. Then we create a little random noise called e from a normal distribution with mean=0 and sd=5. y is then calculated as 5 + 15 * x + e. The idea behind creating example data rather than using “real world” data is that with simulated “toy” data you can change the coefficients and parameters in the example data and see how the change impacts the resulting model. For example, you could increase the standard deviation of e in the example data and see what impact that has on the R^2 of your model.

See Also

There are many good texts on linear regression. One of our favorites is Applied Linear Regression Models (4th ed.) by Kutner, Nachtsheim, and Neter (McGraw-Hill/Irwin). We generally follow their terminology and conventions in this chapter.

We also like Linear Models with R by Julian Faraway (Chapman & Hall), because it illustrates regression using R and is quite readable. Earlier versions of Faraday’s work are available free on-line, too (e.g., http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf).

Performing Simple Linear Regression

Problem

You have two vectors, x and y, that hold paired observations: (x1, y1), (x2, y2), …, (xn, yn). You believe there is a linear relationship between x and y, and you want to create a regression model of the relationship.

Solution

The lm function performs a linear regression and reports the coefficients:

set.seed(42)
x <- rnorm(100)
e <- rnorm(100, mean = 0, sd = 5)
y <- 5 + 15 * x + e

lm(y ~ x)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Coefficients:
#> (Intercept)            x
#>        4.56        15.14

Discussion

Simple linear regression involves two variables: a predictor (or independent) variable, often called x; and a response (or dependent) variable, often called y. The regression uses the ordinary least-squares (OLS) algorithm to fit the linear model:

  • yi = β0 + β1xi + εi

where β0 and β1 are the regression coefficients and the εi are the error terms.

The lm function can perform linear regression. The main argument is a model formula, such as y ~ x. The formula has the response variable on the left of the tilde character (~) and the predictor variable on the right. The function estimates the regression coefficients, β0 and β1, and reports them as the intercept and the coefficient of x, respectively:

Coefficients:
(Intercept)            x
      4.558       15.136

In this case, the regression equation is:

  • yi = 17.72 + 3.25xi + εi

It is quite common for data to be captured inside a data frame, in which case you want to perform a regression between two data frame columns. Here, x and y are columns of a data frame dfrm:

df <- data.frame(x, y)
head(df)
#>        x     y
#> 1  1.371 31.57
#> 2 -0.565  1.75
#> 3  0.363  5.43
#> 4  0.633 23.74
#> 5  0.404  7.73
#> 6 -0.106  3.94

The lm function lets you specify a data frame by using the data parameter. If you do, the function will take the variables from the data frame and not from your workspace:

lm(y ~ x, data = df)          # Take x and y from df
#>
#> Call:
#> lm(formula = y ~ x, data = df)
#>
#> Coefficients:
#> (Intercept)            x
#>        4.56        15.14

Performing Multiple Linear Regression

Problem

You have several predictor variables (e.g., u, v, and w) and a response variable y. You believe there is a linear relationship between the predictors and the response, and you want to perform a linear regression on the data.

Solution

Use the lm function. Specify the multiple predictors on the righthand side of the formula, separated by plus signs (+):

lm(y ~ u + v + w)

Discussion

Multiple linear regression is the obvious generalization of simple linear regression. It allows multiple predictor variables instead of one predictor variable and still uses OLS to compute the coefficients of a linear equation. The three-variable regression just given corresponds to this linear model:

  • yi = β0 + β1ui + β2vi + β3wi + εi

R uses the lm function for both simple and multiple linear regression. You simply add more variables to the righthand side of the model formula. The output then shows the coefficients of the fitted model:

set.seed(42)
u <- rnorm(100)
v <- rnorm(100, mean = 3,  sd = 2)
w <- rnorm(100, mean = -3, sd = 1)
e <- rnorm(100, mean = 0,  sd = 3)

y <- 5 + 4 * u + 3 * v + 2 * w + e

lm(y ~ u + v + w)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

The data parameter of lm is especially valuable when the number of variables increases, since it’s much easier to keep your data in one data frame than in many separate variables. Suppose your data is captured in a data frame, such as the df variable shown here:

df <- data.frame(y, u, v, w)
head(df)
#>       y      u     v     w
#> 1 16.67  1.371 5.402 -5.00
#> 2 14.96 -0.565 5.090 -2.67
#> 3  5.89  0.363 0.994 -1.83
#> 4 27.95  0.633 6.697 -0.94
#> 5  2.42  0.404 1.666 -4.38
#> 6  5.73 -0.106 3.211 -4.15

When we supply df to the data parameter of lm, R looks for the regression variables in the columns of the data frame:

lm(y ~ u + v + w, data = df)
#>
#> Call:
#> lm(formula = y ~ u + v + w, data = df)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

See Also

See “Performing Simple Linear Regression” for simple linear regression.

Getting Regression Statistics

Problem

You want the critical statistics and information regarding your regression, such as R2, the F statistic, confidence intervals for the coefficients, residuals, the ANOVA table, and so forth.

Solution

Save the regression model in a variable, say m:

m <- lm(y ~ u + v + w)

Then use functions to extract regression statistics and information from the model:

anova(m)

ANOVA table

coefficients(m)

Model coefficients

coef(m)

Same as coefficients(m)

confint(m)

Confidence intervals for the regression coefficients

deviance(m)

Residual sum of squares

effects(m)

Vector of orthogonal effects

fitted(m)

Vector of fitted y values

residuals(m)

Model residuals

resid(m)

Same as residuals(m)

summary(m)

Key statistics, such as R2, the F statistic, and the residual standard error (σ)

vcov(m)

Variance–covariance matrix of the main parameters

Discussion

When we started using R, the documentation said use the lm function to perform linear regression. So we did something like this, getting the output shown in “Performing Multiple Linear Regression”:

lm(y ~ u + v + w)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Coefficients:
#> (Intercept)            u            v            w
#>        4.77         4.17         3.01         1.91

How disappointing! The output was nothing compared to other statistics packages such as SAS. Where is R2? Where are the confidence intervals for the coefficients? Where is the F statistic, its p-value, and the ANOVA table?

Of course, all that information is available—you just have to ask for it. Other statistics systems dump everything and let you wade through it. R is more minimalist. It prints a bare-bones output and lets you request what more you want.

The lm function returns a model object that you can assign to a variable:

m <- lm(y ~ u + v + w)

From the model object, you can extract important information using specialized functions. The most important function is summary:

summary(m)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -5.383 -1.760 -0.312  1.856  6.984
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    4.770      0.969    4.92  3.5e-06 ***
#> u              4.173      0.260   16.07  < 2e-16 ***
#> v              3.013      0.148   20.31  < 2e-16 ***
#> w              1.905      0.266    7.15  1.7e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.66 on 96 degrees of freedom
#> Multiple R-squared:  0.885,  Adjusted R-squared:  0.882
#> F-statistic:  247 on 3 and 96 DF,  p-value: <2e-16

The summary shows the estimated coefficients. It shows the critical statistics, such as R2 and the F statistic. It shows an estimate of σ, the standard error of the residuals. The summary is so important that there is an entire recipe devoted to understanding it (“Understanding the Regression Summary”).

There are specialized extractor functions for other important information:

Model coefficients (point estimates)
    coef(m)
#> (Intercept)           u           v           w
#>        4.77        4.17        3.01        1.91
Confidence intervals for model coefficients
    confint(m)
#>             2.5 % 97.5 %
#> (Intercept)  2.85   6.69
#> u            3.66   4.69
#> v            2.72   3.31
#> w            1.38   2.43
Model residuals
    resid(m)
#>       1       2       3       4       5       6       7       8       9
#> -0.5675  2.2880  0.0972  2.1474 -0.7169 -0.3617  1.0350  2.8040 -4.2496
#>      10      11      12      13      14      15      16      17      18
#> -0.2048 -0.6467 -2.5772 -2.9339 -1.9330  1.7800 -1.4400 -2.3989  0.9245
#>      19      20      21      22      23      24      25      26      27
#> -3.3663  2.6890 -1.4190  0.7871  0.0355 -0.3806  5.0459 -2.5011  3.4516
#>      28      29      30      31      32      33      34      35      36
#>  0.3371 -2.7099 -0.0761  2.0261 -1.3902 -2.7041  0.3953  2.7201 -0.0254
#>      37      38      39      40      41      42      43      44      45
#> -3.9887 -3.9011 -1.9458 -1.7701 -0.2614  2.0977 -1.3986 -3.1910  1.8439
#>      46      47      48      49      50      51      52      53      54
#>  0.8218  3.6273 -5.3832  0.2905  3.7878  1.9194 -2.4106  1.6855 -2.7964
#>      55      56      57      58      59      60      61      62      63
#> -1.3348  3.3549 -1.1525  2.4012 -0.5320 -4.9434 -2.4899 -3.2718 -1.6161
#>      64      65      66      67      68      69      70      71      72
#> -1.5119 -0.4493 -0.9869  5.6273 -4.4626 -1.7568  0.8099  5.0320  0.1689
#>      73      74      75      76      77      78      79      80      81
#>  3.5761 -4.8668  4.2781 -2.1386 -0.9739 -3.6380  0.5788  5.5664  6.9840
#>      82      83      84      85      86      87      88      89      90
#> -3.5119  1.2842  4.1445 -0.4630 -0.7867 -0.7565  1.6384  3.7578  1.8942
#>      91      92      93      94      95      96      97      98      99
#>  0.5542 -0.8662  1.2041 -1.7401 -0.7261  3.2701  1.4012  0.9476 -0.9140
#>     100
#>  2.4278
Residual sum of squares
    deviance(m)
#> [1] 679
ANOVA table
    anova(m)
#> Analysis of Variance Table
#>
#> Response: y
#>           Df Sum Sq Mean Sq F value  Pr(>F)
#> u          1   1776    1776   251.0 < 2e-16 ***
#> v          1   3097    3097   437.7 < 2e-16 ***
#> w          1    362     362    51.1 1.7e-10 ***
#> Residuals 96    679       7
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you find it annoying to save the model in a variable, you are welcome to use one-liners such as this:

summary(lm(y ~ u + v + w))

Or you can use Magritr pipes:

lm(y ~ u + v + w) %>%
  summary

See Also

See “Understanding the Regression Summary”. See “Identifying Influential Observations” for regression statistics specific to model diagnostics.

Understanding the Regression Summary

Problem

You created a linear regression model, m. However, you are confused by the output from summary(m).

Discussion

The model summary is important because it links you to the most critical regression statistics. Here is the model summary from “Getting Regression Statistics”:

summary(m)
#>
#> Call:
#> lm(formula = y ~ u + v + w)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -5.383 -1.760 -0.312  1.856  6.984
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    4.770      0.969    4.92  3.5e-06 ***
#> u              4.173      0.260   16.07  < 2e-16 ***
#> v              3.013      0.148   20.31  < 2e-16 ***
#> w              1.905      0.266    7.15  1.7e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.66 on 96 degrees of freedom
#> Multiple R-squared:  0.885,  Adjusted R-squared:  0.882
#> F-statistic:  247 on 3 and 96 DF,  p-value: <2e-16

Let’s dissect this summary by section. We’ll read it from top to bottom—even though the most important statistic, the F statistic, appears at the end:

Call
    summary(m)$call

This shows how lm was called when it created the model, which is important for putting this summary into the proper context.

Residuals statistics
    # Residuals:
    #     Min      1Q  Median      3Q     Max
    # -5.3832 -1.7601 -0.3115  1.8565  6.9840

Ideally, the regression residuals would have a perfect, normal distribution. These statistics help you identify possible deviations from normality. The OLS algorithm is mathematically guaranteed to produce residuals with a mean of zero.[‸1] Hence the sign of the median indicates the skew’s direction, and the magnitude of the median indicates the extent. In this case the median is negative, which suggests some skew to the left.

If the residuals have a nice, bell-shaped distribution, then the first quartile (1Q) and third quartile (3Q) should have about the same magnitude. In this example, the larger magnitude of 3Q versus 1Q (1.3730 versus 0.9472) indicates a slight skew to the right in our data, although the negative median makes the situation less clear-cut.

The Min and Max residuals offer a quick way to detect extreme outliers in the data, since extreme outliers (in the response variable) produce large residuals.

Coefficients
summary(m)$coefficients
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)     4.77      0.969    4.92 3.55e-06
#> u               4.17      0.260   16.07 5.76e-29
#> v               3.01      0.148   20.31 1.58e-36
#> w               1.91      0.266    7.15 1.71e-10

The column labeled Estimate contains the estimated regression coefficients as calculated by ordinary least squares.

Theoretically, if a variable’s coefficient is zero then the variable is worthless; it adds nothing to the model. Yet the coefficients shown here are only estimates, and they will never be exactly zero. We therefore ask: Statistically speaking, how likely is it that the true coefficient is zero? That is the purpose of the t statistics and the p-values, which in the summary are labeled (respectively) t value and Pr(>|t|).

The p-value is a probability. It gauges the likelihood that the coefficient is not significant, so smaller is better. Big is bad because it indicates a high likelihood of insignificance. In this example, the p-value for the u coefficient is a mere 0.00106, so u is likely significant. The p-value for w, however, is 0.05744; this is just over our conventional limit of 0.05, which suggests that w is likely insignificant.[^2] Variables with large p-values are candidates for elimination.

A handy feature is that R flags the significant variables for quick identification. Do you notice the extreme righthand column containing double asterisks (**), a single asterisk (*), and a period(.)? That column highlights the significant variables. The line labeled "Signif. codes" at the bottom gives a cryptic guide to the flags’ meanings:

+

--------- ---------------------------------- *** p-value between 0 and 0.001 ** p-value between 0.001 and 0.01 * p-value between 0.01 and 0.05 . p-value between 0.05 and 0.1 (blank) p-value between 0.1 and 1.0 --------- ----------------------------------

+

The column labeled Std. Error is the standard error of the estimated coefficient. The column labeled t value is the t statistic from which the p-value was calculated.

Residual standard error::
+
[source, r]
# Residual standard error: 2.66 on 96 degrees of freedom
+
-------------------------------------------------------------------
This reports the standard error of the residuals (*σ*)—that is, the
sample standard deviation of *ε*.
-------------------------------------------------------------------

_R_^2^ (coefficient of determination)::
+
[source, r]
# Multiple R-squared:  0.8851,  Adjusted R-squared:  0.8815
+
-------------------------------------------------------------------
*R*^2^ is a measure of the model’s quality. Bigger is better.
Mathematically, it is the fraction of the variance of *y* that is
explained by the regression model. The remaining variance is not
explained by the model, so it must be due to other factors (i.e.,
unknown variables or sampling variability). In this case, the model
explains 0.4981 (49.81%) of the variance of *y*, and the remaining
0.5019 (50.19%) is unexplained.

That being said, we strongly suggest using the adjusted rather than
the basic *R*^2^. The adjusted value accounts for the number of
variables in your model and so is a more realistic assessment of
its effectiveness. In this case, then, we would use 0.8815,
not 0.8851s
-------------------------------------------------------------------

_F_ statistic::
+
[source, r]
# F-statistic: 246.6 on 3 and 96 DF,  p-value: < 2.2e-16
+
--------------------------------------------------------------------
The *F* statistic tells you whether the model is significant
or insignificant. The model is significant if any of the
coefficients are nonzero (i.e., if *β*~*i*~ ≠ 0 for some *i*). It is
insignificant if all coefficients are zero (*β*~1~ = *β*~2~ = … =
*β*~*n*~ = 0).

Conventionally, a *p*-value of less than 0.05 indicates that the
model is likely significant (one or more *β*~*i*~ are nonzero)
whereas values exceeding 0.05 indicate that the model is likely
not significant. Here, the probability is only 0.000391 that our
model is insignificant. That’s good.

Most people look at the *R*^2^ statistic first. The statistician
wisely starts with the *F* statistic, for if the model is not
significant then nothing else matters.
--------------------------------------------------------------------

[[see_also-id240]]
==== See Also

See <<recipe-id231>> for more on extracting statistics and information from the
model object.

[[recipe-id205]]
=== Performing Linear Regression Without an Intercept

[[problem-id205]]
==== Problem

You want to perform a linear regression, but you want to force the
intercept to be zero.

[[solution-id205]]
==== Solution

Add "`+` `0`" to the righthand side of your regression formula. That
will force `lm` to fit the model with a zero intercept:

[source, r]

lm(y ~ x + 0)

The corresponding regression equation is:

++++
<ul class="simplelist">
  <li><em>y</em><sub><em>i</em></sub> = <em>βx</em><sub><em>i</em></sub> + <em>ε</em><sub><em>i</em></sub></li>
</ul>
++++

[[discussion-id205]]
==== Discussion

Linear regression ordinarily includes an intercept term, so that is the
default in R. In rare cases, however, you may want to fit the data while
assuming that the intercept is zero. In this you make a modeling
assumption: when _x_ is zero, _y_ should be zero.

When you force a zero intercept, the `lm` output includes a coefficient
for _x_ but no intercept for _y_, as shown here:

[source, r]

lm(y x + 0) #> #> Call: #> lm(formula = y x + 0) #> #> Coefficients: #> x #> 4.3

We strongly suggest you check that modeling assumption before
proceeding. Perform a regression with an intercept; then see if the
intercept could plausibly be zero. Check the intercept’s confidence
interval. In this example, the confidence interval is (6.26, 8.84):

[source, r]

confint(lm(y ~ x)) #> 2.5 % 97.5 % #> (Intercept) 6.26 8.84 #> x 2.82 5.31

Because the confidence interval does not contain zero, it is NOT
statistically plausible that the intercept could be zero. So in this
case, it is not reasonable to rerun the regression while forcing a zero
intercept.

[[title-highcor]]
=== Regressing Only Variables that Highly Correlate with your Dependent Variable

[[problem-highcor]]
==== Problem

You have a data frame with many variables and you want to build a
multiple linear regression using only the variables that are highly
correlated to your response (dependent) variable.

[[solution-highcor]]
==== Solution

If `df` is our data frame containing both our response (dependent) and
all our predictor (independent) variables and `dep_var` is our response
variable, we can figure out our best predictors and then use them in a
linear regression. If we want the top 4 predictor variables, we can use
this recipe:

[source, r]

best_pred ← df %>% select(-dep_var) %>% map_dbl(cor, y = df$dep_var) %>% sort(decreasing = TRUE) %>% .[1:4] %>% names %>% df[.]

mod ← lm(df$dep_var ~ as.matrix(best_pred))

This recipe is a combination of many differnt pieces of logic used
elsewhere in this book. We will describe each step here then walk
through it in the discussion using some example data.

First we drop the response variable out of our pipe chain so that we
have only our predictor variables in our data flow:

[source, r]

df %>% select(-dep_var)

Then we use `map_dbl` from `purrr` to perform a pairwise correlation on
each column relative to the response variable.

[source, r]
map_dbl(cor, y = df$dep_var) %>%
We then take the resulting correlations and sort them in decreasing
order:

[source, r]
sort(decreasing = TRUE) %>%
We want only the top 4 correlated variables so we select the top 4
records in the resulting vector:

[source, r]
.[1:4] %>%
And we don't need the correlation values, only the names of the rows
which are the variable names from our original data frame `df`:

[source, r]

names %>%

Then we can pass those names into our subsetting brackets to select only
the columns with names matching the ones we want:

[source, r]

Chapter 12. df[.]

Our pipe chain assigns the resulting data frame into best_pred. We can then use best_pred as the predictor variables in our regression and we can use df$dep_var as the response

mod <- lm(df$dep_var ~ as.matrix(best_pred))

Discussion

We can combine the mapping functions discussed in recpie @ref(recipe-id157): “Applying a Function to Every Column” and create a recipe to remove low-correlation variables from a set of predictors and use the high correlation predictors in a regression.

We have an example data frame that contains 6 predictor variables named pred1 through pred6. The response variable is named resp. Let’s walk that data frame through our logic and see how it works.

Loading the data and dropping the resp variable is pretty straight forward. So let’s look at the result of mapping the cor function:

# loads the pred data frame
load("./data/pred.rdata")

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp)
#> pred1 pred2 pred3 pred4 pred5 pred6
#> 0.573 0.279 0.753 0.799 0.322 0.607

The output is a named vector of values where the names are the variable names and the values are the pairwise correlations between each predictor variable and resp, the response variable.

If we sort this vector, we get the correlations in decreasing order:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE)
#> pred4 pred3 pred6 pred1 pred5 pred2
#> 0.799 0.753 0.607 0.573 0.322 0.279

Using subsetting allows us to select the top 4 records. The . operator is a special operator that tells the pipe where to put the result of the prior step.

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4]
#> pred4 pred3 pred6 pred1
#> 0.799 0.753 0.607 0.573

We then use the names function to extract the names from our vector. The names are the names of the columns we ultimatly want to use as our independent variables:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names
#> [1] "pred4" "pred3" "pred6" "pred1"

When we pass the vecotr of names into pred[.] the names are used to select columns from the pred data frame. We then use head to select only the top 6 rows for easier illustration:

pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names %>%
  pred[.] %>%
  head
#>    pred4   pred3  pred6  pred1
#> 1  7.252  1.5127  0.560  0.206
#> 2  2.076  0.2579 -0.124 -0.361
#> 3 -0.649  0.0884  0.657  0.758
#> 4  1.365 -0.1209  0.122 -0.727
#> 5 -5.444 -1.1943 -0.391 -1.368
#> 6  2.554  0.6120  1.273  0.433

Now let’s bring it all together and pass the resulting data into the regression:

best_pred <- pred %>%
  select(-resp) %>%
  map_dbl(cor, y = pred$resp) %>%
  sort(decreasing = TRUE) %>%
  .[1:4] %>%
  names %>%
  pred[.]

mod <- lm(pred$resp ~ as.matrix(best_pred))
summary(mod)
#>
#> Call:
#> lm(formula = pred$resp ~ as.matrix(best_pred))
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -1.485 -0.619  0.189  0.562  1.398
#>
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                  1.117      0.340    3.28   0.0051 **
#> as.matrix(best_pred)pred4    0.523      0.207    2.53   0.0231 *
#> as.matrix(best_pred)pred3   -0.693      0.870   -0.80   0.4382
#> as.matrix(best_pred)pred6    1.160      0.682    1.70   0.1095
#> as.matrix(best_pred)pred1    0.343      0.359    0.95   0.3549
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.927 on 15 degrees of freedom
#> Multiple R-squared:  0.838,  Adjusted R-squared:  0.795
#> F-statistic: 19.4 on 4 and 15 DF,  p-value: 8.59e-06

Performing Linear Regression with Interaction Terms

Problem

You want to include an interaction term in your regression.

Solution

The R syntax for regression formulas lets you specify interaction terms. The interaction of two variables, u and v, is indicated by separating their names with an asterisk (*):

lm(y ~ u*v)

This corresponds to the model yi = β0 + β1ui
β2vi + β3uivi + εi, which includes the first-order interaction term β3uivi.

Discussion

In regression, an interaction occurs when the product of two predictor variables is also a significant predictor (i.e., in addition to the predictor variables themselves). Suppose we have two predictors, u and v, and want to include their interaction in the regression. This is expressed by the following equation:

  • yi = β0 + β1ui + β2vi + β3uivi + εi

Here the product term, β3uivi, is called the interaction term. The R formula for that equation is:

y ~ u * v

When you write y ~ u*v, R automatically includes u, v, and their product in the model. This is for a good reason. If a model includes an interaction term, such as β3uivi, then regression theory tells us the model should also contain the constituent variables ui and vi.

Likewise, if you have three predictors (u, v, and w) and want to include all their interactions, separate them by asterisks:

y ~ u * v * w

This corresponds to the regression equation:

  • yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi

Now we have all the first-order interactions and a second-order interaction (β7uiviwi).

Sometimes, however, you may not want every possible interaction. You can explicitly specify a single product by using the colon operator (:). For example, u:v:w denotes the product term βuiviwi but without all possible interactions. So the R formula:

y ~ u + v + w + u:v:w

corresponds to the regression equation:

  • yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi

It might seem odd that colon (:) means pure multiplication while asterisk (*) means both multiplication and inclusion of constituent terms. Again, this is because we normally incorporate the constituents when we include their interaction, so making that the default for asterisk makes sense.

There is some additional syntax for easily specifying many interactions:

(u + v + ... + w)^2

: Include all variables (u, v, …, w) and all their first-order interactions.

(u + v + ... + w)^3

: Include all variables, all their first-order interactions, and all their second-order interactions.

(u + v + ... + w)^4

: And so forth.

Both the asterisk (*) and the colon (:) follow a “distributive law”, so the following notations are also allowed:

x*(u + v + ... + w)

: Same as x*u + x*v + ... + x*w (which is the same as x + u + v + ... + w + x:u + x:v + ... + x:w).

x:(u + v + ... + w)

: Same as x:u + x:v + ... + x:w.

All this syntax gives you some flexibility in writing your formula. For example, these three formulas are equivalent:

y ~ u * v
y ~ u + v + u:v
y ~ (u + v) ^ 2

They all define the same regression equation, yi = β0
β1ui + β2vi + β3uivi + εi .

See Also

The full syntax for formulas is richer than described here. See R in a Nutshell (O’Reilly) or the R Language Definition for more details.

Selecting the Best Regression Variables

Problem

You are creating a new regression model or improving an existing model. You have the luxury of many regression variables, and you want to select the best subset of those variables.

Solution

The step function can perform stepwise regression, either forward or backward. Backward stepwise regression starts with many variables and removes the underperformers:

full.model <- lm(y ~ x1 + x2 + x3 + x4)
reduced.model <- step(full.model, direction = "backward")

Forward stepwise regression starts with a few variables and adds new ones to improve the model until it cannot be improved further:

min.model <- lm(y ~ 1)
fwd.model <-
  step(min.model,
       direction = "forward",
       scope = (~ x1 + x2 + x3 + x4))

Discussion

When you have many predictors, it can be quite difficult to choose the best subset. Adding and removing individual variables affects the overall mix, so the search for “the best” can become tedious.

The step function automates that search. Backward stepwise regression is the easiest approach. Start with a model that includes all the predictors. We call that the full model. The model summary, shown here, indicates that not all predictors are statistically significant:

# example data
set.seed(4)
n <- 150
x1 <- rnorm(n)
x2 <- rnorm(n, 1, 2)
x3 <- rnorm(n, 3, 1)
x4 <- rnorm(n,-2, 2)
e <- rnorm(n, 0, 3)
y <- 4 + x1 + 5 * x3 + e

# build the model
full.model <- lm(y ~ x1 + x2 + x3 + x4)
summary(full.model)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x4)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.032 -1.774  0.158  2.032  6.626
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  3.40224    0.80767    4.21  4.4e-05 ***
#> x1           0.53937    0.25935    2.08    0.039 *
#> x2           0.16831    0.12291    1.37    0.173
#> x3           5.17410    0.23983   21.57  < 2e-16 ***
#> x4          -0.00982    0.12954   -0.08    0.940
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 145 degrees of freedom
#> Multiple R-squared:  0.77,   Adjusted R-squared:  0.763
#> F-statistic:  121 on 4 and 145 DF,  p-value: <2e-16

We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model:

reduced.model <- step(full.model, direction="backward")
#> Start:  AIC=327
#> y ~ x1 + x2 + x3 + x4
#>
#>        Df Sum of Sq  RSS AIC
#> - x4    1         0 1240 325
#> - x2    1        16 1256 327
#> <none>              1240 327
#> - x1    1        37 1277 329
#> - x3    1      3979 5219 540
#>
#> Step:  AIC=325
#> y ~ x1 + x2 + x3
#>
#>        Df Sum of Sq  RSS AIC
#> - x2    1        16 1256 325
#> <none>              1240 325
#> - x1    1        37 1277 327
#> - x3    1      3988 5228 539
#>
#> Step:  AIC=325
#> y ~ x1 + x3
#>
#>        Df Sum of Sq  RSS AIC
#> <none>              1256 325
#> - x1    1        44 1300 328
#> - x3    1      3974 5230 537

The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors:

summary(reduced.model)
#>
#> Call:
#> lm(formula = y ~ x1 + x3)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.148 -1.850 -0.055  2.026  6.550
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    3.648      0.751    4.86    3e-06 ***
#> x1             0.582      0.255    2.28    0.024 *
#> x3             5.147      0.239   21.57   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 147 degrees of freedom
#> Multiple R-squared:  0.767,  Adjusted R-squared:  0.763
#> F-statistic:  241 on 2 and 147 DF,  p-value: <2e-16

Backward stepwise regression is easy, but sometimes it’s not feasible to start with “everything” because you have too many candidate variables. In that case use forward stepwise regression, which will start with nothing and incrementally add variables that improve the regression. It stops when no further improvement is possible.

A model that “starts with nothing” may look odd at first:

min.model <- lm(y ~ 1)

This is a model with a response variable (y) but no predictor variables. (All the fitted values for y are simply the mean of y, which is what you would guess if no predictors were available.)

We must tell step which candidate variables are available for inclusion in the model. That is the purpose of the scope argument. The scope is a formula with nothing on the lefthand side of the tilde (~) and candidate variables on the righthand side:

fwd.model <- step(
  min.model,
  direction = "forward",
  scope = (~ x1 + x2 + x3 + x4),
  trace = 0
)

Here we see that x1, x2, x3, and x4 are all candidates for inclusion. (We also included trace=0 to inhibit the voluminous output from step.) The resulting model has two significant predictors and no insignificant predictors:

summary(fwd.model)
#>
#> Call:
#> lm(formula = y ~ x3 + x1)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -8.148 -1.850 -0.055  2.026  6.550
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    3.648      0.751    4.86    3e-06 ***
#> x3             5.147      0.239   21.57   <2e-16 ***
#> x1             0.582      0.255    2.28    0.024 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.92 on 147 degrees of freedom
#> Multiple R-squared:  0.767,  Adjusted R-squared:  0.763
#> F-statistic:  241 on 2 and 147 DF,  p-value: <2e-16

The step-forward algorithm reached the same model as the step-backward model by including x1 and x3 but excluding x2 and x4. This is a toy example, so that is not surprising. In real applications, we suggest trying both the forward and the backward regression and then comparing the results. You might be surprised.

Finally, don’t get carried away by stepwise regression. It is not a panacea, it cannot turn junk into gold, and it is definitely not a substitute for choosing predictors carefully and wisely. You might think: “Oh boy! I can generate every possible interaction term for my model, then let step choose the best ones! What a model I’ll get!” You’d be thinking of something like this, which starts with all possible interactions then tries to reduce the model:

full.model <- lm(y ~ (x1 + x2 + x3 + x4) ^ 4)
reduced.model <- step(full.model, direction = "backward")
#> Start:  AIC=337
#> y ~ (x1 + x2 + x3 + x4)^4
#>
#>               Df Sum of Sq  RSS AIC
#> - x1:x2:x3:x4  1    0.0321 1145 335
#> <none>                     1145 337
#>
#> Step:  AIC=335
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4 + x1:x3:x4 + x2:x3:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x2:x3:x4  1      0.76 1146 333
#> - x1:x3:x4  1      8.37 1154 334
#> <none>                  1145 335
#> - x1:x2:x4  1     20.95 1166 336
#> - x1:x2:x3  1     25.18 1170 336
#>
#> Step:  AIC=333
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4 + x1:x3:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x1:x3:x4  1      8.74 1155 332
#> <none>                  1146 333
#> - x1:x2:x4  1     21.72 1168 334
#> - x1:x2:x3  1     26.51 1172 334
#>
#> Step:  AIC=332
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x3:x4 + x1:x2:x3 + x1:x2:x4
#>
#>            Df Sum of Sq  RSS AIC
#> - x3:x4     1      0.29 1155 330
#> <none>                  1155 332
#> - x1:x2:x4  1     23.24 1178 333
#> - x1:x2:x3  1     31.11 1186 334
#>
#> Step:  AIC=330
#> y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 +
#>     x1:x2:x3 + x1:x2:x4
#>
#>            Df Sum of Sq  RSS AIC
#> <none>                  1155 330
#> - x1:x2:x4  1      23.4 1178 331
#> - x1:x2:x3  1      31.5 1187 332

This does not work well. Most of the interaction terms are meaningless. The step function becomes overwhelmed, and you are left with many insignificant terms.

Regressing on a Subset of Your Data

Problem

You want to fit a linear model to a subset of your data, not to the entire dataset.

Solution

The lm function has a subset parameter that specifies which data elements should be used for fitting. The parameter’s value can be any index expression that could index your data. This shows a fitting that uses only the first 100 observations:

lm(y ~ x1, subset=1:100)          # Use only x[1:100]

Discussion

You will often want to regress only a subset of your data. This can happen, for example, when using in-sample data to create the model and out-of-sample data to test it.

The lm function has a parameter, subset, that selects the observations used for fitting. The value of subset is a vector. It can be a vector of index values, in which case lm selects only the indicated observations from your data. It can also be a logical vector, the same length as your data, in which case lm selects the observations with a corresponding TRUE.

Suppose you have 1,000 observations of (x, y) pairs and want to fit your model using only the first half of those observations. Use a subset parameter of 1:500, indicating lm should use observations 1 through 500:

## example data
n <- 1000
x <- rnorm(n)
e <- rnorm(n, 0, .5)
y <- 3 + 2 * x + e
lm(y ~ x, subset = 1:500)
#>
#> Call:
#> lm(formula = y ~ x, subset = 1:500)
#>
#> Coefficients:
#> (Intercept)            x
#>           3            2

More generally, you can use the expression 1:floor(length(x)/2) to select the first half of your data, regardless of size:

lm(y ~ x, subset = 1:floor(length(x) / 2))
#>
#> Call:
#> lm(formula = y ~ x, subset = 1:floor(length(x)/2))
#>
#> Coefficients:
#> (Intercept)            x
#>           3            2

Let’s say your data was collected in several labs and you have a factor, lab, that identifies the lab of origin. You can limit your regression to observations collected in New Jersey by using a logical vector that is TRUE only for those observations:

load('./data/lab_df.rdata')
lm(y ~ x, subset = (lab == "NJ"), data = lab_df)
#>
#> Call:
#> lm(formula = y ~ x, data = lab_df, subset = (lab == "NJ"))
#>
#> Coefficients:
#> (Intercept)            x
#>        2.58         5.03

Using an Expression Inside a Regression Formula

Problem

You want to regress on calculated values, not simple variables, but the syntax of a regression formula seems to forbid that.

Solution

Embed the expressions for the calculated values inside the I(...) operator. That will force R to calculate the expression and use the calculated value for the regression.

Discussion

If you want to regress on the sum of u and v, then this is your regression equation:

  • yi = β0 + β1(ui + vi) + εi

How do you write that equation as a regression formula? This won’t work:

lm(y ~ u + v)    # Not quite right

Here R will interpret u and v as two separate predictors, each with its own regression coefficient. Likewise, suppose your regression equation is:

  • yi = β0 + β1ui + β2ui2 + εi

This won’t work:

lm(y ~ u + u ^ 2)  # That's an interaction, not a quadratic term

R will interpret u^2 as an interaction term (“Performing Linear Regression with Interaction Terms”) and not as the square of u.

The solution is to surround the expressions by the I(...) operator, which inhibits the expressions from being interpreted as a regression formula. Instead, it forces R to calculate the expression’s value and then incorporate that value directly into the regression. Thus the first example becomes:

lm(y ~ I(u + v))

In response to that command, R computes u + v and then regresses y on the sum.

For the second example we use:

lm(y ~ u + I(u ^ 2))

Here R computes the square of u and then regresses on the sum u
u2.

All the basic binary operators (+, -, *, /, ^) have special meanings inside a regression formula. For this reason, you must use the I(...) operator whenever you incorporate calculated values into a regression.

A beautiful aspect of these embedded transformations is that R remembers the transformations and applies them when you make predictions from the model. Consider the quadratic model described by the second example. It uses u and u^2, but we supply the value of u only and R does the heavy lifting. We don’t need to calculate the square of u ourselves:

load('./data/df_squared.rdata')
m <- lm(y ~ u + I(u ^ 2), data = df_squared)
predict(m, newdata = data.frame(u = 13.4))
#>   1
#> 877

See Also

See “Regressing on a Polynomial” for the special case of regression on a polynomial. See “Regressing on Transformed Data” for incorporating other data transformations into the regression.

Regressing on a Polynomial

Problem

You want to regress y on a polynomial of x.

Solution

Use the poly(x,n) function in your regression formula to regress on an n-degree polynomial of x. This example models y as a cubic function of x:

lm(y ~ poly(x, 3, raw = TRUE))

The example’s formula corresponds to the following cubic regression equation:

  • yi = β0 + β1xi + β2xi2 + β3xi3 + εi

Discussion

When a person first uses a polynomial model in R, they often do something clunky like this:

x_sq <- x ^ 2
x_cub <- x ^ 3
m <- lm(y ~ x + x_sq + x_cub)

Obviously, this was quite annoying, and it littered my workspace with extra variables.

It’s much easier to write:

m <- lm(y ~ poly(x, 3, raw = TRUE))

The raw=TRUE is necessary. Without it, the poly function computes orthogonal polynomials instead of simple polynomials.

Beyond the convenience, a huge advantage is that R will calculate all those powers of x when you make predictions from the model (“Predicting New Values”). Without that, you are stuck calculating x2 and x3 yourself every time you employ the model.

Here is another good reason to use poly. You cannot write your regression formula in this way:

lm(y ~ x + x^2 + x^3)     # Does not do what you think!

R will interpret x^2 and x^3 as interaction terms, not as powers of x. The resulting model is a one-term linear regression, completely unlike your expectation. You could write the regression formula like this:

lm(y ~ x + I(x ^ 2) + I(x ^ 3))

But that’s getting pretty verbose. Just use poly.

  • JDL note: we don’t give a runnable example here… that OK?

See Also

See “Performing Linear Regression with Interaction Terms” for more about interaction terms. See “Regressing on Transformed Data” for other transformations on regression data.

Regressing on Transformed Data

Problem

You want to build a regression model for x and y, but they do not have a linear relationship.

Solution

You can embed the needed transformation inside the regression formula. If, for example, y must be transformed into log(y), then the regression formula becomes:

lm(log(y) ~ x)

Discussion

A critical assumption behind the lm function for regression is that the variables have a linear relationship. To the extent this assumption is false, the resulting regression becomes meaningless.

Fortunately, many datasets can be transformed into a linear relationship before applying lm.

obs trans 1
Figure 12-1. Example of a Data Transform

Figure 12-1 shows an example of exponential decay. The left panel shows the original data, z. The dotted line shows a linear regression on the original data; clearly, it’s a lousy fit. If the data is really exponential, then a possible model is:

  • z = exp[β0 + β1t + ε]

where t is time and exp[⋅] is the exponential function (ex). This is not linear, of course, but we can linearize it by taking logarithms:

  • log(z) = β0 + β1t + ε

In R, that regression is simple because we can embed the log transform directly into the regression formula:

# read in our example data
load(file = './data/df_decay.rdata')
z <- df_decay$z
t <- df_decay$time

# transform and model
m <- lm(log(z) ~ t)
summary(m)
#>
#> Call:
#> lm(formula = log(z) ~ t)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -0.4479 -0.0993  0.0049  0.0978  0.2802
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   0.6887     0.0306    22.5   <2e-16 ***
#> t            -2.0118     0.0351   -57.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.148 on 98 degrees of freedom
#> Multiple R-squared:  0.971,  Adjusted R-squared:  0.971
#> F-statistic: 3.28e+03 on 1 and 98 DF,  p-value: <2e-16

The right panel of Figure X-X shows the plot of log(z) versus time. Superimposed on that plot is their regression line. The fit appears to be much better; this is confirmed by the R2 = 0.97, compared with 0.82 for the linear regression on the original data.

You can embed other functions inside your formula. If you thought the relationship was quadratic, you could use a square-root transformation:

lm(sqrt(y) ~ month)

You can apply transformations to variables on both sides of the formula, of course. This formula regresses y on the square root of x:

lm(y ~ sqrt(x))

This regression is for a log-log relationship between x and y:

lm(log(y) ~ log(x))

Finding the Best Power Transformation (Box–Cox Procedure)

Problem

You want to improve your linear model by applying a power transformation to the response variable.

Solution

Use the Box–Cox procedure, which is implemented by the boxcox function of the MASS package. The procedure will identify a power, λ, such that transforming y into yλ will improve the fit of your model:

library(MASS)
m <- lm(y ~ x)
boxcox(m)

Discussion

To illustrate the Box–Cox transformation, let’s create some artificial data using the equation y−1.5 = x + ε, where ε is an error term:

set.seed(9)
x <- 10:100
eps <- rnorm(length(x), sd = 5)
y <- (x + eps) ^ (-1 / 1.5)

Then we will (mistakenly) model the data using a simple linear regression and derive an adjusted R2 of 0.6374:

m <- lm(y ~ x)
summary(m)
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Residuals:
#>      Min       1Q   Median       3Q      Max
#> -0.04032 -0.01633 -0.00792  0.00996  0.14516
#>
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  0.166885   0.007078    23.6   <2e-16 ***
#> x           -0.001465   0.000116   -12.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0291 on 89 degrees of freedom
#> Multiple R-squared:  0.641,  Adjusted R-squared:  0.637
#> F-statistic:  159 on 1 and 89 DF,  p-value: <2e-16

When plotting the residuals against the fitted values, we get a clue that something is wrong:

plot(m, which = 1)       # Plot only the fitted vs residuals
fitplot 1
Figure 12-2. Fitted Values vs Residuals

We used the Base R plot function to plot the residuals vs the fitted values in Figure 12-2. We can see this plot has a clear parabolic shape. A possible fix is a power transformation on y, so we run the Box–Cox procedure:

library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#>     select
bc <- boxcox(m)
boxcox 1
Figure 12-3. Output of boxcox on the Model (m)

The boxcox function plots values of λ against the log-likelihood of the resulting model as shown in Figure 12-3. We want to maximize that log-likelihood, so the function draws a line at the best value and also draws lines at the limits of its confidence interval. In this case, it looks like the best value is around −1.5, with a confidence interval of about (−1.75, −1.25).

Oddly, the boxcox function does not return the best value of λ. Rather, it returns the (x, y) pairs displayed in the plot. It’s pretty easy to find the values of λ that yield the largest log-likelihood y. We use the which.max function:

which.max(bc$y)
#> [1] 13

Then this gives us the position of the corresponding λ:

lambda <- bc$x[which.max(bc$y)]
lambda
#> [1] -1.52

The function reports that the best λ is −1.515. In an actual application, We would urge you to interpret this number and choose the power that makes sense to you—rather than blindly accepting this “best” value. Use the graph to assist you in that interpretation. Here, We’ll go with −1.515.

We can apply the power transform to y and then fit the revised model; this gives a much better R2 of 0.9668:

z <- y ^ lambda
m2 <- lm(z ~ x)
summary(m2)
#>
#> Call:
#> lm(formula = z ~ x)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -13.459  -3.711  -0.228   2.206  14.188
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -0.6426     1.2517   -0.51     0.61
#> x             1.0514     0.0205   51.20   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.15 on 89 degrees of freedom
#> Multiple R-squared:  0.967,  Adjusted R-squared:  0.967
#> F-statistic: 2.62e+03 on 1 and 89 DF,  p-value: <2e-16

For those who prefer one-liners, the transformation can be embedded right into the revised regression formula:

m2 <- lm(I(y ^ lambda) ~ x)

By default, boxcox searches for values of λ in the range −2 to +2. You can change that via the lambda argument; see the help page for details.

We suggest viewing the Box–Cox result as a starting point, not as a definitive answer. If the confidence interval for λ includes 1.0, it may be that no power transformation is actually helpful. As always, inspect the residuals before and after the transformation. Did they really improve?

Forming Confidence Intervals for Regression Coefficients

Problem

You are performing linear regression and you need the confidence intervals for the regression coefficients.

Solution

Save the regression model in an object; then use the confint function to extract confidence intervals:

load(file = './data/conf.rdata')
m <- lm(y ~ x1 + x2)
confint(m)
#>             2.5 % 97.5 %
#> (Intercept) -3.90   6.47
#> x1          -2.58   6.24
#> x2           4.67   5.17

Discussion

The Solution uses the model y = β0 + β1(x1)i
β2(x2)i + εi. The confint function returns the confidence intervals for the intercept (β0), the coefficient of x1 (β1), and the coefficient of x2 (β2):

confint(m)
#>             2.5 % 97.5 %
#> (Intercept) -3.90   6.47
#> x1          -2.58   6.24
#> x2           4.67   5.17

By default, confint uses a confidence level of 95%. Use the level parameter to select a different level:

confint(m, level = 0.99)
#>             0.5 % 99.5 %
#> (Intercept) -5.72   8.28
#> x1          -4.12   7.79
#> x2           4.58   5.26

See Also

The coefplot function of the arm package can plot confidence intervals for regression coefficients.

Plotting Regression Residuals

Problem

You want a visual display of your regression residuals.

Solution

You can plot the model object by selecting the residuals plot from the available plots:

m <- lm(y ~ x1 + x2)
plot(m, which = 1)
residualplot 1
Figure 12-4. Model Residual Plot

The output is shown in Figure 12-4.

Discussion

Normally, plotting a regression model object produces several diagnostic plots. You can select just the residuals plot by specifying which=1.

The graph above shows a plot of the residuals from “Performing Simple Linear Regression”. R draws a smoothed line through the residuals as a visual aid to finding significant patterns—for example, a slope or a parabolic shape.

See Also

See “Diagnosing a Linear Regression”, which contains examples of residuals plots and other diagnostic plots.

Diagnosing a Linear Regression

Problem

You have performed a linear regression. Now you want to verify the model’s quality by running diagnostic checks.

Solution

Start by plotting the model object, which will produce several diagnostic plots:

m <- lm(y ~ x1 + x2)
plot(m)

Next, identify possible outliers either by looking at the diagnostic plot of the residuals or by using the outlierTest function of the car package:

library(car)
#> Loading required package: carData
#>
#> Attaching package: 'car'
#> The following object is masked from 'package:dplyr':
#>
#>     recode
#> The following object is masked from 'package:purrr':
#>
#>     some
outlierTest(m)
#> No Studentized residuals with Bonferonni p < 0.05
#> Largest |rstudent|:
#>   rstudent unadjusted p-value Bonferonni p
#> 2     2.27             0.0319        0.956

Finally, identify any overly influential observations (“Identifying Influential Observations”).

Discussion

R fosters the impression that linear regression is easy: just use the lm function. Yet fitting the data is only the beginning. It’s your job to decide whether the fitted model actually works and works well.

Before anything else, you must have a statistically significant model. Check the F statistic from the model summary (“Understanding the Regression Summary”) and be sure that the p-value is small enough for your purposes. Conventionally, it should be less than 0.05 or else your model is likely not very meaningful.

Simply plotting the model object produces several useful diagnostic plots, shown in Figure 12-5:

length(x1)
#> [1] 30
length(x2)
#> [1] 30
length(y)
#> [1] 30

m <- lm(y ~ x1 + x2)
par(mfrow = (c(2, 2))) # this gives us a 2x2 plot
plot(m)
goodfit 1
Figure 12-5. Diagnostics of a Good Fit

Figure 12-5 shows diagnostic plots for a pretty good regression:

  • The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern.

  • The points in the Normal Q–Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution.

  • In both the Scale–Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center.

In contrast, the series of graphs shown in Figure 12-6 show the diagnostics for a not-so-good regression:

load(file = './data/bad.rdata')
m <- lm(y2 ~ x3 + x4)
par(mfrow = (c(2, 2)))      # this gives us a 2x2 plot
plot(m)
badfit 1
Figure 12-6. Diagnostics of a Poor Fit

Observe that the Residuals vs Fitted plot has a definite parabolic shape. This tells us that the model is incomplete: a quadratic factor is missing that could explain more variation in y. Other patterns in residuals would be suggestive of additional problems: a cone shape, for example, may indicate nonconstant variance in y. Interpreting those patterns is a bit of an art, so we suggest reviewing a good book on linear regression while evaluating the plot of residuals.

There are other problems with the not-so-good diagnostics above. The Normal Q–Q plot has more points off the line than it does for the good regression. Both the Scale–Location and Residuals vs Leverage plots show points scattered away from the center, which suggests that some points have excessive leverage.

Another pattern is that point number 28 sticks out in every plot. This warns us that something is odd with that observation. The point could be an outlier, for example. We can check that hunch with the outlierTest function of the car package:

outlierTest(m)
#>    rstudent unadjusted p-value Bonferonni p
#> 28     4.46           7.76e-05       0.0031

The outlierTest identifies the model’s most outlying observation. In this case, it identified observation number 28 and so confirmed that it could be an outlier.

See Also

See recipes “Understanding the Regression Summary” and “Identifying Influential Observations”. The car package is not part of the standard distribution of R; see “Installing Packages from CRAN”.

Identifying Influential Observations

Problem

You want to identify the observations that are having the most influence on the regression model. This is useful for diagnosing possible problems with the data.

Solution

The influence.measures function reports several useful statistics for identifying influential observations, and it flags the significant ones with an asterisk (*). Its main argument is the model object from your regression:

influence.measures(m)

Discussion

The title of this recipe could be “Identifying Overly Influential Observations”, but that would be redundant. All observations influence the regression model, even if only a little. When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them.

The influence.measures function reports several statistics: DFBETAS, DFFITS, covariance ratio, Cook’s distance, and hat matrix values. If any of these measures indicate that an observation is influential, the function flags that observation with an asterisk (*) along the righthand side:

influence.measures(m)
#> Influence measures of
#>   lm(formula = y2 ~ x3 + x4) :
#>
#>      dfb.1_   dfb.x3   dfb.x4    dffit cov.r   cook.d    hat inf
#> 1  -0.18784  0.15174  0.07081 -0.22344 1.059 1.67e-02 0.0506
#> 2   0.27637 -0.04367 -0.39042  0.45416 1.027 6.71e-02 0.0964
#> 3  -0.01775 -0.02786  0.01088 -0.03876 1.175 5.15e-04 0.0772
#> 4   0.15922 -0.14322  0.25615  0.35766 1.133 4.27e-02 0.1156
#> 5  -0.10537  0.00814 -0.06368 -0.13175 1.078 5.87e-03 0.0335
#> 6   0.16942  0.07465  0.42467  0.48572 1.034 7.66e-02 0.1062
etc ...
  • JDL NOTE: the output above does not seem to be respecting the output.lines=10 setting. Debug. We also use output.lines=5 in ch4. Go see if that is working

This is the model from “Diagnosing a Linear Regression”, where we suspected that observation 28 was an outlier. An asterisk is flagging that observation, confirming that it’s overly influential.

This recipe can identify influential observations, but you shouldn’t reflexively delete them. Some judgment is required here. Are those observations improving your model or damaging it?

See Also

See “Diagnosing a Linear Regression”. Use help(influence.measures) to get a list of influence measures and some related functions. See a regression textbook for interpretations of the various influence measures.

Testing Residuals for Autocorrelation (Durbin–Watson Test)

Problem

You have performed a linear regression and want to check the residuals for autocorrelation.

Solution

The Durbin—Watson test can check the residuals for autocorrelation. The test is implemented by the dwtest function of the lmtest package:

library(lmtest)
m <- lm(y ~ x)           # Create a model object
dwtest(m)                # Test the model residuals

The output includes a p-value. Conventionally, if p < 0.05 then the residuals are significantly correlated whereas p > 0.05 provides no evidence of correlation.

You can perform a visual check for autocorrelation by graphing the autocorrelation function (ACF) of the residuals:

acf(m)                   # Plot the ACF of the model residuals

Discussion

The Durbin–Watson test is often used in time series analysis, but it was originally created for diagnosing autocorrelation in regression residuals. Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.

This first example builds a simple regression model and then tests the residuals for autocorrelation. The test returns a p-value well above zero, which indicates that there is no significant autocorrelation:

library(lmtest)
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#>     as.Date, as.Date.numeric
load(file = './data/ac.rdata')
m <- lm(y1 ~ x)
dwtest(m)
#>
#>  Durbin-Watson test
#>
#> data:  m
#> DW = 2, p-value = 0.4
#> alternative hypothesis: true autocorrelation is greater than 0

This second example exhibits autocorrelation in the residuals. The p-value is near 0, so the autocorrelation is likely positive:

m <- lm(y2 ~ x)
dwtest(m)
#>
#>  Durbin-Watson test
#>
#> data:  m
#> DW = 2, p-value = 0.01
#> alternative hypothesis: true autocorrelation is greater than 0

By default, dwtest performs a one-sided test and answers this question: Is the autocorrelation of the residuals greater than zero? If your model could exhibit negative autocorrelation (yes, that is possible), then you should use the alternative option to perform a two-sided test:

dwtest(m, alternative = "two.sided")

The Durbin–Watson test is also implemented by the durbinWatsonTest function of the car package. We suggested the dwtest function primarily because we think the output is easier to read.

See Also

Neither the lmtest package nor the car package are included in the standard distribution of R; see recipes @ref(recipe-id013) “Accessing the Functions in a Package” and @ref(recipe-id012) “Installing Packages from CRAN”. See recipes @ref(recipe-id082) X-X and X-X for more regarding tests of autocorrelation.

Predicting New Values

Problem

You want to predict new values from your regression model.

Solution

Save the predictor data in a data frame. Use the predict function, setting the newdata parameter to the data frame:

load(file = './data/pred2.rdata')

m <- lm(y ~ u + v + w)
preds <- data.frame(u = 3.1, v = 4.0, w = 5.5)
predict(m, newdata = preds)
#>  1
#> 45

Discussion

Once you have a linear model, making predictions is quite easy because the predict function does all the heavy lifting. The only annoyance is arranging for a data frame to contain your data.

The predict function returns a vector of predicted values with one prediction for every row in the data. The example in the Solution contains one row, so predict returned one value.

If your predictor data contains several rows, you get one prediction per row:

preds <- data.frame(
  u = c(3.0, 3.1, 3.2, 3.3),
  v = c(3.9, 4.0, 4.1, 4.2),
  w = c(5.3, 5.5, 5.7, 5.9)
)
predict(m, newdata = preds)
#>    1    2    3    4
#> 43.8 45.0 46.3 47.5

In case it’s not obvious: the new data needn’t contain values for response variables, only predictor variables. After all, you are trying to calculate the response, so it would be unreasonable of R to expect you to supply it.

See Also

These are just the point estimates of the predictions. See “Forming Prediction Intervals” for the confidence intervals.

Forming Prediction Intervals

Problem

You are making predictions using a linear regression model. You want to know the prediction intervals: the range of the distribution of the prediction.

Solution

Use the predict function and specify interval="prediction":

predict(m, newdata = preds, interval = "prediction")

Discussion

This is a continuation of “Predicting New Values”, which described packaging your data into a data frame for the predict function. We are adding interval="prediction" to obtain prediction intervals.

Here is the example from “Predicting New Values”, now with prediction intervals. The new lwr and upr columns are the lower and upper limits, respectively, for the interval:

predict(m, newdata = preds, interval = "prediction")
#>    fit  lwr  upr
#> 1 43.8 38.2 49.4
#> 2 45.0 39.4 50.7
#> 3 46.3 40.6 51.9
#> 4 47.5 41.8 53.2

By default, predict uses a confidence level of 0.95. You can change this via the level argument.

A word of caution: these prediction intervals are extremely sensitive to deviations from normality. If you suspect that your response variable is not normally distributed, consider a nonparametric technique, such as the bootstrap (Recipe X-X), for prediction intervals.

Performing One-Way ANOVA

Problem

Your data is divided into groups, and the groups are normally distributed. You want to know if the groups have significantly different means.

Solution

Use a factor to define the groups. Then apply the oneway.test function:

oneway.test(x ~ f)

Here, x is a vector of numeric values and f is a factor that identifies the groups. The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that two or more groups have significantly different means whereas a value exceeding 0.05 provides no such evidence.

Discussion

Comparing the means of groups is a common task. One-way ANOVA performs that comparison and computes the probability that they are statistically identical. A small p-value indicates that two or more groups likely have different means. (It does not indicate that all groups have different means.)

The basic ANOVA test assumes that your data has a normal distribution or that, at least, it is pretty close to bell-shaped. If not, use the Kruskal–Wallis test instead (“Performing Robust ANOVA (Kruskal–Wallis Test)”).

We can illustrate ANOVA with stock market historical data. Is the stock market more profitable in some months than in others? For instance, a common folk myth says that October is a bad month for stock market investors.1 We explored this question by creating a data frame GSPC_df containing two columns, r and mon. r, is the daily returns in the Standard & Poor’s 500 index, a broad measure of stock market performance. The factor, mon, indicates the calendar month in which that change occurred: Jan, Feb, Mar, and so forth. The data covers the period 1950 though 2009.

The one-way ANOVA shows a p-value of 0.03347:

load(file = './data/anova.rdata')
oneway.test(r ~ mon, data = GSPC_df)
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and mon
#> F = 2, num df = 10, denom df = 7000, p-value = 0.03

We can conclude that stock market changes varied significantly according to the calendar month.

Before you run to your broker and start flipping your portfolio monthly, however, we should check something: did the pattern change recently? We can limit the analysis to recent data by specifying a subset parameter. This works for oneway.test just as it does for the lm function. The subset contains the indexes of observations to be analyzed; all other observations are ignored. Here, we give the indexes of the 2,500 most recent observations, which is about 10 years of data:

oneway.test(r ~ mon, data = GSPC_df, subset = tail(seq_along(r), 2500))
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and mon
#> F = 0.7, num df = 10, denom df = 1000, p-value = 0.8

Uh-oh! Those monthly differences evaporated during the past 10 years. The large p-value, 0.7608, indicates that changes have not recently varied according to calendar month. Apparently, those differences are a thing of the past.

Notice that the oneway.test output says “(not assuming equal variances)”. If you know the groups have equal variances, you’ll get a less conservative test by specifying var.equal=TRUE:

oneway.test(x ~ f, var.equal = TRUE)

You can also perform one-way ANOVA by using the aov function like this:

m <- aov(x ~ f)
summary(m)

However, the aov function always assumes equal variances and so is somewhat less flexible than oneway.test.

See Also

If the means are significantly different, use “Finding Differences Between Means of Groups” to see the actual differences. Use “Performing Robust ANOVA (Kruskal–Wallis Test)” if your data is not normally distributed, as required by ANOVA.

Creating an Interaction Plot

Problem

You are performing multiway ANOVA: using two or more categorical variables as predictors. You want a visual check of possible interaction between the predictors.

Solution

Use the interaction.plot function:

interaction.plot(pred1, pred2, resp)

Here, pred1 and pred2 are two categorical predictors and resp is the response variable.

Discussion

ANOVA is a form of linear regression, so ideally there is a linear relationship between every predictor and the response variable. One source of nonlinearity is an interaction between two predictors: as one predictor changes value, the other predictor changes its relationship to the response variable. Checking for interaction between predictors is a basic diagnostic.

The faraway package contains a dataset called rats. In it, treat and poison are categorical variables and time is the response variable. When plotting poison against time we are looking for straight, parallel lines, which indicate a linear relationship. However, using the interaction.plot function produces Figure 12-7 which reveals that something is not right:

library(faraway)
data(rats)
interaction.plot(rats$poison, rats$treat, rats$time)
interactionplot 1
Figure 12-7. Interaction Plot Example

Each line graphs time against poison. The difference between lines is that each line is for a different value of treat. The lines should be parallel, but the top two are not exactly parallel. Evidently, varying the value of treat “warped” the lines, introducing a nonlinearity into the relationship between poison and time.

This signals a possible interaction that we should check. For this data it just so happens that yes, there is an interaction but no, it is not statistically significant. The moral is clear: the visual check is useful, but it’s not foolproof. Follow up with a statistical check.

Finding Differences Between Means of Groups

Problem

Your data is divided into groups, and an ANOVA test indicates that the groups have significantly different means. You want to know the differences between those means for all groups.

Solution

Perform the ANOVA test using the aov function, which returns a model object. Then apply the TukeyHSD function to the model object:

m <- aov(x ~ f)
TukeyHSD(m)

Here, x is your data and f is the grouping factor. You can plot the TukeyHSD result to obtain a graphical display of the differences:

plot(TukeyHSD(m))

Discussion

The ANOVA test is important because it tells you whether or not the groups’ means are different. But the test does not identify which groups are different, and it does not report their differences.

The TukeyHSD function can calculate those differences and help you identify the largest ones. It uses the “honest significant differences” method invented by John Tukey.

We’ll illustrate TukeyHSD by continuing the example from “Performing One-Way ANOVA”, which grouped daily stock market changes by month. Here, we group them by weekday instead, using a factor called wday that identifies the day of the week (Mon, …, Fri) on which the change occurred. We’ll use the first 2,500 observations, which roughly cover the period from 1950 to 1960:

load(file = './data/anova.rdata')
oneway.test(r ~ wday, subset = 1:2500, data = GSPC_df)
#>
#>  One-way analysis of means (not assuming equal variances)
#>
#> data:  r and wday
#> F = 10, num df = 4, denom df = 1000, p-value = 5e-10

The p-value is essentially zero, indicating that average changes varied significantly depending on the weekday. To use the TukeyHSD function, We first perform the ANOVA test using the aov function, which returns a model object, and then apply the TukeyHSD function to the object:

m <- aov(r ~ wday, subset = 1:2500, data = GSPC_df)
TukeyHSD(m)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#>
#> Fit: aov(formula = r ~ wday, data = GSPC_df, subset = 1:2500)
#>
#> $wday
#>              diff       lwr       upr p adj
#> Mon-Fri -0.003153 -4.40e-03 -0.001911 0.000
#> Thu-Fri -0.000934 -2.17e-03  0.000304 0.238
#> Tue-Fri -0.001855 -3.09e-03 -0.000618 0.000
#> Wed-Fri -0.000783 -2.01e-03  0.000448 0.412
#> Thu-Mon  0.002219  9.79e-04  0.003460 0.000
#> Tue-Mon  0.001299  5.85e-05  0.002538 0.035
#> Wed-Mon  0.002370  1.14e-03  0.003605 0.000
#> Tue-Thu -0.000921 -2.16e-03  0.000314 0.249
#> Wed-Thu  0.000151 -1.08e-03  0.001380 0.997
#> Wed-Tue  0.001072 -1.57e-04  0.002300 0.121

Each line in the output table includes the difference between the means of two groups (diff) as well as the lower and upper bounds of the confidence interval (lwr and upr) for the difference. The first line in the table, for example,compares the Mon group and the Fri group: the difference of their means is 0.003 with a confidence interval of (-0.0044 -0.0019).

Scanning the table, we see that the Wed-Mon comparison had the largest difference, which was 0.00237.

A cool feature of TukeyHSD is that it can display these differences visually, too. Simply plot the function’s return value to get output as is shown in Figure 12-8.

plot(TukeyHSD(m))
tukeyhsd 1
Figure 12-8. TukeyHSD Plot

The horizontal lines plot the confidence intervals for each pair. With this visual representation you can quickly see that several confidence intervals cross over zero, indicating that the difference is not necessarily significant. You can also see that the Wed-Mon pair has the largest difference because their confidence interval is farthest to the right.

Performing Robust ANOVA (Kruskal–Wallis Test)

Problem

Your data is divided into groups. The groups are not normally distributed, but their distributions have similar shapes. You want to perform a test similar to ANOVA—you want to know if the group medians are significantly different.

Solution

Create a factor that defines the groups of your data. Use the kruskal.test function, which implements the Kruskal–Wallis test. Unlike the ANOVA test, this test does not depend upon the normality of the data:

kruskal.test(x ~ f)

Here, x is a vector of data and f is a grouping factor. The output includes a p-value. Conventionally, p < 0.05 indicates that there is a significant difference between the medians of two or more groups whereas p > 0.05 provides no such evidence.

Discussion

Regular ANOVA assumes that your data has a Normal distribution. It can tolerate some deviation from normality, but extreme deviations will produce meaningless p-values.

The Kruskal–Wallis test is a nonparametric version of ANOVA, which means that it does not assume normality. However, it does assume same-shaped distributions. You should use the Kruskal–Wallis test whenever your data distribution is nonnormal or simply unknown.

The null hypothesis is that all groups have the same median. Rejecting the null hypothesis (with p < 0.05) does not indicate that all groups are different, but it does suggest that two or more groups are different.

One year, Paul taught Business Statistics to 94 undergraduate students. The class included a midterm examination, and there were four homework assignments prior to the exam. He wanted to know: What is the relationship between completing the homework and doing well on the exam? If there is no relation, then the homework is irrelevant and needs rethinking.

He created a vector of grades, one per student and he also created a parallel factor that captured the number of homework assignments completed by that student. The data are in a data frame named student_data:

load(file = './data/student_data.rdata')
head(student_data)
#> # A tibble: 6 x 4
#>   att.fact hw.mean midterm hw
#>   <fct>      <dbl>   <dbl> <fct>
#> 1 3          0.808   0.818 4
#> 2 3          0.830   0.682 4
#> 3 3          0.444   0.511 2
#> 4 3          0.663   0.670 3
#> 5 2          0.9     0.682 4
#> 6 3          0.948   0.954 4

Notice that the hw variable—although it appears to be numeric—is actually a factor. It assigns each midterm grade to one of five groups depending upon how many homework assignments the student completed.

The distribution of exam grades is definitely not Normal: the students have a wide range of math skills, so there are an unusual number of A and F grades. Hence regular ANOVA would not be appropriate. Instead we used the Kruskal–Wallis test and obtained a p-value of essentially zero (3.99 × 10−5, or 0.00003669):

kruskal.test(midterm ~ hw, data = student_data)
#>
#>  Kruskal-Wallis rank sum test
#>
#> data:  midterm by hw
#> Kruskal-Wallis chi-squared = 30, df = 4, p-value = 4e-05

Obviously, there is a significant performance difference between students who complete their homework and those who do not. But what could Paul actually conclude? At first, Paul was pleased that the homework appeared so effective. Then it dawned on him that this was a classic error in statistical reasoning: He assumed that correlation implied causality. It does not, of course. Perhaps strongly motivated students do well on both homework and exams whereas lazy students do not. In that case, the causal factor is degree of motivation, not the brilliance of my homework selection. In the end, he could only conclude something very simple: students who complete the homework will likely do well on the midterm exam, but he still doesn’t really know why.

Comparing Models by Using ANOVA

Problem

You have two models of the same data, and you want to know whether they produce different results.

Solution

The anova function can compare two models and report if they are significantly different:

anova(m1, m2)

Here, m1 and m2 are both model objects returned by lm. The output from anova includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the models are significantly different whereas a value exceeding 0.05 provides no such evidence.

Discussion

In “Getting Regression Statistics”, we used the anova function to print the ANOVA table for one regression model. Now we are using the two-argument form to compare two models.

The anova function has one strong requirement when comparing two models: one model must be contained within the other. That is, all the terms of the smaller model must appear in the larger model. Otherwise, the comparison is impossible.

The ANOVA analysis performs an F test that is similar to the F test for a linear regression. The difference is that this test is between two models whereas the regression F test is between using the regression model and using no model.

Suppose we build three models of y, adding terms as we go:

load(file = './data/anova2.rdata')
m1 <- lm(y ~ u)
m2 <- lm(y ~ u + v)
m3 <- lm(y ~ u + v + w)

Is m2 really different from m1? We can use anova to compare them, and the result is a p-value of 0.009066:

anova(m1, m2)
#> Analysis of Variance Table
#>
#> Model 1: y ~ u
#> Model 2: y ~ u + v
#>   Res.Df RSS Df Sum of Sq    F Pr(>F)
#> 1     18 197
#> 2     17 130  1      66.4 8.67 0.0091 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicates that the models are significantly different. Comparing m2 and m3, however, yields a p-value of 0.05527:

anova(m2, m3)
#> Analysis of Variance Table
#>
#> Model 1: y ~ u + v
#> Model 2: y ~ u + v + w
#>   Res.Df RSS Df Sum of Sq    F Pr(>F)
#> 1     17 130
#> 2     16 103  1      27.5 4.27  0.055 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is right on the edge. Strictly speaking, it does not pass our requirement to be smaller than 0.05; however, it’s close enough that you might judge the models to be “different enough.”

This example is a bit contrived, so it does not show the larger power of anova. We use anova when, while experimenting with complicated models by adding and deleting multiple terms, we need to know whether or not the new model is really different from the original one. In other words: if we add terms and the new model is essentially unchanged, then the extra terms are not worth the additional complications.

1 In the words of Mark Twain, “October: This is one of the peculiarly dangerous months to speculate in stocks in. The others are July, January, September, April, November, May, March, June, December, August and February.”

About the Authors

J.D. Long is a misplaced southern agricultural economist currently working for Renaissance Re in New York City. J.D. is an avid user of Python, R, AWS and colorful metaphors, and is a frequent presenter at R conferences as well as the founder of the Chicago R User Group. He lives in Jersey City, NJ with his wife, a recovering trial lawyer, and his 11-year-old circuit bending daughter.

Paul Teetor is a quantitative developer with Masters degrees in statistics and computer science. He specializes in analytics and software engineering for investment management, securities trading, and risk management. He works with hedge funds, market makers, and portfolio managers in the greater Chicago area.