F Key Points

F.1 Values and Vectors

  • Use print(expression) to print the value of a single expression.
  • Variable names may include letters, digits, ., and _, but . should be avoided, as it sometimes has special meaning.
  • R’s atomic data types include logical, integer, double (also called numeric), and character.
  • R stores collections in homogeneous vectors of atomic types, or in heterogeneous lists.
  • ‘Scalars’ in R are actually vectors of length 1.
  • Vectors and lists are created using the function c(...).
  • Vector indices from 1 to length(vector) select single elements.
  • Negative indices to vectors deselect elements from the result.
  • The index 0 on its own selects no elements, creating a vector or list of length 0.
  • The expression low:high creates the vector of integers from low to high inclusive.
  • Subscripting a vector with a vector of numbers selects the elements at those locations (possibly with repeats).
  • Subscripting a vector with a vector of logicals selects elements where the indexing vector is TRUE.
  • Values from short vectors (such as ‘scalars’) are repeated to match the lengths of longer vectors.
  • The special value NA represents missing values, and (almost all) operations involving NA produce NA.
  • The special values NULL represents a nonexistent vector, which is not the same as a vector of length 0.
  • A list is a heterogeneous vector capable of storing values of any type (including other lists).
  • Indexing with [ returns a structure of the same type as the structure being indexed (e.g., returns a list when applied to a list).
  • Indexing with [[ strips away one level of structure (i.e., returns the indicated element without any wrapping).
  • Use list('name' = value, ...) to name the elements of a list.
  • Use either L['name'] or L$name to access elements by name.
  • Use back-quotes around the name with $ notation if the name is not a legal R variable name.
  • Use matrix(values, nrow = N) to create a matrix with N rows containing the given values.
  • Use m[i, j] to get the value at the i’th row and j’th column of a matrix.
  • Use m[i,] to get a vector containing the values in the i’th row of a matrix.
  • Use m[,j] to get a vector containing the values in the j’th column of a matrix.
  • Use for (loop_variable in collection){ ...body... } to create a loop.
  • Use if (expression) { ...body... } else if (expression) { ...body... } else { ...body... } to create conditionals.
  • Expression conditions must have length 1; use any(...) and all(...) to collapse logical vectors to single values.
  • Use function(...arguments...) { ...body... } to create a function.
  • Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
  • The body of a function can be a single expression or a block in curly braces.
  • The last expression evaluated in a function is returned as its result.
  • Use return(expression) to return a result early from a function.

F.2 Indexing

  • A list is a heterogeneous vector capable of storing values of any type (including other lists).
  • Indexing with [ returns a structure of the same type as the structure being indexed (e.g., returns a list when applied to a list).
  • Indexing with [[ strips away one level of structure (i.e., returns the indicated element without any wrapping).
  • Use list('name' = value, ...) to name the elements of a list.
  • Use either L['name'] or L$name to access elements by name.
  • Use back-quotes around the name with $ notation if the name is not a legal R variable name.
  • Use matrix(values, nrow = N) to create a matrix with N rows containing the given values.
  • Use m[i, j] to get the value at the i’th row and j’th column of a matrix.
  • Use m[i,] to get a vector containing the values in the i’th row of a matrix.
  • Use m[,j] to get a vector containing the values in the j’th column of a matrix.

F.3 Control Flow

  • Use for (loop_variable in collection){ ...body... } to create a loop.
  • Use if (expression) { ...body... } else if (expression) { ...body... } else { ...body... } to create conditionals.
  • Expression conditions must have length 1; use any(...) and all(...) to collapse logical vectors to single values.
  • Use function(...arguments...) { ...body... } to create a function.
  • Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
  • The body of a function can be a single expression or a block in curly braces.
  • The last expression evaluated in a function is returned as its result.
  • Use return(expression) to return a result early from a function.

F.4 The Tidyverse

  • install.packages('name') installs packages.
  • library(name) (without quoting the name) loads a package.
  • library(tidyverse) loads the entire collection of tidyverse libraries at once.
  • read_csv(filename) reads CSV files that use the string ‘NA’ to represent missing values.
  • read_csv infers each column’s data types based on the first thousand values it reads.
  • A tibble is the tidyverse’s version of a data frame, which represents tabular data.
  • head(tibble) and tail(tibble) inspect the first and last few rows of a tibble.
  • summary(tibble) displays a summary of a tibble’s structure and values.
  • tibble$column selects a column from a tibble, returning a vector as a result.
  • tibble['column'] selects a column from a tibble, returning a tibble as a result.
  • tibble[,c] selects column c from a tibble, returning a tibble as a result.
  • tibble[r,] selects row r from a tibble, returning a tibble as a result.
  • Use ranges and logical vectors as indices to select multiple rows/columns or specific rows/columns from a tibble.
  • tibble[[c]] selects column c from a tibble, returning a vector as a result.
  • min(...), mean(...), max(...), and std(...) calculates the minimum, mean, maximum, and standard deviation of data.
  • These aggregate functions include NAs in their calculations, and so will produce NA if the input data contains any.
  • Use func(data, na.rm = TRUE) to remove NAs from data before calculations are done (but make sure this is statistically justified).
  • filter(tibble, condition) selects rows from a tibble that pass a logical test on their values.
  • arrange(tibble, column) or arrange(desc(column)) arrange rows according to values in a column (the latter in descending order).
  • select(tibble, column, column, ...) selects columns from a tibble.
  • select(tibble, -column) selects out a column from a tibble.
  • mutate(tibble, name = expression, name = expression, ...) adds new columns to a tibble using values from existing columns.
  • group_by(tibble, column, column, ...) groups rows that have the same values in the specified columns.
  • summarize(tibble, name = expression, name = expression) aggregates tibble values (by groups if the rows have been grouped).
  • tibble %>% function(arguments) performs the same operation as function(tibble, arguments).
  • Use %>% to create pipelines in which the left side of each %>% becomes the first argument of the next stage.

F.5 Cleaning Up Data

  • Develop data-cleaning scripts one step at a time, checking intermediate results carefully.
  • Use read_csv to read CSV-formatted tabular data into a tibble.
  • Use the skip and na parameters of read_csv to skip rows and interpret certain values as NA.
  • Use str_replace to replace portions of strings that match patterns with new strings.
  • Use is.numeric to test if a value is a number and as.numeric to convert it to a number.
  • Use map to apply a function to every element of a vector in turn.
  • Use map_dfc and map_dfr to map functions across the columns and rows of a tibble.
  • Pre-allocate storage in a list for each result from a loop and fill it in rather than repeatedly extending the list.

F.6 Testing and Error Handling

  • Operations signal conditions in R when errors occur.
  • The three built-in levels of conditions are messages, warnings, and errors.
  • Programs can signal these themselves using the functions message, warning, and stop.
  • Operations can be placed in a call to the function try to suppress errors, but this is a bad idea.
  • Operations can be placed in a call to the function tryCatch to handle errors.
  • Use testthat to write unit tests for R.
  • Put unit tests for an R package in the tests/testthat directory.
  • Put tests in files called test_group.R and call them test_something.
  • Use test_dir to run tests from a particular that match a pattern.
  • Write tests for data transformation steps as well as library functions.

F.7 Non-Standard Evaluation

  • R uses lazy evaluation: expressions are evaluated when their values are needed, not before.
  • Use expr to create an expression without evaluating it.
  • Use eval to evaluate an expression in the context of some data.
  • Use enquo to create a quosure containing an unevaluated expression and its environment.
  • Use quo_get_expr to get the expression out of a quosure.
  • Use !! to splice the expression in a quosure into a function call.

F.8 Object-Oriented Programming

  • S3 is the most commonly used object-oriented programming system in R.
  • Every object can store metadata about itself in attributes, which are set and queried with attr.
  • The dim attribute stores the dimensions of a matrix (which is physically stored as a vector).
  • The class attribute of an object defines its class or classes (it may have several character entries).
  • When F(X, ...) is called, and X has class C, R looks for a function called F.C (the . is just a naming convention).
  • If an object has multiple classes in its class attribute, R looks for a corresponding method for each in turn.
  • Every user defined class C should have functions new_C (to create it), validate_C (to validate its integrity), and C (to create and validate).

F.9 Intellectual Debt

  • Don’t use setwd.
  • The formula operator ~ delays evaluation of its operand or operands.
  • ~ was created to allow users to pass formulas into functions, but is used more generally to delay evaluation.
  • Some tidyverse functions define . to be the whole data, .x and .y to be the first and second arguments, and ..N to be the N’th argument.
  • These convenience parameters are primarily used when the data being passed to a pipelined function needs to go somewhere other than in the first parameter’s slot.
  • ‘Copy-on-modify’ means that data is aliased until something attempts to modify it, at which point it duplicated, so that data always appears to be unchanged.

F.10 Projects

  • An R package can contain code, data, and documentation.
  • R code is distributed as compiled bytecode in packages, not as source.
  • R packages are almost always distributed through CRAN, the Comprehensive R Archive Network.
  • Most of a project’s metadata goes in a file called DESCRIPTION.
  • Metadata related to imports and exports goes in a file called NAMESPACE.
  • Add patterns to a file called .Rbuildignore to ignore files or directories when building a project.
  • All source code for a package must go in the R sub-directory.
  • library calls in a package’s source code will not be executed as the package is loaded after distribution.
  • Data can be included in a package by putting it in the data sub-directory.
  • Data must be in .rda format in order to be loaded as part of a package.
  • Data in other formats can be put in the inst/extdata directory, and will be installed when the package is installed.
  • Add comments starting with #' to an R file to document functions.
  • Use roxygen2 to extract these comments to create manual pages in the man directory.
  • Use @export directives in roxygen2 comment blocks to make functions visible outside a package.
  • Add required libraries to the Imports section of the DESCRIPTION file to indicate that your package depends on them.
  • Use package::function to access externally-defined functions inside a package.
  • Alternatively, add @import directives to roxygen2 comment blocks to make external functions available inside the package.
  • Import .data from rlang and use .data$column to refer to columns instead of using bare column names.
  • Create a file called R/package.R and document NULL to document the package as a whole.
  • Create a file called R/dataset.R and document the string dataset to document a dataset.

F.11 Web Applications with Shiny

  • Every Shiny application has a user interface, a server, and a call to shinyApp that connects them.
  • Every Shiny application must be in its own directory.
  • Images and other static assets must be in that directory’s www sub-directory.
  • The inputId and outputId attributes of UI elements are used to refer to them from the server.
  • Use input$name and output$name in the server to refer to UI elements.
  • Code placed at the top of the script outside functions is run once when the app launches.
  • Code placed inside server is run once for each user.
  • Code placed inside a handler is run once on each change.
  • A reactive variable is a function whose value changes automatically whenever anything it depends on changes.
  • Use reactive({...}) to create a reactive variable explicitly.
  • The server can change UI elements via the session variable.
  • Use uiOutput and renderUI to (re-)create UI elements as needed in order to break circular dependencies.

F.12 Reticulate

  • The reticulate library allows R programs to access data in Python programs and vice versa.
  • Use py.whatever to access a top-level Python variable from R.
  • Use r.whatever to access a top-level R definition from Python.
  • R is always indexed from 1 (even in Python) and Python is always indexed from 0 (even in R).
  • Numbers in R are floating point by default, so use a trailing ‘L’ to force a value to be an integer.
  • A Python script run from an R session believes it is the main script, i.e., __name__ is '__main__' inside the Python script.

Wickham, Hadley. 2019. Advanced R. 2nd ed. Chapman; Hall/CRC.

Wilkinson, Leland. 2005. The Grammar of Graphics. Springer.