# Key Points

## Simple Things in Python and R

• Use print(expression) to print the value of a single expression.
• Variable names may include letters, digits, ., and _, but . should be avoided, as it sometimes has special meaning.
• R’s atomic data types include logical, integer, double (also called numeric), and character.
• R stores collections in homogeneous vectors of atomic types, or in heterogeneous lists.
• ‘Scalars’ in R are actually vectors of length 1.
• Vectors and lists are created using the function c(...).
• Vector indices from 1 to length(vector) select single elements.
• Negative indices to vectors deselect elements from the result.
• The index 0 on its own selects no elements, creating a vector or list of length 0.
• The expression low:high creates the vector of integers from low to high inclusive.
• Subscripting a vector with a vector of numbers selects the elements at those locations (possibly with repeats).
• Subscripting a vector with a vector of logicals selects elements where the indexing vector is TRUE.
• Values from short vectors (such as ‘scalars’) are repeated to match the lengths of longer vectors.
• The special value NA represents missing values, and (almost all) operations involving NA produce NA.
• The special values NULL represents a nonexistent vector, which is not the same as a vector of length 0.
• Use for (loop_variable in collection){ ...body... } to create a loop.
• Use if (expression) { ...body... } else if (expression) { ...body... } else { ...body... } to create conditionals.
• Expression conditions must have length 1; use any(...) and all(...) to collapse logical vectors to single values.
• Most operators and functions in R work on corresponding elements of vectors, and should be used in preference to loops.
• Use ifelse(vector_condition, values_if_true, values_if_false) in place of conditionals inside loops.
• Use function(...arguments...) { ...body... } to create a function.
• Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
• The body of a function can be a single expression or a block in curly braces.
• The last expression evaluated in a function is returned as its result.
• Use return(expression) to return a result early from a function.
• Use sapply(vector, function) to apply a function to each value in a vector in turn, returning a vector of results.
• Use sapply(vector, function(x){ ...body... }) to perform simple operations on each element of a vector.

...visit page

## The Tidyverse

• install.packages('name') installs packages.
• library(name) (without quoting the name) loads a package.
• library(tidyverse) loads the entire collection of tidyverse libraries at once.
• read_csv(filename) reads CSV files that use the string ‘NA’ to represent missing values.
• read_csv infers each column’s data types based on the first thousand values it reads.
• A tibble is the tidyverse’s version of a data frame, which represents tabular data.
• head(tibble) and tail(tibble) inspect the first and last few rows of a tibble.
• summary(tibble) displays a summary of a tibble’s structure and values.
• tibble$column selects a column from a tibble, returning a vector as a result. • tibble['column'] selects a column from a tibble, returning a tibble as a result. • tibble[,c] selects column c from a tibble, returning a tibble as a result. • tibble[r,] selects row r from a tibble, returning a tibble as a result. • Use ranges and logical vectors as indices to select multiple rows/columns or specific rows/columns from a tibble. • tibble[[c]] selects column c from a tibble, returning a vector as a result. • min(...), mean(...), max(...), and std(...) calculates the minimum, mean, maximum, and standard deviation of data. • These aggregate functions include NAs in their calculations, and so will produce NA if the input data contains any. • Use func(data, na.rm = TRUE) to remove NAs from data before calculations are done (but make sure this is statistically justified). • filter(tibble, condition) selects rows from a tibble that pass a logical test on their values. • arrange(tibble, column) or arrange(desc(column)) arrange rows according to values in a column (the latter in descending order). • select(tibble, column, column, ...) selects columns from a tibble. • select(tibble, -column) selects out a column from a tibble. • mutate(tibble, name = expression, name = expression, ...) adds new columns to a tibble using values from existing columns. • group_by(tibble, column, column, ...) groups rows that have the same values in the specified columns. • summarize(tibble, name = expression, name = expression) aggregates tibble values (by groups if the rows have been grouped). • tibble %>% function(arguments) performs the same operation as function(tibble, arguments). • Use %>% to create pipelines in which the left side of each %>% becomes the first argument of the next stage. ...visit page ## Cleaning Up Data • Develop data-cleaning scripts one step at a time, checking intermediate results carefully. • Use read_csv to read CSV-formatted tabular data into a tibble. • Use the skip and na parameters of read_csv to skip rows and interpret certain values as NA. • Use str_replace to replace portions of strings that match patterns with new strings. • Use is.numeric to test if a value is a number and as.numeric to convert it to a number. • Use map to apply a function to every element of a vector in turn. • Use map_dfc and map_dfr to map functions across the columns and rows of a tibble. • Pre-allocate storage in a list for each result from a loop and fill it in rather than repeatedly extending the list. ...visit page ## Projects • An R package can contain code, data, and documentation. • R code is distributed as compiled bytecode in packages, not as source. • R packages are almost always distributed through CRAN, the Comprehensive R Archive Network. • Most of a project’s metadata goes in a file called DESCRIPTION. • Metadata related to imports and exports goes in a file called NAMESPACE. • Add patterns to a file called .Rbuildignore to ignore files or directories when building a project. • All source code for a package must go in the R sub-directory. • library calls in a package’s source code will not be executed as the package is loaded after distribution. • Data can be included in a package by putting it in the data sub-directory. • Data must be in .rda format in order to be loaded as part of a package. • Data in other formats can be put in the inst/extdata directory, and will be installed when the package is installed. • Add comments starting with #' to an R file to document functions. • Use roxygen2 to extract these comments to create manual pages in the man directory. • Use @export directives in roxygen2 comment blocks to make functions visible outside a package. • Add required libraries to the Imports section of the DESCRIPTION file to indicate that your package depends on them. • Use package::function to access externally-defined functions inside a package. • Alternatively, add @import directives to roxygen2 comment blocks to make external functions available inside the package. • Import .data from rlang and use .data$column to refer to columns instead of using bare column names.
• Create a file called R/package.R and document NULL to document the package as a whole.
• Create a file called R/dataset.R and document the string 'dataset' to document a dataset.
• Unit tests for an R package should be written with the testthat package, and should go in the tests/testthat directory.
• Test should go in files called test_group.R and should be called test_something.
• Use test_dir to run tests from a particular that match a pattern.
• Write tests for data transformation steps as well as library functions.

...visit page

## Intellectual Debt

• Don’t use setwd.
• R defers evaluation of a function’s arguments until they are actually used.
• Lazy evaluation is what allows R programs to use column names as if they were variables.
• The formula operator ~ delays evaluation of its operand or operands.
• ~ was created to allow users to pass formulas into functions, but is used more generally to delay evaluation.
• Some tidyverse functions define . to be the whole data, .x and .y to be the first and second arguments, and ..N to be the N’th argument.
• These convenience parameters are primarily used when the data being passed to a pipelined function needs to go somewhere other than in the first parameter’s slot.
• ‘Copy-on-modify’ means that data is aliased until something attempts to modify it, at which point it duplicated, so that data always appears to be unchanged.
• R uses three built-in conditions called message, warning, and error to signal problems of increasing severity.
• Use the functions message, warning, and stop to signal conditions of these kinds.
• Use the functions try and tryCatch to handle errors.

...visit page

## Object-Oriented Programming

• S3 is the most commonly used object-oriented programming system in R.
• Every object can store metadata about itself in attributes, which are set and queried with attr.
• The dim attribute stores the dimensions of a matrix (which is physically stored as a vector).
• The class attribute of an object defines its class or classes (it may have several character entries).
• When F(X, ...) is called, and X has class C, R looks for a function called F.C (the . is just a naming convention).
• If an object has multiple classes in its class attribute, R looks for a corresponding method for each in turn.
• Every user defined class C should have functions new_C (to create it), validate_C (to validate its integrity), and C (to create and validate).

...visit page