Table of Contents  

☞   Proposed exercises

Installation: R and Rstudio

R is a free software language and environment for statistical analyses.

☞ Install R from http://cran.r-project.org and choose one of these options:

  • Download R for Linux
  • Download R for (Mac) OS X
  • Download R for Windows

When the installation is finished, you should see an “R” icon on your desktop.
Clicking on this icon opens the standard R interface. However, we recommend the use of the RStudio interface:

☞ Install Rstudio from http://www.rstudio.org/

The usual Rstudio screen has four windows.

  1. Console (bottom left)
  2. Editor or script window (top left)
  3. Workspace and history (top right)
  4. Files, plots, packages and help (bottom right)

The first time that you use Rstudio only windows 1, 3 and 4 are shown. To open
window 2 you should open or create a new script file: File -> New -> R Script

You can type commands in the console window after the prompt (>) and press ENTER to execute the command line. The output will appear immediately below
the command line in the command window.

However, it is usually more convenient to use R scripts and type commands in the editor window
(See below section “Using
scripts”)

☞ Read Section 2.3 “Rstudio layout” from document “A (very) short introduction to R” by P. Torfs and C. Brauer

Using scripts

An R script is a text file where you type the commands that you want to execute. The R script is where you keep a record of your work. Using R scripts is very convenient because all R commands used in a session are saved in the script file and can be executed again in a future session. Script files have extension .R, for instance, “myscript.R”.

To create a new R script go to

File -> New -> R Script

To open an existing R script go to

File -> Open -> R Script->select your script

If you want to run a line from the script window, you place the cursor in any place of the line and click Run or press CTRL+ENTER if you are using Windows/Linux or Command+Enter if you are using MAC.

You can execute part of a script (or the whole script) by selecting the corresponding lines and pressing Run or
CTRL+ENTER or Command+Enter.

☞ Read Section 5 “Scripts” from document “A (very) short introduction to R”
by P. Torfs and C. Brauer

Simple computations with R

R can be used as a calculator. You can just type your equation and execute the
command:

Example: Addition of two values

3 + 6
## [1] 9
# The output is always preceded by a number between brackets: [1]

Calculating the perimeter of the circumference with radius 3

2 * 3.1416 * 3
## [1] 18.8496

Variables

You can assign a number to a name with the backward arrow <-:

x <- 3

Now “x” is called a variable and it appears in the workspace window, which means
that R stores the value of “x” in its memory and it can be used later.

In general, by using the backward arrow <- , you can assign a value to an object.

If you type the name of a variable, the current value of the variable will be printed

x
## [1] 3

There are variables that are already defined in R, like variable “pi”

pi
## [1] 3.141593

Calculating the perimeter of the circumference with radius 3

2 * pi * x
## [1] 18.84956

Changing the value of radius and reusing the code

x <- 5
2 * pi * x
## [1] 31.41593

Remarks

  1. R is case sensitive
A <- 33
a <- 44
A
## [1] 33
a
## [1] 44
  1. The tag # indicates a comment

R functions

An R function is used by typing its name followed by its arguments (also called parameters) between parentheses.

Example:
seq( ) is a function for generating a sequence of numbers. Its arguments are arg1=from, which specifies the first number of the sequence, arg2= to, last number of the sequence, and arg3=by, the increment of the sequence.

seq(10,80, 2) # generates a sequence from 10 to 80 with increment 2
##  [1] 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54
## [24] 56 58 60 62 64 66 68 70 72 74 76 78 80
The number between brackets at the beginning of each line of the output indicates the position of the first element of each row: 10 is the first element of the output and 56 is the 24th element of the output

Help and documentation

help( ) or ? : provides information about a function or an object

Example:

help(mean)
?mean

help.search( ): provides information about a topic

Example:

help.search("logistic regression")

List and remove objects

ls() Lists objects in current environment

rm(x) Removes object x in current environment

rm(list = ls()) Removes all objects from current environment

Working directory

It is important to specify your working directory, where your files and scripts are
stored, using the function setwd( ).

Example:

setwd("c:/mydir") # note: use / instead of \

You can check which is your current working directory:

getwd()

Data structures

Vector

A vector is a sequence of data elements of the same type, numerical, character strings or logical. You can create a vector with the function c( )

c( ): R function for creating a vector

Example:

x<-c(1, 3, -1, 2)
y<-c("a", "b", "c")
x
## [1]  1  3 -1  2
y
## [1] "a" "b" "c"

length( ): provides the number of elements of a vector

seq(from=, to=, by=): creates a vector containing a sequence of numbers

rep(x, times=): creates a vector by repeating a sequence a number of times

Examples

seq(10,50,2)
##  [1] 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
rep(c(1,0,3),4)
##  [1] 1 0 3 1 0 3 1 0 3 1 0 3

☞ Perform Exercices 1 to 4, pg 4 document “R: A self-learn tutorial”

Accessing the elements of a vector

Elements in a vector are addressed by bracket indexing [ ]:

Examples:
Consider a vector x with 10 components

x<-c(-1, 2, 7, 3, -4, 5, 1, 0, -2, 1)
x[3] #returns the third element of vector x
## [1] 7

You can get consecutive elements of a vector by using :

x[2:5] #returns the elements in positions 2, 3, 4 and 5 of vector x
## [1]  2  7  3 -4

You can get non consecutive elements of a vector by using function c():

x[c(1,3,7)] #returns the elements in positions 1, 3 and 7 of vector x
## [1] -1  7  1

You can exclude elements of a vector with the sign - before the indices you want to exclude

x[-c(2,5,8)] #returns all the elements in x except those in positions 2, 5 and 8
## [1] -1  7  3  5  1 -2  1

☞ Read section 3 from document “R: A self-learn tutorial” and Perform
Exercices 1 and 2, pg 5

Assignment of new values

You can modify the value of an element in a vector with <-:

Example:

x[3]<-5  # changes the third element in x by a 5
x  # the third element in x was a 7 and now is a 5
##  [1] -1  2  5  3 -4  5  1  0 -2  1

Example:
In this example we want to change the negative elements in x by a 99. For this, we specify the condition x<0 between brackets

x[x<0]<-99  # assigns a 99 to the elements in x that are <0
x  # the initial negative elements in x have been substituted by a 99
##  [1] 99  2  5  3 99  5  1  0 99  1

When programming new functions you may need to create an empty vector.

x<-numeric() # This creates a numeric vector
x<-NULL # This creates a vector without specifying the data type

You can create logical vectors

Example:

x<-c(1, 3, -1, 2)
x>0   # answers TRUE or FALSE to the condition x>0 for each element in x
## [1]  TRUE  TRUE FALSE  TRUE

☞ Read http://www.r-tutor.com/r-introduction/vector

Matrix

A matrix is a collection of data elements of the same type arranged in a two-dimensional rectangular layout.

matrix( ): R function for creating a matrix

Example:
By default, matrices are filled by columns

x<-matrix(c(1,2,3,4,5,6,7,8), ncol=4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

Example:
Matrices can also be filled by rows by specifying byrow = TRUE:

x<-matrix(c(1,2,3,4,5,6,7,8), nrow=2, byrow = TRUE)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8

Dimension of a matrix

dim( ): provides the number of rows and number of columns of a matrix

dim(x)
## [1] 2 4

Accessing elements of a matrix

Elements of a matrix are addressed with brackets by specifying the row and the column of the element: [row,column]

Examples:

x[2,4] # the element in the second row and the forth column of matrix x
## [1] 8
x[2, ] # vector containing all the elements of the second row of matrix x
## [1] 5 6 7 8
x[ ,4] # vector containing all the elements of the forth column of matrix x
## [1] 4 8
x[2 ,1:3] # vector containing the elements in columns 1 to 3 of the second row of matrix x
## [1] 5 6 7
x[,c(2,4)] # matrix containing the second and forth columns of matrix x
##      [,1] [,2]
## [1,]    2    4
## [2,]    6    8
x[,-2] # matrix containing all columns in x except the second one
##      [,1] [,2] [,3]
## [1,]    1    3    4
## [2,]    5    7    8

☞ Read section 4 from document“R: A self-learn tutorial”“ and Perform
Exercices 1 and 2, pg 8

Adding new columns or rows to a matrix

Using cbind( ) or rbind( ) you can also create a matrix by putting together
several vectors in columns or in rows, respectively

Example: Let’s x, y, and z denote three vectors of length 5:

x<-c(1, 2, 3, 4, 5)
y<-c(0, 1, -1, 2, 1)
z<-c(-2, -3, 0, 1, 2)
cbind(x,y,z) # binds the 3 vectors as columns, the result is a matrix with 5 rows and 3 columns
##      x  y  z
## [1,] 1  0 -2
## [2,] 2  1 -3
## [3,] 3 -1  0
## [4,] 4  2  1
## [5,] 5  1  2
rbind(x,y,z) # binds the 3 vectors as rows, the result is a matrix with 3 rows and 5 columns
##   [,1] [,2] [,3] [,4] [,5]
## x    1    2    3    4    5
## y    0    1   -1    2    1
## z   -2   -3    0    1    2

Operations with matrices

A<-matrix(1:12, ncol=4) # By default, the elements of the matrix are filled by columns
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
B<-matrix(rep(1:3,4),ncol=4)
B
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
C<-matrix(1:16, ncol=4, byrow=T)  # When byrow=T, the elements of the matrix are filled by rows
C
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
A+B # sum element by element
##      [,1] [,2] [,3] [,4]
## [1,]    2    5    8   11
## [2,]    4    7   10   13
## [3,]    6    9   12   15
A+C # error: the sum requires that dim(A)=dim(C)
A*B # product element by element
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    4   10   16   22
## [3,]    9   18   27   36
A*C # error: the product element by element requires that dim(A)=dim(C)
A%*%C # matrix product: each element is the scalar product of a row of the first matrix and a column of the second matrix
##      [,1] [,2] [,3] [,4]
## [1,]  214  236  258  280
## [2,]  242  268  294  320
## [3,]  270  300  330  360
A%*%B # error: matrix product requires that ncol(A)=nrow(B)
t(A) # transposed matrix of A
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12
A^2 # each element of A is squared
##      [,1] [,2] [,3] [,4]
## [1,]    1   16   49  100
## [2,]    4   25   64  121
## [3,]    9   36   81  144
C%*%C # matrix product of C by C
##      [,1] [,2] [,3] [,4]
## [1,]   90  100  110  120
## [2,]  202  228  254  280
## [3,]  314  356  398  440
## [4,]  426  484  542  600

☞ Read http://www.r-tutor.com/r-introduction/matrix

Data frame

A data frame is a very important data type in R since data files for statistical analysis are read and stored in data frames (see below Reading and writing data files).

A data frame is a table where the top line of the table, called the header, contains
the column names. Each column of a data frame is a vector, thus, the elements of
each column should be of the same type but the elements of different columns can
be of different type.

data.frame( ): R function for creating a data frame

When a data frame is printed, the name of the row (or the number of the row) is
shown at the beginning of the row.

Example: The following script creates a dataframe called “students” with 2
columns: name and age:

name<-c("John","Paul","Mary")
age<-c(21,23,32)
students<-data.frame(name, age)
students
##   name age
## 1 John  21
## 2 Paul  23
## 3 Mary  32

The same dataframe could also be created with the following code:

students<-data.frame(name=c("John","Paul","Mary"),age=c(21,23,32))

Functions for data frames

nrow( ): number of rows of a data frame

ncol( ): number of columns of a data frame

rownames( ): vector containing the row names of a data frame

colnames( ): vector containing the column names of a data frame

head( ): shows the first rows of a data frame

tail( ): shows the last rows of a data frame

Accessing the elements of a data frame

Elements of a data frame are addressed with [row,column]

Examples:

students[1,2]   # element in row 1 and column 2 in data frame "student"
## [1] 21
students[1,"age"]   # element in row 1 and column "age" in data frame "student"
## [1] 21

Retrieving the columns of a data frame

The columns (variables) of a data frame can be referred in different ways:

1) name1$name2 where name1 is the name of the data frame and name2 is the name of the column

Example:

students$age # we should specify both, the name of the data frame, "students" and the name of the variable, "age"
## [1] 21 23 32

2) name1[,“name2”] where name1 is the name of the data frame and name2 is the name of the column

Example:

students[,"age"]
## [1] 21 23 32

3) name1[,num] where name1 is the name of the data frame and num is the column number

Example:

students[,2]
## [1] 21 23 32

4) name1[[num]] where name1 is the name of the data frame and num is the column number

Example:

students[[2]]
## [1] 21 23 32

The columns of a data frame can also be referred directly with its name without specifying the data frame name provided that, previously, the data frame has been attached

☞ Read http://www.r-tutor.com/r-introduction/data-frame

☞ More info on R data frames:  https://www.gastonsanchez.com/intro2cwd/dataframes.html

Subsetting

You can create a new data frame consisting of a subset of the data in a different
data frame by indexing the required restrictions.

Example: Creating a new data frame named “students2” which is a subset of
“students” but only includes those students younger than 30 years old.

students2<-students[students$age<30, ] # the subset is created by specifying the restriction: only those rows (individuals) with age<30 are selected
students2
##   name age
## 1 John  21
## 2 Paul  23

List

A list is a vector containing objects of different type.

list( ): R function for creating a list

Example

Lets create a list containing three objects, a character vector x, a matrix M and a numerical vector y.

x<-c("a", "b", "c")
M<-matrix(1:4, 2)
y<-c(1.3, 2.5, 5.7, 1.8, 3.1)
mylist<-list(letters=x, mymatrix=M, numbers=y)  # we assign new names to x, M and y

The objects in a list are called attributes. Function attributes() provides the names of the elements contained in a list.

Example

“mylist” contains three objects or attributes:

attributes(mylist)
## $names
## [1] "letters"  "mymatrix" "numbers"

Accessing the elements in a list with simple brackets

If you use simple brackes [] you obtain a list slice

Example

mylist[2]   # This provides a new list containing one object, matrix M, the second attribute of "mylist"
## $mymatrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
mylist[3]   # This provides a new list containing vector "numbers", the third attribute of "mylist"
## $numbers
## [1] 1.3 2.5 5.7 1.8 3.1
mylist[2:3]   # This provides a new list containing two objects, the second and third attributes of "mylist"
## $mymatrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
##
## $numbers
## [1] 1.3 2.5 5.7 1.8 3.1

Accessing the elements in a list with double brackets

If you use double brackes [[]] you obtain the object.

Example

mylist[[2]] # This provides matrix M
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Compare objects mylist[2] and mylist[[2]] using class( ), an R function that provides the type of an object

class(mylist[2]) # mylist[2] is a list containing a matrix
## [1] "list"
class(mylist[[2]]) # mylist[[2]] is a matrix
## [1] "matrix"

If we want to retrieve the vector “numbers” we should use double brackets

mylist[[3]]  # This provides vector "numbers"
## [1] 1.3 2.5 5.7 1.8 3.1

Compare the use of simple and double brackets to obtain the mean of vector “numbers”

mean(mylist[3])  # This provides an warning and a NA (not available) result because function mean() expects a vector, not a list, for its argument
## Warning in mean.default(mylist[3]): argument is not numeric or logical:
## returning NA
## [1] NA
mean(mylist[[3]])  # This is correct because the argument mylist[[3]] of mean() is a vector
## [1] 2.88

Accessing the elements in a list with the dollar symbol

An alternative to the double brackes for accessing an object in a list or data frame is through the syntax name1$name2 where name1 is the name of the list and name2 is the name of the object or attribute.

Example

mylist$mymatrix # This provides matrix M
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Retrieving objects from the output of a statistical procedure

The output of most statistical procedures is stored as a list. The objects in the output can be retrieved using double brackets [[]]

Example

We want to retrieve the residuals of a linear regression model:

x<-c(1, 2, 3, 4, 5)
y<-c(1.3, 2.4, 1.6, 2.3, 2.1)
m1<-lm(y~x)
attributes(m1)  # this provides the names of the attributes of model m1
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"
##  [5] "fitted.values" "assign"        "qr"            "df.residual"
##  [9] "xlevels"       "call"          "terms"         "model"
##
## $class
## [1] "lm"

The residuals of the regression model are the second attribute of m1 and can be retrieved as follows

m1[[2]]
##     1     2     3     4     5
## -0.34  0.61 -0.34  0.21 -0.14

An alternative syntax for obtaining the residuals is

m1$residuals
##     1     2     3     4     5
## -0.34  0.61 -0.34  0.21 -0.14

☞ Read http://www.r-tutor.com/r-introduction/list

The output of some statistical procedures are stored as S4 objects. To access slots of an S4 object you should use @ instead of $.

Data types and factor variables

Data types

There are several R data types, the most important are numeric, integer, logical and character.

☞ Read http://www.r-tutor.com/r-introduction/basic-data-types

By default, any number is considered a numeric data even when the number only
indicates a category of a categorical variable.

Factor variables factor( )

It is recommended to convert any categorical variable into a factor
using function factor( ). This function assigns an integer 1, 2, …, k (called levels)
to each category of the original variable where k is the number of different
categories.

*Example: *

The following script creates a dataframe called “example” with 3
variables: Y (diseased indicator), treatment (0=placebo, 1=drug) and bmi
(body mass index)

Y<-c("diseased", "not-diseased", "not-diseased", "diseased","not-diseased")
treatment<-c(0, 0, 0, 1, 1)
bmi<-c(21.52, 22.73, 21.89, 20.17, 24.13)
example <-data.frame(Y, treatment, bmi)

We can check the data types of the three variables with function class( ):

class(Y)
## [1] "character"
class(treatment)
## [1] "numeric"
class(bmi)
## [1] "numeric"

Variable treatment is numeric and would be analyzed as a continuous variable. We should transform this variable to a factor by specifying their levels (categories) and labels (explanation of each category):

example$treatment<-factor(example$treatment, levels=c(0,1), labels=c("placebo", "drug"))
example$treatment
## [1] placebo placebo placebo drug    drug
## Levels: placebo drug

Character data, such as Y, can also be transformed to a factor variable. If levels are not specified, by default, the levels of the categories are assigned according to the alphabetic or numeric order. For variable Y, level 1 will be “diseased” and level 2 will be “not-diseased”:

example$Y<-factor(Y)
example$Y
## [1] diseased     not-diseased not-diseased diseased     not-diseased
## Levels: diseased not-diseased

In most statistical procedures, the category with level=1 is taken as the reference
category. If we want to change the reference category we will specify the levels
of the variables with the first level corresponding to the reference category.

Example:

example$Y<-factor(Y, levels=c("not-diseased", "diseased"))
example$Y
## [1] diseased     not-diseased not-diseased diseased     not-diseased
## Levels: not-diseased diseased

Now, the reference category for Y is “not-diseased”

http://www.statmethods.net/input/valuelabels.html

Reading and writing data files

You can import your data from a file to R without coding, using RStudio: File -> Import Dataset -> From … (text, Excel, SPSS, SAS or stata).

Alternatively, you can use R coding:

 

txt files

read.table( ): this function reads a text data file and stores the values into a data frame

header=T : first row of the data file contains the names of the columns or variables
sep="" : the values of each row are separated by a space
sep="\t" : the values of each row are separated by a tabulation
sep="," : the values of each row are separated by a comma
dec=".": the decimal symbol is a point

Example

example <- read.table("treatment.txt", header=F, sep="")
# This instruction reads file "treatment.txt" and creates the dataframe "example"

write.table( ): this function writes a data frame into a text file

Example

write.table(treatment, file="treatment.txt", row.names=FALSE)
# This instruction writes the dataframe "treatment" in file "treatment.txt"
# row.names=FALSE prevents R for printint the names of the rows (or just the row numbers) in the output file

☞ Read Section 8 “Reading and writing data files” from document “A (very) short introduction to R”
by P. Torfs and C. Brauer and do the proposed
exercise at the end of the section.

csv files

One convenient format for storing data files is csv, “comma separated values”, since they can be open easily in Excel

read.csv( )

Example:

Example <- read.csv("example.csv", header=T, sep=",", dec=".");

write.csv( ): this function writes a data frame into a csv file

Example

write.csv(treatment, file="treatment.csv", row.names=FALSE)
# This instruction writes the dataframe "treatment" in file "treatment.csv"
# row.names=FALSE prevents R from printing the names of the rows (or just the row numbers) in the output file

Excel files

You can read Excel files with function read.xlsx( ) from the package openxlsx. In the next section you will learn how to install this package.

Example:

data <- read.xlsx(xlsxFile = "my_data.xlsx", sheet = "sheet2", skipEmptyRows = T)

 

Libraries and packages

All R functions and datasets are stored in packages.

Libraries/Packages

R comes with a standard set of packages. Others are available for download and
installation.

Standard packages: The standard (or base) packages are considered part of the R
source code. They contain the basic functions that allow R to work, and the
datasets and standard statistical and graphical functions. They should be
automatically available in any R installation.

Contributed packages: There are thousands of contributed packages for R,
written by many different authors. Some of these packages implement specialized
statistical methods. Some (the recommended packages) are distributed with every
binary distribution of R.

Most packages are available for download from CRAN and other repositories such as
Bioconductor, a large repository of tools for the analysis and comprehension of
high-throughput genomic data.

Once installed, a package has to be loaded into the session to be used.

Install packages from CRAN

Using an R package or library for the first time requires two steps: installing the library and loading the library with the following functions:

install.packages(): Install the package

library(): load the package

Example: How to install and load the library “survival”?

install.packages("survival") # you only need to do this once
library(survival) # load library

Install packages from Bioconductor

Bioconductor http://www.bioconductor.org/ provides tools for the analysis and
comprehension of high-throughput genomic data.

New Bioconductor (R version 3.6.0 or newer). Install a Bioconductor package:

The new Bioconductor uses BiocManager to install packages.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install()   install bioconductor packages

library(): load library

Old Bioconductor (R version older than 3.6.0). Install a Bioconductor package:

Old Bioconductor uses biocLite:

source("http://bioconductor.org/biocLite.R"): executes biocLite.R.

biocLite(): install bioconductor packages

library(): load library

Example: Installing “Biostrings” from new Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings") # install Biostrings
library(Biostrings) # load library Biostrings

Example: Installing “Biostrings” from old Bioconductor:

source("http://bioconductor.org/biocLite.R") # executes biocLite.R.
biocLite("Biostrings") # install Biostrings
library(Biostrings) # load library Biostrings

Basic statistical functions

We will describe the principal statistical functions using the following example:

Y<-c("diseased", "not-diseased", "not-diseased", "diseased","not-diseased")
treatment<-c(0, 0, 0, 1, 1)
bmi<-c(21.52, 22.73, 21.89, 20.17, 24.13)
treatment<-factor(treatment, levels=c(0,1), labels=c("placebo", "drug"))

Summary statistics

Summary statistics of a numerical variable

min(bmi) # minimum value of bmi
max(bmi) # maximum value of bmi
mean(bmi) # mean or average bmi
median(bmi) # median bmi
sd(bmi) # standard deviation of bmi
quantile(bmi, 0.25) # first quartile for bmi
quantile(bmi, 0.75) # third quartile for bmi
quantile(bmi, 0.95) # 95% quantile for bmi
IQR(bmi) # interquartile rang of bmi
summary(bmi)  # provides the most important summary statistics for bmi

Summary statistics of a character or factor variable

table(treatment)  # frequency table
prop.table(table(treatment))  # Table of relative frequencies
100*prop.table(table(treatment))   # Table of percentages

Correlation coefficient between two numerical variables

Example

x<-c(2, 4, 1, 3, 6, 5)
y<-c(3, 5, 2, 2, 6, 3)
cor(x,y)  # Pearson correlation coefficient
cor(x,y, method="spearman")  # Spearman correlation coefficient

Test statistics

t-test for the equality of two means from independent samples

The t-test is appropriate under the assumption of normal distributed data

t.test(bmi~treatment)   # by default variances are assumed different
t.test(bmi~treatment, var.equal=T)   # variances are assumed equal

If we only want to retrieve the p-value of the test

t.test(bmi~treatment, var.equal=T)$p.value

F test for the equality of two variances

var.test(bmi~treatment)

Wilcoxon test for the equality of two means from independent samples

The Wilcoxon test is the alternative to the t-test when the data is not normally distributed

wilcox.test(bmi~treatment)

t.test for paired samples

x1<-c(2.7, 3.4, 1.8, 3.2, 6.4, 1.5)
x2<-c(3.1, 5.3, 2.1, 2.7, 6.8, 3.1)
t.test(x1-x2, mu=0)

Wilcoxon test for paired samples

wilcox.test(x1,x2, paired=T)

One-factor ANOVA for testing the equality of more than two means

The ANOVA is appropriate under the assumption of normal distributed data and homoscedasticity (equal variances)

y<-c(2.7, 3.4, 1.8, 3.2, 6.4, 1.5, 3.1, 5.3, 2.1, 2.7, 6.8, 3.1)
x<-c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
summary(aov(y~x))

Kruskal-Wallis test for testing the equality of more than two means

The Kruskal-Wallis test is the alternative to the ANOVA when the data is not normally distributed

kruskal.test(y~x)

Regression models

linear regression

The outcome variable is continuous

y<-c(2.7, 3.4, 1.8, 3.2, 6.4, 1.5, 3.1, 5.3, 2.1, 2.7, 6.8, 3.1)
x1<-c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
x2<-c(2, 3, 8, 2, 6, 1, 1, 3, 2, 2, 8, 1)
summary(lm(y~x1+x2))

logistic regression

The outcome variable is dychotomous

y<-c(rep(1, 6), rep(0,6))
x1<-c(5.3, 2.1, 5.6, 2.1, 3.2, 1.2, 1.2, 3.1, 1.3, 2.4, 1.5, 2.6)
x2<-c(2, 3, 8, 2, 6, 1, 1, 3, 2, 2, 8, 1)
summary(glm(y~x1+x2, family=binomial))

Graphs with R

Principal graphical representations

The general function for creating graphs in R is plot( )

Useful links with examples on creating graphs from Quick-R website http://www.statmethods.net:

Histograms hist()

Density plots density()

Dot plots dotchart()

Bar plots barplot()

Line charts lines()

Pie charts pie()

Box plots boxplot()

Scatter plots scatterplot()

Graphical parameters par()

R Graphical Parameters Cheat Sheet from Gaston Sanchez blog

Lattice graphs

ggplot2 graphs

Saving plots to a file

pdf() redirects a plot to a pdf file and dev.off() stops R from redirecting new plots to files

Example

pdf("normal_distribution.pdf")
hist(rnorm(100, mean=10, sd=3))
dev.off()

Programming with R

Functions with R

You can program your own functions by specifying the function name, the
arguments or parameters between brackets and the body of your function between curly brackets.

The body of a function may contain several lines. return() is used to specify the output of the function. If return() is not specified, R returns the last executed line of the function.

Basic function structure:

function.name <- function(arguments) {
  purpose of function i.e. computations involving the arguments
}

The function is executed by function.name(arguments)

Example

The following function computes the perimeter of a circumference. The function’s
name is “Perimeter”“, the argument is the radius of the circumference

Perimeter <- function(x) {
  p <- 2 * pi * x
  return(p)
}

You can use the perimeter functon to calculate the perimeter of a circumference of
any ratio x

Perimeter(x = 3)
## [1] 18.84956
Perimeter(x = 5)
## [1] 31.41593

Example

Function to compute the area of a rectangle

Area <- function(x,y) {
  x*y
}

Area(10,7)
## [1] 70

Example

Function to compute the area and perimeter of a rectangle

Area.perimeter <- function(x,y) {
  area<-x*y
  perimeter<-2*x+2*y
  return(c(area,perimeter))
}

Area.perimeter(10,7)
## [1] 70 34

We can specify a default value for a function argument

Example

Function to compute the power of a number. The default value is 2 if no value for the power is specified

power <- function(x,n=2) {
  x^n
}

power(10)
## [1] 100
power(10,3)
## [1] 1000

☞ Read Section 11 “Programming tools” from document “A (very) short introduction to R”
by P. Torfs and C. Brauer and perform the proposed
exercises in this section

Programming tools: if-else statement

Example

The following function contains an if-else statement and provides the absolute value of a number x

absolute.value <- function(x){
    if(x >= 0){
       x
    }
    else{
       -x
    }
}

absolute.value(-3.47)
## [1] 3.47

The return() statement causes a function to stop executing and returns a value to its caller immediately

Example

The following function computes the square root of a number x. It returns a message if the number is negative

 sq.root<- function(x){
    if(x < 0){
       return("The number is negative, the square root does not exist. Provide a positive number")
    }
    else{
       sqrt(x)
    }
}

sq.root(-3.47)
## [1] "The number is negative, the square root does not exist. Provide a positive number"

There is also the function ifelse()

Example

Given a variable x, create a categorical variable with values “small” if x is smaller than the mean and “large” otherwise:

 catx<- ifelse(x<=mean(x), "small", "large")

Programming tools: for-loop

Example: The following for-loop prints the first 4 digits

for (i in 1:4){   # Variable i takes values from 1 to 4 consecutively
  print(i)   # at each step of the loop the value of i is printed
}   # end of the for-loop
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Example: The following for-loop prints the names in a list

list.names<-c("Alex", "John", "Mike")
for (name in list.names){   # Variable name takes values in list.names
  print(name)   # at each step of the loop the value of name is printed
}   # end of the for-loop
## [1] "Alex"
## [1] "John"
## [1] "Mike"

Example: The following for-loop provides the sum of the first 10 digits

x<-0  # We inicialize at 0 the variable x that will contain the sum
for (i in 1:10){   # Variable i takes values from 1 to 10 consecutively
  x<-x+i   # at each step of the loop the variable x is incremented by the value of i
}   # end of the for-loop
x   # result: x=sum of the first 10 digits
## [1] 55

Example: The following for-loop performs a t-test on different random generated vectors and returns the vector of p-values

set.seed(123)
pvalue.vector<-NULL # We inicialize the vector that will keep the p-values. Using NULL is convenient because we don't need to specify the length nor the data type of the vector
for (i in 1:5){
  x<-rnorm(n=100, mean=0, sd=1)
  p<-t.test(x, mu=0)$p.value
  pvalue.vector<-c(pvalue.vector,p) # at each step of the loop we add (concatenate with function c()) the new p-value stored in variable p to the output vector pvalue.vector
}   # end of the for-loop

pvalue.vector
## [1] 0.3243898 0.2687521 0.2076953 0.7280507 0.2872622

Example: Generate 100 values from a standard normal distribution and sum up
those values that are larger than 1.3

s<-0   # We inicialized at 0 the variable s that will contain the sum
for (i in 1:100){  # Variable i takes values from 1 to 10 consecutively
  x<-rnorm(1,0,1)   # at each step of the loop a value is randomly generated from a normal   distribution with mean=0 and sd=1. The random value is stored in variable x
  if (x>1.3){
    s<-s+x    # at each step of the loop, if x>1.3, the value of s is incremented by x
  }
}   # end of the for-loop
s   # result: s= sum of those random values larger than 1.3
## [1] 10.74194

For and if loops are useful but are computationally not very efficient. You should always think
whether there is an alternative R code that does not require loops.

The above two examples could be obtained with the following code

Example: Sum of the first 10 digits

i<-seq(1,10,1) # generates the vector i=(1,2,3,...,10)
x<-sum(i)

Example: Generate 100 values from a standard normal distribution and sum up
those values that are larger than 1.3

x<-rnorm(100,0,1) # x is a vector of length 100 containing random values from a N(0,1) distribution
s<-sum(x[x>1.3])  # sum the values of x that are larger than 1.3

☞ More info about functions in R: https://www.gastonsanchez.com/intro2cwd/functions1.html

Avoid loops whenever possible

R is a vector-oriented language, which often allows us to avoid writing explicit
loops.

Instead of summing two vectors with the following code:

for (i in 1:length(a)) {
  c[i] = a[i] + b[i]
}

R is designed to do all this in a single vectorized operation:

c = a + b

Compare the execution time of these two processes:

x <- runif(1000000)
y <- runif(1000000)
system.time(z <- x + y)
system.time(for (i in 1:length(x)) z[i] <- x[i] + y[i])

Useful functions to avoid loops

Function apply()

apply( ): This is a very useful function that avoids using loops in R whenever possible. apply( ) is used when we want to apply the same function to every row or
column of a matrix.

This function has 3 arguments: arg1=name of the matrix, arg2=1 if the function is
to be applied to the rows or arg2=2 if the function is to be applied to the columns, arg3=function

Example: Assume that x is a matrix of numbers with 4 rows and 3 columns as defined below

x<-matrix(seq(1:12), nrow=4)
x
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
apply(x,1,mean) # computes the mean of the rows. The result is a vector of length 10
## [1] 5 6 7 8
apply(x,2,mean) # computes the mean of the columns. The result is a vector of length 8
## [1]  2.5  6.5 10.5

Function tapply()

tapply(arg1, arg2, fun) applies function “fun” to the object “arg1” separately for the different groups of indices defined by “arg2”

Example

y<-c(1, 2, 3, 4, 5, 6, 7, 8)
x<-c(1, 1, 1, 2, 2, 2, 2, 2)
tapply(y, x, sum)  # provides the sum of the elements of y for the two groups defined by x
##  1  2
##  6 30

☞ Follow: Tutorial on the Apply family of functions

Data management with R

Modifying the elements of a numeric or character variable

With <- we can modify the elements of a vector or a matrix

Example

A<-c(5, 4, 3, 2, 3)
A[A==3]<-0   # assigns a 0 to the elements of A that are equal to 3
A
## [1] 5 4 0 2 0
B<-c("a", "a", "b", "a", "b")
B[B=="a"]<-"c"    # assigns a "c" to the elements of B that are equal to "a"
B
## [1] "c" "c" "b" "c" "b"

Example

A<-c(5, 4, 3, 2, 3)
A[A<=3]<-0   # assigns a 0 to the elements of A that are smaller than or equal to 3
A
## [1] 5 4 0 0 0

Example ifelse()

A<-c(5, 4, 3, 2, 3)
A<-ifelse(A<=3,0,1)   # assigns a 0 to the elements of A that are smaller than or equal to 3 and a 1 to all the other elements
A
## [1] 1 1 0 0 0

Example

A<-c(5, 4, 3, 2, 3)
A[(A<3)|(A>4)]<- 0 #assigns a 0 to the elements of A smaller than 3 or larger than 4
A
## [1] 0 4 3 0 3

Modifying elements of a factor variable

If you try to do a similar modification on the values of a factor variable you will obtain a warning.

Example

data<-data.frame(
  x=c(25, 23, 31, 42, 36, 24),
  y=c("male", "female", "male", "male", "female", "female")
  )

Example
We want to change the values in y, “female” by “F” and “male” by “M”

data$y[data$y=="female"]<-"F"
## Warning in `[<-.factor`(`*tmp*`, data$y == "female", value =
## structure(c(2L, : invalid factor level, NA generated
data$y   #  The substitution has not been performed appropriately
## [1] male <NA> male male <NA> <NA>
## Levels: female male

Let’s read again the data frame data

data<-data.frame(
  x=c(25, 23, 31, 42, 36, 24),
  y=c("male", "female", "male", "male", "female", "female")
  )
data
##    x      y
## 1 25   male
## 2 23 female
## 3 31   male
## 4 42   male
## 5 36 female
## 6 24 female

For factor variables we cannot change the values directly, we have to change the levels of the variable:

levels(data$y)
## [1] "female" "male"
levels(data$y)<-c("F", "M")  # We assign an "F" to the first level of z and an "M" to the second level
levels(data$y)
## [1] "F" "M"

Modifying the column names

Example
Modify the names of the columns of data: “x”by “age” and “y” by “gender”

colnames(data)
## [1] "x" "y"
colnames(data)<-c("age","gender")
colnames(data)
## [1] "age"    "gender"

Adding and removing columns of a data frame

Example

data$z<-c(1, 2, 3, 4, 5, 6)  # Creates a new variable z in data frame "data"
data
##   age gender z
## 1  25      M 1
## 2  23      F 2
## 3  31      M 3
## 4  42      M 4
## 5  36      F 5
## 6  24      F 6
data$z<-NULL  # Removes variable z from data frame "data"
data
##   age gender
## 1  25      M
## 2  23      F
## 3  31      M
## 4  42      M
## 5  36      F
## 6  24      F

Sorting elements

sort(), order() and rank() are important functions for sorting elements

Example

x<-c(10, 100, 5)
sort(x)  # provides a new vector with the elements of x rearranged in increasing order
## [1]   5  10 100
sort(x, decreasing = T)  # provides a new vector with the elements of x rearranged in decreasing order
## [1] 100  10   5
order(x)  # returns a vector of indeces for obtaining the ordered vector x
## [1] 3 1 2
x[order(x)]   # this code is equivalent to sort(x)
## [1]   5  10 100
rank(x)  # provides the ranking of the elements in x
## [1] 2 3 1

When ordering vectors in a data frame is more convenient to use the function order() because we can retreive the indices and apply the permutation to other variables in the data frame. With sort() we don’t have the information on the permuation applied to the data.

data<-data.frame(
  x=c(25, 23, 31, 42, 36, 24),
  y=c("male", "female", "male", "male", "female", "female")
  )
data$x[order(data$x)]   # we obtain the ages in ascending order
## [1] 23 24 25 31 36 42
data$y[order(data$x)]   # we obtain the corresponding gender for each age
## [1] female female male   male   female male
## Levels: female male

Missing values

You can specify any value in a data set as a missing value with the assignment value<-NA

Example

Replace all values equal to 9999 by NA in data frame “data”

data[data==9999]<-NA

is.na() provides information about missing values

Example

data<-data.frame(
  x=c(25, NA, 31, 42, 36, NA),
  y=c("male", "female", NA, "male", "female", "female"),
  z=c(1, 2, 3, 4, 5, 6)
  )
is.na(data$x)   # provides a logical vector indicating which element is missing
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE
any(is.na(data$x))  # returns TRUE when the vector contains any missing
## [1] TRUE
sum(is.na(data$x))   # provides the number of missing values in a vector
## [1] 2
apply(data, 2, function(x) sum(is.na(x)))  # provides the number of missing values in each column of a data frame
## x y z
## 2 1 0

Example

Replace all NA by 0 in data frame “data”

data[is.na(data)]<-0

complete.cases() indicate which rows of a data frame contain missing values. It can be used for obtaining a new data frame with only those rows with no missing value

complete.cases(data)   #   Logical indicator of the rows containing or not missing values
## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE
data[complete.cases(data),]   # new data frame with only those rows with no missing value
##    x      y z
## 1 25   male 1
## 4 42   male 4
## 5 36 female 5
na.omit(data)   # new data frame with only those rows with no missing value
##    x      y z
## 1 25   male 1
## 4 42   male 4
## 5 36 female 5

Manipulating strings

grep() grep(pattern, x , ignore.case=FALSE, fixed=FALSE): Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices.
Example

x<-c("male", "male", "female", "female", "male")
grep("female", x, fixed=TRUE)
## [1] 3 4
grep("male", x, fixed=TRUE)   # all elements in x contains the expression "male"
## [1] 1 2 3 4 5
grep("^male", x)  # In regular expression syntax, ^ indicates the start of a string. This code provides those elements in x starting by "male"
## [1] 1 2 5

paste() concatenates characters

Example

x<-c("male", "male", "female", "female", "male")
y<-c("A", "A", "A", "B", "B")
z<-c(1, 2, 3, 4, 5)
paste(x,y,sep="-")
## [1] "male-A"   "male-A"   "female-A" "female-B" "male-B"
paste(x,z,sep=".")
## [1] "male.1"   "male.2"   "female.3" "female.4" "male.5"

Data type conversion

is. test the data type:
is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()

as. converts to a data type:
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame

x=c(25, 23, 31, 42, 36, 24)
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
as.character(x)
## [1] "25" "23" "31" "42" "36" "24"
M<-matrix(1:4, ncol=2)
vectorM<-as.vector(M)
vectorM
## [1] 1 2 3 4

☞ Follow the tutorial “Modifying Data Frames” in http://gastonsanchez.com/teaching/

☞ Read sections from http://www.statmethods.net/management/index.html

Workspace

The workspace is your current R working environment and includes any userdefined objects (vectors, matrices, data frames, lists, functions). At the end of an R
session, the user can save an image of the current workspace that is automatically
reloaded the next time R is started.

http://www.statmethods.net/interface/workspace.html

save.image()  # save the workspace to the file .RData in the cwd
save(object list,file="myfile.RData")   # save specific objects to a file if you don't specify the path, the current working directory is assumed
load("myfile.RData")  # load a workspace into the current session. Iif you don't specify the path, the cwd is assumed
q() # quit R. You will be prompted to save the workspace.

Miscellaneous. Useful functions

Function source()

You can execute an existing script by using the function source( )

source("scriptname.R")

Function with()

with() provides an alternative code for evaluating an R expression in a data frame

Example
Let’s consider the following data frame

example<-data.frame(
  age=c(25, 23, 31, 42, 36, 24),
  gender=c("male", "female", "male", "male", "female", "female")
  )

The usual code for obtaining the mean of variable “age” is

mean(example$age)
## [1] 30.16667

The alternative code using function with() is

with(example, mean(age))
## [1] 30.16667

The usual code for obtaining the mean of variable “age” separately by gender is

tapply(example$age, example$gender, mean)
##   female     male
## 27.66667 32.66667

The alternative code using function with() is

with(example, tapply(age, gender, mean))
##   female     male
## 27.66667 32.66667

Function which()

which() gives the indices that satisfie a logical expression

Example

  age=c(25, 23, 31, 42, 36, 24)
which(age>30)   # indices of those elements in "age" larger than 30
## [1] 3 4 5

Function list.files()

This function is useful when working with multiple data files.

list.files() returns every file in the current working directory

list.files(path=".", pattern=NULL) The first argument, path, specifies the path to the directory to be searched, by default the current working directory, and the second argument, pattern, specifies a pattern in the names of the files. If no pattern is specified all files are returned.

Example

In this example we store the names of all csv files in an object called “files_csv”

files_csv<-list.files(pattern="csv")

Function sessionInfo()

sessionInfo()provides the version of R you are running, the type of computer you are using and the versions of the packages that have been loaded.

The stringsAsFactors and as.is arguments

The default behavior of R when reading a data set, with for instance read.csv(), is to convert character strings into factors. In some situations this is not convenient and you can avoid this conversion by specifying the argument stringsAsFactors=F within function read.csv().

If you just want to prevent the conversion from string to factor of a specific column n you can specify the argument as.is=n where n is the number of the column that you want to keep as a string.

Example

data<-read.csv("file.csv", header=T, stringsAsFactors = F)  # prevents the conversion of all string columns to factor

data<-read.csv("file.csv", header=T, as.is=3)  # prevents the conversion of column 3 from string to factor

attach() and detach()

The attach() specifies the data frame to be used and variables in the data frame can be accessed by simply giving their names

Function detach() removes the attachment

Example

You want to calculate the mean of a variable “x”“ in a data frame called “data”. You usually will execute mean(data$x). Instead, you can also execute:

attach(data)
mean(x)  # Since data frame "data" is attached you don't need to specify the variable as data$x but only by x

detach(data)

Though attach() may seem to be a very useful function since it saves us from typing the name of the data frame, it should be used with caution since it may give errors if it is easy to call it is called repeatedly without the call of detach(). These functions should be avoided within functions.

Documentation

Interesting references:

R Tutorial. An R Introduction to Statistics

QUICK-R

Cookbook for R

Gaston Sanchez blog

An Introduction to R from R Development Core Team

Programming in R by T.Girke

R & Bioconductor by T.Girke


This document was created with R Markdown: http://rmarkdown.rstudio.com/