1 Introduction to R

This guide is meant to be an overview of the R programming language. It is by no means comprehensive or complete; there are many topics that deserve more explanation and others that are omitted all together. Instead, the primary purpose of this guide is to provide you with enough knowledge with R to be able to follow the tutorials for the Econometrics for M.Sc. Students class.

Both a PDF and HTML version are provided. The PDF is better for printing, but the HTML is recommended, as it is easier to copy the code to follow along, and it looks better on computer screens.

This document will evolve, change, and be updated over time. So if something is missing in the version you have, please check out the latest version, as it may have what you’re after. Also look at the section on Mistakes and Typos for reporting errors in this guide.

1.1 Importance of this Guide for the Course (and the Exam!)

We do not expect you to be able to replicate and reproduce R code for the exam. Instead, we may show you some R code and ask you to i) interprete what the code aims to achieve, and ii) explain the results of the execution of some code. These questions will always be related to econometrics and not to programming specifically. This means that advanced knowledge of R programming will not be required.

But we do not want to teach you anything that becomes obsolete the minute you submit your final exam. The skills you learn and develop here are valuable (indeed, essential) for anyone wanting to pursue a career even slightly related to economics. At the very least, knowing how to conduct empirical work will most likely be necessary for your master’s thesis. R is a great tool for this.

1.2 R: What and Why?

R is an open-source programming language with a focus on statistical computing. Owing to its foundations in statistics, it has many useful capabilities out-of-the-box. It has a large user community that is continuously adding to its collection of external libraries and expanding the base functionality of R. Statistical estimators are usually released in R before they make their way to other statistical software.

Most of you come from one of the following backgrounds: i) you have no programming experience, ii) you’re familiar with Stata (or another statistical software like SAS or SPSS), iii) you’ve used other general programming languages before, or iv) you already know R! Except for those of you in iv), the rest may be wondering why we choose to use R for this course rather than (the default) Stata or something else. You may also be wondering why we even need a programming language for economics at all!

To the latter point: if you plan to work in an economics-related field after you graduate, be it in academia or in the private sector, you will need to work with data. For relatively small datasets and simple estimators, you can maybe scrape by with Excel. However, you will soon find it too slow and cumbersome for any real data work, not to mention that all your colleagues will be using something else! Even those of you who just want to create and solve models will find that you need to have a basic understanding of empirical work (even if just to check others’ work).

To the former point: there are several reasons we’ve decided to use R:

  • Free and Open source
    • Because R is freely accessible to all, it is easy to get started on your own time. This also means that new techniques and estimators are deployed regularly by dedicated users.
  • Flexibility/customize code
    • Unlike with some other statistical software (ie SPSS, Stata), R does not have a point-and-click interface. For those just beginning, that may be a little frustrating at first. But it will encourage (read: force) you to write out everything you do in your analysis. This is important when others (including your future self!) want to replicate your results. Because R has its foundations in statistics, it is possible to write clear and concise code that others will be able to understand what you are doing relatively easily.
  • Follow along on your own PC/at home
    • Due to its open source nature, all of you can access, download, and install R free of charge. This makes it easy to install on your own computers and follow along at home, in the library, or in a café. With proprietary software (such as Stata), you need to buy a licence to run the software, which can be a hurdle to experimenting with the software on your own. And because learning-by-doing is especially important for programming, this openness makes learning R more enjoyable (or at least easier!).
  • Increasing popularity in industry and academia (including with economists!)
    • Perhaps owing to its foundation in open-source software, R is becoming more popular with both academics and those working in the private sector. This makes having a basic understanding of R a useful skill for your life after university. And because R as a leg in the world of statistics, and a leg in the world of general-purpose programming, you can (somewhat easily) transfer your knowledge to learning Stata or some general-purpose language (eg Python), should you need to (because you’re an RA and your supervisor insists on using Stata) or want to (because you find you enjoy programming and you find R too purpose-built).

1.3 R vs. Stata

Stata is perhaps the most natural comparision to R, at least for economists. And though the use and popularity of R is growing among economists, most empirical work in economics is still done in Stata (due to its long history with applied economists). This also makes the decision whether to use R or Stata in an introductory, graduate-level econometrics course not an easy one to make. Ideally, you would know both languages and employ the own better suited to your (or your supervisors/co-authors!) needs. But you need to start somewhere. And we’ve decided R to be the best starting point, for the reasons listed above.

Nonetheless, it is still useful in pointing out some of the main differences between the two languages:

  • Advantages of R
    • Free and open source
    • Functions are vectorized which adds to computational efficiency
    • R can read/open multiple datasets into memory, making tasks such as merging easier (though this is now possible in the just-released Stata 16)
    • Matrix notation and algebra is simplier
    • Wider set of statistical estimators
  • Advanges of Stata
    • Supported by a company (can be good or bad)
    • Wider range of built-in estimators
    • Somewhat easier to carry out basic techniques (reg y x)
    • Still more widespread among economists

For those of you who already know Stata, you may find using R quirky at first, but the payoff in the long-run is worth it!

1.4 What is RStudio?

R is simply a programming language. It is meant to translate code into a language computers understand so they can execute programs for you. Though essential, we also want tools to aid us in creating useful code. One can always write code in a simple text editor (eg Notepad), but it will be a very frustrating experience.

RStudio is an integrated development environment (IDE) that helps us write better code. It lets you execute code step-by-step, autocompletes code (ie offers suggestions), shows your environment (ie variables, datasets, functions), and renders plots; everything that will make our programming experience more bearable!

1.5 Learning R

As mentioned in the Introduction, this guide is far from exhaustive. This means that if you want to truly master R, you will need to find external resources. This list is a starting point for some recommended further readings and resources:

1.6 Understanding This Guide

All code will be surrounded with a grey box, to distinguish it from the text. For example:

5

On the other hand, the results from the execution of the code will be appended with “## [some_num]”, where [some_num] refers to the positional index. This will make more sense when we turn to vectors and matrices. For example, the result of executing the previous code:

## [1] 5

where 5 is the “result” of executing 5 (ie typing a number simply returns that number).

When the code and the result are provided together (as is mostly the case), they will be in the same grey box, with the result still commented out with ##:

5
## [1] 5

1.7 Terminology

It is important to make a few points of clarification regarding the terminology used when speaking about concepts in R, especially when they mean something different in normal “econometrics speak”.

1.7.1 Variable

In R, a variable refers to any object that represents “data”. Data here can be anything that is stored, be it numbers, vectors, matrices, data frames, etc. That means that a variable, \(x\), in R could be a scalar (\(x = 5\)) or a vector (\(\boldsymbol{x} = \begin{pmatrix} 3 & 6 \end{pmatrix}'\)).

Unlike how we normally think of a variable in economics or statistics, a variable in R is not random. It has a deterministic representation.

It should be clear from context whether we are talking about a variable in the R sense, or in the econometric sense (a random variable, or realizations of a random variable in a dataset, eg hourly wages).

1.7.2 Object

The term “object” is somewhat more general than “variable” in R speak (and, for that matter, any programming language). “Object” is something you will come across occasionally when reading this guide, or when reading anything about R. Put simply, everything is an object, such as data and functions. There are many other types of objects that are not important for our purposes.

Discussing objects (and Object Oriented Programming) is beyond the scope of what’s required of us. For those interested, you can get an overview at this Data Mentor article.

1.8 Installing R

R can be installed from the R-Project’s website. It will ask to choose a mirror (ie a server) where to download the R package. The easiest is to go to the “cloud” mirror, which will automatically choose the best server for you: Cloud R-Project.

You will need to choose your platform and then the release. As of Oct 10, 2019, the most current version is 3.6.1. Follow the installer’s instructions on how to install R.

1.9 Installing RStudio

Once you’ve installed R, you will need (or more accurately, want) to install RStudio. You can do that from the RStudio Homepage.

1.10 Installing Packages

To get the most out of R, you will want to install external libraries to access new functionality, be it estimators, graphing engines, etc. You can install new packages in R with the install.packages command. For example, if you wanted to install the skimr package (which we will use for descriptive statistics), you would execute:

install.packages("skimr")

1.11 Getting Started

When you open RStudio for the first time, it should find the R version you previously installed. If you are having troubles finding the correct R version, look at this RStudio Support Article. The screen will look something like in Figure 1.1.

RStudio Interface

Figure 1.1: RStudio Interface

1.11.1 Console

This is where you enter the code to be executed. For example, if you wanted to know the result of \(5 + 5\) you could type 5 + 5 into the console and hit “Run” (just above the source pane, or Ctrl/Command + Enter). The results will also be printed to the console pane.

1.11.2 Source

Using the console works well for one-off calculations, or for interactive exploration, but when we want to execute a series of commands in a row, and keep track of what we are executing, we will type everything into a text file in the source pane. The general convention is to save this text file with the .R extension.

1.11.3 Environment

When objects are created, such as variables and functions, they will be shown here.

1.11.4 Plots/Help

Whenever you graph something, or look up help on a function with ?, the results will show up in this pane.

1.11.5 Working Directory

When working with external data sets or writing out tables and figures, it is important to tell R where everything on the computer is located. To make everything as portable as possible (ie code that I write can be executed easily on another computer), references to folders and files will be relative. In this context (or more generally, in a programming context) this means relative to a root or working directory. All code will be executed at this level, all data will be stored at this level or lower, and all figures/tables will be exported to this level or lower.

To get the current working directory, one uses the getwd function.

getwd()
## [1] "/Users/bshanks/gdrive/education/5_doctoral/teaching/r-intro"

The output of this will look different on everyone’s computer. If you want to change your working directory to a different location, you use the setwd function.1

setwd("/path/to/working/directory")
setwd("C:/path/to/working/directory")
setwd("C:\\path\\to\\working\\directory")
setwd("C:\path\to\working\directory") # won't work!

1.12 Need Help?

Are you lost even after reading this guide? Does the help file for a function leave you more confused than before? Luckily there is a variety of resources on the web that should be able to address most of your concerns.

The other resources I listed in the Learning R section are quite useful. Googling your questions is also a great way to find answers to what you need. Some packages will have vignettes, which are generally easier to undertand than help files, as they give concrete examples (compare the skimr help file with its vignette).

A fantastic website for finding commonly asked questions for R is Stack Overflow, whereas Cross Validated is a great site for questions and answers to statistics questions. Often when you are Googling for a solution, results from these sites will come up.

1.13 Mistakes and/or Typos

If you come across any mistakes or typos, please send me an email at brendan.shanks@econ.lmu.de! This document will be continuously updated with corrections, clarifications, and new information.

1.14 Outline

The guide is structured as follows. The Fundamentals Section covers the main functionality of R, and goes over the various data types built into R. The section on Statistical Methods describes the essential tool for conducting empirical work. The Plots Section introduces you to graphing with R. And the last Section outlines potential avenues to expand your R skillset.

2 Fundamentals

This section covers the essential elements of R that will be required for our course. This is by no means exhaustive, and there will be plenty that is omitted in the interest of time. For those interested, I highly recommend going through the suggested readings on using R for econometrics.

2.1 Basic Operators

The basic operators in R are +, -, *, and / for addition, subtraction, multiplication, and division, respectively. When used with scalars (numeric or integer types in R), they work as expected.

1+1    # addition
## [1] 2
7-3    # subtraction
## [1] 4
2*6    # multiplication
## [1] 12
10/2   # division
## [1] 5
10/0   # careful!
## [1] Inf

2.2 Built-in Functions

R comes with lots of functions already built into the program, meaning you do not need to load a library in order to access/use them. Table 2.1 lists some of these built-in functions.

Table 2.1: (Some) Built-in Functions
R Function Description
abs(v) Absolute value of \(v\) (\(|v|\))
sqrt(v) Square root of \(v\) (\(\sqrt{v}\))
exp(v) Exponential function (\(e^{v}\))
log(v) Natural logarithm of \(v\) (\(ln(v)\))
log(v, b) Logarithm of \(v\) to base \(b\) (\(log_{b}(v)\))
abs(5)
## [1] 5
abs(-5)
## [1] 5
sqrt(6)
## [1] 2.44949
sqrt(-6)
## Warning in sqrt(-6): NaNs produced
## [1] NaN
exp(2)
## [1] 7.389056
log(5)
## [1] 1.609438
log(exp(2))
## [1] 2

In addition, R has many other built-in functions that we will introduce as we go. In every case, the function will be “called” (ie executed) when appended with (). Looking back at the sample of built-in functions in Table 2.1, the square root function sqrt is only called when written as sqrt(r).

For all functions, built-in and loaded from libraries, you can get help on the function by typing ?function in R.

2.3 Variable Assignment

You assign numbers, strings (text), vectors, matrices, data frames, etc. to variables using “<-”. You can also use “=”, but this is not preferred (as “==” is used to test equality).

x <- 5
y <- 7 - 3
x
## [1] 5
y
## [1] 4
z <- x + y
z
## [1] 9
sqrt(z)
## [1] 3

2.4 Data Types

Thus far the examples given have been using the numeric data type. R understands several other data types. For our purposes, these are the most important:

  • Numeric: 5, 8.4
  • Integer: 6L (The L at the end tells R that the number is an integer)
  • Character: "text"
  • Logical: TRUE or FALSE

The distinction between a numeric and integer data type is not so important for us (it mainly concerns how much memory is needed to store the data). A character data type, also known as a string, is what we would call text. A logical data type can only take one of two values: TRUE or FALSE.

If we dont know what type an object is, we can use the function class().

class(z)
## [1] "numeric"
class(9L)
## [1] "integer"
class("What class is this?")
## [1] "character"

It is also possible to convert from one data type to another using the as function. For example, if we had a variable in our data set that is stored as a series of 1’s and 0’s, and we wanted to convert them to logicals, we could use as.logical():

as.logical(1)
## [1] TRUE
as.logical(0)
## [1] FALSE

We can also test if an element belongs to a certain data type with the is function.

is.logical(1)
## [1] FALSE
is.numeric(0)
## [1] TRUE

2.5 Data Structures

Whereas a data type refers to one “element”, the data structure refers to how these elements are arranged. These are the objects that we will operate on when carrying out statistical analyses. The most important for our purposes are:

  • (Atomic) Vector
  • Matrix
  • List
  • Data Frame
  • Factors

2.6 Vectors

Vectors are the most important of the data structures. In fact, everything we have dealt with so far is a vector:

is.vector(z)
## [1] TRUE
is.vector(TRUE)
## [1] TRUE
is.vector("This is a vector too?!")
## [1] TRUE

This may be somewhat confusing, as we may think of a vector of being a collection of numbers or variables. But the logic still applies here. So far we can think of each “vector” as a vector of length 1 (ie a scalar2). We can see this by applying the length() function, which returns the length of the object.

length(5)
## [1] 1

We now turn to vectors that are longer than one. To create a vector in R, we use the c() operator, with each element separated by a comma:

vec <- c(7,32,4)
vec
## [1]  7 32  4
length(vec)
## [1] 3

Vectors are always flat (ie one dimensional), thus:

new_vec <- c(vec, c(6,3,1))
new_vec
## [1]  7 32  4  6  3  1

But it is important to note that vectors can only store one data type (eg logicals). This can have some unexpected results without warning!

other_vec <- c("element", "next one", vec)
other_vec
## [1] "element"  "next one" "7"        "32"       "4"
class(other_vec)
## [1] "character"

Our vec vector has been coerced to a character vector before being combined (since character types cannot generally be coerced to a numeric type).

There are other ways to make vectors as well. We can use first:last to make a vector going sequentially from first to last. We can also use the seq function, which allows for some more control.

3:40
##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## [26] 28 29 30 31 32 33 34 35 36 37 38 39 40
seq(from = 2, to = 20, by = 2)
##  [1]  2  4  6  8 10 12 14 16 18 20

2.6.1 Vector Functions and Operations

The base operators we discuss in Section 2.1 also work on vectors of length greater than 1.

vec + 6
## [1] 13 38 10
vec / 2
## [1]  3.5 16.0  2.0
vec * 10
## [1]  70 320  40

When using the base operators on two or more vectors, you must ensure they are the same length. The operators work element-wise.

m <- c(8,5,0)
n <- c(45,94,26)
o <- c(4,2)
m + n
## [1] 53 99 26
n - m
## [1] 37 89 26
n / m
## [1]  5.625 18.800    Inf
m * n
## [1] 360 470   0

Normally, if the vectors are of a different length, R will not let you operate on them. But be careful! If you use the base operators on two vectors where the length of one is a multiple of the length of the other, R will “recycle” (ie repeat the elements in the vector) the shorter vector so it has the same length as the longer one.

o + n
## Warning in o + n: longer object length is not a multiple of shorter object
## length
## [1] 49 96 30
o + new_vec
## [1] 11 34  8  8  7  3

There are also a set of vector-specific functions which make working with vectors easier. A subset of them are given in Table 2.2

Table 2.2: Vector Functions
R Vector Function Description
length(v) Number of elements in the vector \(\boldsymbol{v}\)
max(v), min(v) Maximum and minimum element of \(\boldsymbol{v}\) (\(\max(\boldsymbol{x})\), \(\min(\boldsymbol{x})\))
ceiling(v), floor(v) Round every element in \(\boldsymbol{v}\) to the next largest (ceiling) or smallest (floor) integer
round(v) Round every element in \(\boldsymbol{v}\)
sort(v) Arrange elements in vector \(\boldsymbol{v}\) from smallest to largest
sum(v) Sum of elements in \(\boldsymbol{v}\) (\(\sum_i v_i\))
prod(v) Product of elements in \(\boldsymbol{v}\) (\(\prod_i v_i\))
mean(v) Calculate the mean of the elements in \(\boldsymbol{v}\)
median(v) Calculate the median of the elements in \(\boldsymbol{v}\)
sd(v) Calculate the standard deviation of the vector \(\boldsymbol{v}\)
var(v) Calculate the variation of the vector \(\boldsymbol{v}\)
cor(v, w) Calculate the correlation between the vectors \(\boldsymbol{v}\) and \(\boldsymbol{w}\)
numeric(n) Vector of \(0\)’s of length \(n\) (\(\boldsymbol{0} \in \mathbb{R}^n\))
rep(v, n) Repeat the vector \(\boldsymbol{v}\) \(n\) times
max(m)
## [1] 8
min(m)
## [1] 0
sort(m)
## [1] 0 5 8
sum(m)
## [1] 13
prod(m)
## [1] 0
mean(m)
## [1] 4.333333
median(m)
## [1] 5
sd(m)
## [1] 4.041452
var(m)
## [1] 16.33333
sqrt(var(m))
## [1] 4.041452
cor(m, n)
## [1] 0.4055098
numeric(10)
##  [1] 0 0 0 0 0 0 0 0 0 0
rep(m, 10)
##  [1] 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0
rep(m, each = 10)
##  [1] 8 8 8 8 8 8 8 8 8 8 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0

2.6.2 Logical Operators

We can even apply logical operations to each element of a vector with a single statement. This is done with the following operators: ==, >, <, >=, <=, and != (with the last three referring to “larger than or equal to”, “smaller than or equal to”, and “not equal to” respectively).

You can use the & and | operators for “and” and “or” respectively. Prepending a statement with a ! means “not”.

m == 5  # note the use of '==' rather than '='!
## [1] FALSE  TRUE FALSE
m != 0
## [1]  TRUE  TRUE FALSE
m > 4
## [1]  TRUE  TRUE FALSE
m <= 0
## [1] FALSE FALSE  TRUE
m == 8 | m == 0
## [1]  TRUE FALSE  TRUE
m > 0 & m < 8
## [1] FALSE  TRUE FALSE
!(m > 0 & m < 8)
## [1]  TRUE FALSE  TRUE

2.6.3 Strings

When working with vectors which consist of string (text) elements, there are a few other functions that may be useful. Perhaps the most important of these is the paste function. As the name implies, it will concatentate two strings with one another. For example, given two strings:

string1 <- "The first part"
string2 <- "and the second part"
paste(string1, string2)
## [1] "The first part and the second part"

paste will even work on vectors longer than one. In this case, it will append/prepend the vector with the scalar.3

paste(m, "and text")
## [1] "8 and text" "5 and text" "0 and text"
paste(m, n)
## [1] "8 45" "5 94" "0 26"

By default, paste separates the two (or more) elements with a space (). This can be overridden with the sep argument. If you want there to be no separator, you can either set sep = "" or use paste0 instead.

paste("Part 1", "Part 2", sep = "")
## [1] "Part 1Part 2"
paste0("Part 1", "Part 2")
## [1] "Part 1Part 2"
paste("Part 1", "Part 2", sep = " : ")
## [1] "Part 1 : Part 2"

If you want to convert a vector of text into a scalar, you can use the collapse argument, setting it equal to the new separator.

paste(
  c("This is a vector", "with several", "different elements", "that I want to collapse"),
  collapse = ", "
)
## [1] "This is a vector, with several, different elements, that I want to collapse"

2.6.4 Remove Objects

After using R for a while, you may start to notice that you are starting to collect many objects and are having difficulties keeping track of what’s what. In RStudio, you can have an overview of all objects in the environment pane (top-right by default). You can also get a list of all objects by using the ls function.

ls()
##  [1] "m"         "n"         "new_vec"   "o"         "other_vec" "string1"  
##  [7] "string2"   "vec"       "x"         "y"         "z"

Now that you know what is in your environment, you may want to start removing some obsolete objects. You can use the rm function to achieve this. For example, to remove the vector new_vector:

rm(new_vec)
ls()
##  [1] "m"         "n"         "o"         "other_vec" "string1"   "string2"  
##  [7] "vec"       "x"         "y"         "z"

You can combine rm and ls to remove all objects in memory.

rm(list = ls())
ls()
## character(0)

2.6.5 Indexing Vectors

Often we wish to extract a subset of an entire vector, or even a single element. We achieve this through indexing with []. Given a positional number, or a vector of positional numbers, we can extract specific elements.

vec <- -10:10
vec[]
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10
vec[3]
## [1] -8
vec[3:6]
## [1] -8 -7 -6 -5
vec[c(1,20)]
## [1] -10   9
vec[length(vec)]
## [1] 10

It is also possible to index a vector with another vector of logicals of the same length, where TRUE returns the element and FALSE does not.

vec[c(TRUE, FALSE, TRUE)]
##  [1] -10  -8  -7  -5  -4  -2  -1   1   2   4   5   7   8  10
vec[vec == 32]
## integer(0)

Another useful function for indexing is the which function. Given a vector of logicals, it will return the position of the TRUE elements.

which(vec > 0)
##  [1] 12 13 14 15 16 17 18 19 20 21
which(vec < 2 & vec > -2)
## [1] 10 11 12

Important, which returns the indices, not the elements themselves. So in the above example, 10, 11, 12 refer to the position in the vector vec where the elements are larger than -2 and smaller than 2, whereas if we wanted the elements themselves, we could use vec[vec < 2 & vec > -2], which would return -1, 0, 1.

2.6.6 Vector Assignment

Using the indices, we can re-assign specific elements of a vector with a new element.

vec
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10
vec[2] <- -15
vec
##  [1] -10 -15  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10

If we assign an element to an index that is out of range (ie an index larger than the length of the vector), then R fills the positions in between with NA’s. (This means “not available” and indicates missing values).

vec[25] <- 100
vec
##  [1] -10 -15  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10  NA  NA  NA 100

2.6.7 Naming Vectors

Occasionally we may want to name each element in a vector in order to increase readability and understandabiltiy (and aid with indexing/assignment). We can do this with the names function.

m <- c(9, 4, 1)
names(m) <- c("#1", "#Two", "num3")
m
##   #1 #Two num3 
##    9    4    1
m["num3"] <- 6
m
##   #1 #Two num3 
##    9    4    6
names(vec) <- LETTERS[1:length(vec)]
vec
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T 
## -10 -15  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8   9 
##   U   V   W   X   Y 
##  10  NA  NA  NA 100

2.7 Factors

Factors are a special type of vectors. They are used to represent (ordered or unordered) categorical or qualitative data. In economics, examples of this can be countries, gender, industry, likert scales, etc. Factors are stored as a vector of integers, with each integer representing a respective outcome. However, factors are presented as strings, which can make working with them a little complicated.

We will look at two examples. One with gender where the options will be one of female or male. And we will look at a simple three option likert scale consisting of disagree, neutral, and agree. The first example is unordered while the second one is ordered.

genders <- factor(rep(c("female", "male"), times = 6))
likert <- factor(
  rep(c("Disagree", "Neutral", "Agree"), times = 4),
  levels = c("Disagree", "Neutral", "Agree"),
  ordered = TRUE
)

Notice that in the ordered case we set the relative order of the variables with the level option. If we left this unspecified, R would order the options alphabetically.

We can investigate factors with the levels and nlevels functions.

levels(genders)
## [1] "female" "male"
nlevels(likert)
## [1] 3

Also note that since factors are a special type of vector, certain operations do not work on them (since they are generally meaningless). And some functions work only on ordered factors, but not unordered factors!

genders + 1 # won't work!
min(genders) # also problematic
min(likert)
## [1] Disagree
## Levels: Disagree < Neutral < Agree

Converting from Factors

Sometimes we will require that our data is a factor and other times we will simply want text. Thus, functions for converting between the two can come in handy.

as.character(genders)
##  [1] "female" "male"   "female" "male"   "female" "male"   "female" "male"  
##  [9] "female" "male"   "female" "male"
as.numeric(likert)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

Notice how converting factors to numeric results in a series of integers? Remember that all factors are stored as a vector of integers!

As we will see, there are many other things that we can do with factors.

2.8 Random Numbers

When running simulations, such as a Monte Carlo Simulation, or simply for the purpose of better understanding an estimator, we need to generate realizations from a given distribution. In other words, we need to generate random numbers.

Before we do that, it is useful to jump back a bit and discuss briefly some common characteristis of (univariate) distribtion functions. The Probability Density Function (PDF) of a given distribution maps each possible outcome to a relative likelihood of that outcome being drawn from the distribution.

The Culmulative Density Function (CDF) of a distribution, on the other hand, states the probability that a draw from the distribution of \(X\) will take a value of \(x\) or lower (\(P(X \leq x)\)). As the CDF is a monotonic function for the distributions that we will be working with, it is also possible to take the inverse of the function.

In R, we can do calculations with these distributions by prepending the name of the distribution with either d, p, or q.

  • d returns the “height” (density or relative likelihood) of the PDF at x
  • p returns the the probability from the CDF for a given x
  • q is the inverse of the CDF and returns the quantile x for a given probability

For example, to use the CDF of the normal distribution we would use the pnorm function (norm being the name of the normal distribution in R).

For an overivew of the distributions available in R (technically from the builtin “stats” package) type ?Distributions. Each distribution has various parameters that can be altered, such as the mean, standard deviation, degrees of freedom, etc.

Figure 2.1 shows the PDF of the Normal Distribution with a visualization of how the dnorm() function works.

Probability Density of Normal Distribution

Figure 2.1: Probability Density of Normal Distribution

The CDF of the Normal Distribution is shown in Figure 2.2. It also highlights the use of the pnorm() and qnorm() functions.

Culmulative Density of Normal Distribution

Figure 2.2: Culmulative Density of Normal Distribution

2.8.1 Random Draws from a Distribution

Generating draws from the distributions available in R uses the same logic as the other distribution functions; prepend the desired distribution (dist) with r (for “random”).

The main argument for the random number generating functions the number of draws, n. The functions will return a vector of length n filled with realizations from the named distribution with the given parameters.

rnorm(10)   # Mean of 0 and SD of 1 are default
##  [1] -0.64299680  0.31746201 -0.10250011  1.13841323  0.45766026 -1.45672458
##  [7] -0.55419768  0.03137946  0.20429414  1.33667443
rnorm(10, mean = 5, sd = 2)
##  [1] 6.1647477 4.9640196 4.7688927 3.3567738 3.3358199 3.7416206 5.0932894
##  [8] 5.2060580 0.8916436 3.3277690
rchisq(10, df = 5)
##  [1] 4.3325927 2.7069218 5.2703912 1.7464303 4.2435745 7.1613996 8.0302707
##  [8] 0.3144991 2.6654725 5.0968154

2.8.2 Seed

One important facet of applied work is replicatability. This means that, given your code and data, other researchers can obtain exactly the same results as you did (ideally with the single push of a button!).

But when we are dealing with simulations, where realizations of random numbers are essential, how can we expect to get exactly the same results each time we run the simulation?

That is where the seed comes in. In essense, it specifies the starting point of a sequence of random numbers. The Wikipedia article on the seed provides a good description of what it does.

For our purposes, we can “set the seed” before we starting drawing realizations from specified distributions. This ensures we will draw the same numbers from the distribution each time we run the code.

rnorm(2)
## [1] -1.5128930 -0.7059439
set.seed(484)
rnorm(2)
## [1] -1.289618  0.678667
set.seed(484)
rnorm(2)
## [1] -1.289618  0.678667

2.9 Special Values

There are a few values that you will come across when using R that may seem a bit mysterious at first, but have precise meanings. We have already come across a few of them already. A summary of them is given in Table 2.3.

Table 2.3: Special Values
Value Meaning Test Memebership
NA Not Available (missing) is.na
NaN Not a Number is.nan, is.na
Inf, -Inf Positive/Negative Infinity is.infinite, !is.finite
z <- c(1)
z[4] <- 5
z
## [1]  1 NA NA  5
z/0
## [1] Inf  NA  NA Inf
0/0
## [1] NaN
log(0)
## [1] -Inf
log(-10)
## Warning in log(-10): NaNs produced
## [1] NaN

Some functions, like the log function, produce warnings when NaN values are produced. We can also test for membership to these special values with the is function. These are provided in the third column of Table 2.3.

is.na(z)
## [1] FALSE  TRUE  TRUE FALSE
is.nan(0/0)
## [1] TRUE
is.finite(log(0))
## [1] FALSE
is.infinite(log(0))
## [1] TRUE

2.10 Matrices

An extension of the vector data structure is the matrix. This is essentially a vector with an addition dimension.4 Whereas a vector is flat, a matrix consists of rows and columns. Just like vectors, matrices must contain elements of the same data type. Given one or more vectors, you can create matrices with the rbind and cbind functions.

a <- 1:6
b <- 7:12
c <- 13:18
m <- rbind(a, b, c)
n <- cbind(a, b, c)
m
##   [,1] [,2] [,3] [,4] [,5] [,6]
## a    1    2    3    4    5    6
## b    7    8    9   10   11   12
## c   13   14   15   16   17   18
n
##      a  b  c
## [1,] 1  7 13
## [2,] 2  8 14
## [3,] 3  9 15
## [4,] 4 10 16
## [5,] 5 11 17
## [6,] 6 12 18

The matrix function can also be used to specify the number of rows/columns of a matrix or create an empty matrix.

matrix(1:24, nrow = 2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    1    3    5    7    9   11   13   15   17    19    21    23
## [2,]    2    4    6    8   10   12   14   16   18    20    22    24
matrix(1:24, ncol = 2)
##       [,1] [,2]
##  [1,]    1   13
##  [2,]    2   14
##  [3,]    3   15
##  [4,]    4   16
##  [5,]    5   17
##  [6,]    6   18
##  [7,]    7   19
##  [8,]    8   20
##  [9,]    9   21
## [10,]   10   22
## [11,]   11   23
## [12,]   12   24
matrix(nrow = 2, ncol = 2)
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA

You can use the nrow, ncol, and dim functions to get the dimensions of the matrix.

nrow(m)
## [1] 3
ncol(m)
## [1] 6
dim(m)
## [1] 3 6

Just like with vectors, you can name matrices. You can provide a vector of names for both the rows and for the columns.

rownames(m)
## [1] "a" "b" "c"
colnames(m)
## NULL
colnames(m) <- LETTERS[1:ncol(m)]
colnames(m)
## [1] "A" "B" "C" "D" "E" "F"

Matrices can also be indexed liked vectors. Because there are two dimensions, you need to supply two indices (or at least indicate which dimension you are indexing).

m[1, 1]
## [1] 1
m[, c("A", "B")]
##    A  B
## a  1  2
## b  7  8
## c 13 14

2.10.1 Matrix Algebra

We can also use matrix algebra functions with matrices in R. Table 2.4 lists some of the functions that can be applied to matrices.

Table 2.4: Functions for Matrix Algebra
Function Description
t(M) Transpose of the matrix M (\(\boldsymbol{M}'\) or \(\boldsymbol{M}^T\))
identical(M,R) Test if matrices M and R are the same
M * R Element-wise multiplication of M and R (\(\boldsymbol{M} \odot \boldsymbol{R}\))
M %*% R Multiply matrix M with R (\(\boldsymbol{M} \boldsymbol{R}\))
crossprod(M,R) Same as t(M) %*% R (\(\boldsymbol{M}' \boldsymbol{R}\))
solve(M) Calculate the inverse of matrix M (\(\boldsymbol{M}^{-1}\))
diag(M) Returns the diagonal vector from matrix M (\(\text{diag} \boldsymbol{M}\))
rowSums(M), colSums(M) Take the sum of the elements of M row/column-wise
rowMeans(M), colMeans(M) Take the mean of the elements of M row/column-wise
M <- matrix(rnorm(16), nrow = 4)
R <- matrix(rnorm(16), nrow = 4)
t(M)
##            [,1]       [,2]      [,3]       [,4]
## [1,] -0.3975054  0.4336701 0.5274228  0.3849394
## [2,] -0.9837064 -0.7759925 0.1923013 -0.9442839
## [3,] -1.2623426 -0.2245831 0.9692689  0.4504042
## [4,]  0.5843260  1.0682363 0.3192506 -0.6585833
M %*% R
##            [,1]      [,2]        [,3]        [,4]
## [1,] -0.1103055 0.3437971 -0.09487179 -0.66672829
## [2,]  0.3029771 2.9155968 -1.97731576  0.69438809
## [3,] -0.2097555 1.3041029 -1.43760182  1.24919106
## [4,]  1.0638162 1.1222780 -0.28578469 -0.04977172
solve(M)
##           [,1]       [,2]      [,3]       [,4]
## [1,] -4.362764  4.0045264 -4.875427  0.2611986
## [2,] -1.498631  0.9198999 -1.480591 -0.5552803
## [3,]  2.288065 -2.2020343  3.442356  0.1270255
## [4,]  1.163539 -0.4843006  1.627443 -0.4827018
colMeans(R)
## [1]  0.1306180  0.8225804 -0.5405077  0.3965967

2.11 Lists

Lists are more general than both vectors and matrices. They act like containers, holding other objects (vectors, matrices, even other lists!).

Though we generally will not create our own lists, we will have to deal with lists as outputs of other functions (eg the output of a linear model, OLS estimation, is a list).

To create a list you use the list command.

lst <- list(1:10, c(TRUE, FALSE), letters)
lst
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
## [1]  TRUE FALSE
## 
## [[3]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

You may have noticed that unlike vectors and matrices, we can put objects of different data types into one list.

We can also name each element in a list.

names(lst) <- c("nums", "logical", "letters")
lst
## $nums
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $logical
## [1]  TRUE FALSE
## 
## $letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

One has to be a bit more careful when accessing elements in a list than with other data types. To index based on position, one must use [[]] (double square bracekts). If you only use one pair of brackets, you get a list with one object in it (rather than the object itself).

lst[1]
## $nums
##  [1]  1  2  3  4  5  6  7  8  9 10
lst[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10

But the easiest way to extract the desired object is by using $ followed by the name of the object.

lst$letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

You can also add another object to the list through similar means.

lst$months <- month.abb
lst
## $nums
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $logical
## [1]  TRUE FALSE
## 
## $letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
## 
## $months
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Because lists can hold various different types of objects, they are useful for storing the results of estimated models, monte carlo simulations, etc. For example, fitting an OLS regression may return the degrees of freedom (a scalar), a residual for each observation (a vector), and the variance-covariance matrix of the estimated coefficients (a matrix).

2.12 Data Frames

Data frames are what most people may find recognizable. For those with experience in Stata, they are the most similar to Stata’s dataset. They are similar to matrices, in that they are two dimensional (consisting of rows and columns), but unlike matrices, they can hold data of different types. They can be thought of as a list consisting of equal length vectors. We can use the data.frame function to construct a new data frame, with each argument a(n) (un)named vector.

df <- data.frame(
  numerals = 1:10,
  roman = as.character(as.roman(1:10)),
  letters = letters[1:10],
  stringsAsFactors = FALSE
)

Importantly, we can decide if we want strings to be considered as factors, or just as characters. This will depend on how we use the variable in our analyses.

Since data frames are similar to lists, the indexing of data frames is also similar, though there are key differences.

df[1]
##    numerals
## 1         1
## 2         2
## 3         3
## 4         4
## 5         5
## 6         6
## 7         7
## 8         8
## 9         9
## 10       10
df[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
df$numerals
##  [1]  1  2  3  4  5  6  7  8  9 10
df[c("numerals", "letters")]
##    numerals letters
## 1         1       a
## 2         2       b
## 3         3       c
## 4         4       d
## 5         5       e
## 6         6       f
## 7         7       g
## 8         8       h
## 9         9       i
## 10       10       j
df[4:6,3]
## [1] "d" "e" "f"

2.12.1 Variable Creation

In empirical work we often want to create new variables from the exisiting ones (GDP per capita from GDP and populaltion, log wages from wages, etc.). We can do that by assigning vectors to an existing or new column.

df$capital_letters <- LETTERS[1:10]
df$roman2letter <- paste(df$roman, df$letters, sep = "_")
df$numerals <- sort(df$numerals, decreasing = TRUE)
df
##    numerals roman letters capital_letters roman2letter
## 1        10     I       a               A          I_a
## 2         9    II       b               B         II_b
## 3         8   III       c               C        III_c
## 4         7    IV       d               D         IV_d
## 5         6     V       e               E          V_e
## 6         5    VI       f               F         VI_f
## 7         4   VII       g               G        VII_g
## 8         3  VIII       h               H       VIII_h
## 9         2    IX       i               I         IX_i
## 10        1     X       j               J          X_j

In additon to creating new variables, sometimes we will want to work only with a subset of a data frame (think of Stata’s drop ... if ...). There are several ways to do this. We can even combine this with variable assignment to make conditional changes (Stata: replace ... if ...).

subset(df, numerals > 5)
##   numerals roman letters capital_letters roman2letter
## 1       10     I       a               A          I_a
## 2        9    II       b               B         II_b
## 3        8   III       c               C        III_c
## 4        7    IV       d               D         IV_d
## 5        6     V       e               E          V_e
df[df$numerals > 5, ]
##   numerals roman letters capital_letters roman2letter
## 1       10     I       a               A          I_a
## 2        9    II       b               B         II_b
## 3        8   III       c               C        III_c
## 4        7    IV       d               D         IV_d
## 5        6     V       e               E          V_e
df[df$numerals > 5, "numerals"] <- 20
df
##    numerals roman letters capital_letters roman2letter
## 1        20     I       a               A          I_a
## 2        20    II       b               B         II_b
## 3        20   III       c               C        III_c
## 4        20    IV       d               D         IV_d
## 5        20     V       e               E          V_e
## 6         5    VI       f               F         VI_f
## 7         4   VII       g               G        VII_g
## 8         3  VIII       h               H       VIII_h
## 9         2    IX       i               I         IX_i
## 10        1     X       j               J          X_j

2.13 Import/Export Data

The data that an Econometrician uses comes in a wide variety of formats. Some of these can be read/written natively in R, but many of them require the use of external libraries to handle them. Some of examples of these libraries are haven, foreign, readr, and readxl, though there are many others.

For our purposes, we will stick with R’s default data storage format, .rds. For the tutorials, all datasets will be provided in this format. Though for your own work, you will want to learn how to open (and write to) files such as .csv, .xls, and .dta.

To save an R object (of any data structure) to our local computer, we can use the saveRDS function.

saveRDS(df, file = "data/data_frame.rds")

And, similarily, to open a saved object, we would use the readRDS function.

df <- readRDS(file = "data/data_frame.rds")

Remember, the path will be relative to your working directory. See the Working Directory section.

2.13.1 Built-in Data Sets

R also comes pre-loaded with built-in datasets. These can be useful for playing around with some of the functionality of R before using real-world data. To find out which datasets are available, use the data function.

data()
data(mtcars)

2.13.2 Datasets from the Web

You can also load datasets directly from a website. For example, if we wanted to access datasets from Wooldridge’s Introductory Econometrics textbook, which come in Stata’s data format (.dta), we can use the read_dta function from the haven package, and supply the URL instead of the filepath.

require(haven)
## Loading required package: haven
airfare <- read_dta("http://fmwww.bc.edu/ec-p/data/wooldridge2k/airfare.dta")

2.14 Loops/Functionals

A common task when working with datasets is executing the same function (or performing the same operation) repeatedly for different variables or a subset of the original data. Perhaps the most straight forward way to do this is to write the function n times. For example, if we wanted to calculate the mean of each column (variable) in the mtcars data frame, we could execute the following functions:

mean(mtcars$mpg)
## [1] 20.09062
mean(mtcars$cyl)
## [1] 6.1875
mean(mtcars$disp)
## [1] 230.7219
# ...

However, with a larger number of operations, this becomes infeasible (and this method is very difficult to scale up). This is where the for loop comes in. It executes a specific set of operations for a pre-determined amount of iterations. If this sounds confusing, consider the following (trivialized) example:

You wish to print the integers from 1 to 10 sequentially to the console. To achieve this in R, with a loop, you would do the following:

for (i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

where the for indicates the start of a loop, and the elements inside the first parantheses, (i in 1:10), indicate both the iteration variable (i) and the set of numbers to loop over (1:10). The iteration variable can be whatever letter/name you would like it to be. The set of numbers to loop over can be any vector, and does not necessarily have to be numbers. The loop will execute the expression in the braces ({}) length(1:10) times.

Turning back to our example with the mtcars dataset, we could calculate the mean for columns 2 to 6 by:

for (col in 2:6) {
  mu <- mean(mtcars[,col])
  print(mu)
}
## [1] 6.1875
## [1] 230.7219
## [1] 146.6875
## [1] 3.596563
## [1] 3.21725

This is just a taste of what for loops can do. They are highly expandable.

2.14.1 Apply Family

Those having experience in other programming languages will be familiar with for loops. However, in R, most often you will see an apply function in place of a loop. They are preferred to loops as they are vectorized (ie they work over the set simultaneously rather than sequentially).

The main function from this family is, somewhat unsurprisingly, apply. But we will also look at the lapply/sapply function(s) as well.5

The syntax for the two functions is similar; only with apply we need to specify the margin (ie rows [1] or columns [2]). Returning to the example above, we could achieve the same thing with the apply functions:

apply(mtcars[, 2:6], 2, mean)
##        cyl       disp         hp       drat         wt 
##   6.187500 230.721875 146.687500   3.596563   3.217250
lapply(mtcars[, 2:6], mean)
## $cyl
## [1] 6.1875
## 
## $disp
## [1] 230.7219
## 
## $hp
## [1] 146.6875
## 
## $drat
## [1] 3.596563
## 
## $wt
## [1] 3.21725
sapply(mtcars[, 2:6], mean)
##        cyl       disp         hp       drat         wt 
##   6.187500 230.721875 146.687500   3.596563   3.217250

You’ll notice that the only difference between the last two is the data structure of the returned object; lapply returns a list while sapply tries to simplify the results into a vector.

So since a for loop, apply, lapply, and sapply all achieve the same results, which one should we use? The answer is that it depends. If we require the results from iteration i for iteration i+1, then a for loop becomes essential. If we are working on the columns of a data frame, lapply or sapply are the simplest. apply is useful if we want to work rowwise.

2.15 Conditionals

When working with loops (and with custom functions, which we will see shortly), we may only want to execute specific operations when certain criteria are met.

Going back to the example of printing all integers from 1 to 10, let’s say we only wanted to print the odd numbers. Then we could use a for loop again, but with a conditional statemtent. These test a certain expression and return TRUE or FALSE, and only execute the conditional if TRUE is returned.

for (i in 1:10) {
  if (i %% 2 != 0) {
    print(paste(i, "is odd"))
  }
}
## [1] "1 is odd"
## [1] "3 is odd"
## [1] "5 is odd"
## [1] "7 is odd"
## [1] "9 is odd"

In the example above, the %% is the modulus operator (ie m %% n would return the remainder of the division of m by n. So if m is 5 and n is 2, m %% n would be 1). Since odd numbers cannot be exactly divided by 2, we know that the modulus of an odd number divided by 2 is not zero (it’s 1). Therefore, i %% 2 != 0 evaluates to TRUE only for odd numbers.

We can also test other statements as well using else if. Let’s say if i equals six we would rather print the statement i equals 6 rather than just i itself.

for (i in 1:10) {
  if (i %% 2 != 0) {
    print(paste(i, "is odd"))
  } else if (i == 6) {
    print(paste("i equals", i))
  }
}
## [1] "1 is odd"
## [1] "3 is odd"
## [1] "5 is odd"
## [1] "i equals 6"
## [1] "7 is odd"
## [1] "9 is odd"

So even though 6 fails our first test (it is not an odd number), it satisfies our section conditional (it equals six). Finally, we can use the else expression to capture everything that failed the other conditionals (we can use as many else if conditionals as we like).

for (i in 1:10) {
  if (i %% 2 != 0) {
    print(paste(i, "is odd"))
  } else if (i == 6) {
    print(paste("i equals", i))
  } else {
    print(paste(i, "is not odd or six"))
  }
}
## [1] "1 is odd"
## [1] "2 is not odd or six"
## [1] "3 is odd"
## [1] "4 is not odd or six"
## [1] "5 is odd"
## [1] "i equals 6"
## [1] "7 is odd"
## [1] "8 is not odd or six"
## [1] "9 is odd"
## [1] "10 is not odd or six"

2.15.1 Vectorized Conditional

So far, the conditionals we explored only handle a single element (ie a vector of length one). We may also want to carry out some operation on elements of a vector that satisfy certain criterion. We could loop through each element in a vector to achieve this, but that is not necessary.

Let’s say we wanted to take the log of each element of a vector that had negative values, and wanted to set the negative values (and zero) to NA. We could do that as follows:

vec_log <- ifelse(vec > 0, log(vec), NA)
## Warning in log(vec): NaNs produced
vec_log
##         A         B         C         D         E         F         G         H 
##        NA        NA        NA        NA        NA        NA        NA        NA 
##         I         J         K         L         M         N         O         P 
##        NA        NA        NA 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 
##         Q         R         S         T         U         V         W         X 
## 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851        NA        NA        NA 
##         Y 
## 4.6051702

This is extremely useful in this context, because R would normally return a NaN for negative numbers and -Inf for zero, which can cause problems for other R functions.

2.16 Writing Functions

We can also write our own functions in addition to the functions available built-in or from loaded libraries. Functions, as in the mathematical sense, take an input (vector, matrix, data frame, even another function!) and return an output (generally a scalar, but can be other data types as well). This can be useful when we want to carry out an operation repeatedly in various contexts.

An example function could be a function that tests whether a number is a square number.

isSquareNumber <- function(x) {
  sqrt(x) %% 1 == 0
}
isSquareNumber(2)
## [1] FALSE
isSquareNumber(9)
## [1] TRUE
isSquareNumber(-9)
## Warning in sqrt(x): NaNs produced
## [1] NA
isSquareNumber("9")
## Error in sqrt(x): non-numeric argument to mathematical function

Where x is the input and the output is the result of the one line of code. Our function works, but is not robust! Let’s use what we learned from conditionals to handle such special cases.

isSquareNumber <- function(x) {
  if (!is.numeric(x)) {
    output <- NA
  } else if (x > 0) {
    output <- sqrt(x) %% 1 == 0
  } else {
    output <- NA
  }

  return(output)
}
isSquareNumber(9)
## [1] TRUE
isSquareNumber(-9)
## [1] NA
isSquareNumber("9")
## [1] NA

We’ve also explicitly told the function to return the object output. This is useful when functions become more complicated and execute numerous lines of code, returning different values.

3 Basic Statistical Methods

3.1 Monte Carlo Simulations

One statistical method that we will use during the course is called the Monte Carlo method. The general idea is as follows:

  1. We specify the data generating process (DGP) for a set of random variables (ie we specifiy the parameters and distributions of the DGP).
  2. We draw a random sample from the specified DGP and estimate certain parameters.
  3. We do this repeatedly to get information on certain parameters of interest (and to learn more about the properties of specific estimators).

In our examples, this will be mainly done to learn about the properties of basic estimators, such as the mean estimator or the OLS estimator. But now you may be wondering: why do we go through this process when we know the statistical properties of these estimators already? The answer is because it aids in understanding these estimators; drawing random samples is more intuitive than asymptotic theory. In current research, it is used to infer certain properties of estimators that have no analytical solution or are difficult to calculate.

We will go through more advanced examples in class, for now lets work with the simplest case. We have only one random variable that is distributed as follows:

\[\begin{equation} Y_i \sim N(\mu, \sigma^2) \tag{3.1} \end{equation}\]

And we are interested in infering properties of the mean estimator, defined as:

\[ \hat{\mu} = n^{-1} \sum_{i=1}^n Y_i \] In the end, we want to know the properties of \(\hat{\mu}\), specifically its distribution, mean, and variance (\(F(\hat{\mu}), E(\hat{\mu}), var(\hat{\mu})\), where \(F\) is an undefined CDF). As we all probably know, the properties of this estimator are already well known:

\[ \hat{\mu} \sim N(\mu, \sigma / n) \]

We can use a Monte Carlo simulation to verify this. There are numerous ways to do this in R; but as it involves repeatedly drawing from a distribution, we will need to employ either a for loop or use some of the apply functions. We will approach this problem with the tools from the latter. Specifically, we will use the replicate function, which is related to the other apply functions (reminder: if you don’t know what a function does, or how to use it, type ?replicate).

The first step is to specify a DGP and draw a sample from this distribution. We will do this with a function, as we will execute it over and over. The following function takes three arguments, the sample size (n), the mean (mean), and the standard deviation of the random variable (sd). It will draw a sample of size n from the distribution specified in Equation (3.1). It will then estimate the paramter \(\mu\) with the estimator \(\hat{\mu}\), and return that value.

The function requires a value for n, but the mean and sd arguments can be left blank and will revert to their default values of 0 and 1 respectively.

mean_rnorm_sample <- function(n, mean=0, sd=1) {
  samp <- rnorm(n, mean, sd)
  mu <- mean(samp)
  return(mu)
}
mean_rnorm_sample(100, 5, 2)
## [1] 5.27703

Before we start the replications, we need to set the seed (see the section on working with a Seed). This will ensure that our results are replicable (and that we get the same results everytime!).

Then, we will execute the above function 1000 times (\(r = 1000\)), with a sample size of 100 (\(n = 100\)). We will set \(\mu = 5\) and \(\sigma = 2\). We will store the estimated \(\mu\)’s in a vector called mu_hats.

set.seed(38547)
sample_size <- 1000
mean <- 5
sd <- 2
mu_hats <- replicate(1000, mean_rnorm_sample(sample_size, mean, sd))

Now that we have ran the simulation, we can investigate the properties of the mean estimator. Since we also know the true value of the parameters, we can compare the empirically estimated values with the theoretical ones.

mu_hat_mean <- mean(mu_hats)
mu_hat_mean
## [1] 5.002903
mu_hat_sd <- var(mu_hats)
mu_hat_sd
## [1] 0.003852582
## [1] "Empirical mean of estimator is 5.0029, theoretical value is 5."
## [1] "Empirical variance of estimator is 0.0039, theoretical value is 0.04 (2^2 / 100)."

As you can see, the Monte Carlo simulation returns what we would expect from statistical theory.

We can also plot a histogram of the estimated means and compare it with the normal distribution

hist(mu_hats, breaks = 30, freq = FALSE)
curve(dnorm(x, mean = mean, sd = sd/sqrt(sample_size)), add = TRUE) # remember sd adjust.

Looks pretty close! This is not very surprising given we know the true distribution, but this would help us inspect the properties of an estimator when we do not have any theory to guide us.

3.2 Summary Stats

Because there is no one set of “definitive” summary statistics, there are many different ways to inspect datasets. You can of course look one statistic at a time with functions such as mean, median, sd, etc., but normally you will want to calculate a set of statistics on numerous variables, all at once.

One option is the built in summary function, which will return basic statistics.

mtcars_subset <- mtcars[,1:4]
summary(mtcars_subset)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0

For more detailed and/or comprehensive statistics, look to packages such as Hmisc, pastecs, psych, or skimr. For example, the skim function from skimr neatly summarizes the data:

require("skimr")
## Loading required package: skimr
library(skimr)
skim(mtcars_subset)
Table 3.1: Data summary
Name mtcars_subset
Number of rows 32
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.4 15.43 19.2 22.8 33.9 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.0 4.00 6.0 8.0 8.0 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.1 120.83 196.3 326.0 472.0 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.0 96.50 123.0 180.0 335.0 ▇▇▆▃▁

3.3 Linear Regression

The workhorse estimator for economics is the OLS estimator. It is highly robust, adaptable, expandable, its properties well understood, and can be used in a variety of situations. In R, we can estimate a linear model with the lm command.

Using the mtcars data, if we wanted to investigate the relationship between horsepower (hp) and mileage (mpg) we could run a regression of mileage on horsepower. The lm function requires at least a formula and a dataset.

The formula specifies the relationship you are estimating. The general format is y ~ x1 + x2 where y is an outcome variable and x1 and x2 are explanatory variables. The dataset simply refers to the data you want to use.

The lm function returns an lm object, which contains information on the estimated coefficients, fitted values, degrees of freedom, etc. We can use R’s summary command to print the most important information to the console.

lfit <- lm(mpg ~ hp, data = mtcars)
attributes(lfit) # shows the info contained lm object, accessible with "$"
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
lfit$coefficients
## (Intercept)          hp 
## 30.09886054 -0.06822828
lfit$residuals
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##         -1.59374995         -1.59374995         -0.95363068         -1.19374995 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##          0.54108812         -4.83489134          0.91706759         -1.46870730 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##         -0.81717412         -2.50678234         -3.90678234         -1.41777049 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##         -0.51777049         -2.61777049         -5.71206353         -5.02978075 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##          0.29364342          6.80420581          3.84900992          8.23597754 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##         -1.98071757         -4.36461883         -4.66461883         -0.08293241 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##          1.04108812          1.70420581          2.10991276          8.01093488 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##          3.71340487          1.54108812          7.75761261         -1.26197823
summary(lfit)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

This is just the tip of the iceberg. In class, we will also consider extensions such as inference using robust and clustered standard errors, IV estimation, and fixed effects.

4 Plots

One of R’s standout features is graphing. There are numerous external libraries available that allow you to plot virtually any graph you desire (including interactive ones!).

I will only briefly introduce R’s built-in graphing capabilties as a starting point. A good overview can be found on the Graphs Section on Quick-R.

Once you become comfortable with plotting in base R, I would highly recommend using ggplot2 for graphing. It is the state-of-the-art library that is (probably) also the most widely used. The R for Data Science book has a good section on Data Visualization (using ggplot2).

4.1 Histograms

We have have already seen how to plot a histogram in the Monte Carlo Section. Since a histogram is a visualization of a one-dimensional variable, we enter one column (or vector) into the hist function. We can also customize the plot with optional arguments to the hist function, such as changing the number of bins.

hist(mtcars$mpg)
hist(mtcars$mpg, breaks = 25)

Type ?hist to an overview of all optional arguments.

4.2 Scatterplots

Another common graph type is a scatterplot, which we achieve in R with plot(x, y) where x and y are the variables on the x- and y-axis respectively. This is a two-dimensional representation of two variables, and thus requires at least two arguments.

plot(mtcars$hp, mtcars$mpg)

Again, look to ?plot for more options.

4.3 Other Graphs

Base R is capable of other graphing types such as Box Plots, Bar Plots, and Pie Charts.

But for anything more advanced than basic plotting, look to ggplot2.

5 Other Topics

This guide covers the basics you will need to get started with the tutorials. For those of you who wish to do your own empirical work, you will want to expand on this knowledge. This also means going beyond base R.

Probably the most useful set of packages is provided by The Tidyverse. dplyr is especially useful for data cleaning and manipulation while tidyr aids in reshaping data to formats that are useful for our analyses. These are not strictly necessary for empirical econometric work, but they will make your life easier when working with messy, real-world data.

This guide will evolve and change over time as R is also an evolving programming language. In addition, your feedback and comments will be incorporated into future versions of this guide.


  1. A note of warning for Windows users. “\” is an escape character (not so important for our purposes), but it means that you need to use two of them when setting your path (“\\”), or use the forward slash (“/”) instead.↩︎

  2. This is not technically correct, just as a \(1\times1\) matrix is not technically a scalar, but it helps for understanding in this context.↩︎

  3. paste will always coerce numbers and logicals to character.↩︎

  4. A vector of length \(n\) would be \(\boldsymbol{v} \in \mathbb{R}^n\) while a matrix with \(r\) rows and \(s\) columns would be \(\boldsymbol{M} \in \mathbb{R}^{r \times s}\).↩︎

  5. Other functions from this family include tapply and mapply.↩︎