Introduction to R for Econometrics
Department of Economics, LMU Munich
2019-10-15
1 Introduction to R
This guide is meant to be an overview of the R programming language. It is by no means comprehensive or complete; there are many topics that deserve more explanation and others that are omitted all together. Instead, the primary purpose of this guide is to provide you with enough knowledge with R to be able to follow the tutorials for the Econometrics for M.Sc. Students class.
Both a PDF
and HTML
version are provided. The PDF
is better for printing, but the HTML
is recommended, as it is easier to copy the code to follow along, and it looks better on computer screens.
This document will evolve, change, and be updated over time. So if something is missing in the version you have, please check out the latest version, as it may have what you’re after. Also look at the section on Mistakes and Typos for reporting errors in this guide.
1.1 Importance of this Guide for the Course (and the Exam!)
We do not expect you to be able to replicate and reproduce R code for the exam. Instead, we may show you some R code and ask you to i) interprete what the code aims to achieve, and ii) explain the results of the execution of some code. These questions will always be related to econometrics and not to programming specifically. This means that advanced knowledge of R programming will not be required.
But we do not want to teach you anything that becomes obsolete the minute you submit your final exam. The skills you learn and develop here are valuable (indeed, essential) for anyone wanting to pursue a career even slightly related to economics. At the very least, knowing how to conduct empirical work will most likely be necessary for your master’s thesis. R is a great tool for this.
1.2 R: What and Why?
R is an open-source programming language with a focus on statistical computing. Owing to its foundations in statistics, it has many useful capabilities out-of-the-box. It has a large user community that is continuously adding to its collection of external libraries and expanding the base functionality of R. Statistical estimators are usually released in R before they make their way to other statistical software.
Most of you come from one of the following backgrounds: i) you have no programming experience, ii) you’re familiar with Stata (or another statistical software like SAS or SPSS), iii) you’ve used other general programming languages before, or iv) you already know R! Except for those of you in iv), the rest may be wondering why we choose to use R for this course rather than (the default) Stata or something else. You may also be wondering why we even need a programming language for economics at all!
To the latter point: if you plan to work in an economics-related field after you graduate, be it in academia or in the private sector, you will need to work with data. For relatively small datasets and simple estimators, you can maybe scrape by with Excel. However, you will soon find it too slow and cumbersome for any real data work, not to mention that all your colleagues will be using something else! Even those of you who just want to create and solve models will find that you need to have a basic understanding of empirical work (even if just to check others’ work).
To the former point: there are several reasons we’ve decided to use R:
- Free and Open source
- Because R is freely accessible to all, it is easy to get started on your own time. This also means that new techniques and estimators are deployed regularly by dedicated users.
- Flexibility/customize code
- Unlike with some other statistical software (ie SPSS, Stata), R does not have a point-and-click interface. For those just beginning, that may be a little frustrating at first. But it will encourage (read: force) you to write out everything you do in your analysis. This is important when others (including your future self!) want to replicate your results. Because R has its foundations in statistics, it is possible to write clear and concise code that others will be able to understand what you are doing relatively easily.
- Follow along on your own PC/at home
- Due to its open source nature, all of you can access, download, and install R free of charge. This makes it easy to install on your own computers and follow along at home, in the library, or in a café. With proprietary software (such as Stata), you need to buy a licence to run the software, which can be a hurdle to experimenting with the software on your own. And because learning-by-doing is especially important for programming, this openness makes learning R more enjoyable (or at least easier!).
- Increasing popularity in industry and academia (including with economists!)
- Perhaps owing to its foundation in open-source software, R is becoming more popular with both academics and those working in the private sector. This makes having a basic understanding of R a useful skill for your life after university. And because R as a leg in the world of statistics, and a leg in the world of general-purpose programming, you can (somewhat easily) transfer your knowledge to learning Stata or some general-purpose language (eg Python), should you need to (because you’re an RA and your supervisor insists on using Stata) or want to (because you find you enjoy programming and you find R too purpose-built).
1.3 R vs. Stata
Stata is perhaps the most natural comparision to R, at least for economists. And though the use and popularity of R is growing among economists, most empirical work in economics is still done in Stata (due to its long history with applied economists). This also makes the decision whether to use R or Stata in an introductory, graduate-level econometrics course not an easy one to make. Ideally, you would know both languages and employ the own better suited to your (or your supervisors/co-authors!) needs. But you need to start somewhere. And we’ve decided R to be the best starting point, for the reasons listed above.
Nonetheless, it is still useful in pointing out some of the main differences between the two languages:
- Advantages of R
- Free and open source
- Functions are vectorized which adds to computational efficiency
- R can read/open multiple datasets into memory, making tasks such as merging easier (though this is now possible in the just-released Stata 16)
- Matrix notation and algebra is simplier
- Wider set of statistical estimators
- Advanges of Stata
- Supported by a company (can be good or bad)
- Wider range of built-in estimators
- Somewhat easier to carry out basic techniques (
reg y x
) - Still more widespread among economists
For those of you who already know Stata, you may find using R quirky at first, but the payoff in the long-run is worth it!
1.4 What is RStudio?
R is simply a programming language. It is meant to translate code into a language computers understand so they can execute programs for you. Though essential, we also want tools to aid us in creating useful code. One can always write code in a simple text editor (eg Notepad), but it will be a very frustrating experience.
RStudio is an integrated development environment (IDE) that helps us write better code. It lets you execute code step-by-step, autocompletes code (ie offers suggestions), shows your environment (ie variables, datasets, functions), and renders plots; everything that will make our programming experience more bearable!
1.5 Learning R
As mentioned in the Introduction, this guide is far from exhaustive. This means that if you want to truly master R, you will need to find external resources. This list is a starting point for some recommended further readings and resources:
- General
- Econometrics
- RStudio
- R Programming (Advanced)
1.6 Understanding This Guide
All code will be surrounded with a grey box, to distinguish it from the text. For example:
On the other hand, the results from the execution of the code will be appended with “## [some_num]
”, where [some_num]
refers to the positional index. This will make more sense when we turn to vectors and matrices. For example, the result of executing the previous code:
## [1] 5
where 5
is the “result” of executing 5
(ie typing a number simply returns that number).
When the code and the result are provided together (as is mostly the case), they will be in the same grey box, with the result still commented out with ##
:
1.7 Terminology
It is important to make a few points of clarification regarding the terminology used when speaking about concepts in R, especially when they mean something different in normal “econometrics speak”.
1.7.1 Variable
In R, a variable refers to any object that represents “data”. Data here can be anything that is stored, be it numbers, vectors, matrices, data frames, etc. That means that a variable, \(x\), in R could be a scalar (\(x = 5\)) or a vector (\(\boldsymbol{x} = \begin{pmatrix} 3 & 6 \end{pmatrix}'\)).
Unlike how we normally think of a variable in economics or statistics, a variable in R is not random. It has a deterministic representation.
It should be clear from context whether we are talking about a variable in the R sense, or in the econometric sense (a random variable, or realizations of a random variable in a dataset, eg hourly wages).
1.7.2 Object
The term “object” is somewhat more general than “variable” in R speak (and, for that matter, any programming language). “Object” is something you will come across occasionally when reading this guide, or when reading anything about R. Put simply, everything is an object, such as data and functions. There are many other types of objects that are not important for our purposes.
Discussing objects (and Object Oriented Programming) is beyond the scope of what’s required of us. For those interested, you can get an overview at this Data Mentor article.
1.8 Installing R
R can be installed from the R-Project’s website. It will ask to choose a mirror (ie a server) where to download the R package. The easiest is to go to the “cloud” mirror, which will automatically choose the best server for you: Cloud R-Project.
You will need to choose your platform and then the release. As of Oct 10, 2019, the most current version is 3.6.1. Follow the installer’s instructions on how to install R.
1.9 Installing RStudio
Once you’ve installed R, you will need (or more accurately, want) to install RStudio. You can do that from the RStudio Homepage.
1.10 Installing Packages
To get the most out of R, you will want to install external libraries to access new functionality, be it estimators, graphing engines, etc. You can install new packages in R with the install.packages
command. For example, if you wanted to install the skimr
package (which we will use for descriptive statistics), you would execute:
1.11 Getting Started
When you open RStudio for the first time, it should find the R version you previously installed. If you are having troubles finding the correct R version, look at this RStudio Support Article. The screen will look something like in Figure 1.1.

Figure 1.1: RStudio Interface
1.11.1 Console
This is where you enter the code to be executed. For example, if you wanted to know the result of \(5 + 5\) you could type 5 + 5
into the console and hit “Run” (just above the source pane, or Ctrl/Command + Enter
). The results will also be printed to the console pane.
1.11.2 Source
Using the console works well for one-off calculations, or for interactive exploration, but when we want to execute a series of commands in a row, and keep track of what we are executing, we will type everything into a text file in the source pane. The general convention is to save this text file with the .R
extension.
1.11.3 Environment
When objects are created, such as variables and functions, they will be shown here.
1.11.4 Plots/Help
Whenever you graph something, or look up help on a function with ?
, the results will show up in this pane.
1.11.5 Working Directory
When working with external data sets or writing out tables and figures, it is important to tell R where everything on the computer is located. To make everything as portable as possible (ie code that I write can be executed easily on another computer), references to folders and files will be relative. In this context (or more generally, in a programming context) this means relative to a root or working directory. All code will be executed at this level, all data will be stored at this level or lower, and all figures/tables will be exported to this level or lower.
To get the current working directory, one uses the getwd
function.
The output of this will look different on everyone’s computer. If you want to change your working directory to a different location, you use the setwd
function.1
1.12 Need Help?
Are you lost even after reading this guide? Does the help file for a function leave you more confused than before? Luckily there is a variety of resources on the web that should be able to address most of your concerns.
The other resources I listed in the Learning R section are quite useful. Googling your questions is also a great way to find answers to what you need. Some packages will have vignettes, which are generally easier to undertand than help files, as they give concrete examples (compare the skimr
help file with its vignette).
A fantastic website for finding commonly asked questions for R is Stack Overflow, whereas Cross Validated is a great site for questions and answers to statistics questions. Often when you are Googling for a solution, results from these sites will come up.
1.13 Mistakes and/or Typos
If you come across any mistakes or typos, please send me an email at brendan.shanks@econ.lmu.de! This document will be continuously updated with corrections, clarifications, and new information.
1.14 Outline
The guide is structured as follows. The Fundamentals Section covers the main functionality of R, and goes over the various data types built into R. The section on Statistical Methods describes the essential tool for conducting empirical work. The Plots Section introduces you to graphing with R. And the last Section outlines potential avenues to expand your R skillset.
2 Fundamentals
This section covers the essential elements of R that will be required for our course. This is by no means exhaustive, and there will be plenty that is omitted in the interest of time. For those interested, I highly recommend going through the suggested readings on using R for econometrics.
2.1 Basic Operators
The basic operators in R are +
, -
, *
, and /
for addition, subtraction, multiplication, and division, respectively. When used with scalars (numeric or integer types in R), they work as expected.
2.2 Built-in Functions
R comes with lots of functions already built into the program, meaning you do not need to load a library in order to access/use them. Table 2.1 lists some of these built-in functions.
R Function | Description |
---|---|
abs(v) |
Absolute value of \(v\) (\(|v|\)) |
sqrt(v) |
Square root of \(v\) (\(\sqrt{v}\)) |
exp(v) |
Exponential function (\(e^{v}\)) |
log(v) |
Natural logarithm of \(v\) (\(ln(v)\)) |
log(v, b) |
Logarithm of \(v\) to base \(b\) (\(log_{b}(v)\)) |
abs(5)
## [1] 5
abs(-5)
## [1] 5
sqrt(6)
## [1] 2.44949
sqrt(-6)
## Warning in sqrt(-6): NaNs produced
## [1] NaN
exp(2)
## [1] 7.389056
log(5)
## [1] 1.609438
log(exp(2))
## [1] 2
In addition, R has many other built-in functions that we will introduce as we go. In every case, the function will be “called” (ie executed) when appended with ()
. Looking back at the sample of built-in functions in Table 2.1, the square root function sqrt
is only called when written as sqrt(r)
.
For all functions, built-in and loaded from libraries, you can get help on the function
by typing ?function
in R.
2.3 Variable Assignment
You assign numbers, strings (text), vectors, matrices, data frames, etc. to variables using “<-
”. You can also use “=
”, but this is not preferred (as “==
” is used to test equality).
2.4 Data Types
Thus far the examples given have been using the numeric data type. R understands several other data types. For our purposes, these are the most important:
- Numeric:
5
,8.4
- Integer:
6L
(TheL
at the end tells R that the number is an integer) - Character:
"text"
- Logical:
TRUE
orFALSE
The distinction between a numeric and integer data type is not so important for us (it mainly concerns how much memory is needed to store the data). A character data type, also known as a string, is what we would call text. A logical data type can only take one of two values: TRUE
or FALSE
.
If we dont know what type an object is, we can use the function class()
.
class(z)
## [1] "numeric"
class(9L)
## [1] "integer"
class("What class is this?")
## [1] "character"
It is also possible to convert from one data type to another using the as
function. For example, if we had a variable in our data set that is stored as a series of 1
’s and 0
’s, and we wanted to convert them to logicals, we could use as.logical()
:
We can also test if an element belongs to a certain data type with the is
function.
2.5 Data Structures
Whereas a data type refers to one “element”, the data structure refers to how these elements are arranged. These are the objects that we will operate on when carrying out statistical analyses. The most important for our purposes are:
- (Atomic) Vector
- Matrix
- List
- Data Frame
- Factors
2.6 Vectors
Vectors are the most important of the data structures. In fact, everything we have dealt with so far is a vector:
is.vector(z)
## [1] TRUE
is.vector(TRUE)
## [1] TRUE
is.vector("This is a vector too?!")
## [1] TRUE
This may be somewhat confusing, as we may think of a vector of being a collection of numbers or variables. But the logic still applies here. So far we can think of each “vector” as a vector of length 1 (ie a scalar2). We can see this by applying the length()
function, which returns the length of the object.
We now turn to vectors that are longer than one. To create a vector in R, we use the c()
operator, with each element separated by a comma:
Vectors are always flat (ie one dimensional), thus:
But it is important to note that vectors can only store one data type (eg logicals). This can have some unexpected results without warning!
other_vec <- c("element", "next one", vec)
other_vec
## [1] "element" "next one" "7" "32" "4"
class(other_vec)
## [1] "character"
Our vec
vector has been coerced to a character vector before being combined (since character types cannot generally be coerced to a numeric type).
There are other ways to make vectors as well. We can use first:last
to make a vector going sequentially from first
to last
. We can also use the seq
function, which allows for some more control.
3:40
## [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## [26] 28 29 30 31 32 33 34 35 36 37 38 39 40
seq(from = 2, to = 20, by = 2)
## [1] 2 4 6 8 10 12 14 16 18 20
2.6.1 Vector Functions and Operations
The base operators we discuss in Section 2.1 also work on vectors of length greater than 1.
When using the base operators on two or more vectors, you must ensure they are the same length. The operators work element-wise.
m <- c(8,5,0)
n <- c(45,94,26)
o <- c(4,2)
m + n
## [1] 53 99 26
n - m
## [1] 37 89 26
n / m
## [1] 5.625 18.800 Inf
m * n
## [1] 360 470 0
Normally, if the vectors are of a different length, R will not let you operate on them. But be careful! If you use the base operators on two vectors where the length of one is a multiple of the length of the other, R will “recycle” (ie repeat the elements in the vector) the shorter vector so it has the same length as the longer one.
o + n
## Warning in o + n: longer object length is not a multiple of shorter object
## length
## [1] 49 96 30
o + new_vec
## [1] 11 34 8 8 7 3
There are also a set of vector-specific functions which make working with vectors easier. A subset of them are given in Table 2.2
R Vector Function | Description |
---|---|
length(v) |
Number of elements in the vector \(\boldsymbol{v}\) |
max(v) , min(v) |
Maximum and minimum element of \(\boldsymbol{v}\) (\(\max(\boldsymbol{x})\), \(\min(\boldsymbol{x})\)) |
ceiling(v) , floor(v) |
Round every element in \(\boldsymbol{v}\) to the next largest (ceiling ) or smallest (floor ) integer |
round(v) |
Round every element in \(\boldsymbol{v}\) |
sort(v) |
Arrange elements in vector \(\boldsymbol{v}\) from smallest to largest |
sum(v) |
Sum of elements in \(\boldsymbol{v}\) (\(\sum_i v_i\)) |
prod(v) |
Product of elements in \(\boldsymbol{v}\) (\(\prod_i v_i\)) |
mean(v) |
Calculate the mean of the elements in \(\boldsymbol{v}\) |
median(v) |
Calculate the median of the elements in \(\boldsymbol{v}\) |
sd(v) |
Calculate the standard deviation of the vector \(\boldsymbol{v}\) |
var(v) |
Calculate the variation of the vector \(\boldsymbol{v}\) |
cor(v, w) |
Calculate the correlation between the vectors \(\boldsymbol{v}\) and \(\boldsymbol{w}\) |
numeric(n) |
Vector of \(0\)’s of length \(n\) (\(\boldsymbol{0} \in \mathbb{R}^n\)) |
rep(v, n) |
Repeat the vector \(\boldsymbol{v}\) \(n\) times |
max(m)
## [1] 8
min(m)
## [1] 0
sort(m)
## [1] 0 5 8
sum(m)
## [1] 13
prod(m)
## [1] 0
mean(m)
## [1] 4.333333
median(m)
## [1] 5
sd(m)
## [1] 4.041452
var(m)
## [1] 16.33333
sqrt(var(m))
## [1] 4.041452
cor(m, n)
## [1] 0.4055098
numeric(10)
## [1] 0 0 0 0 0 0 0 0 0 0
rep(m, 10)
## [1] 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0 8 5 0
rep(m, each = 10)
## [1] 8 8 8 8 8 8 8 8 8 8 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0
2.6.2 Logical Operators
We can even apply logical operations to each element of a vector with a single statement. This is done with the following operators: ==
, >
, <
, >=
, <=
, and !=
(with the last three referring to “larger than or equal to”, “smaller than or equal to”, and “not equal to” respectively).
You can use the &
and |
operators for “and” and “or” respectively. Prepending a statement with a !
means “not”.
2.6.3 Strings
When working with vectors which consist of string (text) elements, there are a few other functions that may be useful. Perhaps the most important of these is the paste
function. As the name implies, it will concatentate two strings with one another. For example, given two strings:
string1 <- "The first part"
string2 <- "and the second part"
paste(string1, string2)
## [1] "The first part and the second part"
paste
will even work on vectors longer than one. In this case, it will append/prepend the vector with the scalar.3
paste(m, "and text")
## [1] "8 and text" "5 and text" "0 and text"
paste(m, n)
## [1] "8 45" "5 94" "0 26"
By default, paste
separates the two (or more) elements with a space (). This can be overridden with the
sep
argument. If you want there to be no separator, you can either set sep = ""
or use paste0
instead.
paste("Part 1", "Part 2", sep = "")
## [1] "Part 1Part 2"
paste0("Part 1", "Part 2")
## [1] "Part 1Part 2"
paste("Part 1", "Part 2", sep = " : ")
## [1] "Part 1 : Part 2"
If you want to convert a vector of text into a scalar, you can use the collapse
argument, setting it equal to the new separator.
2.6.4 Remove Objects
After using R for a while, you may start to notice that you are starting to collect many objects and are having difficulties keeping track of what’s what. In RStudio, you can have an overview of all objects in the environment pane (top-right by default). You can also get a list of all objects by using the ls
function.
Now that you know what is in your environment, you may want to start removing some obsolete objects. You can use the rm
function to achieve this. For example, to remove the vector new_vector
:
You can combine rm
and ls
to remove all objects in memory.
2.6.5 Indexing Vectors
Often we wish to extract a subset of an entire vector, or even a single element. We achieve this through indexing with []
. Given a positional number, or a vector of positional numbers, we can extract specific elements.
vec <- -10:10
vec[]
## [1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
## [20] 9 10
vec[3]
## [1] -8
vec[3:6]
## [1] -8 -7 -6 -5
vec[c(1,20)]
## [1] -10 9
vec[length(vec)]
## [1] 10
It is also possible to index a vector with another vector of logicals of the same length, where TRUE
returns the element and FALSE
does not.
Another useful function for indexing is the which
function. Given a vector of logicals, it will return the position of the TRUE
elements.
Important, which
returns the indices, not the elements themselves. So in the above example, 10, 11, 12
refer to the position in the vector vec
where the elements are larger than -2 and smaller than 2, whereas if we wanted the elements themselves, we could use vec[vec < 2 & vec > -2]
, which would return -1, 0, 1
.
2.6.6 Vector Assignment
Using the indices, we can re-assign specific elements of a vector with a new element.
vec
## [1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
## [20] 9 10
vec[2] <- -15
vec
## [1] -10 -15 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
## [20] 9 10
If we assign an element to an index that is out of range (ie an index larger than the length of the vector), then R fills the positions in between with NA
’s. (This means “not available” and indicates missing values).
2.6.7 Naming Vectors
Occasionally we may want to name each element in a vector in order to increase readability and understandabiltiy (and aid with indexing/assignment). We can do this with the names
function.
2.7 Factors
Factors are a special type of vectors. They are used to represent (ordered or unordered) categorical or qualitative data. In economics, examples of this can be countries, gender, industry, likert scales, etc. Factors are stored as a vector of integers, with each integer representing a respective outcome. However, factors are presented as strings, which can make working with them a little complicated.
We will look at two examples. One with gender where the options will be one of female or male. And we will look at a simple three option likert scale consisting of disagree, neutral, and agree. The first example is unordered while the second one is ordered.
genders <- factor(rep(c("female", "male"), times = 6))
likert <- factor(
rep(c("Disagree", "Neutral", "Agree"), times = 4),
levels = c("Disagree", "Neutral", "Agree"),
ordered = TRUE
)
Notice that in the ordered case we set the relative order of the variables with the level
option. If we left this unspecified, R would order the options alphabetically.
We can investigate factors with the levels
and nlevels
functions.
Also note that since factors are a special type of vector, certain operations do not work on them (since they are generally meaningless). And some functions work only on ordered factors, but not unordered factors!
Converting from Factors
Sometimes we will require that our data is a factor and other times we will simply want text. Thus, functions for converting between the two can come in handy.
as.character(genders)
## [1] "female" "male" "female" "male" "female" "male" "female" "male"
## [9] "female" "male" "female" "male"
as.numeric(likert)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
Notice how converting factors to numeric
results in a series of integers? Remember that all factors are stored as a vector of integers!
As we will see, there are many other things that we can do with factors.
2.8 Random Numbers
When running simulations, such as a Monte Carlo Simulation, or simply for the purpose of better understanding an estimator, we need to generate realizations from a given distribution. In other words, we need to generate random numbers.
Before we do that, it is useful to jump back a bit and discuss briefly some common characteristis of (univariate) distribtion functions. The Probability Density Function (PDF) of a given distribution maps each possible outcome to a relative likelihood of that outcome being drawn from the distribution.
The Culmulative Density Function (CDF) of a distribution, on the other hand, states the probability that a draw from the distribution of \(X\) will take a value of \(x\) or lower (\(P(X \leq x)\)). As the CDF is a monotonic function for the distributions that we will be working with, it is also possible to take the inverse of the function.
In R, we can do calculations with these distributions by prepending the name of the distribution with either d
, p
, or q
.
d
returns the “height” (density or relative likelihood) of the PDF atx
p
returns the the probability from the CDF for a givenx
q
is the inverse of the CDF and returns the quantile x for a given probability
For example, to use the CDF of the normal distribution we would use the pnorm
function (norm
being the name of the normal distribution in R).
For an overivew of the distributions available in R (technically from the builtin “stats” package) type ?Distributions
. Each distribution has various parameters that can be altered, such as the mean, standard deviation, degrees of freedom, etc.
Figure 2.1 shows the PDF of the Normal Distribution with a visualization of how the dnorm()
function works.

Figure 2.1: Probability Density of Normal Distribution
The CDF of the Normal Distribution is shown in Figure 2.2. It also highlights the use of the pnorm()
and qnorm()
functions.

Figure 2.2: Culmulative Density of Normal Distribution
2.8.1 Random Draws from a Distribution
Generating draws from the distributions available in R uses the same logic as the other distribution functions; prepend the desired distribution (dist
) with r
(for “random”).
The main argument for the random number generating functions the number of draws, n
. The functions will return a vector of length n
filled with realizations from the named distribution with the given parameters.
rnorm(10) # Mean of 0 and SD of 1 are default
## [1] -0.64299680 0.31746201 -0.10250011 1.13841323 0.45766026 -1.45672458
## [7] -0.55419768 0.03137946 0.20429414 1.33667443
rnorm(10, mean = 5, sd = 2)
## [1] 6.1647477 4.9640196 4.7688927 3.3567738 3.3358199 3.7416206 5.0932894
## [8] 5.2060580 0.8916436 3.3277690
rchisq(10, df = 5)
## [1] 4.3325927 2.7069218 5.2703912 1.7464303 4.2435745 7.1613996 8.0302707
## [8] 0.3144991 2.6654725 5.0968154
2.8.2 Seed
One important facet of applied work is replicatability. This means that, given your code and data, other researchers can obtain exactly the same results as you did (ideally with the single push of a button!).
But when we are dealing with simulations, where realizations of random numbers are essential, how can we expect to get exactly the same results each time we run the simulation?
That is where the seed comes in. In essense, it specifies the starting point of a sequence of random numbers. The Wikipedia article on the seed provides a good description of what it does.
For our purposes, we can “set the seed” before we starting drawing realizations from specified distributions. This ensures we will draw the same numbers from the distribution each time we run the code.
2.9 Special Values
There are a few values that you will come across when using R that may seem a bit mysterious at first, but have precise meanings. We have already come across a few of them already. A summary of them is given in Table 2.3.
Value | Meaning | Test Memebership |
---|---|---|
NA |
Not Available (missing) | is.na |
NaN |
Not a Number | is.nan , is.na |
Inf , -Inf |
Positive/Negative Infinity | is.infinite , !is.finite |
z <- c(1)
z[4] <- 5
z
## [1] 1 NA NA 5
z/0
## [1] Inf NA NA Inf
0/0
## [1] NaN
log(0)
## [1] -Inf
log(-10)
## Warning in log(-10): NaNs produced
## [1] NaN
Some functions, like the log
function, produce warnings when NaN
values are produced. We can also test for membership to these special values with the is
function. These are provided in the third column of Table 2.3.
2.10 Matrices
An extension of the vector data structure is the matrix. This is essentially a vector with an addition dimension.4 Whereas a vector is flat, a matrix consists of rows and columns. Just like vectors, matrices must contain elements of the same data type. Given one or more vectors, you can create matrices with the rbind
and cbind
functions.
a <- 1:6
b <- 7:12
c <- 13:18
m <- rbind(a, b, c)
n <- cbind(a, b, c)
m
## [,1] [,2] [,3] [,4] [,5] [,6]
## a 1 2 3 4 5 6
## b 7 8 9 10 11 12
## c 13 14 15 16 17 18
n
## a b c
## [1,] 1 7 13
## [2,] 2 8 14
## [3,] 3 9 15
## [4,] 4 10 16
## [5,] 5 11 17
## [6,] 6 12 18
The matrix
function can also be used to specify the number of rows/columns of a matrix or create an empty matrix.
matrix(1:24, nrow = 2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 1 3 5 7 9 11 13 15 17 19 21 23
## [2,] 2 4 6 8 10 12 14 16 18 20 22 24
matrix(1:24, ncol = 2)
## [,1] [,2]
## [1,] 1 13
## [2,] 2 14
## [3,] 3 15
## [4,] 4 16
## [5,] 5 17
## [6,] 6 18
## [7,] 7 19
## [8,] 8 20
## [9,] 9 21
## [10,] 10 22
## [11,] 11 23
## [12,] 12 24
matrix(nrow = 2, ncol = 2)
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
You can use the nrow
, ncol
, and dim
functions to get the dimensions of the matrix.
Just like with vectors, you can name matrices. You can provide a vector of names for both the rows and for the columns.
rownames(m)
## [1] "a" "b" "c"
colnames(m)
## NULL
colnames(m) <- LETTERS[1:ncol(m)]
colnames(m)
## [1] "A" "B" "C" "D" "E" "F"
Matrices can also be indexed liked vectors. Because there are two dimensions, you need to supply two indices (or at least indicate which dimension you are indexing).
2.10.1 Matrix Algebra
We can also use matrix algebra functions with matrices in R. Table 2.4 lists some of the functions that can be applied to matrices.
Function | Description |
---|---|
t(M) |
Transpose of the matrix M (\(\boldsymbol{M}'\) or \(\boldsymbol{M}^T\)) |
identical(M,R) |
Test if matrices M and R are the same |
M * R |
Element-wise multiplication of M and R (\(\boldsymbol{M} \odot \boldsymbol{R}\)) |
M %*% R |
Multiply matrix M with R (\(\boldsymbol{M} \boldsymbol{R}\)) |
crossprod(M,R) |
Same as t(M) %*% R (\(\boldsymbol{M}' \boldsymbol{R}\)) |
solve(M) |
Calculate the inverse of matrix M (\(\boldsymbol{M}^{-1}\)) |
diag(M) |
Returns the diagonal vector from matrix M (\(\text{diag} \boldsymbol{M}\)) |
rowSums(M) , colSums(M) |
Take the sum of the elements of M row/column-wise |
rowMeans(M) , colMeans(M) |
Take the mean of the elements of M row/column-wise |
M <- matrix(rnorm(16), nrow = 4)
R <- matrix(rnorm(16), nrow = 4)
t(M)
## [,1] [,2] [,3] [,4]
## [1,] -0.3975054 0.4336701 0.5274228 0.3849394
## [2,] -0.9837064 -0.7759925 0.1923013 -0.9442839
## [3,] -1.2623426 -0.2245831 0.9692689 0.4504042
## [4,] 0.5843260 1.0682363 0.3192506 -0.6585833
M %*% R
## [,1] [,2] [,3] [,4]
## [1,] -0.1103055 0.3437971 -0.09487179 -0.66672829
## [2,] 0.3029771 2.9155968 -1.97731576 0.69438809
## [3,] -0.2097555 1.3041029 -1.43760182 1.24919106
## [4,] 1.0638162 1.1222780 -0.28578469 -0.04977172
solve(M)
## [,1] [,2] [,3] [,4]
## [1,] -4.362764 4.0045264 -4.875427 0.2611986
## [2,] -1.498631 0.9198999 -1.480591 -0.5552803
## [3,] 2.288065 -2.2020343 3.442356 0.1270255
## [4,] 1.163539 -0.4843006 1.627443 -0.4827018
colMeans(R)
## [1] 0.1306180 0.8225804 -0.5405077 0.3965967
2.11 Lists
Lists are more general than both vectors and matrices. They act like containers, holding other objects (vectors, matrices, even other lists!).
Though we generally will not create our own lists, we will have to deal with lists as outputs of other functions (eg the output of a linear model, OLS estimation, is a list).
To create a list you use the list
command.
lst <- list(1:10, c(TRUE, FALSE), letters)
lst
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] TRUE FALSE
##
## [[3]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
You may have noticed that unlike vectors and matrices, we can put objects of different data types into one list.
We can also name each element in a list.
names(lst) <- c("nums", "logical", "letters")
lst
## $nums
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $logical
## [1] TRUE FALSE
##
## $letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
One has to be a bit more careful when accessing elements in a list than with other data types. To index based on position, one must use [[]]
(double square bracekts). If you only use one pair of brackets, you get a list with one object in it (rather than the object itself).
But the easiest way to extract the desired object is by using $
followed by the name of the object.
lst$letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
You can also add another object to the list through similar means.
lst$months <- month.abb
lst
## $nums
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $logical
## [1] TRUE FALSE
##
## $letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
##
## $months
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Because lists can hold various different types of objects, they are useful for storing the results of estimated models, monte carlo simulations, etc. For example, fitting an OLS regression may return the degrees of freedom (a scalar), a residual for each observation (a vector), and the variance-covariance matrix of the estimated coefficients (a matrix).
2.12 Data Frames
Data frames are what most people may find recognizable. For those with experience in Stata, they are the most similar to Stata’s dataset. They are similar to matrices, in that they are two dimensional (consisting of rows and columns), but unlike matrices, they can hold data of different types. They can be thought of as a list consisting of equal length vectors. We can use the data.frame
function to construct a new data frame, with each argument a(n) (un)named vector.
df <- data.frame(
numerals = 1:10,
roman = as.character(as.roman(1:10)),
letters = letters[1:10],
stringsAsFactors = FALSE
)
Importantly, we can decide if we want strings to be considered as factors, or just as characters. This will depend on how we use the variable in our analyses.
Since data frames are similar to lists, the indexing of data frames is also similar, though there are key differences.
df[1]
## numerals
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
df[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
df$numerals
## [1] 1 2 3 4 5 6 7 8 9 10
df[c("numerals", "letters")]
## numerals letters
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
df[4:6,3]
## [1] "d" "e" "f"
2.12.1 Variable Creation
In empirical work we often want to create new variables from the exisiting ones (GDP per capita from GDP and populaltion, log wages from wages, etc.). We can do that by assigning vectors to an existing or new column.
df$capital_letters <- LETTERS[1:10]
df$roman2letter <- paste(df$roman, df$letters, sep = "_")
df$numerals <- sort(df$numerals, decreasing = TRUE)
df
## numerals roman letters capital_letters roman2letter
## 1 10 I a A I_a
## 2 9 II b B II_b
## 3 8 III c C III_c
## 4 7 IV d D IV_d
## 5 6 V e E V_e
## 6 5 VI f F VI_f
## 7 4 VII g G VII_g
## 8 3 VIII h H VIII_h
## 9 2 IX i I IX_i
## 10 1 X j J X_j
In additon to creating new variables, sometimes we will want to work only with a subset of a data frame (think of Stata’s drop ... if ...
). There are several ways to do this. We can even combine this with variable assignment to make conditional changes (Stata: replace ... if ...
).
subset(df, numerals > 5)
## numerals roman letters capital_letters roman2letter
## 1 10 I a A I_a
## 2 9 II b B II_b
## 3 8 III c C III_c
## 4 7 IV d D IV_d
## 5 6 V e E V_e
df[df$numerals > 5, ]
## numerals roman letters capital_letters roman2letter
## 1 10 I a A I_a
## 2 9 II b B II_b
## 3 8 III c C III_c
## 4 7 IV d D IV_d
## 5 6 V e E V_e
df[df$numerals > 5, "numerals"] <- 20
df
## numerals roman letters capital_letters roman2letter
## 1 20 I a A I_a
## 2 20 II b B II_b
## 3 20 III c C III_c
## 4 20 IV d D IV_d
## 5 20 V e E V_e
## 6 5 VI f F VI_f
## 7 4 VII g G VII_g
## 8 3 VIII h H VIII_h
## 9 2 IX i I IX_i
## 10 1 X j J X_j
2.13 Import/Export Data
The data that an Econometrician uses comes in a wide variety of formats. Some of these can be read/written natively in R, but many of them require the use of external libraries to handle them. Some of examples of these libraries are haven
, foreign
, readr
, and readxl
, though there are many others.
For our purposes, we will stick with R’s default data storage format, .rds
. For the tutorials, all datasets will be provided in this format. Though for your own work, you will want to learn how to open (and write to) files such as .csv
, .xls
, and .dta
.
To save an R object (of any data structure) to our local computer, we can use the saveRDS
function.
And, similarily, to open a saved object, we would use the readRDS
function.
Remember, the path will be relative to your working directory. See the Working Directory section.
2.13.1 Built-in Data Sets
R also comes pre-loaded with built-in datasets. These can be useful for playing around with some of the functionality of R before using real-world data. To find out which datasets are available, use the data
function.
2.13.2 Datasets from the Web
You can also load datasets directly from a website. For example, if we wanted to access datasets from Wooldridge’s Introductory Econometrics textbook, which come in Stata’s data format (.dta
), we can use the read_dta
function from the haven
package, and supply the URL instead of the filepath.
2.14 Loops/Functionals
A common task when working with datasets is executing the same function (or performing the same operation) repeatedly for different variables or a subset of the original data. Perhaps the most straight forward way to do this is to write the function n
times. For example, if we wanted to calculate the mean of each column (variable) in the mtcars
data frame, we could execute the following functions:
mean(mtcars$mpg)
## [1] 20.09062
mean(mtcars$cyl)
## [1] 6.1875
mean(mtcars$disp)
## [1] 230.7219
# ...
However, with a larger number of operations, this becomes infeasible (and this method is very difficult to scale up). This is where the for
loop comes in. It executes a specific set of operations for a pre-determined amount of iterations. If this sounds confusing, consider the following (trivialized) example:
You wish to print the integers from 1 to 10 sequentially to the console. To achieve this in R, with a loop, you would do the following:
for (i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
where the for
indicates the start of a loop, and the elements inside the first parantheses, (i in 1:10)
, indicate both the iteration variable (i
) and the set of numbers to loop over (1:10
). The iteration variable can be whatever letter/name you would like it to be. The set of numbers to loop over can be any vector, and does not necessarily have to be numbers. The loop will execute the expression in the braces ({}
) length(1:10)
times.
Turning back to our example with the mtcars
dataset, we could calculate the mean for columns 2 to 6 by:
for (col in 2:6) {
mu <- mean(mtcars[,col])
print(mu)
}
## [1] 6.1875
## [1] 230.7219
## [1] 146.6875
## [1] 3.596563
## [1] 3.21725
This is just a taste of what for
loops can do. They are highly expandable.
2.14.1 Apply Family
Those having experience in other programming languages will be familiar with for
loops. However, in R, most often you will see an apply
function in place of a loop. They are preferred to loops as they are vectorized (ie they work over the set simultaneously rather than sequentially).
The main function from this family is, somewhat unsurprisingly, apply
. But we will also look at the lapply
/sapply
function(s) as well.5
The syntax for the two functions is similar; only with apply
we need to specify the margin
(ie rows [1] or columns [2]). Returning to the example above, we could achieve the same thing with the apply
functions:
apply(mtcars[, 2:6], 2, mean)
## cyl disp hp drat wt
## 6.187500 230.721875 146.687500 3.596563 3.217250
lapply(mtcars[, 2:6], mean)
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
sapply(mtcars[, 2:6], mean)
## cyl disp hp drat wt
## 6.187500 230.721875 146.687500 3.596563 3.217250
You’ll notice that the only difference between the last two is the data structure of the returned object; lapply
returns a list while sapply
tries to simplify the results into a vector.
So since a for
loop, apply
, lapply
, and sapply
all achieve the same results, which one should we use? The answer is that it depends. If we require the results from iteration i
for iteration i+1
, then a for
loop becomes essential. If we are working on the columns of a data frame, lapply
or sapply
are the simplest. apply
is useful if we want to work rowwise.
2.15 Conditionals
When working with loops (and with custom functions, which we will see shortly), we may only want to execute specific operations when certain criteria are met.
Going back to the example of printing all integers from 1 to 10, let’s say we only wanted to print the odd numbers. Then we could use a for
loop again, but with a conditional statemtent. These test a certain expression and return TRUE
or FALSE
, and only execute the conditional if TRUE
is returned.
for (i in 1:10) {
if (i %% 2 != 0) {
print(paste(i, "is odd"))
}
}
## [1] "1 is odd"
## [1] "3 is odd"
## [1] "5 is odd"
## [1] "7 is odd"
## [1] "9 is odd"
In the example above, the %%
is the modulus operator (ie m %% n
would return the remainder of the division of m
by n
. So if m
is 5 and n
is 2, m %% n
would be 1). Since odd numbers cannot be exactly divided by 2, we know that the modulus of an odd number divided by 2 is not zero (it’s 1). Therefore, i %% 2 != 0
evaluates to TRUE
only for odd numbers.
We can also test other statements as well using else if
. Let’s say if i
equals six we would rather print the statement i equals 6
rather than just i
itself.
for (i in 1:10) {
if (i %% 2 != 0) {
print(paste(i, "is odd"))
} else if (i == 6) {
print(paste("i equals", i))
}
}
## [1] "1 is odd"
## [1] "3 is odd"
## [1] "5 is odd"
## [1] "i equals 6"
## [1] "7 is odd"
## [1] "9 is odd"
So even though 6
fails our first test (it is not an odd number), it satisfies our section conditional (it equals six). Finally, we can use the else
expression to capture everything that failed the other conditionals (we can use as many else if
conditionals as we like).
for (i in 1:10) {
if (i %% 2 != 0) {
print(paste(i, "is odd"))
} else if (i == 6) {
print(paste("i equals", i))
} else {
print(paste(i, "is not odd or six"))
}
}
## [1] "1 is odd"
## [1] "2 is not odd or six"
## [1] "3 is odd"
## [1] "4 is not odd or six"
## [1] "5 is odd"
## [1] "i equals 6"
## [1] "7 is odd"
## [1] "8 is not odd or six"
## [1] "9 is odd"
## [1] "10 is not odd or six"
2.15.1 Vectorized Conditional
So far, the conditionals we explored only handle a single element (ie a vector of length one). We may also want to carry out some operation on elements of a vector that satisfy certain criterion. We could loop through each element in a vector to achieve this, but that is not necessary.
Let’s say we wanted to take the log of each element of a vector that had negative values, and wanted to set the negative values (and zero) to NA
. We could do that as follows:
vec_log <- ifelse(vec > 0, log(vec), NA)
## Warning in log(vec): NaNs produced
vec_log
## A B C D E F G H
## NA NA NA NA NA NA NA NA
## I J K L M N O P
## NA NA NA 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
## Q R S T U V W X
## 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851 NA NA NA
## Y
## 4.6051702
This is extremely useful in this context, because R would normally return a NaN
for negative numbers and -Inf
for zero, which can cause problems for other R functions.
2.16 Writing Functions
We can also write our own functions in addition to the functions available built-in or from loaded libraries. Functions, as in the mathematical sense, take an input (vector, matrix, data frame, even another function!) and return an output (generally a scalar, but can be other data types as well). This can be useful when we want to carry out an operation repeatedly in various contexts.
An example function could be a function that tests whether a number is a square number.
isSquareNumber <- function(x) {
sqrt(x) %% 1 == 0
}
isSquareNumber(2)
## [1] FALSE
isSquareNumber(9)
## [1] TRUE
isSquareNumber(-9)
## Warning in sqrt(x): NaNs produced
## [1] NA
isSquareNumber("9")
## Error in sqrt(x): non-numeric argument to mathematical function
Where x
is the input and the output is the result of the one line of code. Our function works, but is not robust! Let’s use what we learned from conditionals to handle such special cases.
isSquareNumber <- function(x) {
if (!is.numeric(x)) {
output <- NA
} else if (x > 0) {
output <- sqrt(x) %% 1 == 0
} else {
output <- NA
}
return(output)
}
isSquareNumber(9)
## [1] TRUE
isSquareNumber(-9)
## [1] NA
isSquareNumber("9")
## [1] NA
We’ve also explicitly told the function to return the object output
. This is useful when functions become more complicated and execute numerous lines of code, returning different values.
3 Basic Statistical Methods
3.1 Monte Carlo Simulations
One statistical method that we will use during the course is called the Monte Carlo method. The general idea is as follows:
- We specify the data generating process (DGP) for a set of random variables (ie we specifiy the parameters and distributions of the DGP).
- We draw a random sample from the specified DGP and estimate certain parameters.
- We do this repeatedly to get information on certain parameters of interest (and to learn more about the properties of specific estimators).
In our examples, this will be mainly done to learn about the properties of basic estimators, such as the mean estimator or the OLS estimator. But now you may be wondering: why do we go through this process when we know the statistical properties of these estimators already? The answer is because it aids in understanding these estimators; drawing random samples is more intuitive than asymptotic theory. In current research, it is used to infer certain properties of estimators that have no analytical solution or are difficult to calculate.
We will go through more advanced examples in class, for now lets work with the simplest case. We have only one random variable that is distributed as follows:
\[\begin{equation} Y_i \sim N(\mu, \sigma^2) \tag{3.1} \end{equation}\]
And we are interested in infering properties of the mean estimator, defined as:
\[ \hat{\mu} = n^{-1} \sum_{i=1}^n Y_i \] In the end, we want to know the properties of \(\hat{\mu}\), specifically its distribution, mean, and variance (\(F(\hat{\mu}), E(\hat{\mu}), var(\hat{\mu})\), where \(F\) is an undefined CDF). As we all probably know, the properties of this estimator are already well known:
\[ \hat{\mu} \sim N(\mu, \sigma / n) \]
We can use a Monte Carlo simulation to verify this. There are numerous ways to do this in R; but as it involves repeatedly drawing from a distribution, we will need to employ either a for
loop or use some of the apply
functions. We will approach this problem with the tools from the latter. Specifically, we will use the replicate
function, which is related to the other apply
functions (reminder: if you don’t know what a function does, or how to use it, type ?replicate
).
The first step is to specify a DGP and draw a sample from this distribution. We will do this with a function, as we will execute it over and over. The following function takes three arguments, the sample size (n
), the mean (mean
), and the standard deviation of the random variable (sd
). It will draw a sample of size n
from the distribution specified in Equation (3.1). It will then estimate the paramter \(\mu\) with the estimator \(\hat{\mu}\), and return that value.
The function requires a value for n
, but the mean
and sd
arguments can be left blank and will revert to their default values of 0 and 1 respectively.
mean_rnorm_sample <- function(n, mean=0, sd=1) {
samp <- rnorm(n, mean, sd)
mu <- mean(samp)
return(mu)
}
mean_rnorm_sample(100, 5, 2)
## [1] 5.27703
Before we start the replications, we need to set the seed (see the section on working with a Seed). This will ensure that our results are replicable (and that we get the same results everytime!).
Then, we will execute the above function 1000 times (\(r = 1000\)), with a sample size of 100 (\(n = 100\)). We will set \(\mu = 5\) and \(\sigma = 2\). We will store the estimated \(\mu\)’s in a vector called mu_hats
.
set.seed(38547)
sample_size <- 1000
mean <- 5
sd <- 2
mu_hats <- replicate(1000, mean_rnorm_sample(sample_size, mean, sd))
Now that we have ran the simulation, we can investigate the properties of the mean estimator. Since we also know the true value of the parameters, we can compare the empirically estimated values with the theoretical ones.
mu_hat_mean <- mean(mu_hats)
mu_hat_mean
## [1] 5.002903
mu_hat_sd <- var(mu_hats)
mu_hat_sd
## [1] 0.003852582
## [1] "Empirical mean of estimator is 5.0029, theoretical value is 5."
## [1] "Empirical variance of estimator is 0.0039, theoretical value is 0.04 (2^2 / 100)."
As you can see, the Monte Carlo simulation returns what we would expect from statistical theory.
We can also plot a histogram of the estimated means and compare it with the normal distribution
hist(mu_hats, breaks = 30, freq = FALSE)
curve(dnorm(x, mean = mean, sd = sd/sqrt(sample_size)), add = TRUE) # remember sd adjust.
Looks pretty close! This is not very surprising given we know the true distribution, but this would help us inspect the properties of an estimator when we do not have any theory to guide us.
3.2 Summary Stats
Because there is no one set of “definitive” summary statistics, there are many different ways to inspect datasets. You can of course look one statistic at a time with functions such as mean
, median
, sd
, etc., but normally you will want to calculate a set of statistics on numerous variables, all at once.
One option is the built in summary
function, which will return basic statistics.
mtcars_subset <- mtcars[,1:4]
summary(mtcars_subset)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
For more detailed and/or comprehensive statistics, look to packages such as Hmisc
, pastecs
, psych
, or skimr
. For example, the skim
function from skimr
neatly summarizes the data:
Name | mtcars_subset |
Number of rows | 32 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.4 | 15.43 | 19.2 | 22.8 | 33.9 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.0 | 4.00 | 6.0 | 8.0 | 8.0 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.1 | 120.83 | 196.3 | 326.0 | 472.0 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.0 | 96.50 | 123.0 | 180.0 | 335.0 | ▇▇▆▃▁ |
3.3 Linear Regression
The workhorse estimator for economics is the OLS estimator. It is highly robust, adaptable, expandable, its properties well understood, and can be used in a variety of situations. In R, we can estimate a linear model with the lm
command.
Using the mtcars
data, if we wanted to investigate the relationship between horsepower (hp
) and mileage (mpg
) we could run a regression of mileage on horsepower. The lm
function requires at least a formula and a dataset.
The formula specifies the relationship you are estimating. The general format is y ~ x1 + x2
where y
is an outcome variable and x1
and x2
are explanatory variables. The dataset simply refers to the data you want to use.
The lm
function returns an lm
object, which contains information on the estimated coefficients, fitted values, degrees of freedom, etc. We can use R’s summary
command to print the most important information to the console.
lfit <- lm(mpg ~ hp, data = mtcars)
attributes(lfit) # shows the info contained lm object, accessible with "$"
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
lfit$residuals
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## -1.59374995 -1.59374995 -0.95363068 -1.19374995
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 0.54108812 -4.83489134 0.91706759 -1.46870730
## Merc 230 Merc 280 Merc 280C Merc 450SE
## -0.81717412 -2.50678234 -3.90678234 -1.41777049
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## -0.51777049 -2.61777049 -5.71206353 -5.02978075
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 0.29364342 6.80420581 3.84900992 8.23597754
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## -1.98071757 -4.36461883 -4.66461883 -0.08293241
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 1.04108812 1.70420581 2.10991276 8.01093488
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3.71340487 1.54108812 7.75761261 -1.26197823
summary(lfit)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
This is just the tip of the iceberg. In class, we will also consider extensions such as inference using robust and clustered standard errors, IV estimation, and fixed effects.
4 Plots
One of R’s standout features is graphing. There are numerous external libraries available that allow you to plot virtually any graph you desire (including interactive ones!).
I will only briefly introduce R’s built-in graphing capabilties as a starting point. A good overview can be found on the Graphs Section on Quick-R.
Once you become comfortable with plotting in base R, I would highly recommend using ggplot2 for graphing. It is the state-of-the-art library that is (probably) also the most widely used. The R for Data Science book has a good section on Data Visualization (using ggplot2).
4.1 Histograms
We have have already seen how to plot a histogram in the Monte Carlo Section. Since a histogram is a visualization of a one-dimensional variable, we enter one column (or vector) into the hist
function. We can also customize the plot with optional arguments to the hist
function, such as changing the number of bins.
Type ?hist
to an overview of all optional arguments.
4.2 Scatterplots
Another common graph type is a scatterplot, which we achieve in R with plot(x, y)
where x
and y
are the variables on the x- and y-axis respectively. This is a two-dimensional representation of two variables, and thus requires at least two arguments.
Again, look to ?plot
for more options.
4.3 Other Graphs
Base R is capable of other graphing types such as Box Plots, Bar Plots, and Pie Charts.
But for anything more advanced than basic plotting, look to ggplot2.
5 Other Topics
This guide covers the basics you will need to get started with the tutorials. For those of you who wish to do your own empirical work, you will want to expand on this knowledge. This also means going beyond base R.
Probably the most useful set of packages is provided by The Tidyverse. dplyr is especially useful for data cleaning and manipulation while tidyr aids in reshaping data to formats that are useful for our analyses. These are not strictly necessary for empirical econometric work, but they will make your life easier when working with messy, real-world data.
This guide will evolve and change over time as R is also an evolving programming language. In addition, your feedback and comments will be incorporated into future versions of this guide.
A note of warning for Windows users. “\” is an escape character (not so important for our purposes), but it means that you need to use two of them when setting your path (“\\”), or use the forward slash (“/”) instead.↩︎
This is not technically correct, just as a \(1\times1\) matrix is not technically a scalar, but it helps for understanding in this context.↩︎
paste
will always coerce numbers and logicals to character.↩︎A vector of length \(n\) would be \(\boldsymbol{v} \in \mathbb{R}^n\) while a matrix with \(r\) rows and \(s\) columns would be \(\boldsymbol{M} \in \mathbb{R}^{r \times s}\).↩︎
Other functions from this family include
tapply
andmapply
.↩︎