top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Introduction to R for Stata Users

Let's set the software

R is a programming environment

  • Robert Gentleman and Ross Ihaka developed R at the University of Auckland, New Zealand in 1996.

  • They designed the language to combine the strengths of two existing languages, S and Scheme.

  • Tools are distributed as packages, which any user can download to customize the R environment.

  • Free Software.

RStudio is a better view (Similar to Stata). Problematic with an extensive database.

Data Type

1. Vector

Definition: A vector is a sequence of elements that share the same data type. A vector supports logical, integer, double, character, complex, or raw data types.

Example code

#Generating scalar

#Generating a vector
x1 <- c(1,2,3)
x2 <- c(1,2,5.3,6,-2,4) # numeric vector
x3<- c("one","two","three") # character vector
x4 <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
x2[c(2,4)] # 2nd and 4th elements of vector

Operations with vectors

  • x[c(posituin1, position)] subsetting

  • vectors rep(a, repetitions)

  • seq(from =,to =,by =)

  • a : b patterned Vectors

Exercise 1 - Practice with vectors

Suggested solution - 1 (Try yourself first)

2. Matrix

3. Arrays

4. Data Frame

Exercise 2 - Data frame

Suggested solution - 2 (Try yourself first)

5. List

6. Factor

Some functions to start

1. Get help

There are multiple blogs and help sources on the Internet. Try to google it and look for specific code.

R also can give you some advice using the following code

?options ## To Internet

# If the exact name of the command is not know"sum") # To Internet list of commands

2. Loops

3. Export and import

4. Merge function

5. More functions

6. Random variables

Linear Regression (Economists, such as myself, love regressions)

Let's study the demand for economics journals.

We begin with a small data set taken from Stock and Watson (2007) that provides information on the number of library subscriptions to economic journals in the US in 2000. The data set, collected initially by Bergstrom (2001), is available in package AER under the name Journals.

1. Upload database

We will need to install the package AER.

R has millions of packages that people create to run multiple statistical processes. Uploading packages in Windows is more straightforward than in IOS. In RStduio, I usually upload packages manually

Example code:

install.packages("AER") ## install packages
library(AER) ## Loaded a package
data ("Journals", package="AER") ## Call the date 

Let's check the data before continuing


2. Simple graphs

3. Estimations

Exercise 3 - Wage Equation

Suggested solution - 3 (Try yourself first)

Exercise 4 - Wages and year of experience

Suggested solution - 4 (Try yourself first)

Exercise 5 - Prices and subscripts

Suggested solution - 5 (Try yourself first)

4. Dichotomous variables (Dummy variables)

5. Non-Linear regressions

6. Comparison of models

Descriptive Statistics

In Stata, we can use the command summarize to calculate the descriptive statistics of the database. We can do the same in R with the following commands.

1. Mean, Median, and Standard Deviation

Example code:

rm(list=ls(all=TRUE)) # remove all the objects in the memory


levels(CPS1985$occupation)[c(2, 6)] <- c("techn", "mgmt") # 
attach(CPS1985) # to use column wage


2. Histograms

3. More sophisticated graphs

Interactions, Separate, and Weights

  • y a + x Model without interaction. Identical slopes to x but different intercepts to a.

  • y a ∗ x Model with interaction. This interaction included ethnicity, education and the interaction between the two.

  • y a + x + a : x, the term a:x gives the difference in slopes compared with the reference category, in other words, just the interaction.

Example code:

cps_int <- lm(log(wage) ~ experience + I(experience^2) +
                education * ethnicity, data = CPS1988)
# Test of coeficients

cps_int <- lm(log(wage) ~ experience + I(experience^2) +
                education + ethnicity + education:ethnicity,
              data = CPS1988)
coeftest(cps_int)  ## Both models are the same.

Separate regression for each level

As a further variation, it may be necessary to fit separate regressions for African-Americans and Caucasians.

  • This model specifies that the terms within parentheses are nested within ethnicity.

  • The term -1 removes the intercept of the nested model. A matrix to see results for both ethnicity

  • anova(model1, model2) the model where ethnicity interacts with every other regressor fits significantly better, at any reasonable level than the model without any interaction term.

Example code:

cps_sep <- lm(log(wage) ~ ethnicity /
                (experience + I(experience^2) + education) - 1,
                 data = CPS1988)

#Estimate two models for separate

# To compare both models
cps_sep_cf <- matrix(coef(cps_sep), nrow = 2)
rownames(cps_sep_cf) <- levels(CPS1988$ethnicity)
colnames(cps_sep_cf) <- names(coef(cps_lm))[1:4]

anova(cps_sep, cps_lm)

Weighted least squares


  • A Modern Approach to Regression with R.

  • An Introduction for R for Quantitative Economics.

  • R for STATA users.

  • Applied Econometric with R.


Recent Posts

See All


bottom of page