* 1. Setting up
log using "Regression Workshop.log", replace /* replace with your own file path */
version 16.0 /* tells Stata what version you are using */
clear all /* clears all data and programs */
set linesize 80 /* set linesize */
display "$S_DATE" /* display date */
*** Opening data set
use "Head Start Evaluation Data.dta", clear /* Since this dta file has spaces, we need to use quotations marks around it */
* We will be again using data from a randomized impact evaluation of Head Start (a low-income preschool program). We will first test whether random assignment was effective at creating similar groups.
* 2. Thinking about our data before we begin.
* The research question that we will test today is: What is the effect of preschool on a child's development?
* What is the "X" variable and what is the "Y" variable?
* "Y" variable is the child's development, which we will measure with the variable ppvt_36
* "X" variable is the treatment (in this case, did the child go to preschool?)
* Generic regression equation: y = constant + beta1 * x + error term
* 3. Data cleaning and creating a summary statistics table
describe /* describe the data set */
count /* how many observations are there */
sum treatment /* What kind of variable is this? What does the mean tell us? */
* Every empirical paper describes the data and includes a summary statistic table. This table helps to let the reader know the respondents (that helps with understanding how externally valid the sample is). It is important to choose variables to describe that are relevant to the research question and main point of the paper. In addition, there are some variables that are usually included in analyses such as gender, age, income, and education. Typically, a summary statistics table should include the mean of each variable, standard deviation, and number of observations.
* Let's pick a few baseline variables: gender, number of moves, whether the child has ever moved, age at enrollment, whether the child was enrolled before birth. Identify each of the variables using the lookfor command and the codebook.
* Male
lookfor male
codebook child_male
* Number of moves
lookfor moves
codebook nmoves
* If the child ever moved: no variable for this, but we can make one!
gen evermoved = .
replace evermoved = 1 if nmoves>0
replace evermoved = 0 if nmoves == 0
label var evermoved "1=Child has ever moved"
* Age
lookfor age
codebook age_mths
* Child enrolled before birth: no variable for this, but we can make one!
tab age_mths
gen enroll_prebirth = .
replace enroll_prebirth = 1 if age_mths < 0 /* I'm deciding that age_mnths = 0, they were enrolled at birth*/
replace enroll_prebirth = 0 if age_mths >= 0
label var enroll_prebirth "1=Child enrolled before birth"
* Now, let's make a table of summary statistics.
sum child_male nmoves evermoved age_mths enroll_prebirth povratio
* 4. Testing randomization by creating a balance table.
* One of the key assumptions with randomization is that treatment and control groups are similar among observable and unobservable characteristics at baseline. To do this, it is often easiest to report the means of each variable separately for treatment and control group. In this case, there are two different groups (Control, Treatment). We will compare those who won the lottery and those who did not.
* Two ways to make this table
sum child_male nmoves evermoved age_mths enroll_prebirth povratio if treatment == 1
sum child_male nmoves evermoved age_mths enroll_prebirth povratio if treatment == 0
bys treatment: sum child_male nmoves evermoved age_mths enroll_prebirth povratio
* We will look at how to calculate the differences easily when we do ttests (instead of by hand)
* 5. Correlations
* We typically are interested in relationships between two variables; that is, their correlation. Correlation coefficients are a way to mathematically capture the relationship between two variables. To correlate two variables, you use the correlate command (corr).
corr child_male treatment
corr parental_distress_24 treatment
* 6. Testing statistical significance with dummy variables
* However, observing differences does not tell you whether these differences are statistically—or economically—significant. Let’s tackle statistical significance first. One way to test whether two means are the same is using the ttest command. It goes like this:
ttest child_male, by(treatment) /* dummy variable */
ttest nmoves, by(treatment) /* continuous variable */
* Note that with the ttest, we can see the difference between the different group means.
* 7. Regressions
* The Stata command for a regression is "regress", which we shorten to "reg".
reg nmoves treatment
reg ppvt_36 treatment
* The results here are the same as when we did a ttest! We see the differences here as well in the coefficient for our treatment variable.
* Often, we need to use robust standard errors when we do regressions. I won't talk about the statistical meaning behind it here.
reg nmoves treatment, robust
reg ppvt_36 treatment, robust
* Note that the standard errors change slightly. I always think that it is best practice to use robust standard errors.
* We can also graph our results and add our regression line.
* Let's look at the effect of family income on a child's birth weight.
* Let's first drop the observations where birth weight and/or income are missing
drop if bw_lbs == .
drop if povratio == .
graph twoway scatter bw_lbs povratio || lfit bw_lbs povratio
* We can use an if statement in our regression.
reg bw_lbs povratio if treatment == 1, robust
reg bw_lbs povratio if treatment == 0, robust
* Finally, let's look at a multivariable regression (multiple x variables) and output our regression.
reg ppvt_36 mom_educ_less_hs bw_lbs, robust
reg aggr_behavior_36 treatment mom_educ_less_hs mom_educ_hs_ged bw_lbs povratio child_male white black, robust
* Let's create a table with our regression output.
* install outreg2
* ssc install outreg2 - I already have it installed.
reg bw_lbs povratio
outreg2 using regression_workshop.xls, replace asterisk(coef) level(95) tdec(3) bracket
* asterisk(coef) specifies that asterisks for significance levels are appended to regression coefficients rather than to t statistics or standard errors. This is the default setting, meaning you can omit it. The default setting is 3 asterisks for 1%, 2 asterisks for 5%, and 1 asterisk for 10% significance levels. We maintained those defaults.
* level(#) specifies the confidence level, in percent, for confidence intervals. The default is level(95) or as set by set level; note that this does not change the asterisk option!
* tdec(#) specifies the number of decimal places reported for t statistics or for p values if pvalue is specified.
* bracket specifies that square brackets [] be used rather than parentheses () around t statistics, standard errors, etc. The default is parentheses (which is fine).
* We can also add statistics to our regression output.
reg bw_lbs povratio, robust
sum povratio
local meaner = r(mean)
outreg2 using regression_workshop.xls, addstat(Mean of Income, `meaner') replace
* See here for more information on outreg2: https://www.princeton.edu/~otorres/Outreg2.pdf