Basic Methods for Establishing Causal Inference
Chapter 7
© 2019 McGraw-Hill Education. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or distribution without the prior written consent of McGraw-Hill Education

Learning Objectives
Explain the consequences of key assumptions falling within a causal model
Explain how control variables can improve causal inference from regression analysis
Use control variables in estimating a regression equation
Explain how proxy variables can improve causal inference from regression analysis
Use proxy variables in estimating a regression equation
Explain how functional form choice can affect causal inference from regression analysis

‹#›
© 2019 McGraw-Hill Education.

The assumptions to estimate the parameters of a regression equation are:
The data-generating process for an outcome, Y, can be expressed as: Yi = α + β1X1i + … βKXKi + Ui
{Yi, Xi, …, XKi) is a random sample
E[U] = E[U × X1] = … = E[U × XK] = 0
If these assumptions hold, we can use our regression equation estimates as “good guesses” for the parameters.
Assessing Key Assumptions within a Causal Model

‹#›
© 2019 McGraw-Hill Education.

Assumption 1 states that: the determining function is linear in the parameters, and that other factors—in the form of the error term—are additive (they simply add on at the end)
For example:
Total Costs = Fixed Costs + f1Factor1 + … + fKFactorJ
FactorJ represents a factor of production and f1 its price
If we have data on Factor1 through FactorK , where K < J
Total Costs = α + β1Factor1i + … βKFactorKi + Ui
Assessing Key Assumptions within a Causal Model

‹#›
© 2019 McGraw-Hill Education.

Assumption 2 states that our sample is random
There are many ways to collect a random sample, but all start with first defining the population
For example, we may define the population as all individuals in the United States, and then randomly draw Social Security numbers to build the sample.
When dealing with populations that span multiple periods of time, we treat what was observed for a given period of time as realization from a broader set of possibilities
Random Sample

‹#›
© 2019 McGraw-Hill Education.

Key merit of drawing a random sample is that, on average, it should look like a smaller version of the population from which we are drawing
The information in a random sample should “represent” the population
For any given sample of data, randomness does not guarantee that it represents the population well
Random vs Representative Sample

‹#›
© 2019 McGraw-Hill Education.

Random vs Representative Sample
If we have a random sample of 20 people asking them about their age and rating of the product from all the customers
But problem with this sample is it is not representative of a population of age over 40
To avoid situations like this, it is common practice to take measures to collect a representative sample

‹#›
© 2019 McGraw-Hill Education.

Age and Rating Data for a New Product

‹#›
© 2019 McGraw-Hill Education.

Representative sample: a sample whose distribution approximately matches that of the population for a subset of observed, independent variables
Constructing a representative sample:
Step 1: Choose the independent variables whose distribution you want to be representative
Step 2: Use information about the population to stratify (categorize) each of the choses variables
Step 3: Use information about the population to pre-set the proportion if the sample that will be selected from each stratum
Step 4: Collect the sample by randomly sampling from each stratum, where the number of random draws from each stratum is set according to the proportions determined in Step 3
Random vs. Representative Sample

‹#›
© 2019 McGraw-Hill Education.

We are interested in how rating depend on age, so we have age in the role of independent variable:
Step 1: With just one independent variable, this step is trivial—we want a representative sample according to age
Step 2: We need to utilize information we have about the population. We know that 30% of the population is over the age of 40. We can stratify the data into two groups: over 40 and 40 and under.
Random vs Representative Sample

‹#›
© 2019 McGraw-Hill Education.

Random vs Representative Sample
Step 3: We use our knowledge of the population to determine the proportion of our same coming from these two strata: 30% should be over 40 and 70% 40 and under. If our sample size is N = 1,000, we will have 300 who are over 40 and 700 who are 40 and under
Step 4: We may collect a random sample larger than 1,000 to ensure there are at least 300 who are over 40 and at least 700 who are 40 and under. Then, randomly select 300 from the subgroup who are over 40, and randomly select 700 from the group who are 40 and under

‹#›
© 2019 McGraw-Hill Education.

The concepts of random and representative are not mutually exclusive when it comes to data samples. A sample can be both
If we construct a representative sample, then by construction it is not truly a random sample
Constructing a representative sample ensures that we observe the pertinent range of our independent variables
Construction of a representative sample often ensures that we have substantial variation in the independent variables
Random vs. Representative Sample

‹#›
© 2019 McGraw-Hill Education.

Consequences of Nonrandom Samples
The construction of a representative sample generally results in nonrandom sample
A sample that is nonrandom is also known as selected sample
Two fundamental ways in which a sample can be nonrandom or selected. It can be selected according to:
The independent variables (Xs)
The dependent variable (Y)

‹#›
© 2019 McGraw-Hill Education.

Selection by independent variable

THE REGRESSION LINE FOR THE DATA SET IS:
RATING = 40 + 0.5AGE.
USING JUST DATA FOR AGE < 30 WILL SIMPLY LIMIT WHERE, ALONG THE LINE, WE ARE OBSERVING DATA.
USING JUST THESE DATA POINTS WILL SKEW OUR ESTIMATES FOR THE REGRESSION LINE.
Assessing Key Assumptions within a Causal Model

‹#›
© 2019 McGraw-Hill Education.

Selection by dependent variable

SAMPLE IS SELECTED SUCH THAT THE ONLY OBSERVATIONS WHERE THE RATING IS ABOVE 60 (ABOVE THE GREEN LINE).
SELECTION OF SAMPLE DEPENDING ON RATING (DEPENDENT VARIABLE) MAY CAUSE PROBLEMS WHEN ESTIMATING REGRESSION EQUATION.
SELECTION OF SAMPLE DEPENDING ON DEPENDENT VARIABLE MAY CREATE A SITUATION WHERE E[Ui] = E[Xi[Ui] = 0 MAY HOLD TRUE FOR THE FULL POPULATION, BUT E[Ui] 0 and E[Xi[Ui] 0 FOR THE SELECTED SUBSET OF THE POPULATION.
Assessing Key Assumptions within a Causal Model

‹#›
© 2019 McGraw-Hill Education.

Selection by depended variable

SELECTING DATA POINTS WHERE RATING IS ABOVE 60, HAS TWO IMPORTANT CONSEQUENCES:
THE MEAN VALUE OF THE ERRORS IS POSITIVE FOR THE SELECTED SUBSET AND,
THE ERRORS AND AGE ARE NEGATIVELY CORRELATED.
Assessing Key Assumptions within a Causal Model

‹#›
© 2019 McGraw-Hill Education.

Assumption 3 states that E[U] = E[U × X1] = … = E[U × XK] = 0. This means we assume the errors have a mean of zero and are not correlated with the treatments in the population
Violation of this assumption, meaning there exists correlation between the errors and at least one treatment, is known as an endogeneity problem
The component(s) of the error, Ui, that are correlated with a treatment(s), X, as confounding factors
No Correlation Between Errors and Treatment

‹#›
© 2019 McGraw-Hill Education.

Three main forms in which endogeneity problems generally materialize:
Omitted variable: Any variable contained in the error term of a data generating process, due to lack of data or simply a decision not to include it
Measurement error: When one or more of the variables in the determining function (typically at least one of the treatments) is measured with error.
Simultaneity: This can arise when one or more of the treatments is determined at the same time as the outcome; often occurs when some amount of reverse causality occurs
No Correlation Between Errors and Treatment

‹#›
© 2019 McGraw-Hill Education.

Control variable: any variable included in a regression equation whose purpose is to alleviate an endogeneity problem
Confounding factor that is added to a determining function
Control Variables

‹#›
© 2019 McGraw-Hill Education.

Yi = α + β1X1i + … βKXKi + Ui
If the variable C is a confounding factor within the data-generating process, if…
C affects the outcome, Y
C is correlated with at least one treatment (Xj)
Then…
C is a good control, and its inclusion as part of the determining function can help mitigate an endogeneity problem
Criterion for a Good Control

‹#›
© 2019 McGraw-Hill Education.

Dummy variable is a dichotomous variable (one that takes on values 0 or 1)that is used to indicate the presence or absence of a given characteristic
Typically utilized in regression equations in lieu of categorical, ordinal, or interval variables
Dummy Variables

‹#›
© 2019 McGraw-Hill Education.

Categorical variable
Indicates membership to one of a set of two or more mutually exclusive categories that do not have an obvious ordering
Ordinal variable
Indicates membership to one of a set of two or more mutually exclusive categories that do not have an obvious ordering, but the difference in values is not meaningful
Interval variable
Indicates membership to one of a set of two or more mutually exclusive categories that have an obvious ordering, and the difference in values is meaningful
Types of Variables

‹#›
© 2019 McGraw-Hill Education.

Suppose we have a data-generating process as:
Salesi = α + β1Commisioni + β2Locationi + Ui
We cannot regress “Sales” on “Commission” and “Location” since Location does not take on numerical values
Instead include the dummy variables created for Location as part of the determining function, rather than the Location variable itself:
Salesi = α + β1Commisioni + β2LosAngelesi + β2Chicagoi + Ui
Base group is the excluded dummy variable among a set of dummy variables representing a categorical, ordinal, or interval variable
Dummy Variables

‹#›
© 2019 McGraw-Hill Education.

Selecting Controls
The variables that theory says should affect the outcome should all be included in the regression
All these variables belong as part of the data-generating process
These variables can serve as valuable data sanity checks
A data sanity check for a regression is a comparison between the estimated coefficient for an independent variable in a regression and the value for that coefficient as predicted by theory

‹#›
© 2019 McGraw-Hill Education.

When Selecting Controls:
Identify variables that theoretically should or might affect the outcome
Include variables that theoretically should affect the outcome
For variables that theoretically might affect the outcome, include those that prove to affect the outcome empirically through a hypothesis test
For variables that theoretically might affect the outcome, discard those that prove irrelevant through a hypothesis test
Selecting Controls

‹#›
© 2019 McGraw-Hill Education.

Proxy variable is a variable used in a regression equation in order to proxy for a confounding factor, in an attempt to alleviate the endogeneity problem caused by that confounding factor
Proxy Variables

‹#›
© 2019 McGraw-Hill Education.

Functional form choice can affect causal inference from regression analysis
Assuming the following data-generation function:
Salesi = α + βHoursi + Ui
Implies that value of sales change with hours at a constant rate of β (e.g. if β is 12 then each increase in hours will increase sales by 12)
Form of the Determining Function

‹#›
© 2019 McGraw-Hill Education.

Functional form choice can affect causal inference from regression analysis
Hours may affect Sales in a non-linear way, such that they have a large effect for the first few hours, but the effect diminishes as hours become large
A quadratic determining function might be better than the linear determining function
The causal relationship between Sales and Hours:
Salesi = α + βHoursi + β2Hours2i + Ui
Form of the Determining Function

‹#›
© 2019 McGraw-Hill Education.

Salesi = α + βHoursi + β2Hours2i + Ui
Where we set Hours = X1 and Hours2 = X2 and it looks like a generic multiple regression equation
Form of the Determining Function

‹#›
© 2019 McGraw-Hill Education.

Consequences of using the wrong function form:
Constrains the shape of the relationship between sales and hours
If we assume it is linear, the effect is constant β.
If we assume it is quadratic, the effect is not constant – simple calculus will show it is + hours.
Use Weierstrass approximation theorem: if a function is continuous, it can be approximated as closely as desired with polynomial function
Form of the Determining Function

‹#›
© 2019 McGraw-Hill Education.

Quadratic Relationship Between Y and X

‹#›
© 2019 McGraw-Hill Education.

THIS FUNCTION CLEARLY CANNOT BE APPROXIMATED BY LINEAR OR QUADRATIC FUNCTION. HOWEVER THERE IS A POLYNOMIAL THAT CAN GET EXTREMELY CLOSE TO THIS HIGHLY IRREGULAR FUNCTION.
Example of a Continuous but Highly Irregular Function

‹#›
© 2019 McGraw-Hill Education.

Laffer Curve

THE LAFFER CURVE IS BASED ON THE IDEA THAT TAX REVENUE WILL BE ZERO BOTH WITH A ZERO TAX RATE AND A 100% TAX RATE BUT IS POSITIVE FOR TAX RATES IN BETWEEN

‹#›
© 2019 McGraw-Hill Education.

Interpretations of β for Different Log Functional Forms
Log-log measures elasticity, the percentage change in one variable with a percentage change in another

‹#›
© 2019 McGraw-Hill Education.

QUALITY: 100% ORIGINAL – NO PLAGIARISM

(USA, AUS, UK & CA PhD. Writers)

CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS

The Best Custom Essay Writing Service

About Our Service

We are an online academic writing company that connects talented freelance writers with students in need of their services. Unlike other writing companies, our team is made up of native English speakers from countries such as the USA, UK, Canada, Australia, Ireland, and New Zealand.

Qualified Writers

Our Guarantees:

CLICK TO SUBMIT YOUR ORDER