As discussed by Hoover in “Lost Causes”, the concept of causality was dropped from econometrics for many decades, although it is currently making a resurgence. Nonetheless the revolution launched by Judea Pearl in the 1990’s has not percolated to econometrics textbooks. As a result, satisfactory definitions of key concepts required to make sense of regression models are not available. In this lecture, we will clarify many confusions about regression models using the newly developed tools of causality. Essential use is made of causality concepts as defined in previous lecture: Causal versus Positivist Econometrics
Brief History of Econometrics: Launched in early 20th Century by Ragnar Frisch, econometric methodology was strongly shaped by the Cowles Commission (CC) in the 1960’s. The CC approach relied on structural equations, which embodied causal information known in advanced to the researcher. The goal was estimation of causal effects, and not discovery or assessment of the hypothesized causal structures. The oil shock of the 1970’s led to dramatic failures of macroeconomic regression models, leading to general distrust of econometric methodology. Two major critiques emerged. The Sims critique argued that hypothesized causal structures were false, and should be abandoned. We should go back to a purely descriptive analysis, looking for patterns in the numbers, without any reference to the real world phenomena represented by these numbers. Directly opposite was the Lucas critique which said that regression models were based on surface relationships between the numbers and ignored the deeper causal structures which drive these relationships. Regression models would fail when economic regimes underwent structural change – precisely when they were most needed. Neither approach has led to successful macro models. Models based on both approaches, as well as more conventional macro models, failed dramatically in the Global financial crisis. The fundamental problem lies in the failure of econometrics to incorporate causal inference correctly.
Autonomy & Invariance: These are two concepts which are of central importance, but cannot be understood without using causality. Autonomy of a regression equation means that the relationship continues to function even if other relationships in the economic system change. For example, a consumption function arises out of stability in the use of income for household purchases. We would expect this relationship to persist over many different kinds of economic change, though not all of them. Ragnar Frisch classified regression relationships into three types: (1) structural (=autonomous), (2) confluent, and (3) spurious. There is substantial confusion in econometrics literature over the meaning of these terms. However, they can easily be understood within the causal framework discussed in the previous lesson.
Causal Explanation: A regression relationship Y=a+bX is autonomous if it represents a causal relationship X => Y. This causality means that there a real-world mechanism which operates to transmit the effect of changes in X to the variable Y. It is this mechanism which guarantees the autonomy of the equation. If the world changes in ways which do not affect this mechanism, it will continue to operate, and produce the desired relationship between X and Y. A confluent relation is one where X => W => Y. Because W mediates the causal effect, shocks to W can disturb the relationship between X and Y. The relationship is genuine, but it is not invariant to systematic changes in W. A spurious relationship occurs due to a common cause. Suppose Z => X and Z => Y. The variables X and Y appear related because changes in Z are transmitted to both. If we fix Z, the two would be independent. Despite possibly strong correlation between X and Y, both regressions (X on Y, or Y on X) are spurious.
Invariance: Confusingly, invariance is sometimes used for concepts similar to autonomy. It is better understood in the context of attempting to QUANTIFY a causal effect. Suppose we know that X => Y. Next we want to find out the strength of this effect. This is what a regression model does. When we estimate Y=a+bX, the estimated coefficient b measures the impact of changes in X on changes in Y. Invariance refers to the constancy of this parameter b. As we will explain below, this search for invariance is an illusion, which side-tracks from the real goals.
Notation: Many authors have suggested that a big source of confusion regarding causality comes from lack of notation to express causal concepts. Accordingly, we will introduce some explicit notation for this purpose. The forward arrow X => Y indicates that X causes Y, as per definition already given in previous lecture. We will use backarrows, as in Y <= A + B X to indicate a quantified causal relationship. Here, not only does X cause Y, but the impact can be quantified by the equation. Nothing prevents multiple causes from affecting Y. Let us say that Z=(Z1,Z2,…,Zn) are other variables which casually affect Y. If they are independent of X, then we can lump the combined influence of all of them into the constant A=A(Z). It is easy to imagine environmental variables E=(E1,E2,…,Ek) which could affect the strength of the causal relationship, making it stronger or weaker. The causal factors Z could also play this role. So the constant B is better modeled as a function B=B(Z,E).
Regression: Suppose A(Z)=a0+a1Z1+…+amZm, where Z1 to Zm are observed. We could lump the combined effect of the unobserved variables Z(m+1) … Z(n) into an error term e and get the standard regression equation: Y = a0 + a1 Z1 + a2 Z2 + … + am Zm + B(Z,E) X + e. If we add the assumption that B remains constant, so the B(Z,E)=B, then we get a standard regression model. In this equation, Z and X play different roles. The main causal relationship of interest is the one between X and Y. The Z’s have been put in to take care of the fluctuations of A in the causal relationship between X, so as to get a better measure of the causal effect B. They are sometimes called “co-variates” for this reason. However, there is no way to tell the difference between Zi and X in the regression equation. We need to improve our notation to clearly indicate this difference. We will use square brackets  to denote a causal factor, and parentheses () to denote a coefficient within a functional form of a causal relationship. With this notation, we have Y <= (A) + (B) [X]. Here the parenthesized (A) and (B) indicate that these are coefficients within a causal relationship, while the square bracket [X] indicates a causal factor. Then the regression equation can be written as Y <= (a0 + a1 Z1 + … amZm) + (B) [X], to show the difference between the Z’s and the X. The Z’s may or may not be causal factors for Y. In this equation, they are added for the purpose of adjusting the constant, so as to allow more accurate estimate of the causal effect (B). It is of importance to note that the idea that B is constant is unlikely to be true. The strength of causal effect is likely to vary with many different factors, as we will see in examples given later.
Mincer’s Earning-Education Equation: An equation developed by Jacob Mincer to quantify the effect of education on earnings has acquired central importance in labor economics. This takes the form Earnings (Ern) = a + b Education (Edu) + Other Factors. Let us take for granted the causal effect of Edu => Ern; evidence in favor of this causal hypothesis is overwhelming. Then the earnings equation Ern <= (A) + (B) [Edu] is an attempt to QUANTIFY the strength of the causal effect. To illustrate the difference between covariates and causes, consider the variable Parental Education (PrE). Generally speaking children of educated parents have more education and higher earnings, so PrE can play a role in the earnings equation. Nonetheless, it is clear that it is not a direct cause of Ern. If it affects the constants A, B (which is plausible), then we can write the regression equation as:
Ern <= (A)+(B) [Edu] = (a0 + a1 PrE) + (b0 + b1 PrE) [Edu] = (a0+b0) + (a1 PrE) + (b0) [Edu] + (b1 PrE) [Edu]
Without the parentheses and brackets, the variable PrE and Edu would appear on par in the regression equation; both would appear to be causal factors. The mysterious cross-term (b1 PrE) [Edu] would call for interpretation. With this notation, the different roles of PrE and Edu are apparent. Edu is the only causal factor, while PrE affects the strength of the relationship, and is needed for quantifying the causal effect.
Focus on Mechanism: Currently, regression models search for invariance in the wrong place – they focus on trying to find ways to specify (A) and (B) such that these coefficients become constant. However, there is no reason or need that the strength of the causal effect should remain constant over large sets of environmental configurations. Although nothing remains constant over long stretches of time, it is the MECHANISM which remains stable over (short periods of) time. It is this mechanism which allows us to use current patterns to forecast future patterns, and also to assess effects of policy interventions. To illustrate how the search for mechanism shifts the focus of the research effort, think about HOW education affects earnings. One very simple factor comes to mind immediately. Some jobs require educational qualifications; you cannot apply for them without having the credentials. We will create a simple model to show how thinking about mechanisms leads to different ways to analyze the earnings education relationship.
A Simple Model for Earnings-Education Relationship: Suppose there are three sectors in the economy: Private, Government (Public), and Informal. Suppose that wages are WP > WG > WI; private sector offers highest wages, followed by government, while informal sector offers the lowest wages. Suppose that government sector jobs are restricted to the educated; application for these jobs requires a degree. Suppose that private sector does its own skills assessment and training, and therefore does not require degrees. Suppose that the educated prefer government jobs because these offer job stability, fringe benefits, retirement and health plans, which are substantially superior to the private sector, even thought the wages are lower. Suppose there are 100 jobs in private and informal sector and 50 jobs in Government sector. Suppose there are 50 educated people in the population, who all get government jobs, and 200 uneducated, who split evenly between the private sector and the informal sector. If the average of WP and WI is above WG, then our Earnings education will show that education has a negative impact on earnings, even though it is actually quite beneficial. Attempts to find better fitting equations by adding variables and nonlinearites will be quite misleading and useless, because they do not reflect efforts to find out WHY more education leads to higher earnings. It is only focus on the mechanism, which remains stable under change, which will lead to improved understanding, and therefore improved estimates of the strength of the causal effects.
Failure of Positivist Paradigm: At the root of the problem is the positivist paradigm, which tell us to avoid thinking about the hidden causal mechanism of reality, and focus on the observables. Econometrics attempts to find stable relationships among the observables, but the only stable relationship occur at the level of causal mechanisms which underlie the observations. We will illustrate this issue with another few examples.
The Forecast Competitions M1 to M4: The International Journal of Forecasting organizes forecasting competitions as follows. In the most recent one (M4), 100,000 data series were picked, and competitors were invited to submit forecasting methods. Methods which performed well overall in terms of average forecasting error over all the 100,000 series were awarded prizes in the competition. Forecast performance depends on the match of the model with the underlying real-world process which generates the data. If we forecast growth of a child, a S-shaped curve would do will. Forecasting oil prices would require attention to the dynamics of supply and demand of oil. If we just look at the numbers, success of forecasting is purely random – like using a crystal ball for forecasting. The results of the competition, when examined closely, validate this idea. Performance of forecasting methods varies widely across different series, and for different time periods. In any competition of random numbers, some will come out on top and others will come out on bottom. Nothing systematic can be learnt from such competitions. To go beyond crystal ball forecasting, one needs to look beyond the observed numbers to the real-world processes which generate the numbers.
Relating Advertising Expenses to Sales: As our last example, we consider assessing the relationship between Advertising Expense (Adv) and Sales (Sls). We have good a priori reasons, as well as strong empirical evidence, of the validity of the causal relationship Adv => Sls. However, the strength of the causal effect is of great importance to business firms trying to allocate budget to advertising. This means that we want to estimate A, B in the causal regression model Sls <= (A) + (B) [Adv]. We have every reason to believe that the impact coefficient B will be affected by many different variables, when we look at the mechanism by which advertising leads to consumer decisions to buy. So there is no reason to expect to see a single stable value for B persist over a long period of time. The products, competitors and their advertising strategies, consumers, and the market, all change and evolve quite rapidly. Advertising strategies effective yesterday will be less effective in the rapidly changing world of social media. Running a regression to try to pick up a stable relationship is looking for an “invariant” relation in the coefficients, which does not exist. There is some invariance when we look to consumer decisions for purchase, which are based on perceived needs, information available from different sources, and perceived advantages and disadvantages of consumption choices. When we look at the stable mechanism, and then assess the impact of advertising on different aspects of this mechanism, we may be able to come up with advertising strategies which work under rapidly changing scenarios. Without examining the mechanisms, playing with sales and advertising numbers to try to find a good regression fit is a completely useless exercise.
Concluding Remarks: What insights does causality provide about regression models? Given a collection of variables, (X,Y, Z1, …, Zn), our first task must be to search for DIRECT causal linkages – those which operate when all other variables are fixed. These are the AUTONOMOUS relations. Indirect chains are just sequences of direct chains. These are the CONFLUENT relations. Common Causes create APPEARANCE of relationship – these are SPURIOUS relations. Because there is no serious discussion or understanding of causality in most widely used econometrics textbooks, VAST amount of econometrics is study of SPURIOUS relationships. But, even when a genuine causal relationship is under study, econometrics searches for regression form which are “invariant” – that is, regressions which have stable coefficients. As we have seen, invariance occurs at the level of the underlying causal mechanism, and not at the level of the observed relationship. Thus, we endorse the Lucas critique – regression failure occurs when we study surface relationships among observed variables, instead of searching out the underlying causal relationships. However, the search for causality cannot be done by axiomatic, a priori theory, as proposed by Lucas. Instead, we need an empirical approach to look at the real-world mechanism by which one variable affects another.