**Defining Feature of Real Econometrics:** {bit.ly/RSIA10C}

We can define “Real Econometrics” as being the search for causal relationships within a collection of variables. There exists an enormous amount of confusion about what exactly is a causal relationship. We will take a simple and practical approach, developed by Woodward in his book on “Making Things Happen”. Given a collection of variables X,Y,Z1,Z2,…,Zn, we will say that X is a cause of Y if we can create changes in Y by changing the values of X. This is a “practical” definition in the sense that if we learn about causal relationship, we can use it to create changes in the world around us.

**Why is there confusion about causality?**

Children (and animals) are born with the ability to learn about causal relationships, and to use them to bring about desirable changes in their external environment. Since understanding of causation is built deeply into us, ordinary people find it difficult to understand why philosophers are so confused about causality. We need to discuss this issue because dominant methodology of statistics and econometrics is built on foundations of a philosophy (logical positivism) which is a source of enormous confusion. However, students should not worry if they fail to understand the source of this confusion. The point of trying to explain this is to explain why conventional statistics and econometrics is wrong. It does not matter for learning real statistics.

Confusion about causation was created by David Hume, who noted that we can only observe sequences of events – B follows A – but we cannot observe the underlying causal connection that A caused B to happen. This idea of Hume is actually a mistake created by his misunderstanding of the sources of human knowledge. Unfortunately, his mistake became embodied in the heart of the philosophy of “logical positivism” and became widely accepted. In the early 20^{th} Century, all of the social sciences, including statistics and econometrics, were created on the basis of the false foundations of this philosophy. This is the reason why these disciplines have been singularly unsuccessful in leading to increased welfare for mankind as a whole.

At the heart of logical positivism is the idea that reliable knowledge comes only from observations and logic. On the surface, this seems like a straightforward idea. But when “observations” is restricted to observations of the external world (and not our internal experience), this leads to huge mistakes. For a more detailed discussion of these mistakes, see “The Emergence of Logical Positivism”. Causality is defined by manipulation: changing the value of X leads to changes in the value Y. Quite often, this manipulation is not actually possible; in such cases, the definition is based on a “hypothetical” experiment, where we imagine changing the value of X, and seeing how this would impact on Y. By insisting that knowledge is only of what we actually see, and we cannot have any knowledge of what was not observed, this kind of hypothetical experiment is ruled out by logical positivism. This is why the concept of causation cannot be understood by positivists.

**Causation: Complexities, Technicalities, and Subtleties**

The philosophical confusion about causation is partly because causation is itself complex. Here, we will take a simple and practical approach to the topic, as formulated by Woodward in his book on “Making Things Happen”. The key concept is that X causes Y if we can use changes in X to create changes in Y. It requires some work to make this precise.

**Deterministic & Probabilistic Cause**: First, we must distinguish between deterministic and probabilistic cause. A deterministic cause is when a change in X actually causes a change in Y. A probabilistic cause is when a change in X leads to a change in the probabilities of occurrence for Y. A simple example of a probabilistic cause is something like COVID Vaccination. Suppose that the probability of catching COVID is 60% in some reference population. Suppose that vaccination lowers this probability to 10%. This is an example of how vaccination is a probabilistic cause of not getting COVID. In a population of 100 people without vaccination, about 60 will get COVID while 40 will not. Among 100 vaccinated people, 10 will get COVID while 90 will not. Probabilistic causation cannot be detected by looking at individuals. We will find all four types of people: with vaccine and COVID, with vaccine but COVID free, and without vaccine with COVID, and without vaccine and COVID free. It is only the proportions of COVID free people in the two populations which show the probabilistic effectiveness of the vaccine.

**Direct Cause**: Next, the concept of *direct cause* is of fundamental importance. The idea is that changes in X directly cause changes in Y, without the influence of any intervening variables. We will represent this symbolically as X => Y, or Y <= X. This requires some care to make it precise. Suppose we have a collection of variables under study: X,Y,Z1, Z2, …, Zn. To assert that X is a direct cause of Y means that for some configuration of values Z1=v1, Z2=v2, …, Zn=vn, some change in X from X=observed value to X=other potential value, will cause a corresponding change in Y (or in the probabilities of Y outcomes). The reason we must fix the other variables is to prevent them from interfering with the power of X to change Y. We do not require that all changes in X lead to changes in Y – only that there should be some way to change X so as to create a change in Y. We also do not require X to have the power to influence Y in all environments (as defined by values of Z1 to Zn). It may be that for some configurations of values of the Z-variables, X is powerless to change Y. To say that X is a direct cause of Y, we only need some particular configuration of values Z such that changes in X can affect Y.

**Indirect Cause**: The word “indirect” is ambiguous and can have several meanings. However, we will use it in only one way: X is an indirect cause if it is linked to Y be a sequence of direct causes. For example, X => Z1, Z1 => Z2, and Z2 => Y. In the late 20^{th} Century, Judea Pearl created a new approach to causality, which broke out of the mindset created by positivism. In his terminology, a direct cause (X=>Y) is a (causal) parent of Y. An indirect cause (X=>Z=>Y) is an “ancestor” of Y. Note how fixing variables allows us to differentiate between direct and indirect causes. If X=>Z=>Y and we fix the value of Z at v, then regardless of how we change X, we cannot create changes in Y. This is because Z is the only channel through which X affects Y, and if Z cannot change then X is powerless to influence Y. In such situations we say that Z screens off the effect of X on Z. After fixing Z, X and Y become independent.

**The Collection of Relevant Variables**: It is important to note that these definitions depend on the specified collection of variables. If X=>W=>Y, but W is not in the set of variables under consideration, then X will appear to be a direct cause of Y; fixing all other variables will not prevent changes in X from affecting Y. Only fixing the value of W will prevent X from affecting Y, but W is not among the set of variables under consideration. This relativity of direct causation to the variable set under consideration cannot be avoided.

**The GOAL of Real Statistics**: With these technicalities defined, we can sharpen definition of “Real Statistics”. Given a collection of variables X1, X2, … , Xn, real statistics aims to find the direct causal relationships between any pair of variables in this set. Given all the direct causal links, we can also find all the indirect causal links because these are just sequences of direct causal links. There are a few important points which follow from this goal.

- Each causal link represents a real world mechanism. Changing X causes something to happen which leads to changes in Y. This is NOT a relationship between numbers (the observed values of X and Y) but a real-world relationship is which being captured by the numbers.
- Because we are after discovery of real world mechanisms, we will ALWAYS need to go beyond the numbers to assess causal linkages. Numbers can only reveal correlations, and never causation. Correlations can often be helpful in discovery of causations, but actual manipulation and control of Y using X lead to far more certain discovery of causes. Again, this requires actual interference and action, not just passive observation, of the real world.
- Given a collection of real variables, the number of causal hypotheses that can link them is so enormous that it is impossible to go in with an open mind and hope to discover something. Instead, we go in with some tentative causal hypotheses based on our knowledge of the real world structures generating the data. Then the data may support, or discredit, our original guess at the causal structure. If the data conflict with our original hypothesis, they will also often give us a clue about a better alternative. These alternatives may often involve expanding the data set beyond the original set of variables under consideration.

**A Real-World Example**:

We will illustrate all of these abstract concepts in the context of a real world data. This is a real data set which lists prices (per square foot) of houses (HPsqf), and also the number of convenience stores (#Shops) within walking distance. There are 415 houses, categorized according to the number of stores, which varies from 0 to 10. The following graph provides a picture of the data set:

Each blue dot represents a house. The number of convenience stores (#Shops) is listed on the X-axis and varies from 0 to 10. There is one outlier, an extremely expensive house which has only one store in the neighborhood. In general, housing price shows an increasing tendence with #Shops. That is depicted by the orange regression line which has a positive slope. Running a regression of Housing Prices on #Shops leads to the following output:

This output can be summarized in the following regression equation:

HPsqf = 27.18 + 2.63 #Shops + Error

What the regression tells us is that if we increase the number of convenience store by 1, the prices of houses will increase by about $2.63 per square foot on the average, in the neighborhood of the convenience store. At least, this is what a CAUSAL interpretation of the regression model would tell us. This is what students are trained to believe, even though the regression does not actually provide us with any causal information directly. To understand this, let us first look directly at a data summary, without regression. One convenient graphical representation is given below:

Each bar corresponds to the number of convenience stores in the neighborhood, which varies from 0 to 10. The bar is the median price (per sq. ft.) of houses having that number of convenience stores within walking distance. The orange line is the total number of houses within this category. The graph shows a generally increasing trend in price per sq. ft. corresponding to increasing numbers of shops. The orange line shows that the largest number of houses (around 68) have 0, and a similar number have 5 stores within walking distance. From 5 onwards the number of houses declines. There are only around 10 houses in the category 10 – which means 10 convenience stores within walking distance.

The first thing to understand is that this is the data – this is all the information provided to us by the data. There is no more. That is, the regression analysis does not magically create more information for us. In fact, all the of the “additional” information provided by regression is created by the assumptions of the regression model which are added to the data. In general, and in this particular, these assumptions are almost surely false. So, regressions create an illusion of precision, which is not actually part of the empirical evidence available to us.

Next, we consider the evidence provided by this data. It does seem to be the case that housing prices tend to increase with the number of shops. It is also of some interest that the number of houses within each category is decreasing, at least after 5 shops. The smallest number of houses have 10 shops within walking distance. The question is: is this a causal relationship? Is it true that if the number of shops increase, then housing prices per square foot would go up? Regression methodology teaches students to believe so. Coming out of a standard econometrics course, students would look at the above regression and conclude that if one additional shop opens up, the prices of houses in the neighborhood would tend to increase by about $2.63 per sq. ft. on the average. This is completely false and misleading. The data do not provide us with any such evidence.

From the real statistics perspective, the data informs us of a correlation between #Shops and HPsqf. This is a clue to a possible causal relationship between the two variables. There are three possible causal sequences which could lead to such a correlation: #Shops => HPsqf, HPsqf => #Shops, or Z => #Shops and Z => HPsqf, where Z is some unknown common cause which affects both of the variables we see. How can we find out which of these possibilities (ignoring more complex ones) holds? To learn about causality, we must formulate hypotheses about structures of reality which lead to the observed correlation. Three such hypotheses are formulated below:

- People look for houses with more convenient shopping.
- Richer parts of town attract more stores.
- There is some other factor which attracts stores, and also causes higher house prices.

To give an example of the third hypothesis, suppose that the town us centered around a lake. Lakefront properties are generally more expensive. Also, tourists coming to town generally come to see the lake. As a result, there are a lot of shops on the lakeside (which serve tourists). Then the correlation between housing prices and number of shops is accidental, due to a common cause (lake). How can we differentiate between the three hypotheses? No amount of sophisticated data analysis of the data will reveal any information about this matter. Instead, we must expend shoe leather. We could go and ask questions from three classes of people:

Real Estate Agents: What do people look for when they are shopping for houses? How much importance do they give to the number of shops in the neighborhood? Why are the more expensive houses so highly priced, relative to the others?

Home Buyers: What were the characteristics you were looking for when you purchased the house? How much importance do you place on nearby convenience stores, in terms of purchase decisions? How much more would you have been willing to pay for this house, if there were a few more shops located nearby?

Shop Keepers: What influenced you to open up a shop in this location? Were you looking for proximity to other shops, or proximity to a rich neighborhood?

Acquiring information of this sort is necessary to learn about the causal relationships which create the observed correlation. It is obvious that the data cannot answer any of the questions above, which will provide us with this causal information. This demonstrates how a real analysis, which searches for causes, nearly always goes beyond the data, to the real-world factors which generate the data.

**Conclusions**:

Data analysis for the purpose of discovering causal relationship differs dramatically from the regression analysis methods currently being taught the world over. Some of these differences are summarized below:

- Think intuitively, about the real world. This generates initial hypotheses about causal structures, to be tested and verified with data. Often special purpose data will have to be gathered for this purpose.
- Think about direct versus indirect effects: It is essential to identify the direct effects, since these are the building blocks for the indirect effects. Every hypothesized direct effect corresponds to a real world mechanism (not just a pattern in the data). Given a real world mechanism in operation, there are often many different ways – not just data analysis – to assess its presence and strength.
- Think causal explanation: Whenever we see a strong correlation between variables X and Y, we can look for an explanation. There are three main causal sequences which explain such correlations: X=>Y, Y=>X, or Z=>X & Z=>Y. Thinking about which of the possibilities holds is not an exercise in data manipulation. This is an exercise in thinking about mechanisms which operate in the real world, relating the variables under study.
- Think about OTHER relevant factors. Whenever we see a correlation, it may be due to a common cause. We have to apply our real-world knowledge to discover what a common cause may be. Then we may be able to gather data on the common cause, and discover whether our hypothesis of common cause holds. If conditioning on the common cause makes X and Y independent, then the hypothesis is confirmed.

This should show how a real data analysis always involves thinking about real world mechanisms, and not about how to manipulate the data, or to make fancy statistical assumptions about the error terms.

You should never make “fancy statistical assumptions” about error terms. If your causal model is good enough, when you apply it the error terms will have certain characteristics. If they don’t have them, the model cannot be right. That is why scrutiny and testing of the properties of error terms is important. If the model is correct the errors will be serially independent, not correlated with the explanatory variables and approximately normally distributed. All of those properties should be tested. Any failure indicates a mis-specification, probably some relevant factor has been omitted. The data can never establish that a model is right, only that it is tenable for a given data set. But the data can often tell you when the model or hypothesis is wrong or inadequate.

Great post, Asad. When using regression analysis, many economists seem to take for granted that they know all the relevant variables and the functional forms they take. This was criticized already by Keynes (Tinbergen then being the target) and more forcefully in modern times by David Freedman. Usually, in an economic context, we do NOT know for sure what variables are relevant for the things we try to explain and understand. Theory (and our own more or less vague hunches) may tell us X, Y, Z, are the relevant variables to be incorporated in our models — but we NEVER actually know. To my knowledge, there has never been a single case in economics where regression equations have been able to decisively uncover causal relationships. As Asad says: “The data do not provide us with any such evidence … A real analysis, which searches for causes, nearly always goes beyond the data, to the real-world factors which generate the data.”

Thanks for the feedback Syll.