Regression vs Reality

This is a continuation of the previous post on Econometrics Versus Reality, which provided the philosophical background for the nominalist methodology of science underlying regression models. In this post, we will discuss a specific data set and illustrate the differences between a regression model and a real model (short for a model based on realist philosophy of science).

Regression: Most widely used model

Fisher introduced the idea of making an assumption that data is a random sample from a hypothetical imaginary distribution, in order to simplify data analysis. Regression extends this idea to two or more variables. It is based on LARGE number of FALSE assumptions. As we have seen, a nominalist methodology has no difficulties with false assumptions. It does not matter if Model doesn’t correspond to Reality. We only check FIT between model & data. In contrast to this, REALISM says that False assumptions lead to false results; that is our premise in this course on Real Statistics.

A Standard Course Introduces Assumptions of the regression model with minimal explanation. The goal is never to analyze or understand these assumptions, or to assess if they are true. The assumptions just provide us with the mathematical tools required to do data analysis. The standard course USES the assumptions to do complex mathematics required to setup and estimate regression models. A whole SEMESTER of work is involved in learning the MECHANICS of regression. Since nearly all regression models are false, all of this is useless. In this course, we will study Regression Models from OUTSIDE. That is, we will discuss how regression models are setup and estimated, without discussing the mechanical and mathematical details.

Causality:

As discussed earlier, causality is not observable. We can see that Y happens after X, and but not that Y happens because of X. Because of this, the positivist methodology of econometrics makes no mention of causality. Yet, causality is central to understanding regression models, and also, why they fail. We have already introduced the notation that “X => Y” reads “X causes Y”. What does this mean? Changes in X lead to changes in Y. The relationship is DIRECT.  If we have X => W => Y this is NOT equivalent. In this situation, W is a MEDIATOR – it mediates the relation between X and Y.  Other causal factors may be present. X => Y and Z => Y is possible and often the case. Other chains of causation may be present. X => Y and ALSO X => W => Y can BOTH hold. However, we will assume that there is no circular causation. We cannot have X => Y and Y => W => X. Although such situations may be possible, we ignore them for simplicity in our initial approach to causality. With this notation in place, we can now discuss the assumptions of the regression model.

Assumptions of the Regression Model

Regression starts by identifying a dependent variable Y, which we wish to “explain”. The regressors are a set of variables X1, …, Xk to be used in explaining Y. Since the notion of “explain” is never explained, the meaning of these fundamental assumptions never emerges clearly. In this lecture, we will only consider the case of a single regressor X. The key assumption is one of a causal relationship between X and Y:       X => Y. This is referred to as the Exogeneity of X, but there is no real understanding of what this means. Next we assume a LINEAR relation between Y and X, and some other factors which are not known:

Y= bX + F1 + F2 + F3 + … + Fn

If we aggregate the unknown factors into an error term, we get the regression model: Y = bX + ErrY. An important assumption is that ErrY is INDEPENDENT of X. We will use =/> to mean “does not cause”. Then the independence assumption X || Fi means X =/> Fi and also Fi =/> X, where Fi are the unknown factors which go into the error term. To run a regression, we will need multiple data points. Then, an additional assumption is that ErrY is independent across time or sector. We also have the standard Fisherian assumption: ErrY is random sample from common distribution. If we let M be the common mean of this unknown distribution, we can write the regression equation as Y = M + bX + (ErrY-M). In this equation we have a constant term, and the new error term is ErrY’ = ErrY-M. This error term has mean 0.

  1. What is the BASIS for these assumptions? NONE.
  2. Can we expect them to be roughly valid? NO!

These are all INCREDIBLE & BIZARRE assumptions – it is almost impossible to think of situations where they would hold. Edward Leamer analyzed the regression model and stated The Axiom of Correct Specification: Regressions produce good results ONLY IF ALL ASSUMPTIONS HOLD. Failure of any one of the assumptions can lead to dramatically wrong conclusions. Many examples of such failure can be demonstrated in highly regarded papers published in top journals. The biggest problem is that regressiom models create confusion about causal effects in minds of students. This makes it impossible for them to use data to arrive at sensible conclusions about reality.

Yule fitted the first regression model to data in 1896, and failed to reach any conclusions regarding the problem he attempted. Hundred years later, in a centenary article, Freedman (1996) wrote that we have tried this methodology for a century, and it has failed to produce good results. The time has come to abandon it. This is precisely our point of view. Regression models have failed, and should be abandoned. In this chapter, we will describe the methodology and its failures. Later we will develop alternative, superior methodologies, based on a REALIST approach, which does not allow to make false assumptions freely, to fit the data. With this as background, we now turn to regression models.

Regression: Fitting lines to data

For the two variable case, regression involves fitting a line to data, creating a smooth and simple relationship which approximates a complex and cluttered cluster of points on a graph of the data. In context of REAL STATISTICS – we must ask WHY?

  1. Why are we fitting a line to the data?
  2. What does the fitted line MEAN?
  3. How do we calculate which line fits best?
  4. What will be done with the regression analysis?

In real world situations, there are a number of possible DIFFERENT uses for fitting lines. The interpretation of the LINE depends greatly on real world context which generates the numbers, and the PURPOSE for which this analysis is being done.

GENERALLY SPEAKING: Regression methodology is only ONE of many possible ways of fitting lines. VISUAL FITS are USUALLY an excellent & SUPERIOR option. We will now illustrate these abstract concepts by one simple example. More examples will be given in later lectures.

Measuring Serum K levels in Blood:

The data we analyze is two ways of measuring Serum kanamycin levels in blood. Samples were drawn simultaneously from an umbilical catheter and a heel venipuncture in 20 babies – this is a real data set taken from Kaggle. The data set and a graph of the data are given below:

BabyHeelstickCatheter
12325.2
233.226
316.616.3
426.327.2
52023.2
62018.1
720.622.2
818.917.2
917.818.8
102016.4
1126.424.8
1221.826.8
1314.915.4
1417.414.9
152018.1
1613.216.3
1728.431.3
1825.931.2
1918.918
2013.815.6

We graph this data in an X-Y scatterplot:

The data and the graph show a rough correspondence between the two measures, which is what we expect to see. Both Cath and Heel are measures of the same unknown quantity K, the serum K levels in the blood of the baby.

The data is relevant to the analysis of a meaningful question. Are the two measures equivalent? Do they provide us with an accurate measure of the true Serum Kanamycin levels (K) in the baby’s blook? The unobserved real variable K generates the two measures C and H: K => C and K => H. Because both are caused by K, we expect to see high correlation between the two. BUT there is no causal relationship: C =/> H and also H =/> C. Neither measurement causes the other. A reasonable representation of the underlying structure is: C = K + ErrC and H = K + ErrH. This is A GOOD model because it matches underlying unobserved reality. The errors are meaningful – they represent the errors created by assuming that the sample is representative of baby’s blood. As we will soon see, regression analysis cannot use either of these correct structural models because the involve the unknown and unobserved K.

Before turning to a regression, we do a little common-sense analysis of the data. It is well known that given two erratic but independent measurements, an average of the two will give us a more stable and accurate measure. While K cannot be measured, if we define K* as the average of the two measures we have – K* = (C+H)/2 – then K* will provide a better approximation to the underlying true K than either of the two measures C and H. Using this idea as the basis, we construct a table of values for K* and also plot the errors – the deviations from K* – of the two measures below.

The X-axis is our estimated value K* of K. On the Y axis we plot K* and compare it with the two measures H and C. As we can see, both measures are closely matched to K*, so the data conforms to our intuitions regarding the two measures H and C, and the underlying unobserved K. There are few things we can learn from this data analysis. First we present the “errors” in the measures C and Y and measures of K*. Note that these are NOT the TRUE errors, because the true value of K is unknown. Instead, we analyse ErrC* = C-K* and ErrH* = H=K*

K*ErrHErrC
14.70.9-0.9
14.751.55-1.55
15.150.25-0.25
16.15-1.251.25
16.45-0.150.15
18.05-0.850.85
18.2-1.81.8
18.30.5-0.5
18.45-0.450.45
19.05-0.950.95
19.05-0.950.95
21.40.8-0.8
21.61.6-1.6
24.11.1-1.1
24.32.5-2.5
25.6-0.80.8
26.750.45-0.45
28.552.65-2.65
29.6-3.63.6
29.851.45-1.45

The three errors highlighted in red are the largest errors. All other errors are within 2 units of K*. From this analysis, we could conclude that the two measures are aligned, and both come within 2 units of the true K about 85% of the time. Occasionally, larger errors like 2.5 or even 4 can occur. One could go further by examining the particular cases of large errors to try to identify the source of the error. A real analysis always goes beyond the data, to try to understand the real world factors which generate the observations. Next, we turn to a regression analysis of this data set.

External Regression Analysis

A conventional course in regression analysis does the following:

  1. Learn assumptions of regression
  2. Use to create mathematical & statistical analysis
  3. Learn how to estimate regression models, & properties of estimators & test statistics
  4. Use this theory to interpret results of regression

Our point of view here is that the assumptions are nearly always false. As a result, the results are nearly always useless. So there is no point in learning all of this machinery, which take a lot of time and effort. Instead, we will teach regression from an external perspective. Running a regression involves doing the following tasks:

  1. Choose Dependent Variable Y
  2. Explanatory variables X
  3. Goal explain Y using X.
  4. Feed data to computer – we don’t investigate what happens inside the computer.
  5. Get regression output
  6. LEARN how to INTERPRET output.

A huge amount of time can be saved by learning how to drive the car, without learning the details of how the engine works. Accordingly, we proceed by running a regression analysis of the data on C and H.

We immediately run into a problem. Which of the two should be a dependent variable, and which should be the explanatory variable? Actually, the causal structure shows that both are dependent, while the independent variable is K. However, regression analysis does not allow the use of unobservables – we have no data on K. The nominalist philosophy says that we should use variables which are unobservable. Accordingly, there are only two possibilities, we can run a regression of H on C or of C on H. Both are wrong because both embody the wrong causal hypotheses, and end up giving us wrong and misleading results. We will examine the regression of H on C in greater detail. There are vast numbers of programs which we can use to run regressions; they all give similar results. Below, I present the results obtained from EXCEL:

SUMMARY OUTPUT
B on CHeelstick on Catheter
Regression Statistics
Multiple R0.832453
R Square0.692978
Adjusted R Square0.675921
Standard Error2.904264
Observations20
ANOVA
DfSSMSFSignificance F
Regression1342.684342.68440.627655.29E-06
Residual18151.82558.434749
Total19494.5095
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept4.2101122.6909181.5645630.135096-1.44339.863522
Catheter0.7869920.1234696.3739825.29E-060.5275931.046392

This output is interpreted as follows.

Correlation & R-squared: Multiple R  = 0.832453 ,R Square =  0.692978

In the nominalist theory, the most important aspect of regression is how well the model fits the data (not how well the model approximates truth). For this purpose, the correlation and its square are the most important measures in use. These measures the association between the dependent variable Y and the regressors, under stringent assumptions about the data. This measure can give HIGHLY misleading results if assumptions are violated. Two variables which have no relation to each can have very high correlation. Also, two variables which are strongly related can have zero correlation. In this particular case, the two measures H and C are linearly related, so the correlation is a suitable measure of their association. The 83% correlation is high, and shows a fairly strong relationship between H and C. The R-square is SQUARE of correlation: 69.2% = 83.2% x 83.2%. This has the following standard Interpretation: 69.2% of the VARIATION in H variable can be explained by the C variable. This concept is meaningful ONLY under very strong assumptions rarely satisfied in real world applications. This comes from ANOVA analysis due to Fisher pursuing racist goals. Fisher wanted to explain variations in children use parents genes as causal factors, and to separate the contribution of genetics from the environment. However, this methodology does not actually succeed in achieving this goal.

Central Object of Regression is to FIT A LINE:

The most important goal of regression is to approximate the dependent variable as a linear function of the independent variables. The regression output estimates that H is a linear function of the C:

H = 4.21 + 0.79 C + err

A graph of this relationship is given below. The orange is the regression line, while the blue is the actual data.

Visually, we can see that the data fits reasonably well to the line. This reflects the high correlation of 83%. However, this apparent linear pattern in data is not a good reflection of the TRUE relationship between these two variables H and C. The regression estimates don’t make sense. They have been derived under assumption that C is fixed, and H is caused by C. But this is not true.

A standard interpretation of this regression relationship would be that if we change C by one unit, then H would change by 0.79 units. This is NOT valid, because C is dependent on K and an Error ErrC. If we change ErrC – that is, the error of measurement – this will cause a change in C but not in K. Since K is the cause of H, changes in the error of measurement of C will not affect H. So the interpretation of 0.79 as the effect of changes of C on H is not correct. Similarly, none of the other statistics generated by regression make any sense, because the assumptions of the regression model are not satisfied.

The F-statistic for the Regression:

F              40.62765                              Significance F     5.29E-06

We will end this lesson with an interpretation of the overall F-statistic for the regression as a whole. This is similar to the R-squared in being a measure of the overall goodness of fit between the data and the regression line. First we recall the meaning of the p-value, which is also called the significance level.

Significance Level: Suppose our null hypothesis is that a coin is fair. We flip it 100 times and observe 70 heads. The p-value measures how much this observation is in conformity with the null hypothesis. The p-value is defined to the probability of the observed event, together with ALL equal or more extreme events. Here more extreme means having lower probability. If X is the number of heads in 100 with 50% success probability on each trial, then the p-value of the observation 70 is P(X>=70)+P(X<=30) which is 2*3.92507E-05, an extremely low probability. This means that the event is highly unlikely under the null hypothesis. Because there is a high degree of conflict between the observed event and the null hypothesis that the coin is fair. So, we can reject the null hypothesis.

Now, we go back to the interpretation of the F-statistic. This is a test for the null hypothesis that all coefficients in the regression are 0. That is a=0 and b=0, so that there is no connection between the two measures H and C. The statistic of 40 with significance of 5.3E-06 means that this is highly unlikely. So, we can reject the null hypothesis of no connection between H and C. While the inference is correct (that is, there is a strong relationship between H an C), the reasoning which leads to this conclusion is wrong. Also, the p-value is meaningless, since the assumptions on which it is based are not valid.

ENDNOTES: Writeup of this lecture in WORD:RSIA10B Lines.docx

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.