One of the key reasons for the dead-end we face in econometric analysis (and also in statistics) is the idea that analysis of numbers can be separated from the real-world meaning and context of the numbers. Positivists ideas have been absorbed by the public, without conscious realization of this. When people say: “Just give me the facts, I don’t want your opinions”, they think they are stating a commonplace and trite truth. They do not realize that this sentence is an advanced conclusion of complex philosophical argument about the dual nature of knowledge, which is fundamentally unsound. The “facts” – the numbers – and the “opinions” (the guessed-at causal phenomena which generated the numbers) – cannot be neatly separated, and analysis of the facts REQUIRES the guesses at causal structures. The previous set of posts on Simpson’s Paradox (1 , 2, & 3) illustrated the importance of learning about the causal structures in the context of studying discrimination against females in admissions at Berkeley. Next, we will take the SAME set of numbers, the same data, and pretend that it comes from batting averages. We will see that analyzing batting averages which generate a Simpson Paradox leads to different considerations regarding causal structures. The hidden and unobservable real world causal structures which generate the observable data cannot be ignored in statistical analysis, even though it is customary to do so in standard textbooks of econometrics and statistics.
Simpson’s Paradox in Baseball Scores.
One of the central assumptions of orthodox statistical methodology is that we can do analysis of numbers without knowing their origins. The mean, median, and mode can be calculated for any list of numbers BUT the meaning of these measures depends strongly on the real-world objects which are being measured by the numbers. The orthodox model has a statistical consultant who works with a field expert. The field expert knows the causal relationships, but the statistician look only at the numbers, with minimal knowledge of what they measure. In fact, we will show that statistical analysis requires real-world knowledge and cannot be separated from the field analysis. To illustrate this principle, we consider the same numbers used for Berkeley admissions, but give them another interpretation, in the context of batting averages of baseball players.
Consider Tom and Frank, two batters who have batting averages of 56% and 44%. On the basis of these numbers, it seems clear that Tom is the better batter. At a critical moment, when the team needs a hit, the coach should send out Tom to bat, as Tom will have a higher probability of getting a hit. However, an analyst looks at the hit record more deeply, dividing the batting average according to type of pitcher: Left-Handed or Right-Handed. This leads to the following numbers.
While Tom has better over-all performance with a batting average of 56% compared to 44%, when we break it down by pitchers handedness, a different picture emerges. Frank is better against Left-Handed Pitchers, averaging 80% hits in comparison to Tom’s 60%. Similarly, Frank is better against Right-Handed Pitchers, averaging 40% against Tom’s average of only 20%. Again, we have a Simpson’s Paradox. Frank is better than Tom against Left-Handed Pitchers, and also against Right-Handed Pitchers. But in general, against all pitchers, Tom is better than Frank. How can that be? This is a classic case of “confounding”. We can illustrate confounding by a causal diagram.
The batting average depends on the batter performance. It also depends on the mix of left and right-handed pitchers faced by the batter. Frank’s average can vary from 80% to 40% depending on the proportion of left and right-handed pitchers that he faces. Similarly, Tom’s average can vary between 60% and 20% depending on the Mix of Left/Right pitchers that he faces. The batting average depends on two different factors – one is the batter performance and the second is the nature of the field. To evaluate batter performance, we must eliminate the confounder. One way to do so is to condition on the confounder – that is to hold it constant. This means that we should condition on Left-Handed Pitcher, and separately on Right-Handed Pitchers. Doing so leads to a clear conclusion that Frank is the better batter. If both players face the SAME proportion of left-and right-handed pitchers, then Frank will definitely do better than Tom. As long as the MIX of Left and Right Hand Pitchers is EXOGENOUS – that is, it is determined without reference to the variables under study — then the coach should send out Frank. HOWEVER, it is also possible that the MIX is an ENDOGENOUS variable. This can happen in the following way. Frank is an exceptional batter, and has an amazing record of 80% hits against left-handed pitchers. This average declines to only 40% against right-handed pitchers. The coach of the opposing team is aware of this weakness of Frank, and switches to using right-handed pitchers when Frank comes to the bat. Thus, the normal mix of pitcher is 90% Left-Handed and 10% Right-Handed which is what Tom will see. But when Frank is sent to bat, the coach will change pitchers to heavily favor right-handed pitchers, so that Frank will see a mix of 90% Right-Handed Pitchers with only 10% Left-Handed pitchers. The causal diagram is now different:
With this causal sequencing, the Left/Right Mix is NO LONGER a confounder, because it is no longer an EXOGENOUS variable. The MIX is INFLUENCED by the Batter. When we want to compute the batting average, we have to take into account the direct effect of batting ability, and also the indirect effect created by the fact that the choice of batter influences the opposing coach’s choice of the MIX between left and right handed pitchers. Taking both into account, we see that the coach should now prefer to send Tom into the field, even though Frank is the better batter. This is because the opposing coach will respond the choice of Frank by changing the pitchers, and with the changed pitcher mix, Frank will actually perform worse than Tom. One important lesson from this analysis is that knowledge of whether or not a variable is endogenous or exogenous depends on knowledge of real-world structures not directly observable in the numbers. We need to know whether or not the opposing coach looks at the batter to decide on the mix of pitchers he will use. Against ANY FIXED MIX of left and right handed pitchers, Frank will do better than Tom. However, is the mix is CHANGED and CHOSEN to face Frank with right-handed pitchers, against whom he is weak, then Frank will perform worse than Tom.