How to Avoid Eye Tricks in Statistics?


We often find surprising coincidences in statistics, for example, the correlation between the US budget expenditure on science and the number of people committing suicide by hanging themselves, or between movies with Nicolas Cage and the number of people that drown in pools. There is nothing mystical in this. Assistant Professor of Economics at the Duke University and NES graduate Anna Bykhovskaya explained in a new episode of the "Economics out Loud" podcast where these senseless correlations come from and how to avoid them. She also talked about the capabilities of models designed by economists and cases when models become useless. GURU shares a summary of the episode.


What is the time series?

Time series is data arranged in time order: annual or quarterly GDP; prices in a store today, yesterday and the day before yesterday; data on stock trading or returns; interest rates; exchange rates; weather changes; results of a sports team performance; data on patients with, say, COVID; etc. We can study both a unique time series and connections between several of them.

In isolation, these data are very clear, but how can we use them and identify patterns in order to draw conclusions and make predictions? There are various methods, including recommendation systems and scoring algorithms. And there is a general principle: the less data is available, the simpler the model should be, otherwise we will learn to make excellent predictions on the available data, but we will not be able to say anything at all about new observations. 

In general, talking about models and their power, we come to a philosophical question about what we can actually predict. This question brings to mind the scientific determinism theory and the so-called Laplace’s demon. The French mathematician Pierre-Simon Laplace (1749-1827) proposed a thought experiment, and later the idea of this experiment was called the Laplace's demon. Imagine that there is a demon who can see the movement of every atom in the universe and can predict exactly what will happen in the future as well as tell what happened in the past. Laplace considered such a mind in relation to physics. Meanwhile, I think that in economics the situation is more complicated, because the movement of people is much more random than the movement of particles. We deal with approximations.

Let's say our time series is GDP values for the last 30 years. Quarterly data is four dots on the graph per year. To create a model, we assume that these dots, these data are the result of a random process: every day we flip a coin and, depending on the outcome – heads or tails, the economy with a 50% probability grows or falls. Now let's make the task more complicated: let this dot on the graph be determined by our past, by what has happened to the economy over the past 10-20 years. We apply a model to these time series to identify patterns and build a forecast. And here the question comes up: which models are good and which are bad, and whether there is a true model at all. As the British statistician George Box said, in fact, all models are wrong, but some of them are useful. We will not dare to claim that there is a model that accurately describes how the world works. Models are rather a way of trying to say something about the future. Therefore, the purpose of studying time series is to find those models that are useful. 

In order to build a model, we have to deal with four issues:

 - identify a trend – for example, determine whether it is an indicator of linear or exponential growth;

 - take into account seasonal factors. For example, every summer air tickets become more expensive on average; construction decreases in winter; the demand for flowers and sweets increases around every February 14;

 - identify business cycles in the economy – for example, the growth or decline of GDP. Unlike seasonality, these cycles do not have a fixed duration, and it cannot be stated for sure that if the economy grows for three years, it will also shrink for three years;

 - determine the perceptual constancy (also called object constancy, or constancy phenomenon). Let's say if the interest rate jumped today, it will greatly affect the financial markets tomorrow. But will this effect be felt after that – in a year, in 10 years, etc.? This is a very important question. 


Where do false correlations come from?

There are two reasons for their appearance. The first is a common coincidence resulting from data fitting. If we sort through a great lot of data, then eventually we will, of course, find two similar sequences, for example, in milk consumption and board games (or the number of diplomas issued in sociology and space flights – editor). This is shown by the Infinite Monkeys Theorem: accidentally pressing the keys of a typewriter, a monkey can eventually print the word "Hamlet".

The second reason is, in turn, related to the time series. They can be divided into stationary and nonstationary. Stationary ones are good time series with the influence of the relevant event disappearing rather quickly. For example, if I didn't get enough sleep today, I will feel it tomorrow, but in a week, most likely, the effect will already disappear. In nonstationary time series, the effect does not subside or subsides very slowly. For example, the choice of a degree specialization and supervisor will affect our lives for a very, very long time. And this creates the illusion of pseudo-correlation between the two graphs. Let's say two friends, one in Sochi and the other in Novosibirsk, bought red houses, and they have a tradition of taking a picture against the background of them every May 1 and sending a photo report. In a year, two, and more, most likely, they will be photographed against the background of a red house. The color of a house is a very stable characteristic, and we will see a strong pseudo-correlation: the colors of the houses in Sochi and Novosibirsk have accidentally coincided, but they will remain the same for a very long time. If we look at the weather in those photos, it is likely to be very different, because the weather is more random in itself, so it is unlikely that we will get a strong correlation in this case. 

There is a reverse situation when two nonstationary time series are actually correlated. For example, income and consumption. A person consumes a fixed part of his or her income, say, 80%. Therefore, income and consumption graphs will be similar. Such series are called cointegrated. So, income and consumption are cointegrated time series, both are nonstationary, both change very much, and the correlation does not disappear – they are really related to each other. British economist Clive Granger was the first to notice this when he read a paper by James Davidson and co-authors, in which the scholars tried to predict consumption growth using as a predictor the difference between income and consumption. Granger was very surprised by their conclusions, because the consumption growth rate is a stationary time series with good properties, while consumption and income in physical terms are nonstationary time series with bad properties. How is this possible and why do time series with bad properties help predict time series with good properties? It turned out that the difference between these two time series with bad properties results in a time series with good properties. That is, we can take two bad time series, calculate the difference between them and get a good time series. In the example of income and consumption, the correlation is not false. 

Clive Granger began working with his co-author Robert Engle on the theory of cointegration, and in 2003 they received the Nobel Prize. Their methods let us check whether there is cointegration between several bad series, but these methods work in a situation where there are not very many time series, for example, income and consumption. If there are a lot of time series, for example, a lot of countries or stocks, then the methods of Granger and Engle stop working. This is one of the tasks I am working on: new approaches that will work in a situation where we have big data or many different time series. 


Examples of time series use

Time series, of course, are actively used by macroeconomists, including in Central Banks, in the real economy when studying big data, in finance.

For example, they can be used to test the efficient market hypothesis. According to it, the stock price already contains all the available information about it, the stocks are traded at a fair price, so it is impossible to predict their changes in order to beat the market. And the search for cointegrations among stock prices is an attempt to earn money, while a good linear combination is much easier to predict. 

One of my works is devoted to the evolution of social networks, which are also in some way a time series. We observe a network of friends or phone calls over time, so we can apply time series analysis methods here as well. Another sphere for the application of this methodology is the study of the evolution of international trade.


How can matching be improved?

In my only paper that is not related to econometrics, I tried to extend the Gale–Shapley algorithm, which is used to find optimal pairs, including when distributing children to schools based on the preferences of children and schools. The idea is to create an algorithm that would eliminate envy. Not the envy like in the case "I envy Mary and I want to go to her school, but my grades are lower than Mary's, and they won't accept me to her school," rather justified envy when my grades are higher, but the school doesn't enroll me. The Gale–Shapley algorithm suggests such a partition. 

However, it does not take into account who goes to the same school as you. And it is very important who we study with. It is often said that people go to Ivy League universities, such as Harvard, Princeton and Yale, not to get an education, but for networking (this topic was also discussed in an episode of the "Economics out Loud" podcast about an academic career for economists – editor). My goal was to add preferences about classmates to the Gale–Shapley algorithm and see if we can still achieve a good distribution of students across schools so that no one envies anyone. And if someone is envious, it is not justified: they will not be enrolled to a certain school anyway because of too low grades. This complicates the task, but still, under certain restrictions, it is possible to modify the algorithm and build a distribution by schools so that people get the best option for themselves based on their grades, and so that there is no justified envy.