Machine learning is almost always connected with analyzing historical data in a way that will allow us predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.
Stock performance based on company statements
Companies listed on stock exchange present various financial statements like income statement, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?
Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict company’s performance based on the data we have on January 2 2020. In our database we have annual statements for period
2019-01-01 – 2019-12-31.
Though it seems we could make predictions given the data, we aren’t able to. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ending. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021 is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.
To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We
should keep asking ourselves the question: what data was available at given time point?
Then the same data will be available at prediction time, which is usually today’s date.
Backtesting portfolio performance
The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize its performance. If we want an idea of how well a portfolio performs, we back test it; in other words, checking how wouldit perform in the past. To show the problem of relying on future data let’s proceed with an example:
We want to invest in three companies:
● Mr. Cooper Group Inc (COOP)
● Educational Development Corporation (EDUC)
● Newmont Corporation (NEM)
To define how to divide our capital we will use data from past three years. Using Markowitz’s Theory we get following portfolio:
COOP: 35%, EDUC: 45%, NEM: 20%
Then we can easily check how such a portfolio would perform had we started investing three years ago. Markowitz proved that it would perform quite well, if we relied on future data.
Three years ago, data we used to compute that portfolio wasn’t known. For the next three years other proportions will be optimal but wedon’t know them. The threat of not realizing the usage of future data is being overly optimistic about future investments. A little unclear what this means!
To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at given time point and our back testing is more reliable.
Always think what data was available at a given time point
It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself one more time: am I using future data?