Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.
Stock performance based on company statements
Companies listed on the stock exchange present various financial statements like income statements, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?
Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict a company’s performance based on the data we have on January 2 2020. In our database, we have annual statements for the period 2019-01-01 – 2019-12-31.
Though it seems we could make predictions given the data, we can’t. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ends. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021, is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.
To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We should keep asking ourselves the question: what data was available at a given point? Then the same data will be available at the prediction time, which is usually today’s date.
Backtesting portfolio performance
The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize performance. If we want an idea of how well a portfolio performs, we backtest it; in other words, checking how would it perform in the past. To show the problem of relying on future data let’s proceed with an example:
We want to invest in three companies:
- Mr. Cooper Group Inc (COOP)
- Educational Development Corporation (EDUC)
- Newmont Corporation (NEM)
To decide how to divide our money among different stocks, we will use information from the past three years and apply Markowitz’s Theory. Based on this method, we have determined that we should allocate 35% to COOP, 45% to EDUC, and 20% to NEM.
We can then test how this portfolio would have performed if we had started investing three years ago. However, relying solely on future data to make investment decisions can be risky. Markowitz’s Theory suggests that this portfolio would have performed well in the past, but we must be cautious about assuming that it will perform well in the future. The best way to test the performance of a portfolio is to use a specific strategy, such as using Markowitz’s Theory based on data from the past three months and rebalancing each month. This approach ensures that we are using data that is available at the time, making the back-testing more reliable.
To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such a strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at a given time point and our backtesting is more reliable.
Always think about which data was available at a given time point
It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself again: am I using future data?