Beware of relying on future data in machine learning

Machine learning is almost always connected with analyzing historical data in a way that will allow us predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.

Stock performance based on company statements

Companies listed on stock exchange present various financial statements like income statement, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?
Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict company’s performance based on the data we have on January 2 2020. In our database we have annual statements for period
2019-01-01 – 2019-12-31. read more