Beware of Relying on Future Data in Machine Learning

4 February 2021
Complete Guide for CTO & IT Directors
Microservices under X-Ray Three books image Download free ebook

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.

Stock performance based on company statements

Companies listed on the stock exchange present various financial statements like income statements, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?

Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict a company’s performance based on the data we have on January 2 2020. In our database, we have annual statements for the period 2019-01-01 – 2019-12-31.

Though it seems we could make predictions given the data, we can’t. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ends. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021, is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.

To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We should keep asking ourselves the question: what data was available at a given point? Then the same data will be available at the prediction time, which is usually today’s date.

Backtesting portfolio performance

The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize performance. If we want an idea of how well a portfolio performs, we backtest it; in other words, checking how would it perform in the past. To show the problem of relying on future data let’s proceed with an example:

We want to invest in three companies:

  • Mr. Cooper Group Inc (COOP)
  • Educational Development Corporation (EDUC)
  • Newmont Corporation (NEM)

To decide how to divide our money among different stocks, we will use information from the past three years and apply Markowitz’s Theory. Based on this method, we have determined that we should allocate 35% to COOP, 45% to EDUC, and 20% to NEM.

We can then test how this portfolio would have performed if we had started investing three years ago. However, relying solely on future data to make investment decisions can be risky. Markowitz’s Theory suggests that this portfolio would have performed well in the past, but we must be cautious about assuming that it will perform well in the future. The best way to test the performance of a portfolio is to use a specific strategy, such as using Markowitz’s Theory based on data from the past three months and rebalancing each month. This approach ensures that we are using data that is available at the time, making the back-testing more reliable.

To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such a strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at a given time point and our backtesting is more reliable.

Always think about which data was available at a given time point

It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself again: am I using future data?

Michał Olędzki

Latest Posts
angular apps

Angular Apps: Top 7 Web Application Examples, Advantages, and Considerations

Angular is a leading development tool for building sophisticated web apps. Check out the top applications fueled by this Google-backed platform and learn about its strengths and weaknesses. Angular is a household name in the front-end development industry and the key competitor of React (aka ReactJS). As one of the leading web development frameworks, it […]

ux writing samples

UX Writing Samples. How to Enhance Usability With Effective Microcopy?

Text is an integral part of UI design and user experience. High-quality, usability-focused copy helps engage users and turn them into customers. User experience (UX) writing is much more than a buzzword. It combines writing proficiency and inventiveness with a strong focus on user actions. The goal is to make things smooth, easy, and informative […]

gregg castano news direct

How to Pick a Good Software Partner? Q&A with Gregg Castano of News Direct  

A few years ago, we had the opportunity to work with News Direct on developing their platform. After carefully analyzing their needs, we’ve helped them design the system and developed a microservices-based architecture incorporating state-of-the-art modern technology allowing for communication using both synchronous and asynchronous calls to ensure high system flexibility and scalability. The main […]

cert pinning android

Mobile Development and Security: Certificate Pinning on Android

In today’s increasingly interconnected digital world, the importance of security for mobile apps and web services cannot be overstated. As cyber threats evolve, so must the defenses and measures we deploy to safeguard sensitive data and maintain trust. One of the pivotal practices in enhancing network security is certificate pinning, a technique that ensures a […]

django apps

Django Apps, Projects, and Other Caveats

Django, emerging as a significant player in the realm of web frameworks, stands out as a Python-based toolset that revolutionizes the way developers approach web application development. It is not merely a framework but a holistic environment that encapsulates a developer’s needs for building robust, efficient, and scalable web applications. Born out of a practical […]

rxjs react

RxJs & React: Reactive State Management

In the ever-evolving realm of web development, the quest for efficient, scalable, and maintainable tools never ends. Two such tools, React and RxJS, have garnered significant attention in the recent past. React, the brainchild of Facebook focuses on crafting intuitive user interfaces by leveraging a component-based architecture. On the other hand, RxJS offers a fresh […]

Related posts
product recommendation system

Product Recommendation: Machine Learning and Recommender Systems Filtering Types

In today’s fast-paced digital world, personalization has become a key factor in enhancing user experiences across various online platforms. One of the most effective ways to achieve this is through the use of recommender systems. These sophisticated algorithms have revolutionized the way we discover and engage with content, products, and services, anticipating our needs and […]

Picture Identification Apps

Best Picture Identification Apps to Check Out in 2022

Image recognition apps are getting smarter and smarter. They allow users to identify objects, learn about them, and receive valuable suggestions based on image search. We checked what’s in stock and share our selection of the ten impressive applications that gear up your phone’s camera with image recognition IQ. Long gone are the days when […]


Marketing Data Science: 9 Examples of Data Science Use in Marketing

Data science is a complex study of an interdisciplinary nature that deals with analyzing, categorizing, and otherwise working with large volumes of raw data. For this purpose, data scientists use a wide array of different tools and methods. Data science has a holistic nature, which means it is up to the data scientist to decide […]

machine learning future data

Beware of Relying on Future Data in Machine Learning

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the […]

Talk with experts

We look forward to hearing from you to start expanding your business together.

Email icon [email protected] Phone icon +1 (888) 413 3806