Beware of Relying on Future Data in Machine Learning

4 February 2021
Complete Guide for CTO & IT Directors
Microservices under X-Ray Three books image Download free ebook

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.

Stock performance based on company statements

Companies listed on the stock exchange present various financial statements like income statements, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?

Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict a company’s performance based on the data we have on January 2 2020. In our database, we have annual statements for the period 2019-01-01 – 2019-12-31.

Though it seems we could make predictions given the data, we can’t. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ends. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021, is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.

To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We should keep asking ourselves the question: what data was available at a given point? Then the same data will be available at the prediction time, which is usually today’s date.

Backtesting portfolio performance

The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize performance. If we want an idea of how well a portfolio performs, we backtest it; in other words, checking how would it perform in the past. To show the problem of relying on future data let’s proceed with an example:

We want to invest in three companies:

  • Mr. Cooper Group Inc (COOP)
  • Educational Development Corporation (EDUC)
  • Newmont Corporation (NEM)

To decide how to divide our money among different stocks, we will use information from the past three years and apply Markowitz’s Theory. Based on this method, we have determined that we should allocate 35% to COOP, 45% to EDUC, and 20% to NEM.

We can then test how this portfolio would have performed if we had started investing three years ago. However, relying solely on future data to make investment decisions can be risky. Markowitz’s Theory suggests that this portfolio would have performed well in the past, but we must be cautious about assuming that it will perform well in the future. The best way to test the performance of a portfolio is to use a specific strategy, such as using Markowitz’s Theory based on data from the past three months and rebalancing each month. This approach ensures that we are using data that is available at the time, making the back-testing more reliable.

To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such a strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at a given time point and our backtesting is more reliable.

Always think about which data was available at a given time point

It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself again: am I using future data?

Michał Olędzki

Latest Posts

Scrum: How to Work Together

With the popularity of the scrum framework among software development teams, it’s growingly important to learn how a scrum team works to meet its goals. Scrum Process Overview Scrum is an agile project management framework widely used in IT but can be applied to other fields as well. The framework facilitates the management of complex […]

event storming

Event Storming: How to Boost Your Software Development Process with a Simple Technique?

Event storming is a dynamic workshop technique that supports domain-driven design in software development. It can boost the team’s efficiency and reduce error risk, minimizing the back-and-forth in the development lifecycle. If you haven’t been living under a rock, you’re probably familiar with the concept of brainstorming. It’s a widely used term for the process […]

rails vs sinatra

Rails vs Sinatra

In the rapidly evolving world of software development, web frameworks have become essential tools for building robust and scalable web applications. These frameworks provide a structured environment that streamlines the development process, offering pre-written code, libraries, and guidelines that help developers avoid repetitive coding tasks, thus significantly enhancing productivity and ensuring best practices. Within the […]

android webstockets

Introduction to Android WebSocket

WebSockets have become a pivotal technology in enabling real-time communication for Android apps, offering a dynamic way to send and receive messages instantaneously. This technology facilitates a persistent connection between the client (Android app) and the server, bypassing the traditional HTTP request-response model to allow continuous data flow through a single TCP connection. The WebSocket […]

smart contracts audit

Introduction to Smart Contract Audits

In the blockchain world, smart contracts are key to decentralized applications (dApps), automating transactions and enforcing agreements without intermediaries. These contracts handle significant digital assets and perform crucial operations, making their security paramount. Smart contract audits are thus essential, scrutinizing the contract’s code for vulnerabilities to prevent potential security breaches. These audits are crucial for […]

What is Python Used for in Finance

Embracing the Essence of Scrum: The Indispensable Values for Agile Teams

In the ever-evolving landscape of project management, Agile methodologies like Scrum have become the cornerstone for many teams striving for adaptability and efficiency. While Scrum offers a comprehensive framework and great agile tools, it is the underlying values that truly breathe life into the process. During The Sprint What Scrum Value Must The Team Demonstrate […]

Related posts
types of supervised learning

Types of Supervised Learning: A Look Into One of Key Branches of ML

Supervised machine learning, a pivotal branch within the vast domain of machine learning, represents a paradigm where machines are trained to decipher patterns and make decisions based on provided examples. This learning approach hinges on the use of labeled data – datasets where input data points (features) are paired with the correct output (target), thereby […]

product recommendation system

Product Recommendation: Machine Learning and Recommender Systems Filtering Types

In today’s fast-paced digital world, personalization has become a key factor in enhancing user experiences across various online platforms. One of the most effective ways to achieve this is through the use of recommender systems. These sophisticated algorithms have revolutionized the way we discover and engage with content, products, and services, anticipating our needs and […]

Picture Identification Apps

Best Picture Identification Apps to Check Out in 2022

Image recognition apps are getting smarter and smarter. They allow users to identify objects, learn about them, and receive valuable suggestions based on image search. We checked what’s in stock and share our selection of the ten impressive applications that gear up your phone’s camera with image recognition IQ. Long gone are the days when […]


Marketing Data Science: 9 Examples of Data Science Use in Marketing

Data science is a complex study of an interdisciplinary nature that deals with analyzing, categorizing, and otherwise working with large volumes of raw data. For this purpose, data scientists use a wide array of different tools and methods. Data science has a holistic nature, which means it is up to the data scientist to decide […]

machine learning future data

Beware of Relying on Future Data in Machine Learning

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the […]

Talk with experts

We look forward to hearing from you to start expanding your business together.

Email icon [email protected] Phone icon +1 (888) 413 3806