Beware of Relying on Future Data in Machine Learning

Tags
4 February 2021
Complete Guide for CTO & IT Directors
Microservices under X-Ray Three books image Download free ebook

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.

Stock performance based on company statements

Companies listed on the stock exchange present various financial statements like income statements, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?

Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict a company’s performance based on the data we have on January 2 2020. In our database, we have annual statements for the period 2019-01-01 – 2019-12-31.

Though it seems we could make predictions given the data, we can’t. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ends. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021, is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.

To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We should keep asking ourselves the question: what data was available at a given point? Then the same data will be available at the prediction time, which is usually today’s date.

Backtesting portfolio performance

The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize performance. If we want an idea of how well a portfolio performs, we backtest it; in other words, checking how would it perform in the past. To show the problem of relying on future data let’s proceed with an example:

We want to invest in three companies:

  • Mr. Cooper Group Inc (COOP)
  • Educational Development Corporation (EDUC)
  • Newmont Corporation (NEM)

To decide how to divide our money among different stocks, we will use information from the past three years and apply Markowitz’s Theory. Based on this method, we have determined that we should allocate 35% to COOP, 45% to EDUC, and 20% to NEM.

We can then test how this portfolio would have performed if we had started investing three years ago. However, relying solely on future data to make investment decisions can be risky. Markowitz’s Theory suggests that this portfolio would have performed well in the past, but we must be cautious about assuming that it will perform well in the future. The best way to test the performance of a portfolio is to use a specific strategy, such as using Markowitz’s Theory based on data from the past three months and rebalancing each month. This approach ensures that we are using data that is available at the time, making the back-testing more reliable.

To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such a strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at a given time point and our backtesting is more reliable.

Always think about which data was available at a given time point

It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself again: am I using future data?

Michał Olędzki

Latest Posts
bots with python

Bots with Python 101

As we continue to embrace the digital age, we encounter countless innovative solutions that improve our daily lives, making mundane tasks more efficient, or even automating them entirely. One such innovative solution is the ‘bot’, a broad term that has various definitions depending on the context in which it is used. In its essence, a […]

/
product roadmap example

Which Way To Go – Product Roadmap Example And Insights

A quick overview of product roadmaps, not only for product managers. Briefly, What Is A Product Roadmap? A product roadmap is a visual representation that outlines a product development and evolution over a defined period, serving as a communication tool to align stakeholders around the product’s direction, goals, and milestones. In general terms, the roadmap […]

/
how to secure bluetooth devices

How to Secure Bluetooth Devices?

In today’s interconnected digital era, our lives are continuously shaped, molded, and improved by the innovative technologies we embrace. Every once in a while, a technological advancement emerges that becomes so deeply embedded in our routines that it’s hard to imagine a world without it. Bluetooth, a wireless communication protocol, is certainly one such marvel. […]

/
ux review

UX review: How to Perform a Usability Audit and Why it’s Important

User experience is crucial for the success of any digital product. A properly conducted UX review can help you identify flaws in your app’s or website’s design and fix them. Find out how a usability audit works and what you can gain from it. It’s hard to overestimate the impact UX has on business. According […]

/
prototype vs proof of concept

Prototype vs Proof of Concept: A Dive into Digital Product Development

In the realm of product development and innovation, two terms are often bandied about with great fervor – ‘Prototype’ and ‘Proof of Concept’ (POC). These methodologies play pivotal roles in transforming an abstract idea into a tangible product, serving as critical stages in the pathway to commercialization. However, these terms are frequently misunderstood, often interchanged […]

/
What is Python Used for in Finance

What is Python Used for in Finance?

As we delve into the digital age, the fields of finance and technology have become intricately intertwined, birthing an innovative hybrid sector known as financial technology, or “Fintech.” As this sector expands and evolves, one programming language stands at its epicenter, powering the development and execution of numerous innovative applications — Python. In a world […]

/
Related posts
product recommendation system

Product Recommendation: Machine Learning and Recommender Systems Filtering Types

In today’s fast-paced digital world, personalization has become a key factor in enhancing user experiences across various online platforms. One of the most effective ways to achieve this is through the use of recommender systems. These sophisticated algorithms have revolutionized the way we discover and engage with content, products, and services, anticipating our needs and […]

/
Picture Identification Apps

Best Picture Identification Apps to Check Out in 2022

Image recognition apps are getting smarter and smarter. They allow users to identify objects, learn about them, and receive valuable suggestions based on image search. We checked what’s in stock and share our selection of the ten impressive applications that gear up your phone’s camera with image recognition IQ. Long gone are the days when […]

/

Marketing Data Science: 9 Examples of Data Science Use in Marketing

Data science is a complex study of an interdisciplinary nature that deals with analyzing, categorizing, and otherwise working with large volumes of raw data. For this purpose, data scientists use a wide array of different tools and methods. Data science has a holistic nature, which means it is up to the data scientist to decide […]

/
machine learning future data

Beware of Relying on Future Data in Machine Learning

Machine learning is almost always connected with analyzing historical data in a way that will allow us to predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the […]

/
Talk with experts

We look forward to hearing from you to start expanding your business together.

Email icon [email protected] Phone icon +1 (888) 413 3806