Beware of relying on future data in Machine Learning

4 February 2021
Complete Guide for CTO & IT Directors
Microservices under X-Ray Three books image Download free ebook

Machine learning is almost always connected with analyzing historical data in a way that will allow us predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model is complete. I will present two cases where we should be really careful not to use future data.

Stock performance based on company statements

Companies listed on stock exchange present various financial statements like income statement, cash flow statements, balance sheet statements, etc. One may try to predict a company’s performance using this data. Where is the risk of using future data in this case?
Let’s assume we have a database with historical financial statements. Each statement has start and end dates which defines the period of time. We would like to create a training sample from this data. Let’s say we want to predict company’s performance based on the data we have on January 2 2020. In our database we have annual statements for period
2019-01-01 – 2019-12-31.

Though it seems we could make predictions given the data, we aren’t able to. Companies need time to prepare financial statements; therefore, they are not available the day after the statement’s period ending. If we do not consider this, your model will learn to look at the newest statement at the beginning of the year. For example, financial information for January 2, 2021 is not available on a statement for the year 2020, but your model will expect it. That may lead to inaccurate results.

To correct this and properly train a model, we need to look at previous data. For instance, we might check our database from two weeks prior. With January 2, 2020 as our starting point, would look in our database for data with an end date of December 19 2019. We
should keep asking ourselves the question: what data was available at given time point?
Then the same data will be available at prediction time, which is usually today’s date.

Backtesting portfolio performance

The Markowitz Portfolio Theory is a well-known portfolio construction method. It allows weight adjustment of given stocks within a portfolio to optimize its performance. If we want an idea of how well a portfolio performs, we back test it; in other words, checking how wouldit perform in the past. To show the problem of relying on future data let’s proceed with an example:

We want to invest in three companies:
● Mr. Cooper Group Inc (COOP)
● Educational Development Corporation (EDUC)
● Newmont Corporation (NEM)

To define how to divide our capital we will use data from past three years. Using Markowitz’s Theory we get following portfolio:

COOP: 35%, EDUC: 45%, NEM: 20%

Then we can easily check how such a portfolio would perform had we started investing three years ago. Markowitz proved that it would perform quite well, if we relied on future data.
Three years ago, data we used to compute that portfolio wasn’t known. For the next three years other proportions will be optimal but wedon’t know them. The threat of not realizing the usage of future data is being overly optimistic about future investments. A little unclear what this means!
To properly back test we have to think of a specific strategy rather than particular portfolio proportions. Such strategy might be: “Use Markowitz portfolio based on past three months, rebalance each month”. This way we make sure to always use data available at given time point and our back testing is more reliable.

Always think what data was available at a given time point

It is crucial to understand your data when dealing with machine learning, and data science in general. It doesn’t matter how sophisticated the model is; when you feed it with garbage data, you get garbage predictions. It becomes even more important when there is time involved in the data being used. One has to really be careful not to use future data. As this can be quite tricky it is better to ask yourself one more time: am I using future data?

Michał Olędzki

Latest Posts
invision vs figma

InVision vs. Figma: Key Features, Differences, and Similarities.

Figma and InVision rank among the best UI design tools. Learn about their key features and how they can speed up front-end development. It’s hard to overstate the importance of UI design in web and mobile development. To be successful a digital product needs to be an eye-pleaser and a UX gem, in addition to […]

/
flutter language logo

Flutter: the number one framework for building cross-platform apps

Flutter is an SDK for building fast apps for different platforms. It comes with comprehensive development tools and streamlines designing high-performance UIs. Find out why Flutter is a top-quality and cost-effective alternative to native app development. Flutter is Google’s open-source toolkit for developing high-fidelity cross-platform applications. It allows you to create native software for Android […]

/
django hosting

Hosting for Django? Here’s what you need to know.

Django is a robust web framework for Python that enables programmers to swiftly build web apps. But once you’ve built your application, the next step is to get it online and available to the world. That’s where hosting comes in. In this article, we will explore the various options available for hosting Django applications. Types […]

/
front end technologies

Top-ranking front-end technologies. Best tools to speed up UI development in 2023

Flawless UI is a must if your app is headed for success. Designing a high-quality front-end can be costly and time-consuming, but not so much if you’re able to choose the right toolset for your product. Find out the best front-end development options to consider in 2023. Back-end powers the functionality of the system, while […]

/
nodejs books

Top 7 Node.js Books for Both Beginner and Professional Developers

Looking to advance your expertise in Node.js? If you’re tired of chaotic online resources, try good old-fashioned books. They will provide you with structured knowledge and give you a clear understanding of even the most complex programming concepts. Node.js is an asynchronous, event-driven, backend JavaScript runtime designed to develop scalable network applications. As a cross-platform […]

/
AR in fashion

AR in Fashion. Key Benefits and Real Use Cases

AR technology has already taken root in many industries. One of them is fashion. Clothing, beauty, and retail brands have long been leveraging AR solutions to attract customers and boost sales. According to recent estimates, the number of mobile augmented reality (AR) users worldwide will reach 1.7 billion by 2024 – a huge rise from […]

/
Related posts
Picture Identification Apps

Best Picture Identification Apps to Check Out in 2022

Image recognition apps are getting smarter and smarter. They allow users to identify objects, learn about them, and receive valuable suggestions based on image search. We checked what’s in stock and share our selection of the ten impressive applications that gear up your phone’s camera with image recognition IQ. Long gone are the days when […]

/

Marketing Data Science: 9 Examples of Data Science Use in Marketing

Data science is a complex study of an interdisciplinary nature that deals with analyzing, categorizing, and otherwise working with large volumes of raw data. For this purpose, data scientists use a wide array of different tools and methods. Data science has a holistic nature, which means it is up to the data scientist to decide […]

/

Beware of relying on future data in Machine Learning

Machine learning is almost always connected with analyzing historical data in a way that will allow us predict the future. However, it is easy to create a model by erroneously relying on future data. Sometimes we might catch this mistake towards the beginning of model development; other times, it will go unnoticed until the model […]

/
Talk with experts

We look forward to hearing from you to start expanding your business together.

Email icon [email protected] Phone icon +1 (888) 413 3806