Cost of data exploitation -

If you have a mine, the cost of obtaining a kilo of gold, aluminum, cooper or whatever, they all measure the cost of obtaining a kilo of the target element.

With the machine learning projects, the situation is similar, and the definition of cost of data exploitation is something that should be defined from early in the process.

I will start with the boring answer

Define KPIs that enable you to measure the cost of exploitation

If you define the right KPIs and start measuring them you will have a lot of benefits.

First benefit: understand the cost of exploitation

So many machine learning initiatives start with money up front, and they want to see initial results. Once started when some results are obtained we all look at them with a very positive way, and we ignore the cost of obtaining the results.

Probably almost all machine learning projects start with a negative return, and soon you realize that you have too many expenses to obtain results, then everybody starts ask why the expenses are so high.

This article from McKinsey reviews the main point where they have experienced this issue. This picture is an useful snapshot:

Data-related spending breaks down into four areas. — Source: https://www.mckinsey.com/

Using data from a third party

The selection of the data providers is key. Once you have selected one, the cost of changing it is so expensive so it’s good to define during project initiation a process of defining the data needs, compare at least 3 data providers and make the selection as assertive as possible. If you are in a big organization and you know your procurement rules, use them tailoring at the right fit of your project.

Second benefit: reduce scope creep

People that are not in the day to day of a machine learning project usually do not understand the effort of data preparation, consolidation, features engineering, etc. The view is a little bit simple: all data is there and available for our imagination, we just have to use it, let’s apply some models and let’s rock it!.

Then business is starving of realizing tests, experiments and work on new ways to obtain better results from data exploitation. This makes you to work on so many directions which makes the scope creep to arise.

If you have manage a software project, you know the next step of the lifecycle: frustration of business unit because IT does not deliver.

The root of scope creep is in fact: you are trying to do so many things (there are more roots of scope creep).

In data management, you have to define a “cost label” on every step of data treatment you do, so in this way when something is requested, you can breakdown the activities, estimate in a better way and communicate to business. Define clear work breakdowns and communicate and explain to business is key to make every body aware of the efforts required to do anything.

You probably are thinking: “this is easier to say than to do“. Then you will think: “the initial estimations will be probably crap, are you going to exposure yourself giving these ballpark estimations?“

This exercise sounds ridiculous, but it helps to define the total cost of data exploitation, and create a culture of “awareness” that will help everybody engaged on the deal. This will give the project manager the ability to open threads of investigation knowing that different cost have to be taken into account without falling on a trap of “doing so many things and losing the focus“.

This is not new. On business intelligence projects we all have learnt do define the owner of data quality, the levels of quality that data has to have before to be accepted by the solution team and vice versa. Here the maturity of the organization in terms of use of practices and experience on data projects is key.

Other sources of scope creep:

A member of the team wants to model and experiment what s/he has in mind, more than what it has been explored and decided by the team. Here to avoid this, you have to stablish clear mechanisms of what is going to be done and document each one of the experiments: assumptions, approach and results.
Lack of communications between data team and applications team. Usually the machine learning results have to be shown on an application for users. When this happens, depending on the results obtained, what is shown and how it’s shown is underestimated. On this part of a project, where you have obtained some decent results and are willing to show to the world, you have to get the UX solution right. I have seen great machine learning results that are not “buy in” by an user because the representation do not engage the user. When defining a project where the results of machine learning are going to be used by end users, do not underestimate the work to be done here: engage the UX team early in the process and be sure they are on the same track. Do not accept: “We are used to handle these situations” comment, this mean they are underestimating things as: time response, complexity of screens to be shown, that they have to define different sets of screens that they never have built, etc.
Not having a clear list of priorities. This is common in all projects and on machine learning projects, the exploration threads have to be prioritized and limited in amount. It’s usual to have a backlog, that the size of this backlog should be balance with project goals and project definition.

Third benefit: you can determine if the use of machine learning makes sense or not

There are so many gold mines that are closed because the cost of exploitation is higher than the obtained gold. This can happen in a machine learning project, it’s not unusual.

There could be so many reasons:

Lack of readiness of the organization to work on a initiative like this. The maturity of the organization is key to be able to adapt to these type of projects.
Lack of data, or few data available: it can happen that the amount of available data is not enough to obtain decent results that gives you the authority to put this technology in production to make business decisions.
You are so matured that you already have a great solution in place. For instance, in manufacturing there are so many advances by the organizations to track and obtain benefits from data, to improve this with lower cost sometimes is not possible.

Final thoughts

Define KPIs and measure them is boring and tedious, but it is something that will help you in so many angles of the management of an initiative like this.

Don’t worry to fail on the initial definition, set something and adapt it to the reality you face.

Keep your innovative and venture mindset while you define some ground rules that will enable communications between peers and other stakeholders.