80/20 and data science

“20% of the task took 80% of the time”. If you have ever heard someone say that, you probably know about the so-called 80/20 rule, also known as the Pareto principle.

The principle is based on the observation that “for many events, roughly 80% of the effects come from 20% of the causes.”. As an example, the Wikipedia article mentions how in some companies “80% of the income come from 20% of the customers” along with a bunch of other examples.

In contrast to this, most of the references to the 80/20 rule that I hear in the wild are variations of the (often sarcastic) statement at the beginning of the post, and it is also this version that is more fun to talk about.

Estimating the finishing touch

In software development, the 80/20 rule often shows up when the “finishing touches” to a task are forgotten or complexity is underestimated. An example could be if a developer forgot to factor in the time it takes to add integration tests to a new feature or underestimated the difficulty of optimizing a piece of code for high performance.

In this context, the 80/20 rule could thus be seen as the result of bad task management to a certain extent, but it is worth noting that it is not always this simple. Things get in the way, like when the test suite refuses to run locally, or the optimized code cannot work without blocking the CPU, and the programming language is single-threaded, forcing the developer to take a different approach to the problem (this is purely hypothetical of course…).

Related to this, Erik Bernhardsson recently wrote an interesting treatise on the subject of why software projects “take longer than you think”, and I think it is worth sneaking in a reference. Here is the main claim from the author:

I suspect devs are actually decent at estimating the median time to complete a task. Planning is hard because they suck at the average.

Erik Bernhardsson, Why software projects take longer than you think – a statistical model

The message here resonated quite well with me (especially because of the use of graphs!). The author speaks of a “blowup factor” (actual time divided by estimated time) for projects, and if his claims are true, there could be some merit to the idea that the 20% of a task could easily “blow up” and take 80% of the time.1

Dirty data

Sometimes, the perception of data science is that most of the time “doing data science” is spent on creating models. For some data scientists, this might be the reality, but for the majority that I have spoken to, preparing data takes up a significant amount of time, and it is not the most glorious work, if one is not prepared for it.

I recently gave an internal talk at work where I jokingly referred to this as the 80/20 rule of data science: 80% of the time is spent on data cleaning (the “boring ” part), and 20% on modeling (the “fun” part).

This is not really a 80/20 rule, except if we rephrase it as “80% of the fun takes up only 20% of the time” or something like that.2

When it comes to deploying models in production, the timescales sometimes shift even more. The total time spent on a project might be 1% on modeling and 99% on data cleaning and infrastructure setup, but it’s the 1% (the model) that gets all the attention.

The future of data cleaning

In the last couple of years, there have been loads of startups and tools emerging that do automatic machine learning (or “AutoML”), i.e. they automate the fun parts of data science, while sometimes also providing convenient infrastructure to explore data.

If we assume that the 80/20 rule of data science is correct, these tools are thus helping out with 20% of data science. However, the first company that solves the problem of automatically cleaning and curating dirty data is going to be a billion-dollar unicorn. Perhaps the reason that we have not seen this yet is that dealing with data is actually really difficult.

For now, I suspect that the “80/20 rule of data science” will continue to be true in many settings, but that is not necessarily a bad thing. You just gotta love the data for the data itself :-)



Note: The post below was written back in September 2014, when I was starting to feel a bit down about how poorly the Shopify sales of my app were going — especially after being featured on the front page of the app store without much attention. I did not publish it back then, because I once promised myself that I would try to mostly stay away from online rants. However, I think it provides some context to my March 2015 post about what happened next during my period of complete independence. So here you go.

I often get emotionally involved in my software. For example, I feel physically uncomfortable when I find a bug in some code I have written. I should probably write better tests to insure my well-being. Anyway, I want to talk about a bit about competition.

Antecons, the recommendation engine that I am working on, has now been around for about a year on the Shopify app store. When Antecons was first released, there were three other competing recommendation engines on the app store and one of them had been added just a few weeks before Antecons. Since then, three or four more have popped up on the app store. Recommendation engines and data analysis must be hot business because all of these competing apps seem to be doing very well and are getting nice reviews from the customers. This is a bit of a let down for me because it does not seem like Antecons is enjoying the same success.

What I am starting to realize is that it probably does not matter that Antecons always use SSL for increased privacy and security (unlike most of the other apps), that the Antecons JavaScript code is minified to reduce bandwidth for the webshop visitors (unlike at least two of the other apps) or that I do not write fake reviews (like at least one other app). No one will pat me on the shoulder for doing what I feel is a tiny bit extra “niceness” of the overall package. Oh well.