80/20 and data science

“20% of the task took 80% of the time”. If you have ever heard someone say that, you probably know about the so-called 80/20 rule, also known as the Pareto principle.

The principle is based on the observation that “for many events, roughly 80% of the effects come from 20% of the causes.”. As an example, the Wikipedia article mentions how in some companies “80% of the income come from 20% of the customers” along with a bunch of other examples.

In contrast to this, most of the references to the 80/20 rule that I hear in the wild are variations of the (often sarcastic) statement at the beginning of the post, and it is also this version that is more fun to talk about.

Estimating the finishing touch

In software development, the 80/20 rule often shows up when the “finishing touches” to a task are forgotten or complexity is underestimated. An example could be if a developer forgot to factor in the time it takes to add integration tests to a new feature or underestimated the difficulty of optimizing a piece of code for high performance.

In this context, the 80/20 rule could thus be seen as the result of bad task management to a certain extent, but it is worth noting that it is not always this simple. Things get in the way, like when the test suite refuses to run locally, or the optimized code cannot work without blocking the CPU, and the programming language is single-threaded, forcing the developer to take a different approach to the problem (this is purely hypothetical of course…).

Related to this, Erik Bernhardsson recently wrote an interesting treatise on the subject of why software projects “take longer than you think”, and I think it is worth sneaking in a reference. Here is the main claim from the author:

I suspect devs are actually decent at estimating the median time to complete a task. Planning is hard because they suck at the average.

Erik Bernhardsson, Why software projects take longer than you think – a statistical model

The message here resonated quite well with me (especially because of the use of graphs!). The author speaks of a “blowup factor” (actual time divided by estimated time) for projects, and if his claims are true, there could be some merit to the idea that the 20% of a task could easily “blow up” and take 80% of the time.1

Dirty data

Sometimes, the perception of data science is that most of the time “doing data science” is spent on creating models. For some data scientists, this might be the reality, but for the majority that I have spoken to, preparing data takes up a significant amount of time, and it is not the most glorious work, if one is not prepared for it.

I recently gave an internal talk at work where I jokingly referred to this as the 80/20 rule of data science: 80% of the time is spent on data cleaning (the “boring ” part), and 20% on modeling (the “fun” part).

This is not really a 80/20 rule, except if we rephrase it as “80% of the fun takes up only 20% of the time” or something like that.2

When it comes to deploying models in production, the timescales sometimes shift even more. The total time spent on a project might be 1% on modeling and 99% on data cleaning and infrastructure setup, but it’s the 1% (the model) that gets all the attention.

The future of data cleaning

In the last couple of years, there have been loads of startups and tools emerging that do automatic machine learning (or “AutoML”), i.e. they automate the fun parts of data science, while sometimes also providing convenient infrastructure to explore data.

If we assume that the 80/20 rule of data science is correct, these tools are thus helping out with 20% of data science. However, the first company that solves the problem of automatically cleaning and curating dirty data is going to be a billion-dollar unicorn. Perhaps the reason that we have not seen this yet is that dealing with data is actually really difficult.

For now, I suspect that the “80/20 rule of data science” will continue to be true in many settings, but that is not necessarily a bad thing. You just gotta love the data for the data itself :-)

Tax deductions are not free money

So you deducted your loan interest in your taxes. That means the interest was essentially free right? I could not quite figure out the answer to this question in my head recently, so I thought I would do a simple example and share it here, in case it could be useful for someone else.

First, the conclusion: Just because an expense is deductible does not make it free. It would have been better to not have the expense in the first place. However, if you cannot avoid the expense, then deductibles are of course great!

Let’s say we pay $10 interest on a loan, our income is $100 and we pay 50% tax. The table below shows the scenario where our interest is not deductible, compared to the scenario where our interest is 100% deductible1.

DescriptionAmount w/o deductible Amount w/ deductible
Income$100$100
Taxable income$100$90
Tax-$50-$45
Net income$50$55
Interest-$10-$10
Final income$40$45

In case we can deduct all of the interest, we would have $5 extra disposable income. Thus, about half of the interest in this example were “free” ($5). However, If we did not have to pay interest at all, we would of course have $50 in final income.

In other words, if we can completely avoid an expense, even if it is tax deductible, that is always the best financial outcome. In practice, this is not always possible, but I think it is a good principle to keep in mind.

University is what you make of it

Being a developer in a position far removed from academia, I am often confronted with the question of whether my university degree was worth the effort or not. Or to put it more mildly: would I be where I am today without it. I usually arrive at the same conclusion: yes, it was definitely worth it for me. And here is an important thing to keep in mind about higher education: it is what you make it out to be.

Anecdotally, I know both sides of the education opinion spectrum very well. When I was growing up, higher education was the most important thing in the world, and people that did not go through university were frowned upon. I have also often heard the song of how companies hunger for computer science graduates, and how good it is to have a Master’s degree and not “just” a Bachelor’s degree.

On the other hand, I have met many people that told me that education is a waste of time. I also know at least a handful of professional developers that are self-taught and some of them wear that as a badge of honor — sometimes also dismissing education outright and calling it useless.

I reject the mentality of both these extremes, and at least statistics like the 2017 Stack Overflow Survey seem to indicate that the industry as a whole has a more nuanced view of education. According to the survey, 76.5% of all professional developers have a Bachelor’s degree or higher which means that roughly one out of every four professional developers do not have a formal education. At the same time, 32% (almost a third of all developers) respond that education is not very important, but most of the responses are grouped around the middle with education being “somewhat important”.

Education or not, neither is right or wrong, and I think it is important to have a balanced view of this. However, I do not want to dismiss the feelings involved here. I would be lying if I said it did not affect me when I was a mid-twenties graduate without professional experience, and I saw much younger self-taught programmers with better business opportunities than myself. But then I realize that they probably did not build a neural network for image classification by hand, nor did they have the opportunity to discuss computer ethics with like-minded peers. And those things gave me immense joy. Likewise, I can sympathize with feelings of the opposite, although it would be disingenuous of me to presume what those feelings are.

The outcome in both cases is the same: it is easy to feel doubt and resentment. From my point of view, this comes in the shape of “why the hell did I waste time in university”, and “how come they got by without a degree?”. When these feelings emerge, they have to be put to rest quickly, because they are not helpful, and most importantly, they are missing the point.

Because in the end, when it comes to professional development, like many other parts of life, there is no right or wrong path to take. Higher education is not a measure of success, but it should not be dismissed either. University can be a tremendously rewarding experience, and the outcome is what you make of it, if you want it.

… and let’s not forget the parties…


Photo by Ian Schneider on Unsplash.