{"id":2351,"date":"2019-11-30T09:34:49","date_gmt":"2019-11-30T08:34:49","guid":{"rendered":"https:\/\/davidlebech.com\/thoughtflow\/?p=2351"},"modified":"2019-11-30T09:34:49","modified_gmt":"2019-11-30T08:34:49","slug":"80-20-and-data-science","status":"publish","type":"post","link":"https:\/\/davidlebech.com\/thoughtflow\/80-20-and-data-science\/","title":{"rendered":"80\/20 and data science"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">&#8220;20% of the task took 80% of the time&#8221;. If you have ever heard someone say that, you probably know about the so-called 80\/20 rule, also known as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pareto_principle\">Pareto principle<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The principle is based on the observation that &#8220;for many events, roughly 80% of the effects come from 20% of the causes.&#8221;. As an example, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pareto_principle\">Wikipedia article<\/a> mentions how in some companies &#8220;80% of the income come from 20% of the customers&#8221; along with a bunch of other examples.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In contrast to this, most of the references to the 80\/20 rule that I hear in the wild are variations of the (often sarcastic) statement at the beginning of the post, and it is also this version that is more fun to talk about.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Estimating the finishing touch<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In software development, the 80\/20 rule often shows up when the &#8220;finishing touches&#8221; to a task are forgotten or complexity is underestimated. An example could be if a developer forgot to factor in the time it takes to add integration tests to a new feature or underestimated the difficulty of optimizing a piece of code for high performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this context, the 80\/20 rule could thus be seen as the result of bad task management to a certain extent, but it is worth noting that it is not always this simple. <em>Things<\/em> get in the way, like when the test suite refuses to run locally, or the optimized code cannot work without blocking the CPU, and the programming language is single-threaded, forcing the developer to take a different approach to the problem (this is <em>purely<\/em> hypothetical of course&#8230;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Related to this, Erik Bernhardsson recently wrote an interesting treatise on the subject of why software projects &#8220;take longer than you think&#8221;, and I think it is worth sneaking in a reference. Here is the main claim from the author:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>I suspect devs are actually decent at estimating the <em>median<\/em> time to complete a task. Planning is hard because they suck at the <em>average<\/em>.<\/p><cite>Erik Bernhardsson, <a href=\"https:\/\/erikbern.com\/2019\/04\/15\/why-software-projects-take-longer-than-you-think-a-statistical-model.html\">Why software projects take longer than you think \u2013 a statistical model<\/a><\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The message here resonated quite well with me (especially because of the use of graphs!). The author speaks of a &#8220;blowup factor&#8221; (actual time divided by estimated time) for projects, and if his claims are true, there could be some merit to the idea that the 20% of a task could easily &#8220;blow up&#8221; and take 80% of the time.<span id='easy-footnote-1-2351' class='easy-footnote-margin-adjust'><\/span><span class='easy-footnote'><a href='https:\/\/davidlebech.com\/thoughtflow\/80-20-and-data-science\/#easy-footnote-bottom-1-2351' title='I&amp;#8217;m not clever enough to make a mathematical argument based on the author&amp;#8217;s post, so I am just going by intuition here :-)'><sup>1<\/sup><\/a><\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dirty data<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Sometimes, the <em>perception<\/em> of data science is that most of the time &#8220;doing data science&#8221; is spent on creating models. For some data scientists, this might be the reality, but for the majority that I have spoken to, preparing data takes up a significant amount of time, and it is not the most glorious work, if one is not prepared for it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I recently gave an internal talk at work where I jokingly referred to this as the <strong>80\/20 rule of data science<\/strong>: 80% of the time is spent on data cleaning (the &#8220;boring &#8221; part), and 20% on modeling (the &#8220;fun&#8221; part).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is not really a 80\/20 rule, except if we rephrase it as &#8220;80% of the fun takes up only 20% of the time&#8221; or something like that.<span id='easy-footnote-2-2351' class='easy-footnote-margin-adjust'><\/span><span class='easy-footnote'><a href='https:\/\/davidlebech.com\/thoughtflow\/80-20-and-data-science\/#easy-footnote-bottom-2-2351' title='I will not take credit for the observation of 80\/20 time in data science, as a quick Google search reveals that other people have had similar observations in the past.'><sup>2<\/sup><\/a><\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When it comes to deploying models in production, the timescales sometimes shift even more. The total time spent on a project might be 1% on modeling and 99% on data cleaning and infrastructure setup, but it&#8217;s the 1% (the model) that gets all the attention.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The future of data cleaning<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the last couple of years, there have been loads of startups and tools emerging that do automatic machine learning (or &#8220;AutoML&#8221;), i.e. they automate the <em>fun<\/em> parts of data science, while sometimes also providing convenient infrastructure to explore data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we assume that the 80\/20 rule of data science is correct, these tools are thus helping out with 20% of data science. However, the first company that solves the problem of automatically cleaning and curating dirty data is going to be a billion-dollar unicorn. Perhaps the reason that we have not seen this yet is that dealing with data is actually <em>really difficult<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For now, I suspect that the &#8220;80\/20 rule of data science&#8221; will continue to be true in many settings, but that is not necessarily a bad thing. You just gotta love the data for the data itself :-)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;20% of the task took 80% of the time&#8221;. If you have ever heard someone say that, you probably know about the so-called 80\/20 rule, also known as the Pareto principle. The principle is based on the observation that &#8220;for many events, roughly 80% of the effects come from 20% of the causes.&#8221;. As an [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2467,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[145,146],"class_list":["post-2351","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-thoughts","tag-data-science","tag-software-engineering"],"_links":{"self":[{"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/posts\/2351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/comments?post=2351"}],"version-history":[{"count":0,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/posts\/2351\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/media\/2467"}],"wp:attachment":[{"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/media?parent=2351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/categories?post=2351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/davidlebech.com\/thoughtflow\/wp-json\/wp\/v2\/tags?post=2351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}