Thought Flow

Technology and other things

Author: David

  • Complete independence

    Jump I

    I am now jobless. I am now independent. For the next few months, I am going to work on projects that do not provide any steady income and I have not planned to take on any consulting or freelance work. I am writing this post both for people that would be interested in what I am doing but also to wrap my ahead around what this is all about.

    What is the plan

    I am taking some time off from paid jobs. Instead, I will work for myself full time. My primary focus is improving and enhancing Antecons, a recommendation engine for webshops that is available now.

    I also want to work on smaller side-projects. For example, I have an ongoing dream of creating programmatic music. Hopefully, blog posts will be more frequent in this period as well. I have a few different topics lined up but my thoughts about them have not yet materialized.

    Show me the money

    I am funding this time off with money that I saved up over the last year. Unless something unexpected happens, I can afford at least two months without any form of income but I hope to stretch the money to support me for half a year. I have very few obligations and expenses so it should be possible.

    Some of my friends suggested that if I do not have a job then I could get unemployment benefits. My answer to that is a clear no. It would be morally wrong to do because I am not unemployed and I do not want to take advantage of the system.

    “I have this cool idea”

    Great! I am open to good ideas and I would love to hear what you have in mind.

    Also, I have always wanted to work on open source projects or for a good cause so if you need help with a non-profit or a charity, I might be able to help you for free.

    The road ahead

    Honestly, I do not really know what I am doing. I am trying to think rationally and tell myself that I should have a clear plan or be more organized but planning is completely against my nature so for now, I am just going to wing it. We will see what happens down the line.

  • Fractals revisited

    FractalGoing through old code can be fun and educational. While updating my website, I took an extra look at some of my featured code. When I came across my simple fractal simulations on the <canvas> element, I was quite surprised to see how much I violated the Don’t Repeat Yourself (DRY) principle. The three simulations share more than 80% of the same code but they were each defined in separate files where all the code was repeated. The performance of the simulation had bothered me earlier, so I decided to take a look at the code and did the following:

    • Consolidate the three simulation files into a single file.
    • Optimize the animation loop.

    It was a fun little evening project to refactor some old code. There’s still some work that could be done, like removing the hardcoded dependency of the canvas element with a specific ID, but for a little showcase like this, I do not want to bother too much about that.

    By the way, the code is online.

  • Lucky decisions

    Most people make lots of decisions every day. Some are tougher to make than others and some are more consequential than others. One of those small everyday decisions recently made me think about the cost and how much we depend on luck to affirm that we did something right.

    A story

    Here is a little story that started my thought flow about decisions:

    I take the train to work. It takes 35 minutes when it is on time. Yesterday, there was a signal failure somewhere between my departure and destination stations so the trains could not run. In my case, I had the following options:

    1. Skip work.
    2. Wait for the trains to start running.
    3. Take the car instead.

    1. was not an option but how to decide between 2. and 3.? Taking the car would mean getting to work with only a small delay but it would cost extra highway tolls (The Öresund Bridge) and parking, a total of about $90. Waiting for the trains to start running would mean an unknown delay, potentially leading to lost work and thereby incurring an indirect cost, especially since I get paid by the hour and my hourly rate is slightly higher than the cost of driving.

    If we ignore the much higher health risks involved with car driving compared to taking the train, the decision depended on how long I estimated the train delay would be. Since I had experienced signal failures before and they often take between two and three hours, I made an informed decision to take the car.

    It turns out, it was a bad decision. The trains started running just when I had gotten half-way across the bridge. It felt bad to be wrong and it came with a small cost.

    Luck

    Decisions and luck are often tied together. In the story above, I highlighted the words “unknown” and “estimated” because a lot of the tougher decisions in life also have unknown factors and are based on estimates. And even though we make “informed decisions”, there are a lot of outside factors that we cannot control and that have an impact on the outcome of our decisions.

    I consider these factors to be luck in both a good and bad sense, depending on the outcome. In the big picture of life, luck has a huge role to play but it is easily forgotten, ignored or overlooked.

    For example, some successful people might say that their success is well-deserved and earned entirely by hard work and being smart but they forget to factor luck into the equation of their success. Bo Peabody (internet millionaire from the nineties) wrote a book about entrepreneurship and luck called Lucky or Smart? and I think the following quote sums up his points nicely:

    Was I lucky? you bet your ass I was lucky. But I was also smart: smart enough to realize that I was getting lucky. — Bo Peabody

    For me, the fact that I am even able to write this blog post shows how lucky I am. Sometimes, I even feel like my life is just a long series of lucky streaks. Realizing this helps me sympathize with people that have had much less luck in their lives. Poor life conditions are often not self-inflicted but the consequence of decisions that suffered from bad luck.

    In the end, I think the best we can do is to try and acknowledge when we are actually being lucky and take advantage of those situations when they arise. But of course, that is easy to say for a lucky person like me.

  • Fixing an IIS deadlock

    Found the below post in my drafts section. I should have released it a year ago but better late than never. Maybe someone with a deadlock problem on IIS and .NET 4.0 will find this piece of information useful.

    At the place where I am contracting at the moment, we recently (April 2013) went live with an intranet web application that needs to handle about 1000 concurrent users. This is handled by 8 web-servers running Internet Information Services (IIS) 7.5 and ASP.NET 4.0 and in the beginning, all was well.

    After running for about a day, we started seeing deadlocks in the IIS worker processes. At its worst, it happened every five minutes. When a worker process deadlocks, it recycles itself and the system could not handle that. A Microsoft support ticket was opened and the problem was eventually solved by adding the following code in a random .cs file in all of the assemblies (i.e. projects) for the application:

    [System.Runtime.InteropServices.StructLayout(System.Runtime.InteropServices.LayoutKind.Explicit)]
    struct WorkaroundStruct
    {
        [System.Runtime.InteropServices.FieldOffset(0)]
        [System.Runtime.InteropServices.MarshalAs(System.Runtime.InteropServices.UnmanagedType.I4)]
        public int myField1;
    } 
    

    This magically solved the deadlocks. It turns out that it is a known bug in the .NET 4.0 framework but Microsoft was not going to make a hotfix for it. Hmm.

    Deadlocks are always bad but in our case, they were extra bad. All data in the application is fetched through a Windows Communication Foundation (WCF) webservice from a different company. Instantiation of a WCF client can be slow and in our case, it could take 7-9 seconds for one or more of the WCF clients. By the way, while instantiating the client, the CPU goes nuts. Even though we cache the client for each session to improve performance, when the IIS enters a deadlock and resets, all 1000 users have to instantiate a new WCF client and the servers could simply not handle the CPU load.

    It is funny what weird issues we run into with technology.

  • Product similarities and relations

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    I recently started to implement a feature that analyzes similarities and relations between products and factors this analysis into the recommendations that are created. As mentioned in the previous post, more data means better recommendations.

    In fact, adding product analysis to the equation is a huge improvement to Antecons for several reasons:

    1. Brand new shops will (probably) see recommendations immediately even if they do not have any sales yet.
    2. Some products might get very few sales or page views. Product relations help improve visibility of these products.
    3. The shop owner is indirectly influencing the recommendations with the tags that are added to a product.

    This feature is now fully rolled out but there are still some technical details that are being tweaked. If you are interested in the technicalities, read on.

    Complexity

    There are some tiny problems with finding product relations: Complexity and cost. One approach is to compare each product with every other product. This requires O(n2) comparisons (where n is the number of products) which is not ideal but it sounds ok since the analysis does not have to run very often.

    The first approach I tried was to create a pipeline that reads batches of products and compares each of these products to all products that “come after” that product for a total of n(n+1)/2 comparisons. This is not a problem for a few hundred products but with a few thousand products it starts to get problematic. If we have 10.000 products, the analysis will have to perform about 50 million product entity reads. On Google App Engine (GAE), each entity fetch is 1 read for the entity and 1 read for the query that fetched the entity. Reading the products in batches of 50 would thus require about one million queries and a total of about 51 million reads. On GAE, datastore reads cost $0,06 / 100.000 operations so the price for running this analysis would be at least $30 — and that is only reading the data…

    Needless to say, this has failed as a scalable and affordable solution and I should have done the math before going down that path but… lesson learned.

    MapReduce to the rescue?

    The second approach I tried was to let the MapReduce framework do some of the work for us. The idea would be to run through all products exactly once and map each product to key-value pairs consisting of tags and product keys. The map and reduce steps could be written something like this:

    product_map(product):
        # Create combinations of tags
        tag_combos = combinations(product.tags, 2)
    
        # Yield each combination of tags.
        for tag_subset in tag_combos:
            sorted_tag_subset = sorted(tag_subset)
            yield sorted_tag_subset, product
    
    product_reduce(tags, products):
        # Create combinations of products.
        product_combos = combinations(products, 2)
    
        # Calculate the similarity and shared tags of each combination of products.
        for combo in product_combos:
            relation = ProductRelation(p1=combo[0], p2=combo[1])
            yield operation.db.Put(relation)
    

    The above code is not exactly how I did it but pretty close. The problem with this is that the amount of relations that need to be stored is the same so I am still storing (potentially) massive amounts of data.

    Locality-sensitive hashing and good ol’ queries

    When I started developing Antecons for Google App Engine, I minimized the number of indexed properties per entity. Since then, I have learned that it is better to focus on minimizing the number of entities so having up to n2 product relations as separate entities did not seem to be the way to go. For tag relations, indexing the tags for each product seemed to be an obvious choice so I did that. This way, it is easy to select related products based on tags with some datastore queries instead of querying separate relation entities.

    Finding product similarities, however, was a more tricky problem to solve. For example, how is it possible to find products with similar titles based on a datastore query? Can we split the title into tokens and query for each of these tokens? Should we use full-text search? What if a product uses two different spellings? What if similar products could be grouped into buckets that can be queried? Ok, now we are on to something…

    Locality-sensitive Hashing is a technique that does exactly this: Given a set of web documents, each document is hashed to a specific bucket such that documents in the same bucket are similar. Given a new web document, we can find similar documents by looking in bucket that the document belongs to.

    After some testing, I ended up using an implementation of simhash. Now, every time a product is saved, three simhash buckets are calculated and these can then be used to query for similar products. In other words, we only store three extra fields per product, a very efficient and scalable solution.

    Conclusion

    I am happy to have added extra recommendation data to Antecons with product relations and similarities. This is not the end of it though since I am already considering how I can approve the above approach so it is faster and more robust. I will continue to write on the blog when there are new improvements for Antecons.

    Thank you for reading!