More data, better recommendations

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.

Today, we have published an improvement for the Antecons recommendation algorithm. In the beginning, recommendations were based on an analysis of order data for a webshop which has turned out to work quite well. But more data is better. Starting today, Antecons will also analyze data based on what products customers are looking at on the webshop. This improves the recommendations, especially for products that have recently been added to a shop and have not been sold so much yet.

There are many other ideas and features in the pipeline and one of them is adding similarity measures as a recommendation tool. That is, similarity in terms of common product tags and similar product titles. This is probably going to find its way into Antecons in the near future, possibly as an opt-in feature.

As an extra note, the back-end and infrastructure of Antecons is constantly improving, thanks in part to the constant improvements being made to Google App Engine. Scalability and reliability are key elements for a high-performance app like Antecons and GAE makes it possible to focus on the app instead of the infrastructure. This might sound like a sales pitch for GAE but actually, it is one of Antecons’ secret weapons.

Experimental features

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.

Yesterday, I found out exactly what it means when Google warns about their experimental App Engine features: Your code might eventually break. Let me be clear, I am not blaming Google. They give you fair warning:

Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce.

I have written about my usage of the MapReduce framework earlier. Yesterday, I updated the MapReduce framework to the latest version only to see that my custom Datastore reader suddenly had stopped working and I was seeing exceptions in my MapReduce pipeline. Bummer.

Long story short, I spent a day debugging the new code and finally got it working by:

  1. Digging through the MapReduce framework code. Hurray for open source!
  2. Dropping the idea of running FP-Growth on batches of entities and instead running the mapping function on each entity.

That second point probably requires some explanation to really grasp and I am not sure I will be able to but maybe some pseudo-Python will help. The biggest change happened in the map-step of the Frequent Patterns MapReduce pipeline. Basically I went from this:

def map_batch_of_transactions(batch):
    frequent_patterns =
    for p in frequent_patterns:
        yield p,

to this:

def map_single_transaction(transaction):
    frequent_patterns = itertools.combinations(transaction, 2)
    for p in frequent_patterns:
        yield p, 1

The MapReduce shuffler takes care of grouping together patterns with the same key so with the new method, the shuffler will have more work to do since the same patterns will be yielded more often. Let’s say we have the pattern:

a,b (support: 4)

Before, the shuffler would just receive:

('a,b', 4)

but now it will receive:

('a,b', 1)
('a,b', 1)
('a,b', 1)
('a,b', 1)

On the other hand, FP-growth does not have to run so the map-step of the pipeline has more predictable performance characteristics. It remains to be seen if the change has significant impact on the entire MapReduce process. I am currently testing this.

So anyway, the whole point of this post was: If a feature is experimental, watch out. Sounds obvious right? Well…

Unix tools on Windows

The other day, I set out on a journey to get many of the wonderful Unix tools running on Windows in something that resembles a terminal. In case you did not know, you can come a very long way by installing msysgit. It includes a terminal called Git Bash and all the common Unix tools such as sed, grep, awk, perl, find and so on. It also includes an ssh client and curl. I have been using this for about a year now and it is quite convenient when you are forced to work on a Windows machine. I know Windows has Powershell but… I just do not like it.

Git Bash