Experimental features – Thought Flow

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.

Yesterday, I found out exactly what it means when Google warns about their experimental App Engine features: Your code might eventually break. Let me be clear, I am not blaming Google. They give you fair warning:

Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce.

I have written about my usage of the MapReduce framework earlier. Yesterday, I updated the MapReduce framework to the latest version only to see that my custom Datastore reader suddenly had stopped working and I was seeing exceptions in my MapReduce pipeline. Bummer.

Long story short, I spent a day debugging the new code and finally got it working by:

Digging through the MapReduce framework code. Hurray for open source!
Dropping the idea of running FP-Growth on batches of entities and instead running the mapping function on each entity.

That second point probably requires some explanation to really grasp and I am not sure I will be able to but maybe some pseudo-Python will help. The biggest change happened in the map-step of the Frequent Patterns MapReduce pipeline. Basically I went from this:

def map_batch_of_transactions(batch):
    frequent_patterns = fpgrowth.run(batch)
    for p in frequent_patterns:
        yield p, p.support

to this:

def map_single_transaction(transaction):
    frequent_patterns = itertools.combinations(transaction, 2)
    for p in frequent_patterns:
        yield p, 1

The MapReduce shuffler takes care of grouping together patterns with the same key so with the new method, the shuffler will have more work to do since the same patterns will be yielded more often. Let’s say we have the pattern:

a,b (support: 4)

Before, the shuffler would just receive:

('a,b', 4)

but now it will receive:

('a,b', 1)
('a,b', 1)
('a,b', 1)
('a,b', 1)

On the other hand, FP-growth does not have to run so the map-step of the pipeline has more predictable performance characteristics. It remains to be seen if the change has significant impact on the entire MapReduce process. I am currently testing this.

So anyway, the whole point of this post was: If a feature is experimental, watch out. Sounds obvious right? Well…

Leave a comment Cancel reply