similarity – Thought Flow

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.

I recently started to implement a feature that analyzes similarities and relations between products and factors this analysis into the recommendations that are created. As mentioned in the previous post, more data means better recommendations.

In fact, adding product analysis to the equation is a huge improvement to Antecons for several reasons:

Brand new shops will (probably) see recommendations immediately even if they do not have any sales yet.
Some products might get very few sales or page views. Product relations help improve visibility of these products.
The shop owner is indirectly influencing the recommendations with the tags that are added to a product.

This feature is now fully rolled out but there are still some technical details that are being tweaked. If you are interested in the technicalities, read on.

Complexity

There are some tiny problems with finding product relations: Complexity and cost. One approach is to compare each product with every other product. This requires O(n²) comparisons (where n is the number of products) which is not ideal but it sounds ok since the analysis does not have to run very often.

The first approach I tried was to create a pipeline that reads batches of products and compares each of these products to all products that “come after” that product for a total of n(n+1)/2 comparisons. This is not a problem for a few hundred products but with a few thousand products it starts to get problematic. If we have 10.000 products, the analysis will have to perform about 50 million product entity reads. On Google App Engine (GAE), each entity fetch is 1 read for the entity and 1 read for the query that fetched the entity. Reading the products in batches of 50 would thus require about one million queries and a total of about 51 million reads. On GAE, datastore reads cost $0,06 / 100.000 operations so the price for running this analysis would be at least $30 — and that is only reading the data…

Needless to say, this has failed as a scalable and affordable solution and I should have done the math before going down that path but… lesson learned.

MapReduce to the rescue?

The second approach I tried was to let the MapReduce framework do some of the work for us. The idea would be to run through all products exactly once and map each product to key-value pairs consisting of tags and product keys. The map and reduce steps could be written something like this:

product_map(product):
    # Create combinations of tags
    tag_combos = combinations(product.tags, 2)

    # Yield each combination of tags.
    for tag_subset in tag_combos:
        sorted_tag_subset = sorted(tag_subset)
        yield sorted_tag_subset, product

product_reduce(tags, products):
    # Create combinations of products.
    product_combos = combinations(products, 2)

    # Calculate the similarity and shared tags of each combination of products.
    for combo in product_combos:
        relation = ProductRelation(p1=combo[0], p2=combo[1])
        yield operation.db.Put(relation)

The above code is not exactly how I did it but pretty close. The problem with this is that the amount of relations that need to be stored is the same so I am still storing (potentially) massive amounts of data.

Locality-sensitive hashing and good ol’ queries

When I started developing Antecons for Google App Engine, I minimized the number of indexed properties per entity. Since then, I have learned that it is better to focus on minimizing the number of entities so having up to n² product relations as separate entities did not seem to be the way to go. For tag relations, indexing the tags for each product seemed to be an obvious choice so I did that. This way, it is easy to select related products based on tags with some datastore queries instead of querying separate relation entities.

Finding product similarities, however, was a more tricky problem to solve. For example, how is it possible to find products with similar titles based on a datastore query? Can we split the title into tokens and query for each of these tokens? Should we use full-text search? What if a product uses two different spellings? What if similar products could be grouped into buckets that can be queried? Ok, now we are on to something…

Locality-sensitive Hashing is a technique that does exactly this: Given a set of web documents, each document is hashed to a specific bucket such that documents in the same bucket are similar. Given a new web document, we can find similar documents by looking in bucket that the document belongs to.

After some testing, I ended up using an implementation of simhash. Now, every time a product is saved, three simhash buckets are calculated and these can then be used to query for similar products. In other words, we only store three extra fields per product, a very efficient and scalable solution.

Conclusion

I am happy to have added extra recommendation data to Antecons with product relations and similarities. This is not the end of it though since I am already considering how I can approve the above approach so it is faster and more robust. I will continue to write on the blog when there are new improvements for Antecons.

Thank you for reading!

Tag: similarity

Product similarities and relations

Complexity

MapReduce to the rescue?

Locality-sensitive hashing and good ol’ queries

Conclusion

More data, better recommendations