AI computing requirements

Whenever there is a new announcement or breakthrough with AI, it always strikes me how out of reach the results would be to replicate for individuals and small organizations. Machine learning algorithms, and especially deep learning with neural networks, are often so computationally expensive that they are infeasible to run without immense computing power.

As an example, OpenAI Five (OpenAI’s Dota 2 playing bot) used 128,000 CPUs and 256 GPUs which trained continuously for several months:

In total, the current version of OpenAI Five has consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months.

OpenAI blog post “How to Train Your OpenAI Five”

Running a collection of more than a hundred thousand CPUs and hundreds of GPUs for ten months would cost several million dollars without discounts. Needless to say, a hobbyist such as myself would never be able to replicate those results. Cutting edge AI research like this has an implicit disclaimer: “Don’t try this at home”.

Even on a smaller scale, it is not always possible to run machine learning algorithms without certain trade-offs. I can sort a list of a million numbers in less than a second, and even re-compile a fairly complex web application in a few seconds, but training a lyrics-generating neural network on less than three thousand songs takes several hours to complete.

Although a comparison between number sorting and machine learning seems a bit silly, I wonder if we will ever see a huge reduction in computational complexity, similar to going from an algorithm like bubble sort to quicksort.1

Perhaps it is not fair to expect to be able to replicate the results of a cutting edge research institution such as OpenAI. Dota 2 is a very complex game, and reinforcement learning is an area of research that is developing fast. But even OpenAI acknowledges that recent improvements to their OpenAI Five bot are primarily due to increases in available computing power:

OpenAI Five’s victories on Saturday, as compared to its losses at The International 2018, are due to a major change: 8x more training compute. In many previous phases of the project, we’d drive further progress by increasing our training scale.

OpenAI blog post “How to Train Your OpenAI Five”

It feels slightly unnerving to see that the potential AI technologies of the future are currently only within reach of a few companies with access to near-unlimited resources. On the other hand, the fact that we need to throw so many computers at mastering a game like Dota should be comforting for those with gloomy visions of the future :-)

Product similarities and relations

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


I recently started to implement a feature that analyzes similarities and relations between products and factors this analysis into the recommendations that are created. As mentioned in the previous post, more data means better recommendations.

In fact, adding product analysis to the equation is a huge improvement to Antecons for several reasons:

  1. Brand new shops will (probably) see recommendations immediately even if they do not have any sales yet.
  2. Some products might get very few sales or page views. Product relations help improve visibility of these products.
  3. The shop owner is indirectly influencing the recommendations with the tags that are added to a product.

This feature is now fully rolled out but there are still some technical details that are being tweaked. If you are interested in the technicalities, read on.

Complexity

There are some tiny problems with finding product relations: Complexity and cost. One approach is to compare each product with every other product. This requires O(n2) comparisons (where n is the number of products) which is not ideal but it sounds ok since the analysis does not have to run very often.

The first approach I tried was to create a pipeline that reads batches of products and compares each of these products to all products that “come after” that product for a total of n(n+1)/2 comparisons. This is not a problem for a few hundred products but with a few thousand products it starts to get problematic. If we have 10.000 products, the analysis will have to perform about 50 million product entity reads. On Google App Engine (GAE), each entity fetch is 1 read for the entity and 1 read for the query that fetched the entity. Reading the products in batches of 50 would thus require about one million queries and a total of about 51 million reads. On GAE, datastore reads cost $0,06 / 100.000 operations so the price for running this analysis would be at least $30 — and that is only reading the data…

Needless to say, this has failed as a scalable and affordable solution and I should have done the math before going down that path but… lesson learned.

MapReduce to the rescue?

The second approach I tried was to let the MapReduce framework do some of the work for us. The idea would be to run through all products exactly once and map each product to key-value pairs consisting of tags and product keys. The map and reduce steps could be written something like this:

product_map(product):
    # Create combinations of tags
    tag_combos = combinations(product.tags, 2)

    # Yield each combination of tags.
    for tag_subset in tag_combos:
        sorted_tag_subset = sorted(tag_subset)
        yield sorted_tag_subset, product

product_reduce(tags, products):
    # Create combinations of products.
    product_combos = combinations(products, 2)

    # Calculate the similarity and shared tags of each combination of products.
    for combo in product_combos:
        relation = ProductRelation(p1=combo[0], p2=combo[1])
        yield operation.db.Put(relation)

The above code is not exactly how I did it but pretty close. The problem with this is that the amount of relations that need to be stored is the same so I am still storing (potentially) massive amounts of data.

Locality-sensitive hashing and good ol’ queries

When I started developing Antecons for Google App Engine, I minimized the number of indexed properties per entity. Since then, I have learned that it is better to focus on minimizing the number of entities so having up to n2 product relations as separate entities did not seem to be the way to go. For tag relations, indexing the tags for each product seemed to be an obvious choice so I did that. This way, it is easy to select related products based on tags with some datastore queries instead of querying separate relation entities.

Finding product similarities, however, was a more tricky problem to solve. For example, how is it possible to find products with similar titles based on a datastore query? Can we split the title into tokens and query for each of these tokens? Should we use full-text search? What if a product uses two different spellings? What if similar products could be grouped into buckets that can be queried? Ok, now we are on to something…

Locality-sensitive Hashing is a technique that does exactly this: Given a set of web documents, each document is hashed to a specific bucket such that documents in the same bucket are similar. Given a new web document, we can find similar documents by looking in bucket that the document belongs to.

After some testing, I ended up using an implementation of simhash. Now, every time a product is saved, three simhash buckets are calculated and these can then be used to query for similar products. In other words, we only store three extra fields per product, a very efficient and scalable solution.

Conclusion

I am happy to have added extra recommendation data to Antecons with product relations and similarities. This is not the end of it though since I am already considering how I can approve the above approach so it is faster and more robust. I will continue to write on the blog when there are new improvements for Antecons.

Thank you for reading!

Frequent patterns and MapReduce

This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


The first data mining algorithm that was implemented for Antecons was Frequent Pattern growth (FP-growth) which is an algorithm that finds frequent patterns in transactional data such as shopping baskets. Finding frequent patterns is often a necessary step in finding association rules. An association rule is something like “if customers purchase beer, they also buy chips” or “if voter is African American, he/she votes for Barack Obama”. This information can be valuable to analyze patterns and act upon them, e.g. for marketing or reporting.

As mentioned earlier, the first prototype of Antecons was implemented in .NET and used FP-growth to find frequent patterns and association rules. The algorithm was executed in-memory on a single thread. This is fine for smaller datasets but it does not scale to larger datasets. When restarting Antecons, I decided to look at FP-growth again and solve the scaling issue. With the help of MapReduce and Google App Engine, the future is promising.

Before going into detail with the FP-growth and MapReduce, I will briefly give an example of association rule mining. The next section can be skipped if you are familiar with the concepts.

Association rule mining — a quick example

Consider the following dataset consisting of three shopping transactions:

t1 = Beer, Chips, Soda
t2 = Chips, Soda
t3 = Chips, Soda

From this dataset, we can find frequent patterns. A frequent pattern is a set of one or more items that occur together (frequently) in the dataset. When finding frequent patterns, the support is recorded along with the pattern. The support is simply a count of how many times a pattern occurs in the entire dataset. For the above dataset, we have:

Beer: 1
Chips: 3
Soda: 3
Beer, Chips: 1
Beer, Soda: 1
Chips, Soda: 3
Beer, Chips, Soda: 1

Here, we have defined a minimum support of 1 which means that all patterns with support greater than or equal to 1 are considered frequent.

Based on the frequent patterns, we can construct association rules. As mentioned earlier, an association rule can informally be stated “If customers purchase Chips, they also buy Soda” or a bit shorter:

Chips -> Soda

Association rules have two basic metrics:

  1. Support: How many times do all the items of a rule appear together in the entire dataset.
  2. Confidence: How often does the left-hand side of the rule appear with the right-hand side.

Based on the frequent patterns from above, here are a few of the association rules we can construct from those patterns:

Chips -> Soda (support: 3, confidence: 3/3 = 1 = 100%)
Beer -> Chips (support: 1, confidence: 1/1 = 1 = 100%)
Chips -> Beer (support: 1, confidence: 1/3 = 0.33 = 33%)
Beer, Soda -> Chips (support: 1, confidence: 1/1 = 1 = 100%)
Chips, Soda -> Beer (support: 1, confidence: 1/3 = 0.33 = 33%)

For more info on association rule mining, see the Wikipedia article for association rule mining.

The problem with association rule mining

Running FP-growth in-memory has certain limitations. To give an idea about the performance, numerous tests of FP-growth were carried out on the prototype of Antecons (when I write FP-growth, I really mean “FP-growth plus finding association rules” but I will just refer to the algorithm as FP-growth). A few of the results are graphed below (please read the caption under the graph).

FP-growth running time
FP-growth running time as a function of minimum support. Each line represents running times for 20 thousand transactions with 1-10 items in each transaction and either 50 or 200 unique types of items.

FP-growth association rules
Number of association rules as a function of minimum support. Each line represents rules found for 20 thousand transaction with 1-10 items for each transactions and either 50 or 200 unique types of items.

There are two problems here:

  1. The running time increases as the support decreases.
  2. The number of association rules decreases as the support increases.

If the goal is to achieve low running time and finding many rules, the above two problems create an unwanted trade-off between running time and number of rules found. In fact, even with 20000 transactions and only 200 unique types of items, 0 rules are found if the minimum support is 50 and only 24 rules are found with minimum support 25. In other words, out of 20000 transactions, it is not possible to find any itemset that occurs 50 or more times. This might be surprising but it actually illustrates one of the main problems with association rule mining:

In order to find interesting association rules, the minimum support for these rules has to be set sufficiently low, but when setting the support very low, the running time and memory usage explodes.

Scaling FP-growth with MapReduce

Some people say that you should not think about scale early on in a project. I agree to some extent but since the limitations of FP-growth start showing on fairly small datasets, it is relevant to consider if the solution can scale at all. When handling millions of transactions, it is no longer feasible to run FP-growth in-memory and on a single thread. Even if we assume that the algorithm scales linearly on the input set, a few minutes all of a sudden turns into a few days and keeping millions of rows in an in-memory data structure is probably not a good idea either.

Fortunately, there are ways to run FP-growth concurrently using a technique called MapReduce. MapReduce is a powerful method that is used extensively by e.g. Google for their search indexing. Google also provides an open source MapReduce framework for Google App Engine — very convenient for Antecons since I chose to deploy on the Google App Engine.

With no prior experience with MapReduce, I was lucky that someone has already implemented FP-growth for MapReduce in Hadoop. Since the basic FP-growth algorithm was implemented a year ago for the Antecons teaser, I just needed a basic idea of how to set up the mapper and the reducer so I chose an approach that is conceptually the same as the one mentioned in the blog post that I just referenced.

The MapReduce pipeline

The full pipeline for creating association rules with MapReduce is actually quite simple and is currently less than 250 lines of code. The first thing that happens is that a reader reads transactions and sends them in batches to the mapper. The map function looks like this:

def fp_map_transactions(transactions):
    patterns = fpgrowth.fpgrowth(transactions)
    for pattern, support in patterns:
        yield pattern, support

After the map step, the MapReduce framework automatically groups together patterns that are the same in the shuffle step which is completely automated. The reduce function receives the pattern and a list of support counts for the pattern so all we have to do is sum the support counts to find the global support count for the pattern:

def fp_reduce(pattern, values):
    support = sum([long(i) for i in values])
    yield '%s:%s\n' % (pattern, support)

The patterns are yielded to a MapReduce writer that writes the result to the blobstore. After all the results have been stored, another pipeline starts finding the association rules for the patterns. The mapping function for this pipeline receives a pattern-support pair and looks like this:

def fp_to_association_map(line):
    pattern, support = line.split(':')
    yield pattern, support  # Yield the full pattern
    if len(pattern) > 1:    # Yield every subpattern
        for item in pattern:
            others = [i for i in pattern if i != item]
            yield others, line

Finally, the reducer puts all the things together to create the association rules.

def fp_to_association_reduce(left, values):
    # Find value that is not a pattern-support pair
    fullpattern_support = [value for value in values 
            if len(value.split(':')) < 2][0]
    values.remove(fullpattern_support)
    for value in values:
        pattern, support = value.split(':')
        for item in left:
            pattern.remove(item)
        confidence = float(support)/float(fullpattern_support)
        rule = { 'left': left, 'right': pattern, 
                'sup': support, 'conf': confidence }

This works :-)

Initial observations about GAE MapReduce

The standard Datastore reader for the MapReduce framework reads one entity at a time so for step 1, I extended the built-in reader to return batches of transactions instead of just one transaction. Otherwise, FP-growth would run for each transaction in the datastore and that would be silly. The code for this reader is simple and can be found here nowhere (it is not valid as of 2013-05-04).

When starting a MapReduce pipeline on Google App Engine, it can be specified how many shards that the computations should be run in. Theoretically, if the number of shards are set to 16, the computations should run concurrently on 16 different threads. I could not reliably test this on the development server but for 1000 transactions, it took 5 minutes to run the frequent pattern pipeline using 4 shards and 1 minute using 16 shards. This sounds good enough but the framework was reporting that it was only using one of the shards for all computations so the difference in running time is strange. Later, I got the work evenly distributed over all shards and the strange behavior disappeared.

Without much optimization, it currently takes between 1 and 2 minutes to analyze 1000 transactions on 16 shards which is not impressive compared to the in-memory version that could analyze 50000 transactions in 40 seconds but I have not carried out any tests on much larger datasets to see if the MapReduce framework scales. That is next step.

The MapReduce framework for Google App Engine is certainly a great help for easy parallelization of heavy work and I look forward to testing it out more to find out, if it is a good option for more data analysis algorithms.

Thank you for reading.