Thought Flow

Tag: Google App Engine

  • More data, better recommendations

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    Today, we have published an improvement for the Antecons recommendation algorithm. In the beginning, recommendations were based on an analysis of order data for a webshop which has turned out to work quite well. But more data is better. Starting today, Antecons will also analyze data based on what products customers are looking at on the webshop. This improves the recommendations, especially for products that have recently been added to a shop and have not been sold so much yet.

    There are many other ideas and features in the pipeline and one of them is adding similarity measures as a recommendation tool. That is, similarity in terms of common product tags and similar product titles. This is probably going to find its way into Antecons in the near future, possibly as an opt-in feature.

    As an extra note, the back-end and infrastructure of Antecons is constantly improving, thanks in part to the constant improvements being made to Google App Engine. Scalability and reliability are key elements for a high-performance app like Antecons and GAE makes it possible to focus on the app instead of the infrastructure. This might sound like a sales pitch for GAE but actually, it is one of Antecons’ secret weapons.

  • An SMTP server for testing

    Today I needed to send some test emails from my application and I searched around for a very simple SMTP server that can save outgoing emails to a folder instead of actually sending them. This is nice for debugging and testing email functionality in an application.

    Unfortunately, it was very difficult to find something useful. In the end, I found a very simple script that did the trick. I forked it and added some command-line options so now I can easily run a simple SMTP server on my development computer in Linux, like this:

    dummy_smtp.py -p 8081
    

    In case you were wondering, this works really well with the Google App Engine mail API. Simply run the development server like this:

    dev_appserver.py --smtp_host=localhost --smtp_port=8081
    

    The source code for the server is right here:
    https://github.com/dlebech/Dummy-SMTP

  • Frequent patterns and MapReduce

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    The first data mining algorithm that was implemented for Antecons was Frequent Pattern growth (FP-growth) which is an algorithm that finds frequent patterns in transactional data such as shopping baskets. Finding frequent patterns is often a necessary step in finding association rules. An association rule is something like “if customers purchase beer, they also buy chips” or “if voter is African American, he/she votes for Barack Obama”. This information can be valuable to analyze patterns and act upon them, e.g. for marketing or reporting.

    As mentioned earlier, the first prototype of Antecons was implemented in .NET and used FP-growth to find frequent patterns and association rules. The algorithm was executed in-memory on a single thread. This is fine for smaller datasets but it does not scale to larger datasets. When restarting Antecons, I decided to look at FP-growth again and solve the scaling issue. With the help of MapReduce and Google App Engine, the future is promising.

    Before going into detail with the FP-growth and MapReduce, I will briefly give an example of association rule mining. The next section can be skipped if you are familiar with the concepts.

    Association rule mining — a quick example

    Consider the following dataset consisting of three shopping transactions:

    t1 = Beer, Chips, Soda
    t2 = Chips, Soda
    t3 = Chips, Soda
    

    From this dataset, we can find frequent patterns. A frequent pattern is a set of one or more items that occur together (frequently) in the dataset. When finding frequent patterns, the support is recorded along with the pattern. The support is simply a count of how many times a pattern occurs in the entire dataset. For the above dataset, we have:

    Beer: 1
    Chips: 3
    Soda: 3
    Beer, Chips: 1
    Beer, Soda: 1
    Chips, Soda: 3
    Beer, Chips, Soda: 1
    

    Here, we have defined a minimum support of 1 which means that all patterns with support greater than or equal to 1 are considered frequent.

    Based on the frequent patterns, we can construct association rules. As mentioned earlier, an association rule can informally be stated “If customers purchase Chips, they also buy Soda” or a bit shorter:

    Chips -> Soda
    

    Association rules have two basic metrics:

    1. Support: How many times do all the items of a rule appear together in the entire dataset.
    2. Confidence: How often does the left-hand side of the rule appear with the right-hand side.

    Based on the frequent patterns from above, here are a few of the association rules we can construct from those patterns:

    Chips -> Soda (support: 3, confidence: 3/3 = 1 = 100%)
    Beer -> Chips (support: 1, confidence: 1/1 = 1 = 100%)
    Chips -> Beer (support: 1, confidence: 1/3 = 0.33 = 33%)
    Beer, Soda -> Chips (support: 1, confidence: 1/1 = 1 = 100%)
    Chips, Soda -> Beer (support: 1, confidence: 1/3 = 0.33 = 33%)
    

    For more info on association rule mining, see the Wikipedia article for association rule mining.

    The problem with association rule mining

    Running FP-growth in-memory has certain limitations. To give an idea about the performance, numerous tests of FP-growth were carried out on the prototype of Antecons (when I write FP-growth, I really mean “FP-growth plus finding association rules” but I will just refer to the algorithm as FP-growth). A few of the results are graphed below (please read the caption under the graph).

    FP-growth running time
    FP-growth running time as a function of minimum support. Each line represents running times for 20 thousand transactions with 1-10 items in each transaction and either 50 or 200 unique types of items.
    FP-growth association rules
    Number of association rules as a function of minimum support. Each line represents rules found for 20 thousand transaction with 1-10 items for each transactions and either 50 or 200 unique types of items.

    There are two problems here:

    1. The running time increases as the support decreases.
    2. The number of association rules decreases as the support increases.

    If the goal is to achieve low running time and finding many rules, the above two problems create an unwanted trade-off between running time and number of rules found. In fact, even with 20000 transactions and only 200 unique types of items, 0 rules are found if the minimum support is 50 and only 24 rules are found with minimum support 25. In other words, out of 20000 transactions, it is not possible to find any itemset that occurs 50 or more times. This might be surprising but it actually illustrates one of the main problems with association rule mining:

    In order to find interesting association rules, the minimum support for these rules has to be set sufficiently low, but when setting the support very low, the running time and memory usage explodes.

    Scaling FP-growth with MapReduce

    Some people say that you should not think about scale early on in a project. I agree to some extent but since the limitations of FP-growth start showing on fairly small datasets, it is relevant to consider if the solution can scale at all. When handling millions of transactions, it is no longer feasible to run FP-growth in-memory and on a single thread. Even if we assume that the algorithm scales linearly on the input set, a few minutes all of a sudden turns into a few days and keeping millions of rows in an in-memory data structure is probably not a good idea either.

    Fortunately, there are ways to run FP-growth concurrently using a technique called MapReduce. MapReduce is a powerful method that is used extensively by e.g. Google for their search indexing. Google also provides an open source MapReduce framework for Google App Engine — very convenient for Antecons since I chose to deploy on the Google App Engine.

    With no prior experience with MapReduce, I was lucky that someone has already implemented FP-growth for MapReduce in Hadoop. Since the basic FP-growth algorithm was implemented a year ago for the Antecons teaser, I just needed a basic idea of how to set up the mapper and the reducer so I chose an approach that is conceptually the same as the one mentioned in the blog post that I just referenced.

    The MapReduce pipeline

    The full pipeline for creating association rules with MapReduce is actually quite simple and is currently less than 250 lines of code. The first thing that happens is that a reader reads transactions and sends them in batches to the mapper. The map function looks like this:

    def fp_map_transactions(transactions):
        patterns = fpgrowth.fpgrowth(transactions)
        for pattern, support in patterns:
            yield pattern, support
    

    After the map step, the MapReduce framework automatically groups together patterns that are the same in the shuffle step which is completely automated. The reduce function receives the pattern and a list of support counts for the pattern so all we have to do is sum the support counts to find the global support count for the pattern:

    def fp_reduce(pattern, values):
        support = sum([long(i) for i in values])
        yield '%s:%s\n' % (pattern, support)
    

    The patterns are yielded to a MapReduce writer that writes the result to the blobstore. After all the results have been stored, another pipeline starts finding the association rules for the patterns. The mapping function for this pipeline receives a pattern-support pair and looks like this:

    def fp_to_association_map(line):
        pattern, support = line.split(':')
        yield pattern, support  # Yield the full pattern
        if len(pattern) > 1:    # Yield every subpattern
            for item in pattern:
                others = [i for i in pattern if i != item]
                yield others, line
    

    Finally, the reducer puts all the things together to create the association rules.

    def fp_to_association_reduce(left, values):
        # Find value that is not a pattern-support pair
        fullpattern_support = [value for value in values 
                if len(value.split(':')) < 2][0]
        values.remove(fullpattern_support)
        for value in values:
            pattern, support = value.split(':')
            for item in left:
                pattern.remove(item)
            confidence = float(support)/float(fullpattern_support)
            rule = { 'left': left, 'right': pattern, 
                    'sup': support, 'conf': confidence }
    

    This works :-)

    Initial observations about GAE MapReduce

    The standard Datastore reader for the MapReduce framework reads one entity at a time so for step 1, I extended the built-in reader to return batches of transactions instead of just one transaction. Otherwise, FP-growth would run for each transaction in the datastore and that would be silly. The code for this reader is simple and can be found here nowhere (it is not valid as of 2013-05-04).

    When starting a MapReduce pipeline on Google App Engine, it can be specified how many shards that the computations should be run in. Theoretically, if the number of shards are set to 16, the computations should run concurrently on 16 different threads. I could not reliably test this on the development server but for 1000 transactions, it took 5 minutes to run the frequent pattern pipeline using 4 shards and 1 minute using 16 shards. This sounds good enough but the framework was reporting that it was only using one of the shards for all computations so the difference in running time is strange. Later, I got the work evenly distributed over all shards and the strange behavior disappeared.

    Without much optimization, it currently takes between 1 and 2 minutes to analyze 1000 transactions on 16 shards which is not impressive compared to the in-memory version that could analyze 50000 transactions in 40 seconds but I have not carried out any tests on much larger datasets to see if the MapReduce framework scales. That is next step.

    The MapReduce framework for Google App Engine is certainly a great help for easy parallelization of heavy work and I look forward to testing it out more to find out, if it is a good option for more data analysis algorithms.

    Thank you for reading.

  • Antecons developer diary part 3

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    On Friday, October 5, 2012, Antecons was started over from scratch and another journey begun. How long this journey will be and what will happen is uncertain. But here are some initial notes on the progress.

    Where to begin

    Starting out is often a bit slow, mostly because lots of questions pile up. Like what programming language to use and what platform to deploy on. What about project management, bug tracking and early documentation? It is easy to get caught up in those things and it happened to me as well on the first day. I was inspired when I read that Jeff Atwood does not make TODO lists. However, writing down a few ideas and making some mental notes works well for me to get a feel for what I want to do with Antecons. But TODO lists might not be useful and micromanaging a project for a one-person team in the exploratory/early phase is definitely not worth it.

    So I ditched the project management for now, I have no bug tracker and the documentation I wrote initially is already outdated. When will I ever learn. Instead of wasting more time, I made decisions. Use Python, deploy on Google App Engine and start making a simple web-based API using JSON for data representations and then take it from there.

    To REST or not to REST

    REST (Representational State Transfer) is an interesting choice of architecture for web-based APIs. REST is kind of a buzzword and most REST APIs are not actually RESTful according to Roy Fielding who coined the term. So of course, I decided that my API would be truly RESTful. Alas, the choice of JSON for data representation makes it difficult to make a hypertext-driven API since there are no widespread standards for how links should be represented in JSON, unlike e.g. the very precise definitions in the Atom Syndication Format (an example is this blog’s Atom feed ).

    There is a specification draft called JSON schema that is very promising so I decided to try and use it for the API. A very crude first alpha version of the API was actually fully hypertext-driven based on the principles outlined in JSON schema and all data representations were described by a JSON schema. Using JSON schema has two advantages:

    1. It provides a way to validate data both on the client and server side.
    2. It is cool to be RESTful.

    Ok, so while number one is useful, number two is kind of silly. And actually I think it is ok to not be truly RESTful but still use some of the REST principles to build an API. My favorite example right now is the Google Calendar API which uses resources for data representation and HTTP verbs to interact with these resources but is not hypertext-driven. According to the Richardson Maturity model, it is probably level 2 REST — if there is such a thing. With Antecons, this is the level of REST I am aiming for.

    So what about JSON schema? Well currently, I am making a JSON schema for all data representations in Antecons but I have not decided whether or not I should continue this trend. If using JSON schema means that some code can be autogenerated by the client then that is fantastic. But I do not know of any ready-made tools to do that, such as is the case for e.g. WSDL in Visual Studio.

    Progress

    After this Friday, I should have another crude alpha version ready for publishing. After it is out there, I will write about it here. The API cannot do much right now. You can just upload some simple data and view it. Basic basic stuff. But I also hope to have the first data analysis algorithms ready within the near future. Then it starts getting exciting.

  • Antecons developer diary part 1

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    It has been a year since the first announcement of Antecons. What has happened since then? Absolutely nothing. The reason for this seems to be pretty straightforward: I have been occupied with full-time consulting and I have made a deliberate choice of working a five-day 40-hour work week. That leaves no room for extra work. However, this has changed slightly. Since October 1, I have had a four-day work week. This leaves one day to do other stuff and I have decided to continue my work on Antecons and I intend to blog about the (slow) progress.

    Consider this part 1 in a series of posts about my experiences, design decisions and the current state of Antecons. First, I would like to fill in the gap in the history of Antecons from last year until now. There is more to the story than just a lack of time.

    Where we left off

    One year ago, I made a promise, a prototype and a teaser. The prototype is a recommendation engine that is still running on one webshop (as far as I know). It is written in C# for .NET, uses Microsoft SQL server as database and is deployed directly on the end-user’s server using a pre-compiled DLL that can be referenced from any .NET application. As I was developing the prototype, I already had a couple of basic objections to my own design:

    1. It required the user to use .NET and SQL Server.
    2. It required me to use Visual Studio, IIS and Windows as a development environment.
    3. It required the user to integrate the component into their webshop. This included:
      1. Setting up an SQL database and tables conforming with the Antecons data format.
      2. Synchronizing webshop purchases with the Antecons database and keeping it updated.
      3. Manually running the recommendation algorithm to pre-create recommendations.
      4. Integrating Antecons into the webshop design in the appropriate places to show the pre-created recommendations for the current product being shown.

    The first two points were easily fixed by deciding to replace the pre-compiled DLL component with a webservice. Using an API for Antecons, there would be no constraints on the user regarding their server software or webshop programming environment.

    And thus was born the teaser, a web application written in Python and running on Google App Engine. It never turned into a webservice with an API. Instead it was a website that showed what the essence of Antecons was: A recommendation engine based on association rule mining (or market basket analysis). As it turned out, this was the easy part and implementing a simple API would have been too. But the third point I outlined above, that was the real killer.

    The promise

    A quote from my company website:

    Metacircle has been developing a data analysis tool for some time. It is called Antecons.

    And from the Antecons teaser website:

    The API is currently under development…

    These quotes sum up the promise I made myself (and the world): To work on Antecons. Did I succeed? No. Do I feel bad about that? Yes. What killed the project? Time and wishful thinking.

    Time grew short when I started working full-time but this is only part of the explanation. As hinted above, I was also held back by the third design objection. Setting up Antecons for actual usage was, in my opinion, too difficult and too much an obstacle for widespread usage. How could Antecons be made so it would be super easy to setup, super easy to deploy and super easy to run. The ideal scenario looked like this:

    1. User registers for Antecons and receives a unique User ID.
    2. User inserts 10 lines of Javascript code in the header of each webshop page where Antecons should show recommendations.
    3. Antecons takes care of the rest.

    This is how e.g. Google Analytics works and it is quite amazing how much information it gathers and how useful this information is, based on such a simple installation. If Antecons could be as easy to install as Google Analytics, which is installed on millions of websites, there would be a huge potential and many users could benefit from it.

    But the difficult questions piled up fast. For example, how would Antecons gather information on user’s sales data and how would it know where to show recommendations on a user’s webshop. The answers I came up with all lead to the same conclusion: The user would have to do some work and Antecons could not be fully automated, not even partially. This shattered my overall vision for Antecons and my head has been throwing around ideas ever since. Maybe the ideal scenario was wishful thinking but in any case, this was the end of the line. Until now.

    Where we are

    It has been two weeks since I started over with the Antecons project. In the next blog post, I will elaborate on what specific design obstacles that put the project on ice. This is an important pretext for what I have started working on right now. Later, I will write more about specific design decisions and actual implementation. The API should even have some partial functionality coming up very soon. Stay tuned.