Thought Flow

Tag: Python

  • Apache Beam MongoDB reader for Python

    The Apache Beam SDK for Python is currently lacking some of the transforms found in the Java SDK. I created a very minimal example of an Apache Beam MongoDB read transform for Python that might be useful for someone else looking for an answer.

    I will update this post in the future if/when the Apache maintainers include support for MongoDB in the SDK. I know I could contribute to the project directly, but I don’t have time for it right now unfortunately :-)

  • Experimental features

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    Yesterday, I found out exactly what it means when Google warns about their experimental App Engine features: Your code might eventually break. Let me be clear, I am not blaming Google. They give you fair warning:

    Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce.

    I have written about my usage of the MapReduce framework earlier. Yesterday, I updated the MapReduce framework to the latest version only to see that my custom Datastore reader suddenly had stopped working and I was seeing exceptions in my MapReduce pipeline. Bummer.

    Long story short, I spent a day debugging the new code and finally got it working by:

    1. Digging through the MapReduce framework code. Hurray for open source!
    2. Dropping the idea of running FP-Growth on batches of entities and instead running the mapping function on each entity.

    That second point probably requires some explanation to really grasp and I am not sure I will be able to but maybe some pseudo-Python will help. The biggest change happened in the map-step of the Frequent Patterns MapReduce pipeline. Basically I went from this:

    def map_batch_of_transactions(batch):
        frequent_patterns = fpgrowth.run(batch)
        for p in frequent_patterns:
            yield p, p.support
    

    to this:

    def map_single_transaction(transaction):
        frequent_patterns = itertools.combinations(transaction, 2)
        for p in frequent_patterns:
            yield p, 1
    

    The MapReduce shuffler takes care of grouping together patterns with the same key so with the new method, the shuffler will have more work to do since the same patterns will be yielded more often. Let’s say we have the pattern:

    a,b (support: 4)
    

    Before, the shuffler would just receive:

    ('a,b', 4)
    

    but now it will receive:

    ('a,b', 1)
    ('a,b', 1)
    ('a,b', 1)
    ('a,b', 1)
    

    On the other hand, FP-growth does not have to run so the map-step of the pipeline has more predictable performance characteristics. It remains to be seen if the change has significant impact on the entire MapReduce process. I am currently testing this.

    So anyway, the whole point of this post was: If a feature is experimental, watch out. Sounds obvious right? Well…

  • Antecons developer diary part 3

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    On Friday, October 5, 2012, Antecons was started over from scratch and another journey begun. How long this journey will be and what will happen is uncertain. But here are some initial notes on the progress.

    Where to begin

    Starting out is often a bit slow, mostly because lots of questions pile up. Like what programming language to use and what platform to deploy on. What about project management, bug tracking and early documentation? It is easy to get caught up in those things and it happened to me as well on the first day. I was inspired when I read that Jeff Atwood does not make TODO lists. However, writing down a few ideas and making some mental notes works well for me to get a feel for what I want to do with Antecons. But TODO lists might not be useful and micromanaging a project for a one-person team in the exploratory/early phase is definitely not worth it.

    So I ditched the project management for now, I have no bug tracker and the documentation I wrote initially is already outdated. When will I ever learn. Instead of wasting more time, I made decisions. Use Python, deploy on Google App Engine and start making a simple web-based API using JSON for data representations and then take it from there.

    To REST or not to REST

    REST (Representational State Transfer) is an interesting choice of architecture for web-based APIs. REST is kind of a buzzword and most REST APIs are not actually RESTful according to Roy Fielding who coined the term. So of course, I decided that my API would be truly RESTful. Alas, the choice of JSON for data representation makes it difficult to make a hypertext-driven API since there are no widespread standards for how links should be represented in JSON, unlike e.g. the very precise definitions in the Atom Syndication Format (an example is this blog’s Atom feed ).

    There is a specification draft called JSON schema that is very promising so I decided to try and use it for the API. A very crude first alpha version of the API was actually fully hypertext-driven based on the principles outlined in JSON schema and all data representations were described by a JSON schema. Using JSON schema has two advantages:

    1. It provides a way to validate data both on the client and server side.
    2. It is cool to be RESTful.

    Ok, so while number one is useful, number two is kind of silly. And actually I think it is ok to not be truly RESTful but still use some of the REST principles to build an API. My favorite example right now is the Google Calendar API which uses resources for data representation and HTTP verbs to interact with these resources but is not hypertext-driven. According to the Richardson Maturity model, it is probably level 2 REST — if there is such a thing. With Antecons, this is the level of REST I am aiming for.

    So what about JSON schema? Well currently, I am making a JSON schema for all data representations in Antecons but I have not decided whether or not I should continue this trend. If using JSON schema means that some code can be autogenerated by the client then that is fantastic. But I do not know of any ready-made tools to do that, such as is the case for e.g. WSDL in Visual Studio.

    Progress

    After this Friday, I should have another crude alpha version ready for publishing. After it is out there, I will write about it here. The API cannot do much right now. You can just upload some simple data and view it. Basic basic stuff. But I also hope to have the first data analysis algorithms ready within the near future. Then it starts getting exciting.

  • Antecons developer diary part 1

    This post is about Antecons, a product recommendation engine, now part of Conversio. Antecons is no longer commercially available, but I have kept my developer diary on my website with permission.


    It has been a year since the first announcement of Antecons. What has happened since then? Absolutely nothing. The reason for this seems to be pretty straightforward: I have been occupied with full-time consulting and I have made a deliberate choice of working a five-day 40-hour work week. That leaves no room for extra work. However, this has changed slightly. Since October 1, I have had a four-day work week. This leaves one day to do other stuff and I have decided to continue my work on Antecons and I intend to blog about the (slow) progress.

    Consider this part 1 in a series of posts about my experiences, design decisions and the current state of Antecons. First, I would like to fill in the gap in the history of Antecons from last year until now. There is more to the story than just a lack of time.

    Where we left off

    One year ago, I made a promise, a prototype and a teaser. The prototype is a recommendation engine that is still running on one webshop (as far as I know). It is written in C# for .NET, uses Microsoft SQL server as database and is deployed directly on the end-user’s server using a pre-compiled DLL that can be referenced from any .NET application. As I was developing the prototype, I already had a couple of basic objections to my own design:

    1. It required the user to use .NET and SQL Server.
    2. It required me to use Visual Studio, IIS and Windows as a development environment.
    3. It required the user to integrate the component into their webshop. This included:
      1. Setting up an SQL database and tables conforming with the Antecons data format.
      2. Synchronizing webshop purchases with the Antecons database and keeping it updated.
      3. Manually running the recommendation algorithm to pre-create recommendations.
      4. Integrating Antecons into the webshop design in the appropriate places to show the pre-created recommendations for the current product being shown.

    The first two points were easily fixed by deciding to replace the pre-compiled DLL component with a webservice. Using an API for Antecons, there would be no constraints on the user regarding their server software or webshop programming environment.

    And thus was born the teaser, a web application written in Python and running on Google App Engine. It never turned into a webservice with an API. Instead it was a website that showed what the essence of Antecons was: A recommendation engine based on association rule mining (or market basket analysis). As it turned out, this was the easy part and implementing a simple API would have been too. But the third point I outlined above, that was the real killer.

    The promise

    A quote from my company website:

    Metacircle has been developing a data analysis tool for some time. It is called Antecons.

    And from the Antecons teaser website:

    The API is currently under development…

    These quotes sum up the promise I made myself (and the world): To work on Antecons. Did I succeed? No. Do I feel bad about that? Yes. What killed the project? Time and wishful thinking.

    Time grew short when I started working full-time but this is only part of the explanation. As hinted above, I was also held back by the third design objection. Setting up Antecons for actual usage was, in my opinion, too difficult and too much an obstacle for widespread usage. How could Antecons be made so it would be super easy to setup, super easy to deploy and super easy to run. The ideal scenario looked like this:

    1. User registers for Antecons and receives a unique User ID.
    2. User inserts 10 lines of Javascript code in the header of each webshop page where Antecons should show recommendations.
    3. Antecons takes care of the rest.

    This is how e.g. Google Analytics works and it is quite amazing how much information it gathers and how useful this information is, based on such a simple installation. If Antecons could be as easy to install as Google Analytics, which is installed on millions of websites, there would be a huge potential and many users could benefit from it.

    But the difficult questions piled up fast. For example, how would Antecons gather information on user’s sales data and how would it know where to show recommendations on a user’s webshop. The answers I came up with all lead to the same conclusion: The user would have to do some work and Antecons could not be fully automated, not even partially. This shattered my overall vision for Antecons and my head has been throwing around ideas ever since. Maybe the ideal scenario was wishful thinking but in any case, this was the end of the line. Until now.

    Where we are

    It has been two weeks since I started over with the Antecons project. In the next blog post, I will elaborate on what specific design obstacles that put the project on ice. This is an important pretext for what I have started working on right now. Later, I will write more about specific design decisions and actual implementation. The API should even have some partial functionality coming up very soon. Stay tuned.