Thought Flow

Tag: Linq

  • Strange Linq-to-SQL performance when using Count

    Yesterday, I was writing a Linq-to-SQL query and noticed a quite remarkable difference in performance between two very similar queries (seconds versus minutes of running time). I thought it would be worth sharing.

    Basically, I have a bunch of orders that are represented by two SQL tables, called Basket and BasketItem. If an order e.g. consists of the purchase of items A, B and C, the corresponding rows in the tables could be:

    Basket:     ID = 1
    BasketItem: ID = 1   BasketID = 1   ItemID = A
    BasketItem: ID = 2   BasketID = 1   ItemID = B
    BasketItem: ID = 3   BasketID = 1   ItemID = C
    

    This is a fairly common way to represent order information in a relational database, establishing a one-to-many relationship between an order and its separate order lines.

    Now, somewhere in the code I am working on, I have an item set (for example { A, B }) and I want to figure out in how many orders these two items occur together. The items are represented in a list:

    // items is given as a parameter
    // but manually initialized here for descriptive purposes
    List<object> items = new List<object>() { "A", "B" };
    

    First version of the solution to the problem looked like this:

    int count = 0;
    foreach (Basket b in Baskets)
    {
      IEnumerable<BasketItem> bItems = b.GetBasketItems();
      if (items.All(item => bItems.Any(bItem => bItem.ItemID.Equals(item))))
        count++;
    }
    return count;
    

    If the BasketItems for a Basket contains all the given items, increment a count. This is obviously not optimal since it requires all BasketItems to be fetched from the database. After a lot of pondering and test, I came up with the following Linq query:

    int minLines = items.Count;
    var count =
      from basket in Baskets
      where
        (from basketItem in BasketItems
         where basketItem.BasketID == basket.BasketID &&
         items.Contains(basketItem.ItemID)
         select basketItem).Count() >= minLines
      select basket;
    return count.Count();
    

    If the query finds items.Count or more BasketItems for a given Basket, the basket is selected and the number of valid baskets are returned in the end. This dropped the running time from ~50 seconds to ~8 seconds for about 7000 calls to the method with the above code (for different items lists), a nice improvement.

    I then thought that the above Linq query could be further improved by using Count() directly instead of using Where clauses. It then rewrites to:

    var count =
      AnteconsBaskets.Count(basket =>
        AnteconsBasketItems.Count(basketItem =>
          basketItem.BasketID == basket.BasketID &&
          items.Contains(basketItem.ItemID)) >= minLines);
    return count;
    

    Using the above code, I had to manually halt the program after waiting more than 10 minutes for execution to complete.

    The question now is: Why did this code perform so poorly compared to the code in listing 2? I cannot offer an explanation of this but I can hint at it. The code in listing 2 translates into an SQL query where COUNT(*) is used in the sub-query for the BasketItem selection. For listing 3, Count() translates into an SQL query that uses a SELECT CASE WHEN query with three cases. Why this is so slow, I do not know but I will definitely not use Count() like this in the future.