Linq – Thought Flow

Yesterday, I was writing a Linq-to-SQL query and noticed a quite remarkable difference in performance between two very similar queries (seconds versus minutes of running time). I thought it would be worth sharing.

Basically, I have a bunch of orders that are represented by two SQL tables, called Basket and BasketItem. If an order e.g. consists of the purchase of items A, B and C, the corresponding rows in the tables could be:

Basket:     ID = 1
BasketItem: ID = 1   BasketID = 1   ItemID = A
BasketItem: ID = 2   BasketID = 1   ItemID = B
BasketItem: ID = 3   BasketID = 1   ItemID = C

This is a fairly common way to represent order information in a relational database, establishing a one-to-many relationship between an order and its separate order lines.

Now, somewhere in the code I am working on, I have an item set (for example { A, B }) and I want to figure out in how many orders these two items occur together. The items are represented in a list:

// items is given as a parameter
// but manually initialized here for descriptive purposes
List<object> items = new List<object>() { "A", "B" };

First version of the solution to the problem looked like this:

int count = 0;
foreach (Basket b in Baskets)
{
  IEnumerable<BasketItem> bItems = b.GetBasketItems();
  if (items.All(item => bItems.Any(bItem => bItem.ItemID.Equals(item))))
    count++;
}
return count;

If the BasketItems for a Basket contains all the given items, increment a count. This is obviously not optimal since it requires all BasketItems to be fetched from the database. After a lot of pondering and test, I came up with the following Linq query:

int minLines = items.Count;
var count =
  from basket in Baskets
  where
    (from basketItem in BasketItems
     where basketItem.BasketID == basket.BasketID &&
     items.Contains(basketItem.ItemID)
     select basketItem).Count() >= minLines
  select basket;
return count.Count();

If the query finds items.Count or more BasketItems for a given Basket, the basket is selected and the number of valid baskets are returned in the end. This dropped the running time from ~50 seconds to ~8 seconds for about 7000 calls to the method with the above code (for different items lists), a nice improvement.

I then thought that the above Linq query could be further improved by using Count() directly instead of using Where clauses. It then rewrites to:

var count =
  AnteconsBaskets.Count(basket =>
    AnteconsBasketItems.Count(basketItem =>
      basketItem.BasketID == basket.BasketID &&
      items.Contains(basketItem.ItemID)) >= minLines);
return count;

Using the above code, I had to manually halt the program after waiting more than 10 minutes for execution to complete.

The question now is: Why did this code perform so poorly compared to the code in listing 2? I cannot offer an explanation of this but I can hint at it. The code in listing 2 translates into an SQL query where COUNT(*) is used in the sub-query for the BasketItem selection. For listing 3, Count() translates into an SQL query that uses a SELECT CASE WHEN query with three cases. Why this is so slow, I do not know but I will definitely not use Count() like this in the future.

Tag: Linq

Strange Linq-to-SQL performance when using Count