Yesterday, I was writing a Linq-to-SQL query and noticed a quite remarkable difference in performance between two very similar queries (seconds versus minutes of running time). I thought it would be worth sharing.
Basically, I have a bunch of orders that are represented by two SQL tables, called Basket
and BasketItem
. If an order e.g. consists of the purchase of items A, B and C, the corresponding rows in the tables could be:
Basket: ID = 1 BasketItem: ID = 1 BasketID = 1 ItemID = A BasketItem: ID = 2 BasketID = 1 ItemID = B BasketItem: ID = 3 BasketID = 1 ItemID = C
This is a fairly common way to represent order information in a relational database, establishing a one-to-many relationship between an order and its separate order lines.
Now, somewhere in the code I am working on, I have an item set (for example { A, B }) and I want to figure out in how many orders these two items occur together. The items are represented in a list:
// items is given as a parameter // but manually initialized here for descriptive purposes List<object> items = new List<object>() { "A", "B" };
First version of the solution to the problem looked like this:
int count = 0; foreach (Basket b in Baskets) { IEnumerable<BasketItem> bItems = b.GetBasketItems(); if (items.All(item => bItems.Any(bItem => bItem.ItemID.Equals(item)))) count++; } return count;
If the BasketItem
s for a Basket
contains all the given items, increment a count. This is obviously not optimal since it requires all BasketItem
s to be fetched from the database. After a lot of pondering and test, I came up with the following Linq query:
int minLines = items.Count; var count = from basket in Baskets where (from basketItem in BasketItems where basketItem.BasketID == basket.BasketID && items.Contains(basketItem.ItemID) select basketItem).Count() >= minLines select basket; return count.Count();
If the query finds items.Count
or more BasketItem
s for a given Basket
, the basket is selected and the number of valid baskets are returned in the end. This dropped the running time from ~50 seconds to ~8 seconds for about 7000 calls to the method with the above code (for different items
lists), a nice improvement.
I then thought that the above Linq query could be further improved by using Count()
directly instead of using Where
clauses. It then rewrites to:
var count = AnteconsBaskets.Count(basket => AnteconsBasketItems.Count(basketItem => basketItem.BasketID == basket.BasketID && items.Contains(basketItem.ItemID)) >= minLines); return count;
Using the above code, I had to manually halt the program after waiting more than 10 minutes for execution to complete.
The question now is: Why did this code perform so poorly compared to the code in listing 2? I cannot offer an explanation of this but I can hint at it. The code in listing 2 translates into an SQL query where COUNT(*)
is used in the sub-query for the BasketItem
selection. For listing 3, Count()
translates into an SQL query that uses a SELECT CASE WHEN
query with three cases. Why this is so slow, I do not know but I will definitely not use Count()
like this in the future.