Thought Flow

Tag: SQL

  • NOT MATERIALIZED

    NOT MATERIALIZED

    I managed to reduce a PostgreSQL view’s query time by more than 80% with just the two keywords NOT MATERIALIZED and by reading the actual database manual when GPT-4 gave up.

    Below, I will explain the context of the speedup.

    I recently moved a little bit of business logic from an API endpoint and into a PostgreSQL view, because it made some things a bit easier to manage. The view uses common table expression (CTE), and one of them looks something like this (not the actual code, but very similar):

    WITH first_and_last_score AS (
      SELECT DISTINCT ON (company_id, user_id)
        company_id,
        user_id,
        LAST_VALUE(score) OVER (PARTITION BY company_id ORDER BY created_at) AS last_company_score,
        LAST_VALUE(score) OVER (PARTITION BY company_id, user_id ORDER BY created_at) AS last_user_score
      FROM scores
    )
    -- ...
    -- more CTEs and queries
    -- ...
    

    The query finds the last score for each company and the last score for each user in that company using the LAST_VALUE window function.

    I was not particularly nervous using the above code for live data, because I knew that the view would always be used with a filter on company_id, and the CTE would thus not to a full table scan of the scores table because the query optimizer would be smart enough to filter the score table first.

    Alas, when profiling the view, it was indeed making a table scan on the scores table in my view, and it was a major bottle-neck for performance.

    It took several hours with trial and error to figure out what was going on, until I finally decided to actually read the Postgresql manual on CTEs, specifically how CTEs are automatically folded into the parent query:

    “… it can be folded into the parent query, allowing joint optimization of the two query levels. By default, this happens if the parent query references the WITH query just once, but not if it references the WITH query more than once
    (Emphasis mine)

    The problem was that I referenced the scores CTE twice and thus it was materialized and did not take advantage of the filter on company ID, I was using on the view.

    In the end, adding NOT MATERIALIZED to the CTE was the fix:

    WITH first_and_last_score AS NOT MATERIALIZED (
      ...
    )
    

    Just this change made reduced query time by roughly 80% for most queries to the view, and since the view was ultimately being served to a front-end, it was the different between “this feels a bit slow” to “this feels instant”.

  • Strange Linq-to-SQL performance when using Count

    Yesterday, I was writing a Linq-to-SQL query and noticed a quite remarkable difference in performance between two very similar queries (seconds versus minutes of running time). I thought it would be worth sharing.

    Basically, I have a bunch of orders that are represented by two SQL tables, called Basket and BasketItem. If an order e.g. consists of the purchase of items A, B and C, the corresponding rows in the tables could be:

    Basket:     ID = 1
    BasketItem: ID = 1   BasketID = 1   ItemID = A
    BasketItem: ID = 2   BasketID = 1   ItemID = B
    BasketItem: ID = 3   BasketID = 1   ItemID = C
    

    This is a fairly common way to represent order information in a relational database, establishing a one-to-many relationship between an order and its separate order lines.

    Now, somewhere in the code I am working on, I have an item set (for example { A, B }) and I want to figure out in how many orders these two items occur together. The items are represented in a list:

    // items is given as a parameter
    // but manually initialized here for descriptive purposes
    List<object> items = new List<object>() { "A", "B" };
    

    First version of the solution to the problem looked like this:

    int count = 0;
    foreach (Basket b in Baskets)
    {
      IEnumerable<BasketItem> bItems = b.GetBasketItems();
      if (items.All(item => bItems.Any(bItem => bItem.ItemID.Equals(item))))
        count++;
    }
    return count;
    

    If the BasketItems for a Basket contains all the given items, increment a count. This is obviously not optimal since it requires all BasketItems to be fetched from the database. After a lot of pondering and test, I came up with the following Linq query:

    int minLines = items.Count;
    var count =
      from basket in Baskets
      where
        (from basketItem in BasketItems
         where basketItem.BasketID == basket.BasketID &&
         items.Contains(basketItem.ItemID)
         select basketItem).Count() >= minLines
      select basket;
    return count.Count();
    

    If the query finds items.Count or more BasketItems for a given Basket, the basket is selected and the number of valid baskets are returned in the end. This dropped the running time from ~50 seconds to ~8 seconds for about 7000 calls to the method with the above code (for different items lists), a nice improvement.

    I then thought that the above Linq query could be further improved by using Count() directly instead of using Where clauses. It then rewrites to:

    var count =
      AnteconsBaskets.Count(basket =>
        AnteconsBasketItems.Count(basketItem =>
          basketItem.BasketID == basket.BasketID &&
          items.Contains(basketItem.ItemID)) >= minLines);
    return count;
    

    Using the above code, I had to manually halt the program after waiting more than 10 minutes for execution to complete.

    The question now is: Why did this code perform so poorly compared to the code in listing 2? I cannot offer an explanation of this but I can hint at it. The code in listing 2 translates into an SQL query where COUNT(*) is used in the sub-query for the BasketItem selection. For listing 3, Count() translates into an SQL query that uses a SELECT CASE WHEN query with three cases. Why this is so slow, I do not know but I will definitely not use Count() like this in the future.