Monday, February 8, 2010

Scalar Subqueries

Better formatted at
Query 2
The VST diagram looks like

There are two interesting things about this diagram.
Every thing is OUTER JOINED to F_OUTER
There are correlated subqueries in the select
There are two things to check - what is the effect of the OUTER JOINS. The OUTER JOINS can easily mean that all joins into F_OUTER don't change the result set size. Let's confirm by looking at the TTJ sizes:

The only thing that bounds the number of rows returned by F_OUTER is it's self join (the line going in and out of F_OUTER) on the bottom left of F_OUTER.
What this means is that it doesn't matter what order we join tables into F_OUTER.
Now we can turn to the other interesting thing about the query. There are 4 correlated subselects in the select clause. These queries in the select clause are called "scalar subqueries." These scalar subqueries can be a bad idea or a good idea depending on how mainly on how many distinct values are used as input into the them. AFAIK, Oracle doesn't merge subqueries in the select, and certainly not the ones in this query because they are embedded in cases statements - looks like I might have spoken too soon! more reseach to do but looks like Oracle can merge these scalar subqueries even with the case statement. I will try to run some more tests . To be continued)
In the worst case scenario the scalar subqueries are going to be run 845,012 times - that is one heck of a lot of work which would be a BAD IDEA.

Looking at the above diagram, and the 4 scalar subqueries in the select clause, the top red number is how many times the scalar subquery will be run (based on the case statement in the select clause) and the orange highlight is how many distinct values will be used in the scalar subquery where clause. P3 will benefit for scalar subquery caching but F won't because there are too many distinct values. On the other hand for P1 and P2 could if there are no collisions (see CBOF p216-217) and the scalar subquery caching is actually supports 1024 values. ( see for a great write up on analysis of scalar subquery caching - the analysis on this page seems to show that caching maxes out way before 1024 but that might be because of collisions)

The subqueries in the select clause look like

WHERE code_vl = F.f1)
from F;

and could be merged into the query like:

ELSE ( X.f2)
from F , X
where code_vl(+) = F.f1;

( NOTE: the first query will break if the correlated sub query returns more than one value where as the second query will return the mulitple rows.)

The VST diagram can't tell you that this is the solution, but they can point out that the place to spend your time is the subqueries in the select.


  1. "we have 4 queries in th select clause that are going to be run 845,012 times"
    Not necessarily true as the select is embedded in a CASE and will only be run if the relevant value is not null. Also they can benefit from scalar subqeury caching.
    Finally, the first of those four queries selects an aggregate which is trickier to accommodate in the main query.

  2. Hi Gary,
    Good points - I added a 3rd diagram showing the number of times the scalar subqueries are run and the number of distinct values used in the scalar subqueries.
    For the case of 12 distinct values, node P3, the performance is better with the case statement.
    I will try and put up execution stats soon if I have time.
    Thanks for the link - great analysis and nice to have the table of of ranges for the scalar subquery caching.
    Jonathan Lewis also gives an interesting example using integer fields in CBOF p216-217 where the the cached values can have collisions (not just the size of the cache) and thus ruin the effect of the cache. Lot to keep track of. In the case above the fields are varchar2