Exclude a path from WSS 3.0 on Windows Server 2008



Recursive CTEs continued ...


In this post, I will finish the discussion of recursive CTEs that I began in my last post.  I will continue to use the CTE examples from Books Online.  To run these examples, you'll need to install the Adventure Works Cycles OLTP sample database.

In my last post, I explained that all recursive queries follow the same pattern of one or more anchor sub-selects and one or more recursive sub-selects combined by a UNION ALL.  Similarly, all recursive query plans also follow the same pattern which looks like so:

  |--Index Spool(WITH STACK)
       |--Concatenation
            |--Compute Scalar(DEFINE:([Expr10XX]=(0)))
            |    |-- ... anchor sub-select plan(s) ...
            |--Assert(WHERE:(CASE WHEN [Expr10ZZ]>(100) THEN (0) ELSE NULL END))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr10YY], [Recr10XX], ...))
                      |--Compute Scalar(DEFINE:([Expr10ZZ]=[Expr10YY]+(1)))
                      |    |--Table Spool(WITH STACK)
                      |-- ... recursive sub-select plan(s) ...

Because of this basic plan shape, SQL Server cannot execute recursive queries using parallel plans.  There are two basic problems.  First, the stack spool does not support parallel execution.  Second, the nested loops join does not support parallel execution on its inner input.

Finally, let's look at how the placement of a WHERE clause can have a big impact on the performance of a recursive query.  In my last post, I dissected a recursive query that returns a list of all employees and their level within the organization.  Suppose that we want to run the same query but limit the results to those employees in the first two levels.  We could write the following query (which you'll also find in Books Online) with an extra WHERE clause to limit the results:

WITH DirectReports(ManagerID, EmployeeID, EmployeeLevel) AS
(
    SELECT ManagerID, EmployeeID, 0 AS EmployeeLevel
    FROM HumanResources.Employee
    WHERE ManagerID IS NULL
    UNION ALL
    SELECT e.ManagerID, e.EmployeeID, EmployeeLevel + 1
    FROM HumanResources.Employee e
        INNER JOIN DirectReports d
        ON e.ManagerID = d.EmployeeID
)
SELECT ManagerID, EmployeeID, EmployeeLevel
FROM DirectReports
WHERE EmployeeLevel <= 2

This query computes uses essentially the same plan as the query without the extra WHERE clause.  The WHERE clause simply introduces a filter operator at the root of the plan:

 Rows  Executes  34   1    |--Filter(WHERE:([Recr1012]<=(2)))  290  1       |--Index Spool(WITH STACK)  290  1         |--Concatenation  0   0            |--Compute Scalar(DEFINE:([Expr1013]=(0)))  0   0            |  |--Compute Scalar(DEFINE:([Expr1003]=(0)))  1   1            |     |--Index Seek(OBJECT:([Employee].[IX_Employee_ManagerID]), SEEK:([ManagerID]=NULL) ...)  289  1            |--Assert(WHERE:(CASE WHEN [Expr1015]>(100) THEN (0) ELSE NULL END))  289  1              |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1015], [Recr1006], [Recr1007], [Recr1008]))  0   0                 |--Compute Scalar(DEFINE:([Expr1015]=[Expr1014]+(1)))  290  1                 |  |--Table Spool(WITH STACK)  0   0                 |--Compute Scalar(DEFINE:([Expr1009]=[Recr1008]+(1)))  289  290                  |--Index Seek(OBJECT:([Employee].[IX_Employee_ManagerID]), SEEK:([ManagerID]=[Recr1007]) ...)  

As you can see from the row counts, this query computes the level of every employee in the organization before discarding the results that we do not want.  Now, suppose that we instead move the WHERE clause into the recursive sub-select:

WITH DirectReports(ManagerID, EmployeeID, EmployeeLevel) AS
(
    SELECT ManagerID, EmployeeID, 0 AS EmployeeLevel
    FROM HumanResources.Employee
    WHERE ManagerID IS NULL
    UNION ALL
    SELECT e.ManagerID, e.EmployeeID, EmployeeLevel + 1
    FROM HumanResources.Employee e
        INNER JOIN DirectReports d
        ON e.ManagerID = d.EmployeeID 
    WHERE EmployeeLevel < 2
)
SELECT ManagerID, EmployeeID, EmployeeLevel
FROM DirectReports

This query is semantically identical to the previous query, but if we look at the plan, we can see that the filter moves into the recursive portion of the plan:

 Rows  Executes  34   1    |--Index Spool(WITH STACK)  34   1       |--Concatenation  0   0         |--Compute Scalar(DEFINE:([Expr1013]=(0)))  0   0         |  |--Compute Scalar(DEFINE:([Expr1003]=(0)))  1   1         |     |--Index Seek(OBJECT:([Employee].[IX_Employee_ManagerID]), SEEK:([ManagerID]=NULL) ...)  33   1         |--Assert(WHERE:(CASE WHEN [Expr1015]>(100) THEN (0) ELSE NULL END))  33   1            |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1015], [Recr1006], [Recr1007], [Recr1008]))  0   0              |--Compute Scalar(DEFINE:([Expr1015]=[Expr1014]+(1)))  34   1              |  |--Table Spool(WITH STACK)  0   0              |--Compute Scalar(DEFINE:([Expr1009]=[Recr1008]+(1)))  33   34                |--Nested Loops(Inner Join)  7   34                   |--Filter(WHERE:(STARTUP EXPR([Recr1008]<(2))))  7   7                   |  |--Constant Scan  33   7                   |--Index Seek(OBJECT:([Employee].[IX_Employee_ManagerID]), SEEK:([ManagerID]=[Recr1007]) ...)  

Notice from the row counts that the new plan computes only the portion of the result set that really interests us and then terminates the recursion.

You may also observe than the plan now has a startup filter.  An ordinary filter reads rows from its input tree and then evaluates a Boolean expression on each row to determine whether to return it.  A startup filter evaluates its expression only once per execution.  The expression does not depend on the input rows.  If the expression evaluates to true, the startup filter returns all of its input rows.  If the expression evaluates to false, the startup filter returns no rows.

In this example, if the startup filter expression evaluates to true, the constant scan returns one row to the nested loops join immediately above the filter and this join then executes the index seek on its inner input.  However, if the startup filter expression evaluates to false, the filter and then the nested loops join return no rows.

While there are many scenarios where a startup filter is genuinely useful, this plan is slightly more complex than is really necessary.  The plan could have used an ordinary filter:

  |--Index Spool(WITH STACK)
       |--Concatenation
            |--Compute Scalar(DEFINE:([Expr1013]=(0)))
            |    |--Compute Scalar(DEFINE:([Expr1003]=(0)))
            |         |--Index Seek(OBJECT:([HumanResources].[Employee].[IX_Employee_ManagerID]), SEEK:([HumanResources].[Employee].[ManagerID]=NULL) ORDERED FORWARD)
            |--Assert(WHERE:(CASE WHEN [Expr1015]>(100) THEN (0) ELSE NULL END))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1015], [Recr1006], [Recr1007], [Recr1008]))
                      |--Filter(WHERE:([Recr1008]<(2)))
                      |    |--Compute Scalar(DEFINE:([Expr1015]=[Expr1014]+(1)))
                      |         |--Table Spool(WITH STACK)
                      |--Compute Scalar(DEFINE:([Expr1009]=[Recr1008]+(1)))
                           |--Index Seek(OBJECT:([HumanResources].[Employee].[IX_Employee_ManagerID] AS [e]), SEEK:([e].[ManagerID]=[Recr1007]) ORDERED FORWARD)

Note that this plan is artificial.  You cannot get SQL Server to generate it.


Repeatable Read Isolation Level


In my last two posts, I showed how queries running at read committed isolation level may generate unexpected results in the presence of concurrent updates.  Many but not all of these results can be avoided by running at repeatable read isolation level.  In this post, I'll explore how concurrent updates may affect queries running at repeatable read.

Unlike a read committed scan, a repeatable read scan retains locks on every row it touches until the end of the transaction.  Even rows that do not qualify for the query result remain locked.  These locks ensure that the rows touched by the query cannot be updated or deleted by a concurrent session until the current transaction completes (whether it is committed or rolled back).  These locks do not protect rows that have not yet been scanned from updates or deletes and do not prevent the insertion of new rows amid the rows that are already locked.  The following graphic illustrates this point:

Note that the capability to insert new "phantom" rows between locked rows that have already been scanned is the principle difference between the repeatable read and serializable isolation levels.  A serializable scan acquires a key range lock which prevents the insertion of any new rows anywhere within the range (as well as the update or deletion of any existing rows within the range).

In the remainder of this post, I'll give a couple of examples of how we can get unexpected results even while running queries at repeatable read isolation level.  These examples are similar to the ones from my previous two posts.

Row Movement

First, let's see how we can move a row and cause a repeatable read scan to miss it.  As with all of the other example in this series of posts, we'll need two sessions.  Begin by creating this simple table:

create table t (a int primary key, b int)
insert t values (1, 1)
insert t values (2, 2)
insert t values (3, 3)

Next, in session 1 lock the second row:

begin tran
update t set b = 2 where a = 2

Now, in session 2 run a repeatable read scan of the table:

select * from t with (repeatableread)

This scan reads the first row then blocks waiting for session 1 to release the lock it holds on the second row.  While the scan is blocked, in session 1 let's move the third row to the beginning of the table before committing the transaction and releasing the exclusive lock blocking session 2:

update t set a = 0 where a = 3
commit tran

As we expect, session 2 completely misses the third row and returns just two rows:

 a      b      c  ----------- ----------- -----------  1      1      1  2      2      2  

Note that if we change the experiment so that session 1 tries to touch the first row in the table, it will cause a deadlock with session 2 which holds a lock on this row.

Phantom Rows

Let's also take a look at how phantom rows can cause unexpected results.  This experiment is similar to the nested loops join experiment from my previous post.  Begin by creating two tables:

create table t1 (a1 int primary key, b1 int)
insert t1 values (1, 9)
insert t1 values (2, 9)

create table t2 (a2 int primary key, b2 int)

Now, in session 1 lock the second row of table t1:

begin tran
update t1 set a1 = 2 where a1 = 2

Next, in session 2 run the following outer join at repeatable read isolation level:

set transaction isolation level repeatable read
select * from t1 left outer join t2 on b1 = a2

The query plan for this join uses a nested loops join:

  |--Nested Loops(Left Outer Join, WHERE:([t1].[b1]=[t2].[a2]))
       |--Clustered Index Scan(OBJECT:([t1].[PK__t1]))
       |--Clustered Index Scan(OBJECT:([t2].[PK__t2]))

This plan scans the first row from t1, tries to join it with t2, finds there are no matching rows, and outputs a null extended row.  It then blocks waiting for session 1 to release the lock on the second row of t1.  Finally, in session 1, insert a new row into t2 and release the lock:

insert t2 values (9, 0)
commit tran

Here is the output from the outer join:

 a1     b1     a2     b2  ----------- ----------- ----------- -----------  1      9      NULL    NULL  2      9      9      0  

Notice that we have both a null extended and a joined row for the same join key!

Summary

As I pointed out at the conclusion of my previous post, I want to emphasize that the above results are not incorrect but rather are a side effect of running at a reduced isolation level.  SQL Server guarantees that the committed data is consistent at all times.

CLARIFICATION 8/26/2008: The above examples work as I originally described if they are executed in tempdb. However, the SELECT statements in session 2 may not block as described if the examples are executed in other databases due to an optimization where SQL Server avoids acquiring read committed locks when it knows that no data has changed on a page. If you encounter this problem, either run these examples in tempdb or change the UPDATE statements in session 1 so that they actually change the data in the updated row. For instance, for the first example try "update t set b = 12 where a = 2".


ROWCOUNT Top


If you've looked at any insert, update, or delete plans, including those used in some of my posts, you've probably noticed that nearly all such plans include a top operator.  For example, the following update statement yields the following plan:

CREATE TABLE T (A INT)
INSERT T VALUES (0)
INSERT T VALUES (1)
INSERT T VALUES (2)

UPDATE T SET A = A + 1

Rows  Executes
3      1      UPDATE [T] set [A] = [A]+@1
3      1        |--Table Update(OBJECT:([T]), SET:([T].[A] = [Expr1004]))
0      0             |--Compute Scalar(DEFINE:([Expr1004]=[T].[A]+[@1]))
3      1                  |--Top(ROWCOUNT est 0)
3      1                       |--Table Scan(OBJECT:([T]))

What is that top operator doing right above the table scan?

It is a ROWCOUNT top.  It is used to implement SET ROWCOUNT functionality.  The "est 0" indicates that SET ROWCOUNT was 0 when the query was compiled.  (I suppose "est" is short for "estimate" though the value at compilation time has no impact on query optimization or execution.)  Recall that a value of 0 means return or update all rows.  Since SET ROWCOUNT was 0 at execution time as well, we can see from the STATISTICS PROFILE output that all 3 rows were updated.

Now try the following:

SET ROWCOUNT 1
UPDATE T SET A = A + 1

Rows  Executes
1      1      UPDATE [T] set [A] = [A]+@1
1      1        |--Table Update(OBJECT:([T]), SET:([T].[A] = [Expr1004]))
0      0             |--Compute Scalar(DEFINE:([Expr1004]=[T].[A]+[@1]))
1      1                  |--Top(ROWCOUNT est 0)
1      1                       |--Table Scan(OBJECT:([T]))

Although we get the same plan including the ROWCOUNT top with the same "estimate," this time SET ROWCOUNT was 1 at execution time, the top returned only one row from the table scan, and we can see that only 1 row was updated.

If we force a recompile, we see that the value of the "estimate" changes:

SET ROWCOUNT 1
UPDATE T SET A = A + 1 OPTION (RECOMPILE)

Rows  Executes
1      1      UPDATE T SET A = A + 1 OPTION (RECOMPILE)
1      1        |--Table Update(OBJECT:([T]), SET:([T].[A] = [Expr1004]))
0      0             |--Compute Scalar(DEFINE:([Expr1004]=[T].[A]+(1)))
1      1                  |--Top(ROWCOUNT est 1)
1      1                       |--Table Scan(OBJECT:([T]))

Why doesn't SQL Server add a ROWCOUNT top to select statements?

For example, the following query plan does not include a top yet only returns 1 row:

SET ROWCOUNT 1
SELECT * FROM T

Rows  Executes
1      1      SELECT * FROM T
1      1        |--Table Scan(OBJECT:([T]))

SQL Server implements SET ROWCOUNT for select statements by simply counting and returning the correct number of rows from the root of the plan.  Although this strategy might work for a really trivial update plan such as the one above, it would not work for more complex update plans.  For instance, if we add a unique index to our table, the update plan becomes substantially more complex:

CREATE UNIQUE INDEX TA ON T(A)
UPDATE T SET A = A + 1

Rows  Executes
2      1      UPDATE [T] set [A] = [A]+@1
2      1        |--Index Update(OBJECT:([T].[TA]), SET:([Bmk10061024] = [Bmk1006],[A1025] = [T].[A]))
2      1             |--Collapse(GROUP BY:([T].[A]))
2      1                  |--Filter(WHERE:(NOT [Expr1021]))
2      1                       |--Sort(ORDER BY:([T].[A] ASC, [Act1023] ASC))
2      1                            |--Split
1      1                                 |--Table Update(OBJECT:([T]), SET:([T].[A] = [Expr1004]))
1      1                                      |--Compute Scalar(DEFINE:([Expr1021]=[Expr1021]))
0      0                                           |--Compute Scalar(DEFINE:([Expr1021]=CASE WHEN [Expr1005] THEN (1) ELSE (0) END))
0      0                                                |--Compute Scalar(DEFINE:([Expr1004]=[T].[A]+(1), [Expr1005]=CASE WHEN [T].[A] = ([T].[A]+(1)) THEN (1) ELSE (0) END))
1      1                                                     |--Top(ROWCOUNT est 1)
1      1                                                          |--Table Scan(OBJECT:([T]))

I'm not going to try in this post to explain all of the details of the above plan.  I'll save that for a future post.  However, observe that in updating 1 row, the root of this plan returns 2 rows.  Counting 1 row from the root of this plan would not achieve an accurate result.  By placing the ROWCOUNT top above the table scan, the optimizer can ensure that the server updates exactly the correct number of rows regardless of the complexity of the remainder of the plan.


Semi-join Transformation


In several of my prior posts, I've given examples of semi-joins.  Recall that semi-joins essentially return a row from one input if we can find at least one matching row from the other input.  Here is a simple example:

create table T1 (a int, b int)

create table T2 (a int, b int)

set nocount on

declare @i int

set @i = 0

while @i < 10000

  begin

    insert T1 values(@i, @i)

    set @i = @i + 1

  end

set nocount on

declare @i int

set @i = 0

while @i < 100

  begin

    insert T2 values(@i, @i)

    set @i = @i + 1

  end

select * from T1

where exists (select * from T2 where T2.a = T1.a)

Rows Executes
100 1   |--Hash Match(Right Semi Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T2].[a]=[T1].[a]))
100 1        |--Table Scan(OBJECT:([T2]))
10000 1        |--Table Scan(OBJECT:([T1]))

Note that the optimizer chooses a hash join and correctly builds a hash table on the smaller input T2 which has only 100 rows and probes with the larger input T1 which has 10,000 rows.

Transformation

Now, let's suppose we want to create an index to speed up this query.  Perhaps we'd like to get a plan with an index nested loops join.  Should we create an index on T1 or T2?  Keep in mind that nested loops join only supports left semi-join not right semi-join.  If we get a nested loops semi-join plan, T1 will be the outer table and T2 will be the inner table.  Thus, we might be tempted to create an index on T2:

create clustered index T2a on T2(a)

Unfortunately, this index does not change the plan.  The optimizer has decided that 10,000 index lookups (one for each row of T1) is still more expensive than the hash join.  Fortunately, we have another less obvious option.  We can create an index on the larger T1 table:

create clustered index T1a on T1(a)

select * from T1

where exists (select * from T2 where T2.a = T1.a)

Now we get an index nested loops join:

Rows Executes
100 1   |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[a], [Expr1009]) WITH UNORDERED PREFETCH)
100 1        |--Stream Aggregate(GROUP BY:([T2].[a]))
100 1        | |--Clustered Index Scan(OBJECT:([T2].[T2a]), ORDERED FORWARD)
100 100        |--Clustered Index Seek(OBJECT:([T1].[T1a]), SEEK:([T1].[a]=[T2].[a]) ORDERED FORWARD)

But wait a minute!  This plan has an inner join.  What happened to the semi-join?  Remember that the semi-join simply returns each row of T1 that matches at least one row of T2.  We can use an inner join to find these matches so long as we do not return any row of T1 more than once.  In this plan, we use the ordered clustered index scan of T2 and the stream aggregate to eliminate any duplicates values of T2.a.  Then, when we join these T2 rows with T1, we know that we can match each T1 row exactly once.  Note that unlike the original hash join plan which touched all 10,000 rows of T1, this plan performs only 100 distinct index lookups on T1.

When is this transformation a good idea?

Transforming the semi-join to an inner join helps when, as in this example, we have a large number of rows on the "preserved" side of the join (T1 in this example) and the transformation enables an index seek that would have otherwise been impossible.
 
The transformation also makes sense if the number of rows on the "lookup" side of the join is very large but most of the rows are duplicates.  Since the semi-join does not depend on the duplicates, we can eliminate them.  For example, let's drop the index on T1 and create a new 10,000 row table T3 that is all duplicates:

drop index T1.T1a

create table T3 (a int, b int)

set nocount on

declare @i int

set @i = 0

while @i < 10000

  begin

  insert T3 values(0, @i)

    set @i = @i + 1

  end

select * from T1

where exists (select * from T3 where T3.a = T1.a)

Rows Executes
1 1   |--Hash Match(Inner Join, HASH:([T3].[a])=([T1].[a]), RESIDUAL:([T3].[a]=[T1].[a]))
1 1        |--Hash Match(Aggregate, HASH:([T3].[a]), RESIDUAL:([T3].[a] = [T3].[a]))
10000 1        | |--Table Scan(OBJECT:([T3]))
10000 1        |--Table Scan(OBJECT:([T1]))

Notice that even though we have no indexes and use the hash join plan, we still transform the semi-join into an inner join.  This time, since we do not have an index, we use a hash aggregate to eliminate duplicate values of T3.a.  Without the transformation, the hash join would have built a hash table on all 10,000 rows in T3.  With the transformation, it builds a hash table on a single row.

Unique index

If we add a unique index so that duplicate values of T2.a are not possible, the optimizer no longer needs to eliminate duplicates and can always perform this transformation for free.  For example:

create unique clustered index T2a on T2(a) with (drop_existing = on)

select * from T1

where exists (select * from T2 where T2.a = T1.a)

Rows Executes
100 1   |--Hash Match(Inner Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T2].[a]=[T1].[a]))
100 1        |--Clustered Index Scan(OBJECT:([T2].[T2a]))
10000 1        |--Table Scan(OBJECT:([T1]))

Since we dropped the index on T1 (see above), we get the hash join plan again.  However, unlike the original right semi-join plan, now we get an inner join.  There is no need for a semi-join because the optimizer knows that there can be no duplicates in T2 and that the semi-join and inner join plans are equivalent.

Stay tuned …

This post will probably be my last of the year, but I will be back in January to continue writing about parallelism, partitioned tables, and other query processing topics.


Sequential Read Ahead


Balancing CPU and I/O throughput is essential to achieve good overall performance and to maximize hardware utilization.  SQL Server includes two asynchronous I/O mechanisms - sequential read ahead and random prefetching - that are designed to address this challenge.

To understand why asynchronous I/O is so important, consider the CPU to I/O performance gap.  The memory subsystem on a modern CPU can deliver data sequentially at roughly 5 Gbytes per second per socket (or for non-NUMA machines for all sockets sharing the same bus) and (depending on how you measure it) can fetch random memory locations at roughly 10 to 50 million accesses per second.  By comparison, a high end 15K SAS hard drive can read only 125 Mbytes per second sequentially and can perform only 200 random I/Os per second (IOPS).  Solid State Disks (SSDS) can reduce the gap between sequential and random I/O performance by eliminating the moving parts from the equation, but a performance gap remains.  In an effort to close this performance gap, it is not uncommon for servers to have a ratio of 10 or more drives for every CPU.  (It is also important to consider and balance the entire I/O subsystem including the number and type of disk controllers not just the drives themselves but that is not the focus of this post.)

Unfortunately, a single CPU issuing only synchronous I/Os can keep only one spindle active at a time.  For a single CPU to exploit the available bandwidth and IOPs of multiple spindles effectively the server must issue multiple I/Os asynchronously.  Thus, SQL Server includes the aforementioned read ahead and prefetching mechanisms.  In this post, I'll take a look at sequential read ahead.

When SQL Server performs a sequential scan of a large table, the storage engine initiates the read ahead mechanism to ensure that pages are in memory and ready to scan before they are needed by the query processor.  The read ahead mechanism tries to stay 500 pages ahead of the scan.  We can see the read ahead mechanism in action by checking the output of SET STATISTICS IO ON.  For example, I ran the following query on a 1GB scale factor TPC-H database.  The LINEITEM table has roughly 6 million rows.

SET STATISTICS IO ON

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 3, read-ahead reads 20331, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Repeating the query a second time shows that the table is now cached in the buffer pool:

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

For sequential I/O performance, it is important to distinguish between allocation ordered and index ordered scans.  An allocation ordered scan tries to read pages in the order in which they are physically stored on disk while an index ordered scan reads pages according to the order in which the data on those index pages is sorted.  (Note that in many cases there are multiple levels of indirection such as RAID devices or SANS between the logical volumes that SQL Server sees and the physical disks.  Thus, even an allocation ordered scan may in fact not be truly optimally ordered.)  Although SQL Server tries to sort and read pages in allocation order even for an index ordered scan, an allocation ordered scan is generally going to be faster since pages are read in the order that they are written on disk with the minimal number of seeks.  Heaps have no inherent order and, thus, are always scanned in allocation order.  Indexes are scanned in allocation order only if the isolation level is read uncommitted (or the NOLOCK hint is used) and only if the query process does not request an ordered scan.  Defragmenting indexes can help to ensure that index ordered scans perform on par with allocation ordered scans.

In my next post, I'll take a look at random prefetching.


Serializable vs. Snapshot Isolation Level


Both the serializable and snapshot isolation levels provide a read consistent view of the database to all transactions.  In either of these isolation levels, a transaction can only read data that has been committed.  Moreover, a transaction can read the same data multiple times without ever observing any concurrent transactions making changes to this data.  The unexpected read committed and repeatable read results that I demonstrated in my prior few posts are not possible in serializable or snapshot isolation level.

Notice that I used the phrase "without ever observingany ... changes."  This choice of words is deliberate.  In serializable isolation level, SQL Server acquires key range locks and holds them until the end of the transaction.  A key range lock ensures that, once a transaction reads data, no other transaction can alter that data - not even to insert phantom rows - until the transaction holding the lock completes.  In snapshot isolation level, SQL Server does not acquire any locks.  Thus, it is possible for a concurrent transaction to modify data that a second transaction has already read.  The second transaction simply does not observe the changes and continues to read an old copy of the data.

Serializable isolation level relies on pessimistic concurrency control.  It guarantees consistency by assuming that two transactions might try to update the same data and uses locks to ensure that they do not but at a cost of reduced concurrency - one transaction must wait for the other to complete and two transactions can deadlock.  Snapshot isolation level relies on optimistic concurrency control.  It allows transactions to proceed without locks and with maximum concurrency, but may need to fail and rollback a transaction if two transactions attempt to modify the same data at the same time.

It is clear there are differences in the level of concurrency that can be achieved and in the failures (deadlocks vs. update conflicts) that are possible with the serializable and snapshot isolation levels.

How about transaction isolation?  How do serializable and snapshot differ in terms of the transaction isolation that they confer?  It is simple to understand serializable.  For the outcome of two transactions to be considered serializable, it must be possible to achieve this outcome by running one transaction at a time in some order.

Snapshot does not guarantee this level of isolation.  A few years ago, Jim Gray shared with me the following excellent example of the difference.  Imagine that we have a bag containing a mixture of white and black marbles.  Suppose that we want to run two transactions.  One transaction turns each of the white marbles into black marbles.  The second transaction turns each of the black marbles into white marbles.  If we run these transactions under serializable isolation, we must run them one at a time.  The first transaction will leave a bag with marbles of only one color.  After that, the second transaction will change all of these marbles to the other color.  There are only two possible outcomes:  a bag with only white marbles or a bag with only black marbles.

If we run these transactions under snapshot isolation, there is a third outcome that is not possible under serializable isolation.  Each transaction can simultaneously take a snapshot of the bag of marbles as it exists before we make any changes.  Now one transaction finds the white marbles and turns them into black marbles.  At the same time, the other transactions finds the black marbles - but only those marbles that where black when we took the snapshot - not those marbles that the first transaction changed to black - and turns them into white marbles.  In the end, we still have a mixed bag of marbles with some white and some black.  In fact, we have precisely switched each marble.

The following graphic illustrates the difference:

We can demonstrate this outcome using SQL Server.  Note that snapshot isolation is only available in SQL Server 2005 and must be explicitly enabled on your database:

alter database database_name set allow_snapshot_isolation on

Begin by creating a simple table with two rows representing two marbles:

create table marbles (id int primary key, color char(5))
insert marbles values(1, 'Black')
insert marbles values(2, 'White')

Next, in session 1 begin a snaphot transaction:

set transaction isolation level snapshot
begin tran
update marbles set color = 'White' where color = 'Black'

Now, before committing the changes, run the following in session 2:

set transaction isolation level snapshot
begin tran
update marbles set color = 'Black' where color = 'White'
commit tran

Finally, commit the transaction in session 1 and check the data in the table:

commit tran
select * from marbles

Here are the results:

 id     color  ----------- -----  1      White  2      Black  

As you can see marble 1 which started out black is now white and marble 2 which started out white is now black.  If you try this same experiment with serializable isolation, one transaction will wait for the other to complete and, depending on the order, both marbles will end up either white or black.


Subqueries in BETWEEN and CASE Statements


Consider the following query:

CREATE TABLE T1 (A INT, B1 INT, B2 INT)
CREATE TABLE T2 (A INT, B INT)

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) BETWEEN T1.B1 AND T1.B2

Observe that the subquery in this query only needs to be evaluated once for each row of T1.  Indeed running on SQL Server 2000, we get the following plan (with the portion of the plan corresponding to the subquery in bold):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1004]))
**       |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1014]=0) then NULL else [Expr1015]))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1014]=COUNT_BIG([T2].[B]), [Expr1015]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1004]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1004]))
                 |--Table Scan(OBJECT:([T1]))

Now let's look at the query plan we get (with SQL Server 2005 or SQL Server 2008):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008], [Expr1014]))
       |--Merge Join(Inner Join, MERGE:([T2].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T2].[A]))
**       |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))
       |    |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1026]=COUNT_BIG([T2].[B]), [Expr1027]=SUM([T2].[B])))
       |    |         |--Sort(ORDER BY:([T2].[A] ASC))
       |    |              |--Table Scan(OBJECT:([T2]))
       |    |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
       |         |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1028]=COUNT_BIG([T2].[B]), [Expr1029]=SUM([T2].[B])))
       |              |--Sort(ORDER BY:([T2].[A] ASC))
       |                   |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1014]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

Notice that the subquery is actually evaluated twice.  There are two scans of T2, two sorts, and two stream aggregates.  I've highlighted both sets of operators in bold.

Why is SQL Server evaluating the subquery twice?  The answer is that SQL Server transforms "X BETWEEEN Y AND Z" into "X <= Y AND X >= Z".  If as in this query, X is a subquery, the subquery is repeated:

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) >= T1.B1 AND
    (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) <= T1.B2

This transformation occurs very early in the processing of the query.  Unfortunately, after that point, SQL Server 2005 and SQL Server 2008 do not realize that the subquery is a common subexpression and they evaluate it as if there were two completely different subqueries.

You may also have noticed that, instead of scanning T1 first and then evaluating the subquery for each row of T1, this plan actually scans T2 first.  This join order results from the optimizer decorrelating the subqueries.

So, how can we get SQL Server to evaluate the subquery only once?  There are actually multiple solutions and all involve rewriting the query to calculate the subquery separately from the BETWEEN clause so that when SQL Server transforms the BETWEEN clause, it does not also duplicate the subquery.  Here are some examples:

SELECT Q.A, Q.B1, Q.B2
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q
WHERE SUM_B BETWEEN Q.B1 AND Q.B2

SELECT T1.*
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q
WHERE Q.SUM_B BETWEEN T1.B1 AND T1.B2

SELECT T1.*
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A AND Q.SUM_B BETWEEN T1.B1 AND T1.B2

All three of these rewrites produce the same plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008]))
**       |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1016]=COUNT_BIG([T2].[B]), [Expr1017]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1008]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

SQL Server also transforms a CASE statement of the form:

CASE X
    WHEN Y1 THEN Z1
    WHEN Y2 THEN Z2
    ...
    ELSE ZN
END

Into:

CASE
    WHEN X = Y1 THEN Z1
    WHEN X = Y2 THEN Z2
    ...
    ELSE ZN
END

Thus, CASE statements can yield the same problematic behavior if X is a subquery.  Unfortunately, with a CASE statement, the number of times that the subquery is reevaluated depends on the number WHEN clauses and can be quite large.  Here is a simple query that illustrates the problem:

SELECT *,
    CASE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A)
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1

Here is the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1007]=If ([Expr1004]=[T1].[B1]) then 'B1' else If ([Expr1004]=[T1].[B2]) then 'B2' else NULL))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Hash Match(Cache, HASH:([T1].[A]), RESIDUAL:([T1].[A]=[T1].[A]))
**                 |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1020]=0) then NULL else [Expr1021]))
                      |--Stream Aggregate(DEFINE:([Expr1020]=COUNT_BIG([T2].[B]), [Expr1021]=SUM([T2].[B])))
                           |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**

And here is the SQL Server 2005 and SQL Server 2008 plan:

  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008]=[T1].[B1] THEN 'B1' ELSE CASE WHEN [Expr1014]=[T1].[B2] THEN 'B2' ELSE NULL END END))
       |--Nested Loops(Inner Join, PASSTHRU:([Expr1008]=[T1].[B1]), OUTER REFERENCES:([T1].[A]))
            |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |    |--Table Scan(OBJECT:([T1]))
**            |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1030]=(0) THEN NULL ELSE [Expr1031] END))
            |         |--Stream Aggregate(DEFINE:([Expr1030]=COUNT_BIG([T2].[B]), [Expr1031]=SUM([T2].[B])))
            |              |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))
            |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1032]=(0) THEN NULL ELSE [Expr1033] END))
                 |--Stream Aggregate(DEFINE:([Expr1032]=COUNT_BIG([T2].[B]), [Expr1033]=SUM([T2].[B])))
                      |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**

Finally, the same solutions once again apply:

SELECT Q.A, Q.B1, Q.B2,
    CASE Q.SUM_B
        WHEN Q.B1 THEN 'B1'
        WHEN Q.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A