SQL-SERVER
Support Five type of joins:
Three join strategies are available to the query processor
If
the optimizer determines that a nested-loop join is the optimal join,
it selects one table as the outer (first) table and the other as the
inner table. When executing a nested-loop join, SQL Server scans the
outer table row by row. For each row, it also scans the inner table,
looking for matching rows. This join type is very efficient if one of
the tables is small (or has been filtered so that it has only a few
qualifying rows) and if the other table has an index on the column
that joins the tables. The optimizer typically chooses the small
table as its first input because this is the fastest method of
executing the nest-loop algorithm.
For the nested-loop strategy to be effective, the inner table needs an index on the join column. This index means SQL Server doesn't need to scan the entire inner table; after accessing a row in the outer table, SQL Server can use the index on the join column to quickly find all matching rows in the inner table.
If the inner table doesn't have a useful index on the join column, the optimizer usually chooses a hash join strategy. However, if the join between the tables isn't based on equality, the optimizer chooses a nested-loop join, regardless of the tables' sizes and indexing schema.
Let's look at the following query on the tables:
SELECT C.CustomerName, O.OrderDate
FROM Customers C INNER JOIN
Orders O
ON C.CustomerID = O.CustomerID
The outer loop iterates through the rows in the Orders table. Note that because a clustered index contains the data, when the execution plan calls for a clustered index scan, the query processor scans the entire table. In the absence of a clustered index, this scan is called a table scan. For each row in the Orders table, the query processor uses clustered index seek, which scans a particular range of rows from a clustered index to seek a matching row in the Customers table because the join column CustomerID has a clustered index.
The query optimizer distinguishes among three variants of nested-loop joins.
A search that scans an entire table or index is a naïve nested-loop join.
A search that performs lookups in an index to fetch rows is an index nested-loop join.
When SQL Server builds an index as part of the query plan and destroys it when the query completes, you have a temporary index nested-loop join. If you notice the Query Analyzer using a temporary index nested-loop join for a common join, you might want to consider running the Index Tuning Wizard to determine whether you could benefit from adding an index to the base table.
SELECT Table1.Col1, Table1.Col2, Table2.Col3
FROM Table1 INNER MERGE JOIN Table2
ON Table1.Col1 = Table2.Col1
The Orders table is the best candidate for the build input because it's considerably smaller. The linked lists start with the function's possible results; in this case, there are only five. Linked to each bucket are records containing all the values the query will return from the build input table that correspond to the bucket. Applying the hash function to the join column of each build input record determines the correspondence. OrderID and OrderDate are the only columns you need from the build input (the Orders table), so these values are the only ones that stay in each record of the hash table.
Join Hints:
<join_type> <join_hint>
JOIN <right_table_name>
SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
RESIDUAL:(\[OD\].\[OrderID\]=\[O\].\[OrderID\]))
—Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[Orders\].
\[PK_Orders\] AS \[O\]), ORDERED)
—Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[OrderDetails\].
\[PK_OrderDetails\] AS \[OD\]), ORDERED)
SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900
ON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900
- Inner Join.
- Left Outer Join.
- Right Outer Join.
- Full Outer Join.
- Cross Join.
By
choosing effective indexes and join strategies, the query optimizer
builds an efficient plan to retrieve data from SQL queries.
Three join strategies are available to the query processor
- Nested Loops Joins.
- Merge Joins.
- Hash Joins.
Nested-Loop
Joins:
For the nested-loop strategy to be effective, the inner table needs an index on the join column. This index means SQL Server doesn't need to scan the entire inner table; after accessing a row in the outer table, SQL Server can use the index on the join column to quickly find all matching rows in the inner table.
If the inner table doesn't have a useful index on the join column, the optimizer usually chooses a hash join strategy. However, if the join between the tables isn't based on equality, the optimizer chooses a nested-loop join, regardless of the tables' sizes and indexing schema.
Let's look at the following query on the tables:
SELECT C.CustomerName, O.OrderDate
FROM Customers C INNER JOIN
Orders O
ON C.CustomerID = O.CustomerID
The outer loop iterates through the rows in the Orders table. Note that because a clustered index contains the data, when the execution plan calls for a clustered index scan, the query processor scans the entire table. In the absence of a clustered index, this scan is called a table scan. For each row in the Orders table, the query processor uses clustered index seek, which scans a particular range of rows from a clustered index to seek a matching row in the Customers table because the join column CustomerID has a clustered index.
The query optimizer distinguishes among three variants of nested-loop joins.
A search that scans an entire table or index is a naïve nested-loop join.
A search that performs lookups in an index to fetch rows is an index nested-loop join.
When SQL Server builds an index as part of the query plan and destroys it when the query completes, you have a temporary index nested-loop join. If you notice the Query Analyzer using a temporary index nested-loop join for a common join, you might want to consider running the Index Tuning Wizard to determine whether you could benefit from adding an index to the base table.
Outlines:
- It is very efficient if one of the tables is small (or has been filtered so that it has only a few qualifying rows) and if the other table has an index on the column that joins the tables. The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.
- The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.
Merge
Joins:
If
the optimizer decides that a merge join is optimal, the query
execution scans two sorted inputs and merges them. For a merge join,
the query processor sorts both inputs on the join column; if the two
inputs to the join are already sorted on the join column (for
example, if each has a clustered index on the join column), merge
join is a logical choice. In other cases, if adequate memory is
available, the optimizer might determine that sorting one input
before joining inputs results in the most efficient join strategy.
If
a one-to-many relationship exists between the inputs, SQL Server
follows a series of steps for an inner join executed with a merge
join strategy. First, it gets a row from each input, then compares
the join columns of the rows. If the join columns match, SQL Server
returns the requested columns from each row. If the columns don't
match, SQL Server discards the row with the lower value and gets the
next row from that input. SQL Server repeats these steps until it has
processed all rows in both inputs, scanning both inputs only once.
To
demonstrate a situation in which the optimizer will choose merge
join, we inserted 10,000 rows into the Orders table. For each Order,
we inserted five products into the OrderDetails table, resulting in
50,000 rows. Both Orders and OrderDetails have clustered indexes on
the OrderID column. Then we executed the following query against the
tables:
SELECT
O.OrderID, O.OrderDate,OD.ProductID,OD.Quantity
FROM
Orders O INNER JOIN OrderDetails
OD
ON
O.OrderID = OD.OrderID
SQL
Server scans both the Orders table and the OrderDetails table. For
each row in Orders, the matching rows in OrderDetails are merged.
Another
case of merge join might occur if a many-to-many relationship
exists between the inputs. If duplicate values exist in both
inputs, the second input must be positioned to the point in the
database where the first duplicate was found before the query
processor can process the next duplicate from the first input. A
temporary table stores the rows instead of discarding them. We can
best explain this approach with a simple example:
SELECT Table1.Col1, Table1.Col2, Table2.Col3
FROM Table1 INNER MERGE JOIN Table2
ON Table1.Col1 = Table2.Col1
Note
that in this example, we used a join hint (which we'll explain later)
to force a merge join. With these small tables, the optimizer
probably won't choose merge join.
Outlines:
- If the two inputs to the join are already sorted on the join column (for example, if each has a clustered index on the join column), merge join is a logical choice.
- Query processor sorts both inputs on the join column.
- Another case of merge join might occur if a many-to-many relationship exists between the inputs
Hash
Joins:
The
third type of join is a hash join. The main reason why the SQL
Server optimizer chooses the hash join algorithm is a lack of
adequate indexes on the join columns. If the optimizer chooses a
hash join, the execution plan involves building a hash table, which
is like building a special index on the fly. The hash join
strategy has two phases: build and probe. During the build phase, the
query processor builds a hash table by scanning each value in the
build input (usually the smaller table) and applying the hashing
algorithm to the key. The hash table consists of linked lists called
hash buckets. The hash function applied to each hash key in the build
input determines the relevant hash bucket. During the probe phase,
the query processor scans each row from the probe input (the second
table) and computes the same hash value on the hash key to find any
matches in the corresponding hash bucket.
Hash
joins perform most efficiently when the tables differ significantly
in size. While processing the query, SQL Server might determine that
it chose the wrong table for the build input. In that case, a role
reversal occurs, and the build input becomes the probe input and vice
versa.
An
example of a hash function is adding the ASCII values of the
characters in the key field and extracting only the last few digits
of the result. Another example is computing the modulo (percentage)
of the key and a prime number. An important aspect of the hash
algorithm is that, unlike regular index keys, the hash function can
produce significantly shortened versions of the original keys. The
hash function chosen generates an even distribution of the input rows
in the hash buckets. These special hash indexes are efficient for
finding exact matches (because the two join keys generate the same
hash function result), but are useless for joining tables based on
inequalities.
To
demonstrate a situation in which the optimizer usually chooses a hash
join, drop the constraints and both clustered indexes from the Orders
and OrderDetails tables. Now execute the same query we ran in the
merge join example:
SELECT
O.OrderID, O.OrderDate,OD.ProductID,
OD.Quantity
FROM
Orders O INNER JOIN OrderDetails OD
ON
O.OrderID = OD.OrderID
The Orders table is the best candidate for the build input because it's considerably smaller. The linked lists start with the function's possible results; in this case, there are only five. Linked to each bucket are records containing all the values the query will return from the build input table that correspond to the bucket. Applying the hash function to the join column of each build input record determines the correspondence. OrderID and OrderDate are the only columns you need from the build input (the Orders table), so these values are the only ones that stay in each record of the hash table.
SQL
Server performs a table scan on the smaller table, Orders, which is
the build input. Then it builds a hash table with the values that
result from applying the hash function to each scanned row. SQL
Server also performs a table scan on the probe—the OrderDetails
table—and looks for the matching values in the hash table by
applying to each scanned row the same formula that it used to build
the hash table.
In
some cases, your server might not have enough memory to hold the
entire build input. In that case, SQL Server divides the hash table
into multiple partitions and performs the join in separate steps,
loading the relevant partition as needed. The hash join algorithm
is very fast and efficient when proper indexes don't exist. This
efficiency is an advantage when you want to run ad hoc queries
without incurring the overhead of creating and dropping new indexes
for only these queries.
Outlines:
- The main reason why the SQL Server optimizer chooses the hash join algorithm is a lack of adequate indexes on the join columns.
- The execution plan involves building a hash table, which is like building a special index on the fly.
- hash join strategy has two phases: build and probe.
- During the build phase, the query processor builds a hash table by scanning each value in the build input (usually the smaller table) and applying the hashing algorithm to the key.
- During the probe phase, the query processor scans each row from the probe input (the second table) and computes the same hash value on the hash key to find any matches in the corresponding hash bucket.
Join Hints:
The
query optimizer is very sophisticated, and in most cases, it chooses
the best join strategy. However, in some situations, perhaps because
of a lack of statistics or uneven data distributions, it might not
make the right choice. When tuning query performance, experiment with
different strategies on the same query to try to find a better
strategy than the optimizer's choice. If you think that a better
strategy is available, you can use join hints. Keep in mind, though,
that after you add a join hint to the query, the query becomes
static. You lose SQL Server's ability to process queries dynamically
as a result of adding a new index, dropping an existing index,
updating statistics, or modifying data in the tables.
The
syntax for supplying join hints is
SELECT
<select_list>
FROM
<left_table_name><join_type> <join_hint>
JOIN <right_table_name>
The
<join_type> can be LOOP, MERGE, or HASH. (A fourth hint,
REMOTE, doesn't really control the join strategy as we discuss in
this article, so we won't go into detail about it.) Note that you
must use the ANSI join syntax to use a join hint. For example, to
force a certain join strategy on a query that selects all orders and
their details, use the process in Listing 3, page 40. To test the
efficiency of a certain query, you can use SET STATISTICS TIME ON to
show execution times, SET STATISTICS IO ON to show I/O costs, and SET
SHOWPLAN_TEXT ON to show the execution plan.
Let's
look at an example in which forcing a strategy that is different from
the one the query processor chose might yield better performance.
Suppose the Orders table has 10,000 orders and the OrderDetails table
has five entries for each of those orders. Run the following query,
which selects all orders, their products, and quantities:
SET
SHOWPLAN_TEXT ON
GOSELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
The
execution plan
—Merge
Join(Inner Join,
MERGE:(\[O\].\[OrderID\])=(\[OD\].\[OrderID\]),
RESIDUAL:(\[OD\].\[OrderID\]=\[O\].\[OrderID\]))
—Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[Orders\].
\[PK_Orders\] AS \[O\]), ORDERED)
—Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[OrderDetails\].
\[PK_OrderDetails\] AS \[OD\]), ORDERED)
shows
that the query processor chose a merge join. Forcing different join
strategies with join hints shows that merge join is the best choice
for this case.
If
you run the following query
SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900
the
query processor chooses merge join again. But if you try a
nested-loop join, you'll see that it yields considerably better
performance:
SELECT
O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM
Orders O INNER LOOP JOIN OrderDetails ODON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900
You
might use another hint to force the query processor to join tables in
the order in which they're listed in the FROM clause. With no hint,
the query optimizer decides the order of the joins on a cost-based
estimation, regardless of their order in the SQL statement. To force
the join order for a certain query, use the FORCE ORDER query hint.
FORCE ORDER isn't a JOIN hint because it doesn't appear in the FROM
clause. Alternatively, you can force the query processor to use your
specified join order for all queries in the session level by using
SET FORCEPLAN ON.
As
a rule, avoid interfering with the query optimizer's decisions. The
optimizer uses a variety of algorithms and in most cases, it comes up
with the best plan. The new join techniques we discussed in this
article are some of the improvements in SQL Server 7.0 that let you
keep a normalized data model, accommodate VLDBs, and get your desired
results faster.
Outlines:
- When tuning query performance, experiment with different strategies on the same query to try to find a better strategy than the optimizer's choice. If you think that a better strategy is available, you can use join hints.
- Keep in mind, though, that after you add a join hint to the query, the query becomes static.
No comments:
Post a Comment