Knowledge is no fun, unless you share it!!! :)

Wednesday, 11 September 2013

SQL-Server Join Advance Mechanism

SQL-SERVER Support Five type of joins:
  • Inner Join.
  • Left Outer Join.
  • Right Outer Join.
  • Full Outer Join.
  • Cross Join.

By choosing effective indexes and join strategies, the query optimizer builds an efficient plan to retrieve data from SQL queries.

Three join strategies are available to the query processor
  • Nested Loops Joins.
  • Merge Joins.
  • Hash Joins.

Nested-Loop Joins:
 
If the optimizer determines that a nested-loop join is the optimal join, it selects one table as the outer (first) table and the other as the inner table. When executing a nested-loop join, SQL Server scans the outer table row by row. For each row, it also scans the inner table, looking for matching rows. This join type is very efficient if one of the tables is small (or has been filtered so that it has only a few qualifying rows) and if the other table has an index on the column that joins the tables. The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.

For the nested-loop strategy to be effective, the inner table needs an index on the join column. This index means SQL Server doesn't need to scan the entire inner table; after accessing a row in the outer table, SQL Server can use the index on the join column to quickly find all matching rows in the inner table.

If the inner table doesn't have a useful index on the join column, the optimizer usually chooses a hash join strategy. However, if the join between the tables isn't based on equality, the optimizer chooses a nested-loop join, regardless of the tables' sizes and indexing schema.

Let's look at the following query on the tables:
SELECT C.CustomerName, O.OrderDate
FROM Customers C INNER JOIN
Orders O
ON C.CustomerID = O.CustomerID

The outer loop iterates through the rows in the Orders table. Note that because a clustered index contains the data, when the execution plan calls for a clustered index scan, the query processor scans the entire table. In the absence of a clustered index, this scan is called a table scan. For each row in the Orders table, the query processor uses clustered index seek, which scans a particular range of rows from a clustered index to seek a matching row in the Customers table because the join column CustomerID has a clustered index.

The query optimizer distinguishes among three variants of nested-loop joins.

A search that scans an entire table or index is a naïve nested-loop join.

A search that performs lookups in an index to fetch rows is an index nested-loop join.

When SQL Server builds an index as part of the query plan and destroys it when the query completes, you have a temporary index nested-loop join. If you notice the Query Analyzer using a temporary index nested-loop join for a common join, you might want to consider running the Index Tuning Wizard to determine whether you could benefit from adding an index to the base table.

Outlines:

  • It is very efficient if one of the tables is small (or has been filtered so that it has only a few qualifying rows) and if the other table has an index on the column that joins the tables. The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.
  •  The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.

Merge Joins:

If the optimizer decides that a merge join is optimal, the query execution scans two sorted inputs and merges them. For a merge join, the query processor sorts both inputs on the join column; if the two inputs to the join are already sorted on the join column (for example, if each has a clustered index on the join column), merge join is a logical choice. In other cases, if adequate memory is available, the optimizer might determine that sorting one input before joining inputs results in the most efficient join strategy.

If a one-to-many relationship exists between the inputs, SQL Server follows a series of steps for an inner join executed with a merge join strategy. First, it gets a row from each input, then compares the join columns of the rows. If the join columns match, SQL Server returns the requested columns from each row. If the columns don't match, SQL Server discards the row with the lower value and gets the next row from that input. SQL Server repeats these steps until it has processed all rows in both inputs, scanning both inputs only once.

To demonstrate a situation in which the optimizer will choose merge join, we inserted 10,000 rows into the Orders table. For each Order, we inserted five products into the OrderDetails table, resulting in 50,000 rows. Both Orders and OrderDetails have clustered indexes on the OrderID column. Then we executed the following query against the tables:

SELECT O.OrderID, O.OrderDate,OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID

SQL Server scans both the Orders table and the OrderDetails table. For each row in Orders, the matching rows in OrderDetails are merged.

Another case of merge join might occur if a many-to-many relationship exists between the inputs. If duplicate values exist in both inputs, the second input must be positioned to the point in the database where the first duplicate was found before the query processor can process the next duplicate from the first input. A temporary table stores the rows instead of discarding them. We can best explain this approach with a simple example:

SELECT Table1.Col1, Table1.Col2, Table2.Col3
FROM Table1 INNER MERGE JOIN Table2
ON Table1.Col1 = Table2.Col1

Note that in this example, we used a join hint (which we'll explain later) to force a merge join. With these small tables, the optimizer probably won't choose merge join.

Outlines:

  • If the two inputs to the join are already sorted on the join column (for example, if each has a clustered index on the join column), merge join is a logical choice.
  • Query processor sorts both inputs on the join column.
  • Another case of merge join might occur if a many-to-many relationship exists between the inputs

Hash Joins:

The third type of join is a hash join. The main reason why the SQL Server optimizer chooses the hash join algorithm is a lack of adequate indexes on the join columns. If the optimizer chooses a hash join, the execution plan involves building a hash table, which is like building a special index on the fly. The hash join strategy has two phases: build and probe. During the build phase, the query processor builds a hash table by scanning each value in the build input (usually the smaller table) and applying the hashing algorithm to the key. The hash table consists of linked lists called hash buckets. The hash function applied to each hash key in the build input determines the relevant hash bucket. During the probe phase, the query processor scans each row from the probe input (the second table) and computes the same hash value on the hash key to find any matches in the corresponding hash bucket.

Hash joins perform most efficiently when the tables differ significantly in size. While processing the query, SQL Server might determine that it chose the wrong table for the build input. In that case, a role reversal occurs, and the build input becomes the probe input and vice versa.

An example of a hash function is adding the ASCII values of the characters in the key field and extracting only the last few digits of the result. Another example is computing the modulo (percentage) of the key and a prime number. An important aspect of the hash algorithm is that, unlike regular index keys, the hash function can produce significantly shortened versions of the original keys. The hash function chosen generates an even distribution of the input rows in the hash buckets. These special hash indexes are efficient for finding exact matches (because the two join keys generate the same hash function result), but are useless for joining tables based on inequalities.

To demonstrate a situation in which the optimizer usually chooses a hash join, drop the constraints and both clustered indexes from the Orders and OrderDetails tables. Now execute the same query we ran in the merge join example:

SELECT O.OrderID, O.OrderDate,OD.ProductID, OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID

The Orders table is the best candidate for the build input because it's considerably smaller. The linked lists start with the function's possible results; in this case, there are only five. Linked to each bucket are records containing all the values the query will return from the build input table that correspond to the bucket. Applying the hash function to the join column of each build input record determines the correspondence. OrderID and OrderDate are the only columns you need from the build input (the Orders table), so these values are the only ones that stay in each record of the hash table.

SQL Server performs a table scan on the smaller table, Orders, which is the build input. Then it builds a hash table with the values that result from applying the hash function to each scanned row. SQL Server also performs a table scan on the probe—the OrderDetails table—and looks for the matching values in the hash table by applying to each scanned row the same formula that it used to build the hash table.

In some cases, your server might not have enough memory to hold the entire build input. In that case, SQL Server divides the hash table into multiple partitions and performs the join in separate steps, loading the relevant partition as needed. The hash join algorithm is very fast and efficient when proper indexes don't exist. This efficiency is an advantage when you want to run ad hoc queries without incurring the overhead of creating and dropping new indexes for only these queries.

Outlines:

  • The main reason why the SQL Server optimizer chooses the hash join algorithm is a lack of adequate indexes on the join columns.
  • The execution plan involves building a hash table, which is like building a special index on the fly.
  • hash join strategy has two phases: build and probe.
  • During the build phase, the query processor builds a hash table by scanning each value in the build input (usually the smaller table) and applying the hashing algorithm to the key.
  • During the probe phase, the query processor scans each row from the probe input (the second table) and computes the same hash value on the hash key to find any matches in the corresponding hash bucket.

Join Hints:

The query optimizer is very sophisticated, and in most cases, it chooses the best join strategy. However, in some situations, perhaps because of a lack of statistics or uneven data distributions, it might not make the right choice. When tuning query performance, experiment with different strategies on the same query to try to find a better strategy than the optimizer's choice. If you think that a better strategy is available, you can use join hints. Keep in mind, though, that after you add a join hint to the query, the query becomes static. You lose SQL Server's ability to process queries dynamically as a result of adding a new index, dropping an existing index, updating statistics, or modifying data in the tables.

The syntax for supplying join hints is

SELECT <select_list>
FROM <left_table_name>
<join_type> <join_hint>
JOIN <right_table_name>

The <join_type> can be LOOP, MERGE, or HASH. (A fourth hint, REMOTE, doesn't really control the join strategy as we discuss in this article, so we won't go into detail about it.) Note that you must use the ANSI join syntax to use a join hint. For example, to force a certain join strategy on a query that selects all orders and their details, use the process in Listing 3, page 40. To test the efficiency of a certain query, you can use SET STATISTICS TIME ON to show execution times, SET STATISTICS IO ON to show I/O costs, and SET SHOWPLAN_TEXT ON to show the execution plan.

Let's look at an example in which forcing a strategy that is different from the one the query processor chose might yield better performance. Suppose the Orders table has 10,000 orders and the OrderDetails table has five entries for each of those orders. Run the following query, which selects all orders, their products, and quantities:

SET SHOWPLAN_TEXT ON
GO
SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID

The execution plan

Merge Join(Inner Join,
MERGE:(\[O\].\[OrderID\])=(\[OD\].\[OrderID\]),
RESIDUAL:(\[OD\].\[OrderID\]=\[O\].\[OrderID\]))
Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[Orders\].
\[PK_Orders\] AS \[O\]), ORDERED)
Clustered Index Scan
(OBJECT:(\[testdb\].\[dbo\].\[OrderDetails\].
\[PK_OrderDetails\] AS \[OD\]), ORDERED)

shows that the query processor chose a merge join. Forcing different join strategies with join hints shows that merge join is the best choice for this case.

If you run the following query

SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900

the query processor chooses merge join again. But if you try a nested-loop join, you'll see that it yields considerably better performance:

SELECT O.OrderID, O.OrderDate, OD.ProductID,OD.Quantity
FROM Orders O INNER LOOP JOIN OrderDetails OD
ON O.OrderID = OD.OrderID
WHERE O.OrderID > 9900

You might use another hint to force the query processor to join tables in the order in which they're listed in the FROM clause. With no hint, the query optimizer decides the order of the joins on a cost-based estimation, regardless of their order in the SQL statement. To force the join order for a certain query, use the FORCE ORDER query hint. FORCE ORDER isn't a JOIN hint because it doesn't appear in the FROM clause. Alternatively, you can force the query processor to use your specified join order for all queries in the session level by using SET FORCEPLAN ON.

As a rule, avoid interfering with the query optimizer's decisions. The optimizer uses a variety of algorithms and in most cases, it comes up with the best plan. The new join techniques we discussed in this article are some of the improvements in SQL Server 7.0 that let you keep a normalized data model, accommodate VLDBs, and get your desired results faster.

Outlines:
  • When tuning query performance, experiment with different strategies on the same query to try to find a better strategy than the optimizer's choice. If you think that a better strategy is available, you can use join hints.
  • Keep in mind, though, that after you add a join hint to the query, the query becomes static.

No comments:

Post a Comment