Tuesday, August 23, 2005
Sampling Using TABLESAMPLE
With due credits to Steve Kass
SQL Server 2005 Implements the TABLESAMPLE clause, which can be used to retrieve a sample (of rows) from a table.
Here is the syntax of the TABLESAMPLE command.
TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]
eg.
SELECT * FROM Person.Contact
TABLESAMPLE SYSTEM(50 ROWS)
The optional SYSTEM keyword specifies the sampling algorithm to use. In the case of SQL Server 2005 SYSTEM is the only sampling method available and applied by default. Other products like DB2 has implemented sampling methods like BERNOULLI.
You can specify the number of rows or percentage of rows you want to get back. But remeber that the sampling method gives an approximate number of rows back. So when you are trying to apply TABLESAMPLE on a small table chances are likely that, you may get zero rows back.
WHEN used with the REPEATABLE option, with the same seed value, SQL Server will return the same sample as before. When specified with a different repeat_seed value, SQL Server will likely return a different sample of the rows in the table.
TABLESAMPLE does not work with views or inline table-valued functions.
When I tried TABLESAMPLE first, I tried
SELECT * FROM Sales.StoreContact
TABLESAMPLE SYSTEM(10 ROWS)
And to my surprize, most of the time I was getting Zero rows back, and sometimes 37 or 179 rows. I had reported this issue in a beta newsgroup and SQL Server MVP Steve Kass help me understand it better. Here is the excerpts from his reply.
<QUOTE>
As documented in Books Online ("Limiting Results Sets by Using TABLESAMPLE"),
the sampling algorithm can only return full data pages. Each page is
selected or skipped with probability [desired number of rows]/[rows in table]
The StoreContact table fits on 4 data pages. Three of those pages contain
179 rows, and one contains 37 rows. When you sample for 10 rows (1/75
of the table), each of the 4 pages is returned with probability 1/75 and
skipped with probabiliy 74/75. The chance that no rows are returned is
about (74/75)^4, or about 87%. When rows are returned, about 3/4 of the
time you will see 179 rows, and about 1/4 of the time you will see 37
rows. Very rarely, you will see more rows, if two or more pages are
returned, but this is very unlikely.
As BOL suggests, SYSTEM sampling (which is the only choice) is not
recommended for small tables. I would add that if the table fits on
N data pages, you should not try to sample fewer than 1/N-th of the
rows, or that you should never try to sample fewer rows than fit on
at least 2 or 3 data pages.
If you were to sample roughly two data pages worth of rows, say
300 rows, the chance of seeing no rows would be about 13%. The
larger (more data pages) the table, the smaller the chance of
seeing no rows when at least a couple of pages worth are requested.
For example, if you request 300 rows from a 1,000,000 row table that
fits on 10,000 data pages, only in 5% of trials would you see no
rows, even though the request is for far less than 1% of the rows.
By choosing the REPEATABLE option, you will get the same sample
each time. For most seeds, this will be an empty sample in your
case. With other seeds, it will contain 37, 179, 216, 358, 395,
537, 574, or 753 rows, depending on which pages were selected,
with the larger numbers of rows returned for very few choices
of seed.
That said, I agree that the consequences of returning only
full data pages results in very confusing behavior!
</QUOTE>
That makes things clear. For my question about the usability of the feature, Conor Cunningham, Development Lead of SQL Server Query Optimization team, gave the following answer.
<QUOTE>
From a usability standpoint, we are following ANSI - we are allowed to
arbitrarily over- or under-sample using the SYSTEM keyword. For small
tables, we happen to over-sample by sampling all rows.
We believe that this provides an effective sampling algorithm for some
classes of applications.
</QUOTE>
But my concern was mostly about the zero rows case. I gave the following reply to Conor.
<QUOTE>
I agree the point that in the SYSTEM sampling method, the implementor is
allowed to arbitrarly over/under sample any given set. And I am OK with
that.
But my cause of concern is when I try to get a 10% SAMPLE on a table that has
over 700 rows, I get zero rows back.
To me, Any sampling algorithm, that applied on a set that has one or more
rows MUST return one or more rows. I dont think zero rows can be a proper
sample.
</QUOTE>
I was asked to open a bug, which I did here
Later the issue was Resolved as By Design by Microsoft.
Meanwhile Steve Kass gave me the following T-SQL code to implement Bernoulli Sampling.
select * from Sales.StoreContact
where rand(checksum(newid())%1000000000+CustomerID)< 0.1
-- 0.1 is the desired probability of choosing a row, change as needed.
Eventhough TABLESAMPLE can be confusing with small tables, you will still find it useful in some cases.
Sunday, August 07, 2005
MAX Sized datatypes
You are no longer limited to that 8000 charcter limit for VARCHAR variables. SQL Server 2005 enhances the VARCHAR, NVARCHAR and VARBINARY datatypes by raising the capacity to Approximately 2GB. You can declare the variables with the MAX specifier. Lets quickly try some VARCHAR(MAX) operations.
DECLARE @vchTest VARCHAR(MAX)
SET @vchTest = REPLICATE('AA', 5000)
SELECT LEN(@vchTest) as Length
Length
--------
8000
Ofcourse, I expected to see 10000 back as result. I refer the REPLICATE command in BOL which clearly says
"If character_expression is not of type varchar(max) or nvarchar(max), REPLICATE truncates the return value at 8,000 bytes. To return values greater than 8,000 bytes, character_expression must be explicitly cast to the appropriate large-value data type. "
So I tried
DECLARE @vchTest VARCHAR(MAX)
SET @vchTest = REPLICATE(CAST('AA' AS VARCHAR(MAX)), 5000)
SELECT LEN(@vchTest) as Length
Length
--------
10000
Then I tried Concatenating many VARCHAR(n) values and assign it to VARCHAR(MAX) .
DECLARE @vchTest VARCHAR(MAX)
SET @vchTest = REPLICATE('AA',4000)+ REPLICATE('AA', 4000)
SELECT LEN(@vchTest) as Length
Length
--------
8000
This time I expected to see 16000 as the result. But I still get 8000 back!
I got help from Christian Wade's blog Which says,
<QUOTE>
The problem is that if you concatenate two varchar(8000) variables, you will be left with a varchar(8000) variable. This is the same issue as populating a float with the division of two integers.
The reason is that varchar(n) variables are physically stored in the same way - irrespective of length. Whereas an varchar(MAX) is effectively a text under the covers and needs to worry about managing pointers to the relevant data page (although this is abstracted away from the developer).
<QUOTE/>
To ensure I just tried
DECLARE @S VARCHAR(9000)
which returned the error.
Msg 131, Level 15, State 3, Line 1
The size (9000) given to the type 'varchar' exceeds the maximum allowed for any data type (8000).
The workaround Christian Wade suggested iscasting the two expressions individually to the large-valued type and then do the assignment.
However, I find it more easy to do an Initialisation and Concantenation with the actual variable, like
DECLARE @vchTest VARCHAR(MAX)
SET @vchTest = 'Start'
SET @vchTest = @vchTest + REPLICATE('AbC2',2000) + 'Test'+ REPLICATE('AbC2',2000) + 'End'
SELECT LEN(@vchTest) as Length
Length
---------
16012
Then I tried all String functions on the variable
SELECT LEN(LEFT(@vchTest,9000)) --OK
SELECT RIGHT(@vchTest,3)
SET @vchTest= REPLACE(@vchTest,'End','XXX')
SELECT RIGHT(@vchTest,3)
SET @vchTest = REVERSE(@vchTest)
SELECT RIGHT(@vchTest,3)
SET @vchTest = REVERSE(@vchTest)
SET @vchTest = SPACE(9000) + @vchTest
SELECT LEN(@vchTest)
SELECT CHARINDEX('Test',@vchTest) --Correct
SET @vchTest = STUFF(@vchTest,24001,4,'SQL')
SELECT CHARINDEX('SQL',@vchTest) --Correct
SELECT SUBSTRING(@vchTest,32000,3)
SET @vchTest = LOWER(UPPER(@vchTest))
SELECT LEN(@vchTest)
SET @vchTest = LTRIM(@vchTest)
SELECT LEN(@vchTest)
All seems to work fine.
The text, ntext and image data types will be deprecated. The varchar(max), nvarchar(max) and varbinary(max) data types are far superior than these old datatypes, because you dont have to use special statements like WRITETEXT and UPDATE text. You can use the standard INSERT and UPDATE statements. . For more information, see the topic Using Large Value Data Types in Books Online.
Saturday, August 06, 2005
New APPLY Operator
The new APPLY Operator lets you reference one or more columns from the LEFT table in the right Derived table or table valued function. There are to variations of the Operator, CROSS APPLY and OUTER APPLY. You can use CROSS APPLY like an INNER JOIN except that you dont have to specify any JOIN condition and you can reference the left table columns in the right hand side of the Operator. OUTER APPLY is like LEFT JOIN, that all rows from the left table is included in the result even when there is no match.
The main advantage of the CROSS APPLY operator is that it lets you pass values from the columns of the joined table to a table valued UDF, which was not possible in previous versions of SQL Server.
Suppose you have the following table structure.
CREATE TABLE #Client(ClientId int, ClientName VARCHAR(35))
CREATE TABLE #Account(AccountId int, ClientId int, AccountValue numeric(18,2))
INSERT INTO #Client VALUES(1, 'Anil')
INSERT INTO #Client VALUES(2, 'Robin')
INSERT INTO #Account VALUES(1,1,100)
INSERT INTO #Account VALUES(2,1,200)
INSERT INTO #Account VALUES(3,1,300)
INSERT INTO #Account VALUES(4,2,100)
INSERT INTO #Account VALUES(5,2,400)
INSERT INTO #Account VALUES(6,2,500)
And you were asked to get the TOP 2 Accounts based on AccountValue of each client.
Is SQL Server 2000 you can achieve it by writing a query like,
SELECT C.ClientName, A.AccountValue
FROM #Client C
INNER JOIN #Account A
ON C.ClientId=A.ClientId
AND A.AccountID IN(
SELECT TOP 2 AccountId
FROM #Account D
WHERE D.ClientID=A.ClientId
ORDER BY D.AccountValue DESC)
Which is ofcourse costly. See how CROSS APPLY makes life easy for you.
SELECT C.ClientName, A.AccountValue
FROM #Client C
CROSS APPLY
(SELECT TOP 2 AccountId, AccountValue
FROM #Account D
WHERE D.ClientID=C.ClientId
ORDER BY D.AccountValue DESC)A
Here is the result of the above two queries.
To see OUTER APPLY at work, Lets add a client without any accounts and try the above query with OUTER APPLY
INSERT INTO #Client VALUES(3, 'Roji')
SELECT C.ClientName, A.AccountValue
FROM #Client C
OUTER APPLY
(SELECT TOP 2 AccountId, AccountValue
FROM #Account D
WHERE D.ClientID=C.ClientId
ORDER BY D.AccountValue DESC)A
And here is the result.
You feel the real power of OUTER/CROSS APPLY When you use them with the Table Values UDFs. There is a UDF named dbo.ufnGetContactInformation in the new AdventureWorks Sample database, which accepts a ContactID as the input and return all contact details as a table. Lets try using that function to get the contact details of all the contacts in the Sales.ContactCreditCard table.
SELECT B.FirstName, B.LastName, B.JobTitle, B.ContactType
FROM Sales.ContactCreditCard A
CROSS APPLY dbo.ufnGetContactInformation(A.ContactId) B
And here is the (abridged) result.
But thats just an introduction. I had seen a lot of smart SQL using CROSS APPLY in NG posting recently. So stay tuned.
The main advantage of the CROSS APPLY operator is that it lets you pass values from the columns of the joined table to a table valued UDF, which was not possible in previous versions of SQL Server.
Suppose you have the following table structure.
CREATE TABLE #Client(ClientId int, ClientName VARCHAR(35))
CREATE TABLE #Account(AccountId int, ClientId int, AccountValue numeric(18,2))
INSERT INTO #Client VALUES(1, 'Anil')
INSERT INTO #Client VALUES(2, 'Robin')
INSERT INTO #Account VALUES(1,1,100)
INSERT INTO #Account VALUES(2,1,200)
INSERT INTO #Account VALUES(3,1,300)
INSERT INTO #Account VALUES(4,2,100)
INSERT INTO #Account VALUES(5,2,400)
INSERT INTO #Account VALUES(6,2,500)
And you were asked to get the TOP 2 Accounts based on AccountValue of each client.
Is SQL Server 2000 you can achieve it by writing a query like,
SELECT C.ClientName, A.AccountValue
FROM #Client C
INNER JOIN #Account A
ON C.ClientId=A.ClientId
AND A.AccountID IN(
SELECT TOP 2 AccountId
FROM #Account D
WHERE D.ClientID=A.ClientId
ORDER BY D.AccountValue DESC)
Which is ofcourse costly. See how CROSS APPLY makes life easy for you.
SELECT C.ClientName, A.AccountValue
FROM #Client C
CROSS APPLY
(SELECT TOP 2 AccountId, AccountValue
FROM #Account D
WHERE D.ClientID=C.ClientId
ORDER BY D.AccountValue DESC)A
Here is the result of the above two queries.
ClientName | AccountValue |
Anil | 300.00 |
Anil | 200.00 |
Robin | 500.00 |
Robin | 400.00 |
To see OUTER APPLY at work, Lets add a client without any accounts and try the above query with OUTER APPLY
INSERT INTO #Client VALUES(3, 'Roji')
SELECT C.ClientName, A.AccountValue
FROM #Client C
OUTER APPLY
(SELECT TOP 2 AccountId, AccountValue
FROM #Account D
WHERE D.ClientID=C.ClientId
ORDER BY D.AccountValue DESC)A
And here is the result.
ClientName | AccountValue |
Anil | 300.00 |
Anil | 200.00 |
Robin | 500.00 |
Robin | 400.00 |
Roji | NULL |
You feel the real power of OUTER/CROSS APPLY When you use them with the Table Values UDFs. There is a UDF named dbo.ufnGetContactInformation in the new AdventureWorks Sample database, which accepts a ContactID as the input and return all contact details as a table. Lets try using that function to get the contact details of all the contacts in the Sales.ContactCreditCard table.
SELECT B.FirstName, B.LastName, B.JobTitle, B.ContactType
FROM Sales.ContactCreditCard A
CROSS APPLY dbo.ufnGetContactInformation(A.ContactId) B
And here is the (abridged) result.
FirstName | LastName | JobTitle | ContactType |
Catherine | Abel | Owner | Store Contact |
Kim | Abercrombie | Owner | Store Contact |
Humberto | Acevedo | Owner | Store Contact |
Pilar | Ackerman | Owner | Store Contact |
But thats just an introduction. I had seen a lot of smart SQL using CROSS APPLY in NG posting recently. So stay tuned.
Thursday, August 04, 2005
DML With Output
SQL Server 2005 introduces a new feature for retrieving rows affected by an INSERT, UPDATE or Delete statement. It is a common requirement to get the affected rows back for further processing, auditing or simply return them back to the client.
Those who had worked with triggers are familiar with the special inserted and deleted tables. The new OUTPUT clause also work in the same manner. You can use the OUTPUT keyword to return information about the results of a Transact-SQL statement into a table variable, temporary table, permanent table or just return it back to the client.
Using the OUTPUT Clause with INSERT
CREATE Table #OutputTest(someid int IDENTITY(1,1), sometext varchar(100))
--Table variable to collect changes
DECLARE @InsertDetails TABLE(someid int,sometext varchar(100))
INSERT INTO #OutputTest(sometext)
OUTPUT Inserted.SomeId, Inserted.someText INTO @InsertDetails
VALUES('Some Text goes Here')
SELECT * FROM @InsertDetails
Here is the result.
(1 row(s) affected)
someid sometext
----------- ----------------------------------
1 Some Text goes Here
(1 row(s) affected)
Note that we are retriving the just inserted IDENTITY value also.
Now lets try to insert a batch of rows.
--Table variable to collect changes
DECLARE @InsertDetails TABLE(someid int,sometext varchar(100))
--Inserting a batch
INSERT INTO #OutputTest(sometext)
OUTPUT Inserted.SomeId, Inserted.someText INTO @InsertDetails
SELECT 'Test 1' UNION
SELECT 'Test 2' UNION
SELECT 'Test 3'
SELECT * FROM @InsertDetails
And here is the OUTPUT.
(3 row(s) affected)
someid sometext
----------- -----------------------------------------------------------------
1 Test 1
2 Test 2
3 Test 3
(3 row(s) affected)
Using the OUTPUT Clause with UPDATE
DECLARE @UpdateDetails TABLE(id int, prevtext VARCHAR(100), newText VARCHAR(100))
UPDATE #OutputTest
SET sometext = 'XYZ'
OUTPUT deleted.someid, deleted.sometext,inserted.sometext INTO @UpdateDetails
WHERE SomeID < 3
SELECT * FROM @UpdateDetails
Here is the output.
(2 row(s) affected)
id prevtext newText
----------- ---------- ----------
1 Test 1 XYZ
2 Test 2 XYZ
(2 row(s) affected)
Note that here we are capturing both the current and previous values of the affected column.
Using the OUTPUT Clause with DELETE
DECLARE @DeleteDetails TABLE(id int, DeletedBy sysname)
DELETE FROM #OutputTest
OUTPUT deleted.someid, SUSER_NAME() INTO @DeleteDetails
WHERE SomeID=3
SELECT * FROM @DeleteDetails
Here is the result
(1 row(s) affected)
id DeletedBy
----------- --------------------------------------------------------------------------------------------------------------------------------
3 DOMAIN\rojipt
(1 row(s) affected)
I believe the DML OUTPUTwill be one of the most favourite feature of T-SQL developers.
Those who had worked with triggers are familiar with the special inserted and deleted tables. The new OUTPUT clause also work in the same manner. You can use the OUTPUT keyword to return information about the results of a Transact-SQL statement into a table variable, temporary table, permanent table or just return it back to the client.
Using the OUTPUT Clause with INSERT
CREATE Table #OutputTest(someid int IDENTITY(1,1), sometext varchar(100))
--Table variable to collect changes
DECLARE @InsertDetails TABLE(someid int,sometext varchar(100))
INSERT INTO #OutputTest(sometext)
OUTPUT Inserted.SomeId, Inserted.someText INTO @InsertDetails
VALUES('Some Text goes Here')
SELECT * FROM @InsertDetails
Here is the result.
(1 row(s) affected)
someid sometext
----------- ----------------------------------
1 Some Text goes Here
(1 row(s) affected)
Note that we are retriving the just inserted IDENTITY value also.
Now lets try to insert a batch of rows.
--Table variable to collect changes
DECLARE @InsertDetails TABLE(someid int,sometext varchar(100))
--Inserting a batch
INSERT INTO #OutputTest(sometext)
OUTPUT Inserted.SomeId, Inserted.someText INTO @InsertDetails
SELECT 'Test 1' UNION
SELECT 'Test 2' UNION
SELECT 'Test 3'
SELECT * FROM @InsertDetails
And here is the OUTPUT.
(3 row(s) affected)
someid sometext
----------- -----------------------------------------------------------------
1 Test 1
2 Test 2
3 Test 3
(3 row(s) affected)
Using the OUTPUT Clause with UPDATE
DECLARE @UpdateDetails TABLE(id int, prevtext VARCHAR(100), newText VARCHAR(100))
UPDATE #OutputTest
SET sometext = 'XYZ'
OUTPUT deleted.someid, deleted.sometext,inserted.sometext INTO @UpdateDetails
WHERE SomeID < 3
SELECT * FROM @UpdateDetails
Here is the output.
(2 row(s) affected)
id prevtext newText
----------- ---------- ----------
1 Test 1 XYZ
2 Test 2 XYZ
(2 row(s) affected)
Note that here we are capturing both the current and previous values of the affected column.
Using the OUTPUT Clause with DELETE
DECLARE @DeleteDetails TABLE(id int, DeletedBy sysname)
DELETE FROM #OutputTest
OUTPUT deleted.someid, SUSER_NAME() INTO @DeleteDetails
WHERE SomeID=3
SELECT * FROM @DeleteDetails
Here is the result
(1 row(s) affected)
id DeletedBy
----------- --------------------------------------------------------------------------------------------------------------------------------
3 DOMAIN\rojipt
(1 row(s) affected)
I believe the DML OUTPUTwill be one of the most favourite feature of T-SQL developers.
Wednesday, August 03, 2005
Common (TOP 10?) T-SQL Programming mistakes
1. TOP Without ORDER BY
SQL Server doesn’t guarantee the order of records returned, if an explicit ORDER BY clause is not specified. Records are not returned in the order they were created. Most of the times records are returned in the order of the clustered index, but not necessarily always.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Add a clustered index on ClientName
--CREATE CLUSTERED INDEX idx1 ON #Client(ClientName)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
SELECT TOP 1 * FROM #Client
--ORDER BY ClientID
DROP TABLE #Client
2. TOP Without TIES
If you are expecting TIES in the result and you want all tied records, (like in the case of finding rank) you should use WITH TIES clause with the SELECT statement.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 30000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
SELECT TOP 1 * FROM #Client
ORDER BY AnnualIncome DESC
DROP TABLE #Client
3. Not Using an Alias
Consider the following query.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(3, 2000)
SELECT * FROM #Account
WHERE AccountId IN(SELECT AccountId FROM #Client)
DROP TABLE #Client
DROP TABLE #Account
Obviously there is no column called AccountId in the #Client table. Probably you are expecting an error, but the query returns all the records from the #Account table. What happened behind the scenes is that, SQL Server looks for the accountId column in the #Client table and when it fails to find the column there, it looks for the column in the outer table. This behaviour is referred as fancy scoping.
4. Aggreagation And JOINS
Be careful when doing aggregation on joined resultset. Chances are there that you may getting the wrong result. Consider the following example.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(3, 2000)
SELECT C.ClientId, C.ClientName,SUM(C.AnnualIncome)
FROM #Client C
INNER JOIN #Account A
ON C.ClientId = A.ClientId
GROUP BY C.ClientId, C.ClientName
ORDER BY C.ClientId
DROP TABLE #Client
DROP TABLE #Account
5. NOT IN AND NULL
If you are using the NOT IN Opereator with a subquery and the subquery contains any NULL values, the subquery will return NULL!. This can be dangerous, and this is not the case if you use IN.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(NULL, 2000)
SELECT * FROM #Client
WHERE ClientID NOT IN(SELECT ClientId FROM #Account)
DROP TABLE #Client
DROP TABLE #Account
6. Aggregation And NULL
If any of the columns you are apllying aggregate functions contains NULL , SQL Server will return a warning, “Warning: Null value is eliminated by an aggregate or other SET operation.”. If you are accessing the recordset using ADO, you might face problems. Either use ISNULL() function or SET ANSI_WARNINGS OFF. Setting ANSI_WARNINGS inside an SP will cause it to recompile everytime.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 30000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', NULL)
SELECT SUM(AnnualIncome) FROM #Client
DROP TABLE #Client
7. OUTER JOIN AND WHERE Condition
Consider the following query.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
--To find Clients who has an annualIncome > 10000
SELECT C.ClientName
FROM #Client C
LEFT JOIN #Account A
ON C.ClientId = A.ClientId
AND C.AnnualIncome > 10000
DROP TABLE #Client
DROP TABLE #Account
This query returns incorrect result. In case of INNER JOINs putting the filter condition on the WHERE clause or ON Clause has the same effect. But in case of OUTER JOINS, it’s a different story. You specify the JOIN Criteria on the ON clause and filter criteria on the WHERE clause. The condition ON the join criteria doesn’t have an impact on the number of rows returned.
A Select statement works in the following way.
SELECT
JOIN
WHERE
GROUP BY
HAVING
DISTINCT
ORDER BY
TOP
(See this link for a detailed description by Joe CELKO)
8. Concatenation And NULL
If you are concatenating a field which has a NULL value, the result will be NULL. To avoid this either use ISNULL() function or SET CONCAT_NULL_YIELDS_NULL OFF
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
-- Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', NULL)
SELECT * FROM #Client
SELECT ClientName + ' - '+ CONVERT(VARCHAR,AnnualIncome) AS [ClientInfo]
FROM #Client
DROP TABLE #Client
9. NULL valued Parameters
If the parameter you are passing has a NULL value and even if there is matching records for NULL, the query wont return any rows back.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(NULL, 2000)
DECLARE @intClientId int
SET @intClientId = NULL
SELECT * FROM #Account
WHERE ClientId = @intClientId
DROP TABLE #Client
DROP TABLE #Account
10. Error Handling
consider the following example.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
-- Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 3000)
--SELECT * FROM #Client
DECLARE @intVar int
SET @intVar = 0
SELECT AnnualIncome/@intVar AS Something
FROM #Client
IF @@ERROR <> 0
SELECT @@ERROR
ELSE
SELECT @@ROWCOUNT
DROP TABLE #Client
In both the cases, the Error handling section will not work as expected, because both @@ERROR and @@ROWCOUNT return the status of the last statement executed.
SQL Server doesn’t guarantee the order of records returned, if an explicit ORDER BY clause is not specified. Records are not returned in the order they were created. Most of the times records are returned in the order of the clustered index, but not necessarily always.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Add a clustered index on ClientName
--CREATE CLUSTERED INDEX idx1 ON #Client(ClientName)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
SELECT TOP 1 * FROM #Client
--ORDER BY ClientID
DROP TABLE #Client
2. TOP Without TIES
If you are expecting TIES in the result and you want all tied records, (like in the case of finding rank) you should use WITH TIES clause with the SELECT statement.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 30000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
SELECT TOP 1 * FROM #Client
ORDER BY AnnualIncome DESC
DROP TABLE #Client
3. Not Using an Alias
Consider the following query.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(3, 2000)
SELECT * FROM #Account
WHERE AccountId IN(SELECT AccountId FROM #Client)
DROP TABLE #Client
DROP TABLE #Account
Obviously there is no column called AccountId in the #Client table. Probably you are expecting an error, but the query returns all the records from the #Account table. What happened behind the scenes is that, SQL Server looks for the accountId column in the #Client table and when it fails to find the column there, it looks for the column in the outer table. This behaviour is referred as fancy scoping.
4. Aggreagation And JOINS
Be careful when doing aggregation on joined resultset. Chances are there that you may getting the wrong result. Consider the following example.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(3, 2000)
SELECT C.ClientId, C.ClientName,SUM(C.AnnualIncome)
FROM #Client C
INNER JOIN #Account A
ON C.ClientId = A.ClientId
GROUP BY C.ClientId, C.ClientName
ORDER BY C.ClientId
DROP TABLE #Client
DROP TABLE #Account
5. NOT IN AND NULL
If you are using the NOT IN Opereator with a subquery and the subquery contains any NULL values, the subquery will return NULL!. This can be dangerous, and this is not the case if you use IN.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(NULL, 2000)
SELECT * FROM #Client
WHERE ClientID NOT IN(SELECT ClientId FROM #Account)
DROP TABLE #Client
DROP TABLE #Account
6. Aggregation And NULL
If any of the columns you are apllying aggregate functions contains NULL , SQL Server will return a warning, “Warning: Null value is eliminated by an aggregate or other SET operation.”. If you are accessing the recordset using ADO, you might face problems. Either use ISNULL() function or SET ANSI_WARNINGS OFF. Setting ANSI_WARNINGS inside an SP will cause it to recompile everytime.
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 30000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', NULL)
SELECT SUM(AnnualIncome) FROM #Client
DROP TABLE #Client
7. OUTER JOIN AND WHERE Condition
Consider the following query.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
--To find Clients who has an annualIncome > 10000
SELECT C.ClientName
FROM #Client C
LEFT JOIN #Account A
ON C.ClientId = A.ClientId
AND C.AnnualIncome > 10000
DROP TABLE #Client
DROP TABLE #Account
This query returns incorrect result. In case of INNER JOINs putting the filter condition on the WHERE clause or ON Clause has the same effect. But in case of OUTER JOINS, it’s a different story. You specify the JOIN Criteria on the ON clause and filter criteria on the WHERE clause. The condition ON the join criteria doesn’t have an impact on the number of rows returned.
A Select statement works in the following way.
SELECT
JOIN
WHERE
GROUP BY
HAVING
DISTINCT
ORDER BY
TOP
(See this link for a detailed description by Joe CELKO)
8. Concatenation And NULL
If you are concatenating a field which has a NULL value, the result will be NULL. To avoid this either use ISNULL() function or SET CONCAT_NULL_YIELDS_NULL OFF
Eg.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
-- Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', NULL)
SELECT * FROM #Client
SELECT ClientName + ' - '+ CONVERT(VARCHAR,AnnualIncome) AS [ClientInfo]
FROM #Client
DROP TABLE #Client
9. NULL valued Parameters
If the parameter you are passing has a NULL value and even if there is matching records for NULL, the query wont return any rows back.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
CREATE TABLE #Account(AccountId int IDENTITY(1,1), ClientId int, AccountValue int)
--Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 30000)
--Populate the account table
INSERT INTO #Account(ClientId, AccountValue)
VALUES(1, 500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(2, 1500)
INSERT INTO #Account(ClientId, AccountValue)
VALUES(NULL, 2000)
DECLARE @intClientId int
SET @intClientId = NULL
SELECT * FROM #Account
WHERE ClientId = @intClientId
DROP TABLE #Client
DROP TABLE #Account
10. Error Handling
consider the following example.
CREATE TABLE #Client(ClientID int IDENTITY(1,1), ClientName varchar(30), AnnualIncome int)
-- Populate the client table.
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Prasanth', 10000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Binu', 20000)
INSERT INTO #Client(ClientName, AnnualIncome)
VALUES('Shino', 3000)
--SELECT * FROM #Client
DECLARE @intVar int
SET @intVar = 0
SELECT AnnualIncome/@intVar AS Something
FROM #Client
IF @@ERROR <> 0
SELECT @@ERROR
ELSE
SELECT @@ROWCOUNT
DROP TABLE #Client
In both the cases, the Error handling section will not work as expected, because both @@ERROR and @@ROWCOUNT return the status of the last statement executed.