Tuesday, August 23, 2005

 

Sampling Using TABLESAMPLE


With due credits to Steve Kass

SQL Server 2005 Implements the TABLESAMPLE clause, which can be used to retrieve a sample (of rows) from a table.

Here is the syntax of the TABLESAMPLE command.


TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]

eg.


SELECT * FROM Person.Contact
TABLESAMPLE SYSTEM(50 ROWS)


The optional SYSTEM keyword specifies the sampling algorithm to use. In the case of SQL Server 2005 SYSTEM is the only sampling method available and applied by default. Other products like DB2 has implemented sampling methods like BERNOULLI.

You can specify the number of rows or percentage of rows you want to get back. But remeber that the sampling method gives an approximate number of rows back. So when you are trying to apply TABLESAMPLE on a small table chances are likely that, you may get zero rows back.

WHEN used with the REPEATABLE option, with the same seed value, SQL Server will return the same sample as before. When specified with a different repeat_seed value, SQL Server will likely return a different sample of the rows in the table.

TABLESAMPLE does not work with views or inline table-valued functions.

When I tried TABLESAMPLE first, I tried


SELECT * FROM Sales.StoreContact
TABLESAMPLE SYSTEM(10 ROWS)


And to my surprize, most of the time I was getting Zero rows back, and sometimes 37 or 179 rows. I had reported this issue in a beta newsgroup and SQL Server MVP Steve Kass help me understand it better. Here is the excerpts from his reply.


<QUOTE>
As documented in Books Online ("Limiting Results Sets by Using TABLESAMPLE"),
the sampling algorithm can only return full data pages. Each page is
selected or skipped with probability [desired number of rows]/[rows in table]

The StoreContact table fits on 4 data pages. Three of those pages contain
179 rows, and one contains 37 rows. When you sample for 10 rows (1/75
of the table), each of the 4 pages is returned with probability 1/75 and
skipped with probabiliy 74/75. The chance that no rows are returned is
about (74/75)^4, or about 87%. When rows are returned, about 3/4 of the
time you will see 179 rows, and about 1/4 of the time you will see 37
rows. Very rarely, you will see more rows, if two or more pages are
returned, but this is very unlikely.

As BOL suggests, SYSTEM sampling (which is the only choice) is not
recommended for small tables. I would add that if the table fits on
N data pages, you should not try to sample fewer than 1/N-th of the
rows, or that you should never try to sample fewer rows than fit on
at least 2 or 3 data pages.

If you were to sample roughly two data pages worth of rows, say
300 rows, the chance of seeing no rows would be about 13%. The
larger (more data pages) the table, the smaller the chance of
seeing no rows when at least a couple of pages worth are requested.
For example, if you request 300 rows from a 1,000,000 row table that
fits on 10,000 data pages, only in 5% of trials would you see no
rows, even though the request is for far less than 1% of the rows.

By choosing the REPEATABLE option, you will get the same sample
each time. For most seeds, this will be an empty sample in your
case. With other seeds, it will contain 37, 179, 216, 358, 395,
537, 574, or 753 rows, depending on which pages were selected,
with the larger numbers of rows returned for very few choices
of seed.

That said, I agree that the consequences of returning only
full data pages results in very confusing behavior!
</QUOTE>


That makes things clear. For my question about the usability of the feature, Conor Cunningham, Development Lead of SQL Server Query Optimization team, gave the following answer.


<QUOTE>
From a usability standpoint, we are following ANSI - we are allowed to
arbitrarily over- or under-sample using the SYSTEM keyword. For small
tables, we happen to over-sample by sampling all rows.

We believe that this provides an effective sampling algorithm for some
classes of applications.
</QUOTE>


But my concern was mostly about the zero rows case. I gave the following reply to Conor.


<QUOTE>
I agree the point that in the SYSTEM sampling method, the implementor is
allowed to arbitrarly over/under sample any given set. And I am OK with
that.

But my cause of concern is when I try to get a 10% SAMPLE on a table that has
over 700 rows, I get zero rows back.

To me, Any sampling algorithm, that applied on a set that has one or more
rows MUST return one or more rows. I dont think zero rows can be a proper
sample.
</QUOTE>


I was asked to open a bug, which I did here

Later the issue was Resolved as By Design by Microsoft.

Meanwhile Steve Kass gave me the following T-SQL code to implement Bernoulli Sampling.


select * from Sales.StoreContact
where rand(checksum(newid())%1000000000+CustomerID)< 0.1
-- 0.1 is the desired probability of choosing a row, change as needed.


Eventhough TABLESAMPLE can be confusing with small tables, you will still find it useful in some cases.



Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?