Sampling Data with SQL Server


Larson et al. from Microsoft Research have this paper in ACM SIGMOD 2007. They introduce an interesting sampling mechanism over SQL Server using materialized views. Materialized views are the same as indexed views. If you create an indexed view over a table or a bunch of tables joined together, using their technique, you have a tidy sample for your data which is kept up-to-date automatically.

What caught my eyes was that their base sampling approach is interestingly simple:


CREATE VIEW Sample_TheTableName WITH SCHEMA_BINDING AS
  SELECT ...,
        __RAND = (CAST( ROW_NUMBER()  OVER (ORDER BY RAND()) AS tinyint)
  FROM ...

The additional __RAND column gives a random distribution on data which can give us a statistically valid sample.

Advertisement

One Response to “Sampling Data with SQL Server”

  1. How to: Random row from Linq to Sql | SevenNet Says:

    […] the purpose of getting random rows is sampling, I have talked very briefly here about a nice approach from Larson et al., Microsoft Research team where they have developed a […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: