Larson et al. from Microsoft Research have this paper in ACM SIGMOD 2007. They introduce an interesting sampling mechanism over SQL Server using materialized views. Materialized views are the same as indexed views. If you create an indexed view over a table or a bunch of tables joined together, using their technique, you have a tidy sample for your data which is kept up-to-date automatically.
What caught my eyes was that their base sampling approach is interestingly simple:
CREATE VIEW Sample_TheTableName WITH SCHEMA_BINDING AS SELECT ..., __RAND = (CAST( ROW_NUMBER() OVER (ORDER BY RAND()) AS tinyint) FROM ...
The additional __RAND column gives a random distribution on data which can give us a statistically valid sample.
December 26, 2014 at 1:46 am
[…] the purpose of getting random rows is sampling, I have talked very briefly here about a nice approach from Larson et al., Microsoft Research team where they have developed a […]