Sampling data quality in distributed SQL Servers (Part 1)

Before going forward with this post, I feel I should say why sampling is so important and why sample data quality? First of all, sampling is heavily used in various query optimization techniques. The very key thing a query optimizer needs to know is selectivity of the query. Selectivity means what are the number of results after running the query against a table. This is extremely important to decide the join direction and wrong estimation on selectivity can change the actual runtime of the query massively. For example when joining three  huge tables AxBxC together, which B is the many to many relation table, the query can be planned as (AxB)xC or Ax(BxC). You may say what is the difference? but imagine the query returns only one row from table C and 1 million rows from A. Which direction do you use? Of course Ax(BxC) ensures a million less lookups.

Sampling is an expensive way to estimate selectivity, because the query should run anyway (but over a much smaller set) but could be the simplest or sometimes the only option. Other ways include complex statistical modeling and collecting forms of statistics about data like histograms.

Although in the context of a single database histograms seem more appealing, in distributed databases, they are not god for several reasons that is out of the scope of this post.

I am personally more interested in distributed databases, not as talked in literature (or federated databases) but in a more practical collaborative enterprise system.

Advertisements