Sampling data quality in distributed SQL Servers (Part 1)

Before going forward with this post, I feel I should say why sampling is so important and why sample data quality? First of all, sampling is heavily used in various query optimization techniques. The very key thing a query optimizer needs to know is selectivity of the query. Selectivity means what are the number of results after running the query against a table. This is extremely important to decide the join direction and wrong estimation on selectivity can change the actual runtime of the query massively. For example when joining three  huge tables AxBxC together, which B is the many to many relation table, the query can be planned as (AxB)xC or Ax(BxC). You may say what is the difference? but imagine the query returns only one row from table C and 1 million rows from A. Which direction do you use? Of course Ax(BxC) ensures a million less lookups.

Sampling is an expensive way to estimate selectivity, because the query should run anyway (but over a much smaller set) but could be the simplest or sometimes the only option. Other ways include complex statistical modeling and collecting forms of statistics about data like histograms.

Although in the context of a single database histograms seem more appealing, in distributed databases, they are not god for several reasons that is out of the scope of this post.

I am personally more interested in distributed databases, not as talked in literature (or federated databases) but in a more practical collaborative enterprise system.

Advertisement

Resharper: Access to the modified closure, simplified.

The following code makes Resharper to nag with “Access to the modified closure”!


var list = new int[]{1,2,3,4,5};
var listdoubled = new List<int>();
foreach(var item in list)
   {
      listdoubled.Add( item * 2 );
   }
listdoubled.ForEach( n => Console.WriteLn(n) );

The above coed looks absolutely perfect, but why reshaper is so concerned about it? Below is what can go wrong when you access the modified closure.


var list = new int[]{1,2,3,4,5};
var listdoubled = new List<delegate>();
foreach(var item in list)
{
listdoubled.Add( () => {Console.WriteLn( item * 2 ) ;} );
}
listdoubled.ForEach( n => n() );

The first piece of code will print this:

2
4
6
8
10

But the second piece of code will print this:

10
10
10
10
10

Why? Because the variable in the foreach loop is defined once and modified each time. If you change the code like this, you get the correct behavior.


var list = new int[]{1,2,3,4,5};
var listdoubled = new List<delegate>();
foreach(var item in list)
{
var itemWithNewScope = item;
listdoubled.Add( () => {Console.WriteLn( itemWithNewScope * 2 ) ;} );
}
listdoubled.ForEach( n => n() );

A bit of C#

Can you say what does this code print out?


var list = new int[]{1,2,3,4,5};
var delegates = list.Select(
   i => {
   var c = i * 10;
   return () => { Console.WriteLn(c); return c++; }
   }).ToList();

delegates.foreach( d => d() );
delegates[0]();
delegates[0]();
delegates[1]();

The answer is:
10
20
30
40
50
11
12
21

C# is sexy isn’t it?

Covariance and contravariance

I am writing this post only in order to experiment spelling of covariance and contravariance. What if Shakespeare was a computer geek and wanted to develop a new language? “To sleep, perchance to dream” this looks more like PerlScript to me than English!

These terms lead to other terms like variant and invariant. However, thou shall not be freaked out. They are simpler than they appear:

Covariance is this (a type will be automatically converted to a more general type):

object str = "string";

Contravariance is the opposite. In C#, method group to delegate conversions are covariant in return types, and contravariant in argument types.

Generic delegate types are always invariant in C# 3.0 which means they can not be automatically converted to more or less generic types, but C# 4.0 allows declaration of covariance and contravariance on parameterized interface and delegate types.

This post describes this new feature of C# 4.0 in more detail.

In previous versions of C#, generic types are invariant. That is, for any two types GenericType<T>and GenericType<K>, in which T is a subclass of K or K is a subclass of TGenericType<T> andGenericType<K> have no inheritance relationship whatsoever to each other. On the other hand, if T is a subclass of K and if C# supports covariance, GenericType<T> is also a subclass ofGenericType<K> or if C# supports contravariance, GenericType<T> is a superclass ofGenericType<K>.

It continues:

Fortunately, C# 4.0 provides us an option: if a generics interface or generics delegate which has a reference typeT as its type parameter and does not have any method or member that take in a parameter of type T, we can declare it to be covariant on T. On the other hand, if that interface or delegate does not have any method or member that returning T, we can declare it be contravariant on T. (Notice the emphases, only interfaces and delegates have covariance and contravariance support and the type parameter must be a reference type. On the other hand, C# arrays have been supporting covariance from the very beginning.)

In simple words, now you can do this:


interface ISampleGenerator<T>
{
public IEnumerable<T> MakeSamples(int count);
}

//Assuming some implementation of interface
var objects = (ISampleGenerator<object>)new StringSampleGenerator<String>();
foreach(var a in objects){ //a is object here
}

Note that this covariance is possible only if the interface has absolute no input dependency on the type T.

C# in depth.

I was reading this book last night. It’s a nice one. I suggest you purchase a copy and read it. Following images are from this book which reminds me of how nice and easy it has become to program with C# through the last decade. Click on the images to see bigger images.

How to simulate bag access in Windows azure table storage? (Part 1)

It is nice to hear that Microsoft is providing table storage. Hopefully we can get it for non-azure platforms as well. The idea is fast and scalable access to persisted objects without limitations of tabular world. No doubt that relational databases are amazing and let for super complex queries and transactions to happen. Downside is their complexity of design and usage. It tends to be extremely hard to provide real scalable relational data yet satisfying service level agreements on response time, availability etc.

Efforts on developing non-relational non-schema bound data sets are as old as databases, and in the cloud era, they make so much sense. For example Mnesia is a lovely database designed to work with Erlang with a LINQ-like query language. Enough to say it is developed in 80’s and is easy to scale, and provides 100% uptime (you get a mechanism to do hot patching). I also read about this database (RavenDB) a few days ago which is based on a similar motive.

One important thing to remember when working with non-relational databases, is that they are not relational. Thus, you don’t run SQL scripts against them and there is no join, no views, no foreign keys and primary keys. These terms make sense for tabular data. Databases like table storage are semi-structured data storage. Structured is tabular and relational data storage store them. Semi structure is XML, JSON, or any other form of persisted object. Unstructured is web and free-form text, etc.

Mnesia (as a pioneer of table-storage like databases) stores data in set’s and bags. A set is a table, which each record has a unique key. Fair enough, we are used to work with table with primary key which is the same. But a bag, is a table in which many records can share a key, hence there might be no way to access a single row of a table because it does not have a unique key (You may say now, WTF? what happens to my candidate keys and primary keys – and my answer is wait a minute. We are not in relational world, so non of these terms exist here).

So what is the value of having a row in a table which we can not access it directly? It of course has some value. Bearing in mind again that table storage is not relational, a good design paradigm is to NEVER query anything except the key (and of course partition key for table storage). Any other query (which is not bounded to partition key for table storage) is similar to a full table scan in you SQL Server database and full table (or index) scan is is THE killer. You can never become scalable if you have a single operation with full table scan over your growing data.

to be continued…

A framework for in memory LINQ

I had a blog post about comparing the performance of cache versus database. Unfortunately Linq to object takes a naive staraight forward approach to any query which is the brute-force look everything. This should change sometime and it annoys me alot. Be a man and start fixing this. I will cach up on it someday after I mowed my yard. I am too busy right now and my weeds are close to 2 meters. Here is my whish list: When I write this code


IEnumerable indexedCollection = myMemColl.AddIndex( i => new {i.FName, i.LName}, IndexOptions.CreateStatistics );

indexedCollection.Where(i => i.FName > "c" && i.FName < "d")
.OrderBy( i => i.FName)
.AsIndexedQueryable().Take(10);

I want the the indexed collection smartly utilize my indexes and take the top 10 items for me without scanning the whole collection. Is it really a big request in 21st century??

To Cache or Not To Cache? This is the Question!

How many times in a developing a business app you have wondered if you should cache the data in your web server to reduce the number of database queries for performance.

Assuming you have unlimited memory, should you cache the whole table into memory (e.g. using ASP.Net method output cache) or you should rely on database and use Linq to database directly?

I guess, the answer to this question is pretty obvious. We know the limitation of both methods, linq to object, till now has no support for indexes and would perform memory scan for any query, while linq to DB has all the overhead of connection and query parsing, etc.

So, lets do a little test to see what is the actual limits of these methods? We will set up a super simple DB with a table that contains two integers. Then hammer DB with queries and cache lookups for different cache size to compare the result.

Run the following query to set up our DB:

create table Smash( Id int primary key, Val int )
go

declare @i int = 0
while @i<1000000
begin
	insert into Smash (Id, Val)
	values (@i, @i)
	set @i = @i + 1
end
go

create nonclustered index [IX_val] on [dbo].[Smash]
(
	[Val] asc,
	[Id] asc
)
GO

Now that we have our DB ready, lets write query for the result.


        const int Cnt = 100;
        private void button1_Click(object sender, EventArgs e)
        {
            var cacheSize = 1000000;
            for (cacheSize = 100; cacheSize < 1000000; cacheSize *= 2)
            {
                var cache = GetCecheAsBigAs(cacheSize);
                //Steps
                var i = cacheSize / 2;
                var time1 = QueryDBMultipleTimes(Cnt, i);
                WriteResult(time1, "DB", cacheSize);
                var time2 = QueryMemoryMultipleTimes(Cnt, i, cache);
                WriteResult(time2, "MEM", cacheSize);
            }
        }

And the query methods:


        private List GetCecheAsBigAs(int cacheSize)
        {
            using (var db = new TempDataContext())
            {
                return db.Smashes.Take(cacheSize).ToList();
            }
        }

        private TimeSpan QueryMemoryMultipleTimes(int Cnt, int lookup, IEnumerable cache)
        {
            var t = DateTime.Now;
            for(var i=0; i< Cnt; i++)
               cache.Where(s => s.Val == lookup).First().Val.Value;
            return DateTime.Now - t;
        }

        private TimeSpan QueryDBMultipleTimes(int Cnt, int lookup)
        {
            using (var db = new TempDataContext())
            {
                var t = DateTime.Now;
                for (var i = 0; i < Cnt; i++)
                     db.Smashes.Where(s => s.Val == lookup).First().Val.Value;
                return DateTime.Now - t;
            }
        }

Now comes the interesting part. I ran the above code and graphed the result:

The blue line represents cache access and the red one represents DB access. It can be clearly seen that cache terribly beats DB until the cache size becomes 20% of the table size. It can also be spotted that DB access has a steady response time. Remember we don’t change the DB size at all during the test. However, this graph has got no new message. It is the same old graph of linear vs tree access to data.

But don’t be fooled with this experiment. There are things to note of. First, our data structure is super simple. On a more complex query when lookups and joins come in, DB would definitely over perform memory drastically due to all the query optimization effort made into DBMS cores in last 30 years. This graph also doesn’t show the overhead of populating and re-populating cache. However, if your data is reasonably small (up-to a few thousand records), and your query is fairly simple, cache it in the web server.

 

www.qualityofdata.com


var blog = new Blog();
var address = (IAddress) new GreatAddress();
blog.address = "www.qualityofdata.com";

your code here