Analyze table space usage in SQL Server

It is pretty useful to check space usage of your tables. SQL server gives a nice system stored procedure called sp_spaceused that returns what you want. You can run the following script on your database and export the results to excel.

You can come up with nice graphs showing why your data base is big and what partitioning strategy to choose. For example I ran this query for a project with a 10GB database. I could not understand why only a few million small records  occupy 10GB. The result was expectable. There was an exception log table which occupied half of the space and dropping that table saved so much time for everybody.

DECLARE @spu TABLE(NAME VARCHAR(max), [ROWS] BIGINT, reserved VARCHAR(50), DATA VARCHAR(50), index_size VARCHAR(50), unused VARCHAR(50)


SELECT FROM sys.objects AS o
WHERE o.schema_id = SCHEMA_ID('dbo')
AND o.type = 'U'




	EXEC sys.sp_spaceused @objname = @TName -- nvarchar(776)





Dive into Expression Trees

I am in the process of developing a framework for sampling query results from LINQ queries. It is going to be hard work, you may ask WTF in this world someone need sampling query results from LINQ queries? Well, this can be particularly useful if you are developing a new database that purely works with LINQ or you are having heavy data services that use various data sources then you want to sample the query result.

Unfortunately, I am not currently working on any core database system or heavy duty data services (at work), but I have to do this as part of my study. This project which is going to be an open source project called [I don’t know yet] is supposed to do the job of evaluating my algorithms that I hopefully (fingers crossed) will publish at VLDB 2011.

So if you continue to visit this blog, you will see posts about these topics:

  • LINQ Expression Trees
  • Writing Provider for LINQ
  • Sampling Techniques
  • Data quality metrics
  • and, Rule based profiling


How to shrink the log file in SQL Server 2008

All of us have wanted to get rid of the SQL Server log file at some stage. The natural way of shrinking the log file is to back it up and keep it in a safe place. Log file keeps transaction data and is essential for temporal disaster recovery.

However, it is a common request to get rid of the log file because it takes space and some people don’t need the log file at all. Beware that there is only ONE place and ONE situation that log file is a waste of space and that is Dev Environment.

SQL Server 2008 has removed the possibility of shrinking the log file using DBCC SHRINKFILE or BACKUP SHRINKONLY. But if you want to get rid of the log file in your dev environment, use the following script:

Note: Before running this script, in SQL Server Management studio, select Query -> SQL CMD mode.

:Setvar DbName **Your Database Here**

PRINT 'Droping log file for $(DbName)'
EXEC sp_configure 'show advanced options', 1;
EXEC sp_configure 'xp_cmdshell', 1
USE Master;
DECLARE @rowsPath VARCHAR(400)
SELECT @rowsPath = physical_name FROM  [$(DbName)].sys.database_files AS df where type_desc='ROWS'
SELECT @logPath = physical_name FROM  [$(DbName)].sys.database_files AS df WHERE type_desc = 'LOG'
SELECT @logFile = name FROM [$(DbName)].sys.database_files AS  df WHERE type_desc = 'LOG'
PRINT 'Renaming the log file (' + @logPath + '). You can delete it later'
EXEC master.dbo.sp_detach_db @dbname = '$(DbName)', @skipchecks = 'false'
DECLARE @CMD VARCHAR(400) = 'ren "' + @logPath + '" "' + @logFile +  REPLACE(CONVERT(VARCHAR(25), GETDATE(), 113), ':', '') + '"'
EXEC master..xp_cmdshell @CMD
DECLARE @attachSql NVARCHAR(MAX) = N'CREATE DATABASE [$(DbName)] ON ( FILENAME = ''' + @rowsPath + ''' ) FOR ATTACH '
EXECUTE (@attachSql)

Sampling data quality in distributed SQL Servers (Part 1)

Before going forward with this post, I feel I should say why sampling is so important and why sample data quality? First of all, sampling is heavily used in various query optimization techniques. The very key thing a query optimizer needs to know is selectivity of the query. Selectivity means what are the number of results after running the query against a table. This is extremely important to decide the join direction and wrong estimation on selectivity can change the actual runtime of the query massively. For example when joining three  huge tables AxBxC together, which B is the many to many relation table, the query can be planned as (AxB)xC or Ax(BxC). You may say what is the difference? but imagine the query returns only one row from table C and 1 million rows from A. Which direction do you use? Of course Ax(BxC) ensures a million less lookups.

Sampling is an expensive way to estimate selectivity, because the query should run anyway (but over a much smaller set) but could be the simplest or sometimes the only option. Other ways include complex statistical modeling and collecting forms of statistics about data like histograms.

Although in the context of a single database histograms seem more appealing, in distributed databases, they are not god for several reasons that is out of the scope of this post.

I am personally more interested in distributed databases, not as talked in literature (or federated databases) but in a more practical collaborative enterprise system.

How to simulate bag access in Windows azure table storage? (Part 1)

It is nice to hear that Microsoft is providing table storage. Hopefully we can get it for non-azure platforms as well. The idea is fast and scalable access to persisted objects without limitations of tabular world. No doubt that relational databases are amazing and let for super complex queries and transactions to happen. Downside is their complexity of design and usage. It tends to be extremely hard to provide real scalable relational data yet satisfying service level agreements on response time, availability etc.

Efforts on developing non-relational non-schema bound data sets are as old as databases, and in the cloud era, they make so much sense. For example Mnesia is a lovely database designed to work with Erlang with a LINQ-like query language. Enough to say it is developed in 80’s and is easy to scale, and provides 100% uptime (you get a mechanism to do hot patching). I also read about this database (RavenDB) a few days ago which is based on a similar motive.

One important thing to remember when working with non-relational databases, is that they are not relational. Thus, you don’t run SQL scripts against them and there is no join, no views, no foreign keys and primary keys. These terms make sense for tabular data. Databases like table storage are semi-structured data storage. Structured is tabular and relational data storage store them. Semi structure is XML, JSON, or any other form of persisted object. Unstructured is web and free-form text, etc.

Mnesia (as a pioneer of table-storage like databases) stores data in set’s and bags. A set is a table, which each record has a unique key. Fair enough, we are used to work with table with primary key which is the same. But a bag, is a table in which many records can share a key, hence there might be no way to access a single row of a table because it does not have a unique key (You may say now, WTF? what happens to my candidate keys and primary keys – and my answer is wait a minute. We are not in relational world, so non of these terms exist here).

So what is the value of having a row in a table which we can not access it directly? It of course has some value. Bearing in mind again that table storage is not relational, a good design paradigm is to NEVER query anything except the key (and of course partition key for table storage). Any other query (which is not bounded to partition key for table storage) is similar to a full table scan in you SQL Server database and full table (or index) scan is is THE killer. You can never become scalable if you have a single operation with full table scan over your growing data.

to be continued…

To Cache or Not To Cache? This is the Question!

How many times in a developing a business app you have wondered if you should cache the data in your web server to reduce the number of database queries for performance.

Assuming you have unlimited memory, should you cache the whole table into memory (e.g. using ASP.Net method output cache) or you should rely on database and use Linq to database directly?

I guess, the answer to this question is pretty obvious. We know the limitation of both methods, linq to object, till now has no support for indexes and would perform memory scan for any query, while linq to DB has all the overhead of connection and query parsing, etc.

So, lets do a little test to see what is the actual limits of these methods? We will set up a super simple DB with a table that contains two integers. Then hammer DB with queries and cache lookups for different cache size to compare the result.

Run the following query to set up our DB:

create table Smash( Id int primary key, Val int )

declare @i int = 0
while @i<1000000
	insert into Smash (Id, Val)
	values (@i, @i)
	set @i = @i + 1

create nonclustered index [IX_val] on [dbo].[Smash]
	[Val] asc,
	[Id] asc

Now that we have our DB ready, lets write query for the result.

        const int Cnt = 100;
        private void button1_Click(object sender, EventArgs e)
            var cacheSize = 1000000;
            for (cacheSize = 100; cacheSize < 1000000; cacheSize *= 2)
                var cache = GetCecheAsBigAs(cacheSize);
                var i = cacheSize / 2;
                var time1 = QueryDBMultipleTimes(Cnt, i);
                WriteResult(time1, "DB", cacheSize);
                var time2 = QueryMemoryMultipleTimes(Cnt, i, cache);
                WriteResult(time2, "MEM", cacheSize);

And the query methods:

        private List GetCecheAsBigAs(int cacheSize)
            using (var db = new TempDataContext())
                return db.Smashes.Take(cacheSize).ToList();

        private TimeSpan QueryMemoryMultipleTimes(int Cnt, int lookup, IEnumerable cache)
            var t = DateTime.Now;
            for(var i=0; i< Cnt; i++)
               cache.Where(s => s.Val == lookup).First().Val.Value;
            return DateTime.Now - t;

        private TimeSpan QueryDBMultipleTimes(int Cnt, int lookup)
            using (var db = new TempDataContext())
                var t = DateTime.Now;
                for (var i = 0; i < Cnt; i++)
                     db.Smashes.Where(s => s.Val == lookup).First().Val.Value;
                return DateTime.Now - t;

Now comes the interesting part. I ran the above code and graphed the result:

The blue line represents cache access and the red one represents DB access. It can be clearly seen that cache terribly beats DB until the cache size becomes 20% of the table size. It can also be spotted that DB access has a steady response time. Remember we don’t change the DB size at all during the test. However, this graph has got no new message. It is the same old graph of linear vs tree access to data.

But don’t be fooled with this experiment. There are things to note of. First, our data structure is super simple. On a more complex query when lookups and joins come in, DB would definitely over perform memory drastically due to all the query optimization effort made into DBMS cores in last 30 years. This graph also doesn’t show the overhead of populating and re-populating cache. However, if your data is reasonably small (up-to a few thousand records), and your query is fairly simple, cache it in the web server.