DBLP for SQL Server

DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free.  It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.

I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.

Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.

It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.

(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)

To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.

You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml

You can replace dblp-data.xml with the path of the dblp xml file.

There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.

Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.

Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.

Advertisements

View Matching

View Matching is a fairly old technique first seriously utilised at the beginning of the millennium. Lars et al from Microsoft research exploited this technique for SQL Server 2000.

The idea is to utilise existing materialised views (or indexed views in SQL Server) for query optimisations. For example, if you query “SELECT * FROM Cars WHERE Brand=’BMW’ and PRICE BETWEEN 40,000 AND 70,000” and there is already an indexed views over the cars called v_LuxuryCars which is defined as “SELECT * FROM Cars WHERE PRICE > 40,000”, the query optimiser can exploit v_LuxuryCars for executing the query. This view can be much smaller than the base table and the query can run faster. The real beauty of exploiting materialised views will be obvious in more complex queries though.

The challenge here is how to efficiently figure out if a view can be used for the query. Indeed, if the query predicate, is a subset of the view definition predicate (like the given example; PRICE BETWEEN 40k AND 80K is a subset of PRICE > 40K). This is not straight forward since the predicates can be really complicated. For example query may have the predicate A=2 and B=2, while the view definition has the predicate A=B. In that case the inference of the predicate sub/super relationship is  not obvious).

However, this problem is well studied for relational algebra and even though Microsoft have been very active in developing View Matching techniques, it is not used in the SQL Server core, possibly for performance reasons.

The reason I am interested in view matching is that I have started a project for View Matching over IQueriable interface. I will keep you posted about the progress.

Sampling Data with SQL Server

Larson et al. from Microsoft Research have this paper in ACM SIGMOD 2007. They introduce an interesting sampling mechanism over SQL Server using materialized views. Materialized views are the same as indexed views. If you create an indexed view over a table or a bunch of tables joined together, using their technique, you have a tidy sample for your data which is kept up-to-date automatically.

What caught my eyes was that their base sampling approach is interestingly simple:


CREATE VIEW Sample_TheTableName WITH SCHEMA_BINDING AS
  SELECT ...,
        __RAND = (CAST( ROW_NUMBER()  OVER (ORDER BY RAND()) AS tinyint)
  FROM ...

The additional __RAND column gives a random distribution on data which can give us a statistically valid sample.

Analyze table space usage in SQL Server

It is pretty useful to check space usage of your tables. SQL server gives a nice system stored procedure called sp_spaceused that returns what you want. You can run the following script on your database and export the results to excel.

You can come up with nice graphs showing why your data base is big and what partitioning strategy to choose. For example I ran this query for a project with a 10GB database. I could not understand why only a few million small records  occupy 10GB. The result was expectable. There was an exception log table which occupied half of the space and dropping that table saved so much time for everybody.


DECLARE @spu TABLE(NAME VARCHAR(max), [ROWS] BIGINT, reserved VARCHAR(50), DATA VARCHAR(50), index_size VARCHAR(50), unused VARCHAR(50)
)

DECLARE @TName VARCHAR(80)

DECLARE C CURSOR FAST_FORWARD READ_ONLY FOR
SELECT o.name FROM sys.objects AS o
WHERE o.schema_id = SCHEMA_ID('dbo')
AND o.type = 'U'

OPEN C

FETCH NEXT FROM C INTO @Tname

WHILE @@FETCH_STATUS = 0
BEGIN

	INSERT INTO @spu
	EXEC sys.sp_spaceused @objname = @TName -- nvarchar(776)

FETCH NEXT FROM C INTO @Tname

END

CLOSE C
DEALLOCATE C

SELECT * FROM @spu ORDER BY DATA DESC

Dive into Expression Trees

I am in the process of developing a framework for sampling query results from LINQ queries. It is going to be hard work, you may ask WTF in this world someone need sampling query results from LINQ queries? Well, this can be particularly useful if you are developing a new database that purely works with LINQ or you are having heavy data services that use various data sources then you want to sample the query result.

Unfortunately, I am not currently working on any core database system or heavy duty data services (at work), but I have to do this as part of my study. This project which is going to be an open source project called [I don’t know yet] is supposed to do the job of evaluating my algorithms that I hopefully (fingers crossed) will publish at VLDB 2011.

So if you continue to visit this blog, you will see posts about these topics:

  • LINQ Expression Trees
  • Writing Provider for LINQ
  • Sampling Techniques
  • Data quality metrics
  • and, Rule based profiling

 

How to shrink the log file in SQL Server 2008

All of us have wanted to get rid of the SQL Server log file at some stage. The natural way of shrinking the log file is to back it up and keep it in a safe place. Log file keeps transaction data and is essential for temporal disaster recovery.

However, it is a common request to get rid of the log file because it takes space and some people don’t need the log file at all. Beware that there is only ONE place and ONE situation that log file is a waste of space and that is Dev Environment.

SQL Server 2008 has removed the possibility of shrinking the log file using DBCC SHRINKFILE or BACKUP SHRINKONLY. But if you want to get rid of the log file in your dev environment, use the following script:

Note: Before running this script, in SQL Server Management studio, select Query -> SQL CMD mode.

:Setvar DbName **Your Database Here**

PRINT 'Droping log file for $(DbName)'
GO
EXEC sp_configure 'show advanced options', 1;
GO
RECONFIGURE;
GO
EXEC sp_configure 'xp_cmdshell', 1
GO
RECONFIGURE
GO
USE Master;
GO
DECLARE @logPath VARCHAR(400)
DECLARE @rowsPath VARCHAR(400)
DECLARE @logFile VARCHAR(400)
SELECT @rowsPath = physical_name FROM  [$(DbName)].sys.database_files AS df where type_desc='ROWS'
SELECT @logPath = physical_name FROM  [$(DbName)].sys.database_files AS df WHERE type_desc = 'LOG'
SELECT @logFile = name FROM [$(DbName)].sys.database_files AS  df WHERE type_desc = 'LOG'
PRINT 'Renaming the log file (' + @logPath + '). You can delete it later'
ALTER DATABASE [$(DbName)] SET  SINGLE_USER WITH ROLLBACK IMMEDIATE
EXEC master.dbo.sp_detach_db @dbname = '$(DbName)', @skipchecks = 'false'
DECLARE @CMD VARCHAR(400) = 'ren "' + @logPath + '" "' + @logFile +  REPLACE(CONVERT(VARCHAR(25), GETDATE(), 113), ':', '') + '"'
PRINT @CMD
EXEC master..xp_cmdshell @CMD
DECLARE @attachSql NVARCHAR(MAX) = N'CREATE DATABASE [$(DbName)] ON ( FILENAME = ''' + @rowsPath + ''' ) FOR ATTACH '
EXECUTE (@attachSql)
GO

Sampling data quality in distributed SQL Servers (Part 1)

Before going forward with this post, I feel I should say why sampling is so important and why sample data quality? First of all, sampling is heavily used in various query optimization techniques. The very key thing a query optimizer needs to know is selectivity of the query. Selectivity means what are the number of results after running the query against a table. This is extremely important to decide the join direction and wrong estimation on selectivity can change the actual runtime of the query massively. For example when joining three  huge tables AxBxC together, which B is the many to many relation table, the query can be planned as (AxB)xC or Ax(BxC). You may say what is the difference? but imagine the query returns only one row from table C and 1 million rows from A. Which direction do you use? Of course Ax(BxC) ensures a million less lookups.

Sampling is an expensive way to estimate selectivity, because the query should run anyway (but over a much smaller set) but could be the simplest or sometimes the only option. Other ways include complex statistical modeling and collecting forms of statistics about data like histograms.

Although in the context of a single database histograms seem more appealing, in distributed databases, they are not god for several reasons that is out of the scope of this post.

I am personally more interested in distributed databases, not as talked in literature (or federated databases) but in a more practical collaborative enterprise system.