DBLP for SQL Server


DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free.  It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.

I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.

Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.

It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.

(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)

To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.

You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml

You can replace dblp-data.xml with the path of the dblp xml file.

There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.

Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.

Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.

Advertisements

5 Responses to “DBLP for SQL Server”

  1. Sahil Creation Says:

    Sir , I could not get the code here .
    https://code.google.com/p/dblp2csv/
    from where i can download the dblp to csv
    please help

    thanks,

    • naiem Says:

      Sahil, you’ve got to check it out by a svn client. If you have a subversion client installed run the following code:
      >svn checkout http://dblp2csv.googlecode.com/svn/trunk/ dblp2csv-read-only

      You can get subversion from http://subversion.tigris.org/

      I also updated the post with some more info. Let me know if you had any problems.

      • Sahil Creation Says:

        thanks, sir

        But when i run in command prompt
        results are
        for example :
        E:\dblp\bin\debug\dblp_xml2csv.exe dblp.xml

        writing the link File

        Writing Object Files

        E:\dblp\bin\Debug>

        It generate 2 files of 1 kb just
        dblp.xml.lnk 1 kb
        dblp.xml.obj 1 kb

        but my dblp.xml file is 837 Mb

        wait for reply
        thanks

    • naiem Says:

      Sahil,
      I will look into that issue tonight. In the meantime you can debug the application if you know .Net dev. Also look for dblp-data.xml from the link in this post. There are two versions of dblp dataset I believe.
      Since I am not loading the file as an xml for performance. Different formatting in the file may cause problems.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: