DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free. It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.
I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.
Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.
It basically converts dblp’s cml to flat csv’s that you can import to your DBMS.
A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.
(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)
To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.
You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
You can replace dblp-data.xml with the path of the dblp xml file.
There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.
Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.
Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.