DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free. It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.
I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.
Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.
It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.
(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)
To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.
You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml
You can replace dblp-data.xml with the path of the dblp xml file.
There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.
Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.
Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.
June 16, 2011 at 9:39 am
Sir , I could not get the code here .
https://code.google.com/p/dblp2csv/
from where i can download the dblp to csv
please help
thanks,
June 16, 2011 at 12:07 pm
Sahil, you’ve got to check it out by a svn client. If you have a subversion client installed run the following code:
>svn checkout http://dblp2csv.googlecode.com/svn/trunk/ dblp2csv-read-only
You can get subversion from http://subversion.tigris.org/
I also updated the post with some more info. Let me know if you had any problems.
June 20, 2011 at 7:32 pm
thanks, sir
But when i run in command prompt
results are
for example :
E:\dblp\bin\debug\dblp_xml2csv.exe dblp.xml
writing the link File
Writing Object Files
E:\dblp\bin\Debug>
It generate 2 files of 1 kb just
dblp.xml.lnk 1 kb
dblp.xml.obj 1 kb
but my dblp.xml file is 837 Mb
wait for reply
thanks
June 20, 2011 at 11:35 pm
Sahil,
I will look into that issue tonight. In the meantime you can debug the application if you know .Net dev. Also look for dblp-data.xml from the link in this post. There are two versions of dblp dataset I believe.
Since I am not loading the file as an xml for performance. Different formatting in the file may cause problems.
February 13, 2014 at 12:59 pm
Hi,
Is there any solution to solve the problem (after generation 1 kb csv output) ?