June 21, 2011 – Tracking Protein Data Bank

Today was spent tracking the Protein Data Bank (PDB) datasets.  As Heather had predicted, there were quite a few more hits than the previous data repositories (670 citations).  The search terms used and search results can be seen in this Google Spreadsheet.  As Heather predicted such a large amount of results, she also suggested starting by only importing citations from datasets that had 3 or less citations.  I imported these 56 citations to the PDB Mendeley group, found the full-text, and analyzed them for reuse.

Determining dataset reuse was trickier for this data repository as several of the articles just listed the PDB ID number in a table with several other PDB IDs.  There are therefore more medium and low confidence levels applied than there have been with other data repositories.

Of the 56 citations, I was unable to access full-text for 3 of them and another 3 had import errors.  Out of the 50 articles examined so far, 36 have potential dataset reuse, 3 have ambiguous dataset reuse, and 4 cite the dataset but do not appear to reuse it.  (My numbers don’t seem to be adding up so I will be reexamining the citations to make sure everything was tagged in the morning.)

