Today was spent tracking the Protein Data Bank (PDB) datasets. As Heather had predicted, there were quite a few more hits than the previous data repositories (670 citations). The search terms used and search results can be seen in this Google Spreadsheet. As Heather predicted such a large amount of results, she also suggested starting by only importing citations from datasets that had 3 or less citations. I imported these 56 citations to the PDB Mendeley group, found the full-text, and analyzed them for reuse.
Determining dataset reuse was trickier for this data repository as several of the articles just listed the PDB ID number in a table with several other PDB IDs. There are therefore more medium and low confidence levels applied than there have been with other data repositories.
Of the 56 citations, I was unable to access full-text for 3 of them and another 3 had import errors. Out of the 50 articles examined so far, 36 have potential dataset reuse, 3 have ambiguous dataset reuse, and 4 cite the dataset but do not appear to reuse it. (My numbers don’t seem to be adding up so I will be reexamining the citations to make sure everything was tagged in the morning.)