Today I finished collecting the search results based on TreeBASE accession number from Google Scholar, imported the full text into Mendeley, and searched to determine if the dataset was reused or not. Search terms used and hits in Google Scholar can be found in this Google Spreadsheet. The citations and tags applied can be viewed in this Mendeley group.
At first I was pleasantly surprised to have found a total of 40 documents that cited the datasets in our sample. However, on further inspection, 4 of these articles did not have enough information to track down the article and had a dead link from Google Scholar. In addition, I was unable to access full-text for 7 articles. This left me with 29 articles to analyze. However, 24 of these did not actually cite the Treebase accession number. Many had the number and the word TreeBASE somewhere in the article but not associated with each other. Others had the TreeBASE Study ID listed, which is not the same as the Legacy Study ID, which is what we were after, although the numbers are the same (a confusing aspect of the TreeBASE database). This left a grand total of 4 datasets that were found to have dataset reuse. Still not bad, but not nearly as much as I had originally expected with my 40 hits on Google Scholar. I think to improve this, we should have also used quotations around the accession number for example “S1332” AND “TreeBASE” which would have weeded out some of the other articles.