Below are some rough thoughts that I’ve hacked out on the data collection process and results so far. They are somewhat scattered at this point, so bear with me. I’ve also included thoughts on potential graphs that can be made to display the findings as of yet.
Based on the preliminary searches I did on the Web of Science citations and having completed searching and analysis for most of the accession numbers in Google Scholar, it seems that articles which cite the dataset rather than the data collection article are more likely to actually reuse the data. It appears that most papers cite the dataset directly in the text referring to the repository name or abbreviation and then the Unique Identifier. For data reuse of GEO and Array Express repositories especially, it was also common to have a table listing all of the Unique identifiers of datasets reused in the study.
Data repositories that have a more unique data identifier allow a search with a higher recall and precision, whereas data repositories that have a generic data identifier such as a four digit number require more search parameters to increase precision so much that some potential hits may be excluded. For example, GEOROC has a 4 or 5 digit ID without an associated letter or repository identifier. Therefore we had restrict the search terms to GEOROC9022 OR “GEOROC 9022” where 9022 is the GEOROC assigned ID number, as the search for GEOROC 9022 without quotation marks returned way too many unrelated results to sort through. However, this may have weeded out potential data reuses that were not found, as no hits were found using those search terms for the GEOROC repository. A better search that returned more precise results was GEO; out of 165 citations collected, only 6 did not cite the dataset. This is directly related to unique identifiers for GEO having the three letters GSE directly preceeding the the 4+ digit accession number without a space between. Repositories using a DOI for each dataset were also somewhat easier to track, although you had to search for the doi without the prefix “doi:”, with the prefix, and with the prefix and a space as authors do not cite DOIs consistently and Google does not tell you how the algorithm works for retrieving articles within Google Scholar so creating a search string is always hit an miss until you find a combination that works.
Potential graphs so far
- Bar Graph
- x-axis: Data Repositories
- y-axis: total data hits found, with bar increments from bottom up of total articles reused, reused as example, data cited but not used, and does not cite data (unless remove this info as I didn’t include hits if they didn’t seem relevant)
- Multi-Line graph – one line per data repository
- x-axis: number of hits/dataset
- y-axis: number of datasets
- Unknown type – based on per dataset level (within data repository)
- # of citations in WoS
- # of hits from the DOI