This morning Heather and I met in sunny Bellingham, WA to hack out the details of this research project. A running document can be found here.
We are starting with 3 repositories: Pangaea, TreeBase and GEO in order to put a poster abstract together for the ASIS&T conference poster submission deadline of July 1st.
We plan to:
- Search Google Scholar and/or PubMed Central for the doi/accession number of each dataset
- Search for the reuse of datasets in a stratified sample of 100 cited by papers from WoS to determine dataset reuse (100 papers per repository, more if time allows)
- Do some analysis on amount of reuse, timeline of reuse, journal distribution of data collection vs reuse, and later on keywords, abstract words, corresponding author country
Our next steps:
- Determine a system to collect the data we want (title, journal, authors, institutions of corresponding author, abstract, keywords, date, PDF of article (and HTML? or plain text?), citation context sentence, citation categorization (how sure are we that the data was actually reused), question of ambiguity of citation categorization, data citation in the reference list) Current possibilities are Zotero and Mendeley
- Get the Web of Science dropbox data in an easy to use format (current .txt files exported from WoS)
- Figure out the Process of collecting the data.
Create new pages on the blog for definitions and questions that would benefit outside opinions.