Week 5: Finalizing data extraction, writing, and analysis

In week 5, Rob and I shifted gears from data extraction to data analysis. Our final database has a total of 80 articles, 40 identified from the list of NCEAS-authored publications and 40 identified from our Web of Science search. In total, we extracted over 500 rows of data (i.e. data sources) from all the articles. Once everything was in the same place, a few standardization steps were required, but we were grateful that our initial planning steps resulted in a database with overall consistency across a wide range of fields related to each paper, its sources, and any resulting data products.

Once we finished extracting data from all 80 articles, we began work on writing and planning for data analysis. To help frame our manuscript, we revisited several potential journals, noting their word limits and organizational requirements. From this step and in considering several idiosyncrasies of the data-extraction process (e.g., differences between research papers that sourced data vs. papers in which sourced data was of primary interest), we further expanded and refined the documentation of our methods.

Concurrent with manuscript development, we continued the development of the data citation best practices document. After two weeks of reading through data synthesis papers, we had over 20 examples of best practices for data citation and, in contrast, common pitfalls. We took all of these items and moved them into a format that will live on in the DataONE best practices webpage (https://www.dataone.org/best-practices). Here is an example of a best practice that we highlight:

Tables can help organize sources: 
It is helpful when a paper that employs numerous covariates lists these in a table (for example with descriptions and summary stats). 
Cleland et al. (2017) provide a useful table listing data sources and links to each.

Cleland, D, K. Reynolds, R. Vaughan, B. Schrader, H. Li, and L. Laing. 2017. 
Terrestrial Condition Assessment for National Forests of the USDA Forest Service in the Continental US. 
Sustainability 9(11):2144. https://doi.org/10.3390/su9112144

We also planned for the next phase of our project, data analysis. With the help of our project mentors, we set up a shared github repository within the DataONE network. Through this repository we can collaborate on R code without sending a script back and forth over email. While we’d been working collaboratively on several shared text and tabular files, this process offered us the opportunity to learn more about the versioning framework employed at DataONE and similarly in other networks. As we wrap up week 5, we’ll be finishing the clean-up of our data frame. We look forward to beginning our analysis in week 6.

Leave a Reply

Your email address will not be published. Required fields are marked *