In our sixth week of the synthesizing science project, we jumped into the analysis of data we’d extracted, corrected, and re-tooled over previous weeks. This process was aided by establishing a github repository for data and R scripts and by employing the very useful “googlesheets” R package (by Jenny Bryan) to interface with our still-refining data set.
We had earlier developed a set of potential metrics to describing characteristics of data sources, data-aggregation studies (our new term for heretofore-termed “data syntheses”), and the use of data repositories for study output. We refined these metrics and divided tasks between us and first quantified patterns of geographic focus, spatial extent, ecosystem focus, and representation in time (2015-2017). Quantifying source-level metrics including the percentage of data sources that were cited (high, fortunately), the location of citations within data-aggregation studies (typically in Methods sections), and the percentage of working links (80%+ in all years) helped to paint a clearer picture of the various practices employed by researchers when aggregating data from various sources.
At the level of the data-aggregation study, we first quantified the number of data sources employed, summarizing by (1) whether a study was part of the NCEAS set or Web of Science (WOS) set, and (2) the year of the study. Next, we summarized and tested accessibility of sources using our four-level “distance”-to-data classification, finding an overall low average (~10%) of sources were accessible with minimal effort (e.g., direct links). When examined across NCEAS and WOS sets, NCEAS studies cited more sources accessible with minimal effort than did WOS studies. There was little evidence of a trend in (increasing) data accessibility over time, the only exception found when the threshold of accessibility was increased to “medium”. The repeatability (i.e., all sources accessible per study) of data-aggregation studies was limited at minimal-effort accessibility (5%) but increased to over 50% when all accessible sources were included. On average, around half of data sources cited per paper were obtainable from repositories such as government agencies (e.g., USGS). NCEAS studies more commonly employed these sources than did WOS studies. The use of these sources increased over 2015-2017. Remaining analyses include quantifying patterns of data storage by authors of data-aggregation papers.
While some patterns appear clear from these summaries, discussions among our team led us to the conclusion that a more robust analysis (e.g., via modeling) was warranted. Approaches such as generalized mixed-effects modeling seem promising as it will allow us to address correlated patterns within (1) data source attributes such as accessibility and source form (e.g., spatial vs. tabular data) and (2) study characteristics such as year, continent, and ecosystem. We are excited to see where this leads us as we further develop our manuscript and best-practices documents.