Motivation for this project: There is a huge amount of data that exist and more is being produced every day. The number of journal articles and scientists are growing (Mabe and Amin, 2001) due to technology and it will be increasingly difficult to manage of all the data produced (Carlson, 2006; Howe, 2008). However there is a lot of data that researchers acquire but never becomes accessible to the rest of the scientific community which is called dark data (Heidorn, 2008). Dark data though considered unpublishable or unusable for a variety of reasons to the researcher, can lead to new discoveries and have potentially huge impacts on the scientific community (Heidorn, 2008). Similarly, data and dark data gathered from ecology and the environmental sciences is also growing. For these reasons, we hope that in estimating the amount of ecological data and dark data out there, we can motivate the ecological community to be more willing to share and archive their information.
Generally there are two types of science projects, big science projects involving a large network of scientists and traditional science projects. Big science projects tend to be high profile, have higher funding, a larger network of cooperation, well curated, and generate larger amounts of data (Heidorn, 2008). Traditional science projects generally are less well known, have less funding, consist of a smaller number of investigators, produce relatively less data (Heidorn, 2008) and largely focus on a smaller scale. However this is not to say that their contribution is negligible. Traditional science projects are more common and can contribute to large amounts of data for the scientific community due to the sheer number of productive researchers conducting their own experiments and projects (Heidorn, 2008). While the number of traditional science projects largely outweighs the number of big science projects, traditional science projects have more heterogeneity (Heidorn, 2008; Carlson 2006; Reichman, 2011). Researchers come from many fields and therefore vary in terms of the amount of data gathered, the type of data gathered, and the protocols and metadata of that data. While, our project will focus on data gathered from the field of ecology and environmental sciences, there is still much heterogeneity with in this scientific field which will have to be addressed.
Additionally, there is the issue of the phenomenon of dark data. Traditional science projects typically generate dark data because researchers gather lots of data intended for publication. However, some datasets never make it into the final paper. The reasons could range from a dataset that has resulted in no significance also called “failed” experiments (Goetz, 2007), or from other data gathered that supersedes its value somehow from the perspective of the researcher (Heidorn, 2008).
This project will focus on traditional science projects from individual researchers. Unlike other sciences, the field of ecology has fewer high profile big collaborative science projects. These projects are networks of research stations and study fields that can generate huge sources of data. Three main networks are NEON, LTER, and OOI. However most networks still employ single researchers to contribute to the larger project’s data. Furthermore, it is likely that most researchers who are part of big science projects also conduct traditional projects and consider the data gathered by their work with big science projects as part of the total amount of data gathered by their lab. Therefore in capturing data from traditional science researchers, this data could theoretically cover the enormous amount of data generated in these collaborative studies.
Data is generated not only by academic researchers but also from researchers employed in environmental consulting industries, Government, and NGOs. A study by NORC for NSF looked at the number of employed science and engineering doctorate holders engaged in publication-related activities has shown that in the field of the life sciences in 1995, 91.3% of doctorate holders employed in education published, 87.2% of doctorate holders employed in government published, and 71.8% of doctorate holders employed in industry published (Hoffer, 2004). A similar trend existed between fields in 2001. It seems that government and industry do contribute a great amount of published data. Also, there is a long standing tradition of collaboration among academia, government, and NGOs that produce large amounts of data.
In addition, the percentage of scientists in the category of life sciences employed in Education was 55.2%, government 10.2%, and industry 34.6%. Our project will certainly focus on the amount of data gathered by ecologists in Education/Academia but data generated by ecologists in government probably contribute to a lot of data as well. This study may also focus on government generated data. However, this estimation would face new obstacles and different problems. New methods would have to be made.
Lastly, the project will need to determine whether to estimate the amount of ecological data generated by the US only or expand the scope to include international researchers. Depending on whether we include publication as our proxy for estimating data, in general data generally produced in the US would be the focus.
Data type and unit of measurement:
In general, the project will focus on data gathered that is intended for publication. This generally rules out automated data such as weather station data etc which is generally not the focus of studies intended for publication. Additionally, measuring how much data is variable depending on type and the unit of measurement. For example data from experimental research can be the raw data (numbers collected in the field or in laboratories), the transformed/altered data for analysis, and the published paper produced from the original data. While all these types of data are important, the raw data is the basis for the entire analyses as well as the foundation for the published paper. Therefore estimating only the raw data would not only be more feasible, but it can serve as a proxy/indicator for the other types of equally important data.
Another important data type is metadata which holds information about the data. Without the metadata, the raw data would be impossible to decipher. This project may want to include metadata as part of its original estimate because it would be new data, strictly speaking. However metadata can also come in a variety of styles and formats. Also, metadata generated by individual researchers are highly variable in completeness. As of yet, processed metadata and workflows have been created to make the metadata more easily understood and allow the experiment to be easily reproduced. However this is still a minority method to creating metadata. It may be that the metadata will not be counted due to the high variability in quality and the potential absence of it.
Duplication and redundancy exist in data gathering. For example several copies of raw data could be created and stored in various databases and network servers. Ideally, we would not count duplication in our analysis but the project will probably not try to account for redundancy in data because it is probably a very small issue for the data of ecology and environmental sciences since so little of it is stored at all.
Because the focus is on traditional science, the need to estimate the amount of raw data generated in total must come from individual researchers. Therefore an estimate of the amount of data generated from the “typical ecologist” must be made in order to aggregate up to the entire ecological community.
Information will be needed to answer these questions:
- How many fields of ecology will be considered?
- How long is the lifespan of an ecologist’s research career?
- How many ecologists are entering or leaving the field?
- How many ecologists total are there now and in past time points?
- Subsequently how much raw data is produced per ecologist now and in past time points?
- How is an individual ecologist defined? After attaining a PHD/Masters?
Potential sources of gathering this data could be from:
Ecology Departments in Academic Universities- A sample of ecology departments could reveal how many ecologists and post docs are employed, what fields of ecology classes are taught, what fields the ecologists specialize in, and how long their careers last. If ecologists employed from firms, government and NGOs is also considered, then samples from government/NGO website staff lists could be done. However corporate data on staff memberships may be tougher or even impossible.
Ecological databases- While raw data from databases will probably not be counted because it will be encapsulated in estimating the data a researcher produces, ecological databases could provide insight into what ecological fields are out there. By getting a sense of how many fields are out there, we could choose several representatives from various fields to sample information about data generation.
Scientific societies and organizations- Science societies may provide information on how many scientist are there in the world or US, how many scientists were present in past time points, the rate of growth of scientists in the community. This information could be important for comparing the growth rate of scientists to the growth rate of data. For example the American Association for the Advancement of Science(AAAS) and Chronicle of higher education may be good sources of information. ESA and NSF funding sites.
Sociology- Other sources of information could be from social scientists that study scientists’ behavior and patterns could be interesting.
Collaborative Studies- While the project will not focus on big science, networks such as NEON, LTER and OOI can be useful in estimating how many researchers are participating in these big scale projects. Also, other kinds of data from these networks could be used for this project.
- Percentage of ecological fields in the scientific community in a pie chart.
- Comparing the growth rate of scientists (#scientists/time)
- Growth rate of data
- Total amount of scientists at a time point
- Total amount of data at a time point
- The # data produced/scientist per time
- Dark data as a percentage of total data gathered