Ecological data is essential for scientists to be able to understand the complexities of nature and for decision makers to make informed decisions about protecting and utilizing our natural resources. However no one knows how much ecological data is out there being produced or will be produced in the near future.
Although no one has attempted to quantify the amount of data in the ecological field, not even the scientific field for that matter, a few attempts have been made to quantify the amount of information out there in general.
The first study that attempted to estimate the amount of information was Michael Lesk who presented an outline of how one could estimate information. Using byte as his unit of measurement, he provided general estimates for various categories of information channels such as Cinema, Images, Broadcasting, Sound, Telephony, and the Web totaling a few thousand petabytes of information (Lesk, 1997).
A more thorough investigation of information came from UC Berkeley when in 2000 and 2003, Lyman and Varian conducted a study to estimate US and world information. Also using bytes as the unit of measurement, Lyman and Varian estimated a snapshot amount of “original” information created from four main categories; Paper, Film, Optical, Magnetic, in great detail (Lyman and Varian, 2000; Lyman and Varian, 2003). Most notably, the study found that Print, film, magnetic and optical storage media produced 5 exabytes of new information in 2002. Ninety-two percent of new information is stored on magnetic media, mainly hard disks. Film represented 7% of the total, paper 0.01%, and optical media 0.002% (Lyman and Varian, 2003). Because of the wide use of the hard drive and servers, the study estimated that an average single use computer can store 20GB and therefore 200MB of original data constitutes 1% of disk capacity. Also servers which provide disk space for a group of information contributing users, were estimated to contain 35% of new information (Lyman and Varian, 2003). The study pointed out a few major challenges of estimating information which include the problem of redundancy or duplication of information,, compression tactics that influence byte storage estimates, information storage versus information flows, information growth rates, and uncertainty (Lyman and Varian, 2000; Lyman and Varian, 2003).
Following the Lyman and Varian’s study, we could assume that ecological information will usually be stored in the form of print and magnetic media. Film and optical may contribute to the overall amount of information but it could be assumed that this information has already been converted into magnetic or will be in the near future.
Starting in 2007, the EMC Corporation hired the research firm, International Data Corporation (IDC) to conduct an industry study on the size of the “Digital Universe” which was defined to be the information that is either created or captured in digital form only and then replicated. This means that the study estimated the global hardware capacity which was “all the empty or usable space on hard drives, tapes, CDs, DVDs, and memory (volatile and nonvolatile) in the market” (IDC, 2008). The study estimated the digital universe to be 161 exabytes and was expected to grow to 988 exabytes by 2010, representing a compound annual growth rate of 57% (IDC, 2007). A year later, the IDC revised its estimate to be 281 exabytes in 2007 and predicted the amount of exabytes by 2011 to be 1,800 exabytes (ten times the amount in 2006), also representing a compound annual growth rate of 60% (IDC, 2008). In later reports, the IDC continued to estimate the digital universe asserting that it reached 487 exabytes in 2008 (IDC, 2009) and will grow to 1.2 million petabytes in 2010 (IDC, 2010). The 2010 IDC study also predicted that between 2009 and 2020, the information in the Digital Universe will grow by a factor of 44 (IDC, 2010).
Again in 2007, researchers at UC San Diego estimated looked at how much information was consumed in a household setting. Using three units of measurements (bytes, hours, words) the study indentified the main media sources of household information consumption which totaled to 3.6 zettabyes. It is important to note that the study had a very different definition of information compared to the Berkeley study. This study considered information to be the flows of data delivered to people rather than the generation of original data and therefore included redundant information which explains the monstrous estimate. Additionally, the research group conducted case studies to identify key trends and indicators for data growth in various research fields. Using researchers from MIT in different fields, these case studies defined the types of data researchers produced and roughly estimated how much data they produced in a year (HMI Case Studies, 2009). Some major points from these case studies, was how variable data generation was in each field, the culture of each scientific field, and the individual stances researchers to data retention, data reuse, and data sharing.
The latest study was conducted by Hilbert and Lopez who focused on the handling capacity of information worldwide. While choosing not to account for redundancy, the study estimated storage capacity, communication capacity, as well as computation capacity for both analog and digital information over two decades (1986-2007) (Hilbert and Lopez, 2011). While concluding that natural information processing is still relatively large in comparison, the world’s technological information processing capacities are growing exponentially (Hilbert and Lopez, 2011).
These various studies though related to our research question have different definitions of what data and information is as well as what they did or did not decide to include in their studies. This will be a major challenge to this project in deciding how to define ecological data as well as the scope.