In light of these studies, what are the issues related to gathering ecological data?
Defining Data versus Information:
- Do we focus on how much raw data is produced or how much information we are getting out of the data(analyzed and published)?
- Is data raw data considered meaningless until it is interpreted (analyzed)? There are several types of raw data (observational, experimental, computational or generated). Other types of data could be qualitative (expert knowledge) and quantitative (numbers).
- Assumptions to be made. What unit will the estimate of “data” represent? Exabyte? Assumptions would have to be made. Csv file is about x megabytes..etc. Databases in their totality estimate the amount of memory, and the rate of increase over time? Traditional information, do we include the information from books, maps, audio recordings, video recordings, on paper? All of this could be translated into memory units (Ie in Lesk paper, 1 sheet of paper = 5000 bytes). Assumptions need to be made to calculation (ie % of fresh information on a piece of paper)
- Do we only count original data? Or is teasing apart redundancy too hard?
- Are we interested in world wide data or just in the US? Can we extrapolate? Who is the leading producer of ecological information?
- Over what time period? Snapshot approach versus all the data within a given range of years
- Estimating the size. Which fields are included? How many journal publishers? How many ecologists are out there (early, mid, late careers)? Who else is included (Govt, NGOs, Universities, Research stations)?
What to measure:
- Do we count raw data as single units or as a part of a data set? What about databases?
- Would images/videos/audio be important? Audio data has been recorded for birds, databases of pictures (eg Calflora.org) of plant species for identification, and night footage of nocturnal mammals. Additionally, would photographs of species/aerial maps for ecological purposes apply? Eg. motion capture photos of animals? GIS layers? Satellite imagery? Land use changes?
- Do we include raw data and transformed data, computed data, analyzed data. Metadata? Intermediate data considered data? What if deleted?
- Are we interested in the amount of data the ecological community generates and/or consumes/communicates?
- Consumption of Information: Is it important to know how much ecological data is consumed? How in the ecological community is information digested and shared? HMI 2009 study measured consumption and allowed for redundancy due to a different definition of data. Is consumption important? Which is not exactly what we are interested in? How much time does an ecologist spend researching online or in general?
- Dark data: According to HMI 2009 study, the dark data is data generated from machine to machine. Our definition of dark data may be different. Ie data that never gets published?
- Do we measure data that has been used and interpreted into information or do we also try to quantify raw data that has never been used? It has potential value though.
- Stock versus Flow (Berkeley study defined the annual size of the “stock” of new information contained in storage media; the second, the volume of information seen or heard each year in information flows.) : Do we distinguish stock versus flow? Berkeley study 2003 measured information flows: In ecology this would translate to ecological conferences, presentations, education workshops, school lectures? The categories would be different than in this study (radio, TV, telephone, internet). It would be emails between professors, powerpoint presentations, journal articles? (though this is storage?), interviews on tv, radio, what about online websites (info storage?), Internet, youtube videos of ecology?, tv shows on ecology? Radio shows? (not likely too much on ecology), school lectures?(count how many classes are held in ecology departments)
- Digital/NonDigital Communication: How ecologists communicate with each other considered a form of data? Emails?
- Growth rates/Rate of change: Similarly ecological data growth rate? In what field? In what format/medium? Is ecology data growth accelerating? HMI case studies on Climate Change and Biological Oceanography researchers state that it will. Were there any ecological equivalent downturns of less information being produced?
- How much Information is lost? Do researchers ever throw away their dark data? Or never look at it?
- Ecology data should be considered an information storage. Information flows would be in the form of ESA international conferences, presentations, nature videos etc. we may not be interested in quantifying this. BUT it would be interesting to be able to quantify how much ecological information is shared within the ecological community of academia, land managers, and the general public. May be similar to Sarah Clark’s project (however she may be more interested in the perception of ecology by the public)
- Are we interested in sampling the amount of information professors have along the lifespan of their careers?
- Question of how to quantify expert knowledge? Information in peoples heads? How to quantify? Lesk cited those that tried to estimate knowledge retained in humans….
- Need to know what are the current medium of storage information? Hard drives, flash drives (no more floppy?), shared information: Network servers, ???, shared google docs?, shared dropbox?
- Need to calculate how many ecologists are there in the world?
- Information is growing rapidly, is ecological information too?
- How much are ecologists contributing to databases?
How to measure data/information:
- What is the unit of measure? Bytes? Words? Hours? HMI 2009 used these types. What unit of measurement will we use? Bytes is a little biased but most of ecological data is now in bytes (PDF, databases, excel files), words would mean we need to average the amount of words in a journal article, what about raw data? Does a measurement = a word?, Hours in terms of what? Hours doing field work? Writing a paper? Presentations?
- How to standardize? Conversion: Past ecological data has been on paper, on audio tracks and other forms of media. While magnetic (mainly hard drives) has been the main form of data storage, regardless of what measure to choose, we will need to convert to compare.
- Hence, for most storage devices, their nominal capacity is much smaller than the data that can be housed on the device over a period of time (as files are erased and replaced) HMI 2009. Think about how ecologists keep their data, modify later, add to it? so if I ask a professor how much their hard drive contains and how many years they have been working? There is a time component. Do ecologists ever erase their data? So how do we classify these data streams? Relevant to ecological real time streaming of data gathering? Weather stations etc?
- While these Berkeley and IDC studies provide us with an upperbound estimate, if we can estimate scientific data we can zero in on the amount of ecological data (since it is a branch of scientific data).
- Use of an indicator to estimate? For example only measuring the amount of Text files or csv files being created per year and used in publication and as stored dark data. Etc
- How to assess redundancy? What about back up? Do ecologists store things on CDs/DVDs anymore? Are they on servers? Are they just backup usually? May be a method to account for journal redundancies? Need to estimate the number of journal articles being produced and what % is put online to avoid redundancy. Print probably is redundant info. Duplication: It is very difficult to distinguish “copies” from “original” information (Bekeley, 2000) . Problem for ecological data. Need to find a way to find “unique” data. Again ecological data may have same problem. Especially if journals cite information from each other. What getting information from one dataset, do you double count because the information is new but the dataset is one?
- Issue of compression if we use bytes of information. Need to normalize compression rates to compare information. Compression: Unlike print or film, there is no unambiguous way to measure the size of digital information (Berkeley, 2000). Berkeley study steered a middle course between the high estimate (based on “reasonable” compression) and the low estimate (based on highly compressed content. If we do use bytes as a measurement of digital data, then compression is an issue. May copy this study’s strategy of getting average low and high compression.
- Any information we can get from industry? Not likely because if we try to estimate the amount of hard disk storage, we cannot tease apart ecological data from entertainment information stored on the same hard drive.
- How do we measure our uncertainty? Do we provide upper and lower bound estimates? Berkeley 2000 and 2003 did not seem to estimate its uncertainty well but provided only upper and lower bound estimates. Do we need to estimate our uncertainty? Statistics?!?
- The pure volume of information does not necessarily determine its value or impact HMI 2009. Will not try to look at the value of ecological information? Same with ecology, volumes of information and journals but not really valued. Not a lot of people read it or use it in applied work? IDC 2007 study also attempted to value the data in dollars. May be interesting to see how much ecological info is worth?
- Human production versus automation. Researchers commission an experiment and get data. While satellites and weather stations grab information every x hours….
- Is there an ecology growth rate? Any explosions of data? What are the trends/hot topics?