Hello everyone, this is Yilin, the intern working on Project 2: Provenance for Self or Others? A Study with Hands-on Experiments. I am very glad to work with you guys in DataONE and hope all of us have a great experience this summer. Research is always a charming thing for me and pursuing a Ph.D. degree has become my next goal. So if you have any questions related to research, feel free to discuss with me. Regarding the project, lots of works have been done this week. The Meeting with my mentor Bertram Ludäscher went really well where we clarify the objectives of this project as well as further steps we might take in the following weeks.
Big picture for project 2
Project 2 is divided into two parts. The first part is “an environment scan” of current researches on data provenance. The goal of this part is to solve the major question “how people use data provenance and what kind of data provenance tools have been used in the academic discipline”. The outcome for this part is an annotated bibliography. As for the second part, hands-on research and programming will be launched, with a report as the output for this part.
What is provenance?
Provenance is a quite new concept. However, people encounter provenance almost every day. The definition of provenance differs from different fields. The definition of provenance in dictionary Merriam-Webster is “ ORIGIN, SOURCE” or “the history of ownership of a valued object or work of art or literature”. Regarding OPM (Open Open Provenance Model), the article (Luc Moreau et al.2008) illustrates that “Provenance is well understood in the context of art or digital libraries, where it respectively refers to the documented history of an art object, or the documentation of processes in a digital object’s life cycle.” While provenance in W3C is defined as a record that describes entities and processes involved in producing and delivering or otherwise influencing a certain resource. In the new discipline “blockchain”, provenance also has its particular meaning. Data provenance, which combined blockchain, is more likely a “Data Identity”, showing when the data was created, who collected the data, what kinds of operations had been launched on the data, etc. Nobody can change information of “Data Identity” and researchers in the future can easily track information of this data to assess its authenticity and do reproducibility.
Type of provenance
Herschel et.al (2017) explained provenance and classify provenance into four main types, namely Provenance meta-data, Information system provenance, Workflow provenance, and Data Provenance. The method to differentiate each type form others could be explained as follows. Meta-data itself can be regarded as provenance and operations related to it can also be seen as provenance. General meta-data tend to assign meaning on the data while provenance meta-data focused more on data derivation process. Based on the definition of provenance in W3C, data like the size of a file is not the provenance while the date of creation is the provenance. When we limit the context of provenance to information system, this kind of provenance could be called Information system provenance. Furthermore, by restricting the type of production process to so-called workflows which helps scientists conceptualize and manage the analysis process at each step, provenance becomes workflow provenance. Sometimes, scientists could use provenance to track the processing of individual data items, then this kind of provenance is called data provenance.
Application of provenance
Provenance is currently widely used in many fields, and the application of provenance could be summarized based on paper Herschel et.al 2017). The table below shows the summarization.
If you have any questions, feel free to use the comment function below. See you next week. Have a nice weekend !
Provenance. 2019. In Merriam-Webster.com. Retrieved May 24, 2019, fromhttps://www.merriam-webster.com/dictionary/provenance
Herschel, M., Diestelkämper, R., & Ben Lahmar, H. (2017). A survey on provenance: What for? What form? What from?. The VLDB Journal—The International Journal on Very Large Data Bases, 26(6), 881-906.
Moreau, L., Freire, J., Futrelle, J., Mcgrath, R. E., Myers, J., & Paulson, P. (2008, June). The open provenance model: An overview. In International Provenance and Annotation Workshop (pp. 323-326). Springer, Berlin, Heidelberg.
This post makes the concept of data provenance very simple, good work!
Thank you so much!