Summing Up – And An Update – DataONE Notebooks

As you may have noticed, I haven’t posted on here for a while. I neglected to post after the last meeting in Berkeley, and then went off of the internet for a couple of months to try and get rid of my tendonitis. It didn’t go away, and I haven’t completely stopped working on the work I started with in this internship. So, here is a summing up of the internship, and where we’re going to take this next.

Here was the initial intention for the internship:

Understanding how scientists analyze data
Description: Scientists use a wide variety of tools and techniques to manage and analyze data. However, to our knowledge no one has taken a systematic look at how scientists do their work. In this project, we will examine a large number of the scientific workflows that have been constructed. We will develop a way of categorizing workflows based on their complexity, types of processing steps employed, and other factors. The goal is to develop new and significant understanding of the scientific process and how it is being enabled by science workflows.

Well, I did examine a large number of workflows, albeit in a roundabout way. The internship resulted in two major pools of data – one being over 500 papers on Workflows on the Mendeley workgroup, and the other being the screenscraped results about workflow use, downloads, and information regarding those workflows from the myExperiment workflow repository. We did end up finding differences between workflows, but that was constrained by what we could gather off of myExperiment. It is the largest repository of workflows online, and difficult to mine if the workflows aren’t from Taverna, but to a certain extent we did accomplish that goal. We also learned, tentatively, about how some workflows, if they do more things and help the researcher more, do get used more, which means that, yes, scientific workflows can help the scientific method by expediting the process. Our results are generally finer than that, and the internship did change direction from being about the workflows themselves to being about how myExperiment has grown and been used as a site, which we are using as a proxy to learn about those workflows.

Personally, I learned a lot from this internship. Here was the original goal for the intern:

Skills to be learned: Kepler and Taverna workflow languages, research methods, research analysis, keeping an open science research notebook, communicating research results. A peer-reviewed publication is envisioned.

I did learn all about Kepler, Taverna, and a few other workflows. I did not end up learning much about assembling workflows, but I did learn how the majority of workflows are setup by looking at the amount I did (at least several hundred manually.) I also looked in detail at several workflows that had been published, or that had been emailed to me by useful scientists. While I wouldn’t be comfortable hooking up a workflow to parse gene IDs using SOAP and PubMed, I would feel comfortable saying what a workflow is probably doing.

More than that, however, I learned much more about research methods and analysis than I expected. Mining the amount of papers I did lead to increased categorisation skills, and the directions from my mentors on where to study and focus next were particularly illuminating. As well, the write ups were a great learning experience for me, as I could see what syntax to use and which words to ignore and all of those subtle skills that aren’t readily apparent or available to an undergraduate. I also learned much more about using a research notebook – I currently use a locally installed WordPress installation in a very similar fashion, because of the success of this open notebook. My time with the Open Knowledge Foundation in Berlin has lead to a heightened interest in Open Science, and I am very glad to have been part of this internship for that. Tangentially, I learned much more about RDF than I expected, and my Python and R skills went up exponentially as a result of the coding I did for this project. For that, I want to thank my friend Steve, and of course and especially Karthik for bearing with me as an R beginner.

Where are we going next? Well, we’ve submitted an abstract with our results to the IDCC 2011, where we hope to present our results and publish a paper in the proceedings. I can’t share that here, in case they have an issue with previous publishing. This is what we worked on at the meeting in Berkeley for two days – writing the abstract, and going over all of the results from the 1000’s of lines of R code I had done, eeking out every bit of information I could from the code. I had one or two mistakes which we caught and fixed, including a pretty big issue in the code that underestimated my complexity proxy.

We also laid grounds for future research: this will involve a cluster analysis of the most downloaded workflows, where we hope to see what kinds of workflows are downloaded more and how to optimally make your workflow downloadable and reusable. This will involve more advanced R code on my part, which I am looking forward to learning how to do. We plan to take all of this information and submit a paper to PLoS once we have finished with this. I also plan to screenscrape more of myExperiment, particularly user profiles and group profiles (which will be suitably anonymised, although all information is in the public domain), to see if that has any bearing on workflow use or not. I also plan to continue doing hard research into the papers I have gathered, in particular to see what sort of workflows are most commonly published about (not including development papers), and which are used in publications, and whether there is any way to trace those papers that use workflows that have been loaded into a repository. I am also hoping, on a separate note, to start a repository for shell scripts and codes used in the social sciences, such as Linguistics, as there is nothing like this at the moment that I know of.

And that was my internship. As you can see, it is still ongoing. However, I want to think Bill Michener, Bertram Ludäscher, and Rebecca Koskela for giving me the opportunity to do this work. I would also like to thank them for inviting me to the DataONE All Hands Meeting in October in Albuquerque, where I will present a poster on this research. I would like to thank Karthik Ram for being so helpful at all times with my coding. I would also like to thank Carl Boettiger and Heather Piwowar for helping me understand and encouraging open notebooks. I would finally like to thank the Open Knowledge Foundation for their conference, where I was able to network with quite a few people I am still in touch with and with whom I am continuing to work on some projects, such as Open Linguistics and Open Economics.

I will keep using this blog, as I have new things to share, so I am not going to say goodbye. But certainly don’t expect updates regularly. Hopefully, the next one won’t be two months away. 🙂