Scraping the Surface – DataONE Notebooks

Last week, having come home from Berlin, I was faced with a problem. On the one hand, I haven’t been able to crack SPARQL. I’m sure it’s a great language (that’s not true), and I’m sure that it is something that is useful to know in the long run (not convinced of this), and I’m aware that it must have a surpassable learning curve (more like a cliff.) Without a proper course, or at least coursebook, in it and basic RDF mining, however, I was at a loss. On the other hand, I needed to have mined several hundred workflows on myExperiment for basic information. I hadn’t done this, mostly due to time constraints. Partially, I didn’t want to sit at my computer for hours, copying and pasting. However, without any other option to do this, that seemed like it was going to be the way to go.

So I sat down and copied and pasted information from 50 odd workflows. That took far longer than one would expect. Towards the end of it, I was talking to a friend of mine about how this was being done. He suggested using Python. I know some python, but I didn’t know of any easy function to do this sort of screen scraping. He was working on a site that uses Beautiful Soup to mine webpages for musical artists and present them in a linked format – so I looked into Beautiful Soup. It was a bit over my head, so I messaged another friend of mine who has been a serious help in other coding issues I’ve had. After a few hours of work, and over a few pints in the local, we (mostly he) managed to bang out a script that would do exactly what I wanted. I have uploaded it to github here.

The script essentially mines the raw html from myExperiment, and pulls off all of the relevant data possible. In my case, that is the title, the date uploaded and updated, the author, the workflow type, the description, the tags, the amount of views, the amount of downloads, the inputs, processors, beanshells, outputs, datalinks, coordinations, ratings, attributions, and so on. That’s most of the information that we’re looking for, and certainly minable. All of this information can then be put into a .csv. This .csv, rather than being in three columns like an RDF unstructured database, can then be sorted into an SQL rather easily, and mined from there using R. Which is what I am currently doing.

So, instead of fighting SPARQL, I’m able to mine the information directly, and at more depth than I had previously anticipated. This should make the research much easier and more streamlined. I’ve been under the weather these past few days, but expect results over the next few as I mine it more. Between this, the mendeley account with hundreds of references, and the draft I’m working slowly on, a paper is slowly forming.