For week five of my internship, my goals were to continue with exploratory data analysis to develop some additional figures and refine some of the exiting ones. Also, we had originally conceived of producing ‘session graphs’ that could illustrate the events of a session in a graphical way, but given what we’ve learned about sessions, we decided to devote time instead to developing better spatial and temporal analyses of search activity.
I also spent some time talking with Megan Mach, the intern working on the DataONE Messaging project, and we had a productive conversation about how she might be able to use some of the statistics that I’m coming up with to support the communication and outreach done by DataONE. She also gave me a few ideas for new analyses that I’ll be running in the next few weeks.
First, a redo of some figures from last week:
The above plot shows which member nodes participated in sessions in which more than one member node was accessed. We see the strongest line between Dryad and LTER, showing that those two shared the largest number of sessions. This is different from last week’s graph in that I’ve removed the Coordinating Node. In speaking with my mentor group, the CNs didn’t seem important in these interactions.
This plot is very similar, although instead of nodes accessed during sessions, these are nodes accessed by the same client over the entire history of the download data. Where the previous plot shows clients accessing data from multiple nodes within a short period of time, this one does away with that time restriction. We can see that KNB is a common node to be paired with others.
Finally, here is a word cloud that shows the top 200 search terms used in DataONE search. We see that temperature is the most popular. I also see a fair number of words perhaps related to hydrology–water, lake, river, stream, ocean, salinity, and so forth. It would be an interesting extension of this process to take the search terms and categorize them using some kind of ontology to get a more general idea of the kinds of sciences being supported by the data.
I’ll be off on a short vacation for the next week and a half, but I’ll update with my progress at the next opportunity. As always, take a look at the GitHub repository for the project as well as the hpad for more technical details about the analyses shown here.