Exploration of Search Logs, Metadata Quality and Data Discovery: Week 1

My name is Ed Flathers, and I’m the DataONE Summer Intern on the project, “Exploration of Search Logs, Metadata Quality and Data Discovery.”  This project is largely focused on data mining and analysis of the DataONE search logs, download logs, and quality reports; many of my products will be program code using the Python and R languages.  I’ll be posting here on the blog every Friday with a general update on my activities, but I’ll also be updating the hpad for the project with more technical details.  In addition, there is a GitHub repository where I will publish the code that I produce.

It’s great to get started on the project!  I spent a day in Santa Barbara meeting in person with project mentors Lauren Walker, Amber Budden, Matt Jones, and Dave Vieglais.  We spent some time discussing the project goals and clarified the scope of activities–including striking the original goal 3, with the possibility (time permitting) of replacing it with some graph-based visualization of DataONE log events.  As we further develop the goals, we will update them in the hpad.

For my first week, my goals were to collect search logs, parse the log events into tokens, and populate a database with the event information.  With access to just over a year’s worth of logs, I have collected 159 log files representing about 1.6 million events.  After parsing the event logs and populating the database, it looks like about 282,000 (17.6%) of the events logged represent search activity.  We will mostly ignore the non-search events for our analysis.  Reducing the data frame by such a significant amount is a helpful early step, since the lower volume of data is easier to work with particularly in terms of the time it takes to execute code against the data.

For more technical details about week 1, take a look at the hpad and the GitHub repository.  Next week: collecting and parsing the download logs, plus more database population!

Leave a Reply

Your email address will not be published.