{"id":2909,"date":"2017-05-26T19:32:36","date_gmt":"2017-05-26T19:32:36","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2909"},"modified":"2017-05-26T22:36:54","modified_gmt":"2017-05-26T22:36:54","slug":"exploration-of-search-logs-metadata-quality-and-data-discovery-week-1","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/search-logs\/exploration-of-search-logs-metadata-quality-and-data-discovery-week-1\/","title":{"rendered":"Exploration of Search Logs, Metadata Quality and Data Discovery: Week 1"},"content":{"rendered":"
My name is Ed Flathers, and I’m the DataONE Summer Intern on the project, “Exploration of Search Logs, Metadata Quality and Data Discovery.” \u00a0This project is largely focused on data mining and analysis of the DataONE search logs, download logs, and quality reports; many of my products will be program code using the Python and R languages. \u00a0I’ll be posting here on the blog every Friday with a general update on my activities, but I’ll also be updating the hpad\u00a0<\/a>for the project with more technical details. \u00a0In addition, there is a GitHub <\/a>repository where I will publish the code that I produce.<\/p>\n It’s great to get started on the project! \u00a0I spent a day in Santa Barbara meeting in person with project mentors Lauren Walker, Amber Budden, Matt Jones, and Dave Vieglais. \u00a0We spent some time discussing the project goals and clarified the scope of activities–including striking the original goal 3, with the possibility (time permitting) of replacing it with some graph-based visualization of DataONE log events. \u00a0As we further develop the goals, we will update them in the hpad<\/a>.<\/p>\n For my first week, my goals were to\u00a0collect search logs, parse the log events into tokens, and populate a database with the event information. \u00a0With access to just over a year’s worth of logs, I have collected 159 log files representing about 1.6 million events. \u00a0After parsing the event logs and populating the database, it looks like about 282,000 (17.6%) of the events logged represent search activity. \u00a0We will mostly ignore the non-search events for our analysis. \u00a0Reducing the data frame by such a significant amount is a helpful early step,\u00a0since the lower volume of data is easier to work with particularly in terms of the time it takes to execute code against the data.<\/p>\n For more technical details about week 1, take a look at the hpad <\/a>and the GitHub <\/a>repository. \u00a0Next week: collecting and parsing the download logs, plus more database population!<\/p>\n","protected":false},"excerpt":{"rendered":" My name is Ed Flathers, and I’m the DataONE Summer Intern on the project, “Exploration of Search Logs, Metadata Quality and Data Discovery.” \u00a0This project is largely focused on data mining and analysis of the DataONE search logs, download logs, and quality reports; many of my products will be program Continue reading Exploration of Search Logs, Metadata Quality and Data Discovery: Week 1<\/span>