{"id":2921,"date":"2017-06-02T23:17:57","date_gmt":"2017-06-02T23:17:57","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2921"},"modified":"2017-06-02T23:23:54","modified_gmt":"2017-06-02T23:23:54","slug":"exploration-of-search-logs-metadata-quality-and-data-discovery-week-2","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/search-logs\/exploration-of-search-logs-metadata-quality-and-data-discovery-week-2\/","title":{"rendered":"Exploration of Search Logs, Metadata Quality and Data Discovery: Week 2"},"content":{"rendered":"
For the second week of my project, my original goals were to\u00a0collect download logs, parse the log events into tokens, and populate a database with the download information. \u00a0After our weekly internship call, my mentors and I decided to change things up a little bit.<\/p>\n
The purpose of building a database of download events is to support an effort to associate download events with search events. \u00a0We can try to do that by comparing the time and remote host of download events to those of search events, but that comparison would be much easier to do given the concept of ‘sessions’ of search events created by grouping search events together according to their proximity in time. \u00a0Grouping the events together into sessions creates time spans, and when it comes time to try to associate download events to search events, we can check to see whether a download event occurs during the span of a session.<\/p>\n
So, this week I explored the best ways to group the search events into sessions. My first thought was to use a statistical clustering technique to automatically group events together. \u00a0Unfortunately, I found that it was difficult to build the concept of ‘human-scale’ time into the clustering approach. \u00a0For example, one remote host generated only 7 events in the logs, all of those within a period of about 5 minutes, but the clustering algorithms tended to try to split those events into two separate sessions, which made sense from a simple mathematical perspective, but is clearly not right when thinking of the real world. \u00a0The approach I settled on instead is to treat as a session any collection of events that occur within 15 minutes of each other. \u00a0This takes into account the natural time scale of log events.<\/p>\n
I also collected the download logs as planned, though I haven’t yet had a chance to put them into the database. \u00a0More on that next week! \u00a0For more technical details about week 2, take a look at the hpad <\/a>and the GitHub <\/a>repository.<\/p>\n","protected":false},"excerpt":{"rendered":" For the second week of my project, my original goals were to\u00a0collect download logs, parse the log events into tokens, and populate a database with the download information. \u00a0After our weekly internship call, my mentors and I decided to change things up a little bit. The purpose of building a Continue reading Exploration of Search Logs, Metadata Quality and Data Discovery: Week 2<\/span>