For the second week of my project, my original goals were to collect download logs, parse the log events into tokens, and populate a database with the download information. After our weekly internship call, my mentors and I decided to change things up a little bit.
The purpose of building a database of download events is to support an effort to associate download events with search events. We can try to do that by comparing the time and remote host of download events to those of search events, but that comparison would be much easier to do given the concept of ‘sessions’ of search events created by grouping search events together according to their proximity in time. Grouping the events together into sessions creates time spans, and when it comes time to try to associate download events to search events, we can check to see whether a download event occurs during the span of a session.
So, this week I explored the best ways to group the search events into sessions. My first thought was to use a statistical clustering technique to automatically group events together. Unfortunately, I found that it was difficult to build the concept of ‘human-scale’ time into the clustering approach. For example, one remote host generated only 7 events in the logs, all of those within a period of about 5 minutes, but the clustering algorithms tended to try to split those events into two separate sessions, which made sense from a simple mathematical perspective, but is clearly not right when thinking of the real world. The approach I settled on instead is to treat as a session any collection of events that occur within 15 minutes of each other. This takes into account the natural time scale of log events.
I also collected the download logs as planned, though I haven’t yet had a chance to put them into the database. More on that next week! For more technical details about week 2, take a look at the hpad and the GitHub repository.