My goal for week four was to do some exploratory data analysis (EDA), now that the data are all transformed into a system that makes them easy to query. I produced some preliminary results and figures describing the search and download events captured by the logs. I’ll go through a series of graphs below, and you can find the code that generates them in the project’s GitHub repository. The hpad also has more numerical details about each example, in case that is of interest.
This histogram shows the number of sessions of different lengths. We can see that 15-20 minutes has by far the most sessions (in fact, 95% of them). From 20-25 minutes, session length drops off pretty quickly. I’ve cut this graph off at 60 minutes because there are a small number of very long sessions (the longest is about 2 hours and 45 minutes) that would extend the graph out to the right quite a ways. This graph can be described as “right skewed” since most of the area of the graph is to the left and then there’s a long tail to the right. We’ll see this pattern quite a bit in these log data.
This graph shows how frequently different numbers of download events happened during sessions. Again we see right skew, with half of all sessions involving 1 or 2 download events. A couple of notes here: this is only showing sessions that included downloads, because the great majority of sessions do not. If I included sessions with 0 downloads, the graph would be very hard to read. Also, I’ve cut this graph off at 20 downloads because there are a small number of sessions with lots of downloads (the most is 791) and, again, the graph would need to extend quite far to the right to show all the data.
This graph shows the count of search events per session. This graph shows the usual right skew, but is a little more complex than what we’ve seen before. The most common number of events for a session is 1, which indicates a session that ended without actually performing any more than the default search. In other words, a client loaded the search page and then did nothing else. We see a drop for 2 events, and another for 3, but increases for 4 and 5 and a steep drop off at 6. This suggests that common search activity tends to include several events, possibly refining the search with keywords, geographic bounds, or facets. Again, I have cut off this graph at 100 events to avoid showing a very wide graph.
We see from the above graph that most sessions involve download events from only one member node, but roughly 15% of sessions involve multiple member nodes. Below, a relationship graph shows which member nodes were paired up in sessions involving 2 member nodes. The width of the line connecting the nodes shows how often they participated in sessions together.
That’s all for this week. Next week, I will continue the EDA and begin to work on some more graph analysis of sessions to see if I can begin to describe what ‘typical’ sessions look like, visually.