{"id":2968,"date":"2017-06-17T02:30:59","date_gmt":"2017-06-17T02:30:59","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2968"},"modified":"2017-06-17T02:30:59","modified_gmt":"2017-06-17T02:30:59","slug":"exploration-of-search-logs-metadata-quality-and-data-discovery-week-4","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/search-logs\/exploration-of-search-logs-metadata-quality-and-data-discovery-week-4\/","title":{"rendered":"Exploration of Search Logs, Metadata Quality and Data Discovery: Week 4"},"content":{"rendered":"
My goal for week four was to do some exploratory data analysis (EDA), now that the data are all transformed into a system that makes them easy to query. I produced some preliminary results and figures describing the search and download events captured by the logs. I’ll go through a series of graphs below, and you can find the code that generates them in the project’s GitHub<\/a> repository. The\u00a0hpad<\/a> also has more numerical details about each example, in case that is of interest.<\/p>\n This histogram shows the number of sessions of different lengths. We can see that 15-20 minutes has by far the most sessions (in fact, 95% of them). From 20-25 minutes, session length drops off pretty quickly. I’ve cut this graph off at 60 minutes because there are a small number of very long sessions (the longest is about 2 hours and 45 minutes) that would extend the graph out to the right quite a ways. This graph can be described as “right skewed” since most of the area of the graph is to the left and then there’s a long tail to the right. We’ll see this pattern quite a bit in these log data.<\/p>\n This graph shows how frequently different numbers of download events happened during sessions. Again we see right skew, with half of all sessions involving 1 or 2 download events. A couple of notes here: this is only showing sessions that included downloads, because the great majority of sessions do not. If I included sessions with 0 downloads, the graph would be very hard to read. Also, I’ve cut this graph off at 20 downloads because there are a small number of sessions with lots of downloads (the most is 791) and, again, the graph would need to extend quite far to the right to show all the data.<\/p>\n This graph shows the count of search events per session. This graph shows the usual right skew, but is a little more complex than what we’ve seen before. The most common number of events for a session is 1, which indicates a session that ended without actually performing any more than the default search. In other words, a client loaded the search page and then did nothing else. We see a drop for 2 events, and another for 3, but increases for 4 and 5 and a steep drop off at 6. This suggests that\u00a0common search activity tends to include several events, possibly refining the search with keywords, geographic bounds, or facets. Again, I have cut off this graph at 100 events to avoid showing a very wide graph.<\/p>\n We see from the above graph that most sessions involve download events from only one member node, but roughly 15% of sessions involve multiple member nodes. Below, a relationship graph shows which member nodes were paired up in sessions involving 2 member nodes. The width of the line connecting the nodes shows how often they participated in sessions together.<\/p>\n <\/p>\n That’s all for this week. \u00a0Next week, I will continue the EDA and begin to work on some more graph analysis of sessions to see if I can begin to describe what ‘typical’ sessions look like, visually.<\/p>\n","protected":false},"excerpt":{"rendered":" My goal for week four was to do some exploratory data analysis (EDA), now that the data are all transformed into a system that makes them easy to query. I produced some preliminary results and figures describing the search and download events captured by the logs. I’ll go through a Continue reading Exploration of Search Logs, Metadata Quality and Data Discovery: Week 4<\/span><\/h5>\n
<\/h5>\n
<\/h5>\n
<\/h5>\n