{"id":2968,"date":"2017-06-17T02:30:59","date_gmt":"2017-06-17T02:30:59","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2968"},"modified":"2017-06-17T02:30:59","modified_gmt":"2017-06-17T02:30:59","slug":"exploration-of-search-logs-metadata-quality-and-data-discovery-week-4","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/search-logs\/exploration-of-search-logs-metadata-quality-and-data-discovery-week-4\/","title":{"rendered":"Exploration of Search Logs, Metadata Quality and Data Discovery: Week 4"},"content":{"rendered":"

My goal for week four was to do some exploratory data analysis (EDA), now that the data are all transformed into a system that makes them easy to query. I produced some preliminary results and figures describing the search and download events captured by the logs. I’ll go through a series of graphs below, and you can find the code that generates them in the project’s GitHub<\/a> repository. The\u00a0hpad<\/a> also has more numerical details about each example, in case that is of interest.<\/p>\n

\"Session<\/h5>\n

This histogram shows the number of sessions of different lengths. We can see that 15-20 minutes has by far the most sessions (in fact, 95% of them). From 20-25 minutes, session length drops off pretty quickly. I’ve cut this graph off at 60 minutes because there are a small number of very long sessions (the longest is about 2 hours and 45 minutes) that would extend the graph out to the right quite a ways. This graph can be described as “right skewed” since most of the area of the graph is to the left and then there’s a long tail to the right. We’ll see this pattern quite a bit in these log data.<\/p>\n

\"Downloads<\/h5>\n

This graph shows how frequently different numbers of download events happened during sessions. Again we see right skew, with half of all sessions involving 1 or 2 download events. A couple of notes here: this is only showing sessions that included downloads, because the great majority of sessions do not. If I included sessions with 0 downloads, the graph would be very hard to read. Also, I’ve cut this graph off at 20 downloads because there are a small number of sessions with lots of downloads (the most is 791) and, again, the graph would need to extend quite far to the right to show all the data.<\/p>\n

\"Search<\/h5>\n

This graph shows the count of search events per session. This graph shows the usual right skew, but is a little more complex than what we’ve seen before. The most common number of events for a session is 1, which indicates a session that ended without actually performing any more than the default search. In other words, a client loaded the search page and then did nothing else. We see a drop for 2 events, and another for 3, but increases for 4 and 5 and a steep drop off at 6. This suggests that\u00a0common search activity tends to include several events, possibly refining the search with keywords, geographic bounds, or facets. Again, I have cut off this graph at 100 events to avoid showing a very wide graph.<\/p>\n

\"Member<\/h5>\n

We see from the above graph that most sessions involve download events from only one member node, but roughly 15% of sessions involve multiple member nodes. Below, a relationship graph shows which member nodes were paired up in sessions involving 2 member nodes. The width of the line connecting the nodes shows how often they participated in sessions together.<\/p>\n

\"Member<\/p>\n

That’s all for this week. \u00a0Next week, I will continue the EDA and begin to work on some more graph analysis of sessions to see if I can begin to describe what ‘typical’ sessions look like, visually.<\/p>\n","protected":false},"excerpt":{"rendered":"

My goal for week four was to do some exploratory data analysis (EDA), now that the data are all transformed into a system that makes them easy to query. I produced some preliminary results and figures describing the search and download events captured by the logs. I’ll go through a Continue reading Exploration of Search Logs, Metadata Quality and Data Discovery: Week 4<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":105,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[375],"tags":[],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2968"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/105"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=2968"}],"version-history":[{"count":9,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2968\/revisions"}],"predecessor-version":[{"id":2980,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2968\/revisions\/2980"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=2968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=2968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=2968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}