Exploration of Search Logs, Metadata Quality and Data Discovery: Week 7

For the seventh week of my internship, I took up a few spatial questions that I discussed with my mentor group, as well as looking into the temporal component of DataONE search. Last week, I looked at searches in DataONE that are spatially explicit: searches that specify a collection of geographic coordinates to restrict search to a given area. This week, I’m looking at implicit spatial search. The DataONE search interface has a “Location” filter that allows a user to enter a geographic region. Users may also imply a spatial context using a search term, like “Alaska” or “Cape Town.”

To find the spatial search terms, I used the list of popular search terms from several weeks ago and just went through it manually, picking out geographic words as I saw them. I went through the top 1,000 search terms and came up with just over 30 geographic words from that list. There may be more, and some are more complex than others. Because my list of search terms is broken up into single words, there can be some confusion over whether I’ve re-assembled the two-word phrases correctly. I tried to verify that I had done it correctly by referring to the original search text. Here are the top 10 geographic search terms and the frequency with which they came up:

santa barbara: 809
california: 370
new mexico: 188
alaska: 173
united states: 99
cape town: 102
mississippi: 108
north carolina: 91
chesapeake bay: 76
africa: 67

Santa Barbara is conspicuously at the top of the list with more than twice as many mentions as any other place name.

Next, I looked at temporal search activity. In the DataONE search, a user can specify a range of dates for data or publication date (or both). Of the 10,889 sessions, 156 included temporal restrictions on the data. Only 23 included temporal restrictions on the publication date. 11 included restrictions on both. Many of the sessions included multiple date ranges at different steps in the search session, so there are 1,112 search events for data dates and 140 search events for publication dates.

A popular way of visualizing time spans is the Gantt chart. It is possible to produce a full Gantt chart showing the spans of every temporal search event, but the chart becomes cumbersome when it gets a lot of rows. I’ve included a sample that shows an excerpt of the time spans for searches on data dates:
The chart is organized by session, so you can see, for example, that the first time span (session 356) is from about 2010–2017 and the second time span (session 365) is from 2000–2017. The next one in the chart, session 879, goes from roughly 1835 to the present day. After that, the next four time spans are all from session 1161, where we can see that the time span changed pretty significantly through the session. If you’d like to take a look at the whole chart, check out the hpad where I’ve linked a couple of more elaborate images.

Histograms are a little more comfortable way to visualize these data:

The above plot shows data searches in twenty-year bins. So, there were about 30 searches for data from 1800–1820. About 100 from 1880–1900. By far, the most common temporal search was for data covering this century.

This plot shows publication date searches. Note that it’s a different scale than the previous histogram because the numbers are much smaller for this type of search. (The data date search is the default option for temporal search, which may influence the number of people who use each option). Again, we see that the searches are more commonly for recently published data.

That’s all for this week; take a look at the GitHub repository for the project if you’d like to see the code I used to create these graphics. Next week, metadata quality: does better metadata lead to more downloads?

Leave a Reply Cancel reply