Exploration of Search Logs, Metadata Quality and Data Discovery: Week 9

The last goal of my DataONE summer internship was to try to determine whether or not metadata quality is related to data downloads. In other words, does higher quality metadata increase the likelihood that a dataset will be downloaded?

The basic approach to answering the question is to gather two groups of metadata–some for data that have been downloaded and some for data that have not–and compare the quality scores between the two groups. I described a method for building this collection last time, although that method didn’t work out well in practice because of time constraints. I adjusted my approach to gather one large group of metadata and then split it into two according to whether there was an associated download in the log. This worked out very well, and allowed me to get a pilot sample of metadata so that I could decide how many records I ultimately needed to collect.

The pilot sample showed me that I needed to sample about 10,000 records, which was also time prohibitive–it took about 3 days for the script I had written to gather all the necessary records and metadata quality scores. In the end, what I found was that there is no support for the hypothesis that higher quality metadata increases the likelihood that a dataset will be downloaded. This doesn’t mean that metadata quality is unimportant, though. What it actually points to is that metadata quality is relatively constant across the DataONE collection. It would be interesting to revisit this question after a big push to improve metadata quality across the member nodes. Would improved quality result in more downloads? Perhaps a future intern will be responsible for figuring that one out.

This is only a brief write-up of a long series of statistical tests, and if you’d like to see the gritty details, take a look at the hpad. There, I’ve gone through most of the analyses step-by-step to show what the data basically look like, what tests I chose to run, and the results and interpretations of the tests. Also, check out the GitHub repository to see the data and code used to perform this last part of the project.

It’s been a great 9 weeks working on this project for DataONE, and I’m excited to see everyone at the all-hands meeting later this month!

Leave a Reply

Your email address will not be published. Required fields are marked *