My goals for week 3 were to collect download logs from a SOLR index, parse those logs into tokens, populate a database with the log information, and relate the download events to the search events by connecting them in time and by remote host address. I was able to accomplish these goals after a minor detour through troubleshooting a problem I was having with the SOLR query results. Check out the hpad for a lot more detail on what was going wrong and how we went about getting the data out of SOLR.
In the end, we wound up exporting the download logs to CSV files, which I imported into the database with a Python function. There are roughly 35 million download events, which makes for slow going when importing into the database. At the moment, since we have only about one year of search logs, we can reduce the size of these download data by restricting them to the time interval of the search logs. We can further reduce the data by looking more closely at what each download event represents–since it is possible through the DataONE search interface to download an entire dataset or individual files from within, not all download events are necessarily the same.
That will be part of next week’s exploratory data analysis. I’ll try to get some simple visualizations up and going so we can get a more intuitive feel for the shape of the data we’ve assembled so far. In the meantime, the new program code is up on the GitHub repository.