Week Three: Statistical Analysis of User Data

Week Three consisted of finishing up a working draft of the Literature Review and putting it on hold while we began the statistical analysis.

First, I had prepared for the analysis in Excel, creating a timeline of each Member Node, and the important dates we’ll need for our sampling windows. Initially, I started with 30 day samples, but a quick analysis found those windows were much too small. I’ve extended the timeframe to a year either side of their joining DataONE with a 2-week gap around when they joined. The only downside to this is that it narrows our Member Node base to less than ten because a lot of the nodes don’t have enough data.

I’ve cleaned up and inputted all the of the datasets I’ve received into R and began to run initial t-tests on the “before” and “after” years. I ran the t-test on the averages for each member node, and though the “after” year had a greater average amount of downloads and uploads, the difference between all of the averages was too great to allow for significance with this type of t-test, so I’m looking into other types of statistical tests that take into account the variance among Member Nodes based on their size (perhaps regression analysis will do the trick) Next week, I’ll input the other Member Node data we receive, and then run t-tests on both upload and download information, as well as begin regression analysis which will include aggregating up the data by day in order to get cumulative amounts.


Leave a Reply

Your email address will not be published.