This past week was spent solely on the statistical analysis of the user data from 6 member nodes. Due to the staggered start of all of the Member Nodes, we had to restrict our sample size from all thirty MN to just 6 that fit our sampling criteria – the criteria were that the MN needed download data at least 1 year before and 1 year after its DataONE joining date. This meant checking that the MN not only fit the timeframe criteria, but also that we had data for the MN. Some I was able to confirm from the website, but for each MN, I also had to check the user data file I received and find the first date fo recorded data (i.e. none zero numbers).
Most of my initial time in R was spent prepping the data for analysis; I’m hoping in the next few weeks I’ll be able to speed up that process to allow for more time on analysis. After the preparation, I initially ran individual t-tests comparing the data across the “before” and “after” timeframes, a rough t-test of the averages across the timeframes, and a t-test of the linear regression coefficients. I also ran a cumulative t-test of all downloads/uploads regardless of the MN. Results-wise, very few tests were statistically significant – a few individual t-tests, and the cumulative t-test. This is, in part, due to the fact that I didn’t run pairwise t-tests (which I will be doing this week), as well as the fact that the differences between the MN perhaps hid any before/after differences. These I will be accounting for in a repeated analysis of variance model run this week.
This coming week will include the above analysis I mentioned, as well as new analysis with our control group of repositories who haven’t yet joined DataONE to test if other factors may be influencing upload and download rates beyond just DataONE.