Week Four: Further Analyses and Results

This past week was spent solely on the statistical analysis of the user data from 6 member nodes. Due to the staggered start of all of the Member Nodes, we had to restrict our sample size from all thirty MN to just 6 that fit our sampling criteria – the criteria were that the MN needed download data at least 1 year before and 1 year after its DataONE joining date. This meant checking that the MN not only fit the timeframe criteria, but also that we had data for the MN. Some I was able to confirm from the website, but for each MN, I also had to check the user data file I received and find the first date fo recorded data (i.e. none zero numbers).

Most of my initial time in R was spent prepping the data for analysis; I’m hoping in the next few weeks I’ll be able to speed up that process to allow for more time on analysis. After the preparation, I initially ran individual t-tests comparing the data across the “before” and “after” timeframes, a rough t-test of the averages across the timeframes, and a t-test of the linear regression coefficients. I also ran a cumulative t-test of all downloads/uploads regardless of the MN. Results-wise, very few tests were statistically significant – a few individual t-tests, and the cumulative t-test. This is, in part, due to the fact that I didn’t run pairwise t-tests (which I will be doing this week), as well as the fact that the differences between the MN perhaps hid any before/after differences. These I will be accounting for in a repeated analysis of variance model run this week.

This coming week will include the above analysis I mentioned, as well as new analysis with our control group of repositories who haven’t yet joined DataONE to test if other factors may be influencing upload and download rates beyond just DataONE.

Leave a Reply

Your email address will not be published. Required fields are marked *