# Week 7: The School of Hard Knocks

Sometimes you just have to learn things the hard way.

On Monday at 2:37pm I started to run my spiffy Make-a-Network code on the big table of all the datasets currently stored in the DataONE archives. In general, this code takes as input a table of unique dataset-person pairs, where the “person” could be anybody – a creator, a contributor, or a user who downloads a dataset. The Big Table of all the DataONE datasets contains 1,295,315 rows of unique dataset-person combinations. …and already in the ether I hear the gnashing of teeth, as those familiar with R and big data worry tremendously about how this will turn out. I was young. I was naive.

My spiffy Make-a-Network code turns a table of unique dataset-person pairs into an edge list by identifying all the unique persons in the table, then connecting pairwise all the datasets associated with that unique person. This means that for any given person, if n datasets are associated with that person, then my code creates n choose 2 dataset pairs, also known as edges.

n choose k is the number of unordered k-tuples that can be made without replacement from a given set of n objects. 52 choose 5 is the total number of 5-card poker hands that can be made from a standard deck of cards. If k = 2, then n choose k is the number of pairs that can be made from a set of n objects. And for those of you not into combinatorics, the “choose” function in math makes numbers grow really fast. I’ll spare you the mathematical details (you can find them in the Wikipedia entry), but here are some examples:

• 5 choose 2 is 10
• 10 choose 2 is 45
• 36 choose 2 is 630 (36 is the mean number of datasets associated with a single person in the Big Table.)
• 50 choose 2 is 1225
• 100 choose 2 is 4950
• 39,268 choose 2 is 770,968,278 (39,268 is the max number of datasets associated with a single person in the Big Table.)

The point? An edge list from a table with 1.2 million rows is going to be much, much larger than the 1.2 million rows from the original table. How much larger depends on the exact distribution of the number of datasets associated with each person. We can calculate exactly how large the final edge list for the Big Table is going to be using a frequency table and the choose function in R. We create a frequency table that counts how many times a person appears; this is the number of datasets associated with that person. We then run down that vector of frequencies with the choose function, sum them all up, and that’s the size of the final edge list.

So I calculated the size of the final edge list for the Big Table. The answer: 4,079,564,237 That is, the number of edges in our final network will be four billion, seventy-nine million, five hundred thousand and change. …and now I hear maniacal laughter in the ether because those of you familiar with R and big data know how this is going to turn out.