Week 7: The School of Hard Knocks

Sometimes you just have to learn things the hard way.

On Monday at 2:37pm I started to run my spiffy Make-a-Network code on the big table of all the datasets currently stored in the DataONE archives. In general, this code takes as input a table of unique dataset-person pairs, where the “person” could be anybody – a creator, a contributor, or a user who downloads a dataset. The Big Table of all the DataONE datasets contains 1,295,315 rows of unique dataset-person combinations. …and already in the ether I hear the gnashing of teeth, as those familiar with R and big data worry tremendously about how this will turn out. I was young. I was naive.

My spiffy Make-a-Network code turns a table of unique dataset-person pairs into an edge list by identifying all the unique persons in the table, then connecting pairwise all the datasets associated with that unique person. This means that for any given person, if n datasets are associated with that person, then my code creates n choose 2 dataset pairs, also known as edges.

n choose k is the number of unordered k-tuples that can be made without replacement from a given set of n objects. 52 choose 5 is the total number of 5-card poker hands that can be made from a standard deck of cards. If k = 2, then n choose k is the number of pairs that can be made from a set of n objects. And for those of you not into combinatorics, the “choose” function in math makes numbers grow really fast. I’ll spare you the mathematical details (you can find them in the Wikipedia entry), but here are some examples:

5 choose 2 is 10
10 choose 2 is 45
36 choose 2 is 630 (36 is the mean number of datasets associated with a single person in the Big Table.)
50 choose 2 is 1225
100 choose 2 is 4950
39,268 choose 2 is 770,968,278 (39,268 is the max number of datasets associated with a single person in the Big Table.)

The point? An edge list from a table with 1.2 million rows is going to be much, much larger than the 1.2 million rows from the original table. How much larger depends on the exact distribution of the number of datasets associated with each person. We can calculate exactly how large the final edge list for the Big Table is going to be using a frequency table and the choose function in R. We create a frequency table that counts how many times a person appears; this is the number of datasets associated with that person. We then run down that vector of frequencies with the choose function, sum them all up, and that’s the size of the final edge list.

So I calculated the size of the final edge list for the Big Table. The answer: 4,079,564,237 That is, the number of edges in our final network will be four billion, seventy-nine million, five hundred thousand and change. …and now I hear maniacal laughter in the ether because those of you familiar with R and big data know how this is going to turn out.

Badly.

I didn’t know this, but apparently it’s common wisdom that R can only handle dataframes with about a million rows, max. So my original table of 1.2 million rows was already pushing poor little R’s capacity to handle information. Making an edge list three orders of magnitude larger? Not so much.

On Tuesday, 24 hours after I started the process remotely on one of DataONE’s big number-crunching machines, I thought that the code was just taking a long time to run. Wednesday – same thing. On Thursday the internet at my house went out and stayed out until Friday late morning. I was terrified that the internet hiccup had halted the process, and when I got back in on Friday afternoon my fears seemed to be confirmed – the process was no longer running and no gigantic edge-list object was happily swimming in my R environment. Bryce explained the situation to me gently and with compassion. The internet wasn’t the problem. I had simply broken R’s little brain.

So now I’m learning all about how to work with big data in R. This is a whole new set of skills, involving such arcane concepts as resilient distributed datasets, parallelized collections, and shuffle operations. Fortunately for me, there’s a package for that: sparklyr. It interfaces with dplyr. And so it seems that I’m finally going to have to break down and start using the pipe operator. Five years of programming in R, stubbornly resisting the siren song of the %>%. But now, big data, and I’m going to have to put away my childish attachment to base R functions and enter into developer adulthood.

Dear Hadley, fellow dissatisfied scion of the ISU Statistics Department, I hope you’re happy. Because as much as I cringe at your grandiliquence, I love my networks more.

Leave a Reply Cancel reply