As a graduate student, Christine had spent the last several weeks working on a statistical analysis exploring the links between air pollution and heart disease. The data set she was analyzing was quite large (the biggest Christine had ever worked with) and had been pulled from a much larger dataset, one which combined air quality data with an immense collection of hospital records, census data, and more into one formidable dataset of tremendous size. With several million rows and seemingly endless columns, the smaller subset Christine was working with was already enough to make anyone’s head spin. Because it was not feasible to look through the values in each of the rows and columns one by one, Christine had to rely on summary statistics – things like mean, median, and max and min values – to help her identify and manage potential errors in the data.
The full dataset was stored on a research server at Christine’s university, where (due to the confidential nature of some of the data) it could only be accessed by authorized personnel. While Christine was not authorized to access this comprehensive dataset directly, she was authorized to work with a subset that had been culled from this enormous dataset and placed into a digital folder shared between Christine and her thesis advisor. Though something of a computer whiz herself, Christine was careful when setting up her workflow. Because she knew that it was all too easy to make mistakes when running analyses, she wrote her scripts so that they would always use a temporary copy of the data rather than the original file. This step minimized the risk of inadvertently making changes to the file her advisor had shared with her. It also helped her avoid creating too many file versions by saving only the commands necessary to generate them. Each time Christine found an error in the data that needed correcting or decided to organize the data in a particular way for her analysis, she just added new commands to the script she used to generate the temporary dataset. With a well-documented workflow and solid set of data protection measures in place, Christine was ready to start seeing some results.
Subject to the perils of her unruly graduate student schedule, Christine found herself spending many hours in front of the computer, her eyelids propped open by caffeine and the glare of the monitor before her. To check her code with different combinations of variables and statistical analyses, she was selecting and running just one section of her code at a time. Things were getting exciting – she was working through the bugs in her code and finding some interesting patterns!
A mistake, she realized later, was probably inevitable. With one ill stroke of the keyboard, she managed to delete a section of code without noticing it. Normally this would not have been a problem, as she would have soon noticed the missing chunk of code and been able to retrace her steps to recover it from an earlier version. But the block of code she deleted contained one essential element that had an impact on the code that followed it – the comment character.
The code that the comment prevented from being executed was a reference to the location of the original file. Christine had included it as an annotation in the code she was working with so that she would never forget where the data were located. While this type of documentation is good and generally encouraged, in this case it meant that one errant deletion could lead to a mini-disaster. When Christine clicked ‘Run’, the reference to the original file was read and interpreted as the location for a new blank file of rambling white space. Just like that, in the space of a mouse click, the subset Christine was so depending on had ceased to exist.
If this had been the only copy of the data, she would have almost certainly lost months or years of work. But fortunately the data were still contained within the larger dataset from which the subset was created. This larger dataset was preserved on the server which Christine did not have access to, and therefore could not possibly manage to alter beyond repair!
Christine knew she could not be the first person to commit such an error. But as she sat down to draft an email to her advisor, she could not help but shake her head in disbelief – what kind of researcher couldn’t keep a handle on her own data?! To tell the truth, Christine had been quite proud of her workflow documentation and data protection practices. How could she have not recognized this risk? In hindsight, the solution was annoyingly clear: if she had saved an additional backup of her data in a separate folder, she could have instantly restored the subset and her pride. The entire mess could have been avoided with a minute of her time and a little more foresight. But Christine’s advisor told her not to feel embarrassed, that he could access the main dataset and re-subset the lost data. He estimated that by using language he had saved when doing the original query, the re-subsetting would take an hour at most.
Of course, the process took a little time because of the massive size of the original file – but it worked. If her advisor had been away at a conference (as he was often prone to be), Christine might have been forced to give up several days of progress as she waited for his return. Luckily, little time was lost and the subset was on the shared server as though it had never left.
Christine’s lab already had good practices and norms for data protection. After her experience, however, they made some changes to how they manage shared data. Now each researcher keeps consistent backups of the datasets they’re sharing and working on so that little errors like the one Christine made don’t destroy data that everyone is depending on.
Christine began her analysis with an intentionally structured workflow most researchers would envy. But careful though she was in her work, the smallest oversight found a way to test the structure’s strength. After averting the crisis of human error, Christine realized that even the most vigilantly maintained workflow could benefit from the teachings of trial and error.