The other day I probably spent close to two hours discussing data management with my lab mate James. The conversation I had with James isn’t the first long conversation I have had with a fellow researcher about data management. By data management, I mean organizing all aspects of a project from naming samples, to saving data, to running analyses, to writing up the results.
We, as researchers, know the importance of keeping our lab notes, data, analyses, and writings organized. Many of us have probably experienced the frustration of looking for a previous dataset or a previous analysis only to find multiple versions of documents with ambiguous file names (e.g. 'data_v1.xlsx', 'data_final.xlsx', 'data_clean.xlsx', 'data_v2_edited.xlsx') stuffed into a computer directory (which often includes a subdirectory called ‘old_stuff’ or ‘archive’ with even more ambiguous files stored inside). It is only during times like these when data management is discussed and becomes a priority. But data management should be discussed and should be a priority from the outset of any project. Good data management allows for reproducible research, which as researchers we should all strive towards.
In this blog post I want to share some of my best data management practices. Many of my data management practices I have adopted from my own experiences (and mistakes!) and from the advice of other researchers. The book Bioinformatics Data Skills: Reproducible and Robust Research With Open Source Tools written by Vince Buffalo and Reproducible Research with R and Rstudio by Christopher Grandrud are great sources for some best practices in data management as well.
At this point I also want to mention that organization is different for everyone! Don’t get up tight if you are having trouble ‘chameleon-ing’ yourself to match someone else’s data organization. Try several things out and then settle on what feels natural for you! This will be your best guarantee at staying organized and producing reproducible research.
The rest of this blog post is organized into five sections where I discuss the following:
What is reproducible research?
Directories, Directories, Directories
What is reproducible research?
Reproducible research means that anyone should be able to take our data, run the analysis, and get the exact results that we did. For this to happen our data must be structured logically and our analysis must be recorded and easy to follow.
Reproducible research is so important, especially as our datasets become larger and open source initiatives become more standard. Reproducible research is important to our success and credibility as researchers. It is also important for collaborators who depend on our data and for future lab members who may continue a project we started or design a new project around our data.
The key to reproducible research is good organization from the outset, as well as, foresight about what we can expect to do with our data once we have collected it. When designing a research project we might ask ourselves questions such as:
What data and what analyses will help us answer our question?
What type of data will we collect? And how is it best structured?
What is our individual sampling unit? How should we uniquely identify each sampling unit?
What analyses can we expect to perform with our data? Does this influence how we should collect our data?
These types of questions are the first step to help us create reproducible research. The next step is being able to easily access our data and analyses. That is, how do we manage our data and analyses on our computers?
Directories, Directories, Directories
Directories will serve as the backbone for managing data and staying organized. Every project should have its own directory. Within this directory I often have several main subdirectories which I name Background, Manuscript, Data, Scripts, and Output. Additionally, I often break larger projects into smaller projects and give each their own subdirectory. The directory structure within the sub project's directory typically mirrors the structure of the project directory.
It's important to remember that the directory structure should be flexible in order to accommodate the individual needs of each project, as well as, be able to evolve over time, since a single project can last years during which time subsets of data will be analyzed and/or the project may expand. Below I describe in more detail how I use each of the main subdirectories in my project directory. And in Figure 1, I show how two of my own project directories are organized in order to show how a directory system can be flexible and accommodate the needs of different projects. The key to flexibility is to think ahead about what data you expect to get as well its format (e.g. DNA sequences saved as fasta files, xcel spreadsheets with measurements, images, etc.).
I like to keep important publications that I frequently refer to in this directory.
This is where I keep all my written drafts and outlines. It is also where I keep a running list of notes such as who helped with each part of the project (data collection, watering plants, DNA extractions, etc) and funding sources, both of which are important points to include in the acknowledgements section of a publication.
I keep all my original data here. Often times this data is structured in excel spreadsheets, but it can also include other forms of data such as raw sequence data or images.
Scripts document how we performed our analyses and also allow others to easily re-perform the same analyses with our data or with new data. Scripts show what data we used and what we did to that data (how we transformed the data, what statistical test we performed on the data, what figures we generated from the data, etc.). Scripts in my Scripts directory are almost always R scripts.
I am in favor of keeping scripts short and compartmentalized. I don’t like to combine analyses in the same scripts unless I have a ‘gut feeling’ about doing so. For example, I may have a script that performs linear regression and generates corresponding figures (called '1_LinReg.R'), a separate script for performing an ANOVA (called '2_ANOVA.R'), and then a separate script that plots the locality where each sample was collected (called '3_LocalityPlotting.R'). At the top of each script I like to include a few sentences about what the analysis achieves or why I performed the analysis. I also like to give each script a number in the script file name which helps me link it to its output (for more on this see Figure 1 and the section on the Output directories and Naming Files below).
The Output directory is where I keep all the results and figures produced from my analyses. I typically name these output files beginning with the number that corresponds to the script that generated the output. For example the output files '1_Log_IslandArea.pdf', '1_Log_IslandElevation.pdf' and '1_Log_IslandIsolation.pdf' are files generated from the analysis saved as '1_LinReg.R'. (See Figure 1 for another example).
Whenever I perform an analysis on a subset of my data or perform a preliminary analysis, say for example for an upcoming conference presentation or grant proposal, I typically will create an Output subdirectory specifically for that occasion. For example, I created a subdirectory called '2017_June_CommitteeMeeting' where I kept the output of my preliminary analyses that I included in my June 2017 Committee Meeting Document.
The README.md File
I want to begin this section with a short story. The other day, I was doing some PCR, when I noticed that I needed to restock my working stock of DNA. I went into the freezer and grabbed my box marked ‘non-diluted DNA stock’. When I lifted the lid, much to my surprise, I found a printed table of the samples inside the box and each sample’s corresponding DNA concentration. With this table I was able to easily find my samples and dilute my DNA. I loved finding this printed table which my past self had left for my future self! I like to think of this printed table as a little love note-- you know those cute messages you find left my someone special in your jacket pocket which make you smile! That’s what being organized in your research with feel like!
So the README.md file is the ultimate little love note. Every directory within your project directory should have a README.md file, especially all data, output, and scripts directories. (I’ll be honest I typically only have a readme file for my main project directory and for the last three mentioned directory types. More than that seem rather overkill.) The README.md file can be added to and modified as you work through your project. Think of the README.md file as a digital lab notebook. Here are some things to think about when creating a README.md file.
If it is a data directory-- how was the data generated, what is the format of the data, when was the data generated, how are the files named?
If it is a scripts directory-- what does each script do, what versions of the script exist?
If it is an output directory – when was the analysis run, what version of the script or software did you use to generate the output, what parameter values did you use in the analysis (if you used the default values of the software be sure to record the default values as these can change between software versions), how are output files named?
The README.md doesn’t always have to be long. Sometimes a simple line will suffice. Other times README.md files will be long and detailed. In Figure 2 I show two examples of README.md files from my own research. The first is a README.md file for a data directory and the second is a README.md file for a scripts directory. From these examples you can see how README.md files can differ in detail.
README.md files are written using a syntax referred to as Markdown. Markdown is a plain text formatting syntax that can be translated to HTML. Markdown takes care of formatting, meaning that you can avoid spending your time manually formatting headers, fonts, lists, etc. like you would if using a Word document. A great web resource for learning more about Markdown can be found here.
Naming Samples & Files
When it comes to naming there are a few important things to keep in mind. First, avoid spaces. Substitute an underscore (_) for a space. Spaces will make doing anything from the command line painful. Second, think repeatable patterns. Patterns allow you to automate tasks easily using computer code. Third, try to avoid ad hoc naming. Fourth, names should be unique and informative.
When it comes to naming samples, every sample should have a unique name and ad hoc naming should be avoided. I have found the best way to do this is to use a three letter code followed by a number. The three letter code identifies the project and the number allows samples to be stored numerically and easily found. The sample name can be kept in a spreadsheet with additional columns providing further details about the sample. This method has worked for me for everything from DNA primers and DNA sequencing to bees collected in the field. In figure 3, I show an example from my own research.
The name of the file should be descriptive enough that you know what is inside. For example, a R script might run a linear regression analysis. I would consider naming that script LinearRegrAnalysis.R. Additionally, because scripts often produce output, either in the form of files or figures, I like to use an identifier to link the script with its output. To do this I have adopted the practice of giving numbers to all my scripts and outputs, such that the output ‘4_output.pdf’ was produced from the script ‘4_LinearRegrAnalysis.R’ and the output ‘7_output.pdf’ was produced from the script ‘7_LocalityPlotting.R’ This is shown in figure 2 above.
Version control means keeping track of changes made to documents. For example, we may write a script, run the analysis and get the results. We may go back later and tweak our code. We now run the analysis with our changes and get different results or find that our analysis fails because the changes broke the code. Either way it is good to know what you changed between versions and which version produced which results.
Software such as Git can help with version control. Git allows us to save multiple versions of a document, tells us what is different between versions, and allows us to go back to previous versions. This is super helpful for coding. There is a learning curve to Git and is something I am still relatively new at, but excites me. Many people have blogged about Git and Vince Buffalo in his book Bioinformatics Data Skills: Reproducible and Robust Research With Open Source Tools devotes a whole chapter to Git. A digital copy of the book Pro Git written by Scott Chacon and Ben Straub can be found here.
I would encourage data management to be more openly discussed, especially among early career researchers (undergrads working in the lab, early stage grad students) and more senior researchers. One way to do this is to have several lab meetings dedicated to data management.