NEON Open Access Data to Explore Our Ecosystems Highlights the Need to Incorporate Data Science Skills for Reproducible Research into Biology Course Curriculum

The National Ecological Observatory Network (NEON) is a NSF funded project designed to "characterize and quantify complex, rapidly changing effects of ecological processes across the US" over a 30 year time-frame. The data collected is intended to be used by "scientists, educators, planners, decision makers and public to map, understand and predict the effects of human activities on ecology and effectively address critical ecological questions and issues." (https://www.neonscience.org/about). The data collected by NEON is diverse and extensive. The NEON project monitors 45 field sites located across the United States including Alaska, Hawaii and Puerto Rico and collects both biological and physical data important to understanding ecosystem functioning. 

 I wanted to explore the NEON datasets to better understand how the data could be accessed and used (potentially in a classroom setting or for a course project), so I created a mini-project for myself. In my mini-project I explore the geographic distribution of ticks species and the pathogens they carry that can have consequences for human health across the United States. My code and analysis can be found on my github (here & here). Below I document some of the potential challenges to access and use the data and the skills that are required to overcome these challenges. My experience highlights the need to incorporate data science skills that allow for reproducible research (computer coding literacy, data formatting, cleaning, filtering, and analysis) into biology course curriculum.

Challenge 1: Downloading and aggregating the data

 It is easy to download the data directly from the NEON site (see figure 1); however upon opening the downloaded folder you are faced with a multitude of zip files. In fact there is a zip file for each site/year combo (that can be upwards of 200 zip files!).  Within each zip file there is a minimum of (1) a csv file with the data for that site and year, (2) a csv file providing information of the variables in the data csv, and (3) a text file that provides details of the data and data collection methods. To get a complete dataset, the lines of data from each data containing csv file must be aggregated together—a task (like so many others in current biological research and data management) that would be tedious if done manually. (Plus what if you make a mistake and have to start all over again or accidentally skip a csv file!) 

Figure 1. NEON data portal and download.

Figure 1. NEON data portal and download.

Luckily, as part of the NEON project, NEON researchers create tutorials on how to use and analyze the data. Tutorials can be found here. Among these tutorials are ones that demonstrate how to download and aggregate the data using either R or Python languages. In fact NEON researchers developed a NEON specific library for R and NEON specific module for Python. This makes the data easily accessible for students and researchers who are experienced in R or Python. Additionally, another benefit of these tutorials is that although they are intended to be used with NEON data, the concepts they teach can be used to solve other programing and data analysis tasks. 

Because I wanted to practice my Python skills, for my mini-project I wrote a custom python script to aggregate the data after I downloaded the data from the NEON site. The script unzips each folder, pulls the lines of data from the data csv file, and aggregates them into a single variable that is then written to a csv file (see figure 2). 

Figure 2. Schematic diagram depicting the workflow to aggregate downloaded NEON data.

Figure 2. Schematic diagram depicting the workflow to aggregate downloaded NEON data.

Challenge 2: Managing large-scale datasets

Once aggregated across sties and years, the completed dataset will have upwards of 5,000+ lines of data. Once again, this amount of data is not easily manageable without reproducible workflows documented and performed in R or Python for example. 

 For my mini-project, I performed my data analysis in R (the repo for this analysis can be found here). I relied on the R library dplyr, which is a SQL-like toolkit. With the functions in dplyr I was able to filter and summarize data easily in order to answer my questions. See figure 3 for some ideas on the types of questions you may want to ask yourself when filtering and summarizing data. 

 I also created a SQLite database, allowing me to query the data using SQL (the repo for this can be found here).

Figure 3. Example data frame of NEON data providing suggests for questions you might ask yourself when it comes to filtering and summarizing the data.

Figure 3. Example data frame of NEON data providing suggests for questions you might ask yourself when it comes to filtering and summarizing the data.

Final Thoughts

 The NEON project is awesome! It provides real ecological data that students and researches can use to generate and answer ecological questions. However, it also highlights the need to incorporate data science skills that allow for reproducible research (computer coding literacy, data formatting, cleaning, filtering, and analysis) into biology course curriculum. Without these skills the data and insights from the data are not accessible.