PART 2: Downloading DNA sequences from the NCBI's genbank database: Let's automate this task!

This is part 2 (I’m still working on part 1) of 3 parts looking at megaphylogenies. The introduction to this series can be found here.

We can probably all use the NCBI website to search for a DNA sequence for a particular taxa by using the search bar on the webpage. From there we can click on the sequence record we want and then download the record as a fasta or genbank file. Doing this manually is probably fine for a few taxa, but what if you had to do this for 20 taxa? 100 taxa? or 1,000s of taxa? Luckily there are tools to automate the task of retrieving genbank records!

 Fasta format of nucleotide sequence downloaded from NCBI's GenBank.

Fasta format of nucleotide sequence downloaded from NCBI's GenBank.

In this blog post I cover what the NCBI is, how records are stored and searched, and available tools to help automate the task of downloading records. Also at the end of this blog post is a complete list of links to online resources both mentioned in this post and those I found helpful.

NCBI databases

The NCBI houses 39 molecular and literature databases including databases for nucleotide and protein sequences, protein structures, complete genomes, gene expression, taxonomy, PubMed and more. To access these databases, the NCBI uses the text-based search and retrieval system called Entrez. Each record within a database is referred to with a unique identifier (UID). For nucleotide and protein sequences these UIDs are called GI numbers. Records are stored as xml formats.

Entrez Programming Utilities (E-Utilities)

If you read through the Entrez Programming Utilities (E-Utilites) help documentation on the NCBI’s website (found here), there is a good chance a lot of it will sound like gibberish (or at least it did for me).  Luckily, you don’t need to understand much beyond that E-utilities “are a set of nine server-side programs that provide a stable interface into the Entrez query and database system” using URL requests (This it self my sound like gibberish but thats ok!). Three key E-Utilities programs that you will likely use in some fashion to carry out basic tasks are Esearch, Esummary, and Efetch.

  • ESearch retrieves the UIDs that match the search term you provided. These UIDs can be stored and used later with ESummary and EFetch.

  • ESummary returns a summary of the data record corresponding to each UID provided. These summaries are in xml format.

  • EFetch returns the data recording a specified format (such as a fasta or genbank file for nucleotide sequences) for each UID provided.

Go ahead and try using these E-Utilities. Try copying the following URLs into your web brower’s address bar: (here we use the e-utility 'esearch' to search for matK sequences for one of my favorite plant genera Gongora) (here we use the e-utility 'esummary' to retreive the summary of the first listed UID from the search above)

You should see something like this:

Screen Shot 2018-04-19 at 3.48.21 PM.png

And this:

Screen Shot 2018-04-19 at 3.49.13 PM.png

This is pretty cool, eh? Although probably not very helpful. Furthermore to automate the use of E-Ultitlies you would need to write code that can handle URL requests. I don’t know how to do that, but luckily for us there is Entrez Direct (E-Direct) in which developers at the NCBI have done this heavy lifting for us!

Entrez Direct (E-Direct)

Entrez Direct (or E-Direct) allows access to the NCBI’s database from a UNIX terminal using command line arguments. Essentially developers at the NCBI did a lot of the heavy lifting for us! EDirect is basically a set of perl scripts that define functions that can be executed from the terminal command line (or used as functions in shell scripts to aid in automation) that handle the URL requests for E-Utilities. This is great because now gaining access to NCBI databases requires a lot less effort and programming knowledge on our end!

E-Direct requires installation and requires a Unix environment with the Perl language installed. Installation instructions can be found here and here. The installation is straight-forward—just copy and paste the commands into the terminal window. The installation will download a scripts directory called ‘edirect’ in the user’s home directory.

Although I began by using E-Direct in shell scripts for my project, I quickly switched over to using the Biopython module once I discovered it. Therefore I won’t say much more about E-Direct. I’ve provided a link to a nice webinar on using E-Direct at the end of this post if you want more information.

biopython logo.png

Entrez with Biopython

Biopython is great and I am a huge fan! So ultimately upon discovering Biopython I settled on using Biopython for handling all my downloading sequences from NCBI database needs! Biopython was also great for handling my sequences post-download and post-alignment.

First off what is Biopython? “Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. “ The first version of biopython was released in 2000. As of this blog post the most current version is biopython v1.70 released July 2017.

You will need to install Biopython. Instructions are found here and its should be very easy but if you run into problems check this pdf out. There is a great tutorial and cookbook for using Biopython, here, which will help you get off the ground running.  Chapter nine is all about accessing the NCBI’s Entrez Databases. I used the examples from the first few chapters and chapter 9 to customize my code for my project needs.


Entrez with R

There are several R packages that also access the NCBI’s Entrez databases. The most widely used is the library package ‘ape’. There is a nice PowerPoint tutorial here that covers the basics. However I found I wasn’t able to query records as efficiency as I was using the functions in the Biopython module. Therefore I have not explored this option too much further.


Links to online resources

Overview of the NCBI databases

Resources for E-Ultities

Resources for E-Direct

Resources for Biopython

Resources for R