Recently, I found myself wanting to make a phylogeny of species that I found interesting. These species were organisms that had been isolated from olive oil, typically by plating methods, and either identified through screening or through amplicon sequencing. Because olive oil microbiology is not a huge feild, and because olive oil is probably a highly selective environment, I was able to make a list of all of the organisms that have been identified in olive oil.
I had been using phyloT to generate phylogenetic treees, however because of it’s GUI, it’s not reproducible. And although olive oil microbiology may not be a bustling field, new organisms are being isolated from oil semi-frequently. I started getting annoyed with having to regerenate the tree each time I wanted to update it, and so I began my search for the “best” phylogenetic tree generator.
However, I did not want to use sequences. I simply wanted to input my list of species or species IDs and output a tree. That way, I could update my list (either in an external document or in the script itself) and update my tree each time a new species was identified.
If only the world were simple. I played around with a few methods, however the one I like best uses both python and R. The first two that I show below use R-only methods, and although they are fine, they had shortcomings which I will discuss below.
Three ways to build a tree from species name (or taxon ID)
The three ways I used to build my trees were:
- R metacoder package
- R rotl & ape packages
- R taxize package
Additionally, I was able to produce a phylogeny using the Python ETE toolkit quite easily. If you prefer python, their page on dealing with NCBI taxonomy could be helpful.
Hopefully one of these ways will be helpful!
As stated before, I used a self-made csv file of species and NCBI taxon identifiers from organisms that have been isolated from olive oil. You can see parts of the data.frame below.
oil_species <- read.csv("./data/olive_oil_species.csv")
# Make a vector for species oil_species_taxon <- as.character(oil_species[, 2]) oil_species_taxon
##  "Saccharomyces cerevisiae " "Nakazawaea wickerhamii" ##  "Barnettozyma californica" "Candida boidinii" ##  "Aspergillus" "Candida parapsilosis " ##  "Meyerozyma guilliermondii" "Clavispora lusitaniae" ##  "Debaryomyces hansenii" "Candida albicans" ##  "Rhodotorula mucilaginosa " "Helicosporium" ##  "Alternaria" "Penicillium" ##  "Candida diddensiae " "Candida sp. CBS 12510" ##  "Candida adriatica" "Brettanomyces acidodurans " ##  "Nakazawaea molendini-olei" "Cystobasidium slooffiae" ##  "Zygotorulaspora mrakii"
# Make a vector for NCBI ID oil_species_id <- oil_species[, 3] oil_species_id
##  4932 1538186 36038 5477 5052 5480 4929 36911 ##  4959 5476 5537 171188 5598 5073 45543 1164822 ##  1171601 1958866 1538181 106018 42260
R metacoder package
The metacoder package was easy to use and interfaced with many databases. I could also choose to give it NCBI taxon IDs or species names. I chose to give it names.
library(metacoder) oil_taxon_metacoder <- extract_taxonomy(oil_species_taxon, key = "name", database = "ncbi", allow_na = TRUE)
heat_tree(oil_taxon_metacoder, node_size = n_obs, node_color = n_obs, node_label = name)
However, I wasn’t a huge fan of the metacoder
heat_tree() output. I decided to explore what other packages interfaced with phylogenies to see if I could get an output that appealed to me.
R rotl & ape packages
rotl provides an interface to the “Open Tree of Life”. The package allows you to query the tree and retrieve a phylogeny.
oil_species_resolved <- tnrs_match_names(oil_species_taxon) tree <- tol_induced_subtree(ott_ids = oil_species_resolved$ott_id)
The above code doesn’t work. The newest species, Brettanomyces acidodurans, isn’t found in the database, and so it isn’t represented on the tree. However, taking it out exposes a new problem:
# Remove the newest species, which wasn't found: oil_species_taxon_rm <- oil_species_taxon[-18] oil_species_resolved <- tnrs_match_names(oil_species_taxon_rm) tree <- tol_induced_subtree(ott_ids = oil_species_resolved$ott_id)
This code won’t work either. The function
tol_induced_subtree() can’t find  for Candida sp. CBS 12510 (Yamadazyma terventina is how this species referred to in olive oil literature). If it is removed, then the phylogeny will work.
# Remove the newest species, which wasn't found: oil_species_taxon_rm <- oil_species_taxon[-18] oil_species_resolved <- tnrs_match_names(oil_species_taxon_rm) tree <- tol_induced_subtree(ott_ids = oil_species_resolved$ott_id[-16]) plot(tree)