This post is part of a three part series:
- From raw metagenome reads to phyloseq taxonomy table using sourmash gather and sourmash taxonomy
- From raw metagenome reads to taxonomic differential abundance with sourmash and corncob
- Visualizing the taxonomic composition of metagenomes using sourmash and metacoder
This post includes outputs and code from the first blog post which shared a tutorial to go from raw metagenome reads to a phyloseq table using sourmash gather
and sourmash taxonomy
. This tutorial demonstrates how to do differential abundance analysis with the same data using corncob. Corncob works really well for microbial count data because it accounts for sparsity (lots of unobserved taxonomic lineages in samples) and variable sequencing depth.
I don’t want to run the previous tutorial, I just want to play with this one
If you don’t feel like running through everything in the previous tutorial, you can download the sourmash gather
and sourmash taxonomy
results using the code below:
wget -O taxonomy.tar.gz https://osf.io/kf9n8/download
tar xf taxonomy.tar.gz
Building the phyloseq object
In the previous tutorial, we built a phyloseq object from the sourmash taxonomy
output. Corncob doesn’t require a phyloseq object to run, but it does play well with phyloseq objects. We’ll use that as a jumping off point, so I’ve included the code to build the phyloseq object below.
library(tidyverse)
library(phyloseq)
# create a metadata table -------------------------------------------------
# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.
run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
"SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
"MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")
metadata <- data.frame(run_accessions = run_accessions,
sample_ids = sample_ids,
groups = groups) %>%
column_to_rownames("run_accessions")
# read in sourmash gather & taxonomy results ------------------------------
# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("taxonomy/*gather_gtdbrs207_reps.with-lineages.csv") %>%
map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>%
mutate(name = gsub(" .*", "", name))
# We need two tables: a tax table and an "otu" table.
# The tax table will hold the taxonomic lineages of each of our gather matches.
# To make this, we'll make a table with two columns: the genome match and the lineage of the genome.
# The "otu" table will have the counts of each genome in each sample.
# We'll call this our gather_table.
tax_table <- sourmash_taxonomy_results %>%
select(name, lineage) %>%
distinct() %>%
separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species"), sep = ";") %>%
column_to_rownames("name")
gather_table <- sourmash_taxonomy_results %>%
mutate(n_unique_kmers = (unique_intersect_bp / scaled) * average_abund) %>% # calculate the number of uniquely matched k-mers
select(query_name, name, n_unique_kmers) %>% # select only the columns that have information we need
pivot_wider(id_cols = name, names_from = query_name, values_from = n_unique_kmers) %>% # transform to wide format
replace(is.na(.), 0) %>% # replace all NAs with 0
column_to_rownames("name") # move the metagenome sample name to a rowname
# create a phyloseq object ------------------------------------------------
physeq <- phyloseq(otu_table(gather_table, taxa_are_rows = T),
tax_table(as.matrix(tax_table)),
sample_data(metadata))
Differential abundance testing with corncob
Once we have the phyloseq object physeq
, running corncob is fairly straightforward. In this case, our two groups are stool microbiomes from individuals with Crohn’s disease (cd) and individuals without Crohn’s disease (nonIBD).
Let’s start by testing for differential abundance for a specific genome that sourmash gather
identified in our samples. We’ll test a genome that was identified in all of our samples. Using the below code block, we can take a look at 6 genomes that were identified in all samples.
sourmash_taxonomy_results %>%
group_by(name) %>%
tally() %>%
arrange(desc(n)) %>%
head()
name n
<chr> <int>
1 GCA_001915545.1 6
2 GCA_003112755.1 6
3 GCA_003573975.1 6
4 GCA_003584705.1 6
5 GCA_900557355.1 6
6 GCA_900755095.1 6
Randomly pulling out one of these genomes, let’s take a look at its taxonomic lineage:
sourmash_taxonomy_results %>%
filter(name == "GCA_003112755.1") %>%
select(lineage) %>%
distinct()
This genome is assigned to s__Lawsonibacter asaccharolyticus
in the GTDB taxonomy.
lineage
<chr>
1 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Lawsonibacter;s__Lawsonibacter asaccharolyticus
Testing differential abundance for a single genome
Corncob builds a model for each genome/ASV/taxonomic lineage separately, so we can start by building a model for this genome specifically:
library(corncob)
library(janitor)
corncob <- bbdml(formula = GCA_003112755.1 ~ groups,
phi.formula = ~ 1,
data = physeq)
corncob
And we see an output that looks like this:
Call:
bbdml(formula = GCA_003112755.1 ~ groups, phi.formula = ~1, data = physeq)
Coefficients associated with abundance:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.0475 0.3665 -16.499 0.000485 ***
groupsnonibd 0.8905 0.4255 2.093 0.127424
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients associated with dispersion:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.8616 0.6007 -11.42 0.00144 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-likelihood: -39.28
We can plot the results:
plot(corncob, color = groups)
While our results show that this genome is significantly differentially abundant in CD when compared to nonIBD, this result doesn’t seem super biologically meaningful; visually, the mean abundances of the genome aren’t that different between groups.
The corncob vignette includes a lot more information about creating and interpretting models, especially for more complex experimental designs.
Does bbdml() produce an error?
Error in bbdml(formula = GCA_003112755.1 ~ groups, phi.formula = ~1, data = physeq) :
Model could not be optimized! Try changing initializations or simplifying your model.
See this GitHub issue for potential resolutions.
Testing differential abundance at different levels of taxonomy
Because corncob integrates well with phyloseq, we can agglomerate to different levels of taxonomy and test for differential abundance systematically. Let’s test for differential abundance for all genuses observed in our data set.
physeq_genus <- physeq %>%
tax_glom("genus")
corncob_genus <- differentialTest(formula = ~ groups,
phi.formula = ~ 1,
formula_null = ~ 1,
phi.formula_null = ~ 1,
data = physeq_family,
test = "Wald", B = 100,
fdr_cutoff = 0.1)
To bootstrap or not to bootstrap
The B
parameter in differentialTest()
controls the number of bootstrap iterations. If you have at least 30 samples, you probably don’t need to bootstrap and this will save you time when running the model. If you have fewer than 30 samples, it is strongly recommended that you bootstrap with 100-2000 bootstraps.
Then we can plot the results, showing only taxa that were significantly different at an FDR cutoff of 0.1
plot(corncob_genus)
By default, the bbdml()
function does not do multiple testing correction because it only performs one test. If you choose to do multiple tests, be sure to include p value correction. In contrast, differentialTest()
does calculate the false discovery rate for all of the tests performed for that run.
Using corncob_genus$significant_models
, we can see the results of corncob. Below, we’ll parse the output into a dataframe for easier downstream use of the corncob results.
First, we’ll write a function that takes the differentialTest()
results object as input and parses it into a dataframe of taxa-labelled coefficients.
glance_significant_models <- function(corncob_results){
df <- data.frame()
for(i in 1:length(corncob_results$significant_taxa)){
taxa_name <- corncob_results$significant_taxa[i]
taxa_df <- corncob_results$significant_models[[i]]$coefficients %>%
as.data.frame() %>%
rownames_to_column("covariate") %>%
mutate(taxa_name = taxa_name)
df <- bind_rows(df, taxa_df)
}
df <- clean_names(df) %>%
select(covariate, estimate, std_error, t_value, fdr = pr_t, taxa_name)
return(df)
}
Then we’ll use this function to parse the results and filter to coefficients with a false discovery rate < 0.1.
glance_significant_models(corncob_genus) %>%
filter(fdr < 0.1) %>%
filter(!grepl("Intercept", covariate))
covariate estimate std_error t_value fdr taxa_name
1 mu.groupscd -1.4867950 0.4398837 -3.379973 0.07749496 GCF_000012825.1
2 mu.groupscd -0.8150979 0.2502086 -3.257673 0.08270706 GCF_009696065.1
3 mu.groupsnonibd -2.2554143 0.2852505 -7.906784 0.01562175 GCF_009696065.1
4 mu.groupscd 3.5106793 1.1003016 3.190652 0.08577909 GCA_003478935.1
5 mu.groupscd 3.4957077 1.1778541 2.967861 0.09725123 GCF_015557635.1
6 mu.groupscd 3.0423122 0.9160323 3.321185 0.07993957 GCF_900537995.1
7 mu.groupscd 3.5415046 1.1134170 3.180753 0.08624662 GCF_003478505.1
8 mu.groupsnonibd 3.2390747 1.0717325 3.022279 0.09425566 GCF_003478505.1
9 mu.groupscd 3.9035629 1.2645871 3.086828 0.09087019 GCF_015560765.1
Phyloseq has an …odd behavior of assigning a genome/OTU name instead of the lineage to aggregated taxa. The last parsing step we’ll perform is re-deriving the genus-level lineage for these results using the physeq_genus
taxonomy table.
physeq_genus_tax_table <- tax_table(physeq_genus) %>%
as.data.frame() %>%
rownames_to_column("taxa_name")
glance_significant_models(corncob_genus) %>%
filter(fdr < 0.1) %>%
filter(!grepl("Intercept", covariate)) %>%
left_join(physeq_genus_tax_table)
covariate estimate std_error t_value fdr taxa_name domain phylum class order family genus
1 mu.groupscd -1.4867950 0.4398837 -3.379973 0.07749496 GCF_000012825.1 d__Bacteria p__Bacteroidota c__Bacteroidia o__Bacteroidales f__Bacteroidaceae g__Phocaeicola
2 mu.groupscd -0.8150979 0.2502086 -3.257673 0.08270706 GCF_009696065.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Oliverpabstia
3 mu.groupsnonibd -2.2554143 0.2852505 -7.906784 0.01562175 GCF_009696065.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Oliverpabstia
4 mu.groupscd 3.5106793 1.1003016 3.190652 0.08577909 GCA_003478935.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__KLE1615
5 mu.groupscd 3.4957077 1.1778541 2.967861 0.09725123 GCF_015557635.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Fusicatenibacter
6 mu.groupscd 3.0423122 0.9160323 3.321185 0.07993957 GCF_900537995.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Roseburia
7 mu.groupscd 3.5415046 1.1134170 3.180753 0.08624662 GCF_003478505.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Ventrimonas
8 mu.groupsnonibd 3.2390747 1.0717325 3.022279 0.09425566 GCF_003478505.1 d__Bacteria p__Firmicutes_A c__Clostridia o__Lachnospirales f__Lachnospiraceae g__Ventrimonas
9 mu.groupscd 3.9035629 1.2645871 3.086828 0.09087019 GCF_015560765.1 d__Bacteria p__Actinobacteriota c__Coriobacteriia o__Coriobacteriales f__Coriobacteriaceae g__Collinsella
Instead of printing this dataframe to the console, we can write it to a file to use later, for example for visualization.
glance_significant_models(corncob_genus) %>%
filter(fdr < 0.1) %>%
filter(!grepl("Intercept", covariate)) %>%
left_join(physeq_genus_tax_table) %>%
write_tsv("corncob_results_genus.tsv")
Want to visualize these corncob results? See the next blog post on visualizing the taxonomic composition of metagenomes using sourmash and metacoder.
Corncob can do a lot more than what we’ve covered here. See the corncob documentation for more!
This post is part of a three part series:
This post includes outputs and code from the first blog post which shared a tutorial to go from raw metagenome reads to a phyloseq table using sourmash
gather
and sourmashtaxonomy
. This tutorial demonstrates how to do differential abundance analysis with the same data using corncob. Corncob works really well for microbial count data because it accounts for sparsity (lots of unobserved taxonomic lineages in samples) and variable sequencing depth.I don’t want to run the previous tutorial, I just want to play with this one
If you don’t feel like running through everything in the previous tutorial, you can download the sourmash
gather
and sourmashtaxonomy
results using the code below:Building the phyloseq object
In the previous tutorial, we built a phyloseq object from the sourmash
taxonomy
output. Corncob doesn’t require a phyloseq object to run, but it does play well with phyloseq objects. We’ll use that as a jumping off point, so I’ve included the code to build the phyloseq object below.Differential abundance testing with corncob
Once we have the phyloseq object
physeq
, running corncob is fairly straightforward. In this case, our two groups are stool microbiomes from individuals with Crohn’s disease (cd) and individuals without Crohn’s disease (nonIBD).Let’s start by testing for differential abundance for a specific genome that
sourmash gather
identified in our samples. We’ll test a genome that was identified in all of our samples. Using the below code block, we can take a look at 6 genomes that were identified in all samples.Randomly pulling out one of these genomes, let’s take a look at its taxonomic lineage:
This genome is assigned to
s__Lawsonibacter asaccharolyticus
in the GTDB taxonomy.Testing differential abundance for a single genome
Corncob builds a model for each genome/ASV/taxonomic lineage separately, so we can start by building a model for this genome specifically:
And we see an output that looks like this:
We can plot the results:
While our results show that this genome is significantly differentially abundant in CD when compared to nonIBD, this result doesn’t seem super biologically meaningful; visually, the mean abundances of the genome aren’t that different between groups.
The corncob vignette includes a lot more information about creating and interpretting models, especially for more complex experimental designs.
Does bbdml() produce an error?
See this GitHub issue for potential resolutions.
Testing differential abundance at different levels of taxonomy
Because corncob integrates well with phyloseq, we can agglomerate to different levels of taxonomy and test for differential abundance systematically. Let’s test for differential abundance for all genuses observed in our data set.
To bootstrap or not to bootstrap
The
B
parameter indifferentialTest()
controls the number of bootstrap iterations. If you have at least 30 samples, you probably don’t need to bootstrap and this will save you time when running the model. If you have fewer than 30 samples, it is strongly recommended that you bootstrap with 100-2000 bootstraps.Then we can plot the results, showing only taxa that were significantly different at an FDR cutoff of 0.1
By default, the
bbdml()
function does not do multiple testing correction because it only performs one test. If you choose to do multiple tests, be sure to include p value correction. In contrast,differentialTest()
does calculate the false discovery rate for all of the tests performed for that run.Using
corncob_genus$significant_models
, we can see the results of corncob. Below, we’ll parse the output into a dataframe for easier downstream use of the corncob results.First, we’ll write a function that takes the
differentialTest()
results object as input and parses it into a dataframe of taxa-labelled coefficients.Then we’ll use this function to parse the results and filter to coefficients with a false discovery rate < 0.1.
Phyloseq has an …odd behavior of assigning a genome/OTU name instead of the lineage to aggregated taxa. The last parsing step we’ll perform is re-deriving the genus-level lineage for these results using the
physeq_genus
taxonomy table.Instead of printing this dataframe to the console, we can write it to a file to use later, for example for visualization.
Want to visualize these corncob results? See the next blog post on visualizing the taxonomic composition of metagenomes using sourmash and metacoder.
Corncob can do a lot more than what we’ve covered here. See the corncob documentation for more!