I primarily work with two types of data: (meta)transcriptomes and metagenomes. When I’m working with (meta)transcriptomes, I love using Salmon for transcript quantification. It’s fast, lightweight, and accurate.
I’ve sometimes used Salmon in metagenome-land as well, although its less clear that this is an appropriate use of the tool (see the list of references below for people poking around in this area).
- https://github.com/COMBINE-lab/salmon/issues/195
- https://github.com/COMBINE-lab/salmon/issues/330
- https://microbiomejournal.biomedcentral.com/track/pdf/10.1186/s40168-019-0684-8
- https://support.bioconductor.org/p/116742/
- https://2017-ucsc-metagenomics.readthedocs.io/en/latest/salmon_tutorial.html
As a potentially even less appropriate use of the tool, I recently used Salmon to quantify gene abundance in a BAM file. Although quantification of a BAM file is explicitly supported by Salmon, I used an unconventional BAM file. Instead of nucleotide sequences, my BAM file contained amino acid sequences. I’ve included a snippet of the (SAM) file below.
SQ SN:hu-genome19_SRR1976948.14137953 LN:52
@SQ SN:hu-genome19_SRR1976948.6788822 LN:45
@SQ SN:hu-genome19_SRR1976948.11194291 LN:73
@SQ SN:hu-genome19_SRR1976948.15123609 LN:50
@SQ SN:hu-genome19_SRR1976948.14180782 LN:69
@SQ SN:hu-genome19_SRR1976948.8401008 LN:65
@SQ SN:hu-genome19_SRR1976948.3531827 LN:59
@SQ SN:hu-genome19_SRR1976948.14899353 LN:73
@PG ID:PALADIN PN:PALADIN VN:1.4.6 CL:paladin align -f 125 -t 2 outputs/cd-hit95/hu-genome19.cdhit95.faa inputs/reads/hu-s1_k31_r1_search_oh0/hu-genome19.fa.cdbg_id
0:1:1:SRR1976948.461589 0 hu-genome19_SRR1976948.11334561 16 60 83M * 0 0 DALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMA
1:3:3:SRR1976948.498798 0 hu-genome19_SRR1976948.11334561 30 60 65M * 0 0 QASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKP *
2:5:5:SRR1976948.533863 0 hu-genome19_SRR1976948.11334561 17 60 83M * 0 0 ALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMAL
3:1:1:SRR1976948.542297 0 hu-genome19_SRR1976948.11334561 7 60 83M * 0 0 EKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLD
4:1:1:SRR1976948.547977 0 hu-genome19_SRR1976948.11334561 3 60 83M * 0 0 IVRNEKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPN
5:5:5:SRR1976948.569142 0 hu-genome19_SRR1976948.11334561 30 60 83M * 0 0 QASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKPFVIKKILD
6:5:5:SRR1976948.571968 0 hu-genome19_SRR1976948.11334561 1 60 80M * 0 0 DVIVRNEKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVG
7:2:2:SRR1976948.592880 0 hu-genome19_SRR1976948.11334561 18 60 83M * 0 0 LIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALE
8:3:3:SRR1976948.592880 0 hu-genome19_SRR1976948.11334561 18 60 83M * 0 0 LIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALE
9:0:0:SRR1976948.593583 0 hu-genome19_SRR1976948.11334561 28 60 83M * 0 0 TTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKPFVIKKI
This BAM file was produced by the pipeline depicted below. I chose to assemble my metagenome reads in amino acid space with PLASS because this generally produces more assembled genes than nucleotide assemblers. Then, PALADIN maps the nucleotide reads to the amino acid sequences assembled by PLASS. Because I’m mapping reads to the amino acid sequences constructed from the reads themselves, I don’t need to worry about evolutionary distance here.
Additionally, Salmon expects to quantify against a transcriptome. In a way, the output of PLASS is similar to a transcriptome in that outputs the open reading frames from the metagenome, not the full genome assembly.
My end-goal is to measure differences in abundance of amino acid sequences between groups of metagenome samples.
I want to know, are there good reasons why I should not be using Salmon to do this? Are there good alternatives that I haven’t thought of?
I feel confidant in using PLASS and PALADIN for the first part of this pipeline. Although I was able to get Salmon to work on a small prototype, I’m not sure that I should be using it for this application. It seems vaguely concerning, but I don’t have a specific reason to not do this.
An example of my workflow is available at https://github.com/taylorreiter/2020-aa-abund. I’ve included Salmon log files and aux info in the repository as well.