How do i quantify amino acid sequences from a metagenome?

pipeline_question.utf8

I primarily work with two types of data: (meta)transcriptomes and metagenomes. When I’m working with (meta)transcriptomes, I love using Salmon for transcript quantification. It’s fast, lightweight, and accurate.

I’ve sometimes used Salmon in metagenome-land as well, although its less clear that this is an appropriate use of the tool (see the list of references below for people poking around in this area).

As a potentially even less appropriate use of the tool, I recently used Salmon to quantify gene abundance in a BAM file. Although quantification of a BAM file is explicitly supported by Salmon, I used an unconventional BAM file. Instead of nucleotide sequences, my BAM file contained amino acid sequences. I’ve included a snippet of the (SAM) file below.

SQ     SN:hu-genome19_SRR1976948.14137953      LN:52
@SQ     SN:hu-genome19_SRR1976948.6788822       LN:45
@SQ     SN:hu-genome19_SRR1976948.11194291      LN:73
@SQ     SN:hu-genome19_SRR1976948.15123609      LN:50
@SQ     SN:hu-genome19_SRR1976948.14180782      LN:69
@SQ     SN:hu-genome19_SRR1976948.8401008       LN:65
@SQ     SN:hu-genome19_SRR1976948.3531827       LN:59
@SQ     SN:hu-genome19_SRR1976948.14899353      LN:73
@PG     ID:PALADIN      PN:PALADIN      VN:1.4.6        CL:paladin align -f 125 -t 2 outputs/cd-hit95/hu-genome19.cdhit95.faa inputs/reads/hu-s1_k31_r1_search_oh0/hu-genome19.fa.cdbg_id
0:1:1:SRR1976948.461589 0       hu-genome19_SRR1976948.11334561 16      60      83M     *       0       0       DALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMA
1:3:3:SRR1976948.498798 0       hu-genome19_SRR1976948.11334561 30      60      65M     *       0       0       QASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKP       *
2:5:5:SRR1976948.533863 0       hu-genome19_SRR1976948.11334561 17      60      83M     *       0       0       ALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMAL
3:1:1:SRR1976948.542297 0       hu-genome19_SRR1976948.11334561 7       60      83M     *       0       0       EKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLD
4:1:1:SRR1976948.547977 0       hu-genome19_SRR1976948.11334561 3       60      83M     *       0       0       IVRNEKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPN
5:5:5:SRR1976948.569142 0       hu-genome19_SRR1976948.11334561 30      60      83M     *       0       0       QASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKPFVIKKILD
6:5:5:SRR1976948.571968 0       hu-genome19_SRR1976948.11334561 1       60      80M     *       0       0       DVIVRNEKRMLQEAVDALIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVG
7:2:2:SRR1976948.592880 0       hu-genome19_SRR1976948.11334561 18      60      83M     *       0       0       LIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALE
8:3:3:SRR1976948.592880 0       hu-genome19_SRR1976948.11334561 18      60      83M     *       0       0       LIDNNMRKSTTTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALE
9:0:0:SRR1976948.593583 0       hu-genome19_SRR1976948.11334561 28      60      83M     *       0       0       TTQASTGGRRLLKSLADMLQGKSGRFRQNLLGKRVDYSGRSVIVVGPNLKLDECGVPKKMALELFKPFVIKKI

This BAM file was produced by the pipeline depicted below. I chose to assemble my metagenome reads in amino acid space with PLASS because this generally produces more assembled genes than nucleotide assemblers. Then, PALADIN maps the nucleotide reads to the amino acid sequences assembled by PLASS. Because I’m mapping reads to the amino acid sequences constructed from the reads themselves, I don’t need to worry about evolutionary distance here.

Additionally, Salmon expects to quantify against a transcriptome. In a way, the output of PLASS is similar to a transcriptome in that outputs the open reading frames from the metagenome, not the full genome assembly.

My end-goal is to measure differences in abundance of amino acid sequences between groups of metagenome samples.

I want to know, are there good reasons why I should not be using Salmon to do this? Are there good alternatives that I haven’t thought of?

I feel confidant in using PLASS and PALADIN for the first part of this pipeline. Although I was able to get Salmon to work on a small prototype, I’m not sure that I should be using it for this application. It seems vaguely concerning, but I don’t have a specific reason to not do this.

An example of my workflow is available at https://github.com/taylorreiter/2020-aa-abund. I’ve included Salmon log files and aux info in the repository as well.