nucmer_viz3.utf8.md

Updated 05/11/19

I have simplified the code and included installation instructions. If you prefer the previous tutorial, you can find the code here.

Here we’ll visualize the nucmer alignment of two Methanosaeta harundinacea genomes. The first, M. harundinacea 6AC is an isolate genome. The second, M. harundinacea MAG07 is a metagenome assembled genome bin. This visualization will help us see how different the two genomes are.

In principle, this visualization method should work with any two multifastas. I have noticed some visual inconsistencies with the plots when alignments are short and plentiful, so keep that in mind if you try it out.

Installing mummer

conda create -n mummer 
conda activate mummer
conda install -c bioconda mummer4=4.0.0beta2

Running nucmer

To download the test data, run:

# M. harundinacea 6AC
wget -O mh6ac.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/235/565/GCA_000235565.1_ASM23556v1/GCA_000235565.1_ASM23556v1_genomic.fna.gz
gunzip mh6ac.fasta.gz

# M. harundinacea MAG07
wget -O mag07.fasta https://osf.io/d9qyg/download 

The general structure of the nucmer command looks like this:

nucmer --mum reference.fasta query.fasta -p query_ref_nucmer

We will use the genbank assembly as a reference, and the metagenome assembled genome bin as the query.

nucmer --mum mh6ac.fasta mag07.fasta -p m_harundinacea

Here, we filter the nucmer output to only include alignment of length 1000. This is arbitrary, and you should use a length that makes sense for your biological question.

delta-filter -l 1000 -q m_harundinacea.delta > m_harundinacea_filter.delta
show-coords -c -l -L 1000 -r -T m_harundinacea_filter.delta > m_harundinacea_filter_coords.txt

Other multifasta information needed for plotting

Other than the output produced by NUCmer, a bit more information is needed about the reference multifasta to produce an informative plot. The table produced by show-coords does not have the length of the reference contig as a metadata column. Further, because not all reference contigs may have queries that align, a reference contig may be missing from the show-coords output. By generating fasta headers and lengths we can be sure that all of the reference contigs are plotted.

samtools faidx mh6ac.fasta 

This produces two files, mh6ac.fasta.fai. There are five columns that represent the following information:

NAME          Name of this reference sequence
LENGTH      Total length of this reference sequence, in bases
OFFSET      Offset in the FASTA/FASTQ file of this sequence's first base
LINEBASES     The number of bases on each line
LINEWIDTH     The number of bytes in each line, including the newline
QUALOFFSET  Offset of sequence's first quality within the FASTQ file

Plotting in R

We can now create a circle plot that shows where the query multifasta overlaps with the reference multifasta.

Some pre-formatting is necessary to read all the data in and get it into the right format (GRanges objects). I’ve made a few functions that will be grouped together into a larger function to accomplish these tasks.

If you don’t already have the packages we will be using installed, you can run the following code.

install.packages(c("devtools", "ggplot2"))

library(devtools)
install_github("timflutre/rutilstimflutre")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("ggbio", "GenomicRanges")

load-coords loads the filtered NUCmer output into R as a GRanges object. You can set, and only includes the alignments where percent identity was 100%. It takes advantage of a function written by Timothee Flutre to load the NUCmer object in as a GRanges object.

load_coords <- function(coords_file, perc.id) {
  library(rutilstimflutre)
  library(GenomicRanges)
  
  coords <- loadMummer(coords_file, algo = "nucmer")                    # read in nucmer results as a GRanges obj
  coords <- coords[(elementMetadata(coords)[ , "perc.id"] >= perc.id)]  # filter entries with perc.id != 100
  seqlevels(coords) <- seqlevelsInUse(coords)                           # drop seq levels that are no longer used
  coords$qry.name <- names(coords)                                      # set column with query contig name
  coords$qry <- rep("Query", nrow(values(coords)))

  return(coords)
}

faidx_to_GRanges takes the faidx file we created for the reference sequence and turns it into a GRanges object. This will become the outer circle of the plot.

faidx_to_GRanges <- function(faidx_file){
  faidx <- read.table(file = faidx_file, header = F, stringsAsFactors = F,
                      col.names = c("name", "contig.len", "offset", 
                                    "linebases", "linewidth"))
  
  # create a GRanges object for reference sequences
  grange <- GRanges(seqnames = Rle(faidx$name), 
                    ranges = IRanges(start = rep(1, nrow(faidx)), 
                                     end = faidx$contig.len))
  
  # add seqlengths to the reference GRanges object
  seqlengths(grange) <- faidx$contig.len
  genome(grange) <- "Reference"
  grange <- sortSeqlevels(grange)
  grange <- sort(grange)
  return(grange)
}

circular_plot_with_ref contains the ggbio and ggplot code we will use to make the circle plot.

circular_plot_w_ref <- function(reference_GRange, NUCmer_coords){
  # reference_GRange: reference sequence GRanges obj
  # NUCmer_coords: a GRanges object produced by reading in a show-coords processed NUCmer object.
  
  library(ggplot2)
  library(ggbio)
  
  p <- ggbio() + 
    circle(NUCmer_coords, geom = "rect", 
           aes(color = "steelblue", fill = "steelblue")) +  # NUCmer obj
    circle(reference_GRange, geom = "ideo", 
           aes(color = "gray70", fill = "gray70")) +        # Ideogram of ref genome
    circle(reference_GRange, geom = "scale", 
           scale.type = "M", size = 1.5) +                  # Scale from seqlen of ref genome
    # circle(reference_GRange, geom = "text", 
    #       aes(label = seqnames), size = 2) +              # Uncomment for sequence label
    scale_color_manual(name = "Sequence Origin", 
                       labels = c("Reference", "Query"), 
                       values = c("gray70", "steelblue")) +
    scale_fill_manual(name = "Sequence Origin", 
                      labels = c("Reference", "Query"), 
                      values = c("gray70", "steelblue")) +
    ggtitle("Reference vs. Query")
  return(p)
}

load_and_plot_nucmer puts these functions together to produce a plot. It reads in the NUCmer object, creates a GRanges object for the reference, and plots the two GRanges objects together.

load_and_plot_nucmer_w_ref <- function(NUCmer_coords_file, ref_faidx_file, perc.id) {
  # NUCmer_coords_file: string for file path of NUCmer output produced by show coords 
  # ref_faidx_file: reference sequence GRanges obj
  # perc.id: percent id cutoff to show on plot
  
  library(GenomicRanges)
  
  NUCmer_coords <- load_coords(NUCmer_coords_file, perc.id = perc.id)   # Make GRanges obj of nucmer output file
  referenceGR <- faidx_to_GRanges(faidx_file = ref_faidx_file)
  plot <- circular_plot_w_ref(reference_GRange = referenceGR, NUCmer_coords = NUCmer_coords)
  return(plot)
}

And now we can run this function and produce a circle plot.

load_and_plot_nucmer_w_ref(NUCmer_coords_file = "m_harundinacea_filter_coords.txt", 
                           ref_faidx_file = "mh6ac.fasta.fai", 
                           perc.id = 87)
## nb of references: 1
## nb of queries: 116
## nb of alignments: 340

The plot has three tracks. The outer-most track is the scale, and it indicates the size of the contigs in the reference cluster. The gray track is an ideogram of the sequences in the reference. The blue track shows the locations of overlap between the query and the reference. When the blue track is two layers deep, there are two sequences in the query that overlap the same location on the reference.