MixviR_v3.5.0

Mike Sovic

2022-10-21

Introduction

MixviR is a package designed to aid in exploring and visualizing genomic and amino-acid level variation obtained from microbial high-throughput sequencing samples, including samples that contain mixtures of genotypes. The package was originally written to detect and estimate relative frequencies of SARS-CoV-2 lineages (variants) in samples obtained from environmental sources, but can be applied to any microbial taxon.

Quick Start

Inputs

Required

  1. Sample VCFs - one for each sample to be analyzed. Must contain “DP” and “AD” in FORMAT field. Can be “gz” compressed.
  2. Reference genome file (fasta; remove spaces and underscores ’_’ from chromosome names).
  3. File (bed format) defining translated regions of the genome. 6 columns: chromosome, feature start position, feature end position, feature name, score (not used - can enter “.”), and strand. Column names should not be included. Ensure that chromosome names in this file match those in the fasta reference exactly.

     **Pre-constructed reference information is available for SARS-CoV-2 (Covid-19) and can be specified
      as an argument to call_mutations(). In that case, only the sample VCFs are required.

Optional

Lineage-Associated Mutations File

The lineage-associated mutations (lineage.muts) file (csv) provides mutations/amino acid substitutions associated with lineages of interest. Requires columns named “Gene”, “Mutation”, and “Lineage”. Three additional columns can be included named “Sublineage” “Chr” and “Pos”. If “Chr” and “Pos” are included, make sure the chromosome names match those in the reference and bed files (no spaces, remove underscores). The file (including all optional colummns) should look like…

*Figure 1*. Example lineage-associated mutation file.

Figure 1. Example lineage-associated mutation file.

Location/Date File

The location/date (dates) file (csv) associates dates and locations with specific samples for cases where samples are obtained at various time points at one or more locations. Contains columns “SAMP_NAME”, “LOCATION”, and “DATE”. Dates are given in mmddyyyy format, and the file should look like…

*Figure 2*. Example location/date file.

Figure 2. Example location/date file.

Typical Workflow

Run call_mutations() to identify mutations in the sample(s). Required arguments:

Run estimate_lineages() to generate a table with summarized results for each sample. Required arguments:

Run explore_mutations() to interactively visualize results. Required arguments:

   Include one or both of the following if available:

Full Documentation

Inputs

Required

There are three required inputs for running MixviR in its most basic form…

Sample Data

Sample data are provided as variant call format (vcf) files - one for each sample to be analyzed. These vcf files are expected to contain the “DP” and “AD” flags in the FORMAT field. Illumina’s DRAGEN software, bcftools (mpileup/call; Danecek et al 2021) and the GATK (DePristo et al 2011) offer some widely-used workflows that can generate these vcf files. Below is some example bash code I’ve used for generating these files (most arguments can be customized as needed - including the FORMAT/AD AND FORMAT/DP fields in the output is important, if not there by default)…

bcftools example

[*path to bcftools*]/bcftools mpileup -f *fasta reference* -d 4000 -q 60 -Q 30 -L 4500 /
--ff UNMAP,SECONDARY,QCFAIL /
-a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,INFO/AD,INFO/ADF,INFO/ADR /
*input.bam* | [*path to bcftools*]/bcftools call -m -A -Ov -o out_temp.vcf

[*path to bcftools*]/bcftools norm out_temp.vcf -c w -f *fasta reference* /
-m -both -Ov -o out.vcf

Reference Files

The other two necessary inputs define the reference information for the taxon of interest. They include…

  1. A fasta-formatted reference genome file (remove spaces and underscores ’_’ from chromosome names).
  2. An associated bed-formatted file that defines the translated regions of the genome (ORFs/genes). This bed file should be tab delimited with 6 columns: chromosome, feature start position, feature end position, feature name, score (not used - can enter “.”), and strand (+/-). Column names should not be included. Ensure that the chromosome names in this file match those in the reference fasta file. An example of this file is…
*Figure 3*. Example features (bed) input file.

Figure 3. Example features (bed) input file.

Pre-constructed reference information is available for SARS-CoV-2 (Covid-19), and can be utilized with the reference option to call_mutations(). In that case, only sample vcf files are necessary as user-provided inputs.

Optional

There are two optional input files…

Lineage-associated mutations

The lineage-associated mutations file (lineage.muts) provides mutations associated with lineages/groups of interest. It requires columns named “Gene”, “Mutation”, and “Lineage”. Three additional columns can be optionally included: “Sublineage”, “Chr”, and “Pos”. If including the “Sublineage” column, the name of the sublineage associated with a given mutation can be provided. Mutations associated with a main/primary lineage should have “NA” in this column. If the “Chr” and “Pos” columns are included, make sure the chromosome names match those in the reference and bed input files (including no spaces or underscores). The file (including all optional columns) should look like…

*Figure 4*. Example lineage-associated mutation file.

Figure 4. Example lineage-associated mutation file.

It’s important that the syntax of the mutations in this file matches that used by MixviR - see details on naming conventions under the sections “SNP-Based Mutation Identification”,” “Indel-Based Mutation Identification”, and “Nonsense Mutation Identification” below. The “Chr” and “Pos” columns store the chromosome and associated genomic position giving rise to the mutation. These columns are needed if, as part of the results, you want the program to return the sequencing depth at the position when the mutation is not observed for a sample. This information could be relevant in determining whether a target mutation not observed in a sample is absent because it doesn’t occur in the sample, or because the sequencing coverage at that position is insufficient. If these columns are included, make sure the chromosome names match those in the reference genome (see note on naming chromosomes in section Cautions/Important Considerations below).

Location/Date File

In cases where samples are taken from the same location at different time points, the temporal information can be included by providing a csv location/date file (dates) that associates the sample dates and locations with each unique sample name. The file should contain columns named “SAMP_NAME”, “LOCATION”, and “DATE”, and should look like…

*Figure 5*. Example location/date file.

Figure 5. Example location/date file.

Primary Functions

call_mutations()

Overview

  • Description: The call_mutations() function reads in the variant calls from the vcf file, translates the associated amino acids (for mutations within genes), and indentifies any variants relative to the reference.

  • Common Options: Required options to call_mutations() include sample.dir, which is the path to a directory storing one or more vcf files (one for each sample to analyze). There should be no other files in this directory. Also required is information on the reference genome, which can be passed using the combination of fasta.genome and bed options. Alternatively, if working with SARS-CoV-2, the reference option can be set to “Wuhan” to use a pre-formatted reference. To report information (primarily sequencing depth) on all mutations of interest and not just those that are observed in the samples, the write.all.targets option can be set to TRUE, and the lineage-associated mutations file including the optional “Chr” and “Pos” columns (see above) must be specified with lineage.muts.

  • Output: Mutation data are reported in a data frame that by default contains the columns SAMP_NAME, CHR, POS, ALT_ID, AF, & DP, though a number of additional columns can be included (see “Output” section below, and out.cols argument to call_mutations()). This object (data frame) can also be written to a file, which is especially relevant if you’re analyzing a large number of samples or if the genome is large (causing longer run time for call_mutations(); see write.mut.table). The output from call_mutations(), either as an object in the working (global) environment, or as a csv file (created with write.mut.table), is used as input for the functions explore_mutations() or estimate_lineages().

Methods

The call_mutations() function uses one or more vcf files as input, along with a MixviR reference object (data frame) and creates a data frame/table that stores all mutations identified in the sample(s), along with a customizable set of associated information about each mutation. This data frame can be written to a file, and/or saved as an object in the global environment. In either case it is used as input to the explore_mutations() or estimate_lineages() functions.

call_mutations() first obtains the MixviR reference object (Fig 6), which is created as part of the run if the fasta.genome and bed options are provided. Alternatively, if analyzing SARS-CoV-2, the reference option can be set to “Wuhan” and a pre-constructed reference will be used. In the case of overlapping genes, positions will be duplicated in this reference object, with a separate entry for each gene the nucleotide position is associated with.

*Figure 6*. Example of the initial *MixviR* reference object. Two non-genic positions and 6 genic positions, representing 2 amino acids, are shown.

Figure 6. Example of the initial MixviR reference object. Two non-genic positions and 6 genic positions, representing 2 amino acids, are shown.

MixviR then reads in the set of files to be analyzed. These should be the only files stored in the directory given by the sample.dir option. In most cases, samples will be provided in variant call format (vcf). These vcf files need to include the DP and AD flags in the FORMAT field. Relevant information from each vcf file is extracted with functionality from vcfR (Knaus and Grünwald, 2017). If the write.all.targets option will be used to report sequencing depths for genomic positions associated with a priori-defined mutations that don’t occur in the sample, all positions should be included in the input vcf file(s). Otherwise, only variant positions need to be included.

MixviR loops over the set of input files, sequentially calling mutations from each and appending them to a master data frame that stores all mutations. The process of calling mutations for each sample involves several steps. The overall sequencing depths at each position in the input file are first added to the reference object (Fig. 7, column ‘DP’; note that all objects shown in Figs 7-10 are temporary objects created during a MixviR analysis and are not directly available to the user). Depths are ‘NA’ for any positions not in the vcf input file.