Primary Functions
call_mutations()
Overview
Description: The
call_mutations()
function reads in the variant calls from the vcf file, translates the associated amino acids (for mutations within genes), and indentifies any variants relative to the reference.Common Options: Required options to
call_mutations()
include sample.dir, which is the path to a directory storing one or more vcf files (one for each sample to analyze). There should be no other files in this directory. Also required is information on the reference genome, which can be passed using the combination of fasta.genome and bed options. Alternatively, if working with SARS-CoV-2, the reference option can be set to “Wuhan” to use a pre-formatted reference. To report information (primarily sequencing depth) on all mutations of interest and not just those that are observed in the samples, the write.all.targets option can be set to TRUE, and the lineage-associated mutations file including the optional “Chr” and “Pos” columns (see above) must be specified with lineage.muts.Output: Mutation data are reported in a data frame that by default contains the columns SAMP_NAME, CHR, POS, ALT_ID, AF, & DP, though a number of additional columns can be included (see “Output” section below, and out.cols argument to
call_mutations()
). This object (data frame) can also be written to a file, which is especially relevant if you’re analyzing a large number of samples or if the genome is large (causing longer run time forcall_mutations()
; see write.mut.table). The output fromcall_mutations()
, either as an object in the working (global) environment, or as a csv file (created with write.mut.table), is used as input for the functionsexplore_mutations()
orestimate_lineages()
.
Methods
The call_mutations()
function uses one or more vcf files
as input, along with a MixviR reference object (data frame) and
creates a data frame/table that stores all mutations identified in the
sample(s), along with a customizable set of associated information about
each mutation. This data frame can be written to a file, and/or saved as
an object in the global environment. In either case it is used as input
to the explore_mutations()
or
estimate_lineages()
functions.
call_mutations()
first obtains the MixviR
reference object (Fig 6), which is created as part of the run if the
fasta.genome and bed options are provided.
Alternatively, if analyzing SARS-CoV-2, the reference option can be set
to “Wuhan” and a pre-constructed reference will be used. In the case of
overlapping genes, positions will be duplicated in this reference
object, with a separate entry for each gene the nucleotide position is
associated with.
MixviR then reads in the set of files to be analyzed. These should be the only files stored in the directory given by the sample.dir option. In most cases, samples will be provided in variant call format (vcf). These vcf files need to include the DP and AD flags in the FORMAT field. Relevant information from each vcf file is extracted with functionality from vcfR (Knaus and Grünwald, 2017). If the write.all.targets option will be used to report sequencing depths for genomic positions associated with a priori-defined mutations that don’t occur in the sample, all positions should be included in the input vcf file(s). Otherwise, only variant positions need to be included.
MixviR loops over the set of input files, sequentially calling mutations from each and appending them to a master data frame that stores all mutations. The process of calling mutations for each sample involves several steps. The overall sequencing depths at each position in the input file are first added to the reference object (Fig. 7, column ‘DP’; note that all objects shown in Figs 7-10 are temporary objects created during a MixviR analysis and are not directly available to the user). Depths are ‘NA’ for any positions not in the vcf input file.