\name{lod_GWAS}
\alias{lod_GWAS}
\title{
  Genome Wide Association Analysis accounting for Limit of Detection
}
\description{
  \code{lod_GWAS} enables the user to perform a Genome Wide
  Association Analysis (GWAS) of a biomarker accommodating the
  problem of Limit of Detection (LOD). This function performs a
  parametric survival analysis on the phenotype of interest that
  includes both measured and censored data.
  
  \code{\link{lod_QC}} is automatically called within \code{lod_GWAS},
  and its quality report will be saved in a separate text file.
}
\usage{
lod_GWAS(phenofile, pheno_name,
        basic_model = NULL,
        dist = "gaussian",
        mapfile, genofile,
        outputfile,
        filedirectory = getwd(),
        outputheader = "QCGWAS", gzip_output = TRUE,
        lower_limit = NA, upper_limit = NA)
}
\arguments{
  \item{phenofile}{
    Either a data frame containing the phenotype (and covariate)
    values, or the filename (including the extension) of a data
    file containing the same. See below for information on the
    required format.
}
  \item{pheno_name}{
    The name of the column in phenofile that contains the phenotype
    values.
}
  \item{basic_model}{
    A formula (coded as a character string) describing the basic
    model, not including the genetic component. The covariates
    to be included into the analysis are mentioned within
    quotation marks separated by plus signs: for example,
    \code{basic_model="sex+age"}. Please note that covariate
    names should exactly match the appropriate column names of
    phenotype file. The default is \code{NULL}, in which case the
    association is modelled without covariates.
}
  \item{dist}{
    Assumed distribution for (raw or transformed)
    phenotype. The options are \code{weibull},
    \code{exponential}, \code{gaussian}, \code{logistic},
    \code{lognormal} and \code{loglogistic}. Default is
    \code{gaussian}. For more information, see the function
    \code{psm} of package \code{rms}.
}
  \item{mapfile}{
    The file name of the genotype map file (including the file
    extension). See below for information on the
    required format.
}
  \item{genofile}{
    The file name of the genotype dosage file (including the
    file extension). See below for information on the
    required format.
}
  \item{outputfile}{
    The name for the output file.
}
  \item{filedirectory}{
    The directory that contains the phenotype and genotype files
    and where the output file will be saved. Please note that R
    uses \emph{forward }slash (/) where Windows uses backslash
    (\\). The default setting is current R working directory.
}
  \item{outputheader}{
    The output format of the analysis results file, to make it
    compatible with different software packages. The options are
    \code{"QCGWAS"}, \code{"GWAMA"}, \code{"PLINK"},
    \code{"META"}, and \code{"GenABEL"}. Default is
    \code{"QCGWAS"}.
}
  \item{gzip_output}{
    Logical; determines whether the output file is compressed.
    Default is \code{TRUE}.
}
  \item{lower_limit, upper_limit}{
    Arguments passed to \code{\link{lod_QC}}. Specifying the limit
    of detection allows \code{\link{lod_QC}} to check if the
    phenotypes and outsideLOD columns have been coded correctly.
    Please note that these arguments are only used for a quality
    check. Any errors will be reported but \emph{not} corrected.
    Default is \code{NA}.
}
}
\details{
  lod_GWAS is the main function of the package, and is capable
  of performing a genome-wide association study (GWAS)
  accommodating the problem of LOD. It treats non-detects as
  censored data, either left- or right-censored or both, and 
  performs a parametric survival analysis on the phenotype of
  interest that includes both measured and censored values. 
}
\value{
  \code{lod_GWAS} returns an invisible \code{NULL}. The real
  output are the association results (saved as
  \code{[output_file].txt}) and the log file generated by
  \code{\link{lod_QC}} (saved as \code{[output_file].txt.log}).
}
\section{Input File Format}{
  An analysis with \code{lod_GWAS} requires two files for the
  genotypes and one phenotype file. The files can be either
  space or tab delimited. The package also accepts files
  compressed in the gzip format (extension .gz).

  \emph{Genotype Files}
  
  \code{lodGWAS} uses the PLINK dosage format for the genotype
  data. This means that two files are needed: one with the
  genotypes themselves (genotype dosage file), and one with the
  locations of the genetic variants (map file).
  
  \emph{Genotype Dosage File}
  
  The genotype dosage file should contain a header line. The
  header line (first line) should be:
  
  \code{SNP A1 A2 FID1 IID1 FID2 IID2 ... FIDn IIDn}
  
  The first three columns must appear before the dosage data. The
  following columns are the family identifier (FID) and the
  individual identifier (IID) of individuals 1 to n. Thus, the
  number of columns of the header line should be exactly 3 +
  (2 x n_individuals).
  
  The next lines contain the actual genetic data per individual,
  with each row corresponding to a genetic variant. The PLINK
  dosage format can be any of three formats: dosage,
  two-probabilities, or three-probabilities (see below). lodGWAS
  accepts all three formats and will automatically recognize
  whether there are one (dosage), two (two-probabilities), or
  three (three-probabilities) columns per individual. In case
  of any other format it will report that it cannot recognize
  the format and will not run.
  
  \emph{Dosage format}
  
  A dosage is provided in one column per individual. Each dosage
  is a number between 0 and 2. A dosage of 0, 1, or 2 means that
  the individual is homozygous for the A2 allele, heterozygous,
  or homozygous for the A1 allele, respectively. When the
  genetic dataset is expanded using imputation, non-integer
  values are also possible, and are defined as the weighted sum
  of genotype probabilities (i.e. 0 x prob(A2/A2) + 1 x prob(A1/A2)
  + 2 x prob(A1/A1) ).
  The number of columns of the (non-header) lines in a genotype
  file in dosage format should be exactly 3 + n_individuals.
  
  Example of the dosage format:  
  
  \code{      SNP  A1  A2   FID1 IID1 FID2 IID2 FID3 IID3}
  
  \code{   rs0001  A   C    0.08      0.72      1.99}
  
  \emph{Two-probabilities format}
  
  Two numbers, representing the probabilities of the A1/A1 and
  A1/A2 genotypes, respectively. The probability of A2/A2 equals
  1 minus the sum of Prob(A1/A1) and Prob(A1/A2). Each
  probability is a number between 0 and 1. The number of columns
  of the (non-header) lines in a genotype file in two-probabilities
  format should be exactly  3 + (2 x n_individuals).
  
  Example of the two-probabilities format:
  
  \code{      SNP  A1  A2   FID1 IID1   FID2 IID2}
  
  \code{   rs0001   A   C   0.97 0.02   0.88 0.10}
  
  \emph{Three-probabilities format}
  
  Three numbers, representing the probabilities of the A1/A1,
  A1/A2, and A2/A2 genotypes, respectively. Each probability is
  a number between 0 and 1, and the three probabilities per
  genetic variant per individual should add up to 1. The number
  of columns of the (non-header) lines in a genotype file in
  three-probabilities format should be exactly  3 +
  (3 x n_individuals).
  
  Example of the three-probabilities format:
  
  \code{      SNP  A1  A2   FID1 IID1       FID2 IID2}
  
  \code{   rs0001   A   C   0.97 0.02 0.01  0.88 0.10 0.02}
  
  \emph{Genotype Map File}
  
  The genotype map file contains the locations of the genetic
  variants, with each row of the file corresponding to a variant.
  It must contain four columns:
  
  \itemize{
    \item Chromosome (1-22, X, Y or 0 if unspecified)
    \item Marker ID (identifier of the genetic variant)
    \item Genetic distance (Morgan, this is not used by \code{lod_GWAS},
           so the actual value doesn't matter)
    \item Physical position (base-pair position)
  }
  
  \emph{Note}: unlike the other input files, the map file has \emph{no}
  header line.
  
  \emph{Phenotype File}
  
  The phenotype file is a text file containing the non-genetic
  data, with each row of the file corresponding to an individual.
  It must meet the following requirements:
  
  \itemize{
    \item The file must have a header line.
    \item it must contain the following variables: family
      ID, individual ID, phenotype, and outsideLOD (which 
      indicates whether the phenotype is measered, or left- or
      right-censored). Other columns, e.g. for covariates, are
      optional.
    \item The header (name) of columns for family ID, individual
      ID, and outsideLOD must be \code{FID}, \code{IID}, and
    \code{outsideLOD}, respectively (note that R is case sensitive).
      The other columns (phenotype and cov1 to covN) can have any
      arbitrary name.
  }

  The order of the rows (samples) or columns is not important.
  
  \emph{Column descriptions of phenotype file}
  
  \itemize{
    \item FID: family identifier of the individual. It must
      match the FIDs in the genotype dosage file.
    \item IID: the unique identifier of the individual within
      each family. It must match the IIDs in the genotype dosage
      file.
    \item Phenotype: the phenotype or trait of interest, which
      can be any numeric value. See below for a few considerations
      regarding the phenotype.
    \item outsideLOD: The variable \code{outsideLOD} indicates
      whether the phenotype value is within or beyond the range
      of LOD. It must be to be coded as \code{0} if phenotype >
      upper LOD; \code{1} if phenotype is within the detection
      interval; and \code{2} if phenotype < lower LOD. Values
      other than \code{0}, \code{1}, or \code{2} are not accepted.
    \item cov1 to covN: covariate 1 to covariate N. The phenotype
      file can contain as many covariates as necessary. Some
      examples are: age, sex, BMI, smoking status, medication,
      population stratification parameters (principal components),
      dosage data of a particular genetic variant (for conditional
      analysis), study center, etc.
  }
  
  \emph{A few considerations regarding the phenotype}
  
   Please pay particular attention to these instructions, as
   failing to heed them may cause invalid results.
   
   1) The user must carefully distinguish between two types of
   missing phenotype: missing and censored values. Any mix-up
   between these two types will yield incorrect results.
   
   2)	\emph{Missing phenotype values} are those phenotypes that are
   missing for any reason other than being beyond the LOD. They
   are considered as real missing (at random). These values must
   be coded as \code{NA} in both \code{Phenotype} and
   \code{outsideLOD} columns.
   
   3) \emph{Censored phenotype values} are NDs, i.e. measurements
   that fall beyond the LOD of the assay. NDs are not real
   missing values, since they do provide information about the
   distribution of the phenotype. Any ND that is below the lower
   LOD should be changed to the value of the lower LOD (and the
   corresponding outsideLOD value should set to \code{2}). Any ND
   that is above the upper LOD should be changed to the value of
   the upper LOD (and the corresponding outsideLOD value should
   be set to \code{0}). NDs should \emph{NOT} be coded as missing
   (\code{NA}). \code{lodGWAS} can handle multiple lower and upper
   LOD levels (e.g. as a result from different assays used to
   measure the biomarker) in a single file. In that case the
   phenotype of an ND should be changed to the lower/upper LOD
   level of the assay type used for that individual.
   
   4) The column phenotype can be either raw or transformed values
   of the phenotype. Please take care that NDs (whose phenotype
   value equals the LOD) must also be transformed appropriately.
}
\section{Output File Format}{
  Column descriptions of the output file (as per default settings,
  with \code{outputheader="QCGWAS"}) are as following: 
   \itemize{
    \item MARKER: marker ID (identifier of the genetic variant)
      as specified in the genotype input files
    \item CHR: chromosome as specified in the genotype map file
    \item POSITION: physical position (base-pair position) as
      specified in the genotype map file
    \item OTHER_ALL: non-effect allele (non-coded allele)
    \item EFFECT_ALL: effect allele (coded allele)
    \item N_TOTAL: total sample size, including all NDs as well
      as valid measured values
    \item N_VALID: the sample size of valid measured values
      (excluding all NDs). This is useful if the user wants to
      know the percentage of NDs to the total sample size.
    \item EFF_ALL_FREQ: effect allele frequency 
    \item EFFECT: effect size (beta) of effect allele
    \item STDERR: standard error of effect allele
    \item PVALUE: p-value of association 
    \item IMP_QUALITY: imputation quality of the genetic variant
  }
  
  If another output format is chosen, the same columns will be
  present in the output file, but with header names as required
  by the specified software program.
}
\note{
  GWAS analysis will not be performed: 1) on rare
  genetic variants (with allele frequency <0.001 or >0.999),
  and 2) on badly imputed genetic variants (with imputation
  quality score < 0.01). Those genetic variants will be
  included in the output file, but the association results will
  be \code{NA}.
}
\examples{
# To run this example, please copy the 3 Sample files in
# the doc folder of the lodGWAS library
# (R-3.x.x/library/lodGWAS/doc) to your current R working directory
  
\dontrun{
setwd("YOUR WORKING DIRECTORY")
lod_GWAS(phenofile = "Sample_pheno.txt", pheno_name = "outcome1",
         basic_model = "sex",
         mapfile = "Sample_geno.map", genofile = "Sample_geno.dose",
         outputfile = "Sample_output.txt", gzip_output = FALSE,
         lower_limit = 0.1, upper_limit = 2)
}
}