13  Choices and reason behind those choices for : VARWRRUMM track

13.1 HaploSSembly for mapping

Haploid assembly (or as near as possible is required for this variant calling track). We use FREEBAYES variant mapper for variant calling. We recommend to read FREEBAYES instructions carefully to understand what has been done (an not) here

13.2 Data preparation prior to variant calling

Reads are mapped using BWA-MEM . All alternative alignments are reported as secondary alignments in the BAM file and shorter splits are marked as secondary. See also: BWA-MEM reference manual. Reads groups (ID and SM corresponding to sample id are added).

The mapped file is then indexed using samtools.

This is followed by marking of duplicate reads (default), as recommended in Freebayes, using sambamba.

Note that it is also possible to ignore those (eg. you did not use PCR for library preparation), or remove them totally. <?thomas> talk !

For information purpose, we then compute the coverage depth using samtools.

13.3 Variant Calling Freebayes

13.3.1 Raw variant calling

We used Freebayes for variant calling. Freebayes has been developped for variant calling on diploid genomes (but can be used for other ploidy levels) based on illumina short reads. It allows probabilistic calling of variants: computes the probability that a variant exist at the loci

13.3.2 Filtering calls: quality insurance

QUAL and or depth (DP) or observation count

  • QUAL: probability that there is a polymorphism at the loci described by the record. \(1 - P_{locus\ is\ homozygous\ given\ the\ data}\) [GQ, when supplying –genotype-qualities]

vcffilter in vcflib

  • probability of not being polymorphic less than phred 20 (aka 0.01), or probability of polymorphism > 0.99.

  • examine output manually

Usefull links:

13.4 Normalization of variants representation


Freebayes output VCF 4.2:
> “probabilistic description of allelic variants within a population of samples, but it is equally suited to describing the probability of variation in a single sample.”

citat from

  • phred and probability of not being polymorphic (or formula)