Mapping short DNA sequencing reads and calling variants using mapping quality scores

  1. Heng Li1,
  2. Jue Ruan2, and
  3. Richard Durbin1,3
  1. 1 The Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom;
  2. 2 Beijing Genomics Institute, Chinese Academy of Science, Beijing 100029, China

Abstract

New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.

Footnotes

  • 3 Corresponding author.

    3 E-mail rd{at}sanger.ac.uk; fax 44-1223496802.

  • [Supplemental material is available online at www.genome.org. Short-read sequences have been deposited in the European Read Archive (ERA) under accession no. ERA000012 (ftp://ftp.era.ebi.ac.uk/ERA000012/).]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.078212.108.

    • Received March 7, 2008.
    • Accepted August 13, 2008.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server