Gentle Masking of Low-Complexity Sequences Improves Homology Search Martin C. Frith* Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo, Japan Abstract Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low- complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with ‘‘gentle’’ masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is min(0,S), where S is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to ‘‘harsh’’ masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search. Citation: Frith MC (2011) Gentle Masking of Low-Complexity Sequences Improves Homology Search. PLoS ONE 6(12): e28819. doi:10.1371/journal.pone.0028819 ˜ ´ Editor: Leonardo Marino-Ramırez, National Institutes of Health, United States of America Received October 17, 2011; Accepted November 15, 2011; Published December 19, 2011 Copyright: 2011 Martin C. Frith. This is an open-access article distributed



