New Publication on Paralog Removal
For the last few years, Dr. John Soghigian has been collaborating with Dr. Q.Y.J Xiang's group at North Carolina State University to address paralogs in sequence capture data. This is particularly a problem for data generated from angiosperms and other plants due to frequent genome duplication found in these organisms, but may be a problem in other groups, as well. The result of this collaboration, led by the excellent Dr. W. Zhou, was recently published in Systematic Biology.
The new putative paralog detection pipeline, or PPD (available via Dr. Zhou's github, https://github.com/Bean061/putative_paralog), uses shared heterozygosity across evolutionarily distant taxa to detect potential paralogs under the that polymorphisms among species are more likely to be fixed differences between paralogs over deep divergences, than shared ancestral polymorphism. Even at shallow divergences, previous studies have shown that shared heterozygosity may be attributable to gene duplication events or previously undetected polyploidy. Regions with high heterozygosity may also be indicative of poorly aligned multiple-sequence alignments.
As such, a key feature of PPD is recoding alignments with IUPAC degenerate nucleotide codes. This allows for PPD's putative paralog detection, but many phylogenetic programs can also leverage this coding in estimating evolutionary relationships and relative branch lengths. We demonstrate that PPD is more accurate at detecting paralogs and leads to better phylogenetic estimates, particularly for analyses where relative branch length differences can be important, as in divergence time estimation.