From pairs of most similar sequences to phylogenetic best matches
Por:
Stadler, Peter F., Geiss, Manuela, Schaller, David, Sanchez, Alitzel Lopez, Laffitte, Marcos Gonzalez, Valdivia, Dulce I., Hellmuth, Marc, Rosales, Maribel Hernandez
Publicada:
9 abr 2020
Resumen:
Background Many of the commonly used methods for orthology detection
start from mutually most similar pairs of genes (reciprocal best hits)
as an approximation for evolutionary most closely related pairs of genes
(reciprocal best matches). This approximation of best matches by best
hits becomes exact for ultrametric dissimilarities, i.e., under the
Molecular Clock Hypothesis. It fails, however, whenever there are large
lineage specific rate variations among paralogous genes. In practice,
this introduces a high level of noise into the input data for
best-hit-based orthology detection methods. Results If additive
distances between genes are known, then evolutionary most closely
related pairs can be identified by considering certain quartets of genes
provided that in each quartet the outgroup relative to the remaining
three genes is known. A priori knowledge of underlying species phylogeny
greatly facilitates the identification of the required outgroup.
Although the workflow remains a heuristic since the correct outgroup
cannot be determined reliably in all cases, simulations with lineage
specific biases and rate asymmetries show that nearly perfect results
can be achieved. In a realistic setting, where distances data have to be
estimated from sequence data and hence are noisy, it is still possible
to obtain highly accurate sets of best matches. Conclusion Improvements
of tree-free orthology assessment methods can be expected from a
combination of the accurate inference of best matches reported here and
recent mathematical advances in the understanding of (reciprocal) best
match graphs and orthology relations. Availability Accompanying software
is available at .
Filiaciones:
Stadler, Peter F.:
Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany
Univ Leipzig, Interdisciplinary Ctr Bioinformat, Hartelstr 16-18, D-04107 Leipzig, Germany
Univ Leipzig, Competence Ctr Scalable Data Serv & Solut Dresden, Interdisciplinary Ctr Bioinformat, German Ctr Integrat Biodivers Res iDiv, Augustuspl 12, D-04107 Leipzig, Germany
Univ Leipzig, Leipzig Res Ctr Civilizat Dis, Augustuspl 12, D-04107 Leipzig, Germany
Max Planck Inst Math Sci, Inselstr 22, D-04103 Leipzig, Germany
Univ Vienna, Dept Theoret Chem, Wahringer Str 17, A-1090 Vienna, Austria
Univ Nacl Colombia, Fac Ciencias, Ciudad Univ, Bogota 111321, Colombia
Santa Fe Inst, 1399 Hyde Pk Rd, Santa Fe, NM 87501 USA
Geiss, Manuela:
Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany
Software Competence Ctr Hagenberg GmbH, Softwarepk 21, A-4232 Hagenberg, Austria
Schaller, David:
Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany
Sanchez, Alitzel Lopez:
UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico
Laffitte, Marcos Gonzalez:
UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico
Valdivia, Dulce I.:
Ctr Invest & Estudios Avanzados IPN CINVESTAV, Dept Ingn Genet, Km 9-6 Libramiento Norte Carretera Irapuato Leon, Irapuato, Gto, Mexico
Hellmuth, Marc:
Univ Leeds, Sch Comp, E C Stoner Bldg, Leeds LS2 9JT, W Yorkshire, England
Rosales, Maribel Hernandez:
UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico
|