From pairs of most similar sequences to phylogenetic best matches


Por: Stadler, Peter F., Geiss, Manuela, Schaller, David, Sanchez, Alitzel Lopez, Laffitte, Marcos Gonzalez, Valdivia, Dulce I., Hellmuth, Marc, Rosales, Maribel Hernandez

Publicada: 9 abr 2020
Resumen:
Background Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. Results If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. Conclusion Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. Availability Accompanying software is available at .

Filiaciones:
Stadler, Peter F.:
 Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany

 Univ Leipzig, Interdisciplinary Ctr Bioinformat, Hartelstr 16-18, D-04107 Leipzig, Germany

 Univ Leipzig, Competence Ctr Scalable Data Serv & Solut Dresden, Interdisciplinary Ctr Bioinformat, German Ctr Integrat Biodivers Res iDiv, Augustuspl 12, D-04107 Leipzig, Germany

 Univ Leipzig, Leipzig Res Ctr Civilizat Dis, Augustuspl 12, D-04107 Leipzig, Germany

 Max Planck Inst Math Sci, Inselstr 22, D-04103 Leipzig, Germany

 Univ Vienna, Dept Theoret Chem, Wahringer Str 17, A-1090 Vienna, Austria

 Univ Nacl Colombia, Fac Ciencias, Ciudad Univ, Bogota 111321, Colombia

 Santa Fe Inst, 1399 Hyde Pk Rd, Santa Fe, NM 87501 USA

Geiss, Manuela:
 Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany

 Software Competence Ctr Hagenberg GmbH, Softwarepk 21, A-4232 Hagenberg, Austria

Schaller, David:
 Univ Leipzig, Bioinformat Grp, Dept Comp Sci, Hartelstr 16-18, D-04107 Leipzig, Germany

Sanchez, Alitzel Lopez:
 UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico

Laffitte, Marcos Gonzalez:
 UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico

Valdivia, Dulce I.:
 Ctr Invest & Estudios Avanzados IPN CINVESTAV, Dept Ingn Genet, Km 9-6 Libramiento Norte Carretera Irapuato Leon, Irapuato, Gto, Mexico

Hellmuth, Marc:
 Univ Leeds, Sch Comp, E C Stoner Bldg, Leeds LS2 9JT, W Yorkshire, England

Rosales, Maribel Hernandez:
 UNAM Juriquilla, CONACYT Inst Matemat, Blvd Juriquilla 3001, Queretaro, Qro, Mexico
ISSN: 17487188
Editorial
BioMed Central, CAMPUS, 4 CRINAN ST, LONDON N1 9XW, ENGLAND, Reino Unido
Tipo de documento: Article
Volumen: 15 Número: 1
Páginas:
WOS Id: 000526867900001
ID de PubMed: 32308731