First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes


Por: Mendez-Cruz, Carlos-Francisco, Gama-Castro, Socorro, Mejia-Almonte, Citlalli, Castillo-Villalba, Marco-Polo, Muniz-Rascado, Luis-Jose, Collado-Vides, Julio

Publicada: 26 sep 2017
Resumen:
The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naive Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality.

Filiaciones:
Mendez-Cruz, Carlos-Francisco:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico

Gama-Castro, Socorro:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico

Mejia-Almonte, Citlalli:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico

Castillo-Villalba, Marco-Polo:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico

Muniz-Rascado, Luis-Jose:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico

Collado-Vides, Julio:
 Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
ISSN: 17580463
Editorial
OXFORD UNIV PRESS, GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND, Reino Unido
Tipo de documento: Article
Volumen: 2017 Número:
Páginas:
WOS Id: 000412308000001
ID de PubMed: 29220462