First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
Por:
Mendez-Cruz, Carlos-Francisco, Gama-Castro, Socorro, Mejia-Almonte, Citlalli, Castillo-Villalba, Marco-Polo, Muniz-Rascado, Luis-Jose, Collado-Vides, Julio
Publicada:
26 sep 2017
Resumen:
The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually
elaborated summaries about transcription factors (TFs) of Escherichia
coli K-12. These texts involve considerable effort, since they summarize
a diverse collection of structural, mechanistic and physiological
properties of TFs and, due to constant new research, ideally they
require frequent updating. In natural language processing, several
techniques for automatic summarization have been developed. Therefore,
our proposal is to extract, by using those techniques, relevant
information about TFs for assisting the curation and elaboration of the
manual summaries. Here, we present the results of the automatic
classification of sentences about the biological processes regulated by
a TF and the information about the structural domains constituting the
TF. We tested two classical classifiers, Naive Bayes and Support Vector
Machines (SVMs), with the sentences of the manual summaries as training
data. The best classifier was an SVM employing lexical, grammatical, and
terminological features (F-score, 0.8689). The sentences of articles
analyzed by this classifier were frequently true, but many sentences
were set aside (high precision with low recall); consequently, some
improvement is required. Nevertheless, automatic summaries of complete
articles about five TFs, generated with this classifier, included much
of the relevant information of the summaries written by curators (high
ROUGE-1 recall). In fact, a manual comparison confirmed that the best
summary encompassed 100% of the relevant information. Hence, our
empirical results suggest that our proposal is promising for covering
more properties of TFs to generate suggested sentences with relevant
information to help the curation work without losing quality.
Filiaciones:
Mendez-Cruz, Carlos-Francisco:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
Gama-Castro, Socorro:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
Mejia-Almonte, Citlalli:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
Castillo-Villalba, Marco-Polo:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
Muniz-Rascado, Luis-Jose:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
Collado-Vides, Julio:
Univ Nacl Autonoma Mexico, Ctr Genom Sci, Computat Genom Program, Av Univ S-N, Cuernavaca 62100, Morelos, Mexico
|