The gene normalization task in BioCreative III

Zhiyong Lu; Hung-Yu Kao; Chih-Hsuan Wei; Minlie Huang; Jingchen Liu; Cheng-Ju Kuo; Chun-Nan Hsu; Richard Tzong-Han Tsai; Hong-Jie Dai; Naoaki Okazaki; Han-Cheol Cho; Martin Gerner; Illes Solt; Shashank Agarwal; Feifan Liu; Dina Vishnyakova; Patrick Ruch; Martin Romacker; Fabio Rinaldi; Sanmitra Bhattacharya; Padmini Srinivasan; Hongfang Liu; Manabu Torii; Sergio Matos; David Campos; Karin Verspoor; Kevin M Livingston; W John Wilbur

doi:10.1186/1471-2105-12-S8-S2

Back

The gene normalization task in BioCreative III

Journal article

Open access

Peer reviewed

The gene normalization task in BioCreative III

Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, …

BMC bioinformatics, Vol.12(Suppl 8), pp.S2-S2

10/03/2011

DOI: 10.1186/1471-2105-12-S8-S2

PMCID: PMC3269937

PMID: 22151901

Files and links (1)

url

https://doi.org/10.1186/1471-2105-12-S8-S2View

Published (Version of record) Open Access

Abstract

Background: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). Results: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. Conclusions: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

Research

Details

Title: Subtitle: The gene normalization task in BioCreative III
Creators: Zhiyong Lu - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Hung-Yu Kao - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Chih-Hsuan Wei - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Minlie Huang - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Jingchen Liu - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Cheng-Ju Kuo - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Chun-Nan Hsu - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Richard Tzong-Han Tsai - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Hong-Jie Dai - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Naoaki Okazaki - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Han-Cheol Cho - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Martin Gerner - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Illes Solt - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Shashank Agarwal - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Feifan Liu - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Dina Vishnyakova - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Patrick Ruch - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Martin Romacker - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Fabio Rinaldi - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Sanmitra Bhattacharya - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Padmini Srinivasan - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Hongfang Liu - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Manabu Torii - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Sergio Matos - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
David Campos - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Karin Verspoor - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Kevin M Livingston - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
W John Wilbur - Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
Resource Type: Journal article
Publication Details: BMC bioinformatics, Vol.12(Suppl 8), pp.S2-S2
DOI: 10.1186/1471-2105-12-S8-S2
PMID: 22151901
PMCID: PMC3269937
NLM abbreviation: BMC Bioinformatics
ISSN: 1471-2105
eISSN: 1471-2105
Publisher: BioMed Central
Language: English
Date published: 10/03/2011
Academic Unit: Nursing; Computer Science; Business Analytics
Record Identifier: 9984003183502771

Metrics

34 Record Views

77 Times Cited - Web of Science