Journal article
Assessing ChatGPT's ability to emulate human reviewers in scientific research: A descriptive and qualitative approach
Computer methods and programs in biomedicine, Vol.254, p.108313
09/2024
DOI: 10.1016/j.cmpb.2024.108313
PMID: 38954915
Abstract
•ChatGPT aims to replicate human behavior in scientific research but its ability to emulate real-world human reviewers is yet to be discovered.•This study presents the first methodological qualitative assessment of ChatGPT's capability to imitate real human reviewers in scientific research.•Analysis of 720 comments by human reviewers, independently reviewed by researchers showing strong consensus, revealed that the majority of human reviewers’ comments (78.5 %) lacked equivalents in ChatGPT's comments.•Comments on context and methodology exhibited lower levels of complete and partial agreement compared to general comments.•ChatGPT version GPT-4 has a limited ability to emulate human reviewers within the peer review process of scientific research.
ChatGPT is an AI platform whose relevance in the peer review of scientific articles is steadily growing. Nonetheless, it has sparked debates over its potential biases and inaccuracies. This study aims to assess ChatGPT's ability to qualitatively emulate human reviewers in scientific research.
We included the first submitted version of the latest twenty original research articles published by the 3rd of July 2023, in a high-profile medical journal. Each article underwent evaluation by a minimum of three human reviewers during the initial review stage. Subsequently, three researchers with medical backgrounds and expertise in manuscript revision, independently and qualitatively assessed the agreement between the peer reviews generated by ChatGPT version GPT-4 and the comments provided by human reviewers for these articles. The level of agreement was categorized into complete, partial, none, or contradictory.
720 human reviewers’ comments were assessed. There was a good agreement between the three assessors (Overall kappa >0.6). ChatGPT's comments demonstrated complete agreement in terms of quality and substance with 48 (6.7 %) human reviewers’ comments, partially agreed with 92 (12.8 %), identifying issues necessitating further elaboration or recommending supplementary steps to address concerns, had no agreement with a significant 565 (78.5 %), and contradicted 15 (2.1 %). ChatGPT comments on methods had the lowest proportion of complete agreement (13 comments, 3.6 %), while general comments on the manuscript displayed the highest proportion of complete agreement (17 comments, 22.1 %).
ChatGPT version GPT-4 has a limited ability to emulate human reviewers within the peer review process of scientific research.
Details
- Title: Subtitle
- Assessing ChatGPT's ability to emulate human reviewers in scientific research: A descriptive and qualitative approach
- Creators
- Aiman Suleiman - Albert Einstein College of MedicineDario von Wedel - Beth Israel Deaconess Medical CenterRicardo Munoz-Acuna - Beth Israel Deaconess Medical CenterSimone Redaelli - Beth Israel Deaconess Medical CenterAbeer Santarisi - Beth Israel Deaconess Medical CenterEva-Lotte Seibold - Beth Israel Deaconess Medical CenterNikolai Ratajczak - Beth Israel Deaconess Medical CenterShinichiro Kato - Beth Israel Deaconess Medical CenterNader Said - Higher Colleges of TechnologyEswar Sundar - Beth Israel Deaconess Medical CenterValerie Goodspeed - Beth Israel Deaconess Medical CenterMaximilian S. Schaefer - Heinrich Heine University Düsseldorf
- Resource Type
- Journal article
- Publication Details
- Computer methods and programs in biomedicine, Vol.254, p.108313
- DOI
- 10.1016/j.cmpb.2024.108313
- PMID
- 38954915
- NLM abbreviation
- Comput Methods Programs Biomed
- ISSN
- 0169-2607
- eISSN
- 1872-7565
- Publisher
- Elsevier B.V
- Language
- English
- Date published
- 09/2024
- Academic Unit
- Anesthesia
- Record Identifier
- 9985090698302771
Metrics
3 Record Views