Improving clinical text classification using large language models guided by semantic knowledge

Graham Scott

doi:10.25820/etd.007531

Back

Improving clinical text classification using large language models guided by semantic knowledge

Thesis

Open access

Improving clinical text classification using large language models guided by semantic knowledge

Graham Scott

University of Iowa

Master of Science (MS), University of Iowa

Autumn 2024

DOI: 10.25820/etd.007531

Files and links (1)

pdf

University_of_Iowa_Thesis_Template__V2_521.66 kBDownload View

Free to read and download, Open Access

Abstract

Current state of the art models for performing clinical text analysis do not yet represent technologies that can be incorporated into tools for live use by medical professionals and practitioners in hospitals due to the discrepancy between the data used in research and the data created in and used by hospitals. Public datasets utilized by natural language processing (NLP) research groups are heavily processed before use in research both by necessity (removal of sensitive personal information) and to improve the ability of language-processing models to extract information. This thesis explores the aspects of unprocessed hospital text that add unwanted noise, and using the knowledge gained of the syntax and semantics of these documents, proposes a novel model architecture that incorporates measures for addressing undesirable anti-patterns that are common in hospital patient notes with the final goal of creating a model that can be used directly on hospital medical data without any intermediate human processing. Traditional machine learning models exhibit little capacity to cope with the intricacies of natural language processing. The introduction of deep learning architectures like recurrent neural networks (RNNs) and transformers have made NLP possible by allowing models to capture both local and global entities in text. Transformers in particular address key challenges through mechanisms like self-attention, enabling models to weigh the importance of different tokens in a sequence without requiring an explicitly ordered dependency. However, the flexibility that allows transformers to handle the complexities of human language also makes the highly sensitive to noise and unwanted patterns in the data they are trained on. We combat this by leveraging the semantic knowledge that we have gained to create software that reduces the intensive manual data curation that would normally be necessary into model hyperparameters that can be tuned to account for the anti-patterns of similar patient document datasets.

Live Application

Medical Text Classification

Neural Networks

Patient Notes

Preprocessing

Semantic Knowledge

Details

Title: Subtitle: Improving clinical text classification using large language models guided by semantic knowledge
Creators: Graham Scott
Contributors: Kishlay Jha (Advisor)
Tyler Bell (Committee Member)
Hans Johnson (Committee Member)
Resource Type: Thesis
Degree Awarded: Master of Science (MS), University of Iowa
Degree in: Electrical and Computer Engineering
Date degree season: Autumn 2024
DOI: 10.25820/etd.007531
Publisher: University of Iowa
Number of pages: ix, 60 pages
Language: English
Date submitted: 12/09/2024
Description illustrations: Illustrations, tables, graphs, charts
Description bibliographic: Includes bibliographical references (pages 55-60).
Public Abstract (ETD): The models that are currently the best for analyzing medical text can’t yet be used in actual hospitals because of the difference between the data used in research and the data created and used by hospitals. Most medical text datasets used by researchers are heavily processed before use for many reasons, and a model would have to work even without said processing in order to be effective in a live hospital setting.

This thesis explores the aspects of unprocessed hospital notes that make models less accurate and slower to train, and using the knowledge gained of the semantics of these documents, proposes a novel model architecture that addresses the most common of those problems to create a model that can be used directly on hospital medical data without any intermediate human processing.

Traditional machine learning models have trouble processing human language. The invention of transformers have made NLP possible by allowing models to understand connections between ideas in text even when they’re far apart. However, the flexibility that allows transformers to handle the complexities of human language also makes the highly sensitive to things in the data they are trained on that are undesirable, like typos. We combat this by using the semantic knowledge we have gained about the noise present in raw hospital notes to create software that reduces the intensive manual data curation that would normally be necessary to feed hospital data into large language models into model hyperparameters that can be tuned to account for the anti-patterns of similar patient document datasets.
Academic Unit: Electrical and Computer Engineering
Record Identifier: 9984774548002771

Metrics

6 File views/ downloads

16 Record Views