Improving clinical text classification using large language models guided by semantic knowledge
Abstract
Details
- Title: Subtitle
- Improving clinical text classification using large language models guided by semantic knowledge
- Creators
- Graham Scott
- Contributors
- Kishlay Jha (Advisor)Tyler Bell (Committee Member)Hans Johnson (Committee Member)
- Resource Type
- Thesis
- Degree Awarded
- Master of Science (MS), University of Iowa
- Degree in
- Electrical and Computer Engineering
- Date degree season
- Autumn 2024
- DOI
- 10.25820/etd.007531
- Publisher
- University of Iowa
- Number of pages
- ix, 60 pages
- Copyright
- Copyright 2024 Graham Scott
- Language
- English
- Date submitted
- 12/09/2024
- Description illustrations
- Illustrations, tables, graphs, charts
- Description bibliographic
- Includes bibliographical references (pages 55-60).
- Public Abstract (ETD)
The models that are currently the best for analyzing medical text can’t yet be used in actual hospitals because of the difference between the data used in research and the data created and used by hospitals. Most medical text datasets used by researchers are heavily processed before use for many reasons, and a model would have to work even without said processing in order to be effective in a live hospital setting.
This thesis explores the aspects of unprocessed hospital notes that make models less accurate and slower to train, and using the knowledge gained of the semantics of these documents, proposes a novel model architecture that addresses the most common of those problems to create a model that can be used directly on hospital medical data without any intermediate human processing.
Traditional machine learning models have trouble processing human language. The invention of transformers have made NLP possible by allowing models to understand connections between ideas in text even when they’re far apart. However, the flexibility that allows transformers to handle the complexities of human language also makes the highly sensitive to things in the data they are trained on that are undesirable, like typos. We combat this by using the semantic knowledge we have gained about the noise present in raw hospital notes to create software that reduces the intensive manual data curation that would normally be necessary to feed hospital data into large language models into model hyperparameters that can be tuned to account for the anti-patterns of similar patient document datasets.
- Academic Unit
- Electrical and Computer Engineering
- Record Identifier
- 9984774548002771