Journal article
Domain-specific embedding models for hydrology and environmental sciences: enhancing semantic retrieval and question answering in RAG pipelines
Water science and technology, Vol.92(9), pp.1328-1342
11/01/2025
DOI: 10.2166/wst.2025.156
PMID: 41236066
Abstract
Large Language Models (LLMs) have shown strong performance across natural language processing tasks, yet their general-purpose embeddings often fall short in domains with specialized terminology and complex syntax, such as hydrology and environmental science. This study introduces HydroEmbed, a suite of open-source sentence embedding models fine-tuned for four QA formats: multiple-choice (MCQ), true/false (TF), fill-in-the-blank (FITB), and open-ended questions. Models were trained on the HydroLLM Benchmark, a domain-aligned dataset combining textbook and scientific article content. Fine-tuning strategies included MultipleNegativesRankingLoss, CosineSimilarityLoss, and TripletLoss, selected to match each task's semantic structure. Evaluation was conducted on a held-out set of 400 textbook-derived QA pairs, using top-k similarity-based context retrieval and GPT-4o-mini for answer generation. Results show that the fine-tuned models match or exceed performance of strong proprietary and open-source baselines, particularly in FITB and open-ended tasks, where domain alignment significantly improves semantic precision. The MCQ/TF model also achieved competitive accuracy. These findings highlight the value of task- and domain-specific embedding models for building robust retrieval-augmented generation (RAG) pipelines and intelligent QA systems in scientific domains. This work represents a foundational step toward HydroLLM, a domain-specialized language model ecosystem for environmental sciences.
Details
- Title: Subtitle
- Domain-specific embedding models for hydrology and environmental sciences: enhancing semantic retrieval and question answering in RAG pipelines
- Creators
- Ramteja Sajja - Tulane UniversityYusuf Sermet - Tulane UniversityIbrahim Demir - Tulane University
- Resource Type
- Journal article
- Publication Details
- Water science and technology, Vol.92(9), pp.1328-1342
- DOI
- 10.2166/wst.2025.156
- PMID
- 41236066
- NLM abbreviation
- Water Sci Technol
- ISSN
- 0273-1223
- eISSN
- 1996-9732
- Publisher
- IWA PUBLISHING
- Grant note
- Cooperative Institute for Research to Operations in Hydrology (CIROH)NOAA Cooperative Institute Program: NA22NWS4320003 U.S. Department of the Interior (DOI)-U.S. Geological Survey (USGS): G25AP00137
This research was supported by the Cooperative Institute for Research to Operations in Hydrology (CIROH) with fundingunder award NA22NWS4320003 from the NOAA Cooperative Institute Program and by the U.S. Department of the Interior (DOI)-U.S. Geological Survey (USGS) under Award No. G25AP00137. The statements,findings, conclusions, and rec-ommendations are those of the author(s) and do not necessarily reflect the views of NOAA or the U.S. Geological Survey.
- Language
- English
- Electronic publication date
- 10/25/2025
- Date published
- 11/01/2025
- Academic Unit
- Electrical and Computer Engineering; Civil and Environmental Engineering; IIHR--Hydroscience and Engineering; Injury Prevention Research Center
- Record Identifier
- 9985019147202771
Metrics
14 Record Views