LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Mitchell Piehl; Muchao Ye

doi:10.48550/arxiv.2605.15054

Back

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Preprint

Open access

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

Mitchell Piehl and Muchao Ye

ArXiv.org

Cornell University

05/14/2026

DOI: 10.48550/arxiv.2605.15054

Files and links (1)

url

https://doi.org/10.48550/arxiv.2605.15054View

Preprint (Author's original) This preprint has not been evaluated by subject experts through peer review. Preprints may undergo extensive changes and/or become peer-reviewed journal articles. Open Access

Abstract

Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

Computer Science - Computer Vision and Pattern Recognition

Details

Title: Subtitle: LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
Creators: Mitchell Piehl
Muchao Ye
Resource Type: Preprint
Publication Details: ArXiv.org
DOI: 10.48550/arxiv.2605.15054
ISSN: 2331-8422
Publisher: Cornell University; Ithaca, New York
Language: English
Date posted: 05/14/2026
Academic Unit: Computer Science
Record Identifier: 9985163463602771

Metrics

1 Record Views