Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang; Weiyu Xu; Lifeng Lai

doi:10.48550/arxiv.2605.02495

Back

Efficient Preference Poisoning Attack on Offline RLHF

Preprint

Open access

Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang, Weiyu Xu and Lifeng Lai

ArXiv.org

Cornell University

05/04/2026

DOI: 10.48550/arxiv.2605.02495

Files and links (1)

url

https://doi.org/10.48550/arxiv.2605.02495View

Preprint (Author's original) This preprint has not been evaluated by subject experts through peer review. Preprints may undergo extensive changes and/or become peer-reviewed journal articles. Open Access

Abstract

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates forK -flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

Computer Science - Artificial Intelligence

Computer Science - Learning

Statistics - Machine Learning

Details

Title: Subtitle: Efficient Preference Poisoning Attack on Offline RLHF
Creators: Chenye Yang
Weiyu Xu
Lifeng Lai
Resource Type: Preprint
Publication Details: ArXiv.org
DOI: 10.48550/arxiv.2605.02495
ISSN: 2331-8422
Publisher: Cornell University; Ithaca, New York
Language: English
Date posted: 05/04/2026
Academic Unit: Electrical and Computer Engineering
Record Identifier: 9985161448602771

Metrics

1 Record Views