Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Yiyang Shen; Lifu Tu; Weiran Wang

doi:10.48550/arxiv.2604.02621

Back

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Preprint

Open access

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Yiyang Shen, Lifu Tu and Weiran Wang

ArXiv.org

Cornell University

04/03/2026

DOI: 10.48550/arxiv.2604.02621

Files and links (1)

url

https://doi.org/10.48550/arxiv.2604.02621View

Preprint (Author's original) This preprint has not been evaluated by subject experts through peer review. Preprints may undergo extensive changes and/or become peer-reviewed journal articles. Open Access

Abstract

Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

Computer Science - Computation and Language

Computer Science - Learning

Details

Title: Subtitle: Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
Creators: Yiyang Shen
Lifu Tu
Weiran Wang
Resource Type: Preprint
Publication Details: ArXiv.org
DOI: 10.48550/arxiv.2604.02621
ISSN: 2331-8422
Publisher: Cornell University; Ithaca, New York
Language: English
Date posted: 04/03/2026
Academic Unit: Computer Science
Record Identifier: 9985153391502771

Metrics

1 Record Views