Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Ye, Chenlu; Xiong, Wei; Zhang, Yuheng; Jiang, Nan; Zhang, Tong

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2402

Computer Science > Machine Learning

Title: Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang

(Submitted on 11 Feb 2024 (v1), last revised 25 Apr 2024 (this version, v2))

Abstract: We study Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle. In particular, we do not assume that there exists a reward function and the preference signal is drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.

Comments:	RLHF, Preference Learning, Alignment for LLMs
Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2402.07314 [cs.LG]
	(or arXiv:2402.07314v2 [cs.LG] for this version)

Submission history

From: Chenlu Ye [view email]
[v1] Sun, 11 Feb 2024 21:44:21 GMT (65kb)
[v2] Thu, 25 Apr 2024 04:05:06 GMT (65kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2402.07314

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Submission history