A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference

Ye, Chenlu; Xiong, Wei; Zhang, Yuheng; Jiang, Nan; Zhang, Tong

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2402

Computer Science > Machine Learning

Title: A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference

Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang

(Submitted on 11 Feb 2024 (this version), latest version 25 Apr 2024 (v2))

Abstract: Reinforcement Learning from Human Feedback (RLHF) learns from the preference signal provided by a probabilistic preference model, which takes a prompt and two responses as input, and produces a score indicating the preference of one response against another. So far, the most popular RLHF paradigm is reward-based, which starts with an initial step of reward modeling, and the constructed reward is then used to provide a reward signal for the subsequent reward optimization stage. However, the existence of a reward function is a strong assumption and the reward-based RLHF is limited in expressivity and cannot capture the real-world complicated human preference.
In this work, we provide theoretical insights for a recently proposed learning paradigm, Nash learning from human feedback (NLHF), which considered a general preference model and formulated the alignment process as a game between two competitive LLMs. The learning objective is to find a policy that consistently generates responses preferred over any competing policy while staying close to the initial model. The objective is defined as the Nash equilibrium (NE) of the KL-regularized preference model. We aim to make the first attempt to study the theoretical learnability of the KL-regularized NLHF by considering both offline and online settings. For the offline learning from a pre-collected dataset, we propose algorithms that are efficient under suitable coverage conditions of the dataset. For batch online learning from iterative interactions with a preference oracle, our proposed algorithm enjoys a finite sample guarantee under the structural condition of the underlying preference model. Our results connect the new NLHF paradigm with traditional RL theory, and validate the potential of reward-model-free learning under general preference.

Comments:	RLHF, NLHF, Alignment for LLMs
Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2402.07314 [cs.LG]
	(or arXiv:2402.07314v1 [cs.LG] for this version)

Submission history

From: Chenlu Ye [view email]
[v1] Sun, 11 Feb 2024 21:44:21 GMT (65kb)
[v2] Thu, 25 Apr 2024 04:05:06 GMT (65kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2402.07314v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference

Submission history