Weak-to-Strong Extrapolation Expedites Alignment

Zheng, Chujie; Wang, Ziqi; Ji, Heng; Huang, Minlie; Peng, Nanyun

Full-text links:

Download:

Computer Science > Machine Learning

Title: Weak-to-Strong Extrapolation Expedites Alignment

Authors: Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng

(Submitted on 25 Apr 2024 (v1), last revised 22 May 2024 (this version, v2))

Abstract: The open-source community is experiencing a surge in the release of large language models (LLMs) that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the literature on model interpolation, we propose a simple method called ExPO to boost LLMs' alignment with human preference. Utilizing a model that has undergone alignment training (e.g., via DPO or RLHF) and its initial SFT checkpoint, ExPO directly obtains a better-aligned model by extrapolating from the weights of the initial and the aligned models, which implicitly optimizes the alignment objective via first-order approximation. Through experiments with twelve open-source LLMs on HuggingFace, we demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models, as evaluated on the mainstream LLM benchmarks AlpacaEval 2.0 and MT-Bench. Moreover, ExPO exhibits remarkable scalability across various model sizes (from 1.8B to 70B) and capabilities. Through controlled experiments and further empirical analyses, we shed light on the essence of ExPO amplifying the reward signal learned during alignment training. Our work demonstrates the efficacy of model extrapolation in expediting the alignment of LLMs with human preference, suggesting a promising direction for future research.

Comments:	Add theoretical explanation and more evaluation results
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2404.16792 [cs.LG]
	(or arXiv:2404.16792v2 [cs.LG] for this version)

Submission history

From: Chujie Zheng [view email]
[v1] Thu, 25 Apr 2024 17:39:50 GMT (1094kb,D)
[v2] Wed, 22 May 2024 19:33:30 GMT (1164kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.16792

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Weak-to-Strong Extrapolation Expedites Alignment

Submission history