Multi-Head Mixture-of-Experts

Wu, Xun; Huang, Shaohan; Wang, Wenhui; Wei, Furu

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2404

Computer Science > Computation and Language

Title: Multi-Head Mixture-of-Experts

Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei

(Submitted on 23 Apr 2024)

Abstract: Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2404.15045 [cs.CL]
	(or arXiv:2404.15045v1 [cs.CL] for this version)

Submission history

From: Shaohan Huang [view email]
[v1] Tue, 23 Apr 2024 13:47:09 GMT (11150kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.15045

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Multi-Head Mixture-of-Experts

Submission history