How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Huang, Jen-tse; Li, Eric John; Lam, Man Ho; Liang, Tian; Wang, Wenxuan; Yuan, Youliang; Jiao, Wenxiang; Wang, Xing; Tu, Zhaopeng; Lyu, Michael R.

Full-text links:

Download:

Current browse context:

cs.AI

< prev | next >

new | recent | 2403

Computer Science > Artificial Intelligence

Title: How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Authors: Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu

(Submitted on 18 Mar 2024 (v1), last revised 25 Apr 2024 (this version, v2))

Abstract: Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via this https URL

Comments:	16 pages of main text. 11 pages of appendices. 15 figures, 9 tables. Updated scoring scheme
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2403.11807 [cs.AI]
	(or arXiv:2403.11807v2 [cs.AI] for this version)

Submission history

From: Jen-Tse Huang [view email]
[v1] Mon, 18 Mar 2024 14:04:47 GMT (2223kb,D)
[v2] Thu, 25 Apr 2024 15:04:41 GMT (2221kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.11807

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Artificial Intelligence

Title: How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Submission history