Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Ma, Zailin; Yang, Jiansheng; Zhang, Zhihua

Full-text links:

Download:

Source

Current browse context:

cs.GT

< prev | next >

new | recent | 2308

Change to browse by:

Computer Science > Computer Science and Game Theory

Title: Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Authors: Zailin Ma, Jiansheng Yang, Zhihua Zhang

(Submitted on 15 Aug 2023 (v1), last revised 16 Aug 2023 (this version, v2))

Abstract: Computing approximate Nash equilibria in multi-player general-sum Markov games is a computationally intractable task. However, multi-player Markov games with certain cooperative or competitive structures might circumvent this intractability. In this paper, we focus on multi-player zero-sum polymatrix Markov games, where players interact in a pairwise fashion while remain overall competitive. To the best of our knowledge, we propose the first policy optimization algorithm called Entropy-Regularized Optimistic-Multiplicative-Weights-Update (ER-OMWU) for finding approximate Nash equilibria in finite-horizon zero-sum polymatrix Markov games with full information feedback. We provide last-iterate convergence guarantees for finding an $\epsilon$-approximate Nash equilibrium within $\tilde{O}(1/\epsilon)$ iterations, which is near-optimal compared to the optimal $O(1/\epsilon)$ iteration complexity in two-player zero-sum Markov games, which is a degenerate case of zero-sum polymatrix games with only two players involved. Our algorithm combines the regularized and optimistic learning dynamics with separated smooth value update within a single loop, where players update strategies in a symmetric and almost uncoupled manner. It provides a natural dynamics for finding equilibria and is more probable to be adapted to a sample-efficient and fully decentralized implementation where only partial information feedback is available in the future.

Comments:	Proof of Lemma 3.4 is wrong, \bar{L}_{h+1}^{t-k} should be replaced by \sqrt{\bar{L}_{h+1}^{t-k}}. In this case the proof of the main theorem should be substantially modified
Subjects:	Computer Science and Game Theory (cs.GT)
Cite as:	arXiv:2308.07873 [cs.GT]
	(or arXiv:2308.07873v2 [cs.GT] for this version)

Submission history

From: Zailin Ma [view email]
[v1] Tue, 15 Aug 2023 16:40:10 GMT (34kb)
[v2] Wed, 16 Aug 2023 17:17:48 GMT (0kb,I)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2308.07873

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Science and Game Theory

Title: Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Submission history