Data Mixture in Training Un-assures Out-of-Distribution Generalization

Zhang, Songming; Luo, Yuxiao; Wang, Qizhou; Chi, Haoang; Li, Weikai; Han, Bo; Li, Jinyan

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2312

Change to browse by:

Computer Science > Machine Learning

Title: Data Mixture in Training Un-assures Out-of-Distribution Generalization

Authors: Songming Zhang, Yuxiao Luo, Qizhou Wang, Haoang Chi, Weikai Li, Bo Han, Jinyan Li

(Submitted on 25 Dec 2023 (v1), revised 2 Feb 2024 (this version, v3), latest version 23 Apr 2024 (v4))

Abstract: While deep neural networks can achieve good performance on in-distribution samples, their generalization ability significantly degrades under unknown test shifts. We study the problem of out-of-distribution (OOD) generalization capability of models by exploring the relationship between generalization error and training set size. Previous empirical evidence suggests that error falls off as a power of training set size and that lower errors indicate better model generalization. However, in the case of OOD samples, this is not true from our observations. Counterintuitively, increasing training data size does not always lead to a decrease in test generalization error. Such a non-decreasing phenomenon is formally investigated under a linear setting with empirical verification across varying visual benchmarks. To investigate the above results, we redefine OOD data as data located outside the convex hull of the data mixture in training and prove a new generalization error bound. Together our observations highlight that the effectiveness of well-trained models can be guaranteed on data within the convex hull of the training mixture. For OOD data beyond this coverage, the capability of models may be unassured. To achieve better generalization without knowledge of target environments, we demonstrate multiple strategies including data augmentation and pre-training. We also employ a novel data selection algorithm that outperforms baselines.

Comments:	18 pages, 9 figures
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2312.16243 [cs.LG]
	(or arXiv:2312.16243v3 [cs.LG] for this version)

Submission history

From: Songming Zhang [view email]
[v1] Mon, 25 Dec 2023 11:00:38 GMT (4305kb,D)
[v2] Tue, 2 Jan 2024 11:50:38 GMT (4306kb,D)
[v3] Fri, 2 Feb 2024 04:45:45 GMT (4146kb,D)
[v4] Tue, 23 Apr 2024 07:43:10 GMT (4955kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2312.16243v3

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Data Mixture in Training Un-assures Out-of-Distribution Generalization

Submission history