We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: Data Mixture in Training Un-assures Out-of-Distribution Generalization

Abstract: While deep neural networks can achieve good performance on in-distribution samples, their generalization ability significantly degrades under unknown test shifts. We study the problem of out-of-distribution (OOD) generalization capability of models by exploring the relationship between generalization error and training set size. Previous empirical evidence suggests that error falls off as a power of training set size and that lower errors indicate better model generalization. However, in the case of OOD samples, this is not true from our observations. Counterintuitively, increasing training data size does not always lead to a decrease in test generalization error. Such a non-decreasing phenomenon is formally investigated under a linear setting with empirical verification across varying visual benchmarks. To investigate the above results, we redefine OOD data as data located outside the convex hull of the data mixture in training and prove a new generalization error bound. Together our observations highlight that the effectiveness of well-trained models can be guaranteed on data within the convex hull of the training mixture. For OOD data beyond this coverage, the capability of models may be unassured. To achieve better generalization without knowledge of target environments, we demonstrate multiple strategies including data augmentation and pre-training. We also employ a novel data selection algorithm that outperforms baselines.
Comments: 18 pages, 9 figures
Subjects: Machine Learning (cs.LG)
Cite as: arXiv:2312.16243 [cs.LG]
  (or arXiv:2312.16243v3 [cs.LG] for this version)

Submission history

From: Songming Zhang [view email]
[v1] Mon, 25 Dec 2023 11:00:38 GMT (4305kb,D)
[v2] Tue, 2 Jan 2024 11:50:38 GMT (4306kb,D)
[v3] Fri, 2 Feb 2024 04:45:45 GMT (4146kb,D)
[v4] Tue, 23 Apr 2024 07:43:10 GMT (4955kb,D)

Link back to: arXiv, form interface, contact.