References & Citations
Computer Science > Machine Learning
Title: Data Mixture in Training Un-assures Out-of-Distribution Generalization
(Submitted on 25 Dec 2023 (v1), revised 2 Feb 2024 (this version, v3), latest version 23 Apr 2024 (v4))
Abstract: While deep neural networks can achieve good performance on in-distribution samples, their generalization ability significantly degrades under unknown test shifts. We study the problem of out-of-distribution (OOD) generalization capability of models by exploring the relationship between generalization error and training set size. Previous empirical evidence suggests that error falls off as a power of training set size and that lower errors indicate better model generalization. However, in the case of OOD samples, this is not true from our observations. Counterintuitively, increasing training data size does not always lead to a decrease in test generalization error. Such a non-decreasing phenomenon is formally investigated under a linear setting with empirical verification across varying visual benchmarks. To investigate the above results, we redefine OOD data as data located outside the convex hull of the data mixture in training and prove a new generalization error bound. Together our observations highlight that the effectiveness of well-trained models can be guaranteed on data within the convex hull of the training mixture. For OOD data beyond this coverage, the capability of models may be unassured. To achieve better generalization without knowledge of target environments, we demonstrate multiple strategies including data augmentation and pre-training. We also employ a novel data selection algorithm that outperforms baselines.
Submission history
From: Songming Zhang [view email][v1] Mon, 25 Dec 2023 11:00:38 GMT (4305kb,D)
[v2] Tue, 2 Jan 2024 11:50:38 GMT (4306kb,D)
[v3] Fri, 2 Feb 2024 04:45:45 GMT (4146kb,D)
[v4] Tue, 23 Apr 2024 07:43:10 GMT (4955kb,D)
Link back to: arXiv, form interface, contact.