We gratefully acknowledge support from
the Simons Foundation and member institutions.

Machine Learning

New submissions

[ total of 124 entries: 1-124 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 24 May 24

[1]  arXiv:2405.13073 [pdf, other]
Title: A graph-structured distance for heterogeneous datasets with meta variables
Comments: 25 pages (without references), 5 figures, data and scripts available at this https URL
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Heterogeneous datasets emerge in various machine learning or optimization applications that feature different data sources, various data types and complex relationships between variables. In practice, heterogeneous datasets are often partitioned into smaller well-behaved ones that are easier to process. However, some applications involve expensive-to-generate or limited size datasets, which motivates methods based on the whole dataset. The first main contribution of this work is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks. This framework models domains that involve heterogeneous datasets in which variables may be continuous, integer, or categorical, with some identified as meta if their values determine the inclusion/exclusion or affect the bounds of other so-called decreed variables. Excluded variables are introduced to manage variables that are either included or excluded depending on the given points. The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables: any pair of points can be compared, allowing to work directly in heterogeneous datasets with meta variables. The contributions are illustrated with some regression experiments, in which the performance of a multilayer perceptron with respect to its hyperparameters is modeled with inverse distance weighting and $K$-nearest neighbors models.

[2]  arXiv:2405.13149 [pdf, other]
Title: Gaussian Measures Conditioned on Nonlinear Observations: Consistency, MAP Estimators, and Simulation
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)

The article presents a systematic study of the problem of conditioning a Gaussian random variable $\xi$ on nonlinear observations of the form $F \circ \phi(\xi)$ where $\phi: \mathcal{X} \to \mathbb{R}^N$ is a bounded linear operator and $F$ is nonlinear. Such problems arise in the context of Bayesian inference and recent machine learning-inspired PDE solvers. We give a representer theorem for the conditioned random variable $\xi \mid F\circ \phi(\xi)$, stating that it decomposes as the sum of an infinite-dimensional Gaussian (which is identified analytically) as well as a finite-dimensional non-Gaussian measure. We also introduce a novel notion of the mode of a conditional measure by taking the limit of the natural relaxation of the problem, to which we can apply the existing notion of maximum a posteriori estimators of posterior measures. Finally, we introduce a variant of the Laplace approximation for the efficient simulation of the aforementioned conditioned Gaussian random variables towards uncertainty quantification.

[3]  arXiv:2405.13160 [pdf, other]
Title: Borrowing Strength in Distributionally Robust Optimization via Hierarchical Dirichlet Processes
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

This paper presents a novel optimization framework to address key challenges presented by modern machine learning applications: High dimensionality, distributional uncertainty, and data heterogeneity. Our approach unifies regularized estimation, distributionally robust optimization (DRO), and hierarchical Bayesian modeling in a single data-driven criterion. By employing a hierarchical Dirichlet process (HDP) prior, the method effectively handles multi-source data, achieving regularization, distributional robustness, and borrowing strength across diverse yet related data-generating processes. We demonstrate the method's advantages by establishing theoretical performance guarantees and tractable Monte Carlo approximations based on Dirichlet process (DP) theory. Numerical experiments validate the framework's efficacy in improving and stabilizing both prediction and parameter estimation accuracy, showcasing its potential for application in complex data environments.

[4]  arXiv:2405.13302 [pdf, other]
Title: Accelerated Evaluation of Ollivier-Ricci Curvature Lower Bounds: Bridging Theory and Computation
Subjects: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Optimization and Control (math.OC)

Curvature serves as a potent and descriptive invariant, with its efficacy validated both theoretically and practically within graph theory. We employ a definition of generalized Ricci curvature proposed by Ollivier, which Lin and Yau later adapted to graph theory, known as Ollivier-Ricci curvature (ORC). ORC measures curvature using the Wasserstein distance, thereby integrating geometric concepts with probability theory and optimal transport. Jost and Liu previously discussed the lower bound of ORC by showing the upper bound of the Wasserstein distance. We extend the applicability of these bounds to discrete spaces with metrics on integers, specifically hypergraphs. Compared to prior work on ORC in hypergraphs by Coupette, Dalleiger, and Rieck, which faced computational challenges, our method introduces a simplified approach with linear computational complexity, making it particularly suitable for analyzing large-scale networks. Through extensive simulations and application to synthetic and real-world datasets, we demonstrate the significant improvements our method offers in evaluating ORC.

[5]  arXiv:2405.13456 [pdf, other]
Title: Deep linear networks for regression are implicitly regularized towards flat minima
Comments: 46 pages, 4 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for overdetermined univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.

[6]  arXiv:2405.13481 [pdf, other]
Title: Locally Private Estimation with Public Features
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compared to that of classical LDP. Then we propose HistOfTree, an estimator that fully leverages the information contained in both public and private features. Theoretically, HistOfTree reaches the mini-max optimal convergence rate. Empirically, HistOfTree achieves superior performance on both synthetic and real data. We also explore scenarios where users have the flexibility to select features for protection manually. In such cases, we propose an estimator and a data-driven parameter tuning strategy, leading to analogous theoretical and empirical results.

[7]  arXiv:2405.13587 [pdf, other]
Title: Exact Gradients for Stochastic Spiking Neural Networks Driven by Rough Signals
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)

We introduce a mathematically rigorous framework based on rough path theory to model stochastic spiking neural networks (SSNNs) as stochastic differential equations with event discontinuities (Event SDEs) and driven by c\`adl\`ag rough paths. Our formalism is general enough to allow for potential jumps to be present both in the solution trajectories as well as in the driving noise. We then identify a set of sufficient conditions ensuring the existence of pathwise gradients of solution trajectories and event times with respect to the network's parameters and show how these gradients satisfy a recursive relation. Furthermore, we introduce a general-purpose loss function defined by means of a new class of signature kernels indexed on c\`adl\`ag rough paths and use it to train SSNNs as generative models. We provide an end-to-end autodifferentiable solver for Event SDEs and make its implementation available as part of the $\texttt{diffrax}$ library. Our framework is, to our knowledge, the first enabling gradient-based training of SSNNs with noise affecting both the spike timing and the network's dynamics.

[8]  arXiv:2405.13731 [pdf, other]
Title: Control, Transport and Sampling: Towards Better Loss Design
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

Leveraging connections between diffusion-based sampling, optimal transport, and optimal stochastic control through their shared links to the Schr\"odinger bridge problem, we propose novel objective functions that can be used to transport $\nu$ to $\mu$, consequently sample from the target $\mu$, via optimally controlled dynamics. We highlight the importance of the pathwise perspective and the role various optimality conditions on the path measure can play for the design of valid training losses, the careful choice of which offer numerical advantages in practical implementation.

[9]  arXiv:2405.13794 [pdf, other]
Title: Conditioning diffusion models by explicit forward-backward bridging
Comments: 24 pages, 12 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Given an unconditional diffusion model $\pi(x, y)$, using it to perform conditional simulation $\pi(x \mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express conditional simulation as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\pi(x \mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples.

[10]  arXiv:2405.13846 [pdf, other]
Title: Regression Trees Know Calculus
Authors: Nathan Wycoff
Comments: Comments very welcome!
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.

[11]  arXiv:2405.13899 [pdf, ps, other]
Title: Symmetric Linear Bandits with Hidden Symmetry
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the high-dimensional case that covers many standard structures, including sparsity. In this work, we study high-dimensional symmetric linear bandits where the symmetry is hidden from the learner, and the correct symmetry needs to be learned in an online setting. We examine the structure of a collection of hidden symmetry and provide a method based on model selection within the collection of low-dimensional subspaces. Our algorithm achieves a regret bound of $ O(d_0^{1/3} T^{2/3} \log(d))$, where $d$ is the ambient dimension which is potentially very large, and $d_0$ is the dimension of the true low-dimensional subspace such that $d_0 \ll d$. With an extra assumption on well-separated models, we can further improve the regret to $ O(d_0\sqrt{T\log(d)} )$.

[12]  arXiv:2405.13950 [pdf, other]
Title: Actor-critic algorithms for fiber sampling problems
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We propose an actor-critic algorithm for a family of complex problems arising in algebraic statistics and discrete optimization. The core task is to produce a sample from a finite subset of the non-negative integer lattice defined by a high-dimensional polytope. We translate the problem into a Markov decision process and devise an actor-critic reinforcement learning (RL) algorithm to learn a set of good moves that can be used for sampling. We prove that the actor-critic algorithm converges to an approximately optimal sampling policy.
To tackle complexity issues that typically arise in these sampling problems, and to allow the RL to function at scale, our solution strategy takes three steps: decomposing the starting point of the sample, using RL on each induced subproblem, and reconstructing to obtain a sample in the original polytope. In this setup, the proof of convergence applies to each subproblem in the decomposition.
We test the method in two regimes. In statistical applications, a high-dimensional polytope arises as the support set for the reference distribution in a model/data fit test for a broad family of statistical models for categorical data. We demonstrate how RL can be used for model fit testing problems for data sets for which traditional MCMC samplers converge too slowly due to problem size and sparsity structure. To test the robustness of the algorithm and explore its generalization properties, we apply it to synthetically generated data of various sizes and sparsity levels.

[13]  arXiv:2405.13962 [pdf, other]
Title: Learning heavy-tailed distributions with Wasserstein-proximal-regularized $α$-divergences
Comments: 23 pages, 7 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In this paper, we propose Wasserstein proximals of $\alpha$-divergences as suitable objective functionals for learning heavy-tailed distributions in a stable manner. First, we provide sufficient, and in some cases necessary, relations among data dimension, $\alpha$, and the decay rate of data distributions for the Wasserstein-proximal-regularized divergence to be finite. Finite-sample convergence rates for the estimation in the case of the Wasserstein-1 proximal divergences are then provided under certain tail conditions. Numerical experiments demonstrate stable learning of heavy-tailed distributions -- even those without first or second moment -- without any explicit knowledge of the tail behavior, using suitable generative models such as GANs and flow-based models related to our proposed Wasserstein-proximal-regularized $\alpha$-divergences. Heuristically, $\alpha$-divergences handle the heavy tails and Wasserstein proximals allow non-absolute continuity between distributions and control the velocities of flow-based algorithms as they learn the target distribution deep into the tails.

[14]  arXiv:2405.13997 [pdf, other]
Title: Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
Comments: 31 pages, 2 figures. arXiv admin note: text overlap with arXiv:2402.02952
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that sigmoid gating, in fact, enjoys a higher sample efficiency than softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator in the over-specified case in which the number of experts fitted is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate identifiability conditions for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under sigmoid gating than softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.

[15]  arXiv:2405.14038 [pdf, other]
Title: FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear Bandits
Comments: 28 pages, 1 figure
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret up to logarithmic factors. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.

[16]  arXiv:2405.14064 [pdf, other]
Title: Building a stable classifier with the inflated argmax
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.

[17]  arXiv:2405.14131 [pdf, other]
Title: Statistical Advantages of Perturbing Cosine Router in Sparse Mixture of Experts
Comments: 44 pages, 2 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

The cosine router in sparse Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in sparse MoE has been lacking. Considering the least square estimation of the cosine routing sparse MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\mathbb{L}_{2}$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing sparse MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.

[18]  arXiv:2405.14285 [pdf, other]
Title: Computing the Bias of Constant-step Stochastic Approximation with Markovian Noise
Comments: Preprint
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)

We study stochastic approximation algorithms with Markovian noise and constant step-size $\alpha$. We develop a method based on infinitesimal generator comparisons to study the bias of the algorithm, which is the expected difference between $\theta_n$ -- the value at iteration $n$ -- and $\theta^*$ -- the unique equilibrium of the corresponding ODE. We show that, under some smoothness conditions, this bias is of order $O(\alpha)$. Furthermore, we show that the time-averaged bias is equal to $\alpha V + O(\alpha^2)$, where $V$ is a constant characterized by a Lyapunov equation, showing that $\esp{\bar{\theta}_n} \approx \theta^*+V\alpha + O(\alpha^2)$, where $\bar{\theta}_n=(1/n)\sum_{k=1}^n\theta_k$ is the Polyak-Ruppert average. We also show that $\bar{\theta}_n$ converges with high probability around $\theta^*+\alpha V$. We illustrate how to combine this with Richardson-Romberg extrapolation to derive an iterative scheme with a bias of order $O(\alpha^2)$.

[19]  arXiv:2405.14335 [pdf, other]
Title: Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.

[20]  arXiv:2405.14374 [pdf, other]
Title: State-Constrained Offline Reinforcement Learning
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Traditional offline reinforcement learning methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the algorithm greatly. In this paper, we alleviate this limitation by introducing a novel framework named \emph{state-constrained} offline reinforcement learning. By exclusively focusing on the dataset's state distribution, our framework significantly enhances learning potential and reduces previous limitations. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline reinforcement learning. Our research is underpinned by solid theoretical findings that pave the way for subsequent advancements in this domain. Additionally, we introduce StaCQ, a deep learning algorithm that is both performance-driven on the D4RL benchmark datasets and closely aligned with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in state-constrained offline reinforcement learning.

[21]  arXiv:2405.14459 [pdf, other]
Title: Semi-Discrete Optimal Transport: Nearly Minimax Estimation With Stochastic Gradient Descent and Adaptive Entropic Regularization
Authors: Ferdinand Genans-Boiteux (LPSM (UMR\_8001)), Antoine Godichon-Baggioni (LPSM (UMR\_8001)), François-Xavier Vialard (LIGM), Olivier Wintenberger (LPSM (UMR\_8001))
Subjects: Machine Learning (stat.ML)

Optimal Transport (OT) based distances are powerful tools for machine learning to compare probability measures and manipulate them using OT maps. In this field, a setting of interest is semi-discrete OT, where the source measure $\mu$ is continuous, while the target $\nu$ is discrete. Recent works have shown that the minimax rate for the OT map is $\mathcal{O}(t^{-1/2})$ when using $t$ i.i.d. subsamples from each measure (two-sample setting). An open question is whether a better convergence rate can be achieved when the full information of the discrete measure $\nu$ is known (one-sample setting). In this work, we answer positively to this question by (i) proving an $\mathcal{O}(t^{-1})$ lower bound rate for the OT map, using the similarity between Laguerre cells estimation and density support estimation, and (ii) proposing a Stochastic Gradient Descent (SGD) algorithm with adaptive entropic regularization and averaging acceleration. To nearly achieve the desired fast rate, characteristic of non-regular parametric problems, we design an entropic regularization scheme decreasing with the number of samples. Another key step in our algorithm consists of using a projection step that permits to leverage the local strong convexity of the regularized OT problem. Our convergence analysis integrates online convex optimization and stochastic gradient techniques, complemented by the specificities of the OT semi-dual. Moreover, while being as computationally and memory efficient as vanilla SGD, our algorithm achieves the unusual fast rates of our theory in numerical experiments.

[22]  arXiv:2405.14532 [pdf, other]
Title: Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem
Comments: 28 pages, 1 figure. Comments are most welcome!
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

The Procrustes-Wasserstein problem consists in matching two high-dimensional point clouds in an unsupervised setting, and has many applications in natural language processing and computer vision. We consider a planted model with two datasets $X,Y$ that consist of $n$ datapoints in $\mathbb{R}^d$, where $Y$ is a noisy version of $X$, up to an orthogonal transformation and a relabeling of the data points. This setting is related to the graph alignment problem in geometric models. In this work, we focus on the euclidean transport cost between the point clouds as a measure of performance for the alignment. We first establish information-theoretic results, in the high ($d \gg \log n$) and low ($d \ll \log n$) dimensional regimes. We then study computational aspects and propose the Ping-Pong algorithm, alternatively estimating the orthogonal transformation and the relabeling, initialized via a Franke-Wolfe convex relaxation. We give sufficient conditions for the method to retrieve the planted signal after one single step. We provide experimental results to compare the proposed approach with the state-of-the-art method of Grave et al. (2019).

[23]  arXiv:2405.14540 [pdf, other]
Title: This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function $f : \mathcal{S} \to \mathbb{R}$. However, optimizing a black-box which is also a function of time (i.e., a dynamic function) $f : \mathcal{S} \times \mathcal{T} \to \mathbb{R}$ remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in $\mathcal{S} \times \mathcal{T}$ is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.

[24]  arXiv:2405.14574 [pdf, other]
Title: Learning with Fitzpatrick Losses
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Fenchel-Young losses are a family of convex loss functions, encompassing the squared, logistic and sparsemax losses, among others. Each Fenchel-Young loss is implicitly associated with a link function, for mapping model outputs to predictions. For instance, the logistic loss is associated with the soft argmax link function. Can we build new loss functions associated with the same link function as Fenchel-Young losses? In this paper, we introduce Fitzpatrick losses, a new family of convex loss functions based on the Fitzpatrick function. A well-known theoretical tool in maximal monotone operator theory, the Fitzpatrick function naturally leads to a refined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel-Young losses, while maintaining the same link function for prediction. As an example, we introduce the Fitzpatrick logistic loss and the Fitzpatrick sparsemax loss, counterparts of the logistic and the sparsemax losses. This yields two new tighter losses associated with the soft argmax and the sparse argmax, two of the most ubiquitous output layers used in machine learning. We study in details the properties of Fitzpatrick losses and in particular, we show that they can be seen as Fenchel-Young losses using a modified, target-dependent generating function. We demonstrate the effectiveness of Fitzpatrick losses for label proportion estimation.

[25]  arXiv:2405.14630 [pdf, ps, other]
Title: Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension
Comments: 47 pages
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Bounds on the smallest eigenvalue of the neural tangent kernel (NTK) are a key ingredient in the analysis of neural network optimization and memorization. However, existing results require distributional assumptions on the data and are limited to a high-dimensional setting, where the input dimension $d_0$ scales at least logarithmically in the number of samples $n$. In this work we remove both of these requirements and instead provide bounds in terms of a measure of the collinearity of the data: notably these bounds hold with high probability even when $d_0$ is held constant versus $n$. We prove our results through a novel application of the hemisphere transform.

[26]  arXiv:2405.14778 [pdf, ps, other]
Title: Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.

[27]  arXiv:2405.14840 [pdf, other]
Title: Differentiable Annealed Importance Sampling Minimizes The Jensen-Shannon Divergence Between Initial and Target Distribution
Comments: 22 pages, including 9 pages of main text and 11 pages of appendix, conference paper at ICML 2024
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Differentiable annealed importance sampling (DAIS), proposed by Geffner & Domke (2021) and Zhang et al. (2021), allows optimizing, among others, over the initial distribution of AIS. In this paper, we show that, in the limit of many transitions, DAIS minimizes the symmetrized KL divergence (Jensen-Shannon divergence) between the initial and target distribution. Thus, DAIS can be seen as a form of variational inference (VI) in that its initial distribution is a parametric fit to an intractable target distribution. We empirically evaluate the usefulness of the initial distribution as a variational distribution on synthetic and real-world data, observing that it often provides more accurate uncertainty estimates than standard VI (optimizing the reverse KL divergence), importance weighted VI, and Markovian score climbing (optimizing the forward KL divergence).

[28]  arXiv:2405.14848 [pdf, other]
Title: Local Causal Discovery for Structural Evidence of Direct Discrimination
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for developing efficient causal discovery methods for fairness applications. To this end, we introduce local discovery for direct discrimination (LD3): a polynomial-time algorithm that recovers structural evidence of direct discrimination. LD3 performs a linear number of conditional independence tests with respect to variable set size. Moreover, we propose a graphical criterion for identifying the weighted controlled direct effect (CDE), a qualitative measure of direct discrimination. We prove that this criterion is satisfied by the knowledge returned by LD3, increasing the accessibility of the weighted CDE as a causal fairness measure. Taking liver transplant allocation as a case study, we highlight the potential impact of LD3 for modeling fairness in complex decision systems. Results on real-world data demonstrate more plausible causal relations than baselines, which took 197x to 5870x longer to execute.

Cross-lists for Fri, 24 May 24

[29]  arXiv:2405.13153 (cross-list from math.ST) [pdf, other]
Title: Max-sliced Wasserstein concentration and uniform ratio bounds of empirical measures on RKHS
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

Optimal transport and the Wasserstein distance $\mathcal{W}_p$ have recently seen a number of applications in the fields of statistics, machine learning, data science, and the physical sciences. These applications are however severely restricted by the curse of dimensionality, meaning that the number of data points needed to estimate these problems accurately increases exponentially in the dimension. To alleviate this problem, a number of variants of $\mathcal{W}_p$ have been introduced. We focus here on one of these variants, namely the max-sliced Wasserstein metric $\overline{\mathcal{W}}_p$. This metric reduces the high-dimensional minimization problem given by $\mathcal{W}_p$ to a maximum of one-dimensional measurements in an effort to overcome the curse of dimensionality. In this note we derive concentration results and upper bounds on the expectation of $\overline{\mathcal{W}}_p$ between the true and empirical measure on unbounded reproducing kernel Hilbert spaces. We show that, under quite generic assumptions, probability measures concentrate uniformly fast in one-dimensional subspaces, at (nearly) parametric rates. Our results rely on an improvement of currently known bounds for $\overline{\mathcal{W}}_p$ in the finite-dimensional case.

[30]  arXiv:2405.13346 (cross-list from math.OC) [pdf, other]
Title: Convergence of the Deep Galerkin Method for Mean Field Control Problems
Comments: 27 pages, 6 figures
Subjects: Optimization and Control (math.OC); Machine Learning (stat.ML)

We establish the convergence of the deep Galerkin method (DGM), a deep learning-based scheme for solving high-dimensional nonlinear PDEs, for Hamilton-Jacobi-Bellman (HJB) equations that arise from the study of mean field control problems (MFCPs). Based on a recent characterization of the value function of the MFCP as the unique viscosity solution of an HJB equation on the simplex, we establish both an existence and convergence result for the DGM. First, we show that the loss functional of the DGM can be made arbitrarily small given that the value function of the MFCP possesses sufficient regularity. Then, we show that if the loss functional of the DGM converges to zero, the corresponding neural network approximators must converge uniformly to the true value function on the simplex. We also provide numerical experiments demonstrating the DGM's ability to generalize to high-dimensional HJB equations.

[31]  arXiv:2405.13353 (cross-list from stat.ME) [pdf, other]
Title: Adaptive Bayesian Multivariate Spline Knot Inference with Prior Specifications on Model Complexity
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

In multivariate spline regression, the number and locations of knots influence the performance and interpretability significantly. However, due to non-differentiability and varying dimensions, there is no desirable frequentist method to make inference on knots. In this article, we propose a fully Bayesian approach for knot inference in multivariate spline regression. The existing Bayesian method often uses BIC to calculate the posterior, but BIC is too liberal and it will heavily overestimate the knot number when the candidate model space is large. We specify a new prior on the knot number to take into account the complexity of the model space and derive an analytic formula in the normal model. In the non-normal cases, we utilize the extended Bayesian information criterion to approximate the posterior density. The samples are simulated in the space with differing dimensions via reversible jump Markov chain Monte Carlo. We apply the proposed method in knot inference and manifold denoising. Experiments demonstrate the splendid capability of the algorithm, especially in function fitting with jumping discontinuity.

[32]  arXiv:2405.13375 (cross-list from cs.LG) [pdf, other]
Title: Adaptive Data Analysis for Growing Data
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis in the dynamic data setting. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works' improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds.

[33]  arXiv:2405.13392 (cross-list from cs.LG) [pdf, other]
Title: Local convergence of min-max algorithms to differentiable equilibrium on Riemannian manifold
Authors: Sixin Zhang
Comments: under review
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We study min-max algorithms to solve zero-sum differentiable games on Riemannian manifold. The notions of differentiable Stackelberg equilibrium and differentiable Nash equilibrium in Euclidean space are generalized to Riemannian manifold, through an intrinsic definition which does not depend on the choice of local coordinate chart of manifold. We then provide sufficient conditions for the local convergence of the deterministic simultaneous algorithms $\tau$-GDA and $\tau$-SGA near such equilibrium, using a general methodology based on spectral analysis. These algorithms are extended with stochastic gradients and applied to the training of Wasserstein GAN. The discriminator of GAN is constructed from Lipschitz-continuous functions based on Stiefel manifold. We show numerically how the insights obtained from the local convergence analysis may lead to an improvement of GAN models.

[34]  arXiv:2405.13396 (cross-list from cs.LG) [pdf, other]
Title: Why In-Context Learning Transformers are Tabular Data Classifiers
Comments: 9 pages main body, 22 pages total. Preprint under review
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to create complex decision boundaries during pretraining. To validate our claim, we develop a novel forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. Our experiments confirm the effectiveness of ICL-transformers pretrained on this data. Furthermore, we create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator. By fine-tuning this model, we reach the current state-of-the-art on tabular data classification. Code is available at https://github.com/FelixdenBreejen/TabForestPFN.

[35]  arXiv:2405.13535 (cross-list from cs.LG) [pdf, other]
Title: Generalized Laplace Approximation
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

In recent years, the inconsistency in Bayesian deep learning has garnered increasing attention. Tempered or generalized posterior distributions often offer a direct and effective solution to this issue. However, understanding the underlying causes and evaluating the effectiveness of generalized posteriors remain active areas of research. In this study, we introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors. We interpret the generalization of the posterior with a temperature factor as a correction for misspecified models through adjustments to the joint probability model, and the recalibration of priors by redistributing probability mass on models within the hypothesis space using data samples. Additionally, we highlight a distinctive feature of Laplace approximation, which ensures that the generalized normalizing constant can be treated as invariant, unlike the typical scenario in general Bayesian learning where this constant varies with model parameters post-generalization. Building on this insight, we propose the generalized Laplace approximation, which involves a simple adjustment to the computation of the Hessian matrix of the regularized loss function. This method offers a flexible and scalable framework for obtaining high-quality posterior distributions. We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.

[36]  arXiv:2405.13682 (cross-list from cs.LG) [pdf, other]
Title: Constructive Universal Approximation Theorems for Deep Joint-Equivariant Networks by Schur's Lemma
Subjects: Machine Learning (cs.LG); Representation Theory (math.RT); Machine Learning (stat.ML)

We present a unified constructive universal approximation theorem covering a wide range of learning machines including both shallow and deep neural networks based on the group representation theory. Constructive here means that the distribution of parameters is given in a closed-form expression (called the ridgelet transform). Contrary to the case of shallow models, expressive power analysis of deep models has been conducted in a case-by-case manner. Recently, Sonoda et al. (2023a,b) developed a systematic method to show a constructive approximation theorem from scalar-valued joint-group-invariant feature maps, covering a formal deep network. However, each hidden layer was formalized as an abstract group action, so it was not possible to cover real deep networks defined by composites of nonlinear activation function. In this study, we extend the method for vector-valued joint-group-equivariant feature maps, so to cover such real networks.

[37]  arXiv:2405.13691 (cross-list from physics.flu-dyn) [pdf, other]
Title: Neural Networks-based Random Vortex Methods for Modelling Incompressible Flows
Comments: 16 pages, 5 figures
Subjects: Fluid Dynamics (physics.flu-dyn); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)

In this paper we introduce a novel Neural Networks-based approach for approximating solutions to the (2D) incompressible Navier--Stokes equations. Our algorithm uses a Physics-informed Neural Network, that approximates the vorticity based on a loss function that uses a computationally efficient formulation of the Random Vortex dynamics. The neural vorticity estimator is then combined with traditional numerical PDE-solvers for the Poisson equation to compute the velocity field. The main advantage of our method compared to standard Physics-informed Neural Networks is that it strictly enforces physical properties, such as incompressibility or boundary conditions, which might otherwise be hard to guarantee with purely Neural Networks-based approaches.

[38]  arXiv:2405.13712 (cross-list from cs.LG) [pdf, other]
Title: Learning Diffusion Priors from Observations by Expectation Maximization
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Diffusion models recently proved to be remarkable priors for Bayesian inverse problems. However, training these models typically requires access to large amounts of clean data, which could prove difficult in some settings. In this work, we present a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only. Unlike previous works, our method leads to proper diffusion models, which is crucial for downstream tasks. As part of our method, we propose and motivate a new posterior sampling scheme for unconditional diffusion models. We present empirical evidence supporting the effectiveness of our method.

[39]  arXiv:2405.13785 (cross-list from cs.LG) [pdf, other]
Title: Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and Subsampling
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)

Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical applications. However, a systematic approach to handle these misspecifications is lacking in the literature. In this work, we propose a general framework to address these issues. Firstly, we introduce a flexible two-stage GPR framework that separates mean prediction and uncertainty quantification (UQ) to prevent mean misspecification, which can introduce bias into the model. Secondly, kernel function misspecification is addressed through a novel automatic kernel search algorithm, supported by theoretical analysis, that selects the optimal kernel from a candidate set. Additionally, we propose a subsampling-based warm-start strategy for hyperparameter initialization to improve efficiency and avoid hyperparameter misspecification. With much lower computational cost, our subsampling-based strategy can yield competitive or better performance than training exclusively on the full dataset. Combining all these components, we recommend two GPR methods-exact and scalable-designed to match available computational resources and specific UQ requirements. Extensive evaluation on real-world datasets, including UCI benchmarks and a safety-critical medical case study, demonstrates the robustness and precision of our methods.

[40]  arXiv:2405.13844 (cross-list from stat.ME) [pdf, other]
Title: Causal Inference with Cocycles
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)

Many interventions in causal inference can be represented as transformations. We identify a local symmetry property satisfied by a large class of causal models under such interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional and counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show they achieve semiparametric efficiency under typical conditions. Since (infinitely) many distributions can share the same cocycle, these estimators make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.

[41]  arXiv:2405.13888 (cross-list from cs.LG) [pdf, other]
Title: Marrying Causal Representation Learning with Dynamical Systems for Science
Comments: 21 pages, 8 figures, 6 tables
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Causal representation learning promises to extend causal models to hidden causal variables from raw entangled measurements. However, most progress has focused on proving identifiability results in different settings, and we are not aware of any successful real-world application. At the same time, the field of dynamical systems benefited from deep learning and scaled to countless applications but does not allow parameter identification. In this paper, we draw a clear connection between the two and their key assumptions, allowing us to apply identifiable methods developed in causal representation learning to dynamical systems. At the same time, we can leverage scalable differentiable solvers developed for differential equations to build models that are both identifiable and practical. Overall, we learn explicitly controllable models that isolate the trajectory-specific parameters for further downstream tasks such as out-of-distribution classification or treatment effect estimation. We experiment with a wind simulator with partially known factors of variation. We also apply the resulting model to real-world climate data and successfully answer downstream causal questions in line with existing literature on climate change.

[42]  arXiv:2405.13910 (cross-list from cs.LG) [pdf, other]
Title: Learning Latent Space Hierarchical EBM Diffusion Models
Authors: Jiali Cui, Tian Han
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

This work studies the learning problem of the energy-based prior model and the multi-layer generator model. The multi-layer generator model, which contains multiple layers of latent variables organized in a top-down hierarchical structure, typically assumes the Gaussian prior model. Such a prior model can be limited in modelling expressivity, which results in a gap between the generator posterior and the prior model, known as the prior hole problem. Recent works have explored learning the energy-based (EBM) prior model as a second-stage, complementary model to bridge the gap. However, the EBM defined on a multi-layer latent space can be highly multi-modal, which makes sampling from such marginal EBM prior challenging in practice, resulting in ineffectively learned EBM. To tackle the challenge, we propose to leverage the diffusion probabilistic scheme to mitigate the burden of EBM sampling and thus facilitate EBM learning. Our extensive experiments demonstrate a superior performance of our diffusion-learned EBM prior on various challenging tasks.

[43]  arXiv:2405.13912 (cross-list from math.ST) [pdf, other]
Title: Matrix Denoising with Doubly Heteroscedastic Noise: Fundamental Limits and Optimal Spectral Methods
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

We study the matrix denoising problem of estimating the singular vectors of a rank-$1$ signal corrupted by noise with both column and row correlations. Existing works are either unable to pinpoint the exact asymptotic estimation error or, when they do so, the resulting approaches (e.g., based on whitening or singular value shrinkage) remain vastly suboptimal. On top of this, most of the literature has focused on the special case of estimating the left singular vector of the signal when the noise only possesses row correlation (one-sided heteroscedasticity). In contrast, our work establishes the information-theoretic and algorithmic limits of matrix denoising with doubly heteroscedastic noise. We characterize the exact asymptotic minimum mean square error, and design a novel spectral estimator with rigorous optimality guarantees: under a technical condition, it attains positive correlation with the signals whenever information-theoretically possible and, for one-sided heteroscedasticity, it also achieves the Bayes-optimal error. Numerical experiments demonstrate the significant advantage of our theoretically principled method with the state of the art. The proofs draw connections with statistical physics and approximate message passing, departing drastically from standard random matrix theory techniques.

[44]  arXiv:2405.13922 (cross-list from cs.LG) [pdf, other]
Title: Towards Certification of Uncertainty Calibration under Adversarial Attacks
Comments: 11 pages main paper, appendix included
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier score or the expected calibration error. We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the expected calibration error. Finally, we propose novel calibration attacks and demonstrate how they can improve model calibration through \textit{adversarial calibration training}.

[45]  arXiv:2405.13975 (cross-list from cs.LG) [pdf, other]
Title: There is HOPE to Avoid HiPPOs for Long-memory State Space Models
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. However, these models typically face several challenges: (i) they require specifically designed initializations of the system matrices to achieve state-of-the-art performance, (ii) they require training of state matrices on a logarithmic scale with very small learning rates to prevent instabilities, and (iii) they require the model to have exponentially decaying memory in order to ensure an asymptotically stable LTI system. To address these issues, we view SSMs through the lens of Hankel operator theory, which provides us with a unified theory for the initialization and training of SSMs. Building on this theory, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. This approach allows for random initializations of the LTI systems and helps to improve training stability, while also provides the SSMs with non-decaying memory capabilities. Our model efficiently implements these innovations by nonuniformly sampling the transfer functions of LTI systems, and it requires fewer parameters compared to canonical SSMs. When benchmarked against HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel operators demonstrates improved performance on Long-Range Arena (LRA) tasks. Moreover, we use a sequential CIFAR-10 task with padded noise to empirically corroborate our SSM's long memory capacity.

[46]  arXiv:2405.13977 (cross-list from cs.LG) [pdf, other]
Title: Removing Bias from Maximum Likelihood Estimation with Model Autophagy
Comments: 9 Pages, submission for NeurIPS 2024
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We propose autophagy penalized likelihood estimation (PLE), an unbiased alternative to maximum likelihood estimation (MLE) which is more fair and less susceptible to model autophagy disorder (madness). Model autophagy refers to models trained on their own output; PLE ensures the statistics of these outputs coincide with the data statistics. This enables PLE to be statistically unbiased in certain scenarios where MLE is biased. When biased, MLE unfairly penalizes minority classes in unbalanced datasets and exacerbates the recently discovered issue of self-consuming generative modeling. Theoretical and empirical results show that 1) PLE is more fair to minority classes and 2) PLE is more stable in a self-consumed setting. Furthermore, we provide a scalable and portable implementation of PLE with a hypernetwork framework, allowing existing deep learning architectures to be easily trained with PLE. Finally, we show PLE can bridge the gap between Bayesian and frequentist paradigms in statistics.

[47]  arXiv:2405.13987 (cross-list from cs.LG) [pdf, other]
Title: Analysis of Corrected Graph Convolutions
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Statistics Theory (math.ST); Machine Learning (stat.ML)

Machine learning for node classification on graphs is a prominent area driven by applications such as recommendation systems. State-of-the-art models often use multiple graph convolutions on the data, as empirical evidence suggests they can enhance performance. However, it has been shown empirically and theoretically, that too many graph convolutions can degrade performance significantly, a phenomenon known as oversmoothing. In this paper, we provide a rigorous theoretical analysis, based on the contextual stochastic block model (CSBM), of the performance of vanilla graph convolution from which we remove the principal eigenvector to avoid oversmoothing. We perform a spectral analysis for $k$ rounds of corrected graph convolutions, and we provide results for partial and exact classification. For partial classification, we show that each round of convolution can reduce the misclassification error exponentially up to a saturation level, after which performance does not worsen. For exact classification, we show that the separability threshold can be improved exponentially up to $O({\log{n}}/{\log\log{n}})$ corrected convolutions.

[48]  arXiv:2405.13998 (cross-list from cs.LG) [pdf, other]
Title: Bridging Operator Learning and Conditioned Neural Fields: A Unifying Perspective
Comments: 23 pages, 13 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Operator learning is an emerging area of machine learning which aims to learn mappings between infinite dimensional function spaces. Here we uncover a connection between operator learning architectures and conditioned neural fields from computer vision, providing a unified perspective for examining differences between popular operator learning models. We find that many commonly used operator learning models can be viewed as neural fields with conditioning mechanisms restricted to point-wise and/or global information. Motivated by this, we propose the Continuous Vision Transformer (CViT), a novel neural operator architecture that employs a vision transformer encoder and uses cross-attention to modulate a base field constructed with a trainable grid-based positional encoding of query coordinates. Despite its simplicity, CViT achieves state-of-the-art results across challenging benchmarks in climate modeling and fluid dynamics. Our contributions can be viewed as a first step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in physical sciences.

[49]  arXiv:2405.14066 (cross-list from cs.LG) [pdf, ps, other]
Title: Online Classification with Predictions
Comments: 24 pages
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)

We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.

[50]  arXiv:2405.14088 (cross-list from cs.LG) [pdf, other]
Title: High-dimensional Learning with Noisy Labels
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.

[51]  arXiv:2405.14094 (cross-list from cs.LG) [pdf, other]
Title: Attending to Topological Spaces: The Cellular Transformer
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Machine Learning (stat.ML)

Topological Deep Learning seeks to enhance the predictive performance of neural network models by harnessing topological structures in input data. Topological neural networks operate on spaces such as cell complexes and hypergraphs, that can be seen as generalizations of graphs. In this work, we introduce the Cellular Transformer (CT), a novel architecture that generalizes graph-based transformers to cell complexes. First, we propose a new formulation of the usual self- and cross-attention mechanisms, tailored to leverage incidence relations in cell complexes, e.g., edge-face and node-edge relations. Additionally, we propose a set of topological positional encodings specifically designed for cell complexes. By transforming three graph datasets into cell complex datasets, our experiments reveal that CT not only achieves state-of-the-art performance, but it does so without the need for more complex enhancements such as virtual nodes, in-domain structural encodings, or graph rewiring.

[52]  arXiv:2405.14392 (cross-list from stat.ME) [pdf, other]
Title: Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing Flows
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

Continuous normalizing flows (CNFs) learn the probability path between a reference and a target density by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we re-purpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective and using the learned probability path to improve Monte Carlo sampling. We propose a sequential method, which uses samples from a Markov chain to fix the probability path defining the FM objective. We augment this scheme with an adaptive tempering mechanism that allows the discovery of multiple modes in the target. Under mild assumptions, we establish convergence to a local optimum of the FM objective, discuss improvements in the convergence rate, and illustrate our methods on synthetic and real-world examples.

[53]  arXiv:2405.14402 (cross-list from cs.LG) [pdf, other]
Title: Exact Gauss-Newton Optimization for Training Deep Neural Networks
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.

[54]  arXiv:2405.14425 (cross-list from cs.LG) [pdf, other]
Title: When predict can also explain: few-shot prediction to select better neural latents
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. In this study, we reveal the limitations of the widely-used 'co-smoothing' prediction framework and propose an improved few-shot prediction approach that encourages more accurate latent dynamics. Utilizing a student-teacher setup with Hidden Markov Models, we demonstrate that the high co-smoothing model space can encompass models with arbitrary extraneous dynamics within their latent representations. To address this, we introduce a secondary metric -- a few-shot version of co-smoothing. This involves performing regression from the latent variables to held-out channels in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to 'minimal' models devoid of such dynamics. We also provide analytical insights into the origin of this phenomenon. We further validate our findings on real neural data using two state-of-the-art methods: LFADS and STNDT. In the absence of ground truth, we suggest a proxy measure to quantify extraneous dynamics. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.

[55]  arXiv:2405.14440 (cross-list from cs.LG) [pdf, other]
Title: Bayesian Adaptive Calibration and Optimal Design
Comments: Preprint, currently under review
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

The process of calibrating computer models of natural phenomena is essential for applications in the physical sciences, where plenty of domain knowledge can be embedded into simulations and then calibrated against real observations. Current machine learning approaches, however, mostly rely on rerunning simulations over a fixed set of designs available in the observed data, potentially neglecting informative correlations across the design space and requiring a large amount of simulations. Instead, we consider the calibration process from the perspective of Bayesian adaptive experimental design and propose a data-efficient algorithm to run maximally informative simulations within a batch-sequential process. At each round, the algorithm jointly estimates the parameters of the posterior distribution and optimal designs by maximising a variational lower bound of the expected information gain. The simulator is modelled as a sample from a Gaussian process, which allows us to correlate simulations and observed data with the unknown calibration parameters. We show the benefits of our method when compared to related approaches across synthetic and real-data problems.

[56]  arXiv:2405.14468 (cross-list from cs.LG) [pdf, other]
Title: Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Deep neural networks (DNNs) exhibit a surprising structure in their final layer known as neural collapse (NC), and a growing body of works has currently investigated the propagation of neural collapse to earlier layers of DNNs -- a phenomenon called deep neural collapse (DNC). However, existing theoretical results are restricted to special cases: linear models, only two layers or binary classification. In contrast, we focus on non-linear models of arbitrary depth in multi-class classification and reveal a surprising qualitative shift. As soon as we go beyond two layers or two classes, DNC stops being optimal for the deep unconstrained features model (DUFM) -- the standard theoretical framework for the analysis of collapse. The main culprit is a low-rank bias of multi-layer regularization schemes: this bias leads to optimal solutions of even lower rank than the neural collapse. We support our theoretical findings with experiments on both DUFM and real data, which show the emergence of the low-rank structure in the solution found by gradient descent.

[57]  arXiv:2405.14492 (cross-list from stat.ME) [pdf, other]
Title: Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages.

[58]  arXiv:2405.14508 (cross-list from q-bio.QM) [pdf, other]
Title: Prediction of cancer dynamics under treatment using Bayesian neural networks: A simulated study
Comments: 22 pages, 10 figures
Subjects: Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)

Predicting cancer dynamics under treatment is challenging due to high inter-patient heterogeneity, lack of predictive biomarkers, and sparse and noisy longitudinal data. Mathematical models can summarize cancer dynamics by a few interpretable parameters per patient. Machine learning methods can then be trained to predict the model parameters from baseline covariates, but do not account for uncertainty in the parameter estimates. Instead, hierarchical Bayesian modeling can model the relationship between baseline covariates to longitudinal measurements via mechanistic parameters while accounting for uncertainty in every part of the model.
The mapping from baseline covariates to model parameters can be modeled in several ways. A linear mapping simplifies inference but fails to capture nonlinear covariate effects and scale poorly for interaction modeling when the number of covariates is large. In contrast, Bayesian neural networks can potentially discover interactions between covariates automatically, but at a substantial cost in computational complexity.
In this work, we develop a hierarchical Bayesian model of subpopulation dynamics that uses baseline covariate information to predict cancer dynamics under treatment, inspired by cancer dynamics in multiple myeloma (MM), where serum M protein is a well-known proxy of tumor burden. As a working example, we apply the model to a simulated dataset and compare its ability to predict M protein trajectories to a model with linear covariate effects. Our results show that the Bayesian neural network covariate effect model predicts cancer dynamics more accurately than a linear covariate effect model when covariate interactions are present. The framework can also be applied to other types of cancer or other time series prediction problems that can be described with a parametric model.

[59]  arXiv:2405.14522 (cross-list from cs.LG) [pdf, other]
Title: Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.

[60]  arXiv:2405.14544 (cross-list from cs.LG) [pdf, other]
Title: Nuclear Norm Regularization for Deep Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Penalizing the nuclear norm of a function's Jacobian encourages it to locally behave like a low-rank linear map. Such functions vary locally along only a handful of directions, making the Jacobian nuclear norm a natural regularizer for machine learning problems. However, this regularizer is intractable for high-dimensional problems, as it requires computing a large Jacobian matrix and taking its singular value decomposition. We show how to efficiently penalize the Jacobian nuclear norm using techniques tailor-made for deep learning. We prove that for functions parametrized as compositions $f = g \circ h$, one may equivalently penalize the average squared Frobenius norm of $Jg$ and $Jh$. We then propose a denoising-style approximation that avoids the Jacobian computations altogether. Our method is simple, efficient, and accurate, enabling Jacobian nuclear norm regularization to scale to high-dimensional deep learning problems. We complement our theory with an empirical study of our regularizer's performance and investigate applications to denoising and representation learning.

[61]  arXiv:2405.14547 (cross-list from cs.LG) [pdf, other]
Title: Causal Effect Identification in a Sub-Population with Latent Variables
Comments: 19 pages, 5 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by the presence of latent variables in a sub-population, we first extend the classical relevant graphical definitions, such as c-components and Hedges, initially defined for the so-called ID problem (Pearl, 1995; Tian & Pearl, 2002), to their new counterparts. Subsequently, we propose a sound algorithm for the s-ID problem with latent variables.

[62]  arXiv:2405.14657 (cross-list from cs.LG) [pdf, other]
Title: Heteroscedastic Preferential Bayesian Optimization with Informative Noise Distributions
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Preferential Bayesian optimization (PBO) is a sample-efficient framework for learning human preferences between candidate designs. PBO classically relies on homoscedastic noise models to represent human aleatoric uncertainty. Yet, such noise fails to accurately capture the varying levels of human aleatoric uncertainty, particularly when the user possesses partial knowledge among different pairs of candidates. For instance, a chemist with solid expertise in glucose-related molecules may easily compare two compounds from that family while struggling to compare alcohol-related molecules. Currently, PBO overlooks this uncertainty during the search for a new candidate through the maximization of the acquisition function, consequently underestimating the risk associated with human uncertainty. To address this issue, we propose a heteroscedastic noise model to capture human aleatoric uncertainty. This model adaptively assigns noise levels based on the distance of a specific input to a predefined set of reliable inputs known as anchors provided by the human. Anchors encapsulate partial knowledge and offer insight into the comparative difficulty of evaluating different candidate pairs. Such a model can be seamlessly integrated into the acquisition function, thus leading to candidate design pairs that elegantly trade informativeness and ease of comparison for the human expert. We perform an extensive empirical evaluation of the proposed approach, demonstrating a consistent improvement over homoscedastic PBO.

[63]  arXiv:2405.14681 (cross-list from cs.LG) [pdf, other]
Title: Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch.
We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art.

[64]  arXiv:2405.14711 (cross-list from stat.ME) [pdf, other]
Title: Zero-inflation in the Multivariate Poisson Lognormal Family
Comments: 27 pages including appendices. 8 figures, 1 table
Subjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)

Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.

[65]  arXiv:2405.14780 (cross-list from cs.LG) [pdf, other]
Title: Metric Flow Matching for Smooth Interpolations on the Data Manifold
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive for tasks such as trajectory inference, where straight paths might lie outside the data manifold, thus failing to capture the underlying dynamics giving rise to the observed marginals. In this paper, we propose Metric Flow Matching (MFM), a novel simulation-free framework for conditional flow matching where interpolants are approximate geodesics learned by minimizing the kinetic energy of a data-induced Riemannian metric. This way, the generative model matches vector fields on the data manifold, which corresponds to lower uncertainty and more meaningful interpolations. We prescribe general metrics to instantiate MFM, independent of the task, and test it on a suite of challenging problems including LiDAR navigation, unpaired image translation, and modeling cellular dynamics. We observe that MFM outperforms the Euclidean baselines, particularly achieving SOTA on single-cell trajectory prediction.

[66]  arXiv:2405.14806 (cross-list from physics.data-an) [pdf, other]
Title: Lorentz-Equivariant Geometric Algebra Transformers for High-Energy Physics
Comments: 10+12 pages, 5+2 figures, 2 tables
Subjects: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)

Extracting scientific understanding from particle-physics experiments requires solving diverse learning problems with high precision and good data efficiency. We propose the Lorentz Geometric Algebra Transformer (L-GATr), a new multi-purpose architecture for high-energy physics. L-GATr represents high-energy data in a geometric algebra over four-dimensional space-time and is equivariant under Lorentz transformations, the symmetry group of relativistic kinematics. At the same time, the architecture is a Transformer, which makes it versatile and scalable to large systems. L-GATr is first demonstrated on regression and classification tasks from particle physics. We then construct the first Lorentz-equivariant generative model: a continuous normalizing flow based on an L-GATr network, trained with Riemannian flow matching. Across our experiments, L-GATr is on par with or outperforms strong domain-specific baselines.

[67]  arXiv:2405.14822 (cross-list from cs.CV) [pdf, other]
Title: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.

[68]  arXiv:2405.14861 (cross-list from cs.LG) [pdf, ps, other]
Title: Adapting to Unknown Low-Dimensional Structures in Score-Based Diffusion Models
Authors: Gen Li, Yuling Yan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)

This paper investigates score-based diffusion models when the underlying target distribution is concentrated on or near low-dimensional manifolds within the higher-dimensional space in which they formally reside, a common characteristic of natural image distributions. Despite previous efforts to understand the data generation process of diffusion models, existing theoretical support remains highly suboptimal in the presence of low-dimensional structure, which we strengthen in this paper. For the popular Denoising Diffusion Probabilistic Model (DDPM), we find that the dependency of the error incurred within each denoising step on the ambient dimension $d$ is in general unavoidable. We further identify a unique design of coefficients that yields a converges rate at the order of $O(k^{2}/\sqrt{T})$ (up to log factors), where $k$ is the intrinsic dimension of the target distribution and $T$ is the number of steps. This represents the first theoretical demonstration that the DDPM sampler can adapt to unknown low-dimensional structures in the target distribution, highlighting the critical importance of coefficient design. All of this is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner.

Replacements for Fri, 24 May 24

[69]  arXiv:2202.00563 (replaced) [pdf, other]
Title: On the Limitations of General Purpose Domain Generalisation Methods
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[70]  arXiv:2204.07747 (replaced) [pdf, other]
Title: A Variational Approach to Bayesian Phylogenetic Inference
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[71]  arXiv:2302.12565 (replaced) [pdf, other]
Title: Variational Linearized Laplace Approximation for Bayesian Deep Learning
Comments: 22 pages, 8 figures, ICML 2024
Journal-ref: PMLR 235 (2024)
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[72]  arXiv:2305.17028 (replaced) [pdf, other]
Title: Better Batch for Deep Probabilistic Time Series Forecasting
Comments: 11 pages, 3 figures, 3 tables, The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024); We corrected some misleading notations in the published version
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[73]  arXiv:2306.00096 (replaced) [pdf, other]
Title: Learning the Pareto Front Using Bootstrapped Observation Samples
Comments: 37 pages including appendix
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[74]  arXiv:2306.06844 (replaced) [pdf, other]
Title: Provably Efficient Bayesian Optimization with Unbiased Gaussian Process Hyperparameter Estimation
Comments: 25 pages, 5 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[75]  arXiv:2306.11895 (replaced) [pdf, other]
Title: Learning Elastic Costs to Shape Monge Displacements
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[76]  arXiv:2310.18449 (replaced) [pdf, other]
Title: Conditional Generative Representation for Black-Box Optimization with Implicit Constraints
Subjects: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
[77]  arXiv:2402.00168 (replaced) [pdf, other]
Title: Continuous Treatment Effects with Surrogate Outcomes
Comments: 30 pages, 7 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
[78]  arXiv:2403.03850 (replaced) [pdf, other]
Title: Conformal prediction for multi-dimensional time series by ellipsoidal sets
Comments: Accepted by the Forty-first International Conference on Machine Learning (ICML 2024)
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[79]  arXiv:2405.07552 (replaced) [pdf, other]
Title: Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery
Comments: Forty-first International Conference on Machine Learning (ICML 2024), 27 pages, 4 figures, 14 tables
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
[80]  arXiv:2405.09493 (replaced) [pdf, ps, other]
Title: C-Learner: Constrained Learning for Causal Inference and Semiparametric Statistics
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[81]  arXiv:2405.09584 (replaced) [pdf, other]
Title: Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
[82]  arXiv:2405.09831 (replaced) [pdf, other]
Title: Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
Comments: Preprint. Under review
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[83]  arXiv:2405.10301 (replaced) [pdf, other]
Title: Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[84]  arXiv:2006.13456 (replaced) [src]
Title: Likelihood-Free Gaussian Process for Regression
Authors: Yuta Shikuri
Comments: There were errors in the proposed method
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[85]  arXiv:2205.15049 (replaced) [pdf, other]
Title: Metrizing Fairness
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[86]  arXiv:2209.14440 (replaced) [pdf, other]
Title: GeONet: a neural operator for learning the Wasserstein geodesic
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[87]  arXiv:2211.07482 (replaced) [pdf, other]
Title: Unifying O(3) Equivariant Neural Networks Design with Tensor-Network Formalism
Comments: 10 pages + 12-page supplementary materials, many figures
Journal-ref: Mach. Learn.: Sci. Technol. 5 025044, 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph); Machine Learning (stat.ML)
[88]  arXiv:2302.07185 (replaced) [pdf, other]
Title: When mitigating bias is unfair: multiplicity and arbitrariness in algorithmic group fairness
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[89]  arXiv:2304.07278 (replaced) [pdf, ps, other]
Title: Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning
Comments: accepted for presentation in COLT 2024
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Systems and Control (eess.SY); Statistics Theory (math.ST); Machine Learning (stat.ML)
[90]  arXiv:2304.14606 (replaced) [pdf, other]
Title: Algorithmic Recourse with Missing Values
Comments: 30 pages, 15 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[91]  arXiv:2305.15912 (replaced) [pdf, other]
Title: ReLU Characteristic Activation Analysis
Authors: Wenlin Chen, Hong Ge
Comments: code available at: this https URL
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[92]  arXiv:2306.11697 (replaced) [pdf, other]
Title: Treatment Effects in Extreme Regimes
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
[93]  arXiv:2306.13580 (replaced) [pdf, other]
Title: Lower Complexity Adaptation for Empirical Entropic Optimal Transport
Comments: 51 pages, 4 figures, proof of LCA for sub-Gaussian measures (Theorem 3.13) corrected
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
[94]  arXiv:2306.16564 (replaced) [pdf, other]
Title: Pareto Optimal Learning for Estimating Large Language Model Errors
Subjects: Computation and Language (cs.CL); Machine Learning (stat.ML)
[95]  arXiv:2307.01198 (replaced) [pdf, other]
Title: Improved sampling via learned diffusions
Comments: Accepted at ICLR 2024
Journal-ref: International Conference on Learning Representations, 2024
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
[96]  arXiv:2307.04191 (replaced) [pdf, ps, other]
Title: On the sample complexity of parameter estimation in logistic regression with normal design
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
[97]  arXiv:2310.11011 (replaced) [pdf, other]
Title: From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling
Comments: Published in Transactions on Machine Learning Research (TMLR) (05/2024); 72 pages, 27 figures, 4 tables
Journal-ref: Transactions on Machine Learning Research, 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[98]  arXiv:2310.17571 (replaced) [pdf, other]
Title: Inside the black box: Neural network-based real-time prediction of US recessions
Authors: Seulki Chung
Subjects: Econometrics (econ.EM); Machine Learning (stat.ML)
[99]  arXiv:2311.05025 (replaced) [pdf, other]
Title: Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients
Comments: 99 Pages, 13 Figures
Subjects: Computation (stat.CO); Numerical Analysis (math.NA); Methodology (stat.ME); Machine Learning (stat.ML)
[100]  arXiv:2311.11153 (replaced) [pdf, other]
Title: Biarchetype analysis: simultaneous learning of observations and features based on extremes
Subjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
[101]  arXiv:2311.18672 (replaced) [pdf, other]
Title: A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural Networks
Comments: 15 pages, 7 figures, 3 appendices
Journal-ref: Axioms 13 (2024) 160
Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
[102]  arXiv:2312.05134 (replaced) [pdf, other]
Title: Optimal Multi-Distribution Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[103]  arXiv:2402.00847 (replaced) [pdf, other]
Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[104]  arXiv:2402.01454 (replaced) [pdf, other]
Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
[105]  arXiv:2402.01929 (replaced) [pdf, other]
Title: Sample, estimate, aggregate: A recipe for causal discovery foundation models
Comments: Preprint. Under review
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[106]  arXiv:2402.02239 (replaced) [pdf, other]
Title: Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein
Comments: 38 pages, 15 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[107]  arXiv:2402.02277 (replaced) [pdf, other]
Title: Causal Bayesian Optimization via Exogenous Distribution Learning
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[108]  arXiv:2402.02851 (replaced) [pdf, other]
Title: Enhancing Compositional Generalization via Compositional Feature Alignment
Comments: Published in Transactions on Machine Learning Research (TMLR). The code is released at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[109]  arXiv:2402.03587 (replaced) [pdf, other]
Title: Information-Theoretic Active Correlation Clustering
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[110]  arXiv:2402.03687 (replaced) [pdf, other]
Title: Pard: Permutation-Invariant Autoregressive Diffusion for Graph Generation
Comments: Diffusion Model on Graphs
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[111]  arXiv:2402.03985 (replaced) [pdf, other]
Title: A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[112]  arXiv:2402.04906 (replaced) [pdf, other]
Title: Conformal Convolution and Monte Carlo Meta-learners for Predictive Inference of Individual Treatment Effects
Comments: 25 pages, 14 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[113]  arXiv:2402.05569 (replaced) [pdf, other]
Title: Simplifying Hypergraph Neural Networks
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
[114]  arXiv:2402.08193 (replaced) [pdf, other]
Title: Gaussian Ensemble Belief Propagation for Efficient Inference in High-Dimensional Systems
Comments: Under conference submission
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[115]  arXiv:2402.10445 (replaced) [pdf, other]
Title: Collaborative Learning with Different Labeling Functions
Comments: To appear at ICML 2024; v2 and v3 included additional discussion on related work
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
[116]  arXiv:2402.12875 (replaced) [pdf, other]
Title: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Comments: 38 pages, 10 figures. Accepted by ICLR 2024
Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
[117]  arXiv:2402.13380 (replaced) [pdf, ps, other]
Title: Toward TransfORmers: Revolutionizing the Solution of Mixed Integer Programs with Transformers
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC); Machine Learning (stat.ML)
[118]  arXiv:2402.17826 (replaced) [pdf, other]
Title: Prediction-Powered Ranking of Large Language Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[119]  arXiv:2403.02957 (replaced) [pdf, other]
Title: On the Asymptotic Mean Square Error Optimality of Diffusion Models
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[120]  arXiv:2405.00675 (replaced) [pdf, other]
Title: Self-Play Probabilistic Preference Optimization for Language Model Alignment
Comments: 26 pages, 4 figures, 5 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
[121]  arXiv:2405.06627 (replaced) [pdf, other]
Title: Conformal Validity Guarantees Exist for Any Data Distribution
Comments: ICML 2024. Code available at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[122]  arXiv:2405.09784 (replaced) [pdf, other]
Title: Online bipartite matching with imperfect advice
Comments: Accepted into ICML 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
[123]  arXiv:2405.10093 (replaced) [pdf, other]
Title: LaT-PFN: A Joint Embedding Predictive Architecture for In-context Time-series Forecasting
Comments: 9 pages plus references and appendix, 2 tables, 11 figures, added seeds, corrections
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[124]  arXiv:2405.10289 (replaced) [pdf, ps, other]
Title: Subgradient Convergence Implies Subdifferential Convergence on Weakly Convex Functions: With Uniform Rates Guarantees
Authors: Feng Ruan
Comments: This revision adds Lemma 1 and corrects several typos
Subjects: Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
[ total of 124 entries: 1-124 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2405, contact, help  (Access key information)