cs.LG - 2023-10-21

Optimal Batched Best Arm Identification

paper_url: http://arxiv.org/abs/2310.14129
repo_url: None
paper_authors: Tianyuan Jin, Yu Yang, Jing Tang, Xiaokui Xiao, Pan Xu
for: 本研究探讨批处理最佳臂标识问题（Batched Best Arm Identification，BBAI），learner的目标是尽可能快地确定最佳臂，同时尽可能减少策略的变化次数。
methods: 我们提出了三批最佳臂标识算法（Tri-BBAI），它是首个在极限设定（即$\delta\to 0$）下实现最佳样本复杂性的批处理算法，并且只需要最多三批。基于Tri-BBAI，我们还提出了几乎最佳批处理最佳臂标识算法（Opt-BBAI），它在非极限设定（即$\delta>0$）下实现近似最佳样本和批复杂性，而且与Tri-BBAI在$\delta\to 0$时的复杂性相同。
results: 我们的研究表明，Opt-BBAI在非极限设定下实现了近似最佳样本和批复杂性，而且不需要对返回最佳臂的事件进行条件，这与之前的批处理算法不同。此外，我们还提出了一种新的检查最佳臂是否被消除的方法，它是独立的兴趣。

Abstract
We study the batched best arm identification (BBAI) problem, where the learner's goal is to identify the best arm while switching the policy as less as possible. In particular, we aim to find the best arm with probability $1-\delta$ for some small constant $\delta>0$ while minimizing both the sample complexity (total number of arm pulls) and the batch complexity (total number of batches). We propose the three-batch best arm identification (Tri-BBAI) algorithm, which is the first batched algorithm that achieves the optimal sample complexity in the asymptotic setting (i.e., $\delta\rightarrow 0$) and runs only in at most $3$ batches. Based on Tri-BBAI, we further propose the almost optimal batched best arm identification (Opt-BBAI) algorithm, which is the first algorithm that achieves the near-optimal sample and batch complexity in the non-asymptotic setting (i.e., $\delta>0$ is arbitrarily fixed), while enjoying the same batch and sample complexity as Tri-BBAI when $\delta$ tends to zero. Moreover, in the non-asymptotic setting, the complexity of previous batch algorithms is usually conditioned on the event that the best arm is returned (with a probability of at least $1-\delta$), which is potentially unbounded in cases where a sub-optimal arm is returned. In contrast, the complexity of Opt-BBAI does not rely on such an event. This is achieved through a novel procedure that we design for checking whether the best arm is eliminated, which is of independent interest.

摘要
我们研究批处最佳臂识别问题（BBAI），学生的目标是将最佳臂识别出来，将策略调整得最少。具体来说，我们想要在probability $1-\delta$的情况下找到最佳臂，而且将数据分析和批号复杂度（total number of arm pulls和total number of batches）降到最低。我们提出了三批最佳臂识别算法（Tri-BBAI），这是第一个批号算法，可以在无限边界（i.e., $\delta\rightarrow 0$)的情况下实现最佳数据分析和批号复杂度，并且只需要最多三个批号。基于Tri-BBAI，我们进一步提出了几乎最佳批号最佳臂识别算法（Opt-BBAI），这是第一个可以在非 asymptotic 环境（i.e., $\delta>0$是任意固定）下实现几乎最佳的数据分析和批号复杂度，并且跟Tri-BBAI在 $\delta$ 趋向 zero 时的性能相同。此外，在非 asymptotic 环境中，前一代批号算法的复杂度通常受到返回最佳臂的几率（至少是 $1-\delta$）的限制，这可能是无限大的情况。在 контраス特，Opt-BBAI 的复杂度不受这种事件的限制。这是通过我们设计的一种独特的检查方法来检查最佳臂是否被消除的。

DispersioNET: Joint Inversion of Rayleigh-Wave Multimode Phase Velocity Dispersion Curves using Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2310.14094
repo_url: None
paper_authors: Rohan Sharma, Divakar Vashisth, Bharath Shekar
for: 本研究使用深度学习模型DispersioNET来对声波基本和高频模式phase velocity dispersion curve进行联合逆解，以获取声波速度profile。
methods: 该模型基于卷积神经网络(CNN)，并在不同的噪声水平上进行训练和测试。
results: DispersioNET能够准确预测声波速度profile，并能够抗噪和鲁棒性。

Abstract
Rayleigh wave dispersion curves have been widely used in near-surface studies, and are primarily inverted for the shear wave (S-wave) velocity profiles. However, the inverse problem is ill-posed, non-unique and nonlinear. Here, we introduce DispersioNET, a deep learning model based on convolution neural networks (CNN) to perform the joint inversion of Rayleigh wave fundamental and higher order mode phase velocity dispersion curves. DispersioNET is trained and tested on both noise-free and noisy dispersion curve datasets and predicts S-wave velocity profiles that match closely with the true velocities. The architecture is agnostic to variations in S-wave velocity profiles such as increasing velocity with depth and intermediate low-velocity layers, while also ensuring that the output remains independent of the number of layers.

摘要
rayleigh 波动峰位 Curves 在近地表研究中广泛使用，主要是对剪切波（S波）速度profile 进行逆解。然而，逆问题是不定、多重和非线性的。我们介绍了 DispersioNET，一种基于卷积神经网络（CNN）的深度学习模型，用于同时逆解 Rayleigh 波基本和高阶模式 phase 峰位速度分布 Curves。DispersioNET 在噪声存在和缺失的数据集上训练和测试，并能够准确预测 S波速度profile，与真实速度匹配得非常 closely。模型具有不同 S波速度profile 变化的tolerance，同时保证输出不受层数的影响。

A Specialized Semismooth Newton Method for Kernel-Based Optimal Transport

paper_url: http://arxiv.org/abs/2310.14087
repo_url: None
paper_authors: Tianyi Lin, Marco Cuturi, Michael I. Jordan
for: This paper proposes a new method for solving optimal transport (OT) problems using a nonsmooth fixed-point model and a specialized semismooth Newton (SSN) method.
methods: The proposed method uses a nonsmooth fixed-point model and a specialized semismooth Newton (SSN) method to efficiently solve kernel-based OT problems.
results: The proposed method achieves a global convergence rate of $O(1/\sqrt{k})$ and a local quadratic convergence rate under standard regularity conditions, and shows substantial speedups over the short-step interior-point method (SSIPM) on both synthetic and real datasets.Here is the text in Simplified Chinese:
for: 这篇论文提出了一种使用非平滑稳点模型和特殊的半凝固新颖方法解决优Transport问题的新方法。
methods: 该方法使用非平滑稳点模型和特殊的半凝固新颖方法解决核心基于的优Transport问题。
results: 该方法实现了$O(1/\sqrt{k})$的全局收敛率和标准规定的本地二次收敛率，并在实验中显示了SSIPM的显著加速。

Abstract
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples. Recent works suggest that these estimators are more statistically efficient than plug-in (linear programming-based) OT estimators when comparing probability measures in high-dimensions~\citep{Vacher-2021-Dimension}. Unfortunately, that statistical benefit comes at a very steep computational price: because their computation relies on the short-step interior-point method (SSIPM), which comes with a large iteration count in practice, these estimators quickly become intractable w.r.t. sample size $n$. To scale these estimators to larger $n$, we propose a nonsmooth fixed-point model for the kernel-based OT problem, and show that it can be efficiently solved via a specialized semismooth Newton (SSN) method: We show, exploring the problem's structure, that the per-iteration cost of performing one SSN step can be significantly reduced in practice. We prove that our SSN method achieves a global convergence rate of $O(1/\sqrt{k})$, and a local quadratic convergence rate under standard regularity conditions. We show substantial speedups over SSIPM on both synthetic and real datasets.

摘要
kernel-based最优运输（OT）估计器提供了一种代替方法来解决OT问题，从样本中进行函数估计。据最新的研究表明，这些估计器在高维度下比插入（线性编程基于）OT估计器更有统计效率，但是计算成本却非常高：因为它们的计算基于短步内部点法（SSIPM），在实践中通常需要许多迭代。为了扩展这些估计器，我们提出了一种非光滑固定点模型，并证明可以通过特殊的半稳定新颖（SSN）方法高效解决。我们展示，通过利用问题的结构，每次SSN步骤的计算成本可以在实践中减少到一定程度。我们证明我们的SSN方法在全球收敛率为$O(1/\sqrt{k})$，以及当标准正则条件下的本地二阶收敛率。我们在 sintetic 和实际数据集上展示了显著的计算速度提升。

Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

paper_url: http://arxiv.org/abs/2310.14085
repo_url: None
paper_authors: Michael I. Jordan, Tianyi Lin, Zhengyuan Zhou
for: 本研究旨在设计一个不需要先知 convexity/monotonicity 参数的在线梯度下降（OGD）算法，以便在单个代理和多个代理的情况下实现最优的 regret 和 Nash 平衡rate。
methods: 本研究使用了一种全部适应的 OGD 算法，称为 \textsf{AdaOGD}，它不需要先知 convexity/monotonicity 参数。在单个代理情况下，\textsf{AdaOGD} 可以实现 $O(\log^2(T))$ 的 regret，这是最优的结果，只有 log 因子的差异。在多个代理情况下，如果每个代理使用 \textsf{AdaOGD}，则共同行动会在 $O(\frac{\log^3 T}{T})$ 的速度 converges 到一个唯一的 Nash 平衡。
results: 本研究的结果显示，\textsf{AdaOGD} 可以在新闻vendor 问题中实现最优的 regret 和 Nash 平衡rate。此外，本研究还扩展到了更通用的 exp-concave 成本函数和游戏中，使用在线 Newton 步骤（ONS）算法。

Abstract
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions: (1) in the single-agent setting, it achieves an optimal regret of $\Theta(\log T)$ for strongly convex cost functions; and (2) in the multi-agent setting of strongly monotone games, with each agent employing OGD, we obtain last-iterate convergence of the joint action to a unique Nash equilibrium at an optimal rate of $\Theta(\frac{1}{T})$. While these finite-time guarantees highlight its merits, OGD has the drawback that it requires knowing the strong convexity/monotonicity parameters. In this paper, we design a fully adaptive OGD algorithm, \textsf{AdaOGD}, that does not require a priori knowledge of these parameters. In the single-agent setting, our algorithm achieves $O(\log^2(T))$ regret under strong convexity, which is optimal up to a log factor. Further, if each agent employs \textsf{AdaOGD} in strongly monotone games, the joint action converges in a last-iterate sense to a unique Nash equilibrium at a rate of $O(\frac{\log^3 T}{T})$, again optimal up to log factors. We illustrate our algorithms in a learning version of the classical newsvendor problem, where due to lost sales, only (noisy) gradient feedback can be observed. Our results immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings. We also extend our results to the more general setting of exp-concave cost functions and games, using the online Newton step (ONS) algorithm.

摘要
在线梯度下降（OGD）在强连续或强 monotonicity assumption下是著名的双优化算法：（1）在单个代理Setting中，它可以在强连续成本函数下 achieve óptimal regret of $\Theta(\log T)$;（2）在多个代理Setting中，每个代理使用 OGD，我们可以获得last-iterate 协调的joint action converges to a unique Nash equilibrium at an optimal rate of $\Theta(\frac{1}{T})$. 虽然这些 finite-time guarantees highlight its merits, OGD 需要先知道强连续/MONOTONICITY 参数。在这篇论文中，我们设计了一个完全适应的 OGD 算法，\textsf{AdaOGD}，不需要先知道这些参数。在单个代理Setting中，我们的算法可以 achieve $O(\log^2(T))$ regret under strong convexity, which is optimal up to a log factor。此外，如果每个代理使用 \textsf{AdaOGD} 在强MONOTONICITY games中，joint action 会在 last-iterate sense converge to a unique Nash equilibrium at a rate of $O(\frac{\log^3 T}{T})$, again optimal up to log factors。我们在学习版新闻 vendor problem中 illustrate 我们的算法，因为lost sales，只能观察（噪音）梯度反馈。我们的结果 immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings。我们还将我们的结果推广到更一般的 exp-concave cost functions 和 games，使用 online Newton step（ONS）算法。

Graph Neural Networks and Applied Linear Algebra

paper_url: http://arxiv.org/abs/2310.14084
repo_url: https://github.com/sandialabs/gnn-applied-linear-algebra
paper_authors: Nicholas S. Moore, Eric C. Cyr, Peter Ohm, Christopher M. Siefert, Raymond S. Tuminaro
for: 本文主要针对数学 linear algebra 领域的科学计算问题，探讨如何使用 neural network (NN) 来解决 sparse matrix 计算问题。
methods: 本文使用 graph neural network (GNN) 方法来处理 sparse matrix 计算问题，GNN 定义了聚合函数（例如和），可以操作变量大小的输入数据，生成固定大小的输出数据，以便应用 MLP 来解决问题。
results: 本文通过提供了多个实际例子，示出了如何使用 GNN 来解决一些常见的 linear algebra 任务，包括matrix-vector 乘法、插值、松弛方法和连接度度量。

Abstract
Sparse matrix computations are ubiquitous in scientific computing. With the recent interest in scientific machine learning, it is natural to ask how sparse matrix computations can leverage neural networks (NN). Unfortunately, multi-layer perceptron (MLP) neural networks are typically not natural for either graph or sparse matrix computations. The issue lies with the fact that MLPs require fixed-sized inputs while scientific applications generally generate sparse matrices with arbitrary dimensions and a wide range of nonzero patterns (or matrix graph vertex interconnections). While convolutional NNs could possibly address matrix graphs where all vertices have the same number of nearest neighbors, a more general approach is needed for arbitrary sparse matrices, e.g. arising from discretized partial differential equations on unstructured meshes. Graph neural networks (GNNs) are one approach suitable to sparse matrices. GNNs define aggregation functions (e.g., summations) that operate on variable size input data to produce data of a fixed output size so that MLPs can be applied. The goal of this paper is to provide an introduction to GNNs for a numerical linear algebra audience. Concrete examples are provided to illustrate how many common linear algebra tasks can be accomplished using GNNs. We focus on iterative methods that employ computational kernels such as matrix-vector products, interpolation, relaxation methods, and strength-of-connection measures. Our GNN examples include cases where parameters are determined a-priori as well as cases where parameters must be learned. The intent with this article is to help computational scientists understand how GNNs can be used to adapt machine learning concepts to computational tasks associated with sparse matrices. It is hoped that this understanding will stimulate data-driven extensions of classical sparse linear algebra tasks.

摘要
科学计算中的稀疏矩阵计算是普遍存在的。随着科学机器学习的兴趣增长，可以问题是如何使用神经网络（NN）来利用稀疏矩阵计算。然而，多层感知网络（MLP）通常不适用于图或稀疏矩阵计算，因为MLP需要固定大小的输入，而科学应用通常会生成稀疏矩阵，其维度和非零强相互关系是不固定的。而卷积神经网络（CNN）可能可以解决图格上的稀疏矩阵问题，但是需要一种更通用的方法来解决任意稀疏矩阵问题，例如由不同维度的几何函数所生成的稀疏矩阵。图神经网络（GNN）是一种适用于稀疏矩阵的方法，GNN定义了聚合函数（例如和），这些函数可以在变量大小的输入数据上运行，以生成固定大小的输出数据，从而使得MLP可以应用。本文的目标是为数 Linear Algebra 读者提供 GNN 的引入，并提供一些具体的例子来说明如何使用 GNN 来完成许多常见的Linear Algebra 任务。我们关注到了迭代方法，包括矩阵-向量产品、 interpolate、 relaxation 方法和 strength-of-connection 度量。我们的 GNN 例子包括参数由先前确定的情况以及参数需要学习的情况。我们希望通过这篇文章，让计算科学家理解 GNN 如何用于改变机器学习概念，以便在稀疏矩阵计算中进行数据驱动的扩展。

Counterfactual Prediction Under Selective Confounding

paper_url: http://arxiv.org/abs/2310.14064
repo_url: https://github.com/sohaib730/causalml
paper_authors: Sohaib Kiani, Jared Barton, Jon Sushinsky, Lynda Heimbach, Bo Luo
for: 本研究旨在Addressing the challenge of conducting interpretable causal inference between a binary treatment and its resulting outcome when not all confounders are known.
methods: 我们提出了一种基于Selective Confounding的方法，使用双调sample来实现。这种方法可以在多个决策者与不同政策存在的情况下进行适应。
results: 我们提供了both theoretical error bounds和实际证据，证明了我们的方法的有效性。此外，我们还介绍了三种特定于儿童分配场景的评价方法，以增强透明性和可解性。

Abstract
This research addresses the challenge of conducting interpretable causal inference between a binary treatment and its resulting outcome when not all confounders are known. Confounders are factors that have an influence on both the treatment and the outcome. We relax the requirement of knowing all confounders under desired treatment, which we refer to as Selective Confounding, to enable causal inference in diverse real-world scenarios. Our proposed scheme is designed to work in situations where multiple decision-makers with different policies are involved and where there is a re-evaluation mechanism after the initial decision to ensure consistency. These assumptions are more practical to fulfill compared to the availability of all confounders under all treatments. To tackle the issue of Selective Confounding, we propose the use of dual-treatment samples. These samples allow us to employ two-step procedures, such as Regression Adjustment or Doubly-Robust, to learn counterfactual predictors. We provide both theoretical error bounds and empirical evidence of the effectiveness of our proposed scheme using synthetic and real-world child placement data. Furthermore, we introduce three evaluation methods specifically tailored to assess the performance in child placement scenarios. By emphasizing transparency and interpretability, our approach aims to provide decision-makers with a valuable tool. The source code repository of this work is located at https://github.com/sohaib730/CausalML.

摘要

On discretisation drift and smoothness regularisation in neural network training

paper_url: http://arxiv.org/abs/2310.14036
repo_url: None
paper_authors: Mihaela Claudia Rosca
for: 本研究的目的是更深入地理解深度学习，尤其是优化和模型规范化。
methods: 本研究使用了梯度下降（GD）和负梯度流（NGF）等优化算法，以及新的连续时间流动来研究GD的动态。
results: 研究发现，在超级vised学习和两个玩家游戏中训练不稳定的问题可以通过构建新的学习率时间表和规范来解决。此外，研究还发现了在各种深度学习领域中细eness规范对优化的影响，并在强化学习中添加细eness规范后得到了性能提升。

Abstract
The deep learning recipe of casting real-world problems as mathematical optimisation and tackling the optimisation by training deep neural networks using gradient-based optimisation has undoubtedly proven to be a fruitful one. The understanding behind why deep learning works, however, has lagged behind its practical significance. We aim to make steps towards an improved understanding of deep learning with a focus on optimisation and model regularisation. We start by investigating gradient descent (GD), a discrete-time algorithm at the basis of most popular deep learning optimisation algorithms. Understanding the dynamics of GD has been hindered by the presence of discretisation drift, the numerical integration error between GD and its often studied continuous-time counterpart, the negative gradient flow (NGF). To add to the toolkit available to study GD, we derive novel continuous-time flows that account for discretisation drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviours of GD, such as training instabilities observed in supervised learning and two-player games. We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regularisers that do not require additional hyperparameters. Like optimisation, smoothness regularisation is another pillar of deep learning's success with wide use in supervised learning and generative modelling. Despite their individual significance, the interactions between smoothness regularisation and optimisation have yet to be explored. We find that smoothness regularisation affects optimisation across multiple deep learning domains, and that incorporating smoothness regularisation in reinforcement learning leads to a performance boost that can be recovered using adaptions to optimisation methods.

摘要
深度学习的制作方法，将现实世界的问题化为数学优化问题，并使用深度神经网络进行加速器基于梯度下降优化，不疑而是成功的。然而，深度学习的工作原理尚未得到充分的理解。我们想要做出一些探索深度学习的尝试，特icularly focusing on optimization and model regularization. We start by investigating gradient descent (GD), a discrete-time algorithm that is the foundation of most popular deep learning optimization algorithms. Understanding the dynamics of GD has been hindered by the presence of discretization drift, the numerical integration error between GD and its continuous-time counterpart, the negative gradient flow (NGF). To add to the toolkit available to study GD, we derive novel continuous-time flows that account for discretization drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviors of GD, such as training instabilities observed in supervised learning and two-player games. We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regularizers that do not require additional hyperparameters. Smoothness regularization is another pillar of deep learning's success, with wide use in supervised learning and generative modeling. Despite their individual significance, the interactions between smoothness regularization and optimization have yet to be explored. We find that smoothness regularization affects optimization across multiple deep learning domains, and that incorporating smoothness regularization in reinforcement learning leads to a performance boost that can be recovered using adaptions to optimization methods.

Filling the Missing: Exploring Generative AI for Enhanced Federated Learning over Heterogeneous Mobile Edge Devices

paper_url: http://arxiv.org/abs/2310.13981
repo_url: None
paper_authors: Peichun Li, Hanwen Zhang, Yuan Wu, Liping Qian, Rong Yu, Dusit Niyato, Xuemin Shen
for: 提高分布式人工智能模型训练过程中的Edge网络中设备的数据和资源不同性问题。
methods: 提议使用生成AI技术实现 federated learning，通过填充地方数据的想法来解决数据不同性问题，同时提高训练效率和设备资源利用率。
results: 实验结果表明，使用FIMI可以将设备 сторо的能源减少50%，同时达到global测试准确率目标，并且在非独立同分布（non-IID）数据下，FIMI可以显著提高全球准确率。

Abstract
Distributed Artificial Intelligence (AI) model training over mobile edge networks encounters significant challenges due to the data and resource heterogeneity of edge devices. The former hampers the convergence rate of the global model, while the latter diminishes the devices' resource utilization efficiency. In this paper, we propose a generative AI-empowered federated learning to address these challenges by leveraging the idea of FIlling the MIssing (FIMI) portion of local data. Specifically, FIMI can be considered as a resource-aware data augmentation method that effectively mitigates the data heterogeneity while ensuring efficient FL training. We first quantify the relationship between the training data amount and the learning performance. We then study the FIMI optimization problem with the objective of minimizing the device-side overall energy consumption subject to required learning performance constraints. The decomposition-based analysis and the cross-entropy searching method are leveraged to derive the solution, where each device is assigned suitable AI-synthesized data and resource utilization policy. Experiment results demonstrate that FIMI can save up to 50% of the device-side energy to achieve the target global test accuracy in comparison with the existing methods. Meanwhile, FIMI can significantly enhance the converged global accuracy under the non-independently-and-identically distribution (non-IID) data.

摘要
分布式人工智能（AI）模型训练在移动边缘网络上遇到了数据和资源不一致的问题。前者阻碍全球模型的吞吐率，而后者降低设备的资源利用效率。在这篇论文中，我们提出了基于生成AI的联邦学习来解决这些问题，利用了填充缺失（FIMI）的想法。 Specifically, FIMI可以看作是一种资源意识的数据扩充方法，有效地缓解数据不一致性，同时保证了有效的联邦学习训练。我们首先量化了训练数据量和学习性能之间的关系。然后，我们研究了FIMI优化问题，即将设备侧总能 consumption最小化，保证学习性能要求。通过分解分析和十字推测法，我们 derivated解决方案，每个设备都被分配了适合的AI生成的数据和资源利用策略。实验结果表明，FIMI可以将设备侧能 consumption降低至50%，以达到targettest accuracy。同时，FIMI可以在非独立同一分布（non-IID）数据下显著提高全球准确率。

Continual Invariant Risk Minimization

paper_url: http://arxiv.org/abs/2310.13977
repo_url: None
paper_authors: Francesco Alesiani, Shujian Yu, Mathias Niepert
for: 本研究旨在提出一种基于环境变换的连续学习方法，以便在不同环境中学习模型能够捕捉到共同特征表示。
methods: 本研究使用了 invariant risk minimization（IRM）方法，即在所有环境都可用于学习系统上的方法。另外，本研究还提出了一种基于变分 Bayesian 和双层框架的扩展方法，以满足连续学习中的共同特征捕捉。
results: 研究表明，使用提出的方法可以在多个数据集和多个顺序环境中，与之前的方法相比或与之比肩，提高连续学习中的模型性能。

Abstract
Empirical risk minimization can lead to poor generalization behavior on unseen environments if the learned model does not capture invariant feature representations. Invariant risk minimization (IRM) is a recent proposal for discovering environment-invariant representations. IRM was introduced by Arjovsky et al. (2019) and extended by Ahuja et al. (2020). IRM assumes that all environments are available to the learning system at the same time. With this work, we generalize the concept of IRM to scenarios where environments are observed sequentially. We show that existing approaches, including those designed for continual learning, fail to identify the invariant features and models across sequentially presented environments. We extend IRM under a variational Bayesian and bilevel framework, creating a general approach to continual invariant risk minimization. We also describe a strategy to solve the optimization problems using a variant of the alternating direction method of multiplier (ADMM). We show empirically using multiple datasets and with multiple sequential environments that the proposed methods outperform or is competitive with prior approaches.

摘要
empirical risk minimization可能会导致在未看到的环境中的差异化行为，如果学习的模型没有捕捉环境不变的特征表示。不变风险最小化(IRM)是一种最近提出的环境不变表示发现方法。IRM由Arjovsky et al.（2019）引入并由Ahuja et al.（2020）扩展。IRM假设所有环境都可以同时给学习系统提供。在这种情况下，我们推广了IRM的概念，以适应sequentially presented environments中的环境。我们表明，包括 continual learning的方法在内的现有方法无法在sequentially presented environments中标识不变的特征和模型。我们将IRM扩展为variational Bayesian和bilateral框架，创造一种总体的持续不变风险最小化方法。我们还描述了使用一种变体的alternating direction method of multiplier(ADMM)来解决优化问题的策略。我们通过多个数据集和多个sequential environments的实验表明，提posed方法可以比或与之前的方法相比性。

ASBART:Accelerated Soft Bayes Additive Regression Trees

paper_url: http://arxiv.org/abs/2310.13975
repo_url: https://github.com/richael008/xsbart
paper_authors: Hao Ran, Yang Bai
For: This paper proposes a new variant of Bayesian additive regression trees (BART) called accelerate soft BART (ASBART), which improves the speed of the existing Soft BART model while maintaining comparable accuracy.* Methods: The proposed ASBART method uses a new algorithm that is about 10 times faster than the default Soft BART method, making it more practical for real-world applications.* Results: Simulation studies show that ASBART has comparable accuracy to Soft BART, while being significantly faster in terms of computational speed. The code for ASBART is open-source and available online.

Abstract
Bayes additive regression trees(BART) is a nonparametric regression model which has gained wide-spread popularity in recent years due to its flexibility and high accuracy of estimation. Soft BART,one variation of BART,improves both practically and heoretically on existing Bayesian sum-of-trees models. One bottleneck for Soft BART is its slow speed in the long MCMC loop. Compared to BART,it use more than about 20 times to complete the calculation with the default setting. We proposed a variant of BART named accelerate Soft BART(ASBART). Simulation studies show that the new method is about 10 times faster than the Soft BART with comparable accuracy. Our code is open-source and available at https://github.com/richael008/XSBART.

摘要
bayes 添加 regresión árboles (BART) 是一种非 Parametric 回归模型，在过去几年内得到了广泛的推广和应用，因为它的灵活性和估计精度高。软 BART，BART 的一种变体，在现有的 Bayesian 汇集树模型上进行了改进，从理论和实践两个方面来说，它提高了回归预测的精度。然而，Soft BART 的长MCMC循环速度比较慢，相比 BART，它需要大约 20 倍的计算时间。我们提出了一种加速 Soft BART 的方法，称为加速 Soft BART (ASBART)。根据 simulations 的研究，新方法比 Soft BART 快约 10 倍，并且与其相比，它们在精度方面几乎相同。我们的代码开源在 GitHub 上，可以在获取。

Distributed Linear Regression with Compositional Covariates

paper_url: http://arxiv.org/abs/2310.13969
repo_url: None
paper_authors: Yue Chao, Lei Huang, Xuejun Ma
for: 解决大数据集中分布式统计方法和计算问题。
methods: 提议了两种分布式优化技术，一种是基于ADMM框架的中央化优化算法，另一种是基于CDMM框架的分布式坐标下降算法。
results: 通过对真实数据和 sintetic数据进行数值实验，证明了提议的算法的有效性和可靠性。

Abstract
With the availability of extraordinarily huge data sets, solving the problems of distributed statistical methodology and computing for such data sets has become increasingly crucial in the big data area. In this paper, we focus on the distributed sparse penalized linear log-contrast model in massive compositional data. In particular, two distributed optimization techniques under centralized and decentralized topologies are proposed for solving the two different constrained convex optimization problems. Both two proposed algorithms are based on the frameworks of Alternating Direction Method of Multipliers (ADMM) and Coordinate Descent Method of Multipliers(CDMM, Lin et al., 2014, Biometrika). It is worth emphasizing that, in the decentralized topology, we introduce a distributed coordinate-wise descent algorithm based on Group ADMM(GADMM, Elgabli et al., 2020, Journal of Machine Learning Research) for obtaining a communication-efficient regularized estimation. Correspondingly, the convergence theories of the proposed algorithms are rigorously established under some regularity conditions. Numerical experiments on both synthetic and real data are conducted to evaluate our proposed algorithms.

摘要
随着庞大数据集的可用性的提高，解决分布式统计方法和计算问题已成为大数据领域中越来越重要的问题。在这篇论文中，我们关注于巨大compositional数据中的分布式稀缺假设模型。我们提出了两种分布式优化技术，一种是基于中央化topology，另一种是基于分布式topology。两种提出的算法都基于ADMM和CDMM框架(Lin et al., 2014, Biometrika)。在分布式topology中，我们提出了一种分布式坐标点wise降降算法基于Group ADMM(Elgabli et al., 2020, Journal of Machine Learning Research)，以实现通信效率的Regularized估计。对于提出的算法，我们也证明了其 convergence的理论基础。在实验中，我们使用 sintetic和实际数据进行了数值测试，以评估我们的提出算法。

Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression

paper_url: http://arxiv.org/abs/2310.13966
repo_url: None
paper_authors: Chao Wang, Caixing Wang, Xin He, Xingdong Feng
for: This paper focuses on investigating the transfer learning problem within the context of nonparametric regression over a reproducing kernel Hilbert space, with the aim of bridging the gap between practical effectiveness and theoretical guarantees.
methods: The proposed method uses kernel ridge regression for the known transferable source case, and an efficient aggregation algorithm for the unknown case, which can automatically detect and alleviate the effects of negative sources.
results: The paper provides the statistical properties of the desired estimators and establishes the minimax optimal rate, and through extensive numerical experiments on synthetic data and real examples, the effectiveness of the proposed method is validated.

Abstract
In recent years, transfer learning has garnered significant attention in the machine learning community. Its ability to leverage knowledge from related studies to improve generalization performance in a target study has made it highly appealing. This paper focuses on investigating the transfer learning problem within the context of nonparametric regression over a reproducing kernel Hilbert space. The aim is to bridge the gap between practical effectiveness and theoretical guarantees. We specifically consider two scenarios: one where the transferable sources are known and another where they are unknown. For the known transferable source case, we propose a two-step kernel-based estimator by solely using kernel ridge regression. For the unknown case, we develop a novel method based on an efficient aggregation algorithm, which can automatically detect and alleviate the effects of negative sources. This paper provides the statistical properties of the desired estimators and establishes the minimax optimal rate. Through extensive numerical experiments on synthetic data and real examples, we validate our theoretical findings and demonstrate the effectiveness of our proposed method.

摘要
Two scenarios are considered: one where the transferable sources are known, and another where they are unknown. For the known transferable source case, a two-step kernel-based estimator is proposed, solely using kernel ridge regression. For the unknown case, a novel method based on an efficient aggregation algorithm is developed, which can automatically detect and alleviate the effects of negative sources.The statistical properties of the desired estimators are provided, and the minimax optimal rate is established. Extensive numerical experiments on synthetic data and real examples are conducted to validate the theoretical findings and demonstrate the effectiveness of the proposed method.

Toward Generative Data Augmentation for Traffic Classification

paper_url: http://arxiv.org/abs/2310.13935
repo_url: None
paper_authors: Chao Wang, Alessandro Finamore, Pietro Michiardi, Massimo Gallo, Dario Rossi
for: 本研究旨在探讨数据增强技术在网络应用中的可行性，特别是在流量分类领域。
methods: 本研究采用了14种手动设计的数据增强策略，应用于MIRAGE19 dataset。
results: 研究结果显示，数据增强可以在流量分类中提供未曾被探讨的优势，同时促进了使用生成模型自动设计数据增强策略的研究课程。

Abstract
Data Augmentation (DA)-augmenting training data with synthetic samples-is wildly adopted in Computer Vision (CV) to improve models performance. Conversely, DA has not been yet popularized in networking use cases, including Traffic Classification (TC). In this work, we present a preliminary study of 14 hand-crafted DAs applied on the MIRAGE19 dataset. Our results (i) show that DA can reap benefits previously unexplored in TC and (ii) foster a research agenda on the use of generative models to automate DA design.

摘要
<> translate "Data Augmentation (DA)-augmenting training data with synthetic samples-is wildly adopted in Computer Vision (CV) to improve models performance. Conversely, DA has not been yet popularized in networking use cases, including Traffic Classification (TC). In this work, we present a preliminary study of 14 hand-crafted DAs applied on the MIRAGE19 dataset. Our results (i) show that DA can reap benefits previously unexplored in TC and (ii) foster a research agenda on the use of generative models to automate DA design.">以下是文本的Simplified Chinese翻译：<>计算机视觉（CV）中广泛采用数据扩充（DA）技术，卷积神经网络性能。然而，在网络应用场景中，包括流量分类（TC），DA还没有得到普及。在这项工作中，我们对MIRAGE19数据集上手工设计了14种DA，并进行了初步研究。我们的结果表明，DA可以在TC中获得未曾提及的利益，同时也激发了使用生成模型自动化DA设计的研究论坛。

Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation

paper_url: http://arxiv.org/abs/2310.13923
repo_url: https://github.com/zfancy/divoe
paper_authors: Jianing Zhu, Geng Yu, Jiangchao Yao, Tongliang Liu, Gang Niu, Masashi Sugiyama, Bo Han
for:本研究旨在提高机器学习模型在实际应用中的可靠性，通过进行outsider暴露（OOD）检测。methods:本研究提出了一种新的框架，即多样化外围暴露（DivOE），通过在训练过程中使用多亮示的auxiliary outliers来实现有效的OOD检测。results: DivOE通过在训练过程中生成更多的外围样本，以便在ID和OOD数据之间找到更多的分界点，从而提高OOD检测的准确性。

Abstract
Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications. Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers. However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging. In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers. Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training. It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE.

摘要
OUT-OF-DISTRIBUTION (OOD) 检测是在实际应用中部署可靠机器学习模型的重要问题。 latest advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers。 However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging。 In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers。 Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training。 It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure。 Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE。 The code is publicly available at: https://github.com/tmlr-group/DivOE。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Equivariant Map and Agent Geometry for Autonomous Driving Motion Prediction

paper_url: http://arxiv.org/abs/2310.13922
repo_url: None
paper_authors: Yuping Wang, Jier Chen
for: 这种研究旨在解决自动驾驶中的深度学习帮助预测运动的问题，具体来说是确保运动的准确性和稳定性。
methods: 这种研究使用了一种名为EqMotion的新型动作预测模型，该模型具有几何变换等价性和人工对话等变换性，从而使得模型在不同的坐标系下仍然可以准确预测动作。此外，该研究还引入了一种具有几何变换等价性的高清地图处理方法，以增强网络的空间理解。
results: 该研究表明，通过使用EqMotion模型和高清地图处理方法，可以实现高精度的动作预测，同时具有轻量级的设计和高效的数据利用。

Abstract
In autonomous driving, deep learning enabled motion prediction is a popular topic. A critical gap in traditional motion prediction methodologies lies in ensuring equivariance under Euclidean geometric transformations and maintaining invariant interaction relationships. This research introduces a groundbreaking solution by employing EqMotion, a theoretically geometric equivariant and interaction invariant motion prediction model for particles and humans, plus integrating agent-equivariant high-definition (HD) map features for context aware motion prediction in autonomous driving. The use of EqMotion as backbone marks a significant departure from existing methods by rigorously ensuring motion equivariance and interaction invariance. Equivariance here implies that an output motion must be equally transformed under the same Euclidean transformation as an input motion, while interaction invariance preserves the manner in which agents interact despite transformations. These properties make the network robust to arbitrary Euclidean transformations and contribute to more accurate prediction. In addition, we introduce an equivariant method to process the HD map to enrich the spatial understanding of the network while preserving the overall network equivariance property. By applying these technologies, our model is able to achieve high prediction accuracy while maintain a lightweight design and efficient data utilization.

摘要
自主驾驶中，深度学习启用的动作预测是一个受欢迎的话题。传统的动作预测方法存在一个重要的缺陷，即保证动作 equivariant 和 interaction invariant。这项研究提出了一种创新的解决方案，通过使用EqMotion，一种 theoretically 几何 equivariant 和 interaction invariant 的动作预测模型，以及 Agent-equivariant HD map 特征进行上下文意识激活的动作预测。使用EqMotion 作为基础marks a significant departure from existing methods， rigorously ensuring motion equivariance and interaction invariance。 equivariance 在这里意味着输入动作下的输出动作必须在同一个几何变换下具有相同的变换，而 interaction invariance 保持了代理人之间的交互方式不变，即使在变换下。这些特性使得网络具有对任意几何变换的 Robustness 和更高的预测精度。此外，我们还引入了一种具有 equivariance 性的方法来处理 HD map，以激活网络的空间理解，而不损失整体网络的 equivariance 性。通过应用这些技术，我们的模型能够实现高精度的预测，同时具有轻量级的设计和高效的数据利用。

Southern Ocean Dynamics Under Climate Change: New Knowledge Through Physics-Guided Machine Learning

paper_url: http://arxiv.org/abs/2310.13916
repo_url: https://github.com/yikwill/THOR-MOM6
paper_authors: William Yik, Maike Sonnewald, Mariana C. A. Clare, Redouane Lguensat
for: 本研究旨在理解由气候变化引起的南极环流Current的变化，以及这些变化对南极环流Current的影响。
methods: 本研究使用了Tracking global Heating with Ocean Regimes（THOR）方法，将高分辨率气候模型数据分解成不同的物理 régime，并使用神经网络模型预测这些 régime的变化。
results: 研究发现，随着气候变化，南极环流Current在interactions with the Pacific-Antarctic Ridge region发生了 régime shift，其中流速增强，而 bathymetry的 dominant dynamical role weakens。

Abstract
Complex ocean systems such as the Antarctic Circumpolar Current play key roles in the climate, and current models predict shifts in their strength and area under climate change. However, the physical processes underlying these changes are not well understood, in part due to the difficulty of characterizing and tracking changes in ocean physics in complex models. To understand changes in the Antarctic Circumpolar Current, we extend the method Tracking global Heating with Ocean Regimes (THOR) to a mesoscale eddy permitting climate model and identify regions of the ocean characterized by similar physics, called dynamical regimes, using readily accessible fields from climate models. To this end, we cluster grid cells into dynamical regimes and train an ensemble of neural networks to predict these regimes and track them under climate change. Finally, we leverage this new knowledge to elucidate the dynamics of regime shifts. Here we illustrate the value of this high-resolution version of THOR, which allows for mesoscale turbulence, with a case study of the Antarctic Circumpolar Current and its interactions with the Pacific-Antarctic Ridge. In this region, THOR specifically reveals a shift in dynamical regime under climate change driven by changes in wind stress and interactions with bathymetry. Using this knowledge to guide further exploration, we find that as the Antarctic Circumpolar Current shifts north under intensifying wind stress, the dominant dynamical role of bathymetry weakens and the flow strengthens.

摘要
COMPLEX ocean systems, such as the Antarctic Circumpolar Current, play key roles in the climate, and current models predict shifts in their strength and area under climate change. However, the physical processes underlying these changes are not well understood, in part due to the difficulty of characterizing and tracking changes in ocean physics in complex models. To understand changes in the Antarctic Circumpolar Current, we extend the method Tracking global Heating with Ocean Regimes (THOR) to a mesoscale eddy permitting climate model and identify regions of the ocean characterized by similar physics, called dynamical regimes, using readily accessible fields from climate models. To this end, we cluster grid cells into dynamical regimes and train an ensemble of neural networks to predict these regimes and track them under climate change. Finally, we leverage this new knowledge to elucidate the dynamics of regime shifts. Here we illustrate the value of this high-resolution version of THOR, which allows for mesoscale turbulence, with a case study of the Antarctic Circumpolar Current and its interactions with the Pacific-Antarctic Ridge. In this region, THOR specifically reveals a shift in dynamical regime under climate change driven by changes in wind stress and interactions with bathymetry. Using this knowledge to guide further exploration, we find that as the Antarctic Circumpolar Current shifts north under intensifying wind stress, the dominant dynamical role of bathymetry weakens and the flow strengthens.

Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

paper_url: http://arxiv.org/abs/2310.13913
repo_url: None
paper_authors: Lihang Liu, Donglong He, Xianbin Ye, Shanzhuo Zhang, Xiaonan Zhang, Jingbo Zhou, Jun Li, Hua Chai, Fan Wang, Jingzhou He, Liang Zheng, Yonghui Li, Xiaomin Fang
for: 这篇论文旨在提高分子对抗位置预测的精确性，以推动人工智能驱动的药物探索。
methods: 这篇论文使用了深度学习技术来强化分子对抗位置预测，并且使用了大量的蛋白质和小分子组合数据进行预训。
results: 相比传统物理学基于的基底方法和深度学习基于的基底方法，这篇论文的HelixDock方法在复杂的测试集上表现出了superiority，尤其是在蛋白质和小分子之间的对抗位置预测方面。

Abstract
Molecular docking, a pivotal computational tool for drug discovery, predicts the binding interactions between small molecules (ligands) and target proteins (receptors). Conventional physics-based docking tools, though widely used, face limitations in precision due to restricted conformational sampling and imprecise scoring functions. Recent endeavors have employed deep learning techniques to enhance docking accuracy, but their generalization remains a concern due to limited training data. Leveraging the success of extensive and diverse data in other domains, we introduce HelixDock, a novel approach for site-specific molecular docking. Hundreds of millions of binding poses are generated by traditional docking tools, encompassing diverse protein targets and small molecules. Our deep learning-based docking model, a SE(3)-equivariant network, is pre-trained with this large-scale dataset and then fine-tuned with a small number of precise receptor-ligand complex structures. Comparative analyses against physics-based and deep learning-based baseline methods highlight HelixDock's superiority, especially on challenging test sets. Our study elucidates the scaling laws of the pre-trained molecular docking models, showcasing consistent improvements with increased model parameters and pre-train data quantities. Harnessing the power of extensive and diverse generated data holds promise for advancing AI-driven drug discovery.

摘要
分子停靠，一种重要的计算工具，预测小分子（抗体）与目标蛋白（受体）之间的绑定交互。传统的物理学基于的停靠工具，尽管广泛使用，但受限于精确性，因为它们只能进行有限的配置检查和不准确的评分函数。现在的努力是使用深度学习技术来提高停靠精度，但它们的泛化仍然是一个问题，因为它们只有有限的训练数据。我们在其他领域中的丰富和多样化数据的基础上引入了 HelixDock，一种新的方法。我们使用传统的停靠工具生成了数百万个绑定位置，涵盖了多种蛋白目标和小分子。我们的深度学习基于的停靠模型，一个SE(3)相似的网络，通过大规模数据集预训练，然后精度地调整了一个小数量的准确抗体-小分子复合结构。与物理学基于和深度学习基于的基线方法进行比较分析，HelixDock在具有挑战性的测试集上表现出优异性，特别是在难以预测的情况下。我们的研究描述了预训练分子停靠模型的涨幅法律，展示了随着模型参数和预训练数据量的增加，模型的性能得到了一致提高。通过利用广泛生成的数据来提高人工智能驱动的药物探索，我们希望能够推动这一领域的发展。

Towards Hyperparameter-Agnostic DNN Training via Dynamical System Insights

paper_url: http://arxiv.org/abs/2310.13901
repo_url: None
paper_authors: Carmel Fiscko, Aayushya Agarwal, Yihan Ruan, Soummya Kar, Larry Pileggi, Bruno Sinopoli
for:ECCO-DNN is designed to optimize deep neural network training.methods:ECCO-DNN uses a stochastic first-order optimization method that models the optimization variable trajectory as a dynamical system and adaptively selects step sizes based on the trajectory’s shape.results:ECCO-DNN achieves comparable performance to state-of-the-art optimizers including ADAM, SGD, RMSProp, and AdaGrad, and its single hyperparameter can be changed by three orders of magnitude without affecting the trained models’ accuracies. Additionally, ECCO-DNN is insensitive to hyperparameter variations and reduces the data and computation needed for hyperparameter tuning, making it advantageous for rapid prototyping and for applications with new datasets.

Abstract
We present a stochastic first-order optimization method specialized for deep neural networks (DNNs), ECCO-DNN. This method models the optimization variable trajectory as a dynamical system and develops a discretization algorithm that adaptively selects step sizes based on the trajectory's shape. This provides two key insights: designing the dynamical system for fast continuous-time convergence and developing a time-stepping algorithm to adaptively select step sizes based on principles of numerical integration and neural network structure. The result is an optimizer with performance that is insensitive to hyperparameter variations and that achieves comparable performance to state-of-the-art optimizers including ADAM, SGD, RMSProp, and AdaGrad. We demonstrate this in training DNN models and datasets, including CIFAR-10 and CIFAR-100 using ECCO-DNN and find that ECCO-DNN's single hyperparameter can be changed by three orders of magnitude without affecting the trained models' accuracies. ECCO-DNN's insensitivity reduces the data and computation needed for hyperparameter tuning, making it advantageous for rapid prototyping and for applications with new datasets. To validate the efficacy of our proposed optimizer, we train an LSTM architecture on a household power consumption dataset with ECCO-DNN and achieve an optimal mean-square-error without tuning hyperparameters.

摘要
我们提出了一种随机首频优化方法特化于深度神经网络（DNN），即ECCO-DNN。该方法将优化变量轨迹模型为动力系统，并开发了一种适应步长选择算法，以获得两个关键发现：设计动力系统以实现快速连续时间减少，并开发一种时间步骤算法以自适应选择步长，基于数值积分和神经网络结构。这些设计决策使ECCO-DNN的优化器性能免疫参数变化的影响，并与现有的优化器，包括ADAM、SGD、RMSProp和AdaGrad，具有相同的性能。我们在训练DNN模型和数据集中使用ECCO-DNN，包括CIFAR-10和CIFAR-100，并发现ECCO-DNN的单个超参数可以通过三个级别的变化而不影响训练模型的准确性。ECCO-DNN的不敏感性降低了数据和计算所需的 hyperparameter 调试，使其在快速原型和新数据集应用中更加优势。为证明我们提出的优化器的有效性，我们在一个家用电力消耗数据集上使用ECCO-DNN训练LSTM架构，并实现了最佳平均方差值，无需调整超参数。

Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

paper_url: http://arxiv.org/abs/2310.13897
repo_url: None
paper_authors: Dana Angluin, David Chiang, Andy Yang
for: 这个论文研究了 transformer Encoder 的可能性和逻辑限制。
methods: 这个论文使用了硬 attention 和严格未来masking 方法，并证明了这种网络可以认izers star-free 语言。添加位域 embedding 可以扩展认izers 到其他已知类别。
results: 这个论文证明了 transformer 网络可以认izers star-free 语言，并与 first-order logic、 temporal logic 和 algebraic automata theory 有关。

Abstract
We consider transformer encoders with hard attention (in which all attention is focused on exactly one position) and strict future masking (in which each position only attends to positions strictly to its left), and prove that the class of languages recognized by these networks is exactly the star-free languages. Adding position embeddings increases the class of recognized languages to other well-studied classes. A key technique in these proofs is Boolean RASP, a variant of RASP that is restricted to Boolean values. Via the star-free languages, we relate transformers to first-order logic, temporal logic, and algebraic automata theory.

摘要
我们考虑使用转换器Encoder，具有固定注意力（所有注意力都集中在一个位置）和严格未来掩码（每个位置只能关注左侧的位置），并证明这些网络可以认可的语言类型是星号自由语言。添加位域嵌入可以将认可的语言类型扩展到其他已有的类型。我们使用布尔RASP，一种受限的RASP变体，进行关键技术。通过星号自由语言，我们将转换器相关联到首领逻辑、时间逻辑和代数自动机理论。

Specify Robust Causal Representation from Mixed Observations

paper_url: http://arxiv.org/abs/2310.13892
repo_url: https://github.com/ymy4323460/cari
paper_authors: Mengyue Yang, Xinyu Cai, Furui Liu, Weinan Zhang, Jun Wang
for: 本研究旨在学习从观察数据中学习一种低维度、紧凑的表示，以提高预测模型的稳定性和泛化性。
methods: 本研究使用了 causal 表示学习方法，通过在学习过程中添加 mutual information 度量来规范学习。
results: 研究表明，使用 causal 表示学习方法可以提高模型的Robustness 和泛化性，并且在骚扰攻击和分布Shift 下表现更好于基eline。I hope that helps! Let me know if you have any further questions.

Abstract
Learning representations purely from observations concerns the problem of learning a low-dimensional, compact representation which is beneficial to prediction models. Under the hypothesis that the intrinsic latent factors follow some casual generative models, we argue that by learning a causal representation, which is the minimal sufficient causes of the whole system, we can improve the robustness and generalization performance of machine learning models. In this paper, we develop a learning method to learn such representation from observational data by regularizing the learning procedure with mutual information measures, according to the hypothetical factored causal graph. We theoretically and empirically show that the models trained with the learned causal representations are more robust under adversarial attacks and distribution shifts compared with baselines. The supplementary materials are available at https://github.com/ymy $4323460 / \mathrm{CaRI} /$.

摘要
学习 purely from observations 的表示 Concerns the problem of learning a low-dimensional, compact representation, which is beneficial to prediction models. Under the assumption that the intrinsic latent factors follow some causal generative models, we argue that by learning a causal representation, which is the minimal sufficient causes of the whole system, we can improve the robustness and generalization performance of machine learning models. In this paper, we develop a learning method to learn such representation from observational data by regularizing the learning procedure with mutual information measures, according to the hypothetical factored causal graph. We theoretically and empirically show that the models trained with the learned causal representations are more robust under adversarial attacks and distribution shifts compared with baselines. 附加资料可以在 https://github.com/ymy $4323460 / \mathrm{CaRI} /$ 中找到。

Towards a General Framework for Continual Learning with Pre-training

paper_url: http://arxiv.org/abs/2310.13888
repo_url: https://github.com/thu-ml/hide-prompt
paper_authors: Liyuan Wang, Jingyi Xie, Xingxing Zhang, Hang Su, Jun Zhu
for: 本研究探讨了一种通用框架，用于Sequential continual learning tasks，通过预训练来实现人工智能系统适应真实世界动态变化。
methods: 我们在理论上将目标函数 decomposed into three hierarchical components，包括 within-task prediction、task-identity inference 和 task-adaptive prediction。然后，我们提出了一种新的方法，使用 parameter-efficient fine-tuning (PEFT) 技术和 representation statistics 来显著提高这些组件。
results: 我们在下游 continual learning 中观察到了我们的方法的优势和通用性，并进一步探讨了 PEFT 技术在上游 continual learning 中的应用可能性。此外，我们还讨论了该框架与 neuroscience 最新的进展之间的生物基础。

Abstract
In this work, we present a general framework for continual learning of sequentially arrived tasks with the use of pre-training, which has emerged as a promising direction for artificial intelligence systems to accommodate real-world dynamics. From a theoretical perspective, we decompose its objective into three hierarchical components, including within-task prediction, task-identity inference, and task-adaptive prediction. Then we propose an innovative approach to explicitly optimize these components with parameter-efficient fine-tuning (PEFT) techniques and representation statistics. We empirically demonstrate the superiority and generality of our approach in downstream continual learning, and further explore the applicability of PEFT techniques in upstream continual learning. We also discuss the biological basis of the proposed framework with recent advances in neuroscience.

摘要
在这项工作中，我们提出了一种总体框架，用于Sequential continual learning，利用预训练，这种方向已成为人工智能系统来应对实际世界动态的一种有前途的方法。从理论上来看，我们将目标函数分解成三个层次结构，包括内部任务预测、任务标识推理和任务适应预测。然后，我们提议一种新的方法，使用参数效率的细致调整（PEFT）技术和表示统计来显著地优化这些组成部分。我们在下游 continual learning 中进行了实验，证明了我们的方法的优越性和通用性。此外，我们还探讨了这个框架的生物基础，与 neuroscience 最新的进展有关。

Optimal Transport-based Nonlinear Filtering in High-dimensional Settings

paper_url: http://arxiv.org/abs/2310.13886
repo_url: None
paper_authors: Mohammad Al-Jarrah, Niyizhen Jin, Bamdad Hosseini, Amirhossein Taghvaei
for: 本文解决非线性滤波问题，即计算随机动力系统状态条件分布给历史噪声部分观测的问题。
methods: 我们提出的方法基于非线性滤波的最优运输解释，导致一种基于模拟和无概率算法的 simulation-based 和 likelihood-free 算法，可以估计当前状态分布到下一步时间Step的 Brenier 最优运输Map。
results: 我们的方法比 SIR 滤波和ensemble Kalman filter 表现出更高的样本效率、高维度可描述性和能够捕捉复杂多模分布的能力。

Abstract
This paper addresses the problem of nonlinear filtering, i.e., computing the conditional distribution of the state of a stochastic dynamical system given a history of noisy partial observations. The primary focus is on scenarios involving degenerate likelihoods or high-dimensional states, where traditional sequential importance resampling (SIR) particle filters face the weight degeneracy issue. Our proposed method builds on an optimal transport interpretation of nonlinear filtering, leading to a simulation-based and likelihood-free algorithm that estimates the Brenier optimal transport map from the current distribution of the state to the distribution at the next time step. Our formulation allows us to harness the approximation power of neural networks to model complex and multi-modal distributions and employ stochastic optimization algorithms to enhance scalability. Extensive numerical experiments are presented that compare our method to the SIR particle filter and the ensemble Kalman filter, demonstrating the superior performance of our method in terms of sample efficiency, high-dimensional scalability, and the ability to capture complex and multi-modal distributions.

摘要
Our proposed method is based on an optimal transport interpretation of nonlinear filtering, which leads to a simulation-based and likelihood-free algorithm that estimates the Brenier optimal transport map from the current distribution of the state to the distribution at the next time step. By using neural networks to model complex and multi-modal distributions and stochastic optimization algorithms to enhance scalability, our formulation is able to handle challenging scenarios with ease.Extensive numerical experiments are presented that compare our method to the SIR particle filter and the ensemble Kalman filter. The results demonstrate the superior performance of our method in terms of sample efficiency, high-dimensional scalability, and the ability to capture complex and multi-modal distributions.

Fast Approximation of Similarity Graphs with Kernel Density Estimation

paper_url: http://arxiv.org/abs/2310.13870
repo_url: https://github.com/pmacg/kde-similarity-graph
paper_authors: Peter Macgregor, He Sun
for: 构建一个稀疏的相似图，以便进行现代归一化算法的第一步。
methods: 基于kernel density estimation问题，提出了一种新的算法框架，可以快速构建稀疏的相似图，同时保持归一化结果的结构。
results: 与scikit-learn和FAISS库的实现相比，我们的方法在多种数据集上显著超越了它们。

Abstract
Constructing a similarity graph from a set $X$ of data points in $\mathbb{R}^d$ is the first step of many modern clustering algorithms. However, typical constructions of a similarity graph have high time complexity, and a quadratic space dependency with respect to $|X|$. We address this limitation and present a new algorithmic framework that constructs a sparse approximation of the fully connected similarity graph while preserving its cluster structure. Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions. We compare our designed algorithm with the well-known implementations from the scikit-learn library and the FAISS library, and find that our method significantly outperforms the implementation from both libraries on a variety of datasets.

摘要
现在的许多现代聚类算法的第一步是从一个集合 $X$ 中的数据点集构建一个相似Graph。然而，通常情况下，构建相似图的方法具有高时间复杂度和对 $|X|$ 的平方空间依赖。我们解决这个限制，并提出了一个新的算法框架，可以构建稀疏的相似图，保持聚类结构。我们的提出的算法基于kernel density估计问题，适用于任意的kernel函数。我们与scikit-learn库和FAISS库中的常用实现进行比较，发现我们的方法在多种数据集上明显超过了这两个库的实现。

Distributionally Robust Optimization with Bias and Variance Reduction

paper_url: http://arxiv.org/abs/2310.13863
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Ronak Mehta, Vincent Roulet, Krishna Pillutla, Zaid Harchaoui
for: 该研究 targets the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty, which includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss.
methods: 该研究提出了一种名为 Prospect 的杂相 gradient-based algorithm，该算法只需要调整一个学习率参数，并且可以 linear convergence for smooth regularized losses。这与之前的算法不同，这些算法可能需要调整多个参数，或者因为偏向的梯度估计或不够的迁移而失败。
results: 该研究通过实验表明，Prospect 可以比基eline 2-3倍快速 converges on distribution shift and fairness benchmarks spanning tabular, vision, and language domains。

Abstract
We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.

摘要
我团队考虑了分布robust优化（DRO）问题，使用spectral risk-based uncertainty set和$f$- divergence penalty。这种形式包括常见的风险敏感学习目标，如常量值-at-risk（CVaR）和平均top-$k$ loss。我们提出了Prospect算法，它仅需要调整一个学习率超参数，并证明它在抽象化的REGularized loss函数下具有线性减少的性质。这与之前的算法不同，这些算法可能需要调整多个超参数，或者因为偏向的梯度估计或不足的正则化而无法 converges。我们在实验中证明，Prospect可以比基eline算法快速 converge 2-3倍，包括在分布shift和公平性 benchmark上。