2023-11-03

cs.LG

cs.LG - 2023-11-03

Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos

paper_url: http://arxiv.org/abs/2311.02076
repo_url: None
paper_authors: Dayal Singh Kalra, Tianyu He, Maissam Barkeshli
For: 研究Gradient Descent动态学习神经网络中的稳定性和学习速率问题。* Methods: 使用一个简单的2层线性网络（UV模型）在一个单个训练样本上进行训练，并分析函数空间的固定点结构和函数更新的向量场，以揭示学习过程中稳定性和学习速率的机理。* Results: 发现在训练过程中，稳定性可能会随时间的推移而减退（早期稳定性减退），然后转为进行进攻性的加剧（进攻性加剧），并且在学习率增加时可能会出现边缘稳定性边缘。通过分析固定点结构和函数更新向量场，我们揭示了这些稳定性趋势的机理，包括早期稳定性减退的机理、进攻性加剧的机理和学习率增加时边缘稳定性边缘的机理。

Abstract
In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.

摘要
在神经网络的梯度下降动力学中，损失函数的希尔比率（锐度）在训练过程中展现了多种 Robust 现象。包括在训练的早期阶段减少锐度（锐度减少），以及 later 阶段的进攻性锐度和稳定边缘。我们示出了一个简单的两层线性网络（UV 模型）在单个训练示例上进行训练时显示了所有真实场景中的锐度现象。通过分析函数空间的固定点结构和函数更新的向量场，我们揭示了这些锐度趋势的内在机制。我们的分析显示了（i）锐度减少和进攻性锐度的机制，（ii）学习率增加时稳定边缘的必要条件，以及（iii）学习率增加时period-doubling 到稳定边缘抽象的混沌路径。最后，我们证明了这个简化模型的预测在真实场景中有效，并讨论了它的局限性。

Active Learning-Based Species Range Estimation

paper_url: http://arxiv.org/abs/2311.02061
repo_url: https://github.com/chris-lange/sdm_active_sampling
paper_authors: Christian Lange, Elijah Cole, Grant Van Horn, Oisin Mac Aodha
for: 该论文旨在提出一种新的活动学习方法，用于有效地估计种群范围从有限的地面观察数据中。
methods: 该方法基于模型大量弱监睹社区收集的观察数据，并使用这些模型来生成候选种群范围集。然后，该方法采用一种新的活动询问方法，Sequentially选择最有优势的地理位置进行访问，以减少对未地图的种群范围的不确定性。
results: 作者对该方法进行了详细的评估，并与现有的活动学习方法和方法进行比较。结果表明，该方法高效地估计种群范围，并且与使用末端训练的模型准确率相似，即使只使用一部分数据。这显示了活动学习通过传输学习的空间表示来估计种群范围的utilty，以及利用emerging大规模的社区收集数据来活动发现种群。

Abstract
We propose a new active learning approach for efficiently estimating the geographic range of a species from a limited number of on the ground observations. We model the range of an unmapped species of interest as the weighted combination of estimated ranges obtained from a set of different species. We show that it is possible to generate this candidate set of ranges by using models that have been trained on large weakly supervised community collected observation data. From this, we develop a new active querying approach that sequentially selects geographic locations to visit that best reduce our uncertainty over an unmapped species' range. We conduct a detailed evaluation of our approach and compare it to existing active learning methods using an evaluation dataset containing expert-derived ranges for one thousand species. Our results demonstrate that our method outperforms alternative active learning methods and approaches the performance of end-to-end trained models, even when only using a fraction of the data. This highlights the utility of active learning via transfer learned spatial representations for species range estimation. It also emphasizes the value of leveraging emerging large-scale crowdsourced datasets, not only for modeling a species' range, but also for actively discovering them.

摘要
我们提出了一种新的活动学习方法，用于效率地估算一种动物的地理范围从有限多个地面观察数据中。我们模型了这种未映射的物种的范围为多种不同种的估算范围的权重组合。我们表明可以通过使用已经训练过大规模、弱监督社区收集的观察数据来生成这些候选者。然后，我们开发了一种新的活动查询方法，该方法在不断选择不确定度最高的地理位置进行访问，以减少对未映射物种范围的不确定性。我们进行了详细的评估，并与现有的活动学习方法和方法进行比较，使用专家所 derivation 的范围数据集中的一千种物种的评估结果表明，我们的方法在使用只有一部分数据时仍可以超越其他活动学习方法，并且接近终端训练模型的性能。这种结果强调了通过活动学习 transferred 的空间表示来估算物种范围的有用性，以及利用emerging大规模的社区收集数据来模型物种范围的重要性。

Reproducible Parameter Inference Using Bagged Posteriors

paper_url: http://arxiv.org/abs/2311.02019
repo_url: None
paper_authors: Jonathan H. Huggins, Jeffrey W. Miller
for: 本研究旨在解决 bayesian posterior 在模型误差下的不准确性问题，并提出一种可靠的 reproduceability критерий。
methods: 本研究使用了 bagging 技术，即使用 posterior Distribution conditioned on bootstrapped datasets，以提高 reproduceability。
results: 研究发现，bayesbag Typically satisfies the overlap lower bound，并且有一个 Bernstein–Von Mises theorem，确定它的 asymptotic normal distribution。通过 simulated experiments 和犯罪率预测应用，研究证明 bayesbag 的优点。

Abstract
Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.

摘要
“在模型错误下， bayesian posterior 通常不能妥善量化 true 或 pseudo-true 参数的不确定性。更重要的是，错误会导致模型的不可重复性，即使使用同一个模型，在独立的数据集上得到的 posterior 会具有矛盾的结果。为了定义在错误下的可重复性量化标准，我们考虑了两个独立的数据集上constructed confidence set之间的非空 overlap概率，并证明了这个 overlap 概率下界，该下界适用于任何有效的confidence set。我们证明了标准 posterior 的信任集可能会强烈违反这个下界，特别是在高维度 Setting（即采样大小增长）中，表明这些信任集不是内在coherent 的。为了改善可重复性，我们提议使用 bagging 技术（即 conditioned on bootstrapped datasets 的 posterior distribution）。我们从 Jeffrey conditionalization 的基本原理出发，motivate BayesBag，并证明 BayesBag 通常满足 overlap 下界。更重要的是，我们证明了 BayesBag 的 asymptotic normal distribution，以及其在不同 Setting 下的性能。我们通过 simulations 和犯罪率预测应用 demonstrate 了 BayesBag 的好处。”

A Variational Perspective on High-Resolution ODEs

paper_url: http://arxiv.org/abs/2311.02002
repo_url: None
paper_authors: Hoomaan Maskan, Konstantinos C. Zygalakis, Alp Yurtsever
for: 这个论文主要针对无约制最小化凸函数的问题。
methods: 这篇论文提出了一种新的变量观点，使得可以研究高分辨率ODE。通过这种观点，我们可以更快地 converge gradient norm minimization 使用 Nesterov 加速器方法。
results: 我们的方法可以在噪声梯度下实现更好的性能，并且可以与现有的方法进行比较。在一些数学实验中，我们的方法与现有的方法进行了比较，并且得到了更好的结果。

Abstract
We consider unconstrained minimization of smooth convex functions. We propose a novel variational perspective using forced Euler-Lagrange equation that allows for studying high-resolution ODEs. Through this, we obtain a faster convergence rate for gradient norm minimization using Nesterov's accelerated gradient method. Additionally, we show that Nesterov's method can be interpreted as a rate-matching discretization of an appropriately chosen high-resolution ODE. Finally, using the results from the new variational perspective, we propose a stochastic method for noisy gradients. Several numerical experiments compare and illustrate our stochastic algorithm with state of the art methods.

摘要
我们考虑不受限制的极小化的几何函数。我们提出了一种新的量子视角，使用强制的欧拉-拉格朗日方程，以研究高分辨率ODE。透过这个新的视角，我们获得了更快的梯度距离减少率，从尼斯特洛夫的加速器梯度方法中。此外，我们显示了尼斯特洛夫的方法可以被解释为一种调整对应的高分辨率ODE的率调整策略。最后，我们使用新的量子视角提出了一种随机方法 для杂质梯度。一些数学实验比较和IlлюSTRATE了我们的随机算法与现有的方法。

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

paper_url: http://arxiv.org/abs/2311.02000
repo_url: None
paper_authors: Yusu Hong, Junhong Lin
for: 本文研究了 Adam 算法在非对称非 convex 概率优化中的收敛性。尽管在机器学习领域广泛应用，但其理论性仍然有限。先前的研究主要从预期角度研究 Adam 的收敛，经常需要强制ASSUME 梯度是均匀的和问题依赖的先验知识。这限制了这些发现在实际世界应用中的可用性。
methods: 作者提供了深入分析，证明 Adam 可以在高 probabilit 下 converge 到站点点，其速率为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$，不需要任何梯度假设和问题依赖的先验知识来调整超参数。此外，也发现 Adam 限制了梯度的大小在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 之间。最后，作者还研究了一种简化版 Adam 算法，取消一个纠正项，并获得了适应噪音水平的收敛率。
results: 本文的结果表明，在高probabilit 下，Adam 算法可以 converge 到站点点，其速率为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$，而不需要任何梯度假设和问题依赖的先验知识来调整超参数。此外，Adam 算法还限制了梯度的大小在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 之间。

Abstract
In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.

摘要
“在这篇论文中，我们研究了Adaptive Moment Estimation（Adam）算法在不受约束的非凸泛环境中的收敛性。虽然在机器学习领域广泛使用，但其理论性Properties remain limited。先前的研究主要从预期的角度研究了Adam的收敛性，经常假设 gradients是均匀的和bounded，这限制了其在实际场景中的应用。为了突破这些限制，我们提供了深入的分析，并证明Adam可以在高probability下收敛到站点点，其速度为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$，不需要任何 bounded gradient假设和任何问题依赖的优化参数。此外，我们还发现Adam将 gradients的大小限制在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 的范围内。最后，我们还 investigate了Adam中一个简化版本，去掉一个修正项，并 obtain了一个适应噪声水平的收敛速度。”Note: "Simplified Chinese" refers to the written form of Chinese that uses simpler grammar and vocabulary, and is often used in informal writing and online communication. The translation is based on the standardized Simplified Chinese writing system used in mainland China.

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

paper_url: http://arxiv.org/abs/2311.01990
repo_url: None
paper_authors: Jonathan Colaco Carr, Prakash Panangaden, Doina Precup
For: The paper is written to address the gap between the theory and application of Learning from Preferential Feedback (LfPF) algorithms, specifically in partially-observable, non-Markovian environments.* Methods: The paper introduces the Direct Preference Process, a new framework for analyzing LfPF problems, and uses the von Neumann-Morgenstern Expected Utility Theorem to establish conditions for the existence of optimal policies.* Results: The paper shows that the Direct Preference Process generalizes the standard reinforcement learning problem and provides future practitioners with the tools necessary for a more principled design of LfPF agents, narrowing the gap between empirical success and theoretical understanding.Here is the information in Simplified Chinese text:
for: 本文是为了填补学习从偏好反馈（LfPF）算法的理论和实践之间的空白，特别是在部分可见、非马歇维尔环境中。
methods: 本文引入了直接偏好过程（Direct Preference Process），一种新的分析LfPF问题的框架，并使用 von Neumann-Morgenstern 期望风险函数来确定优质策略的存在条件。
results: 本文表明，直接偏好过程可以将标准循环学习问题推广到非马歇维尔环境中，为未来的实践者提供更原则性的LfPF代理设计的工具。

Abstract
Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.

摘要
学习偏好反馈（LfPF）在训练大语言模型和某些交互学习代理人中扮演了关键角色。然而，现有的理论和应用中的LfPF算法存在一定的知识差距。现有的结果只有在Markov决策过程下确保优化策略的存在。我们介绍了新的直接偏好过程框架，用于分析LfPF问题在部分可见、非马歇维环境中。在这个框架下，我们设置了 garantia优化策略的条件，通过考虑偏好的顺序结构。使用von Neumann-Morgenstern预期用途函数，我们表明了直接偏好过程对标准强化学习问题的总结。我们的发现将减少LfPF算法的实际成功和理论理解之间的差距，并为未来的实践者提供了更原则性的LfPF代理人设计的工具。

Latent Diffusion Model for Conditional Reservoir Facies Generation

paper_url: http://arxiv.org/abs/2311.01968
repo_url: None
paper_authors: Daesoo Lee, Oscar Ovanger, Jo Eidsvik, Erlend Aune, Jacob Skauvold, Ragnar Hauge
for: used to generate high-fidelity reservoir facies realizations that preserve conditioning data
methods: uses a novel Latent Diffusion Model that leverages the superiority of diffusion models over GANs
results: significantly outperforms a GAN-based alternative in generating realistic reservoir facies

Abstract
Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.

摘要
创建准确且地质学上实际的沉积 facies 基于有限的测量是钻井开发和沉积管理中的关键，特别是在油气领域。传统的两点地 statistcs 是基础知识，但它们经常难以捕捉复杂的地质模式。多点统计学提供更多的灵活性，但它们也有自己的挑战。随着生成 adversarial Networks（GANs）在不同领域的成功，有人开始使用它们 для facies 生成。然而，最近的计算机视觉领域的进步表明了扩散模型在 GANs 之上的超越。驱动于这一点，我们提出了一种新的潜在扩散模型，用于 conditional 生成沉积 facies。我们的模型可以生成高精度的 facies 实现，严格保持 conditioning 数据。与 GANs 相比，我们的模型在 conditioning 数据上的表现显著优于。

Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products

paper_url: http://arxiv.org/abs/2311.01960
repo_url: None
paper_authors: Tamas Sarlos, Xingyou Song, David Woodruff, Qiuyi, Zhang
for: 本文研究了在entrywise transformed setting下的低级 Approximation问题，即想要找到一个好的rank $k$ Approximation来表示$f(U\cdot V)$, где$U, V^\top \in \mathbb{R}^{n \times r}$是 givens, $r = O(\log(n))$, $f(x)$是一个通用的scalar函数。
methods: 我们使用了 previoius work in sublinear low rank approximation中的方法，并给出了首次的conditional time hardness result，证明了两个condition（1）和（2）是必要的，以获得better than $n^{2-o(1)}$ time的相对误差low rank Approximation。
results: 我们给出了一个novel reduction from Strong Exponential Time Hypothesis (SETH)，这个reduction rely on lower bounding the leverage scores of flat sparse vectors，并在$U \neq V^\top$情况下提供了runtime lower bounds of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. 最后，我们证明了我们的下界是紧的，给出了一个$O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm和一个fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation。

Abstract
Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.

摘要
受快速算法在自然语言处理中启发的启示，我们研究了在变换后的下rankapprox问题，即找到一个好的rank-$k$approximation于$f(U\cdot V)$, где$U, V^\top \in \mathbb{R}^{n \times r}$是给定的，$r = O(\log(n))$,和$f(x)$是一个通用的整数函数。先前的低线性下rankapprox问题的研究表明，如果 Both (1) $U = V^\top$和 (2) $f(x)$是一个PSDkernel函数，那么有一个$O(nk^{2.376-1})$时间常量相对误差approximation算法。我们给出了首次的条件时间困难结果，证明了这两个条件是必要的，以获得一个better than $n^{2-o(1)}$时间的相对误差low rankapproximation。我们还给出了novel的SETH降低，基于lower bounding the leverages cores of flat sparse vectors，这些降低可以在$f(UV)$的rank和目标rank是$n^{o(1)}$时仍然保持有效。几种情况下，我们给出了runtime lower bounds的形式，包括$f(x) = x^p$是一个简单的多项式时，当$U \neq V^\top$时，我们给出了$\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$的时间下界。最后，我们证明了我们的下界是紧的，给出了一个$O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$时间相对误差approximation算法和一个快速的$O(n \cdot \text{poly}(k, p, 1/\epsilon))$添加itive error approximation。此外，由于我们的low rank算法 rely on matrix-vector product subroutines，我们的下界也适用于计算$f(UV)W$，其中$W$是一个小的矩阵。

Optimistic Multi-Agent Policy Gradient for Cooperative Tasks

paper_url: http://arxiv.org/abs/2311.01953
repo_url: None
paper_authors: Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen
for: 多智能体学习任务中的协同学习问题，特别是在使用函数拟合学习时遇到的相对过拟合问题。
methods: 我们提出了一种基于Leaky ReLU函数的简单框架，以减少多智能体学习过程中的相对过拟合问题。
results: 我们在多种多智能体任务上进行了广泛的测试，并证明了我们的方法可以在13个测试任务中超过强基eline，并在剩下的6个任务中与基eline匹配性。

Abstract
\textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. In early work, optimism has been shown to mitigate the \textit{RO} problem when using tabular Q-learning. However, with function approximation optimism can amplify overestimation and thus fail on complex tasks. On the other hand, recent deep multi-agent policy gradient (MAPG) methods have succeeded in many complex tasks but may fail with severe \textit{RO}. We propose a general, yet simple, framework to enable optimistic updates in MAPG methods and alleviate the RO problem. Specifically, we employ a \textit{Leaky ReLU} function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. Intuitively, our method remains optimistic toward individual actions with lower returns which are potentially caused by other agents' sub-optimal behavior during learning. The optimism prevents the individual agents from quickly converging to a local optimum. We also provide a formal analysis from an operator view to understand the proposed advantage transformation. In extensive evaluations on diverse sets of tasks, including illustrative matrix games, complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the proposed method\footnote{Code can be found at \url{https://github.com/wenshuaizhao/optimappo}.} outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.

摘要
\begin{blockquote}多代理人学习任务中的相对过拟合（RO）现象发生在多个代理人协作学习环境中，当代理人因其他代理人的不优秀行为而过拟合到低优秀的共同策略。在早期的工作中，使用表格式Q学习的Optimism已经被证明可以降低RO问题。然而，使用函数拟合的Optimism可能会增加过估计，因此在复杂任务上失败。在 contrary，最近的深度多代理人策略梯度法（MAPG）方法在许多复杂任务上取得了成功，但可能会在严重的RO问题下失败。我们提出了一个通用 yet simple 的框架，以便在 MAPG 方法中实现可信的更新和RO问题的缓解。具体来说，我们使用 \_ Leaky ReLU 函数，其中一个参数选择度量优化的度量来重塑优势。我们的方法保持对个体动作的低返回值的optimism，这些返回值可能是由其他代理人的不优秀行为所导致的。optimism 防止个体代理人快速 converges to 局部优点。我们还提供了一种基于运算员视角的正式分析，以便更好地理解我们提议的优势转换。在多种任务集中，包括简单的矩阵游戏、复杂的多代理人 MuJoCo 和 Overcooked bencmarks，我们的方法（代码可以在找到）在 13 个测试任务中超过强基线，并在剩下 6 个任务中匹配性能。\end{blockquote}Note that the translation is done using Google Translate, and may not be perfect. Please let me know if you have any further questions or need any corrections.

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

paper_url: http://arxiv.org/abs/2311.01933
repo_url: https://github.com/abacusai/forecastpfn
paper_authors: Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, Colin White
for: 这篇论文是为了解决时间序列预测问题，特别是当初始资料够少时。
methods: 本文使用了一个名为 ForecastPFN 的预测模型，这是一个基于统计学的专案调整网络。这个模型可以在单一的前进传递中对新的时间序列资料进行预测。
results: 根据实验结果，ForecastPFN 的预测结果比 state-of-the-art 方法更精确和更快速，甚至当其他方法被允许使用百名以上的内部数据点时。

Abstract
The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points.

摘要
大多数时间序列预测方法需要很多训练数据。然而，在实际应用中，有很多情况只有40个或更少的初始观察值。因此，大多数预测方法在商业应用中的可用性受限。而在最近的研究中，有一些在非常有限的初始数据上进行预测（称为“零 shot”预测），但其性能因数据使用 для预测而异常。在这项工作中，我们采用了一种不同的方法，并提出了 ForecastPFN，第一个基于新的 sintetic 数据分布的零 shot 预测模型。ForecastPFN 是一种基于先验知识的 fitted 网络，通过单次前进 pass 来预测新的时间序列数据。经过广泛的实验，我们表明，由 ForecastPFN 进行预测的零 shot 预测结果比现有的预测方法更准确和更快，即使它们在 hundreds 个更多的在 Distribution 上进行训练。

Simplifying Transformer Blocks

paper_url: http://arxiv.org/abs/2311.01906
repo_url: https://github.com/bobby-he/simplified_transformers
paper_authors: Bobby He, Thomas Hofmann
For: The paper aims to simplify the standard transformer block to improve training speed and reduce the number of parameters.* Methods: The authors use signal propagation theory and empirical observations to motivate modifications to the standard transformer block, including removing skip connections, projection or value parameters, sequential sub-blocks, and normalization layers.* Results: The simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput and using 15% fewer parameters.

Abstract
A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.

摘要
“一个简单的设计方程式 для深度Transformer是将相同的建筑块组合起来。但标准transformer块与聚合、对称层和normalization层的组合，使得模型变得非常复杂，几乎所有的变更都会导致模型训练速度下降或者无法训练。在这个研究中，我们询问这些标准transformer块可以被简化到多少 extent？通过信号传递理论和实验观察，我们提出了一些修改，让许多块件可以被移除无损训练速度，包括跳过 Connection、投影或值参数、Sequential sub-blocks 和normalization层。在采用了 both autoregressive decoder-only 和 BERT encoder-only 模型的实验中，我们的简化transformer模型可以与标准transformer模型相似的每个更新训练速度和性能，并且在使用15% fewer parameters的情况下，比标准transformer模型快15%。”

High Precision Causal Model Evaluation with Conditional Randomization

paper_url: http://arxiv.org/abs/2311.01902
repo_url: None
paper_authors: Chao Ma, Cheng Zhang
for: 评估 causal 模型的标准方法是比较模型预测与真实的效应来自Randomized controlled trials (RCT)。但RCT不 always feasible或道德可靠。在这种情况下，使用conditionally randomized experiments based on inverse probability weighting (IPW) 可能会受到高估程度的问题。
methods: 我们引入了一种新的low-variance estimator for causal error，称为 pairs estimator。我们将同一个IPW estimator应用于模型和真实的实验效应上，从而使得其变iance due to IPW 消失，并 achieve smaller asymptotic variance。
results: 我们的方法可以在实验设置下提高 causal inference 模型的评估，并且在实验中 demonstrate 了我们的方法的优势，表明它可以达到 near-RCT 性能。这种简单 yet powerful 的方法可以在 conditional randomization 设置下评估 causal inference 模型，无需修改 IPW estimator 本身，从而为模型评估带来更加robust和可靠。

Abstract
The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.

摘要
“金Standard” для评估 causal模型 involves comparing model predictions with true effects estimated from randomized controlled trials（RCT）。However，RCTs are not always feasible or ethical to perform。In contrast，conditionally randomized experiments based on inverse probability weighting（IPW）offer a more realistic approach but may suffer from high estimation variance。To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings，we introduce a novel low-variance estimator for causal error，dubbed as the pairs estimator。By applying the same IPW estimator to both the model and true experimental effects，our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance。Empirical studies demonstrate the improved of our estimator，highlighting its potential on achieving near-RCT performance。Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself，paving the way for more robust and reliable model assessments。

Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization

paper_url: http://arxiv.org/abs/2311.01900
repo_url: None
paper_authors: Alejandro de la Concha, Nicolas Vayatis, Argyris Kalogeratos
for: 本研究旨在提供一种在线非 Parametric 的 likelihood-ratio estimation（OLRE）方法，用于比较两个概率密度函数（p和q）的差异，并且在观察到 iid 样本 $(x_t \sim p, x’_t \sim q)$ 序列的时候进行。
methods: 我们的方法基于最近的 Kernel Methods 和函数最小化技术，可以有效地在线更新 estimator。我们的方法是非 Parametric，即不知道 $p$ 和 $q$ 的形式。
results: 我们提供了对 OLRE 方法的 theoretically garantuee，并进行了synthetic experiment 的实验验证。

Abstract
Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.

摘要
“统计和机器学习中衡量两个概率密度函数($p$和$q$)之间的差异使用数据是一个基本问题。一般来说，用likelihood-ratio estimation（LRE）方法来解决这个问题，尽管这个方法主要在单个观测点上进行研究。本文提供了一种新的在线非 Parametric LRE（OLRE）方法，用于在时间序列中观测到的独立identically distributed（iid）观测点 $(x_t \sim p, x'_t \sim q)$。我们的非 Parametric 方法具有不知道 $p$ 和 $q$ 的形式的优点，同时我们利用了最近的核函数方法和函数最小化来开发一个可以高效地在线更新的估计器。我们提供了对OLRE方法的理论保证以及实验 validate在 sintetic experiment中。”Note: "Simplified Chinese" is a translation of the text into Traditional Chinese, which is one of the two standard forms of Chinese writing. The other form is "Traditional Chinese".

Learning Sparse Codes with Entropy-Based ELBOs

paper_url: http://arxiv.org/abs/2311.01888
repo_url: None
paper_authors: Dmytro Velychko, Simon Damm, Asja Fischer, Jörg Lücke
for: 这个论文的目的是提出一种基于信息论的条件随机 sparse coding 学习目标函数，用于非正态 posterior approximations。
methods: 该论文使用了一种基于信息论的条件随机 sparse coding 学习方法，包括非正态 posterior approximations 和 entropy-based 学习目标函数。
results: 该论文的实验结果表明，使用该学习方法可以有效地学习条件随机 sparse coding 模型，并且可以通过 entropy-based 学习目标函数来适应不同的 posterior approximations。

Abstract
Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.

摘要
标准的数学潜在簇节架假设了拉普拉斯假设、线性对应从潜在到观察值，以及 Gaussian 观察值分布。我们在这里 derivates a solely entropy-based learning objective for the parameters of standard sparse coding。这个新的可变核心目标有以下特点：（A）与MAP估计不同，使用非贫则 posterior 估计 для条件arinferencing;（B）与前一些非贫则估计不同，这个新的目标是完全分析的;（C）这个目标允许一种新的均衡化原理。我们首先显示了标准的ELBO目标可以将转换为一个总 entropy 的和，这与最近的一些生成模型具有 Gaussian 假设的结果相符。然后，我们显示了这些条件下ELBO的解是分析的，这导致了一个完全分析的目标。我们使用了不同的 posterior 估计，包括相关的潜在对应和深度束质化估计。此外，我们还进行了实验，以证明可以使用这种 entropy-based ELBO 进行学习。我们的主要贡献是理论的，主要是：（1）为非贫则 posterior 估计提供了（到我们知识的作者）第一个分析的 ELBO 目标 для标准的数学潜在簇节架;（2）我们提供了最初将 ELBO 转换为 entropy 和的概念的证明。

Domain Randomization via Entropy Maximization

paper_url: http://arxiv.org/abs/2311.01885
repo_url: https://github.com/gabrieletiboni/doraemon
paper_authors: Gabriele Tiboni, Pascal Klink, Jan Peters, Tatiana Tommasi, Carlo D’Eramo, Georgia Chalvatzaki
for: 本研究旨在解决Domain Randomization（DR）中的现实差距问题，即RL中在不同的环境中的行为不同。
methods: 本研究提出了一种新的方法，即Domain Randomization via Entropy MaximizatiON（DORAEMON），它是一个受限的优化问题，通过直接最大化训练 distribuion 的熵来自动调整环境参数的分布。
results: 实验结果表明，DORAEMON 可以获得高度适应和普适的策略，即在不同的环境参数下能够解决任务。此外，DORAEMON 还可以在不知道实际世界参数的情况下进行零基础转移。

Abstract
Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.

摘要
varying 动力参数在模拟中是一种受欢迎的Domain Randomization（DR）方法，以减少RL中的现实差距。然而，DR强烈取决于模拟中的参数采样分布的选择，因为高度的变化是关键来减少代理人的行为，但同时不应该过度随机。在这篇文章中，我们提出了一种新的方法来解决模拟到实际的转移问题，即在训练中自动调整动力分布，无需真实世界数据。我们称之为Domain Randomization via Entropy Maximization（DORAEMON），它是一个受限制的优化问题，直接最大化训练 distribuion的熵，保持泛化能力。在实现这一点上，DORAEMON逐渐增加样本动力参数的多样性，只要当当前策略的成功概率充分高时。我们在许多DR文献中进行了比较，证明了DORAEMON可以获得高度适应和泛化的策略，即在不同的动力参数下能够成功解决任务。此外，我们还证明了DORAEMON在机器人 manipulate setup中的零化转移可行性。

Spectral Clustering of Attributed Multi-relational Graphs

paper_url: http://arxiv.org/abs/2311.01840
repo_url: None
paper_authors: Ylli Sadikaj, Yllka Velaj, Sahar Behzadi, Claudia Plant
for: 本研究旨在提出一种基于多种关系和特征属性的图 clustering 方法，以便更好地理解图结构和属性之间的相互关系。
methods: 本研究提出了 SpectralMix 方法，它是一种结合所有图信息的维度减少技术，可以将图结构、不同类型的关系和属性信息都减少到一个维度中，从而更好地理解图结构和属性之间的相互关系。
results: 实验结果表明，SpectralMix 方法可以更好地捕捉图结构和属性之间的相互关系，并且在多个实际数据集上显示出了更高的效果。

Abstract
Graph clustering aims at discovering a natural grouping of the nodes such that similar nodes are assigned to a common cluster. Many different algorithms have been proposed in the literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. However, complex data in many domains can be represented as both attributed and multi-relational networks. In this paper, we propose SpectralMix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. SpectralMix integrates all information available from the attributes, the different types of relations, and the graph structure to enable a sound interpretation of the clustering results. Moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to a single graph and to homogeneity analysis when applied to categorical data. Experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit the superiority of SpectralMix over existing methods.

摘要
graph clustering aimed at discovering natural grouping of nodes, such that similar nodes are assigned to common cluster. many different algorithms have been proposed in literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. however, complex data in many domains can be represented as both attributed and multi-relational networks. in this paper, we propose spectralmix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. spectralmix integrates all information available from attributes, different types of relations, and graph structure to enable sound interpretation of clustering results. moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to single graph and to homogeneity analysis when applied to categorical data. experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit superiority of spectralmix over existing methods.

Mix-ME: Quality-Diversity for Multi-Agent Learning

paper_url: http://arxiv.org/abs/2311.01829
repo_url: None
paper_authors: Garðar Ingvarsson, Mikayel Samvelyan, Bryan Lim, Manon Flageat, Antoine Cully, Tim Rocktäschel
for: 本研究旨在探讨多智能体系中的质量多样性（Quality-Diversity，QD）方法，以实现在不同的情况和需求下发现高性能解决方案的多样性。
methods: 本研究提出了一种基于MAP-Elites算法的多智能体变体 Mix-ME，通过混合不同队伍中的智能体来生成新的解决方案。
results: 在多种部分可见控制任务上进行评估，研究发现，基于Mix-ME算法生成的多智能体变体不仅与单智能体基准相匹配，而且在多智能体情况下，在部分可见情况下也经常超越单智能体基准。

Abstract
In many real-world systems, such as adaptive robotics, achieving a single, optimised solution may be insufficient. Instead, a diverse set of high-performing solutions is often required to adapt to varying contexts and requirements. This is the realm of Quality-Diversity (QD), which aims to discover a collection of high-performing solutions, each with their own unique characteristics. QD methods have recently seen success in many domains, including robotics, where they have been used to discover damage-adaptive locomotion controllers. However, most existing work has focused on single-agent settings, despite many tasks of interest being multi-agent. To this end, we introduce Mix-ME, a novel multi-agent variant of the popular MAP-Elites algorithm that forms new solutions using a crossover-like operator by mixing together agents from different teams. We evaluate the proposed methods on a variety of partially observable continuous control tasks. Our evaluation shows that these multi-agent variants obtained by Mix-ME not only compete with single-agent baselines but also often outperform them in multi-agent settings under partial observability.

摘要
在许多实际系统中，如适应机器人学习，单个优化解决方案可能不足。相反，需要一个多样化高性能解决方案来适应不同的上下文和需求。这是质量多样性（QD）的领域，旨在发现一组高性能解决方案，每个都具有独特的特点。QD方法在多个领域中获得成功，包括机器人学习，其用于发现损害适应行动控制器。然而，大多数现有工作都集中在单机器人设置下，尽管许多任务对象是多机器人。为此，我们介绍 Mix-ME，一种新的多机器人变体，使用混合操作将不同队伍中的机器人混合而成新的解决方案。我们对多种部分可见连续控制任务进行评估，结果显示，由 Mix-ME 获得的多机器人变体不仅与单机器人基elines竞争，而且在部分可见情况下frequently outperform 他们。

Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees

paper_url: http://arxiv.org/abs/2311.01806
repo_url: None
paper_authors: Yingzhen Yang, Ping Li
for: solving large-scale optimization problems with regularization functions, such as least square problems with convex or nonconvex regularization.
methods: proposes a fast sketching algorithm called Sketching for Regularized Optimization (SRO), which generates a sketch of the original data matrix and solves the sketched problem to obtain the optimization results.
results: the proposed algorithm handles general Frechet subdifferentiable regularization functions in an unified framework, and provides general theoretical results for the approximation error between the original problem and the sketched problem for regularized least square problems. Additionally, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are obtained under mild conditions.

Abstract
Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.

摘要
随机算法在解决大规模优化问题上具有重要的意义。在这篇论文中，我们提出了一种快速的笔记算法，即Sketching for Regularized Optimization（SRO）。我们的SRO算法首先生成了原始数据矩阵的笔记，然后解决笔记中的问题。与现有的随机算法不同，我们的算法可以处理通用的Fréchet次导函数。我们提供了对于各种正则化函数的通用理论结论，包括对于几何函数的 bounds。我们还证明了对于各种正则化函数的最小最大rate，并通过实验证明了我们的提案的有效性。

On the Generalization Properties of Diffusion Models

paper_url: http://arxiv.org/abs/2311.01797
repo_url: https://github.com/lphleo/diffusion_generalization
paper_authors: Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian
for: 这个论文旨在理解扩散模型的泛化能力。
methods: 这篇论文使用了评估扩散模型的泛化误差，并提出了一种基于样本大小和模型容量的泛化误差估计。
results: 研究发现，扩散模型在训练过程中的泛化误差随着样本大小和模型容量的增长而减少，并且在数据集中的模式变化下保持可靠的泛化性。

Abstract
Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.

摘要
Diffusion models是一类生成模型，旨在建立一个随机运输map，将empirical observation中的未知目标分布和一个已知的先验分布相匹配。尽管它们在实际应用中表现出色，但对它们的总体泛化能力的理论理解仍然受到限制。这项工作开始了Diffusion models的总体泛化能力的理论探索。我们提出了Diffusion models的泛化差的理论估计，表明在训练Score-based diffusion models时，泛化错误的演变矩阵是$O(n^{-2/5}+m^{-4/5})$，其中$n$是样本大小，$m$是模型容量，不是数据维度的幂次增长，这意味着Diffusion models在训练时可以避免欠拟合症（curse of dimensionality）。此外，我们还扩展了我们的量化分析至数据依赖的场景，在这种场景下，目标分布是一系列的浓度，每个浓度之间的距离逐渐增长。这准确地阐述了模型泛化中"模式shift"的弊端，并且这些估计不仅是理论构造，还经过了数值 simulations 的验证。我们的发现对Diffusion models的泛化性能的理论理解做出了贡献，并提供了实践应用中的指导。

Learning to Augment Distributions for Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2311.01796
repo_url: None
paper_authors: Qizhou Wang, Zhen Fang, Yonggang Zhang, Feng Liu, Yixuan Li, Bo Han
for: 本文旨在解决在开放世界下，使用 auxiliary OOD 数据进行 OOD 探测时，存在异常分布的问题。
methods: 本文提出了 Distributional-Augmented OOD Learning (DAL) 方法，通过在 Wasserstein 球中心auxiliary OOD 分布上构造 OOD 分布集来减少 OOD 分布差异。
results: 对多种 OOD 探测设置进行了广泛的评估，并证明 DAL 在 auxiliary OOD 数据上进行训练的预测器可以提高开放世界下 OOD 探测性能。

Abstract
Open-world classification systems should discern out-of-distribution (OOD) data whose labels deviate from those of in-distribution (ID) cases, motivating recent studies in OOD detection. Advanced works, despite their promising progress, may still fail in the open world, owing to the lack of knowledge about unseen OOD data in advance. Although one can access auxiliary OOD data (distinct from unseen ones) for model training, it remains to analyze how such auxiliary data will work in the open world. To this end, we delve into such a problem from a learning theory perspective, finding that the distribution discrepancy between the auxiliary and the unseen real OOD data is the key to affecting the open-world detection performance. Accordingly, we propose Distributional-Augmented OOD Learning (DAL), alleviating the OOD distribution discrepancy by crafting an OOD distribution set that contains all distributions in a Wasserstein ball centered on the auxiliary OOD distribution. We justify that the predictor trained over the worst OOD data in the ball can shrink the OOD distribution discrepancy, thus improving the open-world detection performance given only the auxiliary OOD data. We conduct extensive evaluations across representative OOD detection setups, demonstrating the superiority of our DAL over its advanced counterparts.

摘要
To address this discrepancy, we propose Distributional-Augmented OOD Learning (DAL), which involves crafting an OOD distribution set that contains all distributions within a Wasserstein ball centered on the auxiliary OOD distribution. We show that training a predictor over the worst OOD data in the ball can help shrink the OOD distribution discrepancy, thereby improving open-world detection performance given only the auxiliary OOD data.We conduct extensive evaluations across representative OOD detection setups and demonstrate the superiority of our DAL over its advanced counterparts.

Efficient Generalized Low-Rank Tensor Contextual Bandits

paper_url: http://arxiv.org/abs/2311.01771
repo_url: None
paper_authors: Qianxin Yi, Yiyang Yang, Yao Wang, Shaojie Tang
for: This paper aims to provide high-usable and accountable decision-making services by building a novel bandits algorithm that fully harnesses the power of multi-dimensional data and non-linear reward functions.
methods: The paper introduces a generalized low-rank tensor contextual bandits model, which represents an action as a tensor and determines the reward through a generalized linear function. The algorithm “Generalized Low-Rank Tensor Exploration Subspace then Refine” (G-LowTESTR) is introduced to effectively trade off exploration and exploitation.
results: The paper shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases through theoretical analysis and simulations/real data experiments. The algorithm is able to capitalize on the low-rank tensor structure for enhanced learning.

Abstract
In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.

摘要
在本文中，我们目标建立一种新的带刺数据搜索算法，能够全面利用多维数据的力量和奖励函数的内在非线性，提供高可用和可负责的决策服务。为此，我们引入一种泛化低级张量上下文ual bandits模型，其中一个动作可以由三个特征向量组成，并且可以表示为一个张量。在这种形式下，奖励由一个通用线性函数应用于动作特征张量和一个固定 pero unknown 参数张量的内积来确定。为实现探索和利用之间的负荷平衡，我们引入一种新的算法called "泛化低级张量探索空间然后精细" (G-LowTESTR)。这个算法首先收集原始数据，以探索决策场景中附加的低级张量信息，然后将原始问题转换为一个几乎两维的通用线性contextual bandits问题。我们的理论分析表明，G-LowTESTR的 regret bound高于vectorization和matricization情况。我们进行了一系列的仿真和实际数据实验，以证明G-LowTESTR的效果，利用其能够利用低级张量结构进行加强学习。

Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel

paper_url: http://arxiv.org/abs/2311.01762
repo_url: None
paper_authors: Oskar Allerbo
For: 该研究探讨了 kernel ridge regression（KRR）中kernel的变化在训练过程中对模型复杂性和泛化性的影响，并提出了一种在训练过程中逐步递减带宽的更新方案。* Methods: 该研究使用了KRR的迭代法，并 investigate了在训练过程中变化kernel的影响。* Results: 研究发现，逐步递减带宽可以使KRR模型在训练Error和泛化性之间取得平衡，并且可以实现零训练Error和良好的泛化性。此外，研究还发现了一种double descent现象，其中在某些情况下，逐步递减带宽可以使模型的泛化性提高。

Abstract
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.

摘要
kernel ridge regression（KRR）是Linear Ridge Regression的推广，在数据上是非线性的，但在参数上是线性的。解决方案可以通过关键值矩阵 inverse 或者迭代的梯度下降来获得。使用迭代方法可以在训练过程中改变kernel，这在本文中被调查。我们从理论角度解决了这些改变对模型复杂度和泛化性的影响。基于我们的发现，我们提出了一种更新策略，其中在训练过程中减小了带宽，从而消除了hyperparameter选择的需要。我们在实际和Synthetic data上示出了使用减小带宽在训练中的优化性。此外，我们还 theoretically和Empirically验证了使用减小带宽可以实现零训练误差、良好的泛化和神经网络上知道的双峰现象。

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

paper_url: http://arxiv.org/abs/2311.01759
repo_url: None
paper_authors: Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Junyi Chen, Lingkun Long, Han Wan, Bei Yu, Weisheng Zhao
for: 本研究旨在开发和部署在微控制器单元（MCU）上的深度学习模型，以满足各种嵌入式互联网应用的需求。
methods: 该研究提出了一个名为TinyFormer的框架，用于开发和部署MCU上的资源有效的转换器模型。TinyFormer包括SuperNAS、SparseNAS和SparseEngine三部分。SuperNAS用于在庞大的搜索空间中搜索适合MCU的超网络模型。SparseNAS用于评估最佳缺省单路模型，包括转换器架构。最后，SparseEngine高效地将搜索到的缺省模型部署到MCU上进行推理。
results: 根据CIFAR-10数据集的评估结果，TinyFormer可以开发高效的转换器模型，具有$96.1%$的准确率，同时遵循MCU的硬件限制，即$1$MB存储和$320$KB内存。此外，TinyFormer在缺省推理中具有显著的速度提升，达到$12.2\times$的提升，相比CMSIS-NN库。TinyFormer被认为可以将强大的转换器带入天线ML场景，扩大深度学习应用的范围。

Abstract
Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path model including transformer architecture from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse models with transformer on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can develop efficient transformers with an accuracy of $96.1\%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to $12.2\times$, when compared to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and greatly expand the scope of deep learning applications.

摘要
发展深度学习模型在微控制器单元（MCU）上（例如，微控制器单元）已经吸引了各种嵌入互联网应用的广泛关注。然而，由于MCU的硬件资源有限制，使得不可避免地将现代高级模型（如转换器）部署到MCU上是一项挑战。在这项工作中，我们提出了TinyFormer框架，用于开发和部署MCU上的资源有效的转换器模型。TinyFormer主要由SuperNAS、SparseNAS和SparseEngine三部分组成。每一部分都扮演着重要的角色。首先，SuperNAS通过巨量搜索空间来搜索适合MCU的超网络。然后，SparseNAS根据找到的超网络来评估最佳的稀疏单轨模型，包括转换器架构。最后，SparseEngine高效地将搜索到的稀疏模型部署到MCU上。到目前为止，SparseEngine是首个可以在MCU上进行稀疏模型执行的投影引擎。我们的评估结果表明，TinyFormer可以在CIFAR-10数据集上开发高效的转换器模型，具有$96.1\%$的准确率，同时遵守MCU的硬件限制，即$1$MB存储和$320$KB内存。此外，TinyFormer在稀疏执行中实现了与CMSIS-NN库相比的速度提升，达到$12.2\times$。TinyFormer被认为将带来强大的转换器到天然语言应用场景，扩大深度学习应用的范围。

Epidemic Decision-making System Based Federated Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.01749
repo_url: None
paper_authors: Yangxi Zhou, Junping Du, Zhe Xue, Zhenhui Pan, Weikang Chen
for: 该论文旨在帮助政府通过对公共安全和经济发展进行全面考虑，应对公共卫生和安全紧急情况。
methods: 该论文提出了一种基于联合学习的方法，通过将各省的疫情情况数据进行合作训练，以提高疫情决策的准确性和效率。
results: 实验结果显示，联合学习方法可以在疫情决策中获得更优化的性能和返回，并可以加速训练模型的收敛速度。此外，对比试验还表明，A2C模型是适合疫情决策场景的最佳强化学习模型，其次是PPO模型，而DDPG模型的性能较差。

Abstract
Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. However, epidemic data often has the characteristics of limited samples and high privacy. However, epidemic data often has the characteristics of limited samples and high privacy. This model can combine the epidemic situation data of various provinces for cooperative training to use as an enhanced learning model for epidemic situation decision, while protecting the privacy of data. The experiment shows that the enhanced federated learning can obtain more optimized performance and return than the enhanced learning, and the enhanced federated learning can also accelerate the training convergence speed of the training model. accelerate the training convergence speed of the client. At the same time, through the experimental comparison, A2C is the most suitable reinforcement learning model for the epidemic situation decision-making. learning model for the epidemic situation decision-making scenario, followed by the PPO model, and the performance of DDPG is unsatisfactory.

摘要
《医疫决策》可以有效地帮助政府全面考虑公共安全和经济发展，以应对公共卫生和安全紧急情况。一些研究表明，高效学习可以帮助政府做出医疫决策，从而实现健康安全和经济发展的平衡。然而，医疫数据经常具有有限的样本和高隐私性。为此，本文提出了一种基于联合学习的医疫决策模型，可以结合各省的医疫情况数据进行合作训练，以保护数据隐私。实验表明，加强联合学习可以在训练模型性能和训练速度两个方面取得更高的优化效果，而且在训练速度方面，加强联合学习可以加速客户端的训练速度。同时，通过实验对比，A2C模型在医疫决策场景中表现最佳，其次是PPO模型，而DDPG模型的表现不满足。

Global Optimization: A Machine Learning Approach

paper_url: http://arxiv.org/abs/2311.01742
repo_url: https://github.com/jettbrains/-L-
paper_authors: Dimitris Bertsimas, Georgios Margaritis
for: solves black-box global optimization problems with nonlinear constraints.
methods: uses hyperplane-based Decision-Trees and mixed integer optimization (MIO) approximation, with extensions to other ML models and adaptive sampling procedures.
results: shows improvements in solution feasibility and optimality in the majority of instances compared to BARON, with improved optimality gaps or solution times in 11 instances.

Abstract
Many approaches for addressing Global Optimization problems typically rely on relaxations of nonlinear constraints over specific mathematical primitives. This is restricting in applications with constraints that are black-box, implicit or consist of more general primitives. Trying to address such limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn as a way of solving black-box global optimization problems by approximating the nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.

摘要
多种方法通常用于解决全球优化问题，通常基于非线性约束的松弛。这限制了应用中的约束是黑盒、隐藏或更一般的 primitives。 Trying to address these limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn to solve black-box global optimization problems by approximating nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

paper_url: http://arxiv.org/abs/2311.01729
repo_url: None
paper_authors: Jui-Yi Tsai, Ya-Wen Teng, Ho Chiok Yew, De-Nian Yang, Lydia Y. Chen
for: 本研究旨在提出一种基于两个指定条件的 conditional diffusion model for social networks，以生成符合条件的社交图。
methods: 该模型通过对 dual conditions 的协同依赖进行杜琪处理，以捕捉两个条件之间的互dependent关系，同时还包括社交同类和社交感染来保持节点之间的连接，并通过对 dual conditions 的互dependent关系进行指导，进行 diffusion 过程的训练。
results: 对四个 dataset 进行评估，与四种现有的图生成方法进行比较，结果显示 CDGraph 可以生成符合 dual-conditional 的社交图，并且在多种社交网络指标中具有较低的不一致性和较高的 dual-conditional 有效性。

Abstract
The social graphs synthesized by the generative models are increasingly in demand due to data scarcity and concerns over user privacy. One of the key performance criteria for generating social networks is the fidelity to specified conditionals, such as users with certain membership and financial status. While recent diffusion models have shown remarkable performance in generating images, their effectiveness in synthesizing graphs has not yet been explored in the context of conditional social graphs. In this paper, we propose the first kind of conditional diffusion model for social networks, CDGraph, which trains and synthesizes graphs based on two specified conditions. We propose the co-evolution dependency in the denoising process of CDGraph to capture the mutual dependencies between the dual conditions and further incorporate social homophily and social contagion to preserve the connectivity between nodes while satisfying the specified conditions. Moreover, we introduce a novel classifier loss, which guides the training of the diffusion process through the mutual dependency of dual conditions. We evaluate CDGraph against four existing graph generative methods, i.e., SPECTRE, GSM, EDGE, and DiGress, on four datasets. Our results show that the generated graphs from CDGraph achieve much higher dual-conditional validity and lower discrepancy in various social network metrics than the baselines, thus demonstrating its proficiency in generating dual-conditional social graphs.

摘要
社交图表生成的生成模型受到数据缺乏和用户隐私问题的增加需求。生成社交图表中的一个关键性能标准是对指定的条件进行忠实性，例如用户具有某些会员和财务状况。而最近的扩散模型在生成图像方面已经表现出色，但它们在生成图表方面的效果尚未得到研究。在这篇论文中，我们提出了首个基于条件的扩散模型 для社交图表，即CDGraph，它在两个指定的条件下训练和生成图表。我们提出了在杂化过程中的共演化依赖性，以捕捉图表中节点之间的互相依赖关系，并进一步包括社交同类和社交感染，以保持节点之间的连接而满足指定的条件。此外，我们引入了一种新的分类损失函数，用于导航扩散过程的训练，该损失函数通过两个条件之间的互相依赖关系来引导扩散过程的训练。我们对CDGraph与四种现有的图生成方法，即SPECTRE、GSM、EDGE和DiGress进行评估，结果显示，生成从CDGraph得到的图表在多种社交网络指标中的双重条件有效性和不一致性均较低，这表明CDGraph在生成双重条件的社交图表方面具有极高的效果。

Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces

paper_url: http://arxiv.org/abs/2311.01722
repo_url: https://github.com/apd10/flcf
paper_authors: Aditya Desai, Benjamin Meisburger, Zichang Liu, Anshumali Shrivastava
for: 这个论文的目的是推荐系统（RS）的实现，尤其是在面临数据隐私和GDPR等法规的情况下，通过联合学习来实现推荐模型，而不是在中央服务器上训练。
methods: 这个论文使用了联合学习的方法，特别是在 embedding 表格中进行收敛，而不是在中央服务器上进行训练。它使用了哈希基Random projection来实现Device capacity-aware federated averaging，使得各种设备都可以参与训练。
results: 论文通过多个数据集进行实验，证明了 FAIR 可以在各种设备上进行训练，并且可以处理不同设备的数据，以实现在线学习。此外，论文还证明了 FAIR 的整体收敛性。

Abstract
Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}

摘要
（traditional Chinese translation）推荐系统（RS） для Item（例如，电影、书籍）和广告是广泛使用来适应用户在互联网平台上的内容。传统上，推荐模型是在中央服务器上训练的。但由于隐私权和GDPR等法规的问题，联合学习是一种越来越受欢迎的方法，它可以让数据保留在客户端上。将联合学习应用到推荐模型是非常具有挑战，因为推荐模型的嵌入表通常会超过大多数用户端的内存限制。为了包括所有设备的数据在联合学习中，我们必须在设备之间实现嵌入表的集体训练。现有的联合学习解决方案只能涵盖一小段的设备 capacities，因此仅能参与训练的设备有限。我们提出了Federated Averaging in Random subspaces（FAIR），它可以根据设备capacity进行不同程度的压缩，并确保所有设备都可以参与训练。FAIR使用我们称为“一致和可拓展的子空间”的哈希基于随机投射来实现大嵌入表的联合训练。我们使用多个数据集进行评估，并证明FAIR在几何同步设定下的异步数据分布下可以实现收敛。我们的代码可以在中找到。

Physics-Informed Generator-Encoder Adversarial Networks with Latent Space Matching for Stochastic Differential Equations

paper_url: http://arxiv.org/abs/2311.01708
repo_url: None
paper_authors: Ruisong Gao, Min Yang, Jin Zhang
for: 解决随机差分方程中的前进、逆向和混合问题，即系统参数只有有限的快照数据。
methods: 提出了一种新的物理学 Informed Generator-Encoder Adversarial Networks（PIG-EA），通过在各种随机差分方程中直接使用生成器和编码器来解决问题。
results: 经过数学实验证明，PIG-EA方法可以更高精度地解决不同类型的随机差分方程问题，并且可以有效地mitigate训练不稳定性问题。

Abstract
We propose a new class of physics-informed neural networks, called Physics-Informed Generator-Encoder Adversarial Networks, to effectively address the challenges posed by forward, inverse, and mixed problems in stochastic differential equations. In these scenarios, while the governing equations are known, the available data consist of only a limited set of snapshots for system parameters. Our model consists of two key components: the generator and the encoder, both updated alternately by gradient descent. In contrast to previous approaches of directly matching the approximated solutions with real snapshots, we employ an indirect matching that operates within the lower-dimensional latent feature space. This method circumvents challenges associated with high-dimensional inputs and complex data distributions, while yielding more accurate solutions compared to existing neural network solvers. In addition, the approach also mitigates the training instability issues encountered in previous adversarial frameworks in an efficient manner. Numerical results provide compelling evidence of the effectiveness of the proposed method in solving different types of stochastic differential equations.

摘要
我们提出一种新的物理学 Informed Neural Network（Physics-Informed Generator-Encoder Adversarial Network，PIG-ENA），用于有效地解决涉及到前向、反向和混合问题的随机 diferencial equations 中的挑战。在这些情况下，规定方程知道，但可用的数据只是系统参数的有限集。我们的模型包括两个关键组成部分：生成器和编码器，两者都通过梯度下降更新。与之前的直接匹配实际解与真实Snapshot的方法不同，我们采用了间接匹配，该操作在具有较低维度的封闭特征空间中进行。这种方法可以避免高维度输入和复杂数据分布所带来的挑战，同时提供更高精度的解决方案，与现有的神经网络解决方案相比。此外，我们的方法还能有效地解决过去的反对抗框架中的训练不稳定问题。数字实验证明了我们提出的方法在不同类型的随机 diffeq 中的有效性。

Adversarial Attacks on Cooperative Multi-agent Bandits

paper_url: http://arxiv.org/abs/2311.01698
repo_url: None
paper_authors: Jinhang Zuo, Zhiyao Zhang, Xuchuang Wang, Cheng Chen, Shuai Li, John C. S. Lui, Mohammad Hajiesmaili, Adam Wierman
for: 这个论文研究了合作多体智能机器人在共享多臂抓拍游戏中的潜在漏洞，以及对这些合作的攻击。
methods: 这篇论文使用了对一些代理人进行攻击，以影响其他代理人的决策。具体来说，在同质性设定下，我们提出了一种target arm攻击策略，可以在$T$轮内让所有代理人选择特定的目标臂$T-o(T)$次，而具有$o(T)$攻击成本。在不同质性设定下，我们证明了target臂攻击需要线性攻击成本，并提出了一种可以让最多代理人受到线性悔化的攻击策略。
results: 数值实验证明了我们提出的攻击策略的有效性。

Abstract
Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.

摘要

Communication-Efficient Federated Non-Linear Bandit Optimization

paper_url: http://arxiv.org/abs/2311.01695
repo_url: None
paper_authors: Chuanhao Li, Chong Liu, Yu-Xiang Wang
for: 该论文研究了多客户端协同优化问题，以保持数据隐私和实现大规模计算，并采用中央服务器协调。
methods: 该论文提出了一种新的算法 named Fed-GO-UCB，用于联合非线性目标函数的联合弗链优化。
results: 论文通过了一些强制条件，证明了 Fed-GO-UCB 算法可以实现下线速率的总征领导和通信成本。

Abstract
Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.

摘要
“联邦优化”是一个研究多个客户端（例如移动设备或组织）在中央服务器协调下实现协同函数优化的问题。由于每个客户端都将数据收集和保留在本地，因此“联邦优化”可以保持数据隐私和实现大规模计算，这使其成为一种吸引人的分散式机器学习概念。尽管通常是在线上问题上进行部署，例如键盘应用程序上的下一个词的预测，但大多数研究都是以假设为offline问题进行设计。仅有一些例外情况是考虑联邦投机优化，并且仅对非 Parametric 函数类型进行设计，这严重限制其实际应用。在这篇文章中，我们提出了一个新的算法，名为 Fed-GO-UCB，用于联邦投机优化。我们严谨地证明了 Fed-GO-UCB 能够在一些轻微的条件下 achieves 次线性速率两个总 regret 和通信成本。我们的理论分析的核心是分布式回归资料和个人信任集建构，这些可能会具有独立的价值。实验评估也证明了我们的提案的有效性。

Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data

paper_url: http://arxiv.org/abs/2311.01683
repo_url: None
paper_authors: Malvika Viswanathan, Leqi Yin, Yashwant Kurmi, Zhongliang Zu
for: 这项研究旨在使用机器学习（ML）模型量化化学交换吸收转移（CEST）效应。
methods: 这项研究使用了一种新的平台，它将模拟和实验数据结合起来生成部分合成CEST数据，以评估这种方法在训练ML模型预测蛋白质氨基转移（APT）效应方面的可行性。
results: 研究结果表明，使用部分合成CEST数据可以准确地预测APT效应，并且在实验中比使用实际数据和完全模拟数据更为稳定和准确。

Abstract
Machine learning (ML) has been increasingly used to quantify chemical exchange saturation transfer (CEST) effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, while training with fully simulated data may introduce bias due to limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.

摘要

Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling

paper_url: http://arxiv.org/abs/2311.01660
repo_url: None
paper_authors: Mert Ketenci, Shreyas Bhave, Noémie Elhadad, Adler Perotte
for: 这个论文的目的是提出一种新的生存分析方法，以减少实践者需要调整的 гиперparameters数量，包括分类模型中的分配数量和离散模型中的分 Bin 数量。
methods: 该方法使用了一种新的分类模型，以减少模型中的 гиперparameters数量，并且不需要进行优化。它还使用了一种新的离散模型，以提高模型的稳定性和数值稳定性。
results: 研究人员通过实验研究表明，该方法可以与基eline方法匹配或超越其性能，并且可以减少实践者需要调整的时间和努力。

Abstract
Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.

摘要
In addition, we found in our empirical studies that the optimal bin size can vary significantly depending on the metric of interest (e.g., concordance vs brier score), and mixture models may suffer from mode collapse and numerical instability. To address these issues, we propose a survival analysis approach that eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that our proposed approach matches or outperforms baselines on several real-world datasets.

Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs

paper_url: http://arxiv.org/abs/2311.01647
repo_url: https://github.com/hdmmblz/multi-graph
paper_authors: Yeyuan Chen, Dingmin Wang
for: 这个论文主要研究了图像学习框架Graph Neural Networks (GNNs)的逻辑表达能力，具体来说是在多关系图上使用GNNs来表示Boolean node classifiers。
methods: 作者使用了R$^2$-GNN架构，该架构是通过将本地消息传递GNN扩展到全球读取来提高表达能力。
results: 研究发现，R$^2$-GNN模型在某些限制性的情况下可以完全表达Boolean node classifiers，但在总体情况下则不行。作者也提出了一种简单的图变换技术，可以在线性时间内执行，以便使R$^2$-GNN模型能够有效地表达任何Boolean node classifiers。

Abstract
As a powerful framework for graph representation learning, Graph Neural Networks (GNNs) have garnered significant attention in recent years. However, to the best of our knowledge, there has been no formal analysis of the logical expressiveness of GNNs as Boolean node classifiers over multi-relational graphs, where each edge carries a specific relation type. In this paper, we investigate $\mathcal{FOC}_2$, a fragment of first-order logic with two variables and counting quantifiers. On the negative side, we demonstrate that the R$^2$-GNN architecture, which extends the local message passing GNN by incorporating global readout, fails to capture $\mathcal{FOC}_2$ classifiers in the general case. Nevertheless, on the positive side, we establish that R$^2$-GNNs models are equivalent to $\mathcal{FOC}_2$ classifiers under certain restricted yet reasonable scenarios. To address the limitations of R$^2$-GNNs regarding expressiveness, we propose a simple graph transformation technique, akin to a preprocessing step, which can be executed in linear time. This transformation enables R$^2$-GNNs to effectively capture any $\mathcal{FOC}_2$ classifiers when applied to the "transformed" input graph. Moreover, we extend our analysis of expressiveness and graph transformation to temporal graphs, exploring several temporal GNN architectures and providing an expressiveness hierarchy for them. To validate our findings, we implement R$^2$-GNNs and the graph transformation technique and conduct empirical tests in node classification tasks against various well-known GNN architectures that support multi-relational or temporal graphs. Our experimental results consistently demonstrate that R$^2$-GNN with the graph transformation outperforms the baseline methods on both synthetic and real-world datasets

摘要
“graph neural networks（GNNs）是一种强大的图表示学习框架，在过去几年内吸引了广泛的关注。然而，据我们所知，对于多关系图上的Boolean节点分类问题，GNNs的逻辑表达能力没有正式的分析。在这篇论文中，我们 investigate $\mathcal{FOC}_2$，一种first-order logic中的两个变量和计数量论 fragment。我们的负面结果表明，R$^2$-GNN架构，通过扩展本地消息传递GNN，不能在总体情况下捕捉 $\mathcal{FOC}_2$ 分类器。然而，我们的积极结果表明，R$^2$-GNN模型与 $\mathcal{FOC}_2$ 分类器在某些有限但合理的情况下是等价的。为了解决R$^2$-GNN的表达能力的局限性，我们提出了一种简单的图变换技术，可以在线性时间内执行。这种变换可以使R$^2$-GNN模型有效地捕捉任何 $\mathcal{FOC}_2$ 分类器。此外，我们还扩展了我们的表达能力和图变换分析到temporal graph，探讨了多种temporal GNN架构，并提出了temporal GNN的表达能力层次。为了证明我们的发现，我们实现了R$^2$-GNN和图变换技术，并在节点分类任务中进行了实验，与多种支持多关系或temporal graph的GNNA architectures进行比较。我们的实验结果表明，R$^2$-GNN与图变换技术在多种synthetic和实际数据集上均有优于基eline方法”

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

paper_url: http://arxiv.org/abs/2311.01644
repo_url: None
paper_authors: Berfin Şimşek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea
for: 本文研究的目标是研究一种具有较少神经元数的神经网络如何对一个具有较多神经元数的神经网络进行近似。
methods: 本文使用了神经网络的拟合方法来研究这个问题，并证明了在某些情况下，使用较少神经元数的神经网络可以对一个具有较多神经元数的神经网络进行高度的近似。
results: 本文的结果表明，在使用 erf 活化函数和标准正态输入分布的情况下，较少神经元数的神经网络可以达到最佳的近似效果，并且这个效果可以通过调整每个学生神经元是否复制教师神经元来实现。

Abstract
Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

摘要
任何连续函数 $f^{*}$ 都可以被 sufficiently many neurons $k$ 的神经网络 arbitrarily well 近似。我们考虑 $f^{*}$ 本身是一个具有一个隐藏层和 $k$ 个神经元的神经网络。对 $f^{*}$ 使用一个具有 $n

Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula

paper_url: http://arxiv.org/abs/2311.01642
repo_url: None
paper_authors: Aryaman Reddi, Maximilian Tölle, Jan Peters, Georgia Chalvatzaki, Carlo D’Eramo
for: 本文目标是提高对抗攻击和分布变化的Robustness Reinforcement Learning（RL）。
methods: 本文提出了一种基于Entropy regularization的新方法，该方法可以简化了极点优化问题。
results: 对多个MuJoCo游戏和导航问题进行了广泛的实验，并证明了QARL在总性和Robustness方面超过了RARL和最近的基准值。

Abstract
Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.

摘要
RL中的稳定性对抗难以控制的攻击和分布变化是一个长期目标。为此，我们提出了Robust Adversarial Reinforcement Learning（RARL），它在竞争性零和游戏中训练一个主角，以适应由敌方所予以的破坏性力量。然而，在找到约束 Nash 平衡点的过程中，可能会遇到复杂的锐点优化问题，特别是在高维控制下。在这篇论文中，我们提出了一种基于 entropy 规范的新方法，以缓解锐点优化问题的复杂性。我们证明了这种 entropy-regularized 目标函数对应于一种Quantal Response Equilibrium（QRE），这是约束 Nash 平衡点的一种扩展，考虑到 bounded rationality，即代理人可能会采取Random 动作而不是最优动作。这种关系允许我们通过调整温度系数来自由地调节代理人的合理性。我们利用这一点，提出了我们的新算法Quantal Adversarial RL（QARL），它逐渐增加了敌方的合理性，直到它完全合理，从而缓解优化问题的复杂性。我们在多个 MuJoCo 步行和导航问题上提供了广泛的证明，证明 QARL 在总性和稳定性方面超过 RARL 和最新的基elines。

2023-11-03

Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos

Active Learning-Based Species Range Estimation

Reproducible Parameter Inference Using Bagged Posteriors

A Variational Perspective on High-Resolution ODEs

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

Latent Diffusion Model for Conditional Reservoir Facies Generation

Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products

Optimistic Multi-Agent Policy Gradient for Cooperative Tasks

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

Simplifying Transformer Blocks

High Precision Causal Model Evaluation with Conditional Randomization

Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization

Learning Sparse Codes with Entropy-Based ELBOs

Domain Randomization via Entropy Maximization

Spectral Clustering of Attributed Multi-relational Graphs

Mix-ME: Quality-Diversity for Multi-Agent Learning

Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees

On the Generalization Properties of Diffusion Models

Learning to Augment Distributions for Out-of-Distribution Detection

Efficient Generalized Low-Rank Tensor Contextual Bandits

Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

Epidemic Decision-making System Based Federated Reinforcement Learning

Global Optimization: A Machine Learning Approach

CDGraph: Dual Conditional Social Graph Synthesizing via Diffusion Model

Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces

Physics-Informed Generator-Encoder Adversarial Networks with Latent Space Matching for Stochastic Differential Equations

Adversarial Attacks on Cooperative Multi-agent Bandits

Communication-Efficient Federated Non-Linear Bandit Optimization

Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data

Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling

Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula