cs.LG - 2023-11-02

PPI++: Efficient Prediction-Powered Inference

  • paper_url: http://arxiv.org/abs/2311.01453
  • repo_url: https://github.com/aangelopoulos/ppi_py
  • paper_authors: Anastasios N. Angelopoulos, John C. Duchi, Tijana Zrnic
  • for: 这个论文是为了提出一种 computationally lightweight 的方法来进行估计和推理,使用小量标注数据和大量机器学习预测结果。
  • methods: 这个方法使用 prediction-powered inference (PPI) 方法,通过自动适应可用预测的质量,计算出高效的自信量集,用于估计参数的任意维度。
  • results: 实验和 sintetic experiments 表明,提出的修改可以提高计算和统计效率,并且在不同的预测质量下都能够得到更好的结果。
    Abstract We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations.
    摘要 我团队现请您耳机提供PPI++:一种 computationally 轻量级的方法,用于基于小量标注数据集和大量机器学习预测数据集的估计和推理。该方法自动适应可用预测的质量,生成可算法的信任区间,用于任何维度的参数。PPI++ 基于预测力量推理(PPI),targeting 同一个问题设定,提高其计算和统计效率。实际和Synthetic 实验表明提案的修改带来了优势。Note that "PPI" in the text refers to "prediction-powered inference", which is a methodology for estimation and inference based on a small labeled dataset and a larger dataset of machine-learning predictions.

Deep Double Descent for Time Series Forecasting: Avoiding Undertrained Models

  • paper_url: http://arxiv.org/abs/2311.01442
  • repo_url: None
  • paper_authors: Valentino Assandri, Sam Heshmati, Burhaneddin Yaman, Anton Iakovlev, Ariel Emiliano Repetur
  • for: 本研究旨在探讨深度学习模型在时间序列预测中的训练策略,即如何训练深度学习模型不受模型架构的限制。
  • methods: 我们采用了广泛的实验方法来研究深度学习模型在时间序列预测中的深度双峰现象,以及使用更多的迭代可以抑制过拟合。
  • results: 我们在72个公共时间序列数据集中实现了长序时间序列预测的状态对应率,达到了70%左右。此外,我们还提出了一种分类方法来描述训练策略的修改,包括数据增强、模型输入、模型目标、时间序列数据量和计算预算。
    Abstract Deep learning models, particularly Transformers, have achieved impressive results in various domains, including time series forecasting. While existing time series literature primarily focuses on model architecture modifications and data augmentation techniques, this paper explores the training schema of deep learning models for time series; how models are trained regardless of their architecture. We perform extensive experiments to investigate the occurrence of deep double descent in several Transformer models trained on public time series data sets. We demonstrate epoch-wise deep double descent and that overfitting can be reverted using more epochs. Leveraging these findings, we achieve state-of-the-art results for long sequence time series forecasting in nearly 70% of the 72 benchmarks tested. This suggests that many models in the literature may possess untapped potential. Additionally, we introduce a taxonomy for classifying training schema modifications, covering data augmentation, model inputs, model targets, time series per model, and computational budget.
    摘要 深度学习模型,特别是转换器,在不同领域中已经达到了吸引人的成绩,包括时间序列预测。现有的时间序列文献主要关注模型建构修改和数据扩充技术,而这篇论文则探索了深度学习模型在时间序列训练中的Schema;无论模型的建构如何。我们进行了广泛的实验,探讨了深度双峰现象在多种转换器模型中的发生,并证明了训练epoch数的增加可以抑制过拟合。基于这些发现,我们在72个benchmark测试集中实现了长序时间序列预测的州OF-the-art结果,达到了70%左右。这表示许多在文献中出现的模型可能具有未发挥的潜力。此外,我们还提出了训练schema修改的分类法,涵盖了数据扩充、模型输入、模型目标、时间序列每个模型和计算预算。

Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time

  • paper_url: http://arxiv.org/abs/2311.01435
  • repo_url: None
  • paper_authors: Xinyuan Cao, Santosh S. Vempala
  • for: 学习高维半空间,保证TV距离 Within desired 程度。
  • methods: 使用一种多项式时间算法,不需要标签,可以快速地学习高维半空间。
  • results: 算法的样本和时间复杂度是多项式增长,可以保证学习结果的准确性。
    Abstract We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.
    摘要 我们提供一个多项式时间算法,用于在$d$-维空间中学习高维半空间,并且保证与所需的总体变化距离(TV距离)相匹配。在这个设定下,我们假设拥有一个未知的一维对称凹降分布,并且半空间是通过删除至少$\epsilon$ fraction的数据来引入的。我们的算法不需要标签,并且可以高效地识别隐藏的半空间,具有 polynomial 的样本和时间复杂度,其中 $1/\epsilon$ 是随着维度和 $\epsilon$ 的函数。我们的算法使用适当的重新权重的 empirical distribution 的第一个和第二个 момен数,我们称之为对比分oment,并且我们的分析基于通用的 Generalized Dirichlet polynomials 的特性,以及一个新的准确性Property of the moment ratio of truncations of logconcave distributions。这种算法在之前的工作中已经提出,但是它们没有得到正式的保证。在先前的工作中,人们已经处理了下面特殊情况:在下面的设定下,所下面的分布是 Gaussian。我们超越了这个特殊情况,并提供了基于 TV 距离的多项式时间 guarantees,而不是之前的时间 bound 保证,这些保证可能是超 polynomial。此外,我们的工作也是首次超越了 Gaussian 的情况。

Identifying Alzheimer Disease Dementia Levels Using Machine Learning Methods

  • paper_url: http://arxiv.org/abs/2311.01428
  • repo_url: None
  • paper_authors: Md Gulzar Hussain, Ye Shiren
  • for: 本研究旨在用机器学习或深度学习模型为您的AD分类提供有效方法。
  • methods: 本研究使用RF、SVM和CNN算法,并将水池分 segmentation作为特征提取方法。
  • results: 我们的结果表明,SVM与水池特征的组合可以达到96.25%的高精度,超过其他分类方法。
    Abstract Dementia, a prevalent neurodegenerative condition, is a major manifestation of Alzheimer's disease (AD). As the condition progresses from mild to severe, it significantly impairs the individual's ability to perform daily tasks independently, necessitating the need for timely and accurate AD classification. Machine learning or deep learning models have emerged as effective tools for this purpose. In this study, we suggested an approach for classifying the four stages of dementia using RF, SVM, and CNN algorithms, augmented with watershed segmentation for feature extraction from MRI images. Our results reveal that SVM with watershed features achieves an impressive accuracy of 96.25%, surpassing other classification methods. The ADNI dataset is utilized to evaluate the effectiveness of our method, and we observed that the inclusion of watershed segmentation contributes to the enhanced performance of the models.
    摘要 德мен谱,一种常见的神经元 degeneration 病种,是阿尔ц海默病(AD)的主要表现。随着病情从轻到严重,它会significantly 减退个体独立完成日常任务的能力,导致了时间和准确的 AD 分类的需求。机器学习或深度学习模型已成为有效的工具。在这项研究中,我们建议了基于 RF、SVM 和 CNN 算法的四个阶段 демен谱分类方法,并使用 watershed 分 segmentation 来提取 MRI 图像的特征。我们的结果表明,SVM 与 watershed 特征达到了96.25%的准确率,超过了其他分类方法。我们使用 ADNI 数据集来评估我们的方法的效果,并发现了包括 watershed 分 segmentation 的模型表现更佳。

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

  • paper_url: http://arxiv.org/abs/2311.01420
  • repo_url: None
  • paper_authors: Cheng-Hao Tu, Hong-You Chen, Zheda Mai, Jike Zhong, Vardaan Pahuja, Tanya Berger-Wolf, Song Gao, Charles Stewart, Yu Su, Wei-Lun Chao
  • for: 本研究旨在适应预训练模型到目标频谱中进行类别化,使用目标数据覆盖部分标签空间。这种问题在实际应用中很重要,因为收集全部类别数据是不现实的。然而,这个问题在文献中得到了有限的关注。本文通过构建 benchmark 数据集和广泛的实验来探讨这个问题,并发现一个困境:在新目标频谱中适应是重要的,但是保持缺失类别的准确率是极其困难的。
  • methods: 我们提出了两个关键方向来解决这个困境:1)分离频谱差异和分类差异,2)保持类别关系。我们提出了多种有效的解决方案,以保持缺失类别的准确率,并提高总性能。
  • results: 我们的实验结果显示,我们的方法可以维持缺失类别的准确率,同时提高总性能,建立了基于部分目标数据的预训练模型的坚实基准。
    Abstract We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma -- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data.
    摘要 我们提出了一个实际问题,即适应预训练源模型到目标频谱中的全类划分,使用目标数据覆盖部分标签空间。这个问题在文献中受到了限制的关注。为了照明这个问题,我们构建了标准 benchmark 数据集并进行了广泛的实验,以揭示内在的挑战。我们发现了一个困境:一个方面是适应新的目标频谱非常重要,以提高性能;另一方面,我们发现保持缺失在目标适应数据中的分类精度非常困难,更何决定提高。为了解决这个问题,我们确定了两个关键方向:1)分离频谱方向和分类方向的融合,2)保持分类关系。我们提出了多种有效的解决方案,以保持缺失分类的精度并提高总性能,建立了坚实的基准值,以便整体传输预训练模型。

A Coreset-based, Tempered Variational Posterior for Accurate and Scalable Stochastic Gaussian Process Inference

  • paper_url: http://arxiv.org/abs/2311.01409
  • repo_url: None
  • paper_authors: Mert Ketenci, Adler Perotte, Noémie Elhadad, Iñigo Urteaga
  • for: 本研究提出了一种新的Stochastic Variational Gaussian Process(SVGP)推理方法,用于寻找高精度的GP模型参数。
  • methods: 该方法基于一个可学习的pseudo输入输出点集(coreset),并使用一种基于GP prior和数据可能性的可变温度家族来定义推理模型。
  • results: 研究人员通过分析CVTGP的下界,发现CVTGP可以减少学习参数的大小,保持数据 sparse和可解释的表示,并且在实际推理任务中提供了更好的证据下界估计和预测平均方差。
    Abstract We present a novel stochastic variational Gaussian process ($\mathcal{GP}$) inference method, based on a posterior over a learnable set of weighted pseudo input-output points (coresets). Instead of a free-form variational family, the proposed coreset-based, variational tempered family for $\mathcal{GP}$s (CVTGP) is defined in terms of the $\mathcal{GP}$ prior and the data-likelihood; hence, accommodating the modeling inductive biases. We derive CVTGP's lower bound for the log-marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables, and show it is amenable to stochastic optimization. CVTGP reduces the learnable parameter size to $\mathcal{O}(M)$, enjoys numerical stability, and maintains $\mathcal{O}(M^3)$ time- and $\mathcal{O}(M^2)$ space-complexity, by leveraging a coreset-based tempered posterior that, in turn, provides sparse and explainable representations of the data. Results on simulated and real-world regression problems with Gaussian observation noise validate that CVTGP provides better evidence lower-bound estimates and predictive root mean squared error than alternative stochastic $\mathcal{GP}$ inference methods.
    摘要 我们提出了一种新的随机变量 Gaussian process($\mathcal{GP}$)推理方法,基于一个可学习的Weighted pseudo input-output点(coreset)的 posterior。而不是一个自由变量的变ational家族,我们定义了一个基于 $\mathcal{GP}$ 先验和数据可能性的 coreset-based, variational tempered家族(CVTGP)。我们 derivates CVTGP 的下界 для吞吐量 marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables,并证明它是可数学化的。CVTGP 减少可学习参数的大小至 $\mathcal{O}(M)$,享受到数学稳定性,并保持 $\mathcal{O}(M^3)$ 时间-和 $\mathcal{O}(M^2)$ 空间复杂性,通过利用一个 coreset-based tempered posterior,从而提供稀疏和可解释的数据表示。在 simulate 和实际世界 regression 问题中,我们 obtainted results show that CVTGP 提供更好的证据下界估计和预测根据标准差Error than alternative stochastic $\mathcal{GP}$ inference methods。

Normalizing flows as approximations of optimal transport maps via linear-control neural ODEs

  • paper_url: http://arxiv.org/abs/2311.01404
  • repo_url: None
  • paper_authors: Alessandro Scagliotti, Sara Farinelli
  • for: 本研究目的是计算两个概率分布 $\mu$ 和 $\nu$ 之间的最优运输图。
  • methods: 本文使用了深度神经网络来构造可逆运输图,并使用了线性控制的启动Vector场来实现 $W_2$-优化运输图。
  • results: 本文提出了一种基于 $\Gamma$-收敛原理的数值方法,可以实现 praktisch 计算最优运输图。
    Abstract The term "Normalizing Flows" is related to the task of constructing invertible transport maps between probability measures by means of deep neural networks. In this paper, we consider the problem of recovering the $W_2$-optimal transport map $T$ between absolutely continuous measures $\mu,\nu\in\mathcal{P}(\mathbb{R}^n)$ as the flow of a linear-control neural ODE. We first show that, under suitable assumptions on $\mu,\nu$ and on the controlled vector fields, the optimal transport map is contained in the $C^0_c$-closure of the flows generated by the system. Assuming that discrete approximations $\mu_N,\nu_N$ of the original measures $\mu,\nu$ are available, we use a discrete optimal coupling $\gamma_N$ to define an optimal control problem. With a $\Gamma$-convergence argument, we prove that its solutions correspond to flows that approximate the optimal transport map $T$. Finally, taking advantage of the Pontryagin Maximum Principle, we propose an iterative numerical scheme for the resolution of the optimal control problem, resulting in an algorithm for the practical computation of the approximated optimal transport map.
    摘要 “Normalizing Flows”是指通过深度神经网络建立可逆传输映射,以实现probability measures之间的寻常化。在这篇论文中,我们考虑了将$W_2$-优化的传输 map $T$ между绝对连续概率分布 $\mu,\nu\in\mathcal{P}(\mathbb{R}^n)$的问题。我们首先表明,在适当的假设下,优化的传输 map 包含在$C^0_c$-闭合的流动中。假设可用的柯西分布 $\mu_N,\nu_N$ 是原始概率分布 $\mu,\nu$ 的精炼版本,我们使用一个最优对应问题来定义一个优化控制问题。通过 $\Gamma$-收敛证明,我们证明其解对应于流动,这些流动与优化传输 map 相似。最后,我们利用彭特拉金最大原理,提出了一种可行的数值方法,用于实现优化控制问题的解决方案,从而实现了优化传输 map 的实际计算。

Time-series Generation by Contrastive Imitation

  • paper_url: http://arxiv.org/abs/2311.01388
  • repo_url: None
  • paper_authors: Daniel Jarrett, Ioana Bica, Mihaela van der Schaar
  • for: 这个论文的目标是学习时间序列数据的生成模型,并且解决了序列设置中的独特挑战,即生成器需要捕捉转移的Conditional Dynamics,同时其打包rollouts也需要保持多步轨迹的 JOINT 分布。
  • methods: 这个论文使用了一种混合学习的框架,即通过对比估计来学习一个全局的能量模型,并且通过优化一个本地(但是前向looking)的转移策略来学习一个生成器。在训练时,这两个 ком成分被学习了合作地,避免了对抗性目标的不稳定。在推断时,学习的策略服务为生成器,并且学习的能量服务为轨迹级别的评估标准。
  • results: 该论文的实验结果表明,这种方法可以生成有用的预测样本,并且与现有的标准准确。
    Abstract Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.
    摘要 请考虑学习一种生成模型 для时间序列数据。这种顺序设置具有独特的挑战:不仅 generator应该捕捉(步骤)转移的Conditional动力,而且其开放Loop执行应该保持多步轨迹的共同分布。一方面,基于MLE的autoregressive模型可以学习并计算Explicit转移分布,但它们在执行过程中会受到堆叠的错误影响。另一方面,基于GAN的 adversarial模型可以消除这种曝露偏见,但转移是隐式的难以评估。在这种工作中,我们研究了一种生成框架,它将 autoregressive 模型和 GAN 模型的优点相结合。我们在这个框架中采用了一个 momentum-matching 目标函数来减少堆叠错误的影响,并且通过对 local (但是前向的) 转移策略进行优化,使得 reinforcement 信号来自 global (但是步骤可分解的) energy 模型,该模型通过对比估计来训练。在训练时,这两个组件被学习共同,以避免 adversarial 目标函数的不稳定性。在推理时,学习的策略将服务为生成器,用于逐步采样,而学习的能量将用于评估样本质量。通过直接在时间序列特征上学习一种策略,这种方法实现了"通过模仿来生成'。理论上,我们证明了这种形式的正确性和算法的一致性。实际上,我们评估了它在真实数据上的预测用途性能,并证明它与现有的标准准确。

Monotone Generative Modeling via a Gromov-Monge Embedding

  • paper_url: http://arxiv.org/abs/2311.01375
  • repo_url: None
  • paper_authors: Wonjun Lee, Yifei Yang, Dongmian Zou, Gilad Lerman
  • for: 这个论文的目的是提出一种基于深度生成模型的方法,用于解决生成器激素网络(GANs)中的初始化条件敏感和模式崩溃问题。
  • methods: 该方法利用了格罗莫夫-蒙日 embedding(GME)来识别数据的低维结构,并将其映射到一个低维 latent space 中,保持了数据的几何结构。这个映射是通过使用 GME 和 $c$-cyclical monotonicity 的生成映射来确保,其中 $c$ 是一个内在嵌入成本。
  • results: 数值实验表明,该方法可以生成高质量的图像,避免模式崩溃,并在不同的初始化条件下展现出较好的Robustness。
    Abstract Generative Adversarial Networks (GANs) are powerful tools for creating new content, but they face challenges such as sensitivity to starting conditions and mode collapse. To address these issues, we propose a deep generative model that utilizes the Gromov-Monge embedding (GME). It helps identify the low-dimensional structure of the underlying measure of the data and then maps it, while preserving its geometry, into a measure in a low-dimensional latent space, which is then optimally transported to the reference measure. We guarantee the preservation of the underlying geometry by the GME and $c$-cyclical monotonicity of the generative map, where $c$ is an intrinsic embedding cost employed by the GME. The latter property is a first step in guaranteeing better robustness to initialization of parameters and mode collapse. Numerical experiments demonstrate the effectiveness of our approach in generating high-quality images, avoiding mode collapse, and exhibiting robustness to different starting conditions.
    摘要 生成 adversarial networks (GANs) 是一种强大的创新工具,但它们面临敏感性和模式塌缩的挑战。为解决这些问题,我们提议一种深度生成模型,利用 Gromov-Monge 嵌入 (GME)。它可以识别数据的下面结构,并将其映射到一个低维度的隐藏空间中,保持其几何结构。然后,我们可以使用最优的运输算法将其传输到参照掌握中的掌握。我们保证 GME 的嵌入精度和 $c$-cyclical 幂环境的生成映射具有良好的Robustness,其中 $c$ 是 GME 使用的内在嵌入成本。这个性质是保证更好的模型初始化参数和模式塌缩的第一步。数值实验证明我们的方法可以生成高质量的图像,避免模式塌缩,并在不同的初始参数下展现出良好的Robustness。

Respiratory Anomaly Detection using Reflected Infrared Light-wave Signals

  • paper_url: http://arxiv.org/abs/2311.01367
  • repo_url: None
  • paper_authors: Md Zobaer Islam, Brenden Martin, Carly Gotcher, Tyler Martinez, John F. O’Hara, Sabit Ekin
  • for: 这个研究旨在开发一种无接触呼吸异常检测方法,使用机械人的胸部反射的异步光波信号来检测呼吸 Parameters。
  • methods: 该方法使用低成本、普遍存在的红外照明LED和光电传感器,并使用机器学习模型来识别不同类型的呼吸数据中的异常 Parameters。
  • results: 实验结果表明,该方法可以在0.5米至1.5米的距离内准确地识别7种不同类型的呼吸数据,准确率高达96.6%。此外,系统还可以排除不含呼吸信息的数据。该系统可以在家庭或医疗机构中作为一种智能、无接触、谨慎的呼吸监测方法应用。
    Abstract In this study, we present a non-contact respiratory anomaly detection method using incoherent light-wave signals reflected from the chest of a mechanical robot that can breathe like human beings. In comparison to existing radar and camera-based sensing systems for vitals monitoring, this technology uses only a low-cost ubiquitous light source (e.g., infrared light emitting diode) and sensor (e.g., photodetector). This light-wave sensing (LWS) system recognizes different breathing anomalies from the variations of light intensity reflected from the chest of the robot within a 0.5m-1.5m range. The anomaly detection model demonstrates up to 96.6% average accuracy in classifying 7 different types of breathing data using machine learning. The model can also detect faulty data collected by the system that does not contain breathing information. The developed system can be utilized at home or healthcare facilities as a smart, non-contact and discreet respiration monitoring method.
    摘要 在这项研究中,我们提出了一种无接触呼吸异常检测方法,使用机械人的胸部反射的不协调光波信号。与现有的雷达和摄像头基本感知系统相比,这技术只需使用低成本的普遍存在的光源(例如,红外光晶闻)和检测器(例如,光电池)。这个光波检测(LWS)系统可以从机械人胸部反射的光INTENSITY变化中识别出不同的呼吸异常。模型可以达到96.6%的均值准确率,在7种不同的呼吸数据类型中进行分类。模型还可以检测系统收集的数据中不含呼吸信息的异常数据。已开发的系统可以在家庭或医疗机构中使用,作为智能、无接触和谨慎的呼吸监测方法。

On the Lipschitz constant of random neural networks

  • paper_url: http://arxiv.org/abs/2311.01356
  • repo_url: None
  • paper_authors: Paul Geuchen, Thomas Heindl, Dominik Stöger, Felix Voigtlaender
  • for: 这篇论文旨在研究随机ReLU神经网络的Lipschitz常数。
  • methods: 该论文使用 teorethical 方法来研究随机ReLU神经网络的Lipschitz常数。
  • results: 论文得到了随机ReLU神经网络的Lipschitz常数的 bounds,其中深度取决于宽度和深度。
    Abstract Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. However, only few theoretical results regarding this quantity exist in the literature. In this paper, we initiate the study of the Lipschitz constant of random ReLU neural networks, i.e., neural networks whose weights are chosen at random and which employ the ReLU activation function. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. Moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the Lipschitz constant. These bounds match up to a logarithmic factor that depends on the depth.
    摘要 empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. the worst-case robustness against these so-called adversarial examples can be quantified by the lipschitz constant of the neural network. however, only few theoretical results regarding this quantity exist in the literature. in this paper, we initiate the study of the lipschitz constant of random relu neural networks, i.e., neural networks whose weights are chosen at random and which employ the relu activation function. for shallow neural networks, we characterize the lipschitz constant up to an absolute numerical constant. moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the lipschitz constant. these bounds match up to a logarithmic factor that depends on the depth.Here's the translation in Traditional Chinese:empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. the worst-case robustness against these so-called adversarial examples can be quantified by the lipschitz constant of the neural network. however, only few theoretical results regarding this quantity exist in the literature. in this paper, we initiate the study of the lipschitz constant of random relu neural networks, i.e., neural networks whose weights are chosen at random and which employ the relu activation function. for shallow neural networks, we characterize the lipschitz constant up to an absolute numerical constant. moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the lipschitz constant. these bounds match up to a logarithmic factor that depends on the depth.

Unreading Race: Purging Protected Features from Chest X-ray Embeddings

  • paper_url: http://arxiv.org/abs/2311.01349
  • repo_url: None
  • paper_authors: Tobias Weber, Michael Ingrisch, Bernd Bischl, David Rügamer
  • for: 这份研究是为了测试和移除医疗影像模型内的保护特征效应。
  • methods: 这份研究使用了一种叫做“orthogonalization”的方法,以移除医疗影像模型内的保护特征影响。
  • results: 研究发现,在医疗影像预测中,保护特征会有影响,但是使用“orthogonalization”方法可以移除这些影响,同时维持预测性能。I hope that helps! Let me know if you have any other questions.
    Abstract Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Materials and Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in chest radiograph embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray classification.
    摘要 目的:分析并消除深度学习模型中保护特征的影响(例如年龄、性别、种族)的影响。材料和方法:使用正交化来消除深度学习模型中的保护特征影响,以确保不受特定特征影响的结果。为验证方法的有效性,我们对MIMIC和CheXpert数据集使用三个预训练模型,namely一个监督对比模型、一个自监督对比模型和一个基eline类ifier模型。我们的统计分析包括比较原始与正交化的嵌入之间的差异,并评估使用两种类型的嵌入来预测年龄、性别和种族。结果:我们的实验显示,保护特征对疾病预测产生了显著的影响。通过正交化,消除这些特征的影响。除了消除疾病预测中的保护特征影响,正交化嵌入还使得不可直接预测保护属性,减少了 subgroup disparities。结论:本研究示出了在胸部X射线分类领域中正交化技术的成功应用和评估。

High-dimensional Linear Bandits with Knapsacks

  • paper_url: http://arxiv.org/abs/2311.01327
  • repo_url: None
  • paper_authors: Wanteng Ma, Dong Xia, Jiashuo Jiang
  • for: 本研究探讨了高维度上的上下文抽象鸟(CBwK)问题,即每个抽象鸟的奖励等于高维度稀疏加权向量与当前到达者特征的乘积,加上随机噪声。
  • methods: 本研究提出了一种在高维度上实现稀疏估计的在线变体,并将其与 primal-dual 框架相结合,以控制杯具容量的消耗。
  • results: 研究人员表明,这种整合方法可以实现依赖于特征维度的下线 regret,并在数据缺乏和数据充沛两个 Régime 中达到最优 regret。同时,他们还进行了数值实验,证明了他们的算法在高维度上的实际性。
    Abstract We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting.
    摘要 我们研究了高维度上的上下文抽奖问题(CBwK),其中抽奖奖劵的值等于高维度稀疏质量 вектор与当前到达者的特征进行乘法,加上随机噪声。在这篇论文中,我们探讨如何利用这种稀疏结构来实现CBwK问题的改善 regret。为此,我们首先开发了在线的坚持阈值算法,该算法在线上进行稀疏估计。然后,我们将我们的在线估计与 primal-dual 框架相结合,其中我们将每个knapsack约束分配一个 dual 变量,并使用在线学习算法来更新 dual 变量,以控制knapsack资源的使用。我们证明这种整合方法可以实现依赖于特征维度的倒数的 regret,而不是在先前的文献中所确立的多项式依赖。此外,我们还应用了我们的框架到高维度上的无约束上下文抽奖问题,并在数据稀缺和数据充沛两种情况下实现了最优的 regret。最后,我们进行了数据分析,证明了我们的算法在高维度设置下的有效性。

Long-Range Neural Atom Learning for Molecular Graphs

  • paper_url: http://arxiv.org/abs/2311.01276
  • repo_url: None
  • paper_authors: Xuan Li, Zhanke Zhou, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han
  • for: 提高 Graph Neural Networks (GNNs) 在药物发现中的性能,特别是捕捉长距离交互 (LRI),以提高分子质量的预测。
  • methods: 提出一种方法,使用归纳神经元(Neural Atoms)将原始原子归纳到一些抽象的表示中,以捕捉分子中各个原子组的共同信息。该方法通过明确地交换归纳神经元之间的信息,使归纳神经元成为分子中不同原子之间的通信渠道,从而减少了任意两个节点之间的交互范围。
  • results: 通过对三个长距离图 benchmark 进行广泛的实验,证明该方法可以与任何 GNN 结合使用,并帮助捕捉 LRI。
    Abstract Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs are mainly good at leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method that implicitly projects all original atoms into a few Neural Atoms, which abstracts the collective information of atomic groups within a molecule. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection with the traditional LRI calculation method, Ewald Summation. We conduct extensive experiments on three long-range graph benchmarks, covering both graph-level and link-level tasks on molecular graphs. We empirically justify that our method can be equipped with an arbitrary GNN and help to capture LRI.
    摘要 Graph Neural Networks (GNNs) 已经广泛应用于分子图中的药物探索。然而,目前的 GNNs 主要能够利用短距离交互(SRI),而忽略长距离交互(LRI),这两者都是决定分子性质的关键因素。为解决这个问题,我们提议一种方法,将原始原子Projects into几个神经原子,这些神经原子抽象了分子中原子群的共同信息。具体来说,我们显式地交换神经原子之间的信息,并将其投影回原子表示中作为增强。通过这种机制,神经原子建立了跨节点通信频道,有效地将任意节点对的交互范围减少为单个跳远。为了物理上 inspect 我们的方法,我们揭示了它与传统的 LRI 计算方法,Ewald Summation 之间的连接。我们在三个长距离图 benchmark 上进行了广泛的实验,覆盖了分子图级别和链接级别任务。我们经验表明,我们的方法可以与任意 GNN 结合使用,帮助捕捉 LRI。

Sanitized Clustering against Confounding Bias

  • paper_url: http://arxiv.org/abs/2311.01252
  • repo_url: https://github.com/evaflower/scab
  • paper_authors: Yinghua Yao, Yuangang Pan, Jing Li, Ivor W. Tsang, Xin Yao
    for:This paper aims to address the issue of confounding bias in cluster analysis by proposing a new framework called Sanitized Clustering Against confounding Bias (SCAB).methods:The SCAB framework uses a Variational Auto-Encoder (VAE) to eliminate the confounding bias in the semantic latent space of complex data by minimizing the mutual information between the confounding factor and the latent representation.results:The proposed SCAB framework achieves a significant gain in clustering performance by removing the confounding bias, as demonstrated through extensive experiments on complex datasets. The code is available at \url{https://github.com/EvaFlower/SCAB}.Here's the Chinese version:for:这篇论文目的是解决复杂数据中干扰因素的影响,提出了一种新的框架叫做Sanitized Clustering Against confounding Bias (SCAB)。methods:SCAB使用Variational Auto-Encoder (VAE)来消除复杂数据中干扰因素的影响,在semantic latent space中减少干扰因素和数据的相互信息。results:提出的SCAB框架在复杂数据集上实现了干扰因素的消除,从而提高了聚类性能,经过广泛的实验 validate 这一结论。代码可以在 \url{https://github.com/EvaFlower/SCAB} 上获取。
    Abstract Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.
    摘要 实际世界数据集往往含有偏见,这些偏见来自于不同的来源或条件 durante 数据采集。这些不一致性本身就是一个干扰因素,对 clustering 分析造成干扰。现有方法通过将数据投影到偏见因素的 ortogonal complement 上进行 clustering。在这些方法中,数据和偏见因素在原始特征空间中被粗略地考虑,其中数据和偏见因素之间的相关性被假设为线性,以便解决方便。这些方法因此受到限制,因为实际数据通常是复杂的和非线性相关的。这篇文章提出了一种新的 clustering 框架,名为 Sanitized Clustering Against confounding Bias (SCAB),它可以在复杂数据中除掉偏见因素。SCAB 使用非线性依赖度度量来除掉偏见因素在 semantic 隐藏空间中。具体来说,我们通过 Variational Auto-Encoder (VAE) 来提取 latent representation,并将偏见因素的信息从 latent representation 中除掉。然后,我们引入 clustering 模块,将 purified 的 latent representation 进行 clustering。我们在复杂数据集上进行了广泛的实验,得到的结果表明,SCAB 可以减少偏见 bias,并且在 clustering 性能上具有显著的提升。代码可以在 \url{https://github.com/EvaFlower/SCAB} 上获取。

Gaussian Processes on Cellular Complexes

  • paper_url: http://arxiv.org/abs/2311.01198
  • repo_url: None
  • paper_authors: Mathieu Alain, So Takao, Brooks Paige, Marc Peter Deisenroth
  • for: 这篇论文旨在开发基于图的机器学习模型,以利用图的拓扑逻辑假设来训练模型。
  • methods: 该论文提出了基于维度较高的维度复杂系统(cellular complexes)的加aussian proceses(GP)模型,并提出了两种新型的kernels:一种是图Matér kernel的推广,另一种是将不同维度类型的信息混合在一起。
  • results: 该论文的实验结果表明,基于维度复杂系统的GP模型在处理复杂的图数据时具有更高的准确率和更好的一致性,而且可以更好地捕捉图的拓扑结构。
    Abstract In recent years, there has been considerable interest in developing machine learning models on graphs in order to account for topological inductive biases. In particular, recent attention was given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Mat\'ern kernel and one that additionally mixes information of different cell types.
    摘要 近年来,有很大的兴趣在开发机器学习模型,以利用图structure上的拓扑假设。特别是,有人关注在图上的Gaussian процессе,因为它可以同时考虑不确定性。然而,图只能模型两个顶点之间的关系。在这篇论文中,我们超越这个二元设定,考虑多元关系,包括顶点、边和其总是一个通用的扩展,即细胞。我们提议在细胞复杂体系上的Gaussian проце序,这是图的扩展,可以捕捉不同细胞类型之间的交互。我们的一个关键贡献是 derivation of two novel kernels,一个总结图Matér kernel,另一个同时混合不同细胞类型的信息。

  • paper_url: http://arxiv.org/abs/2311.01196
  • repo_url: https://github.com/tmlr-group/rgib
  • paper_authors: Zhanke Zhou, Jiangchao Yao, Jiaxu Liu, Xiawei Guo, Quanming Yao, Li He, Liang Wang, Bo Zheng, Bo Han
  • for: 本研究旨在提高图神经网络(GNN)对边际噪声的抗针对性。
  • methods: 本研究提出了一种信息论指导的原则,即强健图信息瓶颈(RGIB),以提取可靠的监督信号,并避免表示归一化。RGIB不同于基本信息瓶颈,更加强调和谐地处理图结构、目标标签和表示之间的互相依赖关系,建立了新的学习目标,以提高对边际噪声的抗针对性表示。
  • results: 对六个dataset和三种GNN进行了广泛的实验,证明了RGIB实例的效iveness。
    Abstract Although link prediction on graphs has achieved great success with the development of graph neural networks (GNNs), the potential robustness under the edge noise is still less investigated. To close this gap, we first conduct an empirical study to disclose that the edge noise bilaterally perturbs both input topology and target label, yielding severe performance degradation and representation collapse. To address this dilemma, we propose an information-theory-guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the basic information bottleneck, RGIB further decouples and balances the mutual dependence among graph topology, target labels, and representation, building new learning objectives for robust representation against the bilateral noise. Two instantiations, RGIB-SSL and RGIB-REP, are explored to leverage the merits of different methodologies, i.e., self-supervised learning and data reparameterization, for implicit and explicit data denoising, respectively. Extensive experiments on six datasets and three GNNs with diverse noisy scenarios verify the effectiveness of our RGIB instantiations. The code is publicly available at: https://github.com/tmlr-group/RGIB.
    摘要 尽管图гра数据预测已经在图神经网络(GNNs)的发展过程中取得了很大的成功,但图像预测下边的稳定性仍然得不到充分的研究。为了填补这一漏洞,我们首先进行了一项实验,发现图像预测双方干扰了输入图гра和目标标签,导致性能下降和表示塌陷。为解决这个问题,我们提出了一种基于信息理论的原则,即Robust Graph Information Bottleneck(RGIB),用于提取可靠的监督信号并避免表示塌陷。与基本信息瓶颈不同,RGIB进一步解耦了图гра结构、目标标签和表示之间的互相关系,建立了新的学习目标,以避免双方干扰下的表示塌陷。我们还explored two instantiations, RGIB-SSL和RGIB-REP,以利用不同的方法oloiges,即自监督学习和数据重parameterization,进行隐式和显式数据干扰。我们在六个dataset和三种GNN中进行了广泛的实验,证明了我们的RGIB实现的效果。代码可以在 GitHub上获取:https://github.com/tmlr-group/RGIB。

Add and Thin: Diffusion for Temporal Point Processes

  • paper_url: http://arxiv.org/abs/2311.01139
  • repo_url: None
  • paper_authors: David Lüdke, Marin Biloš, Oleksandr Shchur, Marten Lienen, Stephan Günnemann
  • for: 这个论文是为了提出一种可靠的概率噪声扩散模型,用于识别和预测连续时间事件数据。
  • methods: 这个模型使用了TPP框架,并使用了概率噪声扩散的方法来处理整个事件序列。这种方法不同于现有的扩散方法,可以自然地处理数据中的绝对和连续组成部分。
  • results: 在 synthetic 和实际数据集上的实验中,这个模型与现状的TPP模型匹配在概率密度估计方面,而且在预测方面强力突出于现状模型。
    Abstract Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.
    摘要 自适应神经网络在时间点过程(TPP)框架中已成为连续时间事件数据的标准模型。尽管这些模型可以表达性地捕捉事件序列,但它们因为顺序性的限制而难以长期预测。为了超越这些限制,我们 derivate ADD-THIN,一种基于概率的减噪扩散模型,用于整个事件序列。与现有扩散方法不同,ADD-THIN自然处理数据中的离散和连续组成部分。在synthetic和实际数据集上的实验中,我们的模型与状态态准TPP模型匹配在概率分布估计方面,而且在预测方面强制超越它们。

Generating QM1B with PySCF$_{\text{IPU}}$

  • paper_url: http://arxiv.org/abs/2311.01135
  • repo_url: https://github.com/graphcore-research/pyscf-ipu
  • paper_authors: Alexander Mathiasen, Hatem Helal, Kerstin Klaser, Paul Balanca, Josef Dean, Carlo Luschi, Dominique Beaini, Andrew Fitzgibbon, Dominic Masters
  • for: 这个论文的目的是探索使用增强计算机见解和自然语言处理的基础模型,以提高量子化学任务的进步。
  • methods: 这个论文使用了PySCF$_{\text{IPU}$数据生成器和智能处理单元(IPU)来创建大量的训练样例,并生成了100亿个训练样例的 dataset QM1B。
  • results: 这个论文的结果表明,一个简单的基线神经网络(SchNet 9M)可以通过增加训练样例来提高性能,而无需添加额外的逻辑假设。
    Abstract The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: http://github.com/graphcore-research/pyscf-ipu
    摘要 “基于Foundational模型的进步在计算机视觉和自然语言处理领域已经取得了巨大的进展,这些进展得到了大量的训练例子的支持。然而,量子化化领域的进展仍然受到限制,因为deep learning的潜力受到了相对较少的训练例子的限制,这些训练例子的数量在100k到20M之间。这些训练例子的有限性是因为labels是通过精度高但计算复杂的密度函数理论(DFT)来计算的。尽管之前的DFT数据集都是使用CPU超级计算机创建的,但我们在这篇论文中首次利用硬件加速器。我们开发了一个名为PySCF$_{\text{IPU}$的数据生成器,使用智能处理单元(IPU)来生成QM1B数据集,该数据集包含10亿个训练例子,其中9-11个重元素。我们示示了一个简单的基线神经网络(SchNet 9M)在增加训练数据量后,性能得到了提高。为鼓励未来的研究人员负责QM1B,我们列出了QM1B的一些限制和我们的DFT选项的低分辨率,这也作为鼓励更大、更准确的数据集的动机。代码和数据集可以在GitHub上获取:http://github.com/graphcore-research/pyscf-ipu。”

AI for Interpretable Chemistry: Predicting Radical Mechanistic Pathways via Contrastive Learning

  • paper_url: http://arxiv.org/abs/2311.01118
  • repo_url: None
  • paper_authors: Mohammadamin Tavakoli, Yin Ting T. Chiu, Alexander Shmakov, Ann Marie Carlton, David Van Vranken, Pierre Baldi
    for:This paper aims to address the limitations of existing deep learning-based reaction predictors by introducing a new system called RMechRP, which leverages contrastive learning and mechanistic pathways to provide more interpretable and generalizable predictions of radical reactions.methods:The authors use a public database of radical reactions, RMechDB, to develop and train multiple deep-learning models. They employ contrastive learning to learn a representation of chemical reactions that is based on mechanistic pathways, which are the most interpretable representation of chemical reactions.results:The authors demonstrate the effectiveness of RMechRP in providing accurate and interpretable predictions of radical reactions, and its potential for various applications in atmospheric chemistry. Their results show that RMechRP outperforms existing reaction predictors in terms of accuracy and interpretability, and has the potential to be applied in a variety of domains beyond radical chemistry.Here is the text in Simplified Chinese:for:这篇论文的目的是解决现有的深度学习基于反应预测器的限制,通过引入对比学习和机制路径来提供更加可读性和普适性的反应预测。methods:作者使用了一个公共的自由 radical reaction 数据库,RMechDB,来开发和训练多个深度学习模型。他们使用对比学习来学习化学反应的表示,基于机制路径,这是化学反应最可读的表示。results:作者表明了 RMechRP 的有效性,可以提供高度可读性和精度的 radical reaction 预测,并且在大气化学中有广泛的应用前景。他们的结果表明,RMechRP 在准确性和可读性方面超过了现有的反应预测器,并且在不同的预测任务中具有普适性。
    Abstract Deep learning-based reaction predictors have undergone significant architectural evolution. However, their reliance on reactions from the US Patent Office results in a lack of interpretable predictions and limited generalization capability to other chemistry domains, such as radical and atmospheric chemistry. To address these challenges, we introduce a new reaction predictor system, RMechRP, that leverages contrastive learning in conjunction with mechanistic pathways, the most interpretable representation of chemical reactions. Specifically designed for radical reactions, RMechRP provides different levels of interpretation of chemical reactions. We develop and train multiple deep-learning models using RMechDB, a public database of radical reactions, to establish the first benchmark for predicting radical reactions. Our results demonstrate the effectiveness of RMechRP in providing accurate and interpretable predictions of radical reactions, and its potential for various applications in atmospheric chemistry.
    摘要 深度学习基于的反应预测器在建筑方面已经经历了重要的变革。然而,它们仍然依赖于美国专利局的反应,导致预测的解释性和泛化能力受限,无法应用于其他化学领域,如自由基和大气化学。为解决这些挑战,我们提出了一种新的反应预测系统,RMechRP,它利用对比学习并与机理路径相结合,以获得最可读的化学反应表示。RMechRP特意设计为自由基反应,可提供不同水平的化学反应解释。我们使用RMechDB,一个公共的自由基反应数据库,来训练和开发多种深度学习模型,以建立自由基反应预测的首个benchmark。我们的结果表明RMechRP可以提供高度准确和可读的自由基反应预测,并有很好的应用前景在大气化学领域。

In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer

  • paper_url: http://arxiv.org/abs/2311.01106
  • repo_url: None
  • paper_authors: Yuzhou Cao, Hussein Mozannar, Lei Feng, Hongxin Wei, Bo An
  • for: 本研究旨在提高机器学习分类器的安全性和性能,通过学习如何分类和如何委托专家来实现。
  • methods: 本研究使用学习如何委托专家的框架,并利用 asymmetric softmax 基于的代表函数来解决过去的误差潜在问题。
  • results: 本研究表明,使用 asymmetric softmax 基于的代表函数可以生成有效的估计,并且不受过去误差的影响。 addition, the study shows that the proposed method has good non-asymptotic properties and is empirically validated on benchmark datasets.
    Abstract Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our method and empirically validate its performance and calibration on benchmark datasets.
    摘要 使机器学习分类器延迟决策到下游专家更准确的决策将提高安全性和性能。这个目标可以通过学习延迟框架来实现,该框架旨在同时学习分类和延迟专家的决策。在latest studies中,已经证明了使用softmax参数化的popular estimators可以导致无限大的投票概率,从而使得其不准确。然而,还未知道这是由于广泛使用的softmax参数化还是由其他因素引起的。在这个工作中,我们首先表明了先前文献中的不准确和无限大的估计器的起因是由surrogate losses的对称性所致,而不是由softmax。然后,我们提出了一种新的 statistically consistent asymmetric softmax-based surrogate loss,可以生成有效的估计器而不受 boundedness问题的影响。我们进一步分析了我们的方法的非几何性质,并通过 benchmark datasets进行了实验验证。

Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.01075
  • repo_url: https://github.com/niiceMing/CMTA
  • paper_authors: Siming Lan, Rui Zhang, Qi Yi, Jiaming Guo, Shaohui Peng, Yunkai Gao, Fan Wu, Ruizhi Chen, Zidong Du, Xing Hu, Xishan Zhang, Ling Li, Yunji Chen
  • for: 这篇论文主要关注在多任务强化学习中的模块原则,即将功能特性分成不同模块并合适地联合使用,以避免多任务间的负面转移问题。
  • methods: 本论文提出了对抗式模块(CMTA)方法,它通过对模块进行对比学习,并在细节水平上联合共享模块,透过时间注意力,解决多任务间的负面转移问题,提高模块方法的通用性和表现力。
  • results: 根据Meta-World多任务强化学习benchmark的实验结果,CMTA方法比单独学习每个任务的情况下,首次取得了性能提升,并在基eline之上取得了substantial的表现改善。
    Abstract In the field of multi-task reinforcement learning, the modular principle, which involves specializing functionalities into different modules and combining them appropriately, has been widely adopted as a promising approach to prevent the negative transfer problem that performance degradation due to conflicts between tasks. However, most of the existing multi-task RL methods only combine shared modules at the task level, ignoring that there may be conflicts within the task. In addition, these methods do not take into account that without constraints, some modules may learn similar functions, resulting in restricting the model's expressiveness and generalization capability of modular methods. In this paper, we propose the Contrastive Modules with Temporal Attention(CMTA) method to address these limitations. CMTA constrains the modules to be different from each other by contrastive learning and combining shared modules at a finer granularity than the task level with temporal attention, alleviating the negative transfer within the task and improving the generalization ability and the performance for multi-task RL. We conducted the experiment on Meta-World, a multi-task RL benchmark containing various robotics manipulation tasks. Experimental results show that CMTA outperforms learning each task individually for the first time and achieves substantial performance improvements over the baselines.
    摘要 在多任务强化学习领域,模块原则广泛应用,即将功能分解为不同模块,并合理地组合它们,以避免多任务之间的负面传递问题。然而,大多数现有的多任务RL方法只是将共享模块在任务级别 combinable,忽略了任务之间的冲突。此外,这些方法不考虑模块之间可能存在相似函数学习,从而限制模块的表达能力和多模块方法的泛化能力。在这篇论文中,我们提出了对这些限制的解决方案,即对比模块学习(CMTA)方法。CMTA方法通过对模块进行对比学习,并将共享模块在更细粒度的级别combine,使得在任务之间避免负面传递,提高多任务RL的表达能力和性能。我们在Meta-World,一个包含多种 робо控制任务的多任务RL benchmark上进行了实验。实验结果表明,CMTA方法可以在每个任务上学习一个新的任务,并且在基eline上达到了显著性能提升。

Deep Learning for real-time neural decoding of grasp

  • paper_url: http://arxiv.org/abs/2311.01061
  • repo_url: None
  • paper_authors: Paolo Viviani, Ilaria Gesmundo, Elios Ghinato, Andres Agudelo-Toro, Chiara Vercellino, Giacomo Vitali, Letizia Bergamasco, Alberto Scionti, Marco Ghislieri, Valentina Agostini, Olivier Terzo, Hansjörg Scherberger
  • for: 这个论文的目的是提出一种基于深度学习的 neural decoding 方法,用于 классифика grasp type 的 neural 信号。
  • methods: 这个方法使用 LSTM 网络来处理时间序列中的 neural 数据(即脉冲列),并将其分类为不同的 grasp 类型。
  • results: 论文提出的方法在使用真实的 neural 记录数据上实现了显著的提升,并且在 simulated real-time decoding 中也表现出了优于之前的成果。
    Abstract Neural decoding involves correlating signals acquired from the brain to variables in the physical world like limb movement or robot control in Brain Machine Interfaces. In this context, this work starts from a specific pre-existing dataset of neural recordings from monkey motor cortex and presents a Deep Learning-based approach to the decoding of neural signals for grasp type classification. Specifically, we propose here an approach that exploits LSTM networks to classify time series containing neural data (i.e., spike trains) into classes representing the object being grasped. The main goal of the presented approach is to improve over state-of-the-art decoding accuracy without relying on any prior neuroscience knowledge, and leveraging only the capability of deep learning models to extract correlations from data. The paper presents the results achieved for the considered dataset and compares them with previous works on the same dataset, showing a significant improvement in classification accuracy, even if considering simulated real-time decoding.
    摘要 neural decoding 涉及将Brain Machine Interfaces中获取的脑电信号与物理世界中的变量相关联,如手部运动或机器人控制。在这种情况下,这项工作从一个具体的先前存在的脑电听记录数据集开始,并提出了基于深度学习的方法来解码脑电信号,以分类抓取类型。特别是,我们提议使用LSTM网络来分类时间序列中的脑电数据(即电听轨迹)为抓取物品的类别。我们的主要目标是超越现有的解码精度,不依赖任何先前的神经科学知识,只依靠深度学习模型来从数据中提取相关性。文章介绍的结果与同一数据集的先前工作进行比较,显示了明显的提高分类精度,即使考虑实时解码。

Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment

  • paper_url: http://arxiv.org/abs/2311.01059
  • repo_url: None
  • paper_authors: Annie S. Chen, Govind Chada, Laura Smith, Archit Sharma, Zipeng Fu, Sergey Levine, Chelsea Finn
    for: 这种研究旨在帮助机器人在部署时适应不同于训练seen的情况,以便在实际世界中成功。methods: 这种方法基于在测试时一个episode内选择和适应已经预训练的行为的价值感知机制。results: 我们的方法能够在模拟和实际Go1四脚机器人上快速适应动力学变化,并在一些out-of-distribution情况下效果更高于现有方法,提高了机器人在部署时的适应能力。
    Abstract To succeed in the real world, robots must cope with situations that differ from those seen during training. We study the problem of adapting on-the-fly to such novel scenarios during deployment, by drawing upon a diverse repertoire of previously learned behaviors. Our approach, RObust Autonomous Modulation (ROAM), introduces a mechanism based on the perceived value of pre-trained behaviors to select and adapt pre-trained behaviors to the situation at hand. Crucially, this adaptation process all happens within a single episode at test time, without any human supervision. We provide theoretical analysis of our selection mechanism and demonstrate that ROAM enables a robot to adapt rapidly to changes in dynamics both in simulation and on a real Go1 quadruped, even successfully moving forward with roller skates on its feet. Our approach adapts over 2x as efficiently compared to existing methods when facing a variety of out-of-distribution situations during deployment by effectively choosing and adapting relevant behaviors on-the-fly.
    摘要 中文翻译:为了在真实世界中成功,机器人需要适应不同于训练中的情况。我们的方法RObust Autonomous Modulation(ROAM)使机器人能够根据每种情况选择和适应已经学习的行为。这个适应过程发生在测试时间内一个话,无需人工监督。我们的方法能够快速适应实际中的动力学变化,包括在真实的Go1 quadruped上并在轮式滑块上进行成功移动。相比已有方法,ROAM在面对多种异常情况时的适应效率高达2倍,通过选择和适应相应的行为来实现这一点。

Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis

  • paper_url: http://arxiv.org/abs/2311.01052
  • repo_url: https://github.com/victorletzelter/code-rmcl
  • paper_authors: Victor Letzelter, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Gael Richard, Slim Essid
  • for: 这篇论文是用于探讨多个目标的条件分布估计在回采测 Settings 中。
  • methods: 这篇论文使用了强化多选择学习 (rMCL),是一种将多个假设组合成一个条件分布估计的方法。
  • results: rMCL 可以维持多个预测的多样性,并且可以从 Voronoi 分布中获得一个条件分布估计的概率解释。 实验显示,rMCL 可以在声音源localization问题中实现实用的应用和解释。
    Abstract We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
    摘要 我们介绍了一种强化多选学习(rMCL)方法,用于在回归 Setting 中估计 condition distribution,在每个训练输入上可能会采样多个目标。多选学习是一个简单的框架,用于处理多模态概率分布,使用 Winner-Takes-All(WTA)损失函数来评估一组假设。现有的 MCL 变体在回归 Setting 中强制合并假设,从而最终牺牲预测的多样性。相比之下,我们的方法基于一种新的学习得分方案,受到输出空间 Voronoi 分割的数学基础,从而可以 derivate 一种概率解释。经验 validate 了 rMCL 方法在 sintetic 数据上,然后进一步评估了它在声音源localization问题上的实用性和解释的相关性。

Application and Energy-Aware Data Aggregation using Vector Synchronization in Distributed Battery-less IoT Networks

  • paper_url: http://arxiv.org/abs/2311.01050
  • repo_url: None
  • paper_authors: Chetna Singhal, Subhrajit Barick, Rishabh Sonkar
  • for: 提供一种机制来聚合感知器数据并提供可持续的应用支持在分布式电池less IoT网络中。
  • methods: 提出了一种应用感知器 Task 和能量管理器(ATEM),以及一种基于向量同步的数据聚合器(VSDA)。ATEM 支持设备级联合能量收集和系统级energy-aware多应用管理。VSDA 使用 LSTM 模型预测可用 ambient 能量,并根据设备 Profiling 和应用任务速率设置相应的设备配置。
  • results: 提出的方案可以满足多样化应用需求,降低数据损失和包开销,提高硬件组件可用性,并使组件更早可用。相比之前的状态对照,提出的方案具有明显的优势。
    Abstract The battery-less Internet of Things (IoT) devices are a key element in the sustainable green initiative for the next-generation wireless networks. These battery-free devices use the ambient energy, harvested from the environment. The energy harvesting environment is dynamic and causes intermittent task execution. The harvested energy is stored in small capacitors and it is challenging to assure the application task execution. The main goal is to provide a mechanism to aggregate the sensor data and provide a sustainable application support in the distributed battery-less IoT network. We model the distributed IoT network system consisting of many battery-free IoT sensor hardware modules and heterogeneous IoT applications that are being supported in the device-edge-cloud continuum. The applications require sensor data from a distributed set of battery-less hardware modules and there is provision of joint control over the module actuators. We propose an application-aware task and energy manager (ATEM) for the IoT devices and a vector-synchronization based data aggregator (VSDA). The ATEM is supported by device-level federated energy harvesting and system-level energy-aware heterogeneous application management. In our proposed framework the data aggregator forecasts the available power from the ambient energy harvester using long-short-term-memory (LSTM) model and sets the device profile as well as the application task rates accordingly. Our proposed scheme meets the heterogeneous application requirements with negligible overhead; reduces the data loss and packet delay; increases the hardware component availability; and makes the components available sooner as compared to the state-of-the-art.
    摘要 “无电池互联网关关(IoT)设备是下一代无线网络的可持续绿色计划的重要元素。这些无电池设备使用周围环境的能源,通过环境能量收集。环境能量收集是动态的,导致间歇性任务执行。收集到的能源储存在小电容器中,并困难保证应用程序执行。我们的目标是提供一个机制,将散布在多个无电池IoT感知硬件模组之间的感知数据聚合,并提供可持续的应用程序支持。我们模型了分布式IoT网络系统,该系统包括许多无电池IoT感知硬件模组和多种不同的IoT应用程序。这些应用程序需要从分布式的无电池硬件模组中获取感知数据,并且实现共同控制模组扭转器。我们提出了应用程序相关的任务和能量管理器(ATEM),以及基于分布式device级联合能源征收和系统级能源敏感多应用程序管理的vector-synchronization基于数据聚合器(VSDA)。ATEM运行在IoT设备上,并使用LSTM模型预测周围环境能量收集器中可用的电力,然后设定设备oprofile和应用程序任务速率。我们的提案可以满足多种不同的应用程序需求,并且对应用程序执行造成轻微过载; 实现数据损失和封包延误的减少; 提高硬件元件可用性; 并让元件更早地可用。”

Improving Robustness via Tilted Exponential Layer: A Communication-Theoretic Perspective

  • paper_url: http://arxiv.org/abs/2311.01047
  • repo_url: https://github.com/bhagyapuranik/texp_for_robustness
  • paper_authors: Bhagyashree Puranik, Ahmad Beirami, Yao Qin, Upamanyu Madhow
  • for: 提高深度网络的 robustness,不仅仅靠 empirical risk minimization 和合适的数据增强。
  • methods: 提议使用通信理论来提高深度网络层输出的信号噪比,通过在学习和推理过程中进行神经竞争。
  • results: 实验表明,使用 TEXP 学习和推理可以提高图像数据集上的 robustness against 噪音和其他常见损害,无需数据增强。此外,可以通过合适地结合 TEXP 和数据增强技术来获得更加累积的 robustness 提升。
    Abstract State-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization with suitable data augmentation. In this paper, we propose a complementary approach motivated by communication theory, aimed at enhancing the signal-to-noise ratio at the output of a neural network layer via neural competition during learning and inference. In addition to minimization of a standard end-to-end cost, neurons compete to sparsely represent layer inputs by maximization of a tilted exponential (TEXP) objective function for the layer. TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. Inference in a TEXP layer is accomplished by replacing batch norm by a tilted softmax, which can be interpreted as computation of posterior probabilities for the competing signaling hypotheses represented by each neuron. After providing insights via simplified models, we show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise and other common corruptions, without requiring data augmentation. Further cumulative gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with data augmentation techniques.
    摘要 现代深度网络技术 mostly 依靠 empirical risk minimization 提高网络的Robustness。在这篇论文中,我们提出了一种基于通信理论的 complementary 方法,通过在学习和推理过程中进行神经竞争来提高神经网络层输出的信号噪声比。此外,我们还 minimization 标准的 end-to-end 成本函数,并且使 neurons 竞争 representation 层输入数据,通过 maximization 倾斜的 exponential 目标函数 (TEXP) 来提高层输出的稳定性。TEXP 学习可以 interpret 为 Gaussian 模型下的数据噪声的最大 likelihood estimation ,而 inference 可以通过 replacing batch norm with tilted softmax 来实现,这可以 interpret 为 computed posterior probabilities для竞争的信号推测说。通过简化模型的分析和实验研究,我们发现了在标准图像数据集上,TEXP 学习和推理可以提高噪声和其他常见损害的Robustness,不需要数据增强。此外,可以通过合理地组合 TEXP 和数据增强技术来获得更加总的Robustness 提升。

Time-Independent Information-Theoretic Generalization Bounds for SGLD

  • paper_url: http://arxiv.org/abs/2311.01046
  • repo_url: None
  • paper_authors: Futoshi Futami, Masahiro Fujisawa
  • for: 这篇论文是为了提供一种新的信息理论性的普适化 bound,用于随机梯度劳伦豪玄(SGLD)的 sampling 和非托管优化研究。
  • methods: 这篇论文使用了对smoothness和dissipativity的假设,并通过关注时间演化的Kullback–Leibler差异来 derivethe generalization error bounds。
  • results: 这篇论文提供了一种时间独立的普适化 bound,其 decay to zero 随着样本大小增加,不管迭代次数和步长是否固定。此外,这篇论文还提供了首次在training和test loss相同时获得的信息理论性普适化 bound,该 bound 是时间独立的和步长无关的,从而导致了一个改进的过失 bound。
    Abstract We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.
    摘要 我们提供了一个新的信息理论基准,用于测量泊reno-Langevin dynamics(SGLD)的扩展 bound,假设具有光滑性和耗散性的假设。我们的 bound 是时间独立的,随着样本大小增加而降至零,不过步长是固定的或者是变化的。与先前的研究不同,我们通过专注在 Kullback-Leibler 构成函数的时间演化中,从而得到了一个与实际数据之间的稳定性相关的一个上限。此外,我们还提出了第一个具有同一个训练和测试损失的信息理论扩展 bound,通过证明 SGLD 的损失函数是下减几何的,从而移除了先前的步长问题,导致一个改进的过失率 bound。

Better with Less: A Data-Active Perspective on Pre-Training Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.01038
  • repo_url: https://github.com/galina0217/apt
  • paper_authors: Jiarong Xu, Renhong Huang, Xin Jiang, Yuxuan Cao, Carl Yang, Chunping Wang, Yang Yang
  • for: 预训文件(Pre-training on graph neural networks)的目的是学习可转移的知识,以便在无标注数据下进行下游任务。
  • methods: 提议一种“更好减少”(Better-with-less)框架,即使用 fewer, but carefully chosen 数据进行预训。该框架包括一个图选择器和一个预训模型。图选择器选择最有代表性和指导性的数据点,根据图的内在特性和预测uncertainty。预训模型采用predictive uncertainty作为反馈,以测量模型对数据的信任程度。
  • results: 实验结果表明,提议的 APT 可以在 fewer 训练数据下得到更高的下游性能。
    Abstract Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.
    摘要 <>预训 graphs neural networks(GNNs)以学习可转移的知识,以便在无标注数据上下游任务中表现出色。预训图模型的成功通常归功于大量输入数据。但在这篇论文中,我们发现了“巨量数据诅咒”现象:更多的训练数据并不一定能提高下游性能。基于这一观察,我们提出了“更好减少”框架,即在GNN模型中更少,但更精心选择的数据来提高预训。我们称之为数据活跃图预训(APT)框架。APT框架由图选择器和预训模型组成,图选择器根据图的内在特性和预测不确定性选择最有代表性和指导性的数据点。预训模型则在被选择的数据点上学习初步理解新、未见数据,同时尝试记忆之前数据中学习的知识。因此,图选择器和预训模型之间的结合和互动形成了一个综合框架,在这个框架中,图预训是进行进行逐步和轮循的。实验结果表明,我们的APT能够在更少的训练数据下提取更好的预训模型,并且在下游任务中表现出色。

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

  • paper_url: http://arxiv.org/abs/2311.01011
  • repo_url: None
  • paper_authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
  • for: 研究人员可以使用这些攻击和防御样本来研究LLMs中的攻击表现和防御机制。
  • methods: 这些攻击和防御样本都是由在线游戏《Tensor Trust》中的玩家生成的。
  • results: 许多模型对于这些攻击策略是易受攻击的,而且一些攻击策略可以在不同的环境中广泛应用。Translation:
  • for: 研究人员可以使用这些攻击和防御样本来研究LLMs中的攻击表现和防御机制。
  • methods: 这些攻击和防御样本都是由在线游戏《Tensor Trust》中的玩家生成的。
  • results: 许多模型对于这些攻击策略是易受攻击的,而且一些攻击策略可以在不同的环境中广泛应用。
    Abstract While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper
    摘要 While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third-party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable structure, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper.Here's the translation in Traditional Chinese:而 Large Language Models (LLMs) 在实际应用中越来越普遍,但它们仍然受到伪造第三方提示的攻击:这些攻击可以让系统设计者的意愿被覆盖。为了帮助研究人员研究这个问题,我们提供了超过 126,000 个提示插入攻击和 46,000 个防御措施,这些攻击和防御都是在线上游戏“Tensor Trust”中被玩家所创造的。根据我们所知,这是目前最大的人工生成的反对例集,用于测试 instruction-following LLMs。我们发现,这些攻击有许多易于理解的结构,并给出 LLMs 的弱点。我们还使用这个数据集来建立了两种提示插入的标准参考,即提取提示和夺取提示。我们的参考结果显示许多模型受到了这些攻击战略的威胁。此外,我们发现一些攻击战略可以通过游戏的不同组态对应到现场 LLM-based 应用中的攻击。我们在 发布了所有数据和源代码。

Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner’s Approach

  • paper_url: http://arxiv.org/abs/2311.00993
  • repo_url: None
  • paper_authors: Xueying Long, Quang Bui, Grady Oktavian, Daniel F. Schmidt, Christoph Bergmeir, Rakshitha Godahewa, Seong Per Lee, Kaifeng Zhao, Paul Condylis
  • for: 这篇论文旨在解决大型电子商务公司面临的预测挑战,并且考虑到电子商务与传统零售之间的重要区别。
  • methods: 本论文提出了一个两层 Hierarchy 架构,首先采用一个顶部预测方法,将大量时间序列转换为较少的数量和较少的干扰,然后将预测结果转换为决策层预测。另外,本论文还提出了直接在下一层训练的方法,使用子样本进行训练。
  • results: 本论文通过评估多个数据集,包括自有数据集和 M5 竞赛数据集,并证明了两层 Hierarchy 架构的可扩展性和精准性。此外,本论文还证明了电子商务和传统零售数据集之间的重要区别。
    Abstract The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger dataset sizes with feasible computational effort, firstly, we investigate a two-layer hierarchy and propose a top-down approach to forecasting at an aggregated level with less amount of series and intermittency, and then disaggregating to obtain the decision-level forecasts. Probabilistic forecasts are generated under distributional assumptions. Secondly, direct training at the lower level with subsamples can also be an alternative way of scaling. Performance of modelling with subsets is evaluated with the main dataset. Apart from a proprietary dataset, the proposed scalable methods are evaluated using the Favorita dataset and the M5 dataset. We are able to show the differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher level under a much simpler setting.
    摘要 最近的M5竞赛已经提高了零售预测的状态艺。然而,我们注意到竞赛挑战和我们在大型电商公司面临的挑战之间存在重要的差异。我们的场景中的数据集大于竞赛挑战的数据集,而电商可以拥有更多的产品种类,导致更多的间歇性数据。为了可靠地扩展到更大的数据集大小,我们首先investigate了两层层次结构,并提出了一种从上向下的预测方法,首先预测了汇总层次的预测,然后细化到获得决策层次的预测。我们采用了分布式预测方法。其次,直接在下一层使用子样本进行训练也可以作为扩展的方法。我们对主数据集进行评估模型的性能。除了自己的专有数据集外,我们还使用了Favorita数据集和M5数据集进行评估。我们能够发现电商和面向店铺零售数据集之间的不同特征。值得一提的是,我们的顶部预测框架在原M5竞赛中的top50中排名,即使在更简单的设置下进行训练。

Autonomous Learning of Generative Models with Chemical Reaction Network Ensembles

  • paper_url: http://arxiv.org/abs/2311.00975
  • repo_url: None
  • paper_authors: William Poole, Thomas E. Ouldridge, Manoj Gopalkrishnan
  • for: 这篇论文目的是研究一种能够自主学习复杂环境的微米大小的袋型分子系统。
  • methods: 该论文使用控制理论、机器学习理论、化学反应网络理论和统计物理来开发一种通用的化学系统自主学习复杂分布的架构。该架构基于化学实现机器学习优化工具:Relative entropy cost function的梯度下降。
  • results: 该方法可以优化任何均衡的化学反应网络,并且可以使用隐藏单元学习复杂分布。这一结果被重新归类为一种形式的积分反馈控制。此外,由于使用了Explicit физи学模型,因此可以 derivate thermodynamic costs和trade-offs相关于这个过程。
    Abstract Can a micron sized sack of interacting molecules autonomously learn an internal model of a complex and fluctuating environment? We draw insights from control theory, machine learning theory, chemical reaction network theory, and statistical physics to develop a general architecture whereby a broad class of chemical systems can autonomously learn complex distributions. Our construction takes the form of a chemical implementation of machine learning's optimization workhorse: gradient descent on the relative entropy cost function. We show how this method can be applied to optimize any detailed balanced chemical reaction network and that the construction is capable of using hidden units to learn complex distributions. This result is then recast as a form of integral feedback control. Finally, due to our use of an explicit physical model of learning, we are able to derive thermodynamic costs and trade-offs associated to this process.
    摘要 可以把一个微米大小的袋形分子之间互动的系统视为自主学习一个复杂且波动的环境的内部模型吗?我们从控制理论、机器学习理论、化学反应网络理论和统计物理学中着手吸取灵感,开发了一种通用的化学系统自主学习复杂分布的总体体系。我们的设计通过使用化学实现机器学习优化工具:相对 entropy 成本函数的梯度下降。我们证明这种方法可以应用于优化任何均衡的化学反应网络,并且使用隐藏单元学习复杂分布。这一结果最后被重新映射为一种类型的积分反馈控制。由于我们使用了明确的物理学习模型,我们能够计算这个过程中的热力学成本和折衔。

Federated Linear Bandits with Finite Adversarial Actions

  • paper_url: http://arxiv.org/abs/2311.00973
  • repo_url: None
  • paper_authors: Li Fan, Ruida Zhou, Chao Tian, Cong Shen
  • for: 这个论文是解决联合 linear 上下文选择问题的 federated 学习方法。
  • methods: 该方法基于 SupLinUCB 和 OFUL 算法的扩展,针对 adversarial 有穷finite action set 问题。
  • results: 该方法可以达到 $\tilde{O}(\sqrt{d T})$ 的总 regret,与最小最优下界匹配,是order-optimal(以polylog term为准)。
    Abstract We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
    摘要 我们研究了一个联邦线性帆风模型,其中 $M$ 客户端与中央服务器进行交互,以解决一个线性上下文帆风问题,其中可能存在不同客户端的敌对行动集。为了解决这些独特挑战,我们提议了 FedSupLinUCB 算法,该算法基于 SupLinUCB 和 OFUL 算法,并在线性上下文帆风中进行扩展。我们证明了 FedSupLinUCB 算法的总后悔为 $\tilde{O}(\sqrt{dT})$,其中 $T$ 是所有客户端共同抽取的枪下数,$d$ 是线性模型的 ambient 维度。这与最佳下界匹配,因此是顺序优化的(即polylog 项)。我们研究了同步和异步情况,并证明了通信成本可以控制为 $O(dM^2 \log(d) \log(T))$ 和 $O(\sqrt{d^3M^3} \log(d))$, 分别。FedSupLinUCB 设计还可以在以下两个方面进行扩展:(1)变量适应,可以实现 $\tilde{O}(\sqrt{d \sum_{t=1}^{T} \sigma_t^2})$ 的总后悔,其中 $\sigma_t^2$ 是循环 $t$ 的噪声方差;(2)敌对损害,可以实现 $\tilde{O}(\sqrt{dT} + dC_p)$ 的总后悔,其中 $C_p$ 是敌对损害预算。实验结果证明了理论分析的可靠性,并在 synthetic 和实际数据上证明了 FedSupLinUCB 的有效性。

Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization Algorithms

  • paper_url: http://arxiv.org/abs/2311.00966
  • repo_url: https://github.com/facebookresearch/InvarianceUnitTests
  • paper_authors: Haoxiang Wang, Gargi Balasubramaniam, Haozhe Si, Bo Li, Han Zhao
    for:* 这个论文的目的是提出一种新的预测模型,以实现鲁棒的预测在不同环境下。methods:* 这个论文使用了一种新的算法,即非参数隐藏特征子空间恢复(ISR),来解决预测模型在不同环境下的鲁棒性问题。results:* 这个论文的实验结果表明,ISR可以在不同环境下实现鲁棒的预测,并且可以减少训练环境的数量,从而提高预测的效率。
    Abstract Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose Invariant-feature Subspace Recovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of Rosenfeld et al. (2021), we show that our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments. Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose ISR-Multiclass, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k\rceil+1$ training environments for $k$-class classification. Finally, for regression problems, we propose ISR-Regression that can identify the invariant-feature subspace with $d_s+1$ training environments. Empirically, we demonstrate the superior performance of our ISRs on synthetic benchmarks. Further, ISR can be used as post-processing methods for feature extractors such as neural nets.
    摘要 领域总则要求训练在多个环境下的模型能够在未经见过的测试环境中general化良好。近年来,一些算法如不变风险最小化(IRM)在领域总则方面得到了提出。然而, Rosenfeld等人(2021)显示,在简单的线性数据模型中,即使忽略非拟合问题,IRM和其扩展也无法在未经见过环境中general化,需要至少有$d_s+1$个训练环境,其中$d_s$是假值特征空间的维度。在这种情况下,我们提出了不变特征子空间恢复(ISR):一种新的算法,以实现领域总则的证明性普适性。在 Rosenfeld等人(2021)中的 binary 分类设置下,我们首先示出了我们的第一个算法,ISR-Mean,可以从类别Conditional distribution的首要幂中找到不变特征子空间的拟合,并实现领域总则的证明性,需要至少有$d_s+1$个训练环境。我们的第二个算法,ISR-Cov,进一步减少了需要的训练环境数量,使用类别的次要幂信息,可以在$O(1)$个训练环境下实现领域总则。与 IRM 不同的是,我们的算法不会遇到非拟合问题,并且具有全球整合性。接下来,我们扩展了 ISR-Mean 到更加一般的多类分类问题,并提出了 ISR-Multiclass,它可以在 $k$-class 分类问题中利用类信息来证明性地恢复不变特征子空间,需要至少有 $\lceil d_s/k\rceil+1$ 个训练环境。最后,我们为回归问题提出了 ISR-Regression,可以在 $d_s+1$ 个训练环境下实现领域总则。在实际中,我们通过synthetic benchmarks进行了证明性的实验,并证明了 ISR 可以作为特征提取器如 neural nets 的后处理方法。

On Finding Bi-objective Pareto-optimal Fraud Prevention Rule Sets for Fintech Applications

  • paper_url: http://arxiv.org/abs/2311.00964
  • repo_url: None
  • paper_authors: Chengyao Wen, Yin Lou
  • For: 这个论文关注于从初始规则集中找到高质量的规则子集,以提高风险预测决策的精度。* Methods: 该论文采用了解决问题选择在前(SSF)问题来找到非占优规则子集,并提出了一种基于谱分析的规则选择算法called SpectralRules。* Results: 实验表明,该方法可以在实际应用场景中提高风险预测的精度,并且比现有方法更有优势。
    Abstract Rules are widely used in Fintech institutions to make fraud prevention decisions, since rules are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions. This paper is concerned with finding high-quality rule subsets in a bi-objective space (such as precision and recall) from an initial pool of rules. To this end, we adopt the concept of Pareto optimality and aim to find a set of non-dominated rule subsets, which constitutes a Pareto front. We propose a heuristic-based framework called PORS and we identify that the core of PORS is the problem of solution selection on the front (SSF). We provide a systematic categorization of the SSF problem and a thorough empirical evaluation of various SSF methods on both public and proprietary datasets. We also introduce a novel variant of sequential covering algorithm called SpectralRules to encourage the diversity of the initial rule set and we empirically find that SpectralRules further improves the quality of the found Pareto front. On two real application scenarios within Alipay, we demonstrate the advantages of our proposed methodology compared to existing work.
    摘要 ��worts are widely used in Fintech institutions to make fraud prevention decisions, since ��worts are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions. This paper is concerned with finding high-quality rule subsets in a bi-objective space (such as precision and recall) from an initial pool of ��worts. To this end, we adopt the concept of Pareto optimality and aim to find a set of non-dominated rule subsets, which constitutes a Pareto front. We propose a heuristic-based framework called PORS and we identify that the core of PORS is the problem of solution selection on the front (SSF). We provide a systematic categorization of the SSF problem and a thorough empirical evaluation of various SSF methods on both public and proprietary datasets. We also introduce a novel variant of sequential covering algorithm called SpectralRules to encourage the diversity of the initial rule set and we empirically find that SpectralRules further improves the quality of the found Pareto front. On two real application scenarios within Alipay, we demonstrate the advantages of our proposed methodology compared to existing work.Note:* "��worts" is the Simplified Chinese term for "rules"* "Pareto optimality" is translated as "Pareto优化" (Pareto yòu jì)* "non-dominated rule subsets" is translated as "非主导��wort subset" (fēi zhǔ dǎo ��wort subset)* "solution selection on the front" is translated as "前台解决方案选择" (qián tai jiě jué fāng yì xuǎn chōu)* "SSF problem" is translated as "SSF问题" (SSF wèn tí)* "sequential covering algorithm" is translated as "顺序覆盖算法" (shù xì fù kè algoritmo)* "SpectralRules" is translated as "光谱规则" (guāng xiàng guī fā)

Dynamic Fair Federated Learning Based on Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.00959
  • repo_url: None
  • paper_authors: Weikang Chen, Junping Du, Yingxia Shao, Jia Wang, Yangxi Zhou
  • for: 提高 Federated Learning 中设备之间不公正 Representation 问题的解决方案
  • methods: 使用动态q公平 Federated Learning 算法(DQFFL),结合 reinforcement learning,以减少设备聚合不公正性和提高 Federated Learning 中所有组合的待遇公平性
  • results: DQFFL 在评估全局 Federated 模型性能、公平性和收敛速度三个方面,与现有方法相比,表现更佳,可以减少设备之间的不公正 Representation 问题,提高 Federated Learning 的总性能和公平性。
    Abstract Federated learning enables a collaborative training and optimization of global models among a group of devices without sharing local data samples. However, the heterogeneity of data in federated learning can lead to unfair representation of the global model across different devices. To address the fairness issue in federated learning, we propose a dynamic q fairness federated learning algorithm with reinforcement learning, called DQFFL. DQFFL aims to mitigate the discrepancies in device aggregation and enhance the fairness of treatment for all groups involved in federated learning. To quantify fairness, DQFFL leverages the performance of the global federated model on each device and incorporates {\alpha}-fairness to transform the preservation of fairness during federated aggregation into the distribution of client weights in the aggregation process. Considering the sensitivity of parameters in measuring fairness, we propose to utilize reinforcement learning for dynamic parameters during aggregation. Experimental results demonstrate that our DQFFL outperforms the state-of-the-art methods in terms of overall performance, fairness and convergence speed.
    摘要 “ federated learning 可以实现多个设备之间的共同训练和优化全域模型,而不需要分享本地数据样本。但是,跨设备数据的不均衡可能导致全球模型的不公平代表性。为了解决联邦学习中的公平问题,我们提出了一个动态q公平联邦学习算法(DQFFL)。DQFFL 的目的是在联邦学习中缓和设备聚合的差异,并对所有参与联邦学习的集群进行公平的对待。为了量化公平,DQFFL 利用每个设备上全球联邦模型的表现,并通过 α-公平来转换维护公平的保存到联邦聚合过程中的分布。考虑到联邦学习中的参数敏感性,我们提出了在联邦聚合过程中使用循环学习来动态参数。实验结果显示,我们的 DQFFL 在全面性、公平性和融合速度三方面比前者方法更好。”

Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization

  • paper_url: http://arxiv.org/abs/2311.00944
  • repo_url: None
  • paper_authors: Wei Shen, Minhui Huang, Jiawei Zhang, Cong Shen
  • for: 这个论文是为了研究如何在联合学习中使用缓和技术来解决非中心非对称最小最大值问题。
  • methods: 这个论文提出了一种新的算法 called Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), 它利用了缓和技术来解决联合学习中的非中心非对称最小最大值问题。
  • results: 论文证明FESS-GDA可以在几种联合学习任务中uniformly使用,并提供了新的或更好的分析融合结果。 Additionally, the paper shows the practical efficiency of FESS-GDA in training generative adversarial networks (GANs) and fair classification.
    Abstract In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved its success in centralized nonconvex minimax optimization, how and whether smoothing technique could be helpful in federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly used to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.
    摘要

Learning Defect Prediction from Unrealistic Data

  • paper_url: http://arxiv.org/abs/2311.00931
  • repo_url: None
  • paper_authors: Kamel Alrashedy, Vincent J. Hellendoorn, Alessandro Orso
  • for: 本研究的目的是调查大规模synthetic数据生成的问题,以及如何使用这些数据来改进代码理解和生成任务的模型性能。
  • methods: 本研究使用了一种方法来识别大量但不真实的数据集中的有用样本,以提高代码理解和生成任务中的模型性能。这种方法基于模型学习的高维嵌入,对真实和人工添加bug的程序进行评分,并且只允许使用最相似的样本进行训练。
  • results: 研究结果表明,使用这种方法可以在两种流行的代码预训练模型中提高代码理解和生成任务的性能。此外,研究还发现了大规模synthetic数据生成的限制,并提出了在实际应用中预测漏洞和bug的AI模型的限制。
    Abstract Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications
    摘要 预训模型,如CodeBERT和CodeT5,在代码理解和生成任务中变得流行。这些模型通常很大,需要相应的培训数据量,但这些数据很少存在下游任务中。因此,有人开始使用更大的 yet less realistic 的数据集来训练模型,例如在函数中 искусственного注入漏洞。但这些模型通常只在类似数据上表现出色,在实际世界程序中表现不佳。在这篇论文中,我们推测这种差异是因为大量的干扰样本,使模型偏离实际世界任务分布。为了研究这一点,我们提出了一种方法,可以在大规模的不真实数据集中标识最类似实际世界数据的子集。我们的方法使用神经网络模型提取实际世界和人工生成的程序高维表示,然后根据这些表示的距离对人工样本进行评分。我们发现,只具有最类似实际世界样本的表示,而不是所有人工样本,可以提高两种流行的预训模型在两个代码理解任务上的性能。我们的结果启示,可以通过在大规模的不真实数据集中提取代码的表示,来使用大规模的人工数据生成来帮助我们利用大规模的人工数据来提高下游任务性能。最后,我们强调了在实际应用中使用人工智能模型预测漏洞和漏洞的限制。

A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations

  • paper_url: http://arxiv.org/abs/2311.00923
  • repo_url: None
  • paper_authors: Hang Chen, Keqing Du, Chenguang Li, Xinyu Yang
  • for: 本研究旨在探讨 causal 模型与深度学习在复杂数据集上的整合,包括图像或文本组件之间的 causal 关系。
  • methods: 本研究提出了将 causal 数据分类为三类:定态数据、半定态数据和未定态数据,根据 causal 结构和表示方式。
  • results: 研究表明,定态数据主要应用于传统 causal 场景中的统计数据,半定态数据包括深度学习中常见的时间序列、图像、文本等数据格式,而未定态数据是一个新兴的研究领域,尚未得到充分发展。
    Abstract The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions of causal data into three distinct categories from the standpoint of causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data chiefly pertains to statistical data used in conventional causal scenarios, while semi-definite data refers to a spectrum of data formats germane to deep learning, including time-series, images, text, and others. Indefinite data is an emergent research sphere inferred from the progression of data forms by us. To comprehensively present these three data paradigms, we elaborate on their formal definitions, differences manifested in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from myriad research undertakings, present a roadmap for indefinite data, beginning with its current research conundrums. Lastly, we classify and scrutinize the key datasets presently utilized within these three paradigms.
    摘要 <> translate text into Simplified ChineseThe fusion of causal models with deep learning, introducing increasingly intricate data sets such as the causal associations within images or between textual components, has emerged as a major research focus. However, the extension of original causal concepts and theories to such complex, non-statistical data has posed significant challenges. In response, our study proposes redefinitions of causal data into three distinct categories based on causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data primarily refers to statistical data used in conventional causal scenarios, while semi-definite data encompasses a range of data formats relevant to deep learning, including time-series, images, text, and others. Indefinite data is an emerging research area that we have inferred from the evolution of data forms. To comprehensively present these three data paradigms, we delineate their formal definitions, differences manifest in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from numerous research endeavors, present a roadmap for indefinite data, beginning with its current research conundrums. Finally, we categorize and scrutinize the key datasets currently employed within these three paradigms.

MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training

  • paper_url: http://arxiv.org/abs/2311.00919
  • repo_url: None
  • paper_authors: Jiacheng Li, Ninghui Li, Bruno Ribeiro
  • For: The paper is written to address the problem of membership inference attacks in machine learning, specifically defending against attacks that try to determine whether a particular instance was used to train a model.* Methods: The paper introduces a novel method called Membership-Invariant Subspace Training (MIST) that leverages counterfactually-invariant representations and subspace learning methods to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significantly impacting other instances.* Results: The paper shows that MIST outperforms other state-of-the-art MI defenses while resulting in minimal reduction in testing accuracy, based on extensive experimental studies against several SOTA MI attacks.
    Abstract In Member Inference (MI) attacks, the adversary try to determine whether an instance is used to train a machine learning (ML) model. MI attacks are a major privacy concern when using private data to train ML models. Most MI attacks in the literature take advantage of the fact that ML models are trained to fit the training data well, and thus have very low loss on training instances. Most defenses against MI attacks therefore try to make the model fit the training data less well. Doing so, however, generally results in lower accuracy. We observe that training instances have different degrees of vulnerability to MI attacks. Most instances will have low loss even when not included in training. For these instances, the model can fit them well without concerns of MI attacks. An effective defense only needs to (possibly implicitly) identify instances that are vulnerable to MI attacks and avoids overfitting them. A major challenge is how to achieve such an effect in an efficient training process. Leveraging two distinct recent advancements in representation learning: counterfactually-invariant representations and subspace learning methods, we introduce a novel Membership-Invariant Subspace Training (MIST) method to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significant impact on other instances. We have conducted extensive experimental studies, comparing MIST with various other state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find that MIST outperforms other defenses while resulting in minimal reduction in testing accuracy.
    摘要 在 Member Inference(MI)攻击中,敌方尝试确定训练机器学习(ML)模型中使用的实例。MI攻击是使用私人数据训练ML模型的隐私问题。大多数MI攻击在文献中利用了ML模型对训练数据的适应性,因此大多数防御措施是通过减少模型对训练实例的适应程度来进行。但是,这通常会导致减少准确性。我们发现训练实例有不同的抵触MI攻击的程度。大多数实例会在没有包含在训练中时仍然有低损失。这些实例上,模型可以很好地适应它们无需MI攻击的担忧。一个有效的防御仅需要(可能是隐式的)标识MI攻击的抵触实例,并避免对它们进行过度适应。一个主要挑战是如何在有效的训练过程中实现这一目标。我们利用了两种最近的表示学习技术:对于抵触不变的表示和子空间学习方法,我们提出了一种新的Membership-Invariant Subspace Training(MIST)方法,以防止MI攻击。MIST不会过度适应抵触实例,而不会对其他实例产生重要的影响。我们对MIST方法进行了广泛的实验研究,与其他多种现状顶峰MI防御方法进行比较,并对多个SOTAMI攻击进行了测试。我们发现MIST方法在减少测试损失的同时,与其他防御方法相比,具有最佳的性能。