methods: 本文提出了一种新的方法 called Federated Orthogonal Training (FOT),它利用层的全球输入子空间来避免全球忘记现象,并通过对新任务的聚合更新进行修正,使其与老任务的全球主方向 orthogonal。
results: 实验表明,FOT 方法可以在 CFL Setting 中超过现有状态的持续学习方法,实现了最高的准确率提升(最高达 15%),同时具有较低的计算和通信成本(27% 下降),而且不违反隐私原则。Abstract
Federated Learning (FL) has gained significant attraction due to its ability to enable privacy-preserving training over decentralized data. Current literature in FL mostly focuses on single-task learning. However, over time, new tasks may appear in the clients and the global model should learn these tasks without forgetting previous tasks. This real-world scenario is known as Continual Federated Learning (CFL). The main challenge of CFL is Global Catastrophic Forgetting, which corresponds to the fact that when the global model is trained on new tasks, its performance on old tasks decreases. There have been a few recent works on CFL to propose methods that aim to address the global catastrophic forgetting problem. However, these works either have unrealistic assumptions on the availability of past data samples or violate the privacy principles of FL. We propose a novel method, Federated Orthogonal Training (FOT), to overcome these drawbacks and address the global catastrophic forgetting in CFL. Our algorithm extracts the global input subspace of each layer for old tasks and modifies the aggregated updates of new tasks such that they are orthogonal to the global principal subspace of old tasks for each layer. This decreases the interference between tasks, which is the main cause for forgetting. We empirically show that FOT outperforms state-of-the-art continual learning methods in the CFL setting, achieving an average accuracy gain of up to 15% with 27% lower forgetting while only incurring a minimal computation and communication cost.
摘要
受到隐私保护训练 Decentralized 数据的 Federated Learning (FL) 技术在最近得到了广泛关注,因为它可以实现隐私保护训练。然而,目前的文献主要关注单任务学习。然而,随着时间的推移,客户端上可能会出现新的任务,global model需要学习这些任务而不是忘记之前的任务。这种real-world scenario被称为 Continual Federated Learning (CFL)。CFL 的主要挑战是全球性衰减,即当全球模型在新任务上训练时,其对于旧任务的性能下降。有些最近的工作在 CFL 中提出了方法,以解决全球性衰减问题,但这些方法 either 假设了过去数据样本的可用性或者违反了 Federated Learning 的隐私原则。我们提出了一种新的方法,即 Federated Orthogonal Training (FOT),以解决这些挑战。我们的算法从旧任务中提取每层的全球输入子空间,并将新任务的聚合更新修改为在每层上保持垂直于旧任务的全球主成分空间。这种方法降低了任务之间的干扰,这是主要的忘记原因。我们实验表明,FOT 可以在 CFL 设置中击败当前状态的 continual learning 方法,实现了最多15%的准确率提升,同时减少了27%的忘记水平,只占了最小的计算和通信成本。
A Comparative Evaluation of FedAvg and Per-FedAvg Algorithms for Dirichlet Distributed Heterogeneous Data
paper_authors: Hamza Reguieg, Mohammed El Hanjri, Mohamed El Kamili, Abdellatif Kobbane
for: investigate Federated Learning (FL) and compare two strategies within this paradigm: Federated Averaging (FedAvg) and Personalized Federated Averaging (Per-FedAvg)
methods: use Non-Identically and Independently Distributed (Non-IID) data to evaluate the performance of both strategies
results: Per-FedAvg shows superior robustness in conditions of high data heterogeneity, and our results provide insights into the development of more effective and efficient machine learning strategies in a decentralized setting.Here’s the full translation in Simplified Chinese:
methods: 使用 Non-Identically and Independently Distributed (Non-IID) 数据来评估这两种策略的性能。
results: Per-FedAvg 在高度不同数据中显示出更高的 Robustness,而我们的结果可以帮助开发更有效和高效的机器学习策略在分布式环境中。Abstract
In this paper, we investigate Federated Learning (FL), a paradigm of machine learning that allows for decentralized model training on devices without sharing raw data, there by preserving data privacy. In particular, we compare two strategies within this paradigm: Federated Averaging (FedAvg) and Personalized Federated Averaging (Per-FedAvg), focusing on their performance with Non-Identically and Independently Distributed (Non-IID) data. Our analysis shows that the level of data heterogeneity, modeled using a Dirichlet distribution, significantly affects the performance of both strategies, with Per-FedAvg showing superior robustness in conditions of high heterogeneity. Our results provide insights into the development of more effective and efficient machine learning strategies in a decentralized setting.
摘要
在这篇论文中,我们研究了联邦学习(Federated Learning,FL),这是一种机器学习的平台,允许在设备上进行分布式模型训练,而不需要共享原始数据,从而保持数据隐私。我们特别比较了两种策略在这个平台上:联邦平均(FedAvg)和个性化联邦平均(Per-FedAvg),并将注重在非同一样分布(Non-IID)数据上的性能。我们的分析表明,数据不同程度的不同,使用 Dirichlet 分布来模型,对两种策略的性能产生了显著影响,Per-FedAvg 在高度不同程度下表现出了更高的鲁棒性。我们的结果提供了开发更有效率的机器学习策略在分布式环境下的指导。
Modified Step Size for Enhanced Stochastic Gradient Descent: Convergence and Experiments
paper_authors: M. Soheil Shamaee, S. Fathi Hafshejani
for: 提高 Stochastic Gradient Descent(SGD)算法的性能
methods: 使用修改后的衰减步长,其中包括对 Logarithmic 函数的 интегра
results: 在 Smooth 非 convex 函数上达到 $O(\frac{\ln T}{\sqrt{T})$ 的 converge 速率,并通过数据集的实验表明了该方法的效果。Here’s the breakdown of each point:1. for: The paper is written to enhance the performance of the SGD algorithm.2. methods: The paper proposes a modified decay step size based on $\frac{1}{\sqrt{t}$ with a logarithmic term, which leads to the selection of smaller values in the final iterations.3. results: The paper achieves a convergence rate of $O(\frac{\ln T}{\sqrt{T})$ for smooth non-convex functions without the Polyak-{\L}ojasiewicz condition, and the numerical experiments on image classification tasks demonstrate significant improvements in accuracy compared to the traditional $\frac{1}{\sqrt{t}$ step size.Abstract
This paper introduces a novel approach to enhance the performance of the stochastic gradient descent (SGD) algorithm by incorporating a modified decay step size based on $\frac{1}{\sqrt{t}$. The proposed step size integrates a logarithmic term, leading to the selection of smaller values in the final iterations. Our analysis establishes a convergence rate of $O(\frac{\ln T}{\sqrt{T})$ for smooth non-convex functions without the Polyak-{\L}ojasiewicz condition. To evaluate the effectiveness of our approach, we conducted numerical experiments on image classification tasks using the FashionMNIST, and CIFAR10 datasets, and the results demonstrate significant improvements in accuracy, with enhancements of $0.5\%$ and $1.4\%$ observed, respectively, compared to the traditional $\frac{1}{\sqrt{t}$ step size. The source code can be found at \\\url{https://github.com/Shamaeem/LNSQRTStepSize}.
摘要
这篇论文提出了一种新的方法,用于提高泛率下降(SGD)算法的性能。该方法基于$\frac{1}{\sqrt{t}$的修改步长,其中包含了对数函数,从而选择小于最终迭代的值。我们的分析表明,对于非 convex 函数,该方法可以达到$O(\frac{\ln T}{\sqrt{T})$的 converges 速率,而不需要波佳-{\L}ojasiewicz 条件。为证明该方法的有效性,我们在图像分类任务上进行了数值实验,使用了 FashionMNIST 和 CIFAR10 数据集,结果显示,与传统 $\frac{1}{\sqrt{t}$ 步长相比,该方法可以提高准确率,具体提高了 $0.5\%$ 和 $1.4\%$。源代码可以在 \url{https://github.com/Shamaeem/LNSQRTStepSize} 找到。
Privacy-Utility Tradeoff of OLS with Random Projections
paper_authors: Yun Lu, Malik Magdon-Ismail, Yu Wei, Vassilis Zikas
for: 本研究探讨了Linear Ordinary Least Squares(OLS)问题的分布式隐私(DP)性。
methods: 本研究使用了Sarlos(2006)提出的Approximate LS Algorithm(ALS),以及Dwork et al.(2014)的标准 Gaussian Mechanism。我们还提出了一种新的DP分析方法,以及一些可能是独立有用的工具。
results: 我们的研究结果表明,ALS算法可以保持隐私,而无需修改或噪声加工。我们还提供了第一个精确的DP分析方法,以及一些改进的DP分析工具。此外,我们还证明了,在大规模数据集中,计算DP级别可能是不可能的。因此,需要开发黑obox DP估计器,以便在实际应用中 empirically estimating 数据中的隐私级别。Abstract
We study the differential privacy (DP) of a core ML problem, linear ordinary least squares (OLS), a.k.a. $\ell_2$-regression. Our key result is that the approximate LS algorithm (ALS) (Sarlos, 2006), a randomized solution to the OLS problem primarily used to improve performance on large datasets, also preserves privacy. ALS achieves a better privacy/utility tradeoff, without modifications or further noising, when compared to alternative private OLS algorithms which modify and/or noise OLS. We give the first {\em tight} DP-analysis for the ALS algorithm and the standard Gaussian mechanism (Dwork et al., 2014) applied to OLS. Our methodology directly improves the privacy analysis of (Blocki et al., 2012) and (Sheffet, 2019)) and introduces new tools which may be of independent interest: (1) the exact spectrum of $(\epsilon, \delta)$-DP parameters (``DP spectrum") for mechanisms whose output is a $d$-dimensional Gaussian, and (2) an improved DP spectrum for random projection (compared to (Blocki et al., 2012) and (Sheffet, 2019)). All methods for private OLS (including ours) assume, often implicitly, restrictions on the input database, such as bounds on leverage and residuals. We prove that such restrictions are necessary. Hence, computing the privacy of mechanisms such as ALS must estimate these database parameters, which can be infeasible in big datasets. For more complex ML models, DP bounds may not even be tractable. There is a need for blackbox DP-estimators (Lu et al., 2022) which empirically estimate a data-dependent privacy. We demonstrate the effectiveness of such a DP-estimator by empirically recovering a DP-spectrum that matches our theory for OLS. This validates the DP-estimator in a nontrivial ML application, opening the door to its use in more complex nonlinear ML settings where theory is unavailable.
摘要
我们研究了线性最小二乘(OLS)问题中的分数隐私(DP)。我们的关键结论是,偏相对最小二乘(ALS)算法(Sarlos,2006),一种用于提高大型数据集的性能的随机解决方案,同时也保持隐私。相比于其他修改和噪声OLS算法,ALS实现了更好的隐私/用途质量比,无需进一步修改或噪声。我们提供了首个紧密的DP分析 дляALS算法和标准 Gaussian机制(Dwork等,2014)应用于OLS问题。我们的方法直接改进了(Blocki等,2012)和(Sheffet,2019)中的隐私分析,并 introduce了新的工具:(1)DP分布的准确谱(DP spectrum),其中输出是一个$d$-维 Gaussian 分布,以及(2)改进的DP分布 для随机投影。所有私有OLS(包括我们的)都假设了输入数据库中的约束,例如,约束在输入数据中的倾斜和差异。我们证明了这些约束是必需的。因此,计算私有OLS的隐私必须估计这些数据库参数,这可能是大型数据集中的不可能任务。为更复杂的机器学习模型,DP bound可能无法可读。这需要黑盒DP估计器(Lu等,2022),它可以在数据中使用随机方法来估计数据依赖的隐私。我们证明了这种DP估计器的有效性,通过 empirically recovering a DP spectrum that matches our theory for OLS。这将开启黑盒DP估计器的使用在更复杂的非线性机器学习设置中,where theory is unavailable。
lfads-torch: A modular and extensible implementation of latent factor analysis via dynamical systems
paper_authors: Andrew R. Sedler, Chethan Pandarinath
for: 这篇论文是为了减少高维度神经活动中的噪音,以便在科学和工程领域中使用。
methods: 这篇论文使用了一种名为“Latent Factor Analysis via Dynamical Systems”的Variational Sequential Autoencoder(RNN-based),以解决高维度神经活动中的噪音问题。
results: 这篇论文的结果显示,这种方法可以实现高度的表现,并且可以应用到许多 neuroscience 中的问题上。Abstract
Latent factor analysis via dynamical systems (LFADS) is an RNN-based variational sequential autoencoder that achieves state-of-the-art performance in denoising high-dimensional neural activity for downstream applications in science and engineering. Recently introduced variants and extensions continue to demonstrate the applicability of the architecture to a wide variety of problems in neuroscience. Since the development of the original implementation of LFADS, new technologies have emerged that use dynamic computation graphs, minimize boilerplate code, compose model configuration files, and simplify large-scale training. Building on these modern Python libraries, we introduce lfads-torch -- a new open-source implementation of LFADS that unifies existing variants and is designed to be easier to understand, configure, and extend. Documentation, source code, and issue tracking are available at https://github.com/arsedler9/lfads-torch .
摘要
Latent Factor Analysis via Dynamical Systems(LFADS)是一种基于RNN的变量序列自动编码器,可以在科学和工程领域中实现高级别噪声去除神经活动数据。最近的变体和扩展继续证明了该架构在神经科学中的广泛应用。自LFADS原始实现以来,新的技术出现了,包括动态计算图、最小化boilerplate代码、组合模型配置文件和大规模训练。基于这些现代Python库,我们介绍lfads-torch---一个新的开源实现,它将 существующие变体集成起来,并设计为更容易理解、配置和扩展。文档、源代码和问题跟踪可以在https://github.com/arsedler9/lfads-torch 上找到。
Implicit regularization of deep residual networks towards neural ODEs
paper_authors: Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau for: 这篇论文的目的是为了建立深度学习模型之间的数学基础,具体来说是将残留神经网络与神经� differential equations(ODEs)之间的连接固化。methods: 这篇论文使用了一种叫做偏微分流的方法来训练深度学习模型,并且证明了如果初始化了一个神经网络为一个残留神经网络的离散化,那么这个离散化会在训练过程中保持不变。results: 这篇论文的结果表明,如果神经网络满足一个Polyak-Lojasiewicz条件,那么 gradient flow 会收敛到一个全局最小值。此外,这个条件适用于一家 residual networks,其中每层的偏微分是一个二层感知器,并且它们在宽度方向上有一定的过度参数。numerical experiments 验证了这些结果。Abstract
Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
摘要
深度学习模型中的剩余神经网络是当前最佳实践。它们的连续深度类型,神经 diferencial equations(ODEs)也广泛使用。尽管它们的成功,但是这两种模型之间的数学基础仍然缺乏固定的连接。在这篇文章中,我们向这个方向致力于建立深度神经网络向神经ODE的隐式规范,对非线性网络进行梯度流训练。我们证明,如果网络在训练开始时初始化为神经ODE的离散化,那么这种离散化会在训练过程中保持不变。我们的结果适用于有限的训练时间和训练时间趋于无穷大,只要网络满足一个Polyak-Lojasiewicz条件。这个条件适用于一家具有线性增强的二层感知机的残差神经网络,并且 garantía 梯度流 converges to a global minimum。实验证明了我们的结果。
Symbolically integrating tensor networks over various random tensors by the second version of Python RTNI
for: 这个论文是为了介绍PyRTNI2库的升级版本,该库可以 симвоlic integrate tensor networks over Haar-distributed unitary matrices。
methods: 这篇论文使用了element-wise moment calculus的方法,以及将tensor network diagrams和delta functions相关联的方法。
results: PyRTNI2可以处理Haar-distributed orthogonal matrices和实数和复数正态分布tensor,并可以导出tensor networks的格式为TensorNetwork,以便进行进一步的计算,包括低维度的情况,where Weingarten functions differ from high-dimensional cases。Abstract
We are upgrading the Python-version of RTNI, which symbolically integrates tensor networks over the Haar-distributed unitary matrices. Now, PyRTNI2 can treat the Haar-distributed orthogonal matrices and the real and complex normal Gaussian tensors as well. Moreover, it can export tensor networks in the format of TensorNetwork so that one can make further calculations with concrete tensors, even for low dimensions, where the Weingarten functions differ from the ones for high dimensions. The tutorial notebooks are found at GitHub: https://github.com/MotohisaFukuda/PyRTNI2. In this paper, we explain maths behind the program and show what kind of tensor network calculations can be made with it. For the former, we interpret the element-wise moment calculus of the above random matrices and tensors in terms of tensor network diagrams, and argue that the view is natural, relating delta functions in the calculus to edges in tensor network diagrams.
摘要
我们正在升级Python版本的RTNI,这个symbolically组合了tensor network的程式。现在PyRTNI2可以处理哈aar分布的对称矩阵和实部和复部的正 Gaussian 网络,并且可以将网络出口到TensorNetwork格式,以便进一步计算具体的网络,甚至低维度的网络,其中Weingarten函数与高维度不同。教程 Notebook可以在GitHub上找到:https://github.com/MotohisaFukuda/PyRTNI2。在这篇论文中,我们解释了软件的数学基础和展示了它可以进行哪些网络计算。对于前者,我们将元素级数律calculus of the above random matrices and tensors interpreting as tensor network diagrams, and argue that the view is natural, relating delta functions in the calculus to edges in tensor network diagrams.
Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement
results: 实验结果表明,NRSER可以有效地提高SER系统的防噪性能,包括防止系统对完全背景噪声的识别。此外,提出的SNR水平检测结构可以独立地用于数据选择等任务。Abstract
Speech emotion recognition (SER) often experiences reduced performance due to background noise. In addition, making a prediction on signals with only background noise could undermine user trust in the system. In this study, we propose a Noise Robust Speech Emotion Recognition system, NRSER. NRSER employs speech enhancement (SE) to effectively reduce the noise in input signals. Then, the signal-to-noise-ratio (SNR)-level detection structure and waveform reconstitution strategy are introduced to reduce the negative impact of SE on speech signals with no or little background noise. Our experimental results show that NRSER can effectively improve the noise robustness of the SER system, including preventing the system from making emotion recognition on signals consisting solely of background noise. Moreover, the proposed SNR-level detection structure can be used individually for tasks such as data selection.
摘要
<>文本翻译成简化中文。<>听话情感识别(SER)经常受到背景噪声的影响,这会导致系统的性能下降。此外,基于背景噪声的预测可能会使用户对系统失去信任。在这种情况下,我们提出了一种防止噪声的Speech Emotion Recognition系统(NRSER)。NRSER使用了Speech Enhancement(SE)技术来有效地减少输入信号中的噪声。然后,我们引入了信号噪声比(SNR)水平检测结构和波形重建策略,以降低SE对无或少背景噪声的语音信号的负面影响。我们的实验结果表明,NRSER可以有效地提高噪声鲁棒性,包括避免系统对背景噪声只作出情感识别。此外,我们提出的SNR水平检测结构可以独立地应用于数据选择等任务。
An Accurate Graph Generative Model with Tunable Features
results: 实验结果表明,使用新的Feedback Error Mechanism 可以准确地调整 GraphTune 模型中的 graph 特征,比 conventinal models 更高精度。Abstract
A graph is a very common and powerful data structure used for modeling communication and social networks. Models that generate graphs with arbitrary features are important basic technologies in repeated simulations of networks and prediction of topology changes. Although existing generative models for graphs are useful for providing graphs similar to real-world graphs, graph generation models with tunable features have been less explored in the field. Previously, we have proposed GraphTune, a generative model for graphs that continuously tune specific graph features of generated graphs while maintaining most of the features of a given graph dataset. However, the tuning accuracy of graph features in GraphTune has not been sufficient for practical applications. In this paper, we propose a method to improve the accuracy of GraphTune by adding a new mechanism to feed back errors of graph features of generated graphs and by training them alternately and independently. Experiments on a real-world graph dataset showed that the features in the generated graphs are accurately tuned compared with conventional models.
摘要
一个图是非常常见和有力的数据结构,用于模型交流和社交网络。生成图模型是重要的基础技术,可以重复 simulate 网络和预测网络结构变化。虽然现有的生成图模型很有用,但是可调特征的图生成模型在领域中尚未得到充分发掘。我们之前已经提出了 GraphTune,一种可生成图模型,可以在生成图时连续调整特定图特征,保持大多数图集特征。但是,GraphTune 中的调整精度并没有达到实际应用中的需求。在这篇论文中,我们提出了一种方法来提高 GraphTune 的调整精度,通过添加一种反馈错误图特征的机制,并在独立地训练它们。实验表明,对实际图集进行生成后,图中的特征都能够准确地调整,比较于传统模型更加精准。
Advances in machine-learning-based sampling motivated by lattice quantum chromodynamics
results: 论文的结果表明,使用机器学习模型可以有效地采样凝聚态场论中的结构和交互,并且可以实现first-principles的物理计算。Abstract
Sampling from known probability distributions is a ubiquitous task in computational science, underlying calculations in domains from linguistics to biology and physics. Generative machine-learning (ML) models have emerged as a promising tool in this space, building on the success of this approach in applications such as image, text, and audio generation. Often, however, generative tasks in scientific domains have unique structures and features -- such as complex symmetries and the requirement of exactness guarantees -- that present both challenges and opportunities for ML. This Perspective outlines the advances in ML-based sampling motivated by lattice quantum field theory, in particular for the theory of quantum chromodynamics. Enabling calculations of the structure and interactions of matter from our most fundamental understanding of particle physics, lattice quantum chromodynamics is one of the main consumers of open-science supercomputing worldwide. The design of ML algorithms for this application faces profound challenges, including the necessity of scaling custom ML architectures to the largest supercomputers, but also promises immense benefits, and is spurring a wave of development in ML-based sampling more broadly. In lattice field theory, if this approach can realize its early promise it will be a transformative step towards first-principles physics calculations in particle, nuclear and condensed matter physics that are intractable with traditional approaches.
摘要
伪乱分布的抽样是计算科学中的一项普遍任务,从语言学到生物和物理等领域都有广泛的应用。生成机器学习(ML)模型在这个领域得到了成功,基于图像、文本和音频生成等应用的成功。然而,在科学领域的生成任务中有独特的结构和特点,例如复杂的对称和精确性保证的要求,这对ML技术提出了挑战和机遇。这篇观点文章描述了基于ML的抽样技术的进步,特别是基于粒子物理学的假想场论。通过实现对物质结构和互动的计算,假想场论成为了物理学界最大的开源超级计算的主要用户。设计为这种应用的ML算法面临着巨大挑战,包括扩展自定义ML架构到最大超级计算机上的必要性,但也承诺巨大的利益,并在ML基于抽样技术的开发中促进了广泛的进步。在粒子物理学中,如果这种方法能实现早期的承诺,那么将是对first-principles物理计算的一个转变步骤,包括粒子、核和 condensed matter 物理的计算,这些计算是使用传统方法不可能完成的。
For: The paper is written for users who want to perform machine learning tasks but do not have deep domain knowledge. The paper aims to provide a framework called AutoML-GPT that simplifies the machine learning pipeline and reduces the time and effort required for these tasks.* Methods: The paper uses a conversational interface to allow users to specify their requirements, constraints, and evaluation metrics. The system employs advanced techniques for hyperparameter optimization and model selection, ensuring that the resulting model achieves optimal performance.* Results: The paper demonstrates through experimental results on diverse datasets that AutoML-GPT significantly reduces the time and effort required for machine learning tasks. The system’s ability to leverage the vast knowledge encoded in large language models enables it to provide valuable insights, identify potential pitfalls, and suggest effective solutions to common challenges faced during model training.Abstract
With the emerging trend of GPT models, we have established a framework called AutoML-GPT that integrates a comprehensive set of tools and libraries. This framework grants users access to a wide range of data preprocessing techniques, feature engineering methods, and model selection algorithms. Through a conversational interface, users can specify their requirements, constraints, and evaluation metrics. Throughout the process, AutoML-GPT employs advanced techniques for hyperparameter optimization and model selection, ensuring that the resulting model achieves optimal performance. The system effectively manages the complexity of the machine learning pipeline, guiding users towards the best choices without requiring deep domain knowledge. Through our experimental results on diverse datasets, we have demonstrated that AutoML-GPT significantly reduces the time and effort required for machine learning tasks. Its ability to leverage the vast knowledge encoded in large language models enables it to provide valuable insights, identify potential pitfalls, and suggest effective solutions to common challenges faced during model training.
摘要
With the emerging trend of GPT models, we have established a framework called AutoML-GPT that integrates a comprehensive set of tools and libraries. This framework grants users access to a wide range of data preprocessing techniques, feature engineering methods, and model selection algorithms. Through a conversational interface, users can specify their requirements, constraints, and evaluation metrics. Throughout the process, AutoML-GPT employs advanced techniques for hyperparameter optimization and model selection, ensuring that the resulting model achieves optimal performance. The system effectively manages the complexity of the machine learning pipeline, guiding users towards the best choices without requiring deep domain knowledge. Through our experimental results on diverse datasets, we have demonstrated that AutoML-GPT significantly reduces the time and effort required for machine learning tasks. Its ability to leverage the vast knowledge encoded in large language models enables it to provide valuable insights, identify potential pitfalls, and suggest effective solutions to common challenges faced during model training.Here's the translation in Traditional Chinese:With the emerging trend of GPT models, we have established a framework called AutoML-GPT that integrates a comprehensive set of tools and libraries. This framework grants users access to a wide range of data preprocessing techniques, feature engineering methods, and model selection algorithms. Through a conversational interface, users can specify their requirements, constraints, and evaluation metrics. Throughout the process, AutoML-GPT employs advanced techniques for hyperparameter optimization and model selection, ensuring that the resulting model achieves optimal performance. The system effectively manages the complexity of the machine learning pipeline, guiding users towards the best choices without requiring deep domain knowledge. Through our experimental results on diverse datasets, we have demonstrated that AutoML-GPT significantly reduces the time and effort required for machine learning tasks. Its ability to leverage the vast knowledge encoded in large language models enables it to provide valuable insights, identify potential pitfalls, and suggest effective solutions to common challenges faced during model training.
results: 本文结合了多种数据源、评估指标和方法,对Machine learning-based 工具和框架在B细胞免疫疗法设计方面的进步进行了评估和检验,并描述了主要挑战和未来发展的方向。Abstract
Antibodies, a prominent class of approved biologics, play a crucial role in detecting foreign antigens. The effectiveness of antigen neutralisation and elimination hinges upon the strength, sensitivity, and specificity of the paratope-epitope interaction, which demands resource-intensive experimental techniques for characterisation. In recent years, artificial intelligence and machine learning methods have made significant strides, revolutionising the prediction of protein structures and their complexes. The past decade has also witnessed the evolution of computational approaches aiming to support immunotherapy design. This review focuses on the progress of machine learning-based tools and their frameworks in the domain of B-cell immunotherapy design, encompassing linear and conformational epitope prediction, paratope prediction, and antibody design. We mapped the most commonly used data sources, evaluation metrics, and method availability and thoroughly assessed their significance and limitations, discussing the main challenges ahead.
摘要
抗体,一种已批准的生物药物,在检测外源抗原方面发挥重要作用。抗体中和降解的效果取决于蛋白质-蛋白质复合物之间的强度、敏感性和特异性,这些特性需要资源充沛的实验技术进行Characterization。在最近几年,人工智能和机器学习方法在蛋白质结构预测和其复合物预测方面做出了重要进步,对免疫疗法设计提供了支持。本文将关注使用机器学习技术支持B细胞免疫疗法设计的进展,包括线性和 conformational 抗体蛋白质预测、蛋白质预测和抗体设计。我们将最常用的数据源、评价指标和方法可用性进行映射,并且详细评估了它们的重要性和局限性,讨论了未来的主要挑战。
Double Clipping: Less-Biased Variance Reduction in Off-Policy Evaluation
results: 研究表明,double clipping 可以减少 estimator 的偏差,同时保持 variance 减少性能,提高计算效率。Abstract
"Clipping" (a.k.a. importance weight truncation) is a widely used variance-reduction technique for counterfactual off-policy estimators. Like other variance-reduction techniques, clipping reduces variance at the cost of increased bias. However, unlike other techniques, the bias introduced by clipping is always a downward bias (assuming non-negative rewards), yielding a lower bound on the true expected reward. In this work we propose a simple extension, called $\textit{double clipping}$, which aims to compensate this downward bias and thus reduce the overall bias, while maintaining the variance reduction properties of the original estimator.
摘要
Carbon Emission Prediction and Clean Industry Transformation Based on Machine Learning: A Case Study of Sichuan Province
For: 本研究使用矩阵正常化处理2000-2019年四川省46个键盘产业的能源消耗数据,使用DBSCAN封顶分析 objective 分类行业。* Methods: 本研究使用DBSCAN封顶分析 objective 分类行业,并使用罚款回归模型来控制过采样、处理高维数据和选择特征。* Results: 研究发现第二个群around coal有最高排放,这主要归结于生产需求。 gasoline-focused和coke-focused 群也有显著的排放。根据这些结果,提出了使用清洁煤矿技术、交通管理、钢铁行业电力取代煤炭、行业标准化等减排策略。Abstract
This study preprocessed 2000-2019 energy consumption data for 46 key Sichuan industries using matrix normalization. DBSCAN clustering identified 16 feature classes to objectively group industries. Penalized regression models were then applied for their advantages in overfitting control, high-dimensional data processing, and feature selection - well-suited for the complex energy data. Results showed the second cluster around coal had highest emissions due to production needs. Emissions from gasoline-focused and coke-focused clusters were also significant. Based on this, emission reduction suggestions included clean coal technologies, transportation management, coal-electricity replacement in steel, and industry standardization. The research introduced unsupervised learning to objectively select factors and aimed to explore new emission reduction avenues. In summary, the study identified industry groupings, assessed emissions drivers, and proposed scientific reduction strategies to better inform decision-making using algorithms like DBSCAN and penalized regression models.
摘要
这个研究对2000-2019年四川46个重点产业的能源消耗数据进行了归一化处理。使用DBSCAN划分 clustering 方法对行业进行了 объектив分类。然后,对高维数据进行了惩罚回归模型的应用,以便控制过拟合、处理高维数据和选择特征。研究结果显示,第二个群组织煤矿产业占据了最高排出水平,这是因为生产需要。汽油和焦炭专注的群组也有显著的排出水平。根据这些结果,提出了清洁煤技术、交通管理、钢铁产业煤电replace和产业标准化等减排策略。这项研究通过不监督学习方法选择因素,探索了新的减排途径,以更好地 Inform 决策。In summary, the study used unsupervised learning techniques like DBSCAN clustering and penalized regression models to identify industry groupings, assess emissions drivers, and propose scientific reduction strategies for better decision-making. The research aimed to explore new emission reduction avenues by introducing unsupervised learning to objectively select factors.
Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?
results: 研究发现,在seen和unseen情况下,使用SSL模型对于dysarthria患者的词语运动预测具有显著改善,相比MFCC。DeCoAR在精度调整方案下,对于健康人和患者都显示了${\sim}1.81%}$和${\sim}4.56%}$的Relative Improvement of Pearson Correlation Coefficient(CC)。Abstract
$ $Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic space to the articulatory space. Signal-processing features like the MFCCs, have been widely used for the AAI task. For subjects with dysarthric speech, AAI is challenging because of an imprecise and indistinct pronunciation. In this work, we perform AAI for dysarthric speech using representations from pre-trained self-supervised learning (SSL) models. We demonstrate the impact of different pre-trained features on this challenging AAI task, at low-resource conditions. In addition, we also condition x-vectors to the extracted SSL features to train a BLSTM network. In the seen case, we experiment with three AAI training schemes (subject-specific, pooled, and fine-tuned). The results, consistent across training schemes, reveal that DeCoAR, in the fine-tuned scheme, achieves a relative improvement of the Pearson Correlation Coefficient (CC) by ${\sim}$1.81\% and ${\sim}$4.56\% for healthy controls and patients, respectively, over MFCCs. In the unseen case, we observe similar average trends for different SSL features. Overall, SSL networks like wav2vec, APC, and DeCoAR, which are trained with feature reconstruction or future timestep prediction tasks, perform well in predicting dysarthric articulatory trajectories.
摘要
$ $音律-语音映射(AAI)是将音律空间映射到语音空间。信号处理特征如MFCCs,已广泛用于AAI任务。对于具有异常语音的主题而言,AAI是一项挑战,因为它们的发音不准确和模糊。在这项工作中,我们使用预训练的自然语言学习(SSL)模型来实现AAI。我们表明了不同预训练特征对于这项具有挑战性的AAI任务的影响。此外,我们还将x-vector conditioning到提取的SSL特征来训练BLSTM网络。在可见情况下,我们尝试了三种AAI训练方案(主体特定、混合和细化)。结果显示,在细化方案下,DeCoAR achieved relative improvement of Pearson Correlation Coefficient (CC) by approximately 1.81% and 4.56% for healthy controls and patients, respectively, over MFCCs.在未见情况下,我们观察到了不同的SSL特征对于不同的语音特征的平均趋势。总的来说,SSL网络如wav2vec、APC和DeCoAR,通过特征重建或未来时间步预测任务进行训练,在预测异常语音的语音映射方面表现良好。
Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization
results: 数值实验表明,该方法可以学习出一个更加稳健和 menos conservative 的策略,与传统的 rectangular uncertainty 相比。Abstract
In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $\alpha$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.
摘要
在robust markov decision processes(RMDPs)中,假设奖励和转移动力在给定的不确定集中。通过targeting最大返回在最敌对模型下,RMDPs Address performance sensitivity to misspecified environments.然而,为保持计算 tractability,不确定集通常是独立结构的每个状态。这种called rectangularity condition 仅由计算问题所衍生,而且缺乏实践驱动力,可能会导致过度保守的行为。在这项工作中,我们研究了奖励RMDPs,其中转移函数固定,但奖励函数在一个α-距离 nominated one 内。我们 drew a direct connection between这种非正方形奖励-RMDPs 和应用策略访问频率规范化。我们介绍了一种策略梯度法,并证明其 convergence。numerical experiments 表明学习策略的 robustness 和与矩形不确定相比较保守的行为。
Tropical Geometric Tools for Machine Learning: the TML package
methods: 该论文使用 Hit and Run Markov chain Monte Carlo 采样器和 тропи metric 进行统计推断。此外,论文还介绍了一些基于 tropos 的超vised 和无关 learn 方法,包括 tropos 主成分分析、 tropos 逻辑回归和 tropos 核密度估计。
results: 论文的结果主要表明,使用 tropically 的 HAR 采样器可以有效地进行统计推断,并且可以应用于多种超vised 和无关 learn 问题。此外,论文还提出了一些基于 tropos 的新的方法和应用,例如 tropos 主成分分析和 tropos 核密度估计。Abstract
In the last decade, developments in tropical geometry have provided a number of uses directly applicable to problems in statistical learning. The TML package is the first R package which contains a comprehensive set of tools and methods used for basic computations related to tropical convexity, visualization of tropically convex sets, as well as supervised and unsupervised learning models using the tropical metric under the max-plus algebra over the tropical projective torus. Primarily, the TML package employs a Hit and Run Markov chain Monte Carlo sampler in conjunction with the tropical metric as its main tool for statistical inference. In addition to basic computation and various applications of the tropical HAR sampler, we also focus on several supervised and unsupervised methods incorporated in the TML package including tropical principal component analysis, tropical logistic regression and tropical kernel density estimation.
摘要
在过去一个十年中, тропическая геометрия的发展已经为统计学学习带来了许多直接适用的应用。TML包是R包中的第一个包含涵盖тропические几何基本计算、视觉化 тропически几何集以及使用极大加法代数下的极大值推论的完整工具集。主要地,TML包使用 тропи metric来进行统计推论,使用射击和逃跑Markov链式 Monte Carlo抽样法。此外,TML包还包括了一些基本计算和各种应用,如тропические主成分分析、тропические逻辑回归和 тропические核密度估计。
Federated Few-shot Learning for Cough Classification with Edge Devices
results: 我们的结果显示,F2LCough在COVID-19 Thermal Face & Cough数据集上取得了86%的F1-Score,较其他方法为高。这显示了几据学和联合学可以在数据缺乏情况下建立一个分类模型,并且维护了隐私性。Abstract
Automatically classifying cough sounds is one of the most critical tasks for the diagnosis and treatment of respiratory diseases. However, collecting a huge amount of labeled cough dataset is challenging mainly due to high laborious expenses, data scarcity, and privacy concerns. In this work, our aim is to develop a framework that can effectively perform cough classification even in situations when enormous cough data is not available, while also addressing privacy concerns. Specifically, we formulate a new problem to tackle these challenges and adopt few-shot learning and federated learning to design a novel framework, termed F2LCough, for solving the newly formulated problem. We illustrate the superiority of our method compared with other approaches on COVID-19 Thermal Face & Cough dataset, in which F2LCough achieves an average F1-Score of 86%. Our results show the feasibility of few-shot learning combined with federated learning to build a classification model of cough sounds. This new methodology is able to classify cough sounds in data-scarce situations and maintain privacy properties. The outcomes of this work can be a fundamental framework for building support systems for the detection and diagnosis of cough-related diseases.
摘要
自动分类咳声是肺病诊断和治疗中最关键的任务之一,但收集庞大量标注咳数据却具有高度劳动成本、数据缺乏和隐私问题。在这种情况下,我们的目标是开发一个能够有效地进行咳类型分类的框架,同时解决隐私问题。我们将问题重新定义为新的问题,并采用少量学习和联合学习来设计一个名为F2LCough的新框架。我们在COVID-19 thermal face & cough数据集上进行了比较,发现F2LCough在average F1-Score方面达到86%。这些结果表明了少量学习与联合学习的可行性,可以在数据缺乏情况下分类咳声并保持隐私性。这种新的方法可以为诊断咳病提供支持。
Towards Efficient Modeling and Inference in Multi-Dimensional Gaussian Process State-Space Models
results: 实验结果表明,提议方法可以减少参数数量和计算复杂性,同时可以与现有方法相比做出类似的推断性能。Abstract
The Gaussian process state-space model (GPSSM) has attracted extensive attention for modeling complex nonlinear dynamical systems. However, the existing GPSSM employs separate Gaussian processes (GPs) for each latent state dimension, leading to escalating computational complexity and parameter proliferation, thus posing challenges for modeling dynamical systems with high-dimensional latent states. To surmount this obstacle, we propose to integrate the efficient transformed Gaussian process (ETGP) into the GPSSM, which involves pushing a shared GP through multiple normalizing flows to efficiently model the transition function in high-dimensional latent state space. Additionally, we develop a corresponding variational inference algorithm that surpasses existing methods in terms of parameter count and computational complexity. Experimental results on diverse synthetic and real-world datasets corroborate the efficiency of the proposed method, while also demonstrating its ability to achieve similar inference performance compared to existing methods. Code is available at \url{https://github.com/zhidilin/gpssmProj}.
摘要
Gaussian процесс状态空间模型 (GPSSM) 已经吸引了广泛的关注,用于模型复杂非线性动力系统。然而,现有的 GPSSM 使用每个隐藏状态维度之间的分离 Gaussian 过程 (GP),导致计算复杂性和参数增加,从而对高维隐藏状态系统的模型 pose 了挑战。为了缓解这个困难,我们提议将高效转换 Gaussian 过程 (ETGP) 集成到 GPSSM 中,该方法涉及将共享 GP Push 多个 normalizing flows,以高效地模型高维隐藏状态空间中的过渡函数。此外,我们还开发了相应的变量推理算法,它在参数计数和计算复杂性方面超过了现有方法。实验结果表明,提议的方法在多种 sintetic 和实际世界数据上具有高效性,同时也能够达到与现有方法相似的推理性能。代码可以在 \url{https://github.com/zhidilin/gpssmProj} 上获取。
MQENet: A Mesh Quality Evaluation Neural Network Based on Dynamic Graph Attention
results: 实验结果表明,MQENet可以有效地评估结构网格的质量,并且在NACA-Market benchmark dataset上达到了高度的评估精度。Abstract
With the development of computational fluid dynamics, the requirements for the fluid simulation accuracy in industrial applications have also increased. The quality of the generated mesh directly affects the simulation accuracy. However, previous mesh quality metrics and models cannot evaluate meshes comprehensively and objectively. To this end, we propose MQENet, a structured mesh quality evaluation neural network based on dynamic graph attention. MQENet treats the mesh evaluation task as a graph classification task for classifying the quality of the input structured mesh. To make graphs generated from structured meshes more informative, MQENet introduces two novel structured mesh preprocessing algorithms. These two algorithms can also improve the conversion efficiency of structured mesh data. Experimental results on the benchmark structured mesh dataset NACA-Market show the effectiveness of MQENet in the mesh quality evaluation task.
摘要
随着计算流体动力学的发展,工业应用中流体模拟精度的要求也在不断提高。mesh质量直接影响模拟精度。然而,过去的网格质量指标和模型无法全面、 объектив地评估网格质量。为此,我们提出MQENet,一种基于动态图注意力的结构化网格质量评估神经网络。MQENet将网格评估任务视为一种图分类任务,用于评估输入结构网格的质量。为了使结构网格生成的图更加有用,MQENet引入了两种新的结构网格预处理算法。这两种算法还可以提高结构网格数据的转换效率。实验结果表明,MQENet在标准网格数据集NACA-Market上得到了较高的评估精度。
Distribution learning via neural differential equations: a nonparametric statistical perspective
results: 这篇论文提出了一个普适的统计收敛分析方法,并在 $C^k$ 平滑目标分布和神经网络类型中实现了nearly minimax-optimal 的收敛率。Abstract
Ordinary differential equations (ODEs), via their induced flow maps, provide a powerful framework to parameterize invertible transformations for the purpose of representing complex probability distributions. While such models have achieved enormous success in machine learning, particularly for generative modeling and density estimation, little is known about their statistical properties. This work establishes the first general nonparametric statistical convergence analysis for distribution learning via ODE models trained through likelihood maximization. We first prove a convergence theorem applicable to arbitrary velocity field classes $\mathcal{F}$ satisfying certain simple boundary constraints. This general result captures the trade-off between approximation error (`bias') and the complexity of the ODE model (`variance'). We show that the latter can be quantified via the $C^1$-metric entropy of the class $\mathcal F$. We then apply this general framework to the setting of $C^k$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $\mathcal F$: $C^k$ functions and neural networks. The latter is the practically important case of neural ODEs. Our proof techniques require a careful synthesis of (i) analytical stability results for ODEs, (ii) classical theory for sieved M-estimators, and (iii) recent results on approximation rates and metric entropies of neural network classes. The results also provide theoretical insight on how the choice of velocity field class, and the dependence of this choice on sample size $n$ (e.g., the scaling of width, depth, and sparsity of neural network classes), impacts statistical performance.
摘要
ordinary differential equations (ODEs) 通过它们引入的流动图,提供了一个强大的框架来Parameterize invertible transformations,以便表示复杂的概率分布。而这些模型在机器学习中已经取得了巨大的成功,特别是在生成模型和概率预测方面。然而,对这些模型的统计性质所知之少。这个工作首次提供了对ODE模型通过最大化likelihood进行学习的统计收敛分析。我们首先证明了适用于任何速度场类$\mathcal{F}$满足某些简单的边界约束的收敛定理。这个总体结果捕捉了在approximation error('bias')和ODE模型('variance')之间的负责任。我们表明了后者可以通过$\mathcal{F}$的$C^1$度量 entropy来衡量。然后,我们将这个总体框架应用于$C^k$平滑目标分布的设置,并确定了相对迫近最优的收敛率。其中,$C^k$函数和神经网络是两个有实际意义的速度场类。我们的证明技术需要结合(i) ODEs的分析稳定性结果,(ii) Sieved M-estimators的经典理论,以及(iii)近期关于应用率和度量Entropy的神经网络类的结果。结果还提供了对速度场类选择和样本大小 $n$(例如,宽度、深度和稀疏性的神经网络类的依赖关系)的统计性能的理论启示。
Breast MRI radiomics and machine learning radiomics-based predictions of response to neoadjuvant chemotherapy – how are they affected by variations in tumour delineation?
paper_authors: Sepideh Hatamikia, Geevarghese George, Florian Schwarzhans, Amirreza Mahbod, Ramona Woitek for: 这个研究的目的是为了evaluating the impact of variations in manual delineations of volumes of interest (VOIs) on the performance of radiomics predictors in breast cancer subtypes.methods: 这个研究使用了contrast-enhanced magnetic resonance imaging (MRI) acquired prior to treatment (baseline MRI scans),并使用了不同的mathematical operations such as erosion, smoothing, dilation, randomization, and ellipse fitting to simulate variations of segmentation masks.results: 研究发现,使用不同的VOI delineation methods can significantly affect the number of robust features and prediction performance in radiomics analysis. Specifically, smoothing and erosion yielded the highest number of robust features and the best prediction performance, while ellipse fitting and dilation led to the lowest robustness and prediction performance for both breast cancer subtypes. Additionally, the study found that at most 28% of the selected features were similar to manual VOIs when different VOI delineation data were used.Abstract
Manual delineation of volumes of interest (VOIs) by experts is considered the gold-standard method in radiomics analysis. However, it suffers from inter- and intra-operator variability. A quantitative assessment of the impact of variations in these delineations on the performance of the radiomics predictors is required to develop robust radiomics based prediction models. In this study, we developed radiomics models for the prediction of pathological complete response to neoadjuvant chemotherapy in patients with two different breast cancer subtypes based on contrast-enhanced magnetic resonance imaging acquired prior to treatment (baseline MRI scans). Different mathematical operations such as erosion, smoothing, dilation, randomization, and ellipse fitting were applied to the original VOIs delineated by experts to simulate variations of segmentation masks. The effects of such VOI modifications on various steps of the radiomics workflow, including feature extraction, feature selection, and prediction performance, were evaluated. Using manual tumor VOIs and radiomics features extracted from baseline MRI scans, an AUC of up to 0.96 and 0.89 was achieved for human epidermal growth receptor 2 positive and triple-negative breast cancer, respectively. For smoothing and erosion, VOIs yielded the highest number of robust features and the best prediction performance, while ellipse fitting and dilation lead to the lowest robustness and prediction performance for both breast cancer subtypes. At most 28% of the selected features were similar to manual VOIs when different VOI delineation data were used. Differences in VOI delineation affects different steps of radiomics analysis, and their quantification is therefore important for development of standardized radiomics research.
摘要
临床验证是验证分析的标准方法,但它受到Operator variability的影响。为了开发可靠的验证模型,我们需要评估随着分割masks的变化而导致的预测器性能的影响。我们在基线MRI扫描前进行了针对不同乳腺癌分型的预后化学治疗的预测,并使用不同的数学操作来模拟分割masks的变化。我们评估了这些变化对验证过程中的特征提取、特征选择和预测性能的影响。使用手动肿瘤分割和基线MRI扫描中提取的验证特征,我们可以达到0.96和0.89的AUC,分别为人类肿瘤抑制剂2阳性和三重阴性乳腺癌。对于平滑和减小操作,分割masks具有最高的Robust特征数和最好的预测性能,而 для�elia和扩大操作,分割masks具有最低的Robust特征数和预测性能。最多28%的选择特征与手动分割数据相同。不同的分割masks导致不同的预测过程中的不同步骤受到影响,因此其量化对于开发标准化验证研究非常重要。
results: 本研究发现,透过将攻击者识别label加入VC模型训练,可以生成timbre-reserved的对抗攻击音频,具有目标话者的时变特征。这些对抗攻击音频可以让SID系统错误识别攻击者,并且保留目标话者的时变特征。Abstract
As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does not exploit the vulnerability of the SID model and may not make the SID system give the attacker's desired decision. As for the adversarial attack, despite the SID system can be led to a designated decision, it cannot meet the specified text or speaker timbre requirements for the specific attack scenarios. In this study, to make the attack in SID not only leverage the vulnerability of the SID model but also reserve the timbre of the target speaker, we propose a timbre-reserved adversarial attack in the speaker identification. We generate the timbre-reserved adversarial audios by adding an adversarial constraint during the different training stages of the voice conversion (VC) model. Specifically, the adversarial constraint is using the target speaker label to optimize the adversarial perturbation added to the VC model representations and is implemented by a speaker classifier joining in the VC model training. The adversarial constraint can help to control the VC model to generate the speaker-wised audio. Eventually, the inference of the VC model is the ideal adversarial fake audio, which is timbre-reserved and can fool the SID system.
摘要
为了使骗谋攻击(SID)系统不仅利用骗谋模型的漏洞,还保留目标说话人的时征特征,我们在这种研究中提出了一种具有时征保留的敌意攻击。我们在不同的训练阶段中添加了一个敌意约束,以控制VC模型生成说话人级别的声音。具体来说,我们使用目标说话人标签来优化骗谋模型表示中的敌意干扰,通过一个说话人分类器参与VC模型训练。这种敌意约束可以帮助控制VC模型生成说话人级别的声音,最终得到骗谋模型的恶意假声音,这个声音保留了目标说话人的时征特征。
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin
paper_authors: Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie
for: 这篇研究旨在提高cross-lingual TTS的自然度和情感表达能力。
methods: 提出了一个基于传播过程的Diffusion model based Cross-Lingual Emotion Transfer方法(DiCLET-TTS),通过将情感从源语言 speaker 转移到内部和跨语言目标 speaker 上,以提高语言转移后的自然度和情感表达能力。
results: 试验结果显示DiCLET-TTS 比较优秀于多种竞争模型,并且OP-EDM 能够学习 speaker-irrelevant yet emotion-discriminative embedding。Abstract
While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
摘要
Traditional cross-lingual TTS methods based on monolingual corpora have made significant progress in recent years, but they still suffer from the problem of foreign accents, which limits the naturalness of the speech. Moreover, current methods ignore the modeling of emotion, which is essential paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a diffusion model-based cross-lingual emotion transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. To address the foreign accent problem and improve emotion expressiveness, we use a prior text encoder with an emotion embedding as a condition to parameterize the terminal distribution of the forward diffusion process. To further improve emotion expressiveness, we propose a novel orthogonal projection-based emotion disentangling module (OP-EDM) to learn speaker-irrelevant but emotion-discriminative embeddings. In addition, we introduce a condition-enhanced DPM decoder to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process. Cross-lingual emotion transfer experiments show that DiCLET-TTS outperforms various competitive models and demonstrates the effectiveness of OP-EDM in learning speaker-irrelevant but emotion-discriminative embeddings.
results: 研究表明,SEPAL模型在两个人类乳腺癌数据集中表现出色,超过了之前的州OF-the-art方法和其他包含空间上下文的机制。Abstract
Spatial transcriptomics is an emerging technology that aligns histopathology images with spatially resolved gene expression profiling. It holds the potential for understanding many diseases but faces significant bottlenecks such as specialized equipment and domain expertise. In this work, we present SEPAL, a new model for predicting genetic profiles from visual tissue appearance. Our method exploits the biological biases of the problem by directly supervising relative differences with respect to mean expression, and leverages local visual context at every coordinate to make predictions using a graph neural network. This approach closes the gap between complete locality and complete globality in current methods. In addition, we propose a novel benchmark that aims to better define the task by following current best practices in transcriptomics and restricting the prediction variables to only those with clear spatial patterns. Our extensive evaluation in two different human breast cancer datasets indicates that SEPAL outperforms previous state-of-the-art methods and other mechanisms of including spatial context.
摘要
《空间转录组学是一种emerging技术,可以将组织学图像与空间地定的蛋白表达 profiling进行对应。它有很大的潜力用于理解多种疾病,但面临着重要的瓶颈,例如专业设备和领域专业知识。在这项工作中,我们提出了一种新的模型,可以从视觉组织表现中预测基因谱。我们的方法利用生物学上的偏见,直接监督表达差异相对于平均表达水平,并利用每个坐标点的本地视觉上下文来进行预测,使用图 neural network。这种方法可以在当前方法中关闭完全地方性和完全全球性之间的差距。此外,我们还提出了一个新的标准测试,以更好地定义任务,并且只Predicting variables with clear spatial patterns。我们对两个不同的人乳癌组织数据集进行了广泛的评估,结果显示,SEPAL在前一个状态的方法和其他包含空间上下文的机制上表现出色。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.
Contrastive Grouping with Transformer for Referring Image Segmentation
methods: 本研究提出了一种名为Contrastive Grouping with Transformer network(CGFormer)的掩码分类方法,该方法通过学习可变的查询токен来表示 объек,然后在每两层之间交叉升级查询语言特征和视觉特征,以实现对象感知的跨模态理解。此外,CGFormer还应用了对比学习策略来识别查询Token和其掩码。
results: 实验结果表明,CGFormer在分割和泛化设定下都能够具有州元的性能,与现有的一阶方法相比,CGFormer在对象分割任务中具有显著的优势。Abstract
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.
摘要
传统的一个阶段方法使用每个像素的分类框架,直接将视觉和语言对齐到像素级别,因此无法捕捉关键的物体水平信息。在这篇论文中,我们提出了一种面Mask分类框架,即对比集成Transformers网络(CGFormer),它显式地捕捉物体水平信息,通过启用可学习的查询令和分组策略。具体来说,CGFormer首先引入了可学习的查询令,用于表示物体,然后在每两层交替地查询语言特征和视觉特征,并将视觉特征分组到查询令中进行对应的横向推理。此外,CGFormer实现了交叉层交互,在每两层中同时更新查询令和推理mask。最后,CGFormer与对比学习结合分组策略,以确定查询令和其对应的推理mask。实验结果表明,CGFormer在 segmentation 和通用设定下能够一直高效地和稳定地 exceed 状态元方法。
Comparative Analysis of Deep Learning Architectures for Breast Cancer Diagnosis Using the BreaKHis Dataset
results: Xception模型在F1分数方面达到了0.9,准确率达到了89%,而Inception和InceptionResNet模型均达到了87%的准确率,但Inception模型的F1分数为87,而InceptionResNet模型的F1分数为86。这些结果表明深度学习方法在诊断乳腺癌中的重要性,并且有助于提供更好的诊断服务给患者。Abstract
Cancer is an extremely difficult and dangerous health problem because it manifests in so many different ways and affects so many different organs and tissues. The primary goal of this research was to evaluate deep learning models' ability to correctly identify breast cancer cases using the BreakHis dataset. The BreakHis dataset covers a wide range of breast cancer subtypes through its huge collection of histopathological pictures. In this study, we use and compare the performance of five well-known deep learning models for cancer classification: VGG, ResNet, Xception, Inception, and InceptionResNet. The results placed the Xception model at the top, with an F1 score of 0.9 and an accuracy of 89%. At the same time, the Inception and InceptionResNet models both hit accuracy of 87% . However, the F1 score for the Inception model was 87, while that for the InceptionResNet model was 86. These results demonstrate the importance of deep learning methods in making correct breast cancer diagnoses. This highlights the potential to provide improved diagnostic services to patients. The findings of this study not only improve current methods of cancer diagnosis, but also make significant contributions to the creation of new and improved cancer treatment strategies. In a nutshell, the results of this study represent a major advancement in the direction of achieving these vital healthcare goals.
摘要
乳癌是一种极其困难和危险的健康问题,因为它可以出现在多种不同的形式和影响多种不同的器官和组织。本研究的主要目标是评估深度学习模型在识别乳癌案例方面的性能,使用BreakHis数据集。BreakHis数据集包括多种乳癌Subtype的历史病理图像,因此我们可以使用和比较五种常见的深度学习模型对于肿瘤分类的性能:VGG、ResNet、Xception、Inception和InceptionResNet。结果显示,Xception模型在F1分数方面得分为0.9,准确率为89%。同时,Inception和InceptionResNet模型都达到了87%的准确率。但是,Inception模型的F1分数为87,而InceptionResNet模型的F1分数为86。这些结果表明深度学习方法在识别乳癌案例中的重要性,这也提供了改善临床诊断服务的可能性。这些发现不仅改进了当前肿瘤诊断方法,还为创造新的和改进的肿瘤治疗策略做出了重要贡献。总之,本研究的结果代表了肿瘤诊断领域的一大突破。
RevColV2: Exploring Disentangled Representations in Masked Image Modeling
results: 实验结果显示,使用RevColV2架构的基础模型可以在多个下游视觉任务中实现竞争性的表现,例如图像分类、semantic segmentation和物件检测。例如,在ImageNet-22K dataset上进行中途精通的finetuning后,RevColV2-L可以实现88.4%的top-1准确率和58.6 mIoU的semantic segmentation准确率。Abstract
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For example, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4% top-1 accuracy on ImageNet-1K classification and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU on ADE20K semantic segmentation. Code and models are released at https://github.com/megvii-research/RevCol
摘要
受预训练掩模型(MIM)的普遍使用,已经取得了领先的表现。然而,现有的MIM方法在下游应用中抛弃了解码器网络,导致预训练和细化 phases的表现不一致,从而降低下游任务的表现。在本文中,我们提出了一种新的架构——RevColV2,以解决这个问题。RevColV2架构包括底层列和顶层列,这两个列之间的信息在反向传播的过程中被恰当地传递和慢慢分离。这种设计使得RevColV2架构保持了预训练和细化 phases中的独立特征,从而实现了维持低级别特征和 semantic 信息的优良性。我们的实验结果表明,一个基于RevColV2架构的基础模型可以在多个下游视觉任务上达到竞争性的表现,如图像分类、semantic segmentation和物体检测。例如,在ImageNet-22K数据集上进行中间细化训练后,RevColV2-L模型可以达到88.4%的顶层准确率和58.6 mIoU的semantic segmentation精度。另外,通过添加教师和大规模数据集,RevColV2-L模型可以达到62.1个box AP和60.4 mIoU的semantic segmentation精度。代码和模型可以在https://github.com/megvii-research/RevCol 上下载。
Constrained CycleGAN for Effective Generation of Ultrasound Sector Images of Improved Spatial Resolution
paper_authors: Xiaofei Sun, He Li, Wei-Ning Lee for: 这个研究的目的是将多普勒超声图像(US)的空间分辨率改善,以提高心脏动态运动的评估质量。methods: 这个研究使用了一种名为CCycleGAN的新型的循环GAN模型,该模型直接使用不同的超声探针所获取的无对对的US图像进行生成。此外,CCycleGAN还引入了一种新的束约条件,以保证生成图像的结构一致性和吸收信号特征的一致性。results: 实验结果表明,CCycleGAN可以成功地生成高空间分辨率的US图像,同时也提高了图像的峰信号噪声比(PSNR)和结构相似度(SSIM)。此外,CCycleGAN生成的US图像在人体内部的心脏运动评估中也有更高的质量,特别是在深部区域。Abstract
Objective. A phased or a curvilinear array produces ultrasound (US) images with a sector field of view (FOV), which inherently exhibits spatially-varying image resolution with inferior quality in the far zone and towards the two sides azimuthally. Sector US images with improved spatial resolutions are favorable for accurate quantitative analysis of large and dynamic organs, such as the heart. Therefore, this study aims to translate US images with spatially-varying resolution to ones with less spatially-varying resolution. CycleGAN has been a prominent choice for unpaired medical image translation; however, it neither guarantees structural consistency nor preserves backscattering patterns between input and generated images for unpaired US images. Approach. To circumvent this limitation, we propose a constrained CycleGAN (CCycleGAN), which directly performs US image generation with unpaired images acquired by different ultrasound array probes. In addition to conventional adversarial and cycle-consistency losses of CycleGAN, CCycleGAN introduces an identical loss and a correlation coefficient loss based on intrinsic US backscattered signal properties to constrain structural consistency and backscattering patterns, respectively. Instead of post-processed B-mode images, CCycleGAN uses envelope data directly obtained from beamformed radio-frequency signals without any other non-linear postprocessing. Main Results. In vitro phantom results demonstrate that CCycleGAN successfully generates images with improved spatial resolution as well as higher peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) compared with benchmarks. Significance. CCycleGAN-generated US images of the in vivo human beating heart further facilitate higher quality heart wall motion estimation than benchmarks-generated ones, particularly in deep regions.
摘要
目标:使用phasized或curvilinear array生成ultrasound(US)图像,图像具有锐度场视野(FOV),这些图像自然而然地在远区和两侧扫描方向展现空间不均匀的图像解析质量。锐度US图像有助于准确地量化大小和动态的器官,如心脏。因此,本研究的目标是将US图像中的空间不均匀的图像解析转换为更少的空间不均匀的图像解析。方法:我们提出了一种受限制的CycleGAN(CCycleGAN),该模型直接使用不同ultrasound array探针获取的无对应图像来生成US图像。除了传统的对抗学习和循环一致性损失外,CCycleGAN还引入了基于US回射信号特性的相同损失和相关系数损失,以确保结构一致性和回射特征的保持。而不是使用后处理的B模式图像,CCycleGAN使用直接从Radio frequency信号中得到的封包数据,无需其他非线性后处理。主要结果:在医学实验室中,我们使用了医学实验室中的人工胚膜模型,对CCycleGAN进行了测试。结果表明,CCycleGAN成功地生成了高分辨率的US图像,同时具有更高的PSNR和SSIM值,比benchmarks更高。意义:CCycleGAN生成的US图像可以更好地估计人体心脏墙运动,特别是在深部区域。这些结果表明,CCycleGAN可以成为一种有用的医学图像翻译工具,可以帮助医生更好地诊断和治疗各种疾病。
Deep-Learning Framework for Optimal Selection of Soil Sampling Sites
results: 研究结果表明,我们提出的模型在测试数据集上达到了99.52%的准确率,57.35%的交集 overlap(IoU)和71.47%的 dice相似度,而现有的CNN模型的性能指标分别为66.08%、3.85%和1.98%。这表明我们的模型在土壤抽样数据集上表现出了优异性。Abstract
This work leverages the recent advancements of deep learning in image processing to find optimal locations that present the important characteristics of a field. The data for training are collected at different fields in local farms with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because the ground truth is highly imbalanced binary images. Therefore, we approached the problem with two methods, the first approach involves utilizing a state-of-the-art model with the convolutional neural network (CNN) backbone, while the second is to innovate a deep-learning design grounded in the concepts of transformer and self-attention. Our framework is constructed with an encoder-decoder architecture with the self-attention mechanism as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while the performance metrics of the state-of-the-art CNN-based model are 66.08%, 3.85%, and 1.98%, respectively. This indicates that our proposed model outperforms the CNN-based method on the soil-sampling dataset. To the best of our knowledge, our work is the first to provide a soil-sampling dataset with multiple attributes and leverage deep learning techniques to enable the automatic selection of soil-sampling sites. This work lays a foundation for novel applications of data science and machine-learning technologies to solve other emerging agricultural problems.
摘要
The team tested their model on a challenging dataset and achieved impressive results, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%. These results are significantly better than those achieved by a state-of-the-art CNN-based model, which had a mean accuracy of 66.08%, a mean IoU of 3.85%, and a mean Dice Coefficient of 1.98%.This work is the first to provide a soil-sampling dataset with multiple attributes and use deep learning techniques to automatically select soil-sampling sites. The team believes that their approach could be used to solve other emerging agricultural problems and lay the foundation for novel applications of data science and machine learning in agriculture.
AdLER: Adversarial Training with Label Error Rectification for One-Shot Medical Image Segmentation
results: 这个研究的结果显示,这个新的一对一分类方法(AdLER)可以提高分类性能,并且在没有足够训练数据的情况下具有更好的一致性和更高的精度。Abstract
Accurate automatic segmentation of medical images typically requires large datasets with high-quality annotations, making it less applicable in clinical settings due to limited training data. One-shot segmentation based on learned transformations (OSSLT) has shown promise when labeled data is extremely limited, typically including unsupervised deformable registration, data augmentation with learned registration, and segmentation learned from augmented data. However, current one-shot segmentation methods are challenged by limited data diversity during augmentation, and potential label errors caused by imperfect registration. To address these issues, we propose a novel one-shot medical image segmentation method with adversarial training and label error rectification (AdLER), with the aim of improving the diversity of generated data and correcting label errors to enhance segmentation performance. Specifically, we implement a novel dual consistency constraint to ensure anatomy-aligned registration that lessens registration errors. Furthermore, we develop an adversarial training strategy to augment the atlas image, which ensures both generation diversity and segmentation robustness. We also propose to rectify potential label errors in the augmented atlas images by estimating segmentation uncertainty, which can compensate for the imperfect nature of deformable registration and improve segmentation authenticity. Experiments on the CANDI and ABIDE datasets demonstrate that the proposed AdLER outperforms previous state-of-the-art methods by 0.7% (CANDI), 3.6% (ABIDE "seen"), and 4.9% (ABIDE "unseen") in segmentation based on Dice scores, respectively. The source code will be available at https://github.com/hsiangyuzhao/AdLER.
摘要
通常,医疗图像自动分割需要大量高质量标注数据,因此在临床设置下采用自动分割是更加困难的。一旦分割基于学习的变换(OSSLT)已经展示了在有限的标注数据下可以取得满意的结果,包括不supervised deformable registration、数据增强通过学习 registration以及基于增强数据进行分割学习。然而,当数据多样性很低时,当前一旦分割方法会受到多样性不足的限制,以及可能的标签错误引起的registration错误。为了解决这些问题,我们提出了一种基于对抗学习和标签修正的新一代医疗图像分割方法(AdLER),以提高分割性能。具体来说,我们实现了一种双重一致性约束,以降低注射错误。此外,我们开发了一种对抗训练策略,以增强生成数据的多样性和分割的Robustness。此外,我们还提出了一种纠正可能存在的标签错误的方法,通过估算分割不确定性,可以补偿杂论注射和提高分割的authenticity。实验结果表明,提案的AdLER方法在CANDI和ABIDE datasets上比前一代方法提高0.7%(CANDI)、3.6%(ABIDE "seen")和4.9%(ABIDE "unseen")的分割基于dice scores,分别。源代码将在https://github.com/hsiangyuzhao/AdLER上提供。
NTU4DRadLM: 4D Radar-centric Multi-Modal Dataset for Localization and Mapping
paper_authors: Jun Zhang, Huayang Zhuge, Yiyao Liu, Guohao Peng, Zhenyu Wu, Haoyuan Zhang, Qiyang Lyu, Heshan Li, Chunyang Zhao, Dogan Kircali, Sanat Mharolkar, Xun Yang, Su Yi, Yuanzhe Wang, Danwei Wang
for: This paper is written for researchers and developers who are interested in Simultaneous Localization and Mapping (SLAM) using 4D radar, thermal camera, and Inertial Measurement Unit (IMU).
methods: The paper presents a new dataset called NTU4DRadLM, which includes all 6 sensors (4D radar, thermal camera, IMU, 3D LiDAR, visual camera, and RTK GPS) and is specifically designed for SLAM tasks.
results: The paper evaluates three types of SLAM algorithms using the NTU4DRadLM dataset and reports the results, which include the accuracy of the algorithms in various environments.Here’s the simplified Chinese version:
results: 这篇论文使用NTU4DRadLM数据集评估了三种SLAM算法的性能,并报告了结果,其中包括不同环境下的算法准确性。Abstract
Simultaneous Localization and Mapping (SLAM) is moving towards a robust perception age. However, LiDAR- and visual- SLAM may easily fail in adverse conditions (rain, snow, smoke and fog, etc.). In comparison, SLAM based on 4D Radar, thermal camera and IMU can work robustly. But only a few literature can be found. A major reason is the lack of related datasets, which seriously hinders the research. Even though some datasets are proposed based on 4D radar in past four years, they are mainly designed for object detection, rather than SLAM. Furthermore, they normally do not include thermal camera. Therefore, in this paper, NTU4DRadLM is presented to meet this requirement. The main characteristics are: 1) It is the only dataset that simultaneously includes all 6 sensors: 4D radar, thermal camera, IMU, 3D LiDAR, visual camera and RTK GPS. 2) Specifically designed for SLAM tasks, which provides fine-tuned ground truth odometry and intentionally formulated loop closures. 3) Considered both low-speed robot platform and fast-speed unmanned vehicle platform. 4) Covered structured, unstructured and semi-structured environments. 5) Considered both middle- and large- scale outdoor environments, i.e., the 6 trajectories range from 246m to 6.95km. 6) Comprehensively evaluated three types of SLAM algorithms. Totally, the dataset is around 17.6km, 85mins, 50GB and it will be accessible from this link: https://github.com/junzhang2016/NTU4DRadLM
摘要
《同时地位和地图Localization(SLAM)正在迈向一个强大的感知年代。然而,雷达和视觉SLAM可能在不利的条件下(雨、雪、烟雾等)容易失败。相比之下,基于4D雷达、热成像和IMU的SLAM可以工作稳定。然而,相关的数据集很少,这使得研究受到了严重的阻碍。尽管过去四年有一些基于4D雷达的数据集被提出,但是它们主要是为了对象检测而不是SLAM。此外,它们通常不包括热成像。因此,本文提出了NTU4DRadLM数据集。NTU4DRadLM的主要特点包括:1. 同时包含6种感知器:4D雷达、热成像、IMU、3D雷达、视觉摄像头和RTK GPS。2. 专门为SLAM任务设计,提供精度调整的地理位置轨迹和故意设计的循环关闭。3. 考虑了中速和快速无人车平台。4. 覆盖结构化、无结构化和半结构化环境。5. 考虑了中型和大型的外部环境,即6个轨迹的距离从246米到6.95公里。6. 全面评估了三种SLAM算法。总的来说,数据集约为17.6公里,85分钟,50GB,可以从以下链接获取:https://github.com/junzhang2016/NTU4DRadLM。
ASF-Net: Robust Video Deraining via Temporal Alignment and Online Adaptive Learning
results: 我们在基于新建的 dataset 上进行了参数学习过程,并开发了一种创新的恢复学习策略,该策略可以将 sintetic 和实际场景之间的差异bridged,从而提高场景适应性。我们的提出方法在三个标准准点上表现出优于常见方法,并在实际场景中具有惊喜的视觉质量。Abstract
In recent times, learning-based methods for video deraining have demonstrated commendable results. However, there are two critical challenges that these methods are yet to address: exploiting temporal correlations among adjacent frames and ensuring adaptability to unknown real-world scenarios. To overcome these challenges, we explore video deraining from a paradigm design perspective to learning strategy construction. Specifically, we propose a new computational paradigm, Alignment-Shift-Fusion Network (ASF-Net), which incorporates a temporal shift module. This module is novel to this field and provides deeper exploration of temporal information by facilitating the exchange of channel-level information within the feature space. To fully discharge the model's characterization capability, we further construct a LArge-scale RAiny video dataset (LARA) which also supports the development of this community. On the basis of the newly-constructed dataset, we explore the parameters learning process by developing an innovative re-degraded learning strategy. This strategy bridges the gap between synthetic and real-world scenes, resulting in stronger scene adaptability. Our proposed approach exhibits superior performance in three benchmarks and compelling visual quality in real-world scenarios, underscoring its efficacy. The code is available at https://github.com/vis-opt-group/ASF-Net.
摘要
Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning
results: 该方法在现有的标准测试集上比既有的无监督方法和部分监督方法提供更高的准确率,并且与完全监督方法相当或甚至超过。Abstract
Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods.
摘要
<> translate "Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods."into Simplified Chinese.以下是文章中的简化中文翻译:Unsupervised learning是一项复杂的任务,因为缺乏标签。多个对象跟踪(MOT),它无法避免互相干扰、遮挡等问题,更加困难无标签指导。在这篇文章中,我们探索视频帧中样本特征的潜在一致性,并提出了无监督相似性学习方法(UCSL),包括三种对比模块:自我对比、相互对比和抽象对比。特别是:i) 自我对比使用内帧直接和间帧间接对比,以获得特征表示,充分发挥自我相似性。ii) 相互对比将相互匹配和连续帧匹配结果对齐,解决对象遮挡的持续性负面影响。iii) 抽象对比将抽象对象相互对应,进一步增加后续对象关联的确定性。在现有的benchmark上,我们的方法比现有的无监督方法使用更少的ReID头的帮助,甚至提供了高于许多全监督方法的准确率。
Exploring the Robustness of Human Parsers Towards Common Corruptions
results: 实验结果表明,提出的方法可以提高人像分割模型的robustness,并且可以在不同的图像损害情况下保持相对的表现。Abstract
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.
摘要
人类分割目标是将每个人像像Pixel segmentation with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.
Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation
results: 实验结果表明,TiO-Depth在KITTI、Cityscapes和DDAD datasets上大多数情况下都能够超过单目和双目现有方法的性能,并证明了一种两个任务合一的网络可以为单目和双目深度估计提供更高的精度。Abstract
Monocular and binocular self-supervised depth estimations are two important and related tasks in computer vision, which aim to predict scene depths from single images and stereo image pairs respectively. In literature, the two tasks are usually tackled separately by two different kinds of models, and binocular models generally fail to predict depth from single images, while the prediction accuracy of monocular models is generally inferior to binocular models. In this paper, we propose a Two-in-One self-supervised depth estimation network, called TiO-Depth, which could not only compatibly handle the two tasks, but also improve the prediction accuracy. TiO-Depth employs a Siamese architecture and each sub-network of it could be used as a monocular depth estimation model. For binocular depth estimation, a Monocular Feature Matching module is proposed for incorporating the stereo knowledge between the two images, and the full TiO-Depth is used to predict depths. We also design a multi-stage joint-training strategy for improving the performances of TiO-Depth in both two tasks by combining the relative advantages of them. Experimental results on the KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms both the monocular and binocular state-of-the-art methods in most cases, and further verify the feasibility of a two-in-one network for monocular and binocular depth estimation. The code is available at https://github.com/ZM-Zhou/TiO-Depth_pytorch.
摘要
眼镜和双眼自助深度估计是计算机视觉中两个重要和相关的任务,它们目标是从单个图像和双图像对中预测场景的深度。在文献中,这两个任务通常由两种不同的模型来解决,而双眼模型通常无法从单个图像中预测深度,而眼镜模型的预测精度通常落后于双眼模型。在这篇论文中,我们提出了一个名为 TiO-Depth 的 Two-in-One 自助深度估计网络,可以同时处理这两个任务,并提高预测精度。TiO-Depth 使用了同构网络,并且每个子网络可以作为眼镜深度估计模型使用。对于双眼深度估计,我们提出了一个名为 Monocular Feature Matching 的单眼特征匹配模块,以利用双图像之间的相似性,并使用全 TiO-Depth 来预测深度。我们还设计了一种多阶段联合培训策略,以提高 TiO-Depth 在这两个任务中的性能。实验结果表明,TiO-Depth 在 KITTI、Cityscapes 和 DDAD 数据集上的表现都较为出色,大多数情况下超过了眼镜和双眼状态的目标方法,并证明了 Two-in-One 网络的可行性。代码可以在 GitHub 上找到:https://github.com/ZM-Zhou/TiO-Depth_pytorch。
S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection
results: 对KITTI和Waymo开放 dataset进行了广泛的实验,显示S$^3$-DA可以显著提高检测精度,在单一训练过程中实现单类和多类3D物体检测的州际最佳性能。Abstract
Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervised attention mechanisms without any geometry appearance awareness in transformers are susceptible to producing noisy features for query points, which severely limits the network performance and also makes the model have a poor ability to detect multi-category objects in a single training process. To tackle this problem, this paper proposes a novel "Supervised Shape&Scale-perceptive Deformable Attention" (S$^3$-DA) module for monocular 3D object detection. Concretely, S$^3$-DA utilizes visual and depth features to generate diverse local features with various shapes and scales and predict the corresponding matching distribution simultaneously to impose valuable shape&scale perception for each query. Benefiting from this, S$^3$-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features. Besides, we propose a Multi-classification-based Shape$\&$Scale Matching (MSM) loss to supervise the above process. Extensive experiments on KITTI and Waymo Open datasets demonstrate that S$^3$-DA significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches. The source code will be made publicly available at https://github.com/mikasa3lili/S3-MonoDETR.
摘要
最近,基于transformer的方法在单视图3D物体检测中表现出色,可以从单个2D图像中预测3D特征。这些方法通常使用视觉和深度表示来生成查询点对象, whose quality具有决定性的影响于检测精度。然而,当前无supervised attention机制,无法考虑对象的几何外观,这会使transformer中的网络性能受到严重的限制,同时也使得模型无法在单一训练过程中检测多类对象。为解决这个问题,本文提出了一种novel的“Supervised Shape&Scale-perceptive Deformable Attention”(S$^3$-DA)模块。具体来说,S$^3$-DA利用视觉和深度特征来生成多样的本地特征,同时预测匹配分布,以便为每个查询点强制实施有价值的形状&比例见解。这使得S$^3$-DA可以efficiently估计查询点所属类别的接受领域,从而生成Robust查询特征。此外,我们提出了一种Multi-classification-based Shape$\&$Scale Matching(MSM)损失函数来监督上述过程。广泛的实验表明,S$^3$-DA可以显著提高检测精度,在单个训练过程中实现单类和多类3D物体检测的state-of-the-art性能。网站将在https://github.com/mikasa3lili/S3-MonoDETR中公开源代码。
GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning
methods: 该论文提出了一种新的并有效的集群强化框架(GBE-MLZSL),以全面利用图像的本地和全局特征,并提高预测精度和稳定性。特别是,该框架将特征地图分成多个特征组,每个特征组可以独立地在Local Information Distinguishing Module(LID)中进行训练,以保证唯一性。同时,Global Enhancement Module(GEM)是设计来保持图像的主要方向。此外,还设计了一个静止图 Structured 构建本地特征之间的相关性。
results: 实验表明,提出的GBE-MLZSL方法在大规模的MLZSL benchmark数据集NUS-WIDE和Open-Images-v4上,与其他当前state-of-the-art方法之间的margin比较大。Abstract
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein, the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics, and transfer the learned model to unseen ones. But they ignore the effective integration of local and global features. That is, in the process of inferring unseen classes, global features represent the principal direction of the image in the feature space, while local features should maintain uniqueness within a certain range. This integrated neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Specifically, we split the feature maps into several feature groups, of which each feature group can be trained independently with the Local Information Distinguishing Module (LID) to ensure uniqueness. Meanwhile, a Global Enhancement Module (GEM) is designed to preserve the principal direction. Besides, a static graph structure is designed to construct the correlation of local features. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
摘要
In this paper, we propose a novel and effective group bi-enhancement framework for MLZSL, called GBE-MLZSL, to fully utilize such properties and enable more accurate and robust visual-semantic projections. Specifically, we split the feature maps into several feature groups, each of which can be trained independently with the Local Information Distinguishing Module (LID) to ensure uniqueness. Additionally, a Global Enhancement Module (GEM) is designed to preserve the principal direction. Furthermore, a static graph structure is designed to construct the correlation of local features. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.Translated into Simplified Chinese, the text would be:这篇论文研究了多类零例学习(MLZSL)问题,即在一个样本(例如一张图像)中识别多个未经见过的类,基于已经见过的类和auxiliary知识,如semantic信息。现有的方法通常是分析样本中不同类的关系,从空间或semantic特征的角度来转移已经学习的模型到未经见过的类。但它们忽略了Integrate Local and Global Features的效果。即在推断未经见过的类时,global feature在特征空间中表示样本的主要方向,而local feature在某些范围内保持uniqueness。这种总体忽略会使模型失去样本的主要组成部分。只靠基于seen类的local存在来进行推断会引入不可避免的偏见。在这篇论文中,我们提出了一种新的和有效的集群强化框架,名为GBE-MLZSL,以便充分利用这些特性并实现更加准确和可靠的视semantic投影。具体来说,我们将特征图分成多个特征组,每个特征组可以独立地通过Local Information Distinguishing Module(LID)来确保uniqueness。同时,我们设计了Global Enhancement Module(GEM)来保持主要方向。此外,我们还设计了一个静态图Structured Graph来建立本地特征之间的相关性。实验结果表明,提出的GBE-MLZSL在大规模MLZSL benchmark数据集NUS-WIDE和Open-Images-v4上舜拓了其他状态的方法。
A novel framework employing deep multi-attention channels network for the autonomous detection of metastasizing cells through fluorescence microscopy
paper_authors: Michail Mamalakis, Sarah C. Macfarlane, Scott V. Notley, Annica K. B Gad, George Panoutsos
for: distinguishing between normal and metastasizing human cells
methods: combines multi-attention channels network and global explainable techniques using fluorescence microscopy images of actin and vimentin filaments
results: unprecedented understanding of cytoskeletal changes accompanying oncogenic transformation, and potential spatial biomarker for diagnostic tools against metastasis (spatial distribution of vimentin)Here is the same information in Simplified Chinese text:
results: 未曾有的细胞变化理解,并可能提供将来的诊断工具 против转移细胞 (细胞分布的 vimentin)Abstract
We developed a transparent computational large-scale imaging-based framework that can distinguish between normal and metastasizing human cells. The method relies on fluorescence microscopy images showing the spatial organization of actin and vimentin filaments in normal and metastasizing single cells, using a combination of multi-attention channels network and global explainable techniques. We test a classification between normal cells (Bj primary fibroblast), and their isogenically matched, transformed and invasive counterpart (BjTertSV40TRasV12). Manual annotation is not trivial to automate due to the intricacy of the biologically relevant features. In this research, we utilized established deep learning networks and our new multi-attention channel architecture. To increase the interpretability of the network - crucial for this application area - we developed an interpretable global explainable approach correlating the weighted geometric mean of the total cell images and their local GradCam scores. The significant results from our analysis unprecedently allowed a more detailed, and biologically relevant understanding of the cytoskeletal changes that accompany oncogenic transformation of normal to invasive and metastasizing cells. We also paved the way for a possible spatial micrometre-level biomarker for future development of diagnostic tools against metastasis (spatial distribution of vimentin).
摘要
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
results: MagicProp 结合了图像修饰技术的灵活性和泛化生成模型的高效性,可以在输入视频中任意地区进行对象类型和艺术风格的修改,同时保持视频帧之间的 temporal consistency。广泛的实验表明 MagicProp 有效。Abstract
This paper addresses the issue of modifying the visual appearance of videos while preserving their motion. A novel framework, named MagicProp, is proposed, which disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. The flexibility of these techniques enables the editing of arbitrary regions within the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach. To achieve this, a diffusion-based conditional generation model, called PropDPM, is developed, which synthesizes the target frame by conditioning on the reference appearance, the target motion, and its previous appearance. The autoregressive editing approach ensures temporal consistency in the resulting videos. Overall, MagicProp combines the flexibility of image-editing techniques with the superior temporal consistency of autoregressive modeling, enabling flexible editing of object types and aesthetic styles in arbitrary regions of input videos while maintaining good temporal consistency across frames. Extensive experiments in various video editing scenarios demonstrate the effectiveness of MagicProp.
摘要
A Generic Fundus Image Enhancement Network Boosted by Frequency Self-supervised Representation Learning
results: 对比state-of-the-art算法,GFE-Net在数据依赖度、增强性能、部署效率和扩展可用性等方面表现出优异,并且可以方便进行后续的基本图像分析。Abstract
Fundus photography is prone to suffer from image quality degradation that impacts clinical examination performed by ophthalmologists or intelligent systems. Though enhancement algorithms have been developed to promote fundus observation on degraded images, high data demands and limited applicability hinder their clinical deployment. To circumvent this bottleneck, a generic fundus image enhancement network (GFE-Net) is developed in this study to robustly correct unknown fundus images without supervised or extra data. Levering image frequency information, self-supervised representation learning is conducted to learn robust structure-aware representations from degraded images. Then with a seamless architecture that couples representation learning and image enhancement, GFE-Net can accurately correct fundus images and meanwhile preserve retinal structures. Comprehensive experiments are implemented to demonstrate the effectiveness and advantages of GFE-Net. Compared with state-of-the-art algorithms, GFE-Net achieves superior performance in data dependency, enhancement performance, deployment efficiency, and scale generalizability. Follow-up fundus image analysis is also facilitated by GFE-Net, whose modules are respectively verified to be effective for image enhancement.
摘要
血液照片容易受到影像质量下降的影响,这会对眼科医生或智能系统进行临床诊断带来困难。虽然有增强算法可以提高血液图像质量,但这些算法具有高数据需求和局限性,使其在临床应用中受到限制。为了绕过这个瓶颈,本研究提出了一种通用血液图像增强网络(GFE-Net),可以不需要指导或额外数据,强制约束血液图像中的结构。GFE-Net 利用图像频率信息,自我指导学习来学习血液图像中的结构,然后通过将 representation learning 和图像增强结合在一起,GFE-Net 可以准确地 corrections 血液图像,同时保持血液结构。我们进行了全面的实验,以证明 GFE-Net 的有效性和优势。相比之前的算法,GFE-Net 在数据依赖、增强性、部署效率和扩展可行性等方面具有显著优势。此外,GFE-Net 的模块也在不同的应用中进行了验证,其中每个模块都能够准确地进行图像增强。
Fearless Luminance Adaptation: A Macro-Micro-Hierarchical Transformer for Exposure Correction
results: 实验表明,本方法可以提供更加吸引人的图像修复结果,并且在low-light face recognition和low-light semantic segmentation中表现出色。Abstract
Photographs taken with less-than-ideal exposure settings often display poor visual quality. Since the correction procedures vary significantly, it is difficult for a single neural network to handle all exposure problems. Moreover, the inherent limitations of convolutions, hinder the models ability to restore faithful color or details on extremely over-/under- exposed regions. To overcome these limitations, we propose a Macro-Micro-Hierarchical transformer, which consists of a macro attention to capture long-range dependencies, a micro attention to extract local features, and a hierarchical structure for coarse-to-fine correction. In specific, the complementary macro-micro attention designs enhance locality while allowing global interactions. The hierarchical structure enables the network to correct exposure errors of different scales layer by layer. Furthermore, we propose a contrast constraint and couple it seamlessly in the loss function, where the corrected image is pulled towards the positive sample and pushed away from the dynamically generated negative samples. Thus the remaining color distortion and loss of detail can be removed. We also extend our method as an image enhancer for low-light face recognition and low-light semantic segmentation. Experiments demonstrate that our approach obtains more attractive results than state-of-the-art methods quantitatively and qualitatively.
摘要
照片拍摄时使用不理想的曝光设置通常会导致视觉质量差。由于修正方法之间差异很大,因此单一神经网络难以处理所有曝光问题。另外,卷积的内在限制,阻碍模型恢复 faithful 的颜色或细节在极度过曝光或 Underexposed 区域。为了缓解这些限制,我们提议了一种宏微层次 transformer,它包括一个宏注意力来捕捉长距离依赖关系,一个微注意力来提取本地特征,以及一个层次结构来进行粗细修正。具体来说,宏微注意力的补做设计可以提高地方性,同时允许全局交互。层次结构使得网络可以层次修正不同的曝光错误。此外,我们还提出了一种对比约束,并将其灵活地添加到损失函数中,使 corrected 图像被pull towards 正样本,并被push away FROM 动态生成的负样本。因此,剩下的颜色扭曲和细节损失可以被去除。我们还扩展了我们的方法,用于低光照人脸识别和低光照 semantic segmentation。实验表明,我们的方法可以比 estado-of-the-art 方法更加吸引人地得到结果, both quantitatively and qualitatively。
Boosting Weakly-Supervised Image Segmentation via Representation, Transform, and Compensator
results: 我们在 PASCAL VOC 2012 数据集上进行了实验,并证明了我们的方法可以准确地 segment 图像。我们在 PASCAL VOC 2012 验证集上达到了 67.2% 和 68.76% mIoU,在测试集上达到了 68.76% mIoU。此外,我们还扩展了我们的方法到弱地监督对象定位任务,并实验表明我们的方法在这个任务中仍然能够获得非常竞争力的结果。Abstract
Weakly-supervised image segmentation (WSIS) is a critical task in computer vision that relies on image-level class labels. Multi-stage training procedures have been widely used in existing WSIS approaches to obtain high-quality pseudo-masks as ground-truth, resulting in significant progress. However, single-stage WSIS methods have recently gained attention due to their potential for simplifying training procedures, despite often suffering from low-quality pseudo-masks that limit their practical applications. To address this issue, we propose a novel single-stage WSIS method that utilizes a siamese network with contrastive learning to improve the quality of class activation maps (CAMs) and achieve a self-refinement process. Our approach employs a cross-representation refinement method that expands reliable object regions by utilizing different feature representations from the backbone. Additionally, we introduce a cross-transform regularization module that learns robust class prototypes for contrastive learning and captures global context information to feed back rough CAMs, thereby improving the quality of CAMs. Our final high-quality CAMs are used as pseudo-masks to supervise the segmentation result. Experimental results on the PASCAL VOC 2012 dataset demonstrate that our method significantly outperforms other state-of-the-art methods, achieving 67.2% and 68.76% mIoU on PASCAL VOC 2012 val set and test set, respectively. Furthermore, our method has been extended to weakly supervised object localization task, and experimental results demonstrate that our method continues to achieve very competitive results.
摘要
弱样指导图像分割(WSIS)是计算机视觉中的关键任务,它基于图像级别的类标签。现有的WSIS方法多使用多个阶段训练过程来获得高质量的假标签,从而取得了显著的进步。然而,单阶段WSIS方法在最近受到了关注,因为它们可能简化训练过程,尽管经常受到低质量假标签的限制,使其在实际应用中具有局限性。为解决这个问题,我们提出了一种新的单阶段WSIS方法,该方法使用对称网络和对比学习来提高类激活图(CAM)的质量,并实现自我调整过程。我们的方法使用跨表示反复增强方法,将不同的特征表示从后向扩展可靠的物体区域,并 introduce a cross-transform regularization module,该模块学习强健的类范例,以便对比学习,并捕捉全局信息,以帮助改善CAM的质量。最终,我们的高质量CAM被用作假标签,以便监督分割结果。实验结果表明,我们的方法在PASCAL VOC 2012数据集上与其他状态对照方法相比,显著超越了它们,达到了67.2%和68.76%的mIoU在PASCAL VOC 2012验证集和测试集上,分别。此外,我们的方法还被扩展到弱有指导物体定位任务,实验结果表明,我们的方法在这个任务上仍然实现了非常竞争力的结果。
results: 在三个popular dataset上(包括 CIFAR100、minilmageNet 和 CUB200),提出的 B-FSCL 方法完全超越了所有现有的 FSCL 方法。Abstract
Few-shot continual learning (FSCL) has attracted intensive attention and achieved some advances in recent years, but now it is difficult to again make a big stride in accuracy due to the limitation of only few-shot incremental samples. Inspired by distinctive human cognition ability in life learning, in this work, we propose a novel Big-model driven Few-shot Continual Learning (B-FSCL) framework to gradually evolve the model under the traction of the world's big-models (like human accumulative knowledge). Specifically, we perform the big-model driven transfer learning to leverage the powerful encoding capability of these existing big-models, which can adapt the continual model to a few of newly added samples while avoiding the over-fitting problem. Considering that the big-model and the continual model may have different perceived results for the identical images, we introduce an instance-level adaptive decision mechanism to provide the high-level flexibility cognitive support adjusted to varying samples. In turn, the adaptive decision can be further adopted to optimize the parameters of the continual model, performing the adaptive distillation of big-model's knowledge information. Experimental results of our proposed B-FSCL on three popular datasets (including CIFAR100, minilmageNet and CUB200) completely surpass all state-of-the-art FSCL methods.
摘要
Recently, few-shot continual learning (FSCL) has received extensive attention and achieved some advances, but it has become difficult to make further significant improvements in accuracy due to the limited number of few-shot incremental samples. Inspired by human cognitive abilities in life learning, we propose a novel Big-model driven Few-shot Continual Learning (B-FSCL) framework to gradually evolve the model under the guidance of the world's big-models (like human accumulative knowledge). Specifically, we perform big-model driven transfer learning to leverage the powerful encoding capabilities of these existing big-models, which can adapt the continual model to a few newly added samples while avoiding the overfitting problem. Considering that the big-model and the continual model may have different perceptions of the same images, we introduce an instance-level adaptive decision mechanism to provide high-level flexibility cognitive support adjusted to varying samples. In turn, the adaptive decision can be further adopted to optimize the parameters of the continual model, performing adaptive distillation of big-model's knowledge information. Experimental results of our proposed B-FSCL on three popular datasets (including CIFAR100, minilmageNet, and CUB200) completely surpass all state-of-the-art FSCL methods.
Correlated and Multi-frequency Diffusion Modeling for Highly Under-sampled MRI Reconstruction
results: 实验结果表明,提出的方法可以更高精度地重建MRI图像,并且可以加速抽象过程。Abstract
Most existing MRI reconstruction methods perform tar-geted reconstruction of the entire MR image without tak-ing specific tissue regions into consideration. This may fail to emphasize the reconstruction accuracy on im-portant tissues for diagnosis. In this study, leveraging a combination of the properties of k-space data and the diffusion process, our novel scheme focuses on mining the multi-frequency prior with different strategies to pre-serve fine texture details in the reconstructed image. In addition, a diffusion process can converge more quickly if its target distribution closely resembles the noise distri-bution in the process. This can be accomplished through various high-frequency prior extractors. The finding further solidifies the effectiveness of the score-based gen-erative model. On top of all the advantages, our method improves the accuracy of MRI reconstruction and accel-erates sampling process. Experimental results verify that the proposed method successfully obtains more accurate reconstruction and outperforms state-of-the-art methods.
摘要
大多数现有MRI重建方法都是对整个MRI图像进行targeted重建,不考虑特定组织区域的重建精度。这可能导致重建精度不够,特别是在诊断中需要准确的组织区域。在本研究中,我们提出了一种新的方法,利用k空间数据的性质和扩散过程,将多频率优先级与不同策略相结合,以保留重建图像中细节的细腻 texture。此外,扩散过程可以更快地 converges,如果target分布和噪声分布在过程中很相似。这可以通过多种高频率优先级抽取器来实现。这些发现进一步证明了Score-based生成模型的效iveness。此外,我们的方法还提高了MRI重建的精度和采样速度。实验结果证明,我们提出的方法可以更好地重建MRI图像,并且超过了现有的方法。
A Post-Processing Based Bengali Document Layout Analysis with YOLOV8
paper_authors: Nazmus Sakib Ahmed, Saad Sakib Noor, Ashraful Islam Shanto Sikder, Abhijit Paul
for: 这 paper 的目的是提高孟加拉文档格式分析 (DLA),使用 YOLOv8 模型和创新的后处理技术。
methods: 这 paper 使用数据增强来提高模型的鲁棒性,并使用两个阶段预测策略来实现准确的元素分 segmentation。
results: 这 paper 的结果表明, ensemble 模型和后处理技术可以超越基础模型,解决在 BaDLAD 数据集中存在的问题。Abstract
This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.
摘要
Translated into Simplified Chinese:这篇论文关注使用YOLOv8模型和创新的后处理技术进行增强孟加拉文档布局分析(DLA)。我们利用数据增强来提高模型的可靠性,并在完整的数据集上精心调整方法,实现了两stage预测策略以确定精确的元素分割。我们的集成模型,结合后处理,超越了基础模型,解决了在BaDLAD数据集中 Identified的问题。通过这种方法,我们希望推进孟加拉文档分析,提高OCR和文档理解。BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field.此外,我们的实验提供了关键的思路,可以在已有的解决方案中添加新策略。
pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time Adaptation
methods: 本方法叫做 Pseudo Source guided Target Clustering(pSTarC),它是在实际领域变换下的TTA领域中相对未曾研究的。这种方法 Draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples。
results: 实验表明,pSTarC可以减轻计算需求,同时提高预测精度。此外,我们还证明了pSTarC的普适性,并在连续TTA框架中表现出色。Abstract
Test Time Adaptation (TTA) is a pivotal concept in machine learning, enabling models to perform well in real-world scenarios, where test data distribution differs from training. In this work, we propose a novel approach called pseudo Source guided Target Clustering (pSTarC) addressing the relatively unexplored area of TTA under real-world domain shifts. This method draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples. The test samples are strategically aligned with these pseudo-source samples, facilitating their clustering and thereby enhancing TTA performance. pSTarC operates solely within the fully test-time adaptation protocol, removing the need for actual source data. Experimental validation on a variety of domain shift datasets, namely VisDA, Office-Home, DomainNet-126, CIFAR-100C verifies pSTarC's effectiveness. This method exhibits significant improvements in prediction accuracy along with efficient computational requirements. Furthermore, we also demonstrate the universality of the pSTarC framework by showing its effectiveness for the continuous TTA framework.
摘要
测试时适应(TTA)是机器学习中的一个重要概念,它允许模型在真实世界中表现良好,其测试数据分布与训练数据分布不同。在这种情况下,我们提出了一种新的方法called pseudo Source guided Target Clustering(pSTarC),用于解决真实世界域转移下的TTA。这种方法 Draws inspiration from 目标划分技术,利用源分类器来生成pseudo-source样本。测试样本被策略性地与这些pseudo-source样本相对应,从而提高了TTA性能。pSTarC在完全测试时适应协议下运行,不需要实际的源数据。在多个域转移数据集上,包括VisDA、Office-Home、DomainNet-126和CIFAR-100C,我们进行了实验 validate pSTarC的效果。这种方法在预测精度和计算需求方面具有显著改进。此外,我们还证明了pSTarC框架的通用性,其在连续TTA框架中也表现出了效果。
ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data
results: 对于多个物体检测数据集(包括 COCO)和多种模型(包括 Detectron-X101 和 Faster-RCNN),ObjectLab 能够准确地检测标签错误,与其他标签质量分数相比,具有更高的准确率/回归率。Abstract
Despite powering sensitive systems like autonomous vehicles, object detection remains fairly brittle in part due to annotation errors that plague most real-world training datasets. We propose ObjectLab, a straightforward algorithm to detect diverse errors in object detection labels, including: overlooked bounding boxes, badly located boxes, and incorrect class label assignments. ObjectLab utilizes any trained object detection model to score the label quality of each image, such that mislabeled images can be automatically prioritized for label review/correction. Properly handling erroneous data enables training a better version of the same object detection model, without any change in existing modeling code. Across different object detection datasets (including COCO) and different models (including Detectron-X101 and Faster-RCNN), ObjectLab consistently detects annotation errors with much better precision/recall compared to other label quality scores.
摘要
尽管它用于感知系统如自动驾驶汽车的识别系统,但对象检测仍然比较脆弱,其中一个主要原因是训练数据中的注释错误。我们提议ObjectLab,一种简单的算法,用于检测对象检测标签中的多种错误,包括:被忽略的 bounding box、 incorrect 的位置和类别标签分配错误。ObjectLab 使用任何已经训练过的对象检测模型来评估每个图像的标签质量,以便自动优先级化需要更正的标签。正确处理错误数据可以训练一个更好的同样的对象检测模型,无需更改现有的代码。在不同的对象检测 dataset (包括 COCO)和不同的模型(包括 Detectron-X101 和 Faster-RCNN)中,ObjectLab invariably detects 注释错误的精度/回归比例远高于其他标签质量分数。
Multi-scale, Data-driven and Anatomically Constrained Deep Learning Image Registration for Adult and Fetal Echocardiography
results: 测试结果表明,这种方法可以提高图像匹配的精度和稳定性,并且可以在成人和胎儿电子医学图像中达到优秀的结果。Abstract
Temporal echocardiography image registration is a basis for clinical quantifications such as cardiac motion estimation, myocardial strain assessments, and stroke volume quantifications. In past studies, deep learning image registration (DLIR) has shown promising results and is consistently accurate and precise, requiring less computational time. We propose that a greater focus on the warped moving image's anatomic plausibility and image quality can support robust DLIR performance. Further, past implementations have focused on adult echocardiography, and there is an absence of DLIR implementations for fetal echocardiography. We propose a framework that combines three strategies for DLIR in both fetal and adult echo: (1) an anatomic shape-encoded loss to preserve physiological myocardial and left ventricular anatomical topologies in warped images; (2) a data-driven loss that is trained adversarially to preserve good image texture features in warped images; and (3) a multi-scale training scheme of a data-driven and anatomically constrained algorithm to improve accuracy. Our tests show that good anatomical topology and image textures are strongly linked to shape-encoded and data-driven adversarial losses. They improve different aspects of registration performance in a non-overlapping way, justifying their combination. Despite fundamental distinctions between adult and fetal echo images, we show that these strategies can provide excellent registration results in both adult and fetal echocardiography using the publicly available CAMUS adult echo dataset and our private multi-demographic fetal echo dataset. Our approach outperforms traditional non-DL gold standard registration approaches, including Optical Flow and Elastix. Registration improvements could be translated to more accurate and precise clinical quantification of cardiac ejection fraction, demonstrating a potential for translation.
摘要
医学影像协调是基础 для临床量化,如心脏运动评估、肌肉弹性评估和心脏血量评估。过去的研究表明,深度学习图像协调(DLIR)有扎实的结果和精度,需要较少的计算时间。我们提议更重视卷积动图像的 анатомиче可能性和图像质量,以支持Robust DLIR表现。此外,过去的实施都是成人echo,而 absence of DLIR实现 для胎儿echo。我们提议一种框架,该框架结合以下三种策略:(1)一种适应Physiological myocardial和左心脏的生物学特征的形状编码损失;(2)一种通过对图像特征进行反向学习来保持好的图像特征;(3)一种多尺度训练的数据驱动和生物学特征驱动的算法,以提高准确性。我们的测试表明,良好的生物学特征和图像特征是强相关的,这些损失可以不同方面提高协调性能。尽管成人echo和胎儿echo图像具有基本不同的特征,我们的策略可以在两者上提供出色的协调结果。我们的方法超过了传统的非深度学习标准注册方法,包括折射流和Elastix。更好的协调可能可以翻译到更准确和精度的临床量化,表明了我们的方法的潜在应用。
When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision
results: 在ScanNet-v2和S3DIS测试集上实现高质量的3D点云实例标签,并在噪音矩形框筛选情况下显示了高效性和稳定性。Abstract
Learning from bounding-boxes annotations has shown great potential in weakly-supervised 3D point cloud instance segmentation. However, we observed that existing methods would suffer severe performance degradation with perturbed bounding box annotations. To tackle this issue, we propose a complementary image prompt-induced weakly-supervised point cloud instance segmentation (CIP-WPIS) method. CIP-WPIS leverages pretrained knowledge embedded in the 2D foundation model SAM and 3D geometric prior to achieve accurate point-wise instance labels from the bounding box annotations. Specifically, CP-WPIS first selects image views in which 3D candidate points of an instance are fully visible. Then, we generate complementary background and foreground prompts from projections to obtain SAM 2D instance mask predictions. According to these, we assign the confidence values to points indicating the likelihood of points belonging to the instance. Furthermore, we utilize 3D geometric homogeneity provided by superpoints to decide the final instance label assignments. In this fashion, we achieve high-quality 3D point-wise instance labels. Extensive experiments on both Scannet-v2 and S3DIS benchmarks demonstrate that our method is robust against noisy 3D bounding-box annotations and achieves state-of-the-art performance.
摘要
学习封包注解可以提高弱相关3D点云实例分割的潜力。然而,我们发现现有方法对受到扰动 boundING box注解时会表现出严重的性能下降。为解决这个问题,我们提出了补充图 prompt-induced 弱相关3D点云实例分割(CIP-WPIS)方法。CIP-WPIS 利用预训练在2D基础模型 SAM 中嵌入的知识和3D几何规范来实现从 bounding box 注解中获取高质量点云实例标签。具体来说,CIP-WPIS 首先选择在实例中3D候选点完全可见的图像视图。然后,我们生成补充背景和前景投影以获得 SAM 2D实例幕标注。根据这些标注,我们将点Cloud中的点分配确idence值,以表示点是否属于实例。此外,我们利用 superpoints 提供的3D几何一致性来决定实例标签分配。这种方法可以实现高质量点云实例标签。我们在 Scannet-v2 和 S3DIS benchmark上进行了广泛的实验,并证明了我们的方法对受到扰动 bounding box 注解的Robustness和性能具有状态的某些表现。
Few shot font generation via transferring similarity guided global style and quantization local style
results: 实验结果表明,该方法可以获得完整的组件级风格表示,并控制全局字形特征。与其他当前状态顶尖方法相比,该方法在不同的语言书写系统上表现出了更高的效果和普适性。代码可以在 GitHub 上找到:https://github.com/awei669/VQ-Font。Abstract
Automatic few-shot font generation (AFFG), aiming at generating new fonts with only a few glyph references, reduces the labor cost of manually designing fonts. However, the traditional AFFG paradigm of style-content disentanglement cannot capture the diverse local details of different fonts. So, many component-based approaches are proposed to tackle this problem. The issue with component-based approaches is that they usually require special pre-defined glyph components, e.g., strokes and radicals, which is infeasible for AFFG of different languages. In this paper, we present a novel font generation approach by aggregating styles from character similarity-guided global features and stylized component-level representations. We calculate the similarity scores of the target character and the referenced samples by measuring the distance along the corresponding channels from the content features, and assigning them as the weights for aggregating the global style features. To better capture the local styles, a cross-attention-based style transfer module is adopted to transfer the styles of reference glyphs to the components, where the components are self-learned discrete latent codes through vector quantization without manual definition. With these designs, our AFFG method could obtain a complete set of component-level style representations, and also control the global glyph characteristics. The experimental results reflect the effectiveness and generalization of the proposed method on different linguistic scripts, and also show its superiority when compared with other state-of-the-art methods. The source code can be found at https://github.com/awei669/VQ-Font.
摘要
自动几个字体生成(AFFG),目的是通过只需几个字形引用来生成新字体,从而减少手动设计字体的劳动成本。然而,传统的AFFG模式中的风格内容分离无法捕捉不同字体的地方细节。因此,许多组件化方法被提议。然而,这些组件化方法通常需要特定的预定义字形组件,例如笔画和基本元素,这是不适用于不同语言的AFFG。在这篇论文中,我们提出了一种新的字体生成方法,通过将风格特征从类似性指导的全局特征和精细组件水平表示相乘。我们在目标字形和参考样本之间计算相似性分数,并将其作为全局风格特征的权重进行相乘。为更好地捕捉地方风格,我们采用了交叉注意力基于的风格传输模块,将参考字形的风格传输到组件水平,其中组件是通过量化Vector без manual定义得到的自适应积分码。通过这些设计,我们的AFFG方法可以获得完整的组件级别风格表示,同时控制全局字形特征。实验结果表明了我们的方法在不同的文字系统中的效果和普遍性,以及与其他当前领域的方法相比的优势。详细代码可以在 找到。
results: 训练后,Mask R-CNN模型可以准确地 segmentation 土壤图像,并在不同环境下收集的图像上表现良好。训练集的损失值为0.1999,验证集的mAP值(IoU=0.5)为0.8804,并且只需0.06秒的时间来完成图像 segmentation。Abstract
The complex background in the soil image collected in the field natural environment will affect the subsequent soil image recognition based on machine vision. Segmenting the soil center area from the soil image can eliminate the influence of the complex background, which is an important preprocessing work for subsequent soil image recognition. For the first time, the deep learning method was applied to soil image segmentation, and the Mask R-CNN model was selected to complete the positioning and segmentation of soil images. Construct a soil image dataset based on the collected soil images, use the EISeg annotation tool to mark the soil area as soil, and save the annotation information; train the Mask R-CNN soil image instance segmentation model. The trained model can obtain accurate segmentation results for soil images, and can show good performance on soil images collected in different environments; the trained instance segmentation model has a loss value of 0.1999 in the training set, and the mAP of the validation set segmentation (IoU=0.5) is 0.8804, and it takes only 0.06s to complete image segmentation based on GPU acceleration, which can meet the real-time segmentation and detection of soil images in the field under natural conditions. You can get our code in the Conclusions. The homepage is https://github.com/YidaMyth.
摘要
在自然环境中采集的土壤图像中的复杂背景将影响后续的土壤图像认知基于机器视觉。 segmenting 土壤中心区域从土壤图像中可以消除复杂背景的影响,这是土壤图像认知前置处理的重要步骤。 这是首次应用深度学习方法进行土壤图像分割,选择了Mask R-CNN模型来完成位置和分割土壤图像。 根据收集的土壤图像构建了土壤图像数据集,使用EISeg注意力工具标记土壤区域为土壤,并保存注意力信息。 训练Mask R-CNN土壤图像实例分割模型。 训练后的模型可以在不同环境中获得高精度的分割结果,并且在0.06秒钟内完成图像分割基于GPU加速,可以满足在自然条件下的实时分割和检测土壤图像。 可以在结论中获取我们的代码。 主页是https://github.com/YidaMyth。
Self-Supervised Video Transformers for Isolated Sign Language Recognition
results: 我们发现,MaskFeat 可以超过 pose-based 和监督视频模型,在 gloss-based WLASL2000 上达到 79.02% 的top-1 准确率。此外,我们还分析了这些模型对 ASL 手语表示的能力,并通过 linear probing 分析表示的多样性。这个研究证明了 ISLR 中 architecture 和预训练任务的选择对性的重要性。Abstract
This paper presents an in-depth analysis of various self-supervision methods for isolated sign language recognition (ISLR). We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes, and study all the combinations on the WLASL2000 dataset. Our findings reveal that MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000. Furthermore, we analyze these models' ability to produce representations of ASL signs using linear probing on diverse phonological features. This study underscores the value of architecture and pre-training task choices in ISLR. Specifically, our results on WLASL2000 highlight the power of masked reconstruction pre-training, and our linear probing results demonstrate the importance of hierarchical vision transformers for sign language representation.
摘要
AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism
results: 对于HumanML3D和KIT-ML两个数据集,该方法在质量和量化评估中几乎所有超过当前状态的方法,并实现了细腻的生成和action2motion。Abstract
Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose \textbf{AttT2M}, a two-stage method with multi-perspective attention mechanism: \textbf{body-part attention} and \textbf{global-local motion-text attention}. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code is in https://github.com/ZcyMonkey/AttT2M
摘要
师MAIN RESEARCH FOCUS IN RECENT YEARS HAS BEEN GENERATING 3D HUMAN MOTION BASED ON TEXTUAL DESCRIPTIONS. THIS REQUIRES THE GENERATED MOTION TO BE DIVERSE, NATURAL, AND CONFORM TO THE TEXTUAL DESCRIPTION. DUE TO THE COMPLEX SPATIO-TEMPORAL NATURE OF HUMAN MOTION AND THE DIFFICULTY IN LEARNING THE CROSS-MODAL RELATIONSHIP BETWEEN TEXT AND MOTION, TEXT-DRIVEN MOTION GENERATION IS STILL A CHALLENGING PROBLEM. TO ADDRESS THESE ISSUES, WE PROPOSE ATT2M, A TWO-STAGE METHOD WITH MULTI-PERSPECTIVE ATTENTION MECHANISM. THE FORMER FOCUSES ON THE MOTION EMBEDDING PERSPECTIVE, WHICH MEANS INTRODUCING A BODY-PART SPATIO-TEMPORAL ENCODER INTO VQ-VAE TO LEARN A MORE EXPRESSIVE DISCRETE LATENT SPACE. THE LATTER IS FROM THE CROSS-MODAL PERSPECTIVE, WHICH IS USED TO LEARN THE SENTENCE-LEVEL AND WORD-LEVEL MOTION-TEXT CROSS-MODAL RELATIONSHIP. THE TEXT-DRIVEN MOTION IS FINALLY GENERATED WITH A GENERATIVE TRANSFORMER. EXTENSIVE EXPERIMENTS CONDUCTED ON HUMANML3D AND KIT-ML DEMONSTRATE THAT OUR METHOD OUTPERFORMS THE CURRENT STATE-OF-THE-ART WORKS IN TERMS OF QUALITATIVE AND QUANTITATIVE EVALUATION, AND ACHIEVE FINE-GRAINED SYNTHESIS AND ACTION2MOTION. OUR CODE IS AVAILABLE AT https://github.com/ZcyMonkey/AttT2M.
FastPoseGait: A Toolbox and Benchmark for Efficient Pose-based Gait Recognition
methods: 这个工具箱支持多种现代 pose-based gait recognition 算法,包括多种 SOTA 算法和最新的进步。此外,这个工具箱还提供了许多预训模型和详细的 benchmark 结果,对未来的研究提供了宝贵的参考和对照。
results: 这个研究提供了一个高度可调的 pose-based gait recognition 工具箱,可以快速地进行 pose-based gait recognition 的研究。此外,这个工具箱还提供了许多预训模型和详细的 benchmark 结果,对未来的研究提供了宝贵的参考和对照。Abstract
We present FastPoseGait, an open-source toolbox for pose-based gait recognition based on PyTorch. Our toolbox supports a set of cutting-edge pose-based gait recognition algorithms and a variety of related benchmarks. Unlike other pose-based projects that focus on a single algorithm, FastPoseGait integrates several state-of-the-art (SOTA) algorithms under a unified framework, incorporating both the latest advancements and best practices to ease the comparison of effectiveness and efficiency. In addition, to promote future research on pose-based gait recognition, we provide numerous pre-trained models and detailed benchmark results, which offer valuable insights and serve as a reference for further investigations. By leveraging the highly modular structure and diverse methods offered by FastPoseGait, researchers can quickly delve into pose-based gait recognition and promote development in the field. In this paper, we outline various features of this toolbox, aiming that our toolbox and benchmarks can further foster collaboration, facilitate reproducibility, and encourage the development of innovative algorithms for pose-based gait recognition. FastPoseGait is available at https://github.com//BNU-IVC/FastPoseGait and is actively maintained. We will continue updating this report as we add new features.
摘要
我们现在推出 FastPoseGait,一个开源工具箱 для pose-based 步态识别基于 PyTorch。我们的工具箱支持一系列当今最先进的 pose-based 步态识别算法以及一些相关的benchmark。与其他 pose-based 项目不同,FastPoseGait 集成了多种最新的SOTA算法,并在一个统一的框架下集成了最新的技术和最佳实践,以便方便比较效果和效率。此外,为促进未来的pose-based 步态识别研究,我们提供了多个预训练模型和详细的benchmark结果,这些结果对于进一步的调查提供了 ценные信息,并作为参考来供其他研究人员参考。通过 FastPoseGait 的高度可模块化结构和多种方法,研究人员可以快速探索 pose-based 步态识别领域,并促进该领域的发展。在这篇文章中,我们详细介绍了 FastPoseGait 的各种特点,希望通过我们的工具箱和benchmark,推动合作、促进复制性和激发 pose-based 步态识别领域的创新算法的发展。FastPoseGait 可以在 上下载,并且 actively 维护。我们将继续更新这份报告,添加新的特性。
Towards High-Frequency Tracking and Fast Edge-Aware Optimization
paper_authors: Akash Bapat for:* 这个论文目标是提高AR/VR跟踪系统的状态 искусственный智能,提高跟踪频率至数个命令的频率。methods:* 该论文提出了一种利用多个商业摄像头的方法,利用摄像头的滚动屏和圆形扭曲来实现高频跟踪。results:* 实验表明,该方法可以在不同的运动范围内实现高精度的跟踪,并且可以与现有的状态 искусственный智能系统进行综合比较。Abstract
This dissertation advances the state of the art for AR/VR tracking systems by increasing the tracking frequency by orders of magnitude and proposes an efficient algorithm for the problem of edge-aware optimization. AR/VR is a natural way of interacting with computers, where the physical and digital worlds coexist. We are on the cusp of a radical change in how humans perform and interact with computing. Humans are sensitive to small misalignments between the real and the virtual world, and tracking at kilo-Hertz frequencies becomes essential. Current vision-based systems fall short, as their tracking frequency is implicitly limited by the frame-rate of the camera. This thesis presents a prototype system which can track at orders of magnitude higher than the state-of-the-art methods using multiple commodity cameras. The proposed system exploits characteristics of the camera traditionally considered as flaws, namely rolling shutter and radial distortion. The experimental evaluation shows the effectiveness of the method for various degrees of motion. Furthermore, edge-aware optimization is an indispensable tool in the computer vision arsenal for accurate filtering of depth-data and image-based rendering, which is increasingly being used for content creation and geometry processing for AR/VR. As applications increasingly demand higher resolution and speed, there exists a need to develop methods that scale accordingly. This dissertation proposes such an edge-aware optimization framework which is efficient, accurate, and algorithmically scales well, all of which are much desirable traits not found jointly in the state of the art. The experiments show the effectiveness of the framework in a multitude of computer vision tasks such as computational photography and stereo.
摘要
Furthermore, edge-aware optimization is an indispensable tool in the computer vision arsenal for accurate filtering of depth-data and image-based rendering, which is increasingly being used for content creation and geometry processing for AR/VR. As applications increasingly demand higher resolution and speed, there exists a need to develop methods that scale accordingly. This dissertation proposes such an edge-aware optimization framework which is efficient, accurate, and algorithmically scales well, all of which are much desirable traits not found jointly in the state of the art. The experiments show the effectiveness of the framework in a multitude of computer vision tasks such as computational photography and stereo.
Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs
results: 研究结果显示,新的评估metric 具有高相关性,并且可以帮助加速机器学习影像压缩器的研究。此外,研究者还将dataset和FRVQA模型开源,以便其他人可以进一步改进FRVQA模型。Abstract
Machine learning-based video codecs have made significant progress in the past few years. A critical area in the development of ML-based video codecs is an accurate evaluation metric that does not require an expensive and slow subjective test. We show that existing evaluation metrics that were designed and trained on DSP-based video codecs are not highly correlated to subjective opinion when used with ML video codecs due to the video artifacts being quite different between ML and video codecs. We provide a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level. We make the dataset and FRVQA model open source to help accelerate research in ML video codecs, and so that others can further improve the FRVQA model.
摘要
To address this challenge, we have created a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level.To help accelerate research in ML video codecs, we are making the dataset and FRVQA model open source. We hope that others will use and build upon our work to further improve the FRVQA model and advance the field of ML video codecs.
paper_authors: K. Acharya, W. Raza, C. M. J. M. Dourado Jr, A. Velasquez, H. Song for: 本研究的目的是对 neurosymbolic 人工智能(Neurosymbolic AI)领域的发展进行文献综述,特别是 neurosymbolic deep learning(Neurosymbolic DL)和 neurosymbolic reinforcement learning(Neurosymbolic RL)这两个子领域。methods: 本研究使用文献综述的方法,对 neurosymbolic RL 领域的研究进行分类和概括。三种分类方法是:学习 для理解、理解 для学习和学习-理解。这些分类方法再次细分为各个应用领域。results: 本研究发现 neurosymbolic RL 领域的研究主要集中在三个方面:学习、理解和决策。学习方面包括对于不同应用领域的学习方法和技术的研究,例如 image recognition 和自然语言处理。理解方面包括对于不同应用领域的理解方法和技术的研究,例如知识 Graph 和 Semantic Reasoning。决策方面包括对于不同应用领域的决策方法和技术的研究,例如 deep reinforcement learning 和 Transfer Learning。Abstract
The area of Neurosymbolic Artificial Intelligence (Neurosymbolic AI) is rapidly developing and has become a popular research topic, encompassing sub-fields such as Neurosymbolic Deep Learning (Neurosymbolic DL) and Neurosymbolic Reinforcement Learning (Neurosymbolic RL). Compared to traditional learning methods, Neurosymbolic AI offers significant advantages by simplifying complexity and providing transparency and explainability. Reinforcement Learning(RL), a long-standing Artificial Intelligence(AI) concept that mimics human behavior using rewards and punishment, is a fundamental component of Neurosymbolic RL, a recent integration of the two fields that has yielded promising results. The aim of this paper is to contribute to the emerging field of Neurosymbolic RL by conducting a literature survey. Our evaluation focuses on the three components that constitute Neurosymbolic RL: neural, symbolic, and RL. We categorize works based on the role played by the neural and symbolic parts in RL, into three taxonomies:Learning for Reasoning, Reasoning for Learning and Learning-Reasoning. These categories are further divided into sub-categories based on their applications. Furthermore, we analyze the RL components of each research work, including the state space, action space, policy module, and RL algorithm. Additionally, we identify research opportunities and challenges in various applications within this dynamic field.
摘要
neural 符号 人工智能(Neurosymbolic AI)领域在迅速发展,已成为研究热点,涵盖子领域 such as Neurosymbolic Deep Learning(Neurosymbolic DL)和Neurosymbolic Reinforcement Learning(Neurosymbolic RL)。相比传统学习方法,Neurosymbolic AI 提供了 significan advantages,例如简化复杂性和提供透明性和解释性。人工智能(AI)概念,模拟人类行为使用奖励和惩罚的 Reinforcement Learning(RL),是 Neurosymbolic RL 的基础组件,是一种最近 integrate 两个领域的成果,并产生了有前途的结果。本文的目标是为emerging 的 Neurosymbolic RL 领域进行文献综述。我们的评估将关注Neurosymbolic RL 中的三个组件:神经、符号和RL。我们根据这三个组件在 RL 中的角色,将工作分为三类:学习为理解、理解为学习和学习-理解。这些类别进一步分为应用的子类别。此外,我们还分析了每个研究作品中的 RL 组件,包括状态空间、动作空间、策略模块和RL算法。此外,我们还识别了在不同应用场景中的研究机会和挑战。
Deep Deformable Models: Learning 3D Shape Abstractions with Part Consistency
results: 在ShapeNet数据集上进行了广泛的实验,并证明了DDMs可以在形状抽象中具有更高的精度和部件一致性,并且超过了现有的状态场方法。Abstract
The task of shape abstraction with semantic part consistency is challenging due to the complex geometries of natural objects. Recent methods learn to represent an object shape using a set of simple primitives to fit the target. \textcolor{black}{However, in these methods, the primitives used do not always correspond to real parts or lack geometric flexibility for semantic interpretation.} In this paper, we investigate salient and efficient primitive descriptors for accurate shape abstractions, and propose \textit{Deep Deformable Models (DDMs)}. DDM employs global deformations and diffeomorphic local deformations. These properties enable DDM to abstract complex object shapes with significantly fewer primitives that offer broader geometry coverage and finer details. DDM is also capable of learning part-level semantic correspondences due to the differentiable and invertible properties of our primitive deformation. Moreover, DDM learning formulation is based on dynamic and kinematic modeling, which enables joint regularization of each sub-transformation during primitive fitting. Extensive experiments on \textit{ShapeNet} demonstrate that DDM outperforms the state-of-the-art in terms of reconstruction and part consistency by a notable margin.
摘要
shape abstraction with semantic part consistency是一项复杂的任务,因为自然物体的geometry是多样的。现有的方法通过使用一组简单的基本元素来表示物体形状,但这些基本元素并不总是真实的部分或者缺乏地理学灵活性。在这篇论文中,我们调查了突出的和高效的基本描述符,并提出了深度可变模型(DDM)。DDM使用全局变形和 diffeomorphic 地方变形,这些特性使得 DDM 可以准确抽象复杂的物体形状,并且只需要 fewer primitives,可以更好地覆盖各种geometry和细节。此外,DDM 可以学习部分 semantics 的匹配,因为我们的基本描述符具有可微和反函数性。此外,我们的学习框架基于动态和静态模型,可以同时规范每个子转换的定制。我们的实验表明,DDM 在 ShapeNet 上的重建和部件一致性方面,与状态对比明显提高。
Explainability for Large Language Models: A Survey
results: 本文提供了一个结构化的概述,描述了基于Transformer架构的语言模型的解释技术。还介绍了评估生成的解释 metric,以及如何使用解释来调试模型和提高性能。Abstract
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.
摘要
In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models. Traditional fine-tuning-based paradigm:1. Local explanations: * feature importance analysis * saliency maps * attention visualization2. Global explanations: * model interpretability techniques * knowledge distillationPrompting-based paradigm:1. Local explanations: * prompt-based attention analysis * prompt-based feature importance analysis2. Global explanations: * prompt-based knowledge distillationEvaluation metrics:1. accuracy2. F1-score3. ROUGE score4. BLEU scoreChallenges and opportunities:1. interpretability of complex models2. lack of transparency in decision-making processes3. potential biases and ethical considerations4. opportunities for improving model performance and trustworthiness5. potential applications in natural language processing and beyond.
Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging
results: 在一个synthetic多Modal推动环境中,该方法可以准确地推荐多Modal的内容项,不需要额外学习。Abstract
We present a method for zero-shot recommendation of multimodal non-stationary content that leverages recent advancements in the field of generative AI. We propose rendering inputs of different modalities as textual descriptions and to utilize pre-trained LLMs to obtain their numerical representations by computing semantic embeddings. Once unified representations of all content items are obtained, the recommendation can be performed by computing an appropriate similarity metric between them without any additional learning. We demonstrate our approach on a synthetic multimodal nudging environment, where the inputs consist of tabular, textual, and visual data.
摘要
我们提出了一种零shot推荐多Modal非站点内容的方法,利用最新的生成AI技术进行实现。我们提议将不同模式的输入描述为文本描述,并使用预训练的LLM来获得它们的数字表示。一旦所有内容项的统一表示 obten得到, then recommendation可以通过计算相应的相似度 metric来进行,无需进行额外学习。我们在一个Synthetic多Modal拖延环境中进行了示例,输入包括表格、文本和视觉数据。Note:* "零shot" (zero-shot) refers to the fact that the recommendation is done without any additional training or learning of the model.* "多Modal" (multimodal) refers to the fact that the inputs consist of multiple modalities, such as tabular, textual, and visual data.* "非站点" (non-stationary) refers to the fact that the inputs are not stationary, meaning that they are not fixed and can change over time.
Mitigating Motion Blur for Robust 3D Baseball Player Pose Modeling for Pitch Analysis
results: 提高 pose 估计模型对运动人体动作的处理能力,并在不同的实际场景和摄像头位置下保持模型的稳定性Abstract
Using videos to analyze pitchers in baseball can play a vital role in strategizing and injury prevention. Computer vision-based pose analysis offers a time-efficient and cost-effective approach. However, the use of accessible broadcast videos, with a 30fps framerate, often results in partial body motion blur during fast actions, limiting the performance of existing pose keypoint estimation models. Previous works have primarily relied on fixed backgrounds, assuming minimal motion differences between frames, or utilized multiview data to address this problem. To this end, we propose a synthetic data augmentation pipeline to enhance the model's capability to deal with the pitcher's blurry actions. In addition, we leverage in-the-wild videos to make our model robust under different real-world conditions and camera positions. By carefully optimizing the augmentation parameters, we observed a notable reduction in the loss by 54.2% and 36.2% on the test dataset for 2D and 3D pose estimation respectively. By applying our approach to existing state-of-the-art pose estimators, we demonstrate an average improvement of 29.2%. The findings highlight the effectiveness of our method in mitigating the challenges posed by motion blur, thereby enhancing the overall quality of pose estimation.
摘要
translate into Simplified Chinese:使用视频分析投手可以发挥重要作用,帮助战略和伤害预防。计算机视觉基于姿势分析提供了时间效益和成本效益的方法。然而,使用可 accessible 的广播视频,30fps 帧率,常常导致快速动作中人体部分动作模糊,限制现有的姿势关键点估计模型的性能。先前的工作主要依赖于固定背景,假设动作变化少,或者使用多视图数据来解决这个问题。为此,我们提议一种人工数据增强管道,以提高模型对投手模糊动作的处理能力。此外,我们利用野外视频,使我们的模型在不同的实际情况和摄像机位置下成为更加可靠。通过精心优化增强参数,我们观察到了测试数据集上的损失下降54.2%和36.2%,对2D和3D姿势估计模型进行了平均改进29.2%。通过应用我们的方法到现有的状态态arter-of-the-art姿势估计器,我们实现了平均改进29.2%。这些发现表明了我们的方法在对动作模糊的挑战下减轻影响,从而提高总体姿势估计质量。
Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation
results: 这个系统在实验中仅使用了几个任务物品进行训练,但能够实现zero-shot转移到真实世界中的机器人,并且能够对不同的物品进行适应和自动调整。更多细节和视频结果可以在https://sequential-dexterity.github.io获取。Abstract
Many real-world manipulation tasks consist of a series of subtasks that are significantly different from one another. Such long-horizon, complex tasks highlight the potential of dexterous hands, which possess adaptability and versatility, capable of seamlessly transitioning between different modes of functionality without the need for re-grasping or external tools. However, the challenges arise due to the high-dimensional action space of dexterous hand and complex compositional dynamics of the long-horizon tasks. We present Sequential Dexterity, a general system based on reinforcement learning (RL) that chains multiple dexterous policies for achieving long-horizon task goals. The core of the system is a transition feasibility function that progressively finetunes the sub-policies for enhancing chaining success rate, while also enables autonomous policy-switching for recovery from failures and bypassing redundant stages. Despite being trained only in simulation with a few task objects, our system demonstrates generalization capability to novel object shapes and is able to zero-shot transfer to a real-world robot equipped with a dexterous hand. More details and video results could be found at https://sequential-dexterity.github.io
摘要
多个真实世界操作任务通常是一系列不同的子任务,这些长期任务表明了人工手的可靠性和多样性,可以无需重新抓取或使用外部工具,快速适应不同的模式功能。然而,由于高维动作空间和复杂的compositional dynamics,这些任务具有挑战。我们提出了Sequential Dexterity,一种基于奖励学习(RL)的通用系统,可以串行多个灵活策略以实现长期任务目标。系统的核心是一个过程可行性函数,逐渐细化子策略以提高串行成功率,同时允许自主的策略交换以恢复失败和绕过冗余阶段。尽管只在 simulate 环境中培养了几个任务对象,我们的系统仍然可以通过 Zero-shot 转移到真实世界中的 робоット,装备了灵活的手。更多细节和视频结果可以在 找到。
results: Diffusion-CCSP 模型能够强大地泛化到新的约束组合中,并可以与任务和运动规划结合,以生成包含整数和连续参数的长期计划。Here’s the English version of the three key points for reference:
for: This paper proposes a method for learning to solve continuous constraint satisfaction problems (CCSP) in robotic reasoning and planning.
methods: The proposed method uses a compositional diffusion continuous constraint solver (Diffusion-CCSP) model, which represents CCSPs as factor graphs and combines the energies of diffusion models trained to sample for individual constraint types.
results: The Diffusion-CCSP model demonstrates strong generalization to novel combinations of known constraints, and can be integrated into a task and motion planner to devise long-horizon plans that include actions with both discrete and continuous parameters.Abstract
This paper introduces an approach for learning to solve continuous constraint satisfaction problems (CCSP) in robotic reasoning and planning. Previous methods primarily rely on hand-engineering or learning generators for specific constraint types and then rejecting the value assignments when other constraints are violated. By contrast, our model, the compositional diffusion continuous constraint solver (Diffusion-CCSP) derives global solutions to CCSPs by representing them as factor graphs and combining the energies of diffusion models trained to sample for individual constraint types. Diffusion-CCSP exhibits strong generalization to novel combinations of known constraints, and it can be integrated into a task and motion planner to devise long-horizon plans that include actions with both discrete and continuous parameters. Project site: https://diffusion-ccsp.github.io/
摘要
这篇论文介绍了一种用于解决连续约束满意问题(CCSP)的机器学习方法。先前的方法主要依靠手工设计或学习生成器特定约束类型,然后拒绝其他约束被违反。而我们的模型——复杂度分析 kontinuous constraint solver(Diffusion-CCSP)——通过将约束类型表示为因子图并将各类约束的能量组合起来,以 derivation global solution to CCSPs。Diffusion-CCSP具有强大的泛化能力,可以适应新的约束组合,并且可以与任务和运动规划结合,以生成长期规划,包括绝对和连续参数的动作。项目网站:https://diffusion-ccsp.github.io/
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
results: 实验结果显示,这个方法可以将预训模型压缩到2.5GB(3bit/weight),并且在较宽的语言模型测试 benchmark 上保持了良好的准确性(例如PIQA的准确度为77.7%,Winograde的准确度为66.1%等)。Abstract
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this paper, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that \prjname can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3bit/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130$\times$, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
摘要
自 Large Language Models (LLMs) 在许多复杂语言任务上表现出色,因此有很大的兴趣将这些 LLMs 带到移动设备上进行更快的响应和更好的隐私保护。然而,LLMs 的大小(即数十亿参数)需要非常有效的压缩以适应存储有限的设备。许多压缩技术之一是 weight-clustering,它是一种非线性量化,并且由现代智能手机支持。然而,它的训练负担是 LLM 精度调整的瓶颈。特别是 Differentiable KMeans Clustering (DKM) 表现出了状态码的质量与精度回归的最佳平衡,但它的内存复杂度使其几乎不可能应用于训练时 LLM 压缩。在这篇论文中,我们提出了一种内存高效的 DKM 实现,即 eDKM,通过新的技术减少 DKM 的内存占用量。为一个给定的矩阵在 CPU 上进行 backwards 的 DKM,我们将矩阵压缩通过应用 uniquification 和 sharding,并且只有在检查到矩阵没有已经被复制到 CPU 的情况下进行压缩。我们的实验结果表明,我们可以使用 eDKM 将预训练的 LLaMA 7B 模型从 12.6 GB 压缩到 2.5 GB (3bit/weight),在 Alpaca 数据集上进行精度调整,同时在更广泛的 LLM 标准准则(例如 PIQA 77.7%、Winograde 66.1% 等)上保持良好的准确率。
Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip Segmentation in Robotic Surgeries
paper_authors: Jiaqi Liu, Yonghao Long, Kai Chen, Cheuk Hei Leung, Zerui Wang, Qi Dou
for: precisely segmenting surgical instrument tips to enable downstream applications in robotic surgery, such as skill assessment, tool-tissue interaction, and deformation modeling, as well as surgical autonomy.
methods: a novel visual-kinematics graph learning framework that encodes relational features of instrument parts from both image and kinematics, and a cross-modal contrastive loss to incorporate robust geometric prior from kinematics to image for tip segmentation.
results: the proposed multi-modal segmentation method significantly outperformed current image-based state-of-the-art approaches, exceeding averagely 11.2% on Dice, on a private paired visual-kinematics dataset including multiple procedures.Abstract
Accurate segmentation of surgical instrument tip is an important task for enabling downstream applications in robotic surgery, such as surgical skill assessment, tool-tissue interaction and deformation modeling, as well as surgical autonomy. However, this task is very challenging due to the small sizes of surgical instrument tips, and significant variance of surgical scenes across different procedures. Although much effort has been made on visual-based methods, existing segmentation models still suffer from low robustness thus not usable in practice. Fortunately, kinematics data from the robotic system can provide reliable prior for instrument location, which is consistent regardless of different surgery types. To make use of such multi-modal information, we propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures. Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics. Next, a cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation. We have conducted experiments on a private paired visual-kinematics dataset including multiple procedures, i.e., prostatectomy, total mesorectal excision, fundoplication and distal gastrectomy on cadaver, and distal gastrectomy on porcine. The leave-one-procedure-out cross validation demonstrated that our proposed multi-modal segmentation method significantly outperformed current image-based state-of-the-art approaches, exceeding averagely 11.2% on Dice.
摘要
importante任务是正确分 segmentation of surgical instrument tip 是为 robotic surgery 下的下游应用,如技能评估、工具与组织之间的互动和变形模型,以及自动手术。然而,这项任务很有挑战性,因为外科工具的小小大小,以及不同手术过程中的重要差异。虽然很多努力已经在可见基于方法上,但现有的分 segmentation 模型仍然受到低稳定性的影响,因此在实践中不可用。幸好,机械系统的遥感数据可以提供可靠的前提,即工具的位置信息,这些信息不受不同手术类型的影响。为了利用这些多Modal信息,我们提议了一种新的可见-遥感图学学习框架,确定外科工具的末端部分。特别是,我们提出了一种图学学习框架,用于编码外科工具的关系特征从图像和机械数据中。接着,我们设计了一种交叉模式对比损失函数,以实现从机械数据中获得稳定的几何先验。我们在私人的对应视觉-机械数据集上进行了实验,包括多种手术类型,例如肾脏摘除、全部肠脏除、肠脏嵌入和胃部摘除。我们使用了留下一个手术类型的交叉验证,并显示了我们提议的多Modal分 segmentation 方法在对比现有图像基于状态先验的方法时,平均性能提高了11.2%的Dice。
Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities
methods: 该模型结构称为“桥接干扰模型”(BDM),具有脊梁支网络结构,能够同时学习非英语语言 semantics 和英语 TTI 社区的 latent space 兼容性。
results: 经验表明,BDM 可以不仅生成准确表达非英语语言 semantics 的图像,还可以与多种英语 TTI 插件相容,如不同的检查点、LoRA、ControlNet、Dreambooth 等等。此外,BDM 还可以同时生成 combine 非英语 native 和英语 native semantics 的内容,促进文化交流。Abstract
Text-to-Image generation (TTI) technologies are advancing rapidly, especially in the English language communities. However, English-native TTI models inherently carry biases from English world centric training data, which creates a dilemma for development of other language-native TTI models. One common choice is fine-tuning the English-native TTI model with translated samples from non-English communities. It falls short of fully addressing the model bias problem. Alternatively, training non-English language native models from scratch can effectively resolve the English world bias, but diverges from the English TTI communities, thus not able to utilize the strides continuously gaining in the English TTI communities any more. To build non-English language native TTI model meanwhile keep compatability with the English TTI communities, we propose a novel model structure referred as "Bridge Diffusion Model" (BDM). The proposed BDM employs a backbone-branch network structure to learn the non-English language semantics while keep the latent space compatible with the English-native TTI backbone, in an end-to-end manner. The unique advantages of the proposed BDM are that it's not only adept at generating images that precisely depict non-English language semantics, but also compatible with various English-native TTI plugins, such as different checkpoints, LoRA, ControlNet, Dreambooth, and Textual Inversion, etc. Moreover, BDM can concurrently generate content seamlessly combining both non-English native and English-native semantics within a single image, fostering cultural interaction. We verify our method by applying BDM to build a Chinese-native TTI model, whereas the method is generic and applicable to any other language.
摘要
文本到图像生成(TTI)技术在英语社区中进步快速,但英语原生TTI模型带有英语世界中心训练数据的偏见问题。一种常见的选择是使用翻译后的非英语样本来微调英语原生TTI模型,但这并不能完全解决模型偏见问题。 Alternatively, 从scratch来训练非英语语言原生TTI模型可以有效解决英语世界偏见问题,但是这会与英语TTI社区分离,无法再利用英语TTI社区的成果。为建立非英语语言原生TTI模型,同时保持与英语TTI社区的兼容性,我们提出了一种新的模型结构, referred to as "Bridge Diffusion Model" (BDM)。我们的提议的BDM模型采用了后脊架-分支网络结构,通过练习非英语语言 semantics 来学习非英语语言 semantics,并保持与英语原生TTI脊架兼容的粒子空间,从而实现了端到端的学习。BDM模型具有以下优点:一是能够准确地描述非英语语言 semantics,二是可以与英语原生TTI插件(如不同的检查点、LoRA、ControlNet、Dreambooth、Textual Inversion等)兼容,三是能够同时生成 combining both non-English native and English-native semantics within a single image,激发文化交流。我们验证了我们的方法,通过应用BDM模型来建立一个中文原生TTI模型,而这种方法是通用的,可以应用于任何其他语言。
From Specific to Generic Learned Sorted Set Dictionaries: A Theoretically Sound Paradigm Yelding Competitive Data Structural Boosters in Practice
results: 我们 obtiained several interesting results,包括(a)首个学习优质二分搜索森林,其 mean access time bounded by Entropy 的概率分布下的访问 Dictionary。(b)首个学习排序集合,在动态情况下,在权衡分析设置下,与经典字典匹配的时间上限。这后者在广泛接受的宇宙大小下。实验部分,软件开发相对复杂,显示了非常有趣的发现,即我们的总结可以生成有效竞争力的学习数据结构加速器,即使与特定的benchmark模型相比。Abstract
This research concerns Learned Data Structures, a recent area that has emerged at the crossroad of Machine Learning and Classic Data Structures. It is methodologically important and with a high practical impact. We focus on Learned Indexes, i.e., Learned Sorted Set Dictionaries. The proposals available so far are specific in the sense that they can boost, indeed impressively, the time performance of Table Search Procedures with a sorted layout only, e.g., Binary Search. We propose a novel paradigm that, complementing known specialized ones, can produce Learned versions of any Sorted Set Dictionary, for instance, Balanced Binary Search Trees or Binary Search on layouts other that sorted, i.e., Eytzinger. Theoretically, based on it, we obtain several results of interest, such as (a) the first Learned Optimum Binary Search Forest, with mean access time bounded by the Entropy of the probability distribution of the accesses to the Dictionary; (b) the first Learned Sorted Set Dictionary that, in the Dynamic Case and in an amortized analysis setting, matches the same time bounds known for Classic Dictionaries. This latter under widely accepted assumptions regarding the size of the Universe. The experimental part, somewhat complex in terms of software development, clearly indicates the nonobvious finding that the generalization we propose can yield effective and competitive Learned Data Structural Booster, even with respect to specific benchmark models.
摘要
Theoretically, we obtain several interesting results based on this paradigm. For example, we develop the first learned optimum binary search forest, with a mean access time bounded by the entropy of the probability distribution of the accesses to the dictionary. Additionally, we create the first learned sorted set dictionary that, in the dynamic case and in an amortized analysis setting, matches the same time bounds as classic dictionaries, under widely accepted assumptions about the size of the universe.The experimental part of our research, which involved complex software development, surprisingly found that our generalization can yield effective and competitive learned data structural boosters, even compared to specific benchmark models.
Pressmatch: Automated journalist recommendation for media coverage with Nearest Neighbor search
results: 这借研究发现,使用自然语言处理和机器学习技术可以快速和准确地推荐适合产品新闻发布的记者,从而提高公司发布产品的媒体推广效果。Abstract
Slating a product for release often involves pitching journalists to run stories on your press release. Good media coverage often ensures greater product reach and drives audience engagement for those products. Hence, ensuring that those releases are pitched to the right journalists with relevant interests is crucial, since they receive several pitches daily. Keeping up with journalist beats and curating a media contacts list is often a huge and time-consuming task. This study proposes a model to automate and expedite the process by recommending suitable journalists to run media coverage on the press releases provided by the user.
摘要
平时发布产品 often involves pitching 新闻工作者来讲述产品的新闻稿。好的媒体报道通常会提高产品的报道覆盖率和驱动产品的听众参与度。因此,确保向正确的新闻工作者发送 pitches 是非常重要的,因为他们每天收到很多 pitches。维护新闻工作者的 beat 和建立媒体联系人名单可以是一项巨大和耗时的任务。这个研究提出了一个模型,用于自动化和加速这个过程,并对用户提供的新闻稿进行推荐适合的新闻工作者。
Content Prompting: Modeling Content Provider Dynamics to Improve User Welfare in Recommender Ecosystems
results: 论文通过一种抽象模型和数学分析,证明了这种提示策略可以优化用户社会利益,同时尊重提供者的激励。此外,论文还通过简单的实验证明了这种策略可以提高生态系统的健康度和用户满意度。Abstract
Users derive value from a recommender system (RS) only to the extent that it is able to surface content (or items) that meet their needs/preferences. While RSs often have a comprehensive view of user preferences across the entire user base, content providers, by contrast, generally have only a local view of the preferences of users that have interacted with their content. This limits a provider's ability to offer new content to best serve the broader population. In this work, we tackle this information asymmetry with content prompting policies. A content prompt is a hint or suggestion to a provider to make available novel content for which the RS predicts unmet user demand. A prompting policy is a sequence of such prompts that is responsive to the dynamics of a provider's beliefs, skills and incentives. We aim to determine a joint prompting policy that induces a set of providers to make content available that optimizes user social welfare in equilibrium, while respecting the incentives of the providers themselves. Our contributions include: (i) an abstract model of the RS ecosystem, including content provider behaviors, that supports such prompting; (ii) the design and theoretical analysis of sequential prompting policies for individual providers; (iii) a mixed integer programming formulation for optimal joint prompting using path planning in content space; and (iv) simple, proof-of-concept experiments illustrating how such policies improve ecosystem health and user welfare.
摘要
用户只有在推荐系统(RS)能够浮现符合他们需求和偏好的内容时,才能获得价值。而RS通常有用户基本库的全面视图,而内容提供者只有与他们内容的用户交互的本地视图。这限制了提供者对整个人口的新内容供应的能力。在这种情况下,我们使用内容提醒策略来缓解信息不均衡。内容提醒是一个提醒或建议,让提供者为RS预测的用户需求而提供新的内容。提醒策略是一个针对提供者的信念、技能和利益的回应序列。我们的目标是找到一个共同的提醒策略,使提供者在均衡下为用户社会福祉产生最佳内容,同时尊重提供者自己的利益。我们的贡献包括:1. RS生态系统抽象模型,包括内容提供者行为,支持内容提醒策略。2. 针对个体提供者的sequential提醒策略的设计和理论分析。3. 内容空间探索路径规划法,用于优化共同提醒策略。4. 简单的证明性实验,说明如何实施内容提醒策略,提高生态系统健康和用户福祉。
Deep supervised hashing for fast retrieval of radio image cubes
paper_authors: Steven Ndung’u, Trienko Grobler, Stefan J. Wijnholds, Dimka Karastoyanova, George Azzopardi
for: Next-generation radio surveys will result in a large number of serendipitous discoveries, and deep hashing algorithms can be used to efficiently search and retrieve similar images in a large database.
methods: The paper uses deep hashing algorithms for image retrieval tasks in astronomy, specifically using the Hamming distance between the binary hash of the query image and those of the reference images in the database.
results: The experimental results achieved a precision of 88.5% using the mean average precision (mAP) metric, demonstrating the capability to search and retrieve similar radio images efficiently and at scale.Abstract
The shear number of sources that will be detected by next-generation radio surveys will be astronomical, which will result in serendipitous discoveries. Data-dependent deep hashing algorithms have been shown to be efficient at image retrieval tasks in the fields of computer vision and multimedia. However, there are limited applications of these methodologies in the field of astronomy. In this work, we utilize deep hashing to rapidly search for similar images in a large database. The experiment uses a balanced dataset of 2708 samples consisting of four classes: Compact, FRI, FRII, and Bent. The performance of the method was evaluated using the mean average precision (mAP) metric where a precision of 88.5\% was achieved. The experimental results demonstrate the capability to search and retrieve similar radio images efficiently and at scale. The retrieval is based on the Hamming distance between the binary hash of the query image and those of the reference images in the database.
摘要
“Future radio surveys will detect an enormous number of sources, leading to unexpected discoveries. Deep hashing algorithms have been proven efficient in image retrieval tasks in computer vision and multimedia, but their applications in astronomy are limited. In this study, we utilize deep hashing to quickly search for similar images in a large database. The experiment uses a balanced dataset of 2708 samples, including four classes: Compact, FRI, FRII, and Bent. The performance of the method was evaluated using the mean average precision (mAP) metric, achieving a precision of 88.5%. The results demonstrate the ability to efficiently search and retrieve similar radio images at scale, based on the Hamming distance between the binary hash of the query image and those of the reference images in the database.”Here's the breakdown of the translation:* "shear number" 的 Simplified Chinese translation is "巨大的数量" (jùdà de shùliàng)* "next-generation radio surveys" 的 Simplified Chinese translation is "次代 радио探测" (cìdài ràdìo tàncè)* "serendipitous discoveries" 的 Simplified Chinese translation is "偶然发现" (òujiān fāxìan)* "data-dependent deep hashing algorithms" 的 Simplified Chinese translation is "基于数据的深度哈希算法" (jīyú shuòyǔ de shēngrán hǎixī算法)* "computer vision and multimedia" 的 Simplified Chinese translation is "计算机视觉和多媒体" (jìsuànjī zhìguān yǔ duōmédiā)* "limited applications in astronomy" 的 Simplified Chinese translation is "在天文学中的应用受限" (zài tiānwén xué zhī yǐ jí)* "utilize deep hashing to rapidly search for similar images" 的 Simplified Chinese translation is "使用深度哈希快速搜索相似图像" (shǐyòu shēngrán hǎixī suōsòu xiàngsi túxìng)* "balanced dataset of 2708 samples" 的 Simplified Chinese translation is "2708个样本的均衡数据集" (2708 ge yàngbèi de jìngbìng shùliàng)* "four classes: Compact, FRI, FRII, and Bent" 的 Simplified Chinese translation is "四类:紧凑、FRI、FRII、拐型" (sì lèi: jǐnchōng, FRI, FRII, gōngyì)* "mean average precision (mAP) metric" 的 Simplified Chinese translation is "平均精度 (mAP) 指标" (píngjìn jīngdé (mAP) zhǐbǐ)* "achieving a precision of 88.5%" 的 Simplified Chinese translation is "达到88.5%的精度" (dàtuō 88.5% de jīngdé)* "the retrieval is based on the Hamming distance between the binary hash of the query image and those of the reference images in the database" 的 Simplified Chinese translation is "搜索基于图像查询图像的二进制哈希距离" (suōsòu jīyú túxìng túxìng de èrjìn bìngxī hǎixī yuèlü)
Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks
results: 我们的实验结果表明,在预训练阶段使用人工文本不会对语言模型在下游任务中的表现产生显著影响,也不会导致模型的性别偏见增加。Abstract
In recent times, significant advancements have been witnessed in the field of language models, particularly with the emergence of Large Language Models (LLMs) that are trained on vast amounts of data extracted from internet archives. These LLMs, such as ChatGPT, have become widely accessible, allowing users to generate text for various purposes including articles, essays, jokes, and poetry. Given that LLMs are trained on a diverse range of text sources, encompassing platforms like Reddit and Twitter, it is foreseeable that future training datasets will also incorporate text generated by previous iterations of the models themselves. In light of this development, our research aims to investigate the influence of artificial text in the pre-training phase of language models. Specifically, we conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail news articles, and ChatGPT, which employed the same articles for its training and evaluated their performance on three downstream tasks as well as their potential gender bias, using sentiment analysis as a metric. Through a series of experiments, we demonstrate that the utilization of artificial text during pre-training does not have a significant impact on either the performance of the models in downstream tasks or their gender bias. In conclusion, our findings suggest that the inclusion of text generated by LLMs in their own pre-training process does not yield substantial effects on the subsequent performance of the models in downstream tasks or their potential gender bias.
摘要
In light of this development, our research aims to investigate the influence of artificial text in the pre-training phase of language models. Specifically, we conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail news articles, and ChatGPT, which employed the same articles for its training, and evaluated their performance on three downstream tasks as well as their potential gender bias, using sentiment analysis as a metric.Through a series of experiments, we found that the utilization of artificial text during pre-training does not have a significant impact on either the performance of the models in downstream tasks or their gender bias. In conclusion, our findings suggest that the inclusion of text generated by LLMs in their own pre-training process does not yield substantial effects on the subsequent performance of the models in downstream tasks or their potential gender bias.
Knowledge Graph Embeddings for Multi-Lingual Structured Representations of Radiology Reports
results: 研究表明,这个图形嵌入可以更好地捕捉医疗报告中的关系性,并且在疾病分类和影像分类任务中表现出色,比BERT模型更好,且具有较小的模型大小和训练数据需求。此外,这个方法还可以跨语言进行应用。Abstract
The way we analyse clinical texts has undergone major changes over the last years. The introduction of language models such as BERT led to adaptations for the (bio)medical domain like PubMedBERT and ClinicalBERT. These models rely on large databases of archived medical documents. While performing well in terms of accuracy, both the lack of interpretability and limitations to transfer across languages limit their use in clinical setting. We introduce a novel light-weight graph-based embedding method specifically catering radiology reports. It takes into account the structure and composition of the report, while also connecting medical terms in the report through the multi-lingual SNOMED Clinical Terms knowledge base. The resulting graph embedding uncovers the underlying relationships among clinical terms, achieving a representation that is better understandable for clinicians and clinically more accurate, without reliance on large pre-training datasets. We show the use of this embedding on two tasks namely disease classification of X-ray reports and image classification. For disease classification our model is competitive with its BERT-based counterparts, while being magnitudes smaller in size and training data requirements. For image classification, we show the effectiveness of the graph embedding leveraging cross-modal knowledge transfer and show how this method is usable across different languages.
摘要
医学文本分析方法在最近几年内经历了重大变革。BERT语言模型的引入对医学领域的PubMedBERT和ClinicalBERT进行了适应。这些模型依靠大量储存的医学文献库。虽然在准确性方面表现良好,但lack of interpretability和语言转移限制使其在临床设置中无法使用。我们介绍了一种新的轻量级图 embedding方法,专门针对医学报告。该方法考虑报告的结构和组成,同时通过多语言的SNOMED临床术语知识库连接医学术语。得到的图 embedding揭示了临床术语之间的下面关系,实现了更好的可读性和临床准确性,不需要大量的预训练数据。我们在疾病分类和图像分类两个任务上使用了这种embedding,并证明了它的效果。在疾病分类任务中,我们的模型与BERT基于模型相比竞争,而且它的大小和训练数据要求都比BERT要小得多。在图像分类任务中,我们利用了图 embedding在不同语言之间的交叉模式知识传递,并证明了这种方法在不同语言上的可用性。
A 3D explainability framework to uncover learning patterns and crucial sub-regions in variable sulci recognition
paper_authors: Michail Mamalakis, Heloise de Vareilles, Atheer AI-Manea, Samantha C. Mitchell, Ingrid Arartz, Lynn Egeland Morch-Johnsen, Jane Garrison, Jon Simons, Pietro Lio, John Suckling, Graham Murray
results: 研究发现,在使用TOP-OSLO dataset中的MRI图像中,左半球比右半球更有可能正确地检测皮层 Sulcus(存在或不存在),并且发现了特定但广泛的子区域对每个类别结果做出了重要贡献。此外,研究还启示了不偏袋注意力的注意事项对网络性能的影响。该方法不仅提供了自动化、公正的皮层 Sulcus annotations,还为脑科学领域的进一步探索和调查提供了新的思路。Abstract
Precisely identifying sulcal features in brain MRI is made challenging by the variability of brain folding. This research introduces an innovative 3D explainability frame-work that validates outputs from deep learning networks in their ability to detect the paracingulate sulcus, an anatomical feature that may or may not be present on the frontal medial surface of the human brain. This study trained and tested two networks, amalgamating local explainability techniques GradCam and SHAP with a dimensionality reduction method. The explainability framework provided both localized and global explanations, along with accuracy of classification results, revealing pertinent sub-regions contributing to the decision process through a post-fusion transformation of explanatory and statistical features. Leveraging the TOP-OSLO dataset of MRI acquired from patients with schizophrenia, greater accuracies of paracingulate sulcus detection (presence or absence) were found in the left compared to right hemispheres with distinct, but extensive sub-regions contributing to each classification outcome. The study also inadvertently highlighted the critical role of an unbiased annotation protocol in maintaining network performance fairness. Our proposed method not only offers automated, impartial annotations of a variable sulcus but also provides insights into the broader anatomical variations associated with its presence throughout the brain. The adoption of this methodology holds promise for instigating further explorations and inquiries in the field of neuroscience.
摘要
通过准确地识别大脑磁共振图像中的 Sulcal 特征,这种研究推出了一个创新的3D解释框架。该框架验证了深度学习网络的输出是否能够检测人脑前 медиаль面上的 paracingulate sulcus,这是一个可能或可能不存在的 анатомиче特征。本研究用了两个网络,其中一个是 GradCam 和 SHAP 的 мест解释技术,另一个是一种维度减少方法。该解释框架提供了 both localized 和 global 解释,以及分类结果的准确性,揭示了在决策过程中重要的相关子区域。使用 TOP-OSLO 数据集,该研究发现在左半球比右半球更高的准确率检测 paracingulate sulcus(存在或缺失)。此外,研究还意外地发现了在左半球和右半球之间的区域差异对于每个分类结果的重要性。此方法不仅提供了自动化、不偏的 sulcus Variable 的注释,还为其存在的各个部分提供了深入的解释。该方法的采用拥有推动更多的 Neuroscience 领域的探索和问题。
Large Process Models: Business Process Management in the Age of Generative AI
paper_authors: Timotheus Kampik, Christian Warmuth, Adrian Rebmann, Ron Agam, Lukas N. P. Egger, Andreas Gerber, Johannes Hoffart, Jonas Kolk, Philipp Herzig, Gero Decker, Han van der Aa, Artem Polyvyanyy, Stefanie Rinderle-Ma, Ingo Weber, Matthias Weidlich
results: 论文认为,实施LPM可以减少企业转型所需的时间和努力,同时提供更加深入、更加有效和更加可行的业务转型建议,相比之下传统的 Symbolic 模型。然而,论文也提出了实施LPM的限制和研究挑战。Abstract
The continued success of Large Language Models (LLMs) and other generative artificial intelligence approaches highlights the advantages that large information corpora can have over rigidly defined symbolic models, but also serves as a proof-point of the challenges that purely statistics-based approaches have in terms of safety and trustworthiness. As a framework for contextualizing the potential, as well as the limitations of LLMs and other foundation model-based technologies, we propose the concept of a Large Process Model (LPM) that combines the correlation power of LLMs with the analytical precision and reliability of knowledge-based systems and automated reasoning approaches. LPMs are envisioned to directly utilize the wealth of process management experience that experts have accumulated, as well as process performance data of organizations with diverse characteristics, e.g., regarding size, region, or industry. In this vision, the proposed LPM would allow organizations to receive context-specific (tailored) process and other business models, analytical deep-dives, and improvement recommendations. As such, they would allow to substantially decrease the time and effort required for business transformation, while also allowing for deeper, more impactful, and more actionable insights than previously possible. We argue that implementing an LPM is feasible, but also highlight limitations and research challenges that need to be solved to implement particular aspects of the LPM vision.
摘要
大型语言模型(LLM)和其他生成人工智能方法的继续成功,强调了大量信息库对于固定符号模型的优势,但也 serves as a proof-point of 隐性和可靠性问题。作为 LLM 和其他基础模型技术的框架,我们提出了大量处理模型(LPM)的概念,该模型结合了 LLM 的相关力和知识基础系统和自动化推理方法的分析精度和可靠性。LPM 可以直接利用专家们积累的过程管理经验和不同特征的组织过程性能数据,例如大小、地区、行业等,以提供Context-specific(特定)的过程和商业模型、深入分析和改进建议。因此,LPM 可以减少企业转型所需的时间和努力,同时提供更深入、更有影响和更可行的发现。我们认为实施 LPM 是可能的,但也存在实现特定方面的限制和研究挑战。
Regularly Truncated M-estimators for Learning with Noisy Labels
results: 理论上显示了方法的抗噪音特性,实验结果表明我们的方法可以超越多个基eline,并在各种噪音类型和水平下表现稳定和可靠。Abstract
The sample selection approach is very popular in learning with noisy labels. As deep networks learn pattern first, prior methods built on sample selection share a similar training procedure: the small-loss examples can be regarded as clean examples and used for helping generalization, while the large-loss examples are treated as mislabeled ones and excluded from network parameter updates. However, such a procedure is arguably debatable from two folds: (a) it does not consider the bad influence of noisy labels in selected small-loss examples; (b) it does not make good use of the discarded large-loss examples, which may be clean or have meaningful information for generalization. In this paper, we propose regularly truncated M-estimators (RTME) to address the above two issues simultaneously. Specifically, RTME can alternately switch modes between truncated M-estimators and original M-estimators. The former can adaptively select small-losses examples without knowing the noise rate and reduce the side-effects of noisy labels in them. The latter makes the possibly clean examples but with large losses involved to help generalization. Theoretically, we demonstrate that our strategies are label-noise-tolerant. Empirically, comprehensive experimental results show that our method can outperform multiple baselines and is robust to broad noise types and levels.
摘要
“ selección de muestras es muy popular en aprendizaje con etiquetas ruidosas. Como las redes profundas aprenden patrones primero, los métodos previos basados en selección de muestras comparten un procedimiento de entrenamiento similar: los ejemplos de pérdida pequeña pueden ser considerados como ejemplos limpios y utilizados para ayudar en la generalización, mientras que los ejemplos de pérdida grande son excluidos de actualizaciones de parámetros de la red. Sin embargo, tal procedimiento es objeto de debate desde dos ángulos: (a) no tiene en cuenta la mala influencia de las etiquetas ruidosas en los ejemplos de pérdida pequeña seleccionados; (b) no utiliza adecuadamente los ejemplos de pérdida grande descartados, que pueden ser limpios o tener información significativa para la generalización. En este artículo, propongo regularmente truncated M-estimators (RTME) para abordar los problemas anteriores de manera simultánea. De manera específica, RTME puede alternar entre modes de estimadores truncados y estimadores originales. Los primeros pueden adaptativamente seleccionar ejemplos de pérdida pequeña sin conocer la tasa de ruido y reducir los efectos colaterales de las etiquetas ruidosas en ellos. Los segundos involucran posibles ejemplos limpios pero con pérdidas grandes para ayudar en la generalización. Teóricamente, demostramos que nuestras estrategias son tolerantes al ruido de etiquetas. Empíricamente, resultados exhaustivos y robustos muestran que nuestro método puede superar multiple baselines y es resistente a tipos y niveles de ruido de etiquetas amplios.”
Equitable-FL: Federated Learning with Sparsity for Resource-Constrained Environment
results: 实验结果表明,该方法可以在不同的数据集和环境下减少模型大小,同时保持模型的准确性,并且可以适应不同的客户端资源具有不同的缺省值。Abstract
In Federated Learning, model training is performed across multiple computing devices, where only parameters are shared with a common central server without exchanging their data instances. This strategy assumes abundance of resources on individual clients and utilizes these resources to build a richer model as user's models. However, when the assumption of the abundance of resources is violated, learning may not be possible as some nodes may not be able to participate in the process. In this paper, we propose a sparse form of federated learning that performs well in a Resource Constrained Environment. Our goal is to make learning possible, regardless of a node's space, computing, or bandwidth scarcity. The method is based on the observation that model size viz a viz available resources defines resource scarcity, which entails that reduction of the number of parameters without affecting accuracy is key to model training in a resource-constrained environment. In this work, the Lottery Ticket Hypothesis approach is utilized to progressively sparsify models to encourage nodes with resource scarcity to participate in collaborative training. We validate Equitable-FL on the $MNIST$, $F-MNIST$, and $CIFAR-10$ benchmark datasets, as well as the $Brain-MRI$ data and the $PlantVillage$ datasets. Further, we examine the effect of sparsity on performance, model size compaction, and speed-up for training. Results obtained from experiments performed for training convolutional neural networks validate the efficacy of Equitable-FL in heterogeneous resource-constrained learning environment.
摘要
在联合学习中,模型训练在多个计算设备之间进行,只是共享参数而不是数据实例。这种策略假设每个客户端都有充足的资源,并利用这些资源建立更加丰富的模型。然而,当假设丰富资源的假设不成立时,学习可能无法进行,因为一些节点可能无法参与过程中。在这篇论文中,我们提出了一种缺省形式的联合学习方法,可以在资源受限环境中进行学习。我们的目标是让学习无论节点的空间、计算或带宽scarce都可以进行。该方法基于参数数量与可用资源的关系,即模型大小与可用资源的定义资源缺乏问题。在这种情况下,我们采用了抽奖票假设方法,以逐步减少模型参数,以鼓励有资源缺乏的节点参与合作训练。我们在$MNIST$, $F-MNIST$, $CIFAR-10$benchmark数据集和$Brain-MRI$数据集以及$PlantVillage$数据集进行了验证。此外,我们还研究了缺省对性能、模型大小压缩和训练速度的影响。实验结果表明,Equitable-FL在多种不同资源环境下进行联合学习时具有有效性。
Domain Generalization via Balancing Training Difficulty and Model Capability
methods: 这个研究使用了两个新的设计,即 MoDify-based Data Augmentation 和 MoDify-based Network Optimization,它们协力地对抗了训练过程中的对错问题,以获得更好的预测性。
results: 这个研究获得了多个 benchark 上的 superior performance,并且可以与现有的方法整合,并且可以在不同的视觉识别任务中使用。Abstract
Domain generalization (DG) aims to learn domain-generalizable models from one or multiple source domains that can perform well in unseen target domains. Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models, leading to over-fitting or under-fitting in the trained generalization model. We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties along the training process. MoDify consists of two novel designs that collaborate to fight against the misalignment while learning domain-generalizable models. The first is MoDify-based Data Augmentation which exploits an RGB Shuffle technique to generate difficulty-aware training samples on the fly. The second is MoDify-based Network Optimization which dynamically schedules the training samples for balanced and smooth learning with appropriate difficulty. Without bells and whistles, a simple implementation of MoDify achieves superior performance across multiple benchmarks. In addition, MoDify can complement existing methods as a plug-in, and it is generic and can work for different visual recognition tasks.
摘要
域名泛化(DG)目标是从一个或多个源领域学习到能够在未看过的目标领域表现好的模型。虽然最近得到了进步,但大多数现有的工作受到模型在训练过程中样本难度与模型能力的不同而受到偏移,导致过拟合或者下降在训练的泛化模型。我们提出了MoDify,一个带动力度的框架,解决这个问题。MoDify包括两个新的设计,协作以适应不同难度水平的样本,从训练过程中学习域名泛化模型。首先是MoDify基于数据增强的设计,利用RGB混淆技术在训练过程中生成适度考虑的训练样本。其次是MoDify基于网络优化的设计,通过动态安排训练样本来保持平稳的学习,适应不同难度水平。无需辉煌的简单实现,MoDify可以在多个 benchmark 上 achieve 优秀表现。此外,MoDify可以补充现有方法,作为插件,并且可以对不同的视觉识别任务进行应用。
LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs
results: 与基eline相比,leansContext方法可以减少Context的成本,同时保持ROUGE-1分数的稳定性。如果使用免费预训练的LLM-基于摘要器,leansContext还可以进一步提高准确性。Abstract
Question-answering (QA) is a significant application of Large Language Models (LLMs), shaping chatbot capabilities across healthcare, education, and customer service. However, widespread LLM integration presents a challenge for small businesses due to the high expenses of LLM API usage. Costs rise rapidly when domain-specific data (context) is used alongside queries for accurate domain-specific LLM responses. One option is to summarize the context by using LLMs and reduce the context. However, this can also filter out useful information that is necessary to answer some domain-specific queries. In this paper, we shift from human-oriented summarizers to AI model-friendly summaries. Our approach, LeanContext, efficiently extracts $k$ key sentences from the context that are closely aligned with the query. The choice of $k$ is neither static nor random; we introduce a reinforcement learning technique that dynamically determines $k$ based on the query and context. The rest of the less important sentences are reduced using a free open source text reduction method. We evaluate LeanContext against several recent query-aware and query-unaware context reduction approaches on prominent datasets (arxiv papers and BBC news articles). Despite cost reductions of $37.29\%$ to $67.81\%$, LeanContext's ROUGE-1 score decreases only by $1.41\%$ to $2.65\%$ compared to a baseline that retains the entire context (no summarization). Additionally, if free pretrained LLM-based summarizers are used to reduce context (into human consumable summaries), LeanContext can further modify the reduced context to enhance the accuracy (ROUGE-1 score) by $13.22\%$ to $24.61\%$.
摘要
帮助机器人回答问题(Question-answering,QA)是大型自然语言模型(Large Language Models,LLMs)的重要应用,涵盖医疗、教育和客户服务等领域。然而,广泛的 LLM 集成对小型企业而言是一大挑战,因为 LLM API 使用成本高涨。当用于精度的域pecific数据(context)时,成本会快速增加。一种选择是使用 LLM SUMMARIZE Context,从而减少context。然而,这也可能会过滤出一些具有响应域pecific queries 的有用信息。在这篇论文中,我们弃用人类SUMMARIZER,转而使用 AI 模型友好的SUMMARY。我们的方法LeanContext,高效地提取了 $k$ 个关键句子,与查询高度相关。 $k$ 的选择不是静态也不是随机的,我们引入了一种强化学习技术,动态确定 $k$ 基于查询和 context。剩下的部分使用一种免费开源的文本减少方法减少。我们对多个最新的查询 aware 和查询无关的context减少方法进行评估,并发现LeanContext可以在成本下降 $37.29\%$ 到 $67.81\%$ 的情况下,ROUGE-1 分数下降只有 $1.41\%$ 到 $2.65\%$ 相比于基线(无 summarization)。此外,如果使用免费预训练 LLM 基于 SUMMARIZER 减少 context(转换为人类可读的摘要),LeanContext 可以进一步修改减少后的 context,提高精度(ROUGE-1 分数) by $13.22\%$ 到 $24.61\%$。
Leveraging Semi-Supervised Graph Learning for Enhanced Diabetic Retinopathy Detection
results: 研究表明,该算法可以在两个公共可用的数据集上获得显著改进的分类精度、特异性和敏感性,并且具有鲁棒性 against 噪音和异常值。Abstract
Diabetic Retinopathy (DR) is a significant cause of blindness globally, highlighting the urgent need for early detection and effective treatment. Recent advancements in Machine Learning (ML) techniques have shown promise in DR detection, but the availability of labeled data often limits their performance. This research proposes a novel Semi-Supervised Graph Learning SSGL algorithm tailored for DR detection, which capitalizes on the relationships between labelled and unlabeled data to enhance accuracy. The work begins by investigating data augmentation and preprocessing techniques to address the challenges of image quality and feature variations. Techniques such as image cropping, resizing, contrast adjustment, normalization, and data augmentation are explored to optimize feature extraction and improve the overall quality of retinal images. Moreover, apart from detection and diagnosis, this work delves into applying ML algorithms for predicting the risk of developing DR or the likelihood of disease progression. Personalized risk scores for individual patients are generated using comprehensive patient data encompassing demographic information, medical history, and retinal images. The proposed Semi-Supervised Graph learning algorithm is rigorously evaluated on two publicly available datasets and is benchmarked against existing methods. Results indicate significant improvements in classification accuracy, specificity, and sensitivity while demonstrating robustness against noise and outlie rs.Notably, the proposed algorithm addresses the challenge of imbalanced datasets, common in medical image analysis, further enhancing its practical applicability.
摘要
糖尿病视网膜病 (DR) 是全球主要导致盲视的重要原因,强调了早期发现和有效治疗的急需。 recent advancements in Machine Learning (ML) techniques have shown promise in DR detection, but the availability of labeled data often limits their performance. This research proposes a novel Semi-Supervised Graph Learning (SSGL) algorithm tailored for DR detection, which capitalizes on the relationships between labeled and unlabeled data to enhance accuracy.首先,这项研究 investigate data augmentation and preprocessing techniques to address the challenges of image quality and feature variations. techniques such as image cropping, resizing, contrast adjustment, normalization, and data augmentation are explored to optimize feature extraction and improve the overall quality of retinal images.此外,这项研究不仅仅是 DR 的检测和诊断,还探讨了使用 ML 算法预测糖尿病的发展风险或疾病进程的可能性。通过对患者的全面数据,包括人口统计信息、医疗历史和视网膜图像,生成个性化的风险分数。提出的 Semi-Supervised Graph learning 算法在两个公共可用的数据集上进行了严格的评估,并与现有方法进行了比较。结果表明该算法在分类精度、特异性和敏感性方面具有显著提高,并且在噪音和异常值的情况下具有坚定的稳定性。值得一提的是,提出的算法可以有效地处理医学图像分析中常见的不均衡数据集,进一步提高了其实际应用性。
Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits
for: 解决 adversarial linear contextual bandit problem,loss vectors 完全 adversarially 选择,per-round action set 从固定分布中随机选择。
methods: 不需要 simulator,直接使用 adversarial 搜索,实现 $\widetilde{O}(\sqrt{T})$ regret,同时保持小action set 的计算效率。
results: 在 sleeping bandits 中,解决 Saha et al. [2020] 的开问,存在 $poly(d)\sqrt{T}$ regret 的 polynomials-time 算法,并且可以处理 linear loss 的 additive misspecification error。Abstract
We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i.e. the context) is drawn from a fixed distribution. Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret no better than $\widetilde{O}(T^{\frac{5}{6})$, or are computationally inefficient. We greatly improve these results by achieving a regret of $\widetilde{O}(\sqrt{T})$ without a simulator, while maintaining computational efficiency when the action set in each round is small. In the special case of sleeping bandits with adversarial loss and stochastic arm availability, our result answers affirmatively the open question by Saha et al. [2020] on whether there exists a polynomial-time algorithm with $poly(d)\sqrt{T}$ regret. Our approach naturally handles the case where the loss is linear up to an additive misspecification error, and our regret shows near-optimal dependence on the magnitude of the error.
摘要
我们考虑了对抗线性上下文竞争问题,其中损失 вектор被完全对抗选择,并且每轮动作集(即上下文)是从固定分布中抽出的。现有的方法 either需要 Access to a simulator to generate free i.i.d. contexts, or achieve a sub-optimal regret no better than $\widetilde{O}(T^{\frac{5}{6})$, or are computationally inefficient. We greatly improve these results by achieving a regret of $\widetilde{O}(\sqrt{T})$ without a simulator, while maintaining computational efficiency when the action set in each round is small. In the special case of sleeping bandits with adversarial loss and stochastic arm availability, our result answers affirmatively the open question by Saha et al. [2020] on whether there exists a polynomial-time algorithm with $poly(d)\sqrt{T}$ regret. Our approach naturally handles the case where the loss is linear up to an additive misspecification error, and our regret shows near-optimal dependence on the magnitude of the error.
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
paper_authors: Fengxiang Bie, Yibo Yang, Zhongzhu Zhou, Adam Ghanem, Minjia Zhang, Zhewei Yao, Xiaoxia Wu, Connor Holmes, Pareesa Golnari, David A. Clifton, Yuxiong He, Dacheng Tao, Shuaiwen Leon Song
For: This paper focuses on text-to-image generation (TTI) models that use neural networks to generate high-fidelity images based on text descriptions.* Methods: The paper discusses various types of generative models used for TTI, including diffusion models, which have been shown to be effective in image synthesis and have become the major image decoder used by TTI models. The paper also explores the integration of large language models with TTI models to improve performance.* Results: The paper reports that TTI models have made significant progress in recent years, with the generation results nearly indistinguishable from real-world images. The paper argues that further improvements could be made through the combination of innovative model architectures and prediction enhancement techniques.Abstract
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.
摘要
文本到图像生成(TTI)是指使用模型可以从文本输入中生成高效精度的图像描述。使用神经网络进行文本到图像生成可以追溯到生成对抗网络的出现,然后是推理转换器。扩散模型是一种广泛使用的生成模型,用于通过系统性地引入噪声来生成图像。由于扩散模型在图像生成方面的出色表现,因此成为了主要的图像解码器,并使文本到图像生成成为机器学习(ML)研究的先锋。在大模型时代,通过增大模型大小和与大语言模型的集成,有效提高了TTI模型的性能,使得生成结果几乎与实际图像无法分辨,革命化了图像检索方式。我们的探索研究让我们认为,可以通过创新的模型架构和预测增强技术来进一步扩展文本到图像模型。我们在这篇评论中分为五个主要部分,详细介绍了主要的文献框架,以便深入探讨不同类型的文本到图像生成方法。接着,我们对这些方法进行了详细比较和批判,并提出了未来工作的可能的改进方向。在未来工作中,我们 argue that TTI的发展可以带来很大的产出效益,特别在AIGC时代,并可以扩展到更复杂的任务,如视频生成和3D生成。
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
paper_authors: Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, Yejin Choi
for: The paper aims to improve AI systems’ ability to reflect value pluralism, which is the view that multiple correct values may be held in tension with one another.
methods: The authors introduce ValuePrism, a large-scale dataset of human-written values, rights, and duties, and use GPT-4 to generate contextualized values. They also build Kaleido, an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence of human values, rights, and duties within a specific context.
results: The authors show that Kaleido outperforms the teacher GPT-4 in terms of accuracy and broader coverage, and can help explain variability in human decision-making by outputting contrasting values. Additionally, they demonstrate that Kaleido’s representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism.Abstract
Human values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts. To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties as well as their interaction. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. ValuePrism's contextualized values are generated by GPT-4 and deemed high-quality by human annotators 91% of the time. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented. With ValuePrism, we build Kaleido, an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT-4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values. Finally, we show that Kaleido's representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. We hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering AI systems to make decisions that are more in accordance with them.
摘要
人类价值观是人类决策中不可或缺的一部分。价值多元观是指人们可能同时保持多个正确的价值观(例如,当考虑到为朋友保持好感而做出谎言时,如何均衡诚实和友谊?)。作为统计学学习者,AI系统默认会按照平均值进行适应,这可能会抹平人类价值观的可能性。为了改进AI系统,以更好地反映价值多元观,我们的首要挑战是探索AI系统是否可以模拟人类多元价值观、权利和义务,以及它们之间的交互。我们介绍了ValuePrism,一个大规模的数据集,包含218万个价值观、权利和义务,与31万个人类写作的情况相连。ValuePrism的受过Contextualization的价值观是由GPT-4生成的,并被人类评分员评为高质量91%的时间。我们进行了大规模的研究,邀请来自多个社会和民族背景的 annotators,以了解价值观 representation的多样性。基于ValuePrism,我们构建了Kaleido,一个开源、轻量级、结构化的语言基于多任务模型,可以生成、解释和评估人类价值观、权利和义务在特定情况下的 relevance 和 valence(即支持或反对)。人类更喜欢我们的系统输出的价值观集,认为它们更准确,覆盖率更广。此外,我们还证明了Kaleido可以帮助解释人类决策的变化,输出相互矛盾的价值观。最后,我们表明了Kaleido的表示可以跨越不同的哲学框架和数据集,证明了明确、模块化和可解释的方法对价值多元观具有优势。我们希望通过让人类价值观变得更加明确,使AI系统做出更符合人类价值观的决策。
methods: 该方法结合了压缩隐藏层(MAE)目标到对比学习目标中,以改进表示的地方semantics。此外,我们还引入了随机dropoutPositional Embedding(PED)来 Address scale variation between image-text pretraining and detection finetuning。
results: 在LVIS开放词汇检测标准benchmark上,CFM-ViT实现了33.9 AP$r$的最佳记录,比最佳方法高7.6分。此外,CFM-ViT还实现了更好的零aser检测传递性和图像水平表示。Abstract
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.
摘要
我们提出了对比特征掩码视TRANSFORMER(CFM-ViT)方法,用于开放词汇 объек检测(OVD)的图像-文本预训练。我们的方法将掩码自动编码(MAE)目标函数与对比学习目标函数结合在一起,以提高定位任务的表示。与标准MAE不同的是,我们在图像-文本嵌入空间进行重建,而不是在像素空间进行重建,这使得模型更好地学习区域水平 semantics。此外,我们引入了随机dropout的位置嵌入(PED)来解决图像-文本预训练和检测finetuning之间的尺度变化问题。PED通过随机dropout位置嵌入来提高检测性能,并使得使用冻结的ViT背景作为区域分类器,以避免在检测finetuning过程中忘记开放词汇知识。在LVIS开放词汇检测数据集上,CFM-ViT实现了33.9 AP$r$的状态ethe-art成绩,比最佳方法提高7.6个点,并在零配置检测转移中实现了更好的检测性能。此外,CFM-ViT获得了图像水平表示的强大表示能力,在零配置图像-文本检索数据集上超过了状态艺术的8个 из 12个维度。
Bias and Fairness in Large Language Models: A Survey
paper_authors: Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Nesreen K. Ahmed
for: 这 paper 的目的是提供一个完整的survey of bias evaluation and mitigation techniques for large language models (LLMs), 以便研究者和实践者可以更好地理解和防止 LLMS 中的偏见的传播。
results: 这 paper 的结果是一个完整的survey of recent research on bias evaluation and mitigation techniques for LLMs,包括了不同类型的 metrics、datasets 和 mitigation techniques,以及它们之间的关系和交互。这个survey 可以帮助研究者和实践者更好地理解和防止 LLMS 中的偏见。Abstract
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.
摘要
大语言模型(LLMs)的快速进步使得处理、理解和生成人类语言的能力得到了进一步的提高,并在社会圈中得到了加强。然而,这些模型可以学习、延续和增强社会偏见。在这篇论文中,我们提供了大语言模型偏见评估和降低技术的全面评论。我们首先将社会偏见和公平在自然语言处理中进行了整合、正式化和扩展,并定义了不同类型的危害和公平的多种要求。然后,我们提出了三种直观的分类,即评估метри克和数据集分类,以及降低技术分类。我们的第一个分类是评估метри克分类,它将评估数据集和模型之间的关系解决,并将评估 метри克分为模型层次、字符级别和生成文本层次。我们的第二个分类是数据集分类,它将数据集分为对称输入或提示,并标识目标危害和社会群体。我们还发布了一份总结公共可用数据集的文件,以便更好地访问。我们的第三个分类是降低技术分类,它将降低方法分为预处理、训练、内部处理和后处理,并在每个子类别中列出了研究趋势。最后,我们 indentified了未来工作中的一些问题和挑战。通过汇总一系列最近的研究成果,我们希望通过这篇文章,为研究人员和实践者提供一份清晰的指南,以便更好地理解和预防 LLMS 中的偏见传播。
methods: 该研究使用了开源的 LLMs 作为控制器,并提供了一个用户友好的系统库,可以自定义引擎设计以支持模型训练在多个开源 LLMs 上,同时允许在一起融合模型 API 和常见 API。
results: 研究提出了一个涵盖工具使用数据收集、工具检索、工具注册、内存控制、自定义模型训练和评估的框架,以帮助 LLMs 具备工具使用能力。此外,还展示了一个基于 ModelScope-Agent 框架的实际应用程序——ModelScopeGPT,可以连接多个开源 LLMs 和上千个公共 AI 模型,以及本地化社区知识。Abstract
Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent framework that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning over tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent library\footnote{https://github.com/modelscope/modelscope-agent} and online demo\footnote{https://modelscope.cn/studios/damo/ModelScopeGPT/summary} are now publicly available.
摘要
带有强大语言模型(LLM)的大型语言模型在最近的几年内已经展现出了人类意图理解、逻辑推理和规划行为的强大能力。为了更好地利用LLM完成复杂任务,现在有一个增长的趋势是建立一个具有工具使用能力的代理框架,将LLM与大量外部API集成起来。在这项工作中,我们介绍了ModelScope-Agent,一个通用和可定制的代理框架,基于开源LLM控制器。它提供了用户友好的系统库,可以自定义引擎设计,以支持模型训练在多个开源LLM上,同时也可以快速集成模型API和常见API。为了让LLM具备工具使用能力,我们提出了一个涵盖工具使用数据收集、工具检索、工具注册、内存控制、定制化模型训练和评估的框架。最后,我们展示了ModelScopeGPT,一个基于ModelScope-Agent框架的现实世界智能助手,可以连接多个开源LLM和超过1000个公共AI模型,并将本地化社区知识集成到ModelScope中。ModelScope-Agent库()和在线示例()现在都已经公开available。
results: 研究发现,现代语言模型可以在低资源 диаLECTS 上进行竞争性的表现,并且可以拓展到未知语言边界。然而,要确保文本的一致表示,还需要解决一些问题,以实现语言和 Speakers 之间的一致模型空间。Abstract
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages. State-of-the-art language models came a long way, starting from the simple one-hot representation of words capable of performing tasks like natural language understanding, common-sense reasoning, or question-answering, thus capturing both the syntax and semantics of texts. At the same time, language models are expanding beyond our known language boundary, even competitively performing over very low-resource dialects of endangered languages. However, there are still problems to solve to ensure an equitable representation of texts through a unified modeling space across language and speakers. In this survey, we shed light on this iterative progression of multilingual text representation and discuss the driving factors that ultimately led to the current state-of-the-art. Subsequently, we discuss how the full potential of language democratization could be obtained, reaching beyond the known limits and what is the scope of improvement in that space.
摘要
现代NLP技术发展包括大量多语言模型,可以在多于100种语言上进行任务。当前的语言模型已经很 longue distance,从简单的一个热点表示单词开始,可以完成自然语言理解、常识逻辑和问答等任务,同时捕捉文本的语法和 semantics。然而,还有很多问题需要解决,以确保文本在多语言空间中具有一致的表示空间,并且让所有语言和发音者都有平等的表达机会。在这份报告中,我们将详细介绍这一趋势的迭代发展,并讨论驱动这一进步的因素。后续,我们将讨论如何实现语言 демокра化的全部潜力,超越已知的限制,以及这一空间中的可进步范围。
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
results: 该论文实现了将 LLM 的语言能力扩展到语音频谱上,可以实现语音识别、语音翻译、语音理解和语音对话,甚至在零shot cross-lingual scenarios 下。Abstract
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
摘要
大型语言模型(LLM)的出现引起了很大的关注,它们的语言能力可以扩展到语音领域。然而,语音和文本之间的模式匹配仍然是一个开放的问题。现有的解决方案可以分为两种策略:一种是采用分解approach,在 separately 训练的语音识别系统输出(token或状态)作为 LLMS 的输入,这会限制其在模式匹配方面的潜力。另一种是以端到端方式进行,它 rely 于语音指令数据,但这些数据难以在大量收集。在这篇论文中,我们解决这些问题,并提出了 BLSP 方法,即通过行为对齐来启动语言-语音预训练。我们通过学习一个轻量级的模式适配器,使 LLMS 在语音段和其转录之间 exhibit 同样的生成行为。训练过程可以分为两步。第一步是让 LLMS 使用语音转录作为前缀,生成文本。第二步是使用这些文本继续进行练习,以train 模式适配器。我们示示这个简单的过程可以扩展 LLMS 的能力到语音领域,实现语音识别、语音翻译、语音理解和语音对话,甚至在零aser 跨语言enario 中。
Evaluating Transformer’s Ability to Learn Mildly Context-Sensitive Languages
results: 研究发现,Transformer模型在已经看过的数据上能够泛化良好,但在长串上的推断能力较差,而LSTM模型在这个方面表现更好。分析还显示,Transformer模型学习了自我关注 patrerns和表示,这些 patrerns和表示可能帮助模型解决语言。Abstract
Despite that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test Transformer's ability to learn a variety of mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.
摘要
尽管变换器在自然语言处理(NLP)任务中表现良好,但 latest studies 表明自注意力在学习一些常见和 context-free 语言方面存在理论上的限制。这些发现使我们思考自然语言模型化的可能性,自然语言被假设为有些 context-sensitive。我们测试 transformer 能够学习不同复杂程度的 mildly context-sensitive 语言,并发现它们在未seen 数据上具有良好的泛化能力,但在更长的字串上具有更差的推理能力,与 LSTM 模型相比。我们的分析表明 transformer 模型中学习的自注意力模式和表示方式可以模型依赖关系和 counting 行为,可能有助于模型解决语言。
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
For: The paper aims to improve record linkage in noisy datasets using large language models (LLMs) and make it more accessible to users who are familiar with popular string matching packages like R and Stata.* Methods: The paper proposes an open-source package called LinkTransformer that treats record linkage as a text retrieval problem and uses transformer LLMs to perform record linkage. The package includes a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI.* Results: The paper claims that LinkTransformer can perform record linkage with high accuracy and supports standard functionality such as blocking and linking on multiple noisy fields. It also includes comprehensive tools for efficient model tuning and makes it easy for users to contribute their custom-trained models to its model hub.Abstract
Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
摘要
连结资讯 Across ources 是社会科学、商业和政府中许多分析的基本步骤。 although large language models (LLMs) 在复杂数据中提供了巨大的推荐,在许多领域中, approximate string matching 套件在 популяр的软件such as R 和 Stata 中仍然占主导地位。这些套件有clean、简单的接口,并可以轻松扩展到多种语言。我们的开源套件 LinkTransformer 目标是将受欢迎的字串匹配方法和深度学习结合在一起,以提供一个易用的字串匹配解决方案。它的核心是一个可以在四行程式码中应用transformer模型的工具组。LinkTransformer 包含了多种语言的预训transformer对偶性模型,并支持轻松地 интеграble任何transformer语言模型。它支持标准的功能,例如封页和联结多个噪音字段。LinkTransformer API 还可以进行其他常见的文本数据处理任务,例如聚合、噪音除除损和无需翻译的跨语言联结。更重要的是,LinkTransformer 还包含了详细的模型调整工具,以便在不同的粒度上进行自定义,当Off-the-shelf模型不提供所需的精度时。最后,为了促进再利用、重现性和扩展性,LinkTransformer 让用户可以轻松地发布自己的自定义模型。通过结合transformer语言模型和对多数使用 string matching 套件的用户而且 familier的 APIs,LinkTransformer 目标是将LLMs 的好处传播到那些可能不熟悉深度学习框架的人。
results: 实验结果表明,提出的CMR-ISPS抑制器可以快速地防止附近干扰信号的扩散,并且可以在不同的干扰信号水平下提供相应的抗干扰性能。Abstract
This work presents a cost-effective technique for designing robust adaptive beamforming algorithms based on efficient covariance matrix reconstruction with iterative spatial power spectrum (CMR-ISPS). The proposed CMR-ISPS approach reconstructs the interference-plus-noise covariance (INC) matrix based on a simplified maximum entropy power spectral density function that can be used to shape the directional response of the beamformer. Firstly, we estimate the directions of arrival (DoAs) of the interfering sources with the available snapshots. We then develop an algorithm to reconstruct the INC matrix using a weighted sum of outer products of steering vectors whose coefficients can be estimated in the vicinity of the DoAs of the interferences which lie in a small angular sector. We also devise a cost-effective adaptive algorithm based on conjugate gradient techniques to update the beamforming weights and a method to obtain estimates of the signal of interest (SOI) steering vector from the spatial power spectrum. The proposed CMR-ISPS beamformer can suppress interferers close to the direction of the SOI by producing notches in the directional response of the array with sufficient depths. Simulation results are provided to confirm the validity of the proposed method and make a comparison to existing approaches
摘要
First, the directions of arrival (DoAs) of the interfering sources are estimated using the available snapshots. Then, an algorithm is developed to reconstruct the INC matrix using a weighted sum of outer products of steering vectors whose coefficients can be estimated in the vicinity of the DoAs of the interferences, which lie in a small angular sector.Additionally, a cost-effective adaptive algorithm based on conjugate gradient techniques is proposed to update the beamforming weights, and a method to obtain estimates of the signal of interest (SOI) steering vector from the spatial power spectrum. The proposed CMR-ISPS beamformer can effectively suppress interferers close to the direction of the SOI by producing notches in the directional response of the array with sufficient depths.Simulation results are provided to confirm the validity of the proposed method and compare it to existing approaches. The proposed technique offers a cost-effective and robust solution for adaptive beamforming in the presence of interference.
paper_authors: Lianke Qin, Aravind Reddy, Zhao Song
For: This paper is written for studying dimension reduction for Mahalanobis metrics and providing efficient data structures for solving the Approximate Distance Estimation (ADE) problem for Mahalanobis distances.* Methods: The paper uses randomized Monte Carlo data structures and adapts it to handle sequences of adaptive queries and online updates to the Mahalanobis metric matrix and the data points.* Results: The paper provides efficient data structures for solving the ADE problem for Mahalanobis distances, which can be used in conjunction with prior algorithms for online learning of Mahalanobis metrics.Abstract
Mahalanobis metrics are widely used in machine learning in conjunction with methods like $k$-nearest neighbors, $k$-means clustering, and $k$-medians clustering. Despite their importance, there has not been any prior work on applying sketching techniques to speed up algorithms for Mahalanobis metrics. In this paper, we initiate the study of dimension reduction for Mahalanobis metrics. In particular, we provide efficient data structures for solving the Approximate Distance Estimation (ADE) problem for Mahalanobis distances. We first provide a randomized Monte Carlo data structure. Then, we show how we can adapt it to provide our main data structure which can handle sequences of \textit{adaptive} queries and also online updates to both the Mahalanobis metric matrix and the data points, making it amenable to be used in conjunction with prior algorithms for online learning of Mahalanobis metrics.
摘要
马哈拉诺比斯度量广泛应用在机器学习中,常与 $k$-最近邻、$k$-集群和 $k$-中值集群一起使用。尽管其重要性,但是没有任何之前的研究把笔记技术应用于快速化马哈拉诺比斯度量算法。在这篇论文中,我们开始研究维度减少 для马哈拉诺比斯度量。具体来说,我们提供了高效的数据结构来解决 Approximate Distance Estimation(ADE)问题。我们首先提供了随机 Monte Carlo 数据结构。然后,我们如何将其改进,以满足适应性查询和在 Mahalanobis 度量矩阵和数据点上进行在线更新,使其适用于与之前的在线学习 Mahalanobis 度量算法。
On the training and generalization of deep operator networks
results: 该训练方法可以在各种情况下提高DeepONets的稳定性和泛化能力,并且可以更好地处理各种非线性和非对易性问题。Abstract
We present a novel training method for deep operator networks (DeepONets), one of the most popular neural network models for operators. DeepONets are constructed by two sub-networks, namely the branch and trunk networks. Typically, the two sub-networks are trained simultaneously, which amounts to solving a complex optimization problem in a high dimensional space. In addition, the nonconvex and nonlinear nature makes training very challenging. To tackle such a challenge, we propose a two-step training method that trains the trunk network first and then sequentially trains the branch network. The core mechanism is motivated by the divide-and-conquer paradigm and is the decomposition of the entire complex training task into two subtasks with reduced complexity. Therein the Gram-Schmidt orthonormalization process is introduced which significantly improves stability and generalization ability. On the theoretical side, we establish a generalization error estimate in terms of the number of training data, the width of DeepONets, and the number of input and output sensors. Numerical examples are presented to demonstrate the effectiveness of the two-step training method, including Darcy flow in heterogeneous porous media.
摘要
我们提出了一种新的训练方法 для深度运算网络(DeepONets),这是一种非常流行的神经网络模型。DeepONets由两个子网络组成:分支网络和主网络。通常情况下,这两个子网络同时进行训练,这等于在高维空间中解决一个复杂的优化问题。此外,由于非对称和非线性的性质,训练非常困难。为解决这个挑战,我们提出了一种分两步训练方法,先训练主网络,然后顺序训练分支网络。这种机制的核心思想是分而治之的方法,即将整个复杂的训练任务分解成两个子任务,每个子任务都有较低的复杂性。在这个过程中,我们引入了 Gram-Schmidt 正交化过程,这有助于提高稳定性和泛化能力。从理论角度来看,我们建立了一个泛化误差估计,其与训练数据量、深度网络宽度和输入和输出感知器数量有关。数据示范中,我们展示了这种两步训练方法的效果,包括 Darcy 流在不同的孔隙媒质中。
MPTopic: Improving topic modeling via Masked Permuted pre-training
For: The paper aims to improve the quality of topic modeling in text analysis by addressing the limitations of existing methods such as BERTopic and Top2Vec.* Methods: The paper introduces a new approach called TF-RDF (Term Frequency - Relative Document Frequency) to assess the relevance of terms within a document, and uses this approach to drive a clustering algorithm called MPTopic.* Results: The paper shows that the topic keywords identified using MPTopic and TF-RDF outperform those extracted by BERTopic and Top2Vec through comprehensive evaluation.Here’s the same information in Simplified Chinese:* For: 论文目的是为了提高文本分析中的话题模型质量,并且解决现有方法如BERTopic和Top2Vec的局限性。* Methods: 论文引入了一种新的方法 called TF-RDF (文档频次-相对文档频次),用于评估文档中 термина的 relevance,并使用这种方法驱动一种名为 MPTopic 的聚类算法。* Results: 论文表明,使用 MPTopic 和 TF-RDF 提取的话题关键词比 BERTopic 和 Top2Vec 提取的词语要出色。Abstract
Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.
摘要
Streaming Active Learning for Regression Problems Using Regression via Classification
results: 实验结果表明,提出的方法可以在同等级别的注释成本下实现更高的回归精度。Abstract
One of the challenges in deploying a machine learning model is that the model's performance degrades as the operating environment changes. To maintain the performance, streaming active learning is used, in which the model is retrained by adding a newly annotated sample to the training dataset if the prediction of the sample is not certain enough. Although many streaming active learning methods have been proposed for classification, few efforts have been made for regression problems, which are often handled in the industrial field. In this paper, we propose to use the regression-via-classification framework for streaming active learning for regression. Regression-via-classification transforms regression problems into classification problems so that streaming active learning methods proposed for classification problems can be applied directly to regression problems. Experimental validation on four real data sets shows that the proposed method can perform regression with higher accuracy at the same annotation cost.
摘要
一个机器学习模型的挑战是其性能随环境变化而下降。为维护性能,流动活动学习被使用,其中模型通过添加新的注释样本到训练集来重新训练,如果预测样本的准确性不够高 enough。虽然许多流动活动学习方法已经为分类问题提出,但对于回归问题,业界上的尝试不多。本文提出使用回归via分类框架来实现流动活动学习回归。回归via分类将回归问题转化为分类问题,从而可以直接应用流动活动学习方法,提高回归的准确性。实验 validate on four real data sets 显示,提出的方法可以在同样的注释成本下实现更高的回归精度。
Bayesian sparsity and class sparsity priors for dictionary learning and coding
paper_authors: Alberto Bocchinfuso, Daniela Calvetti, Erkki Somersalo
for: solves challenging inverse problems using dictionary learning methods
methods: uses sparse coding techniques and dictionary compression to reduce computational complexity
results: effectively identifies relevant subdictionaries and reduces computational complexity in real-world applications such as glitch detection and hyperspectral remote sensingAbstract
Dictionary learning methods continue to gain popularity for the solution of challenging inverse problems. In the dictionary learning approach, the computational forward model is replaced by a large dictionary of possible outcomes, and the problem is to identify the dictionary entries that best match the data, akin to traditional query matching in search engines. Sparse coding techniques are used to guarantee that the dictionary matching identifies only few of the dictionary entries, and dictionary compression methods are used to reduce the complexity of the matching problem. In this article, we propose a work flow to facilitate the dictionary matching process. First, the full dictionary is divided into subdictionaries that are separately compressed. The error introduced by the dictionary compression is handled in the Bayesian framework as a modeling error. Furthermore, we propose a new Bayesian data-driven group sparsity coding method to help identify subdictionaries that are not relevant for the dictionary matching. After discarding irrelevant subdictionaries, the dictionary matching is addressed as a deflated problem using sparse coding. The compression and deflation steps can lead to substantial decreases of the computational complexity. The effectiveness of compensating for the dictionary compression error and using the novel group sparsity promotion to deflate the original dictionary are illustrated by applying the methodology to real world problems, the glitch detection in the LIGO experiment and hyperspectral remote sensing.
摘要
字典学习方法继续受欢迎用于解决困难的反问题。在字典学习方法中,计算前方模型被替换为一个大字典的可能结果,问题是将字典条目与数据匹配,类似于传统的查询匹配在搜索引擎中。稀盐编码技术用于保证字典匹配只找到少量的字典条目,而字典压缩方法用于减少匹配问题的复杂性。在这篇文章中,我们提出一个工作流程来促进字典匹配过程。首先,全字典被分解成分字典,并将每个分字典独立压缩。 dictionary compression error 被处理在 bayesian 框架中作为模型误差。此外,我们提出了一种新的 bayesian 数据驱动的群 sparse coding 方法,以帮助标识不相关的分字典。 после将不相关的分字典排除,字典匹配被视为一个减少的问题,使用稀盐编码进行解决。压缩和减少步骤可能会导致计算复杂性的明显减少。我们通过应用方法到实际问题,如 LIGO 实验中的雷达检测和Remote sensing 中的 Hyperspectral 检测,来证明资料做准的补偿和使用新的群 sparse coding 促进法可以减少计算复杂性。
Switch and Conquer: Efficient Algorithms By Switching Stochastic Gradient Oracles For Decentralized Saddle Point Problems
paper_authors: Chhavi Sharma, Vishnu Narayanan, P. Balamurugan
for: 这个论文targets non-smooth strongly convex-strongly concave saddle point problems in a decentralized setting without a central server.
methods: authors propose an inexact primal dual hybrid gradient (inexact PDHG) procedure that allows generic gradient computation oracles to update the primal and dual variables.
results: authors prove that the proposed algorithm, Decentralized Proximal Switching Stochastic Gradient method with Compression (C-DPSSG), converges to an $\epsilon$-accurate saddle point solution with linear rate, and the algorithm is well suited for obtaining solutions of low/medium accuracy faster.Here is the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope this helps!Abstract
We consider a class of non-smooth strongly convex-strongly concave saddle point problems in a decentralized setting without a central server. To solve a consensus formulation of problems in this class, we develop an inexact primal dual hybrid gradient (inexact PDHG) procedure that allows generic gradient computation oracles to update the primal and dual variables. We first investigate the performance of inexact PDHG with stochastic variance reduction gradient (SVRG) oracle. Our numerical study uncovers a significant phenomenon of initial conservative progress of iterates of IPDHG with SVRG oracle. To tackle this, we develop a simple and effective switching idea, where a generalized stochastic gradient (GSG) computation oracle is employed to hasten the iterates' progress to a saddle point solution during the initial phase of updates, followed by a switch to the SVRG oracle at an appropriate juncture. The proposed algorithm is named Decentralized Proximal Switching Stochastic Gradient method with Compression (C-DPSSG), and is proven to converge to an $\epsilon$-accurate saddle point solution with linear rate. Apart from delivering highly accurate solutions, our study reveals that utilizing the best convergence phases of GSG and SVRG oracles makes C-DPSSG well suited for obtaining solutions of low/medium accuracy faster, useful for certain applications. Numerical experiments on two benchmark machine learning applications show C-DPSSG's competitive performance which validate our theoretical findings. The codes used in the experiments can be found \href{https://github.com/chhavisharma123/C-DPSSG-CDC2023}{here}.
摘要
我们考虑一类非滑坡强弱缓衡点问题在分布式设置中,无中央服务器。以解决这类问题的协议形式,我们发展了一个不精确的内部预测点数值变化(inexact PDHG)程式,允许普通的梯度计算实体更新内部预测点和梯度。我们首先研究对不精确PDHG使用测量噪声减少梯度(SVRG)实体的性能。我们的数据研究发现在追踪过程中,对于IPDHG的初始阶段,实际上存在较大的保守进步。为了解决这个问题,我们提出了一个简单有效的转换想法,其中在初始阶段使用一个通用梯度计算实体(GSG)来增加积分进步,然后在适当的时刻转换到SVRG实体。我们给这个算法命名为分布式预测转换梯度法(C-DPSSG),并证明其可以在线性速率下落在ε-精确点解。此外,我们的研究发现,通过利用GSG和SVRG实体的最佳追踪阶段,C-DPSSG可以实现低/中精度更快的解决方案,对于一些应用而言是有用的。我们的实验结果显示C-DPSSG在两个机器学习应用中的竞争性表现,与我们的理论成果相符。实验代码可以在以下连结获取:
A Boosted Machine Learning Framework for the Improvement of Phase and Crystal Structure Prediction of High Entropy Alloys Using Thermodynamic and Configurational Parameters
for: This paper aims to predict the phases and crystal structures of High-Entropy Alloys (HEAs) using machine learning (ML) techniques.
methods: The study employs five distinct boosting algorithms (XGBoost, LightGBM, Random Forest, Gradient Boosting, and CatBoost) to predict phases and crystal structures, and introduces a methodical framework using the Pearson correlation coefficient to select strongly co-related features for improved accuracy.
results: The study achieves an accuracy of 94.05% for phase prediction and 90.07% for crystal structure prediction, and provides a new approach to quantify the influence of parameters on the model’s accuracy.Abstract
The reason behind the remarkable properties of High-Entropy Alloys (HEAs) is rooted in the diverse phases and the crystal structures they contain. In the realm of material informatics, employing machine learning (ML) techniques to classify phases and crystal structures of HEAs has gained considerable significance. In this study, we assembled a new collection of 1345 HEAs with varying compositions to predict phases. Within this collection, there were 705 sets of data that were utilized to predict the crystal structures with the help of thermodynamics and electronic configuration. Our study introduces a methodical framework i.e., the Pearson correlation coefficient that helps in selecting the strongly co-related features to increase the prediction accuracy. This study employed five distinct boosting algorithms to predict phases and crystal structures, offering an enhanced guideline for improving the accuracy of these predictions. Among all these algorithms, XGBoost gives the highest accuracy of prediction (94.05%) for phases and LightGBM gives the highest accuracy of prediction of crystal structure of the phases (90.07%). The quantification of the influence exerted by parameters on the model's accuracy was conducted and a new approach was made to elucidate the contribution of individual parameters in the process of phase prediction and crystal structure prediction.
摘要
高级噪声合金(HEA)的很多特有性归因于它们包含多种相和晶体结构。在材料信息学领域,使用机器学习(ML)技术来分类HEA的相和晶体结构得到了广泛的应用。在这项研究中,我们组装了一个新的HEA合集,其中包含了不同组合的1345个HEA。其中,705个数据集用于预测晶体结构,并采用了热力学和电子配置来帮助预测。我们的研究框架包括用Pearson相关系数选择强相关特征,以提高预测精度。我们使用了五种不同的提升算法来预测相和晶体结构,其中XGBoost提供了预测相的最高精度(94.05%),而LightGBM提供了预测晶体结构的相最高精度(90.07%)。我们还对模型精度的影响因素进行了评估,并开发了一种新的方法来解释参数对预测相和晶体结构的贡献。
An Ensemble Score Filter for Tracking High-Dimensional Nonlinear Dynamical Systems
methods: 利用分数基本概率模型来描述滤波过程中的演化,并使用小批量 Monte Carlo 估计器直接估计分数函数,而不需要训练神经网络。
results: 在高维劳逊系统中,EnSF 可以可靠地跟踪高维非线性观测过程,并且可以提供高精度的滤波结果,这些问题都是现有滤波方法所面临的挑战。Abstract
We propose an ensemble score filter (EnSF) for solving high-dimensional nonlinear filtering problems with superior accuracy. A major drawback of existing filtering methods, e.g., particle filters or ensemble Kalman filters, is the low accuracy in handling high-dimensional and highly nonlinear problems. EnSF attacks this challenge by exploiting the score-based diffusion model, defined in a pseudo-temporal domain, to characterizing the evolution of the filtering density. EnSF stores the information of the recursively updated filtering density function in the score function, in stead of storing the information in a set of finite Monte Carlo samples (used in particle filters and ensemble Kalman filters). Unlike existing diffusion models that train neural networks to approximate the score function, we develop a training-free score estimation that uses mini-batch-based Monte Carlo estimator to directly approximate the score function at any pseudo-spatial-temporal location, which provides sufficient accuracy in solving high-dimensional nonlinear problems as well as saves tremendous amount of time spent on training neural networks. Another essential aspect of EnSF is its analytical update step, gradually incorporating data information into the score function, which is crucial in mitigating the degeneracy issue faced when dealing with very high-dimensional nonlinear filtering problems. High-dimensional Lorenz systems are used to demonstrate the performance of our method. EnSF provides surprisingly impressive performance in reliably tracking extremely high-dimensional Lorenz systems (up to 1,000,000 dimension) with highly nonlinear observation processes, which is a well-known challenging problem for existing filtering methods.
摘要
我们提出了一种ensemble score filter(EnSF),用于解决高维非线性筛选问题,提高准确性。现有的筛选方法,如 particile filter 或 ensemble Kalman filter,在处理高维高非线性问题时的准确性很低。EnSF 利用了分数基 diffusion 模型,在 pseudo-时间领域中定义了筛选演化的分数函数。EnSF 将筛选 densities 的信息存储在分数函数中,而不是使用finite Monte Carlo 样本(用于 particile filter 和 ensemble Kalman filter)。与现有的扩散模型不同,我们开发了一种无需训练的分数估计,使用 mini-batch-based Monte Carlo 估计器直接在任何 pseudo-空间时间位置上估计分数函数,这提供了足够的准确性来解决高维非线性问题,同时节省了训练神经网络所需的巨大时间。另一个关键特点是 EnSF 的分析更新步骤,逐步将数据信息 incorporated 到分数函数中,这是解决非线性问题时的重要问题。高维 Lorenz 系统被用来演示 EnSF 的性能,EnSF 在处理 extremely high-dimensional Lorenz 系统(达 1,000,000 维)的非线性观测过程中表现出非常出众的表现,这是现有筛选方法所面临的一个著名的挑战。
Pure Message Passing Can Estimate Common Neighbor for Link Prediction
results: 我们在不同领域的benchmark datasets上进行了实验,结果显示了我们的方法在预测链接任务中的出色表现,较基于方法的表现更好。Abstract
Message Passing Neural Networks (MPNNs) have emerged as the {\em de facto} standard in graph representation learning. However, when it comes to link prediction, they often struggle, surpassed by simple heuristics such as Common Neighbor (CN). This discrepancy stems from a fundamental limitation: while MPNNs excel in node-level representation, they stumble with encoding the joint structural features essential to link prediction, like CN. To bridge this gap, we posit that, by harnessing the orthogonality of input vectors, pure message-passing can indeed capture joint structural features. Specifically, we study the proficiency of MPNNs in approximating CN heuristics. Based on our findings, we introduce the Message Passing Link Predictor (MPLP), a novel link prediction model. MPLP taps into quasi-orthogonal vectors to estimate link-level structural features, all while preserving the node-level complexities. Moreover, our approach demonstrates that leveraging message-passing to capture structural features could offset MPNNs' expressiveness limitations at the expense of estimation variance. We conduct experiments on benchmark datasets from various domains, where our method consistently outperforms the baseline methods.
摘要
Translation in Simplified Chinese:message passing neural networks (MPNNs) 已经成为 graphs 表示学习的“de facto”标准,但当预测关系时,它们经常遇到问题,被简单的规律如共同邻居 (CN) 所超越。这个差异源于 MPNNs 在节点水平的表示方面 excellence,但它们在节点间的结构特征方面缺乏表达能力,如 CN。为了补偿这个差异,我们提出,通过利用输入vector的正交性,纯message-passing可以真正捕捉结构特征。我们进一步研究 MPNNs 在CN规律的近似方面的效能。根据我们的发现,我们引入了 Message Passing Link Predictor (MPLP),一个新的预测关系模型。MPLP 利用 quasi-orthogonal vector 估计关系级别的结构特征,同时保留节点水平的复杂性。此外,我们的方法显示,通过将message-passing用于结构特征的捕捉,可以对 MPNNs 的表达能力进行补偿,即使是在估计误差方面。我们在不同领域的benchmark数据上进行了实验,我们的方法一致地超越了基eline方法。
Network Topology Inference with Sparsity and Laplacian Constraints
results: numerical experiments表明,提案的方法能够有效地解决网络顶点推导问题,并且比traditional $\ell_1$-norm方法更加稳定和有效。Abstract
We tackle the network topology inference problem by utilizing Laplacian constrained Gaussian graphical models, which recast the task as estimating a precision matrix in the form of a graph Laplacian. Recent research \cite{ying2020nonconvex} has uncovered the limitations of the widely used $\ell_1$-norm in learning sparse graphs under this model: empirically, the number of nonzero entries in the solution grows with the regularization parameter of the $\ell_1$-norm; theoretically, a large regularization parameter leads to a fully connected (densest) graph. To overcome these challenges, we propose a graph Laplacian estimation method incorporating the $\ell_0$-norm constraint. An efficient gradient projection algorithm is developed to solve the resulting optimization problem, characterized by sparsity and Laplacian constraints. Through numerical experiments with synthetic and financial time-series datasets, we demonstrate the effectiveness of the proposed method in network topology inference.
摘要
我们解决网络顶点结构推论问题,利用laplacian受限 Gaussian graphical models,它将任务转换为估计一个矩阵precision matrix的graph Laplacian。最近的研究 \cite{ying2020nonconvex} 发现了 $\ell_1$ 条件下学习简短网络的limitation:实验中非零元素的数量随着调整参数增加;理论上,一个大的调整参数将导致一个最密集的网络。为了解决这些挑战,我们提议一个具有 $\ell_0$ 条件的网络Laplacian估计方法。我们开发了一个高效的梯度对应算法来解决这个估计问题,它具有简短和Laplacian的约束。通过实验证明,我们显示了我们的提议方法在网络顶点结构推论中的效果。
results: 本文的实验结果表明,使用该方法可以减少学习的复杂性,同时保证算法的准确性。这种方法可以用于各种电子电路设计问题,如电路优化、灵活性分析等。Abstract
Electrical circuits are present in a variety of technologies, making their design an important part of computer aided engineering. The growing number of tunable parameters that affect the final design leads to a need for new approaches of quantifying their impact. Machine learning may play a key role in this regard, however current approaches often make suboptimal use of existing knowledge about the system at hand. In terms of circuits, their description via modified nodal analysis is well-understood. This particular formulation leads to systems of differential-algebraic equations (DAEs) which bring with them a number of peculiarities, e.g. hidden constraints that the solution needs to fulfill. We aim to use the recently introduced dissection concept for DAEs that can decouple a given system into ordinary differential equations, only depending on differential variables, and purely algebraic equations that describe the relations between differential and algebraic variables. The idea then is to only learn the differential variables and reconstruct the algebraic ones using the relations from the decoupling. This approach guarantees that the algebraic constraints are fulfilled up to the accuracy of the nonlinear system solver, which represents the main benefit highlighted in this article.
摘要
电路设计是现代工程设计中的一个重要组成部分。随着参数的增加,电路设计的最终结果的影响需要新的方法来衡量其影响。机器学习可能会在这个领域发挥关键作用,但现有的方法常常不充分利用现有系统的知识。在电路方面,使用修改后的节点分析来描述电路是非常好的。这种形式化导致系统拥有偏微分方程(DAE),这些DAE具有一些特点,例如隐藏的约束,解决方案需要满足这些约束。我们想使用最近引入的分割概念来处理DAE,将系统分解成仅依赖于偏微分变量的普通偏微分方程,并且使用系统的关系来重建代数变量。这种方法保证了代数约束的满足,直到非线性系统解决器的精度,这是本文的主要优点。
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
paper_authors: Neel Nanda, Andrew Lee, Martin Wattenberg
for: 这 paper 是 investigate how sequence models represent their decision-making process, and provide evidence of a closely related linear representation of the board state.
methods: 这 paper 使用 Othello-playing neural network, and use probing to understand the model’s internal state.
results: 这 paper 得到了一个简单 yet powerful way to interpret the model’s internal state, and demonstrate that linear representations enable significant interpretability progress.Here’s the full text in Simplified Chinese:
for: 这 paper 是 investigate how sequence models represent their decision-making process, 和提供 evidence of a closely related linear representation of the board state.
methods: 这 paper 使用 Othello-playing neural network, 并使用 probing 来理解模型的内部状态.
results: 这 paper 得到了一个简单 yet powerful way to interpret the model’s internal state, 并 demonstrates that linear representations enable significant interpretability progress.Abstract
How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.
摘要
<>translate "How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for 'my colour' vs. 'opponent's colour' may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed." into Chinese (Simplified)Answer:sequence models的决策过程是如何表示?先前的工作表明,抽象棋盘 neural network 学习了非线性模型(Li et al., 2023)。在这项工作中,我们提供了一种相关的直线表示,具体来说,我们表明了 probing for "我的颜色" vs. "对手的颜色" 可能是一种简单却强大的内部状态的解释方法。这种精确的内部表示允许我们通过简单的矢量算术控制模型的行为。直线表示具有显著的可读性进步,我们通过进一步探索世界模型如何计算来证明这一点。
Short-term power load forecasting method based on CNN-SAEDN-Res
results: simulation 结果显示,提出的载电预测方法在预测精度和预测稳定性方面具有明显的优势,比较之前的方法更能够捕捉非时序因素资料中的相互关联。Abstract
In deep learning, the load data with non-temporal factors are difficult to process by sequence models. This problem results in insufficient precision of the prediction. Therefore, a short-term load forecasting method based on convolutional neural network (CNN), self-attention encoder-decoder network (SAEDN) and residual-refinement (Res) is proposed. In this method, feature extraction module is composed of a two-dimensional convolutional neural network, which is used to mine the local correlation between data and obtain high-dimensional data features. The initial load fore-casting module consists of a self-attention encoder-decoder network and a feedforward neural network (FFN). The module utilizes self-attention mechanisms to encode high-dimensional features. This operation can obtain the global correlation between data. Therefore, the model is able to retain important information based on the coupling relationship between the data in data mixed with non-time series factors. Then, self-attention decoding is per-formed and the feedforward neural network is used to regression initial load. This paper introduces the residual mechanism to build the load optimization module. The module generates residual load values to optimize the initial load. The simulation results show that the proposed load forecasting method has advantages in terms of prediction accuracy and prediction stability.
摘要
在深度学习中,带有非时序因素的数据加载具有困难处理序列模型的问题。这种问题导致预测精度不够。因此,一种基于卷积神经网络(CNN)、自注意编码器解码网络(SAEDN)和剩余级修正(Res)的短期电力预测方法被提出。在这种方法中,特征提取模块由两维卷积神经网络组成,用于挖掘数据中的本地相关性,并从而获得高维数据特征。初始电力预测模块包括自注意编码器解码网络和Feedforward神经网络(FFN)。这个模块使用自注意机制编码高维特征,从而获得数据之间的全局相关性。因此,模型能够保留数据混合非时序因素的重要信息。然后,自注意解码被执行,并使用Feedforward神经网络进行回归初始电力。本文介绍了剩余机制来建立电力优化模块。该模块生成剩余电力值,以优化初始电力。实验结果显示,提议的电力预测方法具有更高的预测精度和预测稳定性。
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading
results: 这个研究所得到的结果显示,使用了多头组合多任务学习 (MEMTL) 方法可以对时间变化的环境进行快速的解决,并且可以实现高精度的推导。Abstract
Computation offloading has become a popular solution to support computationally intensive and latency-sensitive applications by transferring computing tasks to mobile edge servers (MESs) for execution, which is known as mobile/multi-access edge computing (MEC). To improve the MEC performance, it is required to design an optimal offloading strategy that includes offloading decision (i.e., whether offloading or not) and computational resource allocation of MEC. The design can be formulated as a mixed-integer nonlinear programming (MINLP) problem, which is generally NP-hard and its effective solution can be obtained by performing online inference through a well-trained deep neural network (DNN) model. However, when the system environments change dynamically, the DNN model may lose efficacy due to the drift of input parameters, thereby decreasing the generalization ability of the DNN model. To address this unique challenge, in this paper, we propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs). Specifically, the shared backbone will be invariant during the PHs training and the inferred results will be ensembled, thereby significantly reducing the required training overhead and improving the inference performance. As a result, the joint optimization problem for offloading decision and resource allocation can be efficiently solved even in a time-varying wireless environment. Experimental results show that the proposed MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
摘要
computation offloading 已成为支持 computationally intensive 和延迟敏感应用的受欢迎解决方案,通过将计算任务传输到 mobil edge server (MES) 进行执行,这被称为 mobil/多 access edge computing (MEC)。为了提高 MEC 性能,需要设计一个优化的卸载策略,包括卸载决策(是否卸载)和 MEC 的计算资源分配。该设计可以表示为混合整数非线性编程 (MINLP) 问题,通常是NP-hard 的,其有效解决方法是通过在训练了深度神经网络 (DNN) 模型的线上推理进行获取。然而,当系统环境变化 dynamically 时,DNN 模型可能会失去有效性,因为输入参数的漂移,从而降低 DNN 模型的泛化能力。为解决这个特殊挑战,在这篇论文中,我们提出了一种多头集合多任务学习 (MEMTL) 方法,其特点是共享脊梁和多个预测头 (PH)。具体来说,共享脊梁在 PH 训练时保持不变,并将推理结果ensemble,从而减少了训练负担和提高了推理性能。因此,在时变无线环境中,可以效率地解决卸载决策和资源分配的共优化问题。实验结果表明,提出的 MEMTL 方法在推理准确率和平均方差Error 方面具有显著优势,而无需更多的训练数据。
Discovering Predictive Relational Object Symbols with Symbolic Attentive Layers
results: 实验表明,该模型在一个 simulate 的表格环境中,能够更好地预测行为效果,同时同时自动发现对象符号和关系符号。分析表明,学习的符号与表格环境中对象之间的相对位置、物品类型和横向Alignment有关。Abstract
In this paper, we propose and realize a new deep learning architecture for discovering symbolic representations for objects and their relations based on the self-supervised continuous interaction of a manipulator robot with multiple objects on a tabletop environment. The key feature of the model is that it can handle a changing number number of objects naturally and map the object-object relations into symbolic domain explicitly. In the model, we employ a self-attention layer that computes discrete attention weights from object features, which are treated as relational symbols between objects. These relational symbols are then used to aggregate the learned object symbols and predict the effects of executed actions on each object. The result is a pipeline that allows the formation of object symbols and relational symbols from a dataset of object features, actions, and effects in an end-to-end manner. We compare the performance of our proposed architecture with state-of-the-art symbol discovery methods in a simulated tabletop environment where the robot needs to discover symbols related to the relative positions of objects to predict the observed effect successfully. Our experiments show that the proposed architecture performs better than other baselines in effect prediction while forming not only object symbols but also relational symbols. Furthermore, we analyze the learned symbols and relational patterns between objects to learn about how the model interprets the environment. Our analysis shows that the learned symbols relate to the relative positions of objects, object types, and their horizontal alignment on the table, which reflect the regularities in the environment.
摘要
在这篇论文中,我们提出了一种新的深度学习架构,用于从自适应的 kontinuous 互动中找到对象和它们之间的关系的符号表示。这个模型的关键特点是可以自然地处理变化数量的对象,并将对象之间的关系Explicitly map到符号领域中。在模型中,我们使用了一层自注意力层,从对象特征中计算出精确的注意力权重,这些注意力权重被视为对象之间的关系符号。这些关系符号然后用于聚合学习的对象符号和预测对象上执行的效果。这个管道可以从对象特征、动作和效果的数据集中形成对象符号和关系符号,并在端到端的方式下进行结构化的符号探索。我们与其他基eline进行比较,并在模拟的桌面环境中证明了我们提出的架构的性能比其他基eline更好,可以成功预测对象之间的关系和效果。此外,我们分析了模型中学习的符号和对象之间的关系,发现符号与对象的相对位置、对象类型和桌面上的水平对齐有关,这些符号与环境中的常见性相符。
Tight Bounds for Machine Unlearning via Differential Privacy
results: 作者 closing the gap between upper and lower bounds on the deletion capacity of DP-based machine unlearning algorithms, obtaining tight bounds on the deletion capacity achievable by these algorithms.Abstract
We consider the formulation of "machine unlearning" of Sekhari, Acharya, Kamath, and Suresh (NeurIPS 2021), which formalizes the so-called "right to be forgotten" by requiring that a trained model, upon request, should be able to "unlearn" a number of points from the training data, as if they had never been included in the first place. Sekhari et al. established some positive and negative results about the number of data points that can be successfully unlearnt by a trained model without impacting the model's accuracy (the "deletion capacity"), showing that machine unlearning could be achieved by using differentially private (DP) algorithms. However, their results left open a gap between upper and lower bounds on the deletion capacity of these algorithms: our work fully closes this gap, obtaining tight bounds on the deletion capacity achievable by DP-based machine unlearning algorithms.
摘要
我团队考虑了Sekhari等人(NeurIPS 2021)所提出的机器“忘记” formalization,即要求已经训练过的模型,在请求时,能够“忘记”一些训练数据点,如果这些点从来没有被包含在模型中。Sekhari等人确立了一些积极和消极结果,表明可以通过使用匿名隐私(DP)算法实现机器忘记。然而,他们的结果留下了一个DP算法的删除容量(deletion capacity)的上下限之间的差距:我们的工作完全关闭了这个差距,得到了DP基于的机器忘记算法的精确 deletion capacity 上限。
Towards Certified Probabilistic Robustness with High Accuracy
paper_authors: Ruihan Zhang, Peixin Zhang, Jun Sun
for: This paper aims to build certifiably robust yet accurate neural network models, which is an open problem in the field of adversarial examples.
methods: The proposed approach consists of two parts: a probabilistic robust training method that minimizes variance in terms of divergence, and a runtime inference method for certified probabilistic robustness of the prediction.
results: The proposed approach significantly outperforms existing approaches in terms of both certification rate and accuracy, and is reasonably efficient. The approach works for a variety of perturbations and is applicable to multiple models trained on different datasets.Abstract
Adversarial examples pose a security threat to many critical systems built on neural networks (such as face recognition systems, and self-driving cars). While many methods have been proposed to build robust models, how to build certifiably robust yet accurate neural network models remains an open problem. For example, adversarial training improves empirical robustness, but they do not provide certification of the model's robustness. On the other hand, certified training provides certified robustness but at the cost of a significant accuracy drop. In this work, we propose a novel approach that aims to achieve both high accuracy and certified probabilistic robustness. Our method has two parts, i.e., a probabilistic robust training method with an additional goal of minimizing variance in terms of divergence and a runtime inference method for certified probabilistic robustness of the prediction. The latter enables efficient certification of the model's probabilistic robustness at runtime with statistical guarantees. This is supported by our training objective, which minimizes the variance of the model's predictions in a given vicinity, derived from a general definition of model robustness. Our approach works for a variety of perturbations and is reasonably efficient. Our experiments on multiple models trained on different datasets demonstrate that our approach significantly outperforms existing approaches in terms of both certification rate and accuracy.
摘要
遭遇攻击性示例对许多基于神经网络的重要系统(如识别面部系统和自动驾驶车)的安全性带来了威胁。许多方法已经被提出来建立坚固的模型,但是如何建立认证可靠且精确的神经网络模型仍然是一个开启的问题。例如,敌对训练可以提高了实际的抗衡能力,但它们不会提供模型的认证 robustness。另一方面,认证训练则可以提供认证的 robustness,但是它们会导致模型的精确度下降。在这个工作中,我们提出了一个新的方法,旨在实现高精确度和认证可靠的神经网络模型。我们的方法有两部分:一个是一种概率 robust 的训练方法,另一个是一种runtime inference方法,用于认证模型的概率 robustness。这个方法可以实现在不同类型的攻击下,且是相对高效的。我们在多个模型和不同的数据集上进行了实验,结果显示,我们的方法在认证率和精确度两方面都大大超过了现有的方法。
Pretraining Representations for Bioacoustic Few-shot Detection using Supervised Contrastive Learning
results: 该方法在验证集上获得了63.46%的F1分数,在测试集上获得了42.7%的F1分数,在DCASE挑战中名列第二。Abstract
Deep learning has been widely used recently for sound event detection and classification. Its success is linked to the availability of sufficiently large datasets, possibly with corresponding annotations when supervised learning is considered. In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly. Therefore supervised learning is not the best suited approach to solve bioacoustic tasks. The bioacoustic community recasted the problem of sound event detection within the framework of few-shot learning, i.e. training a system with only few labeled examples. The few-shot bioacoustic sound event detection task in the DCASE challenge focuses on detecting events in long audio recordings given only five annotated examples for each class of interest. In this paper, we show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework. We highlight the ability of this framework to transfer well for five-shot event detection on previously unseen classes in the training data. We obtain an F-score of 63.46\% on the validation set and 42.7\% on the test set, ranking second in the DCASE challenge. We provide an ablation study for the critical choices of data augmentation techniques as well as for the learning strategy applied on the training set.
摘要
现代深度学习技术在声音事件检测和分类方面得到了广泛应用。其成功与具有足够大的数据集,可能带有相应的注释时supervised learning是考虑的。在生物声学应用中,大多数任务都有少量标注的训练数据,因为注释长录音是时间consuming和costly。因此,supervised learning不是解决生物声学任务的最佳方法。生物声学社区将声音事件检测问题重新定义为few-shot learning问题,即使用只有几个标注的示例来训练系统。DCASE挑战中的声音事件检测五个难题中的few-shot bioacoustic sound event detection task是检测长录音中的事件,只需五个标注示例。在这篇论文中,我们表明了可以通过利用数据增强和supervised contrastive learning框架来学习rich feature extractor从scratch。我们指出了这种框架的可轻 Transfer Learning,能够在未看过的类型上进行五个shot事件检测。我们在验证集上取得了63.46%的F-score和42.7%的测试集F-score,在DCASE挑战中排名第二。我们还提供了关键的数据增强技术和训练集上的学习策略的ablation study。
Tutorial: a priori estimation of sample size, effect size, and statistical power for cluster analysis, latent class analysis, and multivariate mixture models
For: The paper is written for researchers who want to determine the sample size and effect size for analyses that identify subgroups.* Methods: The paper provides a roadmap for determining sample size and effect size using a procedure that formalizes expectations about effect sizes in a specific domain, and establishes the minimum sample size for subgroup analyses using simulations.* Results: The paper provides a reference table for the most popular subgroup analyses, including k-means, Ward agglomerative hierarchical clustering, c-means fuzzy clustering, latent class analysis, latent profile analysis, and Gaussian mixture modeling. The table shows the minimum numbers of observations per expected subgroup and features to achieve acceptable statistical power.Abstract
Before embarking on data collection, researchers typically compute how many individual observations they should do. This is vital for doing studies with sufficient statistical power, and often a cornerstone in study pre-registrations and grant applications. For traditional statistical tests, one would typically determine an acceptable level of statistical power, (gu)estimate effect size, and then use both values to compute the required sample size. However, for analyses that identify subgroups, statistical power is harder to establish. Once sample size reaches a sufficient threshold, effect size is primarily determined by the number of measured features and the underlying subgroup separation. As a consequence, a priory computations of statistical power are notoriously complex. In this tutorial, I will provide a roadmap to determining sample size and effect size for analyses that identify subgroups. First, I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice, and use this to compute the minimally required number of measured variables. Next, I outline how to establish the minimum sample size in subgroup analyses. Finally, I use simulations to provide a reference table for the most popular subgroup analyses: k-means, Ward agglomerative hierarchical clustering, c-means fuzzy clustering, latent class analysis, latent profile analysis, and Gaussian mixture modelling. The table shows the minimum numbers of observations per expected subgroup (sample size) and features (measured variables) to achieve acceptable statistical power, and can be readily used in study design.
摘要
Before starting data collection, researchers usually calculate how many individual observations they should collect. This is crucial for conducting studies with sufficient statistical power, and is often a key component of study pre-registrations and grant applications. For traditional statistical tests, one would typically determine an acceptable level of statistical power, estimate effect size, and then use both values to compute the required sample size. However, for analyses that identify subgroups, statistical power is more difficult to establish. Once the sample size reaches a sufficient threshold, effect size is primarily determined by the number of measured features and the underlying subgroup separation. As a consequence, a priori computations of statistical power are notoriously complex. In this tutorial, I will provide a roadmap to determining sample size and effect size for analyses that identify subgroups. First, I introduce a procedure that allows researchers to formalize their expectations about effect sizes in their domain of choice, and use this to compute the minimally required number of measured variables. Next, I outline how to establish the minimum sample size in subgroup analyses. Finally, I use simulations to provide a reference table for the most popular subgroup analyses: k-means, Ward agglomerative hierarchical clustering, c-means fuzzy clustering, latent class analysis, latent profile analysis, and Gaussian mixture modeling. The table shows the minimum numbers of observations per expected subgroup (sample size) and features (measured variables) to achieve acceptable statistical power, and can be readily used in study design.
DoRA: Domain-Based Self-Supervised Learning Framework for Low-Resource Real Estate Appraisal
results: 对实际交易数据进行测试,该模型在几 shot enario下显著超过了基于数据表格的SSL基线、图形基本方法和指导方法。Abstract
The marketplace system connecting demands and supplies has been explored to develop unbiased decision-making in valuing properties. Real estate appraisal serves as one of the high-cost property valuation tasks for financial institutions since it requires domain experts to appraise the estimation based on the corresponding knowledge and the judgment of the market. Existing automated valuation models reducing the subjectivity of domain experts require a large number of transactions for effective evaluation, which is predominantly limited to not only the labeling efforts of transactions but also the generalizability of new developing and rural areas. To learn representations from unlabeled real estate sets, existing self-supervised learning (SSL) for tabular data neglects various important features, and fails to incorporate domain knowledge. In this paper, we propose DoRA, a Domain-based self-supervised learning framework for low-resource Real estate Appraisal. DoRA is pre-trained with an intra-sample geographic prediction as the pretext task based on the metadata of the real estate for equipping the real estate representations with prior domain knowledge. Furthermore, inter-sample contrastive learning is employed to generalize the representations to be robust for limited transactions of downstream tasks. Our benchmark results on three property types of real-world transactions show that DoRA significantly outperforms the SSL baselines for tabular data, the graph-based methods, and the supervised approaches in the few-shot scenarios by at least 7.6% for MAPE, 11.59% for MAE, and 3.34% for HR10%. We expect DoRA to be useful to other financial practitioners with similar marketplace applications who need general models for properties that are newly built and have limited records. The source code is available at https://github.com/wwweiwei/DoRA.
摘要
marketplace系统连接需求和供应,以发展不偏袋折的决策方法。房地产评估作为高成本房产评估任务,需要域专家根据相关知识和市场判断来进行估价。现有的自动评估模型可以减少域专家的主观性,但它们需要大量的交易数据进行有效评估,这是限制了不仅标注努力,还限制了新规划和农村地区的普适性。为了学习不标注的房地产集合中的表示,现有的自动学习(SSL)技术对于表格数据 neglects 多种重要特征,并且无法包含域知识。在这篇论文中,我们提出了DoRA,一种基于域的自动学习框架,用于低资源房地产评估。DoRA通过 metadata 中的地理预测任务进行预训练,以具备房地产表示的先验知识。此外,我们还使用了交叉样本学习来使表示扩展到有限交易下的稳定性。我们对实际交易中的三种不同类型的财产进行了测试,结果显示DoRA在几个shot scenario下明显超过了SSL基线、图表基eline和批处理方法的性能,提高了MAPЭ、MAE和HR10的性能。我们预计DoRA将对其他金融实践人员有用,他们需要面临新建和有限记录的财产评估模型。代码可以在 获取。
A Unifying Variational Framework for Gaussian Process Motion Planning
paper_authors: Lucas Cosier, Rares Iordan, Sicelukwanda Zwane, Giovanni Franzese, James T. Wilson, Marc Peter Deisenroth, Alexander Terenin, Yasemin Bekiroglu
results: 实验结果表明, compared to基准方法,本文提出的方法可以更好地平衡成功率和运动规划质量。Abstract
To control how a robot moves, motion planning algorithms must compute paths in high-dimensional state spaces while accounting for physical constraints related to motors and joints, generating smooth and stable motions, avoiding obstacles, and preventing collisions. A motion planning algorithm must therefore balance competing demands, and should ideally incorporate uncertainty to handle noise, model errors, and facilitate deployment in complex environments. To address these issues, we introduce a framework for robot motion planning based on variational Gaussian Processes, which unifies and generalizes various probabilistic-inference-based motion planning algorithms. Our framework provides a principled and flexible way to incorporate equality-based, inequality-based, and soft motion-planning constraints during end-to-end training, is straightforward to implement, and provides both interval-based and Monte-Carlo-based uncertainty estimates. We conduct experiments using different environments and robots, comparing against baseline approaches based on the feasibility of the planned paths, and obstacle avoidance quality. Results show that our proposed approach yields a good balance between success rates and path quality.
摘要
要控制 robot 的移动,动作规划算法需要计算高维状态空间中的路径,同时考虑到机械制约和 JOINTS 的物理约束,生成平滑和稳定的动作,避免障碍物和冲突。一个动作规划算法应该平衡竞合的需求,并应该包含不确定性,以处理噪声、模型错误和复杂环境中的部署。为解决这些问题,我们介绍了基于Variational Gaussian Processes的机器人动作规划框架,这个框架统一和总结了各种基于概率推理的动作规划算法。我们的框架可以在终端训练中采用等价、不等价和软动作规划约束,并提供了间隔型和Monte Carlo 类型的不确定性估计。我们在不同的环境和机器人上进行了实验,与基线方法进行比较,评价计划路径的可行性和避免障碍质量。结果表明,我们的提议方法可以获得良好的平衡,同时保证动作质量。
Autonomous Soft Tissue Retraction Using Demonstration-Guided Reinforcement Learning
results: 这个研究实现了一种自动化手术软体压缩的方法,并证明了这种方法的可行性。Abstract
In the context of surgery, robots can provide substantial assistance by performing small, repetitive tasks such as suturing, needle exchange, and tissue retraction, thereby enabling surgeons to concentrate on more complex aspects of the procedure. However, existing surgical task learning mainly pertains to rigid body interactions, whereas the advancement towards more sophisticated surgical robots necessitates the manipulation of soft bodies. Previous work focused on tissue phantoms for soft tissue task learning, which can be expensive and can be an entry barrier to research. Simulation environments present a safe and efficient way to learn surgical tasks before their application to actual tissue. In this study, we create a Robot Operating System (ROS)-compatible physics simulation environment with support for both rigid and soft body interactions within surgical tasks. Furthermore, we investigate the soft tissue interactions facilitated by the patient-side manipulator of the DaVinci surgical robot. Leveraging the pybullet physics engine, we simulate kinematics and establish anchor points to guide the robotic arm when manipulating soft tissue. Using demonstration-guided reinforcement learning (RL) algorithms, we investigate their performance in comparison to traditional reinforcement learning algorithms. Our in silico trials demonstrate a proof-of-concept for autonomous surgical soft tissue retraction. The results corroborate the feasibility of learning soft body manipulation through the application of reinforcement learning agents. This work lays the foundation for future research into the development and refinement of surgical robots capable of managing both rigid and soft tissue interactions. Code is available at https://github.com/amritpal-001/tissue_retract.
摘要
在外科领域,机器人可以提供重要的协助,包括进行小、重复的任务,如缝合、针替换和组织吸引,以便外科医生能够更专注于更复杂的过程。然而,现有的外科任务学习主要关注坚体交互,而随着外科机器人的发展,需要涉及到软体的操作。之前的工作主要集中在假体中学习软组织任务,这可能会昂贵并成为研究入门障碍。在这种情况下,我们创建了ROS兼容的物理 simulate环境,并支持坚体和软体交互在外科任务中。此外,我们通过DaVinci外科机器人的病人侧把手 investigate软组织交互的可能性。通过pybullet物理引擎,我们模拟了机械学和确定了引导外科机器人的软组织 manipulate的anchor点。使用示例导引学习(RL)算法,我们研究其性能与传统RL算法相比。我们的室内实验结果表明,通过应用RL代理人,可以实现自主的外科软组织吸引。这些结果证明了在应用RL算法时,可以学习软体操作。这项工作为未来关于开发和改进外科机器人的研究提供了基础。代码可以在https://github.com/amritpal-001/tissue_retract中找到。
Approximating Fair $k$-Min-Sum-Radii in $\mathbb{R}^d$
paper_authors: Lukas Drexler, Annika Hennes, Abhiruk Lahiri, Melanie Schmidt, Julian Wargalla for:* The paper is focused on the $k$-min-sum-radii problem in the context of fair clustering.methods:* The paper proposes a PTAS (Probably Approximately Correct) algorithm for the fair $k$-min-sum-radii problem in Euclidean spaces of arbitrary dimension, with a constant number of clusters $k$.results:* The proposed algorithm is the first PTAS for the fair $k$-min-sum-radii problem, and it works for different notions of group fairness.Abstract
The $k$-center problem is a classical clustering problem in which one is asked to find a partitioning of a point set $P$ into $k$ clusters such that the maximum radius of any cluster is minimized. It is well-studied. But what if we add up the radii of the clusters instead of only considering the cluster with maximum radius? This natural variant is called the $k$-min-sum-radii problem. It has become the subject of more and more interest in recent years, inspiring the development of approximation algorithms for the $k$-min-sum-radii problem in its plain version as well as in constrained settings. We study the problem for Euclidean spaces $\mathbb{R}^d$ of arbitrary dimension but assume the number $k$ of clusters to be constant. In this case, a PTAS for the problem is known (see Bandyapadhyay, Lochet and Saurabh, SoCG, 2023). Our aim is to extend the knowledge base for $k$-min-sum-radii to the domain of fair clustering. We study several group fairness constraints, such as the one introduced by Chierichetti et al. (NeurIPS, 2017). In this model, input points have an additional attribute (e.g., colors such as red and blue), and clusters have to preserve the ratio between different attribute values (e.g., have the same fraction of red and blue points as the ground set). Different variants of this general idea have been studied in the literature. To the best of our knowledge, no approximative results for the fair $k$-min-sum-radii problem are known, despite the immense amount of work on the related fair $k$-center problem. We propose a PTAS for the fair $k$-min-sum-radii problem in Euclidean spaces of arbitrary dimension for the case of constant $k$. To the best of our knowledge, this is the first PTAS for the problem. It works for different notions of group fairness.
摘要
“$k$-中心问题”是一个热门的聚集问题,要找到一个分 partitioning 方案,使得该集合中的各个对象的最大半径 minimized。这个问题已经很受欢迎,但如果我们总和所有对象的半径而不是仅考虑最大半径,这就是“$k$-min-sum-radii”问题。这个问题在最近的几年中已经引起了越来越多的关注,并且开发了访问算法。我们在 $\mathbb{R}^d$ 的任意维度上研究这个问题,并假设对象的数量是常数的。在这种情况下,我们知道PTAS的存在(见 Bandyapadhyay、Lochet 和 Saurabh, SoCG, 2023)。我们的目标是将这个知识库扩展到公平聚集领域。我们研究了许多公平聚集约束,例如 Chierichetti 等人(NeurIPS, 2017)提出的一种模型,在这个模型中,输入对象有一个额外的特征(例如颜色,如红色和蓝色),并且要求各个集合保持不同特征值的比例(例如输入集合中的红色和蓝色对象的比例和输入集合中的红色和蓝色对象的比例一样)。不同的这种一般的想法已经在文献中被研究。我们提出了一个PTAS для公平的 $k$-min-sum-radii 问题。这是我们知道的第一个PTAS。它适用于不同的公平性观念。
Trustworthiness-Driven Graph Convolutional Networks for Signed Network Embedding
paper_authors: Min-Jeong Kim, Yeon-Chang Lee, David Y. Kang, Sang-Wook Kim
for: 本文targets the problem of representing nodes in a signed network as low-dimensional vectors, and proposes a novel GCN-based approach named TrustSGCN to correct for incorrect embedding propagation.
results: 实验结果显示,TrustSGCN在四个真实的签名网络 dataset 上Consistently outperforms five state-of-the-art GCN-based SNE methods。Abstract
The problem of representing nodes in a signed network as low-dimensional vectors, known as signed network embedding (SNE), has garnered considerable attention in recent years. While several SNE methods based on graph convolutional networks (GCN) have been proposed for this problem, we point out that they significantly rely on the assumption that the decades-old balance theory always holds in the real-world. To address this limitation, we propose a novel GCN-based SNE approach, named as TrustSGCN, which corrects for incorrect embedding propagation in GCN by utilizing the trustworthiness on edge signs for high-order relationships inferred by the balance theory. The proposed approach consists of three modules: (M1) generation of each node's extended ego-network; (M2) measurement of trustworthiness on edge signs; and (M3) trustworthiness-aware propagation of embeddings. Furthermore, TrustSGCN learns the node embeddings by leveraging two well-known societal theories, i.e., balance and status. The experiments on four real-world signed network datasets demonstrate that TrustSGCN consistently outperforms five state-of-the-art GCN-based SNE methods. The code is available at https://github.com/kmj0792/TrustSGCN.
摘要
“简洁网络图(Signed Network)内的节点表示为低维Vector的问题,称为简洁网络嵌入(SNE),在最近的几年中受到了很大的关注。然而,许多基于图像润满网络(GCN)的SNE方法已经被提出,但是它们假设了现今已经很长时间的平衡理论应用于实际中。为了解决这个限制,我们提出了一个新的GCN基于的SNE方法,名为信任GCN(TrustSGCN),它通过使用对端标签的信任性来修正GCN中的嵌入传播。提案的方法包括三个模组:(M1)每个节点的扩展EGO网络的生成;(M2)根据端标签的信任性量度;以及(M3)基于信任性的嵌入传播。此外,TrustSGCN利用了社会学中两个著名的理论,即平衡和社会地位,来学习节点的嵌入。实验结果显示,TrustSGCN在四个真实的简洁网络数据集上显著地超越了五个现有的GCN基于的SNE方法。代码可以在https://github.com/kmj0792/TrustSGCN中找到。”
Fairness Implications of Heterogeneous Treatment Effect Estimation with Machine Learning Methods in Policy-making
for: The paper is written for governments trying to make and implement policy using causal machine learning methods, and for researchers and practitioners working in this area.
methods: The paper discusses the use of AI Fairness methods to protect against unintended consequences in machine learning models, but argues that these methods are not suitable for all causal machine learning applications. Instead, the paper proposes a definition of fairness for indirect decision-making scenarios, where the causal machine learning model only has indirect power.
results: The paper argues that the complexity of causal machine learning models can make it difficult to achieve fairness in policy-making, and suggests that careful modelling and awareness of decision-making biases are necessary to address this challenge.Abstract
Causal machine learning methods which flexibly generate heterogeneous treatment effect estimates could be very useful tools for governments trying to make and implement policy. However, as the critical artificial intelligence literature has shown, governments must be very careful of unintended consequences when using machine learning models. One way to try and protect against unintended bad outcomes is with AI Fairness methods which seek to create machine learning models where sensitive variables like race or gender do not influence outcomes. In this paper we argue that standard AI Fairness approaches developed for predictive machine learning are not suitable for all causal machine learning applications because causal machine learning generally (at least so far) uses modelling to inform a human who is the ultimate decision-maker while AI Fairness approaches assume a model that is making decisions directly. We define these scenarios as indirect and direct decision-making respectively and suggest that policy-making is best seen as a joint decision where the causal machine learning model usually only has indirect power. We lay out a definition of fairness for this scenario - a model that provides the information a decision-maker needs to accurately make a value judgement about just policy outcomes - and argue that the complexity of causal machine learning models can make this difficult to achieve. The solution here is not traditional AI Fairness adjustments, but careful modelling and awareness of some of the decision-making biases that these methods might encourage which we describe.
摘要
政府可以使用可变性机器学习方法来生成不同类型的干预效果估计,这些方法可能是政府制定和实施政策的有用工具。然而,根据人工智能文献所示,政府应该非常小心不良后果,因为机器学习模型可能会导致不良后果。一种方法是使用 AI Fairness 方法来创建不受敏感变量(如种族或性别)影响的机器学习模型。在这篇论文中,我们 argue That标准 AI Fairness 方法不适用于所有 causal machine learning 应用程序,因为 causal machine learning 通常(至少是)使用模型来告诉人类决策者做出决定。我们称这些场景为 indirect 和 direct 决策making 分别,并认为政策制定是 indirect 决策和 machine learning 模型通常只有 indirect 力量的共同决策。我们提出了一种公平定义 - 一个模型可以提供决策者准确地判断正确的政策结果的信息 - 并 argue dass causal machine learning 模型的复杂性可能使这困难实现。在这里,不是传统的 AI Fairness 调整,而是仔细的模型和决策BIAS 的认识,我们描述。
results: 本文描述了两种情况:首先,使用已知前进算子作为物理约束的情况,其次更一般的数据驱动DL方法。Abstract
Machine Learning (ML) methods and tools have gained great success in many data, signal, image and video processing tasks, such as classification, clustering, object detection, semantic segmentation, language processing, Human-Machine interface, etc. In computer vision, image and video processing, these methods are mainly based on Neural Networks (NN) and in particular Convolutional NN (CNN), and more generally Deep NN. Inverse problems arise anywhere we have indirect measurement. As, in general, those inverse problems are ill-posed, to obtain satisfactory solutions for them needs prior information. Different regularization methods have been proposed, where the problem becomes the optimization of a criterion with a likelihood term and a regularization term. The main difficulty, however, in great dimensional real applications, remains the computational cost. Using NN, and in particular Deep Learning (DL) surrogate models and approximate computation, can become very helpful. In this work, we focus on NN and DL particularly adapted for inverse problems. We consider two cases: First the case where the forward operator is known and used as physics constraint, the second more general data driven DL methods.
摘要
results: 研究发现LightGBM模型在比赛中阶段时间占60%-80%时的平均准确率达81.62%,而逻辑回归和梯度抽象模型在早期比赛阶段表现更佳,得到了推动性的结果。Abstract
This paper presents a study on the prediction of outcomes in matches of the electronic game League of Legends (LoL) using machine learning techniques. With the aim of exploring the ability to predict real-time results, considering different variables and stages of the match, we highlight the use of unpublished data as a fundamental part of this process. With the increasing popularity of LoL and the emergence of tournaments, betting related to the game has also emerged, making the investigation in this area even more relevant. A variety of models were evaluated and the results were encouraging. A model based on LightGBM showed the best performance, achieving an average accuracy of 81.62\% in intermediate stages of the match when the percentage of elapsed time was between 60\% and 80\%. On the other hand, the Logistic Regression and Gradient Boosting models proved to be more effective in early stages of the game, with promising results. This study contributes to the field of machine learning applied to electronic games, providing valuable insights into real-time prediction in League of Legends. The results obtained may be relevant for both players seeking to improve their strategies and the betting industry related to the game.
摘要
The study evaluated a variety of models, and the results were encouraging. A LightGBM-based model achieved an average accuracy of 81.62% in intermediate stages of the match, when the percentage of elapsed time was between 60% and 80%. On the other hand, Logistic Regression and Gradient Boosting models were more effective in early stages of the game, with promising results.This study contributes to the field of machine learning applied to electronic games, providing valuable insights into real-time prediction in League of Legends. The results obtained may be relevant for both players seeking to improve their strategies and the betting industry related to the game.Translation notes:* "electronic game" is translated as "电子游戏" (diàn xī yóu xì)* "League of Legends" is translated as "英雄联盟" (yīng xióng lián méng)* "machine learning techniques" is translated as "机器学习技术" (jī shī xué xí jì shù)* "real-time results" is translated as "实时结果" (shí jī jié guǒ)* "unpublished data" is translated as "未发布数据" (wèi fā bìu xiàng xì)* "betting industry" is translated as "赌博业" (jià bò yè)
Diffusion Modeling with Domain-conditioned Prior Guidance for Accelerated MRI and qMRI Reconstruction
results: 该方法在多核磁共振和量化MRI恢复中显示出了显著的损害降低和精度保持,特别是在高速因素下。此外,该方法还可以在不同的解剖结构中维持高效率和准确性。Abstract
This study introduces a novel approach for image reconstruction based on a diffusion model conditioned on the native data domain. Our method is applied to multi-coil MRI and quantitative MRI reconstruction, leveraging the domain-conditioned diffusion model within the frequency and parameter domains. The prior MRI physics are used as embeddings in the diffusion model, enforcing data consistency to guide the training and sampling process, characterizing MRI k-space encoding in MRI reconstruction, and leveraging MR signal modeling for qMRI reconstruction. Furthermore, a gradient descent optimization is incorporated into the diffusion steps, enhancing feature learning and improving denoising. The proposed method demonstrates a significant promise, particularly for reconstructing images at high acceleration factors. Notably, it maintains great reconstruction accuracy and efficiency for static and quantitative MRI reconstruction across diverse anatomical structures. Beyond its immediate applications, this method provides potential generalization capability, making it adaptable to inverse problems across various domains.
摘要
Structured Radial Basis Function Network: Modelling Diversity for Multiple Hypotheses Prediction
paper_authors: Alejandro Rodriguez Dominguez, Muhammad Shahzad, Xia Hong
For: 这个研究旨在解决多modal regression问题,特别是预测非站ARY процес或具有复杂的混合分布。* Methods: 这个研究使用了一种组合多个假设预测器的结构化对�ishment(Radial Basis Function Network),并证明这个模型可以优化多个假设目标分布。* Results: 研究发现这个模型可以实现高度的普遍化表现和计算效率,并且只需使用两层神经网作为预测器即可控制多样性。此外,这个模型还可以使用梯度下降方法来实现损失无关的多个预测器。实验结果显示这个模型可以在Literature中优化表现。Abstract
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions. It can be tackled with multiple hypotheses frameworks but with the difficulty of combining them efficiently in a learning model. A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems. The predictors are regression models of any type that can form centroidal Voronoi tessellations which are a function of their losses during training. It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution and is equivalent to interpolating the meta-loss of the predictors, the loss being a zero set of the interpolation error. This model has a fixed-point iteration algorithm between the predictors and the centers of the basis functions. Diversity in learning can be controlled parametrically by truncating the tessellation formation with the losses of individual predictors. A closed-form solution with least-squares is presented, which to the authors knowledge, is the fastest solution in the literature for multiple hypotheses and structured predictions. Superior generalization performance and computational efficiency is achieved using only two-layer neural networks as predictors controlling diversity as a key component of success. A gradient-descent approach is introduced which is loss-agnostic regarding the predictors. The expected value for the loss of the structured model with Gaussian basis functions is computed, finding that correlation between predictors is not an appropriate tool for diversification. The experiments show outperformance with respect to the top competitors in the literature.
摘要
多Modal重要预测非站点过程或复杂的混合分布。它可以通过多个假设框架来解决,但是将其有效地结合到学习模型中是困难的。一种结构化圆拟函数网络是提出的一种多个假设预测器 ensemble for regression problems。这些预测器是任何类型的回归模型,可以形成中心 Voronoi 划分,这是在训练时的损失函数。 proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution, and is equivalent to interpolating the meta-loss of the predictors, the loss being a zero set of the interpolation error. This model has a fixed-point iteration algorithm between the predictors and the centers of the basis functions. Diversity in learning can be controlled parametrically by truncating the tessellation formation with the losses of individual predictors. A closed-form solution with least-squares is presented, which to the authors' knowledge, is the fastest solution in the literature for multiple hypotheses and structured predictions. Using only two-layer neural networks as predictors, the model achieves superior generalization performance and computational efficiency, with diversity as a key component of success. A gradient-descent approach is introduced, which is loss-agnostic regarding the predictors. The expected value for the loss of the structured model with Gaussian basis functions is computed, finding that correlation between predictors is not an appropriate tool for diversification. Experimental results show outperformance with respect to the top competitors in the literature.
Non-Asymptotic Bounds for Adversarial Excess Risk under Misspecified Models
for: 评估 robust 估计器的性能based on adversarial losses under misspecified models.
methods: 使用 distributional adversarial attack 和 adversarial training 进行评估.
results: 提出了一种通用的评估方法,并Establish non-asymptotic upper bounds for the adversarial excess risk associated with Lipschitz loss functions.Abstract
We propose a general approach to evaluating the performance of robust estimators based on adversarial losses under misspecified models. We first show that adversarial risk is equivalent to the risk induced by a distributional adversarial attack under certain smoothness conditions. This ensures that the adversarial training procedure is well-defined. To evaluate the generalization performance of the adversarial estimator, we study the adversarial excess risk. Our proposed analysis method includes investigations on both generalization error and approximation error. We then establish non-asymptotic upper bounds for the adversarial excess risk associated with Lipschitz loss functions. In addition, we apply our general results to adversarial training for classification and regression problems. For the quadratic loss in nonparametric regression, we show that the adversarial excess risk bound can be improved over those for a general loss.
摘要
我们提出一个通用的方法来评估预测器在不准确模型下的性能,基于敌对损失函数。我们首先显示出敌对损失相等于对于某些紧缩条件的分布型敌对攻击带来的损失。这确保了敌对训练程序的定义性。然后,我们研究了对于预测器的敌对剩余损失,包括预测器的整合误差和近似误差。我们then establish non-asymptotic upper bounds for the adversarial excess risk associated with Lipschitz loss functions. Finally, we apply our general results to adversarial training for classification and regression problems. For the quadratic loss in nonparametric regression, we show that the adversarial excess risk bound can be improved over those for a general loss.Here's the text with some additional information about the terms used:* "预测器" (zhì wén zhī) refers to a predictor or an estimator.* "不准确模型" (bù jian shí mian) refers to a misspecified model.* "敌对损失函数" (dài tào shè yǐ jī) refers to the adversarial loss function.* "分布型敌对攻击" (fēn bù xīng dào) refers to a distributional adversarial attack.* "整合误差" (zhé yì bù yì) refers to the generalization error.* "近似误差" (jìn xiē bù yì) refers to the approximation error.* "Lipschitz loss functions" (Lipschitz loss functions) refer to a class of loss functions that are Lipschitz continuous.* "nonparametric regression" (nonparametric regression) refers to a type of regression analysis that does not make any assumptions about the underlying distribution of the data.
Physics-informed machine learning of the correlation functions in bulk fluids
results: 机器学习模型在解决奥托-泽尼克方程的问题上表现了高精度和高效性,并且对液体热动力学理论的应用具有重要的潜在潜力。Abstract
The Ornstein-Zernike (OZ) equation is the fundamental equation for pair correlation function computations in the modern integral equation theory for liquids. In this work, machine learning models, notably physics-informed neural networks and physics-informed neural operator networks, are explored to solve the OZ equation. The physics-informed machine learning models demonstrate great accuracy and high efficiency in solving the forward and inverse OZ problems of various bulk fluids. The results highlight the significant potential of physics-informed machine learning for applications in thermodynamic state theory.
摘要
“欧兹方程”(Ornstein-Zernike equation)是现代流体 integral equation theory 中 Computational pair correlation function 的基本方程。在这项工作中,我们explore了机器学习模型,主要是physics-informed neural networks和physics-informed neural operator networks,来解决欧兹方程。这些physics-informed machine learning模型在解决前向和反向欧兹问题方面表现出了很高的准确率和高效性。结果表明physics-informed machine learning在热动力学状态理论中有广泛的应用前景。
results: 该方法在大量临床数据集上进行了评估,与之前的CNN基于的和transformer基于的模型相比,在Dice分数上表现出了更高的性能。此外,该方法生成的分割形状与人工标注更加相似,并避免了其他模型中的常见问题,如孔洞或 Fragmentation。Abstract
Cardiac Magnetic Resonance imaging (CMR) is the gold standard for assessing cardiac function. Segmenting the left ventricle (LV), right ventricle (RV), and LV myocardium (MYO) in CMR images is crucial but time-consuming. Deep learning-based segmentation methods have emerged as effective tools for automating this process. However, CMR images present additional challenges due to irregular and varying heart shapes, particularly in basal and apical slices. In this study, we propose a classifier-guided two-stage network with an all-slice fusion transformer to enhance CMR segmentation accuracy, particularly in basal and apical slices. Our method was evaluated on extensive clinical datasets and demonstrated better performance in terms of Dice score compared to previous CNN-based and transformer-based models. Moreover, our method produces visually appealing segmentation shapes resembling human annotations and avoids common issues like holes or fragments in other models' segmentations.
摘要
卡ди亚磁共振成像(CMR)是评估心脏功能的标准方法。在CMR图像中,正确分割左心室(LV)、右心室(RV)和心肌(MYO)是关键,但是却是耗时的。深度学习基于的分割方法在 automating 这个过程中表现出了有效的特点。然而,CMR图像又具有心形不规则和不同的心形特征,特别是在基层和脊梁slice中。在这项研究中,我们提议一种类型导向的两阶段网络,以增强CMR分割精度,特别是在基层和脊梁slice中。我们的方法在丰富的临床数据集上进行了评估,并与之前的CNN基于的和 transformer基于的模型相比,表现出更高的Dice分数。此外,我们的方法生成的分割形状与人工标注相似,并避免了其他模型中的常见问题,如孔洞或 Fragmentation。
Online Targetless Radar-Camera Extrinsic Calibration Based on the Common Features of Radar and Camera
results: 我们的实验结果显示,我们的提案方法可以实现高精度和稳定性的单一整合。Abstract
Sensor fusion is essential for autonomous driving and autonomous robots, and radar-camera fusion systems have gained popularity due to their complementary sensing capabilities. However, accurate calibration between these two sensors is crucial to ensure effective fusion and improve overall system performance. Calibration involves intrinsic and extrinsic calibration, with the latter being particularly important for achieving accurate sensor fusion. Unfortunately, many target-based calibration methods require complex operating procedures and well-designed experimental conditions, posing challenges for researchers attempting to reproduce the results. To address this issue, we introduce a novel approach that leverages deep learning to extract a common feature from raw radar data (i.e., Range-Doppler-Angle data) and camera images. Instead of explicitly representing these common features, our method implicitly utilizes these common features to match identical objects from both data sources. Specifically, the extracted common feature serves as an example to demonstrate an online targetless calibration method between the radar and camera systems. The estimation of the extrinsic transformation matrix is achieved through this feature-based approach. To enhance the accuracy and robustness of the calibration, we apply the RANSAC and Levenberg-Marquardt (LM) nonlinear optimization algorithm for deriving the matrix. Our experiments in the real world demonstrate the effectiveness and accuracy of our proposed method.
摘要
感知融合是自动驾驶和自动机器人的关键技术,而雷达-摄像头融合系统在过去几年中得到了广泛的应用。然而,为了确保有效的感知融合,需要进行精准的协调。协调包括内在协调和外在协调,其中外在协调对于实现准确的感知融合是非常重要的。然而,许多目标基于的协调方法需要复杂的操作程序和丰富的实验条件,这会对研究人员进行重现结果的困难。为解决这个问题,我们介绍了一种新的方法,利用深度学习提取雷达数据(即距离-Doppler-角度数据)和摄像头图像中的公共特征。而不是直接表示这些公共特征,我们的方法将这些公共特征直接地用于匹配雷达和摄像头系统中的同一个目标。具体来说,提取的公共特征可以作为一个在线无目标协调方法的示例,用于计算雷达和摄像头系统之间的外在协调矩阵。为了提高准确性和稳定性,我们在这个特征基础上应用了RANSAC和Levenberg-Marquardt(LM)非线性优化算法来得到矩阵。我们在实际世界中进行的实验表明了我们的提案的有效性和准确性。
paper_authors: Mingjun Ying, Dipankar Shakya, Hitesh Poddar, Theodore S. Rappaport
for: This paper aims to improve the power efficiency of cascaded communication systems by developing a new metric called the Waste Factor ($W$).
methods: The authors use a mathematical framework to evaluate power efficiency in cascaded communication systems, by accounting for power wasted in individual components along a cascade. They also use the consumption efficiency factor (CEF) to evaluate the effects of insertion loss and deployment density on power efficiency.
results: The authors show that the Waste Factor is a unifying metric for defining wasted power in a cascade, and that it can be used to compare power efficiency between data centers and their components. They also observe that the CEF is markedly sensitive to insertion loss changes, particularly in uplink transmissions, and that energy efficiency improves at 142 GHz compared to 28 GHz as UE and BS numbers increase.Here’s the Chinese translation of the three points:
results: 作者们显示了 $W$ 是积离系统中定义浪费能量的统一指标,并且可以用来比较积离系统和其组件的能效性。他们还发现 CEF 受插入损失变化的影响非常大,特别是在上行传输中,而且能效性在142 GHz比28 GHz提高为 UE 和 BS 数量增加。Abstract
In this paper, we expand upon a new metric called the Waste Factor ($W$), a mathematical framework used to evaluate power efficiency in cascaded communication systems, by accounting for power wasted in individual components along a cascade. We show that the derivation of the Waste Factor, a unifying metric for defining wasted power along the signal path of any cascade, is similar to the mathematical approach used by H. Friis in 1944 to develop the Noise Factor ($F$), which has since served as a unifying metric for quantifying additive noise power in a cascade. Furthermore, the mathematical formulation of $W$ can be utilized in artificial intelligence (AI) and machine learning (ML) design and control for enhanced power efficiency. We consider the power usage effectiveness (PUE), which is a widely used energy efficiency metric for data centers, to evaluate $W$ for the data center as a whole. The use of $W$ allows easy comparison of power efficiency between data centers and their components. Our study further explores how insertion loss of components in a cascaded communication system influences $W$ at 28 GHz and 142 GHz along with the data rate performance, evaluated using the consumption efficiency factor (CEF). We observe CEF's marked sensitivity, particularly to phase shifter insertion loss changes. Notably, CEF variations are more prominent in uplink transmissions, whereas downlink transmissions offer relative CEF stability. Our exploration also covers the effects of varying User Equipment (UE) and Base Station (BS) deployment density on CEF in cellular networks. This work underscores the enhanced energy efficiency at 142 GHz, compared to 28 GHz, as UE and BS numbers escalate.
摘要
在本文中,我们扩展了一个新的度量 called Waste Factor ($W$), 用于评估级联通信系统中的能效性,通过对各个组件的能量损耗进行考虑。我们表明了度量Waste Factor的 derivation, 是级联通信系统中能效性度量的一个统一metric,类似于1944年Friis提出的Noise Factor ($F$),该度量在级联通信系统中 quantifying 添加的噪声功率。此外,度量W的数学表述可以在人工智能(AI)和机器学习(ML)设计和控制中使用,以提高能效性。我们使用数据中心的能效性指标(PUE)来评估W,以便对数据中心和其组件进行简单的能效性比较。我们的研究还探讨了级联通信系统中组件插入损耗对W的影响,以及数据率性能,通过消耗效应因子(CEF)的变化来评估。我们发现CEF在28GHz和142GHz之间具有明显的敏感度,特别是在相位调制器插入损耗变化时。此外,我们还发现在无线电网络中 UE 和 BS 的分布密度变化对CEF的影响。这种研究表明了142GHz的能效性比28GHz更高,当 UE 和 BS 数量增加时。
A Sub-Terahertz Sliding Correlator Channel Sounder with Absolute Timing using Precision Time Protocol over Wi-Fi
paper_authors: Dipankar Shakya, Hitesh Poddar, Theodore S. Rappaport for:This paper aims to achieve sub-nanosecond timing accuracy for multipath component (MPC) propagation delays in power delay profiles (PDPs) for 5G and 6G communications at mmWave and sub-THz frequencies.methods:The proposed solution utilizes precision time protocol (PTP) and periodic drift correction to achieve absolute timing for MPCs in PDPs. The solution involves synchronizing the transmitter (TX) and receiver (RX) clocks using two RaspberryPi computers and a dedicated Wi-Fi link, and applying a periodic drift correction algorithm to eliminate PDP sample drift.results:The proposed solution achieves sub-nanosecond timing accuracy for MPC delays, reducing PDP sample drift to 150 samples/hour compared to several thousand samples/hour without synchronization. The solution shows promise in myriad applications, including precise position location and distributed systems that require sub-nanosecond timing accuracy and synchronization among components.Here is the information in Simplified Chinese text:for:这篇论文目标是在5G和6G通信中的mmWave和sub-THz频谱中实现多吉比特数据速率的 multipath component(MPC)延迟。methods:提议的解决方案利用精度时间协议(PTP)和周期偏移修正来实现MPC在电力延迟profile(PDP)中的绝对时间。解决方案使用两个RaspberryPi计算机和专门的Wi-Fi链接来实现TX和RX Rubidium时钟的同步,并应用周期偏移修正算法来消除PDP样本偏移。results:提议的解决方案实现了MPC延迟的sub-nanosecond精度,将PDP样本偏移降低至150个样本/小时,比不同的时钟同步方法而言有所下降。该解决方案在多种应用中展示了承诺,包括精确的位置定位和分布式系统,它们需要sub-nanosecond精度和同步。Abstract
Radio channels at mmWave and sub-THz frequencies for 5G and 6G communications offer large channel bandwidths (hundreds of MHz to several GHz) to achieve multi-Gbps data rates. Accurate modeling of the radio channel for these wide bandwidths requires capturing the absolute timing of multipath component (MPC) propagation delays with sub-nanosecond accuracy. Achieving such timing accuracy is challenging due to clock drift in untethered transmitter (TX) and receiver (RX) clocks used in time-domain channel sounders, yet will become vital in many future 6G applications. This paper proposes a novel solution utilizing precision time protocol (PTP) and periodic drift correction to achieve absolute timing for MPCs in power delay profiles (PDPs) --captured as discrete samples using sliding correlation channel sounders. Two RaspberryPi computers are programmed to implement PTP over a dedicated Wi-Fi link and synchronize the TX and RX Rubidium clocks continuously every second. This synchronization minimizes clock drift, reducing PDP sample drift to 150 samples/hour, compared to several thousand samples/hour without synchronization. Additionally, a periodic drift correction algorithm is applied to eliminate PDP sample drift and achieve sub-nanosecond timing accuracy for MPC delays. The achieved synchronicity eliminates the need for tedious and sometimes inaccurate ray tracing to synthesize omnidirectional PDPs from directional measurements. The presented solution shows promise in myriad applications, including precise position location and distributed systems that require sub-nanosecond timing accuracy and synchronization among components.
摘要
radio频道在mmWave和sub-THz频率上为5G和6G通信提供大量频带宽(百万Hz到几亿Hz)以实现多Gbps的数据速率。 precisely modeling radio频道需要 capture multipath component(MPC)延迟的绝对时间准确性(sub-纳秒级)。 Achieving such timing accuracy is challenging due to clock drift in untethered transmitter(TX)和receiver(RX)clocks used in time-domain channel sounders, yet will become vital in many future 6G applications。 This paper proposes a novel solution utilizing precision time protocol(PTP)and periodic drift correction to achieve absolute timing for MPCs in power delay profiles(PDPs)—captured as discrete samples using sliding correlation channel sounders。 Two RaspberryPi computers are programmed to implement PTP over a dedicated Wi-Fi link and synchronize the TX and RX Rubidium clocks continuously every second。 This synchronization minimizes clock drift, reducing PDP sample drift to 150 samples/hour, compared to several thousand samples/hour without synchronization。 Additionally, a periodic drift correction algorithm is applied to eliminate PDP sample drift and achieve sub-nanosecond timing accuracy for MPC delays。 The achieved synchronicity eliminates the need for tedious and sometimes inaccurate ray tracing to synthesize omnidirectional PDPs from directional measurements。 The presented solution shows promise in myriad applications, including precise position location and distributed systems that require sub-nanosecond timing accuracy and synchronization among components。
Robust Joint Active-Passive Beamforming Design for IRS-Assisted ISAC Systems
paper_authors: Mahmoud AlaaEldin, Emad Alsusa, Karim G. Seddik, Christos Masouros, Iman Valiulahi for: 这个研究旨在探讨 интеelligent reflective surfaces(IRS)与integrated sensing and communication(ISAC)系统之间的integraion,以改善未来无线网络中的spectrum congestion问题。methods: 这个研究使用了低复杂性和高效的共同优化算法,将BS的传发矩阵和IRS的反射矩阵共同优化,以最小化传发信号和欲要的射频范围之间的Frobenius距离。results: 研究结果显示,在不同的环境下,IRS-ISAC系统可以实现更好的传输和探测性能,并且可以适应不同的通信用户和射频范围。此外,这个研究还提出了一个对于IRS频率不确定性的强化 beamforming优化算法,可以对于实际系统中的频率不确定性进行适应。Abstract
The idea of Integrated Sensing and Communication (ISAC) offers a promising solution to the problem of spectrum congestion in future wireless networks. This paper studies the integration of intelligent reflective surfaces (IRS) with ISAC systems to improve the performance of radar and communication services. Specifically, an IRS-assisted ISAC system is investigated where a multi-antenna base station (BS) performs multi-target detection and multi-user communication. A low complexity and efficient joint optimization of transmit beamforming at the BS and reflective beamforming at the IRS is proposed. This is done by jointly optimizing the BS beamformers and IRS reflection coefficients to minimize the Frobenius distance between the covariance matrices of the transmitted signal and the desired radar beam pattern. This optimization aims to satisfy the signal-to-interference-and-noise ratio (SINR) constraints of the communication users, the total transmit power limit at the BS, and the unit modulus constraints of the IRS reflection coefficients. To address the resulting complex non-convex optimization problem, an efficient alternating optimization (AO) algorithm combining fractional programming (FP), semi-definite programming (SDP), and second order cone programming (SOCP) methods is proposed. Furthermore, we propose robust beamforming optimization for IRS-ISAC systems by adapting the proposed optimization algorithm to the IRS channel uncertainties that may exist in practical systems. Using advanced tools from convex optimization theory, the constraints containing uncertainty are transformed to their equivalent linear matrix inequalities (LMIs) to account for the channels' uncertainty radius. The results presented quantify the benefits of IRS-ISAC systems under various conditions and demonstrate the effectiveness of the proposed algorithm.
摘要
“嵌入式感测和通信(ISAC)技术提供了未来无线网络中频率压力问题的有效解决方案。这篇文章研究了智能反射表(IRS)与ISAC系统的集成,以提高雷达和通信服务的性能。具体来说,我们研究了一种由多个antenna基站(BS)和IRS共同实现的IRS-助け过 ISAC系统。在这种系统中,BS使用多个antenna执行多个目标检测和多个用户通信。我们提出了一种低复杂度和高效的集成传输扩张和反射扩张的优化方法。这种优化方法是将BS的扩张和IRS的反射矩阵优化到最小化 Frobenius 距离 между发射信号的covariance矩阵和 Desired 雷达波形矩阵。这种优化目的是满足通信用户的信号噪干扰比(SINR)约束、BS的总发射功率限制和IRS的单位模数约束。为解决这个复杂非 convex 优化问题,我们提出了一种高效的 alternate 优化(AO)算法,该算法结合分数编程(FP)、半definite 编程(SDP)和第二个 cone 编程(SOCP)方法。此外,我们还提出了IRS-ISAC系统中的Robust 扩张优化,该算法通过将uncertainty 约束转化为其等效的线性matrix inequality(LMIs)来考虑实际系统中的频率uncertainty。结果表明,在不同条件下,IRS-ISAC系统具有显著的优势,并且提出的算法效果扎实。”
Consensus-based Distributed Variational Multi-object Tracker in Multi-Sensor Network
results: 实验结果表明,我们提出的分布式变分追踪器和中央整合式追踪器的跟踪精度相当,并且分布式追踪器比 arithmetic sensor fusion 和平均协议融合策略更高。Abstract
The growing need for accurate and reliable tracking systems has driven significant progress in sensor fusion and object tracking techniques. In this paper, we design two variational Bayesian trackers that effectively track multiple targets in cluttered environments within a sensor network. We first present a centralised sensor fusion scheme, which involves transmitting sensor data to a fusion center. Then, we develop a distributed version leveraging the average consensus algorithm, which is theoretically equivalent to the centralised sensor fusion tracker and requires only local message passing with neighbouring sensors. In addition, we empirically verify that our proposed distributed variational tracker performs on par with the centralised version with equal tracking accuracy. Simulation results show that our distributed multi-target tracker outperforms the suboptimal distributed sensor fusion strategy that fuses each sensor's posterior based on arithmetic sensor fusion and an average consensus strategy.
摘要
随着准确和可靠的跟踪系统的需求不断增长,感知融合和目标跟踪技术也得到了重要的进步。在这篇论文中,我们设计了两种变分极 bayesian跟踪器,可以有效地跟踪多个目标在杂乱环境中的感知网络中。我们首先提出了一种中央感知融合方案,其中感知数据被传输到融合中心进行融合。然后,我们开发了分布式版本,利用平均协议算法,这是中央感知融合器的理论等价,只需在与邻居感知器进行本地消息传递。此外,我们验证了我们的提议的分布式变分跟踪器和中央版本之间的跟踪精度相等。 simulation结果显示,我们的分布式多目标跟踪器比使用 arithmetic sensor fusion和平均协议策略来进行分布式感知融合的优化策略强。
Delay-Doppler Alignment Modulation for Spatially Sparse Massive MIMO Communication
results: 本文首先表明,通过应用路径基本零做(ZF)预编码和接收 combine,DDAM可以将原始时变频Selective通道转化为时间 invariants ISI-free通道。 derive了必要和/或 suficient conditionsto achieve这种转化。然后,提供了一个 asymptotic analysis,表明当基站天线数量远大于通道路径数量时,DDAM可以实现时间 invariants ISI-free通道,只需要简单的延迟-Doppler补做和路径基本 MRT beamforming。此外,为了实现DDAM设计中的一些可容忍的 ISI,将路径基本预编码和接收 combine矩阵优化为最大化spectral efficiency。Abstract
Delay alignment modulation (DAM) is an emerging technique for achieving inter-symbol interference (ISI)-free wideband communications using spatial-delay processing, without relying on channel equalization or multi-carrier transmission. However, existing works on DAM only consider multiple-input single-output (MISO) communication systems and assume time-invariant channels. In this paper, by extending DAM to time-variant frequency-selective multiple-input multiple-output (MIMO) channels, we propose a novel technique termed \emph{delay-Doppler alignment modulation} (DDAM). Specifically, by leveraging \emph{delay-Doppler compensation} and \emph{path-based beamforming}, the Doppler effect of each multi-path can be eliminated and all multi-path signal components may reach the receiver concurrently and constructively. We first show that by applying path-based zero-forcing (ZF) precoding and receive combining, DDAM can transform the original time-variant frequency-selective channels into time-invariant ISI-free channels. The necessary and/or sufficient conditions to achieve such a transformation are derived. Then an asymptotic analysis is provided by showing that when the number of base station (BS) antennas is much larger than that of channel paths, DDAM enables time-invariant ISI-free channels with the simple delay-Doppler compensation and path-based maximal-ratio transmission (MRT) beamforming. Furthermore, for the general DDAM design with some tolerable ISI, the path-based transmit precoding and receive combining matrices are optimized to maximize the spectral efficiency. Numerical results are provided to compare the proposed DDAM technique with various benchmarking schemes, including MIMO-orthogonal time frequency space (OTFS), MIMO-orthogonal frequency-division multiplexing (OFDM) without or with carrier frequency offset (CFO) compensation, and beam alignment along the dominant path.
摘要
延迟对齐模ulation(DAM)是一种emerging技术,用于实现干扰符号(ISI)free广泛通信,只需通过空间延迟处理,而不需依赖通道均衡或多 carriermultiplexing。然而,现有的DAM研究仅考虑多input single-output(MISO)通信系统,并假设时variant channels。在这篇文章中,我们通过扩展DAM到时variant frequency-selective多input多output(MIMO)通信频道,提出一种新的技术,称为延迟Doppler对齐模ulation(DDAM)。具体来说,通过利用延迟Doppler补做和路径基 beamforming,每个多路可以消除Doppler效应,使所有多路信号组件能够同时到达接收器并构成。我们首先示出,通过应用路径基Zero-Forcing(ZF)预编码和接收 combining,DDAM可以将原始时variant frequency-selective频道变换成时 invariantin ISI-free频道。我们 derive了必要和/或充分的条件以实现这种变换。然后,我们提供了一种 asymptotic analysis,表明当基站天线数量比channel path数量多得多时,DDAM可以将时variant ISI-free频道转换成时 invariantin ISI-free频道,只需使用简单的延迟Doppler补做和路径基MRT beamforming。此外,为了设计DDAM,我们 optimize了路径基传输预编码和接收 combining矩阵,以最大化spectral efficiency。我们提供了一些数值结果,并与various benchmarking schemes进行比较,包括MIMO-orthogonal time frequency space(OTFS)、MIMO-orthogonal frequency-division multiplexing(OFDM)without/with carrier frequency offset(CFO)compensation,以及beam alignment along the dominant path。
paper_authors: Étienne Labbé, Thomas Pellegrini, Julien Pinquier
for: The paper is written for the task of automated audio captioning (AAC), specifically using a ConvNeXt architecture as the audio encoder and exploring the use of task embeddings to improve performance across multiple datasets.
methods: The paper uses a ConvNeXt architecture as the audio encoder, which is adapted from the vision domain to audio classification. The model is trained on multiple AAC datasets (AC, CL, MACS, WavCaps) with a task embedding (TE) token to identify the source dataset for each input sample.
results: The paper achieves state-of-the-art scores on the AudioCaps (AC) dataset and competitive performance on Clotho (CL) with fewer parameters than existing models. The use of task embeddings improves cross-dataset performance, but there is still a performance gap between datasets, indicating the need for dataset-specific models. The resulting model, called CoNeTTE, achieves SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively.Abstract
Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: https://github.com/Labbeti/conette-audio-captioning.
摘要
自动化语音描述(AAC)技术涉及生成语音内容的自然语言描述,使用编码器-解码器架构。一个音频编码器生成音频嵌入,并将其传递给一个通常是转换器解码器的decoder进行描述生成。在这个工作中,我们描述了我们的模型,它与现有模型的不同之处在于使用ConvNeXt架构作为音频编码器,从视觉领域中适应到音频分类。我们称之为CNext-trans模型,在AudioCaps(AC)数据集上达到了状态之artefact的分数,并在Clotho(CL)数据集上表现竞争力强,同时使用四到四十个参数少于现有模型。我们 investigate了AC数据集中可能的偏见问题,并证明使用不偏见的编码器对性能有一定的影响。使用著名的PANN的CNN14作为不偏见编码器,我们观察到了1.7%的绝对下降分数(其中更高的分数表示更好的性能)。为提高跨数据集性能,我们进行了多个AAC数据集(AC、CL、MACS、WavCaps)的组合训练。虽然这种策略提高了总模型性能,但还不如特定目标数据集训练的模型,表明不存在一个适用于所有数据集的模型。为了减少数据集之间性能差距,我们引入了任务嵌入(TE)token,让模型可以通过检测输入样本的来源数据集来识别样本的来源。我们对TE的影响进行了详细的分析,包括对形式(字词)和内容(声音事件类型)的影响。最终,我们提出了CoNeTTE模型,一个不偏见CNext-trans模型,通过添加数据集特定的任务嵌入,实现了SPIDEr分数44.1%和30.5%。代码可以在https://github.com/Labbeti/conette-audio-captioning中下载。
Remixing-based Unsupervised Source Separation from Scratch
results: 实验结果表明,提议的方法可以超越现有的混合 invariant 训练方法,从零开始训练一个单频分离模型。此外,我们还提出了一种简单的搅拌方法来稳定训练。Abstract
We propose an unsupervised approach for training separation models from scratch using RemixIT and Self-Remixing, which are recently proposed self-supervised learning methods for refining pre-trained models. They first separate mixtures with a teacher model and create pseudo-mixtures by shuffling and remixing the separated signals. A student model is then trained to separate the pseudo-mixtures using either the teacher's outputs or the initial mixtures as supervision. To refine the teacher's outputs, the teacher's weights are updated with the student's weights. While these methods originally assumed that the teacher is pre-trained, we show that they are capable of training models from scratch. We also introduce a simple remixing method to stabilize training. Experimental results demonstrate that the proposed approach outperforms mixture invariant training, which is currently the only available approach for training a monaural separation model from scratch.
摘要
我们提出了一种无监督的方法,用于从零开始训练分离模型,使用RecmixIT和Self-Remixing,这两种最近提出的自动学习方法来修正预训练模型。它们首先使用一个教师模型将混合物分离出来,然后创建假混合物,通过搅拌和重新混合分离后的信号。一个学生模型然后被训练使用教师的输出或初始混合物作为监督来分离假混合物。为了修正教师的输出,教师的权重被更新为学生的权重。而这些方法最初假设了教师是预训练的,但我们表明它们可以训练模型从零开始。我们还介绍了一种简单的搅拌方法,以稳定训练。实验结果表明,我们提出的方法在训练独立式分离模型方面超过了现有的混合物不变训练方法。