paper_authors: Adarsha M, S. Malathi, Santosh Kumar for: 这种研究旨在调查固定格式网络在不同的调制格式下的性能,以估算系统的容量和spectral efficiency。methods: 该研究使用了光学干扰 quadrature modulator结构建立固定格式网络调制,并使用了homodyne探测方法。数据多路复用使用了极化分多路复用技术。results: 在这些情况下,使用不同的调制格式实现了100Gbps、150Gbps和200Gbps的数据速率。使用现代数字信号处理系统,对于PM-QPSK、PM-8QAM和PM-16QAM等不同的调制方式,分别实现了2、3和4位/s/Hz的spectrum efficiency。不同的调制方式每种系统容量分别为8-9、12-13.5和16-18Tbps,可以在3000、1300和700公里的传输距离上实现接受比特错误率低于等于2*10-3。Abstract
In this article, the performance of a fixed grid network is examined for various modulation formats to estimate the system's capacity and spectral efficiency. The optical In-phase Quadrature Modulator structure is used to build a fixed grid network modulation, and the homodyne detection approach is used for the receiver. Data multiplexing is accomplished using the Polarization Division Multiplexed technology. 100 Gbps, 150 Gbps, and 200 Gbps data rates are transmitted under these circumstances utilizing various modulation formats. Various pre-processing and signal recovery steps are explained by using modern digital signal processing systems. The achieved spectrum efficiencies for PM-QPSK, PM-8 QAM, and PM-16 QAM, respectively, were 2, 3, and 4 bits/s/Hz. Different modulation like PM-QPSK, PM-8-QAM, and PM-16-QAM each has system capacities of 8-9, 12-13.5, and 16-18 Tbps and it reaches transmission distances of 3000, 1300, and 700 kilometers with acceptable Bit Error Rate less than equal to 2*10-3 respectively. Peak optical power for received signal detection and full width at half maximum is noted for the different modulations under a fixed grind network.
摘要
在这篇文章中,我们研究了固定格网络的性能,用以估算系统的容量和 spectral efficiency。我们使用了光学干扰 quadrature modulator 结构来建立固定格网络的模ulation,并使用了 homodyne 探测方法来接收器。数据多路复用使用了极化分 multiplexing 技术。在这些情况下,我们将100Gbps、150Gbps和200Gbps的数据速率传输,使用不同的模ulation format。我们还介绍了使用现代数字信号处理系统来实现预处理和信号恢复步骤。我们测得了 PM-QPSK、PM-8QAM 和 PM-16QAM 模ulation的spectrum efficiency分别为2、3和4 bits/s/Hz。不同的模ulation各自有系统容量为8-9、12-13.5和16-18 Tbps,可以在3000、1300和700公里的传输距离上实现可接受的 Bit Error Rate 低于等于2*10-3。我们还注意到了不同模ulation的接收信号检测的峰值光学功率和半宽度。
Enhancing Secrecy in UAV RSMA Networks: Deep Unfolding Meets Deep Reinforcement Learning
paper_authors: Abuzar B. M. Adam, Mohammed A. M. Elhassan
for: 这 paper 是研究多架空无人机(UAV)网络中的秘密率最大化问题。
methods: 这 paper 使用了一个组合的射频、速率分配和无人机轨迹优化问题,这个问题是非核心的。因此,它被转换为一个Markov问题,并使用了一个新的多代理深度学习(DRL)框架,称为DUN-DRL。
results: 该 paper 的结果显示,DUN-DRL 比其他 literatura 中的 DRL 方法表现更好,并且可以实现更高的秘密率。Abstract
In this paper, we consider the maximization of the secrecy rate in multiple unmanned aerial vehicles (UAV) rate-splitting multiple access (RSMA) network. A joint beamforming, rate allocation, and UAV trajectory optimization problem is formulated which is nonconvex. Hence, the problem is transformed into a Markov decision problem and a novel multiagent deep reinforcement learning (DRL) framework is designed. The proposed framework (named DUN-DRL) combines deep unfolding to design beamforming and rate allocation, data-driven to design the UAV trajectory, and deep deterministic policy gradient (DDPG) for the learning procedure. The proposed DUN-DRL have shown great performance and outperformed other DRL-based methods in the literature.
摘要
在这篇论文中,我们考虑了多架空航天器(UAV)率分访问(RSMA)网络中的最大秘密率增加问题。我们将这个问题转化为一个非核心的Markov决策问题,并提出了一种基于多代理深度学习(DRL)的新框架(名为DUN-DRL)。该框架结合深度嵌入设计灵敏谱和率分配,基于数据驱动设计UAV轨迹,并使用深度确定策略梯度(DDPG)进行学习过程。我们的DUN-DRL方法在文献中表现出色,并比其他基于DRL的方法表现更好。
RIS-aided Near-Field MIMO Communications: Codebook and Beam Training Design
results: 比较研究表明,提出的扫描方案可以实现近似优化性,同时减少培训负担;相比于只考虑角度信息,将距离信息包含在扫描过程中可以明显提高可得率。Abstract
Downlink reconfigurable intelligent surface (RIS)-assisted multi-input-multi-output (MIMO) systems are considered with far-field, near-field, and hybrid-far-near-field channels. According to the angular or distance information contained in the received signals, 1) a distance-based codebook is designed for near-field MIMO channels, based on which a hierarchical beam training scheme is proposed to reduce the training overhead; 2) a combined angular-distance codebook is designed for mixed-far-near-field MIMO channels, based on which a two-stage beam training scheme is proposed to achieve alignment in the angular and distance domains separately. For maximizing the achievable rate while reducing the complexity, an alternating optimization algorithm is proposed to carry out the joint optimization iteratively. Specifically, the RIS coefficient matrix is optimized through the beam training process, the optimal combining matrix is obtained from the closed-form solution for the mean square error (MSE) minimization problem, and the active beamforming matrix is optimized by exploiting the relationship between the achievable rate and MSE. Numerical results reveal that: 1) the proposed beam training schemes achieve near-optimal performance with a significantly decreased training overhead; 2) compared to the angular-only far-field channel model, taking the additional distance information into consideration will effectively improve the achievable rate when carrying out beam design for near-field communications.
摘要
下链接智能表面(RIS)助enciphered多输入多输出(MIMO)系统被考虑,包括远场、近场和混合远近场通道。根据接收信号中的角度或距离信息,提出了以下方案:1. 基于近场MIMO通道的距离编码本是基于近场MIMO通道的距离编码本,提出了层次扫描训练方案,以减少训练负担。2. 基于混合远近场MIMO通道的混合角度距离编码本是基于混合远近场MIMO通道的混合角度距离编码本,提出了两stage扫描训练方案,以在angular和距离域分别实现对alignment。3. 为了最大化可达速率而减少复杂性,提出了alternating optimization算法,以iteratively进行共享优化。具体来说,RIS减少系数矩阵通过扫描训练过程进行优化,可能性最小化问题中的closed-form解决方案中得到最佳组合矩阵,并通过利用可达速率和MSE之间的关系来优化活动扩散矩阵。数据示出:1. 提出的扫描训练方案可以实现近似最佳性,同时减少训练负担;2. 在进行 beam设计时,考虑到距离信息的情况下,与angular-only远场通道模型相比,能够更有效地提高可达速率。
RIS-Aided Cell-Free Massive MIMO Systems for 6G: Fundamentals, System Design, and Applications
results: 论文提供了 RIS-aided CF mMIMO 无线通信系统的全面检讨,包括系统体系和应用场景、通信协议和channel模型等方面的概述,以及系统操作和资源分配等方面的深入分析。Abstract
An introduction of intelligent interconnectivity for people and things has posed higher demands and more challenges for sixth-generation (6G) networks, such as high spectral efficiency and energy efficiency, ultra-low latency, and ultra-high reliability. Cell-free (CF) massive multiple-input multiple-output (mMIMO) and reconfigurable intelligent surface (RIS), also called intelligent reflecting surface (IRS), are two promising technologies for coping with these unprecedented demands. Given their distinct capabilities, integrating the two technologies to further enhance wireless network performances has received great research and development attention. In this paper, we provide a comprehensive survey of research on RIS-aided CF mMIMO wireless communication systems. We first introduce system models focusing on system architecture and application scenarios, channel models, and communication protocols. Subsequently, we summarize the relevant studies on system operation and resource allocation, providing in-depth analyses and discussions. Following this, we present practical challenges faced by RIS-aided CF mMIMO systems, particularly those introduced by RIS, such as hardware impairments and electromagnetic interference. We summarize corresponding analyses and solutions to further facilitate the implementation of RIS-aided CF mMIMO systems. Furthermore, we explore an interplay between RIS-aided CF mMIMO and other emerging 6G technologies, such as next-generation multiple-access (NGMA), simultaneous wireless information and power transfer (SWIPT), and millimeter wave (mmWave). Finally, we outline several research directions for future RIS-aided CF mMIMO systems.
摘要
sixth-generation(6G)网络面临更高的要求和挑战,如高频率效率和能效性、超低延迟和超高可靠性。各种技术,如无绳(CF)大量多输入多输出(mMIMO)和智能反射表面(RIS),也被认为是应对这些前所未有的挑战的两种有前途的技术。 Given their distinct capabilities, integrating the two technologies to further enhance wireless network performances has received great research and development attention. In this paper, we provide a comprehensive survey of research on RIS-aided CF mMIMO wireless communication systems. We first introduce system models focusing on system architecture and application scenarios, channel models, and communication protocols. Subsequently, we summarize the relevant studies on system operation and resource allocation, providing in-depth analyses and discussions. Following this, we present practical challenges faced by RIS-aided CF mMIMO systems, particularly those introduced by RIS, such as hardware impairments and electromagnetic interference. We summarize corresponding analyses and solutions to further facilitate the implementation of RIS-aided CF mMIMO systems. Furthermore, we explore an interplay between RIS-aided CF mMIMO and other emerging 6G technologies, such as next-generation multiple-access (NGMA), simultaneous wireless information and power transfer (SWIPT), and millimeter wave (mmWave). Finally, we outline several research directions for future RIS-aided CF mMIMO systems.
Identifying Distribution Network Faults Using Adaptive Transition Probability
paper_authors: Xinliang Ma, Weihua Liu, Bingying Jin
for: 提高分布网络中故障检测精度
methods: combines adaptive probability learning和波形分解,用于优化特征相似性
results: 实验结果表明,该方法在使用适应学习条件时,能够超越常用的分类模型,如卷积神经网络、支持向量机和k-最近邻居分类器,特别是在限制样本大小的情况下。Abstract
A novel approach is suggested for improving the accuracy of fault detection in distribution networks. This technique combines adaptive probability learning and waveform decomposition to optimize the similarity of features. Its objective is to discover the most appropriate linear mapping between simulated and real data to minimize distribution differences. By aligning the data in the same feature space, the proposed method effectively overcomes the challenge posed by limited sample size when identifying faults and classifying real data in distribution networks. Experimental results utilizing simulated system data and real field data demonstrate that this approach outperforms commonly used classification models such as convolutional neural networks, support vector machines, and k-nearest neighbors, especially under adaptive learning conditions. Consequently, this research provides a fresh perspective on fault detection in distribution networks, particularly when adaptive learning conditions are employed.
摘要
一种新的方法建议用于改进分布网络中故障探测精度。这种技术结合适应概率学习和波形分解,以优化特征相似性。其目标是找到最适合的线性映射,使得实际数据和模拟数据在同一个特征空间内进行对应。通过对数据进行对应,该方法有效地超越了有限样本大小的挑战,使得在分布网络中探测FAULTS和分类实际数据时更加准确。实验结果表明,该方法在使用适应学习条件时,可以超越常用的分类模型,如卷积神经网络、支持向量机和k-最近邻居等,尤其是在适应学习条件下。因此,这种研究提供了一个新的视角,用于分布网络中的故障探测,特别是在使用适应学习条件时。
Bayesian Approach for Adaptive EMG Pattern Classification Via Semi-Supervised Sequential Learning
results: 实验结果显示,提案的方法可以降低时间的类别精度下降,并且比传统方法高效。这些结果证明了提案的方法的有效性,并适用于实际应用中的 EMG 基本控制系统。Abstract
Intuitive human-machine interfaces may be developed using pattern classification to estimate executed human motions from electromyogram (EMG) signals generated during muscle contraction. The continual use of EMG-based interfaces gradually alters signal characteristics owing to electrode shift and muscle fatigue, leading to a gradual decline in classification accuracy. This paper proposes a Bayesian approach for adaptive EMG pattern classification using semi-supervised sequential learning. The proposed method uses a Bayesian classification model based on Gaussian distributions to predict the motion class and estimate its confidence. Pseudo-labels are subsequently assigned to data with high-prediction confidence, and the posterior distributions of the model are sequentially updated within the framework of Bayesian updating, thereby achieving adaptive motion recognition to alterations in signal characteristics over time. Experimental results on six healthy adults demonstrated that the proposed method can suppress the degradation of classification accuracy over time and outperforms conventional methods. These findings demonstrate the validity of the proposed approach and its applicability to practical EMG-based control systems.
摘要
人机界面可能会被开发为使用模式识别来估算执行人类动作的电omyogram (EMG) 信号。这些信号在长期使用EMG-based interface时会逐渐改变特征,这是由于电极移动和肌肉疲劳引起的,导致分类精度逐渐下降。本文提出了一种使用 Bayesian 方法的可靠 EMG 模式分类。这个方法使用 Bayesian 分类模型,根据 Gaussian 分布来预测动作类型和其信度。然后将高预测信度的数据分配为 pseudo-label,并在 Bayesian 更新框架中逐渐更新模型的 posterior 分布,以达到适应性的动作识别。实验结果显示,提案的方法可以抑制分类精度逐渐下降的现象,并且比传统方法表现出色。这些成果证明了提案的方法的有效性,并且适用于实际的 EMG-based 控制系统。
results: 我们在各种数据集上进行了广泛的实验,结果表明,提出的单一模型可以成功地处理多种输入条件,并且表现出色。Abstract
The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance.
摘要
过去一个十年,数据驱动的语音提升(SE)技术因深度学习而经历了重大的发展。现有的方法在某些常用的数据集上表现出色,但大多数都只适用于单一的条件(例如单通道、多通道或固定采样频率)或只考虑单一任务(例如干声除或抗雷声)。当前没有一种通用的SE方法,可以有效地处理多个输入条件,使用单个模型。在这篇论文中,我们首次进行了这一研究。我们首先设计了独立于 Microphone 通道、信号长度和采样频率的单一 SE 模型。然后,我们设计了一个通用的 SE 测试准则,将现有的公共数据集合并多个条件。我们在各种数据集上进行了广泛的实验,发现我们提议的单一模型可以成功地处理多个条件,表现出色。
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
results: 我们的模型在 Clotho 评分分区上取得了新的状态之冠32.6 SPIDEr-FL 分数,并在 2023 年 DCASE AAC 挑战中赢得了比赛。Abstract
Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge.
摘要
paper_authors: Ivan Yakovlev, Mikhail Melnikov, Nikita Bukhal, Rostislav Makarov, Alexander Alenin, Nikita Torgashov, Anton Okhotnikov
For: 本研究旨在提高声音反馈攻击检测(VAS)领域的深度神经网络(DNN)性能,并提供大量数据集来促进神经网络系统的进步。* Methods: 本研究使用了大量的批处理并行数据集(LRPD),该数据集包含了19个录音设备在17个不同环境中收集的超过100万个语音示例。* Results: 基于LRPD数据集的模型在完全未知的条件下表现了一致性,其EER在评估subset中为0.28%,并在ASVpoof 2017评估集上达到11.91%的水平。这些结果表明模型在LRPD数据集上训练后在未知条件下具有良好的表现。Abstract
The latest research in the field of voice anti-spoofing (VAS) shows that deep neural networks (DNN) outperform classic approaches like GMM in the task of presentation attack detection. However, DNNs require a lot of data to converge, and still lack generalization ability. In order to foster the progress of neural network systems, we introduce a Large Replay Parallel Dataset (LRPD) aimed for a detection of replay attacks. LRPD contains more than 1M utterances collected by 19 recording devices in 17 various environments. We also provide an example training pipeline in PyTorch [1] and a baseline system, that achieves 0.28% Equal Error Rate (EER) on evaluation subset of LRPD and 11.91% EER on publicly available ASVpoof 2017 [2] eval set. These results show that model trained with LRPD dataset has a consistent performance on the fully unknown conditions. Our dataset is free for research purposes and hosted on GDrive. Baseline code and pre-trained models are available at GitHub.
摘要
最新的声音反模仿(VAS)研究显示,深度神经网络(DNN)在声音攻击探测任务中表现出色,超过了经典方法如GMM。然而,DNN需要很多数据来融合,并且还缺乏泛化能力。为推动神经网络系统的进步,我们介绍了一个大量重复并行数据集(LRPD),用于检测重复攻击。LRPD包含了超过100万个语音样本,由19个录音设备在17个不同的环境中采集。我们还提供了一个PyTorch中的训练管道示例和一个基线系统,其在LRPD评估子集上达到0.28%的相同错误率(EER),并在公开可用的ASVpoof 2017评估集上达到11.91%的EER。这些结果表明,使用LRPD数据集训练的模型在完全未知的条件下具有一致的表现。我们的数据集是免费用于研究目的,并在GDrive上hosts。基线代码和预训练模型可以在GitHub上获取。
ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech
results: 我们的实验结果显示,我们的ReFlow-TTS方法在LJSpeechDataset上实现了最高的性能,并且单步抽样比较高质量的语音合成模型。Abstract
The diffusion models including Denoising Diffusion Probabilistic Models (DDPM) and score-based generative models have demonstrated excellent performance in speech synthesis tasks. However, its effectiveness comes at the cost of numerous sampling steps, resulting in prolonged sampling time required to synthesize high-quality speech. This drawback hinders its practical applicability in real-world scenarios. In this paper, we introduce ReFlow-TTS, a novel rectified flow based method for speech synthesis with high-fidelity. Specifically, our ReFlow-TTS is simply an Ordinary Differential Equation (ODE) model that transports Gaussian distribution to the ground-truth Mel-spectrogram distribution by straight line paths as much as possible. Furthermore, our proposed approach enables high-quality speech synthesis with a single sampling step and eliminates the need for training a teacher model. Our experiments on LJSpeech Dataset show that our ReFlow-TTS method achieves the best performance compared with other diffusion based models. And the ReFlow-TTS with one step sampling achieves competitive performance compared with existing one-step TTS models.
摘要
Diffusion模型,包括Denosing Diffusion Probabilistic Models (DDPM)和分数基本生成模型,在语音合成任务中表现出色,但是其效果带来许多抽样步骤,导致高质量语音合成需要长时间抽样。这种缺点限制了它在实际场景中的实用性。在这篇论文中,我们介绍ReFlow-TTS,一种基于流函数的新方法,用于高质量语音合成。具体来说,ReFlow-TTS是一个Ordinary Differential Equation (ODE)模型,可以将 Gaussian 分布transport到真实的 Mel-spectrogram 分布,以最大程度地保持直线路径。此外,我们提出的方法可以在一步抽样下实现高质量语音合成,并减少了培训教师模型的需求。我们在LJSpeech Dataset上进行了实验,并证明了ReFlow-TTS方法在 diffusion 基于模型中表现最佳,并且ReFlow-TTS 一步抽样方法与现有的一步 TTS 模型相匹配。
Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
paper_authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed
for: 提高低资源自动语音识别的性能
methods: 使用合成语音增强低资源预训练 corpus
results: 成功降低了90%的语音数据需求,只有轻微性能下降Abstract
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90\% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.
摘要
自我指导学习(SSL)技术在各种语音处理任务中获得了卓越的结果。然而,尚存在很大的挑战是减少预训练中需要庞大量的语音数据的依赖。本文提议使用生成的 sintetic speech 来增强低资源预训练集。我们使用 SSL 特征构建了高质量的 text-to-speech(TTS)系统,生成了大量的 sintetic 训练集,并实验结果表明,我们的提议方法可以将语音数据需求减少90%,只带来轻微的性能下降。这是我们知道的首次尝试增强低资源自我指导学习在语音处理中。
Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features
results: 该方法在跨数据集和静音修剪场景中都表现出了低计算复杂性和良好的性能,能够提高SSD方法的Robustness和可解性。Abstract
Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid speech anti-spoofing. On the other hand, distribution differences in inter-utterance speaker features can be utilized for SSD. The proposed method offers low computational complexity and performs well in both cross-dataset and silence trimming scenarios.
摘要
当前的语音合成检测(SSD)方法在某些数据集上表现良好,但仍面临 robustness 和可读性的问题。一个可能的原因是这些方法不进行真实语音的缺陷分析。在这篇论文中,文本到语音(TTS)过程中的 speaker 特征的缺陷被分析了。因为 TTS 中 speaker 特征的时间一致性不够细致,因此 intra-utterance 中的 speaker 特征存在差异。此外,基于 encoder 提取的 speaker 嵌入有所不同,因此 inter-utterance 中的 speaker 特征的分布不同。根据这些分析,一种基于时间一致性和 speaker 特征分布的 SSD 方法被提议。在一个手上,模拟 intra-utterance 中 speaker 特征的时间一致性可以帮助语音防 spoofing。在另一个手上,inter-utterance 中 speaker 特征的分布差异可以被利用于 SSD。提议的方法具有低计算复杂度,在cross-dataset 和沉寂截取场景中表现良好。
Enhancing Code-switching Speech Recognition with Interactive Language Biases
paper_authors: Hexin Liu, Leibny Paola Garcia, Xiangyu Zhang, Andy W. H. Khong, Sanjeev Khudanpur
for: 本研究旨在提高多语言交互 circumstance 下的自动话语识别(ASR)性能,通过对混合 CTC/注意力 ASR 模型进行多级语言信息偏好。
methods: 本研究使用了 hybrid CTC/注意力 ASR 模型,并通过在不同层次语言信息上进行偏好来提高模型性能。
results: 对 ASRU 2019 多语言交互挑战 dataset 进行了实验,并与基eline 进行了比较。结果表明,提出的交互语言偏好(ILB)方法可以提高模型性能,并且对各种语言偏好和它们之间的交互进行了深入的分析。Abstract
Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame- and token-level language posteriors. The interaction between various resolutions of language biases is subsequently explored in this work. We conducted experiments on datasets from the ASRU 2019 code-switching challenge. Compared to the baseline, the proposed interactive language biases (ILB) method achieves higher performance and ablation studies highlight the effects of different language biases and their interactions. In addition, the results presented indicate that language bias implicitly enhances internal language modeling, leading to performance degradation after employing an external language model.
摘要
语言通常在多语言交流中切换,特别在双语社会中。这种现象被称为代码切换(CS),使自动听说识别(ASR)在多语言场景下变得更加困难。我们提出了改进CS-ASR方法,使用多级语言信息(frame和token)来偏袋混合CTC/注意力ASR模型。我们在这里探索不同分辨率的语言偏袋之间的交互作用。我们在ASRU 2019代码切换挑战 datasets上进行了实验,相比基eline,我们的互动语言偏袋(ILB)方法实现了更高的性能。归根结底,我们的研究表明,语言偏袋隐式地增强了内部语言模型,导致性能下降 después使用外部语言模型。
For: 本研究旨在提高城市建筑能源估算的准确性和可靠性,通过提取当地建筑尺度特点,建立代表性的建筑模板。* Methods: 该研究使用了 Representation Learning 技术,具体来说是 VQ-AE 和自适应卷积神经网络,对当地建筑脚印进行了压缩和扩展,提取了建筑尺度特点。* Results: 研究发现,使用 MARL 提取的 geometric feature embeddings 可以提高城市建筑能源估算的准确性和可靠性,并且可以自动生成多scale 区域的建筑模板。 Code、数据集和训练模型都公开可用:https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation。Abstract
Building archetypes, representative models of building stock, are crucial for precise energy simulations in Urban Building Energy Modeling. The current widely adopted building archetypes are developed on a nationwide scale, potentially neglecting the impact of local buildings' geometric specificities. We present Multi-scale Archetype Representation Learning (MARL), an approach that leverages representation learning to extract geometric features from a specific building stock. Built upon VQ-AE, MARL encodes building footprints and purifies geometric information into latent vectors constrained by multiple architectural downstream tasks. These tailored representations are proven valuable for further clustering and building energy modeling. The advantages of our algorithm are its adaptability with respect to the different building footprint sizes, the ability for automatic generation across multi-scale regions, and the preservation of geometric features across neighborhoods and local ecologies. In our study spanning five regions in LA County, we show MARL surpasses both conventional and VQ-AE extracted archetypes in performance. Results demonstrate that geometric feature embeddings significantly improve the accuracy and reliability of energy consumption estimates. Code, dataset and trained models are publicly available: https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation
摘要
建筑型例,代表性模型的建筑股权,在城市建筑能源模拟中是关键。现有的广泛采用的建筑型例是在国家范围内开发的,可能忽略本地建筑的几何特点。我们提出了多级Archetype Representation Learning(MARL),它利用表示学来提取建筑范围的几何特征。基于VQ-AE,MARL编码建筑范围和约束多个建筑下渠道任务的几何信息,生成受限制的约束 vector。这些适应性的表示值有助于进一步的划分和建筑能源模拟。我们的算法具有适应不同建筑范围大小、自动生成多级区域和保持几何特征的优势。在洛杉矶县五个区域的研究中,我们证明MARL超过了传统和VQ-AE提取的建筑型例,在性能和可靠性方面具有显著优势。结果表明几何特征嵌入有助于提高能源消耗估计的准确性和可靠性。代码、数据集和训练模型在 GitHub 上公开:https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation。
SCoRe: Submodular Combinatorial Representation Learning for Real-World Class-Imbalanced Settings
For: The paper focuses on the challenge of representation learning in real-world class-imbalanced settings, particularly in deep learning.* Methods: The authors propose a new framework called SCoRe (Submodular Combinatorial Representation Learning) that leverages set-based combinatorial functions to model diversity and cooperation among feature clusters. They also introduce a family of Submodular Combinatorial Loss functions to overcome the pitfalls of class-imbalance in contrastive learning.* Results: The authors show that their proposed objectives outperform state-of-the-art metric learners by up to 7.6% for imbalanced classification tasks and up to 19.4% for object detection tasks on several benchmark datasets.Abstract
Representation Learning in real-world class-imbalanced settings has emerged as a challenging task in the evolution of deep learning. Lack of diversity in visual and structural features for rare classes restricts modern neural networks to learn discriminative feature clusters. This manifests in the form of large inter-class bias between rare object classes and elevated intra-class variance among abundant classes in the dataset. Although deep metric learning approaches have shown promise in this domain, significant improvements need to be made to overcome the challenges associated with class-imbalance in mission critical tasks like autonomous navigation and medical diagnostics. Set-based combinatorial functions like Submodular Information Measures exhibit properties that allow them to simultaneously model diversity and cooperation among feature clusters. In this paper, we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework and propose a family of Submodular Combinatorial Loss functions to overcome these pitfalls in contrastive learning. We also show that existing contrastive learning approaches are either submodular or can be re-formulated to create their submodular counterparts. We conduct experiments on the newly introduced family of combinatorial objectives on two image classification benchmarks - pathologically imbalanced CIFAR-10, subsets of MedMNIST and a real-world road object detection benchmark - India Driving Dataset (IDD). Our experiments clearly show that the newly introduced objectives like Facility Location, Graph-Cut and Log Determinant outperform state-of-the-art metric learners by up to 7.6% for the imbalanced classification tasks and up to 19.4% for object detection tasks.
摘要
“在实际世界中的分布不均的设定下,深度学习中的 Representation Learning 已成为一个挑战。缺乏罕见类别的视觉和结构特征使现代神经网络很难学习标志性的特征群。这会导致资料集中罕见类别和普见类别之间的大型间隔,以及普见类别中的高度内部矩阵。尽管深度度量学习方法有所进步,但还需要进一步改进,以应对实际中的分布不均问题,如自主 Navigation 和医疗诊断。”“Set-based combinatorial functions like Submodular Information Measures exhibit properties that allow them to simultaneously model diversity and cooperation among feature clusters. In this paper, we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework and propose a family of Submodular Combinatorial Loss functions to overcome these pitfalls in contrastive learning. We also show that existing contrastive learning approaches are either submodular or can be re-formulated to create their submodular counterparts.”“我们在新引入的家族 combinatorial 目标上进行了实验,包括 Facility Location, Graph-Cut 和 Log Determinant。我们的实验结果显示,这些新引入的目标可以与现有的 Metric Learners 相比,在不均分的类别任务上进行改进。 Specifically, we observe up to 7.6% improvement for imbalanced classification tasks and up to 19.4% improvement for object detection tasks.”
PRIME: Prioritizing Interpretability in Failure Mode Extraction
results: 我们通过在不同的数据集上进行多个实验,成功地 indentified failing modes 并生成了高质量的关于这些失败模式的文本描述。这些结果 highlights 模型失败的解释性的重要性。Abstract
In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models. Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provide human-understandable text descriptions for them. We observe that in some cases, describing text does not match well with identified failure modes, partially owing to the fact that shared interpretable attributes of failure modes may not be captured using clustering in the feature space. To improve on these shortcomings, we propose a novel approach that prioritizes interpretability in this problem: we start by obtaining human-understandable concepts (tags) of images in the dataset and then analyze the model's behavior based on the presence or absence of combinations of these tags. Our method also ensures that the tags describing a failure mode form a minimal set, avoiding redundant and noisy descriptions. Through several experiments on different datasets, we show that our method successfully identifies failure modes and generates high-quality text descriptions associated with them. These results highlight the importance of prioritizing interpretability in understanding model failures.
摘要
在这个工作中,我们研究了提供训练过的图像分类模型失败模式的人类可理解描述的挑战。现有的方法通过首先标识特征空间中错误分类样本的集群(或方向)来解决这个问题,然后尝试提供人类可理解的文本描述。我们发现,在某些情况下,描述文本并不好匹配错误模式的共同可解释特征,部分是因为在特征空间中使用 clustering 可能会遗漏共享可解释特征。为了改进这些缺陷,我们提出了一种新的方法,即优先考虑可解释性:我们首先从数据集中获取图像的人类可理解概念(标签),然后分析模型的行为是否符合标签的存在或缺失。我们的方法还确保了失败模式的标签形成最小集,以避免重复和噪音的描述。通过在不同的数据集上进行了多个实验,我们证明了我们的方法可以成功地认出失败模式并生成高质量的相关文本描述。这些结果强调了在理解模型失败时优先考虑可解释性的重要性。
Prior Mismatch and Adaptation in PnP-ADMM with a Nonconvex Convergence Analysis
methods: 这篇论文使用了 Alternating Direction Method of Multipliers (ADMM) 方法,并研究了在不同的数据分布下对 PnP 方法的影响。
results: 研究结果显示,PnP-ADMM 方法对于图像超分辨问题的性能是可以忽略不计的,但是如果使用不同的数据分布来训练和测试,则会导致性能下降。然而,通过一些简单的采样和适应策略,可以减少这种性能差距。Abstract
Plug-and-Play (PnP) priors is a widely-used family of methods for solving imaging inverse problems by integrating physical measurement models with image priors specified using image denoisers. PnP methods have been shown to achieve state-of-the-art performance when the prior is obtained using powerful deep denoisers. Despite extensive work on PnP, the topic of distribution mismatch between the training and testing data has often been overlooked in the PnP literature. This paper presents a set of new theoretical and numerical results on the topic of prior distribution mismatch and domain adaptation for alternating direction method of multipliers (ADMM) variant of PnP. Our theoretical result provides an explicit error bound for PnP-ADMM due to the mismatch between the desired denoiser and the one used for inference. Our analysis contributes to the work in the area by considering the mismatch under nonconvex data-fidelity terms and expansive denoisers. Our first set of numerical results quantifies the impact of the prior distribution mismatch on the performance of PnP-ADMM on the problem of image super-resolution. Our second set of numerical results considers a simple and effective domain adaption strategy that closes the performance gap due to the use of mismatched denoisers. Our results suggest the relative robustness of PnP-ADMM to prior distribution mismatch, while also showing that the performance gap can be significantly reduced with few training samples from the desired distribution.
摘要
插件和执行(PnP)先验是一种广泛使用的方法来解决图像反问题,通过将物理测量模型与使用图像抑制器指定的图像先验集成。PnP方法已经达到了现状的性能,当使用强大的深度抑制器时。 DESPITE extensive work on PnP, the topic of distribution mismatch between the training and testing data has often been overlooked in the PnP literature. This paper presents a set of new theoretical and numerical results on the topic of prior distribution mismatch and domain adaptation for alternating direction method of multipliers (ADMM) variant of PnP. Our theoretical result provides an explicit error bound for PnP-ADMM due to the mismatch between the desired denoiser and the one used for inference. Our analysis contributes to the work in the area by considering the mismatch under nonconvex data-fidelity terms and expansive denoisers. Our first set of numerical results quantifies the impact of the prior distribution mismatch on the performance of PnP-ADMM on the problem of image super-resolution. Our second set of numerical results considers a simple and effective domain adaption strategy that closes the performance gap due to the use of mismatched denoisers. Our results suggest the relative robustness of PnP-ADMM to prior distribution mismatch, while also showing that the performance gap can be significantly reduced with few training samples from the desired distribution.
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition
results: 实验结果表明,基于Semantic decomposition和量化的听视对应方法可以显著提高AVS性能,例如在AVS-Semantic标准测试集上提高了21.2%的mIoU。Abstract
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos based on their associated acoustic cues. With multiple sound sources involved, establishing robust correspondences between audio and visual contents poses unique challenges due to its (1) intricate entanglement across sound sources and (2) frequent shift among sound events. Assuming sound events occur independently, the multi-source semantic space (which encompasses all possible semantic categories) can be viewed as the Cartesian product of single-source sub-spaces. This motivates us to decompose the multi-source audio semantics into single-source semantics, allowing for more effective interaction with visual content. Specifically, we propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several quantized single-source semantics. Furthermore, we introduce a global-to-local quantization mechanism that distills knowledge from stable global (clip-level) features into local (frame-level) ones to handle the constant shift of audio semantics. Extensive experiments demonstrate that semantically quantized and decomposed audio representation significantly improves AVS performance, e.g., +21.2% mIoU on the most challenging AVS-Semantic benchmark.
摘要
宽泛听视分割(AVS)是一项复杂的任务,旨在根据视频中相关的声音提示,将视觉对象分割成多个类别。由于多个声音源参与,在Audio和视觉内容之间建立坚实的对应关系具有独特挑战,主要是因为声音源之间的复杂互相关系和声音事件频繁变化。假设声音事件独立发生,那么多源语义空间(包括所有可能的语义类别)可以视为Cartesian产品的多个单源语义空间。这使得我们可以将多源语义分解为单源语义,以更好地与视觉内容交互。具体来说,我们提出了基于产品量化的语义分解方法,可以将多源语义分解并表示为多个量化的单源语义。此外,我们还引入了全局到本地量化机制,可以从稳定的全局(clip级)特征中提取知识,并将其转化为本地(帧级)特征,以处理声音语义的不断变化。广泛的实验表明,semantic量化和分解后的声音表示可以显著提高AVS性能,例如+21.2% mIoU在AVS-Semantic标准测试上。
Fewshot learning on global multimodal embeddings for earth observation tasks
results: 研究表明,只需使用约200-500个随机选择的标注示例(约4K-10K平方公里),可以达到与完整的标注数据集(约150K张图像片或3M平方公里)的性能水平,在所有模式、地区和下游任务中。这表明模型已经捕捉了大地特征,可以在各种enario中应用。Abstract
In this work we pretrain a CLIP/ViT based model using three different modalities of satellite imagery across five AOIs covering over ~10\% of the earth total landmass, namely Sentinel 2 RGB optical imagery, Sentinel 1 SAR amplitude and Sentinel 1 SAR interferometric coherence. This model uses $\sim 250$ M parameters. Then, we use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation related to vegetation, built up surface, croplands and permanent water. We consistently show how we reduce the need for labeled data by 99\%, so that with ~200-500 randomly selected labeled examples (around 4K-10K km$^2$) we reach performance levels analogous to those achieved with the full labeled datasets (about 150K image chips or 3M km$^2$ in each AOI) on all modalities, AOIs and downstream tasks. This leads us to think that the model has captured significant earth features useful in a wide variety of scenarios. To enhance our model's usability in practice, its architecture allows inference in contexts with missing modalities and even missing channels within each modality. Additionally, we visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.
摘要
“在这项工作中,我们预训练了基于CLIP/ViT的模型,使用三种不同的卫星图像模式,覆盖地球总面积的约10%的五个AOI,即Sentinel 2 红色光学图像、Sentinel 1 SAR振荡和Sentinel 1 SAR相互干扰干扰。这个模型使用大约250M个参数。然后,我们使用每个模式生成的嵌入,与经典机器学习方法进行不同的下游任务,关于地球观测,如茂密 vegetation、建筑物表面、农地和常年水。我们一致地表明,我们可以降低需要标注数据的量,从99%下降到200-500个随机选择的标注示例(约4K-10K km$^2$),以达到与全部标注数据集(约150K图像块或3M km$^2$在每个AOI)的性能水平。这导致我们认为,这个模型已经捕捉了大地特征,可以在各种enario中应用。为了使这个模型在实际应用中更加可用,它的架构允许在缺失模式和内部模式中进行推理。此外,我们可见地表示,这些无标注空间,通过与我们选择的标注集合进行比较,表明这些空间是有意义的。”
Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study
paper_authors: Myeongseob Ko, Ming Jin, Chenguang Wang, Ruoxi Jia
for: This paper aims to develop practical membership inference attacks (MIAs) against large-scale multi-modal models, specifically targeting CLIP models.
methods: The proposed baseline strategy thresholds the cosine similarity between text and image features of a target point, and the enhanced attack method aggregates cosine similarity across transformations of the target. Additionally, a new weakly supervised attack method leverages ground-truth non-members to further enhance the attack.
results: The simple baseline achieves over $75%$ membership identification accuracy, and the enhanced attacks outperform the baseline across multiple models and datasets, with the weakly supervised attack demonstrating an average-case performance improvement of $17%$ and being at least $7$X more effective at low false-positive rates.Abstract
Membership inference attacks (MIAs) aim to infer whether a data point has been used to train a machine learning model. These attacks can be employed to identify potential privacy vulnerabilities and detect unauthorized use of personal data. While MIAs have been traditionally studied for simple classification models, recent advancements in multi-modal pre-training, such as CLIP, have demonstrated remarkable zero-shot performance across a range of computer vision tasks. However, the sheer scale of data and models presents significant computational challenges for performing the attacks. This paper takes a first step towards developing practical MIAs against large-scale multi-modal models. We introduce a simple baseline strategy by thresholding the cosine similarity between text and image features of a target point and propose further enhancing the baseline by aggregating cosine similarity across transformations of the target. We also present a new weakly supervised attack method that leverages ground-truth non-members (e.g., obtained by using the publication date of a target model and the timestamps of the open data) to further enhance the attack. Our evaluation shows that CLIP models are susceptible to our attack strategies, with our simple baseline achieving over $75\%$ membership identification accuracy. Furthermore, our enhanced attacks outperform the baseline across multiple models and datasets, with the weakly supervised attack demonstrating an average-case performance improvement of $17\%$ and being at least $7$X more effective at low false-positive rates. These findings highlight the importance of protecting the privacy of multi-modal foundational models, which were previously assumed to be less susceptible to MIAs due to less overfitting. Our code is available at https://github.com/ruoxi-jia-group/CLIP-MIA.
摘要
美国文学攻击(MIA)旨在判断一个数据点是否被用于训练机器学习模型。这些攻击可以用来检测潜在隐私漏洞和未经授权使用个人数据。虽然MIA在简单的分类模型上已经得到了广泛的研究,但是随着多Modal预训练的发展,如CLIP,其在多种计算机视觉任务上表现出了很好的零基础性能。然而,大规模数据和模型的计算挑战仍然存在。这篇文章是开发大规模多Modal模型的实用MIA的第一步。我们提出了一个简单的基线策略,即将文本和图像特征的夹角相似度进行阈值设置,并提出了在 transformations 上聚合 cosine similarity 来增强基线。我们还提出了一种新的弱监督攻击方法,利用非成员(例如,使用目标模型的发布日期和公开数据的时间戳)来进一步提高攻击。我们的评估结果显示,CLIP模型对我们的攻击策略非常易受到影响,我们的简单基线可以达到超过 75%的成员认定精度。此外,我们的增强攻击方法在多个模型和数据集上表现出了超过基eline的性能,弱监督攻击在低 false-positive 率下表现出了平均情况性能提高约 17%,并且在至少 7 倍的 false-positive 率下表现出了最好的性能。这些发现说明了保护多Modal基础模型的隐私的重要性,这些模型曾经被认为是MIA less susceptible due to less overfitting。我们的代码可以在 中找到。
Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation
results: 我们在COCObenchmark上进行了多种评估 setup,并证明了我们的模型在半监督学习下比前一代最佳性能的 semi-supervised pose estimator 更高,特别是在低数据 regime 下。例如,只有0.5K 标注图像,我们的方法可以超越最佳竞争对手by 7.22 mAP (+25% 绝对提升)。此外,我们还证明了我们的模型可以从无标记数据中学习,以进一步提高其泛化和性能。Abstract
We propose a new semi-supervised learning design for human pose estimation that revisits the popular dual-student framework and enhances it two ways. First, we introduce a denoising scheme to generate reliable pseudo-heatmaps as targets for learning from unlabeled data. This uses multi-view augmentations and a threshold-and-refine procedure to produce a pool of pseudo-heatmaps. Second, we select the learning targets from these pseudo-heatmaps guided by the estimated cross-student uncertainty. We evaluate our proposed method on multiple evaluation setups on the COCO benchmark. Our results show that our model outperforms previous state-of-the-art semi-supervised pose estimators, especially in extreme low-data regime. For example with only 0.5K labeled images our method is capable of surpassing the best competitor by 7.22 mAP (+25% absolute improvement). We also demonstrate that our model can learn effectively from unlabeled data in the wild to further boost its generalization and performance.
摘要
我们提出一种新的半监督学习设计,用于人姿估计,这种设计基于流行的双学生框架,并在两个方面进行了改进。首先,我们引入一种去噪方案,以生成可靠的假热图作为不监督学习的目标。这使用多视图扩展和阈值和精炼过程生成一个热图pool。其次,我们根据这些热图pool中的估计交叉学生不确定性选择学习目标。我们在COCO标准库上多个评估setup中评估了我们的提议方法,结果显示,我们的模型在低数据量情况下超过了之前的半监督pose估计器,具体提高了7.22个mAP (+25%相对提高)。此外,我们还证明了我们的模型可以从无标注数据中学习,并在野外数据中进一步提高其普适性和性能。
Towards Few-Call Model Stealing via Active Self-Paced Knowledge Distillation and Diffusion-Based Image Generation
methods: 我们提出了一种新的框架,使用Diffusion模型生成的假数据集( proxy data set)作为训练数据,并采用最大化API调用数量的限制。我们通过将一定数量的样本传递 через黑盒模型,收集labels,然后通过沟通知论把黑盒教师(攻击模型)的知识 transferred 到学生模型(复制模型)中。
results: 我们的实验结果表明,我们的框架在几个API调用的情况下比两种现状顶尖方法更高效。Abstract
Diffusion models showcased strong capabilities in image synthesis, being used in many computer vision tasks with great success. To this end, we propose to explore a new use case, namely to copy black-box classification models without having access to the original training data, the architecture, and the weights of the model, \ie~the model is only exposed through an inference API. More specifically, we can only observe the (soft or hard) labels for some image samples passed as input to the model. Furthermore, we consider an additional constraint limiting the number of model calls, mostly focusing our research on few-call model stealing. In order to solve the model extraction task given the applied restrictions, we propose the following framework. As training data, we create a synthetic data set (called proxy data set) by leveraging the ability of diffusion models to generate realistic and diverse images. Given a maximum number of allowed API calls, we pass the respective number of samples through the black-box model to collect labels. Finally, we distill the knowledge of the black-box teacher (attacked model) into a student model (copy of the attacked model), harnessing both labeled and unlabeled data generated by the diffusion model. We employ a novel active self-paced learning framework to make the most of the proxy data during distillation. Our empirical results on two data sets confirm the superiority of our framework over two state-of-the-art methods in the few-call model extraction scenario.
摘要
Diffusion models have shown great potential in image synthesis and have been successfully applied to various computer vision tasks. In this paper, we aim to explore a new use case for diffusion models: copying black-box classification models without access to the original training data, model architecture, or weights. Specifically, we can only observe the soft or hard labels of some input image samples through the model's inference API. To solve this task under these restrictions, we propose the following framework:1. Create a synthetic data set (proxy data set) using diffusion models to generate realistic and diverse images.2. Pass the maximum number of allowed API calls through the black-box model to collect labels.3. Distill the knowledge of the black-box teacher (attacked model) into a student model (copy of the attacked model) using both labeled and unlabeled data generated by the diffusion model.4. Employ a novel active self-paced learning framework to make the most of the proxy data during distillation.Our experimental results on two data sets demonstrate the superiority of our framework over two state-of-the-art methods in the few-call model extraction scenario.
DataDAM: Efficient Dataset Distillation with Attention Matching
paper_authors: Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z. Liu, Yuri A. Lawryshyn, Konstantinos N. Plataniotis
for: 降低深度学习训练成本,保持多个 dataset 上强大的泛化性
methods: 使用高效的 Dataset Distillation with Attention Matching (DataDAM) 方法,通过匹配不同层的神经网络中的空间注意力图来学习高质量的合成图像
results: 在 CIFAR10/100、TinyImageNet、ImageNet-1K 和 ImageNet-1K 的一些subset中,与先前方法相比,实现了状态机器的性能,并在 CIFAR100 和 ImageNet-1K 上实现了提高分别为 6.5% 和 4.1%。Abstract
Researchers have long tried to minimize training costs in deep learning while maintaining strong generalization across diverse datasets. Emerging research on dataset distillation aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset and ultimately achieves test accuracy equivalent to a model trained on the whole dataset. Unfortunately, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data, and they incur significant computational costs. Despite promising results, there still exists a significant performance gap between models trained on condensed synthetic sets and those trained on the whole dataset. In this paper, we address these challenges using efficient Dataset Distillation with Attention Matching (DataDAM), achieving state-of-the-art performance while reducing training costs. Specifically, we learn synthetic images by matching the spatial attention maps of real and synthetic data generated by different layers within a family of randomly initialized neural networks. Our method outperforms the prior methods on several datasets, including CIFAR10/100, TinyImageNet, ImageNet-1K, and subsets of ImageNet-1K across most of the settings, and achieves improvements of up to 6.5% and 4.1% on CIFAR100 and ImageNet-1K, respectively. We also show that our high-quality distilled images have practical benefits for downstream applications, such as continual learning and neural architecture search.
摘要
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
results: 研究发现,对于低扰动量水印方法,存在一种基本的负反向关系,即避免错误率和伪装错误率之间的负反向关系。此外,研究还发现了一种替换模型攻击,可以成功地去除水印。此外,研究还发现了水印方法对于伪装攻击的漏洞,可能导致真实图像被误认为是伪图像。Abstract
In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of a diffusion purification attack. In this regime, we also empirically show that diffusion purification effectively removes watermarks with minimal changes to images. For high perturbation watermarking methods where notable changes are applied to images, the diffusion purification attack is not effective. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images (potentially obscene) identified as watermarked ones, damaging the reputation of the developers. In particular, by just having black-box access to the watermarking method, we show that one can generate a watermarked noise image which can be added to the real images to have them falsely flagged as watermarked ones. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.
摘要
随着生成型人工智能模型的发展,识别真实内容和人工生成的内容变得非常重要,以避免使用假材料作为真实内容并 vice versa。各种技术已经被提出来识别人工生成的图像,其中水印技术成为了一种有希望的方法。在这篇论文中,我们分析了不同的人工生成图像检测器,包括水印和深度骗局检测器。对于低干扰量水印方法,我们发现了一个基本的负面负担关系,即逃脱错误率(即水印图像被识别为非水印图像的比率)和骗局错误率(即非水印图像被识别为水印图像的比率)。在这种情况下,我们也实证表明了 diffusion purification 攻击可以有效地去除水印。对于高干扰量水印方法, diffusion purification 攻击不生效。在这种情况下,我们开发了模型替换对抗攻击,可以成功地去除水印。此外,我们还发现了水印方法对骗局攻击(攻击者尝试将真实图像(可能是不良的)识别为水印图像,从而损害开发者的声誉)的漏洞。具体来说,只要对水印方法有黑盒访问,就可以生成一个水印图像,并将其添加到真实图像中,以误导系统将真实图像识别为水印图像。最后,我们扩展了我们的理论,并通过实验来描述一个基本的负面负担关系,即检测器的可靠性和robustness之间的负面负担关系。
Multi-task View Synthesis with Neural Radiance Fields
results: 对于 synthetic 和 realistic 的标准准则进行了广泛的评估,表明MuvieNeRF可以同时Synthesize多种场景属性,并且在不同的 NeRF 背景下表现优秀。特别是,我们展示了MuvieNeRF在不同的应用场景中具有通用性。代码可以在https://github.com/zsh2000/MuvieNeRF中下载。Abstract
Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capability for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MTVS), which reinterprets multi-task prediction as a set of novel-view synthesis tasks for multiple scene properties, including RGB. To tackle the MTVS problem, we propose MuvieNeRF, a framework that incorporates both multi-task and cross-view knowledge to simultaneously synthesize multiple scene properties. MuvieNeRF integrates two key modules, the Cross-Task Attention (CTA) and Cross-View Attention (CVA) modules, enabling the efficient use of information across multiple views and tasks. Extensive evaluation on both synthetic and realistic benchmarks demonstrates that MuvieNeRF is capable of simultaneously synthesizing different scene properties with promising visual quality, even outperforming conventional discriminative models in various settings. Notably, we show that MuvieNeRF exhibits universal applicability across a range of NeRF backbones. Our code is available at https://github.com/zsh2000/MuvieNeRF.
摘要
多任务视觉学是计算机视觉的关键方面。然而,当前研究主要集中在多任务紧凑预测设置中,忽略了自然3D世界的多视点共同结构,而且缺乏多样化的想象能力。为了解决这些局限性,我们提出了一个新的问题设定---多任务视觉合成(MTVS),将多任务预测重新解释为多个场景属性的新视图合成任务。为了解决MTVS问题,我们提出了MuvieNeRF框架,该框架包含了多任务和交叉视图知识的两个关键模块:交叉任务注意力(CTA)和交叉视图注意力(CVA)模块。这两个模块使得可以有效利用多视点和任务之间的信息。我们对各种 sintetic和实际的标准块进行了广泛的评估,并证明了MuvieNeRF可以同时合成不同场景属性,并且视觉质量得到了广泛的提高。特别是,我们表明了MuvieNeRF在不同的NeRF底层模型上 exhibits 普遍可用性。我们的代码可以在 GitHub 上找到。
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
paper_authors: Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu
For: 本研究旨在推进人体 pose 和形态估计(EHPS)领域的发展,通过大量数据和大型模型的使用,实现对多种场景的人体动作和形态估计。* Methods: 本研究使用了许多 EHPS 数据集,并对模型训练 scheme 进行了系统atic investigation,以便选择最适合的数据集和训练策略。此外,研究还使用了视力 transformer 来研究模型大小的扩展法律,并实现了模型的特殊化。* Results: 研究表明,通过使用大量数据和大型模型,可以实现对多种场景的人体动作和形态估计的出色表现,并且可以在不同的环境下进行出色的转移。特别是,基于 SMPLer-X 的基础模型可以在七个测试标准中占据国际第一名,包括 AGORA(107.2 mm NMVE)、UBody(57.4 mm PVE)、EgoBody(63.6 mm PVE)和 EHF(62.3 mm PVE 无需 fine-tuning)等。Abstract
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/
摘要
traditional Chinese:这个研究探讨了将表达人体姿态和形状估计(EHPS)与许多应用结合起来,尽管现有的进步已经很鼓舞人,但现在的州OF-the-art方法仍然受到一个仅仅有限的训练数据集的限制。在这个工作中,我们investigated scaling up EHPS towards the first generalist foundation model(dubbed SMPLer-X),使用Up to ViT-Huge as the backbone和Up to 4.5M个实例从多元数据来训练。With big data和大型模型,SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments。1. 我们进行了系统性的调查,探讨在32个EHPS数据集中,包括各种情况,一个训练在任何单一数据集上的模型无法应对的各种情况。更重要的是,我们根据广泛的 benchmarking 过程中所获得的启示,来优化我们的训练方案,并选择最适合的数据集,从而导致EHPS的能力获得了重大的提升。2. 我们利用 Computer Vision Transformers 来研究EHPS中模型的扩展法则。此外,我们还提出了一种finetuning 策略,将SMPLer-X转换成专家模型,以获得更高的性能提升。值得注意的是,我们的基础模型SMPLer-X在七个benchmark上均 delivery state-of-the-art results,包括AGORA(107.2 mm NMVE)、UBody(57.4 mm PVE)、EgoBody(63.6 mm PVE)和EHF(62.3 mm PVE without finetuning)。更多信息可以在以下页面上找到:https://caizhongang.github.io/projects/SMPLer-X/
FACTS: First Amplify Correlations and Then Slice to Discover Bias
for: 本研究旨在identifying spurious correlations in computer vision datasets, in order to improve downstream bias mitigation strategies.
methods: 我们提出了一种方法 called First Amplify Correlations and Then Slice to Discover Bias (FACTS), which involves amplifying correlations and then performing correlation-aware slicing to identify underperforming data slices.
results: 我们的方法可以 considerably improve correlation bias identification compared to prior work, with improvements of up to 35% precision@10 in a range of diverse evaluation settings.Abstract
Computer vision datasets frequently contain spurious correlations between task-relevant labels and (easy to learn) latent task-irrelevant attributes (e.g. context). Models trained on such datasets learn "shortcuts" and underperform on bias-conflicting slices of data where the correlation does not hold. In this work, we study the problem of identifying such slices to inform downstream bias mitigation strategies. We propose First Amplify Correlations and Then Slice to Discover Bias (FACTS), wherein we first amplify correlations to fit a simple bias-aligned hypothesis via strongly regularized empirical risk minimization. Next, we perform correlation-aware slicing via mixture modeling in bias-aligned feature space to discover underperforming data slices that capture distinct correlations. Despite its simplicity, our method considerably improves over prior work (by as much as 35% precision@10) in correlation bias identification across a range of diverse evaluation settings. Our code is available at: https://github.com/yvsriram/FACTS.
摘要
computer vision datasets часто包含偶极性关系 между任务相关标签和(容易学习的)隐藏任务不相关属性(例如上下文)。模型在这些 dataset 上训练时会学习 "短cut",并在数据 slice 中表现不佳,其中 correlation 不存在。在这项工作中,我们研究如何认识这些 slice 以便下游减偏策略。我们提出了 First Amplify Correlations and Then Slice to Discover Bias 方法(FACTS),其中首先使用强regularization 的实际风险最小化来适应偶极性假设。接着,我们通过混合模型在偏aligned feature space 中进行相关探测,以发现表现不佳的数据 slice,这些 slice 捕捉了不同的相关性。尽管其简单,我们的方法在多种多样的评估环境中可以明显提高对偶极性标识的性能,提高了至多 35% 的精度@10。我们的代码可以在:https://github.com/yvsriram/FACTS 中找到。
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
methods: 我们首先证明了,可以通过整个抽象过程来倒推 reward function 的梯度,并且这种方法可以在不同的奖励函数下达到优秀的性能。然后,我们提出了DRaFT变体:DRaFT-K和DRaFT-LV,它们分别 truncates backpropagation 到最后 K 步和只有一步。我们证明了我们的方法在不同的奖励函数下都能够达到好的性能。
results: 我们的实验结果表明,DRaFT 可以在不同的 reward 函数下提高图像的艺术质量。此外,我们还提出了与先前的工作之间的连接,为 gradient-based 微调算法的设计空间提供了一个统一的视角。Abstract
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
摘要
我们介绍 Direct Reward Fine-Tuning (DRaFT),一种简单有效的方法,可以调整传播模型以最大化可微分的奖励函数,例如人类偏好模型中的分数。我们首先显示了可以将奖励函数梯度传递到整个抽样过程中,并且这会实现强大的表现在多种奖励下,超过循环学习基础的方法。然后我们提出了DRaFT-K和DRaFT-LV,这两种方法分别是将梯度传递到最后K步抽样中和仅将梯度估计为一个低方差的情况下。我们显示了我们的方法可以适用于多种奖励函数,并且可以对Stable Diffusion 1.4中生成的图像进行重大改善。最后,我们将我们的方法与先前的工作进行比较,提供了一个统一的观点,对于梯度基本调整算法的设计空间。
IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images
results: 经过广泛的实验,我们的IFAST方法可以在一个大量的binocular NIR图像数据集(BNI-FAS)上 дости得状态的前两名的 результаados,证明单框面反护身份验证可以基于双目NIR图像进行。Abstract
Single-shot face anti-spoofing (FAS) is a key technique for securing face recognition systems, and it requires only static images as input. However, single-shot FAS remains a challenging and under-explored problem due to two main reasons: 1) on the data side, learning FAS from RGB images is largely context-dependent, and single-shot images without additional annotations contain limited semantic information. 2) on the model side, existing single-shot FAS models are infeasible to provide proper evidence for their decisions, and FAS methods based on depth estimation require expensive per-pixel annotations. To address these issues, a large binocular NIR image dataset (BNI-FAS) is constructed and published, which contains more than 300,000 real face and plane attack images, and an Interpretable FAS Transformer (IFAST) is proposed that requires only weak supervision to produce interpretable predictions. Our IFAST can produce pixel-wise disparity maps by the proposed disparity estimation Transformer with Dynamic Matching Attention (DMA) block. Besides, a well-designed confidence map generator is adopted to cooperate with the proposed dual-teacher distillation module to obtain the final discriminant results. The comprehensive experiments show that our IFAST can achieve state-of-the-art results on BNI-FAS, proving the effectiveness of the single-shot FAS based on binocular NIR images.
摘要
单张图像反骗检测(FAS)是保护人脸识别系统的关键技术,它只需要静态图像作为输入。然而,单张图像FAS仍然是一个挑战和未探索的问题,主要由两个原因导致:1)数据方面,从RGB图像学习FAS是高度依赖于上下文,单张图像没有额外标注的情况下具有有限的 semantic information。2)模型方面,现有的单张图像FAS模型无法提供有效的证据,基于深度估计的FAS方法需要每个像素的昂贵标注。为解决这些问题,一个大量的双目near-infrared(NIR)图像集(BNI-FAS)被构建并公布,该集包含了超过300,000个真实的人脸和平面攻击图像,以及一个可解释的FAS转换器(IFAST),该模型只需弱级超vision来生成可解释的预测。我们的IFAST可以生成像素级的差分地图,并采用了一个Well-designed的信任度地图生成器,以及一个 dual-teacher 分离模块,以获得最终的检测结果。广泛的实验表明,我们的IFAST可以在BNI-FAS上达到状态艺术的结果,证明了单张图像FAS的可行性。
Forward Flow for Novel View Synthesis of Dynamic Scenes
paper_authors: Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang
for: 本文提出了一种基于神经辐射场(NeRF)的方法,用于synthesizing动态场景中的新视图。现有方法frequently采用一个静止的NeRF来表示canonical space,然后通过将采样的3D点映射回canonical space中的学习后向流场景来渲染动态图像。但是,这个后向流场景是不连续的, diffficult to be fitted by commonly used smooth motion models。
results: 我们的方法在对比 existed methods 时表现出优异,在novel view rendering和动作模型学习方面都达到了更高的水平。这demonstrates the effectiveness of our forward flow motion modeling.Abstract
This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF
摘要
这个论文提出了一种基于神经辐射场(NeRF)的方法,用于新视角合成动态场景中的动态图像Synthesis。现有方法通常采用一个静态NeRF来表示坐标轴空间的底层结构,然后将其他时间步骤中的图像映射回到坐标轴空间中,使用学习的反向流场场所。然而,这个反向流场场是不平滑的,这会使得使用常见的光滑运动模型来拟合困难。为了解决这个问题,我们提出了直接将前进流场场估计到其他时间步骤,以便将坐标轴空间中的Canonical radiance field直接截射到其他时间步骤。这个前进流场场是在物体区域内平滑和连续的,这有助于学习运动模型。为了实现这一目标,我们使用精度的 voxel 网格来表示Canonical radiance field,并提出了一种可导的截射过程,包括一个平均拼接操作和一个填充网络,以解决多对一和一对多的映射问题。经过实验表明,我们的方法在新视角渲染和运动模型学习方面具有优异性,证明了我们的前进流场运动模型的效果。项目页面:https://npucvr.github.io/ForwardFlowDNeRF
Prompt-based test-time real image dehazing: a novel pipeline
results: 在实际雾气图像场景下,比对使用现有的雾气模型和PTTD模型,PTTD模型能够更好地提高图像的可见度和细节表示,同时具有更好的泛化能力和灵活性。Abstract
Existing methods attempt to improve models' generalization ability on real-world hazy images by exploring well-designed training schemes (e.g., cycleGAN, prior loss). However, most of them need very complicated training procedures to achieve satisfactory results. In this work, we present a totally novel testing pipeline called Prompt-based Test-Time Dehazing (PTTD) to help generate visually pleasing results of real-captured hazy images during the inference phase. We experimentally find that given a dehazing model trained on synthetic data, by fine-tuning the statistics (i.e., mean and standard deviation) of encoding features, PTTD is able to narrow the domain gap, boosting the performance of real image dehazing. Accordingly, we first apply a prompt generation module (PGM) to generate a visual prompt, which is the source of appropriate statistical perturbations for mean and standard deviation. And then, we employ the feature adaptation module (FAM) into the existing dehazing models for adjusting the original statistics with the guidance of the generated prompt. Note that, PTTD is model-agnostic and can be equipped with various state-of-the-art dehazing models trained on synthetic hazy-clean pairs. Extensive experimental results demonstrate that our PTTD is flexible meanwhile achieves superior performance against state-of-the-art dehazing methods in real-world scenarios. The source code of our PTTD will be made available at https://github.com/cecret3350/PTTD-Dehazing.
摘要
现有方法尝试提高模型对实际雾图像的泛化能力,通过设计有利的训练方案(如循环GAN、先验损失)。然而,大多数方法需要非常复杂的训练过程来获得满意的结果。在这项工作中,我们提出了一个全新的测试管道,即Prompt-based Test-Time Dehazing(PTTD),以帮助在推理阶段生成高质量的实际雾图像结果。我们实验发现,对于已经训练在 sintetic hazy-clean 对的模型,通过细化编码特征统计信息(即均值和标准差),PTTD可以降低领域差距,提高实际雾图像的表现。我们首先使用Prompt Generation Module(PGM)生成视觉提示,该提示是适合编码特征统计信息的适应性扰动的来源。然后,我们在现有的朔杂雾模型中引入特征适应模块(FAM),以适应原始统计信息的修正。需要注意的是,PTTD是模型无关的,可以与多种现有的 state-of-the-art 朔杂雾模型结合使用。我们的PTTD在实际场景中表现出优于现有的朔杂雾方法,并且具有高灵活性。我们将PTTD的源代码发布在GitHub上,可以在 中下载。
Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 减少深度神经网络(DNNs)的内存占用,以便在常见的移动设备上加载模型。
methods: 提议使用可编程的codebooks,并通过初始化和学习来确定最佳的映射。
results: 实现了高效地压缩Llama 7B模型,可以在5年前的手机上加载。Abstract
The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.
摘要
“深度神经网络(DNN)的巨大兴趣,尤其是计算机视觉和自然语言处理,归功于计算力的增长。然而,这也导致模型的内存占用量增加,到了一定程度,甚至无法将模型加载到常见设备,如智能手机上。为解决这个限制,量化成为一种非常受欢迎的解决方案,它将高精度张量映射到低精度、内存效率高的格式上。在内存占用量减少方面,最有效的变种是基于codebook。然而,这些方法受到两个限制。首先,它们可能会定义每个张量都有自己的codebook,或者使用内存昂贵的映射来多个codebook。其次,对于梯度下降优化,映射偏好跳跃到极值,因此不能定义 proximal search。在这个工作中,我们提出了解决这两个限制的方法。首先,我们将相似分布的神经元Initially group together,并利用这些结构的重新排序来应用不同的准则因子或者映射多个codebook中的映射,无需任何映射开销。其次,基于这种初始化,我们提议一种联合学习codebook和映射的方法,与最近的梯度基于后期quantization技术相似。此外,我们还引入了一种新的梯度更新定义,以实现 proximal search 的搜索。我们提出的联合学习codebooks和映射(JLCM)方法,可以非常有效地近似任何DNN,例如,一个Llama 7B可以压缩到2Go,并在5年前的智能手机上加载。”
Towards Free Data Selection with General-Purpose Models
results: 在多种计算机视觉任务上实现了显著改进,与现有活跃学习方法相比,速度提高530倍。Abstract
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from inter-mediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods. Extensive experiments verify the effectiveness of FreeSel on various computer vision tasks. Our code is available at https://github.com/yichen928/FreeSel.
摘要
一种愿景优选算法可以高效地选择最有用的样本,以最大化limited的注释预算。然而,现有的方法,通常是通过活动学习方法,iterates占用时间consuming的模型训练和批处理数据选择。在这篇论文中,我们挑战这个状况,并提出一个新的数据选择管道。我们使用现有的通用模型来选择不同的数据集中的样本,而不需要额外的训练或监督。我们提出了一种新的自由数据选择(FreeSel)方法。我们定义通过通用模型中的间隔特征EXTRACTINGsemantic pattern,以捕捉每个图像中的细节信息。然后,我们通过基于距离的样本选择,在细化的semantic pattern level上选择所有的数据样本。FreeSel可以 circumvent批处理选择过程,实现了 significantefficiency提升,比现有的活动学习方法快530倍。我们的代码可以在https://github.com/yichen928/FreeSel上下载。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.
See Beyond Seeing: Robust 3D Object Detection from Point Clouds via Cross-Modal Hallucination
paper_authors: Jianning Deng, Gabriel Chan, Hantao Zhong, Chris Xiaoxuan Lu for:* 这种新的框架是用于鲁棒的3D物体检测从点云数据中,通过跨模态梦境投影来实现。methods:* 我们提出了多种对齐方法,包括空间对齐和特征对齐,以实现同时的后缘修正和梦境生成。results:* 我们的方法可以更好地处理具有各种频谱特征的检测问题,并且在训练和测试阶段都具有优秀的性能和效率。Abstract
This paper presents a novel framework for robust 3D object detection from point clouds via cross-modal hallucination. Our proposed approach is agnostic to either hallucination direction between LiDAR and 4D radar. We introduce multiple alignments on both spatial and feature levels to achieve simultaneous backbone refinement and hallucination generation. Specifically, spatial alignment is proposed to deal with the geometry discrepancy for better instance matching between LiDAR and radar. The feature alignment step further bridges the intrinsic attribute gap between the sensing modalities and stabilizes the training. The trained object detection models can deal with difficult detection cases better, even though only single-modal data is used as the input during the inference stage. Extensive experiments on the View-of-Delft (VoD) dataset show that our proposed method outperforms the state-of-the-art (SOTA) methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime.
摘要
To address the geometry discrepancy between LiDAR and radar, we propose spatial alignment for better instance matching. Additionally, we use feature alignment to bridge the intrinsic attribute gap between the sensing modalities and stabilize the training.Our proposed method can handle difficult detection cases with single-modal data as input during the inference stage, and outperforms state-of-the-art methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime. We demonstrate the effectiveness of our approach through extensive experiments on the View-of-Delft (VoD) dataset.
Multi-Depth Branches Network for Efficient Image Super-Resolution
results: 我们的模型在 SR 领域实现了更好的性能,并且在 computation overhead 方面具有更好的性能。我们的代码可以在 https://github.com/thy960112/MDBN 上获取。Abstract
Significant progress has been made in the field of super-resolution (SR), yet many convolutional neural networks (CNNs) based SR models primarily focus on restoring high-frequency details, often overlooking crucial low-frequency contour information. Transformer-based SR methods, while incorporating global structural details, frequently come with an abundance of parameters, leading to high computational overhead. In this paper, we address these challenges by introducing a Multi-Depth Branches Network (MDBN). This framework extends the ResNet architecture by integrating an additional branch that captures vital structural characteristics of images. Our proposed multi-depth branches module (MDBM) involves the stacking of convolutional kernels of identical size at varying depths within distinct branches. By conducting a comprehensive analysis of the feature maps, we observe that branches with differing depths can extract contour and detail information respectively. By integrating these branches, the overall architecture can preserve essential low-frequency semantic structural information during the restoration of high-frequency visual elements, which is more closely with human visual cognition. Compared to GoogLeNet-like models, our basic multi-depth branches structure has fewer parameters, higher computational efficiency, and improved performance. Our model outperforms state-of-the-art (SOTA) lightweight SR methods with less inference time. Our code is available at https://github.com/thy960112/MDBN
摘要
“在超分解(SR)领域中,有 significiant progress 已经做出,但是许多 convolutional neural networks(CNNs)基于 SR 模型专注于重新实现高频率的详细信息,经常忽略重要的低频率构造信息。Transformer-based SR 方法,尽管包含全球结构的细节,却常常具有很多参数,导致高 computational overhead。在这篇文章中,我们解决这些挑战,通过引入 Multi-Depth Branches Network (MDBN) 架构。这个架构将 ResNet 架构扩展, adds an additional branch 来捕捉图像中重要的构造特征。我们的提案的多层分支模组 (MDBM) 包括对应的嵌入尺寸在不同深度的几个分支中堆叠的卷积核。我们通过对特征图进行全面分析,发现不同的深度分支可以分别提取构造和细节信息。通过融合这些分支,总体架构可以在实现高频率视觉元素的Restoration 过程中保留重要的低频率semantic构造信息,与人类视觉认知更加相似。相比 GoogLeNet-like 模型,我们的基本多层分支结构具有较少参数、更高的计算效率和提高的性能。我们的模型在较低的推导时间内,已经超过了现有的SOTA 轻量级 SR 方法。我们的代码可以在 https://github.com/thy960112/MDBN 中找到。”
Telling Stories for Common Sense Zero-Shot Action Recognition
results: 该论文通过使用 Stories 数据集和提出的方法,在多个 benchmark 上达到了新的领先性水平,提高 top-1 准确率达到 6.1% 之上。Abstract
Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories .
摘要
视频理解已经长期受到大量标注数据的限制,这促使了研究零shot学习。现在的语言模型研究开创了对零shot视频分析的新机会,但构建有效的Semantic空间关系动作类仍然是挑战。我们解决这个问题,通过引入一个新的数据集,叫做Stories,这个数据集包含了多种动作类的丰富文本描述,从WikiHow文章中提取出来的。对于每个类,我们提取了多句叙述,描述了动作的必要步骤、场景、物品和动词,这些文本数据允许我们模型动作之间的细化关系,为零shot传递开辟新的道路。我们还提出了一种使用Stories来提高零shot分类训练的方法。无需任务集微调,我们的方法在多个benchmark上达到了新的州OF-THE-ART,提高了top-1准确率by up to 6.1%。我们认为Stories提供了一个价值的资源,可以促进零shot动作认知领域的进步。这些文本描述将seen和unseen类之间建立连接,超越了标注数据的瓶颈,长期妨碍了这一领域的发展。数据可以在以下地址找到:https://github.com/kini5gowda/Stories。
Development of a Deep Learning Method to Identify Acute Ischemic Stroke Lesions on Brain CT
results: 研究发现,使用这种深度学习算法可以准确地检测AIS损伤,并且可以分类脑神经损伤的侧。研究中的最佳方法达到了72%的准确率,并且发现大型损伤(80%的准确率)、多个损伤(87%的准确率)以及follow-up扫描(76%的准确率)的检测率较高。然而,chronic brain conditions会降低深度学习的准确率,特别是非 stroke lesions和老 stroke lesions(32%和31%的错误率)。Abstract
Computed Tomography (CT) is commonly used to image acute ischemic stroke (AIS) patients, but its interpretation by radiologists is time-consuming and subject to inter-observer variability. Deep learning (DL) techniques can provide automated CT brain scan assessment, but usually require annotated images. Aiming to develop a DL method for AIS using labelled but not annotated CT brain scans from patients with AIS, we designed a convolutional neural network-based DL algorithm using routinely-collected CT brain scans from the Third International Stroke Trial (IST-3), which were not acquired using strict research protocols. The DL model aimed to detect AIS lesions and classify the side of the brain affected. We explored the impact of AIS lesion features, background brain appearances, and timing on DL performance. From 5772 unique CT scans of 2347 AIS patients (median age 82), 54% had visible AIS lesions according to expert labelling. Our best-performing DL method achieved 72% accuracy for lesion presence and side. Lesions that were larger (80% accuracy) or multiple (87% accuracy for two lesions, 100% for three or more), were better detected. Follow-up scans had 76% accuracy, while baseline scans 67% accuracy. Chronic brain conditions reduced accuracy, particularly non-stroke lesions and old stroke lesions (32% and 31% error rates respectively). DL methods can be designed for AIS lesion detection on CT using the vast quantities of routinely-collected CT brain scan data. Ultimately, this should lead to more robust and widely-applicable methods.
摘要
计算Tomography(CT)通常用于图像新生疱疹(AIS)患者,但是它的解释由放射学家是时间消耗和对观察者的不一致的。深度学习(DL)技术可以提供自动化CT脑部扫描评估,但通常需要标注图像。为了开发一种DL方法用于AIS,我们设计了基于 convolutional neural network(CNN)的DL算法,使用了不同研究协议下收集的CT脑部扫描图像。DL模型的目标是检测AIS损害和脑部哪个区域受到影响。我们研究了AIS损害特征、背景脑部 appearances和时间对DL性能的影响。从5772个Unique CT扫描图像中(2347名AIS患者, median age 82),54%有可见的AIS损害。我们最佳的DL方法达到了72%的损害存在和脑部哪个区域受到影响的准确率。大型损害(80%的准确率)、多个损害(87%的准确率)和追踪扫描(76%的准确率)都有更好的检测性能。与此同时,慢性脑部疾病(如非stroke损害和老stroke损害)会降低DL性能,特别是错误率为32%和31%。DL方法可以用于AIS损害检测CT脑部扫描图像,从而提高方法的可靠性和普遍性。最终,这将导致更加稳定和可靠的方法。
Efficient Large Scale Medical Image Dataset Preparation for Machine Learning Applications
paper_authors: Stefan Denner, Jonas Scherer, Klaus Kades, Dimitrios Bounias, Philipp Schader, Lisa Kausch, Markus Bujotzek, Andreas Michael Bucher, Tobias Penzkofer, Klaus Maier-Hein
results: 该论文通过使用该数据筛选工具,可以提高医疗影像数据的质量和可靠性,并且可以探索数据中的潜在偏见。此外,该工具还可以帮助研究人员 validate 图像和分割质量,以及检测数据中的偏见。Abstract
In the rapidly evolving field of medical imaging, machine learning algorithms have become indispensable for enhancing diagnostic accuracy. However, the effectiveness of these algorithms is contingent upon the availability and organization of high-quality medical imaging datasets. Traditional Digital Imaging and Communications in Medicine (DICOM) data management systems are inadequate for handling the scale and complexity of data required to be facilitated in machine learning algorithms. This paper introduces an innovative data curation tool, developed as part of the Kaapana open-source toolkit, aimed at streamlining the organization, management, and processing of large-scale medical imaging datasets. The tool is specifically tailored to meet the needs of radiologists and machine learning researchers. It incorporates advanced search, auto-annotation and efficient tagging functionalities for improved data curation. Additionally, the tool facilitates quality control and review, enabling researchers to validate image and segmentation quality in large datasets. It also plays a critical role in uncovering potential biases in datasets by aggregating and visualizing metadata, which is essential for developing robust machine learning models. Furthermore, Kaapana is integrated within the Radiological Cooperative Network (RACOON), a pioneering initiative aimed at creating a comprehensive national infrastructure for the aggregation, transmission, and consolidation of radiological data across all university clinics throughout Germany. A supplementary video showcasing the tool's functionalities can be accessed at https://bit.ly/MICCAI-DEMI2023.
摘要
在医学成像领域的快速发展中,机器学习算法已成为提高诊断精度的不可或缺的工具。然而,这些算法的效iveness取决于医学成像数据的可用性和组织化。传统的Digital Imaging and Communications in Medicine(DICOM)数据管理系统不能满足大规模医学成像数据的需求。这篇论文介绍了一种创新的数据 curación工具,作为Kaapana开源工具包的一部分,用于改善医学成像数据的组织、管理和处理。该工具特地针对医生和机器学习研究人员的需求,并包括高级搜索、自动标注和高效标记功能。此外,工具还实现了质量控制和审核功能,帮助研究人员 validate 图像和分割质量,并推断数据中可能存在的偏见。此外,Kaapana与德国大学医院 radiological Cooperative Network(RACOON)集成,该项目旨在创建一个涵盖所有大学医院的 радиологиical 数据集成、传输和 консоли达到的全国基础设施。补充视频展示工具的功能可以在https://bit.ly/MICCAI-DEMI2023中获取。
results: 实验表明,与当前方法相比,M-MAE方法具有较高的效果,包括在 ImageNet 上Linear probing ViT-Base 上提高3.9%,和 fine-tuning ViT-Large 上提高1%。Abstract
In this paper, we provide a comprehensive toolbox for understanding and enhancing self-supervised learning (SSL) methods through the lens of matrix information theory. Specifically, by leveraging the principles of matrix mutual information and joint entropy, we offer a unified analysis for both contrastive and feature decorrelation based methods. Furthermore, we propose the matrix variational masked auto-encoder (M-MAE) method, grounded in matrix information theory, as an enhancement to masked image modeling. The empirical evaluations underscore the effectiveness of M-MAE compared with the state-of-the-art methods, including a 3.9% improvement in linear probing ViT-Base, and a 1% improvement in fine-tuning ViT-Large, both on ImageNet.
摘要
在这篇论文中,我们提供了一个完整的工具箱,用于理解和提高自动编目学习(SSL)方法的表示。具体来说,我们通过利用矩阵共同信息理论和矩阵共同熵的原则,为对比和特征抑制基于方法提供了一个统一的分析。此外,我们还提出了基于矩阵信息理论的矩阵变量隐藏自动编码器(M-MAE)方法,用于提高图像模型。实验证明,M-MAE方法比现有的方法更有效,包括对ViT-Base和ViT-Large图像集进行线性搜索中的3.9%提高,以及对ImageNet图像集进行练习中的1%提高。
Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN
paper_authors: Weiwen Zhang, Dawei Yang, Haoxuan Che, An Ran Ran, Carol Y. Cheung, Hao Chen
for: This paper aims to improve the resolution of optical coherence tomography angiography (OCTA) images without paired training data.
methods: The proposed method uses a generator and discriminator based on Generative Adversarial Networks (GANs), with a dual-path generator to emphasize high-frequency fine capillaries and a frequency-aware adversarial loss for the discriminator.
results: The proposed method outperforms other state-of-the-art unpaired methods both quantitatively and visually, with improved preservation of fine capillary details.Here’s the simplified Chinese text:
for: 这个 paper 目的是提高无对应数据的 optical coherence tomography angiography (OCTA) 图像的分辨率。
results: 提议的方法在对比其他状态体际的无对应方法时,都有较好的表现,并且可以更好地保留细血管的细节。Abstract
For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.
摘要
For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.Here's the translation in Traditional Chinese:For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.
Effect of structure-based training on 3D localization precision and quality
results: 对比训练方法,结构基于训练方法得到了显著提高的检测率和地址精度,特别是在不同的信号噪比(SNR)下。此外,该方法还能有效地消除棱镜质量问题,以确保更准确的3D重建。Abstract
This study introduces a structural-based training approach for CNN-based algorithms in single-molecule localization microscopy (SMLM) and 3D object reconstruction. We compare this approach with the traditional random-based training method, utilizing the LUENN package as our AI pipeline. The quantitative evaluation demonstrates significant improvements in detection rate and localization precision with the structural-based training approach, particularly in varying signal-to-noise ratios (SNRs). Moreover, the method effectively removes checkerboard artifacts, ensuring more accurate 3D reconstructions. Our findings highlight the potential of the structural-based training approach to advance super-resolution microscopy and deepen our understanding of complex biological systems at the nanoscale.
摘要
Translation notes:* "single-molecule localization microscopy" (SMLM) is translated as "单分子定位微scopie" (SMLM)* "3D object reconstruction" is translated as "3D物体重建"* "deep learning algorithms" is translated as "深度学习算法"* "structural-based training approach" is translated as "结构基于的训练方法"* "random-based training method" is translated as "随机基于的训练方法"* "LUENN package" is translated as "LUENN包"* "quantitative evaluation" is translated as "量化评估"* "detection rate" is translated as "检测率"* "localization precision" is translated as "定位精度"* "checkerboard artifacts" is translated as "棋盘艺术ifacts"
Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors
results: 对多种物体进行了详细的测试,并表明了our方法可以具有高度一致的3D重建和强大的泛化能力Abstract
Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods. See https://Consistent123.github.io for a more comprehensive exploration of our generated 3D assets.
摘要
<>Translate the given text into Simplified Chinese.<>根据预训练的扩散模型从单张图像中重建3D对象的方法,已经展现出了有前途的结果。然而,由于使用 случаagnostic的静态策略,其对于任意情况和3D重建的一致性仍然不够。在这项工作中,我们提出了Consistent123,一种情况意识的两stage方法,用于从单张图像中高度一致的3D资产重建。在第一阶段,Consistent123仅利用3D结构约束,通过CLIP基于的情况意识适应检测机制来进行充分的geometry利用。在第二阶段,2D文本约束被引入,逐渐取代导向role,细腻地雕琢3D模型的细节。Consistent123与导航需求的演化趋势更加相似,适应不同对象的3D初始化和2D文本细化。Consistent123可以获得高度一致的3D重建,并且在不同对象上展现出强大的泛化能力。Qualitative和量化实验表明,我们的方法在图像到3D重建领域内Significantly outperforms state-of-the-art方法。请参考https://Consistent123.github.io进行更多的3D资产的探索。
A Survey on Deep Learning Techniques for Action Anticipation
results: 本研究总结了各种动作预测算法的最新进展,并评估了不同评估指标和数据集的影响。未来的发展方向亦有系统的讨论。Abstract
The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.
摘要
<>转换文本到简化中文。人类未来行为预测能力在各种应用领域都是非常重要,如自动驾驶和人机交互。因此,过去几年内,有很多方法被提出用于行动预测,其中深度学习基本方法尤为流行。在这项工作中,我们对当前的行动预测算法进行了latest advances的综述,并将其分为不同类别,以便读者一眼了解。此外,我们还详细介绍了通用的评价指标和数据集,以及未来的发展方向。>>Here's the translation in Traditional Chinese:<>转换文本到简化中文。人类未来行为预测能力在各种应用领域都是非常重要,如自动驾驶和人机交互。因此,过去几年内,有很多方法被提出用于行动预测,其中深度学习基本方法尤为流行。在这个工作中,我们对当前的行动预测算法进行了latest advances的综述,并将其分为不同类别,以便读者一眼了解。此外,我们还详细介绍了通用的评价指标和数据集,以及未来的发展方向。
for: 这篇论文目的是为了解决 Complex Spatio-Temporal Distribution 的雨层推广 Video Deraining 问题。
methods: 本文使用 Event Camera 和 End-to-End Learning-based Network 来解决这个问题。 Specifically, the authors propose an event-aware motion detection module and a pyramidal adaptive selection module to reliably separate the background and rain layers.
results: compared with existing state-of-the-art methods, the proposed method demonstrates clear superiority on synthetic and self-collected real-world datasets. The code and dataset are available at \url{https://github.com/booker-max/EGVD}.Abstract
With the rapid development of deep learning, video deraining has experienced significant progress. However, existing video deraining pipelines cannot achieve satisfying performance for scenes with rain layers of complex spatio-temporal distribution. In this paper, we approach video deraining by employing an event camera. As a neuromorphic sensor, the event camera suits scenes of non-uniform motion and dynamic light conditions. We propose an end-to-end learning-based network to unlock the potential of the event camera for video deraining. First, we devise an event-aware motion detection module to adaptively aggregate multi-frame motion contexts using event-aware masks. Second, we design a pyramidal adaptive selection module for reliably separating the background and rain layers by incorporating multi-modal contextualized priors. In addition, we build a real-world dataset consisting of rainy videos and temporally synchronized event streams. We compare our method with extensive state-of-the-art methods on synthetic and self-collected real-world datasets, demonstrating the clear superiority of our method. The code and dataset are available at \url{https://github.com/booker-max/EGVD}.
摘要
随着深度学习的快速发展,视频抖雨得到了显著进步。然而,现有的视频抖雨管道无法实现复杂的雨层分布场景中满意的性能。在这篇论文中,我们使用事件摄像头来解决这个问题。作为一种神经omorphic感知器,事件摄像头适合非均匀的运动场景和动态的照明条件。我们提出了一种基于学习的终端网络,用于解锁事件摄像头的潜在能力。首先,我们设计了一个事件感知模块,用于自适应聚合多帧运动上下文。其次,我们设计了一个层次自适应选择模块,用于可靠地分离背景和雨层。此外,我们构建了一个包含雨天视频和时间同步事件流的实际数据集。我们对与现有的状态艺术方法进行了广泛的比较,并证明了我们的方法的明显优越性。代码和数据集可以在 GitHub 上获取:。
Glioma subtype classification from histopathological images using in-domain and out-of-domain transfer learning: An experimental study
paper_authors: Vladimir Despotovic, Sang-Yoon Kim, Ann-Christin Hau, Aliaksandra Kakoichankava, Gilbert Georg Klamminger, Felix Bruno Kleine Borgmann, Katrin B. M. Frauenknecht, Michel Mittelbronn, Petr V. Nazarov
for: 这个论文主要目的是对Computer-aided classification of adult-type diffuse gliomas进行了全面的比较和深度学习架构的研究。
results: 这篇论文的研究结果表明,使用这些方法可以提高医疗影像分类的性能,并且可以减少病理学家的标注工作。此外,这篇论文还提供了一个可视化工具,可以在整个扫描图像水平上生成热图,以便为病理学家提供有用的信息。Abstract
We provide in this paper a comprehensive comparison of various transfer learning strategies and deep learning architectures for computer-aided classification of adult-type diffuse gliomas. We evaluate the generalizability of out-of-domain ImageNet representations for a target domain of histopathological images, and study the impact of in-domain adaptation using self-supervised and multi-task learning approaches for pretraining the models using the medium-to-large scale datasets of histopathological images. A semi-supervised learning approach is furthermore proposed, where the fine-tuned models are utilized to predict the labels of unannotated regions of the whole slide images (WSI). The models are subsequently retrained using the ground-truth labels and weak labels determined in the previous step, providing superior performance in comparison to standard in-domain transfer learning with balanced accuracy of 96.91% and F1-score 97.07%, and minimizing the pathologist's efforts for annotation. Finally, we provide a visualization tool working at WSI level which generates heatmaps that highlight tumor areas; thus, providing insights to pathologists concerning the most informative parts of the WSI.
摘要
我们在这篇论文中对各种传输学习策略和深度学习架构进行了广泛的比较,用于计算机辅助Diffuse gliomas的成人类型分类。我们评估了ImageNet表示的 OUT-OF-DOMAIN 表示的泛化性,并研究了采用自我超vised和多任务学习方法进行预训练,以提高模型在历史病理图像数据集上的性能。此外,我们还提出了一种半supervised学习方法,其中使用精度调整的模型来预测整个扫描图像(WSI)中的标签。模型然后被重新训练使用真实标签和弱标签,从而提高性能,并最小化病理医生对标注的努力。最后,我们还提供了一个可视化工具,它可以在WSI级别上生成热图,并高亮显示肿瘤区域,从而为病理医生提供有用的信息。
When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo
paper_authors: Tianqi Liu, Xinyi Ye, Weiyue Zhao, Zhiyu Pan, Min Shi, Zhiguo Cao for:This paper proposes a new method for multi-view stereo (MVS) reconstruction, which is designed to improve the efficiency and accuracy of the feature matching process.methods:The proposed method, called ET-MVSNet, uses a novel non-local feature augmentation strategy based on the epipolar geometry. This strategy reduces the 2D search space into the epipolar line in stereo matching, making it more efficient and accurate.results:ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmarks with high efficiency. The proposed method is able to improve the accuracy and speed of MVS reconstruction, making it a valuable contribution to the field.Abstract
Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet.
摘要
学习基于多视图雷达(MVS)方法仰赖特征匹配,需要突出特征描述。一个有效的解决方案是应用非本地特征聚合,例如Transformer。虽然有用,这些技术增加了MVS中计算负担。每个像素都密集关注整个图像。相比之下,我们提议在对应对线对线上进行非本地特征聚合约束。我们的想法来源于经典的 epipolar геометрии,它表明一个点在不同深度假设下将被投影到另一个视图中的 epipolar 线上。这种约束将2D 搜索空间缩小到 epipolar 线上。类似地,这表明 MVS 的匹配是将一系列点分类为同一条线上。 inspired by this point-to-line 搜索,我们开发了一种线到点非本地增强策略。我们首先开发了一种优化的搜索算法,将2D 特征图分成 epipolar 线对。然后,我们使用 Epipolar Transformer(ET)进行非本地特征增强。我们将 ET incorporated 到一个学习基于 MVS 的基础模型中,称为 ET-MVSNet。 ET-MVSNet 在 DTU 和 Tanks-and-Temples 测试 benchmark 上 achieve 状态的重建性能,同时具有高效性。代码可以在 https://github.com/TQTQliu/ET-MVSNet 中找到。
Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing
paper_authors: Lukas Meiner, Jens Mehnert, Alexandru Paul Condurache
for: 实时对资源有限的设备进行检查和识别,以减少对于调用运算的需求。
methods: 使用构造化排除法,将网络中的浮点运算减少到最小化,而不需要特定的训练或调整程序。
results: 在CIFAR-10和ImageNet等受欢迎的视觉标准库上,实现了对网络的减少,并且仅对网络的调用时间进行快速对应。 Specifically, 在CIFAR-10上,将ResNet34中的条件减少46.72%,仅对网络的精度下降1.25%。Abstract
To reduce the computational cost of convolutional neural networks (CNNs) for usage on resource-constrained devices, structured pruning approaches have shown promising results, drastically reducing floating-point operations (FLOPs) without substantial drops in accuracy. However, most recent methods require fine-tuning or specific training procedures to achieve a reasonable trade-off between retained accuracy and reduction in FLOPs. This introduces additional cost in the form of computational overhead and requires training data to be available. To this end, we propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. It instantly reduces the network's test-time inference cost without requiring any training or fine-tuning. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) to detect redundancies in the channel dimension. Similar channels are aggregated to reduce the input and filter depth simultaneously, allowing for cheaper convolutions. We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet. In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
摘要
为了减少深度学习网络(CNN)的计算成本,使其在资源有限的设备上运行,结构化剪辑方法已经显示出了可观的成果,可以减少浮点操作数(FLOPs)而无需减少准确率。然而,大多数最新的方法需要精细调整或特定的训练程序来达到一个合适的平衡点,这会增加计算开销和需要训练数据。为此,我们提出了快速压缩(HASTE)模块,它可以作为任何常见卷积模块的替换部件。它可以立即降低网络的测试时间执行成本,不需要任何训练或精细调整。我们利用了本地性敏感哈希(LSH)检测通道维度中的重复性,从而压缩缓存特征图而无需减少准确率。相似的通道被聚合,以降低输入和滤波器的深度,使得更加便宜的卷积操作。我们在CIFAR-10和ImageNet等流行的视觉benchmark上展示了我们的方法。特别是,我们可以在ResNet34中INSTANTLY降低46.72%的FLOPs,只失去1.25%的准确率,只需替换卷积模块为我们的快速压缩模块。
Robots That Can See: Leveraging Human Pose for Trajectory Prediction
results: 实现了预测人类轨迹的未来状态的目标性能,并在常见预测 bencmarks 和一个人类跟踪数据集中达到了状态前的表现。此外,还发现了有限历史数据的新代理人为预测错误的主要贡献者,并证明了3D关节点位置在这些挑战性enario中减少预测错误的作用。Abstract
Anticipating the motion of all humans in dynamic environments such as homes and offices is critical to enable safe and effective robot navigation. Such spaces remain challenging as humans do not follow strict rules of motion and there are often multiple occluded entry points such as corners and doors that create opportunities for sudden encounters. In this work, we present a Transformer based architecture to predict human future trajectories in human-centric environments from input features including human positions, head orientations, and 3D skeletal keypoints from onboard in-the-wild sensory information. The resulting model captures the inherent uncertainty for future human trajectory prediction and achieves state-of-the-art performance on common prediction benchmarks and a human tracking dataset captured from a mobile robot adapted for the prediction task. Furthermore, we identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3D skeletal poses in reducing prediction error in such challenging scenarios.
摘要
anticipating the motion of all humans in dynamic environments such as homes and offices is critical to enable safe and effective robot navigation. such spaces remain challenging as humans do not follow strict rules of motion and there are often multiple occluded entry points such as corners and doors that create opportunities for sudden encounters. in this work, we present a transformer-based architecture to predict human future trajectories in human-centric environments from input features including human positions, head orientations, and 3d skeletal keypoints from onboard in-the-wild sensory information. the resulting model captures the inherent uncertainty for future human trajectory prediction and achieves state-of-the-art performance on common prediction benchmarks and a human tracking dataset captured from a mobile robot adapted for the prediction task. furthermore, we identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3d skeletal poses in reducing prediction error in such challenging scenarios.
Towards Complex-query Referring Image Segmentation: A Novel Benchmark
results: 实验结果表明,\textsc{DuMoGa}方法在RIS-CQ上表现出色,在不同的数据集和模型下都有显著的提高。Abstract
Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms. However, there has been a lack of research investigating how existing algorithms should be benchmarked with complex language queries, which include more informative descriptions of surrounding objects and backgrounds (\eg \textit{"the black car."} vs. \textit{"the black car is parking on the road and beside the bus."}). Given the significant improvement in the semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, building upon the existing RefCOCO and Visual Genome datasets, we propose a new RIS benchmark with complex queries, namely \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific and informative queries, and enables a more realistic scenario of RIS research. Besides, we present a nichetargeting method to better task the RIS-CQ, called dual-modality graph alignment model (\textbf{\textsc{DuMoGa}), which outperforms a series of RIS methods.
摘要
总结过去的一 деcade,图像理解(RIS)已经得到了广泛的研究,导致了高级算法的开发。然而,现有的研究很少关注如何将现有算法 benchmarking WITH complex language queries,这些查询包括更加详细的周围对象和背景描述 (\eg "黑色车" vs. "黑色车在路上停放,和汽车一起"). given the significant improvement in semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, we propose a new RIS benchmark with complex queries, named \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific, and informative queries, and enables a more realistic scenario of RIS research. besides, we present a niche-targeting method to better task the RIS-CQ, called dual-modality graph alignment model (\textbf{\textsc{DuMoGa}), which outperforms a series of RIS methods.
A Survey of Incremental Transfer Learning: Combining Peer-to-Peer Federated Learning and Domain Incremental Learning for Multicenter Collaboration
results: 本文实现了多中心协作深度学习模型的开发,并评估了不同的常数调整基于循环学习方法的影响。研究发现,对于不同的数据不同性、分类器头设置、网络优化器、模型初始化、中心顺序和权重转移类型,不同的调整方法具有不同的影响。Abstract
Due to data privacy constraints, data sharing among multiple clinical centers is restricted, which impedes the development of high performance deep learning models from multicenter collaboration. Naive weight transfer methods share intermediate model weights without raw data and hence can bypass data privacy restrictions. However, performance drops are typically observed when the model is transferred from one center to the next because of the forgetting problem. Incremental transfer learning, which combines peer-to-peer federated learning and domain incremental learning, can overcome the data privacy issue and meanwhile preserve model performance by using continual learning techniques. In this work, a conventional domain/task incremental learning framework is adapted for incremental transfer learning. A comprehensive survey on the efficacy of different regularization-based continual learning methods for multicenter collaboration is performed. The influences of data heterogeneity, classifier head setting, network optimizer, model initialization, center order, and weight transfer type have been investigated thoroughly. Our framework is publicly accessible to the research community for further development.
摘要
(注意:以下是简化中文版本,具体翻译结果可能与原文有所不同)
RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation
results: 比前一代SOTA模型更高的性能,仅使用10%的parameters和18%的MACsAbstract
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
摘要
Audio-visual speech separation方法目的是将不同的感知模式融合在一起,以生成高质量的分离后的语音,从而提高下游任务如语音识别的性能。大多数现有的State-of-the-art(SOTA)模型都在时域上运行。然而,它们对音响特征的简单化方法通常会导致模型更大、更计算成本高的模型来达到SOTA性能。在这篇论文中,我们介绍了一种新的时域频域音视频混合方法:Recurrent Time-Frequency Separation Network(RTFS-Net),它在Short-Time Fourier Transform(STFT)生成的复杂时域频域缓冲中应用其算法。我们使用多层RNN来独立地模型和捕捉时间和频率维度上的音频信号,并引入了一种新的注意力机制来有效地混合音频和视频信息。此外,我们还提出了一种基于音频特征自然谱的新的掩码分离方法,以提高分离效果。RTFS-Net比前一个SOTA方法使用的参数数量减少了10%,计算资源减少了18%。这是首次在时域频域上实现了所有当代时域counterparts的音视频混合方法,并超越了它们。
TBD Pedestrian Data Collection: Towards Rich, Portable, and Large-Scale Natural Pedestrian Data
results: 作者的系统可以在多种环境中进行大规模数据收集,并且可以快速生成标注结果。与现有的行人数据收集方法相比,作者的系统具有三个特点:结合顶部下看和自我中心视角,在人工智能的社会合适的环境中观察人类行为,以及人类验证的标注。Abstract
Social navigation and pedestrian behavior research has shifted towards machine learning-based methods and converged on the topic of modeling inter-pedestrian interactions and pedestrian-robot interactions. For this, large-scale datasets that contain rich information are needed. We describe a portable data collection system, coupled with a semi-autonomous labeling pipeline. As part of the pipeline, we designed a label correction web app that facilitates human verification of automated pedestrian tracking outcomes. Our system enables large-scale data collection in diverse environments and fast trajectory label production. Compared with existing pedestrian data collection methods, our system contains three components: a combination of top-down and ego-centric views, natural human behavior in the presence of a socially appropriate "robot", and human-verified labels grounded in the metric space. To the best of our knowledge, no prior data collection system has a combination of all three components. We further introduce our ever-expanding dataset from the ongoing data collection effort -- the TBD Pedestrian Dataset and show that our collected data is larger in scale, contains richer information when compared to prior datasets with human-verified labels, and supports new research opportunities.
摘要
社交导航和行人行为研究已经转移到机器学习基于方法和融合到了对行人间交互和机器人间交互的模型研究。为此,需要大规模的数据集,具有丰富的信息。我们描述了一种可搬式数据收集系统,联合了半自动化标注管道。在管道中,我们设计了一个 labels 修正web应用程序,以便人类确认自动跟踪结果的准确性。我们的系统可以在多种环境下进行大规模数据收集,并快速生成标注结果。相比现有的行人数据收集方法,我们的系统具有三个组成部分:一种组合顶部视角和自我视角,人类在社交合适的“机器人”存在下展现自然的人类行为,以及人类验证的标签,围绕着度量空间定义。根据我们所知,没有任何先前的数据收集系统拥有这三个组成部分。我们进一步介绍了我们的持续扩展的数据集——TBD行人数据集,并证明我们收集到的数据规模更大,包含更多的信息,与先前的人类验证标签相比,支持新的研究机会。
TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields
results: 我们的方法在多个方面超越了前期方法,包括大词汇量、文本一致性和响应时间低。我们的实验结果表明,TextField3D 可以实现开放词汇3D生成的可能性。Abstract
Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.
摘要
近期研究强调3D表示方法在文本3D指导下进行学习。然而,有限的文本3D数据限制了生成 vocabulary 范围和文本控制。生成器可能轻易陷入某些文本提示的刻板概念,导致失去开放词汇生成能力。为解决这个问题,我们介绍了一种受控3D生成模型,即 TextField3D。具体来说,而不是直接使用文本提示作为输入,我们建议在给定文本提示的缓存空间中注入动态噪声,即噪声文本场(NTF)。这样,有限的3D数据可以被映射到适当的文本缓存空间,该空间通过 NTF 得到扩展。为此,我们提出了 NTFGen 模块,用于模型文本缓存代码在噪声场中。同时,我们提出了 NTFBind 模块,用于将视图不变的图像缓存代码与噪声场相匹配。这样,我们可以在geometry和texture之间进行条件生成。为了引导条件生成,我们构建了文本-3D分类器和文本-2.5D分类器的多模式抑制。相比之前的方法,TextField3D具有三大优点:1)大词汇,2)文本一致性,3)低延迟。广泛的实验表明,我们的方法实现了可能的开放词汇3D生成能力。
Domain-Adaptive Learning: Unsupervised Adaptation for Histology Images with Improved Loss Function Combination
paper_authors: Ravi Kant Gupta, Shounak Das, Amit Sethi
For: 本研究提出了一种新的频率领域适应(UDA)方法,用于针对 Hematoxylin & Eosin(H&E)染色的病理图像。现有的对抗性频率领域适应方法可能无法有效地将不同频率领域的多modal分布相互对接。* Methods: 我们的方法提出了一个新的损失函数,并与现有的损失函数束合使用,以特化解决病理图像的特殊挑战。我们利用病理图像的特有特征,如组织结构和细胞形态,来增强适应性。* Results: 我们的方法在准确性、稳定性和普适性方面表现出色,超过了当前领域的最佳技术。我们在FHIST数据集上进行了广泛的实验,结果显示,我们的提出的方法——频率领域适应学习(DAL)在病理图像上表现出了1.41%和6.56%的提升,比ViT基于和CNN基于的SoTA方法更高。Abstract
This paper presents a novel approach for unsupervised domain adaptation (UDA) targeting H&E stained histology images. Existing adversarial domain adaptation methods may not effectively align different domains of multimodal distributions associated with classification problems. The objective is to enhance domain alignment and reduce domain shifts between these domains by leveraging their unique characteristics. Our approach proposes a novel loss function along with carefully selected existing loss functions tailored to address the challenges specific to histology images. This loss combination not only makes the model accurate and robust but also faster in terms of training convergence. We specifically focus on leveraging histology-specific features, such as tissue structure and cell morphology, to enhance adaptation performance in the histology domain. The proposed method is extensively evaluated in accuracy, robustness, and generalization, surpassing state-of-the-art techniques for histology images. We conducted extensive experiments on the FHIST dataset and the results show that our proposed method - Domain Adaptive Learning (DAL) significantly surpasses the ViT-based and CNN-based SoTA methods by 1.41% and 6.56% respectively.
摘要
Retail-786k: a Large-Scale Dataset for Visual Entity Matching
results: 根据Price comparison任务,每个实体形成一个相似类型的产品集,但是使用标准的图像基于分类和检索算法不能够解决这个问题。因此,需要开发新的方法,可以将示例基于的视觉等同类转移到新数据上。本研究的目的是为这种算法提供 benchmark。Abstract
Entity Matching (EM) defines the task of learning to group objects by transferring semantic concepts from example groups (=entities) to unseen data. Despite the general availability of image data in the context of many EM-problems, most currently available EM-algorithms solely rely on (textual) meta data. In this paper, we introduce the first publicly available large-scale dataset for "visual entity matching", based on a production level use case in the retail domain. Using scanned advertisement leaflets, collected over several years from different European retailers, we provide a total of ~786k manually annotated, high resolution product images containing ~18k different individual retail products which are grouped into ~3k entities. The annotation of these product entities is based on a price comparison task, where each entity forms an equivalence class of comparable products. Following on a first baseline evaluation, we show that the proposed "visual entity matching" constitutes a novel learning problem which can not sufficiently be solved using standard image based classification and retrieval algorithms. Instead, novel approaches which allow to transfer example based visual equivalent classes to new data are needed to address the proposed problem. The aim of this paper is to provide a benchmark for such algorithms. Information about the dataset, evaluation code and download instructions are provided under https://www.retail-786k.org/.
摘要
“Entity Matching(EM)定义为将知识传递到未见到的数据上,以将概念汇总到例子中。尽管在许多EM问题上有广泛的图像数据可用,现有大多数EM算法仅仅靠文本元数据。在这篇文章中,我们介绍了首次公开可用的大规模“视觉实体匹配”数据集,基于商业化的使用情况,从欧洲多家商家获取了多年的印刷广告单张,总计约786,000个手动标注、高分辨率产品图像,包含约18,000个不同的单独产品,分为约3,000个实体。这些产品实体的标注基于价格比较任务,每个实体都是一个可比较的产品集。我们透过首创基准评估发现,“视觉实体匹配”是一个新的学习问题,不能由标准图像分类和搜寻算法解决。相反,需要新的方法,以将例子基于的见识汇总到新数据。本文的目的是提供这个问题的参考基准。更多关于数据、评估代码和下载 instruction可以在https://www.retail-786k.org/取得。”
APNet: Urban-level Scene Segmentation of Aerial Images and Point Clouds
results: 我们的实验表明, fusione output consistently outperforms each individual network branch, and APNet achieves state-of-the-art performance of 65.2 mIoU on the SensatUrban dataset。Abstract
In this paper, we focus on semantic segmentation method for point clouds of urban scenes. Our fundamental concept revolves around the collaborative utilization of diverse scene representations to benefit from different context information and network architectures. To this end, the proposed network architecture, called APNet, is split into two branches: a point cloud branch and an aerial image branch which input is generated from a point cloud. To leverage the different properties of each branch, we employ a geometry-aware fusion module that is learned to combine the results of each branch. Additional separate losses for each branch avoid that one branch dominates the results, ensure the best performance for each branch individually and explicitly define the input domain of the fusion network assuring it only performs data fusion. Our experiments demonstrate that the fusion output consistently outperforms the individual network branches and that APNet achieves state-of-the-art performance of 65.2 mIoU on the SensatUrban dataset. Upon acceptance, the source code will be made accessible.
摘要
在这篇论文中,我们关注点云Scene的semantic segmentation方法。我们的基本概念是通过不同场景表示的多样化合作来利用不同的上下文信息和网络架构。为此,我们提出了一种名为APNet的网络架构,其分为两个分支:一个点云分支和一个空中图分支,后者是根据点云生成的。为了利用每个分支的不同特性,我们采用了一种geometry-aware合并模块,该模块通过学习将每个分支的结果结合在一起。此外,我们还使用了每个分支的分立损失,以避免一个分支占据结果,保证每个分支的最佳性能,并且明确定义了拟合网络的输入领域,确保它只进行数据融合。我们的实验表明,拟合输出 consistently 超过了每个网络分支的结果,而APNet在SensatUrban数据集上实现了65.2 mIoU的state-of-the-art性能。接受后,源代码将公开。
Redistributing the Precision and Content in 3D-LUT-based Inverse Tone-mapping for HDR/WCG Display
results: 实验结果表明,该方法可以减少转换错误,并且可以在不同的显示设备上提供更好的视觉效果。Abstract
ITM(inverse tone-mapping) converts SDR (standard dynamic range) footage to HDR/WCG (high dynamic range /wide color gamut) for media production. It happens not only when remastering legacy SDR footage in front-end content provider, but also adapting on-theair SDR service on user-end HDR display. The latter requires more efficiency, thus the pre-calculated LUT (look-up table) has become a popular solution. Yet, conventional fixed LUT lacks adaptability, so we learn from research community and combine it with AI. Meanwhile, higher-bit-depth HDR/WCG requires larger LUT than SDR, so we consult traditional ITM for an efficiency-performance trade-off: We use 3 smaller LUTs, each has a non-uniform packing (precision) respectively denser in dark, middle and bright luma range. In this case, their results will have less error only in their own range, so we use a contribution map to combine their best parts to final result. With the guidance of this map, the elements (content) of 3 LUTs will also be redistributed during training. We conduct ablation studies to verify method's effectiveness, and subjective and objective experiments to show its practicability. Code is available at: https://github.com/AndreGuo/ITMLUT.
摘要
HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field
for: Addresses the problem of modeling an animatable 3D human head avatar under light-weight setups, which has not been well solved.
methods: Introduces a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template.
results: Achieves state-of-the-art performance for 3D head avatar animation, with high-resolution, realistic, and view-consistent synthesis of dynamic head appearance.Abstract
The problem of modeling an animatable 3D human head avatar under light-weight setups is of significant importance but has not been well solved. Existing 3D representations either perform well in the realism of portrait images synthesis or the accuracy of expression control, but not both. To address the problem, we introduce a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template. At the core of our representation, a synthetic-renderings-based condition method is proposed to fuse the prior information from the parametric model into the implicit field without constraining its topological flexibility. Besides, based on the hybrid representation, we properly overcome the inconsistent shape issue presented in existing methods and improve the animation stability. Moreover, by adopting an overall GAN-based architecture using an image-to-image translation network, we achieve high-resolution, realistic and view-consistent synthesis of dynamic head appearance. Experiments demonstrate that our method can achieve state-of-the-art performance for 3D head avatar animation compared with previous methods.
摘要
“模型动画3D人头化身的问题具有重要意义,但尚未得到完善的解决。现有的3D表示方式可以在真实性的肖像图像合成方面表现出色,但控制表达的准确性不佳;或者可以在表达控制方面表现出色,但肖像图像的真实性不佳。为解决这个问题,我们提出了一种新的混合式显式隐式3D表示方法,即Facial Model Conditioned Neural Radiance Field。我们的表示方法把NeRF的表达力和参数模板中的先知信息结合在一起,以实现高度的表现真实性和控制灵活性。另外,我们还解决了现有方法中的不一致形状问题,提高动画稳定性。此外,我们采用了一种总体的GAN基 architecture,使用图像到图像翻译网络,实现高分辨率、真实和视角一致的动态头型表 synthesis。实验表明,我们的方法可以与之前的方法相比,达到3D头化身动画的州际性表现。”
Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative Pretraining
paper_authors: Tianyu Han, Laura Žigutytė, Luisa Huck, Marc Huppertz, Robert Siepmann, Yossi Gandelsman, Christian Blüthgen, Firas Khader, Christiane Kuhl, Sven Nebelung, Jakob Kather, Daniel Truhn
results: 我们发现DiffChest可以实现高度的医疗读者一致性,具体而言,Fleiss的Kappa值在大多数影像找到的情况下都是0.8或更高。DiffChest可以正确地捕捉11.1%至100%的变量因素。此外,我们的预训 проце序可以将模型优化以捕捉输入影像中最重要的信息。DiffChest在11种胸部病情的诊断中表现出色,并在其他情况下至少具有足够的诊断精度。Abstract
Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dataset of 515,704 chest radiographs from 194,956 patients from multiple healthcare centers in the United States and Europe. DiffChest explains classifications on a patient-specific level and visualizes the confounding factors that may mislead the model. We found high inter-reader agreement when evaluating DiffChest's capability to identify treatment-related confounders, with Fleiss' Kappa values of 0.8 or higher across most imaging findings. Confounders were accurately captured with 11.1% to 100% prevalence rates. Furthermore, our pretraining process optimized the model to capture the most relevant information from the input radiographs. DiffChest achieved excellent diagnostic accuracy when diagnosing 11 chest conditions, such as pleural effusion and cardiac insufficiency, and at least sufficient diagnostic accuracy for the remaining conditions. Our findings highlight the potential of pretraining based on diffusion models in medical image classification, specifically in providing insights into confounding factors and model robustness.
摘要
检测自动诊断助手系统中的误导性模式是确保其可靠性的关键,特别在医疗领域。现有的深度学习模型评估技术无法在诊断水平可视化干扰因素。我们提议一种自我条件 diffusion 模型,称为 DiffChest,并在515,704 个胸部X光图像和194,956 名病人从美国和欧洲多个医疗中心获取训练数据。DiffChest 可以在病人特定水平解释分类结果,并可视化可能误导模型的干扰因素。我们发现在评估 DiffChest 是否能够确定治疗相关干扰因子时, Fleiss κ 值在 0.8 或更高,对大多数影像发现都达到了0.8 或更高的一致度。干扰因素被正确地捕捉,存在11.1% 到100% 的发现率。此外,我们的预训练过程将模型优化为从输入胸部X光图像中捕捉最重要的信息。DiffChest 在11 种胸部疾病诊断中达到了出色的诊断精度,并在剩下的疾病诊断中至少达到了足够的诊断精度。我们的发现表明基于扩散模型的预训练在医疗影像分类中具有潜在的优势,特别是在提供干扰因素的透视和模型Robustness。
Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution Modeling
results: 我们的方法在多个任务和多种动作类型下表现出色,可以减少忘记现象,并且比较有效和灵活。Abstract
Action Quality Assessment (AQA) is a task that tries to answer how well an action is carried out. While remarkable progress has been achieved, existing works on AQA assume that all the training data are visible for training in one time, but do not enable continual learning on assessing new technical actions. In this work, we address such a Continual Learning problem in AQA (Continual-AQA), which urges a unified model to learn AQA tasks sequentially without forgetting. Our idea for modeling Continual-AQA is to sequentially learn a task-consistent score-discriminative feature distribution, in which the latent features express a strong correlation with the score labels regardless of the task or action types. From this perspective, we aim to mitigate the forgetting in Continual-AQA from two aspects. Firstly, to fuse the features of new and previous data into a score-discriminative distribution, a novel Feature-Score Correlation-Aware Rehearsal is proposed to store and reuse data from previous tasks with limited memory size. Secondly, an Action General-Specific Graph is developed to learn and decouple the action-general and action-specific knowledge so that the task-consistent score-discriminative features can be better extracted across various tasks. Extensive experiments are conducted to evaluate the contributions of proposed components. The comparisons with the existing continual learning methods additionally verify the effectiveness and versatility of our approach.
摘要
Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification
for: Addresses the practical issue of incomplete text-based person re-identification (ReID) in real-world applications, where person images and text descriptions are not completely matched and contain partially missing modality data.
methods: Proposes a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework, which includes cross-modal nearest neighbor construction, relation graphs, and prototype-aware cross-modal alignment loss to handle incomplete data.
results: Consistently outperforms state-of-the-art text-image ReID approaches on several benchmarks with different missing ratios, demonstrating the effectiveness of the proposed method in handling incomplete data.Abstract
Traditional text-based person re-identification (ReID) techniques heavily rely on fully matched multi-modal data, which is an ideal scenario. However, due to inevitable data missing and corruption during the collection and processing of cross-modal data, the incomplete data issue is usually met in real-world applications. Therefore, we consider a more practical task termed the incomplete text-based ReID task, where person images and text descriptions are not completely matched and contain partially missing modality data. To this end, we propose a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework to handle the aforementioned issues for incomplete text-based ReID. Specifically, we cannot directly retrieve person images based on a text query on missing modality data. Therefore, we propose the cross-modal nearest neighbor construction strategy for missing data by computing the cross-modal similarity between existing images and texts, which provides key guidance for the completion of missing modal features. Furthermore, to efficiently complete the missing modal features, we construct the relation graphs with the aforementioned cross-modal nearest neighbor sets of missing modal data and the corresponding prototypes, which can further enhance the generated missing modal features. Additionally, for tighter fine-grained alignment between images and texts, we raise a prototype-aware cross-modal alignment loss that can effectively reduce the modality heterogeneity gap for better fine-grained alignment in common space. Extensive experimental results on several benchmarks with different missing ratios amply demonstrate that our method can consistently outperform state-of-the-art text-image ReID approaches.
摘要
传统的文本基于人识别(ReID)技术强调完全匹配的多模态数据,这是理想的情况。然而,实际应用中经常遇到数据缺失和损坏问题,因此我们考虑了一个更实际的任务:受限文本基于ReID任务,其中人像图像和文本描述不完全匹配,包含部分缺失的modal数据。为此,我们提出了一种 novel Prototype-guided Cross-modal Completion and Alignment(PCCA)框架来解决上述问题。具体来说,我们无法直接根据缺失modal数据来检索人像图像。因此,我们提出了跨模态最近邻构建策略,通过计算跨模态相似性来完成缺失modal特征。此外,为了有效完成缺失modal特征,我们构建了跨模态最近邻集和对应的原型,可以进一步增强生成的缺失modal特征。此外,为了更紧密地对齐图像和文本,我们提出了原型意识的跨模态对齐损失,可以更好地减少模态差异,以便更紧密地对齐在共同空间中。广泛的实验结果表明,我们的方法可以在多个benchmark上与状态ola分别获得优于状态艺。
Guiding Instruction-based Image Editing via Multimodal Large Language Models
results: 这个论文的实验结果表明,使用MLLM可以大幅提高图像修改的自动度和人工评价,同时保持竞争性的推理效率。Abstract
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
摘要
instruction-based图像编辑提高图像修改的可控性和灵活性,通过自然的命令而无需详细的描述或地域掩码。然而,人类的指令有时候太简短,无法capture和跟踪。多模态大语言模型(MLLM)表现出跨模态理解和视觉意识回快生成的潜力。我们研究如何通过MLLM实现图像修改指令,并提出了MLLM导向图像修改(MGIE)。MGIE学习表达 instrucciones和提供显式导航,修改模型同时捕捉这种视觉想象并进行修改。我们评估了各种修改方面,包括批处理修改、全局图像优化和本地修改。广泛的实验结果表明,表达 instrucciones是指令基于图像修改的关键因素,而我们的MGIE可以在自动指标和人类评估中带来明显的改进,同时保持竞争力强的推理效率。
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
For: The paper is written for improving the reliability of cross-modal retrieval methods by quantifying the uncertainty arisen from inherent data ambiguity.* Methods: The paper proposes a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework, which constructs learnable prototypes for each modality, uses Dempster-Shafer Theory and Subjective Logic Theory to build an evidential theoretical framework, and induces accurate uncertainty and reliable predictions for cross-modal retrieval.* Results: The paper demonstrates the effectiveness of the PAU model through extensive experiments on four major benchmark datasets, achieving accurate uncertainty and reliable predictions for cross-modal retrieval.Here’s the information in Simplified Chinese text:* For: 本文是为了提高跨模态检索方法的可靠性,通过量化数据本身的不确定性。* Methods: 本文提出了一种基于 проtotypes的 Aleatoric Uncertainty Quantification (PAU) 框架,通过构建每个模式的学习可变的原型,利用 Dempster-Shafer 理论和主观逻辑理论来建立证据理论框架,实现准确的不确定性和可靠的预测。* Results: 本文通过对四个主要的 benchmark 数据集进行了广泛的实验,证明了 PAU 模型的效iveness,实现了准确的不确定性和可靠的预测。Abstract
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.
摘要
跨模态检索方法建立视觉语言模态之间的相似关系,通过共同学习一个公共表示空间。然而,预测结果经常不可靠,这是由于 Aleatoric uncertainty 引起的,这种不确定性来自于低质量数据,如损坏图像、快速视频和不细节的文本。在这篇论文中,我们提出了一种新的 Prototype-based Aleatoric Uncertainty Quantification(PAU)框架,以提供可靠的预测。具体来说,我们首先构建了每个模态的多种可学习prototype来表示整个 semantics 子空间。然后,我们使用 Dempster-Shafer 理论和主观逻辑理论来建立证据框架,将证据与 Dirichlet 分布参数相关联。PAU 模型可以准确地量ify Aleatoric 不确定性,并提供可靠的预测 для跨模态检索。我们在四个主要的 benchmark 数据集上进行了广泛的实验,结果表明我们的方法的有效性。代码可以在 https://github.com/leolee99/PAU 中下载。
SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning
paper_authors: Risa Shinoda, Ryo Hayamizu, Kodai Nakashima, Nakamasa Inoue, Rio Yokota, Hirokatsu Kataoka for: 这个论文旨在提高视觉模型的训练效果,使其可以使用有限多个标注图像进行训练。methods: 该论文提出了一个新的数据集SegRCDB,该数据集基于一种公式驱动的监督学习方法,可以在无需实际图像或手动Semantic标注的情况下进行预训练。results: 预训练使用SegRCDB得到了高于COCO-Stuff的mIoU,并且在ADE-20k和Cityscapes上进行精度调整也有同样的表现。这些结果表明SegRCDB有很高的潜在价值,可以为semantic segmentation预训练和研究提供帮助。Abstract
Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what matters in semantic segmentation pre-training has not been fully investigated. In this paper, we propose the Segmentation Radial Contour DataBase (SegRCDB), which for the first time applies formula-driven supervised learning for semantic segmentation. SegRCDB enables pre-training for semantic segmentation without real images or any manual semantic labels. SegRCDB is based on insights about what is important in pre-training for semantic segmentation and allows efficient pre-training. Pre-training with SegRCDB achieved higher mIoU than the pre-training with COCO-Stuff for fine-tuning on ADE-20k and Cityscapes with the same number of training images. SegRCDB has a high potential to contribute to semantic segmentation pre-training and investigation by enabling the creation of large datasets without manual annotation. The SegRCDB dataset will be released under a license that allows research and commercial use. Code is available at: https://github.com/dahlian00/SegRCDB
摘要
<>将文本翻译为简化字符串。<>预训练是一种强大的策略,可以帮助图像模型高效地在有限的标注图像上训练。在semantic segmentation中,创建标注mask需要很大的劳动和时间,因此建立大规模的预训练数据集 WITH semantic标签很难。此外,预训练中的什么都没有充分研究。在这篇论文中,我们提出了Segmentation Radial Contour DataBase (SegRCDB),这是第一次应用式驱动学习来进行semantic segmentation的预训练。SegRCDB可以在无需真实图像或任何手动semantic标签的情况下进行预训练。SegRCDB基于预训练中对semantic segmentation的重要因素的理解,允许高效的预训练。预训练使用SegRCDB达到了与COCO-Stuff的同样数量的训练图像上的mIoU高于fine-tuning。SegRCDB具有推动semantic segmentation预训练和研究的潜力,因为它允许创建大规模的数据集无需手动标注。SegRCDB数据集将在允许研究和商业使用的license下发布。代码可以在以下地址找到:https://github.com/dahlian00/SegRCDB
Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications
paper_authors: Vladislav Dordiuk, Maksim Dzhigil, Konstantin Ushenin
for: 这个研究旨在探讨在3D瓷器分割任务中如何使用重量对称性来提高模型的准确率和参数数量。
methods: 该研究使用了卷积神经网络,并通过对重量进行对称处理来提高模型的泛化能力。
results: 研究发现,通过对重量进行对称处理,可以提高模型的准确率,并且可以降低模型的参数数量,甚至可以使用非常小的训练集来进行模型训练。Abstract
3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.
摘要
三角形网格分割是生物医学应用中非常重要的任务。人体具有左右对称和一些器官位置的变化。这使得我们可以预期对于旋转和反转不变层在卷积神经网络中的积分层。在这个研究中,我们研究了3D网格分割任务中的重量对称的影响。我们分析了血管结构疾病(液体膜瘤)和常见解剖结构(心脏腔和心脏壁)的3D网格分割问题。本地几何特征被编码为signed distance函数的采样,神经网络对每个网格节点进行预测。我们发现,在神经网络具有至少三层卷积层的情况下,Weight Symmetry可以增加1%到3%的准确率,并且可以降低训练集的数量,最多降低8倍,而不会影响性能。这也适用于非常小的训练集。
DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation
results: 对多个数据集进行了广泛的实验,并证明了与全层模型的性能和效率之间的良好平衡。而且,对基eline模型也带来了更好的性能提升。代码和模型已经公开发布 для重现。Abstract
Diffusion models achieve great success in generating diverse and high-fidelity images. The performance improvements come with low generation speed per image, which hinders the application diffusion models in real-time scenarios. While some certain predictions benefit from the full computation of the model in each sample iteration, not every iteration requires the same amount of computation, potentially leading to computation waste. In this work, we propose DeeDiff, an early exiting framework that adaptively allocates computation resources in each sampling step to improve the generation efficiency of diffusion models. Specifically, we introduce a timestep-aware uncertainty estimation module (UEM) for diffusion models which is attached to each intermediate layer to estimate the prediction uncertainty of each layer. The uncertainty is regarded as the signal to decide if the inference terminates. Moreover, we propose uncertainty-aware layer-wise loss to fill the performance gap between full models and early-exited models. With such loss strategy, our model is able to obtain comparable results as full-layer models. Extensive experiments of class-conditional, unconditional, and text-guided generation on several datasets show that our method achieves state-of-the-art performance and efficiency trade-off compared with existing early exiting methods on diffusion models. More importantly, our method even brings extra benefits to baseline models and obtains better performance on CIFAR-10 and Celeb-A datasets. Full code and model are released for reproduction.
摘要
Diffusion models 取得了高品质和多样化的图像生成成功。然而,每几个图像生成 Speed 较低,对于实时应用而言是一个障碍。一些预测可以从全部模型的计算中获得优化,但不是每个迭代都需要相同的计算量,这可能会导致计算浪费。在这个工作中,我们提出了DeeDiff,一个早期终止框架,可以适当地分配computation资源,以提高对于扩散模型的生成效率。具体来说,我们将时间步长自适应不确定性估计模组(UEM)添加到各个中继层,以估计各个层的预测不确定性。这个不确定性被视为终止决策的信号。此外,我们还提出了层别不确定性感知损失,以填补全层模型和早期终止模型之间的性能差距。这种损失策略使我们的模型能够和全层模型相比较得到相似的表现。我们在多个标准 datasets上进行了广泛的实验,包括类别参数、无条件和文本参数生成。结果显示,我们的方法在效率和表现之间实现了绝佳的调适。此外,我们的方法甚至对基eline模型带来更好的表现,在CIFAR-10和Celeb-A datasets上取得了更高的表现。我们的代码和模型都公开发布,以便重现。
GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation
results: 我们在KITTI dataset上实现了state-of-the-art表现,并且实现了高效的cue fusion速度。Abstract
Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perform well in both types of scenes. However, adaptive cue fusion relies on attention mechanisms, where the quadratic complexity limits the granularity of cue representation. Additionally, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction. To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation. We utilize deformable attention to learn cue relationships at a fine scale, while sparse attention reduces computational requirements when granularity increases. To compensate for the precision drop in dynamic scenes, we represent scene attributes in the form of super tokens without relying on precise shapes. Within each super token attributed to dynamic scenes, we gather its relevant cues and learn local dense relationships to enhance cue fusion. Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed.
摘要
几何 estimation 提供了一种alternativeapproach для感知3D信息在自主驾驶中。单框depth estimation 已经取得了 significativesuccess by learning various types of cues and specializing in either static or dynamic scenes。最近,这些笔记 fusion 成为了一个吸引人的话题, aiming to enable the combined cues to perform well in both types of scenes。然而,adaptive cue fusion rely on attention mechanisms, where the quadratic complexity limits the granularity of cue representation。 In addition, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction。 To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation。 We utilize deformable attention to learn cue relationships at a fine scale, while sparse attention reduces computational requirements when granularity increases。 To compensate for the precision drop in dynamic scenes, we represent scene attributes in the form of super tokens without relying on precise shapes。 Within each super token attributed to dynamic scenes, we gather its relevant cues and learn local dense relationships to enhance cue fusion。 Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed。
Imagery Dataset for Condition Monitoring of Synthetic Fibre Ropes
paper_authors: Anju Rani, Daniel O. Arroyo, Petar Durdevic
for: automatized visual inspection of synthetic fiber ropes (SFRs) to detect defects and assess remaining useful life (RUL)
methods: computer vision applications, including object detection, classification, and segmentation
results: a comprehensive dataset of 6,942 raw images representing both normal and defective SFRs to support the development of robust defect detection algorithmsAbstract
Automatic visual inspection of synthetic fibre ropes (SFRs) is a challenging task in the field of offshore, wind turbine industries, etc. The presence of any defect in SFRs can compromise their structural integrity and pose significant safety risks. Due to the large size and weight of these ropes, it is often impractical to detach and inspect them frequently. Therefore, there is a critical need to develop efficient defect detection methods to assess their remaining useful life (RUL). To address this challenge, a comprehensive dataset has been generated, comprising a total of 6,942 raw images representing both normal and defective SFRs. The dataset encompasses a wide array of defect scenarios which may occur throughout their operational lifespan, including but not limited to placking defects, cut strands, chafings, compressions, core outs and normal. This dataset serves as a resource to support computer vision applications, including object detection, classification, and segmentation, aimed at detecting and analyzing defects in SFRs. The availability of this dataset will facilitate the development and evaluation of robust defect detection algorithms. The aim of generating this dataset is to assist in the development of automated defect detection systems that outperform traditional visual inspection methods, thereby paving the way for safer and more efficient utilization of SFRs across a wide range of applications.
摘要
自动化视觉检测 синтетиче纤维绳 (SFR) 在海上、风力发电等领域是一项复杂的任务。SFR 中任何缺陷都可能会损害其结构完整性,对人员和设备安全造成重要风险。由于这些绳子的大小和重量,常常无法定期检查它们。因此,有一项急需开发高效缺陷检测方法,以评估它们的剩余有用寿(RUL)。为解决这个挑战,我们已经生成了一个全面的数据集,包括总共 6,942 张原始图像,表示正常和缺陷 SFR 的场景。这个数据集包括了各种可能发生在 SFR 的操作寿命中的缺陷enario,包括但不限于斑点缺陷、切断绳、擦伤、压缩、核心缺陷和正常。这个数据集作为计算机视觉应用程序的资源,可以支持对 SFR 中的缺陷进行检测和分类、分割等计算机视觉方法。数据集的可用性将促进了检测和分析 SFR 中的缺陷的自动化系统的开发和评估,从而为 SFR 在各种应用中的更安全和更高效的使用做出了重要贡献。
A 5-Point Minimal Solver for Event Camera Relative Motion Estimation
paper_authors: Ling Gao, Hang Su, Daniel Gehrig, Marco Cannici, Davide Scaramuzza, Laurent Kneip
for: Linear motion estimation using event-based cameras
methods: Derive correct non-linear parametrization of eventails (manifolds generated by lines in the space-time volume of events) and introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections.
results: Generate more stable relative motion estimates than other methods and consistently achieve a 100% success rate in estimating linear velocity, outperforming existing closed-form solvers.Abstract
Event-based cameras are ideal for line-based motion estimation, since they predominantly respond to edges in the scene. However, accurately determining the camera displacement based on events continues to be an open problem. This is because line feature extraction and dynamics estimation are tightly coupled when using event cameras, and no precise model is currently available for describing the complex structures generated by lines in the space-time volume of events. We solve this problem by deriving the correct non-linear parametrization of such manifolds, which we term eventails, and demonstrate its application to event-based linear motion estimation, with known rotation from an Inertial Measurement Unit. Using this parametrization, we introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections, which can be fused into a single, averaged linear velocity when considering multiple lines. We demonstrate on both synthetic and real data that our solver generates more stable relative motion estimates than other methods while capturing more inliers than clustering based on spatio-temporal planes. In particular, our method consistently achieves a 100% success rate in estimating linear velocity where existing closed-form solvers only achieve between 23% and 70%. The proposed eventails contribute to a better understanding of spatio-temporal event-generated geometries and we thus believe it will become a core building block of future event-based motion estimation algorithms.
摘要
(Note: The translation is in Simplified Chinese, using the traditional Chinese characters for "event" and "tail".) Event-based 摄像机 идеаль для线性运动估计,因为它们主要响应Scene中的 Edge。 However, accurately determining the camera displacement based on events remains an open problem. This is because line feature extraction and dynamics estimation are closely tied when using event cameras, and no precise model is currently available for describing the complex structures generated by lines in the space-time volume of events. We address this problem by deriving the correct non-linear parametrization of such manifolds, which we term eventails, and demonstrate its application to event-based linear motion estimation, with known rotation from an Inertial Measurement Unit. Using this parametrization, we introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections, which can be fused into a single, averaged linear velocity when considering multiple lines. We demonstrate on both synthetic and real data that our solver generates more stable relative motion estimates than other methods while capturing more inliers than clustering based on spatio-temporal planes. In particular, our method consistently achieves a 100% success rate in estimating linear velocity where existing closed-form solvers only achieve between 23% and 70%. The proposed eventails contribute to a better understanding of spatio-temporal event-generated geometries and we thus believe it will become a core building block of future event-based motion estimation algorithms.
On Uniform Scalar Quantization for Learned Image Compression
results: 作者的方法在多种 represntative 图像压缩网络上表现出了更高的性能,并且提供了两个小技巧,一是设置合适的下界参数,二是使用零中心归一化和部分停止梯度。Abstract
Learned image compression possesses a unique challenge when incorporating non-differentiable quantization into the gradient-based training of the networks. Several quantization surrogates have been proposed to fulfill the training, but they were not systematically justified from a theoretical perspective. We fill this gap by contrasting uniform scalar quantization, the most widely used category with rounding being its simplest case, and its training surrogates. In principle, we find two factors crucial: one is the discrepancy between the surrogate and rounding, leading to train-test mismatch; the other is gradient estimation risk due to the surrogate, which consists of bias and variance of the gradient estimation. Our analyses and simulations imply that there is a tradeoff between the train-test mismatch and the gradient estimation risk, and the tradeoff varies across different network structures. Motivated by these analyses, we present a method based on stochastic uniform annealing, which has an adjustable temperature coefficient to control the tradeoff. Moreover, our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance parameter of the estimated quantized latent distribution, which effectively reduces the train-test mismatch; the other is to use zero-center quantization with partial stop-gradient, which reduces the gradient estimation variance and thus stabilize the training. Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.
摘要
学习图像压缩存在独特挑战,因为在梯度基本训练网络时引入不可微化量化。许多量化代理已被提出,但它们没有系统地从理论角度得到正确的正则化。我们填充这一差距,通过对uniform整数量化和其训练代理进行对比。在理论上,我们发现两个因素非常重要:一是量化代理和圆拟的差异,导致训练和测试之间的差异;另一个是因为代理而导致的梯度估计风险,它包括梯度估计的偏差和方差。我们的分析和实验表明,存在训练和测试之间的交易,而这种交易随着不同的网络结构而变化。我们被这些分析激励,并提出一种基于随机均衡的方法,其中包括可调温度系数以控制交易。此外,我们的分析还让我们发现了两个微妙的技巧:一是设置合适的下界参数,以降低训练和测试之间的差异;另一个是使用零中量化,以降低梯度估计方差,从而稳定训练。我们的方法,包括这两个技巧,在一系列代表性的图像压缩网络上被证明为比现有实践更高效。
UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight Modeling
results: 这篇论文的结果显示,在实验和实际应用中,这个系统具有了前所未有的稳定性和精度,并且在追踪和模型化中具有了很高的效率和可靠性。Abstract
Tracking and modeling unknown rigid objects in the environment play a crucial role in autonomous unmanned systems and virtual-real interactive applications. However, many existing Simultaneous Localization, Mapping and Moving Object Tracking (SLAMMOT) methods focus solely on estimating specific object poses and lack estimation of object scales and are unable to effectively track unknown objects. In this paper, we propose a novel SLAM backend that unifies ego-motion tracking, rigid object motion tracking, and modeling within a joint optimization framework. In the perception part, we designed a pixel-level asynchronous object tracker (AOT) based on the Segment Anything Model (SAM) and DeAOT, enabling the tracker to effectively track target unknown objects guided by various predefined tasks and prompts. In the modeling part, we present a novel object-centric quadric parameterization to unify both static and dynamic object initialization and optimization. Subsequently, in the part of object state estimation, we propose a tightly coupled optimization model for object pose and scale estimation, incorporating hybrids constraints into a novel dual sliding window optimization framework for joint estimation. To our knowledge, we are the first to tightly couple object pose tracking with light-weight modeling of dynamic and static objects using quadric. We conduct qualitative and quantitative experiments on simulation datasets and real-world datasets, demonstrating the state-of-the-art robustness and accuracy in motion estimation and modeling. Our system showcases the potential application of object perception in complex dynamic scenes.
摘要
<>translation-environment: zh-CN Tracking和模型未知静止物体在自动驾驶系统和虚拟实际交互应用中扮演着关键性的角色。然而,许多现有的同时地理位、地图和移动物体跟踪(SLAMMOT)方法仅仅关注特定物体姿态的估计,而不能有效地跟踪未知物体。在这篇论文中,我们提出了一种新的SLAM后端,它将ego-动态跟踪、静止物体动态跟踪和模型化集成到一个共同优化框架中。在感知部分,我们设计了基于Segment Anything Model(SAM)和DeAOT的像素级异步物体跟踪器(AOT),使得跟踪器能够根据不同的任务和提示有效地跟踪目标未知物体。在模型部分,我们提出了一种新的物体-центric四元参数化,以统一静止和动态物体的初始化和优化。在物体状态估计部分,我们提出了一种紧密相互关联的优化模型,将物体姿态和Scale的估计集成到一个新的双滑动窗口优化框架中。到我们所知,我们是第一个通过quadric紧密地跟踪动态和静止物体的 pose和Scale。我们在模拟数据集和实际数据集上进行了质量和量化实验,展示了对运动估计和模型化的状态艺术。我们的系统展示了对复杂动态场景的物体感知的潜在应用。Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China. The translation is based on the Google Translate API, and may not be perfect or idiomatic in all cases.
Unveiling Document Structures with YOLOv5 Layout Detection
paper_authors: Herman Sugiharto, Yorissa Silviana, Yani Siti Nurpazrin for: 本研究旨在快速识别文档布局和抽取无结构数据。methods: 本研究使用 cutting-edge 计算机视觉模型 YOLOv5 进行文档布局识别和无结构数据抽取。results: YOLOv5 模型在文档布局识别任务中表现出众,准确率为 0.91,准确率为 0.971, F1 分数为 0.939,ROC曲线下的面积为 0.975。这个系统可以有效地提高无结构数据抽取的效率。Abstract
The current digital environment is characterized by the widespread presence of data, particularly unstructured data, which poses many issues in sectors including finance, healthcare, and education. Conventional techniques for data extraction encounter difficulties in dealing with the inherent variety and complexity of unstructured data, hence requiring the adoption of more efficient methodologies. This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data. The present study establishes a conceptual framework for delineating the notion of "objects" as they pertain to documents, incorporating various elements such as paragraphs, tables, photos, and other constituent parts. The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data, hence improving the effectiveness of data extraction. In the conducted examination, the YOLOv5 model exhibits notable effectiveness in the task of document layout identification, attaining a high accuracy rate along with a precision value of 0.91, a recall value of 0.971, an F1-score of 0.939, and an area under the receiver operating characteristic curve (AUC-ROC) of 0.975. The remarkable performance of this system optimizes the process of extracting textual and tabular data from document images. Its prospective applications are not limited to document analysis but can encompass unstructured data from diverse sources, such as audio data. This study lays the foundation for future investigations into the wider applicability of YOLOv5 in managing various types of unstructured data, offering potential for novel applications across multiple domains.
摘要
现今数字环境中,尤其是在金融、医疗和教育等领域,数据的普遍存在和不结构化的特点带来了许多问题。传统的数据EXTRACTING技术在处理不结构化数据的自然多样性和复杂性时遇到困难,因此需要采用更高效的方法ologies。本研究利用YOLOv5 cutting-edge计算机视觉模型,为了快速识别文档布局和提取不结构化数据。本研究提出了对文档中的"对象"概念的定义,包括段落、表格、照片等元素。主要目标是创建一个自主的系统,可以快速识别文档布局并提取不结构化数据,从而改善数据EXTRACTING的效率。在实验中,YOLOv5模型在文档布局识别任务中表现出了remarkable的效果,实现了高精度率、精度值0.91、回归值0.971、F1值0.939和ROC曲线下的接收操作特征值0.975。这个系统的出色表现可以优化文档图像中的文本和表格数据EXTRACTING过程。其潜在应用不限于文档分析,还可以涵盖多种不结构化数据的管理,如音频数据。本研究为未来对YOLOv5在不同类型的不结构化数据管理方面的进一步研究提供了基础。
HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World
paper_authors: Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, Marc Pollefeys for:这个研究的目的是开发一种可以与人类交互并协助完成物理世界任务的智能助手。methods:这个研究使用了一种大规模的egosistent human interaction数据集,名为HoloAssist,其中两个人共同完成物理搅动任务。任务执行者通过穿着一种混合现实头盔记录了七个同步数据流。任务指导者通过实时观看执行者的 egocentric 视频,给出指导。results:通过对数据进行扩充,并观察参与者的各种行为,我们提供了关键的洞察,包括人工指导者如何 corrrect 错误, intervene 在任务完成过程中,并将指导链接到环境。HoloAssist 涵盖了 166 小时的数据, captured by 350 个唯一的 instructor-performer 对。此外,我们构建了一些 benchmark,包括 mistake detection, intervention type prediction,和 hand forecasting,并进行了详细分析。我们期望 HoloAssist 将成为建立智能助手交互与人类的重要资源。数据可以在 https://holoassist.github.io/ 下载。Abstract
Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task instructor watches the performer's egocentric video in real time and guides them verbally. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/.
摘要
Segment Anything Model is a Good Teacher for Local Feature Learning
paper_authors: Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang
for: 提高地方特征描述性能
methods: 使用 SAM 模型作为导师,通过Pixel Semantic Relational Distillation (PSRD) 和 Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) 等技术进行地方特征学习和描述
results: 在多个任务上达到了更高的性能,比如图像匹配和长期视觉地址Local feature detection and description are crucial for many computer vision tasks, but data-driven methods rely on pixel-level correspondence, which is challenging to obtain. This paper proposes SAMFeat, which uses a pre-trained SAM model as a teacher to guide local feature learning and improve performance on limited datasets. The proposed method includes Pixel Semantic Relational Distillation (PSRD) and Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), and an Edge Attention Guidance (EAG) to improve the accuracy of local feature detection and description. The results show that SAMFeat outperforms previous local features on various tasks, such as image matching on HPatches and long-term visual localization on Aachen Day-Night.Abstract
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Pixel Semantic Relational Distillation (PSRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.
摘要
本文提出了一种名为SAMFeat的新方法,用于提高计算机视觉任务中的本地特征检测和描述。SAMFeat利用了一个名为SAM(分类任务模型)的基本模型,该模型在1100万张图像上进行训练,并且作为教师来引导本地特征学习。为此,我们首先构建了一个auxiliary任务,即像素semantic relational distillation(PSRD)任务,该任务将SAM模型学习的分类信息转化为本地特征描述网络中的特征关系。其次,我们开发了一种名为弱式监督对比学习基于 semantic grouping(WSC)的技术,该技术利用了SAM模型生成的semantic grouping来提高本地特征描述的度量空间。最后,我们设计了一种Edge Attention Guidance(EAG)技术,以提高本地特征检测和描述的准确率,并且使网络更加注意到引导Edge region。SAMFeat在多个任务上,如HPatches图像匹配和Aachen日夜长期视觉本地化,表现出了与之前的本地特征相比的超越性。 codes可以在https://github.com/vignywang/SAMFeat上下载。
Text-image Alignment for Diffusion-based Perception
paper_authors: Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogério Guimarães, Pietro Perona
For: The paper is written for exploring the use of diffusion models for visual tasks and improving the perceptual performance of diffusion-based models.* Methods: The paper uses automatically generated captions to improve text-image alignment and enhance the cross-attention maps of the model, leading to better perceptual performance.* Results: The paper achieves state-of-the-art (SOTA) results in diffusion-based semantic segmentation on ADE20K and overall SOTA in depth estimation on NYUv2. The method also generalizes to the cross-domain setting and achieves SOTA results in object detection on Watercolor2K and segmentation on Dark Zurich-val and Nighttime Driving.Abstract
Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current SOTA in diffusion-based semantic segmentation on ADE20K and the current overall SOTA in depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting; we use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/
摘要
填挤模型是一类生成模型,具有吸引人的文本到图像合成能力,并促使了传统机器学习任务中的新一波创新方法。然而,如何利用这些生成模型的感知知识来解决视觉任务仍然是一个打开的问题。具体来说,使用推荐界面在扩散幕中应用 diffusion 模型是一个未解决的问题。我们发现,自动生成的标题可以改善文本-图像对齐,并使模型的交叉注意力图显著提高,从而提高模型的感知性能。我们的方法超过当前 SOTA 在扩散基于 semantic segmentation 任务上的 ADE20K,以及当前总体 SOTA 在 depth estimation 任务上的 NYUv2。此外,我们的方法可以跨领域调整,我们使用模型个性化和标题修改来对我们的模型进行对目标领域的调整,并在不对齐基elines上获得改进。我们的对象检测模型,训练在 Pascal VOC,在 Watercolor2K 上达到了 SOTA 结果。我们的 segmentation 方法,训练在 Cityscapes,在 Dark Zurich-val 和 Nighttime Driving 上达到了 SOTA 结果。更多信息请参考我们的项目页面:https://www.vision.caltech.edu/tadp/
SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features
paper_authors: Song Wang, Zhu Wang, Can Li, Xiaojuan Qi, Hayden Kwok-Hay So
for: Event-based multi-object tracking (MOT) in real-world settings with complex background and camera motion.
methods: SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects, and a simultaneous object detector provides updated spatial information.
results: SpikeMOT achieves high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.Abstract
In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex background and camera motion can easily obscure the true target motion. In this work, an event-based multi-object tracker, called SpikeMOT, is presented to address these challenges. SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects. The resulting spike train representations are used to track the object movement at high frequency, while a simultaneous object detector provides updated spatial information of these objects at an equivalent frame rate. To evaluate the effectiveness of SpikeMOT, we introduce DSEC-MOT, the first large-scale event-based MOT benchmark incorporating fine-grained annotations for objects experiencing severe occlusions, frequent trajectory intersections, and long-term re-identification in real-world contexts. Extensive experiments employing DSEC-MOT and another event-based dataset, named FE240hz, demonstrate SpikeMOT's capability to achieve high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.
摘要
contrast to traditional RGB 摄像头,事件摄像头的高时间分辨率使其能够捕捉到 frames 之间的质量丰富信息,使其成为目标跟踪的优良候选。然而,在实际应用中,尚未有充分的研究对事件基据多对象跟踪(MOT)进行了深入的探索,特别是在实际场景中,背景和摄像头的运动会轻松隐藏真实的目标运动。在这种情况下,一种基于事件的多对象跟踪器,称为SpikeMOT,被提出来解决这些挑战。SpikeMOT 利用脉冲神经网络提取事件流中对象的稀疏 spatial temporal 特征。这些脉冲表示被跟踪对象的运动,同时一个同步的对象探测器提供了这些对象的新的空间信息,并在相同的帧率上进行跟踪。为了评估SpikeMOT 的效果,我们提出了 DSEC-MOT,这是首个包含细化注释的事件基据 MOT benchmark,这些注释包括对象受到严重遮挡、频繁的轨迹交叠以及长期重新识别。通过使用 DSEC-MOT 和另一个事件基据数据集 named FE240hz,我们进行了广泛的实验,demonstrating SpikeMOT 在真实世界场景中实现高跟踪精度,从而提高事件基据 MOT 的国际先进水平。
Perceptual Tone Mapping Model for High Dynamic Range Imaging
results: 对比和主观评估表明,该模型在对比、颜色彩度和整体图像质量方面表现出色,超过了现有的TMO。Abstract
One of the key challenges in tone mapping is to preserve the perceptual quality of high dynamic range (HDR) images when mapping them to standard dynamic range (SDR) displays. Traditional tone mapping operators (TMOs) compress the luminance of HDR images without considering the surround and display conditions emanating into suboptimal results. Current research addresses this challenge by incorporating perceptual color appearance attributes. In this work, we propose a TMO (TMOz) that leverages CIECAM16 perceptual attributes, i.e., brightness, colorfulness, and hue. TMOz accounts for the effects of both the surround and the display conditions to achieve more optimal colorfulness reproduction. The perceptual brightness is compressed, and the perceptual color scales, i.e., colorfulness and hue are derived from HDR images by employing CIECAM16 color adaptation equations. A psychophysical experiment was conducted to automate the brightness compression parameter. The model employs fully automatic and adaptive approach, obviating the requirement for manual parameter selection. TMOz was evaluated in terms of contrast, colorfulness and overall image quality. The objective and subjective evaluation methods revealed that the proposed model outperformed the state-of-the-art TMOs.
摘要
一个关键挑战在声音映射是保持高动态范围(HDR)图像的感知质量 cuando mapping 到标准动态范围(SDR)显示器。传统的声音映射运算符(TMO)压缩 HDR 图像的亮度无论考虑围层和显示条件,导致SUBOPTIMAL 结果。当前的研究利用感知色度属性来解决这个挑战。在这种工作中,我们提出了一种基于 CIECAM16 感知属性的 TMO(TMOz),包括亮度、色彩强度和色调。TMOz 考虑了围层和显示条件的影响,以实现更优化的色彩强度复制。HDR 图像中的感知亮度被压缩,并使用 CIECAM16 色色映射方程 derive 感知色彩和色调。我们进行了一次心理学实验,自动化了亮度压缩参数。模型使用了完全自动和适应的方法,不需要手动参数选择。TMOz 被评估基于对比度、色彩强度和整体图像质量。对象和主观评估方法表明,提案的模型在 state-of-the-art TMOs 中表现更好。
Synthetic Data Generation and Deep Learning for the Topological Analysis of 3D Data
results: 研究表明,深度学习模型可以EXTRACT这些特征,并且与基于PERSISTENT HOMOLOGY的现有拓扑数据分析工具相比,它们具有一些优点。此外,研究还使用了semantic segmentation来提供更多的地形信息,并与拓扑标签相结合。Abstract
This research uses deep learning to estimate the topology of manifolds represented by sparse, unordered point cloud scenes in 3D. A new labelled dataset was synthesised to train neural networks and evaluate their ability to estimate the genus of these manifolds. This data used random homeomorphic deformations to provoke the learning of visual topological features. We demonstrate that deep learning models could extract these features and discuss some advantages over existing topological data analysis tools that are based on persistent homology. Semantic segmentation was used to provide additional geometric information in conjunction with topological labels. Common point cloud multi-layer perceptron and transformer networks were both used to compare the viability of these methods. The experimental results of this pilot study support the hypothesis that, with the aid of sophisticated synthetic data generation, neural networks can perform segmentation-based topological data analysis. While our study focused on simulated data, the accuracy achieved suggests a potential for future applications using real data.
摘要
nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance
methods: 这个方法使用了Segment Anything Model (SAM)和nnUNet两种神经网络, synergistically 整合它们以实现更高精度和更好的适应能力。
results: 实验结果显示,这个方法可以在不同的训练数据大小下进行几次学习,并且可以在医疗影像分类中实现更高的精度和可靠性。Abstract
The recent developments of foundation models in computer vision, especially the Segment Anything Model (SAM), allow scalable and domain-agnostic image segmentation to serve as a general-purpose segmentation tool. In parallel, the field of medical image segmentation has benefited significantly from specialized neural networks like the nnUNet, which is trained on domain-specific datasets and can automatically configure the network to tailor to specific segmentation challenges. To combine the advantages of foundation models and domain-specific models, we present nnSAM, which synergistically integrates the SAM model with the nnUNet model to achieve more accurate and robust medical image segmentation. The nnSAM model leverages the powerful and robust feature extraction capabilities of SAM, while harnessing the automatic configuration capabilities of nnUNet to promote dataset-tailored learning. Our comprehensive evaluation of nnSAM model on different sizes of training samples shows that it allows few-shot learning, which is highly relevant for medical image segmentation where high-quality, annotated data can be scarce and costly to obtain. By melding the strengths of both its predecessors, nnSAM positions itself as a potential new benchmark in medical image segmentation, offering a tool that combines broad applicability with specialized efficiency. The code is available at https://github.com/Kent0n-Li/Medical-Image-Segmentation.
摘要
近年来,计算机视觉领域内的基础模型,特别是Segment Anything Model(SAM),允许扫描领域和领域无关的图像分割,成为一种通用的图像分割工具。同时,医疗图像分割领域也受到专门的神经网络,如nnUNet的启用,这种神经网络通过特定数据集进行自动配置,以适应特定的分割挑战。为了融合基础模型和域специфи的模型的优点,我们提出了nnSAM模型,它将SAM模型与nnUNet模型 synergistically 集成,以实现更高精度和更加稳定的医疗图像分割。nnSAM模型利用SAM模型的强大和稳定的特征提取能力,同时利用nnUNet模型的自动配置能力,以适应特定数据集的学习。我们对不同训练样本大小的nnSAM模型进行了全面的评估,发现它具有几shot学习能力,这是医疗图像分割中非常有价值的特点,因为高质量、标注过的数据可能是昂贵和困难的获得。通过融合两者的优点,nnSAM模型在医疗图像分割中positioned itself为一种新的benchmark,提供了一种可以同时拥有通用性和特定性的工具。代码可以在https://github.com/Kent0n-Li/Medical-Image-Segmentation中找到。
AdaPose: Towards Cross-Site Device-Free Human Pose Estimation with Commodity WiFi
results: 实验结果表明,AdaPose 能够有效地消除领域差异,从而推动 WiFi-based pose estimation 技术在智能城市中的普及应用。Abstract
WiFi-based pose estimation is a technology with great potential for the development of smart homes and metaverse avatar generation. However, current WiFi-based pose estimation methods are predominantly evaluated under controlled laboratory conditions with sophisticated vision models to acquire accurately labeled data. Furthermore, WiFi CSI is highly sensitive to environmental variables, and direct application of a pre-trained model to a new environment may yield suboptimal results due to domain shift. In this paper, we proposes a domain adaptation algorithm, AdaPose, designed specifically for weakly-supervised WiFi-based pose estimation. The proposed method aims to identify consistent human poses that are highly resistant to environmental dynamics. To achieve this goal, we introduce a Mapping Consistency Loss that aligns the domain discrepancy of source and target domains based on inner consistency between input and output at the mapping level. We conduct extensive experiments on domain adaptation in two different scenes using our self-collected pose estimation dataset containing WiFi CSI frames. The results demonstrate the effectiveness and robustness of AdaPose in eliminating domain shift, thereby facilitating the widespread application of WiFi-based pose estimation in smart cities.
摘要
COMNet: Co-Occurrent Matching for Weakly Supervised Semantic Segmentation
results: 在Pascal VOC 2012和MS-COCO datasets上,我们的网络可以有效地提高基eline模型的性能,并实现新的状态网络。Abstract
Image-level weakly supervised semantic segmentation is a challenging task that has been deeply studied in recent years. Most of the common solutions exploit class activation map (CAM) to locate object regions. However, such response maps generated by the classification network usually focus on discriminative object parts. In this paper, we propose a novel Co-Occurrent Matching Network (COMNet), which can promote the quality of the CAMs and enforce the network to pay attention to the entire parts of objects. Specifically, we perform inter-matching on paired images that contain common classes to enhance the corresponded areas, and construct intra-matching on a single image to propagate the semantic features across the object regions. The experiments on the Pascal VOC 2012 and MS-COCO datasets show that our network can effectively boost the performance of the baseline model and achieve new state-of-the-art performance.
摘要
Image-levelweakly supervised semantic segmentation是一个复杂的任务,近年来得到了广泛研究。大多数常见的解决方案利用类活化图(CAM)来定位对象区域。然而,由分类网络生成的响应图通常会专注于特征性的对象部分。在这篇论文中,我们提出了一种新的协同匹配网络(COMNet),可以提高CAM的质量并使网络对整个对象部分进行注意。具体来说,我们在含共同类的图像对进行交叉匹配以强制相应区域的匹配,并在单个图像上进行内部匹配以传播对象区域中的semantic特征。实验表明,我们的网络可以有效提高基eline模型的性能并实现新的状态对抗性表现。
Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training
results: 我们的Model2Scene模型可以在无标签3D物体突出检测、efficient 3D场景识别和零shot 3Dsemantic segmentation等下游任务中提供优秀表现,特别是在ScanNet和S3DIS数据集上得到了46.08%和55.49%的平均精度。Abstract
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud, which is tedious and expensive to acquire. In this paper, we propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. The main challenges are the domain gaps between the CAD models and the real scene's objects, including model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object). To handle the above challenges, Model2Scene first simulates a crowded scene by mixing data-augmented CAD models. Next, we propose a novel feature regularization operation, termed Deep Convex-hull Regularization (DCR), to project point features into a unified convex hull space, reducing the domain gap. Ultimately, we impose contrastive loss on language embedding and the point features of CAD models to pre-train the 3D network. Extensive experiments verify the learned 3D scene representation is beneficial for various downstream tasks, including label-free 3D object salient detection, label-efficient 3D scene perception and zero-shot 3D semantic segmentation. Notably, Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets, respectively. The code will be publicly available.
摘要
当前成功的3D场景识别方法都基于大规模标注的点云数据,但这些数据是费时和成本高的获得的。在这篇论文中,我们提出了Model2Scene,一种新的思路,它可以自由地学习3D场景表示从计算机支持设计(CAD)模型和语言中。主要挑战是从CAD模型到场景中的对象之间的领域差异,包括模型到场景(从单个模型到场景)和synthetic-to-real(从 sintetic模型到实际场景中的对象)。为了解决以上挑战,Model2Scene首先将CAD模型混合数据进行加工,然后我们提出了一种新的特征规范操作,称为深度凹陷规范(DCR),用于将点特征 проек到一个统一的凹陷空间,从而减少领域差异。最后,我们对语言表示和CAD模型的点特征进行偏置损失,以预训练3D网络。广泛的实验表明,学习的3D场景表示对下游任务具有很好的效果,包括无标签3D对象突出检测、标签有效3D场景识别和零标签3Dsemantic分割。特别是,Model2Scene在无标签3D对象突出检测任务中取得了非常出色的平均精度(mAP)46.08%和55.49%,分别在ScanNet和S3DIS datasets上。代码将公开。
CrossZoom: Simultaneously Motion Deblurring and Event Super-Resolving
paper_authors: Chi Zhang, Xiang Zhang, Mingyuan Lin, Cheng Li, Chu He, Wen Yang, Gui-Song Xia, Lei Yu for:This paper aims to improve the performance of frame-event based vision applications by bridging the resolution gap between traditional and neuromorphic event cameras.methods:The proposed method, called CrossZoom, uses a novel unified neural network (CZ-Net) to jointly recover sharp latent sequences and high-resolution events from blurry inputs. The method leverages scale-variant properties and effectively fuses cross-modality information to achieve cross-enhancement.results:Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed method. The method can improve the temporal resolution of images and the spatial resolution of events, leading to better performance in frame-event based vision applications.Abstract
Even though the collaboration between traditional and neuromorphic event cameras brings prosperity to frame-event based vision applications, the performance is still confined by the resolution gap crossing two modalities in both spatial and temporal domains. This paper is devoted to bridging the gap by increasing the temporal resolution for images, i.e., motion deblurring, and the spatial resolution for events, i.e., event super-resolving, respectively. To this end, we introduce CrossZoom, a novel unified neural Network (CZ-Net) to jointly recover sharp latent sequences within the exposure period of a blurry input and the corresponding High-Resolution (HR) events. Specifically, we present a multi-scale blur-event fusion architecture that leverages the scale-variant properties and effectively fuses cross-modality information to achieve cross-enhancement. Attention-based adaptive enhancement and cross-interaction prediction modules are devised to alleviate the distortions inherent in Low-Resolution (LR) events and enhance the final results through the prior blur-event complementary information. Furthermore, we propose a new dataset containing HR sharp-blurry images and the corresponding HR-LR event streams to facilitate future research. Extensive qualitative and quantitative experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed method. Codes and datasets are released at https://bestrivenzc.github.io/CZ-Net/.
摘要
即使 tradicional 和 neuromorphic event camera 的合作带来了 frame-event 基于视觉应用程序的繁荣,性能仍然受到两种模式之间的分辨率差距的限制。这篇论文旨在跨过这一差距,通过提高图像的时间分辨率(即运动抖抖)和事件的空间分辨率(即事件超分辨)来bridging this gap。为此,我们介绍了 CrossZoom,一种新的协调 neural Network(CZ-Net),可以同时恢复锐化时间序列中的输入和高分辨度(HR)事件。具体来说,我们提出了一种多尺度抖事件融合架构,利用模式的尺度variant property,以实现跨模态信息的有效融合。我们还设计了注意力机制和交互预测模块,以解决LR事件中的自然偏差,并使得最终结果得到了预先抖事件补偿的信息的帮助。此外,我们提出了一个新的数据集,包括高分辨度锐化图像和对应的高分辨度-低分辨度事件流,以便未来的研究。我们的实验结果表明,我们的方法效果和稳定。codes和数据集可以在https://bestrivenzc.github.io/CZ-Net/ 获取。
Incremental Rotation Averaging Revisited and More: A New Rotation Averaging Benchmark
results: 比较IRAv4方法与其他主流旋转平均方法的性能表现, demonstate IRAv4方法的效果Abstract
In order to further advance the accuracy and robustness of the incremental parameter estimation-based rotation averaging methods, in this paper, a new member of the Incremental Rotation Averaging (IRA) family is introduced, which is termed as IRAv4. As the most significant feature of the IRAv4, a task-specific connected dominating set is extracted to serve as a more reliable and accurate reference for rotation global alignment. In addition, to further address the limitations of the existing rotation averaging benchmark of relying on the slightly outdated Bundler camera calibration results as ground truths and focusing solely on rotation estimation accuracy, this paper presents a new COLMAP-based rotation averaging benchmark that incorporates a cross check between COLMAP and Bundler, and employ the accuracy of both rotation and downstream location estimation as evaluation metrics, which is desired to provide a more reliable and comprehensive evaluation tool for the rotation averaging research. Comprehensive comparisons between the proposed IRAv4 and other mainstream rotation averaging methods on this new benchmark demonstrate the effectiveness of our proposed approach.
摘要
为了进一步提高幂加计数器估计基于旋转均值方法的准确性和可靠性,本文提出了一新的幂加计数器均值方法(IRAv4)。这个方法的最重要特点是从任务特定的连接dominating set中提取一个更可靠和准确的旋转全局参照。此外,为了进一步解决现有的旋转均值标准 benchmark 的局限性,这篇文章提出了一个基于 COLMAP 的新的旋转均值标准 benchmark,该标准具有跨Check между COLMAP 和 Bundler,并使用两者的准确性作为评价指标。这种新的评价工具被期望能够提供更可靠和全面的评价工具 для旋转均值研究。对于提出的 IRAv4 和其他主流的旋转均值方法进行了广泛的比较,实验结果表明了我们提出的方法的效果。
methods: 这篇论文提出了基于 You Only Learn One Representation (YOLOR) 网络架构的方法,该架构特点是结合显式和隐式知识,从数据观察和学习的 latents 来提高共享表示。
results: 该方法可以同时进行对象检测、实例 segmentation、semantic segmentation和图像描述,并且可以保持低参数计数和不需要预训练。Abstract
Multi-task learning (MTL) aims to learn multiple tasks using a single model and jointly improve all of them assuming generalization and shared semantics. Reducing conflicts between tasks during joint learning is difficult and generally requires careful network design and extremely large models. We propose building on You Only Learn One Representation (YOLOR), a network architecture specifically designed for multitasking. YOLOR leverages both explicit and implicit knowledge, from data observations and learned latents, respectively, to improve a shared representation while minimizing the number of training parameters. However, YOLOR and its follow-up, YOLOv7, only trained two tasks at once. In this paper, we jointly train object detection, instance segmentation, semantic segmentation, and image captioning. We analyze tradeoffs and attempt to maximize sharing of semantic information. Through our architecture and training strategies, we find that our method achieves competitive performance on all tasks while maintaining a low parameter count and without any pre-training. We will release code soon.
摘要
多任务学习(MTL)目标是使用单个模型学习多个任务,并共同提高所有任务的泛化和共享含义。在共同学习过程中避免任务之间冲突是困难的,通常需要非常精心设计网络和庞大的模型。我们提出基于你仅学习一个表示(YOLOR)网络架构,该架构特点是通过数据观察和学习含义来提高共享表示,同时尽量降低训练参数数量。然而,YOLOR和其后续的YOLOv7只是同时训练了两个任务。在这篇论文中,我们同时训练对象检测、实例分割、semantic segmentation和图文描述。我们分析了贸易和共享含义的权衡,并尝试 Maximize shared semantic information。通过我们的架构和训练策略,我们发现我们的方法在所有任务上具有竞争性的性能,同时保持低参数数量,无需预训练。我们即将发布代码。
Investigating Shift Equivalence of Convolutional Neural Networks in Industrial Defect Segmentation
results: 实验结果显示,提案的方法在微表面瑕疵(MSD)数据集和四个实际的产业瑕疵数据集上具有更高的出力一致性和分类性能,较前state-of-the-art方法更好。Abstract
In industrial defect segmentation tasks, while pixel accuracy and Intersection over Union (IoU) are commonly employed metrics to assess segmentation performance, the output consistency (also referred to equivalence) of the model is often overlooked. Even a small shift in the input image can yield significant fluctuations in the segmentation results. Existing methodologies primarily focus on data augmentation or anti-aliasing to enhance the network's robustness against translational transformations, but their shift equivalence performs poorly on the test set or is susceptible to nonlinear activation functions. Additionally, the variations in boundaries resulting from the translation of input images are consistently disregarded, thus imposing further limitations on the shift equivalence. In response to this particular challenge, a novel pair of down/upsampling layers called component attention polyphase sampling (CAPS) is proposed as a replacement for the conventional sampling layers in CNNs. To mitigate the effect of image boundary variations on the equivalence, an adaptive windowing module is designed in CAPS to adaptively filter out the border pixels of the image. Furthermore, a component attention module is proposed to fuse all downsampled features to improve the segmentation performance. The experimental results on the micro surface defect (MSD) dataset and four real-world industrial defect datasets demonstrate that the proposed method exhibits higher equivalence and segmentation performance compared to other state-of-the-art methods.Our code will be available at https://github.com/xiaozhen228/CAPS.
摘要
在工业缺陷分割任务中,像素准确率和交集覆盖率(IoU)是常用的评估分割性能的指标,但模型的输出一致性(也称Equivalence)通常被忽略。即使只有小的输入图像shift,分割结果也可能出现显著的变化。现有方法主要通过数据扩展或反馈抑制来提高网络对平移变换的Robustness,但它们的平移Equivalence在测试集上表现不佳或受非线性活化函数的影响。此外,输入图像的边界变化会导致分割结果的变化,从而增加了平移Equivalence的限制。为解决这个问题,一种新的下/上采样层组合called component attention polyphase sampling(CAPS)被提出,以取代传统的采样层在CNN中。为了减少图像边界变化对Equivalence的影响,CAPS中的适应窗口模块可以逐个窗口过滤出图像边界的像素。此外,一种组件注意模块被提出,以将所有下采样的特征进行混合,以提高分割性能。实验结果表明,提出的方法在微表面缺陷(MSD) dataset和四个实际工业缺陷dataset上 exhibits higher Equivalence and segmentation performance compared to other state-of-the-art methods。我们的代码将在https://github.com/xiaozhen228/CAPS上提供。
paper_authors: Xiaotian Han, Hanqing Zeng, Yu Chen, Shaoliang Nie, Jingzhou Liu, Kanika Narang, Zahra Shakeri, Karthik Abinav Sankararaman, Song Jiang, Madian Khabsa, Qifan Wang, Xia Hu
for: 本研究探讨图 convolution 和 Mixup 技术之间的关系。
methods: 本研究使用图 convolution 和 Mixup 技术来学习图像的表示。
results: 研究发现,在两个轻量级条件下,图 convolution 可以看作是 Mixup 的特殊形式,并在训练和测试阶段应用。这两个条件是:1)同义关系重新标签(Homophily Relabel),2)测试阶段 Mixup(Test-Time Mixup)。我们数学上证明了图 convolution 网络(GCN)和简化图 convolution(SGC)可以表示为 Mixup,并通过训练 MLP 来验证这一等式。Abstract
This paper investigates the relationship between graph convolution and Mixup techniques. Graph convolution in a graph neural network involves aggregating features from neighboring samples to learn representative features for a specific node or sample. On the other hand, Mixup is a data augmentation technique that generates new examples by averaging features and one-hot labels from multiple samples. One commonality between these techniques is their utilization of information from multiple samples to derive feature representation. This study aims to explore whether a connection exists between these two approaches. Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are: 1) \textit{Homophily Relabel} - assigning the target node's label to all its neighbors, and 2) \textit{Test-Time Mixup} - Mixup the feature during the test time. We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance.
摘要
Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are:1. Homophily Relabel: Assign the target node's label to all its neighbors.2. Test-Time Mixup: Mixup the features during the test time.We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance.
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity
results: 研究发现,尽管小维度权重可能看起来是无用的,但它们实际上含有解决更难的下游任务所需的关键知识。从某种角度来看,这些权重是“潜在的Junk DNA”,如果被移除,可能会导致知识损失和任务性能下降。这些发现可能会改变我们对LLM知识编码方式的理解,并且开启了未来模型剪枝和任务相关计算的研究方向。Abstract
The traditional notion of "Junk DNA" has long been linked to non-coding segments within the human genome, constituting roughly 98% of its composition. However, recent research has unveiled the critical roles some of these seemingly non-functional DNA sequences play in cellular processes. Intriguingly, the weights within deep neural networks exhibit a remarkable similarity to the redundancy observed in human genes. It was believed that weights in gigantic models contained excessive redundancy, and could be removed without compromising performance. This paper challenges this conventional wisdom by presenting a compelling counter-argument. We employ sparsity as a tool to isolate and quantify the nuanced significance of low-magnitude weights in pre-trained large language models (LLMs). Our study demonstrates a strong correlation between these weight magnitudes and the knowledge they encapsulate, from a downstream task-centric angle. we raise the "Junk DNA Hypothesis" backed by our in-depth investigation: while small-magnitude weights may appear "useless" for simple tasks and suitable for pruning, they actually encode crucial knowledge necessary for solving more difficult downstream tasks. Removing these seemingly insignificant weights can lead to irreversible knowledge forgetting and performance damage in difficult tasks. These findings offer fresh insights into how LLMs encode knowledge in a task-sensitive manner, pave future research direction in model pruning, and open avenues for task-aware conditional computation during inference.
摘要
传统上,“垃圾DNA”概念一直与人类基因组中的非编码序列相关,占基因组的约98%的质量。然而,最近的研究发现,一些这些看起来无功能的DNA序列在细胞过程中扮演着重要的角色。启示地,神经网络中的权重 Displayed remarkable similarity to human gene redundancy. It was believed that the weights in large models were redundant and could be removed without affecting performance. However, this paper challenges this conventional wisdom by presenting a counter-argument. We use sparsity as a tool to isolate and quantify the nuanced significance of low-magnitude weights in pre-trained large language models (LLMs). Our study shows a strong correlation between these weight magnitudes and the knowledge they encapsulate from a downstream task-centric angle. We propose the "Junk DNA Hypothesis" backed by our in-depth investigation: while small-magnitude weights may appear "useless" for simple tasks and suitable for pruning, they actually encode crucial knowledge necessary for solving more difficult downstream tasks. Removing these seemingly insignificant weights can lead to irreversible knowledge forgetting and performance damage in difficult tasks. These findings offer fresh insights into how LLMs encode knowledge in a task-sensitive manner, pave future research direction in model pruning, and open avenues for task-aware conditional computation during inference.
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
results: 在 NetHack 游戏中,Motif 可以在不直接寻求高分的情况下很好地提高游戏分数。此外,当与环境奖励相结合时,Motif 的方法可以大幅度超越现有的方法,并在没有示例的情况下做出进展。最后,研究发现 Motif 通常会生成人类对应的直观行为,可以通过提示修改轻松地控制,而且可扩展性好。Abstract
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
摘要
探索富有环境和不带先知的行为评估是非常困难的。在这篇论文中,我们提出了 Motif,一种通用的方法,可以从大语言模型(LLM)中提取先知知识并用于代理人的决策。Motif基于将 LLM 用于决策的想法,无需与环境进行交互:它从 LLM 中提取对caption的偏好,以构建内在奖励,然后用强化学习训练代理人。我们评估了 Motif 的性能和行为在开放结构的 NetHack 游戏中。 Result 显示,只学习 maximize 内在奖励,Motif 可以在游戏得分上高于直接带直接带 maximize 算法。当与环境奖励结合时,我们的方法在现有的方法上有显著进步,并在没有示范的情况下完成任务。最后,我们表明 Motif 生成的主要是人类对应的INTUITIVE行为,可以通过提示修改轻松地控制,而且随着 LLM 的大小和提示信息的量而升级良好。
Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection
paper_authors: Dahun Kim, Anelia Angelova, Weicheng Kuo
for: bridging the gap between image-level pretraining and open-vocabulary object detection
methods: using detection-oriented image-text pretraining with a detector architecture, and a shifted-window learning approach upon window attention
results: setting a new state of the art on the LVIS open-vocabulary detection benchmark, and achieving competitive results on the COCO benchmark without pseudo labeling or weak supervision.Here’s the simplified Chinese text:
for: closure detection和开放词汇检测之间的差距
methods: 使用检测封装的图像文本预训练,并使用窗口注意力的偏移学习
results: 在LIVIS开放词汇检测标准benchmark上设置新的最佳值(40.4偏好AP$_r$),与最佳现有方法相比提高了6.5偏好AP$_r$。在COCObenchmark上 achieve了非常竞争力的40.8新型AP,无需pseudo标签或弱超vision。Abstract
We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$_r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.
摘要
我们提出了一种新的开放词汇检测方法,基于检测偏好的图像文本预训练来bridging图像级预训练和开放词汇检测之间的差距。在预训练阶段,我们将通常使用的分类架构替换为检测架构,这更好地服务于检测需求的区域水平识别,使得检测头可以从噪声图像文本对learning。无需使用伪标注,我们的方法是一种简单而有效的扩展,通过对不同的对象 semantics cues进行学习。此外,我们提议使用shifted-window学习方法来改进后处理抽象,使模型表示更加鲁棒、不受窗口模式的影响和偏见。在流行的LVIS开放词汇检测标准benchmark上,我们的方法实现了40.4带宽AP$_r$的新州Of The Art,比best现有方法+6.5带宽AP$_r$的系统水平性能。在COCObenchmark上,我们实现了非常竞争力的40.8新的AP,没有使用伪标注或弱监督。此外,我们还评估了我们的方法在传输检测设置下的性能,其比基线显著提高。视觉化表明预训练热键比基线具有更高的对象地域性。代码和模型将公开发布。
Self-Specialization: Uncovering Latent Expertise within Large Language Models
paper_authors: Junmo Kang, Hongyin Luo, Yada Zhu, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky
for: 本研究ocuses on self-alignment for expert domain specialization (e.g., biomedicine), discovering its effectiveness in improving zero-shot and few-shot performance in target domains of interest.
results: 我们的自适应模型(30B)在生医领域比基本模型(MPT-30B)大幅提高表现,甚至超过了更大的受欢迎模型(LLaMA-65B),显示其实用性和实际性。Abstract
Recent works have demonstrated the effectiveness of self-alignment in which a large language model is, by itself, aligned to follow general instructions through the automatic generation of instructional data using a handful of human-written seeds. Instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine), discovering it to be very effective for improving zero-shot and few-shot performance in target domains of interest. As a preliminary, we first present the benchmark results of existing aligned models within a specialized domain, which reveals the marginal effect that "generic" instruction-following training has on downstream expert domains' performance. To remedy this, we explore self-specialization that leverages domain-specific unlabelled data and a few labeled seeds for the self-alignment process. When augmented with retrieval to reduce hallucination and enhance concurrency of the alignment, self-specialization offers an effective (and efficient) way of "carving out" an expert model out of a "generalist", pre-trained LLM where different domains of expertise are originally combined in a form of "superposition". Our experimental results on a biomedical domain show that our self-specialized model (30B) outperforms its base model, MPT-30B by a large margin and even surpasses larger popular models based on LLaMA-65B, highlighting its potential and practicality for specialization, especially considering its efficiency in terms of data and parameters.
摘要
Feedback-guided Data Synthesis for Imbalanced Classification
results: 在ImageNet-LT和NICO++ datasets上实现了状态机器学习的最佳结果,对下 Representation 类型进行了4%以上的改进,并在NICO++上获得了5%以上的最差组织精度提升。Abstract
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.
摘要
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.Here is the translation in Traditional Chinese:Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.
Learning Generalizable Tool-use Skills through Trajectory Generation
methods: 该论文使用 Generative Model 学习用户在使用新工具时的工具使用轨迹,并且可以泛化到不同的工具形状。首先,我们生成一个工具使用轨迹,然后优化工具姿势序列以适应生成的轨迹。我们使用了单个模型,并在四个不同的难度较高的塑料对象处理任务中训练。
results: 我们的模型在使用不同的新工具时,能够对塑料对象进行高效的处理,并且明显超过基eline。详细结果可以在我们项目网站上找到:https://sites.google.com/view/toolgen。Abstract
Autonomous systems that efficiently utilize tools can assist humans in completing many common tasks such as cooking and cleaning. However, current systems fall short of matching human-level of intelligence in terms of adapting to novel tools. Prior works based on affordance often make strong assumptions about the environments and cannot scale to more complex, contact-rich tasks. In this work, we tackle this challenge and explore how agents can learn to use previously unseen tools to manipulate deformable objects. We propose to learn a generative model of the tool-use trajectories as a sequence of point clouds, which generalizes to different tool shapes. Given any novel tool, we first generate a tool-use trajectory and then optimize the sequence of tool poses to align with the generated trajectory. We train a single model for four different challenging deformable object manipulation tasks. Our model is trained with demonstration data from just a single tool for each task and is able to generalize to various novel tools, significantly outperforming baselines. Additional materials can be found on our project website: https://sites.google.com/view/toolgen.
摘要
自主系统可以高效地使用工具来辅助人类完成许多常见的任务,如cooking和清洁。然而,目前的系统仍无法与人类水准的智能匹配,在适应新工具方面。先前的工作基于可用性通常会假设环境,并不能扩展到更加复杂的触摸任务。在这个工作中,我们解决这个挑战,并探讨如何使用未看过的工具来操纵弹性物体。我们提出了一个学习生成工具使用轨迹的数据模型,它可以对不同的工具形状进行生成。给任何新的工具,我们首先生成一个工具使用轨迹,然后对该轨迹进行优化,以使其与生成的轨迹相对。我们将一个模型训练用四个不同的弹性物体操纵任务,并且训练这个模型只需要使用单一工具的示范数据,并且可以很好地适应不同的新工具,与基准相比表现更好。更多的资料可以在我们的项目网站上找到:https://sites.google.com/view/toolgen。
Primal-Dual Continual Learning: Stability and Plasticity through Lagrange Multipliers
paper_authors: Juan Elenter, Navid NaderiAlizadeh, Tara Javidi, Alejandro Ribeiro
For: 本研究的目的是解决 continual learning 中的 no-forgetting 约束问题,即在学习新任务时,不能忘记之前学习的任务。* Methods: 本研究使用了 Lagrangian duality 来解决 continual learning 中的具体约束问题,并且分析了两种版本的 continual learning 问题:一种是任务级别的粗略方法,另一种是样本级别的细腻方法。* Results: 研究发现, dual variables 可以指示约束变化对优化问题的敏感性,并且可以使用这些 dual variables 来分区缓存、分配资源和填充缓存。 研究还提供了质量下界,并通过多个 continual learning benchmark 进行了实验验证。Abstract
Continual learning is inherently a constrained learning problem. The goal is to learn a predictor under a \emph{no-forgetting} requirement. Although several prior studies formulate it as such, they do not solve the constrained problem explicitly. In this work, we show that it is both possible and beneficial to undertake the constrained optimization problem directly. To do this, we leverage recent results in constrained learning through Lagrangian duality. We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer. In this setting, we analyze two versions of the continual learning problem: a coarse approach with constraints at the task level and a fine approach with constraints at the sample level. We show that dual variables indicate the sensitivity of the optimal value with respect to constraint perturbations. We then leverage this result to partition the buffer in the coarse approach, allocating more resources to harder tasks, and to populate the buffer in the fine approach, including only impactful samples. We derive sub-optimality bounds, and empirically corroborate our theoretical results in various continual learning benchmarks. We also discuss the limitations of these methods with respect to the amount of memory available and the number of constraints involved in the optimization problem.
摘要
Translated into Simplified Chinese: continual learning 是一个受限学习问题。目标是在 "无忘卷" 的前提下学习预测器。虽然一些先前的研究把它们作为这样形式化,但是它们没有直接解决受限问题。在这个工作中,我们表明可以并且有利可循直接解决受限优化问题。为此,我们利用最近的受限学习的研究结果,特别是拉格朗日策略。我们关注内存基本方法,即将前一个任务中的一小部分样本存储在回Buffer中。在这种设置下,我们分析了两种版本的受限学习问题:一种粗略的方法,在任务级别上设置约束,以及一种细腻的方法,在样本级别上设置约束。我们发现 dual variables 可以反映约束干扰优化问题的优化值的敏感性。然后,我们利用这个结论来划分缓存在粗略方法中,对硬度更高的任务分配更多的资源,并在细腻方法中,只包含影响力较大的样本。我们 derivation 质量下界,并在各种受限学习标准 benchmark 中进行实验验证。我们还讨论了这些方法在内存量和约束数量方面的局限性。
3D Reconstruction in Noisy Agricultural Environments: A Bayesian Optimization Perspective for View Planning
results: 在噪音环境中使用少数摄像头实现高效的3D重建Abstract
3D reconstruction is a fundamental task in robotics that gained attention due to its major impact in a wide variety of practical settings, including agriculture, underwater, and urban environments. An important approach for this task, known as view planning, is to judiciously place a number of cameras in positions that maximize the visual information improving the resulting 3D reconstruction. Circumventing the need for a large number of arbitrary images, geometric criteria can be applied to select fewer yet more informative images to markedly improve the 3D reconstruction performance. Nonetheless, incorporating the noise of the environment that exists in various real-world scenarios into these criteria may be challenging, particularly when prior information about the noise is not provided. To that end, this work advocates a novel geometric function that accounts for the existing noise, relying solely on a relatively small number of noise realizations without requiring its closed-form expression. With no analytic expression of the geometric function, this work puts forth a Bayesian optimization algorithm for accurate 3D reconstruction in the presence of noise. Numerical tests on noisy agricultural environments showcase the impressive merits of the proposed approach for 3D reconstruction with even a small number of available cameras.
摘要
三维重建是机器人学中一项基本任务,它在各种实际应用场景中具有重要的影响,包括农业、水下和城市环境。一种重要的方法 для完成这项任务是通过精心选择一些摄像机的位置,以最大化视觉信息,从而提高三维重建的性能。然而,在实际场景中存在环境噪声的情况下,使用几何 критери优选择 fewer yet more informative images 可以显著提高三维重建性能。然而,在这种情况下,尚未提供关于噪声的准确信息时,将噪声纳入几何函数中可能是一项挑战。为此,这项工作提出了一种新的几何函数,该函数考虑了现有噪声,不需要准确的几何函数表达。由于这种几何函数没有准确的表达,这项工作提出了一种bayesian优化算法,用于在噪声存在的情况下高精度的三维重建。 numerically tests on noisy agricultural environments showcase the impressive merits of the proposed approach for 3D reconstruction with even a small number of available cameras.
Probabilistic Sampling-Enhanced Temporal-Spatial GCN: A Scalable Framework for Transaction Anomaly Detection in Ethereum Networks
results: 对比传统GCNs,本研究的TRW-GCN框架在检测异常点和交易批量方面显著提高了性能指标。Abstract
The rapid evolution of the Ethereum network necessitates sophisticated techniques to ensure its robustness against potential threats and to maintain transparency. While Graph Neural Networks (GNNs) have pioneered anomaly detection in such platforms, capturing the intricacies of both spatial and temporal transactional patterns has remained a challenge. This study presents a fusion of Graph Convolutional Networks (GCNs) with Temporal Random Walks (TRW) enhanced by probabilistic sampling to bridge this gap. Our approach, unlike traditional GCNs, leverages the strengths of TRW to discern complex temporal sequences in Ethereum transactions, thereby providing a more nuanced transaction anomaly detection mechanism. Preliminary evaluations demonstrate that our TRW-GCN framework substantially advances the performance metrics over conventional GCNs in detecting anomalies and transaction bursts. This research not only underscores the potential of temporal cues in Ethereum transactional data but also offers a scalable and effective methodology for ensuring the security and transparency of decentralized platforms. By harnessing both spatial relationships and time-based transactional sequences as node features, our model introduces an additional layer of granularity, making the detection process more robust and less prone to false positives. This work lays the foundation for future research aimed at optimizing and enhancing the transparency of blockchain technologies, and serves as a testament to the significance of considering both time and space dimensions in the ever-evolving landscape of the decentralized platforms.
摘要
“随着带有潜在威胁的Ethereum网络的快速演化,需要专业的技术来保证其可靠性和透明度。尽管几何对应网络(GNNs)在这些平台上已经开创了问题检测,但捕捉带有时间和空间特征的交易模式仍然是一个挑战。本研究提出了将几何对应网络(GCNs)与时间随机漫步(TRW)强化的概率抽样法,以突破这个问题。我们的方法与传统GCNs不同,利用TRW的优点,对Ethereum交易中的复杂时间序列进行探测,实现更精确的交易问题检测。初步评估显示,我们的TRW-GCN框架在检测问题和交易激波方面具有明显进步,较 Convention GCNs 高效。这个研究不仅强调了带有时间特征的Ethereum交易资料中的时间对称,而且提供了一个可扩展和有效的方法,以确保区块chain技术的安全性和透明度。我们的模型通过将空间关系和时间基础为节点特征,增加了检测过程中的精确性和减少了假阳性。这个工作为未来对区块chain技术的优化和提高透明度的研究提供了基础,并证明了考虑时间和空间尺度的重要性在区块chain技术的演化中。”
GASS: Generalizing Audio Source Separation with Large-scale Data
paper_authors: Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà
for: separating speech, music, and sound events in a supervised fashion
methods: trained on a large-scale dataset, GASS models show feasibility and competitive performance in in-distribution tasks, but struggle with generalizing to out-of-distribution cinematic and music content
results: all fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.Abstract
Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.
摘要
通用源分离目标是将混合声音中的声音源分离开,去除特定频谱域,如语音或乐器的限制。然而,通用源分离的潜力受到许多现有作品的限制,因为大多数作品只关注具有主导性的声音事件混合,而小规模的训练集也限制了其潜力。在这里,我们研究了一个普适的音频源分离(GASS)模型,通过大规模数据集进行supervised学习,用于分离语音、乐器和声音事件。我们评估GASS模型在多种任务上。我们的强大在distribution中的结果表明GASS模型的可能性,而在声音事件和语音分离任务上的竞争性表现也表明它的泛化能力。然而,GASS模型在分离非典型电影和音乐内容时存在挑战。我们还细化GASS模型,并在每个数据集上进行了适应。所有细化模型(除了音乐分离)在各自的benchmark中获得了state-of-the-art的结果。
ABScribe: Rapid Exploration of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models
paper_authors: Mohi Reza, Nathan Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan “Michael” Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, Joseph Jay Williams
results: 对12名写作者进行了用户研究,发现ABScribe可以减少任务工作量(d=1.20,p<0.001),提高用户对修改过程的认知(d=2.41,p<0.001),并为写作者提供了如何使用LLM生成多个写作变体的经验。Abstract
Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art large language models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new versions without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly produce multiple variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text segments for rapid in-place comparisons using mouse-over interactions on a context toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances user perceptions of the revision process (d = 2.41, p < 0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.
摘要
Translation Notes:* "state-of-the-art large language models" Current Chinese translation: "当前的大语言模型"* "simplify writing variation generation" Current Chinese translation: "简化文本变化生成"* "current interfaces" Current Chinese translation: "当前的接口"* "creating new versions without overwriting text" Current Chinese translation: "无需覆盖原文本创建新版本"* "pasting them sequentially" Current Chinese translation: "sequentially paste them"* "clutter the document" Current Chinese translation: "拥堵文档"* "increasing the workload" Current Chinese translation: "增加工作负担"* "disrupting the writer's flow" Current Chinese translation: "打断作者的流程"* "to tackle this" Current Chinese translation: "以解决这个问题"* "users can swiftly produce multiple variations" Current Chinese translation: "用户可快速生成多个变化"* "which are auto-converted into reusable buttons" Current Chinese translation: "自动转换为可重用的按钮"* "variations are stored adjacently within text segments" Current Chinese translation: "变化存储在文本段中侧"* "for rapid in-place comparisons" Current Chinese translation: "以快速比较"* "using mouse-over interactions on a context toolbar" Current Chinese translation: "通过鼠标 hover 在上下文工具栏上"* "our user study with 12 writers" Current Chinese translation: "我们的12名作者用户研究"* "significantly reduces task workload" Current Chinese translation: "显著减少任务工作负担"* "enhances user perceptions of the revision process" Current Chinese translation: "提高用户对修订过程的认知"* "compared to a popular baseline workflow" Current Chinese translation: "相比一个流行的基线工作流"* "and provides insights into how writers explore variations using LLMs" Current Chinese translation: "并提供了如何使用 LLMS 进行文本变化探索的洞察"
Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization
results: 我们实现了一种可扩展的方法来计算准确和高效地计算神经网络的Lipschitz常数上下文约束。这些约束可以用来设计新的层,以实现控制边界的Liptsitz常数。我们在MNIST、CIFAR-10和Tiny-ImageNet数据集上进行实验,并证明我们的提posed算法可以与当前最佳竞争力相当。Abstract
To improve the robustness of deep classifiers against adversarial perturbations, many approaches have been proposed, such as designing new architectures with better robustness properties (e.g., Lipschitz-capped networks), or modifying the training process itself (e.g., min-max optimization, constrained learning, or regularization). These approaches, however, might not be effective at increasing the margin in the input (feature) space. As a result, there has been an increasing interest in developing training procedures that can directly manipulate the decision boundary in the input space. In this paper, we build upon recent developments in this category by developing a robust training algorithm whose objective is to increase the margin in the output (logit) space while regularizing the Lipschitz constant of the model along vulnerable directions. We show that these two objectives can directly promote larger margins in the input space. To this end, we develop a scalable method for calculating guaranteed differentiable upper bounds on the Lipschitz constant of neural networks accurately and efficiently. The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary. Furthermore, our Lipschitz bounding algorithm exploits the monotonicity and Lipschitz continuity of the activation layers, and the resulting bounds can be used to design new layers with controllable bounds on their Lipschitz constant. Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.
摘要
deep 分类器的可靠性面临各种攻击,许多方法已经被提出,如设计新架构(例如 Lipschitz 围限网络)或修改训练过程(例如最小最大优化、受限学习或规范)。这些方法可能并不能提高输入空间中的边界。因此,有越来越多的关注于开发可以直接操作输入空间的决策边界的训练方法。在这篇论文中,我们基于最近的发展,开发了一种可靠的训练算法,其目标是在输出(幂值)空间中增大边界,同时对模型的敏感方向进行规范。我们表明这两个目标可以直接提高输入空间中的边界。为此,我们开发了一种可执行的方法来计算准确和高效地计算神经网络的 Lipschitz 常数上下文约束。这种约束的相对准确性防止过度规范,allowing for more direct manipulation of the decision boundary。此外,我们的 Lipschitz 约束算法利用神经网络的活化层的 monotonicity 和 Lipschitz 连续性,并且得到的约束可以用来设计新的层,其 Lipschitz 常数具有可控的上限。实验 verify 在 MNIST、CIFAR-10 和 Tiny-ImageNet 数据集上,我们提出的方法与状态艺术的竞争力相当。
HyperMask: Adaptive Hypernetwork-based Masks for Continual Learning
results: 提出了一种方法 called HyperMask,可以训练一个网络来满足所有任务,并使用hypernetwork生成半二进制的面积来获得每个任务的专门的子网络。这种解决方案继承了hypernetwork的适应新任务能力,同时使用了lottery ticket假设,可以使用单个网络来满足所有任务。Abstract
Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, there exist many continual learning strategies. One of the most effective is the hypernetwork-based approach. The hypernetwork generates the weights of a target model based on the task's identity. The model's main limitation is that hypernetwork can produce completely different nests for each task. Consequently, each task is solved separately. The model does not use information from the network dedicated to previous tasks and practically produces new architectures when it learns the subsequent tasks. To solve such a problem, we use the lottery ticket hypothesis, which postulates the existence of sparse subnetworks, named winning tickets, that preserve the performance of a full network. In the paper, we propose a method called HyperMask, which trains a single network for all tasks. Hypernetwork produces semi-binary masks to obtain target subnetworks dedicated to new tasks. This solution inherits the ability of the hypernetwork to adapt to new tasks with minimal forgetting. Moreover, due to the lottery ticket hypothesis, we can use a single network with weighted subnets dedicated to each task.
摘要
为解决这个问题,我们使用了彩票假设,假设存在一些稀疏的子网络,称为赢家票,它们保持了全网络的性能。在我们的论文中,我们提出了一种方法called HyperMask,它在所有任务上训练单个网络。hypernetwork 生成了半二进制的面纱,以获取新任务的专门的子网络。这种解决方案继承了 hypernetwork 对新任务的适应能力,同时因为彩票假设,我们可以使用单个网络,每个任务都有权重的子网络。
FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery
results: 我们的研究成果包括成功 sinthez 时尚视频,展示了模特儿从多个角度的抵抗,展示了衣服的适用和外观。这些发现对于在线时尚销售业可以提供有利的改进和提升。Abstract
Our study introduces a new image-to-video generator called FashionFlow. By utilising a diffusion model, we are able to create short videos from still images. Our approach involves developing and connecting relevant components with the diffusion model, which sets our work apart. The components include the use of pseudo-3D convolutional layers to generate videos efficiently. VAE and CLIP encoders capture vital characteristics from still images to influence the diffusion model. Our research demonstrates a successful synthesis of fashion videos featuring models posing from various angles, showcasing the fit and appearance of the garment. Our findings hold great promise for improving and enhancing the shopping experience for the online fashion industry.
摘要
我们的研究推出了一种新的图像到视频生成器 called FashionFlow。通过利用扩散模型,我们可以从静止图像中创建短视频。我们的方法是连接相关的组件与扩散模型,这种方法与之前的工作不同。这些组件包括使用 Pseudo-3D 卷积层生成视频高效地。VAE 和 CLIP 编码器捕捉静止图像中重要的特征,这些特征影响扩散模型。我们的研究成功创造了时尚视频,模特儿从多个角度 пози摆,展示裤装的适合和外观。我们的发现对在线时尚业务的改进和提升具有很大的投资意义。
Multilingual Natural Language Processing Model for Radiology Reports – The Summary is all you need!
paper_authors: Mariana Lindo, Ana Sofia Santos, André Ferreira, Jianning Li, Gijs Luijten, Gustavo Correia, Moon Kim, Jens Kleesiek, Jan Egger, Victor Alves
for: 这个研究旨在自动生成不同语言的 radiology 报告摘要,以便将来研究和 Deep Learning 模型中包含不同民族背景的患者数据。
results: 盲测试中,两位board-certified radiologist表示,系统生成的摘要质量与人工摘要相匹配或超越了70%的情况,表明了严重的临床可靠性。此外,本研究还表明了特定 радиологи 报告的多语言模型比特定语言的模型和非特定报告摘要模型(如 ChatGPT)更高效。Abstract
The impression section of a radiology report summarizes important radiology findings and plays a critical role in communicating these findings to physicians. However, the preparation of these summaries is time-consuming and error-prone for radiologists. Recently, numerous models for radiology report summarization have been developed. Nevertheless, there is currently no model that can summarize these reports in multiple languages. Such a model could greatly improve future research and the development of Deep Learning models that incorporate data from patients with different ethnic backgrounds. In this study, the generation of radiology impressions in different languages was automated by fine-tuning a model, publicly available, based on a multilingual text-to-text Transformer to summarize findings available in English, Portuguese, and German radiology reports. In a blind test, two board-certified radiologists indicated that for at least 70% of the system-generated summaries, the quality matched or exceeded the corresponding human-written summaries, suggesting substantial clinical reliability. Furthermore, this study showed that the multilingual model outperformed other models that specialized in summarizing radiology reports in only one language, as well as models that were not specifically designed for summarizing radiology reports, such as ChatGPT.
摘要
radiology 报告的印象部分总结重要的放射学发现,并在传达这些发现给医生时扮演关键的角色。然而,准备这些总结是时间费时和容易出错的。近些年来,许多放射学报告总结模型已经开发出来。然而,目前没有任何模型可以在多种语言中总结这些报告。这样的模型可以在未来的研究和深度学习模型中提高数据来源于不同的民族背景的patient的可靠性。本研究通过微调一个基于多语言文本-文本转换器的公共可用模型,自动生成了不同语言的放射学报告总结。在盲测中,两位证enciated的放射学医生指出,至少70%的系统生成的总结质量与相应的人类写的总结匹配或超过,表明了严重的临床可靠性。此外,本研究表明,多语言模型在只有一种语言的放射学报告总结模型和不特定于放射学报告总结的模型(如ChatGPT)之上表现出色。
Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality
results: 实验结果表明,使用“语音到动作”框架可以在虚拟都市工程环境中提高LLM的效率和准确性,并且比不使用优化的方法有所提高。Abstract
Large Language Models (LLMs) are trained and aligned to follow natural language instructions with only a handful of examples, and they are prompted as task-driven autonomous agents to adapt to various sources of execution environments. However, deploying agent LLMs in virtual reality (VR) has been challenging due to the lack of efficiency in online interactions and the complex manipulation categories in 3D environments. In this work, we propose Voice2Action, a framework that hierarchically analyzes customized voice signals and textual commands through action and entity extraction and divides the execution tasks into canonical interaction subsets in real-time with error prevention from environment feedback. Experiment results in an urban engineering VR environment with synthetic instruction data show that Voice2Action can perform more efficiently and accurately than approaches without optimizations.
摘要
大型语言模型(LLM)通常在几个示例下进行训练和Alignment,并被驱动为执行任务,并且可以适应不同的执行环境。然而,将代理LLM部署到虚拟现实(VR)中具有线上互动效率低下和3D环境中复杂的操作类别的挑战。在这个工作中,我们提出了“声音到动作”框架,通过层次分析自定义的声音信号和文本指令,并在实时运行中分配执行任务 into canonical interaction subsets,并透过环境反馈来预防错误。实验结果显示,在城市工程VR环境中使用合成 instruciton data时,“声音到动作”可以比不具有优化的方法更高效和精度。
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
paper_authors: Hangfeng He, Hongming Zhang, Dan Roth
for: 评估复杂逻辑模型的能力
methods: 使用GPT-4自动评估逻辑链质量,不需要人工编写参考链
results: 比较现有的参考自由和参考基的逻辑评估指标,SocREval显著提高GPT-4的表现Abstract
To comprehensively assess the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains to assess the model-derived chains. However, such ``gold-standard'' human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning metrics eliminate the need for human-crafted reasoning chains as references, but they typically require fine-tuning on datasets with human-derived reasoning chains, which complicates the process and raises concerns regarding generalizability across diverse datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, obviating the need for human-crafted references. Leveraging the Socratic method, we devise tailored prompts to enhance reference-free reasoning evaluation, which we term SocREval (Socratic method for Reasoning Evaluation). Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, our proposed framework, large language models (LLMs) with the Socratic method, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis.
摘要
(Simplified Chinese translation)为了全面评估当前模型的复杂逻辑能力,需要在可扩展的方式进行步骤逻辑评估。现有的参考基础评估 metrics 依靠人工标注的逻辑链来评估模型生成的链。然而,这些“金标准”的人工编写的逻辑链可能不唯一,并且获取它们可以是劳动密集的。现有的参考自由逻辑评估 metric 可以消除人工编写的逻辑链作为参考,但它们通常需要在具有人工编写的逻辑链的 dataset 上进行 fine-tuning,这会增加过程的复杂性并使得其在不同的 dataset 上的普遍性受到质疑。为解决这些挑战,我们利用 GPT-4 自动评估逻辑链质量,不需要人工编写参考。通过索克拉孚方法,我们开发了特制的 prompt 来提高参考自由逻辑评估,我们称之为 SocREval(索克拉孚方法 для逻辑评估)。Empirical results from four human-annotated datasets show that SocREval significantly improves GPT-4's performance, outperforming existing reference-free and reference-based reasoning evaluation metrics. In addition, our proposed framework, large language models (LLMs) with the Socratic method, is both cost-efficient and robust to prompt writing and example selection, as demonstrated by our in-depth analysis.
Emotional Listener Portrait: Neural Listener Head Generation with Emotion
results: 对多个量化指标进行评测,ELP模型在比较前方法时显示了明显的改善。Abstract
Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.
摘要
听众头部生成中心在生成说话者所提供的信息的非语言表达(例如笑容)中心。生成这些响应的主要挑战是在对话中细分的表情的非推定性,这取决于说话者和听众的情感和态度。为解决这个问题,我们提议了情感听众肖像(ELP),它将每个细分的facial motion作为多个独立的动作代码词的组合,并Explicitly modeling the probability distribution of motions under different emotions in conversation。由于“显式”和“独立”的设计,我们的ELP模型可以不仅通过从学习的分布中采样生成自然和多样化的响应,还可以生成 predetermined 的态度。在多个量化指标下,我们的ELP显示出了significant improvement compared to previous methods。
AI ensemble for signal detection of higher order gravitational wave modes of quasi-circular, spinning, non-precessing binary black hole mergers
results: 这个方法在一年测试集中处理了300,000个信号,并在5.19分钟内完成了。这个方法提供了现有最高精度的信号检测精度,并且只有在每年的搜寻数据中出现了2个错误。这是第一个用于搜寻和检测高次引力波模式信号的人工智能 ensemble。Abstract
We introduce spatiotemporal-graph models that concurrently process data from the twin advanced LIGO detectors and the advanced Virgo detector. We trained these AI classifiers with 2.4 million \texttt{IMRPhenomXPHM} waveforms that describe quasi-circular, spinning, non-precessing binary black hole mergers with component masses $m_{\{1,2\}\in[3M_\odot, 50 M_\odot]$, and individual spins $s^z_{\{1,2\}\in[-0.9, 0.9]$; and which include the $(\ell, |m|) = \{(2, 2), (2, 1), (3, 3), (3, 2), (4, 4)\}$ modes, and mode mixing effects in the $\ell = 3, |m| = 2$ harmonics. We trained these AI classifiers within 22 hours using distributed training over 96 NVIDIA V100 GPUs in the Summit supercomputer. We then used transfer learning to create AI predictors that estimate the total mass of potential binary black holes identified by all AI classifiers in the ensemble. We used this ensemble, 3 AI classifiers and 2 predictors, to process a year-long test set in which we injected 300,000 signals. This year-long test set was processed within 5.19 minutes using 1024 NVIDIA A100 GPUs in the Polaris supercomputer (for AI inference) and 128 CPU nodes in the ThetaKNL supercomputer (for post-processing of noise triggers), housed at the Argonne Leadership Supercomputing Facility. These studies indicate that our AI ensemble provides state-of-the-art signal detection accuracy, and reports 2 misclassifications for every year of searched data. This is the first AI ensemble designed to search for and find higher order gravitational wave mode signals.
摘要
我们介绍了一种空时间图模型,该模型同时处理了激光探测器先进LIGO和先进维戈探测器的数据。我们使用了240万个IMRPhenomXPHM波形,这些波形描述了 quasi-Circular,旋转,不径向旋转的双黑洞合并,其中Component masses $m_{\{1,2\}$在[3M_\odot, 50M_\odot]之间,并且individual spins $s^z_{\{1,2\}$在[-0.9, 0.9]之间,包括($\ell, |m|) = \{(2, 2), (2, 1), (3, 3), (3, 2), (4, 4)\}$ modes,以及模式混合效应在$\ell = 3, |m| = 2$ harmonics。我们在Summit超级计算机上使用分布式训练在96个NVIDIA V100 GPU上训练了这些AI类ifizzer,训练时间为22小时。然后,我们使用传输学习创建了AI预测器,以估算潜在双黑洞的总质量。我们使用了这个ensemble,3个AI类ifizzer和2个预测器,处理了一年的测试集,其中注入了300,000个信号。这个一年的测试集在使用1024个NVIDIA A100 GPU在Polaris超级计算机上进行AI推理,以及128个CPU节点在ThetaKNL超级计算机上进行噪声触发的后处理,在Argonne领导超级计算机中完成 within 5.19分钟。这些研究表明,我们的AIensemble提供了当前最高精度的信号探测精度,并且错误地分类了每年的搜索数据2次。这是第一个采用AI搜索和找到更高级别重力波模式信号的ensemble。
Efficient Streaming Language Models with Attention Sinks
paper_authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
For: The paper is written for deploying large language models (LLMs) in streaming applications, such as multi-round dialogue, where long interactions are expected.* Methods: The paper introduces a framework called StreamingLLM, which enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning.* Results: The paper shows that StreamingLLM can enable several state-of-the-art LLMs to perform stable and efficient language modeling with up to 4 million tokens and more, and outperforms the sliding window recomputation baseline by up to 22.2x speedup in streaming settings.Here is the same information in Simplified Chinese:* For: 这篇论文是为了在流动应用中部署大型语言模型(LLMs),例如多轮对话,长期交互等。* Methods: 论文提出了一种名为 StreamingLLM 的框架,可以让 LLMs trained with finite length attention window 通过不需要微调来泛化到无限长序列 lengths。* Results: 论文显示了 StreamingLLM 可以使得多种状态顶峰的 LLMs 在多达 4 百万字和更多的语言模型中表现稳定和高效,并在流动设置下超过滑动窗口重新计算基准的速度提升至多达 22.2 倍。Abstract
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.
摘要
部署大型自然语言模型(LLM)在流动应用程序中,如多回对话,需求急快,但也存在两大挑战。首先,在解码阶段,缓存前一个token的键值(KV)占用了大量内存。其次,流行的LLM无法泛化到 longer than training sequence length的文本。窗口注意力,只缓存最近的KV,是一种自然的方法,但我们发现,当文本长度超过缓存大小时,窗口注意力失效。我们观察到了一种有趣现象,即注意力沟(attention sink),即保留初始token的KV可以大幅提高窗口注意力的性能。在这篇论文中,我们首先证明了注意力沟的出现是因为初始token的强烈注意力分数,即作为一个“沟”,即使它们并不是semantically important。基于以上分析,我们提出了StreamingLLM框架,可以让 LLMS 在训练时使用有限长度注意力窗口,并不需要 fine-tuning,可以在无限长度文本上进行稳定和高效的语言模型化。我们证明了StreamingLLM在流动设置下可以使得 Llama-2、MPT、Falcon 和 Pythia 等模型在400万个token和更多的文本上进行稳定和高效的语言模型化。此外,我们发现在预训练时添加一个专门用于注意力沟的placeholder token可以进一步提高流动部署的性能。在流动设置下,StreamingLLM可以与滑动窗口重新计算基准线上的差速度相比,提高到22.2倍。代码和数据集可以在https://github.com/mit-han-lab/streaming-llm上获取。
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
results: 我们的ToRA模型在10个数学理解数据集上表现出色,与开源模型相比,具有13%-19%的绝对提升,而ToRA-7B模型在竞赛水平数据集MATH上达到44.6%的高分,超过了最佳开源模型WizardMath-70B的22%绝对提升。此外,ToRA-Code-34B模型在MATH数据集上达到了50%以上的准确率,超过了GPT-4的CoT结果,并与GPT-4解决问题的程度相当。Abstract
Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.
摘要
大型语言模型已经在不同的语言任务中做出了重要进步,但它们仍然对复杂的数学问题须作出努力。在这篇论文中,我们提出了 Tool-integrated Reasoning Agents(ToRA),用于解决具有挑战性的数学问题。ToRA通过融合自然语言理解和工具(如计算库和 симвоlic solvers)的能力,实现了语言和工具之间的融合。为了训练 ToRA,我们创建了互动工具使用轨迹,对数学数据集进行归检学习,并提出了出力空间整形以进一步检验模型的理解行为。因此,ToRA模型在10个数学理解数据集上显示了13%-19%的绝对提升,而 ToRA-7B 更是在竞赛水平数据集 MATH 上取得了44.6%的成绩,比开源模型 WizardMath-70B 高出22%的绝对提升。此外,ToRA-Code-34B 是首个在 MATH 数据集上取得超过50%的开源模型,与 GPT-4 解决问题的程度相当,并且与 GPT-4 解决问题的程度相当。我们还进行了数学理解中工具互动的全面分析,提供了宝贵的研究指引。
results: 研究发现,LLM可以通过文本 alone 理解复杂的时空动态,并生成布局与实际世界中的物体运动模式高度相似。这种方法可以与任何允许分类引导的扩散模型结合使用,并在比较强的基础模型和基线方法上显著超越。Abstract
Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.
摘要
文本填充扩散模型已经出现为神经视频生成工具的有力手段。然而,当前模型仍然面临细腻的时空提示和生成限制,例如缺乏左右移动的能力。为解决这些限制,我们介绍了基于大语言模型(LLM)的视频扩散(LVD)。而不是直接从文本输入生成视频,LVD首先利用LLM生成基于文本输入的动态场景布局,然后使用生成的布局来导引扩散模型进行视频生成。我们发现LLM可以通过文本 alone 理解复杂的时空动态,并生成布局与提示和实际世界中对象运动模式相似。我们then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.
Learning Decentralized Flocking Controllers with Spatio-Temporal Graph Neural Network
paper_authors: Siji Chen, Yanshen Sun, Peihan Li, Lifeng Zhou, Chang-Tien Lu
for: 用于实现狂涛飞行器群体协调控制
methods: 使用延迟-$L$ hop状态和时间扩展GNN(STGNN)
results: 实现了分布式控制,模仿中央控制策略,并在不同场景下实现了凝聚飞行、领头飞行和避免障碍等任务。Abstract
Recently a line of researches has delved the use of graph neural networks (GNNs) for decentralized control in swarm robotics. However, it has been observed that relying solely on the states of immediate neighbors is insufficient to imitate a centralized control policy. To address this limitation, prior studies proposed incorporating $L$-hop delayed states into the computation. While this approach shows promise, it can lead to a lack of consensus among distant flock members and the formation of small clusters, consequently resulting in the failure of cohesive flocking behaviors. Instead, our approach leverages spatiotemporal GNN, named STGNN that encompasses both spatial and temporal expansions. The spatial expansion collects delayed states from distant neighbors, while the temporal expansion incorporates previous states from immediate neighbors. The broader and more comprehensive information gathered from both expansions results in more effective and accurate predictions. We develop an expert algorithm for controlling a swarm of robots and employ imitation learning to train our decentralized STGNN model based on the expert algorithm. We simulate the proposed STGNN approach in various settings, demonstrating its decentralized capacity to emulate the global expert algorithm. Further, we implemented our approach to achieve cohesive flocking, leader following and obstacle avoidance by a group of Crazyflie drones. The performance of STGNN underscores its potential as an effective and reliable approach for achieving cohesive flocking, leader following and obstacle avoidance tasks.
摘要
近期研究团队探索使用图 neuron网络(GNNs)为分布式控制在群体机器人中。然而,已经观察到仅仅基于当前邻居状态不足以模仿中央控制策略。为了解决这个限制,先前的研究提出了 incorporating $L$-hop延迟状态到计算中。虽然这种方法显示了 promise, 但可能导致远程群体成员之间的不一致和小群集成,最终导致协调集成行为失败。相反,我们的方法利用 spatial-temporal GNN, named STGNN,它包括 spatial 和 temporal 扩展。 spatial 扩展收集远程邻居的延迟状态,而 temporal 扩展包括当前邻居的先前状态。通过这两种扩展,我们可以收集更广泛和全面的信息,从而实现更有效和准确的预测。我们开发了一个专家算法来控制群体机器人,并使用模仿学习来训练我们的分布式 STGNN 模型基于专家算法。我们在不同的设置中进行了 STGNN 方法的 simulate,并证明它的分布式能力可以模仿全局专家算法。此外,我们实现了 STGNN 方法以实现凝聚飞行、领导跟随和避免障碍的任务。STGNN 方法的表现强调了它的可靠性和可行性,表明它作为凝聚飞行、领导跟随和避免障碍任务的有效和可靠的方法。
DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems
results: 研究人员通过对多种 simulate 环境进行测试,发现这种方法可以提高资源Constrained robotics 性能,相比基eline方法,提高约25%。Abstract
Resource-constrained robots often suffer from energy inefficiencies, underutilized computational abilities due to inadequate task allocation, and a lack of robustness in dynamic environments, all of which strongly affect their performance. This paper introduces DREAM - Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems, a comprehensive framework that optimizes the allocation of resources for efficient exploration. It advances beyond conventional heuristic-based task planning as observed conventionally. The framework incorporates Operational Range Estimation using Reinforcement Learning to perform exploration and obstacle avoidance in unfamiliar terrains. DREAM further introduces an Energy Consumption Model for goal allocation, thereby ensuring mission completion under constrained resources using a Graph Neural Network. This approach also ensures that the entire Multi-Robot System can survive for an extended period of time for further missions compared to the conventional approach of randomly allocating goals, which compromises one or more agents. Our approach adapts to prioritizing agents in real-time, showcasing remarkable resilience against dynamic environments. This robust solution was evaluated in various simulated environments, demonstrating adaptability and applicability across diverse scenarios. We observed a substantial improvement of about 25% over the baseline method, leading the way for future research in resource-constrained robotics.
摘要
资源有限的机器人经常受到能源不fficient和 computation underutilization的问题,这些问题严重影响其表现。这篇文章提出了 DREAM - 分布式学习探索和能源管理 Framework,这个框架可以优化资源的分配,以提高探索的效率。它超越了传统的规律 Based task planning,通过使用循环 Reinforcement Learning 估计运作范围,实现探索和避免障碍物在未知的地形中。 DREAM 还引入了能源消耗模型,以便将目标分配给可以完成任务的机器人,以确保在有限的资源下完成任务。这种方法还能在多机器人系统中保持机器人系统的持续运行,比传统随机分配目标的方法更好。我们的方法能够在实时给予优先级,实现在动态环境中的类型。我们在不同的 simulated 环境进行了评估,发现我们的方法可以在多种enario中实现显著的改善,相比baseline方法,提高约25%。这个可靠的解决方案开启了未来资源有限机器人研究的新 horizon。
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
results: 该论文的实验结果表明,使用该方法可以在视觉语言任务、表格处理任务和数学推理任务中实现显著的提升,并且可以在不需要训练的情况下适应新的领域和模式。此外,该论文还进行了深入的分析,并证明了:(1)随着工具集大小和后向模型的能力增加,可以实现一致的性能提升;(2)该方法的每个组件都对性能增加做出了贡献;(3)创建的工具是可靠且具有低复杂性和原子性。Abstract
Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrained by general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by prompting GPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved by scaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity. The code is available at \url{https://github.com/lifan-yuan/CRAFT}.
摘要
大型语言模型(LLM)通常会被增强工具来解决复杂任务。通过生成代码批处和透过任务特定的应用程序库(API)执行,它们可以将certain功能外传到特定的外部模组,如图像编码和计算。但现有的方法增强LLM通常受到通用API的限制,缺乏适合特定任务的自适性。在这个工作中,我们提出了CRAFT,一个通用工具创建和撷取框架 для LLM。它创建了特定任务的工具集,并将LLM具有一个组件可以从这些集中撷取工具,以增强它们的解决复杂任务的能力。在每个任务上,我们通过对GPT-4进行训练示例来收集特定的代码解决方案。经过验证步骤以确保正确性,这些解决方案会被抽象为代码批处以增强可重用性,并且删除重复的部分以提高质量。在推断时,语言模型会从工具集撷取批处或生成基于撷取的结果。我们的方法设计为可以灵活地适应市场上的无Seeing领域和模式,而无需调整。实验显示,我们的方法可以与强基eline比较 substatial提升。此外,我们的深入分析显示:(1)可以通过增加工具的数量和背景模型的能力来获得一致的性能提升;(2)我们的方法中每个component都做出了贡献;(3)创建的工具具有低复杂性和原子性。代码可以在 \url{https://github.com/lifan-yuan/CRAFT} 上获取。
Classification of Potholes Based on Surface Area Using Pre-Trained Models of Convolutional Neural Network
results: 研究发现,MobileNet v2在识别坑洞方面的准确率为98%,而图像分类结果显示,图像从腰高度(2英尺)拍摄时,大坑洞、小坑洞和正常道路的准确率分别为87.33%、88.67%和92%。同时,从全腰高度(FFW)拍摄时,三个类别的准确率均为100%。Abstract
Potholes are fatal and can cause severe damage to vehicles as well as can cause deadly accidents. In South Asian countries, pavement distresses are the primary cause due to poor subgrade conditions, lack of subsurface drainage, and excessive rainfalls. The present research compares the performance of three pre-trained Convolutional Neural Network (CNN) models, i.e., ResNet 50, ResNet 18, and MobileNet. At first, pavement images are classified to find whether images contain potholes, i.e., Potholes or Normal. Secondly, pavements images are classi-fied into three categories, i.e., Small Pothole, Large Pothole, and Normal. Pavement images are taken from 3.5 feet (waist height) and 2 feet. MobileNet v2 has an accuracy of 98% for detecting a pothole. The classification of images taken at the height of 2 feet has an accuracy value of 87.33%, 88.67%, and 92% for classifying the large, small, and normal pavement, respectively. Similarly, the classification of the images taken from full of waist (FFW) height has an accuracy value of 98.67%, 98.67%, and 100%.
摘要
弹坑可以致命和对车辆造成严重损害,并可能导致致命车祸。在南亚国家,路面问题是主要原因,因为下层状况差,没有对地下排水系统,以及过度的雨水。本研究比较了三个预训NN模型的表现,即ResNet 50、ResNet 18和MobileNet。首先,路面图像被分类,以找出图像中是否存在弹坑(Potholes或Normal)。其次,路面图像被分类为三个类别,即小弹坑、大弹坑和正常。路面图像来自2英尺(胸高)和3.5英尺。MobileNet v2的准确率为98%,检测弹坑。图像分类的准确率为87.33%、88.67%和92%,分别类别为大弹坑、小弹坑和正常路面。 Similarly,图像从胸高(FFW)高度时的准确率为98.67%、98.67%和100%。
results: 我们的最佳性果DFN-5B数据集可以让模型在不同任务上达到最佳性能,包括在ImageNet上实现83.0%的零shot传输精度。此外,我们还发布了一个新的20亿个示例数据集DFN-2B,并证明了可以从公共数据中训练高性能的数据筛选网络。Abstract
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
摘要
大规模的训练集已成为机器学习的基础,是现代语言模型和多 modal 学习的成就之基础。虽然数据准备 для预训练仍然是一个自动化的过程,但一种常见的方法是先从互联网上收集大量数据,然后通过不同的规则来筛选这个候选人pool下到实际的训练集。在这项工作中,我们研究了如何学习一个数据筛选网络(DFN),用于这个第二步的筛选大量未经准备的数据集。我们的关键发现是,筛选网络的质量与其在下游任务的表现存在差异:例如,一个在 ImageNet 上高度表现的模型可能会生成比其低 ImageNet 准确率的模型更好的训练集。基于我们的发现,我们构建了新的数据筛选网络,实现了状态机器人类图像文本 datasets。具体来说,我们的最佳性performing dataset DFN-5B 使得我们可以在不同的 compute 预算下训练状态机器人类模型。例如,使用我们的 dataset,一个 ViT-H 模型在 ImageNet 上达到 83.0% 零shot 转移率,超过其他 datasets 上的模型,如 LAION-2B、DataComp-1B 或 OpenAI 的 WIT。为了促进更多的研究 dataset 设计,我们还发布了一个新的 2 亿例示例数据 DFN-2B。我们的结果表明,高性能的数据筛选网络可以通过只使用公共可用的数据进行训练。
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
results: 我们的实验表明,即使使用现有的模型编辑方法如ROME,也无法真正地从GPT-J模型中删除事实信息,我们的白盒和黑盒攻击可以从编辑后的模型中恢复“删除”的信息38%的时间。这些攻击利用了两个关键观察:(1)删除的信息可以在模型的中间隐藏状态中找到踪迹,(2)应用 editing 方法于一个问题后可能无法删除对重叠版本的问题中的信息。最后,我们提供了新的防御方法,但我们没有找到一个通用有效的防御方法。Abstract
Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.
摘要
有些预训言语模型拥有我们不希望他们拥有的知识,包括记忆化的个人信息和可能被用来伤害人的知识。它们还可以输出恶势推文。为了解决这些安全和信息问题,我们提出了一个攻击和防御框架,用于研究直接从模型权重中删除敏感信息的任务。我们研究直接编辑模型权重的原因是:(1)这种方法可以保证未来的攻击不能提取特定的删除信息,(2)它可以保护白盒攻击,这是在公开可用的模型权重可以用来提取敏感信息的情况下非常重要。我们的威胁模型假设攻击成功的情况是,敏感问题的答案在一组B生成的候选人中,基于情况下如果信息不安全。实验表明, même les méthodes d'édition d'état de l'art telles que ROME ont du mal à vraiment supprimer des informations factuelles à partir de modèles comme GPT-J, car nos attaques blanches et noires peuvent récupérer "désormais" information "supprimée" à partir d'un modèle edited 38% du temps. Ces attaques se basent sur deux observations clés :(1)les traces d'informations supprimées peuvent être trouvées dans les états cachés intermédiaires du modèle, et(2)appliquer une méthode d'édition pour une question ne supprime pas l'information à travers les versions reformulées de la question. Enfin, nous fournissons de nouvelles méthodes de défense qui protègent contre certaines attaques d'extraction, mais nous ne trouvons pas une méthode universelle efficace. Nos résultats suggèrent que vraiment supprimer des informations sensibles est un problème difficile mais tractable, car même des taux d'attaque relativement faibles peuvent avoir des conséquences graves pour la deployment réelle des modèles de langage.
Adversarial Machine Learning in Latent Representations of Neural Networks
results: 经过广泛的实验分析,作者们发现:(一)假设保持同等水平的信息损害,则潜在特征都比输入表示更加稳定;(二)抗攻击能力与特征维度和 DNN 的泛化能力 jointly 决定分布式 DNN 的抗击性。实验结果表明,对 ImageNet-1K 数据集进行了 10 种不同的攻击方法, compress 的干扰特征可以降低攻击成功率,最好情况下降低了 88%,平均降低了 57%。Abstract
Distributed deep neural networks (DNNs) have been shown to reduce the computational burden of mobile devices and decrease the end-to-end inference latency in edge computing scenarios. While distributed DNNs have been studied, to the best of our knowledge the resilience of distributed DNNs to adversarial action still remains an open problem. In this paper, we fill the existing research gap by rigorously analyzing the robustness of distributed DNNs against adversarial action. We cast this problem in the context of information theory and introduce two new measurements for distortion and robustness. Our theoretical findings indicate that (i) assuming the same level of information distortion, latent features are always more robust than input representations; (ii) the adversarial robustness is jointly determined by the feature dimension and the generalization capability of the DNN. To test our theoretical findings, we perform extensive experimental analysis by considering 6 different DNN architectures, 6 different approaches for distributed DNN and 10 different adversarial attacks to the ImageNet-1K dataset. Our experimental results support our theoretical findings by showing that the compressed latent representations can reduce the success rate of adversarial attacks by 88% in the best case and by 57% on the average compared to attacks to the input space.
摘要
分布式深度神经网络(DNN)已经被证明可以减轻移动设备的计算负担和减少边缘计算场景中的终端推理延迟。 Although distributed DNNs have been studied, to the best of our knowledge, the resilience of distributed DNNs to adversarial attacks remains an open problem. In this paper, we fill this research gap by rigorously analyzing the robustness of distributed DNNs against adversarial attacks. We cast this problem in the context of information theory and introduce two new measurements for distortion and robustness. Our theoretical findings indicate that (i) assuming the same level of information distortion, latent features are always more robust than input representations; (ii) the adversarial robustness is jointly determined by the feature dimension and the generalization capability of the DNN. To test our theoretical findings, we perform extensive experimental analysis by considering 6 different DNN architectures, 6 different approaches for distributed DNN, and 10 different adversarial attacks to the ImageNet-1K dataset. Our experimental results support our theoretical findings by showing that the compressed latent representations can reduce the success rate of adversarial attacks by 88% in the best case and by 57% on average compared to attacks to the input space.
LoRA ensembles for large language model fine-tuning
results: LoRA ensemble 可以提高预测精度和不确定性评估,并且可以与现有的 regularization 技术结合使用。Abstract
Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained model. Thus, it is possible to construct large ensembles of LoRA adapters with almost the same computational overhead as using the original model. We find that LoRA ensembles, applied on its own or on top of pre-existing regularization techniques, gives consistent improvements in predictive accuracy and uncertainty quantification.
摘要
常见的LLM验议问题是不良的uncertainty量化,表现为过于自信、不好准确和不可预测的预测结果。为解决这个问题,视觉领域常用的一种方法是深度集成,它将多个不同初始化的模型 ensemble。然而, ensemble LLMs 有一个很大的挑战:最有效的LLMs 很、很大。保持一个LLM 在内存中 already 是一个挑战,而保持多个LLMs 在内存中是不可能的在许多设置下。为解决这些问题,我们提出了一种 ensemble 方法,使用 Low-Rank Adapters(LoRA),一种具有较少参数的精度调整技术。 LoRA 适配器只占用了一个LLM 的一个数量级的参数,而不是整个模型的参数。因此,可以构建一个大的 LoRA 适配器ensemble,与原始模型的计算开销几乎相同。我们发现,LoRA ensemble 应用于单独还是应用于现有的 regularization 技术上,都能够得到了更好的预测精度和uncertainty量化。
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
for: The paper aims to improve the ability of large language models (LLMs) to complete tasks provably within a minimum number of interactions with the external environment.
methods: The proposed “reason for future, act for now” (\texttt{RAFA}) framework combines long-term reasoning and short-term acting to achieve provable regret guarantees. The framework includes a prompt template for reasoning, learning and planning in Bayesian adaptive Markov decision processes (MDPs), and an “in-context” actor-critic update.
results: The paper shows that the proposed framework achieves a $\sqrt{T}$ regret bound, outperforming various existing frameworks, and achieves nearly perfect scores on a few benchmarks.Here are the three points in Simplified Chinese text:
results: 文章表明,提议的框架可以实现 $\sqrt{T}$ regret bound,超越多种现有框架,并在一些标准准则上达到近乎完美的分数。Abstract
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (\texttt{RAFA}). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state. The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt{T}$ regret. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks.
摘要
大型语言模型(LLM)表现出了印象的思维能力,但将思维转化为实际世界中的行动仍然是一个挑战。尤其是不确定如何在最小化与外部环境互动的情况下完成任务,例如通过内部的思维过程。为此,我们提出了一个原则性的框架,具有证明可靠的对策保证,我们称之为“理解未来,行动现在”(RAFA)。具体来说,我们设计了一个启发模板 для 思维,从记忆缓冲器学习并规划未来路径(理解未来)。在每个步骤中,LLM Agent 执行起始的规划路径的首个动作(行动现在),将收集到的反馈存储在记忆缓冲器中,然后重新邀请思维 routine 重新规划未来路径。关键思想是将 LLM 中的思维视为学习和规划 Bayesian 适应 Markov 决策过程(MDPs)。因此,我们鼓励 LLM 将记忆缓冲器中的未知环境更新为新的 posterior(学习),并生成多个未来步骤的最佳路径,以最大化一个值函数(规划)。学习和规划子程序在“内部”进行,以模拟 actor-critic 更新。我们的理论分析证明了我们的新结合长期思维和短期行动可以取得 $\sqrt{T}$ 的对抗。具体来说,对抗 bound 显示了 LLM 中的前期知识和思维行动之间的不确定性对抗。我们的实验验证显示,它比较多的现有框架表现出色,仅在一些问题上几乎取得完美的得分。
Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile
results: 实验结果表明,这种方法可以在设备上运行一个精度调整后的GPT大语言模型,并且可以在4GB的内存下运行。此外,该方法还实现了文本到动作的功能,使得用户可以通过文本输入来控制移动设备。Abstract
The field of Artificial Intelligence has witnessed remarkable progress in recent years, especially with the emergence of powerful large language models (LLMs) based on the transformer architecture. Cloud-based LLMs, such as OpenAI's ChatGPT, offer impressive capabilities but come with concerns regarding latency and privacy due to network dependencies. This article presents an innovative approach to LLM inference, envisioning a future where LLMs with billions of parameters can be executed directly on mobile devices without network connectivity. The article showcases a fine-tuned GPT LLM with 3 billion parameters that can operate smoothly on devices with as low as 4GB of memory. Through the integration of native code and model quantization techniques, the application not only serves as a general-purpose assistant but also facilitates seamless mobile interactions with text-to-actions features. The article provides insights into the training pipeline, implementation details, test results, and future directions of on-device LLM inference. This breakthrough technology opens up possibilities for empowering users with sophisticated AI capabilities while preserving their privacy and eliminating latency concerns.
摘要
自 recent years 以来,人工智能领域已经做出了很多出色的进步,尤其是基于 transformer 架构的大型自然语言模型(LLM)。云存储的 LLM,如 OpenAI 的 ChatGPT,具有了很强的功能,但是也存在网络依赖和隐私问题。本文描述了一种创新的 LLM 推理方法,梦想一个未来,在无网络连接的手持设备上直接执行具有数十亿参数的 LLM。通过Native code 和模型压缩技术的结合,该应用不仅可以作为通用助手,还可以实现无缝 mobil 交互,包括文本到动作特性。文章提供了训练管道、实现细节、测试结果以及未来方向的相关信息,这项技术开启了 empower 用户通过保护隐私和消除延迟的方式获得高级 AI 功能。
Neural Lithography: Close the Design-to-Manufacturing Gap in Computational Optics with a ‘Real2Sim’ Learned Photolithography Simulator
paper_authors: Cheng Zheng, Guangyuan Zhao, Peter T. C. So for: bridging the “design-to-manufacturing” gap in computational opticsmethods: fully differentiable design framework integrating pre-trained photolithography simulator, leveraging physics-informed modeling and data-driven trainingresults: improved optical performance on task-specific metrics for holographic optical element (HOE) and multi-level diffractive lens (MDL) using two-photon lithography system.Abstract
We introduce neural lithography to address the 'design-to-manufacturing' gap in computational optics. Computational optics with large design degrees of freedom enable advanced functionalities and performance beyond traditional optics. However, the existing design approaches often overlook the numerical modeling of the manufacturing process, which can result in significant performance deviation between the design and the fabricated optics. To bridge this gap, we, for the first time, propose a fully differentiable design framework that integrates a pre-trained photolithography simulator into the model-based optical design loop. Leveraging a blend of physics-informed modeling and data-driven training using experimentally collected datasets, our photolithography simulator serves as a regularizer on fabrication feasibility during design, compensating for structure discrepancies introduced in the lithography process. We demonstrate the effectiveness of our approach through two typical tasks in computational optics, where we design and fabricate a holographic optical element (HOE) and a multi-level diffractive lens (MDL) using a two-photon lithography system, showcasing improved optical performance on the task-specific metrics.
摘要
我们引入神经镶刻来bridging computational optics中的"设计到制造" gap。计算光学具有大的设计自由度,可以实现高级功能和性能,但现有的设计方法通常忽视数值模拟制造过程,这可能导致设计和实际生产的性能差异较大。为了解决这个问题,我们首次提出了一个完全可导的设计框架,其中包含已经预训练的光刻 simulator。通过结合物理知识和数据驱动训练,我们的光刻 simulator acts as a regularizer during design, 补做制造过程中的结构差异。我们通过两个典型的计算光学任务:设计和制造一个折射式光学元件(HOE)和一个多级干涉镜(MDL),使用一个两氢激发镜系统,展示了改进的光学性能在任务特定的 метриках上。
MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search
results: 提出了一种名为 MixQuant 的搜索算法,可以根据roundoff error来找到每层weight的优化量化比特宽,并且可以与任何量化方法结合使用,以提高量化模型的准确性。Abstract
Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference latency, and therefore allows for DNNs to be deployed on platforms with constrained computational resources and real-time systems. However, quantization can lead to numerical instability caused by roundoff error which leads to inaccurate computations and therefore, a decrease in quantized model accuracy. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths, we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error and can be combined with any quantization method as a form of pre-processing optimization. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone. Additionally, we combine MixQuant with vanilla asymmetric quantization to show that MixQuant has the potential to optimize the performance of any quantization technique.
摘要
量化是一种技术,用于创建高效的深度神经网络(DNNs),该技术可以通过在计算和存储矩阵时使用较低的位数宽来减少模型大小和推理延迟,因此允许DNNs在具有限制计算资源和实时系统的平台上进行部署。然而,量化可能会导致数字不稳定性,由于绝对误差所导致的不准确的计算,从而导致减少量化模型的准确性。与先前的作品一样,我们发现, bias和活动是量化的敏感度较高,因此应该保留在全精度或高位宽下进行量化。为了实现这一点,我们提出了 MixQuant,一种搜索算法,可以根据绝对误差来找到每层权重的优化量化位宽。我们显示,将MixQuant与BRECQ(一种现状顶尖的量化方法)相结合,可以提高量化模型的准确性。此外,我们将MixQuant与普通非对称量化相结合,以示MixQuant具有优化任何量化技术的潜力。
Improving Trajectory Prediction in Dynamic Multi-Agent Environment by Dropping Waypoints
for: 本研究的目的是提高 trajectory prediction 的准确性, Addressing the challenges of modeling diverse and uncertain trajectories.
methods: 该研究提出了一种新的框架,即 Temporal Waypoint Dropping (TWD), which promotes explicit temporal learning through the waypoint dropping technique.
results: 对三个数据集(NBA Sports VU、ETH-UCY、TrajNet++)进行了广泛的实验,表明 TWD 能够有效地强制模型学习复杂的时间相关性。Abstract
The inherently diverse and uncertain nature of trajectories presents a formidable challenge in accurately modeling them. Motion prediction systems must effectively learn spatial and temporal information from the past to forecast the future trajectories of the agent. Many existing methods learn temporal motion via separate components within stacked models to capture temporal features. This paper introduces a novel framework, called Temporal Waypoint Dropping (TWD), that promotes explicit temporal learning through the waypoint dropping technique. Learning through waypoint dropping can compel the model to improve its understanding of temporal correlations among agents, thus leading to a significant enhancement in trajectory prediction. Trajectory prediction methods often operate under the assumption that observed trajectory waypoint sequences are complete, disregarding real-world scenarios where missing values may occur, which can influence their performance. Moreover, these models frequently exhibit a bias towards particular waypoint sequences when making predictions. Our TWD is capable of effectively addressing these issues. It incorporates stochastic and fixed processes that regularize projected past trajectories by strategically dropping waypoints based on temporal sequences. Through extensive experiments, we demonstrate the effectiveness of TWD in forcing the model to learn complex temporal correlations among agents. Our approach can complement existing trajectory prediction methods to enhance prediction accuracy. We also evaluate our proposed method across three datasets: NBA Sports VU, ETH-UCY, and TrajNet++.
摘要
自然的轨迹具有内在的多样性和不确定性,这些特点使轨迹预测变得非常困难。轨迹预测系统需要从过去获取空间和时间信息,以更好地预测未来轨迹。许多现有方法在堆叠模型中分别学习时间特征,以捕捉时间特征。本文提出了一种新的框架,即时间点掉除(TWD),它通过掉除时间点来显式地学习时间相关性。通过这种方法,模型可以更好地理解 agent之间的时间相关性,从而导致轨迹预测的显著改进。轨迹预测方法经常假设观察到的轨迹点序列是完整的,忽略了现实世界中的缺失数据,这可能会影响其性能。此外,这些模型经常偏爱某些轨迹点序列,这会导致预测不准确。我们的 TWD 可以有效地解决这些问题。它将随机过程和固定过程混合,通过掉除时间点来规范预测过去轨迹的投影。通过广泛的实验,我们证明了 TWD 的效果是让模型学习复杂的时间相关性。我们的方法可以补充现有的轨迹预测方法,以提高预测精度。我们还对我们的提议方法进行了三个数据集的评估:NBA Sports VU、ETH-UCY 和 TrajNet++。
Toward Operationalizing Pipeline-aware ML Fairness: A Research Agenda for Developing Practical Guidelines and Tools
results: 本研究提出了一个研究资源库,用于帮助研究人员、实践人员和学生在实现算法公平中使用管道驱动的方法。Abstract
While algorithmic fairness is a thriving area of research, in practice, mitigating issues of bias often gets reduced to enforcing an arbitrarily chosen fairness metric, either by enforcing fairness constraints during the optimization step, post-processing model outputs, or by manipulating the training data. Recent work has called on the ML community to take a more holistic approach to tackle fairness issues by systematically investigating the many design choices made through the ML pipeline, and identifying interventions that target the issue's root cause, as opposed to its symptoms. While we share the conviction that this pipeline-based approach is the most appropriate for combating algorithmic unfairness on the ground, we believe there are currently very few methods of \emph{operationalizing} this approach in practice. Drawing on our experience as educators and practitioners, we first demonstrate that without clear guidelines and toolkits, even individuals with specialized ML knowledge find it challenging to hypothesize how various design choices influence model behavior. We then consult the fair-ML literature to understand the progress to date toward operationalizing the pipeline-aware approach: we systematically collect and organize the prior work that attempts to detect, measure, and mitigate various sources of unfairness through the ML pipeline. We utilize this extensive categorization of previous contributions to sketch a research agenda for the community. We hope this work serves as the stepping stone toward a more comprehensive set of resources for ML researchers, practitioners, and students interested in exploring, designing, and testing pipeline-oriented approaches to algorithmic fairness.
摘要
而algorithmic fairness是一个迅速发展的研究领域,在实践中,消除偏见问题经常被减少到仅仅是在优化步骤中遵循一个arbitrary chosen fairness metric,或者在模型输出后进行后处理,或者通过修改训练数据来修改模型。 current work has called on the ML community to take a more comprehensive approach to address fairness issues by systematically investigating the many design choices made throughout the ML pipeline, and identifying interventions that target the root cause of the issue, rather than its symptoms. While we share the conviction that this pipeline-based approach is the most appropriate for combating algorithmic unfairness in practice, we believe there are currently very few methods of operationalizing this approach. Drawing on our experience as educators and practitioners, we first demonstrate that without clear guidelines and toolkits, even individuals with specialized ML knowledge find it challenging to hypothesize how various design choices influence model behavior. We then consult the fair-ML literature to understand the progress to date toward operationalizing the pipeline-aware approach: we systematically collect and organize the prior work that attempts to detect, measure, and mitigate various sources of unfairness through the ML pipeline. We utilize this extensive categorization of previous contributions to sketch a research agenda for the community. We hope this work serves as a stepping stone toward a more comprehensive set of resources for ML researchers, practitioners, and students interested in exploring, designing, and testing pipeline-oriented approaches to algorithmic fairness.
results: 根据实验结果,AGG 能够在 Beijing Air Quality、PhysioNet Challenge 2012 和 UCI localisation 等标准数据集上达到时间序列数据填充、预测和分类的state-of-the-art 结果。Abstract
We introduce the asynchronous graph generator (AGG), a novel graph neural network architecture for multi-channel time series which models observations as nodes on a dynamic graph and can thus perform data imputation by transductive node generation. Completely free from recurrent components or assumptions about temporal regularity, AGG represents measurements, timestamps and metadata directly in the nodes via learnable embeddings, to then leverage attention to learn expressive relationships across the variables of interest. This way, the proposed architecture implicitly learns a causal graph representation of sensor measurements which can be conditioned on unseen timestamps and metadata to predict new measurements by an expansion of the learnt graph. The proposed AGG is compared both conceptually and empirically to previous work, and the impact of data augmentation on the performance of AGG is also briefly discussed. Our experiments reveal that AGG achieved state-of-the-art results in time series data imputation, classification and prediction for the benchmark datasets Beijing Air Quality, PhysioNet Challenge 2012 and UCI localisation.
摘要
我们介绍了异步图生成器(AGG),一种新型的图神经网络架构,用于多通道时间序列数据。AGG模型观测值为节点在动态图中,并通过权重注意力学习表示每个变量之间的关系。因此,AGG可以通过推理到新的时间戳和metadata来进行数据潜在性恢复。与传统的循环组件或时间规则假设不同,AGG直接将测量值、时间戳和metadataembedding为节点中的学习 embeddings。我们对AGG与先前的工作进行了概念和实验性比较,并 briefly discuss了数据扩充对AGG性能的影响。我们的实验结果显示,AGG在Beijing空气质量、PhysioNet Challenge 2012和UCIlocalization benchmark datasets上实现了时间序列数据潜在性恢复、分类和预测的state-of-the-artResult。
Efficient Anatomical Labeling of Pulmonary Tree Structures via Implicit Point-Graph Networks
results: 实现了usable Surface的模型,并且由于数据的缺乏,我们还进行了大规模的数据收集和分布。Abstract
Pulmonary diseases rank prominently among the principal causes of death worldwide. Curing them will require, among other things, a better understanding of the many complex 3D tree-shaped structures within the pulmonary system, such as airways, arteries, and veins. In theory, they can be modeled using high-resolution image stacks. Unfortunately, standard CNN approaches operating on dense voxel grids are prohibitively expensive. To remedy this, we introduce a point-based approach that preserves graph connectivity of tree skeleton and incorporates an implicit surface representation. It delivers SOTA accuracy at a low computational cost and the resulting models have usable surfaces. Due to the scarcity of publicly accessible data, we have also curated an extensive dataset to evaluate our approach and will make it public.
摘要
肺疾病在全球死亡原因中排名前列,需要更好地理解肺系统中复杂的3D树状结构,如呼吸道、血管和血液。在理论上,这些结构可以通过高分辨率图像堆栈来模拟。然而,使用标准的density voxel网格方法会非常昂贵。为此,我们介绍了点基方法,保留树状结构的图 Connectivity和Surface representation。它可以达到低计算成本下的SOTA准确率,并且模型的表面可以使用。由于肺疾病数据的公共访问性缺乏,我们还编辑了一个广泛的数据集,以评估我们的方法,并将其公开。
Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis
results: 研究发现,在训练window内,去除公司标识符后的文本表现更好, indicating that the distraction effect is more significant than look-ahead bias。此外,这种偏见具有更强的影响力于大公司。out-of-sample中,look-ahead bias不是一个问题,但distraction仍然可能存在。Abstract
Large language models (LLMs), including ChatGPT, can extract profitable trading signals from the sentiment in news text. However, backtesting such strategies poses a challenge because LLMs are trained on many years of data, and backtesting produces biased results if the training and backtesting periods overlap. This bias can take two forms: a look-ahead bias, in which the LLM may have specific knowledge of the stock returns that followed a news article, and a distraction effect, in which general knowledge of the companies named interferes with the measurement of a text's sentiment. We investigate these sources of bias through trading strategies driven by the sentiment of financial news headlines. We compare trading performance based on the original headlines with de-biased strategies in which we remove the relevant company's identifiers from the text. In-sample (within the LLM training window), we find, surprisingly, that the anonymized headlines outperform, indicating that the distraction effect has a greater impact than look-ahead bias. This tendency is particularly strong for larger companies--companies about which we expect an LLM to have greater general knowledge. Out-of-sample, look-ahead bias is not a concern but distraction remains possible. Our proposed anonymization procedure is therefore potentially useful in out-of-sample implementation, as well as for de-biased backtesting.
摘要
大型语言模型(LLM),包括ChatGPT,可以从新闻文本中提取有利交易信号。然而,回测这些策略具有挑战,因为LLM被训练了许多年的数据,而回测期间与训练期间重叠,可能会导致偏见。这种偏见可以有两种形式:look-ahead偏见, LLM可能知道新闻文本后的股票收益,以及拖垮效应,通常知道公司名称会对文本情绪的测量产生干扰。我们通过基于新闻标题的交易策略来研究这些偏见的来源。我们将基于原始标题和去除相关公司标识符的两种策略进行比较,并发现在样本内(在LLM训练窗口内),去除公司标识符后的策略表现更好, indicating that the distraction effect is more significant than look-ahead bias。这种倾向尤其强于大型公司——我们预期LLM对这些公司有更多通用知识。外样(外 LLM训练窗口),look-ahead偏见不是问题,但distraction仍然可能存在。我们提议的匿名处理程序因此可能有用,不仅在外样实现,还在做偏见回测。
Building Privacy-Preserving and Secure Geospatial Artificial Intelligence Foundation Models
results: 本文发现了基于地ospatial领域的人工智能模型开发和应用可能会降低个人隐私和安全风险,提出了一种全面的研究方向和预防控制策略,以帮助研究人员和政策制定者更好地理解和解决这些问题。Abstract
In recent years we have seen substantial advances in foundation models for artificial intelligence, including language, vision, and multimodal models. Recent studies have highlighted the potential of using foundation models in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.
摘要
近年来,我们所见到的基础模型在人工智能领域的进步很大,包括语言、视觉和多Modal模型。 latest studies have highlighted the potential of using基础模型 in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.Here's the translation in Traditional Chinese:近年来,我们所见到的基础模型在人工智能领域的进步很大,包括语言、视觉和多Modal模型。 latest studies have highlighted the potential of using基础模型 in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.
AutoAgents: A Framework for Automatic Agent Generation
results: experiments 表明,这个框架可以产生更有条理和更准确的解决方案,比起现有的多代理人方法。Abstract
Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents' responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents.
摘要
大型语言模型(LLM)已经实现了自动任务解决的多智能体系统的很多创新。然而,大多数现有的 LLM 基于多智能体系统仍然依赖于预定义的代理人来处理简单任务,这限制了多智能体团队在不同场景下的适应性。因此,我们介绍了 AutoAgents,一个创新的框架,可以动态生成和协调多个专业代理人,以建立基于任务的 AI 团队。具体来说,AutoAgents 将任务和角色之间的关系 coupling 到一起,通过动态生成多个需要的代理人,根据任务内容和规划解决方案来生成专业代理人。多个专业代理人之间协同合作,以高效地完成任务。同时,框架中还包含了观察者角色,可以反思指定的计划和代理人的回应,并改进它们。我们在多个标准 benchmark 上进行了实验,结果表明,AutoAgents 可以生成更 coherent 和更准确的解决方案,而不是现有的多智能体方法。这说明了分配不同任务不同角色的重要性,以及团队合作的新视角,可以用于解决复杂任务。AutoAgents 项目的存储库可以在 GitHub 上找到:https://github.com/Link-AGI/AutoAgents。
AI-Aristotle: A Physics-Informed framework for Systems Biology Gray-Box Identification
paper_authors: Nazanin Ahmadi Daryakenari, Mario De Florio, Khemraj Shukla, George Em Karniadakis
for: 这个研究旨在找到生物系统中未知的物理方程式,并且使用调教数据来推导这些方程式。
methods: 这个方法结合了EXTreme Theory of Functional Connections(X-TFC)领域分解和物理调教神经网络(PINNs),以及symbolic regression(SR)技术,实现参数发现和灰色盒识别。
results: 这个方法在两个系统生物 benchmark 问题上进行了测试,结果显示了高准确、快速、灵活和可靠的性能。Abstract
Discovering mathematical equations that govern physical and biological systems from observed data is a fundamental challenge in scientific research. We present a new physics-informed framework for parameter estimation and missing physics identification (gray-box) in the field of Systems Biology. The proposed framework -- named AI-Aristotle -- combines eXtreme Theory of Functional Connections (X-TFC) domain-decomposition and Physics-Informed Neural Networks (PINNs) with symbolic regression (SR) techniques for parameter discovery and gray-box identification. We test the accuracy, speed, flexibility and robustness of AI-Aristotle based on two benchmark problems in Systems Biology: a pharmacokinetics drug absorption model, and an ultradian endocrine model for glucose-insulin interactions. We compare the two machine learning methods (X-TFC and PINNs), and moreover, we employ two different symbolic regression techniques to cross-verify our results. While the current work focuses on the performance of AI-Aristotle based on synthetic data, it can equally handle noisy experimental data and can even be used for black-box identification in just a few minutes on a laptop. More broadly, our work provides insights into the accuracy, cost, scalability, and robustness of integrating neural networks with symbolic regressors, offering a comprehensive guide for researchers tackling gray-box identification challenges in complex dynamical systems in biomedicine and beyond.
摘要
找到物理和生物系统中的数学方程是科学研究中的基本挑战。我们介绍了一个新的物理学习框架,以帮助在系统生物中进行参数估计和缺失物理特征的恢复(灰色盒)。该框架被称为AI-Aristotle,它结合了极限理论函数连接(X-TFC)领域分解和物理学习网络(PINNs)以及符号回归(SR)技术来进行参数发现和灰色盒特征的恢复。我们在两个系统生物 benchmark 问题中测试了AI-Aristotle 的准确性、速度、灵活性和可靠性。我们将 X-TFC 和 PINNs 两种机器学习方法进行比较,并使用了两种不同的符号回归技术来跨验我们的结果。当前的工作主要基于合成数据进行测试,但它可以同时处理噪音的实验数据,并且可以在几分钟内在笔记计算机上进行黑盒特征的恢复。更广泛地说,我们的工作提供了 integrating 神经网络与符号回归器的精度、成本、可扩展性和可靠性的评估,这将为处理复杂的生物医学系统中的灰色盒特征恢复问题提供一个完整的指南。
Split and Merge: Aligning Position Biases in Large Language Model based Evaluators
results: 对11,520个答案对进行了广泛的实验,发现PORTIA可以显著提高LLMs的一致率,最高可达98%;同时,PORTIA可以使用较为简单的GPT模型达到与State-of-the-art GPT-4模型相当的性能,减少了评估成本。Abstract
Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.
摘要
大型语言模型(LLM)已经显示了评估人工智能系统产生的答案的潜力,但这些 LLM 基于的评估器具有位置偏见,即在比较对比中,偏向选择第一或第二个答案,不顾其内容。为解决这个限制,我们提出了 PORTIA,一个对照式系统,旨在模拟人类比较策略,以减少 LLM 的位置偏见。 Specifically, PORTIA 将答案分成多个段落,在候选答案中寻找相似内容,然后将它们重新联结成单一的提示,以便 LLM 进行评估。我们在六种不同的 LLM 上进行了广泛的实验,评估了 11,520 个答案对。我们的结果显示,PORTIA 可以明显改善所有模型和比较形式的一致率,实现了平均相对改善率为 47.46%。更重要的是,PORTIA 可以让较不进步的 GPT 模型在成本下降 90% 的情况下,与现有的 GPT-4 模型一致度高于 88%。此外,PORTIA 可以修正 GPT-4 模型中的位置偏见,提高其一致率至 98%。后续的人工评估表明,PORTIA 改进后的 GPT-3.5 模型可以在人工评估者的Alignment上超越 standalone GPT-4 模型。这些结果显示 PORTIA 能够正确地缓解位置偏见,提高 LLM 的一致率,并提高性能,同时保持成本效益。这代表了一个值得信赖的步骤,对于自动评估的应用而言。
results: 我们的实验结果表明,PB-LLM 可以实现极低位归纳,并保持 LLMs 的语言逻辑能力。此外,我们还提出了一种基于 Hessian 矩阵的重建方法,可以在 PTQ 中重建压缩后 LLMs 的能力。此外,我们还提出了一种基于权重抑制的扩展方法,可以在 QAT 中实现更好的压缩精度。Abstract
This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.
摘要
(Simplified Chinese translation)这篇论文探讨了网络归一化(Network Binarization),即将模型权重压缩到单位比特,特别是用于大型语言模型(LLMs)压缩。由于先前的归一化方法会使LLMs崩溃,我们提出了一种新的方法,即半归一化大语言模型(PB-LLM)。PB-LLM可以在低比特压缩中实现极低的压缩率,而不失 linguistic 的逻辑理解能力。我们的探讨首先揭示了先前归一化算法的不充分应用,并确认了关键 weights 的重要性。因此,PB-LLM在归一化过程中只 filters 一小部分关键 weights,将其分配到高比特存储。PB-LLM还被扩展以恢复压缩后 LLMs 的能力,通过分析PTQ和QAT的角度。在PTQ中,我们结合GPTQ的概念,重建归一化后的 weight 矩阵,并成功地恢复 PB-LLM 在低比特下的逻辑理解能力。在QAT中,我们冻结关键 weights durante 训练,探索了最佳扩展系数的 derivation,并提出了基于这个 derivation 的扩展系数Scaling 机制。这些探讨和开发的方法ология具有重要的贡献,可以恢复低比特压缩 LLMs 的性能,并对网络归一化领域的进步做出了重要贡献。代码可以在 https://github.com/hahnyuan/BinaryLLM 上获取。
results: 在一个legal opinion数据集上测试了该方法,并证明它在ROUGE、BERTScore和结构相似性方面与多种强基eline的表现更好。Abstract
We propose an approach for the structure controllable summarization of long legal opinions that considers the argument structure of the document. Our approach involves using predicted argument role information to guide the model in generating coherent summaries that follow a provided structure pattern. We demonstrate the effectiveness of our approach on a dataset of legal opinions and show that it outperforms several strong baselines with respect to ROUGE, BERTScore, and structure similarity.
摘要
我们提出了一种基于文本案例结构的法律意见概要化方法,考虑了文本中的论证结构。我们的方法利用预测的论证角色信息来引导模型生成符合提供的结构模式的 coherent 概要,并在法律意见 dataset 上进行了评估。结果显示,我们的方法在 ROUGE、BERTScore 和结构相似性方面与多种强大的基线方法相比,表现出优异。
Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4
results: 实验表明,使用GPT-4可以在不同的不完全信息游戏中达到比较出色的效果,而无需特殊的训练或示例。Abstract
Unlike perfect information games, where all elements are known to every player, imperfect information games emulate the real-world complexities of decision-making under uncertain or incomplete information. GPT-4, the recent breakthrough in large language models (LLMs) trained on massive passive data, is notable for its knowledge retrieval and reasoning abilities. This paper delves into the applicability of GPT-4's learned knowledge for imperfect information games. To achieve this, we introduce \textbf{Suspicion-Agent}, an innovative agent that leverages GPT-4's capabilities for performing in imperfect information games. With proper prompt engineering to achieve different functions, Suspicion-Agent based on GPT-4 demonstrates remarkable adaptability across a range of imperfect information card games. Importantly, GPT-4 displays a strong high-order theory of mind (ToM) capacity, meaning it can understand others and intentionally impact others' behavior. Leveraging this, we design a planning strategy that enables GPT-4 to competently play against different opponents, adapting its gameplay style as needed, while requiring only the game rules and descriptions of observations as input. In the experiments, we qualitatively showcase the capabilities of Suspicion-Agent across three different imperfect information games and then quantitatively evaluate it in Leduc Hold'em. The results show that Suspicion-Agent can potentially outperform traditional algorithms designed for imperfect information games, without any specialized training or examples. In order to encourage and foster deeper insights within the community, we make our game-related data publicly available.
摘要
不同于完美信息游戏, где所有元素都是知道的,不完整信息游戏模拟了现实世界中决策的复杂性,在不完整信息的情况下。GPT-4,最近的大语言模型(LLM)训练在庞大的过去数据上,具有知识检索和理解能力。本文探讨了GPT-4学习得知的可用性 для不完整信息游戏。为此,我们提出了《相互 подозрения代理人》(Suspicion-Agent),一种充分利用GPT-4的能力来进行不完整信息游戏。通过适当的提示工程来实现不同的功能,Suspicion-Agent基于GPT-4在不完整信息 кар打扮游戏中表现出了Remarkable的适应能力。更重要的是,GPT-4表现出了强大的高阶理想心(ToM)能力,可以理解他人并意外影响他人的行为。利用这一点,我们设计了一种规划策略,使GPT-4可以 Competently against不同的对手,适应其游戏风格,只需输入游戏规则和描述观察。在实验中,我们资深展示了Suspicion-Agent在三个不同的不完整信息游戏中的能力,然后quantitatively评估其在Leduc Hold'em中的表现。结果表明,Suspicion-Agent可能会超越传统为不完整信息游戏设计的算法,无需特殊训练或示例。为了鼓励和推动 deeper insights within the community,我们将游戏相关数据公开。
Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency
paper_authors: Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun Wan, Nan Duan methods: 这 paper 使用了以下方法:* 从多个角度(solution、specification 和 test case)提取多个不同的输出,并将其转化为多ipartite 图形式。* 使用两种定义的一致性度量,将自身一致性信息嵌入图中。* 根据图中的一致性分析,选择最佳的选择。results: 这 paper 的实验结果表明,MPSC 框架可以在多个流行的benchmark上提高表现,包括 HumanEval (+17.60%), HumanEval Plus (+17.61%), MBPP (+6.50%) 和 CodeContests (+11.82%),在 Pass@1 中,比原始从 ChatGPT 生成的输出要好,甚至超过 GPT-4。Abstract
Large language models (LLMs) have exhibited remarkable ability in textual generation. However, in complex reasoning tasks such as code generation, generating the correct answer in a single attempt remains a formidable challenge for LLMs. Previous research has explored solutions by aggregating multiple outputs, leveraging the consistency among them. However, none of them have comprehensively captured this consistency from different perspectives. In this paper, we propose the Multi-Perspective Self-Consistency (MPSC) framework, a novel decoding strategy for LLM that incorporates both inter-consistency across outputs from multiple perspectives and intra-consistency within a single perspective. Specifically, we ask LLMs to sample multiple diverse outputs from various perspectives for a given query and then construct a multipartite graph based on them. With two predefined measures of consistency, we embed both inter- and intra-consistency information into the graph. The optimal choice is then determined based on consistency analysis in the graph. We conduct comprehensive evaluation on the code generation task by introducing solution, specification and test case as three perspectives. We leverage a code interpreter to quantitatively measure the inter-consistency and propose several intra-consistency measure functions. Our MPSC framework significantly boosts the performance on various popular benchmarks, including HumanEval (+17.60%), HumanEval Plus (+17.61%), MBPP (+6.50%) and CodeContests (+11.82%) in Pass@1, when compared to original outputs generated from ChatGPT, and even surpassing GPT-4.
摘要
大型语言模型(LLM)在文本生成方面表现出色,但在复杂的推理任务中,如代码生成, LLM 仍然很难在一次尝试中获得正确的答案。先前的研究已经探访过解决方案,包括聚合多个输出的方法,但这些方法都未能充分捕捉这些积极的一致性。在这篇论文中,我们提出了多元视角自适应性(MPSC)框架,一种新的解答策略,用于 LLM 中。这个框架包括两个预先定义的一致性度量,将它们嵌入到多方位的图形中。我们要求 LLM 在不同的角度获得多个多样的输出,并将它们组合成一个多方位的图形。然后,我们使用这两个一致性度量进行一致性分析,以选择最佳的选择。我们在代码生成任务中进行了全面评估,并引入解决方案、规格和测试用例作为三个角度。我们运用了代码解释器来量化内部一致性,并提出了多个内部一致性度量函数。我们的 MPSC 框架在多个受测 benchmark 上表现出色,包括 HumanEval (+17.60%)、HumanEval Plus (+17.61%)、MBPP (+6.50%) 和 CodeContests (+11.82%),在 Pass@1 中比原始的 ChatGPT 生成的输出要好,甚至超过 GPT-4。
A Foundation Model for General Moving Object Segmentation in Medical Images
results: 实验结果显示,iMOS 可以实现在医疗影像序列中具有高精度的追踪和分类表现,仅需要标注少量的图像。Abstract
Medical image segmentation aims to delineate the anatomical or pathological structures of interest, playing a crucial role in clinical diagnosis. A substantial amount of high-quality annotated data is crucial for constructing high-precision deep segmentation models. However, medical annotation is highly cumbersome and time-consuming, especially for medical videos or 3D volumes, due to the huge labeling space and poor inter-frame consistency. Recently, a fundamental task named Moving Object Segmentation (MOS) has made significant advancements in natural images. Its objective is to delineate moving objects from the background within image sequences, requiring only minimal annotations. In this paper, we propose the first foundation model, named iMOS, for MOS in medical images. Extensive experiments on a large multi-modal medical dataset validate the effectiveness of the proposed iMOS. Specifically, with the annotation of only a small number of images in the sequence, iMOS can achieve satisfactory tracking and segmentation performance of moving objects throughout the entire sequence in bi-directions. We hope that the proposed iMOS can help accelerate the annotation speed of experts, and boost the development of medical foundation models.
摘要
医学图像分割目标是将有用的解剖或疾病结构区分出来,在临床诊断中扮演重要角色。但是,建立高精度深度分割模型需要大量高质量标注数据,而医学标注是非常繁琐和时间消耗的,尤其是在医学视频或3DVolume中,因为标注空间庞大且间隔不一致。最近,一项基础任务名为移动对象分割(MOS)在自然图像中做出了重要进步。它的目标是在图像序列中分割移动对象和背景,只需要最小的标注即可。在本文中,我们提出了首个基础模型,名为iMOS, дляMOS在医学图像中。我们在大量多Modal医学数据集上进行了广泛的实验,并证明了提案的iMOS的有效性。具体来说,只需要序列中标注一小部分的图像,iMOS可以在整个序列中在双向 achieve satisfactory tracking和分割性能。我们希望通过提案的iMOS,可以加速专家的标注速度,并促进医学基础模型的发展。
PlaceNav: Topological Navigation through Place Recognition
results: 实验结果表明,新模型在室内和室外导航任务中取得76%和23%的高Success rate,同时具有更高的计算效率。Abstract
Recent results suggest that splitting topological navigation into robot-independent and robot-specific components improves navigation performance by enabling the robot-independent part to be trained with data collected by different robot types. However, the navigation methods are still limited by the scarcity of suitable training data and suffer from poor computational scaling. In this work, we present PlaceNav, subdividing the robot-independent part into navigation-specific and generic computer vision components. We utilize visual place recognition for the subgoal selection of the topological navigation pipeline. This makes subgoal selection more efficient and enables leveraging large-scale datasets from non-robotics sources, increasing training data availability. Bayesian filtering, enabled by place recognition, further improves navigation performance by increasing the temporal consistency of subgoals. Our experimental results verify the design and the new model obtains a 76% higher success rate in indoor and 23% higher in outdoor navigation tasks with higher computational efficiency.
摘要
Knowledge Graphs for the Life Sciences: Recent Developments, Challenges and Opportunities
results: 本论文通过选择一些优秀的应用场景,描述了constructing and managing Knowledge Graphs (KGs)、使用KGs和相关技术在科学发现中的发现新知识、以及使用KGs支持解释的人工智能应用等三个主题的发展和挑战。Abstract
The term life sciences refers to the disciplines that study living organisms and life processes, and include chemistry, biology, medicine, and a range of other related disciplines. Research efforts in life sciences are heavily data-driven, as they produce and consume vast amounts of scientific data, much of which is intrinsically relational and graph-structured. The volume of data and the complexity of scientific concepts and relations referred to therein promote the application of advanced knowledge-driven technologies for managing and interpreting data, with the ultimate aim to advance scientific discovery. In this survey and position paper, we discuss recent developments and advances in the use of graph-based technologies in life sciences and set out a vision for how these technologies will impact these fields into the future. We focus on three broad topics: the construction and management of Knowledge Graphs (KGs), the use of KGs and associated technologies in the discovery of new knowledge, and the use of KGs in artificial intelligence applications to support explanations (explainable AI). We select a few exemplary use cases for each topic, discuss the challenges and open research questions within these topics, and conclude with a perspective and outlook that summarizes the overarching challenges and their potential solutions as a guide for future research.
摘要
生命科学一词汇指的是研究生物和生物过程的学科,包括化学、生物、医学和其他相关学科。生命科学的研究努力充满数据,因为它们生成和消耗巨量数据,大多数数据是相互关联的graph结构。因此,使用高级知识驱动技术来管理和解释数据是不可或缺的。在这篇调查和观点论文中,我们讨论了生命科学中使用graph技术的最新发展和进步,以及这些技术将来对这些领域的影响。我们分为三个主题来讨论这些话题:建立和管理知识图грам(KG)、使用KG和相关技术在发现新知识方面、以及使用KG在人工智能应用中支持解释(Explainable AI)。我们选择了一些优秀的使用案例,讨论了每个话题中的挑战和未解决问题,并结束于一个视野和展望,概括了总体挑战和其可能的解决方案,以备未来研究的指南。
Forest Mixing: investigating the impact of multiple search trees and a shared refinements pool on ontology learning
methods: 这个论文扩展了DL-Learner工具中的Class Expression Learning for Ontology Engineering(CELOE)算法,使用多个搜索树和共享的修改池来分解搜索空间。它还引入了 conjunction 操作,将每棵搜索树中的最佳类表达合并,保留最多提供信息的结果。
results: 这个论文的实现和设置表明,Forest Mixing Approach不能超越传统的CELOE。然而,这种概念方案可能会在未来对找到ontologies中的类表达进行改进,特ízarz是在大型搜索空间中穿梭。Abstract
We aim at development white-box machine learning algorithms. We focus here on algorithms for learning axioms in description logic. We extend the Class Expression Learning for Ontology Engineering (CELOE) algorithm contained in the DL-Learner tool. The approach uses multiple search trees and a shared pool of refinements in order to split the search space in smaller subspaces. We introduce the conjunction operation of best class expressions from each tree, keeping the results which give the most information. The aim is to foster exploration from a diverse set of starting classes and to streamline the process of finding class expressions in ontologies. %, particularly in large search spaces. The current implementation and settings indicated that the Forest Mixing approach did not outperform the traditional CELOE. Despite these results, the conceptual proposal brought forward by this approach may stimulate future improvements in class expression finding in ontologies. % and influence. % the way we traverse search spaces in general.
摘要
我们目标是开发白盒机器学习算法。我们主要关注描述逻辑学习算法。我们在DL-Learner工具中扩展了Class Expression Learning for Ontology Engineering(CELOE)算法。我们的方法使用多个搜索树和共享的修具池,以将搜索空间分成更小的子空间。我们引入了每棵树中最佳的类表达的 conjunction 操作,保留最多提供信息的结果。我们的目标是促进从多个起始类中的探索,并使找到ontology中的类表达更加高效。特别是在大型搜索空间中。现有的实现和设置表明,Forest Mixing方法不能超越传统的CELOE。尽管得到的结果不如预期,但我们的概念提案可能会在未来对找到ontology中的类表达进行改进。以及如何探索搜索空间的方式。
Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering
paper_authors: Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy
for: Addressing the problem of prompt brittleness and various bias factors in large language models (LLMs) to improve their performance.
methods: Propose a simple and intuitive calibration method called Batch Calibration (BC) that controls contextual bias from batched input and unifies various prior approaches.
results: Demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks using PaLM 2-(S, M, L) and CLIP models.Here’s the summary in Traditional Chinese:
results: 使用PaLM 2-(S, M, L)和CLIP模型,在多于10个自然语言理解和图像识别 зада务中,与先前的条件调整基准点相比,demonstrate state-of-the-art性能。Abstract
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
摘要
大型语言模型(LLM)的Prompting和In-Context Learning(ICL)已成为高效学习方法。然而,LLM受到提示粗糙和提示中各种偏见的影响,包括格式化、选择词语和ICL示例。为解决这些问题,导致LLM表现下降,calibration方法被开发出来mitigate these biases and recover LLM performance。在这个工作中,我们首先进行了现有calibration方法的系统分析,并提供了一个统一的视图,同时揭示了失败的例子。受这些分析的激发,我们提议批量调整(Batch Calibration,BC),一种简单 yet intuitive的方法,控制批处理输入中的上下文偏见,统一多种先前方法,并有效地解决上述问题。BC是零shot,推理只需要进行一次批处理,并且不会增加额外成本。在少量输入 setup 中,我们进一步扩展了BC,让它从标注数据中学习上下文偏见。我们使用PaLM 2-(S, M, L)和CLIP模型验证BC的有效性,并在多于10种自然语言理解和图像分类任务上达到了前一些calibration基线。
MORPH: Design Co-optimization with Reinforcement Learning via a Differentiable Hardware Model Proxy
results: 在模拟的2D抓取和3D多根手 manipulate任务中,MORPH方法能够有效地优化硬件设计参数和控制策略,并保证硬件代理模型与真实硬件保持最接近。Abstract
We introduce MORPH, a method for co-optimization of hardware design parameters and control policies in simulation using reinforcement learning. Like most co-optimization methods, MORPH relies on a model of the hardware being optimized, usually simulated based on the laws of physics. However, such a model is often difficult to integrate into an effective optimization routine. To address this, we introduce a proxy hardware model, which is always differentiable and enables efficient co-optimization alongside a long-horizon control policy using RL. MORPH is designed to ensure that the optimized hardware proxy remains as close as possible to its realistic counterpart, while still enabling task completion. We demonstrate our approach on simulated 2D reaching and 3D multi-fingered manipulation tasks.
摘要
我们介绍MORPH,一种基于实验增强学习的硬件设计参数和控制策略优化方法。大多数合优方法都需要硬件模型,通常是基于物理法则进行模拟。但是这种模型往往难以与有效的优化 Routine 集成。为解决这个问题,我们引入了代理硬件模型,这个模型总是可导的,可以与长期控制策略使用增强学习进行合优。MORPH 是设计来确保优化的硬件代理保持与它的实际对应者最接近,同时仍能完成任务。我们在实验中使用了2D 掌上对象和3D 多指手 manipulation 任务来验证我们的方法。
RSAM: Learning on manifolds with Riemannian Sharpness-aware Minimization
results: 提出了一种新的Riemannian Sharpness-Aware Minimization(RSAM)算法,并通过对各种数据集进行评估,证明RSAM可以提高模型的泛化能力和Robustness。Abstract
Nowadays, understanding the geometry of the loss landscape shows promise in enhancing a model's generalization ability. In this work, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization ability for constrained optimization problems. Indeed, this paper aims to generalize the Sharpness-Aware Minimization (SAM) optimizer to Riemannian manifolds. In doing so, we first extend the concept of sharpness and introduce a novel notion of sharpness on manifolds. To support this notion of sharpness, we present a theoretical analysis characterizing generalization capabilities with respect to manifold sharpness, which demonstrates a tighter bound on the generalization gap, a result not known before. Motivated by this analysis, we introduce our algorithm, Riemannian Sharpness-Aware Minimization (RSAM). To demonstrate RSAM's ability to enhance generalization ability, we evaluate and contrast our algorithm on a broad set of problems, such as image classification and contrastive learning across different datasets, including CIFAR100, CIFAR10, and FGVCAircraft. Our code is publicly available at \url{https://t.ly/RiemannianSAM}.
摘要
现在,理解损失地形的几何学特性显示有潜在的提升模型泛化能力的承诺。在这项工作中,我们启发自前人在优化中应用几何学原理,并提出一种新的方法来提高约束优化问题的Robustness和泛化能力。具体来说,这篇论文旨在推广Sharpness-Aware Minimization(SAM)优化器到Riemannian manifold上。在这个过程中,我们首先扩展了锐度的概念,并提出了一种新的 manifold锐度的概念。为支持这种锐度概念,我们提供了一种 theoretically分析,用于Characterizing generalization capabilities with respect to manifold sharpness,这个结果在之前没有知道。这个分析的激励我们提出了我们的算法,即Riemannian Sharpness-Aware Minimization(RSAM)。为证明RSAM能够提高泛化能力,我们在不同的数据集上进行了评估和比较,包括CIFAR100、CIFAR10和FGVCAircraft等。我们的代码可以在 \url{https://t.ly/RiemannianSAM} 上获取。
ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery
results: ComSD在数字评估中表现出了state-of-the-art的适应性,对于多种技能组合任务和大多数技能练习任务都有显著的优异性。Abstract
Learning diverse and qualified behaviors for utilization and adaptation without supervision is a key ability of intelligent creatures. Ideal unsupervised skill discovery methods are able to produce diverse and qualified skills in the absence of extrinsic reward, while the discovered skill set can efficiently adapt to downstream tasks in various ways. Maximizing the Mutual Information (MI) between skills and visited states can achieve ideal skill-conditioned behavior distillation in theory. However, it's difficult for recent advanced methods to well balance behavioral quality (exploration) and diversity (exploitation) in practice, which may be attributed to the unreasonable MI estimation by their rigid intrinsic reward design. In this paper, we propose Contrastive multi-objectives Skill Discovery (ComSD) which tries to mitigate the quality-versus-diversity conflict of discovered behaviors through a more reasonable MI estimation and a dynamically weighted intrinsic reward. ComSD proposes to employ contrastive learning for a more reasonable estimation of skill-conditioned entropy in MI decomposition. In addition, a novel weighting mechanism is proposed to dynamically balance different entropy (in MI decomposition) estimations into a novel multi-objective intrinsic reward, to improve both skill diversity and quality. For challenging robot behavior discovery, ComSD can produce a qualified skill set consisting of diverse behaviors at different activity levels, which recent advanced methods cannot. On numerical evaluations, ComSD exhibits state-of-the-art adaptation performance, significantly outperforming recent advanced skill discovery methods across all skill combination tasks and most skill finetuning tasks. Codes will be released at https://github.com/liuxin0824/ComSD.
摘要
学习多样化和资格化行为的自适应能力是智能生物体的关键能力。理想的无监督技能发现方法应能够在缺乏外在奖励的情况下生成多样化和资格化的技能,而发现的技能集可以有效地适应下游任务的多种方式。在理论上,最大化技能和访问状态之间的共同信息(MI)可以实现理想的技能决策精炼。然而,现有的高级方法在实践中很难均衡行为质量(探索)和多样性(利用),这可能由高级方法的僵化内置奖励设计所致。在这篇论文中,我们提出了对比多目标技能发现(ComSD)方法,该方法通过更合理的MI估计和动态权重的内置奖励来减少发现行为质量与多样性之间的冲突。ComSD方法使用对比学习来更合理地估计技能决策下的 entropy,并提出了一种新的权重机制来动态平衡不同 entropy(在MI decompositions)的估计,以提高技能多样性和质量。对于具有挑战性的机器人行为发现问题,ComSD可以生成包含不同活动水平的多样化技能集,而现有高级方法无法达到这一点。在数值评估中,ComSD显示出了领先的适应性,在所有技能组合任务和大多数技能精化任务中显著超过了现有高级技能发现方法。代码将在 上发布。
An Investigation Into Race Bias in Random Forest Models Based on Breast DCE-MRI Derived Radiomics Features
results: 研究发现,静止增强MRI数据中含有可识别种族信息,RF模型可以使用这些数据预测白人和黑人种族的准确率在60-70%之间,具体取决于使用的特征子。此外,使用种族不均衡数据预测乳腺癌分子类型的RF模型似乎会产生偏见行为,在测试数据上表现出比较好的性能。Abstract
Recent research has shown that artificial intelligence (AI) models can exhibit bias in performance when trained using data that are imbalanced by protected attribute(s). Most work to date has focused on deep learning models, but classical AI techniques that make use of hand-crafted features may also be susceptible to such bias. In this paper we investigate the potential for race bias in random forest (RF) models trained using radiomics features. Our application is prediction of tumour molecular subtype from dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) of breast cancer patients. Our results show that radiomics features derived from DCE-MRI data do contain race-identifiable information, and that RF models can be trained to predict White and Black race from these data with 60-70% accuracy, depending on the subset of features used. Furthermore, RF models trained to predict tumour molecular subtype using race-imbalanced data seem to produce biased behaviour, exhibiting better performance on test data from the race on which they were trained.
摘要
Translation notes:* "imbalanced data" 不具有保护属性的数据 (i.e., data that are not balanced by protected attributes)* "hand-crafted features" 手动设计的特征 (i.e., features that are manually designed or selected)* "radiomics features" 镭医学特征 (i.e., features extracted from medical imaging data, such as DCE-MRI)* "tumor molecular subtype" 肿瘤分子亚型 (i.e., the molecular characteristics of a tumor, which can be used to predict its behavior and potential treatment outcomes)* "race-identifiable information" 可以识别出种族信息的特征 (i.e., features that can be used to identify an individual's race)* "biased behavior" 偏见行为 (i.e., behavior that is biased towards a particular group or outcome, rather than being impartial or fair)
PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis
results: 该论文的实验表明,该方法可以快速重建场景频谱场,并且有高的渲染质量和方便的编辑功能。Abstract
This paper proposes a method for fast scene radiance field reconstruction with strong novel view synthesis performance and convenient scene editing functionality. The key idea is to fully utilize semantic parsing and primitive extraction for constraining and accelerating the radiance field reconstruction process. To fulfill this goal, a primitive-aware hybrid rendering strategy was proposed to enjoy the best of both volumetric and primitive rendering. We further contribute a reconstruction pipeline conducts primitive parsing and radiance field learning iteratively for each input frame which successfully fuses semantic, primitive, and radiance information into a single framework. Extensive evaluations demonstrate the fast reconstruction ability, high rendering quality, and convenient editing functionality of our method.
摘要
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
paper_authors: Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, Jun Wang
for: 提高大型自然语言模型(LLM)的理解和解释能力
methods: 使用搜索算法和学习值函数
results: 在推理、规划和RLHFAlignment等任务中表现出色,具有普适性和扩展性Abstract
Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, which lacks general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64.
摘要
为了解决这些局限性,我们提出了一种 AlphaZero 类型的树搜索框架 для LLM(称为 TS-LLM),系统地说明了如何使用树搜索和学习的值函数来导航 LLM 的解码能力。TS-LLM 的两个关键特点是:1. 利用学习的值函数,我们的方法可以通用于不同的任务,不仅是理解,而且可以应用于不同的 LLM 大小和任务,无需提供高级大型模型的提示。2. 它可以 guid LLM 的解码在推理和训练过程中。在理解、规划和RLHFAlignment任务上进行了实质性的评估,证明了 TS-LLM 的效果,甚至在深度为 64 的树上。
RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds
results: RLAdapter在Crafter环境中的实验结果表明,RLAdapter比基线模型更高效,同时代理人在我们的框架下展现出了缺失在基线模型中的常识行为。Abstract
While reinforcement learning (RL) shows remarkable success in decision-making problems, it often requires a lot of interactions with the environment, and in sparse-reward environments, it is challenging to learn meaningful policies. Large Language Models (LLMs) can potentially provide valuable guidance to agents in learning policies, thereby enhancing the performance of RL algorithms in such environments. However, LLMs often encounter difficulties in understanding downstream tasks, which hinders their ability to optimally assist agents in these tasks. A common approach to mitigating this issue is to fine-tune the LLMs with task-related data, enabling them to offer useful guidance for RL agents. However, this approach encounters several difficulties, such as inaccessible model weights or the need for significant computational resources, making it impractical. In this work, we introduce RLAdapter, a framework that builds a better connection between RL algorithms and LLMs by incorporating an adapter model. Within the RLAdapter framework, fine-tuning a lightweight language model with information generated during the training process of RL agents significantly aids LLMs in adapting to downstream tasks, thereby providing better guidance for RL agents. We conducted experiments to evaluate RLAdapter in the Crafter environment, and the results show that RLAdapter surpasses the SOTA baselines. Furthermore, agents under our framework exhibit common-sense behaviors that are absent in baseline models.
摘要
whilst reinforcement learning (RL) displays remarkable success in decision-making problems, it often requires a lot of interactions with the environment, and in sparse-reward environments, it is challenging to learn meaningful policies. Large Language Models (LLMs) can potentially provide valuable guidance to agents in learning policies, thereby enhancing the performance of RL algorithms in such environments. However, LLMs often encounter difficulties in understanding downstream tasks, which hinders their ability to optimally assist agents in these tasks. A common approach to mitigating this issue is to fine-tune the LLMs with task-related data, enabling them to offer useful guidance for RL agents. However, this approach encounters several difficulties, such as inaccessible model weights or the need for significant computational resources, making it impractical. In this work, we introduce RLAdapter, a framework that builds a better connection between RL algorithms and LLMs by incorporating an adapter model. Within the RLAdapter framework, fine-tuning a lightweight language model with information generated during the training process of RL agents significantly aids LLMs in adapting to downstream tasks, thereby providing better guidance for RL agents. We conducted experiments to evaluate RLAdapter in the Crafter environment, and the results show that RLAdapter surpasses the SOTA baselines. Furthermore, agents under our framework exhibit common-sense behaviors that are absent in baseline models.Note: "SOTA" stands for "State of the Art", and "baselines" refer to the standard or default settings or models used for comparison. "Common-sense behaviors" refer to the ability of the agents to exhibit behaviors that are intuitive and reasonable based on human experience and knowledge.
A Vision-Guided Robotic System for Grasping Harvested Tomato Trusses in Cluttered Environments
paper_authors: Luuk van den Bent, Tomás Coleman, Robert Babuska
for: automating truss tomato weighing and packaging processes
methods: deep learning-based vision system to identify and grasp trusses in a crate with clutter
results: 100% clearance rate and 93% success rate of grasping trusses on the first tryAbstract
Currently, truss tomato weighing and packaging require significant manual work. The main obstacle to automation lies in the difficulty of developing a reliable robotic grasping system for already harvested trusses. We propose a method to grasp trusses that are stacked in a crate with considerable clutter, which is how they are commonly stored and transported after harvest. The method consists of a deep learning-based vision system to first identify the individual trusses in the crate and then determine a suitable grasping location on the stem. To this end, we have introduced a grasp pose ranking algorithm with online learning capabilities. After selecting the most promising grasp pose, the robot executes a pinch grasp without needing touch sensors or geometric models. Lab experiments with a robotic manipulator equipped with an eye-in-hand RGB-D camera showed a 100% clearance rate when tasked to pick all trusses from a pile. 93% of the trusses were successfully grasped on the first try, while the remaining 7% required more attempts.
摘要
Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".Translation notes:* "truss tomato" is translated as " Tomatoes 葫芦" ( Tomatoes is a noun, and 葫芦 is a measure word used to indicate a bunch of tomatoes)* "weighing and packaging" is translated as "重量和包装" (重量 means weight, and 包装 means packaging)* "lab experiments" is translated as "实验室试验" (实验室 means laboratory, and 试验 means experiment)* "robotic manipulator" is translated as "机器人操作机构" (机器人 means robot, and 操作机构 means manipulator)* "eye-in-hand RGB-D camera" is translated as "手上RGB-D相机" (手上 means hand-held, RGB-D is a type of camera, and 相机 means camera)
An evaluation of GPT models for phenotype concept recognition
paper_authors: Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A Haendel, Peter N Robinson, Chris J Mungall, Justin T Reese
for: 这个论文的目的是探讨用最新的生成器预训练变换器(GPT)模型在医学深度型定义中的表现。
methods: 这个研究使用了七个提示语,两个GPT模型(gpt-3.5和gpt-4.0),以及一个已知的金标准 для fenotype认识。
results: 研究结果表明,目前这些模型尚未达到状态理想的表现。最佳运行使用了少量学习,达到了0.41 F1分数,相比之下,现有的最佳工具可以达到0.62 F1分数。Abstract
Objective: Clinical deep phenotyping plays a critical role in both the diagnosis of patients with rare disorders as well as in building care coordination plans. The process relies on modelling and curating patient profiles using ontology concepts, usually from the Human Phenotype Ontology. Machine learning methods have been widely adopted to support this phenotype concept recognition task. With the significant shift in the use of large language models (LLMs) for most NLP tasks, herewithin, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT in clinical deep phenotyping. Materials and Methods: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5 and gpt-4.0) and an established gold standard for phenotype recognition. Results: Our results show that, currently, these models have not yet achieved state of the art performance. The best run, using few-shots learning, achieved 0.41 F1 score, compared to a 0.62 F1 score achieved by the current best in class tool. Conclusion: The non-deterministic nature of the outcomes and the lack of concordance between different runs using the same prompt and input makes the use of these LLMs in clinical settings problematic.
摘要
Materials and Methods: 我们的实验设置包括七个提问,每个提问都有不同的特定程度,以及两个GPT模型(gpt-3.5和gpt-4.0)和已知的现象识别标准。Results: 我们的结果表明,目前这些模型还没有达到状态的艺术性能。最好的运行,使用少量学习,达到了0.41的F1分数,与当前最佳的类别工具的0.62的F1分数相比,表明这些LLMs在医学设置中的使用具有问题。Conclusion: 非束定的结果和使用同一个提问和输入的不同run之间的不一致性,使得这些LLMs在医学设置中的使用变得困难。
DyVal: Graph-informed Dynamic Evaluation of Large Language Models
results: 实验表明, LLM 在 DyVal 生成的评估样本中表现不佳,强调了动态评估的重要性。 Additionally, the researchers analyzed the failure cases and results of different prompting methods, and found that the DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks.Abstract
Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns about their performance are raised on potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic evaluation of LLMs. Based on our proposed dynamic evaluation framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments demonstrate that LLMs perform worse in DyVal-generated evaluation samples with different complexities, emphasizing the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on the future evaluation research of LLMs.
摘要
大型语言模型(LLM)在各种评估标准上表现出色,但是有关其表现的问题在它们训练集中可能存在数据污染。此外,现有的benchmark测试集可能无法准确评估LLM的进步。在这篇论文中,我们介绍了DyVal,一种新的评估协议,用于动态评估LLM。我们基于我们的提议的动态评估框架,使用指定的拓扑结构来动态生成评估样本,以控制样本的复杂性。DyVal生成了包括数学、逻辑推理和算法问题在内的复杂理解任务的挑战评估集。我们对多种LLM(从Flan-T5-large到ChatGPT和GPT4)进行了评估,实验表明,LLM在DyVal生成的评估样本中表现差,这 highlights the importance of动态评估。我们还分析了不同的提示方法的失败情况和结果。此外,DyVal生成的样本不仅是评估集,还可以用于 LLM 的微调以提高其在现有benchmark上的表现。我们希望DyVal可以引导未来的LLM评估研究。
Advances in Kidney Biopsy Structural Assessment through Dense Instance Segmentation
results: 该模型在一个 NVIDIA GeForce RTX 3090 GPU 上训练,可以高效地识别 renal biopsy 中的 más than 500 个 object,包括 glomeruli, tubuli, 和arteries。其数据集包括 148 个 Jones’ silver-stained renal whole slide images (WSIs),其中 249 个 patches 用于训练,54 个 patches 用于评估。此外,模型可以直接转移领域,无需调整或重新训练,可以在 PAS-stained WSIs 上生成 decent instance segmentation results。与基eline models 相比,该模型达到了新的 state-of-the-art,其 AP 值为 51.7%。Abstract
The kidney biopsy is the gold standard for the diagnosis of kidney diseases. Lesion scores made by expert renal pathologists are semi-quantitative and suffer from high inter-observer variability. Automatically obtaining statistics per segmented anatomical object, therefore, can bring significant benefits in reducing labor and this inter-observer variability. Instance segmentation for a biopsy, however, has been a challenging problem due to (a) the on average large number (around 300 to 1000) of densely touching anatomical structures, (b) with multiple classes (at least 3) and (c) in different sizes and shapes. The currently used instance segmentation models cannot simultaneously deal with these challenges in an efficient yet generic manner. In this paper, we propose the first anchor-free instance segmentation model that combines diffusion models, transformer modules, and RCNNs (regional convolution neural networks). Our model is trained on just one NVIDIA GeForce RTX 3090 GPU, but can efficiently recognize more than 500 objects with 3 common anatomical object classes in renal biopsies, i.e., glomeruli, tubuli, and arteries. Our data set consisted of 303 patches extracted from 148 Jones' silver-stained renal whole slide images (WSIs), where 249 patches were used for training and 54 patches for evaluation. In addition, without adjustment or retraining, the model can directly transfer its domain to generate decent instance segmentation results from PAS-stained WSIs. Importantly, it outperforms other baseline models and reaches an AP 51.7% in detection as the new state-of-the-art.
摘要
《肾脏生剖图像自动实例分割》肾脏生剖图像自动实例分割是诊断肾脏疾病的标准方法。但是,由专家肾脏 PATHOLOGISTS 评分的病变得分是不准确的,而且受到高度的 observer variability 影响。通过自动获取每个分割的统计数据,可以带来显著的减少劳动力和 observer variability 的效果。然而,实例分割问题在肾脏生剖图像上是一个挑战,因为:1. 肾脏生剖图像平均含有300-1000个密集的生物结构,2. 有多个类型(至少3种),3. 具有不同的大小和形状。目前使用的实例分割模型无法同时解决这些挑战。在这篇论文中,我们提出了第一个无锚点instance segmentation模型,结合了扩散模型、transformer模块和RCNNs(区域 convolutional neural networks)。我们的模型在一个NVIDIA GeForce RTX 3090 GPU上训练,可以高效地识别肾脏生剖图像中的 más de 500个 объек目,包括glomeruli、 tubuli 和arteries。我们的数据集包括148个Jones银色染涂肾脏整个扫描图像(WSIs),其中249个patches用于训练,54个patches用于评估。此外,无需调整或重新训练,我们的模型可以直接传递领域,在PAS染色涂抹扫描图像上生成良好的实例分割结果。最重要的是,它超越了其他基线模型,达到了新的状态空间 AP 51.7% 的检测率。
Compromise in Multilateral Negotiations and the Global Regulation of Artificial Intelligence
results: 本论文发现,通过boltanski的实用社会学概念,Multilateral谈判实践中的结构性normative杂合和 situational normative ambiguity的结合,使得多方面谈判中的妥协成功。Abstract
As artificial intelligence (AI) technologies spread worldwide, international discussions have increasingly focused on their consequences for democracy, human rights, fundamental freedoms, security, and economic and social development. In this context, UNESCO's Recommendation on the Ethics of Artificial Intelligence, adopted in November 2021, has emerged as the first global normative framework for AI development and deployment. The intense negotiations of every detail of the document brought forth numerous controversies among UNESCO member states. Drawing on a unique set of primary sources, including written positions and recorded deliberations, this paper explains the achievement of global compromise on AI regulation despite the multiplicity of UNESCO member-state positions representing a variety of liberal and sovereignist preferences. Building upon Boltanski's pragmatic sociology, it conceptualises the practice of multilateral negotiations and attributes the multilateral compromise to two embedded therein mechanisms: Structural normative hybridity and situated normative ambiguity allowed to accomplish a compromise by linking macro-normative structures with situated debates of multilateral negotiations.
摘要
Age Group Discrimination via Free Handwriting Indicators
results: 研究结果显示了Exceptional的分类性能,精度在82.5%至97.5%之间,特征选择率在81.8%至100%之间, recall在75%至100%之间,ROC-AUC在92.2%至100%之间。Abstract
The growing global elderly population is expected to increase the prevalence of frailty, posing significant challenges to healthcare systems. Frailty, a syndrome associated with ageing, is characterised by progressive health decline, increased vulnerability to stressors and increased risk of mortality. It represents a significant burden on public health and reduces the quality of life of those affected. The lack of a universally accepted method to assess frailty and a standardised definition highlights a critical research gap. Given this lack and the importance of early prevention, this study presents an innovative approach using an instrumented ink pen to ecologically assess handwriting for age group classification. Content-free handwriting data from 80 healthy participants in different age groups (20-40, 41-60, 61-70 and 70+) were analysed. Fourteen gesture- and tremor-related indicators were computed from the raw data and used in five classification tasks. These tasks included discriminating between adjacent and non-adjacent age groups using Catboost and Logistic Regression classifiers. Results indicate exceptional classifier performance, with accuracy ranging from 82.5% to 97.5%, precision from 81.8% to 100%, recall from 75% to 100% and ROC-AUC from 92.2% to 100%. Model interpretability, facilitated by SHAP analysis, revealed age-dependent sensitivity of temporal and tremor-related handwriting features. Importantly, this classification method offers potential for early detection of abnormal signs of ageing in uncontrolled settings such as remote home monitoring, thereby addressing the critical issue of frailty detection and contributing to improved care for older adults.
摘要
全球老龄化的人口增长会增加衰退的预测,对医疗系统造成重大挑战。衰退是年龄相关的症状,表现为健康下降、增加外在压力的敏感性和死亡风险增加。它对公共卫生造成重大负担,reduce the quality of life of those affected。由于没有一个通用的方法来评估衰退和标准化定义,这个研究 gap highlights the critical need for early prevention. 为了解决这个问题,本研究提出了一种创新的方法,使用Instrumented Ink Pen来生物学地评估手写功能,以分类不同年龄组。研究从健康参与者80人的内容自由手写数据中分析了14个指标,并将其用于5个分类任务。这些任务包括使用Catboost和Logistic Regression分类器来分别区分邻近和非邻近年龄组。结果表明分类器的表现非常出色,准确率从82.5%到97.5%,精度从81.8%到100%,回归率从75%到100%,ROC-AUC从92.2%到100%。模型可读性,通过SHAP分析,表明年龄依赖性的时间和震动相关手写特征。这种分类方法有潜在的潜在的早期识别衰退的异常迹象,并且可以在没有控制的家庭监测设置中进行识别,因此有助于改善老年人的护理。
Using Large Language Models for Qualitative Analysis can Introduce Serious Bias
results: 研究发现,使用LLM来标注采访讲述录音存在偏见风险,这些偏见可能导致错误的推论。在使用高质量人工标注和灵活编码来训练简单的超vised模型时,发现这些模型的评估错误和偏见较少。因此,作者认为,在评估LLM是否引入偏见时,需要一定的高质量标注,而不是直接使用LLM来标注。Abstract
Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood. This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences. We here mean bias in the technical sense, that the errors that LLMs make in annotating interview transcripts are not random with respect to the characteristics of the interview subjects. Training simpler supervised models on high-quality human annotations with flexible coding leads to less measurement error and bias than LLM annotations. Therefore, given that some high quality annotations are necessary in order to asses whether an LLM introduces bias, we argue that it is probably preferable to train a bespoke model on these annotations than it is to use an LLM for annotation.
摘要
Prototype Generation: Robust Feature Visualisation for Data Independent Interpretability
results: 研究发现,通过使用Prototype Generation方法可以生成具有自然Activation路径的输入,并且可以 Quantitatively measure the similarity between the internal activations of the generated prototypes and natural images。此外,研究还发现,通过解释生成的 прототипы可以提供重要的 introspection,例如揭示模型中学习的偏见和偏好,这些 introspection 不可能通过测试集的量化方法来实现。Abstract
We introduce Prototype Generation, a stricter and more robust form of feature visualisation for model-agnostic, data-independent interpretability of image classification models. We demonstrate its ability to generate inputs that result in natural activation paths, countering previous claims that feature visualisation algorithms are untrustworthy due to the unnatural internal activations. We substantiate these claims by quantitatively measuring similarity between the internal activations of our generated prototypes and natural images. We also demonstrate how the interpretation of generated prototypes yields important insights, highlighting spurious correlations and biases learned by models which quantitative methods over test-sets cannot identify.
摘要
我们介绍原型生成,一种更加依hung和更加稳定的特征可视化技术,用于无模型、数据独立的图像分类器解释。我们显示了它的能力可以生成自然的活化路径,与先前的所有特征可视化算法不同,这些算法被认为是不可靠,因为它们内部的活化不自然。我们通过量化地衡量内部活化和自然图像之间的相似性,从而证实我们的主张。此外,我们还示出了解释生成的原型具有重要的问题解释能力,例如发现模型学习的偏好和偏预,这些方法不能通过测试集的量化方法进行识别。
Revisiting Cephalometric Landmark Detection from the view of Human Pose Estimation with Lightweight Super-Resolution Head
methods: 本研究使用了一个基于MMPose的可靠和适应性的底线,并将一个简单且高效的超频变化模组 incorporated into the framework,以提高表现。
results: 在MICCAI CLDetection2023挑战中,我们的方法在三个指标上排名第一名,并在另外一个指标上排名第三名。Abstract
Accurate localization of cephalometric landmarks holds great importance in the fields of orthodontics and orthognathics due to its potential for automating key point labeling. In the context of landmark detection, particularly in cephalometrics, it has been observed that existing methods often lack standardized pipelines and well-designed bias reduction processes, which significantly impact their performance. In this paper, we revisit a related task, human pose estimation (HPE), which shares numerous similarities with cephalometric landmark detection (CLD), and emphasize the potential for transferring techniques from the former field to benefit the latter. Motivated by this insight, we have developed a robust and adaptable benchmark based on the well-established HPE codebase known as MMPose. This benchmark can serve as a dependable baseline for achieving exceptional CLD performance. Furthermore, we introduce an upscaling design within the framework to further enhance performance. This enhancement involves the incorporation of a lightweight and efficient super-resolution module, which generates heatmap predictions on high-resolution features and leads to further performance refinement, benefiting from its ability to reduce quantization bias. In the MICCAI CLDetection2023 challenge, our method achieves 1st place ranking on three metrics and 3rd place on the remaining one. The code for our method is available at https://github.com/5k5000/CLdetection2023.
摘要
精准地标定几何学特征点对于 ortodontics 和 ortognathics 领域具有重要意义,因为它可能导致自动标记关键点的自动化。在几何学特征点检测上,特别是在 Cephalometrics 中,已经存在许多不同的方法,但它们通常缺乏标准化管道和良好的偏好减少过程,这会对其性能产生很大的影响。在这篇论文中,我们回到了相关的任务,即人姿估计(HPE),这两个任务之间存在许多相似之处,我们强调了从 HPE 领域中提取技术以便改善 CLD 性能。由于这一点,我们开发了一个可靠和可调整的benchmark,基于已知的 HPE 代码库,即 MMPose。这个benchmark可以作为 CLD 性能的可靠基准。此外,我们在框架中引入了一种增强设计,即 incorporating 一种轻量级高效的超分辨率模块,该模块在高分辨率特征上预测热图,从而进一步提高性能,借助其减少量化偏误的能力。在 MICCAI CLDetection2023 挑战中,我们的方法在三个指标上排名第一名,并在另外一个指标上排名第三名。我们的代码可以在 GitHub 上找到:https://github.com/5k5000/CLdetection2023。
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?
results: 研究发现,latest commercial models 在使用 Turtle 语言时表现出色,但它们在输出格式化要求上存在缺陷,需要进一步改进。Abstract
Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks. Yet, their ability to work with formal languages representing data, specifically within the realm of knowledge graph engineering, remains under-investigated. To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax. These tasks, each embodying distinct degrees of complexity and being able to scale with the size of the problem, have been integrated into our automated evaluation system, the LLM-KG-Bench. The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B. This analysis offers an in-depth understanding of the strengths and shortcomings of LLMs in relation to their application within RDF knowledge graph engineering workflows utilizing Turtle representation. While our findings show that the latest commercial models outperform their forerunners in terms of proficiency with the Turtle language, they also reveal an apparent weakness. These models fall short when it comes to adhering strictly to the output formatting constraints, a crucial requirement in this context.
摘要
大型语言模型(LLM)在快速发展,具有显著改善的自然语言处理和编程任务能力。然而,它们在知识图工程中使用ormal语言表示数据的能力仍然尚未得到充分研究。为评估不同LLM的水平,我们创建了五个任务,这些任务涵盖了知识图解析、理解、分析和创建,并且能够随问题的大小缩放。这些任务被 integrate into our自动评估系统LLM-KG-Bench。我们对四种商业可用的LLM(GPT-3.5、GPT-4、Claude 1.3和Claude 2.0)以及两种自由 accessible的离线模型(GPT4All Vicuna和GPT4All Falcon 13B)进行了评估。这个分析提供了LLM在使用Turtle表示知识图工程中的强点和弱点,并且显示了最新的商业模型在Turtle语言方面的表现有所提升,但同时也发现了一点问题:这些模型在输出格式约束上不够严格。
Meta-Path Learning for Multi-relational Graph Neural Networks
results: 实验表明,该方法可以在大量关系的情况下正确地标识有用的 meta-paths,并substantially 超越现有的多关系 GNNs 在 sintetic 和实际 экспериментах中。Abstract
Existing multi-relational graph neural networks use one of two strategies for identifying informative relations: either they reduce this problem to low-level weight learning, or they rely on handcrafted chains of relational dependencies, called meta-paths. However, the former approach faces challenges in the presence of many relations (e.g., knowledge graphs), while the latter requires substantial domain expertise to identify relevant meta-paths. In this work we propose a novel approach to learn meta-paths and meta-path GNNs that are highly accurate based on a small number of informative meta-paths. Key element of our approach is a scoring function for measuring the potential informativeness of a relation in the incremental construction of the meta-path. Our experimental evaluation shows that the approach manages to correctly identify relevant meta-paths even with a large number of relations, and substantially outperforms existing multi-relational GNNs on synthetic and real-world experiments.
摘要
现有的多关系图 neural network 中的一种方法是将这个问题降低到低级的重量学习,或者是基于手动设计的关系依赖链,called meta-paths。然而,前者在多关系(例如知识图)存在时会遇到问题,而后者需要具有域专业知识来确定相关的 meta-paths。在这个工作中,我们提出了一种新的方法来学习 meta-paths 和 meta-path GNNs,这些方法可以在一小number of informative meta-paths 的基础上实现高度准确。关键元素我们的方法是一个用于评估关系的潜在有用性的分数函数,在逐步构建 meta-path 中。我们的实验评估表明,我们的方法可以在多关系中正确地标识 relevante meta-paths,并在synthetic 和实际世界上进行了显著的性能提升。
Dynamic Interpretability for Model Comparison via Decision Rules
results: 在synthetic和实际数据集上进行了多种模型比较场景的实验,证明DeltaXplainer可以有效地描述不同类型的概念融合导致的差异。Abstract
Explainable AI (XAI) methods have mostly been built to investigate and shed light on single machine learning models and are not designed to capture and explain differences between multiple models effectively. This paper addresses the challenge of understanding and explaining differences between machine learning models, which is crucial for model selection, monitoring and lifecycle management in real-world applications. We propose DeltaXplainer, a model-agnostic method for generating rule-based explanations describing the differences between two binary classifiers. To assess the effectiveness of DeltaXplainer, we conduct experiments on synthetic and real-world datasets, covering various model comparison scenarios involving different types of concept drift.
摘要
Explainable AI (XAI) 方法大多是为了研究和解释单个机器学习模型,而不是设计用于捕捉和解释多个模型之间的差异。这篇论文面临着理解和解释多个机器学习模型之间的差异的挑战,这是实际应用中的模型选择、监测和生命周期管理中非常重要的。我们提议了DeltaXplainer,一种模型无关的方法,用于生成对二分类模型之间差异的规则式解释。为了评估DeltaXplainer的有效性,我们在synthetic和实际世界数据集上进行了实验,覆盖了不同类型的概念漂移场景。
GAIA-1: A Generative World Model for Autonomous Driving
results: GAIA-1可以学习高级结构和场景动力学,Contextual awareness,泛化和几何理解等性能,其学习的表示能够捕捉未来事件的预期,同时可以生成真实的样本,为自动驾驶技术的培训带来新的可能性和加速。Abstract
Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.
摘要
自动驾驶承诺改变交通领域,但建立能安全掌握实际enario中的系统仍然是挑战。一个关键问题在于有效地预测由车辆行为所导致的多种可能的结果,如世界如何发展。为解决这个挑战,我们介绍GAIA-1(生成AI для自动驾驶),一种生成世界模型,通过视频、文本和动作输入生成真实的驾驶enario,并提供细化控制 egovehicle行为和场景特性。我们的方法将世界模型作为无监督序列模型问题,将输入映射到精确的标识符,预测下一个标识符的序列。GAIA-1的模型出现了许多属性,包括学习高级结构和场景动力学、场景意识、泛化和几何理解。GAIA-1的学习表示,捕捉未来事件的期望,与生成真实样本的能力,为自动驾驶技术的培训带来新的可能性,提高和加速自动驾驶技术的发展。
Assessment and treatment of visuospatial neglect using active learning with Gaussian processes regression
paper_authors: Ivan De Boi, Elissa Embrechts, Quirine Schatteman, Rudi Penne, Steven Truijen, Wim Saeys
for: 评估和诊断视空间忽视症的人工智能解决方案
methods: 基于 Gaussian process regression 的活动学习方法,以减少患者进行评估所需的努力
results: 在实际 клиниче设定中进行临床试验,与现有的视空间忽视症测试相比,我们的 AI 评估模型显示更高的敏感度和可靠性,并且可以减少评估时间。Abstract
Visuospatial neglect is a disorder characterised by impaired awareness for visual stimuli located in regions of space and frames of reference. It is often associated with stroke. Patients can struggle with all aspects of daily living and community participation. Assessment methods are limited and show several shortcomings, considering they are mainly performed on paper and do not implement the complexity of daily life. Similarly, treatment options are sparse and often show only small improvements. We present an artificial intelligence solution designed to accurately assess a patient's visuospatial neglect in a three-dimensional setting. We implement an active learning method based on Gaussian process regression to reduce the effort it takes a patient to undergo an assessment. Furthermore, we describe how this model can be utilised in patient oriented treatment and how this opens the way to gamification, tele-rehabilitation and personalised healthcare, providing a promising avenue for improving patient engagement and rehabilitation outcomes. To validate our assessment module, we conducted clinical trials involving patients in a real-world setting. We compared the results obtained using our AI-based assessment with the widely used conventional visuospatial neglect tests currently employed in clinical practice. The validation process serves to establish the accuracy and reliability of our model, confirming its potential as a valuable tool for diagnosing and monitoring visuospatial neglect. Our VR application proves to be more sensitive, while intra-rater reliability remains high.
摘要
visuospatial neglect是一种病理特征于视觉刺激的减退,通常与中风相关。患者可能会受到日常生活和社区参与的困难。评估方法有限,主要是在纸上进行,无法体现生活中的复杂性。治疗方案罕见,效果有限。我们提出了基于人工智能的评估方法,能够在三维环境中准确评估患者的视空间忽视。我们采用了活动学习方法和 Gaussian process regression来减少患者参与评估的努力。此外,我们还描述了如何使用这种模型进行患者参与的治疗,包括临床娱乐、远程医疗和个性化医疗,这些方法可以提高患者参与度和康复成果。为验证我们的评估模块,我们在实际 Setting中进行了临床试验。我们将我们的人工智能基本评估与现有的广泛使用的视空间忽视测试进行比较,以验证我们的模型的准确性和可靠性。我们发现,我们的VR应用程序比 conventionally used tests更敏感,同时内部可靠性保持高。
SCALE: Synergized Collaboration of Asymmetric Language Translation Engines
paper_authors: Xin Cheng, Xun Wang, Tao Ge, Si-Qing Chen, Furu Wei, Dongyan Zhao, Rui Yan
For: The paper introduces a collaborative framework called SCALE that connects specialized translation models (STMs) and large language models (LLMs) to improve machine translation.* Methods: The paper introduces a method to incorporate translation from STMs into the triplet in-context demonstrations, which enables the LLM to refine and pivot its translations, mitigating language bias and parallel data bias.* Results: The paper shows that SCALE significantly outperforms both few-shot LLMs and specialized models in challenging low-resource settings, and experiences consistent improvement in Xhosa to English translation. Additionally, the paper shows that SCALE can effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot for translation between any language pairs.Here are the three points in Simplified Chinese text:* For: 本 paper 引入了一种合作框架,即 SCALE,用于连接特殊翻译模型(STM)和大型语言模型(LLM),以提高机器翻译。* Methods: 本 paper 引入了一种方法,即将翻译从 STM integrate 到 triplet 上下文示例中,使 LL 能够对翻译进行精度和推倒,从而减少语言偏见和并行数据偏见。* Results: 本 paper 显示,SCALE 在低资源设置下significantly 超越了几个简单的 LLM 和专门的模型(NLLB),并在 Xhosa 到英语翻译中经常提高。此外,SCALE 还可以有效利用现有的 LLM 语言偏见,通过使用英语中心 STM 作为 pivot 进行任何语言对之间的翻译,超越了几个简单的 GPT-4。Abstract
In this paper, we introduce SCALE, a collaborative framework that connects compact Specialized Translation Models (STMs) and general-purpose Large Language Models (LLMs) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM, thus mitigating language bias of LLM and parallel data bias of STM, enhancing LLM speciality without sacrificing generality, and facilitating continual learning without expensive LLM fine-tuning. Our comprehensive experiments show that SCALE significantly outperforms both few-shot LLMs (GPT-4) and specialized models (NLLB) in challenging low-resource settings. Moreover, in Xhosa to English translation, SCALE experiences consistent improvement by a 4 BLEURT score without tuning LLM and surpasses few-shot GPT-4 by 2.5 COMET score and 3.8 BLEURT score when equipped with a compact model consisting of merely 600M parameters. SCALE could also effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot for translation between any language pairs, outperforming few-shot GPT-4 by an average of 6 COMET points across eight translation directions. Furthermore we provide an in-depth analysis of SCALE's robustness, translation characteristics, and latency costs, providing solid foundation for future studies exploring the potential synergy between LLMs and more specialized, task-specific models.
摘要
在这篇论文中,我们引入了一个协作框架,称为SCALE,它将特殊化翻译模型(STM)和通用大语言模型(LLM) integrate为一个统一的翻译引擎。通过在STM中引入翻译,SCALE可以 Mitigate语言偏好和并行数据偏好,提高LLM的特殊化无需牺牲通用性,并且可以实现 kontinuel Learning without expensive LLM fine-tuning。我们的全面实验表明,SCALE在具有低资源的情况下显著超越了几个shot LLM(GPT-4)和专门的模型(NLLB)。此外,在从Xhosa到英语翻译中,SCALE经常经历了不需要调整LLM的情况下的改进,比特shot GPT-4提高了2.5个COMET分数和3.8个BLEURT分数,当装备了仅600M参数的简洁模型时。SCALE还可以有效利用现有的LLM语言偏好,通过使用英语特殊的STM作为翻译 между任何语言对的托管,超越了几个shot GPT-4的平均6个COMET分数。此外,我们还提供了SCALE的稳定性、翻译特征和延迟成本的深入分析,为未来研究探索LLM和更特殊、任务特定的模型之间的可能的合作提供坚实的基础。
Tell Me a Story! Narrative-Driven XAI with Large Language Models
results: 调查结果表明,大多数普通用户认为SHAPstories是有力的解释,而数据科学家认为SHAPstories可以帮助普通用户更好地理解AI预测结果。CFstories也在图像分类任务中被证明为更加有力的解释方法,并且可以提高准确率。Abstract
In today's critical domains, the predominance of black-box machine learning models amplifies the demand for Explainable AI (XAI). The widely used SHAP values, while quantifying feature importance, are often too intricate and lack human-friendly explanations. Furthermore, counterfactual (CF) explanations present `what ifs' but leave users grappling with the 'why'. To bridge this gap, we introduce XAIstories. Leveraging Large Language Models, XAIstories provide narratives that shed light on AI predictions: SHAPstories do so based on SHAP explanations to explain a prediction score, while CFstories do so for CF explanations to explain a decision. Our results are striking: over 90% of the surveyed general audience finds the narrative generated by SHAPstories convincing. Data scientists primarily see the value of SHAPstories in communicating explanations to a general audience, with 92% of data scientists indicating that it will contribute to the ease and confidence of nonspecialists in understanding AI predictions. Additionally, 83% of data scientists indicate they are likely to use SHAPstories for this purpose. In image classification, CFstories are considered more or equally convincing as users own crafted stories by over 75% of lay user participants. CFstories also bring a tenfold speed gain in creating a narrative, and improves accuracy by over 20% compared to manually created narratives. The results thereby suggest that XAIstories may provide the missing link in truly explaining and understanding AI predictions.
摘要
今天的关键领域中,黑盒机器学习模型的占主导地位进一步增加了Explainable AI(XAI)的需求。通用的SHAP值可以衡量特征重要性,但它们经常是太复杂,缺乏人类友好的解释。另外,Counterfactual(CF)解释可以提供“what if”的情况,但它们留下用户摸着头脑,思考“why”。为了bridging这个差距,我们引入XAIstories。利用大型自然语言模型,XAIstories提供了解释AI预测的narritives。SHAPstories基于SHAP解释来解释预测分数,而CFstories基于CF解释来解释决策。我们的结果很有吸引力:论坛中的普通民众超过90%认为SHAPstories的解释感服务。数据科学家主要看到SHAPstories在传达解释给通用观众时的价值,92%的数据科学家认为它会使非专家更好地理解AI预测。此外,83%的数据科学家表示他们可能会使用SHAPstories。在图像分类任务中,CFstories被用户自己创作的故事超过75%的普通用户认为是 equally convincing或更有吸引力。CFstories还提供了图像分类任务中创建故事的十倍快速和超过20%的准确率提升。结果表明XAIstories可能为AI预测的真正解释和理解提供了一个缺失的链接。
paper_authors: Ramin Barati, Reza Safabakhsh, Mohammad Rahmati
for: 本研究旨在探讨机器学习模型的可靠性问题,以及如何提高模型的Robustness和准确率。
methods: 本研究使用了对各种机器学习任务的实验研究,以及对学习过程中函数的分析和推理。
results: 研究发现,继续函数不能有效地学习最佳Robust hypothesis,而不连续函数则可以更好地实现Robustness和准确率的平衡。此外,研究还提供了一种框架 для严谨地研究幂等和幂函数在学习理论中的应用。Abstract
The reliability of a learning model is key to the successful deployment of machine learning in various applications. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. It has been shown that adversarial training can improve the robustness of the hypothesis. However, this improvement comes at the cost of decreased performance on natural samples. Hence, it has been suggested that robustness and accuracy of a hypothesis are at odds with each other. In this paper, we put forth the alternative proposal that it is the continuity of a hypothesis that is incompatible with its robustness and accuracy. In other words, a continuous function cannot effectively learn the optimal robust hypothesis. To this end, we will introduce a framework for a rigorous study of harmonic and holomorphic hypothesis in learning theory terms and provide empirical evidence that continuous hypotheses does not perform as well as discontinuous hypotheses in some common machine learning tasks. From a practical point of view, our results suggests that a robust and accurate learning rule would train different continuous hypotheses for different regions of the domain. From a theoretical perspective, our analysis explains the adversarial examples phenomenon as a conflict between the continuity of a sequence of functions and its uniform convergence to a discontinuous function.
摘要
“机器学习模型的可靠性是在实际应用中成功的关键。建立一个坚固的模型,特别是不受攻击影响的模型,需要对抗例子现象的全面理解。但是由于机器学习的问题相当复杂,这个现象难以描述。已经证明了对抗训练可以提高模型的坚固性,但是这些改进价值随即降低自然样本的表现。因此,有人建议了模型的稳定性和准确性之间存在冲突。在这篇论文中,我们提出了一个不同的建议,即稳定性和准确性之间的冲突是由于函数的连续性所致。即使是一个连续函数,也无法有效地学习最佳的防御性模型。为了解决这个问题,我们将引入一个对抗学习理论的框架,并提供实验证据,证明连续函数不如离散函数在一些常见的机器学习任务中表现更好。实际上,我们的结果显示了一个坚固和准确的学习规则应该在不同的领域中训练不同的连续函数。从理论上看,我们的分析解释了攻击例子现象是一个连续函数序列的稳定性与其均匀对抗数列的冲突之间的冲突。”
Refined Kolmogorov Complexity of Analog, Evolving and Stochastic Recurrent Neural Networks
results: 研究得到了一个无穷 hierarchy of classes of analog networks, evolving networks, and stochastic networks,这些类别位于 $\mathbf{P}$ 和 $\mathbf{P/poly}$ 之间。此外,研究还提供了一种生成这些类别的通用方法。Abstract
We provide a refined characterization of the super-Turing computational power of analog, evolving, and stochastic neural networks based on the Kolmogorov complexity of their real weights, evolving weights, and real probabilities, respectively. First, we retrieve an infinite hierarchy of classes of analog networks defined in terms of the Kolmogorov complexity of their underlying real weights. This hierarchy is located between the complexity classes $\mathbf{P}$ and $\mathbf{P/poly}$. Then, we generalize this result to the case of evolving networks. A similar hierarchy of Kolomogorov-based complexity classes of evolving networks is obtained. This hierarchy also lies between $\mathbf{P}$ and $\mathbf{P/poly}$. Finally, we extend these results to the case of stochastic networks employing real probabilities as source of randomness. An infinite hierarchy of stochastic networks based on the Kolmogorov complexity of their probabilities is therefore achieved. In this case, the hierarchy bridges the gap between $\mathbf{BPP}$ and $\mathbf{BPP/log^*}$. Beyond proving the existence and providing examples of such hierarchies, we describe a generic way of constructing them based on classes of functions of increasing complexity. For the sake of clarity, this study is formulated within the framework of echo state networks. Overall, this paper intends to fill the missing results and provide a unified view about the refined capabilities of analog, evolving and stochastic neural networks.
摘要
我们提供了一种精细的定义和分类方法 для超过图灵计算机力的Analog、演化和随机神经网络,基于这些网络的实数权重、演化权重和实数概率的科隆莫洛夫复杂性。首先,我们定义了一系列基于实数权重的Analog网络的层次结构,这些层次结构位于$\mathbf{P}$和$\mathbf{P/poly}$之间。然后,我们推广这些结果到演化网络的情况,并获得了类似的层次结构。此外,我们还将这些结果推广到使用实数概率作为随机性源的随机网络,并建立了一个基于概率复杂性的层次结构,这个层次结构跨越了$\mathbf{BPP}$和$\mathbf{BPP/log^*}$之间。除了证明存在和提供示例外,我们还描述了一种通用的构造方法,基于不同复杂性的函数类型。为了便于理解,这种研究基于抽象神经网络(Echo State Networks)的框架下进行了表述。总的来说,本文的目标是填充当前缺失的结果,并提供一个统一的视角,以描述超过图灵计算机力的Analog、演化和随机神经网络的高级功能。
Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process
For: This paper presents a scalable multi-temporal remote sensing change data generator using generative modeling, which can alleviate the problems of collecting, preprocessing, and annotating multi-temporal remote sensing images at scale.* Methods: The proposed method, called Changen, is based on a generative adversarial network (GAN) and decouples the complex simulation problem into two more tractable sub-problems: change event simulation and semantic change synthesis.* Results: The extensive experiments show that Changen has superior generation capability and the change detectors with Changen pre-training exhibit excellent transferability to real-world change datasets.Here is the same information in Simplified Chinese text:* For: 这篇论文提出了一种可扩展的多时间段Remote sensing变化数据生成器,使用生成模型,解决了多时间段Remote sensing图像收集、处理和标注的问题。* Methods: 提议的方法基于生成对抗网络(GAN),将复杂的模拟问题分解成两个更加可控的互相独立的问题:变化事件模拟和semantic变化合成。* Results: 广泛的实验表明,Changen具有优秀的生成能力,并且将变化探测器与Changen预训练结合,在实际变化数据集上表现出色的传送性。Abstract
Understanding the temporal dynamics of Earth's surface is a mission of multi-temporal remote sensing image analysis, significantly promoted by deep vision models with its fuel -- labeled multi-temporal images. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present a scalable multi-temporal remote sensing change data generator via generative modeling, which is cheap and automatic, alleviating these problems. Our main idea is to simulate a stochastic change process over time. We consider the stochastic change process as a probabilistic semantic state transition, namely generative probabilistic change model (GPCM), which decouples the complex simulation problem into two more trackable sub-problems, \ie, change event simulation and semantic change synthesis. To solve these two problems, we present the change generator (Changen), a GAN-based GPCM, enabling controllable object change data generation, including customizable object property, and change event. The extensive experiments suggest that our Changen has superior generation capability, and the change detectors with Changen pre-training exhibit excellent transferability to real-world change datasets.
摘要
Simplified Chinese translation:理解地球表面的时间动态是远程感知图像分析的任务,受到深度视觉模型的推动,却面临收集、预处理和标注多个时间点的远程感知图像的问题。在这篇文章中,我们提出了一种可扩展的多时点远程感知变化数据生成器,通过生成模型,它是便宜和自动的,从而解决了这些问题。我们的主要想法是通过随机变化过程来模拟时间的演变。我们认为这种随机变化过程是一种概率性 semantic state transition,即生成概率性变化模型(GPCM),它将复杂的模拟问题分解成两个更容易解决的子问题,即变化事件模拟和semantic变化合成。为解决这两个问题,我们提出了改变生成器(Changen),基于GAN的GPCM,可以生成可控的对象变化数据,包括可定制的对象属性和变化事件。我们的实验表明,我们的Changen具有优秀的生成能力,而使用Changen预训练的变化探测器在实际变化数据集上表现出色。
Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection
paper_authors: Swapnil Bhosale, Abhra Chaudhuri, Alex Lee Robert Williams, Divyank Tiwari, Anjan Dutta, Xiatian Zhu, Pushpak Bhattacharyya, Diptesh Kanojia
for: 这 paper 的目的是对 MUStARD++ dataset 进行严格的benchmarking,以便完全利用这个多modal的丰富性,并提高 macro-F1 分数2%。
methods: 这 paper 使用了当今最佳语言、语音和视觉编码器,以实现全面的多modal表达。此外,它还提出了一种扩展,称为 MUStARD++ Balanced,将分类结果分配到两个类别中,以解决 sarcastic type 类别的偏袋问题,并提高了 macro-F1 分数2.4%。
results: 这 paper 的结果显示,使用新的 TV 剧 House MD 的 clip,并手动将其分类为多种类别,可以提高 MUStARD++ dataset 的多modal表达能力,并且可以提高 macro-F1 分数。Abstract
The introduction of the MUStARD dataset, and its emotion recognition extension MUStARD++, have identified sarcasm to be a multi-modal phenomenon -- expressed not only in natural language text, but also through manners of speech (like tonality and intonation) and visual cues (facial expression). With this work, we aim to perform a rigorous benchmarking of the MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, for fully utilizing the totality of the multi-modal richness that it has to offer, achieving a 2\% improvement in macro-F1 over the existing benchmark. Additionally, to cure the imbalance in the `sarcasm type' category in MUStARD++, we propose an extension, which we call \emph{MUStARD++ Balanced}, benchmarking the same with instances from the extension split across both train and test sets, achieving a further 2.4\% macro-F1 boost. The new clips were taken from a novel source -- the TV show, House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are made public.
摘要
Introduction of MUStARD dataset and its extension MUStARD++, researchers have found that sarcasm is a multi-modal phenomenon, expressed not only in natural language text, but also through speech and visual cues. In this work, we aim to perform a comprehensive benchmarking of MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, to fully utilize the richness of the multi-modal data, and achieve a 2% improvement in macro-F1 over the existing benchmark. Additionally, to address the imbalance in the "sarcasm type" category in MUStARD++, we propose an extension, called MUStARD++ Balanced, and benchmark it with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost. The new clips were taken from a novel source, the TV show House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are publicly available.Here's the text in Traditional Chinese as well:Introduction of MUStARD dataset and its extension MUStARD++, researchers have found that sarcasm is a multi-modal phenomenon, expressed not only in natural language text, but also through speech and visual cues. In this work, we aim to perform a comprehensive benchmarking of MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, to fully utilize the richness of the multi-modal data, and achieve a 2% improvement in macro-F1 over the existing benchmark. Additionally, to address the imbalance in the "sarcasm type" category in MUStARD++, we propose an extension, called MUStARD++ Balanced, and benchmark it with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost. The new clips were taken from a novel source, the TV show House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are publicly available.
Benchmarking Cognitive Biases in Large Language Models as Evaluators
results: 研究发现,LLMs作为评估器存在强烈的认知偏见,在每个评估中表现出60%的比较。此外,人机和机器偏好之间的相关性为49.6%,表明机器偏好与人类偏好存在差异。Abstract
Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.
摘要
大型语言模型(LLM)最近已经被证明可以作为自动评价器,只需要简单的提示和在Context中学习。在这项工作中,我们卷集了15个不同大小范围的LLM,并对它们的输出响应进行了对比评价,例如系统星是比系统方块更好。然后,我们引入了语言模型评价生成器的认知偏见标准(CoBBLEr),以衡量六种不同的认知偏见在LLM评价输出中,如自центризм偏见,这种偏见会让模型对自己的输出进行高评价。我们发现,LLM作为评价器存在强烈的偏见,在每个评价中都有40%的比较,表明它们的可靠性是问题。此外,我们还研究了人类和机器之间的偏好相关性,并计算了机器和人类之间的相互融合率(RBO),得到了49.6%的平均值,这表明机器的偏好与人类的偏好存在差异。根据我们的发现,LLM可能还未能够被用于自动标注,与人类的偏好相对适合。我们的项目页面是:https://minnesotanlp.github.io/cobbler。
Medical Foundation Models are Susceptible to Targeted Misinformation Attacks
paper_authors: Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, Daniel Truhn
results: 研究发现,通过修改模型的1.1%的权重,可以故意导入错误的生物医学信息,并且这些错误信息将被模型输出,而模型的其他生物医学任务性能则保持不变。这些结果表明了LLMs在医疗领域的可靠性和安全性存在问题。Abstract
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. The erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. We validate our findings in a set of 1,038 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
摘要
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
methods: 通过对 synthetic noisy ImageNet-1K 和 YFCC15M 数据集进行广泛的超参数调整实验,我们证明了适度的噪声在预训练中可以提高域内(ID)转移性能,但总是降低 OUT OF 预训练(OOD)性能。我们也证明了噪声在预训练中 shapes 特征空间的不同导致这一现象。
results: 我们对流行的视觉和语言模型进行评估,发现我们的方法可以有效地缓解预训练中的噪声影响,提高 Both ID 和 OOD 任务的泛化性能。Abstract
Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a lightweight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.
摘要
传统的深度学习实践中,先行训练大规模数据集,然后在下游任务上精细调整模型已成为标准做法。然而,预训练数据通常包含标签噪音,这可能会对模型的泛化产生负面影响。这篇论文的目标是理解预训练数据中噪音的性质,并对其对下游任务的影响进行调节。我们通过对synthetic noisy ImageNet-1K和YFCC15M数据集进行超过实验,发现一定程度的噪音在预训练中可以提高预训练和测试数据分布相同的内联(ID)传递性能,但总是对于预训练和测试数据分布不同的外联(OOD)性能下降。我们经验 verify 噪音在预训练中形成特征空间的不同是原因。我们然后提出了一种轻量级黑obox调整方法(NMTune),用于调整特征空间,以减少噪音对泛化的负面影响,并提高ID和OOD任务的泛化性能。我们在流行的视觉和语言模型中进行了实践性的实验,以评估我们的方法。我们的分析和结果表明这是一个有趣和新的研究方向,我们称之为噪音模型学习(Noisy Model Learning)。
A Closer Look at Bearing Fault Classification Approaches
paper_authors: Harika Abburi, Tanya Chaudhary, Haider Ilyas, Lakshmi Manne, Deepak Mittal, Don Williams, Derek Snaidauf, Edward Bowen, Balaji Veeramani
for: 本研究旨在提高滚珠机器人失效诊断的效率和准确性,以避免意外机器停工和提高维护计划。
methods: 本研究使用了现代机器学习技术,包括深度学习架构,对滚珠数据进行分析和预测。
results: 研究发现,在诊断滚珠失效时,数据分区、模型评价指标和失败标签生成方法的选择具有重要的影响,并提出了实际场景中模型开发考虑因素。Abstract
Rolling bearing fault diagnosis has garnered increased attention in recent years owing to its presence in rotating machinery across various industries, and an ever increasing demand for efficient operations. Prompt detection and accurate prediction of bearing failures can help reduce the likelihood of unexpected machine downtime and enhance maintenance schedules, averting lost productivity. Recent technological advances have enabled monitoring the health of these assets at scale using a variety of sensors, and predicting the failures using modern Machine Learning (ML) approaches including deep learning architectures. Vibration data has been collected using accelerated run-to-failure of overloaded bearings, or by introducing known failure in bearings, under a variety of operating conditions such as rotating speed, load on the bearing, type of bearing fault, and data acquisition frequency. However, in the development of bearing failure classification models using vibration data there is a lack of consensus in the metrics used to evaluate the models, data partitions used to evaluate models, and methods used to generate failure labels in run-to-failure experiments. An understanding of the impact of these choices is important to reliably develop models, and deploy them in practical settings. In this work, we demonstrate the significance of these choices on the performance of the models using publicly-available vibration datasets, and suggest model development considerations for real world scenarios. Our experimental findings demonstrate that assigning vibration data from a given bearing across training and evaluation splits leads to over-optimistic performance estimates, PCA-based approach is able to robustly generate labels for failure classification in run-to-failure experiments, and $F$ scores are more insightful to evaluate the models with unbalanced real-world failure data.
摘要
However, there is a lack of consensus in the metrics used to evaluate the models, the data partitions used to evaluate the models, and the methods used to generate failure labels in run-to-failure experiments. These choices have a significant impact on the performance of the models, and it is essential to understand their influence to develop reliable models that can be deployed in practical settings.In this work, we investigate the impact of these choices on the performance of bearing failure classification models using publicly-available vibration datasets. We demonstrate that assigning vibration data from a given bearing across training and evaluation splits leads to over-optimistic performance estimates, and that a PCA-based approach can robustly generate labels for failure classification in run-to-failure experiments. Additionally, we find that $F$ scores are more insightful to evaluate the models with unbalanced real-world failure data. Our findings provide practical considerations for developing and deploying bearing failure classification models in real-world scenarios.
AI Algorithm for the Generation of Three-Dimensional Accessibility Ramps in Grasshopper / Rhinoceros 7
paper_authors: Antonio Li, Leila Yi, Brandon Yeo Pei Hui
for: This paper aims to provide an algorithm for the automatic generation of wheelchair-accessible ramps in urban development, with the goal of improving accessibility for people with mobile impairments and able-bodied third parties.
methods: The algorithm uses AI search algorithms to determine the optimal pathway connecting initial and terminal points within a 3D model of the environment, taking into account essential components such as elevation differentials, spatial constraints, and gradient specifications.
results: The algorithm generates a full-scale, usable model of a ramp that can be easily exported and transformed through inter-software exchanges, providing significant efficiency gains in the design process and lowering the threshold for the incorporation of accessibility features in future urban design.Abstract
Often overlooked as a component of urban development, accessibility infrastructure is undeniably crucial in daily life. Accessibility ramps are one of the most common types of accessibility infrastructure, and serve to benefit not only people with mobile impairments but also able-bodied third parties. While the necessity of accessibility ramps is acknowledged, actual implementation fails in light of the limits of manpower required for the design stage. In response, we present an algorithm capable of the automatic generation of a feasible accessibility ramp based on a 3D model of the relevant environment. Through the manual specification of initial and terminal points within a 3D model, the algorithm uses AI search algorithms to determine the optimal pathway connecting these points. Essential components in devising a wheelchair-accessible ramp are encoded within the process, as evaluated by the algorithm, including but not limited to elevation differentials, spatial constraints, and gradient specifications. From this, the algorithm then generates the pathway to be expanded into a full-scale, usable model of a ramp, which then can be easily exported and transformed through inter-software exchanges. Though some human input is still required following the generation stage, the minimising of human resources provides significant boosts of efficiency in the design process thus lowering the threshold for the incorporation of accessibility features in future urban design.
摘要
通常被忽略的城市发展组件之一是可达性基础设施,它在日常生活中的重要性不言自明。可达性升降是可达性基础设施中最常见的一种,不仅有助于身体残疾人,还有利于能 bodied 第三方。虽然人们承认可达性升降的必要性,但实际实施失败,一个重要原因是人力设计阶段的限制。为此,我们提出了一个可以自动生成可达性升降的算法,基于相关环境的 3D 模型。通过手动指定 initia 和终点在 3D 模型中的点,算法使用人工智能搜索算法来确定最佳的连接这两点的路径。编码在设计轮廓中的 essenti 组件,包括但不限于高差差、空间限制和斜 Slope 规范,都会被算法评估。从这里,算法会生成一个可以扩展到全Scale 使用的升降模型,可以轻松地通过交互软件交换出口。虽然还需要一定的人工输入 после 生成阶段,但是减少了人力资源的需求,提高了设计过程中的效率,从而降低了未来城市设计中可达性特性的投入难度。
Reliability Quantification of Deep Reinforcement Learning-based Control
results: 这个研究运用了DQN-based控制来解决一个简单任务,并证明了其效果。此外,这个方法还应用于问题 switching 训练模型根据状态。因此,这个方法可以提高DRL控制的性能,通过根据可靠性 switching 训练模型。Abstract
Reliability quantification of deep reinforcement learning (DRL)-based control is a significant challenge for the practical application of artificial intelligence (AI) in safety-critical systems. This study proposes a method for quantifying the reliability of DRL-based control. First, an existing method, random noise distillation, was applied to the reliability evaluation to clarify the issues to be solved. Second, a novel method for reliability quantification was proposed to solve these issues. The reliability is quantified using two neural networks: reference and evaluator. They have the same structure with the same initial parameters. The outputs of the two networks were the same before training. During training, the evaluator network parameters were updated to maximize the difference between the reference and evaluator networks for trained data. Thus, the reliability of the DRL-based control for a state can be evaluated based on the difference in output between the two networks. The proposed method was applied to DQN-based control as an example of a simple task, and its effectiveness was demonstrated. Finally, the proposed method was applied to the problem of switching trained models depending on the state. Con-sequently, the performance of the DRL-based control was improved by switching the trained models according to their reliability.
摘要
深度强化学习(DRL)控制的可靠性评估是应用人工智能(AI)在安全关键系统中实践的一大挑战。本研究提出了一种方法来评估DRL控制的可靠性。首先,我们使用了现有的随机噪声润恤方法来进行可靠性评估,以便更清晰地描述问题。然后,我们提出了一种新的可靠性评估方法,用以解决这些问题。在这种方法中,我们使用了两个神经网络:参照网络和评估网络。它们具有相同的结构和相同的初始参数。在训练前,参照网络和评估网络的输出都是相同的。在训练过程中,评估网络的参数被更新,以便在训练数据上增加参照网络和评估网络之间的差异。因此,我们可以根据参照网络和评估网络之间的差异来评估DRL控制的可靠性。本研究使用了DQN控制为简单任务的示例,并证明了其效果。最后,我们将该方法应用于状态 switching 已训练模型的问题,并通过根据模型的可靠性进行模型 switching 来提高DRL控制的性能。
A Quantum States Preparation Method Based on Difference-Driven Reinforcement Learning
results: 实验结果表明,提出的算法可以在有限条件下准备高精度的二叠bits系统内部状态。与其他算法相比,它在速度和精度两个方面具有不同的改进。Abstract
Due to the large state space of the two-qubit system, and the adoption of ladder reward function in the existing quantum state preparation methods, the convergence speed is slow and it is difficult to prepare the desired target quantum state with high fidelity under limited conditions. To solve the above problems, a difference-driven reinforcement learning (RL) algorithm for quantum state preparation of two-qubit system is proposed by improving the reward function and action selection strategy. Firstly, a model is constructed for the problem of preparing quantum states of a two-qubit system, with restrictions on the type of quantum gates and the time for quantum state evolution. In the preparation process, a weighted differential dynamic reward function is designed to assist the algorithm quickly obtain the maximum expected cumulative reward. Then, an adaptive e-greedy action selection strategy is adopted to achieve a balance between exploration and utilization to a certain extent, thereby improving the fidelity of the final quantum state. The simulation results show that the proposed algorithm can prepare quantum state with high fidelity under limited conditions. Compared with other algorithms, it has different degrees of improvement in convergence speed and fidelity of the final quantum state.
摘要
Due to the large state space of the two-qubit system, and the adoption of ladder reward function in the existing quantum state preparation methods, the convergence speed is slow and it is difficult to prepare the desired target quantum state with high fidelity under limited conditions. To solve the above problems, a difference-driven reinforcement learning (RL) algorithm for quantum state preparation of two-qubit system is proposed by improving the reward function and action selection strategy. Firstly, a model is constructed for the problem of preparing quantum states of a two-qubit system, with restrictions on the type of quantum gates and the time for quantum state evolution. In the preparation process, a weighted differential dynamic reward function is designed to assist the algorithm quickly obtain the maximum expected cumulative reward. Then, an adaptive e-greedy action selection strategy is adopted to achieve a balance between exploration and utilization to a certain extent, thereby improving the fidelity of the final quantum state. The simulation results show that the proposed algorithm can prepare quantum state with high fidelity under limited conditions. Compared with other algorithms, it has different degrees of improvement in convergence speed and fidelity of the final quantum state.Here's the translation breakdown:* "two-qubit system" is translated as "两 Quint 系统" (liǎng qiánti zhìxīng)* "ladder reward function" is translated as "爬坡式奖励函数" (pángmǎ shì jiàngshǎng fùxìng)* "quantum state evolution" is translated as "量子状态演化" (liàngzǐ zhèngdào yánhuà)* "weighted differential dynamic reward function" is translated as "带有权重的不同动态奖励函数" (dài yǒu quánzhòng de bùdìng dòngtài jiàngshǎng fùxìng)* "adaptive e-greedy action selection strategy" is translated as "适应式e-贪Strategy" (shìyìngxìng e-shèng zhìxíng)* "fidelity of the final quantum state" is translated as "最终量子态的准确性" (zuìzhì liàngzǐ zhèngde zhèngxìng)Note that the translation is in Simplified Chinese, which is the most commonly used form of Chinese writing. If you need Traditional Chinese, please let me know.
Discrete-Choice Model with Generalized Additive Utility Network
results: 提出一种基于泛化加性模型(GAUNet)的新型Utility函数,并在东京的旅游调查数据上进行了评估,结果与ASU-DNN的准确性相当,而且具有更好的可读性。Abstract
Discrete-choice models are a powerful framework for analyzing decision-making behavior to provide valuable insights for policymakers and businesses. Multinomial logit models (MNLs) with linear utility functions have been used in practice because they are ease to use and interpretable. Recently, MNLs with neural networks (e.g., ASU-DNN) have been developed, and they have achieved higher prediction accuracy in behavior choice than classical MNLs. However, these models lack interpretability owing to complex structures. We developed utility functions with a novel neural-network architecture based on generalized additive models, named generalized additive utility network ( GAUNet), for discrete-choice models. We evaluated the performance of the MNL with GAUNet using the trip survey data collected in Tokyo. Our models were comparable to ASU-DNN in accuracy and exhibited improved interpretability compared to previous models.
摘要
Adversarial Driving Behavior Generation Incorporating Human Risk Cognition for Autonomous Vehicle Evaluation
results: 对比试验表明,该敌对方法可以准确地探测测试AV的弱点,并且在高精度硬件在Loop(HiL)平台上得到了较好的效果。Abstract
Autonomous vehicle (AV) evaluation has been the subject of increased interest in recent years both in industry and in academia. This paper focuses on the development of a novel framework for generating adversarial driving behavior of background vehicle interfering against the AV to expose effective and rational risky events. Specifically, the adversarial behavior is learned by a reinforcement learning (RL) approach incorporated with the cumulative prospect theory (CPT) which allows representation of human risk cognition. Then, the extended version of deep deterministic policy gradient (DDPG) technique is proposed for training the adversarial policy while ensuring training stability as the CPT action-value function is leveraged. A comparative case study regarding the cut-in scenario is conducted on a high fidelity Hardware-in-the-Loop (HiL) platform and the results demonstrate the adversarial effectiveness to infer the weakness of the tested AV.
摘要
自主车辆评估在过去几年内得到了业界和学术界的更多关注。这篇论文关注于开发了一种新的敌对驾驶行为生成框架,用于让背景车辆 intervene 对自动驾驶车辆 (AV) 进行攻击,以暴露出效果和合理的危险事件。特别是,敌对行为通过复现人类风险认知的汇集前景理论 (CPT) 学习approach,从而学习出敌对策略。然后,为了保证训练稳定,提出了基于深度权值函数的扩展DDPG技术。在一个高精度硬件在Loop(HiL)平台上进行了比较案例研究,结果表明了敌对策略的效果,并暴露了测试 AV 的弱点。
results: 这篇论文提出了三种聚合方法:基于cooperative游戏理论的力指数方法,以及基于一种广泛使用的 causal strength 度量方法。这些方法都有一定的满意性质量,并且在多个数据集上进行了评估,并表明这些解释方法具有robust性。Abstract
The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue -- there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.
摘要
Recent criticisms of post hoc model approximation explanation methods (such as LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue - there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
On Generating Explanations for Reinforcement Learning Policies: An Empirical Study
paper_authors: Mikihisa Yuasa, Huy T. Tran, Ramavarapu S. Sreenivas
for: 提供策略解释
methods: 使用线性时间逻辑(LTL)方程提供解释
results: 在模拟的捕捉Flag环境中证明了方法的有效性,并提出了未来研究的建议。In this paper, the authors propose using Linear Temporal Logic (LTL) formulae to provide explanations for policies. The paper focuses on creating explanations that reveal both the ultimate objectives accomplished by the policy and the conditions it requires throughout its execution. The proposed approach is demonstrated to be effective through a simulated capture the flag environment, and the paper concludes with suggestions for future research.Abstract
In this paper, we introduce a set of \textit{Linear Temporal Logic} (LTL) formulae designed to provide explanations for policies. Our focus is on crafting explanations that elucidate both the ultimate objectives accomplished by the policy and the prerequisites it upholds throughout its execution. These LTL-based explanations feature a structured representation, which is particularly well-suited for local-search techniques. The effectiveness of our proposed approach is illustrated through a simulated capture the flag environment. The paper concludes with suggested directions for future research.
摘要
在这篇论文中,我们介绍了一组基于线性时间逻辑(LTL)的方程,用于提供政策的解释。我们的注重点是通过构建政策执行过程中的前提和目标的结构化表示,以便使用本地搜索技术。我们使用这种LTL-基于的解释方法在模拟的捕捉 Flag 环境中证明了其效果。文章结束时,我们还提出了未来研究的建议。Here's the breakdown of the text into Simplified Chinese characters:在这篇论文中 (在这篇论文中)我们介绍了一组 (我们介绍了一组)基于线性时间逻辑 (LTL) (基于线性时间逻辑)的方程 (方程)用于提供政策的解释 (用于提供政策的解释)我们的注重点 (我们的注重点)是通过构建政策执行过程中的前提 (是通过构建政策执行过程中的前提)和目标 (和目标)的结构化表示 (的结构化表示)以便使用本地搜索技术 (以便使用本地搜索技术)我们使用这种LTL-基于的解释方法 (我们使用这种LTL-基于的解释方法)在模拟的捕捉 Flag 环境中证明了其效果 (在模拟的捕捉 Flag 环境中证明了其效果)文章结束时 (文章结束时)我们还提出了未来研究的建议 (我们还提出了未来研究的建议)
results: 在实验中,这种方法在困难的图像数据集上达到了显著的改善,并且在减少问题到随机噪声的情况下,与状态艺术方法的 FID 分数相当。Abstract
Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.
摘要
Diffusion模型是一种强大的生成模型,它将随机噪声映射到数据中的某些特征上。然而,在许多应用程序,如图像修改,模型输入不是随机噪声。因此,扩散模型必须采用困难的方法,如导航或投影抽样,来包含这些信息在生成过程中。在我们的工作中,我们提出了去噪扩散桥模型(DDBMs),这是一种自然的代替方案,基于扩散桥,一种将两个对应的分布作为终点点 interpolate 的过程家族。我们的方法从数据中学习扩散桥的分数,并将一个分布转换成另一个分布,解决一个(随机)导数 Equation 基于学习的分数。我们的方法自然地统一了许多类型的生成模型,如分数基本扩散模型和OT-Flow-Matching,让我们可以将现有的设计和建筑选择应用到我们更一般的问题上。在实验中,我们在难度图像集上应用DDBMs,并在像素空间和尺度空间中进行了图像翻译和生成。相比基eline方法,DDBMs在标准图像翻译问题上达到了显著的改进,而当我们将问题降到图像生成的情况下,DDBMs在state-of-the-art方法的FID分数上达到了相当的比较。
Asynchrony-Robust Collaborative Perception via Bird’s Eye View Flow
for: This paper is written to address the issue of temporal asynchrony in multi-agent collaboration, which can negatively impact the accuracy of perception and fusion in real-world scenarios.
methods: The proposed method, CoBEVFlow, uses a bird’s eye view (BEV) flow to compensate for motions and align asynchronous collaboration messages sent by multiple agents. This approach allows for robust and efficient collaboration, even in extremely asynchronous settings.
results: The paper presents extensive experiments conducted on both synthetic and real-world datasets, which demonstrate the efficacy of CoBEVFlow in mitigating the impact of asynchrony and outperforming other baselines. The code for CoBEVFlow is available online.Here is the text in Simplified Chinese:
results: 论文提供了大量在synthetic和实际数据集上的实验结果,证明CoBEVFlow在异步设置下能够有效地减轻异步的影响,并超越其他基准。代码可在https://github.com/MediaBrain-SJTU/CoBEVFlow上下载。Abstract
Collaborative perception can substantially boost each agent's perception ability by facilitating communication among multiple agents. However, temporal asynchrony among agents is inevitable in the real world due to communication delays, interruptions, and clock misalignments. This issue causes information mismatch during multi-agent fusion, seriously shaking the foundation of collaboration. To address this issue, we propose CoBEVFlow, an asynchrony-robust collaborative perception system based on bird's eye view (BEV) flow. The key intuition of CoBEVFlow is to compensate motions to align asynchronous collaboration messages sent by multiple agents. To model the motion in a scene, we propose BEV flow, which is a collection of the motion vector corresponding to each spatial location. Based on BEV flow, asynchronous perceptual features can be reassigned to appropriate positions, mitigating the impact of asynchrony. CoBEVFlow has two advantages: (i) CoBEVFlow can handle asynchronous collaboration messages sent at irregular, continuous time stamps without discretization; and (ii) with BEV flow, CoBEVFlow only transports the original perceptual features, instead of generating new perceptual features, avoiding additional noises. To validate CoBEVFlow's efficacy, we create IRregular V2V(IRV2V), the first synthetic collaborative perception dataset with various temporal asynchronies that simulate different real-world scenarios. Extensive experiments conducted on both IRV2V and the real-world dataset DAIR-V2X show that CoBEVFlow consistently outperforms other baselines and is robust in extremely asynchronous settings. The code is available at https://github.com/MediaBrain-SJTU/CoBEVFlow.
摘要
合作感知可以有效地提高每个代理的感知能力,通过多个代理之间的交流来提高感知质量。然而,在实际世界中,由于通信延迟、中断和时钟不协调,代理之间的时间差是不可避免的。这会导致多个代理之间的信息不匹配,seriously shaking the foundation of collaboration。为解决这问题,我们提出了CoBEVFlow,一种鲁棒的协同感知系统,基于 bird's eye view(BEV)流。CoBEVFlow的关键想法是补偿不同时间报送的多个代理协同感知消息,以实现协同感知的准确性。为模型场景中的运动,我们提出了BEV流,即每个空间位置对应的运动向量的集合。基于BEV流,协同感知消息可以在不同时间报送的情况下进行重新分配,以避免信息不匹配的问题。CoBEVFlow有两个优点:(i)CoBEVFlow可以处理不同时间报送的协同感知消息,无需精确的时间排序;(ii)通过BEV流,CoBEVFlow只需要传输原始的感知特征,而不是生成新的感知特征,从而避免了额外的噪声。为证明CoBEVFlow的有效性,我们创建了IRregular V2V(IRV2V)数据集,该数据集包含了不同的时间差和协同感知消息的各种实际场景。我们对IRV2V数据集和DAIR-V2X数据集进行了广泛的实验,结果表明,CoBEVFlow在非常异步的设置下表现出色,并且与其他基eline相比具有更高的稳定性和可靠性。代码可以在https://github.com/MediaBrain-SJTU/CoBEVFlow中下载。
PC-Adapter: Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds with Rectified Pseudo-label
results: 在 PointDA、GraspNetPC 和 PointSegDA 等标准 benchmark 上,本文的方法与基eline 相比,显示了更高的准确率和更好的鲁棒性。Abstract
Understanding point clouds captured from the real-world is challenging due to shifts in data distribution caused by varying object scales, sensor angles, and self-occlusion. Prior works have addressed this issue by combining recent learning principles such as self-supervised learning, self-training, and adversarial training, which leads to significant computational overhead.Toward succinct yet powerful domain adaptation for point clouds, we revisit the unique challenges of point cloud data under domain shift scenarios and discover the importance of the global geometry of source data and trends of target pseudo-labels biased to the source label distribution. Motivated by our observations, we propose an adapter-guided domain adaptation method, PC-Adapter, that preserves the global shape information of the source domain using an attention-based adapter, while learning the local characteristics of the target domain via another adapter equipped with graph convolution. Additionally, we propose a novel pseudo-labeling strategy resilient to the classifier bias by adjusting confidence scores using their class-wise confidence distributions to consider relative confidences. Our method demonstrates superiority over baselines on various domain shift settings in benchmark datasets - PointDA, GraspNetPC, and PointSegDA.
摘要
understanding real-world point clouds 困难,因为数据分布变化引起的因素,如对象大小、感知角度和自我遮挡。先前的方法通过结合最新的学习原则,如无监督学习、自我训练和对抗学习,来解决这个问题,但这会导致很大的计算开销。为了实现简洁而强大的领域适应,我们重新检查了点云数据在领域shift场景下的独特挑战,并发现了源数据的全局几何结构和目标假标签的趋势是源标签分布的关键。基于这些观察,我们提出了一种 adapter-guided 领域适应方法,称为 PC-Adapter,它使用注意力基于的适应器保留源领域的全局几何信息,同时通过另一个适应器 equipped with 图像 convolution 学习target领域的本地特征。此外,我们提出了一种新的假标签生成策略,可以快速地适应类ifier的偏见,通过调整对应的信任分布来考虑相对的信任程度。我们的方法在 PointDA、GraspNetPC 和 PointSegDA 等标准数据集上显示出了superiority。
TranDRL: A Transformer-Driven Deep Reinforcement Learning Enabled Prescriptive Maintenance Framework
results: 本文验证了其框架的有效性,使用NASA C-MPASS数据集,其展示了预测RUL的精准性和维护作业的优化。因此,本文的创新方法可以提供一个数据驱动的维护策略,解决了工业系统中的主要挑战,将系统变得更高效、成本更低、可靠性更高。Abstract
Industrial systems demand reliable predictive maintenance strategies to enhance operational efficiency and reduce downtime. This paper introduces a novel, integrated framework that leverages the power of transformer neural networks and deep reinforcement learning (DRL) algorithms to optimize maintenance actions. Our approach employs the transformer model to effectively capture complex temporal patterns in sensor data, thereby accurately predicting the Remaining Useful Life (RUL) of equipment. Simultaneously, the DRL component of our framework provides cost-effective and timely maintenance recommendations. We validate the efficacy of our framework on the NASA C-MPASS dataset, where it demonstrates significant advancements in both RUL prediction accuracy and the optimization of maintenance actions. Consequently, our pioneering approach provides an innovative data-driven methodology for prescriptive maintenance, addressing key challenges in industrial operations and leading the way to more efficient, cost-effective, and reliable systems.
摘要
工业系统需要可靠的预测维护策略,以提高操作效率和减少停机时间。本文将介绍一个新的、整合式框架,利用变换器神经网络和深度强化学习(DRL)算法来优化维护行动。我们的方法使用变换器模型从感应数据中够有效地捕捉复杂的时间模式,从而精确地预测设备的剩余有用生命(RUL)。同时,我们的框架中的DRL部分提供了成本效益的维护建议。我们在NASA C-MPASS数据集上验证了我们的框架,其展示了很大的RUL预测精度和维护行动优化。因此,我们的创新方法提供了一个数据驱动的维护方法,解决了工业运营中的主要挑战,导向更高效、成本效益、可靠的系统。
Learning to Receive Help: Intervention-Aware Concept Embedding Models
results: 该 paper 的实验结果显示,IntCEMs 在接受测试时的概念修正方法时表现出色,与现有概念可访性模型相比,具有更高的效果。Abstract
Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures by constructing and explaining their predictions using a set of high-level concepts. A special property of these models is that they permit concept interventions, wherein users can correct mispredicted concepts and thus improve the model's performance. Recent work, however, has shown that intervention efficacy can be highly dependent on the order in which concepts are intervened on and on the model's architecture and training hyperparameters. We argue that this is rooted in a CBM's lack of train-time incentives for the model to be appropriately receptive to concept interventions. To address this, we propose Intervention-aware Concept Embedding models (IntCEMs), a novel CBM-based architecture and training paradigm that improves a model's receptiveness to test-time interventions. Our model learns a concept intervention policy in an end-to-end fashion from where it can sample meaningful intervention trajectories at train-time. This conditions IntCEMs to effectively select and receive concept interventions when deployed at test-time. Our experiments show that IntCEMs significantly outperform state-of-the-art concept-interpretable models when provided with test-time concept interventions, demonstrating the effectiveness of our approach.
摘要
具有概念瓶颈模型(CBM)的模型可以解释其预测结果使用一组高级概念。特殊的是,这些模型允许用户进行概念修正,从而提高模型的性能。然而,最近的研究表明,修正效果可能受到 intervened 概念的顺序和模型的架构和训练参数的影响。我们认为这是由于 CBM 缺乏在训练时间提供模型适应测试时间修正的培训励agrant。为解决这个问题,我们提出了概念修正感知模型(IntCEM),一种基于 CBM 的新的架构和训练方法。我们的模型在训练时间内学习一个概念修正策略,从而可以在测试时间内采取有意义的修正轨迹。这使得 IntCEMs 在测试时间被部署时能够有效地接受概念修正。我们的实验表明,IntCEMs 在测试时间提供修正后性能明显超过了现有的概念可读性模型,证明了我们的方法的有效性。
Mode Connectivity and Data Heterogeneity of Federated Learning
results: 研究发现,降低客户端资料不均性可以使模型更加稳定,并且可以增加全球模式之间的连接性,从而提高模型的性能。此外,论文还发现了一个阻碍连接性的障碍,即在两个全球模式之间的直线连接,但这个障碍可以通过考虑非线性模式连接性来消除。Abstract
Federated learning (FL) enables multiple clients to train a model while keeping their data private collaboratively. Previous studies have shown that data heterogeneity between clients leads to drifts across client updates. However, there are few studies on the relationship between client and global modes, making it unclear where these updates end up drifting. We perform empirical and theoretical studies on this relationship by utilizing mode connectivity, which measures performance change (i.e., connectivity) along parametric paths between different modes. Empirically, reducing data heterogeneity makes the connectivity on different paths more similar, forming more low-error overlaps between client and global modes. We also find that a barrier to connectivity occurs when linearly connecting two global modes, while it disappears with considering non-linear mode connectivity. Theoretically, we establish a quantitative bound on the global-mode connectivity using mean-field theory or dropout stability. The bound demonstrates that the connectivity improves when reducing data heterogeneity and widening trained models. Numerical results further corroborate our analytical findings.
摘要
federated learning (FL) 允许多个客户端同时训练模型,保持其数据私有性。过去的研究表明,客户端数据的不同性导致客户端更新中的漂移。然而,关于客户端和全球模式之间的关系,还是有很少研究,使得这些更新的漂移方向不清楚。我们通过使用模式连接性来研究这种关系,并进行了实验和理论研究。实验结果表明,降低客户端数据的不同性可以使得不同模式之间的连接性更加相似,形成更多的低错重叠。此外,我们发现在线性连接两个全球模式时会出现一个阻断连接性的问题,而不考虑非线性模式连接性时,这个问题会消失。理论上,我们使用mean-field理论或dropout稳定性来确定全球模式连接性的质量上限。这个上限表明,降低客户端数据的不同性和训练模型的宽度可以提高连接性。数值结果进一步证实了我们的分析结论。
ACGAN-GNNExplainer: Auxiliary Conditional Generative Explainer for Graph Neural Networks
results: 实验结果表明,ACGAN-GNNExplainer方法在Synthetic和实际图据集上比其他已有的GNN解释器更高的精度和可靠性。Abstract
Graph neural networks (GNNs) have proven their efficacy in a variety of real-world applications, but their underlying mechanisms remain a mystery. To address this challenge and enable reliable decision-making, many GNN explainers have been proposed in recent years. However, these methods often encounter limitations, including their dependence on specific instances, lack of generalizability to unseen graphs, producing potentially invalid explanations, and yielding inadequate fidelity. To overcome these limitations, we, in this paper, introduce the Auxiliary Classifier Generative Adversarial Network (ACGAN) into the field of GNN explanation and propose a new GNN explainer dubbed~\emph{ACGAN-GNNExplainer}. Our approach leverages a generator to produce explanations for the original input graphs while incorporating a discriminator to oversee the generation process, ensuring explanation fidelity and improving accuracy. Experimental evaluations conducted on both synthetic and real-world graph datasets demonstrate the superiority of our proposed method compared to other existing GNN explainers.
摘要
几何神经网络(GNN)在实际应用中表现出色,但它们的下面机制仍然是一个谜。为了解释这个挑战并实现可靠的决策,许多GNN解释器在最近的年份中被提出。然而,这些方法经常遇到限制,包括对特定实体的依赖、缺乏未见顶点的普遍化、生成可能无效的解释和产生不足的精确性。为了突破这些限制,我们在这篇论文中引入了帮助器网络(ACGAN)到GNN解释领域,并提出了一个新的GNN解释器,名为ACGAN-GNNExplainer。我们的方法利用生成器来生成输入几何网络的解释,同时包含一个检测器来监督生成过程,以确保解释的实际性和提高精确性。实验结果显示,我们的提案方法在 sintetic 和实际几何网络数据集上比其他现有的GNN解释器更有优势。
ONNXExplainer: an ONNX Based Generic Framework to Explain Neural Networks Using Shapley Values
results: 对VGG19、ResNet50、DenseNet201和EfficientNetB0等神经网络模型进行了广泛的 benchmark,结果显示,提出的优化方法可以提高解释延迟时间,比现有开源 counterpart SHAP 提高500%。Abstract
Understanding why a neural network model makes certain decisions can be as important as the inference performance. Various methods have been proposed to help practitioners explain the prediction of a neural network model, of which Shapley values are most popular. SHAP package is a leading implementation of Shapley values to explain neural networks implemented in TensorFlow or PyTorch but lacks cross-platform support, one-shot deployment and is highly inefficient. To address these problems, we present the ONNXExplainer, which is a generic framework to explain neural networks using Shapley values in the ONNX ecosystem. In ONNXExplainer, we develop its own automatic differentiation and optimization approach, which not only enables One-Shot Deployment of neural networks inference and explanations, but also significantly improves the efficiency to compute explanation with less memory consumption. For fair comparison purposes, we also implement the same optimization in TensorFlow and PyTorch and measure its performance against the current state of the art open-source counterpart, SHAP. Extensive benchmarks demonstrate that the proposed optimization approach improves the explanation latency of VGG19, ResNet50, DenseNet201, and EfficientNetB0 by as much as 500%.
摘要
ONNXExplainer develops its own automatic differentiation and optimization approach, which not only enables one-shot deployment of neural network inference and explanations, but also significantly improves the efficiency of computing explanations with less memory consumption. For fair comparison purposes, we also implement the same optimization in TensorFlow and PyTorch and measure its performance against the current state-of-the-art open-source counterpart, SHAP.Extensive benchmarks demonstrate that the proposed optimization approach improves the explanation latency of VGG19, ResNet50, DenseNet201, and EfficientNetB0 by as much as 500%.
ASAP: Automated Sequence Planning for Complex Robotic Assembly with Physical Feasibility
paper_authors: Yunsheng Tian, Karl D. D. Willis, Bassel Al Omari, Jieliang Luo, Pingchuan Ma, Yichen Li, Farhad Javid, Edward Gu, Joshua Jacob, Shinjiro Sueda, Hui Li, Sachin Chitta, Wojciech Matusik
results: 作者们在大量复杂产品组装序列数据集上表明 ASAP 可以生成物理可行的组装序列规划方案,并在实际 робоット设置中进行了应用。Abstract
The automated assembly of complex products requires a system that can automatically plan a physically feasible sequence of actions for assembling many parts together. In this paper, we present ASAP, a physics-based planning approach for automatically generating such a sequence for general-shaped assemblies. ASAP accounts for gravity to design a sequence where each sub-assembly is physically stable with a limited number of parts being held and a support surface. We apply efficient tree search algorithms to reduce the combinatorial complexity of determining such an assembly sequence. The search can be guided by either geometric heuristics or graph neural networks trained on data with simulation labels. Finally, we show the superior performance of ASAP at generating physically realistic assembly sequence plans on a large dataset of hundreds of complex product assemblies. We further demonstrate the applicability of ASAP on both simulation and real-world robotic setups. Project website: asap.csail.mit.edu
摘要
自动生成复杂产品的组装需要一个系统可以自动规划一个物理可行的动作序列,以将多个部件组装在一起。在这篇论文中,我们介绍了 ASAP,一种基于物理学的规划方法,用于自动生成这种动作序列。ASAP考虑了重力,以设计一个物理稳定的动作序列,其中每个子组件都由有限数量的部件支持和承载。我们使用高效的树搜索算法来减少这种组装序列的组合性复杂度。搜索可以按照几何启发或基于数据的物理学 Label 进行导航。最后,我们表明 ASAP 在大量复杂产品组装序列计划中表现出色,并在实际机器人设置中进行了应用。更多信息请参考项目网站:asap.csail.mit.edu。
paper_authors: Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Pedro Moreno Mengibar
for: 提高自动语音识别(ASR)系统中的罕见实体识别精度。
methods: 基于 Knuth-Morris-Pratt 算法的模式匹配算法来实现上下文偏导。在搜索过程中,我们会将匹配扩展得分提高,以便在偏导短语集中匹配。我们的方法可以模拟经典方法在Weighted Finite State Transducer(WFST)框架中实现,但是免除了 FST 语言 altogether,并且对内存占用和tensor处理单元(TPU)中的效率进行了仔细考虑。
results: 对偏导测试集进行了重要的单词错误率(WER)降低,而且可以与模型基于偏导方法相结合,进一步提高性能。Abstract
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing phrases. Our method simulates the classical approaches often implemented in the weighted finite state transducer (WFST) framework, but avoids the FST language altogether, with careful considerations on memory footprint and efficiency on tensor processing units (TPUs) by vectorization. Without introducing additional model parameters, our method achieves significant word error rate (WER) reductions on biasing test sets by itself, and yields further performance gain when combined with a model-based biasing method.
摘要
Contextual biasing 是指偏导自动语音识别(ASR)系统向罕见实体方向,这些实体与特定用户或应用场景有关。我们提出基于 Knuth-Morris-Pratt 算法的偏导方法,在搜索过程中,如果扩展匹配到偏导短语集,就会增加该延伸token的得分。我们的方法模拟了经典方法,通常在Weighted Finite State Transducer(WFST)框架中实现,但是避免了 FST 语言,并且对内存占用和硬件加速(TPU)进行了仔细考虑,通过向量化来减少内存占用。无需添加额外参数,我们的方法可以在偏导测试集上减少单词错误率(WER),并且可以与模型基于偏导方法结合使用,以获得更高的性能。
Automatic Prompt Rewriting for Personalized Text Generation
results: 在三个代表性的领域中使用了 datasets,结果表明修改后的提示文本比原始提示文本和通过supervised learning或强化学习优化的提示文本更高效。Abstract
Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference.
摘要
由大型语言模型(LLM)所facilitates,个人化文本生成已成为快速增长的研究方向。大多数现有研究专注于设计特定领域的专门模型,或者需要精确地调整LLM以生成个人化文本。我们考虑了一个常见的情况,在这个情况下,大型语言模型可以仅通过API进行访问,而且这个模型已经冻结并不能进行更新。在这种情况下,我们可以对输入文本(即文本提示)进行改进,这是通常由人工进行的。在这篇文章中,我们提出了一种新的方法来自动修改提示文本,以生成个人化文本。我们的方法是使用一个组合了supervised learning(SL)和强化学习(RL)的训练 парадиг,其中SL减少了RL的搜寻空间,RL则帮助对 rewrite 进行端对端训练。使用三个代表领域的数据集,我们显示了 rewrite 的提示文本比原始提示文本和仅通过SL或RL alone 来调整的提示文本更好。深入分析 rewrite 的提示文本表明它们不��LY readable,并且能够指导人工修改提示文本,当有限的资源供应不足,或者当部署自动 rewrite 提示文本检查器时成本高昂。
The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning
paper_authors: Lillian Zhou, Yuxin Ding, Mingqing Chen, Harry Zhang, Rohit Prabhavalkar, Dhruv Guliani, Giovanni Motta, Rajiv Mathews
for: addressing the issue of outdated automatic speech recognition (ASR) models on edge devices due to language evolution
methods: using Federated Learning (FL) to continually learn from on-device user corrections and improve recognition of fresh terms, while mitigating catastrophic forgetting
results: improved recognition of fresh terms while preserving overall language distribution quality in experimental evaluationsAbstract
Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
摘要
自动语音识别(ASR)模型通常在大量的转录speech数据上训练。随着语言的发展和新词出现,这些模型可能会过时和停滞。在服务器上训练并在边缘设备上部署的模型中,错误可能会出现由服务器训练数据与实际边缘设备使用的差异引起。在这项工作中,我们寻求通过联邦学习(FL)持续学习从边缘设备上的用户更正来解决这个问题。我们探索了如何Target新的特有词汇,学习长尾词和 Mitigate Catastrophic Forgetting。在实验评估中,我们发现提议的技术可以提高模型对新词的识别,同时保持语言总体分布的质量。
A Large Language Model Approach to Educational Survey Feedback Analysis
results: 研究表明,通过使用有效的提示实践,可以使GPT-4达到人类水平的性能在多个任务上,并且可以使用LLM的链条思维来提供有价值的反馈。此外,该paper还开发了一套可变的分类类型,适用于不同的课程类型(在线、半在线或面对面),并且可以根据需要自定义。Abstract
This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs' chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text.
摘要
Survey analysis in education often involves manual processing of textual responses to identify gaps in curricula or evaluate teachers. LLMs can provide a flexible and efficient solution to these tasks without the need for specialized machine learning models or fine-tuning.The authors demonstrate a versatile approach to survey analysis by treating each task as a sequence of natural language processing (NLP) tasks, including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis. They apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses and achieve human-level performance on multiple tasks with GPT-4 using effective prompting practices.Moreover, the study shows the potential of inspecting LLMs' chain-of-thought (CoT) reasoning to provide insight into their decision-making process and foster confidence in their recommendations. The authors also develop a versatile set of classification categories that can be customized for various course types (online, hybrid, or in-person).The results suggest that LLMs can be used to derive a range of insights from survey text, making them a valuable tool for education feedback analysis.
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
results: 这个论文通过对 7 个任务进行系统性的评估,发现 LLM 在 semantic parsing、math reasoning 和 Python 程序生成等领域具有强大的语言到代码生成能力,但也存在一些常见的失败模式。Abstract
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs. This enables us to identify and analyze the typical failure modes across various tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We also release the evaluation framework and all model outputs, hoping to lay the groundwork for further future research in this domain.
摘要
近些时间,大型语言模型(LLM),特别是基于代码预训练的模型,在几个shot或者 zeroshot的情况下,表现出了强大的生成代码能力。despite promising results, there is a notable lack of a comprehensive evaluation of these models' language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs. This enables us to identify and analyze the typical failure modes across various tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We also release the evaluation framework and all model outputs, hoping to lay the groundwork for further future research in this domain.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
results: 对GPT-4V模型的分析表明,它可以处理无序多媒体输入,并且其能力具有很高的一致性和通用性。此外,GPT-4V还可以理解图像上的视觉标记,这可能会开拓新的人机交互方式,如图像引用提示。Abstract
Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf
摘要
大型多modal模型(LMM)将大型语言模型(LLM)扩展到多感知能力,如视觉理解,以实现更强大的通用智能。在这篇论文中,我们对最新的模型GPT-4V(视觉)进行分析,以深入了解LMM。我们的分析将ocus在GPT-4V可以完成的奇妙任务上,包括测试样本以探索GPT-4V的质量和通用性,以及它支持的输入和工作模式。我们的探索方法包括策定和组织一组仔细设计的质量样本,覆盖多个领域和任务。这些样本的观察结果表明,GPT-4V可以处理任意排序的多感知输入,并且其能力的通用性使得GPT-4V成为一个强大的多感知通用系统。此外,GPT-4V可以理解输入图像上的视觉标记,可能开拓新的人机交互方法,如视觉引用提示。我们在报告中进行了深入的讨论,探讨GPT-4V应用场景的出现和未来研究方向,以及如何更好地利用和加强LMM来解决实际问题。最后,我们表示这只是对OpenAI创新的产物——GPT-4V的初步探索,我们应该对LMM的下一代多模态任务定义、新的方法来利用和提高LMM,以及多感知基础模型的更好理解进行未来研究。请参考OpenAI的贡献和归属报告,了解GPT-4V的开发者和归属:https://cdn.openai.com/contributions/gpt-4v.pdf。
Intuitive or Dependent? Investigating LLMs’ Robustness to Conflicting Prompts
methods: 作者们设置了一个量化的 benchmarking 框架,并通过控制 LLMS 的偏好来测试其Robustness。特别是,他们定义了两种Robustness:事实Robustness,targeting LLMs 能够从提示或内存中正确地选择信息,以及决策风格, categorizing LLMs 的行为为INTUITIVE、依赖或理性 Based on cognitive theory。
results: 通过对七个开源和关闭源 LLMS 进行广泛的实验,作者们发现这些模型具有高度易受扰动提示的特点,尤其是在指导常识知识方面。虽然详细的指导可以减少选择错误答案的风险,但这也会增加无效答案的发生率。通过对不同大小 LLMS 进行特定的角色指导,作者们发现这些模型的Robustness和适应性具有不同的Upper bound。Abstract
This paper explores the robustness of LLMs' preference to their internal memory or the given prompt, which may contain contrasting information in real-world applications due to noise or task settings. To this end, we establish a quantitative benchmarking framework and conduct the role playing intervention to control LLMs' preference. In specific, we define two types of robustness, factual robustness targeting the ability to identify the correct fact from prompts or memory, and decision style to categorize LLMs' behavior in making consistent choices -- assuming there is no definitive "right" answer -- intuitive, dependent, or rational based on cognitive theory. Our findings, derived from extensive experiments on seven open-source and closed-source LLMs, reveal that these models are highly susceptible to misleading prompts, especially for instructing commonsense knowledge. While detailed instructions can mitigate the selection of misleading answers, they also increase the incidence of invalid responses. After Unraveling the preference, we intervene different sized LLMs through specific style of role instruction, showing their varying upper bound of robustness and adaptivity.
摘要
Translation notes:* "LLMs" is translated as "大语言模型" (dà yǔ yán módel), which means "large language models" in Simplified Chinese.* "preference" is translated as "偏好" (piān xiǎng), which means "preference" or "inclination" in Simplified Chinese.* "robustness" is translated as "可靠性" (kě jiān xìng), which means "reliability" or "robustness" in Simplified Chinese.* "factual robustness" is translated as "事实可靠性" (shì shí kě jiān xìng), which means "factual reliability" or "factual robustness" in Simplified Chinese.* "decision style" is translated as "决策风格" (jīe yì fēng xìng), which means "decision style" or "cognitive style" in Simplified Chinese.* "intuitive" is translated as "直观" (zhí guān), which means "intuitive" or "gut feeling" in Simplified Chinese.* "dependent" is translated as "依赖" (yì gòng), which means "dependent" or "reliant" in Simplified Chinese.* "rational" is translated as "理性" (lǐ xìng), which means "rational" or "logical" in Simplified Chinese.* "cognitive theory" is translated as "认知理论" (niǎn zhī lǐ lun), which means "cognitive theory" or "cognitive science" in Simplified Chinese.* "prompts" is translated as "提示" (tiē shì), which means "prompts" or "hints" in Simplified Chinese.* "task settings" is translated as "任务设置" (ràng wù jiè xiǎng), which means "task settings" or "task conditions" in Simplified Chinese.* "noise" is translated as "噪音" (zhōng yīn), which means "noise" or "background noise" in Simplified Chinese.* "extensive experiments" is translated as "广泛实验" (guǎn fāng shí yàn), which means "extensive experiments" or "large-scale experiments" in Simplified Chinese.* "seven open-source and closed-source LLMs" is translated as "七种开源和关源的大语言模型" (qī zhǒng kāi yuán yǔ guān yuán de dà yǔ yán módel), which means "seven open-source and closed-source large language models" in Simplified Chinese.* "role-playing intervention" is translated as "角色扮演 intervención" (jiǎo xiǎng bǎo yán jiān), which means "role-playing intervention" or "role-playing experiment" in Simplified Chinese.* "specific style of role instruction" is translated as "特定的角色指导方式" (tè qī de jiǎo xiǎng zhǐ dǎo fāng yì), which means "specific style of role guidance" or "specific role-playing method" in Simplified Chinese.* "varying upper bound of robustness and adaptivity" is translated as "不同的可靠性和适应性上限" (bù dōng de kě jiān xìng yǔ shì bìng xìng), which means "varying upper bound of robustness and adaptability" or "different levels of reliability and adaptability" in Simplified Chinese.
Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles
methods: 本研究使用了 BioNLP 工作shop at ACL 2023 上的分布式任务,即 Lay Summarisation of Biomedical Research Articles (BioLaySumm) 分布式任务,以测试参与者所建立的摘要模型。
results: 研究结果表明,参与者所建立的模型在控制和无控制 Setting下都能够生成高质量的摘要,并且在不同的文章类型和长度下都能够达到比较高的准确率。Abstract
This paper presents the results of the shared task on Lay Summarisation of Biomedical Research Articles (BioLaySumm), hosted at the BioNLP Workshop at ACL 2023. The goal of this shared task is to develop abstractive summarisation models capable of generating "lay summaries" (i.e., summaries that are comprehensible to non-technical audiences) in both a controllable and non-controllable setting. There are two subtasks: 1) Lay Summarisation, where the goal is for participants to build models for lay summary generation only, given the full article text and the corresponding abstract as input; and 2) Readability-controlled Summarisation, where the goal is for participants to train models to generate both the technical abstract and the lay summary, given an article's main text as input. In addition to overall results, we report on the setup and insights from the BioLaySumm shared task, which attracted a total of 20 participating teams across both subtasks.
摘要
Lay Summarization: The goal is for participants to build models for lay summary generation only, given the full article text and the corresponding abstract as input.2. Readability-controlled Summarization: The goal is for participants to train models to generate both the technical abstract and the lay summary, given an article’s main text as input.In addition to overall results, we report on the setup and insights from the BioLaySumm shared task, which attracted a total of 20 participating teams across both subtasks.
Few-Shot Domain Adaptation for Charge Prediction on Unprofessional Descriptions
results: 实验表明,相比现有的FSDA方法,DLCCP方法在新发布的非法律专业人群数据集(NCCP)上表现出色,超越了竞争对手的基线。Abstract
Recent works considering professional legal-linguistic style (PLLS) texts have shown promising results on the charge prediction task. However, unprofessional users also show an increasing demand on such a prediction service. There is a clear domain discrepancy between PLLS texts and non-PLLS texts expressed by those laypersons, which degrades the current SOTA models' performance on non-PLLS texts. A key challenge is the scarcity of non-PLLS data for most charge classes. This paper proposes a novel few-shot domain adaptation (FSDA) method named Disentangled Legal Content for Charge Prediction (DLCCP). Compared with existing FSDA works, which solely perform instance-level alignment without considering the negative impact of text style information existing in latent features, DLCCP (1) disentangles the content and style representations for better domain-invariant legal content learning with carefully designed optimization goals for content and style spaces and, (2) employs the constitutive elements knowledge of charges to extract and align element-level and instance-level content representations simultaneously. We contribute the first publicly available non-PLLS dataset named NCCP for developing layperson-friendly charge prediction models. Experiments on NCCP show the superiority of our methods over competitive baselines.
摘要
近期研究聚焦专业法律语言风格(PLLS)文本的成果表明,在充电预测任务中,PLLS文本的预测表现很出色。然而,不专业的用户也在增加对这种预测服务的需求。存在专业领域与非专业领域之间的域名不同,导致当前最佳实践模型对非专业文本的表现下降。本文提出了一种新的几shot领域适应(FSDA)方法,名为分解法律内容 для充电预测(DLCCP)。与现有FSDA工作相比,DLCCP在不同领域中学习法律内容的时候,不仅实现了实例级别的对齐,还考虑了文本风格信息在潜在特征中的负面影响。DLCCP使用了充电罪名的构成元素知识,将元素级别和实例级别的内容表示分解,并同时进行实例级别的内容对齐。我们提供了首次公开的非PLLS数据集,名为NCCP,以便开发易懂的充电预测模型。实验表明,我们的方法在NCCP上表现出色,超过了竞争对手的基eline。
Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR Customization
for: 这 paper 是为了提出一个大规模的公共适应拼写检查定制自动语音识别(ASR)系统,特别是处理不同类型的罕见和出于词汇(OOV)短语。
methods: 该方法使用创造了数百万个真实的损坏 ASR 假设,并在定制任务中使用非puis的批处理列表。此外,它还提出了两种类型的 ``hard negatives” 的插入方法,并描述了自动挖掘的过程。
results: 经过训练一个开源定制模型,并在提posed dataset上进行了实验,研究发现,插入 ``hard negatives” 的方法可以降低 WER 和假阳性数量。Abstract
We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR) with focus on diverse rare and out-of-vocabulary (OOV) phrases, such as proper names or terms. The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task. Furthermore, we propose injecting two types of ``hard negatives" to the simulated biasing lists in training examples and describe our procedures to automatically mine them. We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
摘要
我们提供了首个大规模公共合成数据集,用于语音识别自动化(ASR)上下文ual spellchecking个性化。我们的方法可以创建百万个真实的损坏ASR假设,并模拟非常复杂的偏见列表用于个性化任务。此外,我们还提议在训练示例中注入两种类型的“hard negatives”,并描述我们的程序自动挖掘它们。我们对一个开源个性化模型进行训练,并发现在投入“hard negative”偏见列表后,WER和假阳性数量减少。
LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games
results: 通过多种文本基于的多代理人、多问题、semantically rich谈判游戏,证明代理人可以成功谈判并实现协议,并且可以普适应用于新的游戏和设置。Abstract
There is a growing interest in using Large Language Models (LLMs) as agents to tackle real-world tasks that may require assessing complex situations. Yet, we have a limited understanding of LLMs' reasoning and decision-making capabilities, partly stemming from a lack of dedicated evaluation benchmarks. As negotiating and compromising are key aspects of our everyday communication and collaboration, we propose using scorable negotiation games as a new evaluation framework for LLMs. We create a testbed of diverse text-based, multi-agent, multi-issue, semantically rich negotiation games, with easily tunable difficulty. To solve the challenge, agents need to have strong arithmetic, inference, exploration, and planning capabilities, while seamlessly integrating them. Via a systematic zero-shot Chain-of-Thought prompting (CoT), we show that agents can negotiate and consistently reach successful deals. We quantify the performance with multiple metrics and observe a large gap between GPT-4 and earlier models. Importantly, we test the generalization to new games and setups. Finally, we show that these games can help evaluate other critical aspects, such as the interaction dynamics between agents in the presence of greedy and adversarial players.
摘要
有越来越多的关注使用大型语言模型(LLM)作为处理复杂情况的代理人。然而,我们对LLM的决策和处理能力的理解还很有限,一部分是因为缺乏专门的评估标准。为了解决这个问题,我们提议使用可评分谈判游戏作为LLM的评估框架。我们创建了多种文本基于的多代理人、多问题、semantic rich的谈判游戏,易于调整Difficulty。为了解决这个挑战,代理人需要具备强大的数学、推理、探索和规划能力,同时协调这些能力。通过一种系统的零shotChain-of-Thought提示(CoT),我们示出了代理人可以成功谈判并达成协议。我们使用多个指标量化表现,并发现GPT-4和早期模型之间存在巨大的差距。更重要的是,我们测试了新游戏和设置的一致性。最后,我们表明这些游戏可以评估其他重要方面,如代理人之间的互动动力在恶意和投机者存在时。
Training and inference of large language models using 8-bit floating point
paper_authors: Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon
results: 该论文通过在大语言模型GPT和Llama 2中使用FP8进行训练和验证,并 plots了每个张量缩放的分布图示,以便更好地理解FP8的动态。Abstract
FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.
摘要
Comparative Analysis of Named Entity Recognition in the Dungeons and Dragons Domain
methods: 研究使用开源的大语言模型对7本 Dungeons and Dragons(D&D)冒险小说进行 named entity 标注,并评估每种模型的精度。
results: 研究发现,未经修改的 Flair、Trankit 和 Spacy 在 D&D 上表现较佳,其他模型表现较差。Abstract
Many NLP tasks, although well-resolved for general English, face challenges in specific domains like fantasy literature. This is evident in Named Entity Recognition (NER), which detects and categorizes entities in text. We analyzed 10 NER models on 7 Dungeons and Dragons (D&D) adventure books to assess domain-specific performance. Using open-source Large Language Models, we annotated named entities in these books and evaluated each model's precision. Our findings indicate that, without modifications, Flair, Trankit, and Spacy outperform others in identifying named entities in the D&D context.
摘要
许多自然语言处理任务,尤其是在特定领域 like 奇幻小说中,存在挑战。这是Named Entity Recognition(NER)的问题,它在文本中检测和分类名实体。我们对7本《启示录》冒险小说进行了10个NER模型的测试,以评估域 especific的性能。使用开源的大语言模型,我们对这些书籍中的名实体进行了标注,并评估每个模型的精度。我们的发现表明,无需修改,Flair、Trankit和Spacy在D&D上表现最佳,可以准确地识别冒险小说中的名实体。
LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud
methods: 提议了一种协作框架,让服务器处理大部分计算,用户控制采样操作,并使用噪声符和重复搜索攻击来防御 Against potential attacks from a malicious server.
results: 在实验中,使用LatticeGen保护文本生成的隐私和安全,并在强攻击下成功保护真实的生成内容,BERTScore指标下than 50%的 semantic remains hidden.Abstract
In the current user-server interaction paradigm of prompted generation with large language models (LLM) on cloud, the server fully controls the generation process, which leaves zero options for users who want to keep the generated text to themselves. We propose LatticeGen, a cooperative framework in which the server still handles most of the computation while the user controls the sampling operation. The key idea is that the true generated sequence is mixed with noise tokens by the user and hidden in a noised lattice. Considering potential attacks from a hypothetically malicious server and how the user can defend against it, we propose the repeated beam-search attack and the mixing noise scheme. In our experiments we apply LatticeGen to protect both prompt and generation. It is shown that while the noised lattice degrades generation quality, LatticeGen successfully protects the true generation to a remarkable degree under strong attacks (more than 50% of the semantic remains hidden as measured by BERTScore).
摘要
当前用户-服务器交互模式下,大语言模型(LLM)在云端完全控制生成过程,留下 zero 选项 для用户们希望保留生成的文本。我们提议LaticeGen,一种合作框架,在服务器处理大部分计算,而用户控制采样操作。关键思想是真正生成的序列被用户杂乱并隐藏在噪声矩阵中。面对可能的恶意服务器攻击和用户如何防御,我们提出重复扫描攻击和杂乱噪声方案。在我们的实验中,我们应用LaticeGen来保护提示和生成。结果显示,虽然噪声矩阵减低生成质量,但LaticeGen成功地保护真正的生成,并在强攻击下(BERTScore中的更 than 50%的 semantic 保持不变)。
Promoting Generalized Cross-lingual Question Answering in Few-resource Scenarios via Self-knowledge Distillation
results: 比标准cross-entropy训练更高,并且在资源有限的情况下,even in zero-shot scenarios,与一个强基eline相比,表现竞争力强。Abstract
Despite substantial progress in multilingual extractive Question Answering (QA), models with high and uniformly distributed performance across languages remain challenging, especially for languages with limited resources. We study cross-lingual transfer mainly focusing on the Generalized Cross-Lingual Transfer (G-XLT) task, where the question language differs from the context language - a challenge that has received limited attention thus far. Our approach seeks to enhance cross-lingual QA transfer using a high-performing multilingual model trained on a large-scale dataset, complemented by a few thousand aligned QA examples across languages. Our proposed strategy combines cross-lingual sampling and advanced self-distillation training in generations to tackle the previous challenge. Notably, we introduce the novel mAP@k coefficients to fine-tune self-knowledge distillation loss, dynamically regulating the teacher's model knowledge to perform a balanced and effective knowledge transfer. We extensively evaluate our approach to assess XLT and G-XLT capabilities in extractive QA. Results reveal that our self-knowledge distillation approach outperforms standard cross-entropy fine-tuning by a significant margin. Importantly, when compared to a strong baseline that leverages a sizeable volume of machine-translated data, our approach shows competitive results despite the considerable challenge of operating within resource-constrained settings, even in zero-shot scenarios. Beyond performance improvements, we offer valuable insights through comprehensive analyses and an ablation study, further substantiating the benefits and constraints of our approach. In essence, we propose a practical solution to improve cross-lingual QA transfer by leveraging a few data resources in an efficient way.
摘要
Despite significant progress in multilingual extractive Question Answering (QA), models with high and uniformly distributed performance across languages remain challenging, especially for languages with limited resources. We study cross-lingual transfer, focusing on the Generalized Cross-Lingual Transfer (G-XLT) task, where the question language differs from the context language - a challenge that has received limited attention so far. Our approach seeks to enhance cross-lingual QA transfer using a high-performing multilingual model trained on a large-scale dataset, complemented by a few thousand aligned QA examples across languages. Our proposed strategy combines cross-lingual sampling and advanced self-distillation training in generations to tackle the previous challenge. Notably, we introduce the novel mAP@k coefficients to fine-tune self-knowledge distillation loss, dynamically regulating the teacher's model knowledge to perform a balanced and effective knowledge transfer. We extensively evaluate our approach to assess XLT and G-XLT capabilities in extractive QA. Results reveal that our self-knowledge distillation approach outperforms standard cross-entropy fine-tuning by a significant margin. Importantly, when compared to a strong baseline that leverages a sizeable volume of machine-translated data, our approach shows competitive results despite the considerable challenge of operating within resource-constrained settings, even in zero-shot scenarios. Beyond performance improvements, we offer valuable insights through comprehensive analyses and an ablation study, further substantiating the benefits and constraints of our approach. In essence, we propose a practical solution to improve cross-lingual QA transfer by leveraging a few data resources in an efficient way.
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
paper_authors: Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne for:This paper proposes a new method called Fine-grained Late-interaction Multi-modal Retrieval (FLMR) to improve the performance of Retrieval-Augmented Visual Question Answering (RA-VQA) systems.methods:FLMR uses a vision model aligned with an existing text-based retriever to obtain image representations that complement those from the image-to-text transforms. It also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents.results:FLMR significantly improves the original RA-VQA retriever’s PRRecall@5 by approximately 8%. Additionally, when equipped with two state-of-the-art large multi-modal/language models, RA-VQA achieves $\sim61%$ VQA score in the OK-VQA dataset.Abstract
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.
摘要
知识基础视觉问答(KB-VQA)需要视觉问答系统利用外部知识库来回答基于图像的问题。 retrieve-augmented visual question answering(RA-VQA)是一个强大的框架,可以解决KB-VQA问题,它首先使用 dense passage retrieval(DPR)来 Retrieval 相关的文档,然后使用它们来回答问题。本文提出了细化晚期多模态检索(FLMR),它可以大幅提高RA-VQA中的知识检索。FLMR解决了两个主要的限制:(1)图像表示可能是不完整和不准确的,(2)查询和文档之间的相关性分数是使用一维嵌入计算的,这可能会失去细化的相关性。FLMR通过使用一个简单的对齐网络将视觉模型与现有的文本基础 Retriever 对齐,从而获得补充的图像表示。同时,FLMR使用多维嵌入来编码图像和问题,以捕捉更细化的相关性。FLMR可以提高原始RA-VQA检索器的PRRecall@5约8%。最后,我们将RA-VQA equip 两个现有的大型多Modal/语言模型,达到了OK-VQA数据集中的约61% VQA 分数。
Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models
results: 实验结果表明,该方法在自动评估指标上显示了良好的性能,但是一个质量分析发现了一些需要进一步改进的地方。这个论文的研究具有推动法律NLP研究的潜在价值,并且可以作为专业领域NLP模型的严格标准进行评估。Abstract
Many individuals are likely to face a legal dispute at some point in their lives, but their lack of understanding of how to navigate these complex issues often renders them vulnerable. The advancement of natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. However, existing legal question answering (LQA) approaches often suffer from a narrow scope, being either confined to specific legal domains or limited to brief, uninformative responses. In this work, we propose an end-to-end methodology designed to generate long-form answers to any statutory law questions, utilizing a "retrieve-then-read" pipeline. To support this approach, we introduce and release the Long-form Legal Question Answering (LLeQA) dataset, comprising 1,868 expert-annotated legal questions in the French language, complete with detailed answers rooted in pertinent legal provisions. Our experimental results demonstrate promising performance on automatic evaluation metrics, but a qualitative analysis uncovers areas for refinement. As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models.
摘要
In this work, we propose an end-to-end methodology that generates long-form answers to any statutory law questions using a "retrieve-then-read" pipeline. To support this approach, we introduce the Long-form Legal Question Answering (LLeQA) dataset, which contains 1,868 expert-annotated legal questions in French, along with detailed answers rooted in relevant legal provisions. Our experimental results show promising performance on automatic evaluation metrics, but a qualitative analysis reveals areas for improvement.As one of the only comprehensive, expert-annotated long-form LQA datasets, LLeQA has the potential to not only resolve a significant real-world issue but also serve as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models to facilitate further research and development in this area.
Contextualising Levels of Language Resourcedness affecting Digital Processing of Text
results: 该论文认为现有的LRL和HRL分类方法存在问题,提出了一个新的分类方法,并通过例子说明了该方法的应用。Abstract
Application domains such as digital humanities and tool like chatbots involve some form of processing natural language, from digitising hardcopies to speech generation. The language of the content is typically characterised as either a low resource language (LRL) or high resource language (HRL), also known as resource-scarce and well-resourced languages, respectively. African languages have been characterized as resource-scarce languages (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most well-resourced language. Varied language resources are used to develop software systems for these languages to accomplish a wide range of tasks. In this paper we argue that the dichotomous typology LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterizes languages as Very LRL, LRL, RL, HRL and Very HRL. The characterization is based on the typology of contextual features for each category, rather than counting tools, and motivation is provided for each feature and each characterization. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterization of language resources within a given scale in a project is an indispensable component particularly in the context of low-resourced languages.
摘要
<>对于应用领域如数字人文学和 chatbot 等,都涉及到处理自然语言,从扫描硬件到语音生成。语言内容的语言 Typically 被characterized 为 either 资源缺乏语言 (LRL) 或高资源语言 (HRL),即资源缺乏和资源充沛语言,分别。非洲语言被characterized 为资源缺乏语言 (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014),而英语则是最具资源的语言。为了开发用于这些语言的软件系统, varied 的语言资源被使用。在这篇文章中,我们认为 dichotomous 类型 LRL 和 HRL 对所有语言是问题atic。通过对语言资源在社会中的理解,一个矩阵被发展出来,其中characterizes 语言为 Very LRL、LRL、RL、HRL 和 Very HRL。这种 categorization 基于每个类别的特征类型,而不是计数工具,并且对每个特征和每个 categorization 提供了动机。Contextualization 资源感知,尤其是关注非洲语言在这篇文章中,可以帮助更好地规划研究和实施项目。因此,我们在这篇文章中 argue dass characterizing 语言资源在项目中的位置是不可或缺的 Component,特别是在资源缺乏语言中。
I Wish to Have an Argument: Argumentative Reasoning in Large Language Models
results: 我们发现,虽然LLM能够匹配或超越当前状态的表现,但其论证思维能力很大程度取决于输入和输出表示。我们还发现了一种“示例效应”,即在输入过多示例时,任务表现下降,4-5个示例是最佳数量。而在链条思维(CoT)提问中,这种效应消失,CoT允许更好地在异常条件下表现。Abstract
We evaluate the ability of contemporary large language models (LLMs) to perform argumentative reasoning. We frame our experiments in terms of the argument mining (AM) and argument pair extraction (APE) tasks, and evaluate their ability to perform reasoning at increasing levels of abstraction in the input and output representations (e.g., arbitrary label sets, semantic graphs). We find that, although LLMs are able to match or surpass the state-of-the-art in AM and APE, their argumentative reasoning performance is very dependent on the input and output representation. We also find an "exemplar effect", where too many exemplars increasingly become detrimental for task performance, and about 4-5 being the optimal amount. Neither result extends to chain-of-thought (CoT) prompting: we find the exemplar effect to be nullified, and our results suggest that CoT allows for better performance under ill-conditioned problems. We hope that the work reported contributes to the improvement of argumentative reasoning in LLMs.
摘要
我团队评估当代大语言模型(LLM)的辩论逻辑能力。我们将实验设计为辩论挖掘(AM)和辩论对比挖掘(APE)任务,并评估其能够在输入和输出表示层次上进行逻辑推理。我们发现,虽然LLM可以匹配或超越现有状态的AM和APE性能,但它们的辩论逻辑性能很виси于输入和输出表示。我们还发现了一种“示例效应”,即过多的示例会导致任务性能下降,并且4-5个示例是最佳数量。这些结果不适用于链条思维(CoT)提问:我们发现示例效应为零,并且我们的结果表明,CoT可以在不良条件下提高任务性能。我们希望该研究对LLM的辩论逻辑能力产生贡献。
SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition
results: 在Common Voice和ML-SUPERB两个多语言数据集上进行测试,实验结果显示SSHR方法可以达到当前最佳性能水平Abstract
Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) has demonstrated its effectiveness in multilingual ASR, it is worth noting that the various layers' representations of SSL potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune multilingual ASR. We first analyze the different layers of the SSL model for language-related and content-related information, uncovering layers that show a stronger correlation. Then, we extract a language-related frame from correlated middle layers and guide specific content extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.
摘要
多语言自动语音识别(ASR)系统在全球语言覆盖方面吸引了一些注意。 although self-supervised learning(SSL)在多语言ASR中表现出色,但是它们各层表示的不同信息仍未得到了完全利用。 在这项研究中,我们提出了一种新的方法,利用自动学习层次表示(SSHR)来精度调整多语言ASR。 我们首先分析了SSL模型各层的语言相关和内容相关信息,找到了相关层,然后提取相关中层的语言帧,通过自我注意机制来引导特定的内容提取。 此外,我们使用我们所提出的交叉CTC(Cross-CTC)来驱动模型获得更多的内容相关信息在最终层。 我们在两个多语言数据集上进行了实验,分别是Common Voice和ML-SUPERB,实验结果表明,我们的方法可以达到当今最佳性能。
Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning
results: 本研究的基eline结果显示,透过连续学习的方式,模型能够适应社交媒体上的敏感内容检测任务,并且能够跟踪和适应社交媒体上的敏感内容的不断变化。Abstract
Detecting problematic content, such as hate speech, is a multifaceted and ever-changing task, influenced by social dynamics, user populations, diversity of sources, and evolving language. There has been significant efforts, both in academia and in industry, to develop annotated resources that capture various aspects of problematic content. Due to researchers' diverse objectives, the annotations are inconsistent and hence, reports of progress on detection of problematic content are fragmented. This pattern is expected to persist unless we consolidate resources considering the dynamic nature of the problem. We propose integrating the available resources, and leveraging their dynamic nature to break this pattern. In this paper, we introduce a continual learning benchmark and framework for problematic content detection comprising over 84 related tasks encompassing 15 annotation schemas from 8 sources. Our benchmark creates a novel measure of progress: prioritizing the adaptability of classifiers to evolving tasks over excelling in specific tasks. To ensure the continuous relevance of our framework, we designed it so that new tasks can easily be integrated into the benchmark. Our baseline results demonstrate the potential of continual learning in capturing the evolving content and adapting to novel manifestations of problematic content.
摘要
检测异常内容,如仇恨言论,是一项多方面和不断发展的任务,受社会动态、用户人口、多源数据和语言演化等因素影响。在学术和业界两个领域,有很大的努力投入到开发了标注资源,以捕捉各种异常内容的不同方面。由于研究人员的多样化目标,标注是不一致的,因此,报告关于异常内容检测的进步是分散的。这种模式预计会持续, Unless we consolidate resources considering the dynamic nature of the problem. We propose integrating the available resources, and leveraging their dynamic nature to break this pattern. In this paper, we introduce a continual learning benchmark and framework for problematic content detection, comprising over 84 related tasks encompassing 15 annotation schemas from 8 sources. Our benchmark creates a novel measure of progress: prioritizing the adaptability of classifiers to evolving tasks over excelling in specific tasks. To ensure the continuous relevance of our framework, we designed it so that new tasks can easily be integrated into the benchmark. Our baseline results demonstrate the potential of continual learning in capturing the evolving content and adapting to novel manifestations of problematic content.
results: 这个论文的实验结果表明,使用这种神经网络预处理器可以高效地解决Poisson方程中的混合边界条件问题,并且在一些实际应用中比algebraic multigrid和其他一些神经网络预处理器更高效。Abstract
We introduce a neural-preconditioned iterative solver for Poisson equations with mixed boundary conditions. The Poisson equation is ubiquitous in scientific computing: it governs a wide array of physical phenomena, arises as a subproblem in many numerical algorithms, and serves as a model problem for the broader class of elliptic PDEs. The most popular Poisson discretizations yield large sparse linear systems. At high resolution, and for performance-critical applications, iterative solvers can be advantageous for these -- but only when paired with powerful preconditioners. The core of our solver is a neural network trained to approximate the inverse of a discrete structured-grid Laplace operator for a domain of arbitrary shape and with mixed boundary conditions. The structure of this problem motivates a novel network architecture that we demonstrate is highly effective as a preconditioner even for boundary conditions outside the training set. We show that on challenging test cases arising from an incompressible fluid simulation, our method outperforms state-of-the-art solvers like algebraic multigrid as well as some recent neural preconditioners.
摘要
我们介绍了一种神经网络预conditioned迭代法来解决杜立特方程式中的混合边界条件问题。杜立特方程式在科学计算中很普遍,它控制了各种物理现象,在许多数值算法中出现为子问题,并且作为更广泛的杜立特PDEs的模型问题。最流行的杜立特积分方法会生成大量的稀疏线性系统。在高分辨率下和性能敏感应用中,迭代法可以是有利的,只要与强大的预conditioner一起使用。我们的核心思想是使用神经网络来近似离散格点拉普拉斯算子的逆函数,该问题具有自由形态和混合边界条件。我们的网络架构受到问题的结构所致,我们示示其在训练集外的边界条件上也是非常有效的预conditioner。我们在来自不可压缩流体模拟的困难测试案例上显示了我们的方法比STATE-OF-THE-ART的方法如多重多射grid和一些最新的神经预conditioner更高效。
Tight Bounds for Volumetric Spanners and Applications
results: 本文提供了对所有$\ell_p$ нор的准确的下界,并证明这些下界可以使用本地搜索算法来实现。此外,本文还应用了这些结果到其他任务上,包括最小体积包围螺旋凝聚问题(MVEE)问题的找到核心集问题。Abstract
Given a set of points of interest, a volumetric spanner is a subset of the points using which all the points can be expressed using "small" coefficients (measured in an appropriate norm). Formally, given a set of vectors $X = \{v_1, v_2, \dots, v_n\}$, the goal is to find $T \subseteq [n]$ such that every $v \in X$ can be expressed as $\sum_{i\in T} \alpha_i v_i$, with $\|\alpha\|$ being small. This notion, which has also been referred to as a well-conditioned basis, has found several applications, including bandit linear optimization, determinant maximization, and matrix low rank approximation. In this paper, we give almost optimal bounds on the size of volumetric spanners for all $\ell_p$ norms, and show that they can be constructed using a simple local search procedure. We then show the applications of our result to other tasks and in particular the problem of finding coresets for the Minimum Volume Enclosing Ellipsoid (MVEE) problem.
摘要
Translated into Simplified Chinese:给一个点集合,一个卷积的扩展是一个包含该点集合的子集,使得所有点可以用"小"系数表示(在适当的 нор 中)。形式地说,给一个 vectors 集合 $X = \{v_1, v_2, \dots, v_n\}$,目标是找到 $T \subseteq [n]$ 使得所有 $v \in X$ 可以表示为 $\sum_{i\in T} \alpha_i v_i$,其中 $\|\alpha\|$ 是小的。这个概念,也被称为well-conditioned basis,在各种应用中得到了推广,包括bandit linear optimization、 determinant maximization 和 matrix low rank approximation。在这篇论文中,我们给出了所有 $\ell_p$ norm 下的几乎最佳大小 bound,并证明了它们可以使用简单的本地搜索算法构建。然后,我们展示了我们的结果在其他任务上的应用,特别是找到 Minimum Volume Enclosing Ellipsoid (MVEE) 问题中的核sets。
ADMET property prediction through combinations of molecular fingerprints
results: 使用Gradient Boosting Decision Tree(CatBoost)和组合ECFP、Avalon、ErG指纹以及200个分子性质得到最佳效果,并成功验证在22个Therapeutics Data Commons ADMETbenchmark上Abstract
While investigating methods to predict small molecule potencies, we found random forests or support vector machines paired with extended-connectivity fingerprints (ECFP) consistently outperformed recently developed methods. A detailed investigation into regression algorithms and molecular fingerprints revealed gradient-boosted decision trees, particularly CatBoost, in conjunction with a combination of ECFP, Avalon, and ErG fingerprints, as well as 200 molecular properties, to be most effective. Incorporating a graph neural network fingerprint further enhanced performance. We successfully validated our model across 22 Therapeutics Data Commons ADMET benchmarks. Our findings underscore the significance of richer molecular representations for accurate property prediction.
摘要
“我们在预测小分子力量方面的研究中发现,随机林或支持向量机(SVM)配对扩展连接指纹(ECFP)一直表现出色,超过最近开发的方法。我们进一步进行了回归算法和分子指纹的调查,发现渐进式搜索树(Gradient Boosting),特别是CatBoost,配对了ECFP、Avalon和ErG指纹,以及200个分子性质,表现最佳。将graph neural network指纹添加到系统中也有助于提高表现。我们成功验证了我们的模型在22个Therapeutics Data Commons ADMET参考标准上。我们的发现强调了丰富的分子表示的重要性,以精确预测分子属性。”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese dialects. If you need the translation in Traditional Chinese, please let me know.
One for All: Towards Training One Graph Model for All Classification Tasks
results: 论文通过使用多个领域的图像数据同时训练,证明了该模型在不同任务下的普通性。它在超参数、少参数和零参数学习场景中都表现出色。Abstract
Designing a single model that addresses multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in integrating and solving different tasks within the language domain. However, a unified model for various tasks on graphs remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it hard to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. Striving to handle all the aforementioned challenges, we propose One for All (OFA), the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose graph classification model across domains.
摘要
设计一个单一模型,能够解决多个任务,是人工智能领域的长期目标。现在,大型自然语言模型已经表现出了惊人的能力,可以将不同的语言任务集成到同一个空间中。然而,对于图学习领域的任务,一个综合的模型仍然是未解决的问题。主要的问题包括:图数据来自不同领域,具有不同的特征和分布,这使得将图据表示在同一个空间中变得困难;任务在图上的多样性,需要不同的嵌入策略;以及对于图学习中的具体任务,没有明确的prompting方法。为了解决这些挑战,我们提出了一个名为“One for All”(OFA)的框架,它可以使用单一的图模型来解决多个任务。OFA使用文本描述图的方法,将不同领域的图数据统一到同一个表示空间中。然后,通过语言模型将不同领域的文本特征编码到同一个嵌入空间中。此外,OFA还引入了“焦点节点”的概念,以标识不同任务的共同表示。为了在图学习中进行具体任务的学习,OFA引入了一种新的图Prompting方法,可以在输入图上添加prompting结构,从而实现无需 fine-tuning 的多任务学习。我们使用来自多个领域的图数据(包括引用网络、分子图、知识图等)进行同时训练,并对OFA模型进行supervised、少数shot和零shot学习场景的评估。 results show that OFA perfoms well across different tasks and domains, making it the first general-purpose graph classification model across domains.
On the Disconnect Between Theory and Practice of Overparametrized Neural Networks
results: 研究发现,在优化、uncertainty quantification和 continual learning 等方面,大规模神经网络不会 exhibit 预测的行为,即使宽度比深度多出很多。这种观察到的差异问题了理论和实践之间的连接。Abstract
The infinite-width limit of neural networks (NNs) has garnered significant attention as a theoretical framework for analyzing the behavior of large-scale, overparametrized networks. By approaching infinite width, NNs effectively converge to a linear model with features characterized by the neural tangent kernel (NTK). This establishes a connection between NNs and kernel methods, the latter of which are well understood. Based on this link, theoretical benefits and algorithmic improvements have been hypothesized and empirically demonstrated in synthetic architectures. These advantages include faster optimization, reliable uncertainty quantification and improved continual learning. However, current results quantifying the rate of convergence to the kernel regime suggest that exploiting these benefits requires architectures that are orders of magnitude wider than they are deep. This assumption raises concerns that practically relevant architectures do not exhibit behavior as predicted via the NTK. In this work, we empirically investigate whether the limiting regime either describes the behavior of large-width architectures used in practice or is informative for algorithmic improvements. Our empirical results demonstrate that this is not the case in optimization, uncertainty quantification or continual learning. This observed disconnect between theory and practice calls into question the practical relevance of the infinite-width limit.
摘要
宽度无穷限的神经网络(NN)在理论上已引起了广泛关注,作为大规模、过参数化网络的分析理论框架。随着宽度增加,NN Approximately converge to a linear model,其特征由神经拟合函数(NTK)定义。这种连接NN和核函数方法,使得NN的理论优势和算法改进得到了更好的理解。在这个链接下,有许多理论上的利点和实际上的改进被提出和实验证明,包括快速优化、可靠的uncertainty量化和持续学习。然而,当前的结果表明,在实际应用中使用的大规模网络 Architecture需要orders of magnitude wider than deep,这引起了关注,因为这种假设表明,实际应用中的网络不会展现出预测的行为。在这个工作中,我们通过实验来检验,无限宽限是否对实际应用中的大规模网络有用,我们发现,这并不是情况。这种观察到的偏差,质疑无限宽限的实际 relevance。
Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs
paper_authors: Jean Kossaifi, Nikola Kovachki, Kamyar Azizzadenesheli, Anima Anandkumar
for: addresses the limitations of learning solution operators of partial differential equations (PDEs) at high resolutions by introducing a new data efficient and highly parallelizable approach with reduced memory requirement and better generalization.
methods: leverages local and global structures of full-scale, real-world phenomena through a decomposition of both the input domain and the operator’s parameter space, and represents the parameters of the model in a high-order latent subspace of the Fourier domain through a global tensor factorization.
results: achieves superior performance on the turbulent Navier-Stokes equations with less than half the error and over 150x compression, and reduces the number of parameters by over 150x and the domain size by 7x without losses in accuracy, while slightly enabling parallelism.Abstract
Memory complexity and data scarcity have so far prohibited learning solution operators of partial differential equations (PDEs) at high resolutions. We address these limitations by introducing a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization, called multi-grid tensorized neural operator (MG-TFNO). MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena, through a decomposition of both the input domain and the operator's parameter space. Our contributions are threefold: i) we enable parallelization over input samples with a novel multi-grid-based domain decomposition, ii) we represent the parameters of the model in a high-order latent subspace of the Fourier domain, through a global tensor factorization, resulting in an extreme reduction in the number of parameters and improved generalization, and iii) we propose architectural improvements to the backbone FNO. Our approach can be used in any operator learning setting. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression. The tensorization combined with the domain decomposition, yields over 150x reduction in the number of parameters and 7x reduction in the domain size without losses in accuracy, while slightly enabling parallelism.
摘要
<>将文本翻译成简化中文。<>在解决部分束环Equation (PDEs)的学习问题上,内存复杂性和数据稀缺性是长期的瓶颈。我们通过引入一种新的数据高效并可并发化的运算符学习方法来缓解这些限制,称为多维Grid Tensorized Neural Operator (MG-TFNO)。MG-TFNO可以扩展到大容量的解析,通过利用全场、实际世界现象的本地和全球结构,对输入域和运算符参数空间进行分解。我们的贡献有三个方面:1. 通过一种新的多维Grid-based域分解方法,实现输入样本的并行化。2. 通过高级别的Latent Space Fourier Transform (LSFT),将运算符参数表示为高级别的 latent subspace,从而减少参数的数量和提高泛化能力。3. 对Backbone FNO框架进行改进。我们的方法可以应用于任何运算符学习设定下。我们在 Navier-Stokes Equations中实现了较低于半个误差,并且实现了150倍压缩。tensorization和域分解的组合,导致参数的减少和域的减少,而无损失准确性。
Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
paper_authors: Yanqiao Zhu, Jeehyun Hwang, Keir Adams, Zhen Liu, Bozhao Nan, Brock Stenfors, Yuanqi Du, Jatin Chauhan, Olaf Wiest, Olexandr Isayev, Connor W. Coley, Yizhou Sun, Wei Wang for:这个论文主要是为了研究分子表示学习(MRL)在化学应用中的可能性和潜力。methods:这个论文使用的方法包括 Graph Neural Networks (GNNs) 和 ensemble learning,以学习分子表示。results:这个论文的结果表明,直接从可访问的转换空间学习可以提高多种任务和模型的性能。Abstract
Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
摘要
молекулярное представление обучения (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While graph neural networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
Reinforcement Learning for Node Selection in Branch-and-Bound
paper_authors: Alexander Mattick, Christopher Mutschler
for: 提高 branch and bound 算法的优化性能
methods: 使用强化学习(RL)模型,基于整个搜索树状态来选择节点
results: 在多种复杂问题集上实现了较高质量的节点选择策略,并在时间约束下提高了优化性能和每个节点的效率Abstract
A big challenge in branch and bound lies in identifying the optimal node within the search tree from which to proceed. Current state-of-the-art selectors utilize either hand-crafted ensembles that automatically switch between naive sub-node selectors, or learned node selectors that rely on individual node data. We propose a novel bi-simulation technique that uses reinforcement learning (RL) while considering the entire tree state, rather than just isolated nodes. To achieve this, we train a graph neural network that produces a probability distribution based on the path from the model's root to its ``to-be-selected'' leaves. Modelling node-selection as a probability distribution allows us to train the model using state-of-the-art RL techniques that capture both intrinsic node-quality and node-evaluation costs. Our method induces a high quality node selection policy on a set of varied and complex problem sets, despite only being trained on specially designed, synthetic TSP instances. Experiments on several benchmarks show significant improvements in optimality gap reductions and per-node efficiency under strict time constraints.
摘要
很大的挑战在分支和约束中是确定最佳节点的搜索树中进行下一步。当前的状态艺术选择器使用 either 手动编写的集合,或者学习节点选择器,它们都依赖于单个节点数据。我们提出了一种新的双 simulations 技术,使用强化学习(RL),而不是只是关注具体节点。为此,我们训练了一个图ael neural network,该模型生成一个基于从模型的根到要选择的叶子节点的路径的概率分布。将节点选择视为概率分布,允许我们使用现代RL技术,包括内在节点质量和节点评估成本。我们的方法在不同和复杂的问题集上实现了高质量的节点选择策略,即使只是在特制的 sintetic TSP 实例上进行训练。在多个标准测试集上,我们的方法可以减少优化差距和每个节点的时间效率,并且在严格的时间限制下达到了显著改善。
Gradient and Uncertainty Enhanced Sequential Sampling for Global Fit
paper_authors: Sven Lämmle, Can Bogoclu, Kevin Cremanns, Dirk Roos
for: 本研究旨在提出一种新的随机抽样策略,以提高global fit的准确性和效率。
methods: 本研究使用机器学习技术建立了一个新的随机抽样策略,名为Gradient和Uncertainty Enhanced Sequential Sampling(GUESS)。该策略使用两个因素:预测 posterior uncertainty 和重量加权 Taylor 展开值。
results: 对于26个1到8维确定函数中的26个测试例,GUESS在global fit中平均实现了最高的样本效率,并且在高维度下的行为和模型选择的重要性进行了一个ablation研究。Abstract
Surrogate models based on machine learning methods have become an important part of modern engineering to replace costly computer simulations. The data used for creating a surrogate model are essential for the model accuracy and often restricted due to cost and time constraints. Adaptive sampling strategies have been shown to reduce the number of samples needed to create an accurate model. This paper proposes a new sampling strategy for global fit called Gradient and Uncertainty Enhanced Sequential Sampling (GUESS). The acquisition function uses two terms: the predictive posterior uncertainty of the surrogate model for exploration of unseen regions and a weighted approximation of the second and higher-order Taylor expansion values for exploitation. Although various sampling strategies have been proposed so far, the selection of a suitable method is not trivial. Therefore, we compared our proposed strategy to 9 adaptive sampling strategies for global surrogate modeling, based on 26 different 1 to 8-dimensional deterministic benchmarks functions. Results show that GUESS achieved on average the highest sample efficiency compared to other surrogate-based strategies on the tested examples. An ablation study considering the behavior of GUESS in higher dimensions and the importance of surrogate choice is also presented.
摘要
现代工程中的代表模型基于机器学习方法已成为重要的一部分,以取代昂贵的计算模拟。创建代表模型所用的数据是模型准确性的关键因素,而这些数据往往因成本和时间限制而受到限制。适应采样策略可以减少创建模型所需的样本数量。本文提出了一种新的采样策略,即梯度和不确定性增强顺序采样(GUESS)。采样函数使用两个项:代表后预测 posterior 不确定性和质量权重Weighted approximation of the second and higher-order Taylor expansion values。虽然过去已经有多种采样策略被提出,但选择合适的方法并不是易事。因此,我们对9种适应采样策略进行了对比,这些策略基于26个不同的1到8维度束缚函数。结果表明,GUESS在测试例子上的平均样本效率高于其他代表模型基于策略。此外,我们还进行了对GUESS在更高维度和代表选择的影响进行了一个减少研究。
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things
results: 本研究的结果表明,FL 在 AIoT 领域具有广泛的应用前景和挑战,FedAIoT 可以作为一个有价值的资源,推动 FL 在 AIoT 领域的进步。Abstract
There is a significant relevance of federated learning (FL) in the realm of Artificial Intelligence of Things (AIoT). However, most existing FL works are not conducted on datasets collected from authentic IoT devices that capture unique modalities and inherent challenges of IoT data. In this work, we introduce FedAIoT, an FL benchmark for AIoT to fill this critical gap. FedAIoT includes eight datatsets collected from a wide range of IoT devices. These datasets cover unique IoT modalities and target representative applications of AIoT. FedAIoT also includes a unified end-to-end FL framework for AIoT that simplifies benchmarking the performance of the datasets. Our benchmark results shed light on the opportunities and challenges of FL for AIoT. We hope FedAIoT could serve as an invaluable resource to foster advancements in the important field of FL for AIoT. The repository of FedAIoT is maintained at https://github.com/AIoT-MLSys-Lab/FedAIoT.
摘要
在人工智能物联网(AIoT)领域,联合学习(FL)具有重要的相关性。然而,大多数现有的FL工作都不是基于真实的IoT设备收集的数据集,这些数据集捕捉了特殊的IoT模式和IoT数据的内在挑战。在这项工作中,我们介绍了FedAIoT,一个针对AIoT的FLbenchmark,以填补这一关键的空白。FedAIoT包括8个来自各种IoT设备的数据集,这些数据集覆盖了IoT模式的唯一特征和AIoT应用的代表性。FedAIoT还包括一个简化了AIoTFL框架的统一终端,以便对数据集的性能进行比较。我们的benchmark结果暴露了FL在AIoT中的机会和挑战。我们希望FedAIoT能成为AIoTFL的重要资源,推动这一领域的进步。FedAIoT的存储库位于https://github.com/AIoT-MLSys-Lab/FedAIoT。
methods: LaLiGAN使用了一种新的推理方法,即将数据映射到一个 latent space 中,并在这个 latent space 中同时发现数据的对称性。
results: 实验表明,LaLiGAN 可以成功地捕捉高维观察数据中的内在对称性,从而生成一个具有良好结构的 latent space,这个 latent space 可以用于其他下游任务。例如,LaLiGAN 可以提高Equation Discovery 和 Long-term Forecasting 等任务的性能。Abstract
Equivariant neural networks require explicit knowledge of the symmetry group. Automatic symmetry discovery methods aim to relax this constraint and learn invariance and equivariance from data. However, existing symmetry discovery methods are limited to linear symmetries in their search space and cannot handle the complexity of symmetries in real-world, often high-dimensional data. We propose a novel generative model, Latent LieGAN (LaLiGAN), which can discover nonlinear symmetries from data. It learns a mapping from data to a latent space where the symmetries become linear and simultaneously discovers symmetries in the latent space. Theoretically, we show that our method can express any nonlinear symmetry under certain conditions. Experimentally, our method can capture the intrinsic symmetry in high-dimensional observations, which results in a well-structured latent space that is useful for other downstream tasks. We demonstrate the use cases for LaLiGAN in improving equation discovery and long-term forecasting for various dynamical systems.
摘要
EQUIVALENT NEURAL NETWORKS REQUIRE EXPLICIT KNOWLEDGE OF THE SYMMETRY GROUP. AUTOMATIC SYMMETRY DISCOVERY METHODS AIM TO RELAX THIS CONSTRAINT AND LEARN INVARIANCE AND EQUIVALENCE FROM DATA. HOWEVER, EXISTING SYMMETRY DISCOVERY METHODS ARE LIMITED TO LINEAR SYMMETRIES IN THEIR SEARCH SPACE AND CANNOT HANDLE THE COMPLEXITY OF SYMMETRIES IN REAL-WORLD, OFTEN HIGH-DIMENSIONAL DATA. WE PROPOSE A NOVEL GENERATIVE MODEL, LATENT LIEGAN (LALIGAN), WHICH CAN DISCOVER NONLINEAR SYMMETRIES FROM DATA. IT LEARNS A MAPPING FROM DATA TO A LATENT SPACE WHERE THE SYMMETRIES BECOME LINEAR AND SIMULTANEOUSLY DISCOVERS SYMMETRIES IN THE LATENT SPACE. THEORETICALLY, WE SHOW THAT OUR METHOD CAN EXPRESS ANY NONLINEAR SYMMETRY UNDER CERTAIN CONDITIONS. EXPERIMENTALLY, OUR METHOD CAN CAPTURE THE INTRINSIC SYMMETRY IN HIGH-DIMENSIONAL OBSERVATIONS, WHICH RESULTS IN A WELL-STRUCTURED LATENT SPACE THAT IS USEFUL FOR OTHER DOWNSTREAM TASKS. WE DEMONSTRATE THE USE CASES FOR LALIGAN IN IMPROVING EQUATION DISCOVERY AND LONG-TERM FORECASTING FOR VARIOUS DYNAMICAL SYSTEMS.
Federated Learning with Differential Privacy for End-to-End Speech Recognition
results: 本文实现了用户级别 ($7.2$, $10^{-9}$)-DP 和 ($4.5$, $10^{-9}$)-DP 的隐私保障,并且在不同的数据异同和预测范围内实现了nearly optimal的模型训练。Abstract
While federated learning (FL) has recently emerged as a promising approach to train machine learning models, it is limited to only preliminary explorations in the domain of automatic speech recognition (ASR). Moreover, FL does not inherently guarantee user privacy and requires the use of differential privacy (DP) for robust privacy guarantees. However, we are not aware of prior work on applying DP to FL for ASR. In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. First, we extend the existing research on FL for ASR by exploring different aspects of recent $\textit{large end-to-end transformer models}$: architecture design, seed models, data heterogeneity, domain shift, and impact of cohort size. With a $\textit{practical}$ number of central aggregations we are able to train $\textbf{FL models}$ that are \textbf{nearly optimal} even with heterogeneous data, a seed model from another domain, or no pre-trained seed model. Second, we apply DP to FL for ASR, which is non-trivial since DP noise severely affects model training, especially for large transformer models, due to highly imbalanced gradients in the attention block. We counteract the adverse effect of DP noise by reviving per-layer clipping and explaining why its effect is more apparent in our case than in the prior work. Remarkably, we achieve user-level ($7.2$, $10^{-9}$)-$\textbf{DP}$ (resp. ($4.5$, $10^{-9}$)-$\textbf{DP}$) with a 1.3% (resp. 4.6%) absolute drop in the word error rate for extrapolation to high (resp. low) population scale for $\textbf{FL with DP in ASR}$.
摘要
While federated learning (FL) has recently emerged as a promising approach to train machine learning models, it is limited to only preliminary explorations in the domain of automatic speech recognition (ASR). Moreover, FL does not inherently guarantee user privacy and requires the use of differential privacy (DP) for robust privacy guarantees. However, we are not aware of prior work on applying DP to FL for ASR. In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. First, we extend the existing research on FL for ASR by exploring different aspects of recent large end-to-end transformer models: architecture design, seed models, data heterogeneity, domain shift, and impact of cohort size. With a practical number of central aggregations, we are able to train FL models that are nearly optimal even with heterogeneous data, a seed model from another domain, or no pre-trained seed model. Second, we apply DP to FL for ASR, which is non-trivial since DP noise severely affects model training, especially for large transformer models, due to highly imbalanced gradients in the attention block. We counteract the adverse effect of DP noise by reviving per-layer clipping and explaining why its effect is more apparent in our case than in the prior work. Remarkably, we achieve user-level ($7.2$, $10^{-9}$)-DP (resp. ($4.5$, $10^{-9}$)-DP) with a 1.3% (resp. 4.6%) absolute drop in the word error rate for extrapolation to high (resp. low) population scale for FL with DP in ASR.
Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym
results: 研究结果表明,BO-based优化器在评估预算有限时表现良好,但在评估预算较大时常常被其他家族的算法所超越。此外,研究还发现了一些来自BBO社区的算法在ML任务上表现出优异的表现。Abstract
The growing ubiquity of machine learning (ML) has led it to enter various areas of computer science, including black-box optimization (BBO). Recent research is particularly concerned with Bayesian optimization (BO). BO-based algorithms are popular in the ML community, as they are used for hyperparameter optimization and more generally for algorithm configuration. However, their efficiency decreases as the dimensionality of the problem and the budget of evaluations increase. Meanwhile, derivative-free optimization methods have evolved independently in the optimization community. Therefore, we urge to understand whether cross-fertilization is possible between the two communities, ML and BBO, i.e., whether algorithms that are heavily used in ML also work well in BBO and vice versa. Comparative experiments often involve rather small benchmarks and show visible problems in the experimental setup, such as poor initialization of baselines, overfitting due to problem-specific setting of hyperparameters, and low statistical significance. With this paper, we update and extend a comparative study presented by Hutter et al. in 2013. We compare BBO tools for ML with more classical heuristics, first on the well-known BBOB benchmark suite from the COCO environment and then on Direct Policy Search for OpenAI Gym, a reinforcement learning benchmark. Our results confirm that BO-based optimizers perform well on both benchmarks when budgets are limited, albeit with a higher computational cost, while they are often outperformed by algorithms from other families when the evaluation budget becomes larger. We also show that some algorithms from the BBO community perform surprisingly well on ML tasks.
摘要
machine learning(ML)的普遍使得它进入了计算机科学中的各个领域,包括黑盒优化(BBO)。最近的研究特别关注于概率优化(BO)。BO基本算法在机器学习社区中很受欢迎,因为它们用于超参数优化和更一般地用于算法配置。然而,随着问题维度和评估预算的增加,BO基本算法的效率会降低。同时,不需要导数的优化方法在优化社区中发展了独立。因此,我们需要了解是否可以在这两个社区之间进行交叉推导,即机器学习中广泛使用的算法是否也适用于BBO,并且vice versa。 Comparative experiments often involve small benchmarks and show visible problems in the experimental setup, such as poor initialization of baselines, overfitting due to problem-specific setting of hyperparameters, and low statistical significance. With this paper, we update and extend a comparative study presented by Hutter et al. in 2013. We compare BBO tools for ML with more classical heuristics, first on the well-known BBOB benchmark suite from the COCO environment and then on Direct Policy Search for OpenAI Gym, a reinforcement learning benchmark. Our results confirm that BO-based optimizers perform well on both benchmarks when budgets are limited, albeit with a higher computational cost, while they are often outperformed by algorithms from other families when the evaluation budget becomes larger. We also show that some algorithms from the BBO community perform surprisingly well on ML tasks.
EPiC-ly Fast Particle Cloud Generation with Flow-Matching and Diffusion
paper_authors: Erik Buhmann, Cedric Ewen, Darius A. Faroughy, Tobias Golling, Gregor Kasieczka, Matthew Leigh, Guillaume Quétant, John Andrew Raine, Debajyoti Sengupta, David Shih
for: This paper is written for researchers and practitioners in the field of particle physics and generative modeling. The authors aim to provide two novel methods for generating LHC jets as point clouds, which can be used for a variety of applications such as particle physics experiments and simulations.
methods: The paper introduces two novel methods for generating LHC jets as point clouds: \epcjedi and \epcfm. \epcjedi combines score-matching diffusion models with the Equivariant Point Cloud (EPiC) architecture based on the deep sets framework, while \epcfm is the first permutation equivariant continuous normalizing flow (CNF) for particle cloud generation. Both methods are trained using the flow-matching objective, which is a scalable and easy-to-train objective based on optimal transport.
results: The authors demonstrate that both \epcjedi and \epcfm achieve state-of-the-art performance on the top-quark JetNet datasets while maintaining fast generation speed. Specifically, \epcfm consistently outperforms all the other generative models considered in the paper across every metric. Additionally, the authors introduce two new particle cloud performance metrics: one based on the Kullback-Leibler divergence between feature distributions, and the other is the negative log-posterior of a multi-model ParticleNet classifier.Abstract
Jets at the LHC, typically consisting of a large number of highly correlated particles, are a fascinating laboratory for deep generative modeling. In this paper, we present two novel methods that generate LHC jets as point clouds efficiently and accurately. We introduce \epcjedi, which combines score-matching diffusion models with the Equivariant Point Cloud (EPiC) architecture based on the deep sets framework. This model offers a much faster alternative to previous transformer-based diffusion models without reducing the quality of the generated jets. In addition, we introduce \epcfm, the first permutation equivariant continuous normalizing flow (CNF) for particle cloud generation. This model is trained with {\it flow-matching}, a scalable and easy-to-train objective based on optimal transport that directly regresses the vector fields connecting the Gaussian noise prior to the data distribution. Our experiments demonstrate that \epcjedi and \epcfm both achieve state-of-the-art performance on the top-quark JetNet datasets whilst maintaining fast generation speed. Most notably, we find that the \epcfm model consistently outperforms all the other generative models considered here across every metric. Finally, we also introduce two new particle cloud performance metrics: the first based on the Kullback-Leibler divergence between feature distributions, the second is the negative log-posterior of a multi-model ParticleNet classifier.
摘要
各种jets在LHC中,通常是一大量高度相关的粒子,是一个非常有趣的实验室,用于深入的生成模型。在这篇论文中,我们介绍了两种新的方法,可以高效精准地生成LHC jets作为点云。我们引入了\epcjedi,它将得分匹配扩散模型和深度集成(EPiC)架构与深度集成框架(deep sets)结合。这个模型比前一代基于转移器的扩散模型更快,而且没有降低生成jets的质量。此外,我们引入了\epcfm,第一个具有对称性的连续正常化流(CNF),用于粒子云生成。这个模型通过可扩展的和易于训练的对流匹配目标训练,直接将泊松噪声前提中的高斯噪声连接到数据分布。我们的实验表明,\epcjedi和\epcfm都达到了JetNet数据集的顶峰性能,而且保持了快速的生成速度。尤其是,我们发现\epcfm模型在所有考虑的生成模型中一直保持领先,在每一个纪录中都表现出优异的性能。最后,我们还介绍了两种新的粒子云性能指标:第一个基于粒子分布的卷积-莱布勒偏移,第二个是一个多模型ParticleNet分类器的负极对数 posterior。
Machine Learning Clifford invariants of ADE Coxeter elements
results: 这个论文发现了这些 Clifford algebraic datasets 可以高度精度地被机器学习,并且提供了一些新的 geometric invariants 和其他已知的几何 invariants 之间的关系。Abstract
There has been recent interest in novel Clifford geometric invariants of linear transformations. This motivates the investigation of such invariants for a certain type of geometric transformation of interest in the context of root systems, reflection groups, Lie groups and Lie algebras: the Coxeter transformations. We perform exhaustive calculations of all Coxeter transformations for $A_8$, $D_8$ and $E_8$ for a choice of basis of simple roots and compute their invariants, using high-performance computing. This computational algebra paradigm generates a dataset that can then be mined using techniques from data science such as supervised and unsupervised machine learning. In this paper we focus on neural network classification and principal component analysis. Since the output -- the invariants -- is fully determined by the choice of simple roots and the permutation order of the corresponding reflections in the Coxeter element, we expect huge degeneracy in the mapping. This provides the perfect setup for machine learning, and indeed we see that the datasets can be machine learned to very high accuracy. This paper is a pump-priming study in experimental mathematics using Clifford algebras, showing that such Clifford algebraic datasets are amenable to machine learning, and shedding light on relationships between these novel and other well-known geometric invariants and also giving rise to analytic results.
摘要
有最近的兴趣在新的Clifford геометрических invariants of linear transformations。这种 motivates the investigation of such invariants for a certain type of geometric transformation of interest in the context of root systems, reflection groups, Lie groups and Lie algebras:Coxeter transformations。我们进行了所有Coxeter transformations for $A_8$, $D_8$ and $E_8$ 的详细计算,使用高性能计算机,并计算了他们的 invariants,使用数据科学技术 such as supervised and unsupervised machine learning。这个计算代数 paradigm generates a dataset that can then be mined using techniques from data science such as supervised and unsupervised machine learning。在这篇论文中,我们主要关注神经网络分类和主成分分析。因为输出—— invariants 完全取决于选择的简单根和相应的反射的 permutation order in the Coxeter element,我们预计存在巨大的杂化在映射中。这提供了完美的设置 для机器学习,并确实发现 datasets 可以通过非常高的准确率进行机器学习。这篇论文是一个实验数学的燃料研究,证明 Clifford 代数数据可以被机器学习,并暴露了这些新的和其他已知的 геомétric invariants之间的关系,并且生成了分析结果。
Networked Inequality: Preferential Attachment Bias in Graph Neural Network Link Prediction
results: 研究发现,GCN采用正则化技术可以减少链接预测中的内部不公平性,并提出了一种新的内部公平度指标来衡量链接预测 scores 之间的差异。Abstract
Graph neural network (GNN) link prediction is increasingly deployed in citation, collaboration, and online social networks to recommend academic literature, collaborators, and friends. While prior research has investigated the dyadic fairness of GNN link prediction, the within-group fairness and ``rich get richer'' dynamics of link prediction remain underexplored. However, these aspects have significant consequences for degree and power imbalances in networks. In this paper, we shed light on how degree bias in networks affects Graph Convolutional Network (GCN) link prediction. In particular, we theoretically uncover that GCNs with a symmetric normalized graph filter have a within-group preferential attachment bias. We validate our theoretical analysis on real-world citation, collaboration, and online social networks. We further bridge GCN's preferential attachment bias with unfairness in link prediction and propose a new within-group fairness metric. This metric quantifies disparities in link prediction scores between social groups, towards combating the amplification of degree and power disparities. Finally, we propose a simple training-time strategy to alleviate within-group unfairness, and we show that it is effective on citation, online social, and credit networks.
摘要
graph neural network(GNN)链接预测在引用、协作和社交网络中被越来越广泛应用,以推荐学术论文、合作伙伴和朋友。然而,之前的研究主要关注了GNN链接预测的对称公平性,而内部公平和“富豪投资”的动态尚未得到足够的研究。然而,这些方面对于网络中的度和权力差异有着重要的影响。在这篇论文中,我们探讨了网络中度偏袋的影响于图 convolutional neural network(GCN)链接预测。具体来说,我们经过理论分析发现,GCNs具有对称正规图滤波器时有内部偏袋预测偏好。我们在实际的引用、协作和社交网络上验证了我们的理论分析。此外,我们将GCN的偏袋偏好与链接预测不公平性相连接,并提出了一种新的内部公平度量。这种度量可以衡量链接预测得分之间的社会组别差异,以遏止度和权力差异的扩大。最后,我们提出了一种简单的训练时间策略来缓解内部不公平性,并在引用、社交和借款网络上验证了其效果。
Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform
results: 在Atari游戏中,cleanba variants可以与强IMPALA基线和PPO基eline匹配或超越,同时具有较短的训练时间和更高的可重现性。Abstract
Distributed Deep Reinforcement Learning (DRL) aims to leverage more computational resources to train autonomous agents with less training time. Despite recent progress in the field, reproducibility issues have not been sufficiently explored. This paper first shows that the typical actor-learner framework can have reproducibility issues even if hyperparameters are controlled. We then introduce Cleanba, a new open-source platform for distributed DRL that proposes a highly reproducible architecture. Cleanba implements highly optimized distributed variants of PPO and IMPALA. Our Atari experiments show that these variants can obtain equivalent or higher scores than strong IMPALA baselines in moolib and torchbeast and PPO baseline in CleanRL. However, Cleanba variants present 1) shorter training time and 2) more reproducible learning curves in different hardware settings. Cleanba's source code is available at \url{https://github.com/vwxyzjn/cleanba}
摘要
分布式深度强化学习(DRL)目的是利用更多计算资源来训练自主代理人,减少训练时间。尽管当前领域已经取得了一定进步,但是复现性问题还没有得到充分探讨。这篇论文首先显示了通常的演员学习框架可能会存在复现性问题,即使控制了超参数。然后,我们介绍了Cleanba,一个新的开源平台 для分布式DRL,该平台提出了高度复制cible的架构。Cleanba实现了高度优化的分布式PPO和IMPALA算法。我们在Atari游戏中进行了实验,发现Cleanba变体可以与强大的IMPALA基线在moolib和torchbeast中获得相同或更高的得分,并且在不同硬件设置下可以 obtaint 1) 训练时间更短和 2) 更加复制cible的学习曲线。Cleanba的源代码可以在以下链接中获取:https://github.com/vwxyzjn/cleanba。
Maximal Volume Matrix Cross Approximation for Image Compression and Least Squares Solution
methods: 这 paper 使用了一种新的证明 classic estimate 的方法,以及一种基于最大体积子matrix的迪kins algorithms。
results: 这 paper 提出了一种基于最大体积子matrix的矩阵减少方法,并且提供了许多实际应用,如图像压缩和连续函数最小二乘approximation。numerical results 表明这种方法的效果非常好。Abstract
We study the classic cross approximation of matrices based on the maximal volume submatrices. Our main results consist of an improvement of a classic estimate for matrix cross approximation and a greedy approach for finding the maximal volume submatrices. Indeed, we present a new proof of a classic estimate of the inequality with an improved constant. Also, we present a family of greedy maximal volume algorithms which improve the error bound of cross approximation of a matrix in the Chebyshev norm and also improve the computational efficiency of classic maximal volume algorithm. The proposed algorithms are shown to have theoretical guarantees of convergence. Finally, we present two applications: one is image compression and the other is least squares approximation of continuous functions. Our numerical results in the end of the paper demonstrate the effective performances of our approach.
摘要
我们研究基于最大体积子矩阵的经典拟合矩阵方法。我们的主要结果包括对经典估计的改进和一种新的规范最大体积子矩阵算法。首先,我们提供一个新的证明,证明了经典估计的不等式具有改进的常数。其次,我们提出了一种家族的排序最大体积子矩阵算法,这些算法可以在Chebychev范数中提高矩阵拟合的误差 bound,同时也可以提高经典最大体积子矩阵算法的计算效率。这些算法具有理论保证的收敛性。最后,我们介绍了两个应用:一个是图像压缩,另一个是continue函数最小二乘approximation。我们的数值结果在文章的末尾展示了我们的方法的有效性。
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
results: 该论文在LRS3 dataset上 obtainted significant improvements in Visual Speech Recognition(VSR)性能,同时保持了实用的Audio-Visual Speech Recognition(ASR)和AVSR性能。此外,使用没有标签的视频数据,该方法还能够利用无标签视频来提高VSR性能。Abstract
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.
摘要
audio-visual speech contains synchronized audio and visual information, providing cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.Here's the translation in Traditional Chinese:Audio-visual speech 包含同步的音频和视讯信息,提供跨Modal 的监控,以learn representation for both automatic speech recognition (ASR) 和 visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.
results: 论文的实验结果表明,使用TCA模块可以与cross attention相比,在不同的分类和uncertainty regression任务中表现相似,而且在使用相同数量的token时,有显著的提升。Abstract
Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of $\mathcal{O}(N)$ tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens $L < N$ on which cross attention is then applied, resulting in only $\mathcal{O}(L)$ complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic $\mathcal{O}(\log(N))$ number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce ReTreever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare ReTreever against Perceiver IO, showing significant gains while using the same number of tokens for inference.
摘要
cross attention是一种广泛使用的方法,用于从 Kontext 字符串中提取信息,以便进行预测。在推理时,对于每个预测,cross attention会扫描所有的 $\mathcal{O}(N)$ 字符串。在实践中, however,经常只需要一小 subsets of tokens 来获得良好的性能。例如,使用 Perceiver IO 可以在推理时将信息简化为一个更小的精度字符串 $L < N$ 上,从而实现只有 $\mathcal{O}(L)$ 的复杂度。然而,随着输入字符串的数量和信息的总量的增加,需要的精度字符串数量也会增加显著。在这种情况下,我们提出了 Tree Cross Attention(TCA)模块,它只需要在推理时对一个对数 $\mathcal{O}(\log(N))$ 的字符串进行搜索,以便进行预测。TCA 将数据组织成树结构,并在推理时进行树搜索,以 retrieve 需要的信息。基于 TCA,我们介绍了 ReTreever,一种可以进行token-efficient 的架构。我们通过实验表明,Tree Cross Attention(TCA)与 Cross Attention 在不同的分类和不确定度 regression 任务中表现相似,而且在使用相同数量的字符串进行推理时,Token 效率明显更高。此外,我们还对 ReTreever 与 Perceiver IO 进行了比较,发现它们在使用相同数量的字符串时表现出了显著的改善。
Parallel Computation of Multi-Slice Clustering of Third-Order Tensors
methods: 该论文使用了分布式存储系统和并行计算方法来实现MSC算法,包括spectral analysis of tensor slices和独立进行每个tensor mode的计算。
results: 论文表明,使用并行计算方法可以超越串行计算方法,并且可以扩展MSC算法的缩放性。Abstract
Machine Learning approaches like clustering methods deal with massive datasets that present an increasing challenge. We devise parallel algorithms to compute the Multi-Slice Clustering (MSC) for 3rd-order tensors. The MSC method is based on spectral analysis of the tensor slices and works independently on each tensor mode. Such features fit well in the parallel paradigm via a distributed memory system. We show that our parallel scheme outperforms sequential computing and allows for the scalability of the MSC method.
摘要
机器学习方法如聚集方法面临着庞大数据集的增加挑战。我们设计了并行算法计算多slice clustering(MSC) для三阶tensor。MSC方法基于tensor扫描的спектраль分析,并在每个tensor模式上独立工作。这些特征适合并行парадигмы,通过分布式存储系统实现。我们展示了我们的并行方案比Sequential计算更高效,并允许MSC方法的扩展。Note:* 机器学习(Machine Learning)用于描述使用计算机algorithms来学习和分类数据的过程。* 聚集方法(clustering methods)是一种常用的机器学习方法,用于将数据分成不同的群体。* 多slice clustering(MSC)是一种基于tensor的聚集方法,用于分析高维数据。* spectral analysis是一种数学分析方法,用于分析函数或数据的特征和结构。* 并行算法(parallel algorithms)是一种使用多个计算机或处理器来实现计算任务的方法。* 分布式存储系统(distributed storage system)是一种将数据分布在多个计算机或存储设备上的存储系统。
Adversarial Imitation Learning from Visual Observations using Latent Information
results: 在高维连续 робо护任务上,该算法可以达到状态艺术水平,同时提供了显著的计算优势。此外,该方法还可以改善来自像素的奖励学习效率。Abstract
We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent's state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our algorithm matches state-of-the-art performance while providing significant computational advantages. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to our code.
摘要
我们集中精力于复制学习从视觉观察中,具体是学习代理人从专家短片中学习。这个框架面临的挑战包括专家动作的缺失和环境的侧面不可观测,因为真实的状态只能从像素推导出来。为解决这个问题,我们首先进行了对半可观察环境中的复制学习的理论分析。我们建立了对代理人对于专家和代理人内部状态转换分布的不准确性的上限。受这一分析的激励,我们提出了叫做“伪函数对抗复制”的算法,它结合了不同策略的对抗复制技术和学习的伪函数表示。在高维度连续控制任务上进行了实验,我们发现了我们的算法可以与现有的表现相匹配,并且提供了重要的计算优势。此外,我们还说明了我们的方法可以帮助从像素学习更加有效率。为保持可重现性,我们提供了免费的代码。
Graph-based Neural Weather Prediction for Limited Area Modeling
paper_authors: Joel Oskarsson, Tomas Landelius, Fredrik Lindsten
for: 本研究旨在应用神经网络预测方法来进行局部区域天气预报。
methods: 本研究使用了一种基于图形的神经网络预测方法,并提出了一个多层次嵌入式模型扩展。
results: 实验结果显示,本方法可以有效地应用于局部区域天气预报,并且能够提供高分辨率的预测。Abstract
The rise of accurate machine learning methods for weather forecasting is creating radical new possibilities for modeling the atmosphere. In the time of climate change, having access to high-resolution forecasts from models like these is also becoming increasingly vital. While most existing Neural Weather Prediction (NeurWP) methods focus on global forecasting, an important question is how these techniques can be applied to limited area modeling. In this work we adapt the graph-based NeurWP approach to the limited area setting and propose a multi-scale hierarchical model extension. Our approach is validated by experiments with a local model for the Nordic region.
摘要
“精准机器学习方法的气象预测技术发展,开创了新的气象模拟可能性。在气候变化时代,高分辨率预测模型的存在也变得越来越重要。大多数现有的神经网络气象预测(NeurWP)方法都集中在全球预测,关键问题是如何将这些技术应用到有限区域模型中。在这种工作中,我们将图基的NeuWP方法应用到有限区域设置中,并提出了多尺度层次模型扩展。我们的方法通过对北欧地区的本地模型进行实验来验证。”Note: Simplified Chinese is used here, as it is the most commonly used version of Chinese in mainland China and is more widely understood. However, if you prefer Traditional Chinese, I can provide that version as well.
Module-wise Training of Neural Networks via the Minimizing Movement Scheme
results: 实验结果显示,当加入了这种调整方法时,模组化训练可以提高模型的准确度,并且比其他模组化训练方法和终端训练方法更好,尤其是在内存有限的情况下。Abstract
Greedy layer-wise or module-wise training of neural networks is compelling in constrained and on-device settings where memory is limited, as it circumvents a number of problems of end-to-end back-propagation. However, it suffers from a stagnation problem, whereby early layers overfit and deeper layers stop increasing the test accuracy after a certain depth. We propose to solve this issue by introducing a module-wise regularization inspired by the minimizing movement scheme for gradient flows in distribution space. We call the method TRGL for Transport Regularized Greedy Learning and study it theoretically, proving that it leads to greedy modules that are regular and that progressively solve the task. Experimentally, we show improved accuracy of module-wise training of various architectures such as ResNets, Transformers and VGG, when our regularization is added, superior to that of other module-wise training methods and often to end-to-end training, with as much as 60% less memory usage.
摘要
<>将文本翻译成简化中文。<>吃善层或模块层wise训练神经网络是在受限制的设备上训练的一种吸引人的方法,因为它 circumvents 许多终到端反向传播问题。然而,它会遇到一个停滞问题,其中早期层过拟合,深度更多的层会在某个深度下停止提高测试准确率。我们提议解决这个问题,通过引入模块层wise regularization,我们称之为TRGL(Transport Regularized Greedy Learning)。我们 theoretically 研究了这种方法,证明它会导致规范的模块,逐渐解决任务。实验ally,我们发现在不同的架构,如ResNets、Transformers和VGG等,当我们的规范加入时,模块层wise 训练的准确率会提高,超过其他模块层wise 训练方法,并且经常高于端到端训练。此外,我们发现在60%的内存使用量下,我们的方法可以达到类似的准确率。
Efficient Biologically Plausible Adversarial Training
results: 研究结果表明,使用生物学可能的学习算法PEPITA可以提高对抗黑客攻击的性能,并且在不同的计算机视觉任务上具有更好的自然vs黑客性能质量比。Abstract
Artificial Neural Networks (ANNs) trained with Backpropagation (BP) show astounding performance and are increasingly often used in performing our daily life tasks. However, ANNs are highly vulnerable to adversarial attacks, which alter inputs with small targeted perturbations that drastically disrupt the models' performance. The most effective method to make ANNs robust against these attacks is adversarial training, in which the training dataset is augmented with exemplary adversarial samples. Unfortunately, this approach has the drawback of increased training complexity since generating adversarial samples is very computationally demanding. In contrast to ANNs, humans are not susceptible to adversarial attacks. Therefore, in this work, we investigate whether biologically-plausible learning algorithms are more robust against adversarial attacks than BP. In particular, we present an extensive comparative analysis of the adversarial robustness of BP and Present the Error to Perturb the Input To modulate Activity (PEPITA), a recently proposed biologically-plausible learning algorithm, on various computer vision tasks. We observe that PEPITA has higher intrinsic adversarial robustness and, with adversarial training, has a more favourable natural-vs-adversarial performance trade-off as, for the same natural accuracies, PEPITA's adversarial accuracies decrease in average by 0.26% and BP's by 8.05%.
摘要
人工神经网络(ANNs)通过反射储存(BP)显示了惊人的性能,并在日常任务中越来越常用。然而,ANNs受到针对性攻击的威胁,这些攻击通过小量目标干扰来干扰模型的性能。为了使ANNs具有鲁棒性,可以通过对训练集进行反向储存训练来提高模型的鲁棒性。然而,这种方法带来了增加的训练复杂度,因为生成针对性攻击样本需要大量计算资源。与此相反,人类不受针对性攻击的威胁。因此,在这项工作中,我们调查了使用生物可能的学习算法是否比BP更鲁棒 against针对性攻击。特别是,我们对BP和Recently proposed的生物可能的学习算法Present the Error to Perturb the Input To modulate Activity(PEPITA)在多种计算机视觉任务上进行了广泛的比较分析。我们发现,PEPITA具有更高的内在鲁棒性,并且在受到针对性攻击的情况下,与BP相比,PEPITA的自然vs针对性性能质量更好,即在同一个自然准确率下,PEPITA的针对性准确率平均下降0.26%,而BP的下降8.05%。
Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer
results: 对真实的 SaaS 公司数据进行评估,Outage-Watch 表现出色,与传统方法相比,平均 AUC 为 0.98,并且能够早期探测服务 metric 的变化,降低了服务中断的 Mean Time To Detection(MTTD),证明了我们提出的方法的有效性。Abstract
Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.
摘要
云服务是 ubique 和关键的,云服务失效是生活中的一种常见现象。为了保持客户和避免收益损失,提供高可靠性保证是非常重要的。一种方法是预测失效,可以帮助降低失效的严重性以及恢复时间。然而, kritische 失效具有罕见的事件,因此难以预测。此外, kritische 失效是无法明确定义的 observable 数据。我们提出的方法是 Outage-Watch,它定义云服务失效为 Quality of Service (QoS) 的下降, capture 由一组 métricas。Outage-Watch 使用当前系统状态预测 QoS métricas 是否将过度阈值,并触发极端事件。使用 Gaussian 混合模型以提供灵活性,并使用极端事件正则化来提高学习tail 的分布。如果任何一个 QoS métricas 的概率过度阈值,则认为出现了失效。我们对一个实际 SaaS 公司数据集进行了评估,结果表明 Outage-Watch 与传统方法相比显著性能更高,其中平均 AUC 为 0.98。此外,Outage-Watch 能够检测所有具有服务 métricas 变化的失效,并在云服务系统中减少失效检测时间(MTTD)的88%,这说明了我们提出的方法的效果。
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
results: 我们通过对预训练模型进行线性探索,并与 convent ional baselines进行比较,来评估我们的架构是否能够扩展到更大的表格数据。Abstract
To analyze the scaling potential of deep tabular representation learning models, we introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning by utilizing table-specific tokenizers and a shared Transformer backbone. Our training approach encompasses both single-table and cross-table models, trained via missing value imputation through a self-supervised masked cell recovery objective. To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately $10^4$ to $10^7$ parameters. These models are trained on a carefully curated pretraining dataset, consisting of 135M training tokens sourced from 76 diverse datasets. We assess the scaling of our architecture in both single-table and cross-table pretraining setups by evaluating the pretrained models using linear probing on a curated set of benchmark datasets and comparing the results with conventional baselines.
摘要 translate_text="To analyze the scaling potential of deep tabular representation learning models, we introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning by utilizing table-specific tokenizers and a shared Transformer backbone. Our training approach encompasses both single-table and cross-table models, trained via missing value imputation through a self-supervised masked cell recovery objective. To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately $10^4$ to $10^7$ parameters. These models are trained on a carefully curated pretraining dataset, consisting of 135M training tokens sourced from 76 diverse datasets. We assess the scaling of our architecture in both single-table and cross-table pretraining setups by evaluating the pretrained models using linear probing on a curated set of benchmark datasets and comparing the results with conventional baselines."Here's the translation:为了分析深度表格表示学习模型的扩展潜力,我们提出了一种专门针对表格数据的Transformer架构,利用表格特定的tokenizer和共享Transformer背部。我们的训练方法包括单表和交叉表模型,通过缺失值补充来实现自我超vised做为掩码的恢复目标。为了了解我们的方法的扩展行为,我们训练了参数量从约10^4到10^7的模型,并在76个不同的数据集上进行了自适应预训练。我们通过对预训练模型进行线性探测,在单表和交叉表预训练设置下评估了我们的架构扩展行为,并与传统基elines进行比较。
Robust Stochastic Optimization via Gradient Quantile Clipping
results: 我们证明了在强 converges 目标函数时, iteration 会 converges to a concentrated distribution,并提供了高probability 上界的最终估计误差。在非均匀目标函数情况下,我们证明了极限分布在一个低梯度的 neighborhood 中受限。我们还提出了一种使用rolling quantiles实现这种算法的实现方法,这种方法具有强大的Robustness 性和高效性,通过数值实验证明。Abstract
We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.
摘要
我们介绍了一种SGD clipping策略,使用Gradient norm的quantiles作为clipping阈值。我们证明了这种新策略可以提供一种robust和高效的优化算法,用于精度目标函数( convex或非 convex),承受重 tailed samples(包括无限 variance)和数据流中一部分异常值。我们的数学分析利用了常数步长SGD和Markov链之间的连接,并处理clipping引入的偏差。对于强度 convex 目标函数,我们证明了迭代 converges to a concentrated distribution,并 derivated high probability bounds on the final estimation error。在非 convex 情况下,我们证明了限制分布在一个低Gradient的 neighborhood中。我们提议了使用rolling quantiles实现这个算法,导致了一种高效的优化过程,并且具有强大的Robust性质。我们的numerical experiments表明,这种策略在实际应用中具有很好的性能。
Leave-one-out Distinguishability in Machine Learning
methods: 我们使用 Gaussian processes 来模型机器学习算法的随机性,并通过详细的实验分析信息泄露问题,以证明 LOOD 的有用性。
results: 我们发现 LOOD 可以帮助我们量化数据 记忆 和 隐私 问题,并且可以分析训练数据中具有影响的数据点。此外,我们还可以使用优化查询来泄露训练数据中最重要的信息。Abstract
We introduce a new analytical framework to quantify the changes in a machine learning algorithm's output distribution following the inclusion of a few data points in its training set, a notion we define as leave-one-out distinguishability (LOOD). This problem is key to measuring data **memorization** and **information leakage** in machine learning, and the **influence** of training data points on model predictions. We illustrate how our method broadens and refines existing empirical measures of memorization and privacy risks associated with training data. We use Gaussian processes to model the randomness of machine learning algorithms, and validate LOOD with extensive empirical analysis of information leakage using membership inference attacks. Our theoretical framework enables us to investigate the causes of information leakage and where the leakage is high. For example, we analyze the influence of activation functions, on data memorization. Additionally, our method allows us to optimize queries that disclose the most significant information about the training data in the leave-one-out setting. We illustrate how optimal queries can be used for accurate **reconstruction** of training data.
摘要
我们提出了一种新的分析框架,用于量化机器学习算法输出分布变化后包括一些数据点在训练集中的效果,我们称之为离带一个数据点分 distinguishability(LOOD)。这个问题对机器学习中的数据 **记忆** 和 **信息泄露** 具有重要意义,以及训练数据点对模型预测的影响。我们使用 Gaussian processes 模型机器学习算法的随机性,并通过广泛的实验分析信息泄露使用会员推理攻击。我们的理论框架允许我们调查训练数据点的泄露原因和泄露的地方。例如,我们分析活动函数对数据记忆的影响。此外,我们的方法允许我们设计优化查询,以披露训练数据中最重要的信息。我们示例如如何使用优化查询进行准确的 **重建** 训练数据。
Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation
methods: 研究专注在E(3)对称扩散模型的设计空间中的前所未踏域。实验采用了比较分析,评估了连续和点点状态空间之间的交互。 finally, the EQGAT-diff model was introduced, which consistently outperforms established models on the QM9 and GEOM-Drugs datasets by a large margin.
results: EQGAT-diff model的实验结果显示,该模型在QM9和GEOM-Drugs数据集上的表现远比先前的模型好很多。此外,对于有限的训练数据,EQGAT-diff模型可以转移到Target Distributions with explicit hydrogens,并且通过一些调整 iterations further improve the state-of-the-art performance across datasets.Abstract
Deep generative diffusion models are a promising avenue for de novo 3D molecular design in material science and drug discovery. However, their utility is still constrained by suboptimal performance with large molecular structures and limited training data. Addressing this gap, we explore the design space of E(3) equivariant diffusion models, focusing on previously blank spots. Our extensive comparative analysis evaluates the interplay between continuous and discrete state spaces. Out of this investigation, we introduce the EQGAT-diff model, which consistently surpasses the performance of established models on the QM9 and GEOM-Drugs datasets by a large margin. Distinctively, EQGAT-diff takes continuous atomic positions while chemical elements and bond types are categorical and employ a time-dependent loss weighting that significantly increases training convergence and the quality of generated samples. To further strengthen the applicability of diffusion models to limited training data, we examine the transferability of EQGAT-diff trained on the large PubChem3D dataset with implicit hydrogens to target distributions with explicit hydrogens. Fine-tuning EQGAT-diff for a couple of iterations further pushes state-of-the-art performance across datasets. We envision that our findings will find applications in structure-based drug design, where the accuracy of generative models for small datasets of complex molecules is critical.
摘要
深层生成扩散模型是物理科学和药物发现中新型分子设计的有前途的途径。然而,它们的实用性仍受大分子结构和有限训练数据的限制。为此,我们探索了E(3)对称扩散模型的设计空间,特别是之前未曾探讨的地方。我们进行了广泛的比较分析,评估了连续和离散状态空间之间的交互。从这些研究中,我们引入了EQGAT-diff模型,其在QM9和GEOM-Drugs数据集上至今的表现均超过了已有模型的表现,差异较大。EQGAT-diff模型使用连续原子位置,而化学元素和键类型则是分类的,同时采用时间依赖损失权重,这有效地增加了训练的整合和生成样本的质量。此外,我们还考虑了EQGAT-diff模型在PubChem3D数据集上进行训练,然后在目标分布中进行了转移学习,以提高模型的应用scope。我们认为,我们的发现将在结构基据设计中扮演着关键的角色,特别是在小数据集中的复杂分子的生成模型准确性是关键。
Deep learning soliton dynamics and complex potentials recognition for 1D and 2D PT-symmetric saturable nonlinear Schrödinger equations
For: 这 paper 的目的是扩展物理 Informed Neural Networks (PINNs) 来学习数据驱动的站ARY和非站ARY的塞耳散度 Equation (SNLSE) 中的两种基本PT-симметричный Scarf-II和周期 potentials在光纤中。* Methods: 这 paper 使用的方法包括 extending PINNs 来学习数据驱动的SNLSE,并对 PT-symmetric potential functions的发现进行数据驱动的反问题研究。特别是,提出了一种modified PINNs (mPINNs) 方案,可以直接通过解数据来确定1D和2D PT-symmetric potentials的函数。* Results: 这 paper 的结果表明,使用 deep neural networks 可以在1D和2D SNLSEs 中实现高精度的解决方案,并且 comparing two network structures under different parameter conditions 可以实现类似的高精度结果。此外, paper 还 analyze了一些影响 neural networks 性能的主要因素,包括活动函数、网络结构和训练数据的大小。Abstract
In this paper, we firstly extend the physics-informed neural networks (PINNs) to learn data-driven stationary and non-stationary solitons of 1D and 2D saturable nonlinear Schr\"odinger equations (SNLSEs) with two fundamental PT-symmetric Scarf-II and periodic potentials in optical fibers. Secondly, the data-driven inverse problems are studied for PT-symmetric potential functions discovery rather than just potential parameters in the 1D and 2D SNLSEs. Particularly, we propose a modified PINNs (mPINNs) scheme to identify directly the PT potential functions of the 1D and 2D SNLSEs by the solution data. And the inverse problems about 1D and 2D PT -symmetric potentials depending on propagation distance z are also investigated using mPINNs method. We also identify the potential functions by the PINNs applied to the stationary equation of the SNLSE. Furthermore, two network structures are compared under different parameter conditions such that the predicted PT potentials can achieve the similar high accuracy. These results illustrate that the established deep neural networks can be successfully used in 1D and 2D SNLSEs with high accuracies. Moreover, some main factors affecting neural networks performance are discussed in 1D and 2D PT Scarf-II and periodic potentials, including activation functions, structures of the networks, and sizes of the training data. In particular, twelve different nonlinear activation functions are in detail analyzed containing the periodic and non-periodic functions such that it is concluded that selecting activation functions according to the form of solution and equation usually can achieve better effect.
摘要
在本文中,我们首先扩展物理学 Informed Neural Networks (PINNs) 以学习数据驱动的一维和二维非线性普朗克 equations (SNLSEs) 中的定点和非定点 solitons。其次,我们研究了数据驱动的逆问题,即在一维和二维 SNLSEs 中发现PT-对称 potential functions,而不是只是参数。特别是,我们提出了修改后PINNs (mPINNs) 方案,以直接从解数据中获取1D和2D SNLSEs 中的PT potential functions。此外,我们还研究了一维和二维PT-对称 potentials的逆问题,它们取决于媒体传播距离z。此外,我们通过PINNs应用于SNLSEs的站点方程来预测PT potentials,并对其进行比较。我们发现,采用不同参数条件下的两种网络结构可以达到类似高精度。这些结果表明,我们建立的深度神经网络可以成功应用于1D和2D SNLSEs。此外,我们还讨论了一些影响神经网络性能的主要因素,包括活化函数、网络结构和训练数据大小。具体来说,我们对12种不同的非线性活化函数进行了详细分析,并结论选择活化函数应该根据解和方程的形式来选择。
In search of dispersed memories: Generative diffusion models are associative memory networks
results: 研究发现,继íz模型的存储容量与现代奥普菲尔德网络的存储容量相同。这些结果证明了生成推理和神经科学中的记忆理论之间的强联系,并提供了一个强大的计算基础 для创造性生成和记忆回忆。Abstract
Hopfield networks are widely used in neuroscience as simplified theoretical models of biological associative memory. The original Hopfield networks store memories by encoding patterns of binary associations, which result in a synaptic learning mechanism known as Hebbian learning rule. Modern Hopfield networks can achieve exponential capacity scaling by using highly non-linear energy functions. However, the energy function of these newer models cannot be straightforwardly compressed into binary synaptic couplings and it does not directly provide new synaptic learning rules. In this work we show that generative diffusion models can be interpreted as energy-based models and that, when trained on discrete patterns, their energy function is equivalent to that of modern Hopfield networks. This equivalence allows us to interpret the supervised training of diffusion models as a synaptic learning process that encodes the associative dynamics of a modern Hopfield network in the weight structure of a deep neural network. Accordingly, in our experiments we show that the storage capacity of a continuous modern Hopfield network is identical to the capacity of a diffusion model. Our results establish a strong link between generative modeling and the theoretical neuroscience of memory, which provide a powerful computational foundation for the reconstructive theory of memory, where creative generation and memory recall can be seen as parts of a unified continuum.
摘要
Toward Robust Recommendation via Real-time Vicinal Defense
results: 经过广泛的实验证明,RVD可以有效防御多种目标攻击,而且不需要改变模型结构和训练过程,更加实用。Abstract
Recommender systems have been shown to be vulnerable to poisoning attacks, where malicious data is injected into the dataset to cause the recommender system to provide biased recommendations. To defend against such attacks, various robust learning methods have been proposed. However, most methods are model-specific or attack-specific, making them lack generality, while other methods, such as adversarial training, are oriented towards evasion attacks and thus have a weak defense strength in poisoning attacks. In this paper, we propose a general method, Real-time Vicinal Defense (RVD), which leverages neighboring training data to fine-tune the model before making a recommendation for each user. RVD works in the inference phase to ensure the robustness of the specific sample in real-time, so there is no need to change the model structure and training process, making it more practical. Extensive experimental results demonstrate that RVD effectively mitigates targeted poisoning attacks across various models without sacrificing accuracy. Moreover, the defensive effect can be further amplified when our method is combined with other strategies.
摘要
<>推荐系统经常受到恶意攻击,攻击者可以插入恶意数据来让推荐系统提供偏袋式的推荐。为了防御这些攻击,有很多鲁棒学习方法被提出,但大多数方法是模型特定或攻击特定的,因此缺乏通用性,而其他方法,如对抗训练,则更关注逃脱攻击而弱于抗毒攻击。在这篇论文中,我们提出了一种通用的方法,即实时邻域防御(RVD),它利用用户的邻域训练数据来细化模型,以在实时推荐时保证模型的鲁棒性。RVD在推荐阶段进行实时微调,不需要改变模型结构和训练过程,因此更加实用。我们的实验结果表明,RVD可以有效地防御目标投毒攻击,并且不 sacrify 准确性。此外,当我们的方法与其他策略相结合时,抗击效果可以得到进一步的增强。
Utility-based Adaptive Teaching Strategies using Bayesian Theory of Mind
paper_authors: Clémence Grislain, Hugo Caselles-Dupré, Olivier Sigaud, Mohamed Chetouani
for: 这个论文的目的是构建一种基于 Bayesian Theory of Mind(ToM)的教师机器人,以便它们可以像人类一样适应学生的内部状态,并为学生选择最佳的示例。
methods: 该论文使用了 Bayesian ToM 机制,从学生的行为观察中构建了学生的内部状态模型,然后根据这些模型选择最佳的示例,以最大化学生的奖励而最小化教学成本。
results: 该论文的实验结果表明,使用这种基于 ToM 的教学策略可以使学生更快速地学习和提高性能,尤其是当教师的 ToM 模型与实际学生状态更加一致时。Abstract
Good teachers always tailor their explanations to the learners. Cognitive scientists model this process under the rationality principle: teachers try to maximise the learner's utility while minimising teaching costs. To this end, human teachers seem to build mental models of the learner's internal state, a capacity known as Theory of Mind (ToM). Inspired by cognitive science, we build on Bayesian ToM mechanisms to design teacher agents that, like humans, tailor their teaching strategies to the learners. Our ToM-equipped teachers construct models of learners' internal states from observations and leverage them to select demonstrations that maximise the learners' rewards while minimising teaching costs. Our experiments in simulated environments demonstrate that learners taught this way are more efficient than those taught in a learner-agnostic way. This effect gets stronger when the teacher's model of the learner better aligns with the actual learner's state, either using a more accurate prior or after accumulating observations of the learner's behaviour. This work is a first step towards social machines that teach us and each other, see https://teacher-with-tom.github.io.
摘要
Estimation and Inference in Distributional Reinforcement Learning
results: 这 paper prove that with a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}\right)$, we can guarantee the $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ is less than $\epsilon$ with high probability. Additionally, the paper shows that a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^2(1-\gamma)^4}\right)$ is sufficient to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\epsilon$ with high probability. Finally, the paper demonstrates that the empirical process $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in certain function spaces.Abstract
In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete distribution of the random return (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. We show that in this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}\right)$ to guarantee a $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ is less than $\epsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\epsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{W_1})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\mathrm{KS})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\mathrm{TV})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.
摘要
在这篇论文中,我们从统计效率的角度研究分布式强化学习。我们研究了分布式政策评估, aiming to estimate the complete distribution of the random return (denoted $\eta^\pi$) attained by a given policy $\pi$.我们使用certainty-equivalence方法construct our estimator $\hat\eta^\pi$, given a generative model is available.我们证明在这种情况下,我们需要一个大小为 $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2p}(1-\gamma)^{2p+2}\right)$的数据集,以保证分布式政策评估问题可以高效地解决。这意味着可以通过采样效率来解决分布式政策评估问题。此外,我们还证明了不同的某些轻度假设下,一个数据集大小为 $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}(1-\gamma)^{4}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\epsilon$ with high probability.此外,我们还研究了 $\hat\eta^\pi$ 的极限行为。我们证明了 $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{W_1})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\mathrm{KS})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\mathrm{TV})$ when some mild conditions hold.我们的发现可以导致一种统一的方法 для统计推断 $\eta^\pi$ 的各种统计函数。
Data-driven localized waves and parameter discovery in the massive Thirring model via extended physics-informed neural networks with interface zones
results: 该论文通过对各种解的数据驱动 simulations 和分析,显示了该方法的高精度和快速收敛特点,并成功地解决了不同类型的本地波解的逆问题。Abstract
In this paper, we study data-driven localized wave solutions and parameter discovery in the massive Thirring (MT) model via the deep learning in the framework of physics-informed neural networks (PINNs) algorithm. Abundant data-driven solutions including soliton of bright/dark type, breather and rogue wave are simulated accurately and analyzed contrastively with relative and absolute errors. For higher-order localized wave solutions, we employ the extended PINNs (XPINNs) with domain decomposition to capture the complete pictures of dynamic behaviors such as soliton collisions, breather oscillations and rogue-wave superposition. In particular, we modify the interface line in domain decomposition of XPINNs into a small interface zone and introduce the pseudo initial, residual and gradient conditions as interface conditions linked adjacently with individual neural networks. Then this modified approach is applied successfully to various solutions ranging from bright-bright soliton, dark-dark soliton, dark-antidark soliton, general breather, Kuznetsov-Ma breather and second-order rogue wave. Experimental results show that this improved version of XPINNs reduce the complexity of computation with faster convergence rate and keep the quality of learned solutions with smoother stitching performance as well. For the inverse problems, the unknown coefficient parameters of linear and nonlinear terms in the MT model are identified accurately with and without noise by using the classical PINNs algorithm.
摘要
在本文中,我们研究了基于深度学习的数据驱动本地波解的MT模型参数发现和解决方法。通过物理学 Informed Neural Networks(PINNs)算法,我们可以高精度地模拟并分析包括喷气、暗气、异常波等数据驱动波解。对于更高阶的本地波解,我们使用了扩展PINNs(XPINNs) WITH 域 decomposing来捕捉整个动态行为的全貌,如喷气碰撞、暗气振荡和异常波superposition。在特定的实现中,我们修改了域 decomposition的界面线为一小的界面zone,并引入 pseudo initial、剩余和导数条件作为界面条件,这些条件与个体神经网络相邻联系。然后,这种修改后的方法被应用到了不同的解,包括明亮喷气、暗气喷气、暗气反喷气、通用喷气、库泽зов-玛喷气和第二阶异常波。实验结果表明,改进后的XPINNs方法可以降低计算复杂度,提高速度度和保持学习解的平滑拼接性。此外,我们还使用了类ical PINNs算法来准确地确定MT模型中未知系数参数,包括线性和非线性项的系数。
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data
results: 对于82个训练数据集、10种组织、三种测序技术和三种物种,我们创建了有用的图结构进行模型训练和基因表达生成,并采用了权重相似学习和对比学习来学习不同数据中基因之间的关系。这种新的设计使得我们可以提供包含多种Modalities的基因表达,其中包含了不同背景下基因功能相似性的功能相似性。Abstract
Discovering genes with similar functions across diverse biomedical contexts poses a significant challenge in gene representation learning due to data heterogeneity. In this study, we resolve this problem by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. Leveraging 82 training datasets from 10 tissues, three sequencing techniques, and three species, we create informative graph structures for model training and gene representations generation, while incorporating regularization with weighted similarity learning and contrastive learning to learn cross-data gene-gene relationships. This novel design ensures that we can offer gene representations containing functional similarity across different contexts in a joint space. Comprehensive benchmarking analysis shows our model's capacity to effectively capture gene function similarity across multiple modalities, outperforming state-of-the-art methods in gene representation learning by up to 97.5%. Moreover, we employ bioinformatics tools in conjunction with gene representations to uncover pathway enrichment, regulation causal networks, and functions of disease-associated or dosage-sensitive genes. Therefore, our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
摘要
在生物医学中发现同 функ数据中的基因具有相似功能是一个挑战,因为数据的多样性导致表现学习的问题。在这项研究中,我们解决这个问题 by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. By leveraging 82 training datasets from 10 tissues, three sequencing techniques, and three species, we create informative graph structures for model training and gene representations generation, while incorporating regularization with weighted similarity learning and contrastive learning to learn cross-data gene-gene relationships. This novel design ensures that we can offer gene representations containing functional similarity across different contexts in a joint space.我们的模型能够优化 capture gene function similarity across multiple modalities, outperforming state-of-the-art methods in gene representation learning by up to 97.5%. Furthermore, we employ bioinformatics tools in conjunction with gene representations to uncover pathway enrichment, regulation causal networks, and functions of disease-associated or dosage-sensitive genes. Therefore, our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
paper_authors: Yong Lin, Lu Tan, Yifan Hao, Honam Wong, Hanze Dong, Weizhong Zhang, Yujiu Yang, Tong Zhang for: 这个论文的目的是解释ensemble方法在陌生数据上的高效性的机制。methods: 这个论文使用了weight space ensemble方法,特别是WiSE-FT方法,以 interpolate模型参数。results: 研究发现,WiSE-FT方法可以在陌生数据上具有优秀的高效性,且可以正确地纠正各个模型的预测错误。此外,研究还发现, ensemble方法可以通过使用多样性的干扰特征来减少预测错误。Abstract
Generalization to out-of-distribution (OOD) data is a critical challenge in machine learning. Ensemble-based methods, like weight space ensembles that interpolate model parameters, have been shown to achieve superior OOD performance. However, the underlying mechanism for their effectiveness remains unclear. In this study, we closely examine WiSE-FT, a popular weight space ensemble method that interpolates between a pre-trained and a fine-tuned model. We observe an unexpected phenomenon, in which WiSE-FT successfully corrects many cases where each individual model makes incorrect predictions, which contributes significantly to its OOD effectiveness. To gain further insights, we conduct theoretical analysis in a multi-class setting with a large number of spurious features. Our analysis predicts the above phenomenon and it further shows that ensemble-based models reduce prediction errors in the OOD settings by utilizing a more diverse set of spurious features. Contrary to the conventional wisdom that focuses on learning invariant features for better OOD performance, our findings suggest that incorporating a large number of diverse spurious features weakens their individual contributions, leading to improved overall OOD generalization performance. Empirically we demonstrate the effectiveness of utilizing diverse spurious features on a MultiColorMNIST dataset, and our experimental results are consistent with the theoretical analysis. Building upon the new theoretical insights into the efficacy of ensemble methods, we further identify an issue of WiSE-FT caused by the overconfidence of fine-tuned models in OOD situations. This overconfidence magnifies the fine-tuned model's incorrect prediction, leading to deteriorated OOD ensemble performance. To remedy this problem, we propose a novel method called BAlaNced averaGing (BANG), which significantly enhances the OOD performance of WiSE-FT.
摘要
通用化到非常值 (OOD) 数据是机器学习中的一个关键挑战。集成方法,如权重空间集合,已经显示出在 OOD 性能上表现出色。然而,这些方法的内在机制仍然不够清楚。在这项研究中,我们密切检查了 WiSE-FT,一种广泛使用的权重空间集合方法,该方法 interpolates между预训练和精度调整的模型。我们发现了一种意外的现象:WiSE-FT 成功地更正了许多情况下每个个体模型的错误预测,这对 OOD 性能做出了重要贡献。为了更深入地理解这种现象,我们在多类 setting 中进行了理论分析,并预测了上述现象。我们的分析还显示,集成型模型在 OOD 设置下采用更多的废弃特征来减少预测错误。与传统智慧所预期的不同,我们发现,在 incorporating 一大量多样的废弃特征时,它们的个人贡献减弱,导致了改善的 OOD 总体化能力。empirically,我们在 MultiColorMNIST 数据集上证明了利用多样的废弃特征的效果,并发现结果与理论分析一致。基于新的理论发现,我们进一步发现 WiSE-FT 在 OOD 情况下存在一个问题,即精度调整的模型在 OOD 情况下的过于自信,导致 ensemble 性能下降。为解决这个问题,我们提出了一种新的方法 called BAlaNced averaGing (BANG),可以减少 OOD 情况下 WiSE-FT 的 ensemble 性能下降。
Memory Gym: Partially Observable Challenges to Memory-Based Agents in Endless Episodes
paper_authors: Marco Pleines, Matthias Pallasch, Frank Zimmer, Mike Preuss
for: The paper compares the performance of Gated Recurrent Unit (GRU) and Transformer-XL (TrXL) in deep reinforcement learning tasks, specifically in memorizing long sequences, withstanding noise, and generalizing.
methods: The paper uses partially observable 2D environments with discrete controls, such as Mortar Mayhem, Mystery Path, and Searing Spotlights, and extrapolates these environments to novel endless tasks as an automatic curriculum. The paper also uses Proximal Policy Optimization and a sliding window approach with TrXL as episodic memory.
results: The paper shows that GRU outperforms TrXL in all endless tasks, with GRU consistently outperforming TrXL by significant margins. However, TrXL demonstrates superior sample efficiency in Mystery Path and outperforms GRU in Mortar Mayhem.Here are the three key points in Simplified Chinese:
results: 论文显示GRU在所有无穷任务中表现出色,与TrXL在所有任务中均有显著的性能优势。但是,TrXL在Mystery Path中表现出较好的样本效率,而GRU在Mortar Mayhem中表现出较好的性能。Abstract
Memory Gym introduces a unique benchmark designed to test Deep Reinforcement Learning agents, specifically comparing Gated Recurrent Unit (GRU) against Transformer-XL (TrXL), on their ability to memorize long sequences, withstand noise, and generalize. It features partially observable 2D environments with discrete controls, namely Mortar Mayhem, Mystery Path, and Searing Spotlights. These originally finite environments are extrapolated to novel endless tasks that act as an automatic curriculum, drawing inspiration from the car game ``I packed my bag". These endless tasks are not only beneficial for evaluating efficiency but also intriguingly valuable for assessing the effectiveness of approaches in memory-based agents. Given the scarcity of publicly available memory baselines, we contribute an implementation driven by TrXL and Proximal Policy Optimization. This implementation leverages TrXL as episodic memory using a sliding window approach. In our experiments on the finite environments, TrXL demonstrates superior sample efficiency in Mystery Path and outperforms in Mortar Mayhem. However, GRU is more efficient on Searing Spotlights. Most notably, in all endless tasks, GRU makes a remarkable resurgence, consistently outperforming TrXL by significant margins.
摘要
Memory Gym 引入了一个独特的标准测试深度强化学习机制,具体来说是比较 GRU 和 Transformer-XL(TrXL)在记忆长序的能力、抵抗噪音和通用性方面的比较。它采用了部分可见 2D 环境和简单的控制,包括 Mortar Mayhem、Mystery Path 和 Searing Spotlights。这些原始的有限环境通过扩展到新的无穷任务来 acted as an automatic curriculum, draw inspiration from the car game "I packed my bag".这些无穷任务不仅有利于评估效率,而且有趣地用于评估方法在记忆基于机制中的效果。由于公共可用的记忆基线匮乏,我们提供了基于 TrXL 和 Proximal Policy Optimization 的实现。这个实现利用 TrXL 作为 episodic memory 使用滑动窗口方法。在我们对 finite 环境的实验中,TrXL 在 Mystery Path 中表现出了更高的样本效率,而 GRU 在 Searing Spotlights 中更高效。然而,在所有无穷任务中,GRU 表现出了惊人的复兴,一直高于 TrXL 的很大幅度。
ResBit: Residual Bit Vector for Categorical Values
methods: 提出了一种基于扩散模型的Analog Bits方法,并提出了一种基于Table Residual Bit Diffusion(TRBD)的TabDDPM表格数据生成方法。
results: 通过实验证明,TRBD可以快速生成高质量表格数据,并且ResBit可以作为一种替代一个热点 вектор的方法,用于GANs的conditioning和图像分类中的标签表达。Abstract
The one-hot vector has long been widely used in machine learning as a simple and generic method for representing discrete data. However, this method increases the number of dimensions linearly with the categorical data to be represented, which is problematic from the viewpoint of spatial computational complexity in deep learning, which requires a large amount of data. Recently, Analog Bits, a method for representing discrete data as a sequence of bits, was proposed on the basis of the high expressiveness of diffusion models. However, since the number of category types to be represented in a generation task is not necessarily at a power of two, there is a discrepancy between the range that Analog Bits can represent and the range represented as category data. If such a value is generated, the problem is that the original category value cannot be restored. To address this issue, we propose Residual Bit Vector (ResBit), which is a hierarchical bit representation. Although it is a general-purpose representation method, in this paper, we treat it as numerical data and show that it can be used as an extension of Analog Bits using Table Residual Bit Diffusion (TRBD), which is incorporated into TabDDPM, a tabular data generation method. We experimentally confirmed that TRBD can generate diverse and high-quality data from small-scale table data to table data containing diverse category values faster than TabDDPM. Furthermore, we show that ResBit can also serve as an alternative to the one-hot vector by utilizing ResBit for conditioning in GANs and as a label expression in image classification.
摘要
“一热vector”已经在机器学习中广泛使用,用于简单且通用的方法来表示数据。然而,这种方法会将数据的维度增加 linearly avec categorical data 被表示,这是深度学习中的空间 Computational Complexity 的问题,需要大量数据。在最近,Analog Bits 方法被提出,用于表示数据为一串位元。然而,这种方法不能够 repre sent 不同类型的数据,导致当需要生成类别值时, Originals 的类别值无法被Restore。为了解决这个问题,我们提出了 Residual Bit Vector (ResBit),它是一种层次的位元表示方法。尽管它是一个通用的表示方法,在这篇文章中,我们将它视为数据的numerical data,并证明它可以作为 Analog Bits 的扩展使用 Table Residual Bit Diffusion (TRBD),它是 TabDDPM 中的一个 tabular data 生成方法。我们实验确认了 TRBD 可以从小规模的表格数据中产生多元化和高品质的数据,并且比 TabDDPM 更快。此外,我们还证明了 ResBit 可以作为一热vector 的替代方案,通过在 GANs 中使用 ResBit 作为条件,以及在图像分类中使用 ResBit 作为标签表达。
Generalized Activation via Multivariate Projection
for: This paper aims to improve the performance of neural networks by introducing a new activation function called Multivariate Projection Unit (MPU).
methods: The paper uses a mathematical proof to establish the expressive power of MPU compared to the widely used Rectified Linear Unit (ReLU) activation function. Experimental evaluations are also conducted to compare the performance of MPU with other activation functions.
results: The paper shows that MPU outperforms ReLU and other activation functions in terms of expressive power, and provides a mathematical proof to support this claim. Experimental results also corroborate the effectiveness of MPU on widely-adopted architectures.Abstract
Activation functions are essential to introduce nonlinearity into neural networks, with the Rectified Linear Unit (ReLU) often favored for its simplicity and effectiveness. Motivated by the structural similarity between a shallow Feedforward Neural Network (FNN) and a single iteration of the Projected Gradient Descent (PGD) algorithm, a standard approach for solving constrained optimization problems, we consider ReLU as a projection from R onto the nonnegative half-line R+. Building on this interpretation, we extend ReLU by substituting it with a generalized projection operator onto a convex cone, such as the Second-Order Cone (SOC) projection, thereby naturally extending it to a Multivariate Projection Unit (MPU), an activation function with multiple inputs and multiple outputs. We further provide a mathematical proof establishing that FNNs activated by SOC projections outperform those utilizing ReLU in terms of expressive power. Experimental evaluations on widely-adopted architectures further corroborate MPU's effectiveness against a broader range of existing activation functions.
摘要
活动函数是神经网络中不可或缺的一部分,RECTIFIED LINEAR UNIT(ReLU)经常被选择因为它的简单性和效果。我们启发自权值迭代PGD算法中的结构相似性,将ReLU视为从R到非负半线R+的投影。基于这种解释,我们扩展ReLU,将其替换为一种多输入多输出的投影运算,例如第二次导数树(SOC)投影,从而自然地扩展到多变量投影单元(MPU)。我们还提供了一个数学证明,证明使用SOC投影 activation 函数的FNN表现更高效于使用ReLU。实验评估表明,MPU在较广泛的 activation 函数中表现更好。
RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations
results: 在多种数据模式下进行了广泛的实验,显示了RECOMBINER可以与best INR-based方法竞争,甚至在低比特率下超越自适应编码器-based codecs 的图像压缩性能Abstract
COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while maintaining its computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates.
摘要
COMpression with Bayesian Implicit NEural Representations (COMBINER) 是一种最近的数据压缩方法,它解决了之前基于Implicit Neural Representation(INR)的方法中的一个关键不足:它避免了量化并直接优化了率度性能。然而,COMBINER仍有 significiant 局限性:1) 它使用因子化 posterior 和先验 approximations 缺乏灵活性; 2) 它无法有效地适应当地偏差于全局模式的数据; 3) 其性能可能受模型选择和 Variational 参数的初始化的影响。我们提出的方法,Robust and Enhanced COMBINER(RECOMBINER),解决了这些问题,通过以下方式:1. 通过对 INR 权重进行线性 реParameterization,保持 computational cost 的同时,提高 variational approxiamtion 的精度。2. 通过添加可学习的位置编码,使 INR 适应当地偏差和详细信息。3. 将高分辨率数据分割成 patches,提高 robustness,并使用层次的 priors 捕捉数据之间的依赖关系。我们在多个数据模式上进行了广泛的实验,展示了 RECOMBINER 与最佳 INR-based 方法竞争,甚至在低比特率下超过 autoencoder-based 编码器在低分辨率图像上。
FedZeN: Towards superlinear zeroth-order federated learning via incremental Hessian estimation
results: 这篇论文提供了一个名为 FedZeN 的联合零阶算法,可以在实际应用中实现超线性增长率。实验结果显示,FedZeN Algorithm 比已有的联合零阶方法高效,并且在多个实际应用中实现了超线性增长率。Abstract
Federated learning is a distributed learning framework that allows a set of clients to collaboratively train a model under the orchestration of a central server, without sharing raw data samples. Although in many practical scenarios the derivatives of the objective function are not available, only few works have considered the federated zeroth-order setting, in which functions can only be accessed through a budgeted number of point evaluations. In this work we focus on convex optimization and design the first federated zeroth-order algorithm to estimate the curvature of the global objective, with the purpose of achieving superlinear convergence. We take an incremental Hessian estimator whose error norm converges linearly, and we adapt it to the federated zeroth-order setting, sampling the random search directions from the Stiefel manifold for improved performance. In particular, both the gradient and Hessian estimators are built at the central server in a communication-efficient and privacy-preserving way by leveraging synchronized pseudo-random number generators. We provide a theoretical analysis of our algorithm, named FedZeN, proving local quadratic convergence with high probability and global linear convergence up to zeroth-order precision. Numerical simulations confirm the superlinear convergence rate and show that our algorithm outperforms the federated zeroth-order methods available in the literature.
摘要
《联合学习》是一种分布式学习框架,允许一组客户端在中央服务器的指挥下,共同训练一个模型,无需分享原始数据样本。虽然在实际应用中,目标函数的导数通常不可获得,但只有很少的研究探讨了联合零次设定下的学习,在这种设定下,函数仅可通过一个限定的数量的点评估访问。在这种工作中,我们关注 convex 优化,并设计了首个联合零次算法,用于估计全局目标函数的曲率。我们采用一种增量希格曼统计器,其误差 нор的梯度 converge Linearly,并在联合零次设定下采样随机搜索方向从Stiefel manifold上进行改进。具体来说,梯度和希格曼统计器都在中央服务器上构建,通过同步 Pseudo-Random Number Generators 来实现通信效率和隐私保护。我们提供了对我们算法的理论分析,名为FedZeN,证明其在当前精度下的本地弯曲收敛和全局线性收敛。数值 simulations 表明我们的算法可以在超linear 收敛率下进行高效的训练。此外,我们的算法也比联合零次方法在文献中出现的性能更高。
Efficient Interpretable Nonlinear Modeling for Multiple Time Series
results: 实验结果表明,提出的方法可以更好地确定VAR几何体的支持,同时也可以提高时序预测的精度,比起当前状态艺术方法更好。Abstract
Predictive linear and nonlinear models based on kernel machines or deep neural networks have been used to discover dependencies among time series. This paper proposes an efficient nonlinear modeling approach for multiple time series, with a complexity comparable to linear vector autoregressive (VAR) models while still incorporating nonlinear interactions among different time-series variables. The modeling assumption is that the set of time series is generated in two steps: first, a linear VAR process in a latent space, and second, a set of invertible and Lipschitz continuous nonlinear mappings that are applied per sensor, that is, a component-wise mapping from each latent variable to a variable in the measurement space. The VAR coefficient identification provides a topology representation of the dependencies among the aforementioned variables. The proposed approach models each component-wise nonlinearity using an invertible neural network and imposes sparsity on the VAR coefficients to reflect the parsimonious dependencies usually found in real applications. To efficiently solve the formulated optimization problems, a custom algorithm is devised combining proximal gradient descent, stochastic primal-dual updates, and projection to enforce the corresponding constraints. Experimental results on both synthetic and real data sets show that the proposed algorithm improves the identification of the support of the VAR coefficients in a parsimonious manner while also improving the time-series prediction, as compared to the current state-of-the-art methods.
摘要
预测线性和非线性模型,基于kernel机器或深度神经网络,已经用于发现时间序列之间的依赖关系。本文提出一种高效的多时间序列非线性模型方法,其复杂度与线性vector autoregressive(VAR)模型相当,同时仍能够捕捉不同时间序列变量之间的非线性交互关系。模型假设是,时间序列集合是通过两步生成的:首先,一个线性VAR过程在隐藏空间中,然后,每个感知器上应用一组 invertible和Lipschitz连续非线性映射,即每个隐藏变量到测量空间中的变量的组成部分 mapping。VAR偏好的标示提供了这些变量之间的依赖关系的topology表示。提posed方法每个组成部分非线性使用 invertible neural network,并强制 sparse VAR偏好,以反映实际应用中通常发现的含义简单的依赖关系。为有效地解决提出的优化问题,提出了一种自定义算法, combining proximal gradient descent、stochastic primal-dual更新和投影,以保证相应的约束。实验结果表明,提出的算法可以高效地提取VAR偏好的支持,并提高时间序列预测,相比现状态艺术方法。
results: 在一个预先定义的benchmark上进行了广泛的评估,该benchmark包括19个分类 dataset,并示出GRANDE在大多数dataset上表现更好于现有的梯度增强和深度学习框架。Abstract
Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.
摘要
尽管深度学习在文本和图像数据上取得了成功,但是tree-based ensemble模型仍然是机器学习适用于不同类型表格数据的现状的state-of-the-art。然而,由于表格数据的高灵活性,there is a significant need for tabular-specific gradient-based methods。在这篇论文中,我们提出了GRANDE,即GRAdieNt-Based Decision Tree Ensembles,一种使用端到端的梯度下降来学习硬的、轴对齐的决策树集的新方法。GRANDE基于紧凑的树集表示,可以使用反射operators来同时优化所有模型参数。我们的方法结合了轴对齐分割,这是表格数据的有用预测条件,以及梯度下降的灵活性。此外,我们还介绍了一种高级的实例级Weighting,它可以在单个模型中学习表格数据中的简单和复杂关系表示。我们在19个分类 datasets上进行了广泛的评估,并示出了我们的方法在大多数数据集上超过了现有的梯度束合和深度学习框架的性能。
Style Transfer for Non-differentiable Audio Effects
for: Audio production style matching, particularly for multi-band compressor effects.
methods: Deep learning approach using audio embeddings, which can be applied to various classes of effects and does not require auto-differentiation.
results: Convincingly styles match a multi-band compressor effect using the proposed approach, and the audio embeddings can be used for downstream tasks such as timbral information retrieval.Here’s the full Chinese text:
for: 这个研究旨在实现音乐制作风格匹配,特别是适用于多射频压缩器效果。
methods: 使用深度学习的音乐嵌入方法,可以应用于不同的效果类型,并且不需要自动微分运算。
results: 透过提案的方法,成功实现多射频压缩器效果的风格匹配,并且音乐嵌入可以用于后续任务,如时域信息回传。Abstract
Digital audio effects are widely used by audio engineers to alter the acoustic and temporal qualities of audio data. However, these effects can have a large number of parameters which can make them difficult to learn for beginners and hamper creativity for professionals. Recently, there have been a number of efforts to employ progress in deep learning to acquire the low-level parameter configurations of audio effects by minimising an objective function between an input and reference track, commonly referred to as style transfer. However, current approaches use inflexible black-box techniques or require that the effects under consideration are implemented in an auto-differentiation framework. In this work, we propose a deep learning approach to audio production style matching which can be used with effects implemented in some of the most widely used frameworks, requiring only that the parameters under consideration have a continuous domain. Further, our method includes style matching for various classes of effects, many of which are difficult or impossible to be approximated closely using differentiable functions. We show that our audio embedding approach creates logical encodings of timbral information, which can be used for a number of downstream tasks. Further, we perform a listening test which demonstrates that our approach is able to convincingly style match a multi-band compressor effect.
摘要
数字音频效果广泛用于音频工程师修改音频数据的音色和时间特性。然而,这些效果可能有许多参数,可能让新手学习困难,并对专业人士增加创作压力。近些年来,有很多尝试使用深度学习来获取音频效果的低级参数配置。然而,当前的方法通常使用不可预测的黑盒技术或需要实现效果在某些自动梯度框架中。在这种情况下,我们提出了一种深度学习方法来实现音频生产风格匹配。这种方法可以与大多数常用框架中的效果结合使用,只需要参数在维度上有连续的域。此外,我们的方法包括多种类型的效果匹配,许多其中是使用可微函数很难或不可能地近似的。我们表明,我们的音频嵌入方法创造出了逻辑编码的时域信息,可以用于许多下游任务。此外,我们进行了一次听测,证明我们的方法能够成功地匹配多个批处器效果。
results: 对多个 benchmark数据集进行了广泛的实验,表明这种扩展可以显著提高模型性能,达到了当前 Literature 中常见的 Hypergraph Networks 的顶峰成绩。Abstract
Higher-order relations are widespread in nature, with numerous phenomena involving complex interactions that extend beyond simple pairwise connections. As a result, advancements in higher-order processing can accelerate the growth of various fields requiring structured data. Current approaches typically represent these interactions using hypergraphs. We enhance this representation by introducing cellular sheaves for hypergraphs, a mathematical construction that adds extra structure to the conventional hypergraph while maintaining their local, higherorder connectivity. Drawing inspiration from existing Laplacians in the literature, we develop two unique formulations of sheaf hypergraph Laplacians: linear and non-linear. Our theoretical analysis demonstrates that incorporating sheaves into the hypergraph Laplacian provides a more expressive inductive bias than standard hypergraph diffusion, creating a powerful instrument for effectively modelling complex data structures. We employ these sheaf hypergraph Laplacians to design two categories of models: Sheaf Hypergraph Neural Networks and Sheaf Hypergraph Convolutional Networks. These models generalize classical Hypergraph Networks often found in the literature. Through extensive experimentation, we show that this generalization significantly improves performance, achieving top results on multiple benchmark datasets for hypergraph node classification.
摘要
高阶关系广泛存在在自然中,许多现象具有复杂的互动,超出简单对应关系。因此,进步在高阶处理方面可以推动不同领域的数据结构发展。现有的方法通常使用 гиперграм(hypergraphs)来表示这些互动。我们在这些 гиперграм 上引入细胞层(cellular sheaves),一种数学建构,以添加额外结构,保持了本地、高阶连接的性质。 Drawing inspiration from existing Laplacians in the literature, we develop two unique formulations of sheaf hypergraph Laplacians: linear and non-linear. Our theoretical analysis demonstrates that incorporating sheaves into the hypergraph Laplacian provides a more expressive inductive bias than standard hypergraph diffusion, creating a powerful instrument for effectively modeling complex data structures. We employ these sheaf hypergraph Laplacians to design two categories of models: Sheaf Hypergraph Neural Networks and Sheaf Hypergraph Convolutional Networks. These models generalize classical Hypergraph Networks often found in the literature. Through extensive experimentation, we show that this generalization significantly improves performance, achieving top results on multiple benchmark datasets for hypergraph node classification.
Benchmarking Collaborative Learning Methods Cost-Effectiveness for Prostate Segmentation
results: 我们的实验结果表明,在考虑的实际场景下,CBM可以提供与FL相等或更好的结果,而且具有高效性。这些结果表明,共识模式可能是FL的可行替代方案。Abstract
Healthcare data is often split into medium/small-sized collections across multiple hospitals and access to it is encumbered by privacy regulations. This brings difficulties to use them for the development of machine learning and deep learning models, which are known to be data-hungry. One way to overcome this limitation is to use collaborative learning (CL) methods, which allow hospitals to work collaboratively to solve a task, without the need to explicitly share local data. In this paper, we address a prostate segmentation problem from MRI in a collaborative scenario by comparing two different approaches: federated learning (FL) and consensus-based methods (CBM). To the best of our knowledge, this is the first work in which CBM, such as label fusion techniques, are used to solve a problem of collaborative learning. In this setting, CBM combine predictions from locally trained models to obtain a federated strong learner with ideally improved robustness and predictive variance properties. Our experiments show that, in the considered practical scenario, CBMs provide equal or better results than FL, while being highly cost-effective. Our results demonstrate that the consensus paradigm may represent a valid alternative to FL for typical training tasks in medical imaging.
摘要
医疗数据经常被分布在多个医院中的中型/小型集合中,并且由于隐私法规的限制,使得使用这些数据进行机器学习和深度学习模型的开发变得困难。为了解决这些问题,可以使用合作学习(CL)方法,让医院合作解决问题,无需显式地分享本地数据。在这篇论文中,我们研究了一个肾脏分割问题,该问题来自于MRI成像,在合作enario中进行比较两种不同的方法:联邦学习(FL)和consensus-based方法(CBM)。我们知道,这是首次在合作学习中使用CBM方法,例如标签融合技术,来解决一个协作学习问题。在这种设置下,CBM方法将本地训练的模型 predictions合并,以获得一个联邦强学习模型,其具有 идеаль的 robustness和预测偏差性质。我们的实验结果显示,在我们考虑的实际enario中,CBMs 提供了等效或更好的结果,而且非常经济。我们的结果表明,consensus paradigm可能是FL的有效替代方案,用于医学成像领域的典型训练任务。
Too Big, so Fail? – Enabling Neural Construction Methods to Solve Large-Scale Routing Problems
results: 提出一种基于“萧瑟重建原理”的神经建构方法,可以在大规模问题上表现更好,并且在四个不同的数据集上进行了 thorought 的实验,证明了这种方法的优势。Abstract
In recent years new deep learning approaches to solve combinatorial optimization problems, in particular NP-hard Vehicle Routing Problems (VRP), have been proposed. The most impactful of these methods are sequential neural construction approaches which are usually trained via reinforcement learning. Due to the high training costs of these models, they usually are trained on limited instance sizes (e.g. serving 100 customers) and later applied to vastly larger instance size (e.g. 2000 customers). By means of a systematic scale-up study we show that even state-of-the-art neural construction methods are outperformed by simple heuristics, failing to generalize to larger problem instances. We propose to use the ruin recreate principle that alternates between completely destroying a localized part of the solution and then recreating an improved variant. In this way, neural construction methods like POMO are never applied to the global problem but just in the reconstruction step, which only involves partial problems much closer in size to their original training instances. In thorough experiments on four datasets of varying distributions and modalities we show that our neural ruin recreate approach outperforms alternative forms of improving construction methods such as sampling and beam search and in several experiments also advanced local search approaches.
摘要
近年来,深度学习方法来解决 combinatorial optimization 问题,尤其是 NP-hard 的 Vehicle Routing Problems (VRP) ,得到了广泛的研究。最有影响的这些方法是序列神经建构方法,通常通过 reinforcement learning 进行训练。由于这些模型的训练成本高,通常只能在限制 instance size (例如,服务 100 个客户)上训练,然后将其应用到远大得多的 instance size (例如,2000 个客户)。通过系统性的扩展研究,我们发现了一些 state-of-the-art 神经建构方法无法泛化到更大的问题实例。我们提议使用灭亡重建原则,该原则是 alternate between completely destroying 地方化的部分解决方案,然后创建一个改进的变体。这样,神经建构方法如 POMO 只在重建步骤中被应用,该步骤只涉及到部分问题的大小与其原始训练实例相似。在四个不同的数据集和模式下进行了严格的实验,我们发现了我们的神经灭亡重建方法在 sampling 和 beam search 以及一些先进的本地搜索方法之上表现出色。
From Empirical Measurements to Augmented Data Rates: A Machine Learning Approach for MCS Adaptation in Sidelink Communication
results: 论文表明,使用机器学习方法可以对MCS水平进行适应性预测,并且可以实现 significanly 提高预测性能。此外,论文还提供了一个大量的实际驱动测试数据集,并将其公开发布。Abstract
Due to the lack of a feedback channel in the C-V2X sidelink, finding a suitable modulation and coding scheme (MCS) is a difficult task. However, recent use cases for vehicle-to-everything (V2X) communication with higher demands on data rate necessitate choosing the MCS adaptively. In this paper, we propose a machine learning approach to predict suitable MCS levels. Additionally, we propose the use of quantile prediction and evaluate it in combination with different algorithms for the task of predicting the MCS level with the highest achievable data rate. Thereby, we show significant improvements over conventional methods of choosing the MCS level. Using a machine learning approach, however, requires larger real-world data sets than are currently publicly available for research. For this reason, this paper presents a data set that was acquired in extensive drive tests, and that we make publicly available.
摘要
Translation in Simplified Chinese:由于C-V2X侧链缺乏反馈通道,选择适当的modulation and coding scheme(MCS)是一个困难的任务。然而,Recent vehicle-to-everything(V2X)通信use cases with higher demands on data rate require adaptive selection of the MCS. In this paper, we propose a machine learning approach to predict suitable MCS levels. In addition, we propose the use of quantile prediction and evaluate it in combination with different algorithms for the task of predicting the MCS level with the highest achievable data rate. As a result, we show significant improvements over conventional methods of choosing the MCS level. However, using a machine learning approach requires larger real-world data sets than are currently publicly available for research. Therefore, this paper presents a data set that was acquired in extensive drive tests and makes it publicly available.
Diffusion Models as Stochastic Quantization in Lattice Field Theory
results: 数值仿真表明,DM可以作为全球抽样器,生成二维$\phi^4$理论中的量子网格场 configurations。此外,DM还可以显著减少自相关时间,特别是在MCMC算法在极点区域中经历极慢减速。这些发现可能会推动量子网格场 simulations的进一步发展,特别是在生成大 ensemble是昂贵的情况下。Abstract
In this work, we establish a direct connection between generative diffusion models (DMs) and stochastic quantization (SQ). The DM is realized by approximating the reversal of a stochastic process dictated by the Langevin equation, generating samples from a prior distribution to effectively mimic the target distribution. Using numerical simulations, we demonstrate that the DM can serve as a global sampler for generating quantum lattice field configurations in two-dimensional $\phi^4$ theory. We demonstrate that DMs can notably reduce autocorrelation times in the Markov chain, especially in the critical region where standard Markov Chain Monte-Carlo (MCMC) algorithms experience critical slowing down. The findings can potentially inspire further advancements in lattice field theory simulations, in particular in cases where it is expensive to generate large ensembles.
摘要
在这个工作中,我们建立了生成扩散模型(DM)和随机量化(SQ)之间的直接连接。DM通过近似逆射扩散过程的斜率方程,生成样本从先验分布,有效地模拟目标分布。通过数值实验,我们示出了DM可以作为全球抽取器,生成二维$\phi^4$理论中的量子核场配置。我们还证明了DM可以明显减少自相关时间在马尔可夫链中,特别是在 kritical 区域,标准马尔可夫链 Monte-Carlo(MCMC)算法经历了极限减速。这些发现可能会推动更多的链场理论仿真,特别是在生成大集的情况下。
On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters
results: 这项研究提供了一种精确地 caracterize 图形动作参数的方法,并证明了在某些情况下,可以使用 GNN 的最后一层本地信息来计算图形中特定 Pattern 的出现次数。Abstract
Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the $k$-dimensional Weisfeiler-Leman ($k$WL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the $k$WL test. A central focus of research in this field revolves around determining the least dimensionality $k$, for which $k$WL can discern graphs with different number of occurrences of a pattern graph $P$. We refer to such a least $k$ as the WL-dimension of this pattern counting problem. This inquiry traditionally delves into two distinct counting problems related to patterns: subgraph counting and induced subgraph counting. Intriguingly, despite their initial appearance as separate challenges with seemingly divergent approaches, both of these problems are interconnected components of a more comprehensive problem: "graph motif parameters". In this paper, we provide a precise characterization of the WL-dimension of labeled graph motif parameters. As specific instances of this result, we obtain characterizations of the WL-dimension of the subgraph counting and induced subgraph counting problem for every labeled pattern $P$. We additionally demonstrate that in cases where the $k$WL test distinguishes between graphs with varying occurrences of a pattern $P$, the exact number of occurrences of $P$ can be computed uniformly using only local information of the last layer of a corresponding GNN. We finally delve into the challenge of recognizing the WL-dimension of various graph parameters. We give a polynomial time algorithm for determining the WL-dimension of the subgraph counting problem for given pattern $P$, answering an open question from previous work.
摘要
研究领域内的核心研究表明,图 neuronal networks(GNNs)的表达能力与 $k$-dimensional Weisfeiler-Leman($k$WL)测试之间存在直接的对应关系。这种关系在研究图的特定属性表征方面产生了新的兴趣。我们的研究重点在于确定 Pattern counting 问题中的最小维度 $k$,以便使 $k$WL 测试能够分辨不同 Pattern 图的数量。我们称这个最小维度为 Pattern 图的 WL 维度。这种问题包括两个不同的 counted 问题,即 subgraph counting 和 induced subgraph counting。尽管这两个问题看起来像是独立的挑战,但它们实际上是图 моти夫参数的一部分。在这篇论文中,我们提供了图 моти夫参数的准确特征化,包括 subgraph counting 和 induced subgraph counting 问题的 WL 维度特征化。此外,我们还证明在 $k$WL 测试可以分辨不同 Pattern 图的情况下,可以使用 GNN 的最后一层本地信息来计算 Pattern 图的具体出现次数。最后,我们考虑了识别不同图参数的 WL 维度的挑战。我们提供了对 Pattern 图 counting 问题的 polynomial time 算法,解决了之前的开题。
Efficient Agnostic Learning with Average Smoothness
results: 我们完全填充了这些漏洞,提供了分布自由的均值渐近级数 bound,并与计算效率高的agnostic学习算法相匹配。我们的结果适用于任何totally bounded metric space,并且显示了实现了对于 realizable 学习的保证。Abstract
We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case. In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest.
摘要
我们研究分布自由非 Parametric 回归,基于 Ashlagi 等 (2021) 提出的平均滑动性概念,该概念测量函数对于未知的平均分布下的"有效"滑动性。而 Hanneke 等 (2023) 的最近研究已经在可 realizable 情况下提供了紧Binding的 uniform 收敛 bounds 和可实现的学习算法,但这两个结果目前在无知(i.e. 噪声)情况下缺乏对应的结果。在这个工作中,我们完全填充了这些差距。首先,我们提供了分布自由的 uniform 收敛 bound для average-smoothness 类型在无知情况下。其次,我们与 derived sample complexity 匹配了一个可实现的 agnostic 学习算法。我们的结果,表示在任何完全度 bounded метри空间上,对于内在的几何结构和数据而言,可以证明 realizable 学习 average-smooth functions 的 guarantees 在无知情况下也适用。在我们的证明中,我们使用函数类型的 bracketing entropy 来证明 uniform 收敛率,这可能是独立的兴趣点。
Feature Cognition Enhancement via Interaction-Aware Automated Transformation
for: This paper aims to address the challenges of representation learning in machine learning, specifically the issues of heavy reliance on manual feature engineering, lack of explainability, and inflexible feature space reconstruction.
methods: The proposed approach is based on interaction-aware reinforcement generation, which involves creating meaningful features and controlling feature set size through selection. The authors use a hierarchical reinforcement learning structure with cascading Markov Decision Processes to automate feature and operation selection, as well as feature crossing.
results: The authors conduct extensive experiments to validate their proposed approach, demonstrating the effectiveness of their method in generating intelligible and efficient feature spaces that emulate human decision-making.Abstract
Creating an effective representation space is crucial for mitigating the curse of dimensionality, enhancing model generalization, addressing data sparsity, and leveraging classical models more effectively. Recent advancements in automated feature engineering (AutoFE) have made significant progress in addressing various challenges associated with representation learning, issues such as heavy reliance on intensive labor and empirical experiences, lack of explainable explicitness, and inflexible feature space reconstruction embedded into downstream tasks. However, these approaches are constrained by: 1) generation of potentially unintelligible and illogical reconstructed feature spaces, stemming from the neglect of expert-level cognitive processes; 2) lack of systematic exploration, which subsequently results in slower model convergence for identification of optimal feature space. To address these, we introduce an interaction-aware reinforced generation perspective. We redefine feature space reconstruction as a nested process of creating meaningful features and controlling feature set size through selection. We develop a hierarchical reinforcement learning structure with cascading Markov Decision Processes to automate feature and operation selection, as well as feature crossing. By incorporating statistical measures, we reward agents based on the interaction strength between selected features, resulting in intelligent and efficient exploration of the feature space that emulates human decision-making. Extensive experiments are conducted to validate our proposed approach.
摘要
Generation of potentially unintelligible and illogical reconstructed feature spaces, stemming from the neglect of expert-level cognitive processes;2. Lack of systematic exploration, which subsequently results in slower model convergence for identification of optimal feature space.To address these challenges, we introduce an interaction-aware reinforcement generation perspective. We redefine feature space reconstruction as a nested process of creating meaningful features and controlling feature set size through selection. We develop a hierarchical reinforcement learning structure with cascading Markov Decision Processes to automate feature and operation selection, as well as feature crossing. By incorporating statistical measures, we reward agents based on the interaction strength between selected features, resulting in intelligent and efficient exploration of the feature space that emulates human decision-making.Extensive experiments are conducted to validate our proposed approach.
Deep Representation Learning for Prediction of Temporal Event Sets in the Continuous Time Domain
results: 比较 existed方法,our proposed approach在多个 dataset上进行了广泛的实验,并达到了较高的预测精度和计算效率Abstract
Temporal Point Processes (TPP) play an important role in predicting or forecasting events. Although these problems have been studied extensively, predicting multiple simultaneously occurring events can be challenging. For instance, more often than not, a patient gets admitted to a hospital with multiple conditions at a time. Similarly people buy more than one stock and multiple news breaks out at the same time. Moreover, these events do not occur at discrete time intervals, and forecasting event sets in the continuous time domain remains an open problem. Naive approaches for extending the existing TPP models for solving this problem lead to dealing with an exponentially large number of events or ignoring set dependencies among events. In this work, we propose a scalable and efficient approach based on TPPs to solve this problem. Our proposed approach incorporates contextual event embeddings, temporal information, and domain features to model the temporal event sets. We demonstrate the effectiveness of our approach through extensive experiments on multiple datasets, showing that our model outperforms existing methods in terms of prediction metrics and computational efficiency. To the best of our knowledge, this is the first work that solves the problem of predicting event set intensities in the continuous time domain by using TPPs.
摘要
temporal point processes (TPP) 扮演着重要的角色在预测或预测事件中。although these problems have been studied extensively, predicting multiple simultaneously occurring events can be challenging. for instance, more often than not, a patient gets admitted to a hospital with multiple conditions at a time. similarly, people buy more than one stock and multiple news breaks out at the same time. Moreover, these events do not occur at discrete time intervals, and forecasting event sets in the continuous time domain remains an open problem. Naive approaches for extending the existing TPP models for solving this problem lead to dealing with an exponentially large number of events or ignoring set dependencies among events. In this work, we propose a scalable and efficient approach based on TPPs to solve this problem. our proposed approach incorporates contextual event embeddings, temporal information, and domain features to model the temporal event sets. we demonstrate the effectiveness of our approach through extensive experiments on multiple datasets, showing that our model outperforms existing methods in terms of prediction metrics and computational efficiency. to the best of our knowledge, this is the first work that solves the problem of predicting event set intensities in the continuous time domain by using TPPs.Here's the word-for-word translation of the text into Simplified Chinese:temporal point processes (TPP) 扮演着重要的角色在预测或预测事件中。although these problems have been studied extensively, predicting multiple simultaneously occurring events can be challenging. for instance, more often than not, a patient gets admitted to a hospital with multiple conditions at a time. similarly, people buy more than one stock and multiple news breaks out at the same time. Moreover, these events do not occur at discrete time intervals, and forecasting event sets in the continuous time domain remains an open problem. Naive approaches for extending the existing TPP models for solving this problem lead to dealing with an exponentially large number of events or ignoring set dependencies among events. In this work, we propose a scalable and efficient approach based on TPPs to solve this problem. our proposed approach incorporates contextual event embeddings, temporal information, and domain features to model the temporal event sets. we demonstrate the effectiveness of our approach through extensive experiments on multiple datasets, showing that our model outperforms existing methods in terms of prediction metrics and computational efficiency. to the best of our knowledge, this is the first work that solves the problem of predicting event set intensities in the continuous time domain by using TPPs.
Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning
results: 在offlineRL中,生成模型作为策略从多Modal数据中表达了表达能力。在offline-to-onlineRL中,一致策略比Diffusion策略更快速,性能相似。在onlineRL中,一致策略demonstrated significant speedup和even higher average performance than Diffusion策略。Abstract
Score-based generative models like the diffusion model have been testified to be effective in modeling multi-modal data from image generation to reinforcement learning (RL). However, the inference process of diffusion model can be slow, which hinders its usage in RL with iterative sampling. We propose to apply the consistency model as an efficient yet expressive policy representation, namely consistency policy, with an actor-critic style algorithm for three typical RL settings: offline, offline-to-online and online. For offline RL, we demonstrate the expressiveness of generative models as policies from multi-modal data. For offline-to-online RL, the consistency policy is shown to be more computational efficient than diffusion policy, with a comparable performance. For online RL, the consistency policy demonstrates significant speedup and even higher average performances than the diffusion policy.
摘要
Score-based生成模型如扩散模型在图像生成和强化学习中证明有效,但扩散模型的推理过程可能会慢,这限制了它在强化学习中使用的可行性。我们提议使用一种高效又表达力强的策略表示方式,即一致策略,并使用actor-critic风格的算法来解决三种常见强化学习设置:离线、离线到在线和在线。在离线强化学习中,我们示出了生成模型作为策略从多Modal数据中表达的表达力。在离线到在线强化学习中,一致策略与扩散策略相比, computationally更高效,性能相似。在在线强化学习中,一致策略示出了明显的加速和比扩散策略更高的平均性能。
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors
results: 本文对GAUDI进行了完整的性能比较,揭示了MME和TPC的相对优劣点。此外,本文还提出了优化MME和TPC使用的策略,并评估了Transformer模型在GAUDI上的性能。Abstract
Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Thirdly, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Lastly, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration. Our study bridges a research gap and offers a roadmap for optimizing Transformer-based model training on the GAUDI architecture.
摘要
启示器模型已经在不同的机器学习任务中获得了惊人的成功,但它们受到高度的计算复杂性和资源需求的限制。它们的自注意机制的quadratic complexity进一步增加了对于长序列和大量数据的挑战。特种的AI硬件加速器,如Habana GAUDI架构,提供了一个有前途的解决方案。GAUDI架构包括一个矩阵 multiply 引擎(MME)和一群可编程的tensor处理核心(TPC)。本文探讨了使用GAUDI处理器加速启示器模型,解决关键的挑战。首先,我们提供了MME和TPC组件之间的完整性比较,揭示它们的相对优劣点。其次,我们探讨了如何优化MME和TPC的使用,提供了实践的指导,以提高计算效率。第三,我们评估了Transformers在GAUDI上的性能,特别是处理长序列的能力。最后,我们评估了两种基于启示器的大型自然语言模型(LLM)在GAUDI上的综合性能。本文的贡献包括实践的指导和研究人员之间的交流,我们通过系统性的探讨、分析和优化探讨GAUDI架构对启示器模型的可能性。我们的研究填补了一个研究漏洞,并提供了优化启示器模型在GAUDI架构上的路线图。
Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness
results: RO2O在线上适应中实现了稳定的学习进程,并在限制的在线交互情况下达到了显著的改进。Abstract
To obtain a near-optimal policy with fewer interactions in Reinforcement Learning (RL), a promising approach involves the combination of offline RL, which enhances sample efficiency by leveraging offline datasets, and online RL, which explores informative transitions by interacting with the environment. Offline-to-Online (O2O) RL provides a paradigm for improving an offline trained agent within limited online interactions. However, due to the significant distribution shift between online experiences and offline data, most offline RL algorithms suffer from performance drops and fail to achieve stable policy improvement in O2O adaptation. To address this problem, we propose the Robust Offline-to-Online (RO2O) algorithm, designed to enhance offline policies through uncertainty and smoothness, and to mitigate the performance drop in online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty penalty and adversarial samples for policy and value smoothness, which enable RO2O to maintain a consistent learning procedure in online adaptation without requiring special changes to the learning objective. Theoretical analyses in linear MDPs demonstrate that the uncertainty and smoothness lead to a tighter optimality bound in O2O against distribution shift. Experimental results illustrate the superiority of RO2O in facilitating stable offline-to-online learning and achieving significant improvement with limited online interactions.
摘要
<>转换给定文本到简化中文。<>在强化学习(RL)中,以 комбинаción of offline RL 和 online RL 为例,可以提高样本效率并探索有用的转移。Offline-to-Online(O2O)RL 提供了一种改进 offline 训练的机制,但由于在线经验和 offline 数据之间的分布偏移,大多数 offline RL 算法会导致性能下降并在 O2O 适应中失去稳定的政策改进。为解决这个问题,我们提出了 Robust Offline-to-Online(RO2O)算法,通过不确定性和简直性来增强 offline 政策,并在在线适应中维护一个稳定的学习过程。特别是,RO2O 使用 Q-ensemble для不确定性 penalty 和对策和价值简直性,这使得 RO2O 可以在在线适应中维护一个稳定的学习过程,而无需特殊地改变学习目标。理论分析在线 MDP 中表明,不确定性和简直性导致在 O2O 中对分布偏移的优质环境。实验结果表明,RO2O 可以在有限的在线交互下实现稳定的 offline-to-online 学习和显著的改进。
Multi-Resolution Active Learning of Fourier Neural Operators
For: 提高 FNO 的训练和预测效率,降低数据成本。* Methods: 动态选择输入函数和分辨率,使用ensemble Monte-Carlo实现有效 posterior inference算法,使用 moments matching和矩阵 determinant lemma 实现可追踪的、高效的实用性计算。* Results: 在多个 benchmark 运算符学习任务中表现优秀,并且可以避免高分辨率查询过早停满问题。Abstract
Fourier Neural Operator (FNO) is a popular operator learning framework, which not only achieves the state-of-the-art performance in many tasks, but also is highly efficient in training and prediction. However, collecting training data for the FNO is a costly bottleneck in practice, because it often demands expensive physical simulations. To overcome this problem, we propose Multi-Resolution Active learning of FNO (MRA-FNO), which can dynamically select the input functions and resolutions to lower the data cost as much as possible while optimizing the learning efficiency. Specifically, we propose a probabilistic multi-resolution FNO and use ensemble Monte-Carlo to develop an effective posterior inference algorithm. To conduct active learning, we maximize a utility-cost ratio as the acquisition function to acquire new examples and resolutions at each step. We use moment matching and the matrix determinant lemma to enable tractable, efficient utility computation. Furthermore, we develop a cost annealing framework to avoid over-penalizing high-resolution queries at the early stage. The over-penalization is severe when the cost difference is significant between the resolutions, which renders active learning often stuck at low-resolution queries and inferior performance. Our method overcomes this problem and applies to general multi-fidelity active learning and optimization problems. We have shown the advantage of our method in several benchmark operator learning tasks.
摘要
法ouvrier neural operator(FNO)是一种受欢迎的运算学框架,不仅可以在许多任务中达到状态机器人的性能,而且在训练和预测中也具有高效性。然而,在实践中收集FNO的训练数据可能会成为一个昂贵的瓶颈,因为它们经常需要昂贵的物理 simulate。为了解决这个问题,我们提出了多resolution active learning of FNO(MRA-FNO),它可以在最小化数据成本的情况下,在训练和预测中尽可能高效地学习。我们提出了一种probabilistic multi-resolution FNO,并使用ensemble Monte-Carlo来开发一个有效的 posterior inference 算法。为了实施活动学习,我们在每一步中 maximize一个利用函数和解构函数的utilit-cost比,以获取新的示例和分辨率。我们使用 moments matching和矩阵 determinant lemma来实现可迭代、高效的utilit computation。此外,我们开发了一个cost annealing框架,以避免在早期stage 高分辨率查询时的过惩罚。当分辨率之间的成本差距较大时,活动学习经常会被困在低分辨率查询中,导致性能差。我们的方法可以在多个多fidelity active learning和优化问题中应用。我们在一些benchmark operator learning任务中表明了我们的方法的优势。
Controlling Continuous Relaxation for Combinatorial Optimization
results: 实验表明,在稠密图上的 CO 问题上,本文提出的方法可以获得更好的结果,而且在相对稀疏图上的 CO 问题上也表现出色。此外,计算时间呈线性增长,与PI-GNN算法一样。Abstract
Recent advancements in combinatorial optimization (CO) problems emphasize the potential of graph neural networks (GNNs). The physics-inspired GNN (PI-GNN) solver, which finds approximate solutions through unsupervised learning, has attracted significant attention for large-scale CO problems. Nevertheless, there has been limited discussion on the performance of the PI-GNN solver for CO problems on relatively dense graphs where the performance of greedy algorithms worsens. In addition, since the PI-GNN solver employs a relaxation strategy, an artificial transformation from the continuous space back to the original discrete space is necessary after learning, potentially undermining the robustness of the solutions. This paper numerically demonstrates that the PI-GNN solver can be trapped in a local solution, where all variables are zero, in the early stage of learning for CO problems on the dense graphs. Then, we address these problems by controlling the continuity and discreteness of relaxed variables while avoiding the local solution: (i) introducing a new penalty term that controls the continuity and discreteness of the relaxed variables and eliminates the local solution; (ii) proposing a new continuous relaxation annealing (CRA) strategy. This new annealing first prioritizes continuous solutions and intensifies exploration by leveraging the continuity while avoiding the local solution and then schedules the penalty term for prioritizing a discrete solution until the relaxed variables are almost discrete values, which eliminates the need for an artificial transformation from the continuous to the original discrete space. Empirically, better results are obtained for CO problems on the dense graphs, where the PI-GNN solver struggles to find reasonable solutions, and for those on relatively sparse graphs. Furthermore, the computational time scaling is identical to that of the PI-GNN solver.
摘要
近期的 combinatorial optimization(CO)问题的进展强调了图神经网络(GNN)的潜在能力。physics-inspired GNN(PI-GNN)解决方案,通过无监督学习找到approximate solutions,在大规模CO问题上吸引了 significative attention。然而,有限的讨论是关于PI-GNN解决方案在密集图上的性能,以及greedy算法在密集图上的性能下降。此外,由于PI-GNN解决方案使用了放松策略,因此需要人工将学习后的练习数据转换回原始的离散空间,可能会损害解决方案的稳定性。本文通过数字实验表明,PI-GNN解决方案在密集图上的CO问题中可能会被困在早期学习阶段的本地解。然后,我们解决这些问题:(i)引入一个新的罚项,控制放松变量的连续性和离散性,并消除本地解;(ii)提出一种新的练习气化(CRA)策略。这种新策略在优先级顺序中允许连续解,并通过强化探索来避免本地解,然后在变量接近离散值时间间隔出罚项,以消除人工将练习数据转换回原始离散空间的需要。实际上,对CO问题的密集图和相对稀疏图进行实验,我们可以获得更好的结果,而且计算时间协同PI-GNN解决方案。
Leveraging Optimization for Adaptive Attacks on Image Watermarks
paper_authors: Nils Lukas, Abdulrahman Diaa, Lucas Fenaux, Florian Kerschbaum
for: The paper is written to address the issue of untrustworthy users misusing image generators to create high-quality deepfakes and engage in online spam or disinformation campaigns.
methods: The paper proposes a method of watermarking to deter misuse by marking generated content with a hidden message, and uses an adaptive attack to evaluate the robustness of the watermarking algorithm.
results: The paper demonstrates that an adaptive attack can break all five surveyed watermarking methods at negligible degradation in image quality, emphasizing the need for more rigorous robustness testing against adaptive, learnable attackers.Here is the same information in Simplified Chinese:
results: 论文表明,适应性攻击可以破坏所有调查的五种水印方法,而且这些攻击不会导致图像质量明显下降,从而强调了对适应性攻击的更加严格的安全性测试。Abstract
Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in online spam or disinformation campaigns. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. A challenge when evaluating watermarking algorithms and their (adaptive) attacks is to determine whether an adaptive attack is optimal, i.e., it is the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at negligible degradation in image quality. These findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.
摘要
不可靠用户可能会滥用图像生成器生成高质量的深伪图并进行在线垃圾或歪曲信息运动。水印可以防止这种滥用行为,通过将生成内容中加入隐藏的信息,并使用机密水印键来检测。水印的核心安全特性是 robustness,即攻击者只能通过重大干扰图像质量来逃脱检测。评估robustness需要设计适应攻击的特定水印算法。一个挑战是在评估水印算法和其适应攻击时,确定攻击是最佳的。我们解决这个问题,通过定义一个目标函数,然后将适应攻击看作优化问题。我们的适应攻击的核心思想是在本地复制秘密水印键,通过创建可微分的代理键来优化攻击参数。我们示例中,对于稳定扩散模型,攻击者可以破坏所评估的五种水印方法,而且这些攻击对图像质量的影响非常小。这些发现强调了对适应、学习型攻击的更加严格的Robustness测试。
Beyond Tides and Time: Machine Learning Triumph in Water Quality
results: LightGBM模型表现最佳,实现最高的平均准确率;树型模型在解决回归问题方面表现出色,而MLP神经网络受到特征涨分的敏感性受到影响。Abstract
Water resources are essential for sustaining human livelihoods and environmental well being. Accurate water quality prediction plays a pivotal role in effective resource management and pollution mitigation. In this study, we assess the effectiveness of five distinct predictive models linear regression, Random Forest, XGBoost, LightGBM, and MLP neural network, in forecasting pH values within the geographical context of Georgia, USA. Notably, LightGBM emerges as the top performing model, achieving the highest average precision. Our analysis underscores the supremacy of tree-based models in addressing regression challenges, while revealing the sensitivity of MLP neural networks to feature scaling. Intriguingly, our findings shed light on a counterintuitive discovery: machine learning models, which do not explicitly account for time dependencies and spatial considerations, outperform spatial temporal models. This unexpected superiority of machine learning models challenges conventional assumptions and highlights their potential for practical applications in water quality prediction. Our research aims to establish a robust predictive pipeline accessible to both data science experts and those without domain specific knowledge. In essence, we present a novel perspective on achieving high prediction accuracy and interpretability in data science methodologies. Through this study, we redefine the boundaries of water quality forecasting, emphasizing the significance of data driven approaches over traditional spatial temporal models. Our findings offer valuable insights into the evolving landscape of water resource management and environmental protection.
摘要
水资源是人类生存和环境健康的重要保障。 precisione 预测水质是有效资源管理和污染防治的关键。 在本研究中,我们评估了五种不同的预测模型,包括线性回归、Random Forest、XGBoost、LightGBM 和 MLP神经网络,以预测格鲁吉亚(Georgia)地区的pH值。结果显示LightGBM 模型在预测pH值方面表现出色,得到了最高的平均准确率。我们的分析发现树状模型在 Addressing regression challenges 方面具有优势,而神经网络模型受到特征Scaling 的影响。另外,我们的发现推翻了传统假设:机器学习模型,不直接考虑时间和空间因素,在水质预测方面表现更高。这种对机器学习模型的发现挑战了传统的假设,并指出了其在实际应用中的潜在价值。我们的研究旨在建立一个可 accessible 的预测管道,以便无关域专业知识的人员和数据科学专家都可以使用。具体来说,我们在这种研究中提出了一种新的预测方法,即通过数据驱动的方法来实现高精度和可解释性。我们的发现对水资源管理和环境保护领域的发展具有重要意义。
results: 实验结果显示,NeuIM可以实现高准确性和高效率的电磁变化 simulating,并且在没有数据的情况下仍能够正确预测IM机器的动作。Abstract
This rapid communication devises a Neural Induction Machine (NeuIM) model, which pilots the use of physics-informed machine learning to enable AI-based electromagnetic transient simulations. The contributions are threefold: (1) a formation of NeuIM to represent the induction machine in phase domain; (2) a physics-informed neural network capable of capturing fast and slow IM dynamics even in the absence of data; and (3) a data-physics-integrated hybrid NeuIM approach which is adaptive to various levels of data availability. Extensive case studies validate the efficacy of NeuIM and in particular, its advantage over purely data-driven approaches.
摘要
这种快速通信创造了一种神经起 induction machine(NeuIM)模型,该模型利用物理学 Informed机器学习来实现人工智能基于电磁脉冲 simulating。这个贡献有三个方面:1. 将NeuIM模型表示为频域中的induction machine;2. 一种能够捕捉快速和慢速IM动态的物理学 Informed神经网络;3. 一种可以适应不同数据可用性水平的数据-物理-混合NeuIM方法。广泛的案例研究证明了NeuIM的有效性,特别是与完全数据驱动方法相比的优势。
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks
for: 提供了一个综合评估框架 дляGraph Neural Networks (GNNs) based Boolean Satisfiability Problem (SAT) 解决方案。
methods: 使用了多种GNN模型,包括不同的预测任务、训练目标和推理算法,并对其进行了比较。
results: 显示了GNN模型可以有效地学习一种类似于排序搜索的解决策略,但它们困难地学习循环搜索在隐藏空间中的技巧。Abstract
Graph neural networks (GNNs) have recently emerged as a promising approach for solving the Boolean Satisfiability Problem (SAT), offering potential alternatives to traditional backtracking or local search SAT solvers. However, despite the growing volume of literature in this field, there remains a notable absence of a unified dataset and a fair benchmark to evaluate and compare existing approaches. To address this crucial gap, we present G4SATBench, the first benchmark study that establishes a comprehensive evaluation framework for GNN-based SAT solvers. In G4SATBench, we meticulously curate a large and diverse set of SAT datasets comprising 7 problems with 3 difficulty levels and benchmark a broad range of GNN models across various prediction tasks, training objectives, and inference algorithms. To explore the learning abilities and comprehend the strengths and limitations of GNN-based SAT solvers, we also compare their solving processes with the heuristics in search-based SAT solvers. Our empirical results provide valuable insights into the performance of GNN-based SAT solvers and further suggest that existing GNN models can effectively learn a solving strategy akin to greedy local search but struggle to learn backtracking search in the latent space.
摘要
Graph neural networks (GNNs) 近期emerge as a promising approach for solving the Boolean Satisfiability Problem (SAT), offering potential alternatives to traditional backtracking or local search SAT solvers. However, despite the growing volume of literature in this field, there remains a notable absence of a unified dataset and a fair benchmark to evaluate and compare existing approaches. To address this crucial gap, we present G4SATBench, the first benchmark study that establishes a comprehensive evaluation framework for GNN-based SAT solvers. In G4SATBench, we meticulously curate a large and diverse set of SAT datasets comprising 7 problems with 3 difficulty levels and benchmark a broad range of GNN models across various prediction tasks, training objectives, and inference algorithms. To explore the learning abilities and comprehend the strengths and limitations of GNN-based SAT solvers, we also compare their solving processes with the heuristics in search-based SAT solvers. Our empirical results provide valuable insights into the performance of GNN-based SAT solvers and further suggest that existing GNN models can effectively learn a solving strategy akin to greedy local search but struggle to learn backtracking search in the latent space.
Symmetry Leads to Structured Constraint of Learning
results: 论文表明,对称性可以导致神经网络的稀热性、低级别性和同质整合等现象,并可以用来设计具有固定约束的梯度下降算法。Abstract
Due to common architecture designs, symmetries exist extensively in contemporary neural networks. In this work, we unveil the importance of the loss function symmetries in affecting, if not deciding, the learning behavior of machine learning models. We prove that every mirror symmetry of the loss function leads to a structured constraint, which becomes a favored solution when either the weight decay or gradient noise is large. As direct corollaries, we show that rescaling symmetry leads to sparsity, rotation symmetry leads to low rankness, and permutation symmetry leads to homogeneous ensembling. Then, we show that the theoretical framework can explain the loss of plasticity and various collapse phenomena in neural networks and suggest how symmetries can be used to design algorithms to enforce hard constraints in a differentiable way.
摘要
由于现代神经网络的通用体系设计, symmetries 在现代神经网络中存在广泛。在这项工作中,我们揭示了损失函数 symmetries 对机器学习模型的学习行为的重要性。我们证明了每个镜像Symmetry of the loss function 导致了一种结构化约束,当权重衰退或梯度噪声较大时,这种约束变得更加受欢迎。作为直接推论,我们显示了缩放Symmetry 导致稀疏性,旋转Symmetry 导致低级别性,并且 permutation Symmetry 导致同质集成。然后,我们显示了理论框架可以解释神经网络中的损失强化和各种塌陷现象,并建议如何使用 symmetries 设计可微 differentiable 的算法来实现硬约束。
Unlabeled Out-Of-Domain Data Improves Generalization
results: 研究人员通过使用本文的方法,可以获得substantial 改善的总体化错误 bounds,比ERM更好。Abstract
We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are gievn as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the "cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
摘要
我们提出了一种新的框架,用于在半监督分类问题中包含无标示数据。我们考虑了两种情况:一是对 adversarially Robust 损失函数进行最小化,二是对 non-Robust 损失函数进行最小化。我们允许无标示样本略有偏差(在总变量意义上)从域内分布。我们的框架结合了分布 robust 优化(DRO)和自适应培训。因此,我们还可以利用高效的多项式时间算法进行训练阶段。从理论上看,我们在两个 Gaussian 混合模型中应用我们的框架,其中包括 $m$ 个独立的标注样本和 $n$ ($n\gg m$) 个非域样本。使用仅标注数据时,知道generalization error可以被约为 $\propto\left(d/m\right)^{1/2}$。然而,使用我们的方法,我们可以 derivate一组新的分布式bounds,这些bounds显示与ERM相比,我们的方法可以减少generalization error的距离。我们的结果表明以下两点:1)无标示样本,即使不具标注,也可以减少通用化距离,只要数据分布遵循一种"分布假设"。2)半监督学习模型可以看作我们框架的特殊情况,当无Distributional shift时。我们通过对各种 sintetic 和实际数据进行实验,证明了我们的声明。
paper_authors: Pavel Sinha, Ioannis Psaromiligkos, Zeljko Zilic
For: automatic segmentation of lumen and media in IntraVascular ultra-sound (IVUS) images* Methods: closed polygonal chains, adaptive-subband-decomposition CNN, Jaccard Measure (JM) loss function, Mean Squared Error (MSE) loss function* Results: outperforms state-of-the-art lumen and media segmentation methods, using JM and Hausdorff Distance (HD) metrics.Here’s the full text in Simplified Chinese:* 为: 自动分割IVUS图像中的血液管和媒质区域* 方法: 使用闭合多边形链、自适应分割Deep Learning网络、Jaccard度量函数和 Mean Squared Error loss function* 结果: 与状态艺术精度segmentation方法相比,达到了更高的精度,使用JM和 HausdorffDistance 度量函数。Abstract
We propose an automatic segmentation method for lumen and media with irregular contours in IntraVascular ultra-sound (IVUS) images. In contrast to most approaches that broadly label each pixel as either lumen, media, or background, we propose to approximate the lumen and media contours by closed polygonal chains. The chain vertices are placed at fixed angles obtained by dividing the entire 360\degree~angular space into equally spaced angles, and we predict their radius using an adaptive-subband-decomposition CNN. We consider two loss functions during training. The first is a novel loss function using the Jaccard Measure (JM) to quantify the similarities between the predicted lumen and media segments and the corresponding ground-truth image segments. The second loss function is the traditional Mean Squared Error. The proposed architecture significantly reduces computational costs by replacing the popular auto-encoder structure with a simple CNN as the encoder and the decoder is reduced to simply joining the consecutive predicted points. We evaluated our network on the publicly available IVUS-Challenge-2011 dataset using two performance metrics, namely JM and Hausdorff Distance (HD). The evaluation results show that our proposed network mostly outperforms the state-of-the-art lumen and media segmentation methods.
摘要
Translated into Simplified Chinese:我们提出了一种自动分割方法,用于在IntraVascular ultra-sound(IVUS)图像中分割不规则的液腔和媒腔。与大多数方法不同,我们使用closed polygon链来近似液腔和媒腔的边界。链Vertex是在分割360度的angular space的等间隔angles中得到的,并使用适应性子带分解CNN来预测它们的半径。我们在训练时使用了两种损失函数:一种新的损失函数使用Jaccard Measure(JM)来衡量预测的液腔和媒腔段与对应的真实图像段的相似性,以及传统的Mean Squared Error。我们提出的 Architecture significantly reduces computational costs by replacing the popular auto-encoder structure with a simple CNN as the encoder and the decoder is reduced to simply joining the consecutive predicted points。我们在公共可用的IVUS-Challenge-2011 dataset上评估了我们的网络,使用了两个性能指标:JM和Hausdorff Distance(HD)。评估结果显示,我们的提出的网络大多比现状的液腔和媒腔分割方法表现更好。
Prescanning Assembly Optimization Criteria for Computed Tomography
results: 提高样本位置的优化可以减少50%的平均平方根误差,16.5%的相似性指标下降和40%的抖动噪声在重建图像中,并自动化计算机断层扫描Assembly,从而节省时间、剂量和运行成本。Abstract
Computerized Tomography assembly and system configuration are optimized for enhanced invertibility in sparse data reconstruction. Assembly generating maximum principal components/condition number of weight matrix is designated as best configuration. The gamma CT system is used for testing. The unoptimized sample location placement with 7.7% variation results in a maximum 50% root mean square error, 16.5% loss of similarity index, and 40% scattering noise in the reconstructed image relative to the optimized sample location when the proposed criteria are used. The method can help to automate the CT assembly, resulting in relatively artifact-free recovery and reducing the iteration to figure out the best scanning configuration for a given sample size, thus saving time, dosage, and operational cost.
摘要
Translated into Simplified Chinese:计算机Tomography组装和系统配置被优化以提高逆向性能。生成最大主要components/condition number的weight矩阵 assembly被认为是最佳配置。使用gamma CT系统进行测试。未优化的样本位置布局 resulted in 7.7%变化,导致最大50%根圆振幅误差,16.5%相似性指标损失,和40%扩散噪声在重构图像中相对于优化样本位置when使用提议的标准。这种方法可以帮助自动化CT组装,从而获得相对 artifact-free 的恢复,并降低确定给定样本大小的扫描配置的迭代次数,从而节省时间、剂量和操作成本。
FreqAlign: Excavating Perception-oriented Transferability for Blind Image Quality Assessment from A Frequency Perspective
results: 提出了一种有效的频谱对齐策略(FreqAlign),通过研究不同频谱 комponents的可见性,选择最适合性的频谱component进行对齐,从而提高BIQA的转移性能。Abstract
Blind Image Quality Assessment (BIQA) is susceptible to poor transferability when the distribution shift occurs, e.g., from synthesis degradation to authentic degradation. To mitigate this, some studies have attempted to design unsupervised domain adaptation (UDA) based schemes for BIQA, which intends to eliminate the domain shift through adversarial-based feature alignment. However, the feature alignment is usually taken at the low-frequency space of features since the global average pooling operation. This ignores the transferable perception knowledge in other frequency components and causes the sub-optimal solution for the UDA of BIQA. To overcome this, from a novel frequency perspective, we propose an effective alignment strategy, i.e., Frequency Alignment (dubbed FreqAlign), to excavate the perception-oriented transferability of BIQA in the frequency space. Concretely, we study what frequency components of features are more proper for perception-oriented alignment. Based on this, we propose to improve the perception-oriented transferability of BIQA by performing feature frequency decomposition and selecting the frequency components that contained the most transferable perception knowledge for alignment. To achieve a stable and effective frequency selection, we further propose the frequency movement with a sliding window to find the optimal frequencies for alignment, which is composed of three strategies, i.e., warm up with pre-training, frequency movement-based selection, and perturbation-based finetuning. Extensive experiments under different domain adaptation settings of BIQA have validated the effectiveness of our proposed method. The code will be released at https://github.com/lixinustc/Openworld-IQA.
摘要
《盲目图像质量评估(BIQA)容易受到分布转移的影响,例如从生成过程破坏到真实破坏。为了解决这个问题,一些研究已经尝试了基于无监督领域适应(UDA)的BIQA方法,以减少领域的转移。然而,通常情况下,这些方法会在低频空间的特征上进行对应性对齐,这会忽略其他频率组件中的可以传递的感知知识,导致UDA的BIQA方法不优。为了解决这个问题,我们从一种新的频率视角出发,提出了一种有效的对齐策略——频率对齐(dubbed FreqAlign),以挖掘BIQA中的感知导向的传递性。》具体来说,我们研究了哪些特征频率是更适合用于感知导向的对齐。基于这个,我们提出了改进BIQA的感知导向传递性的方法,通过特征频率分解和选择包含最多可传递的感知知识的频率组件进行对齐。为了实现稳定和有效的频率选择,我们还提出了频率移动的滑块窗口来找到最佳对齐频率,这包括三种策略:启动预训练、频率移动选择和扰动训练。我们的方法在BIQA的不同领域适应设置下进行了广泛的实验,并证明了我们提出的方法的效果。BIQA代码将在 GitHub 上发布。
results: simulation results show that the proposed estimators are superior to state-of-the-art one-bit estimators, and that the more diverse structured sparsity is exploited, the better estimation performance is achieved.Abstract
Recently, intelligent reflecting surface (IRS)-assisted communication has gained considerable attention due to its advantage in extending the coverage and compensating the path loss with low-cost passive metasurface. This paper considers the uplink channel estimation for IRS-aided multiuser massive MISO communications with one-bit ADCs at the base station (BS). The use of one-bit ADC is impelled by the low-cost and power efficient implementation of massive antennas techniques. However, the passiveness of IRS and the lack of signal level information after one-bit quantization make the IRS channel estimation challenging. To tackle this problem, we exploit the structured sparsity of the user-IRS-BS cascaded channels and develop three channel estimators, each of which utilizes the structured sparsity at different levels. Specifically, the first estimator exploits the elementwise sparsity of the cascaded channel and employs the sparse Bayesian learning (SBL) to infer the channel responses via the type-II maximum likelihood (ML) estimation. However, due to the one-bit quantization, the type-II ML in general is intractable. As such, a variational expectation-maximization (EM) algorithm is custom-derived to iteratively compute an ML solution. The second estimator utilizes the common row-structured sparsity induced by the IRS-to-BS channel shared among the users, and develops another type-II ML solution via the block SBL (BSBL) and the variational EM. To further improve the performance of BSBL, a third two-stage estimator is proposed, which can utilize both the common row-structured sparsity and the column-structured sparsity arising from the limited scattering around the users. Simulation results show that the more diverse structured sparsity is exploited, the better estimation performance is achieved, and that the proposed estimators are superior to state-of-the-art one-bit estimators.
摘要
近期,智能反射表面(IRS)助成通信已经吸引了广泛关注,因为它可以延长覆盖范围和补偿路径损失,而且可以通过低成本的pasive metasurface实现。这篇论文考虑了IRS协助多用户大规模MIMO通信的上传通道估计,在BS端使用一bit ADC。由于IRS是pasive的,因此IRS通道估计具有挑战。为解决这个问题,我们利用用户-IRS-BS堆叠通道的结构准确性,并开发了三种通道估计器,每个估计器都利用了不同的结构准确性。第一个估计器利用用户-IRS-BS堆叠通道的元素准确性,使用简单的bayesian学习(SBL)来推断通道响应,并使用类型二最大likelihood(ML)估计。然而,由于一bit quantization,类型二ML通常是不可解的。因此,我们 derivate了一种变量期望-最大化(EM)算法来逐步计算ML解。第二个估计器利用IRS-BS通道中共同的行structured sparsity,开发了另一种类型二ML解决方案,并使用块SBL(BSBL)和变量EM。为了进一步提高BSBL的性能,我们还提出了一种第三种两个阶段估计器,可以利用用户-IRS-BS堆叠通道中的行structured sparsity和列structured sparsity。实验结果表明,随着更多的结构准确性被利用,估计性能得到了提高,并且提出的估计器比state-of-the-art一bit估计器更高效。
Conformal Metamaterials with Active Tunability and Self-adaptivity for Magnetic Resonance Imaging
results: 这个研究获得了一种可以实现灵活调整频率和 selecively 增强磁场的金属结构,这个结构可以实现在诊断中的广泛应用,并且可以解决传统金属结构在诊断中的一些问题。Abstract
Ongoing effort has been devoted to applying metamaterials to boost the imaging performance of magnetic resonance imaging owing to their unique capacity for electromagnetic field confinement and enhancement. However, there are still major obstacles to widespread clinical adoption of conventional metamaterials due to several notable restrictions, namely: their typically bulky and rigid structures, deviations in their optimal resonance frequency, and their inevitable interference with the transmission RF field in MRI. Herein, we address these restrictions and report a conformal, smart metamaterial, which may not only be readily tuned to achieve the desired, precise frequency match with MRI by a controlling circuit, but is also capable of selectively amplifying the magnetic field during the RF reception phase by sensing the excitation signal strength passively, thereby remaining off during the RF transmission phase and thereby ensuring its optimal performance when applied to MRI as an additive technology. By addressing a host of current technological challenges, the metamaterial presented herein paves the way toward the wide-ranging utilization of metamaterials in clinical MRI, thereby translating this promising technology to the MRI bedside.
摘要
Overview of Use Cases in Single Channel Full Duplex Techniques for Satellite Communication
paper_authors: Victor Monzon Baeza, Steven Kisseleff, Jorge Luis González Rios, Juan Andrés Vasquez-Peralvo, Carlos Mosquera, Roberto López Valcarce, Tomás Ramírez Parracho, Pablo Losada Sanisidro, Juan Carlos Merlano Duncan, Symeon Chatzinotas
for: 本研究提供了卫星通信系统中单通道全动态技术的多样化应用和使用场景综述。
methods: 本研究使用了单通道全动态技术,同时在单个频率通道上实现了发送和接收操作。
results: 该研究选择了八个可能的应用场景,并对这些场景进行了初步评估。初步结果表明,单通道全动态技术在各种关键领域内可能带来巨大的改变。Abstract
This paper provides an overview of the diverse range of applications and use cases for Single-Channel Full-Duplex (SCFD) techniques within the field of satellite communication. SCFD, allowing simultaneous transmission and reception on a single frequency channel, presents a transformative approach to enhancing satellite communication systems. We select eight potential use cases with the objective of highlighting the substantial potential of SCFD techniques in revolutionizing SatCom across a multitude of critical domains. In addition, preliminary results from the qualitative assessment are shown. This work is carried out within the European Space Agency (ESA) ongoing activity FDSAT: Single Channel Full Duplex Techniques for Satellite Communications.
摘要
这篇论文提供了卫星通信领域内单频道全动态(SCFD)技术的多样化应用场景和用 caso的概述。 SCFD技术允许在同一个频道上同时进行发送和接收,对卫星通信系统的提升带来了 transformative 的影响。 我们选择了八个可能的应用场景,以强调 SCFD技术在多个关键领域中的巨大潜力。此外,我们还展示了初步的评估结果。这项工作是欧洲空间局(ESA)正在进行的活动FDSAT:单频道全动态技术 для卫星通信。
Meta Reinforcement Learning for Fast Spectrum Sharing in Vehicular Networks
paper_authors: Kai Huang, Le Liang, Shi Jin, Geoffrey Ye Li
for: 这篇论文研究了车辆到所有东西通信中的快速频谱共享问题,以提高整体系统的频谱效率。
methods: 作者使用深度强化学习模型和距离政策优化方法来解决这个问题。
results: 作者的方法可以快速适应新任务,并且可以减少交互次数和训练时间。数据显示,其方法可以达到近似优化性和快速收敛。Abstract
In this paper, we investigate the problem of fast spectrum sharing in vehicle-to-everything communication. In order to improve the spectrum efficiency of the whole system, the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links. To this end, we model it as a problem of deep reinforcement learning and tackle it with proximal policy optimization. A considerable number of interactions are often required for training an agent with good performance, so simulation-based training is commonly used in communication networks. Nevertheless, severe performance degradation may occur when the agent is directly deployed in the real world, even though it can perform well on the simulator, due to the reality gap between the simulation and the real environments. To address this issue, we make preliminary efforts by proposing an algorithm based on meta reinforcement learning. This algorithm enables the agent to rapidly adapt to a new task with the knowledge extracted from similar tasks, leading to fewer interactions and less training time. Numerical results show that our method achieves near-optimal performance and exhibits rapid convergence.
摘要
在本文中,我们研究了车辆到所有东西通信中快速spectrum sharing的问题。为了提高整个系统的spectrum效率,车辆到基础设施链路的spectrum被重复利用于车辆到车辆链路。为此,我们将其模型为深度优化学习问题,使用距离策略优化进行解决。由于通信网络中的训练经常需要大量的交互,因此通常使用模拟基础进行训练。然而,在实际环境中直接部署Agent可能会导致性能下降,即使Agent在模拟器上表现良好,因为实际环境和模拟器之间存在差距。为Addressing这个问题,我们提出了基于meta学习算法的方法。这种算法使得Agent能够快速适应新任务,通过提取相似任务中的知识,从而减少交互次数和训练时间。数字实验结果表明,我们的方法可以达到近似优化性和快速收敛。
D-Band 2D MIMO FMCW Radar System Design for Indoor Wireless Sensing
results: 研究发现,随着目标距离的增加,使用64个天线的MUSIC算法可以在1-10米室内距离和0-30dB SNR范围内提供较低的root-mean-square error(RMSE),而使用16个天线和4个天线的情况下,两种算法具有相似的性能。此外,研究还探讨了雷达接收器(RX)信号强度和发送器(TX)输出功率的关系,并对现有D频段半导体雷达的输出功率要求进行了比较。Abstract
In this article, we present system design of D-band multi-input multi-output (MIMO) frequency-modulated continuous-wave (FMCW) radar for indoor wireless sensing. A uniform rectangular array (URA) of radar elements is used for 2D direction-of-arrival (DOA) estimation. The DOA estimation accuracy of the MIMO radar array in the presence of noise is evaluated using the multiple-signal classification (MUSIC) and the minimum variance distortionless response (MVDR) algorithms. We investigate different scaling scenarios for the radar receiver (RX) SNR and the transmitter (TX) output power with the target distance. The DOA estimation algorithm providing the highest accuracy and shortest simulation time is shown to depend on the size of the radar array. Specifically, for a 64-element array, the MUSIC achieves lower root-mean-square error (RMSE) compared to the MVDR across 1--10\,m indoor distances and 0--30\,dB SNR (e.g., $\rm 0.8^{\circ}$/$\rm 0.3^{\circ}$ versus $\rm 1.0^{\circ}$/$\rm 0.5^{\circ}$ at 10/20\,dB SNR and 5\,m distance) and 0.5x simulation time. For a 16-element array, the two algorithms provide comparable performance, while for a 4-element array, the MVDR outperforms the MUSIC by a large margin (e.g., $\rm 8.3^{\circ}$/$\rm 3.8^{\circ}$ versus $\rm 62.2^{\circ}$/$\rm 48.8^{\circ}$ at 10/20\,dB SNR and 5\,m distance) and 0.8x simulation time. Furthermore, the TX output power requirement of the radar array is investigated in free-space and through-wall wireless sensing scenarios, and is benchmarked by the state-of-the-art D-band on-chip radars.
摘要
在这篇文章中,我们介绍了D频段多入口多出口(MIMO)频率变化 kontinuierlich(FMCW)雷达的系统设计,用于室内无线感知。我们使用了一个均匀的方正阵(URA)来实现2D方向到来(DOA)估算。我们使用MUSIC和MVDR算法来评估FMCW雷达数组在噪声存在下的DOA估算精度。我们 investigate了不同的收发器信号强度(RX SNR)和发射器输出功率(TX output power)的涨落情况,并对目标距离进行了 investigate。我们发现,对于64个元素数组,MUSIC算法在1-10米室内距离和0-30dB SNR范围内提供了更低的平均方差误差(RMSE),比如0.8度/0.3度相比于1.0度/0.5度,而且 simulation time 相对较短。对于16个元素数组,两种算法具有相似的性能,而对于4个元素数组,MVDR算法明显超过MUSIC算法,例如,在10/20dB SNR和5米距离下,MVDR算法的误差为8.3度/3.8度,而MUSIC算法的误差为62.2度/48.8度,而且 simulation time 相对较长。此外,我们还调查了雷达数组的发射器输出功率要求,并与现有的D频段在芯片上的雷达相比。
White Paper on Radio Channel Modeling and Prediction to Support Future Environment-aware Wireless Communication Systems
paper_authors: Mate Boban, Vittorio Degli-Esposti for:The paper is written to provide an overview of the state-of-the-art in radio channel measurement and modeling, and to identify the key challenges that need to be addressed to support the development of 6G networks.methods:The paper uses a variety of methods, including channel sounder design, metrology, and measurement methodologies, as well as measurements, modeling, and systematic dataset collection and analysis.results:The paper provides a summary of the state-of-the-art in radio channel measurement and modeling, and identifies the key challenges that the scientific community will need to address to support the development of 6G networks. These challenges include the need for a paradigm shift in channel measurements and modeling, the need for a wider frequency range, and the need for better support for diverse and highly cluttered environments.Here is the information in Simplified Chinese text:for:本文目的是提供 radio 频道测量和模型的状况报告,以及为支持未来的 6G 网络发展而需要解决的主要挑战。methods:本文使用了多种方法,包括通道测量设计、测量方法学、测量方法和系统atic dataset收集和分析。results:本文提供了 radio 频道测量和模型的状况报告,并确定了支持未来的 6G 网络发展所需的主要挑战。这些挑战包括需要一种新的测量和模型 парадиг shift,需要更广泛的频谱范围,以及更好地支持多样化和高度干扰的环境。Abstract
COST INTERACT working group (WG)1 aims at increasing the theoretical and experimental understanding of radio propagation and channels in environments of interest and at deriving models for design, simulation, planning and operation of future wireless systems. Wide frequency ranges from sub-GHz to terahertz (THz), potentially high mobility, diverse and highly cluttered environments, dense networks, massive antenna systems, and the use of intelligent surfaces, are some of the challenges for radio channel measurements and modeling for next generation systems. As indicated in [1], with increased number of use cases (e.g., those identified by one6G [2] and shown in Fig. 1) to be supported and a larger number of frequency bands, a paradigm shift in channel measurements and modeling will be required. To address the particular challenges that come with such a paradigm shift, WG1 started the work on relevant topics, ranging from channel sounder design, metrology and measurement methodologies, measurements, modeling, and systematic dataset collection and analysis. In addition to the core activities of WG1, based on the strong interest of the participants, two sub-working groups (subWGs) have been initiated as part of WG1: i) subWG1.1 on millimeter-wave (mmWave) and THz sounding (subWG THz) and ii) subWG1.2 on propagation aspects related to reconfigurable intelligent surfaces (RIS) (subWG RIS). This white paper has two main goals: i) it summarizes the state-of-theart in radio channel measurement and modeling and the key challenges that the scientific community will have to face over the next years to support the development of 6G networks, as identified by WG1 and its subWGs; and ii) it charts the main directions for the work of WG1 and subWGs for the remainder of COST INTERACT duration (i.e., until October 2025).
摘要
Double-Layer Power Control for Mobile Cell-Free XL-MIMO with Multi-Agent Reinforcement Learning
results: 对于不同的MARL算法进行比较,研究发现,提议的MARL算法能够均衡 spectral efficiency(SE)性能和整合时间。此外,对于双层电力控制架构,结果表明,相比单层架构,提议的双层架构在巨大天线数和小天线间距下具有24%的SE性能提升。Abstract
Cell-free (CF) extremely large-scale multiple-input multiple-output (XL-MIMO) is regarded as a promising technology for enabling future wireless communication systems. Significant attention has been generated by its considerable advantages in augmenting degrees of freedom. In this paper, we first investigate a CF XL-MIMO system with base stations equipped with XL-MIMO panels under a dynamic environment. Then, we propose an innovative multi-agent reinforcement learning (MARL)-based power control algorithm that incorporates predictive management and distributed optimization architecture, which provides a dynamic strategy for addressing high-dimension signal processing problems. Specifically, we compare various MARL-based algorithms, which shows that the proposed MARL-based algorithm effectively strikes a balance between spectral efficiency (SE) performance and convergence time. Moreover, we consider a double-layer power control architecture based on the large-scale fading coefficients between antennas to suppress interference within dynamic systems. Compared to the single-layer architecture, the results obtained unveil that the proposed double-layer architecture has a nearly24% SE performance improvement, especially with massive antennas and smaller antenna spacing.
摘要
cell-free(CF) extremly large-scale multiple-input multiple-output(XL-MIMO)被视为未来无线通信系统的促进技术。这种技术具有显著的优势,可以增加度量自由。在这篇论文中,我们首先investigated CF XL-MIMO系统,其中基站装备了XL-MIMO板。然后,我们提出了一种基于多智能学习(MARL)的动态管理和分布式优化架构的新型电力控制算法。这种算法可以在高维度信号处理问题上提供一个动态策略,并且能够均衡spectral efficiency(SE)性能和整合时间。此外,我们还考虑了一种双层电力控制架构,基于大规模投射系数 между天线来降低动态系统中的干扰。与单层架构相比,我们发现,提posed double-layer架构在巨量天线和小天线间隔下具有24%的SE性能提升。
Energy-Efficient Secure Offloading System Designed via UAV-Mounted Intelligent Reflecting Surface for Resilience Enhancement
results: 该论文通过优化地面用户设备的发射功率、UAV-mounted IRS的轨迹和相位偏移矩阵,以及卸载比率 между本地执行和边缘计算,使用successive convex approximation(SCA)算法,可以提供较大的能源储存量相比于本地执行和部分优化。Abstract
With increasing interest in mmWave and THz communication systems, an unmanned aerial vehicle (UAV)-mounted intelligent reflecting surface (IRS) has been suggested as a key enabling technology to establish robust line-of-sight (LoS) connections with ground nodes owing to their free mobility and high altitude, especially for emergency and disaster response. This paper investigates a secure offloading system, where the UAV-mounted IRS assists the offloading procedures between ground users and an access point (AP) acting as an edge cloud. In this system, the users except the intended recipients in the offloading process are considered as potential eavesdroppers. The system aims to achieve the minimum total energy consumption of battery-limited ground user devices under constraints for secure offloading accomplishment and operability of UAV-mounted IRS, which is done by optimizing the transmit power of ground user devices, the trajectory and phase shift matrix of UAV-mounted IRS, and the offloading ratio between local execution and edge computing based on the successive convex approximation (SCA) algorithms. Numerical results show that the proposed algorithm can provide the considerable energy savings compared with local execution and partial optimizations.
摘要
随着mmWave和THz通信系统的兴趣增长,一种随机飞行器(UAV)搭载智能反射表(IRS)已被认为是建立可靠直线视线(LoS)连接地面节点的关键技术,尤其是在紧急和灾难应急情况下。这篇论文研究了一种安全卸载系统,其中UAV搭载IRS协助地面用户和edge云(AP)之间的卸载过程。在这个系统中,地面用户除了指定接收者外的其他用户都被视为潜在的窃听者。系统的目标是在续ouseless执行和UAV搭载IRS的可操作性下,最小化电池限制的地面用户设备的总能量占用,通过优化地面用户设备的传输功率、UAV搭载IRS的轨迹和相位偏移矩阵,以及在本地执行和edge计算之间的卸载比例,使用successive convex approximation(SCA)算法来实现。numerical results表明,提议的算法可以提供较大的能量减少 compared with本地执行和部分优化。
Tree based Single LED Indoor Visible Light Positioning Technique
results: 这个技术可以获得非常高精度的三维定位结果,其中三维定位误差为2.88厘米,比 closest competitor 的6.26厘米更低。Abstract
Visible light positioning(VLP) has gained prominence as a highly accurate indoor positioning technique. Few techniques consider the practical limitations of implementing VLP systems for indoor positioning. These limitations range from having a single LED in the field of view(FoV) of the image sensor to not having enough images for training deep learning techniques. Practical implementation of indoor positioning techniques needs to leverage the ubiquity of smartphones, which is the case with VLP using complementary metal oxide semiconductor(CMOS) sensors. Images for VLP can be gathered only after the lights in question have been installed making it a cumbersome process. These limitations are addressed in the proposed technique, which uses simulated data of a single LED to train machine learning models and test them on actual images captured from a similar experimental setup. Such testing produced mean three dimensional(3D) positioning error of 2.88 centimeters while training with real images achieves accuracy of less than one centimeter compared to 6.26 centimeters of the closest competitor.
摘要
可见光定位(VLP) 已经成为室内定位技术中的一种非常精度的方法。然而,只有几种技术考虑到实际实施 VLP 系统的实际限制。这些限制包括在图像传感器的视场中只有一个 LED 存在以及没有足够的图像用于深度学习技术的训练。实际应用室内定位技术应该利用智能手机的普遍性,这是 VLP 使用 complementary metal oxide semiconductor (CMOS) 传感器的情况。图像 для VLP 只能在灯光问题上安装后获得,这是一个繁琐的过程。这些限制被提出的方法解决,该方法使用单个 LED 的 simulate 数据来训练机器学习模型,然后在实际设置上测试这些模型。这种测试产生的三维定位误差为 2.88 厘米,而使用实际图像训练的精度则为 Less than 1 厘米,与最近竞争对手的 6.26 厘米误差相比。
paper_authors: Enis Berk Çoban, Megan Perra, Michael I. Mandel
for: This paper aims to improve the accuracy of weather predictions in wildlife research using acoustic data and satellite data.
methods: The paper uses machine learning algorithms to train acoustic classifiers using satellite data from the MERRA-2 system, and then uses these classifiers to predict rain, wind, and air temperature at different thresholds.
results: The paper finds that acoustic classifiers trained using MERRA-2 data are more accurate than the raw MERRA-2 data itself, and that using MERRA-2 to roughly identify rain in the acoustic data allows for the production of a functional model without the need for human-validated labels.Abstract
Across various research domains, remotely-sensed weather products are valuable for answering many scientific questions; however, their temporal and spatial resolutions are often too coarse to answer many questions. For instance, in wildlife research, it's crucial to have fine-scaled, highly localized weather observations when studying animal movement and behavior. This paper harnesses acoustic data to identify variations in rain, wind and air temperature at different thresholds, with rain being the most successfully predicted. Training a model solely on acoustic data yields optimal results, but it demands labor-intensive sample labeling. Meanwhile, hourly satellite data from the MERRA-2 system, though sufficient for certain tasks, produced predictions that were notably less accurate in predict these acoustic labels. We find that acoustic classifiers can be trained from the MERRA-2 data that are more accurate than the raw MERRA-2 data itself. By using MERRA-2 to roughly identify rain in the acoustic data, we were able to produce a functional model without using human-validated labels. Since MERRA-2 has global coverage, our method offers a practical way to train rain models using acoustic datasets around the world.
摘要
在不同的研究领域中,远程感知的天气产品非常有价值,但它们的时间和空间分辨率经常太低。例如,在野生动物研究中,需要有高级别、高地点准确的天气观测,以便研究动物的移动和行为。这篇论文利用声学数据来识别雨、风和空气温度的变化,雨是最成功地预测的。使用声学数据来训练模型,可以获得优化的结果,但需要大量的人工标注样本。而从MERRA-2系统获得的一小时卫星数据,虽然适用于某些任务,但预测的准确性较低。我们发现,使用MERRA-2数据来训练声学分类器,可以获得更高的准确性 than raw MERRA-2数据本身。通过使用MERRA-2数据来粗略地识别雨在声学数据中,我们可以生成一个可行的模型,不需要人工验证标注。由于MERRA-2数据有全球覆盖,我们的方法可以在全球各地使用声学数据来训练雨模型。
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
results: 在使用了句子和词级界限的 ASR 模块支持说话人转折检测的情况下,本文提出了一个最佳 Concatenated minimum-Permutation Word Error Rate (cpWER) для全会议记录管道。Abstract
We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.
摘要
我们提出了一个模块化管道来实现单频约会录音的分离、识别和分类,并在Libri-CSS数据集上进行评估。我们使用一个连续语音分离(CSS)系统,其中使用TF-GridNet分离架构,然后使用一个无关于Speaker的语音识别器,以达到最佳的语音识别性能(ORC WER)。接着,我们使用d-vector基于的分类模块来提取Speaker的嵌入特征,并将CSS输出分配给正确的 speaker。在这里,我们提出了一种基于ASR模块的句子和单词级划分的语音转发检测,以支持说话者的转发检测。这 führt zu einem state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) für die vollständige Treffenserkennungspipeline.
Efficient Supervised Training of Audio Transformers for Music Representation Learning
results: 研究发现,初始化模型使用ImageNet或AudioSet权重和使用 longer输入段都是有利的,中间块的learned表示是最佳的,并且在执行patchout操作时可以更快速地提取特征,而无需降低性能。Abstract
In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training task. We assess the impact of initializing the models with different pre-trained weights, using various input audio segment lengths, using learned representations from different blocks and tokens of the transformer for downstream tasks, and applying patchout at inference to speed up feature extraction. We find that 1) initializing the model from ImageNet or AudioSet weights and using longer input segments are beneficial both for the training and downstream tasks, 2) the best representations for the considered downstream tasks are located in the middle blocks of the transformer, and 3) using patchout at inference allows faster processing than our convolutional baselines while maintaining superior performance. The resulting models, MAEST, are publicly available and obtain the best performance among open models in music tagging tasks.
摘要
在这个工作中,我们研究了使用减少 convolution 的 transformer 进行音乐表示学习。我们基于现有的spectrogram-based audio transformer 如 AST,并在supervised任务上训练我们的模型,类似于 PaSST。与前一些工作不同,我们研究了特定的设计决策对下游音乐标签任务的影响,而不是专注于训练任务。我们评估了不同初始化模型的预训练 веса、使用不同的输入音频段长、使用不同块和token的 transformer 学习表示,以及在推理时使用 patchout 加速特征提取。我们发现:1)从 ImageNet 或 AudioSet 预训练 weights 初始化模型和使用 longer input segments 都是有利的,2)Considered downstream tasks 中最佳的表示位于 transformer 中间块中,3)在推理时使用 patchout 可以比我们的 convolutional baselines 更快,但是保持了更高的性能。所有的模型,MAEST,公开可用,并在音乐标签任务中获得了最佳性能。
Audio Visual Speaker Localization from EgoCentric Views
paper_authors: Jinzheng Zhao, Yong Xu, Xinyuan Qian, Wenwu Wang
for: 这 paper 是为了研究 egocentric 音频视频推测说话人的方向的。
methods: 这 paper 使用 transformer 模型将 audio 和视频数据进行融合,并提出了一种训练策略来解决说话人从视频中消失的问题。
results: 实验结果表明,提出的方法在新的数据集上表现了出色的跟踪精度。此外,paper 还适应了多个说话人enario,并在 EasyCom 上实现了 state-of-the-art 的结果。Abstract
The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization.
摘要
文中的听音和视觉模态已经在文献中得到了广泛的研究,通过利用它们的补充特点。然而,大多数前一些工作采用了固定位置的静止感知器。与之不同,在这种工作中,我们 explore egocentric setting,其中各种不同的感知器被嵌入并可以随人类移动,以便实现听音定位。相比静止情况, egocentric setting更加真实地适用于智能家居应用,例如服务机器人。然而,这也带来了新的挑战,如模糊的图像、听音器在感知器视野中的频繁消失和遮挡。在这篇论文中,我们研究 egocentric 听音视觉 DOA 估计,并解决以上挑战。具体来说,我们提出了一种基于 transformer 的听音视觉混合方法,以估计听音器与感知器之间的相对 DOA。此外,我们还开发了一个用于模拟听音器离视场景的新数据集,通过创建一个有人戴摄像头的人移动在同时的场景。实验结果表明,我们提出的方法在新数据集中具有良好的跟踪精度。最后,我们适应了提posed方法 для多个听音器场景。在 EasyCom 上进行实验,我们的模型在实际场景中得到了状态之一的结果,包括圆拱活动听音器检测任务和穿戴者活动预测任务。相关的数据集和代码可以在 GitHub 上获取。
Predicting performance difficulty from piano sheet music images
results: 在五个 datasets 上进行评估,包括超过 7500 份乐谱和 9 个难度水平。 模型在这些 datasets 上进行精度的 fine-tuning 后,实现了最佳性能,其中 balanced accuracy 为 40.34%,mean square error 为 1.33。Abstract
Estimating the performance difficulty of a musical score is crucial in music education for adequately designing the learning curriculum of the students. Although the Music Information Retrieval community has recently shown interest in this task, existing approaches mainly use machine-readable scores, leaving the broader case of sheet music images unaddressed. Based on previous works involving sheet music images, we use a mid-level representation, bootleg score, describing notehead positions relative to staff lines coupled with a transformer model. This architecture is adapted to our task by introducing an encoding scheme that reduces the encoded sequence length to one-eighth of the original size. In terms of evaluation, we consider five datasets -- more than 7500 scores with up to 9 difficulty levels -- , two of them particularly compiled for this work. The results obtained when pretraining the scheme on the IMSLP corpus and fine-tuning it on the considered datasets prove the proposal's validity, achieving the best-performing model with a balanced accuracy of 40.34\% and a mean square error of 1.33. Finally, we provide access to our code, data, and models for transparency and reproducibility.
摘要
(Simplified Chinese translation)估计乐谱演奏难度是音乐教育中对学生学习课程设计的关键。尽管音乐信息检索社区最近对这项任务表示了兴趣,但现有的方法主要使用机器可读 Musical scores,忽略了Sheet music images的更广泛情况。基于之前的Sheet music images工作,我们使用mid-level representation,即bootleg score,描述Notehead positions relative to staff lines,并与Transformer模型结合。我们采用一种编码方案,将编码序列长度减少到原始大小的一半。在评估方面,我们考虑了五个数据集,包括超过7500份乐谱,最高达9级难度。两个数据集特别为这项工作编制。结果表明我们的提案有效,在7500份乐谱上取得平均折合率40.34%和平均方差1.33。最后,我们提供了代码、数据和模型,以便透明度和重现性。
NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
results: 对于三个任务(评估受损程度、预测语音质量和Speech增强),NOMAD表现出了与全参照音频 metric 的竞争性能,同时也超过了其他非匹配参照方法。这表明 NOMAD 可以准确地评估受损音频质量,并且可以学习人类对受损音频的识别能力。Abstract
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
摘要
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
results: 对比一个hot标签solely模型,使用SPA模型可以提高OmAP评估指标+1.8,并且人工评估表明SPA模型的预测更加准确地反映人类听频识别结果。Abstract
Most audio tagging models are trained with one-hot labels as supervised information. However, one-hot labels treat all sound events equally, ignoring the semantic hierarchy and proximity relationships between sound events. In contrast, the event descriptions contains richer information, describing the distance between different sound events with semantic proximity. In this paper, we explore the impact of training audio tagging models with auxiliary text descriptions of sound events. By aligning the audio features with the text features of corresponding labels, we inject the hierarchy and proximity information of sound events into audio encoders, improving the performance while making the prediction more consistent with human perception. We refer to this approach as Semantic Proximity Alignment (SPA). We use Ontology-aware mean Average Precision (OmAP) as the main evaluation metric for the models. OmAP reweights the false positives based on Audioset ontology distance and is more consistent with human perception compared to mAP. Experimental results show that the audio tagging models trained with SPA achieve higher OmAP compared to models trained with one-hot labels solely (+1.8 OmAP). Human evaluations also demonstrate that the predictions of SPA models are more consistent with human perception.
摘要
大多数音频标记模型都是通过一个热度标签作为监督信息进行训练。然而,这些热度标签忽略了声音事件的 semantic hierarchy和 proximity 关系。相比之下,事件描述包含更多的信息,描述了不同声音事件之间的距离和 semantic proximity。在这篇论文中,我们研究了在音频标记模型中使用辅助文本描述声音事件的影响。我们将音频特征与文本描述的对应标签进行对齐,从而将声音事件的 hierarchy和 proximity 信息注入到音频编码器中,提高了性能,同时使预测更加符合人类感知。我们称这种方法为 Semantic Proximity Alignment (SPA)。我们使用 Ontology-aware mean Average Precision (OmAP) 作为主要评估指标,OmAP 重新计算false positives,根据 Audioset ontology distance 进行权重,与人类感知更一致。实验结果表明,使用 SPA 训练的音频标记模型在 OmAP 方面高于使用一个热度标签solely (+1.8 OmAP)。人类评估也表明,SPA 模型的预测更加符合人类感知。
PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System
results: 在M2MeT2.0挑战数据集上测试集上 achieve cp-CER 11.27%,在固定和开放训练条件下 ranked firstAbstract
Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personalized prompt based meeting transcription system, which consists of a clustering system, target-speaker voice activity detection (TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with previous system, we fully leverage pre-trained models for system initialization, thereby bestowing our approach with heightened generalizability and precision. Experiments on M2MeT2.0 Challenge dataset show that our system achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and open training conditions.
摘要
speaker-attributed automatic speech recognition (SA-ASR) 提高多个说话人ASR系统在实际场景中的准确率和可用性,通过分配说话人标签到转录文本中。然而,SA-ASR存在独特的挑战,包括说话人重叠、说话人差异、背景噪音和回声。在本研究中,我们提出了PP-MeT系统,一种基于个人化提示的实时会议笔记系统,包括聚类系统、目标说话人活动检测(TS-VAD)和TS-ASR。具体来说,我们利用目标说话人嵌入作为TS-VAD和TS-ASR模块的提示。与前一代系统不同,我们完全利用预训练模型进行系统初始化,从而使我们的方法具有更高的泛化性和精度。在M2MeT2.0挑战数据集上进行实验,我们的系统在测试集上 achievement 的 cp-CER 为11.27%,在固定和开放训练条件下都 ranking 第一。
LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR
results: 实验结果显示,相比于使用LAE-based CTC,LAE-ST-MoE模型在CS测试集上的混合错误率下降了9.26%,并且可以从CS语音中译写出 Mandarin 或英文文本。Abstract
Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover, the well-trained LAE-ST-MoE model can perform ST tasks from CS speech to Mandarin or English text.
摘要
最近,为了解决自动语音识别(ASR)中语言 switching(CS)中的混乱,conditionally factorized models(如语言意识encoder,LAE)会过度忽略不同语言之间的上下文信息。然而,这些信息可能对ASR模型化有所帮助。为了解决这个问题,我们提出了LAE-ST-MoE框架。它将语音翻译(ST)任务 incorporated into LAE,并通过ST学习不同语言之间的上下文信息。它引入了任务基于的混合专家模块,使用了分离的Feed-Forward网络来处理ASR和ST任务。实验结果表明,相比LAE-based CTC,LAE-ST-MoE模型在ASRU 2019 Mandarin-English CS挑战数据集上的混合错误率减少9.26%。此外,通过训练Well-trained LAE-ST-MoE模型可以将CS语音转换成普通话或英文文本。
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR
results: 在使用 CTC 推理(不使用语言模型)的情况下,这个论文在 AISHELL-1 数据集上达到了state-of-the-art的性能,具体来说是 development 集和测试集的字符错误率(CER)分别为 3.64% 和 3.94%,相比基eline CTC-ASR 系统,这个性能提升了 34.18% 和 34.88%。Abstract
Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.
摘要
由于文本和声音模型之间的Modalität不同,将语言知识从预训练语言模型(PLM)传递到声音编码以实现自动语音识别(ASR)仍然是一项复杂的任务。在这种研究中,我们提出了一种跨Modalität知识传递(CMKT)学习框架,在基于时间连接的朋克思矩阵(CTC)基础上的ASR系统中应用。此外,我们还提出了使用杯形注意力在跨Modalität对应过程中,其中 transformer 注意力是特殊的杯形注意力过程。CMKT 学习计划使得声音编码器编码出较为丰富的语言知识,以便ASR。在 AISHELL-1 数据集上,使用 CTC 滥触推断(不使用任何语言模型),我们在开发和测试集上达到了state-of-the-art性能,即字符错误率(CER)为 3.64% 和 3.94%,对比基准 CTC-ASR 系统的 CER 分别提高了34.18% 和 34.88%。
results: 论文表明了 PnP 方法在各种图像应用场景中可以达到状态之最的结果,并且提供了一种 Linear 的整形分析方法来解释这种成功。Abstract
In plug-and-play (PnP) regularization, the proximal operator in algorithms such as ISTA and ADMM is replaced by a powerful denoiser. This formal substitution works surprisingly well in practice. In fact, PnP has been shown to give state-of-the-art results for various imaging applications. The empirical success of PnP has motivated researchers to understand its theoretical underpinnings and, in particular, its convergence. It was shown in prior work that for kernel denoisers such as the nonlocal means, PnP-ISTA provably converges under some strong assumptions on the forward model. The present work is motivated by the following questions: Can we relax the assumptions on the forward model? Can the convergence analysis be extended to PnP-ADMM? Can we estimate the convergence rate? In this letter, we resolve these questions using the contraction mapping theorem: (i) for symmetric denoisers, we show that (under mild conditions) PnP-ISTA and PnP-ADMM exhibit linear convergence; and (ii) for kernel denoisers, we show that PnP-ISTA and PnP-ADMM converge linearly for image inpainting. We validate our theoretical findings using reconstruction experiments.
摘要
在插入式规范(PnP)常规化中,卷积算法中的距离算子被替换成一个强大的噪声约束。这种形式的替换在实践中很好的工作。实际上,PnP已经在不同的图像应用中达到了状态机器人的结果。PnP的 empirical 成功 Motivates 研究人员理解其理论基础,特别是它的收敛。在先前的工作中,对于核函数噪声约束的情况,PnP-ISTA 的收敛性已经被证明了,但是这些假设对于前向模型很强。本文的目标是解答以下问题:我们可以减轻前向模型的假设吗?我们可以扩展 PnP-ADMM 的收敛分析吗?我们可以估算收敛率吗?在这封信中,我们使用Contract Mapping Theorem 来解答这些问题:(i)对于 симметри� denoiser,我们显示(在轻度的假设下)PnP-ISTA 和 PnP-ADMM 在 linear 收敛;和(ii)对于核函数 denoiser,我们显示 PnP-ISTA 和 PnP-ADMM 对于图像缺失部分进行 linear 收敛。我们使用重建实验来验证我们的理论发现。
Superpixel Transformers for Efficient Semantic Segmentation
results: 我们的方法可以在 Cityscapes 和 ADE20K 上实现state-of-the-art性能,同时具有较少的模型参数和延迟。Abstract
Semantic segmentation, which aims to classify every pixel in an image, is a key task in machine perception, with many applications across robotics and autonomous driving. Due to the high dimensionality of this task, most existing approaches use local operations, such as convolutions, to generate per-pixel features. However, these methods are typically unable to effectively leverage global context information due to the high computational costs of operating on a dense image. In this work, we propose a solution to this issue by leveraging the idea of superpixels, an over-segmentation of the image, and applying them with a modern transformer framework. In particular, our model learns to decompose the pixel space into a spatially low dimensional superpixel space via a series of local cross-attentions. We then apply multi-head self-attention to the superpixels to enrich the superpixel features with global context and then directly produce a class prediction for each superpixel. Finally, we directly project the superpixel class predictions back into the pixel space using the associations between the superpixels and the image pixel features. Reasoning in the superpixel space allows our method to be substantially more computationally efficient compared to convolution-based decoder methods. Yet, our method achieves state-of-the-art performance in semantic segmentation due to the rich superpixel features generated by the global self-attention mechanism. Our experiments on Cityscapes and ADE20K demonstrate that our method matches the state of the art in terms of accuracy, while outperforming in terms of model parameters and latency.
摘要
《 semantic segmentation 》是机器视觉中关键任务之一,具有许多应用于 робо类和自动驾驶等领域。由于这个任务的维度较高,大多数现有方法使用本地操作,如 convolution,生成每个像素特征。然而,这些方法通常无法有效利用全局上下文信息,因为对密集图像进行计算的成本较高。在这种情况下,我们提出了一种解决方案,利用超像素(superpixel)的想法,并将其与现代转换器框架结合使用。具体来说,我们的模型通过一系列的本地交叉关注来将像素空间分解成低维度的超像素空间。然后,我们对超像素进行多头自注意,以激活全局上下文信息,并将其作为每个超像素的特征进行分类。最后,我们直接将超像素的类别预测映射回到像素空间使用图像像素特征与超像素之间的关联。使用超像素空间进行逻辑计算,使我们的方法相比于基于 convolution 的解码器方法更加计算效率高。然而,我们的方法可以 дости到状态之狮的性能水平,因为全球自注意机制生成了丰富的超像素特征。我们在 Cityscapes 和 ADE20K 上进行了实验,结果显示,我们的方法与状态之狮相当,而且在参数量和延迟上超过了状态之狮。
LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection
results: 我们在Waymo开放数据集上评估了我们的方法,并证明了它在3D物体检测中的提高,特别是对于困难类别的大物体检测。Abstract
We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds. Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector. This feature fusion strategy enables the model to better capture the shapes and poses for challenging objects, compared with learning from raw points directly. Our method conducts late-to-early feature fusion in a recurrent manner. This is achieved by enforcing window-based attention blocks upon temporally calibrated and aligned sparse pillar tokens. Leveraging bird's eye view foreground pillar segmentation, we reduce the number of sparse history features that our model needs to fuse into its current frame by 10$\times$. We also propose a stochastic-length FrameDrop training technique, which generalizes the model to variable frame lengths at inference for improved performance without retraining. We evaluate our method on the widely adopted Waymo Open Dataset and demonstrate improvement on 3D object detection against the baseline model, especially for the challenging category of large objects.
摘要
我们提议一种从晚期到早期的循环特征融合方案,用于3D物体检测 based on 时间LiDAR点云。我们的主要动机是将物体意识的潜在嵌入 fusion 到3D物体检测器的早期阶段。这种特征融合策略使得模型能够更好地捕捉到难以捕捉的物体形状和姿态。我们的方法在循环方式下进行了晚期到早期的特征融合。这是通过在时间均衡和对齐的稀疏柱 Token 上进行窗口基于注意力块来实现的。通过采用鸟瞰视图前景柱 Segmentation,我们可以将 sparse history features 缩减到当前帧的 10 倍。我们还提出了一种 Stochastic-length FrameDrop 训练技术,该技术可以在推断时对变长帧进行适应,从而不需要重新训练,以提高性能。我们在广泛采用的 Waymo Open Dataset 上评估了我们的方法,并证明了对基eline 模型的3D物体检测性能的提高,特别是难以捕捉的大型物体类别。
Stochastic Digital Twin for Copy Detection Patterns
results: 研究结果表明,DDPM模型在数字双方应用中表现出色,可以更好地捕捉印刷-图像过程中的随机性。此外,DDPM模型也能够在移动设备数据收集中进行效果地评估。 despite the increased complexity of DDPM methods compared to traditional approaches, this study highlights their advantages and explores their potential for future applications.Abstract
Copy detection patterns (CDP) present an efficient technique for product protection against counterfeiting. However, the complexity of studying CDP production variability often results in time-consuming and costly procedures, limiting CDP scalability. Recent advancements in computer modelling, notably the concept of a "digital twin" for printing-imaging channels, allow for enhanced scalability and the optimization of authentication systems. Yet, the development of an accurate digital twin is far from trivial. This paper extends previous research which modelled a printing-imaging channel using a machine learning-based digital twin for CDP. This model, built upon an information-theoretic framework known as "Turbo", demonstrated superior performance over traditional generative models such as CycleGAN and pix2pix. However, the emerging field of Denoising Diffusion Probabilistic Models (DDPM) presents a potential advancement in generative models due to its ability to stochastically model the inherent randomness of the printing-imaging process, and its impressive performance in image-to-image translation tasks. This study aims at comparing the capabilities of the Turbo framework and DDPM on the same CDP datasets, with the goal of establishing the real-world benefits of DDPM models for digital twin applications in CDP security. Furthermore, the paper seeks to evaluate the generative potential of the studied models in the context of mobile phone data acquisition. Despite the increased complexity of DDPM methods when compared to traditional approaches, our study highlights their advantages and explores their potential for future applications.
摘要
kopi detection patterns (CDP) 提供了一种有效的产品安全技术,以防止假冒。然而,研究 CDP 生产变化的复杂性通常会导致时间consuming 和costly的过程,限制 CDP 可扩展性。现代计算机模拟技术,如“数字双” для印刷-图像通道,可以提高 Authentication Systems 的可扩展性和优化。然而,构建准确的数字双是很困难的。 本文是基于前一个研究,使用机器学习基于的数字双模型来模拟印刷-图像通道。这个模型,基于信息论框架知为“Turbo”,在传统生成模型such as CycleGAN 和 pix2pix 中表现出了superior performance。然而,emerging field of Denoising Diffusion Probabilistic Models (DDPM) 提出了一种新的生成模型,它可以随机模型印刷-图像过程中的自然噪音,并在图像-图像翻译任务中表现出了卓越的表现。 本研究的目的是比较 Turbo 框架和 DDPM 在同一个 CDP 数据集上的能力,以确定在 CDP 安全领域中 DDPM 模型的现实世界优势。此外,本研究还检验了这些模型在移动 phone 数据获取中的生成潜力。尽管 DDPM 方法与传统方法相比更加复杂,但我们的研究表明它们具有优势,并探讨了它们在未来应用中的潜力。
Sketch2CADScript: 3D Scene Reconstruction from 2D Sketch using Visual Transformer and Rhino Grasshopper
results: 根据两个数据集的测试结果,模型在简单场景中具有高精度重建能力,但在更复杂的场景中存在挑战。Abstract
Existing 3D model reconstruction methods typically produce outputs in the form of voxels, point clouds, or meshes. However, each of these approaches has its limitations and may not be suitable for every scenario. For instance, the resulting model may exhibit a rough surface and distorted structure, making manual editing and post-processing challenging for humans. In this paper, we introduce a novel 3D reconstruction method designed to address these issues. We trained a visual transformer to predict a "scene descriptor" from a single wire-frame image. This descriptor encompasses crucial information, including object types and parameters such as position, rotation, and size. With the predicted parameters, a 3D scene can be reconstructed using 3D modeling software like Blender or Rhino Grasshopper which provides a programmable interface, resulting in finely and easily editable 3D models. To evaluate the proposed model, we created two datasets: one featuring simple scenes and another with complex scenes. The test results demonstrate the model's ability to accurately reconstruct simple scenes but reveal its challenges with more complex ones.
摘要
(原文)现有的3D模型重建方法通常会生成voxels、点云或mesh等输出。然而,每种方法都有其局限性,可能无法适用于所有情况。例如,生成的模型可能具有抖音表面和扭曲结构,使人工编辑和后处理变得困难。在这篇论文中,我们介绍了一种新的3D重建方法,用于解决这些问题。我们使用视觉变换器来预测基于单个粗细图像的“场景描述符”。这个描述符包括对象类型和参数,如位置、旋转和大小。与预测的参数进行组合,可以使用3D模型创建软件如Blender或Rhino Grasshopper,实现高级编程接口,从而得到高级编辑和轻松修改的3D模型。为评估提案模型,我们创建了两个数据集:一个是简单场景集,另一个是复杂场景集。测试结果表明,模型可以准确重建简单场景,但面临更复杂场景的挑战。
Space-Time Attention with Shifted Non-Local Search
results: 实验结果显示,这种搜寻策略可以提高视频去噪的品质,PSNR值提高了0.30 dB,并且需要7.5%的总时间。此外,这种搜寻策略可以与现有的空间时间注意模组结合,以达到类比搜寻的最佳效果。Abstract
Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.
摘要
computation of attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.Here's the translation in Traditional Chinese: computation of attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.
Propagation and Attribution of Uncertainty in Medical Imaging Pipelines
results: 在一个真实的医疗影像管线中,我们使用了这种方法来重建测试数据中的脑和膝盖磁共振影像,并预测影像中的量值信息,例如脑体积、膝盖侧或patient的性别。我们量化地显示了传递不确定性和输入不确定性之间的相关性,并比较管线阶层对共同不确定性量的贡献比例。Abstract
Uncertainty estimation, which provides a means of building explainable neural networks for medical imaging applications, have mostly been studied for single deep learning models that focus on a specific task. In this paper, we propose a method to propagate uncertainty through cascades of deep learning models in medical imaging pipelines. This allows us to aggregate the uncertainty in later stages of the pipeline and to obtain a joint uncertainty measure for the predictions of later models. Additionally, we can separately report contributions of the aleatoric, data-based, uncertainty of every component in the pipeline. We demonstrate the utility of our method on a realistic imaging pipeline that reconstructs undersampled brain and knee magnetic resonance (MR) images and subsequently predicts quantitative information from the images, such as the brain volume, or knee side or patient's sex. We quantitatively show that the propagated uncertainty is correlated with input uncertainty and compare the proportions of contributions of pipeline stages to the joint uncertainty measure.
摘要
“uncertainty estimation”,它提供了建立可解释性神经网络的方法,用于医疗影像应用。在这篇论文中,我们提议了在医疗影像管道中传播不确定性的方法。这允许我们在后续管道阶段聚合不确定性,并获得多个模型预测结果的共同不确定性度量。此外,我们还可以分别报告每个管道阶段的 aleatoric 不确定性(数据基于的不确定性)的贡献。我们在一个实际的医疗影像管道中重建了扫描不足的脑和膝盖磁共振(MR)图像,并在图像上预测了一些量化信息,如脑体积或膝部位或患者的性别。我们Quantitatively 表明了传播不确定性与输入不确定性之间的相关性,并比较管道阶段对共同不确定性度量的贡献比例。
MEM: Multi-Modal Elevation Mapping for Robotics and Learning
paper_authors: Gian Erni, Jonas Frey, Takahiro Miki, Matias Mattamala, Marco Hutter
for: This paper is written for robotic and learning tasks that require the fusion of multi-modal information for environment perception.
methods: The paper presents a 2.5D robot-centric elevation mapping framework that fuses multi-modal information from multiple sources into a popular map representation, using a set of fusion algorithms that can be selected based on the information type and user requirements.
results: The paper demonstrates the capabilities of the framework by deploying it on multiple robots with varying sensor configurations and showcasing a range of applications that utilize multi-modal layers, including line detection, human detection, and colorization.Abstract
Elevation maps are commonly used to represent the environment of mobile robots and are instrumental for locomotion and navigation tasks. However, pure geometric information is insufficient for many field applications that require appearance or semantic information, which limits their applicability to other platforms or domains. In this work, we extend a 2.5D robot-centric elevation mapping framework by fusing multi-modal information from multiple sources into a popular map representation. The framework allows inputting data contained in point clouds or images in a unified manner. To manage the different nature of the data, we also present a set of fusion algorithms that can be selected based on the information type and user requirements. Our system is designed to run on the GPU, making it real-time capable for various robotic and learning tasks. We demonstrate the capabilities of our framework by deploying it on multiple robots with varying sensor configurations and showcasing a range of applications that utilize multi-modal layers, including line detection, human detection, and colorization.
摘要
Mobile robots的环境表示图是常用的工具,但纯粹的几何信息不足以满足许多场景中的应用需求,这限制了其应用于其他平台或领域。在这项工作中,我们将2.5D机器人中心的抬高地图框架扩展为将多种数据源的多Modal信息融合到一起。该框架可以接受点云或图像中的数据,并且我们提供了根据信息类型和用户需求选择的融合算法集。我们的系统采用GPU进行实时处理,以便在多种机器人和学习任务中实现实时性。我们通过在多个机器人上部署我们的框架,并使用不同的感知器配置,展示了多种应用场景,包括线检测、人体检测和彩色化。
SatDM: Synthesizing Realistic Satellite Image with Semantic Layout Conditioning using Diffusion Models
results: 验证结果表明,提posed model 能够生成高质量、多样化、准确相应的卫星图像,并且与实际图像的差异非常小。Abstract
Deep learning models in the Earth Observation domain heavily rely on the availability of large-scale accurately labeled satellite imagery. However, obtaining and labeling satellite imagery is a resource-intensive endeavor. While generative models offer a promising solution to address data scarcity, their potential remains underexplored. Recently, Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant promise in synthesizing realistic images from semantic layouts. In this paper, a conditional DDPM model capable of taking a semantic map and generating high-quality, diverse, and correspondingly accurate satellite images is implemented. Additionally, a comprehensive illustration of the optimization dynamics is provided. The proposed methodology integrates cutting-edge techniques such as variance learning, classifier-free guidance, and improved noise scheduling. The denoising network architecture is further complemented by the incorporation of adaptive normalization and self-attention mechanisms, enhancing the model's capabilities. The effectiveness of our proposed model is validated using a meticulously labeled dataset introduced within the context of this study. Validation encompasses both algorithmic methods such as Frechet Inception Distance (FID) and Intersection over Union (IoU), as well as a human opinion study. Our findings indicate that the generated samples exhibit minimal deviation from real ones, opening doors for practical applications such as data augmentation. We look forward to further explorations of DDPMs in a wider variety of settings and data modalities. An open-source reference implementation of the algorithm and a link to the benchmarked dataset are provided at https://github.com/obaghirli/syn10-diffusion.
摘要
深度学习模型在地球观测领域广泛依赖大量高精度卫星成像数据。然而,获取和标注卫星成像数据是资源消耗大的。而生成模型具有潜在的解决数据不足问题的潜力,但它们的潜力仍未得到充分利用。最近,涉及扰动概率模型(DDPMs)在生成真实图像方面表现出了显著的搭配性。在这篇论文中,我们实现了一种基于 semantic map 的受控 DDPM 模型,能够生成高质量、多样化和准确相应的卫星成像图像。此外,我们还提供了完整的优化动态图文献。我们的提案方法 integrate cutting-edge techniques such as variance learning, classifier-free guidance, and improved noise scheduling。减震网络架构还通过自适应 нормализа和自注意机制的添加,进一步提高模型的能力。我们的实验结果表明,生成的样本与实际样本几乎没有差异,开启了实际应用,如数据增强。我们期待以 DDPMs 在更多的设置和数据模式中进行进一步的探索。我们提供了一个开源参考实现和一个具有详细标注的数据集的链接,请参考 https://github.com/obaghirli/syn10-diffusion。
Granularity at Scale: Estimating Neighborhood Well-Being from High-Resolution Orthographic Imagery and Hybrid Learning
for: fills in the gaps of basic information on the well-being of the population in areas with limited data collection methods.
methods: uses high-resolution imagery from satellite or aircraft, and machine learning and computer vision techniques to extract features and detect patterns in the image data.
results: accurately estimates population density, median household income, and educational attainment of individual neighborhoods with R$^2$ up to 0.81, and provides a basis for future work to estimate fine-scale information from overhead imagery without label data.Abstract
Many areas of the world are without basic information on the well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from overhead imagery without the need for label data.
摘要
许多地方的人口生活状况的基本信息缺失,这是因为现有的数据收集方法有限。 however, remotely obtained overhead images, such as satellite or aircraft images, can serve as "windows" into the state of life on the ground and help "fill in the gaps" where community information is sparse. With the improvement of sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, and correlate these features with other information.在这个研究中,我们研究了使用公共可用高分辨率图像来估算美国城市 neighborhoods 中人口密度、 median household income 和教育水平。 results 表明,从图像中提取的特征可以准确地估算社区的密度(R$^2$ 可以达到 0.81),并且超vised 方法可以解释人口收入和教育水平的约半部分变化。此外,我们还提出了一种基于 bag-of-visual-words 的半supervised 聚类方法,该方法可以在没有标签数据的情况下,通过图像特征来估算社区的人口密度、收入和教育水平。这些方法不仅可以为未来的地理总结提供基础,而且还可以用于未来无需标签数据来估算社区的人口密度、收入和教育水平。
Ultra-low-power Image Classification on Neuromorphic Hardware
results: 这个研究在MNIST、CIFAR10和ImageNet上进行了 simulated benchmarking,结果显示了这个方法的优点,包括低功耗、高 Throughput 和低延迟。此外,这个方法可以在Loihi neuromorphic chip上实现,并且提供了证据,认为时间编码具有相似的识别精度,但是具有更低的功耗和更高的 Throughput。Abstract
Spiking neural networks (SNNs) promise ultra-low-power applications by exploiting temporal and spatial sparsity. The number of binary activations, called spikes, is proportional to the power consumed when executed on neuromorphic hardware. Training such SNNs using backpropagation through time for vision tasks that rely mainly on spatial features is computationally costly. Training a stateless artificial neural network (ANN) to then convert the weights to an SNN is a straightforward alternative when it comes to image recognition datasets. Most conversion methods rely on rate coding in the SNN to represent ANN activation, which uses enormous amounts of spikes and, therefore, energy to encode information. Recently, temporal conversion methods have shown promising results requiring significantly fewer spikes per neuron, but sometimes complex neuron models. We propose a temporal ANN-to-SNN conversion method, which we call Quartz, that is based on the time to first spike (TTFS). Quartz achieves high classification accuracy and can be easily implemented on neuromorphic hardware while using the least amount of synaptic operations and memory accesses. It incurs a cost of two additional synapses per neuron compared to previous temporal conversion methods, which are readily available on neuromorphic hardware. We benchmark Quartz on MNIST, CIFAR10, and ImageNet in simulation to show the benefits of our method and follow up with an implementation on Loihi, a neuromorphic chip by Intel. We provide evidence that temporal coding has advantages in terms of power consumption, throughput, and latency for similar classification accuracy. Our code and models are publicly available.
摘要
斯宁 neural network (SNN) 承诺在低功耗应用方面具有优势,通过利用时间和空间稀疏性来减少功耗。在神经模仿硬件上运行 SNN 时,活动数(即冲击)与功耗成直接关系。使用反射传播时间来训练 SNN 可能是一个简单的替代方案,但是在图像识别任务中,通常需要大量的冲击来编码信息。现在,使用时间转换方法来将 ANN 转换成 SNN 已经取得了显著的进步,但是这些方法有时会使用复杂的神经元模型。我们提出了一种基于时间到第一冲击(TTFS)的时间 ANN-to-SNN 转换方法,我们称之为Quartz。Quartz 实现了高精度的分类 accuracy 并且可以轻松地在神经模仿硬件上实现,同时使用最少的 synaptic 操作和内存访问。它比前一些时间转换方法增加了两个附加的 synapse 每个神经元,这些附加 synapse 已经可以在神经模仿硬件上进行可用。我们在 MNIST、CIFAR10 和 ImageNet 上使用 simulate 来证明我们的方法的优点,然后对 Loihi neuromorphic chip 进行实现。我们提供了证据,表明使用时间编码有优势在功耗、吞吐量和延迟方面,对于类似的分类精度。我们的代码和模型都是公共可用。
paper_authors: Adam Schmidt, Omid Mohareri, Simon DiMaio, Septimiu E. Salcudean
for: The paper is written for researchers and developers working on image guidance and automation of medical interventions and surgery, specifically in endoscopic environments.
methods: The paper introduces a novel labeling methodology called Surgical Tattoos in Infrared (STIR), which uses invisible IR-fluorescent dye (indocyanine green, ICG) to label tissue points in video clips, allowing for persistent but invisible labels for tracking and mapping.
results: The paper analyzes multiple frame-based tracking methods on STIR using both 3D and 2D endpoint error and accuracy metrics, providing a benchmark dataset for evaluating and improving tracking and mapping methods in endoscopic environments.Here’s the Chinese translation of the three points:
results: 论文使用STIR benchmark数据集进行多个帧基于跟踪方法的分析,包括3D和2D终点误差和精度度量,以提供跟踪和映射方法在内镜环境中的评估和改进的基准数据集。Abstract
Quantifying performance of methods for tracking and mapping tissue in endoscopic environments is essential for enabling image guidance and automation of medical interventions and surgery. Datasets developed so far either use rigid environments, visible markers, or require annotators to label salient points in videos after collection. These are respectively: not general, visible to algorithms, or costly and error-prone. We introduce a novel labeling methodology along with a dataset that uses said methodology, Surgical Tattoos in Infrared (STIR). STIR has labels that are persistent but invisible to visible spectrum algorithms. This is done by labelling tissue points with IR-flourescent dye, indocyanine green (ICG), and then collecting visible light video clips. STIR comprises hundreds of stereo video clips in both in-vivo and ex-vivo scenes with start and end points labelled in the IR spectrum. With over 3,000 labelled points, STIR will help to quantify and enable better analysis of tracking and mapping methods. After introducing STIR, we analyze multiple different frame-based tracking methods on STIR using both 3D and 2D endpoint error and accuracy metrics. STIR is available at https://dx.doi.org/10.21227/w8g4-g548
摘要
量化跟踪和地图方法的性能在镜头环境中是医疗图像导航和自动化手术的关键。现有的数据集都有一些缺点,包括:不通用、可见到算法或需要标注视频后集成。我们介绍了一种新的标注方法和数据集,即镜头纹身(STIR)。STIR使用不可见光谱的IR染料,即尼龙绿(ICG)染料,标注组织点。采集到的视频帧都是可见光谱的。STIR包含了数百个斯tereo视频帧,包括生物体内和外场景,标注了开始和结束点。总共有超过3000个标注点,STIR将帮助量化和促进跟踪和地图方法的分析。之后,我们使用STIR进行多种帧基于跟踪方法的分析,包括3D和2D终点误差和准确率指标。STIR可以在https://dx.doi.org/10.21227/w8g4-g548上下载。
Deep Learning based Systems for Crater Detection: A Review
results: 我们对所有semantic segmentation-based CDA在一个共同数据集上进行了训练和测试,以评估每个架构在撞击坑检测中的效果和应用前景。此外,我们还提供了未来研究的建议。Abstract
Craters are one of the most prominent features on planetary surfaces, used in applications such as age estimation, hazard detection, and spacecraft navigation. Crater detection is a challenging problem due to various aspects, including complex crater characteristics such as varying sizes and shapes, data resolution, and planetary data types. Similar to other computer vision tasks, deep learning-based approaches have significantly impacted research on crater detection in recent years. This survey aims to assist researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, craters database, and evaluation metrics. To be specific, we discuss the challenges in crater detection due to the complex properties of the craters and survey the DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have conducted training and testing of all the semantic segmentation-based CDAs on a common dataset to evaluate the effectiveness of each architecture for crater detection and its potential applications. Finally, we have provided recommendations for potential future works.
摘要
Planetary surfaces often have craters, which are important features used in age estimation, hazard detection, and spacecraft navigation. However, crater detection is a challenging problem due to various factors, including the complexity of crater characteristics such as size and shape, data resolution, and planetary data types. Like other computer vision tasks, deep learning-based approaches have significantly impacted crater detection research in recent years. This survey aims to help researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, crater databases, and evaluation metrics. Specifically, we discuss the challenges in crater detection due to the complex properties of craters and survey DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have trained and tested all the semantic segmentation-based CDAs on a common dataset to evaluate their effectiveness for crater detection and their potential applications. Finally, we provide recommendations for potential future works.Here's the text in Traditional Chinese:planetary surfaces often have craters, which are important features used in age estimation, hazard detection, and spacecraft navigation. However, crater detection is a challenging problem due to various factors, including the complexity of crater characteristics such as size and shape, data resolution, and planetary data types. Like other computer vision tasks, deep learning-based approaches have significantly impacted crater detection research in recent years. This survey aims to help researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, crater databases, and evaluation metrics. Specifically, we discuss the challenges in crater detection due to the complex properties of craters and survey DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have trained and tested all the semantic segmentation-based CDAs on a common dataset to evaluate their effectiveness for crater detection and their potential applications. Finally, we provide recommendations for potential future works.
Prompt-Enhanced Self-supervised Representation Learning for Remote Sensing Image Understanding
results: 我们的方法在多种下游任务上表现出色,包括土地覆盖分类、semantic segmentation、object detection 和实体分 segmentation。这些结果表明我们的方法可以学习优秀的远程感知表示,并具有高度的泛化和传输性。Abstract
Learning representations through self-supervision on a large-scale, unlabeled dataset has proven to be highly effective for understanding diverse images, such as those used in remote sensing image analysis. However, remote sensing images often have complex and densely populated scenes, with multiple land objects and no clear foreground objects. This intrinsic property can lead to false positive pairs in contrastive learning, or missing contextual information in reconstructive learning, which can limit the effectiveness of existing self-supervised learning methods. To address these problems, we propose a prompt-enhanced self-supervised representation learning method that uses a simple yet efficient pre-training pipeline. Our approach involves utilizing original image patches as a reconstructive prompt template, and designing a prompt-enhanced generative branch that provides contextual information through semantic consistency constraints. We collected a dataset of over 1.28 million remote sensing images that is comparable to the popular ImageNet dataset, but without specific temporal or geographical constraints. Our experiments show that our method outperforms fully supervised learning models and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that our approach learns impressive remote sensing representations with high generalization and transferability.
摘要
学习通过自我监督在大规模、无标注数据集上实现了对多样化图像的理解,如远程感知图像分析中的图像。然而,远程感知图像经常有复杂且受树立的场景,具有多个地面 объек 和没有明确的前景对象,这种内在性可能导致对比学习中的假阳对,或是重建学习中的上下文信息缺失,这些问题限制了现有的自我监督学习方法的效iveness。为解决这些问题,我们提出了一种使用原始图像块作为重构权重模板的提问增强自我监督表示学习方法。我们的方法包括使用原始图像块作为重构权重模板,并设计一个提问增强生成分支,通过语义一致约束提供上下文信息。我们收集了1280万多个远程感知图像的数据集,与popular ImageNet dataset相似,但不受特定的时间或地理约束。我们的实验表明,我们的方法在多个下游任务中比完全监督学习模型和现状最佳自我监督学习方法表现出色,包括地面覆盖分类、semantic segmentation、物体检测和实例 segmentation。这些结果表明,我们的方法学习出色的远程感知表示,具有高度的普适性和传输性。
Learning to Transform for Generalizable Instance-wise Invariance
results: 我们在CIFAR 10、CIFAR10-LT和TinyImageNet等 datasets上进行了实验,并证明了我们的方法可以提高准确率和Robustness。特别是,我们的方法可以学习更大的变换范围,比如Augerino和InstaAug。Abstract
Computer vision research has long aimed to build systems that are robust to spatial transformations found in natural data. Traditionally, this is done using data augmentation or hard-coding invariances into the architecture. However, too much or too little invariance can hurt, and the correct amount is unknown a priori and dependent on the instance. Ideally, the appropriate invariance would be learned from data and inferred at test-time. We treat invariance as a prediction problem. Given any image, we use a normalizing flow to predict a distribution over transformations and average the predictions over them. Since this distribution only depends on the instance, we can align instances before classifying them and generalize invariance across classes. The same distribution can also be used to adapt to out-of-distribution poses. This normalizing flow is trained end-to-end and can learn a much larger range of transformations than Augerino and InstaAug. When used as data augmentation, our method shows accuracy and robustness gains on CIFAR 10, CIFAR10-LT, and TinyImageNet.
摘要
Decaf: Monocular Deformation Capture for Face and Hand Interactions
paper_authors: Soshi Shimada, Vladislav Golyanik, Patrick Pérez, Christian Theobalt
for: 本研究旨在Addressing the challenges of 3D tracking from monocular RGB videos, particularly the non-rigid deformations of human hands interacting with human faces.
results: 研究结果显示,该方法可以生成真实和更有可信度的3D手脸重建,与多个基线比较显著。Abstract
Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects. Modelling dense non-rigid object deformations in this setting remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR and avatar communications. This is due to the severe ill-posedness of the monocular view setting and the associated challenges. While it is possible to naively track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf
摘要
现有方法主要考虑了由单一RGB视频中捕捉的可动和坚实对象,而模elling非坚实对象变形在这种设置中尚未得到了充分的关注,尽管这些效果可以提高下游应用程序,如AR/VR和人物通信的真实性。这是因为单个视频设置的缺乏定义性和相关挑战所致。虽然可以通过使用3D模板或参数化3D模型来独立地跟踪多个非坚实对象,但这种方法会导致多种artefacts在结果3D估计中出现,包括深度模糊、非自然的内部对象碰撞和缺失或不合理的变形。因此,这篇论文提出了首个解决这些基本挑战的方法,并允许通过单个RGB视频来跟踪人手与人脸的3D交互。我们将手指定为可动对象,并且在活动交互中引起非坚实面部变形。我们的方法基于一个新的手脸动作和互动捕获数据集,该数据集包含真实的面部变形,通过 markerless 多视图摄像头系统获取。在创建该数据集的过程中,我们对重构的raw 3D形状进行位置基于动力学处理,并使用一种非坚实头组织的估计方法,以获得面部变形、手脸接触区域和头部位置的可靠注释。我们的神经网络方法的核心是一种variational auto-encoder,该模型提供了手脸深度优先验证,以及导向3D跟踪的模块。我们的最终3D手脸重建结果比较真实和更有可靠性,相比多个可应用于我们的设置的基准。
Training a Large Video Model on a Single Machine in a Day
results: 相比同类架构的先前工作,该管道可以在一天内 achieved higher accuracy with $\frac{1}{8}$ of the computation.Abstract
Videos are big, complex to pre-process, and slow to train on. State-of-the-art large-scale video models are trained on clusters of 32 or more GPUs for several days. As a consequence, academia largely ceded the training of large video models to industry. In this paper, we show how to still train a state-of-the-art video model on a single machine with eight consumer-grade GPUs in a day. We identify three bottlenecks, IO, CPU, and GPU computation, and optimize each. The result is a highly efficient video training pipeline. For comparable architectures, our pipeline achieves higher accuracies with $\frac{1}{8}$ of the computation compared to prior work. Code is available at https://github.com/zhaoyue-zephyrus/AVION.
摘要
视频很大,复杂处理,训练时间长。现状的大规模视频模型通常需要cluster of 32或更多的GPU进行数天的训练。在这篇论文中,我们显示了如何在单个机器上使用八款Consumer-grade GPU来训练状态宇的视频模型,只需一天的时间。我们认为IO、CPU和GPU计算是训练 pipeline 中的三个瓶颈,并且优化了每个瓶颈。结果是一个高效的视频训练管道。对于相同的架构,我们的管道可以与先前的工作相比,得到高度的准确率,只需一半的计算时间。代码可以在https://github.com/zhaoyue-zephyrus/AVION 上找到。
Geodesic Regression Characterizes 3D Shape Changes in the Female Brain During Menstruation
paper_authors: Adele Myers, Caitlin Taylor, Emily Jacobs, Nina Miolane for:* The paper aims to investigate the connection between female brain health and sex hormone fluctuations during menopause.methods:* The researchers use geodesic regression on the space of 3D discrete surfaces to characterize the evolution of brain shape during hormone fluctuations.* They propose approximation schemes to accelerate the process and provide rules of thumb for when to use each approximation.results:* The researchers test their approach on synthetic data and show a significant speed-accuracy trade-off.* They apply the method to real brain shape data and produce the first characterization of how the female hippocampus changes shape during the menstrual cycle as a function of progesterone.Here’s the Chinese version of the three key points:for:* 这个论文目的是investigate female brain health和性激素波动之间的连接,尤其是在menopause期间。methods:* 研究人员使用地odesic regression on the space of 3D discrete surfaces来Characterize brain shape evolution during hormone fluctuations.* 他们提出了一些简化方法,并提供了使用情况的规则Of thumb.results:* 研究人员在synthetic data上测试了他们的方法,并显示了明显的速度-准确度贸易。* 他们应用了方法到实际brain shape数据上,并生成了月经期内雌激素水平对女性hippocampus shape的首次Characterization。Abstract
Women are at higher risk of Alzheimer's and other neurological diseases after menopause, and yet research connecting female brain health to sex hormone fluctuations is limited. We seek to investigate this connection by developing tools that quantify 3D shape changes that occur in the brain during sex hormone fluctuations. Geodesic regression on the space of 3D discrete surfaces offers a principled way to characterize the evolution of a brain's shape. However, in its current form, this approach is too computationally expensive for practical use. In this paper, we propose approximation schemes that accelerate geodesic regression on shape spaces of 3D discrete surfaces. We also provide rules of thumb for when each approximation can be used. We test our approach on synthetic data to quantify the speed-accuracy trade-off of these approximations and show that practitioners can expect very significant speed-up while only sacrificing little accuracy. Finally, we apply the method to real brain shape data and produce the first characterization of how the female hippocampus changes shape during the menstrual cycle as a function of progesterone: a characterization made (practically) possible by our approximation schemes. Our work paves the way for comprehensive, practical shape analyses in the fields of bio-medicine and computer vision. Our implementation is publicly available on GitHub: https://github.com/bioshape-lab/my28brains.
摘要
女性在男性メソッド后は、アルツハイマー病やその他の神経疾患のリスクが高いと考えられています。しかし、女性の脳健康と性ホルモンの変化の间の连関性に関する研究は限定的です。我々は、この连関性を调查するために、性ホルモンの変化による脳の3D形状の変化を数学的に定量するためのツールを开発しました。しかし、现在のこのアプローチは、実用的な目的で使用するためには计算的にもっています。この论文では、3DDiscrete Surface上の地odesic regressionのアプローチを加速するための近似法を提案します。また、各アプローチが适用できる状况についてのルール OF THUMBを提供します。我々は、このアプローチをsynthetic dataに対して试験し、速さ-正确さのトレードオフを评価しました。结果は、実用的な目的で使用するためには、かなりの速さアップを与えることができましたが、正确さについては、ほとんど影响を受けませんでした。最后に、この方法を実际の脳形状データに适用し、女性の月経期によるプロゲステロンの影响による脳のHiPPocampusの形状の変化を描写しました。我々の作业は、生物医学およびコンピュータビジョンの分野で、実用的な形状分析を可能にすることを目的としています。我々の実装は、GitHub上で公开されています。https://github.com/bioshape-lab/my28brains。
Visual In-Context Learning for Few-Shot Eczema Segmentation
paper_authors: Neelesh Kumar, Oya Aran, Venugopal Vasudevan for:这种研究旨在开发一种基于数字相机图像的自适应诊断系统,帮助患有皮炎的患者自我监测疾病的进程。为了实现这一目标,图像分割是一项重要的任务。现有的皮炎分割方法基于深度神经网络,如CNN-based U-Net和transformer-based Swin U-Net,但它们需要大量的标注数据,这可能很难以获得。methods:我们研究了视觉上下文学习的可能性,以实现几何少例学习皮炎分割。我们提出了一种将SegGPT应用于皮炎分割的策略。在一个标注皮炎图像集上测试,我们发现,只需要2个示例图像,SegGPT的性能比CNN U-Net retrained on 428图像更高(mIoU: 36.69 vs 32.60)。此外,我们发现,使用更多的示例图像可能会对SegGPT的性能产生负面影响。results:我们的结果表明,视觉上下文学习可以在皮炎图像分割中实现几何少例学习,并且可以开发更快速、更好的皮炎诊断解决方案。此外,我们的结果还预示,可以开发包容性的解决方案,以满足那些在训练数据中受欠表现的少数民族。Abstract
Automated diagnosis of eczema from digital camera images is crucial for developing applications that allow patients to self-monitor their recovery. An important component of this is the segmentation of eczema region from such images. Current methods for eczema segmentation rely on deep neural networks such as convolutional (CNN)-based U-Net or transformer-based Swin U-Net. While effective, these methods require high volume of annotated data, which can be difficult to obtain. Here, we investigate the capabilities of visual in-context learning that can perform few-shot eczema segmentation with just a handful of examples and without any need for retraining models. Specifically, we propose a strategy for applying in-context learning for eczema segmentation with a generalist vision model called SegGPT. When benchmarked on a dataset of annotated eczema images, we show that SegGPT with just 2 representative example images from the training dataset performs better (mIoU: 36.69) than a CNN U-Net trained on 428 images (mIoU: 32.60). We also discover that using more number of examples for SegGPT may in fact be harmful to its performance. Our result highlights the importance of visual in-context learning in developing faster and better solutions to skin imaging tasks. Our result also paves the way for developing inclusive solutions that can cater to minorities in the demographics who are typically heavily under-represented in the training data.
摘要
自动诊断皮炎从数字相机图像是致命的,这将帮助开发出让患者自我监测的应用程序。一个重要的组成部分是皮炎区域的分割。目前的皮炎分割方法基于深度神经网络,如卷积神经网络(CNN)基于U-Net或转换器基于Swin U-Net。虽然有效,但这些方法需要大量的标注数据,这可能很难以获得。在这里,我们研究了可视内容学习的能力,可以在几个示例图像的基础上完成皮炎分割。我们提出了在 SegGPT 泛型视觉模型上应用 visual in-context learning 的策略,并在一个标注皮炎图像集上进行了测试。结果显示,只需使用两个示例图像,SegGPT 的表现比 CNN U-Net 在 428 个图像上训练的表现更好(mIoU:36.69)。我们还发现,给 SegGPT 更多的示例可能会对其性能产生负面影响。我们的结果表明,可视内容学习在皮炎图像分割任务中具有重要的意义,并且可能为皮炎诊断和治疗带来更快和更好的解决方案。此外,我们的结果还预示了可能为少数民族群体在训练数据中的重要地位,并且可能为这些群体开发包容性的解决方案。
Novel Deep Learning Pipeline for Automatic Weapon Detection
methods: 该论文提出了一个新的pipeline,包括一个 ensemble of convolutional neural networks(CNN)with distinct architectures,每个CNN都是通过不同的mini-batch进行训练,以提高系统的准确率和特异性。
results: 该论文通过多个数据集进行比较,发现提出的pipeline在与现有的SoA系统进行比较时,平均提高了5%的准确率、特异性和回归率。Abstract
Weapon and gun violence have recently become a pressing issue today. The degree of these crimes and activities has risen to the point of being termed as an epidemic. This prevalent misuse of weapons calls for an automatic system that detects weapons in real-time. Real-time surveillance video is captured and recorded in almost all public forums and places. These videos contain abundant raw data which can be extracted and processed into meaningful information. This paper proposes a novel pipeline consisting of an ensemble of convolutional neural networks with distinct architectures. Each neural network is trained with a unique mini-batch with little to no overlap in the training samples. This paper will present several promising results using multiple datasets associated with comparing the proposed architecture and state-of-the-art (SoA) models. The proposed pipeline produced an average increase of 5% in accuracy, specificity, and recall compared to the SoA systems.
摘要
武器和枪击案现在是当今的紧迫问题。这种犯罪和活动的程度已经到了epidemic的水平。这种普遍的武器违法使用需要实时检测武器。现在大多数公共场所和地点都有实时监控视频记录。这些视频含有大量的原始数据,可以提取和处理成有用信息。本文提出了一个新的管道,包括一个ensemble of convolutional neural networks with distinct architectures。每个神经网络都是通过不同的mini-batch进行训练,几乎没有重叠在训练样本上。本文将对多个数据集进行比较,并与状态之前(SoA)模型进行比较。提出的管道在准确性、特异性和恢复率方面平均提高了5% compared to SoA系统。
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
results: 该方法可以在只需2分钟时生成高质量纹理化三维模型,相比之下现有方法提高约10倍。Abstract
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.
摘要
Recent advances in 3D content creation mostly rely on optimization-based 3D generation via score distillation sampling (SDS). Although these methods have shown promising results, they often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. Unlike the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments show that our proposed approach achieves superior efficiency and competitive generation quality. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
results: 该研究表明,该表示方法可以在新的semantic类中泛化,无需收集大量3D数据或者微调模型。此外,该表示方法还能够通过语言提示来实现复杂的空间和semantic概念的推理。Abstract
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )
摘要
In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models.We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )
FLIP: Cross-domain Face Anti-spoofing with Language Guidance
results: 我们的方法在三个标准协议中进行了广泛的实验,结果显示我们的方法可以对FAS任务进行 Zero-shot 转移,并且在低数据情况下表现更好,比过五击转移的适应 ViT 更好。Abstract
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of adaptive ViTs. Code: https://github.com/koushiksrivats/FLIP
摘要
“人脸防 spoofing”(FAS)或“发表攻击”检测是安全应用中的重要组成部分。现有的FAS方法具有较差的泛化性,不能适应未见过的骗YPE、摄像头和环境条件。随着Recently, vision transformer(ViT)模型在FAS任务中的表现,它们的长距离依赖性能力使得它们成为FAS任务的有效解决方案。然而,需要适应预训练 ViT Weight 的auxiliary loss function或适应模块来适应预训练 ViT Weight 学习的大规模数据集,如 ImageNet。在这个工作中,我们首先表明,使用多模态(例如 CLIP)预训练 weight 初始化 ViT 可以提高 FAS 任务的泛化性,这与视language预训练(VLP)模型的零扩展转移能力相一致。然后,我们提出了一种新的方法,通过自然语言的语义来固定视觉表示。具体来说,我们发现,将图像表示与一个ensemble of class descriptions(基于自然语言 semantics)进行对应,可以提高 FAS 任务在低数据 régime中的泛化性。最后,我们提出了一种多模态对比学习策略,以提高特征泛化并跨源领域之间的减少。我们的方法在三个标准协议上进行了广泛的实验,并证明了我们的方法可以明显超越当前的state-of-the-art方法,在零扩展转移情况下,我们的方法可以在5shot转移的情况下表现更好。代码:https://github.com/koushiksrivats/FLIP。
Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors
results: 对于 Taskonomy 任务,我们的等变性规则化技术可以适用于 CNN 和 Transformer 架构,不会在测试时增加额外成本,并且显著提高了超参与和半参与学习的性能。此外,对于现有的 state-of-the-art depth 和 normal 预测器,finetuning 我们的损失可以不仅提高等变性,还提高其在 NYU-v2 上的准确率。Abstract
Dense depth and surface normal predictors should possess the equivariant property to cropping-and-resizing -- cropping the input image should result in cropping the same output image. However, we find that state-of-the-art depth and normal predictors, despite having strong performances, surprisingly do not respect equivariance. The problem exists even when crop-and-resize data augmentation is employed during training. To remedy this, we propose an equivariant regularization technique, consisting of an averaging procedure and a self-consistency loss, to explicitly promote cropping-and-resizing equivariance in depth and normal networks. Our approach can be applied to both CNN and Transformer architectures, does not incur extra cost during testing, and notably improves the supervised and semi-supervised learning performance of dense predictors on Taskonomy tasks. Finally, finetuning with our loss on unlabeled images improves not only equivariance but also accuracy of state-of-the-art depth and normal predictors when evaluated on NYU-v2. GitHub link: https://github.com/mikuhatsune/equivariance
摘要
“density和表面法则预测器应具有均衡性,即裁剪输入图像时,输出图像也应该裁剪。然而,我们发现当前的深度和法则预测器,尽管表现出色,却不尊重均衡性。这个问题甚至存在当使用裁剪和缩放数据增强 durante el entrenamiento。为了解决这个问题,我们提出了一种均衡化训练技术,包括均衡averaging过程和自我一致损失,以直接促进裁剪和缩放均衡性在深度和法则网络中。我们的方法可以应用于CNN和Transformer架构,不会在测试过程中添加额外成本,并且能够提高Taskonomy任务上的超级vised和半supervised学习性能。最后,我们在不包含标注图像的情况下,对现有的深度和法则预测器进行finetuning,不仅提高了均衡性,还提高了其在NYU-v2上的准确率。”Here's the breakdown of the text into Simplified Chinese characters:“density”: 密度 (mìdòu)“surface normal”: 表面法则 (biǎofàng fǎlǜ)“predictors”: 预测器 (yùjièqì)“should possess the equivariant property”: 应具有均衡性 (bìng yǒu yǒu zhèng yì)“cropping the input image should result in cropping the same output image”: 裁剪输入图像,输出图像也应该裁剪 (dīng niè zhǐ yǐng xiàng yǐng, yǐng xiàng yǐng)“despite having strong performances”: 尽管表现出色 (zhōngguān biǎofàng zhèng)“the problem exists even when crop-and-resize data augmentation is employed during training”: 这个问题甚至存在当使用裁剪和缩放数据增强 durante el entrenamiento (zhè ge wèn tī zhīyī cái yǐjīn zài yùdào)“to remedy this, we propose an equivariant regularization technique”: 为解决这个问题,我们提出了一种均衡化训练技术 (bìng yǒu yì zhèng zhì)“consisting of an averaging procedure and a self-consistency loss”: 包括均衡averaging过程和自我一致损失 (bǎng zhì yǐng zhìyì zhèng)“to explicitly promote cropping-and-resizing equivariance in depth and normal networks”: 以直接促进裁剪和缩放均衡性在深度和法则网络中 (yǐng zhì yǐng zhèng zhì yǐng zhèng)“our approach can be applied to both CNN and Transformer architectures”: 我们的方法可以应用于CNN和Transformer架构 (wǒmen de fāngshì kěyǐ bìng yì zhèng zhì)“does not incur extra cost during testing”: 不会在测试过程中添加额外成本 (bù huì zài zhèng yì zhèng)“and notably improves the supervised and semi-supervised learning performance of dense predictors on Taskonomy tasks”: 并能够提高Taskonomy任务上的超级vised和半supervised学习性能 (dànnéng yǐng qián zhèng zhì yǐng zhèng)“Finally, finetuning with our loss on unlabeled images improves not only equivariance but also accuracy of state-of-the-art depth and normal predictors when evaluated on NYU-v2”: 最后,我们在不包含标注图像的情况下,对现有的深度和法则预测器进行finetuning,不仅提高了均衡性,还提高了其在NYU-v2上的准确率 (zuihou, wǒmen zài bù bǎng zhǐ yǐng xiàng yǐng, yǐng xiàng yǐng)
for: addressing the inbetweening problem in the anime industry, specifically the generation of intermediate frames between black-and-white line drawings
methods: using a new approach called AnimeInbet, which geometrizes raster line drawings into graphs of endpoints and reframes the inbetweening task as a graph fusion problem with vertex repositioning
results: synthesizing high-quality, clean, and complete intermediate line drawings that outperform existing methods quantitatively and qualitatively, especially in cases with large motionsAbstract
We aim to address a significant but understudied problem in the anime industry, namely the inbetweening of cartoon line drawings. Inbetweening involves generating intermediate frames between two black-and-white line drawings and is a time-consuming and expensive process that can benefit from automation. However, existing frame interpolation methods that rely on matching and warping whole raster images are unsuitable for line inbetweening and often produce blurring artifacts that damage the intricate line structures. To preserve the precision and detail of the line drawings, we propose a new approach, AnimeInbet, which geometrizes raster line drawings into graphs of endpoints and reframes the inbetweening task as a graph fusion problem with vertex repositioning. Our method can effectively capture the sparsity and unique structure of line drawings while preserving the details during inbetweening. This is made possible via our novel modules, i.e., vertex geometric embedding, a vertex correspondence Transformer, an effective mechanism for vertex repositioning and a visibility predictor. To train our method, we introduce MixamoLine240, a new dataset of line drawings with ground truth vectorization and matching labels. Our experiments demonstrate that AnimeInbet synthesizes high-quality, clean, and complete intermediate line drawings, outperforming existing methods quantitatively and qualitatively, especially in cases with large motions. Data and code are available at https://github.com/lisiyao21/AnimeInbet.
摘要
我们目标是解决动漫业界中尚未得到充分研究的问题,即动漫线 Drawing 的夹在中。夹在中需要生成动漫线 Drawing 中两个黑白线 Drawing 之间的中间帧,这是一项时间consuming 和昂贵的过程,可以从自动化中获得利益。然而,现有的帧 interpolate 方法,基于整个矩阵图像匹配和扭曲,对于线 Drawing 来说是不适用的,经常会产生模糊 artifacts ,损害线 Drawing 的精细结构。为保持线 Drawing 的精度和详细情况,我们提出了一种新方法,即 AnimeInbet,它将线 Drawing 转化为图形� Graphics 的结点 Graph ,并将夹在问题转化为图形� Graphics 的结点合并问题。我们的方法可以有效地捕捉线 Drawing 的稀疏性和特殊结构,同时保持精度和详细情况。这是通过我们的新模块,即结点几何嵌入、结点对准 Transformer 和有效的结点重新排列机制,以及可见预测器。为训练我们的方法,我们提出了 MixamoLine240 数据集,这是一个包含线 Drawing 的vectorization和匹配标签的新数据集。我们的实验表明,AnimeInbet 可以生成高质量、干净、完整的中间线 Drawing,超越现有方法,特别是在大动作情况下。数据和代码可以在 https://github.com/lisiyao21/AnimeInbet 上获取。
End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
paper_authors: Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel, Christian Wolf
for: 这 paper 是关于目标导向视觉导航的最新研究,使用大规模机器学习在模拟环境中进行学习。
methods: 这 paper 使用了两个预言任务来解决主要的挑战,即学习精简的表示和学习高容量的感知模块,以便在不知道环境时进行高级别的感知和决策。
results: 实验结果表明,通过使用这两个预言任务,可以帮助模型学习高级别的感知和决策能力,并达到最新的状态册点和最高级别的性能。Abstract
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.
摘要
最新的目标导航研究借助大规模机器学习在模拟环境中进行。主要挑战在学习简洁的总结性模型,可以在未经见过的环境中泛化应用,以及学习高容量的感知模块,可以在高维输入上进行理解。特别是当目标不是category("ObjectNav")而是图像("ImageNav")时,感知模块需要学习一种比较策略,解决了下面的视觉匹配问题。这个问题已经被证明是从奖励alone或标准辅助任务中很Difficult。我们通过一系列两个预测任务来解决这个问题,其中第一个任务是cross-view completion,它是视觉匹配问题的代理任务,而第二个任务是直接面向目标检测和定位。我们提出了一个新的双 encode器,其中包括一个大容量的双目视力模型(ViT),并证明了对应关系解决方案会自然地从训练信号中出现。实验显示我们的方法在两个标准准则ImageNav和Instance-ImageNav中具有显著改进和SOTA性能。
Class Activation Map-based Weakly supervised Hemorrhage Segmentation using Resnet-LSTM in Non-Contrast Computed Tomography images
results: 在MICCAI 2022 INSTANCE challenge 的验证数据上,该方法的 dice 值为0.55,与现有的弱监督方法(dice值为0.47)相当,而且在训练数据量更小的情况下达到这一效果。Abstract
In clinical settings, intracranial hemorrhages (ICH) are routinely diagnosed using non-contrast CT (NCCT) for severity assessment. Accurate automated segmentation of ICH lesions is the initial and essential step, immensely useful for such assessment. However, compared to other structural imaging modalities such as MRI, in NCCT images ICH appears with very low contrast and poor SNR. Over recent years, deep learning (DL)-based methods have shown great potential, however, training them requires a huge amount of manually annotated lesion-level labels, with sufficient diversity to capture the characteristics of ICH. In this work, we propose a novel weakly supervised DL method for ICH segmentation on NCCT scans, using image-level binary classification labels, which are less time-consuming and labor-efficient when compared to the manual labeling of individual ICH lesions. Our method initially determines the approximate location of ICH using class activation maps from a classification network, which is trained to learn dependencies across contiguous slices. We further refine the ICH segmentation using pseudo-ICH masks obtained in an unsupervised manner. The method is flexible and uses a computationally light architecture during testing. On evaluating our method on the validation data of the MICCAI 2022 INSTANCE challenge, our method achieves a Dice value of 0.55, comparable with those of existing weakly supervised method (Dice value of 0.47), despite training on a much smaller training data.
摘要
在临床设置下,脑膜内出血(ICH)通常使用非contrast CT(NCCT)进行严重评估。正确地自动分割ICH损害是初始和基本步骤,对评估具有极大的用处。然而,与其他结构成像Modalities(MRI)相比,在NCCT图像中,ICH具有非常低的冲击和噪声。过去几年,深度学习(DL)基本方法在ICH分割中表现出了极大的潜力,但是它们的训练需要大量的手动标注损害级别Label,以及足够的多样性,以捕捉ICH的特征。在这项工作中,我们提出了一种新的弱监睹DL方法,用于NCCT扫描图像中ICH分割,使用图像级别的二分类标签,相比手动标注每个ICH损害,更加快速和劳动效率。我们的方法首先在分割网络中获得ICH的约束位置,使用分割网络训练得到的类 activation maps,然后进行加工ICH分割。我们的方法是灵活的,在测试时使用轻量级的计算机itecture。在评估我们的方法在MICCAI 2022 INSTANCE挑战的验证数据上,我们的方法达到了0.55的Dice值,与现有的弱监睹方法(Dice值为0.47)相当,即使在训练数据量相对较小的情况下。
KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing
results: 我们的方法不需要专门培训 Stable Diffusion 模型,也不需要扫描大规模的数据集进行时间consuming的培训。Abstract
Text-conditioned image editing is a recently emerged and highly practical task, and its potential is immeasurable. However, most of the concurrent methods are unable to perform action editing, i.e. they can not produce results that conform to the action semantics of the editing prompt and preserve the content of the original image. To solve the problem of action editing, we propose KV Inversion, a method that can achieve satisfactory reconstruction performance and action editing, which can solve two major problems: 1) the edited result can match the corresponding action, and 2) the edited object can retain the texture and identity of the original real image. In addition, our method does not require training the Stable Diffusion model itself, nor does it require scanning a large-scale dataset to perform time-consuming training.
摘要
文本受控图像编辑是一个最近出现的高实用性任务,其潜力无法估量。然而,大多数同时期方法无法实现动作编辑,即无法根据编辑提示生成符合动作 semantics 的结果,同时保留原始图像内容。为解决动作编辑问题,我们提议 KV Inversion,一种可以实现满意重构性和动作编辑,解决两个主要问题:1)编辑结果能匹配相应的动作,2)编辑对象能保留原始真实图像的Texture和identify。此外,我们的方法不需要专门培训 Stable Diffusion 模型,也不需要扫描大规模数据集进行时间消耗的训练。
Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection
results: 实验表明,TensorFact可以在RGB图像中提高物体检测性能,并且在IR图像中进行微调可以超过标准的物体检测器。Abstract
The primary bottleneck towards obtaining good recognition performance in IR images is the lack of sufficient labeled training data, owing to the cost of acquiring such data. Realizing that object detection methods for the RGB modality are quite robust (at least for some commonplace classes, like person, car, etc.), thanks to the giant training sets that exist, in this work we seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality. At the core of our method, is a novel tensor decomposition method called TensorFact which splits the convolution kernels of a layer of a Convolutional Neural Network (CNN) into low-rank factor matrices, with fewer parameters than the original CNN. We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting, while encouraging them to capture complementary cues from those trained only on the RGB modality. We validate our approach empirically by first assessing how well our TensorFact decomposed network performs at the task of detecting objects in RGB images vis-a-vis the original network and then look at how well it adapts to IR images of the FLIR ADAS v1 dataset. For the latter, we train models under scenarios that pose challenges stemming from data paucity. From the experiments, we observe that: (i) TensorFact shows performance gains on RGB images; (ii) further, this pre-trained model, when fine-tuned, outperforms a standard state-of-the-art object detector on the FLIR ADAS v1 dataset by about 4% in terms of mAP 50 score.
摘要
主要瓶颈在获得良好的认知性能方面是因为缺乏充足的标注训练数据,即使是因为获取这些数据的成本。在这项工作中,我们寻求利用RGB模式中对象检测方法的稳定性(至少是一些常见的类型,如人车等),通过RGB模式的大规模训练集,将其扩展到IR模式中,而不会影响模型在RGB模式中的性能。我们的方法的核心是一种新的矩阵因子分解方法,称为TensorFact,它将卷积层的矩阵分解成低级因子矩阵,具有 fewer 参数 than the original CNN。我们首先在RGB模式上预训练这些因子矩阵,然后只需要对IR模式进行一些可训练参数的增强,以避免过拟合,同时鼓励它们捕捉RGB模式中未经训练的辅助信号。我们通过实验证明了我们的方法的效果,先评估我们的TensorFact decomposed network在RGB图像上的性能,然后看看它在FLIR ADAS v1 dataset上如何适应IR图像。为了 simulate 数据缺乏的问题,我们在训练中采用了不同的场景。从实验结果来看,我们有以下观察:(i) TensorFact在RGB图像上显示出性能提升;(ii)此外,我们在FLIR ADAS v1 dataset上使用这个预训练模型进行细化,与标准状态的对象检测器相比,其MAP50得分提高了约4%。
methods: 本文使用 both supervised 和 self-supervised ViT 网络,并提出一种简单 yet effective 的解决方案,即提供更多的输入序列,以填充高强度token在推理过程中的缺陷。
results: 本文显示,该解决方案可以 entirely fix 这个问题,并在 dense visual prediction 任务中设置新的州OF THE ART ,允许使用更大的模型进行对象发现,并且导致 downstream 视觉处理中的 feature maps 和 attention maps 更加平滑。Abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
摘要
受欢迎的变换器最近在视觉学习中展示出了强大的工具。在这篇论文中,我们识别和描述了Feature Map中的artefacts。这些artefacts与推理过程中出现的高 нор Tokens相对应,主要出现在图像中的低信息背景区域,并被用于内部计算。我们提出了一个简单 yet有效的解决方案,即在视觉 трансформа器的输入序列中提供更多的Token来替代这个角色。我们表明,这种解决方案可以完全解决supervised和self-supervised模型中的这个问题,并在dense visual prediction任务中设置新的状态anner,启用更大的物体发现方法,并且导致下游视处理中的特征图和注意力图更加平滑。
paper_authors: Zilong Chen, Feng Wang, Huaping Liu
for: 高品质3D物体生成
methods: 3D加aussian拼接、进程优化
results: 精确的3D形状和详细构造Abstract
In this paper, we present Gaussian Splatting based text-to-3D generation (GSGEN), a novel approach for generating high-quality 3D objects. Previous methods suffer from inaccurate geometry and limited fidelity due to the absence of 3D prior and proper representation. We leverage 3D Gaussian Splatting, a recent state-of-the-art representation, to address existing shortcomings by exploiting the explicit nature that enables the incorporation of 3D prior. Specifically, our method adopts a progressive optimization strategy, which includes a geometry optimization stage and an appearance refinement stage. In geometry optimization, a coarse representation is established under a 3D geometry prior along with the ordinary 2D SDS loss, ensuring a sensible and 3D-consistent rough shape. Subsequently, the obtained Gaussians undergo an iterative refinement to enrich details. In this stage, we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity. With these designs, our approach can generate 3D content with delicate details and more accurate geometry. Extensive evaluations demonstrate the effectiveness of our method, especially for capturing high-frequency components. Video results are provided at https://gsgen3d.github.io. Our code is available at https://github.com/gsgen3d/gsgen
摘要
在这篇论文中,我们提出了基于 Gaussian Splatting 的文本到 3D 生成方法(GSGEN),这是一种新的方法,用于生成高质量的 3D 对象。先前的方法受到缺乏 3D 先天知识和不准确的几何结构的限制,导致生成的 3D 对象具有偏差和有限的质量。我们利用了三维 Gaussian Splatting,这是当前最佳的表示方式,以解决现有的缺陷,通过质量的考虑,确保了 3D 对象的可见性和质量。我们的方法包括两个阶段:几何优化阶段和外观优化阶段。在几何优化阶段,我们使用了一个粗略的表示,同时遵循 3D 几何先天知识,确保生成的 3D 对象具有理性和可见性。然后,我们通过多个 Gaussians 的迭代优化,以增强细节和提高质量。在外观优化阶段,我们通过增加粒子数量来提高连续性和质量。我们的方法可以生成高质量的 3D 内容,包括细节和几何结构。我们提供了详细的评估结果,显示我们的方法可以更好地捕捉高频成分。视频结果可以在 中查看。我们的代码可以在 中下载。
Audio-Visual Speaker Verification via Joint Cross-Attention
results: 实验结果表明,该方法可以significantly outperform 现有的Audio-Visual Fusion 方法,提高Speaker Verification 性能。Abstract
Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audio-visual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.
摘要
《 speaker verification 》在使用语音信号方面进行了广泛的探索,并表现出了显著的改进。近期,人们开始探索 faces 和 voices 的可用性,因为它们可以为 speaker verification 提供更多的补充和完整的信息,而不仅仅是依靠单一的语音信号。 existing literature 中的 audio-visual fusión 方法已经表现出了与单一的 face 或 voice 模态之间的改进,但是 audio-visual fusión 的潜力还没有得到完全的探索。大多数现有的方法基于 audio-visual fusión ether rely on score-level fusion 或者简单的 feature concatenation。在这个工作中,我们explored cross-modal joint attention 来全面利用 faces 和 voices 之间的相互补充信息和单一模态信息以进行 speaker verification。 Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices。我们的方法可以很好地利用 intra-modal 和 inter-modal 关系,从而提高 audio-visual fusión 的性能。我们在 Voxceleb1 数据集上评估了我们的方法,结果表明我们的方法可以在 audio-visual fusión 中对 speaker verification 进行显著改进。
MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond
paper_authors: Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, Bo Dai
for: 研发大规模、高质量的 synthetic dataset,推动城市级别的神经渲染研究。
methods: 使用 Unreal Engine 5 City Sample project pipeline,收集了飞行和街景视图,并附带了摄像头姿态和多种数据模式。
results: 建立了 MatrixCity 数据集,包含 67k 飞行图像和 452k 街景图像,涵盖两个城市地图,总面积 $28km^2$。Abstract
Neural radiance fields (NeRF) and its subsequent variants have led to remarkable progress in neural rendering. While most of recent neural rendering works focus on objects and small-scale scenes, developing neural rendering methods for city-scale scenes is of great potential in many real-world applications. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, yet collecting such a dataset over real city-scale scenes is costly, sensitive, and technically difficult. To this end, we build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we develop a pipeline to easily collect aerial and street city views, accompanied by ground-truth camera poses and a range of additional data modalities. Flexible controls over environmental factors like light, weather, human and car crowd are also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size $28km^2$. On top of MatrixCity, a thorough benchmark is also conducted, which not only reveals unique challenges of the task of city-scale neural rendering, but also highlights potential improvements for future works. The dataset and code will be publicly available at our project page: https://city-super.github.io/matrixcity/.
摘要
neural radiance fields (NeRF) 和其 variants 已经带来了对 neural rendering 的巨大进步。然而,大多数最近的 neural rendering 工作都集中在小规模的对象和场景上,发展 neural rendering 方法 для city-scale 场景的潜在应用巨大。然而,这一线索的研究受到了数据缺乏的困难,因为收集这样的数据在真实的 city-scale 场景上是昂贵的、敏感的和技术上有困难。为此,我们建立了一个大规模、全面和高质量的synthetic dataset,用于city-scale neural rendering 研究。我们利用 Unreal Engine 5 City Sample 项目,开发了一个管道,可以轻松地收集空中和街道的城市视图,并附加了相应的摄像头位和多种数据模式。管道中还提供了灵活的环境因素控制,如光、天气、人员和车辆拥堵,以支持多种任务,涵盖 city-scale neural rendering 和更多的应用。结果的预测数据集,MatrixCity,包含 67k 空中图像和 452k 街道图像,总面积为 $28km^2$。在 MatrixCity 之上,我们还进行了一项全面的比较,不仅揭示了城市级 neural rendering 任务的独特挑战,还强调了未来工作的潜在改进方向。数据和代码将在我们项目页面上公开:https://city-super.github.io/matrixcity/.
Uncertainty Quantification for Eosinophil Segmentation
paper_authors: Kevin Lin, Donald Brown, Sana Syed, Adam Greene
for: 该研究旨在提高Adorno等人的方法,用深度图像分割来评估嗜内针蛋白。
methods: 该方法使用Monte Carlo Dropout来提供深度学习模型的不确定性评估,并将其视觉化在输出图像中,以评估模型性能、理解深度学习算法的工作方式,并帮助病理学家识别嗜内针蛋白。
results: 该方法可以帮助病理学家更准确地识别嗜内针蛋白,提高诊断效率。Abstract
Eosinophilic Esophagitis (EoE) is an allergic condition increasing in prevalence. To diagnose EoE, pathologists must find 15 or more eosinophils within a single high-power field (400X magnification). Determining whether or not a patient has EoE can be an arduous process and any medical imaging approaches used to assist diagnosis must consider both efficiency and precision. We propose an improvement of Adorno et al's approach for quantifying eosinphils using deep image segmentation. Our new approach leverages Monte Carlo Dropout, a common approach in deep learning to reduce overfitting, to provide uncertainty quantification on current deep learning models. The uncertainty can be visualized in an output image to evaluate model performance, provide insight to how deep learning algorithms function, and assist pathologists in identifying eosinophils.
摘要
“恶生气肠炎(EoE)是一种增加的 allergy 病种,诊断 EoE 可以是一个困难的过程。为了帮助诊断,任何医学影像方法都必须考虑效率和精度。我们提出了改进 Adorno 等人的方法,使用深度图像分割来量化嗜好细胞。我们的新方法利用 Monte Carlo Dropout,一种常见的深度学习方法来减少预测过拟合,从而提供不确定性量化。这种不确定性可以在输出图像中显示,评估模型性能,提供深度学习算法的运作方式,并帮助病理学家确定嗜好细胞。”
HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs
results: 我们的模型在VidHOI数据集上的探测和预测效果比现有方法高,具体来说是1.76%和1.04%的提升在mAP上,同时速度比现有方法快15.4倍。我们通过实验表明,我们的方法可以帮助人机合作更加效率和直观。Abstract
Robots are becoming increasingly integrated into our lives, assisting us in various tasks. To ensure effective collaboration between humans and robots, it is essential that they understand our intentions and anticipate our actions. In this paper, we propose a Human-Object Interaction (HOI) anticipation framework for collaborative robots. We propose an efficient and robust transformer-based model to detect and anticipate HOIs from videos. This enhanced anticipation empowers robots to proactively assist humans, resulting in more efficient and intuitive collaborations. Our model outperforms state-of-the-art results in HOI detection and anticipation in VidHOI dataset with an increase of 1.76% and 1.04% in mAP respectively while being 15.4 times faster. We showcase the effectiveness of our approach through experimental results in a real robot, demonstrating that the robot's ability to anticipate HOIs is key for better Human-Robot Interaction. More information can be found on our project webpage: https://evm7.github.io/HOI4ABOT_page/
摘要
роботы все более интегрируются в нашу жизнь, помогая нам в различных задачах. Чтобы обеспечить эффективное сотрудничество между людьми и роботами, важно, чтобы они понимали наши намерения и предсказывали наши действия. В этой статье мы предлагаем фреймворк для предсказания взаимодействия между людьми и объектами (HOI) для коллаборативных роботов. Мы предлагаем эффективный и прочный модель на основе трансформара для детектирования и предсказания HOIs из видео. Улучшенная предсказуемость позволяет роботам проактивно помогать людям, что приводит к более эффективным и интуитивным сотрудничествам. Наша модель превышает результаты государства искусства в HOI-детектировании и предсказании в базе данных VidHOI на 1,76% и 1,04% в mAP соответственно, в то время как она на 15,4 раза быстрее. Мы подтвердили эффективность нашего подхода с помощью экспериментов в реальном роботе, демонстрируя, что способность робота предсказывать HOIs является ключевым фактором для более эффективного взаимодействия между людьми и роботами. За более подробную информацию обратитесь к нашей странице проекта: .
Latent Noise Segmentation: How Neural Noise Leads to the Emergence of Segmentation and Grouping
results: 研究发现,这种方法可以成功地分割图像,并且分割结果与人类视觉系统中的分割现象相似。此外,研究还发现,这种方法需要很少的样本数据,并且可以在各种不同的噪声水平下进行。Abstract
Deep Neural Networks (DNNs) that achieve human-level performance in general tasks like object segmentation typically require supervised labels. In contrast, humans are able to perform these tasks effortlessly without supervision. To accomplish this, the human visual system makes use of perceptual grouping. Understanding how perceptual grouping arises in an unsupervised manner is critical for improving both models of the visual system, and computer vision models. In this work, we propose a counterintuitive approach to unsupervised perceptual grouping and segmentation: that they arise because of neural noise, rather than in spite of it. We (1) mathematically demonstrate that under realistic assumptions, neural noise can be used to separate objects from each other, and (2) show that adding noise in a DNN enables the network to segment images even though it was never trained on any segmentation labels. Interestingly, we find that (3) segmenting objects using noise results in segmentation performance that aligns with the perceptual grouping phenomena observed in humans. We introduce the Good Gestalt (GG) datasets -- six datasets designed to specifically test perceptual grouping, and show that our DNN models reproduce many important phenomena in human perception, such as illusory contours, closure, continuity, proximity, and occlusion. Finally, we (4) demonstrate the ecological plausibility of the method by analyzing the sensitivity of the DNN to different magnitudes of noise. We find that some model variants consistently succeed with remarkably low levels of neural noise ($\sigma<0.001$), and surprisingly, that segmenting this way requires as few as a handful of samples. Together, our results suggest a novel unsupervised segmentation method requiring few assumptions, a new explanation for the formation of perceptual grouping, and a potential benefit of neural noise in the visual system.
摘要
深度神经网络(DNN)可以达到人类水平的执行通用任务,如物体分割,通常需要有监督标签。然而,人类可以通过自然的视觉系统完成这些任务,而无需监督。为了实现这一点,人类视觉系统会利用感知分组。理解感知分组在无监督下如何发生是critical,以改进计算机视觉模型和人类视觉系统。在这个工作中,我们提出了一种Counterintuitive的方法,即感知分组和分割是由神经噪声引起的,而不是它们的障碍。我们(1)数学示出,在实际假设下,神经噪声可以将物体分开,并(2)显示在DNN中添加噪声可以让网络分割图像,即使这些图像从未接受过任何分割标签。有趣的是,我们发现(3)使用噪声进行分割, segmentation的性能与人类感知分组现象相吻合。我们开发了Good Gestalt(GG)数据集,包括六个数据集,用于测试感知分组。我们的DNN模型在这些数据集上展现了许多重要的人类感知现象,如潜在的梯度、闭合、连续性、靠近性和遮挡。最后,我们(4)通过分析神经网络对噪声的敏感性,证明这种方法的生物学可靠性。我们发现一些模型变体可以在remarkably low levels of neural noise($\sigma<0.001)下成功,而且segmenting这样需要的样本数很少,只需几个样本。总之,我们的结果提出了一种新的无监督分割方法,一种新的感知分组的解释,以及神经噪声在视觉系统中的可能的优点。
CCEdit: Creative and Controllable Video Editing via Diffusion Models
results: 实验结果表明,CCEdit 框架具有出色的功能和编辑能力。Abstract
In this work, we present CCEdit, a versatile framework designed to address the challenges of creative and controllable video editing. CCEdit accommodates a wide spectrum of user editing requirements and enables enhanced creative control through an innovative approach that decouples video structure and appearance. We leverage the foundational ControlNet architecture to preserve structural integrity, while seamlessly integrating adaptable temporal modules compatible with state-of-the-art personalization techniques for text-to-image generation, such as DreamBooth and LoRA.Furthermore, we introduce reference-conditioned video editing, empowering users to exercise precise creative control over video editing through the more manageable process of editing key frames. Our extensive experimental evaluations confirm the exceptional functionality and editing capabilities of the proposed CCEdit framework. Demo video is available at https://www.youtube.com/watch?v=UQw4jq-igN4.
摘要
在这项工作中,我们介绍CCEdit框架,这是一种适应创作和控制视频编辑的多功能框架。CCEdit可以满足广泛的用户编辑需求,并提供了更高级别的创意控制通过解耦视频结构和外观的创新方法。我们利用ControlNet体系结构,以保持视频结构完整性,同时快速集成了适应性强的时间模块,与现代个性化文本生成技术,如梦镜和LoRA,协同工作。此外,我们还引入了参考条件视频编辑,让用户通过更加可控的逻辑框架进行视频编辑,从而提高了编辑效率和精度。我们的广泛实验证明了CCEdit框架的非凡功能和编辑能力。详细的示例视频可以在https://www.youtube.com/watch?v=UQw4jq-igN4中找到。
Deep Single Models vs. Ensembles: Insights for a Fast Deployment of Parking Monitoring Systems
results: 研究发现,使用不同的数据集和深度学习架构,包括融合策略和集成方法,可以实现95%的准确率,而不需要对目标停车场进行训练和标注。Abstract
Searching for available parking spots in high-density urban centers is a stressful task for drivers that can be mitigated by systems that know in advance the nearest parking space available. To this end, image-based systems offer cost advantages over other sensor-based alternatives (e.g., ultrasonic sensors), requiring less physical infrastructure for installation and maintenance. Despite recent deep learning advances, deploying intelligent parking monitoring is still a challenge since most approaches involve collecting and labeling large amounts of data, which is laborious and time-consuming. Our study aims to uncover the challenges in creating a global framework, trained using publicly available labeled parking lot images, that performs accurately across diverse scenarios, enabling the parking space monitoring as a ready-to-use system to deploy in a new environment. Through exhaustive experiments involving different datasets and deep learning architectures, including fusion strategies and ensemble methods, we found that models trained on diverse datasets can achieve 95\% accuracy without the burden of data annotation and model training on the target parking lot
摘要
搜寻高密度城市中的停车位是驾驶员忙碌的任务,可以通过系统知道当前最近的停车位。为此,图像基的系统提供成本优势,需要更少的物理基础设施安装和维护。尽管最近的深度学习突破,但是实施智能停车监测仍然是一个挑战,因为大多数方法需要收集和标注大量数据,这是时间consuming和劳动密集的。我们的研究旨在探讨在公共可用的标注停车场图像基础上创建全球框架,可以在多样化场景下准确地检测停车位,并提供一个Ready-to-use的系统,可以在新环境中部署。通过不同的数据集和深度学习架构、合并策略和 ensemble方法的 исследование,我们发现了:模型在多样化数据集上训练可以达到95%的准确率,无需目标停车场的数据注解和模型训练。
Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization
results: 提出了一种新的细节重点对比调整(DFCR),并通过对比调整和细节重点对比调整来提高图像雾化性能,并且模型具有少于1500万参数,超过当前状态的图像雾化方法Abstract
Recently, deep learning-based methods have dominated image dehazing domain. Although very competitive dehazing performance has been achieved with sophisticated models, effective solutions for extracting useful features are still under-explored. In addition, non-local network, which has made a breakthrough in many vision tasks, has not been appropriately applied to image dehazing. Thus, a multi-receptive-field non-local network (MRFNLN) consisting of the multi-stream feature attention block (MSFAB) and cross non-local block (CNLB) is presented in this paper. We start with extracting richer features for dehazing. Specifically, we design a multi-stream feature extraction (MSFE) sub-block, which contains three parallel convolutions with different receptive fields (i.e., $1\times 1$, $3\times 3$, $5\times 5$) for extracting multi-scale features. Following MSFE, we employ an attention sub-block to make the model adaptively focus on important channels/regions. The MSFE and attention sub-blocks constitute our MSFAB. Then, we design a cross non-local block (CNLB), which can capture long-range dependencies beyond the query. Instead of the same input source of query branch, the key and value branches are enhanced by fusing more preceding features. CNLB is computation-friendly by leveraging a spatial pyramid down-sampling (SPDS) strategy to reduce the computation and memory consumption without sacrificing the performance. Last but not least, a novel detail-focused contrastive regularization (DFCR) is presented by emphasizing the low-level details and ignoring the high-level semantic information in the representation space. Comprehensive experimental results demonstrate that the proposed MRFNLN model outperforms recent state-of-the-art dehazing methods with less than 1.5 Million parameters.
摘要
近期,深度学习基于方法在图像抑雾领域占据了主导地位。虽然使用了复杂的模型,但是对于提取有用特征的有效解决方案还是尚未得到足够的探索。此外,非本地网络,在视觉任务中创造出了突破,尚未被适当应用于图像抑雾。因此,本文提出了一种多感受场非本地网络(MRFNLN),其包括多流处理特征吸引块(MSFAB)和跨非本地块(CNLB)。我们从提取更丰富的特征开始,specifically,我们设计了一种多流特征提取子块(MSFE),它包括三个并行的三维卷积($1\times 1$, $3\times 3$, $5\times 5$),用于提取多级特征。接着,我们采用了一个注意力吸引子块,使模型能够适应重要的通道/区域。MSFE和注意力吸引子块组成我们的 MSFAB。然后,我们设计了一种跨非本地块(CNLB),它可以捕捉更远的相关性,而不是仅仅依靠输入的查询源。相比之下,关键和值分支可以通过折衔更多的先前特征进行增强。CNLB通过利用空间PYRAMID下降抽象(SPDS)策略来减少计算和存储占用,不会失去性能。最后,我们提出了一种新的细节重点对比常规化正则化(DFCR),强调低级详细信息,忽略高级 semantic信息在表示空间中。我们进行了广泛的实验研究,结果表明,我们提出的 MRFNLN 模型在参数数量不足 1.5 万的情况下,已经超越了最新的图像抑雾方法。
HTC-DC Net: Monocular Height Estimation from Single Remote Sensing Images
results: 实验结果显示,提议的网络在三个不同分辨率的数据集上表现出色,与现有方法相比,具有大量的优势。广泛的折衔研究也证明了每个设计元素的效果。Abstract
3D geo-information is of great significance for understanding the living environment; however, 3D perception from remote sensing data, especially on a large scale, is restricted. To tackle this problem, we propose a method for monocular height estimation from optical imagery, which is currently one of the richest sources of remote sensing data. As an ill-posed problem, monocular height estimation requires well-designed networks for enhanced representations to improve performance. Moreover, the distribution of height values is long-tailed with the low-height pixels, e.g., the background, as the head, and thus trained networks are usually biased and tend to underestimate building heights. To solve the problems, instead of formalizing the problem as a regression task, we propose HTC-DC Net following the classification-regression paradigm, with the head-tail cut (HTC) and the distribution-based constraints (DCs) as the main contributions. HTC-DC Net is composed of the backbone network as the feature extractor, the HTC-AdaBins module, and the hybrid regression process. The HTC-AdaBins module serves as the classification phase to determine bins adaptive to each input image. It is equipped with a vision transformer encoder to incorporate local context with holistic information and involves an HTC to address the long-tailed problem in monocular height estimation for balancing the performances of foreground and background pixels. The hybrid regression process does the regression via the smoothing of bins from the classification phase, which is trained via DCs. The proposed network is tested on three datasets of different resolutions, namely ISPRS Vaihingen (0.09 m), DFC19 (1.3 m) and GBH (3 m). Experimental results show the superiority of the proposed network over existing methods by large margins. Extensive ablation studies demonstrate the effectiveness of each design component.
摘要
三维地理信息对生活环境理解具有重要 significancem however, 从远程感知数据中获得三维高度的见解,尤其是在大规模上,受到限制。为解决这个问题,我们提出了一种从光学影像获得高度的独眼准备方法,现在是远程感知数据中最丰富的资源之一。由于这是一个不定Problem,高度估计需要良好的网络设计来提高性能。此外,高度值的分布呈长尾,低高度像素(如背景)为主,因此训练的网络通常偏 towards underestimating building heights。为了解决这些问题,我们不是直接将问题定义为回归任务,而是提出了 HTC-DC Net,它是基于分类-回归模式的。 HTC-DC Net 由 feature extractor 作为 backbone network,HTC-AdaBins 模块,和混合回归过程组成。HTC-AdaBins 模块 serves as the classification phase to determine adaptive bins for each input image,它使用了视transformer encoder integrate local context with holistic information,并在 HTC 中处理长尾问题。混合回归过程通过 adapted bins from the classification phase 进行回归,并由 DCs 训练。我们在三个不同分辨率的 dataset 上进行测试,namely ISPRS Vaihingen (0.09 m), DFC19 (1.3 m) and GBH (3 m)。实验结果表明我们的方法在现有方法的大幅提高。我们还进行了广泛的拟合研究,以证明每个设计元件的效果。
Rethinking Domain Generalization: Discriminability and Generalizability
methods: 本方法基于两个核心 ком成分:选择性频道剔除(Selective Channel Pruning,SCP)和微级分布对齐(Micro-level Distribution Alignment,MDA)。SCP 通过减少神经网络中的冗余特征,增强特征的稳定性和分类精度。而 MDA 强调每个类别内的微级分布对齐,以便保留足够的总体特征和细化分类。
results: 在四个 benchmark 数据集上进行了广泛的实验,证明了我们的方法的有效性。Abstract
Domain generalization (DG) endeavors to develop robust models that possess strong generalizability while preserving excellent discriminability. Nonetheless, pivotal DG techniques tend to improve the feature generalizability by learning domain-invariant representations, inadvertently overlooking the feature discriminability. On the one hand, the simultaneous attainment of generalizability and discriminability of features presents a complex challenge, often entailing inherent contradictions. This challenge becomes particularly pronounced when domain-invariant features manifest reduced discriminability owing to the inclusion of unstable factors, \emph{i.e.,} spurious correlations. On the other hand, prevailing domain-invariant methods can be categorized as category-level alignment, susceptible to discarding indispensable features possessing substantial generalizability and narrowing intra-class variations. To surmount these obstacles, we rethink DG from a new perspective that concurrently imbues features with formidable discriminability and robust generalizability, and present a novel framework, namely, Discriminative Microscopic Distribution Alignment (DMDA). DMDA incorporates two core components: Selective Channel Pruning~(SCP) and Micro-level Distribution Alignment (MDA). Concretely, SCP attempts to curtail redundancy within neural networks, prioritizing stable attributes conducive to accurate classification. This approach alleviates the adverse effect of spurious domain invariance and amplifies the feature discriminability. Besides, MDA accentuates micro-level alignment within each class, going beyond mere category-level alignment. This strategy accommodates sufficient generalizable features and facilitates within-class variations. Extensive experiments on four benchmark datasets corroborate the efficacy of our method.
摘要
领域通用化(DG)努力开发强健的模型,以保持优秀的泛化能力和精准可识别能力。然而,许多领域通用化技术通常通过学习领域不变的表示来提高特征泛化能力,不幸的是,这会忽略特征精准可识别能力。一方面,同时实现特征泛化和精准可识别的特征表示存在复杂的挑战,经常带有内在的矛盾。尤其是当领域不变的特征表示具有不稳定因素,即偶极相关性,时这种挑战变得更加突出。另一方面,现有的领域不变方法可以分为两类:分类水平协调和特征水平协调。前者容易抛弃重要的泛化特征,导致内部变化减少,而后者忽略了特征精准可识别能力。为了缓解这些障碍,我们往返领域通用化的新视角,并提出了一种新的框架,即精准微型分布适应(DMDA)。DMDA包括两个核心 ком成分:选择性通道剔除(SCP)和微级分布适应(MDA)。具体来说,SCP尝试减少神经网络中的重复性,优先保留稳定特征,以便精准分类。这种方法可以减少领域不变的副作用,提高特征精准可识别能力。此外,MDA强调每个类型内的微级协调,超越仅category水平协调。这种策略可以保留足够的泛化特征,促进内部变化。我们在四个标准 benchmark 数据集上进行了广泛的实验,证明了我们的方法的有效性。
Diverse Target and Contribution Scheduling for Domain Generalization
results: experiments 表明,这种方法可以在四个 benchmark 数据集上实现竞争性的性能,并且可以快速地适应新的预测任务。这说明了本文提出的方法的有效性和优势。Abstract
Generalization under the distribution shift has been a great challenge in computer vision. The prevailing practice of directly employing the one-hot labels as the training targets in domain generalization~(DG) can lead to gradient conflicts, making it insufficient for capturing the intrinsic class characteristics and hard to increase the intra-class variation. Besides, existing methods in DG mostly overlook the distinct contributions of source (seen) domains, resulting in uneven learning from these domains. To address these issues, we firstly present a theoretical and empirical analysis of the existence of gradient conflicts in DG, unveiling the previously unexplored relationship between distribution shifts and gradient conflicts during the optimization process. In this paper, we present a novel perspective of DG from the empirical source domain's risk and propose a new paradigm for DG called Diverse Target and Contribution Scheduling (DTCS). DTCS comprises two innovative modules: Diverse Target Supervision (DTS) and Diverse Contribution Balance (DCB), with the aim of addressing the limitations associated with the common utilization of one-hot labels and equal contributions for source domains in DG. In specific, DTS employs distinct soft labels as training targets to account for various feature distributions across domains and thereby mitigates the gradient conflicts, and DCB dynamically balances the contributions of source domains by ensuring a fair decline in losses of different source domains. Extensive experiments with analysis on four benchmark datasets show that the proposed method achieves a competitive performance in comparison with the state-of-the-art approaches, demonstrating the effectiveness and advantages of the proposed DTCS.
摘要
通用化在分布转移下是计算机视觉领域的一大挑战。直接使用领域总体化(DG)中的一颗热度标签作为训练目标,可能会导致梯度冲突,从而使得 capture 内部类特征和增加同类内部差异困难。此外,现有的DG方法大多忽视了来自源(已知)领域的特点,从而导致不均衡学习这些领域。为解决这些问题,我们首先提供了分布转移和梯度冲突在DG中的理论和实验分析,揭示了在优化过程中的 previously 未探讨的关系。在这篇论文中,我们提出了一新的DG视角,即来自 empirical 源领域的风险,并提出了一种新的DG方法 called 多样化目标和贡献安排(DTCS)。DTCS包括两个创新模块:多样化目标监督(DTS)和多样化贡献平衡(DCB),旨在解决DG中一般采用一颗热度标签和平等贡献的局限性。具体来说,DTS 使用不同的软标签作为训练目标,以 compte 各个领域的特性分布,从而缓解梯度冲突,而 DCB 在不同的源领域之间动态均衡贡献,以确保不同的源领域的损失下降均衡。我们对四个基准数据集进行了广泛的实验和分析,结果表明,我们提出的方法可以与当前领先方法竞争, demonstrating 我们的DTCS方法的有效性和优势。
Towards Novel Class Discovery: A Study in Novel Skin Lesions Clustering
results: effectively leverage knowledge from known categories to discover new semantic categories, validated through extensive ablation experiments.Abstract
Existing deep learning models have achieved promising performance in recognizing skin diseases from dermoscopic images. However, these models can only recognize samples from predefined categories, when they are deployed in the clinic, data from new unknown categories are constantly emerging. Therefore, it is crucial to automatically discover and identify new semantic categories from new data. In this paper, we propose a new novel class discovery framework for automatically discovering new semantic classes from dermoscopy image datasets based on the knowledge of known classes. Specifically, we first use contrastive learning to learn a robust and unbiased feature representation based on all data from known and unknown categories. We then propose an uncertainty-aware multi-view cross pseudo-supervision strategy, which is trained jointly on all categories of data using pseudo labels generated by a self-labeling strategy. Finally, we further refine the pseudo label by aggregating neighborhood information through local sample similarity to improve the clustering performance of the model for unknown categories. We conducted extensive experiments on the dermatology dataset ISIC 2019, and the experimental results show that our approach can effectively leverage knowledge from known categories to discover new semantic categories. We also further validated the effectiveness of the different modules through extensive ablation experiments. Our code will be released soon.
摘要
现有的深度学习模型已经在诊断皮肤病的dermoscopic图像上达到了成功的表现。然而,这些模型只能识别预先定义的类别,当它们在临床中使用时,新的未知类别的数据会不断出现。因此,自动地发现和识别新的semantic类别是急需的。在这篇论文中,我们提出了一种新的novel class discovery框架,用于自动地发现dermoscopy图像集中的新类别。具体来说,我们首先使用对所有数据进行对比学习,以学习不偏袋性的特征表示。然后,我们提出了一种不确定性感知多视图 Pseudo-supervision策略,通过对所有类别的数据进行联合训练,使用自己生成的Pseudo标签进行训练。最后,我们进一步改进了Pseudo标签,通过地方sample相似性来提高模型对未知类别的减混表现。我们对ISIC 2019皮肤病 dataset进行了广泛的实验,并证明了我们的方法可以有效地利用已知类别的知识来发现新的semantic类别。我们还进行了extensive的ablation experiment,以验证不同模块的效果。我们的代码即将发布。
Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds
results: 提出一种基于雷达点云的移动实例分割方法,可以增强Scene解释,并且可以在安全关键任务中提高自动化机器人的性能。Abstract
The perception of moving objects is crucial for autonomous robots performing collision avoidance in dynamic environments. LiDARs and cameras tremendously enhance scene interpretation but do not provide direct motion information and face limitations under adverse weather. Radar sensors overcome these limitations and provide Doppler velocities, delivering direct information on dynamic objects. In this paper, we address the problem of moving instance segmentation in radar point clouds to enhance scene interpretation for safety-critical tasks. Our Radar Instance Transformer enriches the current radar scan with temporal information without passing aggregated scans through a neural network. We propose a full-resolution backbone to prevent information loss in sparse point cloud processing. Our instance transformer head incorporates essential information to enhance segmentation but also enables reliable, class-agnostic instance assignments. In sum, our approach shows superior performance on the new moving instance segmentation benchmarks, including diverse environments, and provides model-agnostic modules to enhance scene interpretation. The benchmark is based on the RadarScenes dataset and will be made available upon acceptance.
摘要
<> autonomous robots 需要准确地感知移动 объекts,以确保在动态环境中避免碰撞。 LiDAR 和摄像头可以很好地帮助解释场景,但是它们不直接提供动态信息和在不良天气情况下存在限制。 Radar 感知器可以突破这些限制,提供Doppler 速度,直接提供动态对象的信息。在这篇论文中,我们解决了使用 Radar 点云中的移动实例分割问题,以提高场景的解释,为安全关键任务做好准备。我们的 Radar Instance Transformer 可以在不经过神经网络的情况下,将现场时间信息纳入 Radar 扫描中,以提高分割的精度。我们的实例转换头可以具有类型不易分的实例分割,并且可以提供可靠的实例分割结果。综上所述,我们的方法在新的移动实例分割标准测试中表现出色,包括多种环境,并提供了模型无关的模块,以提高场景的解释。这个标准基于 RadarScenes 数据集,并将在接受后公布。
Distilling ODE Solvers of Diffusion Models into Smaller Steps
results: 比较 existing ODE 解决方案的性能,特别是在生成样本 fewer steps 时的表现更佳,并且具有较低的计算开销。Abstract
Distillation techniques have substantially improved the sampling speed of diffusion models, allowing of the generation within only one step or a few steps. However, these distillation methods require extensive training for each dataset, sampler, and network, which limits their practical applicability. To address this limitation, we propose a straightforward distillation approach, Distilled-ODE solvers (D-ODE solvers), that optimizes the ODE solver rather than training the denoising network. D-ODE solvers are formulated by simply applying a single parameter adjustment to existing ODE solvers. Subsequently, D-ODE solvers with smaller steps are optimized by ODE solvers with larger steps through distillation over a batch of samples. Our comprehensive experiments indicate that D-ODE solvers outperform existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, especially when generating samples with fewer steps. Our method incur negligible computational overhead compared to previous distillation techniques, enabling simple and rapid integration with previous samplers. Qualitative analysis further shows that D-ODE solvers enhance image quality while preserving the sampling trajectory of ODE solvers.
摘要
<>将文本翻译为简化中文。<>diffusion模型的抽象技术已经大幅提高了采样速度,允许在单步或几步内生成样本。然而,这些抽象方法需要对每个数据集、抽象器和网络进行训练,这限制了它们的实际应用性。为解决这些限制,我们提出了一种简单的抽象方法,即Distilled-ODE solvers(D-ODE solvers),它通过对ODE解决器进行优化而不需要训练杜尔凡抽象网络。D-ODE solvers通过对现有ODE解决器进行单个参数调整,并通过对具有更大步长的ODE解决器进行析取采样来优化小步长ODE解决器。我们的全面实验表明,D-ODE solvers在生成样本时比现有的ODE解决器、包括DDIM、PNDM、DPM-Solver、DEIS和EDM更高效,特别是在生成 fewer steps 的样本时。我们的方法相比前一个分布采样技术具有较低的计算开销,使得可以简单地和快速地与现有的采样器结合。Qualitative分析还表明,D-ODE solvers可以提高图像质量,同时保持ODE解决器的采样轨迹。
HIC-YOLOv5: Improved YOLOv5 For Small Object Detection
for: 提高小对象检测精度和速度, Addressing the challenges of small object detection in object detection tasks.
methods: 增加特定于小对象的预测头,采用卷积层和涨层,并应用Channel Attention Mechanism (CBAM) to increase channel information and emphasize important information in both channel and spatial domain.
results: 在VisDrone-2019-DET数据集上,HIC-YOLOv5的mAP@[.5:.95]提高6.42%,mAP@0.5提高9.38%.Abstract
Small object detection has been a challenging problem in the field of object detection. There has been some works that proposes improvements for this task, such as adding several attention blocks or changing the whole structure of feature fusion networks. However, the computation cost of these models is large, which makes deploying a real-time object detection system unfeasible, while leaving room for improvement. To this end, an improved YOLOv5 model: HIC-YOLOv5 is proposed to address the aforementioned problems. Firstly, an additional prediction head specific to small objects is added to provide a higher-resolution feature map for better prediction. Secondly, an involution block is adopted between the backbone and neck to increase channel information of the feature map. Moreover, an attention mechanism named CBAM is applied at the end of the backbone, thus not only decreasing the computation cost compared with previous works but also emphasizing the important information in both channel and spatial domain. Our result shows that HIC-YOLOv5 has improved mAP@[.5:.95] by 6.42% and mAP@0.5 by 9.38% on VisDrone-2019-DET dataset.
摘要
小物体检测问题在物体检测领域中是一个挑战。有些工作提出了改进方案,如添加多个注意块或者修改特征融合网络的结构。然而,这些模型的计算成本较大,使得实时物体检测系统无法实现,剩下有很多可改进的空间。为此,一个改进的YOLOv5模型:HIC-YOLOv5被提出来解决这些问题。首先,增加了专门用于小物体预测的预测头,以提供更高分辨率的特征图用于更好的预测。其次,在 neck 和 backbone 之间采用了卷积块,以增加特征图的通道信息。此外,在 backbone 的末端应用了一个注意机制 named CBAM,以降低计算成本与之前的工作相比,同时强调通道和空间领域中的重要信息。我们的结果显示,HIC-YOLOv5 在 VisDrone-2019-DET 数据集上提高了 mAP@[.5:.95] 和 mAP@0.5 的值,分别提高了6.42%和9.38%。
An Enhanced Low-Resolution Image Recognition Method for Traffic Environments
results: 通过实验 Validate the effectiveness of this algorithm for low-resolution image recognition in traffic environments.Abstract
Currently, low-resolution image recognition is confronted with a significant challenge in the field of intelligent traffic perception. Compared to high-resolution images, low-resolution images suffer from small size, low quality, and lack of detail, leading to a notable decrease in the accuracy of traditional neural network recognition algorithms. The key to low-resolution image recognition lies in effective feature extraction. Therefore, this paper delves into the fundamental dimensions of residual modules and their impact on feature extraction and computational efficiency. Based on experiments, we introduce a dual-branch residual network structure that leverages the basic architecture of residual networks and a common feature subspace algorithm. Additionally, it incorporates the utilization of intermediate-layer features to enhance the accuracy of low-resolution image recognition. Furthermore, we employ knowledge distillation to reduce network parameters and computational overhead. Experimental results validate the effectiveness of this algorithm for low-resolution image recognition in traffic environments.
摘要
当前,低分辨率图像识别遇到了智能交通感知领域中的一个 significiant 挑战。相比高分辨率图像,低分辨率图像受到小尺寸、低质量和缺乏细节的限制,导致传统神经网络识别算法的准确率显著下降。因此,关键在于有效地提取特征。这篇论文探讨了剩余模块的基本维度和其对特征提取和计算效率的影响。基于实验,我们提出了一种双极分支剩余网络结构,利用基本的剩余网络架构和公共特征空间算法。此外,我们还利用中间层特征来提高低分辨率图像识别的准确率。此外,我们采用知识传承来降低网络参数和计算负担。实验结果证明了这种算法的有效性于低分辨率图像识别在交通环境中。
Biomedical Image Splicing Detection using Uncertainty-Guided Refinement
paper_authors: Xun Lin, Wenzhong Tang, Shuai Wang, Zitong Yu, Yizhong Liu, Haoran Wang, Ying Fu, Alex Kot
for: 本文旨在提出一种能够 Mitigating the effects of disruptive factors in biomedical images, such as artifacts, abnormal patterns, and noises, and detecting splicing traces in biomedical images.
results: 对三个基准数据集进行了广泛的实验,并证明了提出的方法的superiority。此外,我们还验证了URN 的普适性和对Post-processing Approaches的Robustness。Abstract
Recently, a surge in biomedical academic publications suspected of image manipulation has led to numerous retractions, turning biomedical image forensics into a research hotspot. While manipulation detectors are concerning, the specific detection of splicing traces in biomedical images remains underexplored. The disruptive factors within biomedical images, such as artifacts, abnormal patterns, and noises, show misleading features like the splicing traces, greatly increasing the challenge for this task. Moreover, the scarcity of high-quality spliced biomedical images also limits potential advancements in this field. In this work, we propose an Uncertainty-guided Refinement Network (URN) to mitigate the effects of these disruptive factors. Our URN can explicitly suppress the propagation of unreliable information flow caused by disruptive factors among regions, thereby obtaining robust features. Moreover, URN enables a concentration on the refinement of uncertainly predicted regions during the decoding phase. Besides, we construct a dataset for Biomedical image Splicing (BioSp) detection, which consists of 1,290 spliced images. Compared with existing datasets, BioSp comprises the largest number of spliced images and the most diverse sources. Comprehensive experiments on three benchmark datasets demonstrate the superiority of the proposed method. Meanwhile, we verify the generalizability of URN when against cross-dataset domain shifts and its robustness to resist post-processing approaches. Our BioSp dataset will be released upon acceptance.
摘要
近些时间,生物医学图像涂抹的学术论文涌现,引起了许多撤回,使生物医学图像鉴别成为研究热点。然而,图像涂抹检测器在生物医学图像中仍然存在困难。生物医学图像中的干扰因素,如artifacts、异常模式和噪声,会显示涂抹迹象,大大增加了这个任务的挑战。此外,生物医学图像的缺乏高质量拼接图像也限制了这个领域的进展。在这项工作中,我们提出了一种基于不确定性的修正网络(URN),以减少干扰因素的影响。URN可以显式地抑制干扰因素在区域之间的不确定信息流传播,从而获得robust特征。此外,URN在解码阶段可以集中做不确定预测区域的修正。此外,我们构建了一个生物医学图像拼接检测(BioSp)数据集,该数据集包含1290个拼接图像。与现有数据集相比,BioSp数据集包括最多的拼接图像和最多的来源。我们对三个标准数据集进行了全面的实验,并证明了我们提出的方法的优越性。同时,我们验证了URN在跨数据集频率域转移和抗后处理approaches的一致性和可靠性。我们将发布BioSp数据集一旦得到批准。
A Comprehensive Review on Tree Detection Methods Using Point Cloud and Aerial Imagery from Unmanned Aerial Vehicles
paper_authors: Weijie Kuang, Hann Woei Ho, Ye Zhou, Shahrel Azmin Suandi, Farzad Ismail for: 这篇论文主要针对用于树木探测的无人机数据中的点云数据和图像数据进行了分析和评估。methods: 本论文主要分析了使用点云数据进行树木探测的方法,包括使用LiDAR和Digital Aerial Photography(DAP)两种数据来实现树木探测。而使用图像直接进行树木探测的方法则是根据使用深度学习(DL)方法来进行评估。results: 本论文对各种方法的比较和结合进行了分析,并介绍了每种方法的优缺点和应用领域。此外,本论文还统计了在过去几年内使用不同方法进行树木探测的研究数量,并发现到2022年为止,使用DL方法进行图像直接树木探测的研究占总数的45%。因此,本论文可以帮助研究人员在特定的森林中进行树木探测,以及帮助农业生产者使用无人机进行农业资源管理。Abstract
Unmanned Aerial Vehicles (UAVs) are considered cutting-edge technology with highly cost-effective and flexible usage scenarios. Although many papers have reviewed the application of UAVs in agriculture, the review of the application for tree detection is still insufficient. This paper focuses on tree detection methods applied to UAV data collected by UAVs. There are two kinds of data, the point cloud and the images, which are acquired by the Light Detection and Ranging (LiDAR) sensor and camera, respectively. Among the detection methods using point-cloud data, this paper mainly classifies these methods according to LiDAR and Digital Aerial Photography (DAP). For the detection methods using images directly, this paper reviews these methods by whether or not to use the Deep Learning (DL) method. Our review concludes and analyses the comparison and combination between the application of LiDAR-based and DAP-based point cloud data. The performance, relative merits, and application fields of the methods are also introduced. Meanwhile, this review counts the number of tree detection studies using different methods in recent years. From our statics, the detection task using DL methods on the image has become a mainstream trend as the number of DL-based detection researches increases to 45% of the total number of tree detection studies up to 2022. As a result, this review could help and guide researchers who want to carry out tree detection on specific forests and for farmers to use UAVs in managing agriculture production.
摘要
无人飞行器(UAV)技术被视为当今最先进和最有效的应用领域之一,其应用场景非常多样化。虽然许多论文已经评估了UAV在农业中的应用,但对树木探测的评估仍然不够。这篇论文将关注UAV数据中的树木探测方法。该数据包括点云数据和图像数据,它们分别由激光探测器和摄像头获取。对点云数据进行探测方法,本论文主要分为利用LiDAR和数字空间图像(DAP)两种方法进行分类。对直接使用图像进行探测方法,本论文则根据使用深度学习(DL)方法进行评估。本评估结论和分析了利用LiDAR和DAP两种点云数据的比较和结合,并 introduce了方法的性能、优势和应用领域。同时,本评估还统计了过去几年内tree detection研究中使用不同方法的数量,从而可以帮助研究人员在特定的森林中进行树木探测,以及帮助农民使用UAV进行农业生产管理。
FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation
results: 我们通过实验证明了我们的方法可以预测更低的渲染错误和更可靠的不确定性,并且在实验中表现出了独立假设无关的神经辐射场的优秀性能。Abstract
Neural radiance fields with stochasticity have garnered significant interest by enabling the sampling of plausible radiance fields and quantifying uncertainty for downstream tasks. Existing works rely on the independence assumption of points in the radiance field or the pixels in input views to obtain tractable forms of the probability density function. However, this assumption inadvertently impacts performance when dealing with intricate geometry and texture. In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption. Specifically, We downsample the training images with different strides and centers to form fixed-size patches which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.
摘要
In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption.Specifically, we downsample the training images with different strides and centers to form fixed-size patches, which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.
Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning
paper_authors: Albert Mohwald, Tomas Jenicek, Ondřej Chum
for: 提高夜间图像检索性能, addresses the challenge of poor retrieval performance in night-time images with limited training data.
methods: 使用 Generative Adversarial Networks (GANs) 生成 synthetic night images, 并将其用于 metric learning 中的数据增强。 提出了一种新的轻量级 GAN 架构, 同时具有对照图像的边 consistency 和同时具有夜夜和白天图像的边检测能力。
results: 在标准的 Tokyo 24/7 日夜检索 benchmark 上达到了state-of-the-art Results, 而不需要特定的日夜图像对应的训练集。Abstract
Image retrieval methods based on CNN descriptors rely on metric learning from a large number of diverse examples of positive and negative image pairs. Domains, such as night-time images, with limited availability and variability of training data suffer from poor retrieval performance even with methods performing well on standard benchmarks. We propose to train a GAN-based synthetic-image generator, translating available day-time image examples into night images. Such a generator is used in metric learning as a form of augmentation, supplying training data to the scarce domain. Various types of generators are evaluated and analyzed. We contribute with a novel light-weight GAN architecture that enforces the consistency between the original and translated image through edge consistency. The proposed architecture also allows a simultaneous training of an edge detector that operates on both night and day images. To further increase the variability in the training examples and to maximize the generalization of the trained model, we propose a novel method of diverse anchor mining. The proposed method improves over the state-of-the-art results on a standard Tokyo 24/7 day-night retrieval benchmark while preserving the performance on Oxford and Paris datasets. This is achieved without the need of training image pairs of matching day and night images. The source code is available at https://github.com/mohwald/gandtr .
摘要
图像检索方法基于CNN描述符依赖于大量多样化的正例和反例图像对的 metric 学习。如夜间图像频率和多样性受限的领域,使用标准 benchmark 中表现良好的方法仍然存在检索性能低下的问题。我们提议使用 GAN 基于的生成器,将可用的日间图像例子翻译成夜间图像。这种生成器在 metric 学习中用作数据增强,为缺乏训练数据的频道提供了训练数据。我们提出了一种新的轻量级 GAN 架构,通过 Edge 的一致性来保证原始图像和翻译图像之间的一致性。此外,我们还提出了一种同时训练 Edge 检测器,该检测器可以在夜间和日间图像上运行。为了进一步增加训练例子的多样性和最大化模型的泛化性,我们提出了一种新的多样 anchor 挖掘方法。通过这种方法,我们超越了标准 Tokyo 24/7 日夜检索标准准则的现有最佳结果,而不需要训练日夜对应的图像对。此外,我们还保持了在 Oxford 和 Paris 数据集上的性能。这些成果都是基于不需要特定的日夜图像对的训练。代码可以在 上获取。
Logarithm-transform aided Gaussian Sampling for Few-Shot Learning
results: 本论文透过实验显示,使用该新的Gaussian对映方法可以实现更好的几shot类别化性能,并且需要较少的数据训练。Abstract
Few-shot image classification has recently witnessed the rise of representation learning being utilised for models to adapt to new classes using only a few training examples. Therefore, the properties of the representations, such as their underlying probability distributions, assume vital importance. Representations sampled from Gaussian distributions have been used in recent works, [19] to train classifiers for few-shot classification. These methods rely on transforming the distributions of experimental data to approximate Gaussian distributions for their functioning. In this paper, I propose a novel Gaussian transform, that outperforms existing methods on transforming experimental data into Gaussian-like distributions. I then utilise this novel transformation for few-shot image classification and show significant gains in performance, while sampling lesser data.
摘要
近些年,几何学习在几个例子中的图像分类得到了广泛应用,因此表示学习的特性,如其下面的概率分布,变得非常重要。在过去的工作中,使用 Gaussian 分布来训练几个例子中的分类器。这些方法基于将实际数据的分布转换为接近 Gaussian 分布的方法。在这篇论文中,我提出了一种新的 Gaussian 变换,超过了现有方法在将实际数据转换为 Gaussian-like 分布的能力。然后,我利用这种新的变换进行几个例子中的图像分类,并显示了显著的性能提升,而且采样更少的数据。
Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention
paper_authors: Yidan Fan, Yongxin Yu, Wenhuan Lu, Yahong Han
for: 针对含有异常事件的未经处理视频进行异常检测。
methods: 提出了一种异常注意力机制,通过考虑视频帧级别编码特征而不需要pseudo标签。Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module.
results: 经验表明,我们的方法可以更好地检测异常事件,并且可以更 preciselly localize异常。Experiments on benchmark datasets XDViolence和UCF-Crime verify the effectiveness of our method.Abstract
With a focus on abnormal events contained within untrimmed videos, there is increasing interest among researchers in video anomaly detection. Among different video anomaly detection scenarios, weakly-supervised video anomaly detection poses a significant challenge as it lacks frame-wise labels during the training stage, only relying on video-level labels as coarse supervision. Previous methods have made attempts to either learn discriminative features in an end-to-end manner or employ a twostage self-training strategy to generate snippet-level pseudo labels. However, both approaches have certain limitations. The former tends to overlook informative features at the snippet level, while the latter can be susceptible to noises. In this paper, we propose an Anomalous Attention mechanism for weakly-supervised anomaly detection to tackle the aforementioned problems. Our approach takes into account snippet-level encoded features without the supervision of pseudo labels. Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module. The module learns different areas of the video, including areas that are challenging to detect, and also assists the attention optimization. Experiments on benchmark datasets XDViolence and UCF-Crime verify the effectiveness of our method. Besides, thanks to the proposed snippet-level attention, we obtain a more precise anomaly localization.
摘要
“对于含有异常事件的未裁剪影片,研究人员对影片异常检测存在增加的兴趣。在不同的影片异常检测场景中,弱监督的影片异常检测具有重要挑战,因为它缺乏training阶段Frame-wise标签,只有视频级别标签作为杂质指导。前一些方法尝试了以下两种方法:一是通过端到端学习学习特征,二是使用两个阶段自动训练策略生成剪辑级别的pseudo标签。然而,这两种方法都有一定的局限性。前者容易忽略剪辑级别的有用特征,而后者可能会受到噪音的影响。在这篇论文中,我们提出了一种异常注意力机制,用于弱监督的异常检测。我们的方法首先生成剪辑级别的异常注意力,然后将其与原始异常分数一起 feed into一个多支序监督模块。该模块学习不同的视频区域,包括具有检测挑战的区域,也可以帮助注意力优化。实验表明,我们的方法在XDViolence和UCF-Crime数据集上具有显著效果。此外,由于我们提出的剪辑级别注意力,我们可以更准确地定位异常。”
Can the Query-based Object Detector Be Designed with Fewer Stages?
results: 对COCO数据集进行实验,比较与主流查询基于多stage编码器和解码器的模型,GOLO模型具有较少的decoder stages,仍然能够达到高度的性能。Abstract
Query-based object detectors have made significant advancements since the publication of DETR. However, most existing methods still rely on multi-stage encoders and decoders, or a combination of both. Despite achieving high accuracy, the multi-stage paradigm (typically consisting of 6 stages) suffers from issues such as heavy computational burden, prompting us to reconsider its necessity. In this paper, we explore multiple techniques to enhance query-based detectors and, based on these findings, propose a novel model called GOLO (Global Once and Local Once), which follows a two-stage decoding paradigm. Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance. Experimental results on the COCO dataset demonstrate the effectiveness of our approach.
摘要
征文基于对象检测器在DETR发表后已经作出了 significiant 进步。然而,大多数现有方法仍然依赖于多个阶段编码器和解码器,或者是这两者的组合。尽管实现了高精度,但多阶段 paradigma(通常包括6个阶段)受到了Computational burden 的困扰,让我们重新思考其必要性。在这篇论文中,我们探讨了多种提高查询基于检测器的技术,并根据这些发现,我们提出了一种新的模型called GOLO(全局一次和本地一次),该模型采用了两个阶段解码方式。与其他主流查询基于模型的多个阶段解码器相比,我们的模型使用了 fewer decoder stages,但仍然可以实现较高的性能。在COCO数据集上进行的实验结果表明了我们的方法的有效性。
Multi-scale Recurrent LSTM and Transformer Network for Depth Completion
results: 实验结果显示,无需复杂的网络结构和后处理步骤,我们的方法可以在一个简单的编码器-解码器网络结构上达到现场的主流自动驾驶 KITTI 数据集的状态码性表现,并且可以作为其他方法的后备网络,也可以达到状态码性表现。Abstract
Lidar depth completion is a new and hot topic of depth estimation. In this task, it is the key and difficult point to fuse the features of color space and depth space. In this paper, we migrate the classic LSTM and Transformer modules from NLP to depth completion and redesign them appropriately. Specifically, we use Forget gate, Update gate, Output gate, and Skip gate to achieve the efficient fusion of color and depth features and perform loop optimization at multiple scales. Finally, we further fuse the deep features through the Transformer multi-head attention mechanism. Experimental results show that without repetitive network structure and post-processing steps, our method can achieve state-of-the-art performance by adding our modules to a simple encoder-decoder network structure. Our method ranks first on the current mainstream autonomous driving KITTI benchmark dataset. It can also be regarded as a backbone network for other methods, which likewise achieves state-of-the-art performance.
摘要
《 lidar 深度完成》是一个新的和热门的深度估计领域。在这个任务中,它是关键和困难的点是将颜色空间和深度空间的特征进行有效的融合。在这篇论文中,我们将传统的 LSTM 和 Transformer 模块从 NLP 迁移到深度完成领域,并对其进行适当的重新设计。具体来说,我们使用 Forget gate、Update gate、Output gate 和 Skip gate 来实现有效的颜色和深度特征融合,并在多个尺度上进行循环优化。最后,我们进一步将深度特征进行 Transformer 多头注意机制来进行融合。实验结果显示,无需复杂的网络结构和后处理步骤,我们的方法可以通过添加我们的模块到简单的编码器-解码器网络结构来实现状态作均性的表现。我们的方法在当前主流自动驾驶 KITTI benchmark 数据集上 ranking 第一名,同时也可以作为其他方法的后ION 网络,也实现了状态作均性的表现。
GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects
results: 实验结果表明,该模型在未看过的和跨类型的受拟合物体中表现出色,超过了现有的艺术骨骼模型和抓取算法的性能。Abstract
Articulated objects like cabinets and doors are widespread in daily life. However, directly manipulating 3D articulated objects is challenging because they have diverse geometrical shapes, semantic categories, and kinetic constraints. Prior works mostly focused on recognizing and manipulating articulated objects with specific joint types. They can either estimate the joint parameters or distinguish suitable grasp poses to facilitate trajectory planning. Although these approaches have succeeded in certain types of articulated objects, they lack generalizability to unseen objects, which significantly impedes their application in broader scenarios. In this paper, we propose a novel framework of Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA), which learns both articulation modeling and grasp pose affordance from diverse articulated objects with different categories. In addition, GAMMA adopts adaptive manipulation to iteratively reduce the modeling errors and enhance manipulation performance. We train GAMMA with the PartNet-Mobility dataset and evaluate with comprehensive experiments in SAPIEN simulation and real-world Franka robot. Results show that GAMMA significantly outperforms SOTA articulation modeling and manipulation algorithms in unseen and cross-category articulated objects. We will open-source all codes and datasets in both simulation and real robots for reproduction in the final version. Images and videos are published on the project website at: http://sites.google.com/view/gamma-articulation
摘要
每天生活中都可以找到具有多个 JOINT 的物体,如橱柜和门。然而,直接操作这些三维具有多 JOINT 的物体是困难的,因为它们具有多样的几何形态、 semantic 类别和动力约束。先前的研究主要集中在特定 JOINT 类型上recognize和操作具有多 JOINT 的物体。它们可以是估算 JOINT 参数或 distinguishing 适合的抓取姿势,以便进行 trajectory 规划。虽然这些方法在某些类型的具有多 JOINT 的物体上达到了一定的成功,但它们缺乏对未看到的物体的一般化,这会很大程度地限制它们在更广泛的场景中的应用。在这篇论文中,我们提出了一种名为 Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA) 的新框架。GAMMA 会学习具有多 JOINT 的物体的 Connection 模型和抓取姿势的可行性。此外,GAMMA 采用了适应的操作来逐步减少模型错误和提高操作性能。我们在 PartNet-Mobility 数据集上训练 GAMMA,并通过在 SAPIEN 模拟和实际 Franka 机器人中进行了广泛的实验。结果显示,GAMMA 在未看到和跨类别的具有多 JOINT 的物体上明显超越了当前的 Connection 模型和操作算法。我们将在最终版本中公布所有代码和数据集,并在实际机器人和模拟中进行了重复。图片和视频已经在项目网站上发布:http://sites.google.com/view/gamma-articulation。
FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding
results: 本研究发现,不同的搜寻策略在不同的图像类别上的表现有所不同,并且提出了一个新的图像搜寻benchmark(FORB),以测试图像嵌入优化的质量。Abstract
Image retrieval is a fundamental task in computer vision. Despite recent advances in this field, many techniques have been evaluated on a limited number of domains, with a small number of instance categories. Notably, most existing works only consider domains like 3D landmarks, making it difficult to generalize the conclusions made by these works to other domains, e.g., logo and other 2D flat objects. To bridge this gap, we introduce a new dataset for benchmarking visual search methods on flat images with diverse patterns. Our flat object retrieval benchmark (FORB) supplements the commonly adopted 3D object domain, and more importantly, it serves as a testbed for assessing the image embedding quality on out-of-distribution domains. In this benchmark we investigate the retrieval accuracy of representative methods in terms of candidate ranks, as well as matching score margin, a viewpoint which is largely ignored by many works. Our experiments not only highlight the challenges and rich heterogeneity of FORB, but also reveal the hidden properties of different retrieval strategies. The proposed benchmark is a growing project and we expect to expand in both quantity and variety of objects. The dataset and supporting codes are available at https://github.com/pxiangwu/FORB/.
摘要translate-into-simplified-chinese图像检索是计算机视觉中的基本任务。尽管最近几年有很多技术在这个领域进行了探索,但是大多数技术只是在一些有限的领域上进行了评估,而且只考虑了3D地标的领域。这使得已有的研究结论很难推广到其他领域,例如标识logo和其他2D平面对象。为了bridging这个差距,我们提出了一个新的图像检索数据集(FORB),该数据集补充了常见采用的3D对象领域,而且更重要的是,它作为图像嵌入质量评估的测试床。在这个数据集中,我们 investigate了不同方法的检索精度,包括候选人数、匹配分数差等方面。我们的实验不仅揭示了FORB的挑战和多样性,还揭示了不同检索策略的隐藏性。我们预计将在量和多样性方面继续扩展该数据集。数据集和相关代码可以在https://github.com/pxiangwu/FORB/上获取。Here's the translation in Traditional Chinese as well:translate-into-traditional-chinese图像检索是计算机视觉中的基本任务。尽管最近几年有很多技术在这个领域进行了探索,但是大多数技术只是在一些有限的领域上进行了评估,而且只考虑了3D地标的领域。这使得已有的研究结论很难推广到其他领域,例如标识logo和其他2D平面对象。为了bridging这个差距,我们提出了一个新的图像检索数据集(FORB),该数据集补充了常见采用的3D对象领域,而且更重要的是,它作为图像嵌入质量评估的测试床。在这个数据集中,我们 investigate了不同方法的检索精度,包括候选人数、匹配分数差等方面。我们的实验不仅揭示了FORB的挑战和多样性,还揭示了不同检索策略的隐藏性。我们预计将在量和多样性方面继续扩展该数据集。数据集和相关代码可以在https://github.com/pxiangwu/FORB/上获取。
for: This paper is written for the purpose of full-body human motion synthesis for the manipulation of large-sized objects, with applications in character animation, embodied AI, VR/AR, and robotics.
methods: The proposed method, called Object MOtion guided human MOtion synthesis (OMOMO), is a conditional diffusion framework that uses two separate denoising processes to generate full-body manipulation behaviors from only the object motion. The method employs hand positions as an intermediate representation to explicitly enforce contact constraints, resulting in more physically plausible manipulation motions.
results: The proposed pipeline is demonstrated to be effective through extensive experiments, and is shown to generalize well to unseen objects. Additionally, a large-scale dataset consisting of 3D object geometry, object motion, and human motion is collected, which contains human-object interaction motion for 15 objects with a total duration of approximately 10 hours.Abstract
Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.
摘要
人类行为模拟在受环境中有广泛的应用,包括人物动画、身体AI、VR/AR和 робо工程。在实际情况下,人类经常与环境交互,并使用不同的物体来完成日常任务。在这种工作中,我们研究了大型物体摆动的全身人类动作合成问题。我们提出了物体动作引导人体动作合成(OMOMO),一种条件扩散框架,可以通过只有物体动作来生成全身摆动行为。由于直接应用扩散模型无法准确实施手和物体之间的接触约束,OMOMO学习了两个分离的排除过程,先预测手部位于物体动作,然后基于预测的手部位进行全身姿态合成。通过使用手部位作为两个排除过程之间的中间表示,我们可以直接实施接触约束,从而生成更加物理可能的摆动姿态。我们提出的管道可以通过简单地将手机附加到被摆动的物体来捕捉全身人类摆动姿态。通过广泛的实验,我们证明了我们的提出的管道的效iveness和其能够泛化到未看到的物体。此外,由于高质量的人类-物体交互动作数据罕见,我们收集了一个大规模的数据集,包括3D物体几何、物体动作和人体动作。我们的数据集包含15种物体的人类-物体交互动作,总持续时间约10小时。
Off-the-shelf bin picking workcell with visual pose estimation: A case study on the world robot summit 2018 kitting task
results: 我们在2018年世界机器人大会组装挑战中测试了我们的设备,并取得了所有参赛队伍中最高分。这说明当今技术已经可以在比之前的水平上进行箱内物品拾取。Abstract
The World Robot Summit 2018 Assembly Challenge included four different tasks. The kitting task, which required bin-picking, was the task in which the fewest points were obtained. However, bin-picking is a vital skill that can significantly increase the flexibility of robotic set-ups, and is, therefore, an important research field. In recent years advancements have been made in sensor technology and pose estimation algorithms. These advancements allow for better performance when performing visual pose estimation. This paper shows that by utilizing new vision sensors and pose estimation algorithms pose estimation in bins can be performed successfully. We also implement a workcell for bin picking along with a force based grasping approach to perform the complete bin picking. Our set-up is tested on the World Robot Summit 2018 Assembly Challenge and successfully obtains a higher score compared with all teams at the competition. This demonstrate that current technology can perform bin-picking at a much higher level compared with previous results.
摘要
世界机器人峰会2018年组装挑战中包括四个不同的任务。其中kiting任务,需要拾取物品,是所有任务中得分最低的一个。然而,拾取是机器人设置的重要技能,可以增加机器人的灵活性,因此是一个重要的研究领域。在过去几年,感知技术和pose估计算术得到了进步。这些进步使得在视觉pose估计中表现更好。这篇论文展示了通过新的视觉感知器和pose估计算术,在容器中进行pose估计是可行的。我们还实现了一个工作站以进行容器拾取,并使用力学 grasping方法来完成完整的容器拾取。我们的设置在世界机器人峰会2018年组装挑战中被测试,并成功获得了所有队伍在比赛中的高得分。这表明现有技术可以在前一个水平上进行容器拾取,而不是之前的结果。
GAFlow: Incorporating Gaussian Attention into Optical Flow
results: 在标准的计算机视觉数据集上进行了广泛的实验,并显示了提高的表现,包括在评估能力和在线测试中的优异表现。Abstract
Optical flow, or the estimation of motion fields from image sequences, is one of the fundamental problems in computer vision. Unlike most pixel-wise tasks that aim at achieving consistent representations of the same category, optical flow raises extra demands for obtaining local discrimination and smoothness, which yet is not fully explored by existing approaches. In this paper, we push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning and enforce the motion affinity during matching. Specifically, we introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks to highlight the local neighborhood that contains fine-grained structural information. Moreover, for reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM) which not only inherits properties from Gaussian distribution to instinctively revolve around the neighbor fields of each point but also is empowered to put the emphasis on contextually related regions during matching. Our fully-equipped model, namely Gaussian Attention Flow network (GAFlow), naturally incorporates a series of novel Gaussian-based modules into the conventional optical flow framework for reliable motion analysis. Extensive experiments on standard optical flow datasets consistently demonstrate the exceptional performance of the proposed approach in terms of both generalization ability evaluation and online benchmark testing. Code is available at https://github.com/LA30/GAFlow.
摘要
Computer vision 中的一个基本问题是图像序列中的运动场景估计(optical flow)。与大多数像素级任务不同,运动场景估计需要同时获得本地特征和流畅性,这些要求尚未由现有方法充分探索。在这篇论文中,我们将把Gaussian Attention(GA)技术应用于运动场景估计模型,以便在学习 represencing 过程中强调本地特征,并在匹配过程中保持运动相互关系。 Specifically, we introduce a novel Gaussian-Constrained Layer(GCL),可以轻松地插入到现有的Transformer块中,以高亮本地区域的细腻结构信息。此外,为了确保可靠的运动分析,我们提出了一新的 Gaussian-Guided Attention Module(GGAM),不仅继承了Gaussian分布的特性,以便自然地绕着每个点的邻近场景循环,而且可以在匹配过程中强调上下文相关的区域。我们的全套模型,即Gaussian Attention Flow网络(GAFlow),自然地将一系列基于Gaussian分布的模块集成到了传统的运动场景估计框架中,以确保可靠的运动分析。经验表明,我们的方法在标准的运动场景估计数据集上表现出了出色的一致性和在线测试中的稳定性。代码可以在https://github.com/LA30/GAFlow中找到。
Abdominal multi-organ segmentation in CT using Swinunter
results: 在公共验证集上获得了可接受的结果和推理时间,表明 transformer-based 模型在这些任务中可以达到更高的性能。Abstract
Abdominal multi-organ segmentation in computed tomography (CT) is crucial for many clinical applications including disease detection and treatment planning. Deep learning methods have shown unprecedented performance in this perspective. However, it is still quite challenging to accurately segment different organs utilizing a single network due to the vague boundaries of organs, the complex background, and the substantially different organ size scales. In this work we used make transformer-based model for training. It was found through previous years' competitions that basically all of the top 5 methods used CNN-based methods, which is likely due to the lack of data volume that prevents transformer-based methods from taking full advantage. The thousands of samples in this competition may enable the transformer-based model to have more excellent results. The results on the public validation set also show that the transformer-based model can achieve an acceptable result and inference time.
摘要
Computed tomography (CT) 腹部多器官分割是许多临床应用的关键,包括疾病检测和治疗规划。深度学习方法在这个方面表现出了无 précédente的表现。然而,使用单一网络 segment 不同器官仍然是一项具有挑战性的任务,主要因为器官的界限晦涩,背景复杂,器官尺度 scales 不同。在这项工作中,我们使用 transformer-based 模型进行训练。根据前年的竞赛结果,大约 90% 的前五名方法都使用 CNN-based 方法,这可能是因为数据量的限制,阻碍 transformer-based 方法发挥全面的效用。这 thousands 个样本在这次竞赛中可能会帮助 transformer-based 模型取得更佳的结果。在公共验证集上也可以看到, transformer-based 模型可以达到可接受的结果和执行时间。
Nonconvex third-order Tensor Recovery Based on Logarithmic Minimax Function
results: 实验表明,提出的方法可以在不同的实际数据集上具有更高的完teness和精度,并且比预先状态艺术方法(EMLCP)更高。Abstract
Recent researches have shown that low-rank tensor recovery based non-convex relaxation has gained extensive attention. In this context, we propose a new Logarithmic Minimax (LM) function. The comparative analysis between the LM function and the Logarithmic, Minimax concave penalty (MCP), and Minimax Logarithmic concave penalty (MLCP) functions reveals that the proposed function can protect large singular values while imposing stronger penalization on small singular values. Based on this, we define a weighted tensor LM norm as a non-convex relaxation for tensor tubal rank. Subsequently, we propose the TLM-based low-rank tensor completion (LRTC) model and the TLM-based tensor robust principal component analysis (TRPCA) model respectively. Furthermore, we provide theoretical convergence guarantees for the proposed methods. Comprehensive experiments were conducted on various real datasets, and a comparison analysis was made with the similar EMLCP method. The results demonstrate that the proposed method outperforms the state-of-the-art methods.
摘要
Note:* "Recent researches" should be "Recent research" in Simplified Chinese.* "based on non-convex relaxation" should be "based on non-convex relaxation" in Simplified Chinese.* "Comprehensive experiments" should be "Extensive experiments" in Simplified Chinese.* "EMLCP" should be "EMLCP method" in Simplified Chinese.
Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks
results: 我们对各种最新的攻击方法进行了广泛的评估和比较,并证明了我们的提出方法在不同的数据集上具有最佳的鲁棒性质量和参数节省的优势,例如在CIFAR-10数据集上,使用ResNet-50作为背景网络,PSAT可以将参数量减少约80%,同时保持最佳的鲁棒性质量。Abstract
Adversarial training serves as one of the most popular and effective methods to defend against adversarial perturbations. However, most defense mechanisms only consider a single type of perturbation while various attack methods might be adopted to perform stronger adversarial attacks against the deployed model in real-world scenarios, e.g., $\ell_2$ or $\ell_\infty$. Defending against various attacks can be a challenging problem since multi-perturbation adversarial training and its variants only achieve suboptimal robustness trade-offs, due to the theoretical limit to multi-perturbation robustness for a single model. Besides, it is impractical to deploy large models in some storage-efficient scenarios. To settle down these drawbacks, in this paper we propose a novel multi-perturbation adversarial training framework, parameter-saving adversarial training (PSAT), to reinforce multi-perturbation robustness with an advantageous side effect of saving parameters, which leverages hypernetworks to train specialized models against a single perturbation and aggregate these specialized models to defend against multiple perturbations. Eventually, we extensively evaluate and compare our proposed method with state-of-the-art single/multi-perturbation robust methods against various latest attack methods on different datasets, showing the robustness superiority and parameter efficiency of our proposed method, e.g., for the CIFAR-10 dataset with ResNet-50 as the backbone, PSAT saves approximately 80\% of parameters with achieving the state-of-the-art robustness trade-off accuracy.
摘要
“对抗攻击训练是现在最受欢迎并且最有效的防御方法。然而,大多数防御机制只考虑单一类型的攻击,而实际场景中可能会采用多种攻击方法来进行更加强大的对抗攻击。防御多种攻击是一个具有挑战性的问题,因为多重攻击 adversarial training 和其变体只能达到各种不佳的鲁棒性质交换。此外,在某些存储效率不高的场景中,部署大型模型是不切实际的。为了解决这些缺点,在这篇论文中我们提出了一种新的多重攻击 adversarial training 框架 Parametersaving adversarial training(PSAT),通过使用 hypernetworks 来训练特殊化模型对单个攻击,并将这些特殊化模型综合以防御多种攻击。最终,我们对state-of-the-art 单/多攻击鲁棒方法进行了广泛的评估和比较,在不同的数据集上,我们的提出方法展现出了鲁棒性superiority和参数效率的优势,例如在 CIFAR-10 数据集上,使用 ResNet-50 作为 backing 模型,PSAT 可以将参数数量减少约 80% ,同时实现最佳的鲁棒性质交换精度。”
Alzheimer’s Disease Prediction via Brain Structural-Functional Deep Fusing Network
paper_authors: Qiankun Zuo, Junren Pan, Shuqiang Wang
For: 这篇论文旨在开发一个能够有效地融合多Modal neuroimages的模型,以分析阿尔ツ海默症(AD)的恶化。* Methods: 本论文提出了一个名为cross-modal transformer generative adversarial network(CT-GAN)的新模型,可以从 функциональ磁共振成像(fMRI)和Diffusion tensor imaging(DTI)中获取功能和结构资讯,并将其融合成一个统一的多Modal connectivity。* Results: 在ADNI资料集上进行评估,提出的CT-GAN模型可以明显提高预测性能和实际地检测AD相关的脑区域。此外,模型还提供了新的问题检测AD相关的异常神经回路的新关注。Abstract
Fusing structural-functional images of the brain has shown great potential to analyze the deterioration of Alzheimer's disease (AD). However, it is a big challenge to effectively fuse the correlated and complementary information from multimodal neuroimages. In this paper, a novel model termed cross-modal transformer generative adversarial network (CT-GAN) is proposed to effectively fuse the functional and structural information contained in functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). The CT-GAN can learn topological features and generate multimodal connectivity from multimodal imaging data in an efficient end-to-end manner. Moreover, the swapping bi-attention mechanism is designed to gradually align common features and effectively enhance the complementary features between modalities. By analyzing the generated connectivity features, the proposed model can identify AD-related brain connections. Evaluations on the public ADNI dataset show that the proposed CT-GAN can dramatically improve prediction performance and detect AD-related brain regions effectively. The proposed model also provides new insights for detecting AD-related abnormal neural circuits.
摘要
综合结构功能图像融合技术已经在分析阿尔茨海默病(AD)的衰退方面显示了巨大的潜力。然而,是一个大的挑战来有效地融合多modal的脑图像信息。在这篇论文中,一种新的模型称为交叉模态变换生成敌对网络(CT-GAN)被提出,以有效地融合功能和结构信息,包括功能磁共振成像(fMRI)和扩散tensor成像(DTI)。CT-GAN可以学习 topological特征并生成多modal的连接性,从多modal成像数据中获得有效的终端到端方式。此外,交换双注意机制被设计来逐渐对共同特征进行对齐,以实现多modal之间的增强。通过分析生成的连接特征,提出的模型可以识别AD相关的脑连接。评估ADNI数据集显示,提出的CT-GAN可以明显提高预测性能和有效地检测AD相关的脑区域。此外,该模型还提供了新的检测AD相关异常神经Circuit的新视角。
DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI
results: 在ADNI dataset上测试,该模型可以高效地生成empirical SC-preserved connectivity,并与其他相关模型相比显示出superior的SC预测性能。此外,该模型还可以确定大多数重要的脑区域和连接,提供一种替代的方式来融合多modal brain networks和分析临床疾病。Abstract
Mapping from functional connectivity (FC) to structural connectivity (SC) can facilitate multimodal brain network fusion and discover potential biomarkers for clinical implications. However, it is challenging to directly bridge the reliable non-linear mapping relations between SC and functional magnetic resonance imaging (fMRI). In this paper, a novel diffusision generative adversarial network-based fMRI-to-SC (DiffGAN-F2S) model is proposed to predict SC from brain fMRI in an end-to-end manner. To be specific, the proposed DiffGAN-F2S leverages denoising diffusion probabilistic models (DDPMs) and adversarial learning to efficiently generate high-fidelity SC through a few steps from fMRI. By designing the dual-channel multi-head spatial attention (DMSA) and graph convolutional modules, the symmetric graph generator first captures global relations among direct and indirect connected brain regions, then models the local brain region interactions. It can uncover the complex mapping relations between fMRI and structural connectivity. Furthermore, the spatially connected consistency loss is devised to constrain the generator to preserve global-local topological information for accurate intrinsic SC prediction. Testing on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the proposed model can effectively generate empirical SC-preserved connectivity from four-dimensional imaging data and shows superior performance in SC prediction compared with other related models. Furthermore, the proposed model can identify the vast majority of important brain regions and connections derived from the empirical method, providing an alternative way to fuse multimodal brain networks and analyze clinical disease.
摘要
fc可以映射到结构连接(sc),可以促进多Modal脑网络融合和发现临床应用中的生物标志物。然而,直接从 Functional Magnetic Resonance Imaging(fMRI)到SC的可靠非线性映射关系是挑战。在本文中,一种基于Diffusion Generative Adversarial Network(DiffGAN)的fMRI-to-SC(DiffGAN-F2S)模型被提出,可以在端到端方式 Predict SC from brain fMRI。具体来说,提案的DiffGAN-F2S模型利用了Denosing Diffusion Probabilistic Models(DDPMs)和对抗学习来快速生成高品质SC,只需几步。通过设计双通道多头空间注意力(DMSA)和图像演算模块,对称图生成器首先捕捉了全脑区域之间的全球关系,然后模型了本地脑区域之间的交互。这可以揭示出fMRI和SC之间的复杂映射关系。此外,用空间连接一致损失来约束生成器保持全球-本地 topological信息的准确SC预测。在公共Alzheimer's Disease Neuroimaging Initiative(ADNI)数据集上测试,提案的模型可以高效地从四维图像数据中生成empirical SC-保持连接,并显示与其他相关模型相比表现出色。此外,提案的模型还可以识别大多数重要的脑区域和连接,提供一种代替方式来融合多Modal脑网络并分析临床疾病。
Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing
paper_authors: Lu Dai, Liqian Ma, Shenhan Qian, Hao Liu, Ziwei Liu, Hui Xiong for: 这个论文的目标是生成基于2D服装图像的3D人体模型。methods: 该论文提出了一种结构包括人体姿态估计、身体形态估计和姿态生成方法的端到端框架。首先,利用人体姿态估计来估计人体姿态参数。然后,通过人体三维骨架的使用和反射 kinematics 模块来提高估计精度。最后,提出了一种适应深度技巧来规避对象大小和摄像头外部效果的分离。results: 该论文通过实验结果表明,提出的方法可以从2D服装图像中高精度地回归自然和多样化的3D人体模型,并且可以在探索和生成多个姿态时提供多样化的结果。Abstract
In this paper, we define and study a new Cloth2Body problem which has a goal of generating 3D human body meshes from a 2D clothing image. Unlike the existing human mesh recovery problem, Cloth2Body needs to address new and emerging challenges raised by the partial observation of the input and the high diversity of the output. Indeed, there are three specific challenges. First, how to locate and pose human bodies into the clothes. Second, how to effectively estimate body shapes out of various clothing types. Finally, how to generate diverse and plausible results from a 2D clothing image. To this end, we propose an end-to-end framework that can accurately estimate 3D body mesh parameterized by pose and shape from a 2D clothing image. Along this line, we first utilize Kinematics-aware Pose Estimation to estimate body pose parameters. 3D skeleton is employed as a proxy followed by an inverse kinematics module to boost the estimation accuracy. We additionally design an adaptive depth trick to align the re-projected 3D mesh better with 2D clothing image by disentangling the effects of object size and camera extrinsic. Next, we propose Physics-informed Shape Estimation to estimate body shape parameters. 3D shape parameters are predicted based on partial body measurements estimated from RGB image, which not only improves pixel-wise human-cloth alignment, but also enables flexible user editing. Finally, we design Evolution-based pose generation method, a skeleton transplanting method inspired by genetic algorithms to generate diverse reasonable poses during inference. As shown by experimental results on both synthetic and real-world data, the proposed framework achieves state-of-the-art performance and can effectively recover natural and diverse 3D body meshes from 2D images that align well with clothing.
摘要
在这篇论文中,我们定义并研究了一个新的Cloth2Body问题,该问题的目标是从2D服装图像中生成3D人体几何体。与现有的人体几何恢复问题不同,Cloth2Body需要解决一些新的和兴起的挑战,包括:首先,如何将人体部着到服装上。第二,如何有效地从不同类型的服装中估计身体形态。最后,如何从2D服装图像中生成多样化和合理的结果。为此,我们提议一个端到端框架,可以准确地估计3D人体几何参数化 pose和形态从2D服装图像。在这个框架中,我们首先利用了 Kinematics-aware Pose Estimation,来估计人体姿态参数。然后,我们使用3D骨架作为代理,并使用反 Kinematics 模块来提高估计精度。此外,我们还设计了一种适应深度技巧,用于将重projected 3D网格更好地与2D服装图像相对应。接下来,我们提出了物理学习形态估计方法,用于估计身体形态参数。在RGB图像中估计部分身体测量值,可以不 только提高像素级人体-服装对齐,还可以启用 flexible user editing。最后,我们设计了一种进化策略,用于在推理中生成多样化的合理姿态。根据实验结果,我们的框架在 both synthetic and real-world data 上达到了状态元的性能,并可以效果地从2D图像中恢复自然和多样化的3D人体几何体。
BEVHeight++: Toward Robust Visual Centric 3D Object Detection
paper_authors: Lei Yang, Tao Tang, Jun Li, Peng Chen, Kun Yuan, Li Wang, Yi Huang, Xinyu Zhang, Kaicheng Yu
for: This paper focuses on addressing the limitations of vision-centric bird’s eye view detection methods for roadside cameras, and proposes a simple yet effective approach called BEVHeight++ to improve the accuracy and robustness of camera-only perception methods.
methods: The proposed BEVHeight++ method regresses the height to the ground to achieve a distance-agnostic formulation, and incorporates both height and depth encoding techniques to improve the projection from 2D to BEV spaces.
results: The proposed method surpasses all previous vision-centric methods on popular 3D detection benchmarks of roadside cameras, and achieves significant improvements over depth-only methods in ego-vehicle scenarios. Specifically, the method yields a notable improvement of +1.9% NDS and +1.1% mAP over BEVDepth on the nuScenes validation set, and achieves substantial advancements of +2.8% NDS and +1.7% mAP on the nuScenes test set.Here’s the simplified Chinese text version of the three information points:
for: 本研究旨在解决路边摄像头上常见的视觉中心探测方法的局限性,并提出了一种简单 yet effective的方法 called BEVHeight++,以提高路边摄像头只使用的探测方法的准确性和可靠性。
results: BEVHeight++方法在流行的路边摄像头3D探测benchmark上表现出色,超过了所有之前的视觉中心方法,并在ego-汽车场景中也表现出明显的提升,比如BEVDepth方法在nuScenes验证集上的NDS和mAP提升了+1.9%和+1.1%,分别在nuScenes测试集上提升了+2.8%和+1.7%。Abstract
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, our BEVHeight++ possesses superior over depth-only methods. Specifically, it yields a notable improvement of +1.9% NDS and +1.1% mAP over BEVDepth when evaluated on the nuScenes validation set. Moreover, on the nuScenes test set, our method achieves substantial advancements, with an increase of +2.8% NDS and +1.7% mAP, respectively.
摘要
当前大多数自动驾驶系统都是关注ego-vehicle传感器上的感知方法,而人们往往忽视了一种alternative approach,即利用智能路边摄像头来扩展感知范围。我们发现现有的视觉中心 bird's eye view探测方法在路边摄像头上表现不佳。这是因为这些方法主要关注在camera中心的深度恢复,而在距离增加时,汽车和地面之间的深度差异很快消失。在这篇论文中,我们提出了一种简单又有效的方法,称为BEVHeight++。其核心思想是通过对地面高度进行回归,以实现距离agnostic的表示,从而简化了Camera-only感知方法的优化过程。通过结合高度和深度编码技术,我们实现了更加准确和Robust的2D到BEV空间投影。在popular 3D探测 benchmarks of roadside cameras 上,我们的方法比之前的视觉中心方法高出了显著的margin。在ego-vehicleenario中,我们的BEVHeight++也表现出了明显的优势, Specifically,它在nuScenes验证集上提高了+1.9% NDS和+1.1% mAP,并在nuScenes测试集上实现了重大的进步,升幅+2.8% NDS和+1.7% mAP。
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
results: 在三种常用的语言-图像预训练模型中实现了相对较少的性能损失(均为0.32减少率),并且可以通过增加更大的批处理大小来加速模型预训练和下游任务表现。Abstract
Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the efficient language-image pre-training, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method, ie ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, etc. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. Our code will be released at https://github.com/guoyang9/ELIP.
摘要
学习一种多面语言模型 computationally prohibitive under a limited computing budget. This paper explores the efficient language-image pre-training, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method, namely ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30% vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines (approximately 0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, etc. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. Our code will be released at https://github.com/guoyang9/ELIP.
methods: 我们提出了一种新的方法,即 Depth-Inference Termination Agent(DITA),它利用一个名为 Judge Model 的监管模型来隐式地推断目标的深度信息,并与 reinforcement learning 结合决策终止。我们在 Judge Model 和 reinforcement learning 的训练过程中同时进行了监控和激励,以高效地训练 Judge Model。
results: 我们的评估结果表明,DITA 方法在所有房间类型上都表现出优秀的成绩,相比基eline方法,DITA 方法在长期epsisode环境中提高了51.2%的成绩,同时保持了 slighty better Success Weighted by Path Length(SPL)。 Code 和资源可以在 GitHub 上找到:https://github.com/HuskyKingdom/DITA_acml2023。Abstract
This paper tackles the critical challenge of object navigation in autonomous navigation systems, particularly focusing on the problem of target approach and episode termination in environments with long optimal episode length in Deep Reinforcement Learning (DRL) based methods. While effective in environment exploration and object localization, conventional DRL methods often struggle with optimal path planning and termination recognition due to a lack of depth information. To overcome these limitations, we propose a novel approach, namely the Depth-Inference Termination Agent (DITA), which incorporates a supervised model called the Judge Model to implicitly infer object-wise depth and decide termination jointly with reinforcement learning. We train our judge model along with reinforcement learning in parallel and supervise the former efficiently by reward signal. Our evaluation shows the method is demonstrating superior performance, we achieve a 9.3% gain on success rate than our baseline method across all room types and gain 51.2% improvements on long episodes environment while maintaining slightly better Success Weighted by Path Length (SPL). Code and resources, visualization are available at: https://github.com/HuskyKingdom/DITA_acml2023
摘要
OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions
results: 对比其他方法,OSM-Net能够更自然地生成真实的讲话头部运动,并且可以在一个合理的一对多映射方式下进行映射。Abstract
One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one mapping between audio signal and head motion sequences, introducing ambiguity correspondences into the mapping since people can behave differently in head motions when speaking the same content. This unreasonable mapping form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when speaking. Based on the above observation, we propose OSM-Net, a \textit{one-to-many} one-shot talking head generation network with natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion features can be sampled within a reasonable range to achieve the one-to-many mapping. Besides, the landmark constraint and time window feature input improve the accurate expression feature extraction and video generation. Extensive experiments show that OSM-Net generates more natural realistic head motions under reasonable one-to-many mapping paradigm compared with other methods.
摘要
一般来说,一帧说话头生成没有显式的头部运动参考,因此很难生成带有头部运动的说话头。现有的方法只是编辑口部区域,生成不动的说话头,这会导致不自然的表演。其他方法建立了一对一的声音信号和头部运动序列之间的映射,但这种映射存在不合理的匹配问题,人们在说同一个内容时可能会有不同的头部运动,这会导致生成的头部运动不自然。因此,一帧说话头生成任务实际上是一个一对多不定 problema,人们在说话时会表现出多种头部运动。根据上述观察,我们提出了OSM-Net,一种基于声音信号的一对多一帧说话头生成网络,具有自然的头部运动。OSM-Net建立了一个运动空间,该空间包含了clip级别的多种有意义的头部运动特征。每个基于空间的基准代表了一个clip中的有意义的头部运动特征,而不是只是一帧。这样,在 talking heads 中可以更自然地实现多种头部运动。另外,附加的标记约束和时间窗口特征输入,可以提高准确地EXTRACT特征和视频生成。实验表明,OSM-Net可以在合理的一对多映射模型下,比其他方法更自然地生成真实的头部运动。
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
for: 这 paper 是关于 cross-modal sponsored search 的研究,用于提高搜索引擎中的广告搜索效果。
methods: 该 paper 使用了一种简单的对应网络,将细部视觉部分在广告图像中与相应的文本进行Explicit地映射,不需要贵重的标注训练数据。此外,paper 还提出了一种新的模型,可以有效地进行cross-modal对应和广告搜索,只需要半个训练数据。
results: compared with state-of-the-art 模型,该 paper 的模型在大规模商业数据集上表现出优于2.57%的提升。此外,paper 还研究了一种常见的cross-modal检索任务,在 MSCOCO 数据集上达到了一致性的性能提升,证明了该方法的通用性。Abstract
Cross-Modal sponsored search displays multi-modal advertisements (ads) when consumers look for desired products by natural language queries in search engines. Since multi-modal ads bring complementary details for query-ads matching, the ability to align ads-specific information in both images and texts is crucial for accurate and flexible sponsored search. Conventional research mainly studies from the view of modeling the implicit correlations between images and texts for query-ads matching, ignoring the alignment of detailed product information and resulting in suboptimal search performance.In this work, we propose a simple alignment network for explicitly mapping fine-grained visual parts in ads images to the corresponding text, which leverages the co-occurrence structure consistency between vision and language spaces without requiring expensive labeled training data. Moreover, we propose a novel model for cross-modal sponsored search that effectively conducts the cross-modal alignment and query-ads matching in two separate processes. In this way, the model matches the multi-modal input in the same language space, resulting in a superior performance with merely half of the training data. Our model outperforms the state-of-the-art models by 2.57% on a large commercial dataset. Besides sponsored search, our alignment method is applicable for general cross-modal search. We study a typical cross-modal retrieval task on the MSCOCO dataset, which achieves consistent performance improvement and proves the generalization ability of our method. Our code is available at https://github.com/Pter61/AlignCMSS/
摘要
协助搜索赞助显示多 modal 广告 (广告) 当消费者通过自然语言查询在搜索引擎中寻找需要的产品。自然modal 广告的匹配需要精细的产品信息的对应,因此在搜索中准确地将广告信息与搜索结果相对应是非常重要。现有研究主要从视图模型多 modal 空间中的隐式相关性来研究查询广告匹配,忽略了细节产品信息的对应,从而导致搜索性能下降。在这种情况下,我们提出了一个简单的对应网络,用于显式地将多 modal 广告图像中细节部分与相应的文本对应。我们还提出了一种新的模型,可以有效地进行交互模式的搜索和广告匹配。在这种方式下,模型将多 modal 输入在同一个语言空间中匹配,从而提高搜索性能,只需要半数据进行训练。我们的模型在大规模商业数据集上超过了状态艺术模型的性能,提高了2.57%。此外,我们的对应方法可以应用于普通的交互模式搜索。我们在 MSCOCO 数据集上进行了一般交互模式搜索任务,实现了一致性提高和普适性。我们的代码可以在 中找到。
CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting
paper_authors: Shaoxiang Guo, Qing Cai, Lin Qi, Junyu Dong
for: This paper proposes a novel 3D hand pose estimator from monocular images, which can successfully bridge the gap between text prompts and irregular detailed pose distribution.
methods: The proposed model uses a CLIP-based contrastive learning paradigm, which encodes pose-aware features and maximizes semantic consistency for a pair of pose-text features. A coarse-to-fine mesh regressor is also designed to effectively query joint-aware cues from the feature pyramid.
results: The proposed model achieves a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone, on several public hand benchmarks.Abstract
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone.
摘要
《对比 язы言-图像预训(CLIP)在许多计算机视觉任务中开始出现,并已经实现了有前途的性能。然而,尚未被完善地探索CLIP是否可以泛化到3D手姿估计任务中,因为将文本提示与具有手姿特征的特征相连接具有显著挑战,因为手姿 JOINT 的位置是离散的。在这篇论文中,我们提出了一种基于 CLIP 的新的3D手姿估计器从单视图图像,称为 CLIP-Hand3D,成功地跨越了文本提示和不规则细节姿态的差异。具体来说,根据手姿标签, derive 手姿 JOINT 的分布顺序在不同的3D空间方向,并将其转化为对应的文本提示。同时,从单视图图像中提取出21个手姿 JOINT,并将其在 x、y 和 z 轴上的空间分布编码成具有手姿特征的特征。接着,我们在 CLIP 基于对比学习模式下maximize semantic consistency,以便将文本提示与手姿特征相关联。此外,我们还设计了一种粗细至细网络回归器,可以有效地从特征 пирамид中提取手姿特征。我们在多个公共手姿标准 benchmark 上进行了广泛的实验,结果显示,我们的模型在相同的规模下实现了比较快的推理速度,同时保持了与相似规模的方法相比的状态体现。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well.
Two-Step Active Learning for Instance Segmentation with Uncertainty and Diversity Sampling
paper_authors: Ke Yu, Stephen Albro, Giulia DeSalvo, Suraj Kothawade, Abdullah Rashwan, Sasan Tavakkol, Kayhan Batmanghelich, Xiaoqi Yin
for: 提高INSTANCE segmentation模型的训练质量,降低标注成本。
methods: integrate uncertainty-based sampling with diversity-based sampling,实现简单、易于实现,且能够提供优秀表现。
results: 在多个dataset上实现五倍的标注效率提升。Abstract
Training high-quality instance segmentation models requires an abundance of labeled images with instance masks and classifications, which is often expensive to procure. Active learning addresses this challenge by striving for optimum performance with minimal labeling cost by selecting the most informative and representative images for labeling. Despite its potential, active learning has been less explored in instance segmentation compared to other tasks like image classification, which require less labeling. In this study, we propose a post-hoc active learning algorithm that integrates uncertainty-based sampling with diversity-based sampling. Our proposed algorithm is not only simple and easy to implement, but it also delivers superior performance on various datasets. Its practical application is demonstrated on a real-world overhead imagery dataset, where it increases the labeling efficiency fivefold.
摘要
培训高质量的实例分割模型需要很多标注图像和类别标注,这经常是昂贵的。活动学习可以解决这个挑战,它寻求最优性和最小的标注成本,通过选择最有用和最 represetative 的图像进行标注。虽然活动学习在实例分割方面得到了更多的探索,但它在其他任务如图像分类中得到了更多的应用。在这种研究中,我们提出了一种post-hoc的活动学习算法,它结合了不确定性基于的采样和多样性基于的采样。我们的提出的算法不仅简单易于实现,而且可以提供更高的性能,在多个数据集上进行了证明。在一个真实的飞行图像数据集上,我们示出了它可以提高标注效率五倍。
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
methods: context-dependent mapping network (Context-I2W) with intent view selector and visual target extractor
results: strong generalization ability and significant performance boosts on four ZS-CIR tasks, achieving new state-of-the-art resultsAbstract
Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/context_i2w.
摘要
不同于需要特殊标注的图像检索任务(Composed Image Retrieval),零值组合图像检索(ZS-CIR)涉及到多样化的视觉内容操作意图,包括域、场景、对象和特征等。 ZS-CIR 任务的关键挑战是学习更加准确的图像表示,以适应不同的描述映射。在这篇论文中,我们提出了一种新的上下文依赖的映射网络,即 Context-I2W,用于适应描述相关的图像信息转换为一个上下文依赖的伪词表示。具体来说,首先是动态学习的意图规则,用于将同一张图像映射到任务特定的操作视图。然后是一个可学习的查询器,用于在 ZS-CIR 任务中捕捉主要目标的本地信息。这两个补做模块共同工作,无需额外监督,可以将图像映射到上下文依赖的伪词表示。我们的模型在四个 ZS-CIR 任务上表现出了强大的泛化能力,对比最佳方法提高了1.88%到3.60%的表现,并实现了 ZS-CIR 领域的新state-of-the-art 成绩。我们的代码可以在 GitHub 上找到:https://github.com/Pter61/context_i2w。
A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition
results: 实验结果显示,我们的DB-LTR在三个长尾标准资料集(CIFAR100-LT、ImageNet-LT和Places-LT)上具有竞争力和优势,与比较方法相比。Abstract
Real-world data often exhibits a long-tailed distribution, in which head classes occupy most of the data, while tail classes only have very few samples. Models trained on long-tailed datasets have poor adaptability to tail classes and the decision boundaries are ambiguous. Therefore, in this paper, we propose a simple yet effective model, named Dual-Branch Long-Tailed Recognition (DB-LTR), which includes an imbalanced learning branch and a Contrastive Learning Branch (CoLB). The imbalanced learning branch, which consists of a shared backbone and a linear classifier, leverages common imbalanced learning approaches to tackle the data imbalance issue. In CoLB, we learn a prototype for each tail class, and calculate an inter-branch contrastive loss, an intra-branch contrastive loss and a metric loss. CoLB can improve the capability of the model in adapting to tail classes and assist the imbalanced learning branch to learn a well-represented feature space and discriminative decision boundary. Extensive experiments on three long-tailed benchmark datasets, i.e., CIFAR100-LT, ImageNet-LT and Places-LT, show that our DB-LTR is competitive and superior to the comparative methods.
摘要
实际世界数据经常展现长尾分布,其中头类占据大量数据,而尾类只有很少的样本。模型在长尾数据集上训练时,对尾类的适应性差,决策界限抖音。因此,在这篇论文中,我们提出了一种简单又有效的模型,名为双支分支长尾识别(DB-LTR)。该模型包括一个共享背bone和一个线性分类器,这两个分支都可以使用常见的不均衡学习方法来处理数据不均衡问题。在CoLB分支中,我们学习了每个尾类的原型,并计算了 между支contrastive损失、内支contrastive损失和一个度量损失。CoLB可以帮助模型更好地适应尾类,并帮助不均衡学习分支学习一个准确的特征空间和决策界限。我们在CIFAR100-LT、ImageNet-LT和Places-LT三个长尾 benchmark数据集上进行了广泛的实验,结果显示,我们的DB-LTR在比较方法中具有竞争力和超越性。
MASK4D: Mask Transformer for 4D Panoptic Segmentation
paper_authors: Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe for: 本研究旨在提高自适应Agent在动态环境中做出正确决策的能力,通过提出Mask4D模型来实现4D精准分割 LiDAR 点云 sequences。methods: Mask4D 是首个将 semantic instance segmentation 和 trackig 紧密融合到一起的 transformer-based 模型,直接预测 semantic instances 和其时间关系,不需要手动设计非学习的关联策略。results: Mask4D 在 SemanticKITTI 测试集上达到了新的状态对,得分 68.4 LSTQ,至少超过 +4.5% 于已发表的顶峰性能方法。Abstract
Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4D for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4D is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on any hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4D introduces spatio-temporal instance queries which encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that it is critical to promote spatially compact instance predictions as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which is used as an auxiliary task to foster spatially compact predictions. Mask4D achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ, improving upon published top-performing methods by at least +4.5%.
摘要
为了安全地在动态环境中交互,自动化代理需要精准地掌握和跟踪实例过程。为此目的,我们提出Mask4D,一种用于4D�anoptic分割 LiDAR 点云的挑战性任务。Mask4D 是首个基于 transformer 的方法,将 semantic instance segmentation 和 sparse 和不规则的3D点云序列中的实例跟踪合并到一个共同模型中。我们的模型直接预测 semantic instance 和其时间关联,不需要手动设置非学习的关联策略,如概率 clustering 或投票式中心预测。相反,Mask4D 引入 spatio-temporal instance queries,这些查询编码了每个 semantic tracklet 的 semantic 和geometry 属性。在深入研究中,我们发现需要促进空间紧凑的实例预测,因为 spatio-temporal instance queries 往往将多个semantically similar instance合并,即使它们在空间上远离。为此,我们回归 6-DOF 矩阵参数,用于提高空间紧凑的预测。Mask4D 在 SemanticKITTI 测试集上达到了新的州OF-the-art 分数68.4 LSTQ,超过了已发表的最高表现方法至少+4.5%。
Joint Correcting and Refinement for Balanced Low-Light Image Enhancement
methods: 提议一种新的合成结构,即 JOINT CORRECTING AND REFINMENT NETWORK (JCRNet),通过三个阶段来更好地均衡增强中的亮度、颜色和照明。
results: 与21种 state-of-the-art 方法进行比较,JCRNet 在 9 个标准测试集上展现出了广泛的性能优势,并在下游视觉任务中(例如精度检测)也得到了更好的结果。Abstract
Low-light image enhancement tasks demand an appropriate balance among brightness, color, and illumination. While existing methods often focus on one aspect of the image without considering how to pay attention to this balance, which will cause problems of color distortion and overexposure etc. This seriously affects both human visual perception and the performance of high-level visual models. In this work, a novel synergistic structure is proposed which can balance brightness, color, and illumination more effectively. Specifically, the proposed method, so-called Joint Correcting and Refinement Network (JCRNet), which mainly consists of three stages to balance brightness, color, and illumination of enhancement. Stage 1: we utilize a basic encoder-decoder and local supervision mechanism to extract local information and more comprehensive details for enhancement. Stage 2: cross-stage feature transmission and spatial feature transformation further facilitate color correction and feature refinement. Stage 3: we employ a dynamic illumination adjustment approach to embed residuals between predicted and ground truth images into the model, adaptively adjusting illumination balance. Extensive experiments demonstrate that the proposed method exhibits comprehensive performance advantages over 21 state-of-the-art methods on 9 benchmark datasets. Furthermore, a more persuasive experiment has been conducted to validate our approach the effectiveness in downstream visual tasks (e.g., saliency detection). Compared to several enhancement models, the proposed method effectively improves the segmentation results and quantitative metrics of saliency detection. The source code will be available at https://github.com/woshiyll/JCRNet.
摘要
低光照图像增强任务需要达到对比度、色彩和照明的平衡。现有方法通常强调一个图像方面,而忽视了如何更好地平衡这些方面,这会导致颜色扭曲和过度照明等问题。这会 серьез影响人类视觉和高级视觉模型的性能。在这种工作中,我们提出了一种新的同化结构,即 JOINT CORRECTING AND REFINEMENT NETWORK(JCRNet)。它主要由三个阶段组成,用于平衡图像的亮度、色彩和照明增强。Stage 1:我们使用基本的编解oder和本地监督机制,提取本地信息和更全面的细节,用于增强。Stage 2:跨阶段特征传输和空间特征变换,进一步促进颜色修正和特征细化。Stage 3:我们使用动态照明调整方法,将预测和真实图像之间的差异 embedding 到模型中,自适应调整照明平衡。广泛的实验表明,提出的方法在9个标准 datasets 上表现出了21种当前方法的综合性优势。此外,我们还进行了一项更加吸引人的实验,以验证我们的方法在下游视觉任务(例如分割检测)中的有效性。相比一些增强模型,我们的方法可以更好地提高分割结果和量化度量的分割检测结果。代码将在 GitHub 上公开。
Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation
results: 这paper在不同数据集上达到了 state-of-the-art 结果,提高了 semantic image segmentation 的精度。Abstract
Many methods of semantic image segmentation have borrowed the success of open compound domain adaptation. They minimize the style gap between the images of source and target domains, more easily predicting the accurate pseudo annotations for target domain's images that train segmentation network. The existing methods globally adapt the scene style of the images, whereas the object styles of different categories or instances are adapted improperly. This paper proposes the Object Style Compensation, where we construct the Object-Level Discrepancy Memory with multiple sets of discrepancy features. The discrepancy features in a set capture the style changes of the same category's object instances adapted from target to source domains. We learn the discrepancy features from the images of source and target domains, storing the discrepancy features in memory. With this memory, we select appropriate discrepancy features for compensating the style information of the object instances of various categories, adapting the object styles to a unified style of source domain. Our method enables a more accurate computation of the pseudo annotations for target domain's images, thus yielding state-of-the-art results on different datasets.
摘要
很多semantic image segmentation方法借鉴了开放复合领域适应的成功。它们减少了目标领域图像和源领域图像之间的风格差异,更容易预测目标领域图像的准确pseudo annotations,用于训练segmentation网络。现有方法通过全局适应场景风格的图像,而对不同类别或实例的物体风格适应不够。本文提出了对象风格补偿(Object Style Compensation),我们构建了对象级别差异记忆(Object-Level Discrepancy Memory),其中包含多个差异特征集。差异特征集中的差异特征捕捉了目标领域图像中同类对象实例的风格变化,从目标领域图像和源领域图像中学习差异特征,并将其存储在记忆中。通过这个记忆,我们可以选择合适的差异特征来补偿目标领域图像中各类对象的风格信息,使得对象风格统一到源领域的风格,从而实现更高精度的pseudo annotations计算,并在不同的 dataset 上达到状态�ayer的结果。
UVL: A Unified Framework for Video Tampering Localization
results: 在三种不同类型的伪造影片(影片填充、影片融合和DeepFake)的多种benchmark上,UVL achieve state-of-the-art的性能,并与现有方法比较,在跨 dataset 上大幅提高了性能。Abstract
With the development of deep learning technology, various forgery methods emerge endlessly. Meanwhile, methods to detect these fake videos have also achieved excellent performance on some datasets. However, these methods suffer from poor generalization to unknown videos and are inefficient for new forgery methods. To address this challenging problem, we propose UVL, a novel unified video tampering localization framework for synthesizing forgeries. Specifically, UVL extracts common features of synthetic forgeries: boundary artifacts of synthetic edges, unnatural distribution of generated pixels, and noncorrelation between the forgery region and the original. These features are widely present in different types of synthetic forgeries and help improve generalization for detecting unknown videos. Extensive experiments on three types of synthetic forgery: video inpainting, video splicing and DeepFake show that the proposed UVL achieves state-of-the-art performance on various benchmarks and outperforms existing methods by a large margin on cross-dataset.
摘要
随着深度学习技术的发展,各种假视频技术不断出现。同时,检测这些假视频的方法也达到了一定的表现水平在某些数据集上。然而,这些方法受到未知视频的泛化和新的假视频方法的挑战。为解决这个问题,我们提出了UVL,一种基于Synthetic Tampering Localization的新型视频假冒检测框架。具体来说,UVL提取了各种合成假视频的共同特征:合成边缘的边缘特征、生成像素的不自然分布和假冒区域与原始视频的无关性。这些特征广泛存在不同类型的合成假视频中,可以提高泛化性 для检测未知视频。我们在三种合成假视频:视频填充、视频融合和DeepFake上进行了广泛的实验,结果表明,提出的UVL在各种标准准点上达到了当前最佳性能,与现有方法相比,在跨数据集上表现出了明显的超越。
D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
paper_authors: Yixuan Wang, Zhuoran Li, Mingtong Zhang, Katherine Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li
for: 这篇论文的目的是提出一种新的Scene representation,以便在 robotic manipulation systems 中提高操作精度和普遍性。
methods: 这篇论文使用了 dynamic 3D descriptor fields,它们可以捕捉工作空间中的动态3D环境,并将 semantic features 和 instance masks 组合在一起。具体来说,它们将 arbitrary 3D points 投射到多视角的Visual observations 上,然后从基础模型中提取特征,并将其混合在一起。
results: 这篇论文的结果显示,使用 D$^3$Fields 可以实现零数据学习的 robotic manipulation tasks,并且与现有的 dense descriptors 相比,它们在普遍性和操作精度方面表现更好。Abstract
Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields - dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.
摘要
scene表示有被 robotic manipulation system 的关键设计选择。理想的表示应该是3D、动态和semantic,以满足多样化的抓取任务的需求。然而,之前的工作通常缺乏这三个属性。在这项工作中,我们介绍了D$^3$Fields - 动态3D描述场。这些场景捕捉了下面环境的动态特征,并将 semantic feature和实例掩码编码在一起。具体来说,我们将工作空间中的arbitrary 3D点 Project onto multi-view 2D visual observation,并 interpolate基础模型中 derivated 特征。得到的混合描述场可以通过2D图像 with varied contexts、styles和instances进行灵活的目标规定。为了评估D$^3$Fields的有效性,我们将其应用到了各种 robotic manipulation任务中。通过在实际场景和 simulations 中进行了广泛的评估,我们证明了D$^3$Fields是both generalizable和effective zero-shot robotic manipulation任务中。在对比state-of-the-art dense descriptors,如Dense Object Nets和DINO,D$^3$Fields表现出了明显更好的总体化能力和抓取精度。
Learning Effective NeRFs and SDFs Representations with 3D Generative Adversarial Networks for 3D Object Generation: Technical Report for ICCV 2023 OmniObject3D Challenge
results: 本文通过使用only a few images of each object from a variety of classes来训练模型,而不是使用大量的图像或训练每个类的单独模型。这种管道可以优化3D对象生成模型。这个解决方案在ICCV 2023 OmniObject3D Challenge中得到了前三名的成绩。Abstract
In this technical report, we present a solution for 3D object generation of ICCV 2023 OmniObject3D Challenge. In recent years, 3D object generation has made great process and achieved promising results, but it remains a challenging task due to the difficulty of generating complex, textured and high-fidelity results. To resolve this problem, we study learning effective NeRFs and SDFs representations with 3D Generative Adversarial Networks (GANs) for 3D object generation. Specifically, inspired by recent works, we use the efficient geometry-aware 3D GANs as the backbone incorporating with label embedding and color mapping, which enables to train the model on different taxonomies simultaneously. Then, through a decoder, we aggregate the resulting features to generate Neural Radiance Fields (NeRFs) based representations for rendering high-fidelity synthetic images. Meanwhile, we optimize Signed Distance Functions (SDFs) to effectively represent objects with 3D meshes. Besides, we observe that this model can be effectively trained with only a few images of each object from a variety of classes, instead of using a great number of images per object or training one model per class. With this pipeline, we can optimize an effective model for 3D object generation. This solution is one of the final top-3-place solutions in the ICCV 2023 OmniObject3D Challenge.
摘要
在这份技术报告中,我们介绍了一种用于ICCV 2023 OmniObject3D Challenge 3D物体生成的解决方案。在过去几年中,3D物体生成技术做出了很大的进步,但是仍然是一个挑战性的任务,因为生成复杂、Texture和高精度结果很难。为了解决这个问题,我们研究了使用3D生成抗恐网络(GANs)来学习有效的NeRFs和SDFs表示。具体来说,我们采用了效果性的geometry-aware 3D GANs作为后续,并结合标签嵌入和颜色映射,以便同时训练不同的分类。然后,通过一个解码器,我们将结果综合到generate Neural Radiance Fields(NeRFs)基于表示,以便生成高品质的 sintetic 图像。同时,我们优化Signed Distance Functions(SDFs),以有效地表示物体使用3D矩阵。此外,我们发现这种模型可以通过只使用每个物体的几张图像进行训练,而不需要使用大量的图像或训练每个分类的模型。这种管道可以优化一个有效的3D物体生成模型。这个解决方案在ICCV 2023 OmniObject3D Challenge中排名前三名。