results: BiLSTM方法比传统回归方法更高效,能够捕捉多时序序列之间的非线性动态,并且在不同生长条件下表现良好,特别是在衰老期表现更佳,因此BiLSTM方法可以用于时序Sentinel-1 VH/VV和Sentinel-2数据中的LAI报假问题。Abstract
The Leaf Area Index (LAI) is vital for predicting winter wheat yield. Acquisition of crop conditions via Sentinel-2 remote sensing images can be hindered by persistent clouds, affecting yield predictions. Synthetic Aperture Radar (SAR) provides all-weather imagery, and the ratio between its cross- and co-polarized channels (C-band) shows a high correlation with time series LAI over winter wheat regions. This study evaluates the use of time series Sentinel-1 VH/VV for LAI imputation, aiming to increase spatial-temporal density. We utilize a bidirectional LSTM (BiLSTM) network to impute time series LAI and use half mean squared error for each time step as the loss function. We trained models on data from southern Germany and the North China Plain using only LAI data generated by Sentinel-1 VH/VV and Sentinel-2. Experimental results show BiLSTM outperforms traditional regression methods, capturing nonlinear dynamics between multiple time series. It proves robust in various growing conditions and is effective even with limited Sentinel-2 images. BiLSTM's performance surpasses that of LSTM, particularly over the senescence period. Therefore, BiLSTM can be used to impute LAI with time-series Sentinel-1 VH/VV and Sentinel-2 data, and this method could be applied to other time-series imputation issues.
摘要
“叶面指数(LAI)是预测冬小麦生产的重要指标。但是,由于持续云层干扰,可能导致农业条件资料取得受限,影响生产预测。Synthetic Aperture Radar(SAR)可提供全天候图像,其中双极化通道(C-band)的比率与时间序列LAI之间存在高度相关性。本研究探讨使用时间序列Sentinel-1 VH/VV来替代LAI数据,以提高空间时间密度。我们使用了 bidirectional LSTM(BiLSTM)网络来替代时间序列LAI,并使用每个时间步的平均方差作为损失函数。我们使用了德国南部和中国北部的数据,并仅使用Sentinel-1 VH/VV和Sentinel-2的数据进行训练。实验结果显示,BiLSTM在不同生长条件下表现更加稳定,并且在衰老期表现更好。因此,BiLSTM可以用来替代LAI的时间序列数据,并且这种方法可以应用于其他时间序列替代问题。”
BiGSeT: Binary Mask-Guided Separation Training for DNN-based Hyperspectral Anomaly Detection
results: 我们在实际世界数据集上验证了我们的训练策略,结果显示与一些现有的方法比较,我们的方法具有更高的检测性。具体来说,我们在HyMap Cooke City dataset上取得了90.67% AUC 得分。此外,我们还将训练策略应用到其他深度网络结构上,实现了对异常检测表现的改进。Abstract
Hyperspectral anomaly detection (HAD) aims to recognize a minority of anomalies that are spectrally different from their surrounding background without prior knowledge. Deep neural networks (DNNs), including autoencoders (AEs), convolutional neural networks (CNNs) and vision transformers (ViTs), have shown remarkable performance in this field due to their powerful ability to model the complicated background. However, for reconstruction tasks, DNNs tend to incorporate both background and anomalies into the estimated background, which is referred to as the identical mapping problem (IMP) and leads to significantly decreased performance. To address this limitation, we propose a model-independent binary mask-guided separation training strategy for DNNs, named BiGSeT. Our method introduces a separation training loss based on a latent binary mask to separately constrain the background and anomalies in the estimated image. The background is preserved, while the potential anomalies are suppressed by using an efficient second-order Laplacian of Gaussian (LoG) operator, generating a pure background estimate. In order to maintain separability during training, we periodically update the mask using a robust proportion threshold estimated before the training. In our experiments, We adopt a vanilla AE as the network to validate our training strategy on several real-world datasets. Our results show superior performance compared to some state-of-the-art methods. Specifically, we achieved a 90.67% AUC score on the HyMap Cooke City dataset. Additionally, we applied our training strategy to other deep network structures, achieving improved detection performance compared to their original versions, demonstrating its effective transferability. The code of our method will be available at https://github.com/enter-i-username/BiGSeT.
摘要
高spectral异常检测(HAD)的目标是找到谱spectrum中异常的小股,不需要先知道异常的特征。深度神经网络(DNNs),包括自适应神经网络(AEs)、卷积神经网络(CNNs)和视transformer(ViTs),在这个领域表现出了非常出色的能力,因为它们可以模型谱spectrum中复杂的背景。然而,在重建任务中,DNNs tend to incorporate both background and anomalies into the estimated background,这被称为同一个映射问题(IMP),导致性能明显下降。为解决这个限制,我们提出了一种独立于模型的二进制掩码引导分离训练策略,名为BiGSeT。我们的方法引入了基于潜在二进制掩码的分离训练损失,以分离训练中的背景和异常。背景被保留,而潜在的异常被用高效的第二阶差分布(LoG)运算器抑制,生成了纯净的背景估计。为保持分离性在训练过程中,我们在训练过程中 périodically 更新掩码,使其在训练过程中保持稳定。在我们的实验中,我们采用了一个普通的AEs作为网络,以验证我们的训练策略在多个实际数据集上的效果。我们的结果显示,我们在HyMap Cooke City dataset上 achievied a 90.67% AUC score,并在其他深度网络结构上应用我们的训练策略,实现了与原版网络的比较性能。这说明了我们的训练策略具有可 пере移性。我们的代码将在 GitHub 上提供。
Transient Neural Radiance Fields for Lidar View Synthesis and 3D Reconstruction
paper_authors: Anagh Malik, Parsa Mirdehghan, Sotiris Nousias, Kiriakos N. Kutulakos, David B. Lindell
for: 用于模拟Raw lidar measurements的虚拟场景渲染
methods: 使用时间分辨率版本的体积渲染公式来渲染雷达测量结果,捕捉瞬时光传输现象
results: 能够在新视图中渲染脉冲雷达测量结果,并与传统的点云基础授益相比,提高了地形和常规外观的恢复Here’s the translation of the abstract in Simplified Chinese:
for: 这个研究是用于模拟Raw lidar measurements的虚拟场景渲染。
methods: 这个方法使用时间分辨率版本的体积渲染公式来渲染雷达测量结果,捕捉瞬时光传输现象。
results: 这个方法能够在新视图中渲染脉冲雷达测量结果,并与传统的点云基础授益相比,提高了地形和常规外观的恢复。Abstract
Neural radiance fields (NeRFs) have become a ubiquitous tool for modeling scene appearance and geometry from multiview imagery. Recent work has also begun to explore how to use additional supervision from lidar or depth sensor measurements in the NeRF framework. However, previous lidar-supervised NeRFs focus on rendering conventional camera imagery and use lidar-derived point cloud data as auxiliary supervision; thus, they fail to incorporate the underlying image formation model of the lidar. Here, we propose a novel method for rendering transient NeRFs that take as input the raw, time-resolved photon count histograms measured by a single-photon lidar system, and we seek to render such histograms from novel views. Different from conventional NeRFs, the approach relies on a time-resolved version of the volume rendering equation to render the lidar measurements and capture transient light transport phenomena at picosecond timescales. We evaluate our method on a first-of-its-kind dataset of simulated and captured transient multiview scans from a prototype single-photon lidar. Overall, our work brings NeRFs to a new dimension of imaging at transient timescales, newly enabling rendering of transient imagery from novel views. Additionally, we show that our approach recovers improved geometry and conventional appearance compared to point cloud-based supervision when training on few input viewpoints. Transient NeRFs may be especially useful for applications which seek to simulate raw lidar measurements for downstream tasks in autonomous driving, robotics, and remote sensing.
摘要
neural radiance fields (NeRFs) 已成为多视图图像和几何的模型工具。 current work also explores how to use additional supervision from lidar or depth sensor measurements in the NeRF framework. However, previous lidar-supervised NeRFs focus on rendering conventional camera imagery and use lidar-derived point cloud data as auxiliary supervision; thus, they fail to incorporate the underlying image formation model of the lidar. Here, we propose a novel method for rendering transient NeRFs that take as input the raw, time-resolved photon count histograms measured by a single-photon lidar system, and we seek to render such histograms from novel views. Different from conventional NeRFs, the approach relies on a time-resolved version of the volume rendering equation to render the lidar measurements and capture transient light transport phenomena at picosecond timescales. We evaluate our method on a first-of-its-kind dataset of simulated and captured transient multiview scans from a prototype single-photon lidar. Overall, our work brings NeRFs to a new dimension of imaging at transient timescales, newly enabling rendering of transient imagery from novel views. Additionally, we show that our approach recovers improved geometry and conventional appearance compared to point cloud-based supervision when training on few input viewpoints. Transient NeRFs may be especially useful for applications which seek to simulate raw lidar measurements for downstream tasks in autonomous driving, robotics, and remote sensing.
Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework
results: 我们验证了一亿个数据点,发现我们的模型在全球范围内的相关系数为0.873,与其他现有模型相比(最大值为0.713)显著高于。我们的模型也在国家和城市层次上表现出色。此外,我们通过对机场和公路进行检查,发现只有我们的图像细节可以描述历史发展这些设施。我们提供了长期和精细的夜晚照明观测数据,以便研究人类活动。数据集可以在 \url{https://doi.org/10.5281/zenodo.7859205} 上下载。Abstract
Satellite-collected nighttime light provides a unique perspective on human activities, including urbanization, population growth, and epidemics. Yet, long-term and fine-grained nighttime light observations are lacking, leaving the analysis and applications of decades of light changes in urban facilities undeveloped. To fill this gap, we developed an innovative framework and used it to design a new super-resolution model that reconstructs low-resolution nighttime light data into high resolution. The validation of one billion data points shows that the correlation coefficient of our model at the global scale reaches 0.873, which is significantly higher than that of other existing models (maximum = 0.713). Our model also outperforms existing models at the national and urban scales. Furthermore, through an inspection of airports and roads, only our model's image details can reveal the historical development of these facilities. We provide the long-term and fine-grained nighttime light observations to promote research on human activities. The dataset is available at \url{https://doi.org/10.5281/zenodo.7859205}.
摘要
卫星收集的夜间照明提供了人类活动的一个独特视角,包括城市化、人口增长和疫病。然而,长期和细化夜间照明观测缺乏,这留下了多年夜间照明变化的分析和应用未发展。为了填补这一空白,我们开发了一种创新性的框架,并使用其设计了一个新的超分辨率模型,可以将低分辨率夜间照明数据变换为高分辨率数据。我们验证了一亿个数据点,结果显示,我们的模型在全球规模上的相关度为0.873,与其他现有模型相比明显高于最高的0.713。此外,我们的模型在国家和城市层次上也表现出了优于现有模型。进一步地,通过观察机场和公路的图像细节,只有我们的图像细节可以描述历史发展这些设施。我们提供了长期和细化夜间照明观测,以便促进人类活动的研究。数据集可以在 \url{https://doi.org/10.5281/zenodo.7859205} 获取。
Sampling-Priors-Augmented Deep Unfolding Network for Robust Video Compressive Sensing
paper_authors: Yuhao Huang, Gangrong Qu, Youran Ge
for: 高速场景记录和低帧率传感器
methods: 使用 sampling-priors-augmented deep unfolding network (SPA-DUN) 实现高效和可靠的多帧重建
results: 在 simulate 和实际数据上实现 SOTA 性能,并且可以处理多种采样设置,提高了可读性和通用性Abstract
Video Compressed Sensing (VCS) aims to reconstruct multiple frames from one single captured measurement, thus achieving high-speed scene recording with a low-frame-rate sensor. Although there have been impressive advances in VCS recently, those state-of-the-art (SOTA) methods also significantly increase model complexity and suffer from poor generality and robustness, which means that those networks need to be retrained to accommodate the new system. Such limitations hinder the real-time imaging and practical deployment of models. In this work, we propose a Sampling-Priors-Augmented Deep Unfolding Network (SPA-DUN) for efficient and robust VCS reconstruction. Under the optimization-inspired deep unfolding framework, a lightweight and efficient U-net is exploited to downsize the model while improving overall performance. Moreover, the prior knowledge from the sampling model is utilized to dynamically modulate the network features to enable single SPA-DUN to handle arbitrary sampling settings, augmenting interpretability and generality. Extensive experiments on both simulation and real datasets demonstrate that SPA-DUN is not only applicable for various sampling settings with one single model but also achieves SOTA performance with incredible efficiency.
摘要
cOOpD: Reformulating COPD classification on chest CT scans as anomaly detection using contrastive representations
paper_authors: Silvia D. Almeida, Carsten T. Lüth, Tobias Norajitra, Tassilo Wald, Marco Nolden, Paul F. Jaeger, Claus P. Heussel, Jürgen Biederer, Oliver Weinheimer, Klaus Maier-Hein
methods: 这篇论文使用了一种自我超vised contrastive pretext model 来学习肺部的表现,然后使用生成模型来检测异常。
results: 这篇论文在两个公开数据集上取得了最好的性能,与之前的现有的supervised state-of-the-art 比较提高了8.2%和7.7%的AUROC值。此外,这篇论文还提供了可读的空间偏见地图和病人级别分数,这些分数有助于早期识别病人的进展阶段。Abstract
Classification of heterogeneous diseases is challenging due to their complexity, variability of symptoms and imaging findings. Chronic Obstructive Pulmonary Disease (COPD) is a prime example, being underdiagnosed despite being the third leading cause of death. Its sparse, diffuse and heterogeneous appearance on computed tomography challenges supervised binary classification. We reformulate COPD binary classification as an anomaly detection task, proposing cOOpD: heterogeneous pathological regions are detected as Out-of-Distribution (OOD) from normal homogeneous lung regions. To this end, we learn representations of unlabeled lung regions employing a self-supervised contrastive pretext model, potentially capturing specific characteristics of diseased and healthy unlabeled regions. A generative model then learns the distribution of healthy representations and identifies abnormalities (stemming from COPD) as deviations. Patient-level scores are obtained by aggregating region OOD scores. We show that cOOpD achieves the best performance on two public datasets, with an increase of 8.2% and 7.7% in terms of AUROC compared to the previous supervised state-of-the-art. Additionally, cOOpD yields well-interpretable spatial anomaly maps and patient-level scores which we show to be of additional value in identifying individuals in the early stage of progression. Experiments in artificially designed real-world prevalence settings further support that anomaly detection is a powerful way of tackling COPD classification.
摘要
classification of 多元疾病是困难的,因为它们的复杂性、症状和影像找到的变化。 chronic obstructive pulmonary disease (COPD) 是一个严重的例子,即使是全球第三大的死亡原因,它仍然被诊断不充分。 its 稀疏、杂乱和多元的计算机扫描图像具有挑战性,我们将 COPD 二分类问题转化为一个异常检测任务。 we propose cOOpD:在健康肺部中检测疾病多元区域,并将其视为外部数据(OOD)。为此,我们使用自我超vised异常检测预测模型,学习不标注的肺部表示,可能捕捉疾病和健康不标注区域特征。然后,我们使用生成模型学习健康表示的分布,并将异常(来自COPD)视为偏离。每个病人的分数由多个区域OOD分数集成。我们发现,cOOpD 在两个公共数据集上表现最佳,与之前的同学报表现相比,AUROC 提高了8.2%和7.7%。此外,cOOpD 提供了可读的空间异常地图和病人级分数,我们显示它们在早期进程中的识别中具有补做作用。在人工设计的真实世界 prévalence 中进行的实验也支持,异常检测是一种有力的方法来解决 COPD 分类。
Masked Autoencoders for Unsupervised Anomaly Detection in Medical Images
results: 我们在两个医疗影像数据集BRATS2020和LUNA16上进行实验,与四种现有的异常检测框架AST、RD4AD、AnoVAEGAN和f-AnoGAN进行比较,结果显示我们的方法在异常检测上有着佳的表现。Abstract
Pathological anomalies exhibit diverse appearances in medical imaging, making it difficult to collect and annotate a representative amount of data required to train deep learning models in a supervised setting. Therefore, in this work, we tackle anomaly detection in medical images training our framework using only healthy samples. We propose to use the Masked Autoencoder model to learn the structure of the normal samples, then train an anomaly classifier on top of the difference between the original image and the reconstruction provided by the masked autoencoder. We train the anomaly classifier in a supervised manner using as negative samples the reconstruction of the healthy scans, while as positive samples, we use pseudo-abnormal scans obtained via our novel pseudo-abnormal module. The pseudo-abnormal module alters the reconstruction of the normal samples by changing the intensity of several regions. We conduct experiments on two medical image data sets, namely BRATS2020 and LUNA16 and compare our method with four state-of-the-art anomaly detection frameworks, namely AST, RD4AD, AnoVAEGAN and f-AnoGAN.
摘要
医学影像中的疾病异常现象可能具有多种不同的外观,这使得收集和标注充足的数据来训练深度学习模型变得困难。因此,在这项工作中,我们采用了使用只有健康样本进行训练的方式。我们提议使用假隐藏自动编码器模型来学习正常样本的结构,然后在假隐藏自动编码器的差异上训练异常分类器。我们在监督式的方式下训练异常分类器,使用正样本的重建结果作为负样本,而使用我们提出的新型假异常模块生成的假异常样本作为正样本。我们在 BRATS2020 和 LUNA16 两个医学影像数据集上进行了实验,并与四种现有的异常检测框架进行了比较,namely AST、RD4AD、AnoVAEGAN 和 f-AnoGAN。
Improved Flood Insights: Diffusion-Based SAR to EO Image Translation
paper_authors: Minseok Seo, Youngtack Oh, Doyi Kim, Dongmin Kang, Yeji Choi
for: This paper aims to improve the interpretability of flood insights from Synthetic Aperture Radar (SAR) images for human analysts.
methods: The paper proposes a novel framework called Diffusion-Based SAR to EO Image Translation (DSE) to convert SAR images into Electro-Optical (EO) images, enhancing the interpretability of flood insights.
results: Experimental results on two datasets (Sen1Floods11 and SEN12-FLOOD) show that the DSE framework not only delivers enhanced visual information but also improves performance across all tested flood segmentation baselines.Abstract
Driven by rapid climate change, the frequency and intensity of flood events are increasing. Electro-Optical (EO) satellite imagery is commonly utilized for rapid response. However, its utilities in flood situations are hampered by issues such as cloud cover and limitations during nighttime, making accurate assessment of damage challenging. Several alternative flood detection techniques utilizing Synthetic Aperture Radar (SAR) data have been proposed. Despite the advantages of SAR over EO in the aforementioned situations, SAR presents a distinct drawback: human analysts often struggle with data interpretation. To tackle this issue, this paper introduces a novel framework, Diffusion-Based SAR to EO Image Translation (DSE). The DSE framework converts SAR images into EO images, thereby enhancing the interpretability of flood insights for humans. Experimental results on the Sen1Floods11 and SEN12-FLOOD datasets confirm that the DSE framework not only delivers enhanced visual information but also improves performance across all tested flood segmentation baselines.
摘要
由快速气候变化驱动,洪水事件的频率和严重程度在增加。电子光(EO)卫星图像通常用于快速应急应用,但是在云覆盖和夜间限制下,其在洪水情况下的应用受到限制,具体评估损害困难。一些使用Synthetic Aperture Radar(SAR)数据的洪水探测技术已经被提出。虽然SAR在上述情况下的优势比EO更大,但SAR存在一个缺点:人工分析者往往难以理解数据。为解决这个问题,本文介绍了一种新的框架,即Diffusion-Based SAR to EO Image Translation(DSE)框架。DSE框架将SAR图像转换为EO图像,从而提高洪水洞察的可读性。实验结果表明,DSE框架不仅提供了加强的视觉信息,还提高了所有测试的洪水分 segmentation基线性能。
results: 该论文对各种伪声检测方法的评估和比较,并提出了一些未解决的问题和未来研究方向,包括伪声检测系统的可靠性和可扩展性等问题。Abstract
Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems against fraudulent activities, such as identity theft, financial fraud, and spreading misinformation, is of paramount importance. This paper reviews the current state-of-the-art techniques for detecting audio spoofing and discusses the current challenges along with open research problems. The paper further highlights the importance of considering the ethical and privacy implications of audio spoofing detection systems. Lastly, the work aims to accentuate the need for building more robust and generalizable methods, the integration of automatic speaker verification and countermeasure systems, and better evaluation protocols.
摘要
本文检视了目前的推广技术,包括伪造检测和攻击检测,以及目前的挑战和未解决的研究问题。此外,文章还强调了考虑音频伪造检测系统的道德和隐私问题。最后,文章强调了需要建立更加 Robust 和通用的方法,同时整合自动认识系统和防护系统,以及更好的评估协议。
An Improved Metric of Informational Masking for Perceptual Audio Quality Measurement
For: The paper is written to develop and improve perceptual audio quality measurement systems that can accurately estimate the perceived quality of audio signals processed by perceptual audio codecs.* Methods: The paper uses models of disturbance audibility and cognitive effects to predict perceived quality degradation in audio signals. Specifically, it proposes an improved model of informational masking (IM) that considers the complexity of disturbance information around the masking threshold.* Results: The proposed IM metric is shown to outperform previously proposed IM metrics in a validation task against subjective quality scores from large and diverse listening test databases. Additionally, the proposed system demonstrated improved quality prediction for music signals coded with bandwidth extension techniques, where other models frequently fail.Abstract
Perceptual audio quality measurement systems algorithmically analyze the output of audio processing systems to estimate possible perceived quality degradation using perceptual models of human audition. In this manner, they save the time and resources associated with the design and execution of listening tests (LTs). Models of disturbance audibility predicting peripheral auditory masking in quality measurement systems have considerably increased subjective quality prediction performance of signals processed by perceptual audio codecs. Additionally, cognitive effects have also been known to regulate perceived distortion severity by influencing their salience. However, the performance gains due to cognitive effect models in quality measurement systems were inconsistent so far, particularly for music signals. Firstly, this paper presents an improved model of informational masking (IM) -- an important cognitive effect in quality perception -- that considers disturbance information complexity around the masking threshold. Secondly, we incorporate the proposed IM metric into a quality measurement systems using a novel interaction analysis procedure between cognitive effects and distortion metrics. The procedure establishes interactions between cognitive effects and distortion metrics using LT data. The proposed IM metric is shown to outperform previously proposed IM metrics in a validation task against subjective quality scores from large and diverse LT databases. Particularly, the proposed system showed an increased quality prediction of music signals coded with bandwidth extension techniques, where other models frequently fail.
摘要
音频质量测量系统使用算法分析音频处理系统的输出,以估算可能的感知质量下降使用人类听觉模型。这种方法可以节省设计和执行听测试(LT)的时间和资源。音频干扰可见性预测模型在质量测量系统中有大幅提高了可视质量预测性能的 signals 处理过的音频编码器。此外,认知效应也被知道可以调控感知错误严重性的感知。然而,认知效应模型在质量测量系统中的性能往往不稳定,尤其是 для 音乐信号。本文首先提出一种改进的信息干扰(IM)模型,该模型考虑干扰信息在抑制阈值附近的复杂性。其次,我们将提出的 IM 度量 incorporated into 质量测量系统中,使用一种新的交互分析过程,该过程通过听测数据来确定认知效应和错误度量之间的交互。提出的 IM 度量在一个验证任务中,与subjective 质量分数从大型和多样化的听测数据库中获得了更好的表现。特别是,提出的系统在使用带宽扩展技术编码音乐信号时,其质量预测性能与其他模型不同,经常失败。
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
results: 研究发现,使用LLM的 corrected sentences frequently resulted in higher Word Error Rates (WER),表明在语音应用中使用LLM的在场景学习仍然是一个挑战。Abstract
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
摘要
Translation in Simplified Chinese:这篇论文探讨了将大语言模型(LLM) integrating into自动语音识别(ASR)系统以提高译文准确性。随着 LLM 的不断提高,其在语言处理领域(NLP)中的应用吸引了广泛关注。我们的主要关注点是探讨 LLM 的上下文学习能力如何提高 ASR 系统的性能,现在面临环境噪音、发音方言和复杂语言上下文等挑战。我们使用 Aishell-1 和 LibriSpeech 数据集,并使用 ChatGPT 和 GPT-4 作为 LLM 的 referential。 unfortunately,我们的初始实验并没有取得希望的结果,这表明使用 LLM 的上下文学习来改进 ASR 应用程序的现在阶段还是一个复杂的任务。尽管我们继续使用不同的设置和模型进行探索,但 corrected 从 LLM 中的句子往往会导致高 Word Error Rate(WER),这表明 LLM 在语音应用程序中的局限性。这篇论文提供了详细的实验结果和意义,证明使用 LLM 的上下文学习能力来修正语音识别转录中的潜在错误仍然是一个挑战。
Feature Embeddings from Large-Scale Acoustic Bird Classifiers Enable Few-Shot Transfer Learning
paper_authors: Burooj Ghani, Tom Denton, Stefan Kahl, Holger Klinck
for: 这种研究的目的是为了帮助理解和保护海洋和陆地动物和其栖息地 across extensive spatiotemporal scales。
methods: 这种研究使用了深度学习模型来分类bioacoustic数据。
results: 研究发现,可以使用大规模的音频分类模型中的特征嵌入来分类不同的生物声音类型,包括鸟类、蝙蝠类、海洋哺乳类和两栖动物的声音。这些特征嵌入可以在不充足的训练数据情况下提供高质量的分类结果。Abstract
Automated bioacoustic analysis aids understanding and protection of both marine and terrestrial animals and their habitats across extensive spatiotemporal scales, and typically involves analyzing vast collections of acoustic data. With the advent of deep learning models, classification of important signals from these datasets has markedly improved. These models power critical data analyses for research and decision-making in biodiversity monitoring, animal behaviour studies, and natural resource management. However, deep learning models are often data-hungry and require a significant amount of labeled training data to perform well. While sufficient training data is available for certain taxonomic groups (e.g., common bird species), many classes (such as rare and endangered species, many non-bird taxa, and call-type), lack enough data to train a robust model from scratch. This study investigates the utility of feature embeddings extracted from large-scale audio classification models to identify bioacoustic classes other than the ones these models were originally trained on. We evaluate models on diverse datasets, including different bird calls and dialect types, bat calls, marine mammals calls, and amphibians calls. The embeddings extracted from the models trained on bird vocalization data consistently allowed higher quality classification than the embeddings trained on general audio datasets. The results of this study indicate that high-quality feature embeddings from large-scale acoustic bird classifiers can be harnessed for few-shot transfer learning, enabling the learning of new classes from a limited quantity of training data. Our findings reveal the potential for efficient analyses of novel bioacoustic tasks, even in scenarios where available training data is limited to a few samples.
摘要
自动化生物声学分析可以帮助我们更好地理解和保护海洋和陆地动物及其栖息地,并且可以在广泛的时空尺度上进行分析。通常情况下,这种分析需要分析大量的声学数据。随着深度学习模型的出现,对重要的声学信号进行分类的精度有了明显的提高。这些模型在生物多样性监测、动物行为研究和自然资源管理中都具有重要的应用价值。然而,深度学习模型通常需要大量的标注训练数据来达到良好的性能。而且,许多类别(如罕见和濒危物种、非鸟类和呼叫类)缺乏足够的训练数据来训练一个可靠的模型。本研究探讨了使用大规模声学分类模型提取的特征向量来分类其他类别。我们在不同的鸟叫和方言类型、蝙蝠叫、海洋哺乳类和两栖动物叫中进行了评估。结果表明,来自鸟叫数据集训练的特征向量能够提供更高质量的分类,比起来自通用音频数据集训练的特征向量。这些结果表明,可以通过将鸟叫分类模型的特征向量进行几步迁移学习,以学习新的类别,即使只有少量的训练数据available。我们的发现表明,可以通过高质量的特征向量和几步迁移学习,实现有效地分析新的生物声学任务,即使有限的训练数据available。
results: 在个性化 named entities 测试集上,每种方法都提高了 word error rate 超过 10%,相比神经网络重新评分基线。 natural language prompts 可以提高 word error rate 7% 无需训练,但有一定的总体泛化损失。 Gazetteers 最终得到了10%的word error rate 提高,同时在总测试集上也提高了1%的word error rate。Abstract
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.
摘要
recognition of personalized content remains a challenge in end-to-end speech recognition. we explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. we use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. on a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. we also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. overall, gazetteers were found to perform the best with a 10% improvement in word error rate (wer), while also improving wer on a general test set by 1%.
LACE: A light-weight, causal model for enhancing coded speech through adaptive convolutions
paper_authors: Jan Büthe, Jean-Marc Valin, Ahmed Mustafa
for: 提高编码后的语音质量
methods: 使用深度神经网络生成经典滤波器核心,并且实现了实时执行在桌面或移动设备CPU上
results: 能够实现有效的宽带编码,bitrates下降至6kb/sAbstract
Classical speech coding uses low-complexity postfilters with zero lookahead to enhance the quality of coded speech, but their effectiveness is limited by their simplicity. Deep Neural Networks (DNNs) can be much more effective, but require high complexity and model size, or added delay. We propose a DNN model that generates classical filter kernels on a per-frame basis with a model of just 300~K parameters and 100~MFLOPS complexity, which is a practical complexity for desktop or mobile device CPUs. The lack of added delay allows it to be integrated into the Opus codec, and we demonstrate that it enables effective wideband encoding for bitrates down to 6 kb/s.
摘要
Translated into Simplified Chinese:传统的语音编码使用低复杂度后 filters 来提高语音质量,但其效果受其简单性的限制。深度神经网络(DNNs)可以更加有效,但需要高度复杂度和模型大小,或者添加延迟。我们提议一个 DNN 模型,在每帧基础上生成经典滤波器kernels,模型只有 300~K 参数和 100~MFLOPS 复杂度,这是Desktop或移动设备 CPU 的实际复杂度。没有添加延迟,它可以与 Opus 编码器集成,我们示出它在 bitrate 为 6 kb/s 下实现了有效的宽带编码。
methods: 我们提出了一种新的Unsupervised Dense Few-shot Medical Image Segmentation Model Training Pipeline (DenseMP),它利用了无监督的dense预训练。DenseMP包括两个阶段:(1) segmentation-aware dense contrastive pre-training,和(2) few-shot-aware superpixel guided dense pre-training。
results: 我们的提议的管道可以显著提高PA-Net模型的性能,达到了Abd-CT和Abd-MRI数据集的国际最佳result。Abstract
Few-shot medical image semantic segmentation is of paramount importance in the domain of medical image analysis. However, existing methodologies grapple with the challenge of data scarcity during the training phase, leading to over-fitting. To mitigate this issue, we introduce a novel Unsupervised Dense Few-shot Medical Image Segmentation Model Training Pipeline (DenseMP) that capitalizes on unsupervised dense pre-training. DenseMP is composed of two distinct stages: (1) segmentation-aware dense contrastive pre-training, and (2) few-shot-aware superpixel guided dense pre-training. These stages collaboratively yield a pre-trained initial model specifically designed for few-shot medical image segmentation, which can subsequently be fine-tuned on the target dataset. Our proposed pipeline significantly enhances the performance of the widely recognized few-shot segmentation model, PA-Net, achieving state-of-the-art results on the Abd-CT and Abd-MRI datasets. Code will be released after acceptance.
摘要
“几步骤医疗影像semantic segmentation是医疗影像分析领域中的核心问题。然而,现有的方法ologies面临训练阶段中的数据缺乏问题,导致过滤。为解决这个问题,我们提出了一个新的Unsupervised Dense Few-shot Medical Image Segmentation Model Training Pipeline(DenseMP)。DenseMP包括两个不同的阶段:(1)segmentation-aware dense contrastive pre-training,和(2)few-shot-aware superpixel guided dense pre-training。这两个阶段合作培养了一个预训的初始模型,专门针对几步骤医疗影像分类。这个模型可以在target dataset上进一步精细调整。我们的提案的管道可以将PA-Net的性能提升到最新的州标,在 Abd-CT 和 Abd-MRI 数据集上取得了州际级的结果。代码将在接受后发布。”
Combining Vision and EMG-Based Hand Tracking for Extended Reality Musical Instruments
methods: combining vision-based hand tracking with surface electromyography(sEMG)数据 для手掌关节角度估算。
results: 比较baseline vision-based跟踪方法,我们的多模式跟踪系统在手势 gesture 覆盖的情况下提高了跟踪精度。这些结果表明,我们的系统可能提高XT体验,尤其是在自遮挡情况下。Abstract
Hand tracking is a critical component of natural user interactions in extended reality (XR) environments, including extended reality musical instruments (XRMIs). However, self-occlusion remains a significant challenge for vision-based hand tracking systems, leading to inaccurate results and degraded user experiences. In this paper, we propose a multimodal hand tracking system that combines vision-based hand tracking with surface electromyography (sEMG) data for finger joint angle estimation. We validate the effectiveness of our system through a series of hand pose tasks designed to cover a wide range of gestures, including those prone to self-occlusion. By comparing the performance of our multimodal system to a baseline vision-based tracking method, we demonstrate that our multimodal approach significantly improves tracking accuracy for several finger joints prone to self-occlusion. These findings suggest that our system has the potential to enhance XR experiences by providing more accurate and robust hand tracking, even in the presence of self-occlusion.
摘要
extension reality (XR) 环境中的自然用户交互中,手追踪是一个关键组件,包括扩展实现乐器 (XRMI)。然而,自我遮挡仍然是视觉基于手追踪系统的主要挑战,导致不准确的结果和下降的用户体验。在这篇论文中,我们提出一种多模态手追踪系统,将视觉基于手追踪与表面电 MYO 数据相结合,以估算手掌关节的弯曲。我们通过一系列的手姿任务,包括一些容易遮挡的手姿,来验证系统的效果。相比基eline vision-based追踪方法,我们的多模态方法显著提高了一些手掌关节的追踪准确性,特别是在自我遮挡的情况下。这些发现表明,我们的系统有potential enhance XR经验,提供更加准确和可靠的手追踪,即使在自我遮挡的情况下。
Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks
paper_authors: Denis Coquenet, Clément Rambour, Emanuele Dalsasso, Nicolas Thome
for: 这 paper 是为了提高 vision-language foundation models 在细腻特征检测和本地化任务上的表现而设计的。
methods: 这 paper 使用了一种多任务微调策略,通过正面/负面提示形式来进一步利用视力语言基础模型的能力。
results: 使用 CLIP 架构为基础,这 paper 在鸟类细腻特征检测和本地化任务上显示出了强劲的改进,同时也提高了 CUB200-2011 数据集的分类性能。Abstract
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.
摘要
视力语言基础模型如CLIP在许多任务和数据集上表现出色,尤其是通过自由文本输入。然而,它们在一些下游任务上表现不佳,如细化特征检测和本地化。在这篇论文中,我们提出一种多任务微调策略基于正面/负面提示表示法。使用CLIP архитектура为基础,我们在鸟类细化特征检测和本地化任务上显示出了强劲提高,同时也提高了CUB200-2011数据集的分类性能。我们提供了源代码,用于 reproduce purposes,可以在https://github.com/FactoDeepLearning/MultitaskVLFM上获取。
Robotic surface exploration with vision and tactile sensing for cracks detection and characterisation
results: 从实验结果来看,提出的算法能够成功地检测和定位裂隙,并且可以通过运动规划算法来提高裂隙的检测和分类效果,并且可以降低裂隙检测的成本。Abstract
This paper presents a novel algorithm for crack localisation and detection based on visual and tactile analysis via fibre-optics. A finger-shaped sensor based on fibre-optics is employed for the data acquisition to collect data for the analysis and the experiments. To detect the possible locations of cracks a camera is used to scan an environment while running an object detection algorithm. Once the crack is detected, a fully-connected graph is created from a skeletonised version of the crack. A minimum spanning tree is then employed for calculating the shortest path to explore the crack which is then used to develop the motion planner for the robotic manipulator. The motion planner divides the crack into multiple nodes which are then explored individually. Then, the manipulator starts the exploration and performs the tactile data classification to confirm if there is indeed a crack in that location or just a false positive from the vision algorithm. If a crack is detected, also the length, width, orientation and number of branches are calculated. This is repeated until all the nodes of the crack are explored. In order to validate the complete algorithm, various experiments are performed: comparison of exploration of cracks through full scan and motion planning algorithm, implementation of frequency-based features for crack classification and geometry analysis using a combination of vision and tactile data. From the results of the experiments, it is shown that the proposed algorithm is able to detect cracks and improve the results obtained from vision to correctly classify cracks and their geometry with minimal cost thanks to the motion planning algorithm.
摘要
For crack detection, a camera is used to scan the environment and an object detection algorithm is employed to identify possible crack locations. Once a crack is detected, a fully-connected graph is created from a skeletonized version of the crack, and a minimum spanning tree is used to calculate the shortest path to explore the crack.In the crack exploration stage, a robotic manipulator is used to explore the crack, and tactile data is collected to confirm the presence of a crack. The manipulator divides the crack into multiple nodes, which are explored individually. The algorithm then calculates the length, width, orientation, and number of branches of the crack.To validate the complete algorithm, various experiments are performed, including a comparison of exploration of cracks through full scan and motion planning algorithm, and the implementation of frequency-based features for crack classification and geometry analysis using a combination of vision and tactile data. The results show that the proposed algorithm is able to detect cracks and improve the results obtained from vision to correctly classify cracks and their geometry with minimal cost thanks to the motion planning algorithm.
Generalizing Supervised Deep Learning MRI Reconstruction to Multiple and Unseen Contrasts using Meta-Learning Hypernetworks
results: 实验结果显示,该模型可以在多重磁共振成像重建中表现出优于联合训练、其他元学习方法和特定的MRI重建方法。此外,模型还能够更好地适应不同的数据获取方式,提高了PSNR和SSIM指标。在高分辨率层中,元学习技术带来了80%的模式特定表征变化。Abstract
Meta-learning has recently been an emerging data-efficient learning technique for various medical imaging operations and has helped advance contemporary deep learning models. Furthermore, meta-learning enhances the knowledge generalization of the imaging tasks by learning both shared and discriminative weights for various configurations of imaging tasks. However, existing meta-learning models attempt to learn a single set of weight initializations of a neural network that might be restrictive for multimodal data. This work aims to develop a multimodal meta-learning model for image reconstruction, which augments meta-learning with evolutionary capabilities to encompass diverse acquisition settings of multimodal data. Our proposed model called KM-MAML (Kernel Modulation-based Multimodal Meta-Learning), has hypernetworks that evolve to generate mode-specific weights. These weights provide the mode-specific inductive bias for multiple modes by re-calibrating each kernel of the base network for image reconstruction via a low-rank kernel modulation operation. We incorporate gradient-based meta-learning (GBML) in the contextual space to update the weights of the hypernetworks for different modes. The hypernetworks and the reconstruction network in the GBML setting provide discriminative mode-specific features and low-level image features, respectively. Experiments on multi-contrast MRI reconstruction show that our model, (i) exhibits superior reconstruction performance over joint training, other meta-learning methods, and context-specific MRI reconstruction methods, and (ii) better adaptation capabilities with improvement margins of 0.5 dB in PSNR and 0.01 in SSIM. Besides, a representation analysis with U-Net shows that kernel modulation infuses 80% of mode-specific representation changes in the high-resolution layers. Our source code is available at https://github.com/sriprabhar/KM-MAML/.
摘要
meta-学习最近emerging为各种医疗成像任务提供了一种数据效果的学习技术,并帮助提高当今深度学习模型。此外,meta-学习可以提高成像任务的知识总结,通过学习共享和特异性权重来涵盖不同配置的成像任务。然而,现有的meta-学习模型尝试学习一个神经网络的初始化 веса,这可能是多模态数据的限制。本工作旨在开发一种多模态meta-学习模型,可以用于图像重建。我们提出的模型called KM-MAML(核心调整基于多模态meta-学习),具有卷积神经网络的演化神经网络,可以在不同的获取设置下进行图像重建。我们在上下文空间中使用梯度基本的meta-学习(GBML)来更新卷积神经网络的权重。卷积神经网络和重建网络在GBML设置中提供了特异性模式特征和低级图像特征。在多模式MRI重建任务上进行实验,我们发现:(i)我们的模型比 JOINT TRAINING、其他meta-学习方法和特定MRI重建方法更好,并且(ii)在不同的获取设置下,我们的模型可以更好地适应。此外,我们通过U-Net进行表示分析发现,核调整注入了80%的模式特征变化。我们的源代码可以在https://github.com/sriprabhar/KM-MAML/。
Watch Your Pose: Unsupervised Domain Adaption with Pose based Triplet Selection for Gait Recognition
paper_authors: Gavriel Habib, Noa Barzilay, Or Shimshi, Rami Ben-Ari, Nir Darshan for:* 这篇论文主要目的是提出一种基于不监督学习的领域适应方法,以提高人体识别器在不同情况下的扩展性。methods:* 该方法基于一种新的Triplet Selection算法,通过训练 embedding 空间,使模型减少姿势特征的偏袋性。results:* 对四个常用的步态数据集(CASIA-B、OU-MVLP、GREW、Gait3D)和三种底层模型(GaitSet、GaitPart、GaitGL)进行了广泛的实验,结果显示了该方法的突出性。Abstract
Gait Recognition is a computer vision task aiming to identify people by their walking patterns. Existing methods show impressive results on individual datasets but lack the ability to generalize to unseen scenarios. Unsupervised Domain Adaptation (UDA) tries to adapt a model, pre-trained in a supervised manner on a source domain, to an unlabelled target domain. UDA for Gait Recognition is still in its infancy and existing works proposed solutions to limited scenarios. In this paper, we reveal a fundamental phenomenon in adaptation of gait recognition models, in which the target domain is biased to pose-based features rather than identity features, causing a significant performance drop in the identification task. We suggest Gait Orientation-based method for Unsupervised Domain Adaptation (GOUDA) to reduce this bias. To this end, we present a novel Triplet Selection algorithm with a curriculum learning framework, aiming to adapt the embedding space by pushing away samples of similar poses and bringing closer samples of different poses. We provide extensive experiments on four widely-used gait datasets, CASIA-B, OU-MVLP, GREW, and Gait3D, and on three backbones, GaitSet, GaitPart, and GaitGL, showing the superiority of our proposed method over prior works.
摘要
《坐姿识别是计算机视觉任务之一,旨在通过人们的步态特征来识别人员。现有方法在特定数据集上显示出了很好的结果,但缺乏对未经见过的场景的泛化能力。无监督领域适应(UDA)试图将源领域上监督学习的模型适应到无标签目标领域。UDAS for 坐姿识别还处在幼年期,现有的作品只有解决有限的情况。在这篇论文中,我们发现了坐姿识别模型适应的基本现象,即目标领域偏好坐姿特征而不是人员特征,导致识别任务的性能下降。我们提议使用坐姿方向基于的无监督领域适应方法(GOUDA)来减少这种偏好。为此,我们提出了一种新的 triplet 选择算法和课程学习框架,目的是通过推迟相似坐姿的样本并吸引不同坐姿的样本来适应编码空间。我们对四个广泛使用的坐姿数据集(CASIA-B、OU-MVLP、GREW和Gait3D)和三个脊梁(GaitSet、GaitPart和GaitGL)进行了广泛的实验,并证明了我们提出的方法的优越性。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. If you prefer Traditional Chinese, I can provide that as well.
Improving 2D Human Pose Estimation across Unseen Camera Views with Synthetic Data
results: 实验结果表明,将RePoGen数据添加到COCO上 significantly improves top-view pose estimation和bottom-view dataset的性能。Abstract
Human Pose Estimation is a thoroughly researched problem; however, most datasets focus on the side and front-view scenarios. We address the limitation by proposing a novel approach that tackles the challenges posed by extreme viewpoints and poses. We introduce a new method for synthetic data generation - RePoGen, RarE POses GENerator - with comprehensive control over pose and view to augment the COCO dataset. Experiments on a new dataset of real images show that adding RePoGen data to the COCO surpasses previous attempts to top-view pose estimation and significantly improves performance on the bottom-view dataset. Through an extensive ablation study on both the top and bottom view data, we elucidate the contributions of methodological choices and demonstrate improved performance. The code and the datasets are available on the project website.
摘要
人体姿态估算是一个广泛研究的问题,但大多数数据集仅集中在副视和正面视角下进行研究。我们解决这些限制,提出一种新的方法,能够有效地处理极端视点和姿态的挑战。我们提出了一种新的人工数据生成方法——RePoGen,即罕见 POses GENerator,具有丰富的姿态和视角控制,以扩展 COCO 数据集。我们在一个新的真实图像集上进行了实验,并证明了在 COCO 上补充 RePoGen 数据可以超越之前的顶视姿态估算,并在底视数据集上显著提高表现。通过对两个视角数据集进行广泛的折衔分析,我们解释了方法的选择对性的贡献,并证明了提高表现。项目网站上提供了代码和数据集。
results: 本文对多模态数据进行了评估,发现多模态数据融合可以提高物体检测的准确率。Abstract
Object detection in remote sensing is a crucial computer vision task that has seen significant advancements with deep learning techniques. However, most existing works in this area focus on the use of generic object detection and do not leverage the potential of multimodal data fusion. In this paper, we present a comparison of methods for multimodal object detection in remote sensing, survey available multimodal datasets suitable for evaluation, and discuss future directions.
摘要
“远程感知中的对象检测是一项重要的计算机视觉任务,深度学习技术已经取得了 significative 的进步。然而,大多数现有的工作都是基于通用对象检测,没有利用多模态数据融合的潜力。本文提出了多模态对象检测在远程感知中的比较,浏览了适用于评估的多模态数据集,并讨论了未来的发展方向。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I'll be happy to provide it.
Weakly supervised marine animal detection from remote sensing images using vector-quantized variational autoencoder
results: 对于marine生物检测从空中图像数据进行评估,我们的方法与文献中的现有方法进行比较,实验结果表明我们的方法在现有研究中表现出优于性。我们的框架提供了更好的解释性和异常Localization,为监测marine生态系统和减少人类活动对marine生物的影响提供了有价值的信息。Abstract
This paper studies a reconstruction-based approach for weakly-supervised animal detection from aerial images in marine environments. Such an approach leverages an anomaly detection framework that computes metrics directly on the input space, enhancing interpretability and anomaly localization compared to feature embedding methods. Building upon the success of Vector-Quantized Variational Autoencoders in anomaly detection on computer vision datasets, we adapt them to the marine animal detection domain and address the challenge of handling noisy data. To evaluate our approach, we compare it with existing methods in the context of marine animal detection from aerial image data. Experiments conducted on two dedicated datasets demonstrate the superior performance of the proposed method over recent studies in the literature. Our framework offers improved interpretability and localization of anomalies, providing valuable insights for monitoring marine ecosystems and mitigating the impact of human activities on marine animals.
摘要
YOLIC: An Efficient Method for Object Localization and Classification on Edge Devices
paper_authors: Kai Su, Yoichi Tomioka, Qiangfu Zhao, Yong Liu
for: 本研究旨在提出一种高效的对象定位和分类方法,用于边缘设备上的图像处理。
methods: 本方法基于Cells of Interest的归一化,将Semantic Segmentation和Object Detection融合在一起,提高计算效率和准确率。
results: 对多个数据集进行了广泛的实验,并证明了YOLIC可以与当前顶尖YOLO算法的检测性能相当,同时在速度方面超过30帧/秒在Raspberry Pi 4B CPU上。Abstract
In the realm of Tiny AI, we introduce ``You Only Look at Interested Cells" (YOLIC), an efficient method for object localization and classification on edge devices. Through seamlessly blending the strengths of semantic segmentation and object detection, YOLIC offers superior computational efficiency and precision. By adopting Cells of Interest for classification instead of individual pixels, YOLIC encapsulates relevant information, reduces computational load, and enables rough object shape inference. Importantly, the need for bounding box regression is obviated, as YOLIC capitalizes on the predetermined cell configuration that provides information about potential object location, size, and shape. To tackle the issue of single-label classification limitations, a multi-label classification approach is applied to each cell for effectively recognizing overlapping or closely situated objects. This paper presents extensive experiments on multiple datasets to demonstrate that YOLIC achieves detection performance comparable to the state-of-the-art YOLO algorithms while surpassing in speed, exceeding 30fps on a Raspberry Pi 4B CPU. All resources related to this study, including datasets, cell designer, image annotation tool, and source code, have been made publicly available on our project website at https://kai3316.github.io/yolic.github.io
摘要
在天下的小 AI 世界中,我们介绍了“只看感兴趣细胞”(YOLIC),一种高效的对象定位和分类算法,适用于边缘设备。通过融合 semantic segmentation 和 object detection 的优点,YOLIC 可以提供更高的计算效率和准确率。通过采用Cells of Interest 进行分类而不是个体像素,YOLIC 可以减少计算负担,并允许粗略对象形态的推测。此外,YOLIC 不需要 bounding box regression,因为它利用预先确定的细胞配置来提供对象的可能位置、大小和形态信息。为了解决单个标签分类的局限性,YOLIC 采用多标签分类方法对每个细胞进行分类,以更好地识别重叠或相邻的对象。这篇论文通过多个数据集的实验表明,YOLIC 可以与现有的 YOLO 算法相比,在速度方面超越30帧/秒 на Raspberry Pi 4B CPU。所有相关的资源,包括数据集、细胞设计器、图像注释工具和源代码,都已经在我们项目网站上公开了,请参考 。
Body Fat Estimation from Surface Meshes using Graph Neural Networks
results: 研究结果表明,使用三角形体表面网络可以准确地预测腹部脂肪量和腹部脂肪分布,并且可以降低训练时间和需要的资源。此外,这种方法还可以应用于便宜和易 accessible 的医疗表面扫描数据,而不是昂贵的医疗图像。Abstract
Body fat volume and distribution can be a strong indication for a person's overall health and the risk for developing diseases like type 2 diabetes and cardiovascular diseases. Frequently used measures for fat estimation are the body mass index (BMI), waist circumference, or the waist-hip-ratio. However, those are rather imprecise measures that do not allow for a discrimination between different types of fat or between fat and muscle tissue. The estimation of visceral (VAT) and abdominal subcutaneous (ASAT) adipose tissue volume has shown to be a more accurate measure for named risk factors. In this work, we show that triangulated body surface meshes can be used to accurately predict VAT and ASAT volumes using graph neural networks. Our methods achieve high performance while reducing training time and required resources compared to state-of-the-art convolutional neural networks in this area. We furthermore envision this method to be applicable to cheaper and easily accessible medical surface scans instead of expensive medical images.
摘要
体重和脂肪分布可能是一个人健康状况的重要指标,以及类型2糖尿病和心血管疾病的风险的开发。通常用于脂肪估算的方法包括体重指数(BMI)、腰围或腰臀比。但这些方法并不准确,无法将不同类型的脂肪或肌肉组织区分开来。在本工作中,我们示出了使用三角形表面网格可以准确预测腹部内脂肪(VAT)和腹部外脂肪(ASAT)体积,使用图gram神经网络。我们的方法可以达到高性能,同时降低培训时间和需要的资源,比对今天的 convolutional neural networks 更高效。我们还想象这种方法可以应用于便宜和容易访问的医疗表面扫描,而不是昂贵的医疗图像。
DGCNet: An Efficient 3D-Densenet based on Dynamic Group Convolution for Hyperspectral Remote Sensing Image Classification
results: 研究结果表明,通过DGC模块,3D-Densenet模型可以更好地选择高 semantic feature的通道信息,并且快速部署在边缘设备上,实现了在IN、Pavia和KSC datasets上的出色表现,超越了主流染色体图像分类方法。Abstract
Deep neural networks face many problems in the field of hyperspectral image classification, lack of effective utilization of spatial spectral information, gradient disappearance and overfitting as the model depth increases. In order to accelerate the deployment of the model on edge devices with strict latency requirements and limited computing power, we introduce a lightweight model based on the improved 3D-Densenet model and designs DGCNet. It improves the disadvantage of group convolution. Referring to the idea of dynamic network, dynamic group convolution(DGC) is designed on 3d convolution kernel. DGC introduces small feature selectors for each grouping to dynamically decide which part of the input channel to connect based on the activations of all input channels. Multiple groups can capture different and complementary visual and semantic features of input images, allowing convolution neural network(CNN) to learn rich features. 3D convolution extracts high-dimensional and redundant hyperspectral data, and there is also a lot of redundant information between convolution kernels. DGC module allows 3D-Densenet to select channel information with richer semantic features and discard inactive regions. The 3D-CNN passing through the DGC module can be regarded as a pruned network. DGC not only allows 3D-CNN to complete sufficient feature extraction, but also takes into account the requirements of speed and calculation amount. The inference speed and accuracy have been improved, with outstanding performance on the IN, Pavia and KSC datasets, ahead of the mainstream hyperspectral image classification methods.
摘要
深度神经网络在频谱图像分类领域面临许多问题,包括不足的有效空间特征利用、梯度消失和过拟合,随着模型深度增加。为了加速模型在边缘设备上部署,提高响应速度和计算能力,我们提出了一种轻量级模型基于改进的3D-Densenet模型,并设计了DGC网络。DGC网络通过3D核心层中的动态组列conv(DGC)机制,在每个分组中动态选择输入通道的部分,以基于所有输入通道的活动来决定。这种设计可以让3D-Densenet模型更好地利用输入图像的视觉semantic特征,提高分类精度。3D核心层可以提取高维度和重复的频谱数据,同时也存在许多重复的信息 между核心层。DGC模块使得3D-Densenet可以选择更有 semantic意义的通道信息,并且忽略不活跃的区域。这种设计可以让3D-CNN模型完成充分的特征提取,同时也考虑计算量和速度的限制。在INF、Pavia和KSC数据集上,DGC网络的推理速度和准确率都有了显著提高,在频谱图像分类领域的主流方法之前。
Transformer-based end-to-end classification of variable-length volumetric data
results: compared to现有的video transformers,提出的方法在retinal OCT volume分类任务上得到了21.96%的平均提升率,并且发现在训练过程中随机调整输入volume的分辨率可以增强volume的表现。Abstract
The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Na\"ive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input volume-wise resolution(#slices) during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high-resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the volume-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume.
摘要
自动分类三维医疗数据是 память密集的。同时,样本间slice数量的变化是常见的。例如,减样可以解决这些问题,但可能会消除有关诊断信息。transformers已经在sequential数据分析中表现出色,但是对于长序列数据来说,计算和存储的需求很大。在这篇论文中,我们提出了一个端到端基于Transformer的框架,可以有效地分类变量长度的三维数据。具体来说,在训练时随机输入每层的分辨率(#slice),可以增强learnable的位置嵌入在每层中。因此,在测试时,附近层的嵌入信息可以被总结,即使是高分辨率的样本。这样,模型会更具robust性和可变长度,同时适应不同的计算预算。我们在Retinal OCT卷积分类任务上评估了我们的方法,比 estado-of-the-art video transformers提高了21.96%的平均 балансов准确率。我们的发现表明,在训练时随机输入每层的分辨率会生成更加有用的卷积表示,相比于固定每层 slice数量。
A Comprehensive Analysis of Blockchain Applications for Securing Computer Vision Systems
results: 本研究发现了BC和CV结合的可能性和挑战,以及可能的应用场景,如供应链管理、医疗、智能城市和国防等领域。同时,本研究也提出了未来研究的可能性和挑战。Abstract
Blockchain (BC) and Computer Vision (CV) are the two emerging fields with the potential to transform various sectors.The ability of BC can help in offering decentralized and secure data storage, while CV allows machines to learn and understand visual data. This integration of the two technologies holds massive promise for developing innovative applications that can provide solutions to the challenges in various sectors such as supply chain management, healthcare, smart cities, and defense. This review explores a comprehensive analysis of the integration of BC and CV by examining their combination and potential applications. It also provides a detailed analysis of the fundamental concepts of both technologies, highlighting their strengths and limitations. This paper also explores current research efforts that make use of the benefits offered by this combination. The effort includes how BC can be used as an added layer of security in CV systems and also ensure data integrity, enabling decentralized image and video analytics using BC. The challenges and open issues associated with this integration are also identified, and appropriate potential future directions are also proposed.
摘要
Blockchain (BC) 和计算机视觉 (CV) 是两个emerging技术,它们有潜力对各个领域进行变革。BC的能力可以帮助提供分布式和安全的数据存储,而CV则让机器学习和理解视觉数据。将这两种技术相结合,有巨大的推动力创造创新应用程序,以解决不同领域的挑战,如供应链管理、医疗、智能城市和国防等。本文进行了全面的BC和CV结合分析,包括其结合的可能性和应用场景。它还提供了BC和CV基本概念的详细分析,并 highlighted它们的优点和局限性。此外,文章还探讨了目前使用BC和CV的研究努力,包括如何使用BC为CV系统提供加强的安全性和数据完整性,以及如何实现分布式图像和视频分析。此外,文章还识别了BC和CV结合的挑战和未解决问题,并提出了相应的未来发展方向。
Automated Deception Detection from Videos: Using End-to-End Learning Based High-Level Features and Classification Approaches
results: 结果表明,面部表达比躺踪和视线方向更有助于检测诈计,并且将多modalities与特征选择结合可以进一步提高检测性能。Results also show that the performance of the approach is better than chance levels, even on low-stake datasets such as the Rolling-Dice Experiment.Abstract
Deception detection is an interdisciplinary field attracting researchers from psychology, criminology, computer science, and economics. We propose a multimodal approach combining deep learning and discriminative models for automated deception detection. Using video modalities, we employ convolutional end-to-end learning to analyze gaze, head pose, and facial expressions, achieving promising results compared to state-of-the-art methods. Due to limited training data, we also utilize discriminative models for deception detection. Although sequence-to-class approaches are explored, discriminative models outperform them due to data scarcity. Our approach is evaluated on five datasets, including a new Rolling-Dice Experiment motivated by economic factors. Results indicate that facial expressions outperform gaze and head pose, and combining modalities with feature selection enhances detection performance. Differences in expressed features across datasets emphasize the importance of scenario-specific training data and the influence of context on deceptive behavior. Cross-dataset experiments reinforce these findings. Despite the challenges posed by low-stake datasets, including the Rolling-Dice Experiment, deception detection performance exceeds chance levels. Our proposed multimodal approach and comprehensive evaluation shed light on the potential of automating deception detection from video modalities, opening avenues for future research.
摘要
伪装检测是一个跨学科的领域,吸引了心理学、刑事学、计算机科学和经济学等领域的研究者。我们提议一种多模态方法,结合深度学习和抽象模型,自动检测伪装。使用视频模式,我们使用扩展的端到端学习来分析视频中的眼动、头pose和 facial expression,达到了与当前方法相比的扎实的结果。由于训练数据有限,我们还利用抽象模型进行伪装检测。虽然序列到类方法被探索,但是抽象模型在数据缺乏情况下表现更好。我们的方法在五个数据集上进行评估,其中包括一个新的滚动骰子实验,这个实验是由经济因素所驱动。结果表明,facial expression 的表达性比 gaze 和 head pose 更高,并且将多Modalities 与特征选择结合可以提高检测性能。不同的数据集中表达的特征之间的差异强调了场景特定的训练数据和伪装行为的上下文的影响。 Cross-dataset эксперименты进一步证实这些发现。尽管面临低端数据集,包括滚动骰子实验,伪装检测性能仍然超过了机会性水平。我们提议的多模态方法和全面的评估,投光了自动化视频模式中的伪装检测的潜在可能性,开启了未来研究的新途径。
NLOS Dies Twice: Challenges and Solutions of V2X for Cooperative Perception
results: 通过设计一个新的模拟框架,对自动驾驶、感知融合和V2X通信进行综合评估,证明了我们的解决方案的效果。Abstract
Multi-agent multi-lidar sensor fusion between connected vehicles for cooperative perception has recently been recognized as the best technique for minimizing the blind zone of individual vehicular perception systems and further enhancing the overall safety of autonomous driving systems. This technique relies heavily on the reliability and availability of vehicle-to-everything (V2X) communication. In practical sensor fusion application scenarios, the non-line-of-sight (NLOS) issue causes blind zones for not only the perception system but also V2X direct communication. To counteract underlying communication issues, we introduce an abstract perception matrix matching method for quick sensor fusion matching procedures and mobility-height hybrid relay determination procedures, proactively improving the efficiency and performance of V2X communication to serve the upper layer application fusion requirements. To demonstrate the effectiveness of our solution, we design a new simulation framework to consider autonomous driving, sensor fusion and V2X communication in general, paving the way for end-to-end performance evaluation and further solution derivation.
摘要
Multi-agent multi-激光感知融合 между连接的自动驾驶车辆已经被认为是最佳技术来最小化个车感知系统的盲区和提高总体安全性。这种技术倚靠了交通 Everything 到 Everything (V2X) 通信的可靠性和可用性。在实际感知融合应用场景中,非线对线 (NLOS) 问题导致了感知系统和直接通信的盲区。为了解决这些下面通信问题,我们提出了一种抽象感知矩阵匹配方法和 mobilty-高 hybrid relay 确定方法,激进提高 V2X 通信的效率和性能,以满足上层应用融合要求。为证明我们的解决方案的有效性,我们设计了一个新的 simulate 框架,考虑自动驾驶、感知融合和 V2X 通信等方面,为末端性能评估和解决方案 derivation 提供了平台。
paper_authors: Alexander Ziller, Alp Güvenir, Ayhan Can Erdur, Tamara T. Mueller, Philip Müller, Friederike Jungmann, Johannes Brandt, Jan Peeken, Rickmer Braren, Daniel Rueckert, Georgios Kaissis
results: 我们在医疗分类标准集和一个真实的临床数据集上评估了我们的方法,并达到了与现有方法相当的结果。此外,通过在前进 passes中使用注意力汇集来获得每个层次的重要性值,我们显示了模型的预测基础的检查。Abstract
Training Artificial Intelligence (AI) models on three-dimensional image data presents unique challenges compared to the two-dimensional case: Firstly, the computational resources are significantly higher, and secondly, the availability of large pretraining datasets is often limited, impeding training success. In this study, we propose a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D volumes. Our method involves sequentially applying these networks to slices of a 3D volume from all orientations. Subsequently, a feature reduction module combines the extracted slice features into a single representation, which is then used for classification. We evaluate our approach on medical classification benchmarks and a real-world clinical dataset, demonstrating comparable results to existing methods. Furthermore, by employing attention pooling as a feature reduction module we obtain weighted importance values for each slice during the forward pass. We show that slices deemed important by our approach allow the inspection of the basis of a model's prediction.
摘要
training人工智能(AI)模型在三维图像数据上存在独特的挑战:首先,计算资源很大,其次,大规模预训练数据的可用性经常受限,这会阻碍训练的成功。在这种研究中,我们提出了一种简单的方法,即使2D网络进行修饰,以处理3DVolume。我们的方法是将这些网络应用于3DVolume的所有方向上的层次,然后使用特征减少模块将提取出的层次特征组合成一个单一表示,并使用这个表示进行分类。我们在医学分类标准 bencmark 和一个真实的临床数据集上评估了我们的方法,并达到了与现有方法相当的结果。此外,通过使用注意力池化作为特征减少模块,我们在前向传播中获得了每个层次的重要性值。我们发现,我们的方法中评估重要的层次可以为模型预测的基础提供可视化。
results: 比前方法更好地保持细节效果,避免泄漏和模糊的问题,高效、易于实现、无敏感参数。通过评估多种图像运算器,并进行量化和质量分析,说明了方法的优势。适用于交互式图像编辑和实时高分辨率视频处理。Abstract
Guided upsampling is an effective approach for accelerating high-resolution image processing. In this paper, we propose a simple yet effective guided upsampling method. Each pixel in the high-resolution image is represented as a linear interpolation of two low-resolution pixels, whose indices and weights are optimized to minimize the upsampling error. The downsampling can be jointly optimized in order to prevent missing small isolated regions. Our method can be derived from the color line model and local color transformations. Compared to previous methods, our method can better preserve detail effects while suppressing artifacts such as bleeding and blurring. It is efficient, easy to implement, and free of sensitive parameters. We evaluate the proposed method with a wide range of image operators, and show its advantages through quantitative and qualitative analysis. We demonstrate the advantages of our method for both interactive image editing and real-time high-resolution video processing. In particular, for interactive editing, the joint optimization can be precomputed, thus allowing for instant feedback without hardware acceleration.
摘要
高解像化是一种有效的方法,用于加速高分辨率图像处理。在这篇论文中,我们提出了一种简单又有效的导向upsampling方法。每个高分辨率图像的每个像素都被表示为两个低分辨率图像的线性插值,其中两个低分辨率图像的索引和权重都是通过最小化upsampling误差来优化。在下采样过程中,可以同时优化,以避免失去小 isolated regions。我们的方法可以从颜色线模型和本地颜色变换中 derivation。与前一种方法相比,我们的方法可以更好地保留效果,同时减少染色和模糊的artefacts。它是高效、易于实现,无敏感参数。我们通过一系列图像操作的评估,展示了我们的方法的优势。我们还证明了在交互图像编辑和实时高分辨率视频处理中,我们的方法具有优势。特别是在交互编辑中,下采样可以在预计算中进行,因此可以在硬件加速下实现即时反馈。
Image Denoising and the Generative Accumulation of Photons
methods: 本研究使用了一种基于探针网络的自我监督预测方法,其中网络通过预测下一个 фото感器到达的位置来解决 minimum mean square error (MMSE) 预测任务。
results: 研究表明,使用本方法可以在4种新的fluorescence microscopy数据集上实现高效的预测和降噪,并且与supervised、self-supervised和unsupervised基eline持平或者更高的性能。Abstract
We present a fresh perspective on shot noise corrupted images and noise removal. By viewing image formation as the sequential accumulation of photons on a detector grid, we show that a network trained to predict where the next photon could arrive is in fact solving the minimum mean square error (MMSE) denoising task. This new perspective allows us to make three contributions: We present a new strategy for self-supervised denoising, We present a new method for sampling from the posterior of possible solutions by iteratively sampling and adding small numbers of photons to the image. We derive a full generative model by starting this process from an empty canvas. We call this approach generative accumulation of photons (GAP). We evaluate our method quantitatively and qualitatively on 4 new fluorescence microscopy datasets, which will be made available to the community. We find that it outperforms supervised, self-supervised and unsupervised baselines or performs on-par.
摘要
我们提出了一种新的视角,即通过视图图像形成为摄像头网格上sequential储存粒子的过程,我们示出了一种由网络学习预测下一个粒子可能出现的位置,实际上是解决最小平均方差雷达(MMSE)减除难题。这种新的视角允许我们提出三个贡献:首先,我们提出了一种新的自我超vised减除策略;其次,我们提出了一种新的采样方法,通过 iteratively sampling和添加小量粒子到图像中来采样 posterior of possible solutions;最后,我们 Derive a full generative model by starting this process from an empty canvas。我们称这种方法为 Generative Accumulation of Photons(GAP)。我们对这种方法进行了量化和质量的评估,并在4个新的 fluorescence microscopy dataset上进行了评估。我们发现它可以与supervised、self-supervised和unsupervised baselines进行比较,或者和其他方法perform on-par。
A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023
results: 我们的实验结果表明,使用verb和名词之间的协occurrence关系可以提高模型的性能,并且使用GPT-3.5生成规则可以进一步提高模型的适应性。然而,使用GPT-3.5生成规则也会导致模型的性能下降一些。这些结果 shed light on the potential and challenges of incorporating differentiable logic and LLMs for knowledge extraction in unsupervised domain adaptation for action recognition.Abstract
In this technical report, we present our findings from a study conducted on the EPIC-KITCHENS-100 Unsupervised Domain Adaptation task for Action Recognition. Our research focuses on the innovative application of a differentiable logic loss in the training to leverage the co-occurrence relations between verb and noun, as well as the pre-trained Large Language Models (LLMs) to generate the logic rules for the adaptation to unseen action labels. Specifically, the model's predictions are treated as the truth assignment of a co-occurrence logic formula to compute the logic loss, which measures the consistency between the predictions and the logic constraints. By using the verb-noun co-occurrence matrix generated from the dataset, we observe a moderate improvement in model performance compared to our baseline framework. To further enhance the model's adaptability to novel action labels, we experiment with rules generated using GPT-3.5, which leads to a slight decrease in performance. These findings shed light on the potential and challenges of incorporating differentiable logic and LLMs for knowledge extraction in unsupervised domain adaptation for action recognition. Our final submission (entitled `NS-LLM') achieved the first place in terms of top-1 action recognition accuracy.
摘要
在这份技术报告中,我们发现了在EPIC-KITCHENS-100无监督领域适应任务中进行动作识别的研究结果。我们的研究集中在应用异 diferenciable逻辑损失在训练中,以利用 verb和名词之间的协同关系,以及预训练的大语言模型(LLM)来生成逻辑规则,以适应未见动作标签。特别是,模型的预测结果被视为一个co-occurrence逻辑公式的真值赋值,以计算逻辑损失,这个损失度量了预测和逻辑约束之间的一致性。通过使用dataset中生成的verb-名词相似矩阵,我们观察到了模型性能的moderate提高,相比于我们的基eline框架。为了进一步提高模型对新动作标签的适应性,我们对GPT-3.5生成的规则进行实验,这导致了一些性能下降。这些发现有助于我们更好地理解在无监督领域适应中包含异 diferenciable逻辑和LLM的潜在和挑战。最终我们的提交(命名为NS-LLM)在动作识别中 achieved the first place。
Multi-objective Evolutionary Search of Variable-length Composite Semantic Perturbations
results: 实验结果显示,相比于现有方法,MES-VCSP 可以获得更高的攻击成功率、更自然的攻击和更少的时间成本。Abstract
Deep neural networks have proven to be vulnerable to adversarial attacks in the form of adding specific perturbations on images to make wrong outputs. Designing stronger adversarial attack methods can help more reliably evaluate the robustness of DNN models. To release the harbor burden and improve the attack performance, auto machine learning (AutoML) has recently emerged as one successful technique to help automatically find the near-optimal adversarial attack strategy. However, existing works about AutoML for adversarial attacks only focus on $L_{\infty}$-norm-based perturbations. In fact, semantic perturbations attract increasing attention due to their naturalnesses and physical realizability. To bridge the gap between AutoML and semantic adversarial attacks, we propose a novel method called multi-objective evolutionary search of variable-length composite semantic perturbations (MES-VCSP). Specifically, we construct the mathematical model of variable-length composite semantic perturbations, which provides five gradient-based semantic attack methods. The same type of perturbation in an attack sequence is allowed to be performed multiple times. Besides, we introduce the multi-objective evolutionary search consisting of NSGA-II and neighborhood search to find near-optimal variable-length attack sequences. Experimental results on CIFAR10 and ImageNet datasets show that compared with existing methods, MES-VCSP can obtain adversarial examples with a higher attack success rate, more naturalness, and less time cost.
摘要
深度神经网络已经证明自己对 adversarial 攻击很易受伤。为了更可靠地评估 DNN 模型的可靠性,设计更强大的 adversarial 攻击方法可以帮助。自动机器学习(AutoML)在最近几年成为一种成功的技术,可以自动找到近似最优 adversarial 攻击策略。然而,现有的 AutoML 对 adversarial 攻击的研究仅关注 $L_{\infty}$-norm 基于的扰动。事实上,semantic 扰动在自然性和物理可行性方面受到越来越多的关注。为了将 AutoML 和 semantic 扰动 bridge 起来,我们提出了一种新的方法:多目标演化搜索变长复合Semantic扰动(MES-VCSP)。具体来说,我们构建了变长复合Semantic扰动的数学模型,该模型提供了五种基于梯度的Semantic 攻击方法。在攻击序列中,同类型的扰动可以重复多次。此外,我们引入了多目标演化搜索,包括 NSGA-II 和邻域搜索,以找到近似最优变长攻击序列。实验结果表明,MES-VCSP 在 CIFAR10 和 ImageNet 数据集上比现有方法更高的攻击成功率、更自然、更快速。
Full-resolution Lung Nodule Segmentation from Chest X-ray Images using Residual Encoder-Decoder Networks
results: 这个研究获得了85%的敏感度和8%的伪阳性率,比以前的方法更快且更高效。Abstract
Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies use down-sampled images and computationally expensive methods with unproven generalization. Instead, this study localizes lung nodules using efficient encoder-decoder neural networks that process full resolution images to avoid any signal loss resulting from down-sampling. Encoder-decoder networks are trained and tested using the JSRT lung nodule dataset. The networks are used to localize lung nodules from an independent external CXR dataset. Sensitivity and false positive rates are measured using an automated framework to eliminate any observer subjectivity. These experiments allow for the determination of the optimal network depth, image resolution and pre-processing pipeline for generalized lung nodule localization. We find that nodule localization is influenced by subtlety, with more subtle nodules being detected in earlier training epochs. Therefore, we propose a novel self-ensemble model from three consecutive epochs centered on the validation optimum. This ensemble achieved a sensitivity of 85% in 10-fold internal testing with false positives of 8 per image. A sensitivity of 81% is achieved at a false positive rate of 6 following morphological false positive reduction. This result is comparable to more computationally complex systems based on linear and spatial filtering, but with a sub-second inference time that is faster than other methods. The proposed algorithm achieved excellent generalization results against an external dataset with sensitivity of 77% at a false positive rate of 7.6.
摘要
肺癌是全球最主要的癌症死亡原因,早期诊断和治疗可以提高生存率。胸部X射线(CXR)是肺癌诊断的便宜成像方式。然而,使用CXR可以困难地识别患者的肺脏癌。计算机视觉已经在过去提出来以帮助人工诊断员,但是现有的研究大多使用下采样的图像和计算昂贵的方法,而且没有证明其普适性。本研究使用高效的编码器-解码器神经网络来地址这个问题。这些神经网络可以处理全分辨率图像,以避免任何信号损失。这些神经网络在JSRT肺脏癌数据集上进行训练和测试,并在一个独立的外部CXR数据集上进行应用。在自动化框架下,感知和假阳性率被测量,以消除任何观察者主观性。这些实验允许我们确定最佳的神经网络深度、图像分辨率和预处理管道。我们发现肺脏癌的定位受到了细微度的影响,悬液脏癌在训练初期的EP中更易于检测。因此,我们提出了一种新的自我ensemble模型,由三个连续的EP中心于验证优点。这个ensemble得到了10个内部测试中的感知率为85%,假阳性率为8。在减少形态假阳性后,这个结果得到了77%的感知率,假阳性率为7.6。这个结果与更计算复杂的方法相比,具有更快的净算时间,但是与其他方法相比,具有更好的普适性。
Quantum Image Denoising: A Framework via Boltzmann Machines, QUBO, and Quantum Annealing
results: 研究发现,使用这个方法可以将杂音图像转换为比原始杂音图像更接近于无杂音图像的图像,并且在某些情况下可以 strictly closer to the noise-free images than the noisy images are。Abstract
We investigate a framework for binary image denoising via restricted Boltzmann machines (RBMs) that introduces a denoising objective in quadratic unconstrained binary optimization (QUBO) form and is well-suited for quantum annealing. The denoising objective is attained by balancing the distribution learned by a trained RBM with a penalty term for derivations from the noisy image. We derive the statistically optimal choice of the penalty parameter assuming the target distribution has been well-approximated, and further suggest an empirically supported modification to make the method robust to that idealistic assumption. We also show under additional assumptions that the denoised images attained by our method are, in expectation, strictly closer to the noise-free images than the noisy images are. While we frame the model as an image denoising model, it can be applied to any binary data. As the QUBO formulation is well-suited for implementation on quantum annealers, we test the model on a D-Wave Advantage machine, and also test on data too large for current quantum annealers by approximating QUBO solutions through classical heuristics.
摘要
我们研究了一种基于restricted Boltzmann machines(RBM)的二进制图像去噪框架,通过在quadratic unconstrained binary optimization(QUBO)形式中引入去噪目标函数,可以利用量子热化进行优化。去噪目标函数通过将训练好的RBM学习的分布与噪声图像的差异 penalty term相匹配,实现去噪。我们 derive了在target distribution假设良好的情况下,最优的penalty parameter选择,并提出了一种实际上支持的修改,以使方法更加robust。此外,我们还证明,通过我们的方法得到的去噪图像,在期望下,比噪图像更近于噪声自身。虽然我们将模型定义为图像去噪模型,但它可以应用于任何二进制数据。由于QUBO表述适合用于量子热化器实现,我们在D-Wave Advantage机器上测试了模型,并对数据过大于当前量子热化器可以处理的情况下,使用类型approximation QUBO解来实现。
Domain-adaptive Person Re-identification without Cross-camera Paired Samples
results: + Proposed method demonstrates effectiveness through four experimental settings on three challenging datasets.Abstract
Existing person re-identification (re-ID) research mainly focuses on pedestrian identity matching across cameras in adjacent areas. However, in reality, it is inevitable to face the problem of pedestrian identity matching across long-distance scenes. The cross-camera pedestrian samples collected from long-distance scenes often have no positive samples. It is extremely challenging to use cross-camera negative samples to achieve cross-region pedestrian identity matching. Therefore, a novel domain-adaptive person re-ID method that focuses on cross-camera consistent discriminative feature learning under the supervision of unpaired samples is proposed. This method mainly includes category synergy co-promotion module (CSCM) and cross-camera consistent feature learning module (CCFLM). In CSCM, a task-specific feature recombination (FRT) mechanism is proposed. This mechanism first groups features according to their contributions to specific tasks. Then an interactive promotion learning (IPL) scheme between feature groups is developed and embedded in this mechanism to enhance feature discriminability. Since the control parameters of the specific task model are reduced after division by task, the generalization ability of the model is improved. In CCFLM, instance-level feature distribution alignment and cross-camera identity consistent learning methods are constructed. Therefore, the supervised model training is achieved under the style supervision of the target domain by exchanging styles between source-domain samples and target-domain samples, and the challenges caused by the lack of cross-camera paired samples are solved by utilizing cross-camera similar samples. In experiments, three challenging datasets are used as target domains, and the effectiveness of the proposed method is demonstrated through four experimental settings.
摘要
现有的人重标识(重标)研究主要关注于邻区相机之间的人标识匹配。然而,在实际情况下,面临跨相机人标识匹配的问题是不可避免的。跨相机的人样本收集于远程场景中,常常缺乏正例样本。使用跨相机的负例样本来实现跨区域人标识匹配是极其困难的。因此,一种新的领域适应人重标识方法被提出,该方法主要包括类别同步促进模块(CSCM)和跨相机一致特征学习模块(CCFLM)。在CSCM中,一种任务特定的特征重组(FRT)机制被提出。这种机制首先将特征分组成根据它们对特定任务的贡献。然后,在这种机制中发展了一种互动促进学习(IPL)方案,以增强特征可识别度。由于特定任务模型的控制参数减少了 после分配任务,因此提高了模型的泛化能力。在CCFLM中,实例级特征分布对齐和跨相机身份一致学习方法被构建。因此,通过在目标领域下进行支持学习,并利用跨相机相似样本来解决缺乏跨相机对应样本的问题,实现了无监督的模型训练。在实验中,三个挑战性的数据集被用作目标领域,并通过四种实验设置证明了方法的有效性。
Optimised Least Squares Approach for Accurate Rectangle Fitting
results: 对比文献中的方法,该方法在干净数据和噪点云数据上具有较高的精度和速度,并且可以在 fewer than 10 迭代中收敛。Abstract
This study introduces a novel and efficient least squares based method for rectangle fitting, using a continuous fitness function that approximates a unit square accurately. The proposed method is compared with the existing method in the literature using both simulated data and real data. The real data is derived from aerial photogrammetry point clouds of a rectangular building. The simulated tests show that the proposed method performs better than the reference method, reducing the root-mean-square error by about 93% and 14% for clean datasets and noisy point clouds, respectively. The proposed method also improves the fitting of the real dataset by about 81%, achieving centimetre level accuracy. Furthermore, the test results show that the proposed method converges in fewer than 10 iterations.
摘要
这项研究提出了一种新的、高效的最小二乘基于方法 для矩形适应,使用连续的适应函数准确地 aproximate 一个单元正方形。提出的方法与现有Literature中的方法进行了比较,使用 both simulated data和实际数据。实际数据来自于航空摄影点云数据。 simulated tests 表明,提出的方法与参照方法相比,可以 reduceroot-mean-square error 约93%和14% for clean datasets和噪点云数据, respectively。此外,提出的方法还改善了实际数据的适应精度,达到了厘米级别的精度。而且,测试结果还表明,提出的方法在 fewer than 10 iterations 内 converges。
Free-Form Composition Networks for Egocentric Action Recognition
for: 解决 egocentric action recognition 中数据不足问题,具体来说是处理 rare class 的动作视频数据。
methods: 提议一种 free-form composition network (FFCN),可以同时学习 verb, preposition, 和 noun 表示,并将它们用于在 feature space 中组合新的样本,以提高 rare class 动作识别性能。
results: 对 Something-Something V2, H2O, 和 EPIC-KITCHENS-100 等三个 популяр的 egocentric action recognition 数据集进行了评估,结果表明提议的方法可以有效地处理数据不足问题,包括长尾和少量 egocentric action recognition。Abstract
Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.
摘要
Egocentric动作识别在人类动作识别领域获得了重要的关注。在这篇论文中,我们解决了 egocentric动作识别数据稀缺问题从compositional总体泛化角度。为解决这个问题,我们提议了一种免费组合网络(FFCN),它可以同时学习独立的动词、前置词和名称表示,然后使用它们来组合新的样本在特征空间中,以便处理罕见的动作视频类型。首先,我们使用图来捕捉不同手/物品实例之间在每个动作视频中的空间-时间关系。我们因此将每个动作分解成一组verb和前置词的空间-时间表示,使用图的边特征来 Extract verb和前置词的表示。图中的时间分解从不同的视频帧中提取verb和前置词的表示,而空间分解适应地学习verb和前置词的表示从动作相关的实例中。通过这些空间-时间表示,我们可以在免费的方式下组合新的样本,不受固定的verb和名称的限制。我们的FFCN可以直接生成新的训练数据样本,以提高动作识别性能。我们在Something-Something V2、H2O和EPIC-KITCHENS-100三个流行的 egocentric动作识别数据集上进行了实验,并得到了处理数据稀缺问题的有效性,包括长尾和少量 egocentric动作识别。
AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion
results: 与前一代方法比较,我们的框架在所有指标上都有显著改进,并且可以交换 avatar 的衣物。Abstract
Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code will be available on Github.
摘要
大规模预训练视觉语言模型允许零shot文本生成3D人偶。之前的状态艺术方法使用CLIP监督神经implicit模型重建人体体块网格。然而,这种方法有两个限制。首先,缺乏专门的人偶模型可能导致人脸扭曲和人偶上的不实际服装。其次,CLIP只提供整体外观优化方向,导致效果较差。为解决这些限制,我们提议AvtarFusion框架,首个使用latent扩散模型提供像素级指导生成人性化人偶,同时将人偶体块分解成衣服和身体两个子模型。AvtarFusion包括首个分离衣服的神经implicit人偶模型,使用新的双体绘制策略在一个空间中绘制分离的皮肤和衣服子模型。我们还提出了一种新的优化方法,calledPixel-Semantics Difference-Sampling (PS-DS),它在生成人偶身体和衣服时进行semantic分离,并生成多种服装风格。此外,我们建立了首个零shot文本到人偶生成的 bencmark。我们的实验结果表明,我们的框架在所有指标上都有显著改进。此外,因为我们的模型是分离的,我们可以交换人偶的衣服。代码将在Github上公开。
WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmark for Autonomous Driving on Water Surfaces
paper_authors: Shanliang Yao, Runwei Guan, Zhaodong Wu, Yi Ni, Zile Huang, Zixian Zhang, Yong Yue, Weiping Ding, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, Yutao Yue
results: 实验结果表明,4D雷达-相机融合可以在水面上提高感知精度和可靠性,特别是在不利的照明和天气条件下。Abstract
Autonomous driving on water surfaces plays an essential role in executing hazardous and time-consuming missions, such as maritime surveillance, survivors rescue, environmental monitoring, hydrography mapping and waste cleaning. This work presents WaterScenes, the first multi-task 4D radar-camera fusion dataset for autonomous driving on water surfaces. Equipped with a 4D radar and a monocular camera, our Unmanned Surface Vehicle (USV) proffers all-weather solutions for discerning object-related information, including color, shape, texture, range, velocity, azimuth, and elevation. Focusing on typical static and dynamic objects on water surfaces, we label the camera images and radar point clouds at pixel-level and point-level, respectively. In addition to basic perception tasks, such as object detection, instance segmentation and semantic segmentation, we also provide annotations for free-space segmentation and waterline segmentation. Leveraging the multi-task and multi-modal data, we conduct benchmark experiments on the uni-modality of radar and camera, as well as the fused modalities. Experimental results demonstrate that 4D radar-camera fusion can considerably improve the accuracy and robustness of perception on water surfaces, especially in adverse lighting and weather conditions. WaterScenes dataset is public on https://waterscenes.github.io.
摘要
自主驱动在水面上扮演着重要的角色,执行危险和时间消耗的任务,如海上监测、生还者救援、环境监测、水文图像和垃圾清理。这项工作介绍了水Scene,第一个4D радиар-摄像头融合数据集,用于自主驱动水面上。我们的无人水面车(USV)配备了4D радиар和单目摄像头,能提供全天候解决方案,检测水面上的物体相关信息,包括颜色、形状、文化、距离、速度、方位和高度。我们将摄像头图像和 радиар点云标注为像素级和点级,分别针对水面上的典型静止和动态物体进行标注。除了基本感知任务,如物体检测、实例分割和 semantics分割,我们还提供了自由空间分 segmentation和水线分 segmentation的标注。通过多任务多模式的数据,我们进行了基准实验,证明4D радиар-摄像头融合可以在水面上提高感知精度和可靠性,特别在恶劣的照明和天气情况下。WaterScenes数据集公开在https://waterscenes.github.io。
On the ability of CNNs to extract color invariant intensity based features for image classification
paper_authors: Pradyumna Elavarthi, James Lee, Anca Ralescu
for: 这 paper investigate convolutional neural networks (CNNs) 的能力 adapt to different color distributions in an image while maintaining context and background.
methods: 这 paper 使用 modified MNIST 和 FashionMNIST 数据进行实验,并 explore various regularization techniques on generalization error across datasets. The paper also proposes a minor architectural modification utilizing the dropout regularization in a novel way to enhance model reliance on color-invariant intensity-based features for improved classification accuracy.
results: 实验结果表明, changes in color can substantially affect classification accuracy, and the proposed regularization technique can improve the model’s reliance on color-invariant intensity-based features for improved classification accuracy.Abstract
Convolutional neural networks (CNNs) have demonstrated remarkable success in vision-related tasks. However, their susceptibility to failing when inputs deviate from the training distribution is well-documented. Recent studies suggest that CNNs exhibit a bias toward texture instead of object shape in image classification tasks, and that background information may affect predictions. This paper investigates the ability of CNNs to adapt to different color distributions in an image while maintaining context and background. The results of our experiments on modified MNIST and FashionMNIST data demonstrate that changes in color can substantially affect classification accuracy. The paper explores the effects of various regularization techniques on generalization error across datasets and proposes a minor architectural modification utilizing the dropout regularization in a novel way that enhances model reliance on color-invariant intensity-based features for improved classification accuracy. Overall, this work contributes to ongoing efforts to understand the limitations and challenges of CNNs in image classification tasks and offers potential solutions to enhance their performance.
摘要
Here's the Simplified Chinese translation:卷积神经网络(CNN)在视觉相关任务中表现出色,但它们对输入数据集的偏差也很敏感。先前的研究表明,CNN在图像分类任务中倾向于Texture而不是物体形状,背景信息也会影响预测。这篇论文检查了CNN在图像中适应不同颜色分布的能力,同时保持图像背景信息。我们对 modify MNIST 和 FashionMNIST 数据进行了实验,发现,图像颜色的变化会对分类精度产生很大的影响。论文还检查了不同 regularization 技术对不同数据集的总体误差的影响,并提出了一种小型的建议,通过dropout regularization在一种新的方式中使用,以提高模型对颜色不变的INTENSITY基于特征的依赖性,从而提高分类精度。总之,这项工作对于ongoing efforts to understand CNNs在图像分类任务中的局限性和挑战,并提供了可能的解决方案,以提高其表现。
Single-Class Target-Specific Attack against Interpretable Deep Learning Systems
paper_authors: Eldor Abdukhamidov, Mohammed Abuhamad, George K. Thiruvathukal, Hyoungshick Kim, Tamer Abuhmed
for: The paper aims to develop a novel single-class target-specific adversarial attack called SingleADV, which can effectively deceive deep learning models and their associated interpreters.
methods: The SingleADV attack uses a stochastic and iterative optimization approach to generate a universal perturbation that confuses the target model into classifying a specific category of objects as the target category while maintaining high relevance and accuracy. The optimization process is guided by the first- and second-moment estimations and designed to minimize the adversarial loss that considers both classifier and interpreter costs.
results: The paper demonstrates the effectiveness of SingleADV through extensive empirical evaluation using four different model architectures and three interpretation models. The results show that SingleADV achieves an average fooling ratio of 0.74 and an adversarial confidence level of 0.78 in generating deceptive adversarial samples, outperforming existing attacks under various conditions and settings.Abstract
In this paper, we present a novel Single-class target-specific Adversarial attack called SingleADV. The goal of SingleADV is to generate a universal perturbation that deceives the target model into confusing a specific category of objects with a target category while ensuring highly relevant and accurate interpretations. The universal perturbation is stochastically and iteratively optimized by minimizing the adversarial loss that is designed to consider both the classifier and interpreter costs in targeted and non-targeted categories. In this optimization framework, ruled by the first- and second-moment estimations, the desired loss surface promotes high confidence and interpretation score of adversarial samples. By avoiding unintended misclassification of samples from other categories, SingleADV enables more effective targeted attacks on interpretable deep learning systems in both white-box and black-box scenarios. To evaluate the effectiveness of SingleADV, we conduct experiments using four different model architectures (ResNet-50, VGG-16, DenseNet-169, and Inception-V3) coupled with three interpretation models (CAM, Grad, and MASK). Through extensive empirical evaluation, we demonstrate that SingleADV effectively deceives the target deep learning models and their associated interpreters under various conditions and settings. Our experimental results show that the performance of SingleADV is effective, with an average fooling ratio of 0.74 and an adversarial confidence level of 0.78 in generating deceptive adversarial samples. Furthermore, we discuss several countermeasures against SingleADV, including a transfer-based learning approach and existing preprocessing defenses.
摘要
在这篇论文中,我们提出了一种新的单类目标特有的反对攻击方法,即SingleADV。SingleADV的目标是生成一个通用的杂化,使target模型将特定类别的物体与目标类别混淆,同时保证高度相关和准确的解释。这个通用的杂化通过Stochastic和Iterative的优化来透传播敌后攻击损失,该损失是根据first-和second-moment estimation设计的,以考虑target类和非target类的损失。在这个优化框架中,欲望的损失表面会提高对抗样本的信任度和解释得分。通过避免其他类别中的意外分类,SingleADV使得targeted攻击更加有效,并在白盒和黑盒场景下实现更好的效果。为评估SingleADV的效果,我们在四种不同的模型架构(ResNet-50、VGG-16、DenseNet-169和Inception-V3)上进行了三种解释模型(CAM、Grad和MASK)的实验。我们的实验结果表明,SingleADV能够成功地欺骗目标深度学习系统和其相关的解释器,并在不同的条件和设置下保持高效。我们的实验结果还表明,SingleADV的性能是有效的,其中平均欺骗率为0.74,对抗信任度为0.78。此外,我们还讨论了对SingleADV的一些防御措施,包括传输学习法和现有的预处理防御。
Early Autism Diagnosis based on Path Signature and Siamese Unsupervised Feature Compressor
paper_authors: Zhuowen Yin, Xinyao Ding, Xin Zhang, Zhengwang Wu, Li Wang, Gang Li
for: Early diagnosis of Autism Spectrum Disorder (ASD) in children younger than 2 years of age.
methods: A novel deep learning-based method that extracts key features from structural MR images to diagnose ASD. This method includes a Siamese verification framework, an unsupervised compressor, weight constraints, and Path Signature.
results: The proposed method performed well under practical scenarios, transcending existing machine learning methods.Abstract
Autism Spectrum Disorder (ASD) has been emerging as a growing public health threat. Early diagnosis of ASD is crucial for timely, effective intervention and treatment. However, conventional diagnosis methods based on communications and behavioral patterns are unreliable for children younger than 2 years of age. Given evidences of neurodevelopmental abnormalities in ASD infants, we resort to a novel deep learning-based method to extract key features from the inherently scarce, class-imbalanced, and heterogeneous structural MR images for early autism diagnosis. Specifically, we propose a Siamese verification framework to extend the scarce data, and an unsupervised compressor to alleviate data imbalance by extracting key features. We also proposed weight constraints to cope with sample heterogeneity by giving different samples different voting weights during validation, and we used Path Signature to unravel meaningful developmental features from the two-time point data longitudinally. Extensive experiments have shown that our method performed well under practical scenarios, transcending existing machine learning methods.
摘要
《自适应谱综合症(ASD)的早期诊断》是一个快速发展的公共卫生问题。早期诊断对于ASD的有效治疗和 intervención是关键。然而,基于交流和行为模式的传统诊断方法对于婴儿 younger than 2岁不可靠。由于ASD infant的神经发展异常,我们采用了一种新的深度学习基于方法来提取ASD早期诊断中的关键特征。特别是,我们提出了一种同构验证框架,以扩展缺乏数据,以及一种无监督压缩器,以平衡数据不均衡问题。此外,我们还提出了权重约束,以响应样本差异性,并使用Path Signature来解释出谱发展特征。我们的方法在实际应用中表现良好,超越了现有的机器学习方法。
Discovering Image Usage Online: A Case Study With “Flatten the Curve’’
results: 研究发现,反向图像搜索引擎可以提供当前图像 reuse 的视图,社交媒体可以评估图像的时间 Popularity,而网络档案可以为未来研究提供一个图像的popularity视图。Abstract
Understanding the spread of images across the web helps us understand the reuse of scientific visualizations and their relationship with the public. The "Flatten the Curve" graphic was heavily used during the COVID-19 pandemic to convey a complex concept in a simple form. It displays two curves comparing the impact on case loads for medical facilities if the populace either adopts or fails to adopt protective measures during a pandemic. We use five variants of the "Flatten the Curve" image as a case study for viewing the spread of an image online. To evaluate its spread, we leverage three information channels: reverse image search engines, social media, and web archives. Reverse image searches give us a current view into image reuse. Social media helps us understand a variant's popularity over time. Web archives help us see when it was preserved, highlighting a view of popularity for future researchers. Our case study leverages document URLs can be used as a proxy for images when studying the spread of images online.
摘要
Translated into Simplified Chinese:了解图像在网络上传播的方式可以帮助我们理解科学视觉的重用和与公众之间的关系。COVID-19大流行期间,“平滑峰值”图表得到了广泛使用,以简化复杂的概念。该图表显示了采取或不采取保护措施的影响于医疗机构的病例数。我们使用“平滑峰值”图表的五种变体作为在线图像的案例研究,以评估其传播的方式。我们利用反图搜索引擎、社交媒体和网络档案来评估图像的传播。反图搜索可以给我们提供当前图像 reuse 的视图。社交媒体可以帮助我们理解变体的时间变化。网络档案可以帮助我们看到它在什么时候被保存,描述未来研究人员可以查看的受欢迎程度。我们的案例研究利用文档URL可以用作图像的代理,当研究在线图像传播时。
SAM-Path: A Segment Anything Model for Semantic Segmentation in Digital Pathology
methods: 本研究使用了trainable class prompts和pathology encoder,将SAM进一步改进来进行 semantic segmentation。
results: 经过实验,这个方法在两个公共的pathology dataset上,BCSS和CRAG dataset上,与人工提示和后处理相比,提高了27.52%的Dice分数和71.63%的IOU。此外,这个方法还实现了5.07%至8.48%的相对改进在Dice分数和IOU上。Abstract
Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.
摘要
Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.Here's the translation in Traditional Chinese:Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.
Efficient Convolution and Transformer-Based Network for Video Frame Interpolation
results: 对于多种复杂的运动 benchmark 进行评估,该方法可以达到与现有 interpolate 网络相当的性能,而且具有更好的速度和内存利用率。Abstract
Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.
摘要
视频帧 interpolación是一项日益重要的研究任务,具有多个关键的工业应用于视频编码、广播和制作领域。最近,transformer被引入到该领域,导致性能得到了显著提升。然而,这来自于巨大增加的内存使用量、训练和推理时间成本。本文提出了一种新的方法,该方法结合了transformer编码器和 convolutional特征,以减少内存占用量,并在推理时间上提高速度,相比现有的transformer-based interpolación方法。我们提出了一种双Encoder架构,这种架构结合了卷积的力量在本地相关性模型与 transformer的力量在长距离相关性上。我们在不同的benchmark上进行了评估,并通过实验示出了提档的性能,与当前的 interpolación网络相比。
RaBiT: An Efficient Transformer using Bidirectional Feature Pyramid Network with Reverse Attention for Colon Polyp Segmentation
results: 实验结果显示,这篇论文的方法在多个 benchmark 数据集上都能够优于现有的方法,同时保持低的computational complexity。此外,这篇论文还展示了高度普遍化的能力,甚至在训练和测试数据集之间有不同特征时仍能保持高度准确。Abstract
Automatic and accurate segmentation of colon polyps is essential for early diagnosis of colorectal cancer. Advanced deep learning models have shown promising results in polyp segmentation. However, they still have limitations in representing multi-scale features and generalization capability. To address these issues, this paper introduces RaBiT, an encoder-decoder model that incorporates a lightweight Transformer-based architecture in the encoder to model multiple-level global semantic relationships. The decoder consists of several bidirectional feature pyramid layers with reverse attention modules to better fuse feature maps at various levels and incrementally refine polyp boundaries. We also propose ideas to lighten the reverse attention module and make it more suitable for multi-class segmentation. Extensive experiments on several benchmark datasets show that our method outperforms existing methods across all datasets while maintaining low computational complexity. Moreover, our method demonstrates high generalization capability in cross-dataset experiments, even when the training and test sets have different characteristics.
摘要
自动并准确地 segmentation of colon polyps 是检测肠癌的关键步骤。先进的深度学习模型已经在polyp segmentation中表现出了承诺。然而,它们仍然具有表征多级特征和泛化能力的局限性。为解决这些问题,这篇论文介绍了RaBiT,一种包含了轻量级Transformer-based架构的encoder,用于模型多级全 semantic关系。decoder包括多个 bidirectional feature pyramid layers with reverse attention modules,以更好地融合不同级别的特征图并逐步精细化肠Polyp边界。此外,我们还提出了一些缓解reverse attention module的方法,使其更适合多类分 segmentation。广泛的实验证明了我们的方法在各个数据集上都超过了现有方法,而且保持了低的计算复杂度。此外,我们的方法在不同数据集之间的交叉 Validation中也表现出了高的泛化能力。
Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization
for: 本研究targets Audio-Visual Event Localization (AVEL) task, which is to temporally localize and classify audio-visual events in a video.
methods: 我们使用了一种基本模型,首先将视频级别的事件标签更细地分解成每个帧的标签,然后使用这些标签进行重新训练。为了处理 synthetic video 的异常性,我们提出了一个辅助目标函数,使得基本模型的预测更加可靠。
results: 我们的三个阶段管道可以在无需更改模型结构的情况下,超越一些现有的 AVEL 方法,并在一个相关的弱监督任务上也提高了性能。Abstract
Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.
摘要
听视事件地理化(AVEL)是将视频中同时可见和听到的事件进行时间地理化和分类的任务。在这篇论文中,我们在弱监督 Setting 中解决了 AVEL,即只有视频水平的事件标签(其存在或不存在,但不是具体的时间标签)作为训练的超级vision。我们的想法是使用基本模型来在训练数据上估算标签,并在这些标签上重新训练模型。具体来说,我们在每个帧片的slice中(slice是一个特定的时间片段)确定了标签的子集,通过以下两种方法:(i) 将视频外部的帧替换为第二个视频中没有重叠的帧,并将这个合成视频feed into基本模型以提取slice中的标签。(ii) 使用基本模型来预测slice中的标签,并对预测结果进行修正,以确保更加可靠的地理化事件标签。为了处理我们合成的视频的异常性,我们提议在基本模型中添加一个辅助目标,以便更好地预测本地化事件标签。我们的三个阶段管道超越了一些现有的 AVEL 方法,无需变更模型结构,并在相关的弱监督任务上提高性能。
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
results: 对前方法进行了广泛的测试和评价,并证明了 T2I-CompBench 数据集和评价指标的有效性,以及 GORS 方法的提高效果Abstract
Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.
摘要
尽管现代文本到图像模型可以生成高质量图像,但现在的方法往往难以有效地将不同属性和关系的对象组合成一个复杂且准确的场景中。我们提出了T2I-CompBench,一个包含3个类别和6个子类别的全面的标准套件 для开放世界的文本到图像生成 benchmark,包括3类和6个子类的6,000个文本提示。我们还提出了专门为文本到图像生成作业设计的评估指标。我们提出了一种新的方法,即通过奖励驱动样本选择(GORS)来提高预训练文本到图像模型的文本到图像生成能力。我们进行了广泛的实验和评估,以 benchmark先前的方法在T2I-CompBench上,并验证我们所提出的评估指标和GORS方法的有效性。相关项目页面可以在https://karine-h.github.io/T2I-CompBench/上找到。
Neural Free-Viewpoint Relighting for Glossy Indirect Illumination
results: 这个方法可以实现高效的实时渲染,包括视亮反射和照明漫反射等光学运输效应,并且可以在不同的视角和照明情况下进行灵活的调整。Abstract
Precomputed Radiance Transfer (PRT) remains an attractive solution for real-time rendering of complex light transport effects such as glossy global illumination. After precomputation, we can relight the scene with new environment maps while changing viewpoint in real-time. However, practical PRT methods are usually limited to low-frequency spherical harmonic lighting. All-frequency techniques using wavelets are promising but have so far had little practical impact. The curse of dimensionality and much higher data requirements have typically limited them to relighting with fixed view or only direct lighting with triple product integrals. In this paper, we demonstrate a hybrid neural-wavelet PRT solution to high-frequency indirect illumination, including glossy reflection, for relighting with changing view. Specifically, we seek to represent the light transport function in the Haar wavelet basis. For global illumination, we learn the wavelet transport using a small multi-layer perceptron (MLP) applied to a feature field as a function of spatial location and wavelet index, with reflected direction and material parameters being other MLP inputs. We optimize/learn the feature field (compactly represented by a tensor decomposition) and MLP parameters from multiple images of the scene under different lighting and viewing conditions. We demonstrate real-time (512 x 512 at 24 FPS, 800 x 600 at 13 FPS) precomputed rendering of challenging scenes involving view-dependent reflections and even caustics.
摘要
彩色照明处理(PRT)仍然是实时渲染复杂光学传输效果的吸引力,如光泽全球照明。在预计算后,我们可以在实时变换环境图像时重新照亮场景。然而,现实中的PRT方法通常只能处理低频圆柱幂照明。使用波峰的全波幂技术有潜力,但是在实践中受到维度困难和数据需求的限制,通常只能在不变视点或直接照明下使用。在这篇论文中,我们提出了一种混合神经波峰PRT解决方案,用于高频间接照明,包括光泽反射。我们尝试将照Transport函数表示为哈尔wavelet基。为全球照明,我们通过一个小型多层感知器(MLP)来学习波峰传输,其中折射方向和材料参数被作为其他MLP输入。我们从不同光照和观察角度下的多个图像中学习/优化特征场(通过矩阵分解compactly表示)和MLP参数。我们示出了在不同的观察角度和光照条件下实时(512 x 512,24 FPS)和低速(800 x 600,13 FPS)预计算渲染复杂场景,包括视角依赖的反射和灵敏束。
Deep Learning of Crystalline Defects from TEM images: A Solution for the Problem of “Never Enough Training Data”
results: 测试结果表明,使用本研究提出的方法可以在实际图像上 дости得高质量的结果,并且可以更好地普适性和可靠性。Abstract
Crystalline defects, such as line-like dislocations, play an important role for the performance and reliability of many metallic devices. Their interaction and evolution still poses a multitude of open questions to materials science and materials physics. In-situ TEM experiments can provide important insights into how dislocations behave and move. During such experiments, the dislocation microstructure is captured in form of videos. The analysis of individual video frames can provide useful insights but is limited by the capabilities of automated identification, digitization, and quantitative extraction of the dislocations as curved objects. The vast amount of data also makes manual annotation very time consuming, thereby limiting the use of Deep Learning-based, automated image analysis and segmentation of the dislocation microstructure. In this work, a parametric model for generating synthetic training data for segmentation of dislocations is developed. Even though domain scientists might dismiss synthetic training images sometimes as too artificial, our findings show that they can result in superior performance, particularly regarding the generalizing of the Deep Learning models with respect to different microstructures and imaging conditions. Additionally, we propose an enhanced deep learning method optimized for segmenting overlapping or intersecting dislocation lines. Upon testing this framework on four distinct real datasets, we find that our synthetic training data are able to yield high-quality results also on real images-even more so if fine-tune on a few real images was done.
摘要
cristallina defects, such as line-like dislocations, play an important role in the performance and reliability of many metallic devices. Their interaction and evolution still poses a multitude of open questions to materials science and materials physics. In-situ TEM experiments can provide important insights into how dislocations behave and move. During such experiments, the dislocation microstructure is captured in the form of videos. The analysis of individual video frames can provide useful insights, but is limited by the capabilities of automated identification, digitization, and quantitative extraction of the dislocations as curved objects. The vast amount of data also makes manual annotation very time-consuming, thereby limiting the use of Deep Learning-based, automated image analysis and segmentation of the dislocation microstructure. In this work, a parametric model for generating synthetic training data for segmentation of dislocations is developed. Although domain scientists may dismiss synthetic training images sometimes as too artificial, our findings show that they can result in superior performance, particularly regarding the generalizing of the Deep Learning models with respect to different microstructures and imaging conditions. Additionally, we propose an enhanced deep learning method optimized for segmenting overlapping or intersecting dislocation lines. Upon testing this framework on four distinct real datasets, we find that our synthetic training data are able to yield high-quality results also on real images - even more so if fine-tuning on a few real images was done.
Data Augmentation in Training CNNs: Injecting Noise to Images
results: 研究结果表明,不同噪音模型的影响程度不同,且一些噪音模型在某些级别下可以提高图像分类器的性能。此外,研究人员还提出了一些新的噪音扩展策略和建议,可以帮助更好地理解图像分类器的学习过程。Abstract
Noise injection is a fundamental tool for data augmentation, and yet there is no widely accepted procedure to incorporate it with learning frameworks. This study analyzes the effects of adding or applying different noise models of varying magnitudes to Convolutional Neural Network (CNN) architectures. Noise models that are distributed with different density functions are given common magnitude levels via Structural Similarity (SSIM) metric in order to create an appropriate ground for comparison. The basic results are conforming with the most of the common notions in machine learning, and also introduce some novel heuristics and recommendations on noise injection. The new approaches will provide better understanding on optimal learning procedures for image classification.
摘要
<> translate "Noise injection is a fundamental tool for data augmentation, and yet there is no widely accepted procedure to incorporate it with learning frameworks. This study analyzes the effects of adding or applying different noise models of varying magnitudes to Convolutional Neural Network (CNN) architectures. Noise models that are distributed with different density functions are given common magnitude levels via Structural Similarity (SSIM) metric in order to create an appropriate ground for comparison. The basic results are conforming with the most of the common notions in machine learning, and also introduce some novel heuristics and recommendations on noise injection. The new approaches will provide better understanding on optimal learning procedures for image classification." into Simplified Chinese.翻译:“噪声注入是数据增强工具的基本工具,但是没有一个广泛accepted的程序来与学习框架结合。这个研究分析了不同噪声模型的不同强度在Convolutional Neural Network (CNN) 架构中的效果。噪声模型按照不同的分布函数分布在不同的强度水平上,通过Structural Similarity (SSIM) 度量来创建一个相对适用的比较基础。研究结果基本上与机器学习中的大多数通俗观点一致,并提出了一些新的启示和建议,包括噪声注入的优化方法。这些新方法将为图像分类学习提供更好的理解。
Correlation-Aware Mutual Learning for Semi-supervised Medical Image Segmentation
results: 我们的方法在 Atrial Segmentation Challenge 数据集上进行实验,结果显示我们的方法能够超过现有的方法,这显示了我们的框架在医疗影像分类任务中的效iveness。Abstract
Semi-supervised learning has become increasingly popular in medical image segmentation due to its ability to leverage large amounts of unlabeled data to extract additional information. However, most existing semi-supervised segmentation methods only focus on extracting information from unlabeled data, disregarding the potential of labeled data to further improve the performance of the model. In this paper, we propose a novel Correlation Aware Mutual Learning (CAML) framework that leverages labeled data to guide the extraction of information from unlabeled data. Our approach is based on a mutual learning strategy that incorporates two modules: the Cross-sample Mutual Attention Module (CMA) and the Omni-Correlation Consistency Module (OCC). The CMA module establishes dense cross-sample correlations among a group of samples, enabling the transfer of label prior knowledge to unlabeled data. The OCC module constructs omni-correlations between the unlabeled and labeled datasets and regularizes dual models by constraining the omni-correlation matrix of each sub-model to be consistent. Experiments on the Atrial Segmentation Challenge dataset demonstrate that our proposed approach outperforms state-of-the-art methods, highlighting the effectiveness of our framework in medical image segmentation tasks. The codes, pre-trained weights, and data are publicly available.
摘要
semi-supervised learning 在医疗图像分割中日益受欢迎,因为它可以利用大量的无标注数据来提取更多的信息。然而,大多数现有的半指导学习分割方法仅关注于从无标注数据中提取信息,忽视了可以通过标注数据来进一步改进模型的性能。在这篇论文中,我们提出了一种新的相关意识共享(CAML)框架,该框架利用标注数据来导引无标注数据中的信息提取。我们的方法基于一种互相学习策略,该策略包括两个模块:交叉样本互相注意模块(CMA)和全面相关一致模块(OCC)。CMA模块在一组样本中建立紧密的交叉样本相关,以传递标注知识到无标注数据。OCC模块在无标注和标注数据之间建立全面相关,并将每个子模型的全面相关矩阵规范化。实验结果表明,我们提出的方法在 Atrial Segmentation Challenge 数据集上超越了当前最佳方法,这有力地证明了我们的框架在医疗图像分割任务中的有效性。代码、预训练 веса和数据都公开可用。
Facial Reenactment Through a Personalized Generator
paper_authors: Ariel Elazary, Yotam Nitzan, Daniel Cohen-Or for:facial reenactmentmethods:personalized generator, latent optimizationresults:state-of-the-art performance, semantic latent space editing and stylizingAbstract
In recent years, the role of image generative models in facial reenactment has been steadily increasing. Such models are usually subject-agnostic and trained on domain-wide datasets. The appearance of the reenacted individual is learned from a single image, and hence, the entire breadth of the individual's appearance is not entirely captured, leading these methods to resort to unfaithful hallucination. Thanks to recent advancements, it is now possible to train a personalized generative model tailored specifically to a given individual. In this paper, we propose a novel method for facial reenactment using a personalized generator. We train the generator using frames from a short, yet varied, self-scan video captured using a simple commodity camera. Images synthesized by the personalized generator are guaranteed to preserve identity. The premise of our work is that the task of reenactment is thus reduced to accurately mimicking head poses and expressions. To this end, we locate the desired frames in the latent space of the personalized generator using carefully designed latent optimization. Through extensive evaluation, we demonstrate state-of-the-art performance for facial reenactment. Furthermore, we show that since our reenactment takes place in a semantic latent space, it can be semantically edited and stylized in post-processing.
摘要
Recently, the role of image generative models in facial reenactment has been increasing steadily. These models are usually subject-agnostic and trained on domain-wide datasets. The appearance of the reenacted individual is learned from a single image, leading to unfaithful hallucination. Thanks to recent advancements, it is now possible to train a personalized generative model tailored specifically to a given individual. In this paper, we propose a novel method for facial reenactment using a personalized generator. We train the generator using frames from a short, yet varied, self-scan video captured using a simple commodity camera. Images synthesized by the personalized generator are guaranteed to preserve identity. Our premise is that the task of reenactment is thus reduced to accurately mimicking head poses and expressions. To this end, we locate the desired frames in the latent space of the personalized generator using carefully designed latent optimization. Through extensive evaluation, we demonstrate state-of-the-art performance for facial reenactment. Furthermore, we show that since our reenactment takes place in a semantic latent space, it can be semantically edited and stylized in post-processing.
Improved Real-time Image Smoothing with Weak Structures Preserved and High-contrast Details Removed
results: 相比现有方法,我们的方法可以更高效地提高图像平滑的质量和效率,同时能够保留弱结构和去除高对比细节。实验结果表明,我们的方法在效率和质量两个方面具有优势。Abstract
Image smoothing is by reducing pixel-wise gradients to smooth out details. As existing methods always rely on gradients to determine smoothing manners, it is difficult to distinguish structures and details to handle distinctively due to the overlapped ranges of gradients for structures and details. Thus, it is still challenging to achieve high-quality results, especially on preserving weak structures and removing high-contrast details. In this paper, we address this challenge by improving the real-time optimization-based method via iterative least squares (called ILS). We observe that 1) ILS uses gradients as the independent variable in its penalty function for determining smoothing manners, and 2) the framework of ILS can still work for image smoothing when we use some values instead of gradients in the penalty function. Thus, corresponding to the properties of pixels on structures or not, we compute some values to use in the penalty function to determine smoothing manners, and so we can handle structures and details distinctively, no matter whether their gradients are high or low. As a result, we can conveniently remove high-contrast details while preserving weak structures. Moreover, such values can be adjusted to accelerate optimization computation, so that we can use fewer iterations than the original ILS method for efficiency. This also reduces the changes onto structures to help structure preservation. Experimental results show our advantages over existing methods on efficiency and quality.
摘要
图像平滑是通过减少像素精度的梯度来缓和细节。现有方法总是基于梯度来确定平滑方式,因此很难通过梯度的重叠范围来 отличи出结构和细节来处理。因此,在保持高质量结果的同时,特别是保留弱结构和去除高对比细节是还有挑战。在这篇论文中,我们解决这个挑战,通过改进实时优化基于方法(ILS)。我们发现,1)ILS使用梯度作约束函数中独立变量,2)ILS框架可以在图像平滑时仍然工作。因此,根据像素的属性,我们计算一些值来使用在约束函数中来确定平滑方式,从而可以区分结构和细节,无论梯度高或低。这样,我们可以方便地去除高对比细节,同时保留弱结构。此外,这些值可以调整以加速优化计算,从而使用 fewer iterations than the original ILS 方法,以提高效率。这也减少了结构的变化,以便结构保持。实验结果表明,我们在效率和质量两个方面具有优势。
results: 该算法可以生成具有正确的距离和焦点提示的投影图,从而提高了观看体验的真实感。Abstract
The Visual Turing Test is the ultimate goal to evaluate the realism of holographic displays. Previous studies have focused on addressing challenges such as limited \'etendue and image quality over a large focal volume, but they have not investigated the effect of pupil sampling on the viewing experience in full 3D holograms. In this work, we tackle this problem with a novel hologram generation algorithm motivated by matching the projection operators of incoherent Light Field and coherent Wigner Function light transport. To this end, we supervise hologram computation using synthesized photographs, which are rendered on-the-fly using Light Field refocusing from stochastically sampled pupil states during optimization. The proposed method produces holograms with correct parallax and focus cues, which are important for passing the Visual Turing Test. We validate that our approach compares favorably to state-of-the-art CGH algorithms that use Light Field and Focal Stack supervision. Our experiments demonstrate that our algorithm significantly improves the realism of the viewing experience for a variety of different pupil states.
摘要
“幻象图灵测试”是评估投影式幻象显示器的最终目标。先前的研究主要关注了有限的“光学范围”和图像质量在大 focal volume 中的问题,但它们没有检查投影 pupil 采样对全息三维幻象的视觉体验产生的影响。在这种工作中,我们使用一种基于干扰光场和可见wigerner函数光传输的新型幻象生成算法来解决这个问题。为此,我们在优化过程中使用生成的照片进行监督,这些照片在运行时使用Light Field重新фокусировщи的方法来生成。我们的方法可以生成具有正确的距离和焦点提示的幻象,这些提示对于通过“幻象图灵测试”是非常重要的。我们的实验表明,我们的方法与当前的CGH算法相比,可以在不同的 pupil 状态下提供更加真实的视觉体验。
Exposing the Fake: Effective Diffusion-Generated Images Detection
results: 我们的evaluation表明,SeDID 在应用于 diffusion 模型时表现出优于现有方法。因此,本研究的成果对于 отличиinar diffusion model-生成图像做出了重要贡献,为人工智能安全领域做出了重要的一步。Abstract
Image synthesis has seen significant advancements with the advent of diffusion-based generative models like Denoising Diffusion Probabilistic Models (DDPM) and text-to-image diffusion models. Despite their efficacy, there is a dearth of research dedicated to detecting diffusion-generated images, which could pose potential security and privacy risks. This paper addresses this gap by proposing a novel detection method called Stepwise Error for Diffusion-generated Image Detection (SeDID). Comprising statistical-based $\text{SeDID}_{\text{Stat}$ and neural network-based $\text{SeDID}_{\text{NNs}$, SeDID exploits the unique attributes of diffusion models, namely deterministic reverse and deterministic denoising computation errors. Our evaluations demonstrate SeDID's superior performance over existing methods when applied to diffusion models. Thus, our work makes a pivotal contribution to distinguishing diffusion model-generated images, marking a significant step in the domain of artificial intelligence security.
摘要
Image合成技术已经受到扩散基本生成模型的欢迎,如降噪扩散概率模型(DDPM)和文本到图像扩散模型。然而,有关检测扩散生成图像的研究相对落后,这可能会对安全和隐私带来潜在的风险。这篇论文弥补了这一空白,提出了一种新的检测方法,即逐步错误 для扩散生成图像检测(SeDID)。SeDID包括统计基于的 $\text{SeDID}_{\text{Stat}$ 和神经网络基于的 $\text{SeDID}_{\text{NNs}$,它利用扩散模型的特有特征,即幂等逆计算和幂等净化计算的错误。我们的评估结果显示,SeDID在应用于扩散模型时表现出色,与现有方法相比有着显著的优势。因此,我们的工作对于 отлиishes diffusion模型生成图像,是人工智能安全领域的一个重要贡献。
The Whole Pathological Slide Classification via Weakly Supervised Learning
results: 我们在Camelyon16乳腺癌数据集和TCGA-NSCLC肺癌数据集上进行了广泛的实验,结果表明,我们提出的框架可以更有效地处理有关肿瘤检测和肿瘤分类的任务,比如现有的医学图像分类方法。Abstract
Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slides. To address this limitation, we introduced two pathological priors: nuclear heterogeneity of diseased cells and spatial correlation of pathological tiles. Leveraging the former, we proposed a data augmentation method that utilizes stain separation during extractor training via a contrastive learning strategy to obtain instance-level representations. We then described the spatial relationships between the tiles using an adjacency matrix. By integrating these two views, we designed a multi-instance framework for analyzing H\&E-stained tissue images based on pathological inductive bias, encompassing feature extraction, filtering, and aggregation. Extensive experiments on the Camelyon16 breast dataset and TCGA-NSCLC Lung dataset demonstrate that our proposed framework can effectively handle tasks related to cancer detection and differentiation of subtypes, outperforming state-of-the-art medical image classification methods based on MIL. The code will be released later.
摘要
Translated into Simplified Chinese:由于其高效使用注释和处理 gigapixel 大小的图像,多例学习(MIL)在数字patology诊断中表现出了很大的承诺。然而,现有方法通常会强调不同结构的高级聚合器,经常忽略 H&E 病理 slice 的内在特征。为了解决这一限制,我们引入了两种病理假设:病理细胞中的核型多样性和病理块之间的空间相关性。我们利用前者,提出了一种数据增强方法,通过对抽象器训练中的着色剂分离来获取实例级别的表示。然后,我们使用一个邻接矩阵来描述病理块之间的空间关系。通过将这两种视图相互 интегра,我们设计了一个基于病理逻辑的多例框架,包括特征提取、筛选和聚合。广泛的实验表明,我们提出的方法可以有效地处理 relate to cancer detection and differentiation of subtypes的任务,在基于 MIL 的医学图像分类方法中表现出优于状态之前。代码将在后续发布。
UGCANet: A Unified Global Context-Aware Transformer-based Network with Feature Alignment for Endoscopic Image Analysis
results: 提高肠道疾病诊断率,减少误诊率,提高患者生存率和健康结果Abstract
Gastrointestinal endoscopy is a medical procedure that utilizes a flexible tube equipped with a camera and other instruments to examine the digestive tract. This minimally invasive technique allows for diagnosing and managing various gastrointestinal conditions, including inflammatory bowel disease, gastrointestinal bleeding, and colon cancer. The early detection and identification of lesions in the upper gastrointestinal tract and the identification of malignant polyps that may pose a risk of cancer development are critical components of gastrointestinal endoscopy's diagnostic and therapeutic applications. Therefore, enhancing the detection rates of gastrointestinal disorders can significantly improve a patient's prognosis by increasing the likelihood of timely medical intervention, which may prolong the patient's lifespan and improve overall health outcomes. This paper presents a novel Transformer-based deep neural network designed to perform multiple tasks simultaneously, thereby enabling accurate identification of both upper gastrointestinal tract lesions and colon polyps. Our approach proposes a unique global context-aware module and leverages the powerful MiT backbone, along with a feature alignment block, to enhance the network's representation capability. This novel design leads to a significant improvement in performance across various endoscopic diagnosis tasks. Extensive experiments demonstrate the superior performance of our method compared to other state-of-the-art approaches.
摘要
Gastrointestinal endoscopy 是一种医疗手段,使用flexible管子装备了摄像头和其他 instrumentexamine the digestive tract。这是一种微创的技术,可以用于诊断和治疗各种胃肠道疾病,包括 inflammatory bowel disease、胃肠道出血和colon cancer。早期发现和识别胃肠道疾病的 lesions 和识别可能会发展为癌症的肿瘤是胃肠endooscopy 的诊断和治疗应用的关键组成部分。因此,提高胃肠道疾病的检测率可以大大提高病人的 прогноosis,增加适时的医疗 intervención,从而 prolong the patient's lifespan 和提高总体的健康 outcome。本文介绍了一种基于 Transformer 的深度神经网络,用于同时执行多个任务,以便准确地识别胃肠道 tract lesions 和 colon polyps。我们的方法包括一个 globally aware 模块,利用powerful MiT 脊梁,以及一个 feature alignment 块,以提高网络的表达能力。这种新的设计带来了对多种 endoscopic diagnosis 任务的显著性能提高。我们进行了广泛的实验,并证明了我们的方法与其他状态的方法相比,表现出了显著的优势。
results: 实验结果显示,使用该IR设计可以在标准数据查询操作中提高处理速度,比直接使用标准XML格式数据的处理速度提高了四十倍以上。Abstract
In the realm of software applications in the transportation industry, Domain-Specific Languages (DSLs) have enjoyed widespread adoption due to their ease of use and various other benefits. With the ceaseless progress in computer performance and the rapid development of large-scale models, the possibility of programming using natural language in specified applications - referred to as Application-Specific Natural Language (ASNL) - has emerged. ASNL exhibits greater flexibility and freedom, which, in turn, leads to an increase in computational complexity for parsing and a decrease in processing performance. To tackle this issue, our paper advances a design for an intermediate representation (IR) that caters to ASNL and can uniformly process transportation data into graph data format, improving data processing performance. Experimental comparisons reveal that in standard data query operations, our proposed IR design can achieve a speed improvement of over forty times compared to direct usage of standard XML format data.
摘要
在交通领域软件应用中,域限定语言(DSL)已广泛应用,主要因其使用友好性和其他优点。随着计算机性能不断提高和大规模模型快速发展,指定应用程序自然语言编程(ASNL)的可能性出现。ASNL具有更大的灵活性和自由度,但这也导致计算复杂度增加和处理性能下降。为解决这个问题,我们的论文提出了一种中间表示(IR)的设计,可以统一处理交通数据,并将其转换为图数据格式,从而提高数据处理性能。实验比较表明,在标准数据查询操作中,我们的提议的IR设计可以实现超过四十倍的速度提升 compared to直接使用标准XML格式数据。
A Causal Framework to Unify Common Domain Generalization Approaches
results: 本文提供了一种新的理解域外通用的方法,并解释了各种 DG 方法之间的关系和优缺点。这些研究可以帮助研究人员更好地理解域外通用的基本原理,并开发更有效的方法来解决这一重要问题。Abstract
Domain generalization (DG) is about learning models that generalize well to new domains that are related to, but different from, the training domain(s). It is a fundamental problem in machine learning and has attracted much attention in recent years. A large number of approaches have been proposed. Different approaches are motivated from different perspectives, making it difficult to gain an overall understanding of the area. In this paper, we propose a causal framework for domain generalization and present an understanding of common DG approaches in the framework. Our work sheds new lights on the following questions: (1) What are the key ideas behind each DG method? (2) Why is it expected to improve generalization to new domains theoretically? (3) How are different DG methods related to each other and what are relative advantages and limitations? By providing a unified perspective on DG, we hope to help researchers better understand the underlying principles and develop more effective approaches for this critical problem in machine learning.
摘要
域外泛化(DG)是关于学习模型能够在新域中泛化well的问题,新域与训练域相关,但不同。这是机器学习的基本问题,在过去几年内吸引了很多注意。一大量的方法被提出。不同的方法受不同的动机,使得了解这个领域的全面性很难。在这篇论文中,我们提出了 causal 框架 для域外泛化,并对各种 DG 方法进行了解释。我们的工作为以下问题提供了新的灯光:1. 哪些是域外泛化方法的关键思想?2. 为什么可以理论上预期域外泛化方法会提高新域泛化的性能?3. 各种 DG 方法之间有什么关系,哪些方法有优势和局限性?通过提供一个统一的视角,我们希望能够帮助研究人员更好地理解域外泛化的基本原则,并开发更有效的方法来解决这一关键问题在机器学习中。
TinyMetaFed: Efficient Federated Meta-Learning for TinyML
results: 研究结果显示,TinyMetaFed 可以在三个 TinyML 应用中实现较大的能源减少和通信负载削减,加速训练过程并稳定训练过程。Abstract
The field of Tiny Machine Learning (TinyML) has made substantial advancements in democratizing machine learning on low-footprint devices, such as microcontrollers. The prevalence of these miniature devices raises the question of whether aggregating their knowledge can benefit TinyML applications. Federated meta-learning is a promising answer to this question, as it addresses the scarcity of labeled data and heterogeneous data distribution across devices in the real world. However, deploying TinyML hardware faces unique resource constraints, making existing methods impractical due to energy, privacy, and communication limitations. We introduce TinyMetaFed, a model-agnostic meta-learning framework suitable for TinyML. TinyMetaFed facilitates collaborative training of a neural network initialization that can be quickly fine-tuned on new devices. It offers communication savings and privacy protection through partial local reconstruction and Top-P% selective communication, computational efficiency via online learning, and robustness to client heterogeneity through few-shot learning. The evaluations on three TinyML use cases demonstrate that TinyMetaFed can significantly reduce energy consumption and communication overhead, accelerate convergence, and stabilize the training process.
摘要
《小型机器学习(TinyML)领域在使用低质量设备(如微控制器)进行机器学习的民主化方面做出了重要进步。随着这些小型设备的普遍存在,人们开始思考这些设备之间是否可以共享知识,以便进一步提高TinyML应用。联邦元学习是一种有前途的解决方案,因为它可以解决实际世界中数据标注稀缺和设备数据分布不均的问题。然而,部署TinyML硬件面临着独特的资源限制,使得现有的方法成为实际应用中的瓶颈。我们介绍了TinyMetaFed,一个适用于TinyML的模型独立元学习框架。TinyMetaFed通过快速 Fine-tune 新设备的神经网络初始化来实现共同训练。它提供了通信成本和隐私保护的部分本地重建和Top-P%选择性通信,计算效率via在线学习,并对客户端多样性提供了少量学习。我们对TinyML应用三个场景进行了评估,结果表明TinyMetaFed可以显著减少能源消耗和通信负担,加速结构,并稳定训练过程。》
Negated Complementary Commonsense using Large Language Models
results: 该方法可以超过GPT-3的几个shot生成,并且更重要的是,强调了大语言模型在非正常问题下的回答需要进一步研究。Here’s the English version for reference:
for: This paper aims to address the issue of large language models’ performance in non-ordinary questions.
methods: The study proposes a model-agnostic method to improve the performance in negated complementary scenarios.
results: The proposed method outperforms GPT-3’s few-shot generation (by more than 11 points) and highlights the significance of studying large language models’ responses in negated complementary questions.Abstract
Larger language models, such as GPT-3, have shown to be excellent in many tasks. However, we demonstrate that out-of-ordinary questions can throw the model off guard. This work focuses on finding answers to negated complementary questions in commonsense scenarios. We illustrate how such questions adversely affect the model responses. We propose a model-agnostic methodology to improve the performance in negated complementary scenarios. Our method outperforms few-shot generation from GPT-3 (by more than 11 points) and, more importantly, highlights the significance of studying the response of large language models in negated complementary questions. The code, data, and experiments are available under: https://github.com/navidre/negated_complementary_commonsense.
摘要
大型语言模型,如GPT-3,在许多任务中表现出色。然而,我们发现异常问题会让模型受挫。这项工作关注在常见情景下的否定补充问题中找到答案。我们示出这些问题会对模型响应产生负面影响。我们提出了一种模型无关的方法来改善在否定补充问题中的性能。我们的方法在GPT-3中超过11个点的几个步骤上表现出色,并更重要的是,强调了大语言模型在否定补充问题中的响应的重要性。代码、数据和实验可以在以下链接中找到:https://github.com/navidre/negated_complementary_commonsense。
results: 本研究通过实验和案例研究表明,Ordinal数据科学可以通过cross-fertilization With other machine learning和知识表示方法,以及多种领域的应用,提供新的视角和工具来探讨和解决复杂的数据问题。Abstract
Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- particularly important for this line of research -- is that order-based methods are often seen as too mathematically rigorous for applying them to real-world data. In this paper, we will therefore discuss different means for measuring and 'calculating' with ordinal structures -- a specific class of directed graphs -- and show how to infer knowledge from them. Our aim is to establish Ordinal Data Science as a fundamentally new research agenda. Besides cross-fertilization with other cornerstone machine learning and knowledge representation methods, a broad range of disciplines will benefit from this endeavor, including, psychology, sociology, economics, web science, knowledge engineering, scientometrics.
摘要
“顺序是资料中物件之间关系的一个主要量度工具。然而,相比于基于数值性质的方法, ordinal 方法的发展相对较少。一个原因是Last century 的计算资源有限,对于 ordinal 计算而言。另一个原因是, ordinal 方法 frequently 被视为对 real-world 数据过于数学化,因此不太受欢迎。在这篇论文中,我们将讨论不同的 ordinal 构造的量度和计算方法,并示如何从 ordinal 结构中获得知识。我们的目标是创建 Ordinal Data Science 作为一个全新的研究主题。此外,跨越 machine learning 和知识表示方法, Broad range of 领域将受益于这个努力,包括心理学、社会学、经济学、网络科学、知识工程、科学ometry。”Note that Simplified Chinese is used here, as it is the more commonly used standard for Chinese writing in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Layered controller synthesis for dynamic multi-agent systems
results: 使用 SWA-SMT 解决方案作为启始数据集,通过强化学习训练神经网络控制策略,并证明了初始数据集的重要性。Abstract
In this paper we present a layered approach for multi-agent control problem, decomposed into three stages, each building upon the results of the previous one. First, a high-level plan for a coarse abstraction of the system is computed, relying on parametric timed automata augmented with stopwatches as they allow to efficiently model simplified dynamics of such systems. In the second stage, the high-level plan, based on SMT-formulation, mainly handles the combinatorial aspects of the problem, provides a more dynamically accurate solution. These stages are collectively referred to as the SWA-SMT solver. They are correct by construction but lack a crucial feature: they cannot be executed in real time. To overcome this, we use SWA-SMT solutions as the initial training dataset for our last stage, which aims at obtaining a neural network control policy. We use reinforcement learning to train the policy, and show that the initial dataset is crucial for the overall success of the method.
摘要
在这篇论文中,我们提出了一种层次方法来解决多机器人控制问题,分为三个阶段,每一阶段都基于之前的结果。第一阶段计算出一个高级规划方案,利用带有停止表示的parametric timed automata来高效地模拟简化系统的动态。在第二阶段,高级规划基于SMT形式ulation,主要处理问题的 combinatorial 方面,提供了更加精准的解决方案。这两个阶段合称为SWA-SMT解决器。它们是按照正确的构造,但lack一个关键特点:它们无法在实时下执行。为了解决这个问题,我们使用SWA-SMT解决方案作为启动数据集来训练我们的最后一个阶段,使用神经网络控制策略进行训练,并证明了初始数据集对方法的总成功具有关键作用。
Extended Graph Assessment Metrics for Graph Neural Networks
results: 研究发现,这些 metric 与模型在不同的医学人口图上的性能相关,并且在不同的学习设置下也有相似的结果。Abstract
When re-structuring patient cohorts into so-called population graphs, initially independent data points can be incorporated into one interconnected graph structure. This population graph can then be used for medical downstream tasks using graph neural networks (GNNs). The construction of a suitable graph structure is a challenging step in the learning pipeline that can have severe impact on model performance. To this end, different graph assessment metrics have been introduced to evaluate graph structures. However, these metrics are limited to classification tasks and discrete adjacency matrices, only covering a small subset of real-world applications. In this work, we introduce extended graph assessment metrics (GAMs) for regression tasks and continuous adjacency matrices. We focus on two GAMs in specific: \textit{homophily} and \textit{cross-class neighbourhood similarity} (CCNS). We extend the notion of GAMs to more than one hop, define homophily for regression tasks, as well as continuous adjacency matrices, and propose a light-weight CCNS distance for discrete and continuous adjacency matrices. We show the correlation of these metrics with model performance on different medical population graphs and under different learning settings.
摘要
In this work, we introduce extended graph assessment metrics (GAMs) for regression tasks and continuous adjacency matrices. We focus on two GAMs in particular: homophily and cross-class neighborhood similarity (CCNS). We extend the notion of GAMs to more than one hop, define homophily for regression tasks, and propose a lightweight CCNS distance for discrete and continuous adjacency matrices. We show the correlation of these metrics with model performance on different medical population graphs and under different learning settings.Here is the text in Simplified Chinese:当重构病人群体为所谓的人口图时,首先独立的数据点可以被 integrate 到一个连接的图结构中。这个人口图可以用于医疗下游任务使用图神经网络(GNNs)。然而,建立合适的图结构是学习管道中的一个挑战,这可以对模型性能产生重要的影响。为此,不同的图评估指标已经被引入,但这些指标只适用于分类任务和离散邻接矩阵。在这种情况下,我们引入了扩展的图评估指标(GAMs),用于回归任务和连续邻接矩阵。我们特别关注两个GAMs:同类性和跨类邻居相似度(CCNS)。我们将同类性扩展到多个跳步,并对回归任务进行定义,以及提出一种轻量级的CCNS距离 для离散和连续邻接矩阵。我们在不同的医疗人口图上和不同的学习设置下显示了这些指标和模型性能之间的相关性。
Learning Multiple Coordinated Agents under Directed Acyclic Graph Constraints
results: 在四个DAG环境中,包括一个实际的Intel高量包装和测试Factory的调度环境,对比其他非DAG方法,这种方法表现出了更高的性能。Abstract
This paper proposes a novel multi-agent reinforcement learning (MARL) method to learn multiple coordinated agents under directed acyclic graph (DAG) constraints. Unlike existing MARL approaches, our method explicitly exploits the DAG structure between agents to achieve more effective learning performance. Theoretically, we propose a novel surrogate value function based on a MARL model with synthetic rewards (MARLM-SR) and prove that it serves as a lower bound of the optimal value function. Computationally, we propose a practical training algorithm that exploits new notion of leader agent and reward generator and distributor agent to guide the decomposed follower agents to better explore the parameter space in environments with DAG constraints. Empirically, we exploit four DAG environments including a real-world scheduling for one of Intel's high volume packaging and test factory to benchmark our methods and show it outperforms the other non-DAG approaches.
摘要
这个论文提出了一种新的多智能体奖励学习(MARL)方法,用于学习带有导向无环图(DAG)约束的多个协调Agent。与现有的MARL方法不同的是,我们的方法直接利用DAG结构 междуAgent来实现更高效的学习性能。理论上,我们提出了一种基于MARL模型与人工奖励(MARLM-SR)的新价值函数,并证明其为优化价值函数的下界。计算上,我们提出了一种实用的训练算法,通过新的首领Agent和奖励生成器和分配器Agent来引导分解式follower Agent更好地探索参数空间。实验上,我们利用了四个DAG环境,包括一个真实世界的Intel高通量包装和测试工厂的实际任务,并证明了我们的方法在这些环境中表现出色,超过了其他非DAG方法。
Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach
paper_authors: Jinhua Si, Fang He, Xi Lin, Xindi Tang
for: 提高城市群的交通效率和服务质量
methods: 使用多代理人强化学习模型和适应大 neighbohood search 算法
results: 提高平均日常系统收益和订单完成率Abstract
The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio.
摘要integrated 城市群开发带来了城市间交通需求的增长。 ride-pooling 服务在城市间交通中存在很大的潜力,但是在线操作受到资源分配和pooling车辆路径规划的复杂性的限制。 为了解决这些挑战,本研究提出了一个两级框架,用于在线车队管理。 Specifically,我们提出了一种多代理恶地权威学习模型,用于在不同城市间分配尚未使用的车辆,而lower level使用适应大 neighborhood搜索算法来更新车辆的路径。 基于实际的厦门和其周边城市数据,我们的方法能够有效缓解供应和需求不匹配,并在日均系统利润和订单填充率方面做出了显著改善。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
MPR-Net:Multi-Scale Pattern Reproduction Guided Universality Time Series Interpretable Forecasting
results: 在多个真实数据集上进行了详细的实验,并取得了领先的预测性能和泛化性能,同时还保持了良好的可解性。Abstract
Time series forecasting has received wide interest from existing research due to its broad applications and inherent challenging. The research challenge lies in identifying effective patterns in historical series and applying them to future forecasting. Advanced models based on point-wise connected MLP and Transformer architectures have strong fitting power, but their secondary computational complexity limits practicality. Additionally, those structures inherently disrupt the temporal order, reducing the information utilization and making the forecasting process uninterpretable. To solve these problems, this paper proposes a forecasting model, MPR-Net. It first adaptively decomposes multi-scale historical series patterns using convolution operation, then constructs a pattern extension forecasting method based on the prior knowledge of pattern reproduction, and finally reconstructs future patterns into future series using deconvolution operation. By leveraging the temporal dependencies present in the time series, MPR-Net not only achieves linear time complexity, but also makes the forecasting process interpretable. By carrying out sufficient experiments on more than ten real data sets of both short and long term forecasting tasks, MPR-Net achieves the state of the art forecasting performance, as well as good generalization and robustness performance.
摘要
时间序列预测已经受到了广泛的研究兴趣,因为它们的广泛应用和内在的挑战性。研究的挑战在于Identifying有效的历史序列模式并将其应用于未来预测。高级模型基于点对连接MLP和Transformer架构具有强大的适应力,但其次要计算复杂性限制了实用性。此外,这些结构自然地破坏时间顺序,从而减少了信息利用和使预测过程不可读写。为解决这些问题,本文提出了一种预测模型,MPR-Net。它首先适应性地分解多尺度历史序列模式使用 convolution 操作,然后基于先前知识的模式复制来构建模式扩展预测方法,最后使用 deconvolution 操作将未来模式转换为未来序列。通过利用时间序列中的时间依赖关系,MPR-Net不仅实现了线性时间复杂度,还使预测过程可读写。通过对多个真实数据集进行了详细的实验,MPR-Net实现了预测性能的状态对照,以及良好的总体和稳定性性能。
GRAN is superior to GraphRNN: node orderings, kernel- and graph embeddings-based metrics for graph generators
results: 研究发现,GRAN模型在小型图像上表现更加出色,而GraphRNN模型在大型图像上表现较差。此外,对GraphRNN模型进行深度优先搜索排序后,其性能也得到了改进。Abstract
A wide variety of generative models for graphs have been proposed. They are used in drug discovery, road networks, neural architecture search, and program synthesis. Generating graphs has theoretical challenges, such as isomorphic representations -- evaluating how well a generative model performs is difficult. Which model to choose depending on the application domain? We extensively study kernel-based metrics on distributions of graph invariants and manifold-based and kernel-based metrics in graph embedding space. Manifold-based metrics outperform kernel-based metrics in embedding space. We use these metrics to compare GraphRNN and GRAN, two well-known generative models for graphs, and unveil the influence of node orderings. It shows the superiority of GRAN over GraphRNN - further, our proposed adaptation of GraphRNN with a depth-first search ordering is effective for small-sized graphs. A guideline on good practices regarding dataset selection and node feature initialization is provided. Our work is accompanied by open-source code and reproducible experiments.
摘要
各种生成模型 для图有很多已经被提出。它们在药物搜索、路网、神经建立搜索和程序合成等领域中使用。生成图有理论挑战,例如同态表示——评估生成模型表现的难度。哪种模型应选取取决于应用领域? 我们广泛研究基于kernel的度量和抽象空间中的度量。抽象空间中基于 manifold 的度量表现较好。我们使用这些度量对 GRAN 和 GraphRNN 两种知名的生成模型进行比较,并揭示节点顺序对 GRAN 的影响。结果表明 GRAN 在小型图上表现更优。此外,我们提出的对 GraphRNN 使用深度优先搜索排序的修改也有效。 我们提供了关于数据集选择和节点特征初始化的良好做法指南。我们的工作被附加了开源代码和可重现实验。
S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
results: 在KTH人体动作和Move-MNIST任务上进行了广泛的实验研究,表明我们的模型与当前最佳视频预测技术相比,在量化和质量两个方面均有出色的表现,即使模型大小相对较小。此外,我们还提出了一种新的训练方法,可以同时优化HR-VQVAE和ST-PixelCNN参数。Abstract
We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
摘要
我们提出了一种新的模型,即将 hierarchical residual vector quantized variational autoencoder(HR-VQVAE)与一种新的空间时间 pixel convolutional neural network(ST-PixelCNN)结合起来,我们称之为sequential hierarchical residual learning vector quantized variational autoencoder(S-HR-VQVAE)。通过聪明 HR-VQVAE 对静止图像的准确表示,以及 ST-PixelCNN 对空间时间信息的处理能力,S-HR-VQVAE 可以更好地处理视频预测中的主要挑战,包括学习空间时间信息、处理高维数据、避免模糊预测和隐式模型物理特征。我们在 KTH Human Action 和 Moving-MNIST 任务上进行了广泛的实验,结果表明,我们的模型与其他顶尖视频预测技术相比,在量化和质量评价中均表现出优异,即使模型规模远小于其他模型。最后,我们提出了一种新的训练方法,可以同时优化 HR-VQVAE 和 ST-PixelCNN 参数。
Short Boolean Formulas as Explanations in Practice
results: 这个论文提出了一些新的量化 bound,并在实际应用中使用了Answer Set Programming来计算解释 формулы。Results表明,这些解释形式可以达到类似于其他方法的准确率,但是由于过拟合,这些解释形式并不一定是理想的。为了避免过拟合, authors使用了cross validation来确定合适的解释长度。Abstract
We investigate explainability via short Boolean formulas in the data model based on unary relations. As an explanation of length k, we take a Boolean formula of length k that minimizes the error with respect to the target attribute to be explained. We first provide novel quantitative bounds for the expected error in this scenario. We then also demonstrate how the setting works in practice by studying three concrete data sets. In each case, we calculate explanation formulas of different lengths using an encoding in Answer Set Programming. The most accurate formulas we obtain achieve errors similar to other methods on the same data sets. However, due to overfitting, these formulas are not necessarily ideal explanations, so we use cross validation to identify a suitable length for explanations. By limiting to shorter formulas, we obtain explanations that avoid overfitting but are still reasonably accurate and also, importantly, human interpretable.
摘要
我们通过短Boolean公式进行可解释性研究,基于单关关系的数据模型中。作为一个长度为k的解释,我们选择一个长度为k的Boolean公式,以最小化与目标特性估计的误差。我们首先提供了新的量化界限,用于预期误差在这种情况下。然后,我们还通过三个具体的数据集来研究这种设置的实践效果。在每个案例中,我们使用Answer Set Programming编码来计算不同长度的解释公式,得到的最准确公式的误差与其他方法在同一数据集上的误差类似。然而,由于过拟合,这些公式并不一定是理想的解释,因此我们使用交叉验证来选择合适的解释长度。通过限制到更短的公式,我们可以获得更加简洁的解释,却不会过拟合,同时仍然具有一定的准确性和人类可读性。
IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation
results: 研究发现传统的KGE模型无法捕捉知识图中的semantics,因此提出了一个新的推理任务以促进机器学习模型的semantic understanding。Abstract
Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations. A key task in the literature is predicting missing links between entities. However, Knowledge Graphs are not just sets of links but also have semantics underlying their structure. Semantics is crucial in several downstream tasks, such as query answering or reasoning. We introduce the subgraph inference task, where a model has to generate likely and semantically valid subgraphs. We propose IntelliGraphs, a set of five new Knowledge Graph datasets. The IntelliGraphs datasets contain subgraphs with semantics expressed in logical rules for evaluating subgraph inference. We also present the dataset generator that produced the synthetic datasets. We designed four novel baseline models, which include three models based on traditional KGEs. We evaluate their expressiveness and show that these models cannot capture the semantics. We believe this benchmark will encourage the development of machine learning models that emphasize semantic understanding.
摘要
知识图融合(KGE)模型用于学习连续表示实体和关系。文献中的一个关键任务是预测实体之间缺失的链接。但知识图并不只是一个链接的集合,它们也具有下面结构的 semantics。这些 semantics 对下游任务如查询回答或理解具有重要作用。我们引入了子图推理任务,其中一个模型需要生成可能和semantically 有效的子图。我们提出了 IntelliGraphs,一组五个新的知识图数据集。IntelliGraphs 数据集包含具有semantics表示的逻辑规则来评估子图推理。我们还介绍了生成这些 sintetic 数据集的数据生成器。我们设计了四种基线模型,其中三种基于传统的 KGE。我们评估了这些模型的表达能力,并证明这些模型无法捕捉 semantics。我们认为这个审核将鼓励机器学习模型强调semantic理解。
Reinforcement Learning for Syntax-Guided Synthesis
for: 这个论文目的是提出一种基于Monte-Carlo Tree Search(MCTS)的自动代码生成算法,用于解决Syntax-Guided Synthesis(SyGuS)问题。
methods: 这种算法使用了强化学习指导的搜索算法,并将学习策略和价值函数与上界确定 bound for trees相结合,以平衡探索和利用。
results: 在训练集和测试集上,这种算法与基准值 enumerate 比较,提高了自动代码生成的性能,提高了26%以上。此外,这种算法还可以自动生成 SyGuS 训练数据,并且与现有的工具(如 CVC5)相比,在训练集上表现良好,在测试集上则与之相当。Abstract
Program synthesis is the task of automatically generating code based on a specification. In Syntax-Guided Synthesis(SyGuS) this specification is a combination of a syntactic template and a logical formula, and any generated code is proven to satisfy both. Techniques like SyGuS are critical to guaranteeing correct synthesis results. Despite the proliferation of machine learning in other types of program synthesis, state-of-the-art techniques in SyGuS are still driven by automated reasoning tools and simple enumeration. We hypothesize this is for two reasons: first the complexity of the search problem, and second the relatively small data sets available. In this work, we tackle these challenges by framing general SyGuS problems as a tree-search, and present a reinforcement learning guided synthesis algorithm for SyGuS based on Monte-Carlo Tree Search (MCTS). Our algorithm incorporates learned policy and value functions combined with the upper confidence bound for trees to balance exploration and exploitation. We incorporate this search procedure in a reinforcement learning setup in order to iteratively improve our policy and value estimators which are based on boosted tree models. To address the scarcity of training data, we present a method for automatically generating training data for SyGuS based on \emph{anti-unification} of existing first-order satisfiability problems, which we use to train our MCTS policy. We implement and evaluate this setup and demonstrate that learned policy and value improve the synthesis performance over a baseline enumerator by over $26$ percentage points in the training and testing sets. With these results our tool outperforms state-of-the-art-tools such as CVC5 on the training set and performs comparably on the testing set. We make our data set publicly available, enabling further application of machine learning methods to the SyGuS problem.
摘要
Program synthesis是自动生成代码的任务,将specification作为一个构文模板和一个逻辑公式的组合。在Syntax-Guided Synthesis(SyGuS)中,任何生成的代码都必须满足这两者。使用如SyGuS的技术可以保证生成结果的正确性。在其他类型的程序生成中,机器学习的应用兴盛,但现在的state-of-the-art技术仍然靠自动推理工具和简单的排序。我们认为这是因为搜寻问题的复杂性和可用的训练数据量相对较少。在这个工作中,我们将General SyGuS问题案例化为树搜索,并提出一个基于Monte-Carlo Tree Search(MCTS)的循环学习帮助生成搜索算法。我们的算法结合了学习政策和价值函数,并使用上界确界树来寻找平衡探索和优化。我们将这个搜索程序整合到循环学习设置中,以迭代改善我们的政策和价值估计器,这些估计器基于提高树模型。为了解决训练数据的缺乏,我们提出了一种自动生成SyGuS训练数据的方法,使用exististing first-order satisfiability problem的反融合,并将其用于训练我们的MCTS政策。我们实现和评估这个设置,并发现learned policy和价值函数可以在训练和测试集上提高生成性能,相比基准枚举器,提高了26.3%的比例点。我们的工具在训练集上表现和现场的工具相比较,并在测试集上表现相对较好。我们将我们的数据集公开,以便进一步应用机器学习方法到SyGuS问题。
Towards Ubiquitous Semantic Metaverse: Challenges, Approaches, and Opportunities
methods: 论文详细介绍了四种基本系统组件的技术,即人工智能 (AI)、空间-时间数据表示 (STDR)、semantic Internet of Things (SIoT) 和增强semantic digital twin (SDT),以及它们在 ubique semantic Metaverse 中的应用。
results: 论文对 ubique semantic Metaverse 的发展提出了一系列挑战,包括可扩展性、兼容性、隐私安全、性能评估和标准化,以及伦理考虑和责任AI。Abstract
In recent years, ubiquitous semantic Metaverse has been studied to revolutionize immersive cyber-virtual experiences for augmented reality (AR) and virtual reality (VR) users, which leverages advanced semantic understanding and representation to enable seamless, context-aware interactions within mixed-reality environments. This survey focuses on the intelligence and spatio-temporal characteristics of four fundamental system components in ubiquitous semantic Metaverse, i.e., artificial intelligence (AI), spatio-temporal data representation (STDR), semantic Internet of Things (SIoT), and semantic-enhanced digital twin (SDT). We thoroughly survey the representative techniques of the four fundamental system components that enable intelligent, personalized, and context-aware interactions with typical use cases of the ubiquitous semantic Metaverse, such as remote education, work and collaboration, entertainment and socialization, healthcare, and e-commerce marketing. Furthermore, we outline the opportunities for constructing the future ubiquitous semantic Metaverse, including scalability and interoperability, privacy and security, performance measurement and standardization, as well as ethical considerations and responsible AI. Addressing those challenges is important for creating a robust, secure, and ethically sound system environment that offers engaging immersive experiences for the users and AR/VR applications.
摘要
Recently, ubiquitous semantic Metaverse has been studied to revolutionize immersive cyber-virtual experiences for augmented reality (AR) and virtual reality (VR) users, which leverages advanced semantic understanding and representation to enable seamless, context-aware interactions within mixed-reality environments. This survey focuses on the intelligence and spatio-temporal characteristics of four fundamental system components in ubiquitous semantic Metaverse, i.e., artificial intelligence (AI), spatio-temporal data representation (STDR), semantic Internet of Things (SIoT), and semantic-enhanced digital twin (SDT). We thoroughly survey the representative techniques of the four fundamental system components that enable intelligent, personalized, and context-aware interactions with typical use cases of the ubiquitous semantic Metaverse, such as remote education, work and collaboration, entertainment and socialization, healthcare, and e-commerce marketing. Furthermore, we outline the opportunities for constructing the future ubiquitous semantic Metaverse, including scalability and interoperability, privacy and security, performance measurement and standardization, as well as ethical considerations and responsible AI. Addressing those challenges is important for creating a robust, secure, and ethically sound system environment that offers engaging immersive experiences for the users and AR/VR applications.
Machine Learning-Assisted Pattern Recognition Algorithms for Estimating Ultimate Tensile Strength in Fused Deposition Modeled Polylactic Acid Specimens
paper_authors: Akshansh Mishra, Vijaykumar S Jatti for: 这个研究旨在使用监督学习算法来估计制造使用沉积模型法(FDM)的聚酯酯(PLA)样品的最大强度(UTS)。methods: 研究使用了四种监督分类算法,namely Logistic Classification, Gradient Boosting Classification, Decision Tree, and K-Nearest Neighbor,以估计样品的UTS。results: 研究发现,Decision Tree和K-Nearest Neighbor算法都达到了F1分数0.71,但KNN算法的AUC分数为0.79,表明KNN算法在分类dataset中的表现较好,因此这个算法是这个研究中最佳的选择。Abstract
In this study, we investigate the application of supervised machine learning algorithms for estimating the Ultimate Tensile Strength (UTS) of Polylactic Acid (PLA) specimens fabricated using the Fused Deposition Modeling (FDM) process. A total of 31 PLA specimens were prepared, with Infill Percentage, Layer Height, Print Speed, and Extrusion Temperature serving as input parameters. The primary objective was to assess the accuracy and effectiveness of four distinct supervised classification algorithms, namely Logistic Classification, Gradient Boosting Classification, Decision Tree, and K-Nearest Neighbor, in predicting the UTS of the specimens. The results revealed that while the Decision Tree and K-Nearest Neighbor algorithms both achieved an F1 score of 0.71, the KNN algorithm exhibited a higher Area Under the Curve (AUC) score of 0.79, outperforming the other algorithms. This demonstrates the superior ability of the KNN algorithm in differentiating between the two classes of ultimate tensile strength within the dataset, rendering it the most favorable choice for classification in the context of this research. This study represents the first attempt to estimate the UTS of PLA specimens using machine learning-based classification algorithms, and the findings offer valuable insights into the potential of these techniques in improving the performance and accuracy of predictive models in the domain of additive manufacturing.
摘要
在本研究中,我们研究了使用监督学习算法来估算加固塑料(PLA)样品的最高强度(UTS)。总共准备了31个PLA样品,输入参数包括填充率、层高、印刷速度和溶融温度。研究的主要目标是评估四种不同的监督分类算法,namely Logistic Classification、Gradient Boosting Classification、Decision Tree和K-Nearest Neighbor的准确性和有效性,以便预测样品的UTS。结果表明,Decision Tree和K-Nearest Neighbor算法都达到了F1分数的0.71,而KNN算法的AUC分数为0.79,高于其他算法,这表明KNN算法在数据集中更好地区分两个类别的最高强度,因此在本研究中是最佳选择。这是首次使用机器学习基于分类算法来估算PLA样品的UTS,研究结果为加固塑料预测模型的性能和准确性提供了有价值的发现。
Explainable Artificial Intelligence driven mask design for self-supervised seismic denoising
For: 提高震动数据的精度和可靠性,降低干扰的影响* Methods: 使用自主学习的卷积神经网络和对Jacobian矩阵的分析,自动找到最有效的干扰排除措施* Results: 在 synthetic 数据和实际震动数据上,提出了一种完全自动的干扰排除方法,不需要清晰训练标签或先知知情,可以高效地降低 trace-wise 干扰和颜色干扰Abstract
The presence of coherent noise in seismic data leads to errors and uncertainties, and as such it is paramount to suppress noise as early and efficiently as possible. Self-supervised denoising circumvents the common requirement of deep learning procedures of having noisy-clean training pairs. However, self-supervised coherent noise suppression methods require extensive knowledge of the noise statistics. We propose the use of explainable artificial intelligence approaches to see inside the black box that is the denoising network and use the gained knowledge to replace the need for any prior knowledge of the noise itself. This is achieved in practice by leveraging bias-free networks and the direct linear link between input and output provided by the associated Jacobian matrix; we show that a simple averaging of the Jacobian contributions over a number of randomly selected input pixels, provides an indication of the most effective mask to suppress noise present in the data. The proposed method therefore becomes a fully automated denoising procedure requiring no clean training labels or prior knowledge. Realistic synthetic examples with noise signals of varying complexities, ranging from simple time-correlated noise to complex pseudo rig noise propagating at the velocity of the ocean, are used to validate the proposed approach. Its automated nature is highlighted further by an application to two field datasets. Without any substantial pre-processing or any knowledge of the acquisition environment, the automatically identified blind-masks are shown to perform well in suppressing both trace-wise noise in common shot gathers from the Volve marine dataset and colored noise in post stack seismic images from a land seismic survey.
摘要
“干扰性噪声在地震数据中存在会导致错误和不确定性,因此需要在早期和高效地抑制噪声。不需要清洁训练对组的自主减杂方法可以避免需要深入学习过程中的干扰噪声统计知识。但是,自主减杂方法需要广泛的噪声统计知识。我们提出使用可解释人工智能方法来看到减杂网络的黑盒子内部,并使用获得的知识来取代任何先前的干扰噪声统计知识。这是通过利用偏好度零网络和对输入和输出之间的直线关系提供的 Jacobian 矩阵来实现的。我们显示了一个简单的均值 Jacobian 贡献的方法可以提供噪声降低的指标。因此,我们的提案可以实现一个无需清洁训练对组或先前知识的自动减杂程序。我们采用了实际的合成例子,包括不同复杂性的噪声信号,以验证我们的方法。我们还将方法应用到了两个场景中的数据,而无需任何重要的预processing或场景探索知识,自动获取的掩蔽物具有良好的噪声抑制能力。”
Real-time Percussive Technique Recognition and Embedding Learning for the Acoustic Guitar
results: 研究人员发现,使用VAEs可以提高分类器的质量,并且可以提供更好的控制距离和丰富的交互。但是,研究人员还需要解决不同数据集的总体化问题。Abstract
Real-time music information retrieval (RT-MIR) has much potential to augment the capabilities of traditional acoustic instruments. We develop RT-MIR techniques aimed at augmenting percussive fingerstyle, which blends acoustic guitar playing with guitar body percussion. We formulate several design objectives for RT-MIR systems for augmented instrument performance: (i) causal constraint, (ii) perceptually negligible action-to-sound latency, (iii) control intimacy support, (iv) synthesis control support. We present and evaluate real-time guitar body percussion recognition and embedding learning techniques based on convolutional neural networks (CNNs) and CNNs jointly trained with variational autoencoders (VAEs). We introduce a taxonomy of guitar body percussion based on hand part and location. We follow a cross-dataset evaluation approach by collecting three datasets labelled according to the taxonomy. The embedding quality of the models is assessed using KL-Divergence across distributions corresponding to different taxonomic classes. Results indicate that the networks are strong classifiers especially in a simplified 2-class recognition task, and the VAEs yield improved class separation compared to CNNs as evidenced by increased KL-Divergence across distributions. We argue that the VAE embedding quality could support control intimacy and rich interaction when the latent space's parameters are used to control an external synthesis engine. Further design challenges around generalisation to different datasets have been identified.
摘要
PatchSorter: A High Throughput Deep Learning Digital Pathology Tool for Object Labeling
paper_authors: Cedric Walker, Tasneem Talawalla, Robert Toth, Akhil Ambekar, Kien Rea, Oswin Chamian, Fan Fan, Sabina Berezowska, Sven Rottenberg, Anant Madabhushi, Marie Maillard, Laura Barisoni, Hugo Mark Horlings, Andrew Janowczyk
results: 通过使用>100,000个对象,这篇论文展示了>7倍的标签速度提升,并且对标签准确性产生了最小的影响,因此可以快速标注大量数据集。Abstract
The discovery of patterns associated with diagnosis, prognosis, and therapy response in digital pathology images often requires intractable labeling of large quantities of histological objects. Here we release an open-source labeling tool, PatchSorter, which integrates deep learning with an intuitive web interface. Using >100,000 objects, we demonstrate a >7x improvement in labels per second over unaided labeling, with minimal impact on labeling accuracy, thus enabling high-throughput labeling of large datasets.
摘要
发现与诊断、治疗效果和疾病进程相关的图像模式,经常需要大量的 histological 对象标注。我们现在发布了一个开源的标注工具,PatchSorter,它将深度学习与直观的网页界面相结合。使用 >100,000 个对象,我们展示了 >7x 的标注速度提高,与无助标注相比,减少了标注精度的影响,因此可以实现大规模标注数据集。
DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle
paper_authors: Oskar Natan, Jun Miura for: DeepIPCv2是一个用于自动驾驶的模型,它利用激光测试器进行环境探测,以提供更加可靠的驾驶,特别是在夜间或对照不良的情况下。methods: DeepIPCv2使用LiDAR点 cloud作为主要感知输入,由于点 cloud不受照明影响,因此可以提供不受影响的环境观测,从而帮助控制模块更好地估计航行控制。results: 根据实验结果,DeepIPCv2在不同的驾驶情况下都展示了最好的性能,并且与其他一些现有的模型进行比较和抑制研究,以证明其表现。Abstract
We present DeepIPCv2, an autonomous driving model that perceives the environment using a LiDAR sensor for more robust drivability, especially when driving under poor illumination conditions where everything is not clearly visible. DeepIPCv2 takes a set of LiDAR point clouds as the main perception input. Since point clouds are not affected by illumination changes, they can provide a clear observation of the surroundings no matter what the condition is. This results in a better scene understanding and stable features provided by the perception module to support the controller module in estimating navigational control properly. To evaluate its performance, we conduct several tests by deploying the model to predict a set of driving records and perform real automated driving under three different conditions. We also conduct ablation and comparative studies with some recent models to justify its performance. Based on the experimental results, DeepIPCv2 shows a robust performance by achieving the best drivability in all driving scenarios. Furthermore, we will upload the codes to https://github.com/oskarnatan/DeepIPCv2.
摘要
我们现在推出了DeepIPCv2,一个拥有自主驾驶能力的模型,它通过激光探测器对环境进行更加稳定的感知,特别是在夜色昏暗的情况下,一切都不太清楚。DeepIPCv2接受激光点云作为主要感知输入,由于点云不受照明变化影响,因此可以提供不受限制的环境观察,从而实现更好的场景理解和稳定的特征提供。为评估其性能,我们进行了多个测试,包括部署模型预测一组驾驶记录,并在三个不同的条件下进行实际自动驾驶。我们还进行了减少和比较研究,以证明其性能。根据实验结果,DeepIPCv2在所有驾驶场景中表现出了最好的稳定性,并且我们将代码上传到https://github.com/oskarnatan/DeepIPCv2。
Image Transformation Sequence Retrieval with General Reinforcement Learning
results: 实验表明,使用MCTS训练的模型能够在真实和syntheticdomains中出perform其supervised counterpart,特别是在最简单和最复杂的情况下。Abstract
In this work, the novel Image Transformation Sequence Retrieval (ITSR) task is presented, in which a model must retrieve the sequence of transformations between two given images that act as source and target, respectively. Given certain characteristics of the challenge such as the multiplicity of a correct sequence or the correlation between consecutive steps of the process, we propose a solution to ITSR using a general model-based Reinforcement Learning such as Monte Carlo Tree Search (MCTS), which is combined with a deep neural network. Our experiments provide a benchmark in both synthetic and real domains, where the proposed approach is compared with supervised training. The results report that a model trained with MCTS is able to outperform its supervised counterpart in both the simplest and the most complex cases. Our work draws interesting conclusions about the nature of ITSR and its associated challenges.
摘要
本文介绍了一种新的图像转换序列检索任务(ITSR),在该任务中,模型需要根据两个给定图像(源和目标图像),检索转换过程中的序列。根据特定的挑战特点,如转换序列中的多个正确序列或转换步骤之间的相关性,我们提出了一种基于通用模型驱动的强化学习方法,即蒙地卡尔树搜索(MCTS),其与深度神经网络结合使用。我们的实验在 sintetic 和 real 领域进行了比较,并发现了一个训练使用 MCTS 的模型,可以在 simplest 和 most complex 情况下超越supervised 训练的模型。我们的工作得出了有趣的结论,关于 ITSR 和其相关挑战的性质。
SecureFalcon: The Next Cyber Reasoning System for Cyber Security
methods: 本研究使用了 FalconLLM 模型,通过对 C 语言代码样本进行训练,以 diferenciar易受攻击的代码和安全的代码。新建了 FormAI 训练数据集,通过组合生成的人工智能(AI)和形式验证来评估模型性能。
results: 研究发现,SecureFalcon 模型在检测软件攻击方面达到了 94% 的准确率,表明它在软件安全领域具有广泛的应用前景和潜在的改变性。Abstract
Software vulnerabilities leading to various detriments such as crashes, data loss, and security breaches, significantly hinder the quality, affecting the market adoption of software applications and systems. Although traditional methods such as automated software testing, fault localization, and repair have been intensively studied, static analysis tools are most commonly used and have an inherent false positives rate, posing a solid challenge to developer productivity. Large Language Models (LLMs) offer a promising solution to these persistent issues. Among these, FalconLLM has shown substantial potential in identifying intricate patterns and complex vulnerabilities, hence crucial in software vulnerability detection. In this paper, for the first time, FalconLLM is being fine-tuned for cybersecurity applications, thus introducing SecureFalcon, an innovative model architecture built upon FalconLLM. SecureFalcon is trained to differentiate between vulnerable and non-vulnerable C code samples. We build a new training dataset, FormAI, constructed thanks to Generative Artificial Intelligence (AI) and formal verification to evaluate its performance. SecureFalcon achieved an impressive 94% accuracy rate in detecting software vulnerabilities, emphasizing its significant potential to redefine software vulnerability detection methods in cybersecurity.
摘要
Introducing Foundation Models as Surrogate Models: Advancing Towards More Practical Adversarial Attacks
for: 针对无框 setting 的 adversarial attack 进行研究,探讨surrogate model 的选择过程对攻击效果的影响。
methods: 利用 foundational model 作为 surrogate model,采用 margin-based loss strategy 进行 fine-tuning,实现基于 FGSM 算法的 adversarial attack。
results: 比较其他 algoritms 的性能,发现使用 foundational model 和 margin-based loss strategy 可以减轻攻击效果的差异,提高攻击效果。Abstract
Recently, the no-box adversarial attack, in which the attacker lacks access to the model's architecture, weights, and training data, become the most practical and challenging attack setup. However, there is an unawareness of the potential and flexibility inherent in the surrogate model selection process on no-box setting. Inspired by the burgeoning interest in utilizing foundational models to address downstream tasks, this paper adopts an innovative idea that 1) recasting adversarial attack as a downstream task. Specifically, image noise generation to meet the emerging trend and 2) introducing foundational models as surrogate models. Harnessing the concept of non-robust features, we elaborate on two guiding principles for surrogate model selection to explain why the foundational model is an optimal choice for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.
摘要
近些时间,无框 adversarial attack(lacking access to the model's architecture, weights, and training data)成为了最实际和挑战性最高的攻击设置。然而,对于无框设置中的可能和灵活性的潜在性,还存在一定的不了了之。 drawing inspiration from the growing interest in using foundational models to address downstream tasks, this paper proposes an innovative idea that 1) recasts adversarial attack as a downstream task and 2) introduces foundational models as surrogate models. By leveraging the concept of non-robust features, we provide two guiding principles for surrogate model selection to explain why foundational models are optimal for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.
results: 该研究发现,XAI方法提供的辅助性输出可能不适用于应用任务,并且存在概念和技术上的限制,这些限制常常使XAI方法本身变成黑盒子。Abstract
Our work serves as a framework for unifying the challenges of contemporary explainable AI (XAI). We demonstrate that while XAI methods provide supplementary and potentially useful output for machine learning models, researchers and decision-makers should be mindful of their conceptual and technical limitations, which frequently result in these methods themselves becoming black boxes. We examine three XAI research avenues spanning image, textual, and graph data, covering saliency, attention, and graph-type explainers. Despite the varying contexts and timeframes of the mentioned cases, the same persistent roadblocks emerge, highlighting the need for a conceptual breakthrough in the field to address the challenge of compatibility between XAI methods and application tasks.
摘要
我们的工作作为现代可解释AI(XAI)挑战的框架。我们示出XAI方法提供补充和有用的输出 для机器学习模型,但研究人员和决策者应该注意这些方法的概念和技术限制,这些限制常常导致这些方法变成黑盒子。我们研究了三个XAI研究路径,涵盖图像、文本和图形数据,覆盖焦点、注意力和图形类解释器。尽管这些情况的时间和上下文不同,但 persistente roadblocks出现,表明需要在该领域获得概念性突破,以解决XAI方法和应用任务的兼容性问题。
EFL Students’ Attitudes and Contradictions in a Machine-in-the-loop Activity System
results: 结果表明大多数学生表现出积极的态度,有些学生表现出负面或混合的态度。从主题分析中,学生和机器之间的矛盾来自于机器的不足、学生的积极性和语言自主性的权衡。Abstract
This study applies Activity Theory and investigates the attitudes and contradictions of 67 English as a foreign language (EFL) students from four Hong Kong secondary schools towards machine-in-the-loop writing, where artificial intelligence (AI) suggests ideas during composition. Students answered an open-ended question about their feelings on writing with AI. Results revealed mostly positive attitudes, with some negative or mixed feelings. From a thematic analysis, contradictions or points of tension between students and AI stemmed from AI inadequacies, students' balancing enthusiasm with preference, and their striving for language autonomy. The research highlights the benefits and challenges of implementing machine-in-the-loop writing in EFL classrooms, suggesting educators align activity goals with students' values, language abilities, and AI capabilities to enhance students' activity systems.
摘要
Translation notes:* "English as a foreign language" is translated as "英语作为外语" (Yīngyǔ zuòwei làngyǔ)* "machine-in-the-loop writing" is translated as "机器在循环写作" (Jīqì zài xiànghuàn xiǎo zuò)* "artificial intelligence" is translated as "人工智能" (Réngōng zhìnéng)* "students' values" is translated as "学生的价值观" (Xuéxí de jīyè guān)* "language abilities" is translated as "语言能力" (Yǔyán nénglì)* "AI capabilities" is translated as "人工智能能力" (Réngōng zhìnéng nénglì)* "activity systems" is translated as "活动系统" ( Huóxìng xìtiān)
RVD: A Handheld Device-Based Fundus Video Dataset for Retinal Vessel Segmentation
paper_authors: MD Wahiduzzaman Khan, Hongwei Sheng, Hu Zhang, Heming Du, Sen Wang, Minas Theodore Coroneo, Farshid Hajati, Sahar Shariflou, Michael Kalloniatis, Jack Phu, Ashish Agar, Zi Huang, Mojtaba Golzan, Xin Yu
results: 这个研究提供了一个新的视频基于的眼睛血管数据集,包括415名病人的435个手持式录像,并提供了该数据集的评估指标和基准结果,显示了这个数据集具有丰富的数据特征和挑战性,并且可以帮助提高眼睛血管分类的技术。Abstract
Retinal vessel segmentation is generally grounded in image-based datasets collected with bench-top devices. The static images naturally lose the dynamic characteristics of retina fluctuation, resulting in diminished dataset richness, and the usage of bench-top devices further restricts dataset scalability due to its limited accessibility. Considering these limitations, we introduce the first video-based retinal dataset by employing handheld devices for data acquisition. The dataset comprises 635 smartphone-based fundus videos collected from four different clinics, involving 415 patients from 50 to 75 years old. It delivers comprehensive and precise annotations of retinal structures in both spatial and temporal dimensions, aiming to advance the landscape of vasculature segmentation. Specifically, the dataset provides three levels of spatial annotations: binary vessel masks for overall retinal structure delineation, general vein-artery masks for distinguishing the vein and artery, and fine-grained vein-artery masks for further characterizing the granularities of each artery and vein. In addition, the dataset offers temporal annotations that capture the vessel pulsation characteristics, assisting in detecting ocular diseases that require fine-grained recognition of hemodynamic fluctuation. In application, our dataset exhibits a significant domain shift with respect to data captured by bench-top devices, thus posing great challenges to existing methods. In the experiments, we provide evaluation metrics and benchmark results on our dataset, reflecting both the potential and challenges it offers for vessel segmentation tasks. We hope this challenging dataset would significantly contribute to the development of eye disease diagnosis and early prevention.
摘要
<>将文本翻译成简化中文。<>retinal vessel segmentation通常基于图像数据集,收集于桌面设备上。静止图像自然消失了 RETINA 的动态特征,导致数据集的资源减少,而使用桌面设备更限制了数据集的扩展性,因为它具有有限的访问性。为了解决这些限制,我们介绍了首个视频基于 RETINA 数据集,使用手持设备进行数据采集。该数据集包含635个手机基于 RETINA 视频,从四个不同的医院收集,涵盖415名50-75岁的病人。它提供了全面和精确的 RETINA 结构注释,包括空间维度和时间维度,以提高血管分 segmentation 的领域。特别是,数据集提供了三级空间注释:全面 RETINA 结构描述、总血管-artery 描述和细腔血管描述,以及时间注释,捕捉 RETINA 血管振荡特征,帮助检测眼病,需要细化血液征变识别。在应用中,我们的数据集表现出了与桌面设备收集的数据集显著的领域转移,因此对现有方法 pose 大的挑战。在实验中,我们提供了评估指标和对数据集的测试结果,反映了我们数据集对血管分 segmentation 任务的潜在和挑战。我们希望这个挑战的数据集可以对眼病诊断和预防产生巨大的贡献。
Regression-Oriented Knowledge Distillation for Lightweight Ship Orientation Angle Prediction with Optical Remote Sensing Images
methods: 该论文提出了一种新的SOAP模型 called Mobile-SOAP,基于MobileNetV2,实现了现有模型的预测精度。此外,还创造了四种小型SOAP模型,通过将Mobile-SOAP中的卷积块替换为四种小规模网络,分别实现了不同的预测任务。
results: 广泛的实验表明,Mobile-SOAP模型在FGSC-23 dataset上的预测精度较高,而且通过SOAP-KD知识传递框架,可以将Mobile-SOAP模型中的知识传递给四种特殊设计的小型模型,从而提高预测性能。特别是,通过SOAP-KD,使用ShuffleNetV2x1.0-based模型的测试平均绝对误差只比Mobile-SOAP高8%,但它的参数数量和乘法积加操作(MACs)分别减少61.6%和60.8%。Abstract
Ship orientation angle prediction (SOAP) with optical remote sensing images is an important image processing task, which often relies on deep convolutional neural networks (CNNs) to make accurate predictions. This paper proposes a novel framework to reduce the model sizes and computational costs of SOAP models without harming prediction accuracy. First, a new SOAP model called Mobile-SOAP is designed based on MobileNetV2, achieving state-of-the-art prediction accuracy. Four tiny SOAP models are also created by replacing the convolutional blocks in Mobile-SOAP with four small-scale networks, respectively. Then, to transfer knowledge from Mobile-SOAP to four lightweight models, we propose a novel knowledge distillation (KD) framework termed SOAP-KD consisting of a novel feature-based guidance loss and an optimized synthetic samples-based knowledge transfer mechanism. Lastly, extensive experiments on the FGSC-23 dataset confirm the superiority of Mobile-SOAP over existing models and also demonstrate the effectiveness of SOAP-KD in improving the prediction performance of four specially designed tiny models. Notably, by using SOAP-KD, the test mean absolute error of the ShuffleNetV2x1.0-based model is only 8% higher than that of Mobile-SOAP, but its number of parameters and multiply-accumulate operations (MACs) are respectively 61.6% and 60.8% less.
摘要
ship orientation angle prediction(SOAP)with optical remote sensing images 是一个重要的图像处理任务,通常利用深度卷积神经网络(CNN)来做精确预测。这篇论文提出了一个新的框架,以减少 SOAP 模型的大小和计算成本,不伤预测精度。首先,我们设计了一个新的 SOAP 模型,称为 Mobile-SOAP,基于 MobileNetV2,实现了状态场所有模型的预测精度。然后,我们创建了四个小型 SOAP 模型,分别替换了 Mobile-SOAP 中的卷积块。为了将 Mobile-SOAP 模型知识传递到四个轻量级模型,我们提出了一种新的知识传递框架,称为 SOAP-KD。该框架包括一种新的特征基于导航损失和一种优化的 sintetic samples 基于知识传递机制。最后,我们进行了广泛的实验,confirming the superiority of Mobile-SOAP over existing models and also demonstrating the effectiveness of SOAP-KD in improving the prediction performance of four specially designed tiny models. Notably, by using SOAP-KD, the test mean absolute error of the ShuffleNetV2x1.0-based model is only 8% higher than that of Mobile-SOAP, but its number of parameters and multiply-accumulate operations (MACs) are respectively 61.6% and 60.8% less.
Prescriptive Process Monitoring Under Resource Constraints: A Reinforcement Learning Approach
for: 这 paper 是为了提高业务过程的性能,通过运行时触发 intervención,提高正确情况的可能性。
methods: 这 paper 使用了人工智能技术,具体来说是 reinforcement learning,通过试错来学习 intervención 策略。
results: 该 paper 表明,在资源受限制的情况下,不能够忽略 uncertainty 和资源利用率,否则可能导致优化 intervención 效果的决策。它提议使用 conformal prediction 技术来考虑预测uncertainty,以便 reinforcement learning 代理人可以更好地学习 intervención 策略。经过实验 validate 了这种方法,结果显示,明确考虑 uncertainty 可以帮助 reinforcement learning 代理人更好地学习 intervención 策略,从而提高 net intervention gain。Abstract
Prescriptive process monitoring methods seek to optimize the performance of business processes by triggering interventions at runtime, thereby increasing the probability of positive case outcomes. These interventions are triggered according to an intervention policy. Reinforcement learning has been put forward as an approach to learning intervention policies through trial and error. Existing approaches in this space assume that the number of resources available to perform interventions in a process is unlimited, an unrealistic assumption in practice. This paper argues that, in the presence of resource constraints, a key dilemma in the field of prescriptive process monitoring is to trigger interventions based not only on predictions of their necessity, timeliness, or effect but also on the uncertainty of these predictions and the level of resource utilization. Indeed, committing scarce resources to an intervention when the necessity or effects of this intervention are highly uncertain may intuitively lead to suboptimal intervention effects. Accordingly, the paper proposes a reinforcement learning approach for prescriptive process monitoring that leverages conformal prediction techniques to consider the uncertainty of the predictions upon which an intervention decision is based. An evaluation using real-life datasets demonstrates that explicitly modeling uncertainty using conformal predictions helps reinforcement learning agents converge towards policies with higher net intervention gain
摘要
This paper argues that, when resources are limited, a key challenge in prescriptive process monitoring is to determine which interventions to trigger based not only on their necessity, timeliness, and effect but also on the uncertainty of these predictions and the level of resource utilization. Committing scarce resources to an intervention when the necessity or effects of the intervention are highly uncertain may lead to suboptimal outcomes.To address this challenge, the paper proposes a reinforcement learning approach for prescriptive process monitoring that leverages conformal prediction techniques to consider the uncertainty of the predictions upon which an intervention decision is based. An evaluation using real-life datasets demonstrates that explicitly modeling uncertainty using conformal predictions helps reinforcement learning agents converge towards policies with higher net intervention gain.
results: 实验表明,这种方法可以在标准语言模型测试集(WikiText-103)上提供更高质量的文本生成结果,并且与token-level autoregressive模型相比,其推理效率几乎相同。此外,这种方法还可以方便地进行领域适应,只需要将领域特定的文本集switch而已。最后,我们发现,通过将文本集更大化,我们可以再次提高文本生成质量,并且无需进行额外训练。Abstract
The dominant text generation models compose the output by sequentially selecting words from a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments (e.g., words or phrases) from an existing text collection. We compute the contextualized representations of meaningful text segments and index them using efficient vector search toolkits. The task of text generation is then decomposed into a series of copy-and-paste operations: at each time step, we seek suitable text spans from the text collection rather than selecting from a standalone vocabulary. Experiments on the standard language modeling benchmark (WikiText-103) show that our approach achieves better generation quality according to both automatic and human evaluations. Besides, its inference efficiency is comparable to token-level autoregressive models thanks to the reduction of decoding steps. We also show that our approach allows for effective domain adaptation by simply switching to domain-specific text collection without extra training. Finally, we observe that our approach attains additional performance gains by simply scaling up to larger text collections, again without further training.\footnote{Our source codes are publicly available at \url{https://github.com/gmftbyGMFTBY/Copyisallyouneed}.}
摘要
主流文本生成模型通常会采用顺序选择词语从固定词汇库中选择 Output 组成。在这篇论文中,我们将文本生成问题定义为逐步从现有文本集中复制文本段落(例如单词或短语)。我们计算了Contextualized表示意义短语,并使用高效的向量搜索工具库进行索引。然后,我们将文本生成任务 decomposed 为一系列的复制操作:在每个时间步骤中,我们寻找wikitext-103标准语言模型测试集上的合适文本段落,而不是从单独的词汇库中选择。我们的方法在自动和人工评估中都显示出较好的生成质量,并且其推理效率与Token-level autoregressive模型相当。此外,我们还证明了我们的方法可以很好地适应域,只需要将域特定的文本集换过去,而无需额外训练。最后,我们发现了我们的方法可以通过简单地扩大文本集来增加性能,而无需进一步训练。(我们的源代码可以在 \url{https://github.com/gmftbyGMFTBY/Copyisallyouneed} 上获取。)
On the Effective Horizon of Inverse Reinforcement Learning
results: 研究发现,使用有效的时间 horizonto shorter than the ground-truth value可以更快地生成更好的结果。这个现象可以通过控制策略类型的复杂性和避免过度适应来解释。Abstract
Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning over a given time horizon to compute an approximately optimal policy for a hypothesized reward function and then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimate and the computational efficiency of IRL algorithms. Interestingly, an effective time horizon shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis leads to a principled choice of the effective horizon for IRL. It also prompts us to reexamine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon together rather than the reward alone with a given horizon. Our experimental results confirm the theoretical analysis.
摘要
倒向奖励学习(IRL)算法经常利用前向奖励学习或规划在给定时间阶段来计算一个相对优化的政策 для一个假设的奖励函数,然后与专家示例相匹配。时间阶段在确定奖励估计的准确性和IRL算法的计算效率中扮演了关键角色。奇怪的是,一个有效的时间阶段 shorter than the ground-truth value 经常可以更快地 producess better results。这项工作正式分析了这种现象,并提供了一个解释:时间阶段控制了 inducing 政策类型的复杂性,并避免过拟合 Limited data 中。这种分析导致了一种原则性的选择有效的时间阶段 для IRL。它还让我们重新评估 классический IRL 形式:在学习奖励函数和有效时间阶段之间 jointly 学习是更自然的。我们的实验结果证实了理论分析。
Artificial Intelligence for Drug Discovery: Are We There Yet?
for: The paper is written to discuss the use of artificial intelligence (AI) in the three pillars of drug discovery, including diseases, targets, and therapeutic modalities, with a focus on small molecule drugs.
methods: The paper uses a variety of AI technologies, such as generative chemistry, machine learning, and multi-property optimization, to accelerate drug discovery and reduce costs.
results: The paper highlights several compounds that have entered clinical trials using AI-driven drug discovery methods, and emphasizes the need for careful vetting of known information to address the reproducibility crisis in the field.Here is the same information in Simplified Chinese text:
results: 论文指出使用AI驱动药物发现方法,已有一些药物进入临床试验,并强调需要仔细检查已知信息,以解决药物发现领域的可重现性危机。Abstract
Drug discovery is adapting to novel technologies such as data science, informatics, and artificial intelligence (AI) to accelerate effective treatment development while reducing costs and animal experiments. AI is transforming drug discovery, as indicated by increasing interest from investors, industrial and academic scientists, and legislators. Successful drug discovery requires optimizing properties related to pharmacodynamics, pharmacokinetics, and clinical outcomes. This review discusses the use of AI in the three pillars of drug discovery: diseases, targets, and therapeutic modalities, with a focus on small molecule drugs. AI technologies, such as generative chemistry, machine learning, and multi-property optimization, have enabled several compounds to enter clinical trials. The scientific community must carefully vet known information to address the reproducibility crisis. The full potential of AI in drug discovery can only be realized with sufficient ground truth and appropriate human intervention at later pipeline stages.
摘要
医药发现在推广新技术,如数据科学、信息学和人工智能(AI),以加速有效的药物开发,降低成本和动物实验。AI在药物发现中产生了深见,投资者、产业和学术科学家以及立法者都表示了兴趣。成功的药物发现需要优化与药理学、药物学和临床结果相关的质量。本文评论了AI在药物发现中的三大柱子:疾病、目标和治疗方式,主要关注小分子药物。AI技术,如生成化学、机器学习和多属性优化,已经使得许多化合物进入了临床试验。科学社区需要仔细检查已知信息,以解决误差危机。充分发挥AI在药物发现的潜力,需要足够的实验据和后续管道阶段的人类干预。
results: 本研究使用了一个实验,证明了这种方法可以将信念对齐到不同的情况下,并且可以在不同的情况下找到一个Pareto前沿,即一个集合最佳的信念强度,这些信念强度可以在不同的情况下均衡。Abstract
Beliefs and values are increasingly being incorporated into our AI systems through alignment processes, such as carefully curating data collection principles or regularizing the loss function used for training. However, the meta-alignment problem is that these human beliefs are diverse and not aligned across populations; furthermore, the implicit strength of each belief may not be well calibrated even among humans, especially when trying to generalize across contexts. Specifically, in high regret situations, we observe that contextual counterfactuals and recourse costs are particularly important in updating a decision maker's beliefs and the strengths to which such beliefs are held. Therefore, we argue that including counterfactuals is key to an accurate calibration of beliefs during alignment. To do this, we first segment belief diversity into two categories: subjectivity (across individuals within a population) and epistemic uncertainty (within an individual across different contexts). By leveraging our notion of epistemic uncertainty, we introduce `the belief calibration cycle' framework to more holistically calibrate this diversity of beliefs with context-driven counterfactual reasoning by using a multi-objective optimization. We empirically apply our framework for finding a Pareto frontier of clustered optimal belief strengths that generalize across different contexts, demonstrating its efficacy on a toy dataset for credit decisions.
摘要
信仰和价值在我们的人工智能系统中越来越被包含,通过精心制定数据收集原则或训练过程中的损失函数正则化。然而,人类信仰问题的元问题是,这些信仰各有不同,不同人群之间不协调,同时,每个人对每个信仰的强度可能并不准确,尤其是在扩展到不同情况时。因此,我们认为在满意度高时,Contextual counterfactuals和报偿成本是更新决策者信仰和信仰强度的关键。为了实现这一点,我们将信仰多样性分为两类:个人间的主观性和个人在不同情况下的认知不确定性。通过我们的认知不确定性概念,我们提出了“信仰均衡ecycle”框架,用于更全面地均衡这些多样性的信仰,通过 context-driven counterfactual reasoning来使用多目标优化。我们在一个简单的借口 dataset 上实际应用了我们的框架,并证明其在不同情况下的普适性。
results: 研究发现,使用 diffusion-generated images 可以提高 NAFLD 分类性能,并且与使用 generative adversarial networks(GANs)-generated images 相比, diffusion-generated images 的质量更高,最高的 Inception Score(IS)分数为 $1.90$,最低的 Fr'{e}chet Inception Distance(FID)分数为 $69.45$。使用一种部分冻结的 CNN 背bone (EfficientNet v1),该合成增强方法在 NAFLD 预测任务上达到了最大的图像水平 ROC AUC 值 $0.904$。Abstract
Integrating deep learning with clinical expertise holds great potential for addressing healthcare challenges and empowering medical professionals with improved diagnostic tools. However, the need for annotated medical images is often an obstacle to leveraging the full power of machine learning models. Our research demonstrates that by combining synthetic images, generated using diffusion models, with real images, we can enhance nonalcoholic fatty liver disease (NAFLD) classification performance. We evaluate the quality of the synthetic images by comparing two metrics: Inception Score (IS) and Fr\'{e}chet Inception Distance (FID), computed on diffusion-generated images and generative adversarial networks (GANs)-generated images. Our results show superior performance for the diffusion-generated images, with a maximum IS score of $1.90$ compared to $1.67$ for GANs, and a minimum FID score of $69.45$ compared to $99.53$ for GANs. Utilizing a partially frozen CNN backbone (EfficientNet v1), our synthetic augmentation method achieves a maximum image-level ROC AUC of $0.904$ on a NAFLD prediction task.
摘要
融合深度学习与临床专业知识可能解决医疗挑战和提供改进的诊断工具,但是需要标注医疗图像的需求 frequently 成为机器学习模型的全部潜力不能得到利用的障碍。我们的研究表明,通过将扩散模型生成的 synthetic 图像与实际图像结合使用,可以提高非阿尔科各疾病(NAFLD)分类性能。我们评估生成的 synthetic 图像质量,比较两个指标:Inception Score(IS)和Fréchet Inception Distance(FID),分别计算在扩散模型生成的图像和生成 adversarial 网络(GANs)生成的图像。我们的结果显示,扩散模型生成的图像性能更高,最大的IS分数为1.90,比GANs的1.67高;最小的FID分数为69.45,比GANs的99.53低。使用一个部分冻结的 CNN 背bone(EfficientNet v1),我们的生成增强方法在NAFLD预测任务中实现最大的图像级 ROC AUC 为0.904。
Hybrid Control Policy for Artificial Pancreas via Ensemble Deep Reinforcement Learning
paper_authors: Wenzhou Lv, Tianyu Wu, Luolin Xiong, Liang Wu, Jian Zhou, Yang Tang, Feng Qian for: 这个研究的目的是为人类一型糖尿病患者(T1DM)提供closed-loop血糖控制方案。methods: 这个研究使用了一种混合控制策略,将模型预测控制(MPC)与深度征才学习(DRL)组合,以便融合MPC的安全性和稳定性,并且具有DRL的个性化和适应性。此外,这个研究还将meta-学习技术 incorporated into HyCPAP,以便快速适应新的病人,并且仅需要有限的可用数据。results: 这个研究使用了FDA所批准的UVA/Padova T1DM simulator,在三个 scenario中进行了广泛的实验。结果显示,HyCPAP的方法可以在T1DM患者中实现最高的时间在理想血糖范围内的百分比,并且最低的低血糖发生次数。Abstract
Objective: The artificial pancreas (AP) has shown promising potential in achieving closed-loop glucose control for individuals with type 1 diabetes mellitus (T1DM). However, designing an effective control policy for the AP remains challenging due to the complex physiological processes, delayed insulin response, and inaccurate glucose measurements. While model predictive control (MPC) offers safety and stability through the dynamic model and safety constraints, it lacks individualization and is adversely affected by unannounced meals. Conversely, deep reinforcement learning (DRL) provides personalized and adaptive strategies but faces challenges with distribution shifts and substantial data requirements. Methods: We propose a hybrid control policy for the artificial pancreas (HyCPAP) to address the above challenges. HyCPAP combines an MPC policy with an ensemble DRL policy, leveraging the strengths of both policies while compensating for their respective limitations. To facilitate faster deployment of AP systems in real-world settings, we further incorporate meta-learning techniques into HyCPAP, leveraging previous experience and patient-shared knowledge to enable fast adaptation to new patients with limited available data. Results: We conduct extensive experiments using the FDA-accepted UVA/Padova T1DM simulator across three scenarios. Our approaches achieve the highest percentage of time spent in the desired euglycemic range and the lowest occurrences of hypoglycemia. Conclusion: The results clearly demonstrate the superiority of our methods for closed-loop glucose management in individuals with T1DM. Significance: The study presents novel control policies for AP systems, affirming the great potential of proposed methods for efficient closed-loop glucose control.
摘要
目标:人工胰腺(AP)在ype 1 肥皂糖病(T1DM)患者中实现closed-loop血糖控制具有抢夺性。然而,设计AP控制策略仍然是一个挑战,因为生物体系复杂,延迟的胰岛快捷响应和不准确的血糖测量。MPC提供了安全和稳定性,但缺乏个性化和偏离预期的饮食会影响其性能。DRL则提供了个性化和适应的策略,但面临了分布shift和大量数据需求。方法:我们提出了一种hybrid控制策略, combineMPC和DRL Ensemble,利用这两种策略的优势,并补做它们的相应局限性。为了快速部署AP系统在实际设置中,我们进一步 incorporate meta-学技术,利用前一些经验和患者共享的知识,以快速适应新的患者,并使用有限的可用数据进行适应。结果:我们在FDA认可的UVA/Padova T1DM simulator上进行了广泛的实验,结果显示我们的方法在三个场景中实现了最高的血糖控制时间百分比和最低的低血糖发生率。结论:研究结果明确地表明了我们的方法在T1DM患者中实现closed-loop血糖控制具有优势。重要性:这种研究提出了AP系统的新控制策略,证明了这些方法在efficient closed-loop血糖控制方面具有潜力。
AutoHint: Automatic Prompt Optimization with Hint Generation
results: 实验表明,使用AutoHint框架可以在多个任务上显著提高准确率,并且在零shot和几个示例提示下都能达到优秀效果。Abstract
This paper presents AutoHint, a novel framework for automatic prompt engineering and optimization for Large Language Models (LLM). While LLMs have demonstrated remarkable ability in achieving high-quality annotation in various tasks, the key to applying this ability to specific tasks lies in developing high-quality prompts. Thus we propose a framework to inherit the merits of both in-context learning and zero-shot learning by incorporating enriched instructions derived from input-output demonstrations to optimize original prompt. We refer to the enrichment as the hint and propose a framework to automatically generate the hint from labeled data. More concretely, starting from an initial prompt, our method first instructs a LLM to deduce new hints for selected samples from incorrect predictions, and then summarizes from per-sample hints and adds the results back to the initial prompt to form a new, enriched instruction. The proposed method is evaluated on the BIG-Bench Instruction Induction dataset for both zero-shot and few-short prompts, where experiments demonstrate our method is able to significantly boost accuracy for multiple tasks.
摘要
We refer to the enrichment as the "hint" and propose a method to automatically generate the hint from labeled data. Starting from an initial prompt, our method first instructs the LLM to deduce new hints for selected samples from incorrect predictions, and then summarizes the per-sample hints and adds the results back to the initial prompt to form a new, enriched instruction.We evaluate our method on the BIG-Bench Instruction Induction dataset for both zero-shot and few-shot prompts, and the results show that our method can significantly boost accuracy for multiple tasks.
Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems
paper_authors: Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Eric Chan-Tin, Tamer Abuhmed for: This paper focuses on the vulnerability of interpretable deep learning systems (IDLSes) to black-box attacks, and proposes a Query-efficient Score-based black-box attack (QuScore) to overcome this vulnerability.methods: The proposed attack method, QuScore, employs an effective microbial genetic algorithm and transfer-based and score-based methods to reduce the number of queries necessary to carry out successful attacks.results: The proposed attack method achieves a high attack success rate of between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. The generated adversarial examples have attribution maps that resemble benign samples, and the attack is resilient against various preprocessing defense techniques.Abstract
Deep learning models are susceptible to adversarial samples in white and black-box environments. Although previous studies have shown high attack success rates, coupling DNN models with interpretation models could offer a sense of security when a human expert is involved, who can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In black-box settings, as access to the components of IDLSes is limited, it becomes more challenging for the adversary to fool the system. In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods by employing an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system. We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.
摘要
深度学习模型容易受到敌意样本的攻击,包括白盒和黑盒环境。 previous studies have shown that coupling DNN models with interpretation models can provide a sense of security, as a human expert can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods using an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system. We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.
Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!
For: The paper is written for communication scholars and researchers who use automated classifiers (ACs) in their studies, and it aims to provide a better understanding of the limitations and potential biases of these classifiers.* Methods: The paper uses a systematic literature review and Monte Carlo simulations to investigate the issue of misclassification bias in ACs, and it introduces and tests several methods for correcting this bias.* Results: The paper finds that existing statistical methods can be used to correct misclassification bias, and it recommends a new error correction method that is versatile and efficient. The results suggest that ACs can be useful for measurement with careful study design and appropriate error correction methods.Abstract
Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses-unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use "gold standard" validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. We introduce and test such methods, including a new method we design and implement in the R package misclassificationmodels, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
摘要
自动分类器(AC),经常通过直接监督机器学习(SML)构建,可以处理大量数据,从文本到图像和视频,并在通信科学和相关领域中成为广泛使用的测量工具。尽管如此,即使高度准确的分类器也会出现错误,导致分类偏见和误导性结果, unless downstream analyses account for these errors。根据我们在通信学家大量应用SML的文献系统atic literature review中发现,通信学家大多ignore misclassification bias。在原则上,现有的统计方法可以使用"金标准"验证数据,such as that created by human annotators,来 corrections misclassification bias and produce consistent estimates。我们介绍并测试了这些方法,包括我们设计和实现的新方法,via Monte Carlo simulations designed to reveal each method's limitations,which we also release。根据我们的结果,我们推荐我们的新错误 corrections方法,因为它是多功能和高效的。总之,自动分类器,即使它们在常见准确标准下或系统性地出现错误,也可以用于测量,只要是进行了仔细的设计和适当的错误 corrections方法。
Efficiently-Verifiable Strong Uniquely Solvable Puzzles and Matrix Multiplication
for: The paper is written for developing fast matrix multiplication algorithms.
methods: The paper uses the Cohn-Umans framework and introduces a new subclass of strong uniquely solvable puzzles (SUSPs) called simplifiable SUSPs.
results: The paper shows that individual simplifiable SUSPs can achieve the same strength of bounds on the matrix multiplication exponent $\omega$ as infinite families of SUSPs, and reports on the construction of larger SUSPs than previously known for small width, which strengthens the upper bound on the matrix multiplication exponent from $2.66$ to $2.505$.Here is the information in Simplified Chinese text:
for: 本文是为了开发快速矩阵乘法算法而写的。
methods: 本文使用了Cohn-Umans框架,并引入了一新的强约束可解谜题(SUSP) subclass called simplifiable SUSPs。
results: 本文表明,个别可简化SUSPs可以达到与无穷多个SUSPs相同的约束 exponent $\omega$ 的强度,并报告了小宽度的大SUSPs than previously known,这使得矩阵乘法 exponent from $2.66$ to $2.505$ 的上限变得更加强。Abstract
We advance the Cohn-Umans framework for developing fast matrix multiplication algorithms. We introduce, analyze, and search for a new subclass of strong uniquely solvable puzzles (SUSP), which we call simplifiable SUSPs. We show that these puzzles are efficiently verifiable, which remains an open question for general SUSPs. We also show that individual simplifiable SUSPs can achieve the same strength of bounds on the matrix multiplication exponent $\omega$ that infinite families of SUSPs can. We report on the construction, by computer search, of larger SUSPs than previously known for small width. This, combined with our tighter analysis, strengthens the upper bound on the matrix multiplication exponent from $2.66$ to $2.505$ obtainable via this computational approach, and nears the results of the handcrafted constructions of Cohn et al.
摘要
我们推进了Cohn-Umans的框架,以开发更快的矩阵乘法算法。我们引入、分析和搜寻一新的强烈独特解难题(SUSP),我们称之为简化SUSP。我们证明了这些题目是可以效率验证的,这是一个未解之谜题 для一般的SUSP。我们还证明了个别的简化SUSP可以 дости到相同的矩阵乘法对数 exponent $\omega$ 的范围,与无限多个SUSP可以实现的结果相似。我们透过计算搜寻,建立了小宽度的更大的SUSP,这与我们更紧密的分析相结合,将矩阵乘法对数 exponent的上限由 $2.66$ 提高到 $2.505$,并且接近Cohn等人的手工建构结果。
ACTI at EVALITA 2023: Overview of the Conspiracy Theory Identification Task
for: This study aims to identify conspiratorial content and specific conspiracy theory categories.
methods: This study uses large language models to identify conspiratorial content and specific conspiracy theory categories.
results: The study finds that large language models can effectively identify conspiratorial content and specific conspiracy theory categories.Abstract
Conspiracy Theory Identication task is a new shared task proposed for the first time at the Evalita 2023. The ACTI challenge, based exclusively on comments published on conspiratorial channels of telegram, is divided into two subtasks: (i) Conspiratorial Content Classification: identifying conspiratorial content and (ii) Conspiratorial Category Classification about specific conspiracy theory classification. A total of fifteen teams participated in the task for a total of 81 submissions. We illustrate the best performing approaches were based on the utilization of large language models. We finally draw conclusions about the utilization of these models for counteracting the spreading of misinformation in online platforms.
摘要
《刺激理论识别任务》是在2023年的Evalita中提出的一个新任务。这个任务基于在阴谋渠道上发布的评论,分为两个子任务:(i)刺激内容分类,检测刺激内容;(ii)刺激类别分类,为特定阴谋理论进行分类。总共有15个队伍参与了这个任务,共提交81个解决方案。我们 illustrate最佳实现方法基于大语言模型的使用。最后,我们draw结论关于在网络平台上防止谣言的利用这些模型。Note: "阴谋渠道" (yǐn wù dào) in the text refers to "conspiratorial channels" or "channels for spreading conspiracy theories".
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
results: 在固定计算预算下使用这些方法进行BERT和T5的预训练后,发现它们的训练、验证和下游性能减退相比基eline。我们提出了一种评估协议,允许在任意机器上进行计算,并将所有计算时间映射到一个参考机器上(reference system time)。Abstract
The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.
摘要
computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.Here's the word-for-word translation in Simplified Chinese:计算必需的语言模型训练计算量在最近几年内有所增加。这种趋势激发了训练效率优化的研究,以提高训练、验证和下游性能。在这项工作中,我们回顾了三类类型的算法:动态架构(层堆栈、层掉落)、批量选择(选择式反Prop、RHO损失)和高效优化器(Lion、Sophia)。当使用这些方法预训练BERT和T5时,我们发现他们的训练、验证和下游性能减退,与基eline的完全衰减学习率相比。我们定义了一种评估协议,使得计算可以在任意机器上进行,将所有计算时间映射到一个参照机器上,我们称之为参照系统时间。我们讨论了我们的评估协议的局限性,并将代码发布,以便严谨的研究高效训练过程:https://github.com/JeanKaddour/NoTrainNoGain。
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
results: 研究发现,通过精炼GPT-3.5 PubMedBERT模型,可以在标准ADE提取评估中达到相当于经验性模型的精度,而无需使用任何标注数据。此外,精炼模型还可以在F1分数和GPT-4之间比较高。Abstract
Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access. We conduct a case study on adverse drug event (ADE) extraction, which is an important area for improving care. On standard ADE extraction evaluation, a GPT-3.5 distilled PubMedBERT model attained comparable accuracy as supervised state-of-the-art models without using any labeled data. Despite being over 1,000 times smaller, the distilled model outperformed its teacher GPT-3.5 by over 6 absolute points in F1 and GPT-4 by over 5 absolute points. Ablation studies on distillation model choice (e.g., PubMedBERT vs BioGPT) and ADE extraction architecture shed light on best practice for biomedical knowledge extraction. Similar gains were attained by distillation for other standard biomedical knowledge extraction tasks such as gene-disease associations and protected health information, further illustrating the promise of this approach.
摘要
大型语言模型(LLM),如GPT-4,在各种任务上表现出色,包括医疗应用。在这篇论文中,我们研究了如何使用LLM来扩大医疗知识抽象。我们发现,虽然LLM已经有了结构医疗文本的能力,但通过自我超vised学习,可以通过对任务特定的学生模型进行缩减,实现了较大的提升,同时具有成本效益和白盒模型访问的优势。我们进行了一个case study,研究了药物不良反应(ADE)提取,这是医疗改进的重要领域。使用GPT-3.5缩减PubMedBERT模型,在标准ADE提取评价中与无标签数据的超参考模型相当的准确率。尽管这个模型比GPT-3.5和GPT-4都小得多,但它在F1指标上比其教师模型GPT-3.5高出6个绝对分,比GPT-4高出5个绝对分。对缩减模型选择(如PubMedBERT vs BioGPT)和ADE提取架构进行了ablation研究, shed light on最佳做法 для医疗知识提取。同样的,通过缩减,对其他标准医疗知识提取任务,如基因疾病关系和医疗保密信息,也实现了类似的提升,这更加证明了这种方法的承诺。
paper_authors: Ce Zhou for: 这项研究旨在帮助患有肢体残缺或肢体不适的人们,特别是患有肢体麻痹的人们,提高生活质量。methods: 该项目使用了Steady-State Visual Evoked Potential(SSVEP)技术,并在Matlab中进行了EEG信号处理,包括Butterworth滤波器和快速傅立叶变换(FFT)。此外,该项目还使用了一种幂数基本分类方法。results: 实验结果表明,该系统可以准确地控制电子轮椅,并且可以在1秒钟左右实现时延。这表明这种基于SSVEP的BCI控制的轮椅在未来可能有很大的应用前途。Abstract
A brain-computer interface (BCI) is a system that allows a person to communicate or control the surroundings without depending on the brain's normal output pathways of peripheral nerves and muscles. A lot of successful applications have arisen utilizing the advantages of BCI to assist disabled people with so-called assistive technology. Considering using BCI has fewer limitations and huge potential, this project has been proposed to control the movement of an electronic wheelchair via brain signals. The goal of this project is to help disabled people, especially paralyzed people suffering from motor disabilities, improve their life qualities. In order to realize the project stated above, Steady-State Visual Evoked Potential (SSVEP) is involved. It can be easily elicited in the visual cortical with the same frequency as the one is being focused by the subject. There are two important parts in this project. One is to process the EEG signals and another one is to make a visual stimulator using hardware. The EEG signals are processed in Matlab using the algorithm of Butterworth Infinite Impulse Response (IIR) bandpass filter (for preprocessing) and Fast Fourier Transform (FFT) (for feature extraction). Besides, a harmonics-based classification method is proposed and applied in the classification part. Moreover, the design of the visual stimulator combines LEDs as flickers and LCDs as information displayers on one panel. Microcontrollers are employed to control the SSVEP visual stimuli panel. This project is evaluated by subjects with different races and ages. Experimental results show the system is easy to be operated and it can achieve approximately a minimum 1-second time delay. So it demonstrates that this SSVEP-based BCI-controlled wheelchair has a huge potential to be applied to disabled people in the future.
摘要
一个脑computer interfaces(BCI)是一个系统,允许人们通过不同于正常脑神经元和肌肉的输出路径ways来交流或控制周围环境。BCI的优势使得它在助手技术方面得到了广泛的应用。为了帮助残疾人,特别是患有运动障碍的人们,提高生活质量,这个项目提出了通过脑信号控制电子轮椅的想法。项目的目标是帮助残疾人,特别是患有运动障碍的人们,提高生活质量。为了实现以上项目,Steady-State Visual Evoked Potential(SSVEP)被应用。SSVEP可以轻松地在视 Cortical中诱发,与主体的注意力频率相同。项目包括两个重要部分:一是处理EEG信号,二是使用硬件制作视觉刺激器。EEG信号在MATLAB中使用Butterworth无限响应滤波器(IIR)和快速傅立叶变换(FFT)进行处理。此外,我们还提出了一种基于谐波分类方法的分类方式。此外,视觉刺激器使用LED和LCD组成,并通过微控制器控制SSVEP视觉刺激。这个项目在不同的种族和年龄的参与者中进行了评估。实验结果表明,系统易于使用,可以实现约1秒的延迟。这表明,这种基于SSVEP的BCI控制的轮椅在未来对残疾人有巨大的潜力。
Designing Behavior Trees from Goal-Oriented LTLf Formulas
results: 本文表明,通过将 LTL 方程转换成 BT,可以实现广泛的 плаanner 实现行为节点,并且任何成功轨迹都满足相应的 LTL 方程。 两种示例都证明了该方法的有用性:一是对两个 плаanner 和 LTL 目标之间的匹配,二是解决一个序列键匙问题 для Fetch 机器人。Abstract
Temporal logic can be used to formally specify autonomous agent goals, but synthesizing planners that guarantee goal satisfaction can be computationally prohibitive. This paper shows how to turn goals specified using a subset of finite trace Linear Temporal Logic (LTL) into a behavior tree (BT) that guarantees that successful traces satisfy the LTL goal. Useful LTL formulas for achievement goals can be derived using achievement-oriented task mission grammars, leading to missions made up of tasks combined using LTL operators. Constructing BTs from LTL formulas leads to a relaxed behavior synthesis problem in which a wide range of planners can implement the action nodes in the BT. Importantly, any successful trace induced by the planners satisfies the corresponding LTL formula. The usefulness of the approach is demonstrated in two ways: a) exploring the alignment between two planners and LTL goals, and b) solving a sequential key-door problem for a Fetch robot.
摘要
<>使用时间逻辑来正式表述自主机器人目标,但生成能 garantuee 目标满足的 планиFIER可以是计算极其繁琐的。这篇论文介绍了将使用 subset of finite trace 线性时间逻辑 (LTL) 表述的目标转化为 behavIOR treE (BT),使得成功跟踪满足 LTL 目标。有用的 LTL 方程可以通过征能任务语言 grammar derivation,导致任务组合使用 LTL 运算符。从 LTL 方程生成 BT 会导致一种放松的行为合成问题,在这种问题中,各种 планиFIER 可以实现动作节点。重要的是,任何成功跟踪都满足相应的 LTL 方程。该方法在两种方面得到了证明:一是探索两个 planner 和 LTL 目标之间的协调关系,二是解决一个 sequential key-door 问题 для Fetch 机器人。Note: "relaxed behavior synthesis" in the text is translated as "放松的行为合成问题" in Simplified Chinese.
methods: 本研究提出了一种简单的命名方法,用于在 Answer Set Programming 中强制执行域内稳定性。这种方法使用 universally unique identifiers 来避免名称冲突,并将本地预测映射到主流引擎的常见全局名空间中。
results: 本研究的结果表明,通过使用本种方法,可以在可能空的应用上下文中强制执行域内稳定性,并且可以安全地共享模板应用程序,即使其他知识设计师没有任何关于模板的知识。Abstract
In imperative programming, the Domain-Driven Design methodology helps in coping with the complexity of software development by materializing in code the invariants of a domain of interest. Code is cleaner and more secure because any implicit assumption is removed in favor of invariants, thus enabling a fail fast mindset and the immediate reporting of unexpected conditions. This article introduces a notion of template for Answer Set Programming that, in addition to the don't repeat yourself principle, enforces locality of some predicates by means of a simple naming convention. Local predicates are mapped to the usual global namespace adopted by mainstream engines, using universally unique identifiers to avoid name clashes. This way, local predicates can be used to enforce invariants on the expected outcome of a template in a possibly empty context of application, independently by other rules that can be added to such a context. Template applications transpiled this way can be processed by mainstream engines and safely shared with other knowledge designers, even when they have zero knowledge of templates.
摘要
在命令式编程中,域驱动设计方法ologies可以帮助 coping with 软件开发中的复杂性,通过在代码中materializing 域中的 invariants,从而使代码更加干净和安全。这篇文章介绍了一种 Answer Set Programming 的模板 notation,该notation 不仅遵循 don't repeat yourself 原则,还通过简单的命名 conventions 来强制本地 predicate 的 locality。本地 predicate 被映射到主流引擎采用的 usual global namespace 中,使用 universally unique identifiers 来避免名称冲突。这样,本地 predicate 可以用来强制模板中的 invariants,在可能的空context of application中独立地加入其他规则。经过这种转换的模板应用可以通过主流引擎进行处理,并安全地与其他知识设计师共享,即使他们对模板有 Zero knowledge。
Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation
results: 通过实验 validate了我们的框架在抽象和连续控制任务中,能够帮助用户更好地理解机器人失败原因,减少需要精细调整的次数,并将机器人更好地适应用户的任务目标。Abstract
Policies often fail due to distribution shift -- changes in the state and reward that occur when a policy is deployed in new environments. Data augmentation can increase robustness by making the model invariant to task-irrelevant changes in the agent's observation. However, designers don't know which concepts are irrelevant a priori, especially when different end users have different preferences about how the task is performed. We propose an interactive framework to leverage feedback directly from the user to identify personalized task-irrelevant concepts. Our key idea is to generate counterfactual demonstrations that allow users to quickly identify possible task-relevant and irrelevant concepts. The knowledge of task-irrelevant concepts is then used to perform data augmentation and thus obtain a policy adapted to personalized user objectives. We present experiments validating our framework on discrete and continuous control tasks with real human users. Our method (1) enables users to better understand agent failure, (2) reduces the number of demonstrations required for fine-tuning, and (3) aligns the agent to individual user task preferences.
摘要
政策经常因分布shift而失败 -- 在部署新环境时,状态和奖励发生变化。数据扩展可以增加Robustness,使模型对任务 irrelevant的变化不敏感。但是设计者无法知道在先a priori哪些概念是无关的,特别是不同的用户有不同的完成任务的方式的偏好。我们提出了一个互动式框架,利用用户直接反馈来确定个性化无关的概念。我们的关键思想是生成对抗示例,让用户快速地识别可能的任务相关和无关的概念。知道无关的概念后,我们使用数据扩展来获得适应个性用户任务目标的策略。我们的方法可以:(1)让用户更好地理解代理失败的原因,(2)减少需要精度调整的示例数量,(3)将代理落实到个性用户任务目标。我们在实验中 Validating our framework on discrete and continuous control tasks with real human users.
results: 该论文的实验结果表明,与现有的offline强化学习方法相比,该方法在D4RL benchmark上的总性能更高。Abstract
The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.
摘要
主要挑战在线束缚学习中,即数据有限制,来自可行动序列的Counterfactual理解困难:假设我们选择了不同的行动方案?这些情况 frequently leads to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Therefore, it is crucial to recognize that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy makes to control the extrapolation. Unlike existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from the behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. From a theoretical perspective, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.Here's the word-for-word translation:主要挑战在线束缚学习中,即数据有限制,来自可行动序列的Counterfactual理解困难:假设我们选择了不同的行动方案?这些情况 frequently leads to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Therefore, it is crucial to recognize that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy makes to control the extrapolation. Unlike existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from the behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. From a theoretical perspective, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.
FDAPT: Federated Domain-adaptive Pre-training for Language Models
results: 研究发现,FDAPT可以与中央基线相似的下游任务性能,并在非Identical Distribution(non-IID)情况下保持性能稳定。此外,提出了一种新的算法FFDAPT,可以提高计算效率,并与标准FDAPT的下游任务性能相似,但是计算时间减少12.1%。Abstract
Combining Domain-adaptive Pre-training (DAPT) with Federated Learning (FL) can enhance model adaptation by leveraging more sensitive and distributed data while preserving data privacy. However, few studies have focused on this method. Therefore, we conduct the first comprehensive empirical study to evaluate the performance of Federated Domain-adaptive Pre-training (FDAPT). We demonstrate that FDAPT can maintain competitive downstream task performance to the centralized baseline in both IID and non-IID situations. Furthermore, we propose a novel algorithm, Frozen Federated Domain-adaptive Pre-training (FFDAPT). FFDAPT improves the computational efficiency by 12.1% on average and exhibits similar downstream task performance to standard FDAPT, with general performance fluctuations remaining less than 1%. Finally, through a critical evaluation of our work, we identify promising future research directions for this new research area.
摘要
使用域 adaptive pre-training (DAPT) 与联合学习 (FL) 结合可以提高模型适应性,同时保护数据隐私。然而,目前很少研究这种方法。因此,我们进行了首次的全面实验性研究,以评估联邦域 adaptive pre-training (FDAPT) 的表现。我们发现,FDAPT 可以与中央基线在 IID 和非 IID 情况下保持竞争性的下游任务性能。此外,我们提出了一种新算法,冻结联邦域 adaptive pre-training (FFDAPT)。FFDAPT 可以提高计算效率,在 average 上降低了12.1%,并且与标准 FDAPT 的下游任务性能相似,下游任务性能波动低于1%。最后,通过一个批判性的评估,我们确定了这个新研究领域的前景。
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
paper_authors: Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby
results: 在实验中,NaViT 可以具有更高的训练效率和推理效率,并且在多种图像任务上达到更高的表现。此外,NaViT 还能够更好地保持图像的纹理和形态信息,从而提高图像的稳定性和公平性。Abstract
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
摘要
“ ubique 和 suboptimal 的选择, resize 图像到固定分辨率进行计算机视觉模型处理,直到现在还没有被成功挑战。然而,模型如 Vision Transformer (ViT) 提供了 flexible sequence-based 模型化,并且可以处理不同的输入序列长度。我们利用这点,在 NaViT (Native Resolution ViT) 中使用序列压缩 during training,以处理任意分辨率和比例的输入。此外,我们还证明了 NaViT 在大规模的超vised 和 contrastive 图像文本预训练中具有更高的训练效率。 NaViT 可以高效地转移到标准任务,如图像和视频分类、物体检测和semantic segmentation,并且在 robustness 和 fairness 指标上带来改进的结果。在推理时,可以使用输入分辨率的灵活性来平滑地调整推理时的成本-性能交互。我们认为 NaViT 标志着计算机视觉模型的标准、 convolutional Neural Networks (CNN) 设计的输入和模型管道的改变,并表示一个可行的方向。”
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
results: 对多个指令遵循数据进行选择和评估,实现选择相对较高质量的样本,并通过对未看过的数据集进行选择而验证其性能。相比于基于未过滤数据集进行finetuning,通过InstructMining选择的模型在42.5%的情况下表现较好。Abstract
Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.
摘要
大型自然语言模型通常需要两个训练阶段,预训练和资源调整。尽管大规模预训练使模型拥有强大的自然语言回快能力,但这些预训练模型可能会在某些情况下无法理解人类的指令。为提高语言模型对指令的理解和回快,指令调整成为了该领域的关键方法。最近的研究发现,大型自然语言模型可以通过少量高质量指令跟踪数据进行高效调整。然而,选择高质量数据集用于调整语言模型仍然缺乏明确的指南。在这篇论文中,我们提出了InstructMining,一种线性规则用于评估指令跟踪数据质量。我们将InstructMining表述为特定的自然语言指标。为了investigate数据质量和这些指标之间的关系,我们进行了广泛的调整实验。实验结果被应用于估计InstructMining参数。为了进一步评估其性能,我们使用InstructMining来从未看到的数据集中选择高质量样本。结果表明,InstructMining可以帮助选择不同的指令跟踪数据集中的相对较高质量样本。与不过滤数据集进行资源调整相比,通过InstructMining选择的数据集中训练的模型在42.5%的情况下表现较好。
results: 对个性化测试集进行比较,每种方法都提高了word error rate(WER)超过10%,而自然语言提示可以提高WER7%而无需任何训练,但有一定的泛化损失。总的来说,报纸方法得到了最佳效果,提高WER10%,同时也提高了一般测试集的WER1%。Abstract
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.
摘要
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.Here's the breakdown of the translation:* "Recognition of personalized content" becomes "个性化内容认知" (gèshìhùa néngyù jìngchí)* "remains a challenge" becomes "仍是一个挑战" (bìngshì yī gè tiǎozhàn)* "in end-to-end speech recognition" becomes "在端到端语音识别中" (shàng zhì zhèng yīn xiāngxīn)* "We explore three novel approaches" becomes "我们探索三种新的方法" (wǒmen tànsuǒ sān zhāng xīn de fāngédé)* "that use personalized content" becomes "使用个性化内容" (shǐyòu gèshìhùa néngyù)* "in a neural rescoring step" becomes "在神经网络重分配步骤中" (shàngjiāo wǎngluò zhòngfēngchēng zhèng)* "to improve recognition" becomes "以提高识别" (yǐ tígāng xiēngbì)* "We use internal de-identified en-US data" becomes "我们使用内部匿名的en-US数据" (wǒmen shǐyòu yùndào bèi mìngmíng de en-US shùjī)* "from interactions with a virtual voice assistant" becomes "从虚拟语音助手的互动中" (shàng zhìxìng yǔshǒu de xiāngxīn zhèng)* "supplemented with personalized named entities" becomes "补充了个性化命名实体" (bǔxiāng le gèshìhùa míngmíng shíwù)* "to compare these approaches" becomes "以比较这些方法" (yǐ bǐjiāo zhèxiē fāngédé)* "On a test set with personalized named entities" becomes "在个性化命名实体的测试集上" (shàng gèshìhùa míngmíng shíwù zhèng)* "we show that each of these approaches improves word error rate by over 10%" becomes "我们发现每一种方法都可以提高单词错误率超过10%" (wǒmen fāxìan mái yì zhèng zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "against a neural rescoring baseline" becomes "对神经网络重分配基准" (duì shàngjiāo wǎngluò zhòngfēngchēng jīzhì)* "We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization" becomes "我们还发现在这个测试集上,自然语言提示可以提高单词错误率7%,无需训练,且只有一定的权重损失" (wǒmen hái fāxìan zài zhè ge cè shì zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER)" becomes "总的来说,报纸找到最好,单词错误率提高10%" (zhòngde lái shuō, bàozhèng zhòngdào zuìhòu, zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)* "while also improving WER on a general test set by 1%" becomes "同时也提高了一般测试集上的单词错误率1%" (tóngshí yěnshì yěu gāo le yīgè zhèng zhèng shì bù kěyǐ tímáo shuānghòu zhèng)
Revisiting the DARPA Communicator Data using Conversation Analysis
results: 研究发现,communicator系统无法处理混合主动的话语结构层次导致一种类型的失败。Abstract
The state of the art in human computer conversation leaves something to be desired and, indeed, talking to a computer can be down-right annoying. This paper describes an approach to identifying ``opportunities for improvement'' in these systems by looking for abuse in the form of swear words. The premise is that humans swear at computers as a sanction and, as such, swear words represent a point of failure where the system did not behave as it should. Having identified where things went wrong, we can work backward through the transcripts and, using conversation analysis (CA) work out how things went wrong. Conversation analysis is a qualitative methodology and can appear quite alien - indeed unscientific - to those of us from a quantitative background. The paper starts with a description of Conversation analysis in its modern form, and then goes on to apply the methodology to transcripts of frustrated and annoyed users in the DARPA Communicator project. The conclusion is that there is at least one species of failure caused by the inability of the Communicator systems to handle mixed initiative at the discourse structure level. Along the way, I hope to demonstrate that there is an alternative future for computational linguistics that does not rely on larger and larger text corpora.
摘要
现代人机对话的状况还有一些缺点,实际上和计算机交流可以是很沮丧的。这篇论文描述了一种方法,通过寻找乱伤的语言来识别人机对话系统中的改进点。根据这篇论文,人们在计算机上发怒的时候会用荒语,这表示计算机不符合预期的行为。通过分析对话,我们可以找到系统的缺陷,并通过对话分析(CA)来推导出问题的起源。对话分析是一种质量方法论,可能对来自量化背景的人们来说可能看起来很陌生,甚至不科学。本文首先描述了现代对话分析的形式,然后应用这种方法分析沮丧和气愤的用户在DARPA通信器项目中的对话记录。结论是,communicator系统无法处理混合规划的问题导致了至少一种类型的失败。在这个过程中,我希望能够表明,计算机语言学不需要靠扩大文本 corpora 来发展。
Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models
paper_authors: Arman Sakif Chowdhury, G. M. Shahariar, Ahammed Tarik Aziz, Syed Mohibul Alam, Md. Azad Sheikh, Tanveer Ahmed Belal
For: 本研究旨在探讨如何在低资源语言如孟加拉语中推断假新闻文章。* Methods: 该研究提出了一种方法ología,包括四种不同的方法,使用五种预训练语言模型,以推断孟加拉语假新闻文章。该方法包括将英文新闻文章翻译成孟加拉语,并使用扩展技术来减少假新闻文章的不足。此外,研究还尝试了新闻概要,以解决BERT基于模型的токен长度限制。* Results: 经过广泛的实验和严格的评估,研究表明,摘要和扩展技术在孟加拉语假新闻推断中具有效果。研究使用三个独立的测试集进行评估,其中 BanglaBERT Base 模型,当与扩展技术结合使用时,在第一个测试集上达到了96%的准确率。在第二个测试集上,使用摘要和扩展新闻文章进行训练的 BanglaBERT 模型达到了97%的准确率。最后,使用 mBERT Base 模型在第三个测试集上达到了86%的准确率,用于通用性表现评估。测试集和实现可以在 GitHub 上找到。Abstract
With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection
摘要
随着社交媒体和在线新闻源的出现,假新闻已成为全球范围内的一个重要问题。然而,在LOW资源语言如孟加拉语方面,假新闻检测得到了研究的有限注意力。在这篇论文中,我们提出了一种方法ологи? consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection.
ChatGPT and Bard Responses to Polarizing Questions
results: 研究发现,ChatGPT和Bard在极化话题上具有左倾倾向,Bard更有可能在极化话题上提供响应。Bard还显示出 fewer guardrails around controversial topics,并且更�clinical and human-like responses。这些发现可以帮助各方利用LLMs,避免误导性和极化的回应。Abstract
Recent developments in natural language processing have demonstrated the potential of large language models (LLMs) to improve a range of educational and learning outcomes. Of recent chatbots based on LLMs, ChatGPT and Bard have made it clear that artificial intelligence (AI) technology will have significant implications on the way we obtain and search for information. However, these tools sometimes produce text that is convincing, but often incorrect, known as hallucinations. As such, their use can distort scientific facts and spread misinformation. To counter polarizing responses on these tools, it is critical to provide an overview of such responses so stakeholders can determine which topics tend to produce more contentious responses -- key to developing targeted regulatory policy and interventions. In addition, there currently exists no annotated dataset of ChatGPT and Bard responses around possibly polarizing topics, central to the above aims. We address the indicated issues through the following contribution: Focusing on highly polarizing topics in the US, we created and described a dataset of ChatGPT and Bard responses. Broadly, our results indicated a left-leaning bias for both ChatGPT and Bard, with Bard more likely to provide responses around polarizing topics. Bard seemed to have fewer guardrails around controversial topics, and appeared more willing to provide comprehensive, and somewhat human-like responses. Bard may thus be more likely abused by malicious actors. Stakeholders may utilize our findings to mitigate misinformative and/or polarizing responses from LLMs
摘要
Currently, there is no annotated dataset of ChatGPT and Bard responses around potentially polarizing topics, which hinders the development of effective regulations and interventions. To address this gap, we created a dataset of ChatGPT and Bard responses on highly polarizing topics in the US. Our results showed a left-leaning bias for both chatbots, with Bard more likely to provide responses around polarizing topics. Bard also appeared to have fewer guardrails around controversial topics and was more willing to provide comprehensive and human-like responses. This may make it more susceptible to being abused by malicious actors.Our findings can help stakeholders mitigate the potential for misinformative and polarizing responses from LLMs. By understanding the biases and limitations of these tools, we can develop targeted interventions and regulations to ensure that they are used responsibly and ethically.
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative
results: 这篇论文通过对MultiWOZ数据集进行实验,发现了AL在DPL中的问题,并提出了一种不使用AL的方法来估计奖励和学习对话策略。这种方法可以保持AL的优点,同时解决了模式折衔的问题。Abstract
Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.
摘要
对话策略,即根据对话状态选择系统的行为,是对话的关键成功因素。在过去几年,人工智能学习(RL)已成为对话策略学习(DPL)的一个有前途的选择。在RL基于的DPL中,对话策略会根据奖励更新。但是,在多个领域任务对话场景中,手动构建细腻的奖励,如状态动作对应的奖励,是具有挑战性的。一种可以从收集的数据中估计奖励的方法是通过对抗学习(AL)训练对话策略和奖励估计器。although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse.本文首先通过详细分析对话策略和奖励估计器的目标函数,描述了AL在DPL中的作用。接着,基于这些分析,我们提出了一种不使用AL的奖励估计和DPL方法,保留了AL的优点。我们使用MultiWOZ多个领域任务对话资源进行评估。
Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models
results: 结果显示,这些方法可以在不同的训练射数中优化 LLM,并超过不适应的模型。Abstract
A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data.
摘要
各种自然语言任务目前正在使用大规模语言模型(LLM)进行解决。这些模型通常通过大量无监督文本数据进行训练并使用方法如精度调整、准备调整或在线上学习来适应下游自然语言任务。在这项工作中,我们提议一种方法,可以在没有标签样本的情况下,通过调整前类分布来进行文本分类任务。这种方法对LMM进行黑盒子处理,并在模型 posterior 进行准备调整。结果显示,这种方法可以超越无适应模型,并且在提示和前一种方法中进行准备调整无需使用适应数据。
To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?
results: 研究发现,参与者在不同的隐私威胁水平下有不同的反应,并且可以通过适当的隐私预算(ε)来影响参与者的反应。这些结果 suggets that lay people’s willingness to share sensitive textual data is influenced by the framing of the risk perception and the privacy budget, and that the choice of ε should not be solely in the hands of researchers or system developers.Abstract
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
摘要
尽管NLG社区已经采用中心权限隐私作为隐私保护和数据共享的标准框架,但是选择和解释隐私预算参数ε的决定仍然是一个大量的自由选择。我们认为,决定ε值不应该完全由研究人员或系统开发人员决定,而应该考虑实际分享敏感数据的人。换句话说,你会为ε值10分享你的即时消息?我们在这篇研究中通过设计、实现和进行行为实验(311名参与者)来研究人们在不确定决策情况下对隐私威胁的反应。我们使用了两个实际的NLG场景来表述风险感受,并使用笔记式行为研究来确定ε阈值会让平民愿意分享敏感文本数据。到我们所知,这是首次进行这种类型的研究。
Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues
results: experiments on two open-domain dialogue datasets show that ICAST consistently outperforms baselines with 1%, 5%, and 10% labeled data, improving the F1 score by 2.06% and 1.00% respectively compared to the strongest baseline with only 5% labeled data.Abstract
Answer selection in open-domain dialogues aims to select an accurate answer from candidates. Recent success of answer selection models hinges on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose the intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We carry out extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST outperforms baselines consistently with 1%, 5% and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.
摘要
Answer selection in open-domain dialogues aims to select an accurate answer from candidates. Recent success of answer selection models relies on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose the intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We conduct extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST consistently outperforms baselines with 1%, 5%, and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.
Parmesan: mathematical concept extraction for education
paper_authors: Jacob Collard, Valeria de Paiva, Eswaran Subrahmanian
For: This paper is written for researchers who are not experts in mathematics, but need to understand mathematical concepts in order to conduct multidisciplinary research.* Methods: The paper uses natural language processing techniques such as concept extraction, relation extraction, definition extraction, and entity linking to develop a prototype system for searching and defining mathematical concepts in context, specifically in the field of category theory.* Results: The authors show that existing natural language processing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that perform well. They also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively.Abstract
Mathematics is a highly specialized domain with its own unique set of challenges that has seen limited study in natural language processing. However, mathematics is used in a wide variety of fields and multidisciplinary research in many different domains often relies on an understanding of mathematical concepts. To aid researchers coming from other fields, we develop a prototype system for searching for and defining mathematical concepts in context, focusing on the field of category theory. This system, Parmesan, depends on natural language processing components including concept extraction, relation extraction, definition extraction, and entity linking. In developing this system, we show that existing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that do perform well, though we expect the system to evolve over time. We also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively. The corpora have been annotated with dependency trees, lemmas, and part-of-speech tags.
摘要
mathematics 是一个高度专业化的领域,具有独特的挑战,在自然语言处理方面受到有限的研究。然而, mathematics 在各种领域中具有广泛的应用,跨学科研究frequently rely on 数学概念的理解。为帮助来自其他领域的研究人员,我们开发了一个搜索和定义数学概念的 прототип系统, focus 在 category theory 领域。该系统, Parmesan,基于自然语言处理组件,包括概念EXTRACTION、关系EXTRACTION、定义EXTRACTION和实体链接。在开发这个系统时,我们发现了现有技术无法直接应用于 category theory 领域,我们提出了 Hybrid 技术,它们在这里表现良好,但我们预期系统会在时间的推移中发展。我们还提供了两个清洁的数学 Corpora,它们基于期刊文章和 Wiki 页面,分别。这两个 Corpora 已经被标注了依赖树、词根和part-of-speech标签。
Going Beyond Local: Global Graph-Enhanced Personalized News Recommendations
results: 在两个公共新闻数据集上进行评估,模型表现出色,超过现有方法,并提供更多的推荐内容Abstract
Precisely recommending candidate news articles to users has always been a core challenge for personalized news recommendation systems. Most recent works primarily focus on using advanced natural language processing techniques to extract semantic information from rich textual data, employing content-based methods derived from local historical news. However, this approach lacks a global perspective, failing to account for users' hidden motivations and behaviors beyond semantic information. To address this challenge, we propose a novel model called GLORY (Global-LOcal news Recommendation sYstem), which combines global representations learned from other users with local representations to enhance personalized recommendation systems. We accomplish this by constructing a Global-aware Historical News Encoder, which includes a global news graph and employs gated graph neural networks to enrich news representations, thereby fusing historical news representations by a historical news aggregator. Similarly, we extend this approach to a Global Candidate News Encoder, utilizing a global entity graph and a candidate news aggregator to enhance candidate news representation. Evaluation results on two public news datasets demonstrate that our method outperforms existing approaches. Furthermore, our model offers more diverse recommendations.
摘要
传统的个性化新闻推荐系统总是面临着推荐候选新闻文章的准确性是核心挑战。现今大多数研究主要集中在使用高级自然语言处理技术来提取文本数据中的含义信息,采用基于本地历史新闻的内容基本方法。但这种方法缺乏全球视野,无法考虑用户隐藏的动机和行为,超 semantic 信息。为了解决这个挑战,我们提出了一种新的模型,即 GLORY(全球-本地新闻推荐系统),它将全球表示学习自其他用户与本地表示相结合,以提高个性化推荐系统。我们实现这一点通过构建全球新闻图和历史新闻汇聚器,并使用闭包图神经网络来增强新闻表示,以 fusion 历史新闻表示。此外,我们还扩展了这种方法到全球候选新闻表示器,使用全球实体图和候选新闻汇聚器来增强候选新闻表示。经过评估两个公共新闻数据集,我们的方法与现有方法进行比较,结果表明我们的方法在准确性和多样性两个方面都有所提高。
Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A Natural Language Processing Approach
results: 模型在测试集上达到了约0.73的macro-average F1分数,表示在正、中性和负情感方面具有平衡的表现,这些结果表明了CNN的效iveness для情感分析任务,并有实际应用在社交媒体分析、市场研究和政策研究等领域。Abstract
This study addressed the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN), offering a new approach to Natural Language Processing (NLP). The data, sourced from Baidu's PaddlePaddle AI platform, were meticulously preprocessed, tokenized, and categorized based on sentiment labels. A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification. The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments. The findings underscore the effectiveness of CNNs for sentiment analysis tasks, with implications for practical applications in social media analysis, market research, and policy studies. The complete experimental content and code have been made publicly available on the Kaggle data platform for further research and development. Future work may involve exploring different architectures, such as Recurrent Neural Networks (RNN) or transformers, or using more complex pre-trained models like BERT, to further improve the model's ability to understand linguistic nuances and context.
摘要
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
results: despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications。Abstract
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
摘要
Translation in Simplified Chinese:这篇论文探讨了大语言模型(LLM)在自动语音识别(ASR)系统中提高转录精度的可能性。随着LLM的不断发展,其在语言处理中的启发学习能力和指令遵循行为吸引了很多关注。我们的主要关注点是调查利用LLM的启发学习能力来提高ASR系统的性能,这些系统目前面临 ambient noise、说话者口音和复杂语言上下文等挑战。我们使用Aishell-1和LibriSpeech数据集,并使用ChatGPT和GPT-4作为LLM的参考。 unfortunately,我们的初始实验结果并不乐见,这表明利用LLM的启发学习来解决ASR应用程序的复杂性是一个复杂的任务。尽管我们进行了更多的设置和模型的调整,但 corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER),这说明LLMs在语音应用中的局限性。这篇论文提供了这些实验的详细描述、结果和意义,确立了现在阶段利用LLMs的启发学习来修正语音识别转录中的潜在错误仍然是一个挑战。
Agreement Tracking for Multi-Issue Negotiation Dialogues
results: 研究人员通过对小型训练集进行验证,发现使用T5模型进行转移学习可以提高妥协跟踪的性能,比较单独使用GPT-Negochat的表现提高21%和9%。Abstract
Automated negotiation support systems aim to help human negotiators reach more favorable outcomes in multi-issue negotiations (e.g., an employer and a candidate negotiating over issues such as salary, hours, and promotions before a job offer). To be successful, these systems must accurately track agreements reached by participants in real-time. Existing approaches either focus on task-oriented dialogues or produce unstructured outputs, rendering them unsuitable for this objective. Our work introduces the novel task of agreement tracking for two-party multi-issue negotiations, which requires continuous monitoring of agreements within a structured state space. To address the scarcity of annotated corpora with realistic multi-issue negotiation dialogues, we use GPT-3 to build GPT-Negochat, a synthesized dataset that we make publicly available. We present a strong initial baseline for our task by transfer-learning a T5 model trained on the MultiWOZ 2.4 corpus. Pre-training T5-small and T5-base on MultiWOZ 2.4's DST task enhances results by 21% and 9% respectively over training solely on GPT-Negochat. We validate our method's sample-efficiency via smaller training subset experiments. By releasing GPT-Negochat and our baseline models, we aim to encourage further research in multi-issue negotiation dialogue agreement tracking.
摘要
自动化谈判支持系统的目的是帮助人类谈判者达成更有利的结果在多个问题谈判中(例如,雇主和候选人在薪资、工时和晋升等问题上进行谈判 перед就业提供emploi)。为了成功,这些系统必须能够在实时内跟踪参与者达成的协议。现有的方法可以是任务导向对话或生成无结构的输出,这些方法无法满足这个目标。我们的工作引入了多方 Issue 谈判协议跟踪任务,需要在结构化状态空间中不断监控协议。由于现有的注释 corpora 中的多个问题谈判对话不够实际,我们使用 GPT-3 构建 GPT-Negochat 数据集,并将其公开发布。我们提出了一个强大的初始基线,通过在 MultiWOZ 2.4 词汇库上转移 T5 模型来实现。在单独在 GPT-Negochat 上进行training 的情况下,预训练 T5-small 和 T5-base 在 MultiWOZ 2.4 上 DST 任务上进行预训练,可以提高结果的21%和9%。我们通过小样本训练 subsets 实验 validate 我们的方法的样本效率。通过发布 GPT-Negochat 和我们的基线模型,我们希望能够鼓励更多关于多个问题谈判对话协议跟踪研究。
National Origin Discrimination in Deep-learning-powered Automated Resume Screening
results: 研究发现,如果依赖深度学习powered自动化简历层次化工具,可能会导致决策偏袋或排斥某些人群,从而引发伦理和法律问题。为解决这个问题,我们提出了一种偏袋缓解方法。经验 validate our study on real candidate resumes.Abstract
Many companies and organizations have started to use some form of AIenabled auto mated tools to assist in their hiring process, e.g. screening resumes, interviewing candi dates, performance evaluation. While those AI tools have greatly improved human re source operations efficiency and provided conveniences to job seekers as well, there are increasing concerns on unfair treatment to candidates, caused by underlying bias in AI systems. Laws around equal opportunity and fairness, like GDPR, CCPA, are introduced or under development, in attempt to regulate AI. However, it is difficult to implement AI regulations in practice, as technologies are constantly advancing and the risk perti nent to their applications can fail to be recognized. This study examined deep learning methods, a recent technology breakthrough, with focus on their application to automated resume screening. One impressive performance of deep learning methods is the represen tation of individual words as lowdimensional numerical vectors, called word embedding, which are learned from aggregated global wordword cooccurrence statistics from a cor pus, like Wikipedia or Google news. The resulting word representations possess interest ing linear substructures of the word vector space and have been widely used in down stream tasks, like resume screening. However, word embedding inherits and reinforces the stereotyping from the training corpus, as deep learning models essentially learn a probability distribution of words and their relations from history data. Our study finds out that if we rely on such deeplearningpowered automated resume screening tools, it may lead to decisions favoring or disfavoring certain demographic groups and raise eth ical, even legal, concerns. To address the issue, we developed bias mitigation method. Extensive experiments on real candidate resumes are conducted to validate our study
摘要
This study examined deep learning methods, a recent technology breakthrough, with a focus on their application to automated resume screening. One impressive performance of deep learning methods is the representation of individual words as low-dimensional numerical vectors, called word embedding, which are learned from aggregated global word-word cooccurrence statistics from a corpus, such as Wikipedia or Google news. The resulting word representations possess interesting linear substructures of the word vector space and have been widely used in downstream tasks, such as resume screening. However, word embedding inherits and reinforces the stereotyping from the training corpus, as deep learning models essentially learn a probability distribution of words and their relations from historical data.Our study finds that if we rely on such deep-learning-powered automated resume screening tools, it may lead to decisions favoring or disfavoring certain demographic groups and raise ethical, even legal, concerns. To address this issue, we developed a bias mitigation method. Extensive experiments on real candidate resumes were conducted to validate our study.
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews
for: 这个研究是为了 automatizing the screening phase of Systematic Reviews (SR),以提高 SR 的效率和准确性。
methods: 这个研究使用了 ChatGPT,一种基于语言模型的对话式 AI chatbot,来自动化 SR 的排除阶段。
results: 研究结果显示 ChatGPT 可以实现高度的一致性和分类性,并且可以与传统的 SR 自动化方法相比。但是,当开发人员在 integrating ChatGPT 到 SR 工具时,需要小心考虑一些因素,以确保 ChatGPT 的性能。Abstract
By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
摘要
通过团队知识在研究领域内组织,系统性评估(SR)提供了价值的导航,用于导向研究。证据表明,SR已成为软件工程中的首选 artifact。然而,屏选阶段的手动努力,使SR变得昂贵和容易出错。屏选传统上被视为不易自动化,但是新的生成AI驱动的聊天机器人, backing with大型语言模型,将在这个领域中推翻现状。在本报告中,我们提议利用这些新的技术发展,自动化SR的屏选阶段。我们评估了ChatGPT在屏选SR文章中的一致性、分类性能和普适性,并与传统的分类器用于SR自动化相比较。我们的结果表明,ChatGPT是可以自动化SR过程的可靠选择,但是在开发者将ChatGPTintegrated into SR工具时,需要仔细考虑。
results: 本文提供了一系列的概念和模型,包括各种 LLMs 模型、数据集和主要发现,并且尝试了对这些研究成果进行了系统的梳理和总结。Abstract
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context length improvements, model alignment, training datasets, benchmarking, efficiency and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides that overview to the research community. It not only focuses on a systematic treatment of the existing literature on a broad range of LLM related concept, but also pays special attention to providing comprehensive summaries with extensive details about the individual existing models, datasets and major insights. We also pay heed to aligning our overview with the emerging outlook of this research direction by accounting for the other recently materializing reviews of the broader research direction of LLMs. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of this research direction. This review article is intended to not only provide a systematic survey, but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research direction.
摘要
受大语言模型(LLM)的成功启发,近期有大量研究贡献在此领域。这些研究涵盖了各种主题,如下面 neural network 的建设、上下文长度优化、模型对齐、训练数据集、标准化、效率等。随着 LLM 技术的快速发展和常见的突破,了解这个领域的进展已经变得非常困难。鉴于这些研究的急速涌现和不断增长,研究者需要一篇 concise yet comprehensive 的评论来了解这个领域的最新进展。这篇文章提供了这样的评论,不仅系统地处理了各种 LLM 相关的概念,而且特别注意于提供详细的概念概述和现有模型、数据集和主要发现的全面摘要。我们还尽量与当前研究方向的其他新出现的评论相协调,以便更好地反映这个研究方向的发展趋势。本文涵盖了 LLM 的相关背景知识和前沿领域的高级主题,是一篇系统的评论和快速参考手册,为研究者和实践者提供了丰富的信息和思路,以便更好地发展 LLM 研究方向。
The Acquisition of Semantic Relationships between words
for: 这 paper investigate 语言 morphology 和 semantic relationships 之间的关系,以了解如何语言结构影响语言理解。
methods: 该 paper 使用 linguistic 方法,包括 morphological analysis 和 semantic analysis,来研究语言 morphology 和 semantic relationships 的关系。
results: 研究发现,语言 morphology 和 semantic relationships 之间存在紧密的关系,这种关系对于语言理解和生成具有重要作用。Abstract
The study of semantic relationships has revealed a close connection between these relationships and the morphological characteristics of a language. Morphology, as a subfield of linguistics, investigates the internal structure and formation of words. By delving into the relationship between semantic relationships and language morphology, we can gain deeper insights into how the underlying structure of words contributes to the interpretation and comprehension of language. This paper explores the dynamic interplay between semantic relationships and the morphological aspects of different languages, by examining the intricate relationship between language morphology and semantic relationships, valuable insights can be gained regarding how the structure of words influences language comprehension.
摘要
研究 semantic 关系的研究发现,这些关系与语言 morphology 之间存在紧密的连接。 morphology 是语言学的一个子领域,它研究 слова的内部结构和形成方式。通过研究 semantic 关系和语言 morphology 之间的关系,我们可以更深入地了解如何 слова的结构对语言理解和解释产生影响。这篇论文探讨不同语言中 morphology 和 semantic 关系之间的动态互动,以获得更深入的理解,掌握语言理解的途径。
MMBench: Is Your Multi-modal Model an All-around Player?
paper_authors: Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin for:MMBench is a novel multi-modality benchmark designed to evaluate the various abilities of vision-language models.methods:MMBench uses a meticulously curated dataset and a novel CircularEval strategy, which incorporates the use of ChatGPT to convert free-form predictions into pre-defined choices, facilitating a more robust evaluation of the model’s predictions.results:MMBench provides a comprehensive evaluation pipeline for vision-language models, allowing for a more objective and robust assessment of their abilities. It is expected to assist the research community in better evaluating their models and encourage future advancements in this domain.Abstract
Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.
摘要
大型视言语模型在最近几年内已取得了很大的进步,具有出色的视觉理解和逻辑能力。然而,如何有效评估这些大型视言语模型仍然是一个主要的障碍,阻碍未来模型的发展。传统的 bencmarks 如 VQAv2 或 COCO Caption 提供了量化性能评估,但它们缺乏细化的能力评估和不可靠的评估指标。最近的主观 bencmarks 如 OwlEval 提供了对模型能力的全面评估,但它们不可扩展和受到偏见的影响。为了解决这些挑战,我们提出了 MMBench,一种新的多模态 bencmark。MMBench 设计了一个丰富的评估流水线,主要由两个元素组成。第一个元素是一个仔细筛选的数据集,超过现有类似 bencmarks 的评估问题和能力多样性。第二个元素是一种新的 CircularEval 策略和 ChatGPT 的使用,可以将自由预测转化为预定的选择,从而促进模型预测的robust评估。MMBench 是一种系统化的对象 bencmark,用于robust评估视言语模型的多种能力。我们希望 MMBench 能够帮助研究人员更好地评估他们的模型,并促进未来这个领域的进步。项目页面:https://opencompass.org.cn/mmbench。
results: 研究发现,使用快速architecture和consistency distillation可以提高生成速度和质量,并且可以比PC-JeDi和Delphes快速得多。Abstract
Building on the success of PC-JeDi we introduce PC-Droid, a substantially improved diffusion model for the generation of jet particle clouds. By leveraging a new diffusion formulation, studying more recent integration solvers, and training on all jet types simultaneously, we are able to achieve state-of-the-art performance for all types of jets across all evaluation metrics. We study the trade-off between generation speed and quality by comparing two attention based architectures, as well as the potential of consistency distillation to reduce the number of diffusion steps. Both the faster architecture and consistency models demonstrate performance surpassing many competing models, with generation time up to two orders of magnitude faster than PC-JeDi and three orders of magnitude faster than Delphes.
摘要
基于 PC-JeDi 的成功,我们现在宣布 PC-Droid,一种显著改进的噪声扩散模型,用于生成jet particle云。通过利用新的扩散 форму拉,研究更新的集成解决方案,并同时训练所有jet类型,我们能够实现所有类型的jet across all evaluation metrics的状态态核心性能。我们研究生成速度和质量之间的交易,并比较了两种注意力基 architecture,以及可能性的一致投Distillation来降低扩散步数。这两种更快的架构和一致模型都达到了许多竞争模型的表现,并且生成时间与 PC-JeDi 相比可以提高到两个数量级,与 Delphes 相比可以提高到三个数量级。
results: 该论文提出了一个上界 bounds 的方法,可以用于解决具有不确定性的 likelihood 的情况。此外,该论文还给出了一个充分条件,使得上界 bounds 变为等式。这个结果有趣,可以应用于多种工程(如模型预测控制)、机器学习和人工智能领域。Abstract
In their seminal 1990 paper, Wasserman and Kadane establish an upper bound for the Bayes' posterior probability of a measurable set $A$, when the prior lies in a class of probability measures $\mathcal{P}$ and the likelihood is precise. They also give a sufficient condition for such upper bound to hold with equality. In this paper, we introduce a generalization of their result by additionally addressing uncertainty related to the likelihood. We give an upper bound for the posterior probability when both the prior and the likelihood belong to a set of probabilities. Furthermore, we give a sufficient condition for this upper bound to become an equality. This result is interesting on its own, and has the potential of being applied to various fields of engineering (e.g. model predictive control), machine learning, and artificial intelligence.
摘要
尤先和卡达尼在1990年的论文中,设定了一个上界 для bayes posterior概率的可测集A,当先前 liegt in a class of probability measures $\mathcal{P}$ 和 likelihood 是precise时。他们还给出了一个充分条件,使上界成为等式。在这篇论文中,我们对 Wasserman 和卡达尼的结果进行推广,同时考虑了征客的不确定性。我们给出了一个上界 для posterior概率,当先前和征客都属于一个概率集时。此外,我们还给出了一个充分条件,使上界成为等式。这个结果具有广泛的应用前途,例如机器学习、人工智能和工程预测控制等领域。
A Causal Framework to Unify Common Domain Generalization Approaches
paper_authors: Nevin L. Zhang, Kaican Li, Han Gao, Weiyan Xie, Zhi Lin, Zhenguo Li, Luning Wang, Yongxiang Huang
for: This paper is written for researchers and practitioners who are interested in domain generalization (DG) in machine learning. It aims to provide a causal framework for understanding the key ideas behind different DG approaches and their relationships.
methods: The paper uses a causal framework to understand and unify different DG approaches, including domain adaptation, transfer learning, and multi-task learning. It also discusses the theoretical underpinnings of these methods and how they are related to each other.
results: The paper provides a new understanding of the underlying principles of DG and sheds light on the relative advantages and limitations of different DG methods. It also helps to identify future research directions in this area.Abstract
Domain generalization (DG) is about learning models that generalize well to new domains that are related to, but different from, the training domain(s). It is a fundamental problem in machine learning and has attracted much attention in recent years. A large number of approaches have been proposed. Different approaches are motivated from different perspectives, making it difficult to gain an overall understanding of the area. In this paper, we propose a causal framework for domain generalization and present an understanding of common DG approaches in the framework. Our work sheds new lights on the following questions: (1) What are the key ideas behind each DG method? (2) Why is it expected to improve generalization to new domains theoretically? (3) How are different DG methods related to each other and what are relative advantages and limitations? By providing a unified perspective on DG, we hope to help researchers better understand the underlying principles and develop more effective approaches for this critical problem in machine learning.
摘要
领域通用化(DG)是关于学习模型可以在新领域中具有良好的泛化性,这些新领域与训练领域相关但不同的。这是机器学习中的一个基本问题,在过去几年内吸引了很多关注。有很多方法被提出来解决这个问题,这些方法受到不同的动机,使得了解这个领域的总体情况变得更加困难。在这篇论文中,我们提出了 causal 框架来解释领域通用化,并对常见的 DG 方法进行了解释。我们的工作照明了以下几个问题:(1)每种 DG 方法的关键思想是什么?(2)为什么 theoretically 预计这些方法可以提高新领域的泛化性?(3)不同的 DG 方法之间有什么关系,它们的优劣点在哪里?通过提供一个综合的视角,我们希望能够帮助研究人员更好地理解领域的基本原理,并开发更有效的领域通用化方法。
TinyMetaFed: Efficient Federated Meta-Learning for TinyML
results: 论文的实验结果显示,TinyMetaFed 可以对 TinyML 应用程序中的能源消耗和通信负载进行重要的减少,并且可以快速将模型训练到新的设备上。同时,TinyMetaFed 还能够提高模型的稳定性和准确性。Abstract
The field of Tiny Machine Learning (TinyML) has made substantial advancements in democratizing machine learning on low-footprint devices, such as microcontrollers. The prevalence of these miniature devices raises the question of whether aggregating their knowledge can benefit TinyML applications. Federated meta-learning is a promising answer to this question, as it addresses the scarcity of labeled data and heterogeneous data distribution across devices in the real world. However, deploying TinyML hardware faces unique resource constraints, making existing methods impractical due to energy, privacy, and communication limitations. We introduce TinyMetaFed, a model-agnostic meta-learning framework suitable for TinyML. TinyMetaFed facilitates collaborative training of a neural network initialization that can be quickly fine-tuned on new devices. It offers communication savings and privacy protection through partial local reconstruction and Top-P% selective communication, computational efficiency via online learning, and robustness to client heterogeneity through few-shot learning. The evaluations on three TinyML use cases demonstrate that TinyMetaFed can significantly reduce energy consumption and communication overhead, accelerate convergence, and stabilize the training process.
摘要
随着小型机器学习(TinyML)领域的发展,它已经有效地将机器学习技术应用到低资源设备上,如微控制器。由于这些小型设备的普遍使用,我们可以问到 whether 将这些设备的知识聚合起来可以对 TinyML 应用程序产生 beneficial effect。 Federated meta-learning 是一种有前途的答案,因为它解决了实际世界中数据标签的稀缺和设备之间数据分布的不均匀性。然而,部署 TinyML 硬件面临着唯一的资源限制,使得现有的方法无法实施 due to energy, privacy, and communication limitations。我们介绍 TinyMetaFed,一种适用于 TinyML 的模型独立 meta-learning 框架。TinyMetaFed 可以快速地在新设备上练习和 fine-tune 神经网络初始化,并提供了通信成本和隐私保护,计算效率和稳定性。我们在三个 TinyML 应用场景中进行了评估,结果表明 TinyMetaFed 可以显著减少能源消耗和通信开销,加速整合过程,并稳定训练过程。
Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics
results: 这种方法在四种不同类型的数据上得到了成功应用,包括手写数字、人类基因组的分类、蛋白 sequences 家族中功能特征的测试、以及特定的税onomies中的同源RNA序列。Abstract
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.
摘要
Robotic surface exploration with vision and tactile sensing for cracks detection and characterisation
results: 实验结果表明,提案的算法能够成功地检测裂解,并且可以通过对裂解的分析来确定裂解的长度、宽度、方向和分支数。此外,该算法还可以降低视觉检测算法的成本,并提高对裂解的正确分类和准确地理学分析。Abstract
This paper presents a novel algorithm for crack localisation and detection based on visual and tactile analysis via fibre-optics. A finger-shaped sensor based on fibre-optics is employed for the data acquisition to collect data for the analysis and the experiments. To detect the possible locations of cracks a camera is used to scan an environment while running an object detection algorithm. Once the crack is detected, a fully-connected graph is created from a skeletonised version of the crack. A minimum spanning tree is then employed for calculating the shortest path to explore the crack which is then used to develop the motion planner for the robotic manipulator. The motion planner divides the crack into multiple nodes which are then explored individually. Then, the manipulator starts the exploration and performs the tactile data classification to confirm if there is indeed a crack in that location or just a false positive from the vision algorithm. If a crack is detected, also the length, width, orientation and number of branches are calculated. This is repeated until all the nodes of the crack are explored. In order to validate the complete algorithm, various experiments are performed: comparison of exploration of cracks through full scan and motion planning algorithm, implementation of frequency-based features for crack classification and geometry analysis using a combination of vision and tactile data. From the results of the experiments, it is shown that the proposed algorithm is able to detect cracks and improve the results obtained from vision to correctly classify cracks and their geometry with minimal cost thanks to the motion planning algorithm.
摘要
Here is the text in Simplified Chinese:这篇论文提出了一种基于视觉和感觉分析的新算法,用于检测和定位裂隙。该算法使用了一个光纤形状感器和一个摄像头来检测裂隙的可能位置,然后使用全连接图和最小束梁树来探索裂隙,计算裂隙的长度、宽度、方向和分支数。该算法通过多种实验验证,包括全扫探测和运动规划算法的比较,以及基于频率特征的裂隙分类和几何分析。实验结果表明,提出的算法可以准确检测裂隙,并使用运动规划算法来提高裂隙分类和几何分析的准确率,而且Cost minimization。
results: 本研究显示了通过顺序数据科学方法可以从实际数据中提取有用的知识,并且可以与其他机器学习和知识表示方法进行融合,有助于多种领域的研究和应用。Abstract
Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- particularly important for this line of research -- is that order-based methods are often seen as too mathematically rigorous for applying them to real-world data. In this paper, we will therefore discuss different means for measuring and 'calculating' with ordinal structures -- a specific class of directed graphs -- and show how to infer knowledge from them. Our aim is to establish Ordinal Data Science as a fundamentally new research agenda. Besides cross-fertilization with other cornerstone machine learning and knowledge representation methods, a broad range of disciplines will benefit from this endeavor, including, psychology, sociology, economics, web science, knowledge engineering, scientometrics.
摘要
<> empirical 数据中对象之间的关系的一种主要仪器是顺序。然而,相比使用数字性质的方法, ordinal 方法的开发规模相对较少。这一点有两个原因:一个是 Last century 的计算资源有限,另一个是 ordinal 方法被视为实际数据应用中过于数学化,难以应用。在这篇论文中,我们将讨论不同的 ordinal 结构测量和计算方法,并如何从 ordinal 结构中提取知识。我们的目标是建立 Ordinal Data Science 作为一个新的研究课程。此外,与其他基estone machine learning 和知识表示方法融合,这种研究将对多种学科产生广泛的影响,包括心理学、社会学、经济学、网络科学、知识工程、科学ometrics。Note: "Ordinal Data Science" is a new research agenda that the author is proposing, which focuses on the study of ordinal structures in empirical data and their applications in various fields.
A decision framework for selecting information-transfer strategies in population-based SHM
paper_authors: Aidan J. Hughes, Jack Poole, Nikolaos Dervilis, Paul Gardner, Keith Worden
for: 提供了一种基于人口的Structural Health Monitoring(SHM)系统的决策支持,以便在结构的运行和维护中减少成本并提高安全性。
methods: 使用了转移学习技术,共享各个结构之间的信息,以减少数据稀缺性的问题。
results: 提出了一种决策框架,可以选择最佳的转移策略,以避免负面传输,并优化信息传输策略,从而减少结构运行和维护成本,并提高安全性。Abstract
Decision-support for the operation and maintenance of structures provides significant motivation for the development and implementation of structural health monitoring (SHM) systems. Unfortunately, the limited availability of labelled training data hinders the development of the statistical models on which these decision-support systems rely. Population-based SHM seeks to mitigate the impact of data scarcity by using transfer learning techniques to share information between individual structures within a population. The current paper proposes a decision framework for selecting transfer strategies based upon a novel concept -- the expected value of information transfer -- such that negative transfer is avoided. By avoiding negative transfer, and by optimising information transfer strategies using the transfer-decision framework, one can reduce the costs associated with operating and maintaining structures, and improve safety.
摘要simplified_chinesestructure health monitoring (SHM) 系统的开发和实施带来了重要的决策支持 для结构维护(OM)操作。然而,有限的标签数据的可用性限制了这些决策支持系统中使用的统计模型的发展。基于人口的 SHM 寻求通过转移学习技术共享结构之间的信息,以减轻数据稀缺的影响。本文提出了一种基于新的概念——信息传递期望值——的决策框架,以避免负面传递。通过避免负面传递和优化信息传递策略使用转移决策框架,可以降低结构维护和操作成本,并提高安全性。Note: The Simplified Chinese translation is using the traditional Chinese characters and grammar, which is different from the Simplified Chinese used in mainland China.
Generalizing Supervised Deep Learning MRI Reconstruction to Multiple and Unseen Contrasts using Meta-Learning Hypernetworks
paper_authors: Sriprabha Ramanarayanan, Arun Palla, Keerthi Ram, Mohanasankar Sivaprakasam for:This paper proposes a multimodal meta-learning model for image reconstruction, which aims to improve the knowledge generalization of imaging tasks by learning both shared and discriminative weights for various configurations of imaging tasks.methods:The proposed model, called KM-MAML, uses hypernetworks to evolve mode-specific weights, and incorporates gradient-based meta-learning in the contextual space to update the weights of the hypernetworks for different modes. The model also uses a low-rank kernel modulation operation to provide mode-specific inductive bias for multiple modes.results:The experiments on multi-contrast MRI reconstruction show that the proposed model exhibits superior reconstruction performance over joint training, other meta-learning methods, and context-specific MRI reconstruction methods, and better adaptation capabilities with improvement margins of 0.5 dB in PSNR and 0.01 in SSIM. Additionally, a representation analysis with U-Net shows that kernel modulation infuses 80% of mode-specific representation changes in the high-resolution layers.Abstract
Meta-learning has recently been an emerging data-efficient learning technique for various medical imaging operations and has helped advance contemporary deep learning models. Furthermore, meta-learning enhances the knowledge generalization of the imaging tasks by learning both shared and discriminative weights for various configurations of imaging tasks. However, existing meta-learning models attempt to learn a single set of weight initializations of a neural network that might be restrictive for multimodal data. This work aims to develop a multimodal meta-learning model for image reconstruction, which augments meta-learning with evolutionary capabilities to encompass diverse acquisition settings of multimodal data. Our proposed model called KM-MAML (Kernel Modulation-based Multimodal Meta-Learning), has hypernetworks that evolve to generate mode-specific weights. These weights provide the mode-specific inductive bias for multiple modes by re-calibrating each kernel of the base network for image reconstruction via a low-rank kernel modulation operation. We incorporate gradient-based meta-learning (GBML) in the contextual space to update the weights of the hypernetworks for different modes. The hypernetworks and the reconstruction network in the GBML setting provide discriminative mode-specific features and low-level image features, respectively. Experiments on multi-contrast MRI reconstruction show that our model, (i) exhibits superior reconstruction performance over joint training, other meta-learning methods, and context-specific MRI reconstruction methods, and (ii) better adaptation capabilities with improvement margins of 0.5 dB in PSNR and 0.01 in SSIM. Besides, a representation analysis with U-Net shows that kernel modulation infuses 80% of mode-specific representation changes in the high-resolution layers. Our source code is available at https://github.com/sriprabhar/KM-MAML/.
摘要
meta-学习已经是现代医学影像处理中提高效率的新趋势,帮助提高现代深度学习模型的性能。 meta-学习可以增强医学影像任务的知识泛化,通过学习多种配置的影像任务中的共享和特异性权重。然而,现有的 meta-学习模型尝试学习一个 neural network 的权重初始化,可能是多模态数据的限制。本工作旨在开发一种多模态 meta-学习模型,用于图像重建,该模型通过演化机制来包含多种获取设置的多模态数据。我们提出的模型被称为 KM-MAML(核心修饰基于多模态 meta-学习),具有卷积核修饰操作来生成模式特有的权重。这些权重提供了模式特有的权重初始化,通过重新调整每个核心的基础网络来进行图像重建。我们在上下文空间中使用 GBML(梯度基于 meta-学习)来更新 hypernetworks 的权重。hypernetworks 和重建网络在 GBML 设置中提供了特定模式的权重和低级别图像特征。在多模式 MRI 重建中,我们的模型(i)表现出优于联合训练、其他 meta-学习方法和特定模式 MRI 重建方法,并(ii)在 PSNR 和 SSIM 等标准中提高了适应能力,增强率为 0.5 dB 和 0.01。此外,通过 U-Net 的表示分析发现,核修饰操作在高分辨率层中充分满足了80%的模式特有表示变化。我们的源代码可以在 中下载。
Privacy-Utility Trade-offs in Neural Networks for Medical Population Graphs: Insights from Differential Privacy and Graph Structure
results: 研究发现了在医疗领域中应用差异性保护图 neural network 的潜在和挑战,并发现了图STRUCTURE 对模型准确率的影响。Abstract
We initiate an empirical investigation into differentially private graph neural networks on population graphs from the medical domain by examining privacy-utility trade-offs at different privacy levels on both real-world and synthetic datasets and performing auditing through membership inference attacks. Our findings highlight the potential and the challenges of this specific DP application area. Moreover, we find evidence that the underlying graph structure constitutes a potential factor for larger performance gaps by showing a correlation between the degree of graph homophily and the accuracy of the trained model.
摘要
我们开始了一项实验性研究,探索在医疗领域人口图的差异性保护图神经网络中的 privacy-utility 质量平衡,并通过会员推测攻击来审核。我们的发现表明这个特定的DP应用领域的潜力和挑战。此外,我们发现图structure在训练模型准确率方面可能具有一定的因果关系,并且与图同质性(degree of graph homophily)相关。Note: "差异性保护" (differential privacy) in Chinese is often abbreviated as "DP".
Extended Graph Assessment Metrics for Graph Neural Networks
results: 论文通过对不同的医疗人口图和不同的学习设置进行测试,显示了这些指标与模型性能之间的相关性。Abstract
When re-structuring patient cohorts into so-called population graphs, initially independent data points can be incorporated into one interconnected graph structure. This population graph can then be used for medical downstream tasks using graph neural networks (GNNs). The construction of a suitable graph structure is a challenging step in the learning pipeline that can have severe impact on model performance. To this end, different graph assessment metrics have been introduced to evaluate graph structures. However, these metrics are limited to classification tasks and discrete adjacency matrices, only covering a small subset of real-world applications. In this work, we introduce extended graph assessment metrics (GAMs) for regression tasks and continuous adjacency matrices. We focus on two GAMs in specific: \textit{homophily} and \textit{cross-class neighbourhood similarity} (CCNS). We extend the notion of GAMs to more than one hop, define homophily for regression tasks, as well as continuous adjacency matrices, and propose a light-weight CCNS distance for discrete and continuous adjacency matrices. We show the correlation of these metrics with model performance on different medical population graphs and under different learning settings.
摘要
当重构患者群体为所谓的人口图时,初始独立数据点可以被一起 incorporated 到一个连接的图结构中。这个人口图然后可以用图神经网络(GNNs)进行医疗下游任务。图结构的建立是学习管道中的一个挑战性 step,它可以对模型性能产生严重的影响。为此,不同的图评估度量(GAMs)已经被引入,以评估图结构。但这些度量只适用于分类任务和精确的邻接矩阵,只覆盖了一小部分实际应用场景。在这种工作中,我们介绍了扩展的图评估度量(GAMs),用于回归任务和连续邻接矩阵。我们特别关注了两个 GAMs:同类性和跨类邻近相似性(CCNS)。我们将homophily 扩展到回归任务,以及连续邻接矩阵,并提出了一种轻量级的 CCNS 距离。我们还显示了这些度量与不同的医疗人口图和不同的学习设置之间的相关性。
Neuro-symbolic Empowered Denoising Diffusion Probabilistic Models for Real-time Anomaly Detection in Industry 4.0
paper_authors: Luigi Capogrosso, Alessio Mascolini, Federico Girella, Geri Skenderi, Sebastiano Gaiardelli, Nicola Dall’Ora, Francesco Ponzio, Enrico Fraccaroli, Santa Di Cataldo, Sara Vinco, Enrico Macii, Franco Fummi, Marco Cristani
results: 该模型可以提供简单 yet 有效的异常预测方法,并可以在嵌入式系统上部署,以便直接整合到生产过程中。Abstract
Industry 4.0 involves the integration of digital technologies, such as IoT, Big Data, and AI, into manufacturing and industrial processes to increase efficiency and productivity. As these technologies become more interconnected and interdependent, Industry 4.0 systems become more complex, which brings the difficulty of identifying and stopping anomalies that may cause disturbances in the manufacturing process. This paper aims to propose a diffusion-based model for real-time anomaly prediction in Industry 4.0 processes. Using a neuro-symbolic approach, we integrate industrial ontologies in the model, thereby adding formal knowledge on smart manufacturing. Finally, we propose a simple yet effective way of distilling diffusion models through Random Fourier Features for deployment on an embedded system for direct integration into the manufacturing process. To the best of our knowledge, this approach has never been explored before.
摘要
产业4.0具有整合数字技术,如物联网、大数据和人工智能,以提高生产和工业过程的效率和生产力。随着这些技术变得更加相互连接和相互依赖,产业4.0系统变得更加复杂,从而增加了发现和终止异常的困难,这些异常可能会对生产过程产生干扰。本文提出了一种基于扩散的实时异常预测模型,使用神经符号方法,将产业知识体系集成到模型中,从而添加了智能制造的正式知识。此外,我们还提出了一种简单 yet有效的抽象扩散模型的方法,通过Random Fourier Features进行简化,以便在嵌入式系统上部署,直接整合到生产过程中。到目前为止,这种方法尚未得到探讨。
Cramer Type Distances for Learning Gaussian Mixture Models by Gradient Descent
results: 本文通过一个 Gaussian Mixture Distributional Deep Q Network 的示例,证明了该方法的效果。与之前的模型相比,这种模型具有参数效率和更好的解释性。Abstract
The learning of Gaussian Mixture Models (also referred to simply as GMMs) plays an important role in machine learning. Known for their expressiveness and interpretability, Gaussian mixture models have a wide range of applications, from statistics, computer vision to distributional reinforcement learning. However, as of today, few known algorithms can fit or learn these models, some of which include Expectation-Maximization algorithms and Sliced Wasserstein Distance. Even fewer algorithms are compatible with gradient descent, the common learning process for neural networks. In this paper, we derive a closed formula of two GMMs in the univariate, one-dimensional case, then propose a distance function called Sliced Cram\'er 2-distance for learning general multivariate GMMs. Our approach has several advantages over many previous methods. First, it has a closed-form expression for the univariate case and is easy to compute and implement using common machine learning libraries (e.g., PyTorch and TensorFlow). Second, it is compatible with gradient descent, which enables us to integrate GMMs with neural networks seamlessly. Third, it can fit a GMM not only to a set of data points, but also to another GMM directly, without sampling from the target model. And fourth, it has some theoretical guarantees like global gradient boundedness and unbiased sampling gradient. These features are especially useful for distributional reinforcement learning and Deep Q Networks, where the goal is to learn a distribution over future rewards. We will also construct a Gaussian Mixture Distributional Deep Q Network as a toy example to demonstrate its effectiveness. Compared with previous models, this model is parameter efficient in terms of representing a distribution and possesses better interpretability.
摘要
学习 Gaussian Mixture Model (GMM) 在机器学习中扮演着重要的角色。GMM 知名于其表达力和可解性,在统计学、计算机视觉以及分布式强化学习中有广泛的应用。然而,许多已知的算法无法适用于 GMM,只有一些 Expectation-Maximization 算法和 Sliced Wasserstein Distance 可以学习这些模型。另外,很少的算法可以与 gradient descent 相结合,这是 neural network 的常见学习过程。在这篇论文中,我们 derivated 一个关闭式的 GMM 公式在一维 случа的情况下,然后提出了一种名为 Sliced Cram\'er 2-distance 的距离函数,用于学习总ivariate GMM。我们的方法具有以下优点:1. 在一维情况下,我们有关闭式的表达,计算和实现容易,可以使用常用的机器学习库(如 PyTorch 和 TensorFlow)。2. 我们的方法可以与 gradient descent 相结合,可以很好地与 neural network 结合使用。3. 我们的方法可以不仅学习一个数据集,还可以直接学习另一个 GMM,不需要采样于目标模型。4. 我们的方法具有一些理论保证,如全局梯度约束和随机梯度的不偏性。这些特点特别有用于分布式强化学习和 Deep Q Networks,其目标是学习未来奖励的分布。我们还会构建一个 Gaussian Mixture Distributional Deep Q Network 作为一个示例,以示其效果。与之前的模型相比,这个模型在表达分布的参数效率和解释性方面具有优势。
Learning Multiple Coordinated Agents under Directed Acyclic Graph Constraints
results: 我们在四个DAG环境中进行了实验,包括一个真实世界的Intel高量包装和测试Factory的一个实际任务。我们发现,我们的方法在DAG约束下表现出优于非DAG方法。Abstract
This paper proposes a novel multi-agent reinforcement learning (MARL) method to learn multiple coordinated agents under directed acyclic graph (DAG) constraints. Unlike existing MARL approaches, our method explicitly exploits the DAG structure between agents to achieve more effective learning performance. Theoretically, we propose a novel surrogate value function based on a MARL model with synthetic rewards (MARLM-SR) and prove that it serves as a lower bound of the optimal value function. Computationally, we propose a practical training algorithm that exploits new notion of leader agent and reward generator and distributor agent to guide the decomposed follower agents to better explore the parameter space in environments with DAG constraints. Empirically, we exploit four DAG environments including a real-world scheduling for one of Intel's high volume packaging and test factory to benchmark our methods and show it outperforms the other non-DAG approaches.
摘要
Theoretically, we propose a new surrogate value function based on a MARL model with synthetic rewards (MARLM-SR) and prove that it is a lower bound of the optimal value function. Computationally, we develop a practical training algorithm that uses new notions of leader agent and reward generator and distributor agent to guide the decomposed follower agents to explore the parameter space more effectively in environments with DAG constraints.Empirically, we test our method on four DAG environments, including a real-world scheduling problem for one of Intel's high-volume packaging and test factories, and show that it outperforms other non-DAG approaches.
Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach
results: 数值研究基于中国厦门及其周边城市的实际数据表明,提出的框架可以有效缓解供应和需求不均衡,并实现了显著提高每天系统利润和订单完成率。Abstract
The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio.
摘要
integración del desarrollo de ciudadanos ha llevado a un aumento en la demanda de viajes interciudades. los servicios de pooling de viajes interciudades tienen un gran potencial para mejorar los servicios de autobuses interciudades tradicionales al implementar mejoras de demanda. Sin embargo, sus operaciones en línea sufren complejidades inherentes debido a la asignación de recursos de vehículos entre ciudades y la ruta de los vehículos en pool. para abordar estos desafíos, este estudio propone un marco de dos niveles diseñado para facilitar la gestión en línea de flotas. específicamente, se propone un modelo de aprendizaje por refuerzo multiexperto en el nivel superior del marco para asignar vehículos ociosos a diferentes líneas interciudades de manera cooperativa, mientras que el nivel inferior actualiza las rutas de los vehículos utilizando un algoritmo de búsqueda de vecindario grande adaptativo. los estudios numéricos realizados con datos realistas de Xiamen y sus ciudades circundantes en china muestran que el marco propuesto efectivamente mitiga las desequilibrios de suministro y demanda, y logra mejoras significativas en la rentabilidad diaria promedio del sistema y el índice de cumplimiento de órdenes.
Implicit regularization in AI meets generalized hardness of approximation in optimization – Sharp results for diagonal linear networks
results: 这篇论文提出了新的 convergence bounds,证明了DLN的梯度流可以准确地 aproximate basis pursuit优化问题的解,并且其中的非锐性取决于DLN的深度。Abstract
Understanding the implicit regularization imposed by neural network architectures and gradient based optimization methods is a key challenge in deep learning and AI. In this work we provide sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks (DLNs) in the over-parameterized regression setting and, potentially surprisingly, link this to the phenomenon of phase transitions in generalized hardness of approximation (GHA). GHA generalizes the phenomenon of hardness of approximation from computer science to, among others, continuous and robust optimization. It is well-known that the $\ell^1$-norm of the gradient flow of DLNs with tiny initialization converges to the objective function of basis pursuit. We improve upon these results by showing that the gradient flow of DLNs with tiny initialization approximates minimizers of the basis pursuit optimization problem (as opposed to just the objective function), and we obtain new and sharp convergence bounds w.r.t.\ the initialization size. Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem -- which is a contradiction -- thus implying sharpness. Moreover, we characterize $\textit{which}$ $\ell_1$ minimizer of the basis pursuit problem is chosen by the gradient flow whenever the minimizer is not unique. Interestingly, this depends on the depth of the DLN.
摘要
It is well known that the $\ell^1$-norm of the gradient flow of DLNs with a tiny initialization converges to the objective function of basis pursuit. We improve upon these results by showing that the gradient flow of DLNs with a tiny initialization approximates minimizers of the basis pursuit optimization problem (rather than just the objective function), and we obtain new and sharp convergence bounds with respect to the initialization size. Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem, which is a contradiction, thus implying sharpness.Moreover, we characterize which $\ell_1$ minimizer of the basis pursuit problem is chosen by the gradient flow whenever the minimizer is not unique. Interestingly, this depends on the depth of the DLN.
MPR-Net:Multi-Scale Pattern Reproduction Guided Universality Time Series Interpretable Forecasting
results: 该模型在多个真实数据集上进行了严格的实验,并 achieved state-of-the-art 预测性能,同时也具有良好的泛化和Robustness性能。Abstract
Time series forecasting has received wide interest from existing research due to its broad applications and inherent challenging. The research challenge lies in identifying effective patterns in historical series and applying them to future forecasting. Advanced models based on point-wise connected MLP and Transformer architectures have strong fitting power, but their secondary computational complexity limits practicality. Additionally, those structures inherently disrupt the temporal order, reducing the information utilization and making the forecasting process uninterpretable. To solve these problems, this paper proposes a forecasting model, MPR-Net. It first adaptively decomposes multi-scale historical series patterns using convolution operation, then constructs a pattern extension forecasting method based on the prior knowledge of pattern reproduction, and finally reconstructs future patterns into future series using deconvolution operation. By leveraging the temporal dependencies present in the time series, MPR-Net not only achieves linear time complexity, but also makes the forecasting process interpretable. By carrying out sufficient experiments on more than ten real data sets of both short and long term forecasting tasks, MPR-Net achieves the state of the art forecasting performance, as well as good generalization and robustness performance.
摘要
时间序列预测已经受到了广泛的研究关注,因为它具有广泛的应用和内在的挑战性。研究挑战在历史序列中找到有效的模式,并将其应用到未来预测中。高级模型基于点对点连接MLP和Transformer架构具有强大的适应力,但是其次要计算复杂性限制了实用性。此外,这些结构自然地扰乱时间顺序,从而减少了信息利用和使预测过程不可读取。为解决这些问题,本文提出了一种预测模型,MPR-Net。它首先适应性分解多级历史序列模式使用 convolution 操作,然后基于先前知识的模式复制优化方法建立预测方法,最后使用 deconvolution 操作重建未来序列为未来序列。通过利用时间序列中的时间依赖关系,MPR-Net不仅实现了线性时间复杂度,还使预测过程可读取。通过对更多than ten 个真实数据集的短期和长期预测任务进行 suficient 实验,MPR-Net实现了状态的前ier预测性能,以及良好的总体和稳定性性能。
Breaking 3-Factor Approximation for Correlation Clustering in Polylogarithmic Rounds
results: 我们的方法可以达到比3更好的 approx ratio,并且可以在 $\tilde{O}(m^{1.5})$ 工作和内存下完成。此外,我们还证明了我们的方法可以被翻译成 sequential 算法和 MPC 算法。Abstract
In this paper, we study parallel algorithms for the correlation clustering problem, where every pair of two different entities is labeled with similar or dissimilar. The goal is to partition the entities into clusters to minimize the number of disagreements with the labels. Currently, all efficient parallel algorithms have an approximation ratio of at least 3. In comparison with the $1.994+\epsilon$ ratio achieved by polynomial-time sequential algorithms [CLN22], a significant gap exists. We propose the first poly-logarithmic depth parallel algorithm that achieves a better approximation ratio than 3. Specifically, our algorithm computes a $(2.4+\epsilon)$-approximate solution and uses $\tilde{O}(m^{1.5})$ work. Additionally, it can be translated into a $\tilde{O}(m^{1.5})$-time sequential algorithm and a poly-logarithmic rounds sublinear-memory MPC algorithm with $\tilde{O}(m^{1.5})$ total memory. Our approach is inspired by Awerbuch, Khandekar, and Rao's [AKR12] length-constrained multi-commodity flow algorithm, where we develop an efficient parallel algorithm to solve a truncated correlation clustering linear program of Charikar, Guruswami, and Wirth [CGW05]. Then we show the solution of the truncated linear program can be rounded with a factor of at most 2.4 loss by using the framework of [CMSY15]. Such a rounding framework can then be implemented using parallel pivot-based approaches.
摘要
在这篇论文中,我们研究并解决了并行算法的相关划分问题,其中每对两个不同的实体都被标注为相似或不相似。目标是使实体 partition 以最小化与标签的不一致数。目前所有的有效并行算法都有至少3倍的approximation ratio。相比之下,可以达到$1.994+\epsilon$的准确率的核心时间算法(CLN22)存在显著的差距。我们提出了第一个多余logs深度的并行算法,它可以达到更好的approximation ratio,具体来说是$(2.4+\epsilon)$-approximate解决方案,并且使用了$\tilde{O}(m^{1.5})$的工作量。此外,它还可以被翻译成$\tilde{O}(m^{1.5})$时间的顺序算法和多余logs内存MPC算法,并且总内存占用为$\tilde{O}(m^{1.5})$。我们的方法受到Awerbuch、Khandekar和Rao(AKR12)的长度限制多产品流算法的启发,我们开发了一个高效的并行算法来解决压缩相关划分线程程序(Charikar、Guruswami和Wirth(CGW05))。然后,我们显示了这个压缩的线程程序解决方案可以通过使用框架(CMSY15)中的缩放来实现,并且这种缩放框架可以通过并行枢轴方法来实现。
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative
results: 该论文通过对多个多元任务对话资料集 MultiWOZ 进行测试,证明了该方法可以减少对AL的依赖性,同时保留其优势。Abstract
Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.
摘要
对话策略,决定系统在对话中的行为,是对对话的成功非常重要的。在过去几年,人工智能学习(RL)已经成为对话策略学习(DPL)中一种可能的选择。在RL基于的DPL中,对话策略会根据奖励进行更新。在多个领域任务对话enario中,手动构建细化的奖励,如状态动作对的奖励,是一个挑战。一种可以从收集的数据中估计奖励的方法是通过对抗学习(AL)训练奖励估计器和对话策略同时。虽然这种方法在实验中表现出色,但它受到了对抗学习的内在问题,如模式落入。本文首先通过对对话策略和奖励估计器的目标函数进行详细分析,然后根据这些分析,我们提出了一种不使用对抗学习的奖励估计和对话策略学习方法。我们使用MultiWOZ多个领域任务对话资料来评估我们的方法。
Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models
results: 研究结果显示,这种方法可以超越未适应的模型,并且在不同的训练体例中表现出色。Abstract
A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data.
摘要
很多自然语言任务目前正在使用大规模语言模型(LLM)进行处理。这些模型通常通过大量不监督文本数据进行训练并通过细化、调整或在语言任务中进行学习来适应下渠道任务。在这项工作中,我们提议一种方法,可以在没有标签样本的情况下,通过调整模型 posterior 来进行文本分类任务。这种方法将 LLM 视为黑盒模型,并在模型 posterior 的抽象上进行调整。结果显示,这种方法可以在不同的训练预示中超越未适应模型。
GRAN is superior to GraphRNN: node orderings, kernel- and graph embeddings-based metrics for graph generators
results: 研究发现,拓扑-based的评估方法在嵌入空间中表现较好,GRAN比GraphRNN更有优势,而改进的GraphRNN方法也有效于小型图。此外,本文还提供了一个关于数据选择和节点特征初始化的指南。Abstract
A wide variety of generative models for graphs have been proposed. They are used in drug discovery, road networks, neural architecture search, and program synthesis. Generating graphs has theoretical challenges, such as isomorphic representations -- evaluating how well a generative model performs is difficult. Which model to choose depending on the application domain? We extensively study kernel-based metrics on distributions of graph invariants and manifold-based and kernel-based metrics in graph embedding space. Manifold-based metrics outperform kernel-based metrics in embedding space. We use these metrics to compare GraphRNN and GRAN, two well-known generative models for graphs, and unveil the influence of node orderings. It shows the superiority of GRAN over GraphRNN - further, our proposed adaptation of GraphRNN with a depth-first search ordering is effective for small-sized graphs. A guideline on good practices regarding dataset selection and node feature initialization is provided. Our work is accompanied by open-source code and reproducible experiments.
摘要
各种生成模型 для图有多种提议。它们在药物探索、路网、神经网络搜索和程序生成中使用。生成图有理论挑战,如同构表示——评估生成模型表现的难度。哪种模型取决于应用领域?我们广泛研究基于kernel的度量和抽象空间中基于抽象的度量。抽象空间中基于度量的模型表现较好。我们使用这些度量对GRAN和GraphRNN两种知名的生成模型进行比较,并揭示节点顺序对GRAN和GraphRNN的影响。结果显示GRAN在小型图中表现更优。此外,我们提出了基于深度先遍步顺序的GraphRNN改进方案,对小型图有效。我们还提供了关于数据选择和节点特征初始化的良好实践指南。我们的工作附有开源代码和可重现的实验。
S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
paper_authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi
for: 这篇论文targets the video prediction task, aiming to improve the accuracy and efficiency of video prediction models.
methods: 该模型combines two novel techniques: (i) hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) spatiotemporal PixelCNN (ST-PixelCNN). The proposed model is called sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE).
results: 实验结果表明, compared to other state-of-the-art video prediction techniques, S-HR-VQVAE achieves better performance in both quantitative and qualitative evaluations, despite having a much smaller model size.Abstract
We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
摘要
我们提出了一种新的模型,它将(i)我们最近提出的层次嵌入式减量变换自适应器(HR-VQVAE)和(ii)一种新的空间时间帧帧 convolutional neural network(ST-PixelCNN)相结合。我们称这种方法为sequential hierarchical residual learning vector quantized variational autoencoder(S-HR-VQVAE)。通过利用HR-VQVAE对静止图像的减量表示的内在能力,以及ST-PixelCNN对空间时间信息的处理能力,S-HR-VQVAE可以更好地处理视频预测中的主要挑战,包括学习空间时间信息、处理高维数据、抵御模糊预测和隐式模型物理特征。我们在KTH人体动作和Move-MNIST任务上进行了广泛的实验,并证明了我们的模型与当今最佳视频预测技术相比,在量化和 каче化评价中均表现出色,即使模型规模很小。最后,我们提出了一种新的训练方法,可以同时优化HR-VQVAE和ST-PixelCNN参数。
Short Boolean Formulas as Explanations in Practice
paper_authors: Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Masood Feyzbakhsh Rankooh, Miikka Vilander
for: 这个论文的目的是解释数据模型中的 объяснения。
methods: 这个论文使用了简短的布尔方程来实现解释。
results: 研究发现,使用简短的布尔方程可以获得reasonably accurate的解释,且可以避免过拟合。Abstract
We investigate explainability via short Boolean formulas in the data model based on unary relations. As an explanation of length k, we take a Boolean formula of length k that minimizes the error with respect to the target attribute to be explained. We first provide novel quantitative bounds for the expected error in this scenario. We then also demonstrate how the setting works in practice by studying three concrete data sets. In each case, we calculate explanation formulas of different lengths using an encoding in Answer Set Programming. The most accurate formulas we obtain achieve errors similar to other methods on the same data sets. However, due to overfitting, these formulas are not necessarily ideal explanations, so we use cross validation to identify a suitable length for explanations. By limiting to shorter formulas, we obtain explanations that avoid overfitting but are still reasonably accurate and also, importantly, human interpretable.
摘要
我们通过简单的布尔方程来调查可解释性。我们选择一个长度为k的布尔方程,以最小化与目标特性的误差来解释。我们首先提供了新的量化 bound,用于预期误差的情况。然后,我们通过使用Answer Set Programming编码来计算不同长度的解释方程,并证明在具体数据集上获得最佳性能。然而,由于过拟合,这些方程可能并不是理想的解释。因此,我们使用交叉验证来确定合适的解释长度,以避免过拟合而仍保持可解释性和有理解性。
IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation
results: 本文的实验表明,传统的KGE模型无法 capture Semantics,而IntelliGraphs数据集和生成器可以帮助提高机器学习模型的semantic理解能力。Abstract
Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations. A key task in the literature is predicting missing links between entities. However, Knowledge Graphs are not just sets of links but also have semantics underlying their structure. Semantics is crucial in several downstream tasks, such as query answering or reasoning. We introduce the subgraph inference task, where a model has to generate likely and semantically valid subgraphs. We propose IntelliGraphs, a set of five new Knowledge Graph datasets. The IntelliGraphs datasets contain subgraphs with semantics expressed in logical rules for evaluating subgraph inference. We also present the dataset generator that produced the synthetic datasets. We designed four novel baseline models, which include three models based on traditional KGEs. We evaluate their expressiveness and show that these models cannot capture the semantics. We believe this benchmark will encourage the development of machine learning models that emphasize semantic understanding.
摘要
知识图 embedding (KGE) 模型用于学习连续表示实体和关系。文献中的关键任务是预测缺失的链接。然而,知识图不仅是链接的集合,还有底层 semantics 结构。这些 semantics 在下游任务中如查询回答或理解中是关键的。我们介绍了 subgraph inference 任务,其中模型需要生成可能和semantically valid的子图。我们提出了 IntelliGraphs,一组五个新的知识图据集。IntelliGraphs 数据集包含具有 semantics 表示的子图,用逻辑规则进行评估子图推理。我们还介绍了生成这些 sintetic 数据集的数据生成器。我们设计了四种基线模型,其中三种基于传统 KGE。我们评估了这些模型的表达能力,并证明这些模型无法捕捉 semantics。我们认为这个标准会鼓励机器学习模型强调semantic理解。
Ageing Analysis of Embedded SRAM on a Large-Scale Testbed Using Machine Learning
results: 研究发现,即使年龄的影响很小,但是我们的指标可以准确地估算节点的使用时间,$R^2$ 分数为 0.77,错误率为 24% 使用回归分析,并且使用分类器可以达到六个月的分辨率。Abstract
Ageing detection and failure prediction are essential in many Internet of Things (IoT) deployments, which operate huge quantities of embedded devices unattended in the field for years. In this paper, we present a large-scale empirical analysis of natural SRAM wear-out using 154 boards from a general-purpose testbed. Starting from SRAM initialization bias, which each node can easily collect at startup, we apply various metrics for feature extraction and experiment with common machine learning methods to predict the age of operation for this node. Our findings indicate that even though ageing impacts are subtle, our indicators can well estimate usage times with an $R^2$ score of 0.77 and a mean error of 24% using regressors, and with an F1 score above 0.6 for classifiers applying a six-months resolution.
摘要
互联网物联网(IoT)应用中,年龄检测和失效预测是非常重要的,因为它们运行着庞大量的嵌入式设备,距离用户超过几年。在这篇论文中,我们对一个通用测试平台上的154个板子进行了大规模的实践分析,以探讨自然SRAM耗尽的情况。我们从SRAM初始化偏好开始,每个节点可以轻松地收集这些数据,然后我们使用不同的特征提取方法和常见的机器学习方法来预测节点的使用时间。我们的发现表明,尽管年龄的影响很小,但我们的指标可以很好地估计节点的使用时间,$R^2$分数为0.77,误差为24%,使用回归分析器,并且使用六个月的分辨率时,F1分数高于0.6。
Aeolus Ocean – A simulation environment for the autonomous COLREG-compliant navigation of Unmanned Surface Vehicles using Deep Reinforcement Learning and Maritime Object Detection
paper_authors: Andrew Alexander Vekinis, Stavros Perantonis
For: 帮助无人水面船(USV)在海洋领域实现自主导航,以提高安全性和降低运行成本,同时为海洋研究、探索和监测提供新的可能性。* Methods: 使用深度强化学习(DRL)和计算机视觉(CV)算法,在实际海洋 simulate 环境中创建了 COLREG 遵从的数字吊尼,以开发和引导 USV 控制系统。* Results: 在许多成功的航行任务中,使用这种方法训练出的自主 Agent 能够成功避免碰撞,并在开放海域和沿海交通中与其他船只进行安全的交通。Abstract
Heading towards navigational autonomy in unmanned surface vehicles (USVs) in the maritime sector can fundamentally lead towards safer waters as well as reduced operating costs, while also providing a range of exciting new capabilities for oceanic research, exploration and monitoring. However, achieving such a goal is challenging. USV control systems must, safely and reliably, be able to adhere to the international regulations for preventing collisions at sea (COLREGs) in encounters with other vessels as they navigate to a given waypoint while being affected by realistic weather conditions, either during the day or at night. To deal with the multitude of possible scenarios, it is critical to have a virtual environment that is able to replicate the realistic operating conditions USVs will encounter, before they can be implemented in the real world. Such "digital twins" form the foundations upon which Deep Reinforcement Learning (DRL) and Computer Vision (CV) algorithms can be used to develop and guide USV control systems. In this paper we describe the novel development of a COLREG-compliant DRL-based collision avoidant navigational system with CV-based awareness in a realistic ocean simulation environment. The performance of the trained autonomous Agents resulting from this approach is evaluated in several successful navigations to set waypoints in both open sea and coastal encounters with other vessels. A binary executable version of the simulator with trained agents is available at https://github.com/aavek/Aeolus-Ocean
摘要
heading towards autonomous navigation in unmanned surface vehicles (USVs) in the maritime industry can lead to safer waters and lower operating costs, while also providing new opportunities for ocean research, exploration, and monitoring. However, achieving this goal is challenging. USV control systems must be able to safely and reliably follow international collision regulations (COLREGs) when encountering other vessels while navigating to a specific location in realistic weather conditions, both day and night. To handle various scenarios, it is crucial to have a virtual environment that can realistically simulate the operating conditions USVs will encounter. Such "digital twins" provide the foundation for developing and testing USV control systems using Deep Reinforcement Learning (DRL) and Computer Vision (CV) algorithms. In this paper, we describe the development of a COLREG-compliant DRL-based collision avoidance navigational system with CV-based awareness in a realistic ocean simulation environment. The performance of the trained autonomous Agents resulting from this approach is evaluated in several successful navigations to set waypoints in both open sea and coastal encounters with other vessels. A binary executable version of the simulator with trained agents is available at .
Machine Learning-Assisted Pattern Recognition Algorithms for Estimating Ultimate Tensile Strength in Fused Deposition Modeled Polylactic Acid Specimens
results: 研究发现,Decision Tree和K-Nearest Neighbor算法均达到了F1分数0.71,但KNN算法表现出了更高的Area Under the Curve(AUC)分数0.79,在分类任务中表现出了更好的能力。这表明KNN算法在分类任务中的选择性比其他算法更高,因此在这种研究 Context中是最佳的选择。Abstract
In this study, we investigate the application of supervised machine learning algorithms for estimating the Ultimate Tensile Strength (UTS) of Polylactic Acid (PLA) specimens fabricated using the Fused Deposition Modeling (FDM) process. A total of 31 PLA specimens were prepared, with Infill Percentage, Layer Height, Print Speed, and Extrusion Temperature serving as input parameters. The primary objective was to assess the accuracy and effectiveness of four distinct supervised classification algorithms, namely Logistic Classification, Gradient Boosting Classification, Decision Tree, and K-Nearest Neighbor, in predicting the UTS of the specimens. The results revealed that while the Decision Tree and K-Nearest Neighbor algorithms both achieved an F1 score of 0.71, the KNN algorithm exhibited a higher Area Under the Curve (AUC) score of 0.79, outperforming the other algorithms. This demonstrates the superior ability of the KNN algorithm in differentiating between the two classes of ultimate tensile strength within the dataset, rendering it the most favorable choice for classification in the context of this research. This study represents the first attempt to estimate the UTS of PLA specimens using machine learning-based classification algorithms, and the findings offer valuable insights into the potential of these techniques in improving the performance and accuracy of predictive models in the domain of additive manufacturing.
摘要
在这个研究中,我们研究了使用监督式机器学习算法来估计制造使用泵流溶解模型(FDM) proces的聚酸酯(PLA)样品的最大强度(UTS)。总共有31个PLA样品被准备,输入参数包括填充比率、层高、印刷速度和溶解温度。研究的主要目标是评估四种不同的监督式分类算法,namely Logistic Classification、Gradient Boosting Classification、Decision Tree和K-Nearest Neighbor,在预测样品的UTS方面的精度和有效性。结果显示,Despite Tree和K-Nearest Neighbor算法都达到了F1分数0.71,KNN算法的AUC分数为0.79,高于其他算法,这表明KNN算法在数据集中更好地区分两个类别的最终强度,因此在这个上下文中,KNN算法是最佳选择。这项研究是预测PLA样品的UTS使用机器学习基于分类算法的第一次尝试,发现的结果提供了对预测模型在材料加工领域的可能性和精度的有价值的信息。
Real-time Percussive Technique Recognition and Embedding Learning for the Acoustic Guitar
results: 研究发现,使用VAEs可以提高分类器的质量,特别是在简化后的2类识别任务中,而且VAEs可以提高分布之间的类别分离度。Abstract
Real-time music information retrieval (RT-MIR) has much potential to augment the capabilities of traditional acoustic instruments. We develop RT-MIR techniques aimed at augmenting percussive fingerstyle, which blends acoustic guitar playing with guitar body percussion. We formulate several design objectives for RT-MIR systems for augmented instrument performance: (i) causal constraint, (ii) perceptually negligible action-to-sound latency, (iii) control intimacy support, (iv) synthesis control support. We present and evaluate real-time guitar body percussion recognition and embedding learning techniques based on convolutional neural networks (CNNs) and CNNs jointly trained with variational autoencoders (VAEs). We introduce a taxonomy of guitar body percussion based on hand part and location. We follow a cross-dataset evaluation approach by collecting three datasets labelled according to the taxonomy. The embedding quality of the models is assessed using KL-Divergence across distributions corresponding to different taxonomic classes. Results indicate that the networks are strong classifiers especially in a simplified 2-class recognition task, and the VAEs yield improved class separation compared to CNNs as evidenced by increased KL-Divergence across distributions. We argue that the VAE embedding quality could support control intimacy and rich interaction when the latent space's parameters are used to control an external synthesis engine. Further design challenges around generalisation to different datasets have been identified.
摘要
现实时音乐信息检索(RT-MIR)具有增强传统音响乐器的潜在能力。我们开发了RT-MIR技术,旨在补充打击式手风琴演奏。我们提出了增强乐器性能的RT-MIR系统设计目标:(i) causal约束,(ii)实际无关作用响应延迟,(iii)控制亲密支持,(iv)合成控制支持。我们介绍了实时鼓部打击识别和嵌入学习技术,使用卷积神经网络(CNN)和CNN与变量自动编码器(VAE)进行联合训练。我们提出了鼓部打击的分类法,并采用跨数据集评估方法。结果表明,网络具有强大分类能力,特别是在简化后2类认知任务中,而VAE增加了分布间的KL散度,表明VAE嵌入质量可以支持控制亲密和丰富的互动。然而,我们还需要进一步探索不同数据集的通用化问题。
results: 论文的结果表明,使用 layerwise 方法可以减轻模型之间的差异,从而提高联合训练的效果。此外,论文还发现了一些特定的层或层组在联合训练中的阻碍效应,这些阻碍效应可以通过 adjusting the learning rate 来解决。Abstract
In the federated setup one performs an aggregation of separate local models multiple times during training in order to obtain a stronger global model; most often aggregation is a simple averaging of the parameters. Understanding when and why averaging works in a non-convex setup, such as federated deep learning, is an open challenge that hinders obtaining highly performant global models. On i.i.d.~datasets federated deep learning with frequent averaging is successful. The common understanding, however, is that during the independent training models are drifting away from each other and thus averaging may not work anymore after many local parameter updates. The problem can be seen from the perspective of the loss surface: for points on a non-convex surface the average can become arbitrarily bad. The assumption of local convexity, often used to explain the success of federated averaging, contradicts to the empirical evidence showing that high loss barriers exist between models from the very beginning of the learning, even when training on the same data. Based on the observation that the learning process evolves differently in different layers, we investigate the barrier between models in a layerwise fashion. Our conjecture is that barriers preventing from successful federated training are caused by a particular layer or group of layers.
摘要
在联合设置下,通过多次对多个本地模型进行聚合来实现更强的全球模型,通常是简单的参数平均。但是理解在非 convex 设置中,如联合深度学习中, WHEN 和 WHY 聚合工作的问题是一个开放的挑战,这阻碍了获得高性能的全球模型。在 i.i.d. datasets 上,联合深度学习 WITH 频繁聚合是成功的。然而,通常认为在独立训练中模型会逐渐偏离彼此,因此聚合可能不再有效了,特别是在多个本地参数更新后。这可以从损失函数的角度看,在非 convex 表面上的平均可能变得无限坏。常见的本地几何Assumption ,用来解释联合聚合的成功,与实验证据表明,从学习开始,模型之间的损失函数高度不同,这与高损失障碍的存在相 contradistinguish。基于层 wise 的观察,我们提出的假设是,在层 wise 的某些层或组件上,存在阻碍联合训练的栅栏。
Multivariate Time Series characterization and forecasting of VoIP traffic in real mobile networks
results: 实验结果表明,使用时间序列分析和机器学习技术可以准确预测VOIP流量中重要的QoS/QoE指标。其中,深度基于机器学习技术表现较好,时间复杂度较低。此外,本文还进行了一系列 auxillary 分析(如站点性和相互响应函数),以提供更深入的理解和分析VOIP流量的行为。Abstract
Predicting the behavior of real-time traffic (e.g., VoIP) in mobility scenarios could help the operators to better plan their network infrastructures and to optimize the allocation of resources. Accordingly, in this work the authors propose a forecasting analysis of crucial QoS/QoE descriptors (some of which neglected in the technical literature) of VoIP traffic in a real mobile environment. The problem is formulated in terms of a multivariate time series analysis. Such a formalization allows to discover and model the temporal relationships among various descriptors and to forecast their behaviors for future periods. Techniques such as Vector Autoregressive models and machine learning (deep-based and tree-based) approaches are employed and compared in terms of performance and time complexity, by reframing the multivariate time series problem into a supervised learning one. Moreover, a series of auxiliary analyses (stationarity, orthogonal impulse responses, etc.) are performed to discover the analytical structure of the time series and to provide deep insights about their relationships. The whole theoretical analysis has an experimental counterpart since a set of trials across a real-world LTE-Advanced environment has been performed to collect, post-process and analyze about 600,000 voice packets, organized per flow and differentiated per codec.
摘要
预测实时交通(如VoIP)的行为在 mobilitas enario 可以帮助操作商更好地规划其网络基础设施和资源的分配。因此,在这种工作中,作者们提出了对关键 QoS/QoE 特征(一些在技术文献中被忽略)的 VoIP 流量预测分析。问题被形式化为多变量时间系列分析。这种形式化允许发现和模型时间序列中的关系,并预测未来时间段的行为。作者们使用 vector autoregressive 模型和机器学习(深度基于和树基于)方法,并对其性能和时间复杂度进行比较。此外,作者们还进行了一系列辅助分析(如站点性和正交冲击响应),以发现时间序列的分析结构和提供深入的理解。整个理论分析有实验室的实际应用,在一个真实的 LTE-Advanced 环境中进行了600,000个语音包的收集、后处理和分析,按流分类和编解码器进行分类。
An Improved Uniform Convergence Bound with Fat-Shattering Dimension
paper_authors: Roberto Colomboni, Emmanuel Esposito, Andrea Paudice
for: 这个论文是为了研究实值函数的均匀收敛性而写的。
methods: 该论文使用了新的均匀收敛约束,以提高现有最佳上界的多项式级别。
results: 该论文提出了一个新的均匀收敛约束,可以减少多项式级别上的一个多项式系数,从而关闭当前的 gap。Abstract
The fat-shattering dimension characterizes the uniform convergence property of real-valued functions. The state-of-the-art upper bounds feature a multiplicative squared logarithmic factor on the sample complexity, leaving an open gap with the existing lower bound. We provide an improved uniform convergence bound that closes this gap.
摘要
“脂肪破碎维度”指的是实值函数的均匀收敛性质。现有的最佳上限 bounds 包含一个乘方 logarithmic 因子,留下一个开放的差距,我们提供了改进的均匀收敛 bound,填充这个差距。
results: 实验表明,使用这种方法,只需要使用单个系统轨迹的5个样本,就可以准确地回归真实的代理系统动力学,包括平衡选择和预测混沌系统的结果。这些发现表明,这种方法在多个竞争性多代理系统中可以提供有效的政策和决策支持。Abstract
Decentralized learning algorithms are an essential tool for designing multi-agent systems, as they enable agents to autonomously learn from their experience and past interactions. In this work, we propose a theoretical and algorithmic framework for real-time identification of the learning dynamics that govern agent behavior using a short burst of a single system trajectory. Our method identifies agent dynamics through polynomial regression, where we compensate for limited data by incorporating side-information constraints that capture fundamental assumptions or expectations about agent behavior. These constraints are enforced computationally using sum-of-squares optimization, leading to a hierarchy of increasingly better approximations of the true agent dynamics. Extensive experiments demonstrated that our approach, using only 5 samples from a short run of a single trajectory, accurately recovers the true dynamics across various benchmarks, including equilibrium selection and prediction of chaotic systems up to 10 Lyapunov times. These findings suggest that our approach has significant potential to support effective policy and decision-making in strategic multi-agent systems.
摘要
分布式学习算法是多智能系统设计的重要工具,它使得代理能 autonomously 从经验和过去互动中学习。在这项工作中,我们提出了一种理论和算法框架,用于实时识别代理行为的学习动力学。我们使用多项式回归来识别代理动力学,并通过包含侧情信息约束来补偿有限数据。这些约束通过权重加权平均来实现,从而构建一个层次结构,从最糟糕的应答逐渐提升到最佳的真实代理动力学。广泛的实验表明,我们的方法只需使用单个轨迹的5个样本,便可以准确地回归真实的代理动力学,并在多个标准测试函数上达到10个Ляпунов时间的预测。这些发现表明,我们的方法在多智能系统中有很大的潜力,以支持有效的政策和决策。
results: 对比 experiments表明,我们提出的模型可以保持与教师模型相同的学习精度,同时具有高速的推理速度。Abstract
Knowledge distillation (KD) has shown great potential for transferring knowledge from a complex teacher model to a simple student model in which the heavy learning task can be accomplished efficiently and without losing too much prediction accuracy. Recently, many attempts have been made by applying the KD mechanism to the graph representation learning models such as graph neural networks (GNNs) to accelerate the model's inference speed via student models. However, many existing KD-based GNNs utilize MLP as a universal approximator in the student model to imitate the teacher model's process without considering the graph knowledge from the teacher model. In this work, we provide a KD-based framework on multi-scaled GNNs, known as graph framelet, and prove that by adequately utilizing the graph knowledge in a multi-scaled manner provided by graph framelet decomposition, the student model is capable of adapting both homophilic and heterophilic graphs and has the potential of alleviating the over-squashing issue with a simple yet effectively graph surgery. Furthermore, we show how the graph knowledge supplied by the teacher is learned and digested by the student model via both algebra and geometry. Comprehensive experiments show that our proposed model can generate learning accuracy identical to or even surpass the teacher model while maintaining the high speed of inference.
摘要
知识塑化(KD)已经展示了让知识从复杂的教师模型传递到简单的学生模型中,以便高效地完成重要的学习任务而无需失去很多预测精度。最近,许多人对使用KD机制来加速图表示学习模型(GNNs)的推理速度进行了尝试。然而,大多数现有的KD-基于GNNs使用多层感知网络(MLP)作为学生模型的universal approximator,而不考虑教师模型中的图知识。在这种工作中,我们提供了基于多尺度GNNs的KD框架,称之为图帧lets,并证明了,通过在多尺度的图帧lets中精准地利用图知识,学生模型可以适应同质和不同的图Structures,并有可能解决过分压缩问题。此外,我们还表明了教师模型对图知识的学习和吞吐过程,通过 Both algebra and geometry。经过全面的实验,我们的提议的模型可以达到和教师模型的预测精度,同时保持高速的推理速度。
Quantum Autoencoders for Learning Quantum Channel Codes
results: 我们在不同量子链路模型下应用了这个框架,并在每个场景中达到了强表现。我们的结果表明,量子机器学习可以在量子通信系统研究中发挥作用,帮助我们更好地理解各种通信设置、多样化通道模型以及容量下限。Abstract
This work investigates the application of quantum machine learning techniques for classical and quantum communication across different qubit channel models. By employing parameterized quantum circuits and a flexible channel noise model, we develop a machine learning framework to generate quantum channel codes and evaluate their effectiveness. We explore classical, entanglement-assisted, and quantum communication scenarios within our framework. Applying it to various quantum channel models as proof of concept, we demonstrate strong performance in each case. Our results highlight the potential of quantum machine learning in advancing research on quantum communication systems, enabling a better understanding of capacity bounds under modulation constraints, various communication settings, and diverse channel models.
摘要
这项研究探讨了使用量子机器学习技术进行классический和量子通信 across不同量子通道模型。我们通过使用参数化的量子电路和灵活的通道噪声模型,开发了一个机器学习框架,以生成量子通道编码并评估其效果。我们在不同的通信场景中(包括类型、助け助け和量子通信)进行了探索。通过应用到不同的量子通道模型中作为证明,我们证明了我们的结果在每个情况下都具有强表现。我们的结果表明量子机器学习在研究量子通信系统方面可能会有益,帮助我们更好地理解容器约束下的容量边界,不同通信设置下的通信效果,以及不同通道模型下的通信性能。
Online Distributed Learning with Quantized Finite-Time Coordination
results: 我们分析了提议算法的性能,并对在线解决方案的平均距离进行分析。最后,我们对一个логистиック回归任务进行了数值研究。Abstract
In this paper we consider online distributed learning problems. Online distributed learning refers to the process of training learning models on distributed data sources. In our setting a set of agents need to cooperatively train a learning model from streaming data. Differently from federated learning, the proposed approach does not rely on a central server but only on peer-to-peer communications among the agents. This approach is often used in scenarios where data cannot be moved to a centralized location due to privacy, security, or cost reasons. In order to overcome the absence of a central server, we propose a distributed algorithm that relies on a quantized, finite-time coordination protocol to aggregate the locally trained models. Furthermore, our algorithm allows for the use of stochastic gradients during local training. Stochastic gradients are computed using a randomly sampled subset of the local training data, which makes the proposed algorithm more efficient and scalable than traditional gradient descent. In our paper, we analyze the performance of the proposed algorithm in terms of the mean distance from the online solution. Finally, we present numerical results for a logistic regression task.
摘要
在这篇论文中,我们考虑了在分布式学习环境下进行在线学习问题。在我们的设定中,一群代理需要合作地训练基于流动数据的学习模型。与联邦学习不同,我们的方法不依赖中央服务器,只是基于代理之间的点对点通信。这种方法通常在数据不能被移动到中央位置的场景下使用,例如隐私、安全或成本原因。为了 compensate the lack of a central server, we propose a distributed algorithm that relies on a quantized, finite-time coordination protocol to aggregate the locally trained models. Furthermore, our algorithm allows for the use of stochastic gradients during local training. Stochastic gradients are computed using a randomly sampled subset of the local training data, which makes the proposed algorithm more efficient and scalable than traditional gradient descent. In our paper, we analyze the performance of the proposed algorithm in terms of the mean distance from the online solution. Finally, we present numerical results for a logistic regression task.Here's the translation in Traditional Chinese:在这篇论文中,我们考虑了在分布式学习环境下进行在线学习问题。在我们的设定中,一群代理需要合作地训练基于流动数据的学习模型。与联邦学习不同,我们的方法不依赖中央服务器,只是基于代理之间的点对点通信。这种方法通常在数据无法被移动到中央位置的场景下使用,例如隐私、安全或成本原因。为了 compensate the lack of a central server, we propose a distributed algorithm that relies on a quantized, finite-time coordination protocol to aggregate the locally trained models. Furthermore, our algorithm allows for the use of stochastic gradients during local training. Stochastic gradients are computed using a randomly sampled subset of the local training data, which makes the proposed algorithm more efficient and scalable than traditional gradient descent. In our paper, we analyze the performance of the proposed algorithm in terms of the mean distance from the online solution. Finally, we present numerical results for a logistic regression task.
Learning IMM Filter Parameters from Measurements using Gradient Descent
results: 经过测试和比较,该方法可以与使用真实数据参数化的IMM筛选器匹配性能。Abstract
The performance of data fusion and tracking algorithms often depends on parameters that not only describe the sensor system, but can also be task-specific. While for the sensor system tuning these variables is time-consuming and mostly requires expert knowledge, intrinsic parameters of targets under track can even be completely unobservable until the system is deployed. With state-of-the-art sensor systems growing more and more complex, the number of parameters naturally increases, necessitating the automatic optimization of the model variables. In this paper, the parameters of an interacting multiple model (IMM) filter are optimized solely using measurements, thus without necessity for any ground-truth data. The resulting method is evaluated through an ablation study on simulated data, where the trained model manages to match the performance of a filter parametrized with ground-truth values.
摘要
系统性能的数据融合和跟踪算法常常取决于感知器系统中的参数,这些参数不仅描述感知器系统,还可能是任务特定的。而目标下的内在参数甚至可能是完全不可见的,直到系统部署才能确定。随着现代感知器系统的复杂度不断增加,参数的数量自然增加,因此需要自动优化模型变量。在这篇论文中,我们使用仅基于测量结果进行参数优化,因此无需任何真实数据。这种方法在模拟数据上进行了减少研究,并证明了它可以与基于真实数据 parametrize 的筛子性能匹配。
Introducing Foundation Models as Surrogate Models: Advancing Towards More Practical Adversarial Attacks
results: 实验结果表明,使用 margin-based loss strategy 来微调 foundational models 可以提高其性能,并且这种方法的性能超过了其他更复杂的算法。Abstract
Recently, the no-box adversarial attack, in which the attacker lacks access to the model's architecture, weights, and training data, become the most practical and challenging attack setup. However, there is an unawareness of the potential and flexibility inherent in the surrogate model selection process on no-box setting. Inspired by the burgeoning interest in utilizing foundational models to address downstream tasks, this paper adopts an innovative idea that 1) recasting adversarial attack as a downstream task. Specifically, image noise generation to meet the emerging trend and 2) introducing foundational models as surrogate models. Harnessing the concept of non-robust features, we elaborate on two guiding principles for surrogate model selection to explain why the foundational model is an optimal choice for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.
摘要
近期,无框黑盒攻击(no-box adversarial attack)成为了最实用和挑战性最高的攻击设置。然而,关于选择surrogate模型的潜在和可能性的了解却受到了忽略。 draw inspiration from the growing interest in using foundational models to address downstream tasks, this paper proposes an innovative idea that recasts adversarial attacks as a downstream task and introduces foundational models as surrogate models. Based on the concept of non-robust features, we present two guiding principles for surrogate model selection to explain why foundational models are optimal for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.
results: 本研究发现,虽然XAI方法可以提供补充性和有用的输出,但是研究人员和决策者应考虑XAI方法的概念和技术限制,这些限制往往会变成黑盒子。Abstract
Our work serves as a framework for unifying the challenges of contemporary explainable AI (XAI). We demonstrate that while XAI methods provide supplementary and potentially useful output for machine learning models, researchers and decision-makers should be mindful of their conceptual and technical limitations, which frequently result in these methods themselves becoming black boxes. We examine three XAI research avenues spanning image, textual, and graph data, covering saliency, attention, and graph-type explainers. Despite the varying contexts and timeframes of the mentioned cases, the same persistent roadblocks emerge, highlighting the need for a conceptual breakthrough in the field to address the challenge of compatibility between XAI methods and application tasks.
摘要
我们的工作作为当代可解释人工智能(XAI)挑战的框架。我们示出XAI方法可以为机器学习模型提供补充性和有用的输出,但研究人员和决策者应注意这些方法的概念和技术限制,这些限制 frequently result in these methods becoming black boxes。我们探讨了图像、文本和图表数据三个XAI研究方向,涵盖了吸引力、注意力和图表类型的解释器。虽然这些案例在不同的上下文和时间段出现,但同样的持续的障碍出现, highlighting the need for a conceptual breakthrough in the field to address the challenge of compatibility between XAI methods and application tasks.
Deep Neural Networks for Semiparametric Frailty Models via H-likelihood
results: 实验研究表明,提出的方法可以提高现有方法的预测性能。一个实际数据分析表明,包含个体特定的强度可以提高DNN基于Cox模型(DNN-Cox)的预测性能。Abstract
For prediction of clustered time-to-event data, we propose a new deep neural network based gamma frailty model (DNN-FM). An advantage of the proposed model is that the joint maximization of the new h-likelihood provides maximum likelihood estimators for fixed parameters and best unbiased predictors for random frailties. Thus, the proposed DNN-FM is trained by using a negative profiled h-likelihood as a loss function, constructed by profiling out the non-parametric baseline hazard. Experimental studies show that the proposed method enhances the prediction performance of the existing methods. A real data analysis shows that the inclusion of subject-specific frailties helps to improve prediction of the DNN based Cox model (DNN-Cox).
摘要
<>对嵌套时间事件数据预测,我们提出了一种新的深度神经网络基于gamma领域模型(DNN-FM)。这种模型的优点在于,joint最大化新的h-概率提供了固定参数的最大似然估计和随机领域的最佳无偏预测。因此,我们使用负概率h-概率作为损失函数,通过批量训练深度神经网络来训练该模型。实验表明,我们的方法可以提高现有方法的预测性能。一个实际分析表明,包含个体特定领域风险的模型可以提高DNN-Cox模型(DNN-Cox)的预测性能。Note: "gamma领域模型" (gamma frailty model) refers to a type of statistical model used for survival analysis, which accounts for the variation in hazard rates across individuals or groups.
Efficient SGD Neural Network Training via Sublinear Activated Neuron Identification
results: 本论文证明了其训练方法可以在$O(M^2/\epsilon^2)$时间内提供一个对应的均值证明,其中网络大小 quadratic 于对应的对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应对应Abstract
Deep learning has been widely used in many fields, but the model training process usually consumes massive computational resources and time. Therefore, designing an efficient neural network training method with a provable convergence guarantee is a fundamental and important research question. In this paper, we present a static half-space report data structure that consists of a fully connected two-layer neural network for shifted ReLU activation to enable activated neuron identification in sublinear time via geometric search. We also prove that our algorithm can converge in $O(M^2/\epsilon^2)$ time with network size quadratic in the coefficient norm upper bound $M$ and error term $\epsilon$.
摘要
深度学习在许多领域中广泛应用,但模型训练过程通常需要巨量计算资源和时间。因此,设计高效的神经网络训练方法,并且可以证明收敛保证是基本和重要的研究问题。在这篇论文中,我们提出了一种静态半空间报告数据结构,它包括一个完全连接的两层神经网络,用于实现启动ReLU活动的启动neuron标识。我们还证明了我们的算法可以在$O(M^2/\epsilon^2)$时间内收敛,其中网络大小 quadratic 于Activation нор Upper bound $M$ 和 error term $\epsilon$。
Prescriptive Process Monitoring Under Resource Constraints: A Reinforcement Learning Approach
for: 这 paper 的目的是优化业务过程的性能,通过实时触发 intervención,提高Positive case outcome的可能性。
methods: 这 paper 使用了 reinforcement learning 方法,通过试错学习来学习 intervención 政策。
results: 这 paper 的实验结果表明,通过使用 conformal prediction 技术来考虑预测uncertainty,可以帮助 reinforcement learning 代理人 converges towards 更高的 net intervention gain 政策。Abstract
Prescriptive process monitoring methods seek to optimize the performance of business processes by triggering interventions at runtime, thereby increasing the probability of positive case outcomes. These interventions are triggered according to an intervention policy. Reinforcement learning has been put forward as an approach to learning intervention policies through trial and error. Existing approaches in this space assume that the number of resources available to perform interventions in a process is unlimited, an unrealistic assumption in practice. This paper argues that, in the presence of resource constraints, a key dilemma in the field of prescriptive process monitoring is to trigger interventions based not only on predictions of their necessity, timeliness, or effect but also on the uncertainty of these predictions and the level of resource utilization. Indeed, committing scarce resources to an intervention when the necessity or effects of this intervention are highly uncertain may intuitively lead to suboptimal intervention effects. Accordingly, the paper proposes a reinforcement learning approach for prescriptive process monitoring that leverages conformal prediction techniques to consider the uncertainty of the predictions upon which an intervention decision is based. An evaluation using real-life datasets demonstrates that explicitly modeling uncertainty using conformal predictions helps reinforcement learning agents converge towards policies with higher net intervention gain
摘要
Nested Elimination: A Simple Algorithm for Best-Item Identification from Choice-Based Feedback
results: 本文提供了实例特定的非假想性 bound,证明NE在样本复杂度方面具有高度最佳化性。此外,我们还证明NE在最差情况下具有高阶绝佳性。数值实验结果从 sintetic 和实际数据中协调我们的理论发现。Abstract
We study the problem of best-item identification from choice-based feedback. In this problem, a company sequentially and adaptively shows display sets to a population of customers and collects their choices. The objective is to identify the most preferred item with the least number of samples and at a high confidence level. We propose an elimination-based algorithm, namely Nested Elimination (NE), which is inspired by the nested structure implied by the information-theoretic lower bound. NE is simple in structure, easy to implement, and has a strong theoretical guarantee for sample complexity. Specifically, NE utilizes an innovative elimination criterion and circumvents the need to solve any complex combinatorial optimization problem. We provide an instance-specific and non-asymptotic bound on the expected sample complexity of NE. We also show NE achieves high-order worst-case asymptotic optimality. Finally, numerical experiments from both synthetic and real data corroborate our theoretical findings.
摘要
我们研究最佳项目标识别问题,基于选择反馈。在这个问题中,一家公司逐渐和适应地显示给客户群体的显示集,并收集他们的选择。目标是 identificar el item más preferido con el menor número de muestras y un alto nivel de confianza。我们提出了一种嵌套减少算法(NE),它基于信息理论下界的嵌套结构。NE estructura simple, fácil de implementar y tiene una garantía teórica fuerte en términos de complejidad de muestras. En particular, NE utiliza una criterio de eliminación innovador y se circunda de resolver cualquier problema de optimización combinatoria complejo. Proporcionamos una bound no asymptótica específica de la complejidad de muestras esperada de NE para cada instancia. Además, mostramos que NE alcanza la optimidad de orden alto en el peor de los casos. Por último, los experimentos numéricos de datos sintéticos y reales respaldan nuestros hallazgos teóricos.
Metal Oxide-based Gas Sensor Array for the VOCs Analysis in Complex Mixtures using Machine Learning
results: 研究发现,使用机器学习方法可以实现99%以上的准确率来识别不同的化学物质,并且在预测化学物质浓度方面也有出色的效果。Abstract
Detection of Volatile Organic Compounds (VOCs) from the breath is becoming a viable route for the early detection of diseases non-invasively. This paper presents a sensor array with three metal oxide electrodes that can use machine learning methods to identify four distinct VOCs in a mixture. The metal oxide sensor array was subjected to various VOC concentrations, including ethanol, acetone, toluene and chloroform. The dataset obtained from individual gases and their mixtures were analyzed using multiple machine learning algorithms, such as Random Forest (RF), K-Nearest Neighbor (KNN), Decision Tree, Linear Regression, Logistic Regression, Naive Bayes, Linear Discriminant Analysis, Artificial Neural Network, and Support Vector Machine. KNN and RF have shown more than 99% accuracy in classifying different varying chemicals in the gas mixtures. In regression analysis, KNN has delivered the best results with R2 value of more than 0.99 and LOD of 0.012, 0.015, 0.014 and 0.025 PPM for predicting the concentrations of varying chemicals Acetone, Toluene, Ethanol, and Chloroform, respectively in complex mixtures. Therefore, it is demonstrated that the array utilizing the provided algorithms can classify and predict the concentrations of the four gases simultaneously for disease diagnosis and treatment monitoring.
摘要
这篇文章描述了一种基于呼吸检测的有机化合物检测技术,可以不侵入式地检测疾病的早期。文章提出了一个使用机器学习方法的几个金属氧化物电极阵列,可以同时检测四种不同的有机化合物浓度。这个阵列在不同的有机化合物浓度下进行了试验,包括乙醇、乙酸、苯和氯化物。取得的数据被多种机器学习算法分析,包括随机森林(RF)、最近邻居(KNN)、决策树、直线回归、条件式回归、简单贝叶激活函数、线性滤元分析、人工神经网络和支持向量机器学习。KNN和RF算法在分类不同的化学物质时有超过99%的准确率。在回归分析中,KNN算法实现了最佳结果,R2值超过0.99并LOD值为0.012、0.015、0.014和0.025 ppm,对于不同的化学物质浓度进行预测。因此,文章证明了这个阵列和提供的算法可以同时分类和预测不同化学物质的浓度,从而实现疾病诊断和治疗监控。
Deep Network Approximation: Beyond ReLU to Diverse Activation Functions
results: 论文表明,对任意活动函数 $\varrho\in \mathscr{A}$,一个 $\mathtt{ReLU}$ 网络宽度为 $N$,深度为 $L$ 可以在任何绝对上被 $\varrho$-活动的网络宽度为 $6N$,深度为 $2L$ 所 aproximated 到任何精度。这一发现使得大多数approximation结果在 $\mathtt{ReLU}$ 网络上得到的结果可以被推广到各种其他活动函数,只是需要略大些常数。Abstract
This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $6N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, at the cost of slightly larger constants.
摘要
results: 研究发现代理之间的影响关系取决于社交网络的拓扑结构和每个代理对推理问题的信息水平。提出了一种算法来评估代理之间的全局影响,并提供了从Raw observational data中学习模型参数的方法。Abstract
This paper investigates causal influences between agents linked by a social graph and interacting over time. In particular, the work examines the dynamics of social learning models and distributed decision-making protocols, and derives expressions that reveal the causal relations between pairs of agents and explain the flow of influence over the network. The results turn out to be dependent on the graph topology and the level of information that each agent has about the inference problem they are trying to solve. Using these conclusions, the paper proposes an algorithm to rank the overall influence between agents to discover highly influential agents. It also provides a method to learn the necessary model parameters from raw observational data. The results and the proposed algorithm are illustrated by considering both synthetic data and real Twitter data.
摘要
Full-resolution Lung Nodule Segmentation from Chest X-ray Images using Residual Encoder-Decoder Networks
paper_authors: Michael James Horry, Subrata Chakraborty, Biswajeet Pradhan, Manoranjan Paul, Jing Zhu, Prabal Datta Barua, U. Rajendra Acharya, Fang Chen, Jianlong Zhou
for: 验诊肺癌的早期诊断,提高肺癌患者的生存率。
methods: 使用高效的编码器-解码器神经网络,不减扩图像,以避免信号损失。
results: localize肺胞结核病变,实现了85%的敏感性和8个false positive的准确率,并且具有低 False Positive 率和快速的推理时间。Abstract
Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies use down-sampled images and computationally expensive methods with unproven generalization. Instead, this study localizes lung nodules using efficient encoder-decoder neural networks that process full resolution images to avoid any signal loss resulting from down-sampling. Encoder-decoder networks are trained and tested using the JSRT lung nodule dataset. The networks are used to localize lung nodules from an independent external CXR dataset. Sensitivity and false positive rates are measured using an automated framework to eliminate any observer subjectivity. These experiments allow for the determination of the optimal network depth, image resolution and pre-processing pipeline for generalized lung nodule localization. We find that nodule localization is influenced by subtlety, with more subtle nodules being detected in earlier training epochs. Therefore, we propose a novel self-ensemble model from three consecutive epochs centered on the validation optimum. This ensemble achieved a sensitivity of 85% in 10-fold internal testing with false positives of 8 per image. A sensitivity of 81% is achieved at a false positive rate of 6 following morphological false positive reduction. This result is comparable to more computationally complex systems based on linear and spatial filtering, but with a sub-second inference time that is faster than other methods. The proposed algorithm achieved excellent generalization results against an external dataset with sensitivity of 77% at a false positive rate of 7.6.
摘要
肺癌是最主要的癌症致死原因,早期诊断和治疗可以提高生存率。胸部X射线成像(CXR)是肺癌诊断的便宜成像方式。但是,使用CXR可能困难地分辨出疑似肿体,特别是与血管和骨结构相似的结构。过去,计算机视觉已经被提议用于帮助人类放射学专家进行诊断,但是这些研究通常使用压缩图像和计算成本高昂的方法,并且无法证明普适性。相反,本研究使用高效的encoder-decoder神经网络来local化肺肿体。这些神经网络可以处理全分辨率图像,以避免因压缩而导致的信号损失。这些神经网络在JSRT肺肿体数据集上进行训练和测试,并在一个独立的外部CXR数据集上进行应用。我们使用自动化框架来测量感知率和假阳率。这些实验允许我们确定最佳神经网络深度、图像分辨率和预处理管道,以及肺肿体localization的影响因素。我们发现,肺肿体localization受到微妙度的影响,微妙度较高的肿体在训练过程中更易于检测。因此,我们提出了一种新的自我ensemble模型,其中三个连续的训练EP中心于验证优点。这个ensemble得到了10次内部测试中的感知率85%,假阳率8。在减少False Positive的情况下,我们得到了感知率77%,假阳率7.6%。这个结果与更计算复杂的方法相比,具有更快的决策时间,但是与其他方法相比,具有更好的普适性。
On the Effective Horizon of Inverse Reinforcement Learning
results: 本文的实验结果证明,使用有效的时间框架可以更快地获得更好的结果,并且可以避免过度适应。此外,本文还提出了一种jointly学习奖励函数和有效时间框架的方法,这种方法在实验中获得了好的结果。Abstract
Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning over a given time horizon to compute an approximately optimal policy for a hypothesized reward function and then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimate and the computational efficiency of IRL algorithms. Interestingly, an effective time horizon shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis leads to a principled choice of the effective horizon for IRL. It also prompts us to reexamine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon together rather than the reward alone with a given horizon. Our experimental results confirm the theoretical analysis.
摘要
倒向奖励学习(Inverse Reinforcement Learning,IRL)算法经常利用前进的奖励学习或规划算法计算一个假设的奖励函数的相对优化策略,然后与专家示范相匹配。时间范围在计算奖励估计的准确性和IRL算法的计算效率中扮演了关键的角色。有趣的是,一个有效的时间范围 shorter than the ground-truth value 可以更快地生成更好的结果。这个研究正式分析了这个现象,并提供了一个解释:时间范围控制引induced policy class的复杂性,并降低了limited data的过拟合。这种分析导致了一种原则性的选择有效的时间范围 для IRL。此外,它也让我们重新考虑了 класси的IRL形式:在学习奖励函数时,更自然的是同时学习有效的时间范围。我们的实验结果证实了理论分析。
Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A Natural Language Processing Approach
results: 该研究发现,使用CNN模型进行情感分类任务可以 дости得balanced的性能,并且可以用于社交媒体分析、市场调查和政策研究等实际应用。Abstract
This study addressed the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN), offering a new approach to Natural Language Processing (NLP). The data, sourced from Baidu's PaddlePaddle AI platform, were meticulously preprocessed, tokenized, and categorized based on sentiment labels. A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification. The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments. The findings underscore the effectiveness of CNNs for sentiment analysis tasks, with implications for practical applications in social media analysis, market research, and policy studies. The complete experimental content and code have been made publicly available on the Kaggle data platform for further research and development. Future work may involve exploring different architectures, such as Recurrent Neural Networks (RNN) or transformers, or using more complex pre-trained models like BERT, to further improve the model's ability to understand linguistic nuances and context.
摘要
Tensor Decompositions Meet Control Theory: Learning General Mixtures of Linear Dynamical Systems
results: 算法在受限 observe 的情况下工作,并可以在时间序列数据中提高预测和理解。Abstract
Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a close relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.
摘要
最近,陈和穷initiated the study of学习混合线性动力系统。线性动力系统已经广泛应用于时间序列数据的模型化,使用混合模型可以更好地适应下面的子 poblation 表示。在这项工作中,我们提出了一种基于矩阵 decompositions的新方法 для学习混合线性动力系统,这种方法不需要强 separation conditions on the components,可以和 Bayes 优化 clustering of trajectories 竞争。此外,我们的算法在具有部分观测的复杂设定下也可以工作。我们的起点是classic Ho-Kalman algorithm 是现代tensor decomposition方法 для学习潜在变量模型的近亲,这给我们一个playbook for how to extend it to work with more complicated generative models。
DSV: An Alignment Validation Loss for Self-supervised Outlier Model Selection
for: 这篇论文主要关注于如何运用自动学习(Self-supervised learning)来进行无监督异常检测(Unsupervised anomaly detection),并且提出了一个名为“Discordance and Separability Validation”的无监督验证损失函数,用于选择高性能的检测模型。
methods: 本文使用了一些资料增强技术,包括随机对称变数和随机对称变数的混合,并且提出了一个名为“Discordance and Separability Validation”的无监督验证损失函数,用于选择高性能的检测模型。
results: 本文的实验结果显示,这个名为“Discordance and Separability Validation”的无监督验证损失函数能够帮助选择高性能的检测模型,并且与其他基准相比,具有更高的检测精度。Abstract
Self-supervised learning (SSL) has proven effective in solving various problems by generating internal supervisory signals. Unsupervised anomaly detection, which faces the high cost of obtaining true labels, is an area that can greatly benefit from SSL. However, recent literature suggests that tuning the hyperparameters (HP) of data augmentation functions is crucial to the success of SSL-based anomaly detection (SSAD), yet a systematic method for doing so remains unknown. In this work, we propose DSV (Discordance and Separability Validation), an unsupervised validation loss to select high-performing detection models with effective augmentation HPs. DSV captures the alignment between an augmentation function and the anomaly-generating mechanism with surrogate losses, which approximate the discordance and separability of test data, respectively. As a result, the evaluation via DSV leads to selecting an effective SSAD model exhibiting better alignment, which results in high detection accuracy. We theoretically derive the degree of approximation conducted by the surrogate losses and empirically show that DSV outperforms a wide range of baselines on 21 real-world tasks.
摘要
Artificial Intelligence for Drug Discovery: Are We There Yet?
results: AI技术已经使得许多化合物进入临床试验阶段,但科学社区必须仔细评估已知信息,解决复制危机。AI在药物发现中的潜力只能在有足够的基础知识和人类 intervene later ipeline 阶段得到实现。Abstract
Drug discovery is adapting to novel technologies such as data science, informatics, and artificial intelligence (AI) to accelerate effective treatment development while reducing costs and animal experiments. AI is transforming drug discovery, as indicated by increasing interest from investors, industrial and academic scientists, and legislators. Successful drug discovery requires optimizing properties related to pharmacodynamics, pharmacokinetics, and clinical outcomes. This review discusses the use of AI in the three pillars of drug discovery: diseases, targets, and therapeutic modalities, with a focus on small molecule drugs. AI technologies, such as generative chemistry, machine learning, and multi-property optimization, have enabled several compounds to enter clinical trials. The scientific community must carefully vet known information to address the reproducibility crisis. The full potential of AI in drug discovery can only be realized with sufficient ground truth and appropriate human intervention at later pipeline stages.
摘要
医药发现在推广新技术,如数据科学、信息学和人工智能(AI),以加速有效治疗的开发,同时降低成本和动物实验。AI正在改变医药发现,可见投资者、产业和学术科学家以及法maker均表示兴趣。成功的医药发现需要优化与药理动力、药代谱和临床结果相关的属性。本文评论AI在三大柱子上的应用:疾病、目标和治疗方式,主要关注小分子药。AI技术,如生成化学、机器学习和多属性优化,已经使得许多化合物进入临床试验。科学社区需要仔细检查已知信息,解决复制危机。AI在医药发现的潜力只能在充分的基础知识和后期管道阶段得到实现,需要合适的人类干预。
for: This paper focuses on the interactions between practitioners and the tools they use in machine learning (ML) practices, and how these interactions shape the development of ML systems.
methods: The paper uses an empirical study of questions asked on the Stack Exchange forums to explore the use of interactive computing platforms (e.g. Jupyter Notebook and Google Colab) in ML practices.
results: The paper finds that interactive computing platforms are used in a variety of learning and coordination practices, which constitutes an infrastructural relationship between interactive computing platforms and ML practitioners. The paper also highlights how this relationship risks making invisible aspects of the ML life cycle that are important for the societal impact of deployed ML systems.Abstract
Machine Learning (ML) systems, particularly when deployed in high-stakes domains, are deeply consequential. They can exacerbate existing inequities, create new modes of discrimination, and reify outdated social constructs. Accordingly, the social context (i.e. organisations, teams, cultures) in which ML systems are developed is a site of active research for the field of AI ethics, and intervention for policymakers. This paper focuses on one aspect of social context that is often overlooked: interactions between practitioners and the tools they rely on, and the role these interactions play in shaping ML practices and the development of ML systems. In particular, through an empirical study of questions asked on the Stack Exchange forums, the use of interactive computing platforms (e.g. Jupyter Notebook and Google Colab) in ML practices is explored. I find that interactive computing platforms are used in a host of learning and coordination practices, which constitutes an infrastructural relationship between interactive computing platforms and ML practitioners. I describe how ML practices are co-evolving alongside the development of interactive computing platforms, and highlight how this risks making invisible aspects of the ML life cycle that AI ethics researchers' have demonstrated to be particularly salient for the societal impact of deployed ML systems.
摘要
Through an empirical study of questions asked on the Stack Exchange forums, this paper explores the use of interactive computing platforms (such as Jupyter Notebook and Google Colab) in ML practices. I find that these platforms are used in a variety of learning and coordination practices, which forms an infrastructural relationship between interactive computing platforms and ML practitioners.I describe how ML practices are co-evolving alongside the development of interactive computing platforms, and highlight how this risks making certain aspects of the ML life cycle invisible to AI ethics researchers. These invisible aspects have been shown to be particularly important for the societal impact of deployed ML systems.
results: 研究者透过实际应用“信念整合循环”框架,发现可以在不同的 контек斯中找到一个共识的优化结果,即可以将人类价值观和信念与AI系统的决策过程整合起来,以提高AI系统的决策 accuracy。Abstract
Beliefs and values are increasingly being incorporated into our AI systems through alignment processes, such as carefully curating data collection principles or regularizing the loss function used for training. However, the meta-alignment problem is that these human beliefs are diverse and not aligned across populations; furthermore, the implicit strength of each belief may not be well calibrated even among humans, especially when trying to generalize across contexts. Specifically, in high regret situations, we observe that contextual counterfactuals and recourse costs are particularly important in updating a decision maker's beliefs and the strengths to which such beliefs are held. Therefore, we argue that including counterfactuals is key to an accurate calibration of beliefs during alignment. To do this, we first segment belief diversity into two categories: subjectivity (across individuals within a population) and epistemic uncertainty (within an individual across different contexts). By leveraging our notion of epistemic uncertainty, we introduce `the belief calibration cycle' framework to more holistically calibrate this diversity of beliefs with context-driven counterfactual reasoning by using a multi-objective optimization. We empirically apply our framework for finding a Pareto frontier of clustered optimal belief strengths that generalize across different contexts, demonstrating its efficacy on a toy dataset for credit decisions.
摘要
信仰和价值在我们的人工智能系统中越来越被包含,通过谨慎地制定数据收集原则或者训练过程中的损失函数规范化。然而,我们称之为“高痛苦问题”的是,人类的信仰各自不同,而且在不同的人群中并不协调。尤其是在扩展到不同上下文时,人类的偏见可能并不准确。因此,我们认为包含对话框架是对准信仰的准确均衡的关键。我们将信仰多样性分为两类:个人差异(在人口内部)和知识不确定性(在个体内部不同上下文中)。通过我们的知识不确定性概念,我们提出了“信仰均衡ecycle”框架,用于更全面地均衡这些多样性的信仰,通过context驱动的对话框架来实现。我们在一个假设问题上采用多目标优化,实际应用了我们的框架,并证明其在不同上下文中的普遍性。
Hybrid Control Policy for Artificial Pancreas via Ensemble Deep Reinforcement Learning
paper_authors: Wenzhou Lv, Tianyu Wu, Luolin Xiong, Liang Wu, Jian Zhou, Yang Tang, Feng Qian for: 这种研究的目的是为了开发一种可以实现closed-loop糖尿病控制的人工肾脏系统(AP),以提高患有类型1糖尿病(T1DM)的人的血糖水平控制。methods: 这种研究使用了一种混合控制策略, combining model predictive control(MPC)和深度学习(DRL),以便融合MPC的安全性和稳定性,和DRL的个性化和适应性。此外,研究还使用了meta-学习技术,以便更快地适应新的患者,并使用有限的数据进行适应。results: 研究结果表明,这种控制策略可以在FDA所批准的UVA/Padova T1DM仿真器上实现最高的糖尿病控制效果,并最低化低血糖的发生频率。结论:这些结果表明,提案的控制策略可以有效地实现closed-loop糖尿病控制,并且可以在实际应用中提高患有T1DM的人的血糖水平控制。Abstract
Objective: The artificial pancreas (AP) has shown promising potential in achieving closed-loop glucose control for individuals with type 1 diabetes mellitus (T1DM). However, designing an effective control policy for the AP remains challenging due to the complex physiological processes, delayed insulin response, and inaccurate glucose measurements. While model predictive control (MPC) offers safety and stability through the dynamic model and safety constraints, it lacks individualization and is adversely affected by unannounced meals. Conversely, deep reinforcement learning (DRL) provides personalized and adaptive strategies but faces challenges with distribution shifts and substantial data requirements. Methods: We propose a hybrid control policy for the artificial pancreas (HyCPAP) to address the above challenges. HyCPAP combines an MPC policy with an ensemble DRL policy, leveraging the strengths of both policies while compensating for their respective limitations. To facilitate faster deployment of AP systems in real-world settings, we further incorporate meta-learning techniques into HyCPAP, leveraging previous experience and patient-shared knowledge to enable fast adaptation to new patients with limited available data. Results: We conduct extensive experiments using the FDA-accepted UVA/Padova T1DM simulator across three scenarios. Our approaches achieve the highest percentage of time spent in the desired euglycemic range and the lowest occurrences of hypoglycemia. Conclusion: The results clearly demonstrate the superiority of our methods for closed-loop glucose management in individuals with T1DM. Significance: The study presents novel control policies for AP systems, affirming the great potential of proposed methods for efficient closed-loop glucose control.
摘要
目标:人工胰腺(AP)在type 1 диабеت�ellitus(T1DM)患者中实现closed-loop血糖控制显示了承诺的潜力。然而,为AP设计有效的控制策略仍然是一个挑战,因为生物学过程复杂、延迟的胰岛响应和不准确的血糖测量。MPC(模型预测控制)可以提供安全性和稳定性通过动态模型和安全约束,但缺乏个性化和适应能力,并且在不期望的饭物上表现不佳。相反,深度学习(DRL)可以提供个性化和适应策略,但面临分布转移和大量数据要求。方法:我们提出一种hybrid控制策略(HyCPAP),以解决以上挑战。HyCPAP将MPC策略和DRL ensemble策略相结合,利用两者之间的优势,并弥补它们的相应局限性。为了更快地部署AP系统在实际 Settings中,我们还在HyCPAP中 интегрирова了meta-学习技术,利用前一个体验和患者共享的知识,以快速适应新的患者,并使用有限的可用数据进行适应。结果:我们在FDA所批准的UVA/Padova T1DM simulator上进行了广泛的实验,在三个场景中。我们的方法在血糖控制中度量最高,并且出现 hypoglycemia 的 случа数最低。结论:结果显示了我们的方法在T1DM患者中closed-loop血糖控制中的优势。重要性:这种控制策略可以减少AP系统中的血糖不稳定性和 hypoglycemia 的风险,提高患者的生活质量。
Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems
results: 我们的实验结果表明,QuScore 可以快速地找到可以欺骗 IDLS 的攻击样本,并且可以在不同的 DNN 模型和解释模型上实现高攻击成功率。在 ImageNet 和 CIFAR 数据集上,我们得到了95%-100% 的攻击成功率和69% 的平均传播率。Abstract
Deep learning models are susceptible to adversarial samples in white and black-box environments. Although previous studies have shown high attack success rates, coupling DNN models with interpretation models could offer a sense of security when a human expert is involved, who can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In black-box settings, as access to the components of IDLSes is limited, it becomes more challenging for the adversary to fool the system. In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods by employing an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system. We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.
摘要
深度学习模型容易受到恶意样本的攻击,包括白盒和黑盒环境。 Previous studies have shown that coupling DNN models with interpretation models can provide a sense of security, as a human expert can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In black-box settings, as access to the components of IDLSes is limited, it becomes more challenging for the adversary to fool the system.In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods using an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system.We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.
Embracing the chaos: analysis and diagnosis of numerical instability in variational flows
results: 作者发现,尽管流动可能会出现严重的数值不稳定性,但是在应用中,流动的结果通常足够准确。此外,作者还开发了一种用于实践中验证流动结果的诊断方法。Abstract
In this paper, we investigate the impact of numerical instability on the reliability of sampling, density evaluation, and evidence lower bound (ELBO) estimation in variational flows. We first empirically demonstrate that common flows can exhibit a catastrophic accumulation of error: the numerical flow map deviates significantly from the exact map -- which affects sampling -- and the numerical inverse flow map does not accurately recover the initial input -- which affects density and ELBO computations. Surprisingly though, we find that results produced by flows are often accurate enough for applications despite the presence of serious numerical instability. In this work, we treat variational flows as dynamical systems, and leverage shadowing theory to elucidate this behavior via theoretical guarantees on the error of sampling, density evaluation, and ELBO estimation. Finally, we develop and empirically test a diagnostic procedure that can be used to validate results produced by numerically unstable flows in practice.
摘要
在这篇论文中,我们研究了数值不稳定性对样本、分布评估和证据下界(ELBO)估计的可靠性的影响。我们首先经验表明,通用的流体可能会出现严重的数值积累错误:数值流图与精确流图不同很多,影响样本;同时,数值逆流图不能准确地回归初始输入,影响分布和ELBO计算。尽管如此,我们发现在应用中,流体所生成的结果通常够准确。在这篇文章中,我们对变换流体视为动态系统,利用阴影理论提供了对样本、分布评估和ELBO估计错误的理论保证。最后,我们开发了一种可用于实践中验证不稳定流体生成结果的诊断过程。
Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!
results: 论文发现,通信学家大多 Ignore 自动分类器的错误,但是这些错误会导致误分类偏见和不准确的结果。论文还提出了一种新的错误 corrction方法,并通过Monte Carlo simulations进行了测试,发现这种方法是iversatile和高效的。因此,这种方法可以用于 corrrecting 自动分类器的错误,以提高量度结果的准确性。Abstract
Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses-unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use "gold standard" validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. We introduce and test such methods, including a new method we design and implement in the R package misclassificationmodels, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
摘要
自动分类器(AC),通常通过直接指导学习(SML)建立,可以处理大量的数据样本,从文本到图像和视频,并在通信科学和相关领域中广泛使用。尽管如此,甚至高度准确的分类器也会出现错误,导致分类偏见和误导性结果,而不是在下游分析中考虑这些错误。根据我们在通信学者对SML应用的系统性文献综述中发现,communication scholars在大多数情况下忽略了错误分类的偏见。在理论上,现有的统计方法可以使用“金标准”验证数据,如人工标注者创建的数据,来修正错误分类和生成一致的估计。我们介绍和测试了这些方法,包括我们设计和实现的一种新方法,via Monte Carlo simulations,以揭示每种方法的局限性,并将其发布。根据我们的结果,我们推荐我们的新错误修正方法,因为它是多功能和高效的。总之,自动分类器,即使其准确率低于通常的标准或系统性错误,也可以用于measurement,只要采用合适的研究设计和错误修正方法。
Tackling Combinatorial Distribution Shift: A Matrix Completion Perspective
results: 论文通过对金融相关的问题进行测试,证明了该方法的有效性。此外,论文还开发了一些新的问题,并derived了其分析纳什平衡解的解决方案,作为对深度学习方法的评价标准。Abstract
In this paper, we propose a numerical methodology for finding the closed-loop Nash equilibrium of stochastic delay differential games through deep learning. These games are prevalent in finance and economics where multi-agent interaction and delayed effects are often desired features in a model, but are introduced at the expense of increased dimensionality of the problem. This increased dimensionality is especially significant as that arising from the number of players is coupled with the potential infinite dimensionality caused by the delay. Our approach involves parameterizing the controls of each player using distinct recurrent neural networks. These recurrent neural network-based controls are then trained using a modified version of Brown's fictitious play, incorporating deep learning techniques. To evaluate the effectiveness of our methodology, we test it on finance-related problems with known solutions. Furthermore, we also develop new problems and derive their analytical Nash equilibrium solutions, which serve as additional benchmarks for assessing the performance of our proposed deep learning approach.
摘要
在这篇论文中,我们提出了一种数值方法来找到延迟游戏的关闭环路奈特Eq的解。这种游戏在金融和经济领域非常普遍,因为多个代理人之间的交互和延迟效果是经常需要的特性,但是这些特性会导致问题的维度增加。这种维度增加特别是由玩家的数量和延迟效果的潜在无穷维度相互作用而引起的。我们的方法是使用独特的回归神经网络来参数化每个玩家的控制。这些回归神经网络控制是通过修改布朗的虚构游戏来训练的。为了评估我们的方法的效果,我们在金融相关的问题上进行测试,并开发了新的问题,并计算了其分析奈特Eq解,这些解serve为评估我们提出的深度学习方法的benchmark。
On Collaboration in Distributed Parameter Estimation with Resource Constraints
results: 本研究通过实验表明,提议的多重投机算法(DOUBLE-F、DOUBLE-Z、UCB-F、UCB-Z)在分布式参数估计问题中是有效的,可以在资源约束和观测相关性不可知的情况下提高参数估计的自信度。Abstract
We study sensor/agent data collection and collaboration policies for parameter estimation, accounting for resource constraints and correlation between observations collected by distinct sensors/agents. Specifically, we consider a group of sensors/agents each samples from different variables of a multivariate Gaussian distribution and has different estimation objectives, and we formulate a sensor/agent's data collection and collaboration policy design problem as a Fisher information maximization (or Cramer-Rao bound minimization) problem. When the knowledge of correlation between variables is available, we analytically identify two particular scenarios: (1) where the knowledge of the correlation between samples cannot be leveraged for collaborative estimation purposes and (2) where the optimal data collection policy involves investing scarce resources to collaboratively sample and transfer information that is not of immediate interest and whose statistics are already known, with the sole goal of increasing the confidence on the estimate of the parameter of interest. When the knowledge of certain correlation is unavailable but collaboration may still be worthwhile, we propose novel ways to apply multi-armed bandit algorithms to learn the optimal data collection and collaboration policy in our distributed parameter estimation problem and demonstrate that the proposed algorithms, DOUBLE-F, DOUBLE-Z, UCB-F, UCB-Z, are effective through simulations.
摘要
我们研究感知器/代理人数据收集和合作策略,以优化参数估计,考虑资源限制和感知器/代理人之间的观测协同关系。我们具体考虑一组感知器/代理人,每个感知器/代理人都从不同变量的多变量 Gaussian 分布中采样,并有不同的估计目标,我们将感知器/代理人的数据收集和合作策略设计问题定义为 Fisher 信息最大化(或 Cramer-Rao 约束最小化)问题。当知道变量之间的相关性信息时,我们分析出两种特殊情况:(1)在把样本之间的相关性信息不能利用 для协同估计目的时,和(2)在优化数据收集策略时,投入有限的资源,以协同采样和传输不是当前兴趣的信息,并且已知的统计信息,以提高估计参数的信度。当知道某些相关性信息不available时,我们提出了一些新的多重抓捕算法,用于学习最佳数据收集和合作策略,并在 simulations 中证明了这些算法的有效性。
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
results: 在固定计算预算下使用这些方法进行BERT和T5的预处理,我们发现他们的训练、验证和下游性能减退,与基准值(完全衰减学习率)相比。我们定义了一种评估协议,使得计算可以在任意机器上进行,并将所有计算时间映射到一个引用机器(引用系统时间)。Abstract
The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.
摘要
计算量 necesary for 训练Transformer-based语言模型在最近几年内大幅增长。这种趋势激发了关于高效训练算法的研究,旨在提高训练、验证和下游性能的速度比标准训练更快。在这项工作中,我们回顾了三种类型的高效训练算法:动态建筑(层堆栈、层产生)、批量选择(选择性反馈、RHO损失)和高效优化器(Lion、Sophia)。在使用这些方法预训练BERT和T5时,我们发现其训练、验证和下游性能减零比基eline WITH Fully-decayed学习率。我们定义了一种评估协议,使得计算可以在任意机器上进行,并将所有计算时间映射到一个参照机器(我们称之为参照系统时间)。我们讨论了我们所提出的评估协议的限制,并发布了我们的代码,以便促进高效训练过程的严格研究:https://github.com/JeanKaddour/NoTrainNoGain。
Improved selective background Monte Carlo simulation at Belle II with graph attention networks and weighted events
results: 通过使用图注意力和统计方法,提高了过滤器的性能,并且可以更好地处理背景辐射。Abstract
When measuring rare processes at Belle II, a huge luminosity is required, which means a large number of simulations are necessary to determine signal efficiencies and background contributions. However, this process demands high computation costs while most of the simulated data, in particular in case of background, are discarded by the event selection. Thus, filters using graph neural networks are introduced at an early stage to save the resources for the detector simulation and reconstruction of events discarded at analysis level. In our work, we improved the performance of the filters using graph attention and investigated statistical methods including sampling and reweighting to deal with the biases introduced by the filtering.
摘要
在 Belle II 中测量罕见过程时,需要巨大的亮度,这意味着需要大量的 simulate 来确定信号效率和背景贡献。然而,这个过程需要高度的计算成本,而大多数 simulated 数据,特别是背景数据,都会在分析阶段被抛弃。因此,我们引入了图神经网络滤波器,以避免亮度测量中的资源浪费。在我们的工作中,我们提高了滤波器的性能,并 investigate 了统计方法,包括采样和重新权重,以处理由滤波器引入的偏见。
Energy Discrepancies: A Score-Independent Loss for Energy-Based Models
paper_authors: Tobias Schröder, Zijing Ou, Jen Ning Lim, Yingzhen Li, Sebastian J. Vollmer, Andrew B. Duncan
for: 提高能量基模型的训练效率和准确性
methods: 提出了一种新的损失函数 called Energy Discrepancy (ED), 不需要计算分数或昂贵的Markov链 Monte Carlo
results: 在数值实验中,ED可以更快和更准确地学习低维数据分布,并且在高维图像数据上表现出较好的效果。Abstract
Energy-based models are a simple yet powerful class of probabilistic models, but their widespread adoption has been limited by the computational burden of training them. We propose a novel loss function called Energy Discrepancy (ED) which does not rely on the computation of scores or expensive Markov chain Monte Carlo. We show that ED approaches the explicit score matching and negative log-likelihood loss under different limits, effectively interpolating between both. Consequently, minimum ED estimation overcomes the problem of nearsightedness encountered in score-based estimation methods, while also enjoying theoretical guarantees. Through numerical experiments, we demonstrate that ED learns low-dimensional data distributions faster and more accurately than explicit score matching or contrastive divergence. For high-dimensional image data, we describe how the manifold hypothesis puts limitations on our approach and demonstrate the effectiveness of energy discrepancy by training the energy-based model as a prior of a variational decoder model.
摘要
energized-based models are a simple yet powerful class of probabilistic models, but their widespread adoption has been limited by the computational burden of training them. We propose a novel loss function called Energy Discrepancy (ED) which does not rely on the computation of scores or expensive Markov chain Monte Carlo. We show that ED approaches the explicit score matching and negative log-likelihood loss under different limits, effectively interpolating between both. Consequently, minimum ED estimation overcomes the problem of nearsightedness encountered in score-based estimation methods, while also enjoying theoretical guarantees. Through numerical experiments, we demonstrate that ED learns low-dimensional data distributions faster and more accurately than explicit score matching or contrastive divergence. For high-dimensional image data, we describe how the manifold hypothesis puts limitations on our approach and demonstrate the effectiveness of energy discrepancy by training the energy-based model as a prior of a variational decoder model.Here's the breakdown of the translation:* "energized-based models" becomes "能量基于模型" (néngyù jībǎo módelǐ)* "probabilistic models" becomes "概率模型" (guèshí módelǐ)* "widespread adoption" becomes "广泛的采用" (guǎngfāng de qièyòu)* "computational burden" becomes "计算负担" (jìsuàn fùdān)* "training them" becomes "训练它们" (xùxí tāmen)* "novel loss function" becomes "新的损失函数" (xīn de shèshì fúnción)* "called Energy Discrepancy" becomes "被称为能量差" (bèi xiàngwàng néngyù dà)* "ED" becomes "ED" (ED)* "explicit score matching" becomes "直接对应" (zhíxí dìbiāo)* "negative log-likelihood loss" becomes "负极log-likelihood损失" (fùjí log-likelihood shèshì)* "limits" becomes "限制" (xiàngsuā)* "interpolating between both" becomes "在两者之间进行 interpolating" (zhīyī zhījīn zhīxíng)* "minimum ED estimation" becomes "最小的ED估算" (zuìxìng de ED gèsuàn)* "nearsightedness" becomes "近视" (jìngshì)* "score-based estimation methods" becomes "分数基于的估算方法" (fēnshù jībǎo de gèsuàn fāngchéng)* "theoretical guarantees" becomes "理论保证" (lǐshù bǎozhì)* "through numerical experiments" becomes "通过数字实验" (tōngchái shùzhì shíyan)* "low-dimensional data distributions" becomes "低维度数据分布" (dīyùdù shùbù)* "high-dimensional image data" becomes "高维度图像数据" (gāoyùdù túxìng shùbù)* "manifold hypothesis" becomes "拓扑假设" (tuōpǔ jiǎxìng)* "variational decoder model" becomes "变分解码模型" (biànfāngsuī módelǐ)
Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection
results: 该文的实验结果表明,DPDGC模型可以更好地平衡隐私和实用性的贸易offs,并且在七种节点分类 benchmark dataset上显示出较好的性能。Abstract
Graph learning methods, such as Graph Neural Networks (GNNs) based on graph convolutions, are highly successful in solving real-world learning problems involving graph-structured data. However, graph learning methods expose sensitive user information and interactions not only through their model parameters but also through their model predictions. Consequently, standard Differential Privacy (DP) techniques that merely offer model weight privacy are inadequate. This is especially the case for node predictions that leverage neighboring node attributes directly via graph convolutions that create additional risks of privacy leakage. To address this problem, we introduce Graph Differential Privacy (GDP), a new formal DP framework tailored to graph learning settings that ensures both provably private model parameters and predictions. Furthermore, since there may be different privacy requirements for the node attributes and graph structure, we introduce a novel notion of relaxed node-level data adjacency. This relaxation can be used for establishing guarantees for different degrees of graph topology privacy while maintaining node attribute privacy. Importantly, this relaxation reveals a useful trade-off between utility and topology privacy for graph learning methods. In addition, our analysis of GDP reveals that existing DP-GNNs fail to exploit this trade-off due to the complex interplay between graph topology and attribute data in standard graph convolution designs. To mitigate this problem, we introduce the Differentially Private Decoupled Graph Convolution (DPDGC) model, which benefits from decoupled graph convolution while providing GDP guarantees. Extensive experiments on seven node classification benchmarking datasets demonstrate the superior privacy-utility trade-off of DPDGC over existing DP-GNNs based on standard graph convolution design.
摘要
“图学算法,如基于图 convolution 的图神经网络(GNNs),在实际世界中解决了一系列基于图结构数据的学习问题,但是图学算法会暴露用户敏感信息和交互,不仅通过模型参数,还通过模型预测结果。因此,标准的敏感数据隐私(DP)技术,只能保证模型参数的隐私,不够。尤其是节点预测结果,通过图 convolution 直接使用邻居节点属性,会增加隐私泄露的风险。为解决这个问题,我们引入图敏感隐私(GDP),一种新的正式隐私框架,可以保证模型参数和预测结果的隐私。此外,因为节点属性和图结构可能有不同的隐私要求,我们引入了一种新的节点级别数据邻接关系的relaxation。这种宽松可以用于建立不同程度的图结构隐私保证,同时维护节点属性隐私。这种宽松还表明了图学算法中的有用负担轮径性,可以用于提高图学算法的隐私性。此外,我们的分析表明,现有的DP-GNNs不能充分利用这种负担轮径性,因为标准的图 convolution 设计中的图结构和属性数据之间存在复杂的互动。为此,我们引入了干扰隐私的分离图 convolution(DPDGC)模型,它具有隐私保证和标准图 convolution 设计之间的融合。我们在七种节点分类 benchmark 数据集上进行了广泛的实验,并证明了 DPDGC 的隐私性-用途性质量比例明显超过现有的 DP-GNNs。”
results: 通过 simulations 和一个人类皮炎研究数据的示例,本文证明了该测试的性能,并建议了一种 linear BN 结构发现工作流程,以帮助选择合适的结构发现算法。Abstract
Bayesian network (BN) structure discovery algorithms typically either make assumptions about the sparsity of the true underlying network, or are limited by computational constraints to networks with a small number of variables. While these sparsity assumptions can take various forms, frequently the assumptions focus on an upper bound for the maximum in-degree of the underlying graph $\nabla_G$. Theorem 2 in Duttweiler et. al. (2023) demonstrates that the largest eigenvalue of the normalized inverse covariance matrix ($\Omega$) of a linear BN is a lower bound for $\nabla_G$. Building on this result, this paper provides the asymptotic properties of, and a debiasing procedure for, the sample eigenvalues of $\Omega$, leading to a hypothesis test that may be used to determine if the BN has max in-degree greater than 1. A linear BN structure discovery workflow is suggested in which the investigator uses this hypothesis test to aid in selecting an appropriate structure discovery algorithm. The hypothesis test performance is evaluated through simulations and the workflow is demonstrated on data from a human psoriasis study.
摘要
bayesian 网络(BN)结构发现算法通常会假设真实的网络结构是稀疏的,或者由于计算限制只能处理具有少量变量的网络。而这些稀疏假设可以有多种形式,经常是关于真实图像 $\nabla_G$ 的最大入度上界的假设。图2在Duttweiler等人(2023)的论文中表明了正则化 inverse covariance 矩阵( $\Omega$) 的最大特征值是 $\nabla_G$ 的下界。基于这个结果,这篇论文描述了 $\Omega$ 的样本特征值的极限性质和一种减偏处理方法,导致了一种用于检测 BN 是否有最大入度大于 1 的 гипотезы测试。一个基于线性 BN 结构发现工作流程是在启用这个测试来帮助选择合适的结构发现算法。这个测试的性能通过模拟和实验来评估,并在人类皮炎研究数据上示例了这个工作流程。
Trainability, Expressivity and Interpretability in Gated Neural ODEs
results: 作者们示出了 gnODEs 可以学习 (approximate) 连续拥有器,并且可以提高解释性,以至于可以显式地可见化学习的结构。此外,作者们还引入了一种新的表达能力测试方法,用于探讨神经网络的能力是否可以生成复杂的轨迹。Abstract
Understanding how the dynamics in biological and artificial neural networks implement the computations required for a task is a salient open question in machine learning and neuroscience. In particular, computations requiring complex memory storage and retrieval pose a significant challenge for these networks to implement or learn. Recently, a family of models described by neural ordinary differential equations (nODEs) has emerged as powerful dynamical neural network models capable of capturing complex dynamics. Here, we extend nODEs by endowing them with adaptive timescales using gating interactions. We refer to these as gated neural ODEs (gnODEs). Using a task that requires memory of continuous quantities, we demonstrate the inductive bias of the gnODEs to learn (approximate) continuous attractors. We further show how reduced-dimensional gnODEs retain their modeling power while greatly improving interpretability, even allowing explicit visualization of the structure of learned attractors. We introduce a novel measure of expressivity which probes the capacity of a neural network to generate complex trajectories. Using this measure, we explore how the phase-space dimension of the nODEs and the complexity of the function modeling the flow field contribute to expressivity. We see that a more complex function for modeling the flow field allows a lower-dimensional nODE to capture a given target dynamics. Finally, we demonstrate the benefit of gating in nODEs on several real-world tasks.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.Here are some key differences between the original text and the translated text:1. Word order: In Chinese, the word order is different from English. In the translated text, the word order is adjusted to conform to Simplified Chinese grammar.2. Characters: Chinese uses characters to represent words, rather than the Roman alphabet used in English. The translated text uses traditional Chinese characters to represent each word.3. Tones: Chinese is a tonal language, which means that the same word can have different meanings depending on the tone used to pronounce it. The translated text does not include tones, as Simplified Chinese does not use tones.4. Grammar: Chinese grammar is different from English, and the translated text reflects these differences. For example, Chinese has no verb conjugation, and word order is used to indicate the relationship between words.5. Vocabulary: The translated text uses Chinese vocabulary and phrases that are equivalent to the original text. However, some words and phrases may be adjusted to conform to Simplified Chinese usage.
Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization
results: 我们的三 stage 管道可以在无需改变模型结构的情况下,超越一些现有的 AVEL 方法,并在一个相关的弱监督任务中提高性能。Abstract
Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.
摘要
听视事件地理化(AVEL)是将视频中的听视事件分类和时间地理化的任务。在这篇论文中,我们在弱监督Setting下解决AVEL,即只有视频水平的事件标签(其存在或不存在,但不是时间上的位置)作为训练模型的超级vision。我们的想法是使用基本模型将训练数据中的标签进行精细的时间分辨率的估计,然后在这些标签上重新训练模型。具体来说,我们将每个slice的帧替换为另一个没有交叉的视频中的帧,然后通过基本模型来提取slice中的标签。为了处理我们的 sintetic video的异常性,我们提出了一个辅助目标来使基本模型更加可靠地预测本地化的事件标签。我们的三个阶段管道比许多现有的AVEL方法表现更好,并且提高了一个相关的弱监督任务的性能。
Personalized Anomaly Detection in PPG Data using Representation Learning and Biometric Identification
results: 结果显示,表示学习可以对异常检测性能有所提高,同时降低了人类间的差异。个性化模型进一步增强了异常检测性能,说明了个体化在心跳信号中的重要性。生物识别结果显示,新用户与授权用户之间比较容易分辨,对于一群用户来说则更加困难。总之,本研究证明了表示学习和个体化在心跳信号中的异常检测性能有所提高。Abstract
Photoplethysmography (PPG) signals, typically acquired from wearable devices, hold significant potential for continuous fitness-health monitoring. In particular, heart conditions that manifest in rare and subtle deviating heart patterns may be interesting. However, robust and reliable anomaly detection within these data remains a challenge due to the scarcity of labeled data and high inter-subject variability. This paper introduces a two-stage framework leveraging representation learning and personalization to improve anomaly detection performance in PPG data. The proposed framework first employs representation learning to transform the original PPG signals into a more discriminative and compact representation. We then apply three different unsupervised anomaly detection methods for movement detection and biometric identification. We validate our approach using two different datasets in both generalized and personalized scenarios. The results show that representation learning significantly improves anomaly detection performance while reducing the high inter-subject variability. Personalized models further enhance anomaly detection performance, underscoring the role of personalization in PPG-based fitness-health monitoring systems. The results from biometric identification show that it's easier to distinguish a new user from one intended authorized user than from a group of users. Overall, this study provides evidence of the effectiveness of representation learning and personalization for anomaly detection in PPG data.
摘要
Spectral-Bias and Kernel-Task Alignment in Physically Informed Neural Networks
methods: 这篇论文使用了 infinitedimensional over-parameterized neural networks和 Gaussian process regression (GPR)的等价性, derivation of an integro-differential equation that governs PINN prediction in the large data-set limit – the Neurally-Informed Equation (NIE)。
results: 这篇论文通过spectral decomposition of the source term in the original differential equation来量化PINN网络中的隐式偏见。Abstract
Physically informed neural networks (PINNs) are a promising emerging method for solving differential equations. As in many other deep learning approaches, the choice of PINN design and training protocol requires careful craftsmanship. Here, we suggest a comprehensive theoretical framework that sheds light on this important problem. Leveraging an equivalence between infinitely over-parameterized neural networks and Gaussian process regression (GPR), we derive an integro-differential equation that governs PINN prediction in the large data-set limit -- the Neurally-Informed Equation (NIE). This equation augments the original one by a kernel term reflecting architecture choices and allows quantifying implicit bias induced by the network via a spectral decomposition of the source term in the original differential equation.
摘要
物理 Informed neural networks (PINNs) 是一种有前途的新方法,用于解决 diferencial equations。在其他深度学习方法一样,选择 PINN 的设计和训练协议需要小心细作。在这里,我们提出了一个完整的理论框架,以便更好地解决这个重要问题。通过 infinitely over-parameterized neural networks 和 Gaussian process regression (GPR) 之间的等价关系,我们得到了 PINN 预测中的 Neurally-Informed Equation (NIE)。这个公式在大数据集Limit中 governs PINN 预测,并且添加了一个泛函型函数表达式,该函数表达式反映了网络架构选择,并且通过spectral decomposition of the source term in the original differential equation来衡量隐式偏见。
Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation
results: 通过人工试验,我们的方法可以帮助用户更好地理解机器人失败的原因,降低必要的示例数量,并使机器人更好地适应用户的任务目标。Abstract
Policies often fail due to distribution shift -- changes in the state and reward that occur when a policy is deployed in new environments. Data augmentation can increase robustness by making the model invariant to task-irrelevant changes in the agent's observation. However, designers don't know which concepts are irrelevant a priori, especially when different end users have different preferences about how the task is performed. We propose an interactive framework to leverage feedback directly from the user to identify personalized task-irrelevant concepts. Our key idea is to generate counterfactual demonstrations that allow users to quickly identify possible task-relevant and irrelevant concepts. The knowledge of task-irrelevant concepts is then used to perform data augmentation and thus obtain a policy adapted to personalized user objectives. We present experiments validating our framework on discrete and continuous control tasks with real human users. Our method (1) enables users to better understand agent failure, (2) reduces the number of demonstrations required for fine-tuning, and (3) aligns the agent to individual user task preferences.
摘要
(Simplified Chinese translation)政策常常因为分布shift而失败 -- 在新环境中改变状态和奖励导致政策失效。数据扩展可以增强政策的鲁棒性,使模型对任务 irrelevant 的变化变得不变。然而,设计者们在新环境中不知道哪些概念是无关的,特别是当不同的用户有不同的任务完成方式的时候。我们提出一种互动式框架,利用用户直接反馈来标识个性化无关的概念。我们的关键思想是生成对比示例,让用户快速地认出可能的任务相关和无关的概念。然后,根据用户的个性化任务目标,使用这些知识进行数据扩展,并从而获得适应用户目标的策略。我们在实验中 validate 了我们的框架,并在真实的人类用户上进行了实验。我们的方法可以 (1) 帮助用户更好地理解机器人失败的原因, (2) 减少调整的次数, (3) 将机器人调整到用户个性化的任务目标。
results: 我们的方法在D4RL benchmark上表现较为优秀,与现有的停机学习方法相比,其总体性能更高。Abstract
The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.
摘要
主要挑战在线束缚学习中,即数据有限时,是一系列对可能行动的反思困境:假设我们选择了不同的行动方案?这些情况 часто会导致推断错误,这些错误往往会积累性地增长,尤其是随着问题的规模增加。因此,在控制推断错误的同时,也变得非常重要承认不同的决策步骤对最终结果的影响不同。相比现有的方法,我们提出了一种方法,通过显式约束数量外来动作来控制推断错误。具体来说,我们的方法利用动态规划决定在训练中是否进行推断,并且设置了对行动策略的上限。这种方法可以平衡推断错误的风险和尝试外来动作的潜在改进。从理论角度来说,我们证明了我们的方法是受束优化的 fixes 点解的可行解。从实际角度来看,我们的方法在 D4RL 测试集上的总表现比现有的在线学习方法更好。
results: 论文证明了梯度 DESCENT 的数据速率比 traditional 一代方法快,并提出了一个假设,认为梯度 DESCENT 的数据速率可能是 $O(1/T\log T)$。Abstract
This work establishes provably faster convergence rates for gradient descent in smooth convex optimization via a computer-assisted analysis technique. Our theory allows nonconstant stepsize policies with frequent long steps potentially violating descent by analyzing the overall effect of many iterations at once rather than the typical one-iteration inductions used in most first-order method analyses. We show that long steps, which may increase the objective value in the short term, lead to provably faster convergence in the long term. A conjecture towards proving a faster $O(1/T\log T)$ rate for gradient descent is also motivated along with simple numerical validation.
摘要
Data Augmentation in Training CNNs: Injecting Noise to Images
results: 研究发现,不同噪声模型的插入会对图像分类 tasks 产生不同的影响,并提出了一些新的准则和建议。这些新方法将为图像分类学习提供更好的理解和优化。Abstract
Noise injection is a fundamental tool for data augmentation, and yet there is no widely accepted procedure to incorporate it with learning frameworks. This study analyzes the effects of adding or applying different noise models of varying magnitudes to Convolutional Neural Network (CNN) architectures. Noise models that are distributed with different density functions are given common magnitude levels via Structural Similarity (SSIM) metric in order to create an appropriate ground for comparison. The basic results are conforming with the most of the common notions in machine learning, and also introduce some novel heuristics and recommendations on noise injection. The new approaches will provide better understanding on optimal learning procedures for image classification.
摘要
噪声注入是数据增强的基本工具,但没有一个广泛接受的程序来将其与学习框架结合。这项研究分析了在卷积神经网络(CNN)架构上添加或应用不同噪声模型的效果,并通过不同分布函数来给噪声模型分配相同的噪声水平。研究结果与大多数机器学习概念相符,并提供了一些新的低噪声注入策略和建议,以便更好地理解图像分类的优化学习过程。
Facial Reenactment Through a Personalized Generator
paper_authors: Ariel Elazary, Yotam Nitzan, Daniel Cohen-Or
for: 这 paper 是用于 facial reenactment 的个性化生成模型的研究。
methods: 该 paper 使用了个性化生成器,通过使用简单的商业摄像头捕捉的短时间、多样化的自我扫描视频来训练个性化生成器,以保证图像具有人脸的真实性。
results: 经过广泛评估,该 paper 实现了 facial reenactment 的状态 искусственный智能性能,并且示示了可以在后期处理中进行semantic编辑和式化。Abstract
In recent years, the role of image generative models in facial reenactment has been steadily increasing. Such models are usually subject-agnostic and trained on domain-wide datasets. The appearance of the reenacted individual is learned from a single image, and hence, the entire breadth of the individual's appearance is not entirely captured, leading these methods to resort to unfaithful hallucination. Thanks to recent advancements, it is now possible to train a personalized generative model tailored specifically to a given individual. In this paper, we propose a novel method for facial reenactment using a personalized generator. We train the generator using frames from a short, yet varied, self-scan video captured using a simple commodity camera. Images synthesized by the personalized generator are guaranteed to preserve identity. The premise of our work is that the task of reenactment is thus reduced to accurately mimicking head poses and expressions. To this end, we locate the desired frames in the latent space of the personalized generator using carefully designed latent optimization. Through extensive evaluation, we demonstrate state-of-the-art performance for facial reenactment. Furthermore, we show that since our reenactment takes place in a semantic latent space, it can be semantically edited and stylized in post-processing.
摘要
近年来,图像生成模型在人脸reenactment中的角色变得越来越重要。这些模型通常是无关主体的,并在域内数据上进行训练。通过学习单个图像中的人脸出现,这些方法会导致不准确的幻像生成。感谢最新的进步,现在可以专门为某个特定个体训练个性化生成器。在这篇论文中,我们提出一种使用专门生成器进行人脸reenactment的新方法。我们使用一段短、但具有多样性的自扫视频捕捉到了一般用途摄像头中的帧。通过专门训练个性化生成器,我们保证生成的图像会保持个体的身份。我们的工作假设是,reenactment任务可以reduced到准确地模仿头部姿势和表情。为此,我们使用特别设计的latent空间优化来确定感兴趣的帧。经过广泛评估,我们证明了在人脸reenactment中的状态之最高表现。此外,我们还表明了由于我们的reenactment发生在semanticlatent空间中,可以在后期处理中进行semantic编辑和风格化。
FDAPT: Federated Domain-adaptive Pre-training for Language Models
results: 研究发现,FDAPT可以保持下游任务性能与中央基eline相似,并且提出了一种新的算法FFDAPT,可以提高计算效率,并且与标准FDAPT的下游任务性能相似。Abstract
Combining Domain-adaptive Pre-training (DAPT) with Federated Learning (FL) can enhance model adaptation by leveraging more sensitive and distributed data while preserving data privacy. However, few studies have focused on this method. Therefore, we conduct the first comprehensive empirical study to evaluate the performance of Federated Domain-adaptive Pre-training (FDAPT). We demonstrate that FDAPT can maintain competitive downstream task performance to the centralized baseline in both IID and non-IID situations. Furthermore, we propose a novel algorithm, Frozen Federated Domain-adaptive Pre-training (FFDAPT). FFDAPT improves the computational efficiency by 12.1% on average and exhibits similar downstream task performance to standard FDAPT, with general performance fluctuations remaining less than 1%. Finally, through a critical evaluation of our work, we identify promising future research directions for this new research area.
摘要
通过结合域适应性预训练(DAPT)和联合学习(FL),可以提高模型适应性,利用更敏感和分布式的数据,保护数据隐私。然而,目前很少关注这种方法。因此,我们进行了首次全面的实验研究,评估联邦域适应性预训练(FDAPT)的性能。我们发现,FDAPT可以与中央基线保持竞争性下沉Task性能,在IID和非IID情况下都能够达到这个目标。此外,我们提出了一种新的算法,冻结联邦域适应性预训练(FFDAPT)。FFDAPT可以提高计算效率,在 average 上降低了12.1%,并且与标准FDAPT的下沉任务性能相似,总性能波动低于1%。最后,通过对我们的工作进行批判性评估,我们确定了这个新研究领域的潜在未来研究方向。
Locally Adaptive Federated Learning via Stochastic Polyak Stepsizes
paper_authors: Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich
For: 提高 Federated Learning 算法的性能,特别是在采用适当的步长调整方法时。* Methods: 基于最近提出的随机Polyak步长(SPS)方法,提出了新的分布式 SPS 变体(FedSPS 和 FedDecSPS),并证明其在强 convex 和 convex 设置下 linearly 收敛,并在一般情况下收敛到一个解近似的地方。* Results: 在 convex 实验中,我们的提案方法与 FedAvg 的最佳 Hyperparameter 相比赛,在 i.i.d. 情况下匹配其优化性能,并在非 i.i.d. 情况下超越 FedAvg。Abstract
State-of-the-art federated learning algorithms such as FedAvg require carefully tuned stepsizes to achieve their best performance. The improvements proposed by existing adaptive federated methods involve tuning of additional hyperparameters such as momentum parameters, and consider adaptivity only in the server aggregation round, but not locally. These methods can be inefficient in many practical scenarios because they require excessive tuning of hyperparameters and do not capture local geometric information. In this work, we extend the recently proposed stochastic Polyak stepsize (SPS) to the federated learning setting, and propose new locally adaptive and nearly parameter-free distributed SPS variants (FedSPS and FedDecSPS). We prove that FedSPS converges linearly in strongly convex and sublinearly in convex settings when the interpolation condition (overparametrization) is satisfied, and converges to a neighborhood of the solution in the general case. We extend our proposed method to a decreasing stepsize version FedDecSPS, that converges also when the interpolation condition does not hold. We validate our theoretical claims by performing illustrative convex experiments. Our proposed algorithms match the optimization performance of FedAvg with the best tuned hyperparameters in the i.i.d. case, and outperform FedAvg in the non-i.i.d. case.
摘要
现代联合学习算法如FedAvg需要精心调整步长以达到最佳性能。现有的适应联合方法只考虑在服务器聚合轮次中的适应性,而不考虑本地的地形信息。这些方法在实际场景中可能不够灵活,因为它们需要过度调整超参数和不能捕捉本地的地形信息。在这个工作中,我们将最近提出的随机Polyak步长(SPS)应用于联合学习设置,并提出了新的分布式SPS变体(FedSPS和FedDecSPS)。我们证明了FedSPS在强 convex和sublinear convex情况下 linearly convergent,并在总体情况下 convergent to a neighborhood of the solution。我们还扩展了我们的提出的方法到一个递减步长版本FedDecSPS,该方法在不满足 interpolate condition 时也可以 convergent。我们验证了我们的理论声明,通过在 convex 实验中进行示范性实验。我们的提出的算法与FedAvg在i.i.d.情况下匹配最佳的优化性能,并在非i.i.d.情况下超越FedAvg。
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
paper_authors: Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby
results: 该论文表明了NaViT方法可以提高大规模的超参和对比图像预训练的训练效率,并且在图像分类、物体检测和 semantic segmentation 等标准任务上达到了更好的结果,同时也可以在检测时使用自适应的输入分辨率来适应不同的应用场景。Abstract
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
摘要
“常见且不优雅的做法是将图像resize到固定分辨率进行计算机视觉模型处理,直到现在都没有被成功挑战。然而,模型如视Transformer(ViT)提供了 flexible sequence-based 模型,因此输入序列长度可变。我们利用这一点,使用 NaViT(Native Resolution ViT),在训练时使用序列压缩来处理任意分辨率和比例的输入。此外,我们还证明了在大规模的supervised和contrastive图像文本预训练中,NaViT可以提高训练效率。 NaViT可以高效地转移到标准任务中,如图像和视频分类、物体检测和Semantic segmentation,并导致提高了robustness和公平性标准。在推理时,输入分辨率灵活性可以用来平滑地 navigates 测试时的成本-性能质量交易。我们认为 NaViT 标志着计算机视觉模型的标准输入和模型管道中的一个重要转折,并表示了计算机视觉领域的一个有前途的方向。”
Towards a Certified Proof Checker for Deep Neural Network Verification
results: 这个论文提出了一种新的DNN验证工具,可以提供更高的验证可靠性和稳定性。在进行验证时,该工具可以检查DNN的证明是否符合预期的结果,并提供一个可靠的验证机制。Abstract
Recent developments in deep neural networks (DNNs) have led to their adoption in safety-critical systems, which in turn has heightened the need for guaranteeing their safety. These safety properties of DNNs can be proven using tools developed by the verification community. However, these tools are themselves prone to implementation bugs and numerical stability problems, which make their reliability questionable. To overcome this, some verifiers produce proofs of their results which can be checked by a trusted checker. In this work, we present a novel implementation of a proof checker for DNN verification. It improves on existing implementations by offering numerical stability and greater verifiability. To achieve this, we leverage two key capabilities of Imandra, an industrial theorem prover: its support of infinite precision real arithmetic and its formal verification infrastructure. So far, we have implemented a proof checker in Imandra, specified its correctness properties and started to verify the checker's compliance with them. Our ongoing work focuses on completing the formal verification of the checker and further optimizing its performance.
摘要
To achieve this, we leverage two key capabilities of Imandra, an industrial theorem prover: its support for infinite precision real arithmetic and its formal verification infrastructure. We have already implemented a proof checker in Imandra and specified its correctness properties. Our ongoing work focuses on completing the formal verification of the checker and further optimizing its performance.
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
results: 通过InstructMining选择高质量的数据,可以提高语言模型在人工指令遵从任务中的表现,比对使用未过滤的数据集来finetuning模型,提高了42.5%的 случаythrough extensive finetuning experiments and applying the results to estimate parameters in InstructMining.Abstract
Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.
摘要
大型语言模型通常需要两个训练阶段,预训练和细化。 despite that large-scale pretraining endows the model with strong natural language response capabilities, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability to interpret and respond to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.
paper_authors: Matthew Newton, Antonis Papachristodoulou
for: This paper aims to improve the robustness of neural network controllers in control systems by using rational activation functions and a general rational neural network structure.
methods: The paper proposes a method to recover a stabilising controller from a Sum of Squares feasibility test, and applies this method to a refined rational neural network that is more compatible with Sum of Squares programming.
results: The paper shows that the proposed method can successfully recover stabilising rational neural network controllers for neural feedback loops with non-linear plants and noise and parametric uncertainty.Here’s the full summary in Simplified Chinese:
methods: 论文提出了一种从Sum of Squares可行性测试中回收稳定控制器的方法,并应用这种方法于一种更适合Sum of Squares编程的精细化的理智神经网络。
results: 论文表明,提出的方法可以成功地回收稳定的理智神经网络控制器,用于神经反馈循环中的非线性植入和噪声和参数不确定性。Abstract
Neural networks have shown great success in many machine learning related tasks, due to their ability to act as general function approximators. Recent work has demonstrated the effectiveness of neural networks in control systems (known as neural feedback loops), most notably by using a neural network as a controller. However, one of the big challenges of this approach is that neural networks have been shown to be sensitive to adversarial attacks. This means that, unless they are designed properly, they are not an ideal candidate for controllers due to issues with robustness and uncertainty, which are pivotal aspects of control systems. There has been initial work on robustness to both analyse and design dynamical systems with neural network controllers. However, one prominent issue with these methods is that they use existing neural network architectures tailored for traditional machine learning tasks. These structures may not be appropriate for neural network controllers and it is important to consider alternative architectures. This paper considers rational neural networks and presents novel rational activation functions, which can be used effectively in robustness problems for neural feedback loops. Rational activation functions are replaced by a general rational neural network structure, which is convex in the neural network's parameters. A method is proposed to recover a stabilising controller from a Sum of Squares feasibility test. This approach is then applied to a refined rational neural network which is more compatible with Sum of Squares programming. Numerical examples show that this method can successfully recover stabilising rational neural network controllers for neural feedback loops with non-linear plants with noise and parametric uncertainty.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.
Tackling Computational Heterogeneity in FL: A Few Theoretical Insights
paper_authors: Adnan Ben Mansour, Gaia Carenini, Alexandre Duplessis
For: This paper focuses on Federated Learning (FL) as a solution to move data collection and training to the edge, and proposes a novel aggregation framework to tackle computational heterogeneity in federated optimization.* Methods: The paper introduces and analyzes a new aggregation framework that formalizes and addresses heterogeneity in federated optimization, including both heterogeneous data and local updates.* Results: The proposed aggregation algorithms are extensively analyzed from both theoretical and experimental perspectives.Abstract
The future of machine learning lies in moving data collection along with training to the edge. Federated Learning, for short FL, has been recently proposed to achieve this goal. The principle of this approach is to aggregate models learned over a large number of distributed clients, i.e., resource-constrained mobile devices that collect data from their environment, to obtain a new more general model. The latter is subsequently redistributed to clients for further training. A key feature that distinguishes federated learning from data-center-based distributed training is the inherent heterogeneity. In this work, we introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneity in federated optimization, in terms of both heterogeneous data and local updates. Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
摘要
将来的机器学习未来在边缘集成。 Federation Learning(FL)是最近提出的一种方法,旨在实现这一目标。 FL的原则是将分布在多个资源有限的移动设备上进行学习的多个客户端模型集成,以获得更一般的模型。该模型后来将被重新分布给客户端进行进一步的训练。与数据中心基于分布式训练的方法不同, Federation Learning 具有内在的多样性。在这项工作中,我们介绍了一种新的聚合框架,用于正式地形式化和处理 federated 优化中的计算多样性,包括数据多样性和本地更新多样性。我们提出的聚合算法被广泛从理论和实验两个角度进行了分析。
Exposing the Fake: Effective Diffusion-Generated Images Detection
methods: 这篇论文提出了一个名为Stepwise Error for Diffusion-generated Image Detection(SeDID)的检测方法,包括统计基础的SeDIDStat和神经网络基础的SeDIDNNs。SeDID利用Diffusion模型具有的特殊特征,例如复原和降噪计算的准确性,来检测Diffusion模型生成的图像。
results: 我们的评估表明,SeDID在应用于Diffusion模型时表现出色,较 existing methods 更好。因此,这篇论文对于分辨Diffusion模型生成的图像做出了重要贡献,实现了人工智能安全领域的一个重要突破口。Abstract
Image synthesis has seen significant advancements with the advent of diffusion-based generative models like Denoising Diffusion Probabilistic Models (DDPM) and text-to-image diffusion models. Despite their efficacy, there is a dearth of research dedicated to detecting diffusion-generated images, which could pose potential security and privacy risks. This paper addresses this gap by proposing a novel detection method called Stepwise Error for Diffusion-generated Image Detection (SeDID). Comprising statistical-based $\text{SeDID}_{\text{Stat}$ and neural network-based $\text{SeDID}_{\text{NNs}$, SeDID exploits the unique attributes of diffusion models, namely deterministic reverse and deterministic denoising computation errors. Our evaluations demonstrate SeDID's superior performance over existing methods when applied to diffusion models. Thus, our work makes a pivotal contribution to distinguishing diffusion model-generated images, marking a significant step in the domain of artificial intelligence security.
摘要
SeDID consists of two components: statistical-based SeDID-Stat and neural network-based SeDID-NNs. It utilizes the unique characteristics of diffusion models, specifically the deterministic reverse and denoising computation errors. Our evaluations show that SeDID outperforms existing methods when applied to diffusion models. Therefore, our work makes an important contribution to distinguishing diffusion model-generated images, marking a significant step forward in the field of artificial intelligence security.
Physics-informed Machine Learning for Calibrating Macroscopic Traffic Flow Models
results: 我们通过一个在加利福尼亚州I-210 E的示例研究,证明我们的方法可以达到和优于传统优化方法的性能。Abstract
Well-calibrated traffic flow models are fundamental to understanding traffic phenomena and designing control strategies. Traditional calibration has been developed base on optimization methods. In this paper, we propose a novel physics-informed, learning-based calibration approach that achieves performances comparable to and even better than those of optimization-based methods. To this end, we combine the classical deep autoencoder, an unsupervised machine learning model consisting of one encoder and one decoder, with traffic flow models. Our approach informs the decoder of the physical traffic flow models and thus induces the encoder to yield reasonable traffic parameters given flow and speed measurements. We also introduce the denoising autoencoder into our method so that it can handles not only with normal data but also with corrupted data with missing values. We verified our approach with a case study of I-210 E in California.
摘要
tradicional 的准确性检查方法是交通现象的基础,用于设计控制策略。在这篇论文中,我们提出了一种新的物理学习基于准确性方法,可以与优化方法相比。为此,我们将经典的深度自适应神经网络与交通流模型结合起来。我们的方法使得神经网络的解码器了解物理交通流模型,因此使得神经网络的编码器可以根据流速度测量获得合理的交通参数。我们还引入了杂化自适应神经网络,使得它可以处理不仅正常数据,还可以处理含有缺失值的数据。我们在加利福尼亚州I-210 E的案例研究中验证了我们的方法。
On the hierarchical Bayesian modelling of frequency response functions
results: 这个研究发展了一个可 combinatorial 的 probabilistic FRF 模型,可以处理不同温度情况下的单位之间的差异,同时利用单位之间的相似性。Abstract
Population-based structural health monitoring (PBSHM) aims to share valuable information among members of a population, such as normal- and damage-condition data, to improve inferences regarding the health states of the members. Even when the population is comprised of nominally-identical structures, benign variations among the members will exist as a result of slight differences in material properties, geometry, boundary conditions, or environmental effects (e.g., temperature changes). These discrepancies can affect modal properties and present as changes in the characteristics of the resonance peaks of the frequency response function (FRF). Many SHM strategies depend on monitoring the dynamic properties of structures, so benign variations can be challenging for the practical implementation of these systems. Another common challenge with vibration-based SHM is data loss, which may result from transmission issues, sensor failure, a sample-rate mismatch between sensors, and other causes. Missing data in the time domain will result in decreased resolution in the frequency domain, which can impair dynamic characterisation. The hierarchical Bayesian approach provides a useful modelling structure for PBSHM, because statistical distributions at the population and individual (or domain) level are learnt simultaneously to bolster statistical strength among the parameters. As a result, variance is reduced among the parameter estimates, particularly when data are limited. In this paper, combined probabilistic FRF models are developed for a small population of nominally-identical helicopter blades under varying temperature conditions, using a hierarchical Bayesian structure. These models address critical challenges in SHM, by accommodating benign variations that present as differences in the underlying dynamics, while also considering (and utilising), the similarities among the blades.
摘要
《人口基于结构健康监测(PBSHM)》的目标是分享人口中成员的有价值信息,如正常和损害状态数据,以提高成员的健康状态的推断。即使人口由 nominally-identical 结构组成,也会存在由材料属性、几何 Parameters、边界条件和环境因素(例如温度变化)引起的轻微差异。这些差异可能影响模式特性,并在频率响应函数(FRF)的特征峰上出现变化。许多Structural Health Monitoring 策略依赖于监测结构的动态性能,因此轻微差异可能对实施这些系统的实用性造成挑战。另一个常见的挑战是数据丢失,可能由传输问题、传感器失效、抽象率匹配问题等所导致。在时域上缺失数据会导致频域中的分辨率降低,从而降低动态特征的描述。使用 hierarchical Bayesian 结构可以有效地为 PBSHM 提供模型结构,因为在人口和个体(或领域)层次上同时学习了统计分布,从而增强参数间的统计强度,特别是数据有限时。在这篇文章中,我们使用 hierarchical Bayesian 结构,为一小群 nominally-identical 飞机扇子在不同温度条件下的共聚 probabilistic FRF 模型的开发。这些模型可以解决 SHM 中的关键挑战,通过同时考虑结构之间的相似性和差异,以及利用这些相似性。
paper_authors: Tamara T. Mueller, Siyu Zhou, Sophie Starck, Friederike Jungmann, Alexander Ziller, Orhun Aksoy, Danylo Movchan, Rickmer Braren, Georgios Kaissis, Daniel Rueckert for:这篇论文的目的是估计肌肉和脂肪的分布和量,以估计人体病理健康和疾病风险。methods:这篇论文使用三角形体表面网络估计脂肪和肌肉的分布和量,并使用 graf neural network 进行准确估计。results:论文的方法可以实现高性能,同时降低训练时间和资源需求,相比于现有的 convolutional neural network。此外,这篇论文还预期这种方法可以应用于便宜且易доступible的医疗表面扫描机上,而不需要昂费的医疗影像设备。Abstract
Body fat volume and distribution can be a strong indication for a person's overall health and the risk for developing diseases like type 2 diabetes and cardiovascular diseases. Frequently used measures for fat estimation are the body mass index (BMI), waist circumference, or the waist-hip-ratio. However, those are rather imprecise measures that do not allow for a discrimination between different types of fat or between fat and muscle tissue. The estimation of visceral (VAT) and abdominal subcutaneous (ASAT) adipose tissue volume has shown to be a more accurate measure for named risk factors. In this work, we show that triangulated body surface meshes can be used to accurately predict VAT and ASAT volumes using graph neural networks. Our methods achieve high performance while reducing training time and required resources compared to state-of-the-art convolutional neural networks in this area. We furthermore envision this method to be applicable to cheaper and easily accessible medical surface scans instead of expensive medical images.
摘要
body 脂肪量和分布可能是一个人的总体健康状况和发展疾病类型2 диабеetes 和 cardiovascular 疾病的风险指标。通常使用的脂肪估算方法包括体重指数(BMI)、腰围或腰股比。但这些方法并不准确,无法区分不同类型的脂肪或肌肉组织。预测腹部内脂肪(VAT)和腹部外脂肪(ASAT)卷积体积的估算方法已经显示出了更高的准确性。在这个工作中,我们表明了使用三角形体表面网格可以准确预测 VAT 和 ASAT 体积,使用图 neuron 网络。我们的方法可以 достичь高性能,同时降低训练时间和需要的资源,比于现有的 convolutional neural networks 更高效。我们还可以预期这种方法可以应用于便宜并可以访问的医疗表面扫描机 instead of 昂贵的医疗图像。
Transformer-based end-to-end classification of variable-length volumetric data
results: 在 retinal OCT 量子数据分类任务上,提出的方法比过去的 video transformers 获得了21.96%的平均改善率,并且在不同的计算预算下能够实现更好的一致性和可靠性。Abstract
The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Na\"ive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input volume-wise resolution(#slices) during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high-resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the volume-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume.
摘要
自动分类三维医疗数据是内存开销很大的。另外,样本之间块数的变化也是非常常见的。Na\"ive的解决方案如下amplespling可以解决这些问题,但是可能会消除有关诊断信息。Transformers在sequential数据分析中表现出了良好的性能。然而,它们在长序数据上的应用是计算昂贵和内存占用很大的。在这篇论文中,我们提出了一个端到端Transformer基于的框架,可以有效地将变量长度的三维数据分类。具体来说,在训练时随机输入Volume-wise分辨率(# slice),我们提高了每个Volume slice中的可学习位置嵌入的能力。因此,在测试时,每个位置嵌入中的积累的位置信息可以泛化到邻近的slice,即使是高分辨率的Volume。这样做的好处是,模型会更愿意变量Volume长度,并且适应不同的计算预算。我们在Retinal OCTVolume分类任务上评估了该方法,与状态机器上的视频Transformers进行比较,实现了21.96%的平均改善率,相比于9类诊断任务的平均准确率。我们的发现表明,在训练时随机输入Volume-wise分辨率会导致更有用的Volume表示,相比于固定每个Volume slice的分辨率。
PatchSorter: A High Throughput Deep Learning Digital Pathology Tool for Object Labeling
paper_authors: Cedric Walker, Tasneem Talawalla, Robert Toth, Akhil Ambekar, Kien Rea, Oswin Chamian, Fan Fan, Sabina Berezowska, Sven Rottenberg, Anant Madabhushi, Marie Maillard, Laura Barisoni, Hugo Mark Horlings, Andrew Janowczyk
results: 使用 >100,000个对象,比无助标注的速度提高 >7倍,而无影响标注准确率,可以实现高速标注大量数据集。Abstract
The discovery of patterns associated with diagnosis, prognosis, and therapy response in digital pathology images often requires intractable labeling of large quantities of histological objects. Here we release an open-source labeling tool, PatchSorter, which integrates deep learning with an intuitive web interface. Using >100,000 objects, we demonstrate a >7x improvement in labels per second over unaided labeling, with minimal impact on labeling accuracy, thus enabling high-throughput labeling of large datasets.
摘要
发现与诊断、治疗效果和疾病进程相关的图像模式通常需要大量的历史学物体标注。我们现在发布一款开源的标注工具,PatchSorter,该工具将深度学习与易懂的Web界面集成。使用了超过100,000个物体,我们实现了每秒标注的 >7倍增加,而无需增加标注精度,因此可以实现大规模标注 dataset。
paper_authors: Alexander Ziller, Alp Güvenir, Ayhan Can Erdur, Tamara T. Mueller, Philip Müller, Friederike Jungmann, Johannes Brandt, Jan Peeken, Rickmer Braren, Daniel Rueckert, Georgios Kaissis
results: 我们在医疗分类 benchmark 和一个实际的临床数据集上评估了我们的方法,并与现有方法相比示出了相似的结果。此外,我们还使用了注意力池化作为特征减少模块,从而获得了每个块 slice 的重要性值 durante el paso adelante。我们显示了这些值对于模型预测的基础。Abstract
Training Artificial Intelligence (AI) models on three-dimensional image data presents unique challenges compared to the two-dimensional case: Firstly, the computational resources are significantly higher, and secondly, the availability of large pretraining datasets is often limited, impeding training success. In this study, we propose a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D volumes. Our method involves sequentially applying these networks to slices of a 3D volume from all orientations. Subsequently, a feature reduction module combines the extracted slice features into a single representation, which is then used for classification. We evaluate our approach on medical classification benchmarks and a real-world clinical dataset, demonstrating comparable results to existing methods. Furthermore, by employing attention pooling as a feature reduction module we obtain weighted importance values for each slice during the forward pass. We show that slices deemed important by our approach allow the inspection of the basis of a model's prediction.
摘要
培训人工智能(AI)模型三维图像数据具有特殊挑战:首先,计算资源增加得非常大,其次,大量预训 dataset的可用性经常受限,这会阻碍训练的成功。在这种研究中,我们提议一种简单的方法,即适应2D网络中间特征表示来处理3D体Volume。我们的方法是顺序应用这些网络到3D体Volume中的所有方向的剖面上,然后使用特征减少模块将提取的剖面特征合并成一个单一表示,并用于分类。我们在医疗分类标准 benchmark 和一个实际的临床数据集上评估了我们的方法,并达到了与现有方法相当的结果。此外,通过在前向传播中使用注意力池化来实现特征减少模块,我们可以在每个剖面上获得重要性的Weight值。我们表明,我们的方法中重要的剖面可以为模型预测的基础提供可视化 inspect。
Image Denoising and the Generative Accumulation of Photons
paper_authors: Alexander Krull, Hector Basevi, Benjamin Salmon, Andre Zeug, Franziska Müller, Samuel Tonks, Leela Muppala, Ales Leonardis
for: This paper is written for the task of noise removal in images corrupted by shot noise.
methods: The paper proposes a new method called “generative accumulation of photons” (GAP) that uses a network to predict the location of the next photon to arrive and solve the minimum mean square error (MMSE) denoising task.
results: The paper evaluates the GAP method on 4 new fluorescence microscopy datasets and shows that it outperforms supervised, self-supervised, and unsupervised baselines, or performs on par with them.Abstract
We present a fresh perspective on shot noise corrupted images and noise removal. By viewing image formation as the sequential accumulation of photons on a detector grid, we show that a network trained to predict where the next photon could arrive is in fact solving the minimum mean square error (MMSE) denoising task. This new perspective allows us to make three contributions: We present a new strategy for self-supervised denoising, We present a new method for sampling from the posterior of possible solutions by iteratively sampling and adding small numbers of photons to the image. We derive a full generative model by starting this process from an empty canvas. We call this approach generative accumulation of photons (GAP). We evaluate our method quantitatively and qualitatively on 4 new fluorescence microscopy datasets, which will be made available to the community. We find that it outperforms supervised, self-supervised and unsupervised baselines or performs on-par.
摘要
我团队提出了一种新的视角,探讨抽象图像中的射频噪声和噪声除去问题。我们认为图像形成是探测器网格上积累着光子的sequential процесс,因此我们表明了一种基于MMSE的推理网络可以解决噪声除去问题。这个新的视角允许我们提出三个贡献:首先,我们提出了一种新的自动预测推理策略,其中网络被训练以估计下一个光子会在哪里出现。其次,我们提出了一种新的采样方法,通过 iteratively 采样并将小数量的光子添加到图像中来采样 posterior 中的可能解。最后,我们 Derive 了一个完整的生成模型,开始于一个空白画布。我们称这种方法为 photon 的生成总结(GAP)。我们对这种方法进行了量化和质量测试,并在4个新的激发镜icroscopy 数据集上进行了评估。我们发现,它在超过supervised、self-supervised和unsupervised 基线之上,或者与之相当。
Quantum Image Denoising: A Framework via Boltzmann Machines, QUBO, and Quantum Annealing
For: 这 paper 是用来描述一种基于 Restricted Boltzmann Machines (RBMs) 的二进制图像干扰除法。* Methods: 该方法使用了 quadratic unconstrained binary optimization (QUBO) 形式来实现denoising目标,并且可以通过Quantum Annealing 来实现。* Results: 该方法可以在大规模的二进制数据上实现高质量的干扰除效果,并且可以预测在假设Target Distribution 已经很好地适应的情况下,denoised图像会更近于干净图像。Here’s the same information in a more detailed format:* For: 这 paper 是用来描述一种基于 Restricted Boltzmann Machines (RBMs) 的二进制图像干扰除法,该方法可以通过 quadratic unconstrained binary optimization (QUBO) 形式来实现denoising目标,并且可以通过Quantum Annealing 来实现。* Methods: 该方法使用了 RBMs 来学习target distribution,然后通过balancing distribution 和 penalty term来实现denoising目标。在 Target Distribution 已经很好地适应的情况下,该方法可以通过 Statistically Optimal 的 penalty parameter 来实现最佳的denoising效果。此外,该方法还提出了一种empirically supported modification,以使得方法更加Robust。* Results: 该方法可以在大规模的二进制数据上实现高质量的干扰除效果,并且可以预测在假设Target Distribution 已经很好地适应的情况下,denoised图像会更近于干净图像。在实际应用中,该方法可以在 D-Wave Advantage 机器上进行实现,并且也可以通过类比 heuristics 来实现在大规模数据上的应用。Abstract
We investigate a framework for binary image denoising via restricted Boltzmann machines (RBMs) that introduces a denoising objective in quadratic unconstrained binary optimization (QUBO) form and is well-suited for quantum annealing. The denoising objective is attained by balancing the distribution learned by a trained RBM with a penalty term for derivations from the noisy image. We derive the statistically optimal choice of the penalty parameter assuming the target distribution has been well-approximated, and further suggest an empirically supported modification to make the method robust to that idealistic assumption. We also show under additional assumptions that the denoised images attained by our method are, in expectation, strictly closer to the noise-free images than the noisy images are. While we frame the model as an image denoising model, it can be applied to any binary data. As the QUBO formulation is well-suited for implementation on quantum annealers, we test the model on a D-Wave Advantage machine, and also test on data too large for current quantum annealers by approximating QUBO solutions through classical heuristics.
摘要
我们研究了一个基于Restricted Boltzmann Machine(RBM)的二进制图像杂变推优方案,该方案通过在quadratic unconstrained binary optimization(QUBO)形式中引入杂变目标来实现二进制图像杂变。我们确定了在训练RBM后对分布的均衡,并增加了对不稳定图像的罚分,以实现杂变目标。我们还提出了一个可靠的修改,以使方法更加鲁棒。此外,我们还证明了在某些假设下,由我们的方法生成的杂变图像在预期下是比原始杂变图像更近于噪声自由图像。虽然我们将模型定义为图像杂变模型,但它可以应用于任何二进制数据。由于QUBO形式适合在量子泛化器上实现,我们在D-Wave Advantage机器上测试了该模型,并对数据太大于当前量子泛化器可以处理的情况下,使用经典规则来估算QUBO解。
SAM-Path: A Segment Anything Model for Semantic Segmentation in Digital Pathology
results: 经过实验表明,在两个公共的pathology数据集(BCSS和CRAG)上,我们的方法可以比vanilla SAM和人工提示后处理提高Dice分数和IOU分数。具体来说,与基eline相比,我们的方法可以提高27.52%的Dice分数和71.63%的IOU分数。此外,我们还提出了一种基于pathologyEncoder的附加方法,可以进一步提高Semantic segmentation的精度。Abstract
Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.
摘要
Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.Here's the translation in Traditional Chinese:Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.
results: 该研究发现,使用该算法可以生成具有正确的距离和Focus cues的投影图,从而提高了观看体验的真实感。并且对比于当前的CGH算法,该方法在不同的 pupil state 下表现更优。Abstract
The Visual Turing Test is the ultimate goal to evaluate the realism of holographic displays. Previous studies have focused on addressing challenges such as limited \'etendue and image quality over a large focal volume, but they have not investigated the effect of pupil sampling on the viewing experience in full 3D holograms. In this work, we tackle this problem with a novel hologram generation algorithm motivated by matching the projection operators of incoherent Light Field and coherent Wigner Function light transport. To this end, we supervise hologram computation using synthesized photographs, which are rendered on-the-fly using Light Field refocusing from stochastically sampled pupil states during optimization. The proposed method produces holograms with correct parallax and focus cues, which are important for passing the Visual Turing Test. We validate that our approach compares favorably to state-of-the-art CGH algorithms that use Light Field and Focal Stack supervision. Our experiments demonstrate that our algorithm significantly improves the realism of the viewing experience for a variety of different pupil states.
摘要
“visual turing test”是投射幕display的最终目标,以评估幕display的真实感。先前的研究主要关注了幕display的局限性和图像质量问题,但未曾调查到观察者的 pupil sampling 对全息三维幕display的视觉体验产生的影响。在这种工作中,我们通过一种基于干扰光场和惯性玻璃函数的投射算法来解决这个问题。我们在计算幕display时使用Synthesized photographs进行监测,这些Synthesized photographs在计算过程中使用Light Field refocusing来从杂然样本的 pupil states中随机抽取样本。我们的方法可以生成具有正确的距离和焦点提示的幕display,这些提示对于通过Visual Turing Test是非常重要的。我们的实验表明,我们的方法与现有的CGH算法相比,可以更好地提高不同的观察者 pupil states 的视觉体验的真实感。
The Whole Pathological Slide Classification via Weakly Supervised Learning
results: 在Camelyon16乳腺癌数据集和TCGA-NSCLC肺癌数据集上,提出的方法可以更好地处理癌病诊断和分型任务,并比州对病理图像分类方法表现更高效。Abstract
Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slides. To address this limitation, we introduced two pathological priors: nuclear heterogeneity of diseased cells and spatial correlation of pathological tiles. Leveraging the former, we proposed a data augmentation method that utilizes stain separation during extractor training via a contrastive learning strategy to obtain instance-level representations. We then described the spatial relationships between the tiles using an adjacency matrix. By integrating these two views, we designed a multi-instance framework for analyzing H\&E-stained tissue images based on pathological inductive bias, encompassing feature extraction, filtering, and aggregation. Extensive experiments on the Camelyon16 breast dataset and TCGA-NSCLC Lung dataset demonstrate that our proposed framework can effectively handle tasks related to cancer detection and differentiation of subtypes, outperforming state-of-the-art medical image classification methods based on MIL. The code will be released later.
摘要
(简化中文)由于其高效使用注释和处理大型图像,多例学习(MIL)在数字生物 pathology 诊断中表现出了扎实的承诺。然而,现有方法通常会强调不同结构的高级聚合器,经常忽略 H&E 病理报告图像中的内在特征。为了解决这一限制,我们引入了两种病理假设:病理细胞中的核型多样性和病理块之间的空间相关性。通过利用前者,我们提出了一种数据增强方法,通过对抽象器训练中的着色剂分离来获取实例级表示。然后,我们使用一个相邻矩阵来描述病理块之间的空间关系。通过将这两种视图集成,我们设计了基于病理假设的多例框架,包括特征提取、筛选和聚合。广泛的实验表明,我们的提议框架可以有效地处理 relate to cancer detection and differentiation of subtypes 等任务,超越当前基于 MIL 的医学图像分类方法。代码将在未来发布。
results: 对于 synthetic 和实际 экспериментах,B-CLEAN-SC 能够改善源重建效果,并且降低噪声水平,但是只需要增加内存空间,而不需要更多的计算资源。Abstract
This paper presents B-CLEAN-SC, a variation of CLEAN-SC for broadband sources. Opposed to CLEAN-SC, which ``deconvolves'' the beamforming map for each frequency individually, B-CLEAN-SC processes frequency intervals. Instead of performing a deconvolution iteration at the location of the maximum level, B-CLEAN-SC performs it at the location of the over-frequency-averaged maximum to improve the location estimation. The method is validated and compared to standard CLEAN-SC on synthetic cases, and real-world experiments, for broad- and narrowband sources. It improves the source reconstruction at low and high frequencies and suppresses noise, while it only increases the need for memory but not computational effort.
摘要
Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition
paper_authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
for: 提高 speech recognition 系统的效率和可扩展性。
methods: 提出一种 linear-time 的替代方案,使用 mean 值来概括整个句子,并与时间特定信息相结合。
results: 在 state-of-the-art ASR 模型中引入 Summary Mixing 方法可以保持或超越先前的语音识别性能,同时降低训练和执行时间和内存预算。Abstract
Modern speech recognition systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but fail to consistently reach the same level of accuracy. In practice, however, the self-attention weights of trained speech recognizers take the form of a global average over time. This paper, therefore, proposes a linear-time alternative to self-attention for speech recognition. It summarises a whole utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ``Summary Mixing''. Introducing Summary Mixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 27% and reducing the memory budget by a factor of two.
摘要
现代语音识别系统几乎总是使用自注意力。然而,使用自注意力的代价很高,因为在语音句子的长度为参数的情况下,自注意力的计算时间为 quadratic time。这会使推理和训练速度下降,同时增加内存消耗。虽然有一些便宜的自注意力的替代方案出现了,但它们无法保持同样的准确率水平。实际上,已经训练的语音识别器的自注意力权重通常是全程时间的global average。因此,这篇论文提议了一种 linear-time 的自注意力替代方案,即“摘要混合”方法。这种方法总结了一个整个句子的摘要,然后将其与时间特定信息结合。我们称之为“摘要混合”。在现有的语音识别模型中引入摘要混合后,可以保持或超越之前的语音识别性能,同时降低训练和推理时间(最多下降27%)和内存预算(减半)。
Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers
paper_authors: Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, Björn W. Schuller
for: 提高状态的讲话情感识别(SER)模型的性能
methods: 使用大型自然语言模型(LLM)对充足的 speech 数据进行标注
results: 通过实验表明,LLM 可以帮助提高 SER 的性能,并且可以在单击和几个 clicks 的情况下实现 improved 的结果,同时还可以通过数据扩展来提高结果的稳定性。Abstract
Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.
摘要
尽管最新的语音情感识别(SER)模型已经取得了 significiant progress,但是现代深度学习(DL)方法仍面临着有限的标注数据的挑战。大型自然语言模型(LLMs)已经革命化了我们对自然语言、语音和视觉的理解,探索了新的emergent property。这篇论文探讨了 LLMs 是否可以用来注释丰富的语音数据,以提高 SER 的状态。我们使用 ChatGPT 进行实验,并证明了 LLMs 在语音情感分类中的扮演。我们的评估包括单个和几个shot的enario,发现 SER 的性能具有一定的变化。特别是,我们通过数据扩展来提高结果,将 ChatGPT-注释的样本 incorporated 到现有的数据集中。我们的工作探索了新的frontier在语音情感分类领域,强调 LLMS 在这个领域的增长意义。
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition
methods: 本研究使用Language-Routing Mixture of Experts(LR-MoE)网络,该网络通过语言专家组合(MLE)提取语言特有的表示,并通过框架级别语言 routings 机制引导学习。共享预处理器(LID)网络与 MLE 层共享参数。
results: 对比基eline,提出的方法在多语言并码换 speech recognition 中显著提高表现,同时具有相当的计算效率。Abstract
Multilingual speech recognition for both monolingual and code-switching speech is a challenging task. Recently, based on the Mixture of Experts (MoE), many works have made good progress in multilingual and code-switching ASR, but present huge computational complexity with the increase of supported languages. In this work, we propose a computation-efficient network named Language-Routing Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. LR-MoE extracts language-specific representations through the Mixture of Language Experts (MLE), which is guided to learn by a frame-wise language routing mechanism. The weight-shared frame-level language identification (LID) network is jointly trained as the shared pre-router of each MoE layer. Experiments show that the proposed method significantly improves multilingual and code-switching speech recognition performances over baseline with comparable computational efficiency.
摘要
多语言语音识别对于单语言和语言交换speech是一个挑战的任务。在最近,基于混合专家(MoE),许多工作在多语言和语言交换ASR中做出了良好的进展,但是随着支持语言的增加,计算复杂性增加得非常大。在这项工作中,我们提出了一种计算效率高的网络 named Language-Routing Mixture of Experts(LR-MoE) для多语言和语言交换ASR。LR-MoE通过混合语言专家(MLE)提取语言特有的表示,并由一个框架级别的语言路由机制引导学习。与每个MoE层共享权重的共享预处理网络(LID)在每个MoE层中被同时训练。实验显示,提议的方法可以明显提高多语言和语言交换speech识别性能,与基准值相比,计算效率相对较高。
SnakeSynth: New Interactions for Generative Audio Synthesis
results: 研究人员成功创造了一种可以在浏览器中运行的高精度音频合成器,并可以通过实时两个维度输入控制音频的长度和强度。Abstract
I present "SnakeSynth," a web-based lightweight audio synthesizer that combines audio generated by a deep generative model and real-time continuous two-dimensional (2D) input to create and control variable-length generative sounds through 2D interaction gestures. Interaction gestures are touch and mobile-compatible with analogies to strummed, bowed, and plucked musical instrument controls. Point-and-click and drag-and-drop gestures directly control audio playback length and I show that sound length and intensity are modulated by interactions with a programmable 2D coordinate grid. Leveraging the speed and ubiquity of browser-based audio and hardware acceleration in Google's TensorFlow.js we generate time-varying high-fidelity sounds with real-time interactivity. SnakeSynth adaptively reproduces and interpolates between sounds encountered during model training, notably without long training times, and I briefly discuss possible futures for deep generative models as an interactive paradigm for musical expression.
摘要
我宣布“SnakeSynth”,一款基于网页的轻量级音频合成器,将深度生成模型生成的音频和实时连续二维(2D)输入结合起来,创造和控制可变长度生成音频的变量长度和强度通过2D交互姿势。交互姿势包括触摸和移动设备兼容的把弦、弓和拨弦乐器控制。点击和拖动手势直接控制音频播放长度,我显示了播放长度和强度是通过与可程序化2D坐标网格进行交互来修改的。利用浏览器基于的快速和 ubique 的音频和硬件加速,我们在 Google TensorFlow.js 中生成了时间变化的高品质音频,并且实现了实时交互。SnakeSynth 可以适应和 interpolate 在模型训练中所遇到的声音,而不需要长时间训练,我 briefly discuss 了深度生成模型作为交互式乐器表达的可能性。
paper_authors: Benoit Brummer, Christophe De Vleeschouwer
for: 提高图像压缩和去噪性能
methods: 利用自然图像噪音数据集,在训练码流程中显式学习图像去噪任务
results: 单一模型可以在不同噪音水平上训练,并在噪音和无噪音图像上达到最佳级别的率质量和压缩率,比普通压缩模型和分别进行去噪和压缩的两个模型具有许多更多的GMac操作。Abstract
Image noise is ubiquitous in photography. However, image noise is not compressible nor desirable, thus attempting to convey the noise in compressed image bitstreams yields sub-par results in both rate and distortion. We propose to explicitly learn the image denoising task when training a codec. Therefore, we leverage the Natural Image Noise Dataset, which offers a wide variety of scenes captured with various ISO numbers, leading to different noise levels, including insignificant ones. Given this training set, we supervise the codec with noisy-clean image pairs, and show that a single model trained based on a mixture of images with variable noise levels appears to yield best-in-class results with both noisy and clean images, achieving better rate-distortion than a compression-only model or even than a pair of denoising-then-compression models with almost one order of magnitude fewer GMac operations.
摘要
图像噪声是摄影中 ubique 存在的问题。然而,图像噪声不可压缩也不可适用,因此尝试将噪声包含在压缩图像数据流中会得到质量不佳的结果。我们提议在编码器训练时专门学习图像去噪任务。因此,我们利用自然图像噪声数据集,该数据集包含了不同ISO数字的场景捕捉,导致不同的噪声水平,包括不重要的噪声。给出这个训练集,我们监督编码器使用噪声-清晰图像对进行超参训练,并显示了一个基于噪声水平的混合图像的单个模型可以在质量和环境方面达到最佳的结果,超过了压缩Only模型或者是将压缩和去噪两个模型进行分别训练后的结果,并且具有许多更少的GMac操作。
SepVAE: a contrastive VAE to separate pathological patterns from healthy ones
results: 作者在三个医学应用和一个自然图像集(CelebA)上比较了这种方法与之前的CA-VAEs方法的性能,并证明了它的更好性能。Abstract
Contrastive Analysis VAE (CA-VAEs) is a family of Variational auto-encoders (VAEs) that aims at separating the common factors of variation between a background dataset (BG) (i.e., healthy subjects) and a target dataset (TG) (i.e., patients) from the ones that only exist in the target dataset. To do so, these methods separate the latent space into a set of salient features (i.e., proper to the target dataset) and a set of common features (i.e., exist in both datasets). Currently, all models fail to prevent the sharing of information between latent spaces effectively and to capture all salient factors of variation. To this end, we introduce two crucial regularization losses: a disentangling term between common and salient representations and a classification term between background and target samples in the salient space. We show a better performance than previous CA-VAEs methods on three medical applications and a natural images dataset (CelebA). Code and datasets are available on GitHub https://github.com/neurospin-projects/2023_rlouiset_sepvae.
摘要
“对照分析VAE(CA-VAEs)是一家variational autoencoders(VAEs)的家族,旨在将背景Dataset(BG)(例如,健康者)和目标Dataset(TG)(例如,病人)中的共同因素分离开来。这些方法将latent space分为一群独特特征(例如,专属于目标Dataset)和一群共同特征(例如,在两个dataset中存在)。现在,所有模型都无法有效地防止latent space之间的信息共享和捕捉所有独特的变化因素。为了解决这个问题,我们引入了两个重要的调整损失:一个分离共同和独特表示的混合损失,以及一个分类损失 между背景和目标样本在独特空间中。我们显示了与前一代CA-VAEs方法比较好的性能在三个医疗应用和一个自然图像 dataset(CelebA)上。代码和数据可以在GitHub上获取:https://github.com/neurospin-projects/2023_rlouiset_sepvae。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
CellGAN: Conditional Cervical Cell Synthesis for Augmenting Cytopathological Image Classification
results: CellGAN 可以生成可见的 TCT 细胞图像,并且可以大幅提高 patch-level 细胞类型分类性能。Abstract
Automatic examination of thin-prep cytologic test (TCT) slides can assist pathologists in finding cervical abnormality for accurate and efficient cancer screening. Current solutions mostly need to localize suspicious cells and classify abnormality based on local patches, concerning the fact that whole slide images of TCT are extremely large. It thus requires many annotations of normal and abnormal cervical cells, to supervise the training of the patch-level classifier for promising performance. In this paper, we propose CellGAN to synthesize cytopathological images of various cervical cell types for augmenting patch-level cell classification. Built upon a lightweight backbone, CellGAN is equipped with a non-linear class mapping network to effectively incorporate cell type information into image generation. We also propose the Skip-layer Global Context module to model the complex spatial relationship of the cells, and attain high fidelity of the synthesized images through adversarial learning. Our experiments demonstrate that CellGAN can produce visually plausible TCT cytopathological images for different cell types. We also validate the effectiveness of using CellGAN to greatly augment patch-level cell classification performance.
摘要
自动检查薄准cytologic test(TCT)板块可以帮助病理医生在精准和高效的癌症检测中发现cervical异常。现有解决方案大多需要在local化异常cell和分类异常 Based on the fact that whole slide images of TCT are extremely large, it thus requires many annotations of normal and abnormal cervical cells to supervise the training of the patch-level classifier for promising performance. In this paper, we propose CellGAN to synthesize cytopathological images of various cervical cell types for augmenting patch-level cell classification. Built upon a lightweight backbone, CellGAN is equipped with a non-linear class mapping network to effectively incorporate cell type information into image generation. We also propose the Skip-layer Global Context module to model the complex spatial relationship of the cells, and attain high fidelity of the synthesized images through adversarial learning. Our experiments demonstrate that CellGAN can produce visually plausible TCT cytopathological images for different cell types. We also validate the effectiveness of using CellGAN to greatly augment patch-level cell classification performance.Here's the translation in Traditional Chinese:自动检查薄准cytologic test(TCT)板块可以帮助病理医生在精确和高效的癌症检测中发现cervical异常。现有解决方案大多需要在local化异常cell和分类异常 Based on the fact that whole slide images of TCT are extremely large, it thus requires many annotations of normal and abnormal cervical cells to supervise the training of the patch-level classifier for promising performance. In this paper, we propose CellGAN to synthesize cytopathological images of various cervical cell types for augmenting patch-level cell classification. Built upon a lightweight backbone, CellGAN is equipped with a non-linear class mapping network to effectively incorporate cell type information into image generation. We also propose the Skip-layer Global Context module to model the complex spatial relationship of the cells, and attain high fidelity of the synthesized images through adversarial learning. Our experiments demonstrate that CellGAN can produce visually plausible TCT cytopathological images for different cell types. We also validate the effectiveness of using CellGAN to greatly augment patch-level cell classification performance.
Large Class Separation is not what you need for Relational Reasoning-based OOD Detection
results: 研究人员通过对受检测数据进行分析,发现inter-class feature distance和异常检测精度之间存在正相关关系,并提出了一种新的损失函数来控制inter-class margin,以提高异常检测精度。Abstract
Standard recognition approaches are unable to deal with novel categories at test time. Their overconfidence on the known classes makes the predictions unreliable for safety-critical applications such as healthcare or autonomous driving. Out-Of-Distribution (OOD) detection methods provide a solution by identifying semantic novelty. Most of these methods leverage a learning stage on the known data, which means training (or fine-tuning) a model to capture the concept of normality. This process is clearly sensitive to the amount of available samples and might be computationally expensive for on-board systems. A viable alternative is that of evaluating similarities in the embedding space produced by large pre-trained models without any further learning effort. We focus exactly on such a fine-tuning-free OOD detection setting. This works presents an in-depth analysis of the recently introduced relational reasoning pre-training and investigates the properties of the learned embedding, highlighting the existence of a correlation between the inter-class feature distance and the OOD detection accuracy. As the class separation depends on the chosen pre-training objective, we propose an alternative loss function to control the inter-class margin, and we show its advantage with thorough experiments.
摘要
This work presents an in-depth analysis of the recently introduced relational reasoning pre-training and investigates the properties of the learned embedding. We find a correlation between inter-class feature distance and OOD detection accuracy, highlighting the importance of class separation. As the chosen pre-training objective affects class separation, we propose an alternative loss function to control the inter-class margin, and demonstrate its advantage through thorough experiments.
results: 本研究实现了实时交通情况评估,解决了机动和非机动交通之间的 occlusion 问题,提供了一个可靠、高精度的数据生成和训练策略。Abstract
Complex inner-city junctions are among the most critical traffic areas for injury and fatal accidents. The development of highly automated driving (HAD) systems struggles with the complex and hectic everyday life within those areas. Sensor-equipped smart infrastructures, which can communicate and cooperate with vehicles, are essential to enable a holistic scene understanding to resolve occlusions drivers and vehicle perception systems for themselves can not cover. We introduce an intelligent research infrastructure equipped with visual sensor technology, located at a public inner-city junction in Aschaffenburg, Germany. A multiple-view camera system monitors the traffic situation to perceive road users' behavior. Both motorized and non-motorized traffic is considered. The system is used for research in data generation, evaluating new HAD sensors systems, algorithms, and Artificial Intelligence (AI) training strategies using real-, synthetic- and augmented data. In addition, the junction features a highly accurate digital twin. Real-world data can be taken into the digital twin for simulation purposes and synthetic data generation.
摘要
内城区的复杂交叉口是交通事故和伤亡的极其关键区域。自动驾驶系统的开发面临了内城区的复杂和繁忙的日常生活所带来的挑战。智能基础设施,具有与车辆通信和合作的感知技术,是解决驾驶员和车辆感知系统无法覆盖的 occlusion 的关键。我们介绍了一个配备视觉感知技术的智能研究基础设施,位于德国阿什菲尔德burg的公共内城区交叉口。多视图摄像头系统监测交通情况,识别路用户行为。包括机动车和非机动车在内,所有交通方式都被考虑。该系统用于研究数据生成、评估新自动驾驶感知器系统、算法和人工智能(AI)训练策略的实际、Synthetic和增强数据。此外,交叉口还拥有高度准确的数字双。实际世界数据可以被digital twin中的模拟用途和生成Synthetic数据。
The IMPTC Dataset: An Infrastructural Multi-Person Trajectory and Context Dataset
paper_authors: Manuel Hetzel, Hannes Reichert, Günther Reitberger, Erich Fuchs, Konrad Doll, Bernhard Sick
for: The paper is written for researchers and developers working on automated traffic systems, particularly those focused on improving the performance of autonomous vehicles in inner-city intersections.
methods: The paper uses a variety of methods, including visual sensor technology and LiDAR systems, to collect data on traffic situations and road users’ behavior at an intelligent public inner-city intersection in Germany. Additional sensors monitor contextual information like weather, lighting, and traffic light signal status.
results: The paper presents the Infrastructural Multi-Person Trajectory and Context Dataset (IMPTC), which contains over 2,500 VRU trajectories and over 20,000 vehicle trajectories, captured over eight hours of measurement data at different times of day, weather conditions, and seasons. The dataset includes all data from sensor calibration to trajectory and context data, and is available online for non-commercial research.Here’s the simplified Chinese text for the three key points:
results: 这篇论文介绍了基础设施多人行车和 контекст数据集(IMPTC),包含了2500多个护航用户轨迹和20000多辆车辆轨迹,在不同的时间、天气和季节下采集了8小时的测量数据。数据集包括所有从感知器 calibration 到轨迹和 контекст数据的数据,并且在线上公开提供非商业研究用途。Abstract
Inner-city intersections are among the most critical traffic areas for injury and fatal accidents. Automated vehicles struggle with the complex and hectic everyday life within those areas. Sensor-equipped smart infrastructures, which can cooperate with vehicles, can benefit automated traffic by extending the perception capabilities of drivers and vehicle perception systems. Additionally, they offer the opportunity to gather reproducible and precise data of a holistic scene understanding, including context information as a basis for training algorithms for various applications in automated traffic. Therefore, we introduce the Infrastructural Multi-Person Trajectory and Context Dataset (IMPTC). We use an intelligent public inner-city intersection in Germany with visual sensor technology. A multi-view camera and LiDAR system perceives traffic situations and road users' behavior. Additional sensors monitor contextual information like weather, lighting, and traffic light signal status. The data acquisition system focuses on Vulnerable Road Users (VRUs) and multi-agent interaction. The resulting dataset consists of eight hours of measurement data. It contains over 2,500 VRU trajectories, including pedestrians, cyclists, e-scooter riders, strollers, and wheelchair users, and over 20,000 vehicle trajectories at different day times, weather conditions, and seasons. In addition, to enable the entire stack of research capabilities, the dataset includes all data, starting from the sensor-, calibration- and detection data until trajectory and context data. The dataset is continuously expanded and is available online for non-commercial research at https://github.com/kav-institute/imptc-dataset.
摘要
城市交叉口是自动驾驶车辆事故和伤亡率最高的地区之一。自动驾驶车辆在这些地区面临着复杂和繁忙的日常生活的挑战。具备感知功能的智能基础设施可以帮助自动驾驶车辆扩展驾驶员和车辆感知系统的感知范围。此外,它们还提供了基于各种应用程序的训练数据的可重复和精确的数据收集机会,包括场景理解的全面信息。因此,我们介绍了基础设施多人行车路径和上下文数据集(IMPTC)。我们在德国内城区使用智能公共交叉口,并使用视觉感知技术和激光雷达系统捕捉交通情况和路用者行为。其他感知器还监测了天气、照明和交通信号灯的状态。数据采集系统关注潜在危险路用者(VRUs)和多个代理人之间的互动。数据集包含了8小时的测量数据,其中包括了步行者、自行车手、电动踏板骑手、轮椅用户和轮椅用户,以及20,000辆不同时间、天气和季节的车辆轨迹。此外,为满足整个研究栈的所有功能,数据集包括了所有的感知、校准和探测数据,以及路径和上下文数据。数据集在线可用于非商业研究,请参考https://github.com/kav-institute/imptc-dataset。
Sequential Experimental Design for X-Ray CT Using Deep Reinforcement Learning
paper_authors: Tianyuan Wang, Felix Lucka, Tristan van Leeuwen
For: 这篇论文的目的是为了提高X射 Computed Tomography(CT)的 inline 质量控制,减少测量角度的数量,并维持重建结果的质量。* Methods: 本论文使用了简短角度扫描来获得3D重建结果,并运用了深度强化学习来解决最佳实验设计(OED)问题。* Results: 本论文通过实验评估发现,使用深度强化学习解决OED问题可以实现高质量的3D重建结果,并且可以在线上运行。Abstract
In X-ray Computed Tomography (CT), projections from many angles are acquired and used for 3D reconstruction. To make CT suitable for in-line quality control, reducing the number of angles while maintaining reconstruction quality is necessary. Sparse-angle tomography is a popular approach for obtaining 3D reconstructions from limited data. To optimize its performance, one can adapt scan angles sequentially to select the most informative angles for each scanned object. Mathematically, this corresponds to solving and optimal experimental design (OED) problem. OED problems are high-dimensional, non-convex, bi-level optimization problems that cannot be solved online, i.e., during the scan. To address these challenges, we pose the OED problem as a partially observable Markov decision process in a Bayesian framework, and solve it through deep reinforcement learning. The approach learns efficient non-greedy policies to solve a given class of OED problems through extensive offline training rather than solving a given OED problem directly via numerical optimization. As such, the trained policy can successfully find the most informative scan angles online. We use a policy training method based on the Actor-Critic approach and evaluate its performance on 2D tomography with synthetic data.
摘要
在X射 Computed Tomography(CT)中,从多个角度获取投射数据,并用于三维重建。为使CT适用于生产线质量控制,减少投射角度数量而保持重建质量是必要的。稀疏角度 computed tomography 是一种常用的方法,可以通过限制数据量来获取三维重建。为了优化其性能,可以逐渐更新扫描角度,以选择每个扫描对象中最有用的角度。这问题可以用优化实验设计(OED)问题来形式化。OED问题是高维度、非凸、双级优化问题,无法在扫描过程中解决。为解决这些挑战,我们将OED问题作为一个部分可见Markov决策过程(POMDP)在 bayesian 框架中表述,并通过深度强化学习解决。这种方法可以学习有效的非吞吐策略,并通过大量的离线训练而不是直接解决给定的OED问题。因此,训练后的策略可以在线上成功地找到最有用的扫描角度。我们使用基于actor-critic方法的策略训练方法,并对2D tomography中的 sintetic 数据进行评估。
Learning Kernel-Modulated Neural Representation for Efficient Light Field Compression
results: 实验表明,该方法在光场数据压缩任务中至少比其他SOTA方法高效,并且可以将描述器学习到新的光场中进行渲染 densely views,表明该方法可能解决视图合成任务。Abstract
Light field is a type of image data that captures the 3D scene information by recording light rays emitted from a scene at various orientations. It offers a more immersive perception than classic 2D images but at the cost of huge data volume. In this paper, we draw inspiration from the visual characteristics of Sub-Aperture Images (SAIs) of light field and design a compact neural network representation for the light field compression task. The network backbone takes randomly initialized noise as input and is supervised on the SAIs of the target light field. It is composed of two types of complementary kernels: descriptive kernels (descriptors) that store scene description information learned during training, and modulatory kernels (modulators) that control the rendering of different SAIs from the queried perspectives. To further enhance compactness of the network meanwhile retain high quality of the decoded light field, we accordingly introduce modulator allocation and kernel tensor decomposition mechanisms, followed by non-uniform quantization and lossless entropy coding techniques, to finally form an efficient compression pipeline. Extensive experiments demonstrate that our method outperforms other state-of-the-art (SOTA) methods by a significant margin in the light field compression task. Moreover, after aligning descriptors, the modulators learned from one light field can be transferred to new light fields for rendering dense views, indicating a potential solution for view synthesis task.
摘要
光场是一种图像数据类型,记录了场景中光束的多个方向信息。它比 класси二dimensional 图像更能带来 immerse 的感受,但是需要巨大的数据量。在这篇论文中,我们从 Sub-Aperture Images (SAI) 的视觉特征中灵感,设计了一种可靠的神经网络表示法,用于光场压缩任务。网络背部由 randomly 初始化的噪声输入,并在 SAI 的目标光场上进行超参数。其由两种 complementary 核心组成:描述核心(descriptors),存储场景描述信息,和调节核心(modulators),控制不同的 queried 视角中的 SAI 渲染。为了进一步增强网络的 Compactness,同时保持高质量的解码光场,我们采用了调节调制机制、核心矩阵分解机制和非均匀量化技术,最后组成了高效的压缩管道。广泛的实验表明,我们的方法在光场压缩任务中比其他现有最佳方法(SOTA)有显著的优势。此外,经过对描述器进行对齐,从一个光场中学习的调节器可以被传递到新的光场中,用于生成密集的视图,表明了一种可能的视 synthesis 任务解决方案。
Recognizing student identification numbers from the matrix templates using a modified U-net architecture
for: This paper presents an innovative approach to student identification during exams and knowledge tests, aiming to overcome the limitations of traditional personal information entry methods.
methods: The proposed method employs a matrix template on the designated section of the exam, where squares containing numbers are selectively blackened. A neural network specifically designed for recognizing students’ personal identification numbers is developed, using a specially adapted U-Net architecture and trained on an extensive dataset of images of blackened tables.
results: The neural network demonstrates high accuracy in recognizing the patterns and arrangement of blackened squares, accurately interpreting the information inscribed within them. The method automates the identification process, reducing administrative effort and expediting data processing, and offers multiple advantages, such as significantly accelerating the exam marking process and minimizing the potential for errors.Abstract
This paper presents an innovative approach to student identification during exams and knowledge tests, which overcomes the limitations of the traditional personal information entry method. The proposed method employs a matrix template on the designated section of the exam, where squares containing numbers are selectively blackened. The methodology involves the development of a neural network specifically designed for recognizing students' personal identification numbers. The neural network utilizes a specially adapted U-Net architecture, trained on an extensive dataset comprising images of blackened tables. The network demonstrates proficiency in recognizing the patterns and arrangement of blackened squares, accurately interpreting the information inscribed within them. Additionally, the model exhibits high accuracy in correctly identifying entered student personal numbers and effectively detecting erroneous entries within the table. This approach offers multiple advantages. Firstly, it significantly accelerates the exam marking process by automatically extracting identifying information from the blackened tables, eliminating the need for manual entry and minimizing the potential for errors. Secondly, the method automates the identification process, thereby reducing administrative effort and expediting data processing. The introduction of this innovative identification system represents a notable advancement in the field of exams and knowledge tests, replacing the conventional manual entry of personal data with a streamlined, efficient, and accurate identification process.
摘要
Translated into Simplified Chinese:这篇论文提出了一种创新的学生身份识别方法,以解决传统的个人信息输入方法的局限性。该方法使用了一个矩阵模板,其中部分填充了黑色的方块,并使用一个专门为学生身份识别号设计的神经网络。该神经网络,使用了一种特殊的U-Net架构,在一个广泛的数据集上训练,包括黑色表格的图像。神经网络能够正确地识别黑色表格中的pattern和排序,并将信息印在其中。此外,模型还能够准确地识别学生个人识别号,并且可以效果地检测表格中的错误输入。这种方法具有多个优点,包括快速地自动提取出表格中的个人信息,消除手动输入的需要,并减少管理性的努力和数据处理的时间。这种创新的身份识别系统代替了传统的手动输入个人数据,并将身份识别过程变得更加流畅、高效和准确。
ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression
results: 实验结果显示,ConvNeXt-ChARM对四个常用的数据集进行了consistent和significant BD-rate(PSNR)reductions,较VVC参考解码器(VTM-18.0)和state-of-the-art learned image compression方法SwinT-ChARM还要好。此外,我们还提供了模型缩放研究,以验证我们的方法的计算效率。Abstract
Over the last few years, neural image compression has gained wide attention from research and industry, yielding promising end-to-end deep neural codecs outperforming their conventional counterparts in rate-distortion performance. Despite significant advancement, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the reconstruction fidelity, especially in non-homogeneous textured image areas. Those models also require more parameters and a higher decoding time. To tackle the above challenges, we propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive prior to capturing both global and local contexts from the hyper and quantized latent representations. The proposed architecture can be optimized end-to-end to fully exploit the context information and extract compact latent representation while reconstructing higher-quality images. Experimental results on four widely-used datasets showed that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM, respectively. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the next generation ConvNet, namely ConvNeXt, and Swin Transformer.
摘要
过去几年,神经网络压缩得到了研究和工业界的广泛关注,并且已经获得了较好的终到端深度神经编码器,其性能超过了传统编码器的环境-质量比。 despite significant progress, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the reconstruction fidelity, especially in non-homogeneous textured image areas. Those models also require more parameters and a higher decoding time. To tackle the above challenges, we propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive prior to capture both global and local contexts from the hyper and quantized latent representations. The proposed architecture can be optimized end-to-end to fully exploit the context information and extract compact latent representation while reconstructing higher-quality images. Experimental results on four widely-used datasets showed that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM, respectively. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the next generation ConvNet, namely ConvNeXt, and Swin Transformer.
RFENet: Towards Reciprocal Feature Evolution for Glass Segmentation
paper_authors: Ke Fan, Changan Wang, Yabiao Wang, Chengjie Wang, Ran Yi, Lizhuang Ma
for: This paper proposes a novel network (RFENet) for effective glass-like object segmentation in images.
methods: The proposed method uses a Selective Mutual Evolution (SME) module to learn the reciprocal features of semantic and boundary information, and a Structurally Attentive Refinement (SAR) module to refine the features of ambiguous points around the boundary.
results: The proposed method achieves state-of-the-art performance on three popular public datasets.Here is the summary in Traditional Chinese text:
results: 提案的方法在三个popular的公共数据集上达到了state-of-the-art的性能。Abstract
Glass-like objects are widespread in daily life but remain intractable to be segmented for most existing methods. The transparent property makes it difficult to be distinguished from background, while the tiny separation boundary further impedes the acquisition of their exact contour. In this paper, by revealing the key co-evolution demand of semantic and boundary learning, we propose a Selective Mutual Evolution (SME) module to enable the reciprocal feature learning between them. Then to exploit the global shape context, we propose a Structurally Attentive Refinement (SAR) module to conduct a fine-grained feature refinement for those ambiguous points around the boundary. Finally, to further utilize the multi-scale representation, we integrate the above two modules into a cascaded structure and then introduce a Reciprocal Feature Evolution Network (RFENet) for effective glass-like object segmentation. Extensive experiments demonstrate that our RFENet achieves state-of-the-art performance on three popular public datasets.
摘要
玻璃样本在日常生活中广泛存在,但大多数现有方法无法准确分割它们。透明性使其与背景难以区分,而小的分界线更使得它们的准确形状难以获得。在这篇论文中,我们通过揭示 semantic 和边界学习的关键协同需求,提出了一种Selective Mutual Evolution(SME)模块,以便在它们之间进行相互的特征学习。然后,为了利用全球形状上下文,我们提出了一种Structurally Attentive Refinement(SAR)模块,以进行细化特征修正这些扭曲点。最后,为了更好地利用多尺度表示,我们将上述两个模块集成到一起,并将其称为Reciprocal Feature Evolution Network(RFENet),以实现有效的玻璃样本分割。广泛的实验证明,我们的 RFENet 在三个流行的公共数据集上达到了状态级表现。
results: 对于VVC参考编码器(VTM-18.0)和SwinT-ChARM neural codec的比较,提出了一个更好的 adaptive image compression transformer(AICT)框架,具有较好的代码效率和解码器复杂度之间的平衡。Abstract
Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in long-range modeling dependencies due to their local connectivity and an increasing number of architectural biases and priors. On the contrary, the proposed ICT can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed adaptive image compression transformer (AICT) framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM.
摘要
Here's the text in Simplified Chinese:基于传输器架构的Transformer压缩框架的调查,即SwinT-ChARM,我们提出了提高其后的减少方法,首先,通过更直观却有效的Transformer通道自动征料模型,得到绝对图像压缩器(ICT)。当前的方法仍然基于ConvNet来实现Entropy编码,由于其本地连接和增加的建筑偏好和约束,因此无法充分考虑长距离模型依赖关系。相比之下,我们提出的ICT可以从缓存表示中捕捉全局和局部上下文,更好地参数化压缩后的量化偏好。此外,我们采用了可学习的缩放模块和ConvNeXt基于预/后处理器,以更高精度提取更 компакт的缓存表示,并在重建高质量图像时进行更好的重建。实验结果表明,我们提出的自适应图像压缩器(AICT)框架在标准测试集上显著提高了编码效率和解码器复杂度的平衡,相比于VVC参考编码器(VTM-18.0)和神经编码器SwinT-ChARM。
results: 实验结果表明,提出的OSEN方法在三个不同的应用中表现出色,即支持估计从压缩感知(CS)测量、表示基于分类和学习帮助CS重建。特别是在低测量率下,OSEN方法可以大幅超越竞争方法,具体来说,与传统方法相比,OSEN方法可以提高支持估计的计算效率和性能。软件实现可以在https://github.com/meteahishali/OSEN中下载。Abstract
In this work, we propose a novel approach called Operational Support Estimator Networks (OSENs) for the support estimation task. Support Estimation (SE) is defined as finding the locations of non-zero elements in a sparse signal. By its very nature, the mapping between the measurement and sparse signal is a non-linear operation. Traditional support estimators rely on computationally expensive iterative signal recovery techniques to achieve such non-linearity. Contrary to the convolution layers, the proposed OSEN approach consists of operational layers that can learn such complex non-linearities without the need for deep networks. In this way, the performance of the non-iterative support estimation is greatly improved. Moreover, the operational layers comprise so-called generative \textit{super neurons} with non-local kernels. The kernel location for each neuron/feature map is optimized jointly for the SE task during the training. We evaluate the OSENs in three different applications: i. support estimation from Compressive Sensing (CS) measurements, ii. representation-based classification, and iii. learning-aided CS reconstruction where the output of OSENs is used as prior knowledge to the CS algorithm for an enhanced reconstruction. Experimental results show that the proposed approach achieves computational efficiency and outperforms competing methods, especially at low measurement rates by a significant margin. The software implementation is publicly shared at https://github.com/meteahishali/OSEN.
摘要
在这项工作中,我们提出了一种新的方法 called Operational Support Estimator Networks (OSENs),用于支持估计任务。支持估计(SE)定义为找到稀疏信号中非零元素的位置。由于这个映射是非线性的,传统的支持估计器通常需要计算成本较高的迭代信号恢复技术来实现非线性。相比之下,我们的 OSEN 方法由操作层组成,这些层可以学习这些复杂的非线性关系,而无需深度网络。这样,非迭代支持估计的性能得到了显著改善。另外,这些操作层包括所谓的生成型超 neuron,它们的核心位置在训练过程中被优化为SE任务的最佳位置。我们在三个不同的应用中评估了 OSENs:i. 从压缩感知(CS)测量中支持估计,ii. 基于表示的分类,iii. 利用 OSENs 的输出为 CS 算法的先验知识,以提高重建结果。实验结果表明,我们的方法在计算效率和 competed 方法之间具有显著优势,特别是在低测量率下,其优势更加明显。我们在 GitHub 上公开了软件实现,请参考 。
Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images
methods: 该方法使用 ResNet50 和 PointNet++ derivate RGB 和点云特征,并 introduce a novel pyramid deep fusion network (PDFNet) 将多种特征融合。
results: 经过 comprehensive ablation experiments,本研究表明了我们提出的融合算法的效iveness,并在公共数据集上超过状态艺术方法。Abstract
Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at {\url{https://github.com/zijinxuxu/PDFNet}.
摘要
Accurately recovering the dense 3D mesh of both hands from monocular images poses significant challenges due to occlusions and projection ambiguity. Most existing methods extract features from color images to estimate the root-aligned hand meshes, neglecting crucial depth and scale information in the real world. Given noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to leverage these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employs single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at [url=https://github.com/zijinxuxu/PDFNet].
results: 在医学影像分类任务上,使用这种方法可以通过最小的人工输入来生成改进的解释 (+0.02, +3%),并实现模型的性能下降 (-0.04, -4%)。Abstract
eXplanation Based Learning (XBL) is a form of Interactive Machine Learning (IML) that provides a model refining approach via user feedback collected on model explanations. Although the interactivity of XBL promotes model transparency, XBL requires a huge amount of user interaction and can become expensive as feedback is in the form of detailed annotation rather than simple category labelling which is more common in IML. This expense is exacerbated in high stakes domains such as medical image classification. To reduce the effort and expense of XBL we introduce a new approach that uses two input instances and their corresponding Gradient Weighted Class Activation Mapping (GradCAM) model explanations as exemplary explanations to implement XBL. Using a medical image classification task, we demonstrate that, using minimal human input, our approach produces improved explanations (+0.02, +3%) and achieves reduced classification performance (-0.04, -4%) when compared against a model trained without interactions.
摘要
<> translate the following text into Simplified Chinese:eXplanation Based Learning (XBL) is a form of Interactive Machine Learning (IML) that provides a model refining approach via user feedback collected on model explanations. Although the interactivity of XBL promotes model transparency, XBL requires a huge amount of user interaction and can become expensive as feedback is in the form of detailed annotation rather than simple category labelling which is more common in IML. This expense is exacerbated in high stakes domains such as medical image classification. To reduce the effort and expense of XBL we introduce a new approach that uses two input instances and their corresponding Gradient Weighted Class Activation Mapping (GradCAM) model explanations as exemplary explanations to implement XBL. Using a medical image classification task, we demonstrate that, using minimal human input, our approach produces improved explanations (+0.02, +3%) and achieves reduced classification performance (-0.04, -4%) when compared against a model trained without interactions.Translation:<>基于解释学习(XBL)是一种交互式机器学习(IML)的形式,通过用户反馈收集的模型解释来进行模型细化。尽管XBL通过交互提高模型透明度,但XBL需要很大量的用户交互,这会导致费用增加,因为反馈通常是详细的注释而不是常见的类别标签。这种成本增加在高赔domain中,如医学图像分类任务。为了降低XBL的努力和费用,我们介绍了一种新的方法,使用两个输入实例和它们所对应的梯度权重类活动映射(GradCAM)模型解释作为例证来实现XBL。使用医学图像分类任务,我们表明,使用最小的人工输入,我们的方法可以生成改进的解释(+0.02,+3%),并实现降低分类性能(-0.04,-4%),与没有交互的模型相比。
What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation
results: 研究发现,预训练引入了可转移的不变性,而这些不变性在练化过程中被压缩到较浅层。这些发现可以帮助理解预训练模型在下游任务上的成功原因以及练化过程中模型的变化。Abstract
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, becoming commonplace across many areas of machine learning. While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks. We present new metrics that specifically investigate the degree to which invariances learned by a pretrained model are retained or forgotten during finetuning. Using these metrics, we present a suite of empirical findings, including that pretraining induces transferable invariances in shallow layers and that invariances from deeper pretrained layers are compressed towards shallower layers during finetuning. Together, these findings contribute to understanding some of the reasons for the successes of pretrained models and the changes that a pretrained model undergoes when finetuned on a downstream task.
摘要
通常情况下,预训练-精度调整方法可以提高下游任务的性能,成为机器学习多个领域的常见做法。虽然预训练的效果已经在多个任务上得到了观察到的实证效果,但是还没有一个清晰的理解,预训练所导致的这种效果的原因。在这项工作中,我们研究了预训练的视Transformers和其相应的精度调整版本在多个benchmark dataset和任务上的关系。我们提出了新的指标,用于 Specifically investigate the degree to which invariances learned by a pretrained model are retained or forgotten during finetuning.使用这些指标,我们发现预训练导致 shallow layer中的变换性被传递,而深层预训练层中的变换性则在精度调整过程中被压缩到 shallower layers。总的来说,这些发现贡献了解预训练模型的成功和精度调整过程中模型的变化。
Unsupervised Optical Flow Estimation with Dynamic Timing Representation for Spike Camera
results: 实验显示,我们的方法可以从针流中提取精确的流动场景,并在不同的高速场景中表现出较高的精度和效率。比如,与最佳针流基于方法SCFlow相比,我们的方法在 $\Delta t=10$ 和 $\Delta t=20$ 下减少了 $15%$ 和 $19%$ 的误差。Abstract
Efficiently selecting an appropriate spike stream data length to extract precise information is the key to the spike vision tasks. To address this issue, we propose a dynamic timing representation for spike streams. Based on multi-layers architecture, it applies dilated convolutions on temporal dimension to extract features on multi-temporal scales with few parameters. And we design layer attention to dynamically fuse these features. Moreover, we propose an unsupervised learning method for optical flow estimation in a spike-based manner to break the dependence on labeled data. In addition, to verify the robustness, we also build a spike-based synthetic validation dataset for extreme scenarios in autonomous driving, denoted as SSES dataset. It consists of various corner cases. Experiments show that our method can predict optical flow from spike streams in different high-speed scenes, including real scenes. For instance, our method gets $15\%$ and $19\%$ error reduction from the best spike-based work, SCFlow, in $\Delta t=10$ and $\Delta t=20$ respectively which are the same settings as the previous works.
摘要
“选择适当的脉冲流数据长度以提取精确信息是脉冲视觉任务的关键。为解决这个问题,我们提出了一个动态时间表示法。基于多层架构,它运用了扩展 convolutions 在时间维度上提取多个时间尺度上的特征,仅需几个参数。此外,我们设计了层对待,以动态融合这些特征。此外,我们还提出了一种无监督学习方法,用于在脉冲流上进行光流估计,以扩展脉冲流的应用范围。此外,为确保方法的稳定性,我们还建立了一个基于脉冲流的 sintetic 验证数据集,称为 SSES 数据集。它包含了各种角度的问题。实验结果显示,我们的方法可以从脉冲流中预测光流在不同高速场景中,包括真实场景。例如,我们的方法在 $\Delta t=10$ 和 $\Delta t=20$ 的设置下分别降低了 $15\%$ 和 $19\%$ 的误差,与最佳脉冲基础方法 SCFlow 的误差相比。”
Flexible and Fully Quantized Ultra-Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems
results: 论文通过实验测试,对TinyissimoYOLO的网络检测性能进行了全面的描述,并对不同的平台进行了对比,包括使用硬件加速器的GAP9处理器、ARM Cortex-M7核心、ARM Cortex-M4核心和一个多核心平台。实验结果表明,GAP9处理器的硬件加速器可以实现最低的推理延迟和能耗,即2.12ms和150uJ。Abstract
This paper deploys and explores variants of TinyissimoYOLO, a highly flexible and fully quantized ultra-lightweight object detection network designed for edge systems with a power envelope of a few milliwatts. With experimental measurements, we present a comprehensive characterization of the network's detection performance, exploring the impact of various parameters, including input resolution, number of object classes, and hidden layer adjustments. We deploy variants of TinyissimoYOLO on state-of-the-art ultra-low-power extreme edge platforms, presenting an in-depth a comparison on latency, energy efficiency, and their ability to efficiently parallelize the workload. In particular, the paper presents a comparison between a novel parallel RISC-V processor (GAP9 from Greenwaves) with and without use of its on-chip hardware accelerator, an ARM Cortex-M7 core (STM32H7 from ST Microelectronics), two ARM Cortex-M4 cores (STM32L4 from STM and Apollo4b from Ambiq), and a multi-core platform with a CNN hardware accelerator (Analog Devices MAX78000). Experimental results show that the GAP9's hardware accelerator achieves the lowest inference latency and energy at 2.12ms and 150uJ respectively, which is around 2x faster and 20% more efficient than the next best platform, the MAX78000. The hardware accelerator of GAP9 can even run an increased resolution version of TinyissimoYOLO with 112x112 pixels and 10 detection classes within 3.2ms, consuming 245uJ. To showcase the competitiveness of a versatile general-purpose system we also deployed and profiled a multi-core implementation on GAP9 at different operating points, achieving 11.3ms with the lowest-latency and 490uJ with the most energy-efficient configuration. With this paper, we demonstrate the suitability and flexibility of TinyissimoYOLO on state-of-the-art detection datasets for real-time ultra-low-power edge inference.
摘要
The experimental results show that the GAP9's hardware accelerator achieves the lowest inference latency and energy consumption, with 2.12ms and 150uJ respectively, which is around 2x faster and 20% more efficient than the next best platform, the MAX78000. The hardware accelerator of GAP9 can even run an increased resolution version of TinyissimoYOLO with 112x112 pixels and 10 detection classes within 3.2ms, consuming 245uJ. To demonstrate the versatility of the system, the paper also deploys and profiles a multi-core implementation on GAP9 at different operating points, achieving 11.3ms with the lowest-latency and 490uJ with the most energy-efficient configuration.Overall, the paper shows that TinyissimoYOLO is suitable and flexible for real-time ultra-low-power edge inference on state-of-the-art detection datasets, and the GAP9 platform with its hardware accelerator achieves the best performance and energy efficiency.
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation
results: 我们在多种环境下进行了线上和线下测试,结果表明,GVCCI 可以稳定地提高 VG 模型的性能,最高提高56.7%,并提高 LGRM 的性能最高提高29.4%。此外,我们还发现了一些实际问题,如 pré-training 数据中学习的偏好导致 VG 模型很难找到正确的 object。最后,我们还提出了一个新的 VG 数据集,包含了多种 manipulate 环境中的image-object-instruction triplets。Abstract
Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. A straightforward solution is to collect additional training data, but the cost of human-annotation is extortionate. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision. GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data. We validate our framework in offline and online settings across diverse environments on different VG models. Experimental results show that accumulating synthetic data from GVCCI leads to a steady improvement in VG by up to 56.7% and improves resultant LGRM by up to 29.4%. Furthermore, the qualitative analysis shows that the unadapted VG model often fails to find correct objects due to a strong bias learned from the pre-training data. Finally, we introduce a novel VG dataset for LGRM, consisting of nearly 252k triplets of image-object-instruction from diverse manipulation environments.
摘要
Language-Guided Robotic Manipulation (LGRM) 是一个复杂的任务,因为它需要一个机器人理解人类的指令以操纵日常物品。最近的方法在 LGRM 中依靠预训练的视觉定位(VG)模型来探测物品而不需要适应操作环境。这会导致性能下降,因为预训练数据和实际世界数据之间存在很大的领域差距。一个简单的解决方案是收集更多的人工注释数据,但是人工注释的成本非常高昂。在这篇论文中,我们提出了 Continuous Grounding Vision to Ceaselessly Created Instructions(GVCCI),一种持续学习的框架,用于 LGRM。GVCCI 可以无需人工指导,不断生成假的指令,并使用生成的数据来训练 VG 模型。我们在不同的环境中进行了在线和离线测试,并在不同的 VG 模型上进行了实验。实验结果表明,GVCCI 生成的数据可以不断提高 VG 的性能,最高提高达 56.7%,并提高了结果性的 LGRM 性能,最高提高达 29.4%。此外,我们的质量分析表明,未适应 VG 模型经常无法找到正确的物品,这是因为预训练数据中学习了强大的偏见。最后,我们介绍了一个新的 VG 数据集,包含了多种 manipulate 环境中的 Nearly 252k 个图像-物品-指令的 triplets。
YOGA: Deep Object Detection in the Wild with Lightweight Feature Learning and Multiscale Attention
For: 本研究旨在开发一种深度学习基于的轻量级对象检测模型,可以在低端边缘设备上运行,同时仍能达到竞争性的准确率。* Methods: 该模型具有两个阶段特征学习管道,使用便宜的线性变换,通过只使用半数的卷积核来学习特征图。此外,它使用注意机制进行多比例特征融合,而不是 convential检测器中的拼接。* Results: 研究人员对COCO-val和COCO-testdev数据集进行了评估,并与其他10个国际前沿对象检测器进行比较。结果显示,YOGA在准确率和模型大小之间做出了最佳的折衔(相对于AP和参数量的减少约22%和23-34%),因此适用于在野的部署。此外,研究人员还对NVIDIA Jetson Nano硬件进行了实现和评估,证明了YOGA在低端边缘设备上的适用性。Abstract
We introduce YOGA, a deep learning based yet lightweight object detection model that can operate on low-end edge devices while still achieving competitive accuracy. The YOGA architecture consists of a two-phase feature learning pipeline with a cheap linear transformation, which learns feature maps using only half of the convolution filters required by conventional convolutional neural networks. In addition, it performs multi-scale feature fusion in its neck using an attention mechanism instead of the naive concatenation used by conventional detectors. YOGA is a flexible model that can be easily scaled up or down by several orders of magnitude to fit a broad range of hardware constraints. We evaluate YOGA on COCO-val and COCO-testdev datasets with other over 10 state-of-the-art object detectors. The results show that YOGA strikes the best trade-off between model size and accuracy (up to 22% increase of AP and 23-34% reduction of parameters and FLOPs), making it an ideal choice for deployment in the wild on low-end edge devices. This is further affirmed by our hardware implementation and evaluation on NVIDIA Jetson Nano.
摘要
我们介绍YOGA,一种深度学习基于的轻量级对象检测模型,可以在低端边缘设备上运行,同时仍然达到竞争性的准确率。 YOGA架构包括一个两相阶段特征学习管道,使用便宜的线性变换学习特征图像,只需半数的卷积核使用。此外,它使用注意机制进行多尺度特征融合,而不是 conventinal检测器使用的拼接。 YOGA是一种灵活的模型,可以根据硬件限制进行轻松缩放或扩展。我们对COCO-val和COCO-testdev数据集进行评估,与其他10种状态器件的对比结果显示,YOGA具有最佳的平衡(最大22%增加AP和23-34%减少参数和FLOPs),使其成为在野的低端边缘设备上部署的理想选择。这一点得到了我们对NVIDIA Jetson Nano硬件实现和评估的证明。
Sem-CS: Semantic CLIPStyler for Text-Based Image Style Transfer
results: 根据DISTS、NIMA和用户研究分数,我们的提议的框架在质量和量化方面具有优越的表现。Abstract
CLIPStyler demonstrated image style transfer with realistic textures using only a style text description (instead of requiring a reference style image). However, the ground semantics of objects in the style transfer output is lost due to style spill-over on salient and background objects (content mismatch) or over-stylization. To solve this, we propose Semantic CLIPStyler (Sem-CS), that performs semantic style transfer. Sem-CS first segments the content image into salient and non-salient objects and then transfers artistic style based on a given style text description. The semantic style transfer is achieved using global foreground loss (for salient objects) and global background loss (for non-salient objects). Our empirical results, including DISTS, NIMA and user study scores, show that our proposed framework yields superior qualitative and quantitative performance. Our code is available at github.com/chandagrover/sem-cs.
摘要
CLIPStyler 已经展示了使用描述文本描述的图像风格转移,并且使用了实际的 текстуuration。然而,图像转移output中的物件Semantics verloren Due to style spill-over on salient and background objects (content mismatch) or over-stylization. 为解决这个问题,我们提议了Semantic CLIPStyler(Sem-CS),它可以进行Semantic style transfer。Sem-CS首先将内容图片分割为焦点和背景物件,然后根据给定的描述文本进行艺术风格转移。我们使用了全球背景损失(for non-salient objects)和全球前景损失(for salient objects)来实现Semantic style transfer。我们的实验结果,包括DISTS、NIMA和用户研究分数,显示了我们的提案框架对Qualitative和量化性能均有着superior表现。我们的代码可以在github.com/chandagrover/sem-cs 中找到。
Unified Medical Image-Text-Label Contrastive Learning With Continuous Prompt
results: 通过多种实验证明,该框架在多个下游任务中表现出色,其中包括图像分类、检测和 segmentation 等。Abstract
Contrastive language-image Pre-training (CLIP) [13] can leverage large datasets of unlabeled Image-Text pairs, which have demonstrated impressive performance in various downstream tasks. Given that annotating medical data is time-consuming and laborious, Image-Text Pre-training has promising applications in exploiting large-scale medical image and radiology report datasets. However, medical Image-Text Pre-training faces several challenges, as follows: (1) Due to privacy concerns, the amount of available medical data is relatively small compared to natural data, leading to weaker generalization ability of the model. (2) Medical images are highly similar with only fine-grained differences in subtleties, resulting in a large number of false-negative sample pairs in comparison learning. (3) The hand-crafted Prompt usually differs from the natural medical image report, Subtle changes in wording can lead to significant differences in performance. In this paper, we propose a unified Image-Text-Label contrastive learning framework based on continuous prompts, with three main contributions. First, We unified the data of images, text, and labels, which greatly expanded the training data that the model could utilize. Second, we address the issue of data diversity and the impact of hand-crafted prompts on model performance by introducing continuous implicit prompts. Lastly, we propose a ImageText-Label contrastive Training to mitigate the problem of too many false-negative samples. We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning (UMCL) framework exhibits excellent performance on several downstream tasks.
摘要
依据图像文本对比预训练(CLIP)[13],可以利用大量无标注图像文本对的数据,实现了许多下游任务的出色表现。由于医疗数据标注是时间consuming和劳动密集的,图像文本预训练在医疗领域有普遍的应用前景。然而,医疗图像文本预训练存在多种挑战,包括:(1)由于隐私问题,可用的医疗数据相对较少,导致模型的泛化能力弱化。(2)医疗图像具有高度相似的特征,导致false negative样本对比学习中的庞大数量。(3)手工设计的提示通常与自然医疗图像报告不同,小做文本修改可以导致显著性能下降。在这篇论文中,我们提出了一种统一图像文本标签对比学习框架,基于连续提示,有以下三个主要贡献:首先,我们统一了图像、文本和标签的数据,大大扩展了模型可用的训练数据。其次,我们解决了数据多样性和手工提示对模型性能的影响,通过引入连续隐式提示。最后,我们提出了图像文本标签对比训练,以 Mitigate false negative样本的问题。我们通过 suficient experiments 表明,Unified Medical Contrastive Learning(UMCL)框架在多个下游任务中表现出色。
results: 我们通过多个最大规模的人类功能脑成像数据集进行了实验,并证明SwiFT在预测性别、年龄和认知素质等任务中一直表现出色,超过了最近的状态革命模型。此外,我们还证明了SwiFT可以通过对比损失自我超参的自我预训练来提高下游任务的性能。Abstract
The modeling of spatiotemporal brain dynamics from high-dimensional data, such as 4D functional MRI, is a formidable task in neuroscience. To address this challenge, we present SwiFT (Swin 4D fMRI Transformer), a Swin Transformer architecture that can learn brain dynamics directly from 4D functional brain MRI data in a memory and computation-efficient manner. SwiFT achieves this by implementing a 4D window multi-head self-attention mechanism and absolute positional embeddings. We evaluate SwiFT using multiple largest-scale human functional brain imaging datasets in tasks such as predicting sex, age, and cognitive intelligence. Our experimental outcomes reveal that SwiFT consistently outperforms recent state-of-the-art models. To the best of our knowledge, SwiFT is the first Swin Transformer architecture that can process dimensional spatiotemporal brain functional data in an end-to-end fashion. Furthermore, due to the end-to-end learning capability, we also show that contrastive loss-based self-supervised pre-training of SwiFT is also feasible for achieving improved performance on a downstream task. We believe that our work holds substantial potential in facilitating scalable learning of functional brain imaging in neuroscience research by reducing the hurdles associated with applying Transformer models to high-dimensional fMRI.
摘要
模型四维脑动态从高维数据,如4D功能磁共振成像,是生物学中的挑战。为解决这个挑战,我们提出了Swift(Swin 4D FMRI transformer),一种基于Swin transformer架构的模型,可以直接从4D功能脑磁共振数据中学习脑动态。Swift实现了4D窗口多头自我协同机制和绝对位域嵌入。我们通过多个人类最大规模的功能脑成像数据集进行了多个任务的评估,如预测性别、年龄和认知素质。我们的实验结果表明,Swift在这些任务中一直表现出色,并且超过了最新的状态艺术模型。我们认为,Swift是第一个可以直接处理四维脑动态数据的Swin transformer架构。此外,由于Swift的端到端学习能力,我们还证明了在自然学习环境中,对Swift进行自适应预训练可以达到更好的下游任务性能。我们认为,我们的工作将为 neuroscience 研究中的可插入学习预处理技术提供重要的推动。
Single Domain Generalization via Normalised Cross-correlation Based Convolutions
methods: 本文提出了一种新的方法,即使用线性运算符(如扩展层和稠密层)来实现域名shift Robustness。我们提出了一种新的算法 called XCNorm,它可以计算输入特征区域的归一化交叉相关性。这种方法不受非线性活化函数的限制,并且可以快速计算。
results: 我们的实验结果表明,使用我们提出的方法可以在单域GALE benchmark上实现相当于state-of-the-art的性能。此外,我们还证明了这种方法的robustness性,可以在不同的 semantic distribution shift 下保持高度的性能。Abstract
Deep learning techniques often perform poorly in the presence of domain shift, where the test data follows a different distribution than the training data. The most practically desirable approach to address this issue is Single Domain Generalization (S-DG), which aims to train robust models using data from a single source. Prior work on S-DG has primarily focused on using data augmentation techniques to generate diverse training data. In this paper, we explore an alternative approach by investigating the robustness of linear operators, such as convolution and dense layers commonly used in deep learning. We propose a novel operator called XCNorm that computes the normalized cross-correlation between weights and an input feature patch. This approach is invariant to both affine shifts and changes in energy within a local feature patch and eliminates the need for commonly used non-linear activation functions. We show that deep neural networks composed of this operator are robust to common semantic distribution shifts. Furthermore, our empirical results on single-domain generalization benchmarks demonstrate that our proposed technique performs comparably to the state-of-the-art methods.
摘要
深度学习技术经常在频率变换下表现不佳,其中测试数据采样不同于训练数据的分布。最佳实践方式 addresses this issue 是单Domain Generalization (S-DG),它目的是使用单一来源的数据训练Robust模型。先前的S-DG研究主要集中在使用数据扩展技术生成多样化的训练数据。在这篇论文中,我们探索了一种不同的方法,即 investigate the robustness of linear operators, such as convolution and dense layers commonly used in deep learning. We propose a novel operator called XCNorm that computes the normalized cross-correlation between weights and an input feature patch. This approach is invariant to both affine shifts and changes in energy within a local feature patch and eliminates the need for commonly used non-linear activation functions. We show that deep neural networks composed of this operator are robust to common semantic distribution shifts. Furthermore, our empirical results on single-domain generalization benchmarks demonstrate that our proposed technique performs comparably to the state-of-the-art methods.Here's the word-for-word translation of the text into Simplified Chinese:深度学习技术经常在频率变换下表现不佳,其中测试数据采样不同于训练数据的分布。最佳实践方式 addresses this issue 是单Domain Generalization (S-DG),它目的是使用单一来源的数据训练Robust模型。先前的S-DG研究主要集中在使用数据扩展技术生成多样化的训练数据。在这篇论文中,我们探索了一种不同的方法,即 investigate the robustness of linear operators, such as convolution and dense layers commonly used in deep learning. We propose a novel operator called XCNorm that computes the normalized cross-correlation between weights and an input feature patch. This approach is invariant to both affine shifts and changes in energy within a local feature patch and eliminates the need for commonly used non-linear activation functions. We show that deep neural networks composed of this operator are robust to common semantic distribution shifts. Furthermore, our empirical results on single-domain generalization benchmarks demonstrate that our proposed technique performs comparably to the state-of-the-art methods.
DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
results: 这篇论文的实验结果表明,使用 GAE 可以实现多个特征的图像修饰,并且可以获得高质量的修饰结果,同时减少了计算量。Abstract
Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks such as text-to-image generation and image inpainting. However, compared to other generative methods like VAEs and GANs, DPMs lack a low-dimensional, interpretable, and well-decoupled latent code. Recently, diffusion autoencoders (Diff-AE) were proposed to explore the potential of DPMs for representation learning via autoencoding. Diff-AE provides an accessible latent space that exhibits remarkable interpretability, allowing us to manipulate image attributes based on latent codes from the space. However, previous works are not generic as they only operated on a few limited attributes. To further explore the latent space of Diff-AE and achieve a generic editing pipeline, we proposed a module called Group-supervised AutoEncoder(dubbed GAE) for Diff-AE to achieve better disentanglement on the latent code. Our proposed GAE has trained via an attribute-swap strategy to acquire the latent codes for multi-attribute image manipulation based on examples. We empirically demonstrate that our method enables multiple-attributes manipulation and achieves convincing sample quality and attribute alignments, while significantly reducing computational requirements compared to pixel-based approaches for representational decoupling. Code will be released soon.
摘要
Diffusion probabilistic models (DPMs) 有非常出色的成果在各种图像生成任务上,如文本到图像生成和图像填充。然而,与其他生成方法如 VAEs 和 GANs 相比,DPMs 缺乏低维、可解释、良好分离的潜在代码。最近,Diffusion autoencoders (Diff-AE) 被提出来探索 DPMs 的表示学习能力via自编码。Diff-AE 提供了可访问的潜在空间,其表现出了remarkable的可解释性,allowing us to manipulate image attributes based on latent codes from the space.然而,先前的工作只是在一些有限的特征上操作。为了更好地探索 Diff-AE 的潜在空间和实现一个通用的编辑管道,我们提出了一个模块 called Group-supervised AutoEncoder (dubbed GAE),以便 Diff-AE 更好地实现分解。我们的提议的 GAE 通过 attribute-swap 策略来获得多Attribute图像修饰的潜在代码,并且我们实际证明了我们的方法可以实现多Attribute修饰,并且实现了令人满意的样本质量和特征对齐,同时减少了像素级对 representational decoupling 的计算需求。我们即将发布代码。
Rectifying Noisy Labels with Sequential Prior: Multi-Scale Temporal Feature Affinity Learning for Robust Video Segmentation
results: 对于各种噪声标注和真实损害,实验表明,我们的方法在比较最新的Robust分割方法中具有显著优势。Abstract
Noisy label problems are inevitably in existence within medical image segmentation causing severe performance degradation. Previous segmentation methods for noisy label problems only utilize a single image while the potential of leveraging the correlation between images has been overlooked. Especially for video segmentation, adjacent frames contain rich contextual information beneficial in cognizing noisy labels. Based on two insights, we propose a Multi-Scale Temporal Feature Affinity Learning (MS-TFAL) framework to resolve noisy-labeled medical video segmentation issues. First, we argue the sequential prior of videos is an effective reference, i.e., pixel-level features from adjacent frames are close in distance for the same class and far in distance otherwise. Therefore, Temporal Feature Affinity Learning (TFAL) is devised to indicate possible noisy labels by evaluating the affinity between pixels in two adjacent frames. We also notice that the noise distribution exhibits considerable variations across video, image, and pixel levels. In this way, we introduce Multi-Scale Supervision (MSS) to supervise the network from three different perspectives by re-weighting and refining the samples. This design enables the network to concentrate on clean samples in a coarse-to-fine manner. Experiments with both synthetic and real-world label noise demonstrate that our method outperforms recent state-of-the-art robust segmentation approaches. Code is available at https://github.com/BeileiCui/MS-TFAL.
摘要
<>使用 simplifies Chinese 翻译文本。<>医学图像分割中的噪声标注问题是不可避免的,会导致性能下降。先前的分割方法只利用单个图像,忽略了图像之间的相互关系。特别是在视频分割中,相邻帧包含丰富的上下文信息,可以帮助识别噪声标注。基于以下两点,我们提出了一个多尺度时间特征相互学习(MS-TFAL)框架,用于解决噪声标注的医学视频分割问题。首先,我们认为视频序列的先后顺序是有效的参考,即图像层次上的像素特征在相邻帧中是靠近的,否则是远离的。因此,我们提出了时间特征相互学习(TFAL),用于评估相邻帧中像素之间的相互关系,以标识可能的噪声标注。此外,我们注意到噪声分布在视频、图像和像素层次上存在较大的变化。因此,我们引入多尺度监督(MSS),以从三个不同的角度监督网络。我们通过重新权重和精细化样本来supervise网络,使其集中于清晰样本,并在宽泛到细化的方式下进行学习。实验表明,我们的方法在真实噪声标注下比 recent state-of-the-art 鲁棒分割方法高效。代码可以在 中下载。
Deep learning-based estimation of whole-body kinematics from multi-view images
results: 这篇论文在新的瓦房顶dataset上 achieved a mean angle error of $7.19^\circ$ and $8.41^\circ$ on the Human3.6M dataset, marking a significant step forward for on-site kinematic analysis using multi-view images.Abstract
It is necessary to analyze the whole-body kinematics (including joint locations and joint angles) to assess risks of fatal and musculoskeletal injuries in occupational tasks. Human pose estimation has gotten more attention in recent years as a method to minimize the errors in determining joint locations. However, the joint angles are not often estimated, nor is the quality of joint angle estimation assessed. In this paper, we presented an end-to-end approach on direct joint angle estimation from multi-view images. Our method leveraged the volumetric pose representation and mapped the rotation representation to a continuous space where each rotation was uniquely represented. We also presented a new kinematic dataset in the domain of residential roofing with a data processing pipeline to generate necessary annotations for the supervised training procedure on direct joint angle estimation. We achieved a mean angle error of $7.19^\circ$ on the new Roofing dataset and $8.41^\circ$ on the Human3.6M dataset, paving the way for employment of on-site kinematic analysis using multi-view images.
摘要
需要分析全身动态(包括联合位置和 JOINT 角度),以评估工作任务中的致命性和骨骼骨伤风险。人姿估算在最近几年中得到了更多的关注,作为一种方法来减少 JOINT 位置的误差。然而, JOINT 角度并不常被估算,也没有评估 JOINT 角度估算的质量。在这篇论文中,我们提出了一种端到端的方法,直接从多视图图像中估算 JOINT 角度。我们的方法利用了体量姿 pose 表示,并将旋转表示映射到连续空间中,其中每个旋转唯一表示。我们还提供了一个新的层次数据集,以及一个数据处理管道,以生成必要的注释 для超级vised 训练程序中的直接 JOINT 角度估算。我们在新的 Roofing 数据集上 achieve 了 mean 角度误差为 $7.19^\circ$,以及 Human3.6M 数据集上的 $8.41^\circ$,为在场地动态分析中使用多视图图像进行职业安全性评估开出了新的可能性。
SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views
results: 与前一代方法相比,该方法可以在稀疏和噪音视图下提供高质量的神经表面重建结果,具有细节和精度。Abstract
The recent neural surface reconstruction by volume rendering approaches have made much progress by achieving impressive surface reconstruction quality, but are still limited to dense and highly accurate posed views. To overcome such drawbacks, this paper pays special attention on the consistent surface reconstruction from sparse views with noisy camera poses. Unlike previous approaches, the key difference of this paper is to exploit the multi-view constraints directly from the explicit geometry of the neural surface, which can be used as effective regularization to jointly learn the neural surface and refine the camera poses. To build effective multi-view constraints, we introduce a fast differentiable on-surface intersection to generate on-surface points, and propose view-consistent losses based on such differentiable points to regularize the neural surface learning. Based on this point, we propose a jointly learning strategy for neural surface and camera poses, named SC-NeuS, to perform geometry-consistent surface reconstruction in an end-to-end manner. With extensive evaluation on public datasets, our SC-NeuS can achieve consistently better surface reconstruction results with fine-grained details than previous state-of-the-art neural surface reconstruction approaches, especially from sparse and noisy camera views.
摘要
最近的神经表面重建方法基于volume rendering的进展很大,但是仍然受到稠密和准确摄像头pose的限制。为了突破这些局限性,这篇论文强调了从稀疏视图中获得一致的表面重建质量。与前一代方法不同,我们在这篇论文中利用神经表面的直接Explicit Geometry来获得多视图约束,并将其用作神经表面学习的有效规则。为建立有效的多视图约束,我们引入了快速可导的On-surface点的生成,并提出了基于这些可导点的视图一致损失函数来规范神经表面学习。基于这个点,我们提出了一种同时学习神经表面和摄像头pose的策略,称为SC-NeuS,以实现geometry-consistent的表面重建。通过对公共数据集进行广泛评估,我们的SC-NeuS可以在稀疏和噪声摄像头视图中获得更好的表面重建结果,特别是在细节上。
FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction
results: 提出了一种简单 yet effective的 FREquency-band-awarE 和 SElf-guidED 网络(FreeSeed),可以有效地除掉 artifact和恢复损坏的细节从污染 sparse-view CT 图像中。Abstract
Sparse-view computed tomography (CT) is a promising solution for expediting the scanning process and mitigating radiation exposure to patients, the reconstructed images, however, contain severe streak artifacts, compromising subsequent screening and diagnosis. Recently, deep learning-based image post-processing methods along with their dual-domain counterparts have shown promising results. However, existing methods usually produce over-smoothed images with loss of details due to (1) the difficulty in accurately modeling the artifact patterns in the image domain, and (2) the equal treatment of each pixel in the loss function. To address these issues, we concentrate on the image post-processing and propose a simple yet effective FREquency-band-awarE and SElf-guidED network, termed FreeSeed, which can effectively remove artifact and recover missing detail from the contaminated sparse-view CT images. Specifically, we first propose a frequency-band-aware artifact modeling network (FreeNet), which learns artifact-related frequency-band attention in Fourier domain for better modeling the globally distributed streak artifact on the sparse-view CT images. We then introduce a self-guided artifact refinement network (SeedNet), which leverages the predicted artifact to assist FreeNet in continuing to refine the severely corrupted details. Extensive experiments demonstrate the superior performance of FreeSeed and its dual-domain counterpart over the state-of-the-art sparse-view CT reconstruction methods. Source code is made available at https://github.com/Masaaki-75/freeseed.
摘要
简化视图计算机断层成像(CT)是一种有前途的解决方案,它可以加速扫描过程并降低病人 receives 的辐射暴露。然而,重建的图像却受到严重的斜线artefact的影响,这些artefact会对后续检测和诊断造成干扰。最近,基于深度学习的图像后处理方法以及其双域对应方法已经显示出了promising的结果。然而,现有的方法通常会产生过滤平滑的图像,导致细节丢失。为了解决这些问题,我们集中在图像后处理方面,并提出了一种简单 yet effective的FREquency-band-awarE和SElf-guidED网络(FreeSeed)。具体来说,我们首先提出了一种频谱域相关的artefact模型网络(FreeNet),它在 Fourier 域学习streak artefact的全球分布,以更好地模型简略视图 CT 图像中的artefact。然后,我们引入了一种自领导的artefact修复网络(SeedNet),它利用预测的artefact来辅助 FreeNet 继续修复严重损害的细节。我们的实验证明,FreeSeed 和其双域对应方法在简略视图 CT 重建方法中表现出了superior的性能。源代码可以在 https://github.com/Masaaki-75/freeseed 上获取。
Rethinking Mitosis Detection: Towards Diverse Data and Feature Representation
paper_authors: Hao Wang, Jiatai Lin, Danyi Li, Jing Wang, Bingchao Zhao, Zhenwei Shi, Xipeng Pan, Huadeng Wang, Bingbing Li, Changhong Liang, Guoqiang Han, Li Liang, Chu Han, Zaiyi Liu for:The paper aims to propose a novel generalizable framework (MitDet) for mitosis detection, which can balance data and feature diversity to achieve better generalizability.methods:The proposed MitDet model uses a diversity-guided sample balancing (DGSB) module to consider data diversity, an inter- and intra-class feature diversity-preserved module (InCDP) to preserve feature diversity, and a stain enhancement (SE) module to enhance the domain-relevant diversity of both data and features.results:The proposed MitDet model outperforms all state-of-the-art (SOTA) approaches in several popular mitosis detection datasets in both internal and external test sets using minimal annotation efforts with point annotations only. Comprehensive ablation studies have also proven the effectiveness of the rethinking of data and feature diversity balancing.Abstract
Mitosis detection is one of the fundamental tasks in computational pathology, which is extremely challenging due to the heterogeneity of mitotic cell. Most of the current studies solve the heterogeneity in the technical aspect by increasing the model complexity. However, lacking consideration of the biological knowledge and the complex model design may lead to the overfitting problem while limited the generalizability of the detection model. In this paper, we systematically study the morphological appearances in different mitotic phases as well as the ambiguous non-mitotic cells and identify that balancing the data and feature diversity can achieve better generalizability. Based on this observation, we propose a novel generalizable framework (MitDet) for mitosis detection. The data diversity is considered by the proposed diversity-guided sample balancing (DGSB). And the feature diversity is preserved by inter- and intra- class feature diversity-preserved module (InCDP). Stain enhancement (SE) module is introduced to enhance the domain-relevant diversity of both data and features simultaneously. Extensive experiments have demonstrated that our proposed model outperforms all the SOTA approaches in several popular mitosis detection datasets in both internal and external test sets using minimal annotation efforts with point annotations only. Comprehensive ablation studies have also proven the effectiveness of the rethinking of data and feature diversity balancing. By analyzing the results quantitatively and qualitatively, we believe that our proposed model not only achieves SOTA performance but also might inspire the future studies in new perspectives. Source code is at https://github.com/Onehour0108/MitDet.
摘要
mitosis检测是计算生物学中一项基本任务,但是受到细胞异ogeneity的影响而很具挑战性。现有的大多数研究通过提高模型复杂度来解决异ogeneity问题,但是缺乏生物知识和复杂模型设计可能导致过拟合问题,限制检测模型的普遍性。本文系统地研究不同 Mitotic 阶段的形态特征以及涉猎到非 Mitotic 细胞的歧义,并发现了保持数据和特征多样性可以实现更好的普遍性。基于这一观察,我们提出了一种普遍的检测模型(MitDet),其中包括数据多样性考虑的多样性指导样本均衡(DGSB)和特征多样性保持模块(InCDP)。此外,我们还引入了颜色增强(SE)模块,以提高领域相关的多样性 Both data and features simultaneously。我们的提案模型在多个流行的 Mitosis 检测数据集上进行了广泛的实验,并在内部和外部测试集上均达到了所有SOTA方法的性能水平,使用最少的注释努力。我们还进行了全面的缺省研究,证明了我们的方法的有效性。通过量化和质量分析,我们认为我们的提案模型不仅达到了SOTA性能,还可能激励未来的研究。代码位于https://github.com/Onehour0108/MitDet。
for: 本研究 propose a fast and simple multi-object tracking (MOT) model, which does not require any attached modules like Kalman filter, Hungarian algorithm, transformer blocks, or graph networks.
methods: 该模型包括基础探测器和交叉注意模块,不需要附加Module,因此计算成本较低。
results: 研究表明,TicrossNet在实时processing中具有32.6 FPS on MOT17和31.0 FPS on MOT20(Tesla V100),包括大约100个实例每帧。此外,TicrossNet还表明对$N_t$的 robustness,因此不需要根据$N_t$调整基础探测器的大小。Abstract
We propose a conceptually simple and thus fast multi-object tracking (MOT) model that does not require any attached modules, such as the Kalman filter, Hungarian algorithm, transformer blocks, or graph networks. Conventional MOT models are built upon the multi-step modules listed above, and thus the computational cost is high. Our proposed end-to-end MOT model, \textit{TicrossNet}, is composed of a base detector and a cross-attention module only. As a result, the overhead of tracking does not increase significantly even when the number of instances ($N_t$) increases. We show that TicrossNet runs \textit{in real-time}; specifically, it achieves 32.6 FPS on MOT17 and 31.0 FPS on MOT20 (Tesla V100), which includes as many as $>$100 instances per frame. We also demonstrate that TicrossNet is robust to $N_t$; thus, it does not have to change the size of the base detector, depending on $N_t$, as is often done by other models for real-time processing.
摘要
我们提出了一种概念简单,因此快速的多目标跟踪(MOT)模型,不需要附加任何模块,如卡尔曼滤波、匈牙利算法、变换块或图示网络。传统的 MOT 模型通常基于上述多步模块,因此计算成本高。我们提出的终端 MOT 模型,称之为 TicrossNet,由基础探测器和交叉注意模块组成。因此,跟踪过程中的负担不会增加太多,即使有多个实例($N_t$)。我们表明,TicrossNet 在实时进行;specifically,它在 MOT17 和 MOT20 上达到 32.6 FPS 和 31.0 FPS(Tesla V100),包括每帧数量超过 100 个实例。我们还证明了 TicrossNet 对 $N_t$ robust,因此它不需要根据 $N_t$ 调整基础探测器的大小,如其他模型一样。
OG: Equip vision occupancy with instance segmentation and visual grounding
methods: OG方法使用了affinity field prediction和association策略来实现INSTANCE clustering和2D实例映射与3Doccupancy实例的对应。
results: 经过extensive的实验和分析,authors发现OG方法可以准确地predict instance masks和occupancy instances,并且可以在不同的环境下保持高度的精度和稳定性。Abstract
Occupancy prediction tasks focus on the inference of both geometry and semantic labels for each voxel, which is an important perception mission. However, it is still a semantic segmentation task without distinguishing various instances. Further, although some existing works, such as Open-Vocabulary Occupancy (OVO), have already solved the problem of open vocabulary detection, visual grounding in occupancy has not been solved to the best of our knowledge. To tackle the above two limitations, this paper proposes Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance segmentation ability and could operate visual grounding in a voxel manner with the help of grounded-SAM. Keys to our approach are (1) affinity field prediction for instance clustering and (2) association strategy for aligning 2D instance masks and 3D occupancy instances. Extensive experiments have been conducted whose visualization results and analysis are shown below. Our code will be publicly released soon.
摘要
占用预测任务通常涉及每个块的几何和semantic标签的推断,这是一项重要的见解任务。然而,目前的一些作品,如开放词汇占用(OVO),已经解决了开放词汇检测的问题。然而,视觉定位在占用中还没有得到完善的解决。为了解决以上两个限制,这篇论文提出了占用定位(OG)方法,该方法具有Default occupancy instance segmentation能力,并可以在 voxel 方式下进行视觉定位,通过基于grounded-SAM的相关策略。我们的方法的关键点包括:1. 归一化场景预测 для实例归一化2. 对2D实例面积和3D占用实例进行匹配策略我们进行了广泛的实验,结果如下图所示。我们将代码公开发布 soon。
GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video
paper_authors: Bruce X. B. Yu, Zhi Zhang, Yongxu Liu, Sheng-hua Zhong, Yan Liu, Chang Wen Chen
for: This paper is written for improving the performance of 3D human pose lifting using ground truth data.
methods: The paper proposes a simple yet effective model called Global-local Adaptive Graph Convolutional Network (GLA-GCN) that globally models the spatiotemporal structure via a graph representation and backtraces local joint features for 3D human pose estimation.
results: The experimental results show that the proposed GLA-GCN significantly outperforms state-of-the-art methods (e.g., up to around 3%, 17%, and 14% error reductions on Human3.6M, HumanEva-I, and MPI-INF-3DHP, respectively).Abstract
3D human pose estimation has been researched for decades with promising fruits. 3D human pose lifting is one of the promising research directions toward the task where both estimated pose and ground truth pose data are used for training. Existing pose lifting works mainly focus on improving the performance of estimated pose, but they usually underperform when testing on the ground truth pose data. We observe that the performance of the estimated pose can be easily improved by preparing good quality 2D pose, such as fine-tuning the 2D pose or using advanced 2D pose detectors. As such, we concentrate on improving the 3D human pose lifting via ground truth data for the future improvement of more quality estimated pose data. Towards this goal, a simple yet effective model called Global-local Adaptive Graph Convolutional Network (GLA-GCN) is proposed in this work. Our GLA-GCN globally models the spatiotemporal structure via a graph representation and backtraces local joint features for 3D human pose estimation via individually connected layers. To validate our model design, we conduct extensive experiments on three benchmark datasets: Human3.6M, HumanEva-I, and MPI-INF-3DHP. Experimental results show that our GLA-GCN implemented with ground truth 2D poses significantly outperforms state-of-the-art methods (e.g., up to around 3%, 17%, and 14% error reductions on Human3.6M, HumanEva-I, and MPI-INF-3DHP, respectively). GitHub: https://github.com/bruceyo/GLA-GCN.
摘要
三维人体姿态估计已经被研究了几十年,有着扎实的成果。三维人体姿态提升是研究方向之一,其中使用估计pose和实际pose数据进行训练。现有的提升pose工作主要关注提高估计pose的性能,但它们通常在实际pose数据上下降性能。我们发现,可以通过提高2D pose的质量来提高3D人体姿态估计的性能。因此,我们专注于在实际pose数据上提高3D人体姿态提升。为 достичь这个目标,我们提出了一种简单 yet有效的模型 called Global-local Adaptive Graph Convolutional Network (GLA-GCN)。我们的GLA-GCN在全程空间和时间结构上模型Graph表示,并通过个别连接层来返回3D人体姿态估计。为验证我们的模型设计,我们在三个标准数据集上进行了广泛的实验:Human3.6M、HumanEva-I和MPI-INF-3DHP。实验结果表明,我们在实际pose数据上使用GLA-GCN实现了state-of-the-art方法的显著改进(比如在Human3.6M、HumanEva-I和MPI-INF-3DHP上的误差减少约3%、17%和14%)。GitHub:https://github.com/bruceyo/GLA-GCN。
Denoising Simulated Low-Field MRI (70mT) using Denoising Autoencoders (DAE) and Cycle-Consistent Generative Adversarial Networks (Cycle-GAN)
paper_authors: Fernando Vega, Abdoljalil Addeh, M. Ethan MacDonald
For: 这个论文的目的是提高低场magnetic resonance imaging(MRI)图像的质量。* Methods: 这个论文使用了一种循环一致性生成对抗网络(Cycle-GAN)来将低场、低分辨率、低信号噪比(SNR)MRI图像转换成高场、高分辨率、高SNR的MRI图像。* Results: 论文表明这种方法可以提高低场MRI图像的质量,并且不需要对图像进行对应的对比。Abstract
In this work, a denoising Cycle-GAN (Cycle Consistent Generative Adversarial Network) is implemented to yield high-field, high resolution, high signal-to-noise ratio (SNR) Magnetic Resonance Imaging (MRI) images from simulated low-field, low resolution, low SNR MRI images. Resampling and additive Rician noise were used to simulate low-field MRI. Images were utilized to train a Denoising Autoencoder (DAE) and a Cycle-GAN, with paired and unpaired cases. Both networks were evaluated using SSIM and PSNR image quality metrics. This work demonstrates the use of a generative deep learning model that can outperform classical DAEs to improve low-field MRI images and does not require image pairs.
摘要
在这个工作中,我们实现了一种去噪Cycle-GAN(循环一致生成敌恶网络),以生成高场、高分辨率、高信噪比(SNR)的核磁共振成像(MRI)图像,从低场、低分辨率、低SNR的MRI图像中。我们使用了抽样和加法 rician 噪声来模拟低场MRI。图像用于训练一个去噪自适应神经网络(DAE)和一个循环GAN,包括对应和无对应的情况。两个网络都被评估使用SSIM和PSNR图像质量指标。这项工作表明了一种生成深度学习模型,可以超越传统的DAE来改善低场MRI图像,而不需要图像对。
results: 该论文在 external benchmarks 和人工评估中达到了 state-of-the-art 性能。 Additionally, the authors make their pre-trained CLIP transformer model, StreetCLIP, publicly available for use in adjacent domains with applications to fighting climate change and urban and rural scene understanding.Abstract
We introduce PIGEON, a multi-task end-to-end system for planet-scale image geolocalization that achieves state-of-the-art performance on both external benchmarks and in human evaluation. Our work incorporates semantic geocell creation with label smoothing, conducts pretraining of a vision transformer on images with geographic information, and refines location predictions with ProtoNets across a candidate set of geocells. The contributions of PIGEON are three-fold: first, we design a semantic geocells creation and splitting algorithm based on open-source data which can be adapted to any geospatial dataset. Second, we show the effectiveness of intra-geocell refinement and the applicability of unsupervised clustering and ProtNets to the task. Finally, we make our pre-trained CLIP transformer model, StreetCLIP, publicly available for use in adjacent domains with applications to fighting climate change and urban and rural scene understanding.
摘要
我们介绍PIGEON,一个多任务端到端系统,用于大规模图像地理位置标注,实现了外部标准和人工评估中的状态对领先性。我们的工作包括卷积神经网络在图像地理信息上预训练,并使用ProtoNets进行候选集地理细胞筛选。PIGEON的贡献包括三个方面:首先,我们设计了基于开源数据的 semantic geocells 创建和分割算法,可以适应任何地ospatial数据集。其次,我们证明了 intra-geocell 精度的有效性和无监督划分和ProtoNets的应用性。最后,我们将我们预训练的 CLIP 变换器模型,StreetCLIP,公开提供用于相关领域的应用,包括气候变化防御和城市和农村景观理解。
Improving Segmentation and Detection of Lesions in CT Scans Using Intensity Distribution Supervision
paper_authors: Seung Yeon Shin, Thomas C. Shen, Ronald M. Summers
for: 提高 segmentation 和检测网络的训练效果,不需要额外标注成本。
methods: 使用 Intensity-based lesion probability (ILP) 函数,从目标肿瘤的INTENSITY histogram中计算每个块的可能性,并将计算后的 ILP 地图作为网络训练的额外指导。
results: 在三种肿瘤类型(小肠肿瘤、肾肿瘤和肺肿瘤)的分 segmentation 中提高了41.3% -> 47.8%、74.2% -> 76.0% 和 26.4% -> 32.7% 的 DISE scores,并在检测任务中提高了64.6% -> 75.5% 的平均准确率。Abstract
We propose a method to incorporate the intensity information of a target lesion on CT scans in training segmentation and detection networks. We first build an intensity-based lesion probability (ILP) function from an intensity histogram of the target lesion. It is used to compute the probability of being the lesion for each voxel based on its intensity. Finally, the computed ILP map of each input CT scan is provided as additional supervision for network training, which aims to inform the network about possible lesion locations in terms of intensity values at no additional labeling cost. The method was applied to improve the segmentation of three different lesion types, namely, small bowel carcinoid tumor, kidney tumor, and lung nodule. The effectiveness of the proposed method on a detection task was also investigated. We observed improvements of 41.3% -> 47.8%, 74.2% -> 76.0%, and 26.4% -> 32.7% in segmenting small bowel carcinoid tumor, kidney tumor, and lung nodule, respectively, in terms of per case Dice scores. An improvement of 64.6% -> 75.5% was achieved in detecting kidney tumors in terms of average precision. The results of different usages of the ILP map and the effect of varied amount of training data are also presented.
摘要
我们提出了一种方法,用于在训练 segmentation 和检测网络时 incorporate 目标肿瘤的 Intensity 信息。我们首先从目标肿瘤的 Intensity 分布中建立一个 lesion probability(ILP)函数。这个函数用于计算每个 voxel 的可能是肿瘤的概率,基于其 Intensity 值。最后,每个输入 CT 扫描的 ILP 地图都被提供给网络进行训练,以提供可能的肿瘤位置的 Intensity 值,无需额外标注成本。我们应用了这种方法,以提高小肠肿瘤、肾肿瘤和肺核抑制的 segmentation 效果。我们发现,对于每种肿瘤类型,我们可以提高 Dice 分数的效果,具体来说是:* 小肠肿瘤:从 41.3% 提高到 47.8%* 肾肿瘤:从 74.2% 提高到 76.0%* 肺核抑制:从 26.4% 提高到 32.7%此外,我们还发现,对于肾肿瘤检测任务,我们可以提高 average precision 的效果,具体来说是:* 肾肿瘤:从 64.6% 提高到 75.5%此外,我们还研究了不同使用 ILP 地图的方法和不同训练数据量的效果。
Differentiable Forward Projector for X-ray Computed Tomography
results: 该软件库可以准确地预测图像,并且可以与原始测量数据保持一致性,这些结果可以用于各种计 Tomography 重建问题。Abstract
Data-driven deep learning has been successfully applied to various computed tomographic reconstruction problems. The deep inference models may outperform existing analytical and iterative algorithms, especially in ill-posed CT reconstruction. However, those methods often predict images that do not agree with the measured projection data. This paper presents an accurate differentiable forward and back projection software library to ensure the consistency between the predicted images and the original measurements. The software library efficiently supports various projection geometry types while minimizing the GPU memory footprint requirement, which facilitates seamless integration with existing deep learning training and inference pipelines. The proposed software is available as open source: https://github.com/LLNL/LEAP.
摘要
<>将数据驱动深度学习应用到了多种计算tomography重建问题中,深度推理模型可能超越现有的分析和迭代算法,特别是在糜烂CT重建中。然而,这些方法常常预测与测量 projection 数据不符的图像。这篇文章介绍了一个准确的微分可导前向和反向投影软件库,以确保预测图像与原始测量数据的一致性。这个软件库可以效率地支持多种投影几何类型,同时尽可能减少GPU内存占用量,以便与现有的深度学习训练和推理管道集成。该软件库现已公开发布,可以在 GitHub 上获取:https://github.com/LLNL/LEAP。Note: "计算tomography" (CT) refers to computed tomography, a medical imaging technique that uses X-rays to produce cross-sectional images of the body.
A Hierarchical Transformer Encoder to Improve Entire Neoplasm Segmentation on Whole Slide Image of Hepatocellular Carcinoma
results: 实验证明,HiTrans 可以提高分割性能,通过利用更大的接收场和学习全局依赖关系来帮助分割肿瘤。Abstract
In digital histopathology, entire neoplasm segmentation on Whole Slide Image (WSI) of Hepatocellular Carcinoma (HCC) plays an important role, especially as a preprocessing filter to automatically exclude healthy tissue, in histological molecular correlations mining and other downstream histopathological tasks. The segmentation task remains challenging due to HCC's inherent high-heterogeneity and the lack of dependency learning in large field of view. In this article, we propose a novel deep learning architecture with a hierarchical Transformer encoder, HiTrans, to learn the global dependencies within expanded 4096$\times$4096 WSI patches. HiTrans is designed to encode and decode the patches with larger reception fields and the learned global dependencies, compared to the state-of-the-art Fully Convolutional Neural networks (FCNN). Empirical evaluations verified that HiTrans leads to better segmentation performance by taking into account regional and global dependency information.
摘要
在数字 histopathology 中,整个肿瘤分 segmentation 在 Whole Slide Image (WSI) 的 Hepatocellular Carcinoma (HCC) 中扮演着重要的角色,特别是作为自动排除健康组织的预处理过滤器,在 histological molecular correlations 挖掘和其他下游 histopathological 任务中。该分 segmentation 任务仍然是一个挑战,因为 HCC 的自然高积分和lack of dependency learning 在大视野中。在这篇文章中,我们提出了一种新的深度学习架构,即 hierarchical Transformer encoder,HiTrans,以学习大视野中的全局依赖关系。HiTrans 是用来编码和解码大视野中的补丁,并且学习的全局依赖关系,比之前的 Fully Convolutional Neural networks (FCNN) 更大。经验证明,HiTrans 可以更好地进行分 segmentation,通过考虑地域和全局依赖信息。
3D Medical Image Segmentation based on multi-scale MPU-Net
results: 对于LiTS 2017数据集,MPU-Net模型的最佳分割结果达到了92.17%的 dice指标、99.08%的准确率、91.91%的精度、99.52%的特征精度和85.91%的MCC指标。这些成绩在不同方面都表现出了模型的突出表现。Abstract
The high cure rate of cancer is inextricably linked to physicians' accuracy in diagnosis and treatment, therefore a model that can accomplish high-precision tumor segmentation has become a necessity in many applications of the medical industry. It can effectively lower the rate of misdiagnosis while considerably lessening the burden on clinicians. However, fully automated target organ segmentation is problematic due to the irregular stereo structure of 3D volume organs. As a basic model for this class of real applications, U-Net excels. It can learn certain global and local features, but still lacks the capacity to grasp spatial long-range relationships and contextual information at multiple scales. This paper proposes a tumor segmentation model MPU-Net for patient volume CT images, which is inspired by Transformer with a global attention mechanism. By combining image serialization with the Position Attention Module, the model attempts to comprehend deeper contextual dependencies and accomplish precise positioning. Each layer of the decoder is also equipped with a multi-scale module and a cross-attention mechanism. The capability of feature extraction and integration at different levels has been enhanced, and the hybrid loss function developed in this study can better exploit high-resolution characteristic information. Moreover, the suggested architecture is tested and evaluated on the Liver Tumor Segmentation Challenge 2017 (LiTS 2017) dataset. Compared with the benchmark model U-Net, MPU-Net shows excellent segmentation results. The dice, accuracy, precision, specificity, IOU, and MCC metrics for the best model segmentation results are 92.17%, 99.08%, 91.91%, 99.52%, 85.91%, and 91.74%, respectively. Outstanding indicators in various aspects illustrate the exceptional performance of this framework in automatic medical image segmentation.
摘要
医疗领域中高级别癌症治疗的精度 directly depends on医生的诊断和治疗精度,因此一个可以实现高精度肿瘤分 segmentation的模型在医疗领域中变得非常重要。它可以大幅降低误诊率,同时减轻医生的负担。然而,完全自动的目标器官分 segmentation是因为3D体积器官的不规则立体结构而变得困难。作为基本模型,U-Net具有优秀的特点,可以学习全局和局部特征,但是仍然缺乏捕捉空间长距离关系和多尺度信息的能力。这篇文章提出了基于Transformer的肿瘤分 segmentation模型MPU-Net,用于患者体积CT影像。该模型通过图像序列化和位置注意机制来理解更深层次的Contextualdependencies,并且在多尺度模块和交叉注意机制的支持下,进一步增强了特征提取和整合的能力。此外,在本研究中提出的混合损失函数可以更好地利用高分辨率特征信息。此外,该建议的体系在LiTS 2017数据集上进行测试和评估,与参考模型U-Net进行比较。结果显示,MPU-Net在肿瘤分 segmentation方面表现出色, dice、准确率、精度、特征率、IOU和MCC指标均达到了最佳值。这些优异的指标在各个方面都表明了这种框架在自动医疗图像分 segmentation方面的Exceptional performance。
Automated Artifact Detection in Ultra-widefield Fundus Photography of Patients with Sickle Cell Disease
results: 该研究发现,该算法可以准确地分类常见的UWF-FP artifact,包括眼睛叶覆盖、下眼睛堵塞、上眼睛堵塞、图像过暗和黑色噪声等。结果表明,该算法可以准确地分类这些artifact,并且在不同的评估方法上具有高度的可靠性和可重复性。Abstract
Importance: Ultra-widefield fundus photography (UWF-FP) has shown utility in sickle cell retinopathy screening; however, image artifact may diminish quality and gradeability of images. Objective: To create an automated algorithm for UWF-FP artifact classification. Design: A neural network based automated artifact detection algorithm was designed to identify commonly encountered UWF-FP artifacts in a cross section of patient UWF-FP. A pre-trained ResNet-50 neural network was trained on a subset of the images and the classification accuracy, sensitivity, and specificity were quantified on the hold out test set. Setting: The study is based on patients from a tertiary care hospital site. Participants: There were 243 UWF-FP acquired from patients with sickle cell disease (SCD), and artifact labelling in the following categories was performed: Eyelash Present, Lower Eyelid Obstructing, Upper Eyelid Obstructing, Image Too Dark, Dark Artifact, and Image Not Centered. Results: Overall, the accuracy for each class was Eyelash Present at 83.7%, Lower Eyelid Obstructing at 83.7%, Upper Eyelid Obstructing at 98.0%, Image Too Dark at 77.6%, Dark Artifact at 93.9%, and Image Not Centered at 91.8%. Conclusions and Relevance: This automated algorithm shows promise in identifying common imaging artifacts on a subset of Optos UWF-FP in SCD patients. Further refinement is ongoing with the goal of improving efficiency of tele-retinal screening in sickle cell retinopathy (SCR) by providing a photographer real-time feedback as to the types of artifacts present, and the need for image re-acquisition. This algorithm also may have potential future applicability in other retinal diseases by improving quality and efficiency of image acquisition of UWF-FP.
摘要
Importance: 拓宽背景照相(UWF-FP)已经展示了在慢着细菌病症(SCR)检测中的用途;然而,图像artefact可能会降低图像质量和分类精度。目标:为UWF-FP中常见的图像artefact进行自动分类。设计:基于神经网络的自动artefact检测算法,用于在患者群中的UWF-FP中标注常见的artefact。使用预训练的ResNet-50神经网络,并对一部分图像进行训练,并测量剩下的测试集中的准确率、敏感度和特异性。设置:研究基于医院第三级医院。参与者:有243名患有慢着细菌病症(SCD)的患者,并对UWF-FP中的artefact进行标注,包括眼睛毛发存在、下眼缘塞栓、上眼缘塞栓、图像过暗、黑色artefact和图像不中心。结果:总体来说,每个类别的准确率分别为眼睛毛发存在83.7%、下眼缘塞栓83.7%、上眼缘塞栓98.0%、图像过暗77.6%、黑色artefact93.9%和图像不中心91.8%。结论和重要性:这个自动算法在慢着细菌病症患者中的Optos UWF-FP上表现出了批处理artefact的批处理能力。进一步的优化正在进行中,以提高远见屏检测的效率,并为摄影师提供实时反馈,以便在图像重新拍摄时提高图像质量。这个算法也可能在未来对其他Retinal疾病有很大的应用前提。
Line Art Colorization of Fakemon using Generative Adversarial Neural Networks
results: 这个研究的视觉结果显示彩色化成果是可行的,但还有改进的空间。Abstract
This work proposes a complete methodology to colorize images of Fakemon, anime-style monster-like creatures. In addition, we propose algorithms to extract the line art from colorized images as well as to extract color hints. Our work is the first in the literature to use automatic color hint extraction, to train the networks specifically with anime-styled creatures and to combine the Pix2Pix and CycleGAN approaches, two different generative adversarial networks that create a single final result. Visual results of the colorizations are feasible but there is still room for improvement.
摘要
这个工作提出了一种完整的方法来彩色化Факемон图像,这些图像具有动漫风格的妖怪样的生物。此外,我们还提出了一些算法来从彩色图像中提取线条图像以及取得颜色提示。我们的工作是文献中首次使用自动提取颜色提示,专门用于训练基于动漫风格的生物,并将Pix2Pix和CycleGAN两种生成对抗网络结合起来,以创造一个终极的结果。可见结果表示彩色化效果可行,但仍有改进的空间。
MoP-CLIP: A Mixture of Prompt-Tuned CLIP Models for Domain Incremental Learning
methods: 基于 mixture of prompt-tuned CLIP models(MoP-CLIP),在训练阶段模型每个类域的特征分布,学习各个域的文本和视觉提示,以适应给定域。在推理阶段,学习的分布使得可以确定测试样本属于知道的域,选择正确的提示进行分类任务,或者来自未seen域,利用 mixture of prompt-tuned CLIP models。
results: 比较 existing DIL 方法在 domain shift 下的表现不佳,而 MoP-CLIP 在标准 DIL 设置下与 state-of-the-art 方法竞争,而在 OOD 场景下表现出色,这些结果表明 MoP-CLIP 的超越性,提供一种强大和普适的解决方案。Abstract
Despite the recent progress in incremental learning, addressing catastrophic forgetting under distributional drift is still an open and important problem. Indeed, while state-of-the-art domain incremental learning (DIL) methods perform satisfactorily within known domains, their performance largely degrades in the presence of novel domains. This limitation hampers their generalizability, and restricts their scalability to more realistic settings where train and test data are drawn from different distributions. To address these limitations, we present a novel DIL approach based on a mixture of prompt-tuned CLIP models (MoP-CLIP), which generalizes the paradigm of S-Prompting to handle both in-distribution and out-of-distribution data at inference. In particular, at the training stage we model the features distribution of every class in each domain, learning individual text and visual prompts to adapt to a given domain. At inference, the learned distributions allow us to identify whether a given test sample belongs to a known domain, selecting the correct prompt for the classification task, or from an unseen domain, leveraging a mixture of the prompt-tuned CLIP models. Our empirical evaluation reveals the poor performance of existing DIL methods under domain shift, and suggests that the proposed MoP-CLIP performs competitively in the standard DIL settings while outperforming state-of-the-art methods in OOD scenarios. These results demonstrate the superiority of MoP-CLIP, offering a robust and general solution to the problem of domain incremental learning.
摘要
尽管最近的增量学习取得了进步,但Addressing catastrophic forgetting under distributional drift仍然是一个打开的和重要的问题。实际上,当前的领域增量学习(DIL)方法在已知领域中表现得比较满意,但其性能在新领域出现时受到了很大的限制。这些限制限制了它们的普遍性,使其在更真实的设置中无法扩展。为了解决这些限制,我们提出了一种基于混合prompt-tuned CLIP模型(MoP-CLIP)的新DIL方法。在训练阶段,我们模型每个领域中的每个类别的特征分布,学习各自的文本和视觉提示来适应给定领域。在推理阶段,学习的分布使我们可以判断一个测试样本是否属于已知领域,选择正确的提示进行分类任务,或者从未看过的领域,利用混合的prompt-tuned CLIP模型。我们的实验表明,现有的DIL方法在领域变化时表现不佳,而我们提出的MoP-CLIP方法在标准DIL设置中与状态地方法竞争,而在OOD scenarios中表现出色。这些结果表明MoP-CLIP的优越性,提供一种Robust和普遍的领域增量学习解决方案。
SepHRNet: Generating High-Resolution Crop Maps from Remote Sensing imagery using HRNet with Separable Convolution
results: 本研究获得了97.5%的高精度分类率和55.2%的 IoU 值,在生成农作物地图方面表现出色,并且较以前的模型(如U-Net++, ResNet50、VGG19、InceptionV3、DenseNet、EfficientNet)的性能更高。Abstract
The accurate mapping of crop production is crucial for ensuring food security, effective resource management, and sustainable agricultural practices. One way to achieve this is by analyzing high-resolution satellite imagery. Deep Learning has been successful in analyzing images, including remote sensing imagery. However, capturing intricate crop patterns is challenging due to their complexity and variability. In this paper, we propose a novel Deep learning approach that integrates HRNet with Separable Convolutional layers to capture spatial patterns and Self-attention to capture temporal patterns of the data. The HRNet model acts as a backbone and extracts high-resolution features from crop images. Spatially separable convolution in the shallow layers of the HRNet model captures intricate crop patterns more effectively while reducing the computational cost. The multi-head attention mechanism captures long-term temporal dependencies from the encoded vector representation of the images. Finally, a CNN decoder generates a crop map from the aggregated representation. Adaboost is used on top of this to further improve accuracy. The proposed algorithm achieves a high classification accuracy of 97.5\% and IoU of 55.2\% in generating crop maps. We evaluate the performance of our pipeline on the Zuericrop dataset and demonstrate that our results outperform state-of-the-art models such as U-Net++, ResNet50, VGG19, InceptionV3, DenseNet, and EfficientNet. This research showcases the potential of Deep Learning for Earth Observation Systems.
摘要
“精准农作物生产Mapping是确保食品安全、有效资源管理和可持续农业实践的关键。一种实现这一目标的方法是通过分析高分辨率卫星图像。深度学习在分析图像方面取得了成功。然而,捕捉复杂的农作物模式是困难的,因为它们的复杂性和变化性。在这篇论文中,我们提出了一种新的深度学习方法,它将HRNet模型作为底层模型,并将分割 convolutional层和自注意力机制与其结合。HRNet模型提取高分辨率特征图像,而分割 convolutional层在浅层中更有效地捕捉农作物模式,同时降低计算成本。自注意力机制 capture long-term时间关系,从编码向量表示中提取各个图像的特征。最后,一个CNN解码器将生成农作物地图。Adaboost在这之上进行进一步改进精度。我们的方法实现了97.5%的分类精度和55.2%的 IoU 在生成农作物地图方面。我们对Zuericrop数据集进行评估,并证明我们的结果超出了当前的模型,如U-Net++, ResNet50、VGG19、InceptionV3、DenseNet和EfficientNet。这项研究展示了深度学习在地球观测系统中的潜力。”
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives
paper_authors: Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei A. Efros, Mathieu Aubry
for: Given a set of calibrated images of a scene, the paper presents an approach to produce a simple, compact, and actionable 3D world representation using 3D primitives.
methods: The approach models primitives as textured superquadric meshes and optimizes their parameters from scratch with an image rendering loss, using differentiable rendering. The approach also includes modeling transparency for each primitive.
results: The resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. The approach is compared to the state of the art on diverse scenes and demonstrated to be robust on real-life captures.Abstract
Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations. Code and video results are available at https://www.tmonnier.com/DBW .
摘要
Our approach operates directly on images through differentiable rendering, and models primitives as textured superquadric meshes. We optimize the parameters of these primitives from scratch using an image rendering loss, and ensure that transparency is modeled for each primitive. This is critical for optimization and also enables handling varying numbers of primitives.The resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations.Code and video results are available at .
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
paper_authors: Roland S. Zimmermann, Thomas Klein, Wieland Brendel
For: The paper aims to investigate the impact of scaling neural networks on their interpretability, specifically in the context of machine vision.* Methods: The authors use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models, including state-of-the-art models and older architectures.* Results: The authors find that there is no scaling effect for interpretability, neither for model nor dataset size. In fact, latest-generation vision models appear even less interpretable than older architectures, suggesting a regression in interpretability.Here are the three key points in Simplified Chinese:* For: 这篇论文 investigate neural network scaling 对于机器视觉领域的解释性的影响。* Methods: 作者使用心理物理 paradigm 量化不同模型的机制解释性。* Results: 作者发现没有缩放效应, neither for model nor dataset size。最新的视觉模型 Even less interpretable than older architectures, suggesting a regression in interpretability.Abstract
In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
摘要
因为现代人工智能系统的广泛应用,理解神经网络内部信息处理的重要性日益增加。最近,机器视觉领域受到了巨大的进步,通过扩大神经网络的数据集和模型大小。我们问 whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
My3DGen: Building Lightweight Personalized 3D Generative Model
results: 这个系统可以重construct多视图一致的图像,并且可以通过 interpolating между任意两个图像来生成新的外观。与之前的研究相比,这个系统可以生成高质量的2D肖像重建和生成。Abstract
Our paper presents My3DGen, a practical system for creating a personalized and lightweight 3D generative prior using as few as 10 images. My3DGen can reconstruct multi-view consistent images from an input test image, and generate novel appearances by interpolating between any two images of the same individual. While recent studies have demonstrated the effectiveness of personalized generative priors in producing high-quality 2D portrait reconstructions and syntheses, to the best of our knowledge, we are the first to develop a personalized 3D generative prior. Instead of fine-tuning a large pre-trained generative model with millions of parameters to achieve personalization, we propose a parameter-efficient approach. Our method involves utilizing a pre-trained model with fixed weights as a generic prior, while training a separate personalized prior through low-rank decomposition of the weights in each convolution and fully connected layer. However, parameter-efficient few-shot fine-tuning on its own often leads to overfitting. To address this, we introduce a regularization technique based on symmetry of human faces. This regularization enforces that novel view renderings of a training sample, rendered from symmetric poses, exhibit the same identity. By incorporating this symmetry prior, we enhance the quality of reconstruction and synthesis, particularly for non-frontal (profile) faces. Our final system combines low-rank fine-tuning with symmetry regularization and significantly surpasses the performance of pre-trained models, e.g. EG3D. It introduces only approximately 0.6 million additional parameters per identity compared to 31 million for full finetuning of the original model. As a result, our system achieves a 50-fold reduction in model size without sacrificing the quality of the generated 3D faces. Code will be available at our project page: https://luchaoqi.github.io/my3dgen.
摘要
我们的论文介绍了My3DGen系统,它是一个实用的系统,可以使用只有10张图像创建个性化轻量级3D生成先验。My3DGen可以从输入测试图像中重建多视图一致的图像,并通过 interpolating между任意两张同一个个体的图像来生成新的外观。相比之下,最近的研究已经证明了个性化生成先验可以生成高质量2D肖像重建和合成。我们知道最初是开发一个个性化3D生成先验的。而不是使用大量先验学习模型进行精细调整,我们提议一种参数效率的方法。我们的方法是使用预训练模型的固定参数作为通用先验,而在每个卷积层和完全连接层中使用低级别分解来训练个性化先验。然而,参数效率的几个shot fine-tuning 往往会导致过拟合。为了解决这个问题,我们提出了基于人脸对称的正则化技术。这种正则化要求在同一个个体中的不同姿势下渲染的新视图图像保持同一个身份。通过添加这种对称先验,我们可以提高重建和合成质量,特别是对非正面(Profile)人脸。我们的最终系统结合了低级别 fine-tuning 和对称正则化,与EG3D模型相比,它具有较高的重建和合成质量,同时减少了模型大小50倍,即每个人ID只添加约0.6百万参数。因此,我们的系统可以实现31百万参数的原始模型的50倍减少,而不是牺牲生成3D人脸的质量。我们的代码将在我们项目页面上提供:。
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
results: 实验结果显示,EgoVLPv2可以在各种视频和语言任务上取得了稳定的state-of-the-art表现,并且在所有下游任务上都超越了强大的基eline。Abstract
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.
摘要
egocentric 视频语言预训练(EgoVLPv2)在普遍性和通用性方面得到了进一步提升,我们在视频和语言底层模型中直接实现了跨模态融合。 EgoVLPv2 在预训练中学习了强大的视频文本表示,并可以重用跨模态注意力模块来支持不同的下游任务,从而降低 fine-tuning 成本。此外,我们提出的融合在底层策略比使用叠加更多特殊层更加轻量级和计算效率。广泛的实验表明 EgoVLPv2 在多种 VL 任务上具有稳定的状态机器人表现,超过了强大的基准值。关于我们的项目,请访问我们的项目页面:https://shramanpramanick.github.io/EgoVLPv2/。
Efficient 3D Articulated Human Generation with Layered Surface Volumes
results: LSV-GAN可以快速生成高品质的3D人体,并且可以在GAN设定中提供高效的3D生成。我们在单一影像 dataset上训练 LSV-GAN,并获得高品质和视角一致的3D人体生成。Abstract
Access to high-quality and diverse 3D articulated digital human assets is crucial in various applications, ranging from virtual reality to social platforms. Generative approaches, such as 3D generative adversarial networks (GANs), are rapidly replacing laborious manual content creation tools. However, existing 3D GAN frameworks typically rely on scene representations that leverage either template meshes, which are fast but offer limited quality, or volumes, which offer high capacity but are slow to render, thereby limiting the 3D fidelity in GAN settings. In this work, we introduce layered surface volumes (LSVs) as a new 3D object representation for articulated digital humans. LSVs represent a human body using multiple textured mesh layers around a conventional template. These layers are rendered using alpha compositing with fast differentiable rasterization, and they can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike conventional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. LSVs can be articulated, and they exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Trained on unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.
摘要
<>传统的3D人体模型具有一些缺陷,如模板精度不高、质量不佳、缺乏可视化等。在这种情况下,我们提出了一种新的3D对象表示方法:层次表面体积(LSV)。LSV使用多层纹理的方式表示人体,每层纹理都可以独立渲染,并且可以通过Alpha混合和快速漫反射来实现。这种方法可以具有较高的精度和质量,同时也可以快速渲染。在GAN中,我们使用2D生成器学习生成层次表面体积中的RGBA纹理。我们的LSV-GAN可以在单视图2D图像集上进行训练,并且可以生成高质量和视角一致的3D人体模型。In this work, we propose a new 3D object representation method called layered surface volumes (LSV) to address the limitations of traditional 3D human body models. LSV represents a human body using multiple textured mesh layers around a conventional template, and each layer can be rendered independently. The layers can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike traditional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. Additionally, LSVs can be articulated and exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Our LSV-GAN can be trained on unstructured, single-view 2D image datasets, and it can generate high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.