cs.SD - 2023-07-16

NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks

  • paper_url: http://arxiv.org/abs/2307.08007
  • repo_url: https://github.com/adrianbarahona/noisebandnet
  • paper_authors: Adrián Barahona-Ríos, Tom Collins
  • for: 本研究旨在提出一种可控制的神经音频合成方法,以便生成具有时间和频率分辨率的各种听起来不同的声音效果。
  • methods: 该方法使用滤波器链来过滤白噪,从而实现声音效果的生成和控制。
  • results: 对于十种声音效果的测试,NoiseBandNet得分高于四种变体的DDSP滤波器synthesizer,在九个评价类别中得分更高,表明NoiseBandNet可以生成具有时间和频率分辨率的各种听起来不同的声音效果。
    Abstract Controllable neural audio synthesis of sound effects is a challenging task due to the potential scarcity and spectro-temporal variance of the data. Differentiable digital signal processing (DDSP) synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. Here we propose NoiseBandNet, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds. We evaluate our approach via a series of experiments, modelling footsteps, thunderstorm, pottery, knocking, and metal sound effects. Comparing NoiseBandNet audio reconstruction capabilities to four variants of the DDSP-filtered noise synthesiser, NoiseBandNet scores higher in nine out of ten evaluation categories, establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution. Finally, we introduce some potential creative uses of NoiseBandNet, by generating variations, performing loudness transfer, and by training it on user-defined control curves.
    摘要 <>translate into Simplified Chinese� controllable neural audio synthesis of sound effects is a challenging task due to the potential scarcity and spectro-temporal variance of the data. Differentiable digital signal processing (DDSP) synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. Here we propose NoiseBandNet, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds. We evaluate our approach via a series of experiments, modelling footsteps, thunderstorm, pottery, knocking, and metal sound effects. Comparing NoiseBandNet audio reconstruction capabilities to four variants of the DDSP-filtered noise synthesiser, NoiseBandNet scores higher in nine out of ten evaluation categories, establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution. Finally, we introduce some potential creative uses of NoiseBandNet, by generating variations, performing loudness transfer, and by training it on user-defined control curves.Translation:控制可能的神经音频合成声效是一个挑战性的任务,因为声效数据的可能性和spectro-temporal variance很大。 diferenciable digital signal processing(DDSP)Synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. 我们提议NoiseBandNet,一种可以通过filterbank filtering white noise来实现和控制声效的架构。这超过了之前的系统,它们假设声效的和谐性。我们通过一系列实验,模拟了踏步、雨天、陶艺、打击和金属声效。 Comparing NoiseBandNet的声音重建能力与四种DDSP滤波器处理的噪声合成器,NoiseBandNet在十个评价类别中得分高于其他四个, Establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution。最后,我们介绍了一些可能的创造性使用NoiseBandNet,如生成变化、卷积传递和用户定义的控制曲线。

eess.AS - 2023-07-16

Model Adaptation for ASR in low-resource Indian Languages

  • paper_url: http://arxiv.org/abs/2307.07948
  • repo_url: None
  • paper_authors: Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Savitha, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Rohan Saxena, Sai Praneeth Reddy Mora, Srinivasa Raghavan
  • for: The paper aims to improve automatic speech recognition (ASR) performance for low-resource languages, specifically Indian languages like Bengali and Bhojpuri.
  • methods: The paper uses self-supervised learning (SSL) based acoustic models like wav2vec2 and large-scale multi-lingual training like Whisper, and explores the use of adaptation and fine-tuning techniques to overcome the low-resource nature of the data.
  • results: The paper aims to understand the importance of each modality (acoustics and text) in building a reliable ASR system for low-resource languages, and to explore the applicability of these approaches to various languages spoken around the world.
    Abstract Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages. In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world.
    摘要 自动语音识别(ASR)性能在最近几年内有了惊人的提升,主要归功于基于自我超级学习(SSL)的声音模型,如wave2vec2以及大规模多语言训练如Whisper。然而,低资源语言仍然存在巨大的挑战,主要是因为语音和文本数据的可用性受限。这更加复杂,因为印度语言有多种方言。然而,许多印度语言可以分组,并且共享同一个字母和 grammatical structure。这使得可以应用大量的适应和精度调整技术来缓解低资源数据的问题,使用已有的资源更加有利。在这个特别会议中,我们邀请社区探讨以下想法:使用声音和文本Modalities 之间的关系来构建可靠的 ASR。可能是,一个语言有充足的声音数据,可以减少文本 corpora 的需求。或者,由于各种预训练声音模型的可用性,可以相反的情况。我们鼓励社区在孟买利语和帕雷语两种低资源印度语言中进行研究。这些方法不仅适用于印度语言,而且可能适用于世界各地的语言。

cs.CV - 2023-07-16

Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector

  • paper_url: http://arxiv.org/abs/2307.08076
  • repo_url: None
  • paper_authors: Shuo-Yen Lin, Ernie Chu, Che-Hsien Lin, Jun-Cheng Chen, Jia-Ching Wang
  • For: The paper aims to address the issue of poor-quality physical adversarial patches for protecting personal privacy from malicious monitoring using object detectors.* Methods: The proposed method uses diffusion models (DM) to generate naturalistic adversarial patches that are high-quality and stable, without suffering from mode collapse problems.* Results: The proposed approach achieves better-quality and more naturalistic adversarial patches than other state-of-the-art patch generation methods, with acceptable attack performance and various generation trade-offs under different conditions.Here are the three points in Simplified Chinese text:* For: 本文目标是解决对象检测器的恶意监测中的个人隐私泄露问题,通过生成高质量的物理攻击质 patches。* Methods: 提议方法基于扩散模型(DM),通过采样预训练于自然图像的DM模型,可以稳定地生成高质量和自然的物理攻击质 patches。* Results: 对比其他状态公共的攻击质 patch生成方法,提议方法可以实现更好的攻击性和质量,同时提供不同条件下的生成质量负担。
    Abstract Many physical adversarial patch generation methods are widely proposed to protect personal privacy from malicious monitoring using object detectors. However, they usually fail to generate satisfactory patch images in terms of both stealthiness and attack performance without making huge efforts on careful hyperparameter tuning. To address this issue, we propose a novel naturalistic adversarial patch generation method based on the diffusion models (DM). Through sampling the optimal image from the DM model pretrained upon natural images, it allows us to stably craft high-quality and naturalistic physical adversarial patches to humans without suffering from serious mode collapse problems as other deep generative models. To the best of our knowledge, we are the first to propose DM-based naturalistic adversarial patch generation for object detectors. With extensive quantitative, qualitative, and subjective experiments, the results demonstrate the effectiveness of the proposed approach to generate better-quality and more naturalistic adversarial patches while achieving acceptable attack performance than other state-of-the-art patch generation methods. We also show various generation trade-offs under different conditions.
    摘要 多种物理抗击方法已经广泛提出来保护个人隐私免受恶势力监测器的攻击。然而,这些方法通常无法生成满意的质量的抗击图像,而且需要大量的优化参数调整,以达到良好的攻击性和隐蔽性。为解决这个问题,我们提出了一种新的自然化抗击方法,基于扩散模型(DM)。通过从DM模型预训练于自然图像中采样最佳图像,我们可以稳定地生成高质量和自然化的物理抗击图像,而不会受到深度生成模型的严重混合问题的影响。我们认为我们是第一个提出DM模型基于的自然化抗击图像生成方法。通过了广泛的量化、质量和主观实验,我们的方法可以生成更高质量和更自然化的抗击图像,同时保持可接受的攻击性。我们还展示了不同条件下的生成质量的交易。

Dense Multitask Learning to Reconfigure Comics

  • paper_url: http://arxiv.org/abs/2307.08071
  • repo_url: None
  • paper_authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
  • for: 这paper的目的是开发一种多任务学习(MTL)模型,以实现漫画幕anel的精度预测,以便将漫画从一个发布频道传输到另一个。
  • methods: 我们使用了一种常见的策略,即无监督图像到图像翻译,以利用大量的真实世界约束。我们还利用了这些翻译结果,开发了一种基于视力变换器底层和域转移注意模块的多任务方法。
  • results: 我们的MTL方法可以成功地标识漫画幕anel中的Semantic单元以及嵌入的3D notion。这是一个非常具有挑战性的问题,因为漫画包含了不同的艺术风格、插图、布局和 объек scale,这些因素取决于作者的创作过程。
    Abstract In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comic panels. This is a significantly challenging problem because comics comprise disparate artistic styles, illustrations, layouts, and object scales that depend on the authors creative process. Typically, dense image-based prediction techniques require a large corpus of data. Finding an automated solution for dense prediction in the comics domain, therefore, becomes more difficult with the lack of ground-truth dense annotations for the comics images. To address these challenges, we develop the following solutions: 1) we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; 2) we utilize the results of the translations to develop our multitasking approach that is based on a vision transformer backbone and a domain transferable attention module; 3) we study the feasibility of integrating our MTL dense-prediction method with an existing retargeting method, thereby reconfiguring comics.
    摘要 在这篇论文中,我们开发了一种多任务学习(MTL)模型,以实现漫画panel的密集预测,以便将漫画从一个发布渠道传输到另一个渠道,并 помочь作者重新配置他们的故事。我们的MTL方法可以成功地标识漫画中的语义单元以及嵌入的3D notion。这是一个非常具有挑战性的问题,因为漫画包含了不同的艺术风格、插图、布局和对象比例,这些因素取决于作者的创作过程。通常,密集图像基于预测技术需要很大的数据库。因此,在漫画领域找到自动化密集预测的解决方案变得更加困难,因为漫画图像的密集批注缺乏。为 Addressing these challenges, we develop the following solutions:1. 我们利用一种通常使用的策略,即无监督图像到图像翻译,以使用大量的实际世界约束。2. 我们利用翻译结果来开发我们的多任务方法,该方法基于视Transformer底层和域传递注意模块。3. 我们研究将我们的MTL密集预测方法与现有的重定向方法集成,以重新配置漫画。

MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment

  • paper_url: http://arxiv.org/abs/2307.08065
  • repo_url: None
  • paper_authors: Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
  • For: The paper is written for vision-based applications that require efficient processing on heterogeneous MPSoC platforms.* Methods: The paper proposes a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms, including a mapping-aware Graph Neural Architecture Search (MaGNAS) framework.* Results: The proposed MaGNAS framework achieves 1.57x latency speedup and is 3.38x more energy-efficient for several vision datasets executed on the Xavier MPSoC platform compared to the GPU-only deployment, while sustaining an average 0.11% accuracy reduction from the baseline.Here’s the simplified Chinese text for the three information points:* For: 这篇论文是为视觉应用程序设计的,需要高效地运行在多核心处理器系统(MPSoC)上。* Methods: 论文提出了一种新的统一设计映射方法,用于高效地处理视觉Graph Neural Networks(GNN)任务在多核心MPSoC平台上。* Results: 提出的MaGNAS框架在Xavier MPSoC上实现了1.57倍的延迟速度提升和3.38倍的能效率提升,而且与基线相比减少了0.11%的准确率。
    Abstract Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both (i) a real hardware SoC platform (NVIDIA Xavier AGX) and (ii) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57x latency speedup and is 3.38x more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.
    摘要 图 neural network (GNN) 在视觉应用中变得越来越受欢迎,这是因为它们内置了图结构和上下文关系 между图像帧中不同部分的能力。同时,由于近期增加的深度视觉应用在边缘进行执行,因此GNN也必须遵循同样的执行要求。然而,与 Typical deep neural networks不同,图学习操作的不规则流动使得在多核心处理器系统(MPSoC)平台上运行GNN变得更加挑战。在这篇论文中,我们提出了一种新的统一设计映射方法,以便高效地处理视觉GNN工作负荷在多核心处理器平台上。具体来说,我们开发了 MaGNAS,一个具有映射意识的图 neural architecture search框架。MaGNAS提出了一个图 neural 架构设计空间,并与多核心 SoC 上的可能的映射选择相结合,以便确定最佳的GNN模型,以最大化设备资源利用。为达到这一目标,MaGNAS使用了两层演化搜索,以确定最佳的GNN和映射对。通过基于最近的视觉 GNN(ViG)架构的超网,我们进行了在四个(04) state-of-the-art 视觉 dataset 上的实验,使用了 both (i) 真实硬件 SoC 平台(NVIDIA Xavier AGX)和 (ii) 性能/成本模型适用于 DNN 加速器的表现/成本模拟器。我们的实验结果表明,MaGNAS 能够提供 1.57 倍的延迟速度提升和 3.38 倍的能效性提升,而在 Xavier MPSoC 上执行多个视觉dataset 时与 GPU-only 部署相比,保持了平均 0.11% 的准确性下降。

LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.08059
  • repo_url: None
  • paper_authors: Haonan Yin, Guanlong Jiao, Qianhui Wu, Borje F. Karlsson, Biqing Huang, Chin Yew Lin
  • for: 本研究旨在为flexible manufacturing systems提供一种不需要监督的多类异常检测方法,能够在只有正常数据可用时检测到对象属于多个类型的异常。
  • methods: 本研究使用生成器基于方法,包括潜在扩散模型 для重建,以解决泛化难点’’问题,以及特征编辑策略来进一步缓解’’标识短ircuit’’问题。
  • results: 对MVTec-AD和MPDD数据集进行了广泛的实验,显示提出的LafitE方法在average AUROC指标上与现有方法相比,具有显著的优势。同时,通过我们提出的pseudo验证集来选择适合实际测试集的超参数。
    Abstract In the context of flexible manufacturing systems that are required to produce different types and quantities of products with minimal reconfiguration, this paper addresses the problem of unsupervised multi-class anomaly detection: develop a unified model to detect anomalies from objects belonging to multiple classes when only normal data is accessible. We first explore the generative-based approach and investigate latent diffusion models for reconstruction to mitigate the notorious ``identity shortcut'' issue in auto-encoder based methods. We then introduce a feature editing strategy that modifies the input feature space of the diffusion model to further alleviate ``identity shortcuts'' and meanwhile improve the reconstruction quality of normal regions, leading to fewer false positive predictions. Moreover, we are the first who pose the problem of hyperparameter selection in unsupervised anomaly detection, and propose a solution of synthesizing anomaly data for a pseudo validation set to address this problem. Extensive experiments on benchmark datasets MVTec-AD and MPDD show that the proposed LafitE, \ie, Latent Diffusion Model with Feature Editing, outperforms state-of-art methods by a significant margin in terms of average AUROC. The hyperparamters selected via our pseudo validation set are well-matched to the real test set.
    摘要 在需要生产不同类型和量的产品时,这篇论文解决了无监督多类异常检测的问题:开发一个综合模型,可以从多个类型的对象中检测异常。我们首先探讨了生成器基本的方法,并调查了抽象扩散模型来解决拥有短circuit''问题,这是抽象扩散模型基于方法的一个常见问题。然后,我们引入了特征编辑策略,将输入特征空间中的特征进行修改,以更好地降低异常点的检测难度,同时提高正常区域的重建质量,从而减少假阳性预测。此外,我们是第一个提出了无监督异常检测中参数选择的问题,并提出了一种使用生成异常数据来 Pseudo 验证集来解决这个问题。我们的LafitE(即潜在扩散模型与特征编辑)在 MPDD 和 MVTec-AD 测试集上进行了广泛的实验,结果表明,它在 average AUROC 方面与状态机器人在前方的方法相比,具有显著的优势。而我们选择的参数via Pseudo 验证集和实际测试集之间的匹配性也很高。

TransNuSeg: A Lightweight Multi-Task Transformer for Nuclei Segmentation

  • paper_url: http://arxiv.org/abs/2307.08051
  • repo_url: https://github.com/zhenqi-he/transnuseg
  • paper_authors: Zhenqi He, Mathias Unberath, Jing Ke, Yiqing Shen
    for:这篇论文的目的是提出一种基于Transformer的自动核体分割方法,以提高核体分割的准确性和效率。methods:这篇论文使用了一种新的多任务学习策略,将核体分割任务分解成三个子任务:核体实例分割、核体边沿分割和分割边缘集成。此外,它还使用了一种新的自适应共享机制,以便在不同分支之间共享自适应 heads。results:这篇论文的实验结果表明,使用这种方法可以在两个不同的数据集上,与CA2.5-Net等其他状态对抗方法相比,提高核体分割的精度。此外,这种方法还可以降低模型的参数量,从而提高计算效率。
    Abstract Nuclei appear small in size, yet, in real clinical practice, the global spatial information and correlation of the color or brightness contrast between nuclei and background, have been considered a crucial component for accurate nuclei segmentation. However, the field of automatic nuclei segmentation is dominated by Convolutional Neural Networks (CNNs), meanwhile, the potential of the recently prevalent Transformers has not been fully explored, which is powerful in capturing local-global correlations. To this end, we make the first attempt at a pure Transformer framework for nuclei segmentation, called TransNuSeg. Different from prior work, we decouple the challenging nuclei segmentation task into an intrinsic multi-task learning task, where a tri-decoder structure is employed for nuclei instance, nuclei edge, and clustered edge segmentation respectively. To eliminate the divergent predictions from different branches in previous work, a novel self distillation loss is introduced to explicitly impose consistency regulation between branches. Moreover, to formulate the high correlation between branches and also reduce the number of parameters, an efficient attention sharing scheme is proposed by partially sharing the self-attention heads amongst the tri-decoders. Finally, a token MLP bottleneck replaces the over-parameterized Transformer bottleneck for a further reduction in model complexity. Experiments on two datasets of different modalities, including MoNuSeg have shown that our methods can outperform state-of-the-art counterparts such as CA2.5-Net by 2-3% Dice with 30% fewer parameters. In conclusion, TransNuSeg confirms the strength of Transformer in the context of nuclei segmentation, which thus can serve as an efficient solution for real clinical practice. Code is available at https://github.com/zhenqi-he/transnuseg.
    摘要 核体在实际临床应用中显示为小型,但是在实际临床实践中,全局空间信息和背景颜色或亮度差异的 corrélation,被视为精确的核体分割的关键组成部分。然而,自动核体分割领域被 CNN 所主导,而 transformer 的潜在力量尚未得到完全利用,这是拥有地方-全局 corrélation 的强大能力。为此,我们首次提出了一个纯 transformer 框架 для核体分割,称为 TransNuSeg。与先前的工作不同,我们将核体分割任务分解为内生多任务学习任务,其中使用 tri-decoder 结构进行核体实例、核体边和分割的聚合edge 分割。为了消除不同分支的不同预测,我们引入了一种新的自我抽象损失,以直接强制不同分支之间的一致性规则。此外,我们还提出了一种高效的注意力共享方案,通过共享部分自动注意力头来降低模型参数数量。最后,我们将各自MLP 瓶颈替换为更加简单的 токен MLP 瓶颈,以进一步降低模型复杂性。在两个不同模式的数据集上进行了实验,包括 MoNuSeg,我们的方法可以与状态态-of-the-art 对手 CA2.5-Net 相比,提高 Dice 指标2-3%,同时减少参数数量30%。结论:TransNuSeg 证明了 transformer 在核体分割领域的力量,可以作为实际临床应用的高效解决方案。代码可以在 上获取。

A Novel SLCA-UNet Architecture for Automatic MRI Brain Tumor Segmentation

  • paper_url: http://arxiv.org/abs/2307.08048
  • repo_url: None
  • paper_authors: Tejashwini P S, Thriveni J, Venugopal K R
    for: 这个论文主要针对的是如何通过深度学习来自动化脑肿图像分类和识别。methods: 该论文提出了一种基于UNet架构的修改方法,即SLCA UNet方法,该方法包括循环感知层、Channel Attention层和堆叠卷积层等模块,能够有效地捕捉脑肿图像中的细节和概念信息。results: 该论文在使用BraTS数据集进行测试时,实现了良好的性能,其中Dice指标、敏感度、特异度和 Hausdorff95指标分别为0.845、0.845、0.999和8.1。
    Abstract Brain tumor is deliberated as one of the severe health complications which lead to decrease in life expectancy of the individuals and is also considered as a prominent cause of mortality worldwide. Therefore, timely detection and prediction of brain tumors can be helpful to prevent death rates due to brain tumors. Biomedical image analysis is a widely known solution to diagnose brain tumor. Although MRI is the current standard method for imaging tumors, its clinical usefulness is constrained by the requirement of manual segmentation which is time-consuming. Deep learning-based approaches have emerged as a promising solution to develop automated biomedical image exploration tools and the UNet architecture is commonly used for segmentation. However, the traditional UNet has limitations in terms of complexity, training, accuracy, and contextual information processing. As a result, the modified UNet architecture, which incorporates residual dense blocks, layered attention, and channel attention modules, in addition to stacked convolution, can effectively capture both coarse and fine feature information. The proposed SLCA UNet approach achieves good performance on the freely accessible Brain Tumor Segmentation (BraTS) dataset, with an average performance of 0.845, 0.845, 0.999, and 8.1 in terms of Dice, Sensitivity, Specificity, and Hausdorff95 for BraTS 2020 dataset, respectively.
    摘要 脑肿是一种严重的健康问题,可能导致个体寿命减少,并被视为全球致死原因之一。因此,在时间上掌握和预测脑肿的技术是非常重要的。生物医学影像分析是一种广泛应用的解决方案,用于诊断脑肿。虽然MRI是当前标准的肿体影像方法,但其临床实用性受到手动分 segmentation 的限制,这是时间consuming 的。基于深度学习的方法在诊断方面出现了一种可能的解决方案,其中 UNet 架构是最常用的。然而,传统的 UNet 有许多局限性,包括复杂度、训练、准确率和上下文信息处理等方面。因此,基于 SLCA UNet 架构,通过添加径 residual dense blocks、层 attention 和通道 attention 模块,可以更好地捕捉肿体中粗细特征信息。提出的 SLCA UNet 方法在可以获得 BraTS 2020 数据集的自由访问Brain Tumor Segmentation(BraTS)数据集上的良好性能,其中的平均性能为0.845、0.845、0.999和8.1。

Planting a SEED of Vision in Large Language Model

  • paper_url: http://arxiv.org/abs/2307.08041
  • repo_url: https://github.com/ailab-cvc/seed
  • paper_authors: Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
  • for: 这个论文是为了提供一种能让语言模型同时看图和画图的图像tokenizer。
  • methods: 这个论文使用了一种新的图像tokenizer architecture,它使用了一个1D causal dependency来生成图像token,并且通过在tokenizer训练阶段进行优化来使图像token capture高级别 semantics。
  • results: 通过使用这种新的图像tokenizer,LLM可以通过简单的LoRA tuning来实现图像-文本和文本-图像生成。这个论文在5.7天内使用64个V100 GPU和500万个公共可用的图像-文本对对其进行训练。
    Abstract We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.
    摘要 我们介绍SEED,一个复杂的图像tokenizer,允许大型语言模型(LLM)同时“看”和“绘”。过去的研究图像tokenizer已经到了僵对,因为使用量化的视觉token导致了与BLIP-2等模型的比较不利,以及生成模型的比较不利(比如Stable Diffusion等)。尽管有限制,我们仍然信任它的自然能力,将vision和textual表现结合起来,实现标准多模式训练, LLM 的原始配方。在这个研究中,我们确定了两个重要的建筑和训练SEED的原则,以确保其与LLM的配合。1. 图像token应该与2D物理 patch 位置无关,而是通过1D causal dependency 生成,这样的自然依赖性与 LLM 的左往右预测机制相关。2. 图像token应该捕捉高度抽象的 semantics,与 слова的Semantic abstraction 相关,并在 tokenizer 训练阶段进行优化。因此,通过我们的SEED,标准 LLM 可以进行图像-文本和文本-图像生成,只需要通过LoRA调整。未来的多模式预训和指令调整可能会带来更好的结果。我们在5.7天内使用64个V100 GPU和500万个公开可用的图像-文本对给SEED进行训练。我们的初步研究显示,可以使用精确的图像tokenizer来实现多模式LLM的实用性和多元性。

Multi-Object Discovery by Low-Dimensional Object Motion

  • paper_url: http://arxiv.org/abs/2307.08027
  • repo_url: https://github.com/sadrasafa/multi-object-segmentation
  • paper_authors: Sadra Safadoust, Fatma Güney
  • for: 该研究旨在提高单图像中的动态 reconstruction,即使无法获取下一帧图像。
  • methods: 该研究使用了像素级几何和物体运动来解除单图像中的流动ambiguity。
  • results: 该研究在 sintetic和实际 datasets上达到了state-of-the-art的多物体分 segmentation结果,并且对预测深度图表现了可靠的性能。
    Abstract Recent work in unsupervised multi-object segmentation shows impressive results by predicting motion from a single image despite the inherent ambiguity in predicting motion without the next image. On the other hand, the set of possible motions for an image can be constrained to a low-dimensional space by considering the scene structure and moving objects in it. We propose to model pixel-wise geometry and object motion to remove ambiguity in reconstructing flow from a single image. Specifically, we divide the image into coherently moving regions and use depth to construct flow bases that best explain the observed flow in each region. We achieve state-of-the-art results in unsupervised multi-object segmentation on synthetic and real-world datasets by modeling the scene structure and object motion. Our evaluation of the predicted depth maps shows reliable performance in monocular depth estimation.
    摘要

Analysing Gender Bias in Text-to-Image Models using Object Detection

  • paper_url: http://arxiv.org/abs/2307.08025
  • repo_url: https://github.com/harveymannering/text-to-image-bias
  • paper_authors: Harvey Mannering
  • for: 该研究目的是测试文本到图像模型中的偏见。
  • methods: 该研究使用了对应的提示,例如“一个男人/女人持有一个物品”,以检查某些物品是否与certain gender相关。
  • results: 分析结果显示, masculine prompts 更 frequently generate了如锦标、剑、车、棒棒球和自行车等物品,而 feminine prompts 更 frequently generate了如手提包、雨伞、碗、瓶子和杯子等物品。
    Abstract This work presents a novel strategy to measure bias in text-to-image models. Using paired prompts that specify gender and vaguely reference an object (e.g. "a man/woman holding an item") we can examine whether certain objects are associated with a certain gender. In analysing results from Stable Diffusion, we observed that male prompts generated objects such as ties, knives, trucks, baseball bats, and bicycles more frequently. On the other hand, female prompts were more likely to generate objects such as handbags, umbrellas, bowls, bottles, and cups. We hope that the method outlined here will be a useful tool for examining bias in text-to-image models.
    摘要 Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer

  • paper_url: http://arxiv.org/abs/2307.08015
  • repo_url: None
  • paper_authors: Yujiao Shi, Fei Wu, Akhil Perincherry, Ankit Vora, Hongdong Li
  • for: 提高Camera pose estimation的精度,尤其是在具有有限样本密度的卫星图像库中。
  • methods: 我们提出了一种方法,通过估算地面图像和卫星图像之间的相对旋转和翻译来提高地面摄像机的位置和方向估计的准确性。我们的方法包括:(1) 使用geometry-guided cross-view transformer来将地面视图转换为飞行视图;(2) 使用神经网络pose optimizer来估计卫星图像和地面图像之间的相对旋转;(3) 使用uncertainty-guided spatial correlation来生成可能性图中的车辆位置。
  • results: 我们的方法在cross-view KITTI dataset上实验表明,与state-of-the-art方法相比,具有显著的改进。特别是,限制车辆 lateral pose 在1m内的概率从35.54%提高到76.44%,限制车辆 orientation 在1°内的概率从19.64%提高到99.10%。
    Abstract Image retrieval-based cross-view localization methods often lead to very coarse camera pose estimation, due to the limited sampling density of the database satellite images. In this paper, we propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image. Our approach designs a geometry-guided cross-view transformer that combines the benefits of conventional geometry and learnable cross-view transformers to map the ground-view observations to an overhead view. Given the synthesized overhead view and observed satellite feature maps, we construct a neural pose optimizer with strong global information embedding ability to estimate the relative rotation between them. After aligning their rotations, we develop an uncertainty-guided spatial correlation to generate a probability map of the vehicle locations, from which the relative translation can be determined. Experimental results demonstrate that our method significantly outperforms the state-of-the-art. Notably, the likelihood of restricting the vehicle lateral pose to be within 1m of its Ground Truth (GT) value on the cross-view KITTI dataset has been improved from $35.54\%$ to $76.44\%$, and the likelihood of restricting the vehicle orientation to be within $1^{\circ}$ of its GT value has been improved from $19.64\%$ to $99.10\%$.
    摘要 Image Retrieval-based Cross-view Localization Methods Often Lead to Very Coarse Camera Pose Estimation, Due to the Limited Sampling Density of the Database Satellite Images. In This Paper, We Propose a Method to Increase the Accuracy of a Ground Camera's Location and Orientation by Estimating the Relative Rotation and Translation Between the Ground-level Image and Its Matched/retrieved Satellite Image. Our Approach Designs a Geometry-guided Cross-view Transformer That Combines the Benefits of Conventional Geometry and Learnable Cross-view Transformers to Map the Ground-view Observations to an Overhead View. Given the Synthesized Overhead View and Observed Satellite Feature Maps, We Construct a Neural Pose Optimizer with Strong Global Information Embedding Ability to Estimate the Relative Rotation Between Them. After Aligning Their Rotations, We Develop an Uncertainty-guided Spatial Correlation to Generate a Probability Map of the Vehicle Locations, from Which the Relative Translation Can Be Determined. Experimental Results Demonstrate That Our Method Significantly Outperforms the State-of-the-art. Notably, the Likelihood of Restricting the Vehicle Lateral Pose to Be Within 1m of Its Ground Truth (GT) Value on the Cross-view KITTI Dataset Has Been Improved from $35.54\%$ to $76.44\%$, and the Likelihood of Restricting the Vehicle Orientation to Be Within $1^{\circ}$ of Its GT Value Has Been Improved from $19.64\%$ to $99.10\%$.

Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks

  • paper_url: http://arxiv.org/abs/2307.08013
  • repo_url: None
  • paper_authors: Haobo Song, Soumajit Majumder, Tao Lin
  • for: This paper aims to revisit the line of implicit models, specifically weight-tied models, and evaluate their effectiveness, stability, and efficiency on vision tasks.
  • methods: The paper uses weight-tied models as the basis for its study, and proposes the use of distinct sparse masks to improve the model capacity.
  • results: The paper finds that weight-tied models are more effective, stable, and efficient on vision tasks compared to DEQ variants, and provides design guidelines for practitioners regarding the selection of depth, width, and sparsity.
    Abstract Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.
    摘要 匿型模型(如深度均衡模型)在社区中受到了广泛关注,因为它们可以训练无穷层模型,并且有着简洁的解决方案和常量内存占用。然而,虽然有几次尝试,但这些方法受到了模型不充分利用和优化不稳定的限制。此外,相关的视觉任务中的公平比较缺失。在这种情况下,我们回到了权重相关模型的起源,并发现了权重相关模型在视觉任务上的效果更高,稳定性更好,并且更高效。通过这些简单 yet clean的权重相关模型,我们进一步研究了这些模型的基本限制,并提出了使用特定的稀疏面积来提高模型容量的方法。最后,我们向实践者提供了深度、宽度和稀疏性选择的设计指南,并证明了我们的理解在其他学习模式上的普适性。

Householder Projector for Unsupervised Latent Semantics Discovery

  • paper_url: http://arxiv.org/abs/2307.08012
  • repo_url: https://github.com/kingjamessong/householdergan
  • paper_authors: Yue Song, Jichao Zhang, Nicu Sebe, Wei Wang
  • for: 这个论文主要是为了探索 generator adversarial networks (GANs) 的内在结构,以便更好地理解和控制图像生成过程。
  • methods: 作者提出了一种基于 Householder 变换的低级别正交矩阵表示方法(Householder Projector),用于参数化 projection matrix,从而实现对 latent code 的 traverse 以便发现更加精细的 semantic attributes。
  • results: 作者在 StyleGAN2/StyleGAN3 模型中集成了 Householder Projector,并在多个 benchmark 上评估了模型的表现。结果显示,只需要在原始训练步骤的 1% 上进行微调,Householder Projector 可以帮助 StyleGANs 发现更加精细和准确的 semantic attributes,而不需要牺牲图像的准确性。
    Abstract Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only $1\%$ of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.
    摘要 “生成问题网络”(GANs),特别是最近的样式基因生成器(StyleGANs),在结构化的底层空间中有多元 semantics。内在semantics发现方法产生了可以在底层代码中移动的方法,以便只有一个因素在旅游中变化。最近,一种无监督的方法提出了一个可能的方向, directly使用对应码到特征的投影矩阵的特征向量作为可解释的方向。然而,一个被遗忘的事实是,投影矩阵不对称,数量过多的特征向量会导致意义的变化,即使投影矩阵是对称的。为了解决这些问题,我们提出了“Householder Projector”,一种通用且统一的低维度对称矩阵表示,基于Householder变换。对称性 garantuees that the feature vectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only 1% of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.

LUCYD: A Feature-Driven Richardson-Lucy Deconvolution Network

  • paper_url: http://arxiv.org/abs/2307.07998
  • repo_url: https://github.com/ctom2/lucyd-deconvolution
  • paper_authors: Tomáš Chobola, Gesine Müller, Veit Dausmann, Anton Theileis, Jan Taucher, Jan Huisken, Tingying Peng
  • for: 提高微scopic图像质量和可读性
  • methods: 结合Richardson-Lucy抽象方程和深度学习网络特征,提出了LUCYD方法,以增强图像质量并降低计算成本。
  • results: LUCYD方法在synthetic和实际微scopic图像中表现出色,超过了现有方法的性能,并能够处理不同的微scopic模式和捕捉条件。
    Abstract The process of acquiring microscopic images in life sciences often results in image degradation and corruption, characterised by the presence of noise and blur, which poses significant challenges in accurately analysing and interpreting the obtained data. This paper proposes LUCYD, a novel method for the restoration of volumetric microscopy images that combines the Richardson-Lucy deconvolution formula and the fusion of deep features obtained by a fully convolutional network. By integrating the image formation process into a feature-driven restoration model, the proposed approach aims to enhance the quality of the restored images whilst reducing computational costs and maintaining a high degree of interpretability. Our results demonstrate that LUCYD outperforms the state-of-the-art methods in both synthetic and real microscopy images, achieving superior performance in terms of image quality and generalisability. We show that the model can handle various microscopy modalities and different imaging conditions by evaluating it on two different microscopy datasets, including volumetric widefield and light-sheet microscopy. Our experiments indicate that LUCYD can significantly improve resolution, contrast, and overall quality of microscopy images. Therefore, it can be a valuable tool for microscopy image restoration and can facilitate further research in various microscopy applications. We made the source code for the model accessible under https://github.com/ctom2/lucyd-deconvolution.
    摘要 生物科学中获取微scopic图像过程经常会导致图像异常和损害,表现为噪声和模糊,这会对数据分析和解释带来重大挑战。这篇论文提出了LUCYD方法,该方法combines Richardson-Lucy整形方程和基于完全 convolutional neural network 的深度特征融合,以提高图像 restore 的质量,降低计算成本,保持高度可解释性。我们的结果表明,LUCYD方法在对比state-of-the-art方法时表现出色,在 sintetic 和实际 microscopy 图像中都达到了更高的图像质量和普适性。我们的实验表明,LUCYD方法可以处理不同的 microscopy 模式和拍摄条件,并且可以在两个不同的 microscopy 数据集上进行评估,包括volumetric widefield 和 light-sheet microscopy。我们的实验结果表明,LUCYD方法可以大幅提高微scopic图像的分辨率、对比度和总质量。因此,它可以成为微scopic图像 Restoration 的有价值工具,并且可以推动各种 microscopy 应用的进一步研究。我们将模型的源代码公开于 GitHub 上,可以通过 访问。

Enforcing Topological Interaction between Implicit Surfaces via Uniform Sampling

  • paper_url: http://arxiv.org/abs/2307.08716
  • repo_url: None
  • paper_authors: Hieu Le, Nicolas Talabot, Jiancheng Yang, Pascal Fua
  • for: 本文旨在提出一种新的方法,用于精确地模型3D物体表面,以保证它们之间的topological交互。
  • methods: 该方法基于随机点的统计方法,通过选择一组点作为参照点,来修正3D物体表面。
  • results: 实验表明,该方法可以准确地重建人体心脏,保证组件之间的topological连接。此外,该方法还可以用来模拟手与任意物体之间的各种交互方式。
    Abstract Objects interact with each other in various ways, including containment, contact, or maintaining fixed distances. Ensuring these topological interactions is crucial for accurate modeling in many scenarios. In this paper, we propose a novel method to refine 3D object representations, ensuring that their surfaces adhere to a topological prior. Our key observation is that the object interaction can be observed via a stochastic approximation method: the statistic of signed distances between a large number of random points to the object surfaces reflect the interaction between them. Thus, the object interaction can be indirectly manipulated by using choosing a set of points as anchors to refine the object surfaces. In particular, we show that our method can be used to enforce two objects to have a specific contact ratio while having no surface intersection. The conducted experiments show that our proposed method enables accurate 3D reconstruction of human hearts, ensuring proper topological connectivity between components. Further, we show that our proposed method can be used to simulate various ways a hand can interact with an arbitrary object.
    摘要 objects 与其他物体之间存在多种互动方式,包括含容、触摸或维持固定距离。保证这些拓扑互动是对很多场景中模型的精确预测非常重要。在这篇论文中,我们提出了一种新的方法来精细调整3D物体表示,使其表面遵循拓扑优先。我们的关键观察是物体互动可以通过一种随机点方法的统计来观察:对一大量Random点的积分可以反映物体之间的互动。因此,我们可以通过选择一组点作为安全来修改物体表面,以间接地控制物体互动。具体来说,我们表明了我们的方法可以用来保证两个物体之间有specific contact比例,而不会出现表面交叉。实验结果表明,我们的提议方法可以准确地重建人类心脏,并保证组件之间的拓扑连接性。此外,我们还表明了我们的方法可以用来模拟手部与任意物体之间的各种互动方式。

Integrating Human Parsing and Pose Network for Human Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07977
  • repo_url: https://github.com/liujf69/ipp-net-parsing
  • paper_authors: Runwei Ding, Yuhang Wen, Jinfu Liu, Nan Dai, Fanyang Meng, Mengyuan Liu
  • for: 本研究旨在提高人体动作识别的精度,使用人体解剖特征图和人体分剖特征图作为输入模式。
  • methods: 本研究提出了一种Integrating Human Parsing and Pose Network(IPP-Net),利用人体解剖特征图和人体分剖特征图进行双路结合,以提高人体动作识别的精度。人体pose分支使用图型卷积网络来模型pose特征,而人体分剖分支使用人体探测和分剖器来提取多帧人体部分特征,然后使用卷积嵌入学习来学习人体分剖特征。
  • results: 对于NTU RGB+D和NTU RGB+D 120测试集,IPP-Net的实验结果表明,IPP-Net可以在人体动作识别任务中获得更高的准确率,比如exist的方法更高。代码可以在https://github.com/liujf69/IPP-Net-Parsing上获取。
    Abstract Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at https://github.com/liujf69/IPP-Net-Parsing .
    摘要 人体骨架和RGB序列都是人类动作识别中广泛使用的输入模式。然而,骨架缺乏外表特征,RGB数据受到大量不相关的描述所受损害。为了解决这个问题,我们引入人体解析特征图作为一种新的输入模式,因为它可以选择性地保留身体部位的空间特征,并过滤背景和服装等不相关的噪音。我们提议一种 integrate human parsing and pose network(IPP-Net),它是第一个同时利用骨架和人体解析特征图进行双树结构的方法。人体姿势分支将不同模式的短暂骨架表示feed到图像卷积网络中,以模型姿势特征。人体解析分支使用人体检测和解析器来提取多帧身体部位解析特征,并使用卷积核心学习。最后,我们采用了两支分支的晚期ensemble来得到最终预测结果,考虑到了稳定的关键点和丰富的语义身体部位特征。我们的代码可以在https://github.com/liujf69/IPP-Net-Parsing上找到。extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods.

HRHD-HK: A benchmark dataset of high-rise and high-density urban scenes for 3D semantic segmentation of photogrammetric point clouds

  • paper_url: http://arxiv.org/abs/2307.07976
  • repo_url: https://github.com/luzaijiaoxial/hrhd-hk
  • paper_authors: Maosu Li, Yijie Wu, Anthony G. O. Yeh, Fan Xue
  • for: 这篇论文旨在评估现有的3Dsemantic segmentation方法,以及它们在多样化的城市场景中的性能。
  • methods: 这篇论文使用了8种流行的semantic segmentation方法,并对其进行了全面的评估。
  • results: 实验结果表明,现有的3D semantic segmentation方法在处理高层、高密度城市区域时仍有很大的改进空间,特别是对于城市 объек的小体积物体。
    Abstract Many existing 3D semantic segmentation methods, deep learning in computer vision notably, claimed to achieve desired results on urban point clouds, in which the city objects are too many and diverse for people to judge qualitatively. Thus, it is significant to assess these methods quantitatively in diversified real-world urban scenes, encompassing high-rise, low-rise, high-density, and low-density urban areas. However, existing public benchmark datasets primarily represent low-rise scenes from European cities and cannot assess the methods comprehensively. This paper presents a benchmark dataset of high-rise urban point clouds, namely High-Rise, High-Density urban scenes of Hong Kong (HRHD-HK), which has been vacant for a long time. HRHD-HK arranged in 150 tiles contains 273 million colorful photogrammetric 3D points from diverse urban settings. The semantic labels of HRHD-HK include building, vegetation, road, waterbody, facility, terrain, and vehicle. To the best of our knowledge, HRHD-HK is the first photogrammetric dataset that focuses on HRHD urban areas. This paper also comprehensively evaluates eight popular semantic segmentation methods on the HRHD-HK dataset. Experimental results confirmed plenty of room for enhancing the current 3D semantic segmentation of point clouds, especially for city objects with small volumes. Our dataset is publicly available at: https://github.com/LuZaiJiaoXiaL/HRHD-HK.
    摘要 许多现有的3Dsemantic segmentation方法,特别是深度学习在计算机视觉领域,宣称达到了所需的结果在城市点云中,其中城市 объекts 太多和多样,使人无法评估其质量。因此,需要对这些方法进行量化的评估,以适应多样化的城市场景。然而,现有的公共benchmark数据集主要表示欧洲城市的低层建筑,无法全面评估这些方法。本文提出了一个高层、高密度城市点云数据集,即高层高密度城市区域的香港(HRHD-HK)数据集,该数据集已经空缺了很长时间。HRHD-HK包括150个块,每个块包含273万个颜色化光学3D点云,来自不同的城市环境。 semantic label 包括建筑、植被、路面、水域、设施、地形和车辆。根据我们所知,HRHD-HK是首个专注于高层高密度城市区域的光学数据集。本文还进行了8种流行的semantic segmentation方法的全面评估。实验结果表明,目前的3Dsemantic segmentation技术在城市点云中仍有很多的提升空间,特别是对城市对象的小体积。我们的数据集可以在:https://github.com/LuZaiJiaoXiaL/HRHD-HK中下载。

Towards Viewpoint-Invariant Visual Recognition via Adversarial Training

  • paper_url: http://arxiv.org/abs/2307.10235
  • repo_url: None
  • paper_authors: Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, Xingxing Wei
  • for: 提高图像分类器的视角不变性,使其能够在不同的视角下仍然准确地预测图像。
  • methods: 提出了一种基于对抗训练的方法,称为视点不变 adversarial training(VIAT),通过将视点变换视为攻击, формули了一个最小化损失函数,以实现视点不变的图像分类器。
  • results: 经验表明,VIAT 可以有效地提高多种图像分类器的视角不变性,并且可以通过将对抗视点分布传递给不同的图像,提高对象的泛化性能。
    Abstract Visual recognition models are not invariant to viewpoint changes in the 3D world, as different viewing directions can dramatically affect the predictions given the same object. Although many efforts have been devoted to making neural networks invariant to 2D image translations and rotations, viewpoint invariance is rarely investigated. As most models process images in the perspective view, it is challenging to impose invariance to 3D viewpoint changes based only on 2D inputs. Motivated by the success of adversarial training in promoting model robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to improve viewpoint robustness of common image classifiers. By regarding viewpoint transformation as an attack, VIAT is formulated as a minimax optimization problem, where the inner maximization characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on a new attack GMVFool, while the outer minimization trains a viewpoint-invariant classifier by minimizing the expected loss over the worst-case adversarial viewpoint distributions. To further improve the generalization performance, a distribution sharing strategy is introduced leveraging the transferability of adversarial viewpoints across objects. Experiments validate the effectiveness of VIAT in improving the viewpoint robustness of various image classifiers based on the diversity of adversarial viewpoints generated by GMVFool.
    摘要 “视觉识别模型不具备对3D世界视角变化的不变性,因为不同的观察方向可能会对同一物体的预测产生很大的影响。虽然许多努力已经投入到了使用神经网络对2D图像的翻译和旋转进行不变性处理,但视点不变性 rarely investigated。因为大多数模型在 perspective view 中处理图像,因此基于2D输入 alone 提高3D视角不变性的问题是挑战。鼓动了对模型 robustness 的成功,我们提出了 Viewpoint-Invariant Adversarial Training(VIAT),以提高常见图像分类器的视角不变性。VIAT 是一种 minimax 优化问题,其中内部最大化部分表示多样化的敌方攻击视点,通过学习一个基于新的攻击 GMVFool 的 Gaussian mixture distribution,而外部最小化部分则是在最坏情况下的敌方视点分布上进行视点不变性的训练。为了进一步提高通用性表现,我们还提出了基于对敌方视点的传输性的分布分享策略。实验证明,VIAT 可以提高多种图像分类器的视角不变性,基于 GMVFool 生成的多样化敌方视点。”

Dual-level Interaction for Domain Adaptive Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.07972
  • repo_url: https://github.com/rainjamesy/dida
  • paper_authors: Dongyu Yao, Boheng Li
  • for: 这篇论文主要针对域 adaptation 上的 semantic segmentation 问题,提出了一种基于 dual-level interaction 的方法(DIDA),以增强模型的鲁棒性和精度。
  • methods: 该方法在 semantic segmentation 中使用了域 adaptation 技术,并在不同的域上进行了 augmented 视图的交互,以便增强模型的鲁棒性和精度。此外,该方法还使用了一种动态更新策略来保持一个有用的实例银行,以便更好地捕捉实例的特征。
  • results: 根据实验结果,该方法在 confusing 和 long-tailed 类上表现出了明显的优势,特别是在 semantic segmentation 中。与现有方法相比,该方法可以增强模型的精度和鲁棒性,并且可以更好地处理域 adaptation 问题。
    Abstract Self-training approach recently secures its position in domain adaptive semantic segmentation, where a model is trained with target domain pseudo-labels. Current advances have mitigated noisy pseudo-labels resulting from the domain gap. However, they still struggle with erroneous pseudo-labels near the boundaries of the semantic classifier. In this paper, we tackle this issue by proposing a dual-level interaction for domain adaptation (DIDA) in semantic segmentation. Explicitly, we encourage the different augmented views of the same pixel to have not only similar class prediction (semantic-level) but also akin similarity relationship with respect to other pixels (instance-level). As it's impossible to keep features of all pixel instances for a dataset, we, therefore, maintain a labeled instance bank with dynamic updating strategies to selectively store the informative features of instances. Further, DIDA performs cross-level interaction with scattering and gathering techniques to regenerate more reliable pseudo-labels. Our method outperforms the state-of-the-art by a notable margin, especially on confusing and long-tailed classes. Code is available at \href{https://github.com/RainJamesY/DIDA}
    摘要 自适应方法最近在域 adapted semantic segmentation 中脱颖而出,其中一个模型通过目标域 Pseudo-标签 进行训练。现有技术已经消除了域之间的噪声 Pseudo-标签,但仍然在边缘类划分器中遇到了错误 Pseudo-标签。在这篇论文中,我们解决了这个问题,我们提议一种双级互动 для域适应(DIDA)在semantic segmentation中。具体来说,我们要求不同的扩展视图(augmented views)中的同一个像素有不仅相似的类预测(semantic-level),还有类似的相似性关系与其他像素(instance-level)。由于不可能保持一个 dataset 中所有像素的特征,我们因此保持一个标注的实例银行,并使用动态更新策略来选择ively存储实例中的有用特征。此外,DIDA通过散布和聚集技术来进行交互,以重新生成更可靠的 Pseudo-标签。我们的方法与当前状态的较好,尤其是在混淆和长尾类上。代码可以在 \href{https://github.com/RainJamesY/DIDA} 上找到。

EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes

  • paper_url: http://arxiv.org/abs/2307.07961
  • repo_url: None
  • paper_authors: Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, Hui Huang
    for:This paper aims to introduce a large-scale visual emotion dataset (EmoSet) with rich annotations to support research in visual emotion analysis and understanding.methods:The dataset is constructed by collecting images from social networks and artistic sources, and is annotated with 118,102 images and 14 emotion attributes, including brightness, colorfulness, scene type, object class, facial expression, and human action.results:EmoSet is five times larger than the largest existing dataset and is well balanced between different emotion categories, providing a valuable resource for researchers in the field of affective computing.
    Abstract Visual Emotion Analysis (VEA) aims at predicting people's emotional responses to visual stimuli. This is a promising, yet challenging, task in affective computing, which has drawn increasing attention in recent years. Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction. In this work, we introduce EmoSet, the first large-scale visual emotion dataset annotated with rich attributes, which is superior to existing datasets in four aspects: scale, annotation richness, diversity, and data balance. EmoSet comprises 3.3 million images in total, with 118,102 of these images carefully labeled by human annotators, making it five times larger than the largest existing dataset. EmoSet includes images from social networks, as well as artistic images, and it is well balanced between different emotion categories. Motivated by psychological studies, in addition to emotion category, each image is also annotated with a set of describable emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action, which can help understand visual emotions in a precise and interpretable way. The relevance of these emotion attributes is validated by analyzing the correlations between them and visual emotion, as well as by designing an attribute module to help visual emotion recognition. We believe EmoSet will bring some key insights and encourage further research in visual emotion analysis and understanding. Project page: https://vcc.tech/EmoSet.
    摘要 Visual Emotion Analysis (VEA) 目标是预测人们对视觉刺激的情感反应。这是一项有前途的、具有挑战性的任务,在情感计算领域内,在最近几年内受到了越来越多的关注。大多数现有的工作在这个领域都是特征设计方向,而忽略了数据建构。在这项工作中,我们介绍了Emoset,第一个大规模的视觉情感数据集,其中包含330万个图像,其中118,102个图像被人类标注员仔细标注,比现有最大的数据集大五倍。Emoset包含社交媒体图像以及艺术图像,并且具有良好的各种情感类别的均衡。受精神学研究 inspirited,每个图像还被标注了一组可见的情感特征:明亮度、颜色彩强、场景类型、物体类型、表情和人类动作,这些特征可以帮助理解视觉情感的精确和可读性。这些情感特征的相关性被证明通过对它们与视觉情感之间的相关性分析,以及设计了一个特征模块来帮助视觉情感认知。我们认为Emoset将带来一些关键的发现,并促进视觉情感分析和理解的进一步研究。项目页面:https://vcc.tech/EmoSet。

Accurate 3D Prediction of Missing Teeth in Diverse Patterns for Precise Dental Implant Planning

  • paper_url: http://arxiv.org/abs/2307.07953
  • repo_url: None
  • paper_authors: Lei Ma, Peng Xue, Yuning Gu, Yue Zhao, Min Zhu, Zhongxiang Ding, Dinggang Shen
  • For: 这个研究旨在提供一种准确预测缺失牙齿的框架,以便为牙齿嵌入式计划提供更好的规划和置入。* Methods: 该研究使用了一种基于CBCT图像的数据集来估算牙齿模型之间的点对点匹配关系,并将每种牙齿类型的位置和形状信息编码到牙齿词典中。然后,使用 sparse 表示法来学习缺失牙齿的邻近牙齿的位置和形状信息,并将这些信息应用到缺失牙齿的词典中来生成准确的预测结果。* Results: 研究结果表明,该提出的框架可以准确预测缺失牙齿的位置和形状,其预测误差为1.04mm和1.33mm分别对于单个缺失牙齿和14个缺失牙齿的预测。这表明该框架可以准确预测缺失牙齿在不同的模式下。
    Abstract In recent years, the demand for dental implants has surged, driven by their high success rates and esthetic advantages. However, accurate prediction of missing teeth for precise digital implant planning remains a challenge due to the intricate nature of dental structures and the variability in tooth loss patterns. This study presents a novel framework for accurate prediction of missing teeth in different patterns, facilitating digital implant planning. The proposed framework begins by estimating point-to-point correspondence among a dataset of dental mesh models reconstructed from CBCT images of healthy subjects. Subsequently, tooth dictionaries are constructed for each tooth type, encoding their position and shape information based on the established point-to-point correspondence. To predict missing teeth in a given dental mesh model, sparse coefficients are learned by sparsely representing adjacent teeth of the missing teeth using the corresponding tooth dictionaries. These coefficients are then applied to the dictionaries of the missing teeth to generate accurate predictions of their positions and shapes. The evaluation results on real subjects shows that our proposed framework achieves an average prediction error of 1.04mm for predictions of single missing tooth and an average prediction error of 1.33mm for the prediction of 14 missing teeth, which demonstrates its capability of accurately predicting missing teeth in various patterns. By accurately predicting missing teeth, dental professionals can improve the planning and placement of dental implants, leading to better esthetic and functional outcomes for patients undergoing dental implant procedures.
    摘要 Recently, the demand for dental implants has increased significantly due to their high success rates and aesthetic advantages. However, accurately predicting missing teeth for precise digital implant planning remains a challenge due to the complexity of dental structures and the variability of tooth loss patterns. This study proposes a novel framework for accurately predicting missing teeth in different patterns, facilitating digital implant planning.The proposed framework begins by estimating point-to-point correspondence among a dataset of dental mesh models reconstructed from CBCT images of healthy subjects. Next, tooth dictionaries are constructed for each tooth type, encoding their position and shape information based on the established point-to-point correspondence. To predict missing teeth in a given dental mesh model, sparse coefficients are learned by sparsely representing adjacent teeth of the missing teeth using the corresponding tooth dictionaries. These coefficients are then applied to the dictionaries of the missing teeth to generate accurate predictions of their positions and shapes.The evaluation results on real subjects show that our proposed framework achieves an average prediction error of 1.04mm for predictions of single missing teeth and an average prediction error of 1.33mm for the prediction of 14 missing teeth, which demonstrates its ability to accurately predict missing teeth in various patterns. By accurately predicting missing teeth, dental professionals can improve the planning and placement of dental implants, leading to better esthetic and functional outcomes for patients undergoing dental implant procedures.

Accelerating Distributed ML Training via Selective Synchronization

  • paper_url: http://arxiv.org/abs/2307.07950
  • repo_url: None
  • paper_authors: Sahil Tyagi, Martin Swany
  • for: 这篇论文的目的是提出一种实用、低开销的深度神经网络(DNNs)训练方法,以提高分布式训练中的效率。
  • methods: 这篇论文使用的方法包括: + 精确地决定在每步的统计聚合是否有必要,以避免高交互成本的统计聚合所带来的过度负担。 + 在不同的训练案例中,适当地调整统计聚合的频率,以便在训练时间中实现最佳的对照。 + 提出了多种优化方法,以提高在半同步训练中的融合。
  • results: 这篇论文的结果显示,使用\texttt{SelSync}方法可以实现与BSP训练相同或更好的精度,而且可以降低训练时间,对应的缩减比例为14倍。
    Abstract In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
    摘要 在分布式训练中,深度神经网络(DNN)被多个工作者同时启动,并在每次步骤中在大规模同步(BSP)训练中进行集中更新。然而,BSP不会线性扩展,因为聚合成本过高。为了缓解这个开销,有些alternatives如联邦平均(FedAvg)和停顿同步并行(SSP)可以减少同步频率,或者完全消除同步,通常是在牺牲最终准确性的代价。在这篇论文中,我们提出了\texttt{SelSync},一种实用、低开销的DNN训练方法,可以在每次步骤中动态决定是否进行聚合或者应用本地更新,根据它们的重要性。我们还提出了多种优化,以提高在半同步训练中的整合。我们的系统可以与BSP相比,提高训练效率,同时保持最终准确性。

Language Conditioned Traffic Generation

  • paper_url: http://arxiv.org/abs/2307.07947
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, Philipp Kraehenbuehl
  • for: 这篇论文主要用于解决现代自动驾驶开发中的模拟问题,即创建真实、可扩展、具有吸引力的交通场景。
  • methods: 该论文使用语言作为交通场景生成的超visory工具, combines a large language model with a transformer-based decoder architecture,从数据集中选择可能的地图位置,并生成初始的交通分布和车辆的动态。
  • results: compared to prior work,LCTGen模型在无条件和条件交通场景生成中显示出了更高的实际性和准确性。
    Abstract Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.
    摘要 现代自动驾驶发展的核心是模拟。模拟器帮助开发、测试和改进驾驶系统,而不会对人类、车辆或环境造成危险。然而,模拟器遇到一个主要挑战:它们需要真实、可扩展、又有趣的内容。而最近的渲染和场景重建技术已经做出了很大的进步,但是模拟场景的布局、动态和行为仍然是一个挑战。在这项工作中,我们寻求语言作为模拟场景生成的超级视图。我们的模型LCTGen结合了一个大型语言模型和一个基于转换器的解码器架构,从数据集中选择可能的地图位置,并生成初始的交通分布以及每辆车辆的动力学。LCTGen在无条件和条件交通场景生成方面比前一代的工作更高效和更真实。代码和视频将在https://ariostgx.github.io/lctgen上提供。

Surface Geometry Processing: An Efficient Normal-based Detail Representation

  • paper_url: http://arxiv.org/abs/2307.07945
  • repo_url: None
  • paper_authors: Wuyuan Xie, Miaohui Wang, Di Lin, Boxin Shi, Jianmin Jiang
  • for: 本文提出了一种高效的Surface detail处理框架,用于解决高分辨率3D视觉应用中的传统方法具有较大的内存和计算时间成本。
  • methods: 本文提出了一种基于2D正常域的新的Surface detail处理方法,通过抽取新的正常特征表示来表示微geometry结构。文中both theoretically和empirically阐述了该表示的三个重要性能属性:细节分离、细节传输和细节幂等。
  • results: 对比现有状态的艺技,我们证明并示出了提议的正常基于表示的效果和多样性。在最新的benchmark dataset上,我们实现了理论分析和实验结果,证明了我们的正常基于表示的效果和多样性。在相同输入Surface vertices上,我们的方法只需6.5%的内存成本和14.0%的运行时间,相比现有竞争算法。
    Abstract With the rapid development of high-resolution 3D vision applications, the traditional way of manipulating surface detail requires considerable memory and computing time. To address these problems, we introduce an efficient surface detail processing framework in 2D normal domain, which extracts new normal feature representations as the carrier of micro geometry structures that are illustrated both theoretically and empirically in this article. Compared with the existing state of the arts, we verify and demonstrate that the proposed normal-based representation has three important properties, including detail separability, detail transferability and detail idempotence. Finally, three new schemes are further designed for geometric surface detail processing applications, including geometric texture synthesis, geometry detail transfer, and 3D surface super-resolution. Theoretical analysis and experimental results on the latest benchmark dataset verify the effectiveness and versatility of our normal-based representation, which accepts 30 times of the input surface vertices but at the same time only takes 6.5% memory cost and 14.0% running time in comparison with existing competing algorithms.
    摘要 Traditional high-resolution 3D vision applications 的面精度处理方式很快发展,但这些方法需要大量的内存和计算时间。为解决这些问题,我们介绍了一种高效的面精度处理框架,该框架在2D正常域中提取了新的正常特征表示,这些表示包括微geometry结构的示例,我们在这篇文章中对其进行了理论和实验 validate。与现有的状态艺术相比,我们的正常基于表示具有三个重要特性,包括细节分离、细节传输和细节幂等。最后,我们针对几种几种 геометри�结构细节处理应用程序设计了三种新方案,包括几何�xture生成、细节传输和3D surface超分辨率。我们的正常基于表示在对最新的benchmark数据进行了理论分析和实验验证,其可以接受30倍的输入面Vertex,但同时只需6.5%的内存成本和14.0%的计算时间,与现有的竞争算法相比。

CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion

  • paper_url: http://arxiv.org/abs/2307.07938
  • repo_url: None
  • paper_authors: Haotian Dong, Enhui Ma, Lubo Wang, Miaohui Wang, Wuyuan Xie, Qing Guo, Ping Li, Lingyu Liang, Kairui Yang, Di Lin
    for:CVSformer is proposed to improve semantic scene completion by learning cross-view object relationships.methods:CVSformer consists of Multi-View Feature Synthesis and Cross-View Transformer to learn cross-view object relationships.results:CVSformer achieves state-of-the-art results on public datasets.
    Abstract Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for each voxel. In the cross-view transformer, we employ the cross-view fusion to comprehensively learn the cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels. We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
    摘要 In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In Multi-View Feature Synthesis, we use a set of 3D convolutional kernels rotated differently to compute multi-view features for each voxel. In Cross-View Transformer, we employ cross-view fusion to comprehensively learn cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels.We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

  • paper_url: http://arxiv.org/abs/2307.07935
  • repo_url: None
  • paper_authors: Jinlong Li, Runsheng Xu, Xinyu Liu, Baolu Li, Qin Zou, Jiaqi Ma, Hongkai Yu
  • for: 本研究旨在解决现有多智能体协同感知算法在实际世界中表现不佳的问题,即在实际数据上下文中,由于 simulate 数据和实际数据之间的域差异,导致模型在实际世界中的感知性能下降。
  • methods: 本研究提出了一种首次在多智能体协同感知中实现了 sim2real 转移学习框架,通过一种新的视力变换器(ViT),并考虑了实施差(Implementation Gap)和特征差(Feature Gap)两种域差。为了有效地 relief 实施差,我们提出了一种不确定性感知器,并通过在egosensor和inter-sensor之间进行特征适应而减少特征差。
  • results: 我们在公共多智能体协同感知数据集OPV2V和V2V4Real上进行了广泛的实验,结果表明,提出的S2R-ViT可以有效地跨越实际和模拟之间的域差,并在点云基于3D物体检测方面表现出色,至于其他方法。
    Abstract Due to the lack of real multi-agent data and time-consuming of labeling, existing multi-agent cooperative perception algorithms usually select the simulated sensor data for training and validating. However, the perception performance is degraded when these simulation-trained models are deployed to the real world, due to the significant domain gap between the simulated and real data. In this paper, we propose the first Simulation-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT, which considers both the Implementation Gap and Feature Gap between simulated and real data. We investigate the effects of these two types of domain gaps and propose a novel uncertainty-aware vision transformer to effectively relief the Implementation Gap and an agent-based feature adaptation module with inter-agent and ego-agent discriminators to reduce the Feature Gap. Our intensive experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality and outperform other methods significantly for point cloud-based 3D object detection.
    摘要 Translation notes:* "multi-agent" is translated as "多智能" (duō zhìnéng)* "cooperative perception" is translated as "合作感知" (hézuò gǎnperce)* "simulation-trained models" is translated as "模拟训练模型" (móxī xùxíng módelì)* "real world" is translated as "实际世界" (shíjiè shìjiè)* "domain gap" is translated as "领域差距" (lánxìng jìnjù)* "Implementation Gap" is translated as "实现差距" (shíxiàn jìnjù)* "Feature Gap" is translated as "特征差距" (tèxí jìnjù)* "uncertainty-aware vision transformer" is translated as "不确定性意识感知变换器" (bù qièdìngxìng yìshì gǎnperce bianhuàng)* "agent-based feature adaptation module" is translated as "智能 Agent 基于特征修改模块" (zhìnéng agent jīyú yìxìng xiūgòu módèl)Note: The translation is based on Simplified Chinese, which is used in mainland China. If you need Traditional Chinese translation, please let me know.

Contrastive Multi-Task Dense Prediction

  • paper_url: http://arxiv.org/abs/2307.07934
  • repo_url: https://github.com/USTCPCS/CVPR2018_attention
  • paper_authors: Siwei Yang, Hanrong Ye, Dan Xu
  • for: This paper addresses the problem of multi-task dense prediction, aiming to achieve simultaneous learning and inference on multiple dense prediction tasks in a single framework.
  • methods: The paper introduces feature-wise contrastive consistency to model cross-task interactions, which effectively boosts representation learning for different sub-tasks without extra expensive distillation modules.
  • results: The proposed multi-task contrastive learning approach achieves superior performance on two challenging datasets (NYUD-v2 and Pascal-Context), establishing new state-of-the-art results for dense predictions.
    Abstract This paper targets the problem of multi-task dense prediction which aims to achieve simultaneous learning and inference on a bunch of multiple dense prediction tasks in a single framework. A core objective in design is how to effectively model cross-task interactions to achieve a comprehensive improvement on different tasks based on their inherent complementarity and consistency. Existing works typically design extra expensive distillation modules to perform explicit interaction computations among different task-specific features in both training and inference, bringing difficulty in adaptation for different task sets, and reducing efficiency due to clearly increased size of multi-task models. In contrast, we introduce feature-wise contrastive consistency into modeling the cross-task interactions for multi-task dense prediction. We propose a novel multi-task contrastive regularization method based on the consistency to effectively boost the representation learning of the different sub-tasks, which can also be easily generalized to different multi-task dense prediction frameworks, and costs no additional computation in the inference. Extensive experiments on two challenging datasets (i.e. NYUD-v2 and Pascal-Context) clearly demonstrate the superiority of the proposed multi-task contrastive learning approach for dense predictions, establishing new state-of-the-art performances.
    摘要 In contrast, we introduce feature-wise contrastive consistency to model cross-task interactions for multi-task dense prediction. We propose a novel multi-task contrastive regularization method based on consistency to effectively boost representation learning of different sub-tasks, which can be easily generalized to different multi-task dense prediction frameworks and does not require additional computation in inference. Extensive experiments on two challenging datasets (NYUD-v2 and Pascal-Context) demonstrate the superiority of the proposed multi-task contrastive learning approach for dense predictions, establishing new state-of-the-art performances.

Holistic Prototype Attention Network for Few-Shot VOS

  • paper_url: http://arxiv.org/abs/2307.07933
  • repo_url: https://github.com/nust-machine-intelligence-laboratory/hpan
  • paper_authors: Yin Tang, Tao Chen, Xiruo Jiang, Yazhou Yao, Guo-Sen Xie, Heng-Tao Shen
  • for: 提高ew-shot video对象分割(FSVOS)中的动态对象分割精度,通过小量支持图像进行权重学习。
  • methods: 我们提出了一种整体prototype注意网络(HPAN),它包括 prototype graph注意模块(PGAM)和对称prototype注意模块(BPAM),通过将有用知识传递到未经见类别上来提高分割性能。
  • results: 我们在YouTube-FSVOS上进行了广泛的实验,并证明了我们提出的HPAN方法的有效性和优越性。
    Abstract Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method.
    摘要 “几帧影像物类分割(FSVOS)目的是将无法见的类别中的动态物类分割,通过一小集支持影像,其中包含像素级别物类标注。现有方法已经证明,对FSVOS使用域间代理机制可以将支持影像和询问帧之间建立相互关联。然而,代理帧中含有重复的像素信息和背景噪音,导致分割性能不佳。此外,现有方法往往忽略了询问影像之间的相互关联。为解决以上问题,我们提出了整体原型注意网络(HPAN),以提高FSVOS的性能。具体来说,HPAN包括一个原型图像注意模组(PGAM)和一个双向原型注意模组(BPAM),将有用的知识传递自见到未见的类别。PGAM从所有前景特征中生成本地区prototype,然后利用这些内部相关性来强化整体prototype的表现。BPAM利用支持影像和询问影像之间的共同关联和自我关联,实现支持询问semantic一致和内部时间一致。我们在YouTube-FSVOS上进行了广泛的实验,以证明我们提出的HPAN方法的有效性和superiority。”

DocTr: Document Transformer for Structured Information Extraction in Documents

  • paper_url: http://arxiv.org/abs/2307.07929
  • repo_url: None
  • paper_authors: Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan
  • for: 提出了一种新的结构化信息抽取(SIE)方法,用于从视觉 ric 文档中提取结构化信息。
  • methods: 使用了 anchor-based 对象检测器的想法,将实体表示为 anchor word 和 bounding box,并表示实体关联为 anchor word 的关联。
  • results: 评估在三个 SIE benchmark 上,提出的方法显示效果很好,并且在语言上进行 предваритель训练 后,能够学习实体检测。
    Abstract We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.
    摘要 我们提出了一种新的结构信息提取(SIE)方式,用于从视觉丰富文档中提取结构信息。该方式希望解决现有的IOB标注或图像基于的形ulation的限制,这些限制是 Either too reliant on the correct ordering of input text or struggle with decoding a complex graph. 而是,我们将实体表示为一个锚字和一个矩形框,并将实体连接视为锚字之间的关系。这种方式更加鲁棒地对text的顺序,并保持了compact的图像 для实体连接。这种方式的提出使我们引入了以下两个方法:1. DOCument TRansformer (DocTr),用于在视觉丰富文档中检测和关联实体矩形框。2. 一种简单的预训练策略,用于在语言上学习实体检测。我们对三个SIE benchmark进行了评估,结果显示了提posed方式的效果,并且总的approach exceeds existing solutions。

Reinforced Disentanglement for Face Swapping without Skip Connection

  • paper_url: http://arxiv.org/abs/2307.07928
  • repo_url: https://github.com/alaist/RD-FS
  • paper_authors: Xiaohang Ren, Xingyu Chen, Pengfei Yao, Heung-Yeung Shum, Baoyuan Wang
  • for: 解决SOTA face swap模型中的人脸特征泄露和非人脸特征失去Problem
  • methods: 引入新的face swap框架’WSC-swap’, eliminating skip connections and using two target encoders to respectively capture pixel-level non-facial region attributes and semantic non-identity attributes in the face region. Plus, employing both identity removal loss via adversarial training and non-identity preservation loss via prior 3DMM models.
  • results: 对FaceForensics++和CelebA-HQ进行了广泛的实验,比较表现出色,包括一个新的指标 для测试人脸一致性,这个指标之前完全被忽略了。
    Abstract The SOTA face swap models still suffer the problem of either target identity (i.e., shape) being leaked or the target non-identity attributes (i.e., background, hair) failing to be fully preserved in the final results. We show that this insufficient disentanglement is caused by two flawed designs that were commonly adopted in prior models: (1) counting on only one compressed encoder to represent both the semantic-level non-identity facial attributes(i.e., pose) and the pixel-level non-facial region details, which is contradictory to satisfy at the same time; (2) highly relying on long skip-connections between the encoder and the final generator, leaking a certain amount of target face identity into the result. To fix them, we introduce a new face swap framework called 'WSC-swap' that gets rid of skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. To further reinforce the disentanglement learning for the target encoder, we employ both identity removal loss via adversarial training (i.e., GAN) and the non-identity preservation loss via prior 3DMM models like [11]. Extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before.
    摘要 现状的SOTA面 swap模型仍然受到两种问题的困扰:一是目标人脸特征(即形状)泄露,二是目标非人脸特征(如背景和 волосы)在最终结果中未能完全保留。我们显示出这种不足的分离是由两种不当的设计引起的:(1)通过单一压缩编码器来表示面部非人脸特征(即姿势)和像素级非人脸地方特征,这是不可能同时满足的;(2)高度依赖长跳转连接来传递编码器到最终生成器,这会带来一定程度的目标人脸标识泄露。为了解决这些问题,我们提出了一个新的面 swap框架 called 'WSC-swap',它 eliminates skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. 为了进一步加强目标编码器的分离学习,我们采用了both identity removal loss via adversarial training(i.e., GAN)和非人脸保持损失 via prior 3DMM models like [11]. Our extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before.

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

  • paper_url: http://arxiv.org/abs/2307.10233
  • repo_url: None
  • paper_authors: Yifei Shi, Junhua Xi, Dewen Hu, Zhiping Cai, Kai Xu
  • for: 这个论文的目的是提出一种基于学习的多视图静止(MVS)方法,以提高多视图静止的精度和效率。
  • methods: 这个方法使用了直接优化每个摄像头光束上的深度值,模拟激光激光仪的距离测量。具体来说,它使用了序列预测方法,通过转换器特征来学习Sequential Modeling,实际上是多视图静止中的epipolar线搜索。
  • results: 这个方法在DTU和Tanks & Temples数据集上达到了所有前一代学习基于方法的总重建得分0.33mm和F-score59.48%。它能够在复杂的景象中提供高质量的深度估计和点云重建。此外,提出了RayMVSNet++以增强每个光束的上下文特征聚合,通过设计了一个注意力阀unit来选择在本地折射镜附近的相关的邻居光束。RayMVSNet++在ScanNet数据集上达到了状态艺术性能。
    Abstract Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We devise a multi-task learning for better optimization convergence and depth accuracy. We found the monotonicity property of the SDFs along each ray greatly benefits the depth estimation. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving an overall reconstruction score of 0.33mm on DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce high-quality depth estimation and point cloud reconstruction in challenging scenarios such as objects/scenes with non-textured surface, severe occlusion, and highly varying depth range. Further, we propose RayMVSNet++ to enhance contextual feature aggregation for each ray through designing an attentional gating unit to select semantically relevant neighboring rays within the local frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces accurate results on the two subsets of textureless regions and large depth variation.
    摘要 学习基于多视图涂抹(MVS)的方法主要集中在3D convolution中的成本量和存储占用。由于3D CNN的计算和存储占用非常高,输出深度的分辨率经常受到限制。与大多数现有的成本量优化方法不同,我们直接优化摄像头方向上的深度值,模拟激光雷达扫描器的范围找寻。这将MVS问题降低到折线基于深度优化,与全成本量优化相比许多轻量级。特别是,我们提出了RayMVSNet,它通过学习每个摄像头方向上的1D隐函数来预测折线上的深度值,并在扫描器范围内查找零交叉点。这种顺序模型化,基于变换器特征,实际上学习了传统多视图涂抹中的epipolar线搜索。我们设计了多任务学习来改进优化的吞吐量和深度准确率。我们发现折线上SDF的 monotonicity 性帮助深度估计。我们的方法在DTU和Tanks & Temples数据集上至今为止的所有学习基于方法中排名第一,实现了总重建分数为0.33mm(DTU)和59.48%(Tanks & Temples)。它能够在物体/场景中的非杂表面、严重遮挡和高度变化的深度范围中生成高质量的深度估计和点云重建。此外,我们提出了RayMVSNet++,它通过设计了注意力闭合单元来选择当地frustum中的相互 relevante的折线,从而增强每个折线的上下文特征汇集。RayMVSNet++在ScanNet数据集上达到了状态的最佳性能,其中包括Textureless Regions和大深度变化两个子集。

On the Robustness of Split Learning against Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2307.07916
  • repo_url: https://github.com/fmy266/SplitADV
  • paper_authors: Mingyuan Fan, Cen Chen, Chengyu Wang, Wenmeng Zhou, Jun Huang
  • for: 证明分学可以增强模型安全性,尤其是在敏感数据不能直接分享的情况下。
  • methods: 通过分享部分模型和计算结果,而不是直接分享敏感数据和模型细节,来实现模型训练和共享。
  • results: 研究发现,对于敏感数据,分学可以减少对模型的攻击,但是在中间层级上的攻击仍然存在,并且可以通过新的攻击方法(SPADV)来证明这一点。
    Abstract Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into model security. Specifically, by exploring full models, attackers can launch adversarial attacks, and split learning can mitigate this severe threat by only disclosing part of models to untrusted servers.This paper aims to evaluate the robustness of split learning against adversarial attacks, particularly in the most challenging setting where untrusted servers only have access to the intermediate layers of the model.Existing adversarial attacks mostly focus on the centralized setting instead of the collaborative setting, thus, to better evaluate the robustness of split learning, we develop a tailored attack called SPADV, which comprises two stages: 1) shadow model training that addresses the issue of lacking part of the model and 2) local adversarial attack that produces adversarial examples to evaluate.The first stage only requires a few unlabeled non-IID data, and, in the second stage, SPADV perturbs the intermediate output of natural samples to craft the adversarial ones. The overall cost of the proposed attack process is relatively low, yet the empirical attack effectiveness is significantly high, demonstrating the surprising vulnerability of split learning to adversarial attacks.
    摘要 分学习可以实现合作深度学习模型训练,同时保护数据隐私和模型安全,因为服务器和客户端只保持部分子网和交换中间计算。然而,现有研究主要关注隐私保护的可靠性,很少研究模型安全。Specifically, by exploring全模型,攻击者可以发起对抗性攻击,并且分学习可以 Mitigate this severe threat by only disclosing部分模型 to untrusted servers。本文的目的是评估分学习对抗性攻击的Robustness,特别是在最具挑战性的设定下,即无法信任服务器仅有访问模型中间层。现有的对抗性攻击主要集中在中央设定中,而不是合作设定,因此,为更好地评估分学习的Robustness,我们开发了一种适应攻击,称为SPADV。SPADV包括两个阶段:1)阴影模型训练,解决了因为缺少部分模型而产生的问题,2)本地对抗性攻击,生成对抗性样本来评估。在第一阶段,只需几个非相关的非ID数据,而在第二阶段,SPADV在自然样本中 Output的中间部分进行了预处理,以生成对抗性样本。整个攻击过程的总成本相对较低, yet the empirical attack effectiveness is significantly high,表明了分学习对抗性攻击的Surprising vulnerability。

Predicting mechanical properties of Carbon Nanotube (CNT) images Using Multi-Layer Synthetic Finite Element Model Simulations

  • paper_url: http://arxiv.org/abs/2307.07912
  • repo_url: None
  • paper_authors: Kaveh Safavigerdini, Koundinya Nouduri, Ramakrishna Surya, Andrew Reinhard, Zach Quinlan, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan
  • For: The paper is written for predicting mechanical properties of vertically-oriented carbon nanotube (CNT) forest images using a deep learning model for artificial intelligence (AI)-based materials discovery.* Methods: The paper uses an innovative data augmentation technique that involves the use of multi-layer synthetic (MLS) or quasi-2.5D images, which are generated by blending 2D synthetic images. The paper also uses a physics-based model to estimate mechanical properties such as stiffness and buckling load for the MLS images. The proposed deep learning architecture, CNTNeXt, builds upon the previous CNTNet neural network, using a ResNeXt feature representation followed by random forest regression estimator.* Results: The paper expects the proposed machine learning approach to outperform single synthetic image-based learning when it comes to predicting mechanical properties of real scanning electron microscopy images. This has the potential to accelerate understanding and control of CNT forest self-assembly for diverse applications.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了预测垂直方向碳纳米管(CNT)森林图像的机械性能使用深度学习模型进行人工智能(AI)基于材料发现。* Methods: 论文使用了一种创新的数据增强技术,利用多层合成(MLS)或 quasi-2.5D 图像,这些图像是通过拼接2D synthetic图像来生成。 MLs 图像更像真实的扫描电子显微镜(SEM)图像,但不需要进行expensive的3D simulations或实验。机械性能如强度和折倒荷重等被使用物理基本模型来估算。* Results: 论文预计该提出的机器学习方法会在真实SEM图像上预测机械性能时比单个synthetic图像基本学习更高效。这有可能加速CNT森林自组装的理解和控制,以推动多种应用。
    Abstract We present a pipeline for predicting mechanical properties of vertically-oriented carbon nanotube (CNT) forest images using a deep learning model for artificial intelligence (AI)-based materials discovery. Our approach incorporates an innovative data augmentation technique that involves the use of multi-layer synthetic (MLS) or quasi-2.5D images which are generated by blending 2D synthetic images. The MLS images more closely resemble 3D synthetic and real scanning electron microscopy (SEM) images of CNTs but without the computational cost of performing expensive 3D simulations or experiments. Mechanical properties such as stiffness and buckling load for the MLS images are estimated using a physics-based model. The proposed deep learning architecture, CNTNeXt, builds upon our previous CNTNet neural network, using a ResNeXt feature representation followed by random forest regression estimator. Our machine learning approach for predicting CNT physical properties by utilizing a blended set of synthetic images is expected to outperform single synthetic image-based learning when it comes to predicting mechanical properties of real scanning electron microscopy images. This has the potential to accelerate understanding and control of CNT forest self-assembly for diverse applications.
    摘要 我们提出了一个管道,用于预测纵向碳纳米管(CNT)森林图像中的机械性能,使用深度学习模型,以实现人工智能(AI)基于材料发现。我们的方法包括一种创新的数据增强技术,使用多层合成(MLS)或 quasi-2.5D 图像,这些图像由混合2D 合成图像来生成。MLS 图像更加closely resemble 3D 合成和实验室扫描电子镜像(SEM)图像,但没有 computationally expensive 3D simulations 或实验的成本。机械性能,如刚性和塌笔荷,对 MLS 图像进行估算,使用物理基础模型。我们提出的深度学习架构,CNTNeXt,基于我们之前的 CNTNet 神经网络,使用 ResNeXt 特征表示,然后使用随机森林回归估计器。我们的机器学习方法,通过使用混合的合成图像来预测 CNT 物理性能,对于实验室扫描电子镜像图像的机械性能预测,具有更高的性能,相比单个合成图像基于学习。这有助于加速理解和控制 CNT 森林自组装的应用。

Multitemporal SAR images change detection and visualization using RABASAR and simplified GLR

  • paper_url: http://arxiv.org/abs/2307.07892
  • repo_url: None
  • paper_authors: Weiying Zhao, Charles-Alban Deledalle, Loïc Denis, Henri Maître, Jean-Marie Nicolas, Florence Tupin
  • for: 本研究旨在检测不同类型的地表变化,以便对地面监测进行更好的准备。
  • methods: 本研究提出了一种简化了通量比(SGLR)方法,假设同时间像素具有相同的等效数目(ENL)。此外,还开发了一种改进的光谱划分法以确定变化类型。
  • results: 研究人员通过处理模拟和SAR图像,并与传统方法进行比较,证明了提案方法的效果性。特别是,数值实验表明,该方法可以有效地检测农田地区变化、建筑地区变化、港区变化和洪涝变化。
    Abstract Understanding the state of changed areas requires that precise information be given about the changes. Thus, detecting different kinds of changes is important for land surface monitoring. SAR sensors are ideal to fulfil this task, because of their all-time and all-weather capabilities, with good accuracy of the acquisition geometry and without effects of atmospheric constituents for amplitude data. In this study, we propose a simplified generalized likelihood ratio ($S_{GLR}$) method assuming that corresponding temporal pixels have the same equivalent number of looks (ENL). Thanks to the denoised data provided by a ratio-based multitemporal SAR image denoising method (RABASAR), we successfully applied this similarity test approach to compute the change areas. A new change magnitude index method and an improved spectral clustering-based change classification method are also developed. In addition, we apply the simplified generalized likelihood ratio to detect the maximum change magnitude time, and the change starting and ending times. Then, we propose to use an adaptation of the REACTIV method to visualize the detection results vividly. The effectiveness of the proposed methods is demonstrated through the processing of simulated and SAR images, and the comparison with classical techniques. In particular, numerical experiments proved that the developed method has good performances in detecting farmland area changes, building area changes, harbour area changes and flooding area changes.
    摘要 理解改变区域的状态需要提供精确的改变信息。因此,检测不同类型的改变是重要的 для地面监测。SAR探测器是理想的选择,因为它们在任何时间和天气条件下都有good accuracy的获取geometry和无 atmospheric constituents的影响。在这项研究中,我们提出了一种简化了通用类比比率(SGLR)方法,假设同一时间的批量像素具有相同的等效数量looks(ENL)。经过了 ratio-based multitemporal SAR图像减噪方法(RABASAR)提供的净化数据,我们成功地应用了这种相似测试方法来计算改变区域。此外,我们还开发了一种改进的光谱分类法来分类改变,以及一种最大变化强度时间、改变开始和结束时间的检测方法。最后,我们使用了一种基于REACTIV方法的修改来可见化检测结果。我们通过处理 simulated和SAR图像,以及与传统技术进行比较,证明了我们提出的方法的有效性。具体来说,数值实验表明,我们的方法在检测农田改变、建筑改变、港口改变和洪涝改变方面具有良好的表现。

Why Does Little Robustness Help? Understanding Adversarial Transferability From Surrogate Training

  • paper_url: http://arxiv.org/abs/2307.07873
  • repo_url: None
  • paper_authors: Yechao Zhang, Shengshan Hu, Leo Yu Zhang, Junyu Shi, Minghui Li, Xiaogeng Liu, Wei Wan, Hai Jin
  • for: 本研究的目的是解释对深度神经网络(DNNs)的抗击例(Adversarial Examples,AE)的传播性。
  • methods: 本研究使用了一系列的理论和实验分析,以了解在对DNNs进行对抗性训练时,模型的平滑性和梯度相似性之间的负相关性。
  • results: 研究发现,在对DNNs进行对抗性训练时,数据分布shift和梯度相似性之间存在负相关性。此外,研究还发现,通过控制数据分布shift和梯度regularization,可以构建更好的代理模型,以提高对抗性训练的传播性。
    Abstract Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.
    摘要 adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.

Unified Adversarial Patch for Cross-modal Attacks in the Physical World

  • paper_url: http://arxiv.org/abs/2307.07859
  • repo_url: None
  • paper_authors: Xingxing Wei, Yao Huang, Yitong Sun, Jie Yu
    for: This paper aims to demonstrate the potential risks of physical adversarial attacks on object detectors that use both visible and infrared sensors.methods: The authors propose a unified adversarial patch that can fool both visible and infrared object detectors simultaneously, using a single patch. They design a novel boundary-limited shape optimization method to achieve compact and smooth shapes, and propose a score-aware iterative evaluation to balance the fooling degree between the two sensors.results: The authors achieve an Attack Success Rate (ASR) of 73.33% and 69.17% against one-stage and two-stage object detectors, respectively. They also verify the effective attacks in the physical world under various settings, such as different angles, distances, postures, and scenes.
    Abstract Recently, physical adversarial attacks have been presented to evade DNNs-based object detectors. To ensure the security, many scenarios are simultaneously deployed with visible sensors and infrared sensors, leading to the failures of these single-modal physical attacks. To show the potential risks under such scenes, we propose a unified adversarial patch to perform cross-modal physical attacks, i.e., fooling visible and infrared object detectors at the same time via a single patch. Considering different imaging mechanisms of visible and infrared sensors, our work focuses on modeling the shapes of adversarial patches, which can be captured in different modalities when they change. To this end, we design a novel boundary-limited shape optimization to achieve the compact and smooth shapes, and thus they can be easily implemented in the physical world. In addition, to balance the fooling degree between visible detector and infrared detector during the optimization process, we propose a score-aware iterative evaluation, which can guide the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. We finally test our method against the one-stage detector: YOLOv3 and the two-stage detector: Faster RCNN. Results show that our unified patch achieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More importantly, we verify the effective attacks in the physical world when visible and infrared sensors shoot the objects under various settings like different angles, distances, postures, and scenes.
    摘要 最近,物理攻击被提出以逃脱基于DNN的物体检测器。为确保安全,许多场景同时使用可见感知器和红外感知器,导致单模态物理攻击失败。为了表明这些场景下的风险,我们提议一种横跨模态物理攻击,即通过单个贴图 Fooled 可见和红外检测器。considering 不同的感知机制,我们的工作专注于模型适应器形状的设计,这些形状可以在不同的感知器下被捕捉。为此,我们设计了一种 novel 边界限定的形状优化方法,以实现紧凑和平滑的形状,这些形状可以轻松地在物理世界中实现。此外,为保证可见检测器和红外检测器在优化过程中的攻击度差异,我们提议一种分数感知迭代评估,可以导引攻击贴图在迭代过程中逐渐减少多模检测器的预测分数。最后,我们对 YOLOv3 和 Faster RCNN 进行测试,结果显示我们的横跨模态贴图可以在不同的角度、距离、姿态和场景下实现73.33% 和 69.17% 的攻击成功率。更重要的是,我们在物理世界中验证了这些攻击的有效性,当可见和红外感知器在不同的设置下拍摄对象时。

Neural Video Recovery for Cloud Gaming

  • paper_url: http://arxiv.org/abs/2307.07847
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Zhaoyuan He, Yifan Yang, Shuozhe Li, Diyuan Dai, Lili Qiu
  • for: 提高云游戏的视频恢复精度和效率
  • methods: 使用游戏状态提高视频恢复精度,使用部分解码视频快速恢复失去的视频帧
  • results: 在iPhone 12和笔记机实现中,证明了使用游戏状态提高视频恢复精度,并且实现了高效的视频恢复Here is the translation in Simplified Chinese:
  • for: 提高云游戏的视频恢复精度和效率
  • methods: 使用游戏状态提高视频恢复精度,使用部分解码视频快速恢复失去的视频帧
  • results: 在iPhone 12和笔记机实现中,证明了使用游戏状态提高视频恢复精度,并且实现了高效的视频恢复
    Abstract Cloud gaming is a multi-billion dollar industry. A client in cloud gaming sends its movement to the game server on the Internet, which renders and transmits the resulting video back. In order to provide a good gaming experience, a latency below 80 ms is required. This means that video rendering, encoding, transmission, decoding, and display have to finish within that time frame, which is especially challenging to achieve due to server overload, network congestion, and losses. In this paper, we propose a new method for recovering lost or corrupted video frames in cloud gaming. Unlike traditional video frame recovery, our approach uses game states to significantly enhance recovery accuracy and utilizes partially decoded frames to recover lost portions. We develop a holistic system that consists of (i) efficiently extracting game states, (ii) modifying H.264 video decoder to generate a mask to indicate which portions of video frames need recovery, and (iii) designing a novel neural network to recover either complete or partial video frames. Our approach is extensively evaluated using iPhone 12 and laptop implementations, and we demonstrate the utility of game states in the game video recovery and the effectiveness of our overall design.
    摘要 云台游戏是一个多百亿美元的业态。客户端在云台游戏中将其运动发送到游戏服务器上,服务器在互联网上渲染和传输结果,并将视频传输回客户端。为提供良好的游戏体验,云台游戏中的延迟必须低于80ms。这意味着视频渲染、编码、传输、解码和显示必须在这个时间段内完成,这是特别困难的因为服务器过载、网络压力和损失。在这篇论文中,我们提出了一种新的视频帧恢复方法,与传统视频帧恢复方法不同,我们的方法使用游戏状态进行明显提高恢复精度,并使用部分解码的帧来恢复丢失的部分。我们设计了一个整体系统,包括(i)高效地提取游戏状态,(ii)修改H.264视频解码器生成一个抑制器,用于指示需要恢复的视频帧部分,以及(iii)设计一种新的神经网络来恢复完整或部分的视频帧。我们的方法在iPhone 12和笔记机实现中进行了广泛的评估,并证明游戏状态在游戏视频恢复中的重要性以及我们的总体设计的有效性。

cs.AI - 2023-07-16

A Recursive Bateson-Inspired Model for the Generation of Semantic Formal Concepts from Spatial Sensory Data

  • paper_url: http://arxiv.org/abs/2307.08087
  • repo_url: None
  • paper_authors: Jaime de Miguel-Rodriguez, Fernando Sancho-Caparrini
  • for: 这 paper 的目的是提出一种新的符号化方法,用于从复杂的空间感知数据中生成层次结构。
  • methods: 这 paper 使用了 Bateson 的不同理论来提取原始数据中的原子特征,然后通过 recursive 过程将这些特征组织成更高级别的构造。
  • results: 这 paper 的结果表明,使用这种符号化方法可以生成具有很好的可读性和可组合性的概念表示,而无需训练。此外,这些概念表示还具有高度的组合性、形式逻辑推理能力和对不同数据集的泛化和 OUT-OF-distribution 学习能力。
    Abstract Neural-symbolic approaches to machine learning incorporate the advantages from both connectionist and symbolic methods. Typically, these models employ a first module based on a neural architecture to extract features from complex data. Then, these features are processed as symbols by a symbolic engine that provides reasoning, concept structures, composability, better generalization and out-of-distribution learning among other possibilities. However, neural approaches to the grounding of symbols in sensory data, albeit powerful, still require heavy training and tedious labeling for the most part. This paper presents a new symbolic-only method for the generation of hierarchical concept structures from complex spatial sensory data. The approach is based on Bateson's notion of difference as the key to the genesis of an idea or a concept. Following his suggestion, the model extracts atomic features from raw data by computing elemental sequential comparisons in a stream of multivariate numerical values. Higher-level constructs are built from these features by subjecting them to further comparisons in a recursive process. At any stage in the recursion, a concept structure may be obtained from these constructs and features by means of Formal Concept Analysis. Results show that the model is able to produce fairly rich yet human-readable conceptual representations without training. Additionally, the concept structures obtained through the model (i) present high composability, which potentially enables the generation of 'unseen' concepts, (ii) allow formal reasoning, and (iii) have inherent abilities for generalization and out-of-distribution learning. Consequently, this method may offer an interesting angle to current neural-symbolic research. Future work is required to develop a training methodology so that the model can be tested against a larger dataset.
    摘要 神经 символиック方法可以结合神经网络和符号学方法的优点。通常,这些模型使用神经网络架构来提取复杂数据中的特征,然后将这些特征处理成符号学引擎中的符号,从而提供了逻辑、结构、可组合性、更好的泛化和外部数据学习等可能性。然而,神经方法对于将符号固定到感知数据中的降解,尚需很多的训练和繁琐的标注。这篇论文提出了一种新的符号学只的方法,用于从复杂空间感知数据中生成层次结构。该方法基于贝蒂逊的不同理论,即对象的想法或概念的起源是由差异所致。根据这个想法,模型从原始数据中提取原子特征,通过计算元素sequential比较来生成流行多变量数值中的特征。在这个过程中,更高级别的构造物可以通过进一步的比较来生成,并且可以通过正式概念分析来获得概念结构。在任何阶段,模型可以通过对构造物和特征进行比较来生成概念结构。 results表明,该模型可以生成比较 ric yet human-readable的概念表示,而无需训练。此外,模型生成的概念结构具有高可组合性,可能允许生成“未看到”的概念,具有逻辑逻辑和外部数据学习的能力。因此,这种方法可能会对当前神经 символиック研究提供一个有趣的角度。未来的工作是开发一种训练方法,以便将模型测试于更大的数据集。

Dataset Distillation Meets Provable Subset Selection

  • paper_url: http://arxiv.org/abs/2307.08086
  • repo_url: None
  • paper_authors: Murad Tukan, Alaa Maalouf, Margarita Osadchy
  • for: 提高 dataset distillation 的效果,使其可以更好地压缩大量数据并保持模型的性能。
  • methods: 我们提出了两个方法来改进 dataset distillation:首先,我们使用抽象样本来初始化压缩后的数据集,并且在训练过程中使用重要样本进行更新。其次,我们将数据subset selection 与 dataset distillation 结合起来,在训练过程中使用重要的样本进行更新。
  • results: 我们的方法可以帮助 dataset distillation 更好地压缩数据并保持模型的性能。我们通过实验表明,我们的方法可以与现有的 dataset distillation 技术结合使用,并且能够提高其性能。
    Abstract Deep learning has grown tremendously over recent years, yielding state-of-the-art results in various fields. However, training such models requires huge amounts of data, increasing the computational time and cost. To address this, dataset distillation was proposed to compress a large training dataset into a smaller synthetic one that retains its performance -- this is usually done by (1) uniformly initializing a synthetic set and (2) iteratively updating/learning this set according to a predefined loss by uniformly sampling instances from the full data. In this paper, we improve both phases of dataset distillation: (1) we present a provable, sampling-based approach for initializing the distilled set by identifying important and removing redundant points in the data, and (2) we further merge the idea of data subset selection with dataset distillation, by training the distilled set on ``important'' sampled points during the training procedure instead of randomly sampling the next batch. To do so, we define the notion of importance based on the relative contribution of instances with respect to two different loss functions, i.e., one for the initialization phase (a kernel fitting function for kernel ridge regression and $K$-means based loss function for any other distillation method), and the relative cross-entropy loss (or any other predefined loss) function for the training phase. Finally, we provide experimental results showing how our method can latch on to existing dataset distillation techniques and improve their performance.
    摘要 深度学习在最近几年内有很大的发展,取得了多个领域的状态机器人模型。然而,训练这些模型需要巨大的数据量和计算资源。为解决这个问题,数据集减少技术被提出,即将大规模的训练数据集压缩成一个更小的人工数据集,保持其性能。在这篇论文中,我们提高了两个数据集减少阶段的技术:1. 我们提出了一种可证明的抽象样本方法,通过标识重要和减少数据中重复的点来初始化减少后的数据集。2. 我们将数据子集选择和数据集减少技术融合在一起,在训练过程中将减少后的数据集训练在“重要”的抽取样本上。为此,我们定义了对于两个不同损失函数的相对贡献度,即一个是初始化阶段的kernel适应函数和$K$-means基于损失函数,另一个是训练阶段的相对杂志损失函数。最后,我们提供了实验结果,表明我们的方法可以与现有的数据集减少技术相结合,提高其性能。

POMDP inference and robust solution via deep reinforcement learning: An application to railway optimal maintenance

  • paper_url: http://arxiv.org/abs/2307.08082
  • repo_url: https://github.com/giarcieri/robust-optimal-maintenance-planning-through-reinforcement-learning-and-rllib
  • paper_authors: Giacomo Arcieri, Cyprien Hoelzl, Oliver Schwery, Daniel Straub, Konstantinos G. Papakonstantinou, Eleni Chatzi
  • for: 这篇论文是用于解决复杂的顺序决策问题的 POMDP 模型,以及使用深度学习解决 POMDP 的不确定性问题。
  • methods: 该论文提出了一种结合推理和 robust 解决 POMDP 的方法,包括使用 Markov Chain Monte Carlo 抽样来恢复转移和观察模型参数,然后使用深度学习技术来解决 POMDP 问题,并将解决方案与模型参数相结合,以提高解决方案的稳定性。
  • results: 该论文应用了这种方法于实际问题,即铁路资产保养规划问题,并进行了对 transformers 和 long short-term memory networks 的比较,以及模型基于/模型自由混合的方法的比较。
    Abstract Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.
    摘要 partially observable Markov decision processes (POMDPs) 可以模型复杂的顺序决策问题在Random and uncertain environment中. 一个主要阻碍POMDPs的广泛应用在实际场景中是lack of a suitable POMDP model or simulator thereof. 可用的解决方法,如Reinforcement Learning (RL), 需要知道过渡动力和观察生成过程,这些常常 unknown 和 non-trivial to infer. 在这种工作中,我们提出了一种 jointly inferring 和 Robust solution of POMDPs via deep RL. 首先,所有过渡和观察模型参数都是通过Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. 然后,POMDP with uncertain parameters is solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. 此外,我们还进行了transformers 和 long short-term memory networks (LSTM) 的比较,这些都是model-free RL solution, 以及model-based/model-free hybrid approach. 我们应用这些方法到了实际问题——铁路资产optimal maintenance planning.

Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

  • paper_url: http://arxiv.org/abs/2307.08074
  • repo_url: None
  • paper_authors: Longyue Wang, Zefeng Du, Donghuai Liu, Deng Cai, Dian Yu, Haiyun Jiang, Yan Wang, Leyang Cui, Shuming Shi, Zhaopeng Tu
  • for: 这个论文的目的是提出一个新的评价标准,以评估自然语言处理(NLP)模型对文本中的语言现象进行模型化。
  • methods: 这个论文使用了一个新的评价标准,叫做Disco-Bench,这个标准可以评估NLP模型对文本中的语言现象进行模型化。
  • results: 这个论文的结果表明,使用文学文档级别的训练数据进行细化预训练可以提高NLP模型对文本中的语言现象的模型化。
    Abstract Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge. We totally evaluate 20 general-, in-domain and commercial models based on Transformer, advanced pretraining architectures and large language models (LLMs). Our results show (1) the challenge and necessity of our evaluation benchmark; (2) fine-grained pretraining based on literary document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field: https://github.com/longyuewangdcu/Disco-Bench.
    摘要 模拟语言流行 -- 超出个别句子的语言现象,是自然语言处理(NLP)的基本 yet 挑战性方面。然而,现有的评估标准主要是评估间句性质,忽视了重要的演讲现象。为了bridging this gap,我们提议了Disco-Bench,一个可以评估语句级别的演讲属性的评估标准,覆盖了理解、翻译和生成多个NLP任务。Disco-Bench包括9个文档级测试集,其中包含了中文和英文的丰富的演讲现象(例如,凝聚和一致)。为了语言分析,我们还设计了一个 диагностические测试组,可以检查目标模型是否学习到了演讲知识。我们总共评估了20个通用-, 域内-和商业模型,基于Transformer、先进的预训练架构和大语言模型(LLMs)。我们的结果显示了以下两点:1. 我们的评估标准的挑战和必要性。2. 基于文学文档级别的预训练数据进行细化预训练,可以一直提高模型对演讲信息的模型。我们将发布数据集、预训练模型和排名,希望可以帮助研究人员在这一领域进行更多的研究:https://github.com/longyuewangdcu/Disco-Bench。

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

  • paper_url: http://arxiv.org/abs/2307.08072
  • repo_url: https://github.com/rucaibox/quantizedempirical
  • paper_authors: Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen
  • for: investigate the impact of quantization on emergent abilities of large language models
  • methods: empirical experiments and fine-grained impact analysis
  • results: 4-bit quantization models still maintain emergent abilities, while 2-bit models show severe performance degradation; fine-tuning can improve performance
    Abstract Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important to understand how quantization impacts the capacity of LLMs. Different from previous studies focused on overall performance, this work aims to investigate the impact of quantization on \emph{emergent abilities}, which are important characteristics that distinguish LLMs from small language models. Specially, we examine the abilities of in-context learning, chain-of-thought reasoning, and instruction-following in quantized LLMs. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation on the test of these abilities. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning. Our work derives a series of important findings to understand the impact of quantization on emergent abilities, and sheds lights on the possibilities of extremely low-bit quantization for LLMs.
    摘要 尽管大语言模型(LLM)在性能方面表现出色,但它们在部署和使用时需要很大的计算资源。为了解决这个问题,量化方法在LLM中广泛应用,以减少它们的内存占用量并提高推理速度。然而,低位数量化方法通常会导致性能下降。关键在于理解量化对LLM的容量有何影响。与之前关注总性能的研究不同,本研究探讨量化对 Language Model 的 emergent ability 的影响,即语言模型的特点,例如语言模型在语言上下文中学习、链式思维和遵从指令的能力。我们的实验表明,4位量化模型仍可保留这些能力,而2位量化模型在测试这些能力时受到了严重的性能下降。为了提高低位模型的性能,我们进行了两个特殊实验:(1)细读影响分析,研究量化对不同组件(或子结构)的影响,以及(2)通过模型细调进行性能补偿。我们的工作得出了一系列重要的发现,可以理解量化对 emergent ability 的影响,并照明了EXTREMELY 低位量化是否可以实现的可能性。

Towards Flexible Time-to-event Modeling: Optimizing Neural Networks via Rank Regression

  • paper_url: http://arxiv.org/abs/2307.08044
  • repo_url: https://github.com/teboozas/dart_ecai23
  • paper_authors: Hyunjun Lee, Junhyun Lee, Taehwa Choi, Jaewoo Kang, Sangbum Choi
  • for: 这个研究旨在开发一个基于深度学习的时间至事件预测模型,以提高预测性能和减少假设。
  • methods: 这个模型使用了一个基于Gehan的排名统计的目标函数,并且不需要设定基eline事件时间分布,因此可以保留直接预测事件时间的优点。
  • results: 经过量化分析various benchmark datasets,这个模型在处理高通量的停止时间至事件数据时表现出了优异的潜力。
    Abstract Time-to-event analysis, also known as survival analysis, aims to predict the time of occurrence of an event, given a set of features. One of the major challenges in this area is dealing with censored data, which can make learning algorithms more complex. Traditional methods such as Cox's proportional hazards model and the accelerated failure time (AFT) model have been popular in this field, but they often require assumptions such as proportional hazards and linearity. In particular, the AFT models often require pre-specified parametric distributional assumptions. To improve predictive performance and alleviate strict assumptions, there have been many deep learning approaches for hazard-based models in recent years. However, representation learning for AFT has not been widely explored in the neural network literature, despite its simplicity and interpretability in comparison to hazard-focused methods. In this work, we introduce the Deep AFT Rank-regression model for Time-to-event prediction (DART). This model uses an objective function based on Gehan's rank statistic, which is efficient and reliable for representation learning. On top of eliminating the requirement to establish a baseline event time distribution, DART retains the advantages of directly predicting event time in standard AFT models. The proposed method is a semiparametric approach to AFT modeling that does not impose any distributional assumptions on the survival time distribution. This also eliminates the need for additional hyperparameters or complex model architectures, unlike existing neural network-based AFT models. Through quantitative analysis on various benchmark datasets, we have shown that DART has significant potential for modeling high-throughput censored time-to-event data.
    摘要 时间到事分析(也称为存生分析)的目标是预测事件发生的时间, givens 一组特征。 一个主要挑战在这个领域是处理 censored 数据,这可能使学习算法变得更加复杂。传统方法如科克斯的相对准确度模型和加速失败时间(AFT)模型在这个领域非常流行,但它们经常假设 proportional hazards 和线性性。特别是 AF 模型经常需要预先指定 Parametric distributional assumptions。为了改进预测性能和减少严格假设,过去几年有很多深度学习方法在存生模型领域得到应用。然而, representation learning 在 neural network 文献中对 AFT 模型的应用还是不够广泛,尽管它在相对 simplicity 和 interpretability 方面与 hazard-focused 方法相比较有优势。在这个工作中,我们介绍了 Deep AFT Rank-regression model for Time-to-event prediction(DART)。这个模型使用基于 Gehan 排名统计的目标函数,这是efficient 和可靠的 representation learning 方法。在 eliminating the requirement to establish a baseline event time distribution 的同时,DART 保留了标准 AFT 模型中的优点,直接预测事件时间。我们的方法是一种 semi-parametric approach to AFT modeling,不需要任何分布 assumption 于存生时间分布。这也意味着我们不需要额外的 Hyperparameter 或复杂的模型结构,与现有的 neural network-based AFT 模型不同。通过对各种 benchmark 数据进行量化分析,我们表明了 DART 在高通过率 censored time-to-event 数据模型中的显著潜力。

A Neural-Symbolic Approach Towards Identifying Grammatically Correct Sentences

  • paper_url: http://arxiv.org/abs/2307.08036
  • repo_url: None
  • paper_authors: Nicos Isaak
  • for: validate English sentences
  • methods: neural-symbolic approach, blending of grammatical and syntactical rules with language models
  • results: effective tackling of Corpus of Linguistic Acceptability (COLA) task, improvement of symbolic systems’ accuracy results through blending with non-symbolic systems
    Abstract Textual content around us is growing on a daily basis. Numerous articles are being written as we speak on online newspapers, blogs, or social media. Similarly, recent advances in the AI field, like language models or traditional classic AI approaches, are utilizing all the above to improve their learned representation to tackle NLP challenges with human-like accuracy. It is commonly accepted that it is crucial to have access to well-written text from valid sources to tackle challenges like text summarization, question-answering, machine translation, or even pronoun resolution. For instance, to summarize well, one needs to select the most important sentences in order to concatenate them to form the summary. However, what happens if we do not have access to well-formed English sentences or even non-valid sentences? Despite the importance of having access to well-written sentences, figuring out ways to validate them is still an open area of research. To address this problem, we present a simplified way to validate English sentences through a novel neural-symbolic approach. Lately, neural-symbolic approaches have triggered an increasing interest towards tackling various NLP challenges, as they are demonstrating their effectiveness as a central component in various AI systems. Through combining Classic with Modern AI, which involves the blending of grammatical and syntactical rules with language models, we effectively tackle the Corpus of Linguistic Acceptability (COLA), a task that shows whether or not a sequence of words is an English grammatical sentence. Among others, undertaken experiments effectively show that blending symbolic and non-symbolic systems helps the former provide insights about the latter's accuracy results.
    摘要 Text around us 每天都在增长。许多文章在线报纸、博客或社交媒体上写作,而现代人工智能技术,如语言模型或传统的класси学习法,则在利用这些文本来提高其学习表示。通常认为,要解决文本处理挑战,需要访问有效的文本来。例如,要 SUMMARIZE 文本,需要选择最重要的句子,并将它们 concatenate 成 SUMMARY。但是,如果我们没有访问有效的英语句子或非有效的句子?尽管有效的文本访问是一个开放的研究领域。为解决这个问题,我们提出了一种简化的英语句子验证方法,基于一种新的神经符号学方法。在最近几年,神经符号学方法在解决各种文本处理挑战方面得到了越来越多的关注,因为它们在不同的 AI 系统中表现出了非常的效果。通过将传统的 grammatical 和 sintactical 规则与语言模型结合在一起,我们可以有效地解决 COLA 任务,这个任务是判断一个序列是否为英语 grammatical 句子。在其他实验中,我们发现了将符号学系统与非符号学系统混合,可以帮助符号学系统提供有关非符号学系统的准确性结果的深入理解。

Bayesian inference for data-efficient, explainable, and safe robotic motion planning: A review

  • paper_url: http://arxiv.org/abs/2307.08024
  • repo_url: None
  • paper_authors: Chengmin Zhou, Chao Wang, Haseeb Hassan, Himat Shah, Bingding Huang, Pasi Fränti
  • for: 这个论文旨在探讨bayesian inference在机器人运动规划中的应用,包括不确定性量化、安全性和优化 garanties、数据效率、模拟与实际差异和混合 bayesian inference和学习等方面。
  • methods: 论文使用bayesian inference的多种方法,包括 bayesian estimation、模型基于 bayesian RL 和无模型基于 bayesian RL、 inverse RL 和 hybridization of bayesian inference and RL等。
  • results: 论文提供了bayesian inference在机器人运动规划中的应用和研究进展,包括对复杂情况的 bayesian inference、数据效率、模拟与实际差异和混合 bayesian inference和学习等方面的分析和评估。
    Abstract Bayesian inference has many advantages in robotic motion planning over four perspectives: The uncertainty quantification of the policy, safety (risk-aware) and optimum guarantees of robot motions, data-efficiency in training of reinforcement learning, and reducing the sim2real gap when the robot is applied to real-world tasks. However, the application of Bayesian inference in robotic motion planning is lagging behind the comprehensive theory of Bayesian inference. Further, there are no comprehensive reviews to summarize the progress of Bayesian inference to give researchers a systematic understanding in robotic motion planning. This paper first provides the probabilistic theories of Bayesian inference which are the preliminary of Bayesian inference for complex cases. Second, the Bayesian estimation is given to estimate the posterior of policies or unknown functions which are used to compute the policy. Third, the classical model-based Bayesian RL and model-free Bayesian RL algorithms for robotic motion planning are summarized, while these algorithms in complex cases are also analyzed. Fourth, the analysis of Bayesian inference in inverse RL is given to infer the reward functions in a data-efficient manner. Fifth, we systematically present the hybridization of Bayesian inference and RL which is a promising direction to improve the convergence of RL for better motion planning. Sixth, given the Bayesian inference, we present the interpretable and safe robotic motion plannings which are the hot research topic recently. Finally, all algorithms reviewed in this paper are summarized analytically as the knowledge graphs, and the future of Bayesian inference for robotic motion planning is also discussed, to pave the way for data-efficient, explainable, and safe robotic motion planning strategies for practical applications.
    摘要 推断学有很多优势在机器人运动规划中,包括政策不确定性评估、安全性(风险意识)和优化机器人运动的 garantías, 数据效率在循环学习中训练, 和实际任务中的模拟与实际差异减少。然而,机器人运动规划中的推断学应用还处于完整理论的后方。此外,没有系统性的文章总结了推断学在机器人运动规划中的进步,以便给研究人员提供系统性的认识。这篇论文首先提供了推断学的概率理论,这些理论是复杂情况下的前提。其次,bayesian估计是用来估计政策或未知函数的 posterior,以计算政策。第三,本文总结了基于模型的推断学RL和无模型的推断学RL算法,这些算法在复杂情况下也进行了分析。第四,本文对 bayesian推断在反向RL中的分析,用数据效率的方式推断出奖励函数。第五,本文系统地介绍了推断学和RL的гибриди化,这是一个有前途的方向,以提高RL的收敛性。最后,本文总结了所有论文中的算法,并讨论了未来推断学在机器人运动规划中的发展,以便为数据效率、可读性和安全的机器人运动规划策略提供道路。

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

  • paper_url: http://arxiv.org/abs/2307.08016
  • repo_url: None
  • paper_authors: Ruipu Luo, Jiwen Zhang, Zhongyu Wei
  • for: 本研究旨在解决视觉语言决策任务中的长动作序列问题,提出了一个单位化训练框架,以实现环境内的活动探索和减少传递偏见。
  • methods: 本研究提出了一个单位转换器(UT),具有内置的循环状态,以保持单位缩寸跨模式内存。
  • results: 经过广泛的TEACH标准库实验,我们的提案方法在评估指标上优于现有的状况顶尖方法。
    Abstract Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained \textit{units}, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on the TEACH benchmark, we demonstrate that our proposed framework outperforms existing state-of-the-art methods in terms of all evaluation metrics. Overall, our work introduces a novel approach to tackling the VLDM task by breaking it down into smaller, manageable units and utilizing a hybrid-training framework. By doing so, we provide a more flexible and effective solution for multimodal decision making.
    摘要 视觉语言决策(VLDM)是一项复杂的多模态任务。智能体需要理解复杂的人类指令,完成环境导航和物体操作的compositional任务。然而,长的行动序列使得这个任务困难学习。从环境角度来看,我们发现任务集可以分为细化的单位,每个单位包括导航阶段和交互阶段。由于环境内单位保持不变,我们提议一种新的混合训练框架,允许活动探索环境,并减少曝光偏见。这种框架利用单位粒度的配置,是模型无关的。我们设计了 Unit-Transformer(UT),具有内置的自回归状态,保持单位级别的交互记忆。经过广泛的 TEACH bencmark 实验,我们证明了我们提议的框架比现有状态的方法在所有评价指标上表现更好。总的来说,我们的工作介绍了一种新的方法来解决 VLDM 任务,通过将其分解成更小、可控的单位,并利用混合训练框架。这提供了更 flexible 和有效的多模态决策解决方案。

SHAMSUL: Simultaneous Heatmap-Analysis to investigate Medical Significance Utilizing Local interpretability methods

  • paper_url: http://arxiv.org/abs/2307.08003
  • repo_url: https://github.com/anondo1969/shamsul
  • paper_authors: Mahbub Ul Alam, Jaakko Hollmén, Jón Rúnar Baldvinsson, Rahim Rahmani
  • For: The paper aims to evaluate the performance of four well-established interpretability methods (LIME, SHAP, Grad-CAM, and LRP) in interpreting deep neural network predictions for chest radiography images.* Methods: The study uses a transfer learning approach with a multi-label-multi-class chest radiography dataset, and evaluates the interpretability methods on both single-label and multi-label predictions. The analysis includes quantitative and qualitative investigations, and compares the results against human expert annotation.* Results: The study finds that Grad-CAM demonstrates the most favorable performance in quantitative evaluation, while the LIME heatmap segmentation visualization exhibits the highest level of medical significance. The research highlights the strengths and limitations of these interpretability methods and suggests that a multimodal-based approach could offer additional insights for enhancing interpretability in the medical domain.Here is the same information in Simplified Chinese text:* For: 这个研究旨在评估深度神经网络预测图像胸部X射线成像的 interpretability 方法 (LIME, SHAP, Grad-CAM, LRP) 的性能。* Methods: 该研究使用了传输学习方法,使用多个标签多个类的胸部X射线成像集合,并对这些方法进行了单标签和多标签预测的评估。研究包括量化和质量的调查,并与人工专家标注进行比较。* Results: 研究发现,Grad-CAM在量化评估中表现最佳,而 LIME 热图分 segmentation 可视化具有最高的医学意义。研究描述了这些 interpretability 方法的优缺点,并建议使用多Modal 基于的方法,可以在医疗领域提供更多的意义。
    Abstract The interpretability of deep neural networks has become a subject of great interest within the medical and healthcare domain. This attention stems from concerns regarding transparency, legal and ethical considerations, and the medical significance of predictions generated by these deep neural networks in clinical decision support systems. To address this matter, our study delves into the application of four well-established interpretability methods: Local Interpretable Model-agnostic Explanations (LIME), Shapley Additive exPlanations (SHAP), Gradient-weighted Class Activation Mapping (Grad-CAM), and Layer-wise Relevance Propagation (LRP). Leveraging the approach of transfer learning with a multi-label-multi-class chest radiography dataset, we aim to interpret predictions pertaining to specific pathology classes. Our analysis encompasses both single-label and multi-label predictions, providing a comprehensive and unbiased assessment through quantitative and qualitative investigations, which are compared against human expert annotation. Notably, Grad-CAM demonstrates the most favorable performance in quantitative evaluation, while the LIME heatmap segmentation visualization exhibits the highest level of medical significance. Our research highlights the strengths and limitations of these interpretability methods and suggests that a multimodal-based approach, incorporating diverse sources of information beyond chest radiography images, could offer additional insights for enhancing interpretability in the medical domain.
    摘要 深度神经网络的可解释性在医疗领域已经引起了广泛的关注,这种关注的起点是因为透明度、法律和伦理考虑以及深度神经网络在临床决策支持系统中的医学意义。为解决这个问题,我们的研究探讨了四种已知的可解释方法:本地可解释性模型无关性解释(LIME)、Shapley添加itive exPlanations(SHAP)、梯度权重分类活动映射(Grad-CAM)和层次 relevance propagation(LRP)。通过转移学习的方式,我们使用了一个多标签多类肺高照图像数据集,以解释具体疾病类型的预测结果。我们的分析包括单标签和多标签预测,以提供全面和无偏见的评估,并与人类专家标注进行比较。考据表明,Grad-CAM在量化评估中表现最佳,而 LIME 热点Segmentation 视觉化可以达到最高的医学意义。我们的研究探讨了这些可解释方法的优缺点,并建议将多模式基于的方法应用于医疗领域,以获取更多的增强可解释性的信息。

MargCTGAN: A “Marginally’’ Better CTGAN for the Low Sample Regime

  • paper_url: http://arxiv.org/abs/2307.07997
  • repo_url: https://github.com/tejuafonja/margctgan
  • paper_authors: Tejumade Afonja, Dingfan Chen, Mario Fritz
  • for: 评估现代 synthetic 表格数据生成器的能力,特别是在低样本情况下。
  • methods: 使用 CTGAN 模型和特征匹配技术来改进 synthetic 数据的统计性和下游任务用途性。
  • results: 提出了 MargCTGAN 模型,可以在高到低样本情况下保持同样的下游任务用途性和统计性。
    Abstract The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
    摘要 现有的Synthetic数据生成方法具有实用的潜力,但是现有评估方法倾向于专注在下游任务的有用性上,常常忽略了这些统计特性的重要性。这个问题在低数据情况下特别突出,伴随着这些统计指标的快速衰退。在这篇文章中,我们解决这个问题,通过评估三种现有的Synthetic数据生成器的边缘分布、列 pairs相互相关、共同分布和下游任务的有用性,以及在高至低数据情况下的表现。CTGAN模型在下游任务方面表现强,但在低数据情况下表现不佳,特别是在统计特性方面。为了解决这个问题,我们提出了MargCTGAN,它通过匹配特征的束缩边缘分布,实现了在下游任务和统计特性方面的一致性提升。

CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

  • paper_url: http://arxiv.org/abs/2307.10237
  • repo_url: None
  • paper_authors: Bhavin Jawade, Deen Dayal Mohan, Dennis Fedorishin, Srirangaraj Setlur, Venu Govindaraju
  • for: 这个论文旨在解决受到无法控制的环境影响的长距离识别 faces 的问题,例如:距离、分辨率、角度、照明、姿势和大气状况等。
  • methods: 本文提出了一种基于分布情况的特征聚合方法,即CoNAN,以提高长距离识别 faces 的精度。这个方法通过学习一个受到分布信息 conditional 的上下文 вектор,将特征按照其估计的有用程度进行权重。
  • results: 本文在长距离无条件 face 识别 dataset 上获得了state-of-the-art 的结果,例如:BTS 和 DroneSURF,这说明了这种聚合策略的优点。
    Abstract Face recognition from image sets acquired under unregulated and uncontrolled settings, such as at large distances, low resolutions, varying viewpoints, illumination, pose, and atmospheric conditions, is challenging. Face feature aggregation, which involves aggregating a set of N feature representations present in a template into a single global representation, plays a pivotal role in such recognition systems. Existing works in traditional face feature aggregation either utilize metadata or high-dimensional intermediate feature representations to estimate feature quality for aggregation. However, generating high-quality metadata or style information is not feasible for extremely low-resolution faces captured in long-range and high altitude settings. To overcome these limitations, we propose a feature distribution conditioning approach called CoNAN for template aggregation. Specifically, our method aims to learn a context vector conditioned over the distribution information of the incoming feature set, which is utilized to weigh the features based on their estimated informativeness. The proposed method produces state-of-the-art results on long-range unconstrained face recognition datasets such as BTS, and DroneSURF, validating the advantages of such an aggregation strategy.
    摘要 face recognition from image sets acquired under unregulated and uncontrolled settings, such as at large distances, low resolutions, varying viewpoints, illumination, pose, and atmospheric conditions, is challenging. Face feature aggregation, which involves aggregating a set of N feature representations present in a template into a single global representation, plays a pivotal role in such recognition systems. Existing works in traditional face feature aggregation either utilize metadata or high-dimensional intermediate feature representations to estimate feature quality for aggregation. However, generating high-quality metadata or style information is not feasible for extremely low-resolution faces captured in long-range and high altitude settings. To overcome these limitations, we propose a feature distribution conditioning approach called CoNAN for template aggregation. Specifically, our method aims to learn a context vector conditioned over the distribution information of the incoming feature set, which is utilized to weigh the features based on their estimated informativeness. The proposed method produces state-of-the-art results on long-range unconstrained face recognition datasets such as BTS, and DroneSURF, validating the advantages of such an aggregation strategy.Here's the translation in Traditional Chinese:面部识别从不受管制的图像集中,例如在大Distance、低分辨率、不同的角度、照明、姿势和大气情况下捕捉的面部图像,是一个问题。在这些识别系统中,面部特征聚合协助整合一个模板中的多个特征表示,是一个关键的步骤。现有的传统面部特征聚合方法通常是利用元metadata或高维度的中途特征表示来估算特征质量,但是在EXTREMELY低分辨率的面部图像中,生成高质量的元metadata或饰件信息是不可能的。为了突破这些限制,我们提出了一个特征分布条件对应方法called CoNAN。 Specifically, our method aims to learn a context vector conditioned over the distribution information of the incoming feature set, which is utilized to weigh the features based on their estimated informativeness. The proposed method produces state-of-the-art results on long-range unconstrained face recognition datasets such as BTS, and DroneSURF, validating the advantages of such an aggregation strategy.

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

  • paper_url: http://arxiv.org/abs/2307.10236
  • repo_url: None
  • paper_authors: Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, Lei Ma
  • for: 这篇论文旨在探讨大语言模型(LLMs)的信任性问题,尤其是在安全性、安全性和可靠性等方面。
  • methods: 本研究使用了十二种不确定性估计方法,并在四个大语言模型(LLMs)上进行了四个常见的自然语言处理(NLP)任务的实验,以探索 LLMS 的预测风险。
  • results: 研究发现,不确定性估计方法可以帮助揭示 LLMS 的不确定/非事实预测,并且在代码生成 зада务中发现了一些buggy程式的存在。
    Abstract The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, erroneous generations, such as false predictions, misinformation, and hallucination made by LLMs, have also raised severe concerns for the trustworthiness of LLMs', especially in safety-, security- and reliability-sensitive scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by general machine learning (ML) models, little is known about whether and to what extent it can help explore an LLM's capabilities and counteract its undesired behavior. To bridge the gap, in this paper, we initiate an exploratory study on the risk assessment of LLMs from the lens of uncertainty. In particular, we experiment with twelve uncertainty estimation methods and four LLMs on four prominent natural language processing (NLP) tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings validate the effectiveness of uncertainty estimation for revealing LLMs' uncertain/non-factual predictions. In addition to general NLP tasks, we extensively conduct experiments with four LLMs for code generation on two datasets. We find that uncertainty estimation can potentially uncover buggy programs generated by LLMs. Insights from our study shed light on future design and development for reliable LLMs, facilitating further research toward enhancing the trustworthiness of LLMs.
    摘要

Automated Polynomial Filter Learning for Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2307.07956
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Wendi Yu, Zhichao Hou, Xiaorui Liu
  • for: 本研究旨在探讨波利omial图像筛选器学习方法的潜力和局限性,以及提出一种通用的自动化波利omial图像筛选器学习框架,以提高波利omial图像筛选器的效果。
  • methods: 本研究使用了波利omial图像筛选器的适应学习方法,并提出了一种新的自动化波利omial图像筛选器学习框架,它可以有效地适应不同类型的图像信号。
  • results: 实验和减少研究表明,使用自动化波利omial图像筛选器学习框架可以提高波利omial图像筛选器的性能,并且在不同的学习设置下具有显著的鲁棒性和稳定性。
    Abstract Polynomial graph filters have been widely used as guiding principles in the design of Graph Neural Networks (GNNs). Recently, the adaptive learning of the polynomial graph filters has demonstrated promising performance for modeling graph signals on both homophilic and heterophilic graphs, owning to their flexibility and expressiveness. In this work, we conduct a novel preliminary study to explore the potential and limitations of polynomial graph filter learning approaches, revealing a severe overfitting issue. To improve the effectiveness of polynomial graph filters, we propose Auto-Polynomial, a novel and general automated polynomial graph filter learning framework that efficiently learns better filters capable of adapting to various complex graph signals. Comprehensive experiments and ablation studies demonstrate significant and consistent performance improvements on both homophilic and heterophilic graphs across multiple learning settings considering various labeling ratios, which unleashes the potential of polynomial filter learning.
    摘要 幂数图 filters 已经广泛应用于图神经网络(GNNs)的设计中,最近,可变学习幂数图 filters 的表现很出色,可以模型具有同性和不同性图号的图信号。在这项工作中,我们进行了一项首次的初步研究,探讨幂数图 filters 学习方法的潜力和限制,发现了严重的过拟合问题。为了改善幂数图 filters 的效果,我们提议了 Auto-Polynomial,一种新的和通用的自动幂数图 filters 学习框架,可以高效地学习更好的 filters,以适应不同的复杂图号。经过了广泛的实验和简要研究,我们发现在不同的学习设置下,Auto-Polynomial 能够在同性和不同性图号上提供显著和稳定的性能提升。

MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning

  • paper_url: http://arxiv.org/abs/2307.07951
  • repo_url: None
  • paper_authors: Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao, Qingkai Zeng, Xiangliang Zhang, Dong Yu
  • For: The paper aims to improve mathematical reasoning in relatively small language models (LMs) by introducing a multi-view fine-tuning method that leverages diverse annotation styles in existing mathematical problem datasets.* Methods: The proposed method postpones distinct instructions to input questions and trains the model to generate solutions in diverse formats in a flexible manner, utilizing the various annotation formats as different “views”.* Results: The approach outperforms prior knowledge distillation-based methods and carefully established baselines, with the model demonstrating promising generalization ability across various views and datasets, as well as the capability to learn from inaccurate or incomplete noisy data.
    Abstract Reasoning in mathematical domains remains a significant challenge for relatively small language models (LMs). Many current methods focus on specializing LMs in mathematical reasoning and rely heavily on knowledge distillation from powerful but inefficient large LMs (LLMs). In this work, we explore a new direction that avoids over-reliance on LLM teachers, introducing a multi-view fine-tuning method that efficiently exploits existing mathematical problem datasets with diverse annotation styles. Our approach uniquely considers the various annotation formats as different "views" and leverages them in training the model. By postpending distinct instructions to input questions, models can learn to generate solutions in diverse formats in a flexible manner. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches that utilize knowledge distillation, as well as carefully established baselines. Additionally, the proposed method grants the models promising generalization ability across various views and datasets, and the capability to learn from inaccurate or incomplete noisy data. We hope our multi-view training paradigm could inspire future studies in other machine reasoning domains.
    摘要 <>TRANSLATE_TEXT推理在数学领域仍然是小语言模型(LM)面临的主要挑战。许多当前方法是特化LM来进行数学推理,并且依赖于强大 yet inefficient 大语言模型(LLM)的知识填充。在这项工作中,我们explore一个新的方向,避免过于依赖LLM教师,并 introducing a multi-view fine-tuning方法,高效地利用现有的数学问题Dataset中的多种注释样式。我们的方法会考虑不同的注释格式为不同的"视图",并在训练模型时进行权重调整。通过在输入问题上附加特定的指令,我们的模型可以学习生成多种格式的解决方案,并且在灵活的方式下进行解决。实验结果显示,我们的策略可以让LLaMA-7B模型超越先前使用知识填充的方法,以及特制的基eline。此外,我们的方法还授予模型在不同的视图和数据集上具有扩展性和可学习性,并且能够从不准确或 incomplete 的噪音数据中学习。我们希望我们的多视图训练方法能够激发未来的机器推理领域的研究。

SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning

  • paper_url: http://arxiv.org/abs/2307.10234
  • repo_url: None
  • paper_authors: Kiana Kheiri, Hamid Karimi
  • for: 这个研究的目的是对多种基于Transformer的生成推荐方法(GPT)在情感分析任务中进行了系统性的评估,特别是在SemEval 2017数据集上进行的任务4。
  • methods: 这个研究使用了三种主要策略:1)使用高级GPT-3.5 Turbo进行提示工程,2)精度地调整GPT模型,3)一种创新的嵌入分类方法。
  • results: 研究发现GPT方法在情感分析任务中的预测性能明显高于当前最佳性能,准确率高于22%;此外,研究还发现GPT模型在处理复杂情感任务时的能力更强,例如理解上下文和检测嘲讽。
    Abstract This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis, specifically in the context of Task 4 on the SemEval 2017 dataset. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations. Additionally, the study compares these GPT-based methodologies with other current, high-performing models previously used with the same dataset. The results illustrate the significant superiority of the GPT approaches in terms of predictive performance, more than 22\% in F1-score compared to the state-of-the-art. Further, the paper sheds light on common challenges in sentiment analysis tasks, such as understanding context and detecting sarcasm. It underscores the enhanced capabilities of the GPT models to effectively handle these complexities. Taken together, these findings highlight the promising potential of GPT models in sentiment analysis, setting the stage for future research in this field. The code can be found at https://github.com/DSAatUSU/SentimentGPT
    摘要

Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling

  • paper_url: http://arxiv.org/abs/2307.07944
  • repo_url: https://github.com/zhuoxiao-chen/redb-da-3ddet
  • paper_authors: Zhuoxiao Chen, Yadan Luo, Zheng Wang, Mahsa Baktashmotlagh, Zi Huang
    for: 本文提出了一种基于pseudo标签技术的不监督领域适应(DA)方法,用于解决多类训练场景下DA方法的性能下降问题。methods: 我们提出了一种名为ReDB的框架,用于同时学习所有类。我们使用了可靠的、多样化的和平衡的pseudo 3D框来引导自我训练。为了解决环境差异引起的干扰,我们提出了一种跨频道评估(CDE)方法,用于评估pseudo标签的正确性。此外,我们还设计了一种重叠框数(OBC)度量,用于降低计算负担和减少对象偏移。results: 我们的ReDB方法在三个标准 benchmarkdataset上进行了实验,使用了voxel-based(i.e., SECOND)和点 cloud-based 3D检测器(i.e., PointRCNN)。结果显示,我们的方法比现有的3D DA方法提高了23.15%的mAP值,在nuScenes $\rightarrow$ KITTI任务上。
    Abstract Unsupervised domain adaptation (DA) with the aid of pseudo labeling techniques has emerged as a crucial approach for domain-adaptive 3D object detection. While effective, existing DA methods suffer from a substantial drop in performance when applied to a multi-class training setting, due to the co-existence of low-quality pseudo labels and class imbalance issues. In this paper, we address this challenge by proposing a novel ReDB framework tailored for learning to detect all classes at once. Our approach produces Reliable, Diverse, and class-Balanced pseudo 3D boxes to iteratively guide the self-training on a distributionally different target domain. To alleviate disruptions caused by the environmental discrepancy (e.g., beam numbers), the proposed cross-domain examination (CDE) assesses the correctness of pseudo labels by copy-pasting target instances into a source environment and measuring the prediction consistency. To reduce computational overhead and mitigate the object shift (e.g., scales and point densities), we design an overlapped boxes counting (OBC) metric that allows to uniformly downsample pseudo-labeled objects across different geometric characteristics. To confront the issue of inter-class imbalance, we progressively augment the target point clouds with a class-balanced set of pseudo-labeled target instances and source objects, which boosts recognition accuracies on both frequently appearing and rare classes. Experimental results on three benchmark datasets using both voxel-based (i.e., SECOND) and point-based 3D detectors (i.e., PointRCNN) demonstrate that our proposed ReDB approach outperforms existing 3D domain adaptation methods by a large margin, improving 23.15% mAP on the nuScenes $\rightarrow$ KITTI task. The code is available at https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet.
    摘要 <>translate text into Simplified Chinese<>无监督领域适应(DA)技术帮助了3D物体检测预测。虽然有效,但现有DA方法在多类训练场景下表现不佳,因为低质量pseudo标签和类别不均问题存在。在这篇文章中,我们解决这个挑战,提出了一种基于学习检测所有类别的 novel ReDB框架。我们的方法生成了可靠、多样和类别均衡的pseudo 3D框,用于反馈自我训练 distributionally different 目标频谱上。为了消除环境差异(例如,扩散)所引起的干扰,我们提出了跨频谱评估(CDE),用于复制目标实例到源环境中,并测量预测一致性。为了减少计算负担和 Mitigate 对象变换(例如,比例和点密度),我们设计了重叠的框数(OBC)度量,允许用于均一下 pseudo-标签对象的下采样。为了解决类别不均问题,我们逐渐增加了类别均衡的pseudo-标签目标实例和源对象,这有助于提高了识别率,包括常见的类和罕见的类。实验结果表明,我们的提出的 ReDB 方法在三个benchmarkdataset上(包括 voxel-based 和点 clouds 3D检测器)表现出色,相比现有的3D领域适应方法,提高了23.15%的MAP值。代码可以在 上获取。

KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.07942
  • repo_url: https://github.com/Luoyadan/KECOR-active-3Ddet
  • paper_authors: Yadan Luo, Zhuoxiao Chen, Zhen Fang, Zheng Zhang, Zi Huang, Mahsa Baktashmotlagh
  • for: 提高自动驾驶 LiDAR 对象检测器的可靠性,减少标注量。
  • methods: 使用新的 kernel coding rate maximization (KECOR) 策略,通过信息理论来选择最有用的点云进行标注。
  • results: 比对现有方法减少了约44% 的标注成本和26% 的计算时间,无需妥协检测性能。
    Abstract Achieving a reliable LiDAR-based object detector in autonomous driving is paramount, but its success hinges on obtaining large amounts of precise 3D annotations. Active learning (AL) seeks to mitigate the annotation burden through algorithms that use fewer labels and can attain performance comparable to fully supervised learning. Although AL has shown promise, current approaches prioritize the selection of unlabeled point clouds with high uncertainty and/or diversity, leading to the selection of more instances for labeling and reduced computational efficiency. In this paper, we resort to a novel kernel coding rate maximization (KECOR) strategy which aims to identify the most informative point clouds to acquire labels through the lens of information theory. Greedy search is applied to seek desired point clouds that can maximize the minimal number of bits required to encode the latent features. To determine the uniqueness and informativeness of the selected samples from the model perspective, we construct a proxy network of the 3D detector head and compute the outer product of Jacobians from all proxy layers to form the empirical neural tangent kernel (NTK) matrix. To accommodate both one-stage (i.e., SECOND) and two-stage detectors (i.e., PVRCNN), we further incorporate the classification entropy maximization and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Extensive experiments conducted on two 3D benchmarks and a 2D detection dataset evidence the superiority and versatility of the proposed approach. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art AL method, without compromising detection performance.
    摘要 要实现自适应驾驶中的可靠 LiDAR 基于对象探测器,是必备的,但其成功取决于获得大量精确的3D注释。活动学习(AL)可以减轻注释负担,但目前的方法通常选择低确定性和多样性的无标点云,导致更多的实例需要标注,并降低计算效率。在这篇论文中,我们采用了一种新的核心编码率最大化(KECOR)策略,以便通过信息理论的视角来标识最有用的点云。我们使用贪婪搜索算法来寻找需要标注的点云,以达到最小化数据量的编码。为了评估选择的样本唯一性和有用性,我们构建了一个3D探测器头的卫星网络,并计算所有卫星层的外产Jacobian的outer乘积,以建立empirical神经积分(NTK)矩阵。为了满足一stage(i.e., SECOND)和两stage(i.e., PVRCNN)探测器,我们还 incorporate分类Entropy最大化和检测性能和总绘制框数之间的融合。我们在两个3D benchmark和一个2D探测数据集上进行了广泛的实验,结果表明我们的方法在 box-level 注释成本和计算时间上具有明显的优越性和多样性。我们的结果表明,相比领先方法,我们的方法可以提高约44%的盒级注释成本和26%的计算时间,而无需牺牲检测性能。

GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT

  • paper_url: http://arxiv.org/abs/2307.07930
  • repo_url: None
  • paper_authors: Yifan Zhang, Cheng Wei, Shangyou Wu, Zhengting He, Wenhao Yu
    for:The paper is written to explore a new framework called GeoGPT that integrates the semantic understanding ability of large language models (LLMs) with mature tools within the GIS community, aiming to lower the threshold of non-professional users to solve geospatial tasks.methods:The paper utilizes Generative Pre-trained Transformer (e.g., ChatGPT) and AutoGPT to enable the framework to automatically reason and call externally defined tools, and the framework is validated through several cases including geospatial data crawling, spatial query, facility siting, and mapping.results:The results show that GeoGPT can conduct geospatial data collection, processing, and analysis in an autonomous manner with the instruction of only natural language, and the framework is effective in solving various geospatial tasks. The paper also suggests that the “foundational plus professional” paradigm implied in GeoGPT provides an effective way to develop next-generation GIS in this era of large foundation models.
    Abstract Decision-makers in GIS need to combine a series of spatial algorithms and operations to solve geospatial tasks. For example, in the task of facility siting, the Buffer tool is usually first used to locate areas close or away from some specific entities; then, the Intersect or Erase tool is used to select candidate areas satisfied multiple requirements. Though professionals can easily understand and solve these geospatial tasks by sequentially utilizing relevant tools, it is difficult for non-professionals to handle these problems. Recently, Generative Pre-trained Transformer (e.g., ChatGPT) presents strong performance in semantic understanding and reasoning. Especially, AutoGPT can further extend the capabilities of large language models (LLMs) by automatically reasoning and calling externally defined tools. Inspired by these studies, we attempt to lower the threshold of non-professional users to solve geospatial tasks by integrating the semantic understanding ability inherent in LLMs with mature tools within the GIS community. Specifically, we develop a new framework called GeoGPT that can conduct geospatial data collection, processing, and analysis in an autonomous manner with the instruction of only natural language. In other words, GeoGPT is used to understand the demands of non-professional users merely based on input natural language descriptions, and then think, plan, and execute defined GIS tools to output final effective results. Several cases including geospatial data crawling, spatial query, facility siting, and mapping validate the effectiveness of our framework. Though limited cases are presented in this paper, GeoGPT can be further extended to various tasks by equipping with more GIS tools, and we think the paradigm of "foundational plus professional" implied in GeoGPT provides an effective way to develop next-generation GIS in this era of large foundation models.
    摘要 决策者在地理信息系统(GIS)中需要结合一系列的空间算法和操作来解决地理任务。例如,在设施位置选址任务中,首先使用 Buffer 工具来确定靠近特定实体的区域,然后使用 Intersect 或 Erase 工具来选择满足多个需求的候选区域。虽然专业人士可以轻松地理解和解决这些地理任务,但非专业人士则有困难。现在,生成预训练的 transformer(例如 ChatGPT)表现出了强大的语义理解和逻辑能力。特别是 AutoGPT 可以进一步扩展大语言模型(LLM)的能力,通过自动理解和调用外部定义的工具。受这些研究启发,我们尝试将非专业用户解决地理任务的门槛降低,通过将语义理解能力内置在 LLM 中与 GIS 社区熟悉的工具相结合。具体来说,我们开发了一个新框架 called GeoGPT,可以在自主 manner 中进行地理数据收集、处理和分析,只需要自然语言输入。即使非专业用户只提供了自然语言描述,GeoGPT 仍可以理解用户的需求,计划和执行定义的 GIS 工具,以获得最终有效的结果。我们在这篇论文中提出了一些案例,包括地理数据抓取、空间查询、设施位置选址和地图生成,证明了 GeoGPT 的有效性。虽然我们只是在这篇论文中提出了有限的案例,但我们认为 GeoGPT 可以在不同的任务中进行扩展,只需要附加更多的 GIS 工具即可。我们认为 GeoGPT 的“基础 plus 专业”的思想提供了下一代 GIS 的有效发展方式。

Neural Architecture Retrieval

  • paper_url: http://arxiv.org/abs/2307.07919
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Xiaohuan Pei, Yanxi Li, Minjing Dong, Chang Xu
  • for: 本研究旨在提高新的神经网络架构设计和现有的神经网络架构之间的相似性检索,以便帮助研究人员更好地将自己的贡献与现有的架构相比较,并确定它们之间的联系。
  • methods: 本研究使用分解图为模块进行重建大图,并 introduce multi-level contrastive learning来实现准确的图表示学习。
  • results: 对人工设计和生成的神经网络架构进行了广泛的评估,并证明了我们的算法的优越性。此外,还建立了一个包含12k个真实世界网络架构、其嵌入的数据集,用于神经网络架构检索。
    Abstract With the increasing number of new neural architecture designs and substantial existing neural architectures, it becomes difficult for the researchers to situate their contributions compared with existing neural architectures or establish the connections between their designs and other relevant ones. To discover similar neural architectures in an efficient and automatic manner, we define a new problem Neural Architecture Retrieval which retrieves a set of existing neural architectures which have similar designs to the query neural architecture. Existing graph pre-training strategies cannot address the computational graph in neural architectures due to the graph size and motifs. To fulfill this potential, we propose to divide the graph into motifs which are used to rebuild the macro graph to tackle these issues, and introduce multi-level contrastive learning to achieve accurate graph representation learning. Extensive evaluations on both human-designed and synthesized neural architectures demonstrate the superiority of our algorithm. Such a dataset which contains 12k real-world network architectures, as well as their embedding, is built for neural architecture retrieval.
    摘要 随着新的神经网络架构设计的数量不断增加,以及现有的神经网络架构的数量,对研究人员来说已经变得困难以对现有的神经网络架构进行比较,或者确定自己的设计与其他相关的神经网络架构之间的连接。为了在高效自动化的方式下发现相似的神经网络架构,我们定义了神经网络架构检索问题,该问题的目的是检索与查询神经网络架构相似的现有神经网络架构。现有的图预训练策略无法处理神经网络架构中的计算图因为图的大小和动作。为了解决这些问题,我们提议将图分解为模块,使用这些模块来重建宏图,并引入多级对比学习来实现准确的图表示学习。我们对人工设计的和生成的神经网络架构进行了广泛的评估,结果表明我们的算法具有优势。我们还构建了一个包含12000个实际世界网络架构,以及它们的嵌入的数据集,用于神经网络架构检索。

Recognition of Mental Adjectives in An Efficient and Automatic Style

  • paper_url: http://arxiv.org/abs/2307.11767
  • repo_url: None
  • paper_authors: Fei Yang
  • for: 本研究旨在提出一种新的常识逻辑任务,即心理和物理分类(MPC),以便处理常识逻辑在逻辑图中。
  • methods: 本研究使用了一个BERT模型,并采用了活动学习算法来减少需要的标注资源。
  • results: 使用ENTROPY策略,模型达到了满意的准确率,仅需约300个标注词。此外,我们还与SentiWordNet进行比较,以检验MPC与情感分类任务的差异。
    Abstract In recent years, commonsense reasoning has received more and more attention from academic community. We propose a new lexical inference task, Mental and Physical Classification (MPC), to handle commonsense reasoning in a reasoning graph. Mental words relate to mental activities, which fall into six categories: Emotion, Need, Perceiving, Reasoning, Planning and Personality. Physical words describe physical attributes of an object, like color, hardness, speed and malleability. A BERT model is fine-tuned for this task and active learning algorithm is adopted in the training framework to reduce the required annotation resources. The model using ENTROPY strategy achieves satisfactory accuracy and requires only about 300 labeled words. We also compare our result with SentiWordNet to check the difference between MPC and subjectivity classification task in sentiment analysis.
    摘要 近年来,常识理解在学术社区中得到了越来越多的关注。我们提出了一个新的lexical inference任务,即心理和物理分类(MPC),以处理常识理解在推理图中。心理词表示心理活动,分为六类:情感、需求、感知、理解、规划和人格。物理词描述物体的物理属性,如颜色、硬度、速度和柔性。我们使用BERT模型进行了 fine-tuning,并采用了活动学习算法来降低需要的标注资源。使用ENTROPY策略的模型具有了满意的准确率,并只需约300个标注词。我们还与SentiWordNet进行了比较,以检查MPC任务与情感分类任务在情感分析中的差异。

Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training

  • paper_url: http://arxiv.org/abs/2307.07909
  • repo_url: https://github.com/yunyikristy/dualmind
  • paper_authors: Yao Wei, Yanchao Sun, Ruijie Zheng, Sai Vemprala, Rogerio Bonatti, Shuhang Chen, Ratnesh Madaan, Zhongjie Ba, Ashish Kapoor, Shuang Ma
  • for: 这篇论文旨在解决当前方法面临的挑战,如过拟合行为和任务特定的微调。
  • methods: 该方法使用了一种新的”双相”训练策略,模仿人类如何在世界上学习行为。首先通过一个自然语言控制任务的自我超vised目标学习基础共知,然后通过模仿行为根据给定的提示进行决策。
  • results: 对MetaWorld和Habitat进行了广泛的实验评估,比前一代通用代理agent高效性70%和50%。在MetaWorld上完成45个任务的成功率高于90%。
    Abstract We introduce DualMind, a generalist agent designed to tackle various decision-making tasks that addresses challenges posed by current methods, such as overfitting behaviors and dependence on task-specific fine-tuning. DualMind uses a novel "Dual-phase" training strategy that emulates how humans learn to act in the world. The model first learns fundamental common knowledge through a self-supervised objective tailored for control tasks and then learns how to make decisions based on different contexts through imitating behaviors conditioned on given prompts. DualMind can handle tasks across domains, scenes, and embodiments using just a single set of model weights and can execute zero-shot prompting without requiring task-specific fine-tuning. We evaluate DualMind on MetaWorld and Habitat through extensive experiments and demonstrate its superior generalizability compared to previous techniques, outperforming other generalist agents by over 50$\%$ and 70$\%$ on Habitat and MetaWorld, respectively. On the 45 tasks in MetaWorld, DualMind achieves over 30 tasks at a 90$\%$ success rate.
    摘要 我们介绍了DUALMIND,一种通用的智能代理,用于解决当前方法面临的挑战,如适应性和任务特定的精度调整。DUALMIND使用一种新的“双阶段”培训策略,模拟人类学习行为的方式。首先,模型通过一种自我监督目标,学习基本的共同知识,然后通过模仿行为来学习不同上下文的决策。DUALMIND可以在不同领域、场景和实现中处理任务,只需一个模型权重集,并可以在零基础下进行指令。我们通过对METAWORLD和HABITAT进行了广泛的实验,证明DUALMIND在总体化能力方面表现出优于前一代技术,在HABITAT和METAWORLD上的表现比其他通用代理高出50%和70%。在METAWORLD上的45个任务中,DUALMIND达到了90%的成功率。

A Dialogue System for Assessing Activities of Daily Living: Improving Consistency with Grounded Knowledge

  • paper_url: http://arxiv.org/abs/2307.07544
  • repo_url: None
  • paper_authors: Zhecheng Sheng, Raymond Finzel, Michael Lucke, Sheena Dufresne, Maria Gini, Serguei Pakhomov
  • for: 这种对话系统用于帮助评估人员更好地评估参与者的功能能力,尤其是新手评估人员。
  • methods: 这种对话系统使用自然语言理解(NLU)和自然语言生成(NLG)两个模块,通过模拟实际对话来评估参与者的功能能力。
  • results: 通过使用最新的InstructGPT-like模型,对话系统可以根据参与者的生活背景信息和查询来生成相应的回答,以保证对话系统的回答与知识库之间的一致性。
    Abstract In healthcare, the ability to care for oneself is reflected in the "Activities of Daily Living (ADL)," which serve as a measure of functional ability (functioning). A lack of functioning may lead to poor living conditions requiring personal care and assistance. To accurately identify those in need of support, assistance programs continuously evaluate participants' functioning across various domains. However, the assessment process may encounter consistency issues when multiple assessors with varying levels of expertise are involved. Novice assessors, in particular, may lack the necessary preparation for real-world interactions with participants. To address this issue, we developed a dialogue system that simulates interactions between assessors and individuals of varying functioning in a natural and reproducible way. The dialogue system consists of two major modules, one for natural language understanding (NLU) and one for natural language generation (NLG), respectively. In order to generate responses consistent with the underlying knowledge base, the dialogue system requires both an understanding of the user's query and of biographical details of an individual being simulated. To fulfill this requirement, we experimented with query classification and generated responses based on those biographical details using some recently released InstructGPT-like models.
    摘要 在医疗领域,自我照顾能力反映在日常生活活动(ADL)中, serves as a measure of functional ability(功能)。lack of functioning may lead to poor living conditions requiring personal care and assistance. To accurately identify those in need of support, assistance programs continuously evaluate participants' functioning across various domains. However, the assessment process may encounter consistency issues when multiple assessors with varying levels of expertise are involved. Novice assessors, in particular, may lack the necessary preparation for real-world interactions with participants. To address this issue, we developed a dialogue system that simulates interactions between assessors and individuals of varying functioning in a natural and reproducible way. The dialogue system consists of two major modules, one for natural language understanding (NLU) and one for natural language generation (NLG), respectively. In order to generate responses consistent with the underlying knowledge base, the dialogue system requires both an understanding of the user's query and of biographical details of an individual being simulated. To fulfill this requirement, we experimented with query classification and generated responses based on those biographical details using some recently released InstructGPT-like models.Note: The text has been translated using Google Translate, and some minor adjustments may be necessary to ensure accuracy and fluency.

Anomaly Detection in Automated Fibre Placement: Learning with Data Limitations

  • paper_url: http://arxiv.org/abs/2307.07893
  • repo_url: None
  • paper_authors: Assef Ghamisi, Todd Charter, Li Ji, Maxime Rivard, Gil Lund, Homayoun Najjaran
  • for: 实时检测 Automated Fibre Placement (AFP) 中的瑕疵,以免让产品质量受到影响。
  • methods: 融合深度学习和传统computer vision算法,不需要大量的 Labelled defective samples 进行训练。使用对称性优化的抽取方法,从 AFP 的 fibre layup 表面中提取地方样本,并训练 autoencoder 来检测瑕疵。
  • results: 这种方法可以实时检测 AFP 中的瑕疵,并对产品质量进行严格的检测,而不需要大量的 Labelled defective samples。实验结果显示,这种方法可以实现高精度的瑕疵检测,并且可以检测到所有类型的瑕疵。
    Abstract Conventional defect detection systems in Automated Fibre Placement (AFP) typically rely on end-to-end supervised learning, necessitating a substantial number of labelled defective samples for effective training. However, the scarcity of such labelled data poses a challenge. To overcome this limitation, we present a comprehensive framework for defect detection and localization in Automated Fibre Placement. Our approach combines unsupervised deep learning and classical computer vision algorithms, eliminating the need for labelled data or manufacturing defect samples. It efficiently detects various surface issues while requiring fewer images of composite parts for training. Our framework employs an innovative sample extraction method leveraging AFP's inherent symmetry to expand the dataset. By inputting a depth map of the fibre layup surface, we extract local samples aligned with each composite strip (tow). These samples are processed through an autoencoder, trained on normal samples for precise reconstructions, highlighting anomalies through reconstruction errors. Aggregated values form an anomaly map for insightful visualization. The framework employs blob detection on this map to locate manufacturing defects. The experimental findings reveal that despite training the autoencoder with a limited number of images, our proposed method exhibits satisfactory detection accuracy and accurately identifies defect locations. Our framework demonstrates comparable performance to existing methods, while also offering the advantage of detecting all types of anomalies without relying on an extensive labelled dataset of defects.
    摘要 传统的自动纤维布置(AFP)系统通常采用端到端的supervised learning,需要一大量的标注的异常样本来进行有效的训练。然而,获取这些标注的样本是一个挑战。为了解决这个限制,我们提出了一个全面的异常检测和地图化方法 для自动纤维布置。我们的方法结合了深度学习和经典的计算机视觉算法,不需要标注的样本或制造异常样本。它能够高效地检测多种表面问题,只需要训练 fewer 的复合部件图像。我们的框架使用了一种创新的样本EXTRACTION方法,利用 AFP 的自然的对称性来扩大数据集。通过输入纤维布置表面的深度地图,我们EXTRACT了本地的样本,这些样本与每个复合带(tow)align。这些样本被处理 durch一个自动编码器,该自动编码器在正常样本上进行了精准的重建,通过重建错误来高亮异常。相对值形成了异常地图,用于有用的可视化。我们的框架使用了球体检测来确定制造异常。实验结果表明,即使使用有限的图像训练,我们提议的方法可以具有满意的检测精度,并准确地确定异常的位置。我们的框架与现有的方法相比,同时可以检测所有类型的异常,不需要大量的标注的异常样本。

Handwritten and Printed Text Segmentation: A Signature Case Study

  • paper_url: http://arxiv.org/abs/2307.07887
  • repo_url: None
  • paper_authors: Sina Gholamian, Ali Vahdat
  • for: 这个论文的目的是解决扫描文档中手写文本与印刷文本的重叠问题,以提高文档的Optical Character Recognition(OCR)和数字化过程。
  • methods: 该论文提出了新的方法来解决手写和印刷文本的分类问题,包括引入新的数据集SignaTR6K和一种新的模型建立方式。
  • results: 该论文的最佳配置在两个不同的数据集上比 Priors 的工作提高了17.9%和7.3%的Intersection over Union(IoU)分数。
    Abstract While analyzing scanned documents, handwritten text can overlap with printed text. This overlap causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This approach results in the assignment of overlapping handwritten and printed pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. To support this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for the handwritten and printed text segmentation task. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores. The SignaTR6K dataset is accessible for download via the following link: https://forms.office.com/r/2a5RDg7cAY.
    摘要 While analyzing scanned documents, handwritten text can overlap with printed text, causing difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurting downstream NLP tasks. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This approach results in the assignment of overlapping handwritten and printed pixels to only one of the classes, and thus, they are not accounted for in the other class. Therefore, in this research, we develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. To support this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for the handwritten and printed text segmentation task. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores. The SignaTR6K dataset is accessible for download via the following link: .

Online Goal Recognition in Discrete and Continuous Domains Using a Vectorial Representation

  • paper_url: http://arxiv.org/abs/2307.07876
  • repo_url: None
  • paper_authors: Douglas Tesch, Leonardo Rosa Amado, Felipe Meneguzzi
  • for: 本研究的目的是提出一种高效的在线目标识别方法,可以在精细空间和连续空间两个Domain中进行目标识别。
  • methods: 本方法使用了一种单一调用 плаanner的方法,以实现在精细空间和连续空间两个Domain中的目标识别。在精细空间中,方法使用了一种简化的动作模型,从而减少了计算负担。
  • results: 本研究的结果表明,该方法可以在精细空间和连续空间两个Domain中进行高效的在线目标识别,并且可以在几乎实时的速度下完成目标识别。与现有技术相比,该方法的计算量减少了数个数量级,使其成为了首个可以实用于机器人应用的在线目标识别方法。
    Abstract While recent work on online goal recognition efficiently infers goals under low observability, comparatively less work focuses on online goal recognition that works in both discrete and continuous domains. Online goal recognition approaches often rely on repeated calls to the planner at each new observation, incurring high computational costs. Recognizing goals online in continuous space quickly and reliably is critical for any trajectory planning problem since the real physical world is fast-moving, e.g. robot applications. We develop an efficient method for goal recognition that relies either on a single call to the planner for each possible goal in discrete domains or a simplified motion model that reduces the computational burden in continuous ones. The resulting approach performs the online component of recognition orders of magnitude faster than the current state of the art, making it the first online method effectively usable for robotics applications that require sub-second recognition.
    摘要 现有研究对在低观察性下高效地识别目标,但相比之下,更少的研究集中在网上目标识别,包括网上目标识别在网上和连续域中。网上目标识别方法通常需要重复访问 плаanner,这会导致高度 computation costs。在实际的物理世界中,例如机器人应用程序,识别目标在连续空间中快速和可靠是非常重要。我们开发了一种高效的目标识别方法,这种方法可以在网上和连续域中快速识别目标,并且仅需对每个可能的目标进行单一的访问。这种方法与现有的状况对照,效率高得多,可以实现在机器人应用程序中的低秒识别。

Does Double Descent Occur in Self-Supervised Learning?

  • paper_url: http://arxiv.org/abs/2307.07872
  • repo_url: https://github.com/yonatangideoni/double_descent_tiny_paper
  • paper_authors: Alisia Lupidi, Yonatan Gideoni, Dulhan Jayalath
  • for: 研究双重下降现象的大多数研究是基于supervised模型,而自适应 Setting的 исследования却发现surprisingly absent。
  • methods: 我们使用标准和线性autoencoder两种未研究过的设定进行实验。
  • results: 我们发现测试损失函数 either exhibits a classical U-shape or monotonically decreases, rather than displaying a double-descent curve.Note: “双重下降” (double descent) in Chinese is “双重下降” (shuāngzhòng jiàoxiàng).
    Abstract Most investigations into double descent have focused on supervised models while the few works studying self-supervised settings find a surprising lack of the phenomenon. These results imply that double descent may not exist in self-supervised models. We show this empirically using a standard and linear autoencoder, two previously unstudied settings. The test loss is found to have either a classical U-shape or to monotonically decrease instead of exhibiting a double-descent curve. We hope that further work on this will help elucidate the theoretical underpinnings of this phenomenon.
    摘要 大多数研究双峰现象都集中在指导学习模型上,而自动学习设置中的研究却很少,这些结果表明双峰现象可能不存在于自动学习模型中。我们通过标准的自动encoder和线性的自动encoder两种未经研究的设置来证明这点。测试损失的曲线是 Either exhibiting a classical U-shape or monotonically decreasing, rather than a double-descent curve. 我们希望进一步的研究能够推动这个现象的理论基础的探索。

The SocialAI School: Insights from Developmental Psychology Towards Artificial Socio-Cultural Agents

  • paper_url: http://arxiv.org/abs/2307.07871
  • repo_url: None
  • paper_authors: Grgur Kovač, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
  • for: 这个论文的目的是将发展心理学知识应用到人工智能领域,以便创造更智能的社交交互式代理人。
  • methods: 这篇论文使用了 Michael Tomasello 和 Jerome Bruner 等心理学家的理论,并提出了一种可 parameterized 的 эксперименталь设计,以便研究社交智能的发展。
  • results: 这篇论文提出了一种名为 SocialAI 的学术学习平台,可以帮助研究人员在社交智能方面进行实验研究,并提供了一些示例实验。
    Abstract Developmental psychologists have long-established the importance of socio-cognitive abilities in human intelligence. These abilities enable us to enter, participate and benefit from human culture. AI research on social interactive agents mostly concerns the emergence of culture in a multi-agent setting (often without a strong grounding in developmental psychology). We argue that AI research should be informed by psychology and study socio-cognitive abilities enabling to enter a culture too. We discuss the theories of Michael Tomasello and Jerome Bruner to introduce some of their concepts to AI and outline key concepts and socio-cognitive abilities. We present The SocialAI school - a tool including a customizable parameterized uite of procedurally generated environments, which simplifies conducting experiments regarding those concepts. We show examples of such experiments with RL agents and Large Language Models. The main motivation of this work is to engage the AI community around the problem of social intelligence informed by developmental psychology, and to provide a tool to simplify first steps in this direction. Refer to the project website for code and additional information: https://sites.google.com/view/socialai-school.
    摘要 To address this issue, we propose the SocialAI school, a tool that includes a customizable parameterized suite of procedurally generated environments to simplify experiments related to socio-cognitive abilities. Our goal is to engage the AI community in exploring the problem of social intelligence informed by developmental psychology and provide a tool to facilitate initial explorations in this area.The SocialAI school is based on the theories of Michael Tomasello and Jerome Bruner, who have contributed significantly to the understanding of socio-cognitive abilities. We introduce key concepts and abilities, such as joint attention, intentional understanding, and cultural learning, and demonstrate their application in experiments using reinforcement learning (RL) agents and large language models.Our main motivation is to encourage the AI community to explore the importance of social intelligence informed by developmental psychology. We believe that by integrating insights from psychology and AI, we can create more sophisticated and human-like intelligent systems. For more information and to access the tool, please refer to the project website at .

Large Language Models as Superpositions of Cultural Perspectives

  • paper_url: http://arxiv.org/abs/2307.07870
  • repo_url: None
  • paper_authors: Grgur Kovač, Masataka Sawayama, Rémy Portelas, Cédric Colas, Peter Ford Dominey, Pierre-Yves Oudeyer
  • for: 这 paper investigate 大型自然语言模型 (LLM) 是否可以看作一个 superposition 的 perspectives 和 values.
  • methods: 作者使用了问卷测试 (PVQ, VSM, IPIP) 来研究 LLM 在不同 context 下表现出的 values 和 personality traits 是如何变化的.
  • results: 研究发现 LLM 在不同 prompt 中表现出的 values 和 personality traits 是 Context-dependent 的,并且可以通过不同的方法来控制这些 values 和 personality traits.I hope this helps! Let me know if you have any further questions.
    Abstract Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at https://sites.google.com/view/llm-superpositions .
    摘要 大型语言模型(LLM) oftentimes 被误解为具有人格或一组价值观。我们认为 LLM 可以被看作是一个组合多种看法的超position,每个看法都具有不同的价值观和人格特质。LLM 在不同的上下文中会表现出不同的价值观和人格特质,而人类则 tend to have more coherent 的价值观和人格特质 across contexts。我们引入了“ perspective controllability”的概念,它指的是一个模型在不同的上下文中能够采取不同的看法。我们通过问卷调查(PVQ、VSM、IPIP)研究 LLM 在不同上下文中表现出的价值观和人格特质是如何改变的。我们还通过实验证明 LLM 在不同上下文中表现出不同的价值观,并且这些价值观甚至在不明显的上下文中仍然会改变。我们然后进行了量化实验,研究不同模型(GPT-4、GPT-3.5、OpenAssistant、StableVicuna、StableLM)的可控性、对不同上下文的适应度以及模型的运作流略。我们最后结论是,我们的工作具有更广泛的科学影响和多种相关的科学问题。详情可以参考我们的项目网站:https://sites.google.com/view/llm-superpositions。

Benchmarking the Effectiveness of Classification Algorithms and SVM Kernels for Dry Beans

  • paper_url: http://arxiv.org/abs/2307.07863
  • repo_url: None
  • paper_authors: Anant Mehta, Prajit Sengupta, Divisha Garg, Harpreet Singh, Yosi Shacham Diamand
  • for: 该研究旨在帮助植物育种人和农业研究人员提高作物产量,通过分析杯装豇产品数据集来发现感兴趣的特征、病虫害抵抗性和营养含量。
  • methods: 本研究使用了不同的支持向量机(SVM)分类算法,包括线性、多项式和卷积函数(RBF),以及其他流行的分类算法,对杯装豇产品数据集进行分类。在进行分类之前,使用了主成分分析(PCA)来降维度。
  • results: 根据准确率为93.34%、准确率为92.61%、准确率为92.35%和F1得分为91.40%来评估算法的性能,RBF SVM kernel算法 achieve最高的准确率。此外,通过适当的视觉化和实验分析,本研究为复杂和非线性结构数据集的分类提供了有价值的指导。
    Abstract Plant breeders and agricultural researchers can increase crop productivity by identifying desirable features, disease resistance, and nutritional content by analysing the Dry Bean dataset. This study analyses and compares different Support Vector Machine (SVM) classification algorithms, namely linear, polynomial, and radial basis function (RBF), along with other popular classification algorithms. The analysis is performed on the Dry Bean Dataset, with PCA (Principal Component Analysis) conducted as a preprocessing step for dimensionality reduction. The primary evaluation metric used is accuracy, and the RBF SVM kernel algorithm achieves the highest Accuracy of 93.34%, Precision of 92.61%, Recall of 92.35% and F1 Score as 91.40%. Along with adept visualization and empirical analysis, this study offers valuable guidance by emphasizing the importance of considering different SVM algorithms for complex and non-linear structured datasets.
    摘要 植物育种者和农业研究人员可以提高作物产量,通过识别有利特征、疾病抵抗和营养含量,通过分析扁豆数据集。本研究分析和比较不同支持向量机(SVM)分类算法,包括直线、多项式和径向基函数(RBF),以及其他流行的分类算法。经预处理(主成分分析),对扁豆数据集进行分类。主要评价指标为准确率,RBF SVM 算法实现最高准确率为 93.34%、精度为 92.61%、回归率为 92.35% 和 F1 分数为 91.40%。此外,通过明确的视觉分析和实际分析,本研究提供了有价值的指导,强调考虑不同 SVM 算法来处理复杂和非线性结构的数据集。

Automated Knowledge Modeling for Cancer Clinical Practice Guidelines

  • paper_url: http://arxiv.org/abs/2307.10231
  • repo_url: None
  • paper_authors: Pralaypati Ta, Bhumika Gupta, Arihant Jain, Sneha Sree C, Arunima Sarkar, Keerthi Ram, Mohanasankar Sivaprakasam
  • for: 本研究旨在提取和生成临床实践指南(CPGs)中的知识,以便实现这些指南的程序式交互。
  • methods: 本研究使用自动化方法提取国家全面癌病网络(NCCN)CPGs中的知识,并生成一个结构化模型。三种激发策略,包括肿瘤分期信息、Unified Medical Language System(UMLS)METAthesaurus & National Cancer Institute thesaurus(NCIt)术语、和节点分类,也被提出以增强模型,以便实现癌病护理指南的程序式浏览和查询。
  • results: 节点分类使用支持向量机(SVM)模型,实现分类精度为0.81,通过十fold交叉验证。
    Abstract Clinical Practice Guidelines (CPGs) for cancer diseases evolve rapidly due to new evidence generated by active research. Currently, CPGs are primarily published in a document format that is ill-suited for managing this developing knowledge. A knowledge model of the guidelines document suitable for programmatic interaction is required. This work proposes an automated method for extraction of knowledge from National Comprehensive Cancer Network (NCCN) CPGs in Oncology and generating a structured model containing the retrieved knowledge. The proposed method was tested using two versions of NCCN Non-Small Cell Lung Cancer (NSCLC) CPG to demonstrate the effectiveness in faithful extraction and modeling of knowledge. Three enrichment strategies using Cancer staging information, Unified Medical Language System (UMLS) Metathesaurus & National Cancer Institute thesaurus (NCIt) concepts, and Node classification are also presented to enhance the model towards enabling programmatic traversal and querying of cancer care guidelines. The Node classification was performed using a Support Vector Machine (SVM) model, achieving a classification accuracy of 0.81 with 10-fold cross-validation.
    摘要 临床实践指南 (CPGs) для癌症疾病在新证据的激发下逐渐发展。目前,CPGs 主要以文档格式发布,这种格式不适合管理这些发展中的知识。本工作提出了一种自动提取临床指南文档中的知识并生成结构化模型的方法。该方法在使用两个版本的国家癌症网络 (NCCN) Non-Small Cell Lung Cancer (NSCLC) 临床指南进行测试,以证明该方法的效果是忠实地提取和模型知识。此外,本文还描述了三种扩充策略,包括肿瘤分期信息、Unified Medical Language System (UMLS) Metathesaurus & National Cancer Institute thesaurus (NCIt) 概念和节点分类,以增强模型,使其可以进行程序化的浏览和查询癌症护理指南。节点分类使用支持向量机 (SVM) 模型,实现分类精度为 0.81 的十fold十字验证。

A Multi-Heuristic Search-based Motion Planning for Automated Parking

  • paper_url: http://arxiv.org/abs/2307.07857
  • repo_url: None
  • paper_authors: Bhargav Adabala, Zlatan Ajanović
    for:这篇论文是关于在无结构环境中,如停车场或建筑现场,由于搜索空间的巨大性和车辆的动态约束,实时规划是一项挑战。methods:本文采用多хеURISTIC搜索方法,通过使用多个хеURISTIC函数,捕捉不同的搜索空间复杂性,并且可以充分发挥每个хеURISTIC函数的优势。results:与Hybrid A算法进行比较,Multi-Heuristic A算法在计算效率和动作计划质量两个方面占据了优势。
    Abstract In unstructured environments like parking lots or construction sites, due to the large search-space and kinodynamic constraints of the vehicle, it is challenging to achieve real-time planning. Several state-of-the-art planners utilize heuristic search-based algorithms. However, they heavily rely on the quality of the single heuristic function, used to guide the search. Therefore, they are not capable to achieve reasonable computational performance, resulting in unnecessary delays in the response of the vehicle. In this work, we are adopting a Multi-Heuristic Search approach, that enables the use of multiple heuristic functions and their individual advantages to capture different complexities of a given search space. Based on our knowledge, this approach was not used previously for this problem. For this purpose, multiple admissible and non-admissible heuristic functions are defined, the original Multi-Heuristic A* Search was extended for bidirectional use and dealing with hybrid continuous-discrete search space, and a mechanism for adapting scale of motion primitives is introduced. To demonstrate the advantage, the Multi-Heuristic A* algorithm is benchmarked against a very popular heuristic search-based algorithm, Hybrid A*. The Multi-Heuristic A* algorithm outperformed baseline in both terms, computation efficiency and motion plan (path) quality.
    摘要 在无结构环境中,如停车场或建筑现场,因车辆的搜寻空间和运动约束导致实时规划成为挑战。许多现代的规划器使用了对搜寻的搜寻函数来导航搜寻。但是,这些规划器对单一搜寻函数的质量依赖甚高,因此在实际应用中会导致无必要的延迟。在这个工作中,我们运用了多个搜寻函数的多搜寻方法,以利用不同的搜寻函数优点,捕捉不同的搜寻空间复杂性。根据我们的知识,这种方法在这个问题上没有被使用过。因此,我们定义了多个可行和非可行的搜寻函数,扩展了原始的多搜寻A*搜寻算法,以便对两向和混合点几何搜寻空间进行搜寻,并导入了动态减少运动元素的机制。为了证明优势,我们将多搜寻A*算法与非常受欢迎的搜寻函数基本算法(Hybrid A*)进行比较。结果显示,Multi-Heuristic A*算法在计算效率和运动规划(路径)质量两个方面都高于基eline。

AspectCSE: Sentence Embeddings for Aspect-based Semantic Textual Similarity using Contrastive Learning and Structured Knowledge

  • paper_url: http://arxiv.org/abs/2307.07851
  • repo_url: None
  • paper_authors: Tim Schopf, Emanuel Gerber, Malte Ostendorff, Florian Matthes
  • for: This paper is written for improving the accuracy of information retrieval tasks by using aspect-based contrastive learning of sentence embeddings.
  • methods: The paper proposes a new approach called AspectCSE, which uses aspect-based contrastive learning to train sentence embeddings that can capture specific aspects of textual similarity.
  • results: The paper reports an average improvement of 3.97% on information retrieval tasks across multiple aspects compared to the previous best results. Additionally, the paper shows that multi-aspect embeddings outperform single-aspect embeddings on aspect-specific information retrieval tasks.
    Abstract Generic sentence embeddings provide a coarse-grained approximation of semantic textual similarity but ignore specific aspects that make texts similar. Conversely, aspect-based sentence embeddings provide similarities between texts based on certain predefined aspects. Thus, similarity predictions of texts are more targeted to specific requirements and more easily explainable. In this paper, we present AspectCSE, an approach for aspect-based contrastive learning of sentence embeddings. Results indicate that AspectCSE achieves an average improvement of 3.97% on information retrieval tasks across multiple aspects compared to the previous best results. We also propose using Wikidata knowledge graph properties to train models of multi-aspect sentence embeddings in which multiple specific aspects are simultaneously considered during similarity predictions. We demonstrate that multi-aspect embeddings outperform single-aspect embeddings on aspect-specific information retrieval tasks. Finally, we examine the aspect-based sentence embedding space and demonstrate that embeddings of semantically similar aspect labels are often close, even without explicit similarity training between different aspect labels.
    摘要 中文翻译:通用句子嵌入提供了一级划算的 semantic 类似性表示,但它们忽略了特定方面的文本类似性。相反,方面基于句子嵌入提供了基于特定方面的类似性 predictions。这使得类似性预测更加专注于特定要求,并更易于解释。在这篇论文中,我们提出了 AspectCSE,一种方面基于的句子嵌入对比学习方法。结果显示,AspectCSE 在多个方面的信息检索任务上平均提高了3.97%,相比前一个最佳结果。我们还提出了使用 Wikidata 知识图Properties 来训练多个方面的句子嵌入模型,其中多个特定方面同时被考虑在类似性预测中。我们示出了多个方面嵌入在特定方面信息检索任务上的表现优于单个方面嵌入。最后,我们检查了方面基于句子嵌入空间,并证明了不同方面标签的嵌入在semantic上相似时,它们通常处于近距离。

AIOptimizer – A reinforcement learning-based software performance optimisation prototype for cost minimisation

  • paper_url: http://arxiv.org/abs/2307.07846
  • repo_url: None
  • paper_authors: Noopur Zambare
  • for: 本研究论文介绍了一个基于成本reduction的软件性能优化工具AIOptimizer的 проtotype。AIOptimizer使用一个基于强化学习的推荐系统,以提高软件系统的效率和可持续性。
  • methods: 本研究使用了一个模块化设计、数据收集技术、连续学习和可靠的集成,以提供有效和用户中心的性能优化解决方案。
  • results: 本研究发现AIOptimizer可以实现精确性、灵活性、可扩展性和用户友善性等设计因素,并且可以实现成本优化、缺陷识别、效率预测和合作等功能。
    Abstract This research article introduces AIOptimizer, a prototype for a software performance optimisation tool based on cost reduction. AIOptimizer uses a recommendation system driven by reinforcement learning to improve software system efficiency and affordability. The paper highlights AIOptimizer's design factors, such as accuracy, adaptability, scalability, and user-friendliness. To provide effective and user-centric performance optimisation solutions, it emphasises the use of a modular design, data gathering techniques, continuous learning, and resilient integration. The article also investigates AIOptimizer features such as fault identification, cost optimisation recommendations, efficiency prediction, and cooperation. Furthermore, it explores several software development life cycle models and introduces AIOptimizer uses a reinforcement learning-based recommendation engine for cost optimisation. The purpose of this research study is to highlight AIOptimizer as a prototype that uses advanced optimisation techniques and smart recommendation systems to continually enhance software performance and save expenses. The research focuses on various software development life cycle models, such as the Waterfall model, Iterative model, Spiral model, V-Model, Big Bang model and Agile Model. Each model has advantages and disadvantages, and their usefulness is determined by the project's specifications and characteristics. The AIOptimizer tool is a theoretical prototype for such software performance optimizers.
    摘要 本研究文章介绍了一种基于成本减少的软件性能优化工具prototype,称为AIOptimizer。AIOptimizer使用一种基于强化学习的推荐系统来提高软件系统的效率和可affordability。文章强调了AIOptimizer的设计因素,如准确率、适应性、扩展性和用户友好性。为提供有效和用户中心的性能优化解决方案,它强调了模块化设计、数据收集技术、连续学习和可靠的集成。文章还探讨了AIOptimizer的特性,如错误识别、成本优化建议、效率预测和合作。此外,它还介绍了软件开发生命周期模型,如水fall模型、迭代模型、散点模型、V模型、大 Bang模型和互动模型。每种模型具有优缺点,其用于特定项目的评估和适用性决定。AIOptimizer工具是一种理论上的软件性能优化器。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.

RegExplainer: Generating Explanations for Graph Neural Networks in Regression Task

  • paper_url: http://arxiv.org/abs/2307.07840
  • repo_url: None
  • paper_authors: Jiaxing Zhang, Zhuomin Chen, Hao Mei, Dongsheng Luo, Hua Wei
  • for: 本文旨在解释Graph Regression模型(XAIG-R)的具体行为,以便更好地理解它在 regression 任务中的工作原理。
  • methods: 本文提出了一种基于信息瓶颈理论的新目标函数,以及一种可以支持多种 GNN 的混合框架。此外,本文还提出了一种对比学习策略,用于处理继承顺序的标签问题。
  • results: 经验证明,提出的方法能够有效地解释 GNN 模型在 regression 任务中的行为。在三个 benchmark 数据集和一个实际数据集上进行了广泛的实验,结果表明提出的方法能够准确地捕捉 GNN 模型的行为特点。
    Abstract Graph regression is a fundamental task and has received increasing attention in a wide range of graph learning tasks. However, the inference process is often not interpretable. Most existing explanation techniques are limited to understanding GNN behaviors in classification tasks. In this work, we seek an explanation to interpret the graph regression models (XAIG-R). We show that existing methods overlook the distribution shifting and continuously ordered decision boundary, which hinders them away from being applied in the regression tasks. To address these challenges, we propose a novel objective based on the information bottleneck theory and introduce a new mix-up framework, which could support various GNNs in a model-agnostic manner. We further present a contrastive learning strategy to tackle the continuously ordered labels in regression task. To empirically verify the effectiveness of the proposed method, we introduce three benchmark datasets and a real-life dataset for evaluation. Extensive experiments show the effectiveness of the proposed method in interpreting GNN models in regression tasks.
    摘要 Graph 回归是一种基本任务,在各种图学习任务中受到了越来越多的关注。然而,推理过程经常不可解释。现有的解释技术主要是用于理解 GNN 的归类任务中的行为。在这项工作中,我们寻求一种用于解释 graph 回归模型(XAIG-R)的解释。我们发现现有的方法忽略了分布转移和连续顺序决策边界,这使得它们无法应用于回归任务中。为解决这些挑战,我们提出了一个基于信息瓶颈理论的新目标函数,并引入了一种新的混合框架,可以在不同的 GNN 上进行模型无关的应用。此外,我们还提出了一种对比学习策略,用于处理连续顺序的标签。为验证提出的方法的效果,我们引入了三个标准 benchmark 数据集和一个真实生成的数据集进行评估。广泛的实验表明,提出的方法可以有效地解释 GNN 模型在回归任务中。

cs.CL - 2023-07-16

Automatic Identification of Alzheimer’s Disease using Lexical Features extracted from Language Samples

  • paper_url: http://arxiv.org/abs/2307.08070
  • repo_url: None
  • paper_authors: M. Zakaria Kurdi
  • for: 本研究的目的是提高阿尔茨海默病(AD)对不同方面的语言处理的影响的理解,以及使用这些方面的语言特征作为机器学习分类器的特征来实现状态对精度的自动识别语言样本。
  • methods: 本研究使用了ADDreSS挑战数据集,该数据集来自于DementiaBank集成。研究使用的数据集包括54名参与者在训练部分中提供的Cookie Theft图像描述语音样本,以及24名参与者在测试部分中提供的样本。总的来说,训练集和测试集中的语音样本数为108和48。首先,研究通过分析99个选择的语言特征对AD的影响进行了研究,并在训练和测试部分中进行了语音样本的分类。其次,研究通过不同的语言复杂度区域进行了一些机器学习实验,以确定可以实现优化性能的特征子组合。最后,研究还对语音样本的大小对分类的影响进行了研究。
  • results: 使用语言特征 seule,可以实现状态对精度高于91%的自动识别语言样本Produced by individuals with AD from those produced by healthy control subjects.这表明AD会对语言处理产生重要的影响。
    Abstract Objective: this study has a twofold goal. First, it aims to improve the understanding of the impact of Dementia of type Alzheimer's Disease (AD) on different aspects of the lexicon. Second, it aims to demonstrate that such aspects of the lexicon, when used as features of a machine learning classifier, can help achieve state-of-the-art performance in automatically identifying language samples produced by patients with AD. Methods: data is derived from the ADDreSS challenge, which is a part of the DementiaBank corpus. The used dataset consists of transcripts of Cookie Theft picture descriptions, produced by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is 108 in the training set and 48 in the test set. First, the impact of AD on 99 selected lexical features is studied using both the training and testing parts of the dataset. Then some machine learning experiments were conducted on the task of classifying transcribed speech samples with text samples that were produced by people with AD from those produced by normal subjects. Several experiments were conducted to compare the different areas of lexical complexity, identify the subset of features that help achieve optimal performance, and study the impact of the size of the input on the classification. To evaluate the generalization of the models built on narrative speech, two generalization tests were conducted using written data from two British authors, Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan. Results: using lexical features only, state-of-the-art classification, F1 and accuracies, of over 91% were achieved in categorizing language samples produced by individuals with AD from the ones produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.
    摘要 Methods: The study uses data from the ADDreSS challenge, part of the DementiaBank corpus. The dataset consists of transcripts of Cookie Theft picture descriptions produced by 54 subjects in the training set and 24 subjects in the test set, with 108 narrative samples in the training set and 48 in the test set. The study first examines the impact of AD on 99 selected lexical features using both the training and testing parts of the dataset. Then, machine learning experiments are conducted to classify transcribed speech samples produced by people with AD from those produced by normal subjects. The experiments compare different areas of lexical complexity, identify the subset of features that achieve optimal performance, and study the impact of input size on classification. To evaluate the generalization of the models built on narrative speech, the study conducts two generalization tests using written data from Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan.Results: The study achieves state-of-the-art classification, F1, and accuracies of over 91% in categorizing language samples produced by individuals with AD from those produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.

Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach

  • paper_url: http://arxiv.org/abs/2307.07994
  • repo_url: https://github.com/jfzhouyoo/supporter
  • paper_authors: Jinfeng Zhou, Zhuang Chen, Bo Wang, Minlie Huang
  • for: 提供情感支持(ES),提高心理状态。
  • methods: 使用混合专家模型和资料学习,评估对话的凝合性和情感诱发效果。
  • results: 在回答过程中,Supporter模型可以诱发正面情感,同时维持对话的凝合性。
    Abstract Emotional support conversation (ESC) aims to provide emotional support (ES) to improve one's mental state. Existing works stay at fitting grounded responses and responding strategies (e.g., question), which ignore the effect on ES and lack explicit goals to guide emotional positive transition. To this end, we introduce a new paradigm to formalize multi-turn ESC as a process of positive emotion elicitation. Addressing this task requires finely adjusting the elicitation intensity in ES as the conversation progresses while maintaining conversational goals like coherence. In this paper, we propose Supporter, a mixture-of-expert-based reinforcement learning model, and well design ES and dialogue coherence rewards to guide policy's learning for responding. Experiments verify the superiority of Supporter in achieving positive emotion elicitation during responding while maintaining conversational goals including coherence.
    摘要 emotional support conversation (ESC) 目的是提供情感支持 (ES) 以改善心理状态。现有的工作停留在适应地响应(例如问题),而忽视 ES 的影响和没有明确的目标导航情感积极转移。为此,我们引入了一种新的 парадиг,将多回合 ES 视为积极情感诱发的过程。在这个任务中,需要精准地调整情感诱发强度,以便在对话进行时维持对话目标,包括凝合性。在这篇论文中,我们提出了支持者,一种权重学习模型,并设计了 ES 和对话凝合性奖励来引导策略的学习。实验证明了支持者在回答时积极诱发情感的同时保持对话目标,包括凝合性。

A Survey of Techniques for Optimizing Transformer Inference

  • paper_url: http://arxiv.org/abs/2307.07982
  • repo_url: None
  • paper_authors: Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani
  • for: 本文旨在概述transformer网络在推理阶段的优化技术,包括知识储存、剪枝、量化、 neural architecture search 和轻量级网络设计等方法。
  • methods: 本文 Survey了一系列的算法级别优化技术,包括知识储存、剪枝、量化、 neural architecture search 和轻量级网络设计等方法。
  • results: 本文 Summarized了一些模型和技术的量化结果,以及它们的准确率和计算复杂度之间的贸易OFF。 In English, the three main points of the paper are:
  • for: The paper aims to survey optimization techniques for the inference phase of transformer networks, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design.
  • methods: The paper surveys a series of algorithm-level optimization techniques, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design.
  • results: The paper summarizes the quantitative results of several models and techniques, including the tradeoff between accuracy and computational complexity.
    Abstract Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
    摘要 This paper provides a comprehensive survey of techniques for optimizing transformer inference, including knowledge distillation, pruning, quantization, neural architecture search, and lightweight network design. We also review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results of several models and techniques to show the tradeoffs they exercise.The survey aims to educate both novice and seasoned researchers in this rapidly evolving field and spark a plethora of research efforts. We believe that this survey will be a valuable resource for researchers and practitioners who are interested in optimizing transformer inference for a wide range of applications.Here is the Simplified Chinese translation of the text:最近几年,变换器神经网络的性能和应用已经惊人地增长。变换器家族,包括bidirectional Encoder Representations from Transformer(BERT)、Generative Pretrained Transformer(GPT)和Vision Transformer(ViT),在自然语言处理(NLP)和计算机视觉(CV)领域都有显著的效果。然而,为了追求高预测性能,变换器的内存和计算核心占用量已经呈指数增长。研究人员已经提出了多种优化变换器推理的技术,包括知识储存、剪辑、量化、神经网络搜索和轻量级网络设计。这篇论文提供了优化变换器推理的全面survey,包括知识储存、剪辑、量化、神经网络搜索和轻量级网络设计。我们还对硬件优化技术进行了评论,以及设计了新的硬件加速器 для变换器。我们SUMMARIZE了几种模型和技术的量化结果,以示其质量和计算量的交易。我们还对未来的发展方向进行了讨论。我们认为,这篇论文将成为 transformer 优化推理的全面资源,对研究人员和实践者都会是非常有价值的。我们期望,这篇论文将能够教育 novice 和经验丰富的研究人员,并促进这个领域的研究努力。

Model Adaptation for ASR in low-resource Indian Languages

  • paper_url: http://arxiv.org/abs/2307.07948
  • repo_url: None
  • paper_authors: Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Savitha, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Rohan Saxena, Sai Praneeth Reddy Mora, Srinivasa Raghavan
  • for: 本研究旨在探讨如何使用自适应学习和大规模多语言训练提高语音识别性能,特别是对具有限制的语音和文本数据的低资源语言进行应用。
  • methods: 本研究使用的方法包括使用wav2vec2自适应学习模型和大规模多语言训练,以及对文本和语音数据进行适应和细化。
  • results: 研究发现,通过对相似语言进行适应和细化,可以在低资源语言中提高语音识别性能,同时也可以利用大量的文本数据来提高语音识别性能。
    Abstract Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages. In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world.
    摘要 自动语音识别(ASR)性能在最近几年内有大幅度改善,主要归功于自我超级学习(SSL)基于音频模型如wav2vec2和大规模多语言训练如Whisper。然而,低资源语言仍然存在巨大挑战,特别是印度语言。这是因为印度语言有多种方言,使得训练ASR模型变得更加困难。然而,许多印度语言可以被分组为同一家族,并且共享同一个字母和语法结构。这使得可以应用大量相似语言的适应和细化技术来抵消低资源数据的问题。在这个特别 sessio中,我们邀请社区探讨以下问题:在印度语言中,音频和文本Modalities在建立可靠ASR模型中的重要性。可能是因为一种语言有充足的音频数据,因此减少了大量文本训练 corpora的需求。或者,由于可用的多种预训练音频模型,因此可能是文本训练 corpora的需求减少了。我们鼓励社区通过使用两种低资源印度语言:孟加拉语和季风语进行研究。这些方法不仅适用于印度语言,也适用于世界各地的其他语言。

Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling

  • paper_url: http://arxiv.org/abs/2307.07946
  • repo_url: https://github.com/zifengcheng/cdap
  • paper_authors: Zifeng Cheng, Qingyu Zhou, Zhiwei Jiang, Xuemin Zhao, Yunbo Cao, Qing Gu
  • for: 这篇论文的目的是提出一个能够在几少标签样本的情况下识别新的类别的方法。
  • methods: 这篇论文提出了一个叫做 Consistent Dual Adaptive Prototypical (CDAP) 网络,这个网络包含了token level和span level的网络,并将它们在不同的粒度上进行联合训练。另外,这篇论文还提出了一个对于两个网络的一致损失函数,以便它们可以从对方学习。
  • results: 在实验阶段,这篇论文获得了三个benchmark dataset上的新的州际之最的结果。
    Abstract Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets.
    摘要 《几个样本级标注 для序列标注》的目标是基于只有几个标注样本来识别新的类别。现有方法主要通过设计token级或Span级标注模型来解决数据缺乏问题,但这些方法只是在单一级别( тоeken级或Span级)进行训练,它们有相应的缺陷。在这篇论文中,我们首先统一了token级和Span级监督,并提出了一个Consistent Dual Adaptive Prototypical(CDAP)网络 для几个样本级标注。CDAP网络包括token级和Span级网络,在不同的级别进行联合训练。为了将两个网络的输出保持一致,我们还提出了一种一致损失函数,使其可以从彼此学习。在推断阶段,我们提出了一种一致推断算法,首先调整预测概率,然后选择非重叠的Span WITH maximum概率。广泛的实验表明,我们的模型在三个标准 benchmark dataset上达到了新的状态级 результа。

Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions

  • paper_url: http://arxiv.org/abs/2307.07940
  • repo_url: None
  • paper_authors: Atsushi Shirafuji, Yutaka Watanobe
  • for: 提高在线评测系统中学生参照多种解决方案的能力
  • methods: 使用去重和排名常见解决方案来减少参照program的数量
  • results: 实验结果表明,去重后的program数量减少60.20%,比基准值减少29.59%,meaning users only need to refer to39.80% of programs on average,top-10 ranked programs cover 29.95% of programs on average.
    Abstract Referring to the solution programs written by the other users is helpful for learners in programming education. However, current online judge systems just list all solution programs submitted by users for references, and the programs are sorted based on the submission date and time, execution time, or user rating, ignoring to what extent the program can be a reference. In addition, users struggle to refer to a variety of solution approaches since there are too many duplicated and near-duplicated programs. To motivate the learners to refer to various solutions to learn the better solution approaches, in this paper, we propose an approach to deduplicate and rank common solution programs in each programming problem. Based on the hypothesis that the more duplicated programs adopt a more common approach and can be a reference, we remove the near-duplicated solution programs and rank the unique programs based on the duplicate count. The experiments on the solution programs submitted to a real-world online judge system demonstrate that the number of programs is reduced by 60.20%, whereas the baseline only reduces by 29.59% after the deduplication, meaning that the users only need to refer to 39.80% of programs on average. Furthermore, our analysis shows that top-10 ranked programs cover 29.95% of programs on average, indicating that the users can grasp 29.95% of solution approaches by referring to only 10 programs. The proposed approach shows the potential of reducing the learners' burden of referring to too many solutions and motivating them to learn a variety of better approaches.
    摘要 依据我们的提案,在编程教育中参考其他用户提交的解决方案程序是有帮助的。然而,当前在线评测系统只是列出所有由用户提交的解决方案程序作为参考,并根据提交日期和时间、执行时间或用户评分排序,而忽略了解决方案的多样性。此外,用户很难参考多种解决方法,因为有太多重复和相似的程序。为了鼓励学生参考多种解决方法,并学习更好的解决方法,我们提出了一种方法,可以在每个编程问题中去除重复的解决方案程序,并将唯一的程序排名基于重复计数。我们的实验表明,在一个真实的在线评测系统中,可以将解决方案程序的数量减少了60.20%,而基eline只减少了29.59%,这意味着用户只需要参考39.80%的程序的平均数量。此外,我们的分析表明,排名前10的程序覆盖了29.95%的程序的平均数量,这表明用户可以通过参考只有10个程序来掌握29.95%的解决方法。我们的方法表明可以减轻学生参考太多解决方案程序的负担,并鼓励他们学习更多的更好的解决方法。

Communicative Agents for Software Development

  • paper_url: http://arxiv.org/abs/2307.07924
  • repo_url: None
  • paper_authors: Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, Maosong Sun
  • for: 这篇论文旨在探讨一种基于大语言模型(LLM)的软件开发方法,它可以在不同的软件开发阶段中使用自然语言交流来替代专门的模型。
  • methods: 这篇论文提出了一种基于虚拟对话的软件开发方法,其中每个阶段都有一个团队参与,包括程序员、代码审查人员和测试工程师。在每个阶段中,团队成员可以通过对话来协作、分享想法和解决问题。
  • results: 研究表明,使用这种方法可以在 less than 7 分钟内完成整个软件开发过程,并且成本低于 1 美元。此外,这种方法还可以快速察觉和修复潜在的漏洞和幻觉,保证了软件的可靠性和效率。
    Abstract Software engineering is a domain characterized by intricate decision-making processes, often relying on nuanced intuition and consultation. Recent advancements in deep learning have started to revolutionize software engineering practices through elaborate designs implemented at various stages of software development. In this paper, we present an innovative paradigm that leverages large language models (LLMs) throughout the entire software development process, streamlining and unifying key processes through natural language communication, thereby eliminating the need for specialized models at each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered software development company that mirrors the established waterfall model, meticulously dividing the development process into four distinct chronological stages: designing, coding, testing, and documenting. Each stage engages a team of agents, such as programmers, code reviewers, and test engineers, fostering collaborative dialogue and facilitating a seamless workflow. The chat chain acts as a facilitator, breaking down each stage into atomic subtasks. This enables dual roles, allowing for proposing and validating solutions through context-aware communication, leading to efficient resolution of specific subtasks. The instrumental analysis of ChatDev highlights its remarkable efficacy in software generation, enabling the completion of the entire software development process in under seven minutes at a cost of less than one dollar. It not only identifies and alleviates potential vulnerabilities but also rectifies potential hallucinations while maintaining commendable efficiency and cost-effectiveness. The potential of ChatDev unveils fresh possibilities for integrating LLMs into the realm of software development.
    摘要 软件工程是一个具有复杂决策过程的领域,常常依赖于细腻的直觉和咨询。在最近的深度学习技术的推动下,软件工程做法正在不断发展和改进。在这篇论文中,我们提出了一种创新的思路,利用大语言模型(LLMs)在软件开发过程中扮演重要角色,通过自然语言交流,解决特定阶段的问题,从而消除特殊模型的需求。在我们的思路中,核心是一家虚拟对话驱动的软件开发公司——ChatDev,它类似于传统的水平模型,将软件开发过程分成四个阶段:设计、编程、测试和文档。每个阶段都有一群代表不同职业的代理人,如程序员、代码审查员和测试工程师,通过对话协作,实现了无缝的工作流程。对话链 acts as a facilitator,将每个阶段分解成原子任务,使代理人能够通过上下文感知的沟通,提出和验证解决方案,从而提高效率。我们的研究表明,ChatDev在软件生成过程中具有惊人的效果,可以在七分钟之内完成整个软件开发过程,成本低于一元。它不仅可以找到和消除潜在的漏洞,还可以修正潜在的幻觉,保持了卓越的效率和成本效果。ChatDev的潜在力量探讨出了许多新的软件开发领域的可能性,它可以帮助我们更好地利用深度学习技术,提高软件开发效率和质量。

Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages

  • paper_url: http://arxiv.org/abs/2307.08714
  • repo_url: None
  • paper_authors: Sunisth Kumar, Davide Liu, Alexandre Boulenger
  • for: 这项研究的目的是提出一种高效的多语言命名实体识别框架,以便在不同语言的 semi-structured 文本数据中进行命名实体识别。
  • methods: 这种模型设计基于知识储存和一致训练,利用一个大型语言模型(XLMRoBERTa)预训练后的知识,并通过学生教师关系进行知识传递。学生模型还在低资源目标语言上进行不supervised一致训练(使用 KL 异谱损失)。
  • results: 我们使用了两个独立的英文和阿拉伯文短信数据集,每个数据集包含 semi-structured 银行交易信息,以证明知识传递的可行性。只需训练30个标注样本,我们的模型可以将英文中的商户、金额等场景转移到阿拉伯文中。我们的模型设计方式在比较状态艺术方法(如 DistilBERT 预训练目标语言或直接在目标语言上supervised 训练)时表现出色,并且在总体来说表现最佳。
    Abstract We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. Our approach relies on both knowledge distillation and consistency training. The modeling framework leverages knowledge from a large language model (XLMRoBERTa) pre-trained on the source language, with a student-teacher relationship (knowledge distillation). The student model incorporates unsupervised consistency training (with KL divergence loss) on the low-resource target language. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information, and focus on exhibiting the transfer of knowledge from English to Arabic. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic. We show that our modeling approach, while efficient, performs best overall when compared to state-of-the-art approaches like DistilBERT pre-trained on the target language or a supervised model directly trained on labeled data in the target language. Our experiments show that it is enough to learn to recognize entities in English to reach reasonable performance in a low-resource language in the presence of a few labeled samples of semi-structured data. The proposed framework has implications for developing multi-lingual applications, especially in geographies where digital endeavors rely on both English and one or more low-resource language(s), sometimes mixed with English or employed singly.
    摘要 我们提出了一种高效的模型框架,用于跨语言名实Recognition semi-structured text数据。我们的方法基于知识储存和一致性训练。我们的模型框架利用源语言中已经预训练的大语言模型(XLMRoBERTa),并通过学生与师之间的关系(知识储存)。学生模型包括低资源目标语言中无监督一致性训练(KL异同损失)。我们使用了英文和阿拉伯语两个独立的短信数据集,每个数据集包含英文和阿拉伯语的 semi-structured 银行交易信息。我们主要关注在英文到阿拉伯语的知识传递中。只有访问30个标注样本后,我们的模型可以将英文中的商户、金额和其他字段识别到阿拉伯语中。我们的模型方法,尽管高效,与状态机器翻译的approaches如DistilBERT预训练目标语言或直接在目标语言上监督训练模型相比,表现最佳。我们的实验表明,只要学习英文中的实体,就可以在低资源语言中 reaches reasonable performance,即使只有几个标注样本的 semi-structured 数据。我们的模型框架在发展多语言应用程序方面具有重要意义,特别是在某些地区的数字努力依赖于英文和一些低资源语言(或混合英文)。

LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models

  • paper_url: http://arxiv.org/abs/2307.07889
  • repo_url: None
  • paper_authors: Adian Liusie, Potsawee Manakul, Mark J. F. Gales
  • for: 本研究探讨了如何使用大语言模型(LLM)来自动评估自然语言生成(NLG)的能力,以及 Comparative assessment 的可能性和优势。
  • methods: 本研究使用了 LLM 的 emergent 能力进行 zero-shot NLG 评估,包括绝对分数预测和比较评估。比较评估使用了对候选作品之间的相对比较,而不是独立地评估每一个候选作品。
  • results: 研究发现,使用 LLM 进行 Comparative assessment 可以达到 moderate-sized open-source LLMs 的比较好的性能,并且可以超越prompt scoring。此外,研究还发现了 LLB 的位置偏见问题,并提出了一些debiasing方法来进一步改进性能。
    Abstract Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural language generation (NLG), a highly challenging area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses relative comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspectives: performance compared to absolute grading; positional biases in the prompt; and efficient ranking in terms of the number of comparisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring, and in many cases can achieve performance competitive with state-of-the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional biases when making pairwise comparisons, and we propose debiasing methods that can further improve performance.
    摘要 现有大型语言模型(LLM)的发展,使得zero-shot能力在不同的自然语言任务中展现出卓越的表现。在这篇论文中,我们探索了两种利用LLM的发现能力进行零shot语言生成评估的方法:绝对分数预测和相对比较方法。相对比较方法,尽管尚未广泛研究在语言生成评估中,但人类往往觉得比较两个选项更直觉。这篇论文从多种角度探讨了相对比较方法:相对于绝对分数评估;提示中的位置偏见;以及有效的排名方法。我们发现了LLM相对比较评估是一种简单、通用且有效的语言生成评估方法。对于中等规模的开源LLM,如FlanT5和Llama2-chat,相对比较评估比提示分数评估更好,且在许多情况下可以与现有方法竞争。此外,我们发现了LLM在比较对比中时常表现出强烈的位置偏见,我们提出了修复方法,可以进一步提高表现。

Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding

  • paper_url: http://arxiv.org/abs/2307.07880
  • repo_url: https://github.com/boleima/profit
  • paper_authors: Bolei Ma, Ercong Nie, Helmut Schmid, Hinrich Schütze
  • for: 这个研究旨在研究在多语言预训练语义模型(MPLMs)中的提示基于finetuning的跨语言能力。
  • methods: 该研究使用了提示基于finetuning的方法,并进行了对多种目标语言的评估。
  • results: 研究发现,提示基于finetuning在跨语言语理理解任务中表现出了有效性和多样性,并且在不同的几shot和全数据情况下表现出了不同的表现特征。
    Abstract Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the ProFiT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.
    摘要 多语言预训言语模型(MPLM)已经在零shot横跨不同自然语言理解任务中显示出了很大的性能提升,通过在源语言(如英语)的任务特定标注数据上精度MPLM,并在多种目标语言进行评估。据最新的研究显示,在几个shot场景中,提问基本的训练超越常训练。然而,跨语言提问基本学习的探索仍然受限。在这项研究中,我们提出了ProFiT管道,以探索跨语言理解的提问基本学习能力。我们在多种跨语言语理解任务(情感分类、重叠识别和自然语言推理)上进行了广泛的实验,并详细分析了提问基本训练在跨语言传递中的变化趋势。我们的结果表明提问基本训练在跨语言语理解中具有效果和多样性。我们发现,提问基本训练在全数据场景中比常训练高效,并在几个shot场景中表现出更大的优势。此外,我们分析了跨语言性和预训练数据大小等因素对提问基本训练的跨语言性能的影响。总之,我们的工作提供了关于提问基本训练的跨语言才能的有价值的视角。

CIDER: Context sensitive sentiment analysis for short-form text

  • paper_url: http://arxiv.org/abs/2307.07864
  • repo_url: https://github.com/jcy204/ciderpolarity
  • paper_authors: James C. Young, Rudy Arthur, Hywel T. P. Williams
  • for: This paper is written for researchers who are interested in sentiment analysis and natural language processing.
  • methods: The paper presents a new approach called CIDER (Context Informed Dictionary and sEntiment Reasoner), which performs context-sensitive sentiment analysis by inferring the valence of sentiment-laden terms from the whole corpus before scoring individual texts.
  • results: The paper demonstrates that CIDER outperforms state-of-the-art generalist sentiment analysis on a large collection of tweets about the weather.Here’s the Chinese translation of the three points:
  • for: 这篇论文是为研究者们编写的,他们关注情感分析和自然语言处理领域。
  • methods: 论文介绍了一种新方法,即Context Informed Dictionary and sEntiment Reasoner(CIDER),它通过从整个 corpus 中推理情感含义权值来对个体文本进行评分。
  • results: 论文表明,CIDER 在一大量天气关注的推文上超过了状态对比的通用情感分析。I hope this helps! Let me know if you have any further questions.
    Abstract Researchers commonly perform sentiment analysis on large collections of short texts like tweets, Reddit posts or newspaper headlines that are all focused on a specific topic, theme or event. Usually, general purpose sentiment analysis methods are used which perform well on average but miss the variation in meaning that happens across different contexts, for example, the word "active" has a very different intention and valence in the phrase "active lifestyle" versus "active volcano". This work presents a new approach, CIDER (Context Informed Dictionary and sEntiment Reasoner), which performs context sensitive sentiment analysis, where the valence of sentiment laden terms is inferred from the whole corpus before being used to score the individual texts. In this paper we detail the CIDER algorithm and demonstrate that it outperforms state-of-the-art generalist sentiment analysis on a large collection of tweets about the weather. We have made our implementation of CIDER available as a python package: https://pypi.org/project/ciderpolarity/.
    摘要

Transformers are Universal Predictors

  • paper_url: http://arxiv.org/abs/2307.07843
  • repo_url: https://github.com/danderfer/Comp_Sci_Sem_2
  • paper_authors: Sourya Basu, Moulik Choraria, Lav R. Varshney
  • for: 这个论文探讨了Transformer架构在自然语言处理中的限制,以及其在信息理论上的通用预测性。
  • methods: 该论文使用了信息理论来分析Transformer架构的非尺度数据 режиmes,并对不同的组成部分进行分析,以了解它们在数据效率训练中的作用。
  • results: 实验结果 validate了论文的理论分析,并在Synthetic和实际数据集上显示了Transformer架构的通用预测性。
    Abstract We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
    摘要 我们发现 transformer 架构在语言模型预测中有限制,并证明它具有一般预测性质在信息理论上。我们进一步分析transformer架构的性能在非 asymptotic 数据 режиме,以了解不同组件的作用,特别是在数据效率训练中。我们对both synthetic和实际数据进行实验来验证我们的理论分析。Note: "Transformer" is a specific type of neural network architecture, and "language modeling" refers to the task of predicting the next word or character in a sequence of text given the context of the previous words.

cs.LG - 2023-07-16

Dataset Distillation Meets Provable Subset Selection

  • paper_url: http://arxiv.org/abs/2307.08086
  • repo_url: None
  • paper_authors: Murad Tukan, Alaa Maalouf, Margarita Osadchy
  • for: 提高 dataset distillation 的效果,减少数据量和计算成本。
  • methods: 使用 sampling-based 方法初始化 distilled set,并在训练过程中使用 importance 定义来选择数据集。
  • results: 实验结果表明,我们的方法可以启用 exiting dataset distillation 技术,并提高其性能。
    Abstract Deep learning has grown tremendously over recent years, yielding state-of-the-art results in various fields. However, training such models requires huge amounts of data, increasing the computational time and cost. To address this, dataset distillation was proposed to compress a large training dataset into a smaller synthetic one that retains its performance -- this is usually done by (1) uniformly initializing a synthetic set and (2) iteratively updating/learning this set according to a predefined loss by uniformly sampling instances from the full data. In this paper, we improve both phases of dataset distillation: (1) we present a provable, sampling-based approach for initializing the distilled set by identifying important and removing redundant points in the data, and (2) we further merge the idea of data subset selection with dataset distillation, by training the distilled set on ``important'' sampled points during the training procedure instead of randomly sampling the next batch. To do so, we define the notion of importance based on the relative contribution of instances with respect to two different loss functions, i.e., one for the initialization phase (a kernel fitting function for kernel ridge regression and $K$-means based loss function for any other distillation method), and the relative cross-entropy loss (or any other predefined loss) function for the training phase. Finally, we provide experimental results showing how our method can latch on to existing dataset distillation techniques and improve their performance.
    摘要 深度学习在最近几年内发展 extremely rapidly,在不同领域取得了状态的艺术 Results。然而,训练这些模型需要巨量数据和计算资源,这导致了训练时间和成本的增加。为解决这个问题,人们提出了数据集缩写,将大量的训练数据缩写成一个更小的合成数据集,保持其性能。在这篇论文中,我们改进了两个数据集缩写阶段:1. 我们提出了一种可证明的抽样方法,通过重要性分析和减少数据中的重复项来初始化缩写集。2. 我们进一步将数据subset选择纳入数据集缩写,在训练过程中使用“重要”的抽样点训练缩写集而不是随机抽样下一个批。我们定义了“重要”的概念,根据数据点对两个不同的损失函数(一个是 kernel ridge regression 和 $K$-means 基于损失函数,另一个是距离损失函数)的相对贡献来确定。最后,我们提供了实验结果,证明我们的方法可以与现有的数据集缩写技术相结合,提高其性能。

POMDP inference and robust solution via deep reinforcement learning: An application to railway optimal maintenance

  • paper_url: http://arxiv.org/abs/2307.08082
  • repo_url: https://github.com/giarcieri/robust-optimal-maintenance-planning-through-reinforcement-learning-and-rllib
  • paper_authors: Giacomo Arcieri, Cyprien Hoelzl, Oliver Schwery, Daniel Straub, Konstantinos G. Papakonstantinou, Eleni Chatzi
  • for: 这项研究的目的是提出一种 combining framework for inference and robust solution of POMDPs via deep RL, 用于解决复杂的顺序决策问题在不确定环境中。
  • methods: 该framework包括使用Markov Chain Monte Carlo sampling来 JOINTLYINFER transition和observation模型参数,然后使用深度学习技术解决POMDP问题,并通过域随机化将参数 Distributions incorporated into the solution。
  • results: 该研究 comparing the use of transformers和long short-term memory networks,以及model-based/model-free hybrid approach,并应用于实际世界的轨道资产维护规划问题。
    Abstract Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.
    摘要 In this work, we propose a combined framework for inferring and solving POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo (MCMC) sampling of a hidden Markov model (HMM), conditioned on actions, to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques, with the parameter distributions incorporated into the solution via domain randomization, to develop solutions that are robust to model uncertainty.As a further contribution, we compare the use of transformers and long short-term memory (LSTM) networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.

Flexible and efficient spatial extremes emulation via variational autoencoders

  • paper_url: http://arxiv.org/abs/2307.08079
  • repo_url: None
  • paper_authors: Likun Zhang, Xiaoyu Ma, Christopher K. Wikle, Raphaël Huser
  • for: 用于模elling Complex 的 spatial extremes 性质
  • methods: 使用 encoding-decoding 结构的 variational autoencoder (extVAE)
  • results: 可以更快和更高精度地 simulate spatial extremes 模型输出,并且可以更好地 capture tail 部分的性质
    Abstract Many real-world processes have complex tail dependence structures that cannot be characterized using classical Gaussian processes. More flexible spatial extremes models such as Gaussian scale mixtures and single-station conditioning models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from. In this paper, we develop a new spatial extremes model that has flexible and non-stationary dependence properties, and we integrate it in the encoding-decoding structure of a variational autoencoder (extVAE). The extVAE can be used as a spatio-temporal emulator that characterizes the distribution of potential mechanistic model output states and produces outputs that have the same properties as the inputs, especially in the tail. Through extensive simulation studies, we show that our extVAE is vastly more time-efficient than traditional Bayesian inference while also outperforming many spatial extremes models with a stationary dependence structure. To further demonstrate the computational power of the extVAE, we analyze a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea, which includes daily measurements at 16703 grid cells.
    摘要 很多现实世界的过程都具有复杂的尾部依赖结构,不能使用传统的 Gaussian 过程来描述。更灵活的 spatial extremes 模型,如 Gaussian scale mixtures 和 single-station conditioning models,具有吸引人的极端依赖性质,但是常常非常困难 fitted 和 simulate from。在这篇论文中,我们开发了一种新的 spatial extremes 模型,具有灵活和非站ary 的依赖性质。我们将其集成到 encoding-decoding 结构中,并将其用作一种 spatio-temporal emulator,可以Characterizes the distribution of potential mechanistic model output states and produces outputs that have the same properties as the inputs, especially in the tail.通过广泛的 simulations 研究,我们表明我们的 extVAE 在时间效率方面远胜传统的 Bayesian inference,同时也能够超越许多 stationary 的 spatial extremes 模型。为了进一步展示 extVAE 的计算能力,我们分析了一个高分辨率的卫星Derived 数据集,包括每天 measurement 的 16703 个网格单元的 sea surface temperature 在红海。

MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment

  • paper_url: http://arxiv.org/abs/2307.08065
  • repo_url: None
  • paper_authors: Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
  • for: 这 paper 是为了提高图像推理 tasks 的性能和能效性而设计的。
  • methods: 该 paper 使用了一种叫做 MaGNAS 的图 neural architecture search 框架,该框架可以在多处理器系统上(SoC)上进行图 neural network 的设计和映射。
  • results: experiments 表明,使用 MaGNAS 可以在 NVIDIA Xavier AGX 平台上提高图像推理 tasks 的性能和能效性,比基eline 的 GPU-only 部署提高 1.57 倍的延迟速度和 3.38 倍的能效率,同时保持平均的准确率下降 0.11%。
    Abstract Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both (i) a real hardware SoC platform (NVIDIA Xavier AGX) and (ii) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57x latency speedup and is 3.38x more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.
    摘要 图 neural network (GNN) 在视觉应用中日益受欢迎,因为它们可以自然地模型图像帧中不同部分之间的结构和上下文关系。另一方面,由于近期的深度视觉应用在边缘得到了广泛的应用,因此GNN在这些应用中必须遵循同样的执行要求。然而,与普通的深度神经网络不同,图学习操作的不规则流动对于运行GNN在多核心系统中带来了挑战。在这篇论文中,我们提出了一种统一的设计映射方法,用于有效地处理视觉GNN工作负荷在多核心系统上。特别是,我们开发了MaGNAS,一个具有Mapping-Aware Graph Neural Architecture Search(MaGNAS)框架。MaGNAS将GNN建立的建筑设计空间与多核心SoC上的可能的映射选项相结合,以便标识最佳的设备资源利用率。为了实现这一点,MaGNAS采用了两层演化搜索,以确定最佳的GNN和映射对的性能交互。通过基于最近的Vision GNN(ViG)架构设计了一个超网,我们在四个state-of-the-art视觉数据集上进行了实验。我们的实验结果表明,MaGNAS能够提供1.57倍的延迟速度提升和3.38倍的能量效率提升,而在NVIDIA Xavier AGX多核心系统上执行视觉数据集时与GPU-only部署相比,保持了0.11%的准确率下降。

Fast Quantum Algorithm for Attention Computation

  • paper_url: http://arxiv.org/abs/2307.08045
  • repo_url: None
  • paper_authors: Yeqi Gao, Zhao Song, Xin Yang, Ruizhe Zhang
  • for: 大型自然语言处理(NLP)模型(LLMs)的性能表现出色,它们通过高级深度学习技术得到了广泛应用。
  • methods: 本研究使用Grover搜寻算法来高效 Compute sparse attention computation matrix。
  • results: 我们的量子算法可以取得 polynomial 的速度提升,并且 attention matrix 具有Extra low-rank 结构,可以帮助获得更快的 LLM 训练算法。
    Abstract Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks. These models, powered by advanced deep learning techniques, have revolutionized the field of natural language processing (NLP) and have achieved remarkable results in various language-related tasks. LLMs have excelled in tasks such as machine translation, sentiment analysis, question answering, text generation, text classification, language modeling, and more. They have proven to be highly effective in capturing complex linguistic patterns, understanding context, and generating coherent and contextually relevant text. The attention scheme plays a crucial role in the architecture of large language models (LLMs). It is a fundamental component that enables the model to capture and utilize contextual information during language processing tasks effectively. Making the attention scheme computation faster is one of the central questions to speed up the LLMs computation. It is well-known that quantum machine has certain computational advantages compared to the classical machine. However, it is currently unknown whether quantum computing can aid in LLM. In this work, we focus on utilizing Grover's Search algorithm to compute a sparse attention computation matrix efficiently. We achieve a polynomial quantum speed-up over the classical method. Moreover, the attention matrix outputted by our quantum algorithm exhibits an extra low-rank structure that will be useful in obtaining a faster training algorithm for LLMs. Additionally, we present a detailed analysis of the algorithm's error analysis and time complexity within the context of computing the attention matrix.
    摘要 大型语言模型(LLM)在各种语言处理任务中表现出色,应用先进的深度学习技术。这些模型在机器翻译、情感分析、问答、文本生成、文本分类、语言模型等任务中均有卓越表现。它们能够吸收复杂的语言模式,理解上下文,并生成具有上下文相关性的文本。模型中的注意力结构是重要的基本 ком成分,它在语言处理任务中实现了效果。目前,尚未知道Quantum computing是否可以应用于LLM。在这个工作中,我们专注于使用Grover搜寻算法计算稀疏注意力computation матриrice,以取得高效的量子速度优化。我们获得了对级数方法的多项式优化,并且注意力矩阵的出力显示了额外的低维结构,这将会帮助获得更快的LLM训练算法。此外,我们还提供了计算注意力矩阵的错误分析和时间复杂度分析。

Towards Flexible Time-to-event Modeling: Optimizing Neural Networks via Rank Regression

  • paper_url: http://arxiv.org/abs/2307.08044
  • repo_url: https://github.com/teboozas/dart_ecai23
  • paper_authors: Hyunjun Lee, Junhyun Lee, Taehwa Choi, Jaewoo Kang, Sangbum Choi
  • for: 预测时间事件发生的时间点 (predicting the time of occurrence of an event)
  • methods: 使用深度学习模型,基于Gehan的排名统计 (using a deep learning model based on Gehan’s rank statistic)
  • results: 在多个 benchmark 数据集上实现了显著的提升,无需额外参数或复杂的模型架构 (achieved significant improvements on multiple benchmark datasets without additional hyperparameters or complex model architectures)
    Abstract Time-to-event analysis, also known as survival analysis, aims to predict the time of occurrence of an event, given a set of features. One of the major challenges in this area is dealing with censored data, which can make learning algorithms more complex. Traditional methods such as Cox's proportional hazards model and the accelerated failure time (AFT) model have been popular in this field, but they often require assumptions such as proportional hazards and linearity. In particular, the AFT models often require pre-specified parametric distributional assumptions. To improve predictive performance and alleviate strict assumptions, there have been many deep learning approaches for hazard-based models in recent years. However, representation learning for AFT has not been widely explored in the neural network literature, despite its simplicity and interpretability in comparison to hazard-focused methods. In this work, we introduce the Deep AFT Rank-regression model for Time-to-event prediction (DART). This model uses an objective function based on Gehan's rank statistic, which is efficient and reliable for representation learning. On top of eliminating the requirement to establish a baseline event time distribution, DART retains the advantages of directly predicting event time in standard AFT models. The proposed method is a semiparametric approach to AFT modeling that does not impose any distributional assumptions on the survival time distribution. This also eliminates the need for additional hyperparameters or complex model architectures, unlike existing neural network-based AFT models. Through quantitative analysis on various benchmark datasets, we have shown that DART has significant potential for modeling high-throughput censored time-to-event data.
    摘要 时间到事分析(也称为存存分析)的目标是预测事件发生的时间, givens 一组特征。 这个领域的一个主要挑战是处理 censored 数据,可以使学习算法更加复杂。传统方法 such as Cox 的对数加速破碎模型和加速失败时间(AFT)模型在这个领域非常受欢迎,但它们经常需要假设,例如对比例的危险和线性。特别是 AF 模型经常需要预先指定的参数分布假设。为了提高预测性能和缓解严格假设,在过去的几年中有很多深度学习方法在预测 hazard-based 模型中得到应用。然而, representation learning 在神经网络文献中对 AFT 模型的应用还很少,尽管它在比较简单和可读性方面比 hazard-focused 方法更有优势。在这项工作中,我们介绍了 Deep AFT Rank-regression 模型(DART),这个模型使用基于 Gehan 排名统计的目标函数,这是一种高效的 representation learning 方法。在 eliminating the requirement to establish a baseline event time distribution 的同时,DART 保留了标准 AFT 模型中的优点,直接预测事件时间。我们的提案的方法是一种 semi-parametric AFT 模型,不需要任何参数分布假设,这也消除了需要额外的 гипер Parameters 或复杂的模型架构,与现有的神经网络基于 AFT 模型不同。通过对各种 benchmark 数据进行量化分析,我们表明了 DART 在高通过率 censored time-to-event 数据模型中具有显著的潜力。

Bivariate DeepKriging for Large-scale Spatial Interpolation of Wind Fields

  • paper_url: http://arxiv.org/abs/2307.08038
  • repo_url: None
  • paper_authors: Pratik Nag, Ying Sun, Brian J Reich
  • for: 这篇论文旨在提供一种高精度风场数据的大规模插值或降阶方法,用于气象、海洋学和气候研究等领域。
  • methods: 本文提出了一种称为“深度拟合”的方法,它是一个具有空间对预的深度神经网络,具有空间对预的嵌入层,用于预测二维空间资料。
  • results: 比较traditional cokriging预测器,深度拟合方法的预测性能更高,且可以快速 Compute,实现高效的数据预测。在中东地区506,771个位置上应用了深度拟合方法,结果表明其预测性能佳,且大大减少了计算时间。
    Abstract High spatial resolution wind data are essential for a wide range of applications in climate, oceanographic and meteorological studies. Large-scale spatial interpolation or downscaling of bivariate wind fields having velocity in two dimensions is a challenging task because wind data tend to be non-Gaussian with high spatial variability and heterogeneity. In spatial statistics, cokriging is commonly used for predicting bivariate spatial fields. However, the cokriging predictor is not optimal except for Gaussian processes. Additionally, cokriging is computationally prohibitive for large datasets. In this paper, we propose a method, called bivariate DeepKriging, which is a spatially dependent deep neural network (DNN) with an embedding layer constructed by spatial radial basis functions for bivariate spatial data prediction. We then develop a distribution-free uncertainty quantification method based on bootstrap and ensemble DNN. Our proposed approach outperforms the traditional cokriging predictor with commonly used covariance functions, such as the linear model of co-regionalization and flexible bivariate Mat\'ern covariance. We demonstrate the computational efficiency and scalability of the proposed DNN model, with computations that are, on average, 20 times faster than those of conventional techniques. We apply the bivariate DeepKriging method to the wind data over the Middle East region at 506,771 locations. The prediction performance of the proposed method is superior over the cokriging predictors and dramatically reduces computation time.
    摘要 高空间分辨率风数据是气候、海洋学和气象研究中的重要工具。大规模的风场 interpolación或下采样是一项复杂的任务,因为风数据往往非 Gaussian 分布,具有高空间变化和不均匀性。在空间统计中,cokriging 是广泛使用的方法,但predictor 不是优化的,除非使用 Gaussian 过程。此外,cokriging 对大量数据来说是计算昂贵的。在这篇论文中,我们提出了一种方法,即双向 DeepKriging,它是一种具有空间依赖性的双向深度神经网络(DNN),其中 embedding 层由空间径向基函数构建。我们然后开发了一种不含分布的不确定性评估方法,基于 bootstrap 和集成 DNN。我们的提出的方法在传统的 cokriging 预测器中超越了常用的 covariance 函数,如线性模型协调和灵活的双向 Matér covariance。我们 demonstate 了我们的 DNN 模型的计算效率和可扩展性,计算时间比传统技术平均快20倍。我们应用了双向 DeepKriging 方法于中东地区的风数据,包括506,771个位置。我们的预测性能较传统的 cokriging 预测器更高,计算时间减少了90%。

Magnetic Field-Based Reward Shaping for Goal-Conditioned Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.08033
  • repo_url: None
  • paper_authors: Hongyu Ding, Yuanze Tang, Qing Wu, Bo Wang, Chunlin Chen, Zhi Wang
  • for: 提高目标填充RL任务中样本效率,即使面临动态环境和奖励稀缺。
  • methods: 基于磁场的奖励定制(MFRS)方法,通过将目标和障碍物视为常见磁铁,根据磁场强度的非线性和不规则分布来设置奖励函数。
  • results: 在 simulate 和实际 робо控任务中,MFRS 比相关现有方法更高效,可以有效提高RL算法在目标填充任务中的样本效率,并且可以适应不同的目标和障碍物动态。
    Abstract Goal-conditioned reinforcement learning (RL) is an interesting extension of the traditional RL framework, where the dynamic environment and reward sparsity can cause conventional learning algorithms to fail. Reward shaping is a practical approach to improving sample efficiency by embedding human domain knowledge into the learning process. Existing reward shaping methods for goal-conditioned RL are typically built on distance metrics with a linear and isotropic distribution, which may fail to provide sufficient information about the ever-changing environment with high complexity. This paper proposes a novel magnetic field-based reward shaping (MFRS) method for goal-conditioned RL tasks with dynamic target and obstacles. Inspired by the physical properties of magnets, we consider the target and obstacles as permanent magnets and establish the reward function according to the intensity values of the magnetic field generated by these magnets. The nonlinear and anisotropic distribution of the magnetic field intensity can provide more accessible and conducive information about the optimization landscape, thus introducing a more sophisticated magnetic reward compared to the distance-based setting. Further, we transform our magnetic reward to the form of potential-based reward shaping by learning a secondary potential function concurrently to ensure the optimal policy invariance of our method. Experiments results in both simulated and real-world robotic manipulation tasks demonstrate that MFRS outperforms relevant existing methods and effectively improves the sample efficiency of RL algorithms in goal-conditioned tasks with various dynamics of the target and obstacles.
    摘要 traditional reinforcement learning(RL)框架中的目标条件RL是一种有趣的扩展,因为动态环境和奖励稀缺可能使得传统的学习算法失效。奖励形成是一种实用的方法来提高样本效率,其中将人类领域知识 embed 到学习过程中。现有的奖励形成方法 для goal-conditioned RL 通常基于距离度量,这可能无法提供动态环境中的高复杂性所需的充分信息。这篇论文提出了一种基于磁场的奖励形成(MFRS)方法,用于goal-conditioned RL 任务中的动态目标和障碍物。我们根据物体的物理性质,将目标和障碍物视为永久磁铁,并根据这些磁铁生成的磁场强度设置奖励函数。非线性和不均匀的磁场强度分布可以提供更多的可访问和渠道化信息关于优化地图,因此引入一种更加复杂的磁奖。此外,我们将我们的磁奖转换为 potential-based 奖励形成,通过同时学习次要潜在函数来确保我们的方法的优化策略不变性。实验结果表明,MFRS在模拟和实际 робоック抓取任务中表现出色,超越了相关的现有方法,并有效地提高了RL算法在目标条件下的样本效率。

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

  • paper_url: http://arxiv.org/abs/2307.08029
  • repo_url: https://github.com/yuchen005/nase
  • paper_authors: Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng
  • for: 提高扩散模型下的生成散音增强(SE)性能,尤其是对于未经测试的噪声。
  • methods: 提出一种基于噪声特征的散音增强(NASE)方法,通过EXTRACTING噪声特征来导引反推进程。同时,提出一种多任务学习方案,以 JOINTLY 优化 SE 和 NC 任务,以提高噪声特征的提取精度。
  • results: 实验证明,NASE 可以与多种主流扩散 SE 模型结合使用,并在 VoiceBank-DEMAND 数据集上显示出显著提高,特别是对于未经测试的噪声。
    Abstract With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech for inference, underexploiting the varying noise information in real-world conditions. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model. Specifically, we design a noise classification (NC) model to produce acoustic embedding as a noise conditioner for guiding the reverse denoising process. Meanwhile, a multi-task learning scheme is devised to jointly optimize SE and NC tasks, in order to enhance the noise specificity of extracted noise conditioner. Our proposed NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models. Experiment evidence on VoiceBank-DEMAND dataset shows that NASE achieves significant improvement over multiple mainstream diffusion SE models, especially on unseen testing noises.
    摘要 Recent advances in diffusion models have led to a surge of research interest in generative speech enhancement (SE) due to its great potential for unseen testing noises. However, existing efforts primarily focus on the inherent properties of clean speech for inference, neglecting the varying noise information in real-world conditions. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in the diffusion model. Specifically, we design a noise classification (NC) model to produce an acoustic embedding as a noise conditioner for guiding the reverse denoising process. Moreover, we devise a multi-task learning scheme to jointly optimize the SE and NC tasks, enhancing the noise specificity of the extracted noise conditioner. Our proposed NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models. Experimental evidence on the VoiceBank-DEMAND dataset demonstrates that NASE achieves significant improvement over multiple mainstream diffusion SE models, especially on unseen testing noises.

Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks

  • paper_url: http://arxiv.org/abs/2307.08013
  • repo_url: None
  • paper_authors: Haobo Song, Soumajit Majumder, Tao Lin
  • for: 该文章旨在探讨隐式模型(DEQs)在视觉任务上的表现,以及其相对于其他方法的比较。
  • methods: 该文章使用Weight-tied模型,并对其进行了深入研究,以及提出了使用特异 sparse masks 来提高模型容量。
  • results: 研究发现,Weight-tied模型在视觉任务上比DEQ variants更有效稳定,并且可以更好地 generalized to other learning paradigms。
    Abstract Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.
    摘要 匿名模型(DEQs)在社区中受到了广泛关注,因为它们可以训练无穷层模型,并且拥有简洁的解决方案和常量内存占用。然而,虽然有几次尝试,但这些方法受到了模型不充分利用和优化不稳定的限制。此外,对于视觉任务的比较是缺失的。在这项工作中,我们回顾了匿名模型的线索,并发现Weight-tied模型在视觉任务上更有效率、稳定、并且更高效。通过这些简单而干净的Weight-tied模型,我们进一步研究了这些模型的基本限制,并提出了使用特定的稀疏mask来提高模型容量。最后,我们对于实践者提供了depth、宽和稀疏选择的设计指南,并证明了我们的理解在其他学习方法上也是可行。

For One-Shot Decoding: Self-supervised Deep Learning-Based Polar Decoder

  • paper_url: http://arxiv.org/abs/2307.08004
  • repo_url: None
  • paper_authors: Huiying Song, Yihao Luo, Yuma Fukuzawa
  • for: 实现一种基于深度学习的排序码解码方案,具有一次解码功能。
  • methods: 透过自动化学习并利用生成矩阵来训练神经网络,将神经网络训练为bounded distance解码器。
  • results: Computer simulations show that the proposed scheme can achieve similar performance to the maximum a posteriori (MAP) decoder for very short packets, and the proposed neural network decoder (NND) has better generalization ability than the conventional one.
    Abstract We propose a self-supervised deep learning-based decoding scheme that enables one-shot decoding of polar codes. In the proposed scheme, rather than using the information bit vectors as labels for training the neural network (NN) through supervised learning as the conventional scheme did, the NN is trained to function as a bounded distance decoder by leveraging the generator matrix of polar codes through self-supervised learning. This approach eliminates the reliance on predefined labels, empowering the potential to train directly on the actual data within communication systems and thereby enhancing the applicability. Furthermore, computer simulations demonstrate that (i) the bit error rate (BER) and block error rate (BLER) performances of the proposed scheme can approach those of the maximum a posteriori (MAP) decoder for very short packets and (ii) the proposed NN decoder (NND) exhibits much superior generalization ability compared to the conventional one.
    摘要 我们提出一种自动超vised深度学习的解码方案,可以实现一击解码楔形码。在我们的方案中,而不是通过指定标签来用深度学习训练神经网络(NN),就是通过楔形码生成矩阵自我超vised学习训练NN。这种方法消除了靠定标签的依赖,使得可以直接在通信系统中训练NN,从而提高可用性。此外,计算机实验表明,(i)提案的方案可以在很短的包长度下达到MAP解码器的比Error rate和块Error rate性能,(ii)提案的NN解码器(NND)在对比传统方法的情况下显示出了很好的泛化能力。

Joint Microseismic Event Detection and Location with a Detection Transformer

  • paper_url: http://arxiv.org/abs/2307.09207
  • repo_url: None
  • paper_authors: Yuanyuan Yang, Claire Birnie, Tariq Alkhalifah
  • for: 这个研究旨在提出一个能够同时探测和定位微地震事件的方法,以实现实时微地震监控。
  • methods: 本研究使用了卷积神经网和Encoder-Decoder Transformer,并应用了一个基于集合的匈牙利损失函数,以直接处理录取到的波形资料。
  • results: 实验结果显示,提案的方法能够正确地探测和定位微地震事件,并且在实际应用中能够获得高效和可靠的结果。
    Abstract Microseismic event detection and location are two primary components in microseismic monitoring, which offers us invaluable insights into the subsurface during reservoir stimulation and evolution. Conventional approaches for event detection and location often suffer from manual intervention and/or heavy computation, while current machine learning-assisted approaches typically address detection and location separately; such limitations hinder the potential for real-time microseismic monitoring. We propose an approach to unify event detection and source location into a single framework by adapting a Convolutional Neural Network backbone and an encoder-decoder Transformer with a set-based Hungarian loss, which is applied directly to recorded waveforms. The proposed network is trained on synthetic data simulating multiple microseismic events corresponding to random source locations in the area of suspected microseismic activities. A synthetic test on a 2D profile of the SEAM Time Lapse model illustrates the capability of the proposed method in detecting the events properly and locating them in the subsurface accurately; while, a field test using the Arkoma Basin data further proves its practicability, efficiency, and its potential in paving the way for real-time monitoring of microseismic events.
    摘要 微型地震事件检测和定位是微型地震监测的两个关键组成部分,它为我们提供了无价的地层下预测和演化的准确信息。传统的方法通常受到人工干预和/或重量计算的限制,而当前的机器学习帮助的方法通常分别处理检测和定位,这些限制了实时微型地震监测的可能性。我们提出了一种将事件检测和源位置嵌入到同一个框架中的方法,通过采用卷积神经网络背景和encoder-decoder转换器,并使用集合基于hungarian损失函数进行训练。该网络在记录波形上直接应用。我们对多个微型地震事件的同时发生进行了synthetic数据生成,并在2DProfile上进行了一个synthetic测试,这些测试结果表明了我们的方法可以正确地检测事件并准确地定位其位置在地层下。此外,我们还对Arkoma Basin数据进行了一个实际测试,这些测试结果表明了我们的方法的实用性、效率和实时监测微型地震事件的潜在性。

LUCYD: A Feature-Driven Richardson-Lucy Deconvolution Network

  • paper_url: http://arxiv.org/abs/2307.07998
  • repo_url: https://github.com/ctom2/lucyd-deconvolution
  • paper_authors: Tomáš Chobola, Gesine Müller, Veit Dausmann, Anton Theileis, Jan Taucher, Jan Huisken, Tingying Peng
  • for: 提高微scopic图像质量和可读性
  • methods: 结合Richardson-Lucy方程和深度学习特征,提出了一种基于特征驱动的图像恢复模型
  • results: 对于synthetic和实际微scopic图像,LUCYD方法表现出色,提高图像质量和一致性,并且可以处理不同的微scopic模式和捕捉条件。
    Abstract The process of acquiring microscopic images in life sciences often results in image degradation and corruption, characterised by the presence of noise and blur, which poses significant challenges in accurately analysing and interpreting the obtained data. This paper proposes LUCYD, a novel method for the restoration of volumetric microscopy images that combines the Richardson-Lucy deconvolution formula and the fusion of deep features obtained by a fully convolutional network. By integrating the image formation process into a feature-driven restoration model, the proposed approach aims to enhance the quality of the restored images whilst reducing computational costs and maintaining a high degree of interpretability. Our results demonstrate that LUCYD outperforms the state-of-the-art methods in both synthetic and real microscopy images, achieving superior performance in terms of image quality and generalisability. We show that the model can handle various microscopy modalities and different imaging conditions by evaluating it on two different microscopy datasets, including volumetric widefield and light-sheet microscopy. Our experiments indicate that LUCYD can significantly improve resolution, contrast, and overall quality of microscopy images. Therefore, it can be a valuable tool for microscopy image restoration and can facilitate further research in various microscopy applications. We made the source code for the model accessible under https://github.com/ctom2/lucyd-deconvolution.
    摘要 生物科学中获取微型图像的过程经常会导致图像异常和损害,表现为图像噪声和模糊,这会对数据分析和解释提出 significiant 挑战。这篇论文提出了一种名为LUCYD的新方法,用于修复Volume Microscopy 图像。该方法结合了Richardson-Lucy 减 convolution 方程和基于深度学习的卷积网络来提高图像的品质。通过将图像形成过程包含在一个特征驱动的修复模型中,该方法希望提高修复后图像的质量,同时降低计算成本并保持高度可读性。我们的结果表明,LUCYD 在 synthetic 和实际 Microscopy 图像中都超过了现有方法的性能,在图像质量和泛化性方面表现出色。我们通过评估其在不同 Microscopy 模式和拍摄条件下的表现,发现LUCYD 可以处理不同的 Microscopy 模式和拍摄条件。我们的实验表明,LUCYD 可以显著提高 Microscopy 图像的分辨率、对比度和整体质量。因此,它可以成为 Microscopy 图像修复的有价值工具,并促进了不同 Microscopy 应用的进一步研究。我们将模型的源代码公开在 上。

MargCTGAN: A “Marginally’’ Better CTGAN for the Low Sample Regime

  • paper_url: http://arxiv.org/abs/2307.07997
  • repo_url: https://github.com/tejuafonja/margctgan
  • paper_authors: Tejumade Afonja, Dingfan Chen, Mario Fritz
  • for: This paper is written for evaluating the effectiveness of synthetic tabular data generation methods, specifically in low sample scenarios, and addressing the oversight of neglecting statistical properties in current evaluation methods.
  • methods: The paper uses three state-of-the-art synthetic tabular data generators, including CTGAN, and evaluates their performance based on marginal distribution, column-pair correlation, joint distribution, and downstream task utility. The proposed MargCTGAN model adds feature matching of de-correlated marginals to improve the statistical properties and downstream utility of the synthetic data.
  • results: The paper shows that CTGAN underperforms in low sample settings in terms of utility, but the proposed MargCTGAN model consistently improves downstream utility as well as statistical properties of the synthetic data.
    Abstract The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
    摘要 现实生成的数据的潜在价值很大。然而,目前的生成synthetic tabular数据的评估方法主要关注下游任务的有用性,经常忽略了统计性质的重要性。这种欠缺特别在低样本情况下显著,同时这些统计度量快速下降。在这篇论文中,我们解决这个问题,通过评估三种现状最佳的synthetic tabular数据生成器的边缘分布、列对协同分布、共同分布和下游任务有用性的表现,从高样本到低样本的场景中进行评估。CTGAN模型在下游任务上表现良好,但在低样本情况下表现不佳,其 Utility 下降。为了解决这个限制,我们提议MargCTGAN,它通过匹配特征的分布,使得下游任务的有用性和统计性质都得到了改进。

CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

  • paper_url: http://arxiv.org/abs/2307.10237
  • repo_url: None
  • paper_authors: Bhavin Jawade, Deen Dayal Mohan, Dennis Fedorishin, Srirangaraj Setlur, Venu Govindaraju
  • for: 长距离、低分辨率、不同视角、照明、姿势和 atmospheric conditions 下的面部识别,提高face recognition的精度和可靠性。
  • methods: 基于面部特征的分布信息来Conditioning,通过学习一个context vector来对特征进行权重调整,以提高face recognition的精度和可靠性。
  • results: 在BTS和DroneSURF等长距离未控制的面部识别数据集上达到了state-of-the-art的结果,证明了我们的方法的优势。
    Abstract Face recognition from image sets acquired under unregulated and uncontrolled settings, such as at large distances, low resolutions, varying viewpoints, illumination, pose, and atmospheric conditions, is challenging. Face feature aggregation, which involves aggregating a set of N feature representations present in a template into a single global representation, plays a pivotal role in such recognition systems. Existing works in traditional face feature aggregation either utilize metadata or high-dimensional intermediate feature representations to estimate feature quality for aggregation. However, generating high-quality metadata or style information is not feasible for extremely low-resolution faces captured in long-range and high altitude settings. To overcome these limitations, we propose a feature distribution conditioning approach called CoNAN for template aggregation. Specifically, our method aims to learn a context vector conditioned over the distribution information of the incoming feature set, which is utilized to weigh the features based on their estimated informativeness. The proposed method produces state-of-the-art results on long-range unconstrained face recognition datasets such as BTS, and DroneSURF, validating the advantages of such an aggregation strategy.
    摘要 面部识别从图像集中获取在无法控制的设置下进行,如在远距离、低分辨率、不同视点、照明、姿势和大气条件下,是有挑战的。在这些识别系统中,面部特征聚合起important role,即将一个模板中的多个特征表示聚合到一个全局表示中。现有的传统面部特征聚合方法可以通过metadata或高维度中间特征表示来估计特征质量进行聚合。然而,在EXTREMELY LOW-RESOLUTION的脸部图像中,生成高质量的metadata或样式信息是不可能的。为了突破这些限制,我们提出了一种特征分布条件的方法called CoNAN для模板聚合。具体来说,我们的方法是学习一个conditioned over the distribution information of the incoming feature set,用于对特征进行权重调整基于估计的有用性。我们的方法在long-range和高高度设置下的无结构化脸部识别数据集,如BTS和DroneSURF, producessate-of-the-art结果,证明了such an aggregation strategy的优势。

A Survey of Techniques for Optimizing Transformer Inference

  • paper_url: http://arxiv.org/abs/2307.07982
  • repo_url: None
  • paper_authors: Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani
  • for: 本文旨在把transformer网络的推理阶段优化技术综述一下,以便为研究人员和学生提供一个全面的参考。
  • methods: 本文总结了多种优化技术,包括知识传承、剪枝、量化、神经网络搜索和特征硬件设计等,以提高transformer网络的推理效率。
  • results: 本文对多种模型和技术进行了评估,并展示了它们的数据量和精度之间的质量评估。同时,本文还预测了未来这一领域的发展趋势。
    Abstract Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
    摘要 近年来,变换神经网络家族,包括矢量代码表(BERT)、生成预训练变换网络(GPT)和视觉转换网络(ViT),在自然语言处理(NLP)和计算机视觉(CV)领域中表现出色。变换网络如ChatGPT对公众生活产生了深远的影响。然而,为了 достичь高预测性能,变换网络的内存和计算核心覆盖面积呈极值增长趋势。研究人员提出了优化变换推理阶段的多种技术。本文对优化变换网络推理阶段的技术进行了全面的检视,包括知识储存、剪辑、量化、神经网络搜索和轻量级网络设计等算法级别的技术。此外,我们还评估硬件级别的优化技术和专门为变换网络设计的新硬件加速器。我们对多种模型和技术的参数/计算量和准确率进行了评量分析,并对这些技术的许多特点进行了详细的描述。我们还预测未来这一领域的发展趋势。我们相信,这篇评论将为初学者和季读者都提供深刻的了解,并且将激发这一领域的大量研究。

Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in An Adversarial Environment

  • paper_url: http://arxiv.org/abs/2307.07980
  • repo_url: https://github.com/wanger521/ogd
  • paper_authors: Xingrong Dong, Zhaoxian Wu, Qing Ling, Zhi Tian
  • for: 这个论文研究了分布式在线学习下的拜尼袭击问题。
  • methods: 该论文使用了一种robust aggregation rule,并提出了一种Byzantine-robust分布式在线摘要算法来实现sublinear的权衡误差 bound。
  • results: 该论文证明了,即使使用State-of-the-art robust aggregation rule,分布式在线学习仍然只能实现线性的拜尼袭击 regret bound,这是不可避免的。但是,当环境不完全是敌对的,即honest participant的损失是独立同分布的时,我们可以实现sublinear的随机误差 bound。
    Abstract This paper studies distributed online learning under Byzantine attacks. The performance of an online learning algorithm is often characterized by (adversarial) regret, which evaluates the quality of one-step-ahead decision-making when an environment provides adversarial losses, and a sublinear bound is preferred. But we prove that, even with a class of state-of-the-art robust aggregation rules, in an adversarial environment and in the presence of Byzantine participants, distributed online gradient descent can only achieve a linear adversarial regret bound, which is tight. This is the inevitable consequence of Byzantine attacks, even though we can control the constant of the linear adversarial regret to a reasonable level. Interestingly, when the environment is not fully adversarial so that the losses of the honest participants are i.i.d. (independent and identically distributed), we show that sublinear stochastic regret, in contrast to the aforementioned adversarial regret, is possible. We develop a Byzantine-robust distributed online momentum algorithm to attain such a sublinear stochastic regret bound. Extensive numerical experiments corroborate our theoretical analysis.
    摘要

Finite element inspired networks: Learning physically-plausible deformable object dynamics from partial observations

  • paper_url: http://arxiv.org/abs/2307.07975
  • repo_url: None
  • paper_authors: Shamil Mamedov, A. René Geist, Jan Swevers, Sebastian Trimpe
  • For: The paper aims to develop a human-interpretable and data-efficient model for simulating the dynamics of deformable linear objects (DLOs).* Methods: The authors draw inspiration from the rigid finite element method (R-FEM) and model a DLO as a serial chain of rigid bodies whose internal state is unrolled through time by a dynamics network. They train the dynamics network jointly with a physics-informed encoder to ensure that the state representation is physically meaningful.* Results: The authors demonstrate the effectiveness of their approach in a robot experiment, showing that the “Finite element inspired network” (FEN) forms a capable DLO dynamics model that yields physically interpretable predictions from partial observations.Here’s the information in Simplified Chinese text:
  • for: 该文章目标是开发一种可以快速预测并具有人类可读性的材料线性对象动力学模型。
  • methods: 作者们引入了刚体Finite element方法(R-FEM),将材料线性对象模型为一串连接的刚体体 whose internal state通过时间的推进。他们将动力学网络与物理学习编码器同时训练,以确保状态表示具有物理意义。
  • results: 作者们在Robot实验中展示了他们的方法的效果,显示了”Finite element inspired network”(FEN)可以快速预测材料线性对象动力学行为,并且从部分观察数据中获得物理意义的预测结果。
    Abstract The accurate simulation of deformable linear object (DLO) dynamics is challenging if the task at hand requires a human-interpretable and data-efficient model that also yields fast predictions. To arrive at such model, we draw inspiration from the rigid finite element method (R-FEM) and model a DLO as a serial chain of rigid bodies whose internal state is unrolled through time by a dynamics network. As this state is not observed directly, the dynamics network is trained jointly with a physics-informed encoder mapping observed motion variables to the body chain's state. To encourage that the state acquires a physically meaningful representation, we leverage the forward kinematics (FK) of the underlying R-FEM model as a decoder. We demonstrate in a robot experiment that this architecture - being termed "Finite element inspired network" - forms an easy to handle, yet capable DLO dynamics model yielding physically interpretable predictions from partial observations. The project code is available at: \url{https://tinyurl.com/fei-networks}
    摘要 对于不可归纳的线性物体(DLO)动力学的准确模拟是一项挑战,特别是当需要一个人类可读的和数据效率的模型,同时还需要快速预测。为了实现这种模型,我们从可变finite element方法(R-FEM)中继承灵感,将DLO表示为一串连续的刚体体 whose internal state是通过时间的卷积来实现的。由于这个状态不直接可见,因此我们将动力网络与物理学整体编码器一起训练,以将观察到的动作变量映射到body链的状态上。为了使状态具有物理意义,我们利用了下行骨干(FK)的前向几何学模型作为解码器。我们在Robot实验中展示了这种架构,称之为"finite element inspired network",可以提供容易处理、具有物理解释能力的DLO动力学模型,从部分观察数据中提取物理意义的预测结果。Code project可以在以下链接中找到:https://tinyurl.com/fei-networks

Heteroscedastic Causal Structure Learning

  • paper_url: http://arxiv.org/abs/2307.07973
  • repo_url: https://github.com/baosws/host
  • paper_authors: Bao Duong, Thin Nguyen
  • for: 学习导向的不同树(DAGs),即从观察数据中找到 causal 关系的编码问题。
  • methods: 我们提出了一种基于 Gaussian 噪声的不同树学习算法,即 HOST(Heteroscedastic causal STructure learning)算法,它可以在 polynomial 时间复杂度下解决 causal 结构学习问题。
  • results: 我们通过广泛的实验评估表明,HOST 算法在 causal 顺序学习和结构学习问题中与状态之前的方法竞争。
    Abstract Heretofore, learning the directed acyclic graphs (DAGs) that encode the cause-effect relationships embedded in observational data is a computationally challenging problem. A recent trend of studies has shown that it is possible to recover the DAGs with polynomial time complexity under the equal variances assumption. However, this prohibits the heteroscedasticity of the noise, which allows for more flexible modeling capabilities, but at the same time is substantially more challenging to handle. In this study, we tackle the heteroscedastic causal structure learning problem under Gaussian noises. By exploiting the normality of the causal mechanisms, we can recover a valid causal ordering, which can uniquely identify the causal DAG using a series of conditional independence tests. The result is HOST (Heteroscedastic causal STructure learning), a simple yet effective causal structure learning algorithm that scales polynomially in both sample size and dimensionality. In addition, via extensive empirical evaluations on a wide range of both controlled and real datasets, we show that the proposed HOST method is competitive with state-of-the-art approaches in both the causal order learning and structure learning problems.
    摘要

Enhancing Energy Efficiency and Reliability in Autonomous Systems Estimation using Neuromorphic Approach

  • paper_url: http://arxiv.org/abs/2307.07963
  • repo_url: None
  • paper_authors: Reza Ahmadvand, Sarah Safura Sharif, Yaser Mike Banad
  • for: 这篇论文的目的是提出一个基于脉冲编程理论和脉冲神经网络(SNN)的估计框架,以便实现低Size、Weight、Power(SWaP)电脑的能效性和可靠性。
  • methods: 本研究使用了SNN-基于的卡尔曼约瑟(KF)和修改后的滑块创新续推(MSIF),并且设计了系统模型对应的网络重量矩阵,以消除学习需求。
  • results: 比较了提案的策略和其算法对应的KF和MSIF,使用了 Monte Carlo 实验,并评估了SNN-MSIF的稳定性,包括在模型不确定性和神经元损失的情况下。结果显示了提案的方法的可行性和SNN-MSIF的精度和稳定性的超越性。此外,脉冲图从网络中观察到的脉冲图亮点,表明了提案的方法实现了约97%的脉冲减少。
    Abstract Energy efficiency and reliability have long been crucial factors for ensuring cost-effective and safe missions in autonomous systems computers. With the rapid evolution of industries such as space robotics and advanced air mobility, the demand for these low size, weight, and power (SWaP) computers has grown significantly. This study focuses on introducing an estimation framework based on spike coding theories and spiking neural networks (SNN), leveraging the efficiency and scalability of neuromorphic computers. Therefore, we propose an SNN-based Kalman filter (KF), a fundamental and widely adopted optimal strategy for well-defined linear systems. Furthermore, based on the modified sliding innovation filter (MSIF) we present a robust strategy called SNN-MSIF. Notably, the weight matrices of the networks are designed according to the system model, eliminating the need for learning. To evaluate the effectiveness of the proposed strategies, we compare them to their algorithmic counterparts, namely the KF and the MSIF, using Monte Carlo simulations. Additionally, we assess the robustness of SNN-MSIF by comparing it to SNN-KF in the presence of modeling uncertainties and neuron loss. Our results demonstrate the applicability of the proposed methods and highlight the superior performance of SNN-MSIF in terms of accuracy and robustness. Furthermore, the spiking pattern observed from the networks serves as evidence of the energy efficiency achieved by the proposed methods, as they exhibited an impressive reduction of approximately 97 percent in emitted spikes compared to possible spikes.
    摘要 “能效率和可靠性在自动系统计算机中已经是长期关键因素,以确保成本效果和安全的任务。随着空间 робо技术和高级空中交通等行业的快速发展,小型、轻量、低功耗(SWaP)计算机的需求增长了 significatively。本研究提出了基于射频编码理论和射频神经网络(SNN)的估算框架,利用神经计算机的效率和可扩展性。因此,我们提出了基于SNN的卡尔曼畸(KF),是一种广泛采用的优质策略,适用于定义良好的线性系统。此外,基于修改的滑动创新畸(MSIF),我们提出了一种可靠的策略,称为SNN-MSIF。各种网络权重矩阵按照系统模型设计,无需学习。通过对提出的策略与算法策略进行比较,我们评估了提案的有效性。此外,我们还对SNN-MSIF的可靠性进行了评估,并在模型不确定性和神经丢失的情况下与SNN-KF进行比较。结果表明提案的方法的可行性和SNN-MSIF的精度和可靠性的提高。此外,由网络产生的射频模式表明了提案的能效率实现,它们在可能发生的射频中减少了约97%。”

Automated Polynomial Filter Learning for Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2307.07956
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Wendi Yu, Zhichao Hou, Xiaorui Liu
  • For: This paper is written for researchers and practitioners in the field of graph neural networks (GNNs) and related areas, as it explores the potential and limitations of polynomial graph filter learning approaches and proposes a novel automated polynomial graph filter learning framework called Auto-Polynomial.* Methods: The paper uses polynomial graph filters as guiding principles in the design of GNNs, and proposes a novel automated learning framework called Auto-Polynomial that efficiently learns better filters capable of adapting to various complex graph signals.* Results: The paper demonstrates significant and consistent performance improvements on both homophilic and heterophilic graphs across multiple learning settings considering various labeling ratios, which unleashes the potential of polynomial filter learning.
    Abstract Polynomial graph filters have been widely used as guiding principles in the design of Graph Neural Networks (GNNs). Recently, the adaptive learning of the polynomial graph filters has demonstrated promising performance for modeling graph signals on both homophilic and heterophilic graphs, owning to their flexibility and expressiveness. In this work, we conduct a novel preliminary study to explore the potential and limitations of polynomial graph filter learning approaches, revealing a severe overfitting issue. To improve the effectiveness of polynomial graph filters, we propose Auto-Polynomial, a novel and general automated polynomial graph filter learning framework that efficiently learns better filters capable of adapting to various complex graph signals. Comprehensive experiments and ablation studies demonstrate significant and consistent performance improvements on both homophilic and heterophilic graphs across multiple learning settings considering various labeling ratios, which unleashes the potential of polynomial filter learning.
    摘要 偏微分 графические фильтры已经广泛应用于图神经网络(GNNs)的设计中。最近,可变学习偏微分 графические фильтры的应用已经展示了在模型图信号上的出色表现,感谢它们的灵活性和表达能力。在这项工作中,我们进行了一项新的初步研究,探索偏微分 графические филь特的潜在和局限性,发现了严重的欠拟合问题。为了改善偏微分 графические filters的效果,我们提出了 Auto-Polynomial,一种新的、通用的自动偏微分 графические filters 学习框架,可以高效地学习更好的 filters,能够适应不同的复杂图信号。经过完善的实验和割除研究,我们发现了在不同的学习设定下,包括不同的标签比例,在Homophilic和Heterophilic图上,Auto-Polynomial 可以在多种情况下提供显著和稳定的性能改进。

SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning

  • paper_url: http://arxiv.org/abs/2307.10234
  • repo_url: None
  • paper_authors: Kiana Kheiri, Hamid Karimi
  • for: 这个研究探讨了基于Transformer生成器(GPT)的多种方法在 sentiment analysis 中的表现,具体是使用 SemEval 2017 数据集进行 Task 4 的分析。
  • methods: 这个研究使用了三种主要策略:1)使用高级 GPT-3.5 Turbo 进行提示工程,2)精度调整 GPT 模型,3)一种创新的嵌入分类方法。
  • results: 研究结果表明,GPT 方法在 predictive performance 方面表现出了明显的优势,相比之前的 state-of-the-art 模型,GPT 模型的 F1 分数提高了 более 22%。此外,研究还探讨了 sentiment analysis 任务中的常见挑战,如理解上下文和检测蔑词。研究表明,GPT 模型在处理这些复杂性方面具有强大的能力。
    Abstract This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis, specifically in the context of Task 4 on the SemEval 2017 dataset. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations. Additionally, the study compares these GPT-based methodologies with other current, high-performing models previously used with the same dataset. The results illustrate the significant superiority of the GPT approaches in terms of predictive performance, more than 22\% in F1-score compared to the state-of-the-art. Further, the paper sheds light on common challenges in sentiment analysis tasks, such as understanding context and detecting sarcasm. It underscores the enhanced capabilities of the GPT models to effectively handle these complexities. Taken together, these findings highlight the promising potential of GPT models in sentiment analysis, setting the stage for future research in this field. The code can be found at https://github.com/DSAatUSU/SentimentGPT
    摘要

Accelerating Distributed ML Training via Selective Synchronization

  • paper_url: http://arxiv.org/abs/2307.07950
  • repo_url: None
  • paper_authors: Sahil Tyagi, Martin Swany
  • for: 这篇论文主要旨在提出一种实用、低开销的深度神经网络(DNN)训练方法,以提高分布式训练中的效率。
  • methods: 本文使用了一种名为\texttt{SelSync}的实际方法,它在每步决定是否进行交互,以提高训练效率。此外,\texttt{SelSync}还提出了一些优化策略来提高在半同步训练中的整合。
  • results: 根据实验结果,\texttt{SelSync}可以与传统的分布式训练方法(如BSP)具有相同或更好的准确性,同时减少训练时间,最多减少14倍。
    Abstract In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
    摘要 在分布式训练中,深度神经网络(DNNs)会在多个工作者同时进行并将本地更新集中发送到每个步骤上进行大规模同步并发训练(BSP)。然而,BSP不会线性扩展,因为协调成本过高。为了减少这些开销, alternativas como Federated Averaging(FedAvg)和Stale-Synchronous Parallel(SSP)可以减少同步频率或完全消除同步,通常是在牺牲最终准确性的代价。在这篇论文中,我们提出了\texttt{SelSync},一种实用、低开销的DNN训练方法,可以在每个步骤之间动态选择是否进行通信。我们还提出了多种优化,以提高在半同步训练的 конверGENCE。我们的系统可以与BSP相比,提高训练时间的效率,最多可以减少训练时间的14倍。

Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling

  • paper_url: http://arxiv.org/abs/2307.07944
  • repo_url: https://github.com/zhuoxiao-chen/redb-da-3ddet
  • paper_authors: Zhuoxiao Chen, Yadan Luo, Zheng Wang, Mahsa Baktashmotlagh, Zi Huang
  • for: 本文targets at domain-adaptive 3D object detection, specifically addressing the challenge of low-quality pseudo labels and class imbalance in multi-class training settings.
  • methods: 本文提出了一种基于pseudo labeling的域 adapted 3D object detection方法,包括ReDB框架、cross-domain examination(CDE)和overlapped boxes counting(OBC)等技术。
  • results: 实验结果表明, compared to existing 3D domain adaptation methods, our proposed ReDB approach significantly improves 3D object detection performance, with a 23.15% mAP improvement on the nuScenes $\rightarrow$ KITTI task.
    Abstract Unsupervised domain adaptation (DA) with the aid of pseudo labeling techniques has emerged as a crucial approach for domain-adaptive 3D object detection. While effective, existing DA methods suffer from a substantial drop in performance when applied to a multi-class training setting, due to the co-existence of low-quality pseudo labels and class imbalance issues. In this paper, we address this challenge by proposing a novel ReDB framework tailored for learning to detect all classes at once. Our approach produces Reliable, Diverse, and class-Balanced pseudo 3D boxes to iteratively guide the self-training on a distributionally different target domain. To alleviate disruptions caused by the environmental discrepancy (e.g., beam numbers), the proposed cross-domain examination (CDE) assesses the correctness of pseudo labels by copy-pasting target instances into a source environment and measuring the prediction consistency. To reduce computational overhead and mitigate the object shift (e.g., scales and point densities), we design an overlapped boxes counting (OBC) metric that allows to uniformly downsample pseudo-labeled objects across different geometric characteristics. To confront the issue of inter-class imbalance, we progressively augment the target point clouds with a class-balanced set of pseudo-labeled target instances and source objects, which boosts recognition accuracies on both frequently appearing and rare classes. Experimental results on three benchmark datasets using both voxel-based (i.e., SECOND) and point-based 3D detectors (i.e., PointRCNN) demonstrate that our proposed ReDB approach outperforms existing 3D domain adaptation methods by a large margin, improving 23.15% mAP on the nuScenes $\rightarrow$ KITTI task. The code is available at https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet.
    摘要 “对于多类别训练设定下的隐式领域适应(DA), Pseudo 标签技术已经成为一种重要的方法。然而,现有的 DA 方法在多类别训练设定下会受到较大的性能下降,这是因为 Pseudo 标签质量低下和分布不均匀问题。在这篇文章中,我们提出了一个名为 ReDB 的框架,用于同时探索所有类别。我们的方法可以生成可靠、多样和分布均匀的 Pseudo 3D 标签,并在不同的目标领域中将其适应自我训练。为了解决环境差异所导致的干扰(例如:批量数量),我们提出了跨领域评估(CDE),用于评估 Pseudo 标签的正确性,并在目标环境中复制目标实体。为了减少计算负载和补偿物体移动(例如:数量和点密度),我们设计了重 overlap 的盒子(OBC)度量,允许对不同的几何特征进行均匀抽样。为了解决类别不均匀问题,我们逐渐将目标点云补充了一些具有分布均匀的 Pseudo 标签目标实体和原始点云,这样可以提高识别率,包括常见的类别和罕见的类别。实验结果显示,我们的 ReDB 方法在三个 benchmark 数据集上比较早的 3D 领域适应方法提高了 23.15% mAP,对于 nuScenes ← KITTI 任务来说,这表明我们的方法可以更好地适应不同的领域。代码可以在 https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet 上找到。”

KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.07942
  • repo_url: https://github.com/Luoyadan/KECOR-active-3Ddet
  • paper_authors: Yadan Luo, Zhuoxiao Chen, Zhen Fang, Zheng Zhang, Zi Huang, Mahsa Baktashmotlagh
  • for: 提高自动驾驶中LiDAR检测器的可靠性,使用活动学习方法来减少标注量。
  • methods: 使用信息论来选择最有用的点云获取标注,并通过搜索算法选择最有用的点云。
  • results: 比起现有方法,提高了监督学习性能的同时减少了约44%的标注成本和26%的计算时间。
    Abstract Achieving a reliable LiDAR-based object detector in autonomous driving is paramount, but its success hinges on obtaining large amounts of precise 3D annotations. Active learning (AL) seeks to mitigate the annotation burden through algorithms that use fewer labels and can attain performance comparable to fully supervised learning. Although AL has shown promise, current approaches prioritize the selection of unlabeled point clouds with high uncertainty and/or diversity, leading to the selection of more instances for labeling and reduced computational efficiency. In this paper, we resort to a novel kernel coding rate maximization (KECOR) strategy which aims to identify the most informative point clouds to acquire labels through the lens of information theory. Greedy search is applied to seek desired point clouds that can maximize the minimal number of bits required to encode the latent features. To determine the uniqueness and informativeness of the selected samples from the model perspective, we construct a proxy network of the 3D detector head and compute the outer product of Jacobians from all proxy layers to form the empirical neural tangent kernel (NTK) matrix. To accommodate both one-stage (i.e., SECOND) and two-stage detectors (i.e., PVRCNN), we further incorporate the classification entropy maximization and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Extensive experiments conducted on two 3D benchmarks and a 2D detection dataset evidence the superiority and versatility of the proposed approach. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art AL method, without compromising detection performance.
    摘要 需要一个可靠的LiDAR基于对象检测器在自动驾驶中,但其成功取决于获得大量精度3D注释。活动学习(AL)希望减轻注释压力通过使用 fewer labels 和可以达到完全监督学习的性能。 although AL has shown promise, current approaches prioritize the selection of unlabeled point clouds with high uncertainty and/or diversity, leading to the selection of more instances for labeling and reduced computational efficiency.在这篇论文中,我们采用了一种新的幂 coding rate maximization(KECOR)策略,该策略目的是通过信息理论来标识最有用的点云来获得标注。 我们使用批处理来寻找想要标注的点云,以便最大化最小数量的比特位数据编码 latent 特征。为了确定选择的样本唯一性和有用性,我们构建了一个3D检测头的卫星网络,并计算所有卫星层的外积Jacobian的 outer product,以形成empirical neural tangent kernel(NTK)矩阵。为了适应一stage(i.e., SECOND)和two-stage检测器(i.e., PVRCNN),我们进一步包括分类 entropy maximization 和检测性能和总绘制盒数之间的平衡。我们在两个3D benchmark和一个2D检测数据集上进行了广泛的实验,结果显示了我们的方法的优越性和多样性。我们发现,相比于状态 искусственный智能方法,我们的方法可以降低盒级注释成本约44%,并且不会影响检测性能。

Optimal Compression of Unit Norm Vectors in the High Distortion Regime

  • paper_url: http://arxiv.org/abs/2307.07941
  • repo_url: None
  • paper_authors: Heng Zhu, Avishek Ghosh, Arya Mazumdar
  • for: 这篇论文探讨了将单位 нор 向量压缩成最少位元数,以确保可以在接受到的水平上实现重建的可接受程度。
  • methods: 本研究将Rate-Distortion/covering code文献中的问题探讨,但专注于”高 Distortion” regime。研究在最坏情况下进行,无任何对vector的假设,但允许使用随机压缩 Maps。
  • results: 研究发现,简单的压缩方案几乎是最佳的。结果包括部分新的结果和已知结果,但是在这篇论文中汇集了所有结果 для 完整性。
    Abstract Motivated by the need for communication-efficient distributed learning, we investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. This problem has been explored in the rate-distortion/covering code literature, but our focus is exclusively on the "high-distortion" regime. We approach this problem in a worst-case scenario, without any prior information on the vector, but allowing for the use of randomized compression maps. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. It turns out that simple compression schemes are nearly optimal in this scenario. While the results are a mix of new and known, they are compiled in this paper for completeness.
    摘要 <> translate "Motivated by the need for communication-efficient distributed learning, we investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. This problem has been explored in the rate-distortion/covering code literature, but our focus is exclusively on the 'high-distortion' regime. We approach this problem in a worst-case scenario, without any prior information on the vector, but allowing for the use of randomized compression maps. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. It turns out that simple compression schemes are nearly optimal in this scenario. While the results are a mix of new and known, they are compiled in this paper for completeness." into Simplified Chinese.翻译文本:<>驱动通信效率的分布式学习需求,我们调查了最小化单元 нор 向量 bit 数量的压缩方法,同时仍允许接受一定的损害率。这个问题在 rate-distortion/covering code 文献中已经被研究过,但我们专注于 "高损害" régime。我们在最坏情况下进行研究,不具备向量的任何先前信息,但允许使用随机压缩映射。我们的研究包括偏向和无偏向压缩方法,并确定了最优压缩率。结果显示,简单的压缩方案几乎是最优的。虽然结果包含一些新知识和已知的成果,但它们在这篇文章中被编辑成完整。

A Novel Truncated Norm Regularization Method for Multi-channel Color Image Denoising

  • paper_url: http://arxiv.org/abs/2307.07932
  • repo_url: https://github.com/wangzhi82/DtNFM
  • paper_authors: Yiwen Shan, Dong Hu, Haoming Ding, Chunming Yang, Zhi Wang
  • for: 这篇研究旨在提出一种基于双重权重核心幂方法的彩色图像干扰除法,以解决现实世界中彩色图像干扰除中存在跨通道差异和空间变化的问题。
  • methods: 该方法使用了双重权重核心幂方法(DtNFM),通过利用干扰图像中的非本地自相似性,将相似的结构聚集起来,并将每个组建立一个DtNFM模型来估算其干扰版本。最终的干扰除图像由多个估算结果 concatenate 得到。
  • results: 该方法在 synthetic 和实际噪声数据集上进行了广泛的实验,并证明了它在现实世界中彩色图像干扰除中表现出优于许多状态之前的方法。
    Abstract Due to the high flexibility and remarkable performance, low-rank approximation methods has been widely studied for color image denoising. However, those methods mostly ignore either the cross-channel difference or the spatial variation of noise, which limits their capacity in real world color image denoising. To overcome those drawbacks, this paper is proposed to denoise color images with a double-weighted truncated nuclear norm minus truncated Frobenius norm minimization (DtNFM) method. Through exploiting the nonlocal self-similarity of the noisy image, the similar structures are gathered and a series of similar patch matrices are constructed. For each group, the DtNFM model is conducted for estimating its denoised version. The denoised image would be obtained by concatenating all the denoised patch matrices. The proposed DtNFM model has two merits. First, it models and utilizes both the cross-channel difference and the spatial variation of noise. This provides sufficient flexibility for handling the complex distribution of noise in real world images. Second, the proposed DtNFM model provides a close approximation to the underlying clean matrix since it can treat different rank components flexibly. To solve the problem resulted from DtNFM model, an accurate and effective algorithm is proposed by exploiting the framework of the alternating direction method of multipliers (ADMM). The generated subproblems are discussed in detail. And their global optima can be easily obtained in closed-form. Rigorous mathematical derivation proves that the solution sequences generated by the algorithm converge to a single critical point. Extensive experiments on synthetic and real noise datasets demonstrate that the proposed method outperforms many state-of-the-art color image denoising methods.
    摘要 due to the high flexibility and remarkable performance, low-rank approximation methods have been widely studied for color image denoising. however, those methods mostly ignore either the cross-channel difference or the spatial variation of noise, which limits their capacity in real-world color image denoising. to overcome these drawbacks, this paper proposes a double-weighted truncated nuclear norm minus truncated frobenius norm minimization (DtNFM) method for denoising color images. by exploiting the nonlocal self-similarity of the noisy image, similar structures are gathered and a series of similar patch matrices are constructed. for each group, the DtNFM model is conducted to estimate its denoised version. the denoised image is obtained by concatenating all the denoised patch matrices. the proposed DtNFM model has two merits. first, it models and utilizes both the cross-channel difference and the spatial variation of noise, providing sufficient flexibility for handling the complex distribution of noise in real-world images. second, the proposed DtNFM model provides a close approximation to the underlying clean matrix, treating different rank components flexibly. to solve the problem resulted from the DtNFM model, an accurate and effective algorithm is proposed by exploiting the framework of the alternating direction method of multipliers (ADMM). the generated subproblems are discussed in detail, and their global optima can be easily obtained in closed form. rigorous mathematical derivation proves that the solution sequences generated by the algorithm converge to a single critical point. extensive experiments on synthetic and real noise datasets demonstrate that the proposed method outperforms many state-of-the-art color image denoising methods.

On the Robustness of Split Learning against Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2307.07916
  • repo_url: https://github.com/fmy266/SplitADV
  • paper_authors: Mingyuan Fan, Cen Chen, Chengyu Wang, Wenmeng Zhou, Jun Huang
  • for: 评估分布式学习中对 adversarial 攻击的Robustness,特别是在无法访问全模型的情况下。
  • methods: 开发了一种特定于分布式学习的攻击方法,称为 SPADV,它包括两个阶段:1)阴影模型训练,解决模型缺失部分的问题,2)本地 adversarial 攻击,生成 adversarial 例子来评估。
  • results: SPADV 需要only a few unlabeled non-IID data,并且可以在第二阶段通过对 naturalsamples中的Intermediate output进行拟合来生成 adversarial examples,这表明了分布式学习对 adversarial 攻击的Robustness surprisingly vulnerable。
    Abstract Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into model security. Specifically, by exploring full models, attackers can launch adversarial attacks, and split learning can mitigate this severe threat by only disclosing part of models to untrusted servers.This paper aims to evaluate the robustness of split learning against adversarial attacks, particularly in the most challenging setting where untrusted servers only have access to the intermediate layers of the model.Existing adversarial attacks mostly focus on the centralized setting instead of the collaborative setting, thus, to better evaluate the robustness of split learning, we develop a tailored attack called SPADV, which comprises two stages: 1) shadow model training that addresses the issue of lacking part of the model and 2) local adversarial attack that produces adversarial examples to evaluate.The first stage only requires a few unlabeled non-IID data, and, in the second stage, SPADV perturbs the intermediate output of natural samples to craft the adversarial ones. The overall cost of the proposed attack process is relatively low, yet the empirical attack effectiveness is significantly high, demonstrating the surprising vulnerability of split learning to adversarial attacks.
    摘要 分学促进了分布式深度学习模型的共同训练,同时保护数据隐私和模型安全,因为服务器和客户端只持有部分子网和交换中间计算。然而,现有研究主要集中在隐私保护的可靠性上,很少探讨模型安全性。具体来说,通过探索全模型,攻击者可以发起对抗性攻击,而分学促可以通过仅披露部分模型来减轻这种严重的威胁。这篇论文的目的是评估分学促的可抗性 against 对抗性攻击,特别是在最具挑战性的设定下,即不可信服务器仅有访问模型中间层。现有的对抗性攻击主要集中在中央设定下,而不是合作设定,因此,为更好地评估分学促的可抗性,我们开发了一种特定的攻击方法,称为 SPADV。 SPADV 包括两个阶段:1)遮盾模型培训,解决因缺少部分模型而产生的问题,2)本地对抗性攻击,生成对抗性示例来评估。第一阶段只需要一些非相关的非独特数据,而第二阶段,SPADV 对于自然示例的间接输出进行了扰动,以生成对抗性示例。整个攻击过程的总成本较低,但实际攻击效果却非常高,表明了分学促对于对抗性攻击的意外脆弱性。

Exploiting FPGA Capabilities for Accelerated Biomedical Computing

  • paper_url: http://arxiv.org/abs/2307.07914
  • repo_url: None
  • paper_authors: Kayode Inadagbo, Baran Arig, Nisanur Alici, Murat Isik
  • for: 这个研究旨在通过使用 Field Programmable Gate Arrays (FPGAs) 提高心电信号分析,并提出了多种高级神经网络架构,包括卷积神经网络 (CNN)、循环神经网络 (RNN)、长期短TERM Memory网络 (LSTMs) 和深度信念网络 (DBNs)。
  • methods: 我们使用 MIT-BIH 心电性股库进行训练和验证,并在模型中引入 Gaussian 噪声以提高算法的Robustness。我们还使用 EarlyStopping 回调和 Dropout 层来避免过拟合。此外,我们还开发了一个自定义的 Tensor Compute Unit (TCU) 加速器,用于 PYNQ Z1 板。
  • results: 我们计算了各种性能指标,如延迟和通过put,以获得实际的应用级别高性能的FPGA在生物医学计算中的潜在性。这种研究最终提供了优化神经网络性能在 FPGAs 上的指南,为不同应用场景提供参考。
    Abstract This study presents advanced neural network architectures including Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTMs), and Deep Belief Networks (DBNs) for enhanced ECG signal analysis using Field Programmable Gate Arrays (FPGAs). We utilize the MIT-BIH Arrhythmia Database for training and validation, introducing Gaussian noise to improve algorithm robustness. The implemented models feature various layers for distinct processing and classification tasks and techniques like EarlyStopping callback and Dropout layer are used to mitigate overfitting. Our work also explores the development of a custom Tensor Compute Unit (TCU) accelerator for the PYNQ Z1 board, offering comprehensive steps for FPGA-based machine learning, including setting up the Tensil toolchain in Docker, selecting architecture, configuring PS-PL, and compiling and executing models. Performance metrics such as latency and throughput are calculated for practical insights, demonstrating the potential of FPGAs in high-performance biomedical computing. The study ultimately offers a guide for optimizing neural network performance on FPGAs for various applications.
    摘要 Translated into Simplified Chinese:这个研究提出了使用场程可编程阵列(FPGAs)进行高性能生物医学计算的先进神经网络架构,包括卷积神经网络(CNN)、循环神经网络(RNN)、长期短Memory神经网络(LSTM)和深度信仰神经网络(DBNs)。我们使用MIT-BIH心跳频数据库进行训练和验证,并在神经网络中引入 Gaussian 噪声以提高算法的稳定性。实现的模型包括不同层次的处理和分类任务,并使用 EarlyStopping 回调函数和 Dropout 层来避免过拟合。我们的工作还探讨了基于 PYNQ Z1 板的自定义 Tensor Compute Unit (TCU) 加速器的开发,并提供了完整的 FPGA-based 机器学习实现方法,包括在 Docker 中设置 Tensil 工具链、选择架构、配置 PS-PL 和编译并执行模型。实验中计算的性能指标包括延迟和吞吐量,以提供实用的指导,展示 FPGAs 在高性能生物医学计算中的潜力。研究最终提供了优化神经网络性能在 FPGAs 上的指南,用于多种应用。

Predicting mechanical properties of Carbon Nanotube (CNT) images Using Multi-Layer Synthetic Finite Element Model Simulations

  • paper_url: http://arxiv.org/abs/2307.07912
  • repo_url: None
  • paper_authors: Kaveh Safavigerdini, Koundinya Nouduri, Ramakrishna Surya, Andrew Reinhard, Zach Quinlan, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan
  • for: 预测碳纳米管(CNT)林集成体的机械性能
  • methods: 使用深度学习模型和人工智能(AI)基 materials发现
  • results: 提出一种使用多层合成图像(MLS)或 quasi-2.5D 图像进行数据增强的管道,可以更好地预测 CNT 林集成体的机械性能。
    Abstract We present a pipeline for predicting mechanical properties of vertically-oriented carbon nanotube (CNT) forest images using a deep learning model for artificial intelligence (AI)-based materials discovery. Our approach incorporates an innovative data augmentation technique that involves the use of multi-layer synthetic (MLS) or quasi-2.5D images which are generated by blending 2D synthetic images. The MLS images more closely resemble 3D synthetic and real scanning electron microscopy (SEM) images of CNTs but without the computational cost of performing expensive 3D simulations or experiments. Mechanical properties such as stiffness and buckling load for the MLS images are estimated using a physics-based model. The proposed deep learning architecture, CNTNeXt, builds upon our previous CNTNet neural network, using a ResNeXt feature representation followed by random forest regression estimator. Our machine learning approach for predicting CNT physical properties by utilizing a blended set of synthetic images is expected to outperform single synthetic image-based learning when it comes to predicting mechanical properties of real scanning electron microscopy images. This has the potential to accelerate understanding and control of CNT forest self-assembly for diverse applications.
    摘要 我们提出了一个气候预测碳纳米管(CNT)林图像的机械性能预测管道,使用深度学习模型来实现人工智能(AI)基于材料发现。我们的方法包括一种创新的数据增强技术,使用多层合成(MLS)或 quasi-2.5D 图像,这些图像由拼接2D 合成图像而成。MLS 图像更接近3D 合成和实验室扫描电子显微镜(SEM)图像,但没有 computationally Expensive 3D 模拟或实验的成本。机械性能如刚性和填充荷 для MLS 图像通过物理基础模型进行估算。我们的提议的深度学习架构,CNTNeXt,基于我们之前的 CNTNet 神经网络,使用 ResNeXt 特征表示 followed by random forest 回归估计器。我们的机器学习方法,通过使用拼接 synthetic 图像来预测 CNT 物理性能,预计能够超越单独使用 synthetic 图像来预测实际 SEM 图像中的机械性能,从而加速了 CNT 林自组装的理解和控制,并且具有广泛的应用前景。

MESOB: Balancing Equilibria & Social Optimality

  • paper_url: http://arxiv.org/abs/2307.07911
  • repo_url: None
  • paper_authors: Xin Guo, Lihong Li, Sareh Nabi, Rabih Salhab, Junzi Zhang
  • for: This paper aims to provide a novel optimization method for multi-level and multi-agent games with anonymous agents and complex interplay between competition and cooperation.
  • methods: The proposed method is called MESOB-OMO, which combines a mean-field approximation with an occupation measure optimization method to solve a bi-objective optimization problem.
  • results: The proposed method is effective in balancing the interests of different parties and handling the competitive nature of bidders, and outperforms baseline methods that only consider either the competitive or cooperative aspects.Here is the text in Simplified Chinese:
  • for: 这篇论文目标是提供一种新的优化方法,用于处理多层多代理人的游戏,具有大量匿名代理人和复杂的竞争与合作关系。
  • methods: 提议的方法是MESOB-OMO,它将mean-field approximation与占用度优化方法相结合,解决bi-objective优化问题。
  • results: MESOB-OMO方法能够均衡不同党之利益,抑制竞争性代理人的行为,并与基准方法相比显示出优异性。
    Abstract Motivated by bid recommendation in online ad auctions, this paper considers a general class of multi-level and multi-agent games, with two major characteristics: one is a large number of anonymous agents, and the other is the intricate interplay between competition and cooperation. To model such complex systems, we propose a novel and tractable bi-objective optimization formulation with mean-field approximation, called MESOB (Mean-field Equilibria & Social Optimality Balancing), as well as an associated occupation measure optimization (OMO) method called MESOB-OMO to solve it. MESOB-OMO enables obtaining approximately Pareto efficient solutions in terms of the dual objectives of competition and cooperation in MESOB, and in particular allows for Nash equilibrium selection and social equalization in an asymptotic manner. We apply MESOB-OMO to bid recommendation in a simulated pay-per-click ad auction. Experiments demonstrate its efficacy in balancing the interests of different parties and in handling the competitive nature of bidders, as well as its advantages over baselines that only consider either the competitive or the cooperative aspects.
    摘要 <>使用在在线广告拍卖中的拍卖推荐为动机,这篇论文考虑了一个总体来说是多层次和多代理人的游戏,具有两个主要特征:一是一大量的匿名代理人,二是竞争和合作之间的复杂互动。为了模型这些复杂系统,我们提出了一种新的和可行的双目标优化方法,称为MESOB(Mean-field Equilibria & Social Optimality Balancing),以及与之相关的占用度优化方法MESOB-OMO(MESOB-Occupation Measure Optimization)来解决它。MESOB-OMO可以在MESOB中获得约等价的竞争和合作两个目标的解,并且可以在极限情况下实现纳什均衡选择和社会平等。我们在模拟的一个基于拍卖的点播广告拍卖中应用MESOB-OMO。实验表明,它可以均衡不同党的利益,同时处理竞争性的拍卖者,以及与基准值(只考虑竞争或合作方面)相比,具有优势。

Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation

  • paper_url: http://arxiv.org/abs/2307.07907
  • repo_url: None
  • paper_authors: Wenhao Ding, Laixi Shi, Yuejie Chi, Ding Zhao
  • for: Handle spurious correlation in reinforcement learning to improve the robustness of real-world tasks.
  • methods: Propose a new framework called Robust State-Confounded Markov Decision Processes (RSC-MDPs) and design an empirical algorithm to learn the robust optimal policy.
  • results: Outperform all baselines in eight realistic self-driving and manipulation tasks.
    Abstract Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have causality but have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and sequential structure of RL, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in breaking spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks.
    摘要 robustness 在强化学习(RL)中已经广泛研究,以处理不同形式的不确定性,如随机干扰、罕见事件和恶意攻击。在这项工作中,我们考虑了一种关键的一种强度对假设相关性,即不同的状态部分没有 causality,但由不见的干扰因素引起的相关性。这种假设相关性在实际任务中很普遍,例如一个自驾车通常在白天会观察到压力很大的交通,而在夜晚则是非常少的交通,这是由于不可见的人类活动引起的。如果模型学习这种无用或甚至有害的相关性,那么在测试案例中,当干扰因素与训练案例不同时,模型可能会catastrophically fail。虽然有动机,但使模型具有假设相关性的鲁棒性具有 significante challenges,因为不确定集,由不见的干扰因素和RL的顺序结构塑造,很难characterize和识别。现有的鲁棒算法假设简单的和无结构的不确定集,因此无法解决这一问题。为解决这个问题,我们提出了Robust State-Confounded Markov Decision Processes(RSC-MDPs),并 theoretically demonstrab其在破坏假设相关性方面的优越性。我们还设计了一种实际算法,用于学习RSC-MDPs中的鲁棒优胜策略,并在八个实际自驾和操作任务中超过所有基elines。

Anomaly Detection in Automated Fibre Placement: Learning with Data Limitations

  • paper_url: http://arxiv.org/abs/2307.07893
  • repo_url: None
  • paper_authors: Assef Ghamisi, Todd Charter, Li Ji, Maxime Rivard, Gil Lund, Homayoun Najjaran
  • for: Automated Fibre Placement (AFP) 自动纤维放置系统中的缺陷检测
  • methods: 无监督深度学习和经典计算机视觉算法
  • results: 可以减少训练图像数量,同时检测到各种表面问题,并且可以准确地标识缺陷位置
    Abstract Conventional defect detection systems in Automated Fibre Placement (AFP) typically rely on end-to-end supervised learning, necessitating a substantial number of labelled defective samples for effective training. However, the scarcity of such labelled data poses a challenge. To overcome this limitation, we present a comprehensive framework for defect detection and localization in Automated Fibre Placement. Our approach combines unsupervised deep learning and classical computer vision algorithms, eliminating the need for labelled data or manufacturing defect samples. It efficiently detects various surface issues while requiring fewer images of composite parts for training. Our framework employs an innovative sample extraction method leveraging AFP's inherent symmetry to expand the dataset. By inputting a depth map of the fibre layup surface, we extract local samples aligned with each composite strip (tow). These samples are processed through an autoencoder, trained on normal samples for precise reconstructions, highlighting anomalies through reconstruction errors. Aggregated values form an anomaly map for insightful visualization. The framework employs blob detection on this map to locate manufacturing defects. The experimental findings reveal that despite training the autoencoder with a limited number of images, our proposed method exhibits satisfactory detection accuracy and accurately identifies defect locations. Our framework demonstrates comparable performance to existing methods, while also offering the advantage of detecting all types of anomalies without relying on an extensive labelled dataset of defects.
    摘要 Our framework employs an innovative sample extraction method that leverages AFP's inherent symmetry to expand the dataset. By inputting a depth map of the fibre layup surface, we extract local samples aligned with each composite strip (tow). These samples are processed through an autoencoder, trained on normal samples for precise reconstructions, highlighting anomalies through reconstruction errors. Aggregated values form an anomaly map for insightful visualization. The framework employs blob detection on this map to locate manufacturing defects.Experimental findings show that our proposed method exhibits satisfactory detection accuracy and accurately identifies defect locations, despite training the autoencoder with a limited number of images. Our framework demonstrates comparable performance to existing methods, while also offering the advantage of detecting all types of anomalies without relying on an extensive labelled dataset of defects.

Multitemporal SAR images change detection and visualization using RABASAR and simplified GLR

  • paper_url: http://arxiv.org/abs/2307.07892
  • repo_url: None
  • paper_authors: Weiying Zhao, Charles-Alban Deledalle, Loïc Denis, Henri Maître, Jean-Marie Nicolas, Florence Tupin
  • for: 这个论文主要是为了检测土地表面的变化,尤其是不同类型的变化,如农田、建筑、港口和洪涝等。
  • methods: 本论文提出了一种简化的总体可能比率(SGLR)方法,假设同时间像素具有相同的等效数量看(ENL)。此外,本论文还提出了一种改进的光谱划分法和一种改进的变化类别法。
  • results: 本论文通过处理模拟和Synthetic Aperture Radar(SAR)图像,证明了提出的方法的效果,特别是在检测农田、建筑、港口和洪涝等区域变化方面。
    Abstract Understanding the state of changed areas requires that precise information be given about the changes. Thus, detecting different kinds of changes is important for land surface monitoring. SAR sensors are ideal to fulfil this task, because of their all-time and all-weather capabilities, with good accuracy of the acquisition geometry and without effects of atmospheric constituents for amplitude data. In this study, we propose a simplified generalized likelihood ratio ($S_{GLR}$) method assuming that corresponding temporal pixels have the same equivalent number of looks (ENL). Thanks to the denoised data provided by a ratio-based multitemporal SAR image denoising method (RABASAR), we successfully applied this similarity test approach to compute the change areas. A new change magnitude index method and an improved spectral clustering-based change classification method are also developed. In addition, we apply the simplified generalized likelihood ratio to detect the maximum change magnitude time, and the change starting and ending times. Then, we propose to use an adaptation of the REACTIV method to visualize the detection results vividly. The effectiveness of the proposed methods is demonstrated through the processing of simulated and SAR images, and the comparison with classical techniques. In particular, numerical experiments proved that the developed method has good performances in detecting farmland area changes, building area changes, harbour area changes and flooding area changes.
    摘要 <>translate the following text into Simplified Chinese<>理解改变区域的状况需要提供精确的改变信息。因此,检测不同类型的改变是重要的 для土地表面监测。SAR感知器非常适合完成这项任务,因为它们在任何时间和任何天气情况下都有良好的探测geometry的准确性,而无需 atmospheric constituents的影响。在这项研究中,我们提出了一种简化的通用概率比例(SGLR)方法,假设相应的时间像素有同等的等效数量 Looks(ENL)。由于降噪数据提供了由比例基于多 temporal SAR图像降噪方法(RABASAR),我们成功地应用了这种相似性测试方法来计算改变区域。此外,我们还开发了一种改进的spectral clustering-based改变类型分类方法和一种最大改变幅度时间检测方法。然后,我们使用简化的SGLR方法检测改变的开始和结束时间。最后,我们提出了使用adapted REACTIV方法来Visual化检测结果的方法。实验证明了我们提出的方法的效果,包括对农地改变、建筑改变、港口改变和洪涝改变等方面的检测。

Investigation of compressor cascade flow based on physics-informed neural networks

  • paper_url: http://arxiv.org/abs/2308.04501
  • repo_url: None
  • paper_authors: Zhihui Li, Francesco Montomoli, Sanjiv Sharma
  • for: 这项研究使用新兴的物理告知神经网络(PINNs)方法,为首次预测压缩机风场。
  • methods: 这种方法基于两维问题,包括液体力学方程在前向和反向问题中。
  • results: PINNs能够高精度地预测压缩机风场,并且在无部分边界条件的情况下,PINNs显示出了与传统CFD方法相比的明显优势。
    Abstract In this study, we utilize the emerging Physics Informed Neural Networks (PINNs) approach for the first time to predict the flow field of a compressor cascade. The approach is demonstrated on a two-dimensional problem, incorporating Navier-Stokes equations in both the forward and inverse problems. In the forward problem, PINNs effectively predict the flow field of the compressor. The key advantage over Deep Neural Networks (DNNs) is that the PINNs model incorporates a physical relationship between the relevant quantities, resulting in more precise predictions. PINNs show obvious advantages over the traditional CFD approaches when dealing with inverse problems in the absence of partial boundary conditions. PINNs successfully reconstruct the flow field of the compressor cascade solely based on partial velocity vectors and wall pressure information. This research provides compelling evidence that PINNs offer turbomachinery designers a promising alternative to the current dominant CFD methods, delivering higher accuracy compared to DNNs.
    摘要 在本研究中,我们首次利用emerging Physics Informed Neural Networks(PINNs)方法预测压缩机螺旋叶流场。该方法在二维问题上进行了示例,并包括了 Navier-Stokes 方程在前向和反向问题中。在前向问题中,PINNs有效地预测了压缩机的流场。与深度神经网络(DNNs)相比,PINNs 模型具有物理关系的相互关系,从而实现了更加精确的预测。在 inverse 问题中,PINNs 显示出了与传统 CFD 方法相比的明显优势,可以在缺少部分边界条件时成功重建压缩机螺旋叶流场。这项研究提供了证明 PINNs 对液压机设计师提供了一个可靠的代替方法,比 DNNs 更高精度。

Handwritten and Printed Text Segmentation: A Signature Case Study

  • paper_url: http://arxiv.org/abs/2307.07887
  • repo_url: None
  • paper_authors: Sina Gholamian, Ali Vahdat
  • for: 提高手写和印刷文本分类精度
  • methods: 引入新的数据集SignaTR6K和模型架构
  • results: 比对先前工作提高17.9%和7.3%的IOU分数
    Abstract While analyzing scanned documents, handwritten text can overlap with printed text. This overlap causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This approach results in the assignment of overlapping handwritten and printed pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. To support this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for the handwritten and printed text segmentation task. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores. The SignaTR6K dataset is accessible for download via the following link: https://forms.office.com/r/2a5RDg7cAY.
    摘要 While analyzing scanned documents, handwritten text can overlap with printed text. This overlap causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This approach results in the assignment of overlapping handwritten and printed pixels to only one of the classes, and thus, they are not accounted for in the other class. Therefore, in this research, we develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. To support this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for the handwritten and printed text segmentation task. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores. The SignaTR6K dataset is accessible for download via the following link: https://forms.office.com/r/2a5RDg7cAY.

Intuitionistic Fuzzy Broad Learning System: Enhancing Robustness Against Noise and Outliers

  • paper_url: http://arxiv.org/abs/2307.08713
  • repo_url: None
  • paper_authors: M. Sajid, A. K. Malik, M. Tanveer
    for: 提高 Broad Learning System (BLS) 的稳定性和有效性,应对实际数据集中噪声和异常值的问题。methods: 提出了两种改进 BLS 模型:含隐式含度的 F-BLS 模型和基于直觉含数理论的 IF-BLS 模型。两种模型都使用距离函数来评估训练点的成员度,但是 F-BLS 模型只考虑训练点与类中心的距离,而 IF-BLS 模型则考虑训练点的含度和非含度。results: 对于 44 个 UCI 数据集和 ADNI 数据集,提出的 F-BLS 和 IF-BLS 模型都显示出优于基eline 模型的总体化能力和鲁棒性。具有噪声的 UCI 数据集上,提出的方法也能够保持比较高的总体化能力和鲁棒性。
    Abstract In the realm of data classification, broad learning system (BLS) has proven to be a potent tool that utilizes a layer-by-layer feed-forward neural network. It consists of feature learning and enhancement segments, working together to extract intricate features from input data. The traditional BLS treats all samples as equally significant, which makes it less robust and less effective for real-world datasets with noises and outliers. To address this issue, we propose the fuzzy BLS (F-BLS) model, which assigns a fuzzy membership value to each training point to reduce the influence of noises and outliers. In assigning the membership value, the F-BLS model solely considers the distance from samples to the class center in the original feature space without incorporating the extent of non-belongingness to a class. We further propose a novel BLS based on intuitionistic fuzzy theory (IF-BLS). The proposed IF-BLS utilizes intuitionistic fuzzy numbers based on fuzzy membership and non-membership values to assign scores to training points in the high-dimensional feature space by using a kernel function. We evaluate the performance of proposed F-BLS and IF-BLS models on 44 UCI benchmark datasets across diverse domains. Furthermore, Gaussian noise is added to some UCI datasets to assess the robustness of the proposed F-BLS and IF-BLS models. Experimental results demonstrate superior generalization performance of the proposed F-BLS and IF-BLS models compared to baseline models, both with and without Gaussian noise. Additionally, we implement the proposed F-BLS and IF-BLS models on the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset, and promising results showcase the models effectiveness in real-world applications. The proposed methods offer a promising solution to enhance the BLS frameworks ability to handle noise and outliers.
    摘要 在数据分类领域,广泛学习系统(BLS)已经证明是一种强大的工具,使用层次 feed-forward 神经网络。它包括特征学习和提升段,共同提取输入数据中细腻的特征。传统的 BLS 对所有样本都视为一样重要,这使得它在真实世界的数据集中 less robust 和 less effective。为解决这个问题,我们提出了模糊 BLS(F-BLS)模型,它将对each training point 分配模糊会员价值,以降低噪声和异常值的影响。在分配会员价值时,F-BLS 模型只考虑样本与类中心的距离在原始特征空间中,不考虑类外的非属性程度。我们还提出了基于INTUITIONISTIC FUZZY理论(IF-BLS)的新模型。该模型使用INTUITIONISTIC FUZZY数字,基于模糊会员价值和非会员价值来对训练点进行分配分数。我们对44个 UCI benchmark 数据集进行了评估,并在一些 UCI 数据集上添加了高斯噪声,以评估我们提出的 F-BLS 和 IF-BLS 模型的Robustness。实验结果表明我们的提出的 F-BLS 和 IF-BLS 模型在对比基eline模型时表现出优化的总体化能力,同时在噪声和异常值存在的情况下也有优异的表现。此外,我们在ADNI 数据集上实现了我们的 F-BLS 和 IF-BLS 模型,并得到了有 promise 的结果,这表明我们的方法在真实世界应用中具有潜在的价值。我们的方法可以增强 BLS 框架对噪声和异常值的处理能力。

Gradient-free training of neural ODEs for system identification and control using ensemble Kalman inversion

  • paper_url: http://arxiv.org/abs/2307.07882
  • repo_url: https://gitlab.com/computationalscience/eki-neural-ode
  • paper_authors: Lucas Böttcher
  • for: solves inverse problems within a Bayesian framework for system identification and optimal control tasks
  • methods: Ensemble Kalman inversion (EKI), a sequential Monte Carlo method that is gradient-free and only requires forward passes to evaluate artificial neural networks
  • results: EKI is an efficient method for training neural ODEs, with competitive runtime and solution quality compared to commonly used gradient-based optimizers.
    Abstract Ensemble Kalman inversion (EKI) is a sequential Monte Carlo method used to solve inverse problems within a Bayesian framework. Unlike backpropagation, EKI is a gradient-free optimization method that only necessitates the evaluation of artificial neural networks in forward passes. In this study, we examine the effectiveness of EKI in training neural ordinary differential equations (neural ODEs) for system identification and control tasks. To apply EKI to optimal control problems, we formulate inverse problems that incorporate a Tikhonov-type regularization term. Our numerical results demonstrate that EKI is an efficient method for training neural ODEs in system identification and optimal control problems, with runtime and quality of solutions that are competitive with commonly used gradient-based optimizers.
    摘要

Graph Embedded Intuitionistic Fuzzy RVFL for Class Imbalance Learning

  • paper_url: http://arxiv.org/abs/2307.07881
  • repo_url: None
  • paper_authors: M. A. Ganaie, M. Sajid, A. K. Malik, M. Tanveer
  • for: 该论文目的是解决机器学习领域中的类异常学习问题,即处理含有少量样本的训练集时,模型往往受到类别偏好的影响,导致少量类型的样本被排除。
  • methods: 该论文提出了一种基于图像函数链(RVFL)网络的新型分类模型,称为图像函数链 intuitionistic 瑞利(GE-IFRVFL)模型。该模型利用图像函数来提取数据集中的含义丰富信息,并使用 Intuitionistic 瑞利集来处理数据中的不确定性和不准确性。
  • results: 该论文在多个 benchmark 不均衡数据集上进行了实验,结果表明,与传统 RVFL 网络相比,GE-IFRVFL 模型在处理不均衡数据集时表现出了明显的优势。此外,该论文还应用了该模型在 ADNI 数据集上,并取得了良好的结果,证明该模型在实际应用中也具有出色的表现。
    Abstract The domain of machine learning is confronted with a crucial research area known as class imbalance learning, which presents considerable hurdles in the precise classification of minority classes. This issue can result in biased models where the majority class takes precedence in the training process, leading to the underrepresentation of the minority class. The random vector functional link (RVFL) network is a widely-used and effective learning model for classification due to its speed and efficiency. However, it suffers from low accuracy when dealing with imbalanced datasets. To overcome this limitation, we propose a novel graph embedded intuitionistic fuzzy RVFL for class imbalance learning (GE-IFRVFL-CIL) model incorporating a weighting mechanism to handle imbalanced datasets. The proposed GE-IFRVFL-CIL model has a plethora of benefits, such as $(i)$ it leverages graph embedding to extract semantically rich information from the dataset, $(ii)$ it uses intuitionistic fuzzy sets to handle uncertainty and imprecision in the data, $(iii)$ and the most important, it tackles class imbalance learning. The amalgamation of a weighting scheme, graph embedding, and intuitionistic fuzzy sets leads to the superior performance of the proposed model on various benchmark imbalanced datasets, including UCI and KEEL. Furthermore, we implement the proposed GE-IFRVFL-CIL on the ADNI dataset and achieved promising results, demonstrating the model's effectiveness in real-world applications. The proposed method provides a promising solution for handling class imbalance in machine learning and has the potential to be applied to other classification problems.
    摘要 machine learning 领域面临一个重要的研究领域,即类别不均衡学习(Class Imbalance Learning,简称 CIL),这种情况可能导致模型偏向主要类别,从而导致少数类别的下 represencing。Random vector functional link(RVFL)网络是一种广泛使用和高效的学习模型,但它在不均衡数据集上表现不佳。为了解决这个限制,我们提出了一种基于图embeded intuitionistic fuzzy RVFL(GE-IFRVFL-CIL)模型,该模型具有以下优点:$(i)$ 利用图embeded提取数据集中具有含义的信息;$(ii)$ 使用intuitionistic fuzzy sets处理数据中的不确定和不准确性;$(iii)$ 特别是,解决类别不均衡学习问题。通过将权重机制、图embeded和intuitionistic fuzzy sets结合在一起,我们的模型在多个benchmark不均衡数据集上表现出色,包括UCI和KEEL。此外,我们在ADNI数据集上实现了该模型,并获得了可观的结果,证明了模型在实际应用中的效果。本方法为处理机器学习中的类别不均衡提供了一个有 Promise的解决方案,并可以应用于其他分类问题。

Why Does Little Robustness Help? Understanding Adversarial Transferability From Surrogate Training

  • paper_url: http://arxiv.org/abs/2307.07873
  • repo_url: None
  • paper_authors: Yechao Zhang, Shengshan Hu, Leo Yu Zhang, Junyu Shi, Minghui Li, Xiaogeng Liu, Wei Wan, Hai Jin
  • for: 本研究旨在更深入地理解对 Deep Neural Networks (DNNs) 的抗击例软件的可迁移性,尤其是在代理模型方面。
  • methods: 本研究使用了一系列的理论和实验分析,探讨了两个主要因素——模型平滑性和梯度相似性——如何影响对 DNNs 的抗击例软件的可迁移性。
  • results: 研究发现,在对 DNNs 进行 adversarial 训练时,模型平滑性和梯度相似性之间存在负相关关系,而这两个因素又与对 DNNs 的抗击例软件的可迁移性有着普遍的影响。通过调整数据增强和梯度正则化,可以同时提高模型平滑性和梯度相似性,从而提高对 DNNs 的抗击例软件的可迁移性。
    Abstract Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.
    摘要 深度学习模型(DNN)的敌对示例(AE)已经被证明可以传播:AE 可以在不同的模型结构下骗别的黑盒模型。虽然一些实验研究提供了生成高度传播的AE的指导,但是大多数这些发现缺乏解释,甚至导致不一致的建议。在这篇文章中,我们带领读者一步进一步地了解对抗传播性,特别是在代理方面。我们从小的Robustness现象出发,其中模型在弱相对抗样本上进行了适应性训练后,可以作为更好的代理模型。我们归因这一现象于两个主要因素的贡献:模型的平滑性和梯度相似性。我们的分析将注重这两个因素之间的共同效应,而不是它们与传播性之间的相互关系。通过一系列理论和实验分析,我们提出了数据分布shift在对抗训练中的影响,以及如何通过数据增强和梯度规则来调节传播性。最后,我们提出了一种通用的制定机制,可以同时优化模型的平滑性和梯度相似性,并通过广泛的实验证明其效果。因此,我们呼吁关注这两个因素的共同影响,而不是仅仅优化一个而忽略另一个,并强调在制定代理模型时的重要性。

Does Double Descent Occur in Self-Supervised Learning?

  • paper_url: http://arxiv.org/abs/2307.07872
  • repo_url: https://github.com/yonatangideoni/double_descent_tiny_paper
  • paper_authors: Alisia Lupidi, Yonatan Gideoni, Dulhan Jayalath
  • for: investigate the existence of double descent in self-supervised models
  • methods: use standard and linear autoencoders, two previously unstudied settings
  • results: the test loss either has a classical U-shape or monotonically decreases, without exhibiting a double-descent curve.
    Abstract Most investigations into double descent have focused on supervised models while the few works studying self-supervised settings find a surprising lack of the phenomenon. These results imply that double descent may not exist in self-supervised models. We show this empirically using a standard and linear autoencoder, two previously unstudied settings. The test loss is found to have either a classical U-shape or to monotonically decrease instead of exhibiting a double-descent curve. We hope that further work on this will help elucidate the theoretical underpinnings of this phenomenon.
    摘要 大多数调查双峰现象都集中在指导学习模型上,而自适应学习设置中的研究却有很少。这些结果表明双峰现象可能不存在于自适应模型中。我们通过标准和线性自动编码器两种未研究过的设置来证实这一点。测试损失的曲线可以分为两种:一种是经典的U型曲线,另一种是 monotonically decrease 而不是展现双峰曲线。我们希望未来的研究能够深入探讨这一现象的理论基础。

The SocialAI School: Insights from Developmental Psychology Towards Artificial Socio-Cultural Agents

  • paper_url: http://arxiv.org/abs/2307.07871
  • repo_url: None
  • paper_authors: Grgur Kovač, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
  • for: 这个论文主要是为了探讨人工智能在社会交互 Setting中的发展,以及如何通过发展心理学来帮助AI研究社会智能。
  • methods: 这篇论文使用了Michael Tomasello和Jerome Bruner等发展心理学家的理论,并提出了一个可 parameterized 的社交AI学校,用于帮助研究者进行相关的实验和研究。
  • results: 这篇论文的结果表明,通过使用社交AI学校,可以让RL代理和大语言模型在社交交互 Setting中展现出更高的社会智能能力。同时,这篇论文也提供了一个简单的入门点,以帮助AI研究者更好地理解和应用发展心理学。
    Abstract Developmental psychologists have long-established the importance of socio-cognitive abilities in human intelligence. These abilities enable us to enter, participate and benefit from human culture. AI research on social interactive agents mostly concerns the emergence of culture in a multi-agent setting (often without a strong grounding in developmental psychology). We argue that AI research should be informed by psychology and study socio-cognitive abilities enabling to enter a culture too. We discuss the theories of Michael Tomasello and Jerome Bruner to introduce some of their concepts to AI and outline key concepts and socio-cognitive abilities. We present The SocialAI school - a tool including a customizable parameterized uite of procedurally generated environments, which simplifies conducting experiments regarding those concepts. We show examples of such experiments with RL agents and Large Language Models. The main motivation of this work is to engage the AI community around the problem of social intelligence informed by developmental psychology, and to provide a tool to simplify first steps in this direction. Refer to the project website for code and additional information: https://sites.google.com/view/socialai-school.
    摘要 We draw on the theories of Michael Tomasello and Jerome Bruner to introduce some of their concepts in AI and outline key socio-cognitive abilities. We present the SocialAI school, a tool that includes a customizable parameterized suite of procedurally generated environments, which simplifies conducting experiments regarding these concepts. We demonstrate examples of such experiments with reinforcement learning (RL) agents and large language models.Our main motivation is to engage the AI community in the problem of social intelligence informed by developmental psychology and provide a tool to simplify initial steps in this direction. For more information and access to the project's code, please refer to the project website at .

Large Language Models as Superpositions of Cultural Perspectives

  • paper_url: http://arxiv.org/abs/2307.07870
  • repo_url: None
  • paper_authors: Grgur Kovač, Masataka Sawayama, Rémy Portelas, Cédric Colas, Peter Ford Dominey, Pierre-Yves Oudeyer
  • for: 这 paper 主要研究了大型自然语言模型 (LLMs) 是如何被识别为具有人性或价值观的问题。
  • methods: 作者使用了问卷调查 (PVQ, VSM, IPIP) 来研究 LLMS 在不同情境下表达的价值和人性特质是如何变化的。他们还通过质量实验来示例 LLMS 在不同情境下表达的价值是如何改变的。
  • results: 研究发现 LLMS 在不同情境下表达的价值和人性特质是 Context-dependent 的,并且可以通过不同的方法来控制这些表达。同时,作者还发现了不同模型的 drivability 和可控性。
    Abstract Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at https://sites.google.com/view/llm-superpositions .
    摘要 大型语言模型(LLM)经常被误认为具有人格或一组价值观。我们认为,LLM可以看作是一个积加的视角,具有不同的价值观和人格特质。LLM在不同的上下文中展现出不同的价值观和人格特质,而人类在不同上下文中的价值观和人格特质往往更加一致。我们提出了“视角可控性”概念,即模型可以采取不同的视角,以便采取不同的价值观和人格特质。我们通过心理测试(PVQ、VSM、IPIP)研究了LLM在不同上下文中表达的价值观和人格特质是如何变化的。我们还通过质量实验表明,LLM在不同的提示下表达不同的价值观,而且这些价值观不一定是明确地表达出来的( demonstrate 其上下文相依性)。我们然后进行了量化实验,研究不同模型(GPT-4、GPT-3.5、OpenAssistant、StableVicuna、StableLM)的可控性,不同方法的影响和模型的顺略性。我们最后结论,我们的工作有很多相关科学问题,并提出了一些新的科学问题。相关研究网站地址为

Custom DNN using Reward Modulated Inverted STDP Learning for Temporal Pattern Recognition

  • paper_url: http://arxiv.org/abs/2307.07869
  • repo_url: None
  • paper_authors: Vijay Shankaran Vivekanand, Rajkumar Kubendran
  • for: 本研究旨在提出一种高效的时间峰检测算法,用于各种领域,如异常检测、关键词检测和神经科学。
  • methods: 该算法结合奖金补偿行为、HEBBbian和反HEBBbian基于学习方法,可以在动态数据集上高效地识别时间峰模式。
  • results: 对于一个复杂的语音数据集,该算法的表现比 estado-of-the-art 更高, indicating that the algorithm can effectively recognize temporal spike patterns in real-world data.
    Abstract Temporal spike recognition plays a crucial role in various domains, including anomaly detection, keyword spotting and neuroscience. This paper presents a novel algorithm for efficient temporal spike pattern recognition on sparse event series data. The algorithm leverages a combination of reward-modulatory behavior, Hebbian and anti-Hebbian based learning methods to identify patterns in dynamic datasets with short intervals of training. The algorithm begins with a preprocessing step, where the input data is rationalized and translated to a feature-rich yet sparse spike time series data. Next, a linear feed forward spiking neural network processes this data to identify a trained pattern. Finally, the next layer performs a weighted check to ensure the correct pattern has been detected.To evaluate the performance of the proposed algorithm, it was trained on a complex dataset containing spoken digits with spike information and its output compared to state-of-the-art.
    摘要 <>Temporal spike recognition plays a crucial role in various domains, including anomaly detection, keyword spotting, and neuroscience. This paper presents a novel algorithm for efficient temporal spike pattern recognition on sparse event series data. The algorithm leverages a combination of reward-modulatory behavior, Hebbian, and anti-Hebbian based learning methods to identify patterns in dynamic datasets with short intervals of training. The algorithm begins with a preprocessing step, where the input data is rationalized and translated to a feature-rich yet sparse spike time series data. Next, a linear feed forward spiking neural network processes this data to identify a trained pattern. Finally, the next layer performs a weighted check to ensure the correct pattern has been detected. To evaluate the performance of the proposed algorithm, it was trained on a complex dataset containing spoken digits with spike information and its output compared to state-of-the-art.中文简体版: Temporal spike recognition在各种领域都扮演着关键角色,包括异常检测、关键词检测和神经科学。本文提出了一种高效的时间脉冲模式识别算法,用于处理缺省事件序列数据。该算法结合奖励调节行为、希伯纳和反希伯纳基本学习方法,在短期培训下 indentify动态数据中的模式。该算法的前期处理步骤将输入数据理解化和转换为具有丰富特征 yet sparse spike时间序列数据。接着,一个线性径向冲击神经网络处理这些数据,以标识已经训练的模式。最后,下一层 performs一个权重检查,以确保正确的模式已经被检测到。为评估提出的算法性能,它被训练在一个包含 spoken digits 的复杂数据集上,并与当前领先的输出进行比较。

Contrasting the efficiency of stock price prediction models using various types of LSTM models aided with sentiment analysis

  • paper_url: http://arxiv.org/abs/2307.07868
  • repo_url: None
  • paper_authors: Varun Sangwan, Vishesh Kumar Singh, Bibin Christopher V
  • for: 该研究目标是找到使用公司预测和行业表现来正确预测股票价格,包括短期和长期目标。
  • methods: 该研究使用公司预测和行业表现来建立模型,以便预测股票价格。
  • results: 该研究得到的结果可以帮助投资者更好地理解公司的股票价格,并且可以用于长期和短期投资决策。I hope this helps! Let me know if you have any other questions.
    Abstract Our research aims to find the best model that uses companies projections and sector performances and how the given company fares accordingly to correctly predict equity share prices for both short and long term goals.
    摘要 我们的研究目标是找到最佳的模型,该使用公司预测和行业表现来正确预测股票价格,包括短期和长期目标。

Benchmarking the Effectiveness of Classification Algorithms and SVM Kernels for Dry Beans

  • paper_url: http://arxiv.org/abs/2307.07863
  • repo_url: None
  • paper_authors: Anant Mehta, Prajit Sengupta, Divisha Garg, Harpreet Singh, Yosi Shacham Diamand
  • for: 增强作物产量,植物育种者和农业研究人员可以通过识别感兴趣特征、疾病抵抗力和营养含量来提高作物产量。
  • methods: 本研究使用了不同的支持向量机(SVM)分类算法,包括直线、多项式和径向基函数(RBF),以及其他流行的分类算法进行比较和分析。为了降低维度,使用了主成分分析(PCA)进行预处理。
  • results: 研究发现,使用径向基函数(RBF) SVM 算法可以达到最高的准确率(93.34%)、精度(92.61%)、回归率(92.35%)和 F1 分数(91.40%)。此外,研究还提供了有用的视觉化和实验分析,为识别不同 SVM 算法在复杂和非线性结构的数据集中的重要性提供了有价值的指导。
    Abstract Plant breeders and agricultural researchers can increase crop productivity by identifying desirable features, disease resistance, and nutritional content by analysing the Dry Bean dataset. This study analyses and compares different Support Vector Machine (SVM) classification algorithms, namely linear, polynomial, and radial basis function (RBF), along with other popular classification algorithms. The analysis is performed on the Dry Bean Dataset, with PCA (Principal Component Analysis) conducted as a preprocessing step for dimensionality reduction. The primary evaluation metric used is accuracy, and the RBF SVM kernel algorithm achieves the highest Accuracy of 93.34%, Precision of 92.61%, Recall of 92.35% and F1 Score as 91.40%. Along with adept visualization and empirical analysis, this study offers valuable guidance by emphasizing the importance of considering different SVM algorithms for complex and non-linear structured datasets.
    摘要 植物育种者和农业研究人员可以通过识别有利特征、疾病抵抗力和营养含量来提高作物产量。这个研究分析和比较不同的支持向量机器学习(SVM)分类算法,包括直线、多项式和径向基函数(RBF),以及其他流行的分类算法。研究使用了干豇数据集,先进行了主成分分析(PCA)作为维度减少步骤。主要评价指标是准确率,RBF SVM 算法实现了最高的准确率为 93.34%、精度为 92.61%、回归率为 92.35% 和 F1 分数为 91.40%。此外,研究还提供了丰富的视觉化和实证分析,为复杂和非线性结构数据中的SVM算法选择提供了有价值的指导。

Automated Knowledge Modeling for Cancer Clinical Practice Guidelines

  • paper_url: http://arxiv.org/abs/2307.10231
  • repo_url: None
  • paper_authors: Pralaypati Ta, Bhumika Gupta, Arihant Jain, Sneha Sree C, Arunima Sarkar, Keerthi Ram, Mohanasankar Sivaprakasam
    for:This paper aims to develop an automated method for extracting knowledge from National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines (CPGs) in Oncology and generating a structured model containing the retrieved knowledge.methods:The proposed method uses natural language processing (NLP) techniques to extract knowledge from NCCN CPGs, and employs a knowledge model to represent the extracted knowledge in a structured format. The method also includes three enrichment strategies to enhance the model: Cancer staging information, UMLS Metathesaurus & NCIt concepts, and Node classification.results:The proposed method was tested using two versions of NCCN Non-Small Cell Lung Cancer (NSCLC) CPG, and achieved a high accuracy of 0.81 in Node classification using a Support Vector Machine (SVM) model with 10-fold cross-validation.
    Abstract Clinical Practice Guidelines (CPGs) for cancer diseases evolve rapidly due to new evidence generated by active research. Currently, CPGs are primarily published in a document format that is ill-suited for managing this developing knowledge. A knowledge model of the guidelines document suitable for programmatic interaction is required. This work proposes an automated method for extraction of knowledge from National Comprehensive Cancer Network (NCCN) CPGs in Oncology and generating a structured model containing the retrieved knowledge. The proposed method was tested using two versions of NCCN Non-Small Cell Lung Cancer (NSCLC) CPG to demonstrate the effectiveness in faithful extraction and modeling of knowledge. Three enrichment strategies using Cancer staging information, Unified Medical Language System (UMLS) Metathesaurus & National Cancer Institute thesaurus (NCIt) concepts, and Node classification are also presented to enhance the model towards enabling programmatic traversal and querying of cancer care guidelines. The Node classification was performed using a Support Vector Machine (SVM) model, achieving a classification accuracy of 0.81 with 10-fold cross-validation.
    摘要 临床实践指南 (CPGs) for cancer diseases 在新证据的激发下不断发展。目前,CPGs 主要以文档格式发布,这种格式不适合管理这些发展中的知识。这项工作提出了一种自动提取 CPGS 文档中的知识并生成一个结构化模型的方法。该方法在使用两个版本的 National Comprehensive Cancer Network (NCCN) Non-Small Cell Lung Cancer (NSCLC) CPG 进行测试,并证明了 faithful 提取和模型知识的效果。此外,文章还提出了三种润色策略,使得模型具有可programmatic traversal和querying cancer care guidelines的能力。这三种润色策略分别是使用 Cancer 分期信息、Unified Medical Language System (UMLS) Metathesaurus & National Cancer Institute thesaurus (NCIt) 概念以及节点分类。Node 分类使用 Support Vector Machine (SVM) 模型,在10-fold cross-validation中达到了0.81的分类精度。

Variational Inference with Gaussian Score Matching

  • paper_url: http://arxiv.org/abs/2307.07849
  • repo_url: https://github.com/modichirag/gsm-vi
  • paper_authors: Chirag Modi, Charles Margossian, Yuling Yao, Robert Gower, David Blei, Lawrence Saul
  • For: The paper is written for researchers and practitioners interested in Bayesian inference and variational inference.* Methods: The paper proposes a new approach to variational inference called score matching variational inference (GSM-VI), which is based on the principle of score matching and can be applied to a wide class of models. The algorithm iteratively adjusts the variational estimate to match the scores at a newly sampled value of the latent variables.* Results: The paper compares GSM-VI to black box variational inference (BBVI) and studies how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix, and the degree of mismatch between the approximating and exact posterior distribution. The results show that GSM-VI is faster than BBVI and requires fewer gradient evaluations to obtain a comparable quality of approximation.
    Abstract Variational inference (VI) is a method to approximate the computationally intractable posterior distributions that arise in Bayesian statistics. Typically, VI fits a simple parametric distribution to the target posterior by minimizing an appropriate objective such as the evidence lower bound (ELBO). In this work, we present a new approach to VI based on the principle of score matching, that if two distributions are equal then their score functions (i.e., gradients of the log density) are equal at every point on their support. With this, we develop score matching VI, an iterative algorithm that seeks to match the scores between the variational approximation and the exact posterior. At each iteration, score matching VI solves an inner optimization, one that minimally adjusts the current variational estimate to match the scores at a newly sampled value of the latent variables. We show that when the variational family is a Gaussian, this inner optimization enjoys a closed form solution, which we call Gaussian score matching VI (GSM-VI). GSM-VI is also a ``black box'' variational algorithm in that it only requires a differentiable joint distribution, and as such it can be applied to a wide class of models. We compare GSM-VI to black box variational inference (BBVI), which has similar requirements but instead optimizes the ELBO. We study how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix (when the target is Gaussian), and the degree of mismatch between the approximating and exact posterior distribution. We also study GSM-VI on a collection of real-world Bayesian inference problems from the posteriorDB database of datasets and models. In all of our studies we find that GSM-VI is faster than BBVI, but without sacrificing accuracy. It requires 10-100x fewer gradient evaluations to obtain a comparable quality of approximation.
    摘要 “统计学中的统计推理(Variational Inference,VI)是一种方法估计computationally intractable的 posterior distribution。通常,VI使用一个简单的 parametric distribution 来替代目标 posterior,并且使用一个适当的目标函数,例如证据下界(Evidence Lower Bound,ELBO)来对其进行最佳化。在这个研究中,我们提出了一种基于得分匹配原理的新方法,即得分匹配VI(Score Matching VI,SM-VI)。这个方法的基本思想是,如果两个分布相同,则它们的得分函数(即分布的LOG值的导数)在它们的支持集上也是相同的。我们透过对SM-VI进行迭代运算,将得分匹配到目标 posterior 中的分布。在每个迭代中,SM-VI解决一个内部优化问题,将当前的渠道估计匹配到目标 posterior 中的分布。当渠道家族为 Gaussian 时,内部优化问题具有关注形式的解,我们称之为 Gaussian Score Matching VI(GSM-VI)。GSM-VI 也是一个“黑盒子”渠道推理方法,它只需要一个可微的共同分布,并且可以应用到广泛的模型中。我们与黑盒子推理(BBVI)进行比较,BBVI 的要求相同,但是它将 ELBO 优化而不是得分匹配。我们研究了 GSM-VI 的行为,包括问题的维度、目标均值矩阵的条件数(当目标为 Gaussian 时)和渠道估计和实际 posterior 的匹配程度。我们还对一些真实世界的 Bayesian 推理问题进行了研究,包括 posteriorDB 数据库中的数据和模型。在所有的研究中,我们发现 GSM-VI 比 BBVI 快速,并且无需牺牲精度。GSM-VI 需要 10-100 倍 fewer gradient evaluations 以取得相同质量的渠道估计。”

Neural Video Recovery for Cloud Gaming

  • paper_url: http://arxiv.org/abs/2307.07847
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Zhaoyuan He, Yifan Yang, Shuozhe Li, Diyuan Dai, Lili Qiu
  • for: 提高云游戏的视频恢复率和质量,以提供更好的游戏体验。
  • methods: 使用游戏状态进行视频恢复,并使用部分解码的帧来恢复丢失的视频部分。开发了一个整体系统,包括提取游戏状态、修改H.264视频解码器生成恢复帧的掩码,以及设计一种新的神经网络来恢复完整或部分的视频帧。
  • results: 通过iPhone 12和笔记机实现,证明了游戏状态在视频恢复中的重要性,以及我们的总体设计的有效性。
    Abstract Cloud gaming is a multi-billion dollar industry. A client in cloud gaming sends its movement to the game server on the Internet, which renders and transmits the resulting video back. In order to provide a good gaming experience, a latency below 80 ms is required. This means that video rendering, encoding, transmission, decoding, and display have to finish within that time frame, which is especially challenging to achieve due to server overload, network congestion, and losses. In this paper, we propose a new method for recovering lost or corrupted video frames in cloud gaming. Unlike traditional video frame recovery, our approach uses game states to significantly enhance recovery accuracy and utilizes partially decoded frames to recover lost portions. We develop a holistic system that consists of (i) efficiently extracting game states, (ii) modifying H.264 video decoder to generate a mask to indicate which portions of video frames need recovery, and (iii) designing a novel neural network to recover either complete or partial video frames. Our approach is extensively evaluated using iPhone 12 and laptop implementations, and we demonstrate the utility of game states in the game video recovery and the effectiveness of our overall design.
    摘要 云游戏是一个多亿美元的industry。一个客户端在云游戏中将其运动发送到游戏服务器上的互联网上,服务器将其渲染并将结果视频回传。为了提供良好的游戏体验,云游戏需要的延迟时间在80ms左右。这意味着视频渲染、编码、传输、解码和显示都需要在这个时间段内完成,这是特别是由服务器过载、网络拥堵和 losses 而具有挑战性。在这篇论文中,我们提出了一种新的视频帧恢复方法,与传统的视频帧恢复方法不同的是,我们的方法使用游戏状态以显著提高恢复精度,并使用部分解码的帧来恢复丢失的部分。我们开发了一个整体系统,包括(i)高效地提取游戏状态,(ii)修改H.264视频解码器生成一个面积指示需要恢复的视频帧部分,以及(iii)设计一种新的神经网络来恢复完整或部分的视频帧。我们的方法在iPhone 12和笔记机实现中进行了广泛的测试,并证明了游戏状态在游戏视频恢复中的重要性和我们的整体设计的有效性。

Transformers are Universal Predictors

  • paper_url: http://arxiv.org/abs/2307.07843
  • repo_url: https://github.com/danderfer/Comp_Sci_Sem_2
  • paper_authors: Sourya Basu, Moulik Choraria, Lav R. Varshney
  • for: 这篇论文是研究Transformer架构的语言模型限制和其在信息论中的通用预测性的。
  • methods: 论文使用了信息论的方法来分析Transformer架构的性能,并在非对称数据 regime中分析不同组件的作用,特别是在数据效率训练中。
  • results: 实验表明,Transformer架构在 sintetic 和实际数据上具有良好的性能,且可以在数据效率训练中获得优秀的结果。
    Abstract We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
    摘要 我们发现 transformer 架构在语言模型预测中存在限制,并证明它有一种普遍预测性质在信息理论上。我们进一步分析不同组件的转换器架构在非对称数据 régime 中的表现,尤其是在数据效果训练中。我们 validate our 理论分析通过实验测试 synthetic 和实际数据集。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

RegExplainer: Generating Explanations for Graph Neural Networks in Regression Task

  • paper_url: http://arxiv.org/abs/2307.07840
  • repo_url: None
  • paper_authors: Jiaxing Zhang, Zhuomin Chen, Hao Mei, Dongsheng Luo, Hua Wei
  • for: 这 paper 的目的是解释图像 regression 模型(XAIG-R)的含义,以便更好地理解图像学习任务中的推理过程。
  • methods: 这 paper 使用了信息瓶颈理论基于的一个新目标函数,以及一种新的混合框架,可以支持不同的 GNN 模型在一种模型无关的方式上。此外,它还提出了一种对比学习策略来解决 regression 任务中的连续顺序标签问题。
  • results: 这 paper 通过三个 benchmark 数据集和一个实际数据集进行了广泛的实验,证明了该方法的有效性在解释 GNN 模型在 regression 任务中。
    Abstract Graph regression is a fundamental task and has received increasing attention in a wide range of graph learning tasks. However, the inference process is often not interpretable. Most existing explanation techniques are limited to understanding GNN behaviors in classification tasks. In this work, we seek an explanation to interpret the graph regression models (XAIG-R). We show that existing methods overlook the distribution shifting and continuously ordered decision boundary, which hinders them away from being applied in the regression tasks. To address these challenges, we propose a novel objective based on the information bottleneck theory and introduce a new mix-up framework, which could support various GNNs in a model-agnostic manner. We further present a contrastive learning strategy to tackle the continuously ordered labels in regression task. To empirically verify the effectiveness of the proposed method, we introduce three benchmark datasets and a real-life dataset for evaluation. Extensive experiments show the effectiveness of the proposed method in interpreting GNN models in regression tasks.
    摘要 GRaph regression是一个基本任务,在各种图学习任务中受到了越来越多的关注。然而,推断过程经常不可解释。大多数现有的解释技术仅适用于理解 GNN 的类型任务。在这项工作中,我们寻求一种可解释的方法,用于解释图 regression 模型(XAIG-R)。我们发现现有方法忽略了分布Shift和连续顺序决策边界,这会阻碍它们在回归任务中应用。为解决这些挑战,我们提出了一个基于信息瓶颈理论的新目标函数,并提出了一种新的混合框架,可以支持多种 GNN 在一种模型无关的方式上。此外,我们还提出了一种对比学习策略,用于处理连续顺序标签在回归任务中。为证明提出的方法的有效性,我们引入了三个标准数据集和一个真实数据集进行评估。广泛的实验表明,我们的方法可以有效地解释 GNN 模型在回归任务中。

eess.IV - 2023-07-16

TransNuSeg: A Lightweight Multi-Task Transformer for Nuclei Segmentation

  • paper_url: http://arxiv.org/abs/2307.08051
  • repo_url: https://github.com/zhenqi-he/transnuseg
  • paper_authors: Zhenqi He, Mathias Unberath, Jing Ke, Yiqing Shen
  • for: 这篇论文是为了提出一种基于Transformer的 nuclei segmentation方法,以解决现有的自动 nuclei segmentation方法具有较高的参数数量和训练时间。
  • methods: 这篇论文使用了一种叫做TransNuSeg的pure Transformer框架,其中包括了一个tri-decoder结构,用于同时进行nuclei实例、nuclei边缘和分布式边缘分割。此外, authors还提出了一种新的自适应loss函数,以确保不同分支的预测结果之间的一致性。
  • results: 实验表明,TransNuSeg方法可以在两个不同的数据集上,与state-of-the-art counterparts such as CA2.5-Net比较,提高了2-3%的Dice指标,同时减少了30%的参数数量。这表明,Transformer在nuclei segmentation领域中具有强大的能力,可以作为实际临床应用中的有效解决方案。
    Abstract Nuclei appear small in size, yet, in real clinical practice, the global spatial information and correlation of the color or brightness contrast between nuclei and background, have been considered a crucial component for accurate nuclei segmentation. However, the field of automatic nuclei segmentation is dominated by Convolutional Neural Networks (CNNs), meanwhile, the potential of the recently prevalent Transformers has not been fully explored, which is powerful in capturing local-global correlations. To this end, we make the first attempt at a pure Transformer framework for nuclei segmentation, called TransNuSeg. Different from prior work, we decouple the challenging nuclei segmentation task into an intrinsic multi-task learning task, where a tri-decoder structure is employed for nuclei instance, nuclei edge, and clustered edge segmentation respectively. To eliminate the divergent predictions from different branches in previous work, a novel self distillation loss is introduced to explicitly impose consistency regulation between branches. Moreover, to formulate the high correlation between branches and also reduce the number of parameters, an efficient attention sharing scheme is proposed by partially sharing the self-attention heads amongst the tri-decoders. Finally, a token MLP bottleneck replaces the over-parameterized Transformer bottleneck for a further reduction in model complexity. Experiments on two datasets of different modalities, including MoNuSeg have shown that our methods can outperform state-of-the-art counterparts such as CA2.5-Net by 2-3% Dice with 30% fewer parameters. In conclusion, TransNuSeg confirms the strength of Transformer in the context of nuclei segmentation, which thus can serve as an efficient solution for real clinical practice. Code is available at https://github.com/zhenqi-he/transnuseg.
    摘要 nuclei 看起来很小,但在实际临床实践中,全球空间信息和背景和核或亮度对比的色彩或亮度对比,被视为精度核 segmentation 的关键组成部分。然而,核心 automatic segmentation 领域被 Convolutional Neural Networks (CNNs) 所主导,而 transformer 的潜力尚未得到充分探索,这是强大地捕捉当地-全球对应关系的。为此,我们提出了首个纯 transformer 框架,称为 TransNuSeg。与先前的工作不同,我们将挑战性的核 segmentation 任务分解成内在多任务学习任务,其中使用 tri-decoder 结构进行核实例、核边和集群边 segmentation 等。为了消除先前工作中分支的不一致预测,我们引入了一种新的自我抽象损失函数,以显式地强制分支之间的一致性规则。此外,我们还提出了一种高效的注意力共享方案,通过在 tri-decoders 中共享自注意力头来降低模型参数数量。最后,我们将 токен MLP 瓶颈取代了过参数化的 transformer 瓶颈,以进一步降低模型复杂性。在两个不同的模式数据上进行了实验,包括 MoNuSeg,我们的方法可以与 state-of-the-art 对手 CA2.5-Net 相比,提高 Dice 指标2-3%,并且减少参数数量30%。结论:TransNuSeg 证明了 transformer 在核 segmentation 上的力量,这些可以作为实际临床实践中的高效解决方案。代码可以在 上获取。

A Novel SLCA-UNet Architecture for Automatic MRI Brain Tumor Segmentation

  • paper_url: http://arxiv.org/abs/2307.08048
  • repo_url: None
  • paper_authors: Tejashwini P S, Thriveni J, Venugopal K R
  • for: 预测和检测脑肿瘤,以降低因脑肿瘤而导致的死亡率。
  • methods: 使用深度学习方法,特别是UNet架构,自动化生物医学影像探索工具。
  • results: 提出了一种修改后的UNet架构,可以有效地捕捉脑肿瘤影像中的粗细特征信息,并在Brain Tumor Segmentation(BraTS)数据集上达到了良好的性能,具体表现为0.845、0.845、0.999和8.1等指标。
    Abstract Brain tumor is deliberated as one of the severe health complications which lead to decrease in life expectancy of the individuals and is also considered as a prominent cause of mortality worldwide. Therefore, timely detection and prediction of brain tumors can be helpful to prevent death rates due to brain tumors. Biomedical image analysis is a widely known solution to diagnose brain tumor. Although MRI is the current standard method for imaging tumors, its clinical usefulness is constrained by the requirement of manual segmentation which is time-consuming. Deep learning-based approaches have emerged as a promising solution to develop automated biomedical image exploration tools and the UNet architecture is commonly used for segmentation. However, the traditional UNet has limitations in terms of complexity, training, accuracy, and contextual information processing. As a result, the modified UNet architecture, which incorporates residual dense blocks, layered attention, and channel attention modules, in addition to stacked convolution, can effectively capture both coarse and fine feature information. The proposed SLCA UNet approach achieves good performance on the freely accessible Brain Tumor Segmentation (BraTS) dataset, with an average performance of 0.845, 0.845, 0.999, and 8.1 in terms of Dice, Sensitivity, Specificity, and Hausdorff95 for BraTS 2020 dataset, respectively.
    摘要 脑肿是一种严重的健康问题,可能导致个体寿命下降,并被认为是全球致死率的一大原因。因此,在时间上掌握和预测脑肿的诊断是非常重要的。生物医学图像分析是一种广泛应用的解决方案,但现有的MRI技术受到手动 segmentation 的限制,这是耗时consuming。深度学习基本单元(Deep Learning-based Approaches)已经出现为开发自动生物医学图像探索工具的有力的解决方案之一。然而,传统的 UNet Architecture 受到复杂性、训练、准确率和上下文信息处理等限制。为此,我们提出了修改后的 UNet 架构,包括循环堆叠、层次注意力和渠道注意力模块,可以有效地捕捉粗细特征信息。我们的 SLCA UNet 方法在公共可用的 Brain Tumor Segmentation(BraTS)数据集上达到了良好的性能,其中 BraTS 2020 数据集的平均性能为 0.845、0.845、0.999 和 8.1 分别在 Dice、敏感性、特异性和 Hausdorff95 方面。

SHAMSUL: Simultaneous Heatmap-Analysis to investigate Medical Significance Utilizing Local interpretability methods

  • paper_url: http://arxiv.org/abs/2307.08003
  • repo_url: https://github.com/anondo1969/shamsul
  • paper_authors: Mahbub Ul Alam, Jaakko Hollmén, Jón Rúnar Baldvinsson, Rahim Rahmani
  • For: This paper aims to improve the interpretability of deep neural networks in the medical and healthcare domain by applying and comparing four well-established interpretability methods (LIME, SHAP, Grad-CAM, and LRP) to a chest radiography dataset.* Methods: The paper uses transfer learning and a multi-label-multi-class chest radiography dataset to interpret predictions pertaining to specific pathology classes. The authors evaluate the performance of the four interpretability methods through quantitative and qualitative investigations, and compare the results against human expert annotation.* Results: The paper finds that Grad-CAM demonstrates the most favorable performance in quantitative evaluation, while the LIME heatmap segmentation visualization exhibits the highest level of medical significance. The research highlights the strengths and limitations of the four interpretability methods and suggests that a multimodal-based approach could offer additional insights for enhancing interpretability in the medical domain.Here is the same information in Simplified Chinese text:* For: 本研究旨在提高深度神经网络在医疗领域的解释性,通过应用和比较四种已知的解释方法(LIME、SHAP、Grad-CAM、LRP)来解释特定疾病类型的预测结果。* Methods: 本研究使用了传输学习和多类多标签的胸部X射线数据集来解释特定疾病类型的预测结果。作者们通过量化和质量调查来评估四种解释方法的性能,并与人工专家标注进行比较。* Results: 研究发现,Grad-CAM在量化评估中表现最佳,而 LIME 热图分割视觉化显示最高的医学意义。研究揭示了四种解释方法的优缺点,并建议在医疗领域使用多Modal 基于的方法可以提供更多的解释。
    Abstract The interpretability of deep neural networks has become a subject of great interest within the medical and healthcare domain. This attention stems from concerns regarding transparency, legal and ethical considerations, and the medical significance of predictions generated by these deep neural networks in clinical decision support systems. To address this matter, our study delves into the application of four well-established interpretability methods: Local Interpretable Model-agnostic Explanations (LIME), Shapley Additive exPlanations (SHAP), Gradient-weighted Class Activation Mapping (Grad-CAM), and Layer-wise Relevance Propagation (LRP). Leveraging the approach of transfer learning with a multi-label-multi-class chest radiography dataset, we aim to interpret predictions pertaining to specific pathology classes. Our analysis encompasses both single-label and multi-label predictions, providing a comprehensive and unbiased assessment through quantitative and qualitative investigations, which are compared against human expert annotation. Notably, Grad-CAM demonstrates the most favorable performance in quantitative evaluation, while the LIME heatmap segmentation visualization exhibits the highest level of medical significance. Our research highlights the strengths and limitations of these interpretability methods and suggests that a multimodal-based approach, incorporating diverse sources of information beyond chest radiography images, could offer additional insights for enhancing interpretability in the medical domain.
    摘要 《深度神经网络可读性的研究在医疗领域引发了广泛的关注,主要是由于透明度、法律和伦理考虑以及在临床决策支持系统中神经网络预测的医学意义。为解决这个问题,我们的研究探讨了四种已有的可读性方法:本地可读性模型自定义解释(LIME)、Shapley添加itive exPlanations(SHAP)、梯度权重分类活动映射(Grad-CAM)和层次 relevance propagation(LRP)。通过将这些方法应用于一个多标签多类胸部X射像数据集,我们想要解释具体疾病类型的预测结果。我们的分析包括单标签和多标签预测,并通过量化和质量调查对比人工专家标注进行了全面和无偏评估。结果显示,Grad-CAM在量化评估中表现最佳,而LIME热图分 segmentation 可读性方法显示最高水平的医学意义。我们的研究描述了这些可读性方法的优缺点,并表明在医疗领域可能需要结合多种信息源以获得更多的解释。》

MoTIF: Learning Motion Trajectories with Local Implicit Neural Functions for Continuous Space-Time Video Super-Resolution

  • paper_url: http://arxiv.org/abs/2307.07988
  • repo_url: https://github.com/sichun233746/motif
  • paper_authors: Yi-Hsin Chen, Si-Cun Chen, Yi-Hsin Chen, Yen-Yu Lin, Wen-Hsiao Peng
  • for: 这篇论文的目的是提出一种能够在任意扩大比例下提高视频的空间时间超分辨率(C-STVSR)技术。
  • methods: 该技术使用了一种空间时间本地隐藏函数,可以学习输入视频帧之间的前进动态信息。该函数有着学习前进动态信息的特点,而不是学习一个混合动态信息的混合函数。为了使得动态信息 interpolate 更加容易,该技术使用了从输入视频中提取的稀疏样本前进动态信息作为上下文输入。
  • results: 该技术在C-STVSR领域实现了状态机器的性能记录,并提供了一个可用的源代码MoTIF。
    Abstract This work addresses continuous space-time video super-resolution (C-STVSR) that aims to up-scale an input video both spatially and temporally by any scaling factors. One key challenge of C-STVSR is to propagate information temporally among the input video frames. To this end, we introduce a space-time local implicit neural function. It has the striking feature of learning forward motion for a continuum of pixels. We motivate the use of forward motion from the perspective of learning individual motion trajectories, as opposed to learning a mixture of motion trajectories with backward motion. To ease motion interpolation, we encode sparsely sampled forward motion extracted from the input video as the contextual input. Along with a reliability-aware splatting and decoding scheme, our framework, termed MoTIF, achieves the state-of-the-art performance on C-STVSR. The source code of MoTIF is available at https://github.com/sichun233746/MoTIF.
    摘要 这个工作Addresses continuous space-time video super-resolution (C-STVSR),它的目标是通过任何缩放因子将输入视频 both spatially and temporally up-scale。一个关键挑战是在输入视频帧之间传递信息。为此,我们引入了一个空间时本地隐藏神经函数。它有突出的特点是学习输入视频帧中的前进运动。我们从输入视频的动作轨迹学习的角度出发,而不是学习混合动作轨迹中的后向运动。为了简化运动插值,我们将输入视频中稀疏样本的前进运动编码为上下文输入。与一种可靠性感知扩散和解码方案相结合,我们的框架,称之为MoTIF,实现了C-STVSR领域的状态级性能。MoTIF的源代码可以在https://github.com/sichun233746/MoTIF上获取。

Panoramic Voltage-Sensitive Optical Mapping of Contracting Hearts using Cooperative Multi-View Motion Tracking with 12 to 24 Cameras

  • paper_url: http://arxiv.org/abs/2307.07943
  • repo_url: None
  • paper_authors: Shrey Chowdhary, Jan Lebert, Shai Dickman, Jan Christoph
  • for: 这研究用于图像心脏的活动电位波,以高空间和时间分辨率观察心脏表面的变形。
  • methods: 这种多摄像头光学映射技术使用24个高速低成本摄像头,可以在整个扭形的心脏表面上图像活动电位波。
  • results: 研究发现可以使用12个摄像头获得0.5-1.0兆Pixel的合并分辨率,并使用计算机视觉技术进行三维协同多视图动态重建和高分辨率电子敏感测量。通过这种设置,研究者在兔心中测量到了不同心律rhythm中的活动电位波,包括正常律 rhythm、脉冲心律和心肺综合症。这种设置定义了现有技术的新状态,可以用于研究心脏的电机动力学 dynamics during health和疾病。
    Abstract Action potential waves triggering the heart's contractions can be imaged at high spatial and temporal resolutions across the heart surface using voltage-sensitive optical mapping. However, for over three decades, optical mapping has been performed with contraction-inhibited hearts. While it was recently demonstrated that action potential waves can be imaged on parts of the three-dimensional deforming ventricular surface using multi-camera optical mapping, panoramic measurements of action potential waves across the entire beating heart surface remained elusive. Here, we introduce a high-resolution multi-camera optical mapping system consisting of up to 24 high-speed, low-cost cameras with which it is possible to image action potential waves at high resolutions on the entire, strongly deforming ventricular surface of the heart. We imaged isolated hearts inside a custom-designed soccerball-shaped imaging chamber, which facilitates imaging and even illumination with excitation light from all sides in a panoramic fashion. We found that it is possible to image the entire ventricular surface using 12 cameras with 0.5-1.0 megapixels combined resolution. The 12 calibrated cameras generate 1.5 gigabytes of video data per second at imaging speeds of 500 fps, which we process using various computer vision techniques, including three-dimensional cooperative multi-view motion tracking, to generate three-dimensional dynamic reconstructions of the deforming heart surface with corresponding high-resolution voltage-sensitive optical measurements. With our setup, we measured action potential waves at unprecedented resolutions on the contracting three-dimensional surface of rabbit hearts during sinus rhythm, paced rhythm as well as ventricular fibrillation. Our imaging setup defines a new state-of-the-art in the field and can be used to study the heart's electromechanical dynamics during health and disease.
    摘要 心脏的刺激波可以通过电容性光Mapping在心脏表面上获得高空间和时间分辨率的刺激波图像。然而,在过去三十年内,光Mapping都是使用干扰心脏的方法进行的。而现在,我们已经成功地在三维凹陷心脏表面上获得刺激波图像,但这些图像仅限于部分心脏表面。在本文中,我们介绍了一种高分辨率多camera光Mapping系统,该系统由24个高速、低成本摄像头组成,可以在整个弯曲的心脏表面上获得刺激波图像。我们使用自定义的足球形封装室进行心脏封装,以便从所有方向进行扫描和照明。我们发现,使用12个0.5-1.0 megapixels摄像头可以获得整个心脏表面的图像。这12个可 kalibrated摄像头每秒钟生成1.5 gigabytes的视频数据,我们使用了多种计算机视觉技术,包括三维合作多视图运动跟踪,来生成三维动态重建三维凹陷心脏表面的高分辨率电容性光测量。通过我们的设置,我们在各种心脏rhythm中测量了刺激波的历史最高分辨率图像。我们的捕捉设置定义了心脏研究领域的新状态符。

cs.SD - 2023-07-15

Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

  • paper_url: http://arxiv.org/abs/2307.07683
  • repo_url: https://github.com/audio-df-ucb/clonedvoicedetection
  • paper_authors: Sarah Barrington, Romit Barua, Gautham Koorma, Hany Farid
  • for: 本研究旨在 diferenciating real and cloned voices, particularly in the context of synthetic-voice cloning technologies.
  • methods: 本研究使用三种不同的方法来分辨真实的voice和假的voice,包括基于低维度感知特征的方法、基于普通频谱特征的方法以及基于端到端学习的方法。
  • results: 研究显示这三种方法可以准确地分辨真实的voice和假的voice,特别是当使用多个音频示例时。learned features consistently yield an equal error rate between $0%$ and $4%$, and are reasonably robust to adversarial laundering.
    Abstract Synthetic-voice cloning technologies have seen significant advances in recent years, giving rise to a range of potential harms. From small- and large-scale financial fraud to disinformation campaigns, the need for reliable methods to differentiate real and synthesized voices is imperative. We describe three techniques for differentiating a real from a cloned voice designed to impersonate a specific person. These three approaches differ in their feature extraction stage with low-dimensional perceptual features offering high interpretability but lower accuracy, to generic spectral features, and end-to-end learned features offering less interpretability but higher accuracy. We show the efficacy of these approaches when trained on a single speaker's voice and when trained on multiple voices. The learned features consistently yield an equal error rate between $0\%$ and $4\%$, and are reasonably robust to adversarial laundering.
    摘要 人工声音克隆技术在最近几年内得到了显著的进步,导致了一系列的可能性问题。从小规模到大规模的金融诈骗到假信息攻击,有必要的可靠方法来分辨真实的声音和假声音。我们描述了三种方法来分辨真实的声音和假声音,这三种方法在特征提取阶段有不同的特征。低维度感知特征提供了高度可解释性,但精度较低,而通用频谱特征和终端学习特征则提供了更高的精度,但是解释性较低。我们展示了这些方法在单个 speaker的声音和多个声音上的效果,并证明了这些方法在不同的场景下的可靠性。学习得到的特征在0%到4%的等错误率之间具有恒定的性,并在恶意洗涤下保持了一定的Robustness。

Towards spoken dialect identification of Irish

  • paper_url: http://arxiv.org/abs/2307.07436
  • repo_url: None
  • paper_authors: Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide
  • for: 本研究旨在开发一个用于识别爱尔兰语言方言的语音识别系统,以便在识别爱尔兰语言时提供更加准确的结果。
  • methods: 本研究使用了两种音频分类模型:XLS-R和ECAPA-TDNN,以及一个基于预训练的爱尔兰语言BERT模型来进行文本分类。ECAPA-TDNN模型在整体上表现最佳,具有73%的准确率,而将其与文本模型进行融合可以提高准确率至76%。
  • results: 研究发现,最精准地识别爱尔兰语言的方言是 Ulster 方言,具有94%的准确率。然而,模型在识别康нахacht和慕尼黑方言时存在困难,这表明可能需要采用更加细化的方法来准确地分辨这些方言。
    Abstract The Irish language is rich in its diversity of dialects and accents. This compounds the difficulty of creating a speech recognition system for the low-resource language, as such a system must contend with a high degree of variability with limited corpora. A recent study investigating dialect bias in Irish ASR found that balanced training corpora gave rise to unequal dialect performance, with performance for the Ulster dialect being consistently worse than for the Connacht or Munster dialects. Motivated by this, the present experiments investigate spoken dialect identification of Irish, with a view to incorporating such a system into the speech recognition pipeline. Two acoustic classification models are tested, XLS-R and ECAPA-TDNN, in conjunction with a text-based classifier using a pretrained Irish-language BERT model. The ECAPA-TDNN, particularly a model pretrained for language identification on the VoxLingua107 dataset, performed best overall, with an accuracy of 73%. This was further improved to 76% by fusing the model's outputs with the text-based model. The Ulster dialect was most accurately identified, with an accuracy of 94%, however the model struggled to disambiguate between the Connacht and Munster dialects, suggesting a more nuanced approach may be necessary to robustly distinguish between the dialects of Irish.
    摘要 爱尔兰语言具有多样性的方言和口音,这使得为低资源语言创建语音识别系统的问题更加复杂,因为系统需要处理巨量的变化和有限的数据集。一项最近的研究发现,在爱尔兰ASR中的方言偏见会导致不均匀的方言表现,其中 Ulster 方言的表现一直比 Connacht 和 Munster 方言差。为了解决这个问题,当前的实验探索了爱尔兰口音的识别,以便将其 integrate 到语音识别管道中。两种音频分类模型,XLS-R 和 ECAPA-TDNN,以及一个基于 Irish-language BERT 模型的文本分类器被测试。ECAPA-TDNN 模型,特别是在 VoxLingua107 数据集上进行语言预训练,表现最佳,准确率为 73%。通过将模型的输出与文本分类器进行拟合,准确率可以提高到 76%。 Ulster 方言的识别率最高,为 94%,但模型在 Connacht 和 Munster 方言之间的异化方面存在困难,这表明可能需要采取更加细化的方法,以具有更加精准地分识爱尔兰方言。

eess.AS - 2023-07-15

Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

  • paper_url: http://arxiv.org/abs/2307.07748
  • repo_url: None
  • paper_authors: Richard Lee Lai, Jen-Cheng Hou, Mandar Gogate, Kia Dashtipour, Amir Hussain, Yu Tsao
  • for: 帮助听力障碍者更好地理解对话,特别是在噪声环境中。
  • methods: 提出了一种基于深度学习的自动识别技术,combines 视频和声音信号,并使用Transformer-based SSL AV-HuBERT模型提取特征,然后使用 BLSTM-based SE 模型进行加工。
  • results: 实验结果显示,提出的方法可以成功地超越限制性的数据问题,并且在不同的噪声环境中都能够提高对话理解性。具体来说,PESQ 值从 1.43 提高到 1.67,STOI 值从 0.70 提高到 0.74。此外,还进行了对 CI 用户的评估,结果表明,在人工对话中遇到的动态噪声中,SSL-AVSE 表现出了明显的改善。NCM 值提高了 26.5% 到 87.2% 相比于噪声基线。
    Abstract Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.
    摘要 听力障碍者面临听说能力下降的挑战,特别是在噪声环境中。本研究的目的是探讨audio-visualspeech增强(AVSE)在cochlear implant(CI)模拟中的有效性。值得注意的是,这个研究强调了一个困难的enario,即有限的培训数据 дляAVSE任务。为解决这个问题,我们提出了一种新的深度神经网络框架,称为Self-Supervised Learning-based AVSE(SSL-AVSE)。SSL-AVSE通过将目标speaker的lip和mouth运动视频信号与相应的音频信号结合,并将这些上下文混合的数据feed into一个Transformer-based SSL AV-HuBERT模型来提取特征。然后,通过一个BLSTM-based SE模型进行进一步处理。实验结果显示了以下几点:首先,SSL-AVSE成功地超越了有限的数据问题,利用了AV-HuBERT模型。其次,通过对AV-HuBERT模型参数的微调,在目标SE任务中实现了显著性能提升。具体来说,PESQ(Perceptual Evaluation of Speech Quality)从1.43提高到1.67,STOI(Short-Time Objective Intelligibility)从0.70提高到0.74。此外,SSL-AVSE的性能还被评估了使用CI vocoded speech来评估智能度对CI用户的智能度。对比实验结果表明,在人类对话中遇到的动态噪声中,SSL-AVSE表现出了显著提升。NCM(Normal Correlation Matrix)值表明,在噪声基eline比较之下,SSL-AVSE的提升为26.5%-87.2%。

cs.CV - 2023-07-15

HQG-Net: Unpaired Medical Image Enhancement with High-Quality Guidance

  • paper_url: http://arxiv.org/abs/2307.07829
  • repo_url: None
  • paper_authors: Chunming He, Kai Li, Guoxia Xu, Jiangpeng Yan, Longxiang Tang, Yulun Zhang, Xiu Li, Yaowei Wang
  • for: 提高医疗影像质量(UMIE),使低质量医疗影像变成高质量医疗影像,不需要使用对应的高质量影像进行训练。
  • methods: 提出了一种新的UMIE方法,利用变量ormalization模块直接在低质量影像增强过程中编码高质量灵感,并通过对抗学习和内容响应损失来导航增强过程。
  • results: 对三个医疗影像 dataset进行了实验,并证明了该方法可以在增强质量和下游任务性能之间取得平衡,并且在许多情况下超越了现有的方法。
    Abstract Unpaired Medical Image Enhancement (UMIE) aims to transform a low-quality (LQ) medical image into a high-quality (HQ) one without relying on paired images for training. While most existing approaches are based on Pix2Pix/CycleGAN and are effective to some extent, they fail to explicitly use HQ information to guide the enhancement process, which can lead to undesired artifacts and structural distortions. In this paper, we propose a novel UMIE approach that avoids the above limitation of existing methods by directly encoding HQ cues into the LQ enhancement process in a variational fashion and thus model the UMIE task under the joint distribution between the LQ and HQ domains. Specifically, we extract features from an HQ image and explicitly insert the features, which are expected to encode HQ cues, into the enhancement network to guide the LQ enhancement with the variational normalization module. We train the enhancement network adversarially with a discriminator to ensure the generated HQ image falls into the HQ domain. We further propose a content-aware loss to guide the enhancement process with wavelet-based pixel-level and multi-encoder-based feature-level constraints. Additionally, as a key motivation for performing image enhancement is to make the enhanced images serve better for downstream tasks, we propose a bi-level learning scheme to optimize the UMIE task and downstream tasks cooperatively, helping generate HQ images both visually appealing and favorable for downstream tasks. Experiments on three medical datasets, including two newly collected datasets, verify that the proposed method outperforms existing techniques in terms of both enhancement quality and downstream task performance. We will make the code and the newly collected datasets publicly available for community study.
    摘要 <>医学图像提高(UMIE)的目标是将低质量(LQ)医学图像转换成高质量(HQ)图像,不需要使用对应的图像进行训练。现有的方法多数基于Pix2Pix/CycleGAN,尽管有一定的效果,但是它们不会直接使用HQ信息来导引提高过程,这可能会导致不必要的artefacts和结构性错误。在这篇论文中,我们提出了一种新的UMIE方法,避免了现有方法的这一限制,通过直接在LQ提高过程中编码HQ信息,并在变量化正则模块中进行变量化正则化。具体来说,我们从HQ图像中提取特征,并直接插入提高网络中,以便使LQ图像提高以HQ图像的特征为导引。我们通过对提高网络进行对抗训练,使得生成的HQ图像在HQ领域中。我们还提出了一种内容相关损失,以便通过波лет特-基于的像素级和多编码器-基于的特征级约束,指导提高过程。此外,作为提高图像的主要动机是为了使图像更好地服务于下游任务,我们提出了一种叠加学习方案,以便在UMIE任务和下游任务之间协同优化,使得生成的HQ图像同时具备视觉吸引力和下游任务的有利性。实验结果表明,提出的方法在三个医学 dataset 上表现出色,超过了现有技术。我们将代码和新收集的 dataset 公开发布,为社区的研究提供便利。

TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for Gaze Estimation

  • paper_url: http://arxiv.org/abs/2307.07813
  • repo_url: None
  • paper_authors: Pietro Bonazzi, Thomas Ruegg, Sizhen Bian, Yawei Li, Michele Magno
  • for: 这个论文的目的是提出一种高效低功耗的边缘视觉应用,以解决边缘设备的计算负担问题。
  • methods: 该论文使用了一种新的“AI在传感器”视觉平台,SONY的IMX500,实现了ultra-fast和ultra-low-power的边缘视觉应用。
  • results: 研究表明,IMX500比Google Coral Dev Micro和SONY Spresense更快(19ms vs 34.4ms)和更高效(4.9mJ VS 34.2mJ),并且可以实现2D gaze estimation的高精度预测(最大差错0.16cm)。
    Abstract Intelligent edge vision tasks encounter the critical challenge of ensuring power and latency efficiency due to the typically heavy computational load they impose on edge platforms.This work leverages one of the first "AI in sensor" vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power end-to-end edge vision applications. We evaluate the IMX500 and compare it to other edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by exploring gaze estimation as a case study. We propose TinyTracker, a highly efficient, fully quantized model for 2D gaze estimation designed to maximize the performance of the edge vision systems considered in this study. TinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1] without significant loss in gaze estimation accuracy (maximum of 0.16 cm when fully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor results in end-to-end latency of around 19ms. The camera takes around 17.9ms to read, process and transmit the pixels to the accelerator. The inference time of the network is 0.86ms with an additional 0.24 ms for retrieving the results from the sensor. The overall energy consumption of the end-to-end system is 4.9 mJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is 1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ VS 34.2mJ)
    摘要 智能边缘视觉任务面临 crítical 挑战,因为它们通常占用边缘平台的重要计算负担。这项工作利用了首个“AI在传感器”视觉平台IMX500(由索尼制造),实现ultra-fast和ultra-low-power的边缘视觉应用。我们对IMX500进行了比较,并使用了其他边缘平台,如Google Coral Dev Micro和索尼Spresense,通过察看计算作为研究案例。我们提出了TinyTracker,一种高效、全数字化模型,用于实现2D gaze estimation。TinyTracker在IMX500视觉传感器上部署后,实现了约19ms的终端延迟时间。摄像头需要17.9ms来读取、处理和传输像素到加速器。网络执行时间为0.86ms,加上从传感器中获取结果的时间(0.24ms)。总的来说,边缘系统的能 consumption 为4.9 mJ,其中计算部分占用0.06 mJ。对照研究显示,IMX500比CoralMicro(19ms VS 34.4ms)更快,并且在能 efficiency 方面更高(4.9mJ VS 34.2mJ)。

Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

  • paper_url: http://arxiv.org/abs/2307.07812
  • repo_url: https://github.com/msiam/mmc-multiscalememory
  • paper_authors: Mennatullah Siam, Rezaul Karim, He Zhao, Richard Wildes
  • for: 这篇论文主要针对几个支持图像中的特定新类,使用少量标注图像进行分类。
  • methods: 该方法使用一种名为多尺度记忆比较器(MMC),它将信息在不同尺度级别内共享,以增强分类精度。
  • results: 该方法在几个实验中均超过了基eline,并达到了状态之arte。代码可以在https://github.com/MSiam/MMC-MultiscaleMemory中下载。
    Abstract Few-shot video segmentation is the task of delineating a specific novel class in a query video using few labelled support images. Typical approaches compare support and query features while limiting comparisons to a single feature layer and thereby ignore potentially valuable information. We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation that combines information across scales within a transformer decoder. Typical multiscale transformer decoders for segmentation tasks learn a compressed representation, their queries, through information exchange across scales. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange via a multiscale memory transformer decoding to reduce confusion between the background and novel class. Integral to the approach, we investigate multiple forms of information exchange across scales in different tasks and provide insights with empirical evidence on which to use in each task. The overall comparisons among query and support features benefit from both rich semantics and precise localization. We demonstrate our approach primarily on few-shot video object segmentation and an adapted version on the fully supervised counterpart. In all cases, our approach outperforms the baseline and yields state-of-the-art performance. Our code is publicly available at https://github.com/MSiam/MMC-MultiscaleMemory.
    摘要 几个批量视频分割是指将特定的新类型在查询视频中分割使用几个标注图像。通常的方法是比较支持和查询特征,并限制比较到单个特征层,因此可能会忽略有价值信息。我们提出了基于模型学习的多尺度内存比较器(MMC),用于几个批量视频分割,它将在变换器解码器中结合多尺度信息。通常的多尺度变换器解码器用于分割任务会学习压缩表示,其查询通过不同尺度之间的信息交换来学习压缩表示。与之前的工作不同,我们在交换过程中保留细节特征图,以避免背景和新类型之间的混淆。我们的方法包括多种不同任务中的信息交换方式,并提供了实验证据,以帮助选择合适的信息交换方式。通过比较查询和支持特征,我们的方法可以获得丰富的 semantics 和精确的地方化。我们在几个几个批量视频对象分割和修改后的完全监督版本中展示了我们的方法,在所有情况下,我们的方法都超越基准,并实现了状态的最佳性。我们的代码可以在 GitHub 上获取:https://github.com/MSiam/MMC-MultiscaleMemory。

MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis

  • paper_url: http://arxiv.org/abs/2307.07807
  • repo_url: https://github.com/jeunyuli/muaf
  • paper_authors: Junyu Li, Han Huang, Dong Ni, Wufeng Xue, Dongmei Zhu, Jun Cheng
  • for: 这个研究的目的是为了早期识别肾脏癌,以提高病人存活率。
  • methods: 这个研究使用了融合B模式和CEUS模式射频影像的多modal聚合网络,以实现多modal特征融合和影像分类。
  • results: 实验结果显示,提案的框架比单模式模型和竞争方法更高的准确性。此外,我们的OTA模组在多帧射频影像中实现了更高的分类精度。
    Abstract Early diagnosis of renal cancer can greatly improve the survival rate of patients. Contrast-enhanced ultrasound (CEUS) is a cost-effective and non-invasive imaging technique and has become more and more frequently used for renal tumor diagnosis. However, the classification of benign and malignant renal tumors can still be very challenging due to the highly heterogeneous appearance of cancer and imaging artifacts. Our aim is to detect and classify renal tumors by integrating B-mode and CEUS-mode ultrasound videos. To this end, we propose a novel multi-modal ultrasound video fusion network that can effectively perform multi-modal feature fusion and video classification for renal tumor diagnosis. The attention-based multi-modal fusion module uses cross-attention and self-attention to extract modality-invariant features and modality-specific features in parallel. In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis. Experimental results on a multicenter dataset show that the proposed framework outperforms the single-modal models and the competing methods. Furthermore, our OTA module achieves higher classification accuracy than the frame-level predictions. Our code is available at \url{https://github.com/JeunyuLi/MUAF}.
    摘要 早期诊断肾癌可以大幅提高患者存活率。扩增噪声超声(CEUS)是一种Cost-effective和非侵入性的成像技术,在肾肿瘤诊断中越来越常用。然而,肾肿瘤的分类仍然是非常困难的,主要是因为肿瘤的外观非常异常,以及成像 artifacts。我们的目标是通过融合B模式和CEUS模式超声视频来检测和分类肾肿瘤。为此,我们提出了一种新的多模态超声视频融合网络,可以有效地执行多模态特征融合和视频分类。我们的注意力基于多模态融合模块使用交叉注意力和自注意力来提取模式不变特征和模式特定特征。此外,我们设计了一个物体级别时间聚合(OTA)模块,可以自动筛选低质量特征并高效地集成多帧中的时间信息,以提高肾肿瘤诊断的准确性。实验结果表明,我们提出的框架在多中心数据集上表现出优于单模态模型和竞争方法。此外,我们的 OTA 模块在帧级预测中实现了更高的分类精度。我们的代码可以在 \url{https://github.com/JeunyuLi/MUAF} 上获取。

Joint Adversarial and Collaborative Learning for Self-Supervised Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07791
  • repo_url: https://github.com/levigty/acl
  • paper_authors: Tianyu Guo, Mengyuan Liu, Hong Liu, Wenhao Li, Jingwen Guo, Tao Wang, Yidi Li
  • for: 本研究的目的是使用对比学习方法,包括MoCo和SimCLR,来解决自然语言处理中的自我supervised动作识别任务。
  • methods: 本研究使用多个数据流(joint、运动和骨)进行ensemble学习,同时如何在单个流中构建一个特异性的特征空间并有效地将多个流的信息汇聚到一起仍然是一个开放的问题。
  • results: 我们首先应用一种新的对比学习方法called BYOL来学习从骨架数据,并将其формализова为一个简单 yet effective的基线方法 для自我supervised骨架动作识别。此外,我们还提出了一种联合对抗学习(ACL)框架,其结合了cross-model对抗学习(CMAL)和cross-stream合作学习(CSCL)。
    Abstract Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, we first apply a new contrastive learning method called BYOL to learn from skeleton data and formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, we further present a joint Adversarial and Collaborative Learning (ACL) framework, which combines Cross-Model Adversarial Learning (CMAL) and Cross-Stream Collaborative Learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multi-stream information, CSCL is designed by generating similarity pseudo label of ensemble learning as supervision and guiding feature generation for individual streams. Exhaustive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that our method can perform favorably against state-of-the-art methods using various evaluation protocols. Our code and models are publicly available at \url{https://github.com/Levigty/ACL}.
    摘要 基于实例水平的异同能力,包括MoCo和SimCLR等对比学习方法,已经从原始图像表示学习任务中适应到解决自动识别动作任务。这些方法通常使用多个数据流(即联合、运动和骨)进行ensemble学习,而在单个流中构建准确的特征空间并有效地聚合多个流的信息仍然是一个开放的问题。为此,我们首先应用一种新的对比学习方法called BYOL来学习从骨架数据,并将SkeletonBYOL作为一个简单又有效的基线 для自主学习骨架动作识别。受SkeletonBYOL的启发,我们进一步提出了一种联合对抗学习(ACL)框架,该框架将对抗学习(CMAL)和流处理学习(CSCL)相结合。具体来说,CMAL通过跨模型对抗损失来获得更准确的特征,而CSCL通过生成流处理学习中的similarity pseudo标签来引导特征生成和流处理学习。我们对三个数据集进行了详细的实验,并证明了ACL的共轭性和对比学习方法的优势。我们的代码和模型在\url{https://github.com/Levigty/ACL}上公开。

Adaptive Nonlinear Latent Transformation for Conditional Face Editing

  • paper_url: http://arxiv.org/abs/2307.07790
  • repo_url: https://github.com/hzzone/adatrans
  • paper_authors: Zhizhong Huang, Siteng Ma, Junping Zhang, Hongming Shan
  • for: 本文提出了一种新的适应非线性潜在转换方法(AdaTrans),用于不混合的条件脸部编辑。
  • methods: 本方法将编辑过程分解为多个细节步骤,每个步骤的方向和大小受到面部特征和潜在码的控制。这种非线性转换路径可以操作面部到目标特征Attributes,保持其他特征不变。此外,本文还提出了一种独立学习策略,基于mutual information框架,以消除特征之间的混合。
  • results: 对于多种面部特征,AdaTrans能够实现可控的脸部编辑,具有分离、灵活性和高质量的优点。与现有方法相比,AdaTrans在面部特征之间的混合、年龄差距和少量标注数据下表现出较好的性能。代码可以从https://github.com/Hzzone/AdaTrans获取。
    Abstract Recent works for face editing usually manipulate the latent space of StyleGAN via the linear semantic directions. However, they usually suffer from the entanglement of facial attributes, need to tune the optimal editing strength, and are limited to binary attributes with strong supervision signals. This paper proposes a novel adaptive nonlinear latent transformation for disentangled and conditional face editing, termed AdaTrans. Specifically, our AdaTrans divides the manipulation process into several finer steps; i.e., the direction and size at each step are conditioned on both the facial attributes and the latent codes. In this way, AdaTrans describes an adaptive nonlinear transformation trajectory to manipulate the faces into target attributes while keeping other attributes unchanged. Then, AdaTrans leverages a predefined density model to constrain the learned trajectory in the distribution of latent codes by maximizing the likelihood of transformed latent code. Moreover, we also propose a disentangled learning strategy under a mutual information framework to eliminate the entanglement among attributes, which can further relax the need for labeled data. Consequently, AdaTrans enables a controllable face editing with the advantages of disentanglement, flexibility with non-binary attributes, and high fidelity. Extensive experimental results on various facial attributes demonstrate the qualitative and quantitative effectiveness of the proposed AdaTrans over existing state-of-the-art methods, especially in the most challenging scenarios with a large age gap and few labeled examples. The source code is available at https://github.com/Hzzone/AdaTrans.
    摘要 最近的面部编辑方法通常通过 StyleGAN 的归一化空间进行 manipulate。然而,这些方法通常受到面部特征的杂化的限制,需要调整最佳编辑强度,并且只能处理二分类特征。这篇论文提出了一种新的适应非线性层次变换方法,称为 AdaTrans。具体来说,我们的 AdaTrans 将掌控过程分解成多个更细的步骤,即在每个步骤中的方向和大小受到面部特征和归一化码的Conditioning。这样,AdaTrans 可以描述一个适应非线性变换轨迹,以控制面部变换为目标特征,保持其他特征不变。然后,AdaTrans 利用一个预定义的概率模型来约束学习的轨迹在归一化码的分布中,通过最大化变换后的归一化码的概率来提升。此外,我们还提出了一种不依赖标注数据的分解学习策略,基于mutual information框架,以消除特征之间的杂化,从而更好地适应具有不同特征的面部编辑。因此,AdaTrans 可以提供可控的面部编辑,具有分解、非二分类特征和高精度的优势。我们的实验结果表明,AdaTrans 在不同的面部特征下表现出较好的效果,特别是在面部年龄差较大和标注数据少的情况下。代码可以在 上获取。

SoccerKDNet: A Knowledge Distillation Framework for Action Recognition in Soccer Videos

  • paper_url: http://arxiv.org/abs/2307.07768
  • repo_url: None
  • paper_authors: Sarosij Bose, Saikat Sarkar, Amlan Chakrabarti
  • For: The paper is written for classifying player actions in soccer videos, which is a challenging problem in sports analytics.* Methods: The paper proposes a novel end-to-end knowledge distillation based transfer learning network pre-trained on the Kinetics400 dataset, and introduces a unique loss parameterization.* Results: The paper obtains 67.20% validation accuracy on a new dataset named SoccerDB1, which consists of 448 videos and 4 diverse classes of players playing soccer. The model also generalizes well to new datasets.Here is the same information in Simplified Chinese text:* For: 这篇论文是为了解决足球视频中玩家行为的分类问题,这是体育分析中的一个挑战。* Methods: 论文提出了一种新的端到端知识储备基于转移学习网络,并引入了一个唯一的损失参数化。* Results: 论文在新的足球DB1数据集上获得了67.20%的验证精度,该数据集包括448个视频和4种多样化的玩家类别。模型也能够轻松地适应新的数据集。
    Abstract Classifying player actions from soccer videos is a challenging problem, which has become increasingly important in sports analytics over the years. Most state-of-the-art methods employ highly complex offline networks, which makes it difficult to deploy such models in resource constrained scenarios. Here, in this paper we propose a novel end-to-end knowledge distillation based transfer learning network pre-trained on the Kinetics400 dataset and then perform extensive analysis on the learned framework by introducing a unique loss parameterization. We also introduce a new dataset named SoccerDB1 containing 448 videos and consisting of 4 diverse classes each of players playing soccer. Furthermore, we introduce an unique loss parameter that help us linearly weigh the extent to which the predictions of each network are utilized. Finally, we also perform a thorough performance study using various changed hyperparameters. We also benchmark the first classification results on the new SoccerDB1 dataset obtaining 67.20% validation accuracy. Apart from outperforming prior arts significantly, our model also generalizes to new datasets easily. The dataset has been made publicly available at: https://bit.ly/soccerdb1
    摘要 “足球视频中玩家行为分类是一个复杂的问题,在体育分析领域已经越来越重要。大多数当前最佳方法使用高度复杂的离线网络,这使得在资源受限的情况下难以部署这些模型。在这篇论文中,我们提出了一种新的终端到终端知识储备基于转移学习网络,并进行了广泛的分析。我们还介绍了一个新的损失参数化方法,以及一个名为足球DB1的新数据集,该数据集包含448个视频和4种多样化的玩家在足球比赛中的行为。此外,我们还引入了一个唯一的损失参数,以便将每个网络的预测值 linearly 权重。最后,我们还进行了详细的性能研究,并对不同的变量参数进行了测试。我们的模型在新的足球DB1数据集上获得了67.20%的验证精度,并且能够轻松地在新的数据集上进行扩展。数据集可以在以下链接获取:https://bit.ly/soccerdb1。”

Tightly-Coupled LiDAR-Visual SLAM Based on Geometric Features for Mobile Agents

  • paper_url: http://arxiv.org/abs/2307.07763
  • repo_url: None
  • paper_authors: Ke Cao, Ruiping Liu, Ze Wang, Kunyu Peng, Jiaming Zhang, Junwei Zheng, Zhifeng Teng, Kailun Yang, Rainer Stiefelhagen
  • for: 提供自动化导航和任务执行在复杂和未知环境中,以便移动机器人可以更好地操作。
  • methods: 利用LiDAR-视觉Simultaneous Localization and Mapping(SLAM)技术,包括两个子系统(LiDAR和单目视觉SLAM)和一个融合框架。融合框架将深度和 semantic的多Modal 几何特征相关,以便补充视觉线尺标和添加方向优化。
  • results: 在M2DGR公共数据集上进行评估,与当前状态的多模式方法相比,我们的系统实现了更高精度和更加稳定的姿态估计。
    Abstract The mobile robot relies on SLAM (Simultaneous Localization and Mapping) to provide autonomous navigation and task execution in complex and unknown environments. However, it is hard to develop a dedicated algorithm for mobile robots due to dynamic and challenging situations, such as poor lighting conditions and motion blur. To tackle this issue, we propose a tightly-coupled LiDAR-visual SLAM based on geometric features, which includes two sub-systems (LiDAR and monocular visual SLAM) and a fusion framework. The fusion framework associates the depth and semantics of the multi-modal geometric features to complement the visual line landmarks and to add direction optimization in Bundle Adjustment (BA). This further constrains visual odometry. On the other hand, the entire line segment detected by the visual subsystem overcomes the limitation of the LiDAR subsystem, which can only perform the local calculation for geometric features. It adjusts the direction of linear feature points and filters out outliers, leading to a higher accurate odometry system. Finally, we employ a module to detect the subsystem's operation, providing the LiDAR subsystem's output as a complementary trajectory to our system while visual subsystem tracking fails. The evaluation results on the public dataset M2DGR, gathered from ground robots across various indoor and outdoor scenarios, show that our system achieves more accurate and robust pose estimation compared to current state-of-the-art multi-modal methods.
    摘要 Mobile robot 依靠 SLAM (同时地理定位和地图生成) 实现自主导航和任务执行在复杂和未知环境中。但是开发专门的移动 robot 算法具有困难的动态和挑战性情况,如低光照条件和运动模糊。为解决这个问题,我们提议一种紧密相关的 LiDAR-视觉 SLAM,其包括两个子系统(LiDAR 和单目视觉 SLAM)以及一个融合框架。融合框架将depth 和视觉特征相关,以补偿视线标记的缺失,并在Bundle Adjustment 中添加方向优化。这使得视觉遥感更加精准。另一方面,视觉子系统检测到的整条线段超越了 LiDAR 子系统的局部计算能力,可以调整方向和过滤异常值,从而实现更高精度的遥感系统。最后,我们采用一个模块来检测子系统的运作,提供 LiDAR 子系统的补偿轨迹,而视觉子系统跟踪失败时。根据公共数据集 M2DGR,从各种室内和室外场景中收集的数据,我们的系统实现了与当前状态艺术多模式方法相比更高精度和更加稳定的姿态估计。

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

  • paper_url: http://arxiv.org/abs/2307.07757
  • repo_url: https://github.com/ruipingl/opensu
  • paper_authors: Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
  • for: 帮助人们 WITH 视觉障碍(PVI)更加独立地移动,通过增强场景理解和提供有用信息。
  • methods: 基于 Grounded Situation Recognition(GSR)技术,采用高效的 Segment Anything Model(SAM)和固定纯transformer背景结构,并将所有激活函数换为GELU,以提高GSR性能。
  • results: 在 SWiG 数据集上达到了领先的表现,并通过在辅助技术数据集和应用示例中的场景测试和应用示例,证明了 OpenSU 系统的可行性和有用性。
    Abstract Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.
    摘要 “固定Scene理解(GSR)能够识别和解释视觉场景,将其转换为有意义的活动(动词)和参与的实体(角色)。在这个工作中,我们专注于将GSR应用于视障人士(PVI)。然而,精确的地方化信息是需要帮助PVI人士自信地穿梭环境和做出了知情的决策。为了解决这个问题,我们提出了一个开放场景理解(OpenSU)系统,旨在生成像素粒度密集的参与实体分割图案。具体来说,我们将OpenSU系统建立在GSR之上,还 adopting一个高效的填充模型(SAM)。此外,为了增强协变和协变之间的交互,我们使用一个坚固的纯transformer背景架构来提高GSR的表现。为了加速训练,我们将所有的启动函数在GSR解oder中替换为GELU,从而减少训练时间。在量值分析中,我们的模型在SWiG dataset上 achieve state-of-the-art表现。此外,通过特定助理技术dataset和应用示例的野试,我们的OpenSU系统可以增强场景理解和推动视障人士的独立 mobilility。我们的代码将在https://github.com/RuipingL/OpenSU 上available。”

Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation

  • paper_url: http://arxiv.org/abs/2308.07929
  • repo_url: None
  • paper_authors: Victor Gallego
  • for: 这个研究是为了适应大型多modal模型(如CLIP和Stable Diffusion) towards sets of particular human preferences,使得用户能够 personnalize 这些模型 для特定任务或偏好。
  • methods: 我们使用 Bradley-Terry 偏好模型开发了一种快速适应方法,可以快速调整原始模型,只需要几个示例和Minimal computing resources。
  • results: 我们透过不同的多modal text and image understanding领域的实验,证明了这个框架的能力,包括偏好预测为赏金模型、生成任务等。
    Abstract Recently, large multimodal models, such as CLIP and Stable Diffusion have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and computational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. In this work, we address the problem of adapting the previous models towards sets of particular human preferences, aligning the retrieved or generated images with the preferences of the user. We leverage the Bradley-Terry preference model to develop a fast adaptation method that efficiently fine-tunes the original model, with few examples and with minimal computing resources. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding, including preference prediction as a reward model, and generation tasks.
    摘要 Note:* "multimodal models" is translated as "多modal模型" (duō modal módel)* "CLIP" is translated as "CLIP" (C-L-I-P)* "Stable Diffusion" is translated as "稳定扩散" (jìng dìng kuò chuān)* "Bradley-Terry preference model" is translated as "布拉德利-特里尔喜好模型" (Bù lā dé lǐ-tè lì xǐ huān módel)* "fine-tunes" is translated as "细调" (xì diào)* "extensive evidence" is translated as "广泛的证据" (guǎng fāng de jiàn jí)

Prawn Morphometrics and Weight Estimation from Images using Deep Learning for Landmark Localization

  • paper_url: http://arxiv.org/abs/2307.07732
  • repo_url: None
  • paper_authors: Alzayat Saleh, Md Mehedi Hasan, Herman W Raadsma, Mehar S Khatkar, Dean R Jerry, Mostafa Rahimi Azghadi
  • for: 这个研究是为了开发一个可靠且高精度的数位图像识别技术,以便在水产业中快速和准确地取得鱼类的形态数据,包括体重和 morphometric 分析。
  • methods: 这个研究使用了一种新的深度学习(DL)方法,包括两个主要 ком成分:一个特征提取模组,使用克罗内克积分操作 effficiently 结合低级和高级特征;以及一个点位标定模组,使用这些特征来预测虾子身体中的关键 morphological 点(点标)的坐标。
  • results: 我们的实验结果显示,这个新的DL方法在精度、可靠性和效率方面都比以前的DL方法表现更好,并且在8164幅虾子图像 dataset 上进行评估。我们也使用了原始特征来Derive five important prawn traits,并使用PCA方法来找出这些点标之间的距离,发现这些距离和鱼类的体长和宽度有高度的相关性。
    Abstract Accurate weight estimation and morphometric analyses are useful in aquaculture for optimizing feeding, predicting harvest yields, identifying desirable traits for selective breeding, grading processes, and monitoring the health status of production animals. However, the collection of phenotypic data through traditional manual approaches at industrial scales and in real-time is time-consuming, labour-intensive, and prone to errors. Digital imaging of individuals and subsequent training of prediction models using Deep Learning (DL) has the potential to rapidly and accurately acquire phenotypic data from aquaculture species. In this study, we applied a novel DL approach to automate weight estimation and morphometric analysis using the black tiger prawn (Penaeus monodon) as a model crustacean. The DL approach comprises two main components: a feature extraction module that efficiently combines low-level and high-level features using the Kronecker product operation; followed by a landmark localization module that then uses these features to predict the coordinates of key morphological points (landmarks) on the prawn body. Once these landmarks were extracted, weight was estimated using a weight regression module based on the extracted landmarks using a fully connected network. For morphometric analyses, we utilized the detected landmarks to derive five important prawn traits. Principal Component Analysis (PCA) was also used to identify landmark-derived distances, which were found to be highly correlated with shape features such as body length, and width. We evaluated our approach on a large dataset of 8164 images of the Black tiger prawn (Penaeus monodon) collected from Australian farms. Our experimental results demonstrate that the novel DL approach outperforms existing DL methods in terms of accuracy, robustness, and efficiency.
    摘要 Accurate weight estimation和形态分析在水产业中是非常有用的,可以优化饲料、预测捕捞量、确定适应性 trait、分级处理和监测生产动物的健康状况。但是,通过传统的手动方法收集生物数据在工业规模和实时是时间consuming、劳动密集和容易出错。数字图像技术可以快速和准确地获取生物数据。在这种研究中,我们采用了一种新的深度学习(DL)方法来自动化生物量和形态分析。这种DL方法包括两个主要组成部分:一个特征提取模块,使用Kronecker乘法高效地结合低级和高级特征;然后是一个地标定位模块,使用这些特征预测虾蟹身体上关键的形态特征(地标)的坐标。一旦获得了这些地标,我们使用基于这些地标的普通连接网络来估算虾蟹的重量。为了形态分析,我们利用检测到的地标来计算五个重要的虾蟹特征。我们还使用主成分分析(PCA)来确定由地标得到的距离,这些距离与身体长度和宽度之间存在高度相关性。我们对澳大利亚营养虾蟹(Penaeus monodon)的大量数据集进行了实验,结果表明,我们的DL方法在准确性、可靠性和效率方面都高于现有的DL方法。

Improving NeRF with Height Data for Utilization of GIS Data

  • paper_url: http://arxiv.org/abs/2307.07729
  • repo_url: None
  • paper_authors: Hinata Aoki, Takao Yamanaka
  • for: 使用 Geographic Information System (GIS) 获取的高程数据,应用 Neural Radiance Fields (NeRF) 技术来重建大规模Scene。
  • methods: 将场景空间分成多个 объек和背景,使用高程数据来表示它们,并采用自适应采样方法。
  • results: 通过这种方法,可以提高图像渲染的准确性,同时减少训练速度。
    Abstract Neural Radiance Fields (NeRF) has been applied to various tasks related to representations of 3D scenes. Most studies based on NeRF have focused on a small object, while a few studies have tried to reconstruct large-scale scenes although these methods tend to require large computational cost. For the application of NeRF to large-scale scenes, a method based on NeRF is proposed in this paper to effectively use height data which can be obtained from GIS (Geographic Information System). For this purpose, the scene space was divided into multiple objects and a background using the height data to represent them with separate neural networks. In addition, an adaptive sampling method is also proposed by using the height data. As a result, the accuracy of image rendering was improved with faster training speed.
    摘要 <>translate "Neural Radiance Fields (NeRF) has been applied to various tasks related to representations of 3D scenes. Most studies based on NeRF have focused on a small object, while a few studies have tried to reconstruct large-scale scenes although these methods tend to require large computational cost. For the application of NeRF to large-scale scenes, a method based on NeRF is proposed in this paper to effectively use height data which can be obtained from GIS (Geographic Information System). For this purpose, the scene space was divided into multiple objects and a background using the height data to represent them with separate neural networks. In addition, an adaptive sampling method is also proposed by using the height data. As a result, the accuracy of image rendering was improved with faster training speed." into Simplified Chinese.哦!下面是文本翻译成简化字的中文:NeRF(神经辐射场)已经应用于多种3D场景表示任务中。大多数NeRF研究都集中在小对象上,只有一些研究尝试了大规模场景重建,但这些方法通常需要大量计算成本。为了应用NeRF到大规模场景,这篇论文提出了一种基于NeRF的方法,使用GIS(地理信息系统)获取的高程数据来有效地使用它们。为此,场景空间被分解成多个对象和背景,使用高程数据来表示它们的不同神经网络。此外,还提出了一种适应采样方法。因此,图像渲染精度得到了改善,同时训练速度也得到了加速。

Improving Translation Invariance in Convolutional Neural Networks with Peripheral Prediction Padding

  • paper_url: http://arxiv.org/abs/2307.07725
  • repo_url: None
  • paper_authors: Kensuke Mukai, Takao Yamanaka
  • for: 这个论文是为了提出一种新的填充方法,以便在卷积神经网络中进行端到端训练。
  • methods: 该方法使用一种新的填充方法,即 Peripheral Prediction Padding (PP-Pad) 方法,可以根据每个任务而不同地训练填充值。
  • results: 经过测试,该方法在 semantic segmentation 任务中实现了更高的准确率和翻译不变性,比既前一些方法更好。I hope this helps! Let me know if you have any other questions.
    Abstract Zero padding is often used in convolutional neural networks to prevent the feature map size from decreasing with each layer. However, recent studies have shown that zero padding promotes encoding of absolute positional information, which may adversely affect the performance of some tasks. In this work, a novel padding method called Peripheral Prediction Padding (PP-Pad) method is proposed, which enables end-to-end training of padding values suitable for each task instead of zero padding. Moreover, novel metrics to quantitatively evaluate the translation invariance of the model are presented. By evaluating with these metrics, it was confirmed that the proposed method achieved higher accuracy and translation invariance than the previous methods in a semantic segmentation task.
    摘要 很多时候,卷积神经网络中使用零填充来防止特征图的大小随层数量递减。然而,最近的研究表明,零填充可能会导致编码绝对位置信息,这可能会影响一些任务的性能。在这种情况下,一种新的填充方法called Peripheral Prediction Padding(PP-Pad)方法被提出,该方法可以在每个任务中自动训练适合的填充值。此外,一些用于评估模型的翻译不变性的新指标也被提出。通过使用这些指标评估,确认了提议方法在 semantic segmentation 任务中实现了更高的准确率和翻译不变性,与之前的方法相比。

Spatial-Spectral Hyperspectral Classification based on Learnable 3D Group Convolution

  • paper_url: http://arxiv.org/abs/2307.07720
  • repo_url: None
  • paper_authors: Guandong Li, Mengxia Ye
    for:This paper proposes a learnable group convolution network (LGCNet) to improve the performance of deep neural networks in hyperspectral image classification.methods:The LGCNet module uses a dynamic learning method for the input channels and convolution kernel grouping, which allows for flexible grouping structures and improved representation ability.results:The LGCNet achieves better inference speed and accuracy than mainstream hyperspectral image classification methods on three datasets (Indian Pines, Pavia University, and KSC).
    Abstract Deep neural networks have faced many problems in hyperspectral image classification, including the ineffective utilization of spectral-spatial joint information and the problems of gradient vanishing and overfitting that arise with increasing depth. In order to accelerate the deployment of models on edge devices with strict latency requirements and limited computing power, this paper proposes a learnable group convolution network (LGCNet) based on an improved 3D-DenseNet model and a lightweight model design. The LGCNet module improves the shortcomings of group convolution by introducing a dynamic learning method for the input channels and convolution kernel grouping, enabling flexible grouping structures and generating better representation ability. Through the overall loss and gradient of the backpropagation network, the 3D group convolution is dynamically determined and updated in an end-to-end manner. The learnable number of channels and corresponding grouping can capture different complementary visual features of input images, allowing the CNN to learn richer feature representations. When extracting high-dimensional and redundant hyperspectral data, the 3D convolution kernels also contain a large amount of redundant information. The LGC module allows the 3D-DenseNet to choose channel information with more semantic features, and is very efficient, making it suitable for embedding in any deep neural network for acceleration and efficiency improvements. LGC enables the 3D-CNN to achieve sufficient feature extraction while also meeting speed and computing requirements. Furthermore, LGCNet has achieved progress in inference speed and accuracy, and outperforms mainstream hyperspectral image classification methods on the Indian Pines, Pavia University, and KSC datasets.
    摘要 深度神经网络在多spectral图像分类中遇到了许多问题,包括不好地利用spectral-spatial共同信息和深度增加导致梯度消失和过拟合问题。为了加速部署模型在边缘设备上,这篇文章提出了一种可学习的群集卷积网络(LGCNet),基于改进的3D-DenseNet模型和轻量级模型设计。LGCNet模块将group卷积缺点改进,通过动态学习输入通道和卷积核组合,实现更flexible的grouping结构,并且能够更好地表示能力。通过整体损失和反向传播网络的梯度,3D group卷积在端到端方式进行动态确定和更新。学习可变通道和相应的grouping可以捕捉不同的辐射特征图像的可读性信息,使CNN学习更加丰富的特征表示。在提取高维和重复的多spectral数据时,3D卷积核也包含了大量的重复信息。LGC模块使得3D-DenseNet可以选择更多semantic的通道信息,非常高效,适用于任何深度神经网络的加速和效率提高。LGCNet在推理速度和准确率方面进步,并在印度棕榈、帕维亚大学和科学中心 datasets 上超越主流多spectral图像分类方法。

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

  • paper_url: http://arxiv.org/abs/2307.07710
  • repo_url: https://github.com/wyf0912/ExposureDiffusion
  • paper_authors: Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen
  • for: 提高图像亮度和准确率
  • methods: 结合扩散模型和物理基础模型,实现图像亮度提高和准确率提高
  • results: 实现了图像亮度提高和准确率提高,并且比传统扩散模型具有更好的一致性和更快的推理速度
    Abstract Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.
    摘要 previous raw image-based low-light image enhancement methods mostly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images, but they failed to capture critical distribution information, leading to visually undesirable results. this work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. as such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. to make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. the proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. we evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.Here's the translation in Traditional Chinese:previous raw image-based low-light image enhancement methods mostly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images, but they failed to capture critical distribution information, leading to visually undesirable results. this work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. as such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. to make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. the proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. we evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

PSGformer: Enhancing 3D Point Cloud Instance Segmentation via Precise Semantic Guidance

  • paper_url: http://arxiv.org/abs/2307.07708
  • repo_url: None
  • paper_authors: Lei Pan, Wuyang Luan, Yuan Zheng, Qiang Fu, Junhui Li
  • for: 提高3D实例分割的性能,解决现有的3D semantic segmentation模型导致的限制。
  • methods: 提出了一种新的3D实例分割网络PSGformer,包括两个关键进步:首先,我们提出了一种多级semantic aggregation module,通过对前景点 filtering和多半径聚合来有效地捕捉场景特征。其次,PSGformer引入了并行特征融合transformer模块,通过独立处理超点特征和聚合特征来实现更全面的特征表示。
  • results: 在ScanNetv2 dataset上进行了广泛的实验,并证明PSGformer可以在Scannetv2隐藏测试集上提高3D实例分割的性能,比对比的状态体系方法高2.2%。
    Abstract Most existing 3D instance segmentation methods are derived from 3D semantic segmentation models. However, these indirect approaches suffer from certain limitations. They fail to fully leverage global and local semantic information for accurate prediction, which hampers the overall performance of the 3D instance segmentation framework. To address these issues, this paper presents PSGformer, a novel 3D instance segmentation network. PSGformer incorporates two key advancements to enhance the performance of 3D instance segmentation. Firstly, we propose a Multi-Level Semantic Aggregation Module, which effectively captures scene features by employing foreground point filtering and multi-radius aggregation. This module enables the acquisition of more detailed semantic information from global and local perspectives. Secondly, PSGformer introduces a Parallel Feature Fusion Transformer Module that independently processes super-point features and aggregated features using transformers. The model achieves a more comprehensive feature representation by the features which connect global and local features. We conducted extensive experiments on the ScanNetv2 dataset. Notably, PSGformer exceeds compared state-of-the-art methods by 2.2% on ScanNetv2 hidden test set in terms of mAP. Our code and models will be publicly released.
    摘要 现有的大多数3D实例分割方法都是基于3Dsemantic分割模型的派生方法。然而,这些间接方法受到一定的限制。它们不能充分利用全局和局部semantic信息,导致3D实例分割框架的总性表现下降。为了解决这些问题,本文提出了PSGformer,一种新的3D实例分割网络。PSGformer包括两项关键进步,以提高3D实例分割的表现。首先,我们提出了一个多级semantic汇集模块,该模块通过对前景点滤波和多半径汇集来有效地捕捉场景特征。这使得更详细的semantic信息从全局和局部角度得到了捕捉。其次,PSGformer引入了并行特征融合 transformer模块,该模块使用transformer独立处理超点特征和汇集特征,以实现更全面的特征表示。我们在ScanNetv2 dataset上进行了广泛的实验。值得注意的是,PSGformer在ScanNetv2隐藏测试集上比相对state-of-the-art方法高出2.2%的mAP。我们将代码和模型公开发布。

Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction and Modeling from 2D Sparse Cardiac Magnetic Resonance Imaging

  • paper_url: http://arxiv.org/abs/2307.07693
  • repo_url: None
  • paper_authors: Meng Ye, Dong Yang, Mikael Kanski, Leon Axel, Dimitris Metaxas
  • for: 重建和模型三维心脏bi-射影形状从二维精 sparsecardiac magnetic resonance(CMR)成像数据
  • methods: 使用混合型抽象超quadrics模型,通过 globally和 locally抽象来模型bi-射影形状
  • results: 比 conventiomal方法高效、可以自动生成高质量三角形网格、学习 dense对应关系以进行准确心脏形态 региSTRATION
    Abstract We propose a novel neural deformable model (NDM) targeting at the reconstruction and modeling of 3D bi-ventricular shape of the heart from 2D sparse cardiac magnetic resonance (CMR) imaging data. We model the bi-ventricular shape using blended deformable superquadrics, which are parameterized by a set of geometric parameter functions and are capable of deforming globally and locally. While global geometric parameter functions and deformations capture gross shape features from visual data, local deformations, parameterized as neural diffeomorphic point flows, can be learned to recover the detailed heart shape.Different from iterative optimization methods used in conventional deformable model formulations, NDMs can be trained to learn such geometric parameter functions, global and local deformations from a shape distribution manifold. Our NDM can learn to densify a sparse cardiac point cloud with arbitrary scales and generate high-quality triangular meshes automatically. It also enables the implicit learning of dense correspondences among different heart shape instances for accurate cardiac shape registration. Furthermore, the parameters of NDM are intuitive, and can be used by a physician without sophisticated post-processing. Experimental results on a large CMR dataset demonstrate the improved performance of NDM over conventional methods.
    摘要 我们提出了一种新的神经流形模型(NDM),用于从2D稀疏卡路里影像数据中重建和模型3D双腔形心的形状。我们使用混合的流形超quadrics来模型双腔形,这些超quadrics是由一组 геометрических参数函数parameter化的,可以在全球和地方水平上进行扭曲。而全球的 геометрические参数函数和扭曲可以从视觉数据中提取大规模的形状特征,而地方的扭曲,则是通过神经Diffusion点流来学习,以回归心形的细节。与传统的可变模型形式ulation中的迭代优化方法不同,NDM可以通过形态分布拟合来学习 geometric parameter functions和全球和地方的扭曲。我们的NDM可以从稀疏卡路里点云中填充任意尺度的点云,并自动生成高质量的三角形网格。此外,NDM还可以自动学习 dense correspondences among different heart shape instances,以实现准确的心形注册。此外,NDM的参数是直观的,可以由医生 без高级后处理使用。实验结果表明,NDM在大量CMR数据集上表现得更好于传统方法。

DRM-IR: Task-Adaptive Deep Unfolding Network for All-In-One Image Restoration

  • paper_url: http://arxiv.org/abs/2307.07688
  • repo_url: https://github.com/YuanshuoCheng/DRM-IR
  • paper_authors: Yuanshuo Cheng, Mingwen Shao, Yecong Wan, Chao Wang, Wangmeng Zuo
  • for: 这篇论文是为了提出一种高效的全能性图像修复方法(DRM-IR),以提高图像修复性能。
  • methods: 该方法包括任务适应质量模型和模型基于图像修复两个子任务,它们是通过参考图像对进行最大 posteriori推断的。
  • results: 对多个标准数据集进行了广泛的实验,显示DRM-IR可以达到当今最佳的全能性图像修复性能。
    Abstract Existing All-In-One image restoration (IR) methods usually lack flexible modeling on various types of degradation, thus impeding the restoration performance. To achieve All-In-One IR with higher task dexterity, this work proposes an efficient Dynamic Reference Modeling paradigm (DRM-IR), which consists of task-adaptive degradation modeling and model-based image restoring. Specifically, these two subtasks are formalized as a pair of entangled reference-based maximum a posteriori (MAP) inferences, which are optimized synchronously in an unfolding-based manner. With the two cascaded subtasks, DRM-IR first dynamically models the task-specific degradation based on a reference image pair and further restores the image with the collected degradation statistics. Besides, to bridge the semantic gap between the reference and target degraded images, we further devise a Degradation Prior Transmitter (DPT) that restrains the instance-specific feature differences. DRM-IR explicitly provides superior flexibility for All-in-One IR while being interpretable. Extensive experiments on multiple benchmark datasets show that our DRM-IR achieves state-of-the-art in All-In-One IR.
    摘要 现有的全包式图像恢复(IR)方法通常缺乏不同类型的退化模型的灵活定制,因此影响了恢复性能。为了实现更高的任务技巧性,本工作提出了高效的动态参照模型 paradigma(DRM-IR),它包括任务适应性的退化模型和基于模型的图像恢复。具体来说,这两个子任务被形式化为一对推理最大 posteriori(MAP)推理,它们在一种层次结构上进行同步优化。通过两个层次的子任务,DRM-IR首先在参照图像对中动态地模型任务特定的退化,然后使用采集的退化统计来恢复图像。此外,为了跨越参照图像和目标退化图像之间的Semantic gap,我们还提出了退化先天传输器(DPT),它限制了特定的特征差异。DRM-IR显式地提供了更高灵活性和可解释性,并在多个 benchmark 数据集上进行了广泛的实验,证明了我们的 DRM-IR 在全包式IR 中实现了state-of-the-art。

Semantic Contrastive Bootstrapping for Single-positive Multi-label Recognition

  • paper_url: http://arxiv.org/abs/2307.07680
  • repo_url: https://github.com/iCVTEAM/Scob
  • paper_authors: Cheng Chen, Yifan Zhao, Jia Li
  • for: 这篇论文主要用于提出一种学习多个标签图像识别方法,以便利用部分标注数据进行训练,并且可以提高图像识别性能和减少标注工作量。
  • methods: 该方法使用semantic contrastive bootstrapping(Scob)方法,通过引入类活动作为semantic guidance来慢慢恢复交对关系,然后提出一种循环semantic masked transformer来提取图像级别的iconic对象表示,并在多个标签分类任务上进行contrastive学习。
  • results: 经过extensive的实验表明,提出的共同学习框架可以在四个公共多个标签图像识别benchmark上大幅超越现有的模型,并且可以减少可能由错误semantic guidance引起的干扰。
    Abstract Learning multi-label image recognition with incomplete annotation is gaining popularity due to its superior performance and significant labor savings when compared to training with fully labeled datasets. Existing literature mainly focuses on label completion and co-occurrence learning while facing difficulties with the most common single-positive label manner. To tackle this problem, we present a semantic contrastive bootstrapping (Scob) approach to gradually recover the cross-object relationships by introducing class activation as semantic guidance. With this learning guidance, we then propose a recurrent semantic masked transformer to extract iconic object-level representations and delve into the contrastive learning problems on multi-label classification tasks. We further propose a bootstrapping framework in an Expectation-Maximization fashion that iteratively optimizes the network parameters and refines semantic guidance to alleviate possible disturbance caused by wrong semantic guidance. Extensive experimental results demonstrate that the proposed joint learning framework surpasses the state-of-the-art models by a large margin on four public multi-label image recognition benchmarks. Codes can be found at https://github.com/iCVTEAM/Scob.
    摘要 学习多标签图像识别 WITH incomplete annotation 在当前研究中得到了广泛的关注,因为它在训练完全标注数据集时表现出的高性能和重要的劳动力成本减少。现有文献主要关注于标签完成和共occurrence学习,而面临着最常见的单个正样式问题。为解决这个问题,我们提出了semantic contrastive bootstrapping(Scob)方法,通过引入类活动作为semantic guidance来慢慢地恢复交对关系。然后,我们提议一种循环semantic masked transformer来提取图像级别的iconic对象表示,并探索多标签分类任务中的contrastive learning问题。我们还提出了一个Expectation-Maximization的框架,iteratively optimize网络参数和refine semantic guidance,以避免可能由错误的semantic guidance引起的干扰。实验结果表明,我们提出的 JOINT learning框架在四个公共多标签图像识别benchmark上超过了当前状态的模型,并且可以在https://github.com/iCVTEAM/Scob找到代码。

Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting

  • paper_url: http://arxiv.org/abs/2307.07678
  • repo_url: None
  • paper_authors: Ze Lu, Yalei Lv, Wenqi Wang, Pengfei Xiong
  • for: Image inpainting with deep generative approaches.
  • methods: Proposed Frequency-Spatial Complementary Network (FSCN) with extra Frequency Branch and Frequency Loss, and Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features.
  • results: Superior inpainting results with fewer parameters and less computation cost, outperforming previous state-of-the-art approaches.
    Abstract Deep generative approaches have obtained great success in image inpainting recently. However, most generative inpainting networks suffer from either over-smooth results or aliasing artifacts. The former lacks high-frequency details, while the latter lacks semantic structure. To address this issue, we propose an effective Frequency-Spatial Complementary Network (FSCN) by exploiting rich semantic information in both spatial and frequency domains. Specifically, we introduce an extra Frequency Branch and Frequency Loss on the spatial-based network to impose direct supervision on the frequency information, and propose a Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features and combine the corresponding characteristics. With our FSCAB, the inpainting network is capable of capturing frequency information and preserving visual consistency simultaneously. Extensive quantitative and qualitative experiments demonstrate that our inpainting network can effectively achieve superior results, outperforming previous state-of-the-art approaches with significantly fewer parameters and less computation cost. The code will be released soon.
    摘要 深度生成方法在图像填充方面最近几年取得了很大的成功。然而,大多数生成填充网络都会面临高频环境的过滤问题,导致结果过滤而失去高频环境的细节,或者保留了semantic结构,但是图像中的细节失真。为了解决这个问题,我们提出了一种有效的频率空间补充网络(FSCN),通过利用图像空间和频率频谱中的丰富semantic信息来解决。特别是,我们在网络中添加了额外的频率分支和频率损失,以直接监督频率信息,并提出了频率空间协同块(FSCAB)来融合多个频谱特征并组合相应的特征。通过我们的FSCAB,填充网络可以同时捕捉频率信息和保持视觉一致性。广泛的量化和质量测试表明,我们的填充网络可以达到更高的效果,比前一代方法有更少的参数和计算成本。代码将很快地发布。

Learning from Pseudo-labeled Segmentation for Multi-Class Object Counting

  • paper_url: http://arxiv.org/abs/2307.07677
  • repo_url: None
  • paper_authors: Jingyi Xu, Hieu Le, Dimitris Samaras
  • for: 本研究的目的是为了解决现有的物件计数模型在多类图像中对象计数 task 中的挑战,即尝试使用只有一些示例来计数图像中的多个物件类型。
  • methods: 我们提议使用示例基本的分割模型来地方化对象区域,然后使用这些分割模型来计数图像中的物件。我们使用只有盒子示例和点注释来获取pseudo segmentation masks,并训练这些分割模型。
  • results: 我们在两个新的多类数据集和一个真实图像集上评估了不同方法的性能,并显示了我们的提议方法在这些数据集上显著超过了之前的物件计数方法。
    Abstract Class-agnostic counting (CAC) has numerous potential applications across various domains. The goal is to count objects of an arbitrary category during testing, based on only a few annotated exemplars. In this paper, we point out that the task of counting objects of interest when there are multiple object classes in the image (namely, multi-class object counting) is particularly challenging for current object counting models. They often greedily count every object regardless of the exemplars. To address this issue, we propose localizing the area containing the objects of interest via an exemplar-based segmentation model before counting them. The key challenge here is the lack of segmentation supervision to train this model. To this end, we propose a method to obtain pseudo segmentation masks using only box exemplars and dot annotations. We show that the segmentation model trained on these pseudo-labeled masks can effectively localize objects of interest for an arbitrary multi-class image based on the exemplars. To evaluate the performance of different methods on multi-class counting, we introduce two new benchmarks, a synthetic multi-class dataset and a new test set of real images in which objects from multiple classes are present. Our proposed method shows a significant advantage over the previous CAC methods on these two benchmarks.
    摘要 “类agnostic counting(CAC)具有广泛的应用前景,可以在不同领域中应用。目标是在测试过程中,基于只有一些标注的示例来计数对象。在本文中,我们指出,当图像中存在多个对象类时(即多类对象计数),现有的对象计数模型很难准确地计数对象。他们通常会吃掉所有对象,不论是否符合示例。为解决这个问题,我们提议先使用示例基于分割模型将对象区域围括出来,然后计数围括区域中的对象。关键问题在于,如何训练这个分割模型。为此,我们提议使用只有盒子示例和点标注来生成pseudo分割面积。我们发现,这种分割模型可以基于示例来有效地围括对象区域,并且可以在任意多类图像中计数对象。为评估不同方法的性能,我们创建了两个新的标准 bencmarks:一个是Synthetic多类数据集,另一个是一个新的实际图像测试集,其中对象来自多个类。我们的提议方法在这两个标准 bencmarks上表现出了明显的优势。”

INVE: Interactive Neural Video Editing

  • paper_url: http://arxiv.org/abs/2307.07663
  • repo_url: None
  • paper_authors: Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee
  • for: 这个论文是为了提供一种实时视频编辑解决方案,帮助视频编辑过程中的 sparse frame 编辑propagation到整个视频clip。
  • methods: 该方法基于 recent work on Layered Neural Atlas (LNA),但是LNA受到两大缺点:(1)方法太慢 для实时编辑,(2)不支持一些编辑用case,如直接帧编辑和固定Texture tracking。我们采用高效的网络架构,以及 hash-grids 编码,以提高处理速度。此外,我们学习 bi-directional 函数 между image-atlas 和引入 vectorized editing,这些都使得我们的 INVE 可以支持更多的编辑。
  • results: 相比 LNA,我们的 INVE 可以减少学习和推理时间,并且支持更多的视频编辑操作。我们通过对 INVE 和 LNA 进行了全面的量化和质量分析,并展示了 INVE 的优越性和改进的性能。 для视频结果,请参考 https://gabriel-huang.github.io/inve/
    Abstract We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/
    摘要 我们介绍Interactive Neural Video Editing(INVE),一个实时影像修剪解决方案,可以帮助影像修剪过程中的叠加短缺几帧至整个影像片。我们的方法受到latest Layered Neural Atlas(LNA)的启发,但LNA有两个主要缺点:(1)方法太慢 для互动编辑,(2)它无法支持一些编辑用 caso,包括直接几帧编辑和固定的 Texture Tracking。为了解决这些挑战,我们采用高效的网络架构,并利用 Hash-Grids 编码,以提高处理速度。此外,我们学习了两向函数 между Image-Atlas和引入 вектор化编辑,这样可以实现更多的编辑在 both the atlas and the frames directly。与LNA相比,INVE可以reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot。我们通过对INVE和LNA的Quantitative and qualitative analysis,强调其许多优点和改进的表现。如果您想看到影像效果,请参考

RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World

  • paper_url: http://arxiv.org/abs/2307.07653
  • repo_url: https://github.com/winterwindwang/rfla
  • paper_authors: Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, Xiaoqian Chen
  • for: 本研究旨在攻击深度神经网络(DNNs)的物理攻击方法。
  • methods: 本文提出了一种新的反射光攻击(RFLA)方法,通过在镜前放置透明彩虹的彩色透明膜和纸剪形成不同颜色的几何图形来实现。
  • results: 实验结果表明,提出的方法在不同的数据集和模型上达到了99%以上的成功率,并在不同的物理环境中进行了验证。
    Abstract Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.
    摘要 Recently, physical adversarial attacks against deep neural networks (DNNs) have gained increasing attention. Current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object, but these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, which is stealthy during the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), which is effective and stealthy in both the digital and physical worlds. This attack is implemented by placing a color transparent plastic sheet and a paper cut of a specific shape in front of a mirror to create different colored geometries on the target object. To achieve these goals, we develop a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.

ACF-Net: An Attention-enhanced Co-interactive Fusion Network for Automated Structural Condition Assessment in Visual Inspection

  • paper_url: http://arxiv.org/abs/2307.07643
  • repo_url: None
  • paper_authors: Chenyu Zhang, Zhaozheng Yin, Ruwen Qin
    for: 这篇论文旨在提出一种自动化桥梁Visual化检查的方法,以提高公共工程设施的监测效率。methods: 该方法基于Attention-enhanced Co-interactive Fusion Network (ACF-Net),可同时分析结构元素和表面缺陷,并使用两个任务特定的学习子网计算任务特定特征。results: 实验结果表明,提出的ACF-Net方法在新的Steel Bridge Condition Inspection Visual (SBCIV)测试集上达到了92.11% mIoU для元素分析和87.16% mIoU для腐蚀分割,超越当前状态的方法。
    Abstract Efficiently monitoring the condition of civil infrastructures necessitates automating the structural condition assessment in visual inspection. This paper proposes an Attention-enhanced Co-interactive Fusion Network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. The ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. It integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. Experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset Steel Bridge Condition Inspection Visual (SBCIV) testing set. An ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. The code will be open-source after acceptance.
    摘要 efficiently monitoring civil infrastructure condition requires automating structural condition assessment in visual inspection. this paper proposes an attention-enhanced co-interactive fusion network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. the ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. it integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset steel bridge condition inspection visual (SBCIV) testing set. an ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. the code will be open-source after acceptance.Here's the text with traditional Chinese characters:efficiently monitoring civil infrastructure condition requires automating structural condition assessment in visual inspection. this paper proposes an attention-enhanced co-interactive fusion network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. the ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. it integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset steel bridge condition inspection visual (SBCIV) testing set. an ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. the code will be open-source after acceptance.

CoTracker: It is Better to Track Together

  • paper_url: http://arxiv.org/abs/2307.07635
  • repo_url: https://github.com/facebookresearch/co-tracker
  • paper_authors: Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht
  • for: 这个论文的目的是提出一种能够同时跟踪多个视频帧中多个点的抽象方法,以提高视频动作预测的性能。
  • methods: 这个方法使用了transformer网络,特制的注意层,iteratively更新多个轨迹的估计。
  • results: 这个方法在大多数测试 benchmark 中超越了当前state-of-the-art方法,并且可以同时跟踪一些点,还可以在视频帧中添加新的点进行跟踪。
    Abstract Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.
    摘要 视频动态预测方法可以同时估计视频帧中所有点的快照动态,或者独立地跟踪视频中每个点的动态。前者是true,即使使用深度学习方法,可以在遮挡中跟踪点。但是,单独跟踪点可能忽略视频中点之间的强相关性,例如因为它们属于同一物理对象,从而影响性能。为此,我们在这篇论文中提出了CoTracker,一种架构,可以在整个视频中同时跟踪多个点。这种架构结合了光流和跟踪领域的一些想法,并实现了一种flexible和强大的设计。它基于特殊注意层,以模型不同时间点之间的点相关性。特殊注意层在循环更新多个轨迹的估计,可以在很长的视频中使用滑块训练方法。它可以同时跟踪1到多个点,并且支持在任何时间添加新的点。结果是一种灵活和强大的跟踪算法,比state-of-the-art方法在大多数测试 benchMarks 上出perform。

Generalizable Embeddings with Cross-batch Metric Learning

  • paper_url: http://arxiv.org/abs/2307.07620
  • repo_url: https://github.com/yetigurbuz/xml-dml
  • paper_authors: Yeti Z. Gurbuz, A. Aydin Alatan
  • for: 本文探讨了深度度量学中的全球平均pooling(GAP)组件,以及其在学习泛化表示的效果。
  • methods: 作者将GAP视为一种将各个特征向量作为独立的 semantic entity 进行组合的方式,并通过一种可 convex combination of learnable prototypes来表示GAP。然后,作者通过一种 recursive 过程来适应一个批处理的样本,并在每一轮中使用不同的批处理来正则化学习。
  • results: 作者在4个流行的DML benchmark上验证了他们的方法,并发现其可以提高DML模型的泛化能力。
    Abstract Global average pooling (GAP) is a popular component in deep metric learning (DML) for aggregating features. Its effectiveness is often attributed to treating each feature vector as a distinct semantic entity and GAP as a combination of them. Albeit substantiated, such an explanation's algorithmic implications to learn generalizable entities to represent unseen classes, a crucial DML goal, remain unclear. To address this, we formulate GAP as a convex combination of learnable prototypes. We then show that the prototype learning can be expressed as a recursive process fitting a linear predictor to a batch of samples. Building on that perspective, we consider two batches of disjoint classes at each iteration and regularize the learning by expressing the samples of a batch with the prototypes that are fitted to the other batch. We validate our approach on 4 popular DML benchmarks.
    摘要

Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.07603
  • repo_url: https://github.com/dibya404/gastrointestinal-disease-classification-through-explainable-and-cost-sensitive-dnn-with-scl
  • paper_authors: Dibya Nath, G. M. Shahariar
  • for: 这篇论文是为了开发一种基于深度卷积神经网络和可读性学习的肠胃疾病分类方法。
  • methods: 该方法利用了Cost-sensitive预训练的深度卷积神经网络架构,并采用了监督式对照学习来学习疾病相关的特征。此外,该方法还包括了Gradient-based的可读性技术,以提高模型的解释性。
  • results: 经过广泛的实验和比较,该方法能够实现高精度的肠胃疾病分类,同时具有Robustness和解释性。
    Abstract Gastrointestinal diseases pose significant healthcare chall-enges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is pivotal in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach on classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness and interpretability of our model. We have made the implementation of our proposed approach publicly available at https://github.com/dibya404/Gastrointestinal-Disease-Classification-through-Explainable-and-Cost-Sensitive-DNN-with-SCL
    摘要 Gastrointestinal diseases pose significant healthcare challenges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is crucial in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach to classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness, and interpretability of our model. We have made the implementation of our proposed approach publicly available at https://github.com/dibya404/Gastrointestinal-Disease-Classification-through-Explainable-and-Cost-Sensitive-DNN-with-SCL.

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

  • paper_url: http://arxiv.org/abs/2307.07511
  • repo_url: None
  • paper_authors: Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas
  • for: 这个论文 targets the problem of generating realistic 3D human motions interacting with objects in a scene.
  • methods: 该论文提出了一个名为Neural Interaction Fields for Trajectory sYnthesis(NIFTY)的框架,它使用了神经网络生成交互场,以及基于物体的物理约束来驱动人物动作的扩散过程,以生成更加真实和可信的人物动作。
  • results: 该论文通过使用自动生成的 sintetic数据和NIFTY框架,实现了更加真实和可信的人物动作Synthesize,包括坐下和抬起几种对象的动作。这些动作质量和成功完成率都高于其他方法。
    Abstract We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.
    摘要 我们Addressing the problem of generating realistic 3D human motions interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarce data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.Here's the translation breakdown:* 我们 (wǒmen) - we* Addressing ( Addressing) - addressing* the problem (the problem) - the problem* of generating (of generating) - of generating* realistic 3D human motions (realistic 3D human motions) - realistic 3D human motions* interacting (interacting) - interacting* with objects (with objects) - with objects* in a scene (in a scene) - in a scene* Our key idea (Our key idea) - Our key idea* is to create (is to create) - is to create* a neural interaction field (a neural interaction field) - a neural interaction field* attached to (attached to) - attached to* a specific object (a specific object) - a specific object* which outputs (which outputs) - which outputs* the distance (the distance) - the distance* to the valid interaction manifold (to the valid interaction manifold) - to the valid interaction manifold* given (given) - given* a human pose (a human pose) - a human pose* as input (as input) - as input* This interaction field (This interaction field) - This interaction field* guides (guides) - guides* the sampling (the sampling) - the sampling* of an object-conditioned (object-conditioned) - object-conditioned* human motion diffusion model (human motion diffusion model) - human motion diffusion model* so as to encourage (so as to encourage) - so as to encourage* plausible contacts (plausible contacts) - plausible contacts* and affordance semantics (and affordance semantics) - and affordance semantics* To support interactions (To support interactions) - To support interactions* with scarce data (with scarce data) - with scarce data* we propose (we propose) - we propose* an automated synthetic data pipeline (an automated synthetic data pipeline) - an automated synthetic data pipeline* For this (For this) - For this* we seed (we seed) - we seed* a pre-trained motion model (a pre-trained motion model) - a pre-trained motion model* with interaction-specific (with interaction-specific) - with interaction-specific* anchor poses (anchor poses) - anchor poses* extracted from (extracted from) - extracted from* limited motion capture data (limited motion capture data) - limited motion capture data* Using our guided diffusion model (Using our guided diffusion model) - Using our guided diffusion model* trained on generated (trained on generated) - trained on generated* synthetic data (synthetic data) - synthetic data* we synthesize (we synthesize) - we synthesize* realistic motions (realistic motions) - realistic motions* for sitting (for sitting) - for sitting* and lifting (and lifting) - and lifting* with several objects (with several objects) - with several objects* outperforming (outperforming) - outperforming* alternative approaches (alternative approaches) - alternative approaches* in terms of motion quality (in terms of motion quality) - in terms of motion quality* and successful action completion (and successful action completion) - and successful action completion* We call our framework (We call our framework) - We call our framework* NIFTY (NIFTY) - NIFTY* Neural Interaction Fields for Trajectory sYnthesis (Neural Interaction Fields for Trajectory sYnthesis) - Neural Interaction Fields for Trajectory sYnthesis

Brain Tumor Detection using Convolutional Neural Networks with Skip Connections

  • paper_url: http://arxiv.org/abs/2307.07503
  • repo_url: None
  • paper_authors: Aupam Hamran, Marzieh Vaeztourshizi, Amirhossein Esmaili, Massoud Pedram
  • for: 用CNN分类脑肿为benign和malignant两类
  • methods: 使用MRI技术,采用不同的CNN体系结构,并应用扩展和深化网络,以及添加跳过连接等优化技术以提高网络的准确率
  • results: 结果表明,一些优化技术可以judiciously用于超越基eline CNN模型Note: “judiciously” is a bit of a tricky word to translate, but I’ve translated it as “可以judiciously用于” (can be used judiciously) to convey the idea that the optimization techniques can be selectively applied to achieve better results.
    Abstract In this paper, we present different architectures of Convolutional Neural Networks (CNN) to analyze and classify the brain tumors into benign and malignant types using the Magnetic Resonance Imaging (MRI) technique. Different CNN architecture optimization techniques such as widening and deepening of the network and adding skip connections are applied to improve the accuracy of the network. Results show that a subset of these techniques can judiciously be used to outperform a baseline CNN model used for the same purpose.
    摘要 在这篇论文中,我们介绍了不同类型的卷积神经网络(CNN),用于通过磁共振成像(MRI)技术分类脑肿为非恶性和恶性两类。不同的CNN建立优化技术,如宽化和深化网络以及添加跳过连接,用于提高网络的准确性。结果表明,一些这些技术可以有效地用于超越基eline CNN模型,用于同一目的。Here's a word-for-word translation of the text using Traditional Chinese characters:在这篇论文中,我们介绍了不同类型的卷积神经网络(CNN),用于通过磁共振成像(MRI)技术分类脑肿为非恶性和恶性两类。不同的CNN建立优化技术,如宽化和深化网络以及添加跳过连接,用于提高网络的准确性。结果表明,一些这些技术可以有效地用于超越基eline CNN模型,用于同一目的。

TALL: Thumbnail Layout for Deepfake Video Detection

  • paper_url: http://arxiv.org/abs/2307.07494
  • repo_url: None
  • paper_authors: Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He
  • for: 这篇论文主要应用于侦测深增伪影片(deepfake video detection)。
  • methods: 本篇论文提出了一个简单 yet effective的策略,即幕照片(Thumbnail Layout,TALL),它可以将影片幕拍成一个预先定义的布局,以保留空间和时间相依性。TALL 是无需更改现有代码的简单方法,只需要在每幕中对继承幕拍进行干扰,然后将其转换为子图排序为预先定义的布局。
  • results: 实验结果显示,TALL 和 SOTA TALL-Swin 在标准测试集和跨测试集上均有出色的表现,TALL 的 AUC 为 90.79%,而 TALL-Swin 的 AUC 为 93.43%。
    Abstract The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
    摘要 “深圳技术”的威胁对社会和网络安全引起了巨大的公众关注,而寻找深圳视频的检测方法也在不断地努力。现有的视频方法可以达到好的性能,但是 computationally intensive。这篇文章介绍了一种简单 yet effective的策略,即 Thumbnail Layout(TALL),它将视频剪辑转换为预定的布局,以保持空间和时间相互依赖关系。 Specifically, consecutive frames 是在固定位置上填充,然后resize 为子图并重新排序成预定的布局作为缩略图。TALL 是无关模型的,只需要修改一些代码。通过 incorporate TALL 到 Swin Transformer 中,我们得到了高效和有效的方法 TALL-Swin。广泛的实验表明 TALL 和 SOTA TALL-Swin 的有效性和优势。 TALL-Swin 在艰辛的 cross-dataset 任务 FaceForensics++ $\to$ Celeb-DF 上达到了 90.79% AUC。代码可以在 GitHub 上找到:https://github.com/rainy-xu/TALL4Deepfake。

PseudoCal: A Source-Free Approach to Unsupervised Uncertainty Calibration in Domain Adaptation

  • paper_url: http://arxiv.org/abs/2307.07489
  • repo_url: None
  • paper_authors: Dapeng Hu, Jian Liang, Xinchao Wang, Chuan-Sheng Foo
  • for: 提高目标频率下模型的准确性
  • methods: 使用 PseudoCal 方法,依据无标目标数据进行准确性调整
  • results: 相比现有方法, PseudoCal 方法显示出了更低的准确性错误
    Abstract Unsupervised domain adaptation (UDA) has witnessed remarkable advancements in improving the accuracy of models for unlabeled target domains. However, the calibration of predictive uncertainty in the target domain, a crucial aspect of the safe deployment of UDA models, has received limited attention. The conventional in-domain calibration method, \textit{temperature scaling} (TempScal), encounters challenges due to domain distribution shifts and the absence of labeled target domain data. Recent approaches have employed importance-weighting techniques to estimate the target-optimal temperature based on re-weighted labeled source data. Nonetheless, these methods require source data and suffer from unreliable density estimates under severe domain shifts, rendering them unsuitable for source-free UDA settings. To overcome these limitations, we propose PseudoCal, a source-free calibration method that exclusively relies on unlabeled target data. Unlike previous approaches that treat UDA calibration as a \textit{covariate shift} problem, we consider it as an unsupervised calibration problem specific to the target domain. Motivated by the factorization of the negative log-likelihood (NLL) objective in TempScal, we generate a labeled pseudo-target set that captures the structure of the real target. By doing so, we transform the unsupervised calibration problem into a supervised one, enabling us to effectively address it using widely-used in-domain methods like TempScal. Finally, we thoroughly evaluate the calibration performance of PseudoCal by conducting extensive experiments on 10 UDA methods, considering both traditional UDA settings and recent source-free UDA scenarios. The experimental results consistently demonstrate the superior performance of PseudoCal, exhibiting significantly reduced calibration error compared to existing calibration methods.
    摘要 Unsupervised domain adaptation (UDA) 在提高目标领域模型的准确性方面做出了非常出色的进步。然而,目标领域模型的预测uncertainty的准确性调整,作为模型安全部署的关键方面,受到了有限的关注。传统的域内准确度调整方法(TemperatureScaling, TempScal)在域分布转移和目标领域数据缺失问题上遇到了挑战。 latest approaches use importance-weighting techniques to estimate the target-optimal temperature based on re-weighted labeled source data. 然而,这些方法需要源数据,并且在严重的域转移情况下,density estimates 不可靠,从而无法适用于源自由 UDA 设置。为了解决这些限制,我们提出了 PseudoCal,一种源自由的准确性调整方法,不依赖于源数据。与前方法不同,我们将 UDA 准确性调整视为目标域specific的无监督准确性调整问题。 Motivated by the factorization of the negative log-likelihood (NLL) objective in TempScal, we generate a labeled pseudo-target set that captures the structure of the real target. By doing so, we transform the unsupervised calibration problem into a supervised one, enabling us to effectively address it using widely-used in-domain methods like TempScal。我们进行了广泛的实验,评估 PseudoCal 的准确性调整性能,包括传统 UDA 设置以及最近的源自由 UDA 场景。实验结果 consistently demonstrate the superior performance of PseudoCal, exhibiting significantly reduced calibration error compared to existing calibration methods。

DreamTeacher: Pretraining Image Backbones with Deep Generative Models

  • paper_url: http://arxiv.org/abs/2307.07487
  • repo_url: None
  • paper_authors: Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler
  • for: 本研究提出了一种自我超vised特征表示学习框架 DreamTeacher,利用生成网络进行预训练下游图像卷积体。
  • methods: 我们提出了两种知识精炼方法:1)将生成模型学习的生成特征精炼到目标图像卷积体上,作为代替大量标注数据集such as ImageNet的预训练方法;2)将生成网络中的标签精炼到目标卷积体的幂点上。
  • results: 我们进行了多种生成模型、密集预测 benchmarks 和多种预训练方法的实验研究,发现我们的 DreamTeacher 对现有的自我超vised表示学习方法进行了显著超越。 不supervised ImageNet 预训练通过 DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.
    Abstract In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analyses on multiple generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.
    摘要 在这项工作中,我们介绍了一个自我超vised特征表示学习框架 DreamTeacher,该框架利用生成网络进行预训练下游图像脊梁。我们提议通过将已经训练过的生成模型中的知识注入到标准图像脊梁上来,以代替使用大量标注数据集如ImageNet进行预训练。我们 investigate了两种知识注入方法:1)将生成模型中学习的特征注入到目标图像脊梁上作为替代预训练方法,2)将生成网络中的标签注入到目标脊梁的幂点上。我们进行了多种生成模型、精度预测 bencmarks 和多种预训练方式的实验研究。我们发现,我们的 DreamTeacher 在所有自我超vised表示学习方法中显著超越了其他方法。不需要人工标注,使用 DreamTeacher 进行无监督ImageNet预训练可以在下游数据集上获得显著改进,并且在多个生成模型和扩散生成模型中都有显著的改进。

Multimodal Distillation for Egocentric Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07483
  • repo_url: https://github.com/gorjanradevski/multimodal-distillation
  • paper_authors: Gorjan Radevski, Dusan Grujicic, Marie-Francine Moens, Matthew Blaschko, Tinne Tuytelaars
  • for: 本研究目的是为了模型手部物体互动,以提高 egocentric 视频理解性能。
  • methods: 该paper使用了 CNNs 和 Vision Transformers 等标准模型,并采用了多modal 输入模块,以提高模型性能。然而,这些多modal 模块增加了模型的复杂度,使其不适合实际应用。本研究目标是保留多modal 模型的性能,但只使用 RGB 帧作为输入。
  • results: 研究发现,使用 multimodal 教师进行教学,可以使学生模型更加准确和更加靠谱。此外,本研究还提出了一种原则正的多modal 知识塑造框架,以解决多modal 知识塑造中出现的问题。最后,研究发现了计算复杂度的减少,并证明了我们的方法可以保持高性能,同时减少输入视图的数量。
    Abstract The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. We release our code at https://github.com/gorjanradevski/multimodal-distillation.
    摘要 主要焦点是模型手object交互的 egocentric视频理解。标准模型,如CNNs或视Transformers,使用RGB帧作为输入表现良好。然而,通过添加补充的模式特征信息,如物体检测、流动、音频等,其性能可以进一步提高。然而,这些模式特征模块的附加复杂性使得这些模型在部署时不实用。我们的目标是保留多模式approach的性能,使用只RGB帧作为输入进行推理。我们示出在Epic-Kitchens和Something-Something数据集上进行 egocentric动作识别 tasks,使用多模式教师进行教育的学生比architectureEquivalent模型在单模式或多模式教学情况下更准确和更好地规范。我们进一步采用了一种原则正的多模式知识储存框架,以解决在naive manner中应用多模式知识储存时出现的问题。最后,我们表明实现的计算复杂度减少,并示出我们的方法可以在输入视图数量减少的情况下保持更高的性能。我们在https://github.com/gorjanradevski/multimodal-distillation上发布了代码。

Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification

  • paper_url: http://arxiv.org/abs/2307.07482
  • repo_url: None
  • paper_authors: Simon Holdenried-Krafft, Peter Somers, Ivonne A. Montes-Majarro, Diana Silimon, Cristina Tarín, Falko Fend, Hendrik P. A. Lensch
  • for: 这个论文主要针对的是肿瘤诊断和治疗规划中的整幕影像评估,以高Definition Magnification进行细胞分析。
  • methods: 该论文提出了一种基于嵌入模型的多例学习(MIL)管道,包括嵌入模型和聚合步骤。在嵌入模型方面,我们explore了使用最新的自我超vised预训练模型来提高MIL的普适性。在聚合步骤方面,我们提出了一种新的MIL架构,可以将MIL-注意力与相关自注意力结合使用。
  • results: 我们在三个 histopathological 数据集上进行了实验,并证明了我们的方法可以与当前状态艺技相比提高至多10%的性能。
    Abstract Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning. WSIs require high magnifications to facilitate sub-cellular analysis. Precise annotations for patch- or even pixel-level classifications in the context of gigapixel WSIs are tedious to acquire and require domain experts. Coarse-grained labels, on the other hand, are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL). In our work, we propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL). We contribute to both the embedding and aggregation steps. Since all-purpose visual feature representations are not yet available, embedding models are currently limited in terms of generalizability. With our work, we explore the potential of dynamic meta-embedding based on cutting-edge self-supervised pre-trained models in the context of MIL. Moreover, we propose a new MIL architecture capable of combining MIL-attention with correlated self-attention. The Dual-Query Perceiver design of our approach allows us to leverage the concept of self-distillation and to combine the advantages of a small model in the context of a low data regime with the rich feature representation of a larger model. We demonstrate the superior performance of our approach on three histopathological datasets, where we show improvement of up to 10% over state-of-the-art approaches.
    摘要 整幕图像(WSI)评估是癌症诊断和治疗规划中的关键步骤。WSI需要高放大以便进行细胞分析。精确的标注对patch-或甚至像素级分类在 context of gigapixel WSIs 是繁琐的和需要域专家。相比之下,粗粒标注更加容易获得,这使得WSI分类成为多例学习(MIL)的理想应用场景。在我们的工作中,我们提出了一种新的嵌入基于的双Query MIL管道(DQ-MIL)。我们对嵌入和聚合步骤做出了贡献。由于目前的视觉特征表示Model 没有通用的ALL-PURPOSE,嵌入模型因此受限于通用性。我们通过在MILCONTEXT中使用 cutting-edge self-supervised pre-trained model 的动态元 embedding来探索这一点。此外,我们还提出了一种新的MIL架构,可以结合MIL注意力和相关自注意力。我们的approach 使用 Dual-Query Perceiver 的设计,可以利用自馈采集和小模型在低数据情况下的优势,同时保留大模型的丰富特征表示。我们在三个 histopathological 数据集上进行了实验,并证明了我们的方法在 state-of-the-art 方法之上提高了10%。

Atlas-Based Interpretable Age Prediction

  • paper_url: http://arxiv.org/abs/2307.07439
  • repo_url: None
  • paper_authors: Sophie Starck, Yadunandan Vivekanand Kini, Jessica Johanna Maria Ritter, Rickmer Braren, Daniel Rueckert, Tamara Mueller
  • for: 这个研究旨在提高医疗评估和研究中的年龄预测精度,以检测疾病和异常年龄变化。
  • methods: 这种研究使用整体图像进行整个身体的研究,并使用Grad-CAM解释方法确定身体部位对年龄预测的影响。
  • results: 研究发现了三个关键的身体部位,即脊梁、自生肌肉和心脏区域,这三个部位对年龄预测具有最高的重要性。
    Abstract Age prediction is an important part of medical assessments and research. It can aid in detecting diseases as well as abnormal ageing by highlighting the discrepancy between chronological and biological age. To gain a comprehensive understanding of age-related changes observed in various body parts, we investigate them on a larger scale by using whole-body images. We utilise the Grad-CAM interpretability method to determine the body areas most predictive of a person's age. We expand our analysis beyond individual subjects by employing registration techniques to generate population-wide interpretability maps. Furthermore, we set state-of-the-art whole-body age prediction with a model that achieves a mean absolute error of 2.76 years. Our findings reveal three primary areas of interest: the spine, the autochthonous back muscles, and the cardiac region, which exhibits the highest importance.
    摘要 生长预测是医学评估和研究中非常重要的一部分。它可以帮助检测疾病以及异常年龄,并且可以预测人体各部位的年龄变化。为了更全面地了解各部位年龄变化,我们使用整体图像进行研究。我们使用Grad-CAM可读性方法来确定人体各部位年龄预测最有价值的部位。此外,我们还使用注册技术来生成人类总体可读性图。此外,我们设置了全面的整体年龄预测模型,实现了年龄差值的平均绝对误差为2.76年。我们的发现显示了三个主要的关注方向:脊梁、自生肌肉和心脏区域,这三个方向具有最高的重要性。