eess.IV - 2023-07-29

Why is thermal imaging textureless

  • paper_url: http://arxiv.org/abs/2307.15800
  • repo_url: https://github.com/fanglinbao/hadar
  • paper_authors: Fanglin Bao, Shubhankar Jape, Andrew Schramka, Junjie Wang, Tim E. McGraw, Zubin Jacob
  • for: 这个论文旨在探讨靠热成像实现夜视的问题,以及如何使用TeX视力来缓解阴影效应。
  • methods: 该论文使用了实验和计算模拟来研究非对称温度场下的纹理恢复问题,并比较了传统温度成像和TeX视力的性能。
  • results: 研究发现,在非对称温度场下,传统温度成像不能正确地恢复纹理,而TeX视力则能够成功地缓解阴影效应并恢复纹理。此外,研究还发现了一些关键尚未研究的TeX视力理论问题,并在实际可行的Bayer滤光器设置下实现了一种真正的夜视如白天的效果。
    Abstract Thermal imaging can enable night vision but is usually textureless, well-known as the ghosting effect. The mechanism of this ghosting effect has recently been explained, and TeX vision has been proposed to overcome the ghosting effect. However, it is still unknown for realistic scenarios with non-uniform temperature whether TeX vision can correctly recover geometric textures and how its performance is compared with traditional thermal imaging. Here, we focus on the interplay of geometric textures and non-uniform temperature which is common in realistic thermal imaging, and demonstrate the failure of traditional approaches while TeX vision successfully recovers geometric textures. We also analyze important yet unexplored aspects of the TeX vision theory, and demonstrate a true night vision like broad daylight with the experimentally more feasible Bayer-filter setup. This deepens the understanding of the ghosting effect and bridges the gap between the TeX vision theory and the consumer thermal-imaging market.
    摘要 热成像可以启用夜视,但通常无Texture,这被称为幽灵效果。这种幽灵效果的机制最近才得到了解释,而TeX视力被提议以消除幽灵效果。然而,在非一致温度的实际场景中,TeX视力是否能正确地恢复Geometric texture和与传统热成像相比如何perform的还未得到了确定的答案。在这里,我们关注了Geometric texture和非一致温度之间的互动,并证明传统方法失败,而TeX视力成功地恢复Geometric texture。我们还分析了TeX vision理论中尚未被探索的重要方面,并在 Bayer-filter 设置下实现了真正的夜视如白天一般。这有助于深入理解幽灵效果,并将TeX vision理论和消费者热成像市场之间的空难 bridged。

cs.SD - 2023-07-28

All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

  • paper_url: http://arxiv.org/abs/2307.15555
  • repo_url: None
  • paper_authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani
  • for: 防止深度学习和计算机视觉技术的滥用,尤其是在语音频段,预防恶意用户利用深度模拟技术生成假语音,进而导致诈骗或身份盗窃等问题。
  • methods: 基于文献中提出的三种特征集,实现了一种将这三种特征集融合的模型,以实现对现有方法的改进。
  • results: 对不同场景和数据集进行测试,证明了该系统具有防御反反馈攻击和泛化能力。
    Abstract Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
    摘要 Simplified Chinese translation:最近的深度学习和计算机视觉技术的进步使得制作和伪造多媒体内容变得更加容易,可能导致恶意用户的威胁和危险。在音频领域,我们目睹到了语音深度伪造生成技术的增长,这使得适用于防止可能的欺诈和身份盗窃的伪造语音检测算法的开发成为了一项重要的任务。在这篇论文中,我们考虑了Literature中提出的三种不同的特征集,并提出了一种将其融合的模型,实现了与当前最佳解决方案的更好的性能。该系统在不同的场景和数据集上进行了测试,以证明其对反法医攻击的抗性和泛化能力。

Automated approach for source location in shallow waters

  • paper_url: http://arxiv.org/abs/2307.15491
  • repo_url: https://github.com/niclas-angele/source_localization
  • paper_authors: Angèle Niclas, Josselin Garnier
  • for: 这篇论文是为了描述一种完全自动化的 shallow water 中源点和媒体参数的恢复方法。
  • methods: 该方法使用了 teoretic 工具来理解扭变方法的稳定性,并提出了一种自动分解记录信号中的模态组分的方法。
  • results: 该方法在实验数据中展示了对实际场景中的右鲸鸟枪响和燃烧声源的有效性。
    Abstract This paper proposes a fully automated method for recovering the location of a source and medium parameters in shallow waters. The scenario involves an unknown source emitting low-frequency sound waves in a shallow water environment, and a single hydrophone recording the signal. Firstly, theoretical tools are introduced to understand the robustness of the warping method and to propose and analyze an automated way to separate the modal components of the recorded signal. Secondly, using the spectrogram of each modal component, the paper investigates the best way to recover the modal travel times and provides stability estimates. Finally, a penalized minimization algorithm is presented to recover estimates of the source location and medium parameters. The proposed method is tested on experimental data of right whale gunshot and combustive sound sources, demonstrating its effectiveness in real-world scenarios.
    摘要 这个论文提出了一种完全自动化的方法,用于在浅水中回归源点和媒体参数。情况是一个未知的源在浅水环境中发出低频声波,并且单个水微phone记录了信号。首先,论文介绍了理论工具,以理解扭曲方法的稳定性,并提出了自动分解记录信号的模态组分的方法。其次,使用每个模态组分的spectrogram,论文研究了最好的方法来回归模态旅行时间,并提供了稳定估计。最后,论文提出了一种惩罚最小化算法,用于回归源点和媒体参数的估计。这种方法在实验数据中进行了右鲸鱼枪声和燃烧声源的测试,证明其在实际情况下的有效性。

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

  • paper_url: http://arxiv.org/abs/2307.15484
  • repo_url: None
  • paper_authors: Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
  • for: 这些纸张是用于提高文本识别和自动语音合成的方法。
  • methods: 这些方法使用了扩散模型和变量自动编码器来提高提示表示能力,以及扩散模型来解决高维度和声波扭曲问题。
  • results: 对比基eline方法,这些方法表现更好,并且可以生成多种 expresión 的语音。Here’s the same information in Simplified Chinese:
  • for: 这些论文是用于提高文本识别和自动语音合成的方法。
  • methods: 这些方法使用了扩散模型和变量自动编码器来提高提示表示能力,以及扩散模型来解决高维度和声波扭曲问题。
  • results: 对比基eline方法,这些方法表现更好,并且可以生成多种 expresión 的语音。
    Abstract Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
    摘要 近期,有越来越多的关注TEXT-TO-SPEECH(TTS)方法,可以通过最小监督来训练,通过组合两种类型的不同的语音表示方式,并使用两个序列-TO-序列任务来解耦TTS。为了Address高维度和波形扭曲在不同表示方式中的挑战,我们提出了Diff-LM-Speech,它模型语意嵌入 mel-spectrogram 基于扩散模型,并引入了提高描述符能力的提问编码结构,以及基于变量自动编码器和表达瓶颈的 prosody 瓶颈。非autoregressive 框架常常面临缺失和重复的单词问题,而autoregressive 模型则面临表达平均化问题,这是因为duration prediction 模型。为了解决这些问题,我们提出了Tetra-Diff-Speech,它设计了一个扩散duration模型,以实现多种表达的多样化。我们预期semantic coding 信息的内容与文本和声音编码信息之间存在一定的相似性,但现有模型通常会提取大量的重复信息和维度爆炸。为了验证semantic coding 是否真的不必要,我们提出了Tri-Diff-Speech。我们的提议方法在实验中表现出了超越基eline方法的成绩。我们提供了一个网站,包含了各种音频样本。

The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022

  • paper_url: http://arxiv.org/abs/2307.15400
  • repo_url: None
  • paper_authors: Li Zhang, Huan Zhao, Yue Li, Bowen Pang, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie
  • for: 这个论文描述了在ICASSP 2022 上举行的第二届多Modal信息基于语音处理~(\textbf{MISP}) 挑战中提交的 FlySpeech 说话人分类系统。
  • methods: 我们开发了一个端到端的音频视频说话人分类系统(AVSD),该系统包括一个唇编码器、一个说话人编码器和一个音频视频解码器。具体来说,我们为了解决分类性能下降的问题,将说话人编码器和音频视频解码器进行共同训练。此外,我们还利用大量预训练的说话人提取器来初始化说话人编码器。
  • results: 我们的实验结果表明,我们的AVSD系统在不同的说话人数量和背景噪音水平下都具有良好的性能,并且与其他参与者的系统进行比较,我们的系统在大多数情况下具有更高的准确率。
    Abstract This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.
    摘要 这篇论文描述了我们在ICASSP 2022年度第二届多模态信息基于语音处理~(\textbf{MISP}) 挑战中提交的飞语音 speaker 分类系统。我们开发了一个端到端的音频视频 speaker 分类系统(AVSD),该系统包括一个唇编码器、一个说话者编码器和一个音频视频解码器。具体来说,为了解决分离训练导致的分类性能下降,我们在说话者编码器和音频视频解码器之间进行了联合训练。此外,我们还利用了大量预训练的说话者抽取器来初始化说话者编码器。

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

  • paper_url: http://arxiv.org/abs/2307.15344
  • repo_url: None
  • paper_authors: Yifei Xin, Yuexian Zou
  • for: 提高音频文本相关 retrieval(ATR)性能, ignore 细腻的cross-modal关系。
  • methods: 引入层次跨modal交互(HCI)方法,同时探索clip-sentence、segment-phrase和frame-word关系,实现多modal含义比较。
  • results: 实验显示,我们的HCI方法可以提高ATR性能,同时,我们的auxiliary captions(AC)框架可以提供更好的音频表示,并且可以作为训练数据增强。
    Abstract Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.
    摘要 现有的音频文本检索(ATR)方法通常是构建整个音频clip和完整的caption句子的对比对,而忽略细致的交叉模态关系,例如短段和短语或帧和单词。在这篇论文中,我们介绍了一种层次跨模态交互(HCI)方法,同时探索clip-sentence、segment-phrase和frame-word关系,实现了多模态semantic比较的全面评估。此外,我们还提出了一种新的ATR框架,利用预训练的captioner生成的auxiliary captions(AC)来实现音频和生成caption之间的特征交互,这对audio表示具有进一步提高的性能,并且与原始ATR匹配分支相комplementary。音频和生成caption还可以组成新的音频-文本对,用于训练的数据增强。实验结果表明,我们的HCI方法具有显著的提升效果,而我们的AC框架也在多个dataset上表现稳定。

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

  • paper_url: http://arxiv.org/abs/2307.15251
  • repo_url: None
  • paper_authors: Xinmeng Xu, Weiping Tu, Yuhong Yang
  • for: 提高语音增强效果
  • methods: 利用CNN和Transformer两种架构,实现并行搅动和自我注意力的结合,并设计了多支杆堆叠 convolution和自我时频频率注意力模块
  • results: 与现有方法相比,本方法在大多数评估标准中表现出色,同时具有最低的模型参数数量
    Abstract Convolutional neural networks (CNN) and Transformer have wildly succeeded in multimedia applications. However, more effort needs to be made to harmonize these two architectures effectively to satisfy speech enhancement. This paper aims to unify these two architectures and presents a Parallel Conformer for speech enhancement. In particular, the CNN and the self-attention (SA) in the Transformer are fully exploited for local format patterns and global structure representations. Based on the small receptive field size of CNN and the high computational complexity of SA, we specially designed a multi-branch dilated convolution (MBDC) and a self-channel-time-frequency attention (Self-CTFA) module. MBDC contains three convolutional layers with different dilation rates for the feature from local to non-local processing. Experimental results show that our method performs better than state-of-the-art methods in most evaluation criteria while maintaining the lowest model parameters.
    摘要 卷积神经网络(CNN)和变换器(Transformer)在多媒体应用中取得了很大成功。然而,为了有效融合这两种架构,还需要更多的努力。这篇论文目标是将这两种架构融合在一起,并提出了并行转换器(Parallel Conformer) для speech enhancement。特别是,我们完全利用了 CNN 和 Transformer 中的自注意力(SA)来捕捉本地格式模式和全球结构表示。基于小覆盖区域大小和自注意力的计算复杂性,我们专门设计了多支分支扩展 convolution(MBDC)和自频时间频率注意力(Self-CTFA)模块。MBDC 包括三层扩展 convolution Layer WITH different dilation rates,用于从本地到非本地处理。实验结果表明,我们的方法在大多数评价标准下表现更好于现有方法,同时保持最低的模型参数。

Self-Supervised Visual Acoustic Matching

  • paper_url: http://arxiv.org/abs/2307.15064
  • repo_url: None
  • paper_authors: Arjun Somayazulu, Changan Chen, Kristen Grauman
  • for: 用于实现自然语言处理任务,特别是听说场景中的声音重新synthesize。
  • methods: 提出了一种自动学习的方法,使用目标场景图像和声音进行联合学习,包括一种新的度量器来衡量剩余的声音信息。
  • results: 在多个Difficult的数据集上进行训练,与当前最佳方法进行比较,达到了更高的性能。
    Abstract Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
    摘要 它目标是使一个音频片段在目标听取环境中重新生成,以让它听起来像是在目标环境中录制的。现有的方法假设有对应的训练数据,其中包括音频在源和目标环境中的观察记录,但这限制了训练数据的多样性或需要使用模拟数据或启发法生成对应的样本。我们提出了一种自然语言处理的自主超vised Approach,其中训练样本只包括目标场景图像和音频,而不需要对应的听取环境音频作为参考。我们的方法通过一种Conditional GAN框架和一种新的度量来共同学习分离房间听取特性和重新生成音频到目标环境中,并在多个挑战性数据集和真实世界音频和环境中表现出色。

eess.AS - 2023-07-28

Efficient Acoustic Echo Suppression with Condition-Aware Training

  • paper_url: http://arxiv.org/abs/2307.15630
  • repo_url: None
  • paper_authors: Ernst Seidel, Pejman Mowlaee, Tim Fingscheidt
  • for: 这篇论文主要关注deep acoustic echo control (DAEC)方法的改进,以提高双语言听取的效果。
  • methods: 论文使用了卷积循环神经网络 (CRN),该网络包括卷积编码和解oder,并且包含了回传瓶颈,以保留双语言听取中的nearend speech。
  • results: 该网络在双语言听取中实现了更好的效果,比起FCRN和CRUSE这两个基eline架构,不仅储存parameters和计算复杂度少,而且也表现更好。
    Abstract The topic of deep acoustic echo control (DAEC) has seen many approaches with various model topologies in recent years. Convolutional recurrent networks (CRNs), consisting of a convolutional encoder and decoder encompassing a recurrent bottleneck, are repeatedly employed due to their ability to preserve nearend speech even in double-talk (DT) condition. However, past architectures are either computationally complex or trade off smaller model sizes with a decrease in performance. We propose an improved CRN topology which, compared to other realizations of this class of architectures, not only saves parameters and computational complexity, but also shows improved performance in DT, outperforming both baseline architectures FCRN and CRUSE. Striving for a condition-aware training, we also demonstrate the importance of a high proportion of double-talk and the missing value of nearend-only speech in DAEC training data. Finally, we show how to control the trade-off between aggressive echo suppression and near-end speech preservation by fine-tuning with condition-aware component loss functions.
    摘要 针对深度听音回声控制(DAEC)的话题在过去几年内有很多方法和不同的模型架构被提出。卷积回声网络(CRN),它由卷积编码器和解码器、回声瓶颈部分组成,因其能够在双语(DT)条件下保持近端语音的能力而被广泛采用。然而,过去的架构都是计算复杂或是减少模型大小的代价是性能下降。我们提出了改进的 CRN 架构,与其他类似架构相比,不仅减少参数和计算复杂度,而且在DT条件下表现更好,超过了基eline FCRN 和 CRUSE 架构。为了实现状态感知训练,我们也表明了高比例的双语和近端只有语音在 DAEC 训练数据中的重要性。最后,我们表明了如何通过condition-aware组件损失函数来控制对强制回声消除和近端语音保持的负面交互。

A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment

  • paper_url: http://arxiv.org/abs/2307.15611
  • repo_url: https://github.com/aircarlo/bin2bin-gan-plc
  • paper_authors: Carlo Aironi, Samuele Cornell, Luca Serafini, Stefano Squartini
  • for: 这篇论文 targets the problem of voice quality degradation in VoIP transmissions caused by packet loss, and proposes a generative adversarial approach to repair lost fragments during audio stream transmission.
  • methods: The proposed method, called bin2bin, is based on an improved pix2pix framework and uses a combination of two STFT-based loss functions and a modified PatchGAN structure as discriminator to translate magnitude spectrograms of audio frames with lost packets to noncorrupted speech spectrograms.
  • results: Experimental results show that the proposed method has obvious advantages compared to current state-of-the-art methods, particularly in handling high packet loss rates and large gaps.
    Abstract Packet loss is a major cause of voice quality degradation in VoIP transmissions with serious impact on intelligibility and user experience. This paper describes a system based on a generative adversarial approach, which aims to repair the lost fragments during the transmission of audio streams. Inspired by the powerful image-to-image translation capability of Generative Adversarial Networks (GANs), we propose bin2bin, an improved pix2pix framework to achieve the translation task from magnitude spectrograms of audio frames with lost packets, to noncorrupted speech spectrograms. In order to better maintain the structural information after spectrogram translation, this paper introduces the combination of two STFT-based loss functions, mixed with the traditional GAN objective. Furthermore, we employ a modified PatchGAN structure as discriminator and we lower the concealment time by a proper initialization of the phase reconstruction algorithm. Experimental results show that the proposed method has obvious advantages when compared with the current state-of-the-art methods, as it can better handle both high packet loss rates and large gaps.
    摘要 packet loss 是 VoIP 传输中音质下降的主要原因,对于智能指南和用户体验产生严重的影响。这篇论文描述了基于生成对抗方法的系统,用于在音频流传输过程中修复丢失的封包。取得了生成对抗网络(GAN)的强大图像对图像翻译能力的灵感,我们提议了bin2bin,一个改进的 pix2pix 框架,用于从丢失封包的音频帧矩阵中翻译到非损音频矩阵。为了更好地保持翻译后的结构信息,这篇论文提出了两个 STFT-based 损失函数的组合,混合传统 GAN 目标。此外,我们使用修改后的 PatchGAN 结构来担任探测器,并通过适当初始化相位重建算法来降低隐藏时间。实验结果表明,提议的方法在与当前状态艺术方法相比有显著优势,可以更好地处理高 packet loss 率和大 gap。

cs.CV - 2023-07-28

TriadNet: Sampling-free predictive intervals for lesional volume in 3D brain MR images

  • paper_url: http://arxiv.org/abs/2307.15638
  • repo_url: https://github.com/benolmbrt/TriadNet
  • paper_authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat
  • for: 评估脑肿瘤(如血栓或 tumor)的体积是诊断病人状况的重要指标,可以用来导引治疗策略。
  • methods: 利用深度卷积神经网络(CNN)进行分割,目前是状态体系的方法。
  • results: 提出了TriadNet方法,可同时提供肿瘤体积和相关预测 интерVAL simultanously,仅需一秒钟。在BraTS 2021大规模 MRI glioblastoma 图像数据库上,TriadNet方法显示出优势。
    Abstract The volume of a brain lesion (e.g. infarct or tumor) is a powerful indicator of patient prognosis and can be used to guide the therapeutic strategy. Lesional volume estimation is usually performed by segmentation with deep convolutional neural networks (CNN), currently the state-of-the-art approach. However, to date, few work has been done to equip volume segmentation tools with adequate quantitative predictive intervals, which can hinder their usefulness and acceptation in clinical practice. In this work, we propose TriadNet, a segmentation approach relying on a multi-head CNN architecture, which provides both the lesion volumes and the associated predictive intervals simultaneously, in less than a second. We demonstrate its superiority over other solutions on BraTS 2021, a large-scale MRI glioblastoma image database.
    摘要 病变Volume (例如血栓或肿瘤) 是一个重要的病人 прогностиic indicator,可以用于导引治疗策略。 lesional volume 估计通常使用深度卷积神经网络(CNN)进行,现在是状态的方法。然而,到目前为止,有很少的工作是在 volume segmentation 工具中添加适当的量化预测范围,这会限制它们在临床实践中的使用和接受度。在这种情况下,我们提出了 TriadNet,一种基于多头CNN架构的分割方法,可以同时提供lesion volume和相关的预测范围,并且在一秒钟内完成。我们在 BraTS 2021 大规模 MRI glioblastoma 图像数据库上展示了它的优势。

A Survey on Deep Learning in Medical Image Registration: New Technologies, Uncertainty, Evaluation Metrics, and Beyond

  • paper_url: http://arxiv.org/abs/2307.15615
  • repo_url: None
  • paper_authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du
  • for: 本文提供了深度学习技术在医学图像registratin中的最新进展。
  • methods: 本文提出了多种新的网络架构、特异度函数和误差估计方法,以及适用于评估深度学习模型在registratin任务中的评价指标。
  • results: 本文对deep learning-based registratin的应用进行了实质性的探讨,包括多Atlas建构、多Atlas分割、运动估计和2D-3D registratin等。
    Abstract Over the past decade, deep learning technologies have greatly advanced the field of medical image registration. The initial developments, such as ResNet-based and U-Net-based networks, laid the groundwork for deep learning-driven image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, and uncertainty estimation. These advancements have not only enriched the field of deformable image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.
    摘要 In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.Translated into Simplified Chinese:过去十年,深度学习技术在医疗图像注册领域已经做出了很大的进步。初期的发展,如ResNet基于和U-Net基于的网络,为深度学习驱动的图像注册奠定了基础。后续的进步包括相似度量、形变规范和注册不确定性的估计等方面。这些进步不仅涌现了图像注册领域的多样性,还促进了其应用于多种任务,如建立图像 атла斯、多个 атла斯分割、运动估计和2D-3D注册等。在这篇论文中,我们提供了深度学习基于图像注册的最新进展的全面概述。我们从核心概念的入门开始,然后探讨了新的网络架构、特定于注册的损失函数和注册不确定性的估计方法。此外,这篇论文还探讨了注册任务中深度学习模型的评价指标,以及这些新技术在医疗图像中的实际应用和未来前景。

Integrated Digital Reconstruction of Welded Components: Supporting Improved Fatigue Life Prediction

  • paper_url: http://arxiv.org/abs/2307.15604
  • repo_url: None
  • paper_authors: Anders Faarbæk Mikkelstrup, Morten Kristiansen
  • for: 提高钢结构疲劳性能
  • methods: 使用自动化高频机械冲击处理,并利用数字重建技术来改进质量和生产效率
  • results: 实现了成本效果、可重用、灵活和快速的数字重建方法,以援助Component设计、全面质量监测和HFMI处理ocumentation
    Abstract In the design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialised scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimisation for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalises the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.
    摘要 在海上钻井基础设计中,疲劳寿命是关键。POST-WELD处理已经被提议来提高焊接缝合的疲劳性能,其中高频机械冲击(HFMI)处理得到了显著提高疲劳性能的结果。自动化HFMI处理可以提高质量控制和可靠性,并可以通过精准的疲劳寿命预测来实现成本效益。但是,通用的Finite Element方法(FEM),常用于预测复杂或多轴缝合的疲劳寿命,基于焊接部的基本CAD描述,而不考虑实际焊接geometry和缺陷。包含实际焊接geometry在FEM模型中可以提高疲劳寿命预测和可能的裂解位置预测,但需要对焊接部进行数字重建。现有的数字重建方法有时间consuming或需要特殊的扫描设备,并且可能需要部件重新位置。提案的框架使用工业机械人与线扫描器结合来集成数字重建作为自动HFMI处理设置的一部分。这种方法使用标准的图像处理、简单的滤波技术和非线性优化来对 overlap的扫描进行协调和拼接。一个屏幕化的波峰表面重建最后生成3D模型,以创建一个可重用、成本效益、灵活的方法,以帮助 generic数字重建焊接部件,以便于组件设计、总质量控制和HFMI处理的文档。

OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation of Road Scenes

  • paper_url: http://arxiv.org/abs/2307.15588
  • repo_url: https://github.com/feibryantkit/oafuser
  • paper_authors: Fei Teng, Jiaming Zhang, Kunyu Peng, Kailun Yang, Yaonan Wang, Rainer Stiefelhagen
  • for: 提高自动驾驶场景理解的图像semantic segmentation,利用光场相机提供的丰富的角度和空间信息。
  • methods: 提出Omni-Aperture Fusion模型(OAFuser),利用中心视图的dense Context和子镜头图像中的角度信息生成semantically-consistent的结果,并提出Sub-Aperture Fusion Module(SAFM)来嵌入子镜头图像到角度特征中,无需额外存储成本。
  • results: 在UrbanLF-Real和-Syn数据集上达到了当前最佳性能,mIoU达84.93%,与原始数据集上的最佳性能增加+4.53%。
    Abstract Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent vehicles. Besides, inappropriate compression leads to information corruption and data loss. To excavate representative information, we propose an Omni-Aperture Fusion model (OAFuser), which leverages dense context from the central view and discovers the angular information from sub-aperture images to generate a semantically-consistent result. To avoid feature loss during network propagation and simultaneously streamline the redundant information from the light field camera, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular features without any additional memory cost. Furthermore, to address the mismatched spatial information across viewpoints, we present Center Angular Rectification Module (CARM) realized feature resorting and prevent feature occlusion caused by asymmetric information. Our proposed OAFuser achieves state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +4.53%. The source code of OAFuser will be made publicly available at https://github.com/FeiBryantkit/OAFuser.
    摘要 光场相机可以提供丰富的angular和空间信息,以提高自动驾驶场景理解。然而,广泛的angular信息光场相机包含大量的重复数据,对智能汽车硬件资源的限制是过载。此外,不当压缩会导致信息损害和数据损失。为了挖掘代表性信息,我们提出了 Omni-Aperture Fusion 模型(OAFuser),它利用中心视图的 dense context 和子镜像中的angular信息来生成具有semantic consistency的结果。为了避免网络传播过程中的特征损失并同时压缩红外相机中的重复信息,我们提出了一种简单 yet 高效的 Sub-Aperture Fusion Module(SAFM),可以在不添加额外存储成本下将子镜像 embed 到angular特征中。此外,为了 Address the mismatched spatial information across viewpoints,我们提出了 Center Angular Rectification Module(CARM),实现了Feature resorting 和避免了由不同视角信息干扰所致的特征遮挡。我们的提出的 OAFuser 在 UrbanLF-Real 和 UrbanLF-Syn 数据集上达到了状态机器人�왕的性能,并在 UrbanLF-Real Extended 数据集上 achieved 84.93% 的 mIoU 记录,与前一个记录 (+4.53%) 相比。我们将在 GitHub 上发布 OAFuser 的源代码,访问 https://github.com/FeiBryantkit/OAFuser。

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding

  • paper_url: http://arxiv.org/abs/2307.15569
  • repo_url: None
  • paper_authors: Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam
  • for: 这篇论文是针对点云理解进行自我指导的 representation learning (SSRL) 问题,以 addressed 3D 数据缺乏和高notation costs 的挑战。
  • methods: 这篇论文提出了 PCExpert,一种新的 SSRL 方法,将点云视为 “特殊化的图像”,这个概念Shift 允许 PCExpert 可以更直接地和更深入地利用大规模的图像特征,通过在多种Transformer架构中共享参数。
  • results: PCExpert 可以在多种任务中表现出色,包括 ScanObjectNN 等,并且只需要训练少量的参数,而且其性能可以与全模型 fine-tuning 的结果相似(92.66%),这表明 PCExpert 具有强大和可靠的表示能力。
    Abstract Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.
    摘要 自适应表示学习(SSRL)在点云理解方面受到越来越多的关注,总是面临3D数据罕见和高标注成本的挑战。本文提出PCExpert,一种新的SSRL方法,它将点云视为“专业化图像”,这种概念转换允许PCExpert在更直接和深入的方式利用大规模图像领域的知识,通过在多种Transformer架构中广泛共享参数和一种新的预测任务来进行预训练。这种参数共享策略,加之预训练中的变换估计任务,使得PCExpert能够超越当前状态的表现,并在多种任务中具有很好的灵活性和稳定性。值得一提的是,PCExpert在线性微调(例如在ScanObjectNN上达到90.02%的总准确率)的性能已经接近了全模型微调(92.66%)的结果,这显示PCExpert的表示能力是有效和可靠的。

Panoptic Scene Graph Generation with Semantics-prototype Learning

  • paper_url: http://arxiv.org/abs/2307.15567
  • repo_url: None
  • paper_authors: Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann
  • for: 提高 PSG 模型在实际应用中的表现,解决固有偏见导致 PSG 模型在构建准确的决策平面上遇到困难。
  • methods: 提出了一种名为 ADTrans 的新框架,用于适应性地传输偏见 predicate 笔记到有用和统一的笔记。通过保证每个 predicate 类划 representation 的一致性和准确性,学习不偏 predicate 的聚类表示。同时,通过不断测量每个表现与其聚类表示之间的分布变化,不断屏选掉潜在的偏见数据。
  • results: 实验显示,ADTrans 可以显著提高 benchmark 模型的表现,实现新的州OF-the-art 性能,并在多个数据集上显示出极高的一致性和效果。
    Abstract Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potential biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets.
    摘要 全面Scene图生成(PSG)解析物体并预测其关系( predicate),将语言和视觉场景连接起来。然而,annotator的语言偏好和 predicate的semantic overlap导致数据集中的 predicate 注解受到偏见,例如对象对的不同 predicate。这种偏见使PSG模型在构建准确的决策平面上陷入困难,从而限制PSG模型的实际应用。为了解决上述内在偏见,我们提出了一种名为 ADTrans 的框架,用于自适应地传输偏见 predicate 注解到有用和统一的一个。为保证转移过程中的一致性和准确性,我们提议测量每个 predicate 类别中的表示不变性,并学习不偏 predicate 的 прототипы。同时,我们连续测量每个表现和其prototype之间的分布变化,并持续屏蔽潜在偏见数据。最终,通过不偏 predicate-prototype表示空间,偏见注解得到了轻松地识别。实验表明,ADTrans 可以显著提高 benchmark 模型的性能,达到新的状态对应性,并在多个数据集上显示出极大的一致性和效果。

Beating Backdoor Attack at Its Own Game

  • paper_url: http://arxiv.org/abs/2307.15539
  • repo_url: https://github.com/damianliumin/non-adversarial_backdoor
  • paper_authors: Min Liu, Alberto Sangiovanni-Vincentelli, Xiangyu Yue
  • for: 防御深度神经网络(DNNs)受到后门攻击,不会影响网络在干净数据上的性能,但会 manipulate 网络行为一旦添加触发模式。
  • methods: 我们提出了一种简单 yet highly effective 防御框架,通过在恶意样本上注入非对抗性后门来抑制攻击者的后门。
  • results: 我们在多个 benchmark 上进行了广泛的实验,结果显示我们的方法可以 достичь现状最佳的防御效果,同时具有最低的干净数据性能下降。
    Abstract Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.
    摘要 Inspired by the stealthiness and effectiveness of backdoor attacks, we propose a simple but highly effective defense framework that injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attacks, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data.The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data.Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoors for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.

YOLOv8 for Defect Inspection of Hexagonal Directed Self-Assembly Patterns: A Data-Centric Approach

  • paper_url: http://arxiv.org/abs/2307.15516
  • repo_url: None
  • paper_authors: Enrique Dehaerne, Bappaditya Dey, Hossein Esfandiar, Lander Verstraete, Hyo Seon Suh, Sandip Halder, Stefan De Gendt
  • for: 本文旨在提出一种方法,以便在 Directed self-assembly (DSA) Patterning 中获得高质量的报告标签,以便用于supervised Machine Learning 模型的准确性检测。
  • methods: 本文使用了一种基于 Machine Learning 的 SEM 图像分析方法,以便自动检测 DSA Patterning 中的缺陷。
  • results: 本文的实验结果表明,使用 YOLOv8 neural network 可以在 DSA Patterning 中达到精度更高于 0.9 mAP 的缺陷检测精度。
    Abstract Shrinking pattern dimensions leads to an increased variety of defect types in semiconductor devices. This has spurred innovation in patterning approaches such as Directed self-assembly (DSA) for which no traditional, automatic defect inspection software exists. Machine Learning-based SEM image analysis has become an increasingly popular research topic for defect inspection with supervised ML models often showing the best performance. However, little research has been done on obtaining a dataset with high-quality labels for these supervised models. In this work, we propose a method for obtaining coherent and complete labels for a dataset of hexagonal contact hole DSA patterns while requiring minimal quality control effort from a DSA expert. We show that YOLOv8, a state-of-the-art neural network, achieves defect detection precisions of more than 0.9 mAP on our final dataset which best reflects DSA expert defect labeling expectations. We discuss the strengths and limitations of our proposed labeling approach and suggest directions for future work in data-centric ML-based defect inspection.
    摘要 缩小模式维度会导致半导体设备中的缺陷类型多样化增加。这种情况推动了半导体 Patterning 的创新,如指导自assembly(DSA),但是传统的自动缺陷检测软件没有适用。机器学习基于 SEM 图像分析已成为半导体缺陷检测的流行研究话题,但是有少量研究关于获得高质量标签的方法。在这种工作中,我们提出一种方法,可以获得 coherent 和完整的标签集,用于半导体 DSA 模式的缺陷检测,而不需要 DSA 专家投入大量时间进行质量控制。我们显示,使用 YOLOv8 neural network,可以在我们的最终数据集上达到缺陷检测精度超过 0.9 mAP,这与 DSA 专家的标签预期相 closest 。我们讨论了我们的标签获取方法的优势和局限性,以及未来在数据驱动的 ML 基于缺陷检测方面的发展方向。

Improving Image Quality of Sparse-view Lung Cancer CT Images with a Convolutional Neural Network

  • paper_url: http://arxiv.org/abs/2307.15506
  • repo_url: None
  • paper_authors: Annika Ries, Tina Dorosti, Johannes Thalhammer, Daniel Sasse, Andreas Sauter, Felix Meurer, Ashley Benne, Franz Pfeiffer, Daniela Pfeiffer
  • For: The paper aims to improve the image quality of sparse-view computed tomography (CT) images for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence.* Methods: The paper uses a U-Net to improve the image quality of sparse-view CT images and evaluates the effectiveness of different levels of undersampling (16, 32, 64, 128, 256, and 512 views) on image quality and diagnostic confidence.* Results: The paper shows that 64-projection sparse-view images result in high image quality and diagnostic confidence, while fewer views lead to insufficient quality. Post-processing the sparse-view images with the U-Net further improves image quality and diagnostic confidence.Here’s the simplified Chinese text for the three key points:
  • for: 该研究旨在提高 sparse-view CT 图像质量,以便更好地检测肺癌病变,并确定最佳投射视图数量、图像质量和诊断自信的平衡点。
  • methods: 该研究使用 U-Net 来提高 sparse-view CT 图像质量,并评估不同的投射视图数量(16, 32, 64, 128, 256, 512 个视图)对图像质量和诊断自信的影响。
  • results: 研究显示,64 投射 sparse-view CT 图像可以保持高质量和诊断自信,而更少的投射视图会导致图像质量下降。通过将 sparse-view CT 图像后处理 U-Net 模型,可以进一步提高图像质量和诊断自信。
    Abstract Purpose: To improve the image quality of sparse-view computed tomography (CT) images with a U-Net for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence. Methods: CT images from 41 subjects (34 with lung cancer, seven healthy) were retrospectively selected (01.2016-12.2018) and forward projected onto 2048-view sinograms. Six corresponding sparse-view CT data subsets at varying levels of undersampling were reconstructed from sinograms using filtered backprojection with 16, 32, 64, 128, 256, and 512 views, respectively. A dual-frame U-Net was trained and evaluated for each subsampling level on 8,658 images from 22 diseased subjects. A representative image per scan was selected from 19 subjects (12 diseased, seven healthy) for a single-blinded reader study. The selected slices, for all levels of subsampling, with and without post-processing by the U-Net model, were presented to three readers. Image quality and diagnostic confidence were ranked using pre-defined scales. Subjective nodule segmentation was evaluated utilizing sensitivity (Se) and Dice Similarity Coefficient (DSC) with 95% confidence intervals (CI). Results: The 64-projection sparse-view images resulted in Se = 0.89 and DSC = 0.81 [0.75,0.86] while their counterparts, post-processed with the U-Net, had improved metrics (Se = 0.94, DSC = 0.85 [0.82,0.87]). Fewer views lead to insufficient quality for diagnostic purposes. For increased views, no substantial discrepancies were noted between the sparse-view and post-processed images. Conclusion: Projection views can be reduced from 2048 to 64 while maintaining image quality and the confidence of the radiologists on a satisfactory level.
    摘要 Methods: 从 2016年1月至2018年12月收集到的41名病人(34名有肺癌,7名健康)的 CT 图像,并将它们前向投影到 2048 个视角的信号gram。将这些视角投影到稀畴视角 CT 数据集中,并使用缓冲后投影来重建图像。为每个抽象级别,使用 filtered backprojection 重建图像,并使用 dual-frame U-Net 训练和评估。在 8,658 张图像上进行了 22 名病人的评估。选择了每个扫描的一个代表图像,并在 19 名病人(12 名病人、7 名健康)中选择了一个代表图像。这些选择的slice被presented给三名读者。评估图像质量和诊断自信使用预定的级别。使用敏感度(Se)和 dice 相似度系数(DSC)来评估分割结果,并计算95% 的信任区间(CI)。Results: 64 个视角稀畴视角图像的 Se = 0.89 和 DSC = 0.81 [0.75,0.86],而其对应的 U-Net 后处理图像的 metric 得到了改善(Se = 0.94, DSC = 0.85 [0.82,0.87])。 fewer views 不够用于诊断purpose。随着视角数量的增加,没有显著的差异被注意到 между sparse-view 和 U-Net 后处理图像。Conclusion: 可以将 projection views 从 2048 缩减到 64,而不会影响图像质量和诊断人员对图像质量的自信。

Local and Global Information in Obstacle Detection on Railway Tracks

  • paper_url: http://arxiv.org/abs/2307.15478
  • repo_url: None
  • paper_authors: Matthias Brucker, Andrei Cramariuc, Cornelius von Einem, Roland Siegwart, Cesar Cadena
  • for: 避免铁路交通事故,提高列车安全性。
  • methods: 利用浅网络学习铁路分割,采用局部感知和控制global信息。
  • results: 比基eline方法高效,在自定义铁路图像集上评估。
    Abstract Reliable obstacle detection on railways could help prevent collisions that result in injuries and potentially damage or derail the train. Unfortunately, generic object detectors do not have enough classes to account for all possible scenarios, and datasets featuring objects on railways are challenging to obtain. We propose utilizing a shallow network to learn railway segmentation from normal railway images. The limited receptive field of the network prevents overconfident predictions and allows the network to focus on the locally very distinct and repetitive patterns of the railway environment. Additionally, we explore the controlled inclusion of global information by learning to hallucinate obstacle-free images. We evaluate our method on a custom dataset featuring railway images with artificially augmented obstacles. Our proposed method outperforms other learning-based baseline methods.
    摘要 可靠的铁路障碍检测可以帮助避免因collision而导致的伤害和可能地损坏或脱轨列车。然而,通用对象探测器并不具备足够的类型来覆盖所有可能的场景,而 dataset featuring objects on railways 也是具有挑战性的。我们提议利用一个浅网络学习铁路分割 FROM normal railway images。网络的有限接受区域防止过于自信的预测,allowing the network to focus on the locally very distinct and repetitive patterns of the railway environment。此外,我们还探索控制了包含全局信息的学习方法,通过学习生成障碍物free images。我们对自定义的 dataset featuring railway images with artificially augmented obstacles 进行评估,并证明了我们的提议方法在其他学习基础方法的比较中表现出色。

Defocus Blur Synthesis and Deblurring via Interpolation and Extrapolation in Latent Space

  • paper_url: http://arxiv.org/abs/2307.15461
  • repo_url: https://github.com/nis-research/linear-latent-blur
  • paper_authors: Ioana Mazilu, Shunxin Wang, Sven Dummer, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
  • for: 这篇论文的目的是提高微scopic图像的质量,以便进一步的处理和分析疾病。
  • methods: 该论文提出了一种方法,可以对图像进行恢复和合成不同程度的模糊效果。该方法使用自适应卷积神经网络,并采用了隐式和显式正则化技术来强制抽象关系在准确空间中的线性关系。
  • results: 该方法可以有效地模拟不同程度的模糊效果,从而提高数据的多样性,并提高微scopic图像的质量。这种方法可以作为数据增强技术,并且可以提高疾病的诊断和分析。
    Abstract Though modern microscopes have an autofocusing system to ensure optimal focus, out-of-focus images can still occur when cells within the medium are not all in the same focal plane, affecting the image quality for medical diagnosis and analysis of diseases. We propose a method that can deblur images as well as synthesize defocus blur. We train autoencoders with implicit and explicit regularization techniques to enforce linearity relations among the representations of different blur levels in the latent space. This allows for the exploration of different blur levels of an object by linearly interpolating/extrapolating the latent representations of images taken at different focal planes. Compared to existing works, we use a simple architecture to synthesize images with flexible blur levels, leveraging the linear latent space. Our regularized autoencoders can effectively mimic blur and deblur, increasing data variety as a data augmentation technique and improving the quality of microscopic images, which would be beneficial for further processing and analysis.
    摘要 现代微镜已经搭载了自动对焦系统,但是在细胞medium中不同的细胞不在同一个 фокус平面上,可能导致图像质量下降,影响医学诊断和疾病分析。我们提出了一种方法,可以恢复图像和生成杂推模灵。我们使用自动编码器,并通过显式和隐式正则化技术来强制抽象关系在 latent space 中的线性关系。这允许我们通过线性 interpolate/extrapolate latent representation来探索不同的杂推模灵水平。与现有方法相比,我们使用简单的架构来生成具有灵活杂推模灵的图像,利用 latent space 的线性性。我们的正则化自动编码器可以有效地模拟杂推模灵和恢复,增加数据的多样性,并提高微镜图像的质量,这将对进一步处理和分析产生有利影响。

ERCPMP: An Endoscopic Image and Video Dataset for Colorectal Polyps Morphology and Pathology

  • paper_url: http://arxiv.org/abs/2307.15444
  • repo_url: None
  • paper_authors: Mojgan Forootan, Mohsen Rajabnia, Ahmad R Mafi, Hamed Azhdari Tehrani, Erfan Ghadirzadeh, Mahziar Setayeshfar, Zahra Ghaffari, Mohammad Tashakoripour, Mohammad Reza Zali, Hamidreza Bolhasani
  • For: The paper is written for developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis, specifically for colorectal polyps.* Methods: The paper uses a dataset called ERCPMP, which contains demographic, morphological, and pathological data, endoscopic images, and videos of 191 patients with colorectal polyps. The dataset includes data on the diagnosis of the polyps, such as Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory, and Adenocarcinoma with Dysplasia Grade & Differentiation.* Results: The paper provides a dataset that can be used for developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis of colorectal polyps. The dataset includes a wide range of data, including demographic, morphological, and pathological data, endoscopic images, and videos, which can be used to train and test machine learning and deep learning models.Here is the information in Simplified Chinese text:* 用途:这篇论文用于开发准确的医学预测、检测、诊断、治疗和预后预测算法,特别是对于肠Rectal 肿瘤。* 方法:论文使用一个名为ERCPMP的Endoscopic Image和Video Dataset,该Dataset包括191名患者的肠Rectal 肿瘤的人口统计、形态学数据、病理学数据、Endoscopic 图像和视频。该Dataset包括肿瘤的诊断,如管状、 villous、 tubulovillous、 hyperplastic、 serrated、 inflammatory 和adenocarcinoma with dysplasia grade & differentiation。* 结果:论文提供了一个可以用于开发准确的医学预测、检测、诊断、治疗和预后预测算法的Dataset。该Dataset包括词语、形态学数据、病理学数据、Endoscopic 图像和视频,可以用于训练和测试机器学习和深度学习模型。
    Abstract In the recent years, artificial intelligence (AI) and its leading subtypes, machine learning (ML) and deep learning (DL) and their applications are spreading very fast in various aspects such as medicine. Today the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment and prognosis is data. ERCPMP is an Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology. This dataset contains demographic, morphological and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is included based on the latest international gastroenterology classification references such as Paris, Pit and JNET classification. Pathological data includes the diagnosis of the polyps including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse and since it is under development, the latest version is accessible via: https://databiox.com.
    摘要 recent 年们,人工智能(AI)和其主要子类型,机器学习(ML)和深度学习(DL)以及其应用在各个领域都在快速扩散,其中医学领域也是如此。 currently, the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis is data. ERCPMP 是一个 Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology。 This dataset contains demographic, morphological, and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is based on the latest international gastroenterology classification references such as Paris, Pit, and JNET classification. Pathological data includes the diagnosis of the polyps, including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory, and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse, and the latest version can be accessed via: .

Automated Visual Monitoring of Nocturnal Insects with Light-based Camera Traps

  • paper_url: http://arxiv.org/abs/2307.15433
  • repo_url: None
  • paper_authors: Dimitri Korsch, Paul Bodesheim, Gunnar Brehm, Joachim Denzler
  • for: 这个论文的目的是为了提供一个自动化的摄像头辅助的蜥蜉数量估计方法,以便更好地理解和对抗现在的蜥蜉减少趋势。
  • methods: 这个论文使用了一个两stage的检测和分类管道,使用了公民科学家手动捕捉的EU-Moths数据集,并对其进行了训练和评估。此外,它还介绍了一个自动化视觉监测系统的原型,并对其进行了评估。
  • results: 这个论文提供了这两个数据集的第一个检测和分类基线,并鼓励其他科学家使用这些公共可用的数据进行进一步的研究。
    Abstract Automatic camera-assisted monitoring of insects for abundance estimations is crucial to understand and counteract ongoing insect decline. In this paper, we present two datasets of nocturnal insects, especially moths as a subset of Lepidoptera, photographed in Central Europe. One of the datasets, the EU-Moths dataset, was captured manually by citizen scientists and contains species annotations for 200 different species and bounding box annotations for those. We used this dataset to develop and evaluate a two-stage pipeline for insect detection and moth species classification in previous work. We further introduce a prototype for an automated visual monitoring system. This prototype produced the second dataset consisting of more than 27,000 images captured on 95 nights. For evaluation and bootstrapping purposes, we annotated a subset of the images with bounding boxes enframing nocturnal insects. Finally, we present first detection and classification baselines for these datasets and encourage other scientists to use this publicly available data.
    摘要 自动摄像头助记 insect 数量的估计是理解和逆转正在进行的昆虫衰退的关键。在这篇论文中,我们介绍了中欧地区的两个昆虫数据集。其中一个数据集是由公民科学家手动捕捉的 EU-Moths 数据集,包含了 200 种不同物种的标注和 bounding box 标注。我们在过去的工作中使用了这个数据集来开发和评估一种昆虫检测和蛾类种类分类的两个阶段管道。我们还介绍了一种自动视觉监测系统的原型,这个原型在 95 个夜晚中拍摄了 более 27,000 张图像。为评估和启动目的,我们将一 subset 的图像标注为涵盖夜晚昆虫的 bounding box。最后,我们展示了这些数据集的第一个检测和分类基线,并邀请其他科学家使用这些公共可用的数据进行研究。

Implicit neural representation for change detection

  • paper_url: http://arxiv.org/abs/2307.15428
  • repo_url: None
  • paper_authors: Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci
  • for: 检测三维空间飞行LiDAR点云中发生的变化,特别是因为不匹配的空间支持和采集系统噪声。
  • methods: 我们提出了一种无监督方法,包括两个组成部分:神经场(NF) для连续形态重建和高斯混合模型 для分类变化。NF提供了不固定格式的表示方式,可以增加高频环境和减少噪声。
  • results: 我们在一个 benchmark 数据集上进行了测试,并证明了我们的方法可以与当前状态的艺术之前提高检测能力。此外,我们还应用了我们的方法于一个实际场景,并证明了它们与场景专家的发现相符。
    Abstract Detecting changes that occurred in a pair of 3D airborne LiDAR point clouds, acquired at two different times over the same geographical area, is a challenging task because of unmatching spatial supports and acquisition system noise. Most recent attempts to detect changes on point clouds are based on supervised methods, which require large labelled data unavailable in real-world applications. To address these issues, we propose an unsupervised approach that comprises two components: Neural Field (NF) for continuous shape reconstruction and a Gaussian Mixture Model for categorising changes. NF offer a grid-agnostic representation to encode bi-temporal point clouds with unmatched spatial support that can be regularised to increase high-frequency details and reduce noise. The reconstructions at each timestamp are compared at arbitrary spatial scales, leading to a significant increase in detection capabilities. We apply our method to a benchmark dataset of simulated LiDAR point clouds for urban sprawling. The dataset offers different challenging scenarios with different resolutions, input modalities and noise levels, allowing a multi-scenario comparison of our method with the current state-of-the-art. We boast the previous methods on this dataset by a 10% margin in intersection over union metric. In addition, we apply our methods to a real-world scenario to identify illegal excavation (looting) of archaeological sites and confirm that they match findings from field experts.
    摘要 检测两个3D空中探测点云之间的变化是一项具有挑战性的任务,因为这两个点云在不同的时间被获取,并且具有不匹配的空间支持和探测系统噪声。大多数最新的变化检测方法基于指导方法,需要大量的标注数据,这些数据在实际应用中很难获得。为了解决这些问题,我们提出了一种不supervised方法,它包括两个组成部分:神经场(NF)和高斯混合模型。NF提供了不受格子限制的表示方式,用于编码不匹配的时间点云,并可以通过增强高频率细节和减少噪声来增强高精度。在每个时间戳点上对重建的点云进行比较,可以在不同的空间缩放比例上进行比较,从而大幅提高检测能力。我们在一个 simulate LiDAR点云 benchmark dataset上应用了我们的方法,该dataset包括不同的具有不同分辨率、输入模式和噪声水平的场景。我们在这些场景中跟比现状态的方法,增加了10%的交叉部分精度。此外,我们还应用了我们的方法于一个实际场景,即考古遗产挖掘(looting),并证明与场地专家的发现相匹配。

Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification

  • paper_url: http://arxiv.org/abs/2307.15427
  • repo_url: None
  • paper_authors: Dimitri Korsch, Paul Bodesheim, Joachim Denzler
  • for: 本研究旨在开发一个基于深度学习的苹果蛾自动识别系统,以帮助生物多样性监测。
  • methods: 本研究使用了一个基于蛾虫检测器和分类器的深度学习管线来分析蛾虫扫描仪上的图像。
  • results: 研究表明,将检测器和分类器结合使用可以提高蛾虫图像标识率从79.62%提高到88.05%。
    Abstract Biodiversity monitoring is crucial for tracking and counteracting adverse trends in population fluctuations. However, automatic recognition systems are rarely applied so far, and experts evaluate the generated data masses manually. Especially the support of deep learning methods for visual monitoring is not yet established in biodiversity research, compared to other areas like advertising or entertainment. In this paper, we present a deep learning pipeline for analyzing images captured by a moth scanner, an automated visual monitoring system of moth species developed within the AMMOD project. We first localize individuals with a moth detector and afterward determine the species of detected insects with a classifier. Our detector achieves up to 99.01% mean average precision and our classifier distinguishes 200 moth species with an accuracy of 93.13% on image cutouts depicting single insects. Combining both in our pipeline improves the accuracy for species identification in images of the moth scanner from 79.62% to 88.05%.
    摘要 生物多样性监测是追踪和抵消人口波动的关键。然而,自动识别系统 rarely 被应用,专家们仍然手动评估生成的数据量。特别是在生物多样性研究中,深度学习方法的视觉监测支持还没有得到广泛应用,相比其他领域如广告或娱乐业。本文提出了一个深度学习管道,用于分析由 moth scanner 捕捉的图像。我们首先使用 moth 检测器来 Localize 检测到的 insects,然后使用分类器来确定检测到的昆虫种类。我们的检测器可以达到 99.01% 的平均精度,分类器可以在单个昆虫图像中分类出 200 种昆虫,准确率为 93.13%。将两个模块结合在一起可以提高图像中的种类鉴定精度,从 79.62% 提高到 88.05%。

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression

  • paper_url: http://arxiv.org/abs/2307.15421
  • repo_url: https://github.com/jiangweibeta/mlic
  • paper_authors: Wei Jiang, Ronggang Wang
  • for: 这个论文是为了提出一种基于多参考 entropy 模型的学习型图像压缩方法,以提高图像压缩的效率和质量。
  • methods: 该方法使用 linear complexity 来捕捉全局相关性,而不是之前的 attention 方法,以降低复杂性。具体来说,它使用了 softmax decomposion 来实现 linear complexity 的捕捉。
  • results: compared to VTM-17.0,这种 MLIC$^{++}$ 方法可以提供12.44%的BD-rate 下降,并且在 PSNR 上具有较高的效率。Here’s the translation in English:
  • for: This paper proposes a learned image compression method based on multi-reference entropy modeling, to improve the efficiency and quality of image compression.
  • methods: The method uses linear complexity to capture global correlations, instead of the previous attention method, to reduce complexity. Specifically, it uses softmax decomposition to achieve linear complexity.
  • results: Compared to VTM-17.0, the proposed MLIC$^{++}$ method can provide a 12.44% reduction in BD-rate and higher efficiency in PSNR.
    Abstract Recently, multi-reference entropy model has been proposed, which captures channel-wise, local spatial, and global spatial correlations. Previous works adopt attention for global correlation capturing, however, the quadratic cpmplexity limits the potential of high-resolution image coding. In this paper, we propose the linear complexity global correlations capturing, via the decomposition of softmax operation. Based on it, we propose the MLIC$^{++}$, a learned image compression with linear complexity for multi-reference entropy modeling. Our MLIC$^{++}$ is more efficient and it reduces BD-rate by 12.44% on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Code will be available at https://github.com/JiangWeibeta/MLIC.
    摘要 最近,多参照 entropy 模型已经被提出,它捕捉了通道 wise、本地空间和全局空间相关性。先前的工作采用了注意力来捕捉全局相关性,但 quadratic complexity 限制了高分辨率图像编码的潜力。在这篇文章中,我们提出了线性复杂度全局相关性捕捉,通过软MAX操作的分解。基于其,我们提出了 MLIC$^{++} $,一种学习图像压缩的线性复杂度多参照 entropy 模型。我们的 MLIC$^{++} $ 比 VTM-17.0 在 PSNR 下降12.44% 的 Kodak 数据集上更高效,代码将在 GitHub 上发布。

Uncertainty-aware Unsupervised Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2307.15409
  • repo_url: None
  • paper_authors: Kai Liu, Sheng Jin, Zhihang Fu, Ze Chen, Rongxin Jiang, Jieping Ye
  • for: 提高无监督多 объек tracking的性能
  • methods: 采用自我监督技术,并开发了一个uncertainty-based metric来验证和修正危险关系
  • results: 实现了高性能的无监督多对象跟踪,并在MOT-Challenges和VisDrone-MOT benchmark上达到了最高级别的表现
    Abstract Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.
    摘要 Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. This causes the similarity-based inter-frame association stage to also be error-prone, resulting in an uncertainty problem. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, but they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.

Task-Oriented Channel Attention for Fine-Grained Few-Shot Classification

  • paper_url: http://arxiv.org/abs/2308.00093
  • repo_url: None
  • paper_authors: SuBeen Lee, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
  • for: fine-grained image classification with limited training data
  • methods: Task Discrepancy Maximization (TDM) with Support Attention Module (SAM) and Query Attention Module (QAM)
  • results: accurate class-sensitive similarity measure and instance-wise highlighting of object-relevant channels
    Abstract The difficulty of the fine-grained image classification mainly comes from a shared overall appearance across classes. Thus, recognizing discriminative details, such as eyes and beaks for birds, is a key in the task. However, this is particularly challenging when training data is limited. To address this, we propose Task Discrepancy Maximization (TDM), a task-oriented channel attention method tailored for fine-grained few-shot classification with two novel modules Support Attention Module (SAM) and Query Attention Module (QAM). SAM highlights channels encoding class-wise discriminative features, while QAM assigns higher weights to object-relevant channels of the query. Based on these submodules, TDM produces task-adaptive features by focusing on channels encoding class-discriminative details and possessed by the query at the same time, for accurate class-sensitive similarity measure between support and query instances. While TDM influences high-level feature maps by task-adaptive calibration of channel-wise importance, we further introduce Instance Attention Module (IAM) operating in intermediate layers of feature extractors to instance-wisely highlight object-relevant channels, by extending QAM. The merits of TDM and IAM and their complementary benefits are experimentally validated in fine-grained few-shot classification tasks. Moreover, IAM is also shown to be effective in coarse-grained and cross-domain few-shot classifications.
    摘要 Fine-grained图像分类的困难主要来自于类别之间共同的整体外观。因此,认izable找到分类特征,如鸟类的眼睛和嘴,是关键。然而,当训练数据有限时,这变得非常困难。为 Addressing this challenge, we propose Task Discrepancy Maximization (TDM), a task-oriented channel attention method tailored for fine-grained few-shot classification with two novel modules Support Attention Module (SAM) and Query Attention Module (QAM). SAM highlights channels encoding class-wise discriminative features, while QAM assigns higher weights to object-relevant channels of the query. Based on these submodules, TDM produces task-adaptive features by focusing on channels encoding class-discriminative details and possessed by the query at the same time, for accurate class-sensitive similarity measure between support and query instances. While TDM influences high-level feature maps by task-adaptive calibration of channel-wise importance, we further introduce Instance Attention Module (IAM) operating in intermediate layers of feature extractors to instance-wisely highlight object-relevant channels, by extending QAM. The merits of TDM and IAM and their complementary benefits are experimentally validated in fine-grained few-shot classification tasks. Moreover, IAM is also shown to be effective in coarse-grained and cross-domain few-shot classifications.

AffineGlue: Joint Matching and Robust Estimation

  • paper_url: http://arxiv.org/abs/2307.15381
  • repo_url: None
  • paper_authors: Daniel Barath, Dmytro Mishkin, Luca Cavalli, Paul-Edouard Sarlin, Petr Hruby, Marc Pollefeys
  • for: 本文提出了AffineGlue方法,用于 JOINT two-view 特征匹配和稳定估计,从而减少了问题的可能性级别。
  • methods: AffineGlue 使用单点最小解方法选择可能的匹配,并使用导航匹配来找到与模型相符的匹配。此外,我们还提出了一种新的 minimal solver for homography estimation,只需要一个 affine correspondence (AC) 和一个重力优先级。
  • results: AffineGlue 在实际 dataset 上表现优于 SOTA,即使假设重力方向下降。在 PhotoTourism 上,AUC@10° 分数提高了6.6个点 compared to SOTA。在 ScanNet 上,AffineGlue 使得 SuperPoint 和 SuperGlue 与无探测 LoFTR achieve 类似的准确率。
    Abstract We propose AffineGlue, a method for joint two-view feature matching and robust estimation that reduces the combinatorial complexity of the problem by employing single-point minimal solvers. AffineGlue selects potential matches from one-to-many correspondences to estimate minimal models. Guided matching is then used to find matches consistent with the model, suffering less from the ambiguities of one-to-one matches. Moreover, we derive a new minimal solver for homography estimation, requiring only a single affine correspondence (AC) and a gravity prior. Furthermore, we train a neural network to reject ACs that are unlikely to lead to a good model. AffineGlue is superior to the SOTA on real-world datasets, even when assuming that the gravity direction points downwards. On PhotoTourism, the AUC@10{\deg} score is improved by 6.6 points compared to the SOTA. On ScanNet, AffineGlue makes SuperPoint and SuperGlue achieve similar accuracy as the detector-free LoFTR.
    摘要 我们提出了AffineGlue方法,它是一种能够同时实现二视图特征匹配和稳定估计的方法,通过单点最小解决方案来减少问题的 combinatorial 复杂性。AffineGlue选择一个可能的匹配点,并使用导向匹配来找到与模型一致的匹配。此外,我们还 derivated一种新的单 Affine 匹配(AC)的 homography 估计方法,只需要一个Affine对应性(AC)和重力 prior。此外,我们还训练了一个神经网络来拒绝不可能导致良好模型的AC。Compared to the state-of-the-art(SOTA),AffineGlue在实际数据集上表现更优异,即使gravity方向下降。在PhotoTourism上,AffineGlue在10度的AUC得分上提高了6.6分,比SOTA更高。在ScanNet上,AffineGlue使得SuperPoint和SuperGlue达到了无检测器LoFTR的同等准确率。

Prompt Guided Transformer for Multi-Task Dense Prediction

  • paper_url: http://arxiv.org/abs/2307.15362
  • repo_url: None
  • paper_authors: Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, Hongtao Lu
  • for: 本文targets the problem of trading off performance and model parameters in task-conditional architecture, and proposes a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge.
  • methods: 本文提出了一种名为Prompt-conditioned Transformer block的新块,该块在自我注意机制中加入了任务特定的提示,以实现全球相互关系模型和参数效率的特征适应。此外,本文还提出了一种轻量级的解码器,以降低参数数量。
  • results: EXTENSIVE experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, show that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.
    Abstract Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.
    摘要 任务条件架构具有参数效率优势,但在性能方面与现有多decoder方法相比,表现略为下降。在这篇论文中,我们提出了一种简单且轻量级的任务条件模型,即提示导向 transformer(PGT),以优化这个挑战。我们的方法设计了一个任务特定提示块,将任务特定的提示 incorporated 在自我注意机制中,以实现全局依赖关系和参数效率的特征适应。这个块在共享encoder和decoder中都有所整合,从而提高了内部和外部任务特征的捕捉。此外,我们还设计了一个轻量级decoder,以进一步减少参数使用量,这个decoder占总模型参数的2.7%。我们在两个多任务稠密预测 benchmark,PASCAL-Context和NYUD-v2,进行了广泛的实验,结果表明,我们的方法在任务条件方法中实现了最佳的结果,同时具有更好的参数大小协调。

Supervised Homography Learning with Realistic Dataset Generation

  • paper_url: http://arxiv.org/abs/2307.15353
  • repo_url: https://github.com/jianghaiscu/realsh
  • paper_authors: Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu
  • For: 提出一种迭代框架,包括两个阶段:生成阶段和训练阶段,用于生成真实的训练数据并提取一个监督的投影网络。* Methods: 使用预估的主导平面屏障和投影对一个无标签图像对来生成一个新的标注过的训练对,并使用这些生成的数据进行训练监督投影网络。在训练阶段,使用内容一致模块和质量评估模块来进行数据的细化和评估。* Results: 实验结果表明,我们的方法可以达到现有最佳性能,并可以在生成的数据集上提高现有的监督方法的性能。代码和数据集可以在https://github.com/JianghaiSCU/RealSH上下载。
    Abstract In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/JianghaiSCU/RealSH.
    摘要 在这篇论文中,我们提出了一种迭代框架,它包括两个阶段:生成阶段和训练阶段,用于生成真实的训练数据并生成一个监督式投影网络。在生成阶段,给定一个无标注的图像对,我们利用该对的先预计算的主要平面面积和投影,以及另一个随机选择的投影作为真实的参照,生成一个新的标注过的训练对。在训练阶段,生成的数据被用来训练监督式投影网络,其中生成的数据被修正 via 内容一致模块和质量评估模块。一旦一个迭代结束,训练完成后,用于下一次数据生成阶段的网络被更新。通过如此的迭代策略,数据集的质量和网络的性能可以逐渐提高。实验结果表明,我们的方法可以 дости得现状势最佳性能,并且可以根据生成的数据来改进现有的监督式方法。代码和数据可以在https://github.com/JianghaiSCU/RealSH中下载。

The Radon Signed Cumulative Distribution Transform and its applications in classification of Signed Images

  • paper_url: http://arxiv.org/abs/2307.15339
  • repo_url: https://github.com/rohdelab/PyTransKit
  • paper_authors: Le Gong, Shiying Li, Naqib Sad Pathan, Mohammad Shifat-E-Rabbi, Gustavo K. Rohde, Abu Hasnat Mohammad Rubaiyat, Sumati Thareja
  • for: 该研究提出了一种基于运输 mathematics 和最优运输的新图像表示技术。
  • methods: 该方法结合了Radon transform 和 Signed Cumulative Distribution Transform 两种已知的图像表示方法,并将其推广到任意函数(图像)上,因此可以用于更多应用场景。
  • results: 研究人员对实验和模拟数据进行了比较,发现新的 transform 能够更准确地表示签名图像中的信息内容,因此可以获得更高的分类精度。
    Abstract Here we describe a new image representation technique based on the mathematics of transport and optimal transport. The method relies on the combination of the well-known Radon transform for images and a recent signal representation method called the Signed Cumulative Distribution Transform. The newly proposed method generalizes previous transport-related image representation methods to arbitrary functions (images), and thus can be used in more applications. We describe the new transform, and some of its mathematical properties and demonstrate its ability to partition image classes with real and simulated data. In comparison to existing transport transform methods, as well as deep learning-based classification methods, the new transform more accurately represents the information content of signed images, and thus can be used to obtain higher classification accuracies. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit, available on Github.
    摘要 我们提出了一种新的图像表示技术,基于运输学和最优运输学 mathematics。该方法通过结合已知的射频变换和最近的signal representation方法called Signed Cumulative Distribution Transform而组合。新提出的方法可以对任意函数(图像)进行扩展,因此可以在更多的应用中使用。我们描述了新的变换,以及一些其数学性质和示例数据。与现有的运输变换方法和深度学习基于分类方法相比,新的变换更好地表示签名图像中的信息内容,因此可以实现更高的分类精度。我们在Python语言中实现了该方法,并将其 integrate到PyTransKit软件包中,可以在Github上下载。

Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF

  • paper_url: http://arxiv.org/abs/2307.15333
  • repo_url: None
  • paper_authors: Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang
  • for: 这个研究是为了提高Explicit NeRF的训练和测试效率,以应对虚拟现实和游戏等领域。
  • methods: 这个研究使用的方法是Dynamic PlenOctree DOT,它是一个可靠的、高效的Octree表现,可以适应场景的变化。
  • results: 相比POT,DOT可以提高视觉质量,减少超过55.15%/$68.84%$的参数,并提供1.7/1.9倍的FPS дляNeRF-synthetic和Tanks $&$ Temples。
    Abstract The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT. [1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
    摘要 Explicit NeRF 技术在虚拟现实和游戏领域得到了广泛关注,因为它具有高效的训练和快速推断能力。特别是PlenOctree(POT)这种显式层次多规格树表示方法,在许多情况下成为了一种重要的框架。然而,POT 的固定结构导致了直接优化的问题,因为场景复杂度在缓存颜色和浓度更新的过程中不断变化,需要根据信号复杂度进行适应性的改进。为解决这个问题,我们提出了动态PlenOctree DOT,它可以动态调整样本分布,以适应场景复杂度的变化。具体来说,DOT提出了一种新的层次特征融合策略,在迭代渲染过程中对Region of Interest进行训练,以确保高效和适应的改进。而不是直接过滤无用的节点,DOT引入了采样和剪除操作,以协助快速学习参数。相比POT,我们的DOT在提高视觉质量、减少参数数量和提供更高的帧率方面表现出色,具体来说,DOT的视觉质量提高了55.15%/68.84%,参数减少了55.15%/68.84%,并且在NeRF-synthetic和Tanks $\&$ Temples等场景下提供了1.7/1.9倍的帧率。项目主页:https://vlislab22.github.io/DOT。[1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation

  • paper_url: http://arxiv.org/abs/2307.15326
  • repo_url: None
  • paper_authors: Yueh-Ning Ku, Mikhail Kuznetsov, Shaunak Mishra, Paloma de Juan
  • for: 这个论文是关于如何使用生成对抗网络(GAN)和检索助手GAN(Retrieval Assisted GAN,RAGAN)来增强电子商务平台上的动态产品广告(DPA)图像的。
  • methods: 这个论文提出了一种基于GAN和检索助手GAN的复制粘贴stagging方法,该方法首先从目录中检索与输入产品相似的已stagged产品,然后将其背景复制到输入图像中,并使用GAN基于填充模型来填充复制后的孔隙。
  • results: 论文通过在线度量和人工评估来证明了该复制粘贴stagging方法的效果,同时还展示了如何使用该方法生成产品动画。
    Abstract Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the efficacy of our copy-paste staging method via offline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.
    摘要 在线广告通常会使用电商平台提供的产品图片,这些图片通常会被称为动态产品广告(DPA)。DPA目录通常有数百万个图片,但不 все图片都能够直接复用为广告图片,这可能导致更低的键盘 clicks(CTR)。特别是,产品只有在固定背景下显示可能不那么吸引人和真实。为解决DPA图片的缺点,我们提出了基于生成对抗网络(GAN)的方法,生成产品在自然环境中的摄影。然而,整个生成整个场景是一项复杂的任务,易于生成幻觉。为此,我们提出了一种更简单的方法:复制粘贴配置。在复制粘贴配置中,我们首先从目录中检索与输入产品相似的已有的stage产品,然后将其中的背景复制到输入图片中。使用GAN基于的填充模型来填充复制后的孔隙。我们通过线上指标和人工评估表明了我们的配置方法的有效性。此外,我们还展示了如何使用我们的配置方法生成动画。

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

  • paper_url: http://arxiv.org/abs/2307.15324
  • repo_url: https://github.com/prismformore/multi-task-transformer
  • paper_authors: Hanrong Ye, Dan Xu
  • for: 这个研究是为了解决多项任务学习中,同一个背景特征(例如从backbone层中的特征)中同时学习多个明确任务特有的特征的问题。
  • methods: 这个研究使用了一种名为TaskExpert的多项任务混合专家模型,它可以学习多个代表任务特有的特征空间,并在动态的方式下将任务特有的特征解析为多个专家网络。
  • results: 实验结果显示,TaskExpert在两个竞争性多项任务学习 benchmark(PASCAL-Context和NYUD-v2)上的9个指标中,均超越了前一代最佳方法。
    Abstract Learning discriminative task-specific features simultaneously for multiple distinct tasks is a fundamental problem in multi-task learning. Recent state-of-the-art models consider directly decoding task-specific features from one shared task-generic feature (e.g., feature from a backbone layer), and utilize carefully designed decoders to produce multi-task features. However, as the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process, producing less discriminative task-specific representations. To tackle this limitation, we propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. Furthermore, to establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. Extensive experiments demonstrate that our TaskExpert clearly outperforms previous best-performing methods on all 9 metrics of two competitive multi-task learning benchmarks for visual scene understanding (i.e., PASCAL-Context and NYUD-v2). Codes and models will be made publicly available at https://github.com/prismformore/Multi-Task-Transformer
    摘要 学习多个不同任务的特征同时是多任务学习的基本问题。现今的状态提取模型直接从一个共享任务普适特征(例如,底层层次特征)中提取任务特征,并使用特别设计的解码器生成多任务特征。然而,由于输入特征完全共享,每个任务解码器也共享解码参数 для不同的输入样本,这会导致静态特征解码过程,生成更少的特征分解。为了解决这些限制,我们提出了TaskExpert,一种新的多任务混合专家模型,允许学习多个代表任务普适特征空间和动态解码任务特征。特别是,TaskExpert引入了一组专家网络将底层特征分解成多个代表任务普适特征。然后,每个任务特征被解码器使用动态任务特征闭合网络在不同的任务普适特征空间中解码。此外,为了在不同层次的TaskExpert中建立长距离模型化任务特征表示,我们设计了一个多任务特征记忆,在每层更新并作为多任务特征解码器的额外特征专家。广泛的实验证明了我们的TaskExpert明显超过了之前最佳表现的方法在图像Scene理解两个竞争性多任务学习 benchmark 上(即 PASCAL-Context 和 NYUD-v2)。代码和模型将在https://github.com/prismformore/Multi-Task-Transformer 上公开。

DocDeshadower: Frequency-aware Transformer for Document Shadow Removal

  • paper_url: http://arxiv.org/abs/2307.15318
  • repo_url: None
  • paper_authors: Shenghong Luo, Ruifeng Xu, Xuhang Chen, Zinuo Li, Chi-Man Pun, Shuqiang Wang
  • for: 提高扫描文档中的阴影 removing 效果
  • methods: 使用多频Transformer模型、Attention-Aggregation Network和Gated Multi-scale Fusion Transformer来除阴影
  • results: 在质量和量化两个方面都超越现有的状态之最方法
    Abstract The presence of shadows significantly impacts the visual quality of scanned documents. However, the existing traditional techniques and deep learning methods used for shadow removal have several limitations. These methods either rely heavily on heuristics, resulting in suboptimal performance, or require large datasets to learn shadow-related features. In this study, we propose the DocDeshadower, a multi-frequency Transformer-based model built on Laplacian Pyramid. DocDeshadower is designed to remove shadows at different frequencies in a coarse-to-fine manner. To achieve this, we decompose the shadow image into different frequency bands using Laplacian Pyramid. In addition, we introduce two novel components to this model: the Attention-Aggregation Network and the Gated Multi-scale Fusion Transformer. The Attention-Aggregation Network is designed to remove shadows in the low-frequency part of the image, whereas the Gated Multi-scale Fusion Transformer refines the entire image at a global scale with its large perceptive field. Our extensive experiments demonstrate that DocDeshadower outperforms the current state-of-the-art methods in both qualitative and quantitative terms.
    摘要 文本中的阴影对可读性有着重要的影响,但现有的传统方法和深度学习方法used for shadow removal有一些局限性。这些方法可能会依赖于规则,导致性能下降,或者需要大量的数据来学习阴影相关的特征。在这项研究中,我们提出了DocDeshadower,一种多频Transformer基于Laplacian Pyramid的模型。DocDeshadower通过在不同频率带进行均衡处理来去除阴影。为此,我们使用Laplacian Pyramid将阴影图像分解成不同频率带。此外,我们还提出了两个新的组件:协调汇集网络和灵活多scale混合transformer。协调汇集网络用于在低频部分中去除阴影,而灵活多scale混合transformer则在全图上进行全局级别的细化处理,其大见范field允许它在不同频率带上进行细化处理。我们的广泛实验表明,DocDeshadower在可读性和量化上都超过了当前状态的方法。

Attentive Multimodal Fusion for Optical and Scene Flow

  • paper_url: http://arxiv.org/abs/2307.15301
  • repo_url: https://github.com/jiesico/fusionraft
  • paper_authors: Youjie Zhou, Guofeng Mei, Yiming Wang, Fabio Poiesi, Yi Wan
  • for: 这paper是为了解决RGB模式下的视觉和Scene flow估计问题,尤其是在噪声或低照度环境下。
  • methods: 这paper提出了一种基于深度学习的FusionRAFT方法,使得早期模式融合可以更好地利用两个感知模式(RGB和深度)的优势。该方法包括自我和交叉关注层,以建立有用的特征,以便更好地利用两个模式的优势。
  • results: 通过比较性试验,这paper表明FusionRAFT方法在Flyingthings3DSynthetic数据集和KITTI实际数据集上表现更好,并且在噪声和低照度条件下表现更加稳定和可靠。
    Abstract This paper presents an investigation into the estimation of optical and scene flow using RGBD information in scenarios where the RGB modality is affected by noise or captured in dark environments. Existing methods typically rely solely on RGB images or fuse the modalities at later stages, which can result in lower accuracy when the RGB information is unreliable. To address this issue, we propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities (RGB and depth). Our approach incorporates self- and cross-attention layers at different network levels to construct informative features that leverage the strengths of both modalities. Through comparative experiments, we demonstrate that our approach outperforms recent methods in terms of performance on the synthetic dataset Flyingthings3D, as well as the generalization on the real-world dataset KITTI. We illustrate that our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images. We release the code, models and dataset at https://github.com/jiesico/FusionRAFT.
    摘要 中文翻译:本文研究了基于RGBD信息的光学和场景流计算,在RGB信息受到噪音或低光照影响的场景下。现有方法通常只采用RGB图像或在后续阶段进行模式融合,这可能会导致RGB信息不可靠时的性能下降。为解决这问题,我们提出了一种新的深度神经网络方法,即FusionRAFT,它在感知Modalities(RGB和深度)之间进行早期融合。我们的方法包括自身和交叉关注层,以不同的网络层次构建有用的特征,以利用两种模式之间的优势。通过比较实验,我们证明了我们的方法在Flyingthings3D sintetic dataset和KITTI实验室 dataset上的性能较高,并且在噪音和低光照条件下表现更加稳定。我们将代码、模型和数据集发布在https://github.com/jiesico/FusionRAFT上。

AC-Norm: Effective Tuning for Medical Image Analysis via Affine Collaborative Normalization

  • paper_url: http://arxiv.org/abs/2307.15282
  • repo_url: https://github.com/endoluminalsurgicalvision-imr/acnorm
  • paper_authors: Chuyan Zhang, Yuncheng Yang, Hao Zheng, Yun Gu
    for: This paper focuses on enhancing the performance of clinical applications with limited annotations using self-supervised learning (SSL) and the “pretraining-then-finetuning” paradigm.methods: The proposed method, Affine Collaborative Normalization (AC-Norm), utilizes the trainable affine parameters of batch normalization (BN) layers to dynamically recalibrate the channels in the target model according to the cross-domain channel-wise correlations, without adding extra parameters.results: The proposed AC-Norm method outperformed the vanilla finetuning method by up to 4% improvement in various transfer learning tasks, including diabetic retinopathy grade classification, retinal vessel segmentation, CT lung nodule segmentation/classification, CT liver-tumor segmentation, and MRI cardiac segmentation. Additionally, AC-Norm was found to be capable of fast transferability estimation.
    Abstract Driven by the latest trend towards self-supervised learning (SSL), the paradigm of "pretraining-then-finetuning" has been extensively explored to enhance the performance of clinical applications with limited annotations. Previous literature on model finetuning has mainly focused on regularization terms and specific policy models, while the misalignment of channels between source and target models has not received sufficient attention. In this work, we revisited the dynamics of batch normalization (BN) layers and observed that the trainable affine parameters of BN serve as sensitive indicators of domain information. Therefore, Affine Collaborative Normalization (AC-Norm) is proposed for finetuning, which dynamically recalibrates the channels in the target model according to the cross-domain channel-wise correlations without adding extra parameters. Based on a single-step backpropagation, AC-Norm can also be utilized to measure the transferability of pretrained models. We evaluated AC-Norm against the vanilla finetuning and state-of-the-art fine-tuning methods on transferring diverse pretrained models to the diabetic retinopathy grade classification, retinal vessel segmentation, CT lung nodule segmentation/classification, CT liver-tumor segmentation and MRI cardiac segmentation tasks. Extensive experiments demonstrate that AC-Norm unanimously outperforms the vanilla finetuning by up to 4% improvement, even under significant domain shifts where the state-of-the-art methods bring no gains. We also prove the capability of AC-Norm in fast transferability estimation. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/ACNorm.
    摘要 受最新的自动学习(SSL)趋势驱动,“预训练后finetuning”的方法在医疗应用中得到了广泛的探索,以提高limited annotations的性能。之前的模型finetuning研究主要集中在常规化项和特定策略模型上,而频道之间的偏移问题尚未得到了充分的注意。在这项工作中,我们重新探讨了批量 нормализа(BN)层的动力学,并发现了BN层的可调参数作为域信息的敏感指标。因此,我们提出了Affine Collaborative Normalization(AC-Norm),用于finetuning,可以在目标模型中动态重新准确channel,无需添加额外参数。基于单步反射,AC-Norm还可以用于评估预训练模型的传输性。我们对AC-Norm与常规finetuning和现有的精细调整方法进行了对比,在不同的频道偏移 task 上进行了广泛的实验。结果表明,AC-Norm在频道偏移情况下可以达到4%的提升,而在其他方法无法提供任何提升的情况下。我们还证明了AC-Norm的快速传输性能测试能力。代码可以在https://github.com/EndoluminalSurgicalVision-IMR/ACNorm上下载。

Recovering high-quality FODs from a reduced number of diffusion-weighted images using a model-driven deep learning architecture

  • paper_url: http://arxiv.org/abs/2307.15273
  • repo_url: https://github.com/jbartlett6/sdnet
  • paper_authors: J Bartlett, C E Davey, L A Johnston, J Duan
  • for: 该研究旨在提出一种基于深度学习的材料方向分布(FOD)重建方法,可以从少量的扩散束图像(DWI)中生成高精度的FOD。
  • methods: 该方法使用深度学习网络,使用扩散获取的Diffusion-weighted image(DWI)信号作为输入,并通过一种圆拟合网络来重建FOD。
  • results: 研究表明,该模型基于深度学习的FOD重建方法可以与现有的FOD超分辨率网络相比,并且可以通过调整约束来提高下游的fixel分类精度。代码可以在https://github.com/Jbartlett6/SDNet中获取。
    Abstract Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal. In this work, we propose a spherical deconvolution network, a model-driven deep learning FOD reconstruction architecture, that ensures intermediate and output FODs produced by the network are consistent with the input DWI signals. Furthermore, we implement a fixel classification penalty within our loss function, encouraging the network to produce FODs that can subsequently be segmented into the correct number of fixels and improve downstream fixel-based analysis. Our results show that the model-based deep learning architecture achieves competitive performance compared to a state-of-the-art FOD super-resolution network, FOD-Net. Moreover, we show that the fixel classification penalty can be tuned to offer improved performance with respect to metrics that rely on accurately segmented of FODs. Our code is publicly available at https://github.com/Jbartlett6/SDNet .
    摘要 《纤维方向分布(FOD)重建使用深度学习有可能生成准确的FOD,从一小量的扩散束图像(DWI)中减少总成像时间。通常使用扩散获取的不变表示来作为输入,以确保这些方法可以适应不同的b-向量和b值;然而,这意味着网络无法直接Conditional Output在DWI信号。在这种工作中,我们提议一种圆柱体抽象网络,一种驱动深度学习FOD重建架构,以确保输入DWI信号和输出FOD之间的一致。此外,我们实施了一种纤维分类罚金在我们的损失函数中,让网络生成FOD,可以随后被正确分割为纤维。我们的结果表明,模型驱动的深度学习架构与现状的FOD超分辨网络FOD-Net具有竞争性。此外,我们还证明了纤维分类罚金可以调整以提高基于纤维分割的下游分析的表现。我们的代码在https://github.com/Jbartlett6/SDNet上公开。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Anatomy-Aware Lymph Node Detection in Chest CT using Implicit Station Stratification

  • paper_url: http://arxiv.org/abs/2307.15271
  • repo_url: None
  • paper_authors: Ke Yan, Dakai Jin, Dazhou Guo, Minfeng Xu, Na Shen, Xian-Sheng Hua, Xianghua Ye, Le Lu
    for: 这个研究的目的是提高于医疗影像中发现异常淋巴节的自动检测性能。methods: 本研究提出了一个 novel 的终端框架,通过利用淋巴节的站信息来提高淋巴节检测性能。我们设计了多头检测器,并将每个头专注于区分淋巴节和非淋巴节结构的不同站信息。在训练过程中,我们使用了多任务学习,将淋巴节站信息作为多个类别的标签生成,因此在测试过程中不需要另外的Explicit LN站预测模型。results: 我们在82名肺癌患者和91名食道癌患者的CT影像检查中评估了我们的算法。结果显示,我们的方法可以从65.1%提高到71.4%和80.3%提高到85.5%,对于每个患者2个错误的淋巴节检测性能,均有明显的改善。相比于 existed 的多种基eline技术,如nnUNet、nnDetection和LENS,我们的方法具有更高的检测性能。
    Abstract Finding abnormal lymph nodes in radiological images is highly important for various medical tasks such as cancer metastasis staging and radiotherapy planning. Lymph nodes (LNs) are small glands scattered throughout the body. They are grouped or defined to various LN stations according to their anatomical locations. The CT imaging appearance and context of LNs in different stations vary significantly, posing challenges for automated detection, especially for pathological LNs. Motivated by this observation, we propose a novel end-to-end framework to improve LN detection performance by leveraging their station information. We design a multi-head detector and make each head focus on differentiating the LN and non-LN structures of certain stations. Pseudo station labels are generated by an LN station classifier as a form of multi-task learning during training, so we do not need another explicit LN station prediction model during inference. Our algorithm is evaluated on 82 patients with lung cancer and 91 patients with esophageal cancer. The proposed implicit station stratification method improves the detection sensitivity of thoracic lymph nodes from 65.1% to 71.4% and from 80.3% to 85.5% at 2 false positives per patient on the two datasets, respectively, which significantly outperforms various existing state-of-the-art baseline techniques such as nnUNet, nnDetection and LENS.
    摘要 找到不同常规图像中的异常淋巴节点是医疗领域中非常重要的各种任务中的一个,如癌细胞肿瘤stage和放疗规划。淋巴节点(LN)是身体中散布的小腺体,根据其生理位置分为不同的LN站。不同的LN站在CT图像的出现和背景下有很大的差异,这会提高自动检测的挑战,特别是对于病理LN。为了解决这个问题,我们提出了一种新的综合框架,利用LN站信息来提高淋巴节点检测性能。我们设计了多头检测器,每个头都专门用于区分LN和非LN结构。在训练时,我们使用LN站分类器生成pseudo站标签,以实现多任务学习,因此在推断时不需要另外的explicit LN站预测模型。我们的算法在82名肺癌患者和91名食道癌患者的数据集上进行了评估,并显示了提高了脊梗淋巴节点检测感度,从65.1%提高到71.4%和80.3%提高到85.5%,在2个false positive每个患者时,分别有显著的提高。与多种现有的基线技术相比,我们的方法显示出了显著的优势。

RSGPT: A Remote Sensing Vision Language Model and Benchmark

  • paper_url: http://arxiv.org/abs/2307.15266
  • repo_url: None
  • paper_authors: Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li
    for: 本研究旨在开发特有的大视语言模型(VLM),用于数据分析领域中的远程感知(RS)应用。methods: 本研究使用了人工标注的卫星图像描述集(RSICap),以及卫星图像评估集(RSIEval),用于评估和训练大视语言模型。results: 研究发现,通过使用高质量的卫星图像描述集(RSICap)和卫星图像评估集(RSIEval),可以帮助开发大视语言模型(VLM),并且可以在RS应用中达到比较出色的性能。
    Abstract The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.
    摘要 大规模的大语言模型,如GPT-4,对人工通用智能的发展产生了巨大的推动,并促使了人工智能2.0的革命。在远程感知(RS)领域,有增加兴趣在开发特定于数据分析的大视语言模型(VLMs)。然而,当前的研究主要集中在视觉认知任务上,缺乏大规模、一致的图像文本数据集,这会对培育大VLMs的训练带来很大的挑战。在计算机视觉领域,最近的研究表明, fine-tuning大视语言模型在小规模、高质量数据集上可以获得出色的视觉和语言理解性能。这些结果与 state-of-the-art VLMs 训练自零开始大量数据,如GPT-4,相当。 inspirited by this captivating idea,在这项工作中,我们构建了高质量的远程感知图像描述集(RSICap),以便在RS领域开发大VLMs。与前期RS datasets不同,RSICap包含2,585个人注解的描述,其中包括Scene描述(例如:居民区、机场、农业等)以及物体信息(例如:颜色、形状、数量、绝对位置等)。为便于RS领域中VLMs的评估,我们还提供了一个名为RSIEval的基准评估集,该集包含人注解的描述和视觉问答对,以便对VLMs在RS领域进行全面的评估。

Learning with Constraint Learning: New Perspective, Solution Strategy and Various Applications

  • paper_url: http://arxiv.org/abs/2307.15257
  • repo_url: None
  • paper_authors: Risheng Liu, Jiaxin Gao, Xuan Liu, Xin Fan
  • for: 解决复杂的机器学习和计算机视觉问题,包括生成对抗网络(GAN)和其变种、多任务和元学习、超参数学习以及各种实际应用。
  • methods: 提出了一种新的框架——学习干预学习(LwCL),可以一元化地探讨这些多样化的学习和视觉问题。LwCL 采用了一种高级别的层次优化模型,能够捕捉这些多样化的学习和视觉问题的本质。
  • results: 对于多种学习和视觉应用,LwCL 提供了一个广泛的解决方案,包括三类和九种问题类型。实验表明,LwCL 可以有效地解决各种复杂的机器学习和计算机视觉问题,并 bridge 理论和实践之间的差距。
    Abstract The complexity of learning problems, such as Generative Adversarial Network (GAN) and its variants, multi-task and meta-learning, hyper-parameter learning, and a variety of real-world vision applications, demands a deeper understanding of their underlying coupling mechanisms. Existing approaches often address these problems in isolation, lacking a unified perspective that can reveal commonalities and enable effective solutions. Therefore, in this work, we proposed a new framework, named Learning with Constraint Learning (LwCL), that can holistically examine challenges and provide a unified methodology to tackle all the above-mentioned complex learning and vision problems. Specifically, LwCL is designed as a general hierarchical optimization model that captures the essence of these diverse learning and vision problems. Furthermore, we develop a gradient-response based fast solution strategy to overcome optimization challenges of the LwCL framework. Our proposed framework efficiently addresses a wide range of applications in learning and vision, encompassing three categories and nine different problem types. Extensive experiments on synthetic tasks and real-world applications verify the effectiveness of our approach. The LwCL framework offers a comprehensive solution for tackling complex machine learning and computer vision problems, bridging the gap between theory and practice.
    摘要 “复杂的学习问题,如生成对抗网络(GAN)和其变种,多任务和元学习,参数学习,以及各种现实世界视觉应用,需要更深刻的理解它们的基础机制。现有方法通常对这些问题进行隔离处理,缺乏一个综合视角,这使得它们的解决方法受限。因此,在这项工作中,我们提出了一个新的框架,名为学习约束学(LwCL),可以总结这些多样化的学习和视觉问题。具体来说,LwCL是一种通用的层次优化模型,捕捉这些多样化的学习和视觉问题的核心。此外,我们开发了基于梯度响应的快速解决策略,以解决LwCL框架中的优化挑战。我们的提出的框架可以有效地解决各种学习和视觉问题,涵盖三个类别和九种不同的问题类型。广泛的实验证明了我们的方法的有效性,LwCL框架可以凝聚理论和实践之间的差距,为复杂的机器学习和计算机视觉问题提供一个普适的解决方案。”

A Solution to Co-occurrence Bias: Attributes Disentanglement via Mutual Information Minimization for Pedestrian Attribute Recognition

  • paper_url: http://arxiv.org/abs/2307.15252
  • repo_url: https://github.com/sdret/a-solution-to-co-occurence-bias-in-pedestrian-attribute-recognition
  • paper_authors: Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Zhenbo Xu, Weiqing Lu, Yuran Cao
  • for: 提高pedestrian attribute recognition的Robustness和Generalization能力
  • methods: 提出了一种Attributes-disentangled feature learning方法,通过mutual information minimization来解耦特征之间的相关性
  • results: 在实际场景中提高了baseline的性能,并在PETAzs和RAPzs等 dataset上实现了State-of-the-artresult
    Abstract Recent studies on pedestrian attribute recognition progress with either explicit or implicit modeling of the co-occurrence among attributes. Considering that this known a prior is highly variable and unforeseeable regarding the specific scenarios, we show that current methods can actually suffer in generalizing such fitted attributes interdependencies onto scenes or identities off the dataset distribution, resulting in the underlined bias of attributes co-occurrence. To render models robust in realistic scenes, we propose the attributes-disentangled feature learning to ensure the recognition of an attribute not inferring on the existence of others, and which is sequentially formulated as a problem of mutual information minimization. Rooting from it, practical strategies are devised to efficiently decouple attributes, which substantially improve the baseline and establish state-of-the-art performance on realistic datasets like PETAzs and RAPzs. Code is released on https://github.com/SDret/A-Solution-to-Co-occurence-Bias-in-Pedestrian-Attribute-Recognition.
    摘要 近期研究人员发现,人行AttributeRecognition的进步通常采用显式或隐式表示人行Attribute的相互关系。然而,这种已知的假设对特定场景的变化和不可预测,可能导致现有方法在不同场景下表现不佳,具有人行Attribute相互关系的偏见。为确保模型在真实场景中 robust,我们提议使用Attribute分离特征学习,以确保一个特征不受另一个特征的存在影响。这个问题可以看作是 mutual information minimization 问题。从而,我们提出了实用的策略来快速分离特征,并在实际数据集上达到了基eline和状态arp的表现。代码可以在 https://github.com/SDret/A-Solution-to-Co-occurence-Bias-in-Pedestrian-Attribute-Recognition 上找到。

D2S: Representing local descriptors and global scene coordinates for camera relocalization

  • paper_url: http://arxiv.org/abs/2307.15250
  • repo_url: None
  • paper_authors: Bach-Thuan Bui, Dinh-Tuan Tran, Joo-Ho Lee
  • For: 本研究提出了一种基于直接学习的视地标记方法,用于解决现有的视地标记方法具有较高的计算成本和存储量的问题。* Methods: 本方法使用一个简单的神经网络 named D2S,用于表示本地描述符和场景坐标。 D2S 使用一种简单的损失函数和图注意机制,选择性地关注 robust 的描述符,而忽略某些不可靠的区域,如云、树和动态对象。* Results: 本方法在室内和室外环境中的场景坐标回归任务中表现出色,超过了现有的 CNN 基于方法。 它能够在不具备标注数据源的情况下,通过自然语言描述和自动分类来泛化到不同的场景中,包括从白天到黑夜的过渡和频率域转换。I hope that helps! Let me know if you have any further questions or if there’s anything else I can help with.
    Abstract State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant cost in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a new outdoor dataset to evaluate the capabilities of visual localization methods in terms of scene generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s
    摘要 现代视觉地标方法通常需要复杂的过程来匹配本地描述符和3D点云。然而,这些过程可能会带来较大的计算成本、存储成本和时间更新成本。在本研究中,我们提出了一种直接学习基于的方法,利用名为D2S的简单网络来表示本地描述符和其场景坐标。我们的方法具有简单性和成本效果。它仅在测试阶段使用单个RGB图像进行地标,并且仅需要一个轻量级模型来编码复杂的稀疏场景。提出的D2S使用一种简单的损失函数和图像注意力来选择ively关注可靠的描述符,而忽略云、树和一些动态对象。这种选择性注意力使得D2S可以有效地进行二分类Semantic地标。此外,我们还提出了一个新的户外数据集来评估视觉地标方法的场景总结和自动更新能力。我们的方法在室内和户外环境中超越了当前最佳CNN基于方法的场景坐标回归。它能够总结超出训练数据,包括从日到夜的过渡和适应频率变化,甚至在没有标注数据源的情况下。source code、训练模型、数据集和示例视频可以在以下链接获取:https://thpjp.github.io/d2s。

TROPHY: A Topologically Robust Physics-Informed Tracking Framework for Tropical Cyclones

  • paper_url: http://arxiv.org/abs/2307.15243
  • repo_url: None
  • paper_authors: Lin Yan, Hanqi Guo, Thomas Peterka, Bei Wang, Jiali Wang
  • for: 本研究旨在提出一种基于物理知识的高效的TC跟踪方法,以提高大规模气象数据集中TC跟踪的计算效率。
  • methods: 本方法首先提出了一种基于物理知识的特征选择策略,以筛选出高稳定性和长期存在的TC kritical points。然后,在多层Robustness计算中,我们对TC kritical points进行了物理约束,以确保计算的TC跟踪结果具有物理意义。
  • results: 我们对30年的2D风场数据进行了实验,并通过对比观察轨迹和已有的TC跟踪算法,示出了TROPHY可以准确地跟踪TC的特征,并且有时even better than已有的TC跟踪算法。
    Abstract Tropical cyclones (TCs) are among the most destructive weather systems. Realistically and efficiently detecting and tracking TCs are critical for assessing their impacts and risks. Recently, a multilevel robustness framework has been introduced to study the critical points of time-varying vector fields. The framework quantifies the robustness of critical points across varying neighborhoods. By relating the multilevel robustness with critical point tracking, the framework has demonstrated its potential in cyclone tracking. An advantage is that it identifies cyclonic features using only 2D wind vector fields, which is encouraging as most tracking algorithms require multiple dynamic and thermodynamic variables at different altitudes. A disadvantage is that the framework does not scale well computationally for datasets containing a large number of cyclones. This paper introduces a topologically robust physics-informed tracking framework (TROPHY) for TC tracking. The main idea is to integrate physical knowledge of TC to drastically improve the computational efficiency of multilevel robustness framework for large-scale climate datasets. First, during preprocessing, we propose a physics-informed feature selection strategy to filter 90% of critical points that are short-lived and have low stability, thus preserving good candidates for TC tracking. Second, during in-processing, we impose constraints during the multilevel robustness computation to focus only on physics-informed neighborhoods of TCs. We apply TROPHY to 30 years of 2D wind fields from reanalysis data in ERA5 and generate a number of TC tracks. In comparison with the observed tracks, we demonstrate that TROPHY can capture TC characteristics that are comparable to and sometimes even better than a well-validated TC tracking algorithm that requires multiple dynamic and thermodynamic scalar fields.
    摘要 热带风暴(TC)是气候系统中最破坏性的天气系统之一。有效地探测和跟踪TC是评估其影响和风险的关键。最近,一种多级坚定性框架已经被提出来研究时变向量场中的关键点。这个框架可以量化不同级别的坚定性,并与多级坚定性相关的 kritical point tracking 进行比较。这个框架在风暴跟踪中表现出了潜力。它可以通过只使用2D风向场来识别风暴特征,这是有利的,因为大多数跟踪算法需要不同高度和层次的动力和 термодинамиче变量。然而,这个框架的计算效率不太好,特别是对大规模气候数据进行处理。这篇文章介绍了一种基于物理知识的逻辑坚定性物理协调跟踪框架(TROPHY),用于风暴跟踪。TROPHY的主要想法是通过将物理知识 integrate 到多级坚定性框架中,以提高计算效率。我们的方法包括:一、在预处理阶段,我们提出了物理学习Feature选择策略,以过滤90%的不稳定和短暂的关键点,保留适合风暴跟踪的好andidates。二、在进程阶段,我们在多级坚定性计算中强制实施物理学习的约束,只考虑物理学习所支持的TC约束。我们在ERA5的2D风向场数据上应用TROPHY,并生成了30年的风暴跟踪。与观测跟踪相比,我们示出TROPHY可以捕捉风暴特征,并且在一些情况下,甚至比一种已经证明有效的TC跟踪算法(需要多个动力和 термодинамичеscalar场)更好。

Fast Dust Sand Image Enhancement Based on Color Correction and New Membership Function

  • paper_url: http://arxiv.org/abs/2307.15230
  • repo_url: None
  • paper_authors: Ali Hakem Alsaeedi, Suha Mohammed Hadi, Yarub Alazzawi
  • for: 提高灰尘照片质量
  • methods: 使用色彩修正和新成员函数进行颜色偏移 correction,采用Adaptive Dark Channel Prior(A-DCP)进行雾化 removal,基于Contrast Limited Adaptive Histogram Equalization(CLAHE)进行对比度限制和图像亮度提高
  • results: 比现有研究更高效地除去红色和黄色投影,提供高质量和量灰尘照片
    Abstract Images captured in dusty environments suffering from poor visibility and quality. Enhancement of these images such as sand dust images plays a critical role in various atmospheric optics applications. In this work, proposed a new model based on Color Correction and new membership function to enhance san dust images. The proposed model consists of three phases: correction of color shift, removal of haze, and enhancement of contrast and brightness. The color shift is corrected using a new membership function to adjust the values of U and V in the YUV color space. The Adaptive Dark Channel Prior (A-DCP) is used for haze removal. The stretching contrast and improving image brightness are based on Contrast Limited Adaptive Histogram Equalization (CLAHE). The proposed model tests and evaluates through many real sand dust images. The experimental results show that the proposed solution is outperformed the current studies in terms of effectively removing the red and yellow cast and provides high quality and quantity dust images.
    摘要 图像捕捉在尘埃环境中,由于visibility和质量受到限制。尘埃图像加强在大气光学应用中扮演关键角色。本工作提出了一种基于颜色修正和新成员函数的图像加强模型。该模型包括三个阶段:色差修正、雾气除净和对比和亮度提高。色差修正使用新的成员函数调整YUV颜色空间中U和V值。使用适应黑道通道优先(A-DCP)进行雾气除净。对比和亮度提高基于对比限定适应 histogram平衡(CLAHE)。提出的模型在多个真实的沙尘图像上进行测试和评估。实验结果表明,提出的解决方案在效果上超越当前研究,可以有效地除掉红色和黄色投影,提供高质量和量的尘埃图像。

Sustainable Transparency in Recommender Systems: Bayesian Ranking of Images for Explainability

  • paper_url: http://arxiv.org/abs/2308.01196
  • repo_url: None
  • paper_authors: Jorge Paz-Ruza, Amparo Alonso-Betanzos, Berta Guijarro-Berdiñas, Brais Cancela, Carlos Eiras-Franco
  • for: 提高推荐系统的透明度和用户信任度
  • methods: 使用用户创建的视觉内容生成个性化解释
  • results: 比前方法更高效,减少了75%的CO${_2}$排放和模型尺寸,在六个实际数据集上达到了一致性superior表现,而且具有remarkable efficiency和小型模型优势。
    Abstract Recommender Systems have become crucial in the modern world, commonly guiding users towards relevant content or products, and having a large influence over the decisions of users and citizens. However, ensuring transparency and user trust in these systems remains a challenge; personalized explanations have emerged as a solution, offering justifications for recommendations. Among the existing approaches for generating personalized explanations, using visual content created by the users is one particularly promising option, showing a potential to maximize transparency and user trust. Existing models for explaining recommendations in this context face limitations: sustainability has been a critical concern, as they often require substantial computational resources, leading to significant carbon emissions comparable to the Recommender Systems where they would be integrated. Moreover, most models employ surrogate learning goals that do not align with the objective of ranking the most effective personalized explanations for a given recommendation, leading to a suboptimal learning process and larger model sizes. To address these limitations, we present BRIE, a novel model designed to tackle the existing challenges by adopting a more adequate learning goal based on Bayesian Pairwise Ranking, enabling it to achieve consistently superior performance than state-of-the-art models in six real-world datasets, while exhibiting remarkable efficiency, emitting up to 75% less CO${_2}$ during training and inference with a model up to 64 times smaller than previous approaches.
    摘要 现有的解释推荐模型在这个上下文中存在限制:它们通常需要大量的计算资源,导致显著的碳排放和大型模型,与推荐系统集成时的碳排放相比。此外,大多数模型使用代理学习目标,这些目标与个性化解释排名的目标不一致,导致学习过程不优化和模型较大。为解决这些限制,我们提出了 BRIE,一种新的模型,通过采用更适合的学习目标基于 bayesian pairwise ranking,实现了与现状最佳的性能,在六个实际数据集上表现出了明显的优势,同时具有很好的效率和较小的模型大小。在训练和推理过程中,BRIE可以减少75%的碳排放,并且模型可以达到64倍小于现有方法。

Generative AI for Medical Imaging: extending the MONAI Framework

  • paper_url: http://arxiv.org/abs/2307.15208
  • repo_url: https://github.com/project-monai/generativemodels
  • paper_authors: Walter H. L. Pinaya, Mark S. Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F. da Costa, Ashay Patel, Hyungjin Chung, Can Zhao, Wei Peng, Zelong Liu, Xueyan Mei, Oeslle Lucena, Jong Chul Ye, Sotirios A. Tsaftaris, Prerna Dogra, Andrew Feng, Marc Modat, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
  • for: 本研究旨在提供一个开源平台,专门用于训练、评估和部署生成模型和相关应用。
  • methods: 本研究使用了多种生成模型,包括扩散模型、自动推导 трансформа器和GANs,并在一个通用的方式下实现了这些模型。
  • results: 本研究可以将生成模型应用到不同的领域,包括医疗影像的问题检测、图像对图像翻译、干扰除和MRI重建。 results表明,这些模型可以在不同的领域中实现高效和精准的结果。
    Abstract Recent advances in generative AI have brought incredible breakthroughs in several areas, including medical imaging. These generative models have tremendous potential not only to help safely share medical data via synthetic datasets but also to perform an array of diverse applications, such as anomaly detection, image-to-image translation, denoising, and MRI reconstruction. However, due to the complexity of these models, their implementation and reproducibility can be difficult. This complexity can hinder progress, act as a use barrier, and dissuade the comparison of new methods with existing works. In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-art studies in a standardised way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalisable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (like CT, MRI, and X-Ray data) and from different anatomical areas. Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.
    摘要 In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-the-art studies in a standardized way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalizable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (such as CT, MRI, and X-ray data) and from different anatomical areas.Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.

Small, but important: Traffic light proposals for detecting small traffic lights and beyond

  • paper_url: http://arxiv.org/abs/2307.15191
  • repo_url: None
  • paper_authors: Tom Sanitz, Christian Wilms, Simone Frintrop
  • for: 提高小型交通灯的检测精度
  • methods: 提出了一种新的交通灯检测系统,包括基于通用物体提案生成的新的交通灯提案生成器,以及细致多尺度特征和注意力机制,以提高检测效果。
  • results: 对三个公共可用的数据集进行评估,与六种方法进行比较,结果显示小型交通灯的检测精度提高至少12.6%,并在所有交通灯大小上表现优异。
    Abstract Traffic light detection is a challenging problem in the context of self-driving cars and driver assistance systems. While most existing systems produce good results on large traffic lights, detecting small and tiny ones is often overlooked. A key problem here is the inherent downsampling in CNNs, leading to low-resolution features for detection. To mitigate this problem, we propose a new traffic light detection system, comprising a novel traffic light proposal generator that utilizes findings from general object proposal generation, fine-grained multi-scale features, and attention for efficient processing. Moreover, we design a new detection head for classifying and refining our proposals. We evaluate our system on three challenging, publicly available datasets and compare it against six methods. The results show substantial improvements of at least $12.6\%$ on small and tiny traffic lights, as well as strong results across all sizes of traffic lights.
    摘要 干货灯检测是自驾车和驾驶助手系统中的一个挑战。大多数现有系统可以在大型干货灯上提供良好的结果,但检测小型和微型干货灯通常被忽略。这里的关键问题在于卷积神经网络中的自然下采样问题,导致检测特征的解析精度低下。为解决这个问题,我们提出了一个新的干货灯检测系统,包括一个新的干货灯提案生成器,该生成器利用通用物体提案生成的发现,以及细腻多尺度特征和注意力来实现高效处理。此外,我们还设计了一个新的检测头来分类和精细地修正我们的提案。我们在三个公共可用的数据集上评估了我们的系统,并与六种方法进行比较。结果显示,我们的系统在小型和微型干货灯上提供了至少12.6%的提升,并在所有干货灯大小上达到了强劲的结果。

EnSolver: Uncertainty-Aware CAPTCHA Solver Using Deep Ensembles

  • paper_url: http://arxiv.org/abs/2307.15180
  • repo_url: https://github.com/hoangcongduc/ensolver
  • paper_authors: Duc C. Hoang, Cuong V. Nguyen, Amin Kharraz
  • for: 保护网站从自动化机器人的攻击中,通过使用文本基于的 CAPTCHA 安全机制。
  • methods: 使用深度学习技术建立 CAPTCHA 解决器,并使用深度ensemble不确定性估计来检测和跳过不符合预期的样本。
  • results: 使用对象检测模型和实验结果表明,EnSolver 可以在不同样本中具有高度的准确率和成功率,达到98.1% 和93% 分别。
    Abstract The popularity of text-based CAPTCHA as a security mechanism to protect websites from automated bots has prompted researches in CAPTCHA solvers, with the aim of understanding its failure cases and subsequently making CAPTCHAs more secure. Recently proposed solvers, built on advances in deep learning, are able to crack even the very challenging CAPTCHAs with high accuracy. However, these solvers often perform poorly on out-of-distribution samples that contain visual features different from those in the training set. Furthermore, they lack the ability to detect and avoid such samples, making them susceptible to being locked out by defense systems after a certain number of failed attempts. In this paper, we propose EnSolver, a novel CAPTCHA solver that utilizes deep ensemble uncertainty estimation to detect and skip out-of-distribution CAPTCHAs, making it harder to be detected. We demonstrate the use of our solver with object detection models and show empirically that it performs well on both in-distribution and out-of-distribution data, achieving up to 98.1% accuracy when detecting out-of-distribution data and up to 93% success rate when solving in-distribution CAPTCHAs.
    摘要 受欢迎的文本基于CAPTCHA作为网站自动化软件保护机制的流行性,使得研究人员努力开发CAPTCHA解决方案,以了解其失败情况,并使CAPTCHA更加安全。最近提出的解决方案基于深度学习技术,能够解决even the very challenging CAPTCHAs with high accuracy。然而,这些解决方案经常在不同于训练集的视觉特征的样本上表现不佳,并且缺乏检测和避免这些样本的能力,使其容易被防御系统锁定。在本文中,我们提出EnSolver,一种新的CAPTCHA解决方案,利用深度ensemble uncertainty estimation来检测和跳过不同于训练集的CAPTCHAs,使其更难被检测。我们使用对象检测模型来实现我们的解决方案,并证明了其在各种数据上的良好性,包括在分布型和不同分布型数据上的性能,达到了98.1%的检测精度和93%的成功率。

R-LPIPS: An Adversarially Robust Perceptual Similarity Metric

  • paper_url: http://arxiv.org/abs/2307.15157
  • repo_url: https://github.com/saraghazanfari/r-lpips
  • paper_authors: Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo
  • for: 本研究旨在提出一种robust learned perceptual image patch similarity(R-LPIPS)度量,以提高图像相似度评估中的安全性。
  • methods: 该度量使用了 adversarially trained deep features,并通过了一系列实验证明其比 классиical LPIPS 度量更加稳定和可靠。
  • results: 研究表明,R-LPIPS 度量能够更好地抗击 adversarial examples,并且在大规模应用中具有更高的安全性。
    Abstract Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. The code is available at https://github.com/SaraGhazanfari/R-LPIPS.
    摘要 Computer vision 中的相似度度量有着重要的作用,用于捕捉图像的含义。近年来,高级相似度度量,如学习的 Perceptual Image Patch Similarity(LPIPS),得到了广泛应用。这些度量利用训练过的神经网络提取的深度特征,并在评估图像相似性时表现出了人类视觉的惊人能力。然而,现在已经公认的是,神经网络受到攻击性例子的威胁,即通过小型的隐蔽的扰动量让模型做出错误的判断。这种敏感性引入了重要的安全问题,特别是在大规模应用中。在这篇论文中,我们提出了Robust Learned Perceptual Image Patch Similarity(R-LPIPS)度量,一种新的度量,利用攻击性训练的深度特征。通过全面的实验,我们证明了R-LPIPS在相似性评估中的优越性,相比于经典的LPIPS度量。代码可以在https://github.com/SaraGhazanfari/R-LPIPS中找到。

R-Block: Regularized Block of Dropout for convolutional networks

  • paper_url: http://arxiv.org/abs/2307.15150
  • repo_url: None
  • paper_authors: Liqi Wang, Qiya Hu
  • for: 本研究旨在提出一种基于对比学习的卷积层正则化技术,以提高卷积神经网络的性能。
  • methods: 本研究使用了一种名为R-Block的对比学习训练策略,其目的是在卷积层中对两个不同采样的输出进行匹配。具体来说,R-Block将两个不同采样的卷积层输出的分布差用来最小化训练集的损失。我们还提出了两种构建子模型的方法。
  • results: 我们的实验结果显示,R-Block在比较其他结构化抛出变体时表现更好,而且我们的子模型构建方法也超过了其他方法。
    Abstract Dropout as a regularization technique is widely used in fully connected layers while is less effective in convolutional layers. Therefore more structured forms of dropout have been proposed to regularize convolutional networks. The disadvantage of these methods is that the randomness introduced causes inconsistency between training and inference. In this paper, we apply a mutual learning training strategy for convolutional layer regularization, namely R-Block, which forces two outputs of the generated difference maximizing sub models to be consistent with each other. Concretely, R-Block minimizes the losses between the output distributions of two sub models with different drop regions for each sample in the training dataset. We design two approaches to construct such sub models. Our experiments demonstrate that R-Block achieves better performance than other existing structured dropout variants. We also demonstrate that our approaches to construct sub models outperforms others.
    摘要 Dropout 作为一种常用的正则化技术,通常在全连接层中使用,而在卷积层中效果较差。因此,为了正则化卷积网络,更结构化的Dropout变体被提议。然而,这些方法的Randomness引入会导致训练和测试过程中的不一致。在这篇论文中,我们采用了相互学习训练策略,即R-Block,使得两个由生成的差分最大化子模型输出的结果彼此一致。具体来说,R-Block将每个训练集中的样本的输出分布between两个不同掉除区域的子模型Minimize the loss。我们设计了两种方法来构建子模型。我们的实验表明,R-Block在其他已有结构化Dropout变体的比较中表现更好。此外,我们的子模型构建方法也超过了其他方法。

Online Clustered Codebook

  • paper_url: http://arxiv.org/abs/2307.15139
  • repo_url: https://github.com/lyndonzheng/cvq-vae
  • paper_authors: Chuanxia Zheng, Andrea Vedaldi
  • for: 这 paper 的目的是提出一种简单的在线代码库学习方法,以解决现有 VQ-VAE 中的代码Vector 归一化问题。
  • methods: 这 paper 使用 Clustering VQ-VAE (CVQ-VAE) 方法,选择编码特征作为更新“死亡”的代码Vector 的参考点,同时使用原始损失来优化代码库。
  • results: 该 paper 的 CVQ-VAE 方法可以广泛验证在不同的 dataset、任务(如重建和生成)和架构(如 VQ-VAE、VQGAN、LDM)上,并且可以轻松地与现有模型集成。
    Abstract Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimisation, whereas a majority of them simply ``dies off'' and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the ``dead'' codevectors, while optimising the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantiser on various datasets, tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE, VQGAN, LDM). Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
    摘要 vector量化(VQ)在机器学习中经受着重新发现,现在越来越在表示学习中使用。然而,在现有的VQ-VAE中优化codevector并不是完全懒散的。一个问题是codebook塌缩,只有一小部分的codevector会收到有用的梯度更新,而大多数codevector会“死亡”并从未更新或使用。这限制了VQ在学习更大的codebook时的效iveness,特别是在复杂的计算机视觉任务中需要高容量表示。在这篇论文中,我们提出了一种简单的在线代码库学习方法,即Clustering VQ-VAE(CVQ-VAE)。我们的方法选择编码特征作为更新“死亡” codevector的锚点,同时通过原始损失来优化活跃的代码库。这种策略使得无用的codevector更近于编码特征的分布,提高了它们的选择和优化的可能性。我们广泛验证了我们的量化器在不同的数据集、任务(例如重建和生成)和结构(例如VQ-VAE、VQGAN、LDM)上的通用能力。我们的CVQ-VAE可以轻松地与现有模型集成,只需要几行代码。

Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2307.15131
  • repo_url: https://github.com/windingwind/seal-3d
  • paper_authors: Xiangyu Wang, Jingsen Zhu, Qi Ye, Yuchi Huo, Yunlong Ran, Zhihua Zhong, Jiming Chen
  • for: 这个论文旨在提供一种可交互地编辑神经表示的方法,以便在受限的编辑灵活性、质量和速度等方面提高NeRF编辑的效果。
  • methods: 该方法使用一种新的教师-学生训练策略和本地预训练和全球精度调整,将编辑指令映射到原始NeRF模型空间,以实现直接响应编辑指令并快速预览编辑效果。
  • results: 该方法可以实现各种编辑效果,并且可以在约1秒钟的交互速度下达到出色的编辑效果。
    Abstract With the popularity of implicit neural representations, or neural radiance fields (NeRF), there is a pressing need for editing methods to interact with the implicit 3D models for tasks like post-processing reconstructed scenes and 3D content creation. While previous works have explored NeRF editing from various perspectives, they are restricted in editing flexibility, quality, and speed, failing to offer direct editing response and instant preview. The key challenge is to conceive a locally editable neural representation that can directly reflect the editing instructions and update instantly. To bridge the gap, we propose a new interactive editing method and system for implicit representations, called Seal-3D, which allows users to edit NeRF models in a pixel-level and free manner with a wide range of NeRF-like backbone and preview the editing effects instantly. To achieve the effects, the challenges are addressed by our proposed proxy function mapping the editing instructions to the original space of NeRF models and a teacher-student training strategy with local pretraining and global finetuning. A NeRF editing system is built to showcase various editing types. Our system can achieve compelling editing effects with an interactive speed of about 1 second.
    摘要 《 neural radiance fields (NeRF) 的 популярность导致了对 implicit 3D 模型的编辑方法的强需求,以便在重建场景后处理和创建3D内容中进行交互式编辑。而过去的作品都已经在不同的角度探索了 NeRF 编辑,但它们的编辑灵活性、质量和速度受到了限制,无法提供直接的编辑回应和即时预览。关键挑战是总结一种可以直接反映编辑指令并快速更新的 neural representation。为了bridging这个差距,我们提出了一种新的交互式编辑方法和系统,叫做 Seal-3D,允许用户在像素级别和自由地编辑 NeRF 模型,并在即时预览编辑效果。以解决这些挑战,我们提出了一种代理函数,将编辑指令映射到原始 NeRF 模型的空间,以及一种教师学生训练策略,包括本地预训练和全球调整。我们建立了一个 NeRF 编辑系统,以示出多种编辑类型。我们的系统可以实现吸引人的编辑效果,编辑速度约为1秒。》

End-to-end Remote Sensing Change Detection of Unregistered Bi-temporal Images for Natural Disasters

  • paper_url: http://arxiv.org/abs/2307.15128
  • repo_url: None
  • paper_authors: Guiqin Zhao, Lianlei Shan, Weiqiang Wang
  • for: 本研究旨在针对自然灾害区域内的建筑物损害检测,通过远程感知图像进行检测。
  • methods: 本研究使用了深度网络,并提出了一种无需注册的端到端变化检测网络(E2ECDNet),可以处理不匹配的双时间图像对。
  • results: 实验结果表明,E2ECDNet 能够在不匹配的双时间图像对上提供高精度的变化检测结果,并且与现有的注册变化检测方法相比,具有更高的精度和更快的运算速度。
    Abstract Change detection based on remote sensing images has been a prominent area of interest in the field of remote sensing. Deep networks have demonstrated significant success in detecting changes in bi-temporal remote sensing images and have found applications in various fields. Given the degradation of natural environments and the frequent occurrence of natural disasters, accurately and swiftly identifying damaged buildings in disaster-stricken areas through remote sensing images holds immense significance. This paper aims to investigate change detection specifically for natural disasters. Considering that existing public datasets used in change detection research are registered, which does not align with the practical scenario where bi-temporal images are not matched, this paper introduces an unregistered end-to-end change detection synthetic dataset called xBD-E2ECD. Furthermore, we propose an end-to-end change detection network named E2ECDNet, which takes an unregistered bi-temporal image pair as input and simultaneously generates the flow field prediction result and the change detection prediction result. It is worth noting that our E2ECDNet also supports change detection for registered image pairs, as registration can be seen as a special case of non-registration. Additionally, this paper redefines the criteria for correctly predicting a positive case and introduces neighborhood-based change detection evaluation metrics. The experimental results have demonstrated significant improvements.
    摘要 改变探测基于远程感知图像已经成为远程感知领域的一个主要领域。深度网络在比时图像之间进行改变探测中表现出了显著的成功,并在不同领域找到了应用。随着自然环境的衰退和自然灾害的频繁发生,通过远程感知图像快速和准确地确定灾难hit buildings是非常重要的。本文旨在研究自然灾害中的改变探测。由于现有的公共数据集在改变探测研究中使用的是注册的,这并不符合实际情况,在这种情况下,本文引入了一个无注册的综合改变探测数据集called xBD-E2ECD。此外,我们提议一种综合改变探测网络,称为E2ECDNet,该网络可以将无注册的双时图像对作为输入,并同时生成流场预测结果和改变探测预测结果。需要注意的是,我们的E2ECDNet还支持注册图像对的改变探测,因为注册可以看作特殊的非注册情况。此外,本文还重新定义了正确预测正例的标准,并引入了邻居基于改变探测评价指标。实验结果表明了显著的改进。

To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.15063
  • repo_url: https://github.com/MarcBotet/hamlet
  • paper_authors: Marc Botet Colomer, Pier Luigi Dovesi, Theodoros Panagiotakopoulos, Joao Frederico Carvalho, Linus Härenstam-Nielsen, Hossein Azizpour, Hedvig Kjellström, Daniel Cremers, Matteo Poggi
  • for: 该论文旨在解决在部署时出现不可预期的领域变化,如突发天气事件,以实现 semantic segmentation 的在线领域适应。
  • methods: 该论文提出了一种基于硬件意识的模块最低成本训练框架(HAMLET),包括硬件意识反推协调器(HAMT)和特有的领域偏移探测器(LT),以实现实时领域适应。
  • results: 该论文的方法可以在单个consumer-grade GPU上达到更高于29帧/秒的同时进行 semantic segmentation 和领域适应,并且在 OnDA 和 SHIFT 标准准则上实现了鼓舞人的准确率和速度质量平衡。
    Abstract The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.
    摘要 goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper, we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.Here's the translation in Traditional Chinese:goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper, we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

Self-Supervised Visual Acoustic Matching

  • paper_url: http://arxiv.org/abs/2307.15064
  • repo_url: None
  • paper_authors: Arjun Somayazulu, Changan Chen, Kristen Grauman
  • for: 用于自然语言处理和媒体生成等应用场景,恢复和重新生成受到环境影响的音频clip。
  • methods: 提出一种自监督的方法,只使用目标场景图像和音频,不需要匹配的源音频作为参考。通过 Conditional GAN 框架和一种新的评价指标,学习抽象房间声学特征和重新生成音频。
  • results: 在多个挑战性的数据集上,与当前状态势最高,并在各种真实世界的音频和环境下表现出色。
    Abstract Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
    摘要 听音匹配目标是重新synthesize一个音频片段,使其在目标听音环境中 зву律。现有方法假设有对称的训练数据,其中包括源和目标环境中的音频,但这限制了训练数据的多样性或需要使用模拟数据或规则来生成对应的样本。我们提出了一种无监督的方法,通过jointly学习抽离房间听音和重新synthesize音频到目标环境中,使用条件GAAN框架和一个新的度量量化听音信息的减噪度。我们在使用实际网络数据或模拟数据进行训练后,示出其在多个挑战性 dataset 和真实世界的听音和环境中表现出色。

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

  • paper_url: http://arxiv.org/abs/2307.15061
  • repo_url: https://github.com/ldkong1205/robodepth
  • paper_authors: Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan Jing, Pengwei Li, Xiaohua Qi, Cheng Jin, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng, Minglei Li, Di Xu, Changpeng Yang, Yuanqi Yao, Gang Wu, Jian Kuai, Xianming Liu, Junjun Jiang, Jiamian Huang, Baojun Li, Jiale Chen, Shuang Zhang, Sun Ao, Zhenyu Li, Runze Chen, Haiyong Luo, Fang Zhao, Jingze Yu
  • for: 提高安全应用中深度估计的可靠性,如在不良天气、传感器故障和噪声污染等情况下提供可靠的深度预测。
  • methods: 使用自然语言处理、图像修复、超解析、对抗训练、扩散噪声消除、视觉语言预训练、学习模型ensemble和层次特征强化等方法来提高深度估计的 Robustness 和可靠性。
  • results: 通过RoboDepth Challenge的学术竞赛,发现了9种top-performing解决方案,包括空间-和频率域扩充、面罩模型、图像修复和超解析、对抗训练、扩散噪声消除、视觉语言预训练、学习模型ensemble和层次特征强化等方法,这些方法能够提高深度估计的可靠性和Robustness。
    Abstract Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.
    摘要 <>转换文本到简化中文。<>实时深度估计在不同的应用场景中具有重要的意义,如恶劣天气、传感器故障和噪声污染等情况下,却存在现实世界中的干扰和损害,使得现有的深度估计系统很难提供可靠的深度预测。在这篇论文中,我们总结了RoboDepth Challenge中的胜利解决方案,这是一个学术竞赛,旨在促进和进步强健的应用场景外的深度估计。这个竞赛基于新建的KITTI-C和NYUDepth2-C标准准则。我们设置了两个独立的轨道,强调强健自我超vised和强健全supervised深度估计。共有超过200名参与者,9个独特和表现出色的解决方案出现在了,其中包括空间和频率域扩充、掩码图像模型、图像修复和超分辨率、对抗训练、扩散型噪声消除、视语预训练、学习集成和层次特征增强等方法。我们进行了广泛的实验分析和深入的观察,以更好地理解每种设计的原理。我们希望这次竞赛可以为未来的强健和可靠的深度估计奠定坚实的基础,并超出这个领域。竞赛网站上公开了数据集、竞赛工具箱、学术会录音和胜利团队的源代码。

MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2307.15058
  • repo_url: https://github.com/open-air-sun/mars
  • paper_authors: Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, Yuxin Huang, Xiaoyu Ye, Zike Yan, Yongliang Shi, Yiyi Liao, Hao Zhao
  • For: The paper proposes an autonomous driving simulator based on neural radiance fields (NeRFs) to solve remaining corner cases and improve realism in simulation.* Methods: The simulator models foreground instances and background environments separately with independent networks, allowing for flexible switching between different NeRF-related backbones, sampling strategies, input modalities, etc.* Results: The simulator achieves state-of-the-art photo-realism results and will be open-sourced, while most counterparts are not.Here’s the simplified Chinese text for the three key points:* For: 该论文提出了基于神经采样场(NeRF)的自动驾驶模拟器,以解决剩下的角落情况并提高模拟的真实性。* Methods: 该模拟器将前景实体和背景环境分别模型为独立的网络,允许自由地换换不同的NeRF相关脊梁、采样策略、输入模式等。* Results: 该模拟器实现了状态机器人化的图像真实性结果,并将被开源发布,而大多数对手不是。
    Abstract Nowadays, autonomous cars can drive smoothly in ordinary cases, and it is widely recognized that realistic sensor simulation will play a critical role in solving remaining corner cases by simulating them. To this end, we propose an autonomous driving simulator based upon neural radiance fields (NeRFs). Compared with existing works, ours has three notable features: (1) Instance-aware. Our simulator models the foreground instances and background environments separately with independent networks so that the static (e.g., size and appearance) and dynamic (e.g., trajectory) properties of instances can be controlled separately. (2) Modular. Our simulator allows flexible switching between different modern NeRF-related backbones, sampling strategies, input modalities, etc. We expect this modular design to boost academic progress and industrial deployment of NeRF-based autonomous driving simulation. (3) Realistic. Our simulator set new state-of-the-art photo-realism results given the best module selection. Our simulator will be open-sourced while most of our counterparts are not. Project page: https://open-air-sun.github.io/mars/.
    摘要 现在,自动驾驶车可以平稳驾驶在常见情况下,而实际感知器模拟将在解决剩下的角度情况中扮演关键角色。为此,我们提出了基于神经辐射场(NeRFs)的自动驾驶模拟器。与现有作品相比,我们的模拟器具有以下三个特点:1. 实例感知:我们的模拟器将背景环境和前景实例分别模型为独立的网络,以便分别控制实例的静态特性(如大小和外观)和动态特性(如轨迹)。2. 模块化:我们的模拟器允许自由地换换不同的现代NeRF相关脊梁、采样策略、输入Modalities等。我们期望这种模块化设计能促进学术进步和实业应用NeRF相关自动驾驶模拟。3. 实际:我们的模拟器在选择最佳模块时创造了新的state-of-the-art的 фото实实alomResults。我们的模拟器将被开源,而大多数对手不是。项目页面:

PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

  • paper_url: http://arxiv.org/abs/2307.15055
  • repo_url: https://github.com/y-zheng18/point_odyssey
  • paper_authors: Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas
  • for: 本研究旨在提高长期细化跟踪算法的状态前方,强调自然化的运动。
  • methods: 该研究使用真实世界动作捕捉数据来动画可变形人物,建立3D场景,并使用结构从动视觉来渲染摄像头视点。其中,人物的外观、动作特征、物理、照明和大气效果都进行了随机化。
  • results: 研究人员通过对现有算法进行修改,使其在PointOdyssey数据集上表现更好,并在两个真实世界benchmark上表现出色。此外,研究人员还提出了一种改进PIPs点跟踪方法,使其在时间上具有更广泛的感知范围,并在PointOdyssey数据集和两个真实世界benchmark上提高了表现。
    Abstract We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks. Our data and code are publicly available at: https://pointodyssey.com
    摘要 我们介绍PointOdyssey,一个大规模的人工数据集和数据生成框架,用于长期精细跟踪算法的训练和评估。我们的目标是提高状态的艺术性,因此我们使用了真实的动作捕捉数据来动画可变的人物,建立了基于动作捕捉环境的3D场景,并使用了从结构 FROM 运动中挖掘的轨迹来渲染摄像头视点。我们创造了多样性的组合,通过随机化人物的外观、动作特征、物理、照明、3D资产和大气效果来创造多样性。我们的数据集目前包含104个视频,每个视频平均2,000帧长,与先前的工作相比有几个数量级的更多的对应笔记注解。我们示出了在我们数据集中可以从头开始训练现有方法,并在PointOdyssey上以及两个真实的benchmark上表现出色。最后,我们对PIPs点跟踪方法进行修改,使其 temporal 感知场景得到了大幅提高,这也提高了它的表现在PointOdyssey上以及两个真实的benchmark上。我们的数据和代码在https://pointodyssey.com 上公开 available。

Learning Depth Estimation for Transparent and Mirror Surfaces

  • paper_url: http://arxiv.org/abs/2307.15052
  • repo_url: None
  • paper_authors: Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, Luigi Di Stefano
  • for: 估算透明或镜面(ToM)表面深度是一个困难的任务,需要感知器、算法或深度网络。
  • methods: 我们提出了一个简单的管道,使用神经网络来学习正确地估算ToM表面深度,无需任何真实标注。我们解释了如何获取可靠的 pseudo标签,通过在图像中填充ToM对象并使用单视深度估算模型来处理它们。这些标签可以用于练化现有的单视或双视网络,让它们学习如何处理ToM表面。
  • results: 在Booster数据集上进行实验,我们发现我们的简单提案具有很大的改进作用。
    Abstract Inferring the depth of transparent or mirror (ToM) surfaces represents a hard challenge for either sensors, algorithms, or deep networks. We propose a simple pipeline for learning to estimate depth properly for such surfaces with neural networks, without requiring any ground-truth annotation. We unveil how to obtain reliable pseudo labels by in-painting ToM objects in images and processing them with a monocular depth estimation model. These labels can be used to fine-tune existing monocular or stereo networks, to let them learn how to deal with ToM surfaces. Experimental results on the Booster dataset show the dramatic improvements enabled by our remarkably simple proposal.
    摘要 描述透明或镜面(ToM)表面的深度很难以由感知器、算法或深度网络正确地推断。我们提出了一个简单的管道,通过神经网络来学习对ToM表面的深度估计,无需任何真实标注。我们解释了如何获得可靠的pseudo标签,通过在图像中填充ToM对象并使用单目深度估计模型处理它们。这些标签可以用来练化现有的单目或双目网络,让它们学习如何处理ToM表面。实验结果表明,我们的非常简单的建议带来了 dramatic improvement。

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

  • paper_url: http://arxiv.org/abs/2307.15049
  • repo_url: None
  • paper_authors: Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, Yujun Shen
  • for: 这个研究想要将预训练的潜在视觉语言模型(VLM)转移到不同的下游任务上。
  • methods: 我们提出了一种新型的调整方法,即弹性掩蔽调整法,它通过学习选择器来掩蔽网络参数。我们参考了神经通路,认为预训练过程中隐藏在网络参数中的知识,可以透过调整这些参数来把其恢复到光。我们首先选择一个下游任务中需要的参数集,然后将这些参数给掩蔽,最后在下游数据上优化这些掩蔽。当更新掩蔽时,我们引入了一种新的梯度减少策略,以调整参数选择,以避免模型忘记过去的知识并过滤下游数据。
  • results: 我们在11个数据集上进行实验,结果显示我们的方法在前一代的方法上具有优越的性能。特别是,我们可以透过仅将2.56%的参数掩蔽,实现18.73%的性能提升,比零基elineCLIP更高。此外,我们的方法可以与大多数现有的参数效率调整方法相互作用,可以将其表现提升。详细信息可以查看我们的项目页面(https://wuw2019.github.io/R-AMT/)。
    Abstract Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here (https://wuw2019.github.io/R-AMT/).
    摘要 Prompt tuning和adapter tuning已经在将预训练的视觉语言模型(VLM)转移到多种下游任务中显示了很大的潜力。在这项工作中,我们设计了一种新的调参方法,称为正则化面 Selection Tuning(R-AMT)。受神经网络的启发,我们认为预训练阶段中隐藏在预训练模型中的知识已经存在于下游任务中。为了让这些知识重新浮现,我们首先确定了一个下游任务中重要的参数集,然后将每个参数添加一个二进制面,最后在下游数据上优化这些面。在更新面时,我们引入了一种新的梯度抑制策略,以避免模型忘记原来的知识并遇到下游数据上的过拟合。实验结果在11个数据集上表明,我们的方法与之前的方法相比具有显著的优势。具体来说,我们通过面 Selection Tuning仅占用2.56%的参数平均提高了CLIP的性能18.73%。此外,我们的方法可以与大多数现有的参数效率调参方法相结合,并可以提高它们的性能。相关页面可以在这里找到(https://wuw2019.github.io/R-AMT/)。

A Transformer-based Approach for Arabic Offline Handwritten Text Recognition

  • paper_url: http://arxiv.org/abs/2307.15045
  • repo_url: None
  • paper_authors: Saleh Momeni, Bagher BabaAli
  • for: 本研究旨在提高 Offline 阿拉伯手写文本识别精度。
  • methods: 我们提出了两种新的架构:Transformer Transducer 和标准sequence-to-sequence Transformer,并对其表现进行比较。这两种架构均利用了注意力机制,可以更好地模型语言依赖关系,并且更容易并行化。
  • results: 我们的方法在 Arabic KHATT 数据集上的评估中表现出色,超越了现有的状态之作。
    Abstract Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text.
    摘要 手写识别是Pattern recognition和机器学习领域中的一个挑战和重要问题,其应用范围广泛,包括文本识别、语音识别、图像识别等领域。在本文中,我们将关注特定的问题是 Offline 阿拉伯手写文本识别。现有的方法通常使用混合 convolutional neural networks для图像特征提取和 recurrent neural networks для时间模型化,并使用 connectionist temporal classification для文本生成。然而,这些方法受到缺乏并行化的约束,以及不能考虑语言规则的限制,因此需要在后期处理阶段使用外部语言模型以提高准确性。为了解决这些问题,我们提出了两种 altenative 架构:Transformer Transducer 和标准 sequence-to-sequence Transformer。我们 comparing 这两种架构的性能,包括准确率和速度。我们的方法可以模型语言依赖关系,只靠注意机制进行并行化,因此更加简单和高效。我们使用预训练的 Transformers 来进行图像理解和语言模型化。我们的评估结果表明,我们的提议方法在 Offline 阿拉伯手写文本识别领域的现状顶峰性能。

TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis

  • paper_url: http://arxiv.org/abs/2307.15042
  • repo_url: None
  • paper_authors: Zihan Zhang, Richard Liu, Kfir Aberman, Rana Hanocka
  • for: 提出一种基于渐进噪声泛化的动作序列 sintesis 模型,以满足长期动作synthesis的应用需求。
  • methods: 利用渐进噪声泛化概率模型(DDPM)的核心思想,在时间轴上实现渐进噪声泛化,并在动作序列中随时间变化。
  • results: 通过实验表明,提出的方法可以生成高质量的动作序列,并且可以应用于人物动画和其他领域。
    Abstract The gradual nature of a diffusion process that synthesizes samples in small increments constitutes a key ingredient of Denoising Diffusion Probabilistic Models (DDPM), which have presented unprecedented quality in image synthesis and been recently explored in the motion domain. In this work, we propose to adapt the gradual diffusion concept (operating along a diffusion time-axis) into the temporal-axis of the motion sequence. Our key idea is to extend the DDPM framework to support temporally varying denoising, thereby entangling the two axes. Using our special formulation, we iteratively denoise a motion buffer that contains a set of increasingly-noised poses, which auto-regressively produces an arbitrarily long stream of frames. With a stationary diffusion time-axis, in each diffusion step we increment only the temporal-axis of the motion such that the framework produces a new, clean frame which is removed from the beginning of the buffer, followed by a newly drawn noise vector that is appended to it. This new mechanism paves the way towards a new framework for long-term motion synthesis with applications to character animation and other domains.
    摘要 《慢慢进程的扩散过程 Synthesize 样本在小幅度 increments 中是关键组成部分,这种方法被称为 Denoising Diffusion Probabilistic Models (DDPM),它在图像生成中提供了无 precedent 的质量,并在近期被探索在运动频谱中。在这项工作中,我们提议将慢慢扩散概念(在扩散时间轴上进行)扩展到运动序列的 temporal 轴。我们的关键想法是在 DDPM 框架中支持时间变化的净化,从而将两个轴相互束缚。使用我们的特殊形式ulation,我们在每个扩散步骤中只 increments 运动序列中的时间轴,以生成一个新的、干净的帧,并将其从运动缓冲中移除。然后,我们随机生成一个新的噪声向量,并将其追加到缓冲中。这种新机制开 up a new 框架,可以用于长期运动生成,并具有应用于人物动画和其他领域的潜在应用。]Note: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from the original text.

Detecting Morphing Attacks via Continual Incremental Training

  • paper_url: http://arxiv.org/abs/2307.15105
  • repo_url: None
  • paper_authors: Lorenzo Pellegrini, Guido Borghi, Annalisa Franco, Davide Maltoni
  • for: 这个论文旨在解决数据传输和存储限制下,难以组合单一数据集,以执行批处理训练方法。
  • methods: 这篇论文使用了不同数据源的Continual Learning(CL)方法, simulate一个随着新数据块的到达,模型在每一轮训练中更新的场景。
  • results: 实验结果显示,Learning without Forgetting(LwF)方法在这种场景下表现最佳,并且在 Morphing Attack Detection 和 Object Classification 任务中进行了详细的调研和优化。
    Abstract Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset -- also exploiting different data sources -- to perform a batch-based training procedure, make the development of robust models particularly challenging. We hypothesize that the recent Continual Learning (CL) paradigm may represent an effective solution to enable incremental training, even through multiple sites. Indeed, a basic assumption of CL is that once a model has been trained, old data can no longer be used in successive training iterations and in principle can be deleted. Therefore, in this paper, we investigate the performance of different Continual Learning methods in this scenario, simulating a learning model that is updated every time a new chunk of data, even of variable size, is available. Experimental results reveal that a particular CL method, namely Learning without Forgetting (LwF), is one of the best-performing algorithms. Then, we investigate its usage and parametrization in Morphing Attack Detection and Object Classification tasks, specifically with respect to the amount of new training data that became available.
    摘要 具有限制数据传输和存储的场景下,组成单一数据集并在多个数据源上进行批处理训练过程中存在很大挑战。我们假设,最近的不间断学习(Continual Learning,CL) paradigm可能是一个有效的解决方案,以允许逐步训练,即使在多个站点之间。实际上,CL的基本假设是,一旦模型已经训练过,那么过去的数据不能在后续训练过程中再次使用,并且可以被删除。因此,在这篇论文中,我们 investigate CL方法在这种enario下的表现,通过模拟一个可以在新数据批量可用时更新的学习模型。实验结果表明,一种特定的CL方法,即不忘学习(Learning without Forgetting,LwF)是最佳性能的算法之一。然后,我们进一步调查其在形态攻击检测和对象分类任务中的使用和参数化情况,具体是关于可用新训练数据的量。

Diverse Inpainting and Editing with GAN Inversion

  • paper_url: http://arxiv.org/abs/2307.15033
  • repo_url: None
  • paper_authors: Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar
  • for: 实现从 StyleGAN 的潜在空间还原被删除的图像,并在这些图像上进行许多编辑。
  • methods: 我们提出了一个将删除图像的潜在代码与 StyleGAN 的映射特征结合的混合网络,以及一个使用生成的数据进行训练的新设计。
  • results: 我们的方法与现有的逆向和填充方法进行比较,实现了较好的质量和多样性。
    Abstract Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.
    摘要 现在的倒转方法已经证明了真实图像可以被StyleGAN的含义空间内转换,并且可以通过训练过的GAN模型中的含义强的特征表示来实现多个编辑。然而,广泛的研究也表明,图像倒转是一项复杂的任务,因为存在高精度重建和编辑之间的负担。在这篇论文中,我们面临更加困难的任务:将抹消图像转换到GAN模型的含义空间中,以获得真实的填充和编辑。此外,我们还利用不同的含义样本来扩展 reverted 的含义代码,以实现多样化的填充。具体来说,我们提议使用编码器和混合网络将抹消图像的编码特征与StyleGAN的映射特征混合在一起。为了让混合网络使用两个输入,我们在网络训练时使用生成的数据进行新的设置。我们还利用更高的比率特征,以避免在填充和未抹消部分之间的颜色不一致。我们进行了广泛的实验,并与当前的倒转和填充方法进行比较。质量指标和视觉比较表明存在显著的改进。

Adaptive Segmentation Network for Scene Text Detection

  • paper_url: http://arxiv.org/abs/2307.15029
  • repo_url: None
  • paper_authors: Guiqin Zhao
  • for: 提高场景文本检测器的性能,解决手动调整参数的繁琐问题,并且能够处理文本实例的极大比例和方向。
  • methods: 提出自适应分 segmentation 阈值学习方法,自动分辨文本像素和背景像素,并且使用全球信息增强特征层网络(GE-FPN)捕捉文本实例的macro大小和极大比例。其后,引入层次优化结构进一步精细化文本实例。
  • results: 通过提出的阈值学习策略和文本检测结构,实现了场景文本检测器的state-of-the-art性能,并且通过缺失实验证明了我们的贡献的有效性。
    Abstract Inspired by deep convolution segmentation algorithms, scene text detectors break the performance ceiling of datasets steadily. However, these methods often encounter threshold selection bottlenecks and have poor performance on text instances with extreme aspect ratios. In this paper, we propose to automatically learn the discriminate segmentation threshold, which distinguishes text pixels from background pixels for segmentation-based scene text detectors and then further reduces the time-consuming manual parameter adjustment. Besides, we design a Global-information Enhanced Feature Pyramid Network (GE-FPN) for capturing text instances with macro size and extreme aspect ratios. Following the GE-FPN, we introduce a cascade optimization structure to further refine the text instances. Finally, together with the proposed threshold learning strategy and text detection structure, we design an Adaptive Segmentation Network (ASNet) for scene text detection. Extensive experiments are carried out to demonstrate that the proposed ASNet can achieve the state-of-the-art performance on four text detection benchmarks, i.e., ICDAR 2015, MSRA-TD500, ICDAR 2017 MLT and CTW1500. The ablation experiments also verify the effectiveness of our contributions.
    摘要 深受深度卷积分割算法的启发,场景文本检测器不断突破数据集的性能峰值。然而,这些方法经常遇到选择阈值的瓶颈和文本实例的极大比例问题。在这篇论文中,我们提议自动学习分割 segmentation 阈值,以分别文本像素和背景像素,从而降低时间消耗的手动参数调整。此外,我们设计了全球信息增强特征层网络(GE-FPN),用于捕捉文本实例的大小和极大比例。接着,我们引入了层次优化结构,以进一步细化文本实例。最后,我们结合提出的阈值学习策略和文本检测结构,设计了适应性网络(ASNet) для场景文本检测。广泛的实验表明,我们的ASNet可以在四个文本检测标准准则上达到领先性状态,即ICDAR 2015、MSRA-TD500、ICDAR 2017 MLT 和 CTW1500。另外,我们的拓展实验也证明了我们的贡献的有效性。

Self-Supervised Graph Transformer for Deepfake Detection

  • paper_url: http://arxiv.org/abs/2307.15019
  • repo_url: None
  • paper_authors: Aminollah Khormali, Jiann-Shiun Yuan
  • For: 本研究提出了一种深伪检测框架,用于检测视频中的深伪。* Methods: 该框架包括一个基于视觉转换器架构的特征提取器,一个基于Transformer推理器的图 convolutional网络,以及一个图Transformer相关图。* Results: 研究人员通过进行多种难题的实验,包括在不同数据集上进行测试、跨数据集检测、跨操作检测和对常见后期处理抖散的检测,得到了该框架的优秀效果,超过了当前状态艺术方法。
    Abstract Deepfake detection methods have shown promising results in recognizing forgeries within a given dataset, where training and testing take place on the in-distribution dataset. However, their performance deteriorates significantly when presented with unseen samples. As a result, a reliable deepfake detection system must remain impartial to forgery types, appearance, and quality for guaranteed generalizable detection performance. Despite various attempts to enhance cross-dataset generalization, the problem remains challenging, particularly when testing against common post-processing perturbations, such as video compression or blur. Hence, this study introduces a deepfake detection framework, leveraging a self-supervised pre-training model that delivers exceptional generalization ability, withstanding common corruptions and enabling feature explainability. The framework comprises three key components: a feature extractor based on vision Transformer architecture that is pre-trained via self-supervised contrastive learning methodology, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To assess the effectiveness of the proposed framework, several challenging experiments are conducted, including in-data distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results achieved demonstrate the remarkable effectiveness of the proposed deepfake detection framework, surpassing the current state-of-the-art approaches.
    摘要 深度伪造检测方法已经在给定的数据集上显示出了可靠的结果,但是在见到过去的数据集上进行测试时,其表现会受到很大的损害。因此,一个可靠的深度伪造检测系统必须保持不偏向于伪造类型、外观和质量,以确保普遍可靠的检测性能。尽管有许多尝试来提高跨数据集通用性,这个问题仍然是挑战,特别是在面对常见的后期处理扰动时。因此,本研究将提出一个深度伪造检测框架,利用一个自我超vised的预训练模型,具有出色的通用能力,抵抗常见的扰动和提供功能解释。这个框架包括三个关键 комponent:一个基于视觉对应架构的Feature Extractor,一个与对应架构的Transformer Discriminator,以及一个基于对应架构的Graph Transformer Relevancy Map。为了评估提案的效果,本研究将进行许多挑战性的实验,包括在原始数据集上的性能、跨数据集通用性、跨修改通用性和对常见后期处理扰动的Robustness。实验结果显示,提案的深度伪造检测框架具有卓越的效果,超过现有的方法。

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

  • paper_url: http://arxiv.org/abs/2307.15007
  • repo_url: None
  • paper_authors: Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju
  • for: 这 paper 的目的是解释机器学习模型的行为,提高模型的可解释性。
  • methods: 这 paper 使用了两种主要的解释策略:post hoc 解释和嵌入式可解释模型。post hoc 解释方法可以解释复杂的黑盒模型的行为,但是这些解释可能不准确,而且无法验证。嵌入式可解释模型则可以自动编码解释到模型结构中,解释是自然的、可靠的和验证的,但是它们通常具有较强的表达能力。这 paper 提出了一种方法,即 Verifiability Tuning (VerT),可以将黑盒模型转化成可靠、可验证的解释模型。
  • results: 这 paper 的实验结果表明,VerT 可以将黑盒模型转化成可靠、可验证的解释模型,并且这些解释模型可以correctly 和可靠地解释模型的行为。同时,VerT 还可以保持黑盒模型的预测性能。
    Abstract With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by highlighting features that are critical to model predictions; however, prior work has shown that these explanations may not be faithful, and even more concerning is our inability to verify them. Specifically, it is nontrivial to evaluate if a given attribution is correct with respect to the underlying model. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful and verifiable, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we aim to bridge the gap between the aforementioned strategies by proposing Verifiability Tuning (VerT), a method that transforms black-box models into models that naturally yield faithful and verifiable feature attributions. We begin by introducing a formal theoretical framework to understand verifiability and show that attributions produced by standard models cannot be verified. We then leverage this framework to propose a method to build verifiable models and feature attributions out of fully trained black-box models. Finally, we perform extensive experiments on semi-synthetic and real-world datasets, and show that VerT produces models that (1) yield explanations that are correct and verifiable and (2) are faithful to the original black-box models they are meant to explain.
    摘要 随着机器学习模型在各种实际应用中的广泛部署,研究人员和实践者们一样强调了模型行为的解释的必要性。为此,先前的文献中提出了两种广泛的解释策略:后期解释方法通过强调模型预测中关键的特征来解释复杂黑盒模型的行为,但是先前的研究表明这些解释可能不准确,甚至更加担忧的是无法确认这些贡献的正确性。而内置可解释模型则通过显式地编码解释到模型建构中,因此其解释自然地准确和可靠,但它们通常具有有限的表达能力,导致预测性能不佳。在这个工作中,我们希望bridge这两种策略的差异,提出一种名为Verifiability Tuning(VerT)的方法,可以将黑盒模型转化为可以自然地生成准确和可靠的特征贡献的模型。我们首先引入了一个正式的理论框架,以理解可靠性的概念,并证明标准模型生成的贡献无法被验证。然后,我们利用这个框架,提出一种建立可靠模型和特征贡献的方法,并在 semi-synthetic 和实际数据集上进行了广泛的实验,结果表明VerT可以生成准确和可靠的特征贡献,同时保持和原始黑盒模型相同的预测性能。

MapNeRF: Incorporating Map Priors into Neural Radiance Fields for Driving View Simulation

  • paper_url: http://arxiv.org/abs/2307.14981
  • repo_url: None
  • paper_authors: Chenming Wu, Jiadai Sun, Zhelun Shen, Liangjun Zhang
  • for: 用于自动驾驶测试中测试摄像头感知器。
  • methods: 利用map假设来将神经辐射场与不确定摄像头位置整合,以确保多视角一致性。
  • results: 实验结果显示,我们的方法可以在偏离路径上维持 semantic consistency。详细视频可以在https://youtu.be/jEQWr-Rfh3A中检视。
    Abstract Simulating camera sensors is a crucial task in autonomous driving. Although neural radiance fields are exceptional at synthesizing photorealistic views in driving simulations, they still fail to generate extrapolated views. This paper proposes to incorporate map priors into neural radiance fields to synthesize out-of-trajectory driving views with semantic road consistency. The key insight is that map information can be utilized as a prior to guiding the training of the radiance fields with uncertainty. Specifically, we utilize the coarse ground surface as uncertain information to supervise the density field and warp depth with uncertainty from unknown camera poses to ensure multi-view consistency. Experimental results demonstrate that our approach can produce semantic consistency in deviated views for vehicle camera simulation. The supplementary video can be viewed at https://youtu.be/jEQWr-Rfh3A.
    摘要 simulate 摄像头传感器是自动驾驶中非常重要的任务。尽管神经辐射场是在驾驶 simulations 中生成 photorealistic 视图的非常好的方法,但它们仍然无法生成 extrapolated 视图。这篇论文提议将地图约束 incorporated 到神经辐射场中,以生成 deviated 视图的 semantic 路况一致性。关键的思想是利用地图信息作为训练神经辐射场的先验知识,以确保多视图一致性。特别是,我们利用 unknown 相机 pose 中的不确定信息来监督density field 和折叠深度的学习,以确保多视图一致性。实验结果表明,我们的方法可以在 deviated 视图中保持 semantic 路况一致性。补充视频可以在 https://youtu.be/jEQWr-Rfh3A 上查看。

cs.AI - 2023-07-28

Evaluating the structure of cognitive tasks with transfer learning

  • paper_url: http://arxiv.org/abs/2308.02408
  • repo_url: None
  • paper_authors: Bruno Aristimunha, Raphael Y. de Camargo, Walter H. Lopez Pinaya, Sylvain Chevallier, Alexandre Gramfort, Cedric Rommel
  • for: 这项研究旨在 investigate deep learning representations 的可传递性在不同的EEG解oding任务中。
  • methods: 研究者使用了现有的decoding模型和两个最新发布的EEG数据集(ERP CORE和M$^3$CV),包含140个人和11种不同的认知任务。他们测试了在一个任务上预训练深度神经网络后,其能够decode到后续任务的能力。
  • results: 研究结果表明,即使使用线性探测传输,也可以获得显著的提高,与纯粹的超vised方法相比,提高了最多28%。此外,研究者发现了一些解oding方案会释放特定和窄的脑活动,而其他解oding方案则需要预训练在广泛的表征上。这些发现有助于解决EEG解oding中的数据稀缺问题,并且提供了减少数据稀缺的实际应用。同时,生成的传输图也提供了认知任务之间的层次关系的理解,从 neuroscientific 的角度来看。
    Abstract Electroencephalography (EEG) decoding is a challenging task due to the limited availability of labelled data. While transfer learning is a promising technique to address this challenge, it assumes that transferable data domains and task are known, which is not the case in this setting. This study investigates the transferability of deep learning representations between different EEG decoding tasks. We conduct extensive experiments using state-of-the-art decoding models on two recently released EEG datasets, ERP CORE and M$^3$CV, containing over 140 subjects and 11 distinct cognitive tasks. We measure the transferability of learned representations by pre-training deep neural networks on one task and assessing their ability to decode subsequent tasks. Our experiments demonstrate that, even with linear probing transfer, significant improvements in decoding performance can be obtained, with gains of up to 28% compare with the pure supervised approach. Additionally, we discover evidence that certain decoding paradigms elicit specific and narrow brain activities, while others benefit from pre-training on a broad range of representations. By revealing which tasks transfer well and demonstrating the benefits of transfer learning for EEG decoding, our findings have practical implications for mitigating data scarcity in this setting. The transfer maps generated also provide insights into the hierarchical relations between cognitive tasks, hence enhancing our understanding of how these tasks are connected from a neuroscientific standpoint.
    摘要 电enzephalography(EEG)解oding是一个具有挑战性的任务,主要因为数据的有限可用性。而转移学习是一种有前途的技术,可以解决这个问题,但它假设了可以确定的数据领域和任务,而这不是这个设置的情况。这个研究探讨了EEG解oding任务之间的转移学习表示的可行性。我们在两个最新发布的EEG数据集,ERP CORE和M$^3$CV中,使用现有的解oding模型进行了广泛的实验。我们测量了转移学习中学习的表示的可行性,通过在一个任务上预训练深度神经网络,然后评估它对后续任务的解码性能的能力。我们的实验结果表明,即使使用线性探索传输,也可以获得显著改善,与纯粹supervised方法相比,改善率可达28%。此外,我们发现了一些解oding方法引起特定和窄的脑活动,而其他方法则受到预训练在广泛的表示上的 beneficial。我们的发现可以有实际意义,帮助解决EEG解oding数据的缺乏问题,同时还可以提供有关认知科学方面的task之间的层次关系的新的视角。

We are all Individuals: The Role of Robot Personality and Human Traits in Trustworthy Interaction

  • paper_url: http://arxiv.org/abs/2307.15568
  • repo_url: None
  • paper_authors: Mei Yii Lim, José David Aguas Lopes, David A. Robb, Bruce W. Wilson, Meriam Moujahid, Emanuele De Pellegrin, Helen Hastie
  • for: 这个论文旨在研究 робо类型在人类社会中的表现,以及人类对 robot的偏好和信任度。
  • methods: 该论文采用了量化和质化的方法,通过 vocal cues 和语言特征来描述 robot 的个性,并通过询问参与者对不同 robot 个性的偏好和信任度来评估 robot 的表现。
  • results: 研究发现,在 Robo-Barista 中, extrovert robot 被人类参与者更加信任和喜欢,无论参与者自己的人格 trait 如何。此外,研究还发现,人类对 robot 的态度和先天偏好对 human-robot interaction Study 中的信任度有重要影响。
    Abstract As robots take on roles in our society, it is important that their appearance, behaviour and personality are appropriate for the job they are given and are perceived favourably by the people with whom they interact. Here, we provide an extensive quantitative and qualitative study exploring robot personality but, importantly, with respect to individual human traits. Firstly, we show that we can accurately portray personality in a social robot, in terms of extroversion-introversion using vocal cues and linguistic features. Secondly, through garnering preferences and trust ratings for these different robot personalities, we establish that, for a Robo-Barista, an extrovert robot is preferred and trusted more than an introvert robot, regardless of the subject's own personality. Thirdly, we find that individual attitudes and predispositions towards robots do impact trust in the Robo-Baristas, and are therefore important considerations in addition to robot personality, roles and interaction context when designing any human-robot interaction study.
    摘要 As robots take on roles in our society, it is important that their appearance, behavior, and personality are appropriate for the job they are given and are perceived favorably by the people with whom they interact. Here, we provide an extensive quantitative and qualitative study exploring robot personality, but importantly, with respect to individual human traits. Firstly, we show that we can accurately portray personality in a social robot, in terms of extroversion-introversion using vocal cues and linguistic features. Secondly, through garnering preferences and trust ratings for these different robot personalities, we establish that, for a Robo-Barista, an extrovert robot is preferred and trusted more than an introvert robot, regardless of the subject's own personality. Thirdly, we find that individual attitudes and predispositions towards robots do impact trust in the Robo-Baristas, and are therefore important considerations in addition to robot personality, roles, and interaction context when designing any human-robot interaction study.

Few-shot Image Classification based on Gradual Machine Learning

  • paper_url: http://arxiv.org/abs/2307.15524
  • repo_url: None
  • paper_authors: Na Chen, Xianming Kuang, Feiyu Liu, Kehao Wang, Qun Chen
  • for: 这个论文的目的是提高几个标注样本的图像分类精度。
  • methods: 这个论文使用非同一个分布(Non-i.i.d)的渐进机器学习(GML)方法,从只有几个标注样本开始,然后逐渐将目标图像标注为增加难度的顺序,通过迭代因子推理在因子图中。
  • results: 该方法可以提高比较精度(SOTA)性能,在测试集上进行了比较研究,并证明了其在图像分类任务中的优越性。特别是,该方法可以在Query集大小增加时,保持性能的提高,而深度模型的性能则很可能会保持不变或者变差。
    Abstract Few-shot image classification aims to accurately classify unlabeled images using only a few labeled samples. The state-of-the-art solutions are built by deep learning, which focuses on designing increasingly complex deep backbones. Unfortunately, the task remains very challenging due to the difficulty of transferring the knowledge learned in training classes to new ones. In this paper, we propose a novel approach based on the non-i.i.d paradigm of gradual machine learning (GML). It begins with only a few labeled observations, and then gradually labels target images in the increasing order of hardness by iterative factor inference in a factor graph. Specifically, our proposed solution extracts indicative feature representations by deep backbones, and then constructs both unary and binary factors based on the extracted features to facilitate gradual learning. The unary factors are constructed based on class center distance in an embedding space, while the binary factors are constructed based on k-nearest neighborhood. We have empirically validated the performance of the proposed approach on benchmark datasets by a comparative study. Our extensive experiments demonstrate that the proposed approach can improve the SOTA performance by 1-5% in terms of accuracy. More notably, it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.
    摘要 《几个样本图像分类》目的是使用只有几个标注样本来高精度地分类无标注图像。现状的解决方案基于深度学习,强调设计越来越复杂的深度背bone。然而,任务仍然非常困难,因为在训练类别之间传递知识的困难。在这篇论文中,我们提出了一种新的方法,基于异步度学习(GML)的异步机器学习(gradual machine learning)的 paradigm。它从只有几个标注样本开始,然后逐渐将目标图像标注为增加难度的顺序,通过迭代因子推理在因子图中。具体来说,我们的提出的方法首先提取特征表示,然后根据提取的特征构建 both unicode 和二进制因子,以便进行慢学习。unicode 因子是根据 embedding 空间中的类中心距离构建的,而二进制因子是根据 k-最近邻居构建的。我们在标准 benchmark 数据集上进行了比较研究,并经验 validate 了我们的方法的性能。我们的广泛的实验表明,我们的方法可以提高 SOTA 性能,并且比现有的深度模型更加稳定。 Specifically, our proposed approach can improve the SOTA performance by 1-5% in terms of accuracy, and it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.Note: "SOTA" stands for "State of the Art", which means the current best performance in a particular field or task.

Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.15514
  • repo_url: None
  • paper_authors: Jaime Corsetti, Davide Boscaini, Fabio Poiesi
  • for: 6D object pose estimation
  • methods: Fully Convolutional Geometric Features (FCGF) with sparse convolutions and hardest contrastive loss, and key modifications to the loss and input data representations, as well as careful tuning of training strategies and data augmentations
  • results: state-of-the-art performance on popular benchmarks, with outperformance of recent competitors
    Abstract Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Features (FCGF) and tailor it for object 6D pose estimation to achieve state-of-the-art performance. FCGF employs sparse convolutions and learns point-level features using a fully-convolutional network by optimising a hardest contrastive loss. We can outperform recent competitors on popular benchmarks by adopting key modifications to the loss and to the input data representations, by carefully tuning the training strategies, and by employing data augmentations suitable for the underlying problem. We carry out a thorough ablation to study the contribution of each modification.
    摘要 最近的6D物体 pose 估计研究关注学习图像和物体模型之间关键点匹配,然后使用RANSAC算法或直接使用整体优化来确定物体pose。我们认为在文献中学习点级特征是被忽略的。为此,我们回顾了全面卷积Geometric Features(FCGF),并对其进行修改以适应物体6D pose估计,以达到领先的性能。FCGF使用稀疏卷积,通过全面卷积网络学习点级特征,并通过最难的对比损失来优化。我们可以通过采用修改loss和输入数据表示,精心调整训练策略,以及适合下面问题的数据增强来超越最近的竞争对手。我们进行了严格的拟合来研究每个修改的贡献。

Exploring Format Consistency for Instruction Tuning

  • paper_url: http://arxiv.org/abs/2307.15504
  • repo_url: None
  • paper_authors: Shihao Liang, Kunlun Zhu, Runchu Tian, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, Maosong Sun
  • for: 提高大语言模型 seguir las instrucciones de los humanos
  • methods: 使用 OpenAI APIs 自动转换format instruction tuning 数据集,并提出一种基于抽象搅拌的干扰除法以降低自动转换中的噪音
  • results: 研究表明,UIT 框架可以提高 instruction tuning 中的泛化性能,并且在实际应用中可以降低成本Here is the full text in Simplified Chinese, with the three key points highlighted:
  • for: 本文旨在提高大语言模型 seguir las instrucciones de los humanos,具体来说是通过增加不同 instrucciones 和 instrucciones 集合的训练数据来提高模型的泛化性能。
  • methods: 我们提出了一种基于 OpenAI APIs 的自动转换format instruction tuning 数据集的框架,并提出了一种基于抽象搅拌的干扰除法以降低自动转换中的噪音。
  • results: 我们的研究表明,UIT 框架可以提高 instruction tuning 中的泛化性能,并且在实际应用中可以降低成本。
    Abstract Instruction tuning has emerged as a promising approach to enhancing large language models in following human instructions. It is shown that increasing the diversity and number of instructions in the training data can consistently enhance generalization performance, which facilitates a recent endeavor to collect various instructions and integrate existing instruction tuning datasets into larger collections. However, different users have their unique ways of expressing instructions, and there often exist variations across different datasets in the instruction styles and formats, i.e., format inconsistency. In this work, we study how format inconsistency may impact the performance of instruction tuning. We propose a framework called "Unified Instruction Tuning" (UIT), which calls OpenAI APIs for automatic format transfer among different instruction tuning datasets. We show that UIT successfully improves the generalization performance on unseen instructions, which highlights the importance of format consistency for instruction tuning. To make the UIT framework more practical, we further propose a novel perplexity-based denoising method to reduce the noise of automatic format transfer. We also train a smaller offline model that achieves comparable format transfer capability than OpenAI APIs to reduce costs in practice.
    摘要 translate_text="Instruction tuning has emerged as a promising approach to enhancing large language models in following human instructions. It is shown that increasing the diversity and number of instructions in the training data can consistently enhance generalization performance, which facilitates a recent endeavor to collect various instructions and integrate existing instruction tuning datasets into larger collections. However, different users have their unique ways of expressing instructions, and there often exist variations across different datasets in the instruction styles and formats, i.e., format inconsistency. In this work, we study how format inconsistency may impact the performance of instruction tuning. We propose a framework called "Unified Instruction Tuning" (UIT), which calls OpenAI APIs for automatic format transfer among different instruction tuning datasets. We show that UIT successfully improves the generalization performance on unseen instructions, which highlights the importance of format consistency for instruction tuning. To make the UIT framework more practical, we further propose a novel perplexity-based denoising method to reduce the noise of automatic format transfer. We also train a smaller offline model that achieves comparable format transfer capability than OpenAI APIs to reduce costs in practice."Here's the translation in Simplified Chinese: instrucion 调整有 emerged 为大语言模型遵循人类指令的一种有前途的方法。增加多样性和数量的指令在训练数据中可以一直提高总体性能,这有助于最近的努力,收集各种指令并将现有的指令调整数据集合入大型集合。然而,不同的用户有各自的指令表达方式,而指令集合中的指令风格和格式经常存在差异,即格式不一致。在这种情况下,我们研究了格式不一致如何影响指令调整的性能。我们提出了一个名为 "统一指令调整"(UIT)的框架,通过OpenAI API进行自动格式传输。我们发现,UIT可以成功地提高未见指令的总体性能,这说明了格式一致性对指令调整的重要性。为了使UIT框架更实用,我们进一步提出了一种基于抽象率的减噪方法,以减少自动格式传输中的噪声。此外,我们还训练了一个较小的离线模型,可以实现与OpenAI API相同的格式传输能力,以降低在实践中的成本。

Curiosity-Driven Reinforcement Learning based Low-Level Flight Control

  • paper_url: http://arxiv.org/abs/2307.15724
  • repo_url: https://github.com/a-ramezani/cdrl-l2fc_u_hcm
  • paper_authors: Amir Ramezani Dooraki, Alexandros Iosifidis
  • for: 这个论文的目的是提出一种基于好奇性的自主学习算法,用于控制quadcopter navigating through obstacles。
  • methods: 该算法使用了好奇性的prediction errorapproach,并与基于奖励学习的算法结合使用。
  • results: 测试结果显示,该算法可以学习优化策略,最大化奖励,其他算法无法达成的目标。
    Abstract Curiosity is one of the main motives in many of the natural creatures with measurable levels of intelligence for exploration and, as a result, more efficient learning. It makes it possible for humans and many animals to explore efficiently by searching for being in states that make them surprised with the goal of learning more about what they do not know. As a result, while being curious, they learn better. In the machine learning literature, curiosity is mostly combined with reinforcement learning-based algorithms as an intrinsic reward. This work proposes an algorithm based on the drive of curiosity for autonomous learning to control by generating proper motor speeds from odometry data. The quadcopter controlled by our proposed algorithm can pass through obstacles while controlling the Yaw direction of the quad-copter toward the desired location. To achieve that, we also propose a new curiosity approach based on prediction error. We ran tests using on-policy, off-policy, on-policy plus curiosity, and the proposed algorithm and visualized the effect of curiosity in evolving exploration patterns. Results show the capability of the proposed algorithm to learn optimal policy and maximize reward where other algorithms fail to do so.
    摘要 寻Curiosity是许多自然 creature 的主要动机,包括许多智能生物,以探索和学习为主要目的。它使得人类和许多动物能够效率地探索,并且在过程中获得更多的知识。在机器学习文献中,Curiosity通常与征募学习算法结合,作为自然选择的一种内生奖励。本工作提出一个基于寻Curiosity的自主学习控制算法,可以将四辐游戏机器人控制到避免障碍而飞行,并且控制机器人的转向方向以 дости其目标位置。为了实现这一目标,我们还提出了一种新的寻Curiosity方法,基于预测误差。我们在实验中使用了在政策、不在政策、在政策加上寻Curiosity、以及我们的提案中进行试验,并将寻Curiosity的影响在演化探索模式中 visualized。结果显示了我们的提案算法可以学习并实现最佳策略,其他算法无法实现。

ETHER: Aligning Emergent Communication for Hindsight Experience Replay

  • paper_url: http://arxiv.org/abs/2307.15494
  • repo_url: None
  • paper_authors: Kevin Denamganaï, Daniel Hernandez, Ozan Vardal, Sondess Missaoui, James Alfred Walker
  • for: 本研究旨在提高自然语言指令驱动的人工智能机器人的合作能力。
  • methods: 本研究使用自然语言conditioned reinforcement learning(RL)agent,利用自然语言的特性,如 компози�,提供强 inductive bias 来学习复杂的策略。
  • results: 研究表明,通过使用 referential game 作为 auxiliary task,可以使RL Agent 更好地利用语言信息,并且可以在不具备 oracle predicate function 的情况下提高性能和数据效率。
    Abstract Natural language instruction following is paramount to enable collaboration between artificial agents and human beings. Natural language-conditioned reinforcement learning (RL) agents have shown how natural languages' properties, such as compositionality, can provide a strong inductive bias to learn complex policies. Previous architectures like HIGhER combine the benefit of language-conditioning with Hindsight Experience Replay (HER) to deal with sparse rewards environments. Yet, like HER, HIGhER relies on an oracle predicate function to provide a feedback signal highlighting which linguistic description is valid for which state. This reliance on an oracle limits its application. Additionally, HIGhER only leverages the linguistic information contained in successful RL trajectories, thus hurting its final performance and data-efficiency. Without early successful trajectories, HIGhER is no better than DQN upon which it is built. In this paper, we propose the Emergent Textual Hindsight Experience Replay (ETHER) agent, which builds on HIGhER and addresses both of its limitations by means of (i) a discriminative visual referential game, commonly studied in the subfield of Emergent Communication (EC), used here as an unsupervised auxiliary task and (ii) a semantic grounding scheme to align the emergent language with the natural language of the instruction-following benchmark. We show that the referential game's agents make an artificial language emerge that is aligned with the natural-like language used to describe goals in the BabyAI benchmark and that it is expressive enough so as to also describe unsuccessful RL trajectories and thus provide feedback to the RL agent to leverage the linguistic, structured information contained in all trajectories. Our work shows that EC is a viable unsupervised auxiliary task for RL and provides missing pieces to make HER more widely applicable.
    摘要 自然语言指导following是人工智能和人类合作的关键。自然语言conditioned reinforcement learning(RL)代理人们已经证明了自然语言的特性,如compositionality,可以为学习复杂政策提供强大的逻辑导向。先前的架构如HIGhER将语言conditioning与Hindsight Experience Replay(HER)结合以处理罕见奖励环境。然而,如HER,HIGhER依赖oracle predicate函数提供一个反馈信号,用于指示哪些语言描述是哪个状态的有效描述。这种依赖oracle限制了其应用。此外,HIGhER只利用RL trajectory中的语言信息,因此在最终性和数据效率方面受到限制。在absence of early successful trajectories,HIGhER与DQN相比,没有优势。在这篇论文中,我们提出了Emergent Textual Hindsight Experience Replay(ETHER)代理人,它基于HIGhER并解决了它的两个限制。我们使用了一种推理视觉游戏,通常在Emergent Communication(EC)中被研究,作为一种无监督任务。此外,我们还使用了一种semantic grounding scheme来将自然语言与RL benchmark中的目标描述相对应。我们发现,EC中的代理人在学习一种与自然语言相关的人工语言,并且这种语言足够表达,以便描述失败的RL trajectory,并提供给RL代理人以利用语言、结构化信息来改进性能。我们的工作表明,EC是一种可靠的无监督任务,可以为RL提供 missing pieces,使HER更加广泛应用。

A Semantic Approach to Decidability in Epistemic Planning (Extended Version)

  • paper_url: http://arxiv.org/abs/2307.15485
  • repo_url: None
  • paper_authors: Alessandro Burigana, Paolo Felli, Marco Montali, Nicolas Troquard
  • for: 这篇论文主要探讨了在多智能计划中使用动态纽标逻辑(DEL)的可 decidability问题。
  • methods: 作者采用了一种新的semantic方法来实现可 decidability,而不是通过语法上的限制。具体来说,作者增强了知识逻辑S5$_n$的axioms,并添加了一个交互axioms(知识共同性),以控制代理人对别人知识的无限推理能力。
  • results: 作者首先证明了这种epistemic planning问题是可 decidable的。此外,作者还研究了不同的通常性axioms的推广,以实现更expressive的DEL Fragment的可 decidability。最后,作者证明了两个常见的epistemic planning系统,基于action templates,在知识下的设定下是可 decidable的。
    Abstract The use of Dynamic Epistemic Logic (DEL) in multi-agent planning has led to a widely adopted action formalism that can handle nondeterminism, partial observability and arbitrary knowledge nesting. As such expressive power comes at the cost of undecidability, several decidable fragments have been isolated, mainly based on syntactic restrictions of the action formalism. In this paper, we pursue a novel semantic approach to achieve decidability. Namely, rather than imposing syntactical constraints, the semantic approach focuses on the axioms of the logic for epistemic planning. Specifically, we augment the logic of knowledge S5$_n$ and with an interaction axiom called (knowledge) commutativity, which controls the ability of agents to unboundedly reason on the knowledge of other agents. We then provide a threefold contribution. First, we show that the resulting epistemic planning problem is decidable. In doing so, we prove that our framework admits a finitary non-fixpoint characterization of common knowledge, which is of independent interest. Second, we study different generalizations of the commutativity axiom, with the goal of obtaining decidability for more expressive fragments of DEL. Finally, we show that two well-known epistemic planning systems based on action templates, when interpreted under the setting of knowledge, conform to the commutativity axiom, hence proving their decidability.
    摘要 使用动态эпистемологи(DEL)在多代理规划中得到了广泛采用的行动 formalism,可以处理不确定性、部分可见性和嵌套知识。然而,这么高度表达力带来了不可解决性问题,因此有很多可 decidable fragments 已经被隔离出来,主要基于动态 formalism 的语法约束。在这篇论文中,我们采用一种新的semantic方法来实现可解决性。具体来说,我们在知识逻辑S5$_n$中添加了一个交互axioms(知识 commutativity),该axioms控制代理者对别人知识的无限推理能力。然后,我们提供了三项贡献:1. 我们表明了这种epistemic planning问题的可解决性。在这个过程中,我们证明了我们的框架有一个 finitary non-fixpoint characterization of common knowledge,这是独立的有趣的。2. 我们研究了不同的generalisations of the commutativity axiom,以实现更expressive fragments of DEL的可解决性。3. 我们证明了两个常见的epistemic planning系统,基于action templates,在知识下被解释,符合 commutativity axiom,因此其可解决性。

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

  • paper_url: http://arxiv.org/abs/2307.15484
  • repo_url: None
  • paper_authors: Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
  • for: 这个论文旨在提出一种基于扩散模型和语言模型的文本译语音系统,以提高文本译语音的质量和自然性。
  • methods: 该论文使用了两种不同类型的扩散Speech表示,并使用两个序列到序列任务来解耦文本译语音。它还引入了一个提示编码结构,以提高提示表示能力。
  • results: 实验结果显示,该论文提出的方法比基eline方法表现出色,并提供了一个网站的音频样本。
    Abstract Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
    摘要 最近,有越来越多关注可以通过最小监督学习的文本识别(TTS)方法。我们提议一种叫做Diff-LM-Speech的方法,它利用扩散模型将含义编码作为mel-spectrogram中的semantic embedding,并通过变量自动编码器和谱瓣瓶逻辑来改善提示表示能力。而autoregressive语言模型经常会出现缺失和重复的单词问题,而非autoregressive框架则会面临表达均衡问题,这是因为duration预测模型的问题。为解决这些问题,我们提议Tetra-Diff-Speech,它使用扩散模型来实现多种表达方式的多样性。尽管我们预期含义编码的信息内容在文本和音频编码之间,现有的模型通常会提取很多无用的信息和维度爆炸。为验证这一点,我们提议Tri-Diff-Speech。实验结果表明,我们的提议方法在比基eline方法有更好的表现。我们提供了一个网站,包含了各种音频样本。

Non-invasive Diabetes Detection using Gabor Filter: A Comparative Analysis of Different Cameras

  • paper_url: http://arxiv.org/abs/2307.15480
  • repo_url: None
  • paper_authors: Christina A. Garcia, Patricia Angela R. Abu, Rosula SJ. Reyes
  • for: 这个论文旨在比较和探讨使用移动设备摄像头和笔记型电脑摄像头来捕捉非侵入性诊断糖尿病(DM)的图像,并使用facial block texture特征进行识别。
  • methods: 该论文使用了12mp和7mp移动设备摄像头以及笔记型电脑摄像头,在正常照明下拍摄图像。 extracted facial blocks被分类使用k-最近邻和支持向量机。
  • results: 系统的性能被测量为准确率96.7%,特异性93%和敏感性100%,最佳性能来自12mp后置摄像头使用支持向量机,使用100张图像。
    Abstract This paper compares and explores the performance of both mobile device camera and laptop camera as convenient tool for capturing images for non-invasive detection of Diabetes Mellitus (DM) using facial block texture features. Participants within age bracket 20 to 79 years old were chosen for the dataset. 12mp and 7mp mobile cameras, and a laptop camera were used to take the photo under normal lighting condition. Extracted facial blocks were classified using k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM). 100 images were captured, preprocessed, filtered using Gabor, and iterated. Performance of the system was measured in terms of accuracy, specificity, and sensitivity. Best performance of 96.7% accuracy, 100% sensitivity, and 93% specificity were achieved from 12mp back camera using SVM with 100 images.
    摘要 这篇论文比较了手持设备摄像头和笔记型电脑摄像头作为轻便的捕捉照片用于不侵入性诊断糖尿病(DM)的方法。选择的参与者年龄在20岁至79岁之间。使用1200万像素和700万像素手持设备摄像头以及笔记型电脑摄像头,在正常照明条件下拍摄照片。提取的脸部块被分类使用k-最近邻和支持向量机(SVM)。 captured 100 张照片,预处理、筛选using Gabor, iterated。系统性能测量的指标包括准确率、特异性和敏感度。使用1200万像素后摄像头和SVM, achieved 96.7%的准确率、100%的敏感度和93%的特异性。

FeedbackLogs: Recording and Incorporating Stakeholder Feedback into Machine Learning Pipelines

  • paper_url: http://arxiv.org/abs/2307.15475
  • repo_url: None
  • paper_authors: Matthew Barker, Emma Kallina, Dhananjay Ashok, Katherine M. Collins, Ashley Casovan, Adrian Weller, Ameet Talwalkar, Valerie Chen, Umang Bhatt
  • for: 这个论文是为了提供一种方法来记录和 incorporate 多个潜在参与者的反馈,以便更好地了解 ML 管道的影响。
  • methods: 这篇论文提出了一种名为 FeedbackLogs 的新方法,用于跟踪 ML 管道中不同参与者的反馈。每个 FeedbackLog 都记录了反馈收集过程中的重要细节,以及反馈本身和如何将反馈纳入 ML 管道中。
  • results: 这篇论文提供了一些具体的使用案例,例如使用 FeedbackLogs 作为算法审核的证据,以及用于记录基于潜在参与者反馈的更新。
    Abstract Even though machine learning (ML) pipelines affect an increasing array of stakeholders, there is little work on how input from stakeholders is recorded and incorporated. We propose FeedbackLogs, addenda to existing documentation of ML pipelines, to track the input of multiple stakeholders. Each log records important details about the feedback collection process, the feedback itself, and how the feedback is used to update the ML pipeline. In this paper, we introduce and formalise a process for collecting a FeedbackLog. We also provide concrete use cases where FeedbackLogs can be employed as evidence for algorithmic auditing and as a tool to record updates based on stakeholder feedback.
    摘要 即使机器学习(ML)管道影响到越来越多的利益者,有很少关于如何记录和 incorporate 利益者的输入的研究。我们提议使用 FeedbackLogs,加入现有的 ML 管道文档,跟踪多个利益者的反馈。每个日志记录了反馈收集过程中重要的细节,反馈本身,以及如何使用反馈更新 ML 管道。在这篇论文中,我们介绍了和形式化了收集FeedbackLog的过程。我们还提供了具体的应用场景,其中FeedbackLog可以作为算法审核的证据,以及用于记录基于利益者反馈的更新。

Rethinking Noisy Label Learning in Real-world Annotation Scenarios from the Noise-type Perspective

  • paper_url: http://arxiv.org/abs/2307.16889
  • repo_url: https://github.com/fuxiailab/protosemi
  • paper_authors: Renyu Zhu, Haoyu Liu, Runze Wu, Minmin Lin, Tangjie Lv, Changjie Fan, Haobo Wang
  • for: investigate the problem of learning with noisy labels in real-world annotation scenarios
  • methods: propose a novel sample selection-based approach for noisy label learning called Proto-semi
  • results: demonstrate the effectiveness of Proto-semi in handling the problem of learning from noisy labels, and show that the prototype-based repartitioning strategy is effective in mitigating the adverse impact of label noise.Here is the summary in Traditional Chinese:
  • for: 研究实际标签条件下的学习噪音标签问题
  • methods: 提出一个基于选择体系的噪音标签学习方法 called Proto-semi
  • results: 验证 Proto-semi 能够实现噪音标签学习问题,并显示基于几何的重新分配策略有效地减少噪音标签的影响。
    Abstract In this paper, we investigate the problem of learning with noisy labels in real-world annotation scenarios, where noise can be categorized into two types: factual noise and ambiguity noise. To better distinguish these noise types and utilize their semantics, we propose a novel sample selection-based approach for noisy label learning, called Proto-semi. Proto-semi initially divides all samples into the confident and unconfident datasets via warm-up. By leveraging the confident dataset, prototype vectors are constructed to capture class characteristics. Subsequently, the distances between the unconfident samples and the prototype vectors are calculated to facilitate noise classification. Based on these distances, the labels are either corrected or retained, resulting in the refinement of the confident and unconfident datasets. Finally, we introduce a semi-supervised learning method to enhance training. Empirical evaluations on a real-world annotated dataset substantiate the robustness of Proto-semi in handling the problem of learning from noisy labels. Meanwhile, the prototype-based repartitioning strategy is shown to be effective in mitigating the adverse impact of label noise. Our code and data are available at https://github.com/fuxiAIlab/ProtoSemi.
    摘要 在这篇论文中,我们研究了在实际注释场景中学习受损标签的问题,其中噪声可以分为两类:事实噪声和模糊噪声。为了更好地 отличи出这两种噪声类型并利用其 semantics,我们提议了一种基于样本选择的受损标签学习方法,称为Proto-semi。Proto-semi首先将所有样本分为自信量高和自信量低两个集合via warm-up。然后,通过利用自信量集合,构建 prototype vectors,以捕捉类征特征。接着,计算不确定样本与 prototype vectors 之间的距离,以便噪声分类。根据这些距离,将标签更正或保留,从而对自信量集合和不确定集合进行修正。最后,我们引入了一种半supervised学习方法,以提高训练。empirical evaluations 表明,Proto-semi 能够有效地处理实际注释中的受损标签学习问题。同时,我们的 prototype-based repartitioning 策略能够减轻噪声对标签学习的负面影响。我们的代码和数据可以在 中找到。

Testing the Depth of ChatGPT’s Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5’s Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking

  • paper_url: http://arxiv.org/abs/2307.16806
  • repo_url: None
  • paper_authors: David Bayani
  • for: 这篇论文探讨了 GPT3.5 模型在视觉任务中的能力,包括图像识别、图像分割和图像生成等。
  • methods: 该论文使用了 GPT3.5 模型,并对其进行了不同的变换和修改,以测试其在视觉任务中的表现。
  • results: 研究发现,GPT3.5 模型在图像识别和图像分割任务中表现不佳,但在图像生成任务中表现较为出色。
    Abstract Over the eight months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche-industry of papers have emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been either natural language text or stylized, code-like language. Drawing inspiration from the prowess we expect a truly human-level intelligent agent to have across multiple signal modalities, in this work we examine GPT3.5's aptitude for visual tasks, where the inputs feature content provided as ASCII-art without overt distillation into a lingual summary. We conduct experiments analyzing the model's performance on image recognition tasks after various transforms typical in visual settings, trials investigating knowledge of image parts, and tasks covering image generation.
    摘要 Over the past eight months since its release, ChatGPT and its underlying model, GPT3.5, have received massive attention due to their powerful combination of capabilities and accessibility. While a niche industry of papers has emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been limited to natural language text or stylized, code-like language. Inspired by the versatility we would expect from a truly human-level intelligent agent, in this work we explore GPT3.5's ability to perform visual tasks, using ASCII art as input without any explicit linguistic summaries. We conduct experiments analyzing the model's performance on image recognition tasks, image part recognition, and image generation.

Worrisome Properties of Neural Network Controllers and Their Symbolic Representations

  • paper_url: http://arxiv.org/abs/2307.15456
  • repo_url: https://github.com/mimuw-rl/worrisome-nn
  • paper_authors: Jacek Cyranka, Kevin E M Church, Jean-Philippe Lessard
  • for: 本研究探讨控制器在简单强化学习问题中的稳定性问题。
  • methods: 本研究使用神经网络控制器和其低神经级别和符号抽象。
  • results: 研究发现, Typical controller 可以达到高均返回值,但仍然生成大量的持续低返回解决方案,这是一个非常不жела的性能,易被敌对者利用。 更加简单的控制器会承认更多的持续坏解决方案。 研究提供了一种系统性 robustness 研究的算法,并证明存在持续解决方案和、在某些情况下, periodic orbits 的存在,使用计算机支持的证明方法。
    Abstract We raise concerns about controllers' robustness in simple reinforcement learning benchmark problems. We focus on neural network controllers and their low neuron and symbolic abstractions. A typical controller reaching high mean return values still generates an abundance of persistent low-return solutions, which is a highly undesirable property, easily exploitable by an adversary. We find that the simpler controllers admit more persistent bad solutions. We provide an algorithm for a systematic robustness study and prove existence of persistent solutions and, in some cases, periodic orbits, using a computer-assisted proof methodology.
    摘要 我们有关控制器的Robustness在简单的征务学习问题上表达出关注。我们专注于神经网络控制器的低神经和符号抽象。一般情况下,一个高均返回值的控制器仍然生成丰富的持续性低返回解,这是非常不愿意的危难,易于敌人利用。我们发现简单的控制器承认更多持续性坏解。我们提供了一个系统atic robustness研究的算法,并证明存在持续解和,在一些情况下, periodic orbit,使用了电脑辅助证明方法。

From Probabilistic Programming to Complexity-based Programming

  • paper_url: http://arxiv.org/abs/2307.15453
  • repo_url: None
  • paper_authors: Giovanni Sileno, Jean-Louis Dessalles
  • for: 本文提出了一种新的计算框架,名为CompLog,受 probabilistic programming 系统ProbLog的启发,基于 simplicity theory 的推理机制,通过计算两个kolmogorov复杂度来代替概率推理。
  • methods: 本文使用了两个kolmogorov复杂度来计算后期和前期意外程度,即后期和前期主观概率。计算基于世界和心理模型的假设,通过 causa 和descriptive 关系来Weight predicates的复杂度。
  • results: 本文提供了一些应用示例,包括生成相关描述和提供谱析和否定的不同方法。
    Abstract The paper presents the main characteristics and a preliminary implementation of a novel computational framework named CompLog. Inspired by probabilistic programming systems like ProbLog, CompLog builds upon the inferential mechanisms proposed by Simplicity Theory, relying on the computation of two Kolmogorov complexities (here implemented as min-path searches via ASP programs) rather than probabilistic inference. The proposed system enables users to compute ex-post and ex-ante measures of unexpectedness of a certain situation, mapping respectively to posterior and prior subjective probabilities. The computation is based on the specification of world and mental models by means of causal and descriptive relations between predicates weighted by complexity. The paper illustrates a few examples of application: generating relevant descriptions, and providing alternative approaches to disjunction and to negation.
    摘要 文章介绍了一种新的计算框架,名为CompLog,它受到概率编程系统ProbLog的启发,基于 simplicity theory 中的推理机制,通过计算两个可读性复杂度(在 ASP 程序中实现为最短路寻找)而不是概率推理。该系统可以为用户计算出不同情况的预后和预先抽象度,即后 posting 和前 posting Subjective 概率。计算基于世界和心理模型的干扰和描述关系,这些关系由 predicate 的复杂度Weight。文章还给出了一些应用示例,如生成相关的描述和提供了许多不同的补做法。

DELPHIC: Practical DEL Planning via Possibilities (Extended Version)

  • paper_url: http://arxiv.org/abs/2307.15451
  • repo_url: None
  • paper_authors: Alessandro Burigana, Paolo Felli, Marco Montali
  • for: This paper aims to improve the practicality of Dynamic Epistemic Logic (DEL) planning by questioning the traditional semantics and proposing an alternative, more compact approach called DELPHIC.
  • methods: The paper uses a new semantics defined using possibilities, which are non-well-founded objects representing both factual properties and what agents consider to be possible. The authors implement the DELPHIC approach in Answer Set Programming (ASP) and compare it with the traditional Kripke-based approach.
  • results: The experimental evaluation shows that DELPHIC outperforms the traditional approach in terms of space and time.
    Abstract Dynamic Epistemic Logic (DEL) provides a framework for epistemic planning that is capable of representing non-deterministic actions, partial observability, higher-order knowledge and both factual and epistemic change. The high expressivity of DEL challenges existing epistemic planners, which typically can handle only restricted fragments of the whole framework. The goal of this work is to push the envelop of practical DEL planning, ultimately aiming for epistemic planners to be able to deal with the full range of features offered by DEL. Towards this goal, we question the traditional semantics of DEL, defined in terms on Kripke models. In particular, we propose an equivalent semantics defined using, as main building block, so-called possibilities: non well-founded objects representing both factual properties of the world, and what agents consider to be possible. We call the resulting framework DELPHIC. We argue that DELPHIC indeed provides a more compact representation of epistemic states. To substantiate this claim, we implement both approaches in ASP and we set up an experimental evaluation to compare DELPHIC with the traditional, Kripke-based approach. The evaluation confirms that DELPHIC outperforms the traditional approach in space and time.
    摘要 dynamically epistemic logic (DEL) 提供了一个架构,可以表示非决定性行为、部分可观察性、高阶知识和事实和知识改变。 DEL 的表达力问题,使得现有的 epistemic 观察者通常只能处理 restriction 的 fragment。 在这个工作中,我们质疑传统 DEL 的 semantics,定义为基于 Kripke 模型。 具体来说,我们提出了一个相等的 semantics,使用 so-called possibilities:非对数世界的性质,以及 agents 认为可能的东西。 我们称这个框架为 DELPHIC。 我们认为 DELPHIC 可以提供更 компакт的 epistemic 状态表示。 为了证明这个主张,我们将实现这两种方法,并设置了一个实验评估,以比较 DELPHIC 和传统、基于 Kripke 的方法。 评估确认 DELPHIC 在空间和时间方面的表现比 traditional 方法更好。

Optimal Alignment of Temporal Knowledge Bases

  • paper_url: http://arxiv.org/abs/2307.15439
  • repo_url: None
  • paper_authors: Oliver Fernandez-Gil, Fabio Patrizi, Giuseppe Perelli, Anni-Yasmin Turhan
  • for: 本研究旨在实现基于ontology的情境识别,并且解决在知识库中收集的数据不准确导致重要查询答案被遗弃的问题。
  • methods: 本文引入了TKBAlignment问题,该问题计算一个变体的TKB,以最小改变TKB,但能使得给定的temporal CQ得到答案,并且是这种(成本-)优化的。
  • results: 本文对ALC TKBs和 conjunctive queries with LTL operators进行研究,并提出了一种解决TKBAlignment问题的方法,该方法基于 propositional LTL over finite traces的对应技术,可以 Compute (cost-optimal) alignments of TKBs。
    Abstract Answering temporal CQs over temporalized Description Logic knowledge bases (TKB) is a main technique to realize ontology-based situation recognition. In case the collected data in such a knowledge base is inaccurate, important query answers can be missed. In this paper we introduce the TKB Alignment problem, which computes a variant of the TKB that minimally changes the TKB, but entails the given temporal CQ and is in that sense (cost-)optimal. We investigate this problem for ALC TKBs and conjunctive queries with LTL operators and devise a solution technique to compute (cost-optimal) alignments of TKBs that extends techniques for the alignment problem for propositional LTL over finite traces.
    摘要 Answering temporal CQs over temporalized Description Logic knowledge bases (TKB) is a main technique to realize ontology-based situation recognition. If the collected data in such a knowledge base is inaccurate, important query answers can be missed. In this paper, we introduce the TKB Alignment problem, which computes a variant of the TKB that minimally changes the TKB, but entails the given temporal CQ and is in that sense (cost-)optimal. We investigate this problem for ALC TKBs and conjunctive queries with LTL operators and devise a solution technique to compute (cost-optimal) alignments of TKBs that extends techniques for the alignment problem for propositional LTL over finite traces.Here's the word-for-word translation:回答 temporal CQs over temporalized Description Logic knowledge bases (TKB) 是实现 ontology-based situation recognition 的主要技术。如果 collected data 中的 TKB 不准确,重要的查询答案就可能会丢失。在这篇论文中,我们介绍 TKB Alignment problem,该问题计算一个 TKB 的变体,使其最小地改变 TKB,但涵盖给定的 temporal CQ,并且是Cost-optimal的。我们对 ALC TKBs 和 conjunctive queries with LTL operators 进行调查,并提出一种 compute (cost-optimal) alignments of TKBs 的解决方案,该方案基于 propositional LTL over finite traces 的对应技术。

Improvable Gap Balancing for Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2307.15429
  • repo_url: https://github.com/yanqidai/igb4mtl
  • paper_authors: Yanqi Dai, Nanyi Fei, Zhiwu Lu
  • For: 这篇论文主要关注多任务学习(MTL)中的梯度平衡和损失平衡两种方法,以及它们在不同任务间的对应关系。* Methods: 本篇论文提出了两种新的改进梯度平衡(IGB)算法,其中一种运用了简单的规律,另一种则是通过深度强化学习来实现MTL中的改进梯度平衡。* Results: 实验结果显示,IGB算法在MTL中实现了最佳的结果,并且与梯度平衡结合使用可以获得进一步的改进。
    Abstract In multi-task learning (MTL), gradient balancing has recently attracted more research interest than loss balancing since it often leads to better performance. However, loss balancing is much more efficient than gradient balancing, and thus it is still worth further exploration in MTL. Note that prior studies typically ignore that there exist varying improvable gaps across multiple tasks, where the improvable gap per task is defined as the distance between the current training progress and desired final training progress. Therefore, after loss balancing, the performance imbalance still arises in many cases. In this paper, following the loss balancing framework, we propose two novel improvable gap balancing (IGB) algorithms for MTL: one takes a simple heuristic, and the other (for the first time) deploys deep reinforcement learning for MTL. Particularly, instead of directly balancing the losses in MTL, both algorithms choose to dynamically assign task weights for improvable gap balancing. Moreover, we combine IGB and gradient balancing to show the complementarity between the two types of algorithms. Extensive experiments on two benchmark datasets demonstrate that our IGB algorithms lead to the best results in MTL via loss balancing and achieve further improvements when combined with gradient balancing. Code is available at https://github.com/YanqiDai/IGB4MTL.
    摘要 在多任务学习(MTL)中,梯度均衡在最近几年内吸引了更多的研究兴趣,因为它经常会导致更好的性能。然而,损失均衡是梯度均衡的更加有效的方法,因此仍然值得进一步的探索。尽管先前的研究通常忽略了多任务中存在的不同可改善差距,其中每个任务的可改善差距定义为从当前训练进度到期望的最终训练进度之间的距离。因此,在进行损失均衡后,性能差距仍然出现在许多情况下。在本文中,我们采用损失均衡框架,提出了两种新的可改善差距均衡(IGB)算法 для MTL:一个使用简单的启发,另一个(这是第一次)使用深度强化学习。特别是,不直接在 MTL 中平衡损失,而是动态分配任务权重以进行可改善差距均衡。此外,我们将 IGB 和梯度均衡相结合,以示两者之间的补充性。广泛的实验表明,我们的 IGB 算法在 MTL 中通过损失均衡得到最佳结果,并在结合梯度均衡时获得进一步的改进。代码可以在 中找到。

A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI

  • paper_url: http://arxiv.org/abs/2307.15425
  • repo_url: None
  • paper_authors: Arash Hajikhani, Carolyn Cole
  • for: 本研究探讨了一种专门编译的语言模型和一个通用模型如OpenAI的GPT-3.5在文本数据中检测SDGs的比较效果。
  • methods: 本研究使用了大语言模型(LLMs),探讨了对偏见和敏感性的挑战。研究强调了特殊训练的重要性以实现精确和不偏的分析。
  • results: 研究发现,专门的SDG检测模型在公司描述 dataset 中比GPT-3.5更加精准地检测SDGs,并且可以快速地提供高度相关的SDGs。研究认为,在执行任务时应选择合适的模型,考虑任务的需求、成本、复杂度和可见性。
    Abstract This paper examines the comparative effectiveness of a specialized compiled language model and a general-purpose model like OpenAI's GPT-3.5 in detecting SDGs within text data. It presents a critical review of Large Language Models (LLMs), addressing challenges related to bias and sensitivity. The necessity of specialized training for precise, unbiased analysis is underlined. A case study using a company descriptions dataset offers insight into the differences between the GPT-3.5 and the specialized SDG detection model. While GPT-3.5 boasts broader coverage, it may identify SDGs with limited relevance to the companies' activities. In contrast, the specialized model zeroes in on highly pertinent SDGs. The importance of thoughtful model selection is emphasized, taking into account task requirements, cost, complexity, and transparency. Despite the versatility of LLMs, the use of specialized models is suggested for tasks demanding precision and accuracy. The study concludes by encouraging further research to find a balance between the capabilities of LLMs and the need for domain-specific expertise and interpretability.
    摘要

Improving Social Media Popularity Prediction with Multiple Post Dependencies

  • paper_url: http://arxiv.org/abs/2307.15413
  • repo_url: None
  • paper_authors: Zhizhen Zhang, Xiaohui Xie, Mengyu Yang, Ye Tian, Yong Jiang, Yong Cui
  • for: 预测社交媒体帖子的 популяр度,以提高推荐系统和多媒体广告等应用的效果。
  • methods: 提出了一种名为受依关系探测网络(DSN)的新预测框架,利用了帖子之间和帖子内的多个依存关系,以提高预测精度。
  • results: 对社交媒体帖子Popularity Dataset进行实验,比现有的模型表现更优异。
    Abstract Social Media Popularity Prediction has drawn a lot of attention because of its profound impact on many different applications, such as recommendation systems and multimedia advertising. Despite recent efforts to leverage the content of social media posts to improve prediction accuracy, many existing models fail to fully exploit the multiple dependencies between posts, which are important to comprehensively extract content information from posts. To tackle this problem, we propose a novel prediction framework named Dependency-aware Sequence Network (DSN) that exploits both intra- and inter-post dependencies. For intra-post dependency, DSN adopts a multimodal feature extractor with an efficient fine-tuning strategy to obtain task-specific representations from images and textual information of posts. For inter-post dependency, DSN uses a hierarchical information propagation method to learn category representations that could better describe the difference between posts. DSN also exploits recurrent networks with a series of gating layers for more flexible local temporal processing abilities and multi-head attention for long-term dependencies. The experimental results on the Social Media Popularity Dataset demonstrate the superiority of our method compared to existing state-of-the-art models.
    摘要 For intra-post dependency, DSN uses a multimodal feature extractor with an efficient fine-tuning strategy to obtain task-specific representations from images and textual information of posts. For inter-post dependency, DSN employs a hierarchical information propagation method to learn category representations that can better capture the differences between posts. Additionally, DSN utilizes recurrent networks with a series of gating layers for more flexible local temporal processing abilities and multi-head attention for long-term dependencies.The experimental results on the Social Media Popularity Dataset demonstrate the superiority of our method compared to existing state-of-the-art models.

Agent-Based Model: Simulating a Virus Expansion Based on the Acceptance of Containment Measures

  • paper_url: http://arxiv.org/abs/2307.15723
  • repo_url: None
  • paper_authors: Alejandro Rodríguez-Arias, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas, Noelia Sánchez-Marroño
  • for: 这个研究旨在描述一种基于代理模型(ABM)的社会系统分析方法,用于研究流行病在社会中的传播和控制。
  • methods: 该研究使用了修改后的SEIRD模型和公民决策模型,以模拟公民在流行病爆发期间的行为和决策。
  • results: 研究发现,公民的行为和决策对抗流行病的传播有重要影响,而且这种影响可以通过分析各个公民的行为和决策来了解。
    Abstract Compartmental epidemiological models categorize individuals based on their disease status, such as the SEIRD model (Susceptible-Exposed-Infected-Recovered-Dead). These models determine the parameters that influence the magnitude of an outbreak, such as contagion and recovery rates. However, they don't account for individual characteristics or population actions, which are crucial for assessing mitigation strategies like mask usage in COVID-19 or condom distribution in HIV. Additionally, studies highlight the role of citizen solidarity, interpersonal trust, and government credibility in explaining differences in contagion rates between countries. Agent-Based Modeling (ABM) offers a valuable approach to study complex systems by simulating individual components, their actions, and interactions within an environment. ABM provides a useful tool for analyzing social phenomena. In this study, we propose an ABM architecture that combines an adapted SEIRD model with a decision-making model for citizens. In this paper, we propose an ABM architecture that allows us to analyze the evolution of virus infections in a society based on two components: 1) an adaptation of the SEIRD model and 2) a decision-making model for citizens. In this way, the evolution of infections is affected, in addition to the spread of the virus itself, by individual behavior when accepting or rejecting public health measures. We illustrate the designed model by examining the progression of SARS-CoV-2 infections in A Coru\~na, Spain. This approach makes it possible to analyze the effect of the individual actions of citizens during an epidemic on the spread of the virus.
    摘要 《组室传染学模型(SEIRD模型)分类人员根据疾病状况,但这些模型不会考虑个体特征或人口行为,这些因素对控制疫情策略的评估非常重要。例如,面罩使用和HIV抗原分发等疫情控制措施的效果。学者们指出,公民团结、人际信任和政府信用度在不同国家的传染率之间存在关系。基于代理模型(ABM)可以研究复杂系统,模拟个体组件、其行为和互动环境中的交互。在本研究中,我们提出了一种ABM架构,将SEIRD模型与公民决策模型相结合,以分析病毒传播在社会中的演化。我们通过对SARS-CoV-2在西班牙加的库恩省的传播进行示例分析,以示出这种方法的效果。这种方法可以评估疫情期间公民个体行为对病毒传播的影响。》Note: The translation is provided using the Google Translate tool, and may not be entirely accurate or idiomatic.

Co-attention Graph Pooling for Efficient Pairwise Graph Interaction Learning

  • paper_url: http://arxiv.org/abs/2307.15377
  • repo_url: https://github.com/leejunhyun/coattentiongraphpooling
  • paper_authors: Junhyun Lee, Bumsoo Kim, Minji Jeon, Jaewoo Kang
  • for: 处理和学习图structured数据
  • methods: 使用 co-attention 在图 pooling 中提取交互作表示
  • results: 在实际数据集上,与现有方法相比,我们的方法具有更高的精度和更低的计算成本
    Abstract Graph Neural Networks (GNNs) have proven to be effective in processing and learning from graph-structured data. However, previous works mainly focused on understanding single graph inputs while many real-world applications require pair-wise analysis for graph-structured data (e.g., scene graph matching, code searching, and drug-drug interaction prediction). To this end, recent works have shifted their focus to learning the interaction between pairs of graphs. Despite their improved performance, these works were still limited in that the interactions were considered at the node-level, resulting in high computational costs and suboptimal performance. To address this issue, we propose a novel and efficient graph-level approach for extracting interaction representations using co-attention in graph pooling. Our method, Co-Attention Graph Pooling (CAGPool), exhibits competitive performance relative to existing methods in both classification and regression tasks using real-world datasets, while maintaining lower computational complexity.
    摘要 格子神经网络(GNNs)已经证明能够有效地处理和学习具有格式结构的数据。然而,先前的工作主要集中在单个图像输入的理解上,而现实世界中许多应用需要对图像数据进行对比分析(例如场景图匹配、代码搜索和药物交互预测)。为此,最近的工作已经转移注意力到对图像对的交互进行学习。虽然这些方法提高了性能,但是它们仍然受到节点级别的交互限制,导致计算成本高并且性能不佳。为解决这个问题,我们提出了一种新的和高效的图像水平的交互表示提取方法,即协同注意力图集(CAGPool)。我们的方法在实际 dataset 上表现竞争性,同时保持计算复杂度较低。

Confident Feature Ranking

  • paper_url: http://arxiv.org/abs/2307.15361
  • repo_url: None
  • paper_authors: Bitya Neuhof, Yuval Benjamini
  • for: 本研究旨在提供一种基于对比测试的后处方法,以确定特征重要性值的稳定排名。
  • methods: 本研究使用对比测试方法,对特征重要性值进行重新排名,并生成相应的置信区间。
  • results: 本研究确保了对特征重要性值的排名具有高概率包含真实排名,并且允许选择top-k集。
    Abstract Interpretation of feature importance values often relies on the relative order of the features rather than on the value itself, referred to as ranking. However, the order may be unstable due to the small sample sizes used in calculating the importance values. We propose that post-hoc importance methods produce a ranking and simultaneous confident intervals for the rankings. Based on pairwise comparisons of the feature importance values, our method is guaranteed to include the ``true'' (infinite sample) ranking with high probability and allows for selecting top-k sets.
    摘要 常用的特征重要性值的解释通常是通过特征之间的相对排名而进行,而不是直接查看值的大小。然而,排名的稳定性可能会受到小样本大小的影响。我们提议使用 posterior 重要性方法生成排名和同时的信任范围,以确保包含“真实”(无限大样本)排名,并允许选择 top-k 集。基于特征之间的对比,我们的方法能够保证包含“真实”排名,并且可以选择 top-k 集。Note: "posterior" in Chinese is "后验" (hòu yì).

Med-HALT: Medical Domain Hallucination Test for Large Language Models

  • paper_url: http://arxiv.org/abs/2307.15343
  • repo_url: None
  • paper_authors: Logesh Kumar Umapathi, Ankit Pal, Malaikannan Sankarasubbu
  • for: The paper is written to address the challenges of hallucinations in large language models (LLMs) in the medical domain, and to propose a new benchmark and dataset (Med-HALT) to evaluate and reduce hallucinations.
  • methods: The paper proposes a new benchmark and dataset (Med-HALT) that includes reasoning and memory-based hallucination tests to assess LLMs’ problem-solving and information retrieval abilities.
  • results: The study evaluates leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, and reveals significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility.
    Abstract This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs's problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at medhalt.github.io
    摘要

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

  • paper_url: http://arxiv.org/abs/2307.15337
  • repo_url: None
  • paper_authors: Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang
  • for: 降低大语言模型(LLMs)的终端生成延迟。
  • methods: 提出了“思维skeleton”(SoT),让LLMs先生成答案的框架,然后并行API调用或批处理解码完成每个框架点的内容。
  • results: 对11种不同的LLMs进行测试,得到了 considerable 的速度提升(最高达2.39倍),并且可能会在某些问题类型上提高答案质量。
    Abstract This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose "Skeleton-of-Thought" (SoT), which guides LLMs to first generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-up (up to 2.39x across 11 different LLMs), but it can also potentially improve the answer quality on several question categories in terms of diversity and relevance. SoT is an initial attempt at data-centric optimization for efficiency, and reveal the potential of pushing LLMs to think more like a human for answer quality.
    摘要 这项工作的目标是减少大语言模型(LLM)的端到端生成延迟。一个主要的延迟原因是大多数现状的LLM采用的是顺序解码方法。在这项工作中,我们受人类思维和写作过程的 inspirited by,提议了“思想骨架”(SoT),帮助LLM首先生成答案的框架,然后在平行API调用或批处理decode中完善每个骨架点。不仅SoT可以提供显著的速度增加(最多2.39倍于11个不同的LLM),而且也可能提高答案质量在一些问题类型上,包括多样性和相关性。SoT是数据驱动优化的初步尝试,揭示了推动LLM思考更像人类的答案质量的可能性。

Tutorials on Stance Detection using Pre-trained Language Models: Fine-tuning BERT and Prompting Large Language Models

  • paper_url: http://arxiv.org/abs/2307.15331
  • repo_url: None
  • paper_authors: Yun-Shiuan Chuang
  • for: 本文提供了两个自包含的教程,用于在推特数据中进行立场检测,使用BERT精度和大型自然语言模型(LLM)的启发。
  • methods: 本教程涵盖了BERT体系和Tokenization,并导导用户在训练、调参和评估标准和域pecificBERT模型的方法。
  • results: 教程使用了多种提示策略,并使用混淆矩阵和macro F1分数来评估。结果显示,不需要精度调整的ChatGPT和FLAN-T5可以在几个例子下表现出优于精度调整的BERT。
    Abstract This paper presents two self-contained tutorials on stance detection in Twitter data using BERT fine-tuning and prompting large language models (LLMs). The first tutorial explains BERT architecture and tokenization, guiding users through training, tuning, and evaluating standard and domain-specific BERT models with HuggingFace transformers. The second focuses on constructing prompts and few-shot examples to elicit stances from ChatGPT and open-source FLAN-T5 without fine-tuning. Various prompting strategies are implemented and evaluated using confusion matrices and macro F1 scores. The tutorials provide code, visualizations, and insights revealing the strengths of few-shot ChatGPT and FLAN-T5 which outperform fine-tuned BERTs. By covering both model fine-tuning and prompting-based techniques in an accessible, hands-on manner, these tutorials enable learners to gain applied experience with cutting-edge methods for stance detection.
    摘要 Translation in Simplified Chinese:这篇论文提供了使用BERT精细化和大型自然语言模型(LLM)的两个自包含的教程,用于推断推特上的立场检测。第一个教程介绍了BERT的架构和token化,并引导用户通过训练、调整和评估标准和域pecificBERT模型的方法。第二个教程专注于构建提示和几个示例来引发ChatGPT和开源FLAN-T5的立场,而不需要精细化。多种提示策略被实现和评估使用混淆矩阵和macro F1分数。这些教程提供了代码、视觉化和概念,揭示了几个批处的强点,其中几个几个示例超出了精细化BERT的性能。通过覆盖模型精细化和提示基本技术,这些教程帮助学习者获得应用最新方法的实践经验。

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2307.15320
  • repo_url: None
  • paper_authors: Ricardo Garcia, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
  • for: 这个论文的目的是探索视觉动力策略在模拟环境中学习,以优化在真实世界中的机器人控制。
  • methods: 这个论文使用了域随机化(DR)方法来bridge模拟和真实数据之间的观察者随机化(sim-to-real)问题。
  • results: 研究人员通过使用DR方法,在一个rich的机器人抓取任务集中系统地探索了视觉域随机化策略,并证明了DR参数对off-line代理任务和on-line策略均有类似的影响。此外,研究人员还表明了在真实场景中的视觉变化robustness的优势。
    Abstract Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks. In particular, we propose an off-line proxy task of cube localization to select DR parameters for texture randomization, lighting randomization, variations of object colors and camera parameters. Notably, we demonstrate that DR parameters have similar impact on our off-line proxy task and on-line policies. We, hence, use off-line optimized DR parameters to train visuomotor policies in simulation and directly apply such policies to a real robot. Our approach achieves 93% success rate on average when tested on a diverse set of challenging manipulation tasks. Moreover, we evaluate the robustness of policies to visual variations in real scenes and show that our simulator-trained policies outperform policies learned using real but limited data. Code, simulation environment, real robot datasets and trained models are available at https://www.di.ens.fr/willow/research/robust_s2r/.
    摘要 学习视motor策略在模拟中 Much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks. In particular, we propose an off-line proxy task of cube localization to select DR parameters for texture randomization, lighting randomization, variations of object colors and camera parameters. Notably, we demonstrate that DR parameters have similar impact on our off-line proxy task and on-line policies. We, hence, use off-line optimized DR parameters to train visuomotor policies in simulation and directly apply such policies to a real robot. Our approach achieves 93% success rate on average when tested on a diverse set of challenging manipulation tasks. Moreover, we evaluate the robustness of policies to visual variations in real scenes and show that our simulator-trained policies outperform policies learned using real but limited data. 代码、模拟环境、实际机器人数据和训练模型可以在https://www.di.ens.fr/willow/research/robust_s2r/ obtained。

Beyond Reality: The Pivotal Role of Generative AI in the Metaverse

  • paper_url: http://arxiv.org/abs/2308.06272
  • repo_url: None
  • paper_authors: Vinay Chamola, Gaurang Bansal, Tridib Kumar Das, Vikas Hassija, Naga Siva Sai Reddy, Jiacheng Wang, Sherali Zeadally, Amir Hussain, F. Richard Yu, Mohsen Guizani, Dusit Niyato
  • for: 这篇论文探讨了如何通过生成人工智能技术实现虚拟世界的演进和互动性,以及这些技术在虚拟世界中的应用。
  • methods: 论文描述了各种生成人工智能技术,包括文本生成模型ChatGPT和GPT-3、图像生成模型DALL-E和MidJourney、以及3D模型生成技术Point-E和Lumirithmic。
  • results: 论文总结了这些技术在虚拟世界中的应用和发展前景,同时也评估了这些技术的挑战和伦理问题。
    Abstract Imagine stepping into a virtual world that's as rich, dynamic, and interactive as our physical one. This is the promise of the Metaverse, and it's being brought to life by the transformative power of Generative Artificial Intelligence (AI). This paper offers a comprehensive exploration of how generative AI technologies are shaping the Metaverse, transforming it into a dynamic, immersive, and interactive virtual world. We delve into the applications of text generation models like ChatGPT and GPT-3, which are enhancing conversational interfaces with AI-generated characters. We explore the role of image generation models such as DALL-E and MidJourney in creating visually stunning and diverse content. We also examine the potential of 3D model generation technologies like Point-E and Lumirithmic in creating realistic virtual objects that enrich the Metaverse experience. But the journey doesn't stop there. We also address the challenges and ethical considerations of implementing these technologies in the Metaverse, offering insights into the balance between user control and AI automation. This paper is not just a study, but a guide to the future of the Metaverse, offering readers a roadmap to harnessing the power of generative AI in creating immersive virtual worlds.
    摘要 imagine 进入一个丰富、动态、互动的虚拟世界,这是metaverse的推荐,并且这个虚拟世界正在被生成人工智能(AI)的变革力所实现。本文将进行全面的探讨,描述如何使用生成AI技术将metaverse变成一个动态、内在、互动的虚拟世界。我们将探讨文本生成模型如ChatGPT和GPT-3,它们在虚拟世界中创建了AI生成的人物,让用户能够在虚拟世界中互动。我们也将探讨图像生成模型如DALL-E和MidJourney,它们在创建丰富多样的内容方面发挥了重要作用。此外,我们还将探讨3D模型生成技术如Point-E和Lumirithmic,它们将实现虚拟物品的实际化,增强metaverse的体验。但我们的旅程不止于此。我们还需要处理在metaverse中实施这些技术的挑战和伦理考虑。本文不仅是一篇研究,更是 metaverse 未来的路径,帮助读者实现在虚拟世界中使用生成AI的潜力。

DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall’s Rank Correlation

  • paper_url: http://arxiv.org/abs/2307.15317
  • repo_url: None
  • paper_authors: Kaipeng Zheng, Huishuai Zhang, Weiran Huang
  • for: 这个论文主要针对几何学习中的测试类别不为模型所见过的问题进行适应。
  • methods: 这个论文使用了几何 similarity 度量来衡量两个特征之间的Semantic相似性,并将 Kendall rank correlation 作为替代的度量。
  • results: 这个论文的实验结果显示,使用 Kendall rank correlation 来替代几何 similarity 度量可以对几何学习中的测试类别进行更好的适应,并且可以实现更高的性能。
    Abstract Few-shot learning aims to adapt models trained on the base dataset to novel tasks where the categories are not seen by the model before. This often leads to a relatively uniform distribution of feature values across channels on novel classes, posing challenges in determining channel importance for novel tasks. Standard few-shot learning methods employ geometric similarity metrics such as cosine similarity and negative Euclidean distance to gauge the semantic relatedness between two features. However, features with high geometric similarities may carry distinct semantics, especially in the context of few-shot learning. In this paper, we demonstrate that the importance ranking of feature channels is a more reliable indicator for few-shot learning than geometric similarity metrics. We observe that replacing the geometric similarity metric with Kendall's rank correlation only during inference is able to improve the performance of few-shot learning across a wide range of datasets with different domains. Furthermore, we propose a carefully designed differentiable loss for meta-training to address the non-differentiability issue of Kendall's rank correlation. Extensive experiments demonstrate that the proposed rank-correlation-based approach substantially enhances few-shot learning performance.
    摘要 通过几拍学习适应基 dataset 中的任务,目标是使模型能够适应 novel 任务中的类别未经过模型训练。这经常导致 novel 类通道的特征值分布呈相对均勋的形式,从而增加了决定通道重要性的挑战。标准的几拍学习方法通常使用 геометрические相似度度量,如 косину斯相似度和负 Euclidian 距离,来衡量两个特征之间的 semantic 相似性。但是,具有高 geometric 相似度的特征可能会拥有不同的 semantics,特别是在几拍学习上。在这篇文章中,我们表明通道的重要性排名是几拍学习中更可靠的指标,而不是 geometric 相似度度量。我们发现,在推理时将 geometric 相似度度量替换为 Kendall 排名相关性可以在不同领域的 dataset 上提高几拍学习性能。此外,我们还提出了一种特殊的可导式损失函数,用于在 meta-training 中处理 Kendall 排名相关性的不导数问题。广泛的实验表明,我们的排名相关性基于的方法可以显著提高几拍学习性能。

Efficient Multiuser AI Downloading via Reusable Knowledge Broadcasting

  • paper_url: http://arxiv.org/abs/2307.15316
  • repo_url: None
  • paper_authors: Hai Wu, Qunsong Zeng, Kaibin Huang
  • for: 这 paper 的目的是解决 sixth-generation (6G) 移动网络中的即时适应人工智能(AI)模型下载问题,以减少无线链路上的通信开销。
  • methods: 这 paper 提出了一个名为 Model Broadcasting and Assembling (MBA) 框架,该框架利用可重用知识(shared parameters among tasks)来启用参数广播,从而减少通信开销。MBA 框架包括两个关键组件:MBA 协议和参数选择和功率控制(PS-PC)的共同设计。
  • results: 该 paper 的实验结果表明,相比传统模型下载方法,MBA 框架可以减少下载时间开销,提高设备的模型性能。
    Abstract For the 6G mobile networks, in-situ model downloading has emerged as an important use case to enable real-time adaptive artificial intelligence on edge devices. However, the simultaneous downloading of diverse and high-dimensional models to multiple devices over wireless links presents a significant communication bottleneck. To overcome the bottleneck, we propose the framework of model broadcasting and assembling (MBA), which represents the first attempt on leveraging reusable knowledge, referring to shared parameters among tasks, to enable parameter broadcasting to reduce communication overhead. The MBA framework comprises two key components. The first, the MBA protocol, defines the system operations including parameter selection from a model library, power control for broadcasting, and model assembling at devices. The second component is the joint design of parameter-selection-and-power-control (PS-PC), which provides guarantees on devices' model performance and minimizes the downloading latency. The corresponding optimization problem is simplified by decomposition into the sequential PS and PC sub-problems without compromising its optimality. The PS sub-problem is solved efficiently by designing two efficient algorithms. On one hand, the low-complexity algorithm of greedy parameter selection features the construction of candidate model sets and a selection metric, both of which are designed under the criterion of maximum reusable knowledge among tasks. On the other hand, the optimal tree-search algorithm gains its efficiency via the proposed construction of a compact binary tree pruned using model architecture constraints and an intelligent branch-and-bound search. Given optimal PS, the optimal PC policy is derived in closed form. Extensive experiments demonstrate the substantial reduction in downloading latency achieved by the proposed MBA compared to traditional model downloading.
    摘要 The MBA framework consists of two key components:1. MBA Protocol: This defines the system operations, including parameter selection from a model library, power control for broadcasting, and model assembling at devices.2. Joint Design of Parameter-Selection-and-Power-Control (PS-PC): This provides guarantees on devices' model performance and minimizes downloading latency. The optimization problem is simplified by decomposing it into sequential PS and PC sub-problems without compromising optimality.The PS sub-problem is solved efficiently using two efficient algorithms:1. Greedy Parameter Selection: This features the construction of candidate model sets and a selection metric, both designed under the criterion of maximum reusable knowledge among tasks.2. Optimal Tree-Search Algorithm: This gains efficiency via a proposed construction of a compact binary tree pruned using model architecture constraints and an intelligent branch-and-bound search.Given optimal PS, the optimal PC policy is derived in closed form. Extensive experiments demonstrate that the proposed MBA achieves substantial reduction in downloading latency compared to traditional model downloading.

WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories

  • paper_url: http://arxiv.org/abs/2307.15293
  • repo_url: https://github.com/seventychi/wc-sbert
  • paper_authors: Te-Yu Chi, Yu-Meng Tang, Chia-Wen Lu, Qiu-Xia Zhang, Jyh-Shing Roger Jang
  • for: 解决 zero-shot 文本分类问题,尤其是自适应自动学习策略。
  • methods: 提议一种使用标签而不是文本进行训练的新型自适应策略,利用 Wikipedia 中的类别作为训练集,并使用 SBERT 预训练模型建立文本中对应的相互关系,以便 associative 训练。
  • results: 实验结果表明,这种方法可以在 minutes 内将模型适应目标数据集,并在 Yahoo Topic 和 AG News 数据集上达到了状态元的 результаTS。相比其他 BERT 基于 transformer 模型,我们的方法可以减少训练数据量,提高训练效率,并且在不同数据集上进行快速的精度调整和推理。
    Abstract Our research focuses on solving the zero-shot text classification problem in NLP, with a particular emphasis on innovative self-training strategies. To achieve this objective, we propose a novel self-training strategy that uses labels rather than text for training, significantly reducing the model's training time. Specifically, we use categories from Wikipedia as our training set and leverage the SBERT pre-trained model to establish positive correlations between pairs of categories within the same text, facilitating associative training. For new test datasets, we have improved the original self-training approach, eliminating the need for prior training and testing data from each target dataset. Instead, we adopt Wikipedia as a unified training dataset to better approximate the zero-shot scenario. This modification allows for rapid fine-tuning and inference across different datasets, greatly reducing the time required for self-training. Our experimental results demonstrate that this method can adapt the model to the target dataset within minutes. Compared to other BERT-based transformer models, our approach significantly reduces the amount of training data by training only on labels, not the actual text, and greatly improves training efficiency by utilizing a unified training set. Additionally, our method achieves state-of-the-art results on both the Yahoo Topic and AG News datasets.
    摘要 Translated into Simplified Chinese:我们的研究集中于解决NLP中的零例文本分类问题,强调创新自动训练策略。为达到这个目标,我们提议一种新的自动训练策略,使用标签而不是文本进行训练,减少模型训练时间。特别是,我们使用Wikipedia中的类别作为我们的训练集,利用SBERT预训练模型来建立文本中类别之间的相互关联,促进相关训练。对于新的测试集,我们改进了原始自动训练方法,消除了每个目标集需要的先前训练和测试数据。而是采用Wikipedia作为一个统一的训练集,更好地逼近零例场景。这种修改允许快速的细化和推理,大幅减少自动训练的时间。我们的实验结果表明,这种方法可以在分钟内适应目标集。相比其他BERT基于转换器模型,我们的方法可以减少训练数据量,只训练标签而不是实际文本,并且大幅提高训练效率。此外,我们的方法在Yahoo主题和AG新闻集上达到了状态对的结果。

Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

  • paper_url: http://arxiv.org/abs/2308.00085
  • repo_url: None
  • paper_authors: Yahui Fu, Koji Inoue, Chenhui Chu, Tatsuya Kawahara
  • for: 提高对用户情感的理解和回应
  • methods: incorporating commonsense knowledge and reasoning about the causes of emotions, integrating in-context learning with commonsense knowledge, and integrating with ChatGPT and T5-based models
  • results: outperforms other comparable methods on both automatic and human evaluations
    Abstract Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user's experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user's perspective, ignoring the system's perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user's perspective (user's desires and reactions) and the system's perspective (system's intentions and reactions). We enhance ChatGPT's ability to reason for the system's perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.
    摘要 现代方法 для生成同情响应尝试包含常识知识或理智来更好地理解用户的经验和情感。然而,这些方法主要关注用户的视角,忽略系统的视角。在这篇论文中,我们提出一种基于常识的 causality 解释方法 для多样化同情响应生成,考虑用户的视角(用户的愿望和反应)和系统的视角(系统的意图和反应)。我们通过将审计学习与常识知识集成到ChatGPT中,提高其理解系统视角的能力。然后,我们将commonsense-based causality explanation与ChatGPT和基于T5的模型集成。实验评估表明,我们的方法在自动和人类评估中都高于其他相似方法。

Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

  • paper_url: http://arxiv.org/abs/2307.15254
  • repo_url: https://github.com/dearcaat/mhim-mil
  • paper_authors: Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, Bo Liu
  • for: 这篇论文是关于多个实例学习(MIL)问题的解决方案。
  • methods: 这篇论文使用了一种新的实例遮盖策略(Masked Hard Instance Mining,MHIM),它使用一个SIAMESE结构(教师-学生)和一个准确性约束来探索可能的困难实例。
  • results: 实验结果表明,使用MHIM-MIL方法可以在CAMELYON-16和TCGA肺癌数据集上超过其他最新的方法,并且具有更好的性能和训练成本。Here’s the breakdown of each point in more detail:
  • for: The paper is about solving the Multiple Instance Learning (MIL) problem, which is a common problem in medical image analysis.
  • methods: The proposed method uses a novel instance masking strategy called Masked Hard Instance Mining (MHIM), which combines a Siamese structure with a consistency constraint to explore potential hard instances.
  • results: The experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets show that the proposed MHIM-MIL method outperforms other state-of-the-art methods in terms of performance and training cost.
    Abstract The whole slide image (WSI) classification is often formulated as a multiple instance learning (MIL) problem. Since the positive tissue is only a small fraction of the gigapixel WSI, existing MIL methods intuitively focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting hard-to-classify instances. Some literature has revealed that hard examples are beneficial for modeling a discriminative boundary accurately. By applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a consistency constraint to explore the potential hard instances. With several instance masking strategies based on attention scores, MHIM-MIL employs a momentum teacher to implicitly mine hard instances for training the student model, which can be any attention-based MIL model. This counter-intuitive strategy essentially enables the student to learn a better discriminating boundary. Moreover, the student is used to update the teacher with an exponential moving average (EMA), which in turn identifies new hard instances for subsequent training iterations and stabilizes the optimization. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that MHIM-MIL outperforms other latest methods in terms of performance and training cost. The code is available at: https://github.com/DearCaat/MHIM-MIL.
    摘要 整个滤镜图像(WSI)分类经常被视为多例学习(MIL)问题。由于正例组占整个多个吉比特像素的只有一小部分,现有的MIL方法倾向于通过注意力机制来标识突出的实例。然而,这会导致模型偏好易于分类的实例,而忽略困难分类的实例。一些文献表明,困难的实例对模型准确地界定边框具有重要作用。我们在实例层次上运用这一想法,提出了一种新的MIL框架——偏挥硬实例挖掘(MHIM-MIL)。该框架使用了SIAMESE结构(教师-学生),并通过一致性约束来探索潜在的困难实例。通过多种实例层次的掩码策略,MHIM-MIL使用了掩码硬实例来训练学生模型,该模型可以是任何注意力基于的MIL模型。这种Counter-intuitive策略使得学生能够学习更好的分类边界。此外,学生模型被用来更新教师模型,并使用了指数移动平均(EMA)来识别新的困难实例,以便在后续训练迭代中进行更新。实验结果表明,MHIM-MIL在CAMELYON-16和TCGA肺癌数据集上表现出色,比latest方法更高的性能和训练成本。代码可以在:https://github.com/DearCaat/MHIM-MIL。

An Overview Of Temporal Commonsense Reasoning and Acquisition

  • paper_url: http://arxiv.org/abs/2308.00002
  • repo_url: None
  • paper_authors: Georg Wenzel, Adam Jatowt
  • for: 本文旨在探讨大语言模型在时间常识逻辑 reasoning 方面的表现,并提出了一些增强语言模型表现的方法。
  • methods: 本文使用了多种增强方法,包括数据增强、随机隐藏状态和随机掩码等,以提高语言模型的时间常识逻辑能力。
  • results: despite the use of these augmentations, the models still struggle to approach human performance on reasoning tasks over temporal common sense properties, such as the typical occurrence times, orderings, or durations of events.
    Abstract Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events, and use it to reason over problems requiring such knowledge. This trait is essential in temporal natural language processing tasks, with possible applications such as timeline summarization, temporal question answering, and temporal natural language inference. Recent research on the performance of large language models suggests that, although they are adept at generating syntactically correct sentences and solving classification tasks, they often take shortcuts in their reasoning and fall prey to simple linguistic traps. This article provides an overview of research in the domain of temporal commonsense reasoning, particularly focusing on enhancing language model performance through a variety of augmentations and their evaluation across a growing number of datasets. However, these augmented models still struggle to approach human performance on reasoning tasks over temporal common sense properties, such as the typical occurrence times, orderings, or durations of events. We further emphasize the need for careful interpretation of research to guard against overpromising evaluation results in light of the shallow reasoning present in transformers. This can be achieved by appropriately preparing datasets and suitable evaluation metrics.
    摘要 时间常识逻辑指的是理解phrase、action和event的典型时间上下文,并使用这些知识来解决问题。这种 trait 是 temporal natural language processing 任务的关键特征,可能的应用包括时间线概要、时间问答和时间自然语言推理。Recent research 表明,虽然大语言模型能够生成正确的语法结构和解决分类任务,但它们经常采取短cuts 的思维方式,容易受到simple linguistic traps 的影响。本文提供了 temporal commonsense reasoning 领域的研究概述,特别是通过多种加强和其评估在不断增长的数据集上。然而,这些加强模型仍然无法 approached human performance 在时间常识性Property上,如事件的典型发生时间、顺序或持续时间。我们进一步强调需要在研究中进行仔细的解释,以避免因 transformers 的浅层理解而导致的误导。这可以通过适当的数据准备和评估 metric 来实现。

A Practical Recipe for Federated Learning Under Statistical Heterogeneity Experimental Design

  • paper_url: http://arxiv.org/abs/2307.15245
  • repo_url: https://github.com/mmorafah/fedzoo-bench
  • paper_authors: Mahdi Morafah, Weijia Wang, Bill Lin
  • for: 本研究旨在探讨 Federated Learning (FL) 在数据不同性下的应用,并提供一个可比较的和有奖励的实验设置。
  • methods: 本研究使用了多种 FL 方法,包括 22 种state-of-the-art 方法,并提供了一个开源库 PyTorch 的实现。
  • results: 研究发现了 FL 特有的实验变量对性能的影响,并提供了一些建议和标准化的特性,以帮助设计更加有意义和有奖励的 FL 实验设置。
    Abstract Federated Learning (FL) has been an area of active research in recent years. There have been numerous studies in FL to make it more successful in the presence of data heterogeneity. However, despite the existence of many publications, the state of progress in the field is unknown. Many of the works use inconsistent experimental settings and there are no comprehensive studies on the effect of FL-specific experimental variables on the results and practical insights for a more comparable and consistent FL experimental setup. Furthermore, the existence of several benchmarks and confounding variables has further complicated the issue of inconsistency and ambiguity. In this work, we present the first comprehensive study on the effect of FL-specific experimental variables in relation to each other and performance results, bringing several insights and recommendations for designing a meaningful and well-incentivized FL experimental setup. We further aid the community by releasing FedZoo-Bench, an open-source library based on PyTorch with pre-implementation of 22 state-of-the-art methods, and a broad set of standardized and customizable features available at https://github.com/MMorafah/FedZoo-Bench. We also provide a comprehensive comparison of several state-of-the-art (SOTA) methods to better understand the current state of the field and existing limitations.
    摘要 Federated Learning (FL) 是近年来的一个热点领域,有很多研究来使其在数据不同性时更成功。然而,尽管有很多论文,但现状的进步还未得到了一个全面的了解。许多研究使用不一致的实验设置,并没有系统性的研究FL特有的实验变量对结果和实践建议。此外,存在多个标准准则和干扰变量,导致了不一致和混乱的问题。在这项工作中,我们提供了FL特有实验变量与其他变量之间的首次全面研究,从而获得了许多新的发现和建议,以及设计一个有意义和有奖励的FL实验设置。此外,我们还发布了FedZoo-Bench,一个基于PyTorch的开源库,包含22种当前领导的方法的预实现,以及一个广泛的标准化和自定义功能,可以在https://github.com/MMorafah/FedZoo-Bench中获取。此外,我们还提供了多种当前领导方法的比较,以更好地了解现场的状况和存在的限制。

BOURNE: Bootstrapped Self-supervised Learning Framework for Unified Graph Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.15244
  • repo_url: https://github.com/Jackson117/BOURNE
  • paper_authors: Jie Liu, Mengting He, Xuequn Shang, Jieming Shi, Bin Cui, Hongzhi Yin
  • for: 这篇论文的目的是提出一个统一的图像异常检测方法,以检测图像中的节点和边异常。
  • methods: 本论文使用的方法包括图像观察中心的节点和边异常检测模型,以及一个bootstrapped自我监督学习架构(BOURNE)。BOURNE使用图像和图像对映的方法来捕捉节点和边的表现,并通过节点和边之间的对映来实现节点和边异常的互相检测。
  • results: 实验结果显示,BOURNE在6个benchmark dataset上具有较高的异常检测效果和效率,并且可以处理大型图像。
    Abstract Graph anomaly detection (GAD) has gained increasing attention in recent years due to its critical application in a wide range of domains, such as social networks, financial risk management, and traffic analysis. Existing GAD methods can be categorized into node and edge anomaly detection models based on the type of graph objects being detected. However, these methods typically treat node and edge anomalies as separate tasks, overlooking their associations and frequent co-occurrences in real-world graphs. As a result, they fail to leverage the complementary information provided by node and edge anomalies for mutual detection. Additionally, state-of-the-art GAD methods, such as CoLA and SL-GAD, heavily rely on negative pair sampling in contrastive learning, which incurs high computational costs, hindering their scalability to large graphs. To address these limitations, we propose a novel unified graph anomaly detection framework based on bootstrapped self-supervised learning (named BOURNE). We extract a subgraph (graph view) centered on each target node as node context and transform it into a dual hypergraph (hypergraph view) as edge context. These views are encoded using graph and hypergraph neural networks to capture the representations of nodes, edges, and their associated contexts. By swapping the context embeddings between nodes and edges and measuring the agreement in the embedding space, we enable the mutual detection of node and edge anomalies. Furthermore, we adopt a bootstrapped training strategy that eliminates the need for negative sampling, enabling BOURNE to handle large graphs efficiently. Extensive experiments conducted on six benchmark datasets demonstrate the superior effectiveness and efficiency of BOURNE in detecting both node and edge anomalies.
    摘要 GRAPH anomaly detection (GAD) 在过去几年内得到了越来越多的关注,因为它在各种领域中具有重要的应用,如社交网络、金融风险管理和交通分析。现有的 GAD 方法可以分为基于图对象类型的节点和边异常检测模型。然而,这些方法通常将节点和边异常视为分开的任务,忽略了它们在实际图中的相互关系和常见的共occurrence。这会导致它们无法利用节点和边异常的相互信息进行互助检测。此外,现有的 GAD 方法,如 CoLA 和 SL-GAD,通常依赖于负样本采样,这会增加计算成本,使其不可扩展到大型图。为解决这些限制,我们提出了一种基于自我超vision learning的新的统一图异常检测框架(名为 BOURNE)。我们将目标节点所在的子图(图视图)中心化为节点上下文,并将其转换成双向图(双向图视图)来表示边上下文。这些视图被图和双向图神经网络编码,以Capture图节点、边和其相关上下文的表示。通过交换节点和边上下文嵌入的协调,我们实现了节点和边异常之间的互助检测。此外,我们采用了自我超视learning的培训策略,不需要负样本,从而使 BOURNE 可以高效地处理大型图。我们在六个 benchmark 数据集上进行了广泛的实验,结果表明 BOURNE 能够高效地检测节点和边异常。

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

  • paper_url: http://arxiv.org/abs/2307.15220
  • repo_url: https://github.com/camma-public/surgvlp
  • paper_authors: Kun Yuan, Vinkle Srivastav, Tong Yu, Joel Lavanchy, Pietro Mascagni, Nassir Navab, Nicolas Padoy
  • for: 本研究旨在使用开放式 Laparoscopic surgery 视频教程提供有效的超级视觉语言学习指导,无需人工标注。
  • methods: 我们使用多种自动语音识别系统生成视频lecture 的文本转写,并提出了一种新的多模态表示学习方法——SurgVLP,用于对视频和文本进行共同表示学习。
  • results: 我们在多种视觉语言任务上展示了我们的方法的表示能力,包括文本基于视频检索、时间活动固定和视频描述等。此外,我们还证明了我们的方法可以无需人工标注来进行传统的视觉下游任务,如手术工具、阶段和 triplet 识别。
    Abstract Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVLP
    摘要 近期的手术计算机视觉应用程序得到了完全监督的方法驱动,主要使用视觉数据。这些方法依赖于手术视频的手动标注来预测固定的对象类别,这限制了它们对未经见过手术过程和下游任务的泛化能力。在这种工作中,我们提出了使用开放手术电子学习平台上的手术视频课程来提供有效的监督信号,以实现多模态表示学习而无需手动标注。我们对手术视频课程中存在的手术语言特有挑战采用多种自动语音识别系统来生成文本转录。然后,我们提出了一种新的多模态表示学习方法——手术视语言预训练(SurgVLP)。SurgVLP构建了一个新的对比学习目标,将视频剪辑embedding与相应的多个文本embedding集成在一个共同的latent空间中。为了有效地展示学习得到的共同空间表示能力,我们引入了许多视频和语言任务,如文本基于视频检索、时间活动固定和视频描述,作为评估标准。此外,我们还证明了我们的方法无需使用任何标注数据,可以用于传统的视觉下游任务,如手术工具、阶段和 triplet 识别。代码将在https://github.com/CAMMA-public/SurgVLP 上公开。

Reachability Poorman Discrete-Bidding Games

  • paper_url: http://arxiv.org/abs/2307.15218
  • repo_url: None
  • paper_authors: Guy Avni, Tobias Meggendorfer, Suman Sadhukhan, Josef Tkadlec, Đorđe Žikelić
  • for: 本研究是关于“拍卖游戏”(bidding games),特别是在图格上进行的两个玩家零点游戏。
  • methods: 本研究使用“贫人精确拍卖”(poorman discrete-bidding)机制,其中竞拍奖金是限制的,高得分玩家将奖金支付给银行。
  • results: 研究发现,在图DAGs中,贫人预算的阈值可以在某些情况下提供误差 bounds,并且具有周期性。在特定情况下,我们还发现了关闭式解决方案。我们还实现了一种算法来找到阈值预算。
    Abstract We consider {\em bidding games}, a class of two-player zero-sum {\em graph games}. The game proceeds as follows. Both players have bounded budgets. A token is placed on a vertex of a graph, in each turn the players simultaneously submit bids, and the higher bidder moves the token, where we break bidding ties in favor of Player 1. Player 1 wins the game iff the token visits a designated target vertex. We consider, for the first time, {\em poorman discrete-bidding} in which the granularity of the bids is restricted and the higher bid is paid to the bank. Previous work either did not impose granularity restrictions or considered {\em Richman} bidding (bids are paid to the opponent). While the latter mechanisms are technically more accessible, the former is more appealing from a practical standpoint. Our study focuses on {\em threshold budgets}, which is the necessary and sufficient initial budget required for Player 1 to ensure winning against a given Player 2 budget. We first show existence of thresholds. In DAGs, we show that threshold budgets can be approximated with error bounds by thresholds under continuous-bidding and that they exhibit a periodic behavior. We identify closed-form solutions in special cases. We implement and experiment with an algorithm to find threshold budgets.
    摘要 我们考虑{\em 拍卖游戏},一种两player零余{\em 图形游戏}。游戏进行如下:两名玩家都有固定预算。一个 токен被放在一个图形上的顶点上,在每次转折时,两名玩家同时提交拍卖,高拍卖者可以移动 токен,并且在拍卖僵固时,将拍卖赢家决定为 Player 1。Player 1 赢得游戏,只要 токен到达一个指定的目标顶点。我们在这篇研究中,以前无法实现的{\em 穷人粗糙拍卖},即拍卖时的价格粗糙限制,并且不同于前一些研究,不允许玩家在拍卖时付出费用。我们的研究集中在{\em 阈值预算},即玩家必须具备的最低预算,以确保在对某名玩家的预算下获得胜利。我们首先证明存在阈值。在DAGs中,我们证明阈值预算可以准确地 aproximated ,并且它们展现了一个周期性的行为。我们还发现了特殊情况下的关闭式解。我们实现了一个算法,以找到阈值预算。

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

  • paper_url: http://arxiv.org/abs/2307.15217
  • repo_url: None
  • paper_authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
  • for: 这篇论文旨在探讨人工智能系统RLHF的问题和限制,以及如何更好地开发更安全的AI系统。
  • methods: 论文使用了RLHF和相关方法的评估和改进方法,以及如何在实践中使用这些方法。
  • results: 论文提出了RLHF和相关方法的开放问题和基本限制,并提出了优化和补偿这些方法的建议。
    Abstract Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
    摘要 人工智能强化学习(RLHF)是一种训练人工智能系统以实现人类目标的技术。RLHF已经成为现代大语言模型(LLM)的训练方法的中心。尽管如此,RLHF的问题和限制得到了相对较少的公共研究。在这篇论文中,我们(1)survey了RLHF和相关方法的开放问题和基本限制;(2)介绍了RLHF在实践中的理解、改进和补充方法;(3)提出了审核和披露标准,以提高RLHF系统的社会监管。我们的工作强调RLHF的限制,并高调了在RLHF系统的开发中采取多方面的方法,以建立更安全的人工智能系统。

PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization

  • paper_url: http://arxiv.org/abs/2307.15199
  • repo_url: None
  • paper_authors: Junhyeong Cho, Gilhyun Nam, Sungyeon Kim, Hunmin Yang, Suha Kwak
  • for: 这个论文旨在提出一种不需要任何图像的源自由频谱适应方法,以便在视觉语言空间中生成多种风格特征。
  • methods: 该方法使用提示来生成多种风格特征,并使用可学习的风格词vector来 Represent these styles。为确保风格特征不会扭曲内容信息,该方法在视觉语言空间中强制风格-内容特征之间的相互靠近。
  • results: 该方法在PACS、VLCS、OfficeHome和DomainNet等四个 datasets上达到了状态对的最佳性能,而不需要任何图像进行训练。
    Abstract In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. The proposed method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, even though it does not require any images for training.
    摘要 在共同视语空间中,文本特征(例如来自“狗照片”)可以有效表示相关的图像特征(例如狗照片中的特征)。此外,一项latest study发现了这个共同空间的各模态传递现象。基于这些观察,我们提出了PromptStyler,它在共同空间中通过提示 simulate various distribution shifts,无需使用任何图像进行源自无图像领域泛化。我们的方法学习生成多种风格特征(例如“S*风格”) via 可学习的风格词vecorts for pseudo-words S*。为确保学习的风格不会扭曲内容信息,我们强制风格-内容特征(例如“S*风格的[类别]”)在共同视语空间中与其相应的内容特征(例如“[类别]”)相 nearby。之后,我们使用生成的风格-内容特征进行线性分类。PromptStyler实现了在PACS、VLCS、OfficeHome和DomainNet上的state of the art,即使没有使用任何图像进行训练。

One-shot Joint Extraction, Registration and Segmentation of Neuroimaging Data

  • paper_url: http://arxiv.org/abs/2307.15198
  • repo_url: https://github.com/anonymous4545/jers
  • paper_authors: Yao Su, Zhentian Qian, Lei Ma, Lifang He, Xiangnan Kong
  • for: 本研究旨在开发一种基于单个标注图像(即Atlas)和几个未标注原始图像的一拟合批处理方法,以提高脑成像数据中的抽取、准确和分割等预处理步骤的效果。
  • methods: 本研究提出了一种统一的端到端框架,称为JERS,用于联合优化抽取、准确和分割任务。该框架使用了一组抽取、准确和分割模块,通过自我超级视觉来互相强化和促进Feedback。
  • results: 实验结果表明,我们提出的方法在抽取、准确和分割任务中表现出色,并且可以在实际 dataset 上减少人工干预和标注量。 codes 和数据可以在https://github.com/Anonymous4545/JERS 找到。
    Abstract Brain extraction, registration and segmentation are indispensable preprocessing steps in neuroimaging studies. The aim is to extract the brain from raw imaging scans (i.e., extraction step), align it with a target brain image (i.e., registration step) and label the anatomical brain regions (i.e., segmentation step). Conventional studies typically focus on developing separate methods for the extraction, registration and segmentation tasks in a supervised setting. The performance of these methods is largely contingent on the quantity of training samples and the extent of visual inspections carried out by experts for error correction. Nevertheless, collecting voxel-level labels and performing manual quality control on high-dimensional neuroimages (e.g., 3D MRI) are expensive and time-consuming in many medical studies. In this paper, we study the problem of one-shot joint extraction, registration and segmentation in neuroimaging data, which exploits only one labeled template image (a.k.a. atlas) and a few unlabeled raw images for training. We propose a unified end-to-end framework, called JERS, to jointly optimize the extraction, registration and segmentation tasks, allowing feedback among them. Specifically, we use a group of extraction, registration and segmentation modules to learn the extraction mask, transformation and segmentation mask, where modules are interconnected and mutually reinforced by self-supervision. Empirical results on real-world datasets demonstrate that our proposed method performs exceptionally in the extraction, registration and segmentation tasks. Our code and data can be found at https://github.com/Anonymous4545/JERS
    摘要 脑部提取、注册和分割是 neuroscience 研究中不可或缺的前processing 步骤。目的是从 raw 成像扫描中提取脑部(i.e., 提取步骤),将其与目标脑部图像(i.e., 注册步骤)进行对接,并将脑部区域标注为不同的 анатомические区域(i.e., 分割步骤)。传统的研究通常会对提取、注册和分割任务进行分别的开发,并在超级vised Setting中进行训练。然而,收集 voxel-level 标签和进行手动质量控制高维度 neuroscience 成像数据(例如 3D MRI)是许多医学研究中的昂贵和时间consuming。在这篇论文中,我们研究了一种一遍性的脑部提取、注册和分割方法,该方法只需要一个标注图像(即 atlas)和一些 raw 成像数据进行训练。我们提出了一个统一的端到端框架,称之为 JERS,以同时优化提取、注册和分割任务,并允许Feedback among them。具体来说,我们使用一组提取、注册和分割模块,通过自我监督来学习提取 маMask,变换和分割 mask,这些模块之间存在相互之间的连接和互相强化。我们在实际数据上进行了实验,结果表明我们的提案方法在提取、注册和分割任务中表现出色。我们的代码和数据可以在 https://github.com/Anonymous4545/JERS 找到。

Learning in Repeated Multi-Unit Pay-As-Bid Auctions

  • paper_url: http://arxiv.org/abs/2307.15193
  • repo_url: None
  • paper_authors: Rigel Galgana, Negin Golrezaei
  • For: The paper is written for learning how to bid in repeated multi-unit pay-as-bid auctions, with the goal of maximizing revenue.* Methods: The paper uses dynamic programming and online learning algorithms with polynomial time and space complexity under full information and bandit feedback settings.* Results: The paper achieves an upper bound on regret of $O(M\sqrt{T\log |\mathcal{B}|})$ and $O(M\sqrt{|\mathcal{B}|T\log |\mathcal{B}|})$ respectively, and demonstrates through numerical results that the resulting market dynamics converge to a welfare maximizing equilibrium where bidders submit uniform bids. Additionally, the paper shows that the pay-as-bid auction consistently generates significantly higher revenue compared to its popular alternative, the uniform price auction.Here is the same information in Simplified Chinese:* For: 这篇论文是为了学习在重复的多个单位付款为投标的拍卖中,以最大化收益为目的。* Methods: 论文使用动态规划和在线学习算法,具有对数时间和空间复杂度的优化。* Results: 论文实现了对 regret 的Upper bound,并通过数值实验表明,市场动态 converge 到一个最大化利益的均衡点,投标者提交均匀投标。此外,论文还表明,付款拍卖可以与通常的固定价格拍卖相比,一直高得多。
    Abstract Motivated by Carbon Emissions Trading Schemes, Treasury Auctions, and Procurement Auctions, which all involve the auctioning of homogeneous multiple units, we consider the problem of learning how to bid in repeated multi-unit pay-as-bid auctions. In each of these auctions, a large number of (identical) items are to be allocated to the largest submitted bids, where the price of each of the winning bids is equal to the bid itself. The problem of learning how to bid in pay-as-bid auctions is challenging due to the combinatorial nature of the action space. We overcome this challenge by focusing on the offline setting, where the bidder optimizes their vector of bids while only having access to the past submitted bids by other bidders. We show that the optimal solution to the offline problem can be obtained using a polynomial time dynamic programming (DP) scheme. We leverage the structure of the DP scheme to design online learning algorithms with polynomial time and space complexity under full information and bandit feedback settings. We achieve an upper bound on regret of $O(M\sqrt{T\log |\mathcal{B}|})$ and $O(M\sqrt{|\mathcal{B}|T\log |\mathcal{B}|})$ respectively, where $M$ is the number of units demanded by the bidder, $T$ is the total number of auctions, and $|\mathcal{B}|$ is the size of the discretized bid space. We accompany these results with a regret lower bound, which match the linear dependency in $M$. Our numerical results suggest that when all agents behave according to our proposed no regret learning algorithms, the resulting market dynamics mainly converge to a welfare maximizing equilibrium where bidders submit uniform bids. Lastly, our experiments demonstrate that the pay-as-bid auction consistently generates significantly higher revenue compared to its popular alternative, the uniform price auction.
    摘要 受到碳排放交易制度、储蓄拍卖和采购拍卖的启发,我们考虑了在重复的多单位付出拍卖中学习投标的问题。在这些拍卖中,大量相同的物品需要分配给最大的提交投标价格,其中每个赢得投标价格都等于投标价格本身。投标在付出拍卖中的问题具有 combinatorial 性,这使得问题更加挑战。我们通过关注线上设置,即bidder在过去其他投标者提交的投标中仅有访问 Vector of bids 的问题来解决这个挑战。我们表明了在线上问题的优化解决方案可以在 polynomial time 内完成。我们利用 DP 算法的结构来设计在线学习算法,其时间复杂度和空间复杂度均为 O(M\*sqrt(T\*log(|\mathcal{B}|))),其中 M 是投标者需要的单位数量,T 是总的拍卖数量,并且 $|\mathcal{B}|$ 是投标空间中的精度。我们还提供了一个 regret 下界,它与 M 的线性相似。我们的数值结果表明,当所有代理人按照我们建议的无恐学习算法进行投标时,市场动态会主要向积极的均衡点转化,其中投标者会提交均匀投标。最后,我们的实验表明,付出拍卖routinely 生成较高的收益,相比于其受欢迎的替代方案 uniform price auction。

Med-Flamingo: a Multimodal Medical Few-shot Learner

  • paper_url: http://arxiv.org/abs/2307.15189
  • repo_url: https://github.com/snap-stanford/med-flamingo
  • paper_authors: Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, Jure Leskovec
  • for: 这个研究旨在提出一个适应医疗领域多Modal Few-shot Learning的模型,以满足医疗应用中资料稀少的问题。
  • methods: 本研究基于OpenFlamingo-9B,继续进行预训练,使用医疗图像和文本数据库,并实现了几个医疗问题的解释。
  • results: 研究结果显示,Med-Flamingo可以在几个医疗问题上表现出优秀的生成能力,并且可以提供解释,具体来说是20%的提升在医生评价中。此外,本研究首次进行了人类评价,并发现Med-Flamingo可以在不同的医疗问题上提供更好的解释。
    Abstract Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.
    摘要 医学是一个多方面的领域,需要将多种模式的信息集成起来。医学生成视语模型(VLM)可以作为第一步,并且承诺了许多临床应用。然而,现有模型通常需要在大量下游数据集上练习,这会限制其在医学应用中的使用,因为在许多医学应用中数据是稀缺的,需要能够从少量示例中学习。在这里,我们提出了医学鹭鸟(Med-Flamingo),一种适应医学领域的多Modal几个shot学习者。基于OpenFlamingo-9B,我们继续预训练在医学图像和文本数据集上,并且在医学图像和文本数据集上进行了交叉和混合预训练。Med-Flamingo实现了几个shot的生成医学视觉问答(VQA)能力,我们评估了这些能力在多个数据集上,包括一个新的开放式VQA数据集,这些数据集包括视频USMLE风格问题。此外,我们进行了第一次的人类评估 для生成医学VQA, Physicians review了问题和潜在的生成,并在交互应用中进行了评估。Med-Flamingo提高了生成医学VQA的性能,提高了临床评估员的评分,并且首次实现了多Modal医学几个shot适应。我们将我们的模型、代码和评估应用发布在https://github.com/snap-stanford/med-flamingo。

Rotation-Invariant Random Features Provide a Strong Baseline for Machine Learning on 3D Point Clouds

  • paper_url: http://arxiv.org/abs/2308.06271
  • repo_url: https://github.com/meliao/rotation-invariant-random-features
  • paper_authors: Owen Melia, Eric Jonas, Rebecca Willett
  • for: 这个论文的目的是研究三维点云数据上的征要学习方法,以实现具有旋转对称性的函数学习。
  • methods: 这个论文使用了随机特征方法,并对其进行了三维旋转对称性的扩展,以便快速评估点云数据上的函数。
  • results: 实验表明,这个方法可以与通用的旋转不变深度神经网络相比或超越其性能,并且具有许多任务的通用性和快速评估特点。
    Abstract Rotational invariance is a popular inductive bias used by many fields in machine learning, such as computer vision and machine learning for quantum chemistry. Rotation-invariant machine learning methods set the state of the art for many tasks, including molecular property prediction and 3D shape classification. These methods generally either rely on task-specific rotation-invariant features, or they use general-purpose deep neural networks which are complicated to design and train. However, it is unclear whether the success of these methods is primarily due to the rotation invariance or the deep neural networks. To address this question, we suggest a simple and general-purpose method for learning rotation-invariant functions of three-dimensional point cloud data using a random features approach. Specifically, we extend the random features method of Rahimi & Recht 2007 by deriving a version that is invariant to three-dimensional rotations and showing that it is fast to evaluate on point cloud data. We show through experiments that our method matches or outperforms the performance of general-purpose rotation-invariant neural networks on standard molecular property prediction benchmark datasets QM7 and QM9. We also show that our method is general-purpose and provides a rotation-invariant baseline on the ModelNet40 shape classification task. Finally, we show that our method has an order of magnitude smaller prediction latency than competing kernel methods.
    摘要 rotational invariance 是机器学习中广泛使用的一种印度预测,如计算机视觉和量子化学机器学习。无需旋转的机器学习方法已经设置了许多任务的州OF-the-art,包括分子性质预测和3D形状分类。这些方法通常 Either rely on task-specific rotation-invariant features or use general-purpose deep neural networks, which are complicated to design and train. However, it is unclear whether the success of these methods is primarily due to the rotation invariance or the deep neural networks. To address this question, we propose a simple and general-purpose method for learning rotation-invariant functions of three-dimensional point cloud data using a random features approach. Specifically, we extend the random features method of Rahimi & Recht 2007 by deriving a version that is invariant to three-dimensional rotations and showing that it is fast to evaluate on point cloud data. We show through experiments that our method matches or outperforms the performance of general-purpose rotation-invariant neural networks on standard molecular property prediction benchmark datasets QM7 and QM9. We also show that our method is general-purpose and provides a rotation-invariant baseline on the ModelNet40 shape classification task. Finally, we show that our method has an order of magnitude smaller prediction latency than competing kernel methods.

RCT Rejection Sampling for Causal Estimation Evaluation

  • paper_url: http://arxiv.org/abs/2307.15176
  • repo_url: https://github.com/kakeith/rct_rejection_sampling
  • paper_authors: Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit Bhattacharya
  • for: 这个论文旨在提高对 observational data 中 causal effect 的估计,并解决高维 covariate 的干扰问题。
  • methods: 该论文提出了一种基于机器学习方法的 adjustment 方法,用于解决 causal estimation 中的干扰问题。 authors 还提出了一种新的抽样算法,称为 RCT rejection sampling,并提供了理论保证, garanting causal identification 在 observational data 中。
  • results: 通过使用 simulate data, authors 证明了其算法在 oracle estimators 上的低偏度性。 In addition, authors 还 highlighted 一些 finite data 考虑因素,以便在实际应用中使用 RCT rejection sampling。 as a proof of concept, authors 实现了一个 example evaluation pipeline, 并详细介绍了这些 finite data 考虑因素。
    Abstract Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.
    摘要 干扰是观察数据中 causal 效应的重要障碍。在高维 covariate 的设置下(如文本数据、 genomics 或行为社会科学),研究人员已经提出了适应机器学习方法以便 causal 估计的调整方法。然而,实际评估这些调整方法的困难和有限。在这项工作中,我们基于一种有前途的评估策略,即使用 randomized controlled trials (RCTs) 的平均 causal 效应作为真实参照值,并提出了一种新的抽样算法,称为 RCT 拒绝抽样。我们提供了理论保证,表明在观察数据中, causal 标识是可行的,从而允许有效地与参照值 RCT 进行比较。使用 sintetic 数据,我们证明了我们的算法在 oracle 估计器中的低偏误。此外,我们还提出了一些实际评估设计师应该考虑的有限数据问题。作为证明,我们实现了一个示例评估管道,并详细介绍了这些有限数据问题。作为证明,我们发布了一个新的、实际存在的 RCT,包含约 70k 个观察和文本数据作为高维 covariate。总的来说,这些贡献共同推动了观察数据中 causal 估计的有效评估。

VISU at WASSA 2023 Shared Task: Detecting Emotions in Reaction to News Stories Leveraging BERT and Stacked Embeddings

  • paper_url: http://arxiv.org/abs/2307.15164
  • repo_url: None
  • paper_authors: Vivek Kumar, Sushmita Singh, Prayag Tiwari
  • for: 这篇论文是为了探讨情感识别FROM essays written in reaction to news articles的问题而写的。
  • methods: 这篇论文使用了深度学习(DL)模型,将词嵌入表示与特化的预处理策略相结合,以捕捉表达的情感细节。实验使用了静止和上下文嵌入(个体和堆叠),以及BIaLSTM和Transformer基本模型。
  • results: 这篇论文在WASSA 2023 Shared Task(3)中的情感识别任务中获得了rank十的成绩,即Macro F1-Score为0.2717,证明了我们实施的方法在小型和不均衡数据集中的效果。
    Abstract Our system, VISU, participated in the WASSA 2023 Shared Task (3) of Emotion Classification from essays written in reaction to news articles. Emotion detection from complex dialogues is challenging and often requires context/domain understanding. Therefore in this research, we have focused on developing deep learning (DL) models using the combination of word embedding representations with tailored prepossessing strategies to capture the nuances of emotions expressed. Our experiments used static and contextual embeddings (individual and stacked) with Bidirectional Long short-term memory (BiLSTM) and Transformer based models. We occupied rank tenth in the emotion detection task by scoring a Macro F1-Score of 0.2717, validating the efficacy of our implemented approaches for small and imbalanced datasets with mixed categories of target emotions.
    摘要 我们的系统,VISU,参加了2023年WASSA分享任务(3)的情感分类从新闻文章中的反应文章。情感检测从复杂对话中是挑战,因此在这项研究中,我们将重点发展深度学习(DL)模型,使用词嵌入表示和特制预处理策略来捕捉表达出的情感含义。我们的实验使用静态和上下文嵌入(个体和堆叠)以及双向长短Memory(BiLSTM)和转换器基于模型。我们在情感检测任务中占据了排名第十的位置,取得了macro F1分数0.2717,证明我们实施的方法对小数据集和混合类目标情感任务具有效果。

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

  • paper_url: http://arxiv.org/abs/2308.07931
  • repo_url: None
  • paper_authors: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola
  • for: bridges the 2D-to-3D gap for robotic manipulation
  • methods: leverages distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models
  • results: achieves in-the-wild generalization to unseen objects using few-shot learning method for 6-DOF grasping and placing
    Abstract Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
    摘要 自我监督和语言监督的图像模型含有重要的世界知识,这对总化非常重要。然而,许多 робоaxi tasks需要精准的三维几何理解,而图像特征通常缺乏这种知识。这个工作将两个维度之间的 gap bridged ,使用精炼的特征场来结合准确的三维几何和丰富的语言特征,以实现在野外进行6个自由度抓取和放置的几何学掌握。使用从视觉语言模型CLIP中提取出的特征,我们提出了一种通过自然语言文本来指定新的物体 для操作的方法,并证明其能够通过未经见过的表达和新类别的物体进行总化。

Matching Patients to Clinical Trials with Large Language Models

  • paper_url: http://arxiv.org/abs/2307.15051
  • repo_url: None
  • paper_authors: Qiao Jin, Zifeng Wang, Charalampos S. Floudas, Jimeng Sun, Zhiyong Lu
  • For: The paper aims to assist individual patients and referral physicians in identifying suitable clinical trials from an extensive selection, using large language models (LLMs) to predict criterion-level eligibility with detailed explanations.* Methods: The paper introduces TrialGPT, a novel architecture that employs LLMs to predict criterion-level eligibility with detailed explanations, which are then aggregated for ranking and excluding candidate clinical trials based on free-text patient notes.* Results: The experimental results demonstrate that TrialGPT achieves high criterion-level prediction accuracy with faithful explanations, and the aggregated trial-level TrialGPT scores are highly correlated with expert eligibility annotations. The scores are also effective in ranking clinical trials and excluding ineligible candidates, but the paper acknowledges that current LLMs still make some mistakes due to limited medical knowledge and domain-specific context understanding.
    Abstract Clinical trials are vital in advancing drug development and evidence-based medicine, but their success is often hindered by challenges in patient recruitment. In this work, we investigate the potential of large language models (LLMs) to assist individual patients and referral physicians in identifying suitable clinical trials from an extensive selection. Specifically, we introduce TrialGPT, a novel architecture employing LLMs to predict criterion-level eligibility with detailed explanations, which are then aggregated for ranking and excluding candidate clinical trials based on free-text patient notes. We evaluate TrialGPT on three publicly available cohorts of 184 patients and 18,238 annotated clinical trials. The experimental results demonstrate several key findings: First, TrialGPT achieves high criterion-level prediction accuracy with faithful explanations. Second, the aggregated trial-level TrialGPT scores are highly correlated with expert eligibility annotations. Third, these scores prove effective in ranking clinical trials and exclude ineligible candidates. Our error analysis suggests that current LLMs still make some mistakes due to limited medical knowledge and domain-specific context understanding. Nonetheless, we believe the explanatory capabilities of LLMs are highly valuable. Future research is warranted on how such AI assistants can be integrated into the routine trial matching workflow in real-world settings to improve its efficiency.
    摘要 临床试验是药物开发和基于证据的医学发展的关键,但患者招募困难往往阻碍其成功。在这项工作中,我们调查了大语言模型(LLM)在帮助个人患者和推荐医生选择适合的临床试验中的潜在作用。我们介绍了一种新的建筑方案,称为TrialGPT,它使用LLM来预测临床试验权威性的详细解释,然后将这些解释聚合为排名和排除不适的临床试验。我们在三个公共可用的群组中进行了184名患者和18238个临床试验的评估。实验结果表明了以下几点:首先,TrialGPT在权威性预测中达到了高精度和详细的解释。其次,聚合的临床试验级TrialGPT分数与专家证明的可参与性注释高度相关。最后,这些分数能够有效地排名临床试验和排除不适的参与者。我们的错误分析表明,当前的LLM仍然由于医学知识和域pecific上下文理解有一些错误。然而,我们认为LLM的解释能力很值得。未来的研究应该关注如何在实际试验匹配过程中集成这些AI助手,以提高其效率。

Universal and Transferable Adversarial Attacks on Aligned Language Models

  • paper_url: http://arxiv.org/abs/2307.15043
  • repo_url: https://github.com/llm-attacks/llm-attacks
  • paper_authors: Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson
    for: 这个论文的目的是提出一种简单而有效的攻击方法,使得已经被调整的语言模型产生不适的行为。methods: 这个论文使用的方法包括批处理和梯度下降搜索技术,自动生成攻击 suffix。results: 这个论文的实验结果表明,使用这种攻击方法可以让已经被调整的语言模型产生不适的行为,并且这种攻击方法可以在黑盒模型上也有效。
    Abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.
    摘要 因为"out-of-the-box"大语言模型可以生成很多不适的内容,因此最近的工作都在尝试对这些模型进行对齐,以避免不适的生成。虽然有一些成功的尝试(即“监狱”对LLMs),但这些攻击需要人类的创造力,并且在实践中较脆弱。在这篇论文中,我们提出了一种简单而有效的攻击方法,使得对齐的语言模型生成不适的行为。具体来说,我们的方法找到一个适用于各种查询的 suffix,以使模型生成有利的答案(而不是拒绝回答)。而不是人工工程,我们的方法通过滥览和梯度基于搜索技术自动生成这些反对性词组,并且超过了过去的自动提示生成方法。 surprisingly,我们发现了这些反对性词组在黑盒、公共释放的LLMs中也是可转移的。我们在多个提问(即请求多种不适的内容)和多个模型(我们的案例是Vicuna-7B和13B)上训练了攻击 suffix,并且在ChatGPT、Bard和Claude等公共接口上也能够引起不适的内容。总的来说,这项工作提高了对齐语言模型的反对攻击的状态艺术,提出了如何避免这些系统生成不适的信息的问题。代码可以在github.com/llm-attacks/llm-attacks中找到。

AI Literature Review Suite

  • paper_url: http://arxiv.org/abs/2308.02443
  • repo_url: https://github.com/datovar4/ai_literature_review_suite
  • paper_authors: David A. Tovar
  • for: automate and optimize the process of literature review in academic and industrial research
  • methods: leverages open access science, large language models (LLMs), natural language processing, semantic search queries, text embeddings, and summarization
  • results: provides a comprehensive literature review, enables searching, downloading, and organizing of PDF files, and extracts content from articles with succinct summaries
    Abstract The process of conducting literature reviews is often time-consuming and labor-intensive. To streamline this process, I present an AI Literature Review Suite that integrates several functionalities to provide a comprehensive literature review. This tool leverages the power of open access science, large language models (LLMs) and natural language processing to enable the searching, downloading, and organizing of PDF files, as well as extracting content from articles. Semantic search queries are used for data retrieval, while text embeddings and summarization using LLMs present succinct literature reviews. Interaction with PDFs is enhanced through a user-friendly graphical user interface (GUI). The suite also features integrated programs for bibliographic organization, interaction and query, and literature review summaries. This tool presents a robust solution to automate and optimize the process of literature review in academic and industrial research.
    摘要 Literature reviews 通常是时间和劳动密集的过程。为了减少这个过程的复杂性,我们提出了一个基于人工智能的文献评估套件(AI Literature Review Suite),该套件集成了多种功能以提供全面的文献评估。这个工具利用了开放科学、大语言模型(LLM)和自然语言处理技术来实现PDF文档的搜索、下载和组织,以及文章中的内容抽取。使用semantic search queries进行数据检索,并使用文本嵌入和摘要使用LLM来提供简洁的文献评估。用户可以通过用户友好的图形用户界面(GUI)进行交互,并且套件还包括了一个集成的bibliographic组织、交互和查询程序,以及文献评估摘要。这个工具为学术和工业研究中的文献评估带来了一个强大的自动化和优化解决方案。

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

  • paper_url: http://arxiv.org/abs/2307.15020
  • repo_url: None
  • paper_authors: Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan
  • for: 这 paper 的目的是为了评估大语言模型在实际应用中的性能,而不是仅仅是测试其精度。
  • methods: 这 paper 使用了一个全面的中文测试套件 SuperCLUE,包括 CArena、OPEN 和 CLOSE 三个子任务。
  • results: 这 paper 的研究结果表明,关闭式问题的答案准确率不足以反映人类的偏好,但是它们可以补充对话来预测实际用户的偏好。此外,GPT-4 可以自动评估中文语言模型在开放式问题上的人类偏好。
    Abstract Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com
    摘要 大型语言模型(LLM)已经展示了在人类日常生活中的潜在应用前景。因此,用户偏好成为评估LLM在实际应用场景中的表现的关键因素。然而,现有的标准约束主要是通过多选问题来衡量模型的准确率,这限制了我们对其在实际应用中的能力的理解。我们填补了这一漏洞,提出了一个全面的中文标准准备SuperCLUE,名称来自另一个流行的中文LLM标准准备CLUE。SuperCLUE包括三个子任务:实际用户的问题和评分来自LLM战场平台(CArena),开放式问题(OPEN)和关闭式问题(CLOSE)。我们的研究显示,关闭式问题的准确率不充分反映人类偏好,而且可以补充each other来预测实际用户的偏好。此外,我们还证明了GPT-4可以自动评估中文上开放式问题的人类偏好。我们的标准将在https://www.CLUEbenchmarks.com上发布。

How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges

  • paper_url: http://arxiv.org/abs/2307.15016
  • repo_url: https://github.com/htqin/googlebard-visunderstand
  • paper_authors: Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz Khan, Luc Van Gool
  • for: This paper explores the ability of Google Bard to understand and interpret visual data (images) conditioned by text questions, with the goal of evaluating its performance in various task scenarios and identifying areas for improvement.
  • methods: The paper uses Google Bard to process text and image inputs and evaluate its performance in 15 diverse task scenarios, including regular, camouflaged, medical, under-water, and remote sensing data.
  • results: The primary finding of the study is that Bard struggles in vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. The study provides valuable insights for advancing future models and improving their capabilities in comprehending and interpreting fine-grained visual data.Here is the same information in Simplified Chinese:
  • for: 这个研究用Google Bard来处理文本和图像输入,以评估其在不同任务场景中的表现,并找到改进的方向。
  • methods: 这篇论文使用Google Bard处理文本和图像输入,并在15种多样化任务场景中评估其表现,包括常见、掩蔽、医疗、水下和Remote感知数据。
  • results: 研究发现,Bard在视觉场景中表现不佳,这显示了未来模型需要覆盖视觉理解的巨大差距。这项实验带来了价值的发现,可以帮助未来的模型在细致的视觉数据上增强其能力。
    Abstract Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand
    摘要 Google的Bard在协作AI领域已经成为OpenAI的ChatGPT的强有力竞争对手。特别是最近Bard更新以处理图像和文本提示的对话。由于Bard在文本输入方面的卓越表现,我们探索了它在理解和解释图像数据(图像)的能力。这种探索具有探索新的发现和挑战,特别是在解决复杂计算机视觉问题方面。在这项研究中,我们选择了15种多样化任务场景,包括常见、掩体、医疗、水下和远程感知数据,以全面评估Bard的性能。我们的主要发现表明Bard在视觉场景中仍然努力,反映了需要在未来发展中覆盖视觉基础知识的巨大差距。我们预计这项实验性研究会对未来模型的发展产生重要影响,导致它们在理解和解释细致的视觉数据方面增强其能力。我们的项目在https://github.com/htqin/GoogleBard-VisUnderstand上发布。

Improved Neural Radiance Fields Using Pseudo-depth and Fusion

  • paper_url: http://arxiv.org/abs/2308.03772
  • repo_url: None
  • paper_authors: Jingliang Li, Qiang Zhou, Chaohui Yu, Zhengda Lu, Jun Xiao, Zhibin Wang, Fan Wang
  • for: 本研究旨在提高Neural Radiance Fields(NeRF)模型的渲染精度和视角渲染能力,特别是在实际场景中存在多种大小对象/结构的情况下。
  • methods: 我们提出了一种使用多尺度编码量表示 scene中对象的几何信息,并将其提供给NeRF模型。我们还提出了同时进行深度预测和场景重建,以使得构造的量表示更加准确。此外,我们还提出了基于深度导航的点云特征协同拼接,以提高点云特征的准确性。
  • results: 我们的方法在novel view synthesis和dense geometry modeling中表现出了superior的性能,无需Scene-specific优化。
    Abstract Since the advent of Neural Radiance Fields, novel view synthesis has received tremendous attention. The existing approach for the generalization of radiance field reconstruction primarily constructs an encoding volume from nearby source images as additional inputs. However, these approaches cannot efficiently encode the geometric information of real scenes with various scale objects/structures. In this work, we propose constructing multi-scale encoding volumes and providing multi-scale geometry information to NeRF models. To make the constructed volumes as close as possible to the surfaces of objects in the scene and the rendered depth more accurate, we propose to perform depth prediction and radiance field reconstruction simultaneously. The predicted depth map will be used to supervise the rendered depth, narrow the depth range, and guide points sampling. Finally, the geometric information contained in point volume features may be inaccurate due to occlusion, lighting, etc. To this end, we propose enhancing the point volume feature from depth-guided neighbor feature fusion. Experiments demonstrate the superior performance of our method in both novel view synthesis and dense geometry modeling without per-scene optimization.
    摘要

Thinker: Learning to Plan and Act

  • paper_url: http://arxiv.org/abs/2307.14993
  • repo_url: https://github.com/anonymous-scrl/thinker
  • paper_authors: Stephen Chung, Ivan Anokhin, David Krueger
  • for: 这个论文的目的是开发一种新的推奖学习算法,帮助推奖学习代理人自主地使用学习到的世界模型进行规划。
  • methods: 这个算法使用了包装环境在世界模型中的方法,并提出了一些特定的模型交互动作,让代理人可以通过对世界模型进行规划,选择更好的行动。
  • results: 经验结果表明,这个算法在扮演游戏和Atari 2600测试中都达到了状态之 искусственный智能表现的最佳效果和竞争性表现。代理人训练后的视觉化表示它们已经学习了如何有效地规划使用世界模型选择更好的行动。
    Abstract We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for hand-crafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. The algorithm's generality opens a new research direction on how a world model can be used in reinforcement learning and how planning can be seamlessly integrated into an agent's decision-making process.
    摘要 我们提出了思考算法(Thinker algorithm),一种新的方法,让强化学习代理人能够自主地与学习的世界模型互动。思考算法将环境包装在世界模型中,并增加了特定 для与世界模型互动的动作。这些模型互动动作使代理人能够在世界模型中进行规划,提出不同的计划供世界模型评估,然后选择最佳的行动进行环境中执行。这种方法扩展了强化学习的可能性,并且让代理人能够自主地学习规划技巧,同时亦可以轻松地将其计划视觉化。我们透过实验结果在拓扑游戏和Atari 2600测试中证明了思考算法的有效性,并且在这两个测试中获得了竞争性的结果。代理人训练了思考算法后的视觉化结果表明,它们已经学会了从世界模型中选择更好的动作。这个算法的通用性开启了一个新的研究方向,即如何在强化学习中使用世界模型,以及如何将规划自透地整合到代理人的决策过程中。

Multilingual Code Co-Evolution Using Large Language Models

  • paper_url: http://arxiv.org/abs/2307.14991
  • repo_url: None
  • paper_authors: Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, Milos Gligoric
  • for: 本研究旨在解决跨编程语言代码更新的问题,通过大语言模型(LLMs)来实现代码更新。
  • methods: 本研究使用大语言模型(LLMs)来模型代码更新为编辑序列,并学习代码更新之间的相关性。
  • results: 对于6,613个对齐的代码更新样本,codeditor与状态之前的方法相比,得到了大幅度的提高。此外,codeditor与现有的生成型模型相结合,能够实现更高的性能。
    Abstract Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance.
    摘要 许多软件项目会实现API和算法在多种编程语言中。维护这些项目是疲劳的,因为开发者必须确保任何更改(例如,bug fix或新功能)在其他编程语言中得到有效地传播,并且不会出现错误。在软件世界中,使用规则基于的翻译工具(例如,转换器)或机器学习模型来翻译代码从一种语言到另一种语言提供有限的价值。每次翻译整个代码库从一种语言到另一种语言都不是开发者的工作方式。在这篇论文中,我们target一个新任务:将代码更改从一种编程语言到另一种编程语言使用大语言模型(LLM)进行翻译。我们设计并实现了第一个LLM,名为Codeditor,以解决这个任务。Codeditor显式模型代码更改为编辑序列,并学习代码更改之间的相互关系。为了评估Codeditor,我们收集了6,613个对齐的代码更改从8对开源软件项目中,这些项目在两种编程语言(Java和C#)中实现了相同的功能。结果表明,Codeditor在所有常用的自动指标上都高于现有的状态艺术方法。我们的工作还发现,Codeditor与现有的生成基于模型相结合,可以确保更高的性能。

Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

  • paper_url: http://arxiv.org/abs/2307.14971
  • repo_url: https://github.com/wangzy22/tap
  • paper_authors: Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu
  • for: 提高3D视觉模型的性能
  • methods: 使用交叉注意机制生成视图图像作为预训练方案
  • results: 超过前一代预训练方法的表现,并且可以提高建筑型approaches的性能
    Abstract With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP.
    摘要 在MAE领导的面孔图模型化潮流中,生成预训练显示了强大的可能性,以提高2D视觉基本模型的性能。然而,在3D视觉中,基于Transformer的幕后和点云的顺序性限制了生成预训练的进一步发展。在这篇论文中,我们提出了一种适用于任意点云模型的3D-to-2D生成预训练方法。我们提议通过交叉注意机制来生成不同指导姿态的视图图像作为预训练方案。生成视图图像的精确超级vision比点云对应的点云更加精准,因此帮助3D背部更好地理解点云的几何结构和立体关系。实验结果表明了我们提议的3D-to-2D生成预训练的优越性,并在ScanObjectNN分类和ShapeNetPart segmentation任务上实现了最佳性能。代码可以在https://github.com/wangzy22/TAP上获取。

cs.CL - 2023-07-28

Robust Distortion-free Watermarks for Language Models

  • paper_url: http://arxiv.org/abs/2307.15593
  • repo_url: https://github.com/jthickstun/watermark
  • paper_authors: Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang
  • for: 这个论文目的是为了植入水印到文本中,以防止不当使用文本生成模型。
  • methods: 这个论文使用了一种基于杂素分布的水印方法,通过将随机数据序列映射到语言模型中生成的文本中,以达到水印的目的。
  • results: 实验结果表明,这种水印方法可以在不改变文本的分布下,对文本进行植入水印,并且可以在不同的攻击方式下保持水印的可读性。具体来说,对于OPT-1.3B和LLaMA-7B模型,可以在40%-50%的杂素替换、插入和删除攻击下,在35个字符的文本中仍可以可靠地检测水印(p<=0.01)。对于Alpaca-7B模型,由于响应的 entropy 较低,检测是更加困难的,但仍可以在25%的响应中检测到水印(p<=0.01)。
    Abstract We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50$\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
    摘要 我们提出了一种方法,用于在文本中植入抗干扰的水印,不会改变文本的分布,直到最大生成预算为止。我们生成水印文本,通过将一个序列Random numbers(我们使用随机水印密钥计算)映射到语言模型的样本。为检测水印文本,任何知道密钥的人可以将文本与Random number序列进行对齐。我们实现了我们的水印方法,使用两种采样方案:反转采样和最小值采样。我们在三个语言模型(OPT-1.3B、LLaMA-7B和Alpaca-7B)上实验 validate its statistical power and robustness to various paraphrasing attacks。我们发现,对OPT-1.3B和LLaMA-7B模型,我们可以在35个字符之前(也就是说,在40-50%的字符被随机编辑后)检测水印文本(p ≤ 0.01)。对Alpaca-7B模型,我们进行了一项研究,探讨是否可以在用户指令的回答中植入水印。由于回答的低 entropy,检测变得更加困难:约25%的回答( median length around 100 tokens)可以在p ≤ 0.01的情况下检测,而水印也更易受到一些自动生成的修改攻击。

When to generate hedges in peer-tutoring interactions

  • paper_url: http://arxiv.org/abs/2307.15582
  • repo_url: https://github.com/neuromaancer/hedge_prediction
  • paper_authors: Alafate Abulimiti, Chloé Clavel, Justine Cassell
  • for: 这篇论文探讨了机器学习技术在辅导互动中预测幂值的应用。
  • methods: 这篇研究使用自然的面对面数据集,并对自然语言转折、对话策略、辅导策略和非语言表达进行标注。这些元素被转换为上一句话的向量表示,并作为机器学习模型的输入。
  • results: 研究发现,使用嵌入层(捕捉上一句话的语义信息)可以显著提高模型的性能。此外,研究还提供了关于不同特征(如人际关系和非语言表达)在预测幂值方面的重要性的视觉值解释。研究发现,辅导和学生双方的视线强烈关系到幂值预测。这一观察得到了验证通过后续减少研究。
    Abstract This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviours. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models. Results show that embedding layers, that capture the semantic information of the previous turns, significantly improves the model's performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviours, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.
    摘要

All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

  • paper_url: http://arxiv.org/abs/2307.15555
  • repo_url: None
  • paper_authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani
  • for: 防止声音深冒险攻击和身份盗窃 + The paper is written to address the issue of synthetic speech detection in order to prevent frauds and identity thefts.
  • methods: 结合了三种文献中提出的特征集 + The paper uses a fusion of three different feature sets proposed in the literature to improve the performance of synthetic speech detection.
  • results: 在不同的场景和数据集上实现了更好的总体性能 + The paper presents a model that fuses the three feature sets and achieves better overall performance compared to state-of-the-art solutions, with robustness to anti-forensic attacks and generalization capabilities.Here is the information in Simplified Chinese text:
  • for: 防止声音深冒险攻击和身份盗窃
  • methods: 结合了三种文献中提出的特征集
  • results: 在不同的场景和数据集上实现了更好的总体性能
    Abstract Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
    摘要 (Simplified Chinese translation)最近的深度学习和计算机视觉技术的进步,使得 multimedia 内容的合成和伪造变得更加容易,可能导致来自黑客的威胁和危险。在音频领域,我们目睹到深度语音生成技术的快速发展,这使得对于可能的欺诈或身份盗用而需要开发深层 speech 检测算法。在这篇论文中,我们考虑了 literature 中提出的三种不同的特征集,并提出一种将其融合的模型,实现了与当前的解决方案相比的更好的性能。系统在不同的场景和数据集上进行了测试,以证明其对抗反科学攻击和泛化能力的Robustness。

‘What are you referring to?’ Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

  • paper_url: http://arxiv.org/abs/2307.15554
  • repo_url: https://github.com/jchiyah/what-are-you-referring-to
  • paper_authors: Javier Chiyah-Garcia, Alessandro Suglia, Arash Eshghi, Helen Hastie
  • For: 这篇论文主要针对对话中的referential ambiguity问题,即当引用表达不唯一确定所指的对象时,谈话中的冲突和修复机制。* Methods: 该论文使用SIMMC 2.0数据集来评估不同状态艺术模型对Clarificational Exchanges (CE)的处理能力,包括对对话历史相关的CE进行处理。* Results: 研究发现,语言基于模型可以编码多Modal semantic information,并处理一些CE;而多Modal模型可以通过额外学习目标获得分离的对象表示,这在处理多modal referential ambiguity中发挥了关键作用。
    Abstract Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.
    摘要 优先级意图杂化出现在对话中当referring表达不唯一地标识目标对象时。对话参与者通常立即发现这种杂化并与对话者使用meta-communicative, Clarificational Exchanges(CE)进行修复,包括一个Clarification Request(CR)和回应。我们 argue that能生成和回应CE强制要求对多模态、视觉固定对话模型的架构和目标函数做出特定的限制。我们使用SIMMC 2.0数据集来评估不同状态 искусственного智能模型的处理CE能力,并使用一个度量测试模型中的上下文更新。我们发现语言基于模型可以编码简单的多模态Semantic信息并处理一些CE,在对话历史相关的CE方面表现出色,而多模态模型可以通过额外学习目标函数获得分离的对象表示,这些表示在多modal杂化中扮演重要角色。

Oracle Computability and Turing Reducibility in the Calculus of Inductive Constructions

  • paper_url: http://arxiv.org/abs/2307.15543
  • repo_url: None
  • paper_authors: Yannick Forster, Dominik Kirst, Niklas Mück
  • for: 这个论文是为了研究 oracle 计算和图灵下降的概念和关系而写的。
  • methods: 这篇论文使用了 Calculus of Inductive Constructions (CIC) 和 Coq 证明助手来定义和研究 oracle 计算和图灵下降。 它采用了基于 meta-level 函数的定义方式,而不是基于对象级模型的 computation。
  • results: 这篇论文得到了以下结果:Turing 下降形成上半semilattice,传输 decidability,并且比 truth-table 下降更加强大表达能力。此外,当 predicate $p$ 和其 complement 都是对 oracle $q$ 的 semi-decidable 时,then $p$ Turing-reduces to $q$.
    Abstract We develop synthetic notions of oracle computability and Turing reducibility in the Calculus of Inductive Constructions (CIC), the constructive type theory underlying the Coq proof assistant. As usual in synthetic approaches, we employ a definition of oracle computations based on meta-level functions rather than object-level models of computation, relying on the fact that in constructive systems such as CIC all definable functions are computable by construction. Such an approach lends itself well to machine-checked proofs, which we carry out in Coq. There is a tension in finding a good synthetic rendering of the higher-order notion of oracle computability. On the one hand, it has to be informative enough to prove central results, ensuring that all notions are faithfully captured. On the other hand, it has to be restricted enough to benefit from axioms for synthetic computability, which usually concern first-order objects. Drawing inspiration from a definition by Andrej Bauer based on continuous functions in the effective topos, we use a notion of sequential continuity to characterise valid oracle computations. As main technical results, we show that Turing reducibility forms an upper semilattice, transports decidability, and is strictly more expressive than truth-table reducibility, and prove that whenever both a predicate $p$ and its complement are semi-decidable relative to an oracle $q$, then $p$ Turing-reduces to $q$.
    摘要 我们在Calculus of Inductive Constructions(CIC)中发展了一种干预计算和图灵可reducible的概念。与传统的 sintética方法不同,我们使用基于高级函数而不是对象水平模型的计算定义 oracle computations。这种方法适合机器检查证明,我们在Coq中进行了证明。在高级层次上定义干预计算的问题存在一种矛盾。一方面,它必须够精细以证明中心结果,确保所有概念都能够准确地捕捉。另一方面,它必须够简单以便利用 axioms for synthetic computability,这些axioms通常只关注第一级对象。 draw inspiration from Andrej Bauer 基于有效幂论中的连续函数的定义,我们使用sequential continuity来 caracterize valid oracle computations。我们的主要技术结果包括:1. Turing reducibility forms an upper semilattice。2. Turing reducibility transports decidability。3. Turing reducibility is strictly more expressive than truth-table reducibility。4. If both a predicate $p$ and its complement are semi-decidable relative to an oracle $q$, then $p$ Turing-reduces to $q$.注意:以下是简化中文版本,如果需要更加详细的解释,请咨询专业人士。

The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling

  • paper_url: http://arxiv.org/abs/2307.15508
  • repo_url: https://github.com/briemadu/inc-eval-revisions
  • paper_authors: Brielen Madureira, Patrick Kahardipraja, David Schlangen
  • for: 这篇论文主要是关于增量Sequence Labeling中的编辑和修订策略。
  • methods: 该论文提出了一种形式化和Characterize edits和修订策略,并对三种基于Transformer的encoder进行了 Profile。
  • results: 研究发现,这些encoder在不同任务中的增量行为都具有不同的特点,这可以帮助改进修订策略。
    Abstract Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.
    摘要 转换文本为简化中文:增量对话模型组件生成基于输入的输出前缀序列。由于地方冲突或错误假设,可能会出现错误,因此能够修改过去输出的能力是一个感irable的属性,可以由策略控制。在这项工作中,我们将增量编辑和修订在增量序列标记中进行正式化和特征化,并提出修订策略评价指标。然后,我们将方法应用于三种基于Transformer的编码器在不同任务中的增量行为进行 profiling,为更好的修订策略开出道路。

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

  • paper_url: http://arxiv.org/abs/2307.15493
  • repo_url: None
  • paper_authors: Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
  • for: 这些研究是为了评估现有的商业语音识别系统在对话场景中的性能。
  • methods: 研究者使用了5种主要的商业语音识别系统,对于6种语言的自然对话数据进行了评估。
  • results: 研究发现,对话数据中的单词错误率仍然很高,而 overlap 问题是对话识别的关键挑战。这些结果有助于评估当前的对话语音识别技术的状态,并且可以帮助建立更加可靠的对话speech技术。
    Abstract Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
    摘要 speech recognition systems 是人机交互中的关键中间件,尽管speech recognition在纯净的对话中工作得很好,但实际的生活中的开放式交互场景仍然存在许多挑战。我们认为时间是对话系统的关键因素,并评估了5个主要的商业ASR系统的对话和多语言支持。我们发现,在6种自然的对话语言中,word error rate remain extremely high,并且 overlap是关键挑战(研究1)。这继而影响了对话词的识别(研究2),并 ultimately affects downstream intent recognition(研究3)。我们的发现可以评估当前的对话ASR的状态,帮助建立多维度的错误分析和评估,并 indentify需要特别注意的现象,以建立可靠的对话speech技术。

Cross-Modal Concept Learning and Inference for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2307.15460
  • repo_url: None
  • paper_authors: Yi Zhang, Ce Zhang, Yushun Tang, Zhihai He
  • for: 本研究旨在提高现有 fine-tuning 方法的性能,解决图像中的不同Semantic object和概念之间的关系问题。
  • methods: 我们提出了一种新的方法,即跨模型概念学习和推理(CCLI),利用 CLIP 强大的文本-图像相关能力,自动学习图像中的大量特征特征,并根据这些特征构建了分类图像的描述表示,并学习一个概念推理网络进行下游图像分类任务。
  • results: 我们的 CCLI 方法在 few-shot learning 和领域泛化等下游任务中表现出了显著的提升,比如与当前状态艺术法相比,提高了8.0%的性能。
    Abstract Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.
    摘要 大规模预训练视觉语言模型(VLM),如CLIP,已经建立了文本和图像之间的相关性,在多种下游任务上达到了非常成功的结果。现有的精细调整方法中,通常将类型特定的文本描述与整个图像进行匹配,但我们认为这种整个图像匹配并不有效,因为图像从同一类型的图像中可能包含多个不同的semantic对象,而每个semantic对象都可能包含多个semantic部分或概念。在不同类型的图像中,这些semantic部分或概念可能会出现。为解决这个问题,在这篇论文中,我们开发了一种新的方法 called cross-model concept learning and inference(CCLI)。使用CLIP强大的文本-图像相关能力,我们自动学习了一个大量的特异性视觉概念从图像中,并使用这些视觉概念构建了一个描述性图像表示,并学习一个概念推理网络来进行下游图像分类任务,如少量学习和领域泛化。我们的CCLI方法在多种实验结果中表现出了大幅提升的性能,比如在少量学习任务上提升了8.0%,在领域泛化任务上提升了1.3%。

Trie-NLG: Trie Context Augmentation to Improve Personalized Query Auto-Completion for Short and Unseen Prefixes

  • paper_url: http://arxiv.org/abs/2307.15455
  • repo_url: None
  • paper_authors: Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Manish Gupta, Puneet Agrawal
  • for: 提高 Query Auto-completion(QAC)系统的完ten completion accuracy,尤其是对短前缀和未看到的前缀进行提高。
  • methods: 提议一种基于 Trie 和 Natural Language Generation(NLG)模型的新方法,可以同时利用前一个会话中的查询和 Trie 中的 популярity 信号来提高 QAC 的性能。
  • results: 通过对两个大型 QAC 数据集进行评估,发现该方法可以提高 QAC 系统的 MRR 指标的表现,相比之下 Popular Trie-based Lookup 和 BART-based Baseline 方法,平均提高了大约 57% 和 14%。
    Abstract Query auto-completion (QAC) aims at suggesting plausible completions for a given query prefix. Traditionally, QAC systems have leveraged tries curated from historical query logs to suggest most popular completions. In this context, there are two specific scenarios that are difficult to handle for any QAC system: short prefixes (which are inherently ambiguous) and unseen prefixes. Recently, personalized Natural Language Generation (NLG) models have been proposed to leverage previous session queries as context for addressing these two challenges. However, such NLG models suffer from two drawbacks: (1) some of the previous session queries could be noisy and irrelevant to the user intent for the current prefix, and (2) NLG models cannot directly incorporate historical query popularity. This motivates us to propose a novel NLG model for QAC, Trie-NLG, which jointly leverages popularity signals from trie and personalization signals from previous session queries. We train the Trie-NLG model by augmenting the prefix with rich context comprising of recent session queries and top trie completions. This simple modeling approach overcomes the limitations of trie-based and NLG-based approaches and leads to state-of-the-art performance. We evaluate the Trie-NLG model using two large QAC datasets. On average, our model achieves huge ~57% and ~14% boost in MRR over the popular trie-based lookup and the strong BART-based baseline methods, respectively. We make our code publicly available.
    摘要 Query 自动完成(QAC)的目标是为给定的查询前缀提供可能的完成方案。传统上,QAC 系统都是基于历史查询记录中的尝试来建议最受欢迎的完成方案。在这种情况下,短前缀(即查询符)和未看到的前缀是两个困难的场景。最近,人性化的自然语言生成(NLG)模型被提议用于解决这两个挑战。然而,这些 NLG 模型受到两个缺点:(1)一些前一个会话中的查询可能是噪音和无关于用户意图的查询,和(2)NLG 模型不能直接包含历史查询的流行性信号。这种情况引发我们提出一种新的 NLG 模型,即 Trie-NLG,它同时利用尝试和前一个会话中的查询来提供可能的完成方案。我们在训练 Trie-NLG 模型时,将前缀添加了丰富的上下文,包括最近的会话中的查询和 top 尝试。这种简单的模型方法超越了尝试基于和 NLG 基于的方法,并带来了状态之最好的性能。我们使用两个大的 QAC 数据集来评估 Trie-NLG 模型。在 average 的情况下,我们的模型在 MRR 方面获得了大约 57% 和 14% 的提升,相比于流行的尝试基于的查看和强大的 BART 基eline 方法。我们将代码公开。

CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition

  • paper_url: http://arxiv.org/abs/2307.15432
  • repo_url: None
  • paper_authors: Jiang Li, Yingjian Liu, Xiaoping Wang, Zhigang Zeng
  • for: 这篇研究旨在提出一个跨Modal融合网络,具有情感变化意识(CFN-ESA),用于多modal情感识别(ERC)。
  • methods: 该方法使用文本modalities作为主要情感信息来源,而视觉和声音modalities则被视为次要来源。此外,该方法还包括情感变化模块(LESM),以捕捉情感变化信息,并将其与主要任务进行相互适应。
  • results: 实验结果显示,CFN-ESA可以优化ERC的表现,并与现有模型相比,得到了remarkable的进步。
    Abstract Multimodal Emotion Recognition in Conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a cross-modal fusion network with emotion-shift awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information, rendering it hard to adequately extract complementary and associative information from multimodal data. To cope with this problem, in CFN-ESA, textual modalities are treated as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of the unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together the data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture related information, thereby guide the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform the state-of-the-art models.
    摘要 多modal情感识别在对话(ERC)领域已经吸引了不同领域的研究者的关注。在这篇论文中,我们提出了跨modal融合网络 WITH emotion-shift 意识(CFN-ESA) для ERC。现有的方法均视每个模式都是平等的,无法准确地EXTRACT complementary和associative information FROM multimodal data。为了解决这个问题,在CFN-ESA中,文本模式被视为情感信息的主要来源,而视觉和声音模式则被视为次要来源。此外,大多数多modal ERC模型忽视情感转换信息并过度关注上下文信息,导致情感识别下情感转换场景失败。我们提出了情感转换模块来解决这个挑战。CFN-ESA主要由RUME、ACME和LESM三部分组成。RUME用于EXTRACT对话水平的情感cue,并将多个模式之间的数据分布相互紧密连接起来;ACME用于在文本模式为中心进行多模式交互;LESM用于模型情感转换, capture相关信息,以导引主任务的学习。实验结果表明,CFN-ESA可以有效提高ERC的性能,并remarkably exceed state-of-the-art模型。

Investigating the Learning Behaviour of In-context Learning: A Comparison with Supervised Learning

  • paper_url: http://arxiv.org/abs/2307.15411
  • repo_url: https://github.com/xdwang0726/icl_ll
  • paper_authors: Xindi Wang, Yufei Wang, Can Xu, Xiubo Geng, Bowen Zhang, Chongyang Tao, Frank Rudzicz, Robert E. Mercer, Daxin Jiang
  • for: 这 paper 的目的是 investigate the learning behavior of in-context learning (ICL) and compare it with supervised learning (SL) under label perturbations.
  • methods: 作者使用了同一个 demonstration example 进行 ICL 和 SL 训练,并研究其下游性能在 classification tasks 中受到 label perturbations 的影响。
  • results: AUTHORS 发现,gold labels 对下游 ICL 性能有显著影响,特别是对大语言模型; 然而,不均衡标签对 ICL 的影响几乎无关紧要。此外,作者还发现,相比 SL,ICL 对标签扰乱更为敏感,但是随着模型大小增加,ICL 逐渐达到 SL 的性能水平。
    Abstract Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behaviour of ICL, we train the same LLMs with the same demonstration examples via ICL and supervised learning (SL), respectively, and investigate their performance under label perturbations (i.e., noisy labels and label imbalance) on a range of classification tasks. First, via extensive experiments, we find that gold labels have significant impacts on the downstream in-context performance, especially for large language models; however, imbalanced labels matter little to ICL across all model sizes. Second, when comparing with SL, we show empirically that ICL is less sensitive to label perturbations than SL, and ICL gradually attains comparable performance to SL as the model size increases.
    摘要 大型语言模型(LLM)已经表现出杰出的内容学习(ICL)能力,即从极少的训练示例中学习新任务,而不需要预先训练。然而, despite 成功的 LLM, there has been little understanding of how ICL learns the knowledge from the given prompts. 在这篇论文中,我们将同一个 LLM 训练 via ICL 和监督学习(SL),并调查它们在标签噪音(i.e., 杂凑标签和标签不均)的情况下的性能。首先,通过广泛的实验,我们发现 gold labels 对 downstream in-context performance 有很大的影响,特别是 для large language models; however, imbalanced labels matter little to ICL across all model sizes。其次,我们比较 SL 和 ICL,我们显示了实践中 ICL 比 SL 更敏感于标签噪音,并且 ICL 逐渐实现了与 SL 相同的性能,随着模型大小增加。

Towards a Fully Unsupervised Framework for Intent Induction in Customer Support Dialogues

  • paper_url: http://arxiv.org/abs/2307.15410
  • repo_url: None
  • paper_authors: Rita Costa, Bruno Martins, Sérgio Viana, Luisa Coheur
  • for: 本研究旨在提出一种 completelly unsupervised 的意图推论框架,以便在对话中进行意图推论。
  • methods: 本研究使用了对话 corpora 的预处理技术,以提高结果的准确性。同时,通过investigating the most common sequences,提取对话的意图流程。
  • results: 本研究在 MultiWOZ dataset 上进行了测试,并获得了可靠的结果。这种框架不仅可以应用于 MultiWOZ dataset,还可以应用于任何可能的用 caso,例如实际世界中的客户支持应用。
    Abstract State of the art models in intent induction require annotated datasets. However, annotating dialogues is time-consuming, laborious and expensive. In this work, we propose a completely unsupervised framework for intent induction within a dialogue. In addition, we show how pre-processing the dialogue corpora can improve results. Finally, we show how to extract the dialogue flows of intentions by investigating the most common sequences. Although we test our work in the MultiWOZ dataset, the fact that this framework requires no prior knowledge make it applicable to any possible use case, making it very relevant to real world customer support applications across industry.
    摘要 现代模型对意向推干需要标注数据集。然而,标注对话是时间费时的、劳苦的和昂费的。在这个工作中,我们提出了一个 completly 无监控的框架,用于对对话中的意向进行推干。此外,我们显示了如何对对话数据库进行预processing,以改善结果。最后,我们显示了如何从最常见的sequences中提取对话流程的意向。我们在MultiWOZ dataset上进行了测试,但由于这个框架不需任何先前知识,因此它适用于任何可能的用 caso,使其在实际世界中的客户支持应用程序中非常有 relevance。

Multilingual Tourist Assistance using ChatGPT: Comparing Capabilities in Hindi, Telugu, and Kannada

  • paper_url: http://arxiv.org/abs/2307.15376
  • repo_url: None
  • paper_authors: Sanjana Kolar, Rohit Kumar
  • for: 这项研究旨在评估OpenAI提供的ChatGPT语言模型在翻译英语到印地语、telugu和 kannada语言方面的效果,以帮助印度的旅游者在语言多样性的环境中。
  • methods: 该研究使用了50个多样化的问题集,包括一般知识、美食和旅游等领域,并由5名志愿者评分翻译的准确性和流畅性。这些分数最后被转换为BLEU分数,以衡量机器生成的翻译质量。
  • results: 研究发现,印地语翻译表现出色,具有更高的准确性和流畅性,而telugu翻译则落后于其他语言。人工评分者对翻译的准确性和流畅性进行评估,提供了全面的语言模型性能评估。
    Abstract This research investigates the effectiveness of ChatGPT, an AI language model by OpenAI, in translating English into Hindi, Telugu, and Kannada languages, aimed at assisting tourists in India's linguistically diverse environment. To measure the translation quality, a test set of 50 questions from diverse fields such as general knowledge, food, and travel was used. These were assessed by five volunteers for accuracy and fluency, and the scores were subsequently converted into a BLEU score. The BLEU score evaluates the closeness of a machine-generated translation to a human translation, with a higher score indicating better translation quality. The Hindi translations outperformed others, showcasing superior accuracy and fluency, whereas Telugu translations lagged behind. Human evaluators rated both the accuracy and fluency of translations, offering a comprehensive perspective on the language model's performance.
    摘要 这项研究探讨了OpenAI开发的语言模型ChatGPT在将英语翻译成印地语、telugu和 kannada语言方面的效果,以帮助印度语言多样性环境中的旅游者。为衡量翻译质量,研究使用了50个多学科知识、食物和旅行的问题集,由5名志愿者评测准确性和流畅性,并将得分转换为BLEU分数。BLEU分数评估机器生成翻译与人工翻译之间的相似性,高分数表示更高的翻译质量。印地语翻译表现出色,准确性和流畅性都较高,而telugu翻译则落后。人工评测器对翻译准确性和流畅性进行评估,为语言模型表现提供了全面的视角。

Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in Argumentation

  • paper_url: http://arxiv.org/abs/2307.15341
  • repo_url: None
  • paper_authors: Camélia Guerraoui, Paul Reisert, Naoya Inoue, Farjana Sultana Mim, Shoichi Naito, Jungmin Choi, Irfan Robbani, Wenzhi Wang, Kentaro Inui
  • for: 这篇论文旨在探讨计算机模型在推理方面的反馈方式,以帮助学生提高批判性思维能力。
  • methods: 论文使用现有的计算机模型来评估论证质量,并探讨这些模型是否能够提供有用的反馈,以帮助学生进行改进。
  • results: 论文发现,现有的计算机模型可以提供较为 ricH 的反馈,但是这些反馈通常无法解释为什么某个论证是低质量的,这限制了对学生的反馈提供 constructive 的feedback。
    Abstract The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to provide constructive feedback to users to strengthen their critical thinking skills. In this survey, we aim to explore the different dimensions of feedback (Richness, Visualization, Interactivity, and Personalization) provided by the current computational models for argumentation, and the possibility of enhancing the power of explanations of such models, ultimately helping learners improve their critical thinking skills.
    摘要 使用辩论在教育中有助于提高学生的批判性思维能力,计算机模型也已经为这个过程而开发。虽然这些模型有用于评估论证质量,但它们往往无法解释特定论证为何不好或者不合理,这使得给用户提供有用的反馈很困难,从而难以帮助学生提高批判性思维能力。在这份调查中,我们计划探讨现有的计算机模型feedback维度(丰富性、可视化、交互性和个性化),以及可能通过增强这些模型的解释力来帮助学生提高批判性思维能力。

BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering

  • paper_url: http://arxiv.org/abs/2307.15335
  • repo_url: None
  • paper_authors: Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu Thuy Nguyen
  • for: 本研究旨在提出一个基于 transformer 的越南语模型,以解决英语资源充足的问题,并在越南语 VQA dataset 上进行评估。
  • methods: 本研究使用了预训Sequence-to-Sequence和 bidirectional encoder representation from Image Transformers,并在越南语中进行训练。
  • results: 实验结果显示,我们的提案模型在六个指标中出performing better than 强基eline,包括 Accuracy、Precision、Recall、F1-score、WUPS 0.0 和 WUPS 0.9。
    Abstract Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vietnam. To address this limitation, we introduce a transformer-based Vietnamese model named BARTPhoBEiT. This model includes pre-trained Sequence-to-Sequence and bidirectional encoder representation from Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets. Experimental results demonstrate that our proposed model outperforms the strong baseline and improves the state-of-the-art in six metrics: Accuracy, Precision, Recall, F1-score, WUPS 0.0, and WUPS 0.9.
    摘要 视觉问答(VQA)是一项复杂且需求高的任务,涉及自然语言处理(NLP)和计算机视觉(CV),吸引了研究者们的关注。英语,因其资源丰富,在VQA领域已经取得了显著进步,但是尚未有专门针对特定国家的模型。为了解决这一限制,我们提出了一个基于变换器的越南语模型,名为BARTPhoBEiT。这个模型包括预训练的序列到序列和双向编码器表示图像变换器在越南语中,并评估越南语VQA数据集。实验结果表明,我们提议的模型超越强基线和提高了状态之册的六个指标:准确率、精度、回归率、F1分数、WUPS 0.0和WUPS 0.9。

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

  • paper_url: http://arxiv.org/abs/2308.01420
  • repo_url: None
  • paper_authors: Charumathi Badrinath, Weiwei Pan, Finale Doshi-Velez
  • for: 用于改进文档抽象的维度减少算法,以便更好地捕捉人类理解的文档相似性关系。
  • methods: 基于Latent Dirichlet Allocation(LDA)的半指导式人工智能-循环方法,通过允许用户提供一些标签来学习主题。
  • results: 在synthetic corpora上,我们的方法可以生成更加理解的抽象,并且只需要提供一部分标签。在实际 corpora 上,我们获得了相似的结果。
    Abstract A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for dimensionality reduction of text corpora, like Latent Dirichlet Allocation (LDA), often produce projections that do not capture human notions of document similarity. We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections. On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided. On a real corpus, we obtain qualitatively similar results.
    摘要 通常来说,探索文本 corpus 的方式是通过低维度投影文档,希望在投影空间中 clusters similar documents 。然而,流行的文本探索维度减少算法,如 Latent Dirichlet Allocation (LDA),经常生成投影不符合人类意义的文档相似性。我们提议一种 semi-supervised 人在循环 LDA 基于方法,以学习保持含义相似性的文档关系在低维度投影中。在 sintetic corpus 上,我们的方法可以提供更加 interpretable 的投影,只需提供一部分标签。在真实 corpus 上,我们获得了类似的结果。

TrafficSafetyGPT: Tuning a Pre-trained Large Language Model to a Domain-Specific Expert in Transportation Safety

  • paper_url: http://arxiv.org/abs/2307.15311
  • repo_url: https://github.com/ozheng1993/trafficsafetygpt
  • paper_authors: Ou Zheng, Mohamed Abdel-Aty, Dongdong Wang, Chenzhu Wang, Shengxuan Ding
  • for: 这种研究是为了提高大型自然语言处理模型(LLMs)在交通安全领域任务中的表现,并且强调了特殊的交通安全专业知识的重要性。
  • methods: 这种研究使用了一种基于LLAMA模型的新型模型,并在这种模型上进行了监督微调,使用了由政府生成的指导书和ChatGPT生成的指令输出对的人工标签。
  • results: 研究发现,这种TrafficSafetyGPT模型在交通安全领域任务中表现出色,并且可以减轻特殊的交通安全专业知识的需求。
    Abstract Large Language Models (LLMs) have shown remarkable effectiveness in various general-domain natural language processing (NLP) tasks. However, their performance in transportation safety domain tasks has been suboptimal, primarily attributed to the requirement for specialized transportation safety expertise in generating accurate responses [1]. To address this challenge, we introduce TrafficSafetyGPT, a novel LLAMA-based model, which has undergone supervised fine-tuning using TrafficSafety-2K dataset which has human labels from government produced guiding books and ChatGPT-generated instruction-output pairs. Our proposed TrafficSafetyGPT model and TrafficSafety-2K train dataset are accessible at https://github.com/ozheng1993/TrafficSafetyGPT.
    摘要

ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation

  • paper_url: http://arxiv.org/abs/2307.15290
  • repo_url: https://github.com/lianjiatech/belle
  • paper_authors: Cheng Wen, Xianghui Sun, Shuaijiang Zhao, Xiaoquan Fang, Liangyu Chen, Wei Zou
  • for: 这篇论文旨在开发和评估一种专门为家居改造领域设计的域专语言模型(DSLM)。
  • methods: 该研究采用了域适应预训练和指令调整的方法,使用了一个广泛的数据集,包括专业文章、标准文档和网络内容相关于家居改造。
  • results: 实验表明,ChatHome不仅提高了域专功能,还保持了其通用性。
    Abstract This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, precise outputs relevant to the home renovation arena. ChatHome's novelty rests on its methodology, fusing domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content pertinent to home renovation. This dual-pronged strategy is designed to ensure that our model can assimilate comprehensive domain knowledge and effectively address user inquiries. Via thorough experimentation on diverse datasets, both universal and domain-specific, including the freshly introduced "EvalHome" domain dataset, we substantiate that ChatHome not only amplifies domain-specific functionalities but also preserves its versatility.
    摘要 ChatHome's novelty lies in its methodology, which fuses domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content related to home renovation. This dual-pronged approach is designed to ensure that our model can absorb comprehensive domain knowledge and effectively address user inquiries.Through thorough experimentation on diverse datasets, including the newly introduced "EvalHome" domain dataset, we demonstrate that ChatHome not only enhances domain-specific functionalities but also preserves its versatility.

Multilingual Lexical Simplification via Paraphrase Generation

  • paper_url: http://arxiv.org/abs/2307.15286
  • repo_url: https://github.com/kpkqwq/lspg
  • paper_authors: Kang Liu, Jipeng Qiang, Yun Li, Yunhao Yuan, Yi Zhu, Kaixun Hua
  • for: 提高 lexical simplification 方法的效果,使其能够生成更加准确和多样化的替代词。
  • methods: 提出了一种基于 paraphrase 生成的多语言 lexical simplification 方法,通过将句子作为输入,生成具有多种词替的替代词。
  • results: 实验结果表明,我们的方法在英语、西班牙语和葡萄牙语等语言上显著超过 BERT 基于方法和零批 GPT3 基于方法。
    Abstract Lexical simplification (LS) methods based on pretrained language models have made remarkable progress, generating potential substitutes for a complex word through analysis of its contextual surroundings. However, these methods require separate pretrained models for different languages and disregard the preservation of sentence meaning. In this paper, we propose a novel multilingual LS method via paraphrase generation, as paraphrases provide diversity in word selection while preserving the sentence's meaning. We regard paraphrasing as a zero-shot translation task within multilingual neural machine translation that supports hundreds of languages. After feeding the input sentence into the encoder of paraphrase modeling, we generate the substitutes based on a novel decoding strategy that concentrates solely on the lexical variations of the complex word. Experimental results demonstrate that our approach surpasses BERT-based methods and zero-shot GPT3-based method significantly on English, Spanish, and Portuguese.
    摘要 Lexical simplification(LS)方法基于预训练语言模型已经做出了很大的进步,通过分析词语上下文环境来生成可能的替换词。然而,这些方法需要单独的预训练模型来支持不同的语言,并且忽略了保持句子意义的要求。在这篇论文中,我们提出了一种新的多语言LS方法,通过句子重写来提供多样性的词选择,同时保持句子意义的完整性。我们将重写视为一种零批译翻译任务,利用多语言神经翻译模型支持多种语言。在输入句子被编码器模型处理后,我们通过一种新的解码策略来生成替换词,专注于复杂词语的语言变换。实验结果显示,我们的方法在英语、西班牙语和葡萄牙语等语言上超越BERT基于方法和零批GPT3基于方法。

f-Divergence Minimization for Sequence-Level Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2307.15190
  • repo_url: https://github.com/manga-uofa/fdistill
  • paper_authors: Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou
  • for: 本文主要针对语言处理领域中的知识填充问题,即将大型模型中的知识转移到小型模型中。
  • methods: 本文提出了一个f-DISTILL框架,该框架将序列级知识填充转化为最小化一个泛化f- divergence函数的问题。文章还提出了四种填充变种,其中包括SeqKD和ENGINE等已有的方法。
  • results: 实验结果显示,本文提出的方法可以超越现有的填充方法,并且使用 симметриック的填充损失可以更好地让学生模型学习教师分布。
    Abstract Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.
    摘要 知识填充(KD)是将知识从大型模型传递到小型模型的过程。随着自然语言处理领域中模型的不断扩大,KD已经受到了越来越多的关注。在这项工作中,我们提出了f-DISTILL框架,它将序列级知识填充形式化为最小化一个通用f-散度函数。我们提出了四种填充变体,并证明了现有的SeqKD和ENGINE方法是f-DISTILL方法的近似方法。我们还 deriv了 step-wise 分解,将不可 tractable的序列级散度降到单词级损失,可以在可追踪的方式上计算。实验结果表明,我们的方法在四个数据集上表现出色,并且我们的对称填充损失可以更好地让学生学习老师分布。

A Geometric Notion of Causal Probing

  • paper_url: http://arxiv.org/abs/2307.15054
  • repo_url: None
  • paper_authors: Clément Guerner, Anej Svete, Tianyu Liu, Alexander Warstadt, Ryan Cotterell
  • for: 本文提出了一种 formal definition of $\textit{intrinsic}$ information in a subspace of a language model’s representation space, 以便从某些方面控制语言模型的预测结果。
  • methods: 本文使用了一种counterfactual approach,通过独立处理潜在相关的组件来避免偶极 correlations 问题(Kumar et al., 2022)。此外,本文还提出了一种 $\textit{causal}$ concept subspace,可以优化information在子空间中的搜索。
  • results: 实验表明,使用R-LACE(Ravfogel et al., 2022)返回的一维子空间可以控制语言模型生成的字符串的概念值。在本文的 causal controlled intervention 中,我们发现,对于至少一个模型,可以使用 R-LACE 返回的子空间来准确地控制生成的字符串中的概念值。
    Abstract Large language models rely on real-valued representations of text to make their predictions. These representations contain information learned from the data that the model has trained on, including knowledge of linguistic properties and forms of demographic bias, e.g., based on gender. A growing body of work has considered removing information about concepts such as these using orthogonal projections onto subspaces of the representation space. We contribute to this body of work by proposing a formal definition of $\textit{intrinsic}$ information in a subspace of a language model's representation space. We propose a counterfactual approach that avoids the failure mode of spurious correlations (Kumar et al., 2022) by treating components in the subspace and its orthogonal complement independently. We show that our counterfactual notion of information in a subspace is optimized by a $\textit{causal}$ concept subspace. Furthermore, this intervention allows us to attempt concept controlled generation by manipulating the value of the conceptual component of a representation. Empirically, we find that R-LACE (Ravfogel et al., 2022) returns a one-dimensional subspace containing roughly half of total concept information under our framework. Our causal controlled intervention shows that, for at least one model, the subspace returned by R-LACE can be used to manipulate the concept value of the generated word with precision.
    摘要 大型语言模型通常使用实数valued表示文本来进行预测。这些表示包含模型在训练数据中学习的信息,包括语言性质和人口偏见(如性别)等。一个快速增长的研究领域是去除这些信息,使用正交投影onto subspace of representation space。我们对这些研究进行贡献,提出了一个正式的内在信息定义在语言模型表示空间中的概念。我们提出了一种避免假 correlate 失败模式(Kumar et al., 2022)的对照方法,该方法将subspace和其正交补做独立处理。我们显示了我们的对照方法可以优化内在信息在subspace中。此外,这种干预还允许我们通过控制概念的值来进行概念控制生成。我们的实验表明,R-LACE(Ravfogel et al., 2022)返回了一个一维子空间,包含约半个总概念信息。我们的 causal 控制干预表明,至少一个模型中,R-LACE返回的子空间可以准确地控制生成词的概念值。

Gzip versus bag-of-words for text classification

  • paper_url: http://arxiv.org/abs/2307.15002
  • repo_url: https://github.com/flipz357/npc_gzip_exp
  • paper_authors: Juri Opitz
  • for: 这份研究旨在证明 bag-of-words 方法可以达到类似或更好的结果,并更高效。
  • methods: 这篇论文使用了 bag-of-words 方法,并进行了 compression 来提高效率。
  • results: 研究发现,bag-of-words 方法可以达到类似或更好的结果,而且更高效。
    Abstract The effectiveness of compression in text classification ('gzip') has recently garnered lots of attention. In this note we show that `bag-of-words' approaches can achieve similar or better results, and are more efficient.
    摘要 “压缩(gzip)在文本分类中的效果在最近引起了很多关注。在这个笔记中,我们展示了 bag-of-words 方法可以达到类似或更好的结果,并且更高效。”Note:* "压缩" (gzip) is translated as "压缩" in Simplified Chinese.* "bag-of-words" is translated as "bag-of-words" in Simplified Chinese.* "文本分类" (text classification) is translated as "文本分类" in Simplified Chinese.

Scaling TransNormer to 175 Billion Parameters

  • paper_url: http://arxiv.org/abs/2307.14995
  • repo_url: None
  • paper_authors: Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo, Yu Qiao, Yiran Zhong
  • For: The paper proposes a new linear attention-based Large Language Model (LLM) called TransNormerLLM, which outperforms conventional softmax attention-based models in terms of both accuracy and efficiency.* Methods: The paper introduces several advanced modifications to the previous linear attention architecture TransNormer, including positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. The paper also proposes a new technique called Lightning Attention to accelerate linear attention.* Results: The paper achieves impressive acceleration of over 20% and reduces memory usage by a remarkable four times. The model also shows superior efficiency during both training and inference stages, and is scalable for seamless deployment on large-scale clusters. The paper also demonstrates the effectiveness of the model through comprehensive experiments on a self-collected corpus exceeding 6TB and containing over 2 trillion tokens.
    Abstract We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.
    摘要 我们介绍TransNormerLLM,世界上首个运算式注意力基于大语言模型(LLM),在精度和效率方面都能超越传统软max注意力基于模型。TransNormerLLM继承了之前的线性注意架构TransNormer,通过进一步改进,包括位置嵌入、加速线性注意、闸门机制、tensor normalization、推理加速和稳定。具体来说,我们使用LRPE和指数衰退来避免注意力扩散问题,同时让模型保留字元之间的全局互动。此外,我们提出了Lightning Attention技术,可以在runtime中更多地加速线性注意,并降低内存使用量,实现了四倍以上的加速。为了进一步提高TransNormer的性能,我们导入了闸门机制来缓和训练,并使用新的tensor normalization scheme来加速模型,实现了20%以上的加速。此外,我们开发了一个可靠的推理算法,可以在训练和推理阶段都保持数据稳定和一致的推理速度,无视序列长度。 scalability是我们模型的设计核心,使得可以顺利部署在大规模集群上,并且可以进一步扩展到更大的模型,同时保持出色的性能 Metrics。我们透过一系列严谨的实验,证明了我们的模型设计是正确的。我们将预训练的模型发布,以促进社区对效率LLM的发展。

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

  • paper_url: http://arxiv.org/abs/2307.14988
  • repo_url: None
  • paper_authors: Or Sharir, Anima Anandkumar
  • for: 这个论文的目的是提出一种高效的增量计算方法,以便在感知数据或用户输入的变化下进行深度学习模型的更新。
  • methods: 该论文使用了vector量化方法,以便在深度网络中重用计算结果。具体来说,它使用了densely connected transformers架构,并通过减少不必要的计算来实现增量计算。
  • results: 实验结果表明,使用该方法可以在文档编辑过程中实现高效的增量计算,并且可以保持与批处理模型相同的准确率。具体来说,相比于OPT-125M预训练语言模型,该方法可以降低12.1倍(中位数)的操作数量。
    Abstract Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.
    摘要 为解决这个问题,我们使用 вектор量化来精度化 intermediate 值在网络中,从而过滤无用和干扰的修改。这使得可以 reuse hidden neurons 的值,从而实现高效的逐步计算算法。我们应用这种方法到 transformers 架构上,创造了一种高效的逐步计算算法,其复杂度与修改输入的 Fraction 成正比。我们的实验表明,可以在文档编辑中实现相同的准确率,而且需要 fewer operations( median 为 12.1X)来处理序列中的 atomic 修改。

cs.LG - 2023-07-28

TriadNet: Sampling-free predictive intervals for lesional volume in 3D brain MR images

  • paper_url: http://arxiv.org/abs/2307.15638
  • repo_url: https://github.com/benolmbrt/TriadNet
  • paper_authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat
  • for: 这个研究的目的是提高Volume segmentation工具的可靠性和丰富性,以便在临床实践中更好地使用。
  • methods: 这个研究使用了多头Convolutional Neural Networks (CNN)架构,同时提供了病变体积和相关预测范围。
  • results: 研究在BraTS 2021数据集上表现出色,证明了这种方法的优越性。
    Abstract The volume of a brain lesion (e.g. infarct or tumor) is a powerful indicator of patient prognosis and can be used to guide the therapeutic strategy. Lesional volume estimation is usually performed by segmentation with deep convolutional neural networks (CNN), currently the state-of-the-art approach. However, to date, few work has been done to equip volume segmentation tools with adequate quantitative predictive intervals, which can hinder their usefulness and acceptation in clinical practice. In this work, we propose TriadNet, a segmentation approach relying on a multi-head CNN architecture, which provides both the lesion volumes and the associated predictive intervals simultaneously, in less than a second. We demonstrate its superiority over other solutions on BraTS 2021, a large-scale MRI glioblastoma image database.
    摘要 “脑部损伤(例如血栓或肿瘤)的体积是诊断病人 prospect 和治疗策略的重要指标,可以用来引导诊断和治疗策略。 however, 到目前为止,有很少的工作是将体积分 segmentation 工具与适当的量化预测 интервал相结合,这会限制其在临床实践中的用途和采纳。 在这个工作中,我们提出了 TriadNet,一种基于多头 CNN 架构的分 segmentation 方法,可以同时提供体积和相应的预测 интервал,执行时间只需要0.1秒钟。 我们在 BraTS 2021 大规模 MRI 肿瘤影像库中与其他解析方法进行比较, demonstarte 其超越性。”Note: "BraTS" is an abbreviation for "Brain Tumor Segmentation" challenge, which is a large-scale MRI glioblastoma image database.

A Comparative Analysis of Machine Learning Methods for Lane Change Intention Recognition Using Vehicle Trajectory Data

  • paper_url: http://arxiv.org/abs/2307.15625
  • repo_url: None
  • paper_authors: Renteng Yuan
  • for: 本研究旨在用机器学习方法Recognize lane change (LC) intention from high-dimensionality time series data, 帮助自动驾驶车辆更好地理解它们的周围环境,认出安全隐患,提高交通安全性。
  • methods: 本研究比较了不同的机器学习方法在LC意图识别方面的表现,包括XGBoost和LightGBM两种机器学习算法。
  • results: 结果显示, ensemble methods可以减少类型II和类型III错误的影响,并且LightGBM可以提高模型训练效率,而无需牺牲识别精度。
    Abstract Accurately detecting and predicting lane change (LC)processes can help autonomous vehicles better understand their surrounding environment, recognize potential safety hazards, and improve traffic safety. This paper focuses on LC processes and compares different machine learning methods' performance to recognize LC intention from high-dimensionality time series data. To validate the performance of the proposed models, a total number of 1023 vehicle trajectories is extracted from the CitySim dataset. For LC intention recognition issues, the results indicate that with ninety-eight percent of classification accuracy, ensemble methods reduce the impact of Type II and Type III classification errors. Without sacrificing recognition accuracy, the LightGBM demonstrates a sixfold improvement in model training efficiency than the XGBoost algorithm.
    摘要 自动驾驶车辆可以更好地理解它所处环境,认识安全隐患和提高交通安全性,通过准确探测和预测车道变换(LC)过程。这篇论文关注LC过程,并比较不同机器学习方法在认识LC意图时的表现。为验证提议模型的性能,从CitySim数据集中提取了1023辆汽车轨迹。对LC意图识别问题,结果表明, ensemble方法可以降低类型II和类型III错误的影响,同时保持识别精度。不 sacrificing 识别精度,LightGBM Algorithm 比 XGBoost 算法快六倍。

  • paper_url: http://arxiv.org/abs/2307.15621
  • repo_url: https://github.com/awesomelemon/pbt-nas
  • paper_authors: Alexander Chebykin, Arkadiy Dushatskiy, Tanja Alderliesten, Peter A. N. Bosman
  • for: 这篇论文目的是实现Neural Architecture Search(NAS)中的内在搜寻。
  • methods: 这篇论文使用了同时训练和混合神经网络,并将遗传学到的 weights 重复使用来优化参数。提出了基于PBT(Population Based Training)算法的PBT-NAS,可以在训练过程中改善架构,并将不好performing网络替换为well-performing网络的结果。
  • results: PBT-NAS在复杂任务(如图像生成和强化学习)上实现了superior的表现,较baseline(随机搜寻和变化基于PBT)为佳。
    Abstract In this work, we show that simultaneously training and mixing neural networks is a promising way to conduct Neural Architecture Search (NAS). For hyperparameter optimization, reusing the partially trained weights allows for efficient search, as was previously demonstrated by the Population Based Training (PBT) algorithm. We propose PBT-NAS, an adaptation of PBT to NAS where architectures are improved during training by replacing poorly-performing networks in a population with the result of mixing well-performing ones and inheriting the weights using the shrink-perturb technique. After PBT-NAS terminates, the created networks can be directly used without retraining. PBT-NAS is highly parallelizable and effective: on challenging tasks (image generation and reinforcement learning) PBT-NAS achieves superior performance compared to baselines (random search and mutation-based PBT).
    摘要 在这个工作中,我们表明同时训练和混合神经网络是可行的神经建筑搜索(NAS)方法。为超参数优化,重用部分训练过的权重可以实现高效的搜索,这与前一个 Population Based Training(PBT)算法相似。我们提议PBT-NAS算法,它是PBT的NAS适应,通过在训练过程中将不好表现的网络替换为良好表现的网络和混合的结果,使用缩减误差技巧继承权重。在PBT-NAS结束后,创建的网络可以直接使用无需重新训练。PBT-NAS高度并行和有效,在复杂任务(图像生成和强化学习)上,PBT-NAS比基线(随机搜索和误差基于PBT)表现出优秀性。

Robust Distortion-free Watermarks for Language Models

  • paper_url: http://arxiv.org/abs/2307.15593
  • repo_url: https://github.com/jthickstun/watermark
  • paper_authors: Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang
  • for: 这个论文的目的是在语言模型中植入水印,以防止不当使用和抓取文本。
  • methods: 这个论文使用了随机水印键来生成水印文本,并使用了两种采样方法来增强水印的可靠性。
  • results: 实验表明,这种水印方法可以够强大地检测 watermarked 文本,即使在文本中受到较大的杂音和抓取攻击。
    Abstract We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50$\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
    摘要 我们提出了一种方法,可以在语言模型中植入水印,这些水印是对文本的扰动不变的,并且可以在一定的最大生成预算内保持水印的可读性。我们使用一个随机水印密钥来生成水印文本,其中每个字符串都是基于随机数列的映射。为检测水印文本,任何知道水印密钥的人都可以对文本进行对齐。我们在两种抽样方式下实现了这种水印方法:倒计时间抽样和最小值抽样。我们在三个语言模型(OPT-1.3B、LLaMA-7B和Alpaca-7B)上实验 validate了这种水印方法的统计能力和对各种重建攻击的Robustness。结果显示,对OPT-1.3B和LLaMA-7B模型来说,我们可以在35个字符串上(即40%-50%的扰动)检测水印文本(p ≤ 0.01)。对Alpaca-7B模型来说,我们进行了一个案例研究,发现在响应 тиppy instrucions 上植入水印的可能性较低,只有25%的响应可以在p ≤ 0.01的概率下检测,并且水印在某些自动重建攻击下也显示了较弱的Robustness。

Evaluating the structure of cognitive tasks with transfer learning

  • paper_url: http://arxiv.org/abs/2308.02408
  • repo_url: None
  • paper_authors: Bruno Aristimunha, Raphael Y. de Camargo, Walter H. Lopez Pinaya, Sylvain Chevallier, Alexandre Gramfort, Cedric Rommel
  • for: 这个研究旨在investigate deep learning representation transferability in EEG decoding tasks, in order to address the challenge of limited labelled data.
  • methods: 研究使用state-of-the-art decoding models on two recently released EEG datasets, ERP CORE和M$^3$CV, with over 140 subjects and 11 cognitive tasks. measure transferability by pre-training deep neural networks on one task and assessing their ability to decode subsequent tasks.
  • results: 实验结果显示,即使使用线性探测传输,也可以获得显著改善 decoding performance,比对纯粹supervisedapproach增加28%。此外,发现certain decoding paradigms elicit specific and narrow brain activities, while others benefit from pre-training on a broad range of representations.
    Abstract Electroencephalography (EEG) decoding is a challenging task due to the limited availability of labelled data. While transfer learning is a promising technique to address this challenge, it assumes that transferable data domains and task are known, which is not the case in this setting. This study investigates the transferability of deep learning representations between different EEG decoding tasks. We conduct extensive experiments using state-of-the-art decoding models on two recently released EEG datasets, ERP CORE and M$^3$CV, containing over 140 subjects and 11 distinct cognitive tasks. We measure the transferability of learned representations by pre-training deep neural networks on one task and assessing their ability to decode subsequent tasks. Our experiments demonstrate that, even with linear probing transfer, significant improvements in decoding performance can be obtained, with gains of up to 28% compare with the pure supervised approach. Additionally, we discover evidence that certain decoding paradigms elicit specific and narrow brain activities, while others benefit from pre-training on a broad range of representations. By revealing which tasks transfer well and demonstrating the benefits of transfer learning for EEG decoding, our findings have practical implications for mitigating data scarcity in this setting. The transfer maps generated also provide insights into the hierarchical relations between cognitive tasks, hence enhancing our understanding of how these tasks are connected from a neuroscientific standpoint.
    摘要 电enzephalography(EEG)解oding是一个挑战性的任务,因为标注数据的可用性有限。而转移学习是一种有前途的技术,它假设有可转移的数据领域和任务,但这不是这个设定中的情况。这项研究探讨了EEG解oding任务中深度学习表示的可转移性。我们在两个最新发布的EEG数据集,ERP CORE和M$^3$CV,上进行了广泛的实验,这两个数据集包含了140名参与者和11种不同的认知任务。我们测量了深度神经网络在不同任务之间的可转移性,通过在一个任务上预训练深度神经网络,并评估它在后续任务上的解oding性能。我们的实验表明,即使使用线性探索传输,也可以获得显著改善,与纯粹监督方法相比,提高达28%。此外,我们发现了一些解oding方法会产生特定和窄频谱的大脑活动,而其他方法则需要预训练在广泛的表示上。我们的发现有实际意义,可以减轻EEG解oding中的数据缺乏问题。同时,我们生成的传输地图也提供了认知任务之间的层次关系,从神经科学的角度来看,这有助于我们更好地理解这些任务之间的连接。

Dynamic algorithms for k-center on graphs

  • paper_url: http://arxiv.org/abs/2307.15557
  • repo_url: https://github.com/swati1024/torrents
  • paper_authors: Emilio Cruciani, Sebastian Forster, Gramoz Goranci, Yasamin Nazari, Antonis Skarlatos
  • for: 本研究的目的是提出首个高效的动态图-$k$-中心问题解决方案。
  • methods: 本文使用了 deterministic decremental $(2+\epsilon)$-approximation algorithm和 randomized incremental $(4+\epsilon)$-approximation algorithm,both with amortized update time $kn^{o(1)}$ for weighted graphs。
  • results: 本文获得了一个fully dynamic $(2+\epsilon)$-approximation algorithm for the $k$-center problem,with worst-case update time that is within a factor $k$ of the state-of-the-art upper bound for maintaining $(1+\epsilon)$-approximate single-source distances in graphs。
    Abstract In this paper we give the first efficient algorithms for the $k$-center problem on dynamic graphs undergoing edge updates. In this problem, the goal is to partition the input into $k$ sets by choosing $k$ centers such that the maximum distance from any data point to the closest center is minimized. It is known that it is NP-hard to get a better than $2$ approximation for this problem. While in many applications the input may naturally be modeled as a graph, all prior works on $k$-center problem in dynamic settings are on metrics. In this paper, we give a deterministic decremental $(2+\epsilon)$-approximation algorithm and a randomized incremental $(4+\epsilon)$-approximation algorithm, both with amortized update time $kn^{o(1)}$ for weighted graphs. Moreover, we show a reduction that leads to a fully dynamic $(2+\epsilon)$-approximation algorithm for the $k$-center problem, with worst-case update time that is within a factor $k$ of the state-of-the-art upper bound for maintaining $(1+\epsilon)$-approximate single-source distances in graphs. Matching this bound is a natural goalpost because the approximate distances of each vertex to its center can be used to maintain a $(2+\epsilon)$-approximation of the graph diameter and the fastest known algorithms for such a diameter approximation also rely on maintaining approximate single-source distances.
    摘要 在这篇论文中,我们提供了首次有效的动态图-$k$-中心问题算法。在这个问题中,目标是将输入分成$k$个集合,通过选择$k$个中心,以最小化最大数据点与最近中心之间的距离。已知这个问题是NP困难的,无法获得更好于2的近似值。在许多应用中,输入可能自然地表示为图,但所有之前的动态设置-$k$-中心问题研究都是基于度量。在这篇论文中,我们提供了一个 deterministic 的减少$(2+\epsilon)$-近似算法和一个随机的增量$(4+\epsilon)$-近似算法,它们都有平均更新时间为$kn^{o(1)}$。此外,我们还提供了一个减少算法,它可以在最坏情况下更新时间与状态艺术的最佳上限相比,这个上限是为维护$(1+\epsilon)$-近似单源距离的最佳算法。这个目标是自然的因为每个顶点的中心的近似距离可以用来维护$(2+\epsilon)$-近似图 Diameter 和最快的知识算法。

On the Trade-off Between Efficiency and Precision of Neural Abstraction

  • paper_url: http://arxiv.org/abs/2307.15546
  • repo_url: None
  • paper_authors: Alec Edwards, Mirco Giacobbe, Alessandro Abate
  • for: 这篇论文主要目的是提出了一种基于神经网络的形式准确模型,并研究了这种模型在不同场景下的使用和优点。
  • methods: 这篇论文使用了形式幂学 Synthesis 技术,通过设计不同类型的神经网络来生成准确的模型。
  • results: 研究发现,不同类型的神经网络模型在不同场景下具有不同的优点和缺点,并且可以根据具体的应用场景选择合适的模型。
    Abstract Neural abstractions have been recently introduced as formal approximations of complex, nonlinear dynamical models. They comprise a neural ODE and a certified upper bound on the error between the abstract neural network and the concrete dynamical model. So far neural abstractions have exclusively been obtained as neural networks consisting entirely of $ReLU$ activation functions, resulting in neural ODE models that have piecewise affine dynamics, and which can be equivalently interpreted as linear hybrid automata. In this work, we observe that the utility of an abstraction depends on its use: some scenarios might require coarse abstractions that are easier to analyse, whereas others might require more complex, refined abstractions. We therefore consider neural abstractions of alternative shapes, namely either piecewise constant or nonlinear non-polynomial (specifically, obtained via sigmoidal activations). We employ formal inductive synthesis procedures to generate neural abstractions that result in dynamical models with these semantics. Empirically, we demonstrate the trade-off that these different neural abstraction templates have vis-a-vis their precision and synthesis time, as well as the time required for their safety verification (done via reachability computation). We improve existing synthesis techniques to enable abstraction of higher-dimensional models, and additionally discuss the abstraction of complex neural ODEs to improve the efficiency of reachability analysis for these models.
    摘要

Beating Backdoor Attack at Its Own Game

  • paper_url: http://arxiv.org/abs/2307.15539
  • repo_url: https://github.com/damianliumin/non-adversarial_backdoor
  • paper_authors: Min Liu, Alberto Sangiovanni-Vincentelli, Xiangyu Yue
  • for: 防止深度神经网络(DNNs)受到后门攻击,保护模型的正常运行。
  • methods: 基于非对抗性后门攻击,对恶意样本进行探测和毒素处理,以防止后门攻击。
  • results: 对多个benchmark和不同的攻击方法进行了广泛的实验,结果表明我们的方法可以达到现有防御方法的最高效果,同时具有最低的清洁数据影响。
    Abstract Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.
    摘要 Inspired by the stealthiness and effectiveness of backdoor attacks, we propose a simple but highly effective defense framework that injects non-adversarial backdoors targeting poisoned samples. We follow the general steps of backdoor attacks and detect a small set of suspected samples, and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data but has limited influence on clean data.Our defense can be carried out during data preprocessing, without modifying the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. The results show that our method achieves state-of-the-art defense effectiveness with the lowest performance drop on clean data.Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoors for backdoor defense. Our code is available at https://github.com/damianliumin/non-adversarial_backdoor.

RFID-Assisted Indoor Localization Using Hybrid Wireless Data Fusion

  • paper_url: http://arxiv.org/abs/2308.02410
  • repo_url: None
  • paper_authors: Abouzar Ghavami, Ali Abedi
  • for: 本研究旨在提出一种混合部分基于RFID追踪设备和多种IoT无线通信协议的indoor定位方法,以减少RFID标签成本。
  • methods: 本方法使用开发的RFID追踪设备和多种IoT无线通信协议,包括蓝牙、WiFi和ZigBee等,实现了indoor定位。
  • results: 实验结果验证了分析结果,并且表明了 hybrid 方法可以减少RFID标签数量而不影响定位精度。
    Abstract Wireless localization is essential for tracking objects in indoor environments. Internet of Things (IoT) enables localization through its diverse wireless communication protocols. In this paper, a hybrid section-based indoor localization method using a developed Radio Frequency Identification (RFID) tracking device and multiple IoT wireless technologies is proposed. In order to reduce the cost of the RFID tags, the tags are installed only on the borders of each section. The RFID tracking device identifies the section, and the proposed wireless hybrid method finds the location of the object inside the section. The proposed hybrid method is analytically driven by linear location estimates obtained from different IoT wireless technologies. The experimental results using developed RFID tracking device and RSSI-based localization for Bluetooth, WiFi and ZigBee technologies verifies the analytical results.
    摘要 无线地位确定是内部环境中对物品跟踪的重要组成部分。互联网东西(IoT)使得地位确定可以通过其多种无线通信协议。本文提出了一种混合段基地的indoor地位确定方法,使用开发的Radio Frequency Identification(RFID)跟踪设备和多种IoT无线技术。为了降低RFID标签的成本,标签仅在每个段的边部安装。RFID跟踪设备确定段,提出的混合方法在段内确定物品的位置。提议的混合方法由不同的IoT无线技术提供的线性位置估计驱动。实验结果使用开发的RFID跟踪设备和RSSI基于Bluetooth、WiFi和ZigBee技术的本地化位置确定验证了分析结果。

The Applicability of Federated Learning to Official Statistics

  • paper_url: http://arxiv.org/abs/2307.15503
  • repo_url: None
  • paper_authors: Joshua Stock, Oliver Hauke, Julius Weißmann, Hannes Federrath
  • for: 这项研究探讨了 Federated Learning(FL)在官方统计领域的潜在可能性,并证明 FL 模型的性能可以与中央学习方法相比。同时,通过保护数据持有者的隐私,FL 的使用可以拓宽数据访问,最终提高官方统计。
  • methods: 该研究通过三个不同的应用场景进行了模拟,包括医疗保险数据集、细颗粒污染数据集和移动 радио覆盖数据集。这些数据集都来自于官方统计领域相关的领域。我们提供了详细的分析结果,包括中央和 FL 算法性能比较。在所有三个应用场景中,我们成功地使用 FL 模型来训练达到中央模型标准性能水平。
  • results: 我们的关键观察结论和其在实践中的应用有关的建议。我们得出结论,FL 有可能成为未来官方统计领域的重要技术。
    Abstract This work investigates the potential of Federated Learning (FL) for official statistics and shows how well the performance of FL models can keep up with centralized learning methods. At the same time, its utilization can safeguard the privacy of data holders, thus facilitating access to a broader range of data and ultimately enhancing official statistics. By simulating three different use cases, important insights on the applicability of the technology are gained. The use cases are based on a medical insurance data set, a fine dust pollution data set and a mobile radio coverage data set - all of which are from domains close to official statistics. We provide a detailed analysis of the results, including a comparison of centralized and FL algorithm performances for each simulation. In all three use cases, we were able to train models via FL which reach a performance very close to the centralized model benchmarks. Our key observations and their implications for transferring the simulations into practice are summarized. We arrive at the conclusion that FL has the potential to emerge as a pivotal technology in future use cases of official statistics.
    摘要 Here is the text in Simplified Chinese:这项研究探讨了联邦学习(FL)在官方统计中的潜力,并比较了中央学习方法的性能。研究表明,FL可以保护数据持有者的隐私,同时维护模型的性能,因此成为官方统计中重要的技术。研究使用医疗保险、细颗粒污染和移动广播覆盖等三个不同的应用场景进行模拟,并提供了详细的分析结果。研究发现,FL模型在所有三个应用场景中都可以达到中央模型标准的性能水平,并提出了关键观察和未来实践中的启示。研究结论是,FL有望成为未来官方统计中的重要技术。

AbDiffuser: Full-Atom Generation of In-Vitro Functioning Antibodies

  • paper_url: http://arxiv.org/abs/2308.05027
  • repo_url: None
  • paper_authors: Karolis Martinkus, Jan Ludwiczak, Kyunghyun Cho, Wei-Ching Liang, Julien Lafrance-Vanasse, Isidro Hotzel, Arvind Rajpal, Yan Wu, Richard Bonneau, Vladimir Gligorijevic, Andreas Loukas
  • for: 本研究旨在提出一种基于等变量和物理约束的扩散模型,用于同时生成抗体三维结构和序列。
  • methods: 该模型基于一种新的蛋白结构表示方法,利用一种新的对称抗体体积架构,并使用强 diffusion 约束来改进减阈过程。
  • results: 实验表明,AbDiffuser 可以准确地生成与参照集的序列和结构性质高度相似的抗体。实验室实验证实了所有16种 HER2 抗体的高水平表达,并且57.1%的选择设计是紧紧的绑定者。
    Abstract We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of selected designs were tight binders.
    摘要 我们介绍AbDiffuser,一个equivariant和physics-informed扩散模型,用于同时生成抗体三维结构和序列。AbDiffuser基于一个新的蛋白结构表示方法,使用一个新的对称蛋白架构,并利用强大的扩散假设来改善整理过程。我们的方法可以改善蛋白扩散,利用domain知识和物理基于的限制;处理序列长度变化;并降低内存复杂度,实现脊梁和侧链生成。我们在silico和in vitro验证AbDiffuser,实验结果显示AbDiffuser能够将抗体的序列和结构属性与参考集 closely track。实验验证确认了所有16个HER2抗体的高水平表达,并证明57.1%的选择设计是紧系者。

Curiosity-Driven Reinforcement Learning based Low-Level Flight Control

  • paper_url: http://arxiv.org/abs/2307.15724
  • repo_url: https://github.com/a-ramezani/cdrl-l2fc_u_hcm
  • paper_authors: Amir Ramezani Dooraki, Alexandros Iosifidis
  • for: 这个论文是为了研究人工智能中的curiosity drivenserial,并提出了一种基于惊喜的自主学习控制算法,以便 quadcopter 通过障碍物进行控制并实现最佳奖励。
  • methods: 该论文使用了各种方法,包括基于奖励的回归学习算法、新的curiosity方法基于预测误差、视觉化效果的测试和评估。
  • results: 测试结果显示,该算法可以学习优化策略,最大化奖励,而其他算法则无法实现。
    Abstract Curiosity is one of the main motives in many of the natural creatures with measurable levels of intelligence for exploration and, as a result, more efficient learning. It makes it possible for humans and many animals to explore efficiently by searching for being in states that make them surprised with the goal of learning more about what they do not know. As a result, while being curious, they learn better. In the machine learning literature, curiosity is mostly combined with reinforcement learning-based algorithms as an intrinsic reward. This work proposes an algorithm based on the drive of curiosity for autonomous learning to control by generating proper motor speeds from odometry data. The quadcopter controlled by our proposed algorithm can pass through obstacles while controlling the Yaw direction of the quad-copter toward the desired location. To achieve that, we also propose a new curiosity approach based on prediction error. We ran tests using on-policy, off-policy, on-policy plus curiosity, and the proposed algorithm and visualized the effect of curiosity in evolving exploration patterns. Results show the capability of the proposed algorithm to learn optimal policy and maximize reward where other algorithms fail to do so.
    摘要 尝尝是许多自然生物具有衡量水平的智慧所展现出的主要动机之一,用于探索和因此更有效地学习。它使人类和许多动物能够有效地探索,通过搜寻未知状态来带来惊喜,并且从而学习更多。在机器学习文献中,尝尝通常与回归学习算法结合使用,作为内在奖励。这项工作提议一种基于尝尝的自主学习控制算法,通过生成相应的机动速度来从ODometry数据中生成。控制我们提议的四旋翼机器人可以通过避免障碍物,同时控制四旋翼机器人的纬度方向向所需的位置。为了实现这一点,我们还提出了一种基于预测误差的新尝尝方法。我们在使用on-policy、off-policy、on-policy plus curiosity和我们提议的算法进行测试,并将其视觉化为探索模式的演变。结果显示我们的算法能够学习优化策略,并在其他算法失败之前最大化奖励。

From continuous-time formulations to discretization schemes: tensor trains and robust regression for BSDEs and parabolic PDEs

  • paper_url: http://arxiv.org/abs/2307.15496
  • repo_url: https://github.com/lorenzrichter/PDE-backward-solver
  • paper_authors: Lorenz Richter, Leon Sallandt, Nikolas Nüsken
  • for: 解决高维partial differential equations (PDEs)的数值近似问题,即卷积环境中的呼啸问题。
  • methods: 利用Monte Carlo方法和变分方法,使用神经网络来近似函数。基于tensor train的框架,将拥有层次结构的问题转化为 backwards stochastic differential equations和回推方法。
  • results: 提出了一种新的数值策略,可以同时实现高精度和高效计算。理论和实验研究表明,这种方法可以在许多情况下获得一个有利的妥协,即高精度和高效计算之间。
    Abstract The numerical approximation of partial differential equations (PDEs) poses formidable challenges in high dimensions since classical grid-based methods suffer from the so-called curse of dimensionality. Recent attempts rely on a combination of Monte Carlo methods and variational formulations, using neural networks for function approximation. Extending previous work (Richter et al., 2021), we argue that tensor trains provide an appealing framework for parabolic PDEs: The combination of reformulations in terms of backward stochastic differential equations and regression-type methods holds the promise of leveraging latent low-rank structures, enabling both compression and efficient computation. Emphasizing a continuous-time viewpoint, we develop iterative schemes, which differ in terms of computational efficiency and robustness. We demonstrate both theoretically and numerically that our methods can achieve a favorable trade-off between accuracy and computational efficiency. While previous methods have been either accurate or fast, we have identified a novel numerical strategy that can often combine both of these aspects.
    摘要 NUMERICAL APPROXIMATION OF PARTIAL DIFFERENTIAL EQUATIONS (PDEs) IN HIGH DIMENSIONS PRESENTS DIFFICULT CHALLENGES. CLASSICAL GRID-BASED METHODS SUFFER FROM THE SO-CALLED CURSE OF DIMENSIONALITY. RECENT ATTEMPTS USE A COMBINATION OF MONTE CARLO METHODS AND VARIATIONAL FORMULATIONS WITH NEURAL NETWORKS FOR FUNCTION APPROXIMATION. WE ARGUE THAT TENSOR TRAINS PROVIDE AN ATTRACTIVE FRAMEWORK FOR PARABOLIC PDEs. COMBINING REFORMULATIONS IN TERMS OF BACKWARD STOCHASTIC DIFFERENTIAL EQUATIONS AND REGRESSION-TYPE METHODS CAN LEVERAGE LATENT LOW-RANK STRUCTURES, ENABLING BOTH COMPRESSION AND EFFICIENT COMPUTATION. EMPHASIZING A CONTINUOUS-TIME VIEWPOINT, WE DEVELOP ITERATIVE SCHEMES THAT DIFFER IN TERMS OF COMPUTATIONAL EFFICIENCY AND ROBUSTNESS. WE DEMONSTRATE THEORETICALLY AND NUMERICALLY THAT OUR METHODS CAN ACHIEVE A FAVORABLE TRADE-OFF BETWEEN ACCURACY AND COMPUTATIONAL EFFICIENCY. PREVIOUS METHODS HAVE BEEN EITHER ACCURATE OR FAST, BUT WE HAVE IDENTIFIED A NOVEL NUMERICAL STRATEGY THAT CAN OFTEN COMBINE BOTH OF THESE ASPECTS.

FeedbackLogs: Recording and Incorporating Stakeholder Feedback into Machine Learning Pipelines

  • paper_url: http://arxiv.org/abs/2307.15475
  • repo_url: None
  • paper_authors: Matthew Barker, Emma Kallina, Dhananjay Ashok, Katherine M. Collins, Ashley Casovan, Adrian Weller, Ameet Talwalkar, Valerie Chen, Umang Bhatt
  • for: 这篇论文是为了提高机器学习(ML)管道的可追溯性和可信度而写的。
  • methods: 这篇论文提出了一种名为“FeedbackLogs”的新方法,用于跟踪多个潜在参与者的反馈。FeedbackLogs是ML管道的现有文档的补充,可以记录反馈收集过程中的重要细节、反馈内容以及如何将反馈纳入到管道中。
  • results: 这篇论文提出了一种可以作为算法审核的证据,以及一种用于记录基于潜在参与者反馈的更新的工具。
    Abstract Even though machine learning (ML) pipelines affect an increasing array of stakeholders, there is little work on how input from stakeholders is recorded and incorporated. We propose FeedbackLogs, addenda to existing documentation of ML pipelines, to track the input of multiple stakeholders. Each log records important details about the feedback collection process, the feedback itself, and how the feedback is used to update the ML pipeline. In this paper, we introduce and formalise a process for collecting a FeedbackLog. We also provide concrete use cases where FeedbackLogs can be employed as evidence for algorithmic auditing and as a tool to record updates based on stakeholder feedback.
    摘要 尽管机器学习(ML)管道影响越来越多的利益相关者,但有少量的研究关于如何记录和整合利益相关者的意见。我们提议使用 FeedbackLogs,增加现有的机器学习管道文档,以跟踪多个利益相关者的输入。每个日志记录了关键的反馈收集过程、反馈内容和如何使用反馈更新机器学习管道。在这篇论文中,我们介绍了和正式化收集FeedbackLog的过程。我们还提供了具体的应用场景,其中FeedbackLog可以作为算法审核的证据,以及记录基于利益相关者反馈的更新。

Rethinking Noisy Label Learning in Real-world Annotation Scenarios from the Noise-type Perspective

  • paper_url: http://arxiv.org/abs/2307.16889
  • repo_url: https://github.com/fuxiailab/protosemi
  • paper_authors: Renyu Zhu, Haoyu Liu, Runze Wu, Minmin Lin, Tangjie Lv, Changjie Fan, Haobo Wang
  • for: 本文研究了在实际标注场景中学习含有噪声标签的问题,噪声可以分为两类:事实噪声和抽象噪声。
  • methods: 我们提出了一种基于样本选择的噪声标签学习方法,称为Proto-semi。该方法首先将所有样本分为自信量高和自信量低两个集合,并利用自信量高集合 constructed protoype vectors capture 类别特征。然后,对不确定样本与 protoype vectors 之间的距离进行计算,以便噪声分类。根据这些距离,标签 either corrected 或 retained,从而改善自信量高和自信量低两个集合。最后,我们引入了一种半supervised learning方法,以提高训练。
  • results: 对一个实际标注数据集进行实验,证明了Proto-semi 能够有效地处理含有噪声标签的学习问题。同时, prototype-based 重新分配策略被证明可以减轻噪声标签的负面影响。我们的代码和数据可以在https://github.com/fuxiAIlab/ProtoSemi 上获取。
    Abstract In this paper, we investigate the problem of learning with noisy labels in real-world annotation scenarios, where noise can be categorized into two types: factual noise and ambiguity noise. To better distinguish these noise types and utilize their semantics, we propose a novel sample selection-based approach for noisy label learning, called Proto-semi. Proto-semi initially divides all samples into the confident and unconfident datasets via warm-up. By leveraging the confident dataset, prototype vectors are constructed to capture class characteristics. Subsequently, the distances between the unconfident samples and the prototype vectors are calculated to facilitate noise classification. Based on these distances, the labels are either corrected or retained, resulting in the refinement of the confident and unconfident datasets. Finally, we introduce a semi-supervised learning method to enhance training. Empirical evaluations on a real-world annotated dataset substantiate the robustness of Proto-semi in handling the problem of learning from noisy labels. Meanwhile, the prototype-based repartitioning strategy is shown to be effective in mitigating the adverse impact of label noise. Our code and data are available at https://github.com/fuxiAIlab/ProtoSemi.
    摘要 在这篇论文中,我们研究了在现实世界注解场景中学习含有噪声标签的问题。噪声可以分为两类:事实噪声和抽象噪声。为了更好地 distinguish these noise types和利用它们的 semantics,我们提议了一种基于样本选择的噪声标签学习方法,called Proto-semi。Proto-semi首先将所有样本分为confident和uncertain datasetsvia warm-up。通过利用confident dataset,我们构造了类特征表示vector。然后,我们计算了uncertain samples和类特征表示vector之间的距离,以便噪声分类。根据这些距离,我们可以更正或保留标签,从而得到了含有修复的confident和uncertain datasets。最后,我们引入了一种半supervised learning方法,以提高训练。Empirical evaluations on a real-world annotated dataset substantiate the robustness of Proto-semi in handling the problem of learning from noisy labels. Meanwhile, the prototype-based repartitioning strategy is shown to be effective in mitigating the adverse impact of label noise.我们的代码和数据可以在https://github.com/fuxiAIlab/ProtoSemi中找到。

Testing the Depth of ChatGPT’s Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5’s Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking

  • paper_url: http://arxiv.org/abs/2307.16806
  • repo_url: None
  • paper_authors: David Bayani
  • for: 本研究旨在探讨GPT3.5模型在视觉任务中的能力,以及对图像的理解和生成能力。
  • methods: 本研究使用了GPT3.5模型,对图像进行了不同的转换和处理,以评估模型的表现。
  • results: 研究发现,GPT3.5模型在图像认识和分类任务中表现出色,能够准确地识别和描述图像的不同部分。此外,模型还能够生成高质量的图像。
    Abstract Over the eight months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche-industry of papers have emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been either natural language text or stylized, code-like language. Drawing inspiration from the prowess we expect a truly human-level intelligent agent to have across multiple signal modalities, in this work we examine GPT3.5's aptitude for visual tasks, where the inputs feature content provided as ASCII-art without overt distillation into a lingual summary. We conduct experiments analyzing the model's performance on image recognition tasks after various transforms typical in visual settings, trials investigating knowledge of image parts, and tasks covering image generation.
    摘要 Over the past eight months since its release, ChatGPT and its underlying model, GPT3.5, have received massive attention due to their powerful combination of capabilities and accessibility. While a niche industry of papers has emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been limited to natural language text or stylized, code-like language. Inspired by the versatility we expect from a truly human-level intelligent agent, we explore GPT3.5's ability for visual tasks, where the inputs are provided as ASCII art without any explicit lingual summaries. We conduct experiments analyzing the model's performance on image recognition tasks after various transforms commonly used in visual settings, trials investigating the model's understanding of image parts, and tasks involving image generation.

LUCID-GAN: Conditional Generative Models to Locate Unfairness

  • paper_url: http://arxiv.org/abs/2307.15466
  • repo_url: https://github.com/integrated-intelligence-lab/canonical_sets
  • paper_authors: Andres Algaba, Carmen Mazijn, Carina Prunkl, Jan Danckaert, Vincent Ginis
  • for: 检测黑盒模型中的不公正偏见
  • methods: 使用LUCID-GAN方法,通过生成幂等输入,暴露模型的内部逻辑中的可能的不公正偏见
  • results: 在UCI成人和COMPAS数据集上对黑盒模型进行了实验,并证明了LUCID-GAN方法可以无需访问训练数据,检测黑盒模型中的不公正偏见。
    Abstract Most group fairness notions detect unethical biases by computing statistical parity metrics on a model's output. However, this approach suffers from several shortcomings, such as philosophical disagreement, mutual incompatibility, and lack of interpretability. These shortcomings have spurred the research on complementary bias detection methods that offer additional transparency into the sources of discrimination and are agnostic towards an a priori decision on the definition of fairness and choice of protected features. A recent proposal in this direction is LUCID (Locating Unfairness through Canonical Inverse Design), where canonical sets are generated by performing gradient descent on the input space, revealing a model's desired input given a preferred output. This information about the model's mechanisms, i.e., which feature values are essential to obtain specific outputs, allows exposing potential unethical biases in its internal logic. Here, we present LUCID-GAN, which generates canonical inputs via a conditional generative model instead of gradient-based inverse design. LUCID-GAN has several benefits, including that it applies to non-differentiable models, ensures that canonical sets consist of realistic inputs, and allows to assess proxy and intersectional discrimination. We empirically evaluate LUCID-GAN on the UCI Adult and COMPAS data sets and show that it allows for detecting unethical biases in black-box models without requiring access to the training data.
    摘要 大多数群体公平性理念通过计算统计偏见指标来检测不当偏见。然而,这种方法存在多个缺点,如哲学上的分歧、互不兼容和解释性的缺失。这些缺点激发了研究补偿偏见检测方法的研究,这些方法可以提供更多的透明度,了解歧视的来源,并且不需要先行决定公平性的定义和保护特征。一种最近提出的方案是LUCID(找到不公正的位置),它通过在输入空间上进行梯度下降来生成含义的集合,并显示模型对特定输出的愿望的输入。这些信息可以暴露模型内部的不公正偏见。在这篇文章中,我们提出LUCID-GAN(基于GAN的找到不公正的位置),它使用conditional生成模型而不是梯度下降来生成含义的集合。LUCID-GAN具有多种优点,包括适用于不 diferenciable模型、生成的含义集合是实际的输入、以及可以评估代理和交叉性歧视。我们对UCI成人和COMPAS数据集进行了实验,并证明了LUCID-GAN可以无需访问训练数据检测黑盒模型中的不公正偏见。

Unsupervised machine-learning shock-capturing technique for high-order solvers

  • paper_url: http://arxiv.org/abs/2308.00086
  • repo_url: None
  • paper_authors: Andrés Mateo-Gabín, Kenza Tlales, Eusebio Valero, Esteban Ferrer, Gonzalo Rubio
  • for: 提高CFD代码的稳定性和效率,适用于复杂的 geometries 和 varied flow configurations。
  • methods: 基于 Gaussian Mixture Models (GMMs) 的无监督机器学习冲击捕捉算法,无需参数调整具有remarkable accuracy和 robustness。
  • results: 在多种测试 caso中,GMM感知器与现有方法相当,并且在高 Reynolds 数下表现出类似的效果,适用于supersonic flow 和高速燃烧等应用。
    Abstract We present a novel unsupervised machine learning shock capturing algorithm based on Gaussian Mixture Models (GMMs). The proposed GMM sensor demonstrates remarkable accuracy in detecting shocks and is robust across diverse test cases without the need for parameter tuning. We compare the GMM-based sensor with state-of-the-art alternatives. All methods are integrated into a high-order compressible discontinuous Galerkin solver where artificial viscosity can be modulated to capture shocks. Supersonic test cases, including high Reynolds numbers, showcase the sensor's performance, demonstrating the same effectiveness as fine-tuned state-of-the-art sensors. %The nodal DG aproach allows for potential applications in sub-cell flux-differencing formulations, supersonic feature detection, and mesh refinement. The adaptive nature and ability to function without extensive training datasets make this GMM-based sensor suitable for complex geometries and varied flow configurations. Our study reveals the potential of unsupervised machine learning methods, exemplified by the GMM sensor, to improve the robustness and efficiency of advanced CFD codes.
    摘要 我们提出了一种新的无监督机器学习震动捕捉算法基于混合分布(GMM)。我们的GMM感测器能够准确检测震动并在多种测试 caso中显示出良好的Robust性,无需进行参数调整。我们与现有的状态艺术方法进行比较。所有方法都被 интегрирован到一个高阶可 compressible discontinuous Galerkin 解析器中,其中人工粘度可以调整以捕捉震动。我们使用supersonic测试 caso,包括高 Reynolds 数,来证明感测器的性能,显示与精心调整的现有状态艺术感测器一样有效。 nodal DG 方法允许在 sub-cell flux-differencing 表述、supersonic特征检测和 mesh 细化中应用。适应性和无需广泛的培训数据集使得这种 GMM 感测器适用于复杂的 geometries 和多种流体配置。我们的研究表明无监督机器学习方法,如 GMM 感测器,可以提高高效性和稳定性的进险CFD 代码。

Worrisome Properties of Neural Network Controllers and Their Symbolic Representations

  • paper_url: http://arxiv.org/abs/2307.15456
  • repo_url: https://github.com/mimuw-rl/worrisome-nn
  • paper_authors: Jacek Cyranka, Kevin E M Church, Jean-Philippe Lessard
  • for: 研究控制器在简单强化学习问题中的稳定性问题。
  • methods: 使用神经网络控制器和其低神经和符号抽象。
  • results: 发现典型控制器可以达到高平均回报值,但同时也生成许多 persistently low-return 解决方案,这是一个非常不 desireable 性质,易受到敌对者利用。 simpler controllers 更容易生成 persistently bad solutions。 提供了一种系统性 robustness 研究算法,并证明存在 persistently solutions 和,在某些情况下, periodic orbits 的存在,使用计算机助力证明方法。
    Abstract We raise concerns about controllers' robustness in simple reinforcement learning benchmark problems. We focus on neural network controllers and their low neuron and symbolic abstractions. A typical controller reaching high mean return values still generates an abundance of persistent low-return solutions, which is a highly undesirable property, easily exploitable by an adversary. We find that the simpler controllers admit more persistent bad solutions. We provide an algorithm for a systematic robustness study and prove existence of persistent solutions and, in some cases, periodic orbits, using a computer-assisted proof methodology.
    摘要 我们对控制器的稳定性表示关注,特别是对于简单的问题学习 benchmark 中的神经网络控制器。我们发现,一般情况下,控制器可以获得高平均回数,但仍然生成丰富的持续低回数解,这是非常不愿意的危险性,可以轻易地被敌人利用。我们发现简单的控制器更易允许持续坏解。我们提供了一个系统性的稳定性研究 Algorithm,并证明了存在持续解和,在某些情况下, periodic orbit。我们使用了电脑支持的证明方法。

Autonomous Payload Thermal Control

  • paper_url: http://arxiv.org/abs/2307.15438
  • repo_url: None
  • paper_authors: Alejandro D. Mousist
  • for: 这个研究是为了解决小型卫星中热控制问题,因为发电设备和科学仪器的空间受限,热控制困难,可能会影响元件寿命和任务性能。
  • methods: 这个研究使用深度强化学习框架,特别是Soft Actor-Critic算法,在小型卫星上学习热控制策略。
  • results: 实验结果显示,提案的框架能够学习控制负载处理功率,以维持操作范围内的温度,辅助传统热控制系统。
    Abstract In small satellites there is less room for heat control equipment, scientific instruments, and electronic components. Furthermore, the near proximity of the electronics makes power dissipation difficult, with the risk of not being able to control the temperature appropriately, reducing component lifetime and mission performance. To address this challenge, taking advantage of the advent of increasing intelligence on board satellites, a deep reinforcement learning based framework that uses Soft Actor-Critic algorithm is proposed for learning the thermal control policy onboard. The framework is evaluated both in a naive simulated environment and in a real space edge processing computer that will be shipped in the future IMAGIN-e mission and hosted in the ISS. The experiment results show that the proposed framework is able to learn to control the payload processing power to maintain the temperature under operational ranges, complementing traditional thermal control systems.
    摘要 在小卫星中,因空间约束,热控制设备、科学仪器和电子组件具有更小的空间,导致热控制变得更加困难。此外,电子元件的近距离位置使得热量减少困难,从而影响组件寿命和任务性能。为解决这个挑战,利用卫星上增加的智能化程度,我们提出了基于深度强化学习的框架,使用软行为评价算法来学习卫星上的热控制策略。我们在模拟环境中和未来发射的IMAGIN-e任务中进行了实验,结果显示,我们的框架能够学习控制payload处理功率,以维护工作范围内的温度,与传统热控制系统相结合。

Improvable Gap Balancing for Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2307.15429
  • repo_url: https://github.com/yanqidai/igb4mtl
  • paper_authors: Yanqi Dai, Nanyi Fei, Zhiwu Lu
  • for: 这种研究是为了提高多任务学习(MTL)中的性能,尤其是通过改善 improvable gap 来减少性能差距。
  • methods: 这种研究提出了两种新的 improvable gap balancing(IGB)算法,一种是简单的规则,另一种是通过深度强化学习来实现。两种算法都是通过动态分配任务权重来实现 improvable gap 的均衡。
  • results: 在两个 benchmark 数据集上进行了广泛的实验,发现这两种 IGB 算法在 MTL 中可以 дости到最佳结果,并且在结合 gradient balancing 时可以得到进一步的改善。
    Abstract In multi-task learning (MTL), gradient balancing has recently attracted more research interest than loss balancing since it often leads to better performance. However, loss balancing is much more efficient than gradient balancing, and thus it is still worth further exploration in MTL. Note that prior studies typically ignore that there exist varying improvable gaps across multiple tasks, where the improvable gap per task is defined as the distance between the current training progress and desired final training progress. Therefore, after loss balancing, the performance imbalance still arises in many cases. In this paper, following the loss balancing framework, we propose two novel improvable gap balancing (IGB) algorithms for MTL: one takes a simple heuristic, and the other (for the first time) deploys deep reinforcement learning for MTL. Particularly, instead of directly balancing the losses in MTL, both algorithms choose to dynamically assign task weights for improvable gap balancing. Moreover, we combine IGB and gradient balancing to show the complementarity between the two types of algorithms. Extensive experiments on two benchmark datasets demonstrate that our IGB algorithms lead to the best results in MTL via loss balancing and achieve further improvements when combined with gradient balancing. Code is available at https://github.com/YanqiDai/IGB4MTL.
    摘要 在多任务学习(MTL)中,梯度均衡在最近几年内得到了更多的研究兴趣,因为它经常会导致更好的性能。然而,损失均衡是梯度均衡的更加效率的方法,因此仍然值得进一步探索。尽管先前的研究通常忽略了多任务中存在的变化可以改善的差距,这个差距每个任务定义为当前训练进度与理想的最终训练进度之间的距离。因此,在许多情况下, после损失均衡,性能差距仍然存在。在这篇论文中,我们按照损失均衡框架,提出了两种新的可改进差距均衡(IGB)算法 для MTL:一种使用了简单的规则,另一种(是首次)使用了深度优化学习。特别是,不直接在 MTL 中平衡损失,两种算法都选择了在不同任务上动态分配任务权重以进行可改进差距均衡。此外,我们将 IGB 和梯度均衡结合,以示这两种算法之间的补充性。我们在两个标准数据集上进行了广泛的实验,结果显示,我们的 IGB 算法在 MTL 中通过损失均衡得到了最佳结果,并在与梯度均衡结合时得到了进一步的改进。代码可以在 GitHub 上找到:https://github.com/YanqiDai/IGB4MTL。

Implicit neural representation for change detection

  • paper_url: http://arxiv.org/abs/2307.15428
  • repo_url: None
  • paper_authors: Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci
  • for: 检测3D空中LiDAR点云集中的变化,包括不匹配的空间支持和探测系统噪声。
  • methods: 提出了一种无监督方法,包括Neural Field(NF) для连续形态重建和 Gaussian Mixture Model(GMM) для分类变化。NF可以在不同的 spatial scale 上进行比较,从而提高检测Capability。
  • results: 在一个 simulated LiDAR点云集上进行了多种场景的比较,与当前状态艺技相比,提高了10%的intersection over union metric。此外,还应用于实际场景,鉴定了考古遗产被非法挖掘(looting)的情况,与场地专家的发现相符。
    Abstract Detecting changes that occurred in a pair of 3D airborne LiDAR point clouds, acquired at two different times over the same geographical area, is a challenging task because of unmatching spatial supports and acquisition system noise. Most recent attempts to detect changes on point clouds are based on supervised methods, which require large labelled data unavailable in real-world applications. To address these issues, we propose an unsupervised approach that comprises two components: Neural Field (NF) for continuous shape reconstruction and a Gaussian Mixture Model for categorising changes. NF offer a grid-agnostic representation to encode bi-temporal point clouds with unmatched spatial support that can be regularised to increase high-frequency details and reduce noise. The reconstructions at each timestamp are compared at arbitrary spatial scales, leading to a significant increase in detection capabilities. We apply our method to a benchmark dataset of simulated LiDAR point clouds for urban sprawling. The dataset offers different challenging scenarios with different resolutions, input modalities and noise levels, allowing a multi-scenario comparison of our method with the current state-of-the-art. We boast the previous methods on this dataset by a 10% margin in intersection over union metric. In addition, we apply our methods to a real-world scenario to identify illegal excavation (looting) of archaeological sites and confirm that they match findings from field experts.
    摘要 检测在两个不同时间采集的空中三维 LiDAR点云之间的变化是一项具有挑战性的任务,主要因为点云采集系统的噪声和不匹配的空间支持。大多数最新的变化检测方法都基于有监督的方法,需要庞大的标注数据,而实际应用中却缺乏这些数据。为解决这些问题,我们提出了一种无监督的方法,该方法包括两个组件:神经场(NF) для连续形态重建和高斯混合模型(GMM) для分类变化。NF提供了不受格子约束的表示方式,可以将不匹配的时间支持编码为高频环境和噪声减少。在每个时间戳中对重建的比较,可以在任意的空间缩放比进行比较,从而提高检测能力。我们在一个 simulate LiDAR点云数据集上进行了多种场景的比较,并证明了我们的方法在相对于当前状态的艺术中提高了10%的 intersect over union 指标。此外,我们还应用了我们的方法于一个真实世界应用中,并证明了它们与场景专家的发现相匹配。

Deep Generative Models, Synthetic Tabular Data, and Differential Privacy: An Overview and Synthesis

  • paper_url: http://arxiv.org/abs/2307.15424
  • repo_url: None
  • paper_authors: Conor Hassan, Robert Salomone, Kerrie Mengersen
  • for: 本研究提供了深入的生成数据技术发展synopsis,特注意点是对私人敏感数据的synthetic数据生成。
  • methods: 本文使用深度生成模型来生成tabular数据,并详细介绍了相关概念,如无监督学习、神经网络和生成模型。
  • results: 本文详细介绍了使用深度生成模型生成tabular数据的挑战和考虑因素,如数据normalization、隐私问题和模型评估。
    Abstract This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data. Additionally, we highlight the advantages of using deep generative models over other methods and provide a detailed explanation of the underlying concepts, including unsupervised learning, neural networks, and generative models. The paper covers the challenges and considerations involved in using deep generative models for tabular datasets, such as data normalization, privacy concerns, and model evaluation. This review provides a valuable resource for researchers and practitioners interested in synthetic data generation and its applications.
    摘要 Simplified Chinese translation:这篇文章提供了关于深度生成模型在生成数据方面的最新发展,特别是在表格数据上。文章强调了在隐私敏感数据的情况下,生成数据的重要性。此外,文章还 highlights 深度生成模型与其他方法的优势,并提供了详细的解释,包括无监督学习、神经网络和生成模型的基本概念。文章还讨论了使用深度生成模型处理表格数据时的挑战和考虑因素,如数据Normalization、隐私问题和模型评估。这篇文章为关心生成数据和其应用的研究人员和实践者提供了有价值的资源。

Is One Epoch All You Need For Multi-Fidelity Hyperparameter Optimization?

  • paper_url: http://arxiv.org/abs/2307.15422
  • repo_url: https://github.com/deephyper/benchmark
  • paper_authors: Romain Egele, Isabelle Guyon, Yixuan Sun, Prasanna Balaprakash
  • for: 优化机器学习模型的精度,减少计算成本。
  • methods: 利用多级准确水平进行学习过程中的剔除低性能模型,提高模型选择效率。
  • results: 与其他代表性的多级准确优化方法相比,简单的基准点获得了类似的结果,而 computation 减少了一个数量级。分析了数据集的学习曲线,发现了一些主导的学习曲线,这解释了基准点的成功。
    Abstract Hyperparameter optimization (HPO) is crucial for fine-tuning machine learning models but can be computationally expensive. To reduce costs, Multi-fidelity HPO (MF-HPO) leverages intermediate accuracy levels in the learning process and discards low-performing models early on. We compared various representative MF-HPO methods against a simple baseline on classical benchmark data. The baseline involved discarding all models except the Top-K after training for only one epoch, followed by further training to select the best model. Surprisingly, this baseline achieved similar results to its counterparts, while requiring an order of magnitude less computation. Upon analyzing the learning curves of the benchmark data, we observed a few dominant learning curves, which explained the success of our baseline. This suggests that researchers should (1) always use the suggested baseline in benchmarks and (2) broaden the diversity of MF-HPO benchmarks to include more complex cases.
    摘要 (简体中文)神经网络模型精度优化(HPO)是机器学习模型微调的关键步骤,但可能具有高计算成本。为了降低成本,多级准确度HPO(MF-HPO)利用学习过程中的中间准确度水平,早期抛弃低性能模型。我们对多种代表性的MF-HPO方法进行了比较,与简单的基线方法进行了对比。基线方法是在训练一 epoch后,仅保留Top-K模型,并进行进一步的训练来选择最佳模型。 surprisingly,这个基线方法达到了与其他方法相同的结果,而且只需要一个数量级下的计算。对于经典数据集的学习曲线进行分析,我们发现了一些DOMINANT的学习曲线,这解释了我们的基线方法的成功。这表明研究人员应该(1)在benchmark中使用我们建议的基线方法,并(2)扩展MF-HPO的benchmark,以包括更复杂的情况。

The Initial Screening Order Problem

  • paper_url: http://arxiv.org/abs/2307.15398
  • repo_url: None
  • paper_authors: Jose M. Alvarez, Salvatore Ruggieri
  • For: The paper is written to address the initial screening order problem in candidate screening, with the goal of finding the first k suitable candidates rather than the best k candidates in a candidate pool.* Methods: The paper uses formal methods to prove the effects of initial screening orders on the selected set of k candidates, particularly in situations where the candidate pool is unbalanced (e.g., having more male than female candidates).* Results: The paper shows that under unbalanced candidate pools, the human-like screener can suffer from uneven efforts that hinder decision-making over the protected, under-represented group relative to the non-protected, over-represented group. The paper also proves other fairness results under the human-like screener.Here are the three points in Simplified Chinese text:* For: 本文旨在解决初层屏选择问题,即在候选人池中找到第一个k适合的候选人而不是最佳k candidates。* Methods: 本文使用正式方法证明初层屏选择顺序对选定的k candidates产生的影响,特别是在候选人池受损(例如有更多的男性 candidates)时。* Results: 本文表明在不均衡的候选人池时,人类化屏选者可能会因屏选顺序而受到不公正的努力,从保护的、下降的群体中减少决策。 论文还证明了其他公平性结果。
    Abstract In this paper we present the initial screening order problem, a crucial step within candidate screening. It involves a human-like screener with an objective to find the first k suitable candidates rather than the best k suitable candidates in a candidate pool given an initial screening order. The initial screening order represents the way in which the human-like screener arranges the candidate pool prior to screening. The choice of initial screening order has considerable effects on the selected set of k candidates. We prove that under an unbalanced candidate pool (e.g., having more male than female candidates), the human-like screener can suffer from uneven efforts that hinder its decision-making over the protected, under-represented group relative to the non-protected, over-represented group. Other fairness results are proven under the human-like screener. This research is based on a collaboration with a large company to better understand its hiring process for potential automation. Our main contribution is the formalization of the initial screening order problem which, we argue, opens the path for future extensions of the current works on ranking algorithms, fairness, and automation for screening procedures.
    摘要 在本文中,我们提出了初层屏选问题,这是候选人筛选的一个关键步骤。它涉及一个人类化的屏选员,目的是在候选人池中找出第一个k适合的候选人而不是最佳k适合的候选人。初层屏选顺序表示屏选员在筛选之前对候选人池进行了排序。初层屏选顺序的选择对选定的k候选人产生了很大的影响。我们证明,在不均衡的候选人池(例如有更多的男性候选人)情况下,屏选员可能会受到不平等的努力,使其对保护的、少数群体进行决策受到阻碍,相比之下对非保护的、多数群体的决策更加容易。我们还证明了其他的公平性结果。这项研究基于与大公司合作,以更好地理解其招聘过程,以便更好地自动化屏选过程。我们的主要贡献是对初层屏选顺序问题的正式化,我们认为这将开启未来对排名算法、公平性和自动化屏选过程的扩展。

Noisy Interpolation Learning with Shallow Univariate ReLU Networks

  • paper_url: http://arxiv.org/abs/2307.15396
  • repo_url: None
  • paper_authors: Nirmit Joshi, Gal Vardi, Nathan Srebro
  • for: 研究了采用 interpolate with minimum norm( weights 的 $\ell_2$ norm)的 two-layer ReLU 网络在噪音单变量回归中的 asymptotic overfitting 行为。
  • methods: 使用 interpolate with minimum norm 方法,并研究不同的损失函数($L_p$ loss) 对 overfitting 的影响。
  • results: 发现在 $L_1$ 损失函数下,overfitting 受控;在 $L_p$ 损失函数 ($p<2$) 下,overfitting 也受控,但在 $p\geq 2$ 下,overfitting catastrophic。
    Abstract We study the asymptotic overfitting behavior of interpolation with minimum norm ($\ell_2$ of the weights) two-layer ReLU networks for noisy univariate regression. We show that overfitting is tempered for the $L_1$ loss, and any $L_p$ loss for $p<2$, but catastrophic for $p\geq 2$.
    摘要 我们研究插值的极大欠拟合行为,使用两层ReLU网络,在噪音一变量回归 зада务中。我们显示,使用$L_1$损失函数时,过拟合行为会减少;使用任何$L_p$损失函数($p<2)时,过拟合行为也会减少,但是使用$L_p$损失函数($p\geq 2)时,过拟合行为将会恶化。

Does Full Waveform Inversion Benefit from Big Data?

  • paper_url: http://arxiv.org/abs/2307.15388
  • repo_url: None
  • paper_authors: Peng Jin, Yinan Feng, Shihang Feng, Hanchen Wang, Yinpeng Chen, Benjamin Consolvo, Zicheng Liu, Youzuo Lin
  • for: 这个论文研究了大数据对深度学习模型在全波形推算(FWI)中的影响。
  • methods: 该论文使用了OpenFWI数据集,一个最近发布的大规模多结构数据集,来评估深度学习模型在FWI中的性能。
  • results: 实验结果表明,随着数据集大小的增加,深度学习模型在FWI中的性能和泛化性也随着增加。同时,模型容量需要随着数据大小增加以实现最佳改进。
    Abstract This paper investigates the impact of big data on deep learning models for full waveform inversion (FWI). While it is well known that big data can boost the performance of deep learning models in many tasks, its effectiveness has not been validated for FWI. To address this gap, we present an empirical study that investigates how deep learning models in FWI behave when trained on OpenFWI, a collection of large-scale, multi-structural datasets published recently. Particularly, we train and evaluate the FWI models on a combination of 10 2D subsets in OpenFWI that contain 470K data pairs in total. Our experiments demonstrate that larger datasets lead to better performance and generalization of deep learning models for FWI. We further demonstrate that model capacity needs to scale in accordance with data size for optimal improvement.
    摘要 这篇论文研究了大数据对深度学习模型在全波形推敲(FWI)中的影响。虽然已经确立了大数据可以提高深度学习模型在许多任务中表现的效果,但这些效果在FWI中尚未得到证明。为解决这个空白,我们发表了一项实验研究,检验了在OpenFWI中发布的大规模多结构数据集上深度学习模型在FWI中的表现。特别是,我们在OpenFWI中选取了10个2D子集,这些子集包含470K个数据对。我们的实验表明,大数据集会导致深度学习模型在FWI中的表现和泛化性能更好。我们还证明了模型容量需要与数据大小成正比以获得优化的改进。

Co-attention Graph Pooling for Efficient Pairwise Graph Interaction Learning

  • paper_url: http://arxiv.org/abs/2307.15377
  • repo_url: https://github.com/leejunhyun/coattentiongraphpooling
  • paper_authors: Junhyun Lee, Bumsoo Kim, Minji Jeon, Jaewoo Kang
  • for: 这个研究旨在提高graph neural network(GNN)的效能,以便处理和学习具有图структуре数据的问题。
  • methods: 本研究使用co-attention在图 pooling中提取互动表示,实现更高效的 computation complexity和更好的性能。
  • results: 研究表明,CAGPool方法在真实世界数据集上展现出了竞争性的表现,同时保持较低的computational complexity。
    Abstract Graph Neural Networks (GNNs) have proven to be effective in processing and learning from graph-structured data. However, previous works mainly focused on understanding single graph inputs while many real-world applications require pair-wise analysis for graph-structured data (e.g., scene graph matching, code searching, and drug-drug interaction prediction). To this end, recent works have shifted their focus to learning the interaction between pairs of graphs. Despite their improved performance, these works were still limited in that the interactions were considered at the node-level, resulting in high computational costs and suboptimal performance. To address this issue, we propose a novel and efficient graph-level approach for extracting interaction representations using co-attention in graph pooling. Our method, Co-Attention Graph Pooling (CAGPool), exhibits competitive performance relative to existing methods in both classification and regression tasks using real-world datasets, while maintaining lower computational complexity.
    摘要 格式化神经网络(GNNs)已经证明可以有效地处理和学习具有图结构数据的数据。然而,先前的工作主要集中于理解单个图输入,而实际应用中许多应用需要对图结构数据进行对比分析(例如,场景图匹配、代码搜索和药物交互预测)。为了解决这个问题,现有的工作已经转移着注意力到对图对之间的学习。虽然它们的性能有所改善,但是它们仍然受到图节点级别的交互限制,从而导致计算成本高、性能下降。为了解决这个问题,我们提出了一种新的和高效的图级别方法,即协同注意力图Pooling(CAGPool)。我们的方法在实际数据集上的分类和回归任务中展现出竞争性的性能,同时保持计算复杂度较低。

Conflict-free joint decision by lag and zero-lag synchronization in laser network

  • paper_url: http://arxiv.org/abs/2307.15373
  • repo_url: None
  • paper_authors: Hisako Ito, Takatomo Mihana, Ryoichi Horisaki, Makoto Naruse
  • for: 这篇论文是关于应用激光网络作为光学加速器解决竞争多臂投机问题的研究。
  • methods: 该研究使用了激光网络,通过零延迟和延迟同步实现决策合作和碰撞避免功能。
  • results: 实验表明,该系统可以实现低碰撞率和高奖励,并且可扩展到更复杂的场景。
    Abstract With the end of Moore's Law and the increasing demand for computing, photonic accelerators are garnering considerable attention. This is due to the physical characteristics of light, such as high bandwidth and multiplicity, and the various synchronization phenomena that emerge in the realm of laser physics. These factors come into play as computer performance approaches its limits. In this study, we explore the application of a laser network, acting as a photonic accelerator, to the competitive multi-armed bandit problem. In this context, conflict avoidance is key to maximizing environmental rewards. We experimentally demonstrate cooperative decision-making using zero-lag and lag synchronization within a network of four semiconductor lasers. Lag synchronization of chaos realizes effective decision-making and zero-delay synchronization is responsible for the realization of the collision avoidance function. We experimentally verified a low collision rate and high reward in a fundamental 2-player, 2-slot scenario, and showed the scalability of this system. This system architecture opens up new possibilities for intelligent functionalities in laser dynamics.
    摘要 随着Moore的法则的结束和计算机能力的增长,光学加速器正在吸引一些关注。这是因为光子的物理特性,如带宽和多重性,以及在激光物理中出现的各种同步现象。这些因素在计算机性能接近界限时变得重要。在这种研究中,我们探讨了一个激光网络作为光学加速器应用于竞争多臂猎手问题。在这个上下文中,避免冲突是最大化环境奖励的关键。我们实验性地表明了光网络中的共同决策,使用零延迟和延迟同步实现有效的决策。我们实验证明了光网络中的冲突率低,奖励高,并证明了这种系统架构的扩展性。这种系统架构开启了新的智能功能在激光动力学中。

Toward Transparent Sequence Models with Model-Based Tree Markov Model

  • paper_url: http://arxiv.org/abs/2307.15367
  • repo_url: None
  • paper_authors: Chan Hsu, Wei-Chun Huang, Jun-Ting Wu, Chih-Yuan Li, Yihuang Kang
  • for: 该研究目标是解决复杂黑盒机器学习模型在序列数据上的解释性问题。
  • methods: 该研究提出了基于树的模型基于隐藏Markov模型(MOB-HSMM),该模型可以检测高死亡风险事件并揭示隐藏在ICU中的死亡风险。该模型利用了深度神经网络(DNN)中提取的知识来提高预测性能,同时提供了明确的解释。
  • results: 实验结果表明,通过使用LSTM学习序列模式,MOB树可以得到改进的性能。将MOB树与隐藏Markov模型(HSMM)结合在一起,可以揭示可用信息中的可能和解释性序列。
    Abstract In this study, we address the interpretability issue in complex, black-box Machine Learning models applied to sequence data. We introduce the Model-Based tree Hidden Semi-Markov Model (MOB-HSMM), an inherently interpretable model aimed at detecting high mortality risk events and discovering hidden patterns associated with the mortality risk in Intensive Care Units (ICU). This model leverages knowledge distilled from Deep Neural Networks (DNN) to enhance predictive performance while offering clear explanations. Our experimental results indicate the improved performance of Model-Based trees (MOB trees) via employing LSTM for learning sequential patterns, which are then transferred to MOB trees. Integrating MOB trees with the Hidden Semi-Markov Model (HSMM) in the MOB-HSMM enables uncovering potential and explainable sequences using available information.
    摘要 在这项研究中,我们解决了复杂黑盒机器学习模型应用于序列数据中的可解释性问题。我们引入了基于模型的树隐藏半马尔可夫模型(MOB-HSMM),这是一种自然可解释的模型,用于检测高死亡风险事件并揭示隐藏在死亡风险中的征 patrern。这个模型利用了深度神经网络(DNN)提供的知识来提高预测性能,同时也提供了明确的解释。我们的实验结果表明,通过使用LSTM学习序列模式,MOB树可以得到改进的性能。将MOB树与隐藏半马尔可夫模型(HSMM)结合在一起,在MOB-HSMM中实现了潜在的和可解释的序列使用可用信息。

Confident Feature Ranking

  • paper_url: http://arxiv.org/abs/2307.15361
  • repo_url: None
  • paper_authors: Bitya Neuhof, Yuval Benjamini
  • for: 本研究旨在提出一种基于对比比较的后处方法,以确定特征重要性值的稳定排名。
  • methods: 该方法基于对比比较特征重要性值,并生成一个稳定的排名和相应的信任范围。
  • results: 研究表明,该方法能够包含真正的排名(在无穷样本情况下),并且允许选择top-k集。
    Abstract Interpretation of feature importance values often relies on the relative order of the features rather than on the value itself, referred to as ranking. However, the order may be unstable due to the small sample sizes used in calculating the importance values. We propose that post-hoc importance methods produce a ranking and simultaneous confident intervals for the rankings. Based on pairwise comparisons of the feature importance values, our method is guaranteed to include the ``true'' (infinite sample) ranking with high probability and allows for selecting top-k sets.
    摘要 通常情况下,特征重要性值的解释往往基于特征之间的相对排序而不是值本身,称为排名。然而,小样本大小可能导致排名不稳定。我们提议使用后期重要性方法生成排名和同时确定范围,以 garantía que la "verdadera" (大样本) 排名包含在内的高概率。此外,我们的方法还允许选择top-k集。Note: "top-k" refers to the top k features in the dataset, where k is a positive integer.

Med-HALT: Medical Domain Hallucination Test for Large Language Models

  • paper_url: http://arxiv.org/abs/2307.15343
  • repo_url: None
  • paper_authors: Logesh Kumar Umapathi, Ankit Pal, Malaikannan Sankarasubbu
  • for: 这种研究旨在解决大语言模型(LLM)中的幻觉问题,尤其在医疗领域。幻觉可能会在医疗应用中产生严重的后果。
  • methods: 我们提出了一个新的标准和数据集,即医疗领域幻觉测试(Med-HALT),用于评估和减少幻觉。Med-HALT 包括多种国际多元的数据集, derivated from medical examinations across various countries, and includes multiple innovative testing modalities。
  • results: 我们对主要的 LLM 进行了评估,包括 Text Davinci、GPT-3.5、LlaMa-2、MPT 和 Falcon,发现了这些模型在幻觉方面的显著差异。本文提供了详细的数据集描述,促进了透明度和重复性。通过这项工作,我们希望为医疗领域中更安全和可靠的语言模型的开发作出贡献。
    Abstract This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs's problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at medhalt.github.io
    摘要

The Radon Signed Cumulative Distribution Transform and its applications in classification of Signed Images

  • paper_url: http://arxiv.org/abs/2307.15339
  • repo_url: https://github.com/rohdelab/PyTransKit
  • paper_authors: Le Gong, Shiying Li, Naqib Sad Pathan, Mohammad Shifat-E-Rabbi, Gustavo K. Rohde, Abu Hasnat Mohammad Rubaiyat, Sumati Thareja
  • for: 这个论文的目的是提出一种基于运输和最优运输的新图像表示技术。
  • methods: 这种新的图像表示方法结合了宽泛使用的劳顿变换和最近的信号表示方法called Signed Cumulative Distribution Transform。
  • results: 这种新的图像表示方法可以更好地表示签名图像中的信息内容,与现有的运输变换方法和深度学习基于分类方法相比,可以获得更高的分类精度。
    Abstract Here we describe a new image representation technique based on the mathematics of transport and optimal transport. The method relies on the combination of the well-known Radon transform for images and a recent signal representation method called the Signed Cumulative Distribution Transform. The newly proposed method generalizes previous transport-related image representation methods to arbitrary functions (images), and thus can be used in more applications. We describe the new transform, and some of its mathematical properties and demonstrate its ability to partition image classes with real and simulated data. In comparison to existing transport transform methods, as well as deep learning-based classification methods, the new transform more accurately represents the information content of signed images, and thus can be used to obtain higher classification accuracies. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit, available on Github.
    摘要 我们介绍一种新的图像表示技术,基于运输学和最优运输学。该方法通过结合已知的卷积变换和最近的签名总额变换方法,将图像表示为函数。该新提议可以将图像分类问题应用于更多的场景。我们描述了该新变换,以及一些其数学性质和实际应用。我们还比较了现有的运输变换方法和深度学习基于分类方法,显示该新变换更好地表示签名图像的信息内容,因此可以获得更高的分类精度。我们在Python语言中实现了该方法,并将其集成到PyTransKit软件包中,可以在Github上下载。

Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation

  • paper_url: http://arxiv.org/abs/2307.15326
  • repo_url: None
  • paper_authors: Yueh-Ning Ku, Mikhail Kuznetsov, Shaunak Mishra, Paloma de Juan
  • For: 提高动态产品广告(DPA)图像的吸引力和实用性,使得用户更容易点击广告。* Methods: 使用生成对抗网络(GAN)和检索帮助GAN(Retrieval Assisted GANs)生成精心设计的背景,以增强产品图像的吸引力和实用性。* Results: 通过在线 метриks和人工评估,证明了我们的复制粘贴stage方法可以提高DPA图像的吸引力和实用性,并且可以生成动态产品广告短片。
    Abstract Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the efficacy of our copy-paste staging method via offline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.
    摘要 在线广告通常会显示电子商务产品,通常是由电子商务平台发送的目录至广告平台。在更广泛的广告业界中,这些广告被称为动态产品广告(DPA)。DPA目录通常有数百万产品(相应到电子商务平台上可购买的产品数量)。然而,不 ALL的产品图片在目录中都是吸引人和真实的,这可能导致低Click-through rate(CTR)。特别是产品只是在固定背景上显示,可能不如产品在自然环境中搭配而更吸引人。为了解决DPA图片的缺陷,我们提出了基于生成敌人网络(GAN)的方法,生成产品在自然环境中的 stagged 背景。生成整个 stagged 背景是一个具有潜在误导的任务,因此我们提出了一个简单的方法:copy-paste staging。在copy-paste staging中,我们首先从目录中检索与输入产品相似的 stagged 产品,然后将其背景贴上输入图片中。使用GAN基于填充模型来填充贴上的孔隙。我们透过线上指标和人类评价显示了我们的copy-paste staging方法的有效性。此外,我们还显示了我们的 stagging 方法可以实现产品动画,从产品图片中生成动画广告。

Partial observations, coarse graining and equivariance in Koopman operator theory for large-scale dynamical systems

  • paper_url: http://arxiv.org/abs/2307.15325
  • repo_url: None
  • paper_authors: Sebastian Peitz, Hans Harder, Feliks Nüske, Friedrich Philipp, Manuel Schaller, Karl Worthmann
  • for: 这篇论文旨在解决大规模系统数据驱动分析、预测和控制中的一个问题,即классиical EDMD算法不自动提供系统下的库曼操作符approximation,因为只有部分观察数据。
  • methods: 这篇论文使用了 Koopman 操作符来研究大规模系统的非线性动力学,并提出了一种新的方法来保持系统动力学 симметрии,从而大幅提高模型效率。
  • results: 数字实验表明,这种新方法可以减少数据量,同时保持模型的准确性,并且可以与域分解技术相结合以提高效率。
    Abstract The Koopman operator has become an essential tool for data-driven analysis, prediction and control of complex systems, the main reason being the enormous potential of identifying linear function space representations of nonlinear dynamics from measurements. Until now, the situation where for large-scale systems, we (i) only have access to partial observations (i.e., measurements, as is very common for experimental data) or (ii) deliberately perform coarse graining (for efficiency reasons) has not been treated to its full extent. In this paper, we address the pitfall associated with this situation, that the classical EDMD algorithm does not automatically provide a Koopman operator approximation for the underlying system if we do not carefully select the number of observables. Moreover, we show that symmetries in the system dynamics can be carried over to the Koopman operator, which allows us to massively increase the model efficiency. We also briefly draw a connection to domain decomposition techniques for partial differential equations and present numerical evidence using the Kuramoto--Sivashinsky equation.
    摘要 科普曼运算已成为数据驱动分析、预测和控制复杂系统的重要工具,主要原因是可以从测量数据中提取非线性动力学的线性函数空间表示。直到现在,我们对大规模系统的情况还没有充分考虑,即我们只有部分观察数据(例如测量数据)或者故意压缩系统(为了提高效率)。在这篇论文中,我们证明了类型EDMD算法不会自动为下列系统提供科普曼运算符的近似值,除非我们特别选择观察量的数量。此外,我们发现了系统动力学中的对称性可以传递到科普曼运算符中,这使得我们可以巨大提高模型的效率。我们还 briefly Draw a connection to域分解技术 для部分杜拉Equation,并提供了数据证明使用库拉诺-西瓦希诺 Equation。

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2307.15320
  • repo_url: None
  • paper_authors: Ricardo Garcia, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
  • For: + The paper focuses on bridging the visual sim-to-real domain gap in robotic manipulation tasks using domain randomization (DR) methods. + The authors aim to evaluate the effectiveness of DR methods for challenging robotic manipulation tasks and to develop a systematic approach to selecting DR parameters.* Methods: + The authors propose an off-line proxy task of cube localization to select DR parameters for texture randomization, lighting randomization, variations of object colors, and camera parameters. + They use off-line optimized DR parameters to train visuomotor policies in simulation and directly apply such policies to a real robot.* Results: + The authors achieve an average success rate of 93% on a diverse set of challenging manipulation tasks. + Their simulator-trained policies outperform policies learned using real but limited data, demonstrating the effectiveness of their approach in handling visual variations in real scenes.Here’s the simplified Chinese text for the three key information points:* For: + 论文主要关注在机器人 manipulate 任务中的视觉领域域转移问题,使用域随机化 (DR) 方法 bridge 距离。 + 作者们想要评估 DR 方法在复杂的机器人 manipulate 任务中的效果,并开发一种系统的方法来选择 DR 参数。* Methods: + 作者们提议一种离线代理任务,即立方体localization,来选择 DR 参数的Texture randomization、lighting randomization、物体颜色的变化和摄像头参数。 + 他们使用离线优化的 DR 参数来在 simulator 中训练视觉动作策略,然后直接将其应用到真实的机器人上。* Results: + 作者们在多种复杂的机器人 manipulate 任务中得到了平均 93% 的成功率。 + 他们的 simulator 训练的策略在实际场景中处理视觉变化的情况下表现更好,超过了基于实际 pero 有限的数据学习的策略。
    Abstract Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks. In particular, we propose an off-line proxy task of cube localization to select DR parameters for texture randomization, lighting randomization, variations of object colors and camera parameters. Notably, we demonstrate that DR parameters have similar impact on our off-line proxy task and on-line policies. We, hence, use off-line optimized DR parameters to train visuomotor policies in simulation and directly apply such policies to a real robot. Our approach achieves 93% success rate on average when tested on a diverse set of challenging manipulation tasks. Moreover, we evaluate the robustness of policies to visual variations in real scenes and show that our simulator-trained policies outperform policies learned using real but limited data. Code, simulation environment, real robot datasets and trained models are available at https://www.di.ens.fr/willow/research/robust_s2r/.
    摘要 顺序训练在模拟环境中的视听动作策略比实际世界更安全和更便宜。然而,由于模拟和实际数据之间的差异,模拟训练的策略通常在转移到实际机器人上失败。一种常见的方法是域随机化(DR),以 bridge the visual sim-to-real domain gap。在这里,我们系统地探讨视听域随机化方法,并对其进行了丰富的机器人 manipulate 任务的benchmark。具体来说,我们提出了一个离线代理任务——立方体localization,用于选择DR参数的Texture randomization、lighting randomization、物体颜色变换和摄像头参数。值得一提的是,我们示出了DR参数对我们的离线代理任务和在线策略具有相似的影响。因此,我们使用离线优化的DR参数来在模拟环境中训练视听动作策略,然后直接将其应用到实际机器人上。我们的方法在多种挑战性的机器人 manipulate 任务上得到了93%的成功率的平均值。此外,我们还评估了模拟训练的策略在实际场景中对视觉变化的Robustness,并发现我们在模拟环境中训练的策略在实际数据中的表现更佳。可以在https://www.di.ens.fr/willow/research/robust_s2r/获取我们的代码、模拟环境、实际机器人数据和训练模型。

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

  • paper_url: http://arxiv.org/abs/2308.01420
  • repo_url: None
  • paper_authors: Charumathi Badrinath, Weiwei Pan, Finale Doshi-Velez
  • for: explore text corpora and learn topics that preserve semantically meaningful relationships between documents
  • methods: semi-supervised human-in-the-loop LDA-based method
  • results: more interpretable projections than baseline methods with only a fraction of labels provided, qualitatively similar results on a real corpus
    Abstract A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for dimensionality reduction of text corpora, like Latent Dirichlet Allocation (LDA), often produce projections that do not capture human notions of document similarity. We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections. On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided. On a real corpus, we obtain qualitatively similar results.
    摘要 (Simplified Chinese translation)通常来说,探索文本集合的方式之一是通过文本文档的低维度投影,希望在投影空间中可以将有相似主题的文档归类到一起。然而,常见的文本集合维度减少算法,如Latent Dirichlet Allocation(LDA),经常生成不能捕捉人类理解的文档相似性的投影。我们提议一种半监督人类 loops LDA 基于方法,可以在低维度投影中保持文档之间含义 significado的关系。在 sintética corpora 上,我们的方法可以提供更加可解的投影,并且只需提供一部分标签。在实际 corpora 上,我们获得了类似的结果。

DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall’s Rank Correlation

  • paper_url: http://arxiv.org/abs/2307.15317
  • repo_url: None
  • paper_authors: Kaipeng Zheng, Huishuai Zhang, Weiran Huang
  • for: 提高ew-shot learning的性能,尤其是在不同领域的数据集上。
  • methods: 使用Kendall的排名相关度metric instead of geometric similarity metric during inference, 并提pose a carefully designed differentiable loss for meta-training to address the non-differentiability issue。
  • results: 提高ew-shot learning的性能 across a wide range of datasets with different domains, and the proposed rank-correlation-based approach substantially enhances few-shot learning performance.
    Abstract Few-shot learning aims to adapt models trained on the base dataset to novel tasks where the categories are not seen by the model before. This often leads to a relatively uniform distribution of feature values across channels on novel classes, posing challenges in determining channel importance for novel tasks. Standard few-shot learning methods employ geometric similarity metrics such as cosine similarity and negative Euclidean distance to gauge the semantic relatedness between two features. However, features with high geometric similarities may carry distinct semantics, especially in the context of few-shot learning. In this paper, we demonstrate that the importance ranking of feature channels is a more reliable indicator for few-shot learning than geometric similarity metrics. We observe that replacing the geometric similarity metric with Kendall's rank correlation only during inference is able to improve the performance of few-shot learning across a wide range of datasets with different domains. Furthermore, we propose a carefully designed differentiable loss for meta-training to address the non-differentiability issue of Kendall's rank correlation. Extensive experiments demonstrate that the proposed rank-correlation-based approach substantially enhances few-shot learning performance.
    摘要 几个shot学习目标是使模型在基础数据集上训练后适应到新任务中,其中类别没有模型所见过。这经常导致新任务中通道的特征值分布相对均匀,从而增加了决定通道重要性的挑战。标准的几个shot学习方法使用几何相似度度量,如cosine相似度和负Euclidean距离,来衡量两个特征之间的 semantic相似性。然而,具有高几何相似性的特征可能表达出不同的 semantics,特别在几个shot学习上。在这篇论文中,我们表明了特征通道的重要性排名在几个shot学习中是更可靠的指标,而不是几何相似度度量。我们发现,在推理过程中将几何相似度度量替换为Kendall的排名相关性可以在多个领域的数据集上提高几个shot学习性能。此外,我们提出了一种特殊的可导损失函数,用于meta-training,以解决Kendall的排名相关性的不导能性问题。广泛的实验表明,我们的排名相关性基本 Approach 可以大幅提高几个shot学习性能。

Differential Evolution Algorithm based Hyper-Parameters Selection of Transformer Neural Network Model for Load Forecasting

  • paper_url: http://arxiv.org/abs/2307.15299
  • repo_url: https://github.com/anuvabsen1/meta-transformer
  • paper_authors: Anuvab Sen, Arul Rhik Mazumder, Udayon Sen
  • for: 预测电网负荷,减少能源浪费和提高供电稳定性。
  • methods: 使用时间序列模型(ARIMA)和深度学习模型(ANN、LSTM、GRU等),并应用多种metaheuristics(如 diferencial evolution)来找到最佳的模型hyperparameters。
  • results: 研究表明,通过metaheuristics对Transformer模型进行优化,可以提高预测精度,并且提供了不同metaheuristics对模型性能的比较。
    Abstract Accurate load forecasting plays a vital role in numerous sectors, but accurately capturing the complex dynamics of dynamic power systems remains a challenge for traditional statistical models. For these reasons, time-series models (ARIMA) and deep-learning models (ANN, LSTM, GRU, etc.) are commonly deployed and often experience higher success. In this paper, we analyze the efficacy of the recently developed Transformer-based Neural Network model in Load forecasting. Transformer models have the potential to improve Load forecasting because of their ability to learn long-range dependencies derived from their Attention Mechanism. We apply several metaheuristics namely Differential Evolution to find the optimal hyperparameters of the Transformer-based Neural Network to produce accurate forecasts. Differential Evolution provides scalable, robust, global solutions to non-differentiable, multi-objective, or constrained optimization problems. Our work compares the proposed Transformer based Neural Network model integrated with different metaheuristic algorithms by their performance in Load forecasting based on numerical metrics such as Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE). Our findings demonstrate the potential of metaheuristic-enhanced Transformer-based Neural Network models in Load forecasting accuracy and provide optimal hyperparameters for each model.
    摘要 准确的负荷预测在多个领域发挥重要作用,但是准确地捕捉动态能源系统的复杂动态还是传统统计模型的挑战。为此,时间序列模型(ARIMA)和深度学习模型(ANN、LSTM、GRU等)通常被部署,并经常得到更高的成功。在这篇论文中,我们分析了将最近发展的Transformer基于神经网络模型在负荷预测中的效果。Transformer模型具有学习长距离依赖关系的能力,因此它们在负荷预测中具有潜在的优势。我们使用多种metaheuristics,包括差分演化,来找出最佳的神经网络模型参数,以便生成高精度的预测结果。我们的工作比较了不同metaheuristicsAlgorithm和Transformer基于神经网络模型的性能,并通过数学统计指标(如 Mean Squared Error 和 Mean Absolute Percentage Error)来评估其表现。我们的发现表明metaheuristic增强的Transformer基于神经网络模型在负荷预测精度方面具有潜在的优势,并且可以为每个模型提供优化参数。

Learning Nonlinear Projections for Reduced-Order Modeling of Dynamical Systems using Constrained Autoencoders

  • paper_url: http://arxiv.org/abs/2307.15288
  • repo_url: https://github.com/grmacchio/romnet_chaos2023
  • paper_authors: Samuel E. Otto, Gregory R. Macchio, Clarence W. Rowley
  • for: 这种新发展的减少模型技术用于近似非线性动力系统的低维抽象表示。
  • methods: 我们使用受限的自编码神经网络来学习数据上的抽象表示,其中抽象表示的维度是低于实际系统的维度。
  • results: 我们提出了一种新的非线性投影方法,该方法可以学习数据上的抽象表示和投影纤维。我们还提出了一些高维系统建模的技术,包括一种新的简洁化约束 penalty。
    Abstract Recently developed reduced-order modeling techniques aim to approximate nonlinear dynamical systems on low-dimensional manifolds learned from data. This is an effective approach for modeling dynamics in a post-transient regime where the effects of initial conditions and other disturbances have decayed. However, modeling transient dynamics near an underlying manifold, as needed for real-time control and forecasting applications, is complicated by the effects of fast dynamics and nonnormal sensitivity mechanisms. To begin to address these issues, we introduce a parametric class of nonlinear projections described by constrained autoencoder neural networks in which both the manifold and the projection fibers are learned from data. Our architecture uses invertible activation functions and biorthogonal weight matrices to ensure that the encoder is a left inverse of the decoder. We also introduce new dynamics-aware cost functions that promote learning of oblique projection fibers that account for fast dynamics and nonnormality. To demonstrate these methods and the specific challenges they address, we provide a detailed case study of a three-state model of vortex shedding in the wake of a bluff body immersed in a fluid, which has a two-dimensional slow manifold that can be computed analytically. In anticipation of future applications to high-dimensional systems, we also propose several techniques for constructing computationally efficient reduced-order models using our proposed nonlinear projection framework. This includes a novel sparsity-promoting penalty for the encoder that avoids detrimental weight matrix shrinkage via computation on the Grassmann manifold.
    摘要 近期开发的减少模型技术目的是近似非线性动力系统在低维抽象 manifold 上。这是一种有效的方法,用于模型在后过渡期的动力系统,其中Initial conditions和其他干扰的影响都已经衰退。然而,在near manifold 上模型急速动力和非正常敏感机制的影响,为实时控制和预测应用而带来了复杂性。为解决这些问题,我们提出了一个参数化的非线性投影描述,其中 manifold 和投影纤维都是从数据学习出来的。我们的架构使用可逆激活函数和对偶重量矩阵,以确保encoder 是 decoder 的左逆。我们还引入了新的动力感知成本函数,以便学习辐射投影纤维,考虑到急速动力和非正常性。为证明这些方法和它们解决的具体挑战,我们提供了一个细化的三个状态模型的游离振荡例子,该模型具有可 analytically 计算的二维慢态 manifold。预计将来应用于高维系统,我们还提出了一些构造高效减少模型的技术。这包括一种新的瑞度推荐策略,以避免 encoder 的权重矩阵减小,通过计算 Grassmann manifold 进行计算。

Optimal Approximation of Zonoids and Uniform Approximation by Shallow Neural Networks

  • paper_url: http://arxiv.org/abs/2307.15285
  • repo_url: None
  • paper_authors: Jonathan W. Siegel
  • for: 本文解决了两个相关的问题:第一个是确定一个任意的ζonoid在 $\mathbb{R}^{d+1}$ 可以在 Hausdorff 距离上被近似为 $n$ 条线段的和,第二个是确定 shallow ReLU$^k$ 神经网络在其变换空间上的优化approximation rate。
  • methods: 本文使用了新的技术解决了第一个问题,并且对第二个问题进行了 significiant 改进,能够uniformlyapproximate target function和其导函数。
  • results: 本文在所有维度中解决了第一个问题,并且对第二个问题提供了 improved approximation rates,可以uniformlyapproximate target function和其导函数。
    Abstract We study the following two related problems. The first is to determine to what error an arbitrary zonoid in $\mathbb{R}^{d+1}$ can be approximated in the Hausdorff distance by a sum of $n$ line segments. The second is to determine optimal approximation rates in the uniform norm for shallow ReLU$^k$ neural networks on their variation spaces. The first of these problems has been solved for $d\neq 2,3$, but when $d=2,3$ a logarithmic gap between the best upper and lower bounds remains. We close this gap, which completes the solution in all dimensions. For the second problem, our techniques significantly improve upon existing approximation rates when $k\geq 1$, and enable uniform approximation of both the target function and its derivatives.
    摘要 我们研究以下两个相关的问题。第一个问题是决定任意几何在 $\mathbb{R}^{d+1} $ 中可以被估计在 Hausdorff 距离上靠拢到 $n $ 条直线段的误差。第二个问题是决定 uniform 距离上的最佳数值� Lavu neural networks 的变化空间上的数值� Lavu 率。第一个问题在 $d\neq 2,3 $ 已经解决,但在 $d=2,3 $ 还有很小的几何差。我们在这里关闭了这个差,完成了所有维度的解决方案。对于第二个问题,我们的技术可以在 $k\geq 1 $ 上进一步提高现有的数值� Lavu 率,并允许对目标函数和其 derivatives 进行均匀数值� Lavu。

VeriGen: A Large Language Model for Verilog Code Generation

  • paper_url: http://arxiv.org/abs/2308.00708
  • repo_url: None
  • paper_authors: Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, Siddharth Garg
  • for: 这个研究探讨了大语言模型(LLMs)是否能够自动设计硬件,通过生成高质量的Verilog代码。
  • methods: 研究人员在这个研究中细化了存在的LLMs,并对Verilog dataset从GitHub和Verilog教材中进行了编译。他们使用了一个专门设计的测试 suite,包括一个自定义问题集和测试架。
  • results: 研究人员发现,他们细化的开源CodeGen-16B模型在测试中表现出色,与商业化状态天空的GPT-3.5-turbo模型相比,实现了1.1%的总体提升。在面临更多和更复杂的问题集时,细化模型与状态天空模型竞争,在某些情况下表现出优异。特别是,它在不同问题类别中生成正确的Verilog代码的比例提高了41%,这highlights了小型、内部LLMs在硬件设计自动化中的潜力。
    Abstract In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by generating high-quality Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation.
    摘要 在这项研究中,我们探索了大型自然语言模型(LLM)可以自动设计硬件的能力,通过生成高质量的Verilog代码,这是数字系统设计和模型的通用语言。我们对已有的LLM进行了微调,使用从GitHub和Verilog教材中编译的Verilog数据集。我们使用自定义测试环境和测试架构来评估生成的Verilog代码的功能正确性。在我们的微调open-source CodeGen-16B模型与商业现代GPT-3.5-turbo模型进行比较时,我们发现了1.1%的总提高。在测试更多和更复杂的问题集时,我们发现微调后的模型在某些场景下表现竞争力强,并且在某些场景下超过了state-of-the-art gpt-3.5-turbo模型。具有41%的提高在不同类型问题集中生成符合语法规则的Verilog代码的能力,这highlights了小型、内部LLM在硬件设计自动化中的潜力。

Recovering high-quality FODs from a reduced number of diffusion-weighted images using a model-driven deep learning architecture

  • paper_url: http://arxiv.org/abs/2307.15273
  • repo_url: https://github.com/jbartlett6/sdnet
  • paper_authors: J Bartlett, C E Davey, L A Johnston, J Duan
  • for: 该 paper 的目的是提出一种基于深度学习的纤维Orientation Distribution(FOD)重建方法,以提高 diffusion-weighted images(DWI)的重建精度并减少总的成像时间。
  • methods: 该方法使用了深度学习网络,并使用了diffusion acquisition invariant representations来输入数据,以确保网络可以适应不同的b-vectors和b-values。网络还使用了一种圆拟定网络,以确保输出的FOD和输入DWI信号之间的一致性。
  • results: 对比一种现有的FOD超分辨网络FOD-Net,该方法的性能相对竞争力强,并且可以通过调整fixel分类罚项来提高下游fixel基本分析的性能。code可以在https://github.com/Jbartlett6/SDNet 中获取。
    Abstract Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal. In this work, we propose a spherical deconvolution network, a model-driven deep learning FOD reconstruction architecture, that ensures intermediate and output FODs produced by the network are consistent with the input DWI signals. Furthermore, we implement a fixel classification penalty within our loss function, encouraging the network to produce FODs that can subsequently be segmented into the correct number of fixels and improve downstream fixel-based analysis. Our results show that the model-based deep learning architecture achieves competitive performance compared to a state-of-the-art FOD super-resolution network, FOD-Net. Moreover, we show that the fixel classification penalty can be tuned to offer improved performance with respect to metrics that rely on accurately segmented of FODs. Our code is publicly available at https://github.com/Jbartlett6/SDNet .
    摘要 Diffusion-weighted imaging (DWI) 的扩展读取(FOD)重建使用深度学习有可能生成准确的FOD,从一个减少的数量的DWI图像中,降低总图像时间。通常使用Diffusion acquisition invariant representations of the DWI signals as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal。在这种工作中,我们提出了一种圆柱体减少网络,一种驱动式深度学习FOD重建架构,以确保输入DWI信号的中间和输出FOD都与输入DWI信号兼容。此外,我们在我们的损失函数中实现了精度分类约束,让网络生成FOD,可以后续被正确分割为多个精度。我们的结果表明,我们的模型基于深度学习架构与FOD超分辨率网络FOD-Net具有竞争性的性能。此外,我们还表明,精度分类约束可以调整以提高基于FOD的下游分析中的性能。我们的代码可以在https://github.com/Jbartlett6/SDNet 上公开获取。

An Overview Of Temporal Commonsense Reasoning and Acquisition

  • paper_url: http://arxiv.org/abs/2308.00002
  • repo_url: None
  • paper_authors: Georg Wenzel, Adam Jatowt
  • for: 本研究旨在提高语言模型在时间常识逻辑 reasoning 方面的性能,特别是通过多种扩充和评估方法来提高模型的逻辑能力。
  • methods: 本研究使用了多种扩充方法,包括随机扩充、逻辑扩充和知识扩充,以提高语言模型的时间常识逻辑能力。
  • results: despite the use of these augmentations, the models still struggle to approach human performance on reasoning tasks over temporal common sense properties, such as the typical occurrence times, orderings, or durations of events.
    Abstract Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events, and use it to reason over problems requiring such knowledge. This trait is essential in temporal natural language processing tasks, with possible applications such as timeline summarization, temporal question answering, and temporal natural language inference. Recent research on the performance of large language models suggests that, although they are adept at generating syntactically correct sentences and solving classification tasks, they often take shortcuts in their reasoning and fall prey to simple linguistic traps. This article provides an overview of research in the domain of temporal commonsense reasoning, particularly focusing on enhancing language model performance through a variety of augmentations and their evaluation across a growing number of datasets. However, these augmented models still struggle to approach human performance on reasoning tasks over temporal common sense properties, such as the typical occurrence times, orderings, or durations of events. We further emphasize the need for careful interpretation of research to guard against overpromising evaluation results in light of the shallow reasoning present in transformers. This can be achieved by appropriately preparing datasets and suitable evaluation metrics.
    摘要 temporal common sense reasoning 指的是理解语句、行动和事件的典型时间背景,并使用这些知识来解决问题。这种 trait 是 temporal natural language processing 任务中的必备能力,可能的应用包括时间轴概要、时间问答和时间自然语言推理。 current research 表明,虽然大语言模型具有生成 grammatically 正确句子和解决分类任务的能力,但它们经常采取缩短逻辑的缘故,容易受到简单的语言陷阱。这篇文章提供了 temporal common sense reasoning 领域的研究概述,特别是通过多种增强和其评估在不断增长的数据集上。然而,这些增强模型仍然无法接近人类在时间通用理解上的表现,例如事件的典型发生时间、顺序或持续时间。我们更加强调需要在研究评估中小心着将 transformers 的浅层逻辑解释为不可避免的问题,通过适当准备数据集和评估指标来解决这个问题。

Is this model reliable for everyone? Testing for strong calibration

  • paper_url: http://arxiv.org/abs/2307.15247
  • repo_url: https://github.com/jjfeng/testing_strong_calibration
  • paper_authors: Jean Feng, Alexej Gossmann, Romain Pirracchio, Nicholas Petrick, Gene Pennello, Berkman Sahiner
  • for: The paper is written for auditing a risk prediction model for strong calibration, particularly for machine learning algorithms, and for identifying poorly calibrated subgroups.
  • methods: The paper proposes a new testing procedure based on the insight that if observations can be reordered by their expected residuals, there should be a change in the association between the predicted and observed residuals if a poorly calibrated subgroup exists. The procedure uses a sample-splitting method, cross-validation, and a score-based cumulative sum (CUSUM) test to detect changes in the association.
  • results: The proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model compared to existing methods.
    Abstract In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.
    摘要 在一个良好准确的风险预测模型中,每个子组的平均预测概率与实际事件率之间的差异应该很小。这些模型在多样化的人口中具有可靠性,并满足强的算法公平性。然而,对模型准确性进行审核是一项具有挑战性的任务,特别是对机器学习(ML)算法来说,因为数据中的可能子组的数量过多。因此,通常只是对一些预先定义的子组进行审核。现有的开发中有一些解决方案,但它们在弱信号下或者小 subgroup 中不具有准确性。我们提出了一种新的测试过程,基于以下经验:如果我们可以重新排序观测值按其预期差异,那么在这个序列上应该出现一个关于预测和实际差异的变化,如果存在一个不准确的子组。这样我们可以将准确性测试转换为变化点检测,这里有强大的方法。我们开始通过一种分段数据的方法,将一部分数据用于训练一组候选模型,用于预测差异,另一部分数据用于进行分数基元CUSUM测试。为了进一步提高力量,我们然后将这种适应CUSUM测试扩展到包括cross-validation,保持型I错误控制,并且具有最小的假设。与现有方法相比,我们的方法在 simulated studies 中一直表现出更高的力量,并在预测 mortality risk 模型时超过了力量。

A Practical Recipe for Federated Learning Under Statistical Heterogeneity Experimental Design

  • paper_url: http://arxiv.org/abs/2307.15245
  • repo_url: https://github.com/mmorafah/fedzoo-bench
  • paper_authors: Mahdi Morafah, Weijia Wang, Bill Lin
  • for: 本研究旨在 investigate Federated Learning (FL) 在数据不同性的情况下的成功性, 并提供一个系统性的研究结果和实践建议。
  • methods: 本研究使用了多种 FL-specific experimental variables, 包括 client-side and server-side techniques, 以及不同的数据准备和评价方法。
  • results: 本研究发现了一些关键的实验变量对 FL 性能的影响, 并提供了一些实践建议和标准化的实验设置。 我们还发布了 FedZoo-Bench,一个基于 PyTorch 的开源库,包含 22 种 state-of-the-art 方法的实现,可以在 https://github.com/MMorafah/FedZoo-Bench 上下载。
    Abstract Federated Learning (FL) has been an area of active research in recent years. There have been numerous studies in FL to make it more successful in the presence of data heterogeneity. However, despite the existence of many publications, the state of progress in the field is unknown. Many of the works use inconsistent experimental settings and there are no comprehensive studies on the effect of FL-specific experimental variables on the results and practical insights for a more comparable and consistent FL experimental setup. Furthermore, the existence of several benchmarks and confounding variables has further complicated the issue of inconsistency and ambiguity. In this work, we present the first comprehensive study on the effect of FL-specific experimental variables in relation to each other and performance results, bringing several insights and recommendations for designing a meaningful and well-incentivized FL experimental setup. We further aid the community by releasing FedZoo-Bench, an open-source library based on PyTorch with pre-implementation of 22 state-of-the-art methods, and a broad set of standardized and customizable features available at https://github.com/MMorafah/FedZoo-Bench. We also provide a comprehensive comparison of several state-of-the-art (SOTA) methods to better understand the current state of the field and existing limitations.
    摘要 《联合学习(Federated Learning,FL)》在过去几年中得到了广泛的研究。有很多研究旨在使FL在数据不同性的情况下更加成功。然而,尽管有很多论文,但现状的进步还不清楚。许多研究使用不一致的实验设置,而且没有系统的研究表现FL特有的实验变量对结果的影响和实践建议。此外,存在多个标准和干扰变量,使得问题变得更加复杂和模糊。在这篇研究中,我们提供了FL特有的实验变量对之间的首次全面研究,从而提供了许多新的视角和建议,以设计一个有意义和奖励性的FL实验设置。此外,我们还提供了FedZoo-Bench,一个基于PyTorch的开源库,包含22种当前领先的方法的预实现,以及一系列标准化和可定制的特性。我们还对多种当前领先方法进行了全面比较,以更好地理解当前领域的状况和存在的限制。

Sustainable Transparency in Recommender Systems: Bayesian Ranking of Images for Explainability

  • paper_url: http://arxiv.org/abs/2308.01196
  • repo_url: None
  • paper_authors: Jorge Paz-Ruza, Amparo Alonso-Betanzos, Berta Guijarro-Berdiñas, Brais Cancela, Carlos Eiras-Franco
  • for: 提高推荐系统的透明度和用户信任度
  • methods: 使用用户创建的视觉内容生成个性化解释
  • results: 比前一代模型具有更高的性能和效率,减少了75%的CO${_2}$排放和模型尺寸。
    Abstract Recommender Systems have become crucial in the modern world, commonly guiding users towards relevant content or products, and having a large influence over the decisions of users and citizens. However, ensuring transparency and user trust in these systems remains a challenge; personalized explanations have emerged as a solution, offering justifications for recommendations. Among the existing approaches for generating personalized explanations, using visual content created by the users is one particularly promising option, showing a potential to maximize transparency and user trust. Existing models for explaining recommendations in this context face limitations: sustainability has been a critical concern, as they often require substantial computational resources, leading to significant carbon emissions comparable to the Recommender Systems where they would be integrated. Moreover, most models employ surrogate learning goals that do not align with the objective of ranking the most effective personalized explanations for a given recommendation, leading to a suboptimal learning process and larger model sizes. To address these limitations, we present BRIE, a novel model designed to tackle the existing challenges by adopting a more adequate learning goal based on Bayesian Pairwise Ranking, enabling it to achieve consistently superior performance than state-of-the-art models in six real-world datasets, while exhibiting remarkable efficiency, emitting up to 75% less CO${_2}$ during training and inference with a model up to 64 times smaller than previous approaches.
    摘要

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

  • paper_url: http://arxiv.org/abs/2307.15217
  • repo_url: None
  • paper_authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
  • for: 这个论文旨在探讨人工智能系统如何与人类目标相对应,以及RLHF方法在实践中的问题和局限性。
  • methods: 这篇论文使用了RLHF方法来训练大语言模型,并提出了一些实践中的技巧来改进RLHF方法。
  • results: 这篇论文认为RLHF方法存在一些潜在的问题和局限性,并提出了一些审核和公布标准来提高社会监管RLHF系统的能力。
    Abstract Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
    摘要 人类反馈学习强化(RLHF)是一种训练人工智能系统对人类目标的技术。RLHF已经成为现代大语言模型(LLM)的Finetune中心方法。尽管它的流行程度,但有 relativ little public work systematizing its flaws。在这篇论文中,我们(1)报告RLHF和相关方法的开放问题和基本限制;(2)介绍RLHF在实践中理解、改进和补充的技巧;(3)提出了RLHF系统的审核和披透标准,以提高社会对RLHF系统的监管。我们的工作强调RLHF的限制,并高亮了在开发更安全的AI系统方面需要多种方法的发展。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization

  • paper_url: http://arxiv.org/abs/2307.15199
  • repo_url: None
  • paper_authors: Junhyeong Cho, Gilhyun Nam, Sungyeon Kim, Hunmin Yang, Suha Kwak
  • for: 这个研究旨在提高源自零领域缩减中的表现,不需要使用任何图像。
  • methods: 提案的方法是PromptStyler,它使用提示来生成多种Distribution Shift在共同空间中,并透过学习式字库Word vector来生成多种Style feature。
  • results: 研究获得了State of the art的成绩在PACS、VLCS、OfficeHome和DomainNet等测试集上,而不需要任何图像进行训练。
    Abstract In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. The proposed method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, even though it does not require any images for training.
    摘要 在共同视语空间中,文本特征(例如,来自“一张狗照片”)可以有效地表示相关的图像特征(例如,来自狗照片)。此外,一项最近的研究已经证明了这个共同空间中的跨modal传递现象。基于这些观察,我们提出了PromptStyler,它通过使用提示而在共同空间中 simulate 多种分布转移。该方法不需要使用任何图像来处理源无限定类型泛化。我们学习了一系列的风格特征(例如,“一种S*风格的”)via可学习的风格词Vector。为确保学习的风格不会扭曲内容信息,我们强制风格-内容特征(例如,“一种S*风格的[类]”)在共同视语空间中与其对应的内容特征(例如,“[类]”)相 nearby。之后,我们使用生成的风格-内容特征进行线性分类。PromptStyler在PACS、VLCS、OfficeHome和DomainNet等 dataset上达到了状态计算机科学中的顶峰表现,即使不需要任何图像进行训练。

Identifying acute illness phenotypes via deep temporal interpolation and clustering network on physiologic signatures

  • paper_url: http://arxiv.org/abs/2307.15719
  • repo_url: None
  • paper_authors: Yuanfang Ren, Yanjun Li, Tyler J. Loftus, Jeremy Balch, Kenneth L. Abbott, Shounak Datta, Matthew M. Ruppert, Ziyuan Guan, Benjamin Shickel, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac
  • for: 这份研究用于发现初期医院接受时间对临床走向的影响,并且通过数据缺乏的情况下提供早期临床决策的支持。
  • methods: 这份研究使用了深度时间 interpolating和 clustering 网络,将稀疏、不规则的生命征象数据中提取出潜在表示,并从训练集(n=41,502)中提取出明显的患者型别。
  • results: 研究发现了4个患者型别,每个型别都有不同的疾病和结果。型别A(18%)有最多的复合疾病,高率的呼吸不足、肾衰竭、 septic shock 和三年后的死亡率。型别B(33%)和C(31%)有普遍的轻度器官衰竭,但型别B 有最好的短期结果,而型别C 有最好的临床结果。型别D(17%)有早期/持续的低血压、高率的早期手术和许多血液标记的inflammation,但三年后的死亡率较低。
    Abstract Initial hours of hospital admission impact clinical trajectory, but early clinical decisions often suffer due to data paucity. With clustering analysis for vital signs within six hours of admission, patient phenotypes with distinct pathophysiological signatures and outcomes may support early clinical decisions. We created a single-center, longitudinal EHR dataset for 75,762 adults admitted to a tertiary care center for 6+ hours. We proposed a deep temporal interpolation and clustering network to extract latent representations from sparse, irregularly sampled vital sign data and derived distinct patient phenotypes in a training cohort (n=41,502). Model and hyper-parameters were chosen based on a validation cohort (n=17,415). Test cohort (n=16,845) was used to analyze reproducibility and correlation with biomarkers. The training, validation, and testing cohorts had similar distributions of age (54-55 yrs), sex (55% female), race, comorbidities, and illness severity. Four clusters were identified. Phenotype A (18%) had most comorbid disease with higher rate of prolonged respiratory insufficiency, acute kidney injury, sepsis, and three-year mortality. Phenotypes B (33%) and C (31%) had diffuse patterns of mild organ dysfunction. Phenotype B had favorable short-term outcomes but second-highest three-year mortality. Phenotype C had favorable clinical outcomes. Phenotype D (17%) had early/persistent hypotension, high rate of early surgery, and substantial biomarker rate of inflammation but second-lowest three-year mortality. After comparing phenotypes' SOFA scores, clustering results did not simply repeat other acuity assessments. In a heterogeneous cohort, four phenotypes with distinct categories of disease and outcomes were identified by a deep temporal interpolation and clustering network. This tool may impact triage decisions and clinical decision-support under time constraints.
    摘要 <>医院 admit 初始时间对临床轨迹产生重要影响,但早期临床决策 frequently 受到数据缺乏的困扰。我们通过在 admit 至 6 小时内进行整合分析,可以从稀疏、不规则的生命征数据中提取潜在的 patient phenotypes 和不同的疾病特征。我们在一所三级医疗中心收治了 75,762 名成人, duration 至少 6 小时。我们提出了一种深度时间 interpolating 和 clustering 网络,以提取生命征数据中的潜在表示。我们在训练集(n=41,502)中提出了四种 patient phenotypes,每种 phenotype 都有明确的疾病特征和结果。我们根据验证集(n=17,415)中的模型和参数进行选择。测试集(n=16,845)用于评估重复性和与生物标志物相关性。训练、验证和测试集中年龄(54-55 岁)、性别(55% 女性)、种族、后遗病和疾病严重程度均匀分布。四种 phenotypes 中的首个(18%)有最多的并发病和高概率的呼吸窘迫、肾衰竭、 septic shock 和三年 mortality。第二种 phenotypes(33%)和第三种 phenotypes(31%)具有普遍的轻度器官衰竭,但短期result 好。第四种 phenotypes(17%)有早期/持续低血压、高概率的早期手术和严重的生物标志物,但三年 mortality 相对较低。通过对不同 phenotypes 的 SOFA 分数进行比较,整合分析结果并不 simply 重复其他acuity assessments。在一个多样化的 cohort 中,我们通过深度时间 interpolating 和 clustering 网络,可以从稀疏、不规则的生命征数据中提取四种不同类型的疾病和结果。这种工具可能会影响医疗决策和临床决策支持,尤其是在时间紧张的情况下。

The Marginal Value of Momentum for Small Learning Rate SGD

  • paper_url: http://arxiv.org/abs/2307.15196
  • repo_url: None
  • paper_authors: Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li
  • for: 这 paper 的目的是解释 momentum 在 Stochastic gradient descent (SGD) 中的作用,特别是在小学习率和大量批处理误差的情况下。
  • methods: 这 paper 使用了 theoretical analysis 和实验来研究 momentum 的效果。
  • results: 研究发现,在实际训练场景下, momentum 对优化和泛化都没有明显的提升,尤其是当学习率不够大时。
    Abstract Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
    摘要 势能在强共轭情况下加速梯度下降减速,但在随机优化中,folklore Suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. 这篇论文解释了在随机设置中,where the learning rate is small and gradient noise is the dominant source of instability, SGD with and without momentum behave similarly in the short and long time horizons. 实验表明,势能 indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

Learning in Repeated Multi-Unit Pay-As-Bid Auctions

  • paper_url: http://arxiv.org/abs/2307.15193
  • repo_url: None
  • paper_authors: Rigel Galgana, Negin Golrezaei
  • for: 这篇论文关注的是在重复的多单位付款拍卖中学习如何出价,以实现最大化利润。
  • methods: 作者使用动态计划(DP)算法来解决这个问题,并在全信息和强化反馈下进行了线上学习算法的设计。
  • results: 作者证明了在线上学习算法的时间复杂度为乘方时间复杂度,并且在实际实验中,当所有投标者遵循作者提出的无 regret学习算法时,市场动态会向一个最大化利润的均衡点转化。此外,作者还发现在多单位付款拍卖中,付款拍卖可以带来较高的收益,比其受欢迎的替代方案——固定价格拍卖。
    Abstract Motivated by Carbon Emissions Trading Schemes, Treasury Auctions, and Procurement Auctions, which all involve the auctioning of homogeneous multiple units, we consider the problem of learning how to bid in repeated multi-unit pay-as-bid auctions. In each of these auctions, a large number of (identical) items are to be allocated to the largest submitted bids, where the price of each of the winning bids is equal to the bid itself. The problem of learning how to bid in pay-as-bid auctions is challenging due to the combinatorial nature of the action space. We overcome this challenge by focusing on the offline setting, where the bidder optimizes their vector of bids while only having access to the past submitted bids by other bidders. We show that the optimal solution to the offline problem can be obtained using a polynomial time dynamic programming (DP) scheme. We leverage the structure of the DP scheme to design online learning algorithms with polynomial time and space complexity under full information and bandit feedback settings. We achieve an upper bound on regret of $O(M\sqrt{T\log |\mathcal{B}|})$ and $O(M\sqrt{|\mathcal{B}|T\log |\mathcal{B}|})$ respectively, where $M$ is the number of units demanded by the bidder, $T$ is the total number of auctions, and $|\mathcal{B}|$ is the size of the discretized bid space. We accompany these results with a regret lower bound, which match the linear dependency in $M$. Our numerical results suggest that when all agents behave according to our proposed no regret learning algorithms, the resulting market dynamics mainly converge to a welfare maximizing equilibrium where bidders submit uniform bids. Lastly, our experiments demonstrate that the pay-as-bid auction consistently generates significantly higher revenue compared to its popular alternative, the uniform price auction.
    摘要 受到碳排放交易制度、储蓄拍卖和采购拍卖的 inspirations,我们考虑了在重复的多单位付出为投标的问题上学习投标策略。在这些拍卖中,大量相同的商品需要分配给最大的投标价格,其中每个赢得的投标价格都等于投标价格本身。由于拍卖的动作空间具有 combinatorial 性,这个问题非常困难。我们通过关注 offline 设定,即投标者在过去投标记录上仅可以获取历史投标记录,解决了这个问题。我们证明了 offline 问题的优化策略可以使用一种 polynomial time 的动态规划(DP)方案来实现。我们利用 DP 方案的结构,设计了在全信息和带有反馈的情况下的在线学习算法,其时间复杂度和存储空间复杂度均为多项式时间。我们证明了在 $O(M\sqrt{T\log |\mathcal{B}|})$ 和 $O(M\sqrt{|\mathcal{B}|T\log |\mathcal{B}|})$ 的情况下,我们的恐慌 regret upper bound 和 lower bound 均为线性函数。我们的数学结果表明,当所有代理人按照我们提议的不留 regret 学习算法,市场动态会主要循环到一个最大化公平价值的均衡,其中投标者将提交均匀投标。最后,我们的实验表明,付出为投标 auction 在许多情况下会生成比 uniform price auction 更高的收入。

f-Divergence Minimization for Sequence-Level Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2307.15190
  • repo_url: https://github.com/manga-uofa/fdistill
  • paper_authors: Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou
  • for: 本研究旨在提出一个名为f-DISTILL的框架,用于实现语言模型知识传递。
  • methods: 本研究使用一个通过最小化一个通用f-分配函数来实现序列级知识传递的方法,并提出了四种知识传递变种。
  • results: 实验结果表明, compared to现有的SeqKD和ENGINE方法,我们的f-DISTILL方法在四个数据集上表现更好,而我们的对称的知识传递损失可以更好地让学生学习教师分布。
    Abstract Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.
    摘要 知识填充(KD)是将知识从大模型传递到小模型的过程。随着语言模型的不断扩大,KD在自然语言处理领域得到了越来越多的关注。在这项工作中,我们提出了f-DISTILL框架,将序列级知识填充定为最小化一个泛化f-散度函数。我们提出了四种填充变体,并证明了现有的SeqKD和ENGINE方法是我们f-DISTILL方法的近似方法。我们还 deriv出了逐步分解,将不可 tractable的序列级散度转化为可计算的单词级损失。经过四个数据集的实验,我们发现我们的方法可以超越现有的KD方法,并且我们的对称填充损失可以更好地让学生学习教师分布。

Rotation-Invariant Random Features Provide a Strong Baseline for Machine Learning on 3D Point Clouds

  • paper_url: http://arxiv.org/abs/2308.06271
  • repo_url: https://github.com/meliao/rotation-invariant-random-features
  • paper_authors: Owen Melia, Eric Jonas, Rebecca Willett
  • for: 这篇论文是为了探讨三维点 cloud 数据的 rotation-invariant 机器学习方法,以及这种方法在分子性质预测和3D shape 分类 зада务中的表现。
  • methods: 本论文使用了一种简单且通用的随机特征方法,将三维点 cloud 数据转换为 rotation-invariant 的特征,并且显示了这种方法在标准分子性质预测 benchmark 资料集 QM7 和 QM9 上匹配或超越了一般的 rotation-invariant neural network 的性能。
  • results: 本论文显示了这种方法在分子性质预测和3D shape 分类 зада务中的一般化和高效性,并且与一般的 rotation-invariant neural network 相比,预测时间仅有一个数量级的差异。
    Abstract Rotational invariance is a popular inductive bias used by many fields in machine learning, such as computer vision and machine learning for quantum chemistry. Rotation-invariant machine learning methods set the state of the art for many tasks, including molecular property prediction and 3D shape classification. These methods generally either rely on task-specific rotation-invariant features, or they use general-purpose deep neural networks which are complicated to design and train. However, it is unclear whether the success of these methods is primarily due to the rotation invariance or the deep neural networks. To address this question, we suggest a simple and general-purpose method for learning rotation-invariant functions of three-dimensional point cloud data using a random features approach. Specifically, we extend the random features method of Rahimi & Recht 2007 by deriving a version that is invariant to three-dimensional rotations and showing that it is fast to evaluate on point cloud data. We show through experiments that our method matches or outperforms the performance of general-purpose rotation-invariant neural networks on standard molecular property prediction benchmark datasets QM7 and QM9. We also show that our method is general-purpose and provides a rotation-invariant baseline on the ModelNet40 shape classification task. Finally, we show that our method has an order of magnitude smaller prediction latency than competing kernel methods.
    摘要 “旋转协variance是机器学习中广泛使用的 inductive bias,如计算机视觉和量子化学机器学习。旋转不变机器学习方法在许多任务上设置了州OF-the-art,包括分子性质预测和3D形状分类。这些方法通常是使用任务特定的旋转不变特征,或者使用通用的深度神经网络,后者具有复杂的设计和训练问题。然而,是否 rotation协variance的成功主要归功于旋转不变性还是深度神经网络,这问题仍然存在。为了解决这个问题,我们提出了一种简单和通用的方法,利用随机特征来学习三维点云数据上的旋转不变函数。具体来说,我们将 Rahimi & Recht 2007 的随机特征方法扩展到三维旋转不变的版本,并证明其在点云数据上快速评估。我们通过实验表明,我们的方法与通用的旋转不变神经网络相比,在标准分子性质预测 benchmark 数据上匹配或超越其性能。我们还证明了我们的方法是通用的,并在 ModelNet40 形状分类任务上提供了旋转不变基准。最后,我们证明了我们的方法与竞争的核函数方法相比,预测延迟只有一个数量级。”

RCT Rejection Sampling for Causal Estimation Evaluation

  • paper_url: http://arxiv.org/abs/2307.15176
  • repo_url: https://github.com/kakeith/rct_rejection_sampling
  • paper_authors: Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit Bhattacharya
  • For: The paper is written to address the challenge of confounding in observational data, specifically in high-dimensional settings such as text data, genomics, or the behavioral social sciences.* Methods: The paper proposes a new method called RCT rejection sampling, which uses subsampling of randomized controlled trials (RCTs) to create confounded observational datasets, and provides theoretical guarantees for causal identification.* Results: The paper shows that the proposed algorithm results in low bias when evaluated on synthetic data, and highlights several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. Additionally, the paper provides a proof of concept using a novel, real-world RCT consisting of approximately 70k observations and text data as high-dimensional covariates.
    Abstract Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.
    摘要 干扰是观察数据中 causal 效应的主要障碍。在高维 covariate 的设置下,such as text data, genomics, or behavioral social sciences, researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited.在这种工作中,我们建立了一种有前途的 empirical evaluation strategy, simplify evaluation design and use real data: Randomized controlled trials (RCTs) 的 subsampling 创造了受到恶势力影响的观察数据,而使用 RCTs 的平均 causal effect 作为真实参照。我们提出了一种新的抽样算法,称为 RCT rejection sampling,并提供了理论保证,表明在观察数据中,causal 标识存在。使用 sintetic 数据,我们示出了我们的算法实际上具有低偏误。除了这一标识结果外,我们还提出了一些考虑finite data 的设计师可能会在他们自己的数据上使用 RCT rejection sampling 的问题。作为证明,我们实现了一个示例评估管道,并详细介绍了这些finite data 的考虑事项。总之,我们的贡献是 towards a broader agenda of improved empirical evaluation for causal estimation。

Causative Cyberattacks on Online Learning-based Automated Demand Response Systems

  • paper_url: http://arxiv.org/abs/2307.15175
  • repo_url: None
  • paper_authors: Samrat Acharya, Yury Dvorkin, Ramesh Karri
  • for: 本研究旨在探讨人工智能(AI)在供应侧热卷较小的电力loads中的应用,以及这些loads的数据集被用来验证攻击者可能会利用的攻击方法。
  • methods: 本研究使用了人工智能学习方法,包括机器学习和深度学习,以分析用户的能源使用模式并设计优化的奖励策略。
  • results: 研究发现,通过在DR客户端上执行攻击,可以让DR客户端错误地响应DR奖励,从而导致DR客户端的能源消耗增加,并且可以通过控制DR客户端的响应来 manipulate DR market。
    Abstract Power utilities are adopting Automated Demand Response (ADR) to replace the costly fuel-fired generators and to preempt congestion during peak electricity demand. Similarly, third-party Demand Response (DR) aggregators are leveraging controllable small-scale electrical loads to provide on-demand grid support services to the utilities. Some aggregators and utilities have started employing Artificial Intelligence (AI) to learn the energy usage patterns of electricity consumers and use this knowledge to design optimal DR incentives. Such AI frameworks use open communication channels between the utility/aggregator and the DR customers, which are vulnerable to \textit{causative} data integrity cyberattacks. This paper explores vulnerabilities of AI-based DR learning and designs a data-driven attack strategy informed by DR data collected from the New York University (NYU) campus buildings. The case study demonstrates the feasibility and effects of maliciously tampering with (i) real-time DR incentives, (ii) DR event data sent to DR customers, and (iii) responses of DR customers to the DR incentives.
    摘要 各种能源供应商正在采用自动化需求应答(ADR),以取代昂贵的燃料燃烧机和预防峰值电力需求压力。同时,第三方需求应答(DR)聚合者也在利用可控小规模电力负荷来提供实时电网支持服务。一些聚合者和供应商已经开始使用人工智能(AI)来学习电力消耗者的能源使用模式,并使用这些知识来设计优化的DR激励计划。这些AI框架使用公开的通信频道 между供应商/聚合者和DR客户,这些通信频道受到了 causative 数据完整性攻击的威胁。本文探讨了 AI-based DR 学习的漏洞,并设计了一种基于 DR 数据的数据驱动攻击策略。案例研究表明了在 NYU 校园建筑物上收集的 DR 数据可以用于设计和实现这种攻击策略。

PredictChain: Empowering Collaboration and Data Accessibility for AI in a Decentralized Blockchain-based Marketplace

  • paper_url: http://arxiv.org/abs/2307.15168
  • repo_url: https://github.com/ai-and-blockchain/s23_predictchain
  • paper_authors: Matthew T. Pisano, Connor J. Patterson, Oshani Seneviratne
  • For: 该论文旨在提供一个基于区块链的市场平台,帮助用户上传数据集用于预测机器学习模型训练,或者请求已上传数据集的模型训练,或者提交查询到已训练模型。* Methods: 该论文提出了一个基于区块链的机制,通过各个节点的可用计算资源来运行多种不同特征的预测机器学习模型,包括成本、速度、简洁、能力和成本效果等。* Results: 该论文通过实现了一个分布式的预测机器学习模型市场平台,推动了数据分享和中央云服务器的减少,并且为用户提供了一个可靠、安全、可控的机制来训练和使用预测机器学习模型。
    Abstract Limited access to computing resources and training data poses significant challenges for individuals and groups aiming to train and utilize predictive machine learning models. Although numerous publicly available machine learning models exist, they are often unhosted, necessitating end-users to establish their computational infrastructure. Alternatively, these models may only be accessible through paid cloud-based mechanisms, which can prove costly for general public utilization. Moreover, model and data providers require a more streamlined approach to track resource usage and capitalize on subsequent usage by others, both financially and otherwise. An effective mechanism is also lacking to contribute high-quality data for improving model performance. We propose a blockchain-based marketplace called "PredictChain" for predictive machine-learning models to address these issues. This marketplace enables users to upload datasets for training predictive machine learning models, request model training on previously uploaded datasets, or submit queries to trained models. Nodes within the blockchain network, equipped with available computing resources, will operate these models, offering a range of archetype machine learning models with varying characteristics, such as cost, speed, simplicity, power, and cost-effectiveness. This decentralized approach empowers users to develop improved models accessible to the public, promotes data sharing, and reduces reliance on centralized cloud providers.
    摘要 To address these issues, we propose a blockchain-based marketplace called "PredictChain" for predictive machine-learning models. This marketplace enables users to upload datasets for training predictive machine learning models, request model training on previously uploaded datasets, or submit queries to trained models. Nodes within the blockchain network, equipped with available computing resources, will operate these models, offering a range of archetype machine learning models with varying characteristics, such as cost, speed, simplicity, power, and cost-effectiveness. This decentralized approach empowers users to develop improved models accessible to the public, promotes data sharing, and reduces reliance on centralized cloud providers.

VISU at WASSA 2023 Shared Task: Detecting Emotions in Reaction to News Stories Leveraging BERT and Stacked Embeddings

  • paper_url: http://arxiv.org/abs/2307.15164
  • repo_url: None
  • paper_authors: Vivek Kumar, Sushmita Singh, Prayag Tiwari
  • for: 这项研究旨在开发深度学习模型,用于从新闻文章中推断情感表达。
  • methods: 研究使用word embedding表示法,并采用了适应性的预处理策略,以捕捉情感表达的细节。试验使用了静态和上下文嵌入(个体和堆叠),并与BiLSTM和Transformer模型进行了比较。
  • results: 研究在WASSA 2023共享任务中的情感分类任务中取得了第十名,其中Macro F1得分为0.2717,证明了实施的方法的有效性,尤其是在小样本和不均衡的数据集上。
    Abstract Our system, VISU, participated in the WASSA 2023 Shared Task (3) of Emotion Classification from essays written in reaction to news articles. Emotion detection from complex dialogues is challenging and often requires context/domain understanding. Therefore in this research, we have focused on developing deep learning (DL) models using the combination of word embedding representations with tailored prepossessing strategies to capture the nuances of emotions expressed. Our experiments used static and contextual embeddings (individual and stacked) with Bidirectional Long short-term memory (BiLSTM) and Transformer based models. We occupied rank tenth in the emotion detection task by scoring a Macro F1-Score of 0.2717, validating the efficacy of our implemented approaches for small and imbalanced datasets with mixed categories of target emotions.
    摘要 我们的系统,VISU,参加了2023年WASSA共享任务(3)的情感分类从新闻文章中的反应文章。情感检测从复杂对话中是挑战,需要Context/领域理解。因此在这项研究中,我们集中了深度学习(DL)模型,使用词嵌入表示和定制预处理策略,捕捉表达出的情感细节。我们的实验使用静态和上下文嵌入(个人和堆叠)与BiLSTM和Transformer基于模型。我们在情感检测任务中占据了第十名,得分 macro F1-Score 0.2717,证明我们实施的方法对小型和杂合类目的数据集具有效果。

R-LPIPS: An Adversarially Robust Perceptual Similarity Metric

  • paper_url: http://arxiv.org/abs/2307.15157
  • repo_url: https://github.com/saraghazanfari/r-lpips
  • paper_authors: Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo
  • For: The paper aims to address the security concerns of the Learned Perceptual Image Patch Similarity (LPIPS) metric by proposing a new metric called Robust Learned Perceptual Image Patch Similarity (R-LPIPS) that is more robust to adversarial examples.* Methods: The R-LPIPS metric leverages adversarially trained deep features to improve its robustness against adversarial examples.* Results: The paper demonstrates the superiority of R-LPIPS compared to the classical LPIPS metric through a comprehensive set of experiments.
    Abstract Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. The code is available at https://github.com/SaraGhazanfari/R-LPIPS.
    摘要 Computer vision 中的相似度度量有很大的作用,用于捕捉图像的含义。在最近几年,高级相似度度量,如学习后的图像特征相似度度量(LPIPS),出现了。这些度量利用训练过的神经网络提取的深度特征,并表现出了与人类视觉相似的惊人能力。然而,现在已经广泛 acknowledge 的是,神经网络受到抗性示例的影响,即通过小量不可见的扰动欺骗模型的特殊例子。这种抗性引入了安全性问题,尤其是在大规模应用中。在这篇论文中,我们提出了robust learned perceptual image patch similarity(R-LPIPS)度量,一种新的度量,利用抗性训练的深度特征。通过全面的实验,我们证明了R-LPIPS 度量的superiority 相比 classical LPIPS 度量。代码可以在 中找到。

A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity

  • paper_url: http://arxiv.org/abs/2307.15154
  • repo_url: None
  • paper_authors: Zhihan Xiong, Romain Camilleri, Maryam Fazel, Lalit Jain, Kevin Jamieson
  • for: The paper is written for identifying the best arm in a linear bandit problem with a non-stationary environment.
  • methods: The paper proposes a novel algorithm called $\mathsf{P1}$-$\mathsf{RAGE}$ that combines the advantages of both stationary and non-stationary algorithms to achieve robustness and fast identification.
  • results: The paper shows that the proposed algorithm achieves a lower error probability than existing algorithms in the non-stationary setting, while also performing well in benign settings.
    Abstract We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm $x^* := \arg\max_{x\in\mathcal{X}x^\top\sum_{t=1}^{T}\theta_t$ with probability as high as possible. Prior work has addressed the stationary setting where $\theta_t = \theta_1$ for all $t$ and demonstrated that the error probability decreases as $\exp(-T /\rho^*)$ for a problem-dependent constant $\rho^*$. But in many real-world $A/B/n$ multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over $\mathcal{X}$ at each time then the error probability decreases as $\exp(-T\Delta^2_{(1)}/d)$, where $\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$. As there exist environments where $\Delta_{(1)}^2/ d \ll 1/ \rho^*$, we are motivated to propose a novel algorithm $\mathsf{P1}$-$\mathsf{RAGE}$ that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of $\mathsf{P1}$-$\mathsf{RAGE}$ and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.
    摘要 我们研究了固定预算最佳臂识别(BAI)问题,对于线性弹珠在潜在非站ARY环境中。我们有一个终端臂集$\mathcal{X}\subset\mathbb{R}^d$,一个固定预算$T$,以及一个无法预测的系列参数$\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$。一个算法将尝试在可能的最高概率下正确地识别最佳臂$x^*:=\arg\max_{x\in\mathcal{X}x^\top\sum_{t=1}^{T}\theta_t$。先前的工作已经处理过站ARY情况,其中$\theta_t = \theta_1$ for all $t$,并证明了错误概率随着$T$变化为$\exp(-T/\rho^*)$,其中$\rho^*$是问题相依的常数。但在实际的$A/B/n$多变量测试中,环境通常是非站ARY的,因此预期站ARY的算法可以轻松失败。为了实现预算Robustness,我们知道如果在每个时间点随机地选择非逐次的G优化设计 over $\mathcal{X}$,错误概率将随着$T\Delta^2_{(1)}/d$下降,其中$\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$。因为存在环境中$\Delta_{(1)}^2/ d \ll 1/ \rho^*$,我们被动机验证一个新的算法$\mathsf{P1}-\mathsf{RAGE}$,它将寻求最佳的两个世界:Robustness to non-stationarity和在正常情况下快速的识别速率。我们描述了$\mathsf{P1}-\mathsf{RAGE}$的错误概率,并证明了它在实际中终会不比G优化设计差。

R-Block: Regularized Block of Dropout for convolutional networks

  • paper_url: http://arxiv.org/abs/2307.15150
  • repo_url: None
  • paper_authors: Liqi Wang, Qiya Hu
  • for: 这篇论文主要针对于 konvolutional Neural Networks (CNNs) 中的批处理层REGULARIZATION技术,即 Dropout 技术。
  • methods: 该论文提出了一种名为 R-Block 的互助学习训练策略,该策略在 konvolutional Neural Networks (CNNs) 中使用两个不同的 Dropout 区域来强制两个生成的差分最大化子模型的输出分布相互一致。
  • results: 该论文的实验结果表明,R-Block 可以比其他已有的结构化 Dropout 变体实现更好的性能。此外,作者还证明了他们的子模型构建方法超越了其他方法。
    Abstract Dropout as a regularization technique is widely used in fully connected layers while is less effective in convolutional layers. Therefore more structured forms of dropout have been proposed to regularize convolutional networks. The disadvantage of these methods is that the randomness introduced causes inconsistency between training and inference. In this paper, we apply a mutual learning training strategy for convolutional layer regularization, namely R-Block, which forces two outputs of the generated difference maximizing sub models to be consistent with each other. Concretely, R-Block minimizes the losses between the output distributions of two sub models with different drop regions for each sample in the training dataset. We design two approaches to construct such sub models. Our experiments demonstrate that R-Block achieves better performance than other existing structured dropout variants. We also demonstrate that our approaches to construct sub models outperforms others.
    摘要 dropout 作为常见的正则化技术,通常用于全连接层。然而,在卷积层中,dropout 的效果较差。因此,有些更Structured的 dropout 方法被提议用于卷积网络正则化。然而,这些方法的随机性引入会导致训练和推理中的不一致。在这篇论文中,我们应用了对卷积层REGULARIZATION的mutual learning训练策略,即R-Block,该策略要求两个生成的差分最大化子模型的输出分布相互匹配。具体来说,R-Block 将每个样本在训练集中的输出分布之间的损失进行最小化。我们设计了两种方法来构建子模型。我们的实验表明,R-Block 比其他已有的Structured dropout变种表现更好。此外,我们的子模型构建方法也超过了其他方法。

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

  • paper_url: http://arxiv.org/abs/2308.07931
  • repo_url: None
  • paper_authors: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola
  • for: bridges the 2D-to-3D gap for robotic manipulation
  • methods: leverages distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models
  • results: achieves in-the-wild generalization to unseen objects using few-shot learning method for 6-DOF grasping and placing
    Abstract Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
    摘要 自我超视和语言超视图像模型含有重要的世界知识,这对总体化很重要。然而,许多 робо控任务需要细节的三维几何理解,这经常缺失在二维图像特征中。这种工作弥补了这个二维-三维之间的差异,用精炼的特征场来结合精准的三维几何和丰富的 semantics from 二维基础模型。我们提出了一种几个shot学习方法,用于6DOF抓取和置放,这种方法利用了这些强大的空间和 semantics 假设来实现在野外的总体化,并且可以处理未看过的物体。使用从视力语言模型CLIP中提取出的特征,我们提出了一种通过自然语言来标识新的物体,并示出其能够泛化到未看过的表达和新的类别的物体。

On (Normalised) Discounted Cumulative Gain as an Offline Evaluation Metric for Top-$n$ Recommendation

  • paper_url: http://arxiv.org/abs/2307.15053
  • repo_url: None
  • paper_authors: Olivier Jeunen, Ivan Potapov, Aleksei Ustimenko
  • for: 这种研究是为了检验推荐系统的评价方法,特别是使用 Discounted Cumulative Gain (DCG) metric 的正确性。
  • methods: 这篇论文使用了一种 Critical Look 的方法来检验 DCG metric,包括对 DCG 的不准确性和不一致性的分析,以及一种基于实际数据的拟合方法来补做 DCG 的缺陷。
  • results: 研究发现,不正确地使用 DCG metric 可能会导致推荐系统的评价结果偏差,而且在某些情况下,正常化 DCG metric 可能会导致推荐系统的评价结果与实际情况相反。
    Abstract Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.
    摘要 <>translate the following text into Simplified Chinese:Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.Translate the text into Simplified Chinese:<>approaches to recommendation 通常被评估在两种方式之一:在线实验(通常是模拟的)或者离线评估过程,目标是估计在线实验的结果。在文献中,一些离线评估度量已经得到了广泛的采用,启发自信息检索领域中的排名度量。normalized discounted cumulative gain(nDCG)是其中的一个,在许多年中,高值nDCG被用来证明新方法的state-of-the-art。我们的工作是 kritically examining this approach,并investigating when we can expect such metrics to approximate the gold standard outcome of an online experiment。我们正式地表明了考虑DCG为无 bias的估计在线奖励的假设,并提供了这个度量的 derive from first principles,并且在IR中的传统用途上出现了偏差。进一步,我们表明了normalizing the metric renders it inconsistent,因为当DCG是无 bias的时,对比竞争方法的normalized DCG可以反转其相对顺序。通过对离线和在线实验的相对分析,我们显示了我们的不偏DCG估计与在线奖励之间强相关性,即使一些度量的内在假设被违反。这个说明不再 holds for its normalized variant, suggesting that nDCG's practical utility may be limited。Translation notes:* "approaches to recommendation" is translated as "推荐方法"* "online experiment" is translated as "在线实验"* "offline evaluation procedure" is translated as "离线评估过程"* "normalised" is translated as "normalized"* "Discounted Cumulative Gain" is translated as "折扣累积奖励"* "nDCG" is translated as "nDCG"* "unbiased" is translated as "无偏"* "practical utility" is translated as "实际用途"

A Transformer-based Approach for Arabic Offline Handwritten Text Recognition

  • paper_url: http://arxiv.org/abs/2307.15045
  • repo_url: None
  • paper_authors: Saleh Momeni, Bagher BabaAli
  • For: 本研究强调特定问题是recognizing offline Arabic handwritten text,它在pattern recognition和机器学习领域中是一个挑战性的问题,具有广泛的应用领域。* Methods: 我们提出了两种新的架构方法,即Transformer Transducer和标准sequence-to-sequence Transformer,以提高recognizing offline Arabic handwritten text的准确率和速度。我们的方法可以模型语言依赖关系,并且只需要使用注意机制,因此更加平行化和简单化。* Results: 我们的方法在Arabic KHATT数据集上进行评估,与现有状态的方法相比,我们的方法可以提高recognizing offline Arabic handwritten text的准确率。
    Abstract Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text.
    摘要 《手写文本识别是 Pattern recognition 和机器学习 领域中的一个挑战性和重要问题,其应用范围广泛。在这篇论文中,我们专注于特定的问题是识别离线阿拉伯文本。现有的方法通常使用 convolutional neural networks 提取图像特征和 recurrent neural networks 模拟时间序列,并使用 connectionist temporal classification 生成文本。然而,这些方法受到缺乏并行化的限制,以及无法考虑语言规则的缺点。为了解决这些问题,我们介绍了两种alternative architecture,即 Transformer Transducer 和标准sequence-to-sequence Transformer,并比较其性能。我们的方法可以模型语言依赖关系,只需要使用注意机制,从而使其更加并行化和简单。我们使用预训练的 Transformer 来进行图像理解和语言模型化。我们的评估表明,我们提出的方法在阿拉伯文本KHATT 数据集上的识别性能比现有状态的方法高。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China.

Universal and Transferable Adversarial Attacks on Aligned Language Models

  • paper_url: http://arxiv.org/abs/2307.15043
  • repo_url: https://github.com/llm-attacks/llm-attacks
  • paper_authors: Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson
  • for: 这个论文是为了攻击已经Alignment的语言模型,使其生成不良内容。
  • methods: 该论文使用了 suffix 的搜索技术,自动生成了攻击性的提问。
  • results: 该论文在多个模型和多个黑盒模型上实现了攻击,并且发现了这些攻击的 suffix 可以在不同的情况下传递。
    Abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.
    摘要 因为"出包"大型自然语言模型可以生成很多不适内容,现有的工作集中在对这些模型进行对齐,以防止不适的生成。虽然有一些成功的"监禁"攻击(LLMs),但这些攻击需要人工智能和灵活,并且在实践中有些脆弱。在这篇论文中,我们提出了一种简单有效的攻击方法,使得对齐的语言模型生成不适行为。具体来说,我们的方法找到一个附加到各种查询中,以便使Language Model(LLM)生成不适的内容的 suffix。而不是人工设计,我们的方法通过扫描和梯度基于搜索技术自动生成这些恶意提示。 surprisingly,我们发现了这些恶意提示的可移植性,包括黑盒、公开发布的LLMs。 Specifically,我们在多个提示(即很多种不同类型的不适内容查询)和多个模型(我们的 случаyed Vicuna-7B和13B)上进行了训练,并得到了可以在公共的ChatGPT、Bard和Claude等interface中生成不适内容的攻击 suffix。总之,这项工作提前了对对齐语言模型的攻击的状态艺术,引发了关于如何防止这些系统生成不适信息的重要问题。 Code可以在github.com/llm-attacks/llm-attacks中找到。

Detecting Morphing Attacks via Continual Incremental Training

  • paper_url: http://arxiv.org/abs/2307.15105
  • repo_url: None
  • paper_authors: Lorenzo Pellegrini, Guido Borghi, Annalisa Franco, Davide Maltoni
  • for: 实现增量训练,当资料传输和储存有限制时,使得对多个数据来源进行批量训练具有挑战性。
  • methods: 采用不同的Continual Learning方法来实现增量训练,包括Learning without Forgetting(LwF)等方法。
  • results: LwF方法在这个方案中表现良好,并且在具有变化大小的数据批量中进行增量训练时,能够实现好的表现。
    Abstract Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset -- also exploiting different data sources -- to perform a batch-based training procedure, make the development of robust models particularly challenging. We hypothesize that the recent Continual Learning (CL) paradigm may represent an effective solution to enable incremental training, even through multiple sites. Indeed, a basic assumption of CL is that once a model has been trained, old data can no longer be used in successive training iterations and in principle can be deleted. Therefore, in this paper, we investigate the performance of different Continual Learning methods in this scenario, simulating a learning model that is updated every time a new chunk of data, even of variable size, is available. Experimental results reveal that a particular CL method, namely Learning without Forgetting (LwF), is one of the best-performing algorithms. Then, we investigate its usage and parametrization in Morphing Attack Detection and Object Classification tasks, specifically with respect to the amount of new training data that became available.
    摘要 具有限制数据传输和存储的场景下, compose a single dataset --- even exploiting different data sources --- to perform a batch-based training procedure, 的模型开发 particullay challenging. 我们假设 Continual Learning (CL) paradigm 可能是一个有效的解决方案,以便在多个站点进行逐步训练。 indeed, a basic assumption of CL 是一个已经训练过的模型不能再使用老数据进行后续训练迭代,并且可以删除。因此,在这篇论文中,我们研究了不同的 Continual Learning 方法在这种情况下的性能,通过模拟一个随着新数据块的到达而更新的学习模型。实验结果表明,一种特定的 CL 方法, namely Learning without Forgetting (LwF) 是最好的算法之一。然后,我们进一步调查了其在 Morphing Attack Detection 和 Object Classification 任务中的使用和 Parametrization,具体是根据可用的新训练数据量。

Speeding up Fourier Neural Operators via Mixed Precision

  • paper_url: http://arxiv.org/abs/2307.15034
  • repo_url: https://github.com/neuraloperator/neuraloperator
  • paper_authors: Colin White, Renbo Tu, Jean Kossaifi, Gennady Pekhimenko, Kamyar Azizzadenesheli, Anima Anandkumar
  • for: solving partial differential equation (PDE) solutions using the Fourier neural operator (FNO)
  • methods: mixed-precision training of FNO, with a focus on reducing memory usage and training time
  • results: up to 34% reduction in training time and memory usage, with little or no reduction in accuracy, on the Navier-Stokes and Darcy flow equations.
    Abstract The Fourier neural operator (FNO) is a powerful technique for learning surrogate maps for partial differential equation (PDE) solution operators. For many real-world applications, which often require high-resolution data points, training time and memory usage are significant bottlenecks. While there are mixed-precision training techniques for standard neural networks, those work for real-valued datatypes on finite dimensions and therefore cannot be directly applied to FNO, which crucially operates in the (complex-valued) Fourier domain and in function spaces. On the other hand, since the Fourier transform is already an approximation (due to discretization error), we do not need to perform the operation at full precision. In this work, we (i) profile memory and runtime for FNO with full and mixed-precision training, (ii) conduct a study on the numerical stability of mixed-precision training of FNO, and (iii) devise a training routine which substantially decreases training time and memory usage (up to 34%), with little or no reduction in accuracy, on the Navier-Stokes and Darcy flow equations. Combined with the recently proposed tensorized FNO (Kossaifi et al., 2023), the resulting model has far better performance while also being significantly faster than the original FNO.
    摘要 “傅曼托运算(FNO)是一种强大的技术,用于学习偏微分方程(PDE)解析Operator的替代地图。实际应用中,需要大量的高分辨率数据点,训练时间和内存使用却成为了主要的瓶颈。虽然存在混合精度训练技术 для标准神经网络,但这些技术仅适用于实数型数据和finite dimension上,因此无法直接应用于FNO,因为FNO在复数valued Fourier领域和函数空间中运作。另一方面,由于Fourier变换已经是一种 aproximation(由于精度错误),我们不需要在全精度下进行运算。在这个工作中,我们(i)评估FNO的内存和时间成本,(ii)进行了FNO混合精度训练的 numerics 稳定性研究,以及(iii)开发了一个对FNO训练时间和内存使用进行了大幅删减(高达34%)的训练程式,几乎不会影响精度,在奈瓦-斯托克和达瑞流运动方程中。combined with recently proposed tensorized FNO(Kossaifi et al., 2023),得到的模型具有更好的性能,同时也比原始FNO更快。”

Self-Supervised Graph Transformer for Deepfake Detection

  • paper_url: http://arxiv.org/abs/2307.15019
  • repo_url: None
  • paper_authors: Aminollah Khormali, Jiann-Shiun Yuan
  • for: 本研究旨在提出一种可靠的深伪检测系统,能够普适地检测不同类型的深伪视频。
  • methods: 该系统基于自我超vision transformer架构,采用自我超vised contrastive learning方法进行预训练,并将graph convolutional network和transformer discriminator相结合,以及图 transformer relevancy map提供更好的解释性。
  • results: 在多种实验中,该系统表现出色,在不同的 dataset、扰动类型和加工程度下均能够保持高度的检测性能,并且在常见的后期制作扰动下也能够保持robustness。
    Abstract Deepfake detection methods have shown promising results in recognizing forgeries within a given dataset, where training and testing take place on the in-distribution dataset. However, their performance deteriorates significantly when presented with unseen samples. As a result, a reliable deepfake detection system must remain impartial to forgery types, appearance, and quality for guaranteed generalizable detection performance. Despite various attempts to enhance cross-dataset generalization, the problem remains challenging, particularly when testing against common post-processing perturbations, such as video compression or blur. Hence, this study introduces a deepfake detection framework, leveraging a self-supervised pre-training model that delivers exceptional generalization ability, withstanding common corruptions and enabling feature explainability. The framework comprises three key components: a feature extractor based on vision Transformer architecture that is pre-trained via self-supervised contrastive learning methodology, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To assess the effectiveness of the proposed framework, several challenging experiments are conducted, including in-data distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results achieved demonstrate the remarkable effectiveness of the proposed deepfake detection framework, surpassing the current state-of-the-art approaches.
    摘要 深层负作假检测方法在给定数据集上已经表现出了良好的结果,可以准确地识别forge。然而,当面临未seen样本时,其性能会降低 significatively。因此,一个可靠的深层负作假检测系统必须保持中立,不被假造类型、外观和质量所左右。despite various attempts to enhance cross-dataset generalization, the problem remains challenging, especially when testing against common post-processing perturbations, such as video compression or blur. Therefore, this study proposes a deepfake detection framework, which leverages a self-supervised pre-training model that delivers excellent generalization ability, resisting common corruptions and providing feature explainability. The framework consists of three key components: a feature extractor based on the vision Transformer architecture that is pre-trained via self-supervised contrastive learning, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To evaluate the effectiveness of the proposed framework, several challenging experiments are conducted, including in-distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results show that the proposed deepfake detection framework outperforms the current state-of-the-art approaches.

Samplable Anonymous Aggregation for Private Federated Data Analysis

  • paper_url: http://arxiv.org/abs/2307.15017
  • repo_url: None
  • paper_authors: Kunal Talwar, Shan Wang, Audra McMillan, Vojta Jina, Vitaly Feldman, Bailey Basile, Aine Cahill, Yi Sheng Chan, Mike Chatzidakis, Junye Chen, Oliver Chick, Mona Chitnis, Suman Ganta, Yusuf Goren, Filip Granqvist, Kristine Guo, Frederic Jacobs, Omid Javidbakht, Albert Liu, Richard Low, Dan Mascenik, Steve Myers, David Park, Wonhee Park, Gianni Parsa, Tommy Pauly, Christian Priebe, Rehan Rishi, Guy Rothblum, Michael Scaria, Linmao Song, Congzheng Song, Karl Tarbe, Sebastian Vogt, Luke Winstrom, Shundong Zhou
  • for: 这个论文是为了设计扩展性强的协议,以实现隐私统计和隐私联合学习,每个设备都保持私有数据。
  • methods: 该论文提出了一种简单的原理,可以实现许多常用的算法,同时允许隐私评估,与中央设置相似,而不需要强大的信任假设。
  • results: 该论文提出了一种系统架构,实现了该原理,并进行了安全分析。
    Abstract We revisit the problem of designing scalable protocols for private statistics and private federated learning when each device holds its private data. Our first contribution is to propose a simple primitive that allows for efficient implementation of several commonly used algorithms, and allows for privacy accounting that is close to that in the central setting without requiring the strong trust assumptions it entails. Second, we propose a system architecture that implements this primitive and perform a security analysis of the proposed system.
    摘要 我们回归到每个设备保留自己私人数据时的私人统计和联邦学习协议设计的问题。我们的首要贡献是提出了一个简单的基本 primitives,它可以高效实现许多通常使用的算法,并且可以在中央设定下实现隐私账户,不需要强信任假设。其次,我们提出了一种系统架构,实现了这个基本 primitives,并进行了安全分析。

How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges

  • paper_url: http://arxiv.org/abs/2307.15016
  • repo_url: https://github.com/htqin/googlebard-visunderstand
  • paper_authors: Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz Khan, Luc Van Gool
  • for: The paper aims to evaluate the performance of Google’s Bard in understanding and interpreting visual data (images) conditioned by text questions, and to identify the gaps in Bard’s vision-based understanding.
  • methods: The paper uses 15 diverse task scenarios to comprehensively evaluate Bard’s performance in handling visual data, including regular, camouflaged, medical, under-water, and remote sensing data.
  • results: The primary finding of the study is that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.Here’s the Chinese version of the three key points:
  • for: 论文目的是评估Google的Bard在基于文本问题的视觉数据(图像)理解和解释方面的性能,并发现视觉基本理解中的差距。
  • methods: 论文使用15种多样化任务场景来全面评估Bard在处理视觉数据方面的性能,包括常见、掩蔽、医疗、水下和遥感数据等。
  • results: 研究的主要发现是,Bard在这些视觉场景中仍然困难,表明未来发展中需要覆盖视觉基本理解的差距。
    Abstract Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand
    摘要 Google的Bard在对话AI领域已经出现为OpenAI的ChatGPT的强力竞争对手。值得注意的是,Bard最近更新了对视觉输入进行对话的能力。基于Bard的出色的文本输入处理能力,我们探索了它对于图像数据(图像)的理解和解释的能力。这种探索具有探索新的发现和挑战的潜在性,尤其是在解决复杂的计算机视觉问题方面。在这种研究中,我们选择了15种多样化的任务场景,包括常见、掩体、医疗、水下和Remote感知数据,以全面评估Bard的表现。我们的主要发现表明Bard在这些视觉场景中仍然努力,强调了将来发展中需要覆盖的视觉理解悬峰。我们期望这个实验室的研究将对未来模型的发展产生帮助,导致对细化的视觉数据的理解和解释能力得到提高。我们的项目在https://github.com/htqin/GoogleBard-VisUnderstand上发布。

Harnessing Synthetic Active Particles for Physical Reservoir Computing

  • paper_url: http://arxiv.org/abs/2307.15010
  • repo_url: None
  • paper_authors: Xiangzun Wang, Frank Cichos
  • for: 这篇论文探讨了基于活动微颗件系统的物理储存计算方法,以实现高效的信息处理。
  • methods: 该方法使用了具有延迟响应的微颗件系统自组织成非线性动力系统,并使用历史储存来降低噪声。
  • results: 研究人员发现,使用这种特殊架构可以实现高效的预测任务,即使在噪声强的情况下。这些结果为人工生物系统的信息处理研究开辟了新的可能性。
    Abstract The processing of information is an indispensable property of living systems realized by networks of active processes with enormous complexity. They have inspired many variants of modern machine learning one of them being reservoir computing, in which stimulating a network of nodes with fading memory enables computations and complex predictions. Reservoirs are implemented on computer hardware, but also on unconventional physical substrates such as mechanical oscillators, spins, or bacteria often summarized as physical reservoir computing. Here we demonstrate physical reservoir computing with a synthetic active microparticle system that self-organizes from an active and passive component into inherently noisy nonlinear dynamical units. The self-organization and dynamical response of the unit is the result of a delayed propulsion of the microswimmer to a passive target. A reservoir of such units with a self-coupling via the delayed response can perform predictive tasks despite the strong noise resulting from Brownian motion of the microswimmers. To achieve efficient noise suppression, we introduce a special architecture that uses historical reservoir states for output. Our results pave the way for the study of information processing in synthetic self-organized active particle systems.
    摘要 生物系统中信息处理是不可或缺的性能,通过活动过程网络实现了复杂性的计算。它们激发了现代机器学习的多种变种,其中之一是储存计算,通过启动网络节点的减弱记忆来实现计算和复杂预测。储存器在计算机硬件上实现,也可以在不同的物理基础结构上实现,如机械振荡器、螺旋体或细菌。在这里,我们使用自适应微体系来实现物理储存计算。这种系统由活动和无活动组件自组织而成,并且具有内生的噪声。我们通过延迟微型游子的推动到潜在目标来实现单元的自组织和动态响应。一个由这些单元组成的储存器,通过自相互关联来实现预测任务,即使面临强噪声。为了有效地减少噪声,我们提出了一种特殊的建筑方案,使用历史储存器状态来输出。我们的结果为人工自组织活体系统的信息处理研究开辟了新的道路。

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

  • paper_url: http://arxiv.org/abs/2307.15007
  • repo_url: None
  • paper_authors: Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju
  • for: 该论文的目的是解释机器学习模型的行为,并提供可靠的、可验证的特征归因方法。
  • methods: 该论文提出了一种名为Verifiability Tuning(VerT)的方法,可以将黑盒模型转化成具有自然的、可靠和可验证的特征归因方法。
  • results: 该论文通过对半 sintetic和实际数据集进行了广泛的实验,并证明了VerT可以生成的模型和特征归因方法是正确、可靠和 faithful于原始黑盒模型。
    Abstract With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by highlighting features that are critical to model predictions; however, prior work has shown that these explanations may not be faithful, and even more concerning is our inability to verify them. Specifically, it is nontrivial to evaluate if a given attribution is correct with respect to the underlying model. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful and verifiable, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we aim to bridge the gap between the aforementioned strategies by proposing Verifiability Tuning (VerT), a method that transforms black-box models into models that naturally yield faithful and verifiable feature attributions. We begin by introducing a formal theoretical framework to understand verifiability and show that attributions produced by standard models cannot be verified. We then leverage this framework to propose a method to build verifiable models and feature attributions out of fully trained black-box models. Finally, we perform extensive experiments on semi-synthetic and real-world datasets, and show that VerT produces models that (1) yield explanations that are correct and verifiable and (2) are faithful to the original black-box models they are meant to explain.
    摘要 随着机器学习模型在各种实际应用中的普及,研究人员和实践者们都强调了模型行为的解释的重要性。为此,先前的文献中提出了两种广泛的解释策略:一是后期解释方法,通过强调模型预测中的关键特征来解释模型行为,但是先前的研究表明这些解释可能不准确,而且无法验证它们的正确性。另一方面,内置解释的模型可以自动编码解释到模型结构中,因此其解释是自然的、可靠的和验证的,但是它们通常具有较弱的预测性能。在这篇文章中,我们希望通过提出验证化调整(VerT)来bridge这两种策略之间的差距,以生成可靠的、可验证的特征归属。我们首先引入了一个正式的理论框架,以理解验证性的概念,并证明标准模型生成的归属无法验证。然后,我们利用这个框架,提出了一种方法,可以将黑盒模型转化为可靠、可验证的模型和特征归属。最后,我们对半synthetic和实际数据集进行了广泛的实验,并证明VerT可以生成符合预期的、可靠的和可验证的模型和特征归属。

Improved Neural Radiance Fields Using Pseudo-depth and Fusion

  • paper_url: http://arxiv.org/abs/2308.03772
  • repo_url: None
  • paper_authors: Jingliang Li, Qiang Zhou, Chaohui Yu, Zhengda Lu, Jun Xiao, Zhibin Wang, Fan Wang
  • for: 本研究主要探讨了如何使用多尺度编码体Volume和多尺度几何信息来提高NeRF模型的视野 sintesis和精密几何建模性能。
  • methods: 本研究提出了一种基于多尺度编码体Volume和多尺度几何信息的NeRF模型,并提出了一种同时进行深度预测和场景场景 reconstruction的方法,以及一种基于深度指导的点云特征融合方法来增强点云特征。
  • results: 实验结果显示,提出的方法可以在不需要Scene-specific优化的情况下实现高质量的视野 sintesis和精密几何建模,并且在点云特征融合方面提高了性能。
    Abstract Since the advent of Neural Radiance Fields, novel view synthesis has received tremendous attention. The existing approach for the generalization of radiance field reconstruction primarily constructs an encoding volume from nearby source images as additional inputs. However, these approaches cannot efficiently encode the geometric information of real scenes with various scale objects/structures. In this work, we propose constructing multi-scale encoding volumes and providing multi-scale geometry information to NeRF models. To make the constructed volumes as close as possible to the surfaces of objects in the scene and the rendered depth more accurate, we propose to perform depth prediction and radiance field reconstruction simultaneously. The predicted depth map will be used to supervise the rendered depth, narrow the depth range, and guide points sampling. Finally, the geometric information contained in point volume features may be inaccurate due to occlusion, lighting, etc. To this end, we propose enhancing the point volume feature from depth-guided neighbor feature fusion. Experiments demonstrate the superior performance of our method in both novel view synthesis and dense geometry modeling without per-scene optimization.
    摘要 自Neural Radiance Fields出现以来,新视图合成受到了广泛关注。现有的总结方法主要从邻近源图像中构建编码量为附加输入。然而,这些方法无法有效地编码实际场景中的各种尺度对象/结构的 геометрической信息。在这种工作中,我们提议构建多尺度编码量和将多尺度几何信息提供给NeRF模型。为使构造的体积最接近物体Scene中的表面和渲染深度更准确,我们提议同时进行深度预测和场景场景重建。预测的深度图将用于监督渲染深度,窄化深度范围,并导引点抽取。最后,由点体积特征中含有的几何信息可能因为遮挡、照明等因素而不准确。为此,我们提议通过深度导向邻居特征融合提高点体积特征。实验表明我们的方法在新视图合成和不需要Scene特定优化的精密几何模型中具有superior表现。

Detection of Children Abuse by Voice and Audio Classification by Short-Time Fourier Transform Machine Learning implemented on Nvidia Edge GPU device

  • paper_url: http://arxiv.org/abs/2307.15101
  • repo_url: None
  • paper_authors: Jiuqi Yan, Yingxian Chen, W. W. T. Fok
  • for: 增强儿童安全性,预测儿童被虐待情况。
  • methods: 机器学习技术应用于儿童声音识别和检测儿童被虐待情况。
  • results: 实验结果显示,使用机器学习模型可以准确地识别儿童的声音,并且可以在儿童被虐待情况下发出警示。模型的准确率达到了约92%。
    Abstract The safety of children in children home has become an increasing social concern, and the purpose of this experiment is to use machine learning applied to detect the scenarios of child abuse to increase the safety of children. This experiment uses machine learning to classify and recognize a child's voice and predict whether the current sound made by the child is crying, screaming or laughing. If a child is found to be crying or screaming, an alert is immediately sent to the relevant personnel so that they can perceive what the child may be experiencing in a surveillance blind spot and respond in a timely manner. Together with a hybrid use of video image classification, the accuracy of child abuse detection can be significantly increased. This greatly reduces the likelihood that a child will receive violent abuse in the nursery and allows personnel to stop an imminent or incipient child abuse incident in time. The datasets collected from this experiment is entirely from sounds recorded on site at the children home, including crying, laughing, screaming sound and background noises. These sound files are transformed into spectrograms using Short-Time Fourier Transform, and then these image data are imported into a CNN neural network for classification, and the final trained model can achieve an accuracy of about 92% for sound detection.
    摘要 <>对儿童HOME的儿童安全问题而言,这个实验的目的是使用机器学习技术来检测儿童虐待情况,以增加儿童的安全性。这个实验使用机器学习技术来分类和识别儿童的声音,并预测儿童当前发出的声音是否为哭泣、喊叫或 laughter。如果发现儿童在哭泣或喊叫,则立即发送 alert 到相关人员,以便他们在监控盲区内认定儿童的情况,并在时间上采取应急措施。与视频图像分类相结合使用,可以提高儿童虐待检测的准确率。这将大幅降低儿童在寄养所接受暴力虐待的可能性,并让人员在时间上采取应急措施,以防止儿童虐待事件的发生。实验所收集的数据完全来自寄养所录制的声音,包括哭泣、笑声、喊叫声和背景噪音。这些声音文件被转换成spectrograms使用Short-Time Fourier Transform,然后这些图像数据被导入到CNN神经网络中进行分类,最终训练出的模型可以达到约92%的准确率。

Thinker: Learning to Plan and Act

  • paper_url: http://arxiv.org/abs/2307.14993
  • repo_url: https://github.com/anonymous-scrl/thinker
  • paper_authors: Stephen Chung, Ivan Anokhin, David Krueger
  • for: 本研究旨在开掘了一种名为 Thinker 算法,该算法可以让学习型决策机器人自主地与学习的世界模型交互,从而实现了自动化的规划。
  • methods: Thinker 算法将环境包装在一个世界模型中,并引入了新的模型交互动作,允许机器人通过提议不同的计划来让世界模型进行规划,从而消除了手动设计的规划算法的需求。
  • results: experimental 结果表明, Thinker 算法在杯球游戏和 Atari 2600 测试中实现了状态之最好的性能和竞争性能。视觉化显示了机器人培育了 Thinker 算法后,它们已经学习了如何有效地规划使用世界模型选择更好的动作。
    Abstract We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for hand-crafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. The algorithm's generality opens a new research direction on how a world model can be used in reinforcement learning and how planning can be seamlessly integrated into an agent's decision-making process.
    摘要 我们提出了思考算法,一种新的方法,让学习 Agent 能够自主地与学习的世界模型交互。思考算法将环境包装在世界模型中,并引入了特制的世界模型交互动作,让 Agent 可以通过提议多个计划来和世界模型交互,从而实现计划学习。这种方法消除了手动设计的计划算法的需求,让 Agent 可以自主地学习计划,并且可以轻松地通过可视化来解释 Agent 的计划。我们通过骁客游戏和 Atari 2600 测试集的实验result,证明思考算法在状态识别和竞争性能方面具有优秀表现。Visualization 显示 Agent 训练后已经学会了有效地使用世界模型来选择更好的动作。该算法的通用性开启了一个新的研究方向,即如何在学习 Agent 中使用世界模型,以及如何让计划学习与决策过程相协调。

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

  • paper_url: http://arxiv.org/abs/2307.14988
  • repo_url: None
  • paper_authors: Or Sharir, Anima Anandkumar
  • for: 这篇论文的目的是提出一种增量计算方法,以提高深度学习模型在处理动态输入时的效率。
  • methods: 这篇论文使用了 вектор量化来精简中继值,从而使得模型可以更好地重复计算。
  • results: 实验结果显示,使用增量计算方法可以与传统的批量计算方法相比,在运算序列中的字串编译时间比例降低了12.1倍(中位数)。
    Abstract Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.
    摘要 然而,传统的网络架构中的紧密连接带来了重要的障碍,因为even minor input changes 会在网络中传播并限制信息重用。为解决这个问题,我们使用 вектор量化来精确化网络中的中间值,从而过滤出干扰和无关的修改。这使得可以重用隐藏神经元的值。我们应用这种方法到 transformers 架构中,创建了高效的逐步计算算法,计算复杂度与修改输入的 Fraction 成正比。我们的实验表明,在适应 OPT-125M 预训练语言模型时,可以保持文档类型的准确性,而同时需要12.1倍(中位) fewer 操作来处理序列化的 atomic 修改。

Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

  • paper_url: http://arxiv.org/abs/2307.14971
  • repo_url: https://github.com/wangzy22/tap
  • paper_authors: Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu
  • for: 提高3D视觉模型的性能
  • methods: 使用cross-attention机制生成不同指定姿态的视图图像,以供3D模型进行预训练
  • results: 在ScanObjectNN类型和ShapeNetPart分割任务上达到了最佳性能,超过了之前的预训练方法
    Abstract With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP.
    摘要 随着MAE领导的图像模型的潮流,生成预训练表现出了惊人的潜力,以提高2D视觉基本模型的性能。然而,在3D视觉中,由Transformer基础模型和无序点云的限制,生成预训练的发展受到了限制。在这篇论文中,我们提出了一种适用于任何点云模型的3D-to-2D生成预训练方法。我们提议通过cross-attention机制来生成不同的指定pose的视图图像作为预训练方案。通过生成视图图像,可以为3D背景提供更加精确的超visional指导,从而帮助3D背景更好地理解点云的几何结构和立体关系。实验结果表明,我们提出的3D-to-2D生成预训练方法比前期方法更高效。此外,我们的方法也可以提高建筑方法的性能,在ScanObjectNN分类和ShapeNetPart segmentation任务上达到了当前最佳性能。代码可以在https://github.com/wangzy22/TAP中找到。

Learning locally dominant force balances in active particle systems

  • paper_url: http://arxiv.org/abs/2307.14970
  • repo_url: None
  • paper_authors: Dominik Sturm, Suryanarayana Maddu, Ivo F. Sbalzarini
  • for: 这个论文的目的是解释自适应活动粒子系统中的宏观模式形成。
  • methods: 该论文使用了一种组合式无监督聚类和稀疏推理算法来学习自适应粒子系统中的地方主导力平衡。
  • results: 研究发现,自适应粒子系统中的宏观模式形成是基于地方的对称相互作用和激射压力的。该方法还能够揭示自适应粒子系统中模式形成的物理原理和实验观察结果之间的相互关系。
    Abstract We use a combination of unsupervised clustering and sparsity-promoting inference algorithms to learn locally dominant force balances that explain macroscopic pattern formation in self-organized active particle systems. The self-organized emergence of macroscopic patterns from microscopic interactions between self-propelled particles can be widely observed nature. Although hydrodynamic theories help us better understand the physical basis of this phenomenon, identifying a sufficient set of local interactions that shape, regulate, and sustain self-organized structures in active particle systems remains challenging. We investigate a classic hydrodynamic model of self-propelled particles that produces a wide variety of patterns, like asters and moving density bands. Our data-driven analysis shows that propagating bands are formed by local alignment interactions driven by density gradients, while steady-state asters are shaped by a mechanism of splay-induced negative compressibility arising from strong particle interactions. Our method also reveals analogous physical principles of pattern formation in a system where the speed of the particle is influenced by local density. This demonstrates the ability of our method to reveal physical commonalities across models. The physical mechanisms inferred from the data are in excellent agreement with analytical scaling arguments and experimental observations.
    摘要 (注意:以下是简化中文版本,具体的翻译结果可能有所不同)我们使用无监督划分和缺省推理算法来学习活体系中的地方主导力平衡,以解释大规模模式的形成。自然中广泛观察到活体系中自组织的大规模模式。尽管流体理论可以帮助我们更好地理解这种现象的物理基础,但是确定活体系中自组织结构的地方互动的充分集合仍然是一个挑战。我们研究了一种经典的流体力学模型,该模型可以生成各种模式,如星形和移动密度带。我们的数据驱动分析表明,带状模式是由地方对适应度势场的Alignment互动驱动的,而稳定的星形则是由强 particle interactions 导致的负压缩性质的。我们的方法还表明了在模型中速度受到地方密度影响的情况下,物理原理的形成是一致的。这说明我们的方法可以揭示模型之间的物理相似性。实际观察和理论分析中的物理机制与我们的数据分析结果吻合得非常好。

eess.IV - 2023-07-28

A Survey on Deep Learning in Medical Image Registration: New Technologies, Uncertainty, Evaluation Metrics, and Beyond

  • paper_url: http://arxiv.org/abs/2307.15615
  • repo_url: None
  • paper_authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du
  • for: 本研究提供了最新的深度学习基于图像注册的概述,包括最新的网络架构、特定于注册的损失函数和注册不确定性的估计方法。
  • methods: 本研究提出了多种新的网络架构、特定于注册的损失函数和注册不确定性的估计方法,以及适用于评估深度学习模型在注册任务中的评价指标。
  • results: 本研究总结了深度学习基于图像注册的最新进展,包括多种新的网络架构、特定于注册的损失函数和注册不确定性的估计方法,以及这些技术在医疗影像应用中的实际应用和未来展望。
    Abstract Over the past decade, deep learning technologies have greatly advanced the field of medical image registration. The initial developments, such as ResNet-based and U-Net-based networks, laid the groundwork for deep learning-driven image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, and uncertainty estimation. These advancements have not only enriched the field of deformable image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.
    摘要 过去一个十年,深度学习技术在医学图像对接方面做出了很大的进步。初期的发展,如基于ResNet和U-Net的网络,为深度学习驱动的图像对接提供了基础。后续的进步包括相似度度量、准确度评估、和变形规则等方面,这些进步不仅推动了可变图像对接的发展,还使其在各种任务中得到了广泛的应用,如制作图像 Atlases、多个 Atlases 分割、运动估计和2D-3D对接。在这篇论文中,我们提供了深度学习基于图像对接最新的进步的全面概述。我们从核心概念开始,然后探讨了专门为对接设计的网络架构、特定于对接的损失函数和对接不确定性的估计方法。此外,这篇论文还考虑了对接任务的评估指标,并将其应用于医学图像中。最后,我们讨论了这些新技术在医学图像中的实际应用和未来前景。

Integrated Digital Reconstruction of Welded Components: Supporting Improved Fatigue Life Prediction

  • paper_url: http://arxiv.org/abs/2307.15604
  • repo_url: None
  • paper_authors: Anders Faarbæk Mikkelstrup, Morten Kristiansen
  • For: The paper aims to improve the fatigue performance of welded joints in offshore jacket foundations by enhancing the quality of post-weld treatment through digital reconstruction of the weld.* Methods: The paper proposes an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup, using standard image processing, simple filtering techniques, and non-linear optimization to align and merge overlapping scans.* Results: The proposed framework enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment, and improves fatigue life prediction and possible crack location prediction.
    Abstract In the design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialised scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimisation for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalises the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.
    摘要 design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialized scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimization for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalizes the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.Here's the text in Traditional Chinese:在海上桅杆基础设计中,腐蚀生命是关键。post-weld treatment 已经被提议来提高锋械焊接处的腐蚀性能,其中特别是高频机械冲击(HFMI)处理能够明显改善腐蚀性能。自动化HFMI处理可以提高质量保证和设计成本,并且可以与精确的腐蚀生命预测相结合。然而,通用finite element方法(FEM),常用于预测复杂或多轴焊接处的腐蚀生命,仅仅基于焊接部的基本CAD描述,而不考虑实际焊接件的几何和缺陷。包括实际焊接件的几何在FEM模型中可以提高腐蚀生命预测和可能的裂缝位置预测,但需要焊接件的数字重建。现有的数字重建方法是时间consuming 或需要特殊扫描设备和可能的组件重新位置。提议的框架INSTEAD uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimization for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalizes the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.

OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation of Road Scenes

  • paper_url: http://arxiv.org/abs/2307.15588
  • repo_url: https://github.com/feibryantkit/oafuser
  • paper_authors: Fei Teng, Jiaming Zhang, Kunyu Peng, Kailun Yang, Yaonan Wang, Rainer Stiefelhagen
  • for: 增强自动驾驶场景理解的图像Semantic segmentation,使用光场相机提供丰富的空间和方向信息。
  • methods: 提议Omni-Aperture Fusion模型(OAFuser),利用中心视图的密集上下文和子镜头图像中的方向信息生成具有一致性的结果,同时采用Sub-Aperture Fusion Module(SAFM)将子镜头图像与方向特征相嵌入,无需额外存储成本。
  • results: 在UrbanLF-Real和-Syn数据集上达到了状态之内表现,与之前的记录 (+4.53%) 相比,并在UrbanLF-Real Extended数据集上达到了84.93%的mIoU记录。
    Abstract Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent vehicles. Besides, inappropriate compression leads to information corruption and data loss. To excavate representative information, we propose an Omni-Aperture Fusion model (OAFuser), which leverages dense context from the central view and discovers the angular information from sub-aperture images to generate a semantically-consistent result. To avoid feature loss during network propagation and simultaneously streamline the redundant information from the light field camera, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular features without any additional memory cost. Furthermore, to address the mismatched spatial information across viewpoints, we present Center Angular Rectification Module (CARM) realized feature resorting and prevent feature occlusion caused by asymmetric information. Our proposed OAFuser achieves state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +4.53%. The source code of OAFuser will be made publicly available at https://github.com/FeiBryantkit/OAFuser.
    摘要 《Light field cameras可以提供rich的angular和空间信息,以提高自动驾驶场景理解。然而,广泛的angular信息中包含大量冗余数据,这会超过智能汽车的硬件资源。另外,不当压缩会导致信息损坏和数据丢失。为了挖掘代表性信息,我们提出了Omni-Aperture Fusion模型(OAFuser),它利用中心视图的dense context和子视图图像中的angular信息,生成具有相同semantic consistency的结果。为了避免网络传播过程中的特征损失并同时压缩 redundancy from the light field camera,我们提出了一个简单 yet highly effective Sub-Aperture Fusion Module(SAFM),可以将子视图图像嵌入angular特征中,无需额外内存成本。此外,为了 Addressing the mismatched spatial information across viewpoints,我们提出了Center Angular Rectification Module(CARM),实现了Feature Resorting和避免了因视角不同而导致的特征遮挡。我们的提出的OAFuser实现了UrbanLF-Real和UrbanLF-Syn数据集上的state-of-the-art performance,并在UrbanLF-Real Extended数据集上达到了84.93%的mIoU记录,升幅+4.53%。我们将在https://github.com/FeiBryantkit/OAFuser上公开源代码。》

Defocus Blur Synthesis and Deblurring via Interpolation and Extrapolation in Latent Space

  • paper_url: http://arxiv.org/abs/2307.15461
  • repo_url: https://github.com/nis-research/linear-latent-blur
  • paper_authors: Ioana Mazilu, Shunxin Wang, Sven Dummer, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
  • For: 提高微scopic图像质量,使其更适合进一步处理和分析疾病。* Methods: 使用自适应抑制神经网络和明确 regularization 技术,从拉тен空间中 linearly interpolate/extrapolate 图像各个Focus plane的表示。* Results: 能够有效地产生不同degree of blur的图像,提高数据多样性,并且可以用作数据增强技术,提高微scopic图像的质量和分析效果。
    Abstract Though modern microscopes have an autofocusing system to ensure optimal focus, out-of-focus images can still occur when cells within the medium are not all in the same focal plane, affecting the image quality for medical diagnosis and analysis of diseases. We propose a method that can deblur images as well as synthesize defocus blur. We train autoencoders with implicit and explicit regularization techniques to enforce linearity relations among the representations of different blur levels in the latent space. This allows for the exploration of different blur levels of an object by linearly interpolating/extrapolating the latent representations of images taken at different focal planes. Compared to existing works, we use a simple architecture to synthesize images with flexible blur levels, leveraging the linear latent space. Our regularized autoencoders can effectively mimic blur and deblur, increasing data variety as a data augmentation technique and improving the quality of microscopic images, which would be beneficial for further processing and analysis.
    摘要 modern microscopes 有自动对焦系统以确保最佳Focus,但可以出现不对焦图像,因为细胞在媒体中不都在同一个focus plane,这会影响医疗诊断和疾病分析的图像质量。我们提出了一种方法,可以除锈图像以及生成杂化模糊。我们在自适应神经网络中使用了隐式和显式正则化技术,以强制在潜在空间中的表示之间存在线性关系。这allow us to explore不同的杂化水平,通过线性 interpolate/extrapolate latent representation of images taken at different focal planes。相比之下,我们使用了简单的architecture来生成具有 flexible 杂化水平的图像,利用潜在空间的线性特性。我们的正则化自适应神经网络可以有效地模拟杂化和除锈,增加数据多样性作为数据扩充技术,提高微scopic 图像的质量,这将有利于进一步处理和分析。

ERCPMP: An Endoscopic Image and Video Dataset for Colorectal Polyps Morphology and Pathology

  • paper_url: http://arxiv.org/abs/2307.15444
  • repo_url: None
  • paper_authors: Mojgan Forootan, Mohsen Rajabnia, Ahmad R Mafi, Hamed Azhdari Tehrani, Erfan Ghadirzadeh, Mahziar Setayeshfar, Zahra Ghaffari, Mohammad Tashakoripour, Mohammad Reza Zali, Hamidreza Bolhasani
  • for: 这个论文是为了开发准确的医疗算法而写的。
  • methods: 这篇论文使用了endoscopic图像和视频数据集(ERCPMP),包括了patient的人口统计数据、形态数据、pathological数据和endoscopic图像和视频。
  • results: 这篇论文通过分析endoscopic图像和视频数据集,开发了一个准确的医疗算法,可以用于识别抑血管肿瘤的形态和病理特征。
    Abstract In the recent years, artificial intelligence (AI) and its leading subtypes, machine learning (ML) and deep learning (DL) and their applications are spreading very fast in various aspects such as medicine. Today the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment and prognosis is data. ERCPMP is an Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology. This dataset contains demographic, morphological and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is included based on the latest international gastroenterology classification references such as Paris, Pit and JNET classification. Pathological data includes the diagnosis of the polyps including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse and since it is under development, the latest version is accessible via: https://databiox.com.
    摘要 近年来,人工智能(AI)和其主要分支——机器学习(ML)和深度学习(DL)在各种领域广泛应用。医学领域的主要挑战是开发高精度算法,用于医疗预测、检测、诊断、治疗和评估。ERCPMP是一个涵盖肠道肿瘤形态和病理学特征的杜立特内镜影像和视频数据集。这个数据集包括191名患者的肠道肿瘤数据,包括人类学分类标准(Paris、Pit和JNET)中的最新分类参考。病理数据包括肿瘤诊断,包括毛细血管、 villous、 tubulovillous、 炎性、serrated、inflammatory和adenocarcinoma,以及分化度和分化度。现有版本的这个数据集已经发布,可以在 Elsevier Mendeley Dataverse 上获取,而最新版本可以通过以下链接获取:https://databiox.com。

RAWIW: RAW Image Watermarking Robust to ISP Pipeline

  • paper_url: http://arxiv.org/abs/2307.15443
  • repo_url: None
  • paper_authors: Kang Fu, Xiaohong Liu, Jun Jia, Zicheng Zhang, Yicong Peng, Jia Wang, Guangtao Zhai
    for:这个研究是为了提供一个基于深度学习的RAW图像潜像 watermarking框架,以保护RAW图像的版权。methods:我们使用了一个内置的神经网络来实现RAW图像与RGB图像之间的跨领域版权保护,并且将copyright信息直接嵌入RAW图像中。results:我们的实验结果显示,RAWIW框架可以成功地在不同的ISP管道和传输过程中维持版权保护,并且可以实现高质量和适当的隐藏性。
    Abstract Invisible image watermarking is essential for image copyright protection. Compared to RGB images, RAW format images use a higher dynamic range to capture the radiometric characteristics of the camera sensor, providing greater flexibility in post-processing and retouching. Similar to the master recording in the music industry, RAW images are considered the original format for distribution and image production, thus requiring copyright protection. Existing watermarking methods typically target RGB images, leaving a gap for RAW images. To address this issue, we propose the first deep learning-based RAW Image Watermarking (RAWIW) framework for copyright protection. Unlike RGB image watermarking, our method achieves cross-domain copyright protection. We directly embed copyright information into RAW images, which can be later extracted from the corresponding RGB images generated by different post-processing methods. To achieve end-to-end training of the framework, we integrate a neural network that simulates the ISP pipeline to handle the RAW-to-RGB conversion process. To further validate the generalization of our framework to traditional ISP pipelines and its robustness to transmission distortion, we adopt a distortion network. This network simulates various types of noises introduced during the traditional ISP pipeline and transmission. Furthermore, we employ a three-stage training strategy to strike a balance between robustness and concealment of watermarking. Our extensive experiments demonstrate that RAWIW successfully achieves cross-domain copyright protection for RAW images while maintaining their visual quality and robustness to ISP pipeline distortions.
    摘要 这文字说明RAW图像标识(RAWIW)框架,用于图像版权保护。相比于RGB图像,RAW格式图像使用更高的动态范围来捕捉相机感应器的射频特性,提供更多的后处理和修复 flexibility。RAW图像被视为图像生产和分布的原始格式,因此需要版权保护。现有的标识方法通常target RGB图像,这借gabe a gap for RAW图像。为了解决这个问题,我们提出了首个基于深度学习的RAW图像标识(RAWIW)框架。不同于RGB图像标识,我们的方法可以在不同的处理和修复方法下进行标识,并且可以从RGB图像中提取标识信息。为了实现端到端训练的框架,我们将一个神经网络 integrate into the ISP pipeline to handle the RAW-to-RGB conversion process。此外,我们还使用一个抖扰网络,以模拟传输过程中引入的各种噪声。 Finally, we employ a three-stage training strategy to balance the robustness and concealment of watermarking. Our extensive experiments show that RAWIW successfully achieves cross-domain copyright protection for RAW images while maintaining their visual quality and robustness to ISP pipeline distortions.

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression

  • paper_url: http://arxiv.org/abs/2307.15421
  • repo_url: https://github.com/jiangweibeta/mlic
  • paper_authors: Wei Jiang, Ronggang Wang
  • for: 这个论文主要是为了提出一种基于多reference entropy模型的学习图像压缩方法,以提高图像压缩的效率和质量。
  • methods: 该方法使用了linear complexity global correlations capturing,通过分解softmax操作来实现,并提出了一种基于多reference entropy模型的学习图像压缩方法MLIC$^{++}$。
  • results: 对于Kodak数据集,相比VTM-17.0,MLIC$^{++}$可以减少BD-rate by 12.44%,并且在PSNR上具有更高的效率。
    Abstract Recently, multi-reference entropy model has been proposed, which captures channel-wise, local spatial, and global spatial correlations. Previous works adopt attention for global correlation capturing, however, the quadratic cpmplexity limits the potential of high-resolution image coding. In this paper, we propose the linear complexity global correlations capturing, via the decomposition of softmax operation. Based on it, we propose the MLIC$^{++}$, a learned image compression with linear complexity for multi-reference entropy modeling. Our MLIC$^{++}$ is more efficient and it reduces BD-rate by 12.44% on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Code will be available at https://github.com/JiangWeibeta/MLIC.
    摘要 最近,多参照 entropy 模型已经被提出,该模型捕捉了通道级、本地空间和全局空间相关性。前一些作品采用了注意力来捕捉全局相关性,但是这种 quadratic complexity 限制了高分辨率图像编码的潜力。在本文中,我们提出了线性复杂度全球相关性捕捉方法,通过软max操作的归一化 decomposition。基于这种方法,我们提出了 MLIC++,一种学习图像压缩的线性复杂度模型。我们的 MLIC++ 比 VTM-17.0 在 Kodak 数据集上减少了BD-rate 12.44%,相对于 PSNR 来说。代码将在 GitHub 上提供。

Fast Dust Sand Image Enhancement Based on Color Correction and New Membership Function

  • paper_url: http://arxiv.org/abs/2307.15230
  • repo_url: None
  • paper_authors: Ali Hakem Alsaeedi, Suha Mohammed Hadi, Yarub Alazzawi
  • for: 提高灰尘照片的质量和可见度
  • methods: 基于色彩修正和新成员函数,提出了一种新的增强灰尘照片的模型,包括三个阶段:色彩shift的 corrections、雾气去除和对比和亮度的提高
  • results: 对多个真实的灰尘照片进行测试和评估,研究表明,提出的解决方案在去除红色和黄色投影方面表现出色,并提供了高质量和量的灰尘照片
    Abstract Images captured in dusty environments suffering from poor visibility and quality. Enhancement of these images such as sand dust images plays a critical role in various atmospheric optics applications. In this work, proposed a new model based on Color Correction and new membership function to enhance san dust images. The proposed model consists of three phases: correction of color shift, removal of haze, and enhancement of contrast and brightness. The color shift is corrected using a new membership function to adjust the values of U and V in the YUV color space. The Adaptive Dark Channel Prior (A-DCP) is used for haze removal. The stretching contrast and improving image brightness are based on Contrast Limited Adaptive Histogram Equalization (CLAHE). The proposed model tests and evaluates through many real sand dust images. The experimental results show that the proposed solution is outperformed the current studies in terms of effectively removing the red and yellow cast and provides high quality and quantity dust images.
    摘要 图像 capture in 尘埃环境中,因visibility和质量受到干扰。这些图像的提高,如沙尘图像,在大气仪器应用中扮演着关键角色。在这个工作中,我们提出了一种基于色彩修正和新成员函数的新模型,用于提高沙尘图像的质量。该模型包括三个阶段:色彩偏移的修正、雾气除尘和对比度和亮度的提高。色彩偏移的修正使用了一种新的成员函数来调整YUV色彩空间中U和V的值。使用适应黑通道优先(A-DCP)进行雾气除尘。对比度和亮度的提高基于对比限定适应 histogram 平衡(CLAHE)。我们对多个实际的沙尘图像进行测试和评估。实验结果表明,我们的解决方案在 removes 红色和黄色抹雾效果上比现有研究更高效,并提供了高质量和数量的尘埃图像。

Generative AI for Medical Imaging: extending the MONAI Framework

  • paper_url: http://arxiv.org/abs/2307.15208
  • repo_url: https://github.com/project-monai/generativemodels
  • paper_authors: Walter H. L. Pinaya, Mark S. Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F. da Costa, Ashay Patel, Hyungjin Chung, Can Zhao, Wei Peng, Zelong Liu, Xueyan Mei, Oeslle Lucena, Jong Chul Ye, Sotirios A. Tsaftaris, Prerna Dogra, Andrew Feng, Marc Modat, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
  • for: This paper is written for researchers and developers who want to easily train, evaluate, and deploy generative models and related applications in medical imaging.
  • methods: The paper uses a variety of generative models, including diffusion models, autoregressive transformers, and GANs, and implements them in a generalizable fashion for 2D and 3D medical images with different modalities and anatomical areas.
  • results: The paper provides pre-trained models for the community and demonstrates the reproducibility of state-of-the-art studies using a standardized approach, as well as the extension of current applications to future features through a modular and extensible approach.
    Abstract Recent advances in generative AI have brought incredible breakthroughs in several areas, including medical imaging. These generative models have tremendous potential not only to help safely share medical data via synthetic datasets but also to perform an array of diverse applications, such as anomaly detection, image-to-image translation, denoising, and MRI reconstruction. However, due to the complexity of these models, their implementation and reproducibility can be difficult. This complexity can hinder progress, act as a use barrier, and dissuade the comparison of new methods with existing works. In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-art studies in a standardised way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalisable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (like CT, MRI, and X-Ray data) and from different anatomical areas. Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.
    摘要 In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-the-art studies in a standardized way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalizable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (such as CT, MRI, and X-ray data) and from different anatomical areas.Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.

Sparsity aware coding for single photon sensitive vision using Selective Sensing

  • paper_url: http://arxiv.org/abs/2307.15184
  • repo_url: None
  • paper_authors: Yizhou Lu, Trevor Seets, Ehsan Ahmadi, Felipe Gutierrez-Barragan, Andreas Velten
  • for: 提高图像技术的性能
  • methods: 使用选择感知来学习特征和优化编码策略
  • results: 在实验和模拟中表明,选择感知可以提高编码性能和总准确率,特别是在Poisson噪音的场景下。
    Abstract Optical coding has been widely adopted to improve the imaging techniques. Traditional coding strategies developed under additive Gaussian noise fail to perform optimally in the presence of Poisson noise. It has been observed in previous studies that coding performance varies significantly between these two noise models. In this work, we introduce a novel approach called selective sensing, which leverages training data to learn priors and optimizes the coding strategies for downstream classification tasks. By adapting to the specific characteristics of photon-counting sensors, the proposed method aims to improve coding performance under Poisson noise and enhance overall classification accuracy. Experimental and simulated results demonstrate the effectiveness of selective sensing in comparison to traditional coding strategies, highlighting its potential for practical applications in photon counting scenarios where Poisson noise are prevalent.
    摘要 光学编码已广泛应用来提高影像技术。传统的编码策略在添加性 Gaussian 噪声下发展起来的情况下,无法达到最佳性。以前的研究表明,编码性能在这两种噪声模型之间存在很大差异。在这种工作中,我们提出了一种新的方法 called 选择感知(selective sensing),利用训练数据学习假设和优化下推类 зада务中的编码策略。通过适应光子计数器的特点,提出的方法希望在Poisson噪声下提高编码性能和总精度。实验和 simulate 结果表明,选择感知方法比传统编码策略更有效, highlighting 其在光子计数器场景中的应用潜力。

cs.SD - 2023-07-27

Mitigating Cross-Database Differences for Learning Unified HRTF Representation

  • paper_url: http://arxiv.org/abs/2307.14547
  • repo_url: https://github.com/yutongwen/hrtf_field_norm
  • paper_authors: Yutong Wen, You Zhang, Zhiyao Duan
  • for: 实现虚拟听觉显示器中的准确 зву乐声位置需要个人化的头颈相关转换函数(HRTFs)。
  • methods: 使用机器学习模型预测个人化HRTFs,但是需要一个统一的HRTF表示方式以利用多个数据库中的有限数据。
  • results: 这篇研究发现了各数据库HRTF之间的差异,并提出了一个新的方法来对HRTFs进行调变,以获得更好的统一HRTF表示方式。
    Abstract Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively limited samples. However, in addition to differences on the spatial sampling locations, recent studies have shown that, even for the common location, HRTFs across databases manifest consistent differences that make it trivial to tell which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases. In this work, we first identify the possible causes of these cross-database differences, attributing them to variations in the measurement setup. Then, we propose a novel approach to normalize the frequency responses of HRTFs across databases. We show that HRTFs from different databases cannot be classified by their database after normalization. We further show that these normalized HRTFs can be used to learn a more unified HRTF representation across databases than the prior art. We believe that this normalization approach paves the road to many data-intensive tasks on HRTF modeling.
    摘要 个人化的头顶相关转换函数(HRTF)是虚拟受声显示的精确 зву乐位置的关键。由于Measurement of HRTFs是资源充足的,因此使用机器学习模型预测个人化HRTFs是一个具有潜力的方法。training这些模型需要一个统一的HRTF表现方式,以利用它们的限量样本。然而, latest studies have shown that, even for the same location, HRTFs across databases exhibit consistent differences that make it easy to distinguish which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases.在这个工作中,我们首先识别了可能的跨数据库差异的原因,将其归因于测量设置的变化。然后,我们提出了一种新的方法来对HRTFs的频谱响应进行Normalization。我们显示了,从不同的数据库中取得的HRTFs无法根据其数据库进行分类之后Normalization。此外,我们还显示了这些Normalized HRTFs可以用来学习一个更统一的HRTF表现方式,比对照rior art。我们认为这个Normalization方法将开拓出许多数据密集的HRTF模型任务。

Modality-Agnostic Audio-Visual Deepfake Detection

  • paper_url: http://arxiv.org/abs/2307.14491
  • repo_url: None
  • paper_authors: Cai Yu, Peng Chen, Jiahe Tian, Jin Liu, Jiao Dai, Xi Wang, Yesheng Chai, Jizhong Han
    for: 这研究旨在开发一种可以检测多模态深伪的方法,同时可以处理缺失模态情况。methods: 该方法使用了一种混合模式检测方法,利用音频视频抽取相关特征,并使用双标签检测方法来支持独立检测每个模态。results: 实验结果表明,该方法不仅在所有三个音频视频数据集上表现出色,而且在缺失模态情况下也能够达到满意的性能。此外,它还超过了使用两个单模态方法的结果,即使在缺失模态情况下。
    Abstract As AI-generated content (AIGC) thrives, Deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we propose a unified fake-modality-agnostic scenarios framework that enables the detection of multimodal deepfakes and handles missing modalities cases, no matter the manipulation hidden in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we choose audio-visual speech recognition (AVSR) as a preceding task, which effectively extracts speech correlation across modalities, which is difficult for deepfakes to reproduce. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments show that our scheme not only outperforms other state-of-the-art binary detection methods across all three audio-visual datasets but also achieves satisfying performance on detection modality-agnostic audio/video fakes. Moreover, it even surpasses the joint use of two unimodal methods in the presence of missing modality cases.
    摘要 As AI-generated content (AIGC) thrives, Deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we propose a unified fake-modality-agnostic scenarios framework that enables the detection of multimodal deepfakes and handles missing modalities cases, no matter the manipulation hidden in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we choose audio-visual speech recognition (AVSR) as a preceding task, which effectively extracts speech correlation across modalities, which is difficult for deepfakes to reproduce. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments show that our scheme not only outperforms other state-of-the-art binary detection methods across all three audio-visual datasets but also achieves satisfying performance on detection modality-agnostic audio/video fakes. Moreover, it even surpasses the joint use of two unimodal methods in the presence of missing modality cases.

Single Channel Speech Enhancement Using U-Net Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2307.14464
  • repo_url: None
  • paper_authors: Abir Riahi, Éric Plourde
  • for: 提高沟通设备和可靠语音识别系统的可靠性
  • methods: 使用快速 нейрон网络(SNN)基于U-Net架构
  • results: 比起智能硬件实现的Intel N-DNS Challenge基准解决方案,提出了能效的SNN模型,并在不同的噪声比例和实际噪声条件下达到了接受性的性能。
    Abstract Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems. Although conventional artificial neural networks (ANN) have demonstrated remarkable performance in SE, they require significant computational power, along with high energy costs. In this paper, we propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture. SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware. As such, SNNs are thus interesting candidates for real-time applications on devices with limited resources. The primary objective of the current work is to develop an SNN-based model with comparable performance to a state-of-the-art ANN model for SE. We train a deep SNN using surrogate-gradient-based optimization and evaluate its performance using perceptual objective tests under different signal-to-noise ratios and real-world noise conditions. Our results demonstrate that the proposed energy-efficient SNN model outperforms the Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge) baseline solution and achieves acceptable performance compared to an equivalent ANN model.
    摘要 声音增强(SE)是重要的通信设备或可靠的语音识别系统的关键。虽然传统的人工神经网络(ANN)已经在SE中表现出了很好的性能,但它们需要很大的计算能力以及高的能源成本。在这篇论文中,我们提出了一种使用射频神经网络(SNN)的新方法,基于U-Net架构。SNN适合处理时间维度上的数据,如语音,并且在神经逻辑硬件上实现能效。因此,SNN是实时应用中的有趣候选者。目标是开发一个与状态ixen-of-the-art ANN模型相比的性能相似的SNN模型。我们使用代理函数基于优化方法来训练深度SNN,并对其性能进行评估使用感知目标测试,包括不同的信号响应率和实际噪声条件。我们的结果表明,我们提出的能效SNN模型比Intel neuromorphic Deep Noise Suppression Challenge(Intel N-DNS Challenge)基准解决方案更好,并且与相同的ANN模型相比,其性能是可接受的。

WavJourney: Compositional Audio Creation with Large Language Models

  • paper_url: http://arxiv.org/abs/2307.14335
  • repo_url: https://github.com/audio-agi/wavjourney
  • paper_authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
    for: 这个论文的目的是如何使用大语言模型(LLMs)来创造具有语音、乐曲和特效的听众场景。methods: 这篇论文使用了LLMs来连接不同的听众场景模型,以便根据文本描述创建听众场景。具体来说,LLMs首先生成了一份特有的听众场景脚本,该脚本包含了不同的听众场景元素,以及这些元素之间的空间时间关系。然后,这份脚本被交给了一个脚本编译器,将其转换为计算机程序。每一行的程序调用了一个任务特定的听众场景生成模型或计算操作函数(例如, concatenate、mix)。最后,计算程序执行以生成听众场景。results: 这篇论文在多个实际场景中证明了WavJourney的实用性,包括科幻、教育和广播剧等。WavJourney的可解释和交互设计使得人机共创在多round对话中得到了改进的创造控制和适应性。WavJourney为听众场景创作提供了一个新的创新平台,打开了新的可能性 для multimedia内容创作。
    Abstract Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.
    摘要 大型语言模型(LLM)在融合多种专家模型方面表现出了很大的承诺,它们在人工智能生成内容领域的发展中发挥了重要作用。然而,它们在智能音频内容创作方面的潜力还未得到开发。在这项工作中,我们解决了基于文本描述的音频内容创作问题。我们提出了一种名为WavJourney的系统,它利用LLM来连接多种音频模型,以实现音频内容创作。给定一个文本描述的听力场景,WavJourney首先使用LLM生成一个专门为音频故事创作的结构化脚本。这个音频脚本包含多种听力元素,按照其空间时间关系进行组织。作为听力内容的概念表示,音频脚本提供了可交互和可解释的理由,用于人机合作。接下来,音频脚本被 feed 到一个脚本编译器,将其转换成计算机程序。每行程序调用一个任务特定的音频生成模型或计算操作函数(例如, concatenate、mix)。计算程序被执行以获取可解释的音频生成解决方案。我们在多个实际场景中证明了WavJourney的实用性,包括科幻、教育和广播剧。WavJourney的可解释和交互设计使得人机合作在多轮对话中得到加强,提高了音频生产的创作控制和适应性。WavJourney将人类的想象力音频化,开启了新的创作途径在多媒体内容创作领域。

eess.AS - 2023-07-27

Audio Inputs for Active Speaker Detection and Localization via Microphone Array

  • paper_url: http://arxiv.org/abs/2307.14739
  • repo_url: None
  • paper_authors: Davide Berghi, Philip J. B. Jackson
  • for: 本研究探讨了基于多道声音捕获的活聊检测和定位问题,即活聊检测和定位(ASDL)。
  • methods: 本研究使用了一种卷积Recurrent Neural Network(CRNN),使用了多道声音中的空间音学特征作为输入,并对不同的渠道数和干扰噪音进行比较。
  • results: 实验结果表明,使用GCC-PHAT、SALSA特征和新的扩权报知方法可以减轻不同噪音水平下的表达性能下降,并且可以根据降噪性和渠道数进行优化。
    Abstract This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance demonstrate the microphones' contribution to performance on the TragicTalkers dataset, which offers opportunities to investigate audio-visual approaches in the future.
    摘要

  • paper_url: http://arxiv.org/abs/2307.14650
  • repo_url: https://github.com/feima0011/physics-informed-neural-network-for-head-related-transfer-function-upsampling
  • paper_authors: Fei Ma, Thushara D. Abhayapala, Prasanga N. Samarasinghe, Xingyu Chen
  • for: 提高虚拟听觉体验的真实性,使用physics-informed neural network(PINN)方法进行HRTF upsampling。
  • methods: 基于Helmholtz方程的PINN方法,利用HRTF的物理特性来做upsampling,避免基于测量数据的局限性。
  • results: 对多个数据集进行比较,PINN方法在 interpolate 和 extrapolate 两种enario中具有更高的性能,不受under-fitting和over-fitting问题的影响。
    Abstract Head-related transfer functions (HRTFs) capture the spatial and spectral features that a person uses to localize sound sources in space and thus are vital for creating an authentic virtual acoustic experience. However, practical HRTF measurement systems can only provide an incomplete measurement of a person's HRTFs, and this necessitates HRTF upsampling. This paper proposes a physics-informed neural network (PINN) method for HRTF upsampling. Unlike other upsampling methods which are based on the measured HRTFs only, the PINN method exploits the Helmholtz equation as additional information for constraining the upsampling process. This helps the PINN method to generate physically amiable upsamplings which generalize beyond the measured HRTFs. Furthermore, the width and the depth of the PINN are set according to the dimensionality of HRTFs under spherical harmonic (SH) decomposition and the Helmholtz equation. This makes the PINN have an appropriate level of expressiveness and thus does not suffer from under-fitting and over-fitting problems. Numerical experiments confirm the superior performance of the PINN method for HRTF upsampling in both interpolation and extrapolation scenarios over several datasets in comparison with the SH methods.
    摘要 人头相关传函数(HRTF)捕捉了声音源的空间和频率特征,因此是创建真实虚拟听音场的关键。然而,实际测量HRTF系统只能提供 incomplete HRTF 测量,这需要HRTF 采样。这篇论文提出了一种基于物理学习神经网络(PINN)方法的 HRTF 采样方法。与其他采样方法不同,PINN 方法利用 Helmholtz 方程作为额外信息,以制约采样过程。这帮助 PINN 方法生成physically amiable的采样,并且这些采样可以扩展到测量 HRTF 之 beyond。此外,PINN 方法的宽度和深度是根据 HRTF 的维度下圆函数(SH)划分和 Helmholtz 方程来设置。这使得 PINN 方法具有合适的表达能力,并且不会出现过拟合和下拟合问题。 numerical experiments 表明,PINN 方法在 interpolate 和 extrapolate scenarios 中对多个数据集的 HRTF 采样性能较 SH 方法更高。

NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals

  • paper_url: http://arxiv.org/abs/2307.14303
  • repo_url: None
  • paper_authors: Zexu Pan, Marvin Borsdorf, Siqi Cai, Tanja Schultz, Haizhou Li
  • for: 本研究旨在开发一种基于EEG信号的选择性听力模型,以实现在听到干扰的多人对话中提取主要的说话人信号。
  • methods: 该模型使用EEG信号来建立一个neuronal attractor,其与听到的语音刺激相关,并通过在线和离线两种方式实现实时和非实时的抽取。在线NeuroHeed还包括一个自适应核心编码器,以积累过去抽取的语音信号,以便在下一个时间窗口中帮助抽取当前说话人信号。
  • results: 实验结果表明,NeuroHeed能够有效地提取主要的说话人信号,并达到高质量、出色的 восприятия质量和语音可理解性。
    Abstract Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities, which the latter can be measured using affordable and non-intrusive electroencephalography (EEG) devices. In this study, we present NeuroHeed, a speaker extraction model that leverages EEG signals to establish a neuronal attractor which is temporally associated with the speech stimulus, facilitating the extraction of the attended speech signal in a cocktail party scenario. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results demonstrate that NeuroHeed effectively extracts brain-attended speech signals, achieving high signal quality, excellent perceptual quality, and intelligibility in a two-speaker scenario.
    摘要 人类具有选择性听觉能力,能够在多个声音和背景噪声中选择一个声音,这种能力被称为选择性听觉注意力。最近的听觉神经科学研究表明,在听觉过程中选择的语音信号和大脑发生的神经活动之间存在强相关性,这些神经活动可以使用便宜和不侵入的电enzephalography(EEG)设备测量。在这个研究中,我们介绍了NeuroHeed,一种基于EEG信号的语音抽取模型,可以在听觉场景中提取选择的语音信号。我们提出了两种NeuroHeed,一个是OFFLINE版本,另一个是ONLINE版本。在ONLINE版本中,我们还提出了自适应语音编码器,该编码器将过去提取的语音信号accumulate为自我投入的听觉招引器,以保持注意力的积累。ONLINE NeuroHeed在当前窗口中提取语音信号,受到两个招引器的引导。实验结果表明,NeuroHeed可以有效地提取大脑注意力的语音信号,实现高质量的语音信号、优美的听觉质量和语音清晰度在两个说话者场景中。