cs.SD - 2023-09-22

Massive End-to-end Models for Short Search Queries

  • paper_url: http://arxiv.org/abs/2309.12963
  • repo_url: None
  • paper_authors: Weiran Wang, Rohit Prabhavalkar, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li, James Qin, Xingyu Cai, Adam Stooke, Zhong Meng, CJ Zheng, Yanzhang He, Tara Sainath, Pedro Moreno Mengibar
  • for: investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries.
  • methods: use the neural architecture of Google’s universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference.
  • results: despite the speculation that larger CTC models can perform as well as RNN-T models, the authors observe that a 900M RNN-T model outperforms a 1.8B CTC model and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
    Abstract In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
    摘要 在这项工作中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即Connectionist Temporal Classification(CTC)和RNN-Transducer(RNN-T),用于离线识别voice搜索 queries,模型参数达2B。我们的模型encoder使用Google的通用语音模型(USM)的神经网络结构,并添加了挥发池化层以大幅降低帧率和加速训练和推理。我们进行了广泛的词汇大小、时间减少策略和长形测试集的总体性能研究。虽然有人推测,随着模型参数的增加,CTC可能与RNN-T相当,但我们发现一个900M RNN-T明显超过了1.8B CTC,并且更具耐用性。虽然WER差距可以通过LM浅合并大大减少,但CTC的性能仍然落后RNN-T。

VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2309.12914
  • repo_url: None
  • paper_authors: Heitor R. Guimarães, Arthur Pimentel, Anderson Avila, Tiago H. Falk
  • for: 这个论文的目的是提出一种robust distillation recipe,用于压缩模型并提高对抗攻击性能。
  • methods: 该论文使用了自动学习的speech representation,并在教师和学生模型中强制实施几何约束,以提高模型的Robustness和鲁棒性。
  • results: 实验结果显示,提出的方法可以提高对current state-of-the-art robust distillation方法(ARD和RSLAD)的robust准确率,分别提高12%和8%。
    Abstract Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
    摘要 键词检索(KWS)是指在语音流中确定一组预定义的词语的任务。随着深度神经网络的发展,KWS已成为许多小设备,如语音助手的激活和控制技术。但是,基于这些模型的边缘设备可能会受到硬件限制。此外,对语音技术的敌意攻击也在增加,因此开发对抗这些攻击的解决方案已经成为一项重要任务。在这种情况下,我们提出了VIC-KD,一种鲁棒的混合整合法。我们使用自我supervised的语音表示,并在教师和学生模型的秘密表示中加入几何约束,以便更加鲁棒的目标模型。在Google Speech Commands数据集上进行了实验,结果显示,我们的方法性比现有的State-of-the-art robust distillation方法,如ARD和RSLAD,提高了12%和8%的鲁棒精度,分别。

DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

  • paper_url: http://arxiv.org/abs/2309.12792
  • repo_url: None
  • paper_authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su
  • for: 本研究提出了一种改进的时间知识注意力神经网络(DurIAN-E),用于实现高质量和高准确度的文本读音合成。
  • methods: 该模型采用了自适应的核心层结构,并使用多层核心层-based Transformer块作为语言编码器。此外,文本编码器还采用了样式适应的实例normalization(SAIN)层以提高表达能力。
  • results: 实验结果表明,提出的表达力强的TTS模型在对比前一个状态的方法的测试中表现出色,在主观意见分数(MOS)和偏好测试中都达到了更高的水平。
    Abstract This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
    摘要 Here is the translation in Simplified Chinese:这篇论文介绍了一种改进的文本识别模型(DurIAN-E),它继承了原始DurIAN模型的自适应模型结构,并在语言编码器中使用多层折衔RNN-基于Transformer块。此外,提议中的DurIAN-E还利用了frame级别编码器中的样式适应实例归一化(SAIN)层,以提高模型的表达能力。此外,通过混合DDPM和SAIN模块,提高了生成的语音质量和表达性。实验结果表明,提议的表达力TTS模型在主观意见分数(MOS)和偏好测试中表现更好于当前状态的方法。

A Study on Incorporating Whisper for Robust Speech Assessment

  • paper_url: http://arxiv.org/abs/2309.12766
  • repo_url: None
  • paper_authors: Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh
  • for: 这个研究旨在提出一个基于多目标批判学习的声音评估模型,即MOSA-Net+,通过利用大规模预训练的弱监督模型Whisper的声学特征来创建嵌入特征。
  • methods: 这个研究的第一部分investigates the correlation between Whisper的嵌入特征和两种自动学习(SSL)模型的主观质量和语言可理解得分。第二部分评估了Whisper在实施更加稳定的声音评估模型方面的可用性。第三部分分析了将Whisper和SSL模型的表示结合在MOSA-Net+中的可能性。
  • results: 实验结果表明,Whisper的嵌入特征与主观质量和语言可理解得分更加强相关,从而提高MOSA-Net+的预测性能。此外,将Whisper和SSL模型的表示结合只会导致微妙的改善。相比MOSA-Net和其他基于SSL的声音评估模型,MOSA-Net+在评估主观质量和语言可理解得分上具有显著的改善。此外,MOSA-Net+在VoiceMOS挑战2023的Track 3上获得了总成绩的第一名。
    Abstract This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
    摘要 Translation in Simplified Chinese:这项研究引入了一个改进版的多目标语音评估模型,称为MOSA-Net+,通过利用大型预训练的弱监督模型Whisper的语音特征来生成嵌入特征。研究的第一部分 investigate了Whisper和两个自动学习(SSL)模型的嵌入特征与主观质量和听解能力分数之间的相关性。研究的第二部分评估了Whisper是否可以提供更加可靠的语音评估模型。第三部分分析了将Whisper和SSL模型的表示结合使用时MOSA-Net+的效果。实验结果表明,Whisper的嵌入特征与主观质量和听解能力分数相关性更高,对MOSA-Net+的预测性能产生了较大的贡献。此外,将Whisper和SSL模型的表示结合使用只导致了微妙的改进。相比MOSA-Net和其他基于SSL的语音评估模型,MOSA-Net+在所有评价指标上具有显著的改善。我们将MOSA-Net+测试在2023年语音评估挑战赛(VoiceMOS Challenge 2023)的Track 3上,并取得了排名第一的表现。

CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

  • paper_url: http://arxiv.org/abs/2309.12672
  • repo_url: None
  • paper_authors: Xintong Wang, Chang Zeng, Jun Chen, Chunhui Wang
  • for: 这个论文的目的是构建一个多人高准确的唱歌声音合成系统,并且能够在不同语言之间进行跨语言合成。
  • methods: 这个论文使用了Xiaoicesing2作为基础,并使用国际音声字母表来统一所有语言训练数据的表示。此外,文章还利用了condition层正规化来吸收语言信息,使模型在遇到未看过的语言时能够更好地发音。最后,文章还使用了梯度反转层(GRL)来消除歌手偏见,因为所有歌手都是单语言的,这意味着歌手的身份是隐式地与文本相关联的。
  • results: 实验结果表明,CrossSinger可以高准确地合成不同歌手的歌曲,并且能够在不同语言之间进行跨语言合成,包括code-switch情况。
    Abstract It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.
    摘要 “建立一个多人高精当唱歌声合成系统以采用单语言训练数据是挑战。本研究提出了 CrossSinger,它是一个基于 Xiaoicesing2 的跨语言唱歌声合成器。我们使用国际音标字母来统一所有语言训练数据的表示。此外,我们运用了条件层normalization来将语言信息 incorporated 到模型中,以更好地处理 singer 遇到未见的语言时的发音。此外,我们还使用了 Gradient Reversal Layer (GRL) 来移除 singer 的偏好,因为所有 singer 都是单语言,这意味着 singer 的识别是隐式地与文本相关。实验使用了三个唱歌声数据集,包括日本 Kiritan 数据集、英国 NUS-48E 数据集和一个内部的中文数据集。结果显示 CrossSinger 可以实现高精当度的唱歌声合成,包括 code-switch 情况。”

NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

  • paper_url: http://arxiv.org/abs/2309.12656
  • repo_url: None
  • paper_authors: Naohiro Tawara, Marc Delcroix, Atsushi Ando, Atsunori Ogawa
  • for: 这篇论文描述了一种为多个频率域、多个麦克风的便衣对话进行的Speaker diarization系统。
  • methods: 该排序管道使用了weighted prediction error(WPE)为前端,然后使用end-to-end神经网络分类(EEND-VC)对每个通道进行分类。它将每个通道的排序结果结合在一起使用排序输出投票错误减少(DOVER-LAP)。
  • results: 该系统在NTT的CHiME-7挑战中的远程自动语音识别任务中被提交,并在开发集和评估集上分别提高了65%和62%的相对提升,相比于提供的VC-基础系统。
    Abstract This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
    摘要 Here's the Simplified Chinese translation:这篇论文介绍了一个针对多个频道、多个麦克风的休闲对话中的 speaker diarization 系统。提posed系统使用了weighted prediction error(WPE)基于的前端,然后应用每个通道的端到端神经网络抽取(EEND-VC)。系统将每个通道的抽取结果集成使用抽取输出误差减少加上重叠(DOVER-LAP)。为了利用目标频道的知识和所有通道的结果的组合,系统使用了无监督适应(self-supervised adaptation),通过在DOVER-LAP中生成pseudo-labels来重新训练EEND-VC。提posed系统在CHiME-7 challenge中提交了,并在发展集和评估集上分别 achieved 65%和62%的相对提升,与组织者提供的基eline diarization系统相比,获得了第三名的表现。

SPGM: Prioritizing Local Features for enhanced speech separation performance

  • paper_url: http://arxiv.org/abs/2309.12608
  • repo_url: None
  • paper_authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma
  • for: 提高Speech separation模型(如Sepformer)的性能,减少参数数量。
  • methods: 使用Single-Path Global Modulation(SPGM)块取代inter-blocks,SPGM块具有无参数全球 Pooling模块和Modulation模块,共计2%的模型参数。
  • results: SPGM在WSJ0-2Mix和Libri2Mix上达到22.1 dB SI-SDRi和20.4 dB SI-SDRi,超过Sepformer的性能,并与最新的SOTA模型几乎匹配,但具有8倍少的参数数量。
    Abstract Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters.
    摘要 <>转换文本到简化中文。<>双路是一种流行的语音分离模型(例如 Sepformer)的架构,将长序列分割成 overlap 的块,以便在块内和块间分别模型 intra-chunk 的本地特征和 inter-chunk 的全局关系。然而,在这种情况下,inter-blocks 占用了模型的一半参数,但是它们却对性能的贡献很小。因此,我们提出了单路全球修饰(SPGM)块来取代 inter-blocks。SPGM 得名于它的结构,包括一个无参数的全球汇集模块和一个修饰模块,该模块仅占用了模型总参数的 2%。SPGM 块使得所有转换层在模型中都专注于本地特征修饰,从而使整个模型变成单路。SPGM 在 WSJ0-2Mix 和 Libri2Mix 上 достиieves 22.1 dB SI-SDRi 和 20.4 dB SI-SDRi,超过 Sepformer 的性能 by 0.5 dB 和 0.3 dB,并与最新的 SOTA 模型几乎相当。

ICASSP 2023 Acoustic Echo Cancellation Challenge

  • paper_url: http://arxiv.org/abs/2309.12553
  • repo_url: https://github.com/microsoft/AEC-Challenge
  • paper_authors: Ross Cutler, Ando Saabas, Tanel Parnamaa, Marju Purin, Evgenii Indenbom, Nicolae-Catalin Ristea, Jegor Gužvin, Hannes Gamper, Sebastian Braun, Robert Aichner
  • For: 这个挑战的目的是促进静音干扰(AEC)研究,提高语音干扰和音频通信中的声音质量。* Methods: 挑战使用了两个追踪器,包括一个基于人工智能的追踪器和一个基于模型的追踪器,以及一个全带宽AECMOS。* Results: 挑战开源了两个大规模的训练数据集,包括来自更多于10,000个真实的音频设备和人类说话者的实际环境记录,以及一个 sintetic 数据集。winning 的result是基于所有场景的平均意见度(MOS)和单词准确率(WAcc)。
    Abstract The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
    摘要 ICASSP 2023 听音障碍挑战是要促进听音障碍(AEC)领域的研究,这是一个重要的声音提升领域,仍然是音频通信中的主要问题。这是第四个AEC挑战,它的改进包括添加个性化听音障碍追踪,降低算法+缓冲延迟至20毫秒,以及包括全带AECMOS。我们对AEC模型进行训练提供了两个大型数据集,包括单个说话和双个说话场景。这些数据集包括来自 более чем10,000个真实的音频设备和人类说话者在真实环境中的录音,以及一个 sintetic 数据集。我们提供了在线主观测试框架,并提供了一个对研究人员快速测试结果的 объек metric。挑战赛中的赢家是根据所有场景的平均主观评分(MOS)和单词准确率(WAcc)而选择的。

cs.CV - 2023-09-22

ClusterFormer: Clustering As A Universal Visual Learner

  • paper_url: http://arxiv.org/abs/2309.13196
  • repo_url: https://github.com/clusterformer/clusterformer
  • paper_authors: James C. Liang, Yiming Cui, Qifan Wang, Tong Geng, Wenguan Wang, Dongfang Liu
  • for: 这个研究旨在提出一个基于CLUSTERing的概念的普遍性视觉模型,即CLUSTERFORMER,并将其应用于多种视觉任务中,包括图像分类、物体检测和图像分割等。
  • methods: 这个模型使用了两个新的设计:1. 回归十字运算实现了Transformer中的十字运算机制,并允许逐层更新团中心以便强化表示学习; 2. 图像特征重新分配使用更新的团中心,通过相似度基准来重新分配图像特征,实现了透明的管道。
  • results: 实验结果显示CLUSTERFORMER可以超过多种知名的特化架构,包括图像分类、物体检测和图像分割等任务,并在不同的团粒度(即图像、方巢和像素粒度)下实现高效性。
    Abstract This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
    摘要
  1. Recurrent cross-attention clustering: This reformulates the cross-attention mechanism in Transformer to enable recursive updates of cluster centers, facilitating strong representation learning.2. Feature dispatching: This uses updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline.This elegant design enables a streamlined, explainable, and transferable workflow for tackling heterogeneous vision tasks (image classification, object detection, and image segmentation) with varying levels of clustering granularity (image-, box-, and pixel-level). Empirical results show that CLUSTERFORMER outperforms various well-known specialized architectures, achieving:* 83.41% top-1 accuracy over ImageNet-1K for image classification* 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation* 52.4% mIoU over ADE20K for semantic segmentation* 55.8% PQ over COCO Panoptic for panoptic segmentation.We hope that our work will catalyze a paradigm shift in universal models in computer vision, demonstrating the efficacy of the CLUSTERing paradigm in achieving strong representation learning and transferability across diverse vision tasks.

Spatial-frequency channels, shape bias, and adversarial robustness

  • paper_url: http://arxiv.org/abs/2309.13190
  • repo_url: https://github.com/ajaysub110/critical-band-masking
  • paper_authors: Ajay Subramanian, Elena Sizikova, Najib J. Majaj, Denis G. Pelli
    for:这种研究旨在探索人类和神经网络在认知物体方面使用的频谱信息是什么。methods:研究人员使用了 crítical band masking 技术,该技术可以揭示人类和神经网络在认知物体过程中使用的频谱滤波器(或“渠道”)的宽度。results:研究发现,人类在自然图像中认知物体时使用的频谱滤波器与人类在字体和梯形图像中认知时使用的频谱滤波器一致,宽度都是一个 octave。然而,神经网络渠道在不同的架构和训练策略下表现为 2-4 倍于人类渠道宽度,这意味着神经网络对高频和低频噪声敏感,而人类不是。 adversarial 和扩展图像训练通常用于提高网络的Robustness和形态偏好。这种训练是否将网络和人类的物体认知渠道进行对接?
    Abstract What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel'') that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. On the other hand, the neural network channel, across various architectures and training strategies, is 2-4 times as wide as the human channel. In other words, networks are vulnerable to high and low frequency noise that does not affect human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (53% variance explained) and with robustness of adversarially-trained networks (74% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further away from the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only increases this difference.
    摘要 人类和神经网络在认知物体时使用哪些空间频率信息?在神经科学中,关键带掩蔽是一种已知的工具,可以揭示人类和神经网络在认知物体时使用的频率选择性滤波器。关键带掩蔽测量人类和神经网络在噪声添加后的认知性能的敏感度。现有的关键带掩蔽研究表明,人类认知 periodic patterns(格拉丁)和字母使用一个频率带宽( doubles 频率)的空间频率滤波器(或“渠道”)来认知物体。我们在人类和神经网络之间进行关键带掩蔽任务,并测试了14名人类和76个神经网络在16种 ImageNet 分类任务中的表现。我们发现,人类在自然图像中认知物体使用了同样的一个频率带宽的渠道,这是人类物体认知的启示性特征。然而,神经网络渠道,不同架构和训练策略,宽度为2-4倍于人类渠道。即神经网络具有高频和低频噪声不affects human performance的敏感性。常见的图像增强和抗击训练被用来提高网络的Robustness和形态偏好。这种训练是否与人类物体认知渠道相align?神经网络渠道的三个特性(带宽、中心频率、峰噪敏感度)与形态偏好(53% 额外变化)以及对抗训练后网络的Robustness(74% 额外变化)存在强相关性。抗击训练可以提高网络的Robustness,但是同时也使得网络渠道的宽度更加远离人类渠道。因此,关键带掩蔽表明,神经网络渠道比人类渠道更加宽,并且抗击训练只会进一步扩大这个差距。

Flow Factorized Representation Learning

  • paper_url: http://arxiv.org/abs/2309.13167
  • repo_url: https://github.com/kingjamessong/latent-flow
  • paper_authors: Yue Song, T. Anderson Keller, Nicu Sebe, Max Welling
  • for: 本研究的主要目标是学习表示,以达到对真实因素的分解。
  • methods: 我们提出了一种新的视角,即流动因素化表示学习(Flow Factorized Representation Learning),并在这种结构下学习更有效和更有用的表示。
  • results: 我们的模型在标准表示学习 bencmarks 上达到更高的likelihood,同时也更接近于相对平衡模型。此外,我们还证明了我们的变换是可以composite和适用于新数据,这表明我们的表示学习模型具有一定的抗预测和普适性。
    Abstract A prominent goal of representation learning research is to achieve representations which are factorized in a useful manner with respect to the ground truth factors of variation. The fields of disentangled and equivariant representation learning have approached this ideal from a range of complimentary perspectives; however, to date, most approaches have proven to either be ill-specified or insufficiently flexible to effectively separate all realistic factors of interest in a learned latent space. In this work, we propose an alternative viewpoint on such structured representation learning which we call Flow Factorized Representation Learning, and demonstrate it to learn both more efficient and more usefully structured representations than existing frameworks. Specifically, we introduce a generative model which specifies a distinct set of latent probability paths that define different input transformations. Each latent flow is generated by the gradient field of a learned potential following dynamic optimal transport. Our novel setup brings new understandings to both \textit{disentanglement} and \textit{equivariance}. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models. Furthermore, we demonstrate that the transformations learned by our model are flexibly composable and can also extrapolate to new data, implying a degree of robustness and generalizability approaching the ultimate goal of usefully factorized representation learning.
    摘要 prominent goal of representation learning research 是 achiev representations 是 factorized in a useful manner with respect to the ground truth factors of variation. disentangled and equivariant representation learning approached this ideal from a range of complimentary perspectives; however, to date, most approaches have proven to either be ill-specified or insufficiently flexible to effectively separate all realistic factors of interest in a learned latent space. In this work, we propose an alternative viewpoint on such structured representation learning, which we call Flow Factorized Representation Learning, and demonstrate it to learn both more efficient and more usefully structured representations than existing frameworks. Specifically, we introduce a generative model that specifies a distinct set of latent probability paths that define different input transformations. Each latent flow is generated by the gradient field of a learned potential following dynamic optimal transport. Our novel setup brings new understandings to both disentanglement and equivariance. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models. Furthermore, we demonstrate that the transformations learned by our model are flexibly composable and can also extrapolate to new data, implying a degree of robustness and generalizability approaching the ultimate goal of usefully factorized representation learning.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries. The translation is based on the standard Chinese characters and grammar, and may be slightly different from the traditional Chinese used in Hong Kong and Taiwan.

Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations

  • paper_url: http://arxiv.org/abs/2309.13150
  • repo_url: None
  • paper_authors: Hanjiang Hu, Zuxin Liu, Linyi Li, Jiacheng Zhu, Ding Zhao
    for:* 这种方法用于证明深度学习视觉模型对摄像头运动干扰的Robustness。methods:* 该方法使用了一种新的、高效的和实用的框架,利用了像素空间的平滑分布,从而消除了贵重的摄像头运动采样成本,提高了证明Robustness的效率。results:* 通过广泛的实验证明,该方法可以很好地平衡证明效果和计算效率。例如,该方法可以在使用只有30%的投影图像框架的情况下实现约80%的证明准确率。
    Abstract In recent years, computer vision has made remarkable advancements in autonomous driving and robotics. However, it has been observed that deep learning-based visual perception models lack robustness when faced with camera motion perturbations. The current certification process for assessing robustness is costly and time-consuming due to the extensive number of image projections required for Monte Carlo sampling in the 3D camera motion space. To address these challenges, we present a novel, efficient, and practical framework for certifying the robustness of 3D-2D projective transformations against camera motion perturbations. Our approach leverages a smoothing distribution over the 2D pixel space instead of in the 3D physical space, eliminating the need for costly camera motion sampling and significantly enhancing the efficiency of robustness certifications. With the pixel-wise smoothed classifier, we are able to fully upper bound the projection errors using a technique of uniform partitioning in camera motion space. Additionally, we extend our certification framework to a more general scenario where only a single-frame point cloud is required in the projection oracle. This is achieved by deriving Lipschitz-based approximated partition intervals. Through extensive experimentation, we validate the trade-off between effectiveness and efficiency enabled by our proposed method. Remarkably, our approach achieves approximately 80% certified accuracy while utilizing only 30% of the projected image frames.
    摘要 现在的计算机视觉技术在自动驾驶和机器人控制方面已经取得了非常出色的进步。然而,已经观察到深度学习基于视觉模型对摄像头运动干扰的Robustness有所不足。现有的证明过程对摄像头运动干扰的Robustness进行评估是非常昂贵和时间consuming的,因为需要进行大量的图像投影以实现Monte Carlo抽象在3D摄像头运动空间中。为解决这些挑战,我们提出了一种新的、高效、实用的框架,用于证明3D-2D投影变换对摄像头运动干扰的Robustness。我们的方法利用2D像素空间中的平滑分布而不是3D物理空间中的平滑分布,从而消除了高昂的摄像头运动样本成本和大量的图像投影。通过使用像素空间平滑分布,我们可以完全上界投影错误,使用一种基于均匀分区的技术来实现Camera motion空间中的均匀分区。此外,我们将证明框架扩展到一个更加通用的场景,只需要提供单帧点云作为投影oracle。我们通过 derivation Lipschitz-basedapproximated partition intervals来实现这一点。通过广泛的实验,我们证明了我们的提出的方法的效率和可靠性之间的trade-off。特别是,我们的方法可以在30%的图像投影帧上达到约80%的证明精度。

Trading-off Mutual Information on Feature Aggregation for Face Recognition

  • paper_url: http://arxiv.org/abs/2309.13137
  • repo_url: None
  • paper_authors: Mohammad Akyash, Ali Zafari, Nasser M. Nasrabadi
    for: 提高人脸识别精度methods: aggregate ArcFace和AdaFace两个state-of-the-art深度人脸识别模型的输出,通过利用trasnformer注意机制来把扩展两个特征地图之间的关系,从而提高人脸识别系统的总体识别能力。results: 通过对比多个标准 bencmark 结果,我们观察到了我们的方法在人脸识别 tasks 中的一致性提高。
    Abstract Despite the advances in the field of Face Recognition (FR), the precision of these methods is not yet sufficient. To improve the FR performance, this paper proposes a technique to aggregate the outputs of two state-of-the-art (SOTA) deep FR models, namely ArcFace and AdaFace. In our approach, we leverage the transformer attention mechanism to exploit the relationship between different parts of two feature maps. By doing so, we aim to enhance the overall discriminative power of the FR system. One of the challenges in feature aggregation is the effective modeling of both local and global dependencies. Conventional transformers are known for their ability to capture long-range dependencies, but they often struggle with modeling local dependencies accurately. To address this limitation, we augment the self-attention mechanism to capture both local and global dependencies effectively. This allows our model to take advantage of the overlapping receptive fields present in corresponding locations of the feature maps. However, fusing two feature maps from different FR models might introduce redundancies to the face embedding. Since these models often share identical backbone architectures, the resulting feature maps may contain overlapping information, which can mislead the training process. To overcome this problem, we leverage the principle of Information Bottleneck to obtain a maximally informative facial representation. This ensures that the aggregated features retain the most relevant and discriminative information while minimizing redundant or misleading details. To evaluate the effectiveness of our proposed method, we conducted experiments on popular benchmarks and compared our results with state-of-the-art algorithms. The consistent improvement we observed in these benchmarks demonstrates the efficacy of our approach in enhancing FR performance.
    摘要 尽管面Recognition(FR)领域已经取得了一些进步,但FR方法的精度仍然不够高。为了提高FR性能,这篇论文提议了一种将两种现有的深度FR模型,即ArcFace和AdaFace,的输出聚合的技术。在我们的方法中,我们利用了变换器注意机制,以利用两个特征图的不同部分之间的关系。这样做的目的是提高总的识别力。一个挑战在特征聚合中是有效地模型本地和全局依赖关系。传统的变换器通常能够很好地捕捉长距离依赖关系,但它们经常在本地依赖关系上做出不准确的预测。为了解决这个限制,我们在自我注意机制中进行了修改,以同时 capture本地和全局依赖关系。这使得我们的模型能够利用特征图中相互重叠的区域的拥有的相互关系。然而,将两个特征图从不同的FR模型融合可能会导致人脸嵌入中的纬度冗余。这是因为这些模型通常具有相同的背部架构,导致生成的特征图可能包含重复的信息。为了解决这个问题,我们利用信息瓶颈原理,从人脸嵌入中提取最大可能的信息,以确保聚合的特征保留了最有用和权威的信息,同时减少不必要或误导的细节。为了评估我们的提议的效果,我们在popular benchmark上进行了实验,并与当前的算法进行比较。我们在这些benchmark中经常观察到了一致性提高,这表明了我们的方法的有效性。

Understanding Calibration of Deep Neural Networks for Medical Image Classification

  • paper_url: http://arxiv.org/abs/2309.13132
  • repo_url: None
  • paper_authors: Abhishek Singh Sambyal, Usma Niyaz, Narayanan C. Krishnan, Deepti R. Bathula
  • for: 这篇论文旨在探讨医疗影像分析中,使用深度神经网络时,确保模型的准确性和可靠性是非常重要的。
  • methods: 这篇论文使用了多种训练方法,包括全supervised training和旋转自给supervised learning,以了解不同训练方法对模型准确性和可靠性的影响。
  • results: 研究发现,使用旋转自给supervised learning的训练方法可以将模型的准确性和可靠性提高,并且可以实现比全supervised training更好的准确性和可靠性。
    Abstract In the field of medical image analysis, achieving high accuracy is not enough; ensuring well-calibrated predictions is also crucial. Confidence scores of a deep neural network play a pivotal role in explainability by providing insights into the model's certainty, identifying cases that require attention, and establishing trust in its predictions. Consequently, the significance of a well-calibrated model becomes paramount in the medical imaging domain, where accurate and reliable predictions are of utmost importance. While there has been a significant effort towards training modern deep neural networks to achieve high accuracy on medical imaging tasks, model calibration and factors that affect it remain under-explored. To address this, we conducted a comprehensive empirical study that explores model performance and calibration under different training regimes. We considered fully supervised training, which is the prevailing approach in the community, as well as rotation-based self-supervised method with and without transfer learning, across various datasets and architecture sizes. Multiple calibration metrics were employed to gain a holistic understanding of model calibration. Our study reveals that factors such as weight distributions and the similarity of learned representations correlate with the calibration trends observed in the models. Notably, models trained using rotation-based self-supervised pretrained regime exhibit significantly better calibration while achieving comparable or even superior performance compared to fully supervised models across different medical imaging datasets. These findings shed light on the importance of model calibration in medical image analysis and highlight the benefits of incorporating self-supervised learning approach to improve both performance and calibration.
    摘要 在医疗影像分析领域,即使达到高精度也不够;保证准确的预测也非常重要。深度神经网络的自信分数在解释性方面发挥关键作用,为模型的certainty提供了信息, помо助分析出需要注意的案例,并建立对预测的信任。因此,在医疗影像领域,准确可靠的预测是非常重要的。虽然社区内有很大的努力,以使现代深度神经网络在医疗影像任务上达到高精度,但模型准确性和可靠性的调整仍然受到了少数研究。为了解决这个问题,我们进行了全面的实验研究,探讨了不同的训练方法对模型性能和准确性的影响。我们考虑了完全监督学习,这是社区中最常用的方法,以及旋转基于自动学习的方法,包括无扩展和带扩展的方法,在不同的数据集和模型大小上进行了测试。我们使用多种准确度指标来了解模型准确性的多方面特性。我们的研究发现,模型的weight分布和学习的表示相似性与模型准确性的趋势相关。特别是,通过旋转基于自动学习的预训练方法进行训练的模型在不同的医疗影像数据集上显示出了显著更好的准确性,而且与完全监督学习模型相比,它们在不同的模型大小上实现了相似或更高的性能。这些发现 shed light on the importance of model calibration in medical image analysis, and highlight the benefits of incorporating self-supervised learning approaches to improve both performance and calibration.

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

  • paper_url: http://arxiv.org/abs/2309.13041
  • repo_url: None
  • paper_authors: Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar
  • for: 这个论文是为了帮助机器人学习学习掌控技能,尤其是在没有奖励信号的情况下。
  • methods: 这个论文使用了视频数据来适应机器人的学习,通过时间差学习来学习值函数,并将其应用于机器人掌控任务中。
  • results: 这个论文在多个机器人掌控任务上取得了良好的结果,其中包括在一个真实的WidowX机器人上进行的多个掌控任务。政策比之前的方法更好,更加稳定,并能够广泛应用。
    Abstract Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/
    摘要 在现代机器学习系统中,预训练在互联网数据上有证明是一种关键因素,以实现广泛的通用化。在机器人学习上,可以通过在机器人经验数据集上进行预训练来实现这种能力。然而,这些方法与视频数据(如Ego4D)存在类型匹配问题,因为视频只提供了观察经验,而不提供动作或奖励注释,这些注释是机器人学习方法所需的。在这篇论文中,我们开发了一种将大规模人类视频数据集成入机器人预训练的系统,基于完全通过时间差学习学习值函数。我们表明,在视频数据集上学习值函数可以学习更适合下游机器人预训练的表示,比其他视频数据学习方法更好。我们的系统,即V-PTR,将预训练在视频数据集上的好处与多种机器人数据预训练相结合,以生成更好的 manipulate 任务的价值函数和策略。在一个真实的 WidowX 机器人上,我们的框架可以大幅提高先前方法的政策。我们的视频和其他细节可以在 找到。

NeRRF: 3D Reconstruction and View Synthesis for Transparent and Specular Objects with Neural Refractive-Reflective Fields

  • paper_url: http://arxiv.org/abs/2309.13039
  • repo_url: https://github.com/dawning77/nerrf
  • paper_authors: Xiaoxue Chen, Junchen Liu, Hao Zhao, Guyue Zhou, Ya-Qin Zhang
  • for: 这篇论文是关于图像基于视图合成的研究,旨在解决NeRF无法处理复杂的光路变化问题,导致无法成功合成透明或镜面物体的问题。
  • methods: 作者们提出了吸收射镜场(Refractive-Reflective Field,RRF),通过使用进攻四面体和进攻编码来重建非LAMBERTIAN对象的几何结构,并使用费勒涅尔定律来模型物体的折射和反射效果。同时,为了实现高效和有效的抑杂,提出了虚拟圆锥超抽样技术。
  • results: 作者们在不同的形状、背景和费勒涅尔定律上进行了多种实验,并对不同的编辑应用进行了质量和量化的比较,包括材质编辑、物体替换/插入和环境照明估计。
    Abstract Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at https://github.com/dawning77/NeRRF.
    摘要 “对象基于图像的视 synthesis 领域受到对应� Neural Radiance Fields(NeRF)的革命性影响。然而,NeRF 使用直线光束,无法处理由折射和反射导致的复杂光束变化,这限制了 NeRF 在透明或 Specular 物体的成功实现。在这篇论文中,我们介绍了 Refractive-Reflective Field(RRF)。我们将物体照片为输入,首先使用进攻四边形(Marching Tetrahedra)进行非 Lambertian 物体的重建,然后在一个统一框架中模型物体的折射和反射效应,使用 Fresnel 表达。此外,为了获得高效和有效的抑挡遮瑕,我们提出了虚拟圆锥超推数技术。我们在不同的形状、背景和 Fresnel 表达下进行了不同的测试,并评估了不同的编辑应用,包括材料编辑、物体取代/插入和环境照明估计。我们的代码和数据公开在 GitHub 上,请参考 https://github.com/dawning77/NeRRF。”

Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?

  • paper_url: http://arxiv.org/abs/2309.13038
  • repo_url: None
  • paper_authors: Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma, Lingjuan Lyu, Hongdong Li, Liang Zheng
  • for: 这篇论文主要是为了研究现有的手工图像质量指标是否能够准确反映人类对隐私信息的识别度。
  • methods: 这篇论文使用了4种现有的攻击方法来重建图像,并询问多个人标注者判断重建图像是否可识。
  • results: 研究发现现有的手工指标与人类对隐私信息的识别度强度不匹配,甚至自身差异也很大。提出了一种学习基于的measure called SemSim来评估重建图像的semantic相似性,并证明SemSim具有更高的人类评价相关性。
    Abstract Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
    摘要 手工制作的图像质量指标,如PSNR和SSIM,通常用于评估模型隐私风险的重建攻击。在这些指标下,可以重建的图像,如果与原始图像相似,则表示更大的隐私泄露。相反,如果图像与原始图像不相似,则表示更高的鲁棒性。但是,这些指标并不能保证与人类意见相符,人类意见是评估模型隐私泄露的更可靠的判断标准。在这篇论文中,我们全面研究了这些手工制作的指标是否能够准确反映人类对重建图像中的隐私信息的评估。在5个不同类型的数据集上,我们使用4种不同的攻击方法来重建图像,并对每个重建图像请多名人工标注者评估这个图像是否可识别。我们的研究发现,这些手工制作的指标与人类对隐私信息的评估存在较弱的相关性,甚至这些指标本身经常相互矛盾。这些观察表明了现有的指标在社区中的风险。为了解决这个潜在的风险,我们提议一种学习基于的度量方法,即SemSim,用于评估重建图像与原始图像之间的semantic相似性。SemSim通过使用标准的 triplet损失函数,使用原始图像作为固定点,一个可识别的重建图像作为正样本,一个不可识别的重建图像作为负样本进行训练。通过人工标注,SemSim能够更好地反映隐私信息的semantic水平上的泄露。我们展示SemSim与现有指标之间存在高度相关性,并且这种相关性可以在未看到的数据集、模型和攻击方法上进行扩展。

Performance Analysis of UNet and Variants for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.13013
  • repo_url: None
  • paper_authors: Walid Ehab, Yongmin Li
  • for: 本研究旨在探讨深度学习模型在医疗图像分割中的应用,特别是UNet架构和其变种的表现。
  • methods: 本研究使用了深度学习模型,包括标准UNet、Res-UNet和Attention Res-UNet三种架构,对多种医疗图像分割任务进行评估。
  • results: 研究发现,扩展UNet架构具有优秀的医疗图像分割能力,而Res-UNet和Attention Res-UNet架构具有更平滑的整合和更高的性能,特别是处理细节图像时。
    Abstract Medical imaging plays a crucial role in modern healthcare by providing non-invasive visualisation of internal structures and abnormalities, enabling early disease detection, accurate diagnosis, and treatment planning. This study aims to explore the application of deep learning models, particularly focusing on the UNet architecture and its variants, in medical image segmentation. We seek to evaluate the performance of these models across various challenging medical image segmentation tasks, addressing issues such as image normalization, resizing, architecture choices, loss function design, and hyperparameter tuning. The findings reveal that the standard UNet, when extended with a deep network layer, is a proficient medical image segmentation model, while the Res-UNet and Attention Res-UNet architectures demonstrate smoother convergence and superior performance, particularly when handling fine image details. The study also addresses the challenge of high class imbalance through careful preprocessing and loss function definitions. We anticipate that the results of this study will provide useful insights for researchers seeking to apply these models to new medical imaging problems and offer guidance and best practices for their implementation.
    摘要 医学影像在现代医疗中扮演着重要的角色,通过非侵入性的视觉化内部结构和异常,提高疾病早期检测、精准诊断和治疗规划。本研究旨在探讨深度学习模型,尤其是UNet架构和其变体,在医学图像分割任务中的应用。我们希望通过不同的挑战性医学图像分割任务来评估这些模型的表现,解决问题如图像normalization、resize、架构选择、损失函数设计和Hyperparameter优化。研究发现,标准的UNet架构,当扩展了深度网络层时,是一个高效的医学图像分割模型,而Res-UNet和Attention Res-UNet架构在处理细节时表现更好,特别是在处理细节时。此外,我们还 Addresses the challenge of high class imbalance through careful preprocessing and loss function definitions。我们预计这些结果将为研究人员在新的医学影像问题上应用这些模型提供有用的指导和最佳实践。

Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

  • paper_url: http://arxiv.org/abs/2309.13006
  • repo_url: None
  • paper_authors: Tianrun Chen, Chenglong Fu, Ying Zang, Lanyun Zhu, Jia Zhang, Papa Mao, Lingyun Sun
  • for: This paper aims to provide an end-to-end approach for 3D modeling using only a single free-hand sketch, without requiring multiple sketches or view information.
  • methods: The proposed approach, called Deep3DSketch+, uses a lightweight generation network for efficient inference in real-time, and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information and facilitate learning of realistic and fine-detailed shape structures.
  • results: The proposed approach achieved state-of-the-art (SOTA) performance on both synthetic and real datasets, demonstrating its effectiveness in generating high-fidelity 3D models from a single free-hand sketch.
    Abstract The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
    摘要 rapid development of AR/VR 带来巨大的三维内容需求,而传统的计算机支持设计(CAD)方法需要时间consuming 和 labor-intensive modeling process, sketch-based 三维模型化呈现了一个可能的解决方案,但是绘制缺乏和模糊性使得模型化困难以实现创作者的想法。需要精确地从多个视图或步骤性的绘制来解决这个挑战,但是这并不友好于初学者。在这种工作中,我们介绍了一种新的端到端方法,即 Deep3DSketch+,它可以通过单个自由手绘制来完成3D模型化,不需要多个绘制或视图信息输入。我们还引入了轻量级生成网络以实现实时执行,以及一种结构意识的对抗训练方法和笔触提升模块(SEM),以捕捉结构信息,使模型学习真实和细节rich shape结构,以实现高精度性。我们的实验表明,我们的方法可以与当前最佳性(SOTA)在synthetic和实际数据集上达到最高性能。

Point Cloud Network: An Order of Magnitude Improvement in Linear Layer Parameter Count

  • paper_url: http://arxiv.org/abs/2309.12996
  • repo_url: https://gitlab.com/chetterich/pcn-paper-and-materials
  • paper_authors: Charles Hetterich
  • for: 本文介绍了Point Cloud Network(PCN)架构,一种新的深度学习网络实现方式,并提供了实验证明PCN的优越性比多层感知器(MLP)。
  • methods: 本文使用了MLP和PCN两种不同的架构来训练多个模型,包括原始的AlexNet模型,以便对直接比较线性层的性能。
  • results: 研究发现,使用PCN架构的AlexNet-PCN16模型可以与原始AlexNet模型具有相同的测试准确率(test accuracy),仅占AlexNet模型的99.5%参数量。所有训练都在云端RTX 4090 GPU上进行,使用了pytorch库进行模型构建和训练。
    Abstract This paper introduces the Point Cloud Network (PCN) architecture, a novel implementation of linear layers in deep learning networks, and provides empirical evidence to advocate for its preference over the Multilayer Perceptron (MLP) in linear layers. We train several models, including the original AlexNet, using both MLP and PCN architectures for direct comparison of linear layers (Krizhevsky et al., 2012). The key results collected are model parameter count and top-1 test accuracy over the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). AlexNet-PCN16, our PCN equivalent to AlexNet, achieves comparable efficacy (test accuracy) to the original architecture with a 99.5% reduction of parameters in its linear layers. All training is done on cloud RTX 4090 GPUs, leveraging pytorch for model construction and training. Code is provided for anyone to reproduce the trials from this paper.
    摘要
  • The PCN architecture has fewer parameters (99.5% fewer in the linear layers) but still achieves the same level of accuracy as the original AlexNet.* The PCN architecture performs well on both the CIFAR-10 and CIFAR-100 datasets.All of the training was done on cloud RTX 4090 GPUs using PyTorch for model construction and training. The code for reproducing the trials is provided.

License Plate Recognition Based On Multi-Angle View Model

  • paper_url: http://arxiv.org/abs/2309.12972
  • repo_url: https://github.com/zeniSoida/pl1
  • paper_authors: Dat Tran-Anh, Khanh Linh Tran, Hoai-Nam Vu
  • for: 本研究旨在解决图像/视频中文检测问题,尤其是识别车牌上的文字。
  • methods: 本方法 combinates multiple views of license plates to improve text detection accuracy. 具体来说,我们使用三个视角(view-1、view-2、view-3)来识别车牌上的文字组成部分,并使用相似度和距离度量来确定最佳匹配。
  • results: 实验结果表明,提出的方法在自主收集的PTITPlates dataset和Stanford Cars Dataset上具有较高的识别精度,较exist方法有所提高。
    Abstract In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
    摘要 在研究领域中,图像/视频中的文本检测/识别问题对研究人员来说是非常困难的。尽管有一些进步,但现有方法仍然需要进一步改进才能在实际场景中应用。与图像/视频中的文本检测方法不同,这篇论文强调车牌上的文本检测,通过将多个视角的帧合并来实现。对于每个视角,我们提出的方法可以提取描述文本组件的特征,包括角点和面积。具体来说,我们提出了三个视角:视角1、视角2和视角3,用于标识同一个车牌线上的相邻组件,并且根据相似度和距离度量来重建车牌上的文本组件。接着,我们使用CnOCR方法进行车牌上文本识别。实验结果表明,我们的提议方法在自己收集的数据集(PTITPlates)和公共可用的 stanford cars 数据集上具有显著优势,超过现有方法。

PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN

  • paper_url: http://arxiv.org/abs/2309.12970
  • repo_url: None
  • paper_authors: Arnab Das, Suhita Ghosh, Sebastian Stober
  • for: 这个论文的目的是提供一种自动化的检测和评估肾脏癌病变的方法,以帮助提高诊断和治疗的精度。
  • methods: 这个方法使用了一种双树 convolutional neural network (CNN),每个树分别捕捉不同的区域(PZ、TZ、DPU和AFS)的表示。在第二个训练阶段,不同的树的表示进行了互补性的调整,以提高 segmentation 精度。此外,这个方法还 integrate 了多任务学习来进一步提高 segmentation 精度。
  • results: 根据这个方法,误差(mean absolute symmetric distance)的提高量为7.56%、11.00%、58.43%和19.67%对PZ、TZ、DPU和AFS区域进行了提高。
    Abstract The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
    摘要 magnetic resonance imaging (MRI) 提供了生命critical的信息,用于诊断和治疗前列腺癌。为了提供标准化的获取、解释和使用复杂的MRI图像,PI-RADS v2指南被提出。一个自动 segmentation 可以确保consistent和精确的肿坏检测、stage和治疗。指南建议将前列腺分成四个区域:PZ(周边区)、TZ(过渡区)、DPU(后束肠URETHRA)和AFS(前锥形connective tissue)。不是每个区域都与其他区域接壤,而且不同的区域在每个层次中的表现不同。这种情况motivates我们设计了一个双支分布式 convolutional neural network (CNN),其中每支分布式 CNN 分别捕捉connected zones 的表现。此外,不同支分布式 CNN 在第二个训练阶段 fine-tune 的损失中进行互补作用,这种损失penalizes 两支分布式 CNN 对同一类型的预测差异。我们还在我们的框架中包含多任务学习,以进一步提高 segmentation 精度。提出的方法与基准(mean absolute symmetric distance)的 segmentation 精度相比,提高了7.56%、11.00%、58.43%和19.67% respectivly 的PZ、TZ、DPU和AFS区域。

Detect Every Thing with Few Examples

  • paper_url: http://arxiv.org/abs/2309.12969
  • repo_url: https://github.com/mlzxy/devit
  • paper_authors: Xinyu Zhang, Yuting Wang, Abdeslam Boularias
  • for: 这个论文目的是开发一种基于视觉语言的开放类型物体检测器,可以检测到训练时没有看到的类别。
  • methods: 这个论文使用了视觉只的DINOv2背景,并通过示例图像来学习新的类别。它还提出了一种将多类分类任务转换为 binary 分类任务的技术,以及一种地区卷积技术来优化本地化检测。
  • results: 在COCO和LVIS测试集上,DE-ViT比开放类型SoTA高6.9个AP50,并在新类中达到50个AP50。在几shot和一shot SoTA上,DE-ViT比较高7.2个mAP和2.8个AP50。在LVIS测试集上,DE-ViT比开放类型SoTA高2.2个mask AP,达到34.3个mask APr。
    Abstract Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.
    摘要 “开放集Object检测目标在训练时未经看过的类型检测。最新的进展都采用了开放词汇思想,通过视力语言核心来表示类别。本文介绍DE-ViT,一种基于视力只的DINOv2核心实现开放集Object检测,不需要语言。为提高检测能力,我们将多类型分类任务转化为二分类任务,并提出一种新的区域卷积技术。我们在COCO和LVIS上进行了开放集、少量和一个批量Object检测测试,对COCO的开放集SoTA进行了6.9 AP50的超越和50 AP50的新类表现。对于少量和一个批量SoTA,DE-ViT也进行了15 mAP和7.2 mAP的超越。对于LVIS,DE-ViT超越了开放集SoTA2.2个面积AP和34.3个面积APr。代码可以在https://github.com/mlzxy/devit中下载。”

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

  • paper_url: http://arxiv.org/abs/2309.13101
  • repo_url: https://github.com/ingra14m/Deformable-3D-Gaussians
  • paper_authors: Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, Xiaogang Jin
  • for: 本研究旨在解决现有的动态场景重建和渲染方法中的缺陷,提供更高质量和实时速度的方法。
  • methods: 我们提出了一种基于显式3D高斯函数的弹性3D高斯拼接法,通过对可见空间中的高斯函数进行扭曲学习,来模型单目动态场景。我们还提出了一种缓解训练过程中偏差pose的技术,以提高时间 interpolate任务的平滑性。
  • results: 我们的方法在渲染质量和实时速度两个方面具有显著优势,与现有方法相比显著提高了渲染质量和速度。这使得我们的方法适用于多视图合成、时间合成和实时渲染等任务。
    Abstract Implicit neural representation has opened up new avenues for dynamic scene reconstruction and rendering. Nonetheless, state-of-the-art methods of dynamic neural rendering rely heavily on these implicit representations, which frequently struggle with accurately capturing the intricate details of objects in the scene. Furthermore, implicit methods struggle to achieve real-time rendering in general dynamic scenes, limiting their use in a wide range of tasks. To address the issues, we propose a deformable 3D Gaussians Splatting method that reconstructs scenes using explicit 3D Gaussians and learns Gaussians in canonical space with a deformation field to model monocular dynamic scenes. We also introduced a smoothing training mechanism with no extra overhead to mitigate the impact of inaccurate poses in real datasets on the smoothness of time interpolation tasks. Through differential gaussian rasterization, the deformable 3D Gaussians not only achieve higher rendering quality but also real-time rendering speed. Experiments show that our method outperforms existing methods significantly in terms of both rendering quality and speed, making it well-suited for tasks such as novel-view synthesis, time synthesis, and real-time rendering.
    摘要 匿名神经表示法已经开启了新的动态场景重建和渲染领域。然而,现状的动态神经渲染方法通常依赖于这些匿名表示法,它们经常快速捕捉场景中对象的细节。此外,匿名方法在普通的动态场景中实时渲染通常困难,限制了它们在各种任务中的使用。为了解决这些问题,我们提出了使用可变的3DGAUSSIAN分辨率拼接法来重建场景,这种方法使用显式的3DGAUSSIAN和 canonical space中的扭曲场来模型单目动态场景。我们还提出了一种缓和训练机制,可以在真实数据集中减少不准确的姿势的影响,以提高时间插值任务的平滑性。通过差分 Gaussian 渲染,可变的3DGAUSSIAN不仅实现了更高的渲染质量,还具有实时渲染速度。实验表明,我们的方法与现有方法相比,在渲染质量和速度两个方面具有显著的优势,适用于如新视角合成、时间插值和实时渲染等任务。

On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures

  • paper_url: http://arxiv.org/abs/2309.12955
  • repo_url: https://github.com/zqzqz/advcollaborativeperception
  • paper_authors: Qingzhao Zhang, Shuowei Jin, Ruiyang Zhu, Jiachen Sun, Xumiao Zhang, Qi Alfred Chen, Z. Morley Mao
  • for: 这篇论文旨在探讨Connected and Autonomous Vehicles (CAVs) 在协同感知系统中的安全隐患,以及如何通过协同感知系统中的数据来实现安全驱动。
  • methods: 本论文使用了现场实验和仿真方法来研究协同感知系统中的数据攻击和防御策略。
  • results: 本论文的实验结果显示,攻击者可以通过提供假数据来让CAVs做出错误的驾驶决策,导致减速或增加碰撞风险。而提出的异常检测方法可以检测91.5%的攻击,并在实际场景中减少了攻击的影响。
    Abstract Collaborative perception, which greatly enhances the sensing capability of connected and autonomous vehicles (CAVs) by incorporating data from external resources, also brings forth potential security risks. CAVs' driving decisions rely on remote untrusted data, making them susceptible to attacks carried out by malicious participants in the collaborative perception system. However, security analysis and countermeasures for such threats are absent. To understand the impact of the vulnerability, we break the ground by proposing various real-time data fabrication attacks in which the attacker delivers crafted malicious data to victims in order to perturb their perception results, leading to hard brakes or increased collision risks. Our attacks demonstrate a high success rate of over 86% on high-fidelity simulated scenarios and are realizable in real-world experiments. To mitigate the vulnerability, we present a systematic anomaly detection approach that enables benign vehicles to jointly reveal malicious fabrication. It detects 91.5% of attacks with a false positive rate of 3% in simulated scenarios and significantly mitigates attack impacts in real-world scenarios.
    摘要 将文本翻译成简化中文:协同感知,它使connected and autonomous vehicles (CAVs) 的感知能力得到了大幅提高,但也涉及到了安全隐患。CAVs 的驾驶决策取决于外部不可靠数据,使其易受到来自collaborative perception系统中的恶意参与者的攻击。然而,对于这些威胁的安全分析和对策缺乏。为了了解攻击的影响,我们开辟了一个研究,在协同感知系统中提出了不同的实时数据造假攻击。攻击者通过向受害者传递预制作的假数据来干扰受害者的感知结果,导致停车或增加碰撞风险。我们的攻击得到了高于86%的成功率在高精度的模拟场景中,并在实际场景中也是可行的。为了缓解攻击,我们提出了一种系统化异常检测方法,它能够在benign vehicles之间共同披露恶意fabrication。它在模拟场景中检测到91.5%的攻击, false positive率仅3%。在实际场景中,它能够有效地缓解攻击的影响。

Inter-vendor harmonization of Computed Tomography (CT) reconstruction kernels using unpaired image translation

  • paper_url: http://arxiv.org/abs/2309.12953
  • repo_url: None
  • paper_authors: Aravind R. Krishnan, Kaiwen Xu, Thomas Li, Chenyu Gao, Lucas W. Remedios, Praitayini Kanakaraj, Ho Hin Lee, Shunxing Bao, Kim L. Sandler, Fabien Maldonado, Ivana Isgum, Bennett A. Landman
  • for: This paper aims to investigate the harmonization of computed tomography (CT) scans from different manufacturers using an unpaired image translation approach.
  • methods: The authors use a multipath cycle generative adversarial network (GAN) to harmonize the CT scans and evaluate the effect of harmonization on the reconstruction kernels.
  • results: The authors find that their approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status, and vendor on emphysema quantification.Here is the same information in Simplified Chinese text:
  • for: 这篇论文目标是通过不同生产厂商的计算Tomography(CT)扫描图像的不同构成器进行协调。
  • methods: 作者使用了一种多路径循环生成算法网络(GAN)来协调CT扫描图像,并评估构成器的影响。
  • results: 作者发现,他们的方法可以减少不同构成器的差异,并且高亮年龄、性别、吸烟状况和生产厂商对emphysema量化的影响。
    Abstract The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
    摘要 computed tomography(CT)生成中的重建核心(kernel)会决定图像的文字。保持重建核心的一致性非常重要,因为下面的CT文字可能会影响量化图像分析中的测量结果。为了解决这个问题,我们采用了一种不带对的图像翻译方法,并使用多条路径生成反向传播神经网络(GAN)来调整不同制造商的重建核心。我们使用来自SIEMENS和GE两家公司的硬件和软件重建核心,从国家肺癌检测试验数据集中选择50个扫描。我们使用50个扫描来训练多条路径GAN,并对每个重建核心进行调整。为了评估调整后的重建核心的影响,我们对SIEMENS硬件重建核心、GE软件重建核心和GE硬件重建核心进行调整,并对每个扫描进行50次评估。我们使用年龄、吸烟状况、性别和制造商作为 Linear 模型的可变量,并对抑瘤率进行分析变异(ANOVA)。我们的方法可以减少不同重建核心之间的差异,并高亮制造商、性别、吸烟状况和年龄对抑瘤率的影响。

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.12943
  • repo_url: https://github.com/wpy1999/bas-extension
  • paper_authors: Wei Zhai, Pingyu Wu, Kai Zhu, Yang Cao, Feng Wu, Zheng-Jun Zha
  • For: 本研究旨在提高弱度指导对象本地化和 semantic segmentation的性能,通过生成foreground prediction map(FPM)来实现像素级本地化。* Methods: 该研究提出了两个关键的实验观察:1)当已经训练过的网络中的foreground mask扩展时,cross-entropy会 converge to zero,而且activation value会持续增加 until the foreground mask扩展到对象边界。基于这两个观察,该研究提出了一种Background Activation Suppression(BAS)方法,通过Activation Map Constraint(AMC)模块来减少背景活动值,同时通过foreground region guidance和面积约束来学习整个对象区域。* Results: 对CUB-200-2011和ILSVRC dataset进行了广泛的实验,显示BAS可以 achieve significant and consistent improvement over baseline methods。此外,该方法还 achieve state-of-the-art weakly supervised semantic segmentation性能在PASCAL VOC 2012和MS COCO 2014 dataset上。
    Abstract Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at https://github.com/wpy1999/BAS-Extension.
    摘要 弱地监督对象定位和 semantic segmentation 目标是通过只使用图像级别标签来 lokalisieren objects。 最近,一种新的 paradigm 出现,即通过生成 foreground prediction map (FPM) 来实现像素级定位。 而现有的 FPM 基于方法使用 cross-entropy 来评估 foreground prediction map 并帮助生成器学习,而这篇文章则发现了对对象定位学习过程的两个 astonishing experimental observation:1) 当 foreground mask 扩展时,cross-entropy 会 converge to zero 只有部分object region 被 mask 覆盖; 2) 在 foreground mask 扩展到 object boundary 时,activation value 会不断增加。因此,我们认为使用 activation value 可以更好地学习更多的 object regions。在这篇文章中,我们提出了 Background Activation Suppression (BAS) 方法。具体来说,我们设计了 Activation Map Constraint (AMC) 模块,以便通过压制背景 activation value 来促进生成器的学习。同时,通过使用 foreground region guidance 和 area constraint,BAS 可以学习整个 object 的区域。在推理阶段,我们考虑了不同类别的预测图共同来获得最终的定位结果。我们的实验表明,BAS 可以在 CUB-200-2011 和 ILSVRC 数据集上 achieves 显著和稳定的改进,并且我们的方法也可以在 PASCAL VOC 2012 和 MS COCO 2014 数据集上实现 state-of-the-art 的弱监督 semantic segmentation性能。代码和模型可以在 上获取。

Zero-Shot Object Counting with Language-Vision Models

  • paper_url: http://arxiv.org/abs/2309.13097
  • repo_url: None
  • paper_authors: Jingyi Xu, Hieu Le, Dimitris Samaras
  • for: 本研究旨在实现无需人工标注的物体数量计算,即针对任意类型的物体进行测试时的自动化计数。
  • methods: 我们提出了一种新的设定方法,即零例SHOT对象计数(ZSC),只需要在测试时提供类名即可。这种方法不需要人工标注,可以自动化操作。我们首先从输入图像中检索一些物体裁剪,然后使用这些裁剪作为计数例子。目标是找到包含目标物体的裁剪,同时也是所有图像中所有物体的视觉表示。我们首先使用大型语言视觉模型,包括CLIP和Stable Diffusion,构建类型质量标准,然后选择包含目标物体的裁剪。此外,我们还提出了一种排名模型,以估算每个裁剪的计数错误,从而选择最适合计数的例子。
  • results: 我们在最新的无类别物体数量 datasets(FSC-147)上进行了实验,结果表明我们的方法效果很高。
    Abstract Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. This obviates the need for human annotators and enables automated operation. To perform ZSC, we propose finding a few object crops from the input image and use them as counting exemplars. The goal is to identify patches containing the objects of interest while also being visually representative for all instances in the image. To do this, we first construct class prototypes using large language-vision models, including CLIP and Stable Diffusion, to select the patches containing the target objects. Furthermore, we propose a ranking model that estimates the counting error of each patch to select the most suitable exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.
    摘要 “类型无关对象计数”targets counting object instances of an arbitrary class at test time, which is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs, which are often unavailable for novel categories, especially for autonomous systems. Therefore, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. This eliminates the need for human annotators and enables automated operation.To perform ZSC, we propose finding a few object crops from the input image and using them as counting exemplars. The goal is to identify patches containing the objects of interest while also being visually representative for all instances in the image. To do this, we first construct class prototypes using large language-vision models, such as CLIP and Stable Diffusion, to select the patches containing the target objects. Additionally, we propose a ranking model that estimates the counting error of each patch to select the most suitable exemplars for counting.Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification

  • paper_url: http://arxiv.org/abs/2309.12865
  • repo_url: https://github.com/cecilia-xue/hyt-nas
  • paper_authors: Xizhe Xue, Haokui Zhang, Ying Li, Liuwei Wan, Zongwen Bai, Mike Zheng Shou
  • for: This paper aims to address the challenge of training ViT models on hyperspectral images (HSIs) with limited training samples.
  • methods: The proposed method is called single-direction tuning (SDT) and it leverages existing labeled HSI datasets and RGB datasets to enhance the performance on new HSI datasets. SDT uses a parallel architecture, asynchronous cold-hot gradient update strategy, and unidirectional interaction.
  • results: The proposed Triplet-structured transformer (Tri-Former) achieves better performance compared to several state-of-the-art methods on three representative HSI datasets. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.Here’s the Chinese translation of the three key points:
  • for: 本研究目的是解决训练 ViT 模型在有限样本的高spectral 图像(HSIs)中的挑战。
  • methods: 提议的方法是单向调整策略(SDT),它利用现有标注的 HSI 数据集和 RGB 数据集来提高新的 HSI 数据集的性能。SDT 使用并行架构、异步冷热梯度更新策略和单向互动。
  • results: 提议的 Triplet-structured transformer (Tri-Former) 在三个代表性的 HSI 数据集上达到了许多现状方法的更好性能。同源、异源和跨模态调整实验证明了提议的 SDT 的有效性。
    Abstract Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.
    摘要 近些时候,一些研究人员开始使用ViT来解决高spectralInterval(HSI)分类问题,并取得了显著的成果。然而,ViT模型的训练需要一大量的训练样本,而高spectralInterval数据由于注解成本高,通常只有限量的训练样本。这个矛盾尚未得到有效解决。在这篇论文中,我们提议单向调整(SDT)策略,作为一个桥梁,允许我们通过现有的标注HSI数据集和RGB数据集来提高新的HSI数据集的性能。我们的SDT继承了提前调整的想法,即 reuse pre-trained models with minimal modifications for adaptation to new tasks。不同于提前调整,SDT是特地针对HSIs的定制设计的。我们的SDT采用并行架构、异步冷热Gradient更新策略和单向交互。它旨在完全利用训练在不同数据集上的hetrologous和cross-modal数据的强大表示学习能力。此外,我们还介绍了一种新的Triplet-structured transformer(Tri-Former),其中spectral attention和spatial attention模块在并行的构建token混合组件,以减少计算成本,并integrate了3D卷积基本 Channel mixer模块以提高稳定性和保持结构信息。在三个代表性的HSI数据集上进行了比较实验,我们的Tri-Former表现比一些当前的方法更好。同义、异义和cross-modal调整实验证明了SDT的有效性。

Associative Transformer Is A Sparse Representation Learner

  • paper_url: http://arxiv.org/abs/2309.12862
  • repo_url: None
  • paper_authors: Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai
  • for: 这篇论文旨在探讨如何使用弹性交互来更好地模拟生物学原理,并提出了一种基于全球工作空间理论和相关记忆的Associative Transformer(AiT)模型。
  • methods: AiT模型使用了跨层聚合的核心空间,并通过结合缓存的方式实现瓶颈式注意力。这些瓶颈式注意力会限制注意力的容量,从而模拟生物学中的弹性交互。
  • results: 对于多种视觉任务,AiT模型表现出了superiority,可以学习不同的特征弹性,并且可以在不同的输入量和维度上保持复杂度的不变性。
    Abstract Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.
    摘要 (Simplified Chinese translation)由传统的对称Transformer模型中的单一对对注意机制而出发,有一种增长的兴趣是利用稀疏的交互来更加准确地遵循生物学原理。包括Set Transformer和Perceiver在内的方法都使用了混合注意力,并通过限制容量的瓶颈注意力来实现稀疏的交互。基于最近的 neuroscience研究的全球工作区理论和相关记忆,我们提出了相关转换器(AiT)。AiT通过强制实现低级别的显式记忆,使得瓶颈注意力在共享工作区中服务为导向注意力的先验知识,并在相关记忆中形成吸引器。通过联合的终端训练,这些先验知识自然发展出模块特化,每个模块增加了不同的抽象偏好,以形成注意瓶颈。这个瓶颈可以促进输入竞争对写入记忆。我们显示AiT是一种稀疏表示学习器,通过瓶颈学习出不同的先验知识,这些先验知识是输入量和维度的复杂性不变的。AiT在不同的视觉任务中表现出优势。

Cross-Modal Translation and Alignment for Survival Analysis

  • paper_url: http://arxiv.org/abs/2309.12855
  • repo_url: https://github.com/ft-zhou-zzz/cmta
  • paper_authors: Fengtao Zhou, Hao Chen
  • for: 这篇论文的目的是提出一个 Cross-Modal Translation and Alignment (CMTA) 框架,以探索不同模式之间的自然联系,并将不同模式之间的资讯转换为彼此对应的形式,以提高统计分析的精度和准确性。
  • methods: 这篇论文使用了两个平行的encoder-decoder结构,将多modal资料融合为单一的数据表现,并通过将生成的跨模式表现与原始模式表现进行对应,以提高模式之间的联系和转换资讯。此外,这篇论文还提出了一个跨模式注意力模组,作为不同模式之间的资讯桥梁,以实现跨模式的互动和资讯转换。
  • results: 这篇论文的实验结果显示,跨模式转换和对应的CMTA框架能够在五个公共TCGA数据集上实现更高的统计分析精度和准确性,比起现有的方法。
    Abstract With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
    摘要 随着高通量测序技术的快速发展,生存分析的注意点从临床指标转移到了将 genomic profil 与生理图像 incorporate 到生存预测中。现有方法可以分为两类:直接将生理特征和 genomic profil 简单地拼接起来进行生存预测,或者将 genomic profil 作为引导,将生理图像的特征 integrate 到生存预测中。前者可能会忽略不同Modal 之间的自然相关性。后者可能会丢弃不相关于蛋白表达的生理信息。为解决这些问题,我们提出了一种 Cross-Modal Translation and Alignment (CMTA) 框架,用于探索不同Modal 之间的自然相关性,并将 complementary 信息传递。 Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

  • paper_url: http://arxiv.org/abs/2309.12842
  • repo_url: https://github.com/Tianbo-Pan/SRFNet
  • paper_authors: Tianbo Pan, Zidong Cao, Lin Wang
  • for: 本研究旨在提高单目视频中的深度估计精度,以便应用于机器人导航和自动驾驶等场景。
  • methods: 本研究提出了一种名为SRFNet的新网络模型,包括两个关键技术组件:一是基于注意力的互动式融合模块(AIF),二是可靠性 oriented 深度修正模块(RDR)。AIF模块使用事件和帧的空间偏好作为初始 máscara 来引导多模态特征融合,并通过反馈增强帧和事件特征学习。RDR模块使用融合的特征和 máscara 来估计精度高的深度结构。
  • results: 本研究在 synthetic 和实际世界数据集上评估了SRFNet的效果,结果显示,无需预训练,SRFNet可以在夜景中比 Priors 等方法更高的性能。
    Abstract Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.
    摘要 单目深度估计是一个重要的任务,用于测量相机附近的距离,这对于自动驾驶和机器人定位等应用非常重要。传统的帧基本方法受限于对应数范围和运动模糊的问题,因此latest works将event camera整合或导引帧模式的特性。然而,event流拥有空间罕见性,特别是在光度变化较小的区域,导致direct fusion方法,例如RAMNet,忽略了每个模式的最有信心区域的贡献。这会导致多模式融合过程中的结构混乱,进而下降深度估计性能。在这篇论文中,我们提出了一个名为Spatial Reliability-oriented Fusion Network(SRFNet)的新方法,可以在日间和夜间都 estimate fine-grained的深度结构。我们的方法包括两个关键技术部分。首先,我们提出了一个注意力基于的互动式融合(AIF)模组,它根据事件和帧的空间假设作为初始mask,并学习导引多modal feature融合的共识区域。融合后的特征被反馈以提高帧和事件特征学习。同时,它还使用一个output head生成融合mask,并轮询更新以学习共识的空间假设。其次,我们提出了可靠性对适定深度修正(RDR)模组,用于根据融合特征和mask估计精确的深度结构。我们将这个方法评估在实验和真实世界数据上,结果显示,不需要预训,我们的方法在夜间场景中表现更好,比如RAMNet等先前的方法。更多详细信息可以通过我们的项目首页:

Domain Adaptive Few-Shot Open-Set Learning

  • paper_url: http://arxiv.org/abs/2309.12814
  • repo_url: https://github.com/debabratapal7/dafosnet
  • paper_authors: Debabrata Pal, Deeptej More, Sai Bhargav, Dipesh Tamboli, Vaneet Aggarwal, Biplab Banerjee
  • for: 解决目标查询集中未知样本和Visual shift问题,同时可以快速适应新场景。
  • methods: 提出了一种新的方法called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS),并使用了一种名为DAFOSNET的meta-learning-based架构。在训练过程中,模型学习了共享和特异 embedding space,并创建了一个pseudo open-space决策边界。
  • results: 通过使用一对 conditional adversarial networks和domain-specific batch-normalized class prototypes alignment strategy,模型能够快速适应新场景并提高数据密度。
    Abstract Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
    摘要 几个步学习已经在面临未知样本从新类目标查询集中识别难题和处理视觉变化 между领域方面做出了很好的进展。然而,现有技术在确定目标异常点下存在缺陷,即通过学习源领域中的 pseudo-outlier 来拒绝,导致解决这两个问题的答案不完整。为了全面解决这些挑战,我们提出了一种新的方法calledDomain Adaptive Few-Shot Open Set Recognition(DA-FSOS),并介绍了一种基于meta-学习的架构 named DAFOSNET。在训练过程中,我们的模型学习了共享和特异的嵌入空间,同时创建了一个 Pseudo open-space 决策边界,使用完全supervised的源领域和一个标签分离的少量目标领域。为了增强数据密度,我们使用了一对 conditional adversarial networks WITH tunable noise variances 来扩展两个领域的closed和pseudo-open空间。此外,我们提出了一种适应域特化的batch normalized class prototypes alignment策略,用于对两个领域进行全球协调,并确保类别特异性通过新的度量目标。我们的训练方法确保了 DAFOS-NET 在新enario中能够通过��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Automatic view plane prescription for cardiac magnetic resonance imaging via supervision by spatial relationship between views

  • paper_url: http://arxiv.org/abs/2309.12805
  • repo_url: https://github.com/wd111624/cmr_plan
  • paper_authors: Dong Wei, Yawen Huang, Donghuan Lu, Yuexiang Li, Yefeng Zheng
    for: 这种系统的目的是自动化卡路里MR图像的规划,以帮助临床实践中的医生和技术人员更加快速和准确地完成图像规划。methods: 该系统使用了深度学习网络,通过挖掘数据中的空间关系来自动地确定目标平面和源视图之间的交叉线,并通过堆栈锥体网络来逐步提高回归。此外,该系统还使用了多视图规划策略,将所有源视图中的预测热图聚合以获得全球最优的规划。results: 实验结果显示,该系统可以准确地预测四个标准的卡路里MR图像平面,并且比现有的方法更加精准,包括传统的Atlas-based和 newer deep-learning-based方法。此外,该系统还可以预测第一个Cardiac-anatomy-oriented平面(或多个平面),从body-oriented扫描中获得。
    Abstract Background: View planning for the acquisition of cardiac magnetic resonance (CMR) imaging remains a demanding task in clinical practice. Purpose: Existing approaches to its automation relied either on an additional volumetric image not typically acquired in clinic routine, or on laborious manual annotations of cardiac structural landmarks. This work presents a clinic-compatible, annotation-free system for automatic CMR view planning. Methods: The system mines the spatial relationship, more specifically, locates the intersecting lines, between the target planes and source views, and trains deep networks to regress heatmaps defined by distances from the intersecting lines. The intersection lines are the prescription lines prescribed by the technologists at the time of image acquisition using cardiac landmarks, and retrospectively identified from the spatial relationship. As the spatial relationship is self-contained in properly stored data, the need for additional manual annotation is eliminated. In addition, the interplay of multiple target planes predicted in a source view is utilized in a stacked hourglass architecture to gradually improve the regression. Then, a multi-view planning strategy is proposed to aggregate information from the predicted heatmaps for all the source views of a target plane, for a globally optimal prescription, mimicking the similar strategy practiced by skilled human prescribers. Results: The experiments include 181 CMR exams. Our system yields the mean angular difference and point-to-plane distance of 5.68 degrees and 3.12 mm, respectively. It not only achieves superior accuracy to existing approaches including conventional atlas-based and newer deep-learning-based in prescribing the four standard CMR planes but also demonstrates prescription of the first cardiac-anatomy-oriented plane(s) from the body-oriented scout.
    摘要 背景:卡路里变 imagine(CMR)成像取得的规划仍然是艰辛的任务在临床实践中。目的:现有的自动化方法都是基于不常见的三维图像或劳累的手动标注卡ди亚Structural landmarks。这个工作提出了一个可以在临床实践中使用的无需标注的自动CMR规划系统。方法:系统利用目标平面和源视图之间的空间关系,具体来说是找出目标平面和源视图之间的交叉点,并使用深度网络来回归定距离 définition heatmaps。交叉点是由技术人员在图像取得时使用卡ди亚Structural landmarks预scribed的规则,并在后续从空间关系中回拟。由于空间关系自身含有所需的信息,因此无需额外的手动标注。此外,系统还利用多个目标平面预测的多个源视图之间的互动,在堆栈ourglass架构中进行渐进改进。然后,提议一种多视图规划策略,将所有源视图中的预测热图集成,以实现全局优化的规划,类似于人类资深决策者的做法。结果:实验包括181个CMR试验。我们的系统的平均角度差和点到平面距离为5.68度和3.12mm。不仅达到了现有方法的精度,还可以成功地预scribed四个标准CMR平面,以及首次预scribedBody-oriented scout中的cardiac-anatomy-oriented平面。

Scalable Semantic 3D Mapping of Coral Reefs with Deep Learning

  • paper_url: http://arxiv.org/abs/2309.12804
  • repo_url: None
  • paper_authors: Jonathan Sauder, Guilhem Banc-Prandi, Anders Meibom, Devis Tuia
  • for: This paper aims to develop a new method for mapping underwater environments from ego-motion video, with a focus on coral reef monitoring.
  • methods: The method uses machine learning to adapt to challenging underwater conditions and combines 3D mapping with semantic segmentation of images.
  • results: The method achieves high-precision 3D semantic mapping at unprecedented scale with significantly reduced labor costs, making it possible to monitor coral reefs more efficiently and effectively.Here’s the full text in Simplified Chinese:
  • for: 这篇论文目的是开发一种基于ego-motion视频的海洋环境地图方法,主要关注珊瑚礁监测。
  • methods: 该方法使用机器学习适应海洋下难以控制的环境,并将3D地图与图像Semantic分割相结合。
  • results: 该方法实现了高精度3DSemantic地图,并在减少劳动成本方面取得了显著进步,使得珊瑚礁监测更加高效和可靠。
    Abstract Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.
    摘要 珊瑚礁是地球上最多样化的生态系统之一,并且有百万人的生存受其影响。然而,大多数珊瑚礁面临全球气候变化和地方人类活动的威胁。为了更好地理解珊瑚礁的衰退机制,高精度空间和时间分辨率的监测是关键。although computer vision工具已经被应用于这一过程,特别是使用SfM摄ogrammetry для3D地图和深度神经网络 для图像分割,但是分析数据产品创造了瓶颈,从而限制了其扩展性。这篇文章介绍了一种新的珊瑚礁监测方法,基于自己的运动来自视频,结合机器学习来适应水下挑战的3D地图系统,并与现代图像分割方法相结合。这种方法在北红海的珊瑚礁中进行了高精度3Dsemantic地图,覆盖100米视频 transect,只需5分钟投入和分析时间。我们的方法可以快速扩大珊瑚礁监测,减少劳动、设备、运输和计算成本,从而更有效地 Inform conservation policies。我们的方法可以把珊瑚礁 transect democratized,减少劳动和设备成本,以便更多的人可以参与监测和保护。这种方法的计算方法,基于学习的Structure-from-Motion,对于快速低成本地图的水下环境的应用有广泛的应用前景。

NOC: High-Quality Neural Object Cloning with 3D Lifting of Segment Anything

  • paper_url: http://arxiv.org/abs/2309.12790
  • repo_url: None
  • paper_authors: Xiaobao Wei, Renrui Zhang, Jiarui Wu, Jiaming Liu, Ming Lu, Yandong Guo, Shanghang Zhang
  • for: 本研究旨在提出一种基于神经场的高品质3D对象重建方法,以便在用户指定的实时下重建目标对象。
  • methods: 本方法基于神经场和Segment Anything Model (SAM)的优点,首先将多视图2D分割Masks lifted到3D变化场中,然后将2D特征 lifted到3D SAM场中以提高重建质量。
  • results: 在多个 benchmark 数据集上进行了详细的实验,表明本方法能够提供高品质的目标对象重建结果。
    Abstract With the development of the neural field, reconstructing the 3D model of a target object from multi-view inputs has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a certain object indicated by users on-the-fly. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose Neural Object Cloning (NOC), a novel high-quality 3D object reconstruction method, which leverages the benefits of both neural field and SAM from two aspects. Firstly, to separate the target object from the scene, we propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D variation field. The 3D variation field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. Then, apart from 2D masks, we further lift the 2D features of the SAM encoder into a 3D SAM field in order to improve the reconstruction quality of the target object. NOC lifts the 2D masks and features of SAM into the 3D neural field for high-quality target object reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be released.
    摘要 随着神经场的发展,从多视图输入中重建目标对象的3D模型已经吸引了社区的越来越多的关注。现有方法通常学习一个整个场景的神经场,而即时根据用户指定的对象进行重建仍然是一个未探索的领域。尝试 Segment Anything Model (SAM) 可以效果地 segment any 2D 图像,在这篇论文中,我们提出了一种新的高质量3D对象重建方法——神经对象复制(NOC),该方法利用了神经场和 SAM 的优点,从两个方面进行了利用。首先,为了将目标对象从场景中分离,我们提出了一种新的策略,即将多视图2D 分割面的 SAM 编码器输出 lift 到一个统一的3D 变化场,然后将这个3D 变化场 проек 到2D 空间,生成新的提示,这个过程是迭代的,直到达到对象分离的 converges。然后,除了2D 面,我们还 lift 了 SAM 编码器的2D 特征到3D SAM 场,以提高目标对象的重建质量。NOC 将 SAM 的2D 面和特征 lift 到神经场中,以实现高质量的目标对象重建。我们对多个标准数据集进行了详细的实验,以示出我们的方法的优势。代码将会发布。

EMS: 3D Eyebrow Modeling from Single-view Images

  • paper_url: http://arxiv.org/abs/2309.12787
  • repo_url: None
  • paper_authors: Chenghong Li, Leyang Jin, Yujian Zheng, Yizhou Yu, Xiaoguang Han
  • for: 这个论文的目的是提出一种基于学习的方法来实现单视图3D眉毛重建。
  • methods: 这个方法使用了三个模块:RootFinder、OriPredictor和FiberEnder。RootFinder用于Localizing fiber root positions,OriPredictor用于预测3D空间中的方向场,FiberEnder用于确定每个纤维的长度。
  • results: 该方法在不同的眉毛样式和长度上表现了效果,并且可以处理部分受阻的根位置问题。
    Abstract Eyebrows play a critical role in facial expression and appearance. Although the 3D digitization of faces is well explored, less attention has been drawn to 3D eyebrow modeling. In this work, we propose EMS, the first learning-based framework for single-view 3D eyebrow reconstruction. Following the methods of scalp hair reconstruction, we also represent the eyebrow as a set of fiber curves and convert the reconstruction to fibers growing problem. Three modules are then carefully designed: RootFinder firstly localizes the fiber root positions which indicates where to grow; OriPredictor predicts an orientation field in the 3D space to guide the growing of fibers; FiberEnder is designed to determine when to stop the growth of each fiber. Our OriPredictor is directly borrowing the method used in hair reconstruction. Considering the differences between hair and eyebrows, both RootFinder and FiberEnder are newly proposed. Specifically, to cope with the challenge that the root location is severely occluded, we formulate root localization as a density map estimation task. Given the predicted density map, a density-based clustering method is further used for finding the roots. For each fiber, the growth starts from the root point and moves step by step until the ending, where each step is defined as an oriented line with a constant length according to the predicted orientation field. To determine when to end, a pixel-aligned RNN architecture is designed to form a binary classifier, which outputs stop or not for each growing step. To support the training of all proposed networks, we build the first 3D synthetic eyebrow dataset that contains 400 high-quality eyebrow models manually created by artists. Extensive experiments have demonstrated the effectiveness of the proposed EMS pipeline on a variety of different eyebrow styles and lengths, ranging from short and sparse to long bushy eyebrows.
    摘要 眉毛在面部表情和外貌中扮演了关键角色,尽管3D人脸数字化已得到了广泛的研究,但3D眉毛模型化却受到了较少的关注。在这项工作中,我们提出了EMS框架,是首个基于学习的单视角3D眉毛重建框架。我们将眉毛表示为一组纤维曲线,并将重建转化为纤维增长问题。为确定眉毛的长度和形状,我们设计了三个模块:RootFinder、OriPredictor和FiberEnder。RootFinder先地本地化眉毛根部位置,以便于纤维增长;OriPredictor预测了3D空间中纤维的方向场,以帮助纤维增长;FiberEnder用于确定每个纤维的增长结束点。我们的OriPredictor直接借鉴了毛发重建中使用的方法。由于眉毛和毛发之间存在差异,因此RootFinder和FiberEnder均需要新的设计。具体来说,为了处理眉毛根部位置严重受遮挡的挑战,我们将根部地图估计任务转化为一个density map估计任务。给出预测的density map,然后使用密度基于的划分方法来找到根部。对于每个纤维,增长从根部开始,每步长度为constanteorientation field的oriented line。直到结束,每个步骤都需要通过一个像素对齐的RNN架构来判断是否需要停止增长。为支持所提出的网络的训练,我们建立了首个3D人工眉毛数据集,该数据集包含400个高质量眉毛模型,由艺术家手工创建。广泛的实验证明了我们提出的EMS管道的效果,可以处理不同的眉毛风格和长度,从短毛到长毛。

LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition

  • paper_url: http://arxiv.org/abs/2309.12780
  • repo_url: https://github.com/harryqu123/lmc
  • paper_authors: Haoxuan Qu, Xiaofei Hui, Yujun Cai, Jun Liu
  • for: 这 paper 的目的是如何精确地进行开放集object recognition,以减少依赖伪阳特征。
  • methods: 本 paper 提出了一个名为 Large Model Collaboration (LMC) 的新框架,通过多个 off-the-shelf 大型模型的协力来解决这个问题。此外,paper 还提出了多个新的设计来有效地从大型模型中提取隐藏知识。
  • results: 实验结果显示了我们提出的框架的有效性。可以在 https://github.com/Harryqu123/LMC 获取代码。
    Abstract Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available https://github.com/Harryqu123/LMC
    摘要

WiCV@CVPR2023: The Eleventh Women In Computer Vision Workshop at the Annual CVPR Conference

  • paper_url: http://arxiv.org/abs/2309.12768
  • repo_url: None
  • paper_authors: Doris Antensteiner, Marah Halawa, Asra Aslam, Ivaxi Sheth, Sachini Herath, Ziqi Huang, Sunnie S. Y. Kim, Aparna Akula, Xin Wang
  • for: The paper is written to present the details of the Women in Computer Vision Workshop - WiCV 2023, which aims to amplify the voices of underrepresented women in the computer vision community.
  • methods: The paper uses a comprehensive report on the workshop program, historical trends from past WiCV@CVPR events, and a summary of statistics related to presenters, attendees, and sponsorship for the WiCV 2023 workshop.
  • results: The paper presents a detailed report on the WiCV 2023 workshop, including the program, historical trends, and statistics related to presenters, attendees, and sponsorship. The paper also highlights the importance of such events in addressing gender imbalances within the field of computer vision.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了介绍女性计算机视觉工作坊(WiCV 2023)的详细信息。WiCV 的目标是促进计算机视觉领域中少数女性的声音,并且推动该领域的多样性和平等。
  • methods: 这篇论文使用了 WiCV 2023 工作坊的全面报告,以及过去 WiCV@CVPR 事件的历史趋势和统计数据,以描述 WiCV 2023 工作坊的进程和成果。
  • results: 这篇论文提供了 WiCV 2023 工作坊的详细报告,包括工作坊的程序、历史趋势和统计数据,以及参与者和赞助商的相关信息。论文还强调了计算机视觉领域内的性别不平衡问题,并认为这类活动对于解决这一问题具有重要意义。
    Abstract In this paper, we present the details of Women in Computer Vision Workshop - WiCV 2023, organized alongside the hybrid CVPR 2023 in Vancouver, Canada. WiCV aims to amplify the voices of underrepresented women in the computer vision community, fostering increased visibility in both academia and industry. We believe that such events play a vital role in addressing gender imbalances within the field. The annual WiCV@CVPR workshop offers a) opportunity for collaboration between researchers from minority groups, b) mentorship for female junior researchers, c) financial support to presenters to alleviate finanacial burdens and d) a diverse array of role models who can inspire younger researchers at the outset of their careers. In this paper, we present a comprehensive report on the workshop program, historical trends from the past WiCV@CVPR events, and a summary of statistics related to presenters, attendees, and sponsorship for the WiCV 2023 workshop.
    摘要 在本文中,我们介绍了2023年度的女性计算机视觉工作坊(WiCV 2023),该活动与CVPR 2023合办于加拿大温尼伯市。WiCV 的目标是扩大计算机视觉领域下的弱调女性人群的声音,提高学术和产业领域中的女性 visibility。我们认为这些活动对于解决计算机视觉领域中的性别偏见非常重要。每年的 WiCV@CVPR 工作坊提供了以下机会:(a)少数民族研究者之间的合作,(b)为女性新手研究者提供导师,(c)为参会者提供论文发表支持,以及(d)多样化的角色模范,以激励年轻研究者在职业开始时。在本文中,我们提供了2023年 WiCV@CVPR 工作坊的工作计划、过去事件的历史趋势以及2023年工作坊的统计数据。

S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-based Learning for Action Recognition

  • paper_url: http://arxiv.org/abs/2309.12761
  • repo_url: None
  • paper_authors: Mireille El-Assal, Pierre Tirilly, Ioan Marius Bilasco
    for:This paper focuses on developing a more efficient video analysis method using Spiking Neural Networks (SNNs) and Spiking Separated Spatial and Temporal Convolutions (S3TCs).methods:The authors use unsupervised learning with the Spike Timing-Dependent Plasticity (STDP) rule and introduce S3TCs to reduce the number of parameters required for video analysis.results:The proposed method successfully extracts spatio-temporal information from videos, increases the output spiking activity, and outperforms spiking 3D convolutions on the KTH, Weizmann, and IXMAS datasets.Here is the answer in Simplified Chinese text:for: 这篇论文关注开发更高效的视频分析方法,使用神经网络和分离空间和时间卷积(S3TCs)。methods: 作者使用无监督学习和脉冲时间依赖性变化(STDP)规则,并引入S3TCs来降低视频分析所需的参数数量。results: 提议的方法在KTH、Weizmann和IXMAS数据集上成功提取空间-时间信息,提高输出脉冲活动,并超越了脉冲3D卷积。
    Abstract Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
    摘要 视频分析是计算机视觉中的一项重要任务,在过去几年内受到了很多关注。当前的状态艺术性表现在视频分析方面是通过深度神经网络(DNNs)实现的,其计算成本较高,需要大量标注数据进行训练。神经元网络(SNNs)在神经模拟硬件上实现时有 thousands 万次更低的计算成本,但它们的参数数量相对较多,使得它们更难于实现。在这项工作中,我们使用 CSNNs 在无监督的方式进行训练,并 introduce 一种新的 Spiking Separated Spatial and Temporal Convolutions(S3TCs),以降低视频分析所需的参数数量。这种无监督学习具有不需要大量标注数据进行训练的优点。将一个综合空间时间射阻挡分解成空间射阻挡和时间射阻挡,可以降低网络的参数数量。我们在 KTH、Weizmann 和 IXMAS 数据集上测试了我们的网络,并显示了 S3TCs 成功地从视频中提取空间时间信息,提高输出脉冲活动,并超过了射阻挡三维 convolution。

Transformer-based Image Compression with Variable Image Quality Objectives

  • paper_url: http://arxiv.org/abs/2309.12717
  • repo_url: None
  • paper_authors: Chia-Hao Kao, Yi-Hsin Chen, Cheng Chien, Wei-Chen Chiu, Wen-Hsiao Peng
  • for: 该 paper 是为了提供一种可变图像质量目标的 transformer 基于压缩系统,以满足用户的偏好。
  • methods: 该方法使用 learned codec 进行优化,以实现不同质量目标下的图像重建。用户可以通过单一的模型来选择一个质量目标的交易off。
  • results: 该方法可以通过使用 prompt tokens 来condition transformer 基于 autoencoder,并通过学习 prompt generation network 来生成适应用户偏好和输入图像的 prompt tokens。对于常见的质量指标,广泛的实验表明该方法可以适应不同的质量目标,并且与单一质量目标方法相比,其表现相对较好。
    Abstract This paper presents a Transformer-based image compression system that allows for a variable image quality objective according to the user's preference. Optimizing a learned codec for different quality objectives leads to reconstructed images with varying visual characteristics. Our method provides the user with the flexibility to choose a trade-off between two image quality objectives using a single, shared model. Motivated by the success of prompt-tuning techniques, we introduce prompt tokens to condition our Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of our method in adapting the encoding and/or decoding processes to a variable quality objective. While offering the additional flexibility, our proposed method performs comparably to the single-objective methods in terms of rate-distortion performance.
    摘要 Inspired by the success of prompt-tuning techniques, the system introduces prompt tokens to condition the Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of the method in adapting the encoding and/or decoding processes to a variable quality objective. Notably, the proposed method offers the additional flexibility while performing comparably to single-objective methods in terms of rate-distortion performance.

mixed attention auto encoder for multi-class industrial anomaly detection

  • paper_url: http://arxiv.org/abs/2309.12700
  • repo_url: None
  • paper_authors: Jiangqi Liu, Feng Wang
  • for: 本研究旨在提出一种可以实现多类异常检测的单一模型,以解决现有方法的高存储成本和训练效率低下问题。
  • methods: 该方法使用混合注意力自适应Encoder(MAAE),并采用空间注意力和通道注意力来有效地捕捉多类特征分布的全球category信息,以及模型多个类别特征分布的模型。此外,该方法还提出了适应噪声生成器和多尺度融合模块,以适应实际噪声和保持不同类别物体表面 semantics。
  • results: MAAE在 benchmark 数据集上达到了比state-of-the-art 方法更高的性能。
    Abstract Most existing methods for unsupervised industrial anomaly detection train a separate model for each object category. This kind of approach can easily capture the category-specific feature distributions, but results in high storage cost and low training efficiency. In this paper, we propose a unified mixed-attention auto encoder (MAAE) to implement multi-class anomaly detection with a single model. To alleviate the performance degradation due to the diverse distribution patterns of different categories, we employ spatial attentions and channel attentions to effectively capture the global category information and model the feature distributions of multiple classes. Furthermore, to simulate the realistic noises on features and preserve the surface semantics of objects from different categories which are essential for detecting the subtle anomalies, we propose an adaptive noise generator and a multi-scale fusion module for the pre-trained features. MAAE delivers remarkable performances on the benchmark dataset compared with the state-of-the-art methods.
    摘要 现有的方法 для无监督工业异常检测通常将每个物件类别 trains一个分开的模型。这种方法可以轻松地捕捉每个类别的特征分布,但会导致存储成本高且训练效率低。在这篇论文中,我们提出了一个整合式混合注意力自动编码器(MAAE),以实现多类别异常检测的单一模型。为了解决不同类别的特征分布多样性导致性能下降的问题,我们使用空间注意力和通道注意力来有效地捕捉全类别信息和多类别特征分布。其次,为了模拟实际上的噪声和保持不同类别物件表面 semantics,我们提出了适应式噪声生成器和多尺度融合模组。MAAE在比较 dataset 上 delivert 了非常出色的性能,与当前方法相比。

eWand: A calibration framework for wide baseline frame-based and event-based camera systems

  • paper_url: http://arxiv.org/abs/2309.12685
  • repo_url: None
  • paper_authors: Thomas Gossard, Andreas Ziegler, Levin Kolmar, Jonas Tebbe, Andreas Zell
  • for: 用于高精度对象位置三角推算的精准准确calibration
  • methods: 使用闪烁LED闪烁在透明球体内,代替传统的印刷或显示 Pattern
  • results: 提供高精度、易于使用的多摄像头外部坐标calibration方法,适用于frame-和事件基camera
    Abstract Accurate calibration is crucial for using multiple cameras to triangulate the position of objects precisely. However, it is also a time-consuming process that needs to be repeated for every displacement of the cameras. The standard approach is to use a printed pattern with known geometry to estimate the intrinsic and extrinsic parameters of the cameras. The same idea can be applied to event-based cameras, though it requires extra work. By using frame reconstruction from events, a printed pattern can be detected. A blinking pattern can also be displayed on a screen. Then, the pattern can be directly detected from the events. Such calibration methods can provide accurate intrinsic calibration for both frame- and event-based cameras. However, using 2D patterns has several limitations for multi-camera extrinsic calibration, with cameras possessing highly different points of view and a wide baseline. The 2D pattern can only be detected from one direction and needs to be of significant size to compensate for its distance to the camera. This makes the extrinsic calibration time-consuming and cumbersome. To overcome these limitations, we propose eWand, a new method that uses blinking LEDs inside opaque spheres instead of a printed or displayed pattern. Our method provides a faster, easier-to-use extrinsic calibration approach that maintains high accuracy for both event- and frame-based cameras.
    摘要 准确的均衡是多摄像头三角测量物体位置精准的关键。然而,这也是一项时间消耗的过程,需要每次移动摄像头时重新进行。标准方法是使用印刷的模式来估算摄像头的内参和外参参数。这种方法可以应用于事件基图像,尽管需要额外的工作。通过从事件中重建幻象,可以直接检测印刷的模式。然后,可以使用幻象的检测来提供内参均衡。但是,使用2D模式有多个相机的外参均衡的限制,因为相机具有不同的视角和广泛的基线。2D模式只能从一个方向检测,需要很大的尺寸来补偿其与摄像头的距离。这使得外参均衡变得时间消耗和困难。为了缓解这些限制,我们提出了ewand,一种新的方法,使用LED灯光在透明球体中闪烁。我们的方法提供了一种更快、更容易使用的外参均衡方法,可以保持高精度 для both事件基图像和帧基图像。

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

  • paper_url: http://arxiv.org/abs/2309.12657
  • repo_url: None
  • paper_authors: Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi Chu, Nenghai Yu
  • for: 本研究旨在提出一种简单且有效的 transformer 基本框架,用于检测和稳定多模态束缚 manipulation。
  • methods: 我们首先构造了视 language 预训练Encoder,并使用 dual-branch cross-attention (DCA) 技术来抽取和融合模态特有的特征。此外,我们还设计了分离精细类ifier (DFC),以提高模态特有的特征挖掘和避免模态竞争。
  • results: 我们在 $\rm DGM^4$ 数据集上进行了广泛的实验,并证明了我们提出的模型在对 state-of-the-art 方法的比较中表现出色。
    Abstract AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.
    摘要 人工智能生成的文本和图像在互联网上广泛传播,尤其是多模态杂化的 manipulate 问题,导致社会受到了负面影响。现有的多模态杂化检测和定位方法主要是通过视觉语言特征的融合来进行预测,而忽略了特定模式特征的重要性,从而导致低效的结果。在这篇论文中,我们提出了一种简单而新的 transformer 基于的多模态杂化检测和定位框架。我们的框架同时探索特定模式的特征,保留多模态对齐的能力。为此,我们引入视觉语言预训练encoder和双支分支交叉注意力(DCA),以EXTRACT和融合特定模式的特征。此外,我们设计了独立细致分类器(DFC),以提高特定模式的特征挖掘和避免模式竞争。此外,我们还提出了适应性 manipulate 查询(IMQ),可以在每个模式中适应学习查询,以提高对伪造细节的发现。我们对 $\rm DGM^4$ 数据集进行了广泛的实验,并证明了我们的提出的模型在对state-of-the-art方法的比较中表现出色。

FP-PET: Large Model, Multiple Loss And Focused Practice

  • paper_url: http://arxiv.org/abs/2309.12650
  • repo_url: None
  • paper_authors: Yixin Chen, Ourui Fu, Wenrui Shao, Zhaoheng Xie
  • for: 这项研究旨在提出FP-PET方法,用于医学图像分割,尤其是CT和PET图像。
  • methods: 该研究使用了多种机器学习模型,包括STUNet-large、SwinUNETR和VNet,以实现最新的分割性能。
  • results: 研究提出了一个综合评价指标,将多个评价指标(如 dice分数、false positive volume 和 false negative volume)加权平均,以提供全面的模型效果评价。
    Abstract This study presents FP-PET, a comprehensive approach to medical image segmentation with a focus on CT and PET images. Utilizing a dataset from the AutoPet2023 Challenge, the research employs a variety of machine learning models, including STUNet-large, SwinUNETR, and VNet, to achieve state-of-the-art segmentation performance. The paper introduces an aggregated score that combines multiple evaluation metrics such as Dice score, false positive volume (FPV), and false negative volume (FNV) to provide a holistic measure of model effectiveness. The study also discusses the computational challenges and solutions related to model training, which was conducted on high-performance GPUs. Preprocessing and postprocessing techniques, including gaussian weighting schemes and morphological operations, are explored to further refine the segmentation output. The research offers valuable insights into the challenges and solutions for advanced medical image segmentation.
    摘要 Translated into Simplified Chinese:这项研究提出了FP-PET,一种涵盖CT和PET图像分割的全面方法。利用AutoPet2023 Challenge数据集,研究使用了多种机器学习模型,包括STUNet-large、SwinUNETR和VNet,以实现最新的分割性能。文章引入了一个汇总分数,将多个评估指标,如 dice分数、false positive volume (FPV) 和 false negative volume (FNV) 汇总而成一个整体评价指标,以提供更全面的模型效果评估。研究还讨论了模型训练中的计算挑战和解决方案,并在高性能GPU上进行训练。研究还探讨了预处理和后处理技术,包括高斯权重方案和形态运算,以进一步细化分割输出。研究提供了进一步了解高级医学图像分割的挑战和解决方案。

RHINO: Regularizing the Hash-based Implicit Neural Representation

  • paper_url: http://arxiv.org/abs/2309.12642
  • repo_url: None
  • paper_authors: Hao Zhu, Fengyi Liu, Qi Zhang, Xun Cao, Zhan Ma
  • for: 提高Hash表示法中的Regularization,以提高 interpolate 的可靠性和稳定性。
  • methods: 引入一个连续分析函数,以便在Hash表示法中增强Regularization,不需要修改当前的Hash表示法架构。
  • results: RHINO在多种任务上表现出色,如图像适应、签名距离函数表示、5D静止/6D动态神经辐射场优化等,并且在质量和速度两个方面超过当前状态态技术。
    Abstract The use of Implicit Neural Representation (INR) through a hash-table has demonstrated impressive effectiveness and efficiency in characterizing intricate signals. However, current state-of-the-art methods exhibit insufficient regularization, often yielding unreliable and noisy results during interpolations. We find that this issue stems from broken gradient flow between input coordinates and indexed hash-keys, where the chain rule attempts to model discrete hash-keys, rather than the continuous coordinates. To tackle this concern, we introduce RHINO, in which a continuous analytical function is incorporated to facilitate regularization by connecting the input coordinate and the network additionally without modifying the architecture of current hash-based INRs. This connection ensures a seamless backpropagation of gradients from the network's output back to the input coordinates, thereby enhancing regularization. Our experimental results not only showcase the broadened regularization capability across different hash-based INRs like DINER and Instant NGP, but also across a variety of tasks such as image fitting, representation of signed distance functions, and optimization of 5D static / 6D dynamic neural radiance fields. Notably, RHINO outperforms current state-of-the-art techniques in both quality and speed, affirming its superiority.
    摘要 使用含义神经表示(INR)通过哈希表实现了非常出色的效果和效率,可是现有的状态 искусственный интеллект技术表现不够稳定,经常产生不可靠和噪音的结果 durante interpolaciones. 我们认为这个问题的根本原因在于哈希键和输入坐标之间的梯度流不畅,链式规则尝试模型离散的哈希键,而不是连续的坐标。为解决这个问题,我们介绍了犀牛(RHINO),它是一种连续的分析函数,可以在不修改现有哈希基于INR的网络架构的情况下,提供更好的规范。这种连接确保了输入坐标和网络的输出之间的精准的梯度传递,从而提高了规范。我们的实验结果不仅表明了不同的哈希基于INR如DINER和快速NP的规范能力的扩展,还在图像适应、签证距离函数表示和5D静态/6D动态神经辐射场的优化中达到了更高的质量和速度,并且超过了当前状态 искусственный интеллект技术的性能。

Global Context Aggregation Network for Lightweight Saliency Detection of Surface Defects

  • paper_url: http://arxiv.org/abs/2309.12641
  • repo_url: None
  • paper_authors: Feng Yan, Xiaoheng Jiang, Yang Lu, Lisha Cui, Shupan Li, Jiale Cao, Mingliang Xu, Dacheng Tao
  • for: 这个论文主要目的是提出一种轻量级的抗余损检测方法,以提高检测效率和精度。
  • methods: 本文提出了一种基于encoder-decoder结构的Global Context Aggregation Network (GCANet),包括一种新的transformerEncoder和Channel Reference Attention (CRA)模块,以提高多层特征表示的综合性。
  • results: 对三个公共的损害数据集进行实验表明,GCANet可以与17种state-of-the-art方法进行比较,并且在精度和运行效率之间做出了一个更好的平衡。具体来说,GCANet在SD-saliency-900上 achieve了91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$,和97.35% $E_\phi$的精度,并且在单个GPU上运行272fps。
    Abstract Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.
    摘要 surface defect inspection 是一项非常具有挑战性的任务, surface defects 通常会出现弱化的外观或者在复杂的背景下出现。大多数高精度的缺陷检测方法需要昂贵的计算和存储开销,使其在一些资源受限的缺陷检测应用中不实用。 although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. 为了解决这个问题,我们开发了一个全球上下文聚合网络(GCANet),用于轻量级的缺陷检测。我们在轻量级的后ION上加入了一个新的 transformer Encoder,以便在全球上下文信息中捕捉全球上下文信息。我们引入了一种新的 Depth-wise Self-Attention(DSA)模块,用于在通道维度进行元素对元素的相似性检测,同时保持线性复杂度。此外,我们在每个解码块前加入了一个 Channel Reference Attention(CRA)模块,以强化底层特征表示。CRA模块利用不同层次特征之间的通道相关性来适应性地增强特征表示。我们在三个公共缺陷数据集上进行了实验,结果显示,我们的网络在缺陷检测精度和运行效率之间做出了更好的平衡,相比于其他 17 种国际前沿方法。具体来说,GCANet 在 SD-saliency-900 上达到了同等精度(91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, 和 97.35% $E_\phi$),而且在单个 GPU 上运行速度为 272 fps。

CINFormer: Transformer network with multi-stage CNN feature injection for surface defect segmentation

  • paper_url: http://arxiv.org/abs/2309.12639
  • repo_url: None
  • paper_authors: Xiaoheng Jiang, Kaiyi Guo, Yang Lu, Feng Yan, Hao Liu, Jiale Cao, Mingliang Xu, Dacheng Tao
  • for: 这个研究旨在提高工业生产中的表面缺陷检测精度,并解决深度学习方法中的一些挑战,如微弱缺陷和背景中的干扰。
  • methods: 本研究提出了一个基于 transformer 网络的 CINFormer,具有多Stage CNN 特征插入和 Top-K 自我注意模组。这个架构可以维持 CNN 捕捉细部特征和 transformer 抑制背景干扰的优点,以提高缺陷检测精度。
  • results: 实验结果显示,提出的 CINFormer 在 DAGM 2007、Magnetic tile 和 NEU 等表面缺陷数据集上实现了顶尖性能,并且在不同的缺陷类型和背景干扰下都能够获得高度的精度。
    Abstract Surface defect inspection is of great importance for industrial manufacture and production. Though defect inspection methods based on deep learning have made significant progress, there are still some challenges for these methods, such as indistinguishable weak defects and defect-like interference in the background. To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer. CINFormer presents a simple yet effective feature integration mechanism that injects the multi-level CNN features of the input image into different stages of the transformer network in the encoder. This can maintain the merit of CNN capturing detailed features and that of transformer depressing noises in the background, which facilitates accurate defect detection. In addition, CINFormer presents a Top-K self-attention module to focus on tokens with more important information about the defects, so as to further reduce the impact of the redundant background. Extensive experiments conducted on the surface defect datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer achieves state-of-the-art performance in defect detection.
    摘要 surface defect inspection 是现代工业生产中非常重要的一环。尽管基于深度学习的抗损检测方法已经取得了 significiant progress,但还有一些挑战,如微弱损害难以辨别和背景中的干扰。为解决这些问题,我们提出了一种基于 transformer 网络的多stage CNN 特征注入的表面抗损分割方法,即 CINFormer。 CINFormer 提供了一种简单 yet 有效的特征集成机制,通过在 transformer 网络的编码器中注入不同级别的 CNN 特征,以维持 CNN 捕捉细节特征的优点,同时使用 transformer 网络压缩背景干扰的优点。此外,CINFormer 还提供了 Top-K 自注意模块,以便更好地强调损害的关键信息,从而进一步减少背景干扰的影响。经过对 DAGM 2007、Magnetic 块和 NEU 等表面抗损数据集的广泛实验,我们发现,提议的 CINFormer 可以达到现代表面抗损检测的州标性性能。

Auto-Lesion Segmentation with a Novel Intensity Dark Channel Prior for COVID-19 Detection

  • paper_url: http://arxiv.org/abs/2309.12638
  • repo_url: None
  • paper_authors: Basma Jumaa Saleh, Zaid Omar, Vikrant Bhateja, Lila Iznita Izhar
    for: 本研究旨在开发一种基于 computed tomography (CT) 图像的 COVID-19 诊断方法,以帮助诊断可疑 COVID-19 患者。methods: 本研究使用了 radiomic 特征,并采用了强化自动分割原理(IDCP)和深度神经网络(ALS-IDCP-DNN),在定义的分析阈值范围内进行图像分类。results: 验证性 dataset 上,提议的模型实现了98.8%的平均准确率、99%的准确率、98%的 recall 和98%的 F1 score。这些结果表明,我们的模型可以准确地分类 COVID-19 图像,可以帮助 radiologists 诊断可疑 COVID-19 患者。此外,我们的模型表现得更好于现有的10多个国际研究。
    Abstract During the COVID-19 pandemic, medical imaging techniques like computed tomography (CT) scans have demonstrated effectiveness in combating the rapid spread of the virus. Therefore, it is crucial to conduct research on computerized models for the detection of COVID-19 using CT imaging. A novel processing method has been developed, utilizing radiomic features, to assist in the CT-based diagnosis of COVID-19. Given the lower specificity of traditional features in distinguishing between different causes of pulmonary diseases, the objective of this study is to develop a CT-based radiomics framework for the differentiation of COVID-19 from other lung diseases. The model is designed to focus on outlining COVID-19 lesions, as traditional features often lack specificity in this aspect. The model categorizes images into three classes: COVID-19, non-COVID-19, or normal. It employs enhancement auto-segmentation principles using intensity dark channel prior (IDCP) and deep neural networks (ALS-IDCP-DNN) within a defined range of analysis thresholds. A publicly available dataset comprising COVID-19, normal, and non-COVID-19 classes was utilized to validate the proposed model's effectiveness. The best performing classification model, Residual Neural Network with 50 layers (Resnet-50), attained an average accuracy, precision, recall, and F1-score of 98.8%, 99%, 98%, and 98% respectively. These results demonstrate the capability of our model to accurately classify COVID-19 images, which could aid radiologists in diagnosing suspected COVID-19 patients. Furthermore, our model's performance surpasses that of more than 10 current state-of-the-art studies conducted on the same dataset.
    摘要 Translated into Simplified Chinese:在COVID-19疫情期间,计算机成像技术如计算机tomography(CT)扫描已经表现出效iveness在防止病毒的迅速传播。因此,需要进行计算机模型的研究,以便使用CT成像来诊断COVID-19。我们开发了一种新的处理方法,利用 радиомics特征,以帮助CT成像诊断COVID-19。传统的特征 oft lack specificity in distinguishing between different causes of pulmonary diseases,因此我们的目标是开发一个基于CT成像的radiomics框架,以区分COVID-19和其他肺病。模型设计用于强调COVID-19斑点,以便更好地识别COVID-19。模型将图像分类为三类:COVID-19、非COVID-19或正常。它使用了增强自动分割原理,使用暗色通道优先预测(IDCP)和深度神经网络(ALS-IDCP-DNN),在定义的分析阈值范围内。一个公共可用的数据集,包括COVID-19、正常和非COVID-19类别,用于验证我们的模型效果。使用最佳表现的分类模型,即Residual Neural Network with 50 layers(Resnet-50),在COVID-19、非COVID-19和正常类别之间达到了平均准确率、精度、回归率和F1分数的98.8%、99%、98%和98%。这些结果表明我们的模型可以准确地分类COVID-19图像,这将助力放医生诊断可能的COVID-19患者。此外,我们的模型性能超过了现有的10个以上state-of-the-art研究,同一个数据集。

Learning Actions and Control of Focus of Attention with a Log-Polar-like Sensor

  • paper_url: http://arxiv.org/abs/2309.12634
  • repo_url: None
  • paper_authors: Robin Göransson, Volker Krueger
  • for: 提高自动移动机器人图像处理速度
  • methods: 使用径向尺度图像数据和 gaze 控制
  • results: 成功降低图像像素数量,不影响游戏性能Here’s a breakdown of each point:
  • for: The paper aims to improve the image processing speed of an autonomous mobile robot.
  • methods: The paper explores the use of log-polar like image data with gaze control, and extends an A3C deep RL approach with an LSTM network to learn the policy for playing Atari games and gaze control.
  • results: The paper successfully reduces the amount of image pixels by a factor of 5 without losing any gaming performance.
    Abstract With the long-term goal of reducing the image processing time on an autonomous mobile robot in mind we explore in this paper the use of log-polar like image data with gaze control. The gaze control is not done on the Cartesian image but on the log-polar like image data. For this we start out from the classic deep reinforcement learning approach for Atari games. We extend an A3C deep RL approach with an LSTM network, and we learn the policy for playing three Atari games and a policy for gaze control. While the Atari games already use low-resolution images of 80 by 80 pixels, we are able to further reduce the amount of image pixels by a factor of 5 without losing any gaming performance.
    摘要 这里使用简化中文翻译文本。为了实现机器人自动运行中的图像处理时间缩短,这篇论文探讨了使用对应的对应图像数据,并将控制 gaze 在这些图像数据上。我们从 класи的深度学习游戏方法开始,并将 A3C 深度RL 方法与 LSTM 网络扩展。我们学习了三个 Atari 游戏和一个 gaze 控制策略。 Although Atari games already use low-resolution images of 80 by 80 pixels, we are able to further reduce the amount of image pixels by a factor of 5 without losing any gaming performance.

Decision Fusion Network with Perception Fine-tuning for Defect Classification

  • paper_url: http://arxiv.org/abs/2309.12630
  • repo_url: None
  • paper_authors: Xiaoheng Jiang, Shilong Tian, Zhiwen Zhu, Yang Lu, Hao Liu, Li Chen, Shupan Li, Mingliang Xu
  • for: 卷积 neural network 用于Surface defect inspection 中的检测和分类任务
  • methods: 提出了一种决策融合网络(DFNet),通过将semantic decision和feature decision融合来强化网络的决策能力,同时提出了一种感知细化模块(PFM)来优化前景和背景的分割结果
  • results: 对于公开的数据集KolektorSDD2和Magnetic-tile-defect-datasets进行了实验,实现了96.1% AP和94.6% mAP的效果
    Abstract Surface defect inspection is an important task in industrial inspection. Deep learning-based methods have demonstrated promising performance in this domain. Nevertheless, these methods still suffer from misjudgment when encountering challenges such as low-contrast defects and complex backgrounds. To overcome these issues, we present a decision fusion network (DFNet) that incorporates the semantic decision with the feature decision to strengthen the decision ability of the network. In particular, we introduce a decision fusion module (DFM) that extracts a semantic vector from the semantic decision branch and a feature vector for the feature decision branch and fuses them to make the final classification decision. In addition, we propose a perception fine-tuning module (PFM) that fine-tunes the foreground and background during the segmentation stage. PFM generates the semantic and feature outputs that are sent to the classification decision stage. Furthermore, we present an inner-outer separation weight matrix to address the impact of label edge uncertainty during segmentation supervision. Our experimental results on the publicly available datasets including KolektorSDD2 (96.1% AP) and Magnetic-tile-defect-datasets (94.6% mAP) demonstrate the effectiveness of the proposed method.
    摘要 superficie defecto inspección es una tarea importante en la inspección industrial. los métodos basados en aprendizaje profundo han demostrado un rendimiento prometedor en este dominio. sin embargo, estos métodos todavía sufren de mal juzgar cuando se encuentran desafíos como defectos de baja contraste y fondos complejos. para superar estos problemas, presentamos una red de fusión de decisiones (DFNet) que integra la decisión semántica con la decisión de características para fortalecer la capacidad de toma de decisiones del réseau. en particular, introducimos un módulo de fusión de decisiones (DFM) que extrae un vector semántico de la rama de decisión semántica y un vector de características de la rama de decisión de características y los fusiona para tomar la decisión de clasificación final. Además, propusimos un módulo de finuración de percepción (PFM) que realiza la fine-tuning de la percepción durante la etapa de segmentación. PFM genera las salidas semánticas y de características que se envían a la etapa de toma de decisiones de clasificación. Además, presentamos una matriz de pesos de separación interior-exterior para abordar el impacto de la incertidumbre de la etiqueta en la supervisión de segmentación. nuestros resultados experimentales en los conjuntos de datos públicos, incluyendo KolektorSDD2 (96.1% AP) y Magnetic-tile-defect-datasets (94.6% mAP), demuestran la eficacia de la método propuesto.

DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image

  • paper_url: http://arxiv.org/abs/2309.12594
  • repo_url: None
  • paper_authors: Di Liu, Xiang Yu, Meng Ye, Qilong Zhangli, Zhuowei Li, Zhixing Zhang, Dimitris N. Metaxas
  • for: 提出了一种新的bi-channel Transformer架构,用于同时估计全局和局部变形。
  • methods: 使用了一种参数化的塑形模型,称为DeFormer,以便更好地抽象复杂的物体形状。
  • results: 在ShapeNet上进行了广泛的实验,并达到了比之前最佳的重建精度,并且可以Visualize了更加准确的semantic相关性。
    Abstract Accurate 3D shape abstraction from a single 2D image is a long-standing problem in computer vision and graphics. By leveraging a set of primitives to represent the target shape, recent methods have achieved promising results. However, these methods either use a relatively large number of primitives or lack geometric flexibility due to the limited expressibility of the primitives. In this paper, we propose a novel bi-channel Transformer architecture, integrated with parameterized deformable models, termed DeFormer, to simultaneously estimate the global and local deformations of primitives. In this way, DeFormer can abstract complex object shapes while using a small number of primitives which offer a broader geometry coverage and finer details. Then, we introduce a force-driven dynamic fitting and a cycle-consistent re-projection loss to optimize the primitive parameters. Extensive experiments on ShapeNet across various settings show that DeFormer achieves better reconstruction accuracy over the state-of-the-art, and visualizes with consistent semantic correspondences for improved interpretability.
    摘要 通过使用一组基本形状来表示目标形状,现代方法已经实现了在计算机视觉和图形领域中准确地抽象三维形状。然而,这些方法可能使用相对较多的基本形状,或者由于基本形状的有限表达能力而缺乏几何可动性。在本文中,我们提出了一种新的双渠道变换器架构,并结合参数化可变模型,称为DeFormer,以同时估计全局和局部几何变换。这样,DeFormer可以抽象复杂的物体形状,使用少量的基本形状,并且具有更广泛的几何覆盖和细节。然后,我们引入了力场驱动的动态适应和цикли性征重 Projekt loss 来优化基本形状参数。通过对ShapeNet进行了多种设置的实验,我们发现DeFormer可以在计算机视觉和图形领域中实现更好的重建精度,并且可以visualize与更高度的 semantic correspondence для改进释plausibility。

Improving Machine Learning Robustness via Adversarial Training

  • paper_url: http://arxiv.org/abs/2309.12593
  • repo_url: None
  • paper_authors: Long Dang, Thushari Hapuarachchi, Kaiqi Xiong, Jing Lin
  • for: 本研究旨在 investigate 机器学习(ML)Robustness 在中央化和分散化环境中,以帮助设计更加Robust 的 ML 算法。
  • methods: 本研究使用了 adversarial training 方法在中央化和分散化环境中进行 ML 训练和测试,并使用了 Fast Gradient Sign Method 和 DeepFool 生成抗性例子。
  • results: 在中央化环境中,我们实现了测试准确率为 65.41% 和 83.0%,比现有研究提高了 18.41% 和 47%。在分散化环境中,我们研究了 Federated learning(FL)的Robustness,并使用了 adversarial training 方法与独立同分布(IID)和非IID数据进行比较。在 IID 数据 caso,我们可以实现类似于中央化环境的Robust准确率。在非IID 数据 caso,自然准确率下降了 25% 和 23.4%,对比 IID 数据 caso,分别是。我们还提出了一种 IID 数据共享方法,可以提高自然准确率到 85.04% 和 Robust准确率从 57% 提高到 72% 和 67%,分别是。
    Abstract As Machine Learning (ML) is increasingly used in solving various tasks in real-world applications, it is crucial to ensure that ML algorithms are robust to any potential worst-case noises, adversarial attacks, and highly unusual situations when they are designed. Studying ML robustness will significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers. In the centralized environment, we achieve a test accuracy of 65.41% and 83.0% when classifying adversarial examples generated by Fast Gradient Sign Method and DeepFool, respectively. Comparing to existing studies, these results demonstrate an improvement of 18.41% for FGSM and 47% for DeepFool. In the decentralized environment, we study Federated learning (FL) robustness by using adversarial training with independent and identically distributed (IID) and non-IID data, respectively, where CIFAR-10 is used in this research. In the IID data case, our experimental results demonstrate that we can achieve such a robust accuracy that it is comparable to the one obtained in the centralized environment. Moreover, in the non-IID data case, the natural accuracy drops from 66.23% to 57.82%, and the robust accuracy decreases by 25% and 23.4% in C&W and Projected Gradient Descent (PGD) attacks, compared to the IID data case, respectively. We further propose an IID data-sharing approach, which allows for increasing the natural accuracy to 85.04% and the robust accuracy from 57% to 72% in C&W attacks and from 59% to 67% in PGD attacks.
    摘要 随着机器学习(ML)在实际应用中的广泛使用, Ensuring the robustness of ML algorithms against potential worst-case noises, adversarial attacks, and highly unusual situations during their design has become crucial. Studying ML robustness can significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers.在中央化环境中,我们通过对快速梯度方法和深度欺骗的挑战性例子进行反对抗教程,实现了测试精度为65.41%和83.0%。相比已有研究,这些结果表明了18.41%的提高 для快速梯度方法和47%的提高 для深度欺骗。在分布式学习(FL)环境中,我们研究了 Federated learning(FL)的可靠性,使用反对抗教程与独立且相同分布(IID)和非IID数据进行研究,其中CIFAR-10被用于这项研究。在IID数据情况下,我们的实验结果表明,我们可以实现与中央化环境相同的可靠性。此外,在非IID数据情况下,自然精度从66.23%降低到57.82%,而可靠精度下降25%和23.4%在C&W和投影梯度下降(PGD)攻击中,相比IID数据情况下。我们还提出了一种IID数据分享方法,可以提高自然精度到85.04%和可靠精度从57%提高到72%在C&W攻击中,并提高到67%在PGD攻击中。

BGF-YOLO: Enhanced YOLOv8 with Multiscale Attentional Feature Fusion for Brain Tumor Detection

  • paper_url: http://arxiv.org/abs/2309.12585
  • repo_url: https://github.com/mkang315/bgf-yolo
  • paper_authors: Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël C. -W. Phan
  • for: automated brain tumor detection
  • methods: integrate Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), and Fourth detecting head into YOLOv8
  • results: 4.7% absolute increase of mAP$_{50}$ compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H.Here’s the full Chinese text:
  • for: 本研究旨在 automatized 脑癌检测
  • methods: integrate Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), 和 Fourth detecting head into YOLOv8
  • results: 与 YOLOv8x 相比,BGF-YOLO 提供 4.7% 的统计提升,并在脑癌检测 dataset Br35H 上 achievement state-of-the-art.I hope this helps!
    Abstract You Only Look Once (YOLO)-based object detectors have shown remarkable accuracy for automated brain tumor detection. In this paper, we develop a novel BGF-YOLO architecture by incorporating Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), and Fourth detecting head into YOLOv8. BGF-YOLO contains an attention mechanism to focus more on important features, and feature pyramid networks to enrich feature representation by merging high-level semantic features with spatial details. Furthermore, we investigate the effect of different attention mechanisms and feature fusions, detection head architectures on brain tumor detection accuracy. Experimental results show that BGF-YOLO gives a 4.7% absolute increase of mAP$_{50}$ compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H. The code is available at https://github.com/mkang315/BGF-YOLO.
    摘要 “你只需要看一次”(YOLO)基本的物体探测器已经在自动脑肿检测中表现出色。在这篇论文中,我们开发了一种新的BGF-YOLO架构,并将Bi-level Routing Attention(BRA)、Generalized feature pyramid networks(GFPN)和第四个探测头纳入YOLOv8。BGF-YOLO包含一种注意机制,以增强关键特征的注意力,并通过将高级 semantic features与空间细节合并来增强特征表示。此外,我们还 investigate了不同的注意机制和特征融合、探测头架构对脑肿检测精度的影响。实验结果表明,BGF-YOLO与YOLOv8x相比,提高了4.7%的mAP$_{50}$精度,并在脑肿检测数据集Br35H中达到了状态机。代码可以在https://github.com/mkang315/BGF-YOLO中获取。

Classification of Alzheimers Disease with Deep Learning on Eye-tracking Data

  • paper_url: http://arxiv.org/abs/2309.12574
  • repo_url: None
  • paper_authors: Harshinee Sriram, Cristina Conati, Thalia Field
  • For: This paper aims to classify Alzheimer’s Disease (AD) from eye-tracking (ET) data using a Deep-Learning classifier trained end-to-end on raw ET data.* Methods: The proposed method, called VTNet, uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data.* Results: VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.Here’s the Chinese translation of the three pieces of information:* For: 这篇论文目标是使用 Raw 眼动追踪数据进行扩散性疾病分类 (AD)。* Methods: 提议的方法是 VTNet,它使用 GRU 和 CNN 并行使用,以利用眼动数据中的视觉 (V) 和时间 (T) 表示。* Results: VTNet 在 AD 分类任务上表现出色,超过了现有的方法,提供了对这种模型在眼动数据上的预测性的有力证明。
    Abstract Existing research has shown the potential of classifying Alzheimers Disease (AD) from eye-tracking (ET) data with classifiers that rely on task-specific engineered features. In this paper, we investigate whether we can improve on existing results by using a Deep-Learning classifier trained end-to-end on raw ET data. This classifier (VTNet) uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data and was previously used to detect user confusion while processing visual displays. A main challenge in applying VTNet to our target AD classification task is that the available ET data sequences are much longer than those used in the previous confusion detection task, pushing the limits of what is manageable by LSTM-based models. We discuss how we address this challenge and show that VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.
    摘要 先前的研究已经证明了通过眼动跟踪(ET)数据进行阿尔茨heimer病(AD)分类的潜在性。在这篇论文中,我们调查了是否可以通过使用深度学习的分类器,对直接使用原始ET数据进行分类,以提高现有结果。我们称之为VTNet,它使用GRU和CNN并行使用视觉(V)和时间(T)表示,曾用于检测视觉显示的混乱。主要挑战在应用VTNet到我们的target AD分类任务中是,ET数据序列的可用性远比先前的混乱检测任务更长,这使得LSTM模型管理的范围受到挑战。我们详细介绍了我们如何解决这个挑战,并示出VTNet在AD分类任务中的表现,超越了当前的状态艺术方法,提供了对ET数据进行预测的鼓励性证据。

Interpretable 3D Multi-Modal Residual Convolutional Neural Network for Mild Traumatic Brain Injury Diagnosis

  • paper_url: http://arxiv.org/abs/2309.12572
  • repo_url: None
  • paper_authors: Hanem Ellethy, Viktor Vegh, Shekhar S. Chandra
  • for: 这个研究旨在提高轻度头部创伤(mTBI)的诊断精度,并且使用多Modal的残差算法(MRCNN)和Occlusion Sensitivity Maps(OSM)来增强诊断模型的解释力。
  • methods: 这个研究使用了一个 interpretable 的 3D Multi-Modal Residual Convolutional Neural Network(MRCNN)模型,并且将 Occlusion Sensitivity Maps(OSM)加入了诊断模型中,以增强诊断的精度。
  • results: 研究结果显示,MRCNN 模型在 mTBI 诊断中表现出色,精度达 82.4%,sensitivity 达 82.6%,特异性达 81.6%,并且比 CT 基于的 Residual Convolutional Neural Network(RCNN)模型提高了 4.4% 的特异性和 9.0% 的精度。
    Abstract Mild Traumatic Brain Injury (mTBI) is a significant public health challenge due to its high prevalence and potential for long-term health effects. Despite Computed Tomography (CT) being the standard diagnostic tool for mTBI, it often yields normal results in mTBI patients despite symptomatic evidence. This fact underscores the complexity of accurate diagnosis. In this study, we introduce an interpretable 3D Multi-Modal Residual Convolutional Neural Network (MRCNN) for mTBI diagnostic model enhanced with Occlusion Sensitivity Maps (OSM). Our MRCNN model exhibits promising performance in mTBI diagnosis, demonstrating an average accuracy of 82.4%, sensitivity of 82.6%, and specificity of 81.6%, as validated by a five-fold cross-validation process. Notably, in comparison to the CT-based Residual Convolutional Neural Network (RCNN) model, the MRCNN shows an improvement of 4.4% in specificity and 9.0% in accuracy. We show that the OSM offers superior data-driven insights into CT images compared to the Grad-CAM approach. These results highlight the efficacy of the proposed multi-modal model in enhancing the diagnostic precision of mTBI.
    摘要 轻度头部Trauma (mTBI) 是一个重要的公共卫生挑战,因其高频率和长期健康影响的可能性。 Despite Computed Tomography (CT) 是 mTBI 的标准诊断工具,它经常在 mTBI 患者中显示正常结果,这再次 highlights 诊断的复杂性。 在这项研究中,我们介绍了一种可解释的 3D 多模态异常感知 Convolutional Neural Network (MRCNN) 模型,用于 mTBI 诊断。我们的 MRCNN 模型在 mTBI 诊断中表现出色,其中的平均准确率为 82.4%,敏感性为 82.6%,特异性为 81.6%,这些结果通过五次交叉验证过程得到验证。尤其是,相比 CT-based Residual Convolutional Neural Network (RCNN) 模型,我们的 MRCNN 模型在特异性和准确率方面增加了4.4%和9.0%。我们表明 OSM 在 CT 图像中提供了更高的数据驱动的权重,相比 Grad-CAM 方法。这些结果表明我们的多模态模型在 mTBI 诊断中增强了诊断精度。

Wave-informed dictionary learning for high-resolution imaging in complex media

  • paper_url: http://arxiv.org/abs/2310.12990
  • repo_url: None
  • paper_authors: Miguel Moscoso, Alexei Novikov, George Papanicolaou, Chrysoula Tsogka
  • for: 这个论文目的是提出一种用于吸收媒体成像的方法,当有大量和多样化的数据集available时。
  • methods: 该方法有两个步骤:第一步使用字典学习算法来估计真正的绿函数向量,并将其作为列存储在一个不ordered的感测矩阵中。第二步使用多维度排序法来让列表的排序,并使用连接信息来 derive from cross-correlations of its columns,如时间反转。
  • results: 通过 simulation experiments,我们示出了该方法能够在复杂媒体中提供高分辨率的成像图像。
    Abstract We propose an approach for imaging in scattering media when large and diverse data sets are available. It has two steps. Using a dictionary learning algorithm the first step estimates the true Green's function vectors as columns in an unordered sensing matrix. The array data comes from many sparse sets of sources whose location and strength are not known to us. In the second step, the columns of the estimated sensing matrix are ordered for imaging using Multi-Dimensional Scaling with connectivity information derived from cross-correlations of its columns, as in time reversal. For these two steps to work together we need data from large arrays of receivers so the columns of the sensing matrix are incoherent for the first step, as well as from sub-arrays so that they are coherent enough to obtain the connectivity needed in the second step. Through simulation experiments, we show that the proposed approach is able to provide images in complex media whose resolution is that of a homogeneous medium.
    摘要 我们提出了一种方法,用于在散射媒体中进行成像,当有大量多样化的数据集 disponible。该方法包括两步。在第一步中,使用一个词汇学算法,我们估算了真实的绿函数向量作为排序后的感知矩阵的列。数组数据来自多个稀疏的源集,其位置和强度不知道我们。在第二步中,我们使用多维度尺度学(Multi-Dimensional Scaling)将排序后的感知矩阵的列轴进行了排序,并使用这些列的垂直相关性来获得连接信息。为了使这两个步骤可以共同工作,我们需要从大型接收器阵列中获得数据,以确保感知矩阵的列不受相关性的限制。通过实验 simulate, we show that the proposed approach can provide images in complex media with resolution comparable to that of a homogeneous medium.Note: The translation is provided "as is" and may not be perfect. Please let me know if you need any further assistance or clarification.

Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.12557
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Ping Li, Junjie Chen, Li Yuan, Xianghua Xu, Mingli Song
  • for: 提高 semi-supervised semantic segmentation 的效果,使用少量标注图像和大量非标注图像预测像素级标签图。
  • methods: 使用 tri-training 和 triple-view encoder 来捕捉多样化特征,并通过知识储存技术学习对应的 semantics。 dual-frequency decoder 选择重要特征,并通过双频道注意机制来评估特征重要性。
  • results: 在 Pascal VOC 2012 和 Cityscapes 两个标准测试集上进行了广泛的实验,结果表明提出的方法在精度和推理速度之间做出了好的平衡,并且与其他方法相比具有更好的性能。
    Abstract To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
    摘要 要解决高严格的人类标注成本高昂,半supervised semantic segmentation使用一些标注图像和大量无标注图像预测像素级标签地图,同时使用两个 convolutional network 的不同初始化来提高分类精度。以前的方法通常采用两个 convolutional network 的同 architectures 的 co-training,但这会遗漏重要的多样化特征。这种情况驱使我们使用 tri-training 和三个视角编码器,以利用不同的架构来获得多样化的特征,并利用知识填充技术来学习这些编码器之间的补做 semantics。此外,现有的方法通常将编码器和解码器的特征直接 concatenate,从而导致缓存成本过高。这个灵感我们提出了一种 dual-frequency decoder,可以选择重要的特征,并通过将特征从空间频谱中 проек到频谱频率上,实现了 dual-frequency channel attention mechanism。因此,我们提出了一种 Triple-view Knowledge Distillation框架,称之为 TriKD,用于半supervised semantic segmentation,包括三个视角编码器和 dual-frequency decoder。我们在 Pascal VOC 2012 和 Cityscapes 两个标准benchmark上进行了广泛的实验,结果证明了我们提出的方法具有较好的平衡性和推理速度。

cs.AI - 2023-09-22

Poster: Self-Supervised Quantization-Aware Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2309.13220
  • repo_url: None
  • paper_authors: Kaiqi Zhao, Ming Zhao
  • for: 提高量化敏感模型的性能
  • methods: 提出了一种新的无监督自适应量化敏感知识传递(SQAKD)框架,通过同时Minimize KL损失和精度损失来协调量化和敏感知识传递。
  • results: 对多种现有QAT研究进行评估,显示SQAKD可以显著提高量化敏感模型的性能,并不需要大量标注数据。
    Abstract Quantization-aware training (QAT) starts with a pre-trained full-precision model and performs quantization during retraining. However, existing QAT works require supervision from the labels and they suffer from accuracy loss due to reduced precision. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation framework (SQAKD). SQAKD first unifies the forward and backward dynamics of various quantization functions and then reframes QAT as a co-optimization problem that simultaneously minimizes the KL-Loss and the discretization error, in a self-supervised manner. The evaluation shows that SQAKD significantly improves the performance of various state-of-the-art QAT works. SQAKD establishes stronger baselines and does not require extensive labeled training data, potentially making state-of-the-art QAT research more accessible.
    摘要 Quantization-aware training (QAT) 开始于一个预训练的全精度模型,并在重新训练期间进行量化。然而,现有的 QAT 工作受到标签的监督,并且受到精度降低的影响,导致准确性下降。为解决这些限制,本文提出了一种新的 Self-Supervised Quantization-Aware Knowledge Distillation 框架 (SQAKD)。SQAKD 首先将量化函数的前向和反向动力统一,然后将 QAT 重新定义为一个同时减少 KL-损失和精度损失的合理化问题,在自我监督的方式下进行解决。评估结果表明,SQAKD 可以显著提高不同的状态前训练 QAT 工作的性能。SQAKD 设立了更强的基线,并不需要大量的标签训练数据,可能使状态前训练 QAT 研究更加可 accessible。

AI-Copilot for Business Optimisation: A Framework and A Case Study in Production Scheduling

  • paper_url: http://arxiv.org/abs/2309.13218
  • repo_url: None
  • paper_authors: Pivithuru Thejan Amarasinghe, Su Nguyen, Yuan Sun, Damminda Alahakoon
  • for: 这个论文的目的是提出一种基于大语言模型(LLM)的企业优化问题表述 Synthesizer,以减少人工智能的参与度。
  • methods: 该论文采用了练好LLM的 fine-tuning方法,并提出了一种AI copilot的设计方法以及模块化和提示工程技术来解决问题表述中的卡通问题。
  • results: 实验结果显示,通过该方法可以synthesize大型和复杂的企业优化问题表述,并且可以在生产规划中应用。
    Abstract Business optimisation refers to the process of finding and implementing efficient and cost-effective means of operation to bring a competitive advantage for businesses. Synthesizing problem formulations is an integral part of business optimisation, which relies on human expertise to construct problem formulations using optimisation languages. Interestingly, with advancements in Large Language Models (LLMs), the human expertise needed in problem formulation can be minimized. However, developing an LLM for problem formulation is challenging, due to training data, token limitations, and lack of appropriate performance metrics. For the requirement of training data, recent attention has been directed towards fine-tuning pre-trained LLMs for downstream tasks rather than training an LLM from scratch for a specific task. In this paper, we adopt an LLM fine-tuning approach and propose an AI-Copilot for business optimisation problem formulation. For token limitations, we introduce modularization and prompt engineering techniques to synthesize complex problem formulations as modules that fit into the token limits of LLMs. Additionally, we design performance evaluation metrics that are better suited for assessing the accuracy and quality of problem formulations. The experiment results demonstrate that with this approach we can synthesize complex and large problem formulations for a typical business optimisation problem in production scheduling.
    摘要 Despite the potential benefits of using LLMs for problem formulation, there are several challenges that need to be addressed. One of the main challenges is the lack of training data, which makes it difficult to train an LLM from scratch for a specific task. To address this challenge, recent attention has been directed towards fine-tuning pre-trained LLMs for downstream tasks.In this paper, we adopt an LLM fine-tuning approach and propose an AI-Copilot for business optimization problem formulation. To overcome the token limitations of LLMs, we introduce modularization and prompt engineering techniques to synthesize complex problem formulations as modules that fit into the token limits of LLMs. Additionally, we design performance evaluation metrics that are better suited for assessing the accuracy and quality of problem formulations.The experiment results demonstrate that with this approach, we can synthesize complex and large problem formulations for a typical business optimization problem in production scheduling. This shows that our proposed AI-Copilot can effectively assist businesses in optimizing their operations and gaining a competitive advantage.

MISFIT-V: Misaligned Image Synthesis and Fusion using Information from Thermal and Visual

  • paper_url: http://arxiv.org/abs/2309.13216
  • repo_url: https://github.com/Aadharc/Visual_Thermal_Image_Fusion
  • paper_authors: Aadhar Chauhan, Isaac Remy, Danny Broyles, Karen Leung
  • for: 本研究旨在提高搜救队伍在霍尔风景中从空中视觉和热成像中检测人员,以提高搜救效率和准确率。
  • methods: 该研究提出了一种基于Generative Adversarial Network(GAN)和带通信机制的两元深度学习方法,名为Misaligned Image Synthesis and Fusion using Information from Thermal and Visual(MISFIT-V),用于把视觉和热成像模态进行图像 fusión。
  • results: 实验结果表明,MISFIT-V方法在环境因素不利的情况下具有更高的 robustness 性和精度,比如融合图像中的人员检测结果。
    Abstract Detecting humans from airborne visual and thermal imagery is a fundamental challenge for Wilderness Search-and-Rescue (WiSAR) teams, who must perform this function accurately in the face of immense pressure. The ability to fuse these two sensor modalities can potentially reduce the cognitive load on human operators and/or improve the effectiveness of computer vision object detection models. However, the fusion task is particularly challenging in the context of WiSAR due to hardware limitations and extreme environmental factors. This work presents Misaligned Image Synthesis and Fusion using Information from Thermal and Visual (MISFIT-V), a novel two-pronged unsupervised deep learning approach that utilizes a Generative Adversarial Network (GAN) and a cross-attention mechanism to capture the most relevant features from each modality. Experimental results show MISFIT-V offers enhanced robustness against misalignment and poor lighting/thermal environmental conditions compared to existing visual-thermal image fusion methods.
    摘要 搜寻人员从空中视觉和热影像中识别是搜寻和救援队(WiSAR)的基本挑战,需要在压力很大的情况下准确完成。将这两种感知模式融合可能可以减轻人工操作员的认知负担和/或提高计算机视觉对象检测模型的效果。然而,在WiSAR中融合任务 particullay challenging due to hardware limitations and extreme environmental factors。这篇文章介绍了一种新的两重无监督深度学习方法,即 Misaligned Image Synthesis and Fusion using Information from Thermal and Visual(MISFIT-V)。该方法使用生成 adversarial network(GAN)和跨注意机制来捕捉每个模式中最相关的特征。实验结果表明,MISFIT-V在不同的拍摄角度和热环境下具有更高的鲁棒性,相比之下现有的视觉热像重构方法。

Assessing the Impact of Personality on Affective States from Video Game Communication

  • paper_url: http://arxiv.org/abs/2309.13214
  • repo_url: None
  • paper_authors: Atieh Kashani, Johannes Pfau, Magy Seif El-Nasr
  • for: This paper explores the impact of personality on the way players express themselves affectively in a team-based collaborative alternate reality game.
  • methods: The authors collected chat logs from eleven players over two weeks, labeled them according to their affective state, and assessed the connection between them and the five-factor personality domains and facets using multi-linear regression.
  • results: The study found a series of reasonable correlations between (combinations of) personality variables and expressed affect, including increased confusion predicted by lower self-competence, personal annoyance predicted by vulnerability to stress, and expressing anger more often in players prone to anxiety.
    Abstract Individual differences in personality determine our preferences, traits and values, which should similarly hold for the way we express ourselves. With current advancements and transformations of technology and society, text-based communication has become ordinary and often even surpasses natural voice conversations -- with distinct challenges and opportunities. In this exploratory work, we investigate the impact of personality on the tendency how players of a team-based collaborative alternate reality game express themselves affectively. We collected chat logs from eleven players over two weeks, labeled them according to their affective state, and assessed the connection between them and the five-factor personality domains and facets. After applying multi-linear regression, we found a series of reasonable correlations between (combinations of) personality variables and expressed affect -- as increased confusion could be predicted by lower self-competence (C1), personal annoyance by vulnerability to stress (N6) and expressing anger occured more often in players that are prone to anxiety (N1), less humble and modest (A5), think less carefully before they act (C6) and have higher neuroticism (N). Expanding the data set, sample size and input modalities in subsequent work, we aim to confirm these findings and reveal even more interesting connections that could inform affective computing and games user research equally.
    摘要 人类个体差异对我们的偏好、特质和价值产生影响,这一点应该在我们的表达方式上也有影响。随着技术和社会的发展,文本基本上已经成为了日常交流的一种常见方式,有时甚至超过了自然语音交流。在这项探索性研究中,我们 investigate了玩家在团队合作 alternate reality 游戏中表达情感的影响。我们收集了11名玩家的聊天记录,将其分为不同情感状态,并评估了这些状态与五大人格特征域和特征之间的连接。经多线性回归分析,我们发现了一系列合理的相关关系,例如:增加混乱可以预测低自我竞争力(C1)、个人恼怒可以预测脆弱性(N6),表达愤怒更常见于具有 anxiety(N1)、不谦虚和谨慎(A5)、不思议行为(C6)和高度neurótico(N)。在后续工作中,我们计划扩大数据集、样本大小和输入模式,以确认这些发现和揭示更多的 interessante 连接,以便在情感计算和游戏用户研究中具有参考意义。

Intent-Aware Autonomous Driving: A Case Study on Highway Merging Scenarios

  • paper_url: http://arxiv.org/abs/2309.13206
  • repo_url: None
  • paper_authors: Nishtha Mahajan, Qi Zhang
  • for: 本研究使用汽车自动控制器之间的意图交换来促进协作。
  • methods: 我们在高速公路环境 simulator 中实现了意图分享任务,并在两个代理之间进行了 investigate highway 合并场景中意图分享如何 помо助接收方调整其行为。
  • results: 我们发现,通过意图分享,接收方可以更好地适应高速公路合并场景,提高了合并效率和安全性。
    Abstract In this work, we use the communication of intent as a means to facilitate cooperation between autonomous vehicle agents. Generally speaking, intents can be any reliable information about its future behavior that a vehicle communicates with another vehicle. We implement this as an intent-sharing task atop the merging environment in the simulator of highway-env, which provides a collection of environments for learning decision-making strategies for autonomous vehicles. Under a simple setting between two agents, we carefully investigate how intent-sharing can aid the receiving vehicle in adjusting its behavior in highway merging scenarios.
    摘要 在这项工作中,我们使用意图通信作为自动驾驶车辆间合作的方式。一般来说,意图可以是任何可靠地关于未来行为的信息,车辆通过这些信息与另一辆车辆进行交流。我们在高速公路环境 simulator 中实现了意图分享任务,提供了一个收集多种决策策略学习自动驾驶车辆的环境。在简单的两辆车辆之间的设定下,我们仔细研究了意图分享如何帮助接收车辆在高速公路做入道场景中调整其行为。

A Practical Survey on Zero-shot Prompt Design for In-context Learning

  • paper_url: http://arxiv.org/abs/2309.13205
  • repo_url: None
  • paper_authors: Yinheng Li
    for: 这篇论文旨在探讨各种提示技术,包括简洁、连续、几何shot和零shot提示,以及它们对大语言模型(LLM)性能的影响。methods: 论文检讨了不同提示设计方法,包括手动设计、优化算法和评估方法,以优化LLM在多种任务上的性能。results: 论文总结了关键的研究成果,包括提示工程的方法学和贡献,以及评估提示性能的挑战。
    Abstract The remarkable advancements in large language models (LLMs) have brought about significant improvements in Natural Language Processing(NLP) tasks. This paper presents a comprehensive review of in-context learning techniques, focusing on different types of prompts, including discrete, continuous, few-shot, and zero-shot, and their impact on LLM performance. We explore various approaches to prompt design, such as manual design, optimization algorithms, and evaluation methods, to optimize LLM performance across diverse tasks. Our review covers key research studies in prompt engineering, discussing their methodologies and contributions to the field. We also delve into the challenges faced in evaluating prompt performance, given the absence of a single "best" prompt and the importance of considering multiple metrics. In conclusion, the paper highlights the critical role of prompt design in harnessing the full potential of LLMs and provides insights into the combination of manual design, optimization techniques, and rigorous evaluation for more effective and efficient use of LLMs in various NLP tasks.
    摘要 LLMs 的卓越发展对自然语言处理(NLP)任务带来了重要的改善。本文提供了对Context Learning技术的完整回顾,包括不同类型的提示,如离散、连续、少量和零 shot,以及它们对 LLM 性能的影响。我们探讨了不同的提示设计方法,如手动设计、优化算法和评估方法,以优化 LLM 在多种任务上的性能。我们的回顾包括关键的研究成果在提示工程学,讨论了他们的方法和贡献。此外,我们还探讨了评估提示性能的挑战,因为没有单一的 "最佳" 提示,以及考虑多种维度的益虑。结束语,本文强调了提示设计的重要性,并提供了手动设计、优化技术和严格评估的组合,以更有效地和高效地使用 LLM 在多种 NLP 任务中。

Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts

  • paper_url: http://arxiv.org/abs/2309.13202
  • repo_url: https://github.com/hecta-uom/plaba-mu
  • paper_authors: Zihao Li, Samuel Belkadi, Nicolo Micheletti, Lifeng Han, Matthew Shardlow, Goran Nenadic
    for: 本研究旨在提高生物医学领域文献的可读性,使用自然语言处理(NLP)模型自动化报告简化任务,提高公共健康文化知识。methods: 本研究使用现代大语言模型(LLMs)对生物医学报告简化任务进行研究,包括领域细化和提示基本学习(PBL),以及控制符 token 机制。results: 研究使用了多种自动评价指标,包括 BLEU、ROUGE、SARI 和 BERTscore,以及人类评价。 BART-Large WITH Control Token(BART-L-w-CT)机制得到了最高 SARI 分数46.54,T5-base 得到了最高 BERTscore 72.62。在人类评价中,BART-L-w-CTs 获得了更好的简单性分数(2.9 vs. 2.2),而 T5-Base 获得了更好的意义保持分数(3.1 vs. 2.6)。
    Abstract Biomedical literature often uses complex language and inaccessible professional terminologies. That is why simplification plays an important role in improving public health literacy. Applying Natural Language Processing (NLP) models to automate such tasks allows for quick and direct accessibility for lay readers. In this work, we investigate the ability of state-of-the-art large language models (LLMs) on the task of biomedical abstract simplification, using the publicly available dataset for plain language adaptation of biomedical abstracts (\textbf{PLABA}). The methods applied include domain fine-tuning and prompt-based learning (PBL) on: 1) Encoder-decoder models (T5, SciFive, and BART), 2) Decoder-only GPT models (GPT-3.5 and GPT-4) from OpenAI and BioGPT, and 3) Control-token mechanisms on BART-based models. We used a range of automatic evaluation metrics, including BLEU, ROUGE, SARI, and BERTscore, and also conducted human evaluations. BART-Large with Control Token (BART-L-w-CT) mechanisms reported the highest SARI score of 46.54 and T5-base reported the highest BERTscore 72.62. In human evaluation, BART-L-w-CTs achieved a better simplicity score over T5-Base (2.9 vs. 2.2), while T5-Base achieved a better meaning preservation score over BART-L-w-CTs (3.1 vs. 2.6). We also categorised the system outputs with examples, hoping this will shed some light for future research on this task. Our code, fine-tuned models, and data splits are available at \url{https://github.com/HECTA-UoM/PLABA-MU}
    摘要 生物医学文献经常使用复杂的语言和不可接触的专业术语,这使得公众健康文化知识的提高受到了限制。因此,简化对于改善公众健康文化知识具有重要的作用。在这项工作中,我们研究了现状最佳的大型自然语言处理(NLP)模型在生物医学摘要简化任务上的能力,使用公共可用的PLABA数据集(plain language adaptation of biomedical abstracts)。我们使用的方法包括域特化 fine-tuning 和提示基本学习(PBL),其中包括:1)Encoder-decoder模型(T5、SciFive和BART),2)Decoder-only GPT模型(GPT-3.5和GPT-4)和3)BART基于模型中的控制符机制。我们使用了一系列自动评估指标,包括BLEU、ROUGE、SARI和BERTscore,并也进行了人类评估。BART-Large with Control Token(BART-L-w-CT)机制获得了46.54的SARI分数,T5-base获得了72.62的BERTscore。在人类评估中,BART-L-w-CTs在 simplicity 分数上赢得了2.9 VS T5-Base的2.2,而T5-Base在 meaning preservation 分数上赢得了3.1 VS BART-L-w-CTs的2.6。我们还将系统输出分类并提供了示例,希望这可以为未来这个任务提供一些灯光。我们的代码、精度调整模型和数据分割可以在https://github.com/HECTA-UoM/PLABA-MU 中获取。

Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation

  • paper_url: http://arxiv.org/abs/2309.13192
  • repo_url: https://github.com/pittisl/greentrainer
  • paper_authors: Kai Huang, Hanyun Yin, Heng Huang, Wei Gao
  • for: 这个研究旨在提高大型自然语言模型(LLM)的精细化过程中的能效性,以减少环境影响。
  • methods: 这个研究使用了一新的绿色精细化技术(GreenTrainer),可以根据不同的网络层次和精细化目标,选择最适合的网络层次和精细化方法,以最大化精细化效率和降低总计算量(FLOPs)。
  • results: 实验结果显示,比较于精细化整个LLM模型,GreenTrainer可以降低总计算量(FLOPs)达64%,而且与其他已有的精细化技术相比,GreenTrainer可以实现更高的模型准确度和相似的总计算量降低。
    Abstract Fine-tuning is the most effective way of adapting pre-trained large language models (LLMs) to downstream applications. With the fast growth of LLM-enabled AI applications and democratization of open-souced LLMs, fine-tuning has become possible for non-expert individuals, but intensively performed LLM fine-tuning worldwide could result in significantly high energy consumption and carbon footprint, which may bring large environmental impact. Mitigating such environmental impact towards Green AI directly correlates to reducing the FLOPs of fine-tuning, but existing techniques on efficient LLM fine-tuning can only achieve limited reduction of such FLOPs, due to their ignorance of the backpropagation cost in fine-tuning. To address this limitation, in this paper we present GreenTrainer, a new LLM fine-tuning technique that adaptively evaluates different tensors' backpropagation costs and contributions to the fine-tuned model accuracy, to minimize the fine-tuning cost by selecting the most appropriate set of tensors in training. Such selection in GreenTrainer is made based on a given objective of FLOPs reduction, which can flexibly adapt to the carbon footprint in energy supply and the need in Green AI. Experiment results over multiple open-sourced LLM models and abstractive summarization datasets show that, compared to fine-tuning the whole LLM model, GreenTrainer can save up to 64% FLOPs in fine-tuning without any noticeable model accuracy loss. Compared to the existing fine-tuning techniques such as LoRa, GreenTrainer can achieve up to 4% improvement on model accuracy with on-par FLOPs reduction.
    摘要 大量语言模型(LLM)的先进修改是下游应用最有效的方法。随着AI应用的快速发展和开源LLM的普及,非专家个人也可以进行修改,但是全球范围内的修改具有极高的能源消耗和碳脚印,可能对环境产生很大的影响。为了 Mitigating这些环境影响,我们在这篇论文中提出了GreenTrainer,一种新的LLM修改技术,可以自动评估不同张量的反射成本和精度贡献,以最小化修改成本。这种选择在GreenTrainer中是基于一个给定的硬件成本目标,可以适应不同的能源供应和绿色AI的需求。实验结果表明,相比于整个LLM模型的修改,GreenTrainer可以在不同的开源LLM模型和概括摘要 datasets 上节省到64%的FLOPs,而无需 sacrifiSing model精度。相比之下,与LoRa等现有的修改技术,GreenTrainer可以达到4%的模型精度提升,同时具有相同的FLOPs减少。

Masked Discriminators for Content-Consistent Unpaired Image-to-Image Translation

  • paper_url: http://arxiv.org/abs/2309.13188
  • repo_url: https://github.com/bonifazstuhr/feamgan
  • paper_authors: Bonifaz Stuhr, Jürgen Brauer, Bernhard Schick, Jordi Gonzàlez
  • for: 这篇论文的目的是提高零对零图像转换的效能,尤其是在实际应用中遇到的问题,例如内容不一致和模式转换问题。
  • methods: 这篇论文使用的方法包括对全球检测器的输入进行封页,并使用对照抽样法选取小图像组合,以及对生成器流中的内容统计进行选择性标准化。
  • results: 这篇论文的实验结果显示,使用这些方法可以大幅提高零对零图像转换的效能,特别是在实际应用中的图像转换和天气转换领域。此外,论文还提出了一个新的评估指标(cKVD),可以更好地评估图像转换的质量。
    Abstract A common goal of unpaired image-to-image translation is to preserve content consistency between source images and translated images while mimicking the style of the target domain. Due to biases between the datasets of both domains, many methods suffer from inconsistencies caused by the translation process. Most approaches introduced to mitigate these inconsistencies do not constrain the discriminator, leading to an even more ill-posed training setup. Moreover, none of these approaches is designed for larger crop sizes. In this work, we show that masking the inputs of a global discriminator for both domains with a content-based mask is sufficient to reduce content inconsistencies significantly. However, this strategy leads to artifacts that can be traced back to the masking process. To reduce these artifacts, we introduce a local discriminator that operates on pairs of small crops selected with a similarity sampling strategy. Furthermore, we apply this sampling strategy to sample global input crops from the source and target dataset. In addition, we propose feature-attentive denormalization to selectively incorporate content-based statistics into the generator stream. In our experiments, we show that our method achieves state-of-the-art performance in photorealistic sim-to-real translation and weather translation and also performs well in day-to-night translation. Additionally, we propose the cKVD metric, which builds on the sKVD metric and enables the examination of translation quality at the class or category level.
    摘要 通常的目标对于无配对图像到图像翻译是保持源图像和翻译图像的内容一致性,同时模仿目标领域的样式。由于两个频谱的数据集之间存在偏见,许多方法受到翻译过程中的不一致性的影响。大多数引入的方法不约束探测器,导致训练 setup 更加不确定。此外,这些方法没有考虑更大的融合尺度。在这种情况下,我们表明,对两个频谱的总探测器的输入进行内容基于的蒙版是可以减少内容不一致性的。然而,这种策略会导致蒙版过程中的痕迹。为了减少这些痕迹,我们引入了一个本地探测器,该探测器在两个小尺度的匹配对上运行。此外,我们采用这种匹配策略来选择全局输入尺度上的源和目标数据集的输入。此外,我们提出了内容基于的降ormalization来选择性地包含生成器流中的内容统计。在我们的实验中,我们发现我们的方法可以在实际图像到图像翻译中达到领先的性能,并且在天气翻译和白天到夜晚翻译中也表现良好。此外,我们提出了 cKVD 指标,它基于 sKVD 指标,可以对翻译质量进行分类或类别级别的评估。

Diagnosing and exploiting the computational demands of videos games for deep reinforcement learning

  • paper_url: http://arxiv.org/abs/2309.13181
  • repo_url: None
  • paper_authors: Lakshmi Narasimhan Govindarajan, Rex G Liu, Drew Linsley, Alekh Karkada Ashok, Max Reuter, Michael J Frank, Thomas Serre
  • for: 这篇论文旨在探讨深度强化学习(dRL)算法是否可以在视频游戏中学习如人类一样,以及这些成功是由视觉表示学习或强化学习算法的发现更好策略带来的。
  • methods: 作者提出了学习挑战诊断器(LCD)工具,用于分解任务中的视觉学习和强化学习需求。通过LCD,作者在Procgenbenchmark中发现了一种新的挑战分类,并证明这些预测具有高可靠性和可以指导算法开发。
  • results: 作者通过LCD发现了多种在优化dRL算法上整个视频游戏benchmark时出现的失败案例,并提供了更有效的进程路径。
    Abstract Humans learn by interacting with their environments and perceiving the outcomes of their actions. A landmark in artificial intelligence has been the development of deep reinforcement learning (dRL) algorithms capable of doing the same in video games, on par with or better than humans. However, it remains unclear whether the successes of dRL models reflect advances in visual representation learning, the effectiveness of reinforcement learning algorithms at discovering better policies, or both. To address this question, we introduce the Learning Challenge Diagnosticator (LCD), a tool that separately measures the perceptual and reinforcement learning demands of a task. We use LCD to discover a novel taxonomy of challenges in the Procgen benchmark, and demonstrate that these predictions are both highly reliable and can instruct algorithmic development. More broadly, the LCD reveals multiple failure cases that can occur when optimizing dRL algorithms over entire video game benchmarks like Procgen, and provides a pathway towards more efficient progress.
    摘要 人类学习通过与环境互动和行为结果互动。人工智能领域的一个里程碑是开发深度奖励学习(dRL)算法,能够在电子游戏中学习,与人类或更好的性能。然而,未知 Whether the successes of dRL models reflect advances in visual representation learning, the effectiveness of reinforcement learning algorithms at discovering better policies, or both。为解决这个问题,我们引入学习挑战评价器(LCD),一种能够分解任务的视觉和奖励学习需求。我们使用LCD发现了Procgenbenchmark中的一种新分类器,并证明这些预测具有高可靠性和可以指导算法开发。更广泛地说,LCD揭示了优化dRL算法整个电子游戏benchmark like Procgen时可能出现的多种失败情况,并提供了更有效的进程。

AI Risk Profiles: A Standards Proposal for Pre-Deployment AI Risk Disclosures

  • paper_url: http://arxiv.org/abs/2309.13176
  • repo_url: None
  • paper_authors: Eli Sherman, Ian W. Eisenberg
  • for: 本研究旨在提出一种风险评估标准,用于导引下游决策,包括评估风险、购买和部署,以及指导法规制定。
  • methods: 本研究使用了作者提出的AI风险分类法,将广泛的风险提议分类到高级分类层次。furthermore, the authors propose a template-based methodology for collating risk information into a standard, yet flexible, structure.
  • results: 作者采用公开可用信息,应用这种方法对许多知名的AI系统进行了风险评估。结果显示,这种方法可以帮助consumers更好地理解AI系统的风险,并且可以导引下游决策。
    Abstract As AI systems' sophistication and proliferation have increased, awareness of the risks has grown proportionally (Sorkin et al. 2023). In response, calls have grown for stronger emphasis on disclosure and transparency in the AI industry (NTIA 2023; OpenAI 2023b), with proposals ranging from standardizing use of technical disclosures, like model cards (Mitchell et al. 2019), to yet-unspecified licensing regimes (Sindhu 2023). Since the AI value chain is complicated, with actors representing various expertise, perspectives, and values, it is crucial that consumers of a transparency disclosure be able to understand the risks of the AI system the disclosure concerns. In this paper we propose a risk profiling standard which can guide downstream decision-making, including triaging further risk assessment, informing procurement and deployment, and directing regulatory frameworks. The standard is built on our proposed taxonomy of AI risks, which reflects a high-level categorization of the wide variety of risks proposed in the literature. We outline the myriad data sources needed to construct informative Risk Profiles and propose a template-based methodology for collating risk information into a standard, yet flexible, structure. We apply this methodology to a number of prominent AI systems using publicly available information. To conclude, we discuss design decisions for the profiles and future work.
    摘要 随着人工智能系统的复杂性和普及度的增加,关注这些风险的意识也在不断增长(索金等2023年)。作为回应,各方强调了更加强制的披透和透明度在人工智能业务中(NTIA等2023年;OpenAI等2023年),并提出了从技术披透标准化到未定许可证 regime( sindhu等2023年)。由于人工智能价值链非常复杂,各个参与者具有不同的专业知识、观点和价值观,因此在下游决策过程中,披透报告的消费者必须能够理解关注的人工智能系统风险。在这篇论文中,我们提议了一个风险评估标准,可以导引下游决策,包括抢救进一步风险评估、指导采购和部署,以及指导法规框架。这个标准基于我们提议的人工智能风险分类体系,该分类体系反映了Literature中提出的广泛的风险。我们描述了构建信息的各种数据来源,并提议一种模板基于的方法来整理风险信息 into a standard, yet flexible 的结构。我们应用这种方法到了一些公开available的人工智能系统上。最后,我们讨论了配置风险profile的设计决策和未来工作。

Investigating Efficient Deep Learning Architectures For Side-Channel Attacks on AES

  • paper_url: http://arxiv.org/abs/2309.13170
  • repo_url: None
  • paper_authors: Yohaï-Eliel Berreby, Laurent Sauvage
  • for: 这项研究是为了提高深度学习在嵌入式 криптографических应用中的攻击效率,并减少计算资源和数据量的成本。
  • methods: 这项研究使用了 JAX 框架,并 investigate 了不同的 Transformer 模型,以便复制和提高先前的结果。
  • results: 研究人员在 ANSSI Side-Channel Attack Database (ASCAD) 上实现了一些先前已知的攻击结果,并在这些结果的基础之上做出了进一步的改进。
    Abstract Over the past few years, deep learning has been getting progressively more popular for the exploitation of side-channel vulnerabilities in embedded cryptographic applications, as it offers advantages in terms of the amount of attack traces required for effective key recovery. A number of effective attacks using neural networks have already been published, but reducing their cost in terms of the amount of computing resources and data required is an ever-present goal, which we pursue in this work. We focus on the ANSSI Side-Channel Attack Database (ASCAD), and produce a JAX-based framework for deep-learning-based SCA, with which we reproduce a selection of previous results and build upon them in an attempt to improve their performance. We also investigate the effectiveness of various Transformer-based models.
    摘要 在过去几年,深度学习在嵌入式加密应用中利用侧渠攻击的潜力得到了普遍的推广,因为它在关键恢复方面提供了更多的优势。许多使用神经网络的有效攻击已经发表,但减少计算资源和数据需求的成本仍然是一个持续的目标。我们在这种工作中关注ASCAD侧渠攻击数据库(ANSSI Side-Channel Attack Database),并基于JAX框架开发了深度学习基于SCA的框架,可以重现一些先前的结果并将其扩展以提高性能。我们还研究了不同的Transformer模型的效果。

Large Language Models Are Also Good Prototypical Commonsense Reasoners

  • paper_url: http://arxiv.org/abs/2309.13165
  • repo_url: None
  • paper_authors: Chenin Li, Qianglong Chen, Yin Zhang, Yifei Zhang, Hongxiang Yao
  • For: The paper aims to improve the performance of large language models on complex reasoning tasks by developing novel prompts that better support the models’ commonsense reasoning abilities.* Methods: The authors draw inspiration from the outputs of large models for tailored tasks and semi-automatically develop a set of novel prompts from multiple perspectives, including task-relevance, supportive evidence generation, and diverse path decoding.* Results: The experimental results on the ProtoQA dataset demonstrate that the proposed prompts can achieve a new state-of-the-art (SOTA) on the ProtoQA leaderboard, with improvements of 8% and 4% in the Max Answer@1 and Max Incorrect@1 scores, respectively, compared to the previous SOTA model. The generated Chain-of-Thought and knowledge also improve the interpretability of the model.
    Abstract Commonsense reasoning is a pivotal skill for large language models, yet it presents persistent challenges in specific tasks requiring this competence. Traditional fine-tuning approaches can be resource-intensive and potentially compromise a model's generalization capacity. Furthermore, state-of-the-art language models like GPT-3.5 and Claude are primarily accessible through API calls, which makes fine-tuning models challenging. To address these challenges, we draw inspiration from the outputs of large models for tailored tasks and semi-automatically developed a set of novel prompts from several perspectives, including task-relevance, supportive evidence generation (e.g. chain-of-thought and knowledge), diverse path decoding to aid the model. Experimental results on ProtoQA dataset demonstrate that with better designed prompts we can achieve the new state-of-art(SOTA) on the ProtoQA leaderboard, improving the Max Answer@1 score by 8%, Max Incorrect@1 score by 4% (breakthrough 50% for the first time) compared to the previous SOTA model and achieved an improvement on StrategyQA and CommonsenseQA2.0 (3% and 1%, respectively). Furthermore, with the generated Chain-of-Thought and knowledge, we can improve the interpretability of the model while also surpassing the previous SOTA models. We hope that our work can provide insight for the NLP community to develop better prompts and explore the potential of large language models for more complex reasoning tasks.
    摘要 大型语言模型的通质性理解是一项重要的技能,但是在特定任务中表现出 persistente 挑战。传统的精度调整方法可能是资源占用的和可能妨碍模型的总体化能力。此外,当前的语言模型如 GPT-3.5 和 Claude 都是通过 API 调用来访问,这使得模型的调整变得困难。为了解决这些挑战,我们从大型模型的输出中提取了特定任务的输出,并 semi-自动生成了一组新的提示,包括任务相关性、证据生成(如链条思维和知识)和多种路径解码,以帮助模型。实验结果表明,我们的提示设计可以超越前一个 SOTA 模型在 ProtoQA 数据集上的 Max Answer@1 得分,提高了8%,并且在 Max Incorrect@1 得分上提高了4%(打破了50%的首次纪录)。此外,我们还可以通过生成的链条思维和知识提高模型的解释性,并超越了前一个 SOTA 模型。我们希望,我们的工作可以为 NLP 社区提供灵感,开发更好的提示,探索大型语言模型在更复杂的理解任务中的潜在能力。

GAMIX-VAE: A VAE with Gaussian Mixture Based Posterior

  • paper_url: http://arxiv.org/abs/2309.13160
  • repo_url: None
  • paper_authors: Mariano Rivera
  • for: 这篇论文探讨了变量自动编码器(VAEs)中关键的底下勒比级(KL)差异,它是生成模型和表示学习中机器学习中的一个重要组成部分。
  • methods: 该论文提出了一种新的ELBO定义,使用混合 Gaussian 来描述 posterior 概率分布,并在权重抑制方面添加了一个正则化项以避免减少抖动。它还使用 PatchGAN 探测器来提高 texture 的真实感。
  • results: 实验表明该方法可以生成真实的面孔,提供了一种可行的解决方案来增强 VAE 基于的生成模型。
    Abstract Variational Autoencoders (VAEs) have become a cornerstone in generative modeling and representation learning within machine learning. This paper explores a nuanced aspect of VAEs, focusing on interpreting the Kullback Leibler (KL) Divergence, a critical component within the Evidence Lower Bound (ELBO) that governs the trade-off between reconstruction accuracy and regularization. While the KL Divergence enforces alignment between latent variable distributions and a prior imposing a structure on the overall latent space but leaves individual variable distributions unconstrained. The proposed method redefines the ELBO with a mixture of Gaussians for the posterior probability, introduces a regularization term to prevent variance collapse, and employs a PatchGAN discriminator to enhance texture realism. Implementation details involve ResNetV2 architectures for both the Encoder and Decoder. The experiments demonstrate the ability to generate realistic faces, offering a promising solution for enhancing VAE based generative models.
    摘要

Contextual Emotion Estimation from Image Captions

  • paper_url: http://arxiv.org/abs/2309.13136
  • repo_url: None
  • paper_authors: Vera Yang, Archita Srivastava, Yasaman Etesam, Chuxuan Zhang, Angelica Lim
  • for: 这 paper 探索了 Whether Large Language Models (LLMs) 可以支持情绪 estimation 任务,通过首先生成图像描述,然后使用 LLM 进行推理。
  • methods: 这 paper 使用了 Computer Vision 方法来直接测量人们的情绪,并使用 LLM 进行推理。
  • results: 研究发现,GPT-3.5 模型可以提供 surprisingly 合理的情绪预测,但是准确度可以随情绪概念而变化。 Overall, the results suggest promise in the image captioning and LLM approach.
    Abstract Emotion estimation in images is a challenging task, typically using computer vision methods to directly estimate people's emotions using face, body pose and contextual cues. In this paper, we explore whether Large Language Models (LLMs) can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference. First, we must understand: how well do LLMs perceive human emotions? And which parts of the information enable them to determine emotions? One initial challenge is to construct a caption that describes a person within a scene with information relevant for emotion perception. Towards this goal, we propose a set of natural language descriptors for faces, bodies, interactions, and environments. We use them to manually generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset. These captions offer an interpretable representation for emotion estimation, towards understanding how elements of a scene affect emotion perception in LLMs and beyond. Secondly, we test the capability of a large language model to infer an emotion from the resulting image captions. We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations, but accuracy can depend on the emotion concept. Overall, the results suggest promise in the image captioning and LLM approach.
    摘要 人工智能识别人类情感是一项复杂的任务,通常使用计算机视觉方法直接测量人脸、姿势和上下文信息来确定人们的情感。在这篇论文中,我们考虑了使用大型自然语言模型(LLM)来支持情感识别任务。我们首先需要了解: LLM 如何识别人类情感吗?哪些信息使得它们能够确定情感呢?我们的首要挑战是构建一个描述人在场景中的自然语言描述,以便用 LLM 进行推理。为此,我们提出了一组面部、身体、互动和环境等自然语言描述器。我们使用它们手动生成了331个图像集EMOTIC中的图像caption和情绪标注。这些caption提供了可解释的表示方式,用于了解场景元素对情感识别在 LLM 和其他方法中的影响。其次,我们测试了一个大型自然语言模型(GPT-3.5)是否可以从图像caption中推断出情绪。我们发现,特别是text-davinci-003模型,能够提供相对准确的情绪预测,但是准确程度可能取决于情绪概念。总的来说,结果表明了图像captioning和LLM方法的潜在优势。

Insights from an OTTR-centric Ontology Engineering Methodology

  • paper_url: http://arxiv.org/abs/2309.13130
  • repo_url: None
  • paper_authors: Moritz Blum, Basil Ell, Philipp Cimiano
  • for: This paper is written for the purpose of discussing the use of OTTR templates in ontology engineering for the domain of Material Science.
  • methods: The paper uses a bottom-up and top-down approach to ontology engineering, starting with existing data and using OTTR templates to feed the data into a knowledge graph.
  • results: The paper finds that OTTR templates are useful for communicating with domain experts and that the engineering process becomes flexible as a result of encapsulating modeling decisions.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了介绍使用OTTR模板在材料科学领域的ontology工程。
  • methods: 这篇论文使用底层和顶层的方法来实现ontology工程,从现有数据开始,使用OTTR模板将数据feed到知识图。
  • results: 这篇论文发现OTTR模板在与领域专家交流时非常有用,并且因为模板封装了设计决策,因此工程过程变得更加灵活,可以轻松地修改设计决策。
    Abstract OTTR is a language for representing ontology modeling patterns, which enables to build ontologies or knowledge bases by instantiating templates. Thereby, particularities of the ontological representation language are hidden from the domain experts, and it enables ontology engineers to, to some extent, separate the processes of deciding about what information to model from deciding about how to model the information, e.g., which design patterns to use. Certain decisions can thus be postponed for the benefit of focusing on one of these processes. To date, only few works on ontology engineering where ontology templates are applied are described in the literature. In this paper, we outline our methodology and report findings from our ontology engineering activities in the domain of Material Science. In these activities, OTTR templates play a key role. Our ontology engineering process is bottom-up, as we begin modeling activities from existing data that is then, via templates, fed into a knowledge graph, and it is top-down, as we first focus on which data to model and postpone the decision of how to model the data. We find, among other things, that OTTR templates are especially useful as a means of communication with domain experts. Furthermore, we find that because OTTR templates encapsulate modeling decisions, the engineering process becomes flexible, meaning that design decisions can be changed at little cost.
    摘要 OTTR 是一种用于表示 ontology 模式的语言,它可以帮助建立 ontology 或知识库 by instantiating 模板。因此,ontological 表示语言中的特定特点被隐藏,使得域专家不必关注这些特点,可以更专注于决定需要模型的信息和使用哪些设计模式。这样可以抽象出一些决策,以便更专注于一个过程中。在现有文献中,只有一些关于 ontology 工程的研究描述了使用 ontology 模板。在本文中,我们介绍了我们的方法和在材料科学领域中的 ontology 工程活动的发现。在这些活动中,OTTR 模板扮演了关键的角色。我们的 ontology 工程过程是底向的,我们从现有数据开始,将其通过模板feed into 知识图,并是顶向的,我们首先决定需要模型的数据,然后决定如何模型数据。我们发现 OTTR 模板非常有用作域专家与之交流的工具。此外,我们发现因为 OTTR 模板封装了模型决策,工程过程变得灵活,可以在低成本下更改设计决策。

E(2)-Equivariant Graph Planning for Navigation

  • paper_url: http://arxiv.org/abs/2309.13043
  • repo_url: None
  • paper_authors: Linfeng Zhao, Hongyu Li, Taskin Padir, Huaizu Jiang, Lawson L. S. Wong
  • for: 提高机器人导航的学习效率和稳定性,满足实际应用中的需求。
  • methods: 利用欧几何同态性在规划中,实现参数共享和稳定的训练。在不结构化环境中,通过几何图形规划和对称性保持的消息传递网络实现值迭代。还提出了一种可学习的协变层,将特征映射到 DESIRED 空间。
  • results: 在五种多样化任务中,包括结构化和不结构化环境,以及已知和未知的目标点或Semantic goal,实现了训练效率、稳定性和泛化性的显著改进。
    Abstract Learning for robot navigation presents a critical and challenging task. The scarcity and costliness of real-world datasets necessitate efficient learning approaches. In this letter, we exploit Euclidean symmetry in planning for 2D navigation, which originates from Euclidean transformations between reference frames and enables parameter sharing. To address the challenges of unstructured environments, we formulate the navigation problem as planning on a geometric graph and develop an equivariant message passing network to perform value iteration. Furthermore, to handle multi-camera input, we propose a learnable equivariant layer to lift features to a desired space. We conduct comprehensive evaluations across five diverse tasks encompassing structured and unstructured environments, along with maps of known and unknown, given point goals or semantic goals. Our experiments confirm the substantial benefits on training efficiency, stability, and generalization.
    摘要 学习 robot 导航存在一个极其紧迫和挑战性的任务。因为实际世界数据的稀缺和高价,我们需要开发高效的学习方法。在本文中,我们利用二维 Navigation 中的欧几何 симметрия,来实现参数共享。为了处理无结构环境,我们将导航问题定义为在几何图形上进行规划,并开发了一个对称报essage passing网络来实现值迭代。此外,我们还提出了一个可学习的对称层,以提高多摄像头输入的特征提取。我们在五种不同的任务中进行了广泛的评估,包括结构化和无结构化环境,以及已知和未知的点目标或semantic目标。我们的实验表明,我们的方法可以提高训练效率、稳定性和泛化能力。

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

  • paper_url: http://arxiv.org/abs/2309.13042
  • repo_url: https://github.com/jiahao000/mosaicfusion
  • paper_authors: Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy
  • for: 这篇论文是为了提出一种新的扩展数据生成方法,以提高大词汇实例分割器的性能。
  • methods: 该方法使用了 diffusion-based 数据生成方法,不需要任何标注数据,可以使用存在的文本至图生成器来生成多个实例。
  • results: 实验结果表明,使用该方法可以生成大量的合理标注数据,特别是对于罕见和新类别。这有助于提高现有的实例分割器的性能,特别是对于罕见和新类别。
    Abstract We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.
    摘要 我们介绍MosaicFusion,一种简单 yet effective的扩散基于数据增强方法,用于大词汇实例分割。我们的方法不需要任何标注指导。我们使用两个关键设计来使用市场上可用的文本到图像扩散模型来生成对象实例和mask注释。首先,我们将图像canvas分成多个区域,并在不同的文本提示下进行单次扩散过程,以同时生成多个实例。其次,我们通过聚合层和扩散时间步骤之间的交叉注意力图来获得对象提示的集合,然后进行简单的阈值设定和边缘敏感处理来获得实例mask。无论精雕的设计,MosaicFusion可以生成大量的合成标注数据,特别是为罕见和新类别。我们的实验结果表明,MosaicFusion可以大幅提高现有实例分割模型的性能,特别是为罕见和新类别。代码将在https://github.com/Jiahao000/MosaicFusion中发布。

Memory-augmented conformer for improved end-to-end long-form ASR

  • paper_url: http://arxiv.org/abs/2309.13029
  • repo_url: https://github.com/miamoto/conformer-ntm
  • paper_authors: Carlos Carvalho, Alberto Abad
  • for: 用于自动语音识别(ASR)模型的改进,特别是对长句子的表现。
  • methods: 使用外部可微分储存网络(NTM)和encoder-decoder结构的协同作用,以扩展对长句子的泛化能力。
  • results: 在使用Librispeech数据集的train-clean-100和train-960集上,提出的模型比基eline conformer ohne memory更高的表现于长句子。
    Abstract Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
    摘要 具有最新提议的具有竞争力的模型方法(Conformer)在自动语音识别(ASR)中表现出色,超过了基于回归神经网络的方法和变换器。然而,在总体来说,这些端到端模型,特别是带有注意力的模型,在长句子情况下表现较差。为解决这一限制,我们提议在编码器和解码器之间添加一个可微分的内存增强神经网络。这个外部内存可以为长句子提供更多的信息,从而提高系统的总体化能力。我们研究了神经图理 machine(NTM),从而得到我们的提议的 Conformer-NTM 模型体系结构。实验结果表明,提议的系统在 Librispeech train-clean-100 和 train-960 集上比基础 Conformer ohne 内存表现出色。

OpportunityFinder: A Framework for Automated Causal Inference

  • paper_url: http://arxiv.org/abs/2309.13103
  • repo_url: None
  • paper_authors: Huy Nguyen, Prince Grover, Devashish Khatwani
  • for: 用于执行对屏幕数据进行多种 causal inference 研究,为非专家用户提供可编程代码的框架。
  • methods: 使用 raw 观察数据和配置文件,触发管道进行数据检查/处理,选择适合的算法执行 causal 研究,并返回对 outcome 的 causal 影响,以及敏感性和稳定性结果。
  • results: 返回 causal 影响的结果,包括对 outcome 的 causal 影响,以及敏感性和稳定性结果。
    Abstract We introduce OpportunityFinder, a code-less framework for performing a variety of causal inference studies with panel data for non-expert users. In its current state, OpportunityFinder only requires users to provide raw observational data and a configuration file. A pipeline is then triggered that inspects/processes data, chooses the suitable algorithm(s) to execute the causal study. It returns the causal impact of the treatment on the configured outcome, together with sensitivity and robustness results. Causal inference is widely studied and used to estimate the downstream impact of individual's interactions with products and features. It is common that these causal studies are performed by scientists and/or economists periodically. Business stakeholders are often bottle-necked on scientist or economist bandwidth to conduct causal studies. We offer OpportunityFinder as a solution for commonly performed causal studies with four key features: (1) easy to use for both Business Analysts and Scientists, (2) abstraction of multiple algorithms under a single I/O interface, (3) support for causal impact analysis under binary treatment with panel data and (4) dynamic selection of algorithm based on scale of data.
    摘要 我们介绍OpportunityFinder,一个无程式码框架,用于实现各种对组合数据进行可能性推论的不专家用户。目前情况下,OpportunityFinder只需用户提供原始观察数据和配置文件,然后触发一个管道,将数据进行检查和处理,选择适当的算法来执行可能性研究。它返回对定结果的影响,以及敏感度和稳定性结果。可能性推论广泛研究和使用,用于估计个人对产品和功能互动所产生的下游影响。这些可能性研究通常由科学家和/或经济学家定期进行。企业决策者往往因为科学家或经济学家的专业压力而受到瓶颈。我们提供OpportunityFinder作为常见的可能性研究解决方案,具有以下四个关键特点:1. 易用,适合商业分析师和科学家使用。2. 多种算法的抽象,通过单一的输入界面进行处理。3. 支持对组合数据进行可能性影响分析,并且仅需进行二进制对待。4. 基于数据的尺度进行动态算法选择。

A Hybrid Deep Learning-based Approach for Optimal Genotype by Environment Selection

  • paper_url: http://arxiv.org/abs/2309.13021
  • repo_url: None
  • paper_authors: Zahra Khalilzadeh, Motahareh Kashanian, Saeed Khaki, Lizhi Wang
  • For: The paper aims to improve crop yield prediction by integrating weather data across the growing season, especially for different crop varieties, to understand their adaptability in the face of climate change.* Methods: The authors used a dataset of 93,028 training records and 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). They developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model and the CNN-LSTM-DNN model. They also used the Generalized Ensemble Method (GEM) to determine optimal model weights.* Results: The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) compared to baseline models. The CNN-DNN model was used to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables.Here are the three points in Simplified Chinese text:* For: 这个论文目的是提高农业实践中的作物产量预测,以便更好地理解气候变化对作物的适应性。* Methods: 作者使用了一个包含93,028个训练记录和10,337个测试记录的数据集,覆盖了28个美国州和加拿大省的159个地点,时间跨度为13年(2003-2015)。他们开发了两种新的卷积神经网络模型:CNN-DNN模型和CNN-LSTM-DNN模型。他们还使用了通用ensemble方法(GEM)来确定优化模型的权重。* Results: GEM模型在测试数据上实现了较低的RMSE(5.55%到39.88%)、reduced MAE(5.34%到43.76%)和高于基eline模型的 correlation coefficient(1.1%到10.79%)。CNN-DNN模型用于在不同的地点和气候条件下预测最高产量的种子,帮助选择基于气候变量的种子。
    Abstract Precise crop yield prediction is essential for improving agricultural practices and ensuring crop resilience in varying climates. Integrating weather data across the growing season, especially for different crop varieties, is crucial for understanding their adaptability in the face of climate change. In the MLCAS2021 Crop Yield Prediction Challenge, we utilized a dataset comprising 93,028 training records to forecast yields for 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). This dataset included details on 5,838 distinct genotypes and daily weather data for a 214-day growing season, enabling comprehensive analysis. As one of the winning teams, we developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model, combining CNN and fully-connected networks, and the CNN-LSTM-DNN model, with an added LSTM layer for weather variables. Leveraging the Generalized Ensemble Method (GEM), we determined optimal model weights, resulting in superior performance compared to baseline models. The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) when evaluated on test data. We applied the CNN-DNN model to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables. Our data-driven approach is valuable for scenarios with limited testing years. Additionally, a feature importance analysis using RMSE change highlighted the significance of location, MG, year, and genotype, along with the importance of weather variables MDNI and AP.
    摘要 precisión del cultivo de precisión es esencial para mejorar las prácticas agrícolas y asegurar la resistencia de los cultivos en climas variables. Integrar los datos meteorológicos durante la temporada de crecimiento, especialmente para diferentes variedades de cultivos, es crucial para comprender su adaptabilidad enfrentada al cambio climático. En el Desafío de Predicción de Yield de MLCAS2021, utilizamos un conjunto de datos de entrenamiento que comprendía 93,028 registros para predecir los rendimientos para 10,337 registros de prueba, que cubrían 159 ubicaciones en 28 estados y provincias canadiences durante 13 años (2003-2015). Este conjunto de datos incluyó detalles sobre 5,838 genotipos distinctos y datos meteorológicos diarios para una temporada de crecimiento de 214 días, lo que permitió un análisis exhaustivo. Como uno de los equipos ganadores, desarrollamos dos arquitecturas de red neuronal convolutional (CNN) nuevas: el modelo CNN-DNN, que combina redes neuronales convolutional y fully connected, y el modelo CNN-LSTM-DNN, con una capa adicional de LSTM para variables meteorológicas. Al utilizar el Método de Ensemble Generalizado (GEM), determinamos los pesos óptimos del modelo, lo que resultó en una performance superior a los modelos de referencia. El modelo GEM obtuvo una RMSE reducida (del 5,55% al 39,88%), una MAE reducida (del 5,34% al 43,76%) y coeficientes de correlación más altos (del 1,1% al 10,79%) cuando se evaluó en datos de prueba. Aplicamos el modelo CNN-DNN para identificar los genotipos más renderos para diferentes ubicaciones y condiciones meteorológicas, lo que es útil para la selección de genotipos basada en variables meteorológicas. Nuestra aproximación basada en datos es valiosa para escenarios con años de prueba limitados. Además, un análisis de importancia de características utilizando el cambio de RMSE destacó la importancia de la ubicación, MG, año y genotipo, así como la importancia de las variables meteorológicas MDNI y AP.

Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

  • paper_url: http://arxiv.org/abs/2309.13015
  • repo_url: None
  • paper_authors: Chao Fang, Wei Sun, Aojun Zhou, Zhongfeng Wang
  • for: 这个论文主要针对的是如何使用粗糙训练来降低深度神经网络(DNN)的计算成本,同时保持高度准确性。
  • methods: 本论文提出了一种 computation-efficient 的训练方案,包括算法、架构和数据流程合理设计。在算法层面,提出了一种双向权重剔除方法(BDWP),可以在前向和反向传播中利用 N:M 粗糙性来减少计算成本,同时保持模型准确性。在架构层面,提出了一种专门用于 DNN 训练的粗糙加速器(SAT),可以支持常见的稠密操作以及计算效率高的 N:M 粗糙操作。在数据流程层面,提出了多种优化方法,包括交叉映射、预生成 N:M 粗糙权重和离线调度等,以提高 SAT 的计算效率。
  • results: 实验结果显示,使用 SAT 加速器和 BDWP 粗糙训练方法,在 Xilinx VCU1525 FPGA 卡上使用不同的 DNN 模型和数据集,可以实现 average 速度提升1.75倍,同时减少了模型精度下降的0.56%。此外,我们的训练方案可以提高训练吞吐量2.9725.22倍和能效率1.363.58倍 compared to 先前的 FPGA 加速器。
    Abstract Sparse training is one of the promising techniques to reduce the computational cost of DNNs while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has attracted attention due to its hardware-friendly pattern and capability of achieving a high sparse ratio. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. At the dataflow level, multiple optimization methods ranging from interleave mapping, pre-generation of N:M sparse weights, and offline scheduling, are proposed to boost the computational efficiency of SAT. Finally, the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA card using various DNN models and datasets. Experimental results show the SAT accelerator with the BDWP sparse training method under 2:8 sparse ratio achieves an average speedup of 1.75x over that with the dense training, accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our proposed training scheme significantly improves the training throughput by 2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based accelerators.
    摘要 \begin{blockquote}稀疏训练是深度学习模型的一种有前途的技术,可以降低深度学习模型的计算成本,保持高度准确。特别是N:M精细结构稀疏,在N个连续M个元素中只有N个可以非零,这种硬件友好的模式和高度稀疏比例的实现,引起了关注。然而,N:M稀疏深度学习训练的潜在加速仍未得到完全利用,缺少高效的硬件支持。为了解决这些挑战,本文提出了一种计算效率高的训练方案 для N:M稀疏深度学习模型,通过算法、建筑和数据流合理设计。在算法层面,我们提出了一种双向权重减少方法,称为BDWP,可以在深度学习模型的前向和反向传播中利用N:M稀疏的权重,大幅降低计算成本,保持模型准确性。在建筑层面,我们开发了一种适用于深度学习训练的稀疏加速器,称为SAT,可以方便支持常见的密集操作以及计算效率的N:M稀疏操作。在数据流层面,我们提出了多种优化方法,从批量映射、预生成N:M稀疏权重到离线调度,以提高SAT的计算效率。实验结果表明,使用我们的训练方案和SAT加速器,N:M稀疏深度学习模型在Xilinx VCU1525 FPGA卡上实现了1.75倍的速度提升,相对于密集训练方案,平均准确性损失为0.56%。此外,我们的训练方案可以提高训练吞吐量2.97~25.22倍和能效率1.36~3.58倍,至于先前的FPGA加速器。\end{blockquote}Note that the translation is done using the Simplified Chinese language setting, which may not be exactly the same as the Traditional Chinese language setting used in Taiwan.

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

  • paper_url: http://arxiv.org/abs/2309.13007
  • repo_url: https://github.com/dinobby/reconcile
  • paper_authors: Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal
  • for: 提高大型自然语言模型(LLM)的复杂理解能力
  • methods: 提出ReConcile模型,利用多个LLM代理在圆桌会议中互动,提高代理之间的多样化思维和沟通,以提高LLM的复杂理解能力
  • results: 在多个benchmark上实验表明,ReConcile模型可以大幅提高LLM的复杂理解能力,比对 Singleshot baseline和多代理baseline高出7.7%,并且在一些数据集上甚至超过GPT-4的表现。
    Abstract Large Language Models (LLMs) still struggle with complex reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multi-agent framework designed as a round table conference among diverse LLM agents to foster diverse thoughts and discussion for improved consensus. ReConcile enhances the reasoning capabilities of LLMs by holding multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their uncertainties, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. This discussion prompt enables each agent to revise their responses in light of insights from other agents. Once a consensus is reached and the discussion ends, ReConcile determines the final answer by leveraging the confidence of each agent in a weighted voting scheme. We implement ReConcile with ChatGPT, Bard, and Claude2 as the three agents. Our experimental results on various benchmarks demonstrate that ReConcile significantly enhances the reasoning performance of the agents (both individually and as a team), surpassing prior single-agent and multi-agent baselines by 7.7% and also outperforming GPT-4 on some of these datasets. We also experiment with GPT-4 itself as one of the agents in ReConcile and demonstrate that its initial performance also improves by absolute 10.0% through discussion and feedback from other agents. Finally, we also analyze the accuracy after every round and observe that ReConcile achieves better and faster consensus between agents, compared to a multi-agent debate baseline. Our code is available at: https://github.com/dinobby/ReConcile
    摘要 大型语言模型(LLM)仍然面临复杂的理解任务。 motivated by the society of minds(Minsky,1988),我们提议ReConcile,一种多模型多代理框架,设计为多个LLM代理在多种不同的思路和讨论中促进多元思维和讨论,以提高共识。ReConcile通过多轮讨论、学习感SErrors他们的答案,并使用可信度权重投票机制来提高LLMs的理解能力。在每轮讨论中,ReConcile通过'讨论提示'来initiate discussion between agents,该提示包括每个代理在上一轮的答案和解释、uncertainties,以及answer-rectifying human explanations,用于说服其他代理。这些讨论提示使每个代理可以根据其他代理的启示修改其答案。当讨论结束并达成共识时,ReConcile使用每个代理的可信度在权重投票机制中确定最终答案。我们在ChatGPT、Bard和Claude2作为三个代理来实现ReConcile。我们的实验结果表明,ReConcile可以明显提高LLMs的理解能力(both individually and as a team),比对前的单机和多机基线高出7.7%,同时也超过GPT-4在一些数据集上的性能。我们还在GPT-4作为一个代理参与ReConcile,并证明其初始性能也提高了绝对10.0%通过对其他代理的讨论和反馈。最后,我们还分析了每轮的准确率,发现ReConcile在多个代理之间达成共识的速度和准确率比multi-agent debate基线更高。我们的代码可以在https://github.com/dinobby/ReConcile上获取。

Pursuing Counterfactual Fairness via Sequential Autoencoder Across Domains

  • paper_url: http://arxiv.org/abs/2309.13005
  • repo_url: None
  • paper_authors: Yujie Lin, Chen Zhao, Minglai Shao, Baoluo Meng, Xujiang Zhao, Haifeng Chen
  • for: 提高机器学习系统在不同频率的数据上的性能,并在数据分布逐渐发展的过程中保持公平性。
  • methods: 提出了一种名为Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder(CDSAE)的创新框架,该框架可以分离类别特征中的环境信息和敏感特征,从而提高模型在多样化和不 Familiar 频率上的泛化性能,同时也能够有效地解决不公正分类问题。
  • results: 通过在 sintetic 和实际世界数据集上的验证,证明了我们的方法可以提高准确率,同时保持公平性在数据分布逐渐发展的过程中。
    Abstract Recognizing the prevalence of domain shift as a common challenge in machine learning, various domain generalization (DG) techniques have been developed to enhance the performance of machine learning systems when dealing with out-of-distribution (OOD) data. Furthermore, in real-world scenarios, data distributions can gradually change across a sequence of sequential domains. While current methodologies primarily focus on improving model effectiveness within these new domains, they often overlook fairness issues throughout the learning process. In response, we introduce an innovative framework called Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder (CDSAE). This approach effectively separates environmental information and sensitive attributes from the embedded representation of classification features. This concurrent separation not only greatly improves model generalization across diverse and unfamiliar domains but also effectively addresses challenges related to unfair classification. Our strategy is rooted in the principles of causal inference to tackle these dual issues. To examine the intricate relationship between semantic information, sensitive attributes, and environmental cues, we systematically categorize exogenous uncertainty factors into four latent variables: 1) semantic information influenced by sensitive attributes, 2) semantic information unaffected by sensitive attributes, 3) environmental cues influenced by sensitive attributes, and 4) environmental cues unaffected by sensitive attributes. By incorporating fairness regularization, we exclusively employ semantic information for classification purposes. Empirical validation on synthetic and real-world datasets substantiates the effectiveness of our approach, demonstrating improved accuracy levels while ensuring the preservation of fairness in the evolving landscape of continuous domains.
    摘要 recognizing the prevalence of domain shift as a common challenge in machine learning, various domain generalization (DG) techniques have been developed to enhance the performance of machine learning systems when dealing with out-of-distribution (OOD) data. Furthermore, in real-world scenarios, data distributions can gradually change across a sequence of sequential domains. While current methodologies primarily focus on improving model effectiveness within these new domains, they often overlook fairness issues throughout the learning process. In response, we introduce an innovative framework called Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder (CDSAE). This approach effectively separates environmental information and sensitive attributes from the embedded representation of classification features. This concurrent separation not only greatly improves model generalization across diverse and unfamiliar domains but also effectively addresses challenges related to unfair classification. Our strategy is rooted in the principles of causal inference to tackle these dual issues. To examine the intricate relationship between semantic information, sensitive attributes, and environmental cues, we systematically categorize exogenous uncertainty factors into four latent variables: 1) semantic information influenced by sensitive attributes, 2) semantic information unaffected by sensitive attributes, 3) environmental cues influenced by sensitive attributes, and 4) environmental cues unaffected by sensitive attributes. By incorporating fairness regularization, we exclusively employ semantic information for classification purposes. Empirical validation on synthetic and real-world datasets substantiates the effectiveness of our approach, demonstrating improved accuracy levels while ensuring the preservation of fairness in the evolving landscape of continuous domains.

Audience-specific Explanations for Machine Translation

  • paper_url: http://arxiv.org/abs/2309.12998
  • repo_url: None
  • paper_authors: Renhan Lou, Jan Niehues
  • for: 解决机器翻译中存在的certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds.
  • methods: 我们提出了一种 semi-automatic technique to extract these explanations from a large parallel corpus.
  • results: 我们的方法能够从英语->德语、英语->法语和英语->中文语对取得较好的结果,其中有超过10%的句子包含解释,而原始句子中只有1.9%包含解释。
    Abstract In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
    摘要 在机器翻译中,一个常见的问题是翻译某些词汇,即使翻译成功,也可能导致目标语言群体的不理解,因为不同的文化背景。为解决这问题,一种解决方案是添加解释。在这项工作中,我们 explore 技术来提取示例解释。然而,在建立训练集时,由于翻译后的句子中包含需要解释的词汇的稀缺性,使得建立训练集非常困难。因此,我们提出了一种半自动的提取方法,来从大量的平行词典中提取示例解释。实验表明,我们的方法可以从英语->德语、英语->法语和英语->中文三种语言对的平行词典中提取句子,使得超过10%的句子包含解释,而原始句子中只有1.9%的句子包含解释。此外,我们还展示了该技术在三种语言对上的稳定性。因此,这是一项重要的自动第一步,用于创建解释数据集。

Higher-order Graph Convolutional Network with Flower-Petals Laplacians on Simplicial Complexes

  • paper_url: http://arxiv.org/abs/2309.12971
  • repo_url: https://github.com/zeniSoida/pl1
  • paper_authors: Yiming Huang, Yujie Zeng, Qiang Wu, Linyuan Lü
  • for: 这 paper 的目的是提出一种基于 simplicial complexes (SCs) 的高级别征特征学习方法,以增强 graph neural network (GNN) 的表达能力。
  • methods: 该方法基于 Flower-Petals (FP) 模型,并使用 learnable graph filters 来识别不同的高级别交互强度。
  • results: 实验结果表明,提出的模型可以在多种图任务上达到 state-of-the-art (SOTA) 性能,并且提供一个可扩展和灵活的解决方案来探索图中的高级别交互。
    Abstract Despite the recent successes of vanilla Graph Neural Networks (GNNs) on many tasks, their foundation on pairwise interaction networks inherently limits their capacity to discern latent higher-order interactions in complex systems. To bridge this capability gap, we propose a novel approach exploiting the rich mathematical theory of simplicial complexes (SCs) - a robust tool for modeling higher-order interactions. Current SC-based GNNs are burdened by high complexity and rigidity, and quantifying higher-order interaction strengths remains challenging. Innovatively, we present a higher-order Flower-Petals (FP) model, incorporating FP Laplacians into SCs. Further, we introduce a Higher-order Graph Convolutional Network (HiGCN) grounded in FP Laplacians, capable of discerning intrinsic features across varying topological scales. By employing learnable graph filters, a parameter group within each FP Laplacian domain, we can identify diverse patterns where the filters' weights serve as a quantifiable measure of higher-order interaction strengths. The theoretical underpinnings of HiGCN's advanced expressiveness are rigorously demonstrated. Additionally, our empirical investigations reveal that the proposed model accomplishes state-of-the-art (SOTA) performance on a range of graph tasks and provides a scalable and flexible solution to explore higher-order interactions in graphs.
    摘要 尽管最近的vanilla图 neural network (GNN)在许多任务上表现出色,但它们的基础是对应的对之间互动网络,因此它们无法自然地捕捉复杂系统中隐藏的高阶互动。为bridge这个能力差距,我们提出了一种新的方法,利用 simplicial complexes (SC) 的丰富数学理论 - 一种可靠的工具 для模型高阶互动。现有的 SC 基于 GNN 受到高复杂性和僵化的限制,同时量化高阶互动强度仍然是挑战。我们创新地提出了一种高阶花 petal (FP) 模型,将 FP Laplacians 引入 SC 中。此外,我们还介绍了一种基于 FP Laplacians 的高阶图卷积网络 (HiGCN),可以在不同的 topological scale 上捕捉到系统内部的自适应特征。通过使用可学习的图滤波器,每个 FP Laplacian 域中的参数组,我们可以识别出多种具有不同特征的图pattern,其中滤波器的权重serve as a quantifiable measure of high-order interaction strengths。我们的理论基础的进一步证明和实验研究表明,提出的模型可以在许多图任务上达到领先的性能水平,并提供一个可扩展和灵活的解决方案来探索图中的高阶互动。

Trusta: Reasoning about Assurance Cases with Formal Methods and Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12941
  • repo_url: None
  • paper_authors: Zezhong Chen, Yuxin Deng, Wenjie Du
  • for: This paper focuses on the development of a tool called Trustworthiness Derivation Tree Analyzer (Trusta) that automates the construction and verification of assurance cases for safety-critical systems.
  • methods: The tool uses formal methods, such as Prolog and constraint solvers like Z3 and MONA, to automatically reason about assurance cases. It also utilizes large language models like ChatGPT-3.5, ChatGPT-4, and PaLM 2 to generate and evaluate assurance cases, allowing for interactive human examination and modification.
  • results: The paper presents several industrial case studies that demonstrate the practical value of Trusta in finding subtle issues that are typically missed in manual inspection, and shows that the tool can quickly and efficiently enhance the assurance case development process.
    Abstract Assurance cases can be used to argue for the safety of products in safety engineering. In safety-critical areas, the construction of assurance cases is indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases by incorporating formal methods, rendering it possible for automatic reasoning about assurance cases. We present Trustworthiness Derivation Tree Analyzer (Trusta), a desktop application designed to automatically construct and verify TDTs. The tool has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA. Therefore, it can solve constraints about logical formulas involving arithmetic, sets, Horn clauses etc. Trusta also utilizes large language models to make the creation and evaluation of assurance cases more convenient. It allows for interactive human examination and modification. We evaluated top language models like ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests showed a 50%-80% similarity between machine-generated and human-created cases. In addition, Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process. This extraction is subject to human review and correction, blending the best of automated efficiency with human insight. To our knowledge, this marks the first integration of large language models in automatic creating and reasoning about assurance cases, bringing a novel approach to a traditional challenge. Through several industrial case studies, Trusta has proven to quickly find some subtle issues that are typically missed in manual inspection, demonstrating its practical value in enhancing the assurance case development process.
    摘要 可信度证明(Assurance Case)可以用于安全工程中证明产品的安全性。在安全关键领域,构建可信度证明是必备的。可信度推导树(TDT)可以增强可信度证明,使其可以进行自动的逻辑推理。我们介绍了一款名为“信任worthiness Derivation Tree Analyzer”(Trusta)的桌面应用程序,用于自动构建和验证TDT。Trusta具有内置的Prolog解释器,并支持Z3和MONA等约束解决器。因此,它可以解决包括逻辑形式中的数学、集合、扩展等约束。Trusta还利用大型自然语言模型来使创建和评估可信度证明更加方便。它允许交互式的人工检查和修改。我们对ChatGPT-3.5、ChatGPT-4和PaLM 2等大型自然语言模型进行测试,测试结果表明,机器生成的可信度证明与人类创建的可信度证明之间存在50%-80%的相似性。此外,Trusta可以从自然语言文本中提取形式约束,使评估和验证过程更加容易。这种提取是人类审核和修正的,将机器自动效率与人类智慧相结合。在我们所知道的情况下,Trusta是首次将大型自然语言模型 integrate into automatic creation and reasoning about assurance cases,为传统挑战提供了一种新的方法。通过多个工业案例研究,Trusta已经证明了它能够快速发现一些通常被人工检查掉的细微问题,表明了它在提高可信度证明开发过程的实际价值。

Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12940
  • repo_url: None
  • paper_authors: Haoyu Gao, Ting-En Lin, Hangyu Li, Min Yang, Yuchuan Wu, Wentao Ma, Yongbin Li
  • for: 提高大语言模型在多回话对话中的理解能力
  • methods: 使用自我解释提示策略,让模型在对话开始前分析每个对话语言,提高对话中任务执行的表现
  • results: 经过实验证明,该策略可以在六个 benchmark 数据集中 consistently 超越零shot 提示和几个 shot 提示,达到或超过少量提示的效果,表明其可以强大地增强大语言模型在复杂对话任务中的理解能力。
    Abstract Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.
    摘要 干预对话(TOD)系统通过多回对话来帮助用户执行各种活动,但大语言模型(LLM)经常在复杂的上下文中困难理解。在这项研究中,我们提出了一种新的“自我解释”提示策略,以提高LLM在多回对话中的理解能力。这种任务无关的方法需要模型在对话过程中分析每个语音,从而提高对各种对话中心任务的表现。六个基准数据集的实验结果表明,我们的方法在其他零批提示和几批提示之间具有优异的表现,并且能够与或超过几批提示的效果,这表明了这种方法在复杂对话任务中提高LLM的理解能力的潜力。

Frustrated with Code Quality Issues? LLMs can Help!

  • paper_url: http://arxiv.org/abs/2309.12938
  • repo_url: None
  • paper_authors: Nalin Wadhwa, Jui Pradhan, Atharv Sonwane, Surya Prakash Sahu, Nagarajan Natarajan, Aditya Kanade, Suresh Parthasarathy, Sriram Rajamani
  • for: 这种论文主要是为了提高代码质量,提高软件的可靠性、维护性和安全性。
  • methods: 这个论文使用了大型自然语言模型(LLM)来帮助开发者修复代码质量问题。具体来说,这个工具使用了一对LLM组成的“推荐-评分”结构,其中一个LLM提供修复建议,另一个LLM则根据开发者的acceptance criterion评分这些建议。
  • results: 这个论文的实验结果显示,使用CORE工具可以提高Python文件的修复率达59.2%,同时减少了 False Positive 的比例。此外,在Java文件中,CORE工具可以达到76.8%的修复率,与专门的程序修复工具相当。
    Abstract As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The \emph{proposer LLM} of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The \emph{ranker LLM} evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM is able to reduce false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts.
    摘要 随着软件项目的进行,代码质量的重要性日益增加,因为它直接影响软件的可靠性、维护性和安全性。为此,开发者在开发过程中使用静态分析工具来检测代码质量问题。然而,开发者需要额外努力来修改代码,以便通过工具的检测。在这个工作中,我们研究了使用大型自然语言模型(LLM)来帮助开发者修改代码,以解决代码质量问题。我们提出了一个工具,称为 CORE(简称代码修订),其核心思想是使用一对LLM组成的“提案-评分”机制。提案LLM使用同样的推荐方法,将static分析工具的推荐改进应用于代码修订。提案LLM生成的候选修订检查通过静态质量检查。然而,LLM可能引入微妙的、意外的功能变化,这些变化可能被静态分析工具忽略。评分LLM使用一个仅次于开发者的接受标准来评价修订。CORE使用评分LLM的分数来排序候选修订,然后向开发者展示。CORE可以在52个质量检查中,对59.2%的Python文件进行修订,使其通过静态分析工具和人工审查。评分LLM可以在这些案例中减少false positives的比例为25.8%。CORE生成的修订可以在76.8%的Java文件中通过静态分析工具,与专门的修复工具相当,但需要的工程努力明显更少。

On Separate Normalization in Self-supervised Transformers

  • paper_url: http://arxiv.org/abs/2309.12931
  • repo_url: None
  • paper_authors: Xiaohui Chen, Yinkai Wang, Yuanqi Du, Soha Hassoun, Li-Ping Liu
  • for: 本文提出了一种简单的修改,即在masked autoencoders(MAE)中使用分开的normalization层来更好地捕捉token和[CLS]符号的不同特征,以提高下游任务性能。
  • methods: 本文提出的方法是,在MAE模型中,为token和[CLS]符号分别使用分开的normalization层,以便更好地捕捉它们的不同特征。
  • results: 经验表明,通过使用分开的normalization层,[CLS] embedding可以更好地编码全局Contextual信息,并且在它的不规则空间中分布更加均匀。 replaced conventional normalization layer with two separate layers, we observe an average performance improvement of 2.7% over the image, natural language, and graph domains.
    Abstract Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.
    摘要 自我超vision方法对transformer模型表现非常出色,在不同领域中达到了优秀的result。以前的transformer模型,如masked autoencoders(MAE),通常使用单个normalization层来处理[CLS]符号和token。我们在这篇论文中提出了一个简单的修改,即在token和[CLS]符号之间使用分开的normalization层,以更好地捕捉它们的特点和提高下游任务表现。我们的方法的目标是解决使用同一个normalization统计数据来处理token和[CLS]符号可能存在的负面影响,这可能不是最佳的对齐。我们在实验中发现,通过使用两个分开的normalization层,[CLS]嵌入可以更好地编码全局上下文信息,并且在其不对称空间中分布更加均匀。当将传统的normalization层替换为两个分开的normalization层时,我们在图像、自然语言和图形领域的average表现提高了2.7%。

Lamarck’s Revenge: Inheritance of Learned Traits Can Make Robot Evolution Better

  • paper_url: http://arxiv.org/abs/2309.13099
  • repo_url: None
  • paper_authors: Jie Luo, Karine Miras, Jakub Tomczak, Agoston E. Eiben
  • for: 研究“如果18世纪生物学家拉马克不完全错误,个体特征通过遗传继承给后代?”问题。
  • methods: 使用进化机器人框架进行模拟,其中机器人体型(身体)和控制器(大脑)都可以进化,同时机器人也可以通过学习在生命中提高控制器。
  • results: 对拉马克主义系统和达尔文主义系统进行比较,发现拉马克主义系统会增强机器人体型智能的出现,并确定了这种成功的原因:新生机器人的遗传大脑与身体更好匹配,因此其遗传率比达尔文主义系统高。
    Abstract Evolutionary robot systems offer two principal advantages: an advanced way of developing robots through evolutionary optimization and a special research platform to conduct what-if experiments regarding questions about evolution. Our study sits at the intersection of these. We investigate the question ``What if the 18th-century biologist Lamarck was not completely wrong and individual traits learned during a lifetime could be passed on to offspring through inheritance?'' We research this issue through simulations with an evolutionary robot framework where morphologies (bodies) and controllers (brains) of robots are evolvable and robots also can improve their controllers through learning during their lifetime. Within this framework, we compare a Lamarckian system, where learned bits of the brain are inheritable, with a Darwinian system, where they are not. Analyzing simulations based on these systems, we obtain new insights about Lamarckian evolution dynamics and the interaction between evolution and learning. Specifically, we show that Lamarckism amplifies the emergence of `morphological intelligence', the ability of a given robot body to acquire a good brain by learning, and identify the source of this success: `newborn' robots have a higher fitness because their inherited brains match their bodies better than those in a Darwinian system.
    摘要 生化机器系统提供了两大优势:一是通过进化优化发展机器人的先进方法,二是一种特殊的研究平台来进行关于进化的问题的什么样的实验。我们的研究位于这两个方面之间。我们研究“如果18世纪的生物学家拉马克不完全错误,个体特征在生命周期中学习得来的不能被遗传下来?”这个问题,通过使用进化机器人框架进行模拟,在这个框架中,机器人的形态(身体)和控制器(脑)都可以进化,同时机器人也可以通过学习提高控制器的性能。在这个框架中,我们比较了拉马克主义系统和达尔文主义系统两种不同的进化方式。通过分析这些系统的模拟结果,我们获得了新的理解关于拉马克主义进化动力学和进化和学习之间的互动。具体来说,我们发现拉马克主义会增强机器人身体的智能化,即机器人身体可以通过学习获得一个好的脑。而这种成功的原因是“新生”机器人的遗传因素更好地匹配其身体,因此它们在达尔文主义系统中的遗传因素更高。

A matter of attitude: Focusing on positive and active gradients to boost saliency maps

  • paper_url: http://arxiv.org/abs/2309.12913
  • repo_url: https://github.com/oscarllorente/positive_active_saliency_maps
  • paper_authors: Oscar Llorente, Jaime Boal, Eugenio F. Sánchez-Úbeda
  • for: 这篇论文旨在探讨如何通过修复权重映射的缺陷,提高多类分类问题中神经网络的解释性。
  • methods: 该论文使用了修复权重映射的方法,通过恢复权重映射中的符号,提高了对多类分类问题中神经网络的解释性。
  • results: 研究发现,通过考虑正确类和其他类之间的关系,可以更好地了解神经网络对图像中各个像素的关注。此外,隐藏或改变这些像素会对结果产生什么影响也变得更加清晰。
    Abstract Saliency maps have become one of the most widely used interpretability techniques for convolutional neural networks (CNN) due to their simplicity and the quality of the insights they provide. However, there are still some doubts about whether these insights are a trustworthy representation of what CNNs use to come up with their predictions. This paper explores how rescuing the sign of the gradients from the saliency map can lead to a deeper understanding of multi-class classification problems. Using both pretrained and trained from scratch CNNs we unveil that considering the sign and the effect not only of the correct class, but also the influence of the other classes, allows to better identify the pixels of the image that the network is really focusing on. Furthermore, how occluding or altering those pixels is expected to affect the outcome also becomes clearer.
    摘要 静观地図(Saliency map)已成为 convolutional neural network(CNN)的解释技术中最广泛使用的一种,这主要是因为它的简单性和解释的质量。然而,有些人仍存在对静观地図是否准确反映CNN的预测过程中的信息的 doubts。这篇文章探讨了如何从静观地図中救出梯度的正负信息,以便更深入地理解多类分类问题。我们使用预训练和从scratch预测的CNN,发现考虑正负信息和其他类的影响,可以更好地定位图像中网络是真正关注的像素。此外,对这些像素进行遮盖或修改也会对结果产生什么影响也变得更加清楚。

KG-MDL: Mining Graph Patterns in Knowledge Graphs with the MDL Principle

  • paper_url: http://arxiv.org/abs/2309.12908
  • repo_url: None
  • paper_authors: Francesco Bariatti, Peggy Cellier, Sébastien Ferré
  • for: 本文targets the problem of extracting meaningful graph patterns from large knowledge graphs (KGs), which are difficult to mine due to their size and complexity.
  • methods: 本文提出了一种基于最小描述长度(MDL)原理的图Pattern mining approach called KG-MDL, which generates a human-sized and descriptive set of graph patterns for a given KG.
  • results: 实验表明,KG-MDL可以生成小 enough to be interpreted by humans yet descriptive of the KG的pattern set, highlighting both the schema used to create the data and the concrete facts it contains.
    Abstract Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the Minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
    摘要 现在,知识图(KG)中的数据越来越多,这种数据模型支持高级推理和查询,但它们仍然具有困难 mine 的问题。图минаING Approaches可以提取图中的模式,但这有两个主要问题。首先,图MINING Approaches通常会提取太多的模式,让人类分析者难以处理(pattern explosion)。其次,实际的 KG 与通常在图MINING 中处理的图不同:它们是多 graphs,顶点度遵循力� law,以及知识表示方式可能会生成假Pattern。最近,一种名为 GraphMDL+ 的图MINING Approach 被提出,使用最小描述长度(MDL)原理来解决pattern explosion问题。然而,GraphMDL+ 和其他图MINING Approaches不适用于 KG 而不需要更多的参数。在这篇论文中,我们提出了基于 MDL 原理的图模式挖掘方法,可以给 KG 生成一个人类可以理解的、描述性的图模式集,并且在无参数和实时的情况下进行。我们对中等规模的 KG 进行了实验,并证明了我们的方法可以生成小 enough 且描述性的图模式集,并且这些模式集可以高亮 KG 中的schema和具体事实。我们还讨论了对知识图进行图模式挖掘的问题,与其他类型的图数据进行比较。

ProtoEM: A Prototype-Enhanced Matching Framework for Event Relation Extraction

  • paper_url: http://arxiv.org/abs/2309.12892
  • repo_url: None
  • paper_authors: Zhilei Hu, Zixuan Li, Daozhu Xu, Long Bai, Cheng Jin, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
  • for: 本研究旨在提取文本中的多种事件关系,并尝试更好地捕捉这些关系的内在 semantics。
  • methods: 本研究使用 Prototype-Enhanced Matching (ProtoEM) 框架,包括 prototype representing 和 prototype matching 两个步骤。在第一步中,使用例子来表示不同类型的事件关系的词义特征。在第二步中,使用图 neural network (GNN) 模块来模型事件关系之间的依赖关系。
  • results: 实验结果表明,ProtoEM 框架可以有效地表示事件关系的词义特征,并在 MAVEN-ERE 数据集上具有显著的提高效果 compared to baseline models。
    Abstract Event Relation Extraction (ERE) aims to extract multiple kinds of relations among events in texts. However, existing methods singly categorize event relations as different classes, which are inadequately capturing the intrinsic semantics of these relations. To comprehensively understand their intrinsic semantics, in this paper, we obtain prototype representations for each type of event relation and propose a Prototype-Enhanced Matching (ProtoEM) framework for the joint extraction of multiple kinds of event relations. Specifically, ProtoEM extracts event relations in a two-step manner, i.e., prototype representing and prototype matching. In the first step, to capture the connotations of different event relations, ProtoEM utilizes examples to represent the prototypes corresponding to these relations. Subsequently, to capture the interdependence among event relations, it constructs a dependency graph for the prototypes corresponding to these relations and utilized a Graph Neural Network (GNN)-based module for modeling. In the second step, it obtains the representations of new event pairs and calculates their similarity with those prototypes obtained in the first step to evaluate which types of event relations they belong to. Experimental results on the MAVEN-ERE dataset demonstrate that the proposed ProtoEM framework can effectively represent the prototypes of event relations and further obtain a significant improvement over baseline models.
    摘要
  1. prototype representing:通过使用例子来表示不同的事件关系的核心含义。2. prototype matching:使用一个基于图 neural network (GNN) 的模块来模型事件关系之间的依赖关系,然后将新的事件对比与已有的原型来评估它们属于哪种事件关系。实验结果表明,ProtoEM 框架可以有效地表示事件关系的原型,并在基eline模型上获得了显著改进。

Gravity Network for end-to-end small lesion detection

  • paper_url: http://arxiv.org/abs/2309.12876
  • repo_url: https://github.com/cirorusso2910/gravitynet
  • paper_authors: Ciro Russo, Alessandro Bria, Claudio Marrocco
  • for: 检测医疗影像中的小 lesion
  • methods: 提出了一种新的一stage末端检测器,通过引入新的像素基于气场点,动态追踪目标小 lesion进行检测
  • results: 在两个常见的医疗影像任务中(数字乳腺影像和数字胆囊影像),方法展现出了有效地检测小 lesion的表现
    Abstract This paper introduces a novel one-stage end-to-end detector specifically designed to detect small lesions in medical images. Precise localization of small lesions presents challenges due to their appearance and the diverse contextual backgrounds in which they are found. To address this, our approach introduces a new type of pixel-based anchor that dynamically moves towards the targeted lesion for detection. We refer to this new architecture as GravityNet, and the novel anchors as gravity points since they appear to be "attracted" by the lesions. We conducted experiments on two well-established medical problems involving small lesions to evaluate the performance of the proposed approach: microcalcifications detection in digital mammograms and microaneurysms detection in digital fundus images. Our method demonstrates promising results in effectively detecting small lesions in these medical imaging tasks.
    摘要 这篇论文介绍了一种新的一stage端到端检测器,用于检测医疗图像中的小肿。由于小肿的外观和背景的多样性,精准地定位小肿呈现了挑战。为解决这个问题,我们的方法引入了一种新的像素基的锚点,这些锚点在检测过程中会动态向着目标小肿移动。我们称这种新架构为重力网络(GravityNet),这些新锚点为重力点(Gravity Point),因为它们看起来像是吸引小肿的。我们在两个常见的医学问题中进行了实验:数字乳腺癌检测和数字血管图像中的微血管检测。我们的方法在这些医学检测任务中表现出色,能够有效地检测小肿。

AnglE-optimized Text Embeddings

  • paper_url: http://arxiv.org/abs/2309.12871
  • repo_url: https://github.com/SeanLee97/AnglE
  • paper_authors: Xianming Li, Jing Li
  • for: 提高Semantic Textual Similarity(STS)任务的质量文本嵌入,帮助提高Large Language Model(LLM)应用程序的性能。
  • methods: 提出了一种新的角度优化文本嵌入模型AnglE,通过引入角度优化来解决cosine函数中的恶性区域问题,从而提高 gradient 的演化和优化过程。
  • results: 对于短文本STS任务和新收集的长文本STS任务以及域специфи STS任务,AnglE 比State-of-the-art(SOTA)STS模型表现更好, demonstrating the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
    Abstract High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
    摘要 高品质文本嵌入是 LLM 应用中关键的Semantic Textual Similarity (STS) 任务的提高的关键,但是现有的文本嵌入模型面临着让 Gradient 消失的挑战,主要是因为它们对 cosine 函数在优化目标中的使用,这会导致缺失区域。为解决这个问题,本文提出了一种新的角度优化文本嵌入模型 called AnglE。 AnglE 的核心思想是在复杂空间中引入角度优化。这种新的方法可以有效地减轻 cosine 函数的缺失区域的影响,从而使得梯度和优化过程得到改善。为进行全面的 STS 评估,我们对现有的短文本 STS 数据集和从 GitHub Issues 上收集的新的长文本 STS 数据集进行实验。此外,我们还研究了受限制的数据集和使用 LLM 标注数据的域特性 STS 场景。我们进行了各种任务,包括短文本 STS、长文本 STS 和域特性 STS 任务。结果表明,AnglE 在忽略 cosine 缺失区域的 SOTA STS 模型之上表现出色,这些发现证明了 AnglE 的能力生成高质量文本嵌入和角度优化在 STS 中的有用性。

Accurate and Fast Compressed Video Captioning

  • paper_url: http://arxiv.org/abs/2309.12867
  • repo_url: https://github.com/acherstyx/CoCap
  • paper_authors: Yaojie Shen, Xin Gu, Kai Xu, Heng Fan, Longyin Wen, Libo Zhang
  • for: 这种论文是为了提出一种新的视频描述方法,以解决现有视频描述方法中的扫描缺陷。
  • methods: 该方法使用了压缩视频域的特征,包括I帧、运动向量和差异,并设计了一种特殊的变换器来学习描述视频。
  • results: 该方法可以在不同的benchmark上达到状态 arts的性能,同时运行速度比现有方法快得多。
    Abstract Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.
    摘要 传统的视频描述方法通常需要先从解码后的视频抽取一些帧并进行后续处理(例如特征提取和/或描述模型学习)。在这个管道中,人工抽取帧可能会忽略视频中的关键信息,从而降低性能。此外,抽取的帧中可能包含重复的信息,导致在视频描述中的效率低下。为了解决这些问题,我们从压缩频道的视角研究视频描述,它带来了多重优势:1)相比于解码后的原始图像,压缩视频中的I-帧、运动向量和差异,具有高度可识别性,允许我们通过特殊的模型设计不需要人工抽取来学习整个视频;2)描述模型在推理中更高效,因为处理的信息更小和更少重复。我们提出了一种简单 yet 有效的终端转换器在压缩频道中进行视频描述,允许我们从压缩视频中学习描述。我们示出了我们的方法可以在不同的标准上达到领先的性能,同时运行速度比现有方法快得多于2倍。代码可以在https://github.com/acherstyx/CoCap中获取。

Domain Adaptation for Arabic Machine Translation: The Case of Financial Texts

  • paper_url: http://arxiv.org/abs/2309.12863
  • repo_url: None
  • paper_authors: Emad A. Alghamdi, Jezia Zakraoui, Fares A. Abanmy
  • for: 本研究目的是探讨阿拉伯语机器翻译(AMT)在未曾探讨的领域中的效果,具体来说是在金融新闻文章的翻译中。
  • methods: 我们开发了一个精心制作的阿拉伯语-英语(AR-EN)翻译并 fine-tuning 多种预训练的神经机器翻译和大语言模型,包括 ChatGPT-3.5 Turbo。
  • results: 我们发现,只需要一些良好的AR-EN对应的域 adaptation segments, fine-tuning 就能够成功。 ChatGPT 的翻译质量较高,超过其他模型的自动和人工评估。 这是首次将 ChatGPT fine-tuning 应用于金融领域域传输学习。
    Abstract Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed carefully a parallel corpus for Arabic-English (AR- EN) translation in the financial domain for benchmarking different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that the fine-tuning is successful using just a few well-aligned in-domain AR-EN segments. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available at https://huggingface.co/asas-ai/.
    摘要 神经机器翻译(NMT)在大规模训练 corpora 上显示了很好的性能。然而,通用 NMT 系统在域外翻译中表现不佳。为了解决这个问题,一些域 adaptation 方法在过去几年得到了广泛的关注,并常常比通用 NMT 系统的翻译质量更高。英语和其他欧洲语言的 NMT 曾经得到了相当的研究,但在阿拉伯语中的域 adaptation 却受到了 littlet 的关注。因此,本研究的目的是探索在金融新闻文章中使用域adaptation 技术来提高阿拉伯语翻译质量。为此,我们开发了一个精心制作的阿拉伯语-英语(AR-EN)翻译 parallel corpus,并在这个dataset上练习了多种域 adaptation 方法。我们还对多种预训练的 NMT 和 Large Language 模型进行了练习,包括 ChatGPT-3.5 Turbo。结果表明,只需要几个Well-aligned in-domain AR-EN段,我们可以成功地进行了练习。ChatGPT 的翻译质量比其他模型更高,根据自动和人类评估。据我们所知,这是首次在金融域转移学习中使用 ChatGPT 进行了 fine-tuning。为了贡献到域翻译研究,我们将我们的dataset和练习模型上传到了https://huggingface.co/asas-ai/。

Diffusion Augmentation for Sequential Recommendation

  • paper_url: http://arxiv.org/abs/2309.12858
  • repo_url: https://github.com/liuqidong07/diffuasr
  • paper_authors: Qidong Liu, Fan Yan, Xiangyu Zhao, Zhaocheng Du, Huifeng Guo, Ruiming Tang, Feng Tian
  • for: 这篇论文的目的是解决紧缩式推荐(SRS)中的数据罕见问题和长尾用户问题。
  • methods: 这篇论文提出了一种叫做散射增强(DiffuASR)的方法,它可以生成高质量的增强数据,并且可以直接使用这些增强数据来训练紧缩式推荐模型。
  • results: 根据实验结果,DiffuASR 能够有效地解决紧缩式推荐中的数据罕见问题和长尾用户问题,并且可以提高紧缩式推荐模型的表现。
    Abstract Sequential recommendation (SRS) has become the technical foundation in many applications recently, which aims to recommend the next item based on the user's historical interactions. However, sequential recommendation often faces the problem of data sparsity, which widely exists in recommender systems. Besides, most users only interact with a few items, but existing SRS models often underperform these users. Such a problem, named the long-tail user problem, is still to be resolved. Data augmentation is a distinct way to alleviate these two problems, but they often need fabricated training strategies or are hindered by poor-quality generated interactions. To address these problems, we propose a Diffusion Augmentation for Sequential Recommendation (DiffuASR) for a higher quality generation. The augmented dataset by DiffuASR can be used to train the sequential recommendation models directly, free from complex training procedures. To make the best of the generation ability of the diffusion model, we first propose a diffusion-based pseudo sequence generation framework to fill the gap between image and sequence generation. Then, a sequential U-Net is designed to adapt the diffusion noise prediction model U-Net to the discrete sequence generation task. At last, we develop two guide strategies to assimilate the preference between generated and origin sequences. To validate the proposed DiffuASR, we conduct extensive experiments on three real-world datasets with three sequential recommendation models. The experimental results illustrate the effectiveness of DiffuASR. As far as we know, DiffuASR is one pioneer that introduce the diffusion model to the recommendation.
    摘要 受sequential recommendation(SRS)技术支持,许多应用程序在最近得到了技术基础。SRS的目标是根据用户的历史交互来推荐下一个项目,但是SRS经常面临数据稀缺的问题,这种问题广泛存在于推荐系统中。另外,大多数用户只与几个项目进行交互,而现有的SRS模型往往不能满足这些用户。这种问题被称为长尾用户问题,仍需解决。数据扩展是一种有效的解决方法,但是它们通常需要复杂的训练策略或低质量生成的交互。为解决这些问题,我们提出了一种增强的数据扩展方法,即增强数据扩展(DiffuASR),可以为高质量生成提供基础。DiffuASR生成的扩展数据可以直接用于训练SRS模型,不需训练复杂的策略。为了利用扩展模型的生成能力,我们首先提出了一种基于扩散的pseudo序列生成框架,用于填充图像和序列生成之间的差异。然后,我们设计了一种序列U-Net,用于适应扩散噪声预测模型U-Net。最后,我们开发了两种引导策略,以便在生成和原始序列之间融合喜好。为验证我们提出的DiffuASR,我们在三个真实世界数据集上进行了广泛的实验,使用三种序列推荐模型。实验结果表明,DiffuASR是有效的。到目前为止,DiffuASR是对推荐领域中的扩展模型做出的开创性贡献。

AxOCS: Scaling FPGA-based Approximate Operators using Configuration Supersampling

  • paper_url: http://arxiv.org/abs/2309.12830
  • repo_url: None
  • paper_authors: Siva Satyendra Sahoo, Salim Ullah, Soumyo Bhattacharjee, Akash Kumar
  • For: 这个研究旨在提供一个基于机器学习的精简数据类型设计方法,以减少嵌入式系统中的机器学习实现成本。* Methods: 本研究使用了机器学习基于设计空间探索技术,并利用了机器学习方法来预测PPA和BEHAV的影响。* Results: 实验结果显示,提案的AxOCS方法可以对FPGA优化的精简数据类型进行多目标优化,并获得了较高的品质结果范围。
    Abstract The rising usage of AI and ML-based processing across application domains has exacerbated the need for low-cost ML implementation, specifically for resource-constrained embedded systems. To this end, approximate computing, an approach that explores the power, performance, area (PPA), and behavioral accuracy (BEHAV) trade-offs, has emerged as a possible solution for implementing embedded machine learning. Due to the predominance of MAC operations in ML, designing platform-specific approximate arithmetic operators forms one of the major research problems in approximate computing. Recently there has been a rising usage of AI/ML-based design space exploration techniques for implementing approximate operators. However, most of these approaches are limited to using ML-based surrogate functions for predicting the PPA and BEHAV impact of a set of related design decisions. While this approach leverages the regression capabilities of ML methods, it does not exploit the more advanced approaches in ML. To this end, we propose AxOCS, a methodology for designing approximate arithmetic operators through ML-based supersampling. Specifically, we present a method to leverage the correlation of PPA and BEHAV metrics across operators of varying bit-widths for generating larger bit-width operators. The proposed approach involves traversing the relatively smaller design space of smaller bit-width operators and employing its associated Design-PPA-BEHAV relationship to generate initial solutions for metaheuristics-based optimization for larger operators. The experimental evaluation of AxOCS for FPGA-optimized approximate operators shows that the proposed approach significantly improves the quality-resulting hypervolume for multi-objective optimization-of 8x8 signed approximate multipliers.
    摘要 随着人工智能和机器学习处理的应用领域扩大,低成本机器学习实现成为了一个紧迫的需求,特别是 для具有限制的嵌入式系统。为此,精确计算,一种探讨功能、性能、面积(PPA)和行为准确率(BEHAV)的负担规划,已经出现为实现嵌入式机器学习的可能解决方案。由于机器学习中的主要操作是 MAC 操作,因此设计特定平台的精确计算操作成为了研究中的主要问题。最近,人工智能/机器学习基于的设计空间探索技术已经广泛应用于实现精确操作。然而,大多数这些方法仅通过使用机器学习基于的代表函数预测 PPA 和 BEHAV 指标的影响来进行设计。这种方法利用机器学习方法的回归能力,但并不利用更高级别的机器学习方法。为此,我们提出了 AxOCS,一种通过机器学习基于supersampling的精确计算操作设计方法。具体来说,我们提出了一种利用不同位数的运算符之间的 correlation 关系来生成更大位数的运算符的方法。我们的方法是在更小的设计空间中查找更小的位数运算符的解决方案,并使用其相关的 Design-PPA-BEHAV 关系来生成初始的优化解决方案。我们的实验表明,AxOCS 可以对 FPGA 优化的精确 multiply 操作进行多目标优化,并显著提高了质量结果的卷积体积。

Synthetic Boost: Leveraging Synthetic Data for Enhanced Vision-Language Segmentation in Echocardiography

  • paper_url: http://arxiv.org/abs/2309.12829
  • repo_url: https://github.com/naamiinepal/synthetic-boost
  • paper_authors: Rabin Adhikari, Manish Dhakal, Safal Thapaliya, Kanchan Poudel, Prasiddha Bhandari, Bishesh Khanal
  • for: 本研究旨在提高echocardiography图像分割的准确率,使用视力语言分割模型(VLSM),并利用Semantic Diffusion Models(SDM)生成的synthetic图像来增强VLSM的表现。
  • methods: 本研究使用了两种popular VLSMs(CLIPSeg和CRIS),并使用了七种不同的语言提示(derived from several attributes,自动提取自echocardiography图像、分割masks和metadata)。
  • results: 研究结果表明,在使用SDM生成的synthetic图像进行预训练后,VLSM的表现得到了提高,并且 convergence faster。code、配置和提示可以在https://github.com/naamiinepal/synthetic-boost上获取。
    Abstract Accurate segmentation is essential for echocardiography-based assessment of cardiovascular diseases (CVDs). However, the variability among sonographers and the inherent challenges of ultrasound images hinder precise segmentation. By leveraging the joint representation of image and text modalities, Vision-Language Segmentation Models (VLSMs) can incorporate rich contextual information, potentially aiding in accurate and explainable segmentation. However, the lack of readily available data in echocardiography hampers the training of VLSMs. In this study, we explore using synthetic datasets from Semantic Diffusion Models (SDMs) to enhance VLSMs for echocardiography segmentation. We evaluate results for two popular VLSMs (CLIPSeg and CRIS) using seven different kinds of language prompts derived from several attributes, automatically extracted from echocardiography images, segmentation masks, and their metadata. Our results show improved metrics and faster convergence when pretraining VLSMs on SDM-generated synthetic images before finetuning on real images. The code, configs, and prompts are available at https://github.com/naamiinepal/synthetic-boost.
    摘要 准确的分割是诊断卡地狱疾病(CVD)的关键。然而,医生之间的差异和超声图像本身的挑战使得准确的分割受到阻碍。通过利用图像和文本Modalities的共同表示,视觉语言分割模型(VLSM)可以捕捉更多的上下文信息,可能帮助实现准确和可解释的分割。然而,有限的实际数据使得训练VLSM困难。在这项研究中,我们探讨使用Semantic Diffusion Models(SDM)生成的 sintetic dataset来增强VLSM的echocardiography分割。我们对两种流行的VLSM(CLIPSeg和CRIS)进行评估,使用七种不同的语言提示,自动从echocardiography图像、分割mask和其Metadata中提取的多种属性。我们的结果表明,在先行训练VLSM在SDM生成的 sintetic图像后,再进行 fine-tuning on real images 中,可以提高 metric 并且更快地 converges。代码、配置和提示可以在https://github.com/naamiinepal/synthetic-boost中找到。

OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control

  • paper_url: http://arxiv.org/abs/2309.12825
  • repo_url: None
  • paper_authors: Botian Xu, Feng Gao, Chao Yu, Ruize Zhang, Yi Wu, Yu Wang
  • for: 本研究提出了一个高效可扩展的平台,用于在凝视掌控中应用强化学习,基于Nvidia的Omniverse Isaac Sim。
  • methods: 该平台采用底层设计方法,允许用户轻松设计和实验各种应用场景,并在GPU并行的 simulations 上进行加速。
  • results: 该平台提供了多种标准任务,包括单批静止、多批静止和过 actuated 系统跟踪等,并提供了一些广泛使用的RL基elines。作者还提供了一些先行结果,以示Platform的能力和支持未来研究。
    Abstract In this work, we introduce OmniDrones, an efficient and flexible platform tailored for reinforcement learning in drone control, built on Nvidia's Omniverse Isaac Sim. It employs a bottom-up design approach that allows users to easily design and experiment with various application scenarios on top of GPU-parallelized simulations. It also offers a range of benchmark tasks, presenting challenges ranging from single-drone hovering to over-actuated system tracking. In summary, we propose an open-sourced drone simulation platform, equipped with an extensive suite of tools for drone learning. It includes 4 drone models, 5 sensor modalities, 4 control modes, over 10 benchmark tasks, and a selection of widely used RL baselines. To showcase the capabilities of OmniDrones and to support future research, we also provide preliminary results on these benchmark tasks. We hope this platform will encourage further studies on applying RL to practical drone systems.
    摘要 在这个工作中,我们介绍OmniDrones,一个高效和灵活的平台,适用于强化学习控制飞行器,基于Nvidia的Omniverse Isaac Sim。它采用底层设计方法,允许用户轻松地设计和实验各种应用场景,并在GPU平行化的 simulations 上进行了加速。它还提供了一系列的 benchmark 任务,展示了从单架飞行器停在位pto 多架飞行器跟踪系统的挑战。总之,我们提出了一个开源的飞行器 simulate平台,具有广泛的工具,用于飞行器学习。它包括4架飞行器模型、5种感知模式、4种控制方式、超过10个 benchmark 任务,以及一些通用的RL基elines。为了证明OmniDrones的能力和支持未来研究,我们也提供了先行结果这些 benchmark 任务。我们希望这个平台能够鼓励更多的研究人员通过应用RL来解决实际飞行器系统的问题。

A Spectral Theory of Neural Prediction and Alignment

  • paper_url: http://arxiv.org/abs/2309.12821
  • repo_url: None
  • paper_authors: Abdulkadir Canatar, Jenelle Feather, Albert Wakhloo, SueYeon Chung
  • for: 这个论文的目的是尝试理解深度神经网络如何预测神经活动。
  • methods: 该论文使用了一种新的理论框架,将泛化误差与模型活动的 спектраль偏好以及神经响应的对齐关系联系起来。
  • results: 研究发现,使用多种深度神经网络可以获得低级别神经预测误差,且每种神经网络都有不同的表达形式。这些结果表明,可以通过分析表达形式来了解神经网络如何捕捉神经活动。
    Abstract The representations of neural networks are often compared to those of biological systems by performing regression between the neural network responses and those measured from biological systems. Many different state-of-the-art deep neural networks yield similar neural predictions, but it remains unclear how to differentiate among models that perform equally well at predicting neural responses. To gain insight into this, we use a recent theoretical framework that relates the generalization error from regression to the spectral bias of the model activations and the alignment of the neural responses onto the learnable subspace of the model. We extend this theory to the case of regression between model activations and neural responses, and define geometrical properties describing the error embedding geometry. We test a large number of deep neural networks that predict visual cortical activity and show that there are multiple types of geometries that result in low neural prediction error as measured via regression. The work demonstrates that carefully decomposing representational metrics can provide interpretability of how models are capturing neural activity and points the way towards improved models of neural activity.
    摘要 neural network 的表示方式 часто和生物系统相比,通过 regression 来比较 neural network 的响应和生物系统测量的响应。许多 state-of-the-art deep neural network 具有类似的预测性能,但是还不清楚如何区分这些模型在预测 neural network 响应的时候表现出类似的性能。为了增加这些信息,我们使用了最近的理论框架,将泛化误差从回归相关到模型活动的spectral bias和模型学习的子空间对齐。我们将这些理论扩展到模型活动和 neural network 之间的回归问题,并定义了 geometrical properties 描述错误嵌入几何。我们测试了许多 deep neural network,它们预测的视觉 Cortical activity 和实际测量的响应之间存在多种几何,这些几何都能够实现低级别预测错误。这项工作显示,细分表示度量可以提供如何模型捕捉 neural activity的解释,并指向了改进的 neural activity 模型。

Computational Natural Philosophy: A Thread from Presocratics through Turing to ChatGPT

  • paper_url: http://arxiv.org/abs/2309.13094
  • repo_url: None
  • paper_authors: Gordana Dodig-Crnkovic
  • for: 这篇论文旨在探讨计算自然哲学如何理解自然世界,以及如何通过计算机科学和人工智能技术来研究认知和智能。
  • methods: 这篇论文使用了计算机科学和人工智能的方法,包括深度神经网络和强化学习。
  • results: 这篇论文描述了一种基于深度神经网络的大语言模型(LLM),并通过人类反馈强化学习(RLHF)来训练这种模型。
    Abstract Modern computational natural philosophy conceptualizes the universe in terms of information and computation, establishing a framework for the study of cognition and intelligence. Despite some critiques, this computational perspective has significantly influenced our understanding of the natural world, leading to the development of AI systems like ChatGPT based on deep neural networks. Advancements in this domain have been facilitated by interdisciplinary research, integrating knowledge from multiple fields to simulate complex systems. Large Language Models (LLMs), such as ChatGPT, represent this approach's capabilities, utilizing reinforcement learning with human feedback (RLHF). Current research initiatives aim to integrate neural networks with symbolic computing, introducing a new generation of hybrid computational models.
    摘要 现代计算自然哲学将宇宙视为信息和计算的框架,为认知和智能的研究提供了一个框架。虽有一些批评,但这种计算视角已经对自然世界的理解产生了深远的影响,导致了基于深度神经网络的AI系统,如ChatGPT。这个领域的进步得益于多学科研究的整合,将多个领域的知识融合到模拟复杂系统中。大语言模型(LLMs),如ChatGPT,表示这种方法的能力,通过人工增强学习(RLHF)。当前的研究启动旨在将神经网络与符号计算结合,推出一代新的混合计算模型。

Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where

  • paper_url: http://arxiv.org/abs/2309.12757
  • repo_url: None
  • paper_authors: Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu
  • for: 提高 ConvNet 自动学习效果,使其能够更好地利用掩蔽操作进行自我超vision transformer 的学习。
  • methods: 在 ConvNet 框架中引入掩蔽操作,并考虑图像中重要对象的焦点分布,以避免掩蔽操作对图像的损害。另外,引入硬negative sample,使用更大的掩蔽区域来提高对抗样本的精度。
  • results: 在多个数据集、对抗学习机制和下游任务中,提出的方法在许多情况下表现出优于多个基eline。
    Abstract While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.
    摘要 而图像数据在使用简单 yet effective的自我超vised学习算法基于masking和自我重建目标时,卷积神经网络作为图像数据的另一个重要和广泛采用的建筑物,虽然通过对它们的contrastive学习技术进行驱动,仍然面临masking操作不能够帮助它们的学习过程获得显著改进的问题。在这项工作中,我们想要解除对ConvNet的contrastive学习框架中masking操作的添加,以避免在contrastive样本对中 randomly sampling masking区域导致重要/焦点对象上的排序。此外,我们还发现在一个视图中,随机 sampling masking区域可能会过度集中在重要/焦点对象上,从而导致对另一个视图的错误对比性。为了解决这个问题,我们提出了一种explicitly considering saliency constraint的方法,以确保masked区域在背景和前景中具有更均匀的分布。此外,我们还引入硬negative samples,通过在输入图像中masking更大的salient patches来提高对downstream任务的适应性。我们的提议方法在多种dataset、contrastive学习机制和下游任务上进行了广泛的实验,并证明了与一些state-of-the-art基eline相比,我们的方法具有更高的效果和性能。

Towards an MLOps Architecture for XAI in Industrial Applications

  • paper_url: http://arxiv.org/abs/2309.12756
  • repo_url: None
  • paper_authors: Leonhard Faubel, Thomas Woudsma, Leila Methnani, Amir Ghorbani Ghezeljhemeidan, Fabian Buelow, Klaus Schmid, Willem D. van Driel, Benjamin Kloepper, Andreas Theodorou, Mohsen Nosratinia, Magnus Bång
  • for: 本研究旨在提高机器学习操作(MLOps)中的模型解释和反馈能力,以提高用户信任和采纳率。
  • methods: 本研究使用了一种新的MLOps软件架构,包括在实际用例中实现了解释和反馈功能。
  • results: 该软件架构具有高效地管理ML模型生产环境的能力,同时允许 интеграción of 解释和反馈功能。
    Abstract Machine learning (ML) has become a popular tool in the industrial sector as it helps to improve operations, increase efficiency, and reduce costs. However, deploying and managing ML models in production environments can be complex. This is where Machine Learning Operations (MLOps) comes in. MLOps aims to streamline this deployment and management process. One of the remaining MLOps challenges is the need for explanations. These explanations are essential for understanding how ML models reason, which is key to trust and acceptance. Better identification of errors and improved model accuracy are only two resulting advantages. An often neglected fact is that deployed models are bypassed in practice when accuracy and especially explainability do not meet user expectations. We developed a novel MLOps software architecture to address the challenge of integrating explanations and feedback capabilities into the ML development and deployment processes. In the project EXPLAIN, our architecture is implemented in a series of industrial use cases. The proposed MLOps software architecture has several advantages. It provides an efficient way to manage ML models in production environments. Further, it allows for integrating explanations into the development and deployment processes.
    摘要 To address this challenge, we developed a novel MLOps software architecture that integrates explanations and feedback capabilities into the ML development and deployment processes. This architecture was implemented in the project EXPLAIN, using a series of industrial use cases. The proposed MLOps software architecture offers several advantages, including an efficient way to manage ML models in production environments and the ability to integrate explanations into the development and deployment processes.

OpenAi’s GPT4 as coding assistant

  • paper_url: http://arxiv.org/abs/2309.12732
  • repo_url: https://github.com/lmous/openai-gpt4-coding-assistant
  • paper_authors: Lefteris Moussiades, George Zografos
  • for: 本研究使用GPT3.5和GPT4作为代码生成助手,以检验它们在代码开发过程中的能力。
  • methods: 研究人员采用了适当的测试来检验GPT3.5和GPT4的能力,包括回答常见代码开发中的问题、生成可靠的代码和帮助代码调试。
  • results: 测试结果吸引人,GPT4的表现出色,表明这些新工具将提高程序员的产效和重新定义软件开发流程。
    Abstract Lately, Large Language Models have been widely used in code generation. GPT4 is considered the most potent Large Language Model from Openai. In this paper, we examine GPT3.5 and GPT4 as coding assistants. More specifically, we have constructed appropriate tests to check whether the two systems can a) answer typical questions that can arise during the code development, b) produce reliable code, and c) contribute to code debugging. The test results are impressive. The performance of GPT4 is outstanding and signals an increase in the productivity of programmers and the reorganization of software development procedures based on these new tools.
    摘要 近期,大型自然语言模型在代码生成方面广泛应用。GPT4被视为Openai中最强大的大型自然语言模型。本文我们将 examine GPT3.5和GPT4作为代码助手。更specifically,我们构建了适当的测试,检查这两个系统是否可以:a) 回答代码开发中可能出现的常见问题,b) 生成可靠的代码,c) 帮助代码调试。测试结果印象良好,GPT4的表现出色,这 signal了程序员的产出力和基于这些新工具的软件开发流程的重新组织。

Defeasible Reasoning with Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2309.12731
  • repo_url: None
  • paper_authors: Dave Raggett
  • for: 本文旨在解决人类知识中的不确定性、不精确性、不完整性和不一致性问题,以及这些问题对semantic web的影响。
  • methods: 本文提出了一种直观notation和模型,用于解决不确定性推理,并与前期工作关于理据论相关。PKN与N3相比, defeasible reasoning是deductive logic的一种替换。
  • results: 本文结束时提出了一些关于使用声明性语言描述推理策略和战术的想法,以及基于AIF ontology的灵感。此外,文章还讨论了大语言模型时代的符号approaches。
    Abstract Human knowledge is subject to uncertainties, imprecision, incompleteness and inconsistencies. Moreover, the meaning of many everyday terms is dependent on the context. That poses a huge challenge for the Semantic Web. This paper introduces work on an intuitive notation and model for defeasible reasoning with imperfect knowledge, and relates it to previous work on argumentation theory. PKN is to N3 as defeasible reasoning is to deductive logic. Further work is needed on an intuitive syntax for describing reasoning strategies and tactics in declarative terms, drawing upon the AIF ontology for inspiration. The paper closes with observations on symbolic approaches in the era of large language models.
    摘要 人类知识受到不确定性、不精确性、不完整性和不一致性的影响。此外,许多日常用语的意义取决于上下文。这 pose 巨大挑战 дляsemantic web。本文介绍了一种直观 notation 和模型,用于解决不确定性推理,并与之前的 argumentation theory 相关。PKN 相当于 N3,而 defeasible reasoning 相当于deductive logic。进一步的工作需要对推理策略和 тактики的描述使用声明性语言, drawing upon the AIF ontology for inspiration。文章结束时,讨论了使用 symbolic approaches in the era of large language models。Note: "PKN" stands for "Prefered Knowledge Network", and "AIF" stands for "Approximate Inference Framework".

In-context Interference in Chat-based Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12727
  • repo_url: None
  • paper_authors: Eric Nuertey Coleman, Julio Hurtado, Vincenzo Lomonaco
  • for: This paper aims to study the limitations of in-context learning in large language models (LLMs) and its impact on the model’s performance.
  • methods: The paper uses a black-box scenario to evaluate the in-context learning ability of LLMs, and proposes an evaluation benchmark based on the bAbI dataset.
  • results: The study shows that in-context learning can lead to interference between information continually flowing in the context, causing the model to forget previously learned knowledge and reducing its performance.
    Abstract Large language models (LLMs) have had a huge impact on society due to their impressive capabilities and vast knowledge of the world. Various applications and tools have been created that allow users to interact with these models in a black-box scenario. However, one limitation of this scenario is that users cannot modify the internal knowledge of the model, and the only way to add or modify internal knowledge is by explicitly mentioning it to the model during the current interaction. This learning process is called in-context training, and it refers to training that is confined to the user's current session or context. In-context learning has significant applications, but also has limitations that are seldom studied. In this paper, we present a study that shows how the model can suffer from interference between information that continually flows in the context, causing it to forget previously learned knowledge, which can reduce the model's performance. Along with showing the problem, we propose an evaluation benchmark based on the bAbI dataset.
    摘要

H2O+: An Improved Framework for Hybrid Offline-and-Online RL with Dynamics Gaps

  • paper_url: http://arxiv.org/abs/2309.12716
  • repo_url: None
  • paper_authors: Haoyi Niu, Tianying Ji, Bingqi Liu, Haocheng Zhao, Xiangyu Zhu, Jianying Zheng, Pengfei Huang, Guyue Zhou, Jianming Hu, Xianyuan Zhan
  • for: This paper focuses on solving real-world complex tasks using reinforcement learning (RL) in imperfect simulation environments and with limited data.
  • methods: The authors propose a new algorithm called H2O+, which combines offline and online learning methods to address the challenges of sim-to-real transfer and dynamics gaps.
  • results: The proposed algorithm demonstrates superior performance and flexibility in both simulation and real-world robotics experiments compared to advanced cross-domain online and offline RL algorithms.Here’s the same information in Simplified Chinese:
  • for: 这篇论文关注使用强化学习(RL)解决实际世界中复杂任务,不需要高精度模拟环境或大量的离线数据。
  • methods: 作者提出了一种新的算法H2O+,它结合了离线和在线学习方法,以弥补模拟环境和实际环境之间的动态差异。
  • results: 提出的算法在实验和实际 робо术中表现出了superior性能和灵活性,与高级跨领域在线和离线RL算法相比。
    Abstract Solving real-world complex tasks using reinforcement learning (RL) without high-fidelity simulation environments or large amounts of offline data can be quite challenging. Online RL agents trained in imperfect simulation environments can suffer from severe sim-to-real issues. Offline RL approaches although bypass the need for simulators, often pose demanding requirements on the size and quality of the offline datasets. The recently emerged hybrid offline-and-online RL provides an attractive framework that enables joint use of limited offline data and imperfect simulator for transferable policy learning. In this paper, we develop a new algorithm, called H2O+, which offers great flexibility to bridge various choices of offline and online learning methods, while also accounting for dynamics gaps between the real and simulation environment. Through extensive simulation and real-world robotics experiments, we demonstrate superior performance and flexibility over advanced cross-domain online and offline RL algorithms.
    摘要 Translated into Simplified Chinese:解决实际世界中复杂任务使用回归学习(RL)无需高精度模拟环境或大量的离线数据可以很困难。在线RL代理在不完美的模拟环境中训练后可能会受到严重的模拟到实际(sim-to-real)问题。离线RL方法虽然不需要模拟器,但通常有训练离线数据的质量和量的严格要求。最近出现的混合在线-离线RL提供了一个吸引人的框架,允许在有限的离线数据和不完美的模拟器之间进行可转移的政策学习。在这篇论文中,我们开发了一个新的算法 called H2O+,它提供了很大的灵活性,可以将不同的在线和离线学习方法相互连接,同时也考虑到实际和模拟环境之间的动力差异。通过了较为广泛的模拟和实际机器人实验,我们展示了superior的性能和灵活性,与先进的跨Domain在线和离线RL算法相比。

The Mathematical Game

  • paper_url: http://arxiv.org/abs/2309.12711
  • repo_url: https://github.com/xploitspeeds/Bookmarklet-Hacks-For-School
  • paper_authors: Marc Pierre, Quentin Cohen-Solal, Tristan Cazenave
  • for: 这 paper 用于提高 Holophrasm theorem prover 的性能,使用其他游戏搜索算法。
  • methods: 这 paper 使用 MCTS combined with 神经网络来实现自动证明。
  • results: 该 paper 提出了一种基于 MCTS 和神经网络的自动证明方法,以提高 Holophrasm theorem prover 的性能。
    Abstract Monte Carlo Tree Search can be used for automated theorem proving. Holophrasm is a neural theorem prover using MCTS combined with neural networks for the policy and the evaluation. In this paper we propose to improve the performance of the Holophrasm theorem prover using other game tree search algorithms.
    摘要 “蒙特卡罗树搜寻”可以用于自动证明。“Holophrasm”是一个使用蒙特卡罗树搜寻和神经网络来决策和评估的神经证明器。在这篇论文中,我们提议使用其他游戏树搜寻算法来提高Holophrasm证明器的性能。

PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion

  • paper_url: http://arxiv.org/abs/2309.12708
  • repo_url: None
  • paper_authors: Yuxiang Yan, Boda Liu, Jianfei Ai, Qinbu Li, Ru Wan, Jian Pu
  • for: 该论文旨在提出一个 Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion,用于驱动semantic point cloud completion的技术进步。
  • methods: 该论文使用了一种基于LiDAR的模型,包括一个Spatial-Aware Transformer для全球和本地特征提取,以及一个Completion and Segmentation Cooperative Module для联合完成和分类。
  • results: 该论文提出了一个名为PointSSC的共同自动车辆-基础设施点云benchmark,用于测试和评估semantic point cloud completion技术的进步。
    Abstract Semantic Scene Completion (SSC) aims to jointly generate space occupancies and semantic labels for complex 3D scenes. Most existing SSC models focus on volumetric representations, which are memory-inefficient for large outdoor spaces. Point clouds provide a lightweight alternative but existing benchmarks lack outdoor point cloud scenes with semantic labels. To address this, we introduce PointSSC, the first cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. These scenes exhibit long-range perception and minimal occlusion. We develop an automated annotation pipeline leveraging Segment Anything to efficiently assign semantics. To benchmark progress, we propose a LiDAR-based model with a Spatial-Aware Transformer for global and local feature extraction and a Completion and Segmentation Cooperative Module for joint completion and segmentation. PointSSC provides a challenging testbed to drive advances in semantic point cloud completion for real-world navigation.
    摘要

Multi-Label Noise Transition Matrix Estimation with Label Correlations: Theory and Algorithm

  • paper_url: http://arxiv.org/abs/2309.12706
  • repo_url: https://github.com/tmllab/Multi-Label-T
  • paper_authors: Shikun Li, Xiaobo Xia, Hansong Zhang, Shiming Ge, Tongliang Liu
  • for: 这篇论文旨在解决多标签学习中的噪声问题,因为收集大规模准确标签变得更加困难。
  • methods: 这篇论文提出了一种使用过渡矩阵来模型多标签噪声的方法,并利用标签相互关系来估计噪声过渡矩阵。
  • results: 这篇论文提出了一种新的估计器,可以在不需要 anchor point 或准确适应噪声类 posterior 的情况下估计多标签噪声过渡矩阵。这种估计器基于标签相互关系,并使用样本选择技术来提取净标签相互关系所含信息。
    Abstract Noisy multi-label learning has garnered increasing attention due to the challenges posed by collecting large-scale accurate labels, making noisy labels a more practical alternative. Motivated by noisy multi-class learning, the introduction of transition matrices can help model multi-label noise and enable the development of statistically consistent algorithms for noisy multi-label learning. However, estimating multi-label noise transition matrices remains a challenging task, as most existing estimators in noisy multi-class learning rely on anchor points and accurate fitting of noisy class posteriors, which is hard to satisfy in noisy multi-label learning. In this paper, we address this problem by first investigating the identifiability of class-dependent transition matrices in noisy multi-label learning. Building upon the identifiability results, we propose a novel estimator that leverages label correlations without the need for anchor points or precise fitting of noisy class posteriors. Specifically, we first estimate the occurrence probability of two noisy labels to capture noisy label correlations. Subsequently, we employ sample selection techniques to extract information implying clean label correlations, which are then used to estimate the occurrence probability of one noisy label when a certain clean label appears. By exploiting the mismatches in label correlations implied by these occurrence probabilities, we demonstrate that the transition matrix becomes identifiable and can be acquired by solving a bilinear decomposition problem. Theoretically, we establish an estimation error bound for our multi-label transition matrix estimator and derive a generalization error bound for our statistically consistent algorithm. Empirically, we validate the effectiveness of our estimator in estimating multi-label noise transition matrices, leading to excellent classification performance.
    摘要 噪声多标签学习已经吸引了越来越多的关注,因为收集大规模准确标签变得更加困难。为了模型多标签噪声,引入过渡矩阵可以帮助模型多标签噪声。然而,估计多标签噪声过渡矩阵仍然是一个挑战,因为大多数现有的估计器在噪声多类学习中使用锚点和准确地适应噪声类 posterior,这在多标签噪声学习中很难实现。在这篇论文中,我们解决这个问题,首先研究多标签噪声过渡矩阵的可识别性。基于可识别性结果,我们提出了一种新的估计器,它利用标签相互关系来估计噪声标签的过渡矩阵。具体来说,我们首先估计两个噪声标签之间的发生概率,以捕捉噪声标签之间的相互关系。然后,我们使用样本选择技术提取信息,以便从中提取净标签相关信息。最后,我们使用这些发生概率来估计一个噪声标签的过渡矩阵,并解决一个二元分解问题来获取过渡矩阵。理论上,我们建立了估计误差 bound 和一般化误差 bound,以证明我们的多标签过渡矩阵估计器的可靠性。实验证明了我们的估计器在估计多标签噪声过渡矩阵时的效果。

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.12696
  • repo_url: https://github.com/thu-rllab/CFCQL
  • paper_authors: Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, Xiangyang Ji
  • for: 解决多智能机器人在离线多智能学习中的分布偏移和维度高问题,提高行动外部分布(OOD)和价值误差现象的问题。
  • methods: 提出了一种新的多智能离线Q学习算法CounterFactual Conservative Q-Learning(CFCQL),通过计算每个机器人的保守补偿来实现保守价值估计。
  • results: 在四个环境中,包括整数和浮点动作的设定,以及现有和自己制作的数据集上,CFCQL比既有方法高效,尤其是当机器人数量很大时。
    Abstract Offline multi-agent reinforcement learning is challenging due to the coupling effect of both distribution shift issue common in offline setting and the high dimension issue common in multi-agent setting, making the action out-of-distribution (OOD) and value overestimation phenomenon excessively severe. Tomitigate this problem, we propose a novel multi-agent offline RL algorithm, named CounterFactual Conservative Q-Learning (CFCQL) to conduct conservative value estimation. Rather than regarding all the agents as a high dimensional single one and directly applying single agent methods to it, CFCQL calculates conservative regularization for each agent separately in a counterfactual way and then linearly combines them to realize an overall conservative value estimation. We prove that it still enjoys the underestimation property and the performance guarantee as those single agent conservative methods do, but the induced regularization and safe policy improvement bound are independent of the agent number, which is therefore theoretically superior to the direct treatment referred to above, especially when the agent number is large. We further conduct experiments on four environments including both discrete and continuous action settings on both existing and our man-made datasets, demonstrating that CFCQL outperforms existing methods on most datasets and even with a remarkable margin on some of them.
    摘要 偏向多智能体异线上学习是因为偏向分布shift问题和高维度问题的交互作用,导致行为out-of-distribution (OOD)和价值误估现象过于严重。为了解决这个问题,我们提出了一种新的多智能体异线上RL算法,名为CounterFactual Conservative Q-Learning (CFCQL),用于进行保守的价值估计。而不是将所有智能体视为一个高维度单一的智能体并直接应用单个智能体方法,CFCQL计算每个智能体的保守补偿 separately在反对方法下,然后将其线性组合以实现全局保守价值估计。我们证明了它仍然具有下预测性和性能保证,但是引入的补偿和安全策略改进 bound 独立于智能体数量,因此是理论上的超越直接处理。我们进一步在四个环境中进行了实验,包括 both 某些精度和连续动作设置,并在我们自己制作的数据集上进行了实验,示出CFCQL在大多数数据集上表现出优于现有方法,甚至在一些数据集上具有很大的差距。

Enhancing Graph Representation of the Environment through Local and Cloud Computation

  • paper_url: http://arxiv.org/abs/2309.12692
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Francesco Argenziano, Vincenzo Suriani, Daniele Nardi
  • for: bridging the gap between low-level sensor readings and high-level semantic understanding
  • methods: combines classical computer vision tools with modern computer vision cloud services, incorporates an ontology hierarchy with over 800 object classes
  • results: allows for the handling of small objects and integration into the semantic representation of the environment
    Abstract Enriching the robot representation of the operational environment is a challenging task that aims at bridging the gap between low-level sensor readings and high-level semantic understanding. Having a rich representation often requires computationally demanding architectures and pure point cloud based detection systems that struggle when dealing with everyday objects that have to be handled by the robot. To overcome these issues, we propose a graph-based representation that addresses this gap by providing a semantic representation of robot environments from multiple sources. In fact, to acquire information from the environment, the framework combines classical computer vision tools with modern computer vision cloud services, ensuring computational feasibility on onboard hardware. By incorporating an ontology hierarchy with over 800 object classes, the framework achieves cross-domain adaptability, eliminating the need for environment-specific tools. The proposed approach allows us to handle also small objects and integrate them into the semantic representation of the environment. The approach is implemented in the Robot Operating System (ROS) using the RViz visualizer for environment representation. This work is a first step towards the development of a general-purpose framework, to facilitate intuitive interaction and navigation across different domains.
    摘要 Our approach combines classical computer vision tools with modern computer vision cloud services to acquire information from the environment. By incorporating an ontology hierarchy with over 800 object classes, our framework achieves cross-domain adaptability, eliminating the need for environment-specific tools. This allows us to handle small objects and integrate them into the semantic representation of the environment.We have implemented our approach in the Robot Operating System (ROS) using the RViz visualizer for environment representation. This is a first step towards the development of a general-purpose framework for intuitive interaction and navigation across different domains.

QAL-BP: An Augmented Lagrangian Quantum Approach for Bin Packing Problem

  • paper_url: http://arxiv.org/abs/2309.12678
  • repo_url: https://github.com/lorenz92/qal-bp
  • paper_authors: Lorenzo Cellini, Antonio Macaluso, Michele Lombardi
  • for: 解决 bin packing 问题,一个 NP-hard 问题,寻找高效的解决方案。
  • methods: 使用 quantum technologies,尤其是 quantum computing,来解决这个问题。提出了一种新的 Quadratic Unconstrained Binary Optimization (QUBO) 模型,可以更好地处理这个问题。
  • results: 实验结果表明,使用 quantum annealing Device 可以更有效地解决 bin packing 问题,并且比 classical solvers 更快。
    Abstract The bin packing is a well-known NP-Hard problem in the domain of artificial intelligence, posing significant challenges in finding efficient solutions. Conversely, recent advancements in quantum technologies have shown promising potential for achieving substantial computational speedup, particularly in certain problem classes, such as combinatorial optimization. In this study, we introduce QAL-BP, a novel Quadratic Unconstrained Binary Optimization (QUBO) formulation designed specifically for bin packing and suitable for quantum computation. QAL-BP utilizes the augmented Lagrangian method to incorporate the bin packing constraints into the objective function while also facilitating an analytical estimation of heuristic, but empirically robust, penalty multipliers. This approach leads to a more versatile and generalizable model that eliminates the need for empirically calculating instance-dependent Lagrangian coefficients, a requirement commonly encountered in alternative QUBO formulations for similar problems. To assess the effectiveness of our proposed approach, we conduct experiments on a set of bin-packing instances using a real Quantum Annealing device. Additionally, we compare the results with those obtained from two different classical solvers, namely simulated annealing and Gurobi. The experimental findings not only confirm the correctness of the proposed formulation but also demonstrate the potential of quantum computation in effectively solving the bin-packing problem, particularly as more reliable quantum technology becomes available.
    摘要 bin packing 是人工智能领域的一个非常知名的 NP-硬Problem,它提出了很大的计算挑战。然而,最近的量子技术发展有显著的潜在能力,特别是在 combinatorial optimization 中,以获得显著的计算速度提升。在这项研究中,我们介绍了 QAL-BP,一种特有的 Quadratic Unconstrained Binary Optimization (QUBO) 形式,适用于 bin packing 问题并且适用于量子计算。QAL-BP 利用了扩展拉格朗日方法,将箱包约束直接添加到目标函数中,同时还可以 analytically 计算 heuristic,但是实际上是可靠的 penalty multipliers。这种方法使得我们的模型更加通用和可重用,消除了对实际计算 instance-dependent Lagrangian coefficients 的需求,这种需求通常在类似问题的alternative QUBO formulations中出现。为评估我们提出的方法的有效性,我们在一组 bin-packing 实例上进行了实验,使用了真实的 Quantum Annealing 设备。此外,我们还与 simulated annealing 和 Gurobi 两种类型的 classical solver进行了比较。实验结果不仅证明了我们的提出的形式的正确性,还表明了量子计算在解决箱包问题上的潜在能力,特别是随着可靠的量子技术的发展。

TrTr: A Versatile Pre-Trained Large Traffic Model based on Transformer for Capturing Trajectory Diversity in Vehicle Population

  • paper_url: http://arxiv.org/abs/2309.12677
  • repo_url: None
  • paper_authors: Ruyi Feng, Zhibin Li, Bowen Liu, Yan Ding, Ou Zheng
  • For: The paper aims to learn the diversity of trajectories within vehicle populations using the Transformer architecture, which can handle large-scale parameters and capture the spatial distribution of vehicles.* Methods: The authors apply the Transformer architecture to traffic tasks and design specific pre-training tasks to improve the model’s performance. They also create a data structure tailored to the attention mechanism and introduce noises that correspond to spatio-temporal demands.* Results: The pre-trained model demonstrates excellent performance in capturing the spatial distribution of the vehicle population, with no instances of vehicle overlap and an RMSE of 0.6059 compared to ground truth values. In the context of time series prediction, approximately 95% of the predicted trajectories’ speeds closely align with the true speeds, within a deviation of 7.5144m/s. The model also exhibits robustness in the stability test and provides a good basis for downstream fine-tuning tasks.Here’s the format you requested:* For: 学习车辆群体的路径多样性* Methods: 使用变换器架构,设计特定的预训练任务,创建适合注意机制的数据结构,引入相应的空间时间噪声* Results: 预训练模型在捕捉车辆群体的空间分布方面表现出色,无车辆重叠,RMSE为0.6059;在时间序列预测任务中,预测速度与真实速度相似,偏差为7.5144m/s;模型在稳定性测试中表现稳定,可以长期预测时间序列,并且展现出多样化的驾驶行为。
    Abstract Understanding trajectory diversity is a fundamental aspect of addressing practical traffic tasks. However, capturing the diversity of trajectories presents challenges, particularly with traditional machine learning and recurrent neural networks due to the requirement of large-scale parameters. The emerging Transformer technology, renowned for its parallel computation capabilities enabling the utilization of models with hundreds of millions of parameters, offers a promising solution. In this study, we apply the Transformer architecture to traffic tasks, aiming to learn the diversity of trajectories within vehicle populations. We analyze the Transformer's attention mechanism and its adaptability to the goals of traffic tasks, and subsequently, design specific pre-training tasks. To achieve this, we create a data structure tailored to the attention mechanism and introduce a set of noises that correspond to spatio-temporal demands, which are incorporated into the structured data during the pre-training process. The designed pre-training model demonstrates excellent performance in capturing the spatial distribution of the vehicle population, with no instances of vehicle overlap and an RMSE of 0.6059 when compared to the ground truth values. In the context of time series prediction, approximately 95% of the predicted trajectories' speeds closely align with the true speeds, within a deviation of 7.5144m/s. Furthermore, in the stability test, the model exhibits robustness by continuously predicting a time series ten times longer than the input sequence, delivering smooth trajectories and showcasing diverse driving behaviors. The pre-trained model also provides a good basis for downstream fine-tuning tasks. The number of parameters of our model is over 50 million.
    摘要 理解轨迹多样性是实际交通任务的基本方面。然而,捕捉轨迹多样性带来挑战,特别是使用传统的机器学习和回归神经网络,因为它们需要庞大的参数量。然而,出现的Transformer技术,因其可平行计算能力,使得使用大量参数的模型变得可能。在这个研究中,我们将Transformer架构应用到交通任务中,以学习车辆人口中的轨迹多样性。我们分析Transformer的注意机制和它的适应性,然后设计特定的预训练任务。为此,我们创建了适应注意机制的数据结构,并引入了相应的噪声,这些噪声在预训练过程中被包含到结构化数据中。预训练模型显示出色的性能,可以准确地捕捉车辆人口的空间分布,没有车辆重叠,RMSE为0.6059,相比真实值。在时间序列预测任务中,大约95%的预测轨迹速度与真实速度高度相似,差异在7.5144米/秒之间。此外,在稳定测试中,模型表现了稳定性,连续预测了10个输入序列的时间序列,输出了平滑的轨迹和多样的驾驶行为。预训练模型还提供了下游细化任务的良好基础。我们的模型参数数量超过5000万。

Vision Transformers for Computer Go

  • paper_url: http://arxiv.org/abs/2309.12675
  • repo_url: https://github.com/assasinator/Swin_Transformers
  • paper_authors: Amani Sagri, Tristan Cazenave, Jérôme Arjonilla, Abdallah Saffidine
  • for: investigate the application of transformers in the game of Go, specifically in the analysis of the Transformer in Vision.
  • methods: compare transformers to usual Residual Networks.
  • results: highlight the substantial role that transformers can play in the game of Go, through a detailed analysis of numerous points such as prediction accuracy, win rates, memory, speed, size, or even learning rate.
    Abstract Motivated by the success of transformers in various fields, such as language understanding and image analysis, this investigation explores their application in the context of the game of Go. In particular, our study focuses on the analysis of the Transformer in Vision. Through a detailed analysis of numerous points such as prediction accuracy, win rates, memory, speed, size, or even learning rate, we have been able to highlight the substantial role that transformers can play in the game of Go. This study was carried out by comparing them to the usual Residual Networks.
    摘要 受到 transformer 在不同领域的成功启发,本研究探讨它们在围棋游戏中的应用。特别是我们在视觉领域中使用 transformer 进行分析。通过对各种指标(如预测精度、赢局率、内存、速度、大小、学习率等)的细致分析,我们得出了 transformer 在围棋游戏中的重要作用。这个研究通过与常见的 Residual Networks 进行比较,得出了这种结论。

On Sparse Modern Hopfield Model

  • paper_url: http://arxiv.org/abs/2309.12673
  • repo_url: None
  • paper_authors: Jerry Yao-Chieh Hu, Donglin Yang, Dennis Wu, Chenwei Xu, Bo-Yu Chen, Han Liu
  • For: 本文提出了一种含有稀畴结构的现代戴维尔模型(Sparse Modern Hopfield Model),作为现代戴维尔模型的稀畴扩展。* Methods: 本文使用了稀畴注意机制,并提出了一种关于稀畴能函数的关闭式表示,以及基于此的稀畴记忆抽取动力学。* Results: 本文提出了一种基于稀畴结构的记忆抽取动力学,并证明其一步采样等价于稀畴结构的注意机制。此外,本文还提供了稀畴取决因子的记忆抽取误差 bound,并证明其比密集模型更紧。
    Abstract We introduce the sparse modern Hopfield model as a sparse extension of the modern Hopfield model. Like its dense counterpart, the sparse modern Hopfield model equips a memory-retrieval dynamics whose one-step approximation corresponds to the sparse attention mechanism. Theoretically, our key contribution is a principled derivation of a closed-form sparse Hopfield energy using the convex conjugate of the sparse entropic regularizer. Building upon this, we derive the sparse memory retrieval dynamics from the sparse energy function and show its one-step approximation is equivalent to the sparse-structured attention. Importantly, we provide a sparsity-dependent memory retrieval error bound which is provably tighter than its dense analog. The conditions for the benefits of sparsity to arise are therefore identified and discussed. In addition, we show that the sparse modern Hopfield model maintains the robust theoretical properties of its dense counterpart, including rapid fixed point convergence and exponential memory capacity. Empirically, we use both synthetic and real-world datasets to demonstrate that the sparse Hopfield model outperforms its dense counterpart in many situations.
    摘要 我们介绍了一个对� tenir 的现代丰码模型,这是一个对于现代丰码模型的延伸。这个模型具有一个备受归属的记忆回传动态,它的一步近似对应于简短的注意力机制。理论上,我们的主要贡献是通过对简短Entropic regularizer的对偶函数而 derive 一个关闭的简短丰码能量函数。建基于这个能量函数,我们 derive 简短的记忆回传动态,并证明它的一步近似与简短结构的注意力相等。更重要的是,我们提供了一个对简短性的记忆回传错误 bound,这是与对简短性的预设相比较为紧的。因此,我们可以识别出简短性在哪些情况下获得好的情况,并讨论这些情况。此外,我们显示了对简短性的丰码模型保持了现代丰码模型的理论性质,包括快速的稳定点收敛和exponential的记忆容量。实验上,我们使用了 synthetic 和实际世界数据来显示,简短丰码模型在许多情况下比对简短性的丰码模型表现更好。

How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization

  • paper_url: http://arxiv.org/abs/2309.12671
  • repo_url: https://github.com/betray12138/unified-model-shift-and-model-bias-policy-optimization
  • paper_authors: Hai Zhang, Hang Yu, Junqiao Zhao, Di Zhang, Chang Huang, Hongtu Zhou, Xiao Zhang, Chen Ye
  • for: 这篇论文主要目标是提出一种能够满足性能保证的模型基于强化学习(MBRL)算法设计方法。
  • methods: 该方法使用返回差值来导引模型学习,并通过自适应调整模型更新来保证性能提升。
  • results: 实验结果表明,该算法在多个复杂任务上达到了状态之最的性能。
    Abstract Designing and deriving effective model-based reinforcement learning (MBRL) algorithms with a performance improvement guarantee is challenging, mainly attributed to the high coupling between model learning and policy optimization. Many prior methods that rely on return discrepancy to guide model learning ignore the impacts of model shift, which can lead to performance deterioration due to excessive model updates. Other methods use performance difference bound to explicitly consider model shift. However, these methods rely on a fixed threshold to constrain model shift, resulting in a heavy dependence on the threshold and a lack of adaptability during the training process. In this paper, we theoretically derive an optimization objective that can unify model shift and model bias and then formulate a fine-tuning process. This process adaptively adjusts the model updates to get a performance improvement guarantee while avoiding model overfitting. Based on these, we develop a straightforward algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). Empirical results show that USB-PO achieves state-of-the-art performance on several challenging benchmark tasks.
    摘要 “设计和推导具有性能改善保证的模型基于学习(MBRL)算法是具有挑战,主要是因为模型学习和政策优化之间存在高度的整合。许多先前的方法将返回差异用来引导模型学习,忽略模型转移的影响,可能导致因为过度更新模型而导致性能下降。其他方法使用性能差异 bound来考虑模型转移,但这些方法靠摄设定的阈值来限制模型转移,导致执行过程中的依赖性和缺乏灵活性。在这篇文章中,我们理论上 derive 一个优化目标,可以将模型转移和模型偏差统一,然后构成一个精致调整过程。这个过程可以适应性地调整模型更新,以确保性能改善保证,同时避免模型过滤。基于这些,我们开发了一个简单的算法 USB-PO(对于模型转移和模型偏差的政策优化)。实验结果显示,USB-PO 在一些具有挑战性的 bencark task 上实现了顶尖性能。”

Natural revision is contingently-conditionalized revision

  • paper_url: http://arxiv.org/abs/2309.12655
  • repo_url: None
  • paper_authors: Paolo Liberatore
  • for: 本研究旨在探讨自然修订的扩展,以满足更复杂的条件。
  • methods: 本研究使用了自然修订的基本原则,包括最小变化、等同和纯洁。
  • results: 研究发现,自然修订的扩展可以帮助解决一些Counterexample,但也存在一些限制。
    Abstract Natural revision seems so natural: it changes beliefs as little as possible to incorporate new information. Yet, some counterexamples show it wrong. It is so conservative that it never fully believes. It only believes in the current conditions. This is right in some cases and wrong in others. Which is which? The answer requires extending natural revision from simple formulae expressing universal truths (something holds) to conditionals expressing conditional truth (something holds in certain conditions). The extension is based on the basic principles natural revision follows, identified as minimal change, indifference and naivety: change beliefs as little as possible; equate the likeliness of scenarios by default; believe all until contradicted. The extension says that natural revision restricts changes to the current conditions. A comparison with an unrestricting revision shows what exactly the current conditions are. It is not what currently considered true if it contradicts the new information. It includes something more and more unlikely until the new information is at least possible.
    摘要 自然修订似乎很自然:它最小化信念更改,以接纳新信息。然而,一些例外证明它错误。它太保守,从未全heartedly 信任。它只信任当前情况。这对于一些情况是正确的,对于另外一些情况是错误的。哪些是哪些?答案需要扩展自然修订从简单的公式表达universal truth(something holds)扩展到 conditionals表达条件真理(something holds in certain conditions)。这种扩展基于自然修订的基本原则:变化信念最少化,默许enario equality 和简单信任(change beliefs as little as possible; equate the likeliness of scenarios by default; believe all until contradicted)。这种扩展说明自然修订限制更改到当前情况。与不限制的修订进行比较显示了现在所考虑的真实情况是什么。不是现在被认为是真的,如果它与新信息相 contradicted。而是更加不可能的事情,直到新信息至少可能。

Are Deep Learning Classification Results Obtained on CT Scans Fair and Interpretable?

  • paper_url: http://arxiv.org/abs/2309.12632
  • repo_url: None
  • paper_authors: Mohamad M. A. Ashames, Ahmet Demir, Omer N. Gerek, Mehmet Fidan, M. Bilginer Gulmezoglu, Semih Ergin, Mehmet Koc, Atalay Barkana, Cuneyt Calisir
  • for: 这篇研究的目的是提高医学影像识别的可解释性和准确性。
  • methods: 这篇研究使用了深度学习方法,并将患者级别分类 strict 地隔离在训练、验证和测试数据集中。
  • results: 研究发现,使用传统的不平等排序方法训练深度学习模型可能会报告误导性的准确率,并且学习无关的特征。但是,使用严格的患者级别分类方法训练深度学习模型可以保持其准确率,并且在新患者影像上进行测试时仍然表现出高度的可解释性。
    Abstract Following the great success of various deep learning methods in image and object classification, the biomedical image processing society is also overwhelmed with their applications to various automatic diagnosis cases. Unfortunately, most of the deep learning-based classification attempts in the literature solely focus on the aim of extreme accuracy scores, without considering interpretability, or patient-wise separation of training and test data. For example, most lung nodule classification papers using deep learning randomly shuffle data and split it into training, validation, and test sets, causing certain images from the CT scan of a person to be in the training set, while other images of the exact same person to be in the validation or testing image sets. This can result in reporting misleading accuracy rates and the learning of irrelevant features, ultimately reducing the real-life usability of these models. When the deep neural networks trained on the traditional, unfair data shuffling method are challenged with new patient images, it is observed that the trained models perform poorly. In contrast, deep neural networks trained with strict patient-level separation maintain their accuracy rates even when new patient images are tested. Heat-map visualizations of the activations of the deep neural networks trained with strict patient-level separation indicate a higher degree of focus on the relevant nodules. We argue that the research question posed in the title has a positive answer only if the deep neural networks are trained with images of patients that are strictly isolated from the validation and testing patient sets.
    摘要 Traditional deep learning methods for image and object classification have achieved great success, and the biomedical image processing community has also applied these methods to various automatic diagnosis cases. However, most deep learning-based classification attempts in the literature focus solely on achieving high accuracy scores without considering interpretability or patient-wise separation of training and test data. For example, most lung nodule classification papers using deep learning randomly shuffle data and split it into training, validation, and test sets, causing certain images from the same person's CT scan to be in the training set, while other images of the exact same person are in the validation or testing image sets. This can lead to misleading accuracy rates and the learning of irrelevant features, ultimately reducing the real-life usability of these models. When the deep neural networks trained on the traditional, unfair data shuffling method are challenged with new patient images, they perform poorly. In contrast, deep neural networks trained with strict patient-level separation maintain their accuracy rates even when new patient images are tested. Heat-map visualizations of the activations of the deep neural networks trained with strict patient-level separation indicate a higher degree of focus on the relevant nodules. We argue that the research question posed in the title has a positive answer only if the deep neural networks are trained with images of patients that are strictly isolated from the validation and testing patient sets.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

A Quantum Computing-based System for Portfolio Optimization using Future Asset Values and Automatic Reduction of the Investment Universe

  • paper_url: http://arxiv.org/abs/2309.12627
  • repo_url: None
  • paper_authors: Eneko Osaba, Guillaume Gelabert, Esther Villar-Rodriguez, Antón Asla, Izaskun Oregi
  • for: 这个研究是为了解决股票 portefolio优化问题,并使用量子计算技术来解决。
  • methods: 这个系统使用了未来资产预测值来模型,而不是历史值;此外,它还包括自动宇宙减少模块,以降低问题的复杂性。
  • results: 作者们提出了一个概述性的讨论,描述了不同模块的先进性表现。
    Abstract One of the problems in quantitative finance that has received the most attention is the portfolio optimization problem. Regarding its solving, this problem has been approached using different techniques, with those related to quantum computing being especially prolific in recent years. In this study, we present a system called Quantum Computing-based System for Portfolio Optimization with Future Asset Values and Automatic Universe Reduction (Q4FuturePOP), which deals with the Portfolio Optimization Problem considering the following innovations: i) the developed tool is modeled for working with future prediction of assets, instead of historical values; and ii) Q4FuturePOP includes an automatic universe reduction module, which is conceived to intelligently reduce the complexity of the problem. We also introduce a brief discussion about the preliminary performance of the different modules that compose the prototypical version of Q4FuturePOP.
    摘要 一个在量化金融中受到最多关注的问题是资产配置优化问题(Portfolio Optimization Problem)。它的解决方法有很多,其中使用量子计算技术的方法在过去几年中特别繁荣。在这篇研究中,我们介绍了一个名为量子计算基于系统 для资产配置优化与未来资产价值预测(Q4FuturePOP)的系统,它解决了资产配置优化问题,并包括以下两点创新:首先,该工具采用未来预测资产价值,而不是历史数据;其次,Q4FuturePOP包括自动宇宙减少模块,以智能减少问题的复杂性。我们还介绍了批处理版Q4FuturePOP的不同模块的初步性能。

Construction contract risk identification based on knowledge-augmented language model

  • paper_url: http://arxiv.org/abs/2309.12626
  • repo_url: None
  • paper_authors: Saika Wong, Chunmo Zheng, Xing Su, Yinqiu Tang
  • for: 本研究旨在提高建筑项目合同审核效果,以避免潜在的损失。
  • methods: 本研究使用大型自然语言处理模型(LLM),并通过适应 contruction 合同知识来增强语言模型的风险识别能力。
  • results: 我们对实际的建筑合同进行了评估,并实现了良好的性能。此外,我们还研究了LLM如何在这种任务中运用逻辑思维,并提供了未来研究的建议。
    Abstract Contract review is an essential step in construction projects to prevent potential losses. However, the current methods for reviewing construction contracts lack effectiveness and reliability, leading to time-consuming and error-prone processes. While large language models (LLMs) have shown promise in revolutionizing natural language processing (NLP) tasks, they struggle with domain-specific knowledge and addressing specialized issues. This paper presents a novel approach that leverages LLMs with construction contract knowledge to emulate the process of contract review by human experts. Our tuning-free approach incorporates construction contract domain knowledge to enhance language models for identifying construction contract risks. The use of a natural language when building the domain knowledge base facilitates practical implementation. We evaluated our method on real construction contracts and achieved solid performance. Additionally, we investigated how large language models employ logical thinking during the task and provide insights and recommendations for future research.
    摘要 CONTRACT REVIEW IS AN ESSENTIAL STEP IN CONSTRUCTION PROJECTS TO PREVENT POTENTIAL LOSSES. HOWEVER, THE CURRENT METHODS FOR REVIEWING CONSTRUCTION CONTRACTS LACK EFFECTIVENESS AND RELIABILITY, LEADING TO TIME-CONSUMING AND ERROR-PRONE PROCESSES. WHILE LARGE LANGUAGE MODELS (LLMs) HAVE SHOWN PROMISE IN REVOLUTIONIZING NATURAL LANGUAGE PROCESSING (NLP) TASKS, THEY STRUGGLE WITH DOMAIN-SPECIFIC KNOWLEDGE AND ADDRESSING SPECIALIZED ISSUES. THIS PAPER PRESENTS A NOVEL APPROACH THAT LEVERAGES LLMs WITH CONSTRUCTION CONTRACT KNOWLEDGE TO EMULATE THE PROCESS OF CONTRACT REVIEW BY HUMAN EXPERTS. OUR TUNING-FREE APPROACH INCORPORATES CONSTRUCTION CONTRACT DOMAIN KNOWLEDGE TO ENHANCE LANGUAGE MODELS FOR IDENTIFYING CONSTRUCTION CONTRACT RISKS. THE USE OF A NATURAL LANGUAGE WHEN BUILDING THE DOMAIN KNOWLEDGE BASE FACILITATES PRACTICAL IMPLEMENTATION. WE EVALUATED OUR METHOD ON REAL CONSTRUCTION CONTRACTS AND ACHIEVED SOLID PERFORMANCE. ADDITIONALLY, WE INVESTIGATED HOW LARGE LANGUAGE MODELS EMPLOY LOGICAL THINKING DURING THE TASK AND PROVIDE INSIGHTS AND RECOMMENDATIONS FOR FUTURE RESEARCH.

  • paper_url: http://arxiv.org/abs/2309.12625
  • repo_url: https://github.com/hanyin88/drg-llama
  • paper_authors: Hanyin Wang, Chufan Gao, Christopher Dantona, Bryan Hull, Jimeng Sun
  • For: This paper aims to improve the efficiency of the Diagnosis-Related Group (DRG) assignment process in the U.S. inpatient payment system by using an advanced large language model (LLM) fine-tuned on clinical notes.* Methods: The paper introduces DRG-LLaMA, a LLM fine-tuned on 236,192 MIMIC-IV discharge summaries using Low-Rank Adaptation (LoRA) to enhance DRG assignment. The model uses a maximum input token length of 512 and achieves a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged AUC of 0.986.* Results: The DRG-LLaMA model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. The model also achieved a top-1 prediction accuracy of 67.8% and 67.5% for base DRG and CC/MCC prediction, respectively.
    Abstract In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces DRG-LLaMA, an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our DRG-LLaMA-7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, DRG-LLaMA achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that DRG-LLaMA's performance correlates with increased model parameters and input context lengths.
    摘要 在美国医疗机构中,诊断相关组(DRG)是关键,但其分配过程不具有效率。这项研究介绍DRG-LLaMA,一种基于临床笔记的大型自然语言模型(LLM)进行DRG分配的提高。通过将LLaMA作为基础模型,并通过对236,192份MIMIC-IV病卷摘要进行低级适应(LoRA)优化,我们的DRG-LLaMA-7B模型在macro-averaged F1分数上表现出了很好的成绩,即0.327,Top-1预测精度为52.0%,和macro-averaged AUC为0.986,最大输入符号长度为512。这个模型在DRG预测中超越了先前的主导模型,显示了相对提升40.3%和35.7%的macro-averaged F1分数相比于ClinicalBERT和CAML,分别。应用于基本DRG和complication或comorbidity(CC)/主要complication或comorbidity(MCC)预测中,DRG-LLaMA达到了Top-1预测精度为67.8%和67.5%,分别。此外,我们的发现表明DRG-LLaMA的性能与模型参数和输入上下文长度相关。

  • paper_url: http://arxiv.org/abs/2309.12579
  • repo_url: None
  • paper_authors: Parag Saxena
  • for: 这个研究报告旨在提高现代农业中的数据驱动智能,即通过机器学习框架来改善农业教育和沟通。
  • methods: 该研究使用了机器学习模型和自然语言处理(NLP)技术,将问题分类并预测未来的趋势。
  • results: 研究结果表明,机器学习可以预测园艺趋势,并且可以预测未来园艺问题的主题和趋势。这些结果表明,大规模农业产业可以通过积累和维护文本数据来预测趋势和实施战略农业规划。
    Abstract Data-driven insights are essential for modern agriculture. This research paper introduces a machine learning framework designed to improve how we educate and reach out to people in the field of horticulture. The framework relies on data from the Horticulture Online Help Desk (HOHD), which is like a big collection of questions from people who love gardening and are part of the Extension Master Gardener Program (EMGP). This framework has two main parts. First, it uses special computer programs (machine learning models) to sort questions into categories. This helps us quickly send each question to the right expert, so we can answer it faster. Second, it looks at when questions are asked and uses that information to guess how many questions we might get in the future and what they will be about. This helps us plan on topics that will be really important. It's like knowing what questions will be popular in the coming months. We also take into account where the questions come from by looking at the Zip Code. This helps us make research that fits the challenges faced by gardeners in different places. In this paper, we demonstrate the potential of machine learning techniques to predict trends in horticulture by analyzing textual queries from homeowners. We show that NLP, classification, and time series analysis can be used to identify patterns in homeowners' queries and predict future trends in horticulture. Our results suggest that machine learning could be used to predict trends in other agricultural sectors as well. If large-scale agriculture industries curate and maintain a comparable repository of textual data, the potential for trend prediction and strategic agricultural planning could be revolutionized. This convergence of technology and agriculture offers a promising pathway for the future of sustainable farming and data-informed agricultural practices
    摘要 “数据驱动的洞察力是现代农业的关键。这项研究发表在园艺领域的机器学习框架,旨在提高我们向园艺领域人员的教育和沟通方式。该框架包括两个主要部分。首先,它使用特殊的计算机程序(机器学习模型)将问题分类。这有助于我们快速将每个问题传递给相应的专家,以便更快地回答。其次,它考虑问题的提交时间,并使用这些信息预测将来的问题数量和主题。这有助于我们在未来准备相关的研究和规划。我们还考虑问题来源的ZipCode,以便制定地域化的研究。我们的研究表明,机器学习技术可以通过分析家庭主持人的文本查询来预测园艺领域的趋势。我们的结果表明,机器学习可以预测其他农业领域的趋势。如果大规模农业产业积累和维护类似的文本数据库,那么机器学习的潜力可以为未来的可持续农业和数据驱动农业实践带来革命性的改变。这种技术和农业的融合将为未来的可持续农业和数据驱动农业实践提供了一条优秀的道路。”

  • paper_url: http://arxiv.org/abs/2309.12576
  • repo_url: None
  • paper_authors: Robert Underwood, Meghana Madhastha, Randal Burns, Bogdan Nicolae
  • for: 这个论文主要探讨了深度学习模型的结构优化方法,具体来说是用于研究模型在时间推移中的Empirical Evolution Patterns,以便为缓存策略、优化搜索算法和其他应用场景设计。
  • methods: 这个论文使用了Regularized Evolution algorithm来研究模型的进化趋势,并对Candle项目和Nasbench-201搜索空间中的模型进行了算法性分析和量化Characterization。
  • results: 研究发现,Regularized Evolution algorithm会影响模型结构的进化趋势,而且在分布式环境中,模型的进化 Patterns可以被利用来优化缓存和调度策略。此外,研究还发现了一些影响模型结构的升降温度的 Condition。
    Abstract Network Architecture Search and specifically Regularized Evolution is a common way to refine the structure of a deep learning model.However, little is known about how models empirically evolve over time which has design implications for designing caching policies, refining the search algorithm for particular applications, and other important use cases.In this work, we algorithmically analyze and quantitatively characterize the patterns of model evolution for a set of models from the Candle project and the Nasbench-201 search space.We show how the evolution of the model structure is influenced by the regularized evolution algorithm. We describe how evolutionary patterns appear in distributed settings and opportunities for caching and improved scheduling. Lastly, we describe the conditions that affect when particular model architectures rise and fall in popularity based on their frequency of acting as a donor in a sliding window.
    摘要 网络建构搜索和特别是减少演化是深度学习模型结构的常见方法。然而,有很 little is known about如何模型实际在时间演化,这有关缓存策略、改进搜索算法、特定应用场景等重要用例的设计假设。在这项工作中,我们使用算法分析和量化描述了一组 Candle 项目和 Nasbench-201 搜索空间中模型的演化趋势。我们表明了减少演化算法如何影响模型结构的演化。我们还描述了分布式设置中的演化趋势以及缓存和调度的机会。最后,我们描述了模型结构的崛起和衰落的条件,基于它们在滚动窗口中的频率被用作donor。

Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers

  • paper_url: http://arxiv.org/abs/2309.12570
  • repo_url: None
  • paper_authors: Tuhin Chakrabarty, Vishakh Padmakumar, Faeze Brahman, Smaranda Muresan
  • for: 这个论文旨在探讨现代大型自然语言模型(LLM)在职业写作支持工具中的可用性。
  • methods: 这个研究采用了实证研究方法(n=30),检查了现代LLM在助手写作过程中的可用性。
  • results: 研究发现,写作者在不同的认知活动中寻求LLM的帮助,尤其是在翻译和审阅阶段。
    Abstract The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.
    摘要 大型语言模型(LLM)的发展,能够跟随 instrux 和进行对话交互,对各种支持工具的应用产生了增加的兴趣。我们通过实验用户研究(n=30)进行了 LLM 在职业作家助手方面的可用性检查。我们的协作写作界面设计基于写作认知过程模型,视写作为一个目标导向的思考过程,包括观察、翻译和审核等非线性认知活动。参与者被要求提供完成后的调查,以提供 LLM 作为写作伙伴的潜在和障碍。我们分析了写作者与 LLM 之间的互动,发现写作者对 LLM 的帮助最多是在翻译和审核阶段。我们从分析互动和调查回应中获得了未来创作写作助手领域的研究方向。

A Study on Learning Social Robot Navigation with Multimodal Perception

  • paper_url: http://arxiv.org/abs/2309.12568
  • repo_url: https://github.com/robotixx/multimodal-fusion-network
  • paper_authors: Bhabaranjan Panigrahi, Amir Hossain Raj, Mohammad Nazeri, Xuesu Xiao
  • for: 本研究旨在开发一种能够在人类 inhabited 的公共空间中自主移动的 robot,并能够根据围绕着它的人类行为和意图进行相应的导航决策。
  • methods: 本研究使用机器学习方法来捕捉人类社交互动的复杂和微妙性,从数据驱动的角度来capture这些互动。研究者使用多个可用的感知模式,包括 LiDAR 和 RGB 摄像头,并对不同的社交情境进行了比较。
  • results: 研究结果显示,使用多模态感知可以在社交导航决策中获得明显的优势,并且在人类研究中也被证明。研究者还分析了学习过程中的训练和普适性性能。开源代码可供社区未来研究多模态感知导航。
    Abstract Autonomous mobile robots need to perceive the environments with their onboard sensors (e.g., LiDARs and RGB cameras) and then make appropriate navigation decisions. In order to navigate human-inhabited public spaces, such a navigation task becomes more than only obstacle avoidance, but also requires considering surrounding humans and their intentions to somewhat change the navigation behavior in response to the underlying social norms, i.e., being socially compliant. Machine learning methods are shown to be effective in capturing those complex and subtle social interactions in a data-driven manner, without explicitly hand-crafting simplified models or cost functions. Considering multiple available sensor modalities and the efficiency of learning methods, this paper presents a comprehensive study on learning social robot navigation with multimodal perception using a large-scale real-world dataset. The study investigates social robot navigation decision making on both the global and local planning levels and contrasts unimodal and multimodal learning against a set of classical navigation approaches in different social scenarios, while also analyzing the training and generalizability performance from the learning perspective. We also conduct a human study on how learning with multimodal perception affects the perceived social compliance. The results show that multimodal learning has a clear advantage over unimodal learning in both dataset and human studies. We open-source our code for the community's future use to study multimodal perception for learning social robot navigation.
    摘要 自适应移动 робоッツ需要通过其 бордов的感知器 (例如 LiDAR 和 RGB 摄像头) 识别环境,然后采取相应的导航决策。在人类居住的公共空间中导航,这种导航任务不仅仅是避免障碍物,还需要考虑周围的人和他们的意图,并根据下面社会规范进行相应的导航行为变化。机器学习方法可以有效地捕捉这些复杂和柔和的社会互动,无需显式地手工设计简化模型或成本函数。 Considering multiple available sensor modalities and the efficiency of learning methods, this paper presents a comprehensive study on learning social robot navigation with multimodal perception using a large-scale real-world dataset. The study investigates social robot navigation decision making on both the global and local planning levels and contrasts unimodal and multimodal learning against a set of classical navigation approaches in different social scenarios, while also analyzing the training and generalizability performance from the learning perspective. We also conduct a human study on how learning with multimodal perception affects the perceived social compliance. The results show that multimodal learning has a clear advantage over unimodal learning in both dataset and human studies. We open-source our code for the community's future use to study multimodal perception for learning social robot navigation.

Machine Learning Meets Advanced Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2309.12560
  • repo_url: None
  • paper_authors: Saeid Nahavandi, Roohallah Alizadehsani, Darius Nahavandi, Chee Peng Lim, Kevin Kelly, Fernando Bello
  • for: 提高自动化生产质量、降低生产成本和更好地利用人员资源
  • methods: 机器学习方法
  • results: 提高安全性、可靠性和效率Here is a more detailed explanation of each point:
  • for: The paper is written to discuss the application of machine learning methods in automation and robotics, specifically in the context of manipulation tasks. The goal is to improve the quality, efficiency, and safety of automation systems.
  • methods: The paper reviews cutting-edge technologies and recent trends in machine learning methods applied to real-world manipulation tasks. It covers a wide range of applications in different domains, including industry, healthcare, agriculture, space, military, and search and rescue.
  • results: The paper highlights the potential of machine learning methods to improve the safety, reliability, and efficiency of automation systems. It provides an overview of the current state of the field and identifies important research directions for future works.
    Abstract Automated industries lead to high quality production, lower manufacturing cost and better utilization of human resources. Robotic manipulator arms have major role in the automation process. However, for complex manipulation tasks, hard coding efficient and safe trajectories is challenging and time consuming. Machine learning methods have the potential to learn such controllers based on expert demonstrations. Despite promising advances, better approaches must be developed to improve safety, reliability, and efficiency of ML methods in both training and deployment phases. This survey aims to review cutting edge technologies and recent trends on ML methods applied to real-world manipulation tasks. After reviewing the related background on ML, the rest of the paper is devoted to ML applications in different domains such as industry, healthcare, agriculture, space, military, and search and rescue. The paper is closed with important research directions for future works.
    摘要 自动化业务会导致高质量生产、低成本生产和更好的人员资源利用。 robotic manipulator arms 在自动化过程中扮演着重要的角色。然而,对于复杂的抓拍任务,使用硬编程方法设计有效和安全的轨迹是挑战和时间consuming。机器学习方法有 potential 可以学习出专家示范的控制器。 despite promising advances, 以下是未来研究的重要方向:1. 提高安全性、可靠性和效率的机器学习方法,包括在训练和部署阶段。2. 应用机器学习方法到不同领域,如工业、医疗、农业、航天、军事和搜救等。3. 开发出可靠的机器学习模型,以满足实际应用需求。本文首先介绍了机器学习的相关背景,然后分别介绍了机器学习在不同领域的应用,包括工业、医疗、农业、航天、军事和搜救等。 finally, 本文结束于未来研究的重要方向。

Invariant Learning via Probability of Sufficient and Necessary Causes

  • paper_url: http://arxiv.org/abs/2309.12559
  • repo_url: https://github.com/ymy4323460/casn
  • paper_authors: Mengyue Yang, Zhen Fang, Yonggang Zhang, Yali Du, Furui Liu, Jean-Francois Ton, Jun Wang
  • for: 提高模型在未知训练分布下的泛化能力(OOD generalization)
  • methods: 利用 causality 的方法,具体是计算可能性极值(PNS),来捕捉可能性的必要和充分条件,并利用 PNS 风险来学习表示
  • results: 对synthetic和实际数据进行了实验,证明提出的方法有效,并进行了理论分析和证明,证明方法的泛化性。更多细节可以查看 GitHub 仓库:https://github.com/ymy4323460/CaSN。
    Abstract Out-of-distribution (OOD) generalization is indispensable for learning models in the wild, where testing distribution typically unknown and different from the training. Recent methods derived from causality have shown great potential in achieving OOD generalization. However, existing methods mainly focus on the invariance property of causes, while largely overlooking the property of \textit{sufficiency} and \textit{necessity} conditions. Namely, a necessary but insufficient cause (feature) is invariant to distribution shift, yet it may not have required accuracy. By contrast, a sufficient yet unnecessary cause (feature) tends to fit specific data well but may have a risk of adapting to a new domain. To capture the information of sufficient and necessary causes, we employ a classical concept, the probability of sufficiency and necessary causes (PNS), which indicates the probability of whether one is the necessary and sufficient cause. To associate PNS with OOD generalization, we propose PNS risk and formulate an algorithm to learn representation with a high PNS value. We theoretically analyze and prove the generalizability of the PNS risk. Experiments on both synthetic and real-world benchmarks demonstrate the effectiveness of the proposed method. The details of the implementation can be found at the GitHub repository: https://github.com/ymy4323460/CaSN.
    摘要 OUT-OF-DISTRIBUTION(OOD)通用性是学习模型的必要条件,因为测试分布通常不同于训练分布。 recent methods based on causality have shown great potential in achieving OOD generalization. However, existing methods mainly focus on the invariance property of causes, while largely overlooking the property of \textit{sufficiency} and \textit{necessity} conditions. Specifically, a necessary but insufficient cause (feature) is invariant to distribution shift, but it may not have required accuracy. By contrast, a sufficient yet unnecessary cause (feature) tends to fit specific data well but may have a risk of adapting to a new domain. To capture the information of sufficient and necessary causes, we employ a classical concept, the probability of sufficiency and necessary causes (PNS), which indicates the probability of whether one is the necessary and sufficient cause. To associate PNS with OOD generalization, we propose PNS risk and formulate an algorithm to learn representation with a high PNS value. We theoretically analyze and prove the generalizability of the PNS risk. Experiments on both synthetic and real-world benchmarks demonstrate the effectiveness of the proposed method. For more details, please refer to the GitHub repository: .

PlanFitting: Tailoring Personalized Exercise Plans with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12555
  • repo_url: None
  • paper_authors: Donghoon Shin, Gary Hsieh, Young-Ho Kim
  • for: 本研究旨在帮助用户创建个性化的锻炼计划,以满足用户的具体需求和基本原则。
  • methods: 本研究使用了大语言模型的生成能力,让用户通过自然语言描述约束和查询,以生成和优化用户每周的锻炼计划。
  • results: 经过用户研究(N=18)和专家评估(N=3),研究发现PlanFitting可以生成个性化、可行、基于证据的锻炼计划。未来,研究人员可以通过AI助手创建计划,更好地遵循锻炼原则,并更好地适应用户的个性约束。
    Abstract A personally tailored exercise regimen is crucial to ensuring sufficient physical activities, yet challenging to create as people have complex schedules and considerations and the creation of plans often requires iterations with experts. We present PlanFitting, a conversational AI that assists in personalized exercise planning. Leveraging generative capabilities of large language models, PlanFitting enables users to describe various constraints and queries in natural language, thereby facilitating the creation and refinement of their weekly exercise plan to suit their specific circumstances while staying grounded in foundational principles. Through a user study where participants (N=18) generated a personalized exercise plan using PlanFitting and expert planners (N=3) evaluated these plans, we identified the potential of PlanFitting in generating personalized, actionable, and evidence-based exercise plans. We discuss future design opportunities for AI assistants in creating plans that better comply with exercise principles and accommodate personal constraints.
    摘要 一个专门设计的运动计划是不可或缺的,以确保人们有足够的身体活动,但创建计划可以是具有复杂的时间表和考虑的挑战。我们介绍PlanFitting,一个以语言模型为基础的对话式人工智能,可以帮助用户创建个性化的运动计划。通过让用户使用自然语言描述各种限制和查询,PlanFitting可以帮助用户创建和调整每周的运动计划,以满足他们的具体情况,同时尊重基本的运动原则。经过一次用户研究(N=18)和专家规划师(N=3)评估这些计划,我们发现PlanFitting具有创建个性化、可行、基于证据的运动计划的潜力。我们讨论未来的设计机会,以更好地让人工智能助手遵循运动原则,并考虑个人的限制。

Provably Robust and Plausible Counterfactual Explanations for Neural Networks via Robust Optimisation

  • paper_url: http://arxiv.org/abs/2309.12545
  • repo_url: https://github.com/junqi-jiang/proplace
  • paper_authors: Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, Francesca Toni
  • for: 这篇论文的目的是解释神经网络分类器的counterfactual explanations(CEs)。
  • methods: 这篇论文提出了一种名为PROPLACE的方法,利用Robust optimization技术来解释神经网络分类器的CEs。
  • results: 对比六种基eline,PROPLACE在三个评价方面的表现最佳,其中五种基eline都是targeting robustness。
    Abstract Counterfactual Explanations (CEs) have received increasing interest as a major methodology for explaining neural network classifiers. Usually, CEs for an input-output pair are defined as data points with minimum distance to the input that are classified with a different label than the output. To tackle the established problem that CEs are easily invalidated when model parameters are updated (e.g. retrained), studies have proposed ways to certify the robustness of CEs under model parameter changes bounded by a norm ball. However, existing methods targeting this form of robustness are not sound or complete, and they may generate implausible CEs, i.e., outliers wrt the training dataset. In fact, no existing method simultaneously optimises for proximity and plausibility while preserving robustness guarantees. In this work, we propose Provably RObust and PLAusible Counterfactual Explanations (PROPLACE), a method leveraging on robust optimisation techniques to address the aforementioned limitations in the literature. We formulate an iterative algorithm to compute provably robust CEs and prove its convergence, soundness and completeness. Through a comparative experiment involving six baselines, five of which target robustness, we show that PROPLACE achieves state-of-the-art performances against metrics on three evaluation aspects.
    摘要 counterfactual explanations (CEs) 已经收到了增加的关注,作为神经网络分类器的主要方法ологи。通常,对输入输出对的 CE 是指最近输入的数据点,被分类为不同的标签。为解决已经存在的问题, CE 在模型参数更新(例如 retrained)后会被无效化,研究者们已经提出了保证 CE 在模型参数变化下的稳定性的方法。然而,现有的方法不具备完善性和准确性,可能生成不合理的 CE,即训练数据集中的异常值。事实上,现有的方法没有同时优化 proximity 和 plausibility,保持robustness guarantees。在这种情况下,我们提出了可证实 Robust and PLAusible Counterfactual Explanations (PROPLACE),一种基于robust optimization技术来解决现有文献中的限制。我们设计了一种迭代算法来计算可证实的 CE,并证明其 converges,完整性和准确性。通过对六个基elines进行比较实验,我们显示了 PROPLACE 在三个评价方面的状态前 performances。

cs.CL - 2023-09-22

Document Understanding for Healthcare Referrals

  • paper_url: http://arxiv.org/abs/2309.13184
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Jimit Mistry, Natalia M. Arzeno
  • for: 提高医疗referral管理效率,减少管理成本和错误
  • methods: 提出了一种гибрид模型,结合LayoutLMv3和域pecific规则,用于在传输referral文档中自动识别关键病人、医生和检查相关信息
  • results: 结果表明,通过添加域pecific规则,使用变换器模型的精度和F1分数得到了大幅提高,表明混合模型在实际应用中可以提高referral管理效率。
    Abstract Reliance on scanned documents and fax communication for healthcare referrals leads to high administrative costs and errors that may affect patient care. In this work we propose a hybrid model leveraging LayoutLMv3 along with domain-specific rules to identify key patient, physician, and exam-related entities in faxed referral documents. We explore some of the challenges in applying a document understanding model to referrals, which have formats varying by medical practice, and evaluate model performance using MUC-5 metrics to obtain appropriate metrics for the practical use case. Our analysis shows the addition of domain-specific rules to the transformer model yields greatly increased precision and F1 scores, suggesting a hybrid model trained on a curated dataset can increase efficiency in referral management.
    摘要 靠扫描文档和传真communication для医疗referral导致高行政成本和错误,这些错误可能影响病人护理。在这个工作中,我们提出了一种hybrid模型,利用LayoutLMv3 alongside domain-specific规则来标识患者、医生和检查相关实体在传真referral文档中。我们探讨了应用文档理解模型到referral的挑战,因为referral的格式可能因医疗实践而异常,并评估模型性能使用MUC-5指标,以获得实用的指标。我们的分析显示,将域专门规则添加到变换器模型可以提高准确率和F1分数,表明一种基于 cura dataset的hybrid模型可以提高referral管理的效率。

Effective Distillation of Table-based Reasoning Ability from LLMs

  • paper_url: http://arxiv.org/abs/2309.13182
  • repo_url: None
  • paper_authors: Bohao Yang, Chen Tang, Kun Zhao, Chenghao Xiao, Chenghua Lin
    for:This paper aims to specialize table reasoning skills in smaller models for table-to-text generation tasks.methods:The proposed method uses distillation to transfer specific capabilities of large language models (LLMs) to smaller models, specifically tailored for table-based reasoning.results:The fine-tuned model (Flan-T5-base) achieved significant improvement compared to traditional baselines and outperformed specific LLMs like gpt-3.5-turbo on the scientific table-to-text generation dataset (SciGen).
    Abstract Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their remarkable parameter size and their impressive high requirement of computing resources pose challenges for their practical deployment. Recent research has revealed that specific capabilities of LLMs, such as numerical reasoning, can be transferred to smaller models through distillation. Some studies explore the potential of leveraging LLMs to perform table-based reasoning. Nevertheless, prior to our work, there has been no investigation into the prospect of specialising table reasoning skills in smaller models specifically tailored for table-to-text generation tasks. In this paper, we propose a novel table-based reasoning distillation, with the aim of distilling distilling LLMs into tailored, smaller models specifically designed for table-based reasoning task. Experimental results have shown that a 0.22 billion parameter model (Flan-T5-base) fine-tuned using distilled data, not only achieves a significant improvement compared to traditionally fine-tuned baselines but also surpasses specific LLMs like gpt-3.5-turbo on the scientific table-to-text generation dataset (SciGen). The code and data are released in https://github.com/Bernard-Yang/TableDistill.
    摘要

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

  • paper_url: http://arxiv.org/abs/2309.13173
  • repo_url: None
  • paper_authors: Mohsinul Kabir, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, M Saiful Bari, Enamul Hoque
  • for: 本研究评估了大型自然语言处理(NLP)模型(LLMs)在低资源语言如孟加拉语(Bangla)中的表现。
  • methods: 本研究使用了多种重要和多样化的孟加拉语NLP任务,如抽象摘要、问答、重叠、自然语言推理、文本分类和情感分析,对ChatGPT、LLaMA-2和Claude-2等LLMs进行零搅evaluation,并比较其表现与现有的精度调整模型。
  • results: 实验结果显示了不同孟加拉语NLP任务中LLMs的表现较差,这表明需要进一步的研究以提高LLMs在低资源语言如孟加拉语的理解。
    Abstract Large Language Models (LLMs) have emerged as one of the most important breakthroughs in natural language processing (NLP) for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). In this paper, we evaluate the performance of LLMs for the low-resourced Bangla language. We select various important and diverse Bangla NLP tasks, such as abstractive summarization, question answering, paraphrasing, natural language inference, text classification, and sentiment analysis for zero-shot evaluation with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with state-of-the-art fine-tuned models. Our experimental results demonstrate an inferior performance of LLMs for different Bangla NLP tasks, calling for further effort to develop better understanding of LLMs in low-resource languages like Bangla.
    摘要 大型自然语言模型(LLM)已经被认为是自然语言处理(NLP)领域的一个重要突破,它们在语言生成和其他语言特定任务中表现出了卓越的能力。虽然 LLM 已经在英语等语言上进行了评估,但它们尚未在低资源语言 such as 孟加拉语(Bangla)进行了系统性的评估。在这篇论文中,我们对低资源 Bangla 语言进行了 LLM 的评估。我们选择了一些重要和多样的 Bangla NLP 任务,如抽象摘要、问答、重叠、自然语言推理、文本分类和情感分析,并对 ChatGPT、LLaMA-2 和 Claude-2 进行零 shot 评估,并与当前的精度模型进行比较。我们的实验结果表明 LLMs 在不同的 Bangla NLP 任务中表现出了较差的性能,这表明需要进一步的研究,以更好地理解 LLMs 在低资源语言如 Bangla 的性能。

Cardiovascular Disease Risk Prediction via Social Media

  • paper_url: http://arxiv.org/abs/2309.13147
  • repo_url: None
  • paper_authors: Al Zadid Sultan Bin Habib, Md Asif Bin Syed, Md Tanvirul Islam, Donald A. Adjeroh
  • for: 预测心血管疾病(CVD)风险
  • methods: 使用推特和情感分析预测CVD风险,开发了新的CVD相关关键词词典,并使用VADER模型进行情感分析,将用户分类为可能存在CVD风险
  • results: 结果表明通过分析推特中的情感,可以超过基于人口数据alone的预测力,并能够识别可能发展CVD的个体,这些结果表明了自然语言处理和机器学习技术在使用推特来识别CVD风险的潜力。
    Abstract Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.
    摘要

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

  • paper_url: http://arxiv.org/abs/2309.13018
  • repo_url: None
  • paper_authors: Jiamin Xie, Ke Li, Jinxi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli
  • for: 这个研究的目的是实现对多语言自动话语识别(ASR)模型的压缩,并且将其转换为单语言模型或多语言模型。
  • methods: 这个研究使用了适应性遮盾方法,包括两个情况:一是生成简单的单语言模型,二是将多语言模型转换为简单的多语言模型。这个方法可以避免固定的子网络结构,并且在不同的初始化情况下进行适应。
  • results: 这个研究发现,使用适应性遮盾方法可以在对多语言模型进行压缩时,比较有效率,并且可以实现更好的表现。此外,这个方法可以将多语言模型转换为简单的多语言模型,并且可以实现更好的表现。
    Abstract Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
    摘要 中文简体版:神经网络剪枝提供了一种有效的压缩方法,以最小化多语言自动语音识别(ASR)模型的性能损失。然而,它需要每种语言进行多轮剪枝和重新训练。在这个工作中,我们提议使用适应maskingapproach来有效地剪枝多语言ASR模型,分别得到简洁的单语言模型或简洁的多语言模型(名为动态ASR PATHways)。我们的方法可以动态适应子网络,避免提前决定固定子网络结构。我们显示,我们的方法在targeting简洁的单语言模型时比既有的剪枝方法高效。此外,我们还示出了Dynamic ASR PATHways可以将多语言模型中的不同子网络初始化相互转换,从而降低语言特定的剪枝需求。

Nested Event Extraction upon Pivot Element Recogniton

  • paper_url: http://arxiv.org/abs/2309.12960
  • repo_url: None
  • paper_authors: Weicheng Ren, Zixuan Li, Xiaolong Jin, Long Bai, Miao Su, Yantao Liu, Saiping Guan, Jiafeng Guo, Xueqi Cheng
  • for: 提高复杂事件结构抽取精度,解决现有方法不能很好地处理嵌入式事件结构中的 pivot element 问题。
  • methods: 基于识别触发器对 triggers 和 arguments 的类型和关系进行分类,并通过提示学习获得更好的触发器和 argue 的表示,以提高 NEE 性能。
  • results: PerNee 在 ACE2005-Nest、Genia11 和 Genia13 上实现了状态之冠性表现,提高了 NEE 精度。
    Abstract Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer events and as triggers of inner events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner and outer events and further recognizes the PEs via classifying the relation type between trigger pairs. In order to obtain better representations of triggers and arguments to further improve NEE performance, it incorporates the information of both event types and argument roles into PerNee through prompt learning. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in generic domain and construct a new NEE dataset, namely ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11 and Genia13.
    摘要 嵌入式事件提取(NEE)目标是提取嵌入式事件结构,其中事件包含其他事件作为自身参数的嵌入式结构。嵌入事件中的 pivot 元素(PE)同时作为外部事件的参数和内部事件的触发器,因此将其连接到嵌入结构中。这种特殊的 PE 特点带来了现有 NEE 方法的挑战,因为它们无法好地处理 PE 的双重身份。因此,本文提出了一种新模型,即 PerNee,它基于认可 PE 来提取嵌入事件。具体来说,PerNee 先认可外部和内部事件的触发器,然后通过类型化 trigger 对的关系来认定 PE。为了从trigger和参数角度获得更好的表示,PerNee 通过推训来 incorporate 事件类型和参数角色信息。由于现有 NEE 数据集(如 Genia11)限制在特定领域,并且只包含一些嵌入式事件结构,我们系统地分类嵌入事件在通用领域,并构建了一个新的 NEE 数据集,即 ACE2005-Nest。实验结果表明,PerNee 在 ACE2005-Nest、Genia11 和 Genia13 上具有状态的表现。

TopRoBERTa: Topology-Aware Authorship Attribution of Deepfake Texts

  • paper_url: http://arxiv.org/abs/2309.12934
  • repo_url: None
  • paper_authors: Adaku Uchendu, Thai Le, Dongwon Lee
  • for: 本研究旨在开发一种可以判断文本是否为深度伪造文本(deepfake text)的计算方法,以mitigate大量深度伪造文本的散布。
  • methods: 本研究使用了Topological Data Analysis(TDA)层和RoBERTa模型,以capture更多的语言特征和结构特征,提高作者识别率。
  • results: 对于3个 datasets,TopRoBERTa模型比vanilla RoBERTa模型提高了2/3的Macro F1分数,最高提高7%。
    Abstract Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
    摘要 最近的大语言模型(LLM)技术的进步,使得可以生成高质量、不易于 distinguishing 的文本,我们称之为“深伪文本”。目前已经有超过 11K 的文本生成模型在 huggingface 模型库中。因此,有恶意用户可以使用这些开源的 LLM 生成大量的危险文本和谣言。为了解决这问题,一种计算方法是需要的——namely,Turing Test(TT)。在这种情况下,我们研究了一个更一般的问题——作者归属问题(AA),在多类别Setting下进行研究——即不仅是判断给定文本是否是深伪文本,还可以确定这个文本的作者是哪个 LLM。我们提出了 TopRoBERTa,用于改进现有 AA 解决方案,通过包含 Topological Data Analysis(TDA)层在 RoBERTa 模型中,从而更好地捕捉深伪文本中的语言特征。我们通过对不规则、不均衡和不一致的数据进行处理,提取 TDA 特征从 RoBERTa 模型中的 pooling 输出中。我们使用 RoBERTa 模型来捕捉语义和语法特征,而使用 TDA 来捕捉数据的形态和结构特征。最后,TopRoBERTa 在 2/3 个数据集上表现出色,与原始 RoBERTa 相比,提高了 macro F1 得分的最高7%。

PopBERT. Detecting populism and its host ideologies in the German Bundestag

  • paper_url: http://arxiv.org/abs/2309.14355
  • repo_url: None
  • paper_authors: L. Erhard, S. Hanke, U. Remer, A. Falenska, R. Heiberger
  • for: 本研究旨在提供一种可靠、有效、可扩展的方法来评估民粹主义的语言表达。
  • methods: 我们创建了基于德国bundestag(2013-2021年)的 parliamentary speeches 的标注 dataset,并采用 transformer-based 模型(PopBERT)作为多类分类器来检测和评估民粹主义语言的多个维度。
  • results: 验证检查表明,PopBERT 具有强的预测准确率、高质量的面效VALIDITY、与专家调查中党派排名相符、并能正确地检测新的文本片断。PopBERT 可以为德语政治家和党派的语言使用提供动态分析,以及可以在跨领域应用或开发相关的分类器。
    Abstract The rise of populism concerns many political scientists and practitioners, yet the detection of its underlying language remains fragmentary. This paper aims to provide a reliable, valid, and scalable approach to measure populist stances. For that purpose, we created an annotated dataset based on parliamentary speeches of the German Bundestag (2013 to 2021). Following the ideational definition of populism, we label moralizing references to the virtuous people or the corrupt elite as core dimensions of populist language. To identify, in addition, how the thin ideology of populism is thickened, we annotate how populist statements are attached to left-wing or right-wing host ideologies. We then train a transformer-based model (PopBERT) as a multilabel classifier to detect and quantify each dimension. A battery of validation checks reveals that the model has a strong predictive accuracy, provides high qualitative face validity, matches party rankings of expert surveys, and detects out-of-sample text snippets correctly. PopBERT enables dynamic analyses of how German-speaking politicians and parties use populist language as a strategic device. Furthermore, the annotator-level data may also be applied in cross-domain applications or to develop related classifiers.
    摘要 populism 的崛起引起了许多政治科学家和实践者的关注,但检测其下面的语言 ainda是 fragmentary。这篇论文目的是提供一种可靠、有效、可扩展的方法来评估 populist 的立场。为此,我们创建了基于德国bundestag parliamentary speeches(2013-2021)的注释数据集。根据意识形态的定义,我们将 moralizing 引用为贤良人或腐败的エリー特定为 populist 语言的核心维度。此外,为了了解 populist 语言如何被膨胀,我们还注释了 populist 声明与左翼或右翼的主义相关的hosts。然后,我们使用 transformer 基本模型(PopBERT)作为多类归一类ifier来检测和评估每一维度。一系列的验证检查表明,模型具有强大预测精度,提供高质量的面 validate,匹配党派评估专家调查的排名,并正确地检测出 sample 文本片段。PopBERT 允许我们动态地分析德语政治人物和党派如何使用 populist 语言作为策略工具。此外,注释数据还可以在跨领域应用或开发相关的分类器。

Affect Recognition in Conversations Using Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12881
  • repo_url: None
  • paper_authors: Shutong Feng, Guangzhi Sun, Nurul Lubis, Chao Zhang, Milica Gašić
  • for: 本研究旨在探讨大语言模型(LLMs)在对话中识别人类情感的能力,包括开放领域对话和任务导向对话。
  • methods: 研究使用了三个不同的数据集:IEMOCAP、EmoWOZ和DAIC-WOZ,这些数据集涵盖了从伙伴对话到医疗采访的对话。研究使用了零shot和几shot学习,以及任务特定的精度调整,来评估和比较LLMs的表现。
  • results: 研究发现LLMs在情感识别方面具有一定的能力,但是其表现受到自动语音识别(ASR)错误的影响。通过这项研究,我们希望探讨LLMs在对话中是否可以模拟人类的情感识别能力。
    Abstract Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
    摘要 人类communication中,情感认知(affect recognition)发挥关键作用。在人工智能对话中,能够识别和回应人类情感cue的能力是创造有趣和同情的交互的关键因素。本研究探讨了大型自然语言模型(LLMs)在对话中识别人类情感的能力,包括开放领域对话和任务导向对话。通过使用三个多样化的数据集,namely IEMOCAP、EmoWOZ和DAIC-WOZ,覆盖了对话的广泛spectrum,从互斥对话到临床采访,我们评估和比较了LLMs的表现。我们的调查探讨了LLMs在零shot和几shot情况下的能力,以及通过任务特定的精度调整来提高模型 capacities。此外,本研究还考虑了自动语音识别(ASR)错误对LLM预测的影响。通过这项工作,我们希望探讨LLMs在对话中是否能够模拟人类情感认知能力。

StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors

  • paper_url: http://arxiv.org/abs/2309.12810
  • repo_url: None
  • paper_authors: Inez Okulska, Daria Stetsenko, Anna Kołos, Agnieszka Karlińska, Kinga Głąbińska, Adam Nowakowski
  • for: 这个论文的目的是为开源多语言工具StyloMetrix提供一个概述。这个工具提供了不同语言的语法、 синтакси和词汇方面的语料,覆盖了波兰语、英语、乌克兰语和俄语四种语言。
  • methods: 这个论文使用了StyloMetrix工具来生成各种语言的语料,并对这些语料进行了normalization处理。然后,使用了不同的机器学习算法进行超参数的评估。
  • results: 实验结果表明,StyloMetrix vectors可以在不同的语言上进行有效的内容分类,并且可以帮助提高深度学习算法的表现。在Random Forest Classifier、Voting Classifier、Logistic Regression等简单机器学习算法上进行了超参数的评估,并且在Transformer架构上进行了深度学习的评估。
    Abstract This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the StyloMetrix vectors at enhancing an embedding layer extracted from Transformer architectures. The StyloMetrix has proven itself to be a formidable source for the machine learning and deep learning algorithms to execute different classification tasks.
    摘要 Translation notes:* "stylometric" is translated as "式文学的" (shìwén xué de), which is a compound word consisting of "式" (shì) meaning "style" and "文学" (wénxué) meaning "literature" or "linguistics".* "multilanguage" is translated as "多语言" (duō yǔyán), which is a compound word consisting of "多" (duō) meaning "many" and "语言" (yǔyán) meaning "language".* "tool" is translated as "工具" (gōngjù), which is a generic term for any device or software used to perform a specific task.* "cover" is translated as "覆盖" (fùkài), which means "to cover" or "to encompass".* "aspects" is translated as "方面" (fāngmiàn), which means "aspects" or "facets".* "grammar" is translated as "语法" (yǔfǎ), which is the study of the rules and structures of a language.* "syntax" is translated as "语法结构" (yǔfǎ jiégòu), which is the study of the arrangement of words and phrases to form sentences.* "lexicon" is translated as "词汇" (cíhuì), which is a collection of words and their meanings.* "normalized" is translated as "标准化" (biǎozhǔn huà), which means "to make something conform to a standard or norm".* "output" is translated as "输出" (shūchū), which means "output" or "result".* "feature" is translated as "特征" (tèzhèng), which means "feature" or "characteristic".* "developed" is translated as "开发" (kāifā), which means "to develop" or "to create".* "linguistic" is translated as "语言学的" (yǔyán xué de), which is a compound word consisting of "语言" (yǔyán) meaning "language" and "学的" (xué de) meaning "academic" or "scholarly".* "fruitful" is translated as "有益" (yǒu yì), which means "beneficial" or "useful".* "course" is translated as "课程" (kèchéng), which means "course" or "program".* "machine learning" is translated as "机器学习" (jīqì xuéxí), which is a compound word consisting of "机器" (jīqì) meaning "machine" and "学习" (xuéxí) meaning "learning" or "study".* "deep learning" is translated as "深度学习" (shēngrù xuéxí), which is a compound word consisting of "深度" (shēngrù) meaning "depth" and "学习" (xuéxí) meaning "learning" or "study".* "supervised" is translated as "监督学习" (jiāndū xuéxí), which is a compound word consisting of "监督" (jiāndū) meaning "supervise" and "学习" (xuéxí) meaning "learning" or "study".* "content classification" is translated as "内容分类" (nèiróng fēnlèi), which is a compound word consisting of "内容" (nèiróng) meaning "content" and "分类" (fēnlèi) meaning "classification" or "categorization".* "simple algorithms" is translated as "简单的算法" (jiǎnduō de suānfǎ), which is a compound word consisting of "简单" (jiǎnduō) meaning "simple" and "算法" (suānfǎ) meaning "algorithm".* "random forest classifier" is translated as "随机森林分类器" (suījì sēnjīn fēnlèi zhīngjī), which is a compound word consisting of "随机" (suījì) meaning "random" and "森林" (sēnjīn) meaning "forest" and "分类器" (fēnlèi zhīngjī) meaning "classifier".* "voting classifier" is translated as "投票分类器" (tóuchòu fēnlèi zhīngjī), which is a compound word consisting of "投票" (tóuchòu) meaning "vote" and "分类器" (fēnlèi zhīngjī) meaning "classifier".* "logistic regression" is translated as "逻辑回归" (suǒyì huíqiù), which is a compound word consisting of "逻辑" (suǒyì) meaning "logic" and "回归" (huíqiù) meaning "regression".* "deep learning assessments" is translated as "深度学习评估" (shēngrù xuéxí píngjì), which is a compound word consisting of "深度" (shēngrù) meaning "depth" and "学习" (xuéxí) meaning "learning" or "study" and "评估" (píngjì) meaning "assessment" or "evaluation".* "Transformer architectures" is translated as "变换器架构" (biànhuà zhìgòu), which is a compound word consisting of "变换" (biànhuà) meaning "transformation" and "器架构" (zhìgòu) meaning "architecture".

ChatPRCS: A Personalized Support System for English Reading Comprehension based on ChatGPT

  • paper_url: http://arxiv.org/abs/2309.12808
  • repo_url: None
  • paper_authors: Xizhe Wang, Yihua Zhong, Changqin Huang, Xiaodi Huang
  • for: 提高学生的阅读理解能力
  • methods: 使用大语言模型技术,包括预测学生阅读理解水平、生成问题和自动评估等方法
  • results: 实验结果显示,ChatPRCS可以为学生提供高质量的阅读理解问题,与专家制定的问题相似程度有统计显著的相似性
    Abstract As a common approach to learning English, reading comprehension primarily entails reading articles and answering related questions. However, the complexity of designing effective exercises results in students encountering standardized questions, making it challenging to align with individualized learners' reading comprehension ability. By leveraging the advanced capabilities offered by large language models, exemplified by ChatGPT, this paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS, based on the Zone of Proximal Development theory. ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation, among others, to enhance reading comprehension instruction. First, we develop a new algorithm that can predict learners' reading comprehension abilities using their historical data as the foundation for generating questions at an appropriate level of difficulty. Second, a series of new ChatGPT prompt patterns is proposed to address two key aspects of reading comprehension objectives: question generation, and automated evaluation. These patterns further improve the quality of generated questions. Finally, by integrating personalized ability and reading comprehension prompt patterns, ChatPRCS is systematically validated through experiments. Empirical results demonstrate that it provides learners with high-quality reading comprehension questions that are broadly aligned with expert-crafted questions at a statistical level.
    摘要 通常来说,学习英语的读写涉及到阅读文章并回答相关问题。然而,设计有效的训练活动具有复杂性,导致学生遇到标准化的问题,困难与个性化学生的读写理解水平进行对应。本文基于大语言模型的高级功能,例如ChatGPT,提出了一种新的个性化支持系统,称为ChatPRCS,基于读写理解能力的发展Zone of Proximal Development理论。ChatPRCS使用包括读写理解能力预测、问题生成和自动评估等方法,以提高读写理解教学。首先,我们开发了一种新的算法,可以根据学生的历史数据预测他们的读写理解能力,并使用这些数据来生成适合的题目。其次,我们提出了一系列新的ChatGPT提示模式,用于解决读写理解目标的两个关键方面:问题生成和自动评估。这些模式进一步提高生成的题目质量。最后,通过结合个性化能力和读写理解提示模式,我们系统化验证了ChatPRCS。实验结果表明,它可以为学生提供高质量的读写理解题目,与专家制作的问题在统计上保持一致。

Furthest Reasoning with Plan Assessment: Stable Reasoning Path with Retrieval-Augmented Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12767
  • repo_url: None
  • paper_authors: Yin Zhu, Zhiling Luo, Gong Cheng
  • for: 本研究旨在解决现有多步问答(MHQA)方法中的两个主要缺陷:一是信息检索器(IR)因为生成过程中的低质量问题而受到限制,二是语言模型(LLM)因为与 irrelevant knowledge 的交互而导致偏差。
  • methods: 本研究提出了一种新的管道方法,即 Furthest-Reasoning-with-Plan-Assessment(FuRePA),其包括一种改进的框架(Furthest Reasoning)和一个附加的模块(Plan Assessor)。 Furthest Reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration。 Plan Assessor 是一个训练过的评价器,可以选择 LLM 提出的合适的计划。
  • results: 本研究在三个公开的多步问答数据集上进行了评估,并与现有最佳方法进行比较。结果显示, FuRePA 在大多数指标上表现出色,相比之下, achieved a 10%-12% 的答案准确率。
    Abstract Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
    摘要 大型语言模型(LLM)作为强大的理解和生成工具,在不同的自然语言任务中表现出色,其中包括多步 вопро答(MHQA)类型。在这些任务中,我们需要让 LLM 和外部知识的搜寻紧密相连。现有的方法使用 LLM 生成推理路径和计划,并使用 IR 逐步获取相关知识,但这些方法存在问题。一方面,资讯搜寻器(IR)受到 LLM 产生的问题质量低下的限制。另一方面, LLM 受到 IR 提供的无关知识的影响,导致错误的推理。这些错误,在 LLM 和 IR 之间的回归交互中累累积累,最终导致效率下降。为解决以上问题,在这篇论文中,我们提出了一个新的多步 вопро答(MHQA)管道,称为 Furthest-Reasoning-with-Plan-Assessment(FuRePA),包括改进的架构( Furthest Reasoning)和附加的模组( Plan Assessor)。1. Furthest Reasoning 运作方式是将前一次的推理路径和生成的问题遮盖,让 LLM 在每次回归中从头开始生成推理链。这种方法允许 LLM 破坏前一次的错误思维和问题(如果有),并将注意力集中在更加重要的问题上。2. Plan Assessor 是一个训练好的评估器,可以从 LLM 提供的候选计划中选择最佳的计划。我们的方法在三个公开的多步 вопро答 datasets 上进行评估,并在大多数指标上超越了现有的state-of-the-art(实现了10%-12%的答案精度提升)。

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

  • paper_url: http://arxiv.org/abs/2309.12763
  • repo_url: None
  • paper_authors: Asad Ullah, Alessandro Ragano, Andrew Hines
  • for: 本研究旨在提高low resource语言下的自主学习表示学习(SSRL)模型的表现,并评估其在下游phoneme认识任务中的性能。
  • methods: 本研究使用了音频扩展来预训SSRL模型,并评估其在phoneme认识任务中的表现。我们系统地比较了不同的扩展技术,包括拟音变化、噪音添加、重音目标语言speech和其他语言speech。我们发现,将扩展技术与拟音变化相结合(噪音/拟音)是最佳扩展策略,超过了重音和语言知识传递。
  • results: 我们发现,使用具有不同量和类型的预训数据,SSRL模型在phoneme认识任务中的表现都有所提高。此外,我们还评估了扩展数据的缩放因子,以达到与target domain speech预训数据相等的性能。我们的发现表明,在resource受限的语言下,使用本地生成的扩展数据可以超过语言知识传递和其他语言speech的表现。
    Abstract Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
    摘要 自我指导学习(SSRL)已经提高了下游音频识别的性能,而不需要大量的标注数据。然而,对于低资源语言,具有大量预训练数据的困难。而不是通过语言知识传输,我们提议使用音频加工来预训练SSRL模型,并评估音频识别作为下游任务。我们进行了系统性的比较,包括噪音添加、抖音变化、外语言材料和对应语言材料。我们发现将噪音和抖音相结合是最佳的加工策略,超过了对应语言和外语言知识传输。我们对不同量和类型的预训练数据进行了比较,并评估了增强数据的扩展因子以实现与目标频谱 speech 的相同性。我们的发现表明,在资源受限的语言中,可以通过本地生成的增强数据来超越对应语言和外语言的知识传输。

Semantic similarity prediction is better than other semantic similarity measures

  • paper_url: http://arxiv.org/abs/2309.12697
  • repo_url: https://github.com/aieng-lab/stsscore
  • paper_authors: Steffen Herbold
  • for: mesure la similarité sémantique entre des textes naturels
  • methods: utilise une modèle fine-tuné pour prédire la similarité
  • results: obtenu une mesure de similarité plus robuste et alignée avec les attentes que les autres approches
    Abstract Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
    摘要 <>按照以下文本翻译成简化中文:<>自然语言文本之间的 semantics 相似性通常通过子序列重叠(例如 BLEU)或使用嵌入(例如 BERTScore、S-BERT)来衡量。在这篇论文中,我们认为只需要量化 semantics 相似性时,直接使用特定任务的 fine-tuned 模型来预测相似性是更好的方法。使用 GLUE benchmark 中的 STS-B 任务中的 fine-tuned 模型,我们定义了 STSScore 方法,并证明其生成的相似性更加符合我们对坚实 semantics 相似性的预期,与其他方法相比。

AMPLIFY:Attention-based Mixup for Performance Improvement and Label Smoothing in Transformer

  • paper_url: http://arxiv.org/abs/2309.12689
  • repo_url: https://github.com/kiwi-lilo/amplify
  • paper_authors: Leixin Yang, Yaping Zhang, Haoyu Xiong, Yu Xiang
  • for: 提高文本分类 task 的性能,降低模型对噪音和异常值的敏感性。
  • methods: 提出了一种新的 Mixup 方法 called AMPLIFY,通过 Transformer 自带的注意力机制来减少原始样本中噪音和异常值的影响,不增加额外可训练参数,计算成本很低。
  • results: 在 7 个 benchmark dataset 上,AMPLIFY 在文本分类任务中比其他 Mixup 方法具有更高的性能,而且在较小的计算成本下。
    Abstract Mixup is an effective data augmentation method that generates new augmented samples by aggregating linear combinations of different original samples. However, if there are noises or aberrant features in the original samples, Mixup may propagate them to the augmented samples, leading to over-sensitivity of the model to these outliers . To solve this problem, this paper proposes a new Mixup method called AMPLIFY. This method uses the Attention mechanism of Transformer itself to reduce the influence of noises and aberrant values in the original samples on the prediction results, without increasing additional trainable parameters, and the computational cost is very low, thereby avoiding the problem of high resource consumption in common Mixup methods such as Sentence Mixup . The experimental results show that, under a smaller computational resource cost, AMPLIFY outperforms other Mixup methods in text classification tasks on 7 benchmark datasets, providing new ideas and new ways to further improve the performance of pre-trained models based on the Attention mechanism, such as BERT, ALBERT, RoBERTa, and GPT. Our code can be obtained at https://github.com/kiwi-lilo/AMPLIFY.
    摘要 混合是一种有效的数据增强方法,可以生成新的增强样本通过原始样本的线性组合。但是,如果原始样本中存在噪声或异常特征,那么混合可能会传递这些噪声或异常特征到增强样本中,导致模型对这些噪声或异常特征过敏。为解决这个问题,本文提出了一种新的混合方法called AMPLIFY。这种方法使用Transformer自带的注意力机制来减少原始样本中噪声或异常值对预测结果的影响,无需增加额外可训练参数,计算成本非常低,因此可以避免常见的混合方法如 Sentence Mixup 中的高资源消耗问题。实验结果表明,在相对较小的计算资源成本下,AMPLIFY在文本分类任务中比其他混合方法表现更好,提供了新的想法和新的方法来进一步提高基于注意力机制的预测模型,如BERT、ALBERT、RoBERTa和GPT。我们的代码可以在https://github.com/kiwi-lilo/AMPLIFY获取。

JCoLA: Japanese Corpus of Linguistic Acceptability

  • paper_url: http://arxiv.org/abs/2309.12676
  • repo_url: https://github.com/osekilab/jcola
  • paper_authors: Taiga Someya, Yushi Sugimoto, Yohei Oseki
    for:这个论文的目的是为了评估不同类型的日语语言模型在语法可接受性领域的性能。methods:这篇论文使用了10,020个句子的手动标注的双值可接受性判断,其中86%是来自语言学教科书和手册的简单可接受性判断,剩下的14%是根据语言学期刊文章中的12种语言现象分类的。然后, authors使用这些数据来评估9种日语语言模型的语法知识。results:论文的结果表明,一些模型在域内数据上可以超越人类性能,而在域外数据上则无法超越人类性能。此外,对具体的语言现象进行分析也表明,虽然神经语言模型在地方语法依赖关系上很强,但在长距离语法依赖关系上表现不佳,如宾格结构和词汇协调等。
    Abstract Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
    摘要 neural language models 在多种下游任务中表现出色,但是对这些模型内化语法知识的理解还很有限,因此在不同语言之间建立了一些数据集,以便对语言模型的语法评估。本文介绍了日语Corpus of Linguistic Acceptability(JCoLA),包含10,020个句子,每个句子都有 binary acceptability 判断。具体来说,这些句子来自语言学书籍、手册和学术期刊,并分为预测数据(86%;相对简单的acceptability judgments从书籍和手册中提取)和 OUT-OF-DOMAIN 数据(14%;从期刊中提取,并分为12种语言现象)。然后,我们对9种日语语言模型在 JCoLA 上进行了语法知识的评估。结果显示,一些模型在预测数据上能够超越人类性能,而在 OUT-OF-DOMAIN 数据上则没有任何模型能够达到人类性能。进一步的错误分析按语言现象分类,表明了 neural language models 在处理本地语法依赖关系(如语素结构)方面表现出色,但是在面对远程语法依赖关系(如 Nominalization 和 NPI 许可)时,其性能却衰退。

HRoT: Hybrid prompt strategy and Retrieval of Thought for Table-Text Hybrid Question Answering

  • paper_url: http://arxiv.org/abs/2309.12669
  • repo_url: None
  • paper_authors: Tongxu Luo, Fangyu Lei, Jiahe Lei, Weihao Liu, Shihu He, Jun Zhao, Kang Liu
  • for: 这篇论文是为了解决Answering numerical questions over hybrid contents from the given tables and text(TextTableQA)问题。
  • methods: 这篇论文使用了Large Language Models (LLMs)和In-Context Learning技术,以及Chain-of-Thought prompting。
  • results: 这篇论文的方法在 MultiHiertt 数据集中的少量学习情况下达到了State-of-the-Art (SOTA) 性能。
    Abstract Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
    摘要 Answering numerical questions over hybrid contents from the given tables and text(文本表格问答) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.Here's the translation breakdown:* Answering numerical questions over hybrid contents (文本表格问答) + Answering (答案) + Numerical questions (数字问题) + Hybrid contents (混合内容) + Text and tables (文本和表格)* Recently, Large Language Models (LLMs) have gained significant attention (最近,大型语言模型已经吸引了重要的注意) + Recently (最近) + Large Language Models (大型语言模型) + Gained significant attention (吸引了重要的注意)* With the emergence of large language models (LLMs), In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics (LLMs的出现使得受到了关注的研究话题) + With the emergence of (出现) + Large language models (LLMs) + In-Context Learning (在Context学习) + Chain-of-Thought prompting (Chain-of-Thought提问) + Two particularly popular research topics (两个非常流行的研究话题)* In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought (在这篇论文中,我们介绍了一种新的提问策略) + In this paper (在这篇论文中) + We introduce (介绍) + A new prompting strategy (一种新的提问策略) + Called Hybrid prompt strategy (被称为Hybrid提问策略) + And Retrieval of Thought (以及 Retrieval of Thought)* Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking (通过In-Context学习,我们透过提问模型发展 Retrieval thinking的能力) + Through (通过) + In-Context Learning (在Context学习) + We prompt (我们透过提问) + The model (模型) + To develop (发展) + The ability of retrieval thinking ( Retrieval thinking的能力)* Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting (我们的方法在 MultiHiertt 数据集上在少量学习设置下表现出了superior的性能) + Our method (我们的方法) + Achieves (表现出) + Superior performance (superior的性能) + Compared to (与) + The fully-supervised SOTA (完全指导的SOTA) + On the MultiHiertt dataset (在 MultiHiertt 数据集上) + In the few-shot setting (在少量学习设置下)

Decoding Affect in Dyadic Conversations: Leveraging Semantic Similarity through Sentence Embedding

  • paper_url: http://arxiv.org/abs/2309.12646
  • repo_url: None
  • paper_authors: Chen-Wei Yu, Yun-Shiuan Chuang, Alexandros N. Lotsos, Claudia M. Haase
  • For: The paper aims to explore the use of sentence embeddings in analyzing real-world dyadic interactions and predicting the affect of conversational participants.* Methods: The study employs a Transformer-based model to obtain the embeddings of utterances from each speaker in 50 married couples’ conversations about conflicts and pleasant activities.* Results: The study finds that semantic similarity has a positive association with wives’ affect during conflict conversations, but not with husbands’ affect or during pleasant conversations.Here’s the information in Simplified Chinese text:
  • for: 这研究旨在利用句子嵌入来分析现实生活中的对话和预测对话参与者的情感。
  • methods: 这些研究使用Transformer模型来获取每个说话者的句子嵌入。
  • results: 研究发现,在对话中的 semantic similarity 与妻子在对抗对话中的情感有正相关关系,但不与丈夫在对抗对话中的情感或在愉悦对话中的情感有关系。
    Abstract Recent advancements in Natural Language Processing (NLP) have highlighted the potential of sentence embeddings in measuring semantic similarity. Yet, its application in analyzing real-world dyadic interactions and predicting the affect of conversational participants remains largely uncharted. To bridge this gap, the present study utilizes verbal conversations within 50 married couples talking about conflicts and pleasant activities. Transformer-based model all-MiniLM-L6-v2 was employed to obtain the embeddings of the utterances from each speaker. The overall similarity of the conversation was then quantified by the average cosine similarity between the embeddings of adjacent utterances. Results showed that semantic similarity had a positive association with wives' affect during conflict (but not pleasant) conversations. Moreover, this association was not observed with husbands' affect regardless of conversation types. Two validation checks further provided support for the validity of the similarity measure and showed that the observed patterns were not mere artifacts of data. The present study underscores the potency of sentence embeddings in understanding the association between interpersonal dynamics and individual affect, paving the way for innovative applications in affective and relationship sciences.
    摘要 现代自然语言处理(NLP)技术的发展,推祟了句子嵌入的潜在意义。然而,它在实际对话中分析双方对话和预测对话参与者的情感影响仍然是未知之地。为了bridging这一 gab,本研究使用了50对夫妻互动的对话,其中一方为冲突对话,另一方为愉悦对话。使用Transformer模型all-MiniLM-L6-v2,从每个发言人的utterance中获得了嵌入。然后,通过计算 adjacentutterance的cosine相似性的平均值来衡量对话的总相似性。结果表明,在冲突对话中,夫人的情感相关性与句子嵌入的semantic相似性呈正相关关系。此外,这种相关性不存在于愉悦对话中。两项验证检查还为研究的有效性提供了支持,并证明了所见到的模式不是数据的假象。本研究表明,句子嵌入可以帮助我们理解对话中的人际动力和个体情感之间的关系,并为情感科学和关系科学开辟了新的应用领域。

Learning to Diversify Neural Text Generation via Degenerative Model

  • paper_url: http://arxiv.org/abs/2309.12619
  • repo_url: None
  • paper_authors: Jimin Hong, ChaeHun Park, Jaegul Choo
  • for: 提高Neural语言模型的多样性和有用性,以扩展其应用范围。
  • methods: 提出一种新的方法,基于模型学习属性的观察:模型主要学习引起堕落问题的特征。该方法包括两个模型的训练:首先训练一个用于增强不良模式的模型,然后通过关注这个模型无法学习的模式来提高第二个模型的多样性。
  • results: 通过两个任务, namely语言模型和对话生成,进行了广泛的实验,证明了该方法的有效性。
    Abstract Neural language models often fail to generate diverse and informative texts, limiting their applicability in real-world problems. While previous approaches have proposed to address these issues by identifying and penalizing undesirable behaviors (e.g., repetition, overuse of frequent words) from language models, we propose an alternative approach based on an observation: models primarily learn attributes within examples that are likely to cause degeneration problems. Based on this observation, we propose a new approach to prevent degeneration problems by training two models. Specifically, we first train a model that is designed to amplify undesirable patterns. We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn. Extensive experiments on two tasks, namely language modeling and dialogue generation, demonstrate the effectiveness of our approach.
    摘要 neural network语言模型经常无法生成多样化和有用的文本,限制它们在实际问题中的应用。而以前的方法已经提议通过识别和处罚不良行为(例如重复、频繁使用常见词)来解决这些问题。我们则基于一个观察:模型主要学习文本中可能导致异常问题的特征。基于这个观察,我们提出了一种新的方法,通过训练两个模型来预防异常问题。具体来说,我们首先训练一个用于强化不良模式的模型。然后,我们通过关注这个模型无法学习的模式来提高第二个模型的多样性。我们在语言模型和对话生成两个任务上进行了广泛的实验,并证明了我们的方法的有效性。

Unlocking Model Insights: A Dataset for Automated Model Card Generation

  • paper_url: http://arxiv.org/abs/2309.12616
  • repo_url: None
  • paper_authors: Shruti Singh, Hitesh Lodwal, Husain Malwat, Rakesh Thakur, Mayank Singh
  • for: 这paper是为了提高机器学习模型的训练和应用而写的。
  • methods: 这paper使用了500个问题对25种机器学习模型进行了询问,以描述这些模型的训练配置、数据集、偏见、结构细节和训练资源。
  • results: 这paper发现现有的语言模型(如ChatGPT-3.5、LLaMa和Galactica)在理解研讨纸和生成文字回答中存在差距,并且可以使用这些模型来自动生成模型卡。
    Abstract Language models (LMs) are no longer restricted to ML community, and instruction-tuned LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a popular practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 ML models that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. Our initial experiments with ChatGPT-3.5, LLaMa, and Galactica showcase a significant gap in the understanding of research papers by these aforementioned LMs as well as generating factual textual responses. We posit that our dataset can be used to train models to automate the generation of model cards from paper text and reduce human effort in the model card curation process. The complete dataset is available on https://osf.io/hqt7p/?view_only=3b9114e3904c4443bcd9f5c270158d37
    摘要 机器学习模型(LM)不再受限于机器学习社区, instrucion-tuned LM 的出现导致自主AI代理人数量的增加。随着LM的访问权增加,理解其能力、适用范围和开发周期也变得非常重要。模型卡是一种很流行的实践,用于记录ML模型的详细信息。为了自动生成模型卡,我们提出了一个包含25种ML模型的500个问题答案对集。我们采用了人工批注人员,从原始论文中提取答案。此外,我们还explore了LM的可能性,以及它们在生成模型卡时的表现。我们的初步实验表明,ChatGPT-3.5、LLaMa和Galactica等LM在理解研讨文献和生成事实性文本响应方面存在较大的差距。我们认为,我们的数据集可以用于训练模型,以自动生成模型卡从文献中,并减少人类努力在模型卡筹编过程中。完整的数据集可以在https://osf.io/hqt7p/?view_only=3b9114e3904c4443bcd9f5c270158d37中找到。

Is it Possible to Modify Text to a Target Readability Level? An Initial Investigation Using Zero-Shot Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12551
  • repo_url: None
  • paper_authors: Asma Farajidizaji, Vatsal Raina, Mark Gales
  • for: 本研究旨在提出一种新的文本修改任务,即可以独立地控制文本的阅读difficulty水平。
  • methods: 本研究使用了ChatGPT和Llama-2两种基础模型,以及一种扩展方法,即通过语言模型两次生成 парафраз。
  • results: 研究发现,零配置方法可以将文本的阅读difficulty水平Push in the desired direction,但最终阅读difficulty仍然与原始文本的阅读difficulty相关。此外,研究还发现,阅读difficulty的变化会导致文本 semantic和 lexical similarity降低。
    Abstract Text simplification is a common task where the text is adapted to make it easier to understand. Similarly, text elaboration can make a passage more sophisticated, offering a method to control the complexity of reading comprehension tests. However, text simplification and elaboration tasks are limited to only relatively alter the readability of texts. It is useful to directly modify the readability of any text to an absolute target readability level to cater to a diverse audience. Ideally, the readability of readability-controlled generated text should be independent of the source text. Therefore, we propose a novel readability-controlled text modification task. The task requires the generation of 8 versions at various target readability levels for each input text. We introduce novel readability-controlled text modification metrics. The baselines for this task use ChatGPT and Llama-2, with an extension approach introducing a two-step process (generating paraphrases by passing through the language model twice). The zero-shot approaches are able to push the readability of the paraphrases in the desired direction but the final readability remains correlated with the original text's readability. We also find greater drops in semantic and lexical similarity between the source and target texts with greater shifts in the readability.
    摘要 文本简化和文本膨化是常见的任务,它们可以使文本更易于理解。然而,文本简化和膨化任务只能有限地改变文本的可读性。为了直接修改文本的可读性水平,我们提出了一个新的可读性控制文本修改任务。这个任务需要对每个输入文本生成8个版本,每个版本都达到不同的目标可读性水平。我们介绍了一些新的可读性控制文本修改指标。基elines для这个任务使用ChatGPT和Llama-2,我们还提出了一种扩展方法,即通过语言模型两次进行两步过程(生成重叠的重叠)来生成重叠。我们发现零配置方法可以推动文本的可读性水平,但最终的可读性仍然与原始文本的可读性相关。此外,我们还发现,随着可读性的增加,文本之间的semantic和lexical相似度会降低。

Automatic Answerability Evaluation for Question Generation

  • paper_url: http://arxiv.org/abs/2309.12546
  • repo_url: None
  • paper_authors: Zifan Wang, Kotaro Funakoshi, Manabu Okumura
  • for: 这 paper 的目的是提出一种新的自动评价指标,以评估生成的问题是否能够由参考答案回答。
  • methods: 这 paper 使用了一种基于提示的评价指标, named PMAN,通过对生成的问题和参考答案进行对比,来评估问题的可answerability。
  • results: 经过广泛的实验,这 paper 的评价结果被证明可靠,与人工评价结果相align。此外,这 paper 还应用了其metric来评估生成问题模型的性能,发现其metric 与传统的评价指标 complementary。最后, authors 使用 ChatGPT 实现了一个 SOTA 的问题生成模型。
    Abstract Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed for natural language generation (NLG) tasks, are based on measuring the n-gram overlap between the generated and reference text. These simple metrics may be insufficient for more complex tasks, such as question generation (QG), which requires generating questions that are answerable by the reference answers. Developing a more sophisticated automatic evaluation metric, thus, remains as an urgent problem in QG research. This work proposes a Prompting-based Metric on ANswerability (PMAN), a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers for the QG tasks. Extensive experiments demonstrate that its evaluation results are reliable and align with human evaluations. We further apply our metric to evaluate the performance of QG models, which shows our metric complements conventional metrics. Our implementation of a ChatGPT-based QG model achieves state-of-the-art (SOTA) performance in generating answerable questions.
    摘要 传统的自动评价指标,如BLEU和ROUGE,是基于生成和参考文本中的n-gram重叠而定义的。这些简单的指标可能不够用于更复杂的任务,如问题生成(QG),因为QG需要生成可回答的问题。开发一种更加复杂的自动评价指标,因此是QG研究中的紧迫问题。本工作提出了Answerability-based Metric on ANswerability(PMAN),一种新的自动评价指标,用于评估生成的问题是否可以由参考答案回答。我们进行了广泛的实验,并证明了其评价结果的可靠性和与人工评价结果的一致性。此外,我们还应用了我们的指标来评估QG模型的性能,并发现了它与传统指标的协同作用。我们实现了基于ChatGPT的QG模型,实现了状态的杰出表现(SOTA)在生成可回答的问题。

cs.LG - 2023-09-22

The LHCb ultra-fast simulation option, Lamarr: design and validation

  • paper_url: http://arxiv.org/abs/2309.13213
  • repo_url: None
  • paper_authors: Lucio Anderlini, Matteo Barbetti, Simone Capelli, Gloria Corti, Adam Davis, Denis Derkach, Nikita Kazeev, Artem Maevskiy, Maurizio Martinelli, Sergei Mokonenko, Benedetto Gianluca Siddi, Zehua Xu
  • for: 用于提高LHCb实验中的详细探测器模拟,以满足Run 3中的数据收集需求。
  • methods: 使用Gaudi框架,并利用深度生成模型和梯度提升决策树来 parameterize探测器响应和重建算法。
  • results: 比较详细模拟和Lamarr模拟的结果,发现Lamarr可以提供两个数量级的速度提升,同时保持与详细模拟的一致性。
    Abstract Detailed detector simulation is the major consumer of CPU resources at LHCb, having used more than 90% of the total computing budget during Run 2 of the Large Hadron Collider at CERN. As data is collected by the upgraded LHCb detector during Run 3 of the LHC, larger requests for simulated data samples are necessary, and will far exceed the pledged resources of the experiment, even with existing fast simulation options. An evolution of technologies and techniques to produce simulated samples is mandatory to meet the upcoming needs of analysis to interpret signal versus background and measure efficiencies. In this context, we propose Lamarr, a Gaudi-based framework designed to offer the fastest solution for the simulation of the LHCb detector. Lamarr consists of a pipeline of modules parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Most of the parameterizations are made of Deep Generative Models and Gradient Boosted Decision Trees trained on simulated samples or alternatively, where possible, on real data. Embedding Lamarr in the general LHCb Gauss Simulation framework allows combining its execution with any of the available generators in a seamless way. Lamarr has been validated by comparing key reconstructed quantities with Detailed Simulation. Good agreement of the simulated distributions is obtained with two-order-of-magnitude speed-up of the simulation phase.
    摘要 具有详细探测器模拟功能的 Lamarr 框架,基于 Gaudi 框架,可以提供最快的 LHCb 探测器模拟解决方案。Lamarr 包含一系列模块,用于 parameterizing LHCb 实验中的探测器响应和重建算法。大多数参数化都是使用深度生成模型和梯度提升决策树,并在训练过程中使用 simulate 样本或实际数据。嵌入 Lamarr 到 LHCb Gauss Simulation 框架中,可以将其与任何可用的生成器结合使用,实现无缝的执行。Lamarr 已经得到了对 Key 重建量的验证,并与详细模拟相比,实现了两个级别的速度提升。

Evidential Deep Learning: Enhancing Predictive Uncertainty Estimation for Earth System Science Applications

  • paper_url: http://arxiv.org/abs/2309.13207
  • repo_url: https://github.com/AI2ES/miles-guess
  • paper_authors: John S. Schreck, David John Gagne II, Charlie Becker, William E. Chapman, Kim Elmore, Gabrielle Gantos, Eliot Kim, Dhamma Kimpara, Thomas Martin, Maria J. Molina, Vanessa M. Pryzbylo, Jacob Radford, Belen Saavedra, Justin Willson, Christopher Wirz
  • for: 这个研究旨在提供一个可靠且实用的深度学习方法来量化气候和天气预测结果的不确定性。
  • methods: 这个研究使用的方法是 Parametric deep learning 和 Evidential deep learning,这两种方法可以 estimate 预测结果的不确定性,并且可以account for both aleatoric 和 epistemic uncertainty。
  • results: 这个研究发现,使用 evidential neural networks 可以实现预测精度与 ensemble 方法相当,同时可以严谨地量化预测结果的不确定性。
    Abstract Robust quantification of predictive uncertainty is critical for understanding factors that drive weather and climate outcomes. Ensembles provide predictive uncertainty estimates and can be decomposed physically, but both physics and machine learning ensembles are computationally expensive. Parametric deep learning can estimate uncertainty with one model by predicting the parameters of a probability distribution but do not account for epistemic uncertainty.. Evidential deep learning, a technique that extends parametric deep learning to higher-order distributions, can account for both aleatoric and epistemic uncertainty with one model. This study compares the uncertainty derived from evidential neural networks to those obtained from ensembles. Through applications of classification of winter precipitation type and regression of surface layer fluxes, we show evidential deep learning models attaining predictive accuracy rivaling standard methods, while robustly quantifying both sources of uncertainty. We evaluate the uncertainty in terms of how well the predictions are calibrated and how well the uncertainty correlates with prediction error. Analyses of uncertainty in the context of the inputs reveal sensitivities to underlying meteorological processes, facilitating interpretation of the models. The conceptual simplicity, interpretability, and computational efficiency of evidential neural networks make them highly extensible, offering a promising approach for reliable and practical uncertainty quantification in Earth system science modeling. In order to encourage broader adoption of evidential deep learning in Earth System Science, we have developed a new Python package, MILES-GUESS (https://github.com/ai2es/miles-guess), that enables users to train and evaluate both evidential and ensemble deep learning.
    摘要 Robust量化预测uncertainty是气候和天气结果的关键因素。集合可以提供预测uncertainty估计,但物理和机器学习集合都是计算成本高的。 parametric deep learning可以通过预测概率分布的参数来估计uncertainty,但不能考虑到epistemic uncertainty。 evidential deep learning,一种扩展 parametric deep learning 到更高阶分布的技术,可以同时考虑到aleatoric和epistemic uncertainty。本研究比较了来自集合和 evidential neural network 的uncertainty。通过对冬季降水类型分类和表面层流量预测的应用,我们显示 evidential deep learning 模型可以与标准方法匹配的预测精度,同时坚定地量化两种uncertainty。我们评估预测的uncertainty,包括预测是否准确折叠和预测错误与uncertainty之间的相关性。对输入uncertainty进行分析,可以了解模型对下游气象过程的敏感性,从而更好地理解模型。 evidential neural network 的概念简单、可解释性和计算效率,使其成为可靠和实用的uncertainty量化方法。为促进 Earth System Science 中 evidential deep learning 的广泛应用,我们已经开发了一个新的 Python 包,MILES-GUESS(https://github.com/ai2es/miles-guess),它允许用户训练和评估 evidential 和集合 deep learning。

Federated Short-Term Load Forecasting with Personalization Layers for Heterogeneous Clients

  • paper_url: http://arxiv.org/abs/2309.13194
  • repo_url: None
  • paper_authors: Shourya Bose, Kibaek Kim
  • for: 这篇论文是为了提高 Federated Learning(FL)的精度和减少资料隐私问题。
  • methods: 本论文使用了Argonne Privacy-Preserving Federated Learning套件,并提出了一个专门 для处理对私页面层的Personalized Federated Learning(PL-FL)算法,以提高模型的精度。
  • results: 根据NREL ComStock资料集的实验结果显示,PL-FL算法可以提高模型的预测性能,并且可以处理各个客户的对私页面层。
    Abstract The advent of smart meters has enabled pervasive collection of energy consumption data for training short-term load forecasting (STLF) models. In response to privacy concerns, federated learning (FL) has been proposed as a privacy-preserving approach for training, but the quality of trained models degrades as client data becomes heterogeneous. In this paper we alleviate this drawback using personalization layers, wherein certain layers of an STLF model in an FL framework are trained exclusively on the clients' own data. To that end, we propose a personalized FL algorithm (PL-FL) enabling FL to handle personalization layers. The PL-FL algorithm is implemented by using the Argonne Privacy-Preserving Federated Learning package. We test the forecast performance of models trained on the NREL ComStock dataset, which contains heterogeneous energy consumption data of multiple commercial buildings. Superior performance of models trained with PL-FL demonstrates that personalization layers enable classical FL algorithms to handle clients with heterogeneous data.
    摘要 智能仪器的出现使得能源消耗数据进行普遍收集,用于训练短期负荷预测(STLF)模型。为了保护隐私,联邦学习(FL)被提议作为隐私保护的方法,但训练模型的质量受到客户数据的不同性的影响。在本文中,我们通过个性化层来缓解这个缺点,其中某些层在联邦学习框架中仅使用客户自己的数据进行训练。为此,我们提出了个性化联邦学习算法(PL-FL),允许联邦学习算法处理个性化层。PL-FL算法使用Argonne隐私保护联邦学习包进行实现。我们在NREL ComStock数据集上测试了由PL-FL训练的预测模型的forecast性能,该数据集包含多个商业建筑物的各种能源消耗数据。我们发现模型通过PL-FL训练显示出了superior的预测性能,这说明个性化层使得传统的联邦学习算法能够处理客户具有不同数据的情况。

Visualizing Topological Importance: A Class-Driven Approach

  • paper_url: http://arxiv.org/abs/2309.13185
  • repo_url: None
  • paper_authors: Yu Qin, Brittany Terese Fasy, Carola Wenk, Brian Summa
  • for: 本研究首次用图像化方法来显示数据中重要的拓扑特征,以便更好地分析和理解数据的结构。
  • methods: 本研究使用了已经证明的可解释深度学习方法,并将其应用于拓扑分类任务。这种方法可以在每个数据集中找出重要的拓扑结构,并为每个类别分配不同的权重。
  • results: 本研究通过创建 persistente point density 的重要性场来显示数据中重要的拓扑特征。这种方法可以在图像、3D 形状和医疗图像等数据上进行实际应用,并提供了真实世界中这种方法的应用示例。
    Abstract This paper presents the first approach to visualize the importance of topological features that define classes of data. Topological features, with their ability to abstract the fundamental structure of complex data, are an integral component of visualization and analysis pipelines. Although not all topological features present in data are of equal importance. To date, the default definition of feature importance is often assumed and fixed. This work shows how proven explainable deep learning approaches can be adapted for use in topological classification. In doing so, it provides the first technique that illuminates what topological structures are important in each dataset in regards to their class label. In particular, the approach uses a learned metric classifier with a density estimator of the points of a persistence diagram as input. This metric learns how to reweigh this density such that classification accuracy is high. By extracting this weight, an importance field on persistent point density can be created. This provides an intuitive representation of persistence point importance that can be used to drive new visualizations. This work provides two examples: Visualization on each diagram directly and, in the case of sublevel set filtrations on images, directly on the images themselves. This work highlights real-world examples of this approach visualizing the important topological features in graph, 3D shape, and medical image data.
    摘要 Translated into Simplified Chinese:这篇论文介绍了首先使用 topological features 来定义数据类别的方法。 topological features 拥有抽象复杂数据的基本结构的能力,因此是数据可视化和分析管道中的一个重要组成部分。although not all topological features in data are of equal importance. Until now, the default definition of feature importance has been often assumed and fixed. This work shows how proven explainable deep learning approaches can be adapted for use in topological classification. In doing so, it provides the first technique that illuminates what topological structures are important in each dataset in regards to their class label. In particular, the approach uses a learned metric classifier with a density estimator of the points of a persistence diagram as input. This metric learns how to reweigh this density such that classification accuracy is high. By extracting this weight, an importance field on persistent point density can be created. This provides an intuitive representation of persistence point importance that can be used to drive new visualizations. This work provides two examples: Visualization on each diagram directly and, in the case of sublevel set filtrations on images, directly on the images themselves. This work highlights real-world examples of this approach visualizing the important topological features in graph, 3D shape, and medical image data.

Enhancing Multi-Objective Optimization through Machine Learning-Supported Multiphysics Simulation

  • paper_url: http://arxiv.org/abs/2309.13179
  • repo_url: None
  • paper_authors: Diego Botache, Jens Decke, Winfried Ripken, Abhinay Dornipati, Franz Götz-Hahn, Mohamed Ayeb, Bernhard Sick
  • for: 这篇论文旨在提出一个方法ological framework для快速化多物理 simulations,以满足多个目标的优化。
  • methods: 这篇论文使用了两种机器学习和深度学习算法,以及两种优化算法,并将其组合成一个完整的训练和优化管线。
  • results: 经过实验和评估,这篇论文发现可以使用相对少量的数据来训练高精度的代理模型,并且可以快速地获得多个目标的Pareto优化结果。
    Abstract Multiphysics simulations that involve multiple coupled physical phenomena quickly become computationally expensive. This imposes challenges for practitioners aiming to find optimal configurations for these problems satisfying multiple objectives, as optimization algorithms often require querying the simulation many times. This paper presents a methodological framework for training, self-optimizing, and self-organizing surrogate models to approximate and speed up Multiphysics simulations. We generate two real-world tabular datasets, which we make publicly available, and show that surrogate models can be trained on relatively small amounts of data to approximate the underlying simulations accurately. We conduct extensive experiments combining four machine learning and deep learning algorithms with two optimization algorithms and a comprehensive evaluation strategy. Finally, we evaluate the performance of our combined training and optimization pipeline by verifying the generated Pareto-optimal results using the ground truth simulations. We also employ explainable AI techniques to analyse our surrogates and conduct a preselection strategy to determine the most relevant features in our real-world examples. This approach lets us understand the underlying problem and identify critical partial dependencies.
    摘要 多物理 simulate 快速增加计算成本,这会对实践者们的优化问题提出挑战,因为优化算法通常需要对 simulate 进行多次查询。这篇论文提出了一种方法ológical framework для训练、自动优化和自动组织替身模型,以加速多物理 simulate。我们生成了两个实际世界的表格数据集,并证明了替身模型可以通过相对小量数据来准确地表示下面 simulate。我们在多种机器学习和深度学习算法和两种优化算法的基础上进行了广泛的实验。最后,我们使用了可解释 AI 技术来分析我们的替身和采用预选策略来确定实际世界中最重要的特征。这种方法让我们理解下面的问题,并识别 kritical partial dependencies。

Invisible Watermarking for Audio Generation Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.13166
  • repo_url: https://github.com/mikiyaxi/watermark-audio-diffusion
  • paper_authors: Xirong Cao, Xiang Li, Divyesh Jadav, Yanzhao Wu, Zhehui Chen, Chen Zeng, Wenqi Wei
  • for: 保护音频扩散模型的 интеграITY和数据权益
  • methods: 基于mel-spectrogram的音频扩散模型 watermarking技术
  • results: 实现了不可见水印触发机制,保护模型的有Integrity和数据权益,同时仍能够保持高效的净音频生成能力。
    Abstract Diffusion models have gained prominence in the image domain for their capabilities in data generation and transformation, achieving state-of-the-art performance in various tasks in both image and audio domains. In the rapidly evolving field of audio-based machine learning, safeguarding model integrity and establishing data copyright are of paramount importance. This paper presents the first watermarking technique applied to audio diffusion models trained on mel-spectrograms. This offers a novel approach to the aforementioned challenges. Our model excels not only in benign audio generation, but also incorporates an invisible watermarking trigger mechanism for model verification. This watermark trigger serves as a protective layer, enabling the identification of model ownership and ensuring its integrity. Through extensive experiments, we demonstrate that invisible watermark triggers can effectively protect against unauthorized modifications while maintaining high utility in benign audio generation tasks.
    摘要 各种扩散模型在图像领域中得到了广泛应用,以其数据生成和转换能力为特点,在图像和音频领域中实现了状态 искусственный机器学习的最佳性能。在快速发展的音频基于机器学习领域中,保护模型完整性和确立数据版权是核心问题。本文提出了首个应用于音频扩散模型训练的mel-spectrogram watermarking技术。这提供了一种新的方法来解决以上问题。我们的模型不仅在正常的音频生成任务中表现出色,还包含了隐藏的 watermarking 触发器机制,以确保模型的完整性和版权。通过广泛的实验,我们证明了隐藏的 watermark 触发器可以有效地保护 against 未授权修改,同时保持高的用于正常音频生成任务的实用性。

Forecasting Response to Treatment with Global Deep Learning and Patient-Specific Pharmacokinetic Priors

  • paper_url: http://arxiv.org/abs/2309.13135
  • repo_url: None
  • paper_authors: Willa Potosnak, Cristian Challu, Kin G. Olivares, Artur Dubrawski
  • for: 预测医疗时序数据,以早发现不良结果和监测病人状况。
  • methods: 提议一种新的混合全局-本地架构和药理学编码器,用于深入了解患者特定的治疗效应。
  • results: 对比patient-specific模型,全局-本地架构提高了9.2-14.6%的准确率;对比alternative编码技术,药理学编码器在模拟数据上提高了4.4%,在实际数据上提高了2.1%。
    Abstract Forecasting healthcare time series is crucial for early detection of adverse outcomes and for patient monitoring. Forecasting, however, can be difficult in practice due to noisy and intermittent data. The challenges are often exacerbated by change points induced via extrinsic factors, such as the administration of medication. To address these challenges, we propose a novel hybrid global-local architecture and a pharmacokinetic encoder that informs deep learning models of patient-specific treatment effects. We showcase the efficacy of our approach in achieving significant accuracy gains for a blood glucose forecasting task using both realistically simulated and real-world data. Our global-local architecture improves over patient-specific models by 9.2-14.6%. Additionally, our pharmacokinetic encoder improves over alternative encoding techniques by 4.4% on simulated data and 2.1% on real-world data. The proposed approach can have multiple beneficial applications in clinical practice, such as issuing early warnings about unexpected treatment responses, or helping to characterize patient-specific treatment effects in terms of drug absorption and elimination characteristics.
    摘要 预测医疗时序数据是重要的,可以早期检测不良结果并跟踪病人。然而,在实践中预测可能会困难,因为数据充满噪音和中断。这些挑战通常由外部因素引起的变换点加剧,如药物的给药。为了解决这些挑战,我们提议一种新的全球-本地架构和一种用于深度学习模型的药物生物学编码器。我们在血糖预测任务中使用这种方法,并使用真实的 simulated 数据和实际数据进行比较。我们的全球-本地架构在patient-specific模型的基础上提高了9.2-14.6%的准确率。此外,我们的药物生物学编码器在 simulated 数据上比替代编码技术提高4.4%,并在实际数据上提高2.1%。我们的方法可以在临床实践中有多个有利应用,如发现不ждан的治疗反应,或者帮助characterize patient-specific treatment effects in terms of drug absorption and elimination characteristics。

AntiBARTy Diffusion for Property Guided Antibody Design

  • paper_url: http://arxiv.org/abs/2309.13129
  • repo_url: None
  • paper_authors: Jordan Venderley
  • for: 这 paper 是为了探讨用 machine learning 技术来设计和工程抗体的可能性。
  • methods: 这 paper 使用了一种基于 BART 的语言模型,以及一种基于这种语言模型的扩散模型来导向 IgG 抗体的 de novo 设计。
  • results: 这 paper 的实验结果表明,可以使用这种方法来生成具有改进的在silico 稳定性的新抗体,同时保持抗体的有效性和序列多样性。
    Abstract Over the past decade, antibodies have steadily grown in therapeutic importance thanks to their high specificity and low risk of adverse effects compared to other drug modalities. While traditional antibody discovery is primarily wet lab driven, the rapid improvement of ML-based generative modeling has made in-silico approaches an increasingly viable route for discovery and engineering. To this end, we train an antibody-specific language model, AntiBARTy, based on BART (Bidirectional and Auto-Regressive Transformer) and use its latent space to train a property-conditional diffusion model for guided IgG de novo design. As a test case, we show that we can effectively generate novel antibodies with improved in-silico solubility while maintaining antibody validity and controlling sequence diversity.
    摘要 过去十年,抗体在治疗方面的重要性逐渐增长,主要归功于它们的高特异性和其他药物modalities相比的低风险。而传统抗体发现主要是在湿lab中进行,但随着机器学习(ML)基于生成模型的快速进步,在硬件上进行的方法在抗体发现和工程方面变得越来越有前途。为此,我们训练了一个抗体特有的语言模型 AntiBARTy,基于BART(双向自适应变换器),并使用其潜在空间来训练一个基于属性的扩散模型,用于导引IgG de novo设计。作为一个测试案例,我们显示了我们可以效果地生成改进了室内溶解性的新抗体,同时保持抗体有效性和控制序列多样性。

Data is often loadable in short depth: Quantum circuits from tensor networks for finance, images, fluids, and proteins

  • paper_url: http://arxiv.org/abs/2309.13108
  • repo_url: None
  • paper_authors: Raghav Jumade, Nicolas PD Sawaya
  • For: This paper addresses the “input problem” of loading classical data into a quantum computer, which has been an obstacle to achieving quantum advantage.* Methods: The paper introduces a circuit compilation method based on tensor network (TN) theory, called AMLET (Automatic Multi-layer Loader Exploiting TNs), which can be tailored to arbitrary circuit depths.* Results: The paper performs numerical experiments on real-world classical data from four distinct areas and shows that the required circuit depths are often several orders of magnitude lower than the exponentially-scaling general loading algorithm would require. This demonstrates that many classical datasets can be loaded into a quantum computer in much shorter depth than previously expected, which has positive implications for speeding up classical workloads on quantum computers.
    Abstract Though there has been substantial progress in developing quantum algorithms to study classical datasets, the cost of simply loading classical data is an obstacle to quantum advantage. When the amplitude encoding is used, loading an arbitrary classical vector requires up to exponential circuit depths with respect to the number of qubits. Here, we address this ``input problem'' with two contributions. First, we introduce a circuit compilation method based on tensor network (TN) theory. Our method -- AMLET (Automatic Multi-layer Loader Exploiting TNs) -- proceeds via careful construction of a specific TN topology and can be tailored to arbitrary circuit depths. Second, we perform numerical experiments on real-world classical data from four distinct areas: finance, images, fluid mechanics, and proteins. To the best of our knowledge, this is the broadest numerical analysis to date of loading classical data into a quantum computer. Consistent with other recent work in this area, the required circuit depths are often several orders of magnitude lower than the exponentially-scaling general loading algorithm would require. Besides introducing a more efficient loading algorithm, this work demonstrates that many classical datasets are loadable in depths that are much shorter than previously expected, which has positive implications for speeding up classical workloads on quantum computers.
    摘要 尽管在开发量子算法研究类别数据上已经取得了重要进展,但是将类别数据加载到量子计算机上的成本仍然是一个障碍物,以致于实现量子优势。当使用振荡编码时,将任意类别数据加载到多个量子比特(qubit)上可能需要对数量积累的循环深度。在这里,我们提出了两项贡献以解决这个“输入问题”。首先,我们基于张量网络(TN)理论开发了一种简单的练习方法,称之为自动多层加载器(AMLET)。我们的方法通过精心构建特定的TN结构,可以适应任意循环深度。其次,我们在实际的类别数据上进行了数值实验,来评估加载类别数据到量子计算机上的可能性。我们的实验结果表明,可以在循环深度上下文中加载类别数据,而不需要遵循普通的循环深度级数。此外,这项工作还证明了许多类别数据可以在循环深度上下文中加载,这意味着可以通过加速类别工作来减轻量子计算机上的工作负担。

Graph Neural Network for Stress Predictions in Stiffened Panels Under Uniform Loading

  • paper_url: http://arxiv.org/abs/2309.13022
  • repo_url: None
  • paper_authors: Yuecheng Cai, Jasmin Jelovica
  • for: 本研究旨在提出一种novel的图形嵌入技术,用于高效地表示3D厚度板的强度分布。
  • methods: 本研究使用了Graph Sampling and Aggregation(GraphSAGE)技术,并 comparing withfinite-element-vertex图表示方法。
  • results: 研究结果表明,使用提议的图形嵌入方法可以更加准确地预测3D厚度板的强度分布,并且可以快速地对不同结构 Parametric study。
    Abstract Machine learning (ML) and deep learning (DL) techniques have gained significant attention as reduced order models (ROMs) to computationally expensive structural analysis methods, such as finite element analysis (FEA). Graph neural network (GNN) is a particular type of neural network which processes data that can be represented as graphs. This allows for efficient representation of complex geometries that can change during conceptual design of a structure or a product. In this study, we propose a novel graph embedding technique for efficient representation of 3D stiffened panels by considering separate plate domains as vertices. This approach is considered using Graph Sampling and Aggregation (GraphSAGE) to predict stress distributions in stiffened panels with varying geometries. A comparison between a finite-element-vertex graph representation is conducted to demonstrate the effectiveness of the proposed approach. A comprehensive parametric study is performed to examine the effect of structural geometry on the prediction performance. Our results demonstrate the immense potential of graph neural networks with the proposed graph embedding method as robust reduced-order models for 3D structures.
    摘要

Brain Age Revisited: Investigating the State vs. Trait Hypotheses of EEG-derived Brain-Age Dynamics with Deep Learning

  • paper_url: http://arxiv.org/abs/2310.07029
  • repo_url: https://github.com/gemeinl/eeg-brain-age
  • paper_authors: Lukas AW Gemein, Robin T Schirrmeister, Joschka Boedecker, Tonio Ball
    for:* The paper aims to investigate the relationship between brain age and brain pathology using clinical EEG recordings.methods:* The authors use a state-of-the-art Temporal Convolutional Network (TCN) for age regression, and train the model on recordings from the Temple University Hospital EEG Corpus (TUEG) with explicit labels for non-pathological and pathological recordings.results:* The TCN achieves state-of-the-art performance in age decoding with a mean absolute error of 6.6 years.* The authors find that the brain age gap biomarker is not indicative of pathological EEG, and that the model significantly underestimates the age of non-pathological and pathological subjects.
    Abstract The brain's biological age has been considered as a promising candidate for a neurologically significant biomarker. However, recent results based on longitudinal magnetic resonance imaging data have raised questions on its interpretation. A central question is whether an increased biological age of the brain is indicative of brain pathology and if changes in brain age correlate with diagnosed pathology (state hypothesis). Alternatively, could the discrepancy in brain age be a stable characteristic unique to each individual (trait hypothesis)? To address this question, we present a comprehensive study on brain aging based on clinical EEG, which is complementary to previous MRI-based investigations. We apply a state-of-the-art Temporal Convolutional Network (TCN) to the task of age regression. We train on recordings of the Temple University Hospital EEG Corpus (TUEG) explicitly labeled as non-pathological and evaluate on recordings of subjects with non-pathological as well as pathological recordings, both with examinations at a single point in time and repeated examinations over time. Therefore, we created four novel subsets of TUEG that include subjects with multiple recordings: I) all labeled non-pathological; II) all labeled pathological; III) at least one recording labeled non-pathological followed by at least one recording labeled pathological; IV) similar to III) but with opposing transition (first pathological then non-pathological). The results show that our TCN reaches state-of-the-art performance in age decoding with a mean absolute error of 6.6 years. Our extensive analyses demonstrate that the model significantly underestimates the age of non-pathological and pathological subjects (-1 and -5 years, paired t-test, p <= 0.18 and p <= 0.0066). Furthermore, the brain age gap biomarker is not indicative of pathological EEG.
    摘要 研究人员认为大脑的生物龄可能是脑科学中的一个有价值的生物标志物。然而,最近的长期磁共振成像数据显示了解释问题。我们的中心问题是大脑生物龄是脑病学的指标吗,而且改变大脑生物龄与诊断病理相关吗(状态假设)?或者这些差异是每个人的稳定特征吗(性 trait假设)?为了回答这个问题,我们提供了一项全面的大脑老化研究,基于临床EEG。我们使用了当今最佳的时间卷积神经网络(TCN)进行年龄预测任务。我们在记录了普通大学医院EEG资料库(TUEG)的非病理记录上进行训练,并对记录了非病理和病理记录的评估。因此,我们创建了四个新的TUEG子集:I) 所有非病理记录; II) 所有病理记录; III) 至少有一个非病理记录,后跟至少一个病理记录; IV) 与III相似,但具有反向转变(先病理然后非病理)。结果显示,我们的TCN达到了当今最佳性能水平,年龄预测的绝对误差为6.6年。我们进行了广泛的分析,发现TCN对非病理和病理subject下都有显著下降(-1和-5年,paired t-test,p<=0.18和p<=0.0066)。此外,大脑生物龄差异标志物并不是诊断EEG的病理指标。

Understanding Deep Gradient Leakage via Inversion Influence Functions

  • paper_url: http://arxiv.org/abs/2309.13016
  • repo_url: https://github.com/illidanlab/inversion-influence-function
  • paper_authors: Haobo Zhang, Junyuan Hong, Yuyang Deng, Mehrdad Mahdavi, Jiayu Zhou
  • for: 防止分布式学习中的隐私泄露,尤其是在客户端存储敏感数据时。
  • methods: 提出了一种新的倒影影响函数(I$^2$F),通过减少隐私泄露,为分布式学习提供了一种可扩展的解决方案。
  • results: 在不同的网络架构、数据集、攻击实现和干扰防御方法下,I$^2$F有效地预测了潜在的隐私泄露。 codes are provided in https://github.com/illidanlab/inversion-influence-function.
    Abstract Deep Gradient Leakage (DGL) is a highly effective attack that recovers private training images from gradient vectors. This attack casts significant privacy challenges on distributed learning from clients with sensitive data, where clients are required to share gradients. Defending against such attacks requires but lacks an understanding of when and how privacy leakage happens, mostly because of the black-box nature of deep networks. In this paper, we propose a novel Inversion Influence Function (I$^2$F) that establishes a closed-form connection between the recovered images and the private gradients by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F is scalable for analyzing deep networks, requiring only oracle access to gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F effectively approximated the DGL generally on different model architectures, datasets, attack implementations, and noise-based defenses. With this novel tool, we provide insights into effective gradient perturbation directions, the unfairness of privacy protection, and privacy-preferred model initialization. Our codes are provided in https://github.com/illidanlab/inversion-influence-function.
    摘要 深度梯度泄露(DGL)是一种非常有效的攻击,可以从梯度向量中提取私人训练图像。这种攻击对于分布式学习从客户端进行训练的数据进行了重大隐私挑战,因为客户端需要共享梯度。防止这种攻击需要一个深入了解梯度泄露发生的时间和方式,但是由于深度网络的黑盒特性,这种理解很困难。在这篇论文中,我们提出了一种新的反向影响函数(I$^2$F),它可以通过解决DGL问题来建立私人梯度和 recovered图像之间的关系。与直接解决DGL相比,I$^2$F是可扩展的,只需要对梯度和Jacobian-vector产品进行 oracle 访问即可。我们通过实验表明,I$^2$F可以有效地适应不同的网络架构、数据集、攻击实现和噪声防御。通过这个新工具,我们提供了关于有效梯度扰动方向、隐私保护不公平性和隐私首选模型初始化的新视角。我们的代码可以在https://github.com/illidanlab/inversion-influence-function中找到。

Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR

  • paper_url: http://arxiv.org/abs/2309.13102
  • repo_url: None
  • paper_authors: Sheikh Shams Azam, Tatiana Likhomanenko, Martin Pelikan, Jan “Honza” Silovsky
  • For: 本研究使用 Federated Learning (FL) 技术来训练 End-to-End 语音识别 (ASR) 模型,并研究如何减少 word error rate между FL 模型和中央化训练模型之间的性能差距。* Methods: 本研究考虑了多个因素,包括适应优化器、 Connectionist Temporal Classification (CTC) 权重的变化、模型初始化方法、将中央化训练经验应用到 FL 中、FL 特有的hyperparameter 等,以探讨如何在 ASR 下面 heterogeneous data distribution 中实现更好的性能。* Results: 研究发现一些优化器可以更好地适应 FL 环境,并且在不同的Client sample size 和学习率调度器下进行了详细的分析。此外,本研究还总结了以前的相关研究中的算法、趋势和最佳实践,以便在 FL 中实现更好的 ASR 性能。
    Abstract In this paper, we start by training End-to-End Automatic Speech Recognition (ASR) models using Federated Learning (FL) and examining the fundamental considerations that can be pivotal in minimizing the performance gap in terms of word error rate between models trained using FL versus their centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss characteristics via altering Connectionist Temporal Classification (CTC) weight, (iii) model initialization through seed start, (iv) carrying over modeling setup from experiences in centralized training to FL, e.g., pre-layer or post-layer normalization, and (v) FL-specific hyperparameters, such as number of local epochs, client sampling size, and learning rate scheduler, specifically for ASR under heterogeneous data distribution. We shed light on how some optimizers work better than others via inducing smoothness. We also summarize the applicability of algorithms, trends, and propose best practices from prior works in FL (in general) toward End-to-End ASR models.
    摘要 在本文中,我们开始由使用联合学习(Federated Learning,FL)训练端到端自动语音识别(ASR)模型,并探讨在减少中心化训练模型和FL模型之间性能差距方面的基本考虑因素。我们专注于以下五个方面:(i)适应性优化器,(ii)修改连接主义时间分类(CTC)重量,(iii)模型初始化通过种子开始,(iv)从中心化训练经验中提取模型设置,例如前层或后层正则化,(v)FL特有的超参数,如本地环节数、客户端抽样大小和学习率调度器。我们解释了一些优化器如何通过减少缓动性来工作更好。我们还总结了先前的FL研究中对端到端ASR模型的算法、趋势和最佳实践。

Expressive variational quantum circuits provide inherent privacy in federated learning

  • paper_url: http://arxiv.org/abs/2309.13002
  • repo_url: None
  • paper_authors: Niraj Kumar, Jamie Heredge, Changhao Li, Shaltiel Eloul, Shree Hari Sureshbabu, Marco Pistoia
  • for: 这个论文目的是提出一种基于量子机器学习模型的联合学习方法,以保护数据隐私。
  • methods: 这个论文使用了变量量子环境模型,并利用表达性编码映射和过参数化 Ansatz 来保护数据隐私。
  • results: 论文表明,使用变量量子环境模型可以避免数据泄露,并且在各种攻击模型下保持模型训练可能性。
    Abstract Federated learning has emerged as a viable distributed solution to train machine learning models without the actual need to share data with the central aggregator. However, standard neural network-based federated learning models have been shown to be susceptible to data leakage from the gradients shared with the server. In this work, we introduce federated learning with variational quantum circuit model built using expressive encoding maps coupled with overparameterized ans\"atze. We show that expressive maps lead to inherent privacy against gradient inversion attacks, while overparameterization ensures model trainability. Our privacy framework centers on the complexity of solving the system of high-degree multivariate Chebyshev polynomials generated by the gradients of quantum circuit. We present compelling arguments highlighting the inherent difficulty in solving these equations, both in exact and approximate scenarios. Additionally, we delve into machine learning-based attack strategies and establish a direct connection between overparameterization in the original federated learning model and underparameterization in the attack model. Furthermore, we provide numerical scaling arguments showcasing that underparameterization of the expressive map in the attack model leads to the loss landscape being swamped with exponentially many spurious local minima points, thus making it extremely hard to realize a successful attack. This provides a strong claim, for the first time, that the nature of quantum machine learning models inherently helps prevent data leakage in federated learning.
    摘要 Federated learning 已经出现为一种可行的分布式解决方案,用于在没有实际分享数据的情况下训练机器学习模型。然而,标准的神经网络基本的 federated learning 模型已经被证明容易受到数据泄露的威胁,即通过分享梯度来泄露数据。在这种情况下,我们介绍了使用表达式编码映射和过参数 Ansatz 构建的 federated learning 模型。我们表明了表达式编码映射会带来自然的隐私保护,而过参数 Ansatz 可以保证模型可训练。我们的隐私框架基于解决由梯度生成的高阶多变量Chebychev多项式系统的复杂性。我们提供了吸引人的论述,证明在正确和近似情况下解决这些方程是非常困难的。此外,我们还探讨了机器学习基于攻击策略,并证明了过参数化在原始 federated learning 模型中的下降会导致攻击模型下降。最后,我们提供了数学Scaling 理论,表明在攻击模型中下降过参数化会导致搜索空间拥有infiniti多个假的本地最优点,因此非常难实现成功攻击。这提供了一个强有力的证明,即 quantum machine learning 模型的本质带来了防止数据泄露的隐私保护。

Deep learning probability flows and entropy production rates in active matter

  • paper_url: http://arxiv.org/abs/2309.12991
  • repo_url: None
  • paper_authors: Nicholas M. Boffi, Eric Vanden-Eijnden
  • for: 这 paper 是为了理解 nonequilibrium 状态下 active matter 系统的性质。
  • methods: 这 paper 使用了 deep learning 方法来计算 entropy production rate 和 probability current。
  • results: 这 paper 得到了一种可以 direct access 到 entropy production rate 和 probability current 的方法,并且可以分解成各个个体、空间区域和自由度的地方贡献。
    Abstract Active matter systems, from self-propelled colloids to motile bacteria, are characterized by the conversion of free energy into useful work at the microscopic scale. These systems generically involve physics beyond the reach of equilibrium statistical mechanics, and a persistent challenge has been to understand the nature of their nonequilibrium states. The entropy production rate and the magnitude of the steady-state probability current provide quantitative ways to do so by measuring the breakdown of time-reversal symmetry and the strength of nonequilibrium transport of measure. Yet, their efficient computation has remained elusive, as they depend on the system's unknown and high-dimensional probability density. Here, building upon recent advances in generative modeling, we develop a deep learning framework that estimates the score of this density. We show that the score, together with the microscopic equations of motion, gives direct access to the entropy production rate, the probability current, and their decomposition into local contributions from individual particles, spatial regions, and degrees of freedom. To represent the score, we introduce a novel, spatially-local transformer-based network architecture that learns high-order interactions between particles while respecting their underlying permutation symmetry. We demonstrate the broad utility and scalability of the method by applying it to several high-dimensional systems of interacting active particles undergoing motility-induced phase separation (MIPS). We show that a single instance of our network trained on a system of 4096 particles at one packing fraction can generalize to other regions of the phase diagram, including systems with as many as 32768 particles. We use this observation to quantify the spatial structure of the departure from equilibrium in MIPS as a function of the number of particles and the packing fraction.
    摘要 活的物质系统,从自驱动溶液到运动细菌,通常表现为在微观尺度上将自由能转化为有用的劳动。这些系统通常包括物理现象超出平衡统计力学的范畴,因此理解其非平衡状态的性质是一个挑战。生成热量率和稳态概率流的大小都是量化Nonequilibrium状态的指标,它们取决于系统的未知和高维度概率密度。在这里,我们基于最近的生成模型技术,开发了一种深度学习框架,可以估算概率密度的分数。我们证明,这个分数,与微观运动方程相结合,可以直接访问生成热量率、稳态概率流和它们的分解为个体粒子、空间区域和自由度的本地贡献。为表示分数,我们引入了一种新的、空间地本符论基于网络架构,可以学习高阶相互作用 между粒子,同时尊重它们的基本卷积共轭性。我们在应用这种方法于多种高维度相互作用的活跃粒子系统时,发现这种方法可以泛化到其他频谱 диаграм中,包括系统中的4096个粒子。我们用这个观察来量化离散于MIPS中的空间结构,并与粒子数和压力 fraction有关。

BayesDLL: Bayesian Deep Learning Library

  • paper_url: http://arxiv.org/abs/2309.12928
  • repo_url: https://github.com/samsunglabs/bayesdll
  • paper_authors: Minyoung Kim, Timothy Hospedales
  • for: 这份论文是为了描述一个基于PyTorch的泛型概率神经网络库,用于处理大规模深度网络。
  • methods: 这个库实现了主流的approximate Bayesian推理算法,包括变分推理、MC-dropout、渐进MCMC和拉пла斯方法。
  • results: 与其他现有的Bayesian神经网络库相比,这个库可以处理非常大的深度网络,包括视transformer(ViTs)。此外,用户无需编写任何代码修改,可以直接使用现有的backbone网络定义代码。最后,这个库还允许使用预训练模型的权重作为先验均值,这非常有用于使用大规模基础模型如ViTs进行Bayesian推理,这些模型难以从scratch使用下游数据进行优化。
    Abstract We release a new Bayesian neural network library for PyTorch for large-scale deep networks. Our library implements mainstream approximate Bayesian inference algorithms: variational inference, MC-dropout, stochastic-gradient MCMC, and Laplace approximation. The main differences from other existing Bayesian neural network libraries are as follows: 1) Our library can deal with very large-scale deep networks including Vision Transformers (ViTs). 2) We need virtually zero code modifications for users (e.g., the backbone network definition codes do not neet to be modified at all). 3) Our library also allows the pre-trained model weights to serve as a prior mean, which is very useful for performing Bayesian inference with the large-scale foundation models like ViTs that are hard to optimise from scratch with the downstream data alone. Our code is publicly available at: \url{https://github.com/SamsungLabs/BayesDLL}\footnote{A mirror repository is also available at: \url{https://github.com/minyoungkim21/BayesDLL}.}.
    摘要 我们发布了一个基于PyTorch的抽象概率神经网络库,用于大规模深度网络。我们的库实现了主流的抽象概率推理算法:变量推理、MC-dropout、随机梯度MCMC和拉пла斯投影。与其他现有的概率神经网络库相比,我们的库具有以下主要优势:1. 我们的库可以处理非常大的深度网络,包括视Transformer(ViTs)。2. 用户没需要修改代码(例如,后ION网络定义代码不需要修改)。3. 我们的库还允许预训练模型的权重服为先验均值,这对于使用大规模基础模型如ViTs进行概率推理非常有用,这些模型难以从头开始使用下游数据进行优化。我们的代码公共可用于:(备用存储库:)。

Topological Data Mapping of Online Hate Speech, Misinformation, and General Mental Health: A Large Language Model Based Study

  • paper_url: http://arxiv.org/abs/2309.13098
  • repo_url: None
  • paper_authors: Andrew Alexander, Hongbin Wang
  • for: 这项研究旨在了解社交媒体上的仇恨言论和谣言对poster的心理健康造成的影响。
  • methods: 研究使用OpenAI的GPT3 derivateposts的嵌入,并通过机器学习分类来理解仇恨言论/谣言在不同社区中的角色。
  • results: 研究发现仇恨言论/谣言与心理疾病之间存在紧密的关系,并通过图形分析获得了在线仇恨言论/谣言与心理健康之间的视觉地图。
    Abstract The advent of social media has led to an increased concern over its potential to propagate hate speech and misinformation, which, in addition to contributing to prejudice and discrimination, has been suspected of playing a role in increasing social violence and crimes in the United States. While literature has shown the existence of an association between posting hate speech and misinformation online and certain personality traits of posters, the general relationship and relevance of online hate speech/misinformation in the context of overall psychological wellbeing of posters remain elusive. One difficulty lies in the lack of adequate data analytics tools capable of adequately analyzing the massive amount of social media posts to uncover the underlying hidden links. Recent progresses in machine learning and large language models such as ChatGPT have made such an analysis possible. In this study, we collected thousands of posts from carefully selected communities on the social media site Reddit. We then utilized OpenAI's GPT3 to derive embeddings of these posts, which are high-dimensional real-numbered vectors that presumably represent the hidden semantics of posts. We then performed various machine-learning classifications based on these embeddings in order to understand the role of hate speech/misinformation in various communities. Finally, a topological data analysis (TDA) was applied to the embeddings to obtain a visual map connecting online hate speech, misinformation, various psychiatric disorders, and general mental health.
    摘要 “社交媒体的出现引发了对其可能传播仇恨言论和谎言的担忧,这些言论可能导致人们偏见和歧视,并被怀疑与社会暴力和犯罪之间存在关系。虽然文献表明在线仇恨言论和谎言与发帖者的个人特征有关,但全面的心理健康和发帖者的关系还未得到了解。一个问题在于分析大量社交媒体帖子的数据分析工具不够完善。Recent progresses in machine learning and large language models such as ChatGPT have made such an analysis possible。在这项研究中,我们收集了Reddit社交媒体平台上的 тысячи篇帖子,然后使用OpenAI的GPT3来 derive embeddings的这些帖子,这些帖子的坐标是高维实数Vecctors,它们可能表示帖子的隐藏 semantics。然后我们通过这些坐标进行了不同的机器学习分类,以了解在不同社区中仇恨言论和谎言的角色。最后,我们对坐标进行了拓扑数据分析(TDA),以获得在线仇恨言论、谎言、心理疾病和总的心理健康之间的视觉地图。”

FairComp: Workshop on Fairness and Robustness in Machine Learning for Ubiquitous Computing

  • paper_url: http://arxiv.org/abs/2309.12877
  • repo_url: None
  • paper_authors: Sofia Yfantidou, Dimitris Spathis, Marios Constantinides, Tong Xia, Niels van Berkel
  • for: 本研讨会旨在讨论 ubicomp 研究中的公平性,以及其社会、技术和法律含义。
  • methods: 本研讨会将从社会角度探讨公平性和 ubicomp 研究之间的关系,并确定了不会 causing harm 或违反个人权利的技术实践。
  • results: 本研讨会希望能够培养一个关注公平性的 ubicomp 研究社区,同时也为未来的研究提供明确的指导方针。
    Abstract How can we ensure that Ubiquitous Computing (UbiComp) research outcomes are both ethical and fair? While fairness in machine learning (ML) has gained traction in recent years, fairness in UbiComp remains unexplored. This workshop aims to discuss fairness in UbiComp research and its social, technical, and legal implications. From a social perspective, we will examine the relationship between fairness and UbiComp research and identify pathways to ensure that ubiquitous technologies do not cause harm or infringe on individual rights. From a technical perspective, we will initiate a discussion on data practices to develop bias mitigation approaches tailored to UbiComp research. From a legal perspective, we will examine how new policies shape our community's work and future research. We aim to foster a vibrant community centered around the topic of responsible UbiComp, while also charting a clear path for future research endeavours in this field.
    摘要 如何确保宇宙计算(UbiComp)研究成果是公正和公平的?尽管机器学习(ML)中的公正在最近几年得到了更多的关注,但UbiComp中的公正仍然未得到探讨。这个研讨会旨在讨论UbiComp研究中的公正性和其社会、技术和法律因素的影响。从社会角度来看,我们将探讨UBicomp技术不会对个人 права或者造成伤害的关系。从技术角度来看,我们将开始讨论针对UbiComp研究的数据实践,以开发减少偏见的技术策略。从法律角度来看,我们将检查新的政策如何影响我们的社区和未来的研究。我们想建立一个热烈的社区,以讨论负责任的UbiComp研究,同时也映射出未来这一领域的研究方向。

Robotic Handling of Compliant Food Objects by Robust Learning from Demonstration

  • paper_url: http://arxiv.org/abs/2309.12856
  • repo_url: None
  • paper_authors: Ekrem Misimi, Alexander Olofsson, Aleksander Eilertsen, Elling Ruud Øye, John Reidar Mathiassen
  • for: robotic grasping of food compliant objects, to improve the consistency of robot learning and reduce the variability of human operators.
  • methods: Learning from Demonstration (LfD) approach that combines RGB-D images and tactile data to estimate the necessary gripper pose, finger configuration, and forces for effective robot handling.
  • results: the proposed approach can automatically remove inconsistent demonstrations and estimate the teacher’s intended policy, with validated performance for fragile and compliant food objects with complex 3D shapes.
    Abstract The robotic handling of compliant and deformable food raw materials, characterized by high biological variation, complex geometrical 3D shapes, and mechanical structures and texture, is currently in huge demand in the ocean space, agricultural, and food industries. Many tasks in these industries are performed manually by human operators who, due to the laborious and tedious nature of their tasks, exhibit high variability in execution, with variable outcomes. The introduction of robotic automation for most complex processing tasks has been challenging due to current robot learning policies. A more consistent learning policy involving skilled operators is desired. In this paper, we address the problem of robot learning when presented with inconsistent demonstrations. To this end, we propose a robust learning policy based on Learning from Demonstration (LfD) for robotic grasping of food compliant objects. The approach uses a merging of RGB-D images and tactile data in order to estimate the necessary pose of the gripper, gripper finger configuration and forces exerted on the object in order to achieve effective robot handling. During LfD training, the gripper pose, finger configurations and tactile values for the fingers, as well as RGB-D images are saved. We present an LfD learning policy that automatically removes inconsistent demonstrations, and estimates the teacher's intended policy. The performance of our approach is validated and demonstrated for fragile and compliant food objects with complex 3D shapes. The proposed approach has a vast range of potential applications in the aforementioned industry sectors.
    摘要 “ robotic food raw material 的自适应和弹性处理,具有高度生物变化、复杂的三维几何形状、机械结构和 текстусту,目前在海洋、农业和食品行业中受到巨大的需求。这些行业中的许多任务现在由人类操作员执行,由于任务的劳动 INTENSIVE 和 monotony,操作员的执行效果存在很大的变化, resulting in variable outcomes。 introducing robotic automation for most complex processing tasks has been challenging due to current robot learning policies. therefore, a more consistent learning policy involving skilled operators is desired. in this paper, we address the problem of robot learning when presented with inconsistent demonstrations. to this end, we propose a robust learning policy based on Learning from Demonstration (LfD) for robotic grasping of food compliant objects. the approach uses a merging of RGB-D images and tactile data in order to estimate the necessary pose of the gripper, gripper finger configuration and forces exerted on the object in order to achieve effective robot handling. during LfD training, the gripper pose, finger configurations and tactile values for the fingers, as well as RGB-D images are saved. we present an LfD learning policy that automatically removes inconsistent demonstrations, and estimates the teacher's intended policy. the performance of our approach is validated and demonstrated for fragile and compliant food objects with complex 3D shapes. the proposed approach has a vast range of potential applications in the aforementioned industry sectors.”Note: The translation is in Simplified Chinese, which is the standard version of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

DeepOPF-U: A Unified Deep Neural Network to Solve AC Optimal Power Flow in Multiple Networks

  • paper_url: http://arxiv.org/abs/2309.12849
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Heng Liang, Changhong Zhao
  • for: 解决不同电力网络中的最优电力流问题
  • methods: 使用单一深度神经网络(DNN)解决交流电力流问题,并采用滑动输入和输出层对各种电力网络中的负荷和OPT问题进行适应
  • results: 对IEEE 57/118/300-bus测试系统和一个逐渐增长的网络进行了优化的性能表现,并且可以处理不同数量的节点、线径和可再生能源资源。
    Abstract The traditional machine learning models to solve optimal power flow (OPF) are mostly trained for a given power network and lack generalizability to today's power networks with varying topologies and growing plug-and-play distributed energy resources (DERs). In this paper, we propose DeepOPF-U, which uses one unified deep neural network (DNN) to solve alternating-current (AC) OPF problems in different power networks, including a set of power networks that is successively expanding. Specifically, we design elastic input and output layers for the vectors of given loads and OPF solutions with varying lengths in different networks. The proposed method, using a single unified DNN, can deal with different and growing numbers of buses, lines, loads, and DERs. Simulations of IEEE 57/118/300-bus test systems and a network growing from 73 to 118 buses verify the improved performance of DeepOPF-U compared to existing DNN-based solution methods.
    摘要 传统的机器学习模型用于优化电力流(OPF)大多是为某个特定的电力网络训练,而缺乏对今天的电力网络结构和增加插入式分布式能源资源(DERs)的普适性。在这篇论文中,我们提议了DeepOPF-U,它使用一个通用的深度神经网络(DNN)来解决不同电力网络中的交流电力流优化问题。Specifically,我们设计了弹性输入和输出层,以处理具有不同长度的输入和解决方案Vector在不同的网络中。我们的方法使用单个通用DNN来解决不同的和增加的电力网络中的问题,包括不同数量的总站、线径和负荷。我们的实验结果表明,相比之前的DNN基本方法,DeepOPF-U可以更好地处理不同的电力网络和增加的负荷。Here's a word-for-word translation of the text into Simplified Chinese:传统的机器学习模型用于优化电力流(OPF)大多是为某个特定的电力网络训练,而缺乏对今天的电力网络结构和增加插入式分布式能源资源(DERs)的普适性。在这篇论文中,我们提议了DeepOPF-U,它使用一个通用的深度神经网络(DNN)来解决不同电力网络中的交流电力流优化问题。Specifically,我们设计了弹性输入和输出层,以处理具有不同长度的输入和解决方案Vector在不同的网络中。我们的方法使用单个通用DNN来解决不同的和增加的电力网络中的问题,包括不同数量的总站、线径和负荷。我们的实验结果表明,相比之前的DNN基本方法,DeepOPF-U可以更好地处理不同的电力网络和增加的负荷。

Multiple Independent DE Optimizations to Tackle Uncertainty and Variability in Demand in Inventory Management

  • paper_url: http://arxiv.org/abs/2309.13095
  • repo_url: None
  • paper_authors: Sarit Maitra, Sukanya Kundu, Vivek Mishra
  • for: 本研究旨在找出适用于不确定需求 patrerns 的 Metaheuristic Differeential Evolution 优化策略,以最小化存储成本。
  • methods: 本研究结合综合 IM 策略的 continuous review 和 Monte Carlo Simulation (MCS),并对多种算法进行比较,以找到最佳解决方案。
  • results: 研究发现,Differeential Evolution (DE) 算法在优化 IM 中表现最佳,并通过 Latin Hypercube Sampling (LHS) 统计方法进行参数调整。本研究还提出了一种 combining 多个独立 DE 优化实例的方法,以提高性能和成本效益,特别是在不确定需求 patrerns 下。
    Abstract To determine the effectiveness of metaheuristic Differential Evolution optimization strategy for inventory management (IM) in the context of stochastic demand, this empirical study undertakes a thorough investigation. The primary objective is to discern the most effective strategy for minimizing inventory costs within the context of uncertain demand patterns. Inventory costs refer to the expenses associated with holding and managing inventory within a business. The approach combines a continuous review of IM policies with a Monte Carlo Simulation (MCS). To find the optimal solution, the study focuses on meta-heuristic approaches and compares multiple algorithms. The outcomes reveal that the Differential Evolution (DE) algorithm outperforms its counterparts in optimizing IM. To fine-tune the parameters, the study employs the Latin Hypercube Sampling (LHS) statistical method. To determine the final solution, a method is employed in this study which combines the outcomes of multiple independent DE optimizations, each initiated with different random initial conditions. This approach introduces a novel and promising dimension to the field of inventory management, offering potential enhancements in performance and cost efficiency, especially in the presence of stochastic demand patterns.
    摘要 为了判断metaheuristic diferencial evolution优化策略对供应链管理(IM)在不确定的需求 Patterns 上的效果,这个实验室进行了一项严格的调查。主要目标是找到最有效的策略来最小化存储成本在企业中。存储成本包括保持和管理存储的成本。该方法结合了连续性 IM 策略的审查和Monte Carlo Simulation(MCS)。为了找到优化策略,这个研究对meta-heuristic Approaches进行了比较多个算法。研究发现,diferencial Evolution(DE)算法在优化IM方面表现出色。为了调整参数,这个研究使用了Latin Hypercube Sampling(LHS)统计方法。为了确定最终解决方案,这个研究employs a方法,将多个独立的DE优化结果组合起来,每个初始条件都是random。这种方法在供应链管理领域引入了一个新的维度,提供了可能的性能和成本效益,特别是在不确定的需求Patterns 上。

Reward Function Design for Crowd Simulation via Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.12841
  • repo_url: None
  • paper_authors: Ariel Kwiatkowski, Vicky Kalogeiton, Julien Pettré, Marie-Paule Cani
  • for: 这个论文的目的是探讨基于奖励学习的人群 simulate 的方法,以及设计奖励函数的正确方法。
  • methods: 这个论文使用了奖励学习方法,并通过 theoretically 和 empirically 分析奖励函数的效果。
  • results: 该论文的实验结果表明,直接减少能量消耗是一种有效的策略, provided that it is paired with an appropriately scaled guiding potential。这些结果可以帮助开发新的人群 simulate 技术,并对人类 Navigation 的研究产生影响。
    Abstract Crowd simulation is important for video-games design, since it enables to populate virtual worlds with autonomous avatars that navigate in a human-like manner. Reinforcement learning has shown great potential in simulating virtual crowds, but the design of the reward function is critical to achieving effective and efficient results. In this work, we explore the design of reward functions for reinforcement learning-based crowd simulation. We provide theoretical insights on the validity of certain reward functions according to their analytical properties, and evaluate them empirically using a range of scenarios, using the energy efficiency as the metric. Our experiments show that directly minimizing the energy usage is a viable strategy as long as it is paired with an appropriately scaled guiding potential, and enable us to study the impact of the different reward components on the behavior of the simulated crowd. Our findings can inform the development of new crowd simulation techniques, and contribute to the wider study of human-like navigation.
    摘要 伪人群模拟在游戏设计中具有重要意义,因为它使得虚拟世界中的自主人物能够在人类化的方式下自主 Navigation。基于奖励学习的人群模拟显示了巨大的潜力,但是奖励函数的设计是获得有效和高效的结果的关键。在这项工作中,我们探讨了基于奖励学习的人群模拟中奖励函数的设计。我们提供了理论上的思路,并通过一系列场景的实验来评估奖励函数的有效性。我们发现,直接减少能量使用是一个有效的策略,只要与适当的拟合潜在能量相关的奖励函数相结合。这些实验结果可以导向新的人群模拟技术的开发,并对人类化导航的更广泛研究产生贡献。

Doubly Robust Proximal Causal Learning for Continuous Treatments

  • paper_url: http://arxiv.org/abs/2309.12819
  • repo_url: None
  • paper_authors: Yong Wu, Yanwei Fu, Shouyan Wang, Xinwei Sun
  • for: 本研究旨在提出一种可以处理连续干扰因素的 proximal causal learning 框架,以便在实际应用中更好地估计 causal effect。
  • methods: 我们提出了一种基于 kernel 函数的 DR 估计器,可以有效地处理连续干扰因素。我们还提出了一种新的方法来效率地解决干扰函数的问题。
  • results: 我们对 synthetic 数据和实际应用进行了评估,并证明了我们的估计器具有良好的准确性和稳定性。
    Abstract Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.
    摘要 近似 causal learning 是一个有前途的框架,用于在存在未测量的干扰因素时确定 causal effect。在这个框架下, doubly robust(DR)估计器被 derivation 出来,并在不符合模型假设的情况下表现出优异的效果。然而,现有的 DR 估计器只适用于 binary 治疗,而在实际应用中,治疗可能是连续的。主要的障碍是 delta 函数存在在原始 DR 估计器中,使其无法在 causal effect 估计中使用,并且在 auxiliary function 估计中增加了巨大的计算负担。为了解决这些挑战,我们提出了基于 kernel 的 DR 估计器,可以好好地处理连续治疗。利用其平滑性,我们表明其oracle形式是一个可靠的 influence function 的近似。此外,我们提出了一种新的方法来有效地解决 auxiliary function。然后,我们进行了完整的mean square error(MSE)的收敛分析。我们在 sintetic 数据和实际应用中展示了我们的估计器的实用性。

Improving Generalization in Game Agents with Data Augmentation in Imitation Learning

  • paper_url: http://arxiv.org/abs/2309.12815
  • repo_url: None
  • paper_authors: Derek Yadgaroff, Alessandro Sestini, Konrad Tollmar, Linus Gisslén
  • for: 提高游戏AI的通用能力
  • methods: 使用数据扩展法提高imitative learning agents的通用能力
  • results: 数据扩展法可以有效提高imitative learning agents的通用能力,并提供了多种3D环境中的性能指标。
    Abstract Imitation learning is an effective approach for training game-playing agents and, consequently, for efficient game production. However, generalization - the ability to perform well in related but unseen scenarios - is an essential requirement that remains an unsolved challenge for game AI. Generalization is difficult for imitation learning agents because it requires the algorithm to take meaningful actions outside of the training distribution. In this paper we propose a solution to this challenge. Inspired by the success of data augmentation in supervised learning, we augment the training data so the distribution of states and actions in the dataset better represents the real state-action distribution. This study evaluates methods for combining and applying data augmentations to observations, to improve generalization of imitation learning agents. It also provides a performance benchmark of these augmentations across several 3D environments. These results demonstrate that data augmentation is a promising framework for improving generalization in imitation learning agents.
    摘要 仿制学习是一种有效的方法用于训练游戏AI代理人,并且可以提高游戏生产效率。然而,通用化(能够在相关 yet unseen 的情况下表现良好)是一个必备的要求,它是一个未解决的挑战。通用化对仿制学习代理人来说是一个困难的任务,因为它需要算法在训练分布之外行为。在这篇论文中,我们提出了一种解决这个挑战的方法。受到超参数学习中的数据扩展成功的启发,我们将训练数据进行扩展,以使得状态和动作的分布更好地表示真实的状态-动作分布。本研究评估了对观察数据的合并和应用的方法,以提高仿制学习代理人的通用化。此外,本研究还提供了不同3D环境下这些扩展的性能比较。这些结果表明,数据扩展是一种有前途的框架,可以提高仿制学习代理人的通用化。

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

  • paper_url: http://arxiv.org/abs/2309.12802
  • repo_url: None
  • paper_authors: Alexandre R. Ferreira, Cláudio E. C. Campelo
  • for: 本研究旨在提高语音识别模型的Robustness,需要一个大型多样化的标注数据集。
  • methods: 本文提出了一种基于深度假音的数据增强策略,使用了voice cloner和英文语音 Dataset produced by Indians。
  • results: 经过实验 validate,使用增强数据可以提高语音识别模型的性能,并且在不同的场景下都有良好的效果。
    Abstract To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.
    摘要 In this work, we conducted experiments using existing deepfake and transcription models. We selected a voice cloner and a dataset produced by Indians (in English) to ensure the presence of a single accent in the dataset. We then augmented the data and used it to train speech-to-text models in various scenarios.

An Intelligent Approach to Detecting Novel Fault Classes for Centrifugal Pumps Based on Deep CNNs and Unsupervised Methods

  • paper_url: http://arxiv.org/abs/2309.12765
  • repo_url: None
  • paper_authors: Mahdi Abdollah Chalaki, Daniyal Maroufi, Mahdi Robati, Mohammad Javad Karimi, Ali Sadighi
  • for: 本研究旨在Addressing the challenges of data-driven fault diagnosis of rotating machines, particularly the lack of information about various faults in the field.
  • methods: 本 paper 使用了 convolutional neural network (CNN) 和 t-SNE 方法,以检测 novel faults. 首先,使用受限的系统故障信息进行训练,然后使用 clustering 技术进行检测。 如果检测到新的故障,则使用新数据进行网络的扩展。
  • results: 实验结果表明,这种 two-stage 方法在一台 centrifugal pump 上得到了高精度的 novel fault 检测结果。
    Abstract Despite the recent success in data-driven fault diagnosis of rotating machines, there are still remaining challenges in this field. Among the issues to be addressed, is the lack of information about variety of faults the system may encounter in the field. In this paper, we assume a partial knowledge of the system faults and use the corresponding data to train a convolutional neural network. A combination of t-SNE method and clustering techniques is then employed to detect novel faults. Upon detection, the network is augmented using the new data. Finally, a test setup is used to validate this two-stage methodology on a centrifugal pump and experimental results show high accuracy in detecting novel faults.
    摘要 尽管在数据驱动机器故障诊断方面已经取得了一定的成功,但这个领域仍然存在一些挑战。其中一个问题是系统可能在场景中遇到多种故障的信息不够。在这篇论文中,我们假设系统具有部分故障知识,并使用相应的数据来训练卷积神经网络。然后,我们使用t-SNE方法和聚类技术检测新的故障。检测到故障后,网络被扩展使用新的数据。最后,我们使用测试setup验证这种两个阶段方法在中心泵上的效果,实验结果显示高精度地检测到新的故障。Note: "t-SNE" stands for "t-distributed Stochastic Neighbor Embedding", which is a technique used to reduce the dimensionality of data.

Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks

  • paper_url: http://arxiv.org/abs/2309.13092
  • repo_url: None
  • paper_authors: Shuai Wang, Jiayi Shen, Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
  • for: 本研究旨在提出一种基于超граpher学习的节点分类方法,用于处理具有多样化和复杂关系的 multimedia数据中的异构信息网络(HINs)。
  • methods: 本方法使用超граpher instead of graph,以捕捉高阶关系 между节点,而无需靠谱定过程。它还利用示例来改善超граpher学习过程的稳定性,从而提供可读性的人类可读性。
  • results: 对于三个真实的 HINs 实验,本方法显示了效果。
    Abstract The variety and complexity of relations in multimedia data lead to Heterogeneous Information Networks (HINs). Capturing the semantics from such networks requires approaches capable of utilizing the full richness of the HINs. Existing methods for modeling HINs employ techniques originally designed for graph neural networks, and HINs decomposition analysis, like using manually predefined metapaths. In this paper, we introduce a novel prototype-enhanced hypergraph learning approach for node classification in HINs. Using hypergraphs instead of graphs, our method captures higher-order relationships among nodes and extracts semantic information without relying on metapaths. Our method leverages the power of prototypes to improve the robustness of the hypergraph learning process and creates the potential to provide human-interpretable insights into the underlying network structure. Extensive experiments on three real-world HINs demonstrate the effectiveness of our method.
    摘要 multimedia数据中的多样性和复杂性导致非同様网络(HINs)的出现。捕捉HINs中的semantics需要能够利用非同様网络的全部 ricinus。现有的HINs模型使用原本设计 для图神经网络的技术,以及手动划定的ме타路径进行分析。本文提出了一种基于prototype强化的超граraph学习方法 дляHINs节点分类。使用超граraph instead of graphs,我们的方法可以捕捉节点之间的高阶关系,并提取无需依赖于ме타路径的semantic信息。我们的方法利用 prototype的力量来提高超гра�学习过程的稳定性,并创造了可以提供人类可读的网络结构下的内部层次结构的可能性。我们在三个真实的HINs上进行了广泛的实验,并证明了我们的方法的有效性。

Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.12742
  • repo_url: https://github.com/yue-zhongqi/icon
  • paper_authors: Zhongqi Yue, Hanwang Zhang, Qianru Sun
  • for: 本研究旨在解决Unsupervised Domain Adaptation (UDA)中存在的假 correlate问题,即源领域中的特征与目标领域的特征之间的假 correlate,这导致了目标领域的模型难以泛化。
  • methods: 我们提出了一种名为“做 Consistency learning”(ICON)的方法,它通过同时使源领域和目标领域的分类器预测结果相一致,从而消除了目标领域中的假 correlate。
  • results: 我们在经验证上表明,ICON可以在经典的UDA benchmark上达到最佳性能,并在挑战性的WILDS 2.0 benchmark上超越所有传统方法。
    Abstract Domain Adaptation (DA) is always challenged by the spurious correlation between domain-invariant features (e.g., class identity) and domain-specific features (e.g., environment) that does not generalize to the target domain. Unfortunately, even enriched with additional unsupervised target domains, existing Unsupervised DA (UDA) methods still suffer from it. This is because the source domain supervision only considers the target domain samples as auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the target domain -- where the valuable de-correlation clues hide -- is disregarded. We propose to make the U in UDA matter by giving equal status to the two domains. Specifically, we learn an invariant classifier whose prediction is simultaneously consistent with the labels in the source domain and clusters in the target domain, hence the spurious correlation inconsistent in the target domain is removed. We dub our approach "Invariant CONsistency learning" (ICON). Extensive experiments show that ICON achieves the state-of-the-art performance on the classic UDA benchmarks: Office-Home and VisDA-2017, and outperforms all the conventional methods on the challenging WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.
    摘要 域 adaptation (DA) 总是面临着域特异特征(例如类标识)和域特定特征(例如环境)之间的假设相关性,这种相关性不能泛化到目标域。尽管使用额外的无监督目标域数据,现有的无监督DA(UDA)方法仍然受到这种挑战。这是因为源域监督只考虑目标域样本为辅助数据(例如 pseudo-labeling),忽略了目标域的自然分布,其中包含了价值的分解准则。我们提议使得U在UDA中变得重要,即在源域和目标域之间学习一个不变的分类器,其预测结果同时与源域中的标签和目标域中的团集一致,因此在目标域中排除了假设相关性。我们称之为“不变CONsistency学习”(ICON)。我们进行了广泛的实验,ICON在经典的UDABenchmark上取得了state-of-the-art性能,并在挑战性的WILDS 2.0 Benchmark上超过了所有传统方法。代码在https://github.com/yue-zhongqi/ICON。

Optimal Dynamic Fees for Blockchain Resources

  • paper_url: http://arxiv.org/abs/2309.12735
  • repo_url: None
  • paper_authors: Davide Crapis, Ciamac C. Moallemi, Shouqiao Wang
  • For: 本研究は多个区块链资源的优化奖励机制的设计问题进行通用和实用的框架开发。* Methods: 我们的框架可以计算优化奖励策略,以让奖励策略与持续的需求变化进行融合,同时保证奖励策略对当地噪声的Robustness。在多个资源的总情况下,我们的优化策略能正确处理资源需求之间的交叉效应(补做和替代)。* Results: 我们的框架可以用来修订或指导使用各种各样的奖励更新规则,如EIP-1559或EIP-4844。我们通过两个案例研究证明了这一点。我们还使用实际市场数据来对一个一维版本的我们模型进行估算,并对EIP-1559的性能与我们的优化策略进行比较。
    Abstract We develop a general and practical framework to address the problem of the optimal design of dynamic fee mechanisms for multiple blockchain resources. Our framework allows to compute policies that optimally trade-off between adjusting resource prices to handle persistent demand shifts versus being robust to local noise in the observed block demand. In the general case with more than one resource, our optimal policies correctly handle cross-effects (complementarity and substitutability) in resource demands. We also show how these cross-effects can be used to inform resource design, i.e. combining resources into bundles that have low demand-side cross-effects can yield simpler and more efficient price-update rules. Our framework is also practical, we demonstrate how it can be used to refine or inform the design of heuristic fee update rules such as EIP-1559 or EIP-4844 with two case studies. We then estimate a uni-dimensional version of our model using real market data from the Ethereum blockchain and empirically compare the performance of our optimal policies to EIP-1559.
    摘要 我们开发了一个通用且实用的框架,以解决多个区块链资源的优化设计动态费用机制问题。我们的框架可以计算优化费用更新策略,以优考虑到适应持续强制变化的需求轨迹,同时保持对地方噪声观测到的区块需求的稳定性。在多than one resource的普通情况下,我们的优化策略可以正确处理资源需求之间的交叉效应(补做和替代)。我们还示出了如何使用这些交叉效应来指导资源设计,例如将资源合并成具有低需求层次交叉效应的套件可以得到简单而高效的价格更新规则。我们的框架也是实用的,我们示出了如何使用它来优化或指导基于EIP-1559或EIP-4844的费用更新规则的设计。然后,我们使用实际市场数据从Ethereum区块链进行了一个一维版本的模型估算,并对EIP-1559的性能与我们的优化策略进行了实际比较。

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2309.12714
  • repo_url: None
  • paper_authors: Amirali Soltani Tehrani, Niloufar Faridani, Ramin Toosi
  • for: 这个研究旨在提高人机交互的深度理解,通过感知人类情感状态,提高人机交互的效果和同情性。
  • methods: 该研究提出了一种新的方法,即结合自我指导特征提取和指导分类来实现情绪识别。在预处理步骤中,我们使用基于Wav2Vec模型的自我指导特征提取器,从audio数据中捕捉音频特征。然后,输出特征图的前一步结果被传递给自定义的卷积神经网络模型进行情绪分类。
  • results: 在使用ShEMO数据集进行测试时,该方法超过了两个基准方法,即支持向量机分类器和转移学习预训练的CNN模型。与状态的艺术方法相比,该方法表现更出色,提供了更高的情感认知水平,为人机交互领域带来更多的同情性和效果。
    Abstract Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
    摘要 人机交互中情感认识(SER)发挥关键作用,帮助更深入理解情感状态,涵盖广泛应用领域,从而提供更 Empathetic 和有效的沟通。本研究提出了一种创新的方法,将自主学习特征提取与经过监督分类结合用于情感识别。在预处理步骤中,我们采用基于 Wav2Vec 模型的自主学习特征提取器,从音频数据中提取了音频特征。然后,预处理步骤的输出特征地图被传递给自定义设计的卷积神经网络(CNN)模型进行情感分类。使用 ShEMO 数据集进行测试,我们的提议方法超过了两个基准方法,即支持向量机学习分类器和转移学习已经训练的 CNN。与状态 искусственный地进行SER任务的方法进行比较,我们的发现表明了深入的无监督特征学习对 SER 任务的提升具有重要作用。我们的发现强调了深入的特征学习在人机交互中的情感理解方面的重要性。

Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences

  • paper_url: http://arxiv.org/abs/2309.12712
  • repo_url: https://github.com/hugomalard/big-model-only-for-hard-audios
  • paper_authors: Hugo Malard, Salah Zaiem, Robin Algayres
  • for: 这个研究的目的是提出一个可以在不同的模型大小下选择最佳的决策模组,以便在不同的内存和硬件环境下进行自动语音识别(ASR)。
  • methods: 作者使用了两个不同大小的 Whisper 模型,并将它们联合使用以构建一个决策模组。他们还使用了一些计算效率的技巧来降低决策模组的计算成本。
  • results: 作者的实验结果显示,使用这个决策模组可以实现substantial的计算成本减少,同时保持transcription的性能水平。具体来说,在两个 Whisper 模型中,使用决策模组可以降低了模型的计算成本,并且对于大多数的测试数据进行了好的调整。
    Abstract Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
    摘要

Discovering the Interpretability-Performance Pareto Front of Decision Trees with Dynamic Programming

  • paper_url: http://arxiv.org/abs/2309.12701
  • repo_url: None
  • paper_authors: Hector Kohler, Riad Akrour, Philippe Preux
  • for: 本文提出了一种新的Markov Decision Problem(MDP)形式,用于找到最佳决策树。
  • methods: 该方法使用了一个单一的动态计划来计算多种解释性-性能质量Front的优化决策树。
  • results: 实验表明,该方法与当前状态的算法相当于的精度和运行时间,同时返回了一组决策树,用于用户选择最适合其需求的解释性-性能质量Front。
    Abstract Decision trees are known to be intrinsically interpretable as they can be inspected and interpreted by humans. Furthermore, recent hardware advances have rekindled an interest for optimal decision tree algorithms, that produce more accurate trees than the usual greedy approaches. However, these optimal algorithms return a single tree optimizing a hand defined interpretability-performance trade-off, obtained by specifying a maximum number of decision nodes, giving no further insights about the quality of this trade-off. In this paper, we propose a new Markov Decision Problem (MDP) formulation for finding optimal decision trees. The main interest of this formulation is that we can compute the optimal decision trees for several interpretability-performance trade-offs by solving a single dynamic program, letting the user choose a posteriori the tree that best suits their needs. Empirically, we show that our method is competitive with state-of-the-art algorithms in terms of accuracy and runtime while returning a whole set of trees on the interpretability-performance Pareto front.
    摘要 In this paper, we propose a new Markov Decision Problem (MDP) formulation for finding optimal decision trees. Our approach allows us to compute the optimal decision trees for multiple interpretability-performance trade-offs by solving a single dynamic program. This enables the user to choose the tree that best suits their needs after the fact. Empirical results show that our method is competitive with state-of-the-art algorithms in terms of accuracy and runtime, while providing a set of trees on the interpretability-performance Pareto front.

Recurrent Temporal Revision Graph Networks

  • paper_url: http://arxiv.org/abs/2309.12694
  • repo_url: None
  • paper_authors: Yizhou Chen, Anxiang Zeng, Guangda Huzhang, Qingtao Yu, Kerui Zhang, Cao Yuanpeng, Kangle Wu, Han Yu, Zhiming Zhou
  • for: 本研究旨在提供一种更准确地模型 temporal graph 的方法,具体来说是一种基于 recurrent neural network (RNN) 的 temporal neighbor aggregation 方法,以便更好地捕捉 temporal graph 中 node 之间的关系。
  • methods: 本研究使用 RNN WITH node-wise hidden states 来集成所有历史邻居信息,从而提供更完整的邻居信息。这种方法可以在实际应用中提高 averaged precision 约 9.6% compared to existing methods。
  • results: 本研究的实际应用result 显示,使用本研究提出的方法可以在 Ecommerce dataset 中提高 averaged precision 约 9.6% compared to existing methods。这表明本研究的方法可以更好地捕捉 temporal graph 中 node 之间的关系,从而提高模型的准确性。
    Abstract Temporal graphs offer more accurate modeling of many real-world scenarios than static graphs. However, neighbor aggregation, a critical building block of graph networks, for temporal graphs, is currently straightforwardly extended from that of static graphs. It can be computationally expensive when involving all historical neighbors during such aggregation. In practice, typically only a subset of the most recent neighbors are involved. However, such subsampling leads to incomplete and biased neighbor information. To address this limitation, we propose a novel framework for temporal neighbor aggregation that uses the recurrent neural network with node-wise hidden states to integrate information from all historical neighbors for each node to acquire the complete neighbor information. We demonstrate the superior theoretical expressiveness of the proposed framework as well as its state-of-the-art performance in real-world applications. Notably, it achieves a significant +9.6% improvement on averaged precision in a real-world Ecommerce dataset over existing methods on 2-layer models.
    摘要 时间图表提供更加准确地模型多种实际场景 than static graphs. However, temporal graph neighbor aggregation, a critical component of graph networks, is currently extended straightforwardly from static graphs. This can be computationally expensive when considering all historical neighbors during aggregation. In practice, only a subset of the most recent neighbors are typically involved, but such subsampling leads to incomplete and biased neighbor information. To address this limitation, we propose a novel framework for temporal neighbor aggregation that uses recurrent neural networks with node-wise hidden states to integrate information from all historical neighbors for each node to obtain complete neighbor information. We demonstrate the superior theoretical expressiveness of the proposed framework as well as its state-of-the-art performance in real-world applications. Notably, it achieves a significant +9.6% improvement in averaged precision over existing methods on 2-layer models in a real-world Ecommerce dataset.

OneNet: Enhancing Time Series Forecasting Models under Concept Drift by Online Ensembling

  • paper_url: http://arxiv.org/abs/2309.12659
  • repo_url: https://github.com/yfzhang114/onenet
  • paper_authors: Yi-Fan Zhang, Qingsong Wen, Xue Wang, Weiqi Chen, Liang Sun, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan
  • For: 本研究旨在提出一种能够高效地更新时间序列预测模型,以Addressing the concept drifting problem。* Methods: 本文提出了一种基于online convex programming框架的强化学习方法,可以动态地更新和组合两个模型,其中一个模型专注于时间维度上的关系,另一个模型则是跨变量关系。* Results: 实验结果显示,OneNet可以在线预测错误下降超过50%,至比State-Of-The-Art方法更高。
    Abstract Online updating of time series forecasting models aims to address the concept drifting problem by efficiently updating forecasting models based on streaming data. Many algorithms are designed for online time series forecasting, with some exploiting cross-variable dependency while others assume independence among variables. Given every data assumption has its own pros and cons in online time series modeling, we propose \textbf{On}line \textbf{e}nsembling \textbf{Net}work (OneNet). It dynamically updates and combines two models, with one focusing on modeling the dependency across the time dimension and the other on cross-variate dependency. Our method incorporates a reinforcement learning-based approach into the traditional online convex programming framework, allowing for the linear combination of the two models with dynamically adjusted weights. OneNet addresses the main shortcoming of classical online learning methods that tend to be slow in adapting to the concept drift. Empirical results show that OneNet reduces online forecasting error by more than $\mathbf{50\%}$ compared to the State-Of-The-Art (SOTA) method. The code is available at \url{https://github.com/yfzhang114/OneNet}.
    摘要 在线更新时间序列预测模型目的是解决概念漂移问题,通过基于流入数据的高效更新预测模型。许多算法已经为在线时间序列预测设计,其中一些利用时间维度之间的依赖关系,而其他们假设变量之间是独立的。每个数据假设都有其自己的优缺点,在在线时间序列预测中。我们提出了《Online Ensembling Network(OneNet)》,它在实时更新和组合两个模型,其中一个专门关注时间维度之间的依赖关系,另一个则关注变量之间的交叉依赖关系。我们的方法将经验学学习基于的逻辑添加到传统的在线凸programming框架中,允许在线动态调整模型之间的线性组合。OneNet可以快速适应概念漂移,并且实际结果表明,与现状技术(SOTA)方法相比,OneNet可以降低在线预测错误率高于50%。代码可以在 \url{https://github.com/yfzhang114/OneNet} 中找到。

Neural Operator Variational Inference based on Regularized Stein Discrepancy for Deep Gaussian Processes

  • paper_url: http://arxiv.org/abs/2309.12658
  • repo_url: None
  • paper_authors: Jian Xu, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng
  • for: 这个研究旨在提出一种基于神经网络的 Variational Inference 方法,用于深度 Gaussian Process (DGP) 模型中的 Bayesian 推断。
  • methods: 方法使用神经网络生成器,实现了对真 posterior 的过程独立推断,并使用 Monte Carlo 估计和抽样数据点估计技术来解决问题。
  • results: 实验结果显示,提出的方法可以实现高精度和高速度的推断,并在许多数据集上实现了比 SOTA Gaussian process 方法更高的分类精度。此外,方法可以 theoretically 控制预测误差,并在各种数据集上展示了优异的表现。
    Abstract Deep Gaussian Process (DGP) models offer a powerful nonparametric approach for Bayesian inference, but exact inference is typically intractable, motivating the use of various approximations. However, existing approaches, such as mean-field Gaussian assumptions, limit the expressiveness and efficacy of DGP models, while stochastic approximation can be computationally expensive. To tackle these challenges, we introduce Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes. NOVI uses a neural generator to obtain a sampler and minimizes the Regularized Stein Discrepancy in L2 space between the generated distribution and true posterior. We solve the minimax problem using Monte Carlo estimation and subsampling stochastic optimization techniques. We demonstrate that the bias introduced by our method can be controlled by multiplying the Fisher divergence with a constant, which leads to robust error control and ensures the stability and precision of the algorithm. Our experiments on datasets ranging from hundreds to tens of thousands demonstrate the effectiveness and the faster convergence rate of the proposed method. We achieve a classification accuracy of 93.56 on the CIFAR10 dataset, outperforming SOTA Gaussian process methods. Furthermore, our method guarantees theoretically controlled prediction error for DGP models and demonstrates remarkable performance on various datasets. We are optimistic that NOVI has the potential to enhance the performance of deep Bayesian nonparametric models and could have significant implications for various practical applications
    摘要 深度泊松过程(DGP)模型提供了一种强大的非参数方法 для bayesian推理,但确切的推理通常是不可能的,这导致了不同的 aproximation 被使用。然而,现有的方法,如 Gaussian 假设,限制了 DGP 模型的表达能力和有效性,而随机approximation 可能会是 computationally expensive。为了解决这些挑战,我们引入了 Neural Operator Variational Inference(NOVI) для Deep Gaussian Processes。NOVI 使用神经网络生成器来获取一个采样器,并将 Regularized Stein Discrepancy 在 L2 空间中减少到真 posterior 和生成的分布之间的差异。我们使用 Monte Carlo 估计和抽样化优化技术来解决最小最大问题。我们发现,我们的方法中引入的偏差可以通过多余的 Fisher 异同平方控制,从而保证算法的稳定性和精度。我们的实验结果表明,我们的方法可以在数据集规模从百万到万个数据点之间进行效果地训练,并且在 CIFAR10 数据集上达到了 93.56% 的分类精度,超过了现有的 Gaussian process 方法。此外,我们的方法可以 theoretically 控制 DGP 模型的预测错误,并在不同的数据集上显示出惊人的性能。我们认为 NOVI 有可能提高深度 Bayesian 非 Parametric 模型的性能,并可能在各种实际应用中具有重要意义。

Sequential Action-Induced Invariant Representation for Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.12628
  • repo_url: https://github.com/dmu-xmu/sar
  • paper_authors: Dayang Liang, Qihang Chen, Yunlong Liu
  • for: 提高visual reinforcement learning中 task-relevant state representation的学习精度,并在受到视觉干扰的环境中实现更好的性能。
  • methods: 基于bisimulation metric、prediction、contrast和重建等方法,提出Sequential Action–induced invariant Representation(SAR)方法,通过控制信号驱动encoder的优化,使代理人能够学习对干扰免疫的表示。
  • results: 在DeepMind Control suite任务上实现了最佳baseline的性能,并在实际的CARLA自动驾驶中证明了方法的有效性。 Code和示例视频可以在https://github.com/DMU-XMU/SAR.git中找到。
    Abstract How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to the lack of appropriate mechanisms for the extraction of task information in the prediction, contrast, and reconstruction-related approaches and the limitations of bisimulation-related methods in domains with sparse rewards, it is still difficult for these methods to be effectively extended to environments with distractions. To alleviate these problems, in the paper, the action sequences, which contain task-intensive signals, are incorporated into representation learning. Specifically, we propose a Sequential Action--induced invariant Representation (SAR) method, in which the encoder is optimized by an auxiliary learner to only preserve the components that follow the control signals of sequential actions, so the agent can be induced to learn the robust representation against distractions. We conduct extensive experiments on the DeepMind Control suite tasks with distractions while achieving the best performance over strong baselines. We also demonstrate the effectiveness of our method at disregarding task-irrelevant information by deploying SAR to real-world CARLA-based autonomous driving with natural distractions. Finally, we provide the analysis results of generalization drawn from the generalization decay and t-SNE visualization. Code and demo videos are available at https://github.com/DMU-XMU/SAR.git.
    摘要 如何准确地从高维观察数据中提取任务相关的状态表示是现实和挑战性的问题在视觉回归学中。在最近的无监督表示学方法基于 bisimulation 度量、对比、预测和重构方面,已经显示出提取任务相关信息的能力。但由于预测、对比和重构相关的方法中缺乏任务信息抽取的适当机制,以及 bisimulation 度量相关的方法在射频奖励下的局限性,使得这些方法在环境噪音中仍然具有困难。为解决这些问题,在本文中,我们提出了一种Sequential Action--induced invariant Representation(SAR)方法,其中扩展器是通过辅助学习器优化,以便只保留遵循控制信号的序列动作中的组件,以使代理人能够学习免斥噪音的Robust表示。我们在 DeepMind Control suite任务上进行了广泛的实验,并实现了强基eline的最高表现。我们还证明了我们的方法可以忽略任务无关的信息,通过将 SAR 应用于实际的 CARLA 基于自动驾驶中的自然噪音。最后,我们提供了一些总结和分析结果,包括通过总结衰减和 t-SNE 视觉化来证明代理人学习的一致性。代码和示例视频可以在 中找到。

Data-driven Preference Learning Methods for Multiple Criteria Sorting with Temporal Criteria

  • paper_url: http://arxiv.org/abs/2309.12620
  • repo_url: None
  • paper_authors: Li Yijun, Guo Mengzhuo, Zhang Qingpeng
  • for: 本研究旨在提出新的偏好学习方法,用于多 criterion 排序问题中的时间序列数据处理。
  • methods: 本研究使用了一种固定时间折扣因子的几何 quadratic programming 模型,以及一种 ensemble learning 算法,可以将多个可能较弱的优化器的输出集成起来,并通过并行计算进行高效地执行。此外,本研究还提出了一种新的幂等 Recurrent Neural Network (mRNN),可以捕捉时间序列中的偏好动态,并保持多重排序问题中的关键性质,如偏好幂等性、偏好独立性和自然排序。
  • results: 对于 synthetic 数据和一个实际案例(关于分类用户在 mobil 游戏中的历史行为序列),实验结果表明,提出的模型在与基准方法(包括机器学习、深度学习和传统多 criterion 排序方法)进行比较时,表现出了显著的性能改进。
    Abstract The advent of predictive methodologies has catalyzed the emergence of data-driven decision support across various domains. However, developing models capable of effectively handling input time series data presents an enduring challenge. This study presents novel preference learning approaches to multiple criteria sorting problems in the presence of temporal criteria. We first formulate a convex quadratic programming model characterized by fixed time discount factors, operating within a regularization framework. Additionally, we propose an ensemble learning algorithm designed to consolidate the outputs of multiple, potentially weaker, optimizers, a process executed efficiently through parallel computation. To enhance scalability and accommodate learnable time discount factors, we introduce a novel monotonic Recurrent Neural Network (mRNN). It is designed to capture the evolving dynamics of preferences over time while upholding critical properties inherent to MCS problems, including criteria monotonicity, preference independence, and the natural ordering of classes. The proposed mRNN can describe the preference dynamics by depicting marginal value functions and personalized time discount factors along with time, effectively amalgamating the interpretability of traditional MCS methods with the predictive potential offered by deep preference learning models. Comprehensive assessments of the proposed models are conducted, encompassing synthetic data scenarios and a real-case study centered on classifying valuable users within a mobile gaming app based on their historical in-app behavioral sequences. Empirical findings underscore the notable performance improvements achieved by the proposed models when compared to a spectrum of baseline methods, spanning machine learning, deep learning, and conventional multiple criteria sorting approaches.
    摘要 “预测方法的出现刺激了不同领域的数据驱动决策。然而,处理时间序列资料的模型建立仍然是一个持续的挑战。本研究提出了一些新的偏好学习方法,用于多个条件中的排序问题,包括时间条件。我们首先建立了一个固定时间折冲因子的对称quadratic programming模型,并在一个调整框架下进行运算。此外,我们提出了一个ensemble学习算法,用于结合多个、可能的弱来调整器的output,这个过程通过平行计算进行高效执行。为了增强可扩展性和可学习时间折冲因子,我们引入了一个新的对称复环神经网络(mRNN)。这个mRNN可以捕捉时间的演进 Dynamics 的偏好,同时维持多个条件问题的核心性质,包括条件单调性、偏好独立性和时间条件下的自然顺序。提出的mRNN可以描述偏好动态,包括时间条件下的贡献值函数和对个人时间折冲因子的描述,实现了传统多个条件排序方法的解释性和深度偏好学习模型的预测能力。实验结果显示,提出的模型在 synthetic 数据enario 和一个实际的移动游戏APP用户评分案例中均表现出色,与一系列基准方法相比,包括机器学习、深度学习和传统多个条件排序方法。”

Zero-Regret Performative Prediction Under Inequality Constraints

  • paper_url: http://arxiv.org/abs/2309.12618
  • repo_url: None
  • paper_authors: Wenjing Yan, Xuanyu Cao
  • for: 本文研究了受约束的performative预测问题,即预测结果会影响未来数据分布的问题。
  • methods: 本文提出了一种robust预测框架,可以在约束条件下实现高效的预测。此外,本文还提出了一种适应预测算法,可以在各种场景下实现优化的预测。
  • results: 本文的分析表明,提出的适应预测算法可以在约束条件下实现$\ca{O}(\sqrt{T})$的违规和约束违宪,使用只有$\sqrt{T} + 2T$个样本。这是首次对performative预测问题的优化问题进行分析和研究。
    Abstract Performative prediction is a recently proposed framework where predictions guide decision-making and hence influence future data distributions. Such performative phenomena are ubiquitous in various areas, such as transportation, finance, public policy, and recommendation systems. To date, work on performative prediction has only focused on unconstrained scenarios, neglecting the fact that many real-world learning problems are subject to constraints. This paper bridges this gap by studying performative prediction under inequality constraints. Unlike most existing work that provides only performative stable points, we aim to find the optimal solutions. Anticipating performative gradients is a challenging task, due to the agnostic performative effect on data distributions. To address this issue, we first develop a robust primal-dual framework that requires only approximate gradients up to a certain accuracy, yet delivers the same order of performance as the stochastic primal-dual algorithm without performativity. Based on this framework, we then propose an adaptive primal-dual algorithm for location families. Our analysis demonstrates that the proposed adaptive primal-dual algorithm attains $\ca{O}(\sqrt{T})$ regret and constraint violations, using only $\sqrt{T} + 2T$ samples, where $T$ is the time horizon. To our best knowledge, this is the first study and analysis on the optimality of the performative prediction problem under inequality constraints. Finally, we validate the effectiveness of our algorithm and theoretical results through numerical simulations.
    摘要 Performative 预测是一种最近提出的框架,在预测导向决策的过程中,预测结果会影响未来数据分布。这种 performative 现象在交通、金融、公共政策和推荐系统等领域都是非常普遍的。然而,现有的工作都是在不受限制的情况下进行预测,忽略了现实世界学习问题往往受到限制。这篇论文尝试填补这个空白,通过研究 performative 预测下 inequality 约束来解决这个问题。不同于大多数现有的工作,我们不仅提供 performative 稳定点,而是寻找最佳解决方案。预测 performative Gradient 是一项非常困难的任务,因为 performative 对数据分布的影响是agnostic的。为 Addressing this issue, we first develop a robust primal-dual framework that requires only approximate gradients up to a certain accuracy, yet delivers the same order of performance as the stochastic primal-dual algorithm without performativity. Based on this framework, we then propose an adaptive primal-dual algorithm for location families. Our analysis demonstrates that the proposed adaptive primal-dual algorithm attains $\ca{O}(\sqrt{T})$ regret and constraint violations, using only $\sqrt{T} + 2T$ samples, where $T$ is the time horizon. To our best knowledge, this is the first study and analysis on the optimality of the performative prediction problem under inequality constraints. Finally, we validate the effectiveness of our algorithm and theoretical results through numerical simulations.

ARRQP: Anomaly Resilient Real-time QoS Prediction Framework with Graph Convolution

  • paper_url: http://arxiv.org/abs/2310.02269
  • repo_url: None
  • paper_authors: Suraj Kumar, Soumi Chattopadhyay
    for: 这种研究旨在提高现代服务套件架构中的质量服务(QoS)预测精度,以便用户可以根据预测结果做出了 Informed 决策。methods: 这种预测框架(名为 ARRQP)利用图 convolution 技术捕捉用户和服务之间的复杂关系和依赖关系,即使数据是有限或缺失的。 ARRQP 集成了上下文信息和协同信息,以获得用户-服务交互的全面理解。 results: 对 WS-DREAM 测试集的实验表明,这种预测框架可以准确地预测 QoS,并且在各种异常情况下保持高度的稳定性。
    Abstract In the realm of modern service-oriented architecture, ensuring Quality of Service (QoS) is of paramount importance. The ability to predict QoS values in advance empowers users to make informed decisions. However, achieving accurate QoS predictions in the presence of various issues and anomalies, including outliers, data sparsity, grey-sheep instances, and cold-start scenarios, remains a challenge. Current state-of-the-art methods often fall short when addressing these issues simultaneously, resulting in performance degradation. In this paper, we introduce a real-time QoS prediction framework (called ARRQP) with a specific emphasis on improving resilience to anomalies in the data. ARRQP utilizes the power of graph convolution techniques to capture intricate relationships and dependencies among users and services, even when the data is limited or sparse. ARRQP integrates both contextual information and collaborative insights, enabling a comprehensive understanding of user-service interactions. By utilizing robust loss functions, ARRQP effectively reduces the impact of outliers during the model training. Additionally, we introduce a sparsity-resilient grey-sheep detection method, which is subsequently treated separately for QoS prediction. Furthermore, we address the cold-start problem by emphasizing contextual features over collaborative features. Experimental results on the benchmark WS-DREAM dataset demonstrate the framework's effectiveness in achieving accurate and timely QoS predictions.
    摘要 在现代服务套件架构中,保证服务质量(QoS)的重要性不言而喻。预测QoS值的能力使用户做出了 Informed 决策。然而,在面临各种问题和异常情况,包括异常值、数据稀缺、灰羊实例和冷启动场景时,实现准确的QoS预测仍然是一大挑战。当前的状态艺术方法经常在同时处理这些问题时表现不佳,导致性能下降。在这篇论文中,我们提出了一个实时QoS预测框架(叫做ARRQP),强调改进数据中异常现象的抗逆性。ARRQP利用图 convolution 技术捕捉用户和服务之间的复杂关系和依赖关系,即使数据稀缺或异常。ARRQP结合了上下文信息和协同知识,使得用户-服务交互的全面理解。通过使用robust 损失函数,ARRQP减少了模型训练中异常值的影响。此外,我们还提出了稀缺灰羊检测方法,并将其与QoS预测分开处理。此外,我们解决冷启动问题,强调上下文特征而不是协同特征。实验结果表明,ARRQP在WS-DREAM 数据集上实现了准确和时间性的QoS预测。

Multiply Robust Federated Estimation of Targeted Average Treatment Effects

  • paper_url: http://arxiv.org/abs/2309.12600
  • repo_url: None
  • paper_authors: Larry Han, Zhu Shen, Jose Zubizarreta
  • for: 这 paper 是为了 Derive valid causal inferences for a target population using multi-site data.
  • methods: 这 paper 使用了一种 novel federated approach, 包括 covariate shift and covariate mismatch between sites 的 adjustment, 以及 transfer learning 来 estimate ensemble weights to combine information from source sites.
  • results: 这 paper 的研究结果表明,这种方法在不同的 scenario 下具有高效和可靠的特点,并且在 finite sample 上有效性和稳定性比 existed approach 更高.
    Abstract Federated or multi-site studies have distinct advantages over single-site studies, including increased generalizability, the ability to study underrepresented populations, and the opportunity to study rare exposures and outcomes. However, these studies are challenging due to the need to preserve the privacy of each individual's data and the heterogeneity in their covariate distributions. We propose a novel federated approach to derive valid causal inferences for a target population using multi-site data. We adjust for covariate shift and covariate mismatch between sites by developing multiply-robust and privacy-preserving nuisance function estimation. Our methodology incorporates transfer learning to estimate ensemble weights to combine information from source sites. We show that these learned weights are efficient and optimal under different scenarios. We showcase the finite sample advantages of our approach in terms of efficiency and robustness compared to existing approaches.
    摘要 Simplified Chinese translation:多站或联合研究有许多优点,包括增加了一般化性、研究少数群体和罕见的暴露和结果。然而,这些研究具有保护每个个体数据隐私和不同站点的 covariate 分布异常性的挑战。我们提出了一种新的联邦方法,以 derivation 适用于目标人口的有效 causal inference。我们对 covariate shift 和 covariate mismatch 进行了修正,并通过开发多重可靠和隐私保护的干扰函数估计。我们的方法包括使用转移学习来估计ensemble weights,将多个源站的信息组合。我们显示了这些学习到的权重是有效的和优化的在不同的场景下。我们还展示了我们的方法在规模和稳定性方面的较好的 finite sample 优势,与现有方法相比。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The Traditional Chinese version is also available upon request.

Learning algorithms for identification of whisky using portable Raman spectroscopy

  • paper_url: http://arxiv.org/abs/2309.13087
  • repo_url: None
  • paper_authors: Kwang Jun Lee, Alexander C. Trowbridge, Graham D. Bruce, George O. Dwapanyin, Kylie R. Dunning, Kishan Dholakia, Erik P. Schartner
  • for: 鉴定高值饮料的可靠性是一个日益重要的领域,因为问题如品牌替换(即虚假产品)和质量控制对行业是关键的。
  • methods: 我们检查了一系列机器学习算法,并将其直接与可携带式拉曼谱仪device进行了交互,以 both identify和 characterize commercial whisky samples的 ethanol/methanol浓度。
  • results: 我们示出了机器学习模型可以在二十八个商业样本中实现超过99%的品牌认定率。此外,我们还使用了同样的样本和算法来量化 ethanol浓度,以及在杂入 whisky 样本中测量 methanol 水平。我们的机器学习技术然后与通过瓶装置进行spectral analysis和标识,不需要样本从原始容器中抽取,这表明了这种方法在检测假冒或杂入饮料和其他高值液体样本中的实际潜力。
    Abstract Reliable identification of high-value products such as whisky is an increasingly important area, as issues such as brand substitution (i.e. fraudulent products) and quality control are critical to the industry. We have examined a range of machine learning algorithms and interfaced them directly with a portable Raman spectroscopy device to both identify and characterize the ethanol/methanol concentrations of commercial whisky samples. We demonstrate that machine learning models can achieve over 99% accuracy in brand identification across twenty-eight commercial samples. To demonstrate the flexibility of this approach we utilised the same samples and algorithms to quantify ethanol concentrations, as well as measuring methanol levels in spiked whisky samples. Our machine learning techniques are then combined with a through-the-bottle method to perform spectral analysis and identification without requiring the sample to be decanted from the original container, showing the practical potential of this approach to the detection of counterfeit or adulterated spirits and other high value liquid samples.
    摘要 stable 识别高值产品,如威士忌,在当今越来越重要,因为问题如品牌替换(即假冒产品)和质量控制是行业关键。我们已经审查了一系列机器学习算法,并直接与可携带式拉曼谱仪器集成以识别和Characterize商业威士忌样本中的丙醇/甲醇浓度。我们示出了机器学习模型可以在28个商业样本中达到99%以上的品牌识别率。为了 demonstarte 这种方法的灵活性,我们使用了相同的样本和算法来量化丙醇浓度,以及测量杂入威士忌样本中的甲醇含量。我们的机器学习技术然后与通过瓶子方法进行spectral analysis和识别,无需将样本从原始容器中抽取,显示了这种方法在检测假冒或杂入饮料和其他高值液体样本中的实际潜力。

Sampling-Frequency-Independent Universal Sound Separation

  • paper_url: http://arxiv.org/abs/2309.12581
  • repo_url: None
  • paper_authors: Tomohiko Nakamura, Kohei Yatabe
  • for: 这个论文提出了一种能够处理未经训练的采样频率(SF)的通用声音分离(USS)方法,用于分离不同类型的源 signal。
  • methods: 该方法使用了我们之前提出的SF独立(SFI)扩展,使用SFI convolutional layers来处理不同SF。
  • results: 实验表明,信号重采样可能会降低USS性能,而我们提出的方法在不同SF下表现更一致。
    Abstract This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: universalities with respect to source types and recording conditions. The former property has been studied in the USS literature, which has greatly increased the number of source types that can be handled by a single neural network. However, the latter property (e.g., SF) has received less attention despite its necessity. Since the SF varies widely depending on the downstream tasks, the universal source separator must handle a wide variety of SFs. In this paper, to encompass the two properties, we propose an SF-independent (SFI) extension of a computationally efficient USS network, SuDoRM-RF. The proposed network uses our previously proposed SFI convolutional layers, which can handle various SFs by generating convolutional kernels in accordance with an input SF. Experiments show that signal resampling can degrade the USS performance and the proposed method works more consistently than signal-resampling-based methods for various SFs.
    摘要 To address this challenge, the proposed method extends a computationally efficient USS network, SuDoRM-RF, with an SF-independent (SFI) extension. The proposed network uses SFI convolutional layers that can handle various SFs by generating convolutional kernels in accordance with the input SF. This allows the network to maintain its performance across different SFs. Experimental results show that signal resampling can degrade the USS performance, and the proposed method outperforms signal-resampling-based methods for various SFs.In simplified Chinese, the text would be:这篇论文提出了一种能处理不受训练 sampling frequency (SF) 的通用声音分离 (USS) 方法。USS 目标是分离不同类型的原始源,并且可以是下游任务的键技术。为实现这一目标,需要两个关键属性:对于源类型和记录条件的通用性。前者已经在 USS 文献中得到了大量的研究,但是后者(即 SF)尚未得到了 suficient 的关注,尽管它的重要性。由于 SF 在下游任务中变化广泛,通用的源分离器必须能处理多种 SF。为此,我们提议一种 SF-独立 (SFI) 的扩展,使用我们之前提出的 SFI 卷积层,可以根据输入 SF 生成卷积 kernel。实验显示,signal resampling 可能会降低 USS 性能,而我们提议的方法在不同 SF 下表现更稳定。

SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

  • paper_url: http://arxiv.org/abs/2309.12578
  • repo_url: None
  • paper_authors: Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon
  • for: 这篇论文旨在提高Transformer模型的训练效率和内存压缩,以提高模型的运算效率和评估质量。
  • methods: 本论文提出了一种新的对Transformer模型进行实体化的方法,使用了对于层次的滤波器和淹水填充方法,以提高对于注意力操作的实体化效率。
  • results: 本论文的实验结果显示,使用了本方法可以实现Transformer模型的训练时间和内存压缩,并且可以维持评估质量。具体来说,本论文可以在GPU上实现快速的实体化执行,并且可以比起现有的紧缩Transformer模型实现3.08倍的速度提升,同时保持评估质量。
    Abstract Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.
    摘要 减少Transformer的计算复杂性得到了广泛关注,因为训练Transformer很计算昂贵。先前的减少方法包括使用固定模式或数据驱动方法来减少多头注意力计算的数量,但现有方法受到不可避免的问题,如所有层都应用 uniform 固定模式,导致可能丢失重要的序列特征,并且使用更多参数来学习注意力操作的缺省模式。在这篇论文中,我们提出了一种新的减少方案,将 convolution 筛选器和淹水填充方法结合使用,以高效地捕捉层 wise sparse 模式在注意力操作中。我们的减少方法可以在训练过程中降低Transformer的计算复杂性和内存占用。我们实现了层 wise 减少的注意力算法在GPU上,并达到了3.08倍的速度提升,与评价质量相对较好的现有 sparse Transformer 模型相比。

Enhancing Network Resilience through Machine Learning-powered Graph Combinatorial Optimization: Applications in Cyber Defense and Information Diffusion

  • paper_url: http://arxiv.org/abs/2310.10667
  • repo_url: None
  • paper_authors: Diksha Goel
  • For: This paper focuses on developing effective approaches for enhancing network resilience in cyber defense and information diffusion application domains.* Methods: The paper transforms the problems of discovering bottleneck edges and structural hole spanner nodes into graph-combinatorial optimization problems and designs machine learning-based approaches to discover bottleneck points vital for network resilience.* Results: The paper aims to provide effective, efficient, and scalable techniques for enhancing network resilience in specific application domains.Here is the simplified Chinese version of the three key information points:
  • for: 这篇论文关注于在网络防御和信息传播应用领域中提高网络可恢复性。
  • methods: 论文将瓶颈边缘和结构孔挫节点的问题转化为图谱-组合优化问题,并采用机器学习方法来找出网络中瓶颈点。
  • results: 论文目标是为特定应用领域提供有效、高效和可扩展的网络可恢复性提高方法。
    Abstract With the burgeoning advancements of computing and network communication technologies, network infrastructures and their application environments have become increasingly complex. Due to the increased complexity, networks are more prone to hardware faults and highly susceptible to cyber-attacks. Therefore, for rapidly growing network-centric applications, network resilience is essential to minimize the impact of attacks and to ensure that the network provides an acceptable level of services during attacks, faults or disruptions. In this regard, this thesis focuses on developing effective approaches for enhancing network resilience. Existing approaches for enhancing network resilience emphasize on determining bottleneck nodes and edges in the network and designing proactive responses to safeguard the network against attacks. However, existing solutions generally consider broader application domains and possess limited applicability when applied to specific application areas such as cyber defense and information diffusion, which are highly popular application domains among cyber attackers. This thesis aims to design effective, efficient and scalable techniques for discovering bottleneck nodes and edges in the network to enhance network resilience in cyber defense and information diffusion application domains. We first investigate a cyber defense graph optimization problem, i.e., hardening active directory systems by discovering bottleneck edges in the network. We then study the problem of identifying bottleneck structural hole spanner nodes, which are crucial for information diffusion in the network. We transform both problems into graph-combinatorial optimization problems and design machine learning based approaches for discovering bottleneck points vital for enhancing network resilience.
    摘要 随着计算和网络通信技术的不断发展,网络基础设施和其应用环境变得越来越复杂,因此网络更容易受到硬件故障和攻击。为了应对这些攻击和故障,网络可靠性变得非常重要,以确保网络在攻击或故障时仍能提供可接受的服务。在这个视角下,这个论文将关注开发有效的网络可靠性提升方法。现有的网络可靠性提升方法通常是通过确定网络中瓶须节点和边来预防攻击。然而,现有的解决方案通常只适用于更广泛的应用领域,而不是特定的应用领域,如网络防御和信息传播,这些应用领域在网络攻击者中非常受欢迎。这个论文的目标是为网络防御和信息传播应用领域提供有效、高效和可扩展的瓶须节点和边发现方法,以提升网络可靠性。我们首先研究了网络防御图优化问题,即通过发现网络中瓶须边来强化网络防御。然后,我们研究了网络中瓶须结构孔隙节点的问题,这些节点对于信息传播非常重要。我们将这两个问题转化为图-数学优化问题,并使用机器学习方法来发现瓶须点,以提升网络可靠性。

A Simple Illustration of Interleaved Learning using Kalman Filter for Linear Least Squares

  • paper_url: http://arxiv.org/abs/2310.03751
  • repo_url: None
  • paper_authors: Majnu John, Yihren Wu
  • for: 提出了一种基于Kalman Filter的线性最小二乘算法的机器学习算法协调学习机制。
  • methods: 使用了Kalman Filter来实现线性最小二乘算法中的协调学习机制。
  • results: 通过实验证明了该算法的效果。
    Abstract Interleaved learning in machine learning algorithms is a biologically inspired training method with promising results. In this short note, we illustrate the interleaving mechanism via a simple statistical and optimization framework based on Kalman Filter for Linear Least Squares.
    摘要 生物学中的混合学习(Interleaved learning)是一种机器学习算法中的训练方法,具有承诺的成果。本短记将通过简单的统计和优化框架,基于加尔曼缓冲器进行线性最小二乘问题的示例阐释interleaving机制。

eess.SP - 2023-09-22

A Survey of Brain Computer Interface Using Non-Invasive Methods

  • paper_url: http://arxiv.org/abs/2309.13151
  • repo_url: None
  • paper_authors: Ritam Ghosh
  • for: 这篇论文主要用于探讨脑机器接口(BCI)技术的优缺点,以及一些非侵入式技术的应用场景。
  • methods: 本文使用了脑电图(EEG)、功能磁共振成像(fMRI)、近红外 спектроскопия(NIRs)和混合系统等非侵入式技术。
  • results: 本文总结了这些非侵入式技术的优点和缺点,并展示了它们在各种应用场景中的应用。
    Abstract Research on Brain-Computer Interface (BCI) began in the 1970s and has increased in volume and diversified significantly since then. Today BCI is widely used for applications like assistive devices for physically challenged users, mental state monitoring, input devices for hands-free applications, marketing, education, security, games and entertainment. This article explores the advantages and disadvantages of invasive and non-invasive BCI technologies and focuses on use cases of several non-invasive technologies, namely electroencephalogram (EEG), functional Magnetic Resonance Imaging (fMRI), Near Infrared Spectroscopy (NIRs) and hybrid systems.
    摘要 研究Brain-Computer Interface(BCI)始于1970年代,自 then onwards 已经增加了量和多样化了很多。今天,BCI 广泛应用于帮助 físically challenged 用户、监测 mental state、手sfree 应用程序的输入设备、marketing、教育、安全、游戏和娱乐等领域。本文介绍了 BCIs 的优势和缺点,并关注了多种非侵入式技术的应用情况,namely 电энцефаogram(EEG)、功能磁共振成像(fMRI)、近红外 спектроскопи(NIRs)和混合系统。

Performance Evaluation for Subarray-based Reconfigurable Intelligent Surface-Aided Wireless Communication Systems

  • paper_url: http://arxiv.org/abs/2309.12977
  • repo_url: None
  • paper_authors: Xinyi Yang, Weicong Chen, Xiao Li, Shi Jin
  • for: 提高无线通信系统性能,研究基于减数阵的智能表面协助系统
  • methods: 使用基于减数阵的协助技术,分析和优化协助系统中的反射率和谱系数
  • results: 提出基于减数阵的协助系统可以提高系统的均衡 spectral efficiency 和能效率,并且可以同时降低功率消耗和反射率Note: The above results are in Simplified Chinese text.
    Abstract Reconfigurable intelligent surfaces (RISs) have received extensive concern to improve the performance of wireless communication systems. In this paper, a subarray-based scheme is investigated in terms of its effects on ergodic spectral efficiency (SE) and energy efficiency (EE) in RIS-assisted systems. In this scheme, the adjacent elements divided into a subarray are controlled by one signal and share the same reflection coefficient. An upper bound of ergodic SE is derived and an optimal phase shift design is proposed for the subarray-based RIS. Based on the upper bound and optimal design, we obtain the maximum of the upper bound. In particular, we analytically evaluate the effect of the subarray-based RIS on EE since it reduces SE and power consumption simultaneously. Numerical results verify the tightness of the upper bound, demonstrate the effectiveness of the optimal phase shift design for the subarray-based RIS, and reveal the effects of the subarray-based scheme on SE and EE.
    摘要 改进无线通信系统性能的重 Configurable intelligent surfaces (RISs) 已经引起了广泛关注。本文 investigate 一种基于subarray的方案,包括其对ergodic spectral efficiency (SE) 和能效率 (EE) 的影响。在这种方案中,邻近元素被分成一个subarray,并由一个信号控制,共享相同的反射系数。我们 derive 一个上限 bound 的ergodic SE,并提出了一种优化的相位偏移设计。基于上限 bound 和优化设计,我们获得了最大的上限。具体来说,我们分析了subarray-based RIS 对EE的影响,因为它同时降低了SE和功率消耗。 numerically 结果证明了上限 bound 的紧张性,证明了优化相位偏移设计的效果,并揭示了subarray-based scheme 对SE和EE的影响。

Guaranteed Private Communication with Secret Block Structure

  • paper_url: http://arxiv.org/abs/2309.12949
  • repo_url: None
  • paper_authors: Maxime Ferreira Da Costa, Jianxiu Li, Urbashi Mitra
  • for: 这篇论文提出了一种新的私人通信框架,通过在通道实例上传输线性逆问题,使得隐私被引入。这个框架的安全性基于发送方和合法接收方之间的秘密知识。
  • methods: 这篇论文使用了一种基于秘密块结构的协议,使得接收方可以从不充足的线性测量中解码块稀热消息。这种协议可以应用于实际的多Access无线通信系统中。
  • results: 研究表明,在某些特定的频率和传输参数下,伪装者可以尝试从通道输出的四次趋势中提取秘密块结构。然而,计算一个统计下界,表明该提出的四次趋势秘密块估计策略几乎是优化的。此外,研究表明,通过spectral clustering算法,可以定义扩大秘密键的时间长度,以确保通信的安全性。
    Abstract A novel private communication framework is proposed where privacy is induced by transmitting over channel instances of linear inverse problems that are identifiable to the legitimate receiver, but unidentifiable to an eavesdropper. The gap in identifiability is created in the framework by leveraging secret knowledge between the transmitter and the legitimate receiver. Specifically, the case where the legitimate receiver harnesses a secret block structure to decode a transmitted block-sparse message from underdetermined linear measurements in conditions where classical compressed sensing would provably fail is examined. The applicability of the proposed scheme to practical multiple access wireless communication systems is discussed. The protocol's privacy is studied under a single transmission, and under multiple transmissions without refreshing the secret block structure. It is shown that, under a specific scaling of the channel dimensions and transmission parameters, the eavesdropper can attempt to overhear the block structure from the fourth-order moments of the channel output. Computation of a statistical lower bound, suggests that the proposed fourth-order moment secret block estimation strategy is near optimal. The performance of a spectral clustering algorithm is studied to that end, defining scaling laws on the lifespan of the secret key before the communication is compromised. Finally, numerical experiments corroborating the theoretical findings are conducted.
    摘要 一种新的私人通信框架被提议,其中隐私是通过在通道上传输线性逆问题的实例,这些问题只能被合法接收者识别出来,但不能被侦测者识别出来。在这个框架中,通过 transmitter 和合法接收者之间的秘密知识来创造不可识别的差异。例如,在 transmitter 将块稀疏消息从不充分的线性测量中解码的情况下,合法接收者可以利用秘密块结构来解码消息。本文探讨了这种方案在实际多接入无线通信系统中的可行性,并研究了协议的隐私性。在单次传输和多次传输无需刷新秘密块结构的情况下,分析表明,在某些频率缩放和传输参数的情况下,侦测者可以尝试从通道输出的四次 moments 中找到秘密块结构。计算统计下界,表明该提议的四次 moment 秘密块估计策略是近似优美的。此外,对spectral clustering算法的研究表明,在某些频率缩放和传输参数的情况下,秘密钥的寿命会随着通信的增加而减少。最后,通过实验证明了理论发现的结论。

A Proof of Concept for OTFS Resilience in Doubly-Selective Channels by GPU-Enabled Real-Time SDR

  • paper_url: http://arxiv.org/abs/2309.12861
  • repo_url: None
  • paper_authors: Yi Xien Yap, Neil Bhushan, Onur Dizdar, Ata Sattarzadeh, David Redgate, Venkateswara Battula, Stephen Wang
  • for: 该论文旨在研究Orthogonal Time Frequency Space (OTFS) 模ulation技术,并在实际的实时Software Defined Radio (SDR) 设置中进行实验研究。
  • methods: 该论文使用了一个基于 Graphical Processing Unit (GPU) 的信号处理程序,以及 Universal Software Radio Peripheral (USRP) 设备来实现一个低延迟的接收结构,并在不同的Doppler值下调查其性能。
  • results: 研究结果表明,OTFS 比 OFDM 更高度具有对双选择通道的鲁棒性,并在实际实验中表现出色。
    Abstract Orthogonal time frequency space (OTFS) is a modulation technique which is robust against the disruptive effects of doubly-selective channels. In this paper, we perform an experimental study of OTFS by a real-time software defined radio (SDR) setup. Our SDR consists of a Graphical Processing Unit (GPU) for signal processing programmed using Sionna and TensorFlow, and Universal Software Radio Peripheral (USRP) devices for air interface. We implement a low-latency transceiver structure for OTFS and investigate its performance under various Doppler values. By comparing the performance of OTFS with Orthogonal Frequency Division Multiplexing (OFDM), we demonstrate that OTFS is highly robust against the disruptive effects of doubly-selective channels in a real-time experimental setup.
    摘要 水平时频空间(OTFS)是一种干扰强度较弱的模调技术,可以在双 selektiv通道中具有高Robustness。在这篇论文中,我们通过实验研究了OTFS,使用了一个真实时间定制的Software Defined Radio(SDR)设置。我们的SDR包括一个图形处理器(GPU)用于信号处理,并使用Sionna和TensorFlow编程,以及Universal Software Radio Peripheral(USRP)设备用于空中接口。我们实现了一种低延迟的接收结构,并在不同的Doppler值下调查其性能。通过对OTFS和Orthogonal Frequency Division Multiplexing(OFDM)的比较,我们表明了OTFS在真实时间实验设置中对双 selektiv通道的破坏性影响具有高Robustness。

Multiple Satellites Collaboration for Joint Code-aided CFOs and CPOs Estimation

  • paper_url: http://arxiv.org/abs/2309.12828
  • repo_url: None
  • paper_authors: Pingyue Yue, Yixuan Li, Yue Li, Rui Zhang, Shuai Wang, Jianping An
  • For: 提高低Earth轨道卫星网络的安全性和可靠性* Methods: 使用合作卫星技术,并提出了一种基于编码的迭代参数估计算法来解决低信号噪比下的参数估计挑战* Results: simulation results show that the proposed algorithm can approach Bit Error Rate (BER) performance bound within 0.4 dB with regards to four-satellite collaboration.
    Abstract Low Earth Orbit (LEO) satellites are being extensively researched in the development of secure Internet of Remote Things (IoRT). In scenarios with miniaturized terminals, the limited transmission power and long transmission distance often lead to low Signal-to-Noise Ratio (SNR) at the satellite receiver, which degrades communication performance. A solution to address this issue is the utilization of cooperative satellites, which can combine signals received from multiple satellites, thereby significantly improve SNR. However, in order to maximize the combination gain, the signal coherent combining is necessary, which requires the carrier frequency and phase of each receiving signal to be aligned. Under low SNR circumstances, carrier parameter estimation can be a significant challenge, especially for short burst transmission with no training sequence. In order to tackle it, we propose an iterative code-aided estimation algorithm for joint Carrier Frequency Offset (CFO) and Carrier Phase Offset (CPO). The Cram\'er-Rao Lower Bound (CRLB) is suggested as the limit on the parameter estimation performance. Simulation results demonstrate that the proposed algorithm can approach Bit Error Rate (BER) performance bound within 0.4 dB with regards to four-satellite collaboration.
    摘要 低地球轨道(LEO)卫星在网络 remote Things(IoRT)的开发中被广泛研究。在具有小型终端的场景下,由于传输功率和传输距离都很小,因此在卫星接收器上常常出现低信噪比(SNR),这会降低通信性能。为解决这个问题,可以利用合作卫星,即将多个卫星接收器的信号合并,从而显著提高SNR。然而,为了最大化合并增益,需要进行信号干涉合并,这需要每个接收信号的干涉频率和相位相同。在低SNR情况下,干涉参数估计可能是一个 significiant挑战,特别是在短暂的传输中没有训练序列。为解决这个问题,我们提出了一种迭代码帮助估计算法,用于同时估计干涉频率偏移(CFO)和干涉相位偏移(CPO)。对于四个卫星的合作,我们提出的算法可以在0.4dB之内 approaching Bit Error Rate(BER)性能 bound。

Alteration of skeletal muscle energy metabolism assessed by 31P MRS in clinical routine, part 1: Advanced Quality Control pipeline

  • paper_url: http://arxiv.org/abs/2309.12796
  • repo_url: None
  • paper_authors: Antoine Naëgel, Hélène Ratiney, Jabrane Karkouri, Djahid Kennouche, Nicolas Royer, Jill M Slade, Jérôme Morel, Pierre Croisille, Magalie Viallon
  • for: 本研究目的是提供一种基于Literature current recommendations和临床经验所提出的数据质量控制方法,以帮助在 dynamic 31P-MRS 数据处理中获得可靠的结果。
  • methods: 本研究使用了一种三元组合的数据质量控制方法,包括对数据进行自适应标准化、对数据进行质量控制分数(QCS)的计算、并对数据进行手动审核。
  • results: 使用QCS可以快速标识具有数据异常的 subjects,并提供对数据进行修正的指导。总的来说,QCS使得可以自动分类45%的 subjects,其中58名参与者的数据没有规则违反,21名参与者的数据需要拒绝。此外,手动审核还可以Acceptance of full datasets from an additional 80 participants and recovery phase data from an additional 16 subjects。总之,patient数据中出现了更多的异常(35%的dataset),比healthy controls(15%的dataset)更高。
    Abstract Background: Implementing a standardized 31P-MRS dynamic acquisition protocol to evaluate skeletal muscle energy metabolism and monitor muscle fatigability1,2, while being compatible with various longitudinal clinical studies on diversified patient cohorts, requires a high level of technicality and expertise. Furthermore, processing data to obtain reliable results also demands a great degree of expertise from the operator. In this two-part article, we present an advanced quality control approach for data acquired using a dynamic 31P-MRS protocol. The aim is to provide decision support to the operator in order to assist in data processing and obtain reliable results based on objective criteria. We present first in part one, an advanced data quality control (QC) approach of a dynamic 31P-MRS protocol. Part two is an impact study demonstrating the added value of the QC approach to explore clinical results derived from two patient populations with significant fatigue: COVID19 and multiple sclerosis (MS). Experimental: 31P-MRS was performed on a 3T clinical MRI in 175 subjects from clinical and healthy control populations conducted in a University Hospital. An advanced data QC Score (QCS) was developed using multiple objective criteria. The criteria were based on current recommendations from the literature enriched by new proposals based on clinical experience. The QCS was designed to indicate valid and corrupt data and guide necessary objective data editing to extract as much valid physiological data as possible. Dynamic acquisitions using an MR-compatible ergometer ran over a rest(40s), exercise(2min), and a recovery phase(6min). Results: Using QCS enabled rapid identification of subjects with data anomalies allowing the user to correct the data series or reject them partially or entirely as well as identify fully valid datasets. Overall, the use of the QCS resulted in the automatic classification of 45% of the subjects including 58 participants that had data with no criterion violation and 21 participants with violations that resulted in the rejection of all dynamic data. The remaining datasets were inspected manually with guidance allowing acceptance of full datasets from an additional 80 participants and recovery phase data from an additional 16 subjects. Overall, more anomalies occurred with patient data (35% of datasets) compared to healthy controls (15% of datasets). Conclusion: This paper describes typical difficulties encountered during the dynamic acquisition of 31P-MRS. Based on these observations, a standardized data quality control pipeline was created and implemented in both healthy and patient populations. The QC scoring ensures a standardized data rejection procedure and rigorous objective analysis of dynamic 31P-MRS data obtained from patients. The contribution of this methodology contributes to efforts made to standardize the practices of the 31P-MRS that has been underway for a decade, with the ultimate goal of making it an empowered tool for clinical research.
    摘要 Background: 实施标准化31P-MRS动态获取协议,以评估骨骼肌能量代谢和监测肌肉疲劳性,需要高水平的技术性和专业知识。此外,从操作员处理数据以获得可靠结果也需要很高的专业度。在这两篇文章中,我们提出了一种高级数据质控方法,以帮助操作员在数据处理中做出客观的决策。在第一篇文章中,我们介绍了一种高级数据质控方法,以帮助操作员在数据处理中做出客观的决策。第二篇文章是一项影响研究,探讨了这种质控方法在 COVID-19 和多发性骨骼炎(MS)两种疲劳性疾病中的价值。Experimental: 31P-MRS在3T临床MRI上进行了175名临床和健康控制群体的测试。我们开发了一种多重目的 criterion 基于当前文献的建议,以及我们的临床经验所提出的新建议。这种 QCS 是用来指示有效和假数据,并帮助操作员对数据进行客观编辑,以提取最多可靠生物学数据。动态获取使用 MR 兼容的自行车在休息(40s)、运动(2分)和恢复阶段(6分)。结果:通过 QCS,可以快速标识具有数据异常的主体,并让用户对数据系列进行修正或者部分或全部拒绝。总的来说,使用 QCS 导致了自动将45%的主体分类为有效数据,其中有58名参与者没有任何 criterion 违反,而有21名参与者因违反 criterion 而拒绝了所有动态数据。剩下的数据被手动检查,以确定是否acceptable。总的来说, patient 数据中出现了更多的异常(35%的数据),compared to healthy controls(15%的数据)。结论:这篇文章描述了在动态获取31P-MRS数据时常见的困难。基于这些观察,我们创建了一个标准化的数据质控管道,并在健康和疾病人群中实施。 QCS scoring 确保了一个标准化的数据拒绝程序,并且对动态31P-MRS数据从病人中得到的结果进行了严格的客观分析。本质控方法的贡献是为标准化31P-MRS实践做出了贡献,这一实践已经在过去十年中进行了不断的标准化努力,以使其成为严格的研究工具。

Multi-objective Optimization of Space-Air-Ground Integrated Network Slicing Relying on a Pair of Central and Distributed Learning Algorithms

  • paper_url: http://arxiv.org/abs/2309.12783
  • repo_url: None
  • paper_authors: Guorong Zhou, Liqiang Zhao, Gan Zheng, Shenghui Song, Jiankang Zhang, Lajos Hanzo
  • for: 本文旨在研究如何在全球空天地网络(SAGIN)中 dynamically 考虑三种常见的Radio Access Network(RAN)slice,以提高多种定制化服务的可用性和效率。
  • methods: 本文提出了一种基于多智能agent的 deep deterministic policy gradient(CDMADDPG)算法,用于同时优化三种类型的RAN slice的吞吐量、延迟和覆盖面积。
  • results: simulation 结果表明,提出的方法可以尝试到Pareto优化多个RAN slice,并超越参考模型。
    Abstract As an attractive enabling technology for next-generation wireless communications, network slicing supports diverse customized services in the global space-air-ground integrated network (SAGIN) with diverse resource constraints. In this paper, we dynamically consider three typical classes of radio access network (RAN) slices, namely high-throughput slices, low-delay slices and wide-coverage slices, under the same underlying physical SAGIN. The throughput, the service delay and the coverage area of these three classes of RAN slices are jointly optimized in a non-scalar form by considering the distinct channel features and service advantages of the terrestrial, aerial and satellite components of SAGINs. A joint central and distributed multi-agent deep deterministic policy gradient (CDMADDPG) algorithm is proposed for solving the above problem to obtain the Pareto optimal solutions. The algorithm first determines the optimal virtual unmanned aerial vehicle (vUAV) positions and the inter-slice sub-channel and power sharing by relying on a centralized unit. Then it optimizes the intra-slice sub-channel and power allocation, and the virtual base station (vBS)/vUAV/virtual low earth orbit (vLEO) satellite deployment in support of three classes of slices by three separate distributed units. Simulation results verify that the proposed method approaches the Pareto-optimal exploitation of multiple RAN slices, and outperforms the benchmarkers.
    摘要 作为下一代无线通信技术的吸引人之一,网络剖析支持多种个性化服务在全球空天地 Integrated Network (SAGIN) 中,拥有多种资源限制。在这篇论文中,我们动态考虑了三种常见的无线接入网络 (RAN) slice,namely 高速吞吐 slice, 低延迟 slice 和广泛覆盖 slice,在同一层次的物理 SAGIN 中。这三种 RAN slice 的吞吐率、服务延迟和覆盖区域都是jointly 优化的,而且考虑了不同的通信频率和服务优势,以实现 Pareto 优化解决方案。我们提出了一种基于多代理 deep deterministic policy gradient (CDMADDPG) 算法的 JOINT 中央和分布式算法来解决这个问题。该算法首先确定了最佳虚拟无人机 (vUAV) 位置和间 slice Sub-channel 和功率分配,然后对每种 slice 进行内 slice Sub-channel 和功率分配,以及虚拟基站 (vBS)/vUAV/虚拟低地球 (vLEO) 卫星部署。测试结果表明,提议的方法可以实现 Pareto 优化多个 RAN slice,并超过参考值。

Green Holographic MIMO Communications With A Few Transmit Radio Frequency Chains

  • paper_url: http://arxiv.org/abs/2309.12688
  • repo_url: None
  • paper_authors: Shuaishuai Guo, Jia Ye, Kaiqian Qu, Shuping Dang
  • for: 本文旨在探讨绿色束缚多输入多输出通信技术,以减少电磁谱上的束缚数量,同时保持高速度信息传输。
  • methods: 本文提出了一种名为非均匀束缚模式变换(NUHPM)的有效传输方式,通过利用额外的空间度量来实现高SNR范围内的容量限制。
  • results: 分析结果表明,通过增大天线覆盖面积而不增加发射RF束缚数量,可以实现绿色评估多input多output通信系统的高性能。数值结果也验证了我们的分析结论。
    Abstract Holographic multiple-input multiple-output (MIMO) communications are widely recognized as a promising candidate for the next-generation air interface. With holographic MIMO surface, the number of the spatial degrees-of-freedom (DoFs) considerably increases and also significantly varies as the user moves. To fully employ the large and varying number of spatial DoFs, the number of equipped RF chains has to be larger than or equal to the largest number of spatial DoFs. However, this causes much waste as radio frequency (RF) chains (especially the transmit RF chains) are costly and power-hungry. To avoid the heavy burden, this paper investigates green holographic MIMO communications with a few transmit RF chains under an electromagnetic-based communication model. We not only look at the fundamental capacity limits but also propose an effective transmission, namely non-uniform holographic pattern modulation (NUHPM), to achieve the capacity limit in the high signal-to-noise (SNR) regime. The analytical result sheds light on the green evaluation of MIMO communications, which can be realized by increasing the size of the antenna aperture without increasing the number of transmit RF chains. Numerical results are provided to verify our analysis and to show the great performance gain by employing the additional spatial DoFs as modulation resources.
    摘要 干扰多输入多输出(MIMO)通信被广泛认为是下一代无线接口的优选候选人。干扰MIMO表面上,空间度量自由(DoF)的数量增加很多,同时也因用户移动而异常变化。要完全利用这些很多和变化很大的空间DoF,需要更多的RF扩展( especial transmit RF),但这会带来很大的浪费。为了避免这种重荷,本文研究了绿色干扰MIMO通信,使用只有一些发射RF扩展。我们不仅研究基本容量的限制,还提议非均匀干扰模式变换(NUHPM),以实现高信号噪响比(SNR)下的容量限制。分析结果抛光绿色评估MIMO通信,可以通过增加天线覆盖面积而不增加发射RF扩展。数值结果证明我们的分析结果,并显示了采用additional spatial DoF作为模ulation资源时的很大性能提升。

ViT-MDHGR: Cross-day Reliability and Agility in Dynamic Hand Gesture Prediction via HD-sEMG Signal Decoding

  • paper_url: http://arxiv.org/abs/2309.12602
  • repo_url: None
  • paper_authors: Qin Hu, Golara Ahmadi Azar, Alyson Fletcher, Sundeep Rangan, S. Farokh Atashzar
  • for: 这个研究是为了提高多天手势识别的精度和可靠性,并且解决现有的问题,例如对于训练和测试日的数据分配不均匀,导致模型的一致性受损。
  • methods: 本研究使用了一个封闭的ViT-based网络,并且运用了非常短的HD-sEMG信号窗口(仅50ms),从而提高了模型的迅速性和反应性。
  • results: 研究发现,使用了提案的模型,可以预测11种动作,并且在20名对象中平均精度高于71%,并且在重新训练少于10%的parameters下,可以达到92%的精度。
    Abstract Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) biosignals have been extensively investigated for myoelectric control of prosthetic devices, neurorobotics, and more recently human-computer interfaces because of their capability for hand gesture recognition/prediction in a wearable and non-invasive manner. High intraday (same-day) performance has been reported. However, the interday performance (separating training and testing days) is substantially degraded due to the poor generalizability of conventional approaches over time, hindering the application of such techniques in real-life practices. There are limited recent studies on the feasibility of multi-day hand gesture recognition. The existing studies face a major challenge: the need for long sEMG epochs makes the corresponding neural interfaces impractical due to the induced delay in myoelectric control. This paper proposes a compact ViT-based network for multi-day dynamic hand gesture prediction. We tackle the main challenge as the proposed model only relies on very short HD-sEMG signal windows (i.e., 50 ms, accounting for only one-sixth of the convention for real-time myoelectric implementation), boosting agility and responsiveness. Our proposed model can predict 11 dynamic gestures for 20 subjects with an average accuracy of over 71% on the testing day, 3-25 days after training. Moreover, when calibrated on just a small portion of data from the testing day, the proposed model can achieve over 92% accuracy by retraining less than 10% of the parameters for computational efficiency.
    摘要 superficiale electromiografia (sEMG) 和高密度 sEMG (HD-sEMG) 生物信号已经广泛研究用于 prosthetic device 控制、neurorobotics 和最近的人机交互,因为它们可以在穿着和非侵入性的方式下识别/预测手势。 高于同一天的性能已经被报道。然而, между天性能(分开训练和测试日)却很差,这限制了这些技术的应用在实际场景中。有限的最近研究表明了多天手势识别的可能性。现有的研究面临主要挑战:需要长时间的 sEMG 时间窗口,使得相关的神经接口不实用,因为引入的myoelectric控制延迟。本文提议了一个快速的 ViT 基于网络,用于多天动手势预测。我们解决了主要的挑战,因为我们的提议模型只需要非常短的 HD-sEMG 信号窗口(即 50 ms,相当于一半的实时 myoelectric 实现),提高了机敏性和响应性。我们的提议模型可以预测 11 种动手势,对 20 名参与者的测试日有效率超过 71%,并且在只使用测试日少量数据进行升级时,可以达到超过 92% 的精度。

Movable Antenna-Empowered AirComp

  • paper_url: http://arxiv.org/abs/2309.12596
  • repo_url: None
  • paper_authors: Zhenqiao Cheng, Nanxi Li, Jianchi Zhu, Xiaoming She, Chongjun Ouyang, Peng Chen
  • for: 提高计算准确性
  • methods: joint优化传输功率控制、天线位置调整和接收组合
  • results: 提供了一种有效的方法来最小化计算均方差误差,并且数据显示了该方法的明显优势 compared to 基于固定天线的参考系统。
    Abstract A novel over-the-air computation (AirComp) framework, empowered by the incorporation of movable antennas (MAs), is proposed to significantly enhance computation accuracy. Within this framework, the joint optimization of transmit power control, antenna positioning, and receive combining is investigated. An efficient method is proposed to tackle the problem of computation mean-squared error (MSE) minimization, capitalizing on the approach of alternating optimization. Numerical results are provided to substantiate the superior MSE performance of the proposed framework, which establish its clear advantage over benchmark systems employing conventional fixed-position antennas (FPAs).
    摘要 “一个基于无线电处理(AirComp)框架的新方案,利用可动天线(MA)的增强,以提高计算精度。在这个框架中,联合服务器传输电力控制、天线位置调整和接收结合优化。一种高效的方法是提出来解决计算平均方差误差(MSE)的最小化问题,基于交替优化的方法。实验结果显示了提案的框架具有明显的MSE表现优势,与传统固定天线(FPAs)的系统相比。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Passive Reflection Codebook Design for IRS-Integrated Access Point

  • paper_url: http://arxiv.org/abs/2309.12563
  • repo_url: None
  • paper_authors: Yuwei Huang, Lipeng Zhu, Rui Zhang
  • for: 该研究旨在提高无线信号覆盖范围和通信性能,采用智能反射表面技术(IRS)和接收机天线阵列。
  • methods: 该研究提出了一种新的codebook-based IRS反射设计,通过在同一个天线覆盖中集成IRS和接收机天线阵列,减少了信道损失。
  • results: 实验结果表明,该设计可以提高总频谱能量平均值,并在单用户和多用户传输中实现显著的性能提升。
    Abstract Intelligent reflecting surface (IRS) has emerged as a promising technique to extend the wireless signal coverage of access point (AP) and improve the communication performance cost-effectively. In order to reduce the path-loss of the cascaded user-IRS-AP channels, the IRS-integrated AP architecture has been proposed to deploy the IRSs and the antenna array of the AP within the same antenna radome. To reduce the pilot overhead for estimating all IRS-involved channels, in this paper, we propose a novel codebook-based IRS reflection design for the IRS-integrated AP to enhance the coverage performance in a given area. In particular, the codebook consisting of a small number of codewords is designed offline by employing an efficient sector division strategy based on the azimuth angle. To ensure the performance of each sector, we optimize its corresponding codeword for IRS reflection pattern to maximize the sector-min-average-effective-channel-power (SMAECP) by applying the alternating optimization (AO) and semidefinite relaxation (SDR) methods. With the designed codebook, the AP performs the IRS reflection training by sequentially applying all codewords and selects the one achieving the best communication performance for data transmission. Numerical results show that our proposed codebook design can enhance the average channel power of the whole coverage area, as compared to the system without IRS. Moreover, our proposed codebook-based IRS reflection design is shown to achieve significant performance gain over other benchmark schemes in both single-user and multi-user transmissions.
    摘要 智能反射表面(IRS)已经成为一种有前途的技术,以提高无线信号覆盖范围和通信性能,而不需要大量的成本投入。为了减少用户-IRS-AP通道的偏移损耗,我们提议在同一个天线覆盖中部署IRS和AP天线阵列。为了减少估算所需的射频资源,在本文中,我们提出了一种新的codebook-based IRS反射设计,以提高在给定区域的覆盖性能。具体来说,我们采用了一个小型的codeword集合来设计codebook,通过使用高效的扇区策略来基于Azimuth角来设计。为了保证每个扇区的性能,我们对每个扇区的相应codeword进行了最优化,以最大化扇区最小平均有效通道功率(SMAECP)。通过将codebook传递给AP,AP可以通过顺序应用所有codeword来进行IRS反射训练,并选择最佳的通信性能来进行数据传输。numerical results表明,我们的提议的codebook设计可以提高整个覆盖区域的平均通道功率,相比于没有IRS的系统。此外,我们的codebook-based IRS反射设计还被证明可以在单用户和多用户传输中具有显著的性能提升。

cs.SD - 2023-09-21

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

  • paper_url: http://arxiv.org/abs/2309.12521
  • repo_url: None
  • paper_authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu
  • for: 这篇论文旨在提高TS-VAD方法的稳定性和可靠性,使其能够抗击发音人识别错误。
  • methods: 这篇论文提出了一种基于变换器的TS-VAD方法,该方法可以处理不同数量的发音人,并且引入了一组附加的伪发音人识别器来处理在第一次分配不正确的发音人。在训练时,我们使用多种不同的聚类算法来估计发音人识别器,以减少训练和测试条件之间的差异。
  • results: 实验结果表明,PET-TSVAD方法在VoxConverse和DIHARD-I datasets上具有更高的稳定性和可靠性,与现有的TS-VAD方法相比,可以更好地抗击发音人识别错误。
    Abstract Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.
    摘要 target-speaker voice activity detection (TS-VAD) 使用一组说话者配置文件和输入音频信号进行说话者分类。尽管它在传统方法上表现出优势,但该方法可能会因为说话者配置文件中的错误而受到影响。这篇论文提出了一种对 TS-VAD 进行扩展,称为 Profile-Error-Tolerant TS-VAD (PET-TSVAD),可以抗 resist 说话者配置文件中的错误。这是通过使用 transformer-based TS-VAD 来实现,该方法可以处理变数量的说话者和额外引入一组 Pseudo-speaker 配置文件来处理在首个扫描中未探测到的说话者。在训练中,我们使用不同 clustering 算法来估计说话者配置文件,以降低在训练和测试条件下的配置文件匹配度。实验结果表明,PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上一直表现出优势,与传统 TS-VAD 方法相比。

Variational Quantum Harmonizer: Generating Chord Progressions and Other Sonification Methods with the VQE Algorithm

  • paper_url: http://arxiv.org/abs/2309.12254
  • repo_url: None
  • paper_authors: Paulo Vitor Itaboraí, Tim Schwägerl, María Aguado Yáñez, Arianna Crippa, Karl Jansen, Eduardo Reck Miranda, Peter Thomas
  • for: 这项研究探讨了使用物理基于的声明方法来解决Quadratic Unconstrained Binary Optimization (QUBO)问题,使用Variational Quantum Eigensolver (VQE)算法进行优化。
  • methods: 这项研究使用了VQE算法的迭代循环来 aproximate QUBO问题的解决方案,并将每次迭代的中间状态 vectors 用于声明方法。
  • results: 这项研究实现了一个名为Variational Quantum Harmonizer (VQH)的音乐 интер法案例,可以用来增强数据可见性或创作艺术作品。VQH还可以用于让艺术家更好地理解QUBO问题的解决方案,并且可以提供一个广泛的声音库 дляQUBO和量子激发的音乐作品。
    Abstract This work investigates a case study of using physical-based sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems, optimized by the Variational Quantum Eigensolver (VQE) algorithm. The VQE approximates the solution of the problem by using an iterative loop between the quantum computer and a classical optimization routine. This work explores the intermediary statevectors found in each VQE iteration as the means of sonifying the optimization process itself. The implementation was realised in the form of a musical interface prototype named Variational Quantum Harmonizer (VQH), providing potential design strategies for musical applications, focusing on chords, chord progressions, and arpeggios. The VQH can be used both to enhance data visualization or to create artistic pieces. The methodology is also relevant in terms of how an artist would gain intuition towards achieving a desired musical sound by carefully designing QUBO cost functions. Flexible mapping strategies could supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions, as demonstrated in a case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai.
    摘要 The researchers developed a musical interface prototype named Variational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.In simplified Chinese, the text can be translated as:这项研究investigates the use ofphysical-based sonification ofQuadratic Unconstrained Binary Optimization (QUBO) problems, which are optimized by theVariational Quantum Eigensolver (VQE) algorithm. The VQE algorithm uses an iterative loop between a quantum computer and a classical optimization routine to approximate the solution of the problem. The study focuses on the intermediary statevectors found in each VQE iteration as a means of sonifying the optimization process itself.The researchers developed a musical interface prototype namedVariational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.12121
  • repo_url: None
  • paper_authors: Bengt J. Borgstrom, Michael S. Brandstein
  • for: 提高单频道语音干扰的性能
  • methods: 使用多Scale自编码器(MSAE),利用不同的速率和尺度进行spectral decomposition,提取多个尺度嵌入
  • results: 相比传统方法,MSAE提供了明显的性能提升,并在对话质量指标和自动语音识别精度上表现出色。
    Abstract Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
    摘要 Recent attention has been given to neural network approaches for single-channel speech enhancement. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.Here's the translation in Traditional Chinese:近期对单道声音提升的神经网络方法Received much attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.

Is the Ideal Ratio Mask Really the Best? – Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers

  • paper_url: http://arxiv.org/abs/2309.12065
  • repo_url: None
  • paper_authors: Atsuo Hiroe, Katsutoshi Itoyama, Kazuhiro Nakadai
  • for: 这个研究探讨了面具基于的扩音器(BF),它们使用时间频率面具来提取目标语音。虽然已经有很多BF方法被提出,但以下几个方面还没有得到全面的探讨:1)哪种BF可以提供最好的提取性能?2)最佳面具是固定的?3)面具是否与理想的干扰面具(IRM)一样?
  • methods: 我们 investigate这些问题,考虑四种面具基于的BF:最大信号噪声比BF、其两种变体、以及多通道维因纳Filter(MWF)BF。为了获得每种BF的优化面具,我们使用一种通过每个语音样本的方差平方误差来最小化BF输出与目标语音之间的差异的方法。
  • results: 通过CHiME-3数据集的实验,我们发现四种BF都可以达到理想MWFBF的上限性能,但是每种BF的优化面具与IRM不同。这与传统的想法不同,即最佳面具是共同的,并且每种BF的最高性能都不同。因此,这个研究对面具基于BF的设计提供了贡献。
    Abstract This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.
    摘要
  1. Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech?2. Is the optimal mask for the best performance common for all BFs?3. Is the ideal ratio mask (IRM) identical to the optimal mask?To investigate these issues, we considered four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. We employed an approach that minimizes the mean square error between the BF output and target speech for each utterance to obtain the optimal mask corresponding to the peak performance for each BF.Through experiments with the CHiME-3 dataset, we found that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, but the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Therefore, this study contributes to the design of mask-based BFs.

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

  • paper_url: http://arxiv.org/abs/2309.11977
  • repo_url: None
  • paper_authors: Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng
  • for: 本研究旨在提出一种基于语言模型的零示Text-to-Speech(TTS)生成器,可以跨语言和 speaker 进行适应。
  • methods: 该模型使用了一种基于 neural codec 的语言模型 VALL-E,并提出了一种 speaker-aware 文本编码器和一种基于 frame-level 的音响解码器。
  • results: 实验结果显示,该模型在自然性和 speaker 相似性方面比基eline 高,并可以通过提高 style prompt 的长度来提高性能。
    Abstract Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.
    摘要 <>translate("Zero-shot text-to-speech(TTS)Synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.")中文简体版:Zero-shot文本到语音(TTS)synthesis目标是将未看过的说话者的声音复制到新的语音系统中,无需适应参数。通过将语音波形转换为精确的语音符号,并使用语言模型来模型这些符号,现有的语言模型基于TTS模型已经实现了零shot说话者适应能力,只需要3秒钟的未看过说话者的声音提示。然而,它们受到声音提示的长度限制,making it difficult to clone personal speaking style。在这篇论文中,我们提出了一种基于neural codec语言模型VALL-E的新的零shotTTS模型。我们提出了一种 speaker-aware文本编码器,用于从多句式样本中学习个人说话风格的phoneme级别。然后,我们使用VALL-E基于的语音解码器来模型timbre在frame级别,并生成语音。实验结果表明,我们的提出方法可以超越基eline,在自然性和说话者相似性方面表现更好,并可以通过扩展style提示来提高性能。

Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model

  • paper_url: http://arxiv.org/abs/2309.11976
  • repo_url: None
  • paper_authors: Jozef Coldenhoff, Andrew Harper, Paul Kendrick, Tijana Stojkovic, Milos Cernak
  • for: 预测房间声学参数和沟通质量指标
  • methods: 使用多通道模型进行同时预测多个 recordingdevice 的 MOS 和房间声学参数
  • results: 提高了直接响应率、清晰度和语音传输指标的预测,相比单通道模型,需要约5倍 menos计算资源,但是减少了其他指标的性能表现的loss。
    Abstract Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5$\times$ less computation while suffering minimal losses in the performance of the other metrics.
    摘要 因为前面的方法都是单通道的,所以我们假设多通道训练可能会提高模型的性能。我们开发了一个同时预测多个通道的MOS和房间听音参数的模型(MOSRA),并在五个通道上进行了并行预测。由于没有多通道音频数据的标签,我们使用了一个听音器模拟器生成的房间听音标签,并使用了基于wav2vec2的MOS预测模型生成的教师-学生组合中的标签。我们的实验表明,多通道模型在直接响应比、清晰度和语音传输指数方面的预测性能有5倍以下的计算量,而且对其他指标的性能几乎不受影响。

Cluster-based pruning techniques for audio data

  • paper_url: http://arxiv.org/abs/2309.11922
  • repo_url: https://github.com/boris-bergsma/audio_pruning
  • paper_authors: Boris Bergsma, Marta Brzezinska, Oleg V. Yazyev, Milos Cernak
  • for: 提高深度学习模型在各个领域的性能,减少数据量以提高计算效率。
  • methods: 使用k-means clustering方法对数据进行有效的 selección,将相似样本 grouped вместе,减少数据量而保持分类表达能力。
  • results: 在关键词检测(KWS)数据集上进行 clustering分析,显示k-means clustering可以减少音频数据集的大小,保持不同架构NNs的分类性能。
    Abstract Deep learning models have become widely adopted in various domains, but their performance heavily relies on a vast amount of data. Datasets often contain a large number of irrelevant or redundant samples, which can lead to computational inefficiencies during the training. In this work, we introduce, for the first time in the context of the audio domain, the k-means clustering as a method for efficient data pruning. K-means clustering provides a way to group similar samples together, allowing the reduction of the size of the dataset while preserving its representative characteristics. As an example, we perform clustering analysis on the keyword spotting (KWS) dataset. We discuss how k-means clustering can significantly reduce the size of audio datasets while maintaining the classification performance across neural networks (NNs) with different architectures. We further comment on the role of scaling analysis in identifying the optimal pruning strategies for a large number of samples. Our studies serve as a proof-of-principle, demonstrating the potential of data selection with distance-based clustering algorithms for the audio domain and highlighting promising research avenues.
    摘要 Note:* "Deep learning models" is translated as "深度学习模型" (shēn dào xué xí mó del)* "datasets" is translated as "数据集" (data set)* "k-means clustering" is translated as "k-means 聚合" (k-means zù hé)* "keywords spotting" is translated as "关键词检测" (guān jí xiē jiàn dòu)* "neural networks" is translated as "神经网络" (shén xiāo wǎng luò)

The Impact of Silence on Speech Anti-Spoofing

  • paper_url: http://arxiv.org/abs/2309.11827
  • repo_url: None
  • paper_authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Hua Hua, Wenchao Wang, Pengyuan Zhang
  • for: 这个论文旨在分析防 spoofing Countermeasures 对干扰声音的影响。
  • methods: 该论文使用了 Voice Activity Detection (VAD) 技术和 class activation mapping (CAM) 来分析干扰声音对防 spoofing CMs 的影响。
  • results: 研究发现,对 spoof speech 进行干扰声音 removing 可能会导致 CMs 的性能下降。此外,研究还发现了干扰声音的内容和长度的影响,以及如何通过masking silence或non-silence来提高 CMs 的Robustness。
    Abstract The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced.
    摘要 当前的语音反伪措施(CMs)在特定的数据集上表现出色。然而,通过语音活动检测(VAD)来除去测试语音的沉默部分可能会严重降低性能。在这篇论文中,我们分析了语音反伪措施中的沉默的影响。首先,我们研究了沉默的影响原因,包括沉默部分的持续时间比例和沉默部分的内容。TTS算法生成的假语音中的沉默部分持续时间比例较低,而bonafide语音中的沉默部分持续时间比例较高。此外,不同的波形生成器生成的沉默部分与bonafide语音中的沉默部分存在差异。然后,我们研究了沉默对模型预测的影响。即使重新训练,使用基于神经网络的端到端TTS算法生成的假语音在去除沉默后 Error rates 显著增加。为了证明沉默对CMs的影响的原因,我们通过类Activation mapping(CAM) visualize CM的注意力分布。此外,我们还实现了在掩码沉默或非沉默时对实验的分析,这些实验结果表明了沉默持续时间的重要性以及沉默内容的重要性。最后,我们提出了通过掩码沉默来提高CMs对未知假语音攻击的Robustness。此外,我们还介绍了 concatenating silence 攻击和 VAD 和沉默攻击的低通过滤波来 Mitigation。

Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection

  • paper_url: http://arxiv.org/abs/2309.11783
  • repo_url: None
  • paper_authors: Rui Tao, Yuxing Huang, Xiangdong Wang, Long Yan, Lufeng Zhai, Kazushige Ouchi, Taihao Li
  • for: bridging the gap between fully supervised methods and unsupervised techniques in various domains, specifically for detecting sound events with limited labeled data.
  • methods: introducing a Frame Pairwise Distance (FPD) loss branch, along with a minimal amount of synthesized data and corresponding sampling and label processing strategies.
  • results: validated on the standard DCASE dataset, the proposed approach showed efficacy and improved the recognition rate of weakly-supervised sound event detection.
    Abstract Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance the recognition rate of the learning of detection of weakly-supervised sound events, we introduce a Frame Pairwise Distance (FPD) loss branch, complemented with a minimal amount of synthesized data. The corresponding sampling and label processing strategies are also proposed. Two distinct distance metrics are employed to evaluate the proposed approach. Finally, the method is validated on the standard DCASE dataset. The obtained experimental results corroborated the efficacy of this approach.
    摘要 弱监督学习已成为各领域中利用有限标注数据的有力的方法之一,它将完全监督方法和无监督技术相连接起来,从而bridge难以估计的差距。为了提高弱监督声音事件的识别率,我们引入了帧对Distance(FPD)损失支线,并补充了一小量的合成数据。对应的采样和标签处理策略也被提出。两种不同的距离度量被使用来评估该方法。最后,方法在标准的DCASE数据集上进行验证,实验结果证明了该方法的有效性。

CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning

  • paper_url: http://arxiv.org/abs/2309.11768
  • repo_url: https://github.com/louislau1129/comflp
  • paper_authors: Wei Liu, Zhiyuan Peng, Tan Lee
  • for: 提高资源受限设备上Transformer-based语音识别(ASR)模型的性能。
  • methods: 使用层架减少(LP)方法来减少模型中的层数,并使用相关度度量来评估层之间的重复性。
  • results: 比较 existed LP 方法,CoMFLP 可以更好地选择减少的层数,同时只需要常量时间复杂度。实验结果表明,由 CoMFLP 确定的减少提议超过了现有 LP 方法的性能。代码可以在 https://github.com/louislau1129/CoMFLP 上获取。
    Abstract Transformer-based speech recognition (ASR) model with deep layers exhibited significant performance improvement. However, the model is inefficient for deployment on resource-constrained devices. Layer pruning (LP) is a commonly used compression method to remove redundant layers. Previous studies on LP usually identify the redundant layers according to a task-specific evaluation metric. They are time-consuming for models with a large number of layers, even in a greedy search manner. To address this problem, we propose CoMFLP, a fast search LP algorithm based on correlation measure. The correlation between layers is computed to generate a correlation matrix, which identifies the redundancy among layers. The search process is carried out in two steps: (1) coarse search: to determine top $K$ candidates by pruning the most redundant layers based on the correlation matrix; (2) fine search: to select the best pruning proposal among $K$ candidates using a task-specific evaluation metric. Experiments on an ASR task show that the pruning proposal determined by CoMFLP outperforms existing LP methods while only requiring constant time complexity. The code is publicly available at https://github.com/louislau1129/CoMFLP.
    摘要 “trasformer基于的语音识别(ASR)模型中深层显示了性能提升。然而,这种模型在资源受限的设备上部署不是非常高效。层束(LP)是一种常用压缩方法,可以从模型中除掉 redundant 层。先前的研究通常根据任务特定的评价指标来确定级别的重复性。这些方法在大量层的情况下,甚至在批处理方式下,都需要较长的时间。为解决这个问题,我们提出了 CoMFLP,一种快速搜索 LP 算法,基于相关度计算。在这种算法中, Compute 层之间的相关度,生成一个相关矩阵,并且在这个矩阵中找到最 redundant 层。搜索过程分为两步:(1)粗略搜索:根据相关矩阵,先找到 top K 个候选项,其中 K 是一个固定的整数;(2)细致搜索:使用任务特定的评价指标,从 K 个候选项中选择最佳剪除提议。实验结果表明,由 CoMFLP 确定的剪除提议,在 ASR 任务中能够超越现有的 LP 方法,而且只需要常量时间复杂度。代码可以在 https://github.com/louislau1129/CoMFLP 上找到。”

Sparsely Shared LoRA on Whisper for Child Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.11756
  • repo_url: None
  • paper_authors: Wei Liu, Ying Qin, Zhiyuan Peng, Tan Lee
  • for: 这 paper 的目的是提高 Whisper 自动话语识别(ASR)模型的零基础性性能。
  • methods: 这 paper 使用的方法包括 parameter-efficient fine-tuning (PEFT) 和 LoRA 等方法,以及一种新的 Sparsely Shared LoRA (S2-LoRA) 方法。
  • results: 实验结果表明,S2-LoRA 方法可以在低资源的中文儿童语音上达到与 AdaLoRA 相当的适应性,并且在对应数据上表现更好的泛化性。此外,S2-LoRA 方法自动学习的核心矩阵分布与 AdaLoRA 的分布有类似的特征。
    Abstract Whisper is a powerful automatic speech recognition (ASR) model. Nevertheless, its zero-shot performance on low-resource speech requires further improvement. Child speech, as a representative type of low-resource speech, is leveraged for adaptation. Recently, parameter-efficient fine-tuning (PEFT) in NLP was shown to be comparable and even better than full fine-tuning, while only needing to tune a small set of trainable parameters. However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. Different popular PEFT methods are examined. Particularly, we compare LoRA and AdaLoRA and figure out the learnable rank coefficient is a good design. Inspired by the sparse rank distribution allocated by AdaLoRA, a novel PEFT approach Sparsely Shared LoRA (S2-LoRA) is proposed. The two low-rank decomposed matrices are globally shared. Each weight matrix only has to maintain its specific rank coefficients that are constrained to be sparse. Experiments on low-resource Chinese child speech show that with much fewer trainable parameters, S2-LoRA can achieve comparable in-domain adaptation performance to AdaLoRA and exhibit better generalization ability on out-of-domain data. In addition, the rank distribution automatically learned by S2-LoRA is found to have similar patterns to AdaLoRA's allocation.
    摘要 喊voice是一款强大的自动语音识别(ASR)模型。然而,它在低资源语音的零shot性表现仍需要进一步改进。儿童语音作为低资源语音的代表类型,被用于适应。近期, parameter-efficient fine-tuning(PEFT)在NLP中被证明可以与全量精度相当,而只需要调整一小部分的可变参数。然而,当前PEFT方法尚未对喊voice进行了深入研究。本文仅 investigate parameter composition type的PEFT方法,如LoRA和Bitfit,因为它们不会增加额外的推理成本。不同的Popular PEFT方法被比较。特别是,我们比较LoRA和AdaLoRA,并发现了可学习排名系数是一个好设计。受AdaLoRA的稀疑rank分布启发,我们提出了一种新的PEFT方法,即Sparsely Shared LoRA(S2-LoRA)。两个低级别分解的矩阵都是全局分享的。每个weight矩阵只需要保持它的特定排名系数,这些系数被限制为稀疑分布。实验表明,S2-LoRA可以在低资源中文儿童语音上 достичь与AdaLoRA相同的适应性,并且在非适应数据上表现更好。此外,S2-LoRA自动学习的排名分布与AdaLoRA的分布相似。

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

  • paper_url: http://arxiv.org/abs/2309.11730
  • repo_url: https://github.com/wenet-e2e/wespeaker
  • paper_authors: Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li
  • for: 本研究目的是提高current speaker recognition系统的性能,通过将大规模预训练模型(如WavLM)传递到下游任务中,以及直接应用自我超vised方法(如DINO)进行 speaker embedding 学习。
  • methods: 本研究使用了DINO自我超vised方法进行 speaker embedding 学习,并在大规模的WenetSpeech dataset上进行了预训练。在这个过程中,我们还提出了一种基于信任度的数据过滤算法,以提高预训练数据的可靠性。
  • results: 研究结果表明,通过使用DINO自我超vised方法和 confidence-based 数据过滤算法,可以提高speaker recognition系统的性能,并且在大规模的in-the-wild datasets上保持良好的表现。此外,我们还发现了这种方法的可迁移性,可以在不同的 dataset 上提高系统性能。
    Abstract Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.
    摘要 当前的说话识别系统主要依靠supervised方法,受到标注数据的尺度限制。为了提高系统性能,研究人员利用大型预训练模型,如WavLM,将高级特征传递到下游说话识别任务。然而,这种方法添加了额外的参数,因为预训练模型在推理阶段仍然存在。另一组研究人员直接应用自监学方法,如DINO,来学习说话嵌入,但他们没有探索其在大规模在野数据集上的潜力。在这篇论文中,我们介绍了DINO训练在大规模WenetSpeech数据集上的效果,以及其在CNCeleb数据集上的传输性。此外,我们还提出了一种基于信任度的数据过滤算法,以 removal of unreliable data from the pretraining dataset,从而提高supervised系统的性能。相关的预训练模型、信任文件、预训练和Finetuning脚本将在Wespeaker工具箱中提供。

cs.CV - 2023-09-21

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

  • paper_url: http://arxiv.org/abs/2309.12530
  • repo_url: https://github.com/oodbag/rise
  • paper_authors: Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee
  • for: 这篇论文旨在应用大型感知语言模型(CLIP教师模型)来培训一个更小的模型,使其在未见过的领域中具有普遍性。
  • methods: 这篇论文提出了一种新的方法,名为RISE(固定不变性与semantic embedding),它使用CLIP教师模型的学习图像表现来对学习过程进行调整。
  • results: 研究发现,RISE方法可以在多个benchmark数据集上实现更好的领域普遍性,并且比之前的领域普遍性方法更好。
    Abstract Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
    摘要 域 generale 研究训练一个模型使用多个域(或分布)的样本,然后测试模型使用新、未经见过的域的样本。在这篇论文中,我们提出了一种新的方法 для域 generale,利用最近的大视语模型,具体来说是 CLIP 教师模型,来训练一个更小的模型,以便在未经见过的域上进行泛化。我们的关键技术贡献是一种新的规范,即要求学生模型学习的图像表示必须与教师模型对图像的文本描述进行编码后获得的文本表示之间很近。我们提出了两种损失函数的设计:绝对距离和相对距离,它们为训练学生模型的规范过程提供了特定的指导。我们称之为 RISE(固有协调 with 语义嵌入)。我们对多个标准测试集进行评估,并证明我们的提出方法可以超越一些状态实际的域泛化方法。我们知道,我们的工作是首次利用知识填充大视语模型来实现域泛化。通过包含文本信息,RISE 可以提高机器学习模型的泛化能力。

License Plate Super-Resolution Using Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.12506
  • repo_url: None
  • paper_authors: Sawsan AlHalawani, Bilel Benjdira, Adel Ammar, Anis Koubaa, Anas M. Ali
    for: 这个研究旨在提高识别车牌的精度,并且对于surveillance系统中的车牌识别有实际的应用。methods: 本研究使用了cutting-edge diffusion model,并通过对沙乌地车牌 dataset的训练,以提高车牌图像的Restoration。results: 研究发现,diffusion model在车牌图像Restoration中表现出色,与SwinIR和ESRGAN相比,它在PSNR和SSIM上分别提高了12.55%和37.32%,并且92%的人类评审者对于我们的图像有所喜欢。
    Abstract In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
    摘要 surveillance中,因license plate的低质量和小尺寸,识别精度受到阻碍。尽管人工智能基于图像超分辨技术如Convolutional Neural Networks (CNNs)和Generative Adversarial Networks (GANs)已经取得了进步,但这些方法仍然无法提高license plate图像的识别精度。这项研究利用了当今最佳的扩散模型,通过使用精心制作的Saudi license plates数据集,并在低和高分辨率下训练这个模型,我们发现了这个模型在图像恢复方面的超越。这种方法在PSNR指标上提高12.55%和37.32%,并在SSIM指标上提高4.89%和17.66%,相比SwirIR和ESRGAN。此外,92%的人类评估者偏好了我们的图像。简而言之,这项研究提供了一种领先的license plate超分辨技术,具有实际应用的潜在价值。

Impact of architecture on robustness and interpretability of multispectral deep neural networks

  • paper_url: http://arxiv.org/abs/2309.12463
  • repo_url: https://github.com/hendrycks/robustness
  • paper_authors: Charles Godfrey, Elise Bishoff, Myles McKay, Eleanor Byler
  • for: 这种研究是为了探讨不同的融合策略如何改善多光谱深度学习模型在视觉任务中表现。
  • methods: 这些模型使用了不同的融合方法,包括早期融合和晚期融合。早期融合将额外频谱通道与RGB频谱通道一起堆叠成一个高达多个频谱通道的输入图像。晚期融合则是将RGB和非RGB频谱通道通过不同的深度学习模型分支,并在最终分类或分割层前进行融合。
  • results: 这些模型的表现被评估,并分析了它们对自然主义图像损害的 robustness。研究发现,早期融合和晚期融合的表现差异较大,而且不同的输入频谱通道之间的融合方式对模型的性能有着不同的影响。
    Abstract Including information from additional spectral bands (e.g., near-infrared) can improve deep learning model performance for many vision-oriented tasks. There are many possible ways to incorporate this additional information into a deep learning model, but the optimal fusion strategy has not yet been determined and can vary between applications. At one extreme, known as "early fusion," additional bands are stacked as extra channels to obtain an input image with more than three channels. At the other extreme, known as "late fusion," RGB and non-RGB bands are passed through separate branches of a deep learning model and merged immediately before a final classification or segmentation layer. In this work, we characterize the performance of a suite of multispectral deep learning models with different fusion approaches, quantify their relative reliance on different input bands and evaluate their robustness to naturalistic image corruptions affecting one or more input channels.
    摘要 可以包含更多 спектраль频谱信息(例如近红外)可以提高深度学习模型对视觉任务的性能。有多种方式可以将这些额外信息integrated到深度学习模型中,但最佳的融合策略尚未确定,可能因应用场景不同而异。一种方法是“早期融合”,其中附加的频谱通道与RGB频谱合并为多通道输入图像。另一种方法是“晚期融合”,RGB和非RGB频谱通道通过不同的深度学习模型分支进行处理,并在最后的分类或分割层进行融合。本工作将characterize不同融合方法的多spectral深度学习模型的性能,量化它们对不同输入频谱通道的依赖度,以及它们对自然场景中图像损害的Robustness。

DIOR: Dataset for Indoor-Outdoor Reidentification – Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods

  • paper_url: http://arxiv.org/abs/2309.12429
  • repo_url: None
  • paper_authors: Yuyang Chen, Praveen Raj Masilamani, Bhavin Jawade, Srirangaraj Setlur, Karthik Dantu
  • for: 本研究旨在提供一个数据收集框架和半自动标注方法,以及一个包含14名受试者和1649万帧RGB帧的数据集,以便进行人体识别和重新识别。
  • methods: 本研究使用了进阶的3D计算机视觉技术来实现像素精度的人体识别,并且在室内设置中使用动作捕捉系统来进行标注。在外部长距离设置中,我们使用了一个低成本的Hybrid3D计算机视觉和学习架构,只需4个低成本的RGB摄像头,成功地实现了精确的骨架标注,甚至在距离较远的对象中,其高度仅限于20-25像素。
  • results: 本研究获得了精确的骨架标注结果,包括200,000帧的长距离摄像头标注。
    Abstract In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.
    摘要

Synthetic Image Detection: Highlights from the IEEE Video and Image Processing Cup 2022 Student Competition

  • paper_url: http://arxiv.org/abs/2309.12428
  • repo_url: None
  • paper_authors: Davide Cozzolino, Koki Nagano, Lucas Thomaz, Angshul Majumdar, Luisa Verdoliva
  • for: 本研究旨在开发一种能够分辨真实图像和生成图像的系统,以满足现在AI生成图像技术的快速发展和媒体内容的可靠性问题。
  • methods: 本研究使用了一种基于Diffusion Models的生成图像检测方法,通过分析图像的扩散特征来 отличи出真实图像和生成图像。
  • results: 研究结果表明,该方法可以准确地分辨真实图像和生成图像,并且可以承受大量的生成图像。这种方法有广泛的应用前景,可以用于媒体内容的可靠性检测和识别生成图像。
    Abstract The Video and Image Processing (VIP) Cup is a student competition that takes place each year at the IEEE International Conference on Image Processing. The 2022 IEEE VIP Cup asked undergraduate students to develop a system capable of distinguishing pristine images from generated ones. The interest in this topic stems from the incredible advances in the AI-based generation of visual data, with tools that allows the synthesis of highly realistic images and videos. While this opens up a large number of new opportunities, it also undermines the trustworthiness of media content and fosters the spread of disinformation on the internet. Recently there was strong concern about the generation of extremely realistic images by means of editing software that includes the recent technology on diffusion models. In this context, there is a need to develop robust and automatic tools for synthetic image detection.
    摘要 《视频和图像处理(VIP)杯赛》是每年在IEEE国际图像处理会议上举行的学生比赛。2022年IEEE VIP杯赛要求了本科生开发一个能够分辨真实图像和生成图像的系统。这个主题的兴趣源于人工智能在生成视频数据方面的异常进步,具有生成高度真实的图像和视频的工具。然而,这也导致媒体内容的可信度受到了损害,促使虚假信息在互联网上广泛传播。最近,对于使用扩散模型生成高度真实图像的技术表示了强烈的关注。在这种情况下,需要开发一些自动和可靠的生成图像检测工具。

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

  • paper_url: http://arxiv.org/abs/2309.12424
  • repo_url: None
  • paper_authors: Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun Huang, Weining Qian
  • for: 这个研究的目的是提出一个轻量级和高效的Computer Vision Transformer(ViT)模型,以获得更好的Computer Vision任务效果。
  • methods: 这个模型使用了一种称为DualToken-ViT的新型自注意力架构,具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有�
    Abstract Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.
    摘要 自注意力基于视transformer(ViT)在计算机视觉领域已经出现为非常竞争力的建筑。不同于卷积神经网络(CNN),ViT可以共享全局信息。随着不同类型的ViT的开发,ViT在许多视觉任务上变得越来越有利。然而,自注意力的 quadratic complexity使得ViT computationally intensive,而且它们没有对于局部性和平移对称性的偏好,因此需要 larger model size compared to CNNs 以有效地学习视觉特征。在这篇文章中,我们提出了一种轻量级和高效的视transformer模型called DualToken-ViT,该模型利用了CNN和ViT的优点。DualToken-ViT通过将token与局部信息通过卷积结构获得的local information和token与全局信息通过自注意力结构获得的global information进行有效的融合,以实现高效的注意结构。此外,我们在所有阶段使用position-aware global tokens,以增强全局信息的效果,这些position-aware global tokens还包含图像的位置信息,使我们的模型更适合视觉任务。我们在ImageNet-1K dataset上进行了广泛的实验,我们的不同规模的模型在分类、物体检测和 semantic segmentation 任务上达到了75.4%和79.4%的准确率,并且我们的1.0G FLOPs模型超过了LightViT-T使用全球token的模型。

Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition

  • paper_url: http://arxiv.org/abs/2309.12412
  • repo_url: None
  • paper_authors: Walid Ahmed, Habib Hajimolahoseini, Austin Wen, Yang Liu
  • for: 降低神经网络训练和推理的速度
  • methods: 使用低级别分解来压缩网络层
  • results: 在Nvidia V100和Huawei Ascend910两种不同硬件系统上实现5.36%的训练速度提升和15.79%的推理速度提升,只有1%的精度下降相比原始不压缩模型
    Abstract Compression of a neural network can help in speeding up both the training and the inference of the network. In this research, we study applying compression using low rank decomposition on network layers. Our research demonstrates that to acquire a speed up, the compression methodology should be aware of the underlying hardware as analysis should be done to choose which layers to compress. The advantage of our approach is demonstrated via a case study of compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two different hardware systems Nvidia V100 and Huawei Ascend910. With hardware targeted compression, results on Ascend910 showed 5.36% training speedup and 15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to the original uncompressed model
    摘要 压缩神经网络可以帮助提高神经网络的训练和推断速度。在这项研究中,我们研究了使用低级别分解来压缩神经网络层。我们的研究表明,为了提高速度,压缩方法应该了解下面硬件,并进行分析选择哪些层进行压缩。我们的方法的优点被示例通过压缩ResNet50并在全 ImageNet-ILSVRC2012 上训练。我们在两个不同的硬件系统Nvidia V100和Huawei Ascend910上进行测试。与目标硬件压缩,我们在Ascend910上获得了5.36%的训练速度提升和15.79%的推断速度提升在Ascend310上,只有1%的精度下降相比于原始未压缩模型。

POLAR3D: Augmenting NASA’s POLAR Dataset for Data-Driven Lunar Perception and Rover Simulation

  • paper_url: http://arxiv.org/abs/2309.12397
  • repo_url: https://github.com/uwsbel/polar-digital
  • paper_authors: Bo-Hsun Chen, Peter Negrut, Thomas Liang, Nevindu Batagoda, Harry Zhang, Dan Negrut
  • for: 这个论文的目的是提供一个基于NASA的POLAR数据集的三维数据集,用于 lunar 探测和synthesize 高品质的图像。
  • methods: 这个论文使用了两种方法:首先,对POLAR数据集中的每个照片进行了标注,提供了约23000个岩石和其阴影的标签。其次,利用POLAR的LiDAR点云数据,对月表地形场景进行了数字化。 specifically, the authors constructed detailed obj files for all identifiable assets by utilizing both the lunar photos and the POLAR’s LiDAR point clouds.
  • results: 这个论文的结果是POLAR3D,一个包含岩石/阴影标签和月表地形场景的数字化资产集。这个数据集可以用于训练探测算法、synthesize 高品质图像以及模拟月球环境。
    Abstract We report on an effort that led to POLAR3D, a set of digital assets that enhance the POLAR dataset of stereo images generated by NASA to mimic lunar lighting conditions. Our contributions are twofold. First, we have annotated each photo in the POLAR dataset, providing approximately 23 000 labels for rocks and their shadows. Second, we digitized several lunar terrain scenarios available in the POLAR dataset. Specifically, by utilizing both the lunar photos and the POLAR's LiDAR point clouds, we constructed detailed obj files for all identifiable assets. POLAR3D is the set of digital assets comprising of rock/shadow labels and obj files associated with the digital twins of lunar terrain scenarios. This new dataset can be used for training perception algorithms for lunar exploration and synthesizing photorealistic images beyond the original POLAR collection. Likewise, the obj assets can be integrated into simulation environments to facilitate realistic rover operations in a digital twin of a POLAR scenario. POLAR3D is publicly available to aid perception algorithm development, camera simulation efforts, and lunar simulation exercises.POLAR3D is publicly available at https://github.com/uwsbel/POLAR-digital.
    摘要 我们报道了一项工作,它导致了POLAR3D,一组数字资产,用于增强由美国国家航空航天局生成的POLAR数据集中的月球照明条件。我们的贡献是两重。首先,我们为POLAR数据集中每张照片 annotated,提供了约23000个岩石和其阴影的标签。其次,我们利用了月球地表场景的数字图像和POLAR的 LiDAR点云,对可识别的资产进行了详细的数字化。POLAR3D是这些数字资产的集合,包括岩石/阴影标签和与数字双胞虫相关的obj文件。这个新的数据集可以用于训练月球探测的观察算法,并生成超出原始POLAR收集的 fotorealistic 图像。同时,obj资产可以与 simulation 环境集成,以便在数字双胞虫中进行真实的月球车辆操作。POLAR3D公开可用,以便帮助观察算法开发、摄像头模拟和月球 simulations 演练。POLAR3D可以在 GitHub 上找到:https://github.com/uwsbel/POLAR-digital。

Active Stereo Without Pattern Projector

  • paper_url: http://arxiv.org/abs/2309.12315
  • repo_url: https://github.com/bartn8/vppstereo
  • paper_authors: Luca Bartolomei, Matteo Poggi, Fabio Tosi, Andrea Conti, Stefano Mattoccia
  • for: 提高标准透镜系统中的活动三维视觉效果,无需物理 patrern projector。
  • methods: 通过虚拟投影pattern onto left and right images,根据深度感知器的稀缺度量获取。任何设备可以无缝插入我们的框架中,在任何环境下实现虚拟活动三维设置,超越物理 patrern projector的限制,如工作范围或环境条件。
  • results: 在室内/室外 dataset上,包括长距离和近距离的实验,证明了我们的方法的无缝有效性,提高了 both stereo算法和深度网络的准确性。
    Abstract This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environment, overcoming the limitation of pattern projectors, such as limited working range or environmental conditions. Experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks.
    摘要

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

  • paper_url: http://arxiv.org/abs/2309.12314
  • repo_url: https://github.com/microsoft/Cream/tree/main/TinyCLIP
  • paper_authors: Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu
  • for: 这个研究提出了一个新的跨模AL(cross-modal distillation)方法,名为TinyCLIP,用于大规模的语言-图像预训模型。
  • methods: TinyCLIP方法 introduce two core techniques:互动模式(affinity mimicking)和重量继承(weight inheritance)。互动模式探索了多 modalities during distillation中的互动,使学生模型能够模仿老师模型在视力语言匹配空间中学习跨modal feature alignment。重量继承传递老师模型的预训重量到学生模型,以提高填充效率。
  • results: 实验结果显示TinyCLIP可以将预训CLIP ViT-B/32的大小增加50%,保持相似的零基eline性能。而且,将TinyCLIP与重量继承结合,可以将训练时间速度提高1.4-7.8倍,比较训练从零的效率。此外,我们的TinyCLIP ViT-8M/16,在YFCC-15M上训练,在ImageNet上 achieves zero-shot top-1准确率41.1%,比原CLIP ViT-B/16高3.5%,同时只使用8.9%的参数。最后,我们显示了TinyCLIP在不同的下游任务中的好转移性。代码和模型将在https://aka.ms/tinyclip上公开。
    Abstract In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.
    摘要 在这篇论文中,我们提出了一种新的跨模态填充方法,叫做TinyCLIP,用于大规模语言图像预训练模型。该方法 introduce two core techniques:对媒体之间的互动进行填充,以及继承权重。对媒体之间的互动可以让学生模型模仿教师的行为,即在视觉语言相互作用空间中学习跨模态特征对齐。继承权重可以将教师模型预训练的权重传递给学生模型,以提高填充效率。此外,我们将方法拓展到多个阶段进行进程式填充,以避免极端压缩中的有用权重的产生。实验表明,TinyCLIP可以将预训练CLIP ViT-B/32的大小减少50%,保持与零shot性能相似。而在尝试保持相似性的情况下,填充与权重继承可以提高训练速度1.4-7.8倍。此外,我们的TinyCLIP ViT-8M/16,在YFCC-15M上训练,在ImageNet上 achieve Zero-shot top-1准确率41.1%,比原CLIP ViT-B/16提高3.5%,同时只使用8.9%的参数。最后,我们展示了TinyCLIP在多个下游任务中的好传输性。代码和模型将在https://aka.ms/tinyclip上开源。

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

  • paper_url: http://arxiv.org/abs/2309.12306
  • repo_url: None
  • paper_authors: Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung
  • for: 本研究的目的是提出一种 Active Speaker Detection (ASD) 任务,即在视频帧序中判断一个人是否在说话。先前的工作主要关注网络架构,而学习有效表示的研究得到了更少的关注。
  • methods: 我们提出了一种新的对话有用的抽象损失函数,即 TalkNCE。该损失函数只在屏幕上的人正在说话的部分应用,这使得模型学习有效的表示,通过自然的语音和脸部运动的相干关系。我们的损失函数可以与现有的 ASD 训练目标一起优化,不需要额外的监督或训练数据。
  • results: 我们的方法在 AVA-ActiveSpeaker 和 ASW 数据集上达到了状态之Art的性能。
    Abstract The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
    摘要 目标是活动说话人检测(ASD),即在视频帧序中确定人是否正在说话。先前的工作主要关注网络架构,学习有效表示得到了更少的关注。在这个工作中,我们提出了一种新的对话意识强制损失(TalkNCE)。这种损失仅应用于屏幕上人是说话的部分段落,从而鼓励模型通过自然的语音和面部运动的相干学习有效的表示。我们的损失可以与现有的ASD模型训练目标一起优化,无需额外的监督或训练数据。实验表明,我们的损失可以轻松地与现有的ASD框架集成,提高其性能。我们的方法在AVA-ActiveSpeaker和ASW数据集上达到了状态计算的表现。

SlowFast Network for Continuous Sign Language Recognition

  • paper_url: http://arxiv.org/abs/2309.12304
  • repo_url: None
  • paper_authors: Junseok Ahn, Youngjoon Jang, Joon Son Chung
  • for: 本文目的是提高连续手语识别(CSLR)中的空间和动态特征EXTRACTION。
  • methods: 作者使用了两路快慢网络,其中每个路径在不同的时间分辨率下运行,分别捕捉手势(手势、表情)和动态信息(运动)。此外,作者还提出了两种特点适应CSLR的特点的特征融合方法:一是双向特征融合(BFF),可以将动态 semantics transfer into spatial semantics和vice versa; 二是路径特征增强(PFE),可以通过辅助子网络增强动态和空间表示,而不需要额外的推理时间。
  • results: 作者的模型在流行的CSLR数据集上(包括PHOENIX14、PHOENIX14-T和CSL-Daily)达到了当前状态的艺术。
    Abstract The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.
    摘要
  1. Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa.2. Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time.As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

  • paper_url: http://arxiv.org/abs/2309.12303
  • repo_url: https://github.com/shilinyan99/panovos
  • paper_authors: Shilin Yan, Xiaohao Xu, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, Wei Zhang
  • for: 该论文主要用于提出一个全新的пано拉миче视频分割数据集,以及一种基于这个数据集的新的视频对象分割方法。
  • methods: 该论文使用了15种市场上的视频对象分割模型进行评估,并通过错误分析发现这些模型无法处理panoramic视频中的像素级别内容继续性。为了解决这个问题,该论文提出了一种基于semantic boundary信息的Pixel-level匹配方法。
  • results: 对比于之前的最佳模型,该论文的PSCFormer网络在panoramic设定下表现出了出色的优势,segmenation结果较为出色。
    Abstract Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.
    摘要 паннорамные видео содержат богатую информацию о пространстве и привлекли огромное внимание из-за своей выдающейся экспедиции в некоторых областях, таких как автономное управление и виртуальная реальность. Однако существующие данные для видеосегментации сосредоточены только на конвенциональных плоских изображениях. Чтобы решить эту проблему, в этой статье мы представляем панорамный видеоданные, PanoVOS. Данные предоставляют 150 видео с высокой разрешающей способностью и разнообразными движениями. Чтобы оценить разрыв доменных областей между плоскими видео и панорамными видео, мы оцениваем 15 готовых видеообъектной сегментации (VOS) моделей на PanoVOS. After error analysis, we found that all of them fail to handle pixel-level content discontinuities of panoramic videos. Therefore, we propose a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments show that compared with previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS, and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.

Text-Guided Vector Graphics Customization

  • paper_url: http://arxiv.org/abs/2309.12302
  • repo_url: None
  • paper_authors: Peiying Zhang, Nanxuan Zhao, Jing Liao
  • for: 生成高质量自定义 вектор图形,基于文本提示。
  • methods: 利用大规模预训练文本到图像模型,进行精度的文本提示导向图像生成,并使用 semantic-based path alignment 方法初始化 SVG。
  • results: 生成了多种高质量自定义 вектор图形,并通过多种纬度的评估方法得到了极高的评估结果。
    Abstract Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while preserving the properties and layer-wise information of a given exemplar SVG. Our method harnesses the capabilities of large pre-trained text-to-image models. By fine-tuning the cross-attention layers of the model, we generate customized raster images guided by textual prompts. To initialize the SVG, we introduce a semantic-based path alignment method that preserves and transforms crucial paths from the exemplar SVG. Additionally, we optimize path parameters using both image-level and vector-level losses, ensuring smooth shape deformation while aligning with the customized raster image. We extensively evaluate our method using multiple metrics from vector-level, image-level, and text-level perspectives. The evaluation results demonstrate the effectiveness of our pipeline in generating diverse customizations of vector graphics with exceptional quality. The project page is https://intchous.github.io/SVGCustomization.
    摘要 vector graphics 广泛应用于数字艺术中,因其可扩展性和层次结构而受到设计师的喜爱。然而,创建和修改 vector graphics 需要创作力和设计技巧,这会导致时间消耗。在这篇论文中,我们提出了一个新的管道,可以基于文本提示生成高质量自定义 vector graphics,保留原始 SVG 的属性和层次信息。我们利用大型预训练的文本到图像模型的能力,通过微调模型的跨层注意力层,生成基于文本提示的静态图像。为初始化 SVG,我们提出了基于 semantics 的路径对齐方法,保留和转换原始 SVG 中重要的路径。此外,我们使用图像级和向量级损失进行路径参数优化,确保形状变换平滑,同时与自定义静态图像对齐。我们进行了多metric 的全面评估,证明我们的管道可以生成高质量自定义 vector graphics,并且具有多样性。项目页面是

Adaptive Input-image Normalization for Solving the Mode Collapse Problem in GAN-based X-ray Images

  • paper_url: http://arxiv.org/abs/2309.12245
  • repo_url: None
  • paper_authors: Muhammad Muneeb Saad, Mubashir Husain Rehmani, Ruairi O’Reilly
  • for: 增强生成的骨科影像数据集中的数据异质性,以提高机器学习分类器的性能。
  • methods: 使用生成对抗网络(DCGAN和ACGAN)生成增强的骨科影像数据集,并采用输入图像normalization来缓解模式塌井问题。
  • results: 对比使用不具有normalization的DCGAN和ACGAN,使用具有normalization的DCGAN和ACGAN能够提高分类器的性能和多样性指标。
    Abstract Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment datasets. It is important to generate synthetic images that incorporate a diverse range of features to accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem impacts Generative Adversarial Networks' capacity to generate diversified images. Mode collapse comes in two varieties: intra-class and inter-class. In this paper, both varieties of the mode collapse problem are investigated, and their subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization with the Deep Convolutional GAN and Auxiliary Classifier GAN to alleviate the mode collapse problems. Synthetically generated images are utilized for data augmentation and training a Vision Transformer model. The classification performance of the model is evaluated using accuracy, recall, and precision scores. Results demonstrate that the DCGAN and the ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images as evidenced by the superior diversity scores and classification scores.
    摘要 There are two types of mode collapse: intra-class and inter-class. In this paper, both types of mode collapse are investigated, and their impact on the diversity of synthetic X-ray images is evaluated. The authors propose integrating adaptive input-image normalization with GANs to alleviate the mode collapse problems.The proposed method is evaluated using a Vision Transformer model, and the classification performance is measured using accuracy, recall, and precision scores. The results show that the DCGAN and ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images, as evidenced by superior diversity scores and classification scores.In summary, the authors propose a method to address the mode collapse problem in GANs for generating diverse synthetic biomedical images, and demonstrate its effectiveness using a Vision Transformer model. The proposed method can potentially improve the accuracy and robustness of biomedical image classification tasks.

Can We Reliably Improve the Robustness to Image Acquisition of Remote Sensing of PV Systems?

  • paper_url: http://arxiv.org/abs/2309.12214
  • repo_url: None
  • paper_authors: Gabriel Kasmi, Laurent Dubus, Yves-Marie Saint-Drenan, Philippe Blanc
  • for: 监测区域规模的热顶solar电力安装 fleet的发展
  • methods: 利用wavelet scale attribution method (WCAM)来评估深度学习模型的鲁棒性和可靠性
  • results: 提高深度学习系统的可靠性和鲁棒性,以便安全地集成清洁能源到电力系统中
    Abstract Photovoltaic (PV) energy is crucial for the decarbonization of energy systems. Due to the lack of centralized data, remote sensing of rooftop PV installations is the best option to monitor the evolution of the rooftop PV installed fleet at a regional scale. However, current techniques lack reliability and are notably sensitive to shifts in the acquisition conditions. To overcome this, we leverage the wavelet scale attribution method (WCAM), which decomposes a model's prediction in the space-scale domain. The WCAM enables us to assess on which scales the representation of a PV model rests and provides insights to derive methods that improve the robustness to acquisition conditions, thus increasing trust in deep learning systems to encourage their use for the safe integration of clean energy in electric systems.
    摘要 彩绘太阳能(PV)是加速化清洁能源系统的关键。由于缺乏中央数据,远程探测楼顶PV设备是监测区域规模上批量PV设备的最佳选择。然而,现有技术缺乏可靠性,特别是对获取条件的变化非常敏感。为解决这问题,我们利用波лет级别归属方法(WCAM),它在空间频谱域中分解模型预测。WCAM允许我们评估模型预测中哪些级别的表示很重要,并提供了改进鲁棒性的方法,以便在不同的获取条件下提高深度学习系统的可靠性,从而激发使用清洁能源系统,并降低环境污染。

Brain Tumor Detection Using Deep Learning Approaches

  • paper_url: http://arxiv.org/abs/2309.12193
  • repo_url: https://github.com/Arminsbss/tumor-classification
  • paper_authors: Razia Sultana Misu
  • for: 本研究旨在使用深度学习技术自动检测脑肿。
  • methods: 本研究使用了五种转移学习模型,包括VGG16、VGG19、DenseNet121、ResNet50和YOLO V4,其中ResNet50得到了最高精度99.54%。
  • results: 本研究表明,使用深度学习技术可以准确地检测脑肿,并且ResNet50模型得到了最高精度。
    Abstract Brain tumors are collections of abnormal cells that can develop into masses or clusters. Because they have the potential to infiltrate other tissues, they pose a risk to the patient. The main imaging technique used, MRI, may be able to identify a brain tumor with accuracy. The fast development of Deep Learning methods for use in computer vision applications has been facilitated by a vast amount of training data and improvements in model construction that offer better approximations in a supervised setting. The need for these approaches has been the main driver of this expansion. Deep learning methods have shown promise in improving the precision of brain tumor detection and classification using magnetic resonance imaging (MRI). The study on the use of deep learning techniques, especially ResNet50, for brain tumor identification is presented in this abstract. As a result, this study investigates the possibility of automating the detection procedure using deep learning techniques. In this study, I utilized five transfer learning models which are VGG16, VGG19, DenseNet121, ResNet50 and YOLO V4 where ResNet50 provide the best or highest accuracy 99.54%. The goal of the study is to guide researchers and medical professionals toward powerful brain tumor detecting systems by employing deep learning approaches by way of this evaluation and analysis.
    摘要 脑肿是一种集合异常细胞的疾病,可能发展成为肿体或集群。由于它们可能会扩散到其他组织,因此对病人存在风险。主要用于识别脑肿的成像技术是MRI,可能能够准确地识别脑肿。深度学习方法在计算机视觉应用中的快速发展,主要受到了大量的训练数据和改进的模型构建的推动。这些方法在辅助脑肿检测和分类方面表现出了承诺。本研究使用了五种传输学习模型,即VGG16、VGG19、DenseNet121、ResNet50和YOLO V4,其中ResNet50提供了最高或最高精度99.54%。本研究的目标是通过深度学习方法来自动化脑肿检测过程,以帮助研究人员和医疗专业人员建立高效的脑肿检测系统。

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

  • paper_url: http://arxiv.org/abs/2309.12188
  • repo_url: None
  • paper_authors: Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di, Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam
  • for: 本研究旨在提供一个轻量级、即时、用户可控的物品重新排序框架,以便在人工智能肉体中实现环境互动。
  • methods: 本研究使用了一个粗细排序方案,其中包括使用场景图来表示场景,并且运用了三种程序—观察、想像和实施—以实现任务。
  • results: 实验结果显示,SG-Bot 在与竞争对手比较之下,有着很大的进步。
    Abstract Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
    摘要 <> translate("Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.")Here's the translation: объект перераспределение является ключевым в взаимодействиях робота с окружающей средой, представляя значительную возможность в инкорпорированном ИИ. В этой статье мы представляем SG-Бот, новый фреймворк перераспределения, который использует схему "коarse-to-fine" с графиком сцены как представление сцены. В отличие от предыдущих методов, которые основаны на известных принципах целей или моделях zeroshot, SG-Бот демонстрирует лёгкость, реальное времени и управляемые характеристики, гармонично сочетая рассмотрение общих смыслов с автоматическими возможностями. SG-Бот использует трёхступенчатую процедуру - наблюдение, воображение и выполнение - для эффективного решения задачи. Сначала объекты определяются и извлекаются из переплетенной сцены во время наблюдения. Эти объекты первоначально грубо организуются и описываются в графике сцены, руководствуясь Either common sense или критериями, определенными пользователем. Затем эта сцена графика влияет на модель генерации, которая формирует фине-задачу сцены, учитывая информацию о форме из исходной сцены и семантике объектов. Наконец, для выполнения, инициализированная и задуманная сцена графика соответствуют, чтобы сформулировать политики действий робота. Экспериментальные результаты подтверждают, что SG-Бот превышает конкурентов на значительном масштабе.

ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

  • paper_url: http://arxiv.org/abs/2309.12183
  • repo_url: None
  • paper_authors: Yu Cheng, Bo Wang, Robby T. Tan
  • for: addressed the problem of occlusion in 3D human shape and pose estimation from monocular videos, which is common in real-world scenarios.
  • methods: proposed an occlusion-robust temporal method called ORTexME, which utilizes temporal information from the input video to better regularize occluded body parts. The method is based on NeRF, and uses a novel average texture learning approach and human body mesh to guide the opacity-field updates and suppress blur and noise.
  • results: achieved significant improvement on the challenging multi-person 3DPW dataset, with 1.8 P-MPJPE error reduction compared to the state-of-the-art rendering-based methods, which enlarged the error up to 5.6 on the same dataset.
    Abstract In 3D human shape and pose estimation from a monocular video, models trained with limited labeled data cannot generalize well to videos with occlusion, which is common in the wild videos. The recent human neural rendering approaches focusing on novel view synthesis initialized by the off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.
    摘要 在单目视频中的人体形态和姿态估计中,使用有限的标注数据训练的模型不能generalize well于受遮挡影响的视频,这是野外视频中的常见情况。 recent human neural rendering approaches focusing on novel view synthesis initialized by off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

  • paper_url: http://arxiv.org/abs/2309.12179
  • repo_url: None
  • paper_authors: Eui Jun Hwang, Huije Lee, Jong C. Park
  • for: 这篇论文是为了提供一种直接将口语句子翻译成手语的方法,而不需要 intermediate gloss。
  • methods: 这篇论文提出了一种新的手语vector量化网络方法,该方法利用vector量化来 derivate discrete representation from sign pose sequences。
  • results: 该方法在 comprehensive evaluations 中表现出了较好的性能,并且比 Priors SLP 方法更加可靠,同时还提出了使用 Back-Translation 和 Fréchet Gesture Distance 作为评价指标的可靠性。
    Abstract Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fr\'echet Gesture Distance as evaluation metrics.
    摘要 <>转换给定文本到简化中文。<>流利手语生产(SLP)提供了直接将口语句子转换为手语,无需间接采用概念介质。本文介绍了手语向量量化网络,一种新的SLP方法,利用向量量化 derive discrete representation from sign pose sequences。我们的方法受到手语的手势和非手势元素支持高级解码方法,并实现了层次匹配以提高语言一致性。通过全面评估,我们证明了我们的方法在先前SLP方法之上具有更高的性能,并高亮了回传和Fréchet手势距离的评估指标。

SANPO: A Scene Understanding, Accessibility, Navigation, Pathfinding, Obstacle Avoidance Dataset

  • paper_url: http://arxiv.org/abs/2309.12172
  • repo_url: None
  • paper_authors: Sagar M. Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko
  • For: The paper is written for researchers and developers working on video segmentation, depth estimation, multi-task visual modeling, and synthetic-to-real domain adaptation.* Methods: The paper uses a large-scale egocentric video dataset called SANPO, which contains stereo video sessions collected in diverse outdoor environments, as well as rendered synthetic video sessions. The dataset includes dense depth and odometry labels, as well as temporally consistent dense panoptic segmentation labels for some sessions.* Results: The paper provides zero-shot baselines and SANPO benchmarks for future research, with the goal of advancing the state-of-the-art in the above-mentioned areas while enabling human navigation systems.Here’s the information in Simplified Chinese text format:* For: 这篇论文是为研究者和开发者们而写的,他们工作在视频分割、深度估计、多任务视觉模型和真实到 sintetic 领域的域 adaptation 等领域。* Methods: 这篇论文使用了一个大规模的 egocentric 视频数据集 called SANPO,该数据集包括多种不同的户外环境中的双视频会话,以及由 Parallel Domain 提供的 Rendered 的 synthetic 视频会话。数据集包括深度和运动标签,以及一些会话中的时间协调的 dense panoptic segmentation 标签。* Results: 这篇论文提供了 zero-shot baselines 和 SANPO benchmarks,以便未来的研究者可以通过这些 benchmaks 进行研究,以达到提高视频分割、深度估计、多任务视觉模型和真实到 sintetic 领域的状态前瞻。同时,这些 benchmaks 也可以帮助人类导航系统的开发。
    Abstract We introduce SANPO, a large-scale egocentric video dataset focused on dense prediction in outdoor environments. It contains stereo video sessions collected across diverse outdoor environments, as well as rendered synthetic video sessions. (Synthetic data was provided by Parallel Domain.) All sessions have (dense) depth and odometry labels. All synthetic sessions and a subset of real sessions have temporally consistent dense panoptic segmentation labels. To our knowledge, this is the first human egocentric video dataset with both large scale dense panoptic segmentation and depth annotations. In addition to the dataset we also provide zero-shot baselines and SANPO benchmarks for future research. We hope that the challenging nature of SANPO will help advance the state-of-the-art in video segmentation, depth estimation, multi-task visual modeling, and synthetic-to-real domain adaptation, while enabling human navigation systems. SANPO is available here: https://google-research-datasets.github.io/sanpo_dataset/
    摘要 我们介绍SANPO,一个大规模自我视频数据集, focus on dense prediction in outdoor environments。它包含了不同的outdoor环境中的stereo视频会议,以及由Parallel Domain提供的Synthetic视频会议。所有会议都有dense的深度和odometry标签。Synthetic会议和一些真实会议都有时间相同的dense panoptic segmentation标签。根据我们所知,这是人类自我视频数据集中首次同时拥有大规模的dense panoptic segmentation和深度标签。此外,我们还提供了零基eline和SANPObenchmark,以便未来的研究。我们希望SANPO能帮助进步类比类比预测、深度估计、多任务视觉模型和Synthetic-to-real域转换,并帮助人类NAVIGATION系统。SANPO可以在以下网站上获取:https://google-research-datasets.github.io/sanpo_dataset/

Information Forensics and Security: A quarter-century-long journey

  • paper_url: http://arxiv.org/abs/2309.12159
  • repo_url: None
  • paper_authors: Mauro Barni, Patrizio Campisi, Edward J. Delp, Gwenael Doërr, Jessica Fridrich, Nasir Memon, Fernando Pérez-González, Anderson Rocha, Luisa Verdoliva, Min Wu
  • for: Ensuring that people use devices, data, and intellectual properties for authorized purposes, and facilitating the gathering of solid evidence to hold perpetrators accountable.
  • methods: Technological advances in various focus areas, including but not limited to signal processing, data analysis, and machine learning, to address the societal needs of the digital information era.
  • results: Landmark technical contributions and future trends in the field of Information Forensics and Security (IFS) over the last 25 years, as celebrated by the IEEE Signal Processing Society (SPS).
    Abstract Information Forensics and Security (IFS) is an active R&D area whose goal is to ensure that people use devices, data, and intellectual properties for authorized purposes and to facilitate the gathering of solid evidence to hold perpetrators accountable. For over a quarter century since the 1990s, the IFS research area has grown tremendously to address the societal needs of the digital information era. The IEEE Signal Processing Society (SPS) has emerged as an important hub and leader in this area, and the article below celebrates some landmark technical contributions. In particular, we highlight the major technological advances on some selected focus areas in the field developed in the last 25 years from the research community and present future trends.
    摘要 信息 FORENSICS 和安全 (IFS) 是一个活跃的研发领域,旨在确保人们在授权的情况下使用设备、数据和知识产权,并且为追究过失者负责任而收集坚实的证据。自1990年代以来的半个世纪以来,IFS研究领域已经快速增长,以应对数字信息时代的社会需求。IEEE信号处理学会(SPS)在这个领域中已经成为重要的枢纽和领导者,本文将highlight一些在过去25年中由研究社区提出的重要技术进步,并提出未来趋势。

Vulnerability of 3D Face Recognition Systems to Morphing Attacks

  • paper_url: http://arxiv.org/abs/2309.12118
  • repo_url: None
  • paper_authors: Sanjeet Vardam, Luuk Spreeuwers
  • For: 本研究探讨了3DFR系统对3D面部变换攻击的Robustness。* Methods: 本文提出了一些方法来生成高质量的3D面部变换,并对这些变换进行检测。* Results: 实验结果显示,当3DFR系统遇到相似 morphs 攻击时,其最大同比差度(MMPMR)约为40%,相对差度(RMMR)约为41.76%。
    Abstract In recent years face recognition systems have been brought to the mainstream due to development in hardware and software. Consistent efforts are being made to make them better and more secure. This has also brought developments in 3D face recognition systems at a rapid pace. These 3DFR systems are expected to overcome certain vulnerabilities of 2DFR systems. One such problem that the domain of 2DFR systems face is face image morphing. A substantial amount of research is being done for generation of high quality face morphs along with detection of attacks from these morphs. Comparatively the understanding of vulnerability of 3DFR systems against 3D face morphs is less. But at the same time an expectation is set from 3DFR systems to be more robust against such attacks. This paper attempts to research and gain more information on this matter. The paper describes a couple of methods that can be used to generate 3D face morphs. The face morphs that are generated using this method are then compared to the contributing faces to obtain similarity scores. The highest MMPMR is obtained around 40% with RMMR of 41.76% when 3DFRS are attacked with look-a-like morphs.
    摘要

AutoPET Challenge 2023: Sliding Window-based Optimization of U-Net

  • paper_url: http://arxiv.org/abs/2309.12114
  • repo_url: https://github.com/matt3o/autopet2-submission
  • paper_authors: Matthias Hadlich, Zdravko Marinov, Rainer Stiefelhagen
  • for: 这个研究是为了提高医疗影像中肿瘤分类的精确性,并且利用PET和CT两种图像技术来结合metros� and anatomical information。
  • methods: 这个研究使用了FDG-PET/CT扫描,并且提出了一个挑战task来验证肿瘤特有的FDG取摄,并且使用了一个自动化的分类方法来分类肿瘤和正常组织。
  • results: 这个研究获得了1014个FDG-PET/CT研究数据,并且显示了一个高度精确的肿瘤分类方法,并且可以对肿瘤进行严格的分类和分析。
    Abstract Tumor segmentation in medical imaging is crucial and relies on precise delineation. Fluorodeoxyglucose Positron-Emission Tomography (FDG-PET) is widely used in clinical practice to detect metabolically active tumors. However, FDG-PET scans may misinterpret irregular glucose consumption in healthy or benign tissues as cancer. Combining PET with Computed Tomography (CT) can enhance tumor segmentation by integrating metabolic and anatomic information. FDG-PET/CT scans are pivotal for cancer staging and reassessment, utilizing radiolabeled fluorodeoxyglucose to highlight metabolically active regions. Accurately distinguishing tumor-specific uptake from physiological uptake in normal tissues is a challenging aspect of precise tumor segmentation. The AutoPET challenge addresses this by providing a dataset of 1014 FDG-PET/CT studies, encouraging advancements in accurate tumor segmentation and analysis within the FDG-PET/CT domain. Code: https://github.com/matt3o/AutoPET2-Submission/
    摘要 肿体分割在医学影像中是关键和需要精准定义。 fluorodeoxyglucose positron emission tomography(FDG-PET)广泛应用于临床实践中检测活跃的肿体。然而,FDG-PET扫描可能会错误地认为正常或无害组织中的不规则糖分摄取为癌。将PET与计算机扫描成像(CT)结合可以提高肿体分割,将元素学和解剖信息结合起来。FDG-PET/CT扫描是癌病 stagings和重新评估中的关键工具,使用标记的fluorodeoxyglucose来高亮活跃的区域。准确地分辨肿体特有的摄取和正常组织中的 physiological uptake 是精准肿体分割的挑战。AutoPET挑战提供了一个包含1014个FDG-PET/CT研究的数据集,激励创新在FDG-PET/CT领域中的准确肿体分割和分析。代码:https://github.com/matt3o/AutoPET2-Submission/

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

  • paper_url: http://arxiv.org/abs/2309.12110
  • repo_url: None
  • paper_authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
  • for: 这个论文主要是为了研究如何应用最近的多Modal图像预训练模型在艺术领域中。
  • methods: 这个论文使用的方法是使用semantic density的文本超级视觉模型,以提高模型的泛化能力。
  • results: 在NoisyArt dataset上进行了广泛的实验,CLIP模型在零批分类和描述到图像和艺术作品之间的转换中表现出色,并在描述到图像和艺术作品之间的转换中达到了有前例的结果。
    Abstract Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web. On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.
    摘要

FourierLoss: Shape-Aware Loss Function with Fourier Descriptors

  • paper_url: http://arxiv.org/abs/2309.12106
  • repo_url: None
  • paper_authors: Mehmet Bahadir Erden, Selahattin Cansiz, Onur Caki, Haya Khattak, Durmus Etiz, Melek Cosar Yakar, Kerem Duruer, Berke Barut, Cigdem Gunduz-Demir
  • for: 这个研究是为了提高医学影像分类 tasks 中的综合性和准确性。
  • methods: 这个研究使用了 encoder-decoder 网络,并引入了一个新的 shape-aware loss function,named FourierLoss,来视�{数学描述者} 计算的物体形状差异,并对这个差异进行处理。
  • results: 这个研究显示,使用 proposed adaptive loss update mechanism 和 FourierLoss loss function,可以将网络的注意力从学习物体的大致形状转移到学习物体的细微形状,或是vice versa,以提高医学影像分类的准确性。在2879个 Computed Tomography 影像中,这个方法比其他方法 statistically significantly better 的结果。
    Abstract Encoder-decoder networks become a popular choice for various medical image segmentation tasks. When they are trained with a standard loss function, these networks are not explicitly enforced to preserve the shape integrity of an object in an image. However, this ability of the network is important to obtain more accurate results, especially when there is a low-contrast difference between the object and its surroundings. In response to this issue, this work introduces a new shape-aware loss function, which we name FourierLoss. This loss function relies on quantifying the shape dissimilarity between the ground truth and the predicted segmentation maps through the Fourier descriptors calculated on their objects, and penalizing this dissimilarity in network training. Different than the previous studies, FourierLoss offers an adaptive loss function with trainable hyperparameters that control the importance of the level of the shape details that the network is enforced to learn in the training process. This control is achieved by the proposed adaptive loss update mechanism, which end-to-end learns the hyperparameters simultaneously with the network weights by backpropagation. As a result of using this mechanism, the network can dynamically change its attention from learning the general outline of an object to learning the details of its contour points, or vice versa, in different training epochs. Working on 2879 computed tomography images of 93 subjects, our experiments revealed that the proposed adaptive shape-aware loss function led to statistically significantly better results for liver segmentation, compared to its counterparts.
    摘要 现代编码器-解码器网络在医疗图像分割任务中变得非常流行。当它们被训练于标准损失函数时,这些网络没有Explicitly保持图像中对象的形状完整性。然而,这种网络的能力是获得更高准确的结果的关键,特别是在对象和周围环境之间存在低对比的情况下。为解决这个问题,本研究提出了一种新的形态意识损失函数,我们称之为FourierLoss。这个损失函数基于计算图像中对象的Fourier描述符,并将其用于训练网络。与前一 Studies不同,FourierLoss提供了一个可调参数的损失函数,可以在训练过程中控制网络学习的形态细节水平。这种控制由我们提出的适应式损失更新机制实现,该机制通过反向传播来同时学习网络参数和损失函数参数。因此,网络可以在不同的训练纪元中动态地改变它的注意力,从学习对象的总轮廓到学习对象的轮廓点,或者vice versa。在我们对2879个计算Tomography图像的93个Subject进行实验后,我们发现,提出的适应式形态意识损失函数在肝 segmentation任务中具有统计学上significantly Better的结果,相比其他Counterparts。

Bayesian sparsification for deep neural networks with Bayesian model reduction

  • paper_url: http://arxiv.org/abs/2309.12095
  • repo_url: https://github.com/dimarkov/bmr4pml
  • paper_authors: Dimitrije Marković, Karl J. Friston, Stefan J. Kiebel
  • for: 这篇论文旨在探讨 bayesian 简化技术的应用在深度学习中,以提高深度学习模型的计算效率和表现。
  • methods: 本研究使用 bayesian 简化技术,结合结构缩小假设和数学随机构造推断,实现了对深度学习模型的简化。
  • results: 研究比较了不同的简化方法,结果显示 bayesian 模型简化(BMR)方法在不同的深度学习架构上具有优越的表现,并且比较简单和有效。
    Abstract Deep learning's immense capabilities are often constrained by the complexity of its models, leading to an increasing demand for effective sparsification techniques. Bayesian sparsification for deep learning emerges as a crucial approach, facilitating the design of models that are both computationally efficient and competitive in terms of performance across various deep learning applications. The state-of-the-art -- in Bayesian sparsification of deep neural networks -- combines structural shrinkage priors on model weights with an approximate inference scheme based on stochastic variational inference. However, model inversion of the full generative model is exceptionally computationally demanding, especially when compared to standard deep learning of point estimates. In this context, we advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights. As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model. Our comparative study highlights the advantages of the BMR method relative to established approaches based on hierarchical horseshoe priors over model weights. We illustrate the potential of BMR across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision Transformers and MLP-Mixers.
    摘要

Multi-Task Cooperative Learning via Searching for Flat Minima

  • paper_url: http://arxiv.org/abs/2309.12090
  • repo_url: None
  • paper_authors: Fuping Wu, Le Zhang, Yang Sun, Yuanhan Mo, Thomas Nichols, Bartlomiej W. Papiez
  • for: This paper is written for medical image analysis, specifically to improve the generalizability of learned features and performance in individual tasks using multi-task learning (MTL).
  • methods: The paper proposes a multi/bi-level optimization approach to MTL, where features are learned in a cooperative manner by updating the sub-model for each task alternatively, taking advantage of the learned sub-models of the other tasks. To alleviate negative transfer, the paper searches for flat minima with regard to features from other tasks.
  • results: The proposed method is validated on three publicly available datasets and shows promising results compared to state-of-the-art MTL approaches, demonstrating the effectiveness of cooperative learning in medical image analysis.
    Abstract Multi-task learning (MTL) has shown great potential in medical image analysis, improving the generalizability of the learned features and the performance in individual tasks. However, most of the work on MTL focuses on either architecture design or gradient manipulation, while in both scenarios, features are learned in a competitive manner. In this work, we propose to formulate MTL as a multi/bi-level optimization problem, and therefore force features to learn from each task in a cooperative approach. Specifically, we update the sub-model for each task alternatively taking advantage of the learned sub-models of the other tasks. To alleviate the negative transfer problem during the optimization, we search for flat minima for the current objective function with regard to features from other tasks. To demonstrate the effectiveness of the proposed approach, we validate our method on three publicly available datasets. The proposed method shows the advantage of cooperative learning, and yields promising results when compared with the state-of-the-art MTL approaches. The code will be available online.
    摘要

Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

  • paper_url: http://arxiv.org/abs/2309.12047
  • repo_url: None
  • paper_authors: Kiseok Choi, Inchul Kim, Dongyoung Choi, Julio Marco, Diego Gutierrez, Min H. Kim
  • for: This paper aims to improve the accuracy and robustness of non-line-of-sight (NLOS) imaging methods for reconstructing hidden scenes.
  • methods: The proposed method uses a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates imaging parameters during the reconstruction process, using measured illumination in both the time and frequency domains.
  • results: The method is able to consistently reconstruct detailed geometry and albedo of hidden scenes, even under significant noise levels, by using a combination of diffraction-based volumetric NLOS reconstruction, path-space light transport, and a simple ray marching technique.
    Abstract Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes. We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.
    摘要 现有的时间分解非直视(NLOS)成像方法利用测量的 indirect 照明的光学路径进行场景重建。这些方法容易出现重建 artifacts,因为它们通常需要手动选择筛选函数和参数来 Mitigate 这些artefacts。我们介绍了一个完全可导的端到端 NLOS 反推管线,该管线在重建隐藏场景时自动调整成像参数,使用只有测量的照明作为输入,同时在时间和频率两个频率域中工作。我们的管线从 NLOS 体积强度中提取隐藏场景的几何表示,并估算在静止墙上生成的时间分解照明,使用可导的漫游技术来提取详细的表面点和法向量。然后,我们使用梯度下降优化成像参数,使得模拟的时间分解照明与测量的照明之间的错误最小化。我们的端到端可导管线结合了干涉基本的体积NLOS 重建、路径空间光传输和简单的漫游技术,以提取细腻的表面点和法向量。我们 demonstarte 了我们的方法可以在各种噪音水平下一致地重建场景的几何和反射率。

Beyond Image Borders: Learning Feature Extrapolation for Unbounded Image Composition

  • paper_url: http://arxiv.org/abs/2309.12042
  • repo_url: https://github.com/liuxiaoyu1104/unic
  • paper_authors: Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, Wangmeng Zuo
  • for: 提高图像组合和美观品质,大多数现有方法会修剪捕捉到的图像,但这些方法的修剪范围有限。
  • methods: 我们提出了一个联合框架,可以同时进行无限的摄像头视图建议和图像组合(i.e., UNIC),以确保生成的修剪图像是真实的和图像质量高。
  • results: 我们的方法可以在基于现有图像剪辑 datasets 的 dataset 上进行广泛的实验,并显示了我们的 UNIC 在无限的摄像头视图建议和图像组合方面的效果。
    Abstract For improving image composition and aesthetic quality, most existing methods modulate the captured images by striking out redundant content near the image borders. However, such image cropping methods are limited in the range of image views. Some methods have been suggested to extrapolate the images and predict cropping boxes from the extrapolated image. Nonetheless, the synthesized extrapolated regions may be included in the cropped image, making the image composition result not real and potentially with degraded image quality. In this paper, we circumvent this issue by presenting a joint framework for both unbounded recommendation of camera view and image composition (i.e., UNIC). In this way, the cropped image is a sub-image of the image acquired by the predicted camera view, and thus can be guaranteed to be real and consistent in image quality. Specifically, our framework takes the current camera preview frame as input and provides a recommendation for view adjustment, which contains operations unlimited by the image borders, such as zooming in or out and camera movement. To improve the prediction accuracy of view adjustment prediction, we further extend the field of view by feature extrapolation. After one or several times of view adjustments, our method converges and results in both a camera view and a bounding box showing the image composition recommendation. Extensive experiments are conducted on the datasets constructed upon existing image cropping datasets, showing the effectiveness of our UNIC in unbounded recommendation of camera view and image composition. The source code, dataset, and pretrained models is available at https://github.com/liuxiaoyu1104/UNIC.
    摘要 为提高图像组合和艺术质量,现有方法通常对捕捉到的图像进行剪辑,但这些图像剪辑方法有限制的视野范围。一些方法已经建议了从拟合图像中预测剪辑框。然而,生成的拟合区域可能包含在剪辑后的图像中,导致图像组合结果不真实并且可能受到质量下降的影响。在这篇论文中,我们解决了这个问题,提出了一个共同框架,即UNIC,以实现无限制的摄像头视野和图像组合。具体来说,我们的框架接受当前摄像头预览帧作为输入,并提供无限制的视野调整建议,包括图像边缘不受限制的缩放、摄像头移动等操作。为了提高视野调整预测精度,我们还进一步扩展了视野范围,通过特征拟合。经过一次或多次视野调整,我们的方法会 converges,并产生一个摄像头视野和图像组合建议。我们在基于现有图像剪辑数据集构建的数据集上进行了广泛的实验,证明了我们的UNIC在无限制的摄像头视野和图像组合方面的效果。源代码、数据集和预训练模型可以在https://github.com/liuxiaoyu1104/UNIC上下载。

BASE: Probably a Better Approach to Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2309.12035
  • repo_url: None
  • paper_authors: Martin Vonheim Larsen, Sigmund Rolfsjord, Daniel Gusland, Jörgen Ahlberg, Kim Mathiassen
  • for: 这篇论文是为了探讨可靠的视觉对象跟踪方法,以帮助解决现有的跟踪问题。
  • methods: 这篇论文使用了 bayesian 方法,并提出了一种简单、高效的视觉跟踪模型,称为 BASE( bayesian approximation single-hypothesis estimator),可以在 MOT17 和 MOT20 上达到 state-of-the-art 水平。
  • results: 该模型在 MOT17 和 MOT20 上实现了 state-of-the-art 的跟踪效果,而无需使用 Re-Id。
    Abstract The field of visual object tracking is dominated by methods that combine simple tracking algorithms and ad hoc schemes. Probabilistic tracking algorithms, which are leading in other fields, are surprisingly absent from the leaderboards. We found that accounting for distance in target kinematics, exploiting detector confidence and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual tracking. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top 100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible visual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id. Code will be made available at https://github.com/ffi-no
    摘要 visual 目标跟踪领域受到简单跟踪算法和尝试性方案的控制。 probabilistic 跟踪算法,在其他领域的领导地位,在视觉跟踪领域却缺失。我们发现,考虑目标动力学中的距离,利用探测器信任度和非对称雷达特征是 kritical 的。 previous probabilistic methods 缺乏这些方面的处理,我们认为这是为什么它们落后于当前状态的方法(MOT17 top 100 中没有probabilistic tracker)。为了恢复 probablistic 方法的进步,我们提出了一组做实的模型,并示出如何将它们 incorporated 到 probablistic 框架中。我们介绍了 BASE(Bayesian Approximation Single-hypothesis Estimator),一种简单、高性能和易扩展的视觉跟踪器,在 MOT17 和 MOT20 中 achieved state-of-the-art 成绩,不使用 Re-Id。代码将在 https://github.com/ffi-no 上提供。

Face Identity-Aware Disentanglement in StyleGAN

  • paper_url: http://arxiv.org/abs/2309.12033
  • repo_url: None
  • paper_authors: Adrian Suwała, Bartosz Wójcik, Magdalena Proszewska, Jacek Tabor, Przemysław Spurek, Marek Śmieja
  • for: 本研究旨在解决 Conditional GANs manipulate 人脸图像的特征(如表情、发型、姿势、年龄)时同时改变人脸图像的身份特征的问题。
  • methods: 我们提出了 PluGeN4Faces,一个 StyleGAN 插件,可以显著分离人脸图像的特征和人脸图像的身份特征。我们的关键想法是在 Movie Frames 中检索到人物出现在不同的姿势和特征下的图像,然后通过一种对比损失来鼓励模型将同一个人的图像分配到相似的 latent space 中。
  • results: 我们的实验结果表明,PluGeN4Faces 对人脸图像的特征进行修改时,对图像的其他特征的改变相对较少,与现有状态的模型相比。
    Abstract Conditional GANs are frequently used for manipulating the attributes of face images, such as expression, hairstyle, pose, or age. Even though the state-of-the-art models successfully modify the requested attributes, they simultaneously modify other important characteristics of the image, such as a person's identity. In this paper, we focus on solving this problem by introducing PluGeN4Faces, a plugin to StyleGAN, which explicitly disentangles face attributes from a person's identity. Our key idea is to perform training on images retrieved from movie frames, where a given person appears in various poses and with different attributes. By applying a type of contrastive loss, we encourage the model to group images of the same person in similar regions of latent space. Our experiments demonstrate that the modifications of face attributes performed by PluGeN4Faces are significantly less invasive on the remaining characteristics of the image than in the existing state-of-the-art models.
    摘要 <>使用可能性GAN进行面像图像的属性修饰,如表情、发型、姿势和年龄等。尽管现有模型成功修改请求的属性,但同时也会修改图像中其他重要特征,如人脸的身份。在这篇论文中,我们关注解决这个问题,我们引入了PluGeN4Faces, StyleGAN 的插件,它将明确分离人脸属性和人脸身份。我们的关键想法是在电影帧中检索到的图像进行训练,图像中一个人出现在不同的姿势和属性下。通过应用一种对比损失,我们鼓励模型将同一个人的图像分组到类似的潜在空间中。我们的实验表明,PluGeN4Faces 对面像图像的修饰是现有状态OF-THE-ART模型相比较不侵略的。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide that version as well.

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

  • paper_url: http://arxiv.org/abs/2309.12029
  • repo_url: https://github.com/cyfml/opstl
  • paper_authors: Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
  • for: 这篇论文的目的是将人工智能应用到自主机器人系统中,以便处理目标遮蔽的情况。
  • methods: 这篇论文使用了阶层学习(Hierarchical Learning)和填写遮蔽(Imputation)方法来解决目标遮蔽的问题。
  • results: 这篇论文的结果显示,使用这些方法可以将自主机器人系统升级为能够处理目标遮蔽的情况,并且可以实现更高的识别率。
    Abstract To integrate action recognition methods into autonomous robotic systems, it is crucial to consider adverse situations involving target occlusions. Such a scenario, despite its practical relevance, is rarely addressed in existing self-supervised skeleton-based action recognition methods. To empower robots with the capacity to address occlusion, we propose a simple and effective method. We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples. Next, we employ K-nearest-neighbor (KNN) to fill in missing skeleton data based on the closest sample neighbors. Imputing incomplete skeleton sequences to create relatively complete sequences as input provides significant benefits to existing skeleton-based self-supervised models. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning (OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for better use of high-quality, intact skeletons. The effectiveness of our imputation methods is verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D 120. The source code will be made publicly available at https://github.com/cyfml/OPSTL.
    摘要 为了将动作识别方法integrated into autonomous robotic systems,需要考虑目标 occlusion 的情况。这种情况,虽然在现有的自适应skeleton-based action recognition方法中 rarely addressed,但它在实际应用中非常重要。为了赋能机器人处理 occlusion,我们提出了一种简单而有效的方法。我们首先使用 occluded skeleton sequences 进行预训练,然后使用 k-means clustering (KMeans) 对序列嵌入进行分组。接着,我们使用 K-nearest-neighbor (KNN) 填充 incomplete skeleton 数据,基于最近的样本 neighborgood 的 nearest 邻居。填充 incomplete skeleton sequences,以创建比较完整的输入序列,对现有skeleton-based self-supervised模型带来了显著的改进。此外,我们在 Partial Spatio-Temporal Learning (PSTL) 框架之上进行了更新,并增加了 Adaptive Spatial Masking (ASM),以更好地利用高质量、完整的skeleton。我们证明了我们的填充方法的效果,在 NTURGB+D 60 和 NTURGB+D 120 的 occluded 版本上进行了测试。源代码将在 https://github.com/cyfml/OPSTL 上公开。

Precision in Building Extraction: Comparing Shallow and Deep Models using LiDAR Data

  • paper_url: http://arxiv.org/abs/2309.12027
  • repo_url: None
  • paper_authors: Muhammad Sulaiman, Mina Farmanbar, Ahmed Nabil Belbachir, Chunming Rong
  • for: 本文使用 LiDAR 数据进行检测建筑物的深度学习模型,以提高建筑物的分割精度。
  • methods: 本文使用了 shallow 模型,并使用了 boundary masks 来提高建筑物的边界精度。
  • results: compared with deep learning models, shallow models 在 IoU 分数上表现出优异,但是在 BIoU 分数上,deep learning models 表现更好。 boundary masks 可以提高 BIoU 分数 by 4%。 LightGBM 也比 RF 和 XGBoost 表现更好。
    Abstract Building segmentation is essential in infrastructure development, population management, and geological observations. This article targets shallow models due to their interpretable nature to assess the presence of LiDAR data for supervised segmentation. The benchmark data used in this article are published in NORA MapAI competition for deep learning model. Shallow models are compared with deep learning models based on Intersection over Union (IoU) and Boundary Intersection over Union (BIoU). In the proposed work, boundary masks from the original mask are generated to improve the BIoU score, which relates to building shapes' borderline. The influence of LiDAR data is tested by training the model with only aerial images in task 1 and a combination of aerial and LiDAR data in task 2 and then compared. shallow models outperform deep learning models in IoU by 8% using aerial images (task 1) only and 2% in combined aerial images and LiDAR data (task 2). In contrast, deep learning models show better performance on BIoU scores. Boundary masks improve BIoU scores by 4% in both tasks. Light Gradient-Boosting Machine (LightGBM) performs better than RF and Extreme Gradient Boosting (XGBoost).
    摘要 监测建筑物分割是基础设施开发、人口管理和地质观测中的关键。这篇文章主要针对使用 shallow model,因为它们的解释能力可以评估 LiDAR 数据是否对 supervised segmentation 有影响。这篇文章使用的标准数据来自 NORA MapAI 比赛,这是深度学习模型的 benchmark。在这篇文章中, shallow model 与深度学习模型进行比较,使用 Intersection over Union (IoU) 和 Boundary Intersection over Union (BIoU) 两个指标。在提议的工作中,从原始Mask中生成了Boundary Mask,以提高 BIoU 分数,这与建筑物的边界相关。在任务1中,使用只有飞行图像的情况下,shallow model 在 IoU 上比深度学习模型高出8%,而在任务2中,使用飞行图像和 LiDAR 数据的组合时,shallow model 和深度学习模型的分数相似。然而,深度学习模型在 BIoU 分数上表现更好。Boundary Mask 在两个任务中提高 BIoU 分数4%。Light Gradient-Boosting Machine (LightGBM) 在 RF 和 Extreme Gradient Boosting (XGBoost) 之上表现更好。

Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection

  • paper_url: http://arxiv.org/abs/2309.12010
  • repo_url: https://github.com/summitgao/camixer
  • paper_authors: Haopeng Zhang, Zijing Lin, Feng Gao, Junyu Dong, Qian Du, Heng-Chao Li
  • for: 本文旨在提出一种基于Transformer-like架构的SAR变化检测方法,以提高SAR图像变化检测的精度和效率。
  • methods: 本文提出了一种叫做Convolution and Attention Mixer(CAMixer)的新方法,它通过并行的自注意力和平移核函数来提取全球 semantic信息,并通过阻塞机制来增强非线性特征变换。
  • results: 对于三个SAR数据集,实验结果表明,CAMixer方法可以具有更高的精度和效率,并且可以更好地鲁棒化SAR图像变化检测 task。
    Abstract Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. However, existing SAR change detection methods are mainly based on convolutional neural networks (CNNs), with limited consideration of global attention mechanism. In this letter, we explore Transformer-like architecture for SAR change detection to incorporate global attention. To this end, we propose a convolution and attention mixer (CAMixer). First, to compensate the inductive bias for Transformer, we combine self-attention with shift convolution in a parallel way. The parallel design effectively captures the global semantic information via the self-attention and performs local feature extraction through shift convolution simultaneously. Second, we adopt a gating mechanism in the feed-forward network to enhance the non-linear feature transformation. The gating mechanism is formulated as the element-wise multiplication of two parallel linear layers. Important features can be highlighted, leading to high-quality representations against speckle noise. Extensive experiments conducted on three SAR datasets verify the superior performance of the proposed CAMixer. The source codes will be publicly available at https://github.com/summitgao/CAMixer .
    摘要 这是一个实验室内的文章,标题是“Synthetic Aperture Radar(SAR)图像变化检测方法”。这个领域在远程感知领域中具有重要性,但是现有的SAR变化检测方法主要基于卷积神经网络(CNN),对于全球注意机制的考虑却有限。在这封信中,我们尝试使用Transformer-like架构来检测SAR图像变化,以包含全球注意机制。为了补偿对Transformer的推导性,我们结合了自我注意和移位卷积,并在平行的方式下实现了全球 semantic信息的捕捉和本地特征提取。其次,我们采用了阻塞机制,以增强非线性特征转换。阻塞机制是将两个平行的直线层进行元素对元素乘法。这使得重要的特征能够获得突出,从而实现高质量的特征表现,抗衡杂音噪声。我们进行了广泛的实验,证明了我们提出的CAMixer方法的超越性。我们将代码公开在GitHub上,请参考https://github.com/summitgao/CAMixer。

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

  • paper_url: http://arxiv.org/abs/2309.12009
  • repo_url: https://github.com/desehuileng0o0/ikem
  • paper_authors: Yiping Wei, Kunyu Peng, Alina Roitberg, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
  • for: 本研究旨在提高人体动作识别的自动学习性能,特别是使用多modalities setup时的表现。
  • methods: 我们首先提出了一种Implicit Knowledge Exchange Module (IKEM),以消除低性能Modalities之间的知识协同传递。然后,我们提出了三种新的Modalities,以增强不同Modalities之间的补充信息。最后,我们提出了一种新的教师学生框架,以在引入新Modalities时保持效率,并在anchors, positives和negatives的约束下,将secondaryModalities中的知识透传到primaryModalities中。
  • results: 实验结果表明,我们的方法有效地提高了skeleton-based多modalities数据的表现,这标志着我们的approach可以有效地使用多modalities setup进行人体动作识别。
    Abstract Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.
    摘要 实际承认人类动作的自我监督学习在近年来有很快的发展。大多数现有的工作都是基于骨架数据,并使用多 modalities 设置。这些工作忽略了不同modalities的表现差异,这导致了错误知识的传播 между modalities,仅有三种基本modalities,即肢体、骨骼和动作,没有进一步 explore 其他modalities。 在这个工作中,我们首先提出了隐式知识交换模组(IKEM),以解决错误知识传播 between low-performance modalities。然后,我们进一步提出了三种新的modalities,以增加多modalities 之间的补充信息。最后,为确保效率,我们提出了一个 novel teacher-student 框架,以将次要modalities 中的知识转换到必要modalities 中,考虑到紧缩链、正例和负例的关系,称为关联跨modalities 知识传播。实验结果显示了我们的方法的有效性,从而解锁了骨架基于多modalities 数据的效率使用。源代码将在https://github.com/desehuileng0o0/IKEM 上公开。

Identification of pneumonia on chest x-ray images through machine learning

  • paper_url: http://arxiv.org/abs/2309.11995
  • repo_url: https://github.com/Nabeel-105/Covid-19-and-Pneumonia-Detection-Using-Chest-Xray-Images-Full-Desktop-Application-
  • paper_authors: Eduardo Augusto Roeder
  • for: 该研究旨在开发一种用于识别儿童肺部X光图像中的抑菌病毒病的软件。
  • methods: 该软件是基于机器学习技术的计算模型,使用了传输学习技术进行训练。
  • results: 经过训练后,模型在新的图像上达到了98%的敏感度和97.3%的特异性。
    Abstract Pneumonia is the leading infectious cause of infant death in the world. When identified early, it is possible to alter the prognosis of the patient, one could use imaging exams to help in the diagnostic confirmation. Performing and interpreting the exams as soon as possible is vital for a good treatment, with the most common exam for this pathology being chest X-ray. The objective of this study was to develop a software that identify the presence or absence of pneumonia in chest radiographs. The software was developed as a computational model based on machine learning using transfer learning technique. For the training process, images were collected from a database available online with children's chest X-rays images taken at a hospital in China. After training, the model was then exposed to new images, achieving relevant results on identifying such pathology, reaching 98% sensitivity and 97.3% specificity for the sample used for testing. It can be concluded that it is possible to develop a software that identifies pneumonia in chest X-ray images.
    摘要 全球最主要的感染性新生儿死亡原因是肺炎,早期诊断可以改变病人的预后。用于诊断确认的成像检查可以帮助医生,最常用的检查方法是胸部X射线。本研究的目的是开发一种用于识别肺炎在胸部X射线图像中的软件。该软件是基于机器学习技术的计算模型,使用了传输学习技术进行训练。训练过程中获得的图像来自中国医院的儿童胸部X射线图像库。经训练后,模型对新图像进行测试, дости得了98%的敏感性和97.3%的特异性。可以 concluye ,可以开发一种用于识别肺炎在胸部X射线图像中的软件。

Neural Stochastic Screened Poisson Reconstruction

  • paper_url: http://arxiv.org/abs/2309.11993
  • repo_url: None
  • paper_authors: Silvia Sellán, Alec Jacobson
  • for: 用于重建三维表面从点云数据中
  • methods: 使用神经网络研究和量化重建不确定性,基于波峰平滑先验
  • results: 解决现有工作的主要限制,可以完全 интеGRATE到3D扫描管道中,从获取初始重建到决定下一个感知器位置并更新重建数据I hope that helps! Let me know if you have any other questions.
    Abstract Reconstructing a surface from a point cloud is an underdetermined problem. We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior. Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.
    摘要 <> transtable("Reconstructing a surface from a point cloud is an underdetermined problem.") transtable("We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior.") transtable("Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.")>>Here's the translation of the text in Traditional Chinese:<> transtable("重建表面从点云是一个不充分确定的问题。") transtable("我们使用神经网络来研究和评估这种重建不确定性,以Pointer Sobolev smoothness prior为基础。") transtable("我们的算法解决了现有工作的主要限制,可以完全整合到3D扫描管线中,从获取初始重建到决定下一个感知器位置并将更多数据捕获后更新重建。")>>Note that the translation is based on the Google Translate API, and may not be perfect or entirely accurate.

Crop Row Switching for Vision-Based Navigation: A Comprehensive Approach for Efficient Crop Field Navigation

  • paper_url: http://arxiv.org/abs/2309.11989
  • repo_url: None
  • paper_authors: Rajitha de Silva, Grzegorz Cielniak, Junfeng Gao
  • for: 该论文旨在开发一种基于视觉的移动机器人Navigation系统,可以在农业用途中涵盖整个田地。
  • methods: 该论文使用了深度学习的RGB图像分割和深度数据,通过探测农作物的结束和下一排农作物的重新入口来实现视觉基于的农作物行进管理策略。
  • results: 在一个真实的糖芘场中测试了该管理策略,结果表明机器人可以成功地从一排农作物出口,并重新进入下一排农作物, median误差为19.25cm和6.77度。
    Abstract Vision-based mobile robot navigation systems in arable fields are mostly limited to in-row navigation. The process of switching from one crop row to the next in such systems is often aided by GNSS sensors or multiple camera setups. This paper presents a novel vision-based crop row-switching algorithm that enables a mobile robot to navigate an entire field of arable crops using a single front-mounted camera. The proposed row-switching manoeuvre uses deep learning-based RGB image segmentation and depth data to detect the end of the crop row, and re-entry point to the next crop row which would be used in a multi-state row switching pipeline. Each state of this pipeline use visual feedback or wheel odometry of the robot to successfully navigate towards the next crop row. The proposed crop row navigation pipeline was tested in a real sugar beet field containing crop rows with discontinuities, varying light levels, shadows and irregular headland surfaces. The robot could successfully exit from one crop row and re-enter the next crop row using the proposed pipeline with absolute median errors averaging at 19.25 cm and 6.77{\deg} for linear and rotational steps of the proposed manoeuvre.
    摘要 视觉基于移动机器人农业场 Navigation 系统通常只能进行行间 navigation。 switching 过程中常用 GNSS 感知器或多个摄像头设计。本文提出了一种新的视觉基于的农作物行 switching 算法,可以使移动机器人在一个全场农作物中进行整个途径。提出的行 switching 举动使用深度学习基于 RGB 图像分割和深度数据检测农作物行的结束和下一行的重新入口点,并在多个状态的管道中使用视觉反馈或机器人轮胎速度进行成功导航到下一行农作物。该管道在实际的糖葱田中进行测试,包括农作物行间缺陷、变化的照明水平、阴影和不规则的机场表面。机器人可以成功从一个农作物行出现在下一个农作物行中使用提案的管道,相对 median 误差为 19.25 cm 和 6.77°。

ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers

  • paper_url: http://arxiv.org/abs/2309.11986
  • repo_url: None
  • paper_authors: Philipp Ausserlechner, David Haberger, Stefan Thalhammer, Jean-Baptiste Weibel, Markus Vincze
  • for: zeroshot 6D对象pose estimation
  • methods: 使用pre-trained Vision Transformers(ViT)抽取视觉描述符,并使用RANSAC-based PnP算法对比query图像和模板图像进行对应。
  • results: 对比两个现有的状态态的方法,提高了所有三个数据集的平均回归率。
    Abstract As robotic systems increasingly encounter complex and unconstrained real-world scenarios, there is a demand to recognize diverse objects. The state-of-the-art 6D object pose estimation methods rely on object-specific training and therefore do not generalize to unseen objects. Recent novel object pose estimation methods are solving this issue using task-specific fine-tuned CNNs for deep template matching. This adaptation for pose estimation still requires expensive data rendering and training procedures. MegaPose for example is trained on a dataset consisting of two million images showing 20,000 different objects to reach such generalization capabilities. To overcome this shortcoming we introduce ZS6D, for zero-shot novel object 6D pose estimation. Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates against query images of objects and for establishing local correspondences. These local correspondences enable deriving geometric correspondences and are used for estimating the object's 6D pose with RANSAC-based PnP. This approach showcases that the image descriptors extracted by pre-trained ViTs are well-suited to achieve a notable improvement over two state-of-the-art novel object 6D pose estimation methods, without the need for task-specific fine-tuning. Experiments are performed on LMO, YCBV, and TLESS. In comparison to one of the two methods we improve the Average Recall on all three datasets and compared to the second method we improve on two datasets.
    摘要 为了应对机器人系统在复杂和无束缚的实际场景中识别多种物体的需求,现状的6D物体姿态估计方法依赖于物体特定的训练,因此无法泛化到未见过的物体。最新的novel object pose estimation方法通过使用任务特定的精度调整的 convolutional neural networks (CNNs) 进行深度模板匹配来解决这个问题。这种适应仍然需要昂贵的数据渲染和训练过程。例如,MegaPose 是在包含20,000个不同的物体图像中训练的,以达到这种泛化能力。为了解决这个缺点,我们介绍了 Zero-shot Novel Object 6D Pose Estimation(ZS6D)方法。我们使用预训练的 Vision Transformers (ViT) 提取的视觉描述符来匹配渲染的模板图像和查询图像之间的本地匹配。这些本地匹配使得我们可以 derivation геометрические匹配,并用 RANSAC-based PnP 方法来估计物体的6D姿态。这种方法显示了预训练的 ViT 提取的图像描述符能够达到两种现状的novel object 6D pose estimation方法的显著改进,无需进行任务特定的精度调整。我们在 LMO、YCBV 和 TLESS 上进行了实验,与两种方法进行比较。相比之下,我们在所有三个数据集上的均值回归得分都有所提高,相比第二种方法,我们在两个数据集上有所提高。

Spatially Guiding Unsupervised Semantic Segmentation Through Depth-Informed Feature Distillation and Sampling

  • paper_url: http://arxiv.org/abs/2309.12378
  • repo_url: None
  • paper_authors: Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski
  • for: 降低需要人工标注的劳动成本,通过不监督学习方法进行 semantic segmentation 训练。
  • methods: 利用图像随机样本的特征进行学习,并通过 depth 信息了解场景结构。
  • results: 对多个 benchmark 数据集进行了广泛的实验,并得到了显著的性能改进。
    Abstract Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.
    摘要
  1. 通过将特征图与深度图进行空间相关性学习,以便从图像结构中获取知识。2. 通过使用深度信息来实现更有效的特征选择,通过利用3D抽样技术来选择相关的特征。最后,我们通过广泛的实验证明了我们的技术贡献的效果,并在多个 benchmark 数据集上显示了显著的改善。

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2309.11966
  • repo_url: https://github.com/FlorisE/neural-labeling
  • paper_authors: Floris Erich, Naoya Chiba, Yusuke Yoshiyasu, Noriaki Ando, Ryo Hanai, Yukiyasu Domae
  • for: 该论文旨在提出一种基于神经辐射场(NeRF)的场景标注方法和工具集,用于生成分割图、可用性图、2D bounding box、3D bounding box、6DOF对象位姿、深度图和对象体系。
  • methods: 该方法使用NeRF作为渲染器,通过利用多视点图像输入和三角函数等几何准确信息,进行3D空间工具进行标注,不需要特定的标注工具或扫描仪。
  • results: 在应用于机器人实际问题的情况下,通过添加深度图ground truth,使用30000帧透明物体RGB和噪音深度图捕捉到的碗洗机器人中的玻璃镜扭损捕捉到的30000帧碗洗机器人中的玻璃镜扭损,并训练一个简单的深度神经网络,使用标注的深度图进行监督,可以获得较高的重建性能,比之前使用弱监督方法更高。
    Abstract We present NeuralLabeling, a labeling approach and toolset for annotating a scene using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach.
    摘要 我们介绍NeuralLabeling,一种标注方法和工具集 для使用矩形框或多面体标注场景并生成分割图、可用性图、2D矩形框、3D矩形框、6DOF物体位势、深度图和物体多面体。NeuralLabeling使用神经辐射场(NeRF)作为渲染器,允许使用3D空间工具进行标注,同时利用图像从多个视角捕捉的光学信息,如 occlusion 等。为证明NeuralLabeling在机器人学中的实用性,我们添加了透明物体RGB和噪声深度图的30000帧拍摄到的碗洗器dataset。我们表明,通过对标注深度图进行超级vision的深度神经网络训练,可以获得更高的重建性能,比较于之前应用的弱supervision方法。

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

  • paper_url: http://arxiv.org/abs/2309.11962
  • repo_url: https://github.com/tho-kn/Ego3DPose
  • paper_authors: Taeho Kang, Kyungjin Lee, Jinrui Zhang, Youngki Lee
    methods: Two-path network architecture with binocular heatmaps and perspective-aware representation using trigonometryresults: Outperforms state-of-the-art models by 23.1% in MPJPE reduction in UnrealEgo dataset, with superior performance in challenging occlusion cases and visible joint positions.
    Abstract We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.
    摘要 我们介绍EGO3DPose,一个高精度双目人体3D姿势重建系统。双目人体设置具有实用性和有用性,但它受到低姿势估计精度的影响,主要是因为视野扭曲、严重的自遮挡和 JOINTS 的限制。我们发现,双目人体输入中含有两种重要的3D征象,即σtereo对称和视角,但现有方法仅仅将重点放在2D图像特征上,这会导致对常见运动的偏好和低整体精度。我们观察到,它们不仅在难度遮挡的情况下失败,而且在可视 JOINTS 的估计也失败。为解决这些挑战,我们提出了两个新的方法。首先,我们设计了一个两条路径网络架构,其中一条路径估计每个肢体的姿势独立地使用双目热映图。不需要全身信息提供,这样可以减少对训练全身份布的偏好。其次,我们利用 egocentric 视野中的身体部分,其中展现了强大的视角变化(例如,在相机近距离时,手部将变得非常大)。我们提出了一新的视角感知表示方法,使得网络可以估计肢体的3D方向。最后,我们实现了一个统一的姿势重建网络,融合了这两种技术。我们的全面评估显示,EGO3DPose 比前方 Models 的姿势估计误差(即MPJPE) reduction 为23.1%。我们的质数结果显示我们的方法在各种情况和挑战中具有优越性。

A Study of Forward-Forward Algorithm for Self-Supervised Learning

  • paper_url: http://arxiv.org/abs/2309.11955
  • repo_url: None
  • paper_authors: Jonas Brenig, Radu Timofte
  • for: 本研究是 investigate the performance of forward-forward algorithm vs. backpropagation for self-supervised representation learning, and provide insights into the learned representation spaces.
  • methods: 本研究使用了四个标准数据集(MNIST、F-MNIST、SVHN和CIFAR-10)和三种常用的自助学习表示学习技术(旋转、翻转和碎片)。
  • results: 研究发现,虽然forward-forward算法与backpropagation在(自-)超vised学习中表现相似,但在所有研究 setting中转移性能明显落后。这可能是由多个因素引起的,包括每层有独立损失函数和在forward-forward paradigm中实现supervised training的方式。与backpropagation相比,forward-forward算法更关注边界和抛弃一些不必要的信息,这可能妨碍了表示学习的目标。进一步的调查和研究是必要的,以稳定forward-forward策略在自助学习中,并能够在不同的数据集和配置上进行可靠的应用。
    Abstract Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
    摘要 自顾的表示学习在最近几年内取得了非常出色的进步,其中一些最新的方法可以无需标签学习有用的图像表示。这些方法通常通过反射传播来训练网络。在这项研究中,我们第一次比较了反射传播和反射传播两种训练方法的性能,并对学习的表示空间提供了深入的启示。我们的基准使用了四个标准数据集,即MNIST、F-MNIST、SVHN和CIFAR-10,以及三种常用的自顾表示学习技术,即旋转、翻折和缝隙。我们的主要发现是,虽然反射传播在(自)超vised训练中和反射传播相当,但在所有研究的设置中,转移性能明显落后。这可能是由多种因素引起的,包括每层有自己的损失函数以及在反射传播中实现自顾训练的方式。与反射传播相比,反射传播更关注边界,抛弃一些无关于做出决策的信息,这会妨碍表示学习的目标。进一步的调查和研究是必要的,以稳定反射传播的自顾学习策略,并在不同的数据集和配置下进行研究。

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2309.11933
  • repo_url: None
  • paper_authors: Ping Li, Yu Zhang, Li Yuan, Xianghua Xu
  • for: 本研究旨在提出一个 completel y built upon transformers 的 Referring Video Object Segmentation (RVOS) 框架,以解决跨modal зада件中的 объек对象搜寻问题。
  • methods: 本研究使用 transformers 完全建立了一个 RVOS 框架,并将任务视为一个 mask sequence learning 问题,将所有在视频中的物件视为候选物件。
  • results: 验证研究表明,提案的方法在三个 benchmark 上表现出色,例如在 A2D Sentences 和 J-HMDB Sentences 上的 mAP 分别为 45.1% 和 38.7%,在 Ref-YouTube-VOS 上的 $\mathcal{J&F}$ 分别为 56.6%。相比最佳候选方法,提案方法在前两个 benchmark 上的 P$@$0.5 分别提高了 2.1% 和 3.2%,在 Ref-YouTube-VOS 上的 $\mathcal{J}$ 分别提高了 2.9%。
    Abstract Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed \textit{Fully Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $\mathcal{J\&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $\mathcal{J}$ on the latter one.
    摘要 referring video object segmentation (RVOS)需要将视频中的对象分割成自然语言查询中引用的对象。现有方法主要基于复杂的管道来解决这种跨模态任务,而不直接模型对象水平的空间上下文,这将对于定位引用对象具有重要作用。因此,我们提出了一个 completelystructured upon transformers的框架,称为完全转换器装置架构(FTEA),它将RVOS任务视为一个mask sequence学习问题,并将所有视频中的对象视为候选对象。给定一个视频clip和文本查询,则可以通过encoder提取视觉和文本特征,并将它们在semantic similarity上对应。为了捕捉对象水平的空间上下文,我们开发了堆叠transformer,它可以在不同的对象水平上彩色化每个候选对象的视觉特征,并将其解码成直接对应的二进制掩码序列。最后,模型将找到与文本查询最佳匹配的mask sequence。此外,为了让模型生成更加准确的掩码,我们对模型征加多样性损失,以捕捉更多的对象特征。实验表明,我们的方法在三个标准测试集上表现出色,例如,FETA在A2D Sentences(3782个视频)和J-HMDB Sentences(928个视频)上达到了45.1%和38.7%的mAP,并在Ref-YouTube-VOS(3975个视频和7451个对象)上达到了56.6%的$\mathcal{J\&F}$。特别是,与最佳候选方法相比,FETA在前两个测试集上具有2.1%和3.2%的P$@$0.5提升,而在后一个测试集上具有2.9%的提升。

Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning

  • paper_url: http://arxiv.org/abs/2309.11930
  • repo_url: None
  • paper_authors: Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang
  • for: 这篇论文的目的是解决开放世界半监督学习中的新类发现问题,即使用无标签数据来增强模型对已知类的性能。
  • methods: 这篇论文提出了两种方法来解决这个问题:1)使用适应margin损失,根据估算的类分布来强制将seen类和novel类的学习速度融合,以同步学习速度。2)使用假标签对比归一化,将可能属于同一个类的样本集中,以提高新类发现。
  • results: 对多个数据集进行了广泛的评估,发现现有模型仍然困难地学习新类,而我们的方法却能够平衡seen和novel类,在ImageNet数据集上取得了3%的平均准确率提升,至于先前的状态艺术。此外,我们发现在默认的先前文献中进行自我超参数 fine-tuning 可以显著提高性能。
    Abstract In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data while maintaining performance on seen categories from labeled data. The central challenge is the substantial learning gap between seen and novel categories, as the model learns the former faster due to accurate supervisory information. To address this, we introduce 1) an adaptive margin loss based on estimated class distribution, which encourages a large negative margin for samples in seen classes, to synchronize learning paces, and 2) pseudo-label contrastive clustering, which pulls together samples which are likely from the same class in the output space, to enhance novel class discovery. Our extensive evaluations on multiple datasets demonstrate that existing models still hinder novel class learning, whereas our approach strikingly balances both seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset compared to the prior state-of-the-art. Additionally, we find that fine-tuning the self-supervised pre-trained backbone significantly boosts performance over the default in prior literature. After our paper is accepted, we will release the code.
    摘要 在开放世界半supervised学习中,一个机器学习模型被要求发现未经标注的类,并保持已经标注的类的性能。中心挑战是seen和novel类之间的学习差距,因为模型通过准确的监督信息更快地学习seen类。为此,我们提出了两点解决方案:1)适应margin损失基于估计类分布,以便同步学习速度,和2) Pseudo-label对比聚合,以便增强novel类发现。我们在多个数据集进行了广泛的评估,发现现有模型仍然受限于novel类学习,而我们的方法能够平衡seen和novel类,在ImageNet数据集上实现了3%的平均准确率提升,相比之前的状态的艺术。此外,我们发现在先前的文献中 defaults 的自然语言预训练模型进行了显著的性能提升。接下来,我们将接受论文后,将代码发布。

Video Scene Location Recognition with Neural Networks

  • paper_url: http://arxiv.org/abs/2309.11928
  • repo_url: None
  • paper_authors: Lukáš Korel, Petr Pulc, Jiří Tumpach, Martin Holeňa
  • for: 这个论文探讨了基于视频序列的场景识别问题,使用人工神经网络来实现场景识别。
  • methods: 该方法选择每个场景中的一组帧,使用预训练的单图预处理卷积网络进行转换,并使用后续层的神经网络进行场景位置的分类。
  • results: 研究人员在使用不同层的神经网络进行组合,发现只有一些方法适用于这种任务。
    Abstract This paper provides an insight into the possibility of scene recognition from a video sequence with a small set of repeated shooting locations (such as in television series) using artificial neural networks. The basic idea of the presented approach is to select a set of frames from each scene, transform them by a pre-trained singleimage pre-processing convolutional network, and classify the scene location with subsequent layers of the neural network. The considered networks have been tested and compared on a dataset obtained from The Big Bang Theory television series. We have investigated different neural network layers to combine individual frames, particularly AveragePooling, MaxPooling, Product, Flatten, LSTM, and Bidirectional LSTM layers. We have observed that only some of the approaches are suitable for the task at hand.
    摘要 Translation notes:* "scene recognition" is translated as "场景识别" (chǎngjìng zhībèi)* "video sequence" is translated as "视频序列" (zhìpín xùxiàn)* "artificial neural networks" is translated as "人工神经网络" (réngōng shénxiào wǎngluō)* "pre-trained" is translated as "预训练" (yùxùnliào)* "single-image pre-processing" is translated as "单图预处理" (dāngràng yùxùnliào)* "classify" is translated as "分类" (fēngróng)* "scene location" is translated as "场景位置" (chǎngjìng weíqióng)* "neural network layers" is translated as "神经网络层" (shénxiào wǎngluō jié)* "AveragePooling" is translated as "平均池化" (píngyuan chíhuà)* "MaxPooling" is translated as "最大池化" (máxī chíhuà)* "Product" is translated as "乘法" (shūfǎ)* "Flatten" is translated as "平铺" (píngshì)* "LSTM" is translated as "长期记忆神经网络" (chángjì shēngyì shénxiào wǎngluō)* "Bidirectional LSTM" is translated as "双向长期记忆神经网络" (shuāngxiàng chángjì shēngyì shénxiào wǎngluō)

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

  • paper_url: http://arxiv.org/abs/2309.11923
  • repo_url: None
  • paper_authors: Xiaozhou You, Jian Zhang
  • for: 文章目的是提出一种基于文本的图像生成和修改方法,无需对抗训练。
  • methods: 方法利用 StyleGAN 的强大生成能力和 CLIP 的文本图像表示能力,通过特制的映射网络实现图像生成和修改。
  • results: 在 Multi-modal CelebA-HQ 数据集上进行了广泛的实验,表明我们的提出方法在图像生成和修改任务上具有优于现有方法的性能。
    Abstract Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
    摘要 文本导向图像生成和修改旨在生成基于给定文本的所需图像,而文本导向图像修改则是基于指定文本进行Semantic的修改。为这两个相似任务,关键点是保持图像准确性和Semantic一致。许多前一代方法需要复杂的多阶段生成和对抗训练,而困难提供一个简单的框架 для这两个任务。在这项工作中,我们提出了TextCLIP,一个不需要对抗训练的简单框架,可以同时进行文本导向图像生成和修改。提案的方法接受图像或随机噪声作为输入,根据特定的文本来生成高分辨率图像(最大支持1024x1024)。广泛的实验表明,我们的提案方法在Multi-modal CelebA-HQ dataset上比前一代方法更高效,同时在文本导向生成和修改任务上都有优异表现。

Spatial-Temporal Transformer based Video Compression Framework

  • paper_url: http://arxiv.org/abs/2309.11913
  • repo_url: None
  • paper_authors: Yanbo Gao, Wenjia Huang, Shuai Li, Hui Yuan, Mao Ye, Siwei Ma
  • for: 提高learned video compression(LVC)的效率和稳定性。
  • methods: 基于卷积神经网络(NN)的拟合推理,包括弹性推理(Uformer)、多个参考帧(MGP)和空间特征分布预测(SFD-T)等模块。
  • results: 与VTM比较,实现13.5%的BD率降低。
    Abstract Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.
    摘要 Traditional video coding 的发展已经做出了很大的进步,但是这些方法通常难以稳定地生成从输入色彩特征中的动态信息,即几何特征。此外,模块之间的独立性使得减少空间时间重复的效率受到限制。为解决这些问题,在这篇论文中,我们提出了一种新的空间时间变换基于视频压缩(STT-VC)框架。它包括一个宽度自适应变换(RDT),基于uformer的偏移估计来实现动态信息估计和补做,一个多级别预测(MGP)模块,基于多个参考帧来进行预测精度的提升,以及一个空间特征分布先前基于变换(SFD-T)来高效地压缩剩余信息。具体来说,RDT是通过系统地研究 similarity 基于几何动态特征提取和自我注意来稳定地估计动态信息的。MGP是通过有效地探索粗级预测特征,使用编码动态信息来融合多个参考帧信息。SFD-T是通过同时探索剩余信息中的空间特征分布来进一步减少空间时间重复。实验结果表明,我们的方法可以在 VTM 上实现13.5%的BD-Rate减少。

Heart Rate Detection Using an Event Camera

  • paper_url: http://arxiv.org/abs/2309.11891
  • repo_url: None
  • paper_authors: Aniket Jagtap, RamaKrishna Venkatesh Saripalli, Joe Lemley, Waseem Shariff, Alan F. Smeaton
  • for: 用于非侵入式心率监测
  • methods: 使用事件摄像机技术进行血液征迹捕捉
  • results: 成功实现了非接触式心率测量,并且可以减轻误差和难以控制的人体自然抖动等问题
    Abstract Event cameras, also known as neuromorphic cameras, are an emerging technology that offer advantages over traditional shutter and frame-based cameras, including high temporal resolution, low power consumption, and selective data acquisition. In this study, we propose to harnesses the capabilities of event-based cameras to capture subtle changes in the surface of the skin caused by the pulsatile flow of blood in the wrist region. We investigate whether an event camera could be used for continuous noninvasive monitoring of heart rate (HR). Event camera video data from 25 participants, comprising varying age groups and skin colours, was collected and analysed. Ground-truth HR measurements obtained using conventional methods were used to evaluate of the accuracy of automatic detection of HR from event camera data. Our experimental results and comparison to the performance of other non-contact HR measurement methods demonstrate the feasibility of using event cameras for pulse detection. We also acknowledge the challenges and limitations of our method, such as light-induced flickering and the sub-conscious but naturally-occurring tremors of an individual during data capture.
    摘要 事件摄像机也称为neuromorphic摄像机,是一种出现在技术领域的新兴技术,它们比传统的闭锁和帧采集机器更具有优点,包括高时间分辨率、低功耗和选择性数据采集。在这项研究中,我们利用事件驱动的摄像机来捕捉血液循环在手部区域中的微妙变化。我们研究了使用事件摄像机进行无侵入性的血液总流率(HR)连续监测。我们收集了25名参与者的事件摄像机视频数据,这些参与者来自不同的年龄组和皮肤颜色。我们使用传统方法获取的真实HR测量来评估自动从事件摄像机数据中检测HR的准确性。我们的实验结果和与其他非接触HR测量方法的比较表明了使用事件摄像机进行脉律检测的可能性。我们也认可了我们的方法的挑战和限制,例如光线辐射引起的闪光和个体在数据采集过程中的自然发生的微小振荡。

On-the-Fly SfM: What you capture is What you get

  • paper_url: http://arxiv.org/abs/2309.11883
  • repo_url: None
  • paper_authors: Zongqian Zhan, Rui Xia, Yifei Yu, Yibo Xu, Xin Wang
  • for: 实时Structure from motion(SfM),可以在摄像头捕捉图像的同时进行注册和三角坐标估算。
  • methods: 我们的方法包括使用自动学习的词汇树进行快速图像检索、使用最小二乘(LSM)匹配机制提高图像对齐性、以及使用层次权重本地杆对调 optimization。
  • results: 我们的实验结果表明,在线SfM可以在实时捕捉图像时具有robustness和稳定性,并且可以实现图像注册和三角坐标估算。
    Abstract Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.
    摘要 Over the past few decades, significant progress has been made in Structure from Motion (SfM). However, most of these methods work in an offline manner, where images are captured and then fed into a SfM pipeline to obtain poses and sparse point clouds. In this work, we propose an on-the-fly SfM approach that runs SfM online while capturing images. Specifically, our method first employs an unsupervised vocabulary tree trained using learning-based global features for fast image retrieval of newly captured images. Then, we present a robust feature matching mechanism with least squares (LSM) to improve image registration performance. Finally, we use an efficient hierarchical weighted local bundle adjustment (BA) to optimize the images. Experimental results show that on-the-fly SfM can robustly register images captured online.Here's the word-for-word translation of the text into Simplified Chinese:过去几十年,结构从运动(SfM)领域得到了充足的成果。然而,大多数方法都是在离线模式下工作,即首先捕捉图像,然后将其传输到SfM管道中进行获取pose和稀疏点云。在这种情况下,我们提出了在线SfM方法:在捕捉图像时,新 captured On-the-Fly图像会在线被估算pose和点云,即你捕捉的就是你得到的。specifically,我们的方法首先采用了一个不supervised的词汇树,通过学习基于全局特征的学习来快速检索新飞入的图像。然后,我们提出了一种基于小正方形(LSM)的强健特征匹配机制,以提高图像匹配性能。最后,我们通过研究新飞入图像的相邻图像的影响,使用高效的层次权重本地加载平衡(BA)来优化图像。广泛的实验结果表明,在线SfM可以robustly register captured图像。

Using Saliency and Cropping to Improve Video Memorability

  • paper_url: http://arxiv.org/abs/2309.11881
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Vaibhav Mudgal, Qingyang Wang, Lorin Sweeney, Alan F. Smeaton
  • for: 提高视频记忆性,以便提高视频的分享、播放和讨论可能性。
  • methods: 通过基于图像引人注意力的选择性剪辑来提高视频记忆性。实验包括基本固定剪辑和动态剪辑,其中剪辑大小和位置随视频播放和引人注意力跟踪变化。
  • results: Results indicate that especially for videos of low initial memorability, the memorability score can be improved.
    Abstract Video memorability is a measure of how likely a particular video is to be remembered by a viewer when that viewer has no emotional connection with the video content. It is an important characteristic as videos that are more memorable are more likely to be shared, viewed, and discussed. This paper presents results of a series of experiments where we improved the memorability of a video by selectively cropping frames based on image saliency. We present results of a basic fixed cropping as well as the results from dynamic cropping where both the size of the crop and the position of the crop within the frame, move as the video is played and saliency is tracked. Our results indicate that especially for videos of low initial memorability, the memorability score can be improved.
    摘要 视频记忆度是观看者视频内容无情感连接时视频的记忆程度。这是一项重要的特性,因为更有记忆力的视频更有可能被分享、播放和讨论。这篇论文介绍了一系列实验,我们通过选择性剪辑帧来提高视频的记忆力。我们发现,特别是初始记忆力较低的视频,通过动态剪辑(即剪辑大小和位置随视频播放和注意力追踪而变化)可以提高记忆力。

TCOVIS: Temporally Consistent Online Video Instance Segmentation

  • paper_url: http://arxiv.org/abs/2309.11857
  • repo_url: https://github.com/jun-long-li/tcovis
  • paper_authors: Junlong Li, Bingyao Yu, Yongming Rao, Jie Zhou, Jiwen Lu
  • for: 本文提出了一种新的在线视频实例分割方法(TCOVIS),用于解决视频实例分割 task 中的时间一致性问题。
  • methods: TCOVIS 方法包括全局实例匹配策略和空间时间增强模块,这两个部分都可以提高视频中的时间一致性。
  • results: 在四个广泛采用的视频实例分割benchmark上(YouTube-VIS 2019/2021/2022 和 OVIS),TCOVIS 方法达到了所有benchmark上的最佳性能,不需要额外的技术。例如,在 YouTube-VIS 2021 上,TCOVIS 方法使用 ResNet-50 和 Swin-L 的背部板,分别获得了 49.5 AP 和 61.3 AP。
    Abstract In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at https://github.com/jun-long-li/TCOVIS.
    摘要 近年来,视频实例分割(VIS)领域内,有很多离线和在线方法实现了状态数据最佳性。然而,离线方法在实时场景下不够实用,而在线方法尚未保证时间一致性。在这篇论文中,我们提出了一种新的在线视频实例分割方法,即TCOVIS,该方法完全利用视频帧序中的时间信息。TCOVIS的核心包括全局实例分配策略和空间时间增强模块,这两者共同提高了特征序列中的时间一致性。具体来说,我们在整个视频帧序中进行全局最佳匹配,并将模型监督global最佳目标。此外,我们还捕捉了空间特征,将其与语义特征在帧之间归一化,实现了空间时间增强。我们在四个广泛采用的 VIS 标准测试集上进行评估,分别是 YouTube-VIS 2019/2021/2022 和 OVIS,并在所有标准测试集上取得了状态数据最佳性。例如,在 YouTube-VIS 2021 上,TCOVIS 取得了 49.5 AP 和 61.3 AP,使用 ResNet-50 和 Swin-L 框架。代码可以在 上获取。

DEYOv3: DETR with YOLO for Real-time Object Detection

  • paper_url: http://arxiv.org/abs/2309.11851
  • repo_url: None
  • paper_authors: Haodong Ouyang
  • for: 提出了一种新的训练方法,以提高实时物体检测器的性能和投入成本。
  • methods: 使用步骤训练方法,首先使用预训练的 YOLO 检测器来初始化结束到端检测器,然后在第二个阶段将Encoder和背景匹配到 DETR 类型模型中,但只需要重新训练检测器。
  • results: 提出了一种brand-new的实时物体检测模型called DEYOv3,可以在 COCO validate2017 上 достичь 41.1% 的分数和 T4 GPU 上达到 270 FPS,同时 DEYOv3-L 可以在 COCO validate2017 上达到 51.3% AP 和 102 FPS。此外,DEYOv3 不需要额外的训练数据,可以在 N、S 和 M 级模型上 Completed 在 COCO dataset 上训练,只需要一个 24GB RTX3090 GPU。
    Abstract Recently, end-to-end object detectors have gained significant attention from the research community due to their outstanding performance. However, DETR typically relies on supervised pretraining of the backbone on ImageNet, which limits the practical application of DETR and the design of the backbone, affecting the model's potential generalization ability. In this paper, we propose a new training method called step-by-step training. Specifically, in the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch. Due to this training method, the object detector does not need the additional dataset (ImageNet) to train the backbone, which makes the design of the backbone more flexible and dramatically reduces the training cost of the detector, which is helpful for the practical application of the object detector. At the same time, compared with the DETR-like model, the step-by-step training method can achieve higher accuracy than the traditional training method of the DETR-like model. With the aid of this novel training method, we propose a brand-new end-to-end real-time object detection model called DEYOv3. DEYOv3-N achieves 41.1% on COCO val2017 and 270 FPS on T4 GPU, while DEYOv3-L achieves 51.3% AP and 102 FPS. Without the use of additional training data, DEYOv3 surpasses all existing real-time object detectors in terms of both speed and accuracy. It is worth noting that for models of N, S, and M scales, the training on the COCO dataset can be completed using a single 24GB RTX3090 GPU. Code will be released at https://github.com/ouyanghaodong/DEYOv3.
    摘要 最近,端到端对象检测器在研究 сообществе中获得了重要的注意力,因为它们的表现非常出色。然而,DETR通常需要supervised预训练的后IONet,这限制了DETR的实际应用和后IONet的设计,从而影响了模型的总体化能力。在这篇论文中,我们提出了一种新的训练方法called step-by-step training。具体来说,在第一个阶段,使用pre-trained YOLO检测器进行一对多的初始化,然后在第二个阶段,后IONet和编码器与DETR-like模型相同,但是检测器需要从零开始训练。由于这种训练方法,对象检测器不需要额外的数据集(ImageNet)来训练后IONet,这使得后IONet的设计更加灵活,减少了检测器的训练成本,有助于实际应用。同时,相比DETR-like模型,step-by-step training方法可以在同样的精度下提高对象检测器的速度。通过这种新的训练方法,我们提出了一种全新的端到端实时对象检测模型called DEYOv3。DEYOv3-N在COCO val2017上得到了41.1%的分数和270 FPS的速度,而DEYOv3-L在COCO val2017上得到了51.3%的AP和102 FPS。不需要额外的训练数据,DEYOv3超过了所有现有的实时对象检测器,在速度和精度两个方面。值得注意的是,对于N、S、M缩放的模型,在COCO数据集上进行训练可以使用单个24GB RTX3090 GPU。代码将在https://github.com/ouyanghaodong/DEYOv3上发布。

MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion

  • paper_url: http://arxiv.org/abs/2309.11847
  • repo_url: https://github.com/hedlen/meflut
  • paper_authors: Ting Jiang, Chuan Wang, Xinpeng Li, Ru Li, Haoqiang Fan, Shuaicheng Liu
  • for: 高品质多曝光图像融合 (MEF)
  • methods: 提出了一种新的方法,通过编码折衔表 (LUT) 来实现高效和高质量的多曝光图像融合,并通过注意力机制在不同维度进行调整,以提高融合质量。
  • results: 对比州时的最佳方法 (SOTA),新方法在两个 dataset 上表现出较高的质量和效率,并且运行速度快(less than 4ms)。此外,该方法已经被广泛应用在 Android 手机上,并在多个国际品牌中推广。
    Abstract In this paper, we introduce a new approach for high-quality multi-exposure image fusion (MEF). We show that the fusion weights of an exposure can be encoded into a 1D lookup table (LUT), which takes pixel intensity value as input and produces fusion weight as output. We learn one 1D LUT for each exposure, then all the pixels from different exposures can query 1D LUT of that exposure independently for high-quality and efficient fusion. Specifically, to learn these 1D LUTs, we involve attention mechanism in various dimensions including frame, channel and spatial ones into the MEF task so as to bring us significant quality improvement over the state-of-the-art (SOTA). In addition, we collect a new MEF dataset consisting of 960 samples, 155 of which are manually tuned by professionals as ground-truth for evaluation. Our network is trained by this dataset in an unsupervised manner. Extensive experiments are conducted to demonstrate the effectiveness of all the newly proposed components, and results show that our approach outperforms the SOTA in our and another representative dataset SICE, both qualitatively and quantitatively. Moreover, our 1D LUT approach takes less than 4ms to run a 4K image on a PC GPU. Given its high quality, efficiency and robustness, our method has been shipped into millions of Android mobiles across multiple brands world-wide. Code is available at: https://github.com/Hedlen/MEFLUT.
    摘要 在这篇论文中,我们介绍了一种新的高质量多曝光图像融合(MEF)方法。我们显示了一个曝光的融合权重可以被编码成1D查找表(LUT),该表接受像素强度值作为输入,并生成融合权重作为输出。我们学习了每个曝光的1D LUT,然后所有的像素从不同的曝光照片都可以独立地查询该曝光的1D LUT,以实现高质量和高效的融合。具体来说,为了学习这些1D LUT,我们在MEF任务中涉及了注意力机制在不同的维度,包括帧、通道和空间维度,以此实现显著的质量改进。此外,我们收集了一个新的MEF数据集,包含960个样本,其中155个是由专业人员手动调整为标准参考。我们的网络在这个数据集上进行了无监督的训练。我们进行了广泛的实验,以证明所有我们提出的新组件的效果,结果显示我们的方法在我们的数据集和另一个代表性数据集SICE中,都有较高的质量和效率。此外,我们的1D LUT方法在4K图像上只需要0.4毫秒钟,在PC GPU上运行。由于其高质量、效率和稳定性,我们的方法已经被安装在全球多个Android手机品牌上。代码可以在https://github.com/Hedlen/MEFLUT中找到。

MoPA: Multi-Modal Prior Aided Domain Adaptation for 3D Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.11839
  • repo_url: None
  • paper_authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Shenghai Yuan, Lihua Xie
  • for: 这个研究旨在提高3D semantic segmentation中的罕见物类分类性能,并且不需要耗费价值的点据标注。
  • methods: 本研究使用Multi-modal Prior Aided(MoPA)领域对应,提出Valid Ground-based Insertion(VGI)和SAM consistency loss等方法来缓解自我训练中的分布不均势问题,并且将多modal特征知识共享到各自的领域中。
  • results: 实验结果显示,本研究在MM-UDAbenchmark上的表现凌驾了现有的方法,并且在罕见物类分类上具有更高的准确性。
    Abstract Multi-modal unsupervised domain adaptation (MM-UDA) for 3D semantic segmentation is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations. While previous MM-UDA methods can achieve overall improvement, they suffer from significant class-imbalanced performance, restricting their adoption in real applications. This imbalanced performance is mainly caused by: 1) self-training with imbalanced data and 2) the lack of pixel-wise 2D supervision signals. In this work, we propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects. Specifically, we develop Valid Ground-based Insertion (VGI) to rectify the imbalance supervision signals by inserting prior rare objects collected from the wild while avoiding introducing artificial artifacts that lead to trivial solutions. Meanwhile, our SAM consistency loss leverages the 2D prior semantic masks from SAM as pixel-wise supervision signals to encourage consistent predictions for each object in the semantic mask. The knowledge learned from modal-specific prior is then shared across modalities to achieve better rare object segmentation. Extensive experiments show that our method achieves state-of-the-art performance on the challenging MM-UDA benchmark. Code will be available at https://github.com/AronCao49/MoPA.
    摘要 多模态无监督领域适应(MM-UDA)为3D语义分割提供了实用的解决方案,以嵌入自主系统中的语义理解无需昂贵的点级标注。先前的MM-UDA方法可以实现总体改进,但它们受到类别不均衡性的影响,导致其在实际应用中的采用有限。这种不均衡性主要来自于:1)自我训练偏斜数据和2)缺失像素级2D超参信号。在这种工作中,我们提出了多模态依据帮助(MoPA)领域适应,以改善罕见对象的性能。特别是,我们开发了有效的地面基础插入(VGI),以修正不均衡的指导信号,并避免引入人工 artifacts,以避免导致轻微解决方案。此外,我们的SAM一致性损失利用了2D先前语义masks从SAM中的像素级超参信号,以强制每个对象在semantic mask中的一致预测。知识从多模态依据中学习的知识然后被共享到多个模式,以实现更好的罕见对象分割。广泛的实验表明,我们的方法在复杂的MM-UDAbenchmark上实现了状态的最佳性能。代码将在https://github.com/AronCao49/MoPA上公开。

Automatic Endoscopic Ultrasound Station Recognition with Limited Data

  • paper_url: http://arxiv.org/abs/2309.11820
  • repo_url: https://github.com/amrita-medical-ai/eusml-labeller
  • paper_authors: Abhijit Ramesh, Anantha Nandanan, Nikhil Boggavarapu, Priya Nair MD, Gilad Gressel
  • For: 这个研究旨在帮助医生更有效地诊断胰脏癌,使用人工智能技术来帮助医生更快速地识别胰脏癌的“EUS站”(胰脏ultrasound的不同位置)。* Methods: 这个研究使用了深度学习技术,开发了一个可以在EUS процеду中实时识别胰脏癌的computer-assisted diagnostic(CAD)工具。这个工具可以帮助医生更快速地识别胰脏癌,并且提供可读的和解释的视觉化技术。* Results: 研究发现,只需使用43次程序, без任何参数调整,可以取得90%的平衡精度,与现有的州前测试相当。此外,这个工具还可以提供可读的和解释的视觉化技术,帮助医生更好地理解胰脏癌的特征。
    Abstract Pancreatic cancer is a lethal form of cancer that significantly contributes to cancer-related deaths worldwide. Early detection is essential to improve patient prognosis and survival rates. Despite advances in medical imaging techniques, pancreatic cancer remains a challenging disease to detect. Endoscopic ultrasound (EUS) is the most effective diagnostic tool for detecting pancreatic cancer. However, it requires expert interpretation of complex ultrasound images to complete a reliable patient scan. To obtain complete imaging of the pancreas, practitioners must learn to guide the endoscope into multiple "EUS stations" (anatomical locations), which provide different views of the pancreas. This is a difficult skill to learn, involving over 225 proctored procedures with the support of an experienced doctor. We build an AI-assisted tool that utilizes deep learning techniques to identify these stations of the stomach in real time during EUS procedures. This computer-assisted diagnostic (CAD) will help train doctors more efficiently. Historically, the challenge faced in developing such a tool has been the amount of retrospective labeling required by trained clinicians. To solve this, we developed an open-source user-friendly labeling web app that streamlines the process of annotating stations during the EUS procedure with minimal effort from the clinicians. Our research shows that employing only 43 procedures with no hyperparameter fine-tuning obtained a balanced accuracy of 90%, comparable to the current state of the art. In addition, we employ Grad-CAM, a visualization technology that provides clinicians with interpretable and explainable visualizations.
    摘要 肝胆癌是一种致命的癌症,对全球癌症相关死亡率做出了重要贡献。早期癌症检测是关键,可以提高病人预后和存活率。 despite advances in medical imaging techniques, pancreatic cancer remains a challenging disease to detect. Endoscopic ultrasound (EUS) is the most effective diagnostic tool for detecting pancreatic cancer, but it requires expert interpretation of complex ultrasound images to complete a reliable patient scan. To obtain complete imaging of the pancreas, practitioners must learn to guide the endoscope into multiple "EUS stations" (anatomical locations), which provide different views of the pancreas. This is a difficult skill to learn, involving over 225 proctored procedures with the support of an experienced doctor. We build an AI-assisted tool that utilizes deep learning techniques to identify these stations of the stomach in real time during EUS procedures. This computer-assisted diagnostic (CAD) will help train doctors more efficiently. Historically, the challenge faced in developing such a tool has been the amount of retrospective labeling required by trained clinicians. To solve this, we developed an open-source user-friendly labeling web app that streamlines the process of annotating stations during the EUS procedure with minimal effort from the clinicians. Our research shows that employing only 43 procedures with no hyperparameter fine-tuning obtained a balanced accuracy of 90%, comparable to the current state of the art. In addition, we employ Grad-CAM, a visualization technology that provides clinicians with interpretable and explainable visualizations.

FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2309.11804
  • repo_url: https://github.com/xaviergrool/fgfusion
  • paper_authors: Zixuan Yin, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen
  • for: 这个研究旨在提高自动驾驶中的3D检测精度,使用照相机和激光测距仪作为重要的感知器。
  • methods: 本研究提出了细部激光-照相机融合(FGFusion)方法,具有多个描述度的特征,并将其组合在一个细部的方式下。首先,设计了双轮幕架构造,以提取高层次semantic和低层次细部特征。其次,引入了帮助点云特征更好地学习细部空间信息的帮助网络。最后,提出了多个描述度融合(MSF),以融合最后N个图像和点云特征对应的特征对。
  • results: 实验结果显示,FGFusion方法在KITTI和Waymo两个流行的自动驾驶测试 benchmark上具有高效性。
    Abstract Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While most prevalent methods progressively downscale the 3D point clouds and camera images and then fuse the high-level features, the downscaled features inevitably lose low-level detailed information. In this paper, we propose Fine-Grained Lidar-Camera Fusion (FGFusion) that make full use of multi-scale features of image and point cloud and fuse them in a fine-grained way. First, we design a dual pathway hierarchy structure to extract both high-level semantic and low-level detailed features of the image. Second, an auxiliary network is introduced to guide point cloud features to better learn the fine-grained spatial information. Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of image and point cloud. Extensive experiments on two popular autonomous driving benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.
    摘要 (本文提出了一种新的方法,即细腻激光镜头混合(FGFusion),以便更好地利用摄像头和激光镜头中的多尺度特征。我们首先设计了一个双路层次结构,以提取摄像头中的高层次semantic特征和低层次细节特征。其次,我们引入了一个auxiliary网络,以帮助激光镜头特征更好地学习细腻空间信息。最后,我们提出了多尺度混合(MSF),以混合最后N个特征图。我们在两个流行的自动驾驶 benchmark上进行了广泛的实验,并证明了我们的方法的有效性。)Here's the breakdown of the translation:* 摄像头 (camera) becomes 摄像头 (cameras) in Simplified Chinese.* 激光镜头 (lidar) becomes 激光镜头 (lidars) in Simplified Chinese.* 多尺度特征 (multi-scale features) becomes 多尺度特征 (multiscale features) in Simplified Chinese.* 细腻激光镜头混合 (FGFusion) becomes 细腻激光镜头混合 (FGFusion) in Simplified Chinese.* 高层次semantic特征 (high-level semantic features) becomes 高层次semantic特征 (high-level semantic features) in Simplified Chinese.* 低层次细节特征 (low-level detailed features) becomes 低层次细节特征 (low-level detailed features) in Simplified Chinese.* auxiliary网络 (auxiliary network) becomes auxiliary网络 (auxiliary network) in Simplified Chinese.* 多尺度混合 (MSF) becomes 多尺度混合 (MSF) in Simplified Chinese.

A Real-Time Multi-Task Learning System for Joint Detection of Face, Facial Landmark and Head Pose

  • paper_url: http://arxiv.org/abs/2309.11773
  • repo_url: None
  • paper_authors: Qingtian Wu, Liming Zhang
  • for: 本研究旨在提出一种实时多任务检测系统,能同时检测面部、面部特征点和头部姿态。
  • methods: 该系统基于广泛采用的YOLOv8检测框架,并在原始对象检测头上添加了多个特征点准备 regression 头,以高效地定位面部特征点。此外,我们在原始 YOLOv8 框架中进行了优化和改进。
  • results: 我们在 300W-LP 和 AFLW2000-3D 数据集上进行了广泛的实验, validate 了我们提出的模型在大角度面部姿态下的能力和实时性。结果表明,我们的模型可以有效地解决大角度面部姿态的挑战,并在这些相互连接的任务中具有实时性。
    Abstract Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
    摘要 extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). these tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. this paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. the primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. this system builds upon the widely adopted YOLOv8 detection framework. it extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. to validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300w-lp and aflw2000-3d datasets. the results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.

Fast Satellite Tensorial Radiance Field for Multi-date Satellite Imagery of Large Size

  • paper_url: http://arxiv.org/abs/2309.11767
  • repo_url: None
  • paper_authors: Tongtong Zhang, Yuanxiang Li
  • for: 这篇论文的目的是对于卫星图像进行重建和新视角synthesis,并且解决了现有NeRF模型的速度问题、必要的太阳信息输入和实现大型卫星图像的局限性。
  • methods: 这篇论文使用了多对多核网络(Multi-scale Tensor Decomposition, MTD)来模型彩色、体积密度和辅助变数,并且将问题视为一个净化任务,以缓解多日期影像之间的不一致。
  • results: 这篇论文的结果显示,SatensoRF比过去的Sat-NeRF系列具有更好的新视角synthesis表现,并且需要训练 fewer parameters,实现了更快的训练和测试速度,以及降低了计算 overhead。
    Abstract Existing NeRF models for satellite images suffer from slow speeds, mandatory solar information as input, and limitations in handling large satellite images. In response, we present SatensoRF, which significantly accelerates the entire process while employing fewer parameters for satellite imagery of large size. Besides, we observed that the prevalent assumption of Lambertian surfaces in neural radiance fields falls short for vegetative and aquatic elements. In contrast to the traditional hierarchical MLP-based scene representation, we have chosen a multiscale tensor decomposition approach for color, volume density, and auxiliary variables to model the lightfield with specular color. Additionally, to rectify inconsistencies in multi-date imagery, we incorporate total variation loss to restore the density tensor field and treat the problem as a denosing task.To validate our approach, we conducted assessments of SatensoRF using subsets from the spacenet multi-view dataset, which includes both multi-date and single-date multi-view RGB images. Our results clearly demonstrate that SatensoRF surpasses the state-of-the-art Sat-NeRF series in terms of novel view synthesis performance. Significantly, SatensoRF requires fewer parameters for training, resulting in faster training and inference speeds and reduced computational demands.
    摘要 现有的卫星图像NeRF模型受到慢速、必须输入太阳信息以及处理大容量卫星图像的限制。为此,我们提出了SatensoRF,它可以快速加速整个过程,并使用 fewer parameters 来处理大容量卫星图像。此外,我们发现了传统的 Lambertian 表面假设在神经辐射场中失去效果,特别是 для植物和水生元素。与传统的层次 MLB Scene 表示方法不同,我们选择了多尺度矩阵分解方法来odel 颜色、体积密度和辅助变量的辐射场,并将问题视为一个减除任务。为验证我们的方法,我们对Spacenet多视图数据集中的子集进行了评估,该数据集包括多日期和单日期多视图RGB图像。我们的结果显示,SatensoRF 超过了状态的艺术 Sat-NeRF 系列在新视图合成性能方面。此外,SatensoRF 具有更快的训练和推理速度,以及减少的计算需求。

Dictionary Attack on IMU-based Gait Authentication

  • paper_url: http://arxiv.org/abs/2309.11766
  • repo_url: https://github.com/rajeshjnu2006/dictionaryattackonimugait
  • paper_authors: Rajesh Kumar, Can Isik, Chilukuri K. Mohan
  • For: The paper aims to investigate the vulnerability of gait pattern-based authentication systems using inertial measurement units (IMUs) built into smartphones, and to develop a dictionary attack on these systems.* Methods: The paper uses a dataset of 178 unique IMUGait patterns collected from nine physically and demographically diverse individuals, and tests the attack idea on various user authentication models.* Results: The paper finds that it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target’s IMUGait pattern, and that the error rates of the authentication systems before and after the attack challenge the belief that these systems are the most difficult to spoof.Here are the three points in Simplified Chinese text:* For: 这个论文目的是研究基于智能手机内置的倾斜测量单元(IMU)记录的步幅模式认证系统的攻击性,并开发一种词汇攻击模型。* Methods: 论文使用了9名物理和人口学多样化的个体,在不同的四个可控和可适应步factor(速度、步长、步宽、股提升)下,记录了178个独特的IMUGait模式。这些模式被用来攻击多种用户认证模型。* Results: 论文发现可以建立一个IMUGait模式词汇,并使用它来发动攻击或找到一个可以活动地复制目标IMUGait模式的imitator。此外,论文还发现在攻击前和攻击后的错误率下降,这会让人们对认证系统的安全性产生更多的怀疑。
    Abstract We present a novel adversarial model for authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. The attack idea is inspired by and named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. In particular, this work investigates whether it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target's IMUGait pattern. Nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. Each pattern attacked a wide variety of user authentication models. The deeper analysis of error rates (before and after the attack) challenges the belief that authentication systems based on IMUGait patterns are the most difficult to spoof; further research is needed on adversarial models and associated countermeasures.
    摘要 我们提出了一种新的反对抗模型,用于 Authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. 攻击的想法源于和named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. 特别是,这项工作研究了是否可以构建一个IMUGait模式字典,并使用其发动攻击或找到一个可以活动地重现IMUGait模式的imitator。 nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. each pattern attacked a wide variety of user authentication models. 更深入的分析错误率 (before and after the attack) 挑战了认为基于IMUGait模式的身份验证系统是最难模仿的; 需要进一步的研究反对模型和相关的防御措施。

SAM-OCTA: A Fine-Tuning Strategy for Applying Foundation Model to OCTA Image Segmentation Tasks

  • paper_url: http://arxiv.org/abs/2309.11758
  • repo_url: https://github.com/shellredia/sam-octa
  • paper_authors: Chengliang Wang, Xinrun Chen, Haojian Ning, Shiying Li
  • for: 这个论文主要是为了解决Optical coherence tomography angiography(OCTA)图像分割 зада务中的特定目标segmentation问题。
  • methods: 这个论文使用了low-rank adaptation技术和基于Foundation model的微调,并提出了相应的提示点生成策略来处理不同的分割任务。
  • results: 该方法在OCTA-500 dataset上进行了实验,并达到了当前最佳性能指标,同时也能够实现当地血管分 segmentation和有效的血管-血管分 segmentation,这些问题在之前的工作中尚未得到了好的解决。
    Abstract In the analysis of optical coherence tomography angiography (OCTA) images, the operation of segmenting specific targets is necessary. Existing methods typically train on supervised datasets with limited samples (approximately a few hundred), which can lead to overfitting. To address this, the low-rank adaptation technique is adopted for foundation model fine-tuning and proposed corresponding prompt point generation strategies to process various segmentation tasks on OCTA datasets. This method is named SAM-OCTA and has been experimented on the publicly available OCTA-500 dataset. While achieving state-of-the-art performance metrics, this method accomplishes local vessel segmentation as well as effective artery-vein segmentation, which was not well-solved in previous works. The code is available at: https://github.com/ShellRedia/SAM-OCTA.
    摘要 在Optical coherence tomography angiography(OCTA)图像分析中,需要进行特定目标 segmentation 操作。现有方法通常是通过指导数据集(约几百个样本)进行超参数化训练,这可能会导致过拟合。为解决这问题,我们采用了低级别适应技术,并提出了相应的提示点生成策略,以处理不同的 segmentation 任务。这种方法被称为SAM-OCTA,并在公共可用的OCTA-500数据集上进行了实验。它不仅达到了当前最佳性能指标,还能够有效地完成本地血管分 segmentation 以及血管-血管分 segmentation,这在前一些工作中尚未得到妥善解决。代码可以在:https://github.com/ShellRedia/SAM-OCTA 中找到。

A Vision-Centric Approach for Static Map Element Annotation

  • paper_url: http://arxiv.org/abs/2309.11754
  • repo_url: https://github.com/manymuch/cama
  • paper_authors: Jiaxin Zhang, Shiyuan Chen, Haoran Yin, Ruohong Mei, Xuan Liu, Cong Yang, Qian Zhang, Wei Sui
  • for: 提供高质量的地图元素标注数据,帮助提高静止地图建模算法的准确率和一致性。
  • methods: 提出了一种视觉中心的方法,无需LiDAR输入可以生成高质量的3D地图元素标注。
  • results: 对于流行的nuScenes dataset,使用CAMA方法可以提供高效和准确的标注,并且与原始nuScenes静止地图元素比较,模型训练使用CAMA标注得到的 reprojection 误差较低(例如,4.73 vs. 8.03像素)。
    Abstract The recent development of online static map element (a.k.a. HD Map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. To this end, we present CAMA: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).
    摘要 “Recent developments in online static map element (a.k.a. HD Map) construction algorithms have led to a significant demand for high-quality training data. However, public datasets currently available do not provide consistent and accurate data. To address this issue, we propose CAMA, a vision-centric approach for Consistent and Accurate Map Annotation. Our framework can generate high-quality 3D annotations of static map elements without relying on LiDAR inputs. Specifically, the annotations can achieve high reprojection accuracy across all surrounding cameras and are spatial-temporally consistent across the entire sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

PIE: Simulating Disease Progression via Progressive Image Editing

  • paper_url: http://arxiv.org/abs/2309.11745
  • repo_url: https://github.com/irohxu/pie
  • paper_authors: Kaizhao Liang, Xu Cao, Kuei-Da Liao, Tianren Gao, Wenqian Ye, Zhengyu Chen, Jianguo Cao, Tejas Nama, Jimeng Sun
  • for: 预测疾病进程和诊断支持
  • methods: 基于文本生成模型的疾病进程模拟
  • results: 比CLIP分数和疾病分类信息更高的疾病进程生成Translation:
  • for: 用于预测疾病进程和诊断支持
  • methods: 使用基于文本生成模型的疾病进程模拟方法
  • results: 比CLIP分数和疾病分类信息更高的疾病进程生成结果
    Abstract Disease progression simulation is a crucial area of research that has significant implications for clinical diagnosis, prognosis, and treatment. One major challenge in this field is the lack of continuous medical imaging monitoring of individual patients over time. To address this issue, we develop a novel framework termed Progressive Image Editing (PIE) that enables controlled manipulation of disease-related image features, facilitating precise and realistic disease progression simulation. Specifically, we leverage recent advancements in text-to-image generative models to simulate disease progression accurately and personalize it for each patient. We theoretically analyze the iterative refining process in our framework as a gradient descent with an exponentially decayed learning rate. To validate our framework, we conduct experiments in three medical imaging domains. Our results demonstrate the superiority of PIE over existing methods such as Stable Diffusion Walk and Style-Based Manifold Extrapolation based on CLIP score (Realism) and Disease Classification Confidence (Alignment). Our user study collected feedback from 35 veteran physicians to assess the generated progressions. Remarkably, 76.2% of the feedback agrees with the fidelity of the generated progressions. To our best knowledge, PIE is the first of its kind to generate disease progression images meeting real-world standards. It is a promising tool for medical research and clinical practice, potentially allowing healthcare providers to model disease trajectories over time, predict future treatment responses, and improve patient outcomes.
    摘要 疾病发展模拟是医学研究中一个关键领域,具有诊断、诊断和治疗等方面的重要意义。然而,在这个领域中一个主要挑战是缺乏持续医疗影像监测个体患者的能力。为了解决这个问题,我们开发了一个名为进步图像编辑(PIE)的新框架。PIE可以准确地控制疾病相关的图像特征,以便实现 precisel 和现实的疾病发展模拟。具体来说,我们利用了最新的文本生成图像技术来模拟疾病发展,并为每个患者个性化模拟。我们对PIE的迭代缩进过程进行了理论分析,并证明其等价于梯度下降算法。为了验证PIE的有效性,我们在医疗影像领域进行了三个领域的实验。我们的结果表明PIE在CLIP分数(现实)和疾病分类信心度(对齐)等方面比存在方法更高。我们的用户测试收集了35名经验丰富的医生的反馈,并证明76.2%的反馈同意生成的进步准确。到目前为止,PIE是第一个满足现实标准的疾病发展图像生成工具。它是医学研究和临床实践中的一个有前途的工具,可能允许医疗提供者在时间上模拟疾病轨迹,预测未来治疗响应,并提高患者的结果。

CPR-Coach: Recognizing Composite Error Actions based on Single-class Training

  • paper_url: http://arxiv.org/abs/2309.11718
  • repo_url: None
  • paper_authors: Shunli Wang, Qing Yu, Shuaibing Wang, Dingkang Yang, Liuzhen Su, Xiao Zhao, Haopeng Kuang, Peixuan Zhang, Peng Zhai, Lihua Zhang
  • for: 这篇论文的目的是为了提高紧急救援中的心脏复苏技巧评估,并提出了一个基于视觉数据的系统来识别和评估心脏复苏技巧的错误动作。
  • methods: 这篇论文使用了视觉数据来定义13种单一错误动作和74种合成错误动作,并建立了一个名为CPR-Coach的视觉数据集。然后,这篇论文对现有的动作识别模型进行了比较和探讨,以解决单簇训练和多簇测试的问题。
  • results: 这篇论文的实验结果显示,使用ImagineNet框架可以帮助模型增强多错误识别能力,并且可以解决单簇训练和多簇测试的问题。
    Abstract The fine-grained medical action analysis task has received considerable attention from pattern recognition communities recently, but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently, the assessment of CPR skills mainly depends on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a humancognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.
    摘要 Recently, the fine-grained medical action analysis task has received significant attention from the pattern recognition community, but it faces challenges such as data and algorithm shortages. cardiopulmonary resuscitation (CPR) is an essential skill in emergency treatment, but the current assessment of CPR skills mainly relies on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and develop a video dataset named CPR-Coach. By using the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a human-cognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

  • paper_url: http://arxiv.org/abs/2309.11715
  • repo_url: None
  • paper_authors: Xiao Feng Zhang, Tian Yi Song, Jia Wei Yao
  • for: Image shadow removal and image restoration.
  • methods: + Deshadow-Anything: A diffusion model that diffuses along the edges and textures of an image to remove shadows while preserving image details. + Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion.
  • results: + Effective improvement in image restoration performance in shadow removal tasks.
    Abstract Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.
    摘要 segments anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.11711
  • repo_url: None
  • paper_authors: Fei Pan, Xu Yin, Seokju Lee, Sungeui Yoon, In So Kweon
  • for: 这篇论文的目的是提出一个实用的领域对预设不足的目标领域进行Semantic Segmentation任务。
  • methods: 这篇论文使用自我监督学来学习 objet motion 的自适应表现,并且将其应用到对预设不足的目标领域进行Semantic Segmentation。
  • results: 实验结果显示,这篇论文的方法可以对多个测试 benchmark 进行优化,并且可以与现有的州chart-of-the-art方法相互协同运作以进一步改善表现。
    Abstract Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.
    摘要 无监督领域适应(USDA)是一种有效的方法,用于处理目标领域无监督标注的 semantic segmentation 任务中的缺乏标注问题。在这项工作中,我们考虑了更实际的 USDA 设定,其中目标领域包含序列帧的无标注视频,这些视频易于在实践中收集。一项研究建议通过不监督视频中的对象运动学习自我监督学习。我们设计了一个带有自我监督对象运动学习的动态适应 semantic segmentation 框架(MoDA),该框架利用了自我监督对象运动来学习有效的表示。MoDA 与前期方法不同,它不使用目标领域帧的时间一致约束,而是分别对前景和背景类使用不同的策略进行领域对应。具体来说,MoDA 包括前景对象发现和前景Semantic挖掘,以启用目标领域前景异常的匹配。此外,MoDA 还包括背景反馈学习,其中包括一个特定于背景类别的反馈器,以处理背景领域异常。实验结果表明,MoDA 在多个 benchmark 上表现出色,与现有方法相比,具有更高的效果。此外,MoDA 可以与现有状态监督的方法结合使用,以进一步提高性能。

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2309.11707
  • repo_url: None
  • paper_authors: Ping Li, Yu Zhang, Li Yuan, Huaxin Xiao, Binbin Lin, Xianghua Xu
  • for: Unsupervised Video Object Segmentation (VOS)
  • methods: Long-Short Temporal Attention network (LSTA)
  • results: Promising performances with high efficiency on several benchmarks.Here’s the full text in Simplified Chinese:
  • for: 本研究探讨了无监督视频对象分割(VOS)问题,旨在在视频中快速、高效地分割主要背景 объек。
  • methods: 我们提出了一种高效的 Long-Short Temporal Attention 网络(简称 LSTA),它包括两个主要模块:长期记忆和短期注意力。前者捕捉了过去帧和当前帧中长期全局像素关系,模型了不断存在的对象的出现模式。而后者揭示了当前帧和一 nearby frame 中短期局部像素关系,模型了移动对象的运动模式。为了加速推理,我们采用了高效投影和地址预测来实现近似线性时间复杂度。
  • results: 我们在多个 benchmark 上进行了广泛的实验,并证明了提出的方法在高效性和性能方面具有惊人的表现。
    Abstract Unsupervised Video Object Segmentation (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge. However, previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view. Specifically, LSTA consists of two dominant modules, i.e., Long Temporal Memory and Short Temporal Attention. The former captures the long-term global pixel relations of the past frames and the current frame, which models constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals the short-term local pixel relations of one nearby frame and the current frame, which models moving objects by encoding motion pattern. To speedup the inference, the efficient projection and the locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated promising performances of the proposed method with high efficiency.
    摘要 Unsupervised Video Object Segmentation (VOS) targets identifying primary foreground objects' contours in videos without prior knowledge. However, previous methods neglect spatial-temporal context and can't handle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (LSTA) for unsupervised VOS from a holistic view. Specifically, LSTA consists of two main modules: Long Temporal Memory and Short Temporal Attention. The former captures long-term global pixel relations of past frames and the current frame, modeling constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals short-term local pixel relations of one nearby frame and the current frame, modeling moving objects by encoding motion pattern. To speed up inference, efficient projection and locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated the proposed method's promising performance with high efficiency.

Meta OOD Learning for Continuously Adaptive OOD Detection

  • paper_url: http://arxiv.org/abs/2309.11705
  • repo_url: None
  • paper_authors: Xinheng Wu, Jie Lu, Zhen Fang, Guangquan Zhang
  • for: 这个研究是为了提出一种可靠地检测深度学习模型中的外部遗传数据(out-of-distribution,OOD)的方法,并且可以在实际世界中的不断变化和迁移中进行适应。
  • methods: 这个研究使用了一种名为“可动数据适应”(continuously adaptive out-of-distribution,CAOOD)的设定,并且提出了一个名为“多元外部遗传学习”(meta out-of-distribution learning,MOL)的方法来解决CAOOD。MOL使用了一个学习到适应的图表,以便在训练和测试过程中快速适应新的分布。
  • results: 实验结果显示,MOL可以保持ID分类精度和OOD检测性能在不断变化的分布下,并且在实际世界中的应用中可以提供更高的可靠性和效能。
    Abstract Out-of-distribution (OOD) detection is crucial to modern deep learning applications by identifying and alerting about the OOD samples that should not be tested or used for making predictions. Current OOD detection methods have made significant progress when in-distribution (ID) and OOD samples are drawn from static distributions. However, this can be unrealistic when applied to real-world systems which often undergo continuous variations and shifts in ID and OOD distributions over time. Therefore, for an effective application in real-world systems, the development of OOD detection methods that can adapt to these dynamic and evolving distributions is essential. In this paper, we propose a novel and more realistic setting called continuously adaptive out-of-distribution (CAOOD) detection which targets on developing an OOD detection model that enables dynamic and quick adaptation to a new arriving distribution, with insufficient ID samples during deployment time. To address CAOOD, we develop meta OOD learning (MOL) by designing a learning-to-adapt diagram such that a good initialized OOD detection model is learned during the training process. In the testing process, MOL ensures OOD detection performance over shifting distributions by quickly adapting to new distributions with a few adaptations. Extensive experiments on several OOD benchmarks endorse the effectiveness of our method in preserving both ID classification accuracy and OOD detection performance on continuously shifting distributions.
    摘要 现代深度学习应用中,外围分布(OOD)检测是非常重要的,可以识别并警告不应该用于预测的外围样本。现有的OOD检测方法在固定分布下已经做出了 significiant progress。然而,这可能是不现实的,因为实际系统经常发生连续变化和分布的变化。因此,为了有效地应用于实际系统,需要开发能够适应动态和演化分布的OOD检测方法。在这篇论文中,我们提出了一种新的设定,即持续适应外围(CAOOD)检测,旨在开发一种能够在部署时动态适应新 arriving 分布的OOD检测模型。为了解决CAOOD,我们开发了元外围学习(MOL),它是一种学习适应图表,可以在训练过程中初始化一个好的OOD检测模型,并在测试过程中快速适应新的分布,只需要几次适应。我们在多个OOD benchmark上进行了广泛的实验,证明了我们的方法可以在连续变化的分布下保持ID分类精度和OOD检测性能。

cs.AI - 2023-09-21

Curriculum Reinforcement Learning via Morphology-Environment Co-Evolution

  • paper_url: http://arxiv.org/abs/2309.12529
  • repo_url: None
  • paper_authors: Shuang Ao, Tianyi Zhou, Guodong Long, Xuan Song, Jing Jiang
  • for: 本研究旨在帮助RL机器人在不同环境中学习和适应,以提高其总体性和可重复性。
  • methods: 本研究使用了“形态环境共EVOLUTION(MECE)”方法,在这种方法中,RL机器人的形态和环境会不断地更新和改进,以适应环境的变化。
  • results: 实验结果表明,通过MECE方法训练RL机器人的形态和策略,可以在未看过的测试环境中表现出显著更好的普适性和可重复性。此外,我们的剥离分析表明,MECE方法的成功与形态和环境的共EVOLUTION有直接的关系。
    Abstract Throughout long history, natural species have learned to survive by evolving their physical structures adaptive to the environment changes. In contrast, current reinforcement learning (RL) studies mainly focus on training an agent with a fixed morphology (e.g., skeletal structure and joint attributes) in a fixed environment, which can hardly generalize to changing environments or new tasks. In this paper, we optimize an RL agent and its morphology through ``morphology-environment co-evolution (MECE)'', in which the morphology keeps being updated to adapt to the changing environment, while the environment is modified progressively to bring new challenges and stimulate the improvement of the morphology. This leads to a curriculum to train generalizable RL, whose morphology and policy are optimized for different environments. Instead of hand-crafting the curriculum, we train two policies to automatically change the morphology and the environment. To this end, (1) we develop two novel and effective rewards for the two policies, which are solely based on the learning dynamics of the RL agent; (2) we design a scheduler to automatically determine when to change the environment and the morphology. In experiments on two classes of tasks, the morphology and RL policies trained via MECE exhibit significantly better generalization performance in unseen test environments than SOTA morphology optimization methods. Our ablation studies on the two MECE policies further show that the co-evolution between the morphology and environment is the key to the success.
    摘要 通过历史的演化,自然种类学会适应环境变化,而现代强化学习(RL)研究主要集中在训练一个固定结构(例如骨架和关节特性)在固定环境中,这难以应对变化环境或新任务。在这篇论文中,我们通过“形态环境共演化(MECE)”优化RL代理人和其形态,在形态不断更新以适应变化环境的同时,环境也不断改进以带来新的挑战和适应性提升。这导致了一个训练通用RL的课程,其中形态和策略在不同环境中得到优化。而不是手动设计课程,我们训练了两个政策来自动改变形态和环境。为此,我们:1. 开发了两种新有效的奖励,以便为两个政策提供动力学学习RL代理人的学习动态;2. 设计了一个计划器,以自动确定改变形态和环境的时间。在两类任务上进行了实验,MECE训练的形态和RL策略在未看到的测试环境中表现出了显著更好的普适性。我们的剖析研究还表明,MECE中形态和环境之间的共演化是成功的关键。

Knowledge Graph Embedding: An Overview

  • paper_url: http://arxiv.org/abs/2309.12501
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Xiou Ge, Yun-Cheng Wang, Bin Wang, C. -C. Jay Kuo
  • for: 本文概述了目前knowledge graph completion(KGC)领域的研究进展,尤其是两种主要的knowledge graph embedding(KGE)设计方法:距离基于方法和semantic matching基于方法。
  • methods: 本文总结了 reciently proposed models的关系,并发现了这些模型之间的联系。此外,文章还介绍了一种新的approach for KGC,即通过预训练语言模型(PLM)和实体和关系的文本描述来完成KGC。
  • results: 文章介绍了一种基于2D和3D affine操作的CompoundE和CompoundE3D模型,以及一种 combining KGE embedding方法与PLMs的新方法。这些方法可以提供更高的explainability和渐进性。
    Abstract Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
    摘要 许多数学模型已经被应用于设计知识 graphs (KG) 实体和关系的 Representation 以进行链接预测和多种下游任务。这些数学静态的模型不仅可以在大型 KG 中进行可扩展的推理,而且具有许多可解释的优势,可以通过正式证明和实际结果来验证不同的关系模式。在这篇论文中,我们对当前 KG 完成研究进行了全面的概述。特别是,我们关注了两个主要的 KG 嵌入 (KGE) 设计分支:1) 距离基于方法和 2) 含义匹配基于方法。我们发现了最新提出的模型之间的连接,并提出了一个可能的趋势,可以帮助研究人员创造更有效和新的模型。接着,我们探讨了 CompoundE 和 CompoundE3D,它们继承了2D和3D afine操作的想法。它们包括距离基于和含义基于的多种技术。我们还讨论了一种emergingapproach для KG completion,它利用预训练的语言模型 (PLMs) 和实体和关系的文本描述,并提供了 KGE嵌入方法与 PLMs 的集成的思路。

Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

  • paper_url: http://arxiv.org/abs/2309.12491
  • repo_url: https://github.com/tomlimi/MT-Tokenizer-Bias
  • paper_authors: Bar Iluz, Tomasz Limisiewicz, Gabriel Stanovsky, David Mareček
  • for: This paper focuses on the effect of tokenization on gender bias in machine translation, specifically examining the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer’s vocabulary, and gender bias.
  • methods: The authors use a combination of data analysis and machine learning techniques to study the impact of tokenization on gender bias in machine translation. They analyze the subword splits of gendered profession names in the training data and use fine-tuning of the token embedding layer to decrease the gender bias in the model.
  • results: The authors find that the imbalance of gender forms in the model’s training corpus is a major factor contributing to gender bias, and that analyzing subword splits provides good estimates of gender-form imbalance in the training data. They also show that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.Here are the three points in Simplified Chinese text:
  • for: 这个论文研究了机器翻译中的gender bias问题,具体来说是研究训练数据中gendered profession名称的频率、tokenizer的词库中gendered profession名称的表示方式和gender bias之间的交互关系。
  • methods: 作者们使用了数据分析和机器学习技术来研究tokenization对机器翻译中的gender bias的影响。他们分析了训练数据中gendered profession名称的subword splits,并通过token embedding层的微调来降低模型中的gender bias。
  • results: 作者们发现,模型的训练数据中gender forms的偏度是gender bias的主要原因,而且分析subword splits可以提供good estimate of训练数据中gender-form偏度。他们还显示了微调只token embedding层可以降低女性和男性形式之间的差距而不妨碍翻译质量。
    Abstract We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish "doctora" for "female doctor") tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model's training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.
    摘要 我们研究了各种卡通化的影响于机器翻译中的性别偏见,这是前一些研究中尚未得到充分关注的方面。我们专注于训练数据中的性别定型名称的频率,它们在字节化器的词库中的表示方式,以及性别偏见的关系。我们发现,女性和非标准性别定型名称(例如西班牙语"doctora")在训练数据中出现的频率较低,这些名称往往会被拆分成多个字节。我们的结果表明,训练数据中性别形式的偏见是机器翻译模型的训练 corpus 中最大的一个因素,并且对性别预测精度的差异产生了更大的影响,而不是字节拆分。我们示出,分析字节拆分可以提供良好的性别形式偏见的估计,即使训练数据不公开可用。此外,我们还证明了只修改字节嵌入层可以降低女性和男性形式之间的翻译质量差异。

Studying and improving reasoning in humans and machines

  • paper_url: http://arxiv.org/abs/2309.12485
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Nicolas Yax, Hernan Anlló, Stefano Palminteri
  • for: investigate and compare reasoning in large language models (LLM) and humans
  • methods: used cognitive psychology tools traditionally dedicated to the study of (bounded) rationality
  • results: most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning, but with important differences with human-like reasoning and limitations disappearing in more recent LLMs releases.
    Abstract In the present study, we investigate and compare reasoning in large language models (LLM) and humans using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. To do so, we presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an in-depth comparison between humans and LLMs indicated important differences with human-like reasoning, with models limitations disappearing almost entirely in more recent LLMs releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally-responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.
    摘要 在本研究中,我们研究和比较大语言模型(LLM)和人类的理解能力使用一些传统的认知心理学工具,以 investigate bounded rationality 的研究。我们给人类参与者和一些预训练的 LLM 提供了新的变种 classical cognitive experiments,并将其比较。我们的结果表明,大多数包含在模型中的理解错误与人类的错误相似,但是在深入比较之后,发现模型的局限性在更新的 LLM 发布中几乎消失了。此外,我们还证明了可以采取措施来提高表现,但是人类和机器不同的响应方式。我们 conclude 这些比较结果对人工智能和认知心理学都具有epistemological 意义和挑战。

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding

  • paper_url: http://arxiv.org/abs/2309.12482
  • repo_url: None
  • paper_authors: Devleena Das, Sonia Chernova, Been Kim
  • for: 该研究旨在开发一种可以帮助非AI专家理解AI决策过程的方法,以便在日常任务中使用AI系统。
  • methods: 该研究使用了基于概念的解释方法,其中概念是在动作选择 Setting 中定义的。另外,研究还提出了一种joint embedding模型,用于学习状态动作对应的概念解释。
  • results: 实验结果表明,使用State2Explanation(S2E)框架可以提高代理人学习率和任务性能,同时也可以为非AI专家提供有用的解释,从而提高任务完成性。
    Abstract With more complex AI systems used by non-AI experts to complete daily tasks, there is an increasing effort to develop methods that produce explanations of AI decision making understandable by non-AI experts. Towards this effort, leveraging higher-level concepts and producing concept-based explanations have become a popular method. Most concept-based explanations have been developed for classification techniques, and we posit that the few existing methods for sequential decision making are limited in scope. In this work, we first contribute a desiderata for defining "concepts" in sequential decision making settings. Additionally, inspired by the Protege Effect which states explaining knowledge often reinforces one's self-learning, we explore the utility of concept-based explanations providing a dual benefit to the RL agent by improving agent learning rate, and to the end-user by improving end-user understanding of agent decision making. To this end, we contribute a unified framework, State2Explanation (S2E), that involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging such learned model to both (1) inform reward shaping during an agent's training, and (2) provide explanations to end-users at deployment for improved task performance. Our experimental validations, in Connect 4 and Lunar Lander, demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end user task performance at deployment time.
    摘要 With the increasing use of more complex AI systems by non-AI experts for daily tasks, there is a growing effort to develop methods that provide understandable explanations of AI decision-making. To address this, leveraging higher-level concepts and producing concept-based explanations have become a popular approach. However, most existing methods are limited to classification techniques, and there is a lack of methods for sequential decision-making.In this work, we first propose a desiderata for defining "concepts" in sequential decision-making settings. Additionally, inspired by the Protege Effect, which states that explaining knowledge can reinforce one's self-learning, we explore the utility of concept-based explanations providing a dual benefit to both the RL agent and the end-user. To achieve this, we contribute a unified framework called State2Explanation (S2E), which involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging this model to both inform reward shaping during the agent's training and provide explanations to end-users at deployment time.Our experimental validations in Connect 4 and Lunar Lander demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end-user task performance at deployment time.

HANS, are you clever? Clever Hans Effect Analysis of Neural Systems

  • paper_url: http://arxiv.org/abs/2309.12481
  • repo_url: None
  • paper_authors: Leonardo Ranaldi, Fabio Massimo Zanzotto
  • for: 这篇论文旨在检验 Iterative Large Language Models(It-LLMs)在不同选项顺序下的抗衰假设能力。
  • methods: 作者使用了多个多选题目(MCQ)benchmarks来构建坚实的评估模型能力。他们还引入了对抗样本,以检验模型的可靠性。
  • results: 研究发现,模型在选项顺序变化时存在偏袋性,具体来说是在第一个选项的位置影响模型选择的现象。此外,作者还发现模型在几个示例下 exhibit 偏好结构性的决策过程。通过使用 Chain-of-Thought(CoT)技术,作者可以让模型更加坚定地reasoning,从而减少偏袋性。
    Abstract Instruction-tuned Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate It-LLMs' resilience abilities towards a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the It-LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.
    摘要

SAVME: Efficient Safety Validation for Autonomous Systems Using Meta-Learning

  • paper_url: http://arxiv.org/abs/2309.12474
  • repo_url: None
  • paper_authors: Marc R. Schlichting, Nina V. Boord, Anthony L. Corso, Mykel J. Kochenderfer
  • for: 本研究旨在快速发现自动驾驶系统的可能性故障,以便在部署之前进行风险评估。
  • methods: 我们提出了一种 bayesian 方法,它结合了 meta-learning 策略和多重武器框架,以优化验证过程。我们学习了触发故障场景的分布,以及对 simulator 的精度设置的分布。在 meta-learning 的精神中,我们还评估了学习分布是否能够帮助更快地学习新场景。
  • results: 我们使用了一个 cutting-edge 3D 驾驶 simulator,包含了 16 个精度设置,对自动驾驶车stack 进行了测试。我们根据自动驾驶车的偏倾类型进行了不同的场景测试。结果显示,我们的方法可以快速减少验证时间,比传统方法快速多达 18 倍。
    Abstract Discovering potential failures of an autonomous system is important prior to deployment. Falsification-based methods are often used to assess the safety of such systems, but the cost of running many accurate simulation can be high. The validation can be accelerated by identifying critical failure scenarios for the system under test and by reducing the simulation runtime. We propose a Bayesian approach that integrates meta-learning strategies with a multi-armed bandit framework. Our method involves learning distributions over scenario parameters that are prone to triggering failures in the system under test, as well as a distribution over fidelity settings that enable fast and accurate simulations. In the spirit of meta-learning, we also assess whether the learned fidelity settings distribution facilitates faster learning of the scenario parameter distributions for new scenarios. We showcase our methodology using a cutting-edge 3D driving simulator, incorporating 16 fidelity settings for an autonomous vehicle stack that includes camera and lidar sensors. We evaluate various scenarios based on an autonomous vehicle pre-crash typology. As a result, our approach achieves a significant speedup, up to 18 times faster compared to traditional methods that solely rely on a high-fidelity simulator.
    摘要 发现自动化系统的潜在失败是在部署之前非常重要。使用模糊化方法评估自动化系统的安全性可能是costly的。我们提议使用 bayesian方法,结合多重武器框架,以加速验证过程。我们的方法是学习触发故障场景的分布,以及对于快速和准确的模拟而设置的信任度设定的分布。在meta-learning的精神中,我们还评估了学习分布中的信任度设定是否可以更快地学习新场景的分布。我们使用了一个前沿的3D驾驶模拟器,包括16个可信度设定,对于一个包含摄像头和雷达感知器的自动驾驶车Stack。我们根据自动驾驶车预crash类型进行了多种场景的评估。因此,我们的方法可以减少至18倍以上,相比传统方法仅使用高可信度模拟器。

Multimodal Deep Learning for Scientific Imaging Interpretation

  • paper_url: http://arxiv.org/abs/2309.12460
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Abdulelah S. Alshehri, Franklin L. Lee, Shihu Wang
  • for: 本研究旨在 linguistically emulating human-like interactions with Scanning Electron Microscopy (SEM) images, specifically of glass materials, and evaluating the accuracy of such interactions.
  • methods: 该方法基于多模态深度学习框架,利用文献评审文章中的文本和图像数据,以及GPT-4的数据生成和评估能力,以提取图像中的关键特征和缺陷。
  • results: 模型(GlassLLaVA)在前所未见的SEM图像中提取了准确的解释、标识了关键特征,并检测到缺陷。此外,我们还介绍了适用于科学成像应用的多样化评价指标,可以与研究级别的答案进行比较。
    Abstract In the domain of scientific imaging, interpreting visual data often demands an intricate combination of human expertise and deep comprehension of the subject materials. This study presents a novel methodology to linguistically emulate and subsequently evaluate human-like interactions with Scanning Electron Microscopy (SEM) images, specifically of glass materials. Leveraging a multimodal deep learning framework, our approach distills insights from both textual and visual data harvested from peer-reviewed articles, further augmented by the capabilities of GPT-4 for refined data synthesis and evaluation. Despite inherent challenges--such as nuanced interpretations and the limited availability of specialized datasets--our model (GlassLLaVA) excels in crafting accurate interpretations, identifying key features, and detecting defects in previously unseen SEM images. Moreover, we introduce versatile evaluation metrics, suitable for an array of scientific imaging applications, which allows for benchmarking against research-grounded answers. Benefiting from the robustness of contemporary Large Language Models, our model adeptly aligns with insights from research papers. This advancement not only underscores considerable progress in bridging the gap between human and machine interpretation in scientific imaging, but also hints at expansive avenues for future research and broader application.
    摘要 在科学成像领域,解读视觉数据经常需要复杂的人工智能和深入的Subject材料的理解。这项研究提出了一种新的方法ología,用于模拟和评估SEM图像中的人类如果交互行为,特别是钢琴材料的SEM图像。我们利用了一种多模态深度学习框架,将文本和视觉数据从同行评审文章中提取出来,并通过GPT-4的数据生成和评估能力进行进一步的增强。尽管存在某些挑战,如细微的解释和特殊 dataset的有限性,但我们的模型(GlassLLaVA)在面临 previously unseen SEM 图像时仍然能够提供高度准确的解释、标识关键特征和检测缺陷。此外,我们还引入了适用于多种科学成像应用的评价指标,使得可以对研究级答案进行比较。这种进步不仅标识了人机共同解读的科学成像领域中的巨大进步,还预示了未来研究和应用的广阔前景。

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

  • paper_url: http://arxiv.org/abs/2309.12455
  • repo_url: https://github.com/jbshp/longdocfactscore
  • paper_authors: Jennifer A Bishop, Qianqian Xie, Sophia Ananiadou
  • for: 本研究旨在evaluating automatic text summarization metrics for long document data sets, and proposing a new evaluation framework called LongDocFACTScore.
  • methods: 本研究使用了pre-trained language models和human annotated data sets来evaluate automatic text summarization metrics的准确性。
  • results: LongDocFACTScore outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarization data sets, and its performance is comparable to state-of-the-art metrics when evaluated against human measures of factual consistency on short document data sets.
    Abstract Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research evaluating whether existing automatic evaluation metrics are fit for purpose when applied to long document data sets. In this work, we evaluate the efficacy of automatic metrics at assessing factual consistency in long document text summarisation and propose a new evaluation framework LongDocFACTScore. This framework allows metrics to be extended to any length document. This framework outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. Furthermore, we show LongDocFACTScore has performance comparable to state-of-the-art metrics when evaluated against human measures of factual consistency on short document data sets. We make our code and annotated data publicly available: https://github.com/jbshp/LongDocFACTScore.
    摘要 保持事实一致性是抽象文本概要化中的关键问题,但这无法由传统的自动评价指标来评估,如ROUGE分数。 recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits and are therefore not suitable for evaluating long document text summarization. In addition, there is limited research evaluating whether existing automatic evaluation metrics are fit for purpose when applied to long document data sets. In this work, we evaluate the efficacy of automatic metrics at assessing factual consistency in long document text summarization and propose a new evaluation framework LongDocFACTScore. This framework allows metrics to be extended to any length document. This framework outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarization data sets. Furthermore, we show LongDocFACTScore has performance comparable to state-of-the-art metrics when evaluated against human measures of factual consistency on short document data sets. We make our code and annotated data publicly available: .

Ensemble Neural Networks for Remaining Useful Life (RUL) Prediction

  • paper_url: http://arxiv.org/abs/2309.12445
  • repo_url: None
  • paper_authors: Ahbishek Srinivasan, Juan Carlos Andresen, Anders Holst
  • For: The paper aims to propose an ensemble neural network approach for probabilistic remaining useful life (RUL) predictions, which considers both aleatoric and epistemic uncertainties and decouples them to provide a more accurate and interpretable prediction.* Methods: The proposed method uses ensemble neural networks to model the probabilistic nature of RUL predictions, and decouples the aleatoric and epistemic uncertainties to provide a better understanding of the confidence of the predictions.* Results: The proposed approach is tested on NASA’s turbofan jet engine CMAPSS data-set and shows how the uncertainties can be modeled and disentangled. The results also demonstrate the effectiveness of the proposed approach compared to current state-of-the-art methods.Here is the same information in Simplified Chinese text:* For: 该文章目的是提出一种 ensemble neural network 方法,用于 probabilistic 的 remaining useful life(RUL)预测,并考虑了 aleatoric 和 epistemic 不确定性,并将其分解为两个不同的不确定性。* Methods: 该方法使用 ensemble neural networks 来模型 probabilistic 的 RUL 预测,并将 aleatoric 和 epistemic 不确定性分解为两个不同的不确定性。* Results: 该方法在 NASA 的 turbofan jet engine CMAPSS 数据集上进行了测试,并显示了如何模型和分解不确定性。结果还比较了该方法与当前状态的先进方法。
    Abstract A core part of maintenance planning is a monitoring system that provides a good prognosis on health and degradation, often expressed as remaining useful life (RUL). Most of the current data-driven approaches for RUL prediction focus on single-point prediction. These point prediction approaches do not include the probabilistic nature of the failure. The few probabilistic approaches to date either include the aleatoric uncertainty (which originates from the system), or the epistemic uncertainty (which originates from the model parameters), or both simultaneously as a total uncertainty. Here, we propose ensemble neural networks for probabilistic RUL predictions which considers both uncertainties and decouples these two uncertainties. These decoupled uncertainties are vital in knowing and interpreting the confidence of the predictions. This method is tested on NASA's turbofan jet engine CMAPSS data-set. Our results show how these uncertainties can be modeled and how to disentangle the contribution of aleatoric and epistemic uncertainty. Additionally, our approach is evaluated on different metrics and compared against the current state-of-the-art methods.
    摘要 系综综保养规划中的核心部分是监测系统,提供器部件的健康和衰化情况的良好预测,通常表示为剩下的有用寿命(RUL)。现有的大多数数据驱动方法都是单点预测,不包括失败的 probabilistic 特征。其中一些 probabilistic 方法可以同时考虑系统内的 aleatoric 不确定性和模型参数的 epistemic 不确定性,或者将这两种不确定性作为总的不确定性。在这里,我们提出了ensemble neural networks для probabilistic RUL 预测,该方法考虑了这两种不确定性,并将它们分解开来。这些分解的不确定性非常重要,因为它们可以帮助我们理解和解释预测结果的信任程度。我们在NASA的涡轮喷气发动机CMAPSS数据集上测试了这种方法,我们的结果表明了如何模拟这些不确定性,并如何分解它们的贡献。此外,我们还对这种方法进行了不同的评价指标和与当前状态艺术方法进行了比较。

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

  • paper_url: http://arxiv.org/abs/2309.12426
  • repo_url: None
  • paper_authors: Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ramanan, Aman Chadha
  • for: 这个论文主要是用于探讨使用大语言模型(LLMs)来增强现有的抽取式阅读理解任务数据集,以提高下游任务的性能。
  • methods: 该论文使用的方法包括使用GPT-4来自动生成数据集的描述和答案,然后对这些数据集进行微调以适应特定任务。
  • results: 研究发现,使用GPT-4进行数据生成和微调可以提高low resource阅读理解任务的性能,同时也可以大幅减少人工标注的成本。此外,研究还发现了一些特殊的机会和挑战,需要进一步的研究和优化。
    Abstract Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
    摘要 We evaluate the performance of GPT-4 as a replacement for human annotators for low-resource reading comprehension tasks by comparing performance after fine-tuning and the cost associated with annotation. This work is the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low-resource datasets, allowing the research community to create further benchmarks for evaluating generated datasets.Note: "Simplified Chinese" is a romanization of Chinese characters, it's not a native language. The correct name of the language is "中文(简体)" in Chinese.

Event Prediction using Case-Based Reasoning over Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2309.12423
  • repo_url: https://github.com/solashirai/www-evcbr
  • paper_authors: Sola Shirai, Debarun Bhattacharjya, Oktie Hassanzadeh
  • for: 预测新事件的 causal 关系和属性
  • methods: 使用case-based reasoning模型(EvCBR),不需要 retraining,通过统计度量identify similar事件并进行路径预测
  • results: 在使用新闻事件 dataset 进行测试时,EvCBR 表现出色,超过基eline模型(包括 translate-distance-based、GNN-based和规则based LP 模型)
    Abstract Applying link prediction (LP) methods over knowledge graphs (KG) for tasks such as causal event prediction presents an exciting opportunity. However, typical LP models are ill-suited for this task as they are incapable of performing inductive link prediction for new, unseen event entities and they require retraining as knowledge is added or changed in the underlying KG. We introduce a case-based reasoning model, EvCBR, to predict properties about new consequent events based on similar cause-effect events present in the KG. EvCBR uses statistical measures to identify similar events and performs path-based predictions, requiring no training step. To generalize our methods beyond the domain of event prediction, we frame our task as a 2-hop LP task, where the first hop is a causal relation connecting a cause event to a new effect event and the second hop is a property about the new event which we wish to predict. The effectiveness of our method is demonstrated using a novel dataset of newsworthy events with causal relations curated from Wikidata, where EvCBR outperforms baselines including translational-distance-based, GNN-based, and rule-based LP models.
    摘要 使用链接预测(LP)方法在知识图(KG)上进行任务,如 causal event prediction 具有吸引人的机遇。然而,典型的 LP 模型无法执行新的链接预测,因为它们无法处理新的事件实体,并且需要重新训练。我们介绍了一种 случа件理解模型(EvCBR),用于预测新的后果事件的属性,基于知识图中的相似 causa-effect 事件。EvCBR 使用统计度量来标识相似事件,并进行路径预测,不需要训练步骤。为了扩展我们的方法,我们将任务划为两个步骤 LP 任务,第一步是一个 causal 关系连接一个新的效应事件和一个原因事件,第二步是预测新事件的属性。我们的方法在使用 Wikidata 上的新闻事件 causal 关系的数据集上进行了证明,EvCBR 在基于译译距离、图 neural network 和规则 LP 模型的基础上减少。

Constraints First: A New MDD-based Model to Generate Sentences Under Constraints

  • paper_url: http://arxiv.org/abs/2309.12415
  • repo_url: None
  • paper_authors: Alexandre Bonlarron, Aurélie Calabrèse, Pierre Kornprobst, Jean-Charles Régin
  • for: 这个论文是为了开发一种生成受约文本的新方法而写的。
  • methods: 这篇论文使用了多值决策图(MDD)来解决这个问题,并应用了一个语言模型(GPT-2)来选择最佳的句子。
  • results: 该方法可以生成大量的合法的句子,比传统的视力检测测试(MNREAD)中的句子更多,这为标准化句子生成带来了重大突破。此外,这种方法可以轻松适应其他语言,因此具有普适性和可重用性。
    Abstract This paper introduces a new approach to generating strongly constrained texts. We consider standardized sentence generation for the typical application of vision screening. To solve this problem, we formalize it as a discrete combinatorial optimization problem and utilize multivalued decision diagrams (MDD), a well-known data structure to deal with constraints. In our context, one key strength of MDD is to compute an exhaustive set of solutions without performing any search. Once the sentences are obtained, we apply a language model (GPT-2) to keep the best ones. We detail this for English and also for French where the agreement and conjugation rules are known to be more complex. Finally, with the help of GPT-2, we get hundreds of bona-fide candidate sentences. When compared with the few dozen sentences usually available in the well-known vision screening test (MNREAD), this brings a major breakthrough in the field of standardized sentence generation. Also, as it can be easily adapted for other languages, it has the potential to make the MNREAD test even more valuable and usable. More generally, this paper highlights MDD as a convincing alternative for constrained text generation, especially when the constraints are hard to satisfy, but also for many other prospects.
    摘要

ForceSight: Text-Guided Mobile Manipulation with Visual-Force Goals

  • paper_url: http://arxiv.org/abs/2309.12312
  • repo_url: https://github.com/force-sight/forcesight
  • paper_authors: Jeremy A. Collins, Cody Houff, You Liang Tan, Charles C. Kemp
  • for: 这篇论文是为了研究文本指导的移动 manipulate 系统,该系统使用深度学习神经网络预测视觉力目标。
  • methods: 该论文使用了一种深度学习模型,该模型使用单个RGBD图像和文本提示来预测视觉力目标(姿态目标)和相关的力目标(力量目标)。
  • results: 当部署在携带着RGBD相机的移动 manipulate 器上时,ForceSight系统在未seen环境中完成了精度抓取、抽屉开启和物品传递等任务,成功率达81%。在另一个实验中,仅通过视觉服务器而忽略力目标,成功率下降至45%,这说明力目标可以显著提高性能。
    Abstract We present ForceSight, a system for text-guided mobile manipulation that predicts visual-force goals using a deep neural network. Given a single RGBD image combined with a text prompt, ForceSight determines a target end-effector pose in the camera frame (kinematic goal) and the associated forces (force goal). Together, these two components form a visual-force goal. Prior work has demonstrated that deep models outputting human-interpretable kinematic goals can enable dexterous manipulation by real robots. Forces are critical to manipulation, yet have typically been relegated to lower-level execution in these systems. When deployed on a mobile manipulator equipped with an eye-in-hand RGBD camera, ForceSight performed tasks such as precision grasps, drawer opening, and object handovers with an 81% success rate in unseen environments with object instances that differed significantly from the training data. In a separate experiment, relying exclusively on visual servoing and ignoring force goals dropped the success rate from 90% to 45%, demonstrating that force goals can significantly enhance performance. The appendix, videos, code, and trained models are available at https://force-sight.github.io/.
    摘要 我们介绍ForceSight,一种基于文本指导的移动摩擦系统,该系统使用深度神经网络预测视觉力目标。给定一个RGBD图像和一个文本提示,ForceSight将确定相机框架中的目标终端器姿态(骨干目标)和相关的力(力目标)。这两个组成部分共同组成了视觉力目标。在前一个研究中,深度模型输出的人类可读的骨干目标可以使得真正的机器人实现灵活的摩擦。然而,力是摩擦中关键的一部分,通常被低级执行系统中排除。当部署在配备了眼头RGBD摄像头的移动摩擦机器人上时,ForceSight完成了精度抓取、抽屉打开和物品传递等任务,成功率达81%,在未看过的环境中,物品实例与训练数据有很大差异。在另一个实验中,仅通过视觉服务器和忽略力目标,成功率从90%降落到45%,这表明力目标可以显著提高性能。详细介绍、视频、代码和训练模型可以在中找到。

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

  • paper_url: http://arxiv.org/abs/2309.12311
  • repo_url: None
  • paper_authors: Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai
  • for: 提高家庭机器人的3D视觉掌握能力,以便机器人可以基于环境中的物体进行导航、物体操作和回答问题。
  • methods: 使用大语言模型(LLM)将复杂的自然语言查询拆分成Semantic constituents,然后使用OpenScene或LERF等视觉定位工具来定位3D场景中的物体。LLM然后评估这些提议的物体之间的空间和通用常识关系,以便作出最终的定位决定。
  • results: 在ScanRefer benchark中评估LLM-Grounder,未使用任何标注训练数据,可以普适地处理 novel 3D场景和任意自然语言查询,并达到了零shot定位精度。研究表明,LLM可以大幅提高定位能力,特别是对于复杂的语言查询,使LLM-Grounder成为3D视觉语言任务中的有效方法。
    Abstract 3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .
    摘要 三维视觉定位是家庭机器人的关键技能,它使机器人能够在环境中导航、操作物品和回答问题。现有的方法常常需要大量标注数据或者在处理复杂的语言查询时显示限制,而我们提议了LLM-Grounder,一种新的零批处理、开 vocabulary 的大语言模型(LLM)基于的三维视觉定位管道。LLM-Grounder 利用 LLM 将复杂的自然语言查询分解成Semantic 成分,然后使用 OpenScene 或 LERF 等视觉定位工具来确定3D场景中的物体。LLM 然后评估物体之间的空间和常识关系,以便做最终的定位决定。我们的方法不需要任何标注训练数据,可以泛化到新的3D场景和任意文本查询。我们在 ScanRefer benchmark 上评估了LLM-Grounder,并实现了零批处理定位精度。我们的发现表明,LLM 可以大幅提高定位能力,特别是对于复杂的语言查询,从而使LLM-Grounder 成为家庭机器人中的有效方法。视频和交互demo可以在项目网站https://chat-with-nerf.github.io/ 找到。

Rehearsal: Simulating Conflict to Teach Conflict Resolution

  • paper_url: http://arxiv.org/abs/2309.12309
  • repo_url: None
  • paper_authors: Omar Shaikh, Valentino Chai, Michele J. Gelfand, Diyi Yang, Michael S. Bernstein
  • for: This paper aims to provide a system for users to practice and learn effective conflict resolution strategies through simulated conversations with a believable interlocutor.
  • methods: The paper introduces a system called Rehearsal, which uses a large language model conditioned on the Interest-Rights-Power (IRP) theory to generate counterfactual scenarios and guide users towards de-escalating difficult conversations.
  • results: In a between-subjects evaluation, participants who received simulated training from Rehearsal significantly improved their performance in unaided conflicts, reducing their use of escalating competitive strategies by 67% and doubling their use of cooperative strategies.
    Abstract Interpersonal conflict is an uncomfortable but unavoidable fact of life. Navigating conflict successfully is a skill -- one that can be learned through deliberate practice -- but few have access to effective training or feedback. To expand this access, we introduce Rehearsal, a system that allows users to rehearse conflicts with a believable simulated interlocutor, explore counterfactual "what if?" scenarios to identify alternative conversational paths, and learn through feedback on how and when to apply specific conflict strategies. Users can utilize Rehearsal to practice handling a variety of predefined conflict scenarios, from office disputes to relationship issues, or they can choose to create their own. To enable Rehearsal, we develop IRP prompting, a method of conditioning output of a large language model on the influential Interest-Rights-Power (IRP) theory from conflict resolution. Rehearsal uses IRP to generate utterances grounded in conflict resolution theory, guiding users towards counterfactual conflict resolution strategies that help de-escalate difficult conversations. In a between-subjects evaluation, 40 participants engaged in an actual conflict with a confederate after training. Compared to a control group with lecture material covering the same IRP theory, participants with simulated training from Rehearsal significantly improved their performance in the unaided conflict: they reduced their use of escalating competitive strategies by an average of 67%, while doubling their use of cooperative strategies. Overall, Rehearsal highlights the potential effectiveness of language models as tools for learning and practicing interpersonal skills.
    摘要 人际冲突是生活中不可避免的一种不适,但是通过成功 Navigation 这种技能可以帮助你更好地处理这些冲突。然而,有限的人们有效地接受这种技能的训练和反馈。为解决这个问题,我们介绍了一个系统 called Rehearsal,它使得用户能够在受到虚拟对话者的反馈下练习冲突,探索不同的对话路径,并通过反馈学习如何在不同的情况下应用特定的冲突策略。用户可以使用 Rehearsal 练习各种预先定义的冲突场景,从办公室的争议到关系问题,或者他们可以创建自己的场景。为实现 Rehearsal,我们开发了 IRP 提示,一种基于冲突解决理论中的 Interest-Rights-Power(IRP)理论来控制大语言模型的输出。Rehearsal 使用 IRP 生成对话,引导用户采取与冲突解决相关的措施,以帮助他们在不易的对话中减少竞争策略的使用,同时增加合作策略的使用。在一个 between-subjects 评估中,40名参与者在与一名演员进行了实际冲突后接受了 Rehearsal 的虚拟训练。与控制组,其中接受了同样的 IRP 讲解材料,相比之下,Rehearsal 组的参与者在没有任何帮助的情况下处理冲突时表现出了显著改善:他们减少了竞争策略的使用平均67%,同时增加了合作策略的使用。总的来说,Rehearsal highlights 语言模型的潜在效果性作为人际技能学习和练习的工具。

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12307
  • repo_url: https://github.com/dvlab-research/longlora
  • paper_authors: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia
    for: 这个研究旨在提高预训语言模型(LLM)的上下文大小,以减少计算成本,并且保留原始架构。methods: 本研究使用了两种方法来实现上下文扩展:首先,使用稀疏的地方注意力进行练习,以减少计算成本;其次,重新检视了受限的参数练习环境,以确保模型在扩展上下文时仍然能够保持好的性能。results: 本研究在多个任务上实现了优秀的实验结果,包括从7B/13B到70B的LLaMA2模型。具体来说,LongLoRA可以将7B模型的上下文延长至4k至100k,或者将70B模型的上下文延长至32k,在单一的8x A100机器上进行训练。此外,本研究还创建了一个名为LongQA的数据集,用于监督练习。这个数据集包含了超过3,000个长上下文问题答案对。
    Abstract We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
    摘要 我们介绍LongLoRA,一种高效的精度调整方法,可以将预训练大语言模型(LLM)的上下文大小提高,而不需要巨大的计算成本。通常,在训练LLMs时,使用长上下文大小需要大量的计算时间和GPU资源。例如,在上下文长度为8192时,需要16倍的计算成本,相比于上下文长度为2048。在这篇论文中,我们提高了LLM的上下文扩展的速度,从两个方面进行了优化。一方面,虽然在推理时需要使用紧凑的全球注意力,但在微调时可以使用笔者的本地注意力进行有效和高效地调整。我们提出的Shift Short Attention技术可以有效地扩展上下文,并且可以在训练中实现只需两行代码,而在推理时可以选择使用。另一方面,我们再次检视了 Parametric Efficient Fine-Tuning 的 режим,发现在可 Trainable Embedding 和 Normalization 的前提下,LoRA 对上下文扩展非常有效。LongLoRA在多种任务上表现出色,包括 LLMA2 模型从 7B/13B 到 70B。LongLoRA 可以从 4k 上下文扩展到 100k,或者从 70B 下降到 32k 的单个 8x A100 机器上。LongLoRA 可以保留原始模型的结构,并且可以与大多数现有技术相容,如 FlashAttention-2。此外,为了让 LongLoRA 实用,我们收集了一个数据集,LongQA,用于supervised fine-tuning。这个数据集包含了 более3k个长上下文问答对。

Environment-biased Feature Ranking for Novelty Detection Robustness

  • paper_url: http://arxiv.org/abs/2309.12301
  • repo_url: None
  • paper_authors: Stefan Smeu, Elena Burceanu, Emanuela Haller, Andrei Liviu Nicolicioiu
  • for: robust novelty detection, aiming to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors.
  • methods: propose a method that starts with a pretrained embedding and a multi-env setup, and ranks features based on their environment-focus, using a per-feature score based on feature distribution variance between envs.
  • results: improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark.
    Abstract We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.
    摘要 我们面临Semantic novelty detection问题,即检测含义上的新鲜事物,而不受其他无关因素的变化影响。特别是,我们在多个环境下运行,并确定了与环境相关的特征,而不是与任务相关的内容。因此,我们提出了一种方法,它从预训练的嵌入起始,并在多个环境下进行了环境带重分类。首先,我们计算了每个特征的分布差异分布变化得分,以确定它们与环境相关程度。然后,我们证明了,通过去掉高分得分的特征,可以消除干扰关系,并提高总性能。我们在 covariance和 sub-population shift案例中实现了这一点,并在真实和 sintetic benchmark 中实现了6%的提高。

See to Touch: Learning Tactile Dexterity through Visual Incentives

  • paper_url: http://arxiv.org/abs/2309.12300
  • repo_url: None
  • paper_authors: Irmak Guzey, Yinlong Dai, Ben Evans, Soumith Chintala, Lerrel Pinto
  • for: 提高多 finger 机器人的灵活和精准把握能力。
  • methods: 使用视觉奖励来优化把握策略。
  • results: 在六个复杂任务中,如抓取固定物体、推翻细长物体等,TAVI 得到了 73% 的成功率,相比之下,不使用视觉奖励的策略只得到了 65% 的成功率,而使用视觉和抓取奖励的策略则达到了 82% 的成功率。Here’s the full translation of the paper’s abstract in Simplified Chinese:
  • for: 这篇论文旨在提高多 finger 机器人的灵活和精准把握能力。
  • methods: 该论文使用视觉奖励来优化把握策略。
  • results: 在六个复杂任务中,TAVI 得到了 73% 的成功率,相比之下,不使用视觉奖励的策略只得到了 65% 的成功率,而使用视觉和抓取奖励的策略则达到了 82% 的成功率。I hope this helps! Let me know if you have any further questions.
    Abstract Equipping multi-fingered robots with tactile sensing is crucial for achieving the precise, contact-rich, and dexterous manipulation that humans excel at. However, relying solely on tactile sensing fails to provide adequate cues for reasoning about objects' spatial configurations, limiting the ability to correct errors and adapt to changing situations. In this paper, we present Tactile Adaptation from Visual Incentives (TAVI), a new framework that enhances tactile-based dexterity by optimizing dexterous policies using vision-based rewards. First, we use a contrastive-based objective to learn visual representations. Next, we construct a reward function using these visual representations through optimal-transport based matching on one human demonstration. Finally, we use online reinforcement learning on our robot to optimize tactile-based policies that maximize the visual reward. On six challenging tasks, such as peg pick-and-place, unstacking bowls, and flipping slender objects, TAVI achieves a success rate of 73% using our four-fingered Allegro robot hand. The increase in performance is 108% higher than policies using tactile and vision-based rewards and 135% higher than policies without tactile observational input. Robot videos are best viewed on our project website: https://see-to-touch.github.io/.
    摘要 装备多指抓取机器人的感觉感知是实现人类精准、接触丰富、灵活抓取的关键。然而,仅仅通过感觉感知不能提供充分的启示,用于了解物体的空间配置,限制了 corrected errors and adapt to changing situations。在这篇论文中,我们提出了视觉适应的策略(TAVI),一种新的框架,可以通过视觉奖励来增强感觉基础的灵活性。首先,我们使用对比度基于的目标函数来学习视觉表示。接着,我们通过对一个人示例的最佳匹配来构建视觉奖励函数。最后,我们使用在我们四指抓取机器人上的在线反射学习来优化感觉基础的策略,以达到最大化视觉奖励的目标。在六个具有挑战性的任务中,例如吸盘卸、推倒碗和抓flipping细长物体,TAVI达到了73%的成功率。与不含感觉观察输入的策略相比,TAVI的表现提高了108%,与基于感觉和视觉奖励的策略相比,提高了135%。机器人视频最好在我们项目网站上观看:https://see-to-touch.github.io/。

Learning to Drive Anywhere

  • paper_url: http://arxiv.org/abs/2309.12295
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Ruizhao Zhu, Peng Huang, Eshed Ohn-Bar, Venkatesh Saligrama
  • for: 这个研究旨在开发一个可以在不同地理位置和规律下适应驾驶决策的自动驾驶模型。
  • methods: 这个研究使用了一种名为conditional imitation learning(CIL)的方法,并将高容量的地图位置基于的频道对应 Mechanism引入,以有效地适应地方特点,同时也能够模型区域间的相似性。
  • results: 研究发现,使用AnyD模型可以在多个数据集、城市和扩展方法(如中央、半监督和分布式训练)中表现出色,比基线CIL模型高出14%以上在开 Loop评估中和30%以上在关闭 Loop测试中。
    Abstract Human drivers can seamlessly adapt their driving decisions across geographical locations with diverse conditions and rules of the road, e.g., left vs. right-hand traffic. In contrast, existing models for autonomous driving have been thus far only deployed within restricted operational domains, i.e., without accounting for varying driving behaviors across locations or model scalability. In this work, we propose AnyD, a single geographically-aware conditional imitation learning (CIL) model that can efficiently learn from heterogeneous and globally distributed data with dynamic environmental, traffic, and social characteristics. Our key insight is to introduce a high-capacity geo-location-based channel attention mechanism that effectively adapts to local nuances while also flexibly modeling similarities among regions in a data-driven manner. By optimizing a contrastive imitation objective, our proposed approach can efficiently scale across inherently imbalanced data distributions and location-dependent events. We demonstrate the benefits of our AnyD agent across multiple datasets, cities, and scalable deployment paradigms, i.e., centralized, semi-supervised, and distributed agent training. Specifically, AnyD outperforms CIL baselines by over 14% in open-loop evaluation and 30% in closed-loop testing on CARLA.
    摘要 人类司机可以无缝地适应不同地区的条件和道路规则,例如左右两种交通方向。然而,现有的自动驾驶模型只能在限定的运行域中进行部署,无法考虑不同地区的驾驶行为和模型可扩展性。在这项工作中,我们提出了AnyD,一个基于条件学习(CIL)模型,可以高效地从多样化的全球分布的数据中学习地区特有的驾驶行为。我们的关键发现是引入高容量的地理位置基于的通道注意力机制,可以有效地适应本地特点,同时也能够模型地区之间的相似性。通过优化一个对比式学习目标函数,我们的提出的AnyD代理可以高效地扩展到具有不同数据分布和地区事件的情况。我们在多个数据集、城市和可扩展的训练方法(中央化、半supervised和分布式代理训练)中证明了AnyD的优势,特别是在CARLA上,AnyD比基线CIL模型高于14%的开loop评估和30%的关闭loop测试。

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”

  • paper_url: http://arxiv.org/abs/2309.12288
  • repo_url: https://github.com/lukasberglund/reversal_curse
  • paper_authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
    for:* The paper reveals a surprising failure of generalization in auto-regressive large language models (LLMs) when it comes to reversals of statements.methods:* The authors train and fine-tune GPT-3 and Llama-1 on fictitious statements and evaluate their performance on questions about real-world celebrities.results:* The models fail to correctly answer questions about real-world celebrities when the information is presented in reverse order, demonstrating a basic failure of logical deduction. GPT-4 is able to correctly answer questions about real-world celebrities 79% of the time, but only 33% of the time when the information is presented in reverse order. This failure is called the “Reversal Curse” and is robust across model sizes and model families.
    Abstract We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?". Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if "A is B'' occurs, "B is A" is more likely to occur). We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of 'Abyssal Melodies'" and showing that they fail to correctly answer "Who composed 'Abyssal Melodies?'". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse. Code is available at https://github.com/lukasberglund/reversal_curse.
    摘要 我们揭示了自动进行推理的大语言模型(LLM)的一个意外的泛化失败。如果一个模型被训练在“A是B”的句子上,它不会自动泛化到“B是A”的方向。我们称这为“排名之咒”。例如,如果一个模型被训练在“奥拉夫·施科尔茨是德国第九任总理”上,它不会自动回答“德国第九任总理是谁?”的问题,而且对于正确答案(奥拉夫·施科尔茨)的概率不高于随机名称。因此,模型表现出了基本的逻辑推理失败,不会泛化训练集中的普遍规律(即如果“A是B”出现,“B是A”更可能出现)。我们提供了证据,通过finetuning GPT-3和Llama-1在虚假句子上,并示出它们无法正确回答“谁写了‘abyssal Melodies’?”的问题。排名之咒是模型大小和模型家族的稳定特征,不受数据增强影响。我们还评估了ChatGPT(GPT-3.5和GPT-4)在真实世界名人的问题上,如“谁是汤美·克雷的妈妈?”和其反向“谁是mary Lee Pfeiffer的儿子?”。GPT-4在前者79%的时间内正确回答问题,而后者只有33%。这表明了逻辑推理的失败,我们假设是由排名之咒引起的。代码可以在https://github.com/lukasberglund/reversal_curse上获取。

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12284
  • repo_url: https://github.com/meta-math/MetaMath
  • paper_authors: Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
    for:这篇论文旨在提高大型自然语言处理器(LLMs)的数学理解能力,以提供更好的数学问题解决能力。methods:作者们使用了自然语言处理技术,包括重新写数学问题的多种角度,以生成一个名为MetaMathQA的新数据集。然后,他们使用了LLaMA-2模型进行微调,以便在数学理解方面进行更好的表现。results:实验结果表明,作者们的MetaMath模型在两个流行的数学理解benchmark(GSM8K和MATH)上表现出色,与开源LLMs相比,它们的表现有所提高。具体来说,MetaMath-7B模型在GSM8K上达到了66.4%的准确率,而MetaMath-70B模型在MATH上达到了82.3%的准确率,这 beiden都高于同等模型大小的state-of-the-art模型。
    Abstract Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
    摘要 大型语言模型(LLMs)已经推进了自然语言理解的limits和显示出了优秀的问题解决能力。 despite the great success, most existing open-source LLMs(例如LLaMA-2)仍然与mathematical problem solving remains far away from satisfactory due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks(i.e., GSM8K和MATH)for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes, and the training code for public use.

LLMR: Real-time Prompting of Interactive Worlds using Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12276
  • repo_url: https://github.com/asem010/legend-pice
  • paper_authors: Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores Fernandez, Jaron Lanier
  • for: 本研究开发了一个大语言模型 для混合现实(LLMR),用于实时创建和修改混合现实经验。
  • methods: LLMR 使用了新的策略来解决对理想训练数据罕至或需要创造内部动力、敏捷分析或高级互动的问题。 它靠扩展文本互动和 Unity 游戏引擎。
  • results: LLMR 比标准 GPT-4 快四倍在均误率上。它在跨平台互操作性方面进行了多个示例世界的评估和创建/修改任务的评估,并进行了一次用户研究(N=11),发现用户们对系统有正面体验并会再次使用它。
    Abstract We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
    摘要 我们介绍Large Language Model for Mixed Reality(LLMR),一个用于实时创建和修改混合现实经验的框架,使用大自然语言模型(LLM)。 LLMR 利用新的策略来解决缺乏理想训练数据或设计目标需要创造内部动力、直观分析或进阶互动的问题。我们的框架基于文本互动和Unity游戏引擎。通过包括场景理解、任务观察、自我检查和记忆管理的技术,LLMR 在average error rate上比标准GPT-4高出4倍。我们显示 LLMR 的跨平台可扩展性,并评估其在创建和修改多种物品、工具和场景时的表现。最后,我们进行了一次使用者研究(N=11),发现参与者对系统有正面的经验,并会再次使用它。

Enabling Quartile-based Estimated-Mean Gradient Aggregation As Baseline for Federated Image Classifications

  • paper_url: http://arxiv.org/abs/2309.12267
  • repo_url: None
  • paper_authors: Yusen Wu, Jamie Deng, Hao Chen, Phuong Nguyen, Yelena Yesha
  • for: 这 paper 的目的是提出一种解决 federated learning 中数据多样性和安全性问题的新方法,以及提供一个基本参考点 для advanced aggregation techniques。
  • methods: 该 paper 使用了 estimated mean aggregation (EMA) 方法,它通过使用 trimmed means 处理异常值和揭示数据不同性,以确保模型在各个客户端数据集上进行适应。
  • results: via 大量实验,EMA 方法能够保持高精度和 area under the curve (AUC),相比于其他方法,EMA 方法成为 federated learning 中效果和安全性的基本参考点。
    Abstract Federated Learning (FL) has revolutionized how we train deep neural networks by enabling decentralized collaboration while safeguarding sensitive data and improving model performance. However, FL faces two crucial challenges: the diverse nature of data held by individual clients and the vulnerability of the FL system to security breaches. This paper introduces an innovative solution named Estimated Mean Aggregation (EMA) that not only addresses these challenges but also provides a fundamental reference point as a $\mathsf{baseline}$ for advanced aggregation techniques in FL systems. EMA's significance lies in its dual role: enhancing model security by effectively handling malicious outliers through trimmed means and uncovering data heterogeneity to ensure that trained models are adaptable across various client datasets. Through a wealth of experiments, EMA consistently demonstrates high accuracy and area under the curve (AUC) compared to alternative methods, establishing itself as a robust baseline for evaluating the effectiveness and security of FL aggregation methods. EMA's contributions thus offer a crucial step forward in advancing the efficiency, security, and versatility of decentralized deep learning in the context of FL.
    摘要

SALSA-CLRS: A Sparse and Scalable Benchmark for Algorithmic Reasoning

  • paper_url: http://arxiv.org/abs/2309.12253
  • repo_url: https://github.com/jkminder/salsa-clrs
  • paper_authors: Julian Minder, Florian Grötschla, Joël Mathys, Roger Wattenhofer
  • for: 本研究旨在扩展CLRS算法学习benchmark,强调可扩展性和稀疏表示的使用。
  • methods: 本研究使用了修改后CLRS算法和分布式随机算法中的一些问题,以及新增了一些问题。
  • results: 我们在empirical evaluation中发现,SALSA-CLRS比CLRS更具可扩展性和稀疏表示能力。
    Abstract We introduce an extension to the CLRS algorithmic learning benchmark, prioritizing scalability and the utilization of sparse representations. Many algorithms in CLRS require global memory or information exchange, mirrored in its execution model, which constructs fully connected (not sparse) graphs based on the underlying problem. Despite CLRS's aim of assessing how effectively learned algorithms can generalize to larger instances, the existing execution model becomes a significant constraint due to its demanding memory requirements and runtime (hard to scale). However, many important algorithms do not demand a fully connected graph; these algorithms, primarily distributed in nature, align closely with the message-passing paradigm employed by Graph Neural Networks. Hence, we propose SALSA-CLRS, an extension of the current CLRS benchmark specifically with scalability and sparseness in mind. Our approach includes adapted algorithms from the original CLRS benchmark and introduces new problems from distributed and randomized algorithms. Moreover, we perform a thorough empirical evaluation of our benchmark. Code is publicly available at https://github.com/jkminder/SALSA-CLRS.
    摘要 我们介绍一个CLRS算法学习标准延伸,优先愿景是数据分析和紧缩表示。许多CLRS中的算法需要全球内存或资讯交换,这反映了它们的执行模型,它们创建了不紧缩的图(不是紧缩图)基于下面问题。尽管CLRS的目标是评估学习算法在更大的实例中对稍低的数据分析和紧缩表示的能力,但现有的执行模型对于内存需求和时间(困难扩展)带来了重要的限制。然而,许多重要的算法不需要全球连接图,这些算法,主要是分布式的,与传递讯息的方法相似,这与Graph Neural Networks的传递讯息模型相符。因此,我们提出了SALSA-CLRS,CLRS标准的延伸,优先愿景是数据分析和紧缩表示的扩展。我们的方法包括CLRS中原始的算法的修改,以及新的分布式和随机算法问题。此外,我们执行了详细的实验评估。代码可以在https://github.com/jkminder/SALSA-CLRS上取得。

Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection

  • paper_url: http://arxiv.org/abs/2309.12247
  • repo_url: https://github.com/ictmcg/arg
  • paper_authors: Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, Peng Qi
  • for: 这个论文旨在研究大型语言模型(LLMs)是否可以帮助检测假新闻。
  • methods: 作者使用了一种名为ARG的可靠性指南网络,该网络使用了精心调整的BERT来选择合适的理由,以帮助检测假新闻。
  • results: 实验结果表明,作者的ARG和ARG-D方法可以比三种基eline方法(包括SLM、LLM和这两种组合)表现更好,并且可以在Cost-sensitive的enario中提供更好的性能。
    Abstract Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.
    摘要 检测假新闻需要一种细腻的多种证据敏感和深刻的真实背景理解,这对小语言模型(SLM)来说是一项挑战。然而,大语言模型(LLM)的最新进展表现出色在多种任务上,但是LLM是否能够帮助检测假新闻仍然未得到充分探索。本文 investigate LLM在假新闻检测中的潜力。我们首先进行了实验研究,发现一个复杂的LLM如GPT 3.5可以暴露假新闻并提供愉悦多元理由,但还是落后于基本SLM、精通BERT。我们后续的分析表明这种差距是由LLM无法选择和结合理由而导致的。根据这些发现,我们认为当前LLM可能不能完全取代精通SLM,但可以作为SLM的好帮手,提供多元指导性的理由。为实现这一提议,我们设计了一种适应性理由指导网络(ARG),其中SLM可以选择性地从LLM的理由中获得分析新闻的 instrucity。此外,我们还 derivates一种不需要理由的ARG-D版本,通过萃取来实现成本敏感enario中无需问题LLM。我们在两个实际 datasets上进行了实验,发现ARG和ARG-D都高于三种基eline方法,包括SLM、LLM和小语言模型的组合。

ChaCha: Leveraging Large Language Models to Prompt Children to Share Their Emotions about Personal Events

  • paper_url: http://arxiv.org/abs/2309.12244
  • repo_url: None
  • paper_authors: Woosuk Seo, Chanmo Yang, Young-Ho Kim
  • for: 这篇研究的目的是为了探讨儿童如何通过与他人分享故事和感受来学习表达情感。
  • methods: 这篇研究使用了一个状态机和大型自然语言模型(LLMs),将对话保持在轨道上,同时让儿童进行自由的对话。
  • results: 研究发现,儿童对 ChaCha 表示出了亲密的感觉,并让他们分享了各种主题,如家庭旅行和个人成就。
    Abstract Children typically learn to identify and express emotions through sharing their stories and feelings with others, particularly their family. However, it is challenging for parents or siblings to have emotional communication with children since children are still developing their communication skills. We present ChaCha, a chatbot that encourages and guides children to share personal events and associated emotions. ChaCha combines a state machine and large language models (LLMs) to keep the dialogue on track while carrying on free-form conversations. Through an exploratory study with 20 children (aged 8-12), we examine how ChaCha prompts children to share personal events and guides them to describe associated emotions. Participants perceived ChaCha as a close friend and shared their stories on various topics, such as family trips and personal achievements. Based on the quantitative and qualitative findings, we discuss opportunities for leveraging LLMs to design child-friendly chatbots to support children in sharing their emotions.
    摘要 孩子通常通过与他人分享自己的故事和感受来学习识别和表达情感。然而,由于孩子的交流技巧还在发展,因此与孩子进行情感交流可以是一项挑战。我们介绍了一个名为ChaCha的虚拟助手,它鼓励和指导孩子分享个人事件和相关的情感。ChaCha结合状态机制和大型自然语言模型(LLM),以保持对话在轨而进行自由的对话。我们通过对20名8-12岁的孩子进行exploratory研究,发现ChaCha可以鼓励孩子分享个人事件并指导他们描述相关的情感。参与者认为ChaCha是一个亲密的朋友,并分享了各种话题,如家庭旅行和个人成就。根据数据和质量调查结果,我们讨论了如何利用LLM来设计适合孩子的虚拟助手,以支持孩子在表达情感方面。

Explainable Artificial Intelligence for Drug Discovery and Development – A Comprehensive Survey

  • paper_url: http://arxiv.org/abs/2309.12177
  • repo_url: None
  • paper_authors: Roohallah Alizadehsani, Sadiq Hussain, Rene Ripardo Calixto, Victor Hugo C. de Albuquerque, Mohamad Roshanzamir, Mohamed Rahouti, Senthil Kumar Jagatheesaperumal
  • for: 本文提供了一个全面的介绍,涵盖了使用Explainable Artificial Intelligence(XAI)技术在药物发现中的当前状况,包括不同的XAI方法、其应用在药物发现中的方法、以及XAI技术在药物发现中的挑战和限制。
  • methods: 本文详细介绍了XAI技术在药物发现中的应用,包括目标预测、化学物质设计和毒性预测等方面。
  • results: 本文summarizes the current state of XAI in drug discovery, highlighting the challenges and limitations of XAI techniques in drug discovery, and suggesting potential future research directions for the application of XAI in drug discovery.
    Abstract The field of drug discovery has experienced a remarkable transformation with the advent of artificial intelligence (AI) and machine learning (ML) technologies. However, as these AI and ML models are becoming more complex, there is a growing need for transparency and interpretability of the models. Explainable Artificial Intelligence (XAI) is a novel approach that addresses this issue and provides a more interpretable understanding of the predictions made by machine learning models. In recent years, there has been an increasing interest in the application of XAI techniques to drug discovery. This review article provides a comprehensive overview of the current state-of-the-art in XAI for drug discovery, including various XAI methods, their application in drug discovery, and the challenges and limitations of XAI techniques in drug discovery. The article also covers the application of XAI in drug discovery, including target identification, compound design, and toxicity prediction. Furthermore, the article suggests potential future research directions for the application of XAI in drug discovery. The aim of this review article is to provide a comprehensive understanding of the current state of XAI in drug discovery and its potential to transform the field.
    摘要 随着人工智能(AI)和机器学习(ML)技术的出现,药物发现领域已经经历了很大的变革。然而,随着AI和ML模型的复杂度的增加,对模型的透明度和解释性的需求也在增加。解释性人工智能(XAI)是一种新的方法,旨在提供更加解释的机器学习模型预测结果的理解。在过去几年中,对XAI技术的应用在药物发现领域的兴趣有所增加。本文提供了药物发现领域XAI技术的当前状态的总结,包括各种XAI方法、其应用在药物发现领域、以及XAI技术在药物发现领域的挑战和限制。文章还涵盖了XAI技术在药物发现中的应用,包括目标 indentification、化合物设计和毒性预测。此外,文章还提出了XAI技术在药物发现领域的未来研究方向。本文的目的是为读者提供药物发现领域XAI技术的全面了解,以及其在药物发现领域的潜在变革性。

SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain Gap

  • paper_url: http://arxiv.org/abs/2309.12382
  • repo_url: None
  • paper_authors: Daehee Kim, Yoonsik Kim, DongHyun Kim, Yumin Lim, Geewook Kim, Taeho Kil
  • for: 这个研究是为了提高语言模型(LM)基于预训练的效能,并应用于视觉文档理解中。
  • methods: 这篇研究使用了LM基于预训练方法,并提出了一个名为SCOB的新预训练方法,这个方法使用字元别超vised contrastive learning和在线文本渲染来将文档和景象文本领域融合。
  • results: 实验结果显示,SCOB比vanilla预训练方法有更好的效果,并且与现有的方法相比,它的表现相当。这些结果表明,SCOB可以应用于读取类型的预训练方法中。
    Abstract Inspired by the great success of language model (LM)-based pre-training, recent studies in visual document understanding have explored LM-based pre-training methods for modeling text within document images. Among them, pre-training that reads all text from an image has shown promise, but often exhibits instability and even fails when applied to broader domains, such as those involving both visual documents and scene text images. This is a substantial limitation for real-world scenarios, where the processing of text image inputs in diverse domains is essential. In this paper, we investigate effective pre-training tasks in the broader domains and also propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering to effectively pre-train document and scene text domains by bridging the domain gap. Moreover, SCOB enables weakly supervised learning, significantly reducing annotation costs. Extensive benchmarks demonstrate that SCOB generally improves vanilla pre-training methods and achieves comparable performance to state-of-the-art methods. Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods. The code will be available at https://github.com/naver-ai/scob.
    摘要 受Language Model(LM)预训示的成功启发,最近的视觉文档理解研究已经 explore LM预训示方法来模型图像中的文本。其中, reads all text from an image 预训示方法已经显示了 promise, 但经常存在不稳定和失败问题,尤其是在更广泛的领域,如文档和场景文本图像中。这是一个重要的限制,因为在实际应用中,处理各种文本图像输入的多样化领域是必要的。在这篇论文中,我们investigate effective pre-training tasks in the broader domains, 并提出了一种新的预训示方法called SCOB,它通过Character-wise supervised contrastive learning with online text rendering来有效地预训示文档和场景文本领域,并bridge the domain gap。此外,SCOB支持weakly supervised learning,可以减少注解成本。我们的实验表明,SCOB通常超越原始预训示方法,并与状态之artefacts达到相似的性能。我们的发现建议SCOB可以普遍应用和有效地预训示读类预训示方法。代码将在https://github.com/naver-ai/scob中提供。

Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features

  • paper_url: http://arxiv.org/abs/2309.12140
  • repo_url: None
  • paper_authors: Travis Zhang, Katie Luo, Cheng Perng Phoo, Yurong You, Wei-Lun Chao, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger
  • for: 提高自动驾驶车辆中3D对象检测系统的精度和通用性。
  • methods: 利用无标注重复旋转多个位置来适应新的驾驶环境,通过计算重复LiDAR扫描数据的统计信息来导导适应过程。
  • results: 通过含有空间量化历史特征的LiDAR检测模型和Lightweight回归头来强化检测模型,实现了20点的性能提升,特别是人员和远距离对象的检测。Here’s the breakdown of each point:
  • for: The paper aims to improve the accuracy and generalization of 3D object detection systems for self-driving cars.
  • methods: The proposed method uses unlabeled repeated traversals of multiple locations to adapt object detectors to new driving environments, and incorporates statistics computed from repeated LiDAR scans to guide the adaptation process.
  • results: The proposed method achieves significant improvements in detection performance, up to 20 points, especially in detecting pedestrians and distant objects, through the use of spatial quantized historical features and a lightweight regression head.
    Abstract The rapid development of 3D object detection systems for self-driving cars has significantly improved accuracy. However, these systems struggle to generalize across diverse driving environments, which can lead to safety-critical failures in detecting traffic participants. To address this, we propose a method that utilizes unlabeled repeated traversals of multiple locations to adapt object detectors to new driving environments. By incorporating statistics computed from repeated LiDAR scans, we guide the adaptation process effectively. Our approach enhances LiDAR-based detection models using spatial quantized historical features and introduces a lightweight regression head to leverage the statistics for feature regularization. Additionally, we leverage the statistics for a novel self-training process to stabilize the training. The framework is detector model-agnostic and experiments on real-world datasets demonstrate significant improvements, achieving up to a 20-point performance gain, especially in detecting pedestrians and distant objects. Code is available at https://github.com/zhangtravis/Hist-DA.
    摘要 三维物体探测系统的快速发展对自动驾驶车有了显著改善的准确性。然而,这些系统在不同的驾驶环境中很难泛化,这可能会导致检测交通参与者的安全关键失败。为解决这个问题,我们提出了一种方法,该方法利用多次重复的多个位置的无标签数据来适应新的驾驶环境。通过基于重复扫描 LiDAR 数据计算的统计信息,我们有效地引导适应过程。我们的方法可以增强基于 LiDAR 的探测模型,并引入空间量化历史特征来减少特征的抖动。此外,我们还利用统计信息进行一种新的自动训练过程,以稳定训练。这种框架是探测模型无关的,实验结果表明,在实际数据上可以获得大约 20 个表现指标的改善,特别是检测人员和远距离物体的检测。代码可以在 https://github.com/zhangtravis/Hist-DA 上获取。

On the relationship between Benchmarking, Standards and Certification in Robotics and AI

  • paper_url: http://arxiv.org/abs/2309.12139
  • repo_url: None
  • paper_authors: Alan F. T. Winfield, Matthew Studley
  • for: 这篇论文主要是用来探讨责任创新的相关过程,包括标准、认证和测试 benchmarking。
  • methods: 论文使用了标准、认证和测试 benchmarking 等方法来探讨责任创新的实践。
  • results: 论文通过 analyzing 标准、认证和测试 benchmarking 等方法, argued that these three linked processes are not only useful but vital to the broader practice of Responsible Innovation。
    Abstract Benchmarking, standards and certification are closely related processes. Standards can provide normative requirements that robotics and AI systems may or may not conform to. Certification generally relies upon conformance with one or more standards as the key determinant of granting a certificate to operate. And benchmarks are sets of standardised tests against which robots and AI systems can be measured. Benchmarks therefore can be thought of as informal standards. In this paper we will develop these themes with examples from benchmarking, standards and certification, and argue that these three linked processes are not only useful but vital to the broader practice of Responsible Innovation.
    摘要 《benchmarking、标准和认证》是密切相关的过程。标准可以提供一些必须遵循的规范要求,机器人和人工智能系统可能或可能不遵循。认证通常基于一个或多个标准来决定授予操作权限。而 benchmark 则是一种标准化测试集,可以用来评估机器人和人工智能系统的性能。 benchmark 因此可以被视为一种不正式的标准。在这篇文章中,我们将通过 examples from benchmarking、标准和认证, argued that这三个相关过程不仅有用,而且是责任创新的重要组成部分。

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

  • paper_url: http://arxiv.org/abs/2309.12137
  • repo_url: None
  • paper_authors: Fatimah Alzamzami, Abdulmotaleb El Saddik
  • for: 本研究旨在提高阿拉伯语 dialectal 翻译模型的效果,以满足社交媒体平台上的语言需求。
  • methods: 研究人员采用了一种 Contextual Translation 策略,通过将英语推文翻译成四种阿拉伯语方言: Golfo、Yemeni、Iraqi 和 Levantine。
  • results: 研究人员通过开发神经网络翻译模型,证明了该数据集的可靠性和效果。
    Abstract While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.
    摘要 在社交媒体上,英语资源够用以理解内容,但阿拉伯语资源仍然落后。主要原因是阿拉伯语有很多方言,而且用户不使用标准版本(MSA)在日常交流中,而是使用方言版本。这使得社交媒体平台上的用户将这种现象传播到了他们的使用方式,从而提高了建立适合语言依赖应用的人工智能模型的需求。现有的机器翻译(MT)系统设计为MSA时,对阿拉伯语方言不够有效。因此,需要适应社交媒体上的不正式交流方式,开发MT系统可以有效地处理不同的阿拉伯语方言。与MSA的翻译系统有很大进步的情况不同,对阿拉伯语方言的翻译系统几乎没有努力。尽管有些尝试了为dialectal Arabic建立翻译集合,但这些集合是域名 dependent 并不是社交网络文化语言友好。在这种情况下,我们尝试缓解这些限制,提出了一个基于社交网络的多方言阿拉伯语数据集。我们采用我们提议的内容翻译指南来进行翻译,这可以universally适用于外语到本地方言的翻译。我们验证了我们的数据集的authenticity,通过开发四种阿拉伯语方言的神经机器翻译模型。我们的结果表明,使用我们的数据集训练的NMT模型表现出色。我们认为,我们的数据集可靠地服务为阿拉伯语多方言翻译数据集。

A knowledge representation approach for construction contract knowledge modeling

  • paper_url: http://arxiv.org/abs/2309.12132
  • repo_url: None
  • paper_authors: Chunmo Zheng, Saika Wong, Xing Su, Yinqiu Tang
  • for: 本研究旨在使用大型自然语言模型(LLM)提高建筑合同管理的自动化程度,减少人类错误和时间成本。
  • methods: 本研究提出了一种嵌入式合同知识图(NCKG)知识表示方法,将专业人员驱动的合同知识以结构化的方式呈现,以防止LLM生成的内容不准确或欺诈性强。
  • results: 本研究实现了一个基于NCKG和LLM的合同审核管道,对建筑合同进行了可靠和可解释的审核,从而减少了合同风险。
    Abstract The emergence of large language models (LLMs) presents an unprecedented opportunity to automate construction contract management, reducing human errors and saving significant time and costs. However, LLMs may produce convincing yet inaccurate and misleading content due to a lack of domain expertise. To address this issue, expert-driven contract knowledge can be represented in a structured manner to constrain the automatic contract management process. This paper introduces the Nested Contract Knowledge Graph (NCKG), a knowledge representation approach that captures the complexity of contract knowledge using a nested structure. It includes a nested knowledge representation framework, a NCKG ontology built on the framework, and an implementation method. Furthermore, we present the LLM-assisted contract review pipeline enhanced with external knowledge in NCKG. Our pipeline achieves a promising performance in contract risk reviewing, shedding light on the combination of LLM and KG towards more reliable and interpretable contract management.
    摘要 LLM的出现提供了历史上无 precedent的机会,使得建筑合同管理可以自动化,从而减少人类错误和成本。然而,LLM可能生成的内容可能会有吸引力,但是不准确和欺骗性很强。为解决这个问题,我们可以通过封装专家驱动的合同知识来约束自动合同管理过程。本文介绍了嵌入式合同知识图ogram(NCKG),一种知识表示方法,它使用嵌入结构来捕捉合同知识的复杂性。它包括嵌入结构知识表示框架、基于框架的NCKG ontology,以及实现方法。此外,我们还提出了利用LLM和KG的结合来提高合同审核的管道。我们的管道实现了在合同风险审核中的优秀表现,为将LLM和KG结合在更可靠和可解释的合同管理中提供了灯光。

Incentivizing Massive Unknown Workers for Budget-Limited Crowdsensing: From Off-Line and On-Line Perspectives

  • paper_url: http://arxiv.org/abs/2309.12113
  • repo_url: None
  • paper_authors: Feng Li, Yuqi Chai, Huan Yang, Pengfei Hu, Lingjie Duan
  • For: 增强大量未知工作者的奖励机制, addresses the challenges of limited budget and dynamic worker population.* Methods: 基于Context-Aware Combinatorial Multi-Armed Bandit (CACI) mechanism, leveraging exploration-exploitation trade-off in a partitioned context space to incentivize massive unknown workers with limited budget.* Results: 提供了理论上的Upper Bounds on the regrets, 并通过实验证明了机制的有效性。
    Abstract Although the uncertainties of the workers can be addressed by the standard Combinatorial Multi-Armed Bandit (CMAB) framework in existing proposals through a trade-off between exploration and exploitation, we may not have sufficient budget to enable the trade-off among the individual workers, especially when the number of the workers is huge while the budget is limited. Moreover, the standard CMAB usually assumes the workers always stay in the system, whereas the workers may join in or depart from the system over time, such that what we have learnt for an individual worker cannot be applied after the worker leaves. To address the above challenging issues, in this paper, we first propose an off-line Context-Aware CMAB-based Incentive (CACI) mechanism. We innovate in leveraging the exploration-exploitation trade-off in a elaborately partitioned context space instead of the individual workers, to effectively incentivize the massive unknown workers with very limited budget. We also extend the above basic idea to the on-line setting where unknown workers may join in or depart from the systems dynamically, and propose an on-line version of the CACI mechanism. Specifically, by the exploitation-exploration trade-off in the context space, we learn to estimate the sensing ability of any unknown worker (even it never appeared in the system before) according to its context information. We perform rigorous theoretical analysis to reveal the upper bounds on the regrets of our CACI mechanisms and to prove their truthfulness and individual rationality, respectively. Extensive experiments on both synthetic and real datasets are also conducted to verify the efficacy of our mechanisms.
    摘要 尽管工作者的不确定性可以通过标准的 combinatorial Multi-Armed Bandit(CMAB)框架在现有的建议中Addressed through a trade-off between exploration and exploitation, but we may not have enough budget to enable the trade-off among individual workers, especially when the number of workers is large and the budget is limited. In addition, the standard CMAB usually assumes that workers always stay in the system, but workers may join or leave the system over time, so what we learn about an individual worker may not be applicable after they leave. To address these challenges, in this paper, we propose an offline Context-Aware CMAB-based Incentive (CACI) mechanism. We innovate by leveraging the exploration-exploitation trade-off in a carefully partitioned context space instead of individual workers to effectively incentivize massive unknown workers with a very limited budget. We also extend the basic idea to the online setting where unknown workers may join or leave the system dynamically, and propose an online version of the CACI mechanism. Specifically, by using the exploration-exploitation trade-off in the context space, we learn to estimate the sensing ability of any unknown worker (even if it has never appeared in the system before) based on its context information. We provide rigorous theoretical analysis to reveal the upper bounds on the regrets of our CACI mechanisms and to prove their truthfulness and individual rationality, respectively. Extensive experiments on both synthetic and real datasets are also conducted to verify the effectiveness of our mechanisms.

PEFTT: Parameter-Efficient Fine-Tuning for low-resource Tibetan pre-trained language models

  • paper_url: http://arxiv.org/abs/2309.12109
  • repo_url: None
  • paper_authors: Zhou Mingjun, Daiqing Zhuoma, Qun Nuo, Nyima Tashi
  • for: 这个研究是为了探索高资源语言模型的高效微调技术,以便更多的用户和机构可以使用这些模型进行训练。
  • methods: 本研究使用了三种有效微调策略:”提示微调”、”Adapter轻量级微调”和”提示微调+Adapter微调”,并对公共可用的 TNCC-title 数据集进行实验。
  • results: 实验结果表明,使用这些微调策略可以获得显著的改善,为 Tibetan 语言应用程序在基于预训练模型的上提供了有价值的发现。
    Abstract In this era of large language models (LLMs), the traditional training of models has become increasingly unimaginable for regular users and institutions. The exploration of efficient fine-tuning for high-resource languages on these models is an undeniable trend that is gradually gaining popularity. However, there has been very little exploration for various low-resource languages, such as Tibetan. Research in Tibetan NLP is inherently scarce and limited. While there is currently no existing large language model for Tibetan due to its low-resource nature, that day will undoubtedly arrive. Therefore, research on efficient fine-tuning for low-resource language models like Tibetan is highly necessary. Our research can serve as a reference to fill this crucial gap. Efficient fine-tuning strategies for pre-trained language models (PLMs) in Tibetan have seen minimal exploration. We conducted three types of efficient fine-tuning experiments on the publicly available TNCC-title dataset: "prompt-tuning," "Adapter lightweight fine-tuning," and "prompt-tuning + Adapter fine-tuning." The experimental results demonstrate significant improvements using these methods, providing valuable insights for advancing Tibetan language applications in the context of pre-trained models.
    摘要 在这个大语模型(LLM)时代,传统的模型训练已成为常见的无法想象的行为,特别是 для普通用户和机构。探索高资源语言模型的有效精细调整是一种日益受欢迎的趋势,但对低资源语言,如藏语,的研究却很少。藏语自然语言处理研究受限,目前没有任何藏语大语模型,这一天将来得不可避免。因此,关于低资源语言模型的有效精细调整是非常必要的。在这个背景下,我们对藏语预训练语言模型(PLM)的有效精细调整进行了较少的探索。我们在公共可用的 TNCC-title 数据集上进行了三种有效精细调整实验:"提示调整","Adapter 轻量级精细调整"和"提示调整 + Adapter 精细调整"。实验结果表明,这些方法可以获得显著改进,为 Tibetan 语言应用程序在预训练模型的上下文中提供了有价值的洞察。

Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation

  • paper_url: http://arxiv.org/abs/2309.12075
  • repo_url: https://github.com/eqtpartners/ptec
  • paper_authors: Valentin Leonhard Buchner, Lele Cao, Jan-Christoph Kalo, Vilhelm von Ehrenheim
  • for: 这个研究用于评估和优化预训练语言模型(PLM)的精心调整方法,以及这种方法在多类文本分类任务中的性能和计算效率。
  • methods: 本研究使用了Prompt Tuning和基准方法来评估多类文本分类任务的性能和计算效率。它还应用了Trie Search来Address limitation (a),并将PLM的语言头换为分类头来Address limitation (b)和(c)。
  • results: 研究发现,Prompt Tuned Embedding Classification(PTEC)可以显著改善文本分类性能,同时降低推理时间成本。此外,模型的性能在不同公司规模和知名度的情况下都具有可靠性。
    Abstract Prompt Tuning is emerging as a scalable and cost-effective method to fine-tune Pretrained Language Models (PLMs), which are often referred to as Large Language Models (LLMs). This study benchmarks the performance and computational efficiency of Prompt Tuning and baselines for multi-label text classification. This is applied to the challenging task of classifying companies into an investment firm's proprietary industry taxonomy, supporting their thematic investment strategy. Text-to-text classification is frequently reported to outperform task-specific classification heads, but has several limitations when applied to a multi-label classification problem where each label consists of multiple tokens: (a) Generated labels may not match any label in the label taxonomy; (b) The fine-tuning process lacks permutation invariance and is sensitive to the order of the provided labels; (c) The model provides binary decisions rather than appropriate confidence scores. Limitation (a) is addressed by applying constrained decoding using Trie Search, which slightly improves classification performance. All limitations (a), (b), and (c) are addressed by replacing the PLM's language head with a classification head, which is referred to as Prompt Tuned Embedding Classification (PTEC). This improves performance significantly, while also reducing computational costs during inference. In our industrial application, the training data is skewed towards well-known companies. We confirm that the model's performance is consistent across both well-known and less-known companies. Our overall results indicate the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of PLMs with strong generalization abilities. We release our codebase and a benchmarking dataset at https://github.com/EQTPartners/PTEC.
    摘要 《Prompt Tuning for Multi-Label Text Classification》Introduction:Prompt Tuning 是一种可扩展和成本效果的方法,用于细化预训练语言模型(PLM),以实现多个标签文本分类。本研究对 Prompt Tuning 和基eline进行性能和计算效率的比较,并应用于投资公司的专有行业分类任务。文本到文本分类 часто被报道可以超越任务特定的分类头,但在多个标签分类问题中存在以下限制:(a)生成的标签可能并不匹配任何标签在标签分类表中;(b) fine-tuning 过程缺乏 permutation 不变性,敏感于提供标签的顺序;(c)模型提供的 binary 决策而不是适当的信心分数。限制(a)通过使用 Trie Search 的受限 décoding,略微提高分类性能。所有限制(a)、(b)和(c)都被 Addressed 通过将 PLM 的语言头 replaced 为分类头,称为 Prompt Tuned Embedding Classification (PTEC),这Significantly 提高性能,同时降低了推理过程中的计算成本。在我们的工业应用中,训练数据偏向知名公司。我们确认模型在知名和不知名公司之间具有一致性。总的来说,我们的结果表明在域专任务中,需要灵活适应当今 PLM 的状态艺术,以实现优化的性能。我们将代码库和分类数据集发布在 GitHub 上,链接在

Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

  • paper_url: http://arxiv.org/abs/2309.12071
  • repo_url: None
  • paper_authors: Matheus L. O. Santos, Cláudio E. C. Campelo
  • for: 这个研究旨在评估基于7和13亿LLaMA模型的大语言模型(LLMs)在家用硬件上的性能。
  • methods: 我们使用了一个包含1,006个问题的数据库,来评估这些模型的效果。我们还测试了这些模型的计算效率。
  • results: 我们发现最佳performing模型在原始葡萄牙语问题上达到了约46%的准确率,而在英文翻译中达到了约49%的准确率。此外,我们发现7和13亿LLMs在这些问题上的计算时间为20和50秒,室内装备了AMD Ryzen 5 3600x处理器。
    Abstract Although Large Language Models (LLMs) represent a revolution in the way we interact with computers, allowing the construction of complex questions and the ability to reason over a sequence of statements, their use is restricted due to the need for dedicated hardware for execution. In this study, we evaluate the performance of LLMs based on the 7 and 13 billion LLaMA models, subjected to a quantization process and run on home hardware. The models considered were Alpaca, Koala, and Vicuna. To evaluate the effectiveness of these models, we developed a database containing 1,006 questions from the ENEM (Brazilian National Secondary School Exam). Our analysis revealed that the best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. In addition, we evaluated the computational efficiency of the models by measuring the time required for execution. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor
    摘要 Translated into Simplified Chinese:尽管大语言模型(LLM)代表了计算机与人类之间的交互方式的革命,允许建立复杂的问题和对语言序列进行理解,但其使用受到硬件限制,因此在执行时需要专门的硬件。在这项研究中,我们评估了基于7和13亿个LLaMA模型的LMM,经过量化处理并在家用硬件上运行。我们考虑了阿LPACA、科洛哈和维瓦纳这三种模型。为了评估这些模型的效果,我们创建了包含1006个ENEM(巴西国家高中考试)问题的数据库。我们的分析发现,最佳performing模型在原始葡萄牙语问题上的准确率为 approximately 46%,而在其英文翻译中的准确率为approximately 49%。此外,我们还评估了这些模型的计算效率,并测量了在一个配备AMD Ryzen 5 3600x处理器的机器上执行查询所需的时间。结果表明,7亿和13亿LLMs在 average需要20和50秒钟才能处理查询。

  • paper_url: http://arxiv.org/abs/2309.12067
  • repo_url: None
  • paper_authors: Karolina Seweryn, Anna Wróblewska, Szymon Łukasik
  • for: 本研究旨在提供关于足球动作认知的全面概述,包括动作识别、定位和空间时间动作local化等方面,尤其是使用不同感知modalities和多modal方法。
  • methods: 本文详细介绍了用于评估模型性能的公共数据源和评价指标,并探讨了最新的状态艺术方法,包括深度学习技术和传统方法。文中强调了多modal方法,以及将一种源数据 representation在不同的方式下表示。
  • results: 本文评论了现有的状态艺术方法的优劣点和局限性,以及它们在提高模型准确性和鲁棒性方面的潜在潜力。最后,文章强调了未来在足球动作认知领域的开放研究问题和未来趋势,包括多modal方法在这个领域的潜在推动作用。
    Abstract Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
    摘要 《足球动作场景理解》是一项复杂和动态的任务,由于游戏的复杂性和玩家之间的交互。本文提供了全面的概述,分为动作识别、定位和空间时间动作Localization,强调modalities和多模态方法。我们探讨了公共可用的数据源和评估模型性能的度量。文章回顾了最新的state-of-the-art方法,包括深度学习技术和传统方法。我们专注于多模态方法,汇集视频和音频数据,以及表示一种来源的不同方式。我们讨论了方法的优点和局限性,以及它们在准确和Robustness中的潜在提升。最后,文章强调了该领域的一些未解决问题和未来方向,包括多模态方法在足球动作识别领域的潜在推动作用。总的来说,本文对研究足球动作场景理解领域的人士提供了有价值的资源。

An Efficient Consolidation of Word Embedding and Deep Learning Techniques for Classifying Anticancer Peptides: FastText+BiLSTM

  • paper_url: http://arxiv.org/abs/2309.12058
  • repo_url: None
  • paper_authors: Onur Karakaya, Zeynep Hilal Kilimci
  • for: 这 paper 的目的是为了开发一个高精度的预测模型,用于分类抗癌肽(ACPs)。
  • methods: 该 paper 使用 Word2Vec 和 FastText 作为词嵌入技术,然后使用 CNN、LSTM 和 BiLSTM 深度学习模型进行分类。
  • results: 实验结果表明,使用提议的模型可以提高分类精度,并在 widely-used 数据集上达到新的州态艺。 Specifically, 使用 FastText+BiLSTM 组合可以达到 ACPs250 数据集的92.50% 的准确率,以及 Independent 数据集的96.15% 的准确率,从而确定新的州态艺。
    Abstract Anticancer peptides (ACPs) are a group of peptides that exhibite antineoplastic properties. The utilization of ACPs in cancer prevention can present a viable substitute for conventional cancer therapeutics, as they possess a higher degree of selectivity and safety. Recent scientific advancements generate an interest in peptide-based therapies which offer the advantage of efficiently treating intended cells without negatively impacting normal cells. However, as the number of peptide sequences continues to increase rapidly, developing a reliable and precise prediction model becomes a challenging task. In this work, our motivation is to advance an efficient model for categorizing anticancer peptides employing the consolidation of word embedding and deep learning models. First, Word2Vec and FastText are evaluated as word embedding techniques for the purpose of extracting peptide sequences. Then, the output of word embedding models are fed into deep learning approaches CNN, LSTM, BiLSTM. To demonstrate the contribution of proposed framework, extensive experiments are carried on widely-used datasets in the literature, ACPs250 and Independent. Experiment results show the usage of proposed model enhances classification accuracy when compared to the state-of-the-art studies. The proposed combination, FastText+BiLSTM, exhibits 92.50% of accuracy for ACPs250 dataset, and 96.15% of accuracy for Independent dataset, thence determining new state-of-the-art.
    摘要 《抗癌肽(ACPs)是一组具有抗肿瘤性的肽。使用ACPs在抗癌治疗中可能成为一种可靠的替代方案,因为它们具有更高的选择性和安全性。最新的科学发展使得肽基本治疗在抗癌领域受到了广泛关注。在这项工作中,我们的动机是提出一种高效的ACP分类模型,使用词嵌入和深度学习模型的结合。首先,我们使用Word2Vec和FastText作为词嵌入技术,以提取肽序列。然后,Word2Vec和FastText模型的输出被 fed into深度学习方法CNN、LSTM和BiLSTM。为了证明我们的提案的价值,我们在文献中广泛使用的数据集进行了广泛的实验。实验结果表明,我们的模型可以在ACPs250和独立数据集上提高分类精度,比之前的状态 arts。特别是,我们的组合FastText+BiLSTM在ACPs250数据集上达到了92.50%的准确率,在独立数据集上达到了96.15%的准确率,从而确定了新的状态 arts。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.

BELT:Bootstrapping Electroencephalography-to-Language Decoding and Zero-Shot Sentiment Classification by Natural Language Supervision

  • paper_url: http://arxiv.org/abs/2309.12056
  • repo_url: None
  • paper_authors: Jinzhao Zhou, Yiqun Duan, Yu-Cheng Chang, Yu-Kai Wang, Chin-Teng Lin
  • for: 这paper的目的是提出一种新的模型和学习框架,用于解决脑Signal到自然语言翻译的问题。
  • methods: 该paper使用了大型预训练语言模型(LM)来学习EEG表示,并通过对比学习获得semantically Meaningful的EEG表示。
  • results: 该paper在两个脑解释任务上达到了state-of-the-art的结果,分别超过了基eline模型 by 5.45%和10%,并在翻译和零批情感分类任务上取得了42.31%的BLEU-1分数和67.32%的精度。
    Abstract This paper presents BELT, a novel model and learning framework for the pivotal topic of brain-to-language translation research. The translation from noninvasive brain signals into readable natural language has the potential to promote the application scenario as well as the development of brain-computer interfaces (BCI) as a whole. The critical problem in brain signal decoding or brain-to-language translation is the acquisition of semantically appropriate and discriminative EEG representation from a dataset of limited scale and quality. The proposed BELT method is a generic and efficient framework that bootstraps EEG representation learning using off-the-shelf large-scale pretrained language models (LMs). With a large LM's capacity for understanding semantic information and zero-shot generalization, BELT utilizes large LMs trained on Internet-scale datasets to bring significant improvements to the understanding of EEG signals. In particular, the BELT model is composed of a deep conformer encoder and a vector quantization encoder. Semantical EEG representation is achieved by a contrastive learning step that provides natural language supervision. We achieve state-of-the-art results on two featuring brain decoding tasks including the brain-to-language translation and zero-shot sentiment classification. Specifically, our model surpasses the baseline model on both tasks by 5.45% and over 10% and archives a 42.31% BLEU-1 score and 67.32% precision on the main evaluation metrics for translation and zero-shot sentiment classification respectively.
    摘要 The proposed BELT method addresses this problem by leveraging off-the-shelf large-scale pretrained language models (LMs) to bootstrap EEG representation learning. With the capacity of large LMs to understand semantic information and their ability to generalize to new situations, BELT achieves significant improvements in understanding EEG signals.The BELT model consists of a deep conformer encoder and a vector quantization encoder, and semantical EEG representation is achieved through a contrastive learning step that provides natural language supervision. The model is evaluated on two brain decoding tasks, brain-to-language translation and zero-shot sentiment classification, and achieves state-of-the-art results. Specifically, the model outperforms the baseline model by 5.45% and over 10% on both tasks, with a BLEU-1 score of 42.31% and precision of 67.32% for translation and zero-shot sentiment classification, respectively.

SCVCNet: Sliding cross-vector convolution network for cross-task and inter-individual-set EEG-based cognitive workload recognition

  • paper_url: http://arxiv.org/abs/2310.03749
  • repo_url: None
  • paper_authors: Qi Wang, Li Chen, Zhiyuan Zhan, Jianhua Zhang, Zhong Yin
  • For: 这篇论文旨在应用认知劳动量识别器,通过利用不同人机任务和个体集的共同电enzephalogram(EEG)模式,实现Generic Approach。* Methods: 该论文提出了一种名为SCVCNet的神经网络模型,通过分析EEG的细致频率结构来消除任务和个体集相关的干扰。SCVCNet使用了滑动cross-vector convolution(SCVC)操作,并将 paired input layers representing theta和alpha power 作为输入。* Results: 该论文通过使用Regularized least-square method with ridge regression和extreme learning machine theory进行训练,并在三个数据库中验证性能,其中每个数据库包含不同任务由独立的参与者组成。结果显示,SCVCNet在两个不同的验证方案中平均准确率(0.6813和0.6229)和F1分数(0.6743和0.6076)达到了部分比前作高的性能。
    Abstract This paper presents a generic approach for applying the cognitive workload recognizer by exploiting common electroencephalogram (EEG) patterns across different human-machine tasks and individual sets. We propose a neural network called SCVCNet, which eliminates task- and individual-set-related interferences in EEGs by analyzing finer-grained frequency structures in the power spectral densities. The SCVCNet utilizes a sliding cross-vector convolution (SCVC) operation, where paired input layers representing the theta and alpha power are employed. By extracting the weights from a kernel matrix's central row and column, we compute the weighted sum of the two vectors around a specified scalp location. Next, we introduce an inter-frequency-point feature integration module to fuse the SCVC feature maps. Finally, we combined the two modules with the output-channel pooling and classification layers to construct the model. To train the SCVCNet, we employ the regularized least-square method with ridge regression and the extreme learning machine theory. We validate its performance using three databases, each consisting of distinct tasks performed by independent participant groups. The average accuracy (0.6813 and 0.6229) and F1 score (0.6743 and 0.6076) achieved in two different validation paradigms show partially higher performance than the previous works. All features and algorithms are available on website:https://github.com/7ohnKeats/SCVCNet.
    摘要 Here is the translation in Simplified Chinese:这篇论文提出了一种新的认知工作负担识别方法,使用电enzephalogram(EEG)信号。该方法称为SCVCNet,它使用神经网络分析EEG信号的功率 спектраль密度,并提取更细化的频率结构。网络使用滑动交叉向量 convolution(SCVC)操作和间频点特征集成模块来融合特征图。模型使用正则化最小二乘法withridge regression和极限学习机理来训练。模型的性能通过三个数据库,每个数据库包含不同任务和独立参与者组的EEG信号,进行验证。结果显示,提出的方法在两个不同的验证模式中的平均准确率为0.6813和0.6229,F1分数为0.6743和0.6076。所有特征和算法可以在以下网站上获得:

Uncertainty-driven Exploration Strategies for Online Grasp Learning

  • paper_url: http://arxiv.org/abs/2309.12038
  • repo_url: None
  • paper_authors: Yitian Shi, Philipp Schillinger, Miroslav Gabriel, Alexander Kuss, Zohar Feldman, Hanna Ziesche, Ngo Anh Vien
  • For: 提高机器人箱内物品抓取率和灵活性。* Methods: 基于在线学习和 reinforcement learning 的 grasp 预测方法,以及不同的不确定性估计方法。* Results: 实验结果显示,提出的方法可以在实际箱内物品抓取场景中显著提高 grasp 预测精度和灵活性,并且比传统的在线学习方法具有更好的适应能力。
    Abstract Existing grasp prediction approaches are mostly based on offline learning, while, ignored the exploratory grasp learning during online adaptation to new picking scenarios, i.e., unseen object portfolio, camera and bin settings etc. In this paper, we present a novel method for online learning of grasp predictions for robotic bin picking in a principled way. Existing grasp prediction approaches are mostly based on offline learning, while, ignored the exploratory grasp learning during online adaptation to new picking scenarios, i.e., unseen object portfolio, camera and bin settings etc. In this paper, we present a novel method for online learning of grasp predictions for robotic bin picking in a principled way. Specifically, the online learning algorithm with an effective exploration strategy can significantly improve its adaptation performance to unseen environment settings. To this end, we first propose to formulate online grasp learning as a RL problem that will allow to adapt both grasp reward prediction and grasp poses. We propose various uncertainty estimation schemes based on Bayesian Uncertainty Quantification and Distributional Ensembles. We carry out evaluations on real-world bin picking scenes of varying difficulty. The objects in the bin have various challenging physical and perceptual characteristics that can be characterized by semi- or total transparency, and irregular or curved surfaces. The results of our experiments demonstrate a notable improvement in the suggested approach compared to conventional online learning methods which incorporate only naive exploration strategies.
    摘要 现有的抓取预测方法都是基于离线学习,而忽略了在线适应新抓取场景中的探索式学习,即未经见过的物品库、摄像头和容器设置等等。在这篇论文中,我们提出了一种新的在线学习抓取预测方法,以便在原则上进行在线适应。specifically,我们提出了一种有效的探索策略,可以显著提高在未经见过的环境设置下的适应性。为此,我们首先提出了在线抓取学习为RL问题的形式,以便适应抓取奖励预测和抓取姿势。我们还提出了多种不确定性估计方法,基于 bayesian uncertainty quantification和分布 ensemble。我们在实际的垃圾桶抓取场景中进行了评估,桶中的物品具有各种困难的物理和感知特征,包括半透明或完全透明、扭曲或弯曲的表面等。实验结果表明,我们的方法与传统的在线学习方法相比,具有显著的改善。

Dynamic Hypergraph Structure Learning for Traffic Flow Forecasting

  • paper_url: http://arxiv.org/abs/2309.12028
  • repo_url: None
  • paper_authors: Yusheng Zhao, Xiao Luo, Wei Ju, Chong Chen, Xian-Sheng Hua, Ming Zhang
  • for: 预测未来交通情况,基于路网和过去交通情况。
  • methods: 使用强化图 neural network (GNN) 模型复杂的空间时间相关性,并提出一种名为动态Hipergraph结构学习 (DyHSL) 模型来解决交通流量预测问题。
  • results: 在四个流行的交通测试数据集上进行了广泛的实验,并证明了 DyHSL 模型的效果比基本方法更高。
    Abstract This paper studies the problem of traffic flow forecasting, which aims to predict future traffic conditions on the basis of road networks and traffic conditions in the past. The problem is typically solved by modeling complex spatio-temporal correlations in traffic data using spatio-temporal graph neural networks (GNNs). However, the performance of these methods is still far from satisfactory since GNNs usually have limited representation capacity when it comes to complex traffic networks. Graphs, by nature, fall short in capturing non-pairwise relations. Even worse, existing methods follow the paradigm of message passing that aggregates neighborhood information linearly, which fails to capture complicated spatio-temporal high-order interactions. To tackle these issues, in this paper, we propose a novel model named Dynamic Hypergraph Structure Learning (DyHSL) for traffic flow prediction. To learn non-pairwise relationships, our DyHSL extracts hypergraph structural information to model dynamics in the traffic networks, and updates each node representation by aggregating messages from its associated hyperedges. Additionally, to capture high-order spatio-temporal relations in the road network, we introduce an interactive graph convolution block, which further models the neighborhood interaction for each node. Finally, we integrate these two views into a holistic multi-scale correlation extraction module, which conducts temporal pooling with different scales to model different temporal patterns. Extensive experiments on four popular traffic benchmark datasets demonstrate the effectiveness of our proposed DyHSL compared with a broad range of competing baselines.
    摘要 DyHSL extracts hypergraph structural information to model dynamics in the traffic networks and updates each node representation by aggregating messages from its associated hyperedges. Additionally, an interactive graph convolution block is introduced to model high-order spatio-temporal relations in the road network. The two views are then integrated into a holistic multi-scale correlation extraction module, which conducts temporal pooling with different scales to model different temporal patterns.Extensive experiments on four popular traffic benchmark datasets demonstrate the effectiveness of DyHSL compared with a broad range of competing baselines. The proposed method is able to capture complex traffic patterns and improve traffic flow forecasting accuracy.

Demystifying Visual Features of Movie Posters for Multi-Label Genre Identification

  • paper_url: http://arxiv.org/abs/2309.12022
  • repo_url: None
  • paper_authors: Utsav Kumar Nareti, Chandranath Adak, Soumi Chattopadhyay
  • for: Automated multi-label genre identification of movies from poster images.
  • methods: Deep transformer network with a probabilistic module.
  • results: Encouraging performance and outperformed some contemporary architectures in experimental analysis using 13882 posters from IMDb.
    Abstract In the film industry, movie posters have been an essential part of advertising and marketing for many decades, and continue to play a vital role even today in the form of digital posters through online, social media and OTT platforms. Typically, movie posters can effectively promote and communicate the essence of a film, such as its genre, visual style/ tone, vibe and storyline cue/ theme, which are essential to attract potential viewers. Identifying the genres of a movie often has significant practical applications in recommending the film to target audiences. Previous studies on movie genre identification are limited to subtitles, plot synopses, and movie scenes that are mostly accessible after the movie release. Posters usually contain pre-release implicit information to generate mass interest. In this paper, we work for automated multi-label genre identification only from movie poster images, without any aid of additional textual/meta-data information about movies, which is one of the earliest attempts of its kind. Here, we present a deep transformer network with a probabilistic module to identify the movie genres exclusively from the poster. For experimental analysis, we procured 13882 number of posters of 13 genres from the Internet Movie Database (IMDb), where our model performances were encouraging and even outperformed some major contemporary architectures.
    摘要 在电影业中,电影海报是广告和营销的重要组成部分,已经有很多年了,并且在今天的形式中仍然扮演着重要的角色,包括数字海报通过在线、社交媒体和OTT平台。通常,电影海报可以有效地推广和传达电影的核心元素,如其类别、视觉风格/调子、氛围和故事线索/主题,这些元素都是吸引potential viewers的关键。识别电影的类别有重要实际应用,例如推荐电影给target audience。过去的研究通常限于电影的字幕、剧情简opsis和电影场景,这些信息通常都可以在电影上映后获得。然而,海报通常包含在电影发布之前的隐式信息,以便生成大量的兴趣。在这篇论文中,我们采用了自动化多标签类别预测方法,只使用电影海报图像,无需任何额外的文本/ мета-数据信息 about movies,这是当前的一个非常早期的尝试。我们采用了深度变换网络和概率模块来预测电影类别。为了实验分析,我们从互联网电影数据库(IMDb)上获取了13882张海报,并发现我们的模型性能很出色,甚至超过了一些当前的主流架构。

Safe Hierarchical Reinforcement Learning for CubeSat Task Scheduling Based on Energy Consumption

  • paper_url: http://arxiv.org/abs/2309.12004
  • repo_url: None
  • paper_authors: Mahya Ramezani, M. Amin Alandihallaj, Jose Luis Sanchez-Lopez, Andreas Hein
  • for: 优化CubeSat任务调度在低地球轨道(LEO)中
  • methods: 使用层次强化学习方法,包括高级策略 для全局任务分配和低级策略作为安全机制,并使用相似性注意力基本编码器(SABE)进行任务优先级化和多项式预测器(MLP)进行能量消耗预测
  • results: 在多个CubeSat配置下,通过实验证明 Hierarchical Reinforcement Learning 的超 convergency和任务成功率优于 MADDPG 模型和随机调度策略
    Abstract This paper presents a Hierarchical Reinforcement Learning methodology tailored for optimizing CubeSat task scheduling in Low Earth Orbits (LEO). Incorporating a high-level policy for global task distribution and a low-level policy for real-time adaptations as a safety mechanism, our approach integrates the Similarity Attention-based Encoder (SABE) for task prioritization and an MLP estimator for energy consumption forecasting. Integrating this mechanism creates a safe and fault-tolerant system for CubeSat task scheduling. Simulation results validate the Hierarchical Reinforcement Learning superior convergence and task success rate, outperforming both the MADDPG model and traditional random scheduling across multiple CubeSat configurations.
    摘要

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

  • paper_url: http://arxiv.org/abs/2309.11998
  • repo_url: https://github.com/lm-sys/fastchat
  • paper_authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
  • for: The paper is written for researchers and developers who want to understand and advance the capabilities of large language models (LLMs) in real-world scenarios.
  • methods: The paper introduces a large-scale dataset called LMSYS-Chat-1M, which contains one million real-world conversations with 25 state-of-the-art LLMs. The dataset is collected from 210K unique IP addresses in the wild and includes a curation process, basic statistics, and topic distribution.
  • results: The paper demonstrates the versatility of the dataset through four use cases: developing content moderation models, building a safety benchmark, training instruction-following models, and creating challenging benchmark questions. The dataset is publicly available and is expected to serve as a valuable resource for understanding and advancing LLM capabilities.
    Abstract Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
    摘要 Translated into Simplified Chinese:研究人员在实际场景中与大语言模型(LLM)交互的研究日益重要,因为它们在各种应用程序中广泛使用。在这篇论文中,我们介绍了LMSYS-Chat-1M数据集,这是一个包含100万个实际对话,并与25个现代LLM进行交互的大规模数据集。这个数据集来自于210,000个唯一的IP地址,并在我们的Vicuna demo和Chatbot Arena网站上采集。我们提供了数据集的内容概述,包括筛选过程、基本统计和主题分布,并 highlighted its diversity, originality, and scale。我们还示例了这个数据集的多样性,通过四个使用场景:开发与GPT-4类似的内容审核模型,建立安全基准,训练与Vicuna类似的 instrucion-following 模型,并创建挑战性的问题集。我们认为这个数据集将成为 LLM 能力的研究和进步的重要资源。这个数据集公开可用于https://huggingface.co/datasets/lmsys/lmsys-chat-1m。

Predictability and Comprehensibility in Post-Hoc XAI Methods: A User-Centered Analysis

  • paper_url: http://arxiv.org/abs/2309.11987
  • repo_url: None
  • paper_authors: Anahid Jalali, Bernhard Haslhofer, Simone Kriglstein, Andreas Rauber
  • for: 本研究旨在评估用户对黑盒机器学习模型预测结果的解释是否能够增强用户对模型行为的预测能力。
  • methods: 本研究使用了两种广泛使用的工具:LIME和SHAP。我们还研究了对于用户理解和预测模型行为的影响。
  • results: 我们发现SHAP的解释在模型决策边界附近时具有显著的减少了可读性。此外,我们发现对于用户理解和预测模型行为的影响很大。基于我们的发现,我们还提出了未来采用更高度的可读性和预测性的后期解释方法的设计建议。
    Abstract Post-hoc explainability methods aim to clarify predictions of black-box machine learning models. However, it is still largely unclear how well users comprehend the provided explanations and whether these increase the users ability to predict the model behavior. We approach this question by conducting a user study to evaluate comprehensibility and predictability in two widely used tools: LIME and SHAP. Moreover, we investigate the effect of counterfactual explanations and misclassifications on users ability to understand and predict the model behavior. We find that the comprehensibility of SHAP is significantly reduced when explanations are provided for samples near a model's decision boundary. Furthermore, we find that counterfactual explanations and misclassifications can significantly increase the users understanding of how a machine learning model is making decisions. Based on our findings, we also derive design recommendations for future post-hoc explainability methods with increased comprehensibility and predictability.
    摘要 afterwards explainability 方法 goals to clarify predictions of black-box machine learning models. However, it is still largely unclear how well users comprehend the provided explanations and whether these increase the users ability to predict the model behavior. We approach this question by conducting a user study to evaluate comprehensibility and predictability in two widely used tools: LIME and SHAP. Moreover, we investigate the effect of counterfactual explanations and misclassifications on users ability to understand and predict the model behavior. We find that the comprehensibility of SHAP is significantly reduced when explanations are provided for samples near a model's decision boundary. Furthermore, we find that counterfactual explanations and misclassifications can significantly increase the users understanding of how a machine learning model is making decisions. Based on our findings, we also derive design recommendations for future post-hoc explainability methods with increased comprehensibility and predictability.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. If you prefer Traditional Chinese, I can provide that as well.

Representation Abstractions as Incentives for Reinforcement Learning Agents: A Robotic Grasping Case Study

  • paper_url: http://arxiv.org/abs/2309.11984
  • repo_url: https://github.com/PetropoulakisPanagiotis/igae
  • paper_authors: Panagiotis Petropoulakis, Ludwig Gräf, Josip Josifovski, Mohammadhossein Malmir, Alois Knoll
  • for: 这个研究的目的是探讨RL Agent在各种状态表示下解决 робо控制任务的效果。
  • methods: 这个研究使用了不同的状态表示方法,从模型基于的方法、数值型的表示、到图像型的表示,以评估RL Agent在不同状态表示下的性能。
  • results: 研究结果表明,RL Agent使用数值型状态表示可以与非学习基线相当,而图像型表示可以提高RL Agent的成功率和转移率。
    Abstract Choosing an appropriate representation of the environment for the underlying decision-making process of the RL agent is not always straightforward. The state representation should be inclusive enough to allow the agent to informatively decide on its actions and compact enough to increase sample efficiency for policy training. Given this outlook, this work examines the effect of various state representations in incentivizing the agent to solve a specific robotic task: antipodal and planar object grasping. A continuum of state representation abstractions is defined, starting from a model-based approach with complete system knowledge, through hand-crafted numerical, to image-based representations with decreasing level of induced task-specific knowledge. We examine the effects of each representation in the ability of the agent to solve the task in simulation and the transferability of the learned policy to the real robot. The results show that RL agents using numerical states can perform on par with non-learning baselines. Furthermore, we find that agents using image-based representations from pre-trained environment embedding vectors perform better than end-to-end trained agents, and hypothesize that task-specific knowledge is necessary for achieving convergence and high success rates in robot control.
    摘要 (注意:以下是简化中文翻译,不同的翻译方式可能会有所不同)选择RL机器人的决策过程下的环境表示方式不一定 straightforward。状态表示应该包含足够的信息,让机器人能够决策,同时也应该尽量减少样本效率,以便策略训练。基于这个视角,这项工作研究了不同状态表示方式对RL机器人解决特殊 робо控任务:把物捕获到极点和平面上的效果。一个维度的状态表示各种抽象维度定义,从模型基于的方法,到手工制作的数值,以至图像基于的表示,均逐渐减少所引入的任务特定知识。我们研究每种表示方式对机器人解决任务在模拟环境中的能力,以及这些学习的策略在真实机器人中的传输性。结果表明RL机器人使用数值状态可以与非学习基准相当,而使用图像基于表示,从预训练的环境嵌入向量中提取的表示,能够在策略训练中获得更高的成功率和转移率。

Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics

  • paper_url: http://arxiv.org/abs/2309.11981
  • repo_url: None
  • paper_authors: Patricio Vera, Pedro Moya, Lisa Barraza
  • for: 本研究旨在探讨人工智能(AI)领域内大语言模型(LLM)在自然语言处理(NLP)领域的不同进化,并重新评估传统机器智能的评价方法。
  • methods: 本研究提出一种新的评价框架,启发自现代语言模型的进步,旨在掌握语言理解和学习能力。
  • results: 研究表明,新的评价框架可以更好地评估机器智能的语言理解和学习能力,并且可以帮助解决传统评价方法的限制。
    Abstract In the burgeoning field of artificial intelligence (AI), the unprecedented progress of large language models (LLMs) in natural language processing (NLP) offers an opportunity to revisit the entire approach of traditional metrics of machine intelligence, both in form and content. As the realm of machine cognitive evaluation has already reached Imitation, the next step is an efficient Language Acquisition and Understanding. Our paper proposes a paradigm shift from the established Turing Test towards an all-embracing framework that hinges on language acquisition, taking inspiration from the recent advancements in LLMs. The present contribution is deeply tributary of the excellent work from various disciplines, point out the need to keep interdisciplinary bridges open, and delineates a more robust and sustainable approach.
    摘要 在人工智能(AI)领域的不断发展中,大语言模型(LLM)在自然语言处理(NLP)领域的无前例进步,为我们重新审视传统机器智能评价 metric 的整体方法和内容。因为机器认知领域已经达到了仿制,接下来的步骤是有效地语言学习和理解。我们的论文提出了由传统图灵测试shift towards一个涵盖语言学习的框架,以启发自最近的大语言模型的进步。本贡献受到了不同领域的出色工作的推动,要继续保持交往的桥梁,并定义了更加坚固和可持续的方法。

Inferring Capabilities from Task Performance with Bayesian Triangulation

  • paper_url: http://arxiv.org/abs/2309.11975
  • repo_url: None
  • paper_authors: John Burden, Konstantinos Voudouris, Ryan Burnell, Danaja Rutar, Lucy Cheke, José Hernández-Orallo
  • for: 本研究旨在Characterizing machine learning models in richer, more meaningful ways, using diverse experimental data to infer the cognitive profile of a system.
  • methods: 该方法使用PyMC probabilistic programming library, introducing measurement layouts to model how task-instance features interact with system capabilities, triangulating features to infer capabilities from non-populational data.
  • results: 研究通过对68个实际参赛者和30个synthetic agents进行评估,成功地推断出不同的认知 профиls,展示了 capability-oriented evaluation的潜力。
    Abstract As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.
    摘要 随着机器学习模型变得更通用,我们需要用更加细致、有意义的方式来描述它们。我们介绍了一种方法,用于从多种实验数据中推断系统的认知 profiling。为此,我们引入了任务实例特征和系统能力之间的测量布局,以便从非常量数据中推断系统的能力。这些特征需要在复杂的方式下进行三角测量,以便从非常量数据中推断系统的能力。我们使用 bayesian probabilistic programming 库 PyMC,对 AnimalAI 奥运会中的68名实际参赛者和 O-PIAAGETS 对象常见性测试中的30名 sintetic agent 进行推断,并示cases the potential of capability-oriented evaluation。

A Comprehensive Review on Financial Explainable AI

  • paper_url: http://arxiv.org/abs/2309.11960
  • repo_url: None
  • paper_authors: Wei Jie Yeo, Wihan van der Heever, Rui Mao, Erik Cambria, Ranjan Satapathy, Gianmarco Mengaldo
  • for: 评估和选择深度学习模型的解释性方法,以提高深度学习模型在金融领域的透明度和可信度。
  • methods: 对深度学习模型的解释性方法进行比较分析,并根据它们的特点进行分类。
  • results: 对深度学习模型的解释性方法的透明度和可信度进行评估,并探讨采用解释性AI方法的问题和挑战,以及未来的发展方向。
    Abstract The success of artificial intelligence (AI), and deep learning models in particular, has led to their widespread adoption across various industries due to their ability to process huge amounts of data and learn complex patterns. However, due to their lack of explainability, there are significant concerns regarding their use in critical sectors, such as finance and healthcare, where decision-making transparency is of paramount importance. In this paper, we provide a comparative survey of methods that aim to improve the explainability of deep learning models within the context of finance. We categorize the collection of explainable AI methods according to their corresponding characteristics, and we review the concerns and challenges of adopting explainable AI methods, together with future directions we deemed appropriate and important.
    摘要 人工智能(AI)和深度学习模型的成功导致它们在不同领域得到广泛的应用,这主要是因为它们可以处理巨量数据并学习复杂的模式。然而,由于它们的不可解性,在重要领域如金融和医疗等,它们的使用受到了 significatively 的关注,因为它们的决策过程的透明度是非常重要的。在这篇论文中,我们提供了对于改善深度学习模型可见性的比较调查。我们根据这些方法的特点进行分类,并评估了采用可见性AI方法的问题和挑战,以及未来的发展方向。

On the Definition of Appropriate Trust and the Tools that Come with it

  • paper_url: http://arxiv.org/abs/2309.11937
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Helena Löfström
  • For: This paper focuses on evaluating the efficiency of human-AI interactions, specifically in terms of the human experience of explanations and the user’s appropriate trust in the model.* Methods: The paper compares the definitions of appropriate trust from the literature with model performance evaluation, and offers a novel approach to evaluating appropriate trust by taking advantage of the likenesses between definitions. The paper also provides several straightforward evaluation methods for different aspects of user performance, including measuring uncertainty and appropriate trust in regression.* Results: The paper’s main contribution is a novel approach to evaluating appropriate trust, which offers a more objective and comparative evaluation of explanation methods. The paper also provides specific evaluation methods for different aspects of user performance.
    Abstract Evaluating the efficiency of human-AI interactions is challenging, including subjective and objective quality aspects. With the focus on the human experience of the explanations, evaluations of explanation methods have become mostly subjective, making comparative evaluations almost impossible and highly linked to the individual user. However, it is commonly agreed that one aspect of explanation quality is how effectively the user can detect if the predictions are trustworthy and correct, i.e., if the explanations can increase the user's appropriate trust in the model. This paper starts with the definitions of appropriate trust from the literature. It compares the definitions with model performance evaluation, showing the strong similarities between appropriate trust and model performance evaluation. The paper's main contribution is a novel approach to evaluating appropriate trust by taking advantage of the likenesses between definitions. The paper offers several straightforward evaluation methods for different aspects of user performance, including suggesting a method for measuring uncertainty and appropriate trust in regression.
    摘要 评估人类-AI交互的效率具有挑战性,包括主观和客观质量方面。因为注重人类解释的经验,评估解释方法的评价倾向于主观,使对比评价变得各异不同,高度受用户个人影响。然而,通常认为一个解释质量的重要方面是否能让用户正确地判断预测结果的可靠性和正确性,即是否能够提高用户对模型的适当信任。这篇论文从文献中定义了适当信任的定义,并与模型性能评估进行比较,显示了这两者之间的强相似性。本文的主要贡献是一种新的适当信任评估方法,利用定义之间的相似性。文章还提供了不同方面的用户性能评估方法,包括用于推荐和回归中的不确定性和适当信任评估方法。

Learning to Recover for Safe Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.11907
  • repo_url: None
  • paper_authors: Haoyu Wang, Xin Yuan, Qinqing Ren
    for: 这种研究旨在实现安全的学习控制,以及在复杂环境中自动生成安全约束。methods: 提议了一种三个阶段架构,称为TU-Recovery架构,包括安全评估和恢复策略的学习。results: 实验表明,TU-Recovery在约束遵从和约束违反两个方面都有优于不受约束的对照组,并且 auxiliary reward 可以进一步提高TU-Recovery的奖励至比例。
    Abstract Safety controllers is widely used to achieve safe reinforcement learning. Most methods that apply a safety controller are using handcrafted safety constraints to construct the safety controller. However, when the environment dynamics are sophisticated, handcrafted safety constraints become unavailable. Therefore, it worth to research on constructing safety controllers by learning algorithms. We propose a three-stage architecture for safe reinforcement learning, namely TU-Recovery Architecture. A safety critic and a recovery policy is learned before task training. They form a safety controller to ensure safety in task training. Then a phenomenon induced by disagreement between task policy and recovery policy, called adversarial phenomenon, which reduces learning efficiency and model performance, is described. Auxiliary reward is proposed to mitigate adversarial phenomenon, while help the task policy to learn to recover from high-risk states. A series of experiments are conducted in a robot navigation environment. Experiments demonstrate that TU-Recovery outperforms unconstrained counterpart in both reward gaining and constraint violations during task training, and auxiliary reward further improve TU-Recovery in reward-to-cost ratio by significantly reduce constraint violations.
    摘要 安全控制器广泛应用于安全返回学习,大多数方法都使用手工安全限制构建安全控制器。然而,当环境动力较复杂时,手工安全限制变得无效。因此,研究构建基于学习算法的安全控制器是有优势的。我们提出了三阶 Architecture for Safe Reinforcement Learning,称为TU-Recovery Architecture。在任务训练之前,一个安全评价器和一个恢复策略被学习出来,它们组成一个安全控制器,确保任务训练中的安全性。然后,一种由任务策略和恢复策略的不一致引起的现象,称为对抗现象,这会降低学习效率和模型性能。为了 Mitigate this phenomenon, an auxiliary reward is proposed to help the task policy learn to recover from high-risk states.在一个机器人导航环境中,我们进行了一系列实验,结果表明,TU-Recovery在增加奖励和限制违反时比无限制版本表现更好,并且 auxiliary reward 可以再加强 TU-Recovery 的奖励比例。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Unlocking the Heart Using Adaptive Locked Agnostic Networks

  • paper_url: http://arxiv.org/abs/2309.11899
  • repo_url: https://github.com/AstraZeneca/UnlockingHeart
  • paper_authors: Sylwia Majchrowska, Anders Hildeman, Philip Teare, Tom Diethe
  • For: The paper is written for medical imaging applications, specifically for echocardiography datasets.* Methods: The paper introduces the Adaptive Locked Agnostic Network (ALAN) method, which uses self-supervised visual feature extraction and a large backbone model to produce anatomically robust semantic self-segmentation.* Results: The paper demonstrates that the self-supervised backbone model robustly identifies anatomical subregions of the heart in an apical four-chamber view, and then uses these features to design two downstream models for segmenting a target anatomical region and echocardiogram view classification.Here’s the Chinese translation of the three points:* For: 这篇论文是为医疗影像应用而写的,特别是为echocardiography datasets。* Methods: 论文介绍了Adaptive Locked Agnostic Network (ALAN)方法,该方法使用自动启动的视觉特征提取和大型后处理模型来生成医学上Robust的semantic自 Segmentation。* Results: 论文表明,自动启动后处理模型可以强健地识别心脏四室视图中的 анатомичеSUBregion。然后,通过使用这些特征,设计了两个下游模型,一个用于目标 анаatomical区域分割,另一个用于echo cardiogram视图分类。
    Abstract Supervised training of deep learning models for medical imaging applications requires a significant amount of labeled data. This is posing a challenge as the images are required to be annotated by medical professionals. To address this limitation, we introduce the Adaptive Locked Agnostic Network (ALAN), a concept involving self-supervised visual feature extraction using a large backbone model to produce anatomically robust semantic self-segmentation. In the ALAN methodology, this self-supervised training occurs only once on a large and diverse dataset. Due to the intuitive interpretability of the segmentation, downstream models tailored for specific tasks can be easily designed using white-box models with few parameters. This, in turn, opens up the possibility of communicating the inner workings of a model with domain experts and introducing prior knowledge into it. It also means that the downstream models become less data-hungry compared to fully supervised approaches. These characteristics make ALAN particularly well-suited for resource-scarce scenarios, such as costly clinical trials and rare diseases. In this paper, we apply the ALAN approach to three publicly available echocardiography datasets: EchoNet-Dynamic, CAMUS, and TMED-2. Our findings demonstrate that the self-supervised backbone model robustly identifies anatomical subregions of the heart in an apical four-chamber view. Building upon this, we design two downstream models, one for segmenting a target anatomical region, and a second for echocardiogram view classification.
    摘要 超vised学习深度学习模型用于医学成像应用需要一定量的标注数据。然而,获取标注数据具有挑战,因为图像需要由医疗专业人员进行标注。为解决这个限制,我们介绍了自适应锁定不可知数网络(ALAN)。ALAN方法包括使用大型后向模型进行自主超vised视觉特征提取,以生成可靠的各种生物marker。在ALAN方法中,这种自主超vised训练只需在一个大型和多样化的数据集上进行一次。由于分割结果具有直观可读性,可以使用白盒模型并少量参数来设计下游模型。这种特点使得ALAN在资源匮乏的enario下具有优势,如costly临床试验和罕见疾病。在这篇论文中,我们运用ALAN方法于三个公共可用的echocardiography数据集:EchoNet-Dynamic、CAMUS和TMED-2。我们的发现表明,自适应锁定不可知数网络可以在Apical四室视图中稳定地标识心脏的各种生物marker。基于这种成果,我们设计了两个下游模型:一个用于标识目标生物区域,另一个用于echocardiogram视图分类。

MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models

  • paper_url: http://arxiv.org/abs/2309.13079
  • repo_url: None
  • paper_authors: Yidong Liu, FuKai Shang, Fang Wang, Rui Xu, Jun Wang, Wei Li, Yao Li, Conghui He
  • for: 这篇论文旨在为特定领域(如医疗、法律、金融等)提供高质量、域specific的输出,以满足各个领域的需求。
  • methods: 该论文首先评估了现有的大型模型在专业领域的表现,并讨论了这些模型的限制。然后,该论文提出了一个名为“MiChao-HuaFen 1.0”的预训练数据集,专门为新闻和政府部门提供。这个数据集来自于2022年公开available的互联网数据,经过多轮净化和处理,以确保高质量和可靠的来源。
  • results: 该论文通过预训练大型模型在中文垂直领域中表现出色,并为深度学习研究和应用在相关领域提供了支持。
    Abstract With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have demonstrated exceptional capabilities across various domains. Nevertheless, there remains a demand for high-quality, domain-specific outputs in areas like healthcare, law, and finance. This paper first evaluates the existing large models for specialized domains and discusses their limitations. To cater to the specific needs of certain domains, we introduce the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and governmental sectors. The dataset, sourced from publicly available internet data from 2022, underwent multiple rounds of cleansing and processing to ensure high quality and reliable origins, with provisions for consistent and stable updates. This dataset not only supports the pre-training of large models for Chinese vertical domains but also aids in propelling deep learning research and applications in related fields.
    摘要 随着深度学习技术的发展,通用大型模型如GPT-4已经表现出色在各个领域。然而,仍然存在特定领域的高质量、域专输出的需求,如医疗、法律和金融等。这篇论文首先评估了现有的域专大型模型,并讨论了它们的限制。为了满足特定领域的需求,我们介绍了“微超花均1.0”预训练数据集,专门为新闻和政府部门设计。这个数据集来自于2022年公开available的互联网数据,经过多 rondas of 清洁和处理,以确保高质量和可靠的来源,并提供了一系列的常规和稳定的更新。这个数据集不仅支持中文垂直领域的大型模型的预训练,还可以推动深度学习研究和应用在相关领域。

Audio Contrastive based Fine-tuning

  • paper_url: http://arxiv.org/abs/2309.11895
  • repo_url: None
  • paper_authors: Yang Wang, Qibin Liang, Chenghao Xiao, Yizhi Li, Noura Al Moubayed, Chenghua Lin
  • for: Audio classification tasks with a wide range of applications, such as speech and sound processing.
  • methods: contrastive learning, fine-tuning.
  • results: state-of-the-art results in various settings, robust generalisability.Here’s the full text in Simplified Chinese:for: Audio classification tasks with a wide range of applications, such as speech and sound processing.methods: contrastive learning, fine-tuning.results: state-of-the-art results in various settings, robust generalisability.
    Abstract Audio classification plays a crucial role in speech and sound processing tasks with a wide range of applications. There still remains a challenge of striking the right balance between fitting the model to the training data (avoiding overfitting) and enabling it to generalise well to a new domain. Leveraging the transferability of contrastive learning, we introduce Audio Contrastive-based Fine-tuning (AudioConFit), an efficient approach characterised by robust generalisability. Empirical experiments on a variety of audio classification tasks demonstrate the effectiveness and robustness of our approach, which achieves state-of-the-art results in various settings.
    摘要 Audio分类在语音和声音处理任务中扮演着重要角色,它在各种应用领域中具有广泛的应用前景。然而,模型适应训练数据的问题仍然存在,即避免过拟合。我们利用对比学习的转移性,提出了Audio Contrastive-based Fine-tuning(AudioConFit)方法,具有良好的抗抗销性。经验测试表明,我们的方法在多种Audio分类任务中具有出色的效果和稳定性,达到了不同设置下的状态体现。

A Knowledge-Driven Cross-view Contrastive Learning for EEG Representation

  • paper_url: http://arxiv.org/abs/2310.03747
  • repo_url: None
  • paper_authors: Weining Weng, Yang Gu, Qihui Zhang, Yingying Huang, Chunyan Miao, Yiqiang Chen
  • for: This paper is written for researchers and practitioners working with electroencephalogram (EEG) signals and deep learning methods, particularly those interested in developing supervised learning methods for EEG signals with limited labels.
  • methods: The paper proposes a knowledge-driven cross-view contrastive learning framework (KDC2) that integrates neurological theory to extract effective representations from EEG signals with limited labels. The KDC2 method creates scalp and neural views of EEG signals, simulating the internal and external representation of brain activity, and uses inter-view and cross-view contrastive learning pipelines in combination with various augmentation methods to capture neural features from different views.
  • results: The experimental results on different downstream tasks demonstrate that the proposed method outperforms state-of-the-art methods, highlighting the superior generalization of neural knowledge-supported EEG representations across various brain tasks.
    Abstract Due to the abundant neurophysiological information in the electroencephalogram (EEG) signal, EEG signals integrated with deep learning methods have gained substantial traction across numerous real-world tasks. However, the development of supervised learning methods based on EEG signals has been hindered by the high cost and significant label discrepancies to manually label large-scale EEG datasets. Self-supervised frameworks are adopted in vision and language fields to solve this issue, but the lack of EEG-specific theoretical foundations hampers their applicability across various tasks. To solve these challenges, this paper proposes a knowledge-driven cross-view contrastive learning framework (KDC2), which integrates neurological theory to extract effective representations from EEG with limited labels. The KDC2 method creates scalp and neural views of EEG signals, simulating the internal and external representation of brain activity. Sequentially, inter-view and cross-view contrastive learning pipelines in combination with various augmentation methods are applied to capture neural features from different views. By modeling prior neural knowledge based on homologous neural information consistency theory, the proposed method extracts invariant and complementary neural knowledge to generate combined representations. Experimental results on different downstream tasks demonstrate that our method outperforms state-of-the-art methods, highlighting the superior generalization of neural knowledge-supported EEG representations across various brain tasks.
    摘要 因为电энце法测试(EEG)信号具有庞大的神经生物学信息,因此EEG信号与深度学习方法的结合在许多实际任务中得到了广泛的应用。然而,基于EEG信号的指导学习方法的发展受到了大量标签数据手动标注的高成本和标签差异的限制。在视觉和语言领域中采用了自动标注框架,但是由于EEG信号的特殊性,这些框架在不同任务中的应用受到了限制。为解决这些挑战,本文提出了基于知识驱动的跨视图对比学习框架(KDC2),该框架通过神经生物学理论提取EEG信号中有效的表示。KDC2方法创建了脊梁和神经视图的EEG信号,模拟内部和外部的脑动力表示。然后,在不同视图之间和跨视图之间,应用了多种扩展方法,以捕捉不同视图中的神经特征。通过基于同源神经信息一致理论模型尽可能多的先验知识,提出的方法提取了不变和补充的神经知识,生成了组合表示。实验结果表明,我们的方法在不同下游任务中的表现优于状态之前的方法,highlighting the superior generalization of neural knowledge-supported EEG representations across various brain tasks.

Multi-level Asymmetric Contrastive Learning for Medical Image Segmentation Pre-training

  • paper_url: http://arxiv.org/abs/2309.11876
  • repo_url: None
  • paper_authors: Shuang Zeng, Lei Zhu, Xinliang Zhang, Zifeng Tian, Qian Chen, Lujia Jin, Jiayi Wang, Yanye Lu
  • for: 这个研究旨在提出一个新的对称对抗学习框架(JCL),用于医疗影像分类。
  • methods: 这个框架使用了一个新的对称对抗学习策略,同时预训 Both encoder和decoder,以提供更好的初始化 для分类模型。另外,一个多层对抗损失函数被设计来考虑对于特征层、影像层和像素层的对应,以确保encoder和decoder在预训过程中学习多层表示。
  • results: 在多个医疗影像数据集上进行了实验,结果显示了我们的JCL框架比现有的SOTA对抗学习策略更好。
    Abstract Contrastive learning, which is a powerful technique for learning image-level representations from unlabeled data, leads a promising direction to dealing with the dilemma between large-scale pre-training and limited labeled data. However, most existing contrastive learning strategies are designed mainly for downstream tasks of natural images, therefore they are sub-optimal and even worse than learning from scratch when directly applied to medical images whose downstream tasks are usually segmentation. In this work, we propose a novel asymmetric contrastive learning framework named JCL for medical image segmentation with self-supervised pre-training. Specifically, (1) A novel asymmetric contrastive learning strategy is proposed to pre-train both encoder and decoder simultaneously in one-stage to provide better initialization for segmentation models. (2) A multi-level contrastive loss is designed to take the correspondence among feature-level, image-level and pixel-level projections, respectively into account to make sure multi-level representations can be learned by the encoder and decoder during pre-training. (3) Experiments on multiple medical image datasets indicate our JCL framework outperforms existing SOTA contrastive learning strategies.
    摘要 <>TRANSLATE_TEXTcontrastive learning, which is a powerful technique for learning image-level representations from unlabeled data, leads a promising direction to dealing with the dilemma between large-scale pre-training and limited labeled data. However, most existing contrastive learning strategies are designed mainly for downstream tasks of natural images, therefore they are sub-optimal and even worse than learning from scratch when directly applied to medical images whose downstream tasks are usually segmentation. In this work, we propose a novel asymmetric contrastive learning framework named JCL for medical image segmentation with self-supervised pre-training. Specifically, (1) A novel asymmetric contrastive learning strategy is proposed to pre-train both encoder and decoder simultaneously in one-stage to provide better initialization for segmentation models. (2) A multi-level contrastive loss is designed to take the correspondence among feature-level, image-level and pixel-level projections, respectively into account to make sure multi-level representations can be learned by the encoder and decoder during pre-training. (3) Experiments on multiple medical image datasets indicate our JCL framework outperforms existing SOTA contrastive learning strategies.TRANSLATE_TEXT

Stochastic stiffness identification and response estimation of Timoshenko beams via physics-informed Gaussian processes

  • paper_url: http://arxiv.org/abs/2309.11875
  • repo_url: https://github.com/gledsonrt/pigptimoshenkobeam
  • paper_authors: Gledson Rodrigo Tondo, Sebastian Rau, Igor Kavrakov, Guido Morgenthal
  • For: 这篇论文是用来描述一种基于机器学习的结构健康监测系统,用于结构参数Identification和结构回归预测。* Methods: 该论文使用了一种基于 Gaussian Process(GP)模型的physics-informed机器学习模型,使用多输出GP模型来描述Timoshenko beam元件的运动、弯 curvature、应力、负荷等参数。使用bayesian方式进行模型优化,通过Markov chain Monte Carlo方法来最大化 posterior模型。* Results: 该论文通过实验 validate了其模型,并 demonstarted that the proposed approach is effective at identifying structural parameters and is capable of fusing data from heterogeneous and multi-fidelity sensors. probabilistic predictions of structural responses and internal forces are in closer agreement with measured data.
    Abstract Machine learning models trained with structural health monitoring data have become a powerful tool for system identification. This paper presents a physics-informed Gaussian process (GP) model for Timoshenko beam elements. The model is constructed as a multi-output GP with covariance and cross-covariance kernels analytically derived based on the differential equations for deflections, rotations, strains, bending moments, shear forces and applied loads. Stiffness identification is performed in a Bayesian format by maximising a posterior model through a Markov chain Monte Carlo method, yielding a stochastic model for the structural parameters. The optimised GP model is further employed for probabilistic predictions of unobserved responses. Additionally, an entropy-based method for physics-informed sensor placement optimisation is presented, exploiting heterogeneous sensor position information and structural boundary conditions built into the GP model. Results demonstrate that the proposed approach is effective at identifying structural parameters and is capable of fusing data from heterogeneous and multi-fidelity sensors. Probabilistic predictions of structural responses and internal forces are in closer agreement with measured data. We validate our model with an experimental setup and discuss the quality and uncertainty of the obtained results. The proposed approach has potential applications in the field of structural health monitoring (SHM) for both mechanical and structural systems.
    摘要 机器学习模型使用结构健康监测数据变得了一种强大的系统识别工具。这篇论文提出了一种基于Timoshenko梁元件的物理学报GP模型。该模型通过分析差分方程来DERIVE covariance和交叉covariancekernel,并在Bayesian格式下通过Markov链 Monte Carlo方法进行弹性模型化。通过最大化 posterior模型,实现了结构参数的逻辑IDENTIFICATION。furthermore, the proposed approach is capable of fusing data from heterogeneous and multi-fidelity sensors, and provides probabilistic predictions of unobserved responses. The results show that the proposed approach is effective in identifying structural parameters and provides more accurate predictions of structural responses and internal forces. We validate the model with an experimental setup and discuss the quality and uncertainty of the obtained results. The proposed approach has potential applications in the field of structural health monitoring (SHM) for both mechanical and structural systems.Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from the original text.

OSNet & MNetO: Two Types of General Reconstruction Architectures for Linear Computed Tomography in Multi-Scenarios

  • paper_url: http://arxiv.org/abs/2309.11858
  • repo_url: None
  • paper_authors: Zhisheng Wang, Zihan Deng, Fenglin Liu, Yixing Huang, Haijun Yu, Junning Cui
  • For: This paper proposes two novel reconstruction architectures for linear computed tomography (LCT) systems to weaken projection truncation and image the region of interest (ROI).* Methods: The proposed methods use backprojection filtration (BPF) and two types of reconstruction architectures, Overlay-Single Network (OSNet) and Multiple Networks Overlaying (MNetO), to achieve stable interior reconstruction and avoid rotation operations of Hilbert filtering.* Results: Experimental results show that the proposed methods can both recover images, and OSNet outperforms BPF in various scenarios. Additionally, ST-pix2pixGAN is superior to pix2pixGAN and CycleGAN, and MNetO exhibits a few artifacts due to the differences among the multiple models.Here is the simplified Chinese text:* 为:这篇论文提出了两种新的重构架构来弱化投影截断和图像区域 интереса(ROI) для线性 computed tomography(LCT)系统。* 方法:提出的方法使用了投影筛选(BPF)和两种重构架构:重叠单网络(OSNet)和多网络叠加(MNetO),以实现稳定的内部重构和避免希尔贝特滤波的旋转操作。* 结果:实验结果表明,提出的方法都可以重建图像,并且OSNet在多种场景中都超过BPF表现。此外,ST-pix2pixGAN比pix2pixGAN和CycleGAN更佳,MNetO因多个模型之间的差异而具有一些瑕疵。
    Abstract Recently, linear computed tomography (LCT) systems have actively attracted attention. To weaken projection truncation and image the region of interest (ROI) for LCT, the backprojection filtration (BPF) algorithm is an effective solution. However, in BPF for LCT, it is difficult to achieve stable interior reconstruction, and for differentiated backprojection (DBP) images of LCT, multiple rotation-finite inversion of Hilbert transform (Hilbert filtering)-inverse rotation operations will blur the image. To satisfy multiple reconstruction scenarios for LCT, including interior ROI, complete object, and exterior region beyond field-of-view (FOV), and avoid the rotation operations of Hilbert filtering, we propose two types of reconstruction architectures. The first overlays multiple DBP images to obtain a complete DBP image, then uses a network to learn the overlying Hilbert filtering function, referred to as the Overlay-Single Network (OSNet). The second uses multiple networks to train different directional Hilbert filtering models for DBP images of multiple linear scannings, respectively, and then overlays the reconstructed results, i.e., Multiple Networks Overlaying (MNetO). In two architectures, we introduce a Swin Transformer (ST) block to the generator of pix2pixGAN to extract both local and global features from DBP images at the same time. We investigate two architectures from different networks, FOV sizes, pixel sizes, number of projections, geometric magnification, and processing time. Experimental results show that two architectures can both recover images. OSNet outperforms BPF in various scenarios. For the different networks, ST-pix2pixGAN is superior to pix2pixGAN and CycleGAN. MNetO exhibits a few artifacts due to the differences among the multiple models, but any one of its models is suitable for imaging the exterior edge in a certain direction.
    摘要 近些时候,线性计算 Tomatoesography(LCT)系统已经吸引了广泛的关注。为了减弱投影 truncation 并图像区域内 interest(ROI) для LCT,backprojection filtration(BPF)算法是一种有效的解决方案。然而,在 BPF 中,实现稳定的内部重建很难,而且对于 differentiated backprojection(DBP)图像的 LCT,多个旋转-有限倒散 transform(Hilbert filtering)- inverse rotation 操作会模糊图像。为满足 LCT 多种重建enario,包括内部 ROI、完整的物体和外部区域 beyond field-of-view(FOV),并避免 Hilbert filtering 的旋转操作,我们提出了两种重建架构。第一种是将多个 DBP 图像 overlay 成一个完整的 DBP 图像,然后使用一个网络学习 overlaying Hilbert filtering 函数,称为 Overlay-Single Network(OSNet)。第二种是使用多个网络在不同的旋转下对 DBP 图像进行不同的方向性 Hilbert filtering 模型训练,然后 overlay 得到的重建结果,称为 Multiple Networks Overlaying(MNetO)。在两种架构中,我们在 pix2pixGAN 生成器中引入了 Swin Transformer(ST)块,以同时提取 DBP 图像的局部和全局特征。我们从不同的网络、FOV 大小、像素大小、数据量、几何倍化和处理时间等方面进行了调查。实验结果表明,两种架构都可以重建图像。OSNet 在多种enario 中表现出色,比 BPF 更高效。在不同的网络方面,ST-pix2pixGAN 高于 pix2pixGAN 和 CycleGAN。MNetO 因多个模型之间的差异而存在一些瑕疵,但任一个模型都适用于在某个方向上图像外部边缘的重建。

BitCoin: Bidirectional Tagging and Supervised Contrastive Learning based Joint Relational Triple Extraction Framework

  • paper_url: http://arxiv.org/abs/2309.11853
  • repo_url: None
  • paper_authors: Luyao He, Zhongbao Zhang, Sen Su, Yuxin Chen
  • for: 提高relation triple extraction(RTE)任务的精度和效率,并解决现有方法的一些局限性。
  • methods: 提出了一种基于标签和监督contrastive学习的bidirectional triple extraction框架,并实现了标签在两个方向的执行,以便从主语到谓语和谓语到主语中提取关系 triples。
  • results: 在标准数据集上达到了state-of-the-art的result,并在不同类型的任务中显著提高了F1分数,包括Normal、SEO、EPO和多个关系提取任务。
    Abstract Relation triple extraction (RTE) is an essential task in information extraction and knowledge graph construction. Despite recent advancements, existing methods still exhibit certain limitations. They just employ generalized pre-trained models and do not consider the specificity of RTE tasks. Moreover, existing tagging-based approaches typically decompose the RTE task into two subtasks, initially identifying subjects and subsequently identifying objects and relations. They solely focus on extracting relational triples from subject to object, neglecting that once the extraction of a subject fails, it fails in extracting all triples associated with that subject. To address these issues, we propose BitCoin, an innovative Bidirectional tagging and supervised Contrastive learning based joint relational triple extraction framework. Specifically, we design a supervised contrastive learning method that considers multiple positives per anchor rather than restricting it to just one positive. Furthermore, a penalty term is introduced to prevent excessive similarity between the subject and object. Our framework implements taggers in two directions, enabling triples extraction from subject to object and object to subject. Experimental results show that BitCoin achieves state-of-the-art results on the benchmark datasets and significantly improves the F1 score on Normal, SEO, EPO, and multiple relation extraction tasks.
    摘要 信息提取和知识图构建中的关系 triple 提取(RTE)是一项重要任务。尽管最近有所进步,现有的方法仍然存在一些局限性。它们通常使用通用预训练模型,不考虑RTE任务的特殊性。此外,现有的标记 Based 方法通常将RTE任务分解为两个子任务,先 identific 主题,然后 identific 对象和关系。它们仅ocus on从主题到对象中提取关系 triple,忽略了如果提取主题失败,那么所有与该主题相关的 triple 都将难以提取。为了解决这些问题,我们提出了 BitCoin,一种创新的双向标记和监督对比学习基于的关系 triple 提取框架。具体来说,我们设计了一种监督对比学习方法,可以考虑多个正例而不是仅仅 restricting 到一个正例。此外,我们引入了一个罚 terme 来防止主题和对象之间的过度相似性。我们的框架实现了两个方向的标记,即从主题到对象和从对象到主题,以便提取关系 triple。实验结果表明,BitCoin在标准 benchmark 数据集上实现了当前最佳Result 和显著提高了Normal、SEO、EPO 和多个关系提取任务的 F1 分数。

How Prevalent is Gender Bias in ChatGPT? – Exploring German and English ChatGPT Responses

  • paper_url: http://arxiv.org/abs/2310.03031
  • repo_url: https://github.com/Ognatai/bias_chatGPT
  • paper_authors: Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen
  • for: 这个论文旨在探讨OpenAI的ChatGPT语言模型如何帮助非技术用户创作日常工作中的文本,以及该模型的局限性和偏见问题。
  • methods: 该论文采用系统性的分析方法,对提示和生成的答案进行了深入的检查和分析,以找出可能存在的偏见问题。
  • results: 研究发现,ChatGPT可以帮助非技术用户创作文本,但是需要仔细检查系统的答案以避免偏见和语法错误。
    Abstract With the introduction of ChatGPT, OpenAI made large language models (LLM) accessible to users with limited IT expertise. However, users with no background in natural language processing (NLP) might lack a proper understanding of LLMs. Thus the awareness of their inherent limitations, and therefore will take the systems' output at face value. In this paper, we systematically analyse prompts and the generated responses to identify possible problematic issues with a special focus on gender biases, which users need to be aware of when processing the system's output. We explore how ChatGPT reacts in English and German if prompted to answer from a female, male, or neutral perspective. In an in-depth investigation, we examine selected prompts and analyse to what extent responses differ if the system is prompted several times in an identical way. On this basis, we show that ChatGPT is indeed useful for helping non-IT users draft texts for their daily work. However, it is absolutely crucial to thoroughly check the system's responses for biases as well as for syntactic and grammatical mistakes.
    摘要

Evaluating Large Language Models for Document-grounded Response Generation in Information-Seeking Dialogues

  • paper_url: http://arxiv.org/abs/2309.11838
  • repo_url: None
  • paper_authors: Norbert Braunschweiler, Rama Doddipatla, Simon Keizer, Svetlana Stoyanchev
  • for: 这个论文 investigate了使用大型自然语言模型(LLMs)如ChatGPT来进行基于文档的回答生成在信息寻求对话中。
  • methods: 作者使用了两种方法:ChatCompletion和LlamaIndex。ChatCompletion使用了ChatGPT模型的知识,而LlamaIndex同时提取了文档中相关信息。
  • results: 观察到文档基于LLMs无法准确地评估回答生成,因为它们更加具有描述性。因此,作者进行了人工评估,评估 Shared Task 赛事获奖系统、ChatGPT两种变体的输出以及人类回答。结果显示,ChatGPT变体的输出被评估高于Shared Task 赛事获奖系统和人类回答。
    Abstract In this paper, we investigate the use of large language models (LLMs) like ChatGPT for document-grounded response generation in the context of information-seeking dialogues. For evaluation, we use the MultiDoc2Dial corpus of task-oriented dialogues in four social service domains previously used in the DialDoc 2022 Shared Task. Information-seeking dialogue turns are grounded in multiple documents providing relevant information. We generate dialogue completion responses by prompting a ChatGPT model, using two methods: Chat-Completion and LlamaIndex. ChatCompletion uses knowledge from ChatGPT model pretraining while LlamaIndex also extracts relevant information from documents. Observing that document-grounded response generation via LLMs cannot be adequately assessed by automatic evaluation metrics as they are significantly more verbose, we perform a human evaluation where annotators rate the output of the shared task winning system, the two Chat-GPT variants outputs, and human responses. While both ChatGPT variants are more likely to include information not present in the relevant segments, possibly including a presence of hallucinations, they are rated higher than both the shared task winning system and human responses.
    摘要 在这篇论文中,我们研究了大语言模型(LLM)如ChatGPT在信息寻求对话中的回答生成。为评估,我们使用了MultiDoc2Dial词汇对话集,这是四个社会服务领域的任务对话集,已经在DialDoc 2022共同任务中使用。信息寻求对话转帖是基于多份文档提供相关信息。我们使用两种方法生成对话完成响应:ChatCompletion和LlamaIndex。ChatCompletion利用ChatGPT模型的先验知识,而LlamaIndex同时从文档中提取有用信息。由于文档基于的回答生成无法准确地评估,我们进行了人工评估,评估共同任务赢家系统、ChatGPT两种变体的输出以及人类回答。结果显示,两种ChatGPT变体的输出被评估高于共同任务赢家系统和人类回答。

Multimodal Transformers for Wireless Communications: A Case Study in Beam Prediction

  • paper_url: http://arxiv.org/abs/2309.11811
  • repo_url: https://github.com/itu-ai-ml-in-5g-challenge/deepsense6g_tii
  • paper_authors: Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu, Faouzi Bader
  • For: 这个论文目的是为了提高无线通讯中高频率带的焦点管理,使用多modal的感知信息,包括相机、LiDAR、激光和GPS。* Methods: 这个论文使用多modal的transformer深度学习框架,将影像、点云、激光原始数据作为时间序列中的数据,使用卷积神经网提取特征,然后使用transformer核心来学习不同模式之间的隐藏关系,生成下一个层的特征提取。* Results: 这个论文的实验结果显示,使用影像和GPS数据训练的解析器,可以在78.44%的精度下预测焦点,并且具有优秀的泛化能力,在未见日情况下的73%和夜情况下的84%。这比使用其他模式和随机处理技术更好,显示了transformer具有组合特征的优秀表现在无线电波预测中。
    Abstract Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
    摘要 无线通信在高频带width大antenna数组面临扩扫管理挑战,可能可以通过多模态感知信息来改进。在这篇论文中,我们提出了一个多模态变换深度学习框架,用于感知协助扫束预测。我们使用卷积神经网络提取图像、点云和雷达原始数据序列中的特征,并在每层卷积层中使用变换器Encoder学习不同模态和时间实例之间的隐藏关系,生成下一层特征提取的编码向量。我们使用多种模式进行超参数学习。为了强化模型在不平衡数据上,我们利用焦点损失和加权移动平均。我们还评估了数据处理和扩展技术,如图像改进、分割、背景筛选、多模态数据翻转、雷达信号转换和GPS角度准确。实验结果表明,我们基于图像和GPS数据进行训练的解决方案在predicted扫束距离方面取得了78.44%的最佳性能,并且在未看到天气的日常场景中实现了73%的有效普适性。此外,我们的解决方案可以从大量多模态无线数据中进行预训练,然后进行多个下游无线网络任务的细化调整。

JobRecoGPT – Explainable job recommendations using LLMs

  • paper_url: http://arxiv.org/abs/2309.11805
  • repo_url: None
  • paper_authors: Preetam Ghosh, Vaishali Sadaphal
  • for: 本研究旨在提出一种基于自然语言理解的 Job 推荐方法,以填补传统方法所产生的数据损失。
  • methods: 本研究使用 Large Language Models (LLMs) 来捕捉原始文本数据中的信息,并评估四种不同的方法(内容基于的 deterministic、LLM 引导的、LLM 无引导的、混合)的性能。
  • results: 研究发现,LLM 引导的方法和混合方法的性能较高,而内容基于的 deterministic 方法和 LLM 无引导的方法的性能较低。同时,LLM 引导的方法和混合方法的时间需求较低。
    Abstract In today's rapidly evolving job market, finding the right opportunity can be a daunting challenge. With advancements in the field of AI, computers can now recommend suitable jobs to candidates. However, the task of recommending jobs is not same as recommending movies to viewers. Apart from must-have criteria, like skills and experience, there are many subtle aspects to a job which can decide if it is a good fit or not for a given candidate. Traditional approaches can capture the quantifiable aspects of jobs and candidates, but a substantial portion of the data that is present in unstructured form in the job descriptions and resumes is lost in the process of conversion to structured format. As of late, Large Language Models (LLMs) have taken over the AI field by storm with extraordinary performance in fields where text-based data is available. Inspired by the superior performance of LLMs, we leverage their capability to understand natural language for capturing the information that was previously getting lost during the conversion of unstructured data to structured form. To this end, we compare performance of four different approaches for job recommendations namely, (i) Content based deterministic, (ii) LLM guided, (iii) LLM unguided, and (iv) Hybrid. In this study, we present advantages and limitations of each method and evaluate their performance in terms of time requirements.
    摘要 Large Language Models (LLMs) have recently taken the AI field by storm with extraordinary performance in fields where text-based data is available. Inspired by their superior performance, we leverage their ability to understand natural language to capture the information that was previously lost during the conversion of unstructured data to a structured form. To this end, we compare the performance of four different approaches for job recommendations, including:1. Content-based deterministic approach2. LLM-guided approach3. LLM unguided approach4. Hybrid approachIn this study, we present the advantages and limitations of each method and evaluate their performance in terms of time requirements.

DimCL: Dimensional Contrastive Learning For Improving Self-Supervised Learning

  • paper_url: http://arxiv.org/abs/2309.11782
  • repo_url: None
  • paper_authors: Thanh Nguyen, Trung Pham, Chaoning Zhang, Tung Luu, Thang Vu, Chang D. Yoo
  • for: 提高自主学习(SSL)的性能,尤其是对于非对照学习(CL)的扩展和改进。
  • methods: 提出了一种名为维度对比学习(DimCL)的策略,即在维度方向上进行对比学习而不是批处理方向上,以增强特征多样性并作为SSL前期执行的正则化。
  • results: 对多个数据集和后端架构进行了广泛的实验,并证明了DimCL可以提高SSL性能,并且发现了硬度意识的特性为DimCL的成功的关键原因。
    Abstract Self-supervised learning (SSL) has gained remarkable success, for which contrastive learning (CL) plays a key role. However, the recent development of new non-CL frameworks has achieved comparable or better performance with high improvement potential, prompting researchers to enhance these frameworks further. Assimilating CL into non-CL frameworks has been thought to be beneficial, but empirical evidence indicates no visible improvements. In view of that, this paper proposes a strategy of performing CL along the dimensional direction instead of along the batch direction as done in conventional contrastive learning, named Dimensional Contrastive Learning (DimCL). DimCL aims to enhance the feature diversity, and it can serve as a regularizer to prior SSL frameworks. DimCL has been found to be effective, and the hardness-aware property is identified as a critical reason for its success. Extensive experimental results reveal that assimilating DimCL into SSL frameworks leads to performance improvement by a non-trivial margin on various datasets and backbone architectures.
    摘要 自领导学习(SSL)已经取得了很大的成功,其中对比学习(CL)扮演着关键角色。然而,最近的新非CL框架的发展已经达到了相当或更好的性能水平,并且有很大的提升潜力,因此研究人员尝试进一步加强这些框架。将CL assimilated into non-CL frameworks 已经被考虑,但实际证据表明没有可见的改善。因此,这篇论文提出了一种在维度方向上进行CL而不是在批处理方向上进行CL,称之为维度对比学习(DimCL)。DimCL aimsto enhance the feature diversity, and it can serve as a regularizer to prior SSL frameworks. DimCL has been found to be effective, and the hardness-aware property is identified as a critical reason for its success. 广泛的实验结果表明,将DimCL assimilated into SSL frameworks 会导致性能提高,具有非负的幅度。

2DDATA: 2D Detection Annotations Transmittable Aggregation for Semantic Segmentation on Point Cloud

  • paper_url: http://arxiv.org/abs/2309.11755
  • repo_url: None
  • paper_authors: Guan-Cheng Lee
  • for: 本研究旨在解决现有多感器模型面临的精度匹配和费用高问题,以实现在实际应用中使用多感器模型。
  • methods: 本研究使用了2D检测注释传递汇集(\textbf{2DDATA}),设计了本地物体分支(\textbf{Local Object Branch),以处理固定盒体内点。这种简单的设计可以将矩形盒体约束信息传递到3D编码器模型中。
  • results: 研究证明了我们的简单设计可以将矩形盒体约束信息传递到3D编码器模型中,证明了大量多感器模型与特定数据 fusion 的可能性。
    Abstract Recently, multi-modality models have been introduced because of the complementary information from different sensors such as LiDAR and cameras. It requires paired data along with precise calibrations for all modalities, the complicated calibration among modalities hugely increases the cost of collecting such high-quality datasets, and hinder it from being applied to practical scenarios. Inherit from the previous works, we not only fuse the information from multi-modality without above issues, and also exhaust the information in the RGB modality. We introduced the 2D Detection Annotations Transmittable Aggregation(\textbf{2DDATA}), designing a data-specific branch, called \textbf{Local Object Branch}, which aims to deal with points in a certain bounding box, because of its easiness of acquiring 2D bounding box annotations. We demonstrate that our simple design can transmit bounding box prior information to the 3D encoder model, proving the feasibility of large multi-modality models fused with modality-specific data.
    摘要

Improve the efficiency of deep reinforcement learning through semantic exploration guided by natural language

  • paper_url: http://arxiv.org/abs/2309.11753
  • repo_url: None
  • paper_authors: Zhourui Guo, Meng Yao, Yang Yu, Qiyue Yin
  • for: 这个论文的目的是提出一种新的RL方法,以便更高效地使用奥拉克力来提高RL的性能。
  • methods: 该方法使用一种选择性的方式来与奥拉克力进行交互,使用一个封装了当前状态和奥拉克力的神经网络来选择最相关的问题,并使用奥拉克力的答案来更新RL的策略和价值函数。
  • results: 该方法可以在一个物体抓取任务中显著提高RL的效率,比基eline方法减少了与奥拉克力的交互次数,以达到一定的性能水平。
    Abstract Reinforcement learning is a powerful technique for learning from trial and error, but it often requires a large number of interactions to achieve good performance. In some domains, such as sparse-reward tasks, an oracle that can provide useful feedback or guidance to the agent during the learning process is really of great importance. However, querying the oracle too frequently may be costly or impractical, and the oracle may not always have a clear answer for every situation. Therefore, we propose a novel method for interacting with the oracle in a selective and efficient way, using a retrieval-based approach. We assume that the interaction can be modeled as a sequence of templated questions and answers, and that there is a large corpus of previous interactions available. We use a neural network to encode the current state of the agent and the oracle, and retrieve the most relevant question from the corpus to ask the oracle. We then use the oracle's answer to update the agent's policy and value function. We evaluate our method on an object manipulation task. We show that our method can significantly improve the efficiency of RL by reducing the number of interactions needed to reach a certain level of performance, compared to baselines that do not use the oracle or use it in a naive way.
    摘要 强化学习是一种强大的技术,可以通过试错学习,但它经常需要许多交互来达到良好的性能。在某些领域,如稀薄奖励任务,一个智能 oracle 可以提供有用的反馈或指导,这对 agents 的学习过程是非常重要。然而,向 oracle 查询过于频繁可能是成本高或实际不可能的,而且 oracle 不一定总是可以为每个情况提供明确的答案。因此,我们提出了一种新的方法,使用选择性和有效的方式与 oracle 交互,使用一种检索基于的方法。我们假设交互可以被视为一个序列化的问题和答案,并且有一个大量的前期交互数据库。我们使用一个神经网络来编码 agent 和 oracle 的当前状态,并从数据库中检索最相关的问题来问 oracle。然后,我们使用 oracle 的答案来更新 agent 的策略和价值函数。我们在一个物品抓取任务上进行了evaluation,我们显示,我们的方法可以减少RL中交互的次数,以达到一定的性能水平,相比于不使用 oracle 或使用它的简单方法。

How Robust is Google’s Bard to Adversarial Image Attacks?

  • paper_url: http://arxiv.org/abs/2309.11751
  • repo_url: https://github.com/thu-ml/attack-bard
  • paper_authors: Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu
  • for: 这个论文主要研究了Google的Bard chatbot的抗 adversarial robustness问题,以更好地理解商业多模态语言模型的漏洞。
  • methods: 作者使用了白盒子代理视觉编码器或多模态语言模型进行攻击,生成了对Bard的恶意例子,并证明了这些例子可以让Bard输出错误的图像描述。
  • results: 作者发现,对Bard使用的攻击方法可以在22%的情况下成功,并且这些攻击也可以让其他多模态语言模型(如Bing Chat和ERNIE bot)被攻击。此外,作者还发现了Bard的两种防御机制,并设计了对这些防御机制的攻击方法。
    Abstract Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at https://github.com/thu-ml/Attack-Bard. Update: GPT-4V is available at October 2023. We further evaluate its robustness under the same set of adversarial examples, achieving a 45% attack success rate.
    摘要 多模态大语言模型(MLLMs),包括文本和其他感知modalities(特别是视觉),在多种多modal任务中表现出了无precendent的表现。然而,由于视觉模型的不可靠性问题,MLLMs可能具有更严重的安全和安全风险。在这个工作中,我们研究Google的Bard,一个与ChatGPT竞争的聊天机器人,以更好地了解商业MLLMs的漏洞。我们通过攻击白盒子代理视觉encoder或MLLMs来生成对抗例子,可以让Bard输出错误的图像描述,成功率达22%。此外,我们发现Bard的防御机制,包括图像检测和图像攻击检测。我们设计了对这些防御机制的攻击,并证明现有的防御机制也受到攻击。我们希望这个工作可以深入了解MLLMs的稳定性,并促进未来的防御研究。我们的代码可以在https://github.com/thu-ml/Attack-Bard上获取。更新:GPT-4V将于2023年10月发布。我们进一步测试其在同样的对抗例子下的稳定性,成功率达45%。

LPML: LLM-Prompting Markup Language for Mathematical Reasoning

  • paper_url: http://arxiv.org/abs/2309.13078
  • repo_url: None
  • paper_authors: Ryutaro Yamauchi, Sho Sonoda, Akiyoshi Sannai, Wataru Kumagai
  • for: 本研究旨在使用大型自然语言模型(LLMs)进行数学逻辑 reasoning,并解决由 LLMS 生成的文本中存在的错误和计算问题。
  • methods: 本研究提出了一种 novel 的框架,即将 Chain-of-Thought(CoT)方法与外部工具(Python REPL)集成,并通过占位符语言的 markup 语言来控制 LLMS 的不жела的行为。
  • results: 通过对 ChatGPT(GPT-3.5)进行应用,我们 demonstated 了将 CoT 和 Python REPL 集成可以提高 LLMS 的逻辑能力,并且可以使 LLMS 通过 zero-shot 提示来进行高级数学逻辑。
    Abstract In utilizing large language models (LLMs) for mathematical reasoning, addressing the errors in the reasoning and calculation present in the generated text by LLMs is a crucial challenge. In this paper, we propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL). We discovered that by prompting LLMs to generate structured text in XML-like markup language, we could seamlessly integrate CoT and the external tool and control the undesired behaviors of LLMs. With our approach, LLMs can utilize Python computation to rectify errors within CoT. We applied our method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and demonstrated that combining CoT and Python REPL through the markup language enhances the reasoning capability of LLMs. Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
    摘要 utilizing large language models (LLMs) for mathematical reasoning, addressing the errors in the reasoning and calculation present in the generated text by LLMs is a crucial challenge. In this paper, we propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL). We discovered that by prompting LLMs to generate structured text in XML-like markup language, we could seamlessly integrate CoT and the external tool and control the undesired behaviors of LLMs. With our approach, LLMs can utilize Python computation to rectify errors within CoT. We applied our method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and demonstrated that combining CoT and Python REPL through the markup language enhances the reasoning capability of LLMs. Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.Here's the translation in Traditional Chinese:使用大型语言模型(LLMs)进行数学理解,对于 LLMS 生成的文本中的错误和计算存在的挑战是一个重要的挑战。在这篇论文中,我们提出了一个新的框架,它结合了排序链 (CoT) 方法和一个外部工具(Python REPL)。我们发现,通过将 LLMS 调侃为生成标记语言(XML-like)的 markup 语言,可以与 CoT 和外部工具完美整合,控制 LLMS 的不适当行为。我们的方法可以让 LLMS 使用 Python 计算 rectify CoT 中的错误。我们将我们的方法应用到 ChatGPT (GPT-3.5) 来解决困难的数学问题,并证明了通过 markup 语言来结合 CoT 和 Python REPL 可以提高 LLMS 的数学理解能力。我们的方法可以让 LLMS 只需零 shot 提示就能写 markup 语言并进行进阶的数学理解。

Choice-75: A Dataset on Decision Branching in Script Learning

  • paper_url: http://arxiv.org/abs/2309.11737
  • repo_url: None
  • paper_authors: Zhaoyi Joey Hou, Li Zhang, Chris Callison-Burch
  • for: 本研究旨在研究日常事件的发展。
  • methods: 本文提出了Choice-75,首个测试智能系统可Predict decisions given descriptive scenarios的benchmark。
  • results: 大语言模型在总体上表现不错,但在许多困难的场景下还有很大的进步空间。
    Abstract Script learning studies how daily events unfold. Previous works tend to consider a script as a linear sequence of events while ignoring the potential branches that arise due to people's circumstantial choices. We hence propose Choice-75, the first benchmark that challenges intelligent systems to predict decisions given descriptive scenarios, containing 75 scripts and more than 600 scenarios. While large language models demonstrate overall decent performances, there is still notable room for improvement in many hard scenarios.
    摘要 学习脚本研究每日事件的发展。以前的工作通常将脚本视为一个Linear sequence of events而忽略人们因特殊选择而导致的可能的分支。我们因此提出了选择75,首个挑战智能系统预测基于描述场景的决策,包括75个脚本和超过600个场景。虽然大型语言模型在总的表现不错,但还有许多困难场景需要进一步改进。

A Differentiable Framework for End-to-End Learning of Hybrid Structured Compression

  • paper_url: http://arxiv.org/abs/2309.13077
  • repo_url: None
  • paper_authors: Moonjung Eo, Suhyun Kang, Wonjong Rhee
  • for: 提高结构压缩技术的性能。
  • methods: 使用梯度基本优化的演化框架(DF),包括筛选器选择(DML-S)和排名选择(DTL-S)。
  • results: 实验结果表明DF比现有的结构压缩方法更高效。
    Abstract Filter pruning and low-rank decomposition are two of the foundational techniques for structured compression. Although recent efforts have explored hybrid approaches aiming to integrate the advantages of both techniques, their performance gains have been modest at best. In this study, we develop a \textit{Differentiable Framework~(DF)} that can express filter selection, rank selection, and budget constraint into a single analytical formulation. Within the framework, we introduce DML-S for filter selection, integrating scheduling into existing mask learning techniques. Additionally, we present DTL-S for rank selection, utilizing a singular value thresholding operator. The framework with DML-S and DTL-S offers a hybrid structured compression methodology that facilitates end-to-end learning through gradient-base optimization. Experimental results demonstrate the efficacy of DF, surpassing state-of-the-art structured compression methods. Our work establishes a robust and versatile avenue for advancing structured compression techniques.
    摘要 <>请转换文本为简化中文。<>基础技术 Filter pruning 和 low-rank decomposition 是结构压缩的两大基础技术。 although recent efforts have explored hybrid approaches aiming to integrate the advantages of both techniques, their performance gains have been modest at best. In this study, we develop a 可微分 Framework~(DF) that can express filter selection, rank selection, and budget constraint into a single analytical formulation. Within the framework, we introduce DML-S for filter selection, integrating scheduling into existing mask learning techniques. Additionally, we present DTL-S for rank selection, utilizing a singular value thresholding operator. The framework with DML-S and DTL-S offers a hybrid structured compression methodology that facilitates end-to-end learning through gradient-based optimization. Experimental results demonstrate the efficacy of DF, surpassing state-of-the-art structured compression methods. Our work establishes a robust and versatile avenue for advancing structured compression techniques.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

  • paper_url: http://arxiv.org/abs/2309.11725
  • repo_url: https://github.com/ai-s2-lab/fluenteditor
  • paper_authors: Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li
  • for: 提高语音编辑技术中的流畅性,使得用户可以通过修改输入文本脚本而不是直接修改音频自身来编辑语音。
  • methods: 基于神经网络的文本语音编辑技术,包括在编辑区域和邻近音频段之间的听力过渡和原始语音风格的保持。
  • results: 对VCTK数据进行主观和 объектив的实验,表明我们的fluente Editor在自然性和流畅性方面超越了所有先进的基elines。
    Abstract Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed \textit{FluentEditor}, by considering fluency-aware training criterion in the TSE training. Specifically, the \textit{acoustic consistency constraint} aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the \textit{prosody consistency constraint} seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our \textit{FluentEditor} outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at \url{https://github.com/Ai-S2-Lab/FluentEditor}.
    摘要 文本基于的语音编辑(TSE)技术是为了让用户通过修改输入文本脚本而非直接修改音频自行编辑语音。虽然 neural network-based TSE 技术已经做出了很多进展,但当前技术主要是针对编辑区域和参考目标之间的差异减小,忽略了语音流畅性的本地和全局因素。为保持语音流畅性,我们提议一种流畅语音编辑模型,称为“流畅编辑器”(FluentEditor),通过考虑语音流畅性训练 criterion 在 TSE 训练中。具体来说,“语音一致性约束”目的在编辑区域和其相邻的语音段之间实现缓冲的过渡,使得语音流畅性更高;“语调一致性约束”则是保证编辑区域中的语调特征与原始语音的整体风格保持一致。对 VCTK 进行主观和客观实验表明,我们的“流畅编辑器”在自然性和流畅性两个方面超越了所有高级基elines。听音样本和代码可以在 GitHub 上获得:https://github.com/Ai-S2-Lab/FluentEditor。

Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech

  • paper_url: http://arxiv.org/abs/2309.11724
  • repo_url: https://github.com/ai-s2-lab/emopp
  • paper_authors: Rui Liu, Bin Liu, Haizhou Li
  • for: 这个论文主要研究了用于实现语音合成中的情感表达。
  • methods: 这个论文提出了一种基于语音感知的情感表达模型,称为Emotion-Aware Prosodic Phrasing(EmoPP),以便准确地捕捉语音中的情感cue并预测合适的分割点。
  • results: 对于ESD dataset的对象评估和主观评估表明,EmoPP模型在情感表达方面表现出色,与基础模型相比有显著的提升。Audioamples和代码可以在https://github.com/AI-S2-Lab/EmoPP中找到。
    Abstract Prosodic phrasing is crucial to the naturalness and intelligibility of end-to-end Text-to-Speech (TTS). There exist both linguistic and emotional prosody in natural speech. As the study of prosodic phrasing has been linguistically motivated, prosodic phrasing for expressive emotion rendering has not been well studied. In this paper, we propose an emotion-aware prosodic phrasing model, termed \textit{EmoPP}, to mine the emotional cues of utterance accurately and predict appropriate phrase breaks. We first conduct objective observations on the ESD dataset to validate the strong correlation between emotion and prosodic phrasing. Then the objective and subjective evaluations show that the EmoPP outperforms all baselines and achieves remarkable performance in terms of emotion expressiveness. The audio samples and the code are available at \url{https://github.com/AI-S2-Lab/EmoPP}.
    摘要 <>传统的 Text-to-Speech (TTS) 系统通常强调语音的正确性和自然性,但是它们往往忽略了语音的情感表达。在这篇论文中,我们提出了一种基于情感的语音分割模型,称为 EmoPP,可以准确地捕捉语音中的情感cue并预测合适的分割点。我们首先通过对 ESD 数据集的 объектив观察, Validate the strong correlation between emotion and prosodic phrasing。然后,对象和主观评估表明,EmoPP 超过了所有基eline,并达到了很高的情感表达性能。听音amples和代码可以在 GitHub 上 obtian 到:https://github.com/AI-S2-Lab/EmoPP。Note: "ESD" stands for "Emotion-based Speech Dataset" in Chinese.

A Dynamic Domain Adaptation Deep Learning Network for EEG-based Motor Imagery Classification

  • paper_url: http://arxiv.org/abs/2309.11714
  • repo_url: None
  • paper_authors: Jie Jiao, Meiyan Xu, Qingqing Chen, Hefan Zhou, Wangliang Zhou
  • for: 提高BCI系统的精度和稳定性,减少参数调整时间
  • methods: 使用Dynamic Domain Adaptation Based Deep Learning Network (DADL-Net),将EEG数据映射到三维几何空间,并通过3D数据模组和通道注意力机制强化特征,最后通过最终数据模组更新特征
  • results: 在BCI竞赛IV 2a和OpenBMI数据集上验证了方法的表现,实现了70.42%和73.91%的准确率
    Abstract There is a correlation between adjacent channels of electroencephalogram (EEG), and how to represent this correlation is an issue that is currently being explored. In addition, due to inter-individual differences in EEG signals, this discrepancy results in new subjects need spend a amount of calibration time for EEG-based motor imagery brain-computer interface. In order to solve the above problems, we propose a Dynamic Domain Adaptation Based Deep Learning Network (DADL-Net). First, the EEG data is mapped to the three-dimensional geometric space and its temporal-spatial features are learned through the 3D convolution module, and then the spatial-channel attention mechanism is used to strengthen the features, and the final convolution module can further learn the spatial-temporal information of the features. Finally, to account for inter-subject and cross-sessions differences, we employ a dynamic domain-adaptive strategy, the distance between features is reduced by introducing a Maximum Mean Discrepancy loss function, and the classification layer is fine-tuned by using part of the target domain data. We verify the performance of the proposed method on BCI competition IV 2a and OpenBMI datasets. Under the intra-subject experiment, the accuracy rates of 70.42% and 73.91% were achieved on the OpenBMI and BCIC IV 2a datasets.
    摘要 有 correlation между邻近通道的电энцефаogram (EEG), 如何表示这种 correlation 是目前正在探索的问题。此外,由于 EEG 信号的个体差异,这种差异会导致新的 subjects 需要 spent 一定的 calibration 时间 для EEG-based motor imagination 脑机器 interfaces。为解决上述问题,我们提议一种 Dynamic Domain Adaptation Based Deep Learning Network (DADL-Net)。首先,EEG 数据被映射到三维几何空间中,并通过 3D 卷积模块学习其时间-空间特征,然后通过空间通道注意力机制强化特征,最后通过最后一个卷积模块学习特征的时间-空间信息。此外,为了补偿个体和跨会话差异,我们采用动态领域适应策略,将特征之间的距离减少到最大平均差异损失函数,并使用部分目标领域数据进行细化。我们验证了提议的方法在 BCI 竞赛 IV 2a 和 OpenBMI 数据集上的性能。在内部实验中,我们在 OpenBMI 和 BCIC IV 2a 数据集上达到了准确率为 70.42% 和 73.91%。

cs.CL - 2023-09-21

Towards Lexical Analysis of Dog Vocalizations via Online Videos

  • paper_url: http://arxiv.org/abs/2309.13086
  • repo_url: None
  • paper_authors: Yufei Wang, Chunhao Zhang, Jieyi Huang, Mengyue Wu, Kenny Zhu
  • for: 这个研究是为了解含犬语言 semantics 的大挑战。
  • methods: 这个研究使用了数据驱动的方法,通过对犬叫声与相应的位置和活动之间的conditioned probability进行分析,探讨犬语言 semantics 的含义。
  • results: 研究发现了一些支持先前观察研究的犬叫声 semantics 的证据,如 growl 可能表示互动。此外,研究还提供了新的发现,例如 whimper 可以被细分为两种类型:寻求注意和不适。
    Abstract Deciphering the semantics of animal language has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations via correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.
    摘要 研究动物语言 semantics 是一个大型挑战。本研究通过对狗叫声的数据驱动Investigation into the semantics of dog vocalizations has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations by correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.Here's the translation in Traditional Chinese:研究动物语言 semantics 是一个大型挑战。本研究通过对狗叫声的数据驱动Investigation into the semantics of dog vocalizations has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations by correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.

Foundation Metrics: Quantifying Effectiveness of Healthcare Conversations powered by Generative AI

  • paper_url: http://arxiv.org/abs/2309.12444
  • repo_url: None
  • paper_authors: Mahyar Abbasian, Elahe Khatibi, Iman Azimi, David Oniani, Zahra Shakeri Hossein Abad, Alexander Thieme, Ram Sriram, Zhongqi Yang, Yanshan Wang, Bryant Lin, Olivier Gevaert, Li-Jia Li, Ramesh Jain, Amir M. Rahmani
  • for: 这项研究的目的是为了评估医疗聊天机器人的性能,以提高患者的健康结果。
  • methods: 这项研究使用了现有的大语言模型评估指标,并对其进行了修改和扩展,以适应医疗聊天机器人的特点。
  • results: 研究结果表明,现有的评估指标无法完全评估医疗聊天机器人的性能,因为它们缺乏对医疗概念和患者需求的理解。新的评估指标可以更好地评估机器人的语言处理能力、实际医疗任务的影响和用户交互对话的效果。
    Abstract Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.
    摘要 优化人工智能将革新医疗服务,从传统患者护理转化为更个性化、高效、积极的过程。 chatbot 作为互动对话模型,将主导这种患者中心的变革。通过提供诊断、个性化生活建议、心理支持等服务,目标是大幅提高患者健康结果,同时减轻医疗提供者的劳作负担。由于医疗应用的生命重要性,需要建立一个统一和完整的评估指标集,以评估对话模型的性能。现有的评估指标,针对普通的大语言模型(LLM),表明对医疗和健康概念的 comprendio 和其在患者健康状态提高中的重要性存在缺失。此外,这些指标忽视了关键的用户中心因素,如信任建立、伦理、个性化、同理、用户理解和情感支持。本文的目的是探讨采用 LLM 的现状评估指标,并提出一个完整的评估指标集,以评估医疗 chatbot 的性能从患者视角。这些指标包括语言处理能力、对实际医疗任务的影响和用户互动对话的效果。最后,我们展开讨论关于定义和实施这些指标的挑战,尤其是对象audience、评估方法和提示技术的影响。

Active Learning for Multilingual Fingerspelling Corpora

  • paper_url: http://arxiv.org/abs/2309.12443
  • repo_url: None
  • paper_authors: Shuai Wang, Eric Nalisnick
  • for: 帮助手语数据稀缺问题
  • methods: 使用活动学习
  • results: 发现预训练可能利用手语语言之间的手势相似性,但是可能是视觉相似性而不是语言相似性引起的 benefita
    Abstract We apply active learning to help with data scarcity problems in sign languages. In particular, we perform a novel analysis of the effect of pre-training. Since many sign languages are linguistic descendants of French sign language, they share hand configurations, which pre-training can hopefully exploit. We test this hypothesis on American, Chinese, German, and Irish fingerspelling corpora. We do observe a benefit from pre-training, but this may be due to visual rather than linguistic similarities
    摘要 我们使用活动学习来帮助数据缺乏问题在手语中。特别是,我们进行了一种新的预训练分析。由于许多手语都是法语手语的语言后裔,因此它们可能具有相似的手势配置,预训练可能可以利用这些相似性。我们在美国、中国、德国和爱尔兰手语词汇集中测试了这个假设。我们确实发现了预训练的好处,但这可能是由视觉相似性而不是语言相似性导致的。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also commonly used in Taiwan and Hong Kong.

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12294
  • repo_url: None
  • paper_authors: Levon Haroutunian, Zhuang Li, Lucian Galescu, Philip Cohen, Raj Tumuluri, Gholamreza Haffari
  • for: 本文旨在提高大型自然语言模型(LLM)生成的自然语言质量。
  • methods: 本文提出了一种生成并重新排序的方法,包括首先通过提示LLM生成一组候选输出,然后使用任务特定的重新排序模型进行重新排序。
  • results: 经过广泛的实验表明,我们的方法可以提高LLM生成的输出质量,包括semantic consistency和fluency。
    Abstract Large language models (LLMs) have demonstrated impressive capabilities in natural language generation. However, their output quality can be inconsistent, posing challenges for generating natural language from logical forms (LFs). This task requires the generated outputs to embody the exact semantics of LFs, without missing any LF semantics or creating any hallucinations. In this work, we tackle this issue by proposing a novel generate-and-rerank approach. Our approach involves initially generating a set of candidate outputs by prompting an LLM and subsequently reranking them using a task-specific reranker model. In addition, we curate a manually collected dataset to evaluate the alignment between different ranking metrics and human judgements. The chosen ranking metrics are utilized to enhance the training and evaluation of the reranker model. By conducting extensive experiments on three diverse datasets, we demonstrate that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.
    摘要 Our approach involves first generating a set of candidate outputs using an LLM and then reranking them using a task-specific reranker model. We also create a manually curated dataset to evaluate the alignment between different ranking metrics and human judgments. We use these ranking metrics to enhance the training and evaluation of the reranker model.We conduct extensive experiments on three diverse datasets and show that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.

Inspire the Large Language Model by External Knowledge on BioMedical Named Entity Recognition

  • paper_url: http://arxiv.org/abs/2309.12278
  • repo_url: None
  • paper_authors: Junyi Bian, Jiaxuan Zheng, Yuyi Zhang, Shanfeng Zhu
  • for: 这 paper 的目的是解决生物医学命名实体识别(BioNER)任务,特别是利用大语言模型(LLM)来解决这个任务。
  • methods: 这 paper 使用了一种两步 Approach,首先将 NER 任务分解为 entity span EXTRACTION 和 entity type 确定两个步骤。其次,为 entity type 确定,我们将实体知识注入到 LLM 中以解决 LL 缺乏域知识的问题。
  • results: 实验结果表明,我们的 two-step BioNER 方法与之前的几 shot LLM 基eline 相比有了显著的改善。同时,将外部知识注入到 LLM 中也有效地提高了实体类别确定性。
    Abstract Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks. However, they often fall short in some information extraction tasks, particularly those requiring domain-specific knowledge, such as Biomedical Named Entity Recognition (NER). In this paper, inspired by Chain-of-thought, we leverage the LLM to solve the Biomedical NER step-by-step: break down the NER task into entity span extraction and entity type determination. Additionally, for entity type determination, we inject entity knowledge to address the problem that LLM's lack of domain knowledge when predicting entity category. Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline. Additionally, the incorporation of external knowledge significantly enhances entity category determination performance.
    摘要

Improving VTE Identification through Adaptive NLP Model Selection and Clinical Expert Rule-based Classifier from Radiology Reports

  • paper_url: http://arxiv.org/abs/2309.12273
  • repo_url: None
  • paper_authors: Jamie Deng, Yusen Wu, Hilary Hayssen, Brain Englum, Aman Kankaria, Minerva Mayorga-Carlin, Shalini Sahoo, John Sorkin, Brajesh Lal, Yelena Yesha, Phuong Nguyen
  • for: 这个研究旨在提高不结构化(free-text)医疗报告中的深部静脉血栓(DVT)和肺动脉血栓(PE)的识别率,以便更好地治疗Cardiovascular disease。
  • methods: 本研究使用自然语言处理(NLP)方法,结合深度学习(DL)和数据增强,以提高VTE事件的识别率。
  • results: 本研究的实验结果显示,模型具有97%的准确率和97%的F1分数在预测DVT,并具有98.3%的准确率和98.4%的F1分数在预测PE。
    Abstract Rapid and accurate identification of Venous thromboembolism (VTE), a severe cardiovascular condition including deep vein thrombosis (DVT) and pulmonary embolism (PE), is important for effective treatment. Leveraging Natural Language Processing (NLP) on radiology reports, automated methods have shown promising advancements in identifying VTE events from retrospective data cohorts or aiding clinical experts in identifying VTE events from radiology reports. However, effectively training Deep Learning (DL) and the NLP models is challenging due to limited labeled medical text data, the complexity and heterogeneity of radiology reports, and data imbalance. This study proposes novel method combinations of DL methods, along with data augmentation, adaptive pre-trained NLP model selection, and a clinical expert NLP rule-based classifier, to improve the accuracy of VTE identification in unstructured (free-text) radiology reports. Our experimental results demonstrate the model's efficacy, achieving an impressive 97\% accuracy and 97\% F1 score in predicting DVT, and an outstanding 98.3\% accuracy and 98.4\% F1 score in predicting PE. These findings emphasize the model's robustness and its potential to significantly contribute to VTE research.
    摘要 快速和准确地识别深静脉栓塞(VTE),包括深静脉栓塞(DVT)和肺动脉栓塞(PE),是诊断 cardiovascular 疾病的关键。通过自然语言处理(NLP)技术对医疗报告进行自动分析,已经在从退化数据库中提取VTE事件的方面展现出了扎实的进步。然而,由于医疗文本数据的有限性、报告的复杂性和多样性以及数据不均衡,训练深度学习(DL)和NLP模型的问题很大。这项研究提出了一种新的方法组合,包括DL方法、数据增强、适应预训练NLP模型选择和临床专家NLP规则基本分类器,以提高无结构(自由文本)医疗报告中VTE识别的准确率。我们的实验结果表明,该模型具有卓越的表现,在静脉栓塞(DVT)预测方面达到了97%的准确率和97%的F1分数,在肺动脉栓塞(PE)预测方面达到了98.3%的准确率和98.4%的F1分数。这些发现证明了模型的稳定性,并且它有potential为VTE研究做出重要贡献。

  • paper_url: http://arxiv.org/abs/2309.12269
  • repo_url: None
  • paper_authors: Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek
  • for: 这份论文是为了推动法律人工智能研究而创建的剑桥法律词汇库(CLC)的首次发布。CLC包含了超过250,000个英国法律案例,其中大多数案例发生在21世纪,但词汇库还包括16世纪的案例。
  • methods: 这篇论文提供了CLC的首次发布,包括Raw文本和元数据。同时,作者还提供了638个案例的注释,由法律专家进行标注。使用了这些注释数据,作者在GPT-3、GPT-4和RoBERTa模型上进行了案例结果抽取的训练和评估。
  • results: 作者通过使用GPT-3、GPT-4和RoBERTa模型进行了案例结果抽取的训练和评估。他们提供了这些模型的benchmark,以便用于未来的法律人工智能研究。同时,作者还进行了extensive的法律和伦理讨论,以 Addressing the potentially sensitive nature of this material。
    Abstract We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
    摘要 我们介绍了剑桥法律词库(CLC),一个为法律人工智能研究提供的词库。它包含了超过250,000个英国法院案例,大多数案例是21世纪的,但词库还包括了16世纪的案例。本文发布了词库的首个版本,其中包含了原始文本和元数据。同时,我们提供了638个案例的法律专家标注,用于训练和评估案例结果抽取模型。我们还进行了广泛的法律和伦理讨论,以Addressing the potentially sensitive nature of this material。因此,词库将仅 для研究用途发布,受限于 certain restrictions。

On the Relationship between Skill Neurons and Robustness in Prompt Tuning

  • paper_url: http://arxiv.org/abs/2309.12263
  • repo_url: None
  • paper_authors: Leon Ackermann, Xenia Ohmer
  • for: 这 paper 研究了 Prompt Tuning 方法在 pre-trained 大语言模型 (PLMs) 上的稳定性,以及这方法 在具体任务上如何活化特定的 neuron。
  • methods: 这 paper 使用了 RoBERTa 和 T5 进行实验,并研究了这些模型在不同任务上的表现。
  • results: 研究结果表明,Prompt Tuning 在不同任务上的表现不够稳定,而 T5 的表现比 RoBERTa 更加稳定。此外,研究还发现了 RoBERTa 和 T5 中的特定 neuron 在不同任务上的表现。
    Abstract Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Recently, based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data, with higher robustness for T5 than RoBERTa. At the same time, we replicate the existence of skill neurons in RoBERTa and further show that skill neurons also seem to exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to activate the relevant skill neurons on adversarial data.
    摘要 启发调整(Prompt Tuning)是一种Parameter-efficient finetuning方法,用于预训练大型自然语言模型(PLMs)。最近,通过RoBERTa的实验,表明Prompt Tuning可以活化特定的转换器网络中的高度预测和选择性 neuron,用于给定任务。在这篇论文中,我们研究Prompt Tuning的稳定性,与这些“技能neuron”(skill neurons)相关。使用RoBERTa和T5,我们发现,任务特定的启发调整可以在同类任务中进行转移,但对阴谋数据不具有很高的稳定性,T5的稳定性比RoBERTa更高。同时,我们复制了RoBERTa中的技能neurons,并证明T5中也存在技能neurons。更有趣的是,T5中非阴谋数据上定义的技能neurons还是阴谋数据上最预测性的 neurons的一部分,而RoBERTa中的技能neurons不是。我们认为,高度阴谋稳定性可能与模型活化相关的技能neurons的存在有关。

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

  • paper_url: http://arxiv.org/abs/2309.12250
  • repo_url: None
  • paper_authors: Matteo Gabburo, Siddhant Garg, Rik Koncel Kedziorski, Alessandro Moschitti
  • for: 本研究旨在提出一种新的问答评估 metric,用于评估句子级问答系统的正确率。
  • methods: 本研究使用多个参考答案(包括多个正确和错误的参考答案)来评估句子级问答系统的表现,并使用 transformer LM encoder 基于相似度metric 进行评估。
  • results: 研究结果表明,SQuArE metric 在 sentence-level 提取式(Answer Selection)和生成(GenQA)问答系统上具有更高的协调度和更好的性能,并且在多个学术和工业数据集上都具有优异的表现。
    Abstract Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity metrics transfer well for QA evaluation, but they are limited by the usage of a single correct reference answer. We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation), using multiple reference answers (combining multiple correct and incorrect references) for sentence-form QA. We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems, across multiple academic and industrial datasets, and show that it outperforms previous baselines and obtains the highest correlation with human annotations.
    摘要 评估问答系统非常困难和昂贵,人工标注正确答案为最可靠的方法。最近的研究(AVA、BEM)表明,基于转换器LM核心 metric 可以很好地传递问答评估,但它们受到唯一正确参考答案的限制。我们提议一种新的评估指标:SQuArE(句子级问答回答评估),使用多个参考答案(包括多个正确和错误参考)评估句子级问答系统。我们在多个学术和产业数据集上评估了SQuArE,并显示它超过了先前的基线和 humans 的标注相关度最高。

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.12234
  • repo_url: https://github.com/xuchennlp/s2t
  • paper_authors: Chen Xu, Xiaoqian Liu, Erfeng He, Yuhao Zhang, Qianqian Dong, Tong Xiao, Jingbo Zhu, Dapeng Man, Wu Yang
  • for: 这篇论文主要针对的是语音翻译任务中的同时双语连接主义推荐(CTC)框架,用于桥接语音和文本、源语言和目标语言之间的差异。
  • methods: 该模型使用了涉及训练和评估的同时双语CTC框架,利用训练录音和翻译文本作为同时目标,从而桥接语音和文本之间的差异。
  • results: 该模型在资源受限的情况下在MuST-C ST benchmark上达到了新的州OF-the-art表现,并且在语音识别任务中也显示了显著提高,这表明了涉及多语言学习的跨语言学习效应。
    Abstract In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
    摘要 在本研究中,我们提出了同步双语 Connectionist Temporal Classification(CTC)框架,这是一种创新的方法,利用了双CTC来跨越语言和Modalities在语音翻译(ST)任务中的差异。我们通过使用讲解和翻译作为同时目标 для CTC,我们的模型可以跨越音频和文本之间的差异,以及源语言和目标语言之间的差异。基于最近的CTC应用的进步,我们开发了一种改进的变体,BiL-CTC+,它在资源受限的情况下在MuST-C ST标准测试上达到了新的州Of-The-Art表现。很有趣的是,我们的方法还提高了语音识别性能,这表明了涉及跨语言学习的跨越效应,并证明了其广泛的适用性。代码可以在https://github.com/xuchennlp/S2T上获取。

  • paper_url: http://arxiv.org/abs/2309.12224
  • repo_url: None
  • paper_authors: Deepak Gupta, Kush Attal, Dina Demner-Fushman
  • for: 这个论文是为了回答公众的健康相关问题而提供视频回答。
  • methods: 这篇论文使用了一个批处理方法创建了两个大规模数据集:HealthVidQA-CRF和HealthVidQA-Prompt。然后,它们提出了单模态和多模态方法,可以从医疗视频中提取视觉回答。
  • results: 这篇论文的结果表明,创建了这两个数据集可以提高医疗视频回答任务的模型训练,并且视觉特征可以提高单模态和多模态方法的性能。
    Abstract The increase in the availability of online videos has transformed the way we access information and knowledge. A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. Toward this, this paper is focused on answering health-related questions asked by the public by providing visual answers from medical videos. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions. To address this issue, we first proposed a pipelined approach to create two large-scale datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Later, we proposed monomodal and multimodal approaches that can effectively provide visual answers from medical videos to natural language questions. We conducted a comprehensive analysis of the results, focusing on the impact of the created datasets on model training and the significance of visual features in enhancing the performance of the monomodal and multi-modal approaches. Our findings suggest that these datasets have the potential to enhance the performance of medical visual answer localization tasks and provide a promising future direction to further enhance the performance by using pre-trained language-vision models.
    摘要 “在线影片的更多可用性已经改变了我们取得信息和知识的方式。更多的人现在偏好使用指南影片,因为它们提供了一系列步骤的程序来完成特定任务。医疗领域的指南影片可能提供医疗问题上最佳的可视答案。这篇论文专注于通过提供医疗影片的可视答案来回答公众对健康问题的问题。医疗领域的大规模数据匮乏是开发应用程序的关键挑战。为解决这个问题,我们首先提出了管道方法,创建了HealthVidQA-CRF和HealthVidQA-Prompt两个大规模数据集。之后,我们提出了单模式和多模式的方法,可以从医疗影片中提取可视答案。我们进行了全面的分析结果,专注于数据集的创建影响模型训练的影响和可视特征的增强效果。我们的发现表明这些数据集具有提高医疗可视答案定位任务的潜力,并提供了未来发展的可能性,使用预训语音视觉模型。”

Code Soliloquies for Accurate Calculations in Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12161
  • repo_url: https://github.com/luffycodes/tutorbot-spock-phys
  • paper_authors: Shashank Sonkar, MyCo Le, Xinghe Chen, Naiming Liu, Debshila Basu Mallick, Richard G. Baraniuk
  • for: 这个论文的目的是提高智能教学系统(ITS)中使用大语言模型(LLM)后端的质量,通过使用高质量的对话数据集来改进学生和ITS之间的交互。
  • methods: 这个论文使用了先进的GPT-4模型生成Synthetic学生教师对话,并采用了一种新的状态强调设计来解决GPT-4在处理简单的乘数任务时表现不佳的问题。
  • results: 这个论文的结果表明,使用这种新的状态强调设计可以增强Mock对话数据集的质量,特别是在需要计算的科学概念上。这种方法可以提高LLM后端的准确率和计算可靠性。
    Abstract High-quality conversational datasets are integral to the successful development of Intelligent Tutoring Systems (ITS) that employ a Large Language Model (LLM) backend. These datasets, when used to fine-tune the LLM backend, significantly enhance the quality of interactions between students and ITS. A common strategy for developing these datasets involves generating synthetic student-teacher dialogues using advanced GPT-4 models. However, challenges arise when these dialogues demand complex calculations, common in subjects like physics. Despite its advanced capabilities, GPT-4's performance falls short in reliably handling even simple multiplication tasks, marking a significant limitation in its utility for these subjects. To address these challenges, this paper introduces an innovative stateful prompt design. Our approach generates a mock conversation between a student and a tutorbot, both roles simulated by GPT-4. Each student response triggers a soliloquy (an inner monologue) in the GPT-tutorbot, which assesses whether its response would necessitate calculations. If so, it proceeds to script the required code in Python and then uses the resulting output to construct its response to the student. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive. Our findings show that our Higgs model -- a LLaMA finetuned with datasets generated through our novel stateful prompt design -- proficiently utilizes Python for computations. Consequently, finetuning with our datasets enriched with code soliloquies enhances not just the accuracy but also the computational reliability of Higgs' responses.
    摘要 高品质对话数据集是智能教学系统(ITS)的成功发展的重要组成部分。这些数据集,当用于精度调整LLM后端,会显著提高学生和ITS之间的互动质量。一种常见的发展策略是使用高级GPT-4模型生成synthetic学生教师对话。然而,在涉及到物理等科学的复杂计算时,GPT-4的表现不具备可靠性,这成为了这些主题的限制。为了解决这些挑战,本文提出了一种创新的状态归并提示设计。我们的方法通过GPT-4 simulate学生和教师两个角色,然后通过模拟对话来生成Mock对话。每个学生回答都会触发GPT-tutorbot的内部对话(soliloquy),判断是否需要计算。如果需要,那么GPT-tutorbot会使用Python脚本编写代码,然后使用该代码生成回答给学生。我们的方法可以明显提高synthetic对话数据集的质量,特别是对于需要计算的主题。我们的研究发现,我们的Higgs模型(LLaMA finetuned with我们的新状态归并提示设计生成的数据集)可以高效地使用Python进行计算。因此,在我们的数据集中添加了代码soliloquy后,finetuning Higgs的精度和计算可靠性都会提高。

How-to Guides for Specific Audiences: A Corpus and Initial Findings

  • paper_url: http://arxiv.org/abs/2309.12117
  • repo_url: None
  • paper_authors: Nicola Fanton, Agnieszka Falenska, Michael Roth
  • for: 这篇论文 investigate wikiHow 上的 how-to 指南是否因为目标读者而有所不同。
  • methods: 作者使用了资深研究和计算方法来检查文本中的偏见。
  • results: 研究发现,wikiHow 上的 how-to 指南受到社会 norms 和潜在的偏见的影响。
    Abstract Instructional texts for specific target groups should ideally take into account the prior knowledge and needs of the readers in order to guide them efficiently to their desired goals. However, targeting specific groups also carries the risk of reflecting disparate social norms and subtle stereotypes. In this paper, we investigate the extent to which how-to guides from one particular platform, wikiHow, differ in practice depending on the intended audience. We conduct two case studies in which we examine qualitative features of texts written for specific audiences. In a generalization study, we investigate which differences can also be systematically demonstrated using computational methods. The results of our studies show that guides from wikiHow, like other text genres, are subject to subtle biases. We aim to raise awareness of these inequalities as a first step to addressing them in future work.
    摘要 教程文档应该根据目标群体的先前知识和需求进行定制,以efficient地引导读者到他们的目标。然而,targeting specific groups也可能折衣不同的社会规范和微妙的刻板印象。在这篇论文中,我们研究了wikiHow的教程文档在实践中是否受到目标读者影响。我们进行了两个案例研究,检查了特定读者群体的文本特质。在一个总结研究中,我们使用计算方法示出这些差异。结果显示,wikiHow的教程文档,如其他文章类型,受到微妙的偏见。我们希望通过提醒这些不平等来addressing them in future work。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

A Computational Analysis of Vagueness in Revisions of Instructional Texts

  • paper_url: http://arxiv.org/abs/2309.12107
  • repo_url: None
  • paper_authors: Alok Debnath, Michael Roth
  • for: 本研究旨在分析WikiHow上的修订历史数据,找出涉及uncertainty的修改。
  • methods: 本研究使用神经网络模型分析修订历史数据,并采用对比性评价任务来评估模型的性能。
  • results: 研究表明,使用神经网络模型可以有效地分辨修改后的 instrucion 和原始 instrucion。
    Abstract WikiHow is an open-domain repository of instructional articles for a variety of tasks, which can be revised by users. In this paper, we extract pairwise versions of an instruction before and after a revision was made. Starting from a noisy dataset of revision histories, we specifically extract and analyze edits that involve cases of vagueness in instructions. We further investigate the ability of a neural model to distinguish between two versions of an instruction in our data by adopting a pairwise ranking task from previous work and showing improvements over existing baselines.
    摘要 WikiHow 是一个开放领域的指导文章存储库,可以由用户修改。在这篇论文中,我们提取了对应的版本之间的对应。从含污染的历史记录开始,我们特定地提取和分析对于 instrucional 不具体的修改。我们进一步调查一个神经网络模型是否能够在我们的数据中分辨两个版本的指导。我们采用了之前的对应任务,并超过了现有的基线。

SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts

  • paper_url: http://arxiv.org/abs/2309.12102
  • repo_url: https://github.com/acidann/claire
  • paper_authors: Michael Roth, Talita Anthonio, Anna Sauer
  • for: 这个研究是为了评估帮助文档中的解释是否有效。
  • methods: 该研究使用了人工修改的 instrucitonal 文档,并生成了多个解释选项。然后,收集了人类的可能性评估。
  • results: 该研究发现了21个参与者的系统,最佳系统的准确率为68.9%。此外,研究还发现了一些参与者的系统可以在特定上下文中identify多个可能的解释,准确率为75.2%。
    Abstract We describe SemEval-2022 Task 7, a shared task on rating the plausibility of clarifications in instructional texts. The dataset for this task consists of manually clarified how-to guides for which we generated alternative clarifications and collected human plausibility judgements. The task of participating systems was to automatically determine the plausibility of a clarification in the respective context. In total, 21 participants took part in this task, with the best system achieving an accuracy of 68.9%. This report summarizes the results and findings from 8 teams and their system descriptions. Finally, we show in an additional evaluation that predictions by the top participating team make it possible to identify contexts with multiple plausible clarifications with an accuracy of 75.2%.
    摘要 我们描述SemEval-2022任务7,一个共同任务,评估指导文章中的解释可能性。这个任务的数据集包括手动修订的使用指南,我们生成了备用的解释,并收集了人类可能性评估。参与系统的任务是自动确定解释的可能性在特定上下文中。总共有21个参与者,最佳系统的准确率为68.9%。这份报告总结了8个团队和他们的系统描述。此外,我们在额外评估中发现,参与者顶尖系统的预测可以在多个可能性的上下文中寻找正确的解释,准确率为75.2%。

AceGPT, Localizing Large Language Models in Arabic

  • paper_url: http://arxiv.org/abs/2309.12053
  • repo_url: https://github.com/freedomintelligence/acegpt
  • paper_authors: Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Juncai He, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu
  • for: 本研究旨在开发一个特性化为阿拉伯语的大语言模型(LLM),以满足当前主流模型无法充分考虑的阿拉伯语文化特点。
  • methods: 该研究提出了一种全面的解决方案,包括进一步预训练阿拉伯语文本,使用本地阿拉伯语指令进行精度调整(SFT),以及使用阿拉伯语GPT-4响应和人工智能反馈学习(RLAIF)。
  • results: 研究发现,通过使用这种解决方案,可以创造出具有当地文化特点和价值观的阿拉伯语LLM,能够满足不同应用场景中的阿拉伯语使用者需求。研究表明,在不同的benchmark上,包括阿拉伯语Vicuna-80和阿拉伯语AlpacaEval等,AceGPT模型都达到了开放阿拉伯语LLM的状态标准。尤其是在使用GPT-4时,AceGPT在Vicuna-80benchmark中超过Turbo,即使这个benchmark的规模较小。代码、数据和模型可以在https://github.com/FreedomIntelligence/AceGPT中找到。
    Abstract This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks, including the instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the newly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the benchmark's limited scale. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
    摘要 The proposed approach is evaluated on various benchmarks, including Arabic Vicuna-80 and Arabic AlpacaEval for instruction-following, Arabic MMLU and EXAMs for knowledge benchmarking, and a newly introduced Arabic Cultural and Value Alignment benchmark. The results show that the proposed model, named 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across all benchmarks. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the limited scale of the benchmark.The codes, data, and models used in this study are available on GitHub at .

CAMERA: A Multimodal Dataset and Benchmark for Ad Text Generation

  • paper_url: http://arxiv.org/abs/2309.12030
  • repo_url: None
  • paper_authors: Masato Mita, Soichiro Murakami, Akihiko Kato, Peinan Zhang
    for: 提高自动广告文本生成(ATG)领域的研究,为它们提供一个全面的benchmark和明确的问题集,以便比较不同的方法。methods: 该研究使用了一种新的定义,即将ATG看作是跨应用领域的任务,利用多种模式的信息进行生成。具体来说,他们提出了一个首先的benchmark数据集,名为CA Multimodal Evaluation for Ad Text GeneRAtion(CAMERA),这个数据集特别地设计为ATG,以便利用多Modal信息进行评估。results: 研究人员通过多种基线模型的评估实验,证明了他们提出的benchmark和数据集的有用性。这些基线模型包括不同的预训练语言模型和多Modal信息的integration。此外,研究人员还讨论了当前任务的状态和未来的挑战。
    Abstract In response to the limitations of manual online ad production, significant research has been conducted in the field of automatic ad text generation (ATG). However, comparing different methods has been challenging because of the lack of benchmarks encompassing the entire field and the absence of well-defined problem sets with clear model inputs and outputs. To address these challenges, this paper aims to advance the field of ATG by introducing a redesigned task and constructing a benchmark. Specifically, we defined ATG as a cross-application task encompassing various aspects of the Internet advertising. As part of our contribution, we propose a first benchmark dataset, CA Multimodal Evaluation for Ad Text GeneRAtion (CAMERA), carefully designed for ATG to be able to leverage multi-modal information and conduct an industry-wise evaluation. Furthermore, we demonstrate the usefulness of our proposed benchmark through evaluation experiments using multiple baseline models, which vary in terms of the type of pre-trained language model used and the incorporation of multi-modal information. We also discuss the current state of the task and the future challenges.
    摘要 因为手动在线广告生产的限制,有关自动广告文本生成(ATG)的研究得到了广泛的关注。然而,不同方法的比较困难由于整个领域缺乏一个涵盖整个领域的标准准则和明确的输入输出问题集。为了解决这些挑战,本文准备了ATG领域的进步,包括将ATG定义为跨应用任务,涵盖互联网广告不同方面的多种方面。作为我们的贡献,我们提出了首个benchmark dataset,即CA Multimodal Evaluation for Ad Text GeneRAtion(CAMERA),特意设计用于ATG,以便利用多模态信息并进行产业批处。此外,我们通过多种基线模型的评估实验,表明了我们提posed benchmark的有用性。这些基线模型包括不同的预训练语言模型和多模态信息的 incorporation。我们还讨论了当前任务的状态和未来的挑战。

Stock Market Sentiment Classification and Backtesting via Fine-tuned BERT

  • paper_url: http://arxiv.org/abs/2309.11979
  • repo_url: None
  • paper_authors: Jiashu Lou
  • for: 这篇论文主要是为了研究基于实时信息获取的低延迟自动交易平台中的量化交易,以及如何通过情感因素提高交易效果。
  • methods: 本论文使用的方法包括BERT自然语言处理模型的建立和微调,以及基于Alpha191模型和情感标签的 regression 模型的建立。
  • results: 实验结果表明,将情感因素integrated into the Alpha191 model can significantly improve the return rate, with a return rate of 73.8% compared to the baseline and 32.41% compared to the original Alpha191 model during the trading period.
    Abstract With the rapid development of big data and computing devices, low-latency automatic trading platforms based on real-time information acquisition have become the main components of the stock trading market, so the topic of quantitative trading has received widespread attention. And for non-strongly efficient trading markets, human emotions and expectations always dominate market trends and trading decisions. Therefore, this paper starts from the theory of emotion, taking East Money as an example, crawling user comment titles data from its corresponding stock bar and performing data cleaning. Subsequently, a natural language processing model BERT was constructed, and the BERT model was fine-tuned using existing annotated data sets. The experimental results show that the fine-tuned model has different degrees of performance improvement compared to the original model and the baseline model. Subsequently, based on the above model, the user comment data crawled is labeled with emotional polarity, and the obtained label information is combined with the Alpha191 model to participate in regression, and significant regression results are obtained. Subsequently, the regression model is used to predict the average price change for the next five days, and use it as a signal to guide automatic trading. The experimental results show that the incorporation of emotional factors increased the return rate by 73.8\% compared to the baseline during the trading period, and by 32.41\% compared to the original alpha191 model. Finally, we discuss the advantages and disadvantages of incorporating emotional factors into quantitative trading, and give possible directions for further research in the future.
    摘要 随着大数据和计算设备的快速发展,基于实时信息获取的快速交易平台已成为股票交易市场的主要组成部分,因此量化交易的话题得到了广泛的关注。而在非强效交易市场中,人类情感和期望总是主宰市场趋势和交易决策。因此,本文从情感理论出发,使用东方财富为例,从其相应股票板块中提取用户评论标题数据,并进行数据清洁。然后,构建了一个自然语言处理模型BERT,并使用现有的标注数据集进行微调。实验结果显示,微调后的模型具有不同程度的性能改进。然后,根据上述模型,用户评论数据被标注为情感方向,并将所获取的标签信息与Alpha191模型组合使用进行回归,并获得了显著的回归结果。然后,使用回归模型预测下一个五天内的均价变化,并使其为自动交易的指导信号。实验结果表明,包含情感因素的 incorporation 提高了基线期间的回报率73.8%,并比原始 Alpha191 模型提高32.41%。最后,我们讨论了包含情感因素的量化交易的优缺点,并提出了未来研究的可能性。

SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

  • paper_url: http://arxiv.org/abs/2309.13080
  • repo_url: None
  • paper_authors: Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya
    for: 这个论文的目的是提出一个新的新闻相似性数据集,以及四种生成新闻对的方法,以便进行新闻相似性检测任务的训练。methods: 这个论文使用了七个话题来分类新闻,并使用了四种不同的方法来生成新闻对。这些方法包括文本摘要、文本分类、命名实体识别和文本对比。results: 研究人员使用了MinHash、BERT、SBERT和SimCSE模型对创建的数据集进行了测试,并获得了良好的结果。这些模型可以准确地检测新闻的相似性,并且可以在不同的话题下进行更加精准的检测。
    Abstract Nowadays, the use of intelligent systems to detect redundant information in news articles has become especially prevalent with the proliferation of news media outlets in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a new dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four distinct approaches for generating news pairs, which are used in the creation of datasets specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
    摘要 现在,使用智能系统检测新闻文章中的重复信息已经非常普遍,这是因为新闻媒体的扩展而为用户提供更好的体验。然而,新闻的多样性可能会导致假阳性结果:简单的规则 such as 两个新闻都是关于政治的话可以提供强大 pero 误导的下游性能。将新闻相似度数据集分成话题可以使模型学习更窄的领域中的突出特征。然而,这需要话题特定的数据集的存在,现在它们缺失。在这篇文章中,我们提出了一个新的相似新闻数据集,名为SPICED,包括七个话题:犯罪与法律、文化与娱乐、灾难与事故、经济与业务、政治与冲突、科学与技术,和体育。此外,我们提出了四种不同的新闻对生成方法,用于创建专门用于新闻相似度检测任务的数据集。我们使用MinHash、BERT、SBERT和SimCSE模型对创建的数据集进行了benchmark测试。

Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task

  • paper_url: http://arxiv.org/abs/2309.11925
  • repo_url: None
  • paper_authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, José G. C. de Souza, André F. T. Martins
  • for: 这个论文参与了WMT 2023共同任务中的质量估计(QE)任务。
  • methods: 作者使用了COMETKIWI-22模型(Rei et al., 2022b),并采用了多语言方法。
  • results: 作者的方法在所有任务上排名第一,并达到了 sentence-和 word-level 质量预测的状态机器人表现。 Comparing to the previous state-of-the-art COMETKIWI-22, the authors show large improvements in correlation with human judgements (up to 10 Spearman points) and surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.
    Abstract We present the joint contribution of Unbabel and Instituto Superior T\'ecnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.
    摘要 我们现在把Unbabel和 Instituto Superior Técnico在WMT 2023共同任务中的贡献介绍。我们的团队参与了所有任务:句子和单词水平质量预测(任务1)以及细化错误探测(任务2)。对于所有任务,我们基于COMETKIWI-22模型(Rei等., 2022b)。我们的多语言方法在所有任务上排名第一,达到了质量预测的州际先进性水平。相比前一个州际先进COMETKIWI-22,我们显示了大幅提升与人类评估的相关度(最多10个斯宾塞分)。此外,我们超过了第二好的多语言提交到共同任务,差值最多3.8个绝对分。

InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-task LLMs Framework

  • paper_url: http://arxiv.org/abs/2309.11911
  • repo_url: https://github.com/LIN-SHANG/InstructERC
  • paper_authors: Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Sirui Wang
  • for: This paper aims to improve the development of emotion recognition in dialogue (ERC) by proposing a novel approach called InstructERC, which transforms the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs).
  • methods: The proposed InstructERC approach uses a simple yet effective retrieval template module to explicitly integrate multi-granularity dialogue supervision information, as well as two additional emotion alignment tasks (speaker identification and emotion prediction) to implicitly model dialogue role relationships and future emotional tendencies.
  • results: The LLM-based plug-and-play plugin framework achieved comprehensive SOTA on three commonly used ERC datasets, outperforming all previous models. Extensive analysis of parameter-efficient and data-scaling experiments provide empirical guidance for applying InstructERC in practical scenarios.
    Abstract The development of emotion recognition in dialogue (ERC) has been consistently hindered by the complexity of pipeline designs, leading to ERC models that often overfit to specific datasets and dialogue patterns. In this study, we propose a novel approach, namely InstructERC, to reformulates the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs) . InstructERC has two significant contributions: Firstly, InstructERC introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information by concatenating the historical dialog content, label statement, and emotional domain demonstrations with high semantic similarity. Furthermore, we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations. Our LLM-based plug-and-play plugin framework significantly outperforms all previous models and achieves comprehensive SOTA on three commonly used ERC datasets. Extensive analysis of parameter-efficient and data-scaling experiments provide empirical guidance for applying InstructERC in practical scenarios. Our code will be released after blind review.
    摘要 开发对话情感识别(ERC)技术一直受到管道设计的复杂性限制,导致ERC模型经常过拟合特定数据集和对话模式。在这项研究中,我们提出了一种新的方法,即InstructERC,它将ERC任务从推断性框架转换为生成性框架,基于大语言模型(LLM)。InstructERC具有两项重要贡献:首先,InstructERC引入了一种简单 yet有效的检索模板模块,该模块通过 concatenating 历史对话内容、标签声明和情感领域示例来显式地集成多级别对话监督信息。其次,我们引入了两个附加的情感对接任务,即发言人标识和情感预测任务,以隐式地模型对话角色关系和未来情感趋势在对话中。我们的 LLM 基于插件框架在三个常用的 ERC 数据集上实现了广泛的 SOTA 性能。我们进行了参数高效和数据扩展的实验分析,以提供实践场景中应用 InstructERC 的经验指南。我们的代码将在审核后发布。

Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit Hate Speech Detection

  • paper_url: http://arxiv.org/abs/2309.11896
  • repo_url: https://github.com/lcs2-iiitd/fiadd
  • paper_authors: Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty
  • for: 本研究旨在提高大型自然语言处理模型(PLM)对含有潜在仇恨语言表达的文本识别能力。
  • methods: 本研究使用了增强外部 контекст和距离基本metric的两种方法,并将其组合成为新的FOCUSED INFERENTIAL ADAPTIVE DENSITY DISCRIMINATION(FiADD)框架。
  • results: 对三个隐式仇恨数据集进行测试,FiADD显示出了明显的提高在两类和三类仇恨分类任务中。此外,在检测讽刺、反讽和立场表达中,FiADD也显示出类似的性能提高。
    Abstract Although pre-trained large language models (PLMs) have achieved state-of-the-art on many NLP tasks, they lack understanding of subtle expressions of implicit hate speech. Such nuanced and implicit hate is often misclassified as non-hate. Various attempts have been made to enhance the detection of (implicit) hate content by augmenting external context or enforcing label separation via distance-based metrics. We combine these two approaches and introduce FiADD, a novel Focused Inferential Adaptive Density Discrimination framework. FiADD enhances the PLM finetuning pipeline by bringing the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. We test FiADD on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks. We further experiment on the generalizability of FiADD on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ, and observe similar performance improvement. We analyze the generated latent space to understand its evolution under FiADD, which corroborates the advantage of employing FiADD for implicit hate speech detection.
    摘要 In this paper, we propose a novel Focused Inferential Adaptive Density Discrimination (FiADD) framework that combines these two approaches to enhance the PLM finetuning pipeline. FiADD brings the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. We test FiADD on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks.Furthermore, we experiment on the generalizability of FiADD on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ. We observe similar performance improvement, indicating the versatility of FiADD. We analyze the generated latent space to understand its evolution under FiADD, which corroborates the advantage of employing FiADD for implicit hate speech detection.

Is It Really Useful to Jointly Parse Constituency and Dependency Trees? A Revisit

  • paper_url: http://arxiv.org/abs/2309.11888
  • repo_url: None
  • paper_authors: Yanggang Gu, Yang Hou, Zhefeng Wang, Xinyu Duan, Zhenghua Li
  • for: 这个论文关注 JOIN 分析 sentence 中的 Constituency 树和 Dependency 树,即同时生成兼容的 Constituency 树和 Dependency 树。
  • methods: 该论文使用了更高效的解码算法,在训练阶段进行 JOIN 模型化,并提出了高阶分数组件来捕捉 Constituent-Dependency 之间的交互。
  • results: 论文在各种实验和分析中做出了四个方面的进步:1)更高效的解码算法,2)在训练阶段进行 JOIN 模型化,3)提出了高阶分数组件,4)通过深入的实验和分析获得了更多的启示。
    Abstract This work visits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. Compared with previous works, we make progress in four aspects: (1) adopting a much more efficient decoding algorithm, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components for constituent-dependency interaction, (4) gaining more insights via in-depth experiments and analysis.
    摘要 这个工作探讨了同时解析成分树和依赖树的问题,即为输入句子生成兼容的成分树和依赖树,这是吸引人的,因为这两种树是Syntax的补充。相比前一些工作,我们在四个方面进行了进步:(1)采用更高效的解码算法,(2)在训练阶段进行共同模型化,而不仅在推理阶段进行,(3)提出高阶分数组件来描述成分-依赖之间的互动,(4)通过深入实验和分析获得更多的发现。

Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System

  • paper_url: http://arxiv.org/abs/2309.11869
  • repo_url: None
  • paper_authors: Jonathan Dunn
  • for: 本研究旨在量化语言系统中的变化,通过对49个本地英语语言变体的分类来描述语音变体之间的 sintactic 差异。
  • methods: 本研究使用了整个语法和各个语法结构之间的隔离来分类语音变体。
  • results: 结果表明,语音变体中的各个结构都存在变化,但在孤立的情况下,没有任何结构能够与整个语法一样好。这表明,语音变体中的变化部分由不同语法结构之间的交互所组成。此外,研究还发现,在不同语法结构下,语音变体之间的相似性很大。
    Abstract While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar. This means that the grammar, a network which connects thousands of structures at different levels of abstraction, is reduced to a few disconnected variables. This paper quantifies the impact of such reductions by systematically modelling dialectal variation across 49 local populations of English speakers in 16 countries. We perform dialect classification with both an entire grammar as well as with isolated nodes within the grammar in order to characterize the syntactic differences between these dialects. The results show, first, that many individual nodes within the grammar are subject to variation but, in isolation, none perform as well as the grammar as a whole. This indicates that an important part of syntactic variation consists of interactions between different parts of the grammar. Second, the results show that the similarity between dialects depends heavily on the sub-set of the grammar being observed: for example, New Zealand English could be more similar to Australian English in phrasal verbs but at the same time more similar to UK English in dative phrases.
    摘要 语言是一个复杂的适应系统,大多数语法变化研究通常只关注几个个体结构,忽略了语法 grammar 中其他结构之间的关系。这意味着语法网络,连接了数千个结构,被压缩成只有几个分离的变量。这篇论文测量了这种压缩的影响,通过对49个本地英语口语者群体在16个国家的语言变化进行系统性模型。我们使用整个语法以及语法中各个节点的隔离来分类方言,以Characterize语言变化的 sintactic differences。结果显示,首先,语法中很多个节点都存在变化,但是孤立地没有达到语法整体的水平。这表明,语法变化中有很重要的互动部分。其次,结果显示,对于不同的语言变化,相似性很大程度取决于观察到的语法子集。例如,新西兰英语可能与澳大利亚英语在短语动词方面更相似,而在 dative phrases 方面更相似于 UK 英语。

Knowledge Sanitization of Large Language Models

  • paper_url: http://arxiv.org/abs/2309.11852
  • repo_url: None
  • paper_authors: Yoichi Ishibashi, Hidetoshi Shimodaira
  • for: 防止语言模型泄露敏感信息
  • methods: 精细调整语言模型,让其生成无害回答
  • results: 实验结果表明,该方法不仅减少了特定知识泄露,还保持了语言模型的总性能,从而增强了防止抽取攻击和避免生成危险内容的防御。
    Abstract We explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (LLMs). LLMs trained on a large corpus of Web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. Our technique fine-tunes these models, prompting them to generate harmless responses such as ``I don't know'' when queried about specific information. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLM. These two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.
    摘要 我们研究了一种知识净化方法,以降低大语言模型(LLM)中的隐私问题。 LLM 通过大量网络数据训练可能会记忆和泄露敏感或机密信息,这引起了严重的安全问题。我们的技术在这些模型中进行微调,使其在特定信息 queries 时生成无害的回答,如“我不知道”。实验结果表明,我们的简单方法不仅减少了特定知识泄露,还保持了 LLM 的总性能。这两个优点强化了对抽取攻击的防御和减少了负面内容的泄露,如幻见。

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

  • paper_url: http://arxiv.org/abs/2309.11849
  • repo_url: None
  • paper_authors: Xianhao Wei, Jia Jia, Xiang Li, Zhiyong Wu, Ziyi Wang
  • for: 这个研究旨在预测基于话语水平的细腻情感特征,以提高语音合成模型的表达性。
  • methods: 该研究使用了一种Style Transfer模型提取phoneme-level Local Prosody Embedding序列和全局风格嵌入,并提出了一种多级文本预测模型(D-MPM)来预测这两个情感特征。
  • results: 实验结果表明,多级文本信息有效地预测情感特征,并且话语水平提高了整体一致性和用户体验。而且,由于预测模型的合成效果比原始speech的风格传递效果更好,这种方法可能有助于语音合成模型更好地表达情感。
    Abstract This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.
    摘要 (Simplified Chinese translation)这篇论文探讨了基于干支文本的细腻情感分析中适用的语音特征预测方法。为了获得更好的情感语音特征预测值,我们从语音中提取了phoneme级别的本地语音嵌入序列(LPEs)和全局风格嵌入作为语音特征特征。我们提议一种基于多级文本的语音谱proboscis模型(D-MPM),利用多级文本来预测这两个语音特征。该模型可以用来分析更好的情感语音特征,并且导引语音合成模型生成更加表达的语音。为了评估该模型,我们提供了一个大规模的Discourse-level Chinese Audiobook(DCA)数据集,包含 более than 13,000个语音短语序列。实验结果表明,多级文本信息有效地预测语音特征,并且提高了整体准确率和用户体验。而且,尽管我们target的是style transfer模型的合成效果,但是由提posed的文本语音分析模型生成的语音还是在一些用户评估指标上更好于原始语音的style transfer。

A Chinese Prompt Attack Dataset for LLMs with Evil Content

  • paper_url: http://arxiv.org/abs/2309.11830
  • repo_url: None
  • paper_authors: Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu
  • for: 本研究旨在提供一个中文提示攻击数据集(CPAD),用于评估语言模型(LLMs)对提示攻击的抵御能力。
  • methods: 我们采用了多种攻击方法,包括提示攻击、恶意提示和目标攻击,以评估LLMs的安全性。
  • results: 我们运行了多个常见的中文LLMs在我们的数据集上,结果显示,我们的提示能够让LLMs失败,成功率约为70%。
    Abstract Large Language Models (LLMs) present significant priority in text understanding and generation. However, LLMs suffer from the risk of generating harmful contents especially while being employed to applications. There are several black-box attack methods, such as Prompt Attack, which can change the behaviour of LLMs and induce LLMs to generate unexpected answers with harmful contents. Researchers are interested in Prompt Attack and Defense with LLMs, while there is no publicly available dataset to evaluate the abilities of defending prompt attack. In this paper, we introduce a Chinese Prompt Attack Dataset for LLMs, called CPAD. Our prompts aim to induce LLMs to generate unexpected outputs with several carefully designed prompt attack approaches and widely concerned attacking contents. Different from previous datasets involving safety estimation, We construct the prompts considering three dimensions: contents, attacking methods and goals, thus the responses can be easily evaluated and analysed. We run several well-known Chinese LLMs on our dataset, and the results show that our prompts are significantly harmful to LLMs, with around 70% attack success rate. We will release CPAD to encourage further studies on prompt attack and defense.
    摘要 大语言模型(LLM)在文本理解和生成方面具有重要优先级。然而,LLM受到生成危险内容的风险,特别是在应用程序中使用时。现有许多黑盒攻击方法,如提示攻击,可以改变LLM的行为,使其生成意外的答案并含有危险内容。研究人员对提示攻击和LLM防御有浓厚的兴趣,但是现无公共可用的数据集来评估防御能力。在这篇论文中,我们介绍了一个中文提示攻击数据集(CPAD),用于测试LLM的防御能力。我们的提示采用三维构造:内容、攻击方法和目标,因此可以轻松地评估和分析回快。我们使用了一些知名的中文LLM在我们的数据集上进行测试,结果显示,我们的提示能够够Effectively harm LLMs,成功率约为70%。我们将CPAD公开发布,以便更多的研究人员可以进行提示攻击和防御研究。

Word Embedding with Neural Probabilistic Prior

  • paper_url: http://arxiv.org/abs/2309.11824
  • repo_url: None
  • paper_authors: Shaogang Ren, Dingcheng Li, Ping Li
  • for: 提高单词表示学习的词表示学习
  • methods: 使用概率先验来规范词表示学习
  • results: 提高了单词表示学习的表示精度和模型的稳定性
    Abstract To improve word representation learning, we propose a probabilistic prior which can be seamlessly integrated with word embedding models. Different from previous methods, word embedding is taken as a probabilistic generative model, and it enables us to impose a prior regularizing word representation learning. The proposed prior not only enhances the representation of embedding vectors but also improves the model's robustness and stability. The structure of the proposed prior is simple and effective, and it can be easily implemented and flexibly plugged in most existing word embedding models. Extensive experiments show the proposed method improves word representation on various tasks.
    摘要 为了提高词表示学习,我们提议一种概率先验,可以轻松地与词嵌入模型结合使用。与先前的方法不同,在我们的方法中,词嵌入被看作是一种概率生成模型,这使得我们可以对词表示学习强制一个先验。我们提出的先验不仅提高了嵌入向量的表示,还改善了模型的稳定性和鲁棒性。该结构简单而有效,可以轻松地实现并适应大多数现有的词嵌入模型。我们的实验结果表明,我们的方法可以在各种任务上提高词表示。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features

  • paper_url: http://arxiv.org/abs/2309.11791
  • repo_url: None
  • paper_authors: Zhaoyi Wang, Zhenyang Zhang, Jiaxin Qin, Mizuho Iwaihara
  • for: 提高 DBpedia 类型对 CaLiGraph 分类的准确率,实现大规模 Ontology 映射。
  • methods: 利用知识图结构、语义相似性和命名实体类型,自动生成训练数据,并使用 distant supervision 方法finetune 预训练语言模型 BERT。
  • results: 比基eline模型提高25%的准确率,提供实用的大规模 Ontology 映射解决方案。
    Abstract Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.
    摘要 《Wikipedia文章是以类别和列表的形式归类,提供了一个非常全面和通用的分类系统,但开放创建的问题导致了重复和不一致。将DBpedia类划到Wikipedia类划和列表上可以解决这个问题,实现一个大型知识图,对涉及到数字内容的分类进行Entity链接和类型化。然而,现有的CaLiGraph方法 produces incomplete and non-fine-grained mappings。在这篇论文中,我们将这个问题看作ontology alignment,利用知识图的结构信息和ontology类名的语义和语义特征,对可信的映射进行发现,并在远程监督方式下使用预训练语言模型BERT进行训练。我们的方法SLHCat包括两个主要部分:1. 利用知识图结构、语义相似度和命名实体类型自动生成训练数据。2. 使用训练数据进行finetuning和prompt-tuning预训练语言模型BERT,以捕捉类名的语义和语法性质。我们的模型SLHCat在一个由CaLiGraph-DBpedia映射对的3000个精细标注 dataset上进行评估,与基准模型相比,SLHCat在准确率上出现25%的大幅提升,提供了一个实用的大规模ontology映射解决方案。

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

  • paper_url: http://arxiv.org/abs/2309.11710
  • repo_url: https://github.com/elisakreiss/contextref
  • paper_authors: Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber
  • for: 这 paper 的目的是评估无参考度量表(CLIPScore)的准确性,以及这些方法是否与人类偏好相符。
  • methods: 该 paper 使用 ContextRef benchmark,该 benchmark 包括人类评分和多种稳定性检查,以评估无参考度量表的准确性。
  • results: 研究发现,无参考度量表方法在 ContextRef benchmark 上表现不佳,但通过精心微调可以得到显著改进。
    Abstract Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
    摘要 无参考度量(例如CLIPScore)使用预训练视觉语言模型直接评估图文描述,无需费时的参照文本。这些方法可以促进快速进步,但只有如果它们真正对人类偏好判断 align。在这篇文章中,我们介绍ContextRef,一个用于评估无参考度量的benchmark。ContextRef有两个组成部分:人类评分的多种已知质量维度,以及十种多样化的Robustness Check,用于暴露基础的弱点。ContextRef中的图文都会在上下文中展示,这与先前的工作表明上下文对描述质量很重要。使用ContextRef,我们评估了多种预训练模型、分数函数和 Context 的技术。 none of them 在ContextRef中成功,但我们显示了仔细的微调可以实现显著提高。ContextRef仍然是一个挑战性的benchmark,主要是因为上下文依赖性的挑战。

Memory-Augmented LLM Personalization with Short- and Long-Term Memory Coordination

  • paper_url: http://arxiv.org/abs/2309.11696
  • repo_url: None
  • paper_authors: Kai Zhang, Fubang Zhao, Yangyang Kang, Xiaozhong Liu
  • for: 这个研究旨在提高大语言模型(LLM)的个性化生成能力,以提高用户特定的结果。
  • methods: 该研究提出了一种新的计算机биологиMemory机制,结合高效的参数调整方案,以个性化LLM。
  • results: 实验结果表明,该方法能够有效地提高LLM的个性化生成能力,并且超过了之前的方法。
    Abstract Large Language Models (LLMs), such as GPT3.5, have exhibited remarkable proficiency in comprehending and generating natural language. However, their unpersonalized generation paradigm may result in suboptimal user-specific outcomes. Typically, users converse differently based on their knowledge and preferences. This necessitates the task of enhancing user-oriented LLM which remains unexplored. While one can fully train an LLM for this objective, the resource consumption is unaffordable. Prior research has explored memory-based methods to store and retrieve knowledge to enhance generation without retraining for new queries. However, we contend that a mere memory module is inadequate to comprehend a user's preference, and fully training an LLM can be excessively costly. In this study, we propose a novel computational bionic memory mechanism, equipped with a parameter-efficient fine-tuning schema, to personalize LLMs. Our extensive experimental results demonstrate the effectiveness and superiority of the proposed approach. To encourage further research into this area, we are releasing a new conversation dataset generated entirely by LLM based on an open-source medical corpus, as well as our implementation code.
    摘要 Translation notes:* "Large Language Models" is translated as "大型语言模型" (dàxíng yǔyán módelǐ)* "such as GPT3.5" is translated as "如GPT3.5" (rú GPT3.5)* "unpersonalized generation paradigm" is translated as "无个性生成模式" (wú gèxìng shēngchén móde)* "users converse differently based on their knowledge and preferences" is translated as "用户根据知识和偏好不同地交流" (yòngzhì jīngguī jīntiān bùdìng de jiāoxìng)* "this necessitates the task of enhancing user-oriented LLM" is translated as "这需要提高用户指向的LLM任务" (zhè xūyào tímiaokè yǐngyì LLM zhìwù)* "while one can fully train an LLM for this objective" is translated as "可以完全训练LLM以实现这个目标" (kěyǐ qiánzhèng xùntraining LLM yǐ jízhèng zhè ge mùtiān)* "the resource consumption is unaffordable" is translated as "资源消耗不可持续" (zīyuàn xiāohuò bùkěcháng)* "prior research has explored memory-based methods" is translated as "先前的研究曾经探索了记忆基于的方法" (xiānpjàn de yánjiū zhèngjīn tànsuō le jiěyì bázhì de fāngché)* "a mere memory module is inadequate to comprehend a user's preference" is translated as "简单的记忆模块无法理解用户的偏好" (jiǎndān de jiěyì móudāo wúfāng lǐjiě yǐngyì yòu zhèngxìng)* "fully training an LLM can be excessively costly" is translated as "完全训练LLM的成本过高" (qióngzhèng xùntraining LLM de zhèngběn guògāo)* "in this study, we propose a novel computational bionic memory mechanism" is translated as "在本研究中,我们提出了一种新的计算机bone植入记忆机制" (zhèng yàn yánjiū zhōng, wǒmen tìshì le yī zhī xīn de jìsuàn zhīyìng jīfāng)* "equipped with a parameter-efficient fine-tuning schema" is translated as "配备有效率精度调整方案" (fùyè yǒu xiǎngyì liàngdào fāng'àn)* "to personalize LLMs" is translated as "为LLM个性化" (wèi LLM yīngróng huà)* "our extensive experimental results demonstrate the effectiveness and superiority of the proposed approach" is translated as "我们广泛的实验结果表明我们提出的方法的有效性和优势" (wǒmen guǎngfāng de shíyàn jīngqì bùmíng wǒmen tìshì le fāngché de yǒu xìngxìng)* "to encourage further research in this area" is translated as "以促进这一领域的进一步研究" (yǐ jìnshì zhè yī lǐng yè, jìnshì zhè yī jìng yè)