eess.AS - 2023-11-22

End-to-end Transfer Learning for Speaker-independent Cross-language Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2311.13678
  • repo_url: None
  • paper_authors: Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot
  • for: 本研究旨在提高cross-language speech emotion recognition(SER)的性能,提出了一种基于移植学习的深度神经网络(DNN)模型。
  • methods: 我们使用wav2vec 2.0预训练模型将不同语言、speaker和录制条件的音频时域波形转换为共享语言的特征空间,从而减少语言差异。此外,我们还提出了一种新的深度 Within-Class Co-variance Normalisation(Deep-WCCN)层,可以插入到DNN模型中,以减少发音人差异、通道差异等多种差异。整个模型在综合损失下进行了终端精度的 fine-tuning,并在英语、德语和中文三种语言的 dataset 上进行了验证。
  • results: 我们的提议方法不仅在同语言设定下与基于通用音频特征集的基准模型相比,而且在 cross-language 设定下与基准模型有 statistically significant 的差异。此外,我们还对 Deep-WCCN 进行了实验验证,并证明了它可以进一步提高模型性能。最后,我们对当前文献中使用同一个测试集的结果进行了比较,并发现了我们的提议模型在 cross-language SER 中的显著优异。
    Abstract Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language (i.e., the cross-language setting) than the training set or when they come from a different dataset (i.e., the cross-corpus setting). To alleviate this problem, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby it reduces the language variabilities in the speech features. Next, we propose a new Deep-Within-Class Co-variance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and it aims to reduce other variabilities including speaker variability, channel variability and so on. The whole model is fine-tuned in an end-to-end manner on a combined loss and is validated on datasets from three languages (i.e., English, German, Chinese). Experiment results show that our proposed method not only outperforms the baseline model that is based on common acoustic feature sets for SER in the within-language setting, but also significantly outperforms the baseline model for cross-language setting. In addition, we also experimentally validate the effectiveness of Deep-WCCN, which can further improve the model performance. Finally, to comparing the results in the recent literatures that use the same testing datasets, our proposed model shows significantly better performance than other state-of-the-art models in cross-language SER.
    摘要 数据驱动模型在语音情感识别(SER)中取得成功。然而,这些模型,它们基于通用的音响特征或端到端方法,在测试集有不同语言(即跨语言设定)或来自不同数据集(即跨藏设定)时表现不佳。为解决这个问题,本文提出了一种基于传输学习的深度神经网络(DNN)模型,用于跨语言SER。我们使用wav2vec 2.0预训练模型将不同语言、 speaker和录音条件的音响时域波形转换到一个共享多语言的特征空间,从而减少语言差异在语音特征上。然后,我们提出了一种新的深度内类 covariance normalization(Deep-WCCN)层,可以插入到DNN模型中,并希望减少其他不确定性,包括发音者变化、通道变化等。整个模型在端到端方式进行了 fine-tuning,并在英语、德语和中文三种语言的数据集上进行验证。实验结果显示,我们的提议方法不仅在 Within-language 设定中比基eline模型,以及 Common acoustic feature sets for SER 的基eline模型,表现更好,还在跨语言设定中表现出色。此外,我们还进行了 Deep-WCCN 的实验 validate,并证明其可以进一步改善模型性能。最后,我们与当前文献中使用相同测试数据集的其他国际前沿模型进行比较,我们的提议模型显示出了更好的性能。

Sparsity-Driven EEG Channel Selection for Brain-Assisted Speech Enhancement

  • paper_url: http://arxiv.org/abs/2311.13436
  • repo_url: None
  • paper_authors: Jie Zhang, Qing-Tian Xu, Zhen-Hua Ling
  • for: 提高多个说话人的干扰音质量,使用 brain-assisted speech enhancement network (BASEN) 以及 electroencephalogram (EEG) 信号来暗示列入者的听众焦点。
  • methods: 采用 temporal convolutional network 和 convolutional multi-layer cross attention module 来融合 EEG-audio 特征,并提出两种渠道选择方法(residual Gumbel selection 和 convolutional regularization selection)来解决训练不稳定和重复渠道选择问题。
  • results: 对公共数据集进行实验,提出的基eline BASEN 比现有方法更高效,并且两种渠道选择方法可以减少大量有用的 EEG 渠道数量,而无需影响性能。
    Abstract Speech enhancement is widely used as a front-end to improve the speech quality in many audio systems, while it is still hard to extract the target speech in multi-talker conditions without prior information on the speaker identity. It was shown by auditory attention decoding that the attended speaker can be revealed by the electroencephalogram (EEG) of the listener implicitly. In this work, we therefore propose a novel end-to-end brain-assisted speech enhancement network (BASEN), which incorporates the listeners' EEG signals and adopts a temporal convolutional network together with a convolutional multi-layer cross attention module to fuse EEG-audio features. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, we further propose two channel selection methods, called residual Gumbel selection and convolutional regularization selection. They are dedicated to tackling the issues of training instability and duplicated channel selections, respectively. Experimental results on a public dataset show the superiority of the proposed baseline BASEN over existing approaches. The proposed channel selection methods can significantly reduce the amount of informative EEG channels with a negligible impact on the performance.
    摘要 听话提升广泛用于多种音频系统的前端,以提高语音质量,但在多个说话人条件下,还是很难提取目标语音。研究人员发现,听众的电enzephalogram(EEG)可以隐式地披露被注意的说话人。因此,我们提出了一种新的整体脑助推进式语音提升网络(BASEN),它将听众的EEG信号和一个时间卷积网络、一个卷积多层跨注意模块 fusion EEG-音频特征。由于EEG帽子通常具有稀疏通道,我们提出了两种通道选择方法,即剩余束质随机选择和卷积regularization选择。它们用于解决训练不稳定和重复通道选择的问题。实验结果表明,我们的基eline BASEN比现有方法更高效。两种通道选择方法可以减少有用EEG通道的数量,而无法影响性能的影响。

Performance Analysis Of Binaural Signal Matching (BSM) in the Time-Frequency Domain

  • paper_url: http://arxiv.org/abs/2311.13390
  • repo_url: None
  • paper_authors: Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely
  • for: 研究了一种适用于穿戴式和移动阵列的双耳呈现方法,以提高虚拟现实、 теле conferencing 和娱乐应用中的声音质量。
  • methods: 使用了一种叫做双耳信号匹配(BSM)的方法,并对其进行了参数化的时域频域分析,以适应动态环境。
  • results: 对于听到 reverberant speech 的情况,使用 BSM 方法可以提高双耳呈现的质量,并且对于不同的参数化方案进行了比较。
    Abstract The capture and reproduction of spatial audio is becoming increasingly popular, with the mushrooming of applications in teleconferencing, entertainment and virtual reality. Many binaural reproduction methods have been developed and studied extensively for spherical and other specially designed arrays. However, the recent increased popularity of wearable and mobile arrays requires the development of binaural reproduction methods for these arrays. One such method is binaural signal matching (BSM). However, to date this method has only been investigated with fixed matched filters designed for long audio recordings. With the aim of making the BSM method more adaptive to dynamic environments, this paper analyzes BSM with a parameterized sound-field in the time-frequency domain. The paper presents results of implementing the BSM method on a sound-field that was decomposed into its direct and reverberant components, and compares this implementation with the BSM computed for the entire sound-field, to compare performance for binaural reproduction of reverberant speech in a simulated environment.
    摘要 capture 和重现三维音频正在不断受欢迎,在电子会议、娱乐和虚拟现实等领域中渐渐普及。许多三维重现方法已经被研究了很多,但是随着可穿戴式和移动阵列的普及,需要开发适用于这些阵列的三维重现方法。一种如此方法是三维信号匹配(BSM)。然而,到目前为止,BSM方法只被研究了针对长 audio recording 的固定匹配筛选器。为了使 BSM 方法在动态环境中更加适应,这篇论文分析了 BSM 方法在时域频域中参数化的声场中的应用。文章展示了对声场进行了直接和反射组分的分解,并与整个声场 BSM 计算相比较,以评估三维重现泛音环境中的泛音speech 的表现。

Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer

  • paper_url: http://arxiv.org/abs/2311.13075
  • repo_url: None
  • paper_authors: Meng Yu, Dong Yu
  • for: 这篇论文的目的是提出一种简单 yet effective的听录范围特征(FOV),用于选择性增强声音信号,特别是在用户定义的场景中。
  • methods: 该论文使用了一种新的方法,即Counter FOV特征,用于捕捉声音信号的全部方向特征,包括用户定义的场景中的声音源。
  • results: 实验结果表明,提出的FOV特征和Counter FOV特征可以准确地捕捉声音信号,并且可以在实时应用中实现低功耗和高效率。
    Abstract Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular area. In this paper, we proposed a simple yet effective FOV feature, amalgamating all directional attributes within the user-defined field. In conjunction, we've introduced a counter FOV feature capturing directional aspects outside the desired field. Such advancements ensure refined sound capture, particularly emphasizing the FOV's boundaries, and guarantee the enhanced capture of all desired sound sources inside the user-defined field. The results from the experiment demonstrate the efficacy of the introduced angular FOV feature and its seamless incorporation into a low-power subband model suited for real-time applica?tions.
    摘要 audio 焦化(zooming)是一种信号处理技术,可以选择性强调和提高声音信号来自特定区域,而遮掩其他。而传统的束形探测(beamforming)和神经束形探测(neural beamforming)技术,旨在创建一个指向性阵列,通常会忽略场视野(FOV)的概念,FOV定义了一个angular频谱区域。在这篇论文中,我们提出了一个简单 yet effective的 FOV特征,汇集了用户定义的场所有指向特征。此外,我们还引入了一个对场外指向特征的适应特征,以确保高精度的声音捕捉,特别是场视野的边界,以及所有在用户定义的场中的所有 жела要的声音源的提高捕捉。实验结果证明我们引入的angular FOV特征的有效性,并可以与低功耗的卷积模型相结合,适用于实时应用。

cs.CV - 2023-11-22

Importance of Feature Extraction in the Calculation of Fréchet Distance for Medical Imaging

  • paper_url: http://arxiv.org/abs/2311.13717
  • repo_url: https://github.com/mckellwoodland/fid-med-eval
  • paper_authors: McKell Woodland, Mais Al Taie, Jessica Albuquerque Marques Silva, Mohamed Eltaher, Frank Mohn, Alexander Shieh, Austin Castelo, Suprateek Kundu, Joshua P. Yung, Ankit B. Patel, Kristy K. Brock
    for:* The paper aims to compare state-of-the-art feature extractors for computing Fréchet Distances (FDs) in medical imaging.methods:* The authors use a StyleGAN2 network trained with data augmentation techniques tailored for limited data domains on datasets comprising three medical imaging modalities and four anatomical locations.* They compare human evaluation of generative quality (via a visual Turing test) with FDs calculated using ImageNet-trained InceptionV3, ResNet50, SwAV, DINO, and Swin Transformer architectures, as well as an InceptionV3 network trained on a large medical dataset, RadImageNet.results:* All ImageNet-based extractors were consistent with each other, but only SwAV was significantly correlated with medical expert judgment.* The RadImageNet-based FD showed volatility and lacked correlation with human judgment.In simplified Chinese, the three points would be:for:* 这篇论文目标是比较医疗影像领域中最新的特征提取器,以计算Fréchet Distance(FD)。methods:* 作者使用了StyleGAN2网络,通过针对有限数据域的数据增强技术来训练。* 作者比较了人类评估生成质量(通过视觉图灵检测)和使用ImageNet训练的InceptionV3、ResNet50、SwAV、DINO和Swin Transformer架构计算的FD。results:* 所有ImageNet基于的提取器都是相互一致的,但只有SwAV显著地相关于医疗专家判断。* RadImageNet基于的FD表现了不稳定性,并且与人类判断没有相关性。
    Abstract Fr\'echet Inception Distance is a widely used metric for evaluating synthetic image quality that utilizes an ImageNet-trained InceptionV3 network as a feature extractor. However, its application in medical imaging lacks a standard feature extractor, leading to biased and inconsistent comparisons. This study aimed to compare state-of-the-art feature extractors for computing Fr\'echet Distances (FDs) in medical imaging. A StyleGAN2 network was trained with data augmentation techniques tailored for limited data domains on datasets comprising three medical imaging modalities and four anatomical locations. Human evaluation of generative quality (via a visual Turing test) was compared to FDs calculated using ImageNet-trained InceptionV3, ResNet50, SwAV, DINO, and Swin Transformer architectures, in addition to an InceptionV3 network trained on a large medical dataset, RadImageNet. All ImageNet-based extractors were consistent with each other, but only SwAV was significantly correlated with medical expert judgment. The RadImageNet-based FD showed volatility and lacked correlation with human judgment. Caution is advised when using medical image-trained extraction networks in the FD calculation. These networks should be rigorously evaluated on the imaging modality under consideration and publicly released. ImageNet-based extractors, while imperfect, are consistent and widely understood. Training extraction networks with SwAV is a promising approach for synthetic medical image evaluation.
    摘要 “弗雷切幻想距离”是一个广泛使用的图像质量评估指标,它使用一个基于ImageNet的InceptionV3网络作为特征提取器。然而,在医疗图像中使用这个指标时,没有标准的特征提取器,导致比较偏见和不一致。本研究旨在比较医疗图像中的现有特征提取器,以计算弗雷切距离。一个StyleGAN2网络通过对限制 Daten Augmentation 技术进行训练,在三种医疗图像模式和四个生物学位置上进行训练。人类评价生成质量(通过视觉Turing测试)和基于ImageNet训练的InceptionV3、ResNet50、SwAV、DINO和Swin Transformer架构中的FD计算结果进行比较。所有基于ImageNet的抽取器都具有相似的性能,但只有SwAV在人类评价中表现出 statistically significant 的相关性。RadImageNet基于的FD表现了不稳定性,并没有与人类评价相关。当使用医疗图像训练的抽取网络时,应该对特定的图像模式进行严格评估,并公开释出抽取网络。ImageNet基于的抽取器,尽管不完美,但是一致且广泛理解。使用SwAV训练的特征提取网络是评估 sintetic医疗图像质量的有力方法。

DiverseNet: Decision Diversified Semi-supervised Semantic Segmentation Networks for Remote Sensing Imagery

  • paper_url: http://arxiv.org/abs/2311.13716
  • repo_url: None
  • paper_authors: Wanli Ma, Oktay Karakus, Paul L. Rosin
  • for: 降低大量静止图像训练中人工标注成本的开销,通过利用大量无标注数据中有用的特征。
  • methods: 提出了多头多模型半监督学习算法,同时提高精度和多样性在训练中。
  • results: 在四个广泛使用的静止图像数据集中,提出的多头多模型方法达到了最高的语义分割性能,而且相比之前的方法,提档头 architecture 更轻量级。
    Abstract Semi-supervised learning is designed to help reduce the cost of the manual labelling process by exploiting the use of useful features from a large quantity of unlabelled data during training. Since pixel-level manual labelling in large-scale remote sensing imagery is expensive, semi-supervised learning becomes an appropriate solution to this. However, most of the existing semi-supervised learning methods still lack efficient perturbation methods to promote diversity of features and the precision of pseudo labels during training. In order to fill this gap, we propose DiverseNet architectures which explore multi-head and multi-model semi-supervised learning algorithms by simultaneously promoting precision and diversity during training. The two proposed methods of DiverseNet, namely the DiverseHead and DiverseModel, achieve the highest semantic segmentation performance in four widely utilised remote sensing imagery data sets compared to state-of-the-art semi-supervised learning methods. Meanwhile, the proposed DiverseHead architecture is relatively lightweight in terms of parameter space compared to the state-of-the-art methods whilst reaching high-performance results for all the tested data sets.
    摘要 DiverseNet 架构包括两种方法:DiverseHead 和 DiverseModel。这两种方法在四个广泛使用的遥感图像数据集上实现了最高的 semantics 分类性能,比之前的 semi-supervised learning 方法高。此外,提议的 DiverseHead 架构在参数空间上相对轻量级,而且对所有测试数据集都达到高性能。

A Somewhat Robust Image Watermark against Diffusion-based Editing Models

  • paper_url: http://arxiv.org/abs/2311.13713
  • repo_url: None
  • paper_authors: Mingtian Tan, Tianhao Wang, Somesh Jha
  • for: 这篇论文主要针对Diffusion Models(DMs)的图像合成技术,以及它们在图像修改和创建中的应用。
  • methods: 本文使用了对抗式示例技术(Adversarial Example Techniques),开发了一种名为Robust Invisible Watermarking(RIW)的隐形水印技术,以将隐形水印嵌入图像中。
  • results: 本文的RIW技术可以在图像修改后仍保持96%的抽取精度,而传统方法则无法提供任何抽取精度。
    Abstract Recently, diffusion models (DMs) have become the state-of-the-art method for image synthesis. Editing models based on DMs, known for their high fidelity and precision, have inadvertently introduced new challenges related to image copyright infringement and malicious editing. Our work is the first to formalize and address this issue. After assessing and attempting to enhance traditional image watermarking techniques, we recognize their limitations in this emerging context. In response, we develop a novel technique, RIW (Robust Invisible Watermarking), to embed invisible watermarks leveraging adversarial example techniques. Our technique ensures a high extraction accuracy of $96\%$ for the invisible watermark after editing, compared to the $0\%$ offered by conventional methods. We provide access to our code at https://github.com/BennyTMT/RIW.
    摘要 最近,扩散模型(DM)已成为图像生成的状态艺术方法。基于DM的编辑模型,因其高精度和精准性而受到广泛应用,但也导致了图像版权侵犯和恶意编辑的新问题。我们的工作是首次正式化和解决这个问题。我们评估和尝试提高传统图像水印技术,但发现它们在这种emerging context中存在限制。因此,我们开发了一种新的技术——Robust Invisible Watermarking(RIW),通过对抗例技术来嵌入不可见的水印。我们的方法可以在编辑后提取不可见水印,并达到$96\%$的抽取精度,比传统方法的$0\%$要高得多。我们提供了代码,可以在https://github.com/BennyTMT/RIW上下载。

A Comprehensive Review of Artificial Intelligence Applications in Major Retinal Conditions

  • paper_url: http://arxiv.org/abs/2311.13710
  • repo_url: None
  • paper_authors: Hina Raja, Taimur Hassan, Bilal Hassan, Muhammad Usman Akram, Hira Raja, Alaa A Abd-alrazaq, Siamak Yousefi, Naoufel Werghi
  • for: 这篇论文提供了一项系统性的评估视网膜疾病,强调早期检测的重要性以便有效治疗。
  • methods: 论文涵盖了临床和自动化方法的检测视网膜疾病,重点强调过去十年内的研究。
  • results: 论文评估了不同Modalities的算法,以识别视网膜疾病的结构异常和诊断。
    Abstract This paper provides a systematic survey of retinal diseases that cause visual impairments or blindness, emphasizing the importance of early detection for effective treatment. It covers both clinical and automated approaches for detecting retinal disease, focusing on studies from the past decade. The survey evaluates various algorithms for identifying structural abnormalities and diagnosing retinal diseases, and it identifies future research directions based on a critical analysis of existing literature. This comprehensive study, which reviews both clinical and automated detection methods using different modalities, appears to be unique in its scope. Additionally, the survey serves as a helpful guide for researchers interested in digital retinopathy.
    摘要 Translation in Simplified Chinese:这篇论文提供了一项系统性的报告关于视网膜疾病,强调早期发现的重要性以实现有效的治疗。它覆盖了临床和自动化方法的检测方法,关注过去一代的研究。survey评估了不同Modalities的算法,用于识别Structural 异常和诊断视网膜疾病,并标识未来研究方向。这项全面的研究,包括临床和自动化检测方法,似乎是其范围的唯一一项。此外,survey 还能为关注数字Retinopathy的研究人员提供有用的指南。

Multi-view Hybrid Graph Convolutional Network for Volume-to-mesh Reconstruction in Cardiovascular MRI

  • paper_url: http://arxiv.org/abs/2311.13706
  • repo_url: None
  • paper_authors: Nicolás Gaggion, Benjamin A. Matheson, Yan Xia, Rodrigo Bonazzola, Nishant Ravikumar, Zeike A. Taylor, Diego H. Milone, Alejandro F. Frangi, Enzo Ferrante
    for: 这篇论文旨在推动心血管成像技术的发展,以便更好地研究心脏形态和功能。methods: 该论文提出了一种新的直接图像到网格抽取算法,称为HybridVNet,它结合了标准的卷积神经网络和图гра数据结构,并通过深度超vision和网格特征规范来提高网格生成的精度。results: 实验结果表明,HybridVNet可以高效地生成高质量的心脏CMR图像,并且可以在多视图情况下提高心脏MR图像生成的性能。
    Abstract Cardiovascular magnetic resonance imaging is emerging as a crucial tool to examine cardiac morphology and function. Essential to this endeavour are anatomical 3D surface and volumetric meshes derived from CMR images, which facilitate computational anatomy studies, biomarker discovery, and in-silico simulations. However, conventional surface mesh generation methods, such as active shape models and multi-atlas segmentation, are highly time-consuming and require complex processing pipelines to generate simulation-ready 3D meshes. In response, we introduce HybridVNet, a novel architecture for direct image-to-mesh extraction seamlessly integrating standard convolutional neural networks with graph convolutions, which we prove can efficiently handle surface and volumetric meshes by encoding them as graph structures. To further enhance accuracy, we propose a multiview HybridVNet architecture which processes both long axis and short axis CMR, showing that it can increase the performance of cardiac MR mesh generation. Our model combines traditional convolutional networks with variational graph generative models, deep supervision and mesh-specific regularisation. Experiments on a comprehensive dataset from the UK Biobank confirm the potential of HybridVNet to significantly advance cardiac imaging and computational cardiology by efficiently generating high-fidelity and simulation ready meshes from CMR images.
    摘要 卡地Cardiovascular magnetic resonance imaging (CMR) 是一种非常重要的工具,用于评估心脏形态和功能。为了实现这一目标,需要使用CMR图像中的形态和功能信息来生成三维表面和体积网格,这些网格可以用于计算生物学研究、标志物发现和在线实验。然而,现有的表面网格生成方法,如活动形态模型和多Atlas分割,需要较长的时间和复杂的处理步骤来生成可用于实验的三维网格。为了解决这一问题,我们介绍了HybridVNet,一种新的架构,可以直接从CMR图像中提取形态和体积网格。HybridVNet将标准的卷积神经网络与图гра夫 convolutions相结合,并可以高效地处理表面和体积网格,通过编码它们为图гра夫结构。为了进一步提高准确性,我们提议了一种多视图HybridVNet架构,可以处理两个不同的方向的CMR图像(长轴和短轴),并证明了它可以提高心脏MR网格生成的性能。我们的模型结合了传统的卷积神经网络、变量图生成模型、深度监督和网格特定的正则化。我们在UK Biobank数据集上进行了广泛的实验,证明了HybridVNet可以高效地生成高精度和可用于实验的心脏MR网格。这些结果表明,HybridVNet可以在心脏成像和计算生物学方面提供重要的进步。

Masked Conditional Diffusion Models for Image Analysis with Application to Radiographic Diagnosis of Infant Abuse

  • paper_url: http://arxiv.org/abs/2311.13688
  • repo_url: None
  • paper_authors: Shaoju Wu, Sila Kurugol, Andy Tsai
  • for: 帮助骨科医生检测新生儿股骨损伤(CML)。
  • methods: 使用掩masked conditional diffusion model(MaC-DM)进行数据生成,并将权重 segmentation 面积用于分类指南。
  • results: 使用MaC-DM生成的图像和相关的分类标签,可以提高骨科医生在识别正常X光图像和CML损伤图像的性能,以及在标识CML损伤部位的性能。
    Abstract The classic metaphyseal lesion (CML) is a distinct injury that is highly specific for infant abuse. It commonly occurs in the distal tibia. To aid radiologists detect these subtle fractures, we need to develop a model that can flag abnormal distal tibial radiographs (i.e. those with CMLs). Unfortunately, the development of such a model requires a large and diverse training database, which is often not available. To address this limitation, we propose a novel generative model for data augmentation. Unlike previous models that fail to generate data that span the diverse radiographic appearance of the distal tibial CML, our proposed masked conditional diffusion model (MaC-DM) not only generates realistic-appearing and wide-ranging synthetic images of the distal tibial radiographs with and without CMLs, it also generates their associated segmentation labels. To achieve these tasks, MaC-DM combines the weighted segmentation masks of the tibias and the CML fracture sites as additional conditions for classifier guidance. The augmented images from our model improved the performances of ResNet-34 in classifying normal radiographs and those with CMLs. Further, the augmented images and their associated segmentation masks enhanced the performance of the U-Net in labeling areas of the CMLs on distal tibial radiographs.
    摘要 《经典肢骨 lesion(CML)是婴儿虐待的特征性伤害,通常发生在 distal tibia。为了帮助 radiologists 探测这些微妙的骨折,我们需要开发一个可以标注不正常 distal tibial 放射图像的模型。然而,开发这种模型所需的大量和多样化的训练数据库通常缺乏。为解决这一限制,我们提议一种新的生成模型 для数据增强。不同于先前的模型,我们的提议的 masked conditional diffusion model(MaC-DM)不仅能够生成真实的外观和广泛的 synthetic 图像,还能够生成这些图像的相关的分割标签。为实现这些任务,MaC-DM 使用 tibias 和 CML 骨折区域的权重分割 маSK 作为分类指南。我们的模型生成的图像和其相关的分割标签可以提高 ResNet-34 分类正常放射图像和包含 CML 的放射图像的性能。此外,我们的模型生成的图像和其相关的分割标签还可以提高 U-Net 在 distal tibial 放射图像中标注 CML 区域的性能。

Single-Shot Plug-and-Play Methods for Inverse Problems

  • paper_url: http://arxiv.org/abs/2311.13682
  • repo_url: None
  • paper_authors: Yanqi Cheng, Lipei Zhang, Zhenda Shen, Shujun Wang, Lequan Yu, Raymond H. Chan, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
  • for: 这个研究旨在解决对于问题 inverse problems 的应用,特别是使用 Plug-and-Play (PnP) 方法来应对这些问题。
  • methods: 这个研究使用 Single-Shot PnP 方法,将注意力集中在解决问题的最小数据上。具体来说,这个方法首先将 Single-Shot proximal denoiser 组合入迭代方法中,以实现单例训练。其次,这个方法提出了基于新函数的伪类型对应,以保留重要的频率特征并避免变分 gradients 的问题。
  • results: 经过广泛的数据和视觉实验,我们的方法能够获得更好的近似解。
    Abstract The utilisation of Plug-and-Play (PnP) priors in inverse problems has become increasingly prominent in recent years. This preference is based on the mathematical equivalence between the general proximal operator and the regularised denoiser, facilitating the adaptation of various off-the-shelf denoiser priors to a wide range of inverse problems. However, existing PnP models predominantly rely on pre-trained denoisers using large datasets. In this work, we introduce Single-Shot PnP methods (SS-PnP), shifting the focus to solving inverse problems with minimal data. First, we integrate Single-Shot proximal denoisers into iterative methods, enabling training with single instances. Second, we propose implicit neural priors based on a novel function that preserves relevant frequencies to capture fine details while avoiding the issue of vanishing gradients. We demonstrate, through extensive numerical and visual experiments, that our method leads to better approximations.
    摘要 “对于逆问题,使用插入式测试(Plug-and-Play,PnP)的使用已经在最近几年中变得越来越普遍。这是基于普遍减少数据的泛化条件和训练的问题,使得各种商业测试条件可以适用到广泛的逆问题。然而,现有的PnP模型主要靠摄氏训练大量数据。在这个工作中,我们引入单一测试 proximal denoiser 到迭代方法中,允许单一的训练。其次,我们提出了基于新函数的匿名紧缩网络,以保留重要的频率来捕捉细节,而不是让梯度变得太小。我们通过广泛的数据和视觉实验示出,我们的方法可以获得更好的近似。”

Compact 3D Gaussian Representation for Radiance Field

  • paper_url: http://arxiv.org/abs/2311.13681
  • repo_url: None
  • paper_authors: Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, Eunbyung Park
  • for: 高效的3D场景表示和渲染
  • methods: 提出了一种基于学习mask的3D Gaussian splatting方法,并使用grid-based neural field来表示视觉依赖的颜色,以及vector quantization来压缩Gaussian特征。
  • results: 实现了10倍的存储减少和渲染速度提高,同时保持场景表示质量不变,比3DGS更高效、快速、 компакт化和实时渲染。
    Abstract Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However, one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering, achieving very fast rendering speed and promising image quality. However, a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. In our extensive experiments, we consistently show over 10$\times$ reduced storage and enhanced rendering speed, while maintaining the quality of the scene representation, compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/.
    摘要 neural radiance fields (nerfs) 有 demonstrated 出色的可能性, capture 复杂的 3D 场景高度准确。然而,一个持续的挑战是 computing bottleneck,妨碍 nerfs 的广泛应用。另一方面, 3D Gaussian splatting (3DGS) 最近 emerged as an alternative representation,使用 3D Gaussian-based representation,采用 rasterization pipeline 渲染图像,而不是 volumetric rendering,实现 very fast rendering speed 和 promising image quality。然而,3DGS 存在一个重要的缺点,需要维护大量的 Gaussiane 以保持高品质的渲染图像,需要大量的内存和存储空间。为了解决这个杰克难题,我们强调两个关键目标:1. 减少 Gaussiane 数量,不产生性能下降。2. 压缩 Gaussiane 属性,如视角依赖颜色和 covariance。为此,我们提出了一种学习掩模策略,可以减少 Gaussiane 数量,而不产生性能下降。另外,我们还提出了一种精简但有效的视角依赖颜色表示方法,而不是使用球面幂。最后,我们学习了一个 codebook 来压缩 Gaussian 的 геометрические属性,通过向量量化。在我们的广泛实验中,我们一直能够保持场景表示的质量,同时实现了 greater than 10 times 减少的存储空间和提高的渲染速度,相比 3DGS。我们的工作提供了一个完整的场景表示框架,实现了高性能、快速训练、紧凑性和实时渲染。更多信息请访问我们的项目页面:https://maincold2.github.io/c3dgs/。

BenthIQ: a Transformer-Based Benthic Classification Model for Coral Restoration

  • paper_url: http://arxiv.org/abs/2311.13661
  • repo_url: None
  • paper_authors: Rupa Kurinchi-Vendhan, Drew Gray, Elijah Cole
  • for: 这项研究的目的是为了提高浅水底图像中生物多样性的识别精度,以便更好地管理和保护珊瑚礁生态系统。
  • methods: 这篇论文提出了一种基于多标签语义分割网络的方法,使用嵌入式的SwinTransformer结构来学习地方semantic特征,并通过U字形Encoder-Decoder架构来实现本地-全球semantic特征学习。
  • results: 研究人员在法国波利尼西亚的实际案例中比较了传统CNN和注意力基于模型,并证明了他们的方法在浅水底图像中像素精度分类中表现出色。
    Abstract Coral reefs are vital for marine biodiversity, coastal protection, and supporting human livelihoods globally. However, they are increasingly threatened by mass bleaching events, pollution, and unsustainable practices with the advent of climate change. Monitoring the health of these ecosystems is crucial for effective restoration and management. Current methods for creating benthic composition maps often compromise between spatial coverage and resolution. In this paper, we introduce BenthIQ, a multi-label semantic segmentation network designed for high-precision classification of underwater substrates, including live coral, algae, rock, and sand. Although commonly deployed CNNs are limited in learning long-range semantic information, transformer-based models have recently achieved state-of-the-art performance in vision tasks such as object detection and image classification. We integrate the hierarchical Swin Transformer as the backbone of a U-shaped encoder-decoder architecture for local-global semantic feature learning. Using a real-world case study in French Polynesia, we demonstrate that our approach outperforms traditional CNN and attention-based models on pixel-wise classification of shallow reef imagery.
    摘要 珊瑚礁是全球重要的marine生物多样性、海岸保护和人类生活所需的重要资源。然而,它们由于大规模脱色事件、污染和不可持续的做法而面临巨大的威胁。监测这些生态系统的健康状况非常重要,以便有效地恢复和管理。现有的方法创建底层组合图像经常妥协 между空间覆盖率和分辨率。在这篇论文中,我们介绍了BenthIQ,一种基于多个标签的semantic segmentation网络,用于高精度地分类水下substrate,包括活珊瑚、藻类、岩石和沙。尽管通常部署的CNNs是学习范围内semantic信息的劣化,但是基于transformer的模型在视觉任务中最近几年获得了状态对应的表现。我们将 hierarchy Swin Transformer作为encoder-decoder架构的背景,以实现本地-全球semantic特征学习。使用法国波利尼西亚的实际案例,我们示出了我们的方法在静止珊瑚图像中的像素精度分类性能高于传统CNN和注意力基本模型。

Panda or not Panda? Understanding Adversarial Attacks with Interactive Visualization

  • paper_url: http://arxiv.org/abs/2311.13656
  • repo_url: None
  • paper_authors: Yuzhe You, Jarvis Tse, Jian Zhao
  • for: 本研究旨在帮助AML学习者更好地理解攻击机器学习算法的攻击方法和影响,以及如何强化模型的可靠性。
  • methods: 本研究使用了多级交互式视觉化系统,帮助AML学习者更好地理解不同图像分类器面对攻击的不同属性和影响。
  • results: 研究表明,AdvEx可以非仅作为AML机制理解的视觉化工具,还可以提供有趣和有挑战性的学习体验,因此可以帮助AML学习者更好地学习和理解攻击机器学习算法。
    Abstract Adversarial machine learning (AML) studies attacks that can fool machine learning algorithms into generating incorrect outcomes as well as the defenses against worst-case attacks to strengthen model robustness. Specifically for image classification, it is challenging to understand adversarial attacks due to their use of subtle perturbations that are not human-interpretable, as well as the variability of attack impacts influenced by diverse methodologies, instance differences, and model architectures. Through a design study with AML learners and teachers, we introduce AdvEx, a multi-level interactive visualization system that comprehensively presents the properties and impacts of evasion attacks on different image classifiers for novice AML learners. We quantitatively and qualitatively assessed AdvEx in a two-part evaluation including user studies and expert interviews. Our results show that AdvEx is not only highly effective as a visualization tool for understanding AML mechanisms, but also provides an engaging and enjoyable learning experience, thus demonstrating its overall benefits for AML learners.
    摘要 adversarial machine learning (AML) 研究攻击机器学习算法生成错误结果以及对最坏情况进行防御,以增强模型的可靠性。特别是对于图像分类,针对攻击的细微变化很难理解,同时攻击方法、实例差异和模型结构的影响也使得攻击的影响变化很大。我们通过与 AML 学习者和教师合作设计,引入 AdvEx,一种多级互动视觉化系统,为 novice AML 学习者全面展示不同图像分类器对欺骗攻击的性能和影响。我们通过用户研究和专家采访进行了两部分评估,结果表明 AdvEx 不仅是一种高效的 AML 视觉化工具,而且提供了有趣和有挑战性的学习体验,从而证明了它的总体效果。

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

  • paper_url: http://arxiv.org/abs/2311.13655
  • repo_url: None
  • paper_authors: Berna Kabadayi, Wojciech Zielonka, Bharat Lal Bhatnagar, Gerard Pons-Moll, Justus Thies
  • for: 实现高品质的数位人类和3D facial avatar,解决传统方法所带来的限制,如不能捕捉 Profile 和后方视角。
  • methods: 基于images的3D-aware生成模型,不需要精确的面部表情追踪,从2D图像和相应的摄像头参数进行训练。
  • results: 与现有单目方法比较,提供更高品质的图像 sintesis,并且不需要训练数据中的表情追踪。
    Abstract Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.
    摘要 “数字人类和特别是3D面部模拟器在过去几年中吸引了很多关注,因为它们是虚拟现实(VR)和扩展现实(AR)等应用程序的核心。尽管有了进步,但面部模拟器由商业硬件重建的结果仍然缺失部分头顶和背部,这限制了模拟器的可用性。这种在先前的工作中的局限性来自于需要面部跟踪,而面部跟踪在Profile和后视图时失败。为解决这个问题,我们提议从图像中学习人Specific可控的动atable avatar,不需要对 facial expression 的跟踪进行假设。我们的方法的核心在于利用3D感知的生成模型,这种模型是通过训练数据来重现 facial expression 的分布。为了训练这种外观模型,我们只需要一个包含2D图像和相应的摄像头参数的集合。为控制模型,我们学习一个从3DMM facial expression参数到生成模型的 latent space 的映射。这个映射可以通过从normalized frontal view中重建 facial parameters,来学习。通过这种方案,我们解耦3D appearance reconstruction和动画控制,以实现高品质的图像合成。在一系列实验中,我们与state-of-the-art monocular方法进行比较,并显示我们的提议方法在图像合成质量方面具有显著的优势,而不需要训练数据中的 facial expression 跟踪。”

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

  • paper_url: http://arxiv.org/abs/2311.13602
  • repo_url: None
  • paper_authors: Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, Kiyoharu Aizawa
  • for: 这 paper 的目的是提出一种基于图像内容的自动化排版生成方法,以提高 e-commerce 产品图片的排版质量。
  • methods: 这 paper 使用了一种名为 Retrieval-Augmented Layout Transformer (RALF) 的模型,它根据输入图像进行 nearest neighbor 搜索,并将这些结果传递给一个 autoregressive 生成器。
  • results: EXTENSIVE 实验表明,RALF 可以在受控制的和无控制的情况下生成高质量的内容感知排版,并在基eline 上显著超越了基elines。
    Abstract Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.
    摘要 content-aware 图像排版生成目标是自动将视觉元素与给定内容(如电商产品图片)进行排版,以提高排版质量。在这篇论文中,我们认为当前的排版生成方法受到高维排版结构的培训数据的限制。我们表明,一种简单的检索扩充可以显著提高生成质量。我们称之为“检索扩充layouter”(RALF),它根据输入图片检索最近的 neighboor 排版示例,并将这些结果传递给一个autoregressive生成器。我们的模型可以应用检索扩充到多种可控生成任务,并在一个统一的架构中实现高质量的排版生成。我们的广泛实验表明,RALF成功地生成了内容相关的排版,并在受限和无限制的设置下显著超过基elines。

Diffusion models meet image counter-forensics

  • paper_url: http://arxiv.org/abs/2311.13629
  • repo_url: https://github.com/mtailanian/diff-cf
  • paper_authors: Matías Tailanian, Marina Gardella, Álvaro Pardo, Pablo Musé
  • for: 本研究探讨了一种基于扩散模型的反反馈技术,用于隐藏图像 traces 以逃脱审查。
  • methods: 研究使用了 diffusion purification 方法,通过扩散图像中的 traces 来隐藏它们,从而逃脱审查。
  • results: 研究表明,使用 diffusion purification 方法可以很好地隐藏图像 traces,并且可以超越现有的反反馈技术。
    Abstract From its acquisition in the camera sensors to its storage, different operations are performed to generate the final image. This pipeline imprints specific traces into the image to form a natural watermark. Tampering with an image disturbs these traces; these disruptions are clues that are used by most methods to detect and locate forgeries. In this article, we assess the capabilities of diffusion models to erase the traces left by forgers and, therefore, deceive forensics methods. Such an approach has been recently introduced for adversarial purification, achieving significant performance. We show that diffusion purification methods are well suited for counter-forensics tasks. Such approaches outperform already existing counter-forensics techniques both in deceiving forensics methods and in preserving the natural look of the purified images. The source code is publicly available at https://github.com/mtailanian/diff-cf.
    摘要 原文:从摄像头感知器到存储,不同的操作都会对图像产生影响,以生成最终的图像。这些操作会留下特定的 traces,这些 traces 会形成自然的水印。如果图像被修改,这些 traces 就会受到影响,这些变化可以作为证据用于检测和定位伪造。在这篇文章中,我们评估了扩散模型可以对伪造者留下的 traces 进行抹除,从而骗过审查方法。这种方法在对抗检测方面表现出色,并且能够保持图像的自然外观。源代码可以在 上获取。Note: "diffusion purification" in the original text is translated as "扩散纯化" in Simplified Chinese, which is a literal translation of the English phrase. However, "diffusion purification" is not a commonly used term in Chinese, and a more appropriate translation might be "图像纯化" (image purification) or "图像杀死" (image killing).

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

  • paper_url: http://arxiv.org/abs/2311.13600
  • repo_url: https://github.com/mkshing/ziplora-pytorch
  • paper_authors: Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani
  • for: 这篇论文旨在提出一种能够实现个性化的概念驱动Generative Model fine-tuning方法,以实现任何用户提供的主题和风格下的生成。
  • methods: 该方法使用低级别适应(LoRA)来实现概念驱动个性化,并通过独立 trains style和subject LoRAs来实现主题和风格的分离。
  • results: 实验表明,ZipLoRA可以在各种主题和风格组合下生成吸引人的结果,同时保持主题和风格的准确性,并且可以允许用户重新contextualize。项目页面:https://ziplora.github.io
    Abstract Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io
    摘要 方法 дляfinetuning生成模型以实现主题驱动个性化通常实现了主题驱动或风格驱动的生成中强大的结果。最近,低级 adaptations(LoRA)已经被提议为一种效率的参数方法以实现主题驱动个性化。 although recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity。我们提出ZipLoRA,一种可以便宜地和有效地将独立训练的style和subject LoRAs合并以实现任何用户提供的主题在任何用户提供的风格下生成any subject in any style。经过广泛的主题和风格组合实验,ZipLoRA能够生成吸引人的结果,同时保持了重contextualize的能力。项目页面:https://ziplora.github.ioNote that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

T-Rex: Counting by Visual Prompting

  • paper_url: http://arxiv.org/abs/2311.13596
  • repo_url: None
  • paper_authors: Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang
  • for: 本研究设计了一个名为T-Rex的互动物 counting模型,用于实时检测并计算任何物体。
  • methods: 本研究使用了视觉提示来实现开放集合物 detection task,让用户可以通过指定兴趣物体的点或方块在库影像上,然后T-Rex将检测到的所有物体都具有相似的模式。用户可以透过T-Rex提供的视觉反馈来进一步精确地调整计数结果,包括缺失或错误检测的物体。
  • results: T-Rex在多个无预设的 counting benchmark 上 achieve 了现代水准的表现,并且在新设置的 counting benchmark 上显示了无预设的 zero-shot counting 能力。实际应用场景中,T-Rex表现了优异的可靠性和精度。
    Abstract We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.
    摘要 我们介绍T-Rex,一个互动式物件数据模型,旨在首先探测并 counted任何物件。我们将物件数据设置为开放集Object detection任务,并与视觉提示相结合。使用者可以在参考图片上标注点或方块,以便T-Rex检测到相似模式的所有物件。根据T-Rex的视觉反馈,使用者也可以互动地修正数据结果,并在缺失或错误检测的物件上提示。T-Rex在多个无预设分类 counting benchmark 上达到了现有最佳性能。为了更好地挖掘它的潜力,我们建立了一个包括多元enario和挑战的新 counting benchmark。两个量值和质量值的结果显示,T-Rex具有杰出的零shot counting能力。此外,我们还提供了不同实际应用场景,以示T-Rex在视觉提示领域的潜力。

XAGen: 3D Expressive Human Avatars Generation

  • paper_url: http://arxiv.org/abs/2311.13574
  • repo_url: https://github.com/magic-research/xagen
  • paper_authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Jiashi Feng, Mike Zheng Shou
  • for: 这篇论文旨在提出一种能够控制人体表达的3D生成模型(XAGen),以提高人体图像生成的真实性和可控性。
  • methods: 该模型采用多级、多部分3D表示方法,以增强小规模区域的细节表示,并提出了多部分渲染技术来分离体、面和手部的生成。
  • results: 实验表明,XAGen比前 estado-of-the-art 方法更高效,具有更高的真实性、多样性和表达控制能力。
    Abstract Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressive control over body, face, and hands. To enhance the fidelity of small-scale regions like face and hands, we devise a multi-scale and multi-part 3D representation that models fine details. Based on this representation, we propose a multi-part rendering technique that disentangles the synthesis of body, face, and hands to ease model training and enhance geometric quality. Furthermore, we design multi-part discriminators that evaluate the quality of the generated avatars with respect to their appearance and fine-grained control capabilities. Experiments show that XAGen surpasses state-of-the-art methods in terms of realism, diversity, and expressive control abilities. Code and data will be made available at https://showlab.github.io/xagen.
    摘要 Translated into Simplified Chinese:最近的3D-aware GAN模型已经启用了真实和可控的人体图像生成。然而,现有的方法主要控制主体 JOINTS,忽略了表达特征,如 facial expression, jaw pose, hand pose 等。在这项工作中,我们提出了 XAGen,首个具有人像表达控制的3D生成模型。为了提高小规模区域的精度,我们设计了多尺度和多部分3D表示方法,模型细节。基于这种表示方法,我们提议了多部分渲染技术,以分解体、面和手的生成,以便模型训练和提高几何质量。此外,我们设计了多部分评估器,以评估生成人像的外观和细节控制能力。实验结果表明,XAGen超越了当前方法的真实、多样性和表达控制能力。代码和数据将在 上公开。

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

  • paper_url: http://arxiv.org/abs/2311.13570
  • repo_url: None
  • paper_authors: Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis
  • for: 这种研究旨在实现基于学习的3D意识图像生成,以实现高度的真实感和3D视角变化。
  • methods: 我们使用了压缩含义模型(LDM)来实现3D意识图像生成,并通过使用单视深度预测来学习一个准确的3D表示。
  • results: 我们的方法可以生成高质量的3D一致的图像样本,超过了最近的状态艺术GAN基本方法。此外,我们的方法不需要 pose 图像或学习 pose 或摄像机分布,直接从各种自然图像数据中学习3D表示。
    Abstract Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results.
    摘要 现代学习基于的3D意识图像合成方法可以实现高度的真实感和3D视角变化,但现有方法通常使用共同的坐标系来表示实例。然而,在野外数据集中,共同的坐标系可能很难定义或者甚至不存在。在这种情况下,我们在视角空间中模型实例,从而减轻了需要姿势图像和学习摄像头分布的限制。然而,现有的GAN基于方法在这种设定下很容易生成平坦的geometry和分布覆盖率差。我们因此提出了WildFusion,一种基于扩散模型(LDM)的新的3D意识图像合成方法。我们首先训练一个自适应encoder,可以恢复压缩的秘密表示,同时还能捕捉图像的下面结构,从而不仅可以进行重建,而且还可以生成新的视角。为了学习准确的3D表示,我们利用独立视野预测的干扰信号。然后,我们在3D意识的秘密空间中训练扩散模型,从而实现高质量的3D一致的图像样本,超过了最近的状态艺术GAN基于方法。这些方法不需要直接的多视图图像或3D геомет里的直接监督,也不需要姿势图像或学习的姿势或摄像头分布。它直接在3D意识的秘密空间中学习3D表示。这开启了可扩展的3D意识图像合成和3D内容创造的研究可能性。请参考https://katjaschwarz.github.io/wildfusion для视频详情。

ADriver-I: A General World Model for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.13549
  • repo_url: None
  • paper_authors: Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, Tiancai Wang
  • for: 这个论文的目的是提出一种基于多模态大语言模型和扩散技术的自动驾驶模型,以提高自动驾驶的性能和可解释性。
  • methods: 该论文使用了多模态大语言模型和扩散技术,并将视觉特征和控制信号融合为视觉动作对,然后使用这些视觉动作对来构建一个总世界模型,以便在这个模型中进行自动驾驶。
  • results: 经过广泛的实验,该论文的ADriver-I模型在nuScenes和私人数据集上表现出优秀的性能,比较于一些构建的基准模型。
    Abstract Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.
    摘要 Note:* "autonomous driving" is translated as "自动驾驶" (zì huó xíng)* "modular design" is translated as "模块化设计" (molduō zhì yì)* "perception, prediction, planning, and control" is translated as "感知、预测、规划和控制" (gǎn zhě, yù jì, guī yì yǔ kòng yì)* "multimodal large language models" is translated as "多模态大语言模型" (duō mó dà yǔ yán mó delè)* "diffusion techniques" is translated as "分布式技术" (fēn bù zhì yè)* "interleaved vision-action pairs" is translated as "混合视力动作对" (hùn hǎo zhì xìng dòng dài)* "general world model" is translated as "通用世界模型" (gōng yòng shì jiè mó delè)* "autoregressively predict" is translated as "自回归预测" (zì huí jí yù jì)* "control signal" is translated as "控制信号" (kòng zhì xìn haò)* "future frames" is translated as "未来帧" (wèi lái pán)* "extensive experiments" is translated as "广泛实验" (guǎng cháo shí yì)* "impressive performance" is translated as "出色表现" (chū sè biǎo xiǎng)

DiffusionMat: Alpha Matting as Sequential Refinement Learning

  • paper_url: http://arxiv.org/abs/2311.13535
  • repo_url: None
  • paper_authors: Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo
  • for: 这 paper 的目的是提出一种新的图像分离框架,以便更好地解决图像分离问题。
  • methods: 这 paper 使用了一种扩散模型来从粗糙的 alpha 质量图向 refined alpha 质量图的过程中进行过程。
  • results: 对 Several image matting benchmarks 进行评估,DiffusionMat consistently outperforms 现有的方法。In English:
  • for: The purpose of this paper is to propose a new image matting framework to better solve the image matting problem.
  • methods: This paper uses a diffusion model to refine the alpha matte from a coarse to a refined one.
  • results: The proposed method is evaluated on several image matting benchmarks and consistently outperforms existing methods.
    Abstract In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{https://cnnlstm.github.io/DiffusionMat
    摘要 “在这篇论文中,我们介绍了DiffusionMat,一种新的图像分离框架,它利用分散模型来从粗糙的Alpha矩阵逐渐提取精细的Alpha矩阵。与传统方法不同,我们的方法不仅将Trimap作为Alpha矩阵预测的严格指导,而是将图像分离看作一种顺序改进学习过程。这个过程从Trimap中添加噪声开始,然后使用预训练的分散模型进行多个逐渐减噪步骤,这些步骤逐渐引导预测向着干净的Alpha矩阵。DiffusionMat的关键创新是一个修正模块,它在每个减噪步骤后对输出进行调整,以确保最终结果与输入图像的结构一致。我们还提出了Alpha可靠传播(Alpha Reliability Propagation),一种新的技术,用于最大化可用指导的 utility, selectively增强Trimap中可信Alpha信息,从而简化修正任务。为了训练修正模块,我们设计了特殊的损失函数, targeting Alpha矩阵的精度和 Edge的一致性。我们在多个图像分离 benchmark上评估DiffusionMat,结果表明DiffusionMat在存在其他方法的情况下一直保持领先地位。项目页面在~\url{https://cnnlstm.github.io/DiffusionMat}。”

Leveraging CNNs and Ensemble Learning for Automated Disaster Image Classification

  • paper_url: http://arxiv.org/abs/2311.13531
  • repo_url: None
  • paper_authors: Archit Rathod, Veer Pariawala, Mokshit Surana, Kumkum Saxena
  • for: 这个论文是为了探讨自然灾害图像分类的问题,并提出了一种基于卷积神经网络(CNN)的方法。
  • methods: 作者使用了多种CNN建模器,并对一个包含自然灾害图像的数据集进行训练。他们还使用了栅格神经网络(ResNet)和堆叠CNN ensemble方法来优化模型的性能。
  • results: 研究发现,使用栅格神经网络和堆叠CNN ensemble方法可以达到95%的准确率和0.96的F1分数。这些结果表明了CNN基本模型在自然灾害图像分类中的可靠性和有效性。
    Abstract Natural disasters act as a serious threat globally, requiring effective and efficient disaster management and recovery. This paper focuses on classifying natural disaster images using Convolutional Neural Networks (CNNs). Multiple CNN architectures were built and trained on a dataset containing images of earthquakes, floods, wildfires, and volcanoes. A stacked CNN ensemble approach proved to be the most effective, achieving 95% accuracy and an F1 score going up to 0.96 for individual classes. Tuning hyperparameters of individual models for optimization was critical to maximize the models' performance. The stacking of CNNs with XGBoost acting as the meta-model utilizes the strengths of the CNN and ResNet models to improve the overall accuracy of the classification. Results obtained from the models illustrated the potency of CNN-based models for automated disaster image classification. This lays the foundation for expanding these techniques to build robust systems for disaster response, damage assessment, and recovery management.
    摘要 自然灾害是全球的严重威胁,需要有效和高效的灾害管理和恢复。这篇论文关注使用卷积神经网络(CNN)来分类自然灾害图像。我们建立了多种CNN建筑和训练集合,包括地震、洪水、野火和火山等自然灾害图像。结果表明,使用栅stacking方法组合CNN和XGBoost作为meta-模型,可以提高总精度。在优化各个模型的超参数方面也是非常重要。结果表明,CNN模型在自动化灾害图像分类中表现了remarkable的可靠性和精度。这为扩展这些技术,建立可靠的灾害应对、损伤评估和恢复管理系统提供了基础。

Hybrid Whale-Mud-Ring Optimization for Precise Color Skin Cancer Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.13512
  • repo_url: None
  • paper_authors: Amir Hamza, Badis Lekouaghet, Yassine Himeur
  • for: 这篇论文的目的是提高皮肤癌检测的精度,以帮助患有皮肤癌的患者保留健康。
  • methods: 本篇论文提出了一个基于多层阈值调整的方法,使用了鲸鱼优化算法与穆迪环节攻击来解决遗传问题,以提高皮肤癌检测的精度。
  • results: 实验结果显示,提案的WMRA方法在评估适当性、PSNR和MSE三个指标方面,较以对比的最新方法更具优势。
    Abstract Timely identification and treatment of rapidly progressing skin cancers can significantly contribute to the preservation of patients' health and well-being. Dermoscopy, a dependable and accessible tool, plays a pivotal role in the initial stages of skin cancer detection. Consequently, the effective processing of digital dermoscopy images holds significant importance in elevating the accuracy of skin cancer diagnoses. Multilevel thresholding is a key tool in medical imaging that extracts objects within the image to facilitate its analysis. In this paper, an enhanced version of the Mud Ring Algorithm hybridized with the Whale Optimization Algorithm, named WMRA, is proposed. The proposed approach utilizes bubble-net attack and mud ring strategy to overcome stagnation in local optima and obtain optimal thresholds. The experimental results show that WMRA is powerful against a cluster of recent methods in terms of fitness, Peak Signal to Noise Ratio (PSNR), and Mean Square Error (MSE).
    摘要 时间准确识别和治疗皮肤癌可以对病人健康和生活质量有很大的贡献。德摩斯可观,一种可靠且接受的工具,在皮肤癌检测的初期阶段扮演着重要的角色。因此,对于数位德摩斯图像的有效处理具有重要的重要性,以提高皮肤癌诊断的准确性。多层阈值是医疗影像中一个重要的工具,可以帮助检测出图像中的物体,并且对其进行分析。本文提出了一个改进的Mud Ring Algorithm(MRAC),与鲸鱼算法(WOA)的融合版本,称为WMRA。这个方法使用泡泡网攻击和泥环策略,以超越本地最佳点的僵持,从而获得最佳的阈值。实验结果显示,WMRA在一群最新的方法中与其他方法相比,在适应度、PSNR和MSE方面具有优秀的表现。

Deep-learning-based acceleration of MRI for radiotherapy planning of pediatric patients with brain tumors

  • paper_url: http://arxiv.org/abs/2311.13485
  • repo_url: https://github.com/stjude/deepmrirec
  • paper_authors: Shahinur Alam, Jinsoo Uh, Alexander Dresner, Chia-ho Hua, Khaled Khairy
  • for: 这篇论文的目的是为了提高磁共振成像(MRI)的不侵入性诊断和放射治疗(RT)规划,并提供更多的关于人体解剖结构的细节。
  • methods: 这篇论文使用了深度学习的方法(DeepMRIRec)来重建从不完整测量数据中获得的MRI影像。
  • results: 这篇论文的结果显示,使用DeepMRIRec方法可以将MRI影像重建时间缩短到四倍,并且与完整测量数据相比,其 Structural Similarity Score 为 0.960,与评估的现有方法(0.896)相比,显示了这篇论文的潜在优势。
    Abstract Magnetic Resonance Imaging (MRI) is a non-invasive diagnostic and radiotherapy (RT) planning tool, offering detailed insights into the anatomy of the human body. The extensive scan time is stressful for patients, who must remain motionless in a prolonged imaging procedure that prioritizes reduction of imaging artifacts. This is challenging for pediatric patients who may require measures for managing voluntary motions such as anesthesia. Several computational approaches reduce scan time (fast MRI), by recording fewer measurements and digitally recovering full information via post-acquisition reconstruction. However, most fast MRI approaches were developed for diagnostic imaging, without addressing reconstruction challenges specific to RT planning. In this work, we developed a deep learning-based method (DeepMRIRec) for MRI reconstruction from undersampled data acquired with RT-specific receiver coil arrangements. We evaluated our method against fully sampled data of T1-weighted MR images acquired from 73 children with brain tumors/surgical beds using loop and posterior coils (12 channels), with and without applying virtual compression of coil elements. DeepMRIRec reduced scanning time by a factor of four producing a structural similarity score surpassing the evaluated state-of-the-art method (0.960 vs 0.896), thereby demonstrating its potential for accelerating MRI scanning for RT planning.
    摘要

SkeletonGait: Gait Recognition Using Skeleton Maps

  • paper_url: http://arxiv.org/abs/2311.13444
  • repo_url: https://github.com/shiqiyu/opengait
  • paper_authors: Chao Fan, Jingzhe Ma, Dongyang Jin, Chuanfu Shen, Shiqi Yu
  • for: 本研究旨在提出一种新的骨架姿势表示法和基于骨架的方法,以提高深度步态识别的性能。
  • methods: 本研究使用的方法包括提出了一种新的骨架姿势表示法(即骨架地图)和基于骨架的方法(即SkeletonGait),以及一种多支路架构(即SkeletonGait++),以便使用相互补充的特征。
  • results: 本研究的实验结果表明,SkeletonGait++ 可以在多种场景中取得显著的高精度,比如在挑战性的 GREW 数据集上达到了超过 85% 的排名一精度。
    Abstract The choice of the representations is essential for deep gait recognition methods. The binary silhouettes and skeletal coordinates are two dominant representations in recent literature, achieving remarkable advances in many scenarios. However, inherent challenges remain, in which silhouettes are not always guaranteed in unconstrained scenes, and structural cues have not been fully utilized from skeletons. In this paper, we introduce a novel skeletal gait representation named Skeleton Map, together with SkeletonGait, a skeleton-based method to exploit structural information from human skeleton maps. Specifically, the skeleton map represents the coordinates of human joints as a heatmap with Gaussian approximation, exhibiting a silhouette-like image devoid of exact body structure. Beyond achieving state-of-the-art performances over five popular gait datasets, more importantly, SkeletonGait uncovers novel insights about how important structural features are in describing gait and when do they play a role. Furthermore, we propose a multi-branch architecture, named SkeletonGait++, to make use of complementary features from both skeletons and silhouettes. Experiments indicate that SkeletonGait++ outperforms existing state-of-the-art methods by a significant margin in various scenarios. For instance, it achieves an impressive rank-1 accuracy of over $85\%$ on the challenging GREW dataset. All the source code will be available at https://github.com/ShiqiYu/OpenGait.
    摘要 “选择表现方式是深度走势识别方法中非常重要的一部分。 binary silhouette 和 skeletal coordinates 是现代文献中最具代表性的两种表现方式,在许多场景中实现了惊人的进步。然而,这些表现方式中存在一些潜在的挑战,例如在无条件场景中不一定能获取silhouette,以及skeleton 中的结构信息未被完全利用。在本文中,我们介绍了一个新的肢体对称走势表示方法,名为Skeleton Map,以及一个基于肢体图的方法,名为SkeletonGait,以利用人体肢体图中的结构信息。具体来说,Skeleton Map 将人体 JOINT 的坐标转换为一个 Gaussian 折衣的热力映射,呈现一个没有具体身体结构的 Silhouette 像。在五个受欢迎的走势数据集上,SkeletonGait 不仅实现了现有方法的州供应率,而且发现了走势中结构特征的重要性,以及当结构特征在哪些情况下发挥作用。此外,我们提出了一个多支架架构,名为SkeletonGait++,以利用肢体图和 Silhouette 之间的补充特征。实验结果显示,SkeletonGait++ 在不同的场景中均具有显著的提升效果,例如在挑战性的 GREW 数据集上,它的 rank-1 精度高于 $85\%$。所有的源代码将在 GitHub 上公开。”

CompenHR: Efficient Full Compensation for High-resolution Projector

  • paper_url: http://arxiv.org/abs/2311.13409
  • repo_url: https://github.com/cyxwang/compenhr
  • paper_authors: Yuxi Wang, Haibin Ling, Bingyao Huang
  • For: This paper proposes a practical full compensation solution for high-resolution projector-camera systems, addressing the issues of long training time and high memory cost in state-of-the-art methods.* Methods: The proposed method includes an attention-based grid refinement network to improve geometric correction quality, an end-to-end compensation network with a novel sampling scheme and attention blocks to alleviate computation and preserve key features, and a benchmark dataset for high-resolution projector full compensation.* Results: The proposed method demonstrates clear advantages in both efficiency and quality compared to state-of-the-art methods in experiments.Here’s the summary in Traditional Chinese:* For: 本研究提出一个实用的高分辨率投影机构补偿方案,以解决现有方法的长时间训练时间和高内存成本问题。* Methods: 提案的方法包括一个注意力基于的格子精确化网络,以提高几何调正质量;一个终端到端补偿网络,具有一个新的抽样方案和注意区块来缓和计算和保留重要特征;以及一个高分辨率投影机全补偿 benchmark 数据集。* Results: 在实验中,提案的方法与现有方法相比,在效率和质量两方面都有明显的优势。
    Abstract Full projector compensation is a practical task of projector-camera systems. It aims to find a projector input image, named compensation image, such that when projected it cancels the geometric and photometric distortions due to the physical environment and hardware. State-of-the-art methods use deep learning to address this problem and show promising performance for low-resolution setups. However, directly applying deep learning to high-resolution setups is impractical due to the long training time and high memory cost. To address this issue, this paper proposes a practical full compensation solution. Firstly, we design an attention-based grid refinement network to improve geometric correction quality. Secondly, we integrate a novel sampling scheme into an end-to-end compensation network to alleviate computation and introduce attention blocks to preserve key features. Finally, we construct a benchmark dataset for high-resolution projector full compensation. In experiments, our method demonstrates clear advantages in both efficiency and quality.
    摘要 “全景补偿”是一个实用的专案,旨在找到一个对应投影机的输入图像,以补偿因物理环境和硬件而导致的几何和光学扭曲。现代方法通过深度学习来解决这个问题,并在低分辨率设置下显示出惊人的表现。然而,直接将深度学习应用到高分辨率设置是不实际的,因为训练时间太长,内存成本高昂。为解决这个问题,这篇论文提出了一个实用的全景补偿方案。首先,我们设计了一个注意力基于的格子精细化网络,以提高几何补偿质量。其次,我们将一个新的抽样方案integrated into an end-to-end补偿网络,以减少计算量和引入注意点对。最后,我们建立了一个高分辨率专案的参考 dataset。在实验中,我们的方法在效率和质量两方面都显示了明显的优势。

Animatable 3D Gaussians for High-fidelity Synthesis of Human Motions

  • paper_url: http://arxiv.org/abs/2311.13404
  • repo_url: None
  • paper_authors: Keyang Ye, Tianjia Shao, Kun Zhou
  • for: Synthesizing high-fidelity free-view human motions in real time.
  • methods: Novel animatable 3D Gaussian model with learnable code and alpha loss for refining appearance, and joint optimization of human joint parameters.
  • results: Superior performance over NeRF-based methods, with the ability to synthesize new human motions in real time (66 fps on average).
    Abstract We present a novel animatable 3D Gaussian model for rendering high-fidelity free-view human motions in real time. Compared to existing NeRF-based methods, the model owns better capability in synthesizing high-frequency details without the jittering problem across video frames. The core of our model is a novel augmented 3D Gaussian representation, which attaches each Gaussian with a learnable code. The learnable code serves as a pose-dependent appearance embedding for refining the erroneous appearance caused by geometric transformation of Gaussians, based on which an appearance refinement model is learned to produce residual Gaussian properties to match the appearance in target pose. To force the Gaussians to learn the foreground human only without background interference, we further design a novel alpha loss to explicitly constrain the Gaussians within the human body. We also propose to jointly optimize the human joint parameters to improve the appearance accuracy. The animatable 3D Gaussian model can be learned with shallow MLPs, so new human motions can be synthesized in real time (66 fps on avarage). Experiments show that our model has superior performance over NeRF-based methods.
    摘要 我们提出了一种新的可动的3D高斯模型,用于在实时下生成高精度自由视人姿。与现有基于NeRF的方法相比,我们的模型具有更好的高频环境详细Synthesize能力,不受视频帧中的颤动问题的影响。我们的模型核心是一种新的增强了3D高斯表示方法,将每个高斯分配一个学习码。这个学习码 acts as a基于pose的出现嵌入,用于根据目标姿态修复高斯出现的错误。为了让高斯学习人体背景分离,我们还提出了一种新的alpha损失来Explicitly constrain高斯在人体躯体中。此外,我们还提议同时优化人 JOINT参数,以提高出现精度。我们的可动3D高斯模型可以使用浅层MLP学习,因此可以在实时下新的人姿Synthesize。实验表明,我们的模型在NeRF-based方法的基础上具有更好的性能。

Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images

  • paper_url: http://arxiv.org/abs/2311.13398
  • repo_url: None
  • paper_authors: Jaeyoung Chung, Jeongtaek Oh, Kyoung Mu Lee
  • for: 优化高斯拟合使用有限数量的图像,避免过拟合。
  • methods: 引入密集深度地图作为空间指南,以避免过拟合。
  • results: 提出的方法在NERF-LLFF数据集中,与原始方法相比,显示出更好的空间准确性和抗折扣性。
    Abstract In this paper, we present a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual quality. However, it tends to overfit the training views when only a small number of images are available. To address this issue, we introduce a dense depth map as a geometry guide to mitigate overfitting. We obtained the depth map using a pre-trained monocular depth estimation model and aligning the scale and offset using sparse COLMAP feature points. The adjusted depth aids in the color-based optimization of 3D Gaussian splatting, mitigating floating artifacts, and ensuring adherence to geometric constraints. We verify the proposed method on the NeRF-LLFF dataset with varying numbers of few images. Our approach demonstrates robust geometry compared to the original method that relies solely on images.
    摘要 在这篇论文中,我们提出了一种方法来优化 Gaussian 抖splatting,使用有限数量的图像而不导致过拟合。将三个维度场景表示为多个 Gaussian 抖splat 已经得到了出色的视觉质量。然而,当只有少数图像时,它往往会过拟合训练视图。为解决这个问题,我们引入了密集的深度地图作为geometry guide,以 Mitigate overfitting。我们通过使用预训练的单目深度估计模型和稀疏 COLMAP 特征点进行偏移和缩放,获得了调整后的深度地图。调整后的深度地图对于颜色基于的三个维度 Gaussian 抖splatting 进行了颜色优化,避免浮动 artifacts,并保证遵循 geometric 约束。我们在NeRF-LLFF数据集上验证了我们的方法,并证明我们的方法具有比原始方法更好的Robust geometry。

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.13385
  • repo_url: https://github.com/baai-dcai/segvol
  • paper_authors: Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao
  • for: 这个论文是为了提供一个基础的医疗影像分类模型,以便在诊断过程中实现准确且有结构的信息提供。
  • methods: 这个模型使用了一种名为SegVol的基础医疗影像分类模型,通过训练90000个无标注的 Computed Tomography(CT)Volume和6000个标注CT,以支持分类200多个 анатомі学类别。
  • results: 实验结果显示,SegVol在多个分类benchmark上比前一代模型大幅提高了表现,特别是在三个挑战性的肿瘤Dataset上,我们的方法在Dice分数上比nnU-Net高出约20%。
    Abstract Precise image segmentation provides clinical study with meaningful and well-structured information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a universal and interactive volumetric medical image segmentation model, named SegVol. By training on 90k unlabeled Computed Tomography (CT) volumes and 6k labeled CTs, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. Extensive experiments verify that SegVol outperforms the state of the art by a large margin on multiple segmentation benchmarks. Notably, on three challenging lesion datasets, our method achieves around 20% higher Dice score than nnU-Net. The model and data are publicly available at: https://github.com/BAAI-DCAI/SegVol.
    摘要 精准的图像分割提供丰富的临床研究信息,具有明确的结构和意义。尽管医疗图像分割领域已经取得了很大的进步,但是仍然缺乏一个可以覆盖广泛的生物学类别的基础模型。在这篇论文中,我们提出了一种通用和交互式的三维医疗图像分割模型,名为SegVol。通过训练90000个未标注计算机断层(CT)Volume和6000个标注CT,这个基础模型支持分割超过200个生物学类别,使用semantic和空间提示。广泛的实验证明,SegVol在多个分割benchmark上大幅超越了状态的艺术。特别是在三个困难的肿瘤数据集上,我们的方法实现了大约20%的 dice分数高于nnU-Net。模型和数据可以在:https://github.com/BAAI-DCAI/SegVol中获得。

LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes

  • paper_url: http://arxiv.org/abs/2311.13384
  • repo_url: https://github.com/luciddreamer-cvlab/luciddreamer-cvlab.github.io
  • paper_authors: Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, Kyoung Mu Lee
  • for: 提高3D场景生成技术的灵活性和可靠性,使其能够生成具有高细节和多视角一致性的3D场景,无需受限于特定领域或数据集。
  • methods: 提出了一种域外场景生成管线,名为LucidDreamer,具有两个步骤:梦境和协调。在梦境阶段,通过使用扩散型生成模型,从输入点云中生成多视角一致的图像。在协调阶段,通过一种协调算法,将新生成的3D场景积成为一个完整的3D场景。
  • results: LucidDreamer可以生成高细节和多视角一致的3D场景,无需受限于特定领域或数据集。与先前的3D场景生成方法相比,LucidDreamer的结果更加细节和可靠。项目页面:https://luciddreamer-cvlab.github.io/
    Abstract With the widespread usage of VR devices and contents, demands for 3D scene generation techniques become more popular. Existing 3D scene generation models, however, limit the target scene to specific domain, primarily due to their training strategies using 3D scan dataset that is far from the real-world. To address such limitation, we propose LucidDreamer, a domain-free scene generation pipeline by fully leveraging the power of existing large-scale diffusion-based generative model. Our LucidDreamer has two alternate steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometrical guideline for each image generation. Specifically, we project a portion of point cloud to the desired view and provide the projection as a guidance for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new points. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm which harmoniously integrates the portions of newly generated 3D scenes. The finally obtained 3D scene serves as initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly-detailed compared to the previous 3D scene generation methods, with no constraint on domain of the target scene. Project page: https://luciddreamer-cvlab.github.io/
    摘要 With the widespread use of VR devices and contents, the demand for 3D scene generation techniques has become more popular. However, existing 3D scene generation models are limited to specific domains due to their training strategies using 3D scan datasets that are far from the real world. To address this limitation, we propose LucidDreamer, a domain-free scene generation pipeline that fully leverages the power of existing large-scale diffusion-based generative models. Our LucidDreamer has two alternative steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometric guide for each image generation. Specifically, we project a portion of the point cloud to the desired view and use the projection as a guide for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new point cloud. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm that harmoniously integrates the portions of newly generated 3D scenes. The final obtained 3D scene serves as the initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly detailed compared to previous 3D scene generation methods, with no constraint on the domain of the target scene. Project page: https://luciddreamer-cvlab.github.io/

Point Projection Mapping System for Tracking, Registering, Labeling and Validating Optical Tissue Measurements

  • paper_url: http://arxiv.org/abs/2311.13378
  • repo_url: None
  • paper_authors: Lianne Feenstra, Stefan D. van der Stel, Marcos Da Silva Guimaraes, Theo J. M Ruers, Behdad Dashtbozorg
  • for: 这个论文是为了验证新发展的光学组织检测技术,以便在肿瘤手术中进行精准的肿瘤检测。
  • methods: 这个论文使用的方法包括非破坏性跟踪系统Point Projection Mapping,以及一种有效的注册、验证和标注方法,与 histopathology 结果进行比较。
  • results: 这个论文的结果表明,使用这种新的方法可以更加准确地跟踪和验证光学组织检测技术的表现,相比传统的方法更加高效和可靠。
    Abstract Validation of newly developed optical tissue sensing techniques for tumor detection during cancer surgery requires an accurate correlation with histological results. Additionally, such accurate correlation facilitates precise data labeling for developing high-performance machine-learning tissue classification models. In this paper, a newly developed Point Projection Mapping system will be introduced, which allows non-destructive tracking of the measurement locations on tissue specimens. Additionally, a framework for accurate registration, validation, and labeling with histopathology results is proposed and validated on a case study. The proposed framework provides a more robust and accurate method for tracking and validation of optical tissue sensing techniques, which saves time and resources compared to conventional techniques available.
    摘要 新开发的光学组织检测技术的验证需要与癌症手术中的 histological 结果准确相关。此外,准确的相关还可以促进高性能机器学习组织分类模型的开发。本文将介绍一种新开发的点投影映射系统,可以非Destructive地跟踪测量位置在组织样本中。此外,一种精准的 registration、验证和标注方法将被提出,并在一个案例研究中验证。该方法提供了更加robust和准确的跟踪和验证光学组织检测技术的方法,可以节省时间和资源,比传统方法更高效。

MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space

  • paper_url: http://arxiv.org/abs/2311.13372
  • repo_url: None
  • paper_authors: Xiuwen Wu, Rongjie Hu, Jie Liang, Yanming Wang, Bensheng Qiu, Xiaoxiao Wang
  • for: 这个论文的目的是提出一种基于深度学习的眼动预测方法,以便从fMRI数据中预测眼动点。
  • methods: 这个方法使用了一个名为MRGazer的框架,该框架包括眼球EXTRACT模块和基于差异网络的眼动预测模块。与之前的方法相比,该方法减少了fMRI均衡步骤,简化了处理协议,并实现了端到端眼动预测。
  • results: 该方法在多种眼动任务中表现出色,比之前的方法更高效,并在每个volume(约0.02秒)内提供了 объекivity 的结果,比之前的方法(约0.3秒)更快。
    Abstract Eye-tracking research has proven valuable in understanding numerous cognitive functions. Recently, Frey et al. provided an exciting deep learning method for learning eye movements from fMRI data. However, it needed to co-register fMRI into standard space to obtain eyeballs masks, and thus required additional templates and was time consuming. To resolve this issue, in this paper, we propose a framework named MRGazer for predicting eye gaze points from fMRI in individual space. The MRGazer consisted of eyeballs extraction module and a residual network-based eye gaze prediction. Compared to the previous method, the proposed framework skips the fMRI co-registration step, simplifies the processing protocol and achieves end-to-end eye gaze regression. The proposed method achieved superior performance in a variety of eye movement tasks than the co-registration-based method, and delivered objective results within a shorter time (~ 0.02 Seconds for each volume) than prior method (~0.3 Seconds for each volume).
    摘要 眼动研究已经对许多认知功能提供了重要的启示。最近,Frey等人提出了一种深度学习方法,可以从fMRI数据中学习眼动。然而,该方法需要将fMRI数据转换为标准空间,以获取眼球面积,这需要额外的模板和时间消耗。为解决这个问题,在这篇论文中,我们提出了一个名为MRGazer的框架,可以从fMRI数据中预测眼动点。MRGazer包括眼球EXTRACT模块和一个基于差异网络的眼动预测模块。与之前的方法相比,我们的框架取消了fMRI协调步骤,简化了处理协议,并实现了端到端的眼动预测。我们的方法在多种眼动任务中表现出色,比之前的方法更快(大约0.02秒每个量),并且具有更加准确的结果。

Unified Classification and Rejection: A One-versus-All Framework

  • paper_url: http://arxiv.org/abs/2311.13355
  • repo_url: None
  • paper_authors: Zhen Cheng, Xu-Yao Zhang, Cheng-Lin Liu
  • for: 本研究旨在建立一个统一的框架,用于实现开集类别和 unknown input 拒绝(OOD)识别任务。
  • methods: 本研究使用一个统一的框架,将开集类别转换为一个 $(K+1)$ 类别的分类问题,并使用一个 hybrid 训练策略,结合 OVA 损失和多类别梯度损失,以维持关闭集类别准确性。
  • results: 实验结果显示,提案的框架在两个受测集上具有竞争性的表现,包括关闭集类别准确性、OOD检测和误分检测。
    Abstract Classifying patterns of known classes and rejecting ambiguous and novel (also called as out-of-distribution (OOD)) inputs are involved in open world pattern recognition. Deep neural network models usually excel in closed-set classification while performing poorly in rejecting OOD. To tackle this problem, numerous methods have been designed to perform open set recognition (OSR) or OOD rejection/detection tasks. Previous methods mostly take post-training score transformation or hybrid models to ensure low scores on OOD inputs while separating known classes. In this paper, we attempt to build a unified framework for building open set classifiers for both classification and OOD rejection. We formulate the open set recognition of $ K $-known-class as a $ (K + 1) $-class classification problem with model trained on known-class samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all (OVA) binary classification tasks and binding some parameters, we show that combining the scores of OVA classifiers can give $ (K + 1) $-class posterior probabilities, which enables classification and OOD rejection in a unified framework. To maintain the closed-set classification accuracy of the OVA trained classifier, we propose a hybrid training strategy combining OVA loss and multi-class cross-entropy loss. We implement the OVA framework and hybrid training strategy on the recently proposed convolutional prototype network. Experiments on popular OSR and OOD detection datasets demonstrate that the proposed framework, using a single multi-class classifier, yields competitive performance in closed-set classification, OOD detection, and misclassification detection.
    摘要 <>TRANSLATE_TEXT开放世界模式识别中,需要识别已知类型和未知类型(也称为 OUT-OF-DISTRIBUTION,OOD)输入。深度神经网络通常在已知类型识别中表现出色,但在拒绝OOD输入方面表现不佳。为解决这个问题,许多方法已经被设计用于进行开放集Recognition(OSR)或OOD拒绝/检测任务。先前的方法主要通过post-training score transformation或混合模型来确保OOD输入的低分数,以分离已知类型。在这篇论文中,我们尝试建立一个简单的框架用于建立开放集分类器,包括分类和OOD拒绝。我们将开放集识别作为$ K $-已知类型的$ (K + 1) $-类分类问题,使用已知类型样本进行训练。我们将这个问题 decomposing 为$ K $个一对一(OVA)二分类任务,并将一些参数绑定,以确定$(K+1)$类 posterior probabilities,这些 posterior probabilities 允许我们在一个简单的框架中进行分类和OOD拒绝。为保持OVA训练得到的closed-set分类精度,我们提议一种混合Hybrid Training策略, combining OVA损失和多类 Cross-Entropy损失。我们在最近提出的卷积抽象网络上实现了OVA框架和混合训练策略。实验表明,我们的框架在 популяр的OSR和OOD检测数据集上表现竞争力强,在closed-set分类、OOD检测和误分类检测中均表现出色。Note: "<>" is the marker used to indicate the beginning of the text to be translated, and "TRANSLATE_TEXT" is the command used to translate the text.

High-Quality Face Caricature via Style Translation

  • paper_url: http://arxiv.org/abs/2311.13338
  • repo_url: None
  • paper_authors: Lamyanba Laishram, Muhammad Shaheryar, Jong Taek Lee, Soon Ki Jung
    for:* 这个论文的目的是提出一种高质量、无监督的人脸塑型方法,适用于实际应用场景。methods:* 该方法使用计算机视觉技术和GAN模型,通过两步过程来实现人脸塑型和面部特征强调:面部塑型生成和面部塑型投影。results:* 该方法可以很好地强调人脸特征和面部特征,同时保持人脸属性、表情和特征的一致性。* 与其他现有的面部塑型方法相比,该方法具有更高的真实感和更好的可视化性。
    Abstract Caricature is an exaggerated form of artistic portraiture that accentuates unique yet subtle characteristics of human faces. Recently, advancements in deep end-to-end techniques have yielded encouraging outcomes in capturing both style and elevated exaggerations in creating face caricatures. Most of these approaches tend to produce cartoon-like results that could be more practical for real-world applications. In this study, we proposed a high-quality, unpaired face caricature method that is appropriate for use in the real world and uses computer vision techniques and GAN models. We attain the exaggeration of facial features and the stylization of appearance through a two-step process: Face caricature generation and face caricature projection. The face caricature generation step creates new caricature face datasets from real images and trains a generative model using the real and newly created caricature datasets. The Face caricature projection employs an encoder trained with real and caricature faces with the pretrained generator to project real and caricature faces. We perform an incremental facial exaggeration from the real image to the caricature faces using the encoder and generator's latent space. Our projection preserves the facial identity, attributes, and expressions from the input image. Also, it accounts for facial occlusions, such as reading glasses or sunglasses, to enhance the robustness of our model. Furthermore, we conducted a comprehensive comparison of our approach with various state-of-the-art face caricature methods, highlighting our process's distinctiveness and exceptional realism.
    摘要 《卡通化人脸》是一种扩展了人脸特征的艺术形式,通过强调人脸特有的特征来增强人脸的表达。最近,深度端到端技术的进步使得在创造人脸卡通化时能够更好地捕捉到人脸特征的扩展和精细化。大多数这些方法会生成具有漫画化效果的人脸卡通化结果,这些结果在实际应用中更加实用。在这项研究中,我们提出了一种高质量、无需对比数据的人脸卡通化方法,使用计算机视觉技术和GAN模型。我们通过两步进程来实现人脸卡通化:面听生成和面听投影。面听生成阶段从真实图像中生成新的卡通化人脸数据集,并使用真实和生成的卡通化数据集来训练生成模型。面听投影阶段使用已经训练的encoder和生成器来投影真实和卡通化人脸。我们通过增量地夹带真实图像到卡通化人脸中来实现人脸特征的扩展和样式的转换,并保持人脸特征、属性和表情的一致性。此外,我们还进行了对我们方法与现有的州前沿技术进行了广泛的比较,展示了我们的过程的独特性和准确性。

Revisiting Supervision for Continual Representation Learning

  • paper_url: http://arxiv.org/abs/2311.13321
  • repo_url: None
  • paper_authors: Daniel Marczak, Sebastian Cygert, Tomasz Trzciński, Bartłomiej Twardowski
  • for: 这个论文主要是为了研究 continual learning 中的模型学习方法。
  • methods: 这个论文使用了自我超vision和supervision两种方法来学习 continual representation learning。
  • results: 研究发现,带有多层感知器头的supervised模型在 continual representation learning 中可以超越自我超vision模型的表现。
    Abstract In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, recent studies have highlighted the strengths of self-supervised continual representation learning. The improved transferability of representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning.
    摘要 在持续学习领域中,模型通常是一个接一个地学习任务。大多数研究都集中在监督持续学习中,但最近的研究表明自动化持续表示学习具有优势。各层抽象器的改进了表示的可重用性通常与自动化方法的角色相关。在这项工作中,我们假设额外信息,如人工标注,不会下降表示质量。我们的发现表明,当增强了多层抽象器头的监督模型时,可以在持续表示学习中超越自动化模型。

Deep Learning for Vascular Segmentation and Applications in Phase Contrast Tomography Imaging

  • paper_url: http://arxiv.org/abs/2311.13319
  • repo_url: None
  • paper_authors: Ekin Yagis, Shahab Aslani, Yashvardhan Jain, Yang Zhou, Shahrokh Rahmani, Joseph Brunet, Alexandre Bellier, Christopher Werlein, Maximilian Ackermann, Danny Jonigk, Paul Tafforeau, Peter D Lee, Claire Walsh
  • for: 这篇论文的目的是提供一份关于自动血管分割的广泛文献综述,以提供基础知识并确定一种可靠的基线模型,以应用于新的成像Modalities中的血管分割。
  • methods: 本文使用了多种机器学习技术,包括nnU Net模型,以进行血管分割。
  • results: 研究发现,虽然 segmentation 得分 relativity high(例如 clDice 值在0.82-0.88之间),但还存在一些错误,如大血管的分割不佳,由于HiP CT是一种外部技术,缺少水压导致血管压缩,以及细血管内的连接性下降和边界分割错误。
    Abstract Automated blood vessel segmentation is vital for biomedical imaging, as vessel changes indicate many pathologies. Still, precise segmentation is difficult due to the complexity of vascular structures, anatomical variations across patients, the scarcity of annotated public datasets, and the quality of images. We present a thorough literature review, highlighting the state of machine learning techniques across diverse organs. Our goal is to provide a foundation on the topic and identify a robust baseline model for application to vascular segmentation in a new imaging modality, Hierarchical Phase Contrast Tomography (HiP CT). Introduced in 2020 at the European Synchrotron Radiation Facility, HiP CT enables 3D imaging of complete organs at an unprecedented resolution of ca. 20mm per voxel, with the capability for localized zooms in selected regions down to 1mm per voxel without sectioning. We have created a training dataset with double annotator validated vascular data from three kidneys imaged with HiP CT in the context of the Human Organ Atlas Project. Finally, utilising the nnU Net model, we conduct experiments to assess the models performance on both familiar and unseen samples, employing vessel specific metrics. Our results show that while segmentations yielded reasonably high scores such as clDice values ranging from 0.82 to 0.88, certain errors persisted. Large vessels that collapsed due to the lack of hydrostatic pressure (HiP CT is an ex vivo technique) were segmented poorly. Moreover, decreased connectivity in finer vessels and higher segmentation errors at vessel boundaries were observed. Such errors obstruct the understanding of the structures by interrupting vascular tree connectivity. Through our review and outputs, we aim to set a benchmark for subsequent model evaluations using various modalities, especially with the HiP CT imaging database.
    摘要 自动化血管分 segmentation 是生物医学成像中非常重要的,因为血管变化表示许多疾病。然而,精准的分 segmentation 很难,因为血管结构复杂,每个患者的解剖结构不同,公共数据集的缺乏和图像质量。我们提供了一份通俗的文献综述,探讨机器学习技术在多种器官上的状况。我们的目标是提供基础知识,并identify一个可靠的基线模型,用于应用于血管分 segmentation 在新的成像模式下,即层次相对辐射Tomography(HiP CT)。 HiP CT在2020年欧洲 synchrotron Radiation Facility中引入,可以在不同器官的3D成像中获得无前例的分辨率(约20mm/voxel),并且可以在选择的区域进行高分辨率(1mm/voxel)的局部扩展,无需sectioning。我们创建了基于double annotator验证的血管数据集,来源于HiP CT在人类器官 Atlases Project中对三个肾脏的成像。最后,我们使用nnU Net模型进行实验,评估模型在 familiarnon-familiar样本上的性能,并使用血管特定的指标进行评估。我们的结果表明,虽然segmentation得分reasonably high(clDice值在0.82-0.88之间),但有一些错误存在。例如,由于HiP CT是外部技术,大血管受到lack of hydrostatic pressure的影响,导致 segmentation poorly。此外,finer vessels中的连接性降低和Boundaries segmentation error higher were observed。这些错误会中断血管树的连接,阻碍了结构的理解。通过我们的综述和输出,我们 hope to set a benchmark for subsequent model evaluations using various modalities,特别是使用HiP CT成像数据库。

Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.13317
  • repo_url: https://github.com/shercoo/RGDiffSR
  • paper_authors: Yuxuan Zhou, Liangcai Gao, Zhi Tang, Baole Wei
  • for: 提高Scene Text Recognition(STR)的精度和可读性,使用低分辨率(LR)图像中的文本进行提高。
  • methods: 使用Recognition-Guided Diffusion模型,以及Recognition-Guided Denoising Network进行图像提高和降噪处理。
  • results: 在TextZoom dataset上,RGDiffSR比前一代方法更高的提高文本识别精度和图像质量。
    Abstract Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images, consequently elevating recognition accuracy in Scene Text Recognition (STR). Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance to address this issue. Nevertheless, they remain deficient when confronted with severely blurred images, due to their insufficient generation capability when little structural or semantic information can be extracted from original images. Therefore, we introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising Network, to guide the diffusion model generating LR-consistent results through succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the superiority of RGDiffSR over prior state-of-the-art methods in both text recognition accuracy and image fidelity.
    摘要

Retargeting Visual Data with Deformation Fields

  • paper_url: http://arxiv.org/abs/2311.13297
  • repo_url: None
  • paper_authors: Tim Elsner, Julia Berger, Tong Wu, Victor Czech, Lin Gao, Leif Kobbelt
  • for: 本研究旨在推广现有的图像编辑方法,以便在更广泛的视觉数据格式和自由度下进行编辑。
  • methods: 本研究使用 neural network 学习一种可以在低信息含量区域进行塑形的方法,以实现更好的内容意识扩展和修改。
  • results: 实验结果表明,本方法可以在不同的视觉数据上实现更好的内容意识扩展和修改,比之前的方法更高效。
    Abstract Seam carving is an image editing method that enable content-aware resizing, including operations like removing objects. However, the seam-finding strategy based on dynamic programming or graph-cut limits its applications to broader visual data formats and degrees of freedom for editing. Our observation is that describing the editing and retargeting of images more generally by a displacement field yields a generalisation of content-aware deformations. We propose to learn a deformation with a neural network that keeps the output plausible while trying to deform it only in places with low information content. This technique applies to different kinds of visual data, including images, 3D scenes given as neural radiance fields, or even polygon meshes. Experiments conducted on different visual data show that our method achieves better content-aware retargeting compared to previous methods.
    摘要 【image editing方法】:Seam carving是一种图像编辑技术,允许内容感知的缩放和物体 removing 操作。但是,基于动态编程或图形抽取的seam-finding策略限制了其应用范围和视觉数据格式的自由度。我们的观察是,通过描述图像编辑和重定向的更一般的描述方法,可以得到内容感知的变形。我们提议使用神经网络学习一种尝试将输出变形到低信息含量地方的扩展。这种技术适用于不同类型的视觉数据,包括图像、神经辉场、或者 polygon mesh。我们在不同的视觉数据上进行了实验,并证明了我们的方法可以在内容感知的情况下实现更好的重定向。

CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning

  • paper_url: http://arxiv.org/abs/2311.13263
  • repo_url: None
  • paper_authors: Yaqi Liu, Chao Xia, Song Xiao, Qingxiao Guan, Wenqian Dong, Yifan Zhang, Nenghai Yu
  • for: 这个研究旨在探讨一种基于深度学习的复制过程伪造侦测方法,以提高伪造侦测的精度和效率。
  • methods: 本研究提出了一个基于Transformer的伪造侦测网络,名为CMFDFormer,并提出了一个PCSD(Pool Cube and Strip Distillation)常规学习框架,以帮助CMFDFormer处理新任务。
  • results: 实验结果显示,CMFDFormer能够在公开 disponibles datasets上表现出色,并且PCSD常规学习框架能够避免伪造侦测中的忘记现象。
    Abstract Copy-move forgery detection aims at detecting duplicated regions in a suspected forged image, and deep learning based copy-move forgery detection methods are in the ascendant. These deep learning based methods heavily rely on synthetic training data, and the performance will degrade when facing new tasks. In this paper, we propose a Transformer-style copy-move forgery detection network named as CMFDFormer, and provide a novel PCSD (Pooled Cube and Strip Distillation) continual learning framework to help CMFDFormer handle new tasks. CMFDFormer consists of a MiT (Mix Transformer) backbone network and a PHD (Pluggable Hybrid Decoder) mask prediction network. The MiT backbone network is a Transformer-style network which is adopted on the basis of comprehensive analyses with CNN-style and MLP-style backbones. The PHD network is constructed based on self-correlation computation, hierarchical feature integration, a multi-scale cycle fully-connected block and a mask reconstruction block. The PHD network is applicable to feature extractors of different styles for hierarchical multi-scale information extraction, achieving comparable performance. Last but not least, we propose a PCSD continual learning framework to improve the forgery detectability and avoid catastrophic forgetting when handling new tasks. Our continual learning framework restricts intermediate features from the PHD network, and takes advantage of both cube pooling and strip pooling. Extensive experiments on publicly available datasets demonstrate the good performance of CMFDFormer and the effectiveness of the PCSD continual learning framework.
    摘要 копирование-перенос фальсификат detection aim at detecting duplicated regions in a suspected forged image, and deep learning based copy-move forgery detection methods are on the rise. These deep learning based methods heavily rely on synthetic training data, and the performance will degrade when facing new tasks. In this paper, we propose a Transformer-style copy-move forgery detection network named as CMFDFormer, and provide a novel PCSD (Pooled Cube and Strip Distillation) continual learning framework to help CMFDFormer handle new tasks. CMFDFormer consists of a MiT (Mix Transformer) backbone network and a PHD (Pluggable Hybrid Decoder) mask prediction network. The MiT backbone network is a Transformer-style network which is adopted on the basis of comprehensive analyses with CNN-style and MLP-style backbones. The PHD network is constructed based on self-correlation computation, hierarchical feature integration, a multi-scale cycle fully-connected block and a mask reconstruction block. The PHD network is applicable to feature extractors of different styles for hierarchical multi-scale information extraction, achieving comparable performance. Last but not least, we propose a PCSD continual learning framework to improve the forgery detectability and avoid catastrophic forgetting when handling new tasks. Our continual learning framework restricts intermediate features from the PHD network, and takes advantage of both cube pooling and strip pooling. Extensive experiments on publicly available datasets demonstrate the good performance of CMFDFormer and the effectiveness of the PCSD continual learning framework.

Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides

  • paper_url: http://arxiv.org/abs/2311.13261
  • repo_url: https://github.com/aican-research/breast-epithelium-segmentation
  • paper_authors: Maren Høibø, André Pedersen, Vibeke Grotnes Dale, Sissel Marie Berget, Borgny Ytterhus, Cecilia Lindskog, Elisabeth Wik, Lars A. Akslen, Ingerid Reinertsen, Erik Smistad, Marit Valla
  • for: 这个研究旨在开发一种人工智能模型,用于自动分类乳腺癌组织中的细胞。
  • methods: 研究人员使用了 convolutional neural network 和数据扩展技术来训练模型,并使用了 Hematoxylin and eosin 和 cytokeratin AE1/AE3 两种抗体来生成细胞背景图标。
  • results: 研究人员通过对 839 名乳腺癌患者的组织片和两名患者的整个染色片进行训练和评估,实现了对细胞类型的自动分类。模型在质量评估中取得了平均 dice 分数为 0.70、0.79 和 0.75,以及相关的四个类别的质量分数。
    Abstract Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation
    摘要 “数字 PATHOLOGY 可以自动分析 Histopathological section 使用人工智能(AI)。自动评估可以提高诊断效率,并帮助发现 morphological 特征和临床结果之间的关系。为开发这类预测模型,首先需要 identific invasive epithelial cells,并将其分离于正常 epithelial cells 和 situ lesions。在本研究中,我们 aimed to develop an AI model for segmenting epithelial cells in breast cancer sections.我们使用 cytokeratin(CK) AE1/AE3 重新染色 HE sections,并由 PATHOLOGISTS manually annotate。HE/CK 图像组被用来训练 convolutional neural network,并使用数据增强来使模型更加可靠。来自 839 名患者的 Tissue microarrays(TMAs)和 two 名患者的全片图像被用于训练和评估模型。sections 来自四个患者群体,而 TMAs 从第五个患者群体中来自 21 名患者被用作第二个测试集。在量化评估中,我们得到了 Mean Dice 分数为 0.70、0.79 和 0.75 для invasive epithelial cells,benign epithelial cells 和 situ lesions,respectively。 PATHOLOGISTS 对所有 epithelium 和 invasive epithelium 的评分为 4.7 和 4.4,而对 benign epithelium 和 situ lesions 的评分为 3.7 和 2.0。我们的模型可以准确地在 HE 染色 breast cancer slice 上分类 epithelial cells,但需要进一步的工作以确定分类的准确性。免疫 histochemistry , alongside pathologists' annotations,enabled the creation of accurate ground truths。我们的模型现在可以免费在 FastPathology 上获得,代码可以在 https://github.com/AICAN-Research/breast-epithelium-segmentation 上获取。”Note: The translation is in Simplified Chinese, which is the standard version of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Density Distribution-based Learning Framework for Addressing Online Continual Learning Challenges

  • paper_url: http://arxiv.org/abs/2311.13623
  • repo_url: None
  • paper_authors: Shilin Zhang, Jiahui Wang
  • for: 本研究旨在解决在线Continual Learning(CL)中的挑战,提出了基于概率分布的学习框架。CL特别是类增量学习,可以在单 passes的训练数据流中不断学习并适应新的测试分布,更加符合实际应用场景的需求。但是,现有的CL方法 oft suffer from catastrophic forgetting和更高的计算成本,限制其实际应用。我们的提议的框架可以超越这些限制,实现更高的均值准确率和时空效率, bridge the performance gap between CL和经典机器学习。
  • methods: 我们采用了独立的生成器kernel density estimation(GKDE)模型 для每个CL任务。在测试阶段,GKDEs通过自身报告的最大概率密度值来决定对入参测试实例进行预测。GKDE-based学习目标可以保证同类标签的样本被分组 together,而不同实例被推 farther apart。
  • results: 我们的方法在多个CL数据集上进行了广泛的实验,并证明了我们的提议的框架的效果。与流行的CL方法相比,我们的方法可以达到更高的均值准确率,同时保持竞争的时空效率,使我们的框架适用于实际应用。
    Abstract In this paper, we address the challenges of online Continual Learning (CL) by introducing a density distribution-based learning framework. CL, especially the Class Incremental Learning, enables adaptation to new test distributions while continuously learning from a single-pass training data stream, which is more in line with the practical application requirements of real-world scenarios. However, existing CL methods often suffer from catastrophic forgetting and higher computing costs due to complex algorithm designs, limiting their practical use. Our proposed framework overcomes these limitations by achieving superior average accuracy and time-space efficiency, bridging the performance gap between CL and classical machine learning. Specifically, we adopt an independent Generative Kernel Density Estimation (GKDE) model for each CL task. During the testing stage, the GKDEs utilize a self-reported max probability density value to determine which one is responsible for predicting incoming test instances. A GKDE-based learning objective can ensure that samples with the same label are grouped together, while dissimilar instances are pushed farther apart. Extensive experiments conducted on multiple CL datasets validate the effectiveness of our proposed framework. Our method outperforms popular CL approaches by a significant margin, while maintaining competitive time-space efficiency, making our framework suitable for real-world applications. Code will be available at https://github.com/xxxx/xxxx.
    摘要 在这篇论文中,我们Addressing the challenges of online Continual Learning (CL) by introducing a density distribution-based learning framework. CL, especially Class Incremental Learning, enables adaptation to new test distributions while continuously learning from a single-pass training data stream, which is more in line with the practical application requirements of real-world scenarios. However, existing CL methods often suffer from catastrophic forgetting and higher computing costs due to complex algorithm designs, limiting their practical use. Our proposed framework overcomes these limitations by achieving superior average accuracy and time-space efficiency, bridging the performance gap between CL and classical machine learning. Specifically, we adopt an independent Generative Kernel Density Estimation (GKDE) model for each CL task. During the testing stage, the GKDEs utilize a self-reported max probability density value to determine which one is responsible for predicting incoming test instances. A GKDE-based learning objective can ensure that samples with the same label are grouped together, while dissimilar instances are pushed farther apart. Extensive experiments conducted on multiple CL datasets validate the effectiveness of our proposed framework. Our method outperforms popular CL approaches by a significant margin, while maintaining competitive time-space efficiency, making our framework suitable for real-world applications. 代码将会公开在https://github.com/xxxx/xxxx。

Towards Hetero-Client Federated Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2311.13250
  • repo_url: None
  • paper_authors: Yuxiang Lu, Suizhi Huang, Yuwen Yang, Shalayiding Sirejiding, Yue Ding, Hongtao Lu
    for:HC-FMTL 是一个新的问题设定,它旨在扩展现实世界中的应用,允许不同任务的训练和模型的多标的整合。methods:我们提出了 FedHCA$^2$ 框架,它使用了模型关系modeling来处理不同客户端的数据和任务不同性。我们还提出了 Hyper Conflict-Averse Aggregation 和 Hyper Cross Attention Aggregation 两种方法来解决模型不一致和数据类型的问题。results:我们的实验结果显示,FedHCA$^2$ 在多个HC-FMTL scenario中表现出色,较前代方法好。我们的代码将会公开。
    Abstract Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks, assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability, we introduce a novel problem setting, Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in accurate model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges, we propose the FedHCA$^2$ framework, which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization, we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally, inspired by task interaction in MTL, the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios compared to representative methods. Our code will be made publicly available.
    摘要 Federated Learning (FL) 允许分布式客户端共同训练使用本地数据的私人方式。 Federated Multi-Task Learning (FMTL) 基于 FL 来处理多个任务,假设每个客户端都使用相同的模型结构。 为了放弃这个假设,我们引入一个新的问题设定:异质客户端 Federated Multi-Task Learning (HC-FMTL),以适应实际应用中的多个任务和数据不同。 HC-FMTL 的主要挑战是模型不一致问题,这使得传统的汇集方法无效。此外,HC-FMTL 还存在数据和任务不同性的困难,这使得准确的模型汇集变得更加困难。为解决这些挑战,我们提出了 FedHCA$^2$ 框架,允许 federated 训练个性化模型。通过模型之间的关系建模,FedHCA$^2$ 可以减少模型不一致问题。此外,我们还提出了 Hyper Conflict-Averse Aggregation 方案,用于在编码器更新中减少冲突。此外,我们还提出了 Hyper Cross Attention Aggregation 方案,使用层次跨注意力来增强解码器之间的互动,同时减少模型不一致问题。最后,我们采用可学习的 Hyper Aggregation Weights,为每个客户端自定义个性化参数更新。我们的实验表明,FedHCA$^2$ 在多种 HC-FMTL 场景中表现出色,至于代表方法的比较。我们的代码将会公开发布。

TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

  • paper_url: http://arxiv.org/abs/2311.13622
  • repo_url: None
  • paper_authors: Jiang He, Yajie Li, Jie L, Qiangqiang Yuan
  • for: 减少干扰的干扰照片(Hyperspectral images corrupted by various noise)
  • methods: 使用截断扩散模型(Truncated diffusion model)来恢复有用信息
  • results: 可以逐渐恢复有用信息,而不是从纯干扰开始(Avoid destroying valid information)
    Abstract Hyperspectral images play a crucial role in precision agriculture, environmental monitoring or ecological analysis. However, due to sensor equipment and the imaging environment, the observed hyperspectral images are often inevitably corrupted by various noise. In this study, we proposed a truncated diffusion model, called TDiffDe, to recover the useful information in hyperspectral images gradually. Rather than starting from a pure noise, the input data contains image information in hyperspectral image denoising. Thus, we cut the trained diffusion model from small steps to avoid the destroy of valid information.
    摘要 干扰图像在精度农业、环境监测或生态分析中发挥关键作用。然而,由于感器设备和捕捉环境,观察到的干扰图像经常会受到各种干扰。在这种情况下,我们提出了一种裁剪扩散模型,称为TDiffDe,以逐渐恢复干扰图像中有用的信息。不同于从零开始,输入数据包含干扰图像信息。因此,我们将训练的扩散模型从小步骤割除,以避免有效信息的破坏。

Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation for Balanced Knowledge Transfer

  • paper_url: http://arxiv.org/abs/2311.13621
  • repo_url: https://github.com/cpsu00/er-kd
  • paper_authors: Chi-Ping Su, Ching-Hsun Tseng, Shin-Jye Lee
  • for: 本研究旨在解决知识传递过程中存在的知识差距问题,通过精细控制学习模型的学习率,使学习模型更加准确地捕捉老师模型的隐性知识。
  • methods: 本研究提出了一种基于熵重新调整的知识传递方法,即熵重新调整知识传递(ER-KD),该方法通过熵量来重新调整知识传递的损失函数,使学习模型更加注重老师模型的隐性知识。
  • results: 本研究的实验结果表明,ER-KD可以与现有的知识传递方法兼容,同时提高知识传递的性能,而且可以在不同的数据集上实现更好的性能。
    Abstract Knowledge Distillation (KD) transfers knowledge from a larger "teacher" model to a compact "student" model, guiding the student with the "dark knowledge" $\unicode{x2014}$ the implicit insights present in the teacher's soft predictions. Although existing KDs have shown the potential of transferring knowledge, the gap between the two parties still exists. With a series of investigations, we argue the gap is the result of the student's overconfidence in prediction, signaling an imbalanced focus on pronounced features while overlooking the subtle yet crucial dark knowledge. To overcome this, we introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel approach that leverages the entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on challenging instances rich in the teacher's nuanced insights while reducing the emphasis on simpler cases, enabling a more balanced knowledge transfer. Consequently, ER-KD not only demonstrates compatibility with various state-of-the-art KD methods but also further enhances their performance at negligible cost. This approach offers a streamlined and effective strategy to refine the knowledge transfer process in KD, setting a new paradigm in the meticulous handling of dark knowledge. Our code is available at https://github.com/cpsu00/ER-KD.
    摘要 知识填充(KD)将知识从大型"教师"模型传递到紧凑"学生"模型,指导学生通过"黑知识"( implicit insights)在教师的软预测中存在的隐性信息。 although existing KDs have shown the potential of transferring knowledge, the gap between the two parties still exists. with a series of investigations, we argue the gap is the result of the student's overconfidence in prediction, signaling an imbalanced focus on pronounced features while overlooking the subtle yet crucial dark knowledge. to overcome this, we introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel approach that leverages the entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on challenging instances rich in the teacher's nuanced insights while reducing the emphasis on simpler cases, enabling a more balanced knowledge transfer. consequently, ER-KD not only demonstrates compatibility with various state-of-the-art KD methods but also further enhances their performance at negligible cost. this approach offers a streamlined and effective strategy to refine the knowledge transfer process in KD, setting a new paradigm in the meticulous handling of dark knowledge. our code is available at https://github.com/cpsu00/ER-KD.

Towards Detecting, Recognizing, and Parsing the Address Information from Bangla Signboard: A Deep Learning-based Approach

  • paper_url: http://arxiv.org/abs/2311.13222
  • repo_url: None
  • paper_authors: Hasan Murad, Mohammed Eunus Ali
    for: 这个研究旨在提高巴基斯坦语(Bangla)地标信息抽取率,并提供一个综合系统来检测、识别、修正和分析地标信息。methods: 该研究使用了深度学习基于模型,包括CTC基于模型和Encoder-Decoder模型,以及一种新的地址文本修正模型。results: 研究显示,使用CTC基于模型可以获得较高的识别率,而使用Encoder-Decoder模型可以提高修正率。此外,研究还开发了一个基于转换器的地址文本分析器,可以帮助提高巴基斯坦语地标信息抽取率。
    Abstract Retrieving textual information from natural scene images is an active research area in the field of computer vision with numerous practical applications. Detecting text regions and extracting text from signboards is a challenging problem due to special characteristics like reflecting lights, uneven illumination, or shadows found in real-life natural scene images. With the advent of deep learning-based methods, different sophisticated techniques have been proposed for text detection and text recognition from the natural scene. Though a significant amount of effort has been devoted to extracting natural scene text for resourceful languages like English, little has been done for low-resource languages like Bangla. In this research work, we have proposed an end-to-end system with deep learning-based models for efficiently detecting, recognizing, correcting, and parsing address information from Bangla signboards. We have created manually annotated datasets and synthetic datasets to train signboard detection, address text detection, address text recognition, address text correction, and address text parser models. We have conducted a comparative study among different CTC-based and Encoder-Decoder model architectures for Bangla address text recognition. Moreover, we have designed a novel address text correction model using a sequence-to-sequence transformer-based network to improve the performance of Bangla address text recognition model by post-correction. Finally, we have developed a Bangla address text parser using the state-of-the-art transformer-based pre-trained language model.
    摘要 “从自然场景图像中提取文本信息是计算机视觉领域的活跃研究领域,具有许多实际应用。检测文本区域并从牌匾中提取文本是一个具有特殊特征的问题,如反射光、不均匀照明或阴影,发现在真实的自然场景图像中。随着深度学习基于方法的出现,不同的复杂技术已经被提议用于文本检测和文本识别从自然场景图像中。虽然对于英语等资源语言进行了大量的努力,但对于low-resource语言如孟加拉语来说,却很少进行了研究。在这项研究中,我们提出了一个端到端的系统,使用深度学习基于模型来高效地检测、识别、更正和分析孟加拉牌匾上的地址信息。我们创建了手动标注的数据集和 sintetic 数据集,用于训练牌匾检测模型、地址文本检测模型、地址文本识别模型、地址文本更正模型和地址文本解析模型。我们进行了不同的 CTC-based 和 Encoder-Decoder 模型建立的比较研究,以及设计了一种基于 transformer 网络的新的地址文本更正模型,以提高孟加拉地址文本识别模型的性能。最后,我们使用现有的 transformer 预训练语言模型来开发一个孟加拉地址文本解析模型。”

Test-time Adaptive Vision-and-Language Navigation

  • paper_url: http://arxiv.org/abs/2311.13209
  • repo_url: https://github.com/Feliciaxyao/FSTTA
  • paper_authors: Junyu Gao, Xuan Yao, Changsheng Xu
  • for: 提高 Vision-and-Language Navigation (VLN) 模型的泛化能力,使其在多种环境下能够更好地适应变化。
  • methods: 提出了 Fast-Slow Test-Time Adaptation (FSTTA) 方法,通过对 gradients 和参数进行分解-积累分析,实现了在快更新阶段快速适应,以及在慢更新阶段保持模型稳定。
  • results: 对四个流行的标准 benchmark 进行了广泛的实验,并取得了非常出色的性能提升。
    Abstract Vision-and-Language Navigation (VLN) has witnessed significant advancements in recent years, largely attributed to meticulously curated datasets and proficiently trained models. Nevertheless, when tested in diverse environments, the trained models inevitably encounter significant shifts in data distribution, highlighting that relying solely on pre-trained and fixed navigation models is insufficient. To enhance models' generalization ability, test-time adaptation (TTA) demonstrates significant potential in the computer vision field by leveraging unlabeled test samples for model updates. However, simply applying existing TTA methods to the VLN task cannot well handle the adaptability-stability dilemma of VLN models, i.e., frequent updates can result in drastic changes in model parameters, while occasional updates can make the models ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for VLN by performing decomposition-accumulation analysis for both gradients and parameters in a unified framework. Specifically, in the fast update phase, gradients generated during the recent multi-step navigation process are decomposed into components with varying levels of consistency. Then, these components are adaptively accumulated to pinpoint a concordant direction for fast model adaptation. In the slow update phase, historically recorded parameters are gathered, and a similar decomposition-accumulation analysis is conducted to revert the model to a stable state. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks.
    摘要 Computer vision field 中的 Test-Time Adaptation (TTA) 技术在最近几年内得到了广泛的发展,它可以在测试时使用无标签数据来更新模型。然而,在不同的环境中测试的模型往往会遇到数据分布的显著变化,这表明仅仅靠先天学习的模型是不够的。为了增强模型的泛化能力,我们提出了 Fast-Slow Test-Time Adaptation (FSTTA) 方法,该方法可以在 VLN 任务中处理泛化-稳定之间的矛盾。具体来说,在快更新阶段,在最近的多步导航过程中生成的梯度被分解成具有不同一致性水平的组分。然后,这些组分被可适应地积累以确定一个 concordant 方向的快速模型更新。在慢更新阶段,历史记录的参数被聚集,并进行了相似的分解-积累分析,以恢复模型到一个稳定状态。我们的方法在四个流行的标准测试上表现出了很好的性能提升。

The Challenges of Image Generation Models in Generating Multi-Component Images

  • paper_url: http://arxiv.org/abs/2311.13620
  • repo_url: None
  • paper_authors: Tham Yik Foong, Shashank Kotyan, Po Yuan Mao, Danilo Vasconcellos Vargas
  • for: 本研究旨在探讨现代文本到图像生成器的进一步发展,尤其是对于使用复杂提示的图像生成质量的影响。
  • methods: 本研究使用了多种生成模型,包括Stable Diffusion V2,并在这些模型上进行了微调。
  • results: 研究发现,现有的文本到图像生成器在处理多个组件的提示时存在重大的限制,导致图像质量下降和上下文感知减退。
    Abstract Recent advances in text-to-image generators have led to substantial capabilities in image generation. However, the complexity of prompts acts as a bottleneck in the quality of images generated. A particular under-explored facet is the ability of generative models to create high-quality images comprising multiple components given as a prior. In this paper, we propose and validate a metric called Components Inclusion Score (CIS) to evaluate the extent to which a model can correctly generate multiple components. Our results reveal that the evaluated models struggle to incorporate all the visual elements from prompts with multiple components (8.53% drop in CIS per component for all evaluated models). We also identify a significant decline in the quality of the images and context awareness within an image as the number of components increased (15.91% decrease in inception Score and 9.62% increase in Frechet Inception Distance). To remedy this issue, we fine-tuned Stable Diffusion V2 on a custom-created test dataset with multiple components, outperforming its vanilla counterpart. To conclude, these findings reveal a critical limitation in existing text-to-image generators, shedding light on the challenge of generating multiple components within a single image using a complex prompt.
    摘要

Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models

  • paper_url: http://arxiv.org/abs/2311.13619
  • repo_url: None
  • paper_authors: Ge Luo, Junqiang Huang, Manman Zhang, Zhenxing Qian, Sheng Li, Xinpeng Zhang
  • for: 保护艺术家的版权和鼓励原创作品
  • methods: 使用微型水印技术嵌入数字艺术作品中,以检测和防止使用艺术家的作品来模仿其风格,并且可以检测到 watermark 在生成图像中的分布,以暴露非法练习和盗取作品的情况
  • results: 研究在多种练习和攻击方式下,可以可靠地检测和暴露非法练习和盗取作品的行为,并且可以保护艺术家的版权和鼓励原创作品
    Abstract The advancement in text-to-image models has led to astonishing artistic performances. However, several studios and websites illegally fine-tune these models using artists' artworks to mimic their styles for profit, which violates the copyrights of artists and diminishes their motivation to produce original works. Currently, there is a notable lack of research focusing on this issue. In this paper, we propose a novel watermarking framework that detects mimicry in text-to-image models through fine-tuning. This framework embeds subtle watermarks into digital artworks to protect their copyrights while still preserving the artist's visual expression. If someone takes watermarked artworks as training data to mimic an artist's style, these watermarks can serve as detectable indicators. By analyzing the distribution of these watermarks in a series of generated images, acts of fine-tuning mimicry using stolen victim data will be exposed. In various fine-tune scenarios and against watermark attack methods, our research confirms that analyzing the distribution of watermarks in artificially generated images reliably detects unauthorized mimicry.
    摘要 文本到图像模型的进步已导致惊人的艺术表演。然而,许多工作室和网站违法细化这些模型,使用艺术家的作品来模仿他们的风格以获利,这违反了艺术家的版权和减少了他们创作原创作品的动力。目前,这个问题尚未得到了足够的研究。在这篇论文中,我们提出了一种新的水印框架,通过细化来探测文本到图像模型中的模仿。这个框架将隐藏的水印 embed 到数字艺术作品中,以保护艺术家的版权,同时仍保持艺术家的视觉表达。如果有人使用水印图作为训练数据来模仿艺术家的风格,这些水印就可以作为检测器。通过分析水印在生成图像序列中的分布,我们可以检测到使用盗取受害者数据进行细化模仿。在不同的细化场景和水印攻击方法下,我们的研究表明,通过分析水印在人工生成图像中的分布可靠地检测非法模仿。

Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models

  • paper_url: http://arxiv.org/abs/2311.13200
  • repo_url: None
  • paper_authors: Xiyu Qi, Yifan Wu, Yongqiang Mao, Wenhui Zhang, Yidan Zhang
  • for: 这个论文是为了提出一种自动化几何推理方法,以便在遥感图像中进行几何推理。
  • methods: 这个论文使用了SAM模型,并采用了一种新的自动提示学习方法,利用导向掩码生成粗略像素级划请求。
  • results: 实验结果表明,这个方法在DLRSR数据集上表现出色,比其他可用的几何推理方法更高效。
    Abstract The Segment Anything Model (SAM) exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data (SA-1B). Recognizing SAM's dependency on manual guidance given its category-agnostic nature, we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery. This research introduces a structured framework designed for the automation of few-shot semantic segmentation. It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes. Central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM. Extensive experiments on the DLRSD datasets underline the superiority of our approach, outperforming other available few-shot methodologies.
    摘要 《Segment Anything Model(SAM) exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data(SA-1B)。Recognizing SAM's dependency on manual guidance given its category-agnostic nature,we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery。This research introduces a structured framework designed for the automation of few-shot semantic segmentation。It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes。Central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM。Extensive experiments on the DLRSD datasets underline the superiority of our approach,outperforming other available few-shot methodologies。》Here's the word-for-word translation:《 segment anything model(SAM) exhibits remarkable versatility and zero-shot learning abilities,owing largely to its extensive training data(SA-1B)。recognizing SAM's dependency on manual guidance given its category-agnostic nature,we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery。this research introduces a structured framework designed for the automation of few-shot semantic segmentation。it utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes。central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM。extensive experiments on the DLRSD datasets underline the superiority of our approach,outperforming other available few-shot methodologies。》

DRIFu: Differentiable Rendering and Implicit Function-based Single-View 3D Reconstruction

  • paper_url: http://arxiv.org/abs/2311.13199
  • repo_url: https://github.com/kuangzijian/drifu-for-animals
  • paper_authors: Zijian Kuang, Lihang Ying, Shi Jin, Li Cheng
  • for: This paper aims to develop a novel 3D digitization technique for live animals, specifically tailored for avian forms.
  • methods: The proposed method, called DRIFu, leverages a curated set of synthetic 3D animal models, innovative alignment tools, and a shared shape space to enable precise predictions of animal shape and texture.
  • results: The DRIFu model has the potential to revolutionize our understanding and representation of avian forms, enabling realistic posing, animation, and alignment with real-world data.
    Abstract The Differentiable Rendering and Implicit Function-based model (DRIFu) draws its roots from the Pixel-aligned Implicit Function (PIFU), a pioneering 3D digitization technique initially designed for clothed human bodies. PIFU excels in capturing nuanced body shape variations within a low-dimensional space and has been extensively trained on human 3D scans. However, the application of PIFU to live animals poses significant challenges, primarily due to the inherent difficulty in obtaining the cooperation of animals for 3D scanning. In response to this challenge, we introduce the DRIFu model, specifically tailored for animal digitization. To train DRIFu, we employ a curated set of synthetic 3D animal models, encompassing diverse shapes, sizes, and even accounting for variations such as baby birds. Our innovative alignment tools play a pivotal role in mapping these diverse synthetic animal models onto a unified template, facilitating precise predictions of animal shape and texture. Crucially, our template alignment strategy establishes a shared shape space, allowing for the seamless sampling of new animal shapes, posing them realistically, animating them, and aligning them with real-world data. This groundbreaking approach revolutionizes our capacity to comprehensively understand and represent avian forms. For further details and access to the project, the project website can be found at https://github.com/kuangzijian/drifu-for-animals
    摘要 DRIFu模型(Differentiable Rendering and Implicit Function-based model)的起源可以追溯到Pixel-aligned Implicit Function(PIFU),这是一种推进人体三维化技术的先驱技术。 PIFU excellently captures the nuanced variations of body shape in a low-dimensional space and has been extensively trained on human 3D scans. However, applying PIFU to live animals poses significant challenges, primarily due to the difficulty in obtaining the cooperation of animals for 3D scanning. In response to this challenge, we introduce the DRIFu model, specifically tailored for animal digitization.To train DRIFu, we use a curated set of synthetic 3D animal models that encompass diverse shapes, sizes, and even account for variations such as baby birds. Our innovative alignment tools play a crucial role in mapping these diverse synthetic animal models onto a unified template, facilitating precise predictions of animal shape and texture. Importantly, our template alignment strategy establishes a shared shape space, allowing for seamless sampling of new animal shapes, posing them realistically, animating them, and aligning them with real-world data. This groundbreaking approach revolutionizes our capacity to comprehensively understand and represent avian forms.For more details and access to the project, please visit the project website at .

DoubleAUG: Single-domain Generalized Object Detector in Urban via Color Perturbation and Dual-style Memory

  • paper_url: http://arxiv.org/abs/2311.13198
  • repo_url: None
  • paper_authors: Lei Qi, Peng Dong, Tan Xiong, Hui Xue, Xin Geng
  • for: solve the single-domain generalizable object detection task in urban scenarios
  • methods: Double AUGmentation (DoubleAUG) method that includes image- and feature-level augmentation schemes, Color Perturbation (CP) method, and Dual-Style Memory (DSM)
  • results: outperforms state-of-the-art methods, effective in enhancing the model’s generalization capability, and can be integrated into existing methods to further improve model performance.Here is the simplified Chinese text:
  • for: 解决城市景观中单一领域可感知对象检测任务
  • methods: 基于图像和特征级别扩展方法,包括图像级别的颜色扰动(CP)方法和双风格记忆(DSM)
  • results: 超越当前最佳方法,有效地提高模型泛化能力,可以与现有方法结合使用进一步改进模型性能。
    Abstract Object detection in urban scenarios is crucial for autonomous driving in intelligent traffic systems. However, unlike conventional object detection tasks, urban-scene images vary greatly in style. For example, images taken on sunny days differ significantly from those taken on rainy days. Therefore, models trained on sunny day images may not generalize well to rainy day images. In this paper, we aim to solve the single-domain generalizable object detection task in urban scenarios, meaning that a model trained on images from one weather condition should be able to perform well on images from any other weather conditions. To address this challenge, we propose a novel Double AUGmentation (DoubleAUG) method that includes image- and feature-level augmentation schemes. In the image-level augmentation, we consider the variation in color information across different weather conditions and propose a Color Perturbation (CP) method that randomly exchanges the RGB channels to generate various images. In the feature-level augmentation, we propose to utilize a Dual-Style Memory (DSM) to explore the diverse style information on the entire dataset, further enhancing the model's generalization capability. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods. Furthermore, ablation studies confirm the effectiveness of each module in our proposed method. Moreover, our method is plug-and-play and can be integrated into existing methods to further improve model performance.
    摘要 In the image-level augmentation, we consider the variation in color information across different weather conditions and propose a Color Perturbation (CP) method that randomly exchanges the RGB channels to generate various images. In the feature-level augmentation, we propose to utilize a Dual-Style Memory (DSM) to explore the diverse style information on the entire dataset, further enhancing the model's generalization capability.Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods. Furthermore, ablation studies confirm the effectiveness of each module in our proposed method. Moreover, our method is plug-and-play and can be integrated into existing methods to further improve model performance.

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

  • paper_url: http://arxiv.org/abs/2311.13617
  • repo_url: None
  • paper_authors: Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie
  • For: 本文提出了一种多阶段单图像到3D生成方法,可以在不同数据领域中生成可靠的3D对象。* Methods: 本文使用了更好的3D优先来训练NeRF。Specifically, the authors train an object-level LoRA for the target object using the original image and the rendering output of NeRF, and then train the LoRA and NeRF using a progressive training strategy.* Results: 实验表明,提出的方法可以学习对象特定的3D优先,超出了预训练的扩散优先的能力,并实现了单图像到3D生成任务的 estado del arte性能。
    Abstract We present Boosting3D, a multi-stage single image-to-3D generation method that can robustly generate reasonable 3D objects in different data domains. The point of this work is to solve the view consistency problem in single image-guided 3D generation by modeling a reasonable geometric structure. For this purpose, we propose to utilize better 3D prior to training the NeRF. More specifically, we train an object-level LoRA for the target object using original image and the rendering output of NeRF. And then we train the LoRA and NeRF using a progressive training strategy. The LoRA and NeRF will boost each other while training. After the progressive training, the LoRA learns the 3D information of the generated object and eventually turns to an object-level 3D prior. In the final stage, we extract the mesh from the trained NeRF and use the trained LoRA to optimize the structure and appearance of the mesh. The experiments demonstrate the effectiveness of the proposed method. Boosting3D learns object-specific 3D prior which is beyond the ability of pre-trained diffusion priors and achieves state-of-the-art performance in the single image-to-3d generation task.
    摘要 我们介绍Boosting3D,一种多阶段单图像到3D生成方法,可以强健地生成不同数据领域中的合理3D物体。本工作的目的是解决单图像指导3D生成中的视角一致问题,通过建立合理的几何结构来解决。为此,我们提议利用更好的3D先验来训练NeRF。更具体地说,我们在目标对象上训练一个对象级别的LoRA,使用原始图像和NeRF的渲染输出来训练。然后,我们在进行进化训练策略时,训练LoRA和NeRF。LoRA和NeRF在训练中相互推进。在最后阶段,我们提取NeRF中训练过的网格,并使用训练过的LoRA来优化网格的结构和外观。实验表明,提出的方法效果很高,Boosting3D可以学习对象特有的3D先验,超过预训练的扩散先验,并达到单图像到3D生成任务的州OF-THE-ART表现。

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

  • paper_url: http://arxiv.org/abs/2311.13616
  • repo_url: None
  • paper_authors: Zefan Qu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Cairong Zhao
  • for: 提高在线视频质量,如视频会议和云游戏等在线应用需要低延迟率和高质量视频。
  • methods: 提出了一种新的STLVQE方法,包括Module-Agnostic Feature Extractor和Spatial-Temporal Look-up Tables等模块,从而大幅减少计算量并提高网络的推移、对齐和提升模块。
  • results: 在MFQE 2.0 dataset上进行了广泛的实验,显示了STLVQE在性能与速度之间达到了满意的平衡。
    Abstract Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.
    摘要 (Simplified Chinese translation)低延迟率对于在线视频应用程序,如视频会议和云游戏,变得越来越重要。然而,现有的质量提升方法受到慢速推理和未来帧中的时间信息的限制,使其直接应用在线上任务中困难。在这篇论文中,我们提出了一种新的方法,STLVQE,用于解决在线视频质量提升(Online-VQE)问题。我们的STLVQE方法设计了一个新的VQE框架,包括一个Module-Agnostic Feature Extractor,可以大幅减少重复计算,并重新设计卷积、对齐和提升模块。我们还提出了一个Spatial-Temporal Look-up Tables(STL),可以在视频中提取空间-时间信息,并在推理时保留大量的计算时间。到目前为止,我们是第一个在视频任务中利用LUT结构提取时间信息的人。我们在MFQE 2.0数据集上进行了广泛的实验,得到了满意的性能-速度交易。

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

  • paper_url: http://arxiv.org/abs/2311.13194
  • repo_url: None
  • paper_authors: Yonghui Wang, Wengang Zhou, Hao Feng, Keyi Zhou, Houqiang Li
  • for: 这 paper 是为了提高文本理解器在文本多样化场景中的表现而做出的研究。
  • methods: 这 paper 使用了 fine-tuning 多Modal Large Language Models(MLLMs),并将文本位置数据 integrate 到指令中,以提高模型对文本场景中文本的理解能力。
  • results: 实验表明,这 paper 的方法可以在多种文本多样化场景中达到状态机器的表现,证明了该方法的有效性。
    Abstract In the field of document understanding, significant advances have been made in the fine-tuning of Multimodal Large Language Models (MLLMs) with instruction-following data. Nevertheless, the potential of text-grounding capability within text-rich scenarios remains underexplored. In this paper, we present a text-grounding document understanding model, termed TGDoc, which addresses this deficiency by enhancing MLLMs with the ability to discern the spatial positioning of text within images. Empirical evidence suggests that text-grounding improves the model's interpretation of textual content, thereby elevating its proficiency in comprehending text-rich images. Specifically, we compile a dataset containing 99K PowerPoint presentations sourced from the internet. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Moreover, we curate a collection of text-rich images and prompt the text-only GPT-4 to generate 12K high-quality conversations, featuring textual locations within text-rich scenarios. By integrating text location data into the instructions, TGDoc is adept at discerning text locations during the visual question process. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
    摘要 在文档理解领域,已经有 significante 的进步在多Modal大型自然语言模型(MLLM)的精度调整中,使用 instrucion-following 数据。然而,文本位置能力在文本 abbondanza enario 中的潜力仍未得到充分利用。在这篇论文中,我们提出了一种文本位置理解模型,称为 TGDoc,以解决这个问题。我们在 MLLM 中增强了可以理解图像中文本的能力,从而提高文本理解的精度。我们编译了一个包含 99,000 个 PowerPoint presentaion 的数据集,从互联网上收集。我们设置了 instrucion 调整任务,包括文本检测、识别和找到,以便在视觉编码器和大型自然语言模型之间进行一体化。此外,我们编辑了一个包含文本 abbondanza 的图像集,并让文本只GPT-4生成 12,000 个高质量对话,其中包含文本位置在文本 abbondanza enario 中的信息。通过将文本位置数据包含在 instrucion 中,TGDoc 在视觉问题过程中能够准确地识别文本位置。我们进行了广泛的实验,并证明了我们的方法可以在多个文本 abbondanza benchmark 上 дости得 state-of-the-art 性能,验证了我们的方法的有效性。

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.13615
  • repo_url: https://github.com/T1sweet/HEViTPose
  • paper_authors: Chengpeng Wu, Guangxing Tan, Chunyu Li
  • for: 提高人体 pose 估计的效率和精度,特别是在复杂的场景下。
  • methods: 提出了一种高效视觉转换器(HEViTPose),包括一个层次分组空间减少多头注意机制(CGSR-MHA),以减少计算成本,保持特征多样性。另外,定义了一个patch overlap width(PEOW)概念,以了解 overlap 的关系和本地连续性。通过优化 PEOW,模型获得了性能、参数和GFLOPs的改进。
  • results: 在 MPII 和 COCO 两个标准测试集上,HEViTPose 小型和大型模型与当前状态艺术模型在轻量级上达到了相同水平,而且在 PCK@0.5 和 AP 上均有显著提高。具体来说,HEViTPose-B 在 MPII 测试集上达到了 90.7 PCK@0.5,COCO 测试集上达到了 72.6 AP。与 HRNet-W32 和 Swin-S 相比,HEViTPose-B 可以减少参数($\downarrow$62.1%, $\downarrow$80.4%)和 GFLOPs($\downarrow$43.4%, $\downarrow$63.8%)。
    Abstract Human pose estimation in complicated situations has always been a challenging task. Many Transformer-based pose networks have been proposed recently, achieving encouraging progress in improving performance. However, the remarkable performance of pose networks is always accompanied by heavy computation costs and large network scale. In order to deal with this problem, this paper proposes a High-Efficiency Vision Transformer for Human Pose Estimation (HEViTPose). In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost through feature grouping and spatial degradation mechanisms, while preserving feature diversity through multiple low-dimensional attention heads. Moreover, a concept of Patch Embedded Overlap Width (PEOW) is defined to help understand the relationship between the amount of overlap and local continuity. By optimising PEOW, our model gains improvements in performance, parameters and GFLOPs. Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models while being more lightweight. Specifically, HEViTPose-B achieves 90.7 PCK@0.5 on the MPII test set and 72.6 AP on the COCO test-dev2017 set. Compared with HRNet-W32 and Swin-S, our HEViTPose-B significantly reducing Params ($\downarrow$62.1%,$\downarrow$80.4%,) and GFLOPs ($\downarrow$43.4%,$\downarrow$63.8%,). Code and models are available at \url{here}.
    摘要 人体姿势估计在复杂情况下一直是一个挑战。近年来,许多基于Transformer的pose网络被提出,以提高性能。然而,出色的pose网络总是与重大计算成本和大规模网络成本一起出现。为了解决这个问题,这篇论文提出了高效视觉Transformer для人体姿势估计(HEViTPose)。在HEViTPose中,我们提出了一个分组空间减少多头注意机制(CGSR-MHA),可以降低计算成本,同时保持特征多样性。此外,我们定义了一个Patch Embedded Overlap Width(PEOW)概念,以帮助理解重叠的量和本地连续性之间的关系。通过优化PEOW,我们的模型在性能、参数和GFLOPs方面都有改进。我们在MPII和COCO两个标准数据集上进行了广泛的实验,并证明了小型和大型HEViTPose模型与状态前的模型相当,而且更轻量级。具体来说,HEViTPose-B在MPII测试集上达到了90.7 PCK@0.5,COCO测试集上达到了72.6 AP。与HRNet-W32和Swin-S相比,我们的HEViTPose-B显著减少了参数($\downarrow$62.1%, $\downarrow$80.4%)和GFLOPs($\downarrow$43.4%, $\downarrow$63.8%)。代码和模型可以在以下链接获取:\url{here}。

NeISF: Neural Incident Stokes Field for Geometry and Material Estimation

  • paper_url: http://arxiv.org/abs/2311.13187
  • repo_url: None
  • paper_authors: Chenhao Li, Taishi Ono, Takeshi Uemori, Hajime Mihara, Alexander Gatto, Hajime Nagahara, Yusuke Moriuchi
  • for: 这篇论文旨在解决多视图逆渲染问题,即从不同视点拍摄的图像序列中估计场景参数(形状、材料、照明)。
  • methods: 该方法使用神经网络incident Stokes场(NeISF),利用偏振信息减少不确定性。它基于偏振信息的积累效应,通过原始物理基于的可 diferenciable polarimetric renderer来快速和高效地模拟偏振效应。
  • results: 实验结果表明,该方法在synthetic和实际场景中都能够超越现有方法。
    Abstract Multi-view inverse rendering is the problem of estimating the scene parameters such as shapes, materials, or illuminations from a sequence of images captured under different viewpoints. Many approaches, however, assume single light bounce and thus fail to recover challenging scenarios like inter-reflections. On the other hand, simply extending those methods to consider multi-bounced light requires more assumptions to alleviate the ambiguity. To address this problem, we propose Neural Incident Stokes Fields (NeISF), a multi-view inverse rendering framework that reduces ambiguities using polarization cues. The primary motivation for using polarization cues is that it is the accumulation of multi-bounced light, providing rich information about geometry and material. Based on this knowledge, the proposed incident Stokes field efficiently models the accumulated polarization effect with the aid of an original physically-based differentiable polarimetric renderer. Lastly, experimental results show that our method outperforms the existing works in synthetic and real scenarios.
    摘要 多视图反投影问题是估计场景参数(形状、材质、照明)从不同视点捕捉的图像序列中获取。许多方法假设单一反射,因此无法恢复困难的场景,如间接反射。然而,简单地扩展这些方法以考虑多重反射需要更多的假设来减轻ambiguity。为解决这个问题,我们提议使用神经入射Stokes场(NeISF),一种多视图反投影框架,通过波动极化指示器减少歧义。我们的主要动机是使用极化指示器,因为它是多重反射的积累效果,提供了丰富的几何和材质信息。基于这种知识,我们提出的入射Stokes场有效地模拟了积累极化效果,通过原始物理基于的幂ometric renderer来支持。最后,我们的实验结果表明,我们的方法在 sintetic和实际场景中都超过了现有的方法。

Applications of Spiking Neural Networks in Visual Place Recognition

  • paper_url: http://arxiv.org/abs/2311.13186
  • repo_url: https://github.com/qvpr/vprsnn
  • paper_authors: Somayeh Hussaini, Michael Milford, Tobias Fischer
  • for: 这个研究旨在探讨对于机器人Task中的视觉地点识别(VPR)中使用的脉冲神经网络(SNNs)的可能性,特别是在实现 neuromorphic 硬件时。
  • methods: 我们提出了三个进步:首先,我们提出了 Modular SNNs,每个 SNN 代表了不同的地理位置,并且可以扩展到大型环境中;其次,我们发展了多个 Modular SNNs 的 Ensemble,它们可以实现更高的准确度;最后,我们研究了使用序列匹配技术来增强 SNN-based VPR,这种技术可以使用接下来的图像来精确化地点识别。
  • results: 我们的研究结果显示,使用 Modular SNNs 和 Ensemble 可以提高 VPR 的准确度和稳定性,并且显示了序列匹配技术的优势。我们的 SNNs 小巧,只有 1500 个神经元和 474k synapses,这使得它们适合集成。
    Abstract In robotics, Spiking Neural Networks (SNNs) are increasingly recognized for their largely-unrealized potential energy efficiency and low latency particularly when implemented on neuromorphic hardware. Our paper highlights three advancements for SNNs in Visual Place Recognition (VPR). First, we propose Modular SNNs, where each SNN represents a set of non-overlapping geographically distinct places, enabling scalable networks for large environments. Secondly, we present Ensembles of Modular SNNs, where multiple networks represent the same place, significantly enhancing accuracy compared to single-network models. Our SNNs are compact and small, comprising only 1500 neurons and 474k synapses, which makes them ideally suited for ensembling due to this small size. Lastly, we investigate the role of sequence matching in SNN-based VPR, a technique where consecutive images are used to refine place recognition. We analyze the responsiveness of SNNs to ensembling and sequence matching compared to other VPR techniques. Our contributions highlight the viability of SNNs for VPR, offering scalable and robust solutions, paving the way for their application in various energy-sensitive robotic tasks.
    摘要 在机器人学中,脉冲神经网络(SNN)正在被广泛认可为其能够实现高效的能源消耗和低延迟,特别是在神经模拟硬件上实现。我们的论文探讨了SNN在视觉位置识别(VPR)方面的三项进步。首先,我们提出了模块化SNN,其中每个SNN表示一组不重叠的地理位置。这使得我们可以构建大型环境中可扩展的网络。其次,我们提出了多个模块化SNN的集合,其中多个网络表示同一个地点,相比单个网络模型,增强了准确性。我们的SNN很小,只有1500个神经元和474k个 synapse,这使得它们适合集成,因为它们的小型。最后,我们研究了基于SNN的序列匹配技术的作用,其中 consecutive images 用于精细地位置识别。我们分析了SNN对集成和序列匹配的回应,与其他VPR技术相比。我们的贡献表明SNN可以在VPR中实现可扩展和可靠的解决方案,为能量敏感的机器人任务开出了新的可能。

Differentiable Radio Frequency Ray Tracing for Millimeter-Wave Sensing

  • paper_url: http://arxiv.org/abs/2311.13182
  • repo_url: None
  • paper_authors: Xingyu Chen, Xinyu Zhang, Qiyue Xia, Xinmin Fang, Chris Xiaoxuan Lu, Zhengxiong Li
  • for: 这篇论文旨在实现基于 millimeter wave(mmWave)探测的精细三维重建。
  • methods: 该方法使用可微分的射线跟踪引擎模拟雷达点云,并使用梯度优化器微调模型参数以最小化实际和虚拟点云之间的差异。
  • results: 实验结果表明,DiffSBR可以高精度地重建三维对象,包括未曾由雷达见过的新对象。
    Abstract Millimeter wave (mmWave) sensing is an emerging technology with applications in 3D object characterization and environment mapping. However, realizing precise 3D reconstruction from sparse mmWave signals remains challenging. Existing methods rely on data-driven learning, constrained by dataset availability and difficulty in generalization. We propose DiffSBR, a differentiable framework for mmWave-based 3D reconstruction. DiffSBR incorporates a differentiable ray tracing engine to simulate radar point clouds from virtual 3D models. A gradient-based optimizer refines the model parameters to minimize the discrepancy between simulated and real point clouds. Experiments using various radar hardware validate DiffSBR's capability for fine-grained 3D reconstruction, even for novel objects unseen by the radar previously. By integrating physics-based simulation with gradient optimization, DiffSBR transcends the limitations of data-driven approaches and pioneers a new paradigm for mmWave sensing.
    摘要 millimeter wave (mmWave) 探测技术是一种emerging technology,应用于3D物体特征化和环境地图。然而,从稀疏 mmWave 信号中获得高精度3D重建仍然是一个挑战。现有方法通过数据驱动学习,受数据可用性和通用性的限制。我们提出了DiffSBR,一种可导的 mmWave 基于的3D重建框架。DiffSBR integrate了可导的射线跟踪引擎,用于模拟雷达点云从虚拟3D模型中生成。一种势能基本优化器,用于微调模型参数,以降低实际和虚拟点云之间的差异。实验用various雷达硬件 validate DiffSBR 的3D重建精度,包括未经雷达见到的新物体。通过将物理学习与势能优化结合,DiffSBR 超越了数据驱动方法的限制,开拓了mmWave探测新的可能性。

Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI

  • paper_url: http://arxiv.org/abs/2311.13177
  • repo_url: https://github.com/sarafridov/volumetric-propeller
  • paper_authors: Annesha Ghosh, Gordon Wetzstein, Mert Pilanci, Sara Fridovich-Keil
  • for: 解决MRI图像中的偏差artefacts,提高图像诊断质量。
  • methods: 引入额外”spectral”维度来模型偏差频率,基于近期进步的光谱场模型。
  • results: 可以重建静态和动态MRI图像,以及分离脂肪和水,具有独立的临床价值。
    Abstract Off-resonance artifacts in magnetic resonance imaging (MRI) are visual distortions that occur when the actual resonant frequencies of spins within the imaging volume differ from the expected frequencies used to encode spatial information. These discrepancies can be caused by a variety of factors, including magnetic field inhomogeneities, chemical shifts, or susceptibility differences within the tissues. Such artifacts can manifest as blurring, ghosting, or misregistration of the reconstructed image, and they often compromise its diagnostic quality. We propose to resolve these artifacts by lifting the 2D MRI reconstruction problem to 3D, introducing an additional "spectral" dimension to model this off-resonance. Our approach is inspired by recent progress in modeling radiance fields, and is capable of reconstructing both static and dynamic MR images as well as separating fat and water, which is of independent clinical interest. We demonstrate our approach in the context of PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI acquisitions, which are popular for their robustness to motion artifacts. Our method operates in a few minutes on a single GPU, and to our knowledge is the first to correct for chemical shift in gradient echo PROPELLER MRI reconstruction without additional measurements or pretraining data.
    摘要 magnetic resonance imaging (MRI)中的偏振artefacts是视觉扭曲,由于实际核磁共振频率与预期使用的核磁共振频率之间的差异引起。这些差异可能由磁场不均、化学偏移或组织内的磁矩差引起。这些artefacts可能会导致图像模糊、鬼影或偏移重建图像,并且通常会降低图像的诊断质量。我们提议利用2D MRI重建问题的3D升级,通过模elling磁共振场来解决这些偏振artefacts。我们的方法 inspirited by recent progress in modeling radiance fields,可以重建静止和动态MR图像,以及分离脂肪和水,这是独立的临床兴趣。我们在PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI获得中进行了应用,这种方法在几分钟内在单个GPU上运行,并且我们知道是首次对gradient echo PROPELLER MRI重建中的化学偏移进行了 corrections,无需额外测量或预训练数据。

Learning to Complement with Multiple Humans (LECOMH): Integrating Multi-rater and Noisy-Label Learning into Human-AI Collaboration

  • paper_url: http://arxiv.org/abs/2311.13172
  • repo_url: None
  • paper_authors: Zheng Zhang, Kevin Wells, Gustavo Carneiro
  • For: The paper is written for developing robust classifiers that can address challenges posed by different types of data imperfections and complex decision processes in real-world applications.* Methods: The paper integrates noisy-label learning, multi-rater learning, and human-AI collaboration with new benchmarks and the innovative Learning to Complement with Multiple Humans (LECOMH) approach.* Results: The paper shows that LECOMH consistently outperforms leading human-AI collaboration methods using proposed benchmarks, with accuracy improving as collaboration costs increase. Additionally, LECOMH is the only method that enhances human labeller performance across all benchmarks.Here is the information in Simplified Chinese text:* For: 这篇论文是为了开发能够 Addressing 不同类型的数据不精确和复杂决策过程的Robust 分类器而写的。* Methods: 这篇论文 integrate 随机标签学习、多个评审者学习和人类-AI协作,并使用新的 Benchmark 和 Learning to Complement with Multiple Humans (LECOMH) 方法。* Results: 论文显示,LECOMH 在提案的 Benchmark 上 consistently 超过其他主要的人类-AI协作方法,准确率随合作成本增加而增加。此外,LECOMH 是唯一能够提高人类标注员表现的方法。
    Abstract The advent of learning with noisy labels (LNL), multi-rater learning, and human-AI collaboration has revolutionised the development of robust classifiers, enabling them to address the challenges posed by different types of data imperfections and complex decision processes commonly encountered in real-world applications. While each of these methodologies has individually made significant strides in addressing their unique challenges, the development of techniques that can simultaneously tackle these three problems remains underexplored. This paper addresses this research gap by integrating noisy-label learning, multi-rater learning, and human-AI collaboration with new benchmarks and the innovative Learning to Complement with Multiple Humans (LECOMH) approach. LECOMH optimises the level of human collaboration during testing, aiming to optimise classification accuracy while minimising collaboration costs that vary from 0 to M, where M is the maximum number of human collaborators. We quantitatively compare LECOMH with leading human-AI collaboration methods using our proposed benchmarks. LECOMH consistently outperforms the competition, with accuracy improving as collaboration costs increase. Notably, LECOMH is the only method enhancing human labeller performance across all benchmarks.
    摘要 “学习噪音标签(LNL)、多评分学习和人类-AI合作的出现,对实际应用中的数据不完整和复杂决策过程中的分类器发展起了革命性的影响。各自面临着独特挑战,但同时解决这三个问题的技术还是有很大的研究空间。这篇论文填补这个研究漏洞,通过结合噪音标签学习、多评分学习和人类-AI合作,与新的标准和创新的学习来COMPlement with Multiple Humans(LECOMH)方法相结合。LECOMH在测试过程中优化人类合作水平,以提高分类精度,同时最小化与M最大数量的人类合作成本。我们对LECOMH与主流人类-AI合作方法进行了量化比较。LECOMH在所有标准上表现出色,并且在所有标准上提高人类标签人性能。”Note: "LECOMH" is the name of the proposed method in the paper, and it is a combination of "Learning to Complement with Multiple Humans".

3D Face Style Transfer with a Hybrid Solution of NeRF and Mesh Rasterization

  • paper_url: http://arxiv.org/abs/2311.13168
  • repo_url: None
  • paper_authors: Jianwei Feng, Prateek Singhal
  • for: 这篇研究目的是实现3D人脸风格转移,即将3D人脸当作参考图像,并将其数字化以获得各种不同的风格。
  • methods: 我们使用神经辐射场(NeRF)来表示3D人脸,并与2D风格转移结合以实现3D人脸的风格转移。但是,直接将NeRF训练在彩色图像上的风格转移结果中会导致3D不一致问题,而且将NeRF训练与2D风格转移目标结合可能会导致训练不稳定和对于训练时间和内存的需求增加。我们因此提出了一个混合框架,将NeRF与网络绘制结合,以获得高精确的几何重建和快速渲染速度。
  • results: 我们的方法可以实现高品质的3D人脸风格转移,并且可以调控风格。实验结果显示我们的方法可以获得高度一致的3D人脸风格转移结果,同时也可以提供自适应的风格控制。
    Abstract Style transfer for human face has been widely researched in recent years. Majority of the existing approaches work in 2D image domain and have 3D inconsistency issue when applied on different viewpoints of the same face. In this paper, we tackle the problem of 3D face style transfer which aims at generating stylized novel views of a 3D human face with multi-view consistency. We propose to use a neural radiance field (NeRF) to represent 3D human face and combine it with 2D style transfer to stylize the 3D face. We find that directly training a NeRF on stylized images from 2D style transfer brings in 3D inconsistency issue and causes blurriness. On the other hand, training a NeRF jointly with 2D style transfer objectives shows poor convergence due to the identity and head pose gap between style image and content image. It also poses challenge in training time and memory due to the need of volume rendering for full image to apply style transfer loss functions. We therefore propose a hybrid framework of NeRF and mesh rasterization to combine the benefits of high fidelity geometry reconstruction of NeRF and fast rendering speed of mesh. Our framework consists of three stages: 1. Training a NeRF model on input face images to learn the 3D geometry; 2. Extracting a mesh from the trained NeRF model and optimizing it with style transfer objectives via differentiable rasterization; 3. Training a new color network in NeRF conditioned on a style embedding to enable arbitrary style transfer to the 3D face. Experiment results show that our approach generates high quality face style transfer with great 3D consistency, while also enabling a flexible style control.
    摘要 Recent years have seen extensive research into 2D image-based style transfer for human faces. However, these methods often suffer from 3D inconsistency issues when applied to different viewpoints of the same face. In this paper, we aim to address the problem of 3D face style transfer, generating stylized novel views of a 3D human face with multi-view consistency.To achieve this, we propose combining a neural radiance field (NeRF) with 2D style transfer. We find that directly training a NeRF on stylized images from 2D style transfer leads to 3D inconsistency and blurriness. On the other hand, jointly training a NeRF with 2D style transfer objectives results in poor convergence due to the identity and head pose gap between the style image and content image. Additionally, training time and memory requirements are increased due to the need for volume rendering for full image style transfer.To address these challenges, we propose a hybrid framework combining the benefits of high-fidelity geometry reconstruction from NeRF and fast rendering speed from mesh rasterization. Our framework consists of three stages:1. Training a NeRF model on input face images to learn the 3D geometry.2. Extracting a mesh from the trained NeRF model and optimizing it with style transfer objectives via differentiable rasterization.3. Training a new color network in NeRF conditioned on a style embedding to enable arbitrary style transfer to the 3D face.Experimental results show that our approach generates high-quality face style transfer with great 3D consistency, while also enabling flexible style control.

Test-Time Augmentation for 3D Point Cloud Classification and Segmentation

  • paper_url: http://arxiv.org/abs/2311.13152
  • repo_url: None
  • paper_authors: Tuan-Anh Vu, Srinjay Sarkar, Zhiyuan Zhang, Binh-Son Hua, Sai-Kit Yeung
  • For: The paper is written for improving the performance of 3D deep learning tasks, specifically addressing the issue of sparse point cloud representation.* Methods: The paper explores test-time augmentation (TTA) for 3D point clouds, leveraging implicit field reconstruction and point cloud upsampling techniques to augment point cloud data.* Results: The paper shows that both strategies are effective in improving accuracy, with point cloud upsampling leading to more significant performance improvement on downstream tasks such as object classification and segmentation on several datasets.
    Abstract Data augmentation is a powerful technique to enhance the performance of a deep learning task but has received less attention in 3D deep learning. It is well known that when 3D shapes are sparsely represented with low point density, the performance of the downstream tasks drops significantly. This work explores test-time augmentation (TTA) for 3D point clouds. We are inspired by the recent revolution of learning implicit representation and point cloud upsampling, which can produce high-quality 3D surface reconstruction and proximity-to-surface, respectively. Our idea is to leverage the implicit field reconstruction or point cloud upsampling techniques as a systematic way to augment point cloud data. Mainly, we test both strategies by sampling points from the reconstructed results and using the sampled point cloud as test-time augmented data. We show that both strategies are effective in improving accuracy. We observed that point cloud upsampling for test-time augmentation can lead to more significant performance improvement on downstream tasks such as object classification and segmentation on the ModelNet40, ShapeNet, ScanObjectNN, and SemanticKITTI datasets, especially for sparse point clouds.
    摘要 <>将文本翻译成简化中文。<>深度学习任务的性能可以通过数据扩展技术进行提高,但在3D深度学习领域received less attention。当3D形状 sparse representation with low point density时,下游任务的性能会降低显著。这项工作探讨了在3D点云数据上的test-time augmentation(TTA)技术。我们受到了最近学习隐藏表示和点云upsampling的革命的启发,这两种技术可以生成高质量的3D表面重建和靠近表面,分别。我们的想法是利用这两种技术作为系统atic way to augment point cloud data。主要是 sampling points from the reconstructed results and using the sampled point cloud as test-time augmented data。我们发现了这两种策略都能提高准确性。我们发现在下游任务如Object Classification和Segmentation上,使用test-time augmentation Point cloud upsampling可以更大的性能提高,特别是 для稀点云。Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

Single Image Compressed Sensing MRI via a Self-Supervised Deep Denoising Approach

  • paper_url: http://arxiv.org/abs/2311.13144
  • repo_url: None
  • paper_authors: Marlon Bran Lorenzana, Feng Liu, Shekhar S. Chandra
  • for: 本研究旨在提出一种基于单张图像、自助学习(SS)的扩sensing(CS)-核磁共振成像(MRI)框架,以实现对多个数据集的普适性和实际应用中的访问。
  • methods: 本研究使用了深度学习(DL)方法,将大量数据用于训练非线性重建模型。然而,保证多个数据集的普适性和实际应用中的访问是一大allenge。为解决这些问题,本研究提出了一种基于单张图像、SS CS-MRI框架,具有joint深度和稀疏正则化的特点。
  • results: 根据使用Cartesian 1D masks的评价指标,在脑和膝关节数据集上,PSNR提高了2-4dB的平均值。
    Abstract Popular methods in compressed sensing (CS) are dependent on deep learning (DL), where large amounts of data are used to train non-linear reconstruction models. However, ensuring generalisability over and access to multiple datasets is challenging to realise for real-world applications. To address these concerns, this paper proposes a single image, self-supervised (SS) CS-MRI framework that enables a joint deep and sparse regularisation of CS artefacts. The approach effectively dampens structured CS artefacts, which can be difficult to remove assuming sparse reconstruction, or relying solely on the inductive biases of CNN to produce noise-free images. Image quality is thereby improved compared to either approach alone. Metrics are evaluated using Cartesian 1D masks on a brain and knee dataset, with PSNR improving by 2-4dB on average.
    摘要 通过深度学习(DL),抽象压缩感知(CS)的常用方法与大量数据进行训练非线性重建模型。然而,在实际应用中保证普适性和多个数据集访问是困难的。为此,本文提出了基于单个图像、自动超视的CS-MRI框架,可以同时进行深度和稀疏正则化,对CS噪声进行有效降低。这种方法可以提高图像质量,比较单独使用稀疏重建或仅仅靠 CNN 生成噪声自由图像。在脑和膝盖数据集上使用 Cartesian 1D 面罩,PSNR 提高了2-4dB的平均值。

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.13141
  • repo_url: https://github.com/archerfmy/sd-t2i-360panoimage
  • paper_authors: Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie
  • for: 这份技术报告关于基于分散模型的360度全景图生成任务。
  • methods: 提出了一种循环融合策略,用于在杂谱降噪和VAE解码阶段维护图像几何连续性。
  • results: 提出了两种模型,用于实现文本到360度全景图和单个图像到360度全景图的转换任务。
    Abstract This is a technical report on the 360-degree panoramic image generation task based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic images capture the entire $360^\circ\times 180^\circ$ field of view. So the rightmost and the leftmost sides of the 360 panoramic image should be continued, which is the main challenge in this field. However, the current diffusion pipeline is not appropriate for generating such a seamless 360-degree panoramic image. To this end, we propose a circular blending strategy on both the denoising and VAE decoding stages to maintain the geometry continuity. Based on this, we present two models for \textbf{Text-to-360-panoramas} and \textbf{Single-Image-to-360-panoramas} tasks. The code has been released as an open-source project at \href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage} and \href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}
    摘要 这是一份关于基于扩散模型的360度全景图生成任务的技术报告。与常见的2D图像不同,360度全景图捕捉了整个$360^\circ\times 180^\circ$的视场,因此右侧和左侧的360度全景图需要继续,这是该领域的主要挑战。然而,当前的扩散管道不适合生成无缝360度全景图。为解决这个问题,我们提议在杂化和VAE解码阶段使用圆形杂化策略以保持几何继续性。基于这,我们提出了两种模型,一种是\textbf{Text-to-360-panoramas}任务,另一种是\textbf{Single-Image-to-360-panoramas}任务。代码已经在\href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage}和\href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}上发布为开源项目。

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

  • paper_url: http://arxiv.org/abs/2311.13613
  • repo_url: https://github.com/zhangxin-xd/Dataset-Pruning-TDDS
  • paper_authors: Xin Zhang, Jiawei Du, Yunsong Li, Weiying Xie, Joey Tianyi Zhou
  • for: 本研究的目的是提出一种新的数据集缩写方法,以实现与原始数据集的性能相似的核心集合构建。
  • methods: 本研究使用了一种双层策略,将广泛考虑训练过程中的各种动态因素,并将这些因素与代表性样本相结合,以确保缩写后的模型可以保持高度的泛化性。
  • results: 实验结果表明, compared to previoius State-of-the-Art (SOTA) 方法,本研究的方法在 CIFAR 和 ImageNet 数据集上具有明显的优势,尤其是在 CIFAR-100 上,其达到了 54.51% 的准确率,比Random Selection 方法高出 7.83%,并高于其他比较方法至少 12.69%。
    Abstract Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
    摘要

Lightweight High-Speed Photography Built on Coded Exposure and Implicit Neural Representation of Videos

  • paper_url: http://arxiv.org/abs/2311.13134
  • repo_url: https://github.com/zhihongz/bdinr
  • paper_authors: Zhihong Zhang, Runzhao Yang, Jinli Suo, Yuxiao Cheng, Qionghai Dai
  • for: 高速Scene拍摄需要高分辨率,但高带宽需求导致系统庞大、重量大,限制了应用于低容量平台。
  • methods: 采用coded Exposure设置,将帧序列编码为模糊Snapshot,然后从模糊图像中提取latent sharp video。
  • results: 比较 existed方法高效和灵活,可以在低容量平台上实现高速Scene拍摄。
    Abstract The compact cameras recording high-speed scenes with high resolution are highly demanded, but the required high bandwidth often leads to bulky, heavy systems, which limits their applications on low-capacity platforms. Adopting a coded exposure setup to encode a frame sequence into a blurry snapshot and retrieve the latent sharp video afterward can serve as a lightweight solution. However, restoring motion from blur is quite challenging due to the high ill-posedness of motion blur decomposition, intrinsic ambiguity in motion direction, and diverse motions in natural videos. In this work, by leveraging classical coded exposure imaging technique and emerging implicit neural representation for videos, we tactfully embed the motion direction cues into the blurry image during the imaging process and develop a novel self-recursive neural network to sequentially retrieve the latent video sequence from the blurry image utilizing the embedded motion direction cues. To validate the effectiveness and efficiency of the proposed framework, we conduct extensive experiments on benchmark datasets and real-captured blurry images. The results demonstrate that our proposed framework significantly outperforms existing methods in quality and flexibility. The code for our work is available at https://github.com/zhihongz/BDINR
    摘要 compact 相机记录高速场景高分辨率视频的需求很高,但高带宽需求常常导致系统庞大、重量增加,限制其在低容量平台上应用。采用编码曝光设置,将帧序列编码成卷积图像,然后再取回原始清晰视频的想法可以提供轻量级解决方案。然而,从压缩遮盲中恢复运动是很困难的,因为压缩遮盲分解具有高度不定性,动作方向具有内在的 ambiguity,自然视频中的动作多样。在这种情况下,我们利用经典的编码曝光成像技术和 emerging 隐式神经表示法,在曝光过程中注意 embedding 动作方向征在卷积图像中,并开发了一种新的自我递归神经网络,通过利用注入的动作方向征来顺序从卷积图像中提取原始视频序列。为证明我们提出的框架的有效性和高效率,我们在标准数据集和实际捕捉的模糊图像上进行了广泛的实验。实验结果表明,我们的提出的框架在质量和灵活性方面明显超越了现有方法。代码可以在 中下载。

P2RBox: A Single Point is All You Need for Oriented Object Detection

  • paper_url: http://arxiv.org/abs/2311.13128
  • repo_url: None
  • paper_authors: Guangming Cao, Xuehui Yu, Wenwen Yu, Xumeng Han, Xue Yang, Guorong Li, Jianbin Jiao, Zhenjun Han
  • for: 本研究旨在使用点标注进行特种对象检测,以提高对象检测精度和效率。
  • methods: 该方法使用点标注生成推荐的面板,并通过多元学习原理进行评估。然后,通过约束模块进行筛选高质量的面板,并将其转换为旋转盒子标注。
  • results: 该方法可以与多种完全supervised对象检测器结合使用,并在DOTA-v1.0测试数据集上达到62.26%的性能。这是首次使用点标注进行 oriented 对象检测训练。
    Abstract Oriented object detection, a specialized subfield in computer vision, finds applications across diverse scenarios, excelling particularly when dealing with objects of arbitrary orientations. Conversely, point annotation, which treats objects as single points, offers a cost-effective alternative to rotated and horizontal bounding boxes but sacrifices performance due to the loss of size and orientation information. In this study, we introduce the P2RBox network, which leverages point annotations and a mask generator to create mask proposals, followed by filtration through our Inspector Module and Constrainer Module. This process selects high-quality masks, which are subsequently converted into rotated box annotations for training a fully supervised detector. Specifically, we've thoughtfully crafted an Inspector Module rooted in multi-instance learning principles to evaluate the semantic score of masks. We've also proposed a more robust mask quality assessment in conjunction with the Constrainer Module. Furthermore, we've introduced a Symmetry Axis Estimation (SAE) Module inspired by the spectral theorem for symmetric matrices to transform the top-performing mask proposal into rotated bounding boxes. P2RBox performs well with three fully supervised rotated object detectors: RetinaNet, Rotated FCOS, and Oriented R-CNN. By combining with Oriented R-CNN, P2RBox achieves 62.26% on DOTA-v1.0 test dataset. As far as we know, this is the first attempt at training an oriented object detector with point supervision.
    摘要 特化对象检测,计算机视觉专业领域中的一个特殊化领域,在多种场景中表现出色,特别是对于任意方向的对象来说。然而,点注释,即对象作为单点处理,可以提供Cost-effective的替代方案,但是因为失去大小和方向信息而影响性能。在这项研究中,我们介绍了P2RBox网络,它利用点注释和掩码生成器创建掩码提案,然后通过我们的检查模块和约束模块进行筛选。这个过程选择高质量的掩码,然后将其转换成旋转盒子注释,用于训练完全supervised的检测器。我们特别地设计了基于多例学习原理的检查模块,用于评估掩码的 semantic score。此外,我们还提出了一种更加 Robust的掩码质量评估方法,并且在掩码生成器模块中引入了带有spectral theorem for symmetric matrices的Symmetry Axis Estimation(SAE)模块。P2RBox在与RetinaNet、Rotated FCOS和Oriented R-CNN三种完全supervised旋转 объек检测器结合使用时表现出色,在DOTA-v1.0测试集上达到62.26%。我们知道,这是第一次对oriented object detector进行点级指导的训练。

DAE-Net: Deforming Auto-Encoder for fine-grained shape co-segmentation

  • paper_url: http://arxiv.org/abs/2311.13125
  • repo_url: https://github.com/czq142857/dae-net
  • paper_authors: Zhiqin Chen, Qimin Chen, Hang Zhou, Hao Zhang
  • for: 本研究开发了一种不需要监督的3D形状协同分割方法,可以从形状集合中学习一组可变的部件模板。
  • methods: 我们的网络使用一个分支式自编码器,其中一个CNNEncoder接受一个 voxel形状作为输入,生成每个部件的变换矩阵、秘密代码和部件存在度,而Decoder输出点存在度来定义重建损失。
  • results: 我们的网络可以实现不需要监督的3D形状协同分割,并且可以生成细化、紧凑、有意义的部件,这些部件在不同的形状上具有一致性。我们在ShapeNet Part数据集、DFAUST和一个动物subset中进行了广泛的实验,并证明了我们的方法在先前方法之上显著提高了性能。
    Abstract We present an unsupervised 3D shape co-segmentation method which learns a set of deformable part templates from a shape collection. To accommodate structural variations in the collection, our network composes each shape by a selected subset of template parts which are affine-transformed. To maximize the expressive power of the part templates, we introduce a per-part deformation network to enable the modeling of diverse parts with substantial geometry variations, while imposing constraints on the deformation capacity to ensure fidelity to the originally represented parts. We also propose a training scheme to effectively overcome local minima. Architecturally, our network is a branched autoencoder, with a CNN encoder taking a voxel shape as input and producing per-part transformation matrices, latent codes, and part existence scores, and the decoder outputting point occupancies to define the reconstruction loss. Our network, coined DAE-Net for Deforming Auto-Encoder, can achieve unsupervised 3D shape co-segmentation that yields fine-grained, compact, and meaningful parts that are consistent across diverse shapes. We conduct extensive experiments on the ShapeNet Part dataset, DFAUST, and an animal subset of Objaverse to show superior performance over prior methods.
    摘要 我们提出了一种无监督的三维形状协同分割方法,它学习了一组可变形部模板从形状集合中。为了处理集合中的结构变化,我们的网络将每个形状分解成一组选择的模板部分,并使用这些部分进行非线性变换。为了增强模板的表达力,我们引入了每个部分的异常变换网络,以便模型具有多种几何变化的部分,并在异常变换的同时保持形状的准确性。我们还提出了一种训练方案,以避免地方最佳化。我们的网络架构是一个分支式自动编码器,其中 CNN 编码器接受一个 voxel 形状作为输入,生成每个部分的变换矩阵、秘密代码和部分存在概率,而解码器输出点占据来定义重建损失。我们命名这种网络为 DAE-Net,它可以实现无监督的三维形状协同分割,并生成细化、紧凑、意义重要的部分,这些部分在不同的形状中具有一致性。我们在 ShapeNet Part 集合、DFAUST 和 Objaverse 中的动物子集进行了广泛的实验,并证明了我们的方法在先前方法之上表现出优异的性能。

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

  • paper_url: http://arxiv.org/abs/2311.13120
  • repo_url: None
  • paper_authors: Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, Yuan Xie
  • for: The paper is written for recognizing scene text in the wild, which frequently encounters challenges such as domain variations, font diversity, and shape deformations.
  • methods: The paper proposes a novel training strategy called in-context training to generate context-rich scene text sequences, which are used to train a scene text recognizer.
  • results: The proposed method, called E$^2$STR, achieves effective in-context learning capabilities in scene text recognition and outperforms state-of-the-art approaches on public benchmarks, even with a regular-sized model.
    Abstract Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.
    摘要 Scene文本识别(STR)在野外经常遇到域外变化、字体多样性、形态扭曲等挑战。直接采用模型精度调整为特定场景是一个简单的解决方案,但是计算成本高并需要多个模型 копи本。现有研究表明,大型自然语言模型(LLM)可以通过几个示例学习,无需训练,称为“在场景中学习”(ICL)。然而,将 LLM 作为文本识别器是不可接受的资源浪费。另外,我们的预测实验表示,ICL 在 STR 中失败,主要归结于在训练阶段不充分 интеграble多样化样本的上下文信息。为此,我们提出了 E$^2$STR,一种基于上下文富文本序列训练的 STR 模型。E$^2$STR 表明,一个常规大小的模型可以实现有效的 ICL 能力。我们进行了广泛的实验,显示 E$^2$STR 在多种场景下具有惊人的无需训练适应能力,并在公共测试上超越了精度调整的现状标准方法。

Automated Measurement of Pericoronary Adipose Tissue Attenuation and Volume in CT Angiography

  • paper_url: http://arxiv.org/abs/2311.13100
  • repo_url: None
  • paper_authors: Andrew M. Nguyen, Tejas Sudharshan Mathai, Liangchen Liu, Jianfei Liu, Ronald M. Summers
    for: 这项研究的目的是开发一种完全自动化的 péripheral adipose tissue (PCAT) 测量方法,以便更好地评估 coronary artery disease (CAD) 的风险。methods: 这项研究使用了一种名为 nnUNet 的三维全分辨率神经网络,对两个 coronary artery(RCA 和 LCA)进行 segmentation,并自动测量 PCAT 的含气和体积。results: 研究结果表明,使用自动化方法可以准确地测量 PCAT 的含气和体积,并且可以在两个 coronary artery 上进行同时测量。这种方法可能成为 coronary artery disease 的生物标志物,并且有助于评估 coronary artery disease 的风险。
    Abstract Pericoronary adipose tissue (PCAT) is the deposition of fat in the vicinity of the coronary arteries. It is an indicator of coronary inflammation and associated with coronary artery disease. Non-invasive coronary CT angiography (CCTA) is presently used to obtain measures of the thickness, volume, and attenuation of fat deposition. However, prior works solely focus on measuring PCAT using semi-automated approaches at the right coronary artery (RCA) over the left coronary artery (LCA). In this pilot work, we developed a fully automated approach for the measurement of PCAT mean attenuation and volume in the region around both coronary arteries. First, we used a large subset of patients from the public ImageCAS dataset (n = 735) to train a 3D full resolution nnUNet to segment LCA and RCA. Then, we automatically measured PCAT in the surrounding arterial regions. We evaluated our method on a held-out test set of patients (n = 183) from the same dataset. A mean Dice score of 83% and PCAT attenuation of -73.81 $\pm$ 12.69 HU was calculated for the RCA, while a mean Dice score of 81% and PCAT attenuation of -77.51 $\pm$ 7.94 HU was computed for the LCA. To the best of our knowledge, we are the first to develop a fully automated method to measure PCAT attenuation and volume at both the RCA and LCA. Our work underscores how automated PCAT measurement holds promise as a biomarker for identification of inflammation and cardiac disease.
    摘要 PCAT( péripheral coronary adipose tissue)是在心血管周围存储脂肪的现象,是心血管炎的指标,与心血管疾病相关。现有的非侵入式 coronary CT ANGIOGRAPHY(CCTA)技术可以获得PCAT的厚度、体积和吸收强度的测量。然而,先前的研究都是通过 semi-automated 方法在右 coronary artery(RCA)上测量PCAT。在这个预测工作中,我们开发了一种全自动的PCAT吸收强度和体积测量方法,并在两个 coronary artery 周围进行测量。我们使用了一个大型的 ImageCAS 数据集(n = 735)来训练一个3D全分辨率 nnUNet 来分割 LCA 和 RCA。然后,我们自动测量 PCAT 的吸收强度和体积。我们对一个保留的测试集(n = 183)进行评估。得到的平均 Dice 分数为 83%,PCAT 吸收强度为 -73.81 ± 12.69 HU,而 LCA 的平均 Dice 分数为 81%,PCAT 吸收强度为 -77.51 ± 7.94 HU。我们知道,我们是第一个开发了全自动的PCAT吸收强度和体积测量方法,并测量 PCAT 的吸收强度和体积在 RCA 和 LCA 两个 coronary artery 周围。我们的工作表明,自动测量 PCAT 吸收强度和体积可能成为心血管炎和心血管疾病的标志物。

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

  • paper_url: http://arxiv.org/abs/2311.13091
  • repo_url: https://github.com/liuyixin-louis/stable-unlearnable-example
  • paper_authors: Yixin Liu, Kaidi Xu, Xun Chen, Lichao Sun
    for: 这个研究旨在提高隐私保护的深度学习模型,以防止无授权第三方利用公开的图像数据集训练深度学习模型来进行商业或非法用途。methods: 这个研究提出了一种投毒技术,即“不可学习的例子”,将添加一种不可见的噪声到数据中,以导致模型的对应性下降。然后,通过让模型在不同的噪声下进行反对抗训练,提高模型的防护性。results: 这个研究发现,对于防护性的模型,其中的稳定噪声(SEM)可以实现更高的效率和稳定性。通过广泛的实验,SEM在CIFAR-10、CIFAR-100和ImageNet Subset上实现了新的纪录性表现,并且在效率和稳定性之间做出了一个平衡。
    Abstract The open source of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these open-source image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of public data, a poisoning-based technique, the unlearnable example, is proposed to significantly degrade the generalization performance of models by adding a kind of imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial noise on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce stable error-minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through extensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset in terms of both effectiveness and efficiency. The code is available at https://github.com/liuyixin-louis/Stable-Unlearnable-Example.
    摘要 大量开源图像数据的开放源代码促进了深度学习技术的发展。然而,这些开放源代码的泄露可能会被不当使用者利用来训练深度学习模型,进而导致商业或非法用途。为了避免这种滥用公共数据,我们提出了一种毒剂基于技术——不可学习示例,可以在模型训练过程中添加不可见的噪声,以降低模型的总体性能。为了进一步增强其对抗式训练的稳定性,现有的工作利用了迭代式抗击训练,同时训练抗击噪声和代理模型。然而,我们仍未知道抗击示例的稳定性是由代理模型的强化还是抗击噪声的增强所带来的。我们发现,简单地从训练过程中移除对抗噪声可以提高robust不可学习示例的性能,这表明抗击示例的稳定性是由代理模型的稳定性带来的。此外,我们发现抗击噪声的稳定性和保护性之间存在负相关性,这表明抗击噪声存在稳定性问题。为此,我们引入稳定错误最小化噪声(SEM),通过对随机干扰进行训练而提高抗击噪声的稳定性。我们在广泛的实验中示出,SEM可以在CIFAR-10、CIFAR-100和ImageNet Subset上实现新的状态级表现,并且在效果和效率两个方面均达到了新的高水平。代码可以在https://github.com/liuyixin-louis/Stable-Unlearnable-Example上获取。

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

  • paper_url: http://arxiv.org/abs/2311.13073
  • repo_url: https://github.com/ai-forever/kandinskyvideo
  • paper_authors: Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
  • for: 这篇论文旨在提出一种基于文本扩散模型的双stage潜在扩散文本到视频生成架构,以生成高质量的视频。
  • methods: 该架构包括两个阶段:第一阶段是关键帧生成,用于决定视频的故事情节,第二阶段是 interpolate 帧生成,用于使场景和物体的运动平滑。 论文还比较了多种 temporal conditioning 方法,并评估了不同的 MoVQ 视频解码方案。
  • results: 论文的实验结果显示,使用分立的 temporal 块比 temporal 层更有利于视频生成质量相关指标和人类偏好。 interpolation 模型的设计还有所降低了计算成本,相比其他masked frame interpolation方法。 最后,论文比较了其架构与现有解决方案,并 achieved top-2 分数和 top-1 中开源解决方案。
    Abstract Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/
    摘要 multimedia生成方法在人工智能研究中占据着重要地位。文本到图像模型在过去几年内达到了高质量结果。但是,视频生成方法最近才开始发展。这篇文章描述了一种新的两阶段潜在扩散文本到视频生成架构,基于文本到图像扩散模型。首阶段关键帧生成以描述视频的故事情节,第二阶段是用于 interpolating 帧的生成,以使场景和 объек 的运动平滑。我们比较了几种 temporal conditioning 方法,并证明使用分离的 temporal 块的方法在指标反映视频生成质量方面和人类喜好方面具有优势。我们的 interpolating 模型的设计显著降低了计算成本,相比其他封面掩码插值方法。此外,我们评估了不同的 MoVQ 基于 видео解码方案,以提高一致性和 дости得高PSNR、SSIM、MSE 和 LPIPS 分数。最后,我们与现有解决方案进行比较,并达到了总体第二名和开源解决方案第一名:CLIPSIM = 0.2976 和 FVD = 433.054。项目页面:https://ai-forever.github.io/kandinsky-video/

FuseNet: Self-Supervised Dual-Path Network for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.13069
  • repo_url: https://github.com/xmindflow/fusenet
  • paper_authors: Amirhossein Kazerouni, Sanaz Karimijafarbigloo, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
  • For: 实现自动化 semantic segmentation 的需求,以往需要大量的 annotated dataset,这个方法可以解决这个问题。* Methods: 提出了一个 dual-stream 框架,使用自动生成的增强图像来取代 manual annotation,并且使用 cross-modal fusion 技术,将图像与文本数据结合,从而增强模型的可读性。* Results: 实验结果显示,这个方法可以实现高精度的 semantic segmentation,并且可以增强模型的类别能力和数据内涵。
    Abstract Semantic segmentation, a crucial task in computer vision, often relies on labor-intensive and costly annotated datasets for training. In response to this challenge, we introduce FuseNet, a dual-stream framework for self-supervised semantic segmentation that eliminates the need for manual annotation. FuseNet leverages the shared semantic dependencies between the original and augmented images to create a clustering space, effectively assigning pixels to semantically related clusters, and ultimately generating the segmentation map. Additionally, FuseNet incorporates a cross-modal fusion technique that extends the principles of CLIP by replacing textual data with augmented images. This approach enables the model to learn complex visual representations, enhancing robustness against variations similar to CLIP's text invariance. To further improve edge alignment and spatial consistency between neighboring pixels, we introduce an edge refinement loss. This loss function considers edge information to enhance spatial coherence, facilitating the grouping of nearby pixels with similar visual features. Extensive experiments on skin lesion and lung segmentation datasets demonstrate the effectiveness of our method. \href{https://github.com/xmindflow/FuseNet}{Codebase.}
    摘要 semantic segmentation, a crucial task in computer vision, often relies on labor-intensive and costly annotated datasets for training. in response to this challenge, we introduce FuseNet, a dual-stream framework for self-supervised semantic segmentation that eliminates the need for manual annotation. FuseNet leverages the shared semantic dependencies between the original and augmented images to create a clustering space, effectively assigning pixels to semantically related clusters, and ultimately generating the segmentation map. additionally, FuseNet incorporates a cross-modal fusion technique that extends the principles of CLIP by replacing textual data with augmented images. this approach enables the model to learn complex visual representations, enhancing robustness against variations similar to CLIP's text invariance. to further improve edge alignment and spatial consistency between neighboring pixels, we introduce an edge refinement loss. this loss function considers edge information to enhance spatial coherence, facilitating the grouping of nearby pixels with similar visual features. extensive experiments on skin lesion and lung segmentation datasets demonstrate the effectiveness of our method. 👉 codebase.

cs.AI - 2023-11-22

A Survey of Blockchain, Artificial Intelligence, and Edge Computing for Web 3.0

  • paper_url: http://arxiv.org/abs/2311.13731
  • repo_url: None
  • paper_authors: Jianjun Zhu, Fan Li, Jinyuan Chen
  • For: The paper explores the intersection of blockchain, artificial intelligence, and edge computing in the context of Web 3.0, with a focus on their potential to improve data privacy and security.* Methods: The paper provides an in-depth analysis of each of these technologies, including their relevance to Web 3.0, key components, and practical applications. The authors also propose decentralized storage and computing solutions by exploring the integration of these technologies.* Results: The paper highlights the potential of these technologies to return control and ownership of data and digital assets back to users, and outlines key challenges and research directions for realizing this vision.
    Abstract Web 3.0, as the third generation of the World Wide Web, aims to solve contemporary problems of trust, centralization, and data ownership. Driven by the latest advances in cutting-edge technologies, Web 3.0 is moving towards a more open, decentralized, intelligent, and interconnected network. However, increasingly widespread data breaches have raised awareness of online privacy and security of personal data. Additionally, since Web 3.0 is a sophisticated and complex convergence, the technical details behind it are not as clear as the characteristics it presents. In this survey, we conduct an in-depth exploration of Web 3.0 from the perspectives of blockchain, artificial intelligence, and edge computing. Specifically, we begin with summarizing the evolution of the Internet and providing an overview of these three key technological factors. Afterward, we provide a thorough analysis of each technology separately, including its relevance to Web 3.0, key technology components, and practical applications. We also propose decentralized storage and computing solutions by exploring the integration of technologies. Finally, we highlight the key challenges alongside potential research directions. Through the combination and mutual complementation of multiple technologies, Web 3.0 is expected to return more control and ownership of data and digital assets back to users.
    摘要 web3.0,为第三代世界互联网的发展,旨在解决当今的信任、中央化和数据所有权问题。驱动了最新的技术创新,web3.0 向更开放、分布式、智能和连接的网络方向发展。然而,随着数据泄露的扩散,人们对在线隐私和个人数据安全的意识得到提高。此外,由于 web3.0 是一种复杂和不熟悉的整体,技术细节不够清晰,需要进一步的探索。在这份调查中,我们从区块链、人工智能和 Edge computing 的角度进行了深入的探索,包括这些技术的发展历程、关键组件和实际应用。此外,我们还提出了分布式存储和计算解决方案,并考虑了技术的集成。最后,我们指出了关键挑战和未来研究方向。通过多技术的结合和互补,web3.0 期望返还数据和数字资产的控制和所有权给用户。

Studying Artist Sentiments around AI-generated Artwork

  • paper_url: http://arxiv.org/abs/2311.13725
  • repo_url: None
  • paper_authors: Safinah Ali, Cynthia Breazeal
  • for: 这个论文的目的是研究艺术家对人工智能生成的艺术作品的情感反应,以便更好地发展和使用创造力支持工具。
  • methods: 该论文使用了面试和社交媒体平台的公共帖子分析方法,以了解艺术家对AI生成的艺术作品的主要担忧和希望。
  • results: 该论文发现艺术家对AI生成的艺术作品的主要担忧包括可能会抄袭他们的作品和风格,而希望包括用于创造新的艺术用途。这些发现可以指导艺术家和开发者合作,以便负责任地开发和使用创造力支持工具。
    Abstract Art created using generated Artificial Intelligence has taken the world by storm and generated excitement for many digital creators and technologists. However, the reception and reaction from artists have been mixed. Concerns about plagiarizing their artworks and styles for datasets and uncertainty around the future of digital art sparked movements in artist communities shunning the use of AI for generating art and protecting artists' rights. Collaborating with these tools for novel creative use cases also sparked hope from some creators. Artists are an integral stakeholder in the rapidly evolving digital creativity industry and understanding their concerns and hopes inform responsible development and use of creativity support tools. In this work, we study artists' sentiments about AI-generated art. We interviewed 7 artists and analyzed public posts from artists on social media platforms Reddit, Twitter and Artstation. We report artists' main concerns and hopes around AI-generated artwork, informing a way forward for inclusive development of these tools.
    摘要 人工智能生成的艺术在全球引起了风波,让数字创作者和技术人员感到非常激动。然而,艺术家对这些工具的接受和反应是杂mix的。一些艺术家担心AI将其艺术作品和风格复制用于数据集,而其他艺术家则对未来的数字艺术产生了uncertainty。这些问题在艺术家社区中引发了抵制使用AI生成艺术的运动,以保护艺术家的权利。同时,一些创作者也希望通过与这些工具的合作来探索新的创作用例。作为数字创作业务的重要参与者,理解艺术家的担忧和希望是不可或缺的,以便负责任地开发和使用创新工具。在这项工作中,我们研究了艺术家对AI生成艺术的情感,通过对Reddit、Twitter和Artstation等社交媒体平台上艺术家的公共帖子和采访7名艺术家,我们报告了艺术家对AI生成艺术作品的主要担忧和希望,以便为包容性的创新工具的发展提供指导。

Nova$^+$: Generative Language Models for Binaries

  • paper_url: http://arxiv.org/abs/2311.13721
  • repo_url: None
  • paper_authors: Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang
  • for: 本研究旨在推广现有的生成大语言模型(LLMs)到二进制领域,以便利用LLMs在代码生成、程序修复和文档分析等方面的表现。
  • methods: 我们开发了一个名为Nova和Nova$^+的两种LLMs,它们都是基于二进制 corpora 进行预训练的。Nova 使用标准的语言模型任务进行预训练,在五个 benchmark 上表现出色,在三个下游任务(二进制代码相似性检测、二进制代码翻译和二进制代码恢复)中表现更好于 GPT-3.5 和其他现有技术。我们还构建了 Nova$^+,它使用了两个新的预训练任务,即优化生成和优化级别预测,以学习二进制优化和对等 binaries。
  • results: Nova 和 Nova$^+ 在五个 benchmark 上都表现出色,其中 Nova$^+ 在三个下游任务中表现最佳,说明新的预训练任务对 Nova 的性能做出了贡献。
    Abstract Generative large language models (LLMs) pre-trained on code have shown impressive effectiveness in code generation, program repair, and document analysis. However, existing generative LLMs focus on source code and are not specialized for binaries. There are three main challenges for LLMs to model and learn binary code: hex-decimal values, complex global dependencies, and compiler optimization levels. To bring the benefit of LLMs to the binary domain, we develop Nova and Nova$^+$, which are LLMs pre-trained on binary corpora. Nova is pre-trained with the standard language modeling task, showing significantly better capability on five benchmarks for three downstream tasks: binary code similarity detection (BCSD), binary code translation (BCT), and binary code recovery (BCR), over GPT-3.5 and other existing techniques. We build Nova$^+$ to further boost Nova using two new pre-training tasks, i.e., optimization generation and optimization level prediction, which are designed to learn binary optimization and align equivalent binaries. Nova$^+$ shows overall the best performance for all three downstream tasks on five benchmarks, demonstrating the contributions of the new pre-training tasks.
    摘要 大型生成语言模型(LLM)在代码生成、程序修复和文档分析方面表现出色,但现有的生成LLM仅专注于源代码,对二进制代码没有特化。二进制代码模型和学习的三大挑战是hex值、复杂的全球依赖关系以及编译优化级别。为将LLM的优势带到二进制领域,我们开发了Nova和Nova$^+$,它们是基于二进制质量的LLM。Nova通过标准语言模型任务进行预训练,在五个benchmark上表现出显著提高,对三个下游任务(二进制代码相似检测、二进制代码翻译和二进制代码恢复)的性能有显著优势。为了进一步提高Nova的性能,我们构建了Nova$^+$,它通过两个新的预训练任务——优化生成和优化级别预测——来学习二进制优化和对等二进制的适应。Nova$^+$在五个benchmark上表现出总最佳性能,证明了新的预训练任务的贡献。

Towards More Likely Models for AI Planning

  • paper_url: http://arxiv.org/abs/2311.13720
  • repo_url: None
  • paper_authors: Turgay Caglar, Sirine Belhaj, Tathagata Chakraborti, Michael Katz, Sarath Sreedharan
  • for: 这个研究探讨了大型自然语言模型 (LLMs) 在自动规划任务中的应用,特别是在模型空间更新中。
  • methods: 本研究使用了两种不同的模型空间问题,并考虑了一个 LLM 的效果在这些任务中。我们还评估了 LLM 和排序搜寻 (CS) 的比较,以及 LLM 在模型空间推理中的表现。
  • results: 我们的实验结果显示, LLM 在模型空间更新中的表现比 CS 更好,并且可以实现更高的成功率。这显示了 LLM 在自动规划任务中的应用潜力。
    Abstract This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this sangam, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) - an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical signal in concert with the CS approach as part of a two-stage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future.
    摘要

Deep learning-based instance segmentation for the precise automated quantification of digital breast cancer immunohistochemistry images

  • paper_url: http://arxiv.org/abs/2311.13719
  • repo_url: None
  • paper_authors: Blanca Maria Priego-Torresa, Barbara Lobato-Delgado, Lidia Atienza-Cuevas, Daniel Sanchez-Morillo
  • for: automatic quantification of biomarkers on immunohistochemistry breast cancer images for appropriate therapy and disease prognosis
  • methods: deep learning-based instance segmentation architecture for automatic segmentation of nuclear and membrane biomarkers
  • results: promising method to segment nuclei instances in IHC-stained images, integrated into a web platform for decision-support by pathologists
    Abstract The quantification of biomarkers on immunohistochemistry breast cancer images is essential for defining appropriate therapy for breast cancer patients, as well as for extracting relevant information on disease prognosis. This is an arduous and time-consuming task that may introduce a bias in the results due to intra- and inter-observer variability which could be alleviated by making use of automatic quantification tools. However, this is not a simple processing task given the heterogeneity of breast tumors that results in non-uniformly distributed tumor cells exhibiting different staining colors and intensity, size, shape, and texture, of the nucleus, cytoplasm and membrane. In this research work, we demonstrate the feasibility of using a deep learning-based instance segmentation architecture for the automatic quantification of both nuclear and membrane biomarkers applied to IHC-stained slides. We have solved the cumbersome task of training set generation with the design and implementation of a web platform, which has served as a hub for communication and feedback between researchers and pathologists as well as a system for the validation of the automatic image processing models. Through this tool, we have collected annotations over samples of HE, ER and Ki-67 (nuclear biomarkers) and HER2 (membrane biomarker) IHC-stained images. Using the same deep learning network architecture, we have trained two models, so-called nuclei- and membrane-aware segmentation models, which, once successfully validated, have revealed to be a promising method to segment nuclei instances in IHC-stained images. The quantification method proposed in this work has been integrated into the developed web platform and is currently being used as a decision-support tool by pathologists.
    摘要 医学影像中的生物标志物质量化对于定义乳腺癌患者适当的治疗和患病诊断提供了关键信息。然而,这是一项艰难和时间consuming的任务,可能会引入偏见,因为评估人员之间和内部的变动可能会导致不一致的结果。为了解决这个问题,我们在这项研究中使用了深度学习基于实例分割的自动量化方法。我们通过设计和实现了一个网络平台,成功地解决了训练集生成的困难问题,并且为评估模型的验证提供了一个系统。通过这个工具,我们收集了HE、ER和 Ki-67(核内生物标志物质)以及HER2(膜上生物标志物质)IHC染色图像的注释。使用同一个深度学习网络架构,我们训练了两个模型,即核心快速分割模型和膜部分快速分割模型,并在成功验证后,这些模型显示了在IHC染色图像中分割核心实例的批处。我们提出的量化方法已经被集成到开发的网络平台中,现在被用作诊断支持工具。

A Unified Approach to Count-Based Weakly-Supervised Learning

  • paper_url: http://arxiv.org/abs/2311.13718
  • repo_url: None
  • paper_authors: Vinay Shukla, Zhe Zeng, Kareem Ahmed, Guy Van den Broeck
  • for: 学习从弱监督数据中获得高质量标签
  • methods: 提出了一种统一的弱监督学习方法,利用计数来计算输出中true的概率,并基于这个计算 derivate一个计数损失函数
  • results: 在三种常见的弱监督学习 paradigm 上测试了提出的方法,并观察到在三个 paradigm 中均 achiev 了 estado-of-the-art 或高度竞争的结果
    Abstract High-quality labels are often very scarce, whereas unlabeled data with inferred weak labels occurs more naturally. In many cases, these weak labels dictate the frequency of each respective class over a set of instances. In this paper, we develop a unified approach to learning from such weakly-labeled data, which we call count-based weakly-supervised learning. At the heart of our approach is the ability to compute the probability of exactly k out of n outputs being set to true. This computation is differentiable, exact, and efficient. Building upon the previous computation, we derive a count loss penalizing the model for deviations in its distribution from an arithmetic constraint defined over label counts. We evaluate our approach on three common weakly-supervised learning paradigms and observe that our proposed approach achieves state-of-the-art or highly competitive results across all three of the paradigms.
    摘要 高品质标签很难得,而无标签数据具有推断的弱标签更加普遍。在许多情况下,这些弱标签会决定每个实例的类频率。在这篇论文中,我们开发了一种统一的学习方法,我们称之为计数基于的弱监督学习。我们的方法可以计算每个输出的概率为true,这个计算是不可Diff、正确和高效的。基于这个计算,我们 derivate一个计数损失,惩罚模型的分布与标签计数之间的差异。我们在三种常见的弱监督学习 paradigm中评估了我们的方法,并观察到我们的方法在所有三个 paradigm中均达到了领先或非常竞争的结果。

Data Acquisition: A New Frontier in Data-centric AI

  • paper_url: http://arxiv.org/abs/2311.13712
  • repo_url: None
  • paper_authors: Lingjiao Chen, Bilge Acun, Newsha Ardalani, Yifan Sun, Feiyang Kang, Hanrui Lyu, Yongchan Kwon, Ruoxi Jia, Carole-Jean Wu, Matei Zaharia, James Zou
  • for: 本研究旨在探讨数据获取过程中遇到的挑战,并提出一种数据获取挑战(DAM)以便促进数据获取的参与度。
  • methods: 本研究使用现有数据市场的调查,发现现有的数据市场缺乏详细信息的平台、透明的价格和标准化的数据格式。
  • results: 本研究发现了数据获取过程中的挑战,并通过DAM挑战引起了数据获取领域的参与度。
    Abstract As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. The benchmark was released as a part of DataPerf. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.
    摘要 Machine Learning (ML) 系统的增长使得具有相关性和完整性的数据集的需求变得非常重要。然而,有限的研究表明,数据收集过程中存在许多挑战,主要是因为数据收集的过程是静默的,缺乏一致的方法论。我们首先对当前的数据市场进行了调查,发现现有的平台缺乏提供详细信息的能力,无法提供透明的价格和标准化的数据格式。为了鼓励数据中心AI社区的参与,我们则引入了数据收集挑战(DAM),一个测试框架,用于模拟数据提供者和收集者之间的交互。这个挑战被发布在DataPerf上,我们对提交的策略进行了评估,这显示了ML中数据收集策略的重要性。

Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores

  • paper_url: http://arxiv.org/abs/2311.13693
  • repo_url: None
  • paper_authors: Zeliang Zhang, Zhuo Liu, Susan Liang, Zhiyuan Wang, Yifan Zhu, Chen Ding, Chenliang Xu
  • for: 支持扩展级数据分析,特别是基因分析、深度学习和量子计算。
  • methods: 提出了一种压缩基于的tensor decomposition框架,即exascale-tensor,以支持eksascale tensor decomposition。并且提出了一系列策略来提高计算效率。
  • results: 与基elines相比,exascale-tensor可以支持8,000倍大的tensor和6.95倍的速度提升。同时,我们在基因分析和tensor层神经网络两个实际应用中进行了应用,numeric结果表明我们的方法具有扩展性和效果。
    Abstract CP decomposition is a powerful tool for data science, especially gene analysis, deep learning, and quantum computation. However, the application of tensor decomposition is largely hindered by the exponential increment of the computational complexity and storage consumption with the size of tensors. While the data in our real world is usually presented as trillion- or even exascale-scale tensors, existing work can only support billion-scale scale tensors. In our work, we propose the Exascale-Tensor to mitigate the significant gap. Specifically, we propose a compression-based tensor decomposition framework, namely the exascale-tensor, to support exascale tensor decomposition. Then, we carefully analyze the inherent parallelism and propose a bag of strategies to improve computational efficiency. Last, we conduct experiments to decompose tensors ranging from million-scale to trillion-scale for evaluation. Compared to the baselines, the exascale-tensor supports 8,000x larger tensors and a speedup up to 6.95x. We also apply our method to two real-world applications, including gene analysis and tensor layer neural networks, of which the numeric results demonstrate the scalability and effectiveness of our method.
    摘要 <>将文本翻译成简化中文。<>CP分解是数据科学中的一种强大工具,尤其是基因分析、深度学习和量子计算中。然而,tensor分解的应用受到tensor大小的幂等增长的计算复杂性和存储消耗的限制。而在实际世界中,数据通常表示为十进制或甚至ек斯单位积分数组。现有的工作只能支持百亿积分数组。在我们的工作中,我们提出了Exascale-Tensor来缓解这个巨大的差距。具体来说,我们提出了一种压缩基于的tensor分解框架,称为Exascale-Tensor,以支持eks单位积分数组的分解。然后,我们仔细分析了内置的并行性,并提出了一系列策略来提高计算效率。最后,我们进行了对积分数组的范围从百万积分到十亿积分的实验。相比基准,Exascale-Tensor支持8,000倍大的积分数组和6.95倍的速度提升。我们还应用了我们的方法到两个实际应用中,包括基因分析和tensor层神经网络,其数据结果表明了我们的方法的可扩展性和有效性。

Next-Generation Earth System Models: Towards Reliable Hybrid Models for Weather and Climate Applications

  • paper_url: http://arxiv.org/abs/2311.13691
  • repo_url: None
  • paper_authors: Tom Beucler, Erwan Koch, Sven Kotlarski, David Leutwyler, Adrien Michel, Jonathan Koh
  • for: 本文概要总结了机器学习如何改变我们对地球系统的模型,以及我们预计近 future 中这些突破将对瑞士用户产生多大影响。
  • methods: 本文使用了机器学习技术,包括深度学习和自适应学习,来模型地球系统。
  • results: 本文的结果表明,机器学习技术可以提高地球系统模型的准确性和可靠性,并且可以为瑞士用户提供更好的气象预报和环境监测数据。
    Abstract We review how machine learning has transformed our ability to model the Earth system, and how we expect recent breakthroughs to benefit end-users in Switzerland in the near future.
    摘要 我们回顾了机器学习如何改变我们对地球系统的模型,以及我们预计近 future 中对瑞士用户的帮助。Note: "地球系统" (dìzhēn xìtǒng) is a common way to refer to the Earth in Chinese, and "瑞士" (ruìshì) refers to Switzerland.

MAIRA-1: A specialised large multimodal model for radiology report generation

  • paper_url: http://arxiv.org/abs/2311.13668
  • repo_url: None
  • paper_authors: Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, Noel Codella, Matthew P. Lungren, Maria Teodora Wetscherek, Ozan Oktay, Javier Alvarez-Valle
  • for: 这 paper 的目的是生成基于胸部X射线图像的放射学报告。
  • methods: 该模型使用了一个专门设计 для胸部X射线图像的图像编码器,并将大型自然语言模型经过 fine-tuning 和文本基于数据增强,以生成高质量的报告。
  • results: 模型可以生成高水平的报告,并在RadCliQ指标和所有字符指标中显著提高。人工审查表明模型输出的报告具有良好的流畅性和准确性,而且可以捕捉到现有评价方法不能捕捉的失败模式。更多信息和资源可以通过项目网站:https://aka.ms/maira 获取。
    Abstract We present a radiology-specific multimodal model for the task for generating radiological reports from chest X-rays (CXRs). Our work builds on the idea that large language model(s) can be equipped with multimodal capabilities through alignment with pre-trained vision encoders. On natural images, this has been shown to allow multimodal models to gain image understanding and description capabilities. Our proposed model (MAIRA-1) leverages a CXR-specific image encoder in conjunction with a fine-tuned large language model based on Vicuna-7B, and text-based data augmentation, to produce reports with state-of-the-art quality. In particular, MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered. Manual review of model outputs demonstrates promising fluency and accuracy of generated reports while uncovering failure modes not captured by existing evaluation practices. More information and resources can be found on the project website: https://aka.ms/maira.
    摘要 我们提出了一种专门适用于胸部X射线成像(CXR)的多modal模型,用于生成胸部X射线成像报告。我们的工作基于了大型语言模型可以通过调整预训推导器获得多modal能力。在自然图像上,这已经被证明了,让多modal模型获得图像理解和描述能力。我们的提议模型(MAIRA-1)具有特殊的CXR专用影像推导器和微调的大型语言模型,基于Vicuna-7B,以及文本基于数据增强,以生成高品质的报告。尤其是,MAIRA-1在专业医生调整的RadCliQ指标下有 statistically significant 的改善,并且在所有的字幕指标中都有改善。 manual review of model outputs 表明模型输出的报告具有良好的流畅和准确性,而且可以探索现有评估方法所不能捕捉的失败模式。更多信息和资源可以在项目网站上找到:https://aka.ms/maira。

Sample as You Infer: Predictive Coding With Langevin Dynamics

  • paper_url: http://arxiv.org/abs/2311.13664
  • repo_url: None
  • paper_authors: Umais Zahid, Qinghai Guo, Zafeirios Fountas
  • for: 该论文旨在提出一种新的深度生成模型参数学习算法,基于计算神经科学中的预测编码(PC)框架。
  • methods: 该方法对标准PC算法进行修改,使其与标准变分自动编码(VAE)训练相当或超越其性能。injecting Gaussian noise into PC归案程序,将其重新定义为欠拟合Langevin采样,以便对紧缩证据下界(ELBO)进行优化。
  • results: 通过在Langevin采样过程中添加Gaussian噪声,提高了encoder-free训练方法的性能,并测试了三种不同的目标函数。此外,通过采用一个简单的和容易计算的预conditioning方法,提高了模型的稳定性和抗折冲性。与标准VAE训练方法相比,该方法可以在多个维度上提高或与性能相等,同时快速 converge。
    Abstract We present a novel algorithm for parameter learning in generic deep generative models that builds upon the predictive coding (PC) framework of computational neuroscience. Our approach modifies the standard PC algorithm to bring performance on-par and exceeding that obtained from standard variational auto-encoder (VAE) training. By injecting Gaussian noise into the PC inference procedure we re-envision it as an overdamped Langevin sampling, which facilitates optimisation with respect to a tight evidence lower bound (ELBO). We improve the resultant encoder-free training method by incorporating an encoder network to provide an amortised warm-start to our Langevin sampling and test three different objectives for doing so. Finally, to increase robustness to the sampling step size and reduce sensitivity to curvature, we validate a lightweight and easily computable form of preconditioning, inspired by Riemann Manifold Langevin and adaptive optimizers from the SGD literature. We compare against VAEs by training like-for-like generative models using our technique against those trained with standard reparameterisation-trick-based ELBOs. We observe our method out-performs or matches performance across a number of metrics, including sample quality, while converging in a fraction of the number of SGD training iterations.
    摘要 我们提出了一种新的算法,用于参数学习在通用深度生成模型中,基于计算神经科学中的预测编码(PC)框架。我们的方法将标准PC算法修改,以使其和标准VARIATIONAL autoencoder(VAE)训练中的性能相似或更好。我们在PC推理过程中注入 Gaussian 噪音,将其视为过度过度的Langevin抽样,以便对紧缩证据下界(ELBO)进行优化。我们提高了结果的无架构训练方法,通过添加架构网络,以提供抽象的暖点起始,并试用三种不同的目标进行实现。最后,我们运用一种轻量级的和容易计算的预测运算,受到RIEMANN MANIFOLD Langevin和SGD文献中的可靠运算所 inspirited,以增加抽样步长的稳定性和减少曲线弯曲的敏感性。我们与VAEs进行了相似的生成模型训练,并与标准重构化运算之下的ELBO进行比较。我们发现,我们的方法可以与VAEs匹配或超越性能,包括样本质量在内的一些指标,并在SGD训练迭代数中更快 converge。

Visual In-Context Prompting

  • paper_url: http://arxiv.org/abs/2311.13601
  • repo_url: https://github.com/ux-decoder/dinov
  • paper_authors: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
  • for: 提高零shot能力,应用于视觉任务中的开放集成segmentation和检测。
  • methods: 基于encoder-decoder架构,开发了一种多功能提示编码器,支持不同类型的提示,如笔emark、盒子和点。提高了使用多个参考图像段的能力。
  • results: 在COCO和SA-1B dataset上进行联合训练,实现了57.7的PQ在COCO和23.2的PQ在ADE20K。
    Abstract In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.
    摘要 内容提示在大型语言模型(LLM)中已经成为一种普遍使用的方法来提高零执行能力,但这个想法在视觉领域中更少被探索。现有的视觉提示方法主要集中在将分类器访问最相关的物体,而忽略了许多通用视觉任务,例如开放集合分类和检测。在这篇论文中,我们介绍了一个通用的视觉内容提示框架,用于解决这些任务。具体来说,我们基于encoder-decoder架构,并开发了一个多功能的提示encoder,以支持不同的提示,如画笔、方块和点。我们还将其改进,以接受任意数量的参考图像段落作为内容。我们的广泛探索显示,提案的视觉内容提示能够出色地参考和通用分类,实现了与关注领域的竞争性性能,并在许多开放集合分类任务中显示了应用前景。我们通过联合训练COCO和SA-1B,我们的模型能够在COCO上 achieve $57.7$ PQ和ADE20K上 achieve $23.2$ PQ。代码将会在https://github.com/UX-Decoder/DINOv中提供。

Labeling Neural Representations with Inverse Recognition

  • paper_url: http://arxiv.org/abs/2311.13594
  • repo_url: https://github.com/lapalap/invert
  • paper_authors: Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, Marina M. -C. Höhne
  • for: 这篇论文的目的是探讨深度神经网络(DNNs)学习的复杂层次数据表示方法,以及这些表示方法如何与人类可理解的概念相连接。
  • methods: 该论文提出了一种可扩展的方法 called Inverse Recognition(INVERT),该方法可以连接DNNs学习的表示与人类可理解的概念。INVERT不依赖于分割masks,具有较低的计算复杂性,并且可以处理多种类型的神经元。
  • results: 论文通过应用INVERT方法在多个场景中,包括检测模型中的偶极性关系和决策过程的层次结构,以及解释模型中的各种神经元和层次结构。INVERT方法可以提供可解释的度量,评估模型表示与其相应的解释之间的一致程度,并且可以提供统计学上的有效性测试。
    Abstract Deep Neural Networks (DNNs) demonstrated remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance, emphasizing its utility and credibility. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.
    摘要 深度神经网络(DNNs)表现出了学习复杂层次数据表示的很高能力,但这些表示的本质仍然不得而知。现有的全局解释方法,如网络解剖,受到了一些限制,如依赖于分割mask,缺乏统计学 significativeness 测试和高计算复杂性。我们提议一种可扩展的方法——反向识别(INVERT),可以将学习得到的表示连接到人类可理解的概念上。与先前的工作不同,INVERT可以处理多种类型的神经元,计算复杂性较低,并不需要分割mask。此外,INVERT提供了一个可解释的度量,用于评估表示和其相应的解释之间的对应度,并提供一个统计学 significativeness 测试,这使得INVERT的实用性和可靠性更加出色。我们在多种场景中应用INVERT,包括检测模型中的偶极性相关性和解释模型中的决策层次结构。

Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.13628
  • repo_url: https://github.com/thomaspzollo/prompt_risk
  • paper_authors: Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, Richard Zemel
  • for: 防止大语言模型的不良回快,提高模型的质量和可靠性。
  • methods: 基于可信度上的紧凑约束,选择合适的提示语。并提供多种维度的约束,包括最坏用户的回快和用户群体中差异的生成质量。
  • results: 通过实验,提高模型的可靠性和质量,降低最坏回快和用户群体中差异的风险。
    Abstract The recent explosion in the capabilities of large language models has led to a wave of interest in how best to prompt a model to perform a given task. While it may be tempting to simply choose a prompt based on average performance on a validation set, this can lead to a deployment where unexpectedly poor responses are generated, especially for the worst-off users. To mitigate this prospect, we propose Prompt Risk Control, a lightweight framework for selecting a prompt based on rigorous upper bounds on families of informative risk measures. We offer methods for producing bounds on a diverse set of metrics, including quantities that measure worst-case responses and disparities in generation quality across the population of users. In addition, we extend the underlying statistical bounding techniques to accommodate the possibility of distribution shifts in deployment. Experiments on applications such as open-ended chat, medical question summarization, and code generation highlight how such a framework can foster responsible deployment by reducing the risk of the worst outcomes.
    摘要

$σ$-PCA: a unified neural model for linear and nonlinear principal component analysis

  • paper_url: http://arxiv.org/abs/2311.13580
  • repo_url: None
  • paper_authors: Fahdi Kanavati, Lucy Katsnith, Masayuki Tsuneki
  • for: 这 paper 的目的是学习从数据中学习线性变换。
  • methods: 这 paper 使用的方法包括单层自适应神经网络模型,包括线性PCA、非线性PCA 和线性ICA。
  • results: 这 paper 得到的结果是一种叫做 $\sigma$-PCA 的单层自适应神经网络模型,可以同时学习线性和非线性PCA 的特征。这个模型可以减少维度并按照方差排序数据。
    Abstract Linear principal component analysis (PCA), nonlinear PCA, and linear independent component analysis (ICA) -- those are three methods with single-layer autoencoder formulations for learning linear transformations from data. Linear PCA learns orthogonal transformations (rotations) that orient axes to maximise variance, but it suffers from a subspace rotational indeterminacy: it fails to find a unique rotation for axes that share the same variance. Both nonlinear PCA and linear ICA reduce the subspace indeterminacy from rotational to permutational by maximising statistical independence under the assumption of unit variance. The relationship between all three can be understood by the singular value decomposition of the linear ICA transformation into a sequence of rotation, scale, rotation. Linear PCA learns the first rotation; nonlinear PCA learns the second. The scale is simply the inverse of the standard deviations. The problem is that, in contrast to linear PCA, conventional nonlinear PCA cannot be used directly on the data to learn the first rotation, the first being special as it reduces dimensionality and orders by variances. In this paper, we have identified the cause, and as a solution we propose $\sigma$-PCA: a unified neural model for linear and nonlinear PCA as single-layer autoencoders. One of its key ingredients: modelling not just the rotation but also the scale -- the variances. This model bridges the disparity between linear and nonlinear PCA. And so, like linear PCA, it can learn a semi-orthogonal transformation that reduces dimensionality and orders by variances, but, unlike linear PCA, it does not suffer from rotational indeterminacy.
    摘要 Linear 主成分分析(PCA)、非线性 PCA 和 linear 独立成分分析(ICA) --- 这三种方法都是基于单层自适应神经网络的学习线性变换的方法。Linear PCA 学习正交变换(旋转),以便将轴线 Orient 到最大化协方差,但它受到了子空间旋转不确定性:它无法找到共享协方差的唯一旋转。非线性 PCA 和 linear ICA 都减少了子空间旋转不确定性,通过最大化统计独立性,假设协方差均一。这三种方法之间的关系可以通过线性 ICA 变换的特征值分解来理解,线性 PCA 学习的第一个旋转,非线性 PCA 学习的第二个旋转。标准差的逆是非线性 PCA 的标准差。问题在于,与线性 PCA 不同,非线性 PCA 直接在数据上学习第一个旋转是不可行的。在这篇论文中,我们已经确定了这种问题的原因,并提出了一个解决方案:$\sigma$-PCA。这是一种把线性和非线性 PCA 转化为单层自适应神经网络的模型。其中一个关键组成部分是模型不仅包括旋转,还包括标准差(协方差)。这种模型 bridges 线性 PCA 和非线性 PCA 之间的差异,因此它可以学习一个半正交变换,减少维度并按照协方差排序,但不同于线性 PCA,它不受旋转不确定性的影响。

Physical Reasoning and Object Planning for Household Embodied Agents

  • paper_url: http://arxiv.org/abs/2311.13577
  • repo_url: https://github.com/com-phy-affordance/coat
  • paper_authors: Ayush Agrawal, Raghav Prabhakar, Anirudh Goyal, Dianbo Liu
  • for: 本研究探讨了家庭智能代理人在强大任务规划中的准确性和灵活性问题,尤其是在选择替代物品上。
  • methods: 我们引入了通用智能物品用途任务(COAT)框架,用于分析语言模型在日常情景下的理智能力。COAT框架基于理解代理人如何有效地识别和利用日常任务中的替代物品。
  • results: 我们通过三个精心制作的常识问答数据集(15k和130k问题)来评估当今状态的语言模型是如何处理这些问题。我们的发现包括对物品的各种物理状态的抽象变量,以及对物品的用途相关性进行映射。这些贡献可以帮助解决实际生活中的物品选择问题,并为未来家庭智能代理人技术的发展提供基础。
    Abstract In this study, we explore the sophisticated domain of task planning for robust household embodied agents, with a particular emphasis on the intricate task of selecting substitute objects. We introduce the CommonSense Object Affordance Task (COAT), a novel framework designed to analyze reasoning capabilities in commonsense scenarios. This approach is centered on understanding how these agents can effectively identify and utilize alternative objects when executing household tasks, thereby offering insights into the complexities of practical decision-making in real-world environments.Drawing inspiration from human decision-making, we explore how large language models tackle this challenge through three meticulously crafted commonsense question-and-answer datasets, featuring refined rules and human annotations. Our evaluation of state-of-the-art language models on these datasets sheds light on three pivotal considerations: 1) aligning an object's inherent utility with the task at hand, 2) navigating contextual dependencies (societal norms, safety, appropriateness, and efficiency), and 3) accounting for the current physical state of the object. To maintain accessibility, we introduce five abstract variables reflecting an object's physical condition, modulated by human insights to simulate diverse household scenarios. Our contributions include insightful Object-Utility mappings addressing the first consideration and two extensive QA datasets (15k and 130k questions) probing the intricacies of contextual dependencies and object states. The datasets, along with our findings, are accessible at: \url{https://github.com/com-phy-affordance/COAT}. This research not only advances our understanding of physical commonsense reasoning in language models but also paves the way for future improvements in household agent intelligence.
    摘要 在这项研究中,我们探讨了家庭机器人的任务规划问题,尤其是在选择替代物品方面。我们提出了一种新的框架——通用常识物品用途任务(COAT),用于分析语义理解能力在日常情境中。这种方法是基于理解家庭任务执行时机器人如何有效地标识和利用替代物品,从而为实际决策提供了深入的理解。我们draw inspiration from human decision-making, investigating how large language models tackle this challenge using three carefully crafted commonsense question-and-answer datasets, featuring refined rules and human annotations. Our evaluation of state-of-the-art language models on these datasets sheds light on three key considerations: 1) aligning an object's inherent utility with the task at hand, 2) navigating contextual dependencies (societal norms, safety, appropriateness, and efficiency), and 3) accounting for the current physical state of the object. To maintain accessibility, we introduce five abstract variables reflecting an object's physical condition, modulated by human insights to simulate diverse household scenarios. Our contributions include insightful Object-Utility mappings addressing the first consideration and two extensive QA datasets (15k and 130k questions) probing the intricacies of contextual dependencies and object states. The datasets, along with our findings, are accessible at: \url{https://github.com/com-phy-affordance/COAT}. This research not only advances our understanding of physical commonsense reasoning in language models but also paves the way for future improvements in household agent intelligence.

Drilling Down into the Discourse Structure with LLMs for Long Document Question Answering

  • paper_url: http://arxiv.org/abs/2311.13565
  • repo_url: None
  • paper_authors: Inderjeet Nair, Shwetha Somasundaram, Apoorv Saxena, Koustava Goswami
  • for: 长文问题回答 task中的证据检索,即寻找答案相关的段落。
  • methods: 利用大型自然语言模型(LLMs)进行零shot长文证据检索,并提出了一些技巧来利用文档结构来创建缩减表示,以获得更好的理解和分析文档中不同部分之间的关系。
  • results: 比较旧的零shot方法的性能,发现可以保留 $99.6%$ 的性能,同时只处理 $26%$ 的字元数,并且可以与自己的问题进行结合,以达到最佳零shot性能。
    Abstract We address the task of evidence retrieval for long document question answering, which involves locating relevant paragraphs within a document to answer a question. We aim to assess the applicability of large language models (LLMs) in the task of zero-shot long document evidence retrieval, owing to their unprecedented performance across various NLP tasks. However, currently the LLMs can consume limited context lengths as input, thus providing document chunks as inputs might overlook the global context while missing out on capturing the inter-segment dependencies. Moreover, directly feeding the large input sets can incur significant computational costs, particularly when processing the entire document (and potentially incurring monetary expenses with enterprise APIs like OpenAI's GPT variants). To address these challenges, we propose a suite of techniques that exploit the discourse structure commonly found in documents. By utilizing this structure, we create a condensed representation of the document, enabling a more comprehensive understanding and analysis of relationships between different parts. We retain $99.6\%$ of the best zero-shot approach's performance, while processing only $26\%$ of the total tokens used by the best approach in the information seeking evidence retrieval setup. We also show how our approach can be combined with \textit{self-ask} reasoning agent to achieve best zero-shot performance in complex multi-hop question answering, just $\approx 4\%$ short of zero-shot performance using gold evidence.
    摘要 我们考虑了长文档问答中的证据检索任务,即在文档中找到相关的段落以回答问题。我们想要评估大语言模型(LLM)在零批长文档证据检索任务中的应用可行性,因为它们在各种自然语言处理任务中表现出了无 precedent 的成绩。然而,目前 LLM 只能处理有限的上下文长度作为输入,因此将文档段落作为输入可能会忽略全文上下文,同时错过捕捉段落之间的相互依赖关系。此外,直接将大量输入集 feed 给 LLM 可能会产生巨大的计算成本,特别是处理整个文档(并可能会产生企业 API 如 OpenAI 的 GPT 变种的成本)。为解决这些挑战,我们提出了一个套件的技术,利用文档的话语结构来创建缩短的文档表示,以便更好地理解和分析文档中不同部分之间的关系。我们保留了 $99.6\%$ 的最佳零批approach 的性能,同时只处理 $26\%$ 的总token 数。我们还示出了我们的方法可以与 \textit{self-ask} 反思代理 integrate 以获得最佳零批性能在复杂多层问答中,只比零批性能约 $4\%$ 短。

Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object

  • paper_url: http://arxiv.org/abs/2311.13562
  • repo_url: https://github.com/yisuanwang/soulstyler
  • paper_authors: Junhao Chen, Peng Rong, Jingbo Sun, Chao Li, Xiang Li, Hongwu Lv
  • for: 提高图像风格转移精度,使得图像中的特定对象能够根据文本描述进行风格转移,而不影响背景风格。
  • methods: 提出了一种名为“Soulstyler”的框架,通过简单的文本描述指定风格转移目标对象,并使用大量语言模型解析文本,从图像内容中提取风格特征。
  • results: 实验结果表明,Soulstyler 可以准确地根据文本描述进行风格转移,而不会影响背景风格。
    Abstract Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at https://github.com/yisuanwang/Soulstyler.
    摘要 Image style transfer在计算机图形和计算机视觉中占据重要地位。然而,现有的方法都需要参照已经预先处理的样式图片,无法单独预处理特定对象。为解决这一限制,我们提出了“魂革”框架,允许用户通过简单的文本描述指导特定对象在图片中的预处理。我们引入了一个大型自然语言处理模型,以解析文本并确定预处理目标和特定样式。与CLIP基于的semantic visual embedding编码器结合使用,我们的模型能够理解并匹配文本和图片内容。我们还引入了一种新的本地文本-图片块匹配损失函数,确保预处理只对target对象进行,非target区域保持原样式。实验结果表明,我们的模型可以根据文本描述来准确地进行预处理target对象,而不会影响背景区域的样式。我们的代码将在https://github.com/yisuanwang/Soulstyler中提供。

Transfer Learning-based Real-time Handgun Detection

  • paper_url: http://arxiv.org/abs/2311.13559
  • repo_url: None
  • paper_authors: Youssef Elmir, Sid Ahmed Laouar, Larbi Hamdaoui
  • for: 提高安全措施,减少人工监视的依赖
  • methods: 使用卷积神经网络和传输学习实现实时计算机视觉手枪检测系统
  • results: 提出了一种可靠和高效的自动手枪检测系统,准确率为84.74%,与相关研究比较可观,可以减少人工监视时间和false positive数量
    Abstract Traditional surveillance systems rely on human attention, limiting their effectiveness. This study employs convolutional neural networks and transfer learning to develop a real-time computer vision system for automatic handgun detection. Comprehensive analysis of online handgun detection methods is conducted, emphasizing reducing false positives and learning time. Transfer learning is demonstrated as an effective approach. Despite technical challenges, the proposed system achieves a precision rate of 84.74%, demonstrating promising performance comparable to related works, enabling faster learning and accurate automatic handgun detection for enhanced security. This research advances security measures by reducing human monitoring dependence, showcasing the potential of transfer learning-based approaches for efficient and reliable handgun detection.
    摘要 传统的监视系统依赖人工注意力,其效果受限。本研究利用卷积神经网络和传输学习开发了实时计算机视觉系统,自动检测手枪。研究对网上手枪检测方法进行了全面分析,强调减少假阳性和学习时间。传输学习被证明是一种有效的方法。Despite technical challenges, the proposed system achieves a precision rate of 84.74%, demonstrating promising performance comparable to related works, enabling faster learning and accurate automatic handgun detection for enhanced security. This research advances security measures by reducing human monitoring dependence, showcasing the potential of transfer learning-based approaches for efficient and reliable handgun detection.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Vamos: Versatile Action Models for Video Understanding

  • paper_url: http://arxiv.org/abs/2311.13627
  • repo_url: https://github.com/brown-palm/Vamos
  • paper_authors: Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun
  • for: 这种论文旨在提出一种基于文本表示的视频理解方法,以便更好地理解视频中的动作和情境。
  • methods: 该方法使用大型自然语言模型(LLM)作为”理解器”,可以直接利用视频中的文本描述、动作标签和自由文本描述来学习视频表示。
  • results: 研究发现,文本基于的表示可以在四个不同的视频理解benchmark上实现竞争性的表现,而视频嵌入不提供明显的性能提升。
    Abstract What makes good video representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as discrete action labels, or free-form video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularities. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the "reasoner", and can flexibly leverage visual embeddings, action labels, and free-form descriptions extracted from videos as its input. We evaluate Vamos on four complementary video understanding benchmarks, Ego4D, Next-QA, IntentQA, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We perform extensive ablation study and qualitative analysis to support our observations, and achieve state-of-the-art performance on three benchmarks.
    摘要 “什么使得视频理解成功,例如预测未来活动或回答视频相关问题?”而之前的方法直接从视频像素学习,我们提议返回文本基于表示,如可读的动作标签或自由形式视频caption,它们可读性好,可以直接被大语言模型(LLM)消耗。intuitively,不同的视频理解任务可能需要不同的表示,具有不同的粒度。为了实现这一目标,我们提出了多能动作模型(Vamos),一种学习框架,它可以通过大语言模型来“理解”,并可以灵活地使用视觉嵌入、动作标签和自由形式视频描述作为输入。我们对Vamos进行了四个补充性视频理解benchmark测试,包括Ego4D、Next-QA、IntentQA和EgoSchema,以评估它在模拟时间动态、编码视觉历史和理解能力方面的表现。 Surprisingly,我们发现文本基于表示在所有benchmark上具有竞争性的表现,而视觉嵌入提供了极少或无法提高表现,这说明文本基于视频表示在LLM时代是非常有效的。我们进行了广泛的减少性分析和质量分析,并在三个benchmark上实现了状态之最好的表现。”

Medical Image Retrieval Using Pretrained Embeddings

  • paper_url: http://arxiv.org/abs/2311.13547
  • repo_url: None
  • paper_authors: Farnaz Khun Jush, Tuan Truong, Steffen Vogler, Matthias Lenga
  • for: 这篇论文的目的是为了探讨医疗影像搜寻的可能性,并且评估了四种现代预训网络的应用可能性。
  • methods: 这篇论文使用了四种现代预训网络,并且比较了两种相似指标探索方法的结果。此外,它还分析了对3D影像搜寻的影响,包括权重和抽样策略。
  • results: 这篇论文获得了在不同的模式、身体区域和器官水平进行医疗影像搜寻的100%精度。使用预训网络可以实现医疗影像搜寻,不需要进行任何额外训练或微调步骤。
    Abstract A wide range of imaging techniques and data formats available for medical images make accurate retrieval from image databases challenging. Efficient retrieval systems are crucial in advancing medical research, enabling large-scale studies and innovative diagnostic tools. Thus, addressing the challenges of medical image retrieval is essential for the continued enhancement of healthcare and research. In this study, we evaluated the feasibility of employing four state-of-the-art pretrained models for medical image retrieval at modality, body region, and organ levels and compared the results of two similarity indexing approaches. Since the employed networks take 2D images, we analyzed the impacts of weighting and sampling strategies to incorporate 3D information during retrieval of 3D volumes. We showed that medical image retrieval is feasible using pretrained networks without any additional training or fine-tuning steps. Using pretrained embeddings, we achieved a recall of 1 for various tasks at modality, body region, and organ level.
    摘要 医疗图像检索技术和数据格式的多样性使得从图像数据库中准确检索成为挑战。高效的检索系统是现代医疗和研究的关键,因此解决医疗图像检索的挑战是必要的。在这项研究中,我们评估了使用四种当前最佳预训练模型进行医疗图像检索的可能性,并比较了两种相似指标方法的结果。由于使用的网络都是2D图像,我们分析了在检索3D量化时如何包含3D信息的影响。我们显示了无需任何额外训练或微调步骤,可以使用预训练的嵌入来实现医疗图像检索。我们在不同的模式、身体区域和器官水平达到1的回归率。

Enigma: Privacy-Preserving Execution of QAOA on Untrusted Quantum Computers

  • paper_url: http://arxiv.org/abs/2311.13546
  • repo_url: None
  • paper_authors: Ramin Ayanzadeh, Ahmad Mousavi, Narges Alavisamani, Moinuddin Qureshi
  • for: 这个论文的目的是实现低成本、隐私保护的量子计算,并且可以与现有系统集成。
  • methods: 本论文提出了一个称为“Enigma”的隐私保护方案,专门设计用于量子近似最佳化算法(QAOA)。与前一些隐私保护技术不同,Enigma将输入问题转换为不可理解的电路和结果。本论文提出了三种Enigma的变iante,包括Enigma-I、Enigma-II和Enigma-III。
  • results: 本论文透过IBM量子设备进行评估,结果显示Enigma可以确保隐私,同时只对精度(1%-13%)造成小量损失。
    Abstract Quantum computers can solve problems that are beyond the capabilities of conventional computers. As quantum computers are expensive and hard to maintain, the typical model for performing quantum computation is to send the circuit to a quantum cloud provider. This leads to privacy concerns for commercial entities as an untrusted server can learn protected information from the provided circuit. Current proposals for Secure Quantum Computing (SQC) either rely on emerging technologies (such as quantum networks) or incur prohibitive overheads (for Quantum Homomorphic Encryption). The goal of our paper is to enable low-cost privacy-preserving quantum computation that can be used with current systems. We propose Enigma, a suite of privacy-preserving schemes specifically designed for the Quantum Approximate Optimization Algorithm (QAOA). Unlike previous SQC techniques that obfuscate quantum circuits, Enigma transforms the input problem of QAOA, such that the resulting circuit and the outcomes are unintelligible to the server. We introduce three variants of Enigma. Enigma-I protects the coefficients of QAOA using random phase flipping and fudging of values. Enigma-II protects the nodes of the graph by introducing decoy qubits, which are indistinguishable from primary ones. Enigma-III protects the edge information of the graph by modifying the graph such that each node has an identical number of connections. For all variants of Enigma, we demonstrate that we can still obtain the solution for the original problem. We evaluate Enigma using IBM quantum devices and show that the privacy improvements of Enigma come at only a small reduction in fidelity (1%-13%).
    摘要 量子计算机可以解决常见计算机无法解决的问题。由于量子计算机是昂贵而难以维护的,因此通常的模型是将环境发送到量子云提供商。这导致了商业实体的隐私问题,因为无信任的服务器可以从提供的环境中获得保护信息。现有的安全量子计算(SQC)提案 Either rely on emerging technologies (such as quantum networks) or incur prohibitive overheads (for Quantum Homomorphic Encryption).我们的文章的目标是启用低成本、隐私保护的量子计算,可以与当前系统一起使用。我们提出了Enigma,一个特有的隐私保护方案,专门为Quantum Approximate Optimization Algorithm(QAOA)设计。与前一些SQC技术不同,Enigma将输入问题变换成无法理解的环境和结果。我们提出了Enigma-I、Enigma-II和Enigma-III三种变iante。Enigma-I使用随机相位折衣和值融合来保护QAOA的系数。Enigma-II通过引入幌夹宠块来保护图形中的节点。Enigma-III通过修改图形,使每个节点都有相同的连接数来保护图形中的边信息。对于所有Enigma变体,我们证明可以仍然解决原始问题。我们使用IBM量子计算机 Device进行评估,并显示Enigma Privacy Improvement Only at a small reduction in fidelity (1%-13%).

Physics-driven generative adversarial networks empower single-pixel infrared hyperspectral imaging

  • paper_url: http://arxiv.org/abs/2311.13626
  • repo_url: None
  • paper_authors: Dong-Yin Wang, Shu-Hang Bie, Xi-Hao Chen, Wen-Kai Yu
  • for: 这个论文是为了解决单ixel化光谱成像(HSI)中的各种数据训练问题而设计的。
  • methods: 这个论文使用了物理驱动的生成敌对网络(GAN)结构,并将单ixel成像(SPI)的物理过程 integrate到生成器中。 objective function中使用了实际和估计的一维(1D)桶信号作为约束,以更新网络参数并优化生成器。
  • results: 相比传统基于压缩学习和物理驱动 convolutional neural networks 的单ixel红外HSI方法,我们的物理驱动 GAN 基于的单ixel红外HSI可以实现更高的成像性能,但具有更少的测量。
    Abstract A physics-driven generative adversarial network (GAN) was established here for single-pixel hyperspectral imaging (HSI) in the infrared spectrum, to eliminate the extensive data training work required by traditional data-driven model. Within the GAN framework, the physical process of single-pixel imaging (SPI) was integrated into the generator, and the actual and estimated one-dimensional (1D) bucket signals were employed as constraints in the objective function to update the network's parameters and optimize the generator with the assistance of the discriminator. In comparison to single-pixel infrared HSI methods based on compressed sensing and physics-driven convolution neural networks, our physics-driven GAN-based single-pixel infrared HSI can achieve higher imaging performance but with fewer measurements. We believe that this physics-driven GAN will promote practical applications of computational imaging, especially various SPI-based techniques.
    摘要 一种基于物理学的生成对抗网络(GAN)在此处建立,用于单像素干涉谱图像(HSI)的红外spectrum中进行推断,以消除传统数据驱动模型需要的大量数据训练工作。在GAN框架中,实际的单像素捕获(SPI)物理过程被集成到生成器中,并将真实的和估计的一维(1D)吸管信号作为对象函数中的约束,以更新网络参数并优化生成器。与基于压缩感知和物理驱动 convolutional neural networks 的单像素红外HSI方法相比,我们的物理驱动 GAN 基于的单像素红外HSI可以实现更高的捕抓性,但需要 fewer measurements。我们认为这种物理驱动 GAN 将推动计算成像应用,特别是各种 SPI 技术的实际应用。

Linear Log-Normal Attention with Unbiased Concentration

  • paper_url: http://arxiv.org/abs/2311.13541
  • repo_url: None
  • paper_authors: Yury Nahshan, Joseph Kampeas, Emir Haleva
  • for: 本研究旨在解决Transformer模型在长文本或高分辨率图像处理中遇到的缺乏扩展性问题,即自注意力机制的时间和内存复杂度 quadratic 关系与序列长度相关。
  • methods: 本研究通过分析自注意力矩阵的分布和吸引力能力来研究自注意力机制。此外,我们还提出了测量这些量的工具,并提出了一种新的自注意力机制,线性Log-Normal注意力,可以模拟原始自注意力的分布和吸引力行为。
  • results: 我们在各种自然语言标准套件上进行了实验,发现我们提出的线性Log-Normal注意力可以超越其他线性注意力的选择,并提供了提高Transformer模型扩展性的可能性。代码可以在补充材料中找到。
    Abstract Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models. Our code is available in supplementary materials.
    摘要

Speak Like a Native: Prompting Large Language Models in a Native Style

  • paper_url: http://arxiv.org/abs/2311.13538
  • repo_url: https://github.com/yangzhch6/aligncot
  • paper_authors: Zhicheng Yang, Yiwei Wang, Yinya Huang, Jing Xiong, Xiaodan Liang, Jing Tang
  • for: 提高大语言模型(LLM)的理解能力
  • methods: 使用链式思维(CoT)技术,并对具有不同文本风格的示例进行对齐,以提高 LLM 的表现
  • results: 对多个 benchmark 进行了广泛和全面的实验,结果表明,使用我们的 AlignCoT 技术可以在 GPT-3.5-turbo 上提高表现,比如在 GSM8K 上提高了 +2.5%。此外,我们的 AlignCoT 技术可以轻松地与现有的state-of-the-art技术结合使用,以进一步提高 LLM 的表现。
    Abstract Existing work has found that the prompt engineering heavily influences the performance of large language models (LLMs). Chain-of-thought (CoT), as a popular prompt engineering technique, prompted LLMs using in-context examples with reasoning steps. In current studies, the few-shot examples of CoT are generally handcrafted by humans. However, how the text style of in-context examples influence the outputs of LLMs still remains under-explored. This paper presents a novel and effective approach, named \textbf{AlignCoT}, to improve the reasoning capability of LLMs by aligning the in-context examples with the native style of LLMs. ``Native'' refers to the inherent characteristic style of LLMs which can be probed by original zero-shot scenarios. AlignCoT is orthogonal to other prompt engineering methods, making it easy to combine with state-of-the-art techniques to further improve the LLMs' performance. We conduct extensive and comprehensive experiments on several benchmarks. The empirical results demonstrate that our AlignCoTsignificantly improves performance over the carefully handcrafted in-context examples. For instance, with GPT-3.5-turbo, we observed a +2.5\% improvement on GSM8K. Furthermore, our AlignCoT consistently improve the performance when combined with other state-of-the-art prompt engineering methods. The source code and dataset will be available at \href{https://github.com/yangzhch6/AlignCoT}{https://github.com/yangzhch6/AlignCoT}.
    摘要 先前的研究发现,提示工程对大型自然语言模型(LLM)的性能产生很大的影响。链条思维(CoT)是一种流行的提示工程技术,通过提供在 Context 中的示例,使 LLM 进行推理步骤。然而,现有的几个示例通常是由人类手动制定的。但是,Context 中文本风格对 LLM 的输出仍然尚未得到充分探索。本文提出了一种新的和有效的方法,名为 AlignCoT,可以提高 LLM 的推理能力,通过将 Context 中的示例与 LLM 的Native 风格进行对齐。“Native”指的是 LLM 的内生特征风格,可以通过原生 zero-shot enario 探测。AlignCoT 与其他提示工程方法 orthogonal,可以与现有的技术相结合,以进一步提高 LLM 的性能。我们在多个 benchmark 上进行了广泛和全面的实验,结果表明,我们的 AlignCoT 可以较之手动制定的示例 obt 得到 +2.5% 的提升,例如使用 GPT-3.5-turbo 时在 GSM8K 上得到了 +2.5% 的提升。此外,我们的 AlignCoT 在与其他现有提示工程方法相结合时也能够保持性能的提升。代码和数据将在 GitHub 上提供,链接为 \href{https://github.com/yangzhch6/AlignCoT}{https://github.com/yangzhch6/AlignCoT}.

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

  • paper_url: http://arxiv.org/abs/2311.13534
  • repo_url: https://github.com/flagopen/flagembedding
  • paper_authors: Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing
  • for: 提高 fine-tuned 语言模型在通用任务上的表现,而不导致特定领域表现下降。
  • methods: 提出 LM-Cocktail 方法,通过将 fine-tuned 模型与基模型或其他领域的 peer 模型进行 weighted average 来实现模型融合。
  • results: 通过对 LLama 和 BGE 模型进行 experiments,在各种通用任务上实现了强大的 empirical 表现,同时保持了targeted 领域的优秀表现能力。
    Abstract The pre-trained language models are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned language model is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail.
    摘要 预训言语模型 continually 细化以更好地支持下游应用。然而,这种操作可能会导致普通任务上的性能下降。为了解决这个问题,我们提出了 LM-Cocktail,它使得细化后的模型保持普通视角的强健性。我们的方法是通过模型合并,将细化后的语言模型与预训义基模型或来自其他领域的同等模型进行权重平均。尽管简单,但 LM-Cocktail 却 surprisingly 有效:得到的模型在整个通用任务范围内具有强大的 empirical 性能,同时保持高水平的目标领域能力。我们在 LLama 和 BGE 模型上进行了广泛的实验,包括 FLAN、MMLU、MTEB 等标准 benchmarcks,结果证明了我们的提议的有效性。代码和检查点可以在 中找到。

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

  • paper_url: http://arxiv.org/abs/2311.13502
  • repo_url: None
  • paper_authors: Gaoxiang Duan, Junkai Zhang, Xiaoying Zheng, Yongxin Zhu
  • for: 本研究旨在解决现代计算领域中高性能模型的应用所遇到的计算复杂性和精度要求问题。
  • methods: 本研究提出了一种新的注意机制,即使用位运算取代浮点数矩阵乘法,以提高注意机制的计算复杂性和精度。
  • results: 对比传统浮点数矩阵乘法,位运算注意机制可以减少计算复杂性,同时保持注意机制的能力capture长距离信息相互关系。
    Abstract In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. This strategic substitution yields dual advantages. Not only does it maintain the attention mechanism's prowess in capturing intricate long-range information dependencies, but it also orchestrates a profound reduction in the computational complexity inherent in the attention operation. The transition from an $O(n^2d)$ complexity, typical of floating-point operations, to an $O(n^2T)$ complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter $T$ remains markedly smaller than the conventional dimensionality parameter $d$. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.
    摘要 现在大型模型的景观中,Transformer作为一个重要的基础模型,对现代模型的发展做出了重要贡献。然而,其应用面临着基于注意机制的计算复杂性的挑战,以及高精度浮点运算的特定障碍。特别是在Edge computing环境中,由于资源有限和低精度的限制,这些障碍变得更加明显。为了解决Edge computing环境中的减少数据处理需求,我们提出了Bitformer模型,这是Transformer思想的创新扩展。Bitformer模型的核心是一种新型的注意机制,可以将浮点矩阵乘法换换为位运算。这种策略性的更换实现了两个优点。首先,它保持了注意机制的长距离信息依赖关系捕捉能力,并且实现了计算复杂性的显著减少。在浮点运算的$O(n^2d)$复杂度下,我们可以将其换换为位运算的$O(n^2T)$复杂度,其中$T$是 Parameter $T$ 的值,而 $d$ 是传统的维度参数。这种减少的复杂度是由于位运算的精度低于浮点运算的精度所致。Bitformer模型的目的是将现代计算景观中的高性能模型与 Edge computing 环境进行协调,从而 bridge 这两个环境之间的差异。通过这一创新之路,我们打开了现代模型的发展前景,并为Edge computing环境带来了可靠的计算解决方案。

Complexity-Guided Curriculum Learning for Text Graphs

  • paper_url: http://arxiv.org/abs/2311.13472
  • repo_url: None
  • paper_authors: Nidhi Vakil, Hadi Amiri
  • for: 这篇论文的目的是提出一种curriculum learning方法,用于训练文本 граhp数据。
  • methods: 本研究使用了一种novel data scheduler,它利用了“ espaired repetition”和 complexity formalisms来引导训练过程。
  • results: 研究发现,提出的方法可以更好地使用资料,并且可以在不同的GNN模型和数据集上学习传播的curriculum。此外,本研究发现了本地(node)和全局(graph)图形复杂度指标,以及浅层和传统文本复杂度指标的重要性。
    Abstract Curriculum learning provides a systematic approach to training. It refines training progressively, tailors training to task requirements, and improves generalization through exposure to diverse examples. We present a curriculum learning approach that builds on existing knowledge about text and graph complexity formalisms for training with text graph data. The core part of our approach is a novel data scheduler, which employs "spaced repetition" and complexity formalisms to guide the training process. We demonstrate the effectiveness of the proposed approach on several text graph tasks and graph neural network architectures. The proposed model gains more and uses less data; consistently prefers text over graph complexity indices throughout training, while the best curricula derived from text and graph complexity indices are equally effective; and it learns transferable curricula across GNN models and datasets. In addition, we find that both node-level (local) and graph-level (global) graph complexity indices, as well as shallow and traditional text complexity indices play a crucial role in effective curriculum learning.
    摘要 curriculum learning提供了一种系统化的训练方法,可以逐步精细地训练,根据任务需求进行定制化,并通过对多个示例的暴露来提高通用性。我们提出了基于文本和图Complexity formalisms的curriculum learning方法,其核心部分是一种新的数据规划器,利用“隔离重复”和Complexity formalisms来导引训练过程。我们在多个文本图任务和图神经网络架构上运行了该方法,并证明了该模型在使用更少数据和训练时间仍能够达到更好的性能。此外,我们还发现了本地(节点级)和全局(图级)图复杂性指标以及浅层和传统的文本复杂性指标在有效的curriculum learning中扮演着关键的角色。

Generation of Explanations for Logic Reasoning

  • paper_url: http://arxiv.org/abs/2311.13455
  • repo_url: None
  • paper_authors: Yanyi Pu
  • for: 本论文探讨了fortiori理由在推理中的应用,强调其在法律、哲学和人工智能等领域的重要性。
  • methods: 本研究使用GPT-3.5-turbo自动分析fortiori理由,主要目标是理解复杂的推理过程、生成清晰 coherent 的解释以及创造新的理由。研究方法包括细致的分析、解释和增强fortiori理由。
  • results: 实验表明GPT-3.5-turbo在准确地检测和分类fortiori理由方面存在挑战,但模型仍能达到与专门模型相当的性能,特别是在提取关键元素和理解基础性质方面。外部信息的 интеграinto模型处理中显著提高了生成的解释质量。此外,模型还展现了在增强理由方面的remarkable capability。
    Abstract This thesis delves into a fortiori arguments in deductive reasoning, underscoring their relevance in various domains such as law, philosophy, and artificial intelligence. The research is centred on employing GPT-3.5-turbo to automate the analysis of these arguments, with a focus on understanding intricate reasoning processes, generating clear and coherent explanations, and creating novel arguments. The methodology encompasses a series of tasks including detailed reasoning, interpretation, and the augmentation of a fortiori arguments. It involves meticulously identifying these arguments in diverse contexts, differentiating comparative elements, and categorizing them based on their logical structure. Extensive experiments reveals the challenges encountered by GPT-3.5-turbo in accurately detecting and classifying a fortiori arguments. Nevertheless, the model demonstrates a performance that rivals specialized models, particularly in extracting key components and interpreting underlying properties. The integration of external information into the model's processing significantly elevates the quality of the generated explanations. Additionally, the model exhibits a noteworthy capability in augmenting arguments, thus contributing to the enrichment of the data set. Despite facing certain limitations, this thesis makes significant contributions to the fields of artificial intelligence and logical reasoning. It introduces novel methodologies, establishes a rigorous evaluation framework, and provides deep insights that set the stage for future advancements in automated logical reasoning. The findings and methodologies presented herein not only underscore the potential of AI in complex reasoning tasks but also highlight areas for future research and development.
    摘要 Experiments show that GPT-3.5-turbo encounters challenges in accurately detecting and classifying a fortiori arguments, but still demonstrates competitive performance in extracting key components and interpreting underlying properties. Integrating external information into the model's processing significantly improves the quality of generated explanations. Moreover, the model exhibits the ability to augment arguments, enriching the data set.Despite facing limitations, this thesis makes significant contributions to artificial intelligence and logical reasoning, introducing novel methodologies, establishing a rigorous evaluation framework, and providing deep insights. The findings and methodologies presented herein highlight the potential of AI in complex reasoning tasks and suggest directions for future research and development.

Guided Flows for Generative Modeling and Decision Making

  • paper_url: http://arxiv.org/abs/2311.13443
  • repo_url: None
  • paper_authors: Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen
  • for: 这 paper 是为了研究 Classifier-free guidance 是否可以应用于 Flow Matching 模型,以提高 conditional generative models 的性能。
  • methods: 这 paper 使用了 Guided Flows 方法,通过对 vector fields 进行回归来训练 Continuous Normalizing Flows。
  • results: 这 paper 表明 Guided Flows 可以在 conditional image generation、speech synthesis 和 reinforcement learning 中提高样本质量,并且可以采用非常低的计算量来实现。 particularly, 这 paper 是首次将流模型应用于 offline reinforcement learning 设定。
    Abstract Classifier-free guidance is a key component for improving the performance of conditional generative models for many downstream tasks. It drastically improves the quality of samples produced, but has so far only been used for diffusion models. Flow Matching (FM), an alternative simulation-free approach, trains Continuous Normalizing Flows (CNFs) based on regressing vector fields. It remains an open question whether classifier-free guidance can be performed for Flow Matching models, and to what extent does it improve performance. In this paper, we explore the usage of Guided Flows for a variety of downstream applications involving conditional image generation, speech synthesis, and reinforcement learning. In particular, we are the first to apply flow models to the offline reinforcement learning setting. We also show that Guided Flows significantly improves the sample quality in image generation and zero-shot text-to-speech synthesis, and can make use of drastically low amounts of computation without affecting the agent's overall performance.
    摘要 <>将文本翻译成简化中文。<>梯度自由指导是下游任务中提高 conditional generative 模型性能的关键组件。它很大程度提高样本质量,但只有用于 diffusion 模型。流匹配(FM)是一种代替 simulation-free 方法,基于抽象正则场进行 Continuous Normalizing Flows(CNFs)训练。尚未知是否可以在 Flow Matching 模型中实现梯度自由指导,以及它对性能的影响。在这篇论文中,我们探索了 Guided Flows 在多种下游应用中的使用,包括 conditional image generation、speech synthesis 和 reinforcement learning。尤其是,我们是首次将流模型应用到 offline reinforcement learning 设置中。我们还示出,Guided Flows 在图像生成和 zero-shot text-to-speech 合成中明显提高样本质量,并可以在计算量很低的情况下使用,不影响代理人的总性能。

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

  • paper_url: http://arxiv.org/abs/2311.13435
  • repo_url: https://github.com/mbzuai-oryx/video-llava
  • paper_authors: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
  • for: 本研究旨在扩展图像基于的大型多Modal模型(LMM)到视频领域,以提高视频理解和对话能力。
  • methods: 我们提出了一种名为Video-LLaVA的新方法,它利用扫描器和一种新的固定模块,以具有像素级别的定位能力,并将音频信号转化为文本,以涌增视频上的上下文理解。
  • results: 我们在视频生成和问答任务上进行了评估,并提出了新的benchmark测试集,以衡量在视频中提出指令时对象的定位性能。results表明,Video-LLaVA在视频基于的对话和定位任务上具有明显的优势,比如Video-ChatGPT和Video-LLaMA等方法。项目页面:https://github.com/mbzuai-oryx/Video-LLaVA
    Abstract Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA
    摘要 扩展图像基的大型多Modal模型(LMM)到视频是一项复杂的任务,因为视频数据本身具有内在的复杂性。现有的方法扩展图像基LMM到视频中缺乏基础能力(如VideoChat和Video-ChatGPT)或者不利用音频信号来改善视频理解(如Video-ChatGPT)。为了解决这些漏洞,我们提出了 Video-LLaVA,首个具有像素级别固定能力的LMM,通过将音频词语转换为文本来增强视频上下文理解。我们的框架使用了商业化的跟踪器和一个新的固定模块,使其在用户 instrucions 的指导下在视频中空间和时间上local化对象。我们使用了 Vicuna 代替 GPT-3.5,以确保结果的可重复性,因为 GPT-3.5 是商业化的。我们的框架基于 SoTA 图像基 LLaVA 模型,扩展其优势到视频领域,并达到了视频基于对话和固定任务的承诺性。项目页面:https://github.com/mbzuai-oryx/Video-LLaVA

From Images to Connections: Can DQN with GNNs learn the Strategic Game of Hex?

  • paper_url: http://arxiv.org/abs/2311.13414
  • repo_url: https://github.com/yannikkellerde/gnn_hex
  • paper_authors: Yannik Keller, Jannis Blüml, Gopika Sudhakaran, Kristian Kersting
  • for: 研究自适应学习(RL)方法在游戏自动化中的应用,特别是使用图 neural networks(GNN)取代 convolutional neural networks(CNN)。
  • methods: 使用Hex游戏作为实验平台,比较GNN和CNN在自适应学习中的表现。
  • results: GNN在处理长距离依赖关系方面表现出色,降低过拟合现象,但也表现出对地方结构的不优异。这表明GNN可能成为自适应学习中的新平台,并且可能通过使用游戏特定的结构来重新定义自适应学习。
    Abstract The gameplay of strategic board games such as chess, Go and Hex is often characterized by combinatorial, relational structures -- capturing distinct interactions and non-local patterns -- and not just images. Nonetheless, most common self-play reinforcement learning (RL) approaches simply approximate policy and value functions using convolutional neural networks (CNN). A key feature of CNNs is their relational inductive bias towards locality and translational invariance. In contrast, graph neural networks (GNN) can encode more complicated and distinct relational structures. Hence, we investigate the crucial question: Can GNNs, with their ability to encode complex connections, replace CNNs in self-play reinforcement learning? To this end, we do a comparison with Hex -- an abstract yet strategically rich board game -- serving as our experimental platform. Our findings reveal that GNNs excel at dealing with long range dependency situations in game states and are less prone to overfitting, but also showing a reduced proficiency in discerning local patterns. This suggests a potential paradigm shift, signaling the use of game-specific structures to reshape self-play reinforcement learning.
    摘要 游戏如棋盘游戏(如棋牌、吴弈和幻方块)的游戏机制通常具有组合、关系结构,而不仅仅是图像。然而,大多数常见的自动学习(RL)方法仅仅使用卷积神经网络(CNN)来近似策略函数和价值函数。卷积神经网络具有本地性和变换不变性的适应性,这种适应性限制了RL方法的表现。相比之下,图гра几何神经网络(GNN)可以更好地编码复杂的关系结构。因此,我们询问:可以使用GNN来取代CNN?为了回答这个问题,我们在hex游戏(一款抽象又战略丰富的棋盘游戏)上进行了实验。我们的发现表明,GNN在游戏状态中处理长距离依赖的情况上表现出色,并且更少地过拟合。然而,GNN也表现出了处理本地结构的不足。这表明RL方法可能需要根据游戏特定的结构进行重新定义。

Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training

  • paper_url: http://arxiv.org/abs/2311.13381
  • repo_url: None
  • paper_authors: Yuhao Chen, Yuxuan Yan, Qianqian Yang, Yuanchao Shu, Shibo He, Jiming Chen
  • for: 这篇论文的目的是提出一个多 backend 协作训练框架,以便在常规手持式设备上调整现代大型自然语言处理(NLP)模型。
  • methods: 这篇论文使用了一个分解 LLM 为多个子模型的方法,并使用了一个管道式平行训练机制来确保快速和高效地进行分布式训练。 另外,论文还提出了一个新的 backend 排程来将不同的注意头分配到不同的运算硬件,包括手持式 CPU 和 GPU,以最大化每个边缘设备的 compute 资源使用率。
  • results: 论文的初步实验结果显示,使用 Confidant 可以获得最多 45.3% 的内存减少和 8.03x 的推断速度提升在实际应用中。
    Abstract Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. Nonetheless, it is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. In this paper, we propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices like smartphones. Confidant partitions an LLM into several sub-models so that each fits into a mobile device's memory. A pipeline parallel training mechanism is further developed to ensure fast and efficient distributed training. In addition, we propose a novel backend scheduler to allocate different attention heads to heterogeneous compute hardware, including mobile CPU and GPUs, to maximize the compute resource utilization on each edge device. Our preliminary experimental results show that Confidant achieves at most 45.3% memory reduction and 8.03x inference speedup in practical settings.
    摘要 transformer-based large language models (LLMs) 已经在多种自然语言处理 (NLP) 任务中表现出了惊人的能力。然而,在移动边缘设备上部署和微调 LLMs 仍然是一个挑战,主要因为这些设备具有限制的计算、存储和能源预算。本文提出了一个名为 Confidant 的多后端协作训练框架,用于适应最新的 LLMs 在商用移动设备上进行定制。Confidant 将 LLM 分解成多个子模型,以便每个子模型适应移动设备的内存限制。此外,我们还提出了一种管道并行训练机制,以确保 rapide 和高效地进行分布式训练。此外,我们还提出了一种新的后端调度器,用于将不同的注意头分配到多种不同的计算硬件,包括移动 CPU 和 GPU,以最大化每个边缘设备的计算资源利用率。我们的初步实验结果表明,Confidant 可以在实际应用中实现最多 45.3% 的内存减少和 8.03x 的执行速度提高。

Deriving Comprehensible Theories from Probabilistic Circuits

  • paper_url: http://arxiv.org/abs/2311.13379
  • repo_url: None
  • paper_authors: Sieben Bocklandt, Wannes Meert, Koen Vanderstraeten, Wouter Pijpops, Kurt Jaspers
  • for: 这个论文的目的是提高可解释AI模型的可理解性,具体来说是计算出一个可读的逻辑理论来描述高密度区域生成的概率电路。
  • methods: 这个论文使用的方法是基于生成 significado的抽象减少方法,即PUTPUT(概率电路理解通过减少下面的逻辑理论)。
  • results: 论文的结果表明,这种方法可以有效地生成一个可读的逻辑理论,描述高密度区域生成的概率电路,并且在性能和可理解性之间进行了有效的质量减少。
    Abstract The field of Explainable AI (XAI) is seeking to shed light on the inner workings of complex AI models and uncover the rationale behind their decisions. One of the models gaining attention are probabilistic circuits (PCs), which are a general and unified framework for tractable probabilistic models that support efficient computation of various probabilistic queries. Probabilistic circuits guarantee inference that is polynomial in the size of the circuit. In this paper, we improve the explainability of probabilistic circuits by computing a comprehensible, readable logical theory that covers the high-density regions generated by a PC. To achieve this, pruning approaches based on generative significance are used in a new method called PUTPUT (Probabilistic circuit Understanding Through Pruning Underlying logical Theories). The method is applied to a real world use case where music playlists are automatically generated and expressed as readable (database) queries. Evaluation shows that this approach can effectively produce a comprehensible logical theory that describes the high-density regions of a PC and outperforms state of the art methods when exploring the performance-comprehensibility trade-off.
    摘要 领域的解释AI(XAI)目前在努力推翻复杂AI模型的内部工作和做出决策的理由。一种吸引注意的模型是可能性电路(PC),它是一种通用和统一的概率模型框架,支持高效计算多种概率查询。可能性电路保证推理是圆拟于电路的大小。在这篇论文中,我们通过计算可读、可理解的逻辑理论来提高可能性电路的解释性。为此,我们使用基于生成重要性的剪辑方法,称为PUTPUT(可能性电路理解通过剪辑下面逻辑理论)。这种方法应用于实际应用场景,即自动生成音乐播放列表,并表示为可读(数据库)查询。评估表明,这种方法可以有效生成可读逻辑理论,描述高密度区域的可能性电路,并超越当前的状态艺法。

Large Language Model is a Good Policy Teacher for Training Reinforcement Learning Agents

  • paper_url: http://arxiv.org/abs/2311.13373
  • repo_url: https://github.com/zjlab-ammi/llm4teach
  • paper_authors: Zihao Zhou, Bin Hu, Pu Zhang, Chenyang Zhao, Bin Liu
  • for: 这篇论文旨在解决大型自然语言模型(LLM)在复杂序列决策任务中的应用限制,以及在实际动态环境中LLM-基本的代理agent的部署成本和时间成本。
  • methods: 作者提出了一种新的框架,即使用LLM提供高级指令来训练一个较小规模的特殊学生代理。通过LLM提供的导航动作,本地学生模型可以借鉴LLM的先前知识,并且可以通过环境反馈进一步增强其表现。
  • results: 作者在三个复杂的MiniGrid环境中进行了实验,结果显示,作者的方法可以提高样本效率,并且比基eline方法表现更优。代码可以在https://github.com/ZJLAB-AMMI/LLM4Teach中下载。
    Abstract Recent studies have shown that Large Language Models (LLMs) can be utilized for solving complex sequential decision-making tasks by providing high-level instructions. However, LLM-based agents face limitations in real-time dynamic environments due to their lack of specialization in solving specific target problems. Moreover, the deployment of such LLM-based agents is both costly and time-consuming in practical scenarios. In this paper, we introduce a novel framework that addresses these challenges by training a smaller scale specialized student agent using instructions from an LLM-based teacher agent. By leveraging guided actions provided by the teachers, the prior knowledge of the LLM is distilled into the local student model. Consequently, the student agent can be trained with significantly less data. Furthermore, subsequent training with environment feedback empowers the student agents to surpass the capabilities of their teachers. We conducted experiments on three challenging MiniGrid environments to evaluate the effectiveness of our framework. The results demonstrate that our approach enhances sample efficiency and achieves superior performance compared to baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.
    摘要

Applying Large Language Models to Power Systems: Potential Security Threats

  • paper_url: http://arxiv.org/abs/2311.13361
  • repo_url: None
  • paper_authors: Jiaqi Ruan, Gaoqi Liang, Huan Zhao, Guolong Liu, Jing Qiu, Junhua Zhao, Zhao Xu, Fushuan Wen, Zhao Yang Dong
  • for: 这篇论文旨在探讨应用大型自然语言模型(LLM)到电力系统中的潜在安全隐患。
  • methods: 该论文使用了可观察性分析和攻击检测技术来检测 LLM 应用中的安全隐患。
  • results: 论文发现了一些可能的安全隐患,包括数据隐私泄露和攻击者可能利用 LLM 进行攻击。
    Abstract Applying large language models (LLMs) to power systems presents a promising avenue for enhancing decision-making and operational efficiency. However, this action may also incur potential security threats, which have not been fully recognized so far. To this end, this letter analyzes potential threats incurred by applying LLMs to power systems, emphasizing the need for urgent research and development of countermeasures.
    摘要 使用大型自然语言模型(LLM)在电力系统中应用可能会提高决策和运营效率,但这也可能会带来未知的安全隐患。为此,本封信 analyze了应用LLM到电力系统时可能出现的安全隐患,并强调需要进行紧急的研究和开发相应的防范措施。

Uncertainty Estimation in Multi-Agent Distributed Learning

  • paper_url: http://arxiv.org/abs/2311.13356
  • repo_url: None
  • paper_authors: Gleb Radchenko, Victoria Andrea Fill
  • for: 本研究旨在为IoT边缘设备提供更多的自主操作功能,通过开源框架和新的训练策略来推进AI应用。
  • methods: 研究人员采用了量化、剪辑掌握和缺省化等新技术来提高边缘设备的ML功能。
  • results: 研究人员成功地实现了在边缘设备上进行分布式学习,并解决了协作学习中数据集的信任度问题。
    Abstract Traditionally, IoT edge devices have been perceived primarily as low-power components with limited capabilities for autonomous operations. Yet, with emerging advancements in embedded AI hardware design, a foundational shift paves the way for future possibilities. Thus, the aim of the KDT NEUROKIT2E project is to establish a new open-source framework to further facilitate AI applications on edge devices by developing new methods in quantization, pruning-aware training, and sparsification. These innovations hold the potential to expand the functional range of such devices considerably, enabling them to manage complex Machine Learning (ML) tasks utilizing local resources and laying the groundwork for innovative learning approaches. In the context of 6G's transformative potential, distributed learning among independent agents emerges as a pivotal application, attributed to 6G networks' support for ultra-reliable low-latency communication, enhanced data rates, and advanced edge computing capabilities. Our research focuses on the mechanisms and methodologies that allow edge network-enabled agents to engage in collaborative learning in distributed environments. Particularly, one of the key issues within distributed collaborative learning is determining the degree of confidence in the learning results, considering the spatio-temporal locality of data sets perceived by independent agents.
    摘要 传统上,IoT边缘设备被视为低功耗组件,具有有限的自主运行能力。然而,随着嵌入式AI硬件设计的进步,这种情况正在发生变化。因此,KDT NEUROKIT2E项目的目标是开发一个新的开源框架,以便更好地推进AI应用程序在边缘设备上。这些创新包括量化、减少训练和缺省化,可以扩大边缘设备的功能范围,使其能够处理复杂的机器学习任务,并且可以在本地资源上进行自主学习。在6G的转型潜力下,分布式学习在独立代理之间emerges as a critical application,这是因为6G网络支持低延迟、高速数据传输和高级边缘计算功能。我们的研究关注边缘网络启用代理在分布环境中合作学习的机制和方法。特别是,在分布式合作学习中,确定独立代理所感知的数据集的空间-时间地域性是一个关键问题。

Fact-based Court Judgment Prediction

  • paper_url: http://arxiv.org/abs/2311.13350
  • repo_url: None
  • paper_authors: Shubham Kumar Nigam, Aniket Deroy
  • for: 这个研究旨在提高初期案件结果预测,帮助法律专业人士和公众快速了解案件结果。
  • methods: 该研究使用了两种不同的问题变体:一种基于事实,另一种将事实与下级法院的判决(RLC)结合。研究使用了DELSumm算法和多种权重方案。
  • results: 研究结果表明,使用只事实进行法律判决预测的结果较为低,尤其是与原始ILDC for CJPE研究的结果相比。使用不同的转换器模型也未能达到状态之前的最佳结果。
    Abstract This extended abstract extends the research presented in "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" \cite{malik-etal-2021-ildc}, focusing on fact-based judgment prediction within the context of Indian legal documents. We introduce two distinct problem variations: one based solely on facts, and another combining facts with rulings from lower courts (RLC). Our research aims to enhance early-phase case outcome prediction, offering significant benefits to legal professionals and the general public. The results, however, indicated a performance decline compared to the original ILDC for CJPE study, even after implementing various weightage schemes in our DELSumm algorithm. Additionally, using only facts for legal judgment prediction with different transformer models yielded results inferior to the state-of-the-art outcomes reported in the "ILDC for CJPE" study.
    摘要 这个扩展报告build upon the research presented in "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" \cite{malik-etal-2021-ildc}, focusing on fact-based judgment prediction within the context of Indian legal documents. We introduce two distinct problem variations: one based solely on facts, and another combining facts with rulings from lower courts (RLC). Our research aims to enhance early-phase case outcome prediction, offering significant benefits to legal professionals and the general public. The results, however, indicated a performance decline compared to the original ILDC for CJPE study, even after implementing various weightage schemes in our DELSumm algorithm. Additionally, using only facts for legal judgment prediction with different transformer models yielded results inferior to the state-of-the-art outcomes reported in the "ILDC for CJPE" study.Here's the text with some notes on the translation:* "ILDC for CJPE" is translated as "ILDC for CJPE" (文档批判预测和解释的印度法律文档资料库)* "extended abstract" is translated as "扩展报告" (扩展报告)* "fact-based judgment prediction" is translated as "基于事实的判决预测" (基于事实的判决预测)* "legal documents" is translated as "法律文档" (法律文档)* "lower courts" is translated as "下级法院" (下级法院)* "RLC" is translated as "RLC" (RLC)* "early-phase case outcome prediction" is translated as "案件初期结果预测" (案件初期结果预测)* "significant benefits" is translated as "显著的利益" (显著的利益)* "state-of-the-art outcomes" is translated as "现状最佳的结果" (现状最佳的结果)Please note that the translation is based on the Simplified Chinese version of the text, and some words or phrases may have different translations in Traditional Chinese.

Learning principle and mathematical realization of the learning mechanism in the brain

  • paper_url: http://arxiv.org/abs/2311.13341
  • repo_url: None
  • paper_authors: Taisuke Katayose
  • for: 本研究的目的是提供一个数学框架,用于解释深度学习的成功原理,并对现实机器学习模型进行应用。
  • methods: 本研究使用了一种新的概率评估方法,可以对任意数据集进行无约先知无约学习。此外,本研究还提出了一种新的定值方法,可以用于评估估计的值。
  • results: 本研究得到了许多有价值的结果,包括:1) 所有的学习都可以视为输入数据的概率估计;2) 普通的监督学习可以视为估计条件概率;3) 可以对任意数据集进行无约先知无约学习。此外,本研究还成功地描述了大脑中的学习机制。
    Abstract While deep learning has achieved remarkable success, there is no clear explanation about why it works so well. In order to discuss this question quantitatively, we need a mathematical framework that explains what learning is in the first place. After several considerations, we succeeded in constructing a mathematical framework that can provide a unified understanding of all types of learning, including deep learning and learning in the brain. We call it learning principle, and it follows that all learning is equivalent to estimating the probability of input data. We not only derived this principle, but also mentioned its application to actual machine learning models. For example, we found that conventional supervised learning is equivalent to estimating conditional probabilities, and succeeded in making supervised learning more effective and generalized. We also proposed a new method of defining the values of estimated probability using differentiation, and showed that unsupervised learning can be performed on arbitrary dataset without any prior knowledge. Namely, this method is a general-purpose machine learning in the true sense. Moreover, we succeeded in describing the learning mechanism in the brain by considering the time evolution of a fully or partially connected model and applying this new method. The learning principle provides solutions to many unsolved problems in deep learning and cognitive neuroscience.
    摘要 深度学习已经取得了很大的成功,但是没有一个明确的解释为什么它会那么好。要讨论这个问题的数学基础,我们需要一个概念框架,可以解释学习是什么。经过一些考虑,我们成功地构建了一个概念框架,可以为所有类型的学习提供一个统一的理解。我们称之为学习原理,它表示所有学习都是估计输入数据的概率的过程。我们不仅描述了这个原理,还应用于实际的机器学习模型。例如,我们发现了 conditional probabilities 的估计,并成功地使得 supervised learning 更加有效和普遍化。我们还提出了一种新的定义值的方法,并证明了不含任何先验知识的情况下,可以在任何数据集上进行无监督学习。即,这种方法是真正的通用机器学习。此外,我们成功地用时间演化的完全或部分连接模型来描述大脑学习机制,并应用了这种新方法。学习原理解决了深度学习和认知神经科学中的许多未解决的问题。

Quantum learning and essential cognition under the traction of meta-characteristics in an open world

  • paper_url: http://arxiv.org/abs/2311.13335
  • repo_url: None
  • paper_authors: Jin Wang, Changlin Song
  • for: This paper aims to improve the ability of artificial intelligence (AI) to recognize and explore new knowledge in the Open World problem.
  • methods: The proposed model and elemental feature system focus on recognizing the distribution differences in objective features between the new and old worlds, using the quantum tunneling effect of learning ability and the tractive force of meta-characteristic.
  • results: The model system achieves outstanding performance in learning new knowledge, with an accuracy of $96.71%$ at most, demonstrating that AI has acquired the ability to recognize the new world and explore new knowledge, similar to humans.
    Abstract Artificial intelligence has made significant progress in the Close World problem, being able to accurately recognize old knowledge through training and classification. However, AI faces significant challenges in the Open World problem, as it involves a new and unknown exploration journey. AI is not inherently proactive in exploration, and its challenge lies in not knowing how to approach and adapt to the unknown world. How do humans acquire knowledge of the unknown world. Humans identify new knowledge through intrinsic cognition. In the process of recognizing new colors, the cognitive cues are different from known color features and involve hue, saturation, brightness, and other characteristics. When AI encounters objects with different features in the new world, it faces another challenge: where are the distinguishing features between influential features of new and old objects? AI often mistakes a new world's brown bear for a known dog because it has not learned the differences in feature distributions between knowledge systems. This is because things in the new and old worlds have different units and dimensions for their features. This paper proposes an open-world model and elemental feature system that focuses on fundamentally recognizing the distribution differences in objective features between the new and old worlds. The quantum tunneling effect of learning ability in the new and old worlds is realized through the tractive force of meta-characteristic. The outstanding performance of the model system in learning new knowledge (using pedestrian re-identification datasets as an example) demonstrates that AI has acquired the ability to recognize the new world with an accuracy of $96.71\%$ at most and has gained the capability to explore new knowledge, similar to humans.
    摘要 人工智能在封闭世界问题上做出了重要进步,能够通过训练和分类精度地识别旧知识。然而,AI在开放世界问题上面临着巨大挑战,因为它需要面对未知的探索旅程。AI本身不具有探索的 Innate 特性,而且在未知世界中不知道如何进行approach和适应。人类如何获得未知世界的知识呢?人类通过内在认知来识别新知识。在认识新颜色的过程中,认知cue不同于已知颜色特征,包括色彩、浓淡、亮度等特征。当AI在新世界中遇到不同特征的对象时,它又遇到了一个挑战:新世界的熊猫与已知的狗之间的区分是什么?AI经常将新世界的熊猫误认为是已知的狗,因为它没有学习新和旧世界之间的特征分布之间的区别。这是因为新和旧世界中的物体具有不同的单位和维度,这些特征的分布也不同。本文提出了一种开放世界模型和元素特征系统,旨在识别新和旧世界之间的分布差异。通过元素特征的 tractive force,AI在新和旧世界中学习新知识的能力得到了证明,并且在使用人行识别数据集(例如)的例子中,AI的学习性能达到了$96.71\%$的最高精度。这表明AI已经获得了识别新世界的能力,并且类似于人类,具有了探索新知识的能力。

Curriculum Learning and Imitation Learning for Model-free Control on Financial Time-series

  • paper_url: http://arxiv.org/abs/2311.13326
  • repo_url: None
  • paper_authors: Woosung Koh, Insu Choi, Yuntae Jang, Gimin Kang, Woo Chang Kim
  • for: 提高控制任务的性能 over complex time-series data
  • methods: 使用 curriculum learning 和 imitation learning
  • results: 发现 curriculum learning 是一个有前途的方向,而 imitation learning 应用需要谨慎Here’s a more detailed explanation of each point:
  • for: The paper is written to improve the performance of control tasks over complex time-series data. The authors explore the use of curriculum learning and imitation learning in this context.
  • methods: The authors use curriculum learning via data augmentation and imitation learning via policy distillation from an oracle. They also perform ablation studies and tune all overlapping hyperparameters on the baseline.
  • results: The authors find that curriculum learning is a promising direction for improving control-task performance over complex time-series data. However, they also find that imitation learning should be used with caution.
    Abstract Curriculum learning and imitation learning have been leveraged extensively in the robotics domain. However, minimal research has been done on leveraging these ideas on control tasks over highly stochastic time-series data. Here, we theoretically and empirically explore these approaches in a representative control task over complex time-series data. We implement the fundamental ideas of curriculum learning via data augmentation, while imitation learning is implemented via policy distillation from an oracle. Our findings reveal that curriculum learning should be considered a novel direction in improving control-task performance over complex time-series. Our ample random-seed out-sample empirics and ablation studies are highly encouraging for curriculum learning for time-series control. These findings are especially encouraging as we tune all overlapping hyperparameters on the baseline -- giving an advantage to the baseline. On the other hand, we find that imitation learning should be used with caution.
    摘要 curriculum learning和模仿学习在 робо太CS领域得到了广泛应用。然而,对于高度随机时序数据的控制任务,它们的研究几乎没有进行。在这里,我们 theoretically和empirically探索这些想法在代表性的控制任务上。我们通过数据扩展实现了课程学习的基本想法,而模仿学习则通过策略填充 oracle来实现。我们的发现表明,课程学习在复杂时序数据上的控制任务中应该被视为一种新的方向,并且我们的充分随机种子外样实验和抽象研究结果都是非常鼓励的。然而,我们发现,模仿学习应该被用于小心。

Probabilistic Inference in Reinforcement Learning Done Right

  • paper_url: http://arxiv.org/abs/2311.13294
  • repo_url: None
  • paper_authors: Jean Tarbouriech, Tor Lattimore, Brendan O’Donoghue
  • for: 这篇论文的目的是为了解决 Markov decision process (MDP) 中的 probabilistic inference 问题,并提出一种 Bayesian 方法来实现。
  • methods: 这篇论文使用的方法包括 Bayesian 径向量的抽象、variational Bayesian 近似法和 convex optimization 问题的解决方法。
  • results: 研究人员通过实践和计算来证明 VAPOR 方法可以有效地解决 MDP 中的 probabilistic inference 问题,并且可以提供高效的搜索策略。
    Abstract A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.
    摘要 一种广泛的视角在强化学习(RL)中将问题看作是一个图形模型的Markov决策过程(MDP)上的概率推理。学习的核心对象是优化政策下每个状态动作对的抽象概率。先前的方法可能是伪装的,导致算法不能做真正的统计推理,从而导致问题的解决不佳。在这项工作中,我们采取了一种严格的极 bayesian方法,对状态动作优化 posterior概率进行了正确的推导,并证明它可以用来生成一个有效的搜索策略。然而,计算这个量是不可能的,因此我们提出了一种新的变分极 bayesian方法,转化为一个可解决的几何问题,并证明得到的策略也可以有效地搜索。我们称之为VAPOR,并证明它与汤姆逊抽样、K-学习和最大Entropy搜索有着强 Connection。我们结束的实验表明,一个深度RL版本的VAPOR在性能方面具有明显的优势。

Algorithmic Transparency and Manipulation

  • paper_url: http://arxiv.org/abs/2311.13286
  • repo_url: https://github.com/Piyushrai558/voting-via-blockchain
  • paper_authors: Michael Klenk
  • for: 这篇论文旨在探讨算法透明性的欺诈潜在性。
  • methods: 该论文采用了对算法透明性的探讨,以及对欺诈的定义和分析。
  • results: 论文发现,对算法透明性的探讨可能导致欺诈的潜在性,而这种潜在性不仅基于算法的敏感性,还基于人们对算法的理解和信任。
    Abstract A series of recent papers raises worries about the manipulative potential of algorithmic transparency. But while the concern is apt and relevant, it is based on a fraught understanding of manipulation. Therefore, this paper draws attention to the indifference view of manipulation, which explains better than the vulnerability view why algorithmic transparency has manipulative potential. The paper also raises pertinent research questions for future studies of manipulation in the context of algorithmic transparency.
    摘要 一系列最近的论文引发了对算法透明度的欺诈性可能性的担忧。然而,这种担忧基于不稳定的理解欺诈。因此,本文吸引注意力于不关注视角下的欺诈,可以更好地解释算法透明度的欺诈性。本文还提出了对欺诈在算法透明度下的未来研究中的重要问题。

FedFN: Feature Normalization for Alleviating Data Heterogeneity Problem in Federated Learning

  • paper_url: http://arxiv.org/abs/2311.13267
  • repo_url: None
  • paper_authors: Seongyoon Kim, Gihun Lee, Jaehoon Oh, Se-Young Yun
  • for: 这篇论文旨在解决 Federated Learning (FL) 中的数据不均衡问题,以提高模型训练的性能。
  • methods: 本研究使用 Federated Averaging with Feature Normalization Update (FedFN) 方法来解决数据不均衡问题。
  • results: 实验结果显示 FedFN 方法可以实现更好的性能,且可以应用到预训练的 ResNet18 模型以及基础模型上。
    Abstract Federated Learning (FL) is a collaborative method for training models while preserving data privacy in decentralized settings. However, FL encounters challenges related to data heterogeneity, which can result in performance degradation. In our study, we observe that as data heterogeneity increases, feature representation in the FedAVG model deteriorates more significantly compared to classifier weight. Additionally, we observe that as data heterogeneity increases, the gap between higher feature norms for observed classes, obtained from local models, and feature norms of unobserved classes widens, in contrast to the behavior of classifier weight norms. This widening gap extends to encompass the feature norm disparities between local and the global models. To address these issues, we introduce Federated Averaging with Feature Normalization Update (FedFN), a straightforward learning method. We demonstrate the superior performance of FedFN through extensive experiments, even when applied to pretrained ResNet18. Subsequently, we confirm the applicability of FedFN to foundation models.
    摘要 受collaborative方法培训模型保持分散设置下的数据隐私的 Federated Learning(FL)遇到了数据多样性所引起的问题,这可能导致性能下降。在我们的研究中,我们发现,随着数据多样性增加,FedAVG模型中的特征表示性减退更加明显,相比于分类器权重。此外,我们发现,随着数据多样性增加,已知类的特征norm值(在本地模型中)与未知类的特征norm值之间的差距加大,与分类器权重norm值不同。这个差距随着模型训练的进行而扩大,包括特征norm值之间的差距。为解决这些问题,我们提出了 Federated Averaging with Feature Normalization Update(FedFN),一种简单的学习方法。我们通过了详细的实验,证明FedFN的superior性能,即使应用于预训练的ResNet18。然后,我们证明FedFN可以应用于基础模型。

Improved identification accuracy in equation learning via comprehensive $\boldsymbol{R^2}$-elimination and Bayesian model selection

  • paper_url: http://arxiv.org/abs/2311.13265
  • repo_url: None
  • paper_authors: Daniel Nickelsen, Bubacarr Bah
  • for: Equation learning 方法,寻找最佳的Equation 模型,旨在提高Equation 的准确性和效率。
  • methods: 我们提出了一种新的方法,结合 $R^2$ 和 $p(\mathbf{y}|\mathcal{M})$ 两个指标,实现了在每个步骤中较小的模型空间减少,同时保持了广泛的搜索。我们还提出了三种新的方法,包括两种基于 $p(\mathbf{y}|\mathcal{M})$ 的bi-directional stepwise regression方法。
  • results: 经过三个数据集的实验比较,我们发现我们的方法在准确性方面胜过了现有的方法,特别是第二种方法在避免过拟合的问题上取得了最高的率。
    Abstract In the field of equation learning, exhaustively considering all possible equations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of multicollinearity poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude terms of the true equation, leading to reduced identification accuracy. In this article, we present an approach that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(\boldsymbol y|\mathcal M)$, in a novel way. Our procedure is characterized by a comprehensive search with just a minor reduction of the model space at each iteration step. With two flavors of our approach and the adoption of $p(\boldsymbol y|\mathcal M)$ for bi-directional stepwise regression, we present a total of three new avenues for equation learning. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our approach against four state-of-the-art methods and two standard approaches. The results demonstrate that our comprehensive search approach surpasses all other methods in terms of identification accuracy. In particular, the second flavor of our approach establishes an efficient overfitting penalty solely based on $R^2$, which achieves highest rates of exact equation recovery.
    摘要 在Equation学中,涵盖所有可能的方程从基函数词典中得出的方程是不可能的。简少 regresión和精巧算法已经成为了解决这个挑战的流行方法。然而,多icollinearity会使简少回归技术困难,而greedy步骤可能会意外排除真实方程中的项,导致准确性下降。在这篇文章中,我们提出了一种平衡于广泛性和效率的方法。 Drawing inspiration from stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(\mathbf{y}|\mathcal{M})$, in a novel way. Our procedure is characterized by a comprehensive search with only a minor reduction of the model space at each iteration step. With two flavors of our approach and the adoption of $p(\mathbf{y}|\mathcal{M})$ for bi-directional stepwise regression, we present a total of three new avenues for equation learning. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our approach against four state-of-the-art methods and two standard approaches. The results demonstrate that our comprehensive search approach surpasses all other methods in terms of identification accuracy. In particular, the second flavor of our approach establishes an efficient overfitting penalty solely based on $R^2$, which achieves highest rates of exact equation recovery.Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from the original text.

The Rise of Creative Machines: Exploring the Impact of Generative AI

  • paper_url: http://arxiv.org/abs/2311.13262
  • repo_url: None
  • paper_authors: Saad Shaikh, Rajat bendre, Sakshi Mhaske
  • for: 这个研究探讨了如何使用生成型人工智能(AI)改变市场营销、产品开发和研究。
  • methods: 这篇论文介绍了最新的领域发展、易于使用的资源和道德和社会风险。
  • results: 论文强调了负责任开发,通过不断与利益带的沟通和伦理原则来解决问题,并探讨了如何 mitigate 偏见和不准确信息的问题。
    Abstract This study looks at how generative artificial intelligence (AI) can revolutionize marketing, product development, and research. It discusses the latest developments in the field, easy-to-use resources, and moral and social hazards. In addition to addressing mitigating techniques for issues like prejudice and disinformation, the debate emphasizes the significance of responsible development through continual stakeholder communication and ethical principles.
    摘要 Here's the text in Simplified Chinese:这个研究研究了如何生成人工智能(AI)可以革命化市场营销、产品开发和研究。它讨论了最新的发展,易于使用的资源,以及道德和社会的风险。此外,它还Addresses mitigating techniques for issues such as prejudice and disinformation, and emphasizes the importance of responsible development through continual stakeholder communication and ethical principles.

DA-STC: Domain Adaptive Video Semantic Segmentation via Spatio-Temporal Consistency

  • paper_url: http://arxiv.org/abs/2311.13254
  • repo_url: https://github.com/ZHE-SAPI/STCL
  • paper_authors: Zhe Zhang, Gaochang Wu, Jing Zhang, Chunhua Shen, Dacheng Tao, Tianyou Chai
  • for: 这篇论文主要是为了解决视频 semantic segmentation 领域中的域名shift问题,即在不同的域中学习 invariable spatio-temporal features。
  • methods: 该论文提出了一种名为 DA-STC 的方法,它包括一个 bidirectional multi-level spatio-temporal fusion 模块和一个 category-aware spatio-temporal feature alignment 模块。这两个模块都是用来促进域外学习的域内适应。
  • results: 该论文的实验结果表明,DA-STC 方法可以在多个挑战性的 benchmark 上达到 state-of-the-art 的 mIOU 性能。此外,该方法还可以在图像域中进行域 adaptive semantic segmentation。代码和模型将在 \url{https://github.com/ZHE-SAPI/DA-STC} 上发布。
    Abstract Video semantic segmentation is a pivotal aspect of video representation learning. However, significant domain shifts present a challenge in effectively learning invariant spatio-temporal features across the labeled source domain and unlabeled target domain for video semantic segmentation. To solve the challenge, we propose a novel DA-STC method for domain adaptive video semantic segmentation, which incorporates a bidirectional multi-level spatio-temporal fusion module and a category-aware spatio-temporal feature alignment module to facilitate consistent learning for domain-invariant features. Firstly, we perform bidirectional spatio-temporal fusion at the image sequence level and shallow feature level, leading to the construction of two fused intermediate video domains. This prompts the video semantic segmentation model to consistently learn spatio-temporal features of shared patch sequences which are influenced by domain-specific contexts, thereby mitigating the feature gap between the source and target domain. Secondly, we propose a category-aware feature alignment module to promote the consistency of spatio-temporal features, facilitating adaptation to the target domain. Specifically, we adaptively aggregate the domain-specific deep features of each category along spatio-temporal dimensions, which are further constrained to achieve cross-domain intra-class feature alignment and inter-class feature separation. Extensive experiments demonstrate the effectiveness of our method, which achieves state-of-the-art mIOUs on multiple challenging benchmarks. Furthermore, we extend the proposed DA-STC to the image domain, where it also exhibits superior performance for domain adaptive semantic segmentation. The source code and models will be made available at \url{https://github.com/ZHE-SAPI/DA-STC}.
    摘要 视频semantic segmentation的Semantic segmentation是视频表示学习中的关键方面。然而,域域转移问题会使得学习域域不变的适用于视频semantic segmentation的特征特征变得更加挑战。为解决这个问题,我们提出了一种新的DA-STC方法,它包括一个bidirectional多级空间时间融合模块和一个类别意识空间时间特征对齐模块,以便促进域域不变特征的一致学习。首先,我们在图像序列级和浅层特征级进行了双向空间时间融合,从而建立了两个融合的中间视频领域。这使得视频semantic segmentation模型在Shared patch sequences中学习到了域域不变的空间时间特征,并且将这些特征与域域特定的上下文相互作用,以降低源和目标域域之间的特征差距。其次,我们提出了一种类别意识空间时间特征对齐模块,以促进域域不变特征的一致学习。我们在每个类别上适应地聚合了域域特定的深度特征,并将其在空间时间维度上进行了约束,以实现跨域内类特征对齐和外类特征分离。我们的方法在多个复杂的benchmark上实现了state-of-the-art的mIOU。此外,我们还将DA-STC扩展到图像领域,其也在域域不变semantic segmentation中展现出了优秀的性能。我们将代码和模型公开在GitHub上,请参考\url{https://github.com/ZHE-SAPI/DA-STC}.

@ve: A Chatbot for Latin

  • paper_url: http://arxiv.org/abs/2311.14741
  • repo_url: None
  • paper_authors: Oliver Bendel, Karim N’diaye
  • for: 保护和推广已灭绝的语言
  • methods: 使用语音保存和脚本收集、数字化,以及针对性语言学习启动
  • results: 创建了一个可以与拉丁语进行对话的 chatbot,可以作为一种新的教学工具,但现有实现还有各种各样的问题,可能通过使用 GPT-4 和扩展知识库解决这些问题。
    Abstract Dead, extinct, and endangered languages have been preserved primarily through audio conservation and the collection and digitization of scripts and have been promoted through targeted language acquisition efforts. Another possibility would be to build conversational agents that can master these languages. This would provide an artificial, active conversational partner which has knowledge of the vocabulary and grammar, and one learns with it in a different way. The chatbot @ve, with which one can communicate in Latin, was developed in 2022/2023 based on GPT-3.0. It was additionally equipped with a manually created knowledge base. After conceptual groundwork, this paper presents the preparation and implementation of the project. In addition, it summarizes the test that a Latin expert conducted with the chatbot. A critical discussion elaborates advantages and disadvantages. @ve could be a new tool for teaching Latin in a memorable and entertaining way through dialogue. However, the present implementation is still too prone to glitches for stand-alone use - i.e., without the accompaniment of a teacher. The use of GPT-4 could be a solution as well as the extension of the knowledge base. In conclusion, it can be argued that conversational agents are an innovative approach to promoting and preserving languages.
    摘要 死亡、灭绝和濒危语言被保存主要通过音频保存和脚本的收集和数字化,并通过目标语言获取努力来推广。另一种可能性是建立会话代理人,这样learner可以通过与其对话学习语言。这个paper在2022/2023年基于GPT-3.0开发了一个名为@ve的聊天机器人,可以与learner通过对话学习拉丁语。这个项目的准备和实施,以及拉丁专家对聊天机器人的测试,都被summarized在这篇论文中。文章还进行了一个批判性的讨论,探讨了这种技术的优缺点。可以Argument that conversational agents are an innovative approach to promoting and preserving languages.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

TSegFormer: 3D Tooth Segmentation in Intraoral Scans with Geometry Guided Transformer

  • paper_url: http://arxiv.org/abs/2311.13234
  • repo_url: https://github.com/huiminxiong/tsegformer
  • paper_authors: Huimin Xiong, Kunle Li, Kaiyuan Tan, Yang Feng, Joey Tianyi Zhou, Jin Hao, Haochao Ying, Jian Wu, Zuozhu Liu
  • for: 该研究旨在提高数字医疗中的光学内口扫描仪(IOS)中的3D牙齿分割精度,以便于各种 dental 应用。
  • methods: 该研究提出了TSegFormer,一种基于多任务3D变换器架构的多任务3D分割方法,可以同时捕捉不同牙齿和软膏的Local和Global依赖关系。此外,研究人员还提出了一种基于新的点曲线的几何引导损失,以进行终端到终端的精度修剪。
  • results: 实验结果表明,TSegFormer 可以一直超越现有的基准值,并且通过了广泛的分析、视觉化和实际临床应用测试。
    Abstract Optical Intraoral Scanners (IOS) are widely used in digital dentistry to provide detailed 3D information of dental crowns and the gingiva. Accurate 3D tooth segmentation in IOSs is critical for various dental applications, while previous methods are error-prone at complicated boundaries and exhibit unsatisfactory results across patients. In this paper, we propose TSegFormer which captures both local and global dependencies among different teeth and the gingiva in the IOS point clouds with a multi-task 3D transformer architecture. Moreover, we design a geometry-guided loss based on a novel point curvature to refine boundaries in an end-to-end manner, avoiding time-consuming post-processing to reach clinically applicable segmentation. In addition, we create a dataset with 16,000 IOSs, the largest ever IOS dataset to the best of our knowledge. The experimental results demonstrate that our TSegFormer consistently surpasses existing state-of-the-art baselines. The superiority of TSegFormer is corroborated by extensive analysis, visualizations and real-world clinical applicability tests. Our code is available at https://github.com/huiminxiong/TSegFormer.
    摘要 Optical Intraoral Scanners (IOS) 广泛应用于数字医疗领域,以提供精确的三维 dental 冠和舌假粘膜信息。精确的三维牙齿分割在 IOS 中是许多 dental 应用的关键,而之前的方法在复杂的边界上存在误差,并且在不同的患者上显示不满意的结果。在这篇论文中,我们提出了 TSegFormer,它通过多任务三维变换器架构捕捉了不同牙齿和舌假粘膜之间的本地和全局依赖关系。此外,我们还设计了基于一个新的点曲线的几何指导损失,以精细化边界,从头到尾完成精确的分割,而不需要时间占用的后处理。此外,我们还创建了16,000个 IOS 的数据集,这是我们知道的最大的 IOS 数据集。实验结果表明,我们的 TSegFormer 稳定地超越了现有的基准值。我们的优势得到了广泛的分析、视觉化和实际应用测试的证明。我们的代码可以在 上获取。

A Survey of Adversarial CAPTCHAs on its History, Classification and Generation

  • paper_url: http://arxiv.org/abs/2311.13233
  • repo_url: None
  • paper_authors: Zisheng Xu, Qiao Yan, F. Richard Yu, Victor C. M. Leung
  • for: 本研究旨在探讨防止自动化攻击的一种方法,即通过创造困难识别的Captcha图像来保护网站和应用程序免受自动化攻击。
  • methods: 本研究使用了一些常见的生成攻击示例方法,以及一些成功地应用于Captcha图像生成的方法。同时,本研究还 analyze了一些防御攻击示例方法,并指出了可能的威胁。
  • results: 本研究发现,通过组合攻击示例和Captcha图像,可以创造出困难识别的Captcha图像,使得深度学习模型无法正确识别。此外,本研究还发现了一些防御攻击示例方法,并分析了其威胁。
    Abstract Completely Automated Public Turing test to tell Computers and Humans Apart, short for CAPTCHA, is an essential and relatively easy way to defend against malicious attacks implemented by bots. The security and usability trade-off limits the use of massive geometric transformations to interfere deep model recognition and deep models even outperformed humans in complex CAPTCHAs. The discovery of adversarial examples provides an ideal solution to the security and usability trade-off by integrating adversarial examples and CAPTCHAs to generate adversarial CAPTCHAs that can fool the deep models. In this paper, we extend the definition of adversarial CAPTCHAs and propose a classification method for adversarial CAPTCHAs. Then we systematically review some commonly used methods to generate adversarial examples and methods that are successfully used to generate adversarial CAPTCHAs. Also, we analyze some defense methods that can be used to defend adversarial CAPTCHAs, indicating potential threats to adversarial CAPTCHAs. Finally, we discuss some possible future research directions for adversarial CAPTCHAs at the end of this paper.
    摘要 “完全自动化的公共Turing测试以辨别电脑和人类,简称CAPTCHA,是一种重要且相对容易的防护方法,旨在防止自动化攻击。然而,安全性和使用性的贸易带来了限制,导致不能使用巨大的几何变换来损坏深度模型的识别,深度模型甚至能够在复杂的CAPTCHA中超越人类。发现敌对例子提供了一个理想的解决方案,通过结合敌对例子和CAPTCHA来生成敌对CAPTCHA,以让深度模型被欺骗。在这篇研究中,我们将扩展敌对CAPTCHA的定义,并提出一种分类方法 для敌对CAPTCHA。然后,我们系统性地回顾了一些通常使用的生成敌对例子方法,以及成功地使用于生成敌对CAPTCHA的方法。此外,我们分析了可以用来防御敌对CAPTCHA的防护方法,表明了敌对CAPTCHA的潜在威胁。最后,我们讨论了未来敌对CAPTCHA的可能性。”

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

  • paper_url: http://arxiv.org/abs/2311.13231
  • repo_url: https://github.com/yk7333/d3po
  • paper_authors: Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li
  • for: fine-tuning diffusion models with human feedback
  • methods: Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method, eliminates the need for a reward model, using relative scale of objectives as a proxy for human preference
  • results: comparable results to methods using ground-truth rewards, reduces image distortion rates, generates safer images
    Abstract Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available in https://github.com/yk7333/D3PO/tree/main.
    摘要

Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus

  • paper_url: http://arxiv.org/abs/2311.13230
  • repo_url: https://github.com/zthang/focus
  • paper_authors: Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, Luoyi Fu
  • for: 本研究旨在探讨如何检测LLMs中的幻想,以提高LLMs在实际应用中的可靠性。
  • methods: 本研究提出了一种新的参考自由、uncertainty-based方法,通过模仿人类对事实检查的焦点,从三个方面进行幻想检测:1)关键词重要性和重要性权重; 2)历史上不可靠的token可能导致幻想堆叠; 3)Token属性such as token类型和token频率。
  • results: 实验结果表明,提出的方法可以准确地检测LLMs中的幻想,并且在所有评价指标中均达到了顶峰性能。这种方法不需要额外信息,可以大大提高LLMs在实际应用中的可靠性。
    Abstract Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple responses from the LLM for consistency verification, making these methods costly and inefficient. In this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs. Our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. Experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.
    摘要 在这篇论文中,我们提出了一种参考值弹性的方法,用于检测 LLMs 中的夸大。我们的方法从三个方面模仿人类的注意力:1. 专注在文本中最重要和有用的关键字,以减少检测误差。2. 专注在历史上可能导致夸大的不可靠token,以预防夸大的牵扯。3. 专注在token的性质,例如token的类型和频率,以减少夸大的可能性。实验结果显示,我们的提议方法可以实现参考值弹性,并且在所有评估指标上创下了新的纪录。此外,我们的方法不需要额外的信息,因此可以减少成本和增加效率。

Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models

  • paper_url: http://arxiv.org/abs/2311.13226
  • repo_url: https://github.com/andylucny/learningimitation
  • paper_authors: Andrej Lúčny, Kristína Malinovská, Igor Farkaš
  • for: 这个论文的目的是建立一种自动从准备好的自我监督模型中生成自定义模型,而不需要训练和调整。
  • methods: 作者使用了一种基于自带的模型的 associating 方法,通过将其对应的特征向量映射到另一个特征向量的空间来实现。
  • results: 通过这种方法,作者可以在不需要大量样本和训练时间的情况下,建立一个高质量的三维姿态检测器,并且可以通过对模型进行自动化调整来提高其性能。
    Abstract We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot's body via models prepared before the robot's operation. Then, we map their corresponding latent spaces by a sample-efficient robot's self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.
    摘要 我们介绍了一种自定义模型的方法,通过对已有的自我监督模型进行对应而不是训练和精度调整。我们通过一个人工智能机器人在镜子上看到自己的姿势,并从图像中检测自己的3D姿势的示例来说明这种方法。我们首先从视觉输入和机器人的姿势获得特征。然后,我们将其相应的潜在空间映射,通过一个效率高的机器人自我探索镜子来实现。这种映射使机器人构建了由人工智能自我探索镜子而来的3D姿势检测器。这个检测器的质量立即完美地适应已有的样本,而不是逐渐提高的方式。我们使用对应对的特征向量的方式实现这种映射,同时采用了transformer模型中的键值机制。最后,我们通过对模型进行模拟,对其参数进行调整,并系统地评估它的性能,不需要人类参与,从而提高我们之前的研究。

Artificial Intelligence in the Service of Entrepreneurial Finance: Knowledge Structure and the Foundational Algorithmic Paradigm

  • paper_url: http://arxiv.org/abs/2311.13213
  • repo_url: None
  • paper_authors: Robert Kudelić, Tamara Šmaguc, Sherry Robinson
  • For: The paper is written to explore the potential of Artificial Intelligence in Entrepreneurship, specifically in the field of Entrepreneurial Finance.* Methods: The paper uses a bibliometric review of relevant journal articles to analyze the current state of research in the field, including the use of Artificial Neural Networks, Deep Neural Networks, Support Vector Machines, Topic Modeling, Fuzzy Neural Networks, and Growing Hierarchical Self-organizing Maps.* Results: The paper identifies nascent and underdeveloped research directions in the field, and finds that Artificial Neural Networks, Deep Neural Networks, and Support Vector Machines are highly represented in almost all identified topic niches, while the use of other methods such as Topic Modeling, Fuzzy Neural Networks, and Growing Hierarchical Self-organizing Maps is quite rare.Here is the information in Simplified Chinese text:* For: 这篇论文是为了探索人工智能在创业领域中的潜力,具体来说是在企业金融领域中的创新金融。* Methods: 这篇论文使用科学文献综述来分析当前研究领域的现状,包括人工神经网络、深度神经网络、支持向量机等方法的使用。* Results: 论文发现当前领域的研究方向尚未充分发展,而人工神经网络、深度神经网络和支持向量机在各个话题 niches 中具有强大的表现,而其他方法如主题分析、杂化神经网络和增长层次自组织图等的使用相对较少。
    Abstract While the application of Artificial Intelligence in Finance has a long tradition, its potential in Entrepreneurship has been intensively explored only recently. In this context, Entrepreneurial Finance is a particularly fertile ground for future Artificial Intelligence proliferation. To support the latter, the study provides a bibliometric review of Artificial Intelligence applications in (1) entrepreneurial finance literature, and (2) corporate finance literature with implications for Entrepreneurship. Rigorous search and screening procedures of the scientific database Web of Science Core Collection resulted in the identification of 1890 relevant journal articles subjected to analysis. The bibliometric analysis gives a rich insight into the knowledge field's conceptual, intellectual, and social structure, indicating nascent and underdeveloped research directions. As far as we were able to identify, this is the first study to map and bibliometrically analyze the academic field concerning the relationship between Artificial Intelligence, Entrepreneurship, and Finance, and the first review that deals with Artificial Intelligence methods in Entrepreneurship. According to the results, Artificial Neural Network, Deep Neural Network and Support Vector Machine are highly represented in almost all identified topic niches. At the same time, applying Topic Modeling, Fuzzy Neural Network and Growing Hierarchical Self-organizing Map is quite rare. As an element of the research, and before final remarks, the article deals as well with a discussion of certain gaps in the relationship between Computer Science and Economics. These gaps do represent problems in the application of Artificial Intelligence in Economic Science. As a way to at least in part remedy this situation, the foundational paradigm and the bespoke demonstration of the Monte Carlo randomized algorithm are presented.
    摘要 artifical intelligence在金融领域有很长的传统,但它在创业方面的潜力刚刚被广泛探索。在这个 контексте,创业金融是人工智能的特别肥沃的地带。为支持后者,本研究提供了一个 bibliometric 评估人工智能在(1)创业金融文献中的应用,以及(2)公司财务文献中对创业的影响。通过严格的搜索和屏选过程,科学数据库Web of Science Core Collection中发现了1890个相关的学术论文,并进行了分析。 bibliometric 分析为我们了解创业金融领域的知识结构带来了丰富的信息,并指出了未开发和未利用的研究方向。根据我们所知,这是首次对人工智能、创业和金融之间的学术领域进行了地图和 bibliometric 分析,同时,这也是第一篇考虑人工智能方法在创业方面的评论。根据结果,人工神经网络、深度神经网络和支持向量机是所有标签 niches 中最具代表性的方法。同时,应用 topic Modeling、杂化神经网络和生长层次自适应MAP 是非常罕见的。在研究中,我们还讨论了计算机科学和经济科学之间的某些差距,这些差距代表了人工智能应用于经济科学中的问题。为纠正这种情况,我们在研究中提出了一种基本的基本思想和一个特定的示例,以示 Monte Carlo 随机化算法的应用。

Breast Cancer classification by adaptive weighted average ensemble of previously trained models

  • paper_url: http://arxiv.org/abs/2311.13206
  • repo_url: None
  • paper_authors: Mosab S. M. Farea, zhe chen
  • for: 检测乳腺癌的技术研究
  • methods: 使用适应静止图像CAD系统,提出采用已经完全训练的模型的适应平均ensemble技术,不同于文献中使用平均ensemble同时训练的方法
  • results: 对评估指标进行了改进,实现了98%的准确率,比最佳参与模型的97%高出1%,同时降低了假阳和假阴的数量,提高了性能指标
    Abstract Breast cancer is a serious disease that inflicts millions of people each year, and the number of cases is increasing. Early detection is the best way to reduce the impact of the disease. Researchers have developed many techniques to detect breast cancer, including the use of histopathology images in CAD systems. This research proposes a technique that combine already fully trained model using adaptive average ensemble, this is different from the literature which uses average ensemble before training and the average ensemble is trained simultaneously. Our approach is different because it used adaptive average ensemble after training which has increased the performance of evaluation metrics. It averages the outputs of every trained model, and every model will have weight according to its accuracy. The accuracy in the adaptive weighted ensemble model has achieved 98% where the accuracy has increased by 1 percent which is better than the best participating model in the ensemble which was 97%. Also, it decreased the numbers of false positive and false negative and enhanced the performance metrics.
    摘要 乳癌是一种严重的疾病,每年感染数百万人,而 случа例数量在增加。早期检测是降低疾病影响的最佳方法。研究人员已经开发了许多检测乳癌的技术,其中包括使用 histopathology 图像在 CAD 系统中。本研究提出了一种新的技术,即使用已经全部训练的模型,并使用适应性平均ensemble。这与文献中使用平均ensemble并同时进行训练的方法不同。我们的方法使用适应性平均ensemble после训练,这使得评估指标的准确率提高了1%,达到了98%。此外,我们的方法还减少了假阳性和假阴性的数量,提高了性能指标。

Cracking the Code of Negative Transfer: A Cooperative Game Theoretic Approach for Cross-Domain Sequential Recommendation

  • paper_url: http://arxiv.org/abs/2311.13188
  • repo_url: None
  • paper_authors: Chung Park, Taesan Kim, Taekyoon Choi, Junui Hong, Yelim Yu, Mincheol Cho, Kyunam Lee, Sungil Ryu, Hyungjun Yoon, Minsung Choi, Jaegul Choo
  • for: 这 paper 探讨了跨多个领域的续写推荐(CDSR)方法,该方法利用多个领域的信息生成准确和多样的推荐,并考虑用户交互的顺序性。
  • methods: 该 paper 提出了一种新的 CDSR 框架,用于解决跨多个领域的负面传递问题,其中评估每个领域之间的负面传递程度,并将相应的预测损失赋予低权重。此外,该 paper 还提出了一种层次对比学习方法,将序列级别的粗细类别信息与细粒度类别信息相结合,以减轻负面传递的影响。
  • results: 该 paper 在两个真实世界数据集上对十个领域的模型进行了比较,并显示了其性能的提高。
    Abstract This paper investigates Cross-Domain Sequential Recommendation (CDSR), a promising method that uses information from multiple domains (more than three) to generate accurate and diverse recommendations, and takes into account the sequential nature of user interactions. The effectiveness of these systems often depends on the complex interplay among the multiple domains. In this dynamic landscape, the problem of negative transfer arises, where heterogeneous knowledge between dissimilar domains leads to performance degradation due to differences in user preferences across these domains. As a remedy, we propose a new CDSR framework that addresses the problem of negative transfer by assessing the extent of negative transfer from one domain to another and adaptively assigning low weight values to the corresponding prediction losses. To this end, the amount of negative transfer is estimated by measuring the marginal contribution of each domain to model performance based on a cooperative game theory. In addition, a hierarchical contrastive learning approach that incorporates information from the sequence of coarse-level categories into that of fine-level categories (e.g., item level) when implementing contrastive learning was developed to mitigate negative transfer. Despite the potentially low relevance between domains at the fine-level, there may be higher relevance at the category level due to its generalised and broader preferences. We show that our model is superior to prior works in terms of model performance on two real-world datasets across ten different domains.
    摘要 To address this issue, we propose a new CDSR framework that mitigates negative transfer by assessing the extent of negative transfer from one domain to another and adaptively assigning low weight values to the corresponding prediction losses. We estimate the amount of negative transfer by measuring the marginal contribution of each domain to model performance based on cooperative game theory. Additionally, we developed a hierarchical contrastive learning approach that incorporates information from the sequence of coarse-level categories into that of fine-level categories (e.g., item level) when implementing contrastive learning. This helps to mitigate negative transfer, as there may be higher relevance at the category level due to its generalised and broader preferences.Our proposed model outperforms prior works in terms of model performance on two real-world datasets across ten different domains.

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

  • paper_url: http://arxiv.org/abs/2311.13171
  • repo_url: https://github.com/prateeky2806/compeft
  • paper_authors: Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal
  • for: 提高 parameter-efficient fine-tuning (PEFT) 模型的压缩和serve multiple experts on a single GPU
  • methods: 使用 sparsification 和 ternary quantization 减少 PEFT 模型的大小,不需要进行额外 retraining,保持或提高模型性能
  • results: 在 T5、T0 和 LLaMA 基础模型上,实现压缩比例为 8x-50x,并且 stronger models exhibit higher compressibility and better performanceHere’s the Chinese translation of the three points:
  • for: 提高 parameter-efficient fine-tuning (PEFT) 模型的压缩和serve multiple experts on a single GPU
  • methods: 使用 sparsification 和 ternary quantization 减少 PEFT 模型的大小,不需要进行额外 retraining,保持或提高模型性能
  • results: 在 T5、T0 和 LLaMA 基础模型上,实现压缩比例为 8x-50x,并且 stronger models exhibit higher compressibility and better performance
    Abstract Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT's efficacy for compressing the residual of full-finetuning. Our code is available at https://github.com/prateeky2806/compeft.
    摘要 对于parameter-efficient fine-tuning(PEFT)技术,我们提出了一种新的方法 called ComPEFT,用于对PEFT模型进行压缩。ComPEFT使用简化和ternary quantization来将PEFT模型的大小增加,并不需要进行额外的重训。我们在T5、T0和LLaMA基于模型上进行了广泛的评估,并发现ComPEFT可以实现8x-50x的压缩比率。尤其是,我们发现ComPEFT随着模型的大小而增加, stronger models exhibit higher compressibility and better performance。例如,我们发现ComPEFT应用于LLaMA可以与QLoRA相比,在MMLU上提高4.16%的表现,并且实现储存大小的增加。此外,我们还发现压缩后的专家模型仍然保持了几个测试 shot的通用化能力,并且可以实现高效的通信和计算。最后,我们提供了不同方法的分析,与其他PEFT方法进行比较,并评估ComPEFT在对完全训练的压缩中的效果。我们的代码可以在https://github.com/prateeky2806/compeft上获取。

SiGeo: Sub-One-Shot NAS via Information Theory and Geometry of Loss Landscape

  • paper_url: http://arxiv.org/abs/2311.13169
  • repo_url: None
  • paper_authors: Hua Zheng, Kuang-Hung Liu, Igor Fedorov, Xin Zhang, Wen-Yen Chen, Wei Wen
  • for: automating neural network design, particularly in complex domains like RecSys
  • methods: combines zero-shot and one-shot NAS methods, using a “sub-one-shot” paradigm and a novel theoretical framework called SiGeo
  • results: outperforms state-of-the-art NAS proxies on various established NAS benchmarks, with a significant reduction in computational costs compared to weight-sharing one-shot NAS methods
    Abstract Neural Architecture Search (NAS) has become a widely used tool for automating neural network design. While one-shot NAS methods have successfully reduced computational requirements, they often require extensive training. On the other hand, zero-shot NAS utilizes training-free proxies to evaluate a candidate architecture's test performance but has two limitations: (1) inability to use the information gained as a network improves with training and (2) unreliable performance, particularly in complex domains like RecSys, due to the multi-modal data inputs and complex architecture configurations. To synthesize the benefits of both methods, we introduce a "sub-one-shot" paradigm that serves as a bridge between zero-shot and one-shot NAS. In sub-one-shot NAS, the supernet is trained using only a small subset of the training data, a phase we refer to as "warm-up." Within this framework, we present SiGeo, a proxy founded on a novel theoretical framework that connects the supernet warm-up with the efficacy of the proxy. Extensive experiments have shown that SiGeo, with the benefit of warm-up, consistently outperforms state-of-the-art NAS proxies on various established NAS benchmarks. When a supernet is warmed up, it can achieve comparable performance to weight-sharing one-shot NAS methods, but with a significant reduction ($\sim 60$\%) in computational costs.
    摘要

Multimodal Large Language Models: A Survey

  • paper_url: http://arxiv.org/abs/2311.13165
  • repo_url: https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving
  • paper_authors: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, Philip S. Yu
  • for: 本文旨在探讨多Modal语言模型的开发和应用,包括图像、文本、语音等多种数据类型的结合。
  • methods: 本文 introduce了多Modal算法的历史发展和主要技术公司的产品,同时提供了实践指南和常用数据集,供研究人员进行实验和评估。
  • results: 本文总结了多Modal模型的应用和发展挑战,并提供了各种实际应用场景,以便读者更深入地了解多Modal模型的潜力和可能性。
    Abstract The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to understand and process other data types. Multimodal models address this limitation by combining various modalities, enabling a more comprehensive understanding of diverse data. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. Furthermore, we introduce a range of multimodal products, focusing on the efforts of major technology companies. A practical guide is provided, offering insights into the technical aspects of multimodal models. Moreover, we present a compilation of the latest algorithms and commonly used datasets, providing researchers with valuable resources for experimentation and evaluation. Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development. By addressing these aspects, this paper aims to facilitate a deeper understanding of multimodal models and their potential in various domains.
    摘要 translate(text="The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to understand and process other data types. Multimodal models address this limitation by combining various modalities, enabling a more comprehensive understanding of diverse data. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. Furthermore, we introduce a range of multimodal products, focusing on the efforts of major technology companies. A practical guide is provided, offering insights into the technical aspects of multimodal models. Moreover, we present a compilation of the latest algorithms and commonly used datasets, providing researchers with valuable resources for experimentation and evaluation. Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development. By addressing these aspects, this paper aims to facilitate a deeper understanding of multimodal models and their potential in various domains.")以下是文本的简化中文翻译:多Modal语言模型的探索 integrates multiple data types, such as images, text, language, audio, and other heterogeneity. latest large language models excel in text-based tasks, but often struggle to understand and process other data types. Multimodal models address this limitation by combining various modalities, enabling a more comprehensive understanding of diverse data.本文首先定义了多Modal的概念,并考虑了多Modal算法的历史发展。然后,我们介绍了一系列多Modal产品,主要关注了大型科技公司的努力。此外,我们还提供了一个实用指南, offering insights into the technical aspects of multimodal models。此外,我们还提供了最新的算法和通用的数据集,为研究人员提供了价值的实验和评估资源。最后,我们探讨了多Modal模型的应用和其发展中的挑战。通过解决这些方面,本文希望能够促进多Modal模型的深入理解和其在不同领域的潜力。

Large Language Models in Education: Vision and Opportunities

  • paper_url: http://arxiv.org/abs/2311.13160
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Wensheng Gan, Zhenlian Qi, Jiayang Wu, Jerry Chun-Wei Lin
  • for: 这个研究旨在探讨和总结大自然语言模型(LLM)在智能教育领域的应用。
  • methods: 这篇文章首先介绍了LLM的研究背景和动机,然后讲述了数字教育和智能教育的关系,并summarized了现有的教育大模型(EduLLM)研究状况。
  • results: 本文提供了一个系统的总结和视野,以帮助教育者、研究人员和政策制定者更深入地理解LLM4Edu的潜在和挑战。同时,文章还提供了进一步发展和应用LLM4Edu的指导和思路。
    Abstract With the rapid development of artificial intelligence technology, large language models (LLMs) have become a hot research topic. Education plays an important role in human social development and progress. Traditional education faces challenges such as individual student differences, insufficient allocation of teaching resources, and assessment of teaching effectiveness. Therefore, the applications of LLMs in the field of digital/smart education have broad prospects. The research on educational large models (EduLLMs) is constantly evolving, providing new methods and approaches to achieve personalized learning, intelligent tutoring, and educational assessment goals, thereby improving the quality of education and the learning experience. This article aims to investigate and summarize the application of LLMs in smart education. It first introduces the research background and motivation of LLMs and explains the essence of LLMs. It then discusses the relationship between digital education and EduLLMs and summarizes the current research status of educational large models. The main contributions are the systematic summary and vision of the research background, motivation, and application of large models for education (LLM4Edu). By reviewing existing research, this article provides guidance and insights for educators, researchers, and policy-makers to gain a deep understanding of the potential and challenges of LLM4Edu. It further provides guidance for further advancing the development and application of LLM4Edu, while still facing technical, ethical, and practical challenges requiring further research and exploration.
    摘要 随着人工智能技术的快速发展,大型语言模型(LLM)已成为研究热点。教育对人类社会发展和进步具有重要作用。传统教育面临着个人学生差异、教学资源不充分分配和教学效果评价等挑战。因此,LLM在智能教育领域的应用有广阔前途。研究教育大型模型(EduLLM)不断演化,为实现个性化学习、智能指导和教学评价目标提供新的方法和approach。这篇文章的目的是调查和总结LLM在智能教育领域的应用。文章首先介绍了LLM的研究背景和动机,然后讨论了数字教育和EduLLM之间的关系,并结合现有研究的总结。本文的主要贡献是对LLM4Edu的研究背景、动机和应用进行系统的总结和视野,为教育者、研究人员和政策制定者提供深入的理解LLM4Edu的潜力和挑战。此外,文章还为进一步发展和应用LLM4Edu提供指导和思路,同时面临技术、伦理和实践挑战。

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

  • paper_url: http://arxiv.org/abs/2311.13614
  • repo_url: https://github.com/yuqifan1117/hallucidoctor
  • paper_authors: Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang
  • for: 本研究旨在调查多模态大语言模型(MLLMs)在机器生成数据上的听说演示性能,以及这种演示性能中的幻觉问题。
  • methods: 本研究使用了一种基于cross-checking的幻觉检测和消除框架,名为HalluciDoctor,以自动检测并消除幻觉在训练数据中。
  • results: 研究表明,使用HalluciDoctor可以成功 mitigate 44.6%的幻觉,并维持与LLaVA的竞争性能。
    Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.The source code will be released at \url{https://github.com/Yuqifan1117/HalluciDoctor}.
    摘要 多模态大语言模型(MLLMs)在机器生成的指令遵循数据上调整后表现出色,但机器生成数据中的幻觉仍未得到充分探讨。这项工作目的是调查不同类型的幻觉(例如物体、关系、特征幻觉),并在大规模机器生成视觉指令数据中缓解这些幻觉毒害。基于人类可以识别事实错误的能力,我们提出了一种新的幻觉检测和消除框架,即HalluciDoctor,基于交叉检查方法。我们使用HalluciDoctor自动地标识并消除幻觉在训练数据中。有趣的是,HalluciDoctor还表明了由长尾物体共occurrence产生的假 correlations contribute to hallucinations。基于这,我们执行了对照性视觉指令扩展,以增强 MLLMs 对幻觉的抵抗力。我们对幻觉评估标准差进行了全面的实验,结果显示,我们的方法可以成功缓解 44.6% 的幻觉,并与 LLaVA 保持竞争性的性能。源代码将在 \url{https://github.com/Yuqifan1117/HalluciDoctor} 上发布。

Building the Future of Responsible AI: A Reference Architecture for Designing Large Language Model based Agents

  • paper_url: http://arxiv.org/abs/2311.13148
  • repo_url: None
  • paper_authors: Qinghua Lu, Liming Zhu, Xiwei Xu, Zhenchang Xing, Stefan Harrer, Jon Whittle
  • for: 该论文是为了提供基础模型基础的自主代理设计指南,以便在设计基础模型基础的自主代理时能够实现责任AI设计。
  • methods: 该论文使用了模式驱动的参考架构来提供自主代理设计指南,并且对两个实际应用中的基础模型基础自主代理进行了映射以评估参考架构的完整性和实用性。
  • results: 该论文提出了一种基于模式的参考架构,可以为基础模型基础自主代理设计提供一个责任AI设计的指南,并且通过对两个实际应用中的基础模型基础自主代理进行映射,证明了参考架构的完整性和实用性。
    Abstract Large language models (LLMs) have been widely recognised as transformative artificial generative intelligence (AGI) technologies due to their capabilities to understand and generate content, including plans with reasoning capabilities. Foundation model based agents derive their autonomy from the capabilities of foundation models, which enable them to autonomously break down a given goal into a set of manageable tasks and orchestrate task execution to meet the goal. Despite the huge efforts put into building foundation model based autonomous agents, the architecture design of the agents has not yet been systematically explored. Also, while there are significant benefits of using autonomous agents for planning and execution, there are serious considerations regarding responsible AI related software quality attributes, such as security and accountability. Therefore, this paper presents a pattern-oriented reference architecture that serves as architecture design guidance and enables responsible-AI-by-design when designing foundation model based autonomous agents. We evaluate the completeness and utility of the proposed reference architecture by mapping it to the architecture of two real-world agents.
    摘要 大型语言模型(LLM)已被广泛承认为转型人工生成智能(AGI)技术,因为它们可以理解和生成内容,包括带有逻辑能力的计划。基础模型基于代理人 derive their autonomy from the capabilities of foundation models, which enable them to autonomously break down a given goal into a set of manageable tasks and orchestrate task execution to meet the goal。虽然对基础模型基于自主代理人的架构设计做出了巨大的努力,但架构设计仍然没有得到系统的探讨。此外,使用自主代理人进行规划和执行也存在责任AI相关的软件质量属性的问题,如安全性和责任。因此,本文提出了一种模式强调式参考架构,作为基础模型基于自主代理人的架构设计指南,以便在设计过程中实现负责任AI。我们对提出的参考架构进行了完整性和实用性的评估,并将其映射到了两个实际应用中的架构。

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

  • paper_url: http://arxiv.org/abs/2311.13133
  • repo_url: https://github.com/97aditi/LIMIT
  • paper_authors: Aditi Jha, Sam Havens, Jeremey Dohmann, Alex Trott, Jacob Portes
  • for: 本研究旨在探讨 whether a small amount of diverse finetuning samples can improve performance on both traditional NLP benchmarks 和 open-ended, model-based evaluation.
  • methods: 我们使用 open-source MPT-7B 和 MPT-30B 模型进行finetuning,并在不同的 instruction finetuning 数据集上进行了多个预处理。
  • results: 我们发现 subsets of 1k-6k instruction finetuning samples 是 sufficient to achieve good performance on both traditional NLP benchmarks 和 model-based evaluation,并且混合 textbook-style 和 open-ended QA finetuning datasets 可以最佳化表现。
    Abstract Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.
    摘要 庞大语言模型通常会在大量指令数据集上精度调整。然而,最近的研究表明,小量、高质量的数据集可以为通用指令遵从提供好的性能。这种对 LLM 最佳化做法的不一致是由于 NLP 评价方法的快速分化所致。在这项研究中,我们问道:一小量多样化的训练样本是否可以提高传统的步骤预测精度和开放式模型评价中的性能?我们使用 open-source MPT-7B 和 MPT-30B 模型,在不同大小的指令训练集上进行精度调整,其中训练集的大小从 1k 到 60k 个样本。我们发现,1k-6k 个指令训练样本的子集可以在传统的 NLP benchmark 上达到好的性能,同时在开放式模型评价中也能够取得良好的结果。最后,我们发现将文本书籍式和开放式 QA 训练集混合可以优化两种评价方法的性能。

Conditions for Length Generalization in Learning Reasoning Skills

  • paper_url: http://arxiv.org/abs/2311.16173
  • repo_url: None
  • paper_authors: Changnan Xiao, Bing Liu
  • for: 这个论文主要是为了研究人工智能代理(AI)的推理能力的理论基础。
  • methods: 这篇论文使用了大型自然语言模型(LLM)来检验推理任务的能力。
  • results: 论文发现了推理任务的长度总化问题,即当训练在更小的长度或大小的问题上时,模型很难处理更大的长度或大小的问题。这可能表明了推理学习技能的一些理论上的限制。
    Abstract Reasoning is a fundamental capability of AI agents. Recently, large language models (LLMs) have shown remarkable abilities to perform reasoning tasks. However, numerous evaluations of the reasoning capabilities of LLMs have also showed some limitations. An outstanding limitation is length generalization, meaning that when trained on reasoning problems of smaller lengths or sizes, the resulting models struggle with problems of larger sizes or lengths. This potentially indicates some theoretical limitations of generalization in learning reasoning skills. These evaluations and their observations motivated us to perform a theoretical study of the length generalization problem. This work focused on reasoning tasks that can be formulated as Markov dynamic processes (MDPs) and/or directed acyclic graphs (DAGs). It identifies and proves conditions that decide whether the length generalization problem can be solved or not for a reasoning task in a particular representation. Experiments are also conducted to verify the theoretical results.
    摘要 <>将文本翻译成简化中文。人工智能代理人的基本能力之一是逻辑能力。最近,大量语言模型(LLMs)在逻辑任务上表现出了很好的能力。然而,许多逻辑能力评估中的限制也得到了证明。其中一个重要的限制是长度总结,即当训练在更小的长度或大小的逻辑问题上时,得到的模型在更大的长度或大小的逻辑问题上困难。这可能指示了一些总结概念学习的理论限制。这些评估和其观察激发了我们对长度总结问题的理论研究。这项工作将关注在逻辑任务可以表示为Markov逻辑过程(MDPs)和/或直接径图(DAGs)上,并证明了某些条件可以 decidable或不可 decidable的逻辑问题。实验也进行了验证。

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

  • paper_url: http://arxiv.org/abs/2311.13127
  • repo_url: https://github.com/liuyixin-louis/metacloak
  • paper_authors: Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, Lichao Sun
  • For: 防护人像生成模型免被用来制造诽谤或危险内容,提高内容生成的安全性。* Methods: 使用元学习框架和额外的变数抽取过程,实现对受损测量的硬coded规则的解决,并创造可转移和Robust的抖音。* Results: 在VGGFace2和CelebA-HQ数据集上,MetaCloak比先前的方法表现出色,可以成功骗误在线训练服务Like Replicate,示出MetaCloak在实际应用中的效果。
    Abstract Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios. Our code is available at https://github.com/liuyixin-louis/MetaCloak.
    摘要 文本到图像扩散模型允许从有限参考照片生成个性化图像。然而,这些工具,在错误的手上,可以生成误leading或危险内容,危害到个人。为解决这个问题,现有的毒素化approaches在用户图像上进行不可见的扰动,以使其“不可学习”。我们认为这两个限制:一是使用手工规则解决不可解决的二级优化问题,二是缺乏对简单数据变换 like Gaussian filtering 的Robustness。为解决这些挑战,我们提出MetaCloak,它通过元学习框架和额外的变换抽象过程,解决二级毒素化问题。具体来说,我们使用一个 diffusion 模型池来生成可转移的和模型无关的扰动。此外,我们采用额外的变换过程,设计了一个简单的干扰最大化loss,以致在个性化生成过程中引起转换稳定和降低。我们的实验表明,MetaCloak在VGGFace2和CelebA-HQ数据集上表现出色,并且可以成功欺骗在线培训服务Like Replicate, demonstrating the effectiveness of MetaCloak in real-world scenarios。我们的代码可以在https://github.com/liuyixin-louis/MetaCloak 中找到。

Combatting Human Trafficking in the Cyberspace: A Natural Language Processing-Based Methodology to Analyze the Language in Online Advertisements

  • paper_url: http://arxiv.org/abs/2311.13118
  • repo_url: None
  • paper_authors: Alejandro Rodriguez Perez, Pablo Rivas
    for: 本研究 targets the pressing issue of human trafficking in online C2C marketplaces, using advanced NLP techniques to combat human exploitation.methods: 本研究 introduces a novel methodology for generating pseudo-labeled datasets with minimal supervision, employing cutting-edge Transformer models for analysis and using Integrated Gradients for explainable insights.results: 本研究 provides a scalable, machine learning-driven approach to combat human trafficking, offering a rich resource for training state-of-the-art NLP models and filling a critical gap in the literature.
    Abstract This project tackles the pressing issue of human trafficking in online C2C marketplaces through advanced Natural Language Processing (NLP) techniques. We introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art NLP models. Focusing on tasks like Human Trafficking Risk Prediction (HTRP) and Organized Activity Detection (OAD), we employ cutting-edge Transformer models for analysis. A key contribution is the implementation of an interpretability framework using Integrated Gradients, providing explainable insights crucial for law enforcement. This work not only fills a critical gap in the literature but also offers a scalable, machine learning-driven approach to combat human exploitation online. It serves as a foundation for future research and practical applications, emphasizing the role of machine learning in addressing complex social issues.
    摘要 Translation notes:* "C2C" stands for "consumer-to-consumer" and refers to online marketplaces where individuals can buy and sell goods and services directly with each other.* "HTRP" stands for "human trafficking risk prediction" and refers to the task of using machine learning algorithms to predict the likelihood that a given online advertisement or other content may be related to human trafficking.* "OAD" stands for "organized activity detection" and refers to the task of using machine learning algorithms to detect and identify organized criminal activity, such as human trafficking, in online content.* "Integrated Gradients" is a technique for explaining the predictions of machine learning models, by visualizing the contribution of different input features to the model's output.* "law enforcement" refers to the agencies and officials responsible for enforcing laws and maintaining public safety.

PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF

  • paper_url: http://arxiv.org/abs/2311.13099
  • repo_url: None
  • paper_authors: Yutao Feng, Yintong Shang, Xuan Li, Tianjia Shao, Chenfanfu Jiang, Yin Yang
  • for: 本研究旨在推广Physics-based simulations的应用,使其可以与NeRF技术集成,生成高质量的塑性动力学模拟。
  • methods: 本研究使用了一种 meshless 的非线性弹性材料模型,通过 quadratic generalized moving least square (Q-GMLS) 方法来捕捉非线性动力学和大塑形变。
  • results: 研究者们成功地实现了将Physics-based simulations与NeRF技术集成,并可以实时生成各种复杂和多维度形状的塑性动力学模拟。
    Abstract We show that physics-based simulations can be seamlessly integrated with NeRF to generate high-quality elastodynamics of real-world objects. Unlike existing methods, we discretize nonlinear hyperelasticity in a meshless way, obviating the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh or voxel grid. A quadratic generalized moving least square (Q-GMLS) is employed to capture nonlinear dynamics and large deformation on the implicit model. Such meshless integration enables versatile simulations of complex and codimensional shapes. We adaptively place the least-square kernels according to the NeRF density field to significantly reduce the complexity of the nonlinear simulation. As a result, physically realistic animations can be conveniently synthesized using our method for a wide range of hyperelastic materials at an interactive rate. For more information, please visit our project page at https://fytalon.github.io/pienerf/.
    摘要 我们显示了物理模拟可以与NeRF进行无缝整合,以生成高品质的弹簧动力学。不同于现有方法,我们使用一种不需要中继材料的网络无形几何来精确地解析非线性弹簧性。我们使用二项扩展最小二乘(Q-GMLS)来捕捉非线性动力学和大型变形。这种网络无形集成允许我们灵活地模拟复杂的弹簧形状。我们可以根据NeRF密度场进行适应性地置置最小二乘核心,以大大减少非线性模拟的复杂性。因此,我们可以快速地生成具有广泛应用性的弹簧材料的实际physically realistic动画,并在互动率下完成。详情可以参考我们的项目页面:https://fytalon.github.io/pienerf/.

  • paper_url: http://arxiv.org/abs/2311.13095
  • repo_url: None
  • paper_authors: Ha-Thanh Nguyen, Wachara Fungwacharakorn, Ken Satoh
  • For: This paper aims to improve the logical reasoning capabilities of Large Language Models (LLMs) in order to expand their applicability in law and other logic-intensive disciplines.* Methods: The proposed Reinforcement Learning from Logical Feedback (RLLF) approach and a revised evaluation methodology are used to refine LLMs’ reasoning capacities.* Results: The RLLF approach and revised evaluation methodology are shown to be effective in improving LLMs’ logical reasoning abilities, opening up new avenues for research in this domain and contributing to the development of LLMs capable of handling complex legal reasoning tasks.Here is the same information in Simplified Chinese text:* For: 本研究旨在提高大语言模型(LLMs)的逻辑推理能力,以扩展其在法律和其他逻辑敏感领域的应用。* Methods: 提议的Reinforcement Learning from Logical Feedback(RLLF)方法和修订后的评估方法,用于强化LLMs的逻辑推理能力。* Results: RLLF方法和修订后的评估方法被证明有效,可以提高LLMs的逻辑推理能力,开拓新的研究领域,并对LLMs处理复杂的法律推理任务的开发做出贡献。
    Abstract Language serves as a vehicle for conveying thought, enabling communication among individuals. The ability to distinguish between diverse concepts, identify fairness and injustice, and comprehend a range of legal notions fundamentally relies on logical reasoning. Large Language Models (LLMs) attempt to emulate human language understanding and generation, but their competency in logical reasoning remains limited. This paper seeks to address the philosophical question: How can we effectively teach logical reasoning to LLMs while maintaining a deep understanding of the intricate relationship between language and logic? By focusing on bolstering LLMs' capabilities in logical reasoning, we aim to expand their applicability in law and other logic-intensive disciplines. To this end, we propose a Reinforcement Learning from Logical Feedback (RLLF) approach, which serves as a potential framework for refining LLMs' reasoning capacities. Through RLLF and a revised evaluation methodology, we explore new avenues for research in this domain and contribute to the development of LLMs capable of handling complex legal reasoning tasks while acknowledging the fundamental connection between language and logic.
    摘要 语言作为思维表达的工具,帮助人们进行交流沟通。理解多元概念、识别公平和不公平,以及理解多种法律概念的理解都基于逻辑思维。大型自然语言模型(LLM)试图模仿人类语言理解和生成能力,但它们的逻辑思维能力仍有限制。这篇论文问题是如何有效地教育LLM们的逻辑思维能力,保持语言和逻辑之间的深入关系?我们提出了一种基于强化反馈的逻辑学习(RLLF)方法,作为改进LLM们的逻辑思维能力的潜在框架。通过RLLF和修订的评价方法,我们探索了新的研究领域,并为LLM们具有复杂法律逻辑任务处理能力的开发做出贡献,同时也保持语言和逻辑之间的基本关系。

On the Limitation of Diffusion Models for Synthesizing Training Datasets

  • paper_url: http://arxiv.org/abs/2311.13090
  • repo_url: None
  • paper_authors: Shin’ya Yamaguchi, Takuma Fukuda
  • for: 本研究旨在探讨现代扩散模型是否能够完美地复制训练数据分布,并研究扩散模型生成的数据是否能够涵盖训练数据分布的全部。
  • methods: 本研究使用了现代扩散模型生成的数据,并通过对生成的数据进行反向扩散来控制信息的加权。通过评估重建的样本和训练模型,我们发现生成的数据集中集中在训练数据分布的模式内,而难以覆盖数据分布的外部部分。
  • results: 我们发现现代扩散模型不能完美地复制训练数据分布,而且生成的数据集中集中在训练数据分布的模式内,难以覆盖数据分布的外部部分。这表明现代扩散模型在复制训练数据分布方面存在改进的空间。
    Abstract Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. This means that modern diffusion models do not perfectly represent the data distribution for the purpose of replicating datasets for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information added by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic data are concentrated in modes of the training data distribution as the reverse step increases, and thus, they are difficult to cover the outer edges of the distribution. Our findings imply that modern diffusion models are insufficient to replicate training data distribution perfectly, and there is room for the improvement of generative modeling in the replication of training datasets.
    摘要 现代扩散模型可能有 promise 用于训练逻归模型,但我们发现使用 state-of-the-art 扩散模型时,Synthetic 数据会对实际数据集的分类性能产生负面影响。这意味着现代扩散模型不能完美地表示数据分布,无法用于复制训练数据集。这篇论文 investigate 了 Synthetic 和实际样本之间的差异,通过分析扩散和反扩散过程中 Synthetic 样本的重建来控制实际数据集的信息和扩散模型添加的信息之间的平衡。通过评估重建样本和训练模型,我们发现 Synthetic 数据集在增加反扩散步骤时会集中在训练数据集的模式上,因此很难覆盖数据集的外部边缘。我们的发现表明现代扩散模型无法完美复制训练数据集的分布,有很大的改进空间。

Predict-Then-Optimize by Proxy: Learning Joint Models of Prediction and Optimization

  • paper_url: http://arxiv.org/abs/2311.13087
  • repo_url: None
  • paper_authors: James Kotary, Vincenzo Di Vito, Jacob Christopher, Pascal Van Hentenryck, Ferdinando Fioretto
  • for: 这篇论文旨在提出一种可以从可观察的特征学习优化问题的解的方法,并且可以与传统的Predict-Then-Optimize框架相结合。
  • methods: 这篇论文使用机器学习模型来预测优化问题的解,并且可以在训练过程中将优化问题的解与训练资料进行整合。
  • results: 实验结果显示这种方法可以提供高效、精准和柔软的解决方案,并且可以与传统的Predict-Then-Optimize框架相结合。
    Abstract Many real-world decision processes are modeled by optimization problems whose defining parameters are unknown and must be inferred from observable data. The Predict-Then-Optimize framework uses machine learning models to predict unknown parameters of an optimization problem from features before solving. Recent works show that decision quality can be improved in this setting by solving and differentiating the optimization problem in the training loop, enabling end-to-end training with loss functions defined directly on the resulting decisions. However, this approach can be inefficient and requires handcrafted, problem-specific rules for backpropagation through the optimization step. This paper proposes an alternative method, in which optimal solutions are learned directly from the observable features by predictive models. The approach is generic, and based on an adaptation of the Learning-to-Optimize paradigm, from which a rich variety of existing techniques can be employed. Experimental evaluations show the ability of several Learning-to-Optimize methods to provide efficient, accurate, and flexible solutions to an array of challenging Predict-Then-Optimize problems.
    摘要 许多现实世界中的决策过程是通过优化问题来模型的,其定义参数是未知的并需要从可观察数据中推断出来。Predict-Then-Optimize框架使用机器学习模型来预测优化问题中的未知参数从特征上,以解决问题。 recent works show that 决策质量可以通过在训练循环中解决和分析优化问题来提高,但这种方法可能会不够效率,需要手工编写问题特有的规则来实现反propagation。这篇论文提出了一种替代方法,在可观察特征上使用预测模型来学习优化解决方案。该方法是通用的,可以employs a variety of existing techniques from the Learning-to-Optimize paradigm.实验评估表明了许多学习优化方法在 Predict-Then-Optimize 问题中提供了高效、准确和灵活的解决方案。

Learning to Fly in Seconds

  • paper_url: http://arxiv.org/abs/2311.13081
  • repo_url: https://github.com/arplaboratory/learning-to-fly
  • paper_authors: Jonas Eschmann, Dario Albani, Giuseppe Loianno
  • for: 这个论文的目的是提出一种基于学习的控制方法,特别是使用深度学习来控制自适应多旋翼飞行器。
  • methods: 该方法使用了一种非对称演员-评估器架构,并使用了一种高度可靠的学习控制训练方法,以实现direct RPM控制。
  • results: 该方法可以在consumer-grade笔记型电脑上快速训练,并且可以在实时保证下控制多旋翼飞行器。其表现和现有控制方案进行比较,并且可以在实验中证明其竞争性。
    Abstract Learning-based methods, particularly Reinforcement Learning (RL), hold great promise for streamlining deployment, enhancing performance, and achieving generalization in the control of autonomous multirotor aerial vehicles. Deep RL has been able to control complex systems with impressive fidelity and agility in simulation but the simulation-to-reality transfer often brings a hard-to-bridge reality gap. Moreover, RL is commonly plagued by prohibitively long training times. In this work, we propose a novel asymmetric actor-critic-based architecture coupled with a highly reliable RL-based training paradigm for end-to-end quadrotor control. We show how curriculum learning and a highly optimized simulator enhance sample complexity and lead to fast training times. To precisely discuss the challenges related to low-level/end-to-end multirotor control, we also introduce a taxonomy that classifies the existing levels of control abstractions as well as non-linearities and domain parameters. Our framework enables Simulation-to-Reality (Sim2Real) transfer for direct RPM control after only 18 seconds of training on a consumer-grade laptop as well as its deployment on microcontrollers to control a multirotor under real-time guarantees. Finally, our solution exhibits competitive performance in trajectory tracking, as demonstrated through various experimental comparisons with existing state-of-the-art control solutions using a real Crazyflie nano quadrotor. We open source the code including a very fast multirotor dynamics simulator that can simulate about 5 months of flight per second on a laptop GPU. The fast training times and deployment to a cheap, off-the-shelf quadrotor lower the barriers to entry and help democratize the research and development of these systems.
    摘要 学习基于方法,特别是强化学习(RL),对自适应多旋翼飞行器的控制具有很大的承诺。深度RL在模拟中控制复杂系统的精度和活力很高,但模拟到实际世界的跨度往往存在困难的现实差距。此外,RL通常受到很长的训练时间的限制。在这种情况下,我们提出了一种新的不对称演员-评估器结构,并结合高度可靠的RL训练方法,用于端到端多旋翼飞行器的控制。我们表明了课程学习和高度优化的模拟器可以提高样本复杂性和快速训练时间。为了准确描述低级/端到端多旋翼飞行器控制的挑战,我们还提出了一种分类多控制层次、非线性和领域参数的税onomy。我们的框架实现了实际到现实(Sim2Real)的转移,并在consumer级别的笔记计算机上进行了18秒的训练,以及在微控制器上实现了实时保证的多旋翼飞行器控制。最后,我们的解决方案在曲线跟踪中展示了竞争性的性能,通过了多种实验比较,使用实际的Crazyflie纳米四旋翼飞行器。我们开源了代码,包括非常快的多旋翼飞行器动力学模拟器,可以在笔记计算机上的GPU上模拟约5个月的飞行时间。快速的训练时间和在便宜的卖场 quadrotor 上部署,降低了入门门槛和帮助了多旋翼飞行器系统的研发。

Positional Description Matters for Transformers Arithmetic

  • paper_url: http://arxiv.org/abs/2311.14737
  • repo_url: None
  • paper_authors: Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang
  • for: 这篇论文的目的是解决Transformers在数学任务上的问题,尤其是在小数位数的情况下表现不佳。
  • methods: 作者提出了多种解决方案,包括修改 pozitional encoding directly,以及改变表示 arithmetic task 的方式,以更好地利用标准 pozitional encoding。
  • results: 作者通过对三个任务进行研究:(i)纯数 multiplication,(ii)加法长度推导,以及(iii)加法在自然语言上下文中。结果显示,作者可以使用一个小型模型(100M parameters和300k samples)在 direct 加法中达到15位数的高精度,而常规训练在这种情况下会导致模型在4位数 multiplication 失败。在加法任务中,作者只使用120k samples,能够:(ii)在测试12位数时进行 extrapolation,而常规训练无法进行 extrapolation;(iii)在5位数之前达到非常高的准确率,而常规训练只能在3位数之前达到准确率(这实际上是 memorization 的)。
    Abstract Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).
    摘要 启发器在现代自然语言处理中取得成功,然而在数学任务上经常表现不佳,尤其是当数字小时。我们发现,启发器在解决小数字的数学任务时依赖于位置信息,导致性能不佳。在这篇文章中,我们深入探讨了位置编码的角色,并提出了多种修复方法,包括修改位置编码或修改数学任务的表示方式,以更好地利用标准的位置编码。我们对三个任务进行了研究:(i)经典乘法,(ii)加法延展,和(iii)自然语言上的加法。对于(i),我们训练了一个小型模型(100M参数和300k样本),其在直接无笔记录的15位数 multiplication 中表现杰出,可以很好地解决12位数的问题,而通常的训练方法会在4位数 multiplication 上失败。在加法任务中,我们使用了120k个样本,以示出:(ii)从10位数到测试12位数的推导,而通常的训练方法会没有推导。以及(iii)在5位数之前的精度是极其高的,而通常的训练方法只能在3位数之前达到正确率(这实际上是Memorization)。

cs.CL - 2023-11-22

Surpassing GPT-4 Medical Coding with a Two-Stage Approach

  • paper_url: http://arxiv.org/abs/2311.13735
  • repo_url: None
  • paper_authors: Zhichao Yang, Sanjit Singh Batra, Joel Stremmel, Eran Halperin
  • for: 临床应用,如临床决策和诊断建议
  • methods: 使用两个阶段方法:首先使用一个LLM生成证据提议,然后使用LSTM验证阶段,学习 both LLM的高准确率和人工专家的高精度
  • results: 实现了在医疗编码精度、罕见编码精度和句子级证据标识等三个方面的最新状态,无需人工标注证据。
    Abstract Recent advances in large language models (LLMs) show potential for clinical applications, such as clinical decision support and trial recommendations. However, the GPT-4 LLM predicts an excessive number of ICD codes for medical coding tasks, leading to high recall but low precision. To tackle this challenge, we introduce LLM-codex, a two-stage approach to predict ICD codes that first generates evidence proposals using an LLM and then employs an LSTM-based verification stage. The LSTM learns from both the LLM's high recall and human expert's high precision, using a custom loss function. Our model is the only approach that simultaneously achieves state-of-the-art results in medical coding accuracy, accuracy on rare codes, and sentence-level evidence identification to support coding decisions without training on human-annotated evidence according to experiments on the MIMIC dataset.
    摘要 近期大语言模型(LLM)的进步显示它们在医疗应用中具有潜力,如临床决策支持和药品推荐。然而,GPT-4 LLM预测医疗代码的数量过多,导致高回归但低精度。为解决这个挑战,我们介绍了LLM-codex,一种基于两个阶段的方法,首先使用LLM生成证据提案,然后使用LSTM验证阶段来验证。LSTM学习了LLM的高回归和人工专家的高精度,使用自定义损失函数。我们的模型是唯一一种同时实现了医疗代码准确率、罕见代码准确率和句子水平证据标识,以支持代码决策而不需要人工标注证据,根据MIMIC数据集的实验结果。

Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

  • paper_url: http://arxiv.org/abs/2311.13729
  • repo_url: https://github.com/shashank140195/raredis
  • paper_authors: Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru
  • for: 这篇论文的目的是比较三种常见的 END-TO-END 关系EXTRACTION (E2ERE) 方法,使用一个集中精确的 rare disease 资料集。
  • methods: 这篇论文使用了三种不同的方法,包括 NER ⇒ RE 管道模型,连续序列模型,以及生成预训练 Transformer (GPT) 模型。
  • results: 结果显示,管道模型仍然是最好的,而连续序列模型仅落后不多;GPT 模型具有八倍的参数数量,却输给管道模型和连续序列模型,并落后超过 10 个 F1 分。另外,管道模型在第二个 E2ERE 资料集上还有良好的表现。
    Abstract End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.
    摘要 END-TO-END关系抽取(E2ERE)是生物医学自然语言处理(NLP)中重要和现实的应用。在这篇论文中,我们想比较三种常见的E2ERE方法,使用一个复杂的数据集,涉及罕见疾病,其中包括不连续和嵌入的实体。我们使用了RareDis信息抽取数据集来评估三种竞争方法(E2ERE):NER⇒RE管道模型、联合序列到序列模型和生成预训练变换器(GPT)模型。我们使用了相似的状态艺术模型和最佳实践,并进行错误分析以评估它们的失败模式。我们的发现是,管道模型仍然是最好的,而序列到序列模型也不远落后;GPT模型的8倍参数量比序列到序列模型还要差,并且与管道模型相比,失去了10个F1分点。偶极匹配和不连续实体导致了NER错误,从而降低了总E2E性能。我们还在第二个E2ERE数据集上进行了验证。虽然生成LM基于方法更适合零shot设定,但当有训练数据时,我们发现,与较小的encoder-decoder管道模型和更大的GPT模型结合更有优势。我们的贡献还是第一次对RareDis数据集进行E2ERE。

Dynamic Analysis Method for Hidden Dangers in Substation Based on Knowledge Graph

  • paper_url: http://arxiv.org/abs/2311.13708
  • repo_url: None
  • paper_authors: Weiwei Li, Xing Liu, Wei Wang, Lu Chen, Sizhe Li, Hui Fan
  • for: 本研究旨在Addressing the challenge of identifying and understanding hidden dangers in substations from unstructured text data.
  • methods: 该方法使用了动态分析方法,首先从不结构化文本中提取数据,然后使用灵活分布式数据搜索引擎基于Elastic-Search处理这些信息。接着,使用隐藏马尔可夫模型训练数据内部引擎。最后,使用Viterbi算法解读隐藏状态序列,以便对存在隐藏危险的实体进行分类和标注。
  • results: 该方法的效果通过使用特定的 substation 隐藏危险数据进行示例分析,证明了其可以准确地识别和理解隐藏危险。
    Abstract To address the challenge of identifying and understanding hidden dangers in substations from unstructured text data, a novel dynamic analysis method is proposed. This approach begins by analyzing and extracting data from the unstructured text related to hidden dangers. It then leverages a flexible, distributed data search engine built on Elastic-Search to handle this information. Following this, the hidden Markov model is employed to train the data within the engine. The Viterbi algorithm is integrated to decipher the hidden state sequences, facilitating the segmentation and labeling of entities related to hidden dangers. The final step involves using the Neo4j graph database to dynamically create a knowledge map that visualizes hidden dangers in the substation. This method's effectiveness is demonstrated through an example analysis using data from a specific substation's hidden dangers.
    摘要 要解决 substation 中隐藏的危险的识别和理解问题,我们提出了一种新的动态分析方法。这种方法首先从不结构化文本数据中提取和分析有关隐藏危险的信息。然后,我们使用基于 Elastic-Search 的灵活分布式数据搜索引擎来处理这些信息。接着,我们使用隐藏马尔可夫模型来训练数据 dentro 引擎。最后,我们使用 Neo4j 图Database 来动态创建一个知识地图,可视化 substation 中隐藏的危险。这种方法的效果通过一个具体的例子分析,使用特定 substation 中隐藏的危险的数据来证明。

Efficient Transformer Knowledge Distillation: A Performance Review

  • paper_url: http://arxiv.org/abs/2311.13657
  • repo_url: None
  • paper_authors: Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence
  • for: 本研究旨在评估模型压缩via知识传递的影响,以提高效果和减少计算成本。
  • methods: 本研究使用了知识传递来压缩state-of-the-art的高效注意力模型,并对其进行成本-性能评估。
  • results: 研究发现,压缩后的高效注意力模型可以保持大约98.6%的原始模型性能,同时降低执行时间为57.8%。此外,本研究还 introduce了一个新的长ContextNamed Entity Recognition(GONERD)数据集,用于训练和测试长序列Named Entity Recognition(NER)模型。
    Abstract As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.
    摘要 为了满足计算机需求和输入序列长度的限制,自然语言处理社区一直在努力提高模型压缩和高效注意机制。然而,这两个领域的研究没有 overlap。在这个工作中,我们对知识塑造进行评估,以实现高性能的压缩。我们提供了高性能的压缩模型的成本-性能交换,以及与全注意模型相比的性能提升。此外,我们还引入了一个新的长ContextNamed Entity Recognition(GONERD)数据集,用于训练和测试长序列的NER模型。我们发现,通过知识塑造,高效注意转换器可以保持大量原始模型性能,在短Context任务(GLUE、SQUAD、CoNLL-2003)上保持98.6%,在长Context问答任务(HotpotQA、TriviaQA)上保持94.6%,在长ContextNamed Entity Recognition(GONERD)任务上保持98.8%,而且降低执行时间,最多降低57.8%。我们发现,对大多数模型和任务来说,进行知识塑造是一种有效的方法,以便实现高性能的压缩模型。

Language Model Inversion

  • paper_url: http://arxiv.org/abs/2311.13647
  • repo_url: https://github.com/jxmorris12/vec2text
  • paper_authors: John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, Alexander M. Rush
  • for: 本研究旨在利用语言模型的下一个token分布来恢复隐藏的提问token。
  • methods: 作者使用语言模型 inverse 问题,并证明了下一个token的概率包含隐藏的提问文本中的许多信息。他们还提出了一种方法,可以通过搜索来从模型的当前分布输出中恢复未知的提问。
  • results: 在 LLama-2 7b 上,作者的恢复方法可以重建提问,其 Bleu 分数为 59,token-level F1 分数为 78,并能够 precisley 恢复27%的提问。
    Abstract Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.
    摘要 Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of 59 and token-level F1 of 78 and recovers 27% of prompts exactly. 码可以在 http://github.com/jxmorris12/vec2text 上复制。

PaSS: Parallel Speculative Sampling

  • paper_url: http://arxiv.org/abs/2311.13581
  • repo_url: None
  • paper_authors: Giovanni Monea, Armand Joulin, Edouard Grave
  • for: 提高语言模型的批处理能力,以提高模型的性能。
  • methods: 使用并行解码方法,不需要Extra computational cost和第二个模型。
  • results: 可以达到$30%$的速度提升,只需要$O(d_{emb})$的额外参数。
    Abstract Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.
    摘要 大量化语言模型的参数数量到十亿级别,带来了广泛的任务表现。在生成时,这些模型通过自动递归方式使用,需要每个生成的token进行前向传播,并且从内存中读取整个集合的参数。这种内存访问成为生成的主要瓶颈,随着模型的尺寸增大,这种瓶颈也变得更加严重。此外,在多个token同时执行多个token的前向传播通常需要相当于单个token的时间。这两个观察导致了推测采样的发展,其中使用一个较小的模型来预测一些token,然后通过单个大模型的前向传播进行验证或拒绝。然而,这种方法需要两个模型共享同一个分词器,因此限制了其广泛的应用。为了解决这个问题,我们提议使用并行解码来生成多个token。我们的方法只需要添加一个额外的输入token,用于标识将被同时生成的词。我们的方法可以实现高达30%的速度提升,并且只需要额外的$O(d_{emb})$参数。

Efficient Deep Speech Understanding at the Edge

  • paper_url: http://arxiv.org/abs/2311.17065
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Rongxiang Wang, Felix Lin
  • for: 提高边缘设备上的语音理解性能,并且能够有效地处理大量语音输入。
  • methods: 提出了三种创新解决方案,包括:1. 延迟Contextualization:在输入摄取期间并行执行模型的注意力采集器。2. 导航解码:解决输入过载的延迟问题。3. 自适应偏移:根据部分输出序列做出偏移决策。
  • results: 在配备6-8核Arm处理器的平台上实现了State-of-the-Art(SOTA)精度,降低了端到端延迟时间,并将偏移需求减半。
    Abstract Contemporary Speech Understanding (SU) involves a sophisticated pipeline: capturing real-time voice input, the pipeline encompasses a deep neural network with an encoder-decoder architecture enhanced by beam search. This network periodically assesses attention and Connectionist Temporal Classification (CTC) scores in its autoregressive output. This paper aims to enhance SU performance on edge devices with limited resources. It pursues two intertwined goals: accelerating on-device execution and efficiently handling inputs that surpass the on-device model's capacity. While these objectives are well-established, we introduce innovative solutions that specifically address SU's distinctive challenges: 1. Late contextualization: Enables the parallel execution of a model's attentive encoder during input ingestion. 2. Pilot decoding: Alleviates temporal load imbalances. 3. Autoregression offramps: Facilitate offloading decisions based on partial output sequences. Our techniques seamlessly integrate with existing SU models, pipelines, and frameworks, allowing for independent or combined application. Together, they constitute a hybrid solution for edge SU, exemplified by our prototype, XYZ. Evaluated on platforms equipped with 6-8 Arm cores, our system achieves State-of-the-Art (SOTA) accuracy, reducing end-to-end latency by 2x and halving offloading requirements.
    摘要 现代语音理解(SU)包括一个复杂的管道:在实时语音输入时,管道包括一个深度神经网络,其中包括一个encoder-decoder架构,并且使用搜索 beam 来提高性能。这个网络在其自动回归输出中不断评估注意力和 Connectionist Temporal Classification(CTC)分数。 这篇论文旨在在边缘设备上提高 SU 性能。它追求两个相互关联的目标:加速设备上的执行和有效地处理输入流量超出设备模型的范围。虽然这些目标已经得到了广泛的研究,但我们在 SU 的特殊挑战方面提出了创新的解决方案:1. 延迟contextualization:允许在输入搬运期间并行执行模型的注意力搜索encoder。2. 预测解码:减轻时间负荷不均衡。3. 回归折衣:便利基于部分输出序列的卸载决策。我们的技术可以独立或集成应用,与现有的 SU 模型、管道和框架集成,形成一个混合解决方案。我们的诊断系统XYZ在配备6-8Arm核心的平台上进行了测试,实现了State-of-the-Art(SOTA)精度,将总端到端延迟时间减少一半,并将卸载需求减半。

Current Topological and Machine Learning Applications for Bias Detection in Text

  • paper_url: http://arxiv.org/abs/2311.13495
  • repo_url: None
  • paper_authors: Colleen Farrelly, Yashbir Singh, Quincy A. Hathaway, Gunnar Carlsson, Ashok Choudhary, Rahul Paul, Gianfranco Doretto, Yassine Himeur, Shadi Atalls, Wathiq Mansoor
  • for: 本研究旨在探讨语言模型嵌入和几何模型如何影响偏见模型的准确性。
  • methods: 本研究使用RedditBias数据库分析文本偏见,并 comparing四种 transformer 模型,包括 BERT 和 RoBERTa 变体。post-embedding 和 t-SNE 技术实现了数据的二维视化,而 KNN 分类器则用于区分偏见类型。
  • results: 研究发现,BERT 特别是 mini BERT 在偏见分类中表现出色,而多语言模型则落后于其他模型。结论是要进一步改进单语言模型和探索领域偏见。
    Abstract Institutional bias can impact patient outcomes, educational attainment, and legal system navigation. Written records often reflect bias, and once bias is identified; it is possible to refer individuals for training to reduce bias. Many machine learning tools exist to explore text data and create predictive models that can search written records to identify real-time bias. However, few previous studies investigate large language model embeddings and geometric models of biased text data to understand geometry's impact on bias modeling accuracy. To overcome this issue, this study utilizes the RedditBias database to analyze textual biases. Four transformer models, including BERT and RoBERTa variants, were explored. Post-embedding, t-SNE allowed two-dimensional visualization of data. KNN classifiers differentiated bias types, with lower k-values proving more effective. Findings suggest BERT, particularly mini BERT, excels in bias classification, while multilingual models lag. The recommendation emphasizes refining monolingual models and exploring domain-specific biases.
    摘要 Translated into Simplified Chinese:机构偏见可以影响病人结果、教育水平和法律系统导航。文本记录frequently受到偏见影响,一旦偏见被发现,就可以对个人提供培训以降低偏见。许多机器学习工具可以探索文本数据并创建预测模型,以便在实时偏见检测中搜索文本记录。然而,前一些研究很少研究大型自然语言模型嵌入和几何模型偏见数据的影响。为了解决这个问题,这个研究利用RedditBias数据库分析文本偏见。四个变换器模型,包括BERT和RoBERTa变体,被探索。Post-embedding,t-SNE允许二维可视化数据。KNN分类器分化偏见类型,更低的k值更有效。结果表明BERT,特别是mini BERT,在偏见分类方面表现出色,而多语言模型落后。建议强调加工单语言模型和探索领域特定的偏见。

Machine Translation to Control Formality Features in the Target Language

  • paper_url: http://arxiv.org/abs/2311.13475
  • repo_url: None
  • paper_authors: Harshita Tyagi, Prashasta Jung, Hyowon Lee
  • for: 本研究旨在解决在机器学习翻译技术中翻译从英语到有形式语言时,缺失形式信息的问题。
  • methods: 本研究使用了翻译自动注释技术来增加训练数据的大小,并使用了变换器模型,这种模型在自然语言处理任务中已经证明了其效果。
  • results: 我们的研究显示,使用形式控制的翻译模型可以更好地考虑目标语言中的形式,从而提供更加灵活的翻译策略,适用于多种语言交流场景。
    Abstract Formality plays a significant role in language communication, especially in low-resource languages such as Hindi, Japanese and Korean. These languages utilise formal and informal expressions to convey messages based on social contexts and relationships. When a language translation technique is used to translate from a source language that does not pertain the formality (e.g. English) to a target language that does, there is a missing information on formality that could be a challenge in producing an accurate outcome. This research explores how this issue should be resolved when machine learning methods are used to translate from English to languages with formality, using Hindi as the example data. This was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model in a similar setting. Since there are not a lot of training data with ground truth, automated annotation techniques were employed to increase the data size. The primary modeling approach involved leveraging transformer models, which have demonstrated effectiveness in various natural language processing tasks. We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth. This metric provides a quantitative measure of how well the translations align with the desired outputs. Our study showcases a versatile translation strategy that considers the nuances of formality in the target language, catering to diverse language communication needs and scenarios.
    摘要 Formality plays a significant role in language communication, especially in low-resource languages such as Hindi, Japanese, and Korean. These languages use formal and informal expressions to convey messages based on social contexts and relationships. When translating from a source language that does not have formality (e.g., English) to a target language that does, there is a lack of information on formality that can be a challenge in producing an accurate outcome. This research explores how to resolve this issue when using machine learning methods to translate from English to languages with formality, using Hindi as the example data.We trained a bilingual model in a formality-controlled setting and compared its performance with a pre-trained multilingual model in a similar setting. Since there were not many training data with ground truth, automated annotation techniques were employed to increase the data size. The primary modeling approach involved leveraging transformer models, which have demonstrated effectiveness in various natural language processing tasks. We evaluated the official formality accuracy (ACC) by comparing the predicted masked tokens with the ground truth. This metric provides a quantitative measure of how well the translations align with the desired outputs. Our study showcases a versatile translation strategy that considers the nuances of formality in the target language, catering to diverse language communication needs and scenarios.

Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting

  • paper_url: http://arxiv.org/abs/2311.13314
  • repo_url: None
  • paper_authors: Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
  • for: Mitigating the hallucination of large language models (LLMs) during the reasoning process.
  • methods: Knowledge Graph-based Retrofitting (KGR) framework that incorporates LLMs with knowledge graphs (KGs) to retrofit the initial draft responses of LLMs based on factual knowledge stored in KGs.
  • results: Significant improvement in the performance of LLMs on factual question answering benchmarks, especially for complex reasoning processes, demonstrating the effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.Here’s the Chinese version:
  • for: mitigating LLMs的幻觉 during the reasoning process.
  • methods: based on KGs to retrofit the initial draft responses of LLMs.
  • results: significant improvement in LLMs的性能 on factual question answering benchmarks, especially for complex reasoning processes, demonstrating the necessity and effectiveness of KGR.
    Abstract Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graph-based Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.
    摘要 使用知识图(KG) incorporating 在大语言模型(LLM)中的实际知识是一种有前途的方法,以避免 LLMS 的幻觉。现有方法通常仅基于用户输入来查询知识图,因此无法解决 LLMS durante 其理解过程中生成的幻觉。为解决这个问题,本文提出了知识图基于的 Retrofitting(KGR)框架,该框架将 LLMS 与 KG 集成,以适应 LLMS 的理解过程中的幻觉。specifically,KGR 利用 LLMS 提取、选择、验证和更正模型生成的答案中的实际声明,以实现无需人工干预的自主知识验证和修正过程。实验显示,KGR 可以大幅提高 LLMS 在实际问答标准 bencmarks 上的表现,特别是在复杂的理解过程中,这表明了 KGR 在降低幻觉和提高 LLMS 的可靠性的必要和有效性。

Rethinking Radiology Report Generation via Causal Reasoning and Counterfactual Augmentation

  • paper_url: http://arxiv.org/abs/2311.13307
  • repo_url: None
  • paper_authors: Xiao Song, Jiafan Liu, Yun Li, Wenbin Lei, Ruxin Wang
  • for: 这篇论文的目的是解决受到病症共occurrence的干扰,使 radiology report generation (RRG) 更加 precisione 和可靠。
  • methods: 这篇论文使用了一种新的Counterfactual augmentation strategy,包括Counterfactual Sample Synthesis和Counterfactual Report Reconstruction two sub-methods,以Break two aspects of spurious effects,即Joint Vision Coupling和Conditional Sentence Coherence Coupling。
  • results: 实验结果和进一步分析表明,提出的方法可以减少 errors in RRG,并且提高了报告的准确性和可靠性。
    Abstract Radiology Report Generation (RRG) draws attention as an interaction between vision and language fields. Previous works inherited the ideology of vision-to-language generation tasks,aiming to generate paragraphs with high consistency as reports. However, one unique characteristic of RRG, the independence between diseases, was neglected, leading to the injection of the spurious confounder, i.e., the disease co-occurrence. Unfortunately, this confounder confuses the process of report generation worse because of the biased RRG data distribution. In this paper, to rethink this issue thoroughly, we reason about its causes and effects from a novel perspective of statistics and causality, where the Joint Vision Coupling and the Conditional Sentence Coherence Coupling are two aspects prone to implicitly decrease the accuracy of reports. Then, a counterfactual augmentation strategy that contains the Counterfactual Sample Synthesis and the Counterfactual Report Reconstruction sub-methods is proposed to break these two aspects of spurious effects. Experimental results and further analyses on two widely used datasets justify our reasoning and proposed methods.
    摘要

  • paper_url: http://arxiv.org/abs/2311.13281
  • repo_url: None
  • paper_authors: Nick Goodson, Rongfei Lu
  • for: 这种研究旨在提高法律援助机构的工作效率和成本,使法律援助更加可 accessible для更广泛的人群。
  • methods: 该研究使用大语言模型(LLMs)和 чат博тов来自动化法律投入过程,以解决当前 LLMS 的问题,即在不充分理解客户的意图和法律情况时,提供不准确的答案。
  • results: 研究人员通过使用自由形式的语言交互来引出和推理客户的含义和法律情况,并提出未来研究方向,以自动包含客户意图和情况询问在 chatbot 中无需显式提示。
    Abstract Large Language Models (LLMs) and chatbots show significant promise in streamlining the legal intake process. This advancement can greatly reduce the workload and costs for legal aid organizations, improving availability while making legal assistance more accessible to a broader audience. However, a key challenge with current LLMs is their tendency to overconfidently deliver an immediate 'best guess' to a client's question based on the output distribution learned over the training data. This approach often overlooks the client's actual intentions or the specifics of their legal situation. As a result, clients may not realize the importance of providing essential additional context or expressing their underlying intentions, which are crucial for their legal cases. Traditionally, logic based decision trees have been used to automate intake for specific access to justice issues, such as immigration and eviction. But those solutions lack scalability. We demonstrate a proof-of-concept using LLMs to elicit and infer clients' underlying intentions and specific legal circumstances through free-form, language-based interactions. We also propose future research directions to use supervised fine-tuning or offline reinforcement learning to automatically incorporate intention and context elicitation in chatbots without explicit prompting.
    摘要 Traditionally, logic-based decision trees have been used to automate intake for specific access to justice issues, such as immigration and eviction, but these solutions lack scalability. We have demonstrated a proof-of-concept using LLMs to elicit and infer clients' underlying intentions and specific legal circumstances through free-form, language-based interactions. We also propose future research directions to use supervised fine-tuning or offline reinforcement learning to automatically incorporate intention and context elicitation in chatbots without explicit prompting.

Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting

  • paper_url: http://arxiv.org/abs/2311.13274
  • repo_url: None
  • paper_authors: Daphne van Zandvoort, Laura Wiersema, Tom Huibers, Sandra van Dulmen, Sjaak Brinkkemper
  • for: 这个研究旨在提高自动医疗报告的品质和相关性,以减少医疗专业人员的时间投入。
  • methods: 研究使用了两种不同的提示策略,namely shot prompting和pattern prompting,以提高自动医疗报告的效果。
  • results: 研究发现,两架提示策略的结合使用,在与scope和域别文件上进行评估时,实现了最高的ROUGE分数和人类评价标准。但是,自动报告相对于人类参考集的长度约为2倍。
    Abstract Customized medical prompts enable Large Language Models (LLM) to effectively address medical dialogue summarization. The process of medical reporting is often time-consuming for healthcare professionals. Implementing medical dialogue summarization techniques presents a viable solution to alleviate this time constraint by generating automated medical reports. The effectiveness of LLMs in this process is significantly influenced by the formulation of the prompt, which plays a crucial role in determining the quality and relevance of the generated reports. In this research, we used a combination of two distinct prompting strategies, known as shot prompting and pattern prompting to enhance the performance of automated medical reporting. The evaluation of the automated medical reports is carried out using the ROUGE score and a human evaluation with the help of an expert panel. The two-shot prompting approach in combination with scope and domain context outperforms other methods and achieves the highest score when compared to the human reference set by a general practitioner. However, the automated reports are approximately twice as long as the human references, due to the addition of both redundant and relevant statements that are added to the report.
    摘要 自定义医疗提示可以使大语言模型(LLM)有效地处理医疗对话摘要。医疗报告过程经常耗时consuming healthcare professionals。实施医疗对话摘要技术可以解决这个时间约束,生成自动化的医疗报告。 LLMS 在这个过程中的表现受到提示的形式度量影响,这个因素对生成的报告质量和相关性具有关键性。在这项研究中,我们使用了两种不同的提示策略,称为shot prompting和pattern prompting,以提高自动化医疗报告的性能。我们使用ROUGE分数和由专家组成的评审组来评估自动化报告。两极提示方法在组合使用范围和域名上下文时,与其他方法相比,达到了最高分。然而,自动生成的报告约为人参照集的两倍长,这是因为自动生成的报告包含了 redundant 和相关的声明。

Comparative Experimentation of Accuracy Metrics in Automated Medical Reporting: The Case of Otitis Consultations

  • paper_url: http://arxiv.org/abs/2311.13273
  • repo_url: None
  • paper_authors: Wouter Faber, Renske Eline Bootsma, Tom Huibers, Sandra van Dulmen, Sjaak Brinkkemper
  • for: 这项研究的目的是提高医疗专业人员的行政负担,通过使用生成性人工智能自动生成医疗报告。
  • methods: 该研究使用了多种精度度量来评估生成的医疗报告的准确性。
  • results: 研究发现,基于对听力报告的对比,ROUGE-L和Word Mover’s Distance度量是最佳选择,与之前的研究不同。这些结果可以帮助确定生成的医疗报告的准确性,从而促进生成医疗报告的系统的开发。
    Abstract Generative Artificial Intelligence (AI) can be used to automatically generate medical reports based on transcripts of medical consultations. The aim is to reduce the administrative burden that healthcare professionals face. The accuracy of the generated reports needs to be established to ensure their correctness and usefulness. There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting. A comparative experimentation of 10 accuracy metrics has been performed on AI generated medical reports against their corresponding General Practitioner's (GP) medical reports concerning Otitis consultations. The number of missing, incorrect, and additional statements of the generated reports have been correlated with the metric scores. In addition, we introduce and define a Composite Accuracy Score which produces a single score for comparing the metrics within the field of automated medical reporting. Findings show that based on the correlation study and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics are the preferred metrics, which is not in line with previous work. These findings help determine the accuracy of an AI generated medical report, which aids the development of systems that generate medical reports for GPs to reduce the administrative burden.
    摘要 使用生成式人工智能(AI)自动生成医疗报告,以减轻医疗专业人员的行政负担。为确保报告的正确性和有用性,需要确定生成报告的精度。然而,在医疗报告中应用这些精度度量的少有研究。我们对使用10个精度度量进行对比性实验,测试AI生成的医疗报告与相关的一般医生(GP)医疗报告的 Otitis 诊断报告。对生成报告中缺失、错误和附加信息进行相关分析,并对这些度量之间的相关性进行定义和分析。结果表明,基于相关性研究和组合精度分数,ROUGE-L和Word Mover's Distance度量是首选度量,与之前的研究不符。这些发现可以帮助确定AI生成的医疗报告的精度,从而降低GP的行政负担。Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

  • paper_url: http://arxiv.org/abs/2311.13258
  • repo_url: https://github.com/yangyi-chen/vi-struct
  • paper_authors: Yangyi Chen, Xingyao Wang, Manling Li, Derek Hoiem, Heng Ji
  • for: 这个研究是为了提高视觉结构知识提取的效果,特别是对象之间的关系。
  • methods: 这个研究使用了两种新的设计:首先,利用程序语言中的自然结构来表示视觉结构信息,以确保视觉结构信息的显式和系统化表示,并且可以在多级别中表示不同粒度的视觉结构信息,如概念、关系和事件。其次,提出了学习授时序学习的方法,从基本视觉知识到复杂事件结构的理解。
  • results: 在视觉结构预测任务中,ViStruct表现出色,证明了其在视觉结构理解方面的效果。
    Abstract State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.
    摘要 现代视觉语言模型(VLM)在结构知识提取方面仍有限制性,如对 объектов之间关系的理解。在这项工作中,我们提出了 ViStruct 训练框架,用于学习高效的视觉结构知识提取。我们做出了两项创新性的设计:首先,我们提议利用程序语言的内在结构来表示视觉结构信息。这种方法可以提供明确、系统化的视觉结构信息,包括概念、关系和事件等多级划分。其次,我们引入了学习步骤,使 VLM 逐步理解视觉结构,从基本视觉概念到复杂事件结构。我们的直觉是,更低级的知识可能会帮助 VLM 更好地理解视觉结构。此外,我们编译了一个适用于视觉结构知识提取的数据集,并采用了弱监督学习方法,直接将图文对应的视觉事件结构生成为 ViStruct 训练所需。在实验中,我们证明 ViStruct 在视觉结构预测任务中表现出色,demonstrating its effectiveness in improving the understanding of visual structures. code 可以在 中获取。

Automatic Instruction Optimization for Open-source LLM Instruction Tuning

  • paper_url: http://arxiv.org/abs/2311.13246
  • repo_url: https://github.com/lunyiliu/coachlm
  • paper_authors: Yilun Liu, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su, Yutai Hou, Miao Zhang, Min Zhang, Hongxia Ma, Li Zhang, Hao Yang, Yanfei Jiang
  • for: 这 paper 是为了提高语言学习模型(LLM)在响应人类指令的能力而写的。
  • methods: 这 paper 使用了自动生成指令对的方法来训练开源 LLM。
  • results: 这 paper 表明,使用 CoachLM 可以大幅提高 LLM 生成的指令集的质量,从而提高 LLM 的指令遵从能力,并在实际应用中实现了20%的效率提高。
    Abstract Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative in the training of open-source LLMs. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release the training data and code of CoachLM (https://github.com/lunyiliu/CoachLM).
    摘要 听说是非常重要的,以便语言学习模型(LLM)可以响应人类的指令。指令对集用于调教的质量直接影响LLM的性能。然而,手动创建高质量的指令对集是非常昂贵的,导致了开源LLM的训练中广泛采用自动生成指令对的方法。为保证LLM生成的指令对集质量,许多方法已经被提议。然而,现有的方法可能会COMPROMISE数据集的完整性,或者不适用于实际应用。在这篇论文中,我们提出了一种新的方法,即CoachLM,以提高指令对集的质量。CoachLM是通过人工专家修改样本来训练的,并有效地提高了指令对集中高质量样本的比例从17.7%提高到78.9%。我们还评估了CoachLM的效果在各种实际指令测试集上,结果表明CoachLM可以提高LLM对指令的遵从能力,平均提高29.9%,甚至超过了大型LLM的性能。此外,CoachLM成功地部署在Huawei数据管理系统中,提高了LLM的清洁效率,最高提高20%。我们在 GitHub 上发布了CoachLM 的训练数据和代码(https://github.com/lunyiliu/CoachLM)。

AutoKG: Efficient Automated Knowledge Graph Generation for Language Models

  • paper_url: http://arxiv.org/abs/2311.14740
  • repo_url: https://github.com/wispcarey/autokg
  • paper_authors: Bohan Chen, Andrea L. Bertozzi
  • for: 提高大语言模型(LLM)与知识库(KG)之间的关系表示,以增强 LLM 生成的相关性和有用性。
  • methods: 使用自动建构知识图(AutoKG),一种轻量级、高效的方法,通过文本块对关键词EXTRACTING和关系权重评估来构建 KG。
  • results: 比较传统的semantic similarity搜索方法,AutoKG 可以更好地捕捉复杂的关系动态,提供更全面和相互连接的知识检索机制,使 LLM 的输出更加有用和相关。
    Abstract Traditional methods of linking large language models (LLMs) to knowledge bases via the semantic similarity search often fall short of capturing complex relational dynamics. To address these limitations, we introduce AutoKG, a lightweight and efficient approach for automated knowledge graph (KG) construction. For a given knowledge base consisting of text blocks, AutoKG first extracts keywords using a LLM and then evaluates the relationship weight between each pair of keywords using graph Laplace learning. We employ a hybrid search scheme combining vector similarity and graph-based associations to enrich LLM responses. Preliminary experiments demonstrate that AutoKG offers a more comprehensive and interconnected knowledge retrieval mechanism compared to the semantic similarity search, thereby enhancing the capabilities of LLMs in generating more insightful and relevant outputs.
    摘要 传统的方法通常是通过semantic similarity search来链接大型语言模型(LLM)到知识库,但这些方法通常无法捕捉复杂的关系动态。为解决这些限制,我们引入AutoKG,一种轻量级和高效的自动知识图(KG)建构方法。对于一个包含文本块的知识库,AutoKG首先使用LLM提取关键词,然后使用图拉卷学计算每对关键词之间的关系权重。我们使用混合搜索方案,结合向量相似和图Structured Association来增强LLM的回答。初步实验表明,AutoKG可以提供更全面和联通的知识检索机制,因此可以增强LLM的输出更加深入和相关。

On the Calibration of Large Language Models and Alignment

  • paper_url: http://arxiv.org/abs/2311.13240
  • repo_url: None
  • paper_authors: Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, Zhendong Mao
    for: 这个论文主要用于探讨对适用于语言模型的对齐过程中的模型准确性calibration的系统性分析。methods: 该论文使用了不同的训练设置和数据集来研究在不同阶段的对齐训练过程中模型准确性的影响。results: 该论文发现了各种对齐训练过程和训练设置对模型准确性的影响,并对三个关键方面进行了全面的评估:生成、事实性和理解。
    Abstract As large language models attract increasing attention and find widespread application, concurrent challenges of reliability also arise at the same time. Confidence calibration, an effective analysis method for gauging the reliability of deep models, serves as a crucial tool for assessing and improving their reliability. However, such investigation has been comparatively underexplored. In this work, we conduct a systematic examination of the calibration of aligned language models throughout the entire construction process, including pretraining and alignment training. At each stage, we investigate how different training settings, such as parameter scales and training data, affect model calibration. To thoroughly assess model calibration, we evaluate models on three most concerned aspects: generation, factuality and understanding. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
    摘要 如大语言模型在应用中吸引越来越多的注意力,同时也面临着可靠性问题的挑战。对深度模型的可靠性进行有效分析是评估和改进其可靠性的重要工具。然而,这一研究至今尚未得到充分的探讨。在这项工作中,我们进行了系统性的检查对适配语言模型的整个建构过程中的准确性,包括预训练和对适配训练。在每个阶段,我们研究不同的训练设置,如参数缩放和训练数据,对模型准确性的影响。为了全面评估模型准确性,我们评估模型在三个最关心的方面:生成、事实和理解。我们的工作照明了流行的 LLMS 是否具有良好的准确性,以及训练过程对模型准确性的影响。

AS-LLM: When Algorithm Selection Meets Large Language Model

  • paper_url: http://arxiv.org/abs/2311.13184
  • repo_url: None
  • paper_authors: Xingyu Wu, Yan Zhong, Jibin Wu, Kay Chen Tan
  • for: 本研究的目的是解决自动机器学习(AutoML)中的算法选择问题,即在解决特定问题之前,可以快速和准确地选择最适合的算法。
  • methods: 本研究提议一种将算法表示integrated到算法选择过程中的方法,包括使用预训练的语言模型(LLMs)来捕捉算法的特征。
  • results: 实验结果不仅证明了提议的模型的有效性,还展示了不同的预训练LLMs的表现,表明该方法有可能成为自动机器学习中的基准任务,用于评估代码表示能力。
    Abstract Algorithm selection aims to identify the most suitable algorithm for solving a specific problem before execution, which has become a critical process of the AutoML. Current mainstream algorithm selection techniques rely heavily on feature representations of various problems and employ the performance of each algorithm as supervised information. However, there is a significant research gap concerning the consideration of algorithm features. This gap is primarily attributed to the inherent complexity of algorithms, making it particularly challenging to find a universally effective feature extraction method that is applicable across a diverse range of algorithms. Unfortunately, neglecting this aspect undoubtedly impacts the accuracy of algorithm selection and indirectly necessitates an increased volume of problem data for training purposes. This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees. Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.
    摘要 This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees.Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.Translated into Simplified Chinese:算法选择目标是在执行之前确定最适合的算法来解决特定问题,这已成为自动化机器学习(AutoML)的关键过程。当前主流的算法选择技术都是基于问题特征的表示,并利用每个算法的性能作为监督信息。然而,忽略算法特征的研究空白仍然存在,这主要归结于算法的自然复杂性,使得找到一种适用于多种算法的通用有效特征提取方法变得极其困难。忽略这一点会导致算法选择精度下降,并间接需要更多的问题数据进行训练。本文提出了一种缓解这个研究空白的方法,即通过将算法和问题的表示integrated到算法选择过程中。具体来说,我们的提议模型包括了分解问题和算法的特征提取模块,其中算法表示Module leverages预训练的LLMs在代码理解领域的能力。接下来,我们从问题和算法中提取出嵌入向量,然后通过计算匹配度来确定最适合的算法。我们的实验不仅证明了我们的方法的有效性,还展示了不同预训练LLMs的表现,这表明我们的算法选择框架具有评估代码表示能力的基准任务的潜力。

Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper

  • paper_url: http://arxiv.org/abs/2311.13126
  • repo_url: None
  • paper_authors: Chengyu Wang, Junbing Yan, Wei Zhang, Jun Huang
  • for: 本研究聚焦在大语言模型(LLM)中的参数效率精细调整(PEFT)问题上,尤其是LLM的实际应用中的可行性和扩展性问题。
  • methods: 本研究使用了现有的PEFT架构和学习环境,并探讨了PEFT与模型压缩技术的结合以及多模态LLM中PEFT的应用。
  • results: 本研究认为,PEFT是一种有前途的研究方向,但还需要解决许多挑战和开问题,如novel PEFT架构、PEFT在不同学习环境下的应用、PEFT与模型压缩技术的结合以及多模态LLM中PEFT的应用。
    Abstract This paper delves into the pressing need in Parameter-Efficient Fine-Tuning (PEFT) for Large Language Models (LLMs). While LLMs possess remarkable capabilities, their extensive parameter requirements and associated computational demands hinder their practicality and scalability for real-world applications. Our position paper highlights current states and the necessity of further studying into the topic, and recognizes significant challenges and open issues that must be addressed to fully harness the powerful abilities of LLMs. These challenges encompass novel efficient PEFT architectures, PEFT for different learning settings, PEFT combined with model compression techniques, and the exploration of PEFT for multi-modal LLMs. By presenting this position paper, we aim to stimulate further research and foster discussions surrounding more efficient and accessible PEFT for LLMs.
    摘要 Translated into Simplified Chinese:这篇论文探讨了大语言模型(LLM)的参数高效精细调整(PEFT)的急需。虽然LLM具有卓越的能力,但它们的参数要求和相关的计算需求限制了它们在实际应用中的实用性和可扩展性。我们的位置论文指出了当前情况和PEFT的必要性,并认可了需要更多的研究来充分利用LLM的强大能力。这些挑战包括开发新的PEFT架构、PEFT在不同的学习环境下、PEFT与模型压缩技术相结合、以及探索多模态LLM中的PEFT。通过发表这篇位置论文,我们希望可以激发更多的研究和讨论,以便更好地实现更高效和可 accessible的PEFT для LLM。

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

  • paper_url: http://arxiv.org/abs/2311.13110
  • repo_url: None
  • paper_authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma
  • for: 本研究的目标是学习 representation 的可靠性和简洁性。
  • methods: 本文提出了一种基于 Gaussian mixture 的 representation learning 方法,并使用了 alternating optimization 来优化这个目标。此外,文章还提出了一种名为 CRATE 的白盒式 transformer-like 深度网络 architecture,该 architecture 可以实现数据压缩和简洁化。
  • results: 实验表明,CRATE 网络可以很好地压缩和简洁化大规模的图像和文本数据集,并且与高度工程化的 transformer-based 模型具有相似的性能。
    Abstract In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
    摘要 在这篇论文中,我们认为一个自然的表示学习目标是压缩和变换数据集(例如字符集)的分布,使其变为一个低维度的高斯混合支持不可逆的子空间。我们可以使用一种原则性的评价指标,即稀疏率减少,同时最大化表示中的内在信息增加和外在稀疏性。从这个角度来看,流行的深度网络架构,包括转换器,可以被视为实现循环优化算法,以便最大化这个指标。特别是,我们从代数优化的角度解释了转换器块,并 derivated a family of white-box transformer-like deep network architectures,named CRATE,which are fully mathematically interpretable。我们通过一种新的连接 Between Denoising and Compression 显示,这些白盒结构可以实现对压缩编码的逆变换。因此,这些白盒结构是 universal to both encoders and decoders。实验表明,这些网络,尽管简单,实际上可以压缩和稀疏表示大规模的实际世界图像和文本数据集,并达到高度定制化的 transformer-based 模型:ViT、MAE、DINO、BERT 和 GPT2 的性能。我们认为这个计算框架在深度学习理论和实践之间的桥接具有很大的潜力。代码可以在 上获取。

Perceptual Structure in the Absence of Grounding for LLMs: The Impact of Abstractedness and Subjectivity in Color Language

  • paper_url: http://arxiv.org/abs/2311.13105
  • repo_url: None
  • paper_authors: Pablo Loyola, Edison Marrese-Taylor, Andres Hoyos-Idobro
  • for: investigate the problem of grounding in language understanding, using color perception and color language as a test bed
  • methods: collect a large dataset of colors and their descriptions, and perform an empirical analysis comparing two types of alignments (inter-space and intra-space)
  • results: find that while color space alignment holds for simple, pragmatic color descriptions, it drops significantly in the presence of examples with subjective or abstract elements, suggesting that grounding may be necessary in such cases.Here’s the summary in Traditional Chinese as well, for your reference:
  • for: investigate the problem of grounding in language understanding, using color perception and color language as a test bed
  • methods: collect a large dataset of colors and their descriptions, and perform an empirical analysis comparing two types of alignments (inter-space and intra-space)
  • results: find that while color space alignment holds for simple, pragmatic color descriptions, it drops significantly in the presence of examples with subjective or abstract elements, suggesting that grounding may be necessary in such cases.I hope this helps! Let me know if you have any other questions.
    Abstract The need for grounding in language understanding is an active research topic. Previous work has suggested that color perception and color language appear as a suitable test bed to empirically study the problem, given its cognitive significance and showing that there is considerable alignment between a defined color space and the feature space defined by a language model. To further study this issue, we collect a large scale source of colors and their descriptions, containing almost a 1 million examples , and perform an empirical analysis to compare two kinds of alignments: (i) inter-space, by learning a mapping between embedding space and color space, and (ii) intra-space, by means of prompting comparatives between color descriptions. Our results show that while color space alignment holds for monolexemic, highly pragmatic color descriptions, this alignment drops considerably in the presence of examples that exhibit elements of real linguistic usage such as subjectivity and abstractedness, suggesting that grounding may be required in such cases.
    摘要 需要语言理解的基础是一个活跃的研究话题。先前的工作表明,色彩识别和色语言是一个适合employn为实验研究这个问题的测试床,因为它们在认知上具有重要性,并且显示了语言模型定义的特征空间与定义的颜色空间之间存在较大的匹配。为进一步研究这个问题,我们收集了一个大规模的颜色和其描述的源泉,包含约一百万个示例,并进行了一种实验分析,比较两种对齐方法:(i)间隔颜色空间,通过学习颜色空间和嵌入空间之间的映射,以及(ii)内隔颜色空间,通过对颜色描述进行比较。我们的结果表明,当只有单一的、高度 Pragmatic 的颜色描述时,颜色空间对齐具有较高的准确率,但在包含语言使用的元素,如主观性和抽象性时,对齐率明显下降,这表示在这些情况下,可能需要进行基础。

Detecting out-of-distribution text using topological features of transformer-based language models

  • paper_url: http://arxiv.org/abs/2311.13102
  • repo_url: https://github.com/andrespollano/neural_nets-tda
  • paper_authors: Andres Pollano, Anupam Chaudhuri, Anj Simmons
  • for: 检测transformer语言模型中的异常输入(Out-of-distribution,OOD)样本
  • methods: 应用Topological Data Analysis(TDA)方法对transformer语言模型的注意力地图进行分析
  • results: TDA方法在BERT语言模型上的评估表明,与传统的BERT CLS embeddings方法相比,TDA方法在分类异常输入数据(IMDB评论)和同频域输入数据(HuffPost新闻)之间的差异更大,但是在近异常输入数据(CNN/Dailymail)和同频域输入数据(HuffPost商业新闻)上,TDA方法的效果逐渐下降。I hope that helps! Let me know if you have any other questions.
    Abstract We attempt to detect out-of-distribution (OOD) text samples though applying Topological Data Analysis (TDA) to attention maps in transformer-based language models. We evaluate our proposed TDA-based approach for out-of-distribution detection on BERT, a transformer-based language model, and compare the to a more traditional OOD approach based on BERT CLS embeddings. We found that our TDA approach outperforms the CLS embedding approach at distinguishing in-distribution data (politics and entertainment news articles from HuffPost) from far out-of-domain samples (IMDB reviews), but its effectiveness deteriorates with near out-of-domain (CNN/Dailymail) or same-domain (business news articles from HuffPost) datasets.
    摘要 我们使用 topological data analysis (TDA) 方法检测 transformer 基于语言模型中的 out-of-distribution (OOD) 样本。我们对 BERT 语言模型进行了评估,并与基于 BERT CLS 嵌入的传统 OOD 方法进行比较。我们发现 TDA 方法在 politics 和 entertainment 新闻文章与 HuffPost 领域的内容分类任务中表现更好,但是在 near out-of-domain (CNN/Dailymail) 或 same-domain (business 新闻文章 from HuffPost) 数据集中,其效果会下降。

cs.LG - 2023-11-22

Deep Learning as a Method for Inversion of NMR Signals

  • paper_url: http://arxiv.org/abs/2311.13722
  • repo_url: None
  • paper_authors: Julian B. B. Beckmann, Mick D. Mantle, Andrew J. Sederman, Lynn F. Gladden
  • for: 用于推理NMR信号的逆解决方案
  • methods: 使用深度学习来实现NMR信号的逆解决方案,具体来说是使用卷积神经网络来实现图像到图像回归问题
  • results: 与常用的Tikhonov和修改Total Generalized Variation(MTGV)正则化方法相比,深度学习逆解决方案更快速,并且在大多数情况下也超越了这两种正则化方法的性能。
    Abstract The concept of deep learning is employed for the inversion of NMR signals and it is shown that NMR signal inversion can be considered as an image-to-image regression problem, which can be treated with a convolutional neural net. It is further outlined, that inversion through deep learning provides a clear efficiency and usability advantage compared to regularization techniques such as Tikhonov and modified total generalized variation (MTGV), because no hyperparemeter selection prior to reconstruction is necessary. The inversion network is applied to simulated NMR signals and the results compared with Tikhonov- and MTGV-regularization. The comparison shows that inversion via deep learning is significantly faster than the latter regularization methods and also outperforms both regularization techniques in nearly all instances.
    摘要 深度学习概念被应用于核磁共振信号的逆转,并表明了核磁共振信号逆转可以视为一种图像到图像回归问题,可以通过卷积神经网络进行处理。此外,还指出了使用深度学习进行逆转的明显效率和使用便捷性的优势,不需要在恢复之前进行参数选择。逆转网络被应用于模拟的核磁共振信号上,并与 Тихонов和改进的总束变量(MTGV)等正则化方法进行比较。结果表明,使用深度学习进行逆转比这些正则化方法更快,并且在大多数情况下也超越了这些正则化方法。

Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical Approach

  • paper_url: http://arxiv.org/abs/2311.13707
  • repo_url: None
  • paper_authors: Alexander Scholtes, Oktay Karakuş
  • for: 本研究使用 Bayesian 方法来探索打分机会的可能性(Expected Goals,xG) metric 的影响因素。
  • methods: 利用StatsBomb公开的数据,建立了 Bayesian hierarchical logistic regression 模型,分析了约10,000个英超联赛的点球,以确定 whether 位置或球员水平的因素对 xG 产生影响。
  • results: 研究发现,位置效应在基本模型中,只有距离球门和射击角度作为预测器时出现,后背和进攻中场球员的打分机会更高。然而,这些效应随着更多的预测器的引入而减弱。不过,even with additional predictors, player-level effects persist, indicating that certain players have notable positive or negative xG adjustments that influence their likelihood of scoring a given chance。研究还扩展到西班牙的La Liga和德国的Bundesliga的数据,得到了相似的结果。
    Abstract This study employs Bayesian methodologies to explore the influence of player or positional factors in predicting the probability of a shot resulting in a goal, measured by the expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian hierarchical logistic regressions are constructed, analysing approximately 10,000 shots from the English Premier League to ascertain whether positional or player-level effects impact xG. The findings reveal positional effects in a basic model that includes only distance to goal and shot angle as predictors, highlighting that strikers and attacking midfielders exhibit a higher likelihood of scoring. However, these effects diminish when more informative predictors are introduced. Nevertheless, even with additional predictors, player-level effects persist, indicating that certain players possess notable positive or negative xG adjustments, influencing their likelihood of scoring a given chance. The study extends its analysis to data from Spain's La Liga and Germany's Bundesliga, yielding comparable results. Additionally, the paper assesses the impact of prior distribution choices on outcomes, concluding that the priors employed in the models provide sound results but could be refined to enhance sampling efficiency for constructing more complex and extensive models feasibly.
    摘要 Here is the text in Simplified Chinese:这个研究使用 bayesian 方法来探索Player或位置因素对于计算机绘制的可能性的影响,通过statsbomb 提供的公共数据,构建了bayesian 层次逻辑回归,分析了约10,000次英超联赛中的射击,以确定position或player-level 效应是否影响 xG。结果显示,位置效应在只包括距离球门和射击角度的基本模型中存在,攻击 midfielder 和前锋在距离球门越近和射击角度越大时,有更高的得分可能性。但是,当更多的预测器被引入时,这些效应就减弱了。此外,player-level 效应仍然存在,表明某些球员在给定的机会时,具有明显的正或负 xG 调整,影响其得分可能性。研究还扩展到了西班牙的La Liga和德国的Bundesliga数据,获得了相似的结果。此外,文章还评估了采样效率的影响,结论采样效率可以通过修改 prior distribution 来提高,以建立更复杂和详细的模型。

BackboneLearn: A Library for Scaling Mixed-Integer Optimization-Based Machine Learning

  • paper_url: http://arxiv.org/abs/2311.13695
  • repo_url: https://github.com/chziakas/backbone_learn
  • paper_authors: Vassilis Digalakis Jr, Christos Ziakas
  • for: 这篇论文是为了提出一种用于扩展杂数优化问题的开源软件包和框架。
  • methods: 这篇论文使用了核心算法来解决杂数优化问题,并且可以快速解决高维问题。
  • results: 这篇论文的实验结果表明,使用这种方法可以比精确方法更快速地解决杂数优化问题,并且比常用的优化方法更准确。
    Abstract We present BackboneLearn: an open-source software package and framework for scaling mixed-integer optimization (MIO) problems with indicator variables to high-dimensional problems. This optimization paradigm can naturally be used to formulate fundamental problems in interpretable supervised learning (e.g., sparse regression and decision trees), in unsupervised learning (e.g., clustering), and beyond; BackboneLearn solves the aforementioned problems faster than exact methods and with higher accuracy than commonly used heuristics. The package is built in Python and is user-friendly and easily extensible: users can directly implement a backbone algorithm for their MIO problem at hand. The source code of BackboneLearn is available on GitHub (link: https://github.com/chziakas/backbone_learn).
    摘要 我们介绍BackboneLearn:一个开源软件套件和框架,用于扩展束缩数据类型(MIO)问题中的标志变量至高维问题。这个优化架构可以自然地用来设计基本的探索学习问题(例如:简单 regression和决策树)、不supervised learning(例如:分组)和其他领域;BackboneLearn 可以更快速地解决这些问题,并且与通常使用的优化方法相比,具有更高的精度。BackboneLearn 是在 Python 中建立的,易用和可扩展:用户可以直接将背bone算法套用到自己的 MIO 问题中。BackboneLearn 的源代码可以在 GitHub 上获取(连结:https://github.com/chziakas/backbone_learn)。

Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts

  • paper_url: http://arxiv.org/abs/2311.13687
  • repo_url: https://github.com/stet-stet/goct_ismir2023
  • paper_authors: Jayeon Yi, Sungho Lee, Kyogu Lee
  • for: 该论文targets于chart生成,即在音乐上进行动作的游戏中,需要玩家按照指示执行动作。
  • methods: 该论文使用Transformer模型,使用大量数据进行训练,并 introduce tempo-informed preprocessing和training procedures,其中一些被认为是训练成功的关键。
  • results: 该模型在大量数据集上表现出优于基线模型,并且也可以通过预训练和finetuning进一步提高表现。
    Abstract In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a successful training. Our model is found to outperform the baselines on a large dataset, and is also found to benefit from pretraining and finetuning.
    摘要 在“节奏游戏”的心脏地带 - 游戏中玩家需要按照音乐的节奏进行动作 - 有“Chart”,它们是玩家需要遵循的指令。我们新的形式化chart生成为序列生成任务,使用大量数据训练Transformer。我们还提出了根据节奏情况进行预处理和训练方法,其中一些方法被认为是成功训练的重要组成部分。我们的模型在大量数据上表现出色,并且也受益于预训练和精度调整。

A Joint Gradient and Loss Based Clustered Federated Learning Design

  • paper_url: http://arxiv.org/abs/2311.13665
  • repo_url: None
  • paper_authors: Licheng Lin, Mingzhe Chen, Zhaohui Yang, Yusen Wu, Yuchen Liu
  • for: 提出了一种基于分布式边缘设备的集群 Federated Learning(FL)框架,使得不同数据非统一的边缘设备可以分布式地形成多个集群,并在每个集群中进行 FL 训练。
  • methods: 我们提出的集群 FL 算法需要解决两个与 FL 训练相关的挑战。首先,服务器有限的 FL 训练信息(即参数服务器只能获得每个设备的 FL 模型信息)和计算能力有限,不能对大量设备进行差异检测。其次,每个设备无法获得其他设备的数据信息,只能使用服务器发送的全球 FL 模型参数和自己的数据信息来确定自己的集群标识,这会增加设备分配的难度。为解决这两个挑战,我们提出了一种基于 gradient 和损失的分布式分配方法,每个设备根据自己的 local FL 模型和梯度相似性、训练损失来确定自己的集群标识。
  • results: 我们的集群 FL 算法可以将分布式分配 iterations 降低到99%以下,相比之下现有的基准值。
    Abstract In this paper, a novel clustered FL framework that enables distributed edge devices with non-IID data to independently form several clusters in a distributed manner and implement FL training within each cluster is proposed. In particular, our designed clustered FL algorithm must overcome two challenges associated with FL training. First, the server has limited FL training information (i.e., the parameter server can only obtain the FL model information of each device) and limited computational power for finding the differences among a large amount of devices. Second, each device does not have the data information of other devices for device clustering and can only use global FL model parameters received from the server and its data information to determine its cluster identity, which will increase the difficulty of device clustering. To overcome these two challenges, we propose a joint gradient and loss based distributed clustering method in which each device determines its cluster identity considering the gradient similarity and training loss. The proposed clustering method not only considers how a local FL model of one device contributes to each cluster but also the direction of gradient descent thus improving clustering speed. By delegating clustering decisions to edge devices, each device can fully leverage its private data information to determine its own cluster identity, thereby reducing clustering overhead and improving overall clustering performance. Simulation results demonstrate that our proposed clustered FL algorithm can reduce clustering iterations by up to 99% compared to the existing baseline.
    摘要 在本文中,我们提出了一种新的分布式FL框架,允许分布式的边缘设备,具有非相同数据,独立形成多个分布式集群,并在每个集群中进行FL训练。具体来说,我们设计的分布式FL算法需要解决FL训练中两个挑战。首先,服务器有限制FL训练信息(即参数服务器只能获得每个设备的FL模型信息)和计算能力,用于找出多个设备之间的差异。其次,每个设备没有其他设备的数据信息,只能使用服务器发送的全局FL模型参数和自己的数据信息,确定自己的集群标识,这会增加设备归类的困难度。为了解决这两个挑战,我们提出了一种基于分布式gradient和损失的分布式归类方法。在这种方法中,每个设备根据自己的梯度相似性和训练损失来确定自己的集群标识。我们的归类方法不仅考虑了一个本地FL模型如何贡献到每个集群中,还考虑了梯度下降的方向,从而提高归类速度。通过委托归类决策给边缘设备,每个设备可以完全利用自己的私有数据信息来确定自己的集群标识,从而减少归类负担和提高总的归类性能。实验结果表明,我们提出的分布式FL算法可以相比现有基eline,减少归类轮次数量达99%。

Evaluating Pretrained models for Deployable Lifelong Learning

  • paper_url: http://arxiv.org/abs/2311.13648
  • repo_url: None
  • paper_authors: Kiran Lekkala, Eshan Bhargava, Laurent Itti
  • for: 这个论文旨在提出一种可部署的人生学习系统,用于视觉强化学习(RL),该系统可以在执行新任务时保持先前学习的知识。
  • methods: 该论文使用了一种基于几个shot类增量学习(FSCIL)的任务映射器,以及一个基于预训练数据集的encoder/backbone和策略参数。
  • results: 该论文通过在DeLL(部署 для人生学习) benchmark上进行实验,证明该系统可以扩展到涵盖大量任务,并且具有较小的内存占用量和计算资源。
    Abstract We create a novel benchmark for evaluating a Deployable Lifelong Learning system for Visual Reinforcement Learning (RL) that is pretrained on a curated dataset, and propose a novel Scalable Lifelong Learning system capable of retaining knowledge from the previously learnt RL tasks. Our benchmark measures the efficacy of a deployable Lifelong Learning system that is evaluated on scalability, performance and resource utilization. Our proposed system, once pretrained on the dataset, can be deployed to perform continual learning on unseen tasks. Our proposed method consists of a Few Shot Class Incremental Learning (FSCIL) based task-mapper and an encoder/backbone trained entirely using the pretrain dataset. The policy parameters corresponding to the recognized task are then loaded to perform the task. We show that this system can be scaled to incorporate a large number of tasks due to the small memory footprint and fewer computational resources. We perform experiments on our DeLL (Deployment for Lifelong Learning) benchmark on the Atari games to determine the efficacy of the system.
    摘要 我们创建了一个新的标准测试 Deployable Lifelong Learning 系统 для视觉奖励学习(RL),该系统预训练在一个精心选择的数据集上,并提出了一种可扩展的学习系统,能够保留之前学习的 RL 任务知识。我们的标准测试量化系统的可扩展性、性能和资源利用率。我们提posed的系统,一旦预训练在数据集上,可以部署到执行不断学习新任务。我们的方法包括基于几招类增量学习(FSCIL)的任务映射器和由预训练数据集全部训练的encoder/背部。策略参数相应于认知任务被加载以执行任务。我们证明该系统可以扩展到包含大量任务,因为它具有小内存占用量和较少的计算资源。我们在DeLL(部署 для生命学习)标准测试上进行了Atari游戏的实验,以评估系统的有效性。

Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

  • paper_url: http://arxiv.org/abs/2311.13595
  • repo_url: None
  • paper_authors: Yanjun Han, Philippe Rigollet, George Stepaniants
  • for: 本文研究了Feature alignment方法在科学领域中的数据池、注释和比较中的应用。这是一个偏推学习问题,Feature alignment具有 significante statistical和计算挑战。
  • methods: 本文提出了协方差对齐模型,用于比较不同的对齐方法并确定一个最小最大下界。这个下界是最小最大下界,且可以通过自然的 quasi MLE实现。然而,这个估计器包含一个搜索所有Permutation的步骤,对于中等大小的问题来说是计算不可能的。
  • results: 本文表明,使用Gromov-Wasserstein算法可以实现最小最大下界,并且这个算法在大规模问题上可以快速实现。这些结果为Feature alignment方法的实践提供了首次统计正当性。
    Abstract Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.
    摘要 <>translate_language: zh-CNFeature alignment methods are widely used in many scientific fields for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment poses significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling due to the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations, which is computationally infeasible even for moderate-sized problems. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport, which is more amenable to fast implementation even on large-scale problems, is also minimax optimal. These results provide the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

  • paper_url: http://arxiv.org/abs/2311.13589
  • repo_url: None
  • paper_authors: Zhengqi Wu, Renyuan Xu
  • for: 这篇论文目的是为了在具有不同风险偏好的决策者面前提出一种基于Markov决策过程(MDP)的风险敏感学习方法。
  • methods: 该论文提出了一种基于扩展状态空间的粗略化方法,以及一种修改了值迭代算法的方法,以便在具有风险偏好的情况下实现风险敏感学习。
  • results: 该论文的结果表明,在具有风险偏好的情况下,该方法可以准确地学习一个近似优化策略,并且可以保证样本复杂性和误差 bound。
    Abstract Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. For both algorithms, we match the theoretical lower bounds for the risk-neutral setting.
    摘要 强化学习(RL)在多种应用领域和理论研究中备受关注。现有RL理论文献主要关注在无风险情况下决策者 maximizes the expected cumulative reward。然而,在实际应用中,决策者经常具有多样化的风险偏好和结果不确定性,这些不能由风险中性框架捕捉。通过 utility theory,可以处理这些偏好,但是对于通用 utility function 的发展仍然是一个开放的理论问题。在这篇论文中,我们考虑了一种决策者通过 maximize 一个通用 utility function 的 cumulative reward 的 Markov decision process(MDP)框架中寻找最佳策略。为了使 Dynamic Programming Principle 和 Bellman 方程成立,我们增加了状态空间的一个额外维度,用于跟踪 cumulative reward。我们提出了一种简化 Approximation 方案,可以方便地实现 Dynamic Programming Principle。然后,我们提出了一种修改的值迭代算法,使用 epsilon 覆盖来space of cumulative reward。当有可用的 simulator 时,我们的算法可以高效地学习一个近似优化策略,并且有 garantied sample complexity。在 simulator 缺失时,我们的算法,通过 upper-confidence-bound exploration 方法,可以快速地识别一个近似优化策略,并且保证一个 garantied regret bound。我们的算法与 risk-neutral 情况下的理论下界匹配。

A Survey of Serverless Machine Learning Model Inference

  • paper_url: http://arxiv.org/abs/2311.13587
  • repo_url: None
  • paper_authors: Kamil Kojs
  • for: 这篇论文主要是为了探讨大规模深度学习服务系统的发展和优化问题。
  • methods: 该论文使用了novel taxonomy来总结和分类emerging挑战和优化机会,以帮助研究人员更好地了解大规模深度学习服务系统的优化方法。
  • results: 该论文通过分析和总结recent trends和技术发展,提供了一些新的优化视角和研究方向,以帮助推动大规模深度学习服务系统的发展。
    Abstract Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these models in production environments. When hosting machine learning models for real-time predictions, it is important to meet defined Service Level Objectives (SLOs), ensuring reliability, minimal downtime, and optimizing operational costs of the underlying infrastructure. Large machine learning models often demand GPU resources for efficient inference to meet SLOs. In the context of these trends, there is growing interest in hosting AI models in a serverless architecture while still providing GPU access for inference tasks. This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. By providing a novel taxonomy and summarizing recent trends, we hope that this survey could shed light on new optimization perspectives and motivate novel works in large-scale deep learning serving systems.
    摘要 最近的生成AI、计算机视觉和自然语言处理技术的发展,导致了AI模型在各种产品中的广泛应用。这种普遍应用的AI需要大量的努力来部署这些模型在生产环境中。在托管机器学习模型进行实时预测时,要保证定义的服务水平目标(SLOs),确保可靠性、最小的下时间和基础设施的运营成本优化。大型机器学习模型通常需要GPU资源来实现高效的推断,以达到SLOs。在这些趋势下,有增长的兴趣在托管AI模型在无服务器架构中提供GPU访问,以便进行推断任务。这项调查旨在总结和分类emerging挑战和优化机会,以便在大规模深度学习服务系统中提高性能。我们希望通过这项调查,为大规模深度学习服务系统的优化提供新的视角,并促进新的研究。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.

On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates

  • paper_url: http://arxiv.org/abs/2311.13584
  • repo_url: None
  • paper_authors: Stefano Bruno, Ying Zhang, Dong-Young Lim, Ömer Deniz Akyildiz, Sotirios Sabanis
  • for: 这个论文是为了研究涉及扩散模型的泛化扩散问题,以及在强式准确拟合的假设下提供全面的理论保证。
  • methods: 该论文使用了 lipschitz连续函数作为推估函数,并对其进行了修改。
  • results: 该论文提供了关于泛化扩散问题的最佳knownUpper bound estimates,包括关键量的维度和收敛速率,以及一种新的auxiliary process,这些结果在 sampling from a Gaussian distribution with unknown mean 中得到了证明。
    Abstract We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly logconcave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.
    摘要 我们提供了对于扩散基础的生成模型的完整理论保证,假设 données 是强力的凹陷分布,而我们使用的推估函数是 lipschitz 连续函数。我们通过一个推奋例子,批评 Gaussian 分布的未知均值,说明了我们的方法的强大性。在这个情况下,我们提供了明确的估计,包括推奋估计和采样估计,并且这些估计在适当的情况下组合,从而获得了关键量之间的最好知道上限估计,例如 Wasserstein-2 距离之间的距离。而我们的结果使用 $L^2$ 精度的推奋估计假设,这个假设是基于随机估计和我们的新 auxiliary 过程,这个过程只使用了已知的信息。这种方法将提供最好的对于采样算法的对应率。

Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies

  • paper_url: http://arxiv.org/abs/2311.13583
  • repo_url: None
  • paper_authors: Shabnam Daghaghi, Benjamin Coleman, Benito Geordie, Anshumali Shrivastava
  • for: 提高神经网络训练速度的数据采样方法
  • methods: 基于非 Parametric kernel regression 学习有效重要性分数,以及高维统计和随机化算法的 Nadaraya-Watson 估计 sketch 技术
  • results: 比基eline更高的墙 clock 时间和准确率在四个数据集上Here’s a breakdown of each point:
  • for: The paper is focused on improving the training speed of neural networks using data sampling methods.
  • methods: The paper proposes a novel sampling distribution based on nonparametric kernel regression, which learns an effective importance score as the neural network trains. Additionally, the paper uses an efficient sketch-based approximation to the Nadaraya-Watson estimator, which is proven to have exponential convergence guarantees.
  • results: The paper shows that the proposed sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy on four datasets.
    Abstract Data sampling is an effective method to improve the training speed of neural networks, with recent results demonstrating that it can even break the neural scaling laws. These results critically rely on high-quality scores to estimate the importance of an input to the network. We observe that there are two dominant strategies: static sampling, where the scores are determined before training, and dynamic sampling, where the scores can depend on the model weights. Static algorithms are computationally inexpensive but less effective than their dynamic counterparts, which can cause end-to-end slowdown due to their need to explicitly compute losses. To address this problem, we propose a novel sampling distribution based on nonparametric kernel regression that learns an effective importance score as the neural network trains. However, nonparametric regression models are too computationally expensive to accelerate end-to-end training. Therefore, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator. Using recent techniques from high-dimensional statistics and randomized algorithms, we prove that our Nadaraya-Watson sketch approximates the estimator with exponential convergence guarantees. Our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy on four datasets.
    摘要 “数据采样是一种有效的方法来加速神经网络训练,最新的研究表明它可以甚至超越神经网络法律。这些结果归功于高质量的输入重要性分值。我们发现有两种主要策略:静态采样,其中分值在训练前已经确定,和动态采样,其中分值可以随模型参数的变化而变化。静态算法是计算效率低下的,但是它们比其他动态策略效果更差,可能会导致整个训练过程延迟。为了解决这个问题,我们提出了一种基于非参数kernel回归的采样分布,该分布在神经网络训练过程中学习有效的输入重要性分值。但是,非参数回归模型太 computationally expensive,无法加速整个训练过程。因此,我们开发了一种高效的笔记本采样方法。使用最新的高维统计技术和随机化算法,我们证明了我们的笔记本采样方法可以快速地近似 Nadaraya-Watson estimator,并且有 exponential convergence guarantee。我们的采样算法在四个 dataset 上比基eline 具有更好的墙 clock time 和准确性。”

A Unified Framework for Trace-induced Quantum Kernels

  • paper_url: http://arxiv.org/abs/2311.13552
  • repo_url: None
  • paper_authors: Beng Yee Gan, Daniel Leykam, Supanut Thanasilp
  • for: 这篇论文旨在探讨量子机器学习方法中的量子均值方法,以实现实用的量子优势。
  • methods: 这篇论文将所有的跟踪引起的量子均值方法,包括通用的全球准确性和本地投影量子均值方法,归纳到一个共同框架中。它们将”Lego”均值方法作为基本组件,并规定了这些均值方法的表达能力和泛化能力。
  • results: 数据表明,使用本地投影量子均值方法可以达到与全球准确性量子均值方法相同的性能,但需要更少的量子门和测量射。此外,这篇论文还提供了一种系统atic Approach来增加量子均值模型的复杂性,并对现有量子均值方法进行了比较。
    Abstract Quantum kernel methods are promising candidates for achieving a practical quantum advantage for certain machine learning tasks. Similar to classical machine learning, an exact form of a quantum kernel is expected to have a great impact on the model performance. In this work we combine all trace-induced quantum kernels, including the commonly-used global fidelity and local projected quantum kernels, into a common framework. We show how generalized trace-induced quantum kernels can be constructed as combinations of the fundamental building blocks we coin "Lego" kernels, which impose an inductive bias on the resulting quantum models. We relate the expressive power and generalization ability to the number of non-zero weight Lego kernels and propose a systematic approach to increase the complexity of a quantum kernel model, leading to a new form of the local projected kernels that require fewer quantum resources in terms of the number of quantum gates and measurement shots. We show numerically that models based on local projected kernels can achieve comparable performance to the global fidelity quantum kernel. Our work unifies existing quantum kernels and provides a systematic framework to compare their properties.
    摘要 Translation notes:* "Quantum kernel methods" is translated as "量子核函数方法" (quantum kernel functions).* "Classical machine learning" is translated as "经典机器学习" (classical machine learning).* "Trace-induced quantum kernels" is translated as "跟踪量子核函数" (trace-induced quantum kernels).* "Lego kernels" is translated as "乐高核函数" (Lego kernels).* "Expressive power" is translated as "表达能力" (expressive power).* "Generalization ability" is translated as "泛化能力" (generalization ability).* "Quantum resources" is translated as "量子资源" (quantum resources).* "Quantum gates" is translated as "量子门" (quantum gates).* "Measurement shots" is translated as "测量射击" (measurement shots).

Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling

  • paper_url: http://arxiv.org/abs/2311.13548
  • repo_url: https://gitlab.com/achatali/efficient-numerical-integration-in-rkhs-via-ls-sampling
  • paper_authors: Antoine Chatalic, Nicolas Schreuder, Ernesto De Vito, Lorenzo Rosasco
  • for: 这个论文target是数值积分问题,即使用点计算积分函数的方法来 aproximate 一个target概率分布。
  • methods: 这个论文提出了一种高效的方法,利用一小个i.i.d.随机样本来 aproximate 积分函数。这种方法可以在积分函数belongs to a reproducing kernel Hilbert space时使用,并且可以避免大量的函数计算。
  • results: 这个论文的主要结果是一个上界 Error bound ,用于measure 该方法的误差。这个上界bound 随着样本大小m的增加而逐渐下降,并且可以在 Sobolev spaces 中match 已知的最优率。此外,这个论文还提供了一些数学实验,证明了该方法的实用性和精度。
    Abstract In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of $n$ i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of $m
    摘要 在这个工作中,我们考虑了数值 интеграル的问题,即使用只有点值评估函数的方式来近似一个目标概率分布中的积分。我们关注的是在目标分布仅通过一组 $n$ 独立同分布的观察值来访问,并且函数属于一个 reproduce kernel Hilbert space 中的问题。我们提出了一种高效的过程,利用一个小的独立同分布的随机子组 $m

Learned Nonlinear Predictor for Critically Sampled 3D Point Cloud Attribute Compression

  • paper_url: http://arxiv.org/abs/2311.13539
  • repo_url: None
  • paper_authors: Tam Thuc Do, Philip A. Chou, Gene Cheung
  • for: 三维点云特征压缩
  • methods: 基于三维空间方法,包括基准函数分解、低通量编码和高通量编码
  • results: 提出了一种基于RAHT(1)的非线性预测方法,并实现了一种基于权重加权 polynomials的快速高通量编码方法,实验结果表明该方法可以比MPEG G-PCC预测器提供11-12%的比特率减少
    Abstract We study 3D point cloud attribute compression via a volumetric approach: assuming point cloud geometry is known at both encoder and decoder, parameters $\theta$ of a continuous attribute function $f: \mathbb{R}^3 \mapsto \mathbb{R}$ are quantized to $\hat{\theta}$ and encoded, so that discrete samples $f_{\hat{\theta}(\mathbf{x}_i)$ can be recovered at known 3D points $\mathbf{x}_i \in \mathbb{R}^3$ at the decoder. Specifically, we consider a nested sequences of function subspaces $\mathcal{F}^{(p)}_{l_0} \subseteq \cdots \subseteq \mathcal{F}^{(p)}_L$, where $\mathcal{F}_l^{(p)}$ is a family of functions spanned by B-spline basis functions of order $p$, $f_l^*$ is the projection of $f$ on $\mathcal{F}_l^{(p)}$ and encoded as low-pass coefficients $F_l^*$, and $g_l^*$ is the residual function in orthogonal subspace $\mathcal{G}_l^{(p)}$ (where $\mathcal{G}_l^{(p)} \oplus \mathcal{F}_l^{(p)} = \mathcal{F}_{l+1}^{(p)}$) and encoded as high-pass coefficients $G_l^*$. In this paper, to improve coding performance over [1], we study predicting $f_{l+1}^*$ at level $l+1$ given $f_l^*$ at level $l$ and encoding of $G_l^*$ for the $p=1$ case (RAHT($1$)). For the prediction, we formalize RAHT(1) linear prediction in MPEG-PCC in a theoretical framework, and propose a new nonlinear predictor using a polynomial of bilateral filter. We derive equations to efficiently compute the critically sampled high-pass coefficients $G_l^*$ amenable to encoding. We optimize parameters in our resulting feed-forward network on a large training set of point clouds by minimizing a rate-distortion Lagrangian. Experimental results show that our improved framework outperformed the MPEG G-PCC predictor by $11$ to $12\%$ in bit rate reduction.
    摘要 我们研究3D点云属性压缩 via一种体积方法:假设点云几何已知于编码器和解码器,参数θ的连续属性函数f:mathbb{R}^3→mathbb{R}的量化为hat{\theta},以便在知道3D点cloud中的特定点x_i(i=1,2,…)上Recover discrete抽象 samplesf_hat({x_i)。具体来说,我们考虑一个嵌套的函数子空间{F^{(p)}_{l_0} ⊆ … ⊆ F^{(p)}_{L},其中F_l^{(p)}是B-spline基函数的推荐系列中的一个函数空间,f_l∗是f在F_l^{(p)}中的投影,并被编码为低通 coefficientF_l∗,而g_l∗是在Orthogonal subspace Γ_l^{(p)}中的剩余函数,并被编码为高通 coefficientG_l∗。在这篇论文中,我们为提高编码性能而研究预测f_{l+1}^* at level $l+1$ given $f_l^*$ at level $l$,并编码G_l^* for the $p=1$ case (RAHT($1$))。为预测,我们在MPEG-PCC中形式化RAHT(1)的线性预测,并提出一种新的非线性预测器使用双边滤波器的多项式。我们 derive equations to efficiently compute the critically sampled high-pass coefficientsG_l∗ suitable for encoding。我们对 resulting feed-forward network的参数进行优化,使其在大量点云训练集上最小化一个Rate-Distortion lagrange。实验结果表明,我们的改进的框架在比特率减少方面与MPEG G-PCC预测器相比,可以实现$11$到$12\%$的提高。

Naturalness of Attention: Revisiting Attention in Code Language Models

  • paper_url: http://arxiv.org/abs/2311.13508
  • repo_url: None
  • paper_authors: Mootez Saad, Tushar Sharma
  • for: 这个论文旨在解释CodeBERT模型中注意机制的不同因素,以便更好地理解其捕捉代码结构的能力。
  • methods: 该研究使用了CodeBERT模型,并对其进行了初步的解释性分析,包括对注意力分布和变换表示进行了分析。
  • results: 研究发现,通过考虑输入的缩放变换 нор 而不仅仅是注意力 weights alone,CodeBERT模型可以更好地捕捉代码结构的含义。这些结果阐明了CodeBERT模型如何嵌入代码结构的特征。
    Abstract Language models for code such as CodeBERT offer the capability to learn advanced source code representation, but their opacity poses barriers to understanding of captured properties. Recent attention analysis studies provide initial interpretability insights by focusing solely on attention weights rather than considering the wider context modeling of Transformers. This study aims to shed some light on the previously ignored factors of the attention mechanism beyond the attention weights. We conduct an initial empirical study analyzing both attention distributions and transformed representations in CodeBERT. Across two programming languages, Java and Python, we find that the scaled transformation norms of the input better capture syntactic structure compared to attention weights alone. Our analysis reveals characterization of how CodeBERT embeds syntactic code properties. The findings demonstrate the importance of incorporating factors beyond just attention weights for rigorously understanding neural code models. This lays the groundwork for developing more interpretable models and effective uses of attention mechanisms in program analysis.
    摘要 codeBERT模型可以学习高级代码表示,但它们的 opacity 使得捕捉特性的理解受到阻碍。最近的注意分析研究只关注Transformers模型中的注意力重量,而不考虑更广泛的模型Context。这项研究的目标是探讨注意机制中尚未被考虑的因素。我们通过对CodeBERT模型的注意分布和变换表示进行初步实验分析,发现在Java和Python两种编程语言中,输入的缩放转换 нор 比 attention weights 更好地捕捉语法结构。我们的分析发现了CodeBERT模型如何嵌入语法代码特性。这些发现表明了在理解神经代码模型时需要考虑因素 beyond 注意力重量。这些发现将为开发更可解释的模型和更有效的注意力机制在程序分析中提供基础。

Applying Dimensionality Reduction as Precursor to LSTM-CNN Models for Classifying Imagery and Motor Signals in ECoG-Based BCIs

  • paper_url: http://arxiv.org/abs/2311.13507
  • repo_url: https://github.com/bafanas/dim-reduction-with-cnn-lstm
  • paper_authors: Soham Bafana
  • for: 提高 motor rehabilitation 结果,optimize motor imagery classification algorithms within Brain-Computer Interfaces (BCIs)
  • methods: 使用 unsupervised 技术,如 Uniform Manifold Approximation and Projection (UMAP) 和 K-Nearest Neighbors (KNN),evaluate the necessity of employing supervised methods such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for classification tasks
  • results: participants who exhibited high KNN scores following UMAP dimensionality reduction also achieved high accuracy in supervised deep learning (DL) models, individualized model requirements and massive neural training data, dimensionality reduction becomes an effective preprocessing step that minimizes the need for extensive data labeling and supervised deep learning techniques
    Abstract Motor impairments, frequently caused by neurological incidents like strokes or traumatic brain injuries, present substantial obstacles in rehabilitation therapy. This research aims to elevate the field by optimizing motor imagery classification algorithms within Brain-Computer Interfaces (BCIs). By improving the efficiency of BCIs, we offer a novel approach that holds significant promise for enhancing motor rehabilitation outcomes. Utilizing unsupervised techniques for dimensionality reduction, namely Uniform Manifold Approximation and Projection (UMAP) coupled with K-Nearest Neighbors (KNN), we evaluate the necessity of employing supervised methods such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for classification tasks. Importantly, participants who exhibited high KNN scores following UMAP dimensionality reduction also achieved high accuracy in supervised deep learning (DL) models. Due to individualized model requirements and massive neural training data, dimensionality reduction becomes an effective preprocessing step that minimizes the need for extensive data labeling and supervised deep learning techniques. This approach has significant implications not only for targeted therapies in motor dysfunction but also for addressing regulatory, safety, and reliability concerns in the rapidly evolving BCI field.
    摘要 neural incidents like strokes or traumatic brain injuries, present substantial obstacles in rehabilitation therapy. This research aims to elevate the field by optimizing motor imagery classification algorithms within Brain-Computer Interfaces (BCIs). By improving the efficiency of BCIs, we offer a novel approach that holds significant promise for enhancing motor rehabilitation outcomes. Utilizing unsupervised techniques for dimensionality reduction, namely Uniform Manifold Approximation and Projection (UMAP) coupled with K-Nearest Neighbors (KNN), we evaluate the necessity of employing supervised methods such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for classification tasks. Importantly, participants who exhibited high KNN scores following UMAP dimensionality reduction also achieved high accuracy in supervised deep learning (DL) models. Due to individualized model requirements and massive neural training data, dimensionality reduction becomes an effective preprocessing step that minimizes the need for extensive data labeling and supervised deep learning techniques. This approach has significant implications not only for targeted therapies in motor dysfunction but also for addressing regulatory, safety, and reliability concerns in the rapidly evolving BCI field.Here's the text with the Chinese characters: neural incidents like strokes or traumatic brain injuries, present substantial obstacles in rehabilitation therapy. This research aims to elevate the field by optimizing motor imagery classification algorithms within Brain-Computer Interfaces (BCIs). By improving the efficiency of BCIs, we offer a novel approach that holds significant promise for enhancing motor rehabilitation outcomes. Utilizing unsupervised techniques for dimensionality reduction, namely Uniform Manifold Approximation and Projection (UMAP) coupled with K-Nearest Neighbors (KNN), we evaluate the necessity of employing supervised methods such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for classification tasks. Importantly, participants who exhibited high KNN scores following UMAP dimensionality reduction also achieved high accuracy in supervised deep learning (DL) models. Due to individualized model requirements and massive neural training data, dimensionality reduction becomes an effective preprocessing step that minimizes the need for extensive data labeling and supervised deep learning techniques. This approach has significant implications not only for targeted therapies in motor dysfunction but also for addressing regulatory, safety, and reliability concerns in the rapidly evolving BCI field.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Grad-Shafranov equilibria via data-free physics informed neural networks

  • paper_url: http://arxiv.org/abs/2311.13491
  • repo_url: None
  • paper_authors: Byoungchan Jang, Alan A. Kaptanoglu, Rahul Gaur, Shaowu Pan, Matt Landreman, William Dorland
  • For: 这 paper 是用来解决 Grad-Shafranov 方程的方法。* Methods: 这 paper 使用 Physics-Informed Neural Networks (PINNs) 方法来解决 Grad-Shafranov 方程,并通过直接将 PDE residual 作为损失函数进行优化。* Results: PINNs 可以准确地和有效地解决 Grad-Shafranov 方程,并且可以适应不同的边界条件。 Additionally, the authors explore the parameter space of the PINN model to map various trade-offs, such as between reconstruction error and computational speed, and introduce a parameterized PINN framework to handle a broader range of plasma scenarios.
    Abstract A large number of magnetohydrodynamic (MHD) equilibrium calculations are often required for uncertainty quantification, optimization, and real-time diagnostic information, making MHD equilibrium codes vital to the field of plasma physics. In this paper, we explore a method for solving the Grad-Shafranov equation by using Physics-Informed Neural Networks (PINNs). For PINNs, we optimize neural networks by directly minimizing the residual of the PDE as a loss function. We show that PINNs can accurately and effectively solve the Grad-Shafranov equation with several different boundary conditions. We also explore the parameter space by varying the size of the model, the learning rate, and boundary conditions to map various trade-offs such as between reconstruction error and computational speed. Additionally, we introduce a parameterized PINN framework, expanding the input space to include variables such as pressure, aspect ratio, elongation, and triangularity in order to handle a broader range of plasma scenarios within a single network. Parametrized PINNs could be used in future work to solve inverse problems such as shape optimization.
    摘要 很多磁液动力学(MHD)平衡计算需要uncertainty量化、优化和实时诊断信息,因此MHD平衡代码是物理学中的关键工具。在这篇论文中,我们探讨使用物理学 Informed Neural Networks(PINNs)解决Grad-Shafranov方程。我们通过直接将PDE的差异作为损失函数来优化神经网络。我们表明PINNs可以准确地和效地解决Grad-Shafranov方程,并且可以处理不同的边界条件。我们还探索参数空间,包括模型大小、学习率和边界条件,以映射不同的贸易offs,如重构错误和计算速度之间的贸易offs。此外,我们引入参数化PINN框架,扩展输入空间,以包括压力、比例、延展和三角形等参数,以处理更广泛的气溶胶enario中的多种情况。参数化PINNs可以在未来的工作中解决反向问题,如形状优化。

Benchmarking Toxic Molecule Classification using Graph Neural Networks and Few Shot Learning

  • paper_url: http://arxiv.org/abs/2311.13490
  • repo_url: None
  • paper_authors: Bhavya Mehta, Kush Kothari, Reshmika Nambiar, Seema Shrawne
  • for: 这个研究是为了提高化学物质的毒性预测和药物发现,以及环境风险评估。
  • methods: 这篇研究使用了Graph Isomorphic Networks、Multi Headed Attention和Free Large-scale Adversarial Augmentation等方法,具体来说是在graph上 precisely capture 化学物质的结构数据和毒性属性。同时,还将Few-Shot Learning 应用于提高模型的扩展性。
  • results: 实验结果显示,这种方法可以在一个多样化的毒性数据集上取得一个非常出色的AUC-ROC值(0.816),比基eline GCN 模型高出11.4%。这显示了我们的提案方法和Few Shot Learning 在毒分子类别中的重要性和应用前景。
    Abstract Traditional methods like Graph Convolutional Networks (GCNs) face challenges with limited data and class imbalance, leading to suboptimal performance in graph classification tasks during toxicity prediction of molecules as a whole. To address these issues, we harness the power of Graph Isomorphic Networks, Multi Headed Attention and Free Large-scale Adversarial Augmentation separately on Graphs for precisely capturing the structural data of molecules and their toxicological properties. Additionally, we incorporate Few-Shot Learning to improve the model's generalization with limited annotated samples. Extensive experiments on a diverse toxicology dataset demonstrate that our method achieves an impressive state-of-art AUC-ROC value of 0.816, surpassing the baseline GCN model by 11.4%. This highlights the significance of our proposed methodology and Few Shot Learning in advancing Toxic Molecular Classification, with the potential to enhance drug discovery and environmental risk assessment processes.
    摘要 传统方法如图 convolutional neural networks (GCNs) 面临有限数据和分类不均衡的挑战,导致在分子毒性预测任务中表现不佳。为解决这些问题,我们利用图同态网络、多头注意力和自由大规模对抗增强网络分别在图上准确捕捉分子的结构数据和毒理性质。此外,我们还采用少样学习来提高模型的泛化性。广泛的实验表明,我们的方法在多样化毒理学数据集上达到了非常出色的 AUC-ROC 值为 0.816,比基eline GCN 模型高出 11.4%。这些结果表明我们提出的方法和少样学习在毒分子分类方面具有重要意义,有助于提高药物发现和环境风险评估过程。

Comparative Analysis of Linear Regression, Gaussian Elimination, and LU Decomposition for CT Real Estate Purchase Decisions

  • paper_url: http://arxiv.org/abs/2311.13471
  • repo_url: None
  • paper_authors: Xilin Cheng
  • for: 这个论文是用来评估三种计算方法在买房决策过程中的应用。
  • methods: 这个论文使用的方法包括Linear Regression、Gaussian Elimination with partial pivoting、LU Decomposition。
  • results: 研究发现,Linear Regression和LU Decomposition的预测准确性最高,而Gaussian Elimination在稳定性和性能方面存在限制。
    Abstract This paper presents a comprehensive evaluation of three distinct computational algorithms applied to the decision-making process of real estate purchases. Specifically, we analyze the efficacy of Linear Regression from Scikit-learn library, Gaussian Elimination with partial pivoting, and LU Decomposition in predicting the advisability of buying a house in the State of Connecticut based on a set of financial and market-related parameters. The algorithms' performances were compared using a dataset encompassing town-specific details, yearly data, interest rates, and median sale ratios. Our results demonstrate significant differences in predictive accuracy, with Linear Regression and LU Decomposition providing the most reliable recommendations and Gaussian Elimination showing limitations in stability and performance. The study's findings emphasize the importance of algorithm selection in predictive analytic and offer insights into the practical applications of computational methods in real estate investment strategies. By evaluating model efficacy through metrics such as R-squared scores and Mean Squared Error, we provide a nuanced understanding of each method's strengths and weaknesses, contributing valuable knowledge to the fields of real estate analysis and predictive modeling.
    摘要

Span-Based Optimal Sample Complexity for Average Reward MDPs

  • paper_url: http://arxiv.org/abs/2311.13469
  • repo_url: None
  • paper_authors: Matthew Zurek, Yudong Chen
  • for: 学习一个 $\varepsilon$-优质策略在均值奖励Markov决策过程(MDP)中。
  • methods: 通过减少均值奖励MDP到折扣MDP来实现。
  • results: 确立了复杂度上下文$\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$,其中$H$是优质策略偏函数的扩展,$SA$是状态动作空间的 cardinality。这个结果在所有参数$S$, $A$, $H$和$\varepsilon$中是最优(减少到 logs 的因子),超过现有的工作,其中一些假设所有策略的混合时间都是均匀的。
    Abstract We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ samples suffice to learn a $\varepsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma \geq 1 - \frac{1}{H}$, circumventing the well-known lower bound of $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$ for general $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.
    摘要 我们研究了学习一个 $\varepsilon$-优质策略的样本复杂度在均值奖励Markov决策过程(MDP)中。我们提出了复杂度上下文bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$,其中 $H$ 是优质策略的偏函数的扩展 span,$SA$ 是状态动作空间的 cardinality。我们的结果是在所有参数 $S$, $A$, $H$ 和 $\varepsilon$ 中具有最优(即减小 log 因子)的结果,超越了现有的工作,其中 Either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters。我们的结果基于将均值奖励MDP 降低到折扣MDP。为了证明这种降低的有效性,我们开发了 $\gamma$-折扣MDP 中更好的下界,证明 $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ 样本 suffice to learn a $\varepsilon$-优质策略在弱相互通信MDP 中,超越了对于一般 $\gamma$-折扣MDP 的下界 $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$。我们的分析开发了基于 span 参数的上界,这些上界比基于混合时间或 Diameter 参数的上界更紧。这些上界可能在更广泛的应用中得到报用。

Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

  • paper_url: http://arxiv.org/abs/2311.13466
  • repo_url: https://github.com/dunni3/keypoint-diffusion
  • paper_authors: Ian Dunn, David Ryan Koes
  • for: 该论文旨在提出一种基于 Graph Neural Network (GNN) 的分子结构生成模型,用于蛋白质结构设计和分子生物学问题解决。
  • methods: 该模型使用 GNN 学习分子结构的含义,并结合扩散模型进行推断。该模型的特点是使用精细的分子结构表示,从而 preserve 分子间相互作用的信息。
  • results: 在与一个全原子蛋白质表示相比,该模型在扩散模型中的训练和推断时间减少了3倍,并且性能相对较好。
    Abstract Diffusion generative models have emerged as a powerful framework for addressing problems in structural biology and structure-based drug design. These models operate directly on 3D molecular structures. Due to the unfavorable scaling of graph neural networks (GNNs) with graph size as well as the relatively slow inference speeds inherent to diffusion models, many existing molecular diffusion models rely on coarse-grained representations of protein structure to make training and inference feasible. However, such coarse-grained representations discard essential information for modeling molecular interactions and impair the quality of generated structures. In this work, we present a novel GNN-based architecture for learning latent representations of molecular structure. When trained end-to-end with a diffusion model for de novo ligand design, our model achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time.
    摘要 Translate the given text into Simplified Chinese.随机生成模型在生物结构学和基于结构药物设计方面已经出现为一种强大的框架。这些模型直接操作于三维分子结构。由于图神经网络(GNNs)的可插入性和扩展尺度的不利扩展性,许多现有的分子扩散模型依赖于粗略化蛋白结构的表示来使训练和推断可行。然而,这些粗略化表示会产生蛋白结构中关键的信息损失,从而降低生成结构的质量。在这项工作中,我们提出了一种基于GNN的新建立 latent表示方法。当与一种用于新药设计的扩散模型进行端到端训练时,我们的模型可以与使用全原子蛋白表示达到相同的性能,同时展现出3倍的推断时间减少。

Multi-Objective Bayesian Optimization with Active Preference Learning

  • paper_url: http://arxiv.org/abs/2311.13460
  • repo_url: None
  • paper_authors: Ryota Ozaki, Kazuki Ishikawa, Youhei Kanzaki, Shinya Suzuki, Shion Takeno, Ichiro Takeuchi, Masayuki Karasuyama
  • for: 这个论文是为了解决多bjective optimization问题中的优化多个目标函数的问题,但是在这种问题中,找到整个Pareto前进行搜索的成本是繁荐的,而在实际应用中,决策者(DM)通常只需要一个特定的解在Pareto优化解中。
  • methods: 该论文提出了一种 Bayesian optimization(BO)方法,用于在多bjective optimization问题中适应多个目标函数,并且可以在DM的偏好模型中进行adaptive地估计。此外,该论文还提出了一种在DM的偏好估计中 incorporate了对象函数的uncertainty和DM的偏好uncertainty的 acquisition函数,以及一种可以降低DM与BO算法之间的交互成本的active learning策略。
  • results: 该论文通过使用 benchmark function optimization和机器学习模型的hyper-parameter优化问题进行实验,证明了其效果的抽象。
    Abstract There are a lot of real-world black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective functions and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models.
    摘要 有很多现实世界中的黑盒优化问题需要同时优化多个目标函数。然而,在多目标优化(MOO)问题中,确定整个帕瑞托前缘需要昂贵的搜索成本,而在许多实践中,决策者(DM)只需要在多个优化解的集合中选择特定的解决方案。我们提议使用 bayesian 优化(BO)方法来实现MOO问题中的多个目标函数同时优化,并且在DM的偏好模型中进行了交互式地 Adaptive 估计,基于两种类型的监督:对比 preference 和改进请求。为了探索最优解,我们定义了一个获取函数,其中包含目标函数的不确定性以及DM的偏好不确定性。此外,为了最小化与DM的交互成本,我们还提出了一种活动学习策略来估计DM的偏好。我们通过 benchmark 函数优化和机器学习模型的 hyper-parameter 优化问题进行了实验,并证明了我们的提议的有效性。

The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs

  • paper_url: http://arxiv.org/abs/2311.13459
  • repo_url: None
  • paper_authors: Ehsan Amid, Frank Nielsen, Richard Nock, Manfred K. Warmuth
  • for: 本研究用Tempered Exponential Measures (TEMs)作为一种参数泛化的几何函数,以最大化温和积分函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的温和 entropy 函数中的��
    Abstract Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings.
    摘要 tempered exponential measures (TEMs) 是一种 Parametric 的普遍函数家族,在正确评估函数家族中最大化温和 entropy 函数,并且对于正确评估函数进行了概率Normalization。 calculus on TEMs 基于一种受扭变的加法运算,这种运算是通过定义温和 entropy 函数的扭变它们来定义的。在这项工作中,我们介绍了三种不同的精度 parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function。 特别是,我们证明了这些 parameterizations 之间的同构关系,这种同构关系可以被表示为一种总体化的希尔伯特 log cross-ratio 简单距离。 这种希尔伯特距离是一种 $t$-symmetrization of the oriented tempered Funk distance。我们motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold。 然后,我们展示了我们的扩展结构在不同的设置中的性质,并 numerically examined the quality of its differentiable approximations for optimization in machine learning settings。

Explaining high-dimensional text classifiers

  • paper_url: http://arxiv.org/abs/2311.13454
  • repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
  • paper_authors: Odelia Melamed, Rich Caruana
  • for: 这个论文旨在提供一种新的解释性方法,用于高维输入和神经网络分类器。
  • methods: 该方法利用神经网络中 theoretically proven 的高维性质,以提高解释性。
  • results: 在 IMDB 审查 dataset 上进行 sentiment 分析任务中,以及 PowerShell 脚本 dataset 上的 malware 检测任务中,该方法都有出色的表现。
    Abstract Explainability has become a valuable tool in the last few years, helping humans better understand AI-guided decisions. However, the classic explainability tools are sometimes quite limited when considering high-dimensional inputs and neural network classifiers. We present a new explainability method using theoretically proven high-dimensional properties in neural network classifiers. We present two usages of it: 1) On the classical sentiment analysis task for the IMDB reviews dataset, and 2) our Malware-Detection task for our PowerShell scripts dataset.
    摘要

    过去几年,可解释性已成为人工智能决策的有价值工具。然而,经典的解释工具在考虑高维输入和神经网络分类器时有限。我们介绍了一种新的可解释方法,利用神经网络分类器中已经证明的高维性质。我们在以下两个应用中使用了它:1)传统的情感分析任务中IMDB评论集合,2)我们的Malware检测任务中PowerShell脚本集合。

    Here's a word-for-word translation of the text:

    过去几年,可解释性已成为人工智能决策的有价值工具。然而,经典的解释工具在考虑高维输入和神经网络分类器时有限。我们介绍了一种新的可解释方法,利用神经网络分类器中已经证明的高维性质。我们在以下两个应用中使用了它:1)传统的情感分析任务中IMDB评论集合,2)我们的Malware检测任务中PowerShell脚本集合。

    Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan. The translation is written in Simplified Chinese, which is the standard form used in mainland China.

Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates

  • paper_url: http://arxiv.org/abs/2311.13447
  • repo_url: None
  • paper_authors: Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, Cristóbal Guzmán
  • for: The paper is written for studying private empirical risk minimization (ERM) problems under the constraint of $\rho$ zero-concentrated differential privacy (zCDP).
  • methods: The paper proposes new algorithms based on variance reduced gradient descent and proximal point method for solving private ERM problems, which achieve near-optimal convergence rates.
  • results: The paper shows that the proposed algorithms can achieve convergence rates of $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^\kappa\big)$ and $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^{1/2}\big)$ for private ERM problems, which are near-optimal and improve upon existing results.
    Abstract We study private empirical risk minimization (ERM) problem for losses satisfying the $(\gamma,\kappa)$-Kurdyka-{\L}ojasiewicz (KL) condition. The Polyak-{\L}ojasiewicz (PL) condition is a special case of this condition when $\kappa=2$. Specifically, we study this problem under the constraint of $\rho$ zero-concentrated differential privacy (zCDP). When $\kappa\in[1,2]$ and the loss function is Lipschitz and smooth over a sufficiently large region, we provide a new algorithm based on variance reduced gradient descent that achieves the rate $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^\kappa\big)$ on the excess empirical risk, where $n$ is the dataset size and $d$ is the dimension. We further show that this rate is nearly optimal. When $\kappa \geq 2$ and the loss is instead Lipschitz and weakly convex, we show it is possible to achieve the rate $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^\kappa\big)$ with a private implementation of the proximal point method. When the KL parameters are unknown, we provide a novel modification and analysis of the noisy gradient descent algorithm and show that this algorithm achieves a rate of $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^{\frac{2\kappa}{4-\kappa}\big)$ adaptively, which is nearly optimal when $\kappa = 2$. We further show that, without assuming the KL condition, the same gradient descent algorithm can achieve fast convergence to a stationary point when the gradient stays sufficiently large during the run of the algorithm. Specifically, we show that this algorithm can approximate stationary points of Lipschitz, smooth (and possibly nonconvex) objectives with rate as fast as $\tilde{O}\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)$ and never worse than $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^{1/2}\big)$. The latter rate matches the best known rate for methods that do not rely on variance reduction.
    摘要 我们研究了Private Empirical Risk Minimization(ERM)问题,其损失函数满足($\gamma$, $\kappa$)-Kurdyka-{\L}ojasiewicz(KL)条件。当$\kappa=2$时,我们研究这个问题在$\rho$ zero-concentrated differential privacy(zCDP)约束下。当$\kappa \in [1,2]$且损失函数在充分大的区域内是 lipschitz 和光滑的时,我们提出了一种基于减少幂的梯度下降算法,可以实现$\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^\kappa\big)$的副本风险差,其中$n$是数据集大小,$d$是维度。我们进一步证明了这个率是近似优的。当$\kappa \geq 2$且损失函数是 lipschitz 和弱 convex的时,我们示出可以通过私有实现 proximal point 方法来实现$\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^\kappa\big)$的率。当 KL 参数未知时,我们提出了一种 modificated 和分析的杂变梯度下降算法,可以实现 $\tilde{O}\big(\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^{\frac{2\kappa}{4-\kappa}}\big)$的率,这是$\kappa = 2$时的近似优率。此外,我们还证明了不假设 KL 条件时,这种梯度下降算法可以快速地到达站点点。具体来说,我们证明这种算法可以在 Lipschitz 和光滑(可能非 convex)目标函数上 approximate 站点点,其率与 $\frac{\sqrt{d}{n\sqrt{\rho}$ 相同,最坏情况下与 $\big(\frac{\sqrt{d}{n\sqrt{\rho}\big)^{1/2}$ 相同。前者率与不假设 KL 条件的方法相同,后者率则是最坏情况下的最佳率。

Transfer Attacks and Defenses for Large Language Models on Coding Tasks

  • paper_url: http://arxiv.org/abs/2311.13445
  • repo_url: None
  • paper_authors: Chi Zhang, Zifan Wang, Ravi Mangal, Matt Fredrikson, Limin Jia, Corina Pasareanu
  • for: This paper aims to investigate the effect of adversarial perturbations on coding tasks with large language models (LLMs).
  • methods: The paper uses white-box attacks to generate adversarial examples for smaller code models, and then studies the transferability of these examples to LLMs. The paper also proposes prompt-based defenses to improve the resilience of LLMs against adversarial perturbations.
  • results: The paper shows that adversarial examples obtained with a smaller code model are transferable to LLMs, weakening their performance. The proposed defenses show promise in improving the model’s resilience, paving the way to more robust defensive solutions for LLMs in code-related applications.
    Abstract Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e. small syntactic perturbations that do not change the program's semantics, such as the inclusion of "dead code" through false conditions or the addition of inconsequential print statements, designed to "fool" the models. LLMs can also be vulnerable to the same adversarial perturbations but a detailed study on this concern has been lacking so far. In this paper we aim to investigate the effect of adversarial perturbations on coding tasks with LLMs. In particular, we study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. Furthermore, to make the LLMs more robust against such adversaries without incurring the cost of retraining, we propose prompt-based defenses that involve modifying the prompt to include additional information such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the LLMs' performance. The proposed defenses show promise in improving the model's resilience, paving the way to more robust defensive solutions for LLMs in code-related applications.
    摘要 现代大语言模型(LLM),如ChatGPT,已经表现出了编程任务中的卓越能力,包括编写和理解代码。它们超越了之前的神经网络模型,如code2seq或seq2seq,这些模型在完成代码摘要和检测代码漏洞等任务上已经达到了竞争性的结果。然而,这些前一代代码模型在面对黑客攻击时表现不稳定,即使是小规模的语义性改变。LLM也可能受到同样的攻击,但是详细的研究尚未进行过。在这篇论文中,我们想 investigate LLM在编程任务中对黑客攻击的影响。特别是,我们研究了黑客攻击小代码模型后,对LLM的攻击传播性。此外,为了使LLM更加鲜硬对抗这些敌人而不需要重新训练,我们提出了提示基的防御方法,即在提示中添加了黑客攻击后的代码示例和逆转黑客攻击的明确指令。我们的实验表明,黑客攻击小代码模型后,LLM的性能会受到影响。我们提出的防御方法显示了承办性,为LLM在代码相关应用中增强鲜硬性铺垫道路。

Recurrent neural networks and transfer learning for elasto-plasticity in woven composites

  • paper_url: http://arxiv.org/abs/2311.13434
  • repo_url: None
  • paper_authors: Ehsan Ghane, Martin Fagerström, Mohsen Mirkhalaf
  • for: 用于模拟织物复合材料的 meso 级 simulation 的代理模型。
  • methods: 使用 Recurrent Neural Network (RNN) 模型,利用转移学习来解决初始化挑战和稀缺数据问题,生成了一个广泛的数据集表示塑性行为。
  • results: 通过随机步进行预测,实现了对循环荷载 conditons 的准确预测,并且通过包含子尺度特性来增强 RNN 的 universality。
    Abstract As a surrogate for computationally intensive meso-scale simulation of woven composites, this article presents Recurrent Neural Network (RNN) models. Leveraging the power of transfer learning, the initialization challenges and sparse data issues inherent in cyclic shear strain loads are addressed in the RNN models. A mean-field model generates a comprehensive data set representing elasto-plastic behavior. In simulations, arbitrary six-dimensional strain histories are used to predict stresses under random walking as the source task and cyclic loading conditions as the target task. Incorporating sub-scale properties enhances RNN versatility. In order to achieve accurate predictions, the model uses a grid search method to tune network architecture and hyper-parameter configurations. The results of this study demonstrate that transfer learning can be used to effectively adapt the RNN to varying strain conditions, which establishes its potential as a useful tool for modeling path-dependent responses in woven composites.
    摘要 本文使用循环神经网络(RNN)模型作为织物复合材料的计算残留模拟的代理。通过传输学习的能力,本文解决了RNN模型的初始化挑战和稀缺数据问题,其中包括循环shear strain负荷。一个含有材料弹性行为的含义场模型生成了一个全面的数据集。在模拟中,随机六个维度的弹性历史被用来预测在循环walking作为源任务和循环负荷条件作为目标任务下的压力。通过包含子材料性能,RNN模型的多样性得到了加强。为了实现准确预测,模型使用格子搜索方法来调整网络架构和超参数配置。本研究示出,传输学习可以有效地使RNN模型适应不同的弹性条件,这为模拟织物复合材料的路径依赖响应提供了可靠的工具。

Extracting individual variable information for their decoupling, direct mutual information and multi-feature Granger causality

  • paper_url: http://arxiv.org/abs/2311.13431
  • repo_url: None
  • paper_authors: Jarek Duda
  • for: 这篇论文旨在提出一种方法,用于分离多变量之间的复杂依赖关系,以提高变量之间的独立性。
  • methods: 该方法使用 conditional probability distributions 的减少方法,将多变量转化为独立的变量,并且可以用于评估直接相关性和 causality 方向。
  • results: 该方法可以帮助分离多变量之间的复杂依赖关系,提高变量之间的独立性,并且可以用于评估直接相关性和 causality 方向。
    Abstract Working with multiple variables they usually contain difficult to control complex dependencies. This article proposes extraction of their individual information, e.g. $\overline{X|Y}$ as random variable containing information from $X$, but with removed information about $Y$, by using $(x,y) \leftrightarrow (\bar{x}=\textrm{CDF}_{X|Y=y}(x),y)$ reversible normalization. One application can be decoupling of individual information of variables: reversibly transform $(X_1,\ldots,X_n)\leftrightarrow(\tilde{X}_1,\ldots \tilde{X}_n)$ together containing the same information, but being independent: $\forall_{i\neq j} \tilde{X}_i\perp \tilde{X}_j, \tilde{X}_i\perp X_j$. It requires detailed models of complex conditional probability distributions - it is generally a difficult task, but here can be done through multiple dependency reducing iterations, using imperfect methods (here HCR: Hierarchical Correlation Reconstruction). It could be also used for direct mutual information - evaluating direct information transfer: without use of intermediate variables. For causality direction there is discussed multi-feature Granger causality, e.g. to trace various types of individual information transfers between such decoupled variables, including propagation time (delay).
    摘要 工作 avec 多个变量,它们通常含有难以控制的复杂依赖关系。这篇文章提议提取这些变量的个人信息,例如 $\bar{X|Y}$ 作为一个随机变量,含有 $X$ 中的信息,但是去掉了关于 $Y$ 的信息,通过使用 $(x,y) \leftrightarrow (\bar{x}=\textrm{CDF}_{X|Y=y}(x),y)$ 逆转正规化。这种方法可以用于独立变量之间的解耦,例如 $(X_1,\ldots,X_n) \leftrightarrow (\tilde{X}_1,\ldots \tilde{X}_n)$ 这些变量共同含有同样的信息,但是是独立的:$\forall_{i\neq j} \tilde{X}_i\perp \tilde{X}_j, \tilde{X}_i\perp X_j$。这需要详细的复杂Conditional probability distribution模型,这是一项Difficult task,但是在这里可以通过多个依赖降低迭代来完成,使用不完美的方法(如 Hierarchical Correlation Reconstruction)。它也可以用于直接 mutual information 评估,不需要使用中间变量。对于 causality direction,有讨论多种个人信息传递类型,包括协调时间(延迟)。

Bayesian inference of a new Mallows model for characterising symptom sequences applied in primary progressive aphasia

  • paper_url: http://arxiv.org/abs/2311.13411
  • repo_url: None
  • paper_authors: Beatrice Taylor, Cameron Shand, Chris J. D. Hardy, Neil Oxtoby
  • for: 这个研究用于Characterising symptom sequences using Bayesian inference, 以便更好地理解疾病经验和确保医疗公平。
  • methods: 这个研究使用了Malawi模型,并采用自定义MCMC适应。
  • results: 研究发现,这种方法可以 revel mean orderings和估计排名变幂,具有潜在的临床应用。然而,研究受到规模化和小样本大小的限制。
    Abstract Machine learning models offer the potential to understand diverse datasets in a data-driven way, powering insights into individual disease experiences and ensuring equitable healthcare. In this study, we explore Bayesian inference for characterising symptom sequences, and the associated modelling challenges. We adapted the Mallows model to account for partial rankings and right-censored data, employing custom MCMC fitting. Our evaluation, encompassing synthetic data and a primary progressive aphasia dataset, highlights the model's efficacy in revealing mean orderings and estimating ranking variance. This holds the potential to enhance clinical comprehension of symptom occurrence. However, our work encounters limitations concerning model scalability and small dataset sizes.
    摘要

An Empirical Study of Uncertainty Estimation Techniques for Detecting Drift in Data Streams

  • paper_url: http://arxiv.org/abs/2311.13374
  • repo_url: None
  • paper_authors: Anton Winter, Nicolas Jourdan, Tristan Wirth, Volker Knauthe, Arjan Kuijper
  • for: 提高机器学习模型的可靠性在安全关键领域,如自动驾驶和医疗诊断。
  • methods: 使用uncertainty值作为代替错误率的检测方法,以减少在部署后获取真实标签的需求。
  • results: 通过对五种uncertainty估计方法和ADWIN检测器在七个实际数据集上进行比较,发现SWAG方法具有更好的准确性,但选择uncertainty估计方法的影响不大,even the most basic method可以达到竞争力强的性能。这些发现有助于实际应用中对uncertainty-based drift检测的实际应用。
    Abstract In safety-critical domains such as autonomous driving and medical diagnosis, the reliability of machine learning models is crucial. One significant challenge to reliability is concept drift, which can cause model deterioration over time. Traditionally, drift detectors rely on true labels, which are often scarce and costly. This study conducts a comprehensive empirical evaluation of using uncertainty values as substitutes for error rates in detecting drifts, aiming to alleviate the reliance on labeled post-deployment data. We examine five uncertainty estimation methods in conjunction with the ADWIN detector across seven real-world datasets. Our results reveal that while the SWAG method exhibits superior calibration, the overall accuracy in detecting drifts is not notably impacted by the choice of uncertainty estimation method, with even the most basic method demonstrating competitive performance. These findings offer valuable insights into the practical applicability of uncertainty-based drift detection in real-world, safety-critical applications.
    摘要 在安全关键领域如自动驾驶和医疗诊断中,机器学习模型的可靠性是关键。一个重要挑战是概念漂移,可以导致模型逐渐异常。传统的漂移探测器通常依赖真实标签,然而这些标签经常匮乏和昂贵。本研究对使用不确定性值作为错误率的替代品进行了全面的实验评估,以降低在部署后获取标签数据的依赖。我们在七个实际数据集上对五种不确定性估计方法与ADWIN探测器进行了比较,结果表明,虽然SWAG方法具有更好的准确性,但选择不确定性估计方法对总体漂移探测的准确率没有显著影响,即使使用最简单的方法也能够实现竞争力强的表现。这些发现对实际应用中的实用性提供了有价值的信息。

REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints

  • paper_url: http://arxiv.org/abs/2311.13349
  • repo_url: https://github.com/fracorti/reds
  • paper_authors: Francesco Corti, Balz Maag, Joachim Schauer, Ulrich Pferschy, Olga Saukh
  • for: 这篇论文旨在解决Edge设备上深度模型的资源不稳定性问题,以提高模型在不同硬件平台上的可扩展性和效率。
  • methods: 这篇论文提出了一种名为Resource-Efficient Deep Subnetworks(REDS)的新方法,它使用了构造的缺失来实现模型的灵活化,并且通过了一个新的迭代运算器优化器来避免繁重的计算块。
  • results: 在六个 benchmark 架构上进行了训练,并在四个商业化的手持式和嵌入式硬件平台上进行了评估,结果显示了REDS 可以实现高度的可扩展性和效率,并且在 Arduino Nano 33 BLE Sense 上显示了一个具有2层全连接层的简单深度模型在40微秒内适应资源范围内的适应时间。
    Abstract Deep models deployed on edge devices frequently encounter resource variability, which arises from fluctuating energy levels, timing constraints, or prioritization of other critical tasks within the system. State-of-the-art machine learning pipelines generate resource-agnostic models, not capable to adapt at runtime. In this work we introduce Resource-Efficient Deep Subnetworks (REDS) to tackle model adaptation to variable resources. In contrast to the state-of-the-art, REDS use structured sparsity constructively by exploiting permutation invariance of neurons, which allows for hardware-specific optimizations. Specifically, REDS achieve computational efficiency by (1) skipping sequential computational blocks identified by a novel iterative knapsack optimizer, and (2) leveraging simple math to re-arrange the order of operations in REDS computational graph to take advantage of the data cache. REDS support conventional deep networks frequently deployed on the edge and provide computational benefits even for small and simple networks. We evaluate REDS on six benchmark architectures trained on the Google Speech Commands, FMNIST and CIFAR10 datasets, and test on four off-the-shelf mobile and embedded hardware platforms. We provide a theoretical result and empirical evidence for REDS outstanding performance in terms of submodels' test set accuracy, and demonstrate an adaptation time in response to dynamic resource constraints of under 40$\mu$s, utilizing a 2-layer fully-connected network on Arduino Nano 33 BLE Sense.
    摘要 深度模型在边缘设备上部署时经常遇到资源变化,这些变化可能来自于能量水平的波动、时间约束或其他系统中的优先级任务。现有的机器学习管道通常生成不可变的资源模型,无法在运行时进行适应。在这种工作中,我们引入资源高效深度子网络(REDS),以解决模型适应变化资源的问题。与现有技术不同,REDS通过结构化缺失来实现可控性,并且通过Exploiting permutation invariance of neurons来实现硬件特定优化。具体来说,REDS可以通过(1)使用迭代队列优化器 skipping sequential computational blocks,以及(2)通过简单的数学来重新排序REDS计算图中的操作顺序,以利用数据缓存来提高计算效率。REDS支持常见的深度网络,并且在小型和简单的网络上提供了计算性能的改善。我们对六个 benchmark 架构进行了训练,并在 Google Speech Commands、FMNIST 和 CIFAR10 数据集上进行了测试。我们提供了理论分析和实验证明,表明 REDS 在测试集准确率方面表现出色,并且在响应动态资源约束的时间内( menos than 40μs)内进行适应。在 Arduino Nano 33 BLE Sense 上使用 2-layer 全连接网络进行了实验证明。

MergeSFL: Split Federated Learning with Feature Merging and Batch Size Regulation

  • paper_url: http://arxiv.org/abs/2311.13348
  • repo_url: None
  • paper_authors: Yunming Liao, Yang Xu, Hongli Xu, Lun Wang, Zhiwei Yao, Chunming Qiao
  • for: 这篇论文旨在解决edge AI系统中的 federated learning(FL)中的计算和通信负担过重和模型隐私问题,通过结合数据和模型并行的 split federated learning(SFL)技术。
  • methods: 该论文提出了一种新的SFL框架,称为MergeSFL,它通过结合特征合并和批处理大小调节来解决EC系统中的统计差异和系统差异问题,并且同时优化这两种策略的关系以提高SFL的性能。
  • results: 实验结果表明,MergeSFL可以提高最终模型准确率由5.82%到26.22%,并且可以提高训练效率,比基eline上的性能提高1.74倍到4.14倍。
    Abstract Recently, federated learning (FL) has emerged as a popular technique for edge AI to mine valuable knowledge in edge computing (EC) systems. To mitigate the computing/communication burden on resource-constrained workers and protect model privacy, split federated learning (SFL) has been released by integrating both data and model parallelism. Despite resource limitations, SFL still faces two other critical challenges in EC, i.e., statistical heterogeneity and system heterogeneity. To address these challenges, we propose a novel SFL framework, termed MergeSFL, by incorporating feature merging and batch size regulation in SFL. Concretely, feature merging aims to merge the features from workers into a mixed feature sequence, which is approximately equivalent to the features derived from IID data and is employed to promote model accuracy. While batch size regulation aims to assign diverse and suitable batch sizes for heterogeneous workers to improve training efficiency. Moreover, MergeSFL explores to jointly optimize these two strategies upon their coupled relationship to better enhance the performance of SFL. Extensive experiments are conducted on a physical platform with 80 NVIDIA Jetson edge devices, and the experimental results show that MergeSFL can improve the final model accuracy by 5.82% to 26.22%, with a speedup by about 1.74x to 4.14x, compared to the baselines.
    摘要 最近, federated learning(FL)已经出现为边缘智能(EC)系统中 valuabe 知识的挖掘技术。为了减轻工作者的计算/通信压力和保护模型隐私,split federated learning(SFL)已经发布,通过将数据和模型并行化。 despite resource limitations, SFL 仍面临两个关键挑战在 EC 中,即统计不同和系统不同。为了解决这些挑战,我们提出了一个新的 SFL 框架,称为 MergeSFL,通过将特征合并和批处理大小调节在 SFL 中。具体来说,特征合并目的是将工作者的特征合并成一个混合特征序列,这将 approximate 与 IID 数据中得到的特征,并且通过提高模型精度来促进模型的准确性。而批处理大小调节则是为不同的异质工作者分配适合的批处理大小,以提高训练效率。此外,MergeSFL 还尝试同时优化这两个策略,以更好地提高 SFL 的性能。我们在80个NVIDIA Jetson边缘设备上进行了实验,结果显示,MergeSFL 可以提高最终模型准确性 by 5.82% to 26.22%,并且Speedup by about 1.74x to 4.14x,相比基eline。

The Influence of Neural Networks on Hydropower Plant Management in Agriculture: Addressing Challenges and Exploring Untapped Opportunities

  • paper_url: http://arxiv.org/abs/2311.13293
  • repo_url: None
  • paper_authors: C. Coelho, M. Fernanda P. Costa, L. L. Ferrás
    for:这项研究的目的是提高水力发电厂的管理方法,以确保稳定的可再生能源和可持续的农业发展。methods:该研究使用人工神经网络技术,以便评估当前水力发电厂管理软件的现状,并提出了一种农业conscious的发电厂管理框架,以最大化电力生产的同时,保证农业稳定的水源供应。results:研究发现,当前的水力发电厂管理软件可能会导致农业水需求不满足,特别是在旱期时。提出了一种农业conscious的发电厂管理框架,可以最大化电力生产的同时,保证农业稳定的水源供应。此外,还提出了一些政策措施,以便提高模型的透明度和可靠性,以保障农业从不当的压力中免受损害。
    Abstract Hydropower plants are crucial for stable renewable energy and serve as vital water sources for sustainable agriculture. However, it is essential to assess the current water management practices associated with hydropower plant management software. A key concern is the potential conflict between electricity generation and agricultural water needs. Prioritising water for electricity generation can reduce irrigation availability in agriculture during crucial periods like droughts, impacting crop yields and regional food security. Coordination between electricity and agricultural water allocation is necessary to ensure optimal and environmentally sound practices. Neural networks have become valuable tools for hydropower plant management, but their black-box nature raises concerns about transparency in decision making. Additionally, current approaches often do not take advantage of their potential to create a system that effectively balances water allocation. This work is a call for attention and highlights the potential risks of deploying neural network-based hydropower plant management software without proper scrutiny and control. To address these concerns, we propose the adoption of the Agriculture Conscious Hydropower Plant Management framework, aiming to maximise electricity production while prioritising stable irrigation for agriculture. We also advocate reevaluating government-imposed minimum water guidelines for irrigation to ensure flexibility and effective water allocation. Additionally, we suggest a set of regulatory measures to promote model transparency and robustness, certifying software that makes conscious and intelligent water allocation decisions, ultimately safeguarding agriculture from undue strain during droughts.
    摘要 风力电厂对稳定可再生能源和可持续农业是非常重要的,但是需要评估现有风力电厂管理软件的水资源管理做法。关键问题是在发电和农业水需之间可能存在冲突,偏向发电可能导致干旱季节内的灌溉不足,影响农作物产量和地区食品安全。需要在发电和农业水分配之间协调,以确保优化和环保的做法。 neural network 已成为风力电厂管理的 valuable 工具,但是其黑盒特性引发了决策过程中的透明度问题。此外,现有的方法通常不会利用 neural network 的潜在,创造一个可以均衡水分配的系统。这是一个呼吁和披露风力电厂管理软件的风险,而不是适当的审查和控制。为了解决这些问题,我们提议采用农业conscious风力电厂管理框架,以 maximize 电力生产,同时优先保持干旱季节内的灌溉稳定。我们还建议修改政府对农业水资源的最低水准要求,以确保灵活性和有效的水分配。此外,我们建议一些法规措施,以便确保软件的透明度和可靠性,最终保护农业免受旱情的不必要压力。

Improving performance of heart rate time series classification by grouping subjects

  • paper_url: http://arxiv.org/abs/2311.13285
  • repo_url: None
  • paper_authors: Michael Beekhuizen, Arman Naseri, David Tax, Ivo van der Bilt, Marcel Reinders
  • for: 用于预测活动的 Classification 任务
  • methods: 使用了normalization、 groupingSubjects和手动编写的特征,并将其作为深度学习(DL)网络的输入
  • results: 发现 Heart rate time series 可以用于 Classification 任务,但需要谨慎选择 normalization 或 grouping techniques 以降低subject variability 的问题
    Abstract Unlike the more commonly analyzed ECG or PPG data for activity classification, heart rate time series data is less detailed, often noisier and can contain missing data points. Using the BigIdeasLab_STEP dataset, which includes heart rate time series annotated with specific tasks performed by individuals, we sought to determine if general classification was achievable. Our analyses showed that the accuracy is sensitive to the choice of window/stride size. Moreover, we found variable classification performances between subjects due to differences in the physical structure of their hearts. Various techniques were used to minimize this variability. First of all, normalization proved to be a crucial step and significantly improved the performance. Secondly, grouping subjects and performing classification inside a group helped to improve performance and decrease inter-subject variability. Finally, we show that including handcrafted features as input to a deep learning (DL) network improves the classification performance further. Together, these findings indicate that heart rate time series can be utilized for classification tasks like predicting activity. However, normalization or grouping techniques need to be chosen carefully to minimize the issue of subject variability.
    摘要 不同于更常分析的ECG或PPG数据,心率时间序列数据更加简单,经常具有噪音,并可能包含缺失数据点。使用BigIdeasLab_STEP数据集,该数据集包含人类完成特定任务的心率时间序列标注,我们希图确定是否可以实现通用分类。我们的分析发现,选择窗口/步长大小的灵活性会影响准确率的敏感度。此外,我们发现 между个体的物理结构差异导致分类性能之间的差异。我们采用了多种技术来减少这种差异。首先,normalization是一个关键步骤,可以显著提高性能。其次,对分组Subject进行分类,可以提高性能并降低个体间的差异。最后,我们发现在深度学习(DL)网络中使用手工特征作为输入可以进一步提高分类性能。总之,这些发现表明心率时间序列可以用于活动预测任务,但是需要注意选择正确的normalization或分组策略,以避免个体差异的问题。

A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer

  • paper_url: http://arxiv.org/abs/2311.13624
  • repo_url: None
  • paper_authors: Chenyang Li, Zhao Song, Weixin Wang, Chiwun Yang
  • for: 本研究旨在探讨Gradient Leakage攻击的特点和防范策略,以便更好地保护对私人数据的访问。
  • methods: 本研究使用Transformer模型进行探讨,并通过仔细分析Gradient Leakage攻击的方法,证明可以准确地从Gradient中提取敏感数据。
  • results: 本研究发现,Gradient Leakage攻击可以准确地提取Transformer模型中的敏感数据,并且可以在某些情况下进行攻击。此外,本研究还证明了在Gradient Leakage攻击下,SGD算法可以保持稳定的性能。
    Abstract The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
    摘要 “深层泄露梯度(DLG)攻击已成为训练数据隐私泄露的常见和高效方法。这种攻击方法对个人和组织都 pose 了重大隐私威胁。本研究对特征器基本模型中的梯度泄露方法进行了全面分析。通过仔细的检查,我们证明可以准确地从梯度中提取数据,并且仔细调查了梯度攻击可以在哪些条件下执行,提供了具有吸引力的证据。此外,我们重新评估了在梯度上添加随机噪声作为隐私保护措施的方法。在这个框架下,我们提供了对隐私成本的分析,并证明了SGD算法在受扰 gradients 下的收敛性。本研究的主要目标是增进梯度泄露攻击和防御策略的理解,并为特征器基本模型隐私保护技术的开发作出贡献。通过探讨梯度泄露的漏洞和防御策略,本研究 aspires 在保护敏感数据和维护隐私方面提供新的想法和方法。”

Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective

  • paper_url: http://arxiv.org/abs/2311.13279
  • repo_url: None
  • paper_authors: Hao Yuan, Yajiong Liu, Yanfeng Zhang, Xin Ai, Qiange Wang, Chaoyi Chen, Yu Gu, Ge Yu
  • for: 本研究旨在从数据管理角度对GNNS的训练进行复杂分析和评估,并提供了一系列实验结果和实践建议,帮助未来GNNS训练系统的设计。
  • methods: 本文使用了多种代表性方法来评估GNNS训练的数据管理方法,包括数据分区、批处理准备、数据传输等方面的优化。
  • results: 本文通过对多个benchmark数据集的广泛实验,显示了许多有趣和有价值的结果,并提供了一些实践建议,帮助未来GNNS训练系统的设计。
    Abstract Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN training in data management, such as data partitioning, batch preparation for mini-batch training, and data transferring between CPUs and GPUs. These factors, which take up a large proportion of training time, make data management in GNN training more significant. This paper reviews GNN training from a data management perspective and provides a comprehensive analysis and evaluation of the representative approaches. We conduct extensive experiments on various benchmark datasets and show many interesting and valuable results. We also provide some practical tips learned from these experiments, which are helpful for designing GNN training systems in the future.
    摘要 多种图神经网络(GNN)训练系统在最近出现,以支持高效的GNN训练。由于GNN包含复杂的训练样本之间的数据依赖关系,因此GNN训练需要解决不同于深度神经网络(DNN)训练的数据管理问题,如数据分区、批处理 для小批训练和CPU和GPU之间数据传输。这些因素占用了训练时间的大部分,因此数据管理在GNN训练中更加重要。本文从数据管理角度对GNN训练进行了全面的分析和评估,并在各种标准数据集上进行了广泛的实验。我们通过这些实验获得了许多有趣和有价值的结果,并提供了一些实践的建议,可以帮助未来设计GNN训练系统。

Hard Label Black Box Node Injection Attack on Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.13244
  • repo_url: https://github.com/bryanzhou008/hard_label_black_box_attack_gnn
  • paper_authors: Yu Zhou, Zihao Dong, Guofeng Zhang, Jingchen Tang
  • for: 本研究旨在攻击图像神经网络(Graph Neural Network,GNN),具体来说是一种非目标性的硬标签黑盒节点插入攻击(Hard Label Black Box Node Injection Attack)。
  • methods: 本研究基于现有的边扰动攻击,通过限制优化过程来实现节点插入攻击。
  • results: 研究使用了三个数据集(COIL-DEL、IMDB-BINARY、NCI1)进行评估攻击性能。
    Abstract While graph neural networks have achieved state-of-the-art performances in many real-world tasks including graph classification and node classification, recent works have demonstrated they are also extremely vulnerable to adversarial attacks. Most previous works have focused on attacking node classification networks under impractical white-box scenarios. In this work, we will propose a non-targeted Hard Label Black Box Node Injection Attack on Graph Neural Networks, which to the best of our knowledge, is the first of its kind. Under this setting, more real world tasks can be studied because our attack assumes no prior knowledge about (1): the model architecture of the GNN we are attacking; (2): the model's gradients; (3): the output logits of the target GNN model. Our attack is based on an existing edge perturbation attack, from which we restrict the optimization process to formulate a node injection attack. In the work, we will evaluate the performance of the attack using three datasets, COIL-DEL, IMDB-BINARY, and NCI1.
    摘要 “graph neural networks在许多实际任务中已经 дости得了State-of-the-art的性能,包括图像顺序分类和节点顺序分类。然而, latest works have shown that they are also extremely vulnerable to adversarial attacks。大多数前一些工作都是在不实际的白盒enario下进行了攻击。在这项工作中,我们将提出一种非目标性 Hard Label Black Box Node Injection Attack on Graph Neural Networks,这是我们所知道的第一例。在这个设定下,更多的实际任务可以被研究,因为我们的攻击假设了无关(1):攻击的GNN模型的架构;(2):攻击的GNN模型的梯度;(3):目标GNN模型的输出logits。我们的攻击基于现有的边缺陷攻击,从而限制优化过程以形成节点插入攻击。在工作中,我们将评估攻击性能使用三个数据集:COIL-DEL、IMDB-BINARY和NCI1。”

NeutronOrch: Rethinking Sample-based GNN Training under CPU-GPU Heterogeneous Environments

  • paper_url: http://arxiv.org/abs/2311.13225
  • repo_url: None
  • paper_authors: Xin Ai, Qiange Wang, Chunyu Cao, Yanfeng Zhang, Chaoyi Chen, Hao Yuan, Yu Gu, Ge Yu
  • for: 提高 GPU 训练 GNN 模型的效率
  • methods: 使用层基task调度策略,将底层训练任务下载到 CPU,并且只有很少的顶层 vertices 进行 CPU 处理,以避免不效率的 CPU 处理
  • results: 与当前最佳 GNN 系统进行比较,NeutronOrch 可以实现高达 4.61 倍的性能提升
    Abstract Graph Neural Networks (GNNs) have demonstrated outstanding performance in various applications. Existing frameworks utilize CPU-GPU heterogeneous environments to train GNN models and integrate mini-batch and sampling techniques to overcome the GPU memory limitation. In CPU-GPU heterogeneous environments, we can divide sample-based GNN training into three steps: sample, gather, and train. Existing GNN systems use different task orchestrating methods to employ each step on CPU or GPU. After extensive experiments and analysis, we find that existing task orchestrating methods fail to fully utilize the heterogeneous resources, limited by inefficient CPU processing or GPU resource contention. In this paper, we propose NeutronOrch, a system for sample-based GNN training that incorporates a layer-based task orchestrating method and ensures balanced utilization of the CPU and GPU. NeutronOrch decouples the training process by layer and pushes down the training task of the bottom layer to the CPU. This significantly reduces the computational load and memory footprint of GPU training. To avoid inefficient CPU processing, NeutronOrch only offloads the training of frequently accessed vertices to the CPU and lets GPU reuse their embeddings with bounded staleness. Furthermore, NeutronOrch provides a fine-grained pipeline design for the layer-based task orchestrating method, fully overlapping different tasks on heterogeneous resources while strictly guaranteeing bounded staleness. The experimental results show that compared with the state-of-the-art GNN systems, NeutronOrch can achieve up to 4.61x performance speedup.
    摘要 graph neural networks (GNNs) 在多种应用中展现出色的表现。现有的框架利用 CPU-GPU 多核心环境来训练 GNN 模型,并将 mini-batch 和采样技术与 GPU 内存限制结合使用。在 CPU-GPU 多核心环境中,我们可以将 sample-based GNN 训练分成三步:采样、聚合和训练。现有的 GNN 系统使用不同的任务调度方法来在 CPU 和 GPU 上运行每个步骤。经过广泛的实验和分析,我们发现现有的任务调度方法无法完全利用多核心资源,受到 GPU 资源竞争或 CPU 处理不充分的限制。在这篇论文中,我们提出了 NeutronOrch,一个基于层次任务调度的 GNN 训练系统。NeutronOrch 将训练过程分解为层次,将底层训练任务下载到 CPU。这会significantly 减少 GPU 训练计算加载和内存占用。为了避免不充分的 CPU 处理,NeutronOrch 只有在 frequently 访问的顶点上下载训练任务,并让 GPU 重用它们的嵌入。此外,NeutronOrch 提供了细致的管道设计,完全在多核心资源上并行进行不同任务,同时严格 garantying bounded staleness。实验结果显示,相比现状态最佳 GNN 系统,NeutronOrch 可以实现到 4.61 倍的性能提升。

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

  • paper_url: http://arxiv.org/abs/2311.13180
  • repo_url: None
  • paper_authors: Jianqing Fan, Zhaoran Wang, Zhuoran Yang, Chenlu Ye
  • for: 这 paper Focuses on high-dimensional multi-armed contextual bandits with batched feedback, where the rewards are revealed only at the end of each batch.
  • methods: The paper proposes a provably sample-efficient algorithm that achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches.
  • results: The algorithm achieves regret bounds comparable to those in fully sequential settings with only $\mathcal{O}( \log T)$ batches, and features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret.
    Abstract We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
    摘要 我们研究高维度多臂枪矢弹炮,其中 $T$ 步在线交互中被分成 $L$ 批。具体来说,每个批次收集根据先前批次和奖励的策略,并在批次结束时公布奖励。这种反馈结构在应用中有很多,例如个性化医疗和在线广告,因为在线数据通常不会全部Serial来。我们考虑高维度和线性的设定,其中奖励函数的枪矢模型允许某些簇矢或低矢组织。我们询问需要多少批次以达到与完全动态数据相等的性能,并且我们设计了一个可证sample效率的算法,它在簇矢 caso下有 $\mathcal{\tilde O}(s_0^2 \log^2 T)$ 的 regret,在低矢 caso下有 $\mathcal{\tilde O}(r^2 \log^2 T)$ 的 regret,使用了只有 $\mathcal{O}(\log T)$ 批次。在这些设定下,我们的算法可以实现可证sample效率的性能,并且在实验中验证了我们的理论。我们的算法包括一个新的批次分配方法,它根据批次内部的精度和累绩 regret进行调整。此外,我们还进行了实验,使用了实验和实际数据来验证我们的理论。

SecureCut: Federated Gradient Boosting Decision Trees with Efficient Machine Unlearning

  • paper_url: http://arxiv.org/abs/2311.13174
  • repo_url: None
  • paper_authors: Jian Zhang, Bowen Li Jie Li, Chentao Wu
    for: 本研究旨在 Addressing the challenge of data removal in Vertical Federated Learning (VFL) scenarios, where multiple parties provide private features for model training, and honoring the “right to be forgotten” legislation.methods: 我们提出了一个 novel Gradient Boosting Decision Tree (GBDT) framework, which enables both instance unlearning and feature unlearning without the need for retraining from scratch.results: 我们的方法可以实现有效的数据删除,同时减少模型性能下降。实验结果显示,我们的方法在各个受测数据集上实现了比 state-of-the-art 方法更高的模型用途和遗忘性。
    Abstract In response to legislation mandating companies to honor the \textit{right to be forgotten} by erasing user data, it has become imperative to enable data removal in Vertical Federated Learning (VFL) where multiple parties provide private features for model training. In VFL, data removal, i.e., \textit{machine unlearning}, often requires removing specific features across all samples under privacy guarentee in federated learning. To address this challenge, we propose \methname, a novel Gradient Boosting Decision Tree (GBDT) framework that effectively enables both \textit{instance unlearning} and \textit{feature unlearning} without the need for retraining from scratch. Leveraging a robust GBDT structure, we enable effective data deletion while reducing degradation of model performance. Extensive experimental results on popular datasets demonstrate that our method achieves superior model utility and forgetfulness compared to \textit{state-of-the-art} methods. To our best knowledge, this is the first work that investigates machine unlearning in VFL scenarios.
    摘要 鉴于法律规定公司必须尊重用户的“忘记权”,即在Vertically Federated Learning(VFL)中多方提供的私有特征用于模型训练中删除用户数据已成为必要。在VFL中,数据删除,即“机器学习忘记”,经常需要在隐私保证下对所有样本中删除特定的特征。为解决这个挑战,我们提议了一种新的Gradient Boosting Decision Tree(GBDT)框架,可以有效地实现实例忘记和特征忘记,无需重新训练。利用GBDT的稳定结构,我们可以有效地删除数据,同时降低模型性能下降。我们的方法在popular datasets上进行了广泛的实验,结果表明,我们的方法可以在VFL场景中实现更高的模型用用和忘记性,比较于当前的方法。据我们所知,这是第一篇investigates机器忘记在VFL场景中的研究。

AdaptiveFL: Adaptive Heterogeneous Federated Learning for Resource-Constrained AIoT Systems

  • paper_url: http://arxiv.org/abs/2311.13166
  • repo_url: None
  • paper_authors: Chentao Jia, Ming Hu, Zekai Chen, Yanxin Yang, Xiaofei Xie, Yang Liu, Mingsong Chen
  • for: 本研究旨在解决 Federated Learning (FL) 在 Artificial Intelligence of Things (AIoT) 设备之间的协同学习中存在的分类性能低下问题,以及各种设备之间的差异性因素 (如计算能力、存储大小) 的影响。
  • methods: 本研究提出了一种有效的 Federated Learning 方法,名为 AdaptiveFL,该方法基于一种新的细致化宽度模型剔除策略。该策略可以生成各种异构的本地模型,以适应各种 AIoT 设备的差异性。此外,我们还提出了一种基于 reinforcement learning 的设备选择机制,可以在 fly 中根据设备的可用资源选择适合的异构模型进行本地训练。
  • results: 对比 state-of-the-art 方法,AdaptiveFL 可以在 IID 和非 IID 场景中实现最高达 16.83% 的推理改进。
    Abstract Although Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
    摘要 尽管联合学习(FL)有搭建多个人工智能物联网(AIoT)设备之间的合作学习的承诺,但它受到各种设备内存大小、计算能力等不同因素的影响,导致分类性能较低。为解决这些问题,这篇论文提出了一种有效的FL方法名为AdaptiveFL,基于一种新的细化的宽度模型剔除策略。这种策略可以生成不同的多样化本地模型 для不同的AIoT设备。通过我们提出的利用奖励学习来选择适合的设备,AdaptiveFL可以在设备上运行的地方适应性地分配适合的多样化模型。实验结果表明,相比于状态艺术方法,AdaptiveFL可以在IID和非IID场景下实现最多16.83%的推理提升。

Have Your Cake and Eat It Too: Toward Efficient and Accurate Split Federated Learning

  • paper_url: http://arxiv.org/abs/2311.13163
  • repo_url: None
  • paper_authors: Dengke Yan, Ming Hu, Zeke Xia, Yanxin Yang, Jun Xia, Xiaofei Xie, Mingsong Chen
  • For: This paper aims to address the challenges of low inference accuracy and low efficiency in Split Federated Learning (SFL) for AIoT systems.* Methods: The proposed approach, named Sliding Split Federated Learning (S$^2$FL), adopts an adaptive sliding model split strategy and a data balance-based training mechanism to alleviate the issues of stragglers and data heterogeneity.* Results: Compared to conventional SFL, S$^2$FL can achieve up to 16.5% inference accuracy improvement and 3.54X training acceleration, as demonstrated by experimental results.Here’s the summary in Simplified Chinese:* 为:本文目标是解决SFL在AIoT系统中的低推理精度和低效率问题。* 方法:提议的S$^2$FL方法采用适应式滑块模型分 split策略和数据均衡基于训练机制,以解决各种缓慢设备和数据不均问题。* 结果:相比传统SFL,S$^2$FL可以达到16.5%的推理精度提升和3.54倍的训练加速,根据实验结果显示。
    Abstract Due to its advantages in resource constraint scenarios, Split Federated Learning (SFL) is promising in AIoT systems. However, due to data heterogeneity and stragglers, SFL suffers from the challenges of low inference accuracy and low efficiency. To address these issues, this paper presents a novel SFL approach, named Sliding Split Federated Learning (S$^2$FL), which adopts an adaptive sliding model split strategy and a data balance-based training mechanism. By dynamically dispatching different model portions to AIoT devices according to their computing capability, S$^2$FL can alleviate the low training efficiency caused by stragglers. By combining features uploaded by devices with different data distributions to generate multiple larger batches with a uniform distribution for back-propagation, S$^2$FL can alleviate the performance degradation caused by data heterogeneity. Experimental results demonstrate that, compared to conventional SFL, S$^2$FL can achieve up to 16.5\% inference accuracy improvement and 3.54X training acceleration.
    摘要 因其在资源受限enario中的优势,Split Federated Learning(SFL)在AIoT系统中表现承诺。然而,由于数据不同性和延迟器,SFL受到低推理精度和低效率的挑战。为解决这些问题,本文提出了一种新的SFL方法,即Sliding Split Federated Learning(S$^2$FL),该方法采用了适应式分割模型拟合策略和数据均衡式训练机制。通过在不同计算能力的AIoT设备上动态派发不同模型部分,S$^2$FL可以减轻由延迟器引起的训练效率低下。通过将具有不同数据分布的设备上传的特征组合成多个大批次,并将其用于反射传播,S$^2$FL可以减轻由数据不同性引起的性能下降。实验结果表明,相比传统SFL,S$^2$FL可以提高推理精度达16.5%,并提高训练速度3.54倍。

Multi-Objective Optimization via Wasserstein-Fisher-Rao Gradient Flow

  • paper_url: http://arxiv.org/abs/2311.13159
  • repo_url: None
  • paper_authors: Yinuo Ren, Tesi Xiao, Tanmay Gangwani, Anshuka Rangi, Holakou Rahmanian, Lexing Ying, Subhajit Sanyal
  • for: 该研究旨在提出一种基于分子动力学 simulations的多目标优化(MOO)方法,用于优化多个可能存在冲突的目标。
  • methods: 该方法 combinest overdamped Langevin 和 birth-death dynamics,并引入 “dominance potential” 以导引 particles towards global Pareto optimality。
  • results: 对比前一代方法,该方法能够更好地处理复杂的 Pareto front,并且有许多实验证明其在Synthetic 和实际数据上表现出色。I hope this helps! Let me know if you have any other questions.
    Abstract Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a "dominance potential" to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.
    摘要 Simplified Chinese translation:多目标优化(MOO)旨在同时优化多个可能冲突的目标,具有广泛的应用前景。我们提出了基于分子动力学 simulations的一种新型互动 particles 方法,具有适应性和可控性。我们的方法结合了倒摆 Langevin 和生成 Death 动力学,并添加了 "主导性潜在" 来导引粒子向全面 Pareto 优化。与之前的方法不同,我们的方法可以重新分配被占据的粒子,使其特别适合处理复杂的 Pareto 前面。我们的方法还有理论基础,可以视为 Wasserstein-Fisher-Rao 梯度流,并有确定的收敛保证。广泛的实验证明了我们的方法在复杂的 sintetic 和实际数据上表现出色,超过了现有方法。

Testing Closeness of Multivariate Distributions via Ramsey Theory

  • paper_url: http://arxiv.org/abs/2311.13154
  • repo_url: None
  • paper_authors: Ilias Diakonikolas, Daniel M. Kane, Sihan Liu
  • for: This paper is written for the statistical task of closeness testing for multidimensional distributions.
  • methods: The paper uses a computationally efficient closeness tester with sub-learning sample complexity in any fixed dimension, and establishes a qualitatively matching sample complexity lower bound. The method makes essential use of tools from Ramsey theory.
  • results: The paper obtains a sample complexity bound of $O\left((k^{6/7}/\text{poly}_d(\epsilon)) \log^d(k)\right)$ for the closeness testing problem, and a lower bound of $\Omega(k^{6/7}/\text{poly}(\epsilon))$, even for $d=2$. The results show a substantial increase in sample complexity when moving from one to two dimensions, while increases beyond that do not.
    Abstract We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
    摘要 我们研究Multidimensional分布的Statistical任务,具体是给定两个不知道分布 $\mathbf p$ 和 $\mathbf q$ 在 $\mathbb R^d$ 上,判断是否满足 $\mathbf p = \mathbf q$ 或 $\|\mathbf p - \mathbf q\|_{A_k} > \epsilon$,其中 $\|\mathbf p - \mathbf q\|_{A_k}$ 是在 $k$ 个垂直排序的方向上的对称化距离。我们的主要结果是在任何固定维度下的第一个靠近测试,具体是提供一个可 computable 的靠近测试器,其 Sample Complexity 为 $O\left((k^{6/7}/\text{poly}_d(\epsilon)) \log^d(k)\right)$。在下界方面,我们建立了一个Qualitatively matching的下界 Sample Complexity bound,即 $\Omega(k^{6/7}/\text{poly}(\epsilon))$, 甚至在 $d=2$ 的情况下。这些 Sample Complexity bounds 是惊人的,因为在一维情况下,这个问题的 Sample Complexity 为 $\Theta(k^{4/5}/\text{poly}(\epsilon))$。这有着兴味的结果,即从一维变为二维会带来很大的Sample Complexity 增长,而以后的增长不会。作为一个应用,我们得到了 $d_{\text{TV}$-靠近测试器,用于对两个 $k$-histogram 在 $\mathbb R^d$ 上的共同未知分配进行靠近测试,以及对两个均匀分布在 UNION 的 $k$ 个未知排序方向上的靠近测试。 both our algorithm and our lower bound make essential use of tools from Ramsey theory.

Optimal Transport with Cyclic Symmetry

  • paper_url: http://arxiv.org/abs/2311.13147
  • repo_url: None
  • paper_authors: Shoichiro Takeda, Yasunori Akagi, Naoki Marumo, Kenta Niwa
  • for: 这 paper 的目的是提出一种新的快速算法,用于优化运输 (OT),利用输入数据中的循环对称结构。
  • methods: 这 paper 使用了循环对称结构和各种优化技术,将 OT 减少到一个较小的优化问题,并通过解这个问题来获得优化解决方案。
  • results: experiments 表明,这些算法在具有约束对称结构的 synthetic/real-world 数据上具有效果,并且比直接解决 OT 更快。
    Abstract We propose novel fast algorithms for optimal transport (OT) utilizing a cyclic symmetry structure of input data. Such OT with cyclic symmetry appears universally in various real-world examples: image processing, urban planning, and graph processing. Our main idea is to reduce OT to a small optimization problem that has significantly fewer variables by utilizing cyclic symmetry and various optimization techniques. On the basis of this reduction, our algorithms solve the small optimization problem instead of the original OT. As a result, our algorithms obtain the optimal solution and the objective function value of the original OT faster than solving the original OT directly. In this paper, our focus is on two crucial OT formulations: the linear programming OT (LOT) and the strongly convex-regularized OT, which includes the well-known entropy-regularized OT (EROT). Experiments show the effectiveness of our algorithms for LOT and EROT in synthetic/real-world data that has a strict/approximate cyclic symmetry structure. Through theoretical and experimental results, this paper successfully introduces the concept of symmetry into the OT research field for the first time.
    摘要 我们提出了一种新的快速算法,用于优化运输(OT),利用输入数据的循环Symmetry结构。这种OT WITH cyclic symmetry在各种实际应用中出现,包括图像处理、城市规划和图表处理。我们的主要想法是通过利用循环Symmetry和各种优化技术,将OT减少到一个较小的优化问题,并通过解决这个小问题来得到优化的解。这使得我们的算法可以更快地解决OT问题,并且可以保证解决的结果是最优的。在这篇论文中,我们关注了两种关键的OT形式:线性Programming OT(LOT)和强 convex-regulated OT,包括知名的Entropy-regulated OT(EROT)。实验表明,我们的算法在 sintetic/实际数据中,具有约束/approximate cyclic symmetry结构时,对LOT和EROT具有显著的效果。通过理论和实验结果,这篇论文成功地将Symmetry概念引入了OT研究领域中,并为OT问题提供了一种新的解决方案。

Newton-CG methods for nonconvex unconstrained optimization with Hölder continuous Hessian

  • paper_url: http://arxiv.org/abs/2311.13094
  • repo_url: None
  • paper_authors: Chuan He, Zhaosong Lu
  • for: 本文研究了一个非 convex 无约束优化问题,目标是最小化一个两遍导数函数的对数函数。
  • methods: 我们首先提出了一种 Newton-conjugate gradient (Newton-CG) 方法,用于在这个问题中找到一个近似第一阶站点 (FOSP),假设关联的 H"older 参数已知。然后,我们开发了一种无参数 Newton-CG 方法,不需要任何先知参数。这种方法在我们所知道的范围内具有最佳知义迭代和操作复杂度,用于找到这个问题的近似 FOSP。
  • results: 我们提出了一种 Newton-CG 方法,用于在高概率下找到这个问题的近似第二阶站点 (SOSP),并确定了其迭代和操作复杂度。此外,我们还提供了一些初步的数据分析结果,证明我们的无参数 Newton-CG 方法在实际应用中比一种已知规范 Newton 方法更高效。
    Abstract In this paper we consider a nonconvex unconstrained optimization problem minimizing a twice differentiable objective function with H\"older continuous Hessian. Specifically, we first propose a Newton-conjugate gradient (Newton-CG) method for finding an approximate first-order stationary point (FOSP) of this problem, assuming the associated the H\"older parameters are explicitly known. Then we develop a parameter-free Newton-CG method without requiring any prior knowledge of these parameters. To the best of our knowledge, this method is the first parameter-free second-order method achieving the best-known iteration and operation complexity for finding an approximate FOSP of this problem. Furthermore, we propose a Newton-CG method for finding an approximate second-order stationary point (SOSP) of the considered problem with high probability and establish its iteration and operation complexity. Finally, we present preliminary numerical results to demonstrate the superior practical performance of our parameter-free Newton-CG method over a well-known regularized Newton method.
    摘要 在本文中,我们考虑了一个非对称不约束优化问题,目标是最小化一个两遍导数可 diferenciable 函数的对象函数。我们首先提出了一种 Newton- conjugate gradient(Newton-CG)方法,用于查找一个approximate first-order stationary point(FOSP)的解,假设关联的Hölder参数已经知道。然后,我们开发了一种无参数 Newton-CG 方法,不需要任何先知的Hölder参数。我们认为这是第一个parameter-free second-order方法,可以在最好的iteration和操作复杂度下找到一个approximate FOSP。此外,我们还提出了一种 Newton-CG 方法,用于查找一个approximate second-order stationary point(SOSP)的解,并证明了其iteration和操作复杂度。最后,我们展示了一些初步的数值结果,证明了我们的parameter-free Newton-CG方法在实际应用中的超越性。

eess.SP - 2023-11-22

Single-shot Quantum Signal Processing Interferometry

  • paper_url: http://arxiv.org/abs/2311.13703
  • repo_url: https://github.com/yuanliu1/qsp-interferometry
  • paper_authors: Jasmine Sinanan-Singh, Gabriel L. Mintzer, Isaac L. Chuang, Yuan Liu
    for: 这个论文的目的是为了探讨量子系统的感测问题,具体来说是使用量子信号处理技术来实现量子感测的最低限制。methods: 这篇论文使用了量子信号处理技术(QSP),将其扩展到混合量子-振荡器系统中,并将其应用于探讨量子感测问题。results: 论文的结果表明,使用QSP技术可以实现量子感测的最低限制,并且可以在单shot情况下达到Heisenberg限制的精度。此外,论文还提出了一种 concatenate 多个 binary 决策来实现参数估计的方法。
    Abstract Quantum systems of infinite dimension, such as bosonic oscillators, provide vast resources for quantum sensing. Yet, a general theory on how to manipulate such bosonic modes for sensing beyond parameter estimation is unknown. We present a general algorithmic framework, quantum signal processing interferometry (QSPI), for quantum sensing at the fundamental limits of quantum mechanics, i.e., the Heisenberg sensing limit, by generalizing Ramsey-type interferometry. Our QSPI sensing protocol relies on performing nonlinear polynomial transformations on the oscillator's quadrature operators by generalizing quantum signal processing (QSP) from qubits to hybrid qubit-oscillator systems. We use our QSPI sensing framework to make binary decisions on a displacement channel in the single-shot limit. Theoretical analysis suggests the sensing accuracy given a single-shot qubit measurement can approach the Heisenberg-limit scaling. We further concatenate a series of such binary decisions to perform parameter estimation in a bit-by-bit fashion. Numerical simulations are performed to support these statements. Our QSPI protocol offers a unified framework for quantum sensing using continuous-variable bosonic systems beyond parameter estimation and establishes a promising avenue toward efficient and scalable quantum control and quantum sensing schemes beyond the NISQ era.
    摘要 量子系统的无穷维度,如波动器,提供了庞大的量子感知资源。然而,一个涵盖所有bosonic模式的整体理论,用于超出参数估计的感知,仍未知。我们提出了一个通用的算法框架,量子信号处理折射(QSPI),用于量子感知的基础限制,即Heisenberg感知限制,通过普适ramsey-类折射估计。我们的QSPI感知协议基于在hybrid qubit-oscillator系统中实现非线性多项式变换的oscillatorquadrature operator。我们使用QSPI感知框架来在单射shot限制下作出binary决策,并 theoretically analyze the sensing accuracy can approach Heisenberg-limit scaling。我们还将这些binary决策 concatenate在一起,以进行参数估计的bit-by-bit方式。numerical simulations are performed to support these statements. our QSPI protocol provides a unified framework for quantum sensing using continuous-variable bosonic systems beyond parameter estimation and establishes a promising avenue toward efficient and scalable quantum control and quantum sensing schemes beyond the NISQ era.

Cramér-Rao Bounds for the Simultaneous Estimation of Power System Electromechanical Modes and Forced Oscillations

  • paper_url: http://arxiv.org/abs/2311.13598
  • repo_url: None
  • paper_authors: Luke Dosiek
  • for: 本文derives the Cramér-Rao Bounds (CRB) for simultaneously estimating power system electromechanical modes and forced oscillations (FO).
  • methods: 本文使用了两种情况:在第一个情况下,仅使用测量系统输出中存在稳态响应的FO来估计模型。在第二个情况下, startup transient of FO 也存在于测量系统输出中。
  • results: 1) FO参数的CRB不受 startup transient的影响;2) system mode的CRB不受稳态FO的影响;3) system mode的CRB可以通过FO的启动过程来减少。
    Abstract In this paper, the Cram\'{e}r-Rao Bounds (CRB) for the simultaneous estimation of power system electromechanical modes and forced oscillations (FO) are derived. Two cases are considered; in the first case only the steady-state response to the FO is present in the measured system output used by estimation algorithms. In the second, the startup transient of the FO is present in addition to the steady-state response. The CRBs are analyzed numerically to explore sensitivities to FO frequency, signal-to-noise ratio (SNR) and observation window length. It is demonstrated that 1) the CRB of FO parameters is not affected by the presence of the transient response, 2) the CRB of the system modes is not affected by the presence of an FO in steady-state and 3) the CRB of the system modes can be drastically reduced by the presence of a FO startup transient.
    摘要 在这篇论文中,我们 derive the Cramér-Rao Bounds (CRB) for simultaneously estimating power system electromechanical modes and forced oscillations (FO). 我们考虑了两种情况:在第一种情况下,测量系统输出中只有FO的稳态响应存在,而在第二种情况下, startup 过程中的FO响应也存在。我们 numerically 分析了CRB,并explored its sensitivity to FO frequency, signal-to-noise ratio (SNR)和观测窗口长度。结果表明:1. FO参数的CRB不受 startup 过程的影响。2. 系统模式的CRB不受FO的稳态响应的影响。3. 系统模式的CRB可以通过FO的 startup 过程来减少显著地。

Volumetric 3D Point Cloud Attribute Compression: Learned polynomial bilateral filter for prediction

  • paper_url: http://arxiv.org/abs/2311.13533
  • repo_url: None
  • paper_authors: Tam Thuc Do, Philip A. Chou, Gene Cheung
  • for: 这 paper 的目的是提出一种基于 volume 的 attribute 压缩算法,用于压缩 3D 点云 attribute。
  • methods: 该算法使用了一种基于 B-spline 基函数的量化编码方法,从 coarse 到 fine 的分辨率进行编码。此外,paper 还使用了一种基于 polynomial bilateral filter (PBF) 的学习方法,用于预测 finer-grained 的参数。
  • results: paper 通过对多个 point cloud 数据集进行测试,证明了 PBF 的预测性能超过了基于 MPEG 标准的线性预测器。
    Abstract We extend a previous study on 3D point cloud attribute compression scheme that uses a volumetric approach: given a target volumetric attribute function $f : \mathbb{R}^3 \mapsto \mathbb{R}$, we quantize and encode parameters $\theta$ that characterize $f$ at the encoder, for reconstruction $f_{\hat{\theta}(\mathbf(x))$ at known 3D points $\mathbf(x)$ at the decoder. Specifically, parameters $\theta$ are quantized coefficients of B-spline basis vectors $\mathbf{\Phi}_l$ (for order $p \geq 2$) that span the function space $\mathcal{F}_l^{(p)}$ at a particular resolution $l$, which are coded from coarse to fine resolutions for scalability. In this work, we focus on the prediction of finer-grained coefficients given coarser-grained ones by learning parameters of a polynomial bilateral filter (PBF) from data. PBF is a pseudo-linear filter that is signal-dependent with a graph spectral interpretation common in the graph signal processing (GSP) field. We demonstrate PBF's predictive performance over a linear predictor inspired by MPEG standardization over a wide range of point cloud datasets.
    摘要 我们推展了之前的研究,使用三维方向云集特征压缩方案,以承载目标三维函数$f:\mathbb{R}^3\mapsto\mathbb{R}$。具体来说,我们在编码器端将函数参数$\theta$逐渐压缩和编码,然后在解码器端使用已知的3D点$\mathbf(x)$进行重建。具体来说,参数$\theta$是B-spline基函数$\mathbf{\Phi}_l$(对应于至少第二阶幂数$p$)所对的压缩系数,这些基函数在特定的分辨率$l$上构成了函数空间$\mathcal{F}_l^{(p)}$。在这个研究中,我们专注于在粗糙对于细节的假设下预测细节对于粗糙的假设。我们利用数据驱动的假设学习方法,从数据中学习一个幂数滤波器(PBF)的参数。PBF是一种 Pseudo-linear filter,它在信号依赖的情况下具有一个图像 спектраль解释,这种解释在图像信号处理(GSP)领域非常普遍。我们在一系列的点云数据上证明了PBF的预测性能,比起基于MPEG标准的直线预测器。

Real-time unobtrusive sleep monitoring of in-patients with affective disorders: a feasibility study

  • paper_url: http://arxiv.org/abs/2311.13457
  • repo_url: None
  • paper_authors: Samuel Askjer, Kim Mathiasen, Ali Amidi, Christine Parsons, Nicolai Ladegaard
  • for: 这项研究旨在检验 Ballistocardiography 是否可以有效地测量 психи病患者的睡眠质量。
  • methods: 研究者使用 Ballistocardiography 技术进行不侵入式测量睡眠,并对囚犯患有抑郁症和mania症的病人进行了评估。
  • results: 研究发现,使用 Ballistocardiography 测量睡眠质量可以是一种可靠的方法,并且未发现与医疗期长和睡眠时间相关的联系。
    Abstract Sleep and mental health are highly related concepts, and it is an important research and clinical priority to understand their interactions. In-bed sensors using ballistocardiography provide the possibility of unobtrusive measurements of sleep. In this study, we examined the feasibility of ballistocardiography in measuring key aspects of sleep in psychiatric in-patients. Specifically, we examined a sample of patients diagnosed with depression and bipolar disorder. The subjective experiences of the researchers conducting the study are explored and descriptive analyses of patient sleep are subsequently presented. The practicalities of using the ballistocardiography device seem to be favourable. There were no apparent issues regarding data quality or data integrity. Of clinical interest, we found no link between length of stay and reduced time in bed (b = -0.06, SE = 0.03, t = -1.76, p = .08). Using ballistocardiography for measurements on in-patients with affective disorders seems to be a feasible approach.
    摘要 睡眠和心理健康是高度相关的概念,研究和临床 Priority 是理解它们之间的交互。床上传感器使用球isto cardiography 提供了不侵入性的睡眠测量的可能性。本研究通过对心理病院患者进行评估,评估了球isto cardiography 在诊断抑郁症和跨度症患者中的可行性。研究人员的主观经验和患者睡眠的描述分析后,发现使用球isto cardiography device 的实际问题不大,无 apparent data quality 或数据完整性问题。临床意义上,我们未发现Length of stay 和睡眠时间减少的关系(b = -0.06, SE = 0.03, t = -1.76, p = .08)。使用球isto cardiography 测量心理病院患者的睡眠似乎是一种可行的方法。

Millimeter Wave Thin-Film Bulk Acoustic Resonator in Sputtered Scandium Aluminum Nitride Using Platinum Electrodes

  • paper_url: http://arxiv.org/abs/2311.13448
  • repo_url: None
  • paper_authors: Sinwoo Cho, Omar Barrera, Pietro Simeoni, Ellie Y. Wang, Jack Kramer, Vakhtang Chulukhadze, Joshua Campbell, Matteo Rinaldi, Ruochen Lu
  • for: 这个论文描述了使用杂质铝氮化物(ScAlN)膜电子振荡器(FBAR)在毫米波(mmWave)频率band中,通过改进制造和音频/电磁设计,实现高品质因子(Q)。
  • methods: 这些FBAR使用了银(Pt)电极,并对不同的电极组合进行了研究,以了解电极对mmWave FBAR的影响。
  • results: 研究发现,使用Pt顶部和底部电极的FBAR可以实现电机学性能(k2)为4.0%,Q因子为116,在13.7GHz的首顺模(S1)频率上;而使用Pt顶部和Al底部电极的FBAR可以实现k2为1.8%,Q因子为94,在61.6GHz的第三顺模(S3)频率上。这些结果表明, même dans la bande de fréquences de l’ordre de 60GHz, les FBAR à ScAlN peuvent atteindre un facteur de qualité s’approchant de 100 avec une fabrication et une conception acoustique/émetteur optimisées.
    Abstract This work describes sputtered scandium aluminum nitride (ScAlN) thin-film bulk acoustic resonators (FBAR) at millimeter wave (mmWave) with high quality factor (Q) using platinum (Pt) electrodes. FBARs with combinations of Pt and aluminum (Al) electrodes, i.e., Al top Al bottom, Pt top Al bottom, Al top Pt bottom, and Pt top Pt bottom, are built to study the impact of electrodes on mmWave FBARs. The demonstrated FBAR with Pt top and bottom electrodes achieve electromechanical coupling (k2) of 4.0% and Q of 116 for the first-order symmetric (S1) mode at 13.7 GHz, and k2 of 1.8% and Q of 94 for third-order symmetric (S3) mode at 61.6 GHz. Through these results, we confirmed that even in the frequency band of approximately 60 GHz, ScAlN FBAR can achieve a Q factor approaching 100 with optimized fabrication and acoustic/EM design. Further development calls for stacks with better quality in piezoelectric and metallic layers.
    摘要

Timely and Efficient Information Delivery in Real-Time Industrial IoT Networks

  • paper_url: http://arxiv.org/abs/2311.13329
  • repo_url: None
  • paper_authors: Hossam Farag, Dejan Vukobratovic, Andrea Munari, Cedomir Stefanovic
  • for: 支持自动化、自组织和重新配置的工业自动化,以支持第四代工业和未来的第五代工业。
  • methods: 使用 SIC 助手实现实时 IIoT 网络,报文根据监测现象特定的事件生成概率生成报文,并在块抖动通道上使用顺序扫描取消干扰来解码用户包。
  • results: 比较了 SIC 和标准频分多路复用(SDMA)的性能,发现采用 SIC 可以提高 Age of Information(AoI)、吞吐量和延迟违反概率,并且 analytical 结果与 simulate 结果匹配,证明投入 SIC 能力可以使这种简单的访问方法支持时间和效率的信息传输。
    Abstract Enabling real-time communication in Industrial Internet of Things (IIoT) networks is crucial to support autonomous, self-organized and re-configurable industrial automation for Industry 4.0 and the forthcoming Industry 5.0. In this paper, we consider a SIC-assisted real-time IIoT network, in which sensor nodes generate reports according to an event-generation probability that is specific for the monitored phenomena. The reports are delivered over a block-fading channel to a common Access Point (AP) in slotted ALOHA fashion, which leverages the imbalances in the received powers among the contending users and applies successive interference cancellation (SIC) to decode user packets from the collisions. We provide an extensive analytical treatment of the setup, deriving the Age of Information (AoI), throughput and deadline violation probability, when the AP has access to both the perfect as well as the imperfect channel-state information. We show that adopting SIC improves all the performance parameters with respect to the standard slotted ALOHA, as well as to an age-dependent access method. The analytical results agree with the simulation based ones, demonstrating that investing in the SIC capability at the receiver enables this simple access method to support timely and efficient information delivery in IIoT networks.
    摘要 在工业互联网Of Things(IIoT)网络中实现实时通信是支持自主、自组织和重新配置的工业自动化的关键,以支持第四代工业和未来的第五代工业。在这篇论文中,我们考虑了基于SIC的实时IIoT网络,在该网络中,感知节点根据监测现象特定的事件生成报告,并将报告传输到共享Access Point(AP)中,使用阶段性混合(ALOHA)方式进行排队发送。我们对这种设计进行了详细的分析,计算了Age of Information(AoI)、吞吐量和缺刻率,并证明采用SIC技术可以提高所有性能参数,比较于标准的阶段性混合和年龄依存的访问方法。我们的分析结果与实验结果一致,表明在接收器拥有SIC能力时,这种简单的访问方法可以支持有效和准时的信息传输在IIoT网络中。

AA-DL: AoI-Aware Deep Learning Approach for D2D-Assisted Industrial IoT

  • paper_url: http://arxiv.org/abs/2311.13325
  • repo_url: None
  • paper_authors: Hossam Farag, Mohamed Ragab, Cedomir Stefanovic
  • For: 本研究提出了一种基于深度学习的 Age of Information (AoI) 感知方法,以最小化在工业互联网上的峰值信息年龄 (PAoI)。* Methods: 通过 Stochastic Geometry 分析成功概率和平均 PAoI,并以优化问题的形式解决非 convex 的调度问题,开发了一种基于 Geographic Location Information (GLI) 的神经网络结构,并通过反馈Stage 进行无监督学习。* Results: 对于 randomly deployed 网络, AA-DL 方法能够提高 PAoI 性能,并与传统迭代优化方法相比,具有更低的复杂性。
    Abstract In real-time Industrial Internet of Things (IIoT), e.g., monitoring and control scenarios, the freshness of data is crucial to maintain the system functionality and stability. In this paper, we propose an AoI-Aware Deep Learning (AA-DL) approach to minimize the Peak Age of Information (PAoI) in D2D-assisted IIoT networks. Particularly, we analyzed the success probability and the average PAoI via stochastic geometry, and formulate an optimization problem with the objective to find the optimal scheduling policy that minimizes PAoI. In order to solve the non-convex scheduling problem, we develop a Neural Network (NN) structure that exploits the Geographic Location Information (GLI) along with feedback stages to perform unsupervised learning over randomly deployed networks. Our motivation is based on the observation that in various transmission contexts, the wireless channel intensity is mainly influenced by distancedependant path loss, which could be calculated using the GLI of each link. The performance of the AA-DL method is evaluated via numerical results that demonstrate the effectiveness of our proposed method to improve the PAoI performance compared to a recent benchmark while maintains lower complexity against the conventional iterative optimization method.
    摘要 在实时工业互联网(IIoT)应用中,例如监控和控制场景中,数据的新鲜度是维护系统功能和稳定性的关键。在这篇论文中,我们提出了一种基于AoI-Aware Deep Learning(AA-DL)的方法,以最小化在D2D协助IIoT网络中的峰值信息年龄(PAoI)。我们通过杂因统计学分析成功概率和平均PAoI,并将优化问题转化为找到最佳调度策略,以最小化PAoI。由于调度问题是非凸问题,我们开发了一种基于地理位置信息(GLI)和反馈stage的神经网络结构,以进行无监督学习。我们的动机是基于不同传输上下文中的无线通信频率强度主要受到距离参差的路径损失影响,这可以通过每个链接的GLI来计算。我们的AA-DL方法的性能通过数字结果展示,与最近的参考方法相比,可以提高PAoI性能,同时与传统迭代优化方法相比,具有较低的复杂性。

Modulation For Modulo: A Sampling-Efficient High-Dynamic Range ADC

  • paper_url: http://arxiv.org/abs/2311.13282
  • repo_url: None
  • paper_authors: Satish Mulleti, Resham Yashwanth Kumar, Laxmeesha Somappa
  • for: This paper aims to develop an efficient high dynamic range (HDR) analog-to-digital converter (ADC) with fewer bits by using phase modulation (PM) instead of oversampling.
  • methods: The paper proposes using PM to modulate the phase of a carrier signal with the analog input, which allows for HDR-ADC functionality with a low-dynamic range (DR) ADC and fewer bits. The authors also derive identifiability results for reconstruction of the original signal from PM samples acquired at the Nyquist rate.
  • results: The authors demonstrate the efficiency of their PM-based approach with lower reconstruction errors and reduced sampling rates, and show that their hardware prototype can reconstruct signals ten times greater than the ADC’s DR from Nyquist rate samples. This has the potential to replace high-bit rate HDR-ADCs while meeting existing bit rate needs.
    Abstract In high-dynamic range (HDR) analog-to-digital converters (ADCs), having many quantization bits minimizes quantization errors but results in high bit rates, limiting their application scope. A strategy combining modulo-folding with a low-DR ADC can create an efficient HDR-ADC with fewer bits. However, this typically demands oversampling, increasing the overall bit rate. An alternative method using phase modulation (PM) achieves HDR-ADC functionality by modulating the phase of a carrier signal with the analog input. This allows a low-DR ADC with fewer bits. We've derived identifiability results enabling reconstruction of the original signal from PM samples acquired at the Nyquist rate, adaptable to various signals and non-uniform sampling. Using discrete phase demodulation algorithms for practical implementation, our PM-based approach doesn't require oversampling in noise-free conditions, contrasting with modulo-based ADCs. With noise, our PM-based HDR method demonstrates efficiency with lower reconstruction errors and reduced sampling rates. Our hardware prototype illustrates reconstructing signals ten times greater than the ADC's DR from Nyquist rate samples, potentially replacing high-bit rate HDR-ADCs while meeting existing bit rate needs.
    摘要 高动态范围(HDR)数字化到analog(ADC) converter中,具有多个量化比会减少量化错误,但会导致高比特率,限制其应用范围。一种将模ulo-folding与低动态范围(DR)ADC结合的策略可以创建高效的HDR-ADC,但通常需要过样 rate。另一种使用相位调制(PM)方法可以实现HDR-ADC功能,将模拟输入电压调制到干扰信号的相位中。这使得低DR ADC可以使用 fewer bits。我们已经 derive了可 identificability 结果,允许从PM样本中重建原始信号,适用于不同的信号和非均匀采样。使用离散相位解耗算法实现实用,我们的PM-based方法不需要过样 rate,与模ulo-based ADCs 不同。在噪声情况下,我们的PM-based HDR方法表现高效,具有较低的重建错误和降低的采样率。我们的硬件原型示例了从 Nyquist 速率样本中重建信号,大于 ADC 的 DR 的十倍,可能替代高比特率 HDR-ADC,同时满足现有的比特率需求。

A model-free approach to fingertip slip and disturbance detection for grasp stability inference

  • paper_url: http://arxiv.org/abs/2311.13245
  • repo_url: None
  • paper_authors: Dounia Kitouni, Mahdi Khoramshahi, Veronique Perdereau
  • for: 本研究旨在提高机器人对物体抓取稳定性,通过利用感觉传感器获得更多的维度信息。
  • methods: 本研究提出了一种方法,通过提取任务相关特征并设计高效的分类器,检测对各个手指的物体滑动。我们比较了支持向量机和折衣函数分类器两种模型。
  • results: 我们使用了高敏感的USkin感觉传感器和Allegro手臂进行测试和验证,结果表明,提出的方法可以在线进行滑动检测,有效提高机器人对物体抓取稳定性。
    Abstract Robotic capacities in object manipulation are incomparable to those of humans. Besides years of learning, humans rely heavily on the richness of information from physical interaction with the environment. In particular, tactile sensing is crucial in providing such rich feedback. Despite its potential contributions to robotic manipulation, tactile sensing is less exploited; mainly due to the complexity of the time series provided by tactile sensors. In this work, we propose a method for assessing grasp stability using tactile sensing. More specifically, we propose a methodology to extract task-relevant features and design efficient classifiers to detect object slippage with respect to individual fingertips. We compare two classification models: support vector machine and logistic regression. We use highly sensitive Uskin tactile sensors mounted on an Allegro hand to test and validate our method. Our results demonstrate that the proposed method is effective in slippage detection in an online fashion.
    摘要 人工智能在物体抓取方面的能力与人类不可比,人类需要年月学习,并且吸取环境物理互动的充足信息。特别是感觉感知对于提供丰富反馈是关键。虽然感觉感知在机器人抓取中具有潜在的贡献,但是它在实践中得到更少的利用,主要是因为感觉传感器的时间序列复杂度。在这种情况下,我们提出了一种方法,用于评估抓取稳定性使用感觉传感。更 Specifically,我们提出了一种方法来提取任务相关特征,并设计高效的分类器来检测对各个手指的物体滑倒。我们比较了两种分类模型:支持向量机和折衣函数回归。我们使用了高敏感的优斯金感觉传感器,并 mounted on an Allegro hand来测试和验证我们的方法。我们的结果表明,我们的方法在线上可以有效地检测滑倒。

Optimal Time of Arrival Estimation for MIMO Backscatter Channels

  • paper_url: http://arxiv.org/abs/2311.13196
  • repo_url: None
  • paper_authors: Chen He, Luyang Han, Z. Jane Wang
  • for: 这个论文旨在提出一种基于多输入多输出(MIMO)反射通道的时刻到达(TOA)估计器,以便提高估计精度。
  • methods: 该估计器使用了MIMO反射通道的 topological structure,并可以考虑到这些通道的特点,从而提高估计精度。
  • results: 研究发现,对于通常的$M\times N$静脉拓扑,估计误差的均方误差(MSE)为$\frac{M+N-1}{MN}\sigma^2_0$;对于通常的$M\times M$坐标拓扑,其MSE为$\frac{2M-1}{M^2}\sigma^2_0$(对于对角通道)和$\frac{M-1}{M^2}\sigma^2_0$(对于不对角通道),其中$\sigma^2_0$是传统最小二乘估计器的MSE。此外,我们还 deriv了MIMO反射通道TOA估计器的Cramer-Rao下界(CRLB),表明该估计器是优化的。实验结果表明,该TOA估计器可以在大规模MIMO系统中明显提高估计和定位精度。
    Abstract In this paper, we propose a novel time of arrival (TOA) estimator for multiple-input-multiple-output (MIMO) backscatter channels in closed form. The proposed estimator refines the estimation precision from the topological structure of the MIMO backscatter channels, and can considerably enhance the estimation accuracy. Particularly, we show that for the general $M \times N$ bistatic topology, the mean square error (MSE) is $\frac{M+N-1}{MN}\sigma^2_0$, and for the general $M \times M$ monostatic topology, it is $\frac{2M-1}{M^2}\sigma^2_0$ for the diagonal subchannels, and $\frac{M-1}{M^2}\sigma^2_0$ for the off-diagonal subchannels, where $\sigma^2_0$ is the MSE of the conventional least square estimator. In addition, we derive the Cramer-Rao lower bound (CRLB) for MIMO backscatter TOA estimation which indicates that the proposed estimator is optimal. Simulation results verify that the proposed TOA estimator can considerably improve both estimation and positioning accuracy, especially when the MIMO scale is large.
    摘要 在这篇论文中,我们提出了一种新的时刻到达(TOA)估计器 для多输入多输出(MIMO)反射通道,其形式为关闭式。我们的估计器利用MIMO反射通道的 topological structure 进行精细的估计精度,可以显著提高估计精度。特别是,我们表明在$M \times N$ 的双Static拓扑中,MSE的平均值为 $\frac{M+N-1}{MN}\sigma^2_0$,在$M \times M$ 的单Static拓扑中,对于对角子通道,MSE的平均值为 $\frac{2M-1}{M^2}\sigma^2_0$,对于偏角子通道,MSE的平均值为 $\frac{M-1}{M^2}\sigma^2_0$,其中 $\sigma^2_0$ 是传统最小二乘估计器的MSE。此外,我们 derive了MIMO反射 TOA 估计的 Cramer-Rao lower bound (CRLB),这表明我们的估计器是优化的。实验结果表明,我们的 TOA 估计器可以明显提高估计和定位精度,特别是当MIMO尺度较大时。

Joint Distributed Precoding and Beamforming for RIS-aided Cell-Free Massive MIMO Systems

  • paper_url: http://arxiv.org/abs/2311.13139
  • repo_url: None
  • paper_authors: Peng Zhang, Jiayi Zhang, Huahua Xiao, Xiaodan Zhang, Derrick Wing Kwan Ng, Bo Ai
  • for: 这篇论文主要针对 sixth-generation 无线通信系统中的 cell-free 网络和可编程智能表面 (RIS) 的融合。
  • methods: 该论文提出了一种分布式预编码和扫描方法,用于jointly 优化 combining vector、活动预编码和pasive RIS 扫描的设计,以最小化用户的平均方差误差。
  • results: 数字结果验证了提议的分布式预编码和扫描方法的有效性,并证明了它们在复杂度和扩展性方面比中央化方法更佳。
    Abstract The amalgamation of cell-free networks and reconfigurable intelligent surface (RIS) has become a prospective technique for future sixth-generation wireless communication systems. In this paper, we focus on the precoding and beamforming design for a downlink RIS-aided cell-free network. The design is formulated as a non-convex optimization problem by jointly optimizing the combining vector, active precoding, and passive RIS beamforming for minimizing the weighted sum of users' mean square error. A novel joint distributed precoding and beamforming framework is proposed to decentralize the alternating optimization method for acquiring a suboptimal solution to the design problem. Finally, numerical results validate the effectiveness of the proposed distributed precoding and beamforming framework, showing its low-complexity and improved scalability compared with the centralized method.
    摘要 “细胞自由网络和可重新配置智能表面(RIS)的结合技术已成为未来第六代无线通信系统的可能性。本文关注了下行RIS协助细胞自由网络的 precoding 和 beamforming 设计,设计问题由jointly 优化combining vector、活动 precoding 和 passive RIS beamforming,以最小化用户的平均方差误差。我们提出了一种分布式 precoding 和 beamforming 框架,以解决 alternating 优化法中的优化问题,并提供了一种低复杂度和高扩展性的解决方案。最后,数值结果证明了我们的提议的分布式 precoding 和 beamforming 框架的有效性,并与中央化方法相比,显示了它的低复杂度和高扩展性。”

cs.SD - 2023-11-21

Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates

  • paper_url: http://arxiv.org/abs/2311.13058
  • repo_url: None
  • paper_authors: Marco Pasini, Stefan Lattner, George Fazekas
  • for: This paper is focused on developing a self-supervised music source separation system that does not require audio queries during inference time, making it more suitable for genres with varied timbres and effects.
  • methods: The proposed method uses a query-based approach during training, where the continuous embedding of query audios is substituted with Vector Quantized (VQ) representations. The model is trained end-to-end with up to N classes, as determined by the VQ’s codebook size, and seeks to effectively categorize instrument classes.
  • results: The proposed method is demonstrated to be effective in separating music sources, even for genres with diverse instrumentation and effects. The authors provide examples and additional results online.
    Abstract Music source separation is focused on extracting distinct sonic elements from composite tracks. Historically, many methods have been grounded in supervised learning, necessitating labeled data, which is occasionally constrained in its diversity. More recent methods have delved into N-shot techniques that utilize one or more audio samples to aid in the separation. However, a challenge with some of these methods is the necessity for an audio query during inference, making them less suited for genres with varied timbres and effects. This paper offers a proof-of-concept for a self-supervised music source separation system that eliminates the need for audio queries at inference time. In the training phase, while it adopts a query-based approach, we introduce a modification by substituting the continuous embedding of query audios with Vector Quantized (VQ) representations. Trained end-to-end with up to N classes as determined by the VQ's codebook size, the model seeks to effectively categorise instrument classes. During inference, the input is partitioned into N sources, with some potentially left unutilized based on the mix's instrument makeup. This methodology suggests an alternative avenue for considering source separation across diverse music genres. We provide examples and additional results online.
    摘要 音乐源分离是专注于从复杂的音轨中提取明确的音乐元素的技术。历史上,许多方法围绕着supervised learning进行了定制,它们需要标注数据,偶尔存在限制的多样性。更近期的方法则涉及N-shot技术,使用一个或多个音频样本来帮助分离。然而,一些这些方法的挑战是在推理时需要音频查询,使得它们对于具有多种 timbre 和特效的流行乐种类来说不太适用。这篇论文提供了一种自主学习的音乐源分离系统的证明,它在推理时不需要音频查询。在训练阶段,我们引入了一种修改,将维度量化(VQ)表示取代 kontinuous embedding 的查询音频。通过在 N 类上进行挖掘训练,我们训练了一个端到端的模型,以便有效地分类乐器类。在推理阶段,输入被分配到 N 个源,可能会根据混合的乐器构成而有一些未使用的部分。这种方法可能会为不同的音乐类型的音乐源分离提供一条新的途径。我们在线提供了更多的例子和结果。

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

  • paper_url: http://arxiv.org/abs/2311.12488
  • repo_url: https://github.com/navi0105/lyricalignment
  • paper_authors: Jun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang
  • for: 本研究主要针对的是词幕识别和对齐问题,在过去几年内得到了 significiant 性能提高,但大多数之前的研究都只专注于英语,尽管该语言有大量数据可用。
  • methods: 我们采用了预训练的 Whisper 模型,并在 monophonic 普通话 dataset 上进行了微调。我们还使用了数据扩充和源分离模型来解决数据稀缺问题。
  • results: 结果显示,我们的方法在 Mandarin 多音谱数据集上实现了 caracter error rate less than 18%,并且 Mean Absolute Error 为 0.071 秒。这些结果表明采用预训练 speech 模型可以在低资源enario 中实现高效的词幕识别和对齐。
    Abstract The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue, we adapt pretrained Whisper model and fine-tune it on a monophonic Mandarin singing dataset. With the use of data augmentation and source separation model, results show that the proposed method achieves a character error rate of less than 18% on a Mandarin polyphonic dataset for lyrics transcription, and a mean absolute error of 0.071 seconds for lyrics alignment. Our results demonstrate the potential of adapting a pretrained speech model for lyrics transcription and alignment in low-resource scenarios.
    摘要 过去几年来,自动歌词译解和歌词Alignment的任务有所进步,但大多数前一些作品仅对英文进行研究,对于其他语言来说,数据不足是一大障碍。本文将关注于复音乐 Mandarin 流行音乐中的歌词译解和Alignment。为了解决数据不足的问题,我们运用预训Whisper模型,并对单音 Mandarin 歌曲进行调整。通过数据增强和源分离模型,我们获得了以下结果:在 Mandarin 复音乐 dataset 上,我们的方法可以实现字串错误率低于 18%,并且对于歌词Alignment的误差可以在 0.071 秒内控制。我们的结果显示了适用预训Speech模型来进行歌词译解和Alignment的潜力,尤其是在低资源情况下。

HPCNeuroNet: Advancing Neuromorphic Audio Signal Processing with Transformer-Enhanced Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2311.12449
  • repo_url: None
  • paper_authors: Murat Isik, Hiruna Vishwamith, Kayode Inadagbo, I. Can Dikmen
  • for: 这个研究旨在开发一个基于神经网络的对话音频处理架构,整合了神经网络(SNN)、传播器和高性能计算(HPC)的优点,以提高处理多种语言和背景噪音的人类声音处理能力。
  • methods: 本研究使用了Intel N-DNS数据集,利用短时间傅立叶变换(STFT) для时间频率表示,使用传播器嵌入来生成稠密 вектор,并使用SNN编码/解码机制来将脉冲训练转换为对应的对话音频处理。
  • results: 比较结果显示,提案的加速器在100MHz的频率下可以 дости到71.11 Giga-Operations Per Second(GOP/s)的throughput,并且在3.55W的对应的质量上实现了高效的能源耗用。此外,通过设计空间探索,我们提供了优化核心能力的指导方针,以满足不同的对话音频处理任务。
    Abstract This paper presents a novel approach to neuromorphic audio processing by integrating the strengths of Spiking Neural Networks (SNNs), Transformers, and high-performance computing (HPC) into the HPCNeuroNet architecture. Utilizing the Intel N-DNS dataset, we demonstrate the system's capability to process diverse human vocal recordings across multiple languages and noise backgrounds. The core of our approach lies in the fusion of the temporal dynamics of SNNs with the attention mechanisms of Transformers, enabling the model to capture intricate audio patterns and relationships. Our architecture, HPCNeuroNet, employs the Short-Time Fourier Transform (STFT) for time-frequency representation, Transformer embeddings for dense vector generation, and SNN encoding/decoding mechanisms for spike train conversions. The system's performance is further enhanced by leveraging the computational capabilities of NVIDIA's GeForce RTX 3060 GPU and Intel's Core i9 12900H CPU. Additionally, we introduce a hardware implementation on the Xilinx VU37P HBM FPGA platform, optimizing for energy efficiency and real-time processing. The proposed accelerator achieves a throughput of 71.11 Giga-Operations Per Second (GOP/s) with a 3.55 W on-chip power consumption at 100 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency and design flexibility. Through design-space exploration, we provide insights into optimizing core capacities for audio tasks. Our findings underscore the transformative potential of integrating SNNs, Transformers, and HPC for neuromorphic audio processing, setting a new benchmark for future research and applications.
    摘要 The HPCNeuroNet architecture employs the Short-Time Fourier Transform (STFT) for time-frequency representation, Transformer embeddings for dense vector generation, and SNN encoding/decoding mechanisms for spike train conversions. The system's performance is further enhanced by leveraging the computational capabilities of NVIDIA's GeForce RTX 3060 GPU and Intel's Core i9 12900H CPU. Additionally, we implement the hardware on the Xilinx VU37P HBM FPGA platform, optimizing for energy efficiency and real-time processing. The proposed accelerator achieves a throughput of 71.11 Giga-Operations Per Second (GOP/s) with a 3.55 W on-chip power consumption at 100 MHz. Compared to off-the-shelf devices and recent state-of-the-art implementations, the proposed accelerator shows obvious advantages in terms of energy efficiency and design flexibility.Through design-space exploration, we provide insights into optimizing core capacities for audio tasks. Our findings demonstrate the potential of integrating SNNs, Transformers, and HPC for neuromorphic audio processing, setting a new benchmark for future research and applications.

Equipping Pretrained Unconditional Music Transformers with Instrument and Genre Controls

  • paper_url: http://arxiv.org/abs/2311.12257
  • repo_url: None
  • paper_authors: Weihan Xu, Julian McAuley, Shlomo Dubnov, Hao-Wen Dong
  • for: 这个研究目的是探讨“预训练和调整”模式在符号音乐生成中的效果。
  • methods: 作者使用了150万首歌曲数据库,首先预训了一个大型无条件transformer模型,然后提出了一个简单的技术来将这个预训模型与乐器和类型控制 tokens 融合,实现更高级的控制性和表达力。
  • results: 实验结果显示,提案的模型可以成功地根据用户指定的乐器和类型生成符号音乐,并在主观听诊中较baseline模型高度凝聚、和谐、排序和全面质量。
    Abstract The ''pretraining-and-finetuning'' paradigm has become a norm for training domain-specific models in natural language processing and computer vision. In this work, we aim to examine this paradigm for symbolic music generation through leveraging the largest ever symbolic music dataset sourced from the MuseScore forum. We first pretrain a large unconditional transformer model using 1.5 million songs. We then propose a simple technique to equip this pretrained unconditional music transformer model with instrument and genre controls by finetuning the model with additional control tokens. Our proposed representation offers improved high-level controllability and expressiveness against two existing representations. The experimental results show that the proposed model can successfully generate music with user-specified instruments and genre. In a subjective listening test, the proposed model outperforms the pretrained baseline model in terms of coherence, harmony, arrangement and overall quality.
    摘要 “pretraining-and-finetuning”模式在自然语言处理和计算机视觉领域成为标准训练域域模型的方法。在这个工作中,我们想要检查这种模式是否适用于符号音乐生成。我们首先使用150万首歌曲预训练大型不条件变换器模型。然后,我们提出一种简单的技术,通过额外的控制token来赋能这个预训练的无条件音乐变换器模型。我们的提出的表示具有更高水平的控制性和表达性,与现有的两种表示进行比较。实验结果表明,我们的模型可以成功地生成用户指定的乐器和乐种的音乐。在主观听测中,我们的模型比预训练基线模型在凝合度、和声、排序和总质量方面表现更好。

eess.AS - 2023-11-21

FedCPC: An Effective Federated Contrastive Learning Method for Privacy Preserving Early-Stage Alzheimer’s Speech Detection

  • paper_url: http://arxiv.org/abs/2311.13043
  • repo_url: None
  • paper_authors: Wenqing Wei, Zhengdong Yang, Yuan Gao, Jiyi Li, Chenhui Chu, Shogo Okada, Sheng Li
  • for: 早期阿尔ツ海默病(AD)检测是医学研究领域中的一个重要问题。
  • methods: 我们提议使用联邦对比预训练(FedCPC),在联邦学习之前,使用联邦预训练来学习更好的表示,以保护用户隐私。
  • results: 实验结果表明,我们提议的方法可以达到满意的性能,同时保护用户隐私。
    Abstract The early-stage Alzheimer's disease (AD) detection has been considered an important field of medical studies. Like traditional machine learning methods, speech-based automatic detection also suffers from data privacy risks because the data of specific patients are exclusive to each medical institution. A common practice is to use federated learning to protect the patients' data privacy. However, its distributed learning process also causes performance reduction. To alleviate this problem while protecting user privacy, we propose a federated contrastive pre-training (FedCPC) performed before federated training for AD speech detection, which can learn a better representation from raw data and enables different clients to share data in the pre-training and training stages. Experimental results demonstrate that the proposed methods can achieve satisfactory performance while preserving data privacy.
    摘要 幼期阿尔茨曼病(AD)早期检测是医学研究中的一个重要领域。传统的机器学习方法也受到数据隐私风险的影响,因为每个医疗机构的病人数据是独特的。一种常见的做法是使用联邦学习来保护病人的数据隐私。然而,其分布式学习过程也会导致性能下降。为了解决这个问题而保护用户隐私,我们提议在AD语音检测之前使用联邦对比预训练(FedCPC),可以从原始数据中学习更好的表示,并使不同的客户在预训练和训练阶段可以共享数据。实验结果表明,我们的方法可以实现满意的性能 while preserving data privacy。

Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation

  • paper_url: http://arxiv.org/abs/2311.12706
  • repo_url: None
  • paper_authors: Yicheng Hsu, Mingsian R. Bai
  • for: 提供一种可扩展的音频电子存在(AT)系统,以创造远端听众场景的听觉体验。
  • methods: 使用 DeepFilterNet 作为 backing 网络,将阵列 Microphone 信号转换为基于 Head-Related Transfer Function(HRTF)的 filtered 信号,并可调整信号增强和环境保持的权重。
  • results: 提出一种配置独立的 Spatial COherence REpresentation(SCORE)特征,以便在不同的阵列 geometries 和感知器数量下进行网络训练,并通过 magnitude-weighted Interaural Phase Difference error(mw-IPDe)、magnitude-weighted Interaural Level Difference error(mw-ILDe)和 modified Scale-Invariant Signal-to-Distortion Ratio(mSI-SDR)等性能指标进行对象评估。Subjective listening tests 也进行了验证,结果表明提出的 BAT 系统可以实现 Desired 的听觉体验,包括增强信号和环境保持的平衡。
    Abstract Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system using the DeepFilterNet as the backbone to convert the array microphone signals into the Head-Related Transfer Function (HRTF)-filtered signals, with a tunable weighting between signal enhancement and ambience preservation. An array configuration-independent Spatial COherence REpresentation (SCORE) feature is proposed for the model training so that the network remains robust to different array geometries and sensor counts. magnitude-weighted Interaural Phase Difference error (mw-IPDe), magnitude-weighted Interaural Level Difference error (mw-ILDe), and modified Scale-Invariant Signal-to-Distortion Ratio (mSI-SDR) are defined as performance metrics for objective evaluation. Subjective listening tests were also performed to validate the proposed BAT system. The results have shown that the proposed BAT system can achieve superior telepresence performance with the desired balance between signal enhancement and ambience preservation, even when the array configurations are unseen in the training phase.
    摘要

A Distributed Algorithm for Personal Sound Zones Systems

  • paper_url: http://arxiv.org/abs/2311.12427
  • repo_url: None
  • paper_authors: Sipei Zhao, Guoqiang Zhang, Eva Cheng, Ian S. Burnett
  • for: 提供了一个可以在共享空间中实现多个独立的听音区域,并且不需要使用头戴式耳机的Personal Sound Zones(PSZ)系统。
  • methods: 使用了分布式算法,以对多个扩音器进行分布式处理,以降低处理器的负载,并且降低了总的计算复杂度,但是对性能有一定的损害。
  • results: 透过在真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测�
    Abstract A Personal Sound Zones (PSZ) system aims to generate two or more independent listening zones that allow multiple users to listen to different music/audio content in a shared space without the need for wearing headphones. Most existing studies assume that the acoustic paths between loudspeakers and microphones are measured beforehand in a stationary environment. Recently, adaptive PSZ systems have been explored to adapt the system in a time-varying acoustic environment. However, because a PSZ system usually requires multiple loudspeakers, the multichannel adaptive algorithms impose a high computational load on the processor. To overcome that problem, this paper proposes an efficient distributed algorithm for PSZ systems, which not only spreads the computational burden over multiple nodes but also reduces the overall computational complexity, at the expense of a slight decrease in performance. Simulation results with true room impulse responses measured in a Hemi-Anechoic chamber are performed to verify the proposed distributed PSZ system.
    摘要

AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation

  • paper_url: http://arxiv.org/abs/2311.12371
  • repo_url: https://github.com/jishengbai/audiolog
  • paper_authors: Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen
  • for: 这篇论文旨在开发一个基于大语言模型(LLMs)的语音档案系统,以多任务学习的方式进行语音识别和描述。
  • methods: 提案使用一个精致的语音模型,经过精致的训练和组合,以进行语音档案的描述和分类。
  • results: 实验结果显示,提案的系统在语音景像分类和声音事件检测方面表现出色,超过了现有的方法。此外,further analyses显示AudioLog能够有效地摘要长语音序列。
    Abstract Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with multi-task learning of acoustic tasks. Specifically, we propose a joint training network, achieved by fine-tuning a large audio model based on the pre-trained hierarchical token-semantic audio Transformer. We then leverage LLMs to craft audio logs that summarize textual descriptions of the acoustic environment. Experiments show that the proposed system attains exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analyses demonstrate AudioLog's power in effectively summarizing long audio sequences.
    摘要

Rethinking the Output Architecture for Sound Source Localization

  • paper_url: http://arxiv.org/abs/2311.12305
  • repo_url: None
  • paper_authors: Linfeng Feng, Xiao-Lei Zhang, Xuelong Li
  • for: 本研究的目的是提出一种基于soft label分布的sound source localization(SSL)方法,以提高SSL的性能和Robustness。
  • methods: 本研究使用了一种新的损失函数(NLAE和MSE(wo))和一种新的解码方法(Weighted Adjacent Decoding,WAD),以优化SSL的模型。
  • results: 实验结果显示,提出的方法可以达到状态革命性的性能,并且WAD解码方法可以突破现有解码方法的量化误差限制。
    Abstract Sound source localization (SSL) involves estimating the direction of arrival (DOA) of a sound signal. The output space of the DOA estimation is continuous, suggesting that regression may be the most appropriate formulation for DOA. However, in practice, converting the DOA estimation into a classification problem often results in better performance than the regression formulation, since that classification problems are generally easier to model, and are more robust in handling noise and uncertainty than regression problems. In the classification formulation of DOA, the output space is discretized into several intervals, each of which is treated as a class. These classes exhibit strong inter-class correlation, with their mutual-similarity increasing when they approach each other and being ordered. However, this property is not sufficiently explored. To exploit these property, we propose a soft label distribution, named Unbiased Label Distribution (ULD), for eliminating the quantization error of the training target and further taking the inter-class similarity into strong consideration. We further introduce two loss functions, named the Negative Log Absolute Error (NLAE) loss function and {Mean Squared Error loss function without activation (MSE(wo))}, for the soft label family. Finally, we design a new decoding method to map the predicted distribution to sound source locations, called Weighted Adjacent Decoding (WAD). It uses the weighted sum of the probabilities of the peak classes and their adjacent classes in the predicted distribution for decoding. Experimental results show that the proposed method achieves the state-of-the-art performance, and the WAD decoding method is able to even breakthrough the quantization error limits of existing decoding methods.
    摘要 声源Localization(SSL)涉及到计算声信号的方向来源(DOA)的估计。由于DOA估计的输出空间是连续的,因此可以使用回归来模型声源。然而,在实践中,将DOA估计转换成分类问题经常会得到更好的性能,因为分类问题通常更容易模型,并且对噪声和不确定性更加稳定。在分类形式下,DOA的输出空间被细分成多个间隔,每个间隔被视为一个类。这些类之间存在强相关性,其相互相似性随着他们的距离增大而增加。然而,这个特性未得到充分利用。为了利用这个特性,我们提出了一种软标签分布(Unbiased Label Distribution,ULD),用于消除训练目标的量化误差,并且更加重视类之间的相似性。我们还引入了两种损失函数:卷积损失函数(NLAE)和无激活的平方误差损失函数(MSE(wo))。最后,我们设计了一种新的解码方法,名为权重相邻解码(Weighted Adjacent Decoding,WAD),用于将预测分布映射到声源位置。WAD解码方法使用预测分布中峰值类和其相邻类的权重加权和。实验结果表明,我们提出的方法达到了当前最佳性能,WAD解码方法甚至可以超过现有解码方法的量化误差限制。

cs.CV - 2023-11-21

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

  • paper_url: http://arxiv.org/abs/2311.13612
  • repo_url: https://github.com/chris210634/word_soups
  • paper_authors: Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis
  • For: 本研究目的是提出两种更加灵活的方法,即描述器汤和字符汤,以提高零shot的目标准确率。* Methods: 这两种方法不需要测试时使用LLM,可以利用训练数据来提高零shot方法的OOD目标准确率。描述器汤通过选择一小集文本描述器,计算稳定的类嵌入。字符汤通过拼接一串字符来实现类似的目的。* Results: 相比现有的几shot软引导方法,字符汤需要更少的参数和GPU内存,并且在跨数据集和领域泛化benchmark上表现出色。与SoTA描述器和prompt汤集成方法相比,字符汤可以 дости到更高的OOD准确率。
    Abstract Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups
    摘要 在过去一年,一大量的多模式研究在零示词汇中出现了。这些研究使得预训练的VL模型的零示词汇精度提高。一项研究,WaffleCLIP,表明了随机描述符 ensemble可以达到类似的零示词汇精度。然而,两种零示词汇方法都是无法训练的,因此在有些几个异常数据训练时,其性能是低下的。受这些前期工作启发,我们提出了两种更灵活的方法:描述符汤和单词汤。这两种方法不需要LLM在测试时,可以利用训练数据来提高OOD目标精度。描述符汤通过使用通用几个shot训练数据选择一小集文本描述符,然后计算 robust类 embedding。单词汤通过将一串单词组合而成,具有类似的性能。与现有的几个shot软提取方法相比,word soup需要 fewer parameters by construction,并且不需要backpropagation,因此它具有更好的性能。两种汤都超过了当前发表的几个shot方法,即使与SoTA零示词汇方法相结合。与SoTA描述符和提取方法相比,word soup在OOD精度方面达到更高的水平。请参考我们的代码:github.com/Chris210634/word_soups

Novel OCT mosaicking pipeline with Feature- and Pixel-based registration

  • paper_url: http://arxiv.org/abs/2311.13052
  • repo_url: None
  • paper_authors: Jiacheng Wang, Hao Li, Dewei Hu, Yuankai K. Tao, Ipek Oguz
  • for: This paper aims to improve the field of view (FoV) of high-resolution Optical Coherence Tomography (OCT) images for ophthalmology studies by stitching multiple overlapping images together.
  • methods: The proposed method combines learning-based feature matching and robust pixel-based registration to align multiple images effectively, and also utilizes a trained foundational model (Segment Anything Model, SAM) to validate mosaicking results in an unsupervised manner.
  • results: The authors evaluate the efficacy of their pipeline using two datasets and show superior performance in terms of both accuracy and computational efficiency compared to current mosaicking pipelines.Here’s the full text in Simplified Chinese:
  • for: 这篇论文目的是通过将多个重叠的高分辨率 Optical Coherence Tomography (OCT) 图像粘贴起来,提高镜学研究中的场景范围 (FoV)。
  • methods: 提议的方法结合学习基于特征匹配和稳定的像素基于准确进行多个图像的对齐,并利用已经训练的基础模型(Segment Anything Model, SAM)来验证拼接结果的无监督方式。
  • results: 作者通过两个数据集评估了他们的拼接管道的效果,并与当前的拼接管道相比,达到了更高的准确率和计算效率。
    Abstract High-resolution Optical Coherence Tomography (OCT) images are crucial for ophthalmology studies but are limited by their relatively narrow field of view (FoV). Image mosaicking is a technique for aligning multiple overlapping images to obtain a larger FoV. Current mosaicking pipelines often struggle with substantial noise and considerable displacement between the input sub-fields. In this paper, we propose a versatile pipeline for stitching multi-view OCT/OCTA \textit{en face} projection images. Our method combines the strengths of learning-based feature matching and robust pixel-based registration to align multiple images effectively. Furthermore, we advance the application of a trained foundational model, Segment Anything Model (SAM), to validate mosaicking results in an unsupervised manner. The efficacy of our pipeline is validated using an in-house dataset and a large public dataset, where our method shows superior performance in terms of both accuracy and computational efficiency. We also made our evaluation tool for image mosaicking and the corresponding pipeline publicly available at \url{https://github.com/MedICL-VU/OCT-mosaicking}.
    摘要 高分辨率光合成 Tomatoes (OCT) 图像是眼科研究中非常重要的,但它们受到宽度场景的限制。图像融合是一种将多个重叠的图像组合成更大的场景的技术。现有的融合管道经常遇到巨大的噪声和输入子领域的大量排移。在这篇论文中,我们提出了一种多视图 OCT/OCTA 的投影图像融合管道。我们的方法结合了学习基于匹配的特征和稳定的像素基于注册,以有效地对多个图像进行对齐。此外,我们推进了使用训练过的基础模型(Segment Anything Model,SAM)来验证融合结果,从无监督的角度进行验证。我们的管道在一个内部数据集和一个大型公共数据集上进行了验证,与传统方法相比,我们的方法在精度和计算效率两个方面表现出色。此外,我们还将评估工具和相应的管道公开发布在 GitHub 上,请参考 \url{https://github.com/MedICL-VU/OCT-mosaicking}。

Camera-Independent Single Image Depth Estimation from Defocus Blur

  • paper_url: http://arxiv.org/abs/2311.13045
  • repo_url: https://github.com/sleekeagle/defocus_camind
  • paper_authors: Lahiru Wijayasingha, Homa Alemzadeh, John A. Stankovic
  • for: 优化缺陷点云深度估计方法,提高机器视觉下某些下游任务的性能。
  • methods: 基于光学物理方程的杜比模型,对相机参数的影响分析,并提出一种简单的修正方法,使得模型不需要更新。
  • results: 对实验和synthetic数据进行测试,并显示其在不同相机下的相对稳定性。Here’s the translation of the abstract in English:
  • for: Improving the accuracy of monocular depth estimation methods for downstream tasks in machine vision.
  • methods: Analyzing the impact of camera parameters on defocus blur using optical physics equations, and proposing a simple correction method that does not require retraining of the original model.
  • results: Evaluating the model on both synthetic and real datasets obtained with different cameras, and demonstrating its robustness to camera changes.
    Abstract Monocular depth estimation is an important step in many downstream tasks in machine vision. We address the topic of estimating monocular depth from defocus blur which can yield more accurate results than the semantic based depth estimation methods. The existing monocular depth from defocus techniques are sensitive to the particular camera that the images are taken from. We show how several camera-related parameters affect the defocus blur using optical physics equations and how they make the defocus blur depend on these parameters. The simple correction procedure we propose can alleviate this problem which does not require any retraining of the original model. We created a synthetic dataset which can be used to test the camera independent performance of depth from defocus blur models. We evaluate our model on both synthetic and real datasets (DDFF12 and NYU depth V2) obtained with different cameras and show that our methods are significantly more robust to the changes of cameras. Code: https://github.com/sleekEagle/defocus_camind.git
    摘要 监视机器人视觉中的单目深度估算是许多下游任务的重要步骤。我们评论单目深度从聚光朦胧中估算的问题,这种方法可以比semantic基于深度估算方法更加准确。现有的单目深度从聚光朦胧技术对摄像机而言非常敏感,我们使用光学物理方程来描述摄像机参数如何影响聚光朦胧,并提出一个简单的修正方法来解决这个问题,这个方法不需要重新训练原始模型。我们创建了一个可用来测试单目深度从聚光朦胧模型的Synthetic dataset。我们对这些Synthetic dataset和实际数据集(DDFF12和NYU深度V2)进行了评估,并证明我们的方法在不同的摄像机下更加稳定。代码:https://github.com/sleekEagle/defocus_camind.gitNote that Simplified Chinese is a writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other regions.

Unsupervised Multimodal Surface Registration with Geometric Deep Learning

  • paper_url: http://arxiv.org/abs/2311.13022
  • repo_url: https://github.com/mohamedasuliman/geomorph
  • paper_authors: Mohamed A. Suliman, Logan Z. J. Williams, Abdulah Fawaz, Emma C. Robinson
  • for: 这篇论文旨在提出一种基于几何深度学习框架的 cortical surface 图像 регистрация方法,以便进行脑部研究。
  • methods: 该方法包括两个主要步骤:首先,对每个输入表面进行独立特征提取,使用图 convolutions 生成低维特征表示, capture cortical surface 重要特征。然后,通过深度离散方式进行Feature registration,以优化共同结构的 overlap across surfaces,并通过深度条件随机场来进行Regularization,以确保平滑和生物可能的变换。
  • results: 实验结果显示,GeoMorph 方法可以在 cortical surface 图像 registration 中超过现有的深度学习方法,获得更好的对齐和平滑的变换。此外,GeoMorph 方法也可以与传统方法相比,展示出类似的性能。这种多样性和稳定性表明 GeoMorph 方法在 neuroscience 应用中具有强大的潜力。
    Abstract This paper introduces GeoMorph, a novel geometric deep-learning framework designed for image registration of cortical surfaces. The registration process consists of two main steps. First, independent feature extraction is performed on each input surface using graph convolutions, generating low-dimensional feature representations that capture important cortical surface characteristics. Subsequently, features are registered in a deep-discrete manner to optimize the overlap of common structures across surfaces by learning displacements of a set of control points. To ensure smooth and biologically plausible deformations, we implement regularization through a deep conditional random field implemented with a recurrent neural network. Experimental results demonstrate that GeoMorph surpasses existing deep-learning methods by achieving improved alignment with smoother deformations. Furthermore, GeoMorph exhibits competitive performance compared to classical frameworks. Such versatility and robustness suggest strong potential for various neuroscience applications.
    摘要
  1. Independent feature extraction: Graph convolutions are used to extract low-dimensional feature representations that capture important characteristics of the cortical surfaces.2. Deep-discrete registration: Features are registered in a deep-discrete manner to optimize the overlap of common structures across surfaces by learning displacements of a set of control points.To ensure smooth and biologically plausible deformations, the framework implements regularization through a deep conditional random field implemented with a recurrent neural network. Experimental results show that GeoMorph surpasses existing deep-learning methods by achieving improved alignment with smoother deformations. Additionally, GeoMorph exhibits competitive performance compared to classical frameworks, indicating its strong potential for various neuroscience applications.

Image-Based Soil Organic Carbon Remote Sensing from Satellite Images with Fourier Neural Operator and Structural Similarity

  • paper_url: http://arxiv.org/abs/2311.13016
  • repo_url: None
  • paper_authors: Ken C. L. Wong, Levente Klein, Ademir Ferreira da Silva, Hongzhi Wang, Jitendra Singh, Tanveer Syeda-Mahmood
  • for: 这个论文的目的是提出一种基于卷积神经网络的SOC远程感知方法,以便在全球范围内估计SOC浓度。
  • methods: 这个方法使用了FNO-DenseNet模型,该模型结合了卷积神经网络的优点,并且在我们的实验中表现了出色的性能,比传统的像素基于 Random Forest 高出18%。
  • results: 该方法可以在全球范围内估计SOC浓度,并且表现出了出色的准确性。
    Abstract Soil organic carbon (SOC) sequestration is the transfer and storage of atmospheric carbon dioxide in soils, which plays an important role in climate change mitigation. SOC concentration can be improved by proper land use, thus it is beneficial if SOC can be estimated at a regional or global scale. As multispectral satellite data can provide SOC-related information such as vegetation and soil properties at a global scale, estimation of SOC through satellite data has been explored as an alternative to manual soil sampling. Although existing studies show promising results, they are mainly based on pixel-based approaches with traditional machine learning methods, and convolutional neural networks (CNNs) are uncommon. To study the use of CNNs on SOC remote sensing, here we propose the FNO-DenseNet based on the Fourier neural operator (FNO). By combining the advantages of the FNO and DenseNet, the FNO-DenseNet outperformed the FNO in our experiments with hundreds of times fewer parameters. The FNO-DenseNet also outperformed a pixel-based random forest by 18% in the mean absolute percentage error.
    摘要 土壤有碳 (SOC) 固定是将大气碳二氧化物转移并存储在土壤中,这对气候变化的控制具有重要作用。SOC的浓度可以通过合适的土地利用提高,因此在区域或全球范围内估算SOC具有重要意义。由于多спектル卫星数据可以提供土壤上的植被和土壤性质信息,因此有人提出了使用卫星数据来估算SOC的想法。 existing studies 显示了有希望的结果,但主要基于像素基的方法和传统的机器学习方法,并未使用 convolutional neural networks (CNNs)。为了研究使用 CNNs 在 SOC 遥感中的应用,我们提出了基于 Fourier neural operator (FNO) 的 FNO-DenseNet。通过将 FNO 和 DenseNet 的优点相结合,FNO-DenseNet 在我们的实验中超过了 FNO 的性能,并且在比较Pixel-based random forest 的情况下,mean absolute percentage error 下降了18%。

3D Compression Using Neural Fields

  • paper_url: http://arxiv.org/abs/2311.13009
  • repo_url: None
  • paper_authors: Janis Postels, Yannick Strümpler, Klara Reichard, Luc Van Gool, Federico Tombari
  • for: This paper proposes a novel Neural Fields (NF)-based compression algorithm for 3D data, which can compress both geometry and attribute of 3D data.
  • methods: The proposed method uses Signed Distance Fields (SDFs) for watertight shapes and Unsigned Distance Fields (UDFs) for arbitrary non-watertight shapes.
  • results: The method demonstrates excellent geometry compression on 3D point clouds and meshes, and it is straightforward to extend the compression algorithm to compress both geometry and attribute of 3D data.Here is the translation of the paper’s abstract in Simplified Chinese:
  • for: 本研究提出了一种基于神经场(NF)的3D数据压缩算法,可以压缩3D数据的geometry和属性。
  • methods: 该方法使用了签名距离场(SDF)来处理水平的形状,而不签名距离场(UDF)来处理任意非水平的形状。
  • results: 该方法在3D点云和网格上实现了出色的减少,并且可以方便地将压缩算法扩展到压缩3D数据的geometry和属性。
    Abstract Neural Fields (NFs) have gained momentum as a tool for compressing various data modalities - e.g. images and videos. This work leverages previous advances and proposes a novel NF-based compression algorithm for 3D data. We derive two versions of our approach - one tailored to watertight shapes based on Signed Distance Fields (SDFs) and, more generally, one for arbitrary non-watertight shapes using Unsigned Distance Fields (UDFs). We demonstrate that our method excels at geometry compression on 3D point clouds as well as meshes. Moreover, we show that, due to the NF formulation, it is straightforward to extend our compression algorithm to compress both geometry and attribute (e.g. color) of 3D data.
    摘要 神经场(NF)已成为数据模式的压缩工具,如图像和视频。这项工作基于先前的进展,并提出了一种基于NF的压缩算法 для3D数据。我们提出了两个版本的方法:一个针对水密形状基于签名距离场(SDF),另一个针对任意非水密形状使用未签名距离场(UDF)。我们示示了我们的方法在3D点云和网格上进行geometry压缩的优秀表现。此外,由于NF的表示,我们可以轻松地将压缩算法扩展到压缩3D数据的geometry和属性(例如颜色)。

TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations

  • paper_url: http://arxiv.org/abs/2311.13610
  • repo_url: None
  • paper_authors: Zhenda Shen, Yanqi Cheng, Raymond H. Chan, Pietro Liò, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
  • for: 这篇论文是 для 研究 implicit neural representations (INRs) 的能力模型复杂、高维数据,并不需要明确参数化。
  • methods: 这篇论文提出了一个新的函数TRIDENT,用于 implicit neural representations,具有三种非线性特性:首先,它可以表示高阶特征,通过顺序簇合 Compactness; 其次,TRIDENT可以高效地捕捉频率信息,即频率簇合 Compactness; 最后,它可以表示信号或图像,使其能量集中在有限的空间区域内,即空间簇合 Compactness。
  • results: 经过广泛的实验研究,该提出的函数在多个逆问题中表现出色,超越了现有的 implicit neural representation 函数。
    Abstract Implicit neural representations (INRs) have garnered significant interest recently for their ability to model complex, high-dimensional data without explicit parameterisation. In this work, we introduce TRIDENT, a novel function for implicit neural representations characterised by a trilogy of nonlinearities. Firstly, it is designed to represent high-order features through order compactness. Secondly, TRIDENT efficiently captures frequency information, a feature called frequency compactness. Thirdly, it has the capability to represent signals or images such that most of its energy is concentrated in a limited spatial region, denoting spatial compactness. We demonstrated through extensive experiments on various inverse problems that our proposed function outperforms existing implicit neural representation functions.
    摘要 近期,启发式神经表示(INR)已引起广泛关注,因其能模型复杂、高维数据无需显式参数化。在这项工作中,我们介绍了一种新的函数TRIDENT,它是一种启发式神经表示,具有三种非线性特性。首先,它可以表示高阶特征,通过顺序紧凑性得到高级别的特征表示。其次,TRIDENT可以高效地捕捉频率信息,这被称为频率紧凑性。最后,它可以表示信号或图像,使其能够尽量减少尺寸,这被称为空间紧凑性。我们通过多种逆问题实验证明了我们提议的函数的优越性。

AI for Agriculture: the Comparison of Semantic Segmentation Methods for Crop Mapping with Sentinel-2 Imagery

  • paper_url: http://arxiv.org/abs/2311.12993
  • repo_url: None
  • paper_authors: Irina Korotkova, Natalia Efremova
  • for: 这篇研究旨在探讨用于葡萄园分类问题的主要机器学习方法,并评估这些方法在不同情况下的效果。
  • methods: 本研究使用了多种广泛使用的机器学习技术,包括支持向量机器学习、快速降低学习等。
  • results: 研究发现,这些机器学习方法在葡萄园分类问题上有不同的表现,对于不同的问题可以选择适合的模型。
    Abstract Crop mapping is one of the most common tasks in artificial intelligence for agriculture due to higher food demands from a growing population and increased awareness of climate change. In case of vineyards, the texture is very important for crop segmentation: with higher resolution satellite imagery the texture is easily detected by majority of state-of-the-art algorithms. However, this task becomes increasingly more difficult as the resolution of satellite imagery decreases and the information about the texture becomes unavailable. In this paper we aim to explore the main machine learning methods that can be used with freely available satellite imagery and discuss how and when they can be applied for vineyard segmentation problem. We assess the effectiveness of various widely-used machine learning techniques and offer guidance on selecting the most suitable model for specific scenarios.
    摘要 蔬菜映射是人工智能在农业领域中最常见的任务,这是因为人口增长和气候变化的问题导致了食品需求的增加。在葡萄园中,Texture是分 segmentation的关键因素:高分辨率卫星图像中的Texture易于由大多数现代算法检测。然而,这种任务随着卫星图像的分辨率减小而变得越来越difficult,信息about texture变得不可用。在这篇文章中,我们想探索使用自由可用卫星图像的主要机器学习方法,并讨论它们在葡萄园分 segmentation问题上的应用。我们评估了各种常用的机器学习技术的效果,并提供选择特定情况下最适合的模型的指导。

FollowMe: a Robust Person Following Framework Based on Re-Identification and Gestures

  • paper_url: http://arxiv.org/abs/2311.12992
  • repo_url: https://github.com/federicorollo/followme
  • paper_authors: Federico Rollo, Andrea Zunino, Gennaro Raiola, Fabio Amadio, Arash Ajoudani, Nikolaos Tsagarakis
  • for: 这项研究旨在提高人机合作Robot的操作灵活性,使其在各种工作场景中提供个性化协助。
  • methods: 该研究使用了视觉Re-ID、手势检测和避免碰撞导航来识别和跟踪人类对象。
  • results: 实验结果表明,该框架能够在实验室中成功地识别和跟踪人类对象,并在有些不确定的动态障碍物的情况下保持稳定 Navigation。
    Abstract Human-robot interaction (HRI) has become a crucial enabler in houses and industries for facilitating operational flexibility. When it comes to mobile collaborative robots, this flexibility can be further increased due to the autonomous mobility and navigation capacity of the robotic agents, expanding their workspace and consequently, the personalizable assistance they can provide to the human operators. This however requires that the robot is capable of detecting and identifying the human counterpart in all stages of the collaborative task, and in particular while following a human in crowded workplaces. To respond to this need, we developed a unified perception and navigation framework, which enables the robot to identify and follow a target person using a combination of visual Re-Identification (Re-ID), hand gestures detection, and collision-free navigation. The Re-ID module can autonomously learn the features of a target person and use the acquired knowledge to visually re-identify the target. The navigation stack is used to follow the target avoiding obstacles and other individuals in the environment. Experiments are conducted with few subjects in a laboratory setting where some unknown dynamic obstacles are introduced.
    摘要 人机交互(HRI)已成为家庭和工业领域的关键促进者,以增加操作灵活性。当涉及到移动协作机器人时,这种灵活性可以得到进一步提高,因为机器人代理人的自主移动和导航能力,扩大了它们的工作空间,从而提供更多的个性化协助给人类操作者。但是,需要机器人能够在协作任务的所有阶段探测和识别人类对象,特别是在拥挤的工作场所中跟随人类。为回应这个需求,我们开发了一个简化的识别和导航框架,该框架使机器人能够通过视觉重新识别目标人物,以及检测手势和避免碰撞 navigation。识别模块可以自动学习目标人物的特征,并使用已得知知识进行视觉重新识别。导航堆栈则用于跟随目标,避免障碍物和其他环境中的人员。在实验室设置中,我们进行了一些人员的测试,并在其中引入了一些不确定的动态障碍物。

SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion

  • paper_url: http://arxiv.org/abs/2311.12981
  • repo_url: None
  • paper_authors: Yueqian Lin, Jingyang Zhang, Yiran Chen, Hai Li
  • for: This paper aims to provide a valuable method for obtaining challenging evaluation data to advance the development of more robust deep learning models.
  • methods: The proposed method, called SD-NAE, actively synthesizes natural adversarial examples (NAEs) using the state-of-the-art Stable Diffusion. The generation is guided by the gradient of loss from the target classifier.
  • results: The proposed method is effective in producing valid and useful NAEs, as demonstrated through a meticulously designed experiment. The generated NAEs can be used to evaluate the robustness of deep learning models and potentially advance their development.
    Abstract Robustly evaluating deep learning image classifiers is challenging due to some limitations of standard datasets. Natural Adversarial Examples (NAEs), arising naturally from the environment and capable of deceiving classifiers, are instrumental in identifying vulnerabilities in trained models. Existing works collect such NAEs by filtering from a huge set of real images, a process that is passive and lacks control. In this work, we propose to actively synthesize NAEs with the state-of-the-art Stable Diffusion. Specifically, our method formulates a controlled optimization process, where we perturb the token embedding that corresponds to a specified class to synthesize NAEs. The generation is guided by the gradient of loss from the target classifier so that the created image closely mimics the ground-truth class yet fools the classifier. Named SD-NAE (Stable Diffusion for Natural Adversarial Examples), our innovative method is effective in producing valid and useful NAEs, which is demonstrated through a meticulously designed experiment. Our work thereby provides a valuable method for obtaining challenging evaluation data, which in turn can potentially advance the development of more robust deep learning models. Code is available at https://github.com/linyueqian/SD-NAE.
    摘要 Difficulty evaluating deep learning image classifiers due to limitations of standard datasets. Natural Adversarial Examples (NAEs) from the environment can deceive classifiers, but existing methods collect NAEs passively from a large set of real images, lacking control. Our method actively synthesizes NAEs using state-of-the-art Stable Diffusion. We perturb the token embedding corresponding to a specified class to create NAEs, guided by the gradient of loss from the target classifier. Our method is effective in producing valid and useful NAEs, as demonstrated by a carefully designed experiment. Our work provides a valuable method for obtaining challenging evaluation data, which can potentially advance the development of more robust deep learning models. Code available at https://github.com/linyueqian/SD-NAE.

Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models

  • paper_url: http://arxiv.org/abs/2311.12796
  • repo_url: https://github.com/davidstotko/physics-guided-shape-from-template
  • paper_authors: David Stotko, Nils Wandel, Reinhard Klein
  • for: 三维重建动态场景是计算机图形领域的长期问题,随着信息的减少,变得越来越复杂。Shape-from-Template(SfT)方法用于从RGB图像或视频序列中重建模板基于的几何结构,通常不需要深度信息,例如常见的单摄像头记录。然而,现有的重建方法存在 física 不正确和噪声,或者优化过程慢。
  • methods: 我们提出了一种新的 SfT 重建算法,使用预训练的神经网络代理模型,快速评估、稳定、生成平滑的重建结构。渲染模拟的 mesh 可以对重建和目标视频序列进行像素对比,通过梯度基于优化程序来提取不仅形状信息,还有物理参数,如缝纹弹性、压缩硬度等。
  • results: 与 $\phi$-SfT 方法相比,我们的方法可以降低运行时间约400-500倍,同时保持精准、稳定、平滑的重建结构。
    Abstract 3D reconstruction of dynamic scenes is a long-standing problem in computer graphics and increasingly difficult the less information is available. Shape-from-Template (SfT) methods aim to reconstruct a template-based geometry from RGB images or video sequences, often leveraging just a single monocular camera without depth information, such as regular smartphone recordings. Unfortunately, existing reconstruction methods are either unphysical and noisy or slow in optimization. To solve this problem, we propose a novel SfT reconstruction algorithm for cloth using a pre-trained neural surrogate model that is fast to evaluate, stable, and produces smooth reconstructions due to a regularizing physics simulation. Differentiable rendering of the simulated mesh enables pixel-wise comparisons between the reconstruction and a target video sequence that can be used for a gradient-based optimization procedure to extract not only shape information but also physical parameters such as stretching, shearing, or bending stiffness of the cloth. This allows to retain a precise, stable, and smooth reconstructed geometry while reducing the runtime by a factor of 400-500 compared to $\phi$-SfT, a state-of-the-art physics-based SfT approach.
    摘要 三维重建动态场景是计算机图形领域的长期问题,随着可用信息的减少,这个问题变得越来越difficult。形状从模板(SfT)方法试图从RGB图像或视频序列中提取模板基于geometry,经常利用单个monocular摄像头无depth信息,如常见的智能手机记录。然而,现有的重建方法存在 físico和噪声问题,或者优化过程太慢。为解决这个问题,我们提议一种基于神经网络代理模型的新的SfT重建算法,该算法快速评估、稳定、生成平滑的重建结构。通过对模拟的网格进行可归纳渲染,可以在每个像素位置进行对比重建和目标视频序列,从而使用梯度基本进行搜索法来提取不仅形状信息,还有物理参数such as stretching, shearing, or bending stiffness of the cloth。这使得可以保持精确、稳定、平滑重建结构,同时减少运行时间,比phi-SfT,当前的物理基于SfT方法的运行时间减少了400-500倍。

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

  • paper_url: http://arxiv.org/abs/2311.12793
  • repo_url: https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V
  • paper_authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin
  • for: 提高大型多媒体模型(LMMs)的有效媒体Alignment,以解决现有数据的缺乏高质量图文数据所带来的瓶颈。
  • methods: 我们引入了 ShareGPT4V 数据集,这是一个具有 1.2 万个高度描述性标签的大规模资源,超过现有数据集的多样性和信息内容,覆盖了世界知识、物品特性、空间关系和艺术评价等领域。
  • results: ShareGPT4V 数据集在 Supervised Fine-Tuning(SFT)阶段的效果,通过将现有数据集中的等量精细标签替换为我们高质量的标签,显著提高了 LLMs 如 LLaVA-7B、LLaVA-1.5-13B 和 Qwen-VL-Chat-7B 在 MME 和 MMBench bencmarks 上的表现,具体提高了 222.8/22.0/22.3 和 2.7/1.3/1.5。此外,我们还在 Pre-training 和 SFT 阶段将 ShareGPT4V 数据集 integrate 到 LLMs 中,获得了 ShareGPT4V-7B,一个性能在多种多媒体 bencmarks 上出色的 LMM。
    Abstract In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.
    摘要 在大型多Modal模型(LMMs)的领域中,有效的Modal匹配是关键,但受到高质量图文数据的缺乏而受限。为解决这个瓶颈,我们介绍了 ShareGPT4V 数据集,这是一个具有120万高度描述性描述的大规模资源,超过现有数据集的多样性和信息内容,覆盖世界知识、物品特性、空间关系和艺术评价等领域。具体来说,ShareGPT4V 起源于一个 cura 的 100K 高质量描述,通过一种超出现有数据集的 caption 模型进行扩展,达到 1.2 万个。ShareGPT4V 首先在 Supervised Fine-Tuning(SFT)阶段展示其效果,通过将现有 SFT 数据集中的 Equivalent 量细致的描述替换为我们的高质量描述,对 LMMs 如 LLaVA-7B、LLaVA-1.5-13B 和 Qwen-VL-Chat-7B 在 MME 和 MMBench benchmark 上显著提高了性能,具体提高了 222.8/22.0/22.3 和 2.7/1.3/1.5。我们进一步将 ShareGPT4V 数据集integrated 到了预训练和 SFT 阶段,得到了 ShareGPT4V-7B,一个基于简单架构的优秀 LMM,在大多数多Modal benchmark 上表现出色。这个项目可以在 中下载,以便为 LMMs 社区提供积极的资源。

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

  • paper_url: http://arxiv.org/abs/2311.12775
  • repo_url: https://github.com/Anttwo/SuGaR
  • paper_authors: Antoine Guédon, Vincent Lepetit
  • for: 提供了一种方法,允许高速精度地提取3D Gaussian Splatting中的网格。
  • methods: 方法包括一种REGULARIZATION TERM,使得 Gaussian Splatting 更加稳定,并且引入了一种基于 Poisson 重建的网格提取方法,它快速、可扩展、细节保留。
  • results: 方法可以在 minutes 内获得高质量的编辑 mesh,比 estado-of-the-art 方法在 neural SDF 上花费 hours 的时间,并且提供了更好的渲染质量。Here’s the original English text for reference:
  • for: We propose a method for precise and extremely fast mesh extraction from 3D Gaussian Splatting.
  • methods: Our method includes a regularization term that encourages the gaussians to align well with the surface of the scene, and introduces a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction.
  • results: Our method can retrieve an editable mesh for realistic rendering within minutes, compared to hours with the state-of-the-art methods on neural SDFs, while providing a better rendering quality.
    Abstract We propose a method to allow precise and extremely fast mesh extraction from 3D Gaussian Splatting. Gaussian Splatting has recently become very popular as it yields realistic rendering while being significantly faster to train than NeRFs. It is however challenging to extract a mesh from the millions of tiny 3D gaussians as these gaussians tend to be unorganized after optimization and no method has been proposed so far. Our first key contribution is a regularization term that encourages the gaussians to align well with the surface of the scene. We then introduce a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction, which is fast, scalable, and preserves details, in contrast to the Marching Cubes algorithm usually applied to extract meshes from Neural SDFs. Finally, we introduce an optional refinement strategy that binds gaussians to the surface of the mesh, and jointly optimizes these Gaussians and the mesh through Gaussian splatting rendering. This enables easy editing, sculpting, rigging, animating, compositing and relighting of the Gaussians using traditional softwares by manipulating the mesh instead of the gaussians themselves. Retrieving such an editable mesh for realistic rendering is done within minutes with our method, compared to hours with the state-of-the-art methods on neural SDFs, while providing a better rendering quality. Our project page is the following: https://imagine.enpc.fr/~guedona/sugar/
    摘要 我们提出一种方法,允许精准而非常快速地从3D Gaussian Splatting中提取网格。 Gaussian Splatting在最近几年内变得非常流行,因为它可以实现真实的渲染,而且在训练时间上比NeRFs更快速。然而,从 Millionen tiny 3D Gaussian中提取网格是一项挑战,因为这些Gaussian在优化后往往不整理。我们的首要贡献是一个REGULARIZATION term,鼓励Gaussian与场景表面align well。然后,我们引入一种方法,利用这种alignment来从Gaussian中提取网格,使用Poisson reconstruction,它快速、可扩展、细节保持。与Marching Cubes算法通常用于从Neural SDFs中提取网格相比,这种方法具有更好的性能。最后,我们引入一种可选的精度优化策略,将Gaussian绑定到网格上,并同时优化这些Gaussian和网格通过Gaussian splatting rendering。这使得可以使用传统软件编辑、雕塑、骨架、动画、复制和重新照明Gaussian,而不是直接编辑Gaussian自身。通过我们的方法,可以在几分钟内提取editable mesh,而不是前一代方法在Neural SDFs上的几个小时。此外,我们的方法还提供了更好的渲染质量。我们的项目页面是以下:https://imagine.enpc.fr/~guedona/sugar/。

Iris Presentation Attack: Assessing the Impact of Combining Vanadium Dioxide Films with Artificial Eyes

  • paper_url: http://arxiv.org/abs/2311.12773
  • repo_url: None
  • paper_authors: Darshika Jauhari, Renu Sharma, Cunjian Chen, Nelson Sepulveda, Arun Ross
  • for: 这项研究旨在检测人工眼睛中的投影攻击(presentation attack,PA),并研究VO2薄膜在人工眼睛表面上的影响。
  • methods: 该研究使用了两种现代眼睛人脸识别系统的PA检测方法,并在人工眼睛表面上附加了VO2薄膜。
  • results: 研究发现,VO2薄膜可能会使PA检测方法错分为真正的眼睛,从而导致攻击。这表明了这种攻击的存在,需要系统地分析和解决。
    Abstract Iris recognition systems, operating in the near infrared spectrum (NIR), have demonstrated vulnerability to presentation attacks, where an adversary uses artifacts such as cosmetic contact lenses, artificial eyes or printed iris images in order to circumvent the system. At the same time, a number of effective presentation attack detection (PAD) methods have been developed. These methods have demonstrated success in detecting artificial eyes (e.g., fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2) films on their surface in various spatial configurations. VO2 films can be used to selectively transmit NIR light and can, therefore, be used to regulate the amount of NIR light from the object that is captured by the iris sensor. We study the impact of such images produced by the sensor on two state-of-the-art iris PA detection methods. We observe that the addition of VO2 films on the surface of artificial eyes can cause the PA detection methods to misclassify them as bonafide eyes in some cases. This represents a vulnerability that must be systematically analyzed and effectively addressed.
    摘要 芳香识别系统,运行在近红外спектurm(NIR)中,已经显示出抵御攻击(presentation attacks)的易受性,攻击者可以使用cosmetic contact lenses、人工眼或印刷眼图像以排除系统。同时,一些有效的抵御攻击检测(PAD)方法已经被开发出来。这些方法已经在检测人工眼(例如假Van Dyke眼)时表现出成功。在这种工作中,我们想要改变人工眼的光学特性,通过在其表面贴加Vanadium Dioxide(VO2)薄膜。VO2薄膜可以选择性传输NIR光,因此可以用来控制人工眼捕获的NIR光量。我们研究了这些由感ensor产生的图像对两种现代芳香PA检测方法的影响。我们发现,在人工眼表面贴加VO2薄膜后,PA检测方法可能会在某些情况下将它们识别为真正的眼。这表示存在一个漏洞,需要系统地分析和有效地解决。

Swift Parameter-free Attention Network for Efficient Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.12770
  • repo_url: https://github.com/hongyuanyu/span
  • paper_authors: Cheng Wan, Hongyuan Yu, Zhiqi Li, Yihang Chen, Yajun Zou, Yuqing Liu, Xuanwu Yin, Kunlong Zuo
  • for: 提高单图超解像(SISR)任务中的图像质量和执行速度之间的平衡,以便在资源有限的场景中实现高质量和高速的图像重建。
  • methods: 提出了一种高效的 Swift Parameter-free Attention Network(SPAN)模型,该模型使用了一种新的参数自由注意机制,通过对称的活动函数和循环连接来提高高质量信息和抑制低质量信息。
  • results: 经过理论分析和实验验证,SPAN模型在多个 bencmark 上表现出色,其中包括图像质量和执行速度之间的较好的平衡,并在 NTIRE 2023 高效超解像挑战中取得了最高 PSNR 值和最快的测试时间。
    Abstract Single Image Super-Resolution (SISR) is a crucial task in low-level computer vision, aiming to reconstruct high-resolution images from low-resolution counterparts. Conventional attention mechanisms have significantly improved SISR performance but often result in complex network structures and large number of parameters, leading to slow inference speed and large model size. To address this issue, we propose the Swift Parameter-free Attention Network (SPAN), a highly efficient SISR model that balances parameter count, inference speed, and image quality. SPAN employs a novel parameter-free attention mechanism, which leverages symmetric activation functions and residual connections to enhance high-contribution information and suppress redundant information. Our theoretical analysis demonstrates the effectiveness of this design in achieving the attention mechanism's purpose. We evaluate SPAN on multiple benchmarks, showing that it outperforms existing efficient super-resolution models in terms of both image quality and inference speed, achieving a significant quality-speed trade-off. This makes SPAN highly suitable for real-world applications, particularly in resource-constrained scenarios. Notably, our model attains the best PSNR of 27.09 dB, and the test runtime of our team is reduced by 7.08ms in the NTIRE 2023 efficient super-resolution challenge. Our code and models are made publicly available at \url{https://github.com/hongyuanyu/SPAN}.
    摘要 单图超分辨 (SISR) 是计算机视觉领域的关键任务之一,旨在从低分辨度图像中恢复高分辨度图像。传统的注意力机制有助于提高 SISR 性能,但常导致复杂的网络结构和大量参数,从而导致慢速推理速度和大型模型。为解决这问题,我们提出了快速无参数注意力网络 (SPAN),这是一种高效的 SISR 模型,既能保持参数量、推理速度和图像质量的平衡。SPAN 使用了一种新的无参数注意力机制,利用对称的活动函数和径向连接来增强高贡献信息并抑制冗余信息。我们的理论分析表明该设计能够实现注意力机制的目的。我们在多个标准曲线上评估了 SPAN,并显示它在图像质量和推理速度之间取得了显著的质速交换。这使得 SPAN 在实际应用中特别适用,特别是在有限资源的场景下。值得一提的是,我们的模型在 NTIRE 2023 年度高效超分辨度挑战中实现了最高 PSNR 值为 27.09 dB,并在测试时间上减少了7.08ms。我们的代码和模型在 \url{https://github.com/hongyuanyu/SPAN} 上公开发布。

Investigating Weight-Perturbed Deep Neural Networks With Application in Iris Presentation Attack Detection

  • paper_url: http://arxiv.org/abs/2311.12764
  • repo_url: https://github.com/redwankarimsony/weightperturbation-msu
  • paper_authors: Renu Sharma, Redwan Sony, Arun Ross
  • for: 本研究旨在分析深度神经网络(DNNs)对参数偏移的敏感性,以便在实际应用中部署之前进行分析。
  • methods: 本研究使用了三种DNN架构(VGG、ResNet和DenseNet),三种参数偏移方法( Gaussian noise、weight zeroing 和 weight scaling),以及两种设置(全网络和层次)。我们在用于触发人脸识别 task 的两个公开数据集(LivDet-Iris-2017和LivDet-Iris-2020)上进行了实验。
  • results: 根据敏感性分析,我们提出了通过perturbing DNN 参数来提高模型性能的方法。此外,我们还将这些受到偏移的模型 ensemble 在分数级和参数级进行了组合,以提高原始模型的性能。 LivDet-Iris-2017 数据集上, ensemble 在参数级平均提高了43.58%,而 LivDet-Iris-2020 数据集上,ensemble 在参数级平均提高了9.25%。研究代码可以在 GitHub 上找到。
    Abstract Deep neural networks (DNNs) exhibit superior performance in various machine learning tasks, e.g., image classification, speech recognition, biometric recognition, object detection, etc. However, it is essential to analyze their sensitivity to parameter perturbations before deploying them in real-world applications. In this work, we assess the sensitivity of DNNs against perturbations to their weight and bias parameters. The sensitivity analysis involves three DNN architectures (VGG, ResNet, and DenseNet), three types of parameter perturbations (Gaussian noise, weight zeroing, and weight scaling), and two settings (entire network and layer-wise). We perform experiments in the context of iris presentation attack detection and evaluate on two publicly available datasets: LivDet-Iris-2017 and LivDet-Iris-2020. Based on the sensitivity analysis, we propose improved models simply by perturbing parameters of the network without undergoing training. We further combine these perturbed models at the score-level and at the parameter-level to improve the performance over the original model. The ensemble at the parameter-level shows an average improvement of 43.58% on the LivDet-Iris-2017 dataset and 9.25% on the LivDet-Iris-2020 dataset. The source code is available at https://github.com/redwankarimsony/WeightPerturbation-MSU.
    摘要

High-resolution Image-based Malware Classification using Multiple Instance Learning

  • paper_url: http://arxiv.org/abs/2311.12760
  • repo_url: https://github.com/timppeters/mil-malware-images
  • paper_authors: Tim Peters, Hikmat Farhat
  • for: 防止针对大小变化的恶意软件分类
  • methods: 使用高精度灰度图像和多个实例学习来超越针对性的二进制扩大
  • results: 实验结果显示,该方法可以在对抗性扩大样本上达到 $96.6%$ 的准确率,比基eline的 $22.8%$ 高出许多。
    Abstract This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .
    摘要 这个论文提出了一种基于高级灰度图像和多实例学习的新型恶意软件分类方法,以抵御假像扩大攻击。现有的视觉化基于识别方法大多采用输入数据的损失性变换,如尺寸缩放,以处理大型、变量大小的图像。经实验和分析表明,这些方法会导致重要信息损失,可能被利用。该提案将图像分成块,使用嵌入式多实例学习的 convolutional neural network 和注意聚合函数进行分类。实现在 Microsoft Malware Classification 数据集上进行评估,对假像扩大样本达到了 $96.6\%$ 的准确率,比基线 $22.8\%$ 高出了许多。Python 代码可以在 上下载。

Breathing Life Into Sketches Using Text-to-Video Priors

  • paper_url: http://arxiv.org/abs/2311.13608
  • repo_url: None
  • paper_authors: Rinon Gal, Yael Vinker, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Ariel Shamir, Gal Chechik
  • for: 将单个素描图 animate into a short animation, using a text prompt to guide the motion.
  • methods: 使用文本提示指导笔划动作,并使用混合权重损失来帮助笔划动作更加自然和细腻。
  • results: 可以生成高质量的动画,并且可以轻松地编辑。
    Abstract A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
    摘要 一个素描是人类最直观和多余的工具之一,用于视觉表达意想。一个动画素描增加了另一个维度,用于表达意想,广泛用于设计各种目的。动画素描是一项劳动密集的过程,需要广泛的经验和专业的设计技能。在这个工作中,我们提出了一种方法,可以通过提供文本提示来自动添加动作到单个素描,从而“吹生”它。输出是一段短动画,提供vector表示,可以轻松编辑。我们的方法不需要广泛的训练,而是利用大量预训练的文本到视频扩散模型的运动优先,使用分数混合损失来引导笔迹的布局。为了促进自然和平滑的动作和更好地保持素描的外观,我们将动作分为两个组件。第一个控制小地方弯曲,第二个控制全局Affine变换。意外地,我们发现, même les modèles qui ont du mal à générer des vidéos de croquis peuvent toujours servir de support utile pour animer des représentations abstraites.

Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

  • paper_url: http://arxiv.org/abs/2311.12914
  • repo_url: None
  • paper_authors: Quazi Mishkatul Alam, Bilel Tarchoun, Ihsen Alouani, Nael Abu-Ghazaleh
  • for: 这篇论文主要针对于 transformer 模型在视觉任务上的应用和防御性能。
  • methods: 该论文使用了 sparse attention 结构来降低 transformer 模型的 quadratic complexity,并对 deformable transformer 进行了攻击和合作攻击的研究。
  • results: 该论文显示了对 transformer 模型的攻击和合作攻击,并证明了 deformable transformer 具有更高的鲁棒性和防御能力。 Here’s the full text in Simplified Chinese:
  • for: 这篇论文主要针对于 transformer 模型在视觉任务上的应用和防御性能。
  • methods: 该论文使用了 sparse attention 结构来降低 transformer 模型的 quadratic complexity,并对 deformable transformer 进行了攻击和合作攻击的研究。
  • results: 该论文显示了对 transformer 模型的攻击和合作攻击,并证明了 deformable transformer 具有更高的鲁棒性和防御能力。
    Abstract The latest generation of transformer-based vision models have proven to be superior to Convolutional Neural Network (CNN)-based models across several vision tasks, largely attributed to their remarkable prowess in relation modeling. Deformable vision transformers significantly reduce the quadratic complexity of modeling attention by using sparse attention structures, enabling them to be used in larger scale applications such as multi-view vision systems. Recent work demonstrated adversarial attacks against transformers; we show that these attacks do not transfer to deformable transformers due to their sparse attention structure. Specifically, attention in deformable transformers is modeled using pointers to the most relevant other tokens. In this work, we contribute for the first time adversarial attacks that manipulate the attention of deformable transformers, distracting them to focus on irrelevant parts of the image. We also develop new collaborative attacks where a source patch manipulates attention to point to a target patch that adversarially attacks the system. In our experiments, we find that only 1% patched area of the input field can lead to 0% AP. We also show that the attacks provide substantial versatility to support different attacker scenarios because of their ability to redirect attention under the attacker control.
    摘要 最新一代基于转换器的视觉模型已经超越了传统的卷积神经网络(CNN)模型,在多个视觉任务上表现出色,主要归功于它们在关系模型方面的优异。使用 sparse 注意结构的视觉转换器可以减少模型注意力的 quadratic complexity,使其在大规模应用场景,如多视图视觉系统中使用。在最近的研究中,我们发现对于转换器来说,有些攻击不会传递,这是因为它们使用 sparse attention 结构。具体来说,视觉转换器中的注意力是通过指向图像中最相关的其他 токен来模型的。在这个工作中,我们首次提出了对具有 sparse attention 结构的视觉转换器的攻击。我们还开发了新的合作攻击,其中一个源 patch manipulate 注意力,使其指向一个针对系统进行攻击的 target patch。在我们的实验中,我们发现只需要对输入场景的 1% 区域进行修改,就可以导致系统的 AP 值为 0%。我们还发现,这些攻击具有可控的Redirect attention 能力,可以根据攻击者的需求进行调节。

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatially Relation Matching

  • paper_url: http://arxiv.org/abs/2311.12751
  • repo_url: None
  • paper_authors: Meng Chu, Zhedong Zheng, Wei Ji, Tat-Seng Chua
  • for: 提高飞行器控制和导航的自然语言命令 интеграción
  • methods: 利用大型自然语言模型生成数据集和 pré-训练视觉模型,增加Univeristy-1652图像集的空间意识文本标注,并引入细致的空间匹配优化目标
  • results: 在不同描述复杂性下维持 exceptional recall rate,示出了该方法在实际场景中自然语言命令的灵活 интеграción的承诺
    Abstract Drone navigation through natural language commands remains a significant challenge due to the lack of publicly available multi-modal datasets and the intricate demands of fine-grained visual-text alignment. In response to this pressing need, we present a new human-computer interaction annotation benchmark called GeoText-1652, meticulously curated through a robust Large Language Model (LLM)-based data generation framework and the expertise of pre-trained vision models. This new dataset seamlessly extends the existing image dataset, \ie, University-1652, with spatial-aware text annotations, encompassing intricate image-text-bounding box associations. Besides, we introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains an exceptional recall rate under varying description complexities. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.
    摘要 <>translate "Drone navigation through natural language commands remains a significant challenge due to the lack of publicly available multi-modal datasets and the intricate demands of fine-grained visual-text alignment. In response to this pressing need, we present a new human-computer interaction annotation benchmark called GeoText-1652, meticulously curated through a robust Large Language Model (LLM)-based data generation framework and the expertise of pre-trained vision models. This new dataset seamlessly extends the existing image dataset, \ie, University-1652, with spatial-aware text annotations, encompassing intricate image-text-bounding box associations. Besides, we introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains an exceptional recall rate under varying description complexities. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios."into Simplified Chinese.<>无公开多模态数据集和细腻视文对齐的需求,导航器控制通过自然语言命令仍然是一大挑战。为应对这种急需,我们提出了一个新的人机交互标准集called GeoText-1652,通过我们自己开发的大型自然语言模型(LLM)基础数据生成框架和预训练视觉模型的专业知识,以便精心策划。这个新数据集继承了现有的图像集,即University-1652,并添加了包含细腻图像-文本矩形匹配的空间意识文本注释。此外,我们还引入了一个新的优化目标,称为混合空间匹配,用于区域水平的空间关系匹配。经过广泛的实验,我们的方法能够在不同的描述复杂性下保持exceptional recall率。这表明了我们的方法在实际场景中自然语言命令的整合可以提高无人机控制和导航的潜力。

Q-Seg: Quantum Annealing-based Unsupervised Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.12912
  • repo_url: None
  • paper_authors: Supreeth Mysore Venkatesh, Antonio Macaluso, Marlon Nuske, Matthias Klusch, Andreas Dengel
  • for: 这个研究旨在提出一种基于量子热化的无监督图像分类方法,以便在现有的量子硬件上进行高效的图像分类。
  • methods: 我们将图像分类问题转换为一个图像架构优化任务,利用量子硬件的跨频率网络 topology,实现高效的批量分类。
  • results: 我们的实验结果显示,Q-Seg 可以对于现有的量子硬件进行高效的分类,并且在某些应用中(例如洪水探测)表现更好些于经典机器学习方法。
    Abstract In this study, we present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming state-of-the-art classical methods. Our empirical evaluations on synthetic datasets reveal that Q-Seg offers better runtime performance against the classical optimizer Gurobi. Furthermore, we evaluate our method on segmentation of Earth Observation images, an area of application where the amount of labeled data is usually very limited. In this case, Q-Seg demonstrates near-optimal results in flood mapping detection with respect to classical supervised state-of-the-art machine learning methods. Also, Q-Seg provides enhanced segmentation for forest coverage compared to existing annotated masks. Thus, Q-Seg emerges as a viable alternative for real-world applications using available quantum hardware, particularly in scenarios where the lack of labeled data and computational runtime are critical.
    摘要 在本研究中,我们提出了一种新的无监督图像分割方法,基于量子退火,适用于现有的量子硬件。我们将像素化分割问题,即汇集图像的spectral和空间信息,视为一个图像拓扑优化任务。我们的方法可以高效地利用D-Wave Advantage设备中的量子bits的相互连接 topology,提供更好的扩展性,而且比现有的量子方法更高效。我们的实验结果表明,Q-Seg在对 synthetic 数据进行优化时比CLASSICAL optimizer Gurobi 更快。此外,我们在 Earth Observation 图像分割中应用 Q-Seg,并证明它在洪水探测和森林覆盖率分割方面与现有的supervised state-of-the-art机器学习方法具有near-optimal 性能。此外,Q-Seg还提供了改进的图像分割结果,超过了现有的标注mask。因此,Q-Seg emerges as a viable alternative for real-world applications using available quantum hardware,特别是在缺乏标注数据和计算时间的情况下。

Attacking Motion Planners Using Adversarial Perception Errors

  • paper_url: http://arxiv.org/abs/2311.12722
  • repo_url: None
  • paper_authors: Jonathan Sadeghi, Nicholas A. Lord, John Redford, Romain Mueller
  • for: 这个论文的目的是探讨自动驾驶系统中模块之间的相互关系,以及每个模块的性能如何影响整体系统的性能。
  • methods: 这个论文使用了一种名为”边界攻击”的简单算法,可以系统地构造出导致计划失败的感知错误。
  • results: 研究发现,使用这种算法可以在不同的城市和高速公路驾驶场景中找到许多导致计划失败的感知错误,并且这些错误是隔离在计划器的输入空间中的。
    Abstract Autonomous driving (AD) systems are often built and tested in a modular fashion, where the performance of different modules is measured using task-specific metrics. These metrics should be chosen so as to capture the downstream impact of each module and the performance of the system as a whole. For example, high perception quality should enable prediction and planning to be performed safely. Even though this is true in general, we show here that it is possible to construct planner inputs that score very highly on various perception quality metrics but still lead to planning failures. In an analogy to adversarial attacks on image classifiers, we call such inputs \textbf{adversarial perception errors} and show they can be systematically constructed using a simple boundary-attack algorithm. We demonstrate the effectiveness of this algorithm by finding attacks for two different black-box planners in several urban and highway driving scenarios using the CARLA simulator. Finally, we analyse the properties of these attacks and show that they are isolated in the input space of the planner, and discuss their implications for AD system deployment and testing.
    摘要 自动驾驶(AD)系统经常以模块化的方式设计和测试,其中不同模块的性能通过任务特定的指标进行评估。这些指标应该能够捕捉下游影响和整体性能。例如,高识别质量可以使预测和规划进行安全地执行。虽然这是通常的情况,但我们在这里显示出,可以构造出让某些指标得高分的批处输入,却导致规划失败。我们称这种输入为“对抗感知错误”,并使用一种简单的边界攻击算法来系统地构造这些错误。我们使用CARLA simulator在城市和高速公路上驾驶多个场景中,通过这种算法找到了两个黑盒规划器的攻击。最后,我们分析了这些攻击的性质和影响,并讨论了它们对自动驾驶系统部署和测试的意义。

Cascade Learning Localises Discriminant Features in Visual Scene Classification

  • paper_url: http://arxiv.org/abs/2311.12704
  • repo_url: None
  • paper_authors: Junwen Wang, Katayoun Farrahi
  • For: 提高深度卷积神经网络(DCNN)的可读性,特别在医疗领域,临床医生希望获得可靠的自动化判断。* Methods: 研究DCNN中学习过程中的特征表示的地方化性,并比较了两种不同的学习策略(末端到末端学习和层次学习)的效果。* Results: 研究发现,传统的末端到末端学习策略在多层网络中具有有限的特征地方化能力,而层次学习策略(阶段学习)能够更好地地方化特征表示。在基于YOLO对象检测框架的实验中,我们得到的最佳结果显示,阶段学习策略比末端到末端学习策略提高了$2%$的准确率。
    Abstract Lack of interpretability of deep convolutional neural networks (DCNN) is a well-known problem particularly in the medical domain as clinicians want trustworthy automated decisions. One way to improve trust is to demonstrate the localisation of feature representations with respect to expert labeled regions of interest. In this work, we investigate the localisation of features learned via two varied learning paradigms and demonstrate the superiority of one learning approach with respect to localisation. Our analysis on medical and natural datasets show that the traditional end-to-end (E2E) learning strategy has a limited ability to localise discriminative features across multiple network layers. We show that a layer-wise learning strategy, namely cascade learning (CL), results in more localised features. Considering localisation accuracy, we not only show that CL outperforms E2E but that it is a promising method of predicting regions. On the YOLO object detection framework, our best result shows that CL outperforms the E2E scheme by $2\%$ in mAP.
    摘要 缺乏深度卷积神经网络(DCNN)的解释性是医疗领域中的一个很 bekannt的问题,因为临床专业人员希望得到可靠的自动化决策。一种方法来提高信任是通过显示特征表示的本地化。在这项工作中,我们研究了两种不同的学习方法中学习的特征的本地化,并证明一种学习方法的优越性。我们对医疗和自然数据集进行分析,发现传统的端到端(E2E)学习策略在多层网络中具有有限的特征本地化能力。我们显示,叠加学习(CL)策略可以更好地本地化特征。对本地化精度进行评估,我们不仅证明CL超过E2E,而且表示CL是预测区域的有效方法。在YOLO对象检测框架上,我们最佳结果表明CL超过E2E方案的mAP指标上的提升达2%。

Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

  • paper_url: http://arxiv.org/abs/2311.12682
  • repo_url: None
  • paper_authors: Mu Chen, Zhedong Zheng, Yi Yang
    for: 这篇论文主要针对的是scene segmentation via unsupervised domain adaptation(UDA),即将源域数据中学习的知识转移到目标域数据中,以降低需要手动标注Pixel级别的需求。methods: 该方法基于depth estimation来推导 Layout,并通过一种叫做Depth-guided Contextual Filter(DCF)来进行数据增强,以及一种跨任务编码器来Contextual learning。results: 该方法可以在两个广泛使用的 bench-mark上达到竞争性的表现,即77.7 mIoU on GTA to Cityscapes和69.3 mIoU on Synthia to Cityscapes,而且只使用pseudo depth。
    Abstract Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.
    摘要 Scene分割via无监督领域适应(UDA)可以将源数据上获得的知识传递到目标领域中,大大减少了目标频道上的手动像素级标注。为了促进频率不变的特征学习,现有方法通常将源频道和目标频道的数据混合在一起,只是将像素拷贝并贴上。这种混合方法通常不佳,因为它们不考虑混合的配置与实际情况相匹配。我们发现, semantic category,如人行道、建筑物和天空,在深度图中具有相对一致的深度分布,并且可以在深度图中明显分辨。基于这一观察,我们提议一个深度感知框架,用于明确地混合类别,并在端到端方式中进行 segmentation 和深度学习两个补做任务。具体来说,框架包括一个带有深度导航的 Contextual Filter(DCF),用于数据增强,以及一个交叉任务编码器,用于Contextual learning。 DCF 模拟了实际情况,而交叉任务编码器进一步适应地融合了两个任务的补做特征。此外,我们注意到一些公共数据集没有提供深度注释。因此,我们利用了市场上的深度估计网络来生成假深度。广泛的实验显示,我们提出的方法,即使使用假深度,在两个广泛使用的标准 bench-mark 上具有竞争性的性能,即 GTA 到 Cityscapes 的 77.7 mIoU 和 Synthia 到 Cityscapes 的 69.3 mIoU。

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos

  • paper_url: http://arxiv.org/abs/2311.12679
  • repo_url: None
  • paper_authors: Georgios Albanis, Nikolaos Zioulis, Kostas Kolomvatsos
  • for: 这个论文是为了解决从视频中捕捉平滑运动的问题,而传统的markerless技术通常需要耗时、多个阶段、数据驱动回归和优化、和时间窗口上的包装解决。
  • methods: 这个论文提出了一种新的和高效的方法,即BundleMoCap,它在单个阶段中解决了运动捕捉问题,并消除了时间稳定性目标,而且仍能够提供高质量的运动捕捉结果。BundleMoCap的关键思想是在latent keyframe之间进行拟合,通过本地拟合平滑性假设,可以高效地解决一个bundle of frames。
  • results: 论文的实验结果表明,BundleMoCap可以与现有的状态olaр技术进行比较,而不需要增加复杂度。 BundleMoCap的主要优势在于可以实现高质量的运动捕捉结果,同时具有简单性和高效性。更多细节可以在https://moverseai.github.io/bundle/上找到。
    Abstract Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap's strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency. More details can be found at https://moverseai.github.io/bundle/.
    摘要 通常 markerless 技术 capturing 平滑运动视频中会涉及到复杂的过程,如时间约束、多个阶段数据驱动回归优化和时间窗口集成。这些过程可能会效率低下,需要调整多个目标在不同阶段。而 BundleMoCap 则提供了一种新的高效方法。它在单个阶段内解决运动捕捉问题,消除时间稳定目标,同时仍能提供平滑的运动。与现有技术相比,BundleMoCap 表现更高效,而且不增加复杂性。BundleMoCap 的关键思想是基于 latent keyframe 的替换插值。通过启用本地拟合平滑假设,我们可以高效地解决一个bundle的帧数据。此外,该方法可以实现为滑动窗口优化,只需要初始化第一帧即可,从而降低总计算负担。 BundleMoCap 的优势在于能够实现高质量运动捕捉结果,同时具有简单高效的特点。更多细节请参考

Similar Document Template Matching Algorithm

  • paper_url: http://arxiv.org/abs/2311.12663
  • repo_url: None
  • paper_authors: Harshitha Yenigalla, Bommareddy Revanth Srinivasa Reddy, Batta Venkata Rahul, Nannapuraju Hemanth Raju
  • for: 这份研究旨在提供一个整体的医疗文书验证方法,整合先进的模板抽出、比较和骗征检测技术。
  • methods: 这个方法包括先进的模板抽出技术,使用区域 интерес (ROI) 方法,包括条件分析和边缘识别。预处理步骤使用形式学操作和自适应阈值,以确保模板的清晰度。模板比较算法使用先进的特征匹配,包括关键点和描述子,并通过对数 histogram-based 分析以增强抗变性。骗征检测则使用 SSIM 量化结构相似性,以帮助确定可能的匹配。
  • results: 这个方法可以实现高度的医疗文书验证精度,涵盖模板抽出、比较、骗征检测和灵活性对多种文书结构的适应。
    Abstract This study outlines a comprehensive methodology for verifying medical documents, integrating advanced techniques in template extraction, comparison, and fraud detection. It begins with template extraction using sophisticated region-of-interest (ROI) methods, incorporating contour analysis and edge identification. Pre-processing steps ensure template clarity through morphological operations and adaptive thresholding. The template comparison algorithm utilizes advanced feature matching with key points and descriptors, enhancing robustness through histogram-based analysis for accounting variations. Fraud detection involves the SSIM computation and OCR for textual information extraction. The SSIM quantifies structural similarity, aiding in potential match identification. OCR focuses on critical areas like patient details, provider information, and billing amounts. Extracted information is compared with a reference dataset, and confidence thresholding ensures reliable fraud detection. Adaptive parameters enhance system flexibility for dynamic adjustments to varying document layouts. This methodology provides a robust approach to medical document verification, addressing complexities in template extraction, comparison, fraud detection, and adaptability to diverse document structures.
    摘要
  1. Template extraction using sophisticated region-of-interest (ROI) methods, including contour analysis and edge identification. Pre-processing steps such as morphological operations and adaptive thresholding are used to ensure template clarity.2. Template comparison using advanced feature matching with key points and descriptors, which enhances robustness through histogram-based analysis for accounting variations.3. Fraud detection involving the SSIM (Structural Similarity Index Measure) computation and OCR (Optical Character Recognition) for textual information extraction. The SSIM quantifies structural similarity, aiding in potential match identification, while OCR focuses on critical areas such as patient details, provider information, and billing amounts.4. Extracted information is compared with a reference dataset, and confidence thresholding ensures reliable fraud detection.5. Adaptive parameters are used to enhance system flexibility for dynamic adjustments to varying document layouts.This methodology provides a robust approach to medical document verification, addressing complexities in template extraction, comparison, fraud detection, and adaptability to diverse document structures.

Visually Guided Object Grasping

  • paper_url: http://arxiv.org/abs/2311.12660
  • repo_url: None
  • paper_authors: Radu Horaud, Fadi Dornaika, Bernard Espiau
  • for: 本文提出了一种视觉服务方法,用于对物体抓取和更广泛的对端效器与物体的Alignment。
  • methods: 本文首先对Espiau等人提出的方法进行了扩展,使其适用于不 mounted onto the robot being controlled的摄像头。然后,我们强调了实时计算图Jacobian的重要性。
  • results: 本文显示了如何使用不可Calibrated stereo rig来表示 grasp或更广泛的Alignment between two solids in 3-D projective space。这种3-D projective representation是视点 invariants的,可以轻松地映射到图像set-point中无需任何关于摄像头参数的知识。
    Abstract In this paper we present a visual servoing approach to the problem of object grasping and more generally, to the problem of aligning an end-effector with an object. First we extend the method proposed by Espiau et al. [1] to the case of a camera which is not mounted onto the robot being controlled and we stress the importance of the real-time estimation of the image Jacobian. Second, we show how to represent a grasp or more generally, an alignment between two solids in 3-D projective space using an uncalibrated stereo rig. Such a 3-D projective representation is view-invariant in the sense that it can be easily mapped into an image set-point without any knowledge about the camera parameters. Third, we perform an analysis of the performances of the visual servoing algorithm and of the grasping precision that can be expected from this type of approach.
    摘要 在这篇论文中,我们提出了一种视觉服务方法来解决物体抓取和更一般地,将端效器与物体相对于的问题。首先,我们将Espiau等人提出的方法扩展到不 mounted onto the robot being controlled的相机情况下,并强调了实时计算图Jacobian的重要性。其次,我们示出了在3D项目空间中表示一个抓取或更一般地,两个物体之间的对应关系的3D项目空间表示方法,使用不准备的ステレオ摄像头。这种3D项目空间表示方法是视觉不变的,可以轻松地将其映射到图像集点中,不需要任何相机参数的知识。最后,我们对视觉服务算法的性能分析和抓取精度的预期进行了分析。

Hand-Eye Calibration

  • paper_url: http://arxiv.org/abs/2311.12655
  • repo_url: https://github.com/IFL-CAMP/easy_handeye
  • paper_authors: Radu Horaud, Fadi Dornaika
  • for: 本研究旨在解决 robot 手上感测器与手臂之间的关系问题,即手眼准Alignment问题。
  • methods: 本研究提出了两种方法来解决手眼准Alignment问题:一种是经典的方法,另一种是使用3x4的投影矩阵(M和M’)来解决问题。
  • results: 研究表明,使用非线性优化方法同时解决旋转和平移问题是最稳定的方法,并且可以更好地抵消噪声和测量误差的影响。
    Abstract Whenever a sensor is mounted on a robot hand it is important to know the relationship between the sensor and the hand. The problem of determining this relationship is referred to as hand-eye calibration, which is important in at least two types of tasks: (i) map sensor centered measurements into the robot workspace and (ii) allow the robot to precisely move the sensor. In the past some solutions were proposed in the particular case of a camera. With almost no exception, all existing solutions attempt to solve the homogeneous matrix equation AX=XB. First we show that there are two possible formulations of the hand-eye calibration problem. One formulation is the classical one that we just mentioned. A second formulation takes the form of the following homogeneous matrix equation: MY=M'YB. The advantage of the latter is that the extrinsic and intrinsic camera parameters need not be made explicit. Indeed, this formulation directly uses the 3 by 4 perspective matrices (M and M') associated with two positions of the camera. Moreover, this formulation together with the classical one cover a wider range of camera-based sensors to be calibrated with respect to the robot hand. Second, we develop a common mathematical framework to solve for the hand-eye calibration problem using either of the two formulations. We present two methods, (i) a rotation then translation and (ii) a non-linear solver for rotation and translation. Third, we perform a stability analysis both for our two methods and for the classical linear method of Tsai and Lenz (1989). In the light of this comparison, the non-linear optimization method, that solves for rotation and translation simultaneously, seems to be the most robust one with respect to noise and to measurement errors.
    摘要 当感测器 mounted 在机器人手上时,关于感测器和手的关系是非常重要的。这个问题被称为手眼准备,对于至少两种任务是非常重要的:(i)将感测器中的测量转换到机器人工作空间中,(ii)允许机器人精确地移动感测器。在过去,一些解决方案在特定的情况下采用了摄像头。大多数现有的解决方案都尝试解决homogeneous matrix equation AX=XB。我们首先表明了手眼准备问题的两种可能的形式化。一种形式化是我们刚才提到的古典形式化。另一种形式化取得了以下homogeneous matrix equation:MY=M'YB。这种形式化的优点是,摄像头外部和内部参数不需要显式地出现。实际上,这种形式化直接使用了两个相机位置下的3x4的投影矩阵(M和M')。此外,这种形式化与古典形式化覆盖了更多的摄像头基于的感测器与机器人手进行准备的问题。二、我们开发了一个通用的数学框架来解决手眼准备问题。我们提出了两种方法:(i)旋转然后翻译,(ii)非线性最优化方法。三、我们进行了稳定性分析,包括我们两种方法和 Tsai和Lenz(1989)的线性方法。在这种比较下,非线性最优化方法,同时解决旋转和翻译的问题,显示出对于噪声和测量错误的稳定性。

Polyhedral Object Recognition by Indexing

  • paper_url: http://arxiv.org/abs/2311.12641
  • repo_url: https://github.com/crowdbotics-dev/abd-231102-s1-dev-126419
  • paper_authors: Radu Horaud, Humberto Sossa
  • for: 本研究旨在解决计算机视觉中的索引问题,即从2D图像中识别3D多面体对象,而不需要经典的图像特征与对象特征匹配方法。
  • methods: 本文提出了一种新的图像索引方法,基于多项式表示法和哈希函数。这种方法可以快速地判断图像中是否包含一个特定的3D多面体对象。
  • results: 本文提供了一些实验结果,证明了该方法的效果。Results show that the proposed method can effectively recognize 3D polyhedral objects from 2D images.
    Abstract In computer vision, the indexing problem is the problem of recognizing a few objects in a large database of objects while avoiding the help of the classical image-feature-to-object-feature matching paradigm. In this paper we address the problem of recognizing 3-D polyhedral objects from 2-D images by indexing. Both the objects to be recognized and the images are represented by weighted graphs. The indexing problem is therefore the problem of determining whether a graph extracted from the image is present or absent in a database of model graphs. We introduce a novel method for performing this graph indexing process which is based both on polynomial characterization of binary and weighted graphs and on hashing. We describe in detail this polynomial characterization and then we show how it can be used in the context of polyhedral object recognition. Next we describe a practical recognition-by-indexing system that includes the organization of the database, the representation of polyhedral objects in terms of 2-D characteristic views, the representation of this views in terms of weighted graphs, and the associated image processing. Finally, some experimental results allow the evaluation of the system performance.
    摘要 We propose a novel method for performing this graph indexing process, which is based on both polynomial characterization of binary and weighted graphs and hashing. We describe the polynomial characterization in detail and show how it can be used in the context of polyhedral object recognition.Our practical recognition-by-indexing system includes the organization of the database, the representation of polyhedral objects in terms of 2-D characteristic views, the representation of these views in terms of weighted graphs, and the associated image processing. We also present some experimental results to evaluate the system's performance.

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

  • paper_url: http://arxiv.org/abs/2311.12631
  • repo_url: None
  • paper_authors: Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen
  • for: 提高文本到视频生成质量,增强视频中的物理动作准确性和一致性。
  • methods: 利用大语言模型GPT-4生成Blender脚本,基于文本提示 commanded Blender的物理引擎生成基本场景组件,然后输入稳定扩散模型进行视频生成。
  • results: 在三种基本物理动作场景中(即固体物体落下与碰撞、布料折叠与摆动、液体流动),GPT4Motion可以高效生成高质量视频,保持动作准确性和实体一致性。
    Abstract Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for future explorations.
    摘要 (简体中文)最近的文本到视频生成技术已经充分利用了扩散模型来创造基于文本提示的视觉吸引人的内容。然而,它们通常会遇到高计算成本和困难以生成具有一致的物理运动的视频。为解决这些问题,我们提出了GPT4Motion,一个无需训练的框架,它利用大型语言模型 such as GPT的规划能力,Blender的物理模拟力和文本到图像扩散模型的出色的图像生成能力来提高视频合成的质量。具体来说,GPT4Motion使用GPT-4来生成基于用户文本提示的Blender脚本,这个脚本命令Blender的内置物理引擎来生成基本的场景组件,这些组件包含整个帧的一致的物理运动。然后,这些组件被输入到稳定扩散模型中来生成与文本提示相关的视频。我们在三种基本物理运动场景中,包括固体物体落下和碰撞、布料摆倒和摆动、液体流动中,实现了高质量的视频生成,同时保持运动一致和实体一致。GPT4Motion提供了新的视角,推动了文本到视频研究的质量和未来探索的可能性。

Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2311.12623
  • repo_url: None
  • paper_authors: Johan Fredin Haslum, Christos Matsoukas, Karl-Johan Leuchowius, Kevin Smith
  • for: 这个论文是为了提高高内容成像(HCI)在现代药物发现和开发流水线中的应用,从发现到药物候选者特征化。
  • methods: 这个论文提出了一种在线自动适应方法(CODA),将分类器分为通用特征提取器和任务特定模型。在新领域中适应特征提取器的参数,使用交叉批自监督,保持任务特定模型不变。
  • results: 实验结果表明,这种策略可以明显减少通用泛化差,在不同实验室使用不同顾 microscope 的数据上实现了最多300%的提升。CODA可以应用于新、无标注的外域数据源,不同的大小,从单个板到多个实验批次。
    Abstract High Content Imaging (HCI) plays a vital role in modern drug discovery and development pipelines, facilitating various stages from hit identification to candidate drug characterization. Applying machine learning models to these datasets can prove challenging as they typically consist of multiple batches, affected by experimental variation, especially if different imaging equipment have been used. Moreover, as new data arrive, it is preferable that they are analyzed in an online fashion. To overcome this, we propose CODA, an online self-supervised domain adaptation approach. CODA divides the classifier's role into a generic feature extractor and a task-specific model. We adapt the feature extractor's weights to the new domain using cross-batch self-supervision while keeping the task-specific model unchanged. Our results demonstrate that this strategy significantly reduces the generalization gap, achieving up to a 300% improvement when applied to data from different labs utilizing different microscopes. CODA can be applied to new, unlabeled out-of-domain data sources of different sizes, from a single plate to multiple experimental batches.
    摘要 CODA将分类器的角色分为一个通用特征提取器和一个任务特定模型。我们使用cross-batch自监学习来适应特征提取器的加权到新领域,保持任务特定模型不变。我们的结果表明,这种策略可以大幅减少通用差距,在不同实验室使用不同探针的数据上达到300%的提升。CODA可以应用于新、无标签的域外数据源,包括单个板到多个实验批次。

Crowd management, crime detection, work monitoring using aiml

  • paper_url: http://arxiv.org/abs/2311.12621
  • repo_url: None
  • paper_authors: P. R. Adithya, Dheepak. S, B. Akash, Harshini. V, Sai Lakshana
  • for: 这项研究旨在利用现有的关闭电视(CCTV)网络,实现人群管理、犯罪预防和工作场所监测的全面方法,通过人工智能(AI)和机器学习(ML)技术的集成。
  • methods: 该项目使用了AI/ML技术,对视频流进行实时分析,能够在实时标识和评估人群动态,早期发现可能的犯罪活动,并Continuous监测工作环境。
  • results: 该项目通过对现有基础设施的 интеграцию,提高了公共安全措施,提高了组织的生产效率。
    Abstract This research endeavors to harness the potential of existing Closed-Circuit Television (CCTV) networks for a comprehensive approach to crowd management, crime prevention, and workplace monitoring through the integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. The primary objective is to develop and implement advanced algorithms capable of real-time analysis of video feeds, enabling the identification and assessment of crowd dynamics, early detection of potential criminal activities, and continuous monitoring of workplace environments. By leveraging AI/ML, the project aims to optimize surveillance capabilities, thereby enhancing public safety measures and improving organizational productivity. This initiative underscores the transformative impact that intelligent video analytics can have on existing infrastructure, mitigating the need for extensive system overhauls while significantly advancing security and operational efficiency.
    摘要 这项研究寻求通过现有的关闭电视(CCTV)网络来实现全面的人群管理、犯罪预防和工作场所监测,通过人工智能(AI)和机器学习(ML)技术的集成。项目的主要目标是在实时分析视频流中开发并实施高级算法,以识别和评估人群动态、早期发现可能的犯罪活动和不断监测工作环境。通过利用AI/ML,项目希望提高观察能力,以提高公共安全措施和组织效率。这一倡议强调智能视频分析对现有基础设施的转型影响,从而减少需要大规模系统改进的需求,同时显著提高安全和运营效率。

Leveraging Unlabeled Data for 3D Medical Image Segmentation through Self-Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2311.12617
  • repo_url: https://github.com/xmindflow/SSL-contrastive
  • paper_authors: Sanaz Karimijafarbigloo, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
  • for: 提高3D半监督分割方法的精度和可靠性, Addressing the limitations of current methods such as limited contextual information and unreliable pseudo-labels.
  • methods: 引入两个独立的子网络,用于探索和利用Prediction inconsistencies,并通过Targeted verification training process来纠正预测结果。 Additionally, a self-supervised contrastive learning paradigm is employed to adaptively fine-tune the network’s representational capacity and reduce prediction uncertainty.
  • results: 对于脑肿 segmentation,使用临床MRI和CT扫描数据,实验结果表明我们的方法在比特当前方法更高的精度和可靠性。 Codebase accessible on GitHub.
    Abstract Current 3D semi-supervised segmentation methods face significant challenges such as limited consideration of contextual information and the inability to generate reliable pseudo-labels for effective unsupervised data use. To address these challenges, we introduce two distinct subnetworks designed to explore and exploit the discrepancies between them, ultimately correcting the erroneous prediction results. More specifically, we identify regions of inconsistent predictions and initiate a targeted verification training process. This procedure strategically fine-tunes and harmonizes the predictions of the subnetworks, leading to enhanced utilization of contextual information. Furthermore, to adaptively fine-tune the network's representational capacity and reduce prediction uncertainty, we employ a self-supervised contrastive learning paradigm. For this, we use the network's confidence to distinguish between reliable and unreliable predictions. The model is then trained to effectively minimize unreliable predictions. Our experimental results for organ segmentation, obtained from clinical MRI and CT scans, demonstrate the effectiveness of our approach when compared to state-of-the-art methods. The codebase is accessible on \href{https://github.com/xmindflow/SSL-contrastive}{GitHub}.
    摘要 当前的3D半监督分割方法面临着有限的Contextual信息考虑和不可靠的pseudo标签生成,导致效果不佳的不监督数据使用。为解决这些挑战,我们引入了两个不同的子网络,用于探索和利用它们之间的差异,最终更正预测结果。更具体地,我们将预测结果中的不一致区域特定地标记,并进行targeted验证训练。这个过程可以准确地微调和融合子网络的预测结果,从而使用Contextual信息。此外,为了适应网络的表征能力的调整和预测不确定性的减少,我们采用了一种自动调整的自我超视的contrastive学习方法。我们使用网络的自信来分辨可靠和不可靠的预测,然后训练网络以减少不可靠的预测。我们的实验结果表明,对于临床MRI和CT扫描数据进行的器官分割,我们的方法与当前的状态艺法相比,有更高的效果。代码库可以在\href{https://github.com/xmindflow/SSL-contrastive}{GitHub}上获取。

Adaptive Dense Pseudo Label Selection for Semi-supervised Oriented Object Detection

  • paper_url: http://arxiv.org/abs/2311.12608
  • repo_url: None
  • paper_authors: Tong Zhao, Qiang Fang, Shuohao Shi, Xin Xu
  • for: 这个研究是为了提高半指导的物体检测(SSOD)中的 Orientated Object Detection(OOD)性能。
  • methods: 我们提出了一个名为 Adaptive Dense Pseudo Label Selection(ADPLS)的方法,它使用了一个简单 yet effective的适应机制来指导伪标签的选择。特别是,我们提出了一个名为 Mean Feature-Richness Score(mFRS)的估计方法,用于估计潜在物体的密度,并使用这个分数来调整伪标签的数量。
  • results: 在DOTA-v1.5 bencmark上,我们的方法比前一代方法更高,尤其是当标签资料仅有5%时。例如,我们的方法在5%标签资料下 achieve 49.78 mAP,比前一代方法在10%标签资料下的48.63 mAP高出1.15 mAP。
    Abstract Recently, dense pseudo-label, which directly selects pseudo labels from the original output of the teacher model without any complicated post-processing steps, has received considerable attention in semi-supervised object detection (SSOD). However, for the multi-oriented and dense objects that are common in aerial scenes, existing dense pseudo-label selection methods are inefficient and impede the performance in semi-supervised oriented object detection. Therefore, we propose Adaptive Dense Pseudo Label Selection (ADPLS) for semi-supervised oriented object detection. In ADPLS, we design a simple but effective adaptive mechanism to guide the selection of dense pseudo labels. Specifically, we propose the mean Feature-Richness Score (mFRS) to estimate the density of potential objects and use this score to adjust the number of dense pseudo labels. On the DOTA-v1.5 benchmark, the proposed method outperforms previous methods especially when labeled data are scarce. For example, it achieves 49.78 mAP given only 5% of annotated data, which surpasses previous state-of-the-art method given 10% of annotated data by 1.15 mAP. Our codes will be available soon.
    摘要 最近,密集 pseudo-标签选择( dense pseudo-label selection)在 semi-supervised object detection(SSOD)中受到了广泛关注。然而,对于飞行场景中的多 oriented 和密集的对象,现有的密集 pseudo-标签选择方法效率低下,影响了 semi-supervised oriented object detection 的性能。因此,我们提出了适应性 dense pseudo-标签选择(ADPLS)方法。在 ADPLS 中,我们设计了一种简单 yet effective 的适应机制,以导引密集 pseudo-标签的选择。具体来说,我们提出了均值 Feature-Richness Score(mFRS)来估算潜在对象的密集程度,并使用这个分数来调整密集 pseudo-标签的数量。在 DOTA-v1.5 测试集上,我们的方法比前一代方法更高效,特别是当只有5%的注解数据时,我们的方法可以达到 49.78 mAP,而前一代方法只有在注解数据占10%的情况下达到这个水平。我们的代码将很快地上传。

Thinking Outside the Box: Orthogonal Approach to Equalizing Protected Attributes

  • paper_url: http://arxiv.org/abs/2311.14733
  • repo_url: None
  • paper_authors: Jiahui Liu, Xiaohao Cai, Mahesan Niranjan
  • for: 本研究旨在探讨黑盒AI在医疗决策中可能夹带健康相关不平等和偏见的问题,以及如何通过机器学习方法来分析和suppress这些影响。
  • methods: 本研究使用机器学习的正交方法,包括抑制因素分解和对保护属性进行正交化,以分析和消除保护属性对疾病诊断的影响。
  • results: 研究表明,通过使用正交方法可以减少保护属性对疾病诊断的影响,提高模型预测性能,并 Mitigate不良特征相关性。
    Abstract There is growing concern that the potential of black box AI may exacerbate health-related disparities and biases such as gender and ethnicity in clinical decision-making. Biased decisions can arise from data availability and collection processes, as well as from the underlying confounding effects of the protected attributes themselves. This work proposes a machine learning-based orthogonal approach aiming to analyze and suppress the effect of the confounder through discriminant dimensionality reduction and orthogonalization of the protected attributes against the primary attribute information. By doing so, the impact of the protected attributes on disease diagnosis can be realized, undesirable feature correlations can be mitigated, and the model prediction performance can be enhanced.
    摘要 随着黑盒AI的潜在力量的扩展,健康相关的不平等和偏见,如性别和民族,在临床决策中可能得到扩大。这些偏见可能来自数据的可用性和收集过程,以及保护属性本身的下意识效应。本工作提议一种机器学习基于的正交方法,以分析和抑制保护属性的影响,通过维度减少和保护属性对主属信息的正交化。这样可以实现保护属性对疾病诊断的影响,减少不良特征相关性,并提高模型预测性能。

Surgical Temporal Action-aware Network with Sequence Regularization for Phase Recognition

  • paper_url: http://arxiv.org/abs/2311.12603
  • repo_url: None
  • paper_authors: Zhen Chen, Yuhao Zhai, Jun Zhang, Jinqiao Wang
  • for: 帮助外科医生在操作室中更准确地识别手术阶段,以发展计算机助手外科系统。
  • methods: 我们提出了一种具有高效多级医学动作特征 integrate 模块(MS-STA)和两个分类器序列导航(DSR)的 STAR-Net,使得可以更好地利用手术视频中的视觉特征,并通过sequential guidance来优化模型的训练。
  • results: 我们的 STAR-Net 在大规模的胃部手术数据集和公共 Cholec80 标准 datasets 上实现了 state-of-the-art 的手术阶段识别性能。
    Abstract To assist surgeons in the operating theatre, surgical phase recognition is critical for developing computer-assisted surgical systems, which requires comprehensive understanding of surgical videos. Although existing studies made great progress, there are still two significant limitations worthy of improvement. First, due to the compromise of resource consumption, frame-wise visual features are extracted by 2D networks and disregard spatial and temporal knowledge of surgical actions, which hinders subsequent inter-frame modeling for phase prediction. Second, these works simply utilize ordinary classification loss with one-hot phase labels to optimize the phase predictions, and cannot fully explore surgical videos under inadequate supervision. To overcome these two limitations, we propose a Surgical Temporal Action-aware Network with sequence Regularization, named STAR-Net, to recognize surgical phases more accurately from input videos. Specifically, we propose an efficient multi-scale surgical temporal action (MS-STA) module, which integrates visual features with spatial and temporal knowledge of surgical actions at the cost of 2D networks. Moreover, we devise the dual-classifier sequence regularization (DSR) to facilitate the training of STAR-Net by the sequence guidance of an auxiliary classifier with a smaller capacity. Our STAR-Net with MS-STA and DSR can exploit visual features of surgical actions with effective regularization, thereby leading to the superior performance of surgical phase recognition. Extensive experiments on a large-scale gastrectomy surgery dataset and the public Cholec80 benchmark prove that our STAR-Net significantly outperforms state-of-the-arts of surgical phase recognition.
    摘要 <> transtable begin为帮助手术医生在操作室中工作,计算机助手手术系统的发展几乎无法关闭,而这需要全面的手术视频理解。虽然现有的研究做出了巨大的进步,但还有两个重要的限制值得进一步改进。首先,由于资源占用的牺牲,2D网络提取的帧级视觉特征会忽略手术动作的空间和时间知识,这会阻碍后续的帧预测。其次,这些工作只是使用一个简单的分类损失函数和一个热一级的阶段标签来优化阶段预测,无法充分利用手术视频的不充分监督。为了解决这两个限制,我们提出了一种具有手术时间动作意识的网络,称为STAR-Net,可以更准确地从输入视频中识别手术阶段。具体来说,我们提出了一种高效的多尺度手术时间动作(MS-STA)模块,可以将视觉特征与手术动作的空间和时间知识相结合,而无需耗费2D网络的资源。此外,我们还提出了一种双类标注序列regularization(DSR),可以通过auxiliary类型的小型分类器来促进STAR-Net的训练。我们的STAR-Net可以有效地利用手术动作的视觉特征,并且通过序列指导来进行有效的 Regularization,从而实现更高的手术阶段识别性。广泛的实验表明,我们的STAR-Net在一个大规模的胃部手术数据集和公共的Cholec80数据集上具有显著的优异性。<> transtable end

TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction using Vision-Based Tactile Sensing

  • paper_url: http://arxiv.org/abs/2311.12602
  • repo_url: None
  • paper_authors: Mauro Comi, Yijiong Lin, Alex Church, Alessio Tonioni, Laurence Aitchison, Nathan F. Lepora
  • for: 这篇论文旨在探讨数据驱动方法可以如何使用高分辨率视觉感知器来描述物体的3D形态。
  • methods: 该论文提出了一种基于深度学习的触觉3D形态重建方法,该方法利用视觉感知器提供的详细信息和深度学习函数 DeepSDF 来重建物体的3D形态。
  • results: 该论文在实验和实际场景中可以准确地重建物体的3D形态,并且可以在不同的形态和环境下进行robust化和多模式识别。
    Abstract Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: https://touchsdf.github.io/
    摘要 人类依靠视觉和感觉来发展一个全面的3D环境理解。最近,有一个增长的兴趣在使用数据驱动的方法来探索和操作物体,使用高分辨率视觉感知器。然而,使用感觉重建3D形状受到限制,包括无法泛化到未看过的形状、缺乏现实世界测试和精简表示的限制。为解决这些挑战,我们提出了TouchSDF,一种深度学习方法,使用视觉基于感知器提供的丰富信息和深度SDF表示的表达能力来重建感觉中的3D形状。我们的技术包括两部分:(1)一个 convolutional neural network 将感觉图像映射到touchlocation的地方表示surface的本地mesh,和(2)一个隐藏函数,预测signed distance function,从而提取所需的3D形状。这种组合允许TouchSDF在模拟和实际场景中从感觉输入中重建平滑和连续的3D形状,打开了3D意识表示和多模态感知在机器人学中的研究 Avenues。代码和补充材料可以在以下链接中找到:https://touchsdf.github.io/

Deep learning-based detection of morphological features associated with hypoxia in H&E breast cancer whole slide images

  • paper_url: http://arxiv.org/abs/2311.12601
  • repo_url: None
  • paper_authors: Petru Manescu, Joseph Geradts, Delmiro Fernandez-Reyes
  • For: This paper is written to demonstrate the use of deep learning for evaluating hypoxia in breast cancer histomorphology.* Methods: The paper uses weakly supervised deep learning (WSDL) models to detect hypoxia-associated features in routine Hematoxylin and Eosin (H&E) whole slide images (WSI).* Results: The paper shows that the WSDL models can accurately detect hypoxia-associated features in H&E WSI with an average AUC of 0.87 on a left-out test set, and significant differences between features of hypoxic and normoxic tissue regions.Here is the information in Simplified Chinese text:* 为:这篇论文是用于描述用深度学习评估乳腺癌细胞氧激酶水平的研究。* 方法:这篇论文使用弱监督深度学习(WSDL)模型来检测H&E整幕图像中的hypoxia相关特征。* 结果:这篇论文显示WSDL模型可以准确地检测H&E整幕图像中的hypoxia相关特征, Left-out测试集的平均AUC为0.87,并显示hypoxic和normoxic组织区域之间存在显著差异。
    Abstract Hypoxia occurs when tumour cells outgrow their blood supply, leading to regions of low oxygen levels within the tumour. Calculating hypoxia levels can be an important step in understanding the biology of tumours, their clinical progression and response to treatment. This study demonstrates a novel application of deep learning to evaluate hypoxia in the context of breast cancer histomorphology. More precisely, we show that Weakly Supervised Deep Learning (WSDL) models can accurately detect hypoxia associated features in routine Hematoxylin and Eosin (H&E) whole slide images (WSI). We trained and evaluated a deep Multiple Instance Learning model on tiles from WSI H&E tissue from breast cancer primary sites (n=240) obtaining on average an AUC of 0.87 on a left-out test set. We also showed significant differences between features of hypoxic and normoxic tissue regions as distinguished by the WSDL models. Such DL hypoxia H&E WSI detection models could potentially be extended to other tumour types and easily integrated into the pathology workflow without requiring additional costly assays.
    摘要 肿瘤缺氧现象发生于肿瘤细胞超出血液供应,导致肿瘤中的氧气含量下降。计算肿瘤缺氧水平是理解肿瘤生物学、临床进展和治疗响应的重要步骤。这个研究示出了深度学习在肿瘤 Histomorphology 中评估缺氧的新应用。具体来说,我们表明了弱监督深度学习(WSDL)模型可以准确地检测肿瘤缺氧相关特征在 Routine Hematoxylin and Eosin(H&E)整幅报告图像(WSI)中。我们在肿瘤primary sites(n=240)的H&E报告图像中训练和评估了一个深度多实例学习模型,其中的平均AUC为0.87。我们还发现了Hypoxic和normoxic组织区域之间的显著差异,这些差异被WSDL模型所识别出来。这些深度缺氧H&E WSI检测模型可能可以扩展到其他肿瘤类型,并且可以轻松地 интеGRATE到病理工作流程中,无需额外的成本重要的检测。

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.12588
  • repo_url: None
  • paper_authors: Yongliang Lin, Yongzhi Su, Praveen Nathan, Sandeep Inuganti, Yan Di, Martin Sundermeyer, Fabian Manhardt, Didier Stricke, Jason Rambach, Yu Zhang
  • for: 这篇论文目的是提出一种新的高精度对象 pose 估计方法,使用单个 RGB-D 图像。
  • methods: 这篇论文使用了一种叫做 HiPose 的方法,通过 hierarchical binary surface encoding 来实现 3D-3D 对应。不同于之前的 dense-correspondence 方法,我们使用点到表面匹配并在多个层次中步进缩小表面,直到它变成对应点。
  • results: 我们的方法在公共的测试集 LM-O、YCB-V 和 T-Less 上进行了广泛的实验,与所有不包含反射的方法相比,我们的方法表现更好,甚至与较昂贵的反射基于方法相当。此外,我们的方法具有计算效率的优势,可以满足高精度的应用需求。
    Abstract In this work, we present a novel dense-correspondence method for 6DoF object pose estimation from a single RGB-D image. While many existing data-driven methods achieve impressive performance, they tend to be time-consuming due to their reliance on rendering-based refinement approaches. To circumvent this limitation, we present HiPose, which establishes 3D-3D correspondences in a coarse-to-fine manner with a hierarchical binary surface encoding. Unlike previous dense-correspondence methods, we estimate the correspondence surface by employing point-to-surface matching and iteratively constricting the surface until it becomes a correspondence point while gradually removing outliers. Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate that our method surpasses all refinement-free methods and is even on par with expensive refinement-based approaches. Crucially, our approach is computationally efficient and enables real-time critical applications with high accuracy requirements. Code and models will be released.
    摘要 在这项工作中,我们介绍了一种新的密集对匹配方法,用于从单个RGB-D图像中 estimate 6DoF объекpose。虽然现有的数据驱动方法可以达到很好的性能,但它们通常具有时间消耗的 limitation,因为它们基于渲染基因refinement的方法。为了缓解这个限制,我们提出了HiPose,它在一种层次 Binary surface编码中Establishes 3D-3D对匹配。与前一个密集对匹配方法不同,我们在点到表面匹配中Estimate the correspondence surface,并在层次方式推进surface,直到它变成对匹配点,同时慢慢地移除异常值。我们在公共的benchmark LM-O、YCB-V和T-Less上进行了广泛的实验,结果表明,我们的方法不仅超过了所有不包含修正的方法,而且与较昂贵的修正基于方法相当。这些方法的计算效率高,可以满足实时的应用需求。我们将发布代码和模型。

A Region of Interest Focused Triple UNet Architecture for Skin Lesion Segmentation

  • paper_url: http://arxiv.org/abs/2311.12581
  • repo_url: None
  • paper_authors: Guoqing Liu, Yu Guo, Caiying Wu, Guoqing Chen, Barintag Saheya, Qiyu Jin
  • for: automatic skin lesion segmentation
  • methods: Triple-UNet architecture with region of interest enhancement module (ROIE)
  • results: outperforms state-of-the-art on skin lesion segmentation
    Abstract Skin lesion segmentation is of great significance for skin lesion analysis and subsequent treatment. It is still a challenging task due to the irregular and fuzzy lesion borders, and diversity of skin lesions. In this paper, we propose Triple-UNet to automatically segment skin lesions. It is an organic combination of three UNet architectures with suitable modules. In order to concatenate the first and second sub-networks more effectively, we design a region of interest enhancement module (ROIE). The ROIE enhances the target object region of the image by using the predicted score map of the first UNet. The features learned by the first UNet and the enhanced image help the second UNet obtain a better score map. Finally, the results are fine-tuned by the third UNet. We evaluate our algorithm on a publicly available dataset of skin lesion segmentation. Experiments show that Triple-UNet outperforms the state-of-the-art on skin lesion segmentation.
    摘要 皮肤损害区域分割对皮肤损害分析和后续治疗具有重要意义。然而,由于皮肤损害区域的不规则和杂乱,以及皮肤损害的多样性,这是一项非常困难的任务。在这篇论文中,我们提出了Triple-UNet算法,用于自动分割皮肤损害区域。Triple-UNet是三个UNet架构的有机组合,其中每个UNet架构都有适合的模块。为了更好地 concatenate the first and second sub-networks, we design a region of interest enhancement module (ROIE). ROIE使用了predict的得分图来增强目标对象区域的图像,并使得第二个UNet获得更好的得分图。最后,结果被第三个UNet进行细化。我们在公共可用的皮肤损害分割数据集上进行了实验,并证明Triple-UNet在皮肤损害分割方面超过了状态的前瞻性。

Multi-Resolution Planar Region Extraction for Uneven Terrains

  • paper_url: http://arxiv.org/abs/2311.12562
  • repo_url: None
  • paper_authors: Yinghan Sun, Linfang Zheng, Hua Chen, Wei Zhang
  • for: 本研究旨在从非排序点云数据中提取平面区域,以应对机器人应用中的感知行动问题。
  • methods: 我们提出了一种多分辨率平面区域提取策略,通过平衡精度和计算效率来解决现有方法的问题。我们的方法包括点 wise 分类预处理模块和多分辨率分 segmentation 模块。
  • results: 我们通过实验证明了我们的方法可以在不同的不平坦地形上实现高效和稳定的平面区域提取,并且可以在真实世界中实现实时性,帧率超过 35 FPS。
    Abstract This paper studies the problem of extracting planar regions in uneven terrains from unordered point cloud measurements. Such a problem is critical in various robotic applications such as robotic perceptive locomotion. While existing approaches have shown promising results in effectively extracting planar regions from the environment, they often suffer from issues such as low computational efficiency or loss of resolution. To address these issues, we propose a multi-resolution planar region extraction strategy in this paper that balances the accuracy in boundaries and computational efficiency. Our method begins with a pointwise classification preprocessing module, which categorizes all sampled points according to their local geometric properties to facilitate multi-resolution segmentation. Subsequently, we arrange the categorized points using an octree, followed by an in-depth analysis of nodes to finish multi-resolution plane segmentation. The efficiency and robustness of the proposed approach are verified via synthetic and real-world experiments, demonstrating our method's ability to generalize effectively across various uneven terrains while maintaining real-time performance, achieving frame rates exceeding 35 FPS.
    摘要 Our method begins with pointwise classification, categorizing all sampled points based on their local geometric properties. This facilitates multi-resolution segmentation. We then arrange the categorized points using an octree, followed by an in-depth analysis of nodes to finish multi-resolution plane segmentation.We verify the efficiency and robustness of our approach through synthetic and real-world experiments. Our method demonstrates the ability to generalize effectively across various uneven terrains while maintaining real-time performance, achieving frame rates exceeding 35 FPS.

Convolutional Neural Networks for Neuroimaging in Parkinson’s Disease: Is Preprocessing Needed?

  • paper_url: http://arxiv.org/abs/2311.12561
  • repo_url: None
  • paper_authors: Francisco J. Martinez-Murcia, Juan M. Górriz, Javier Ramírez, Andrés Ortiz
    for:* The paper aims to investigate the effectiveness of Convolutional Neural Networks (CNNs) in accounting for spatial and intensity differences in nuclear brain imaging, and to determine whether spatial and intensity normalization is still necessary for accurate diagnosis.methods:* The authors trained four different CNN models based on well-established architectures, using or not different spatial and intensity normalization preprocessing.results:* The results show that a sufficiently complex model such as the three-dimensional version of the ALEXNET can effectively account for spatial differences, achieving a diagnosis accuracy of 94.1% with an area under the ROC curve of 0.984. However, the intensity normalization and its type are revealed as very influential in the results and accuracy of the trained model.
    Abstract Spatial and intensity normalization are nowadays a prerequisite for neuroimaging analysis. Influenced by voxel-wise and other univariate comparisons, where these corrections are key, they are commonly applied to any type of analysis and imaging modalities. Nuclear imaging modalities such as PET-FDG or FP-CIT SPECT, a common modality used in Parkinson's Disease diagnosis, are especially dependent on intensity normalization. However, these steps are computationally expensive and furthermore, they may introduce deformations in the images, altering the information contained in them. Convolutional Neural Networks (CNNs), for their part, introduce position invariance to pattern recognition, and have been proven to classify objects regardless of their orientation, size, angle, etc. Therefore, a question arises: how well can CNNs account for spatial and intensity differences when analysing nuclear brain imaging? Are spatial and intensity normalization still needed? To answer this question, we have trained four different CNN models based on well-established architectures, using or not different spatial and intensity normalization preprocessing. The results show that a sufficiently complex model such as our three-dimensional version of the ALEXNET can effectively account for spatial differences, achieving a diagnosis accuracy of 94.1% with an area under the ROC curve of 0.984. The visualization of the differences via saliency maps shows that these models are correctly finding patterns that match those found in the literature, without the need of applying any complex spatial normalization procedure. However, the intensity normalization -- and its type -- is revealed as very influential in the results and accuracy of the trained model, and therefore must be well accounted.
    摘要 现在,空间和强度normalization是神经成像分析的必备条件。由于 voxel-wise 和其他单variate比较的影响,这些更正在任何类型的分析和成像模式下都被广泛应用。核磁共振成像模式,如 PET-FDG 或 FP-CIT SPECT,在诊断parkinson病的常见模式中尤其висиendent于强度normalization。然而,这些步骤可能会带来计算成本和图像扭曲,从而改变图像中的信息。深度学习神经网络(CNNs)则引入了位置不变性,并在 Pattern recognition中识别对象无关于其方向、大小、角度等因素。因此,我们提出了一个问题:CNNs 能否准确考虑空间和强度差异,并不需要使用 espacial 和强度normalization 预处理?为了回答这个问题,我们在不同的 spatial 和强度normalization 预处理下训练了四个不同的 CNN 模型,使用了 Well-established 的 Architecture。结果显示,一个足够复杂的模型,如我们的三维版本的 ALEXNET,可以准确考虑空间差异,达到了 94.1% 的诊断精度和0.984的 ROC 曲线下的Area。视觉化 diferencias 的映射显示,这些模型正确地找到了与文献中描述的匹配的模式,无需应用任何复杂的空间normalization 过程。然而,强度normalization 的选择和类型对结果和准确率产生了很大的影响。因此,这一点必须得到广泛考虑。

Benchmarking bias: Expanding clinical AI model card to incorporate bias reporting of social and non-social factors

  • paper_url: http://arxiv.org/abs/2311.12560
  • repo_url: None
  • paper_authors: Carolina A. M. Heming, Mohamed Abdalla, Monish Ahluwalia, Linglin Zhang, Hari Trivedi, MinJae Woo, Benjamin Fine, Judy Wawira Gichoya, Leo Anthony Celi, Laleh Seyyed-Kalantari
  • for: 这个论文主要是为了扩展临床AI模型报告卡,以包括广泛的偏见报告,包括社会和非社会因素。
  • methods: 该论文使用了各种方法,包括对AI模型偏见的分析和评估,以及对其他因素的考虑,如疾病依赖、解剖学因素和仪器因素,以确保AI模型的安全部署。
  • results: 该论文得到了许多结果,包括提出了一种扩展了临床AI模型报告卡的方法,以及通过对多个疾病和不同的AI模型进行研究,发现了一些可能导致AI模型偏见的因素。
    Abstract Clinical AI model reporting cards should be expanded to incorporate a broad bias reporting of both social and non-social factors. Non-social factors consider the role of other factors, such as disease dependent, anatomic, or instrument factors on AI model bias, which are essential to ensure safe deployment.
    摘要 临床AI模型报告卡应该扩展到包括广泛的偏见报告,包括社会因素以及非社会因素。非社会因素包括疾病依赖、解剖学因素和仪器因素等,这些因素对AI模型偏见的影响非常重要,以确保安全部署。

“HoVer-UNet”: Accelerating HoVerNet with UNet-based multi-class nuclei segmentation via knowledge distillation

  • paper_url: http://arxiv.org/abs/2311.12553
  • repo_url: https://github.com/diagnijmegen/hover-unet
  • paper_authors: Cristian Tommasino, Cristiano Russo, Antonio Maria Rinaldi, Francesco Ciompi
  • for: 本研究旨在压缩 HoVerNet 框架中的知识,用于 histopathology 中的 nuclei 实例 segmentation 和分类。
  • methods: 提出了一种含有 Mix Vision Transformer 背景的 Compact UNet 网络,并使用自定义损失函数来优化压缩后的知识。
  • results: 在 PanNuke 和 Consep 公共数据集上,我们的模型实现了与 HoVerNet 相当的结果,但具有三倍减少的计算时间。
    Abstract We present "HoVer-UNet", an approach to distill the knowledge of the multi-branch HoVerNet framework for nuclei instance segmentation and classification in histopathology. We propose a compact, streamlined single UNet network with a Mix Vision Transformer backbone, and equip it with a custom loss function to optimally encode the distilled knowledge of HoVerNet, reducing computational requirements without compromising performances. We show that our model achieved results comparable to HoVerNet on the public PanNuke and Consep datasets with a three-fold reduction in inference time. We make the code of our model publicly available at https://github.com/DIAGNijmegen/HoVer-UNet.
    摘要 我们提出了“HoVer-UNet”方法,用于在病理学中分类和实例化肿瘤 cells。我们提出了一个紧凑的、流式化的单一 UNet 网络,并将其配备了自定义损失函数,以优化 transferred 知识,减少计算需求,而不会影响性能。我们的模型在公共的 PanNuke 和 Consep 数据集上实现了与 HoVerNet 相当的结果,并且在推理时间方面减少了三倍。我们将我们的代码公开发布在 GitHub 上,可以在 中找到。

GMISeg: General Medical Image Segmentation without Re-Training

  • paper_url: http://arxiv.org/abs/2311.12539
  • repo_url: None
  • paper_authors: Jing Xu
  • for: 解决新的医学图像分割任务,无需进行额外的训练。
  • methods: 使用 novel low-rank fine-tuning strategy based on the proposed approach to the SAM (Segment Anything Model) image encoder, 并与提示编码器和掩码解码器进行细化。
  • results: GMISeg 在未知任务上表现出优于最新的方法,并进行了全面的分析和总结。
    Abstract Although deep learning models have become the main method for medical image segmentation, they often cannot be extended to unknown segmentation tasks involving new anatomical structures, image shapes, or labels. For new segmentation tasks, researchers often have to retrain or fine-tune the model, which is time-consuming and poses a significant obstacle to clinical researchers, who often lack the resources and professional knowledge to train neural networks. Therefore, we proposed a general method that can solve unknown medical image segmentation tasks without requiring additional training. Given an example set of images and prompts for defining new segmentation tasks, GMISeg applies a novel low-rank fine-tuning strategy based on the proposed approach to the SAM (Segment Anything Model) image encoder, and works with the prompt encoder and mask decoder to fine-tune the labeled dataset without the need for additional training. To achieve generalization of new tasks, we used medical image datasets with different imaging modes for different parts. We trained and generalized GMISeg on a different set of anatomical and imaging modes using cardiac images on other site datasets. We have demonstrated that GMISeg outperforms the latest methods on unknown tasks and have conducted a comprehensive analysis and summary of the important performance of the proposed method.
    摘要 (使用简化中文)虽然深度学习模型已经成为医学图像分割的主要方法,但它们经常无法适应未知的分割任务,包括新的解剖结构、图像形态和标签。为新的分割任务,研究人员经常需要重新训练或微调模型,这是耗时且对临床研究人员来说是一个重大障碍。因此,我们提出了一种可以解决未知医学图像分割任务的通用方法。给出一个示例集合和定义新分割任务的提示,GMISeg应用了一种新的低级别微调策略,基于我们的方法,对SAM(分割任何模型)图像编码器进行微调。与提示编码器和面 mask解码器一起,GMISeg可以在已经标注的数据集上进行微调,无需进行额外的训练。为了实现新任务的泛化,我们使用了不同的医学图像数据集,包括不同的解剖和捕捉模式。我们在不同的部位上训练和泛化GMISeg,并在其他网站上使用卡得纳图像进行训练。我们已经证明,GMISeg在未知任务上表现出色,并进行了全面的分析和总结,描述了该方法的重要性能。

Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.12490
  • repo_url: None
  • paper_authors: Yifan Wang, Yi Gong, Yuan Zeng
  • for: 高精度场景重建和新视图生成
  • methods: 使用多决赛措施的卷积神经网络,以及缓存NeRF的数据结构
  • results: 提高渲染速度和图像质量,同时减少内存占用量
    Abstract Recent advances in Neural radiance fields (NeRF) have enabled high-fidelity scene reconstruction for novel view synthesis. However, NeRF requires hundreds of network evaluations per pixel to approximate a volume rendering integral, making it slow to train. Caching NeRFs into explicit data structures can effectively enhance rendering speed but at the cost of higher memory usage. To address these issues, we present Hyb-NeRF, a novel neural radiance field with a multi-resolution hybrid encoding that achieves efficient neural modeling and fast rendering, which also allows for high-quality novel view synthesis. The key idea of Hyb-NeRF is to represent the scene using different encoding strategies from coarse-to-fine resolution levels. Hyb-NeRF exploits memory-efficiency learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. In addition, to further boost performance, we embed cone tracing-based features in our learnable positional encoding that eliminates encoding ambiguity and reduces aliasing artifacts. Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rending quality and even a lower memory footprint in comparison to previous state-of-the-art methods.
    摘要 Hyb-NeRF represents the scene using different encoding strategies from coarse-to-fine resolution levels. It exploits memory-efficient learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. Additionally, we embed cone tracing-based features in our learnable positional encoding to eliminate encoding ambiguity and reduce aliasing artifacts.Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rendering quality and even a lower memory footprint compared to previous state-of-the-art methods.

HCA-Net: Hierarchical Context Attention Network for Intervertebral Disc Semantic Labeling

  • paper_url: http://arxiv.org/abs/2311.12486
  • repo_url: https://github.com/xmindflow/hca-net
  • paper_authors: Afshin Bozorgpour, Bobby Azad, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
  • for: 这个研究的目的是为了精确地描述骨盘中的脊椎组织,以评估脊椎相关疾病,如骨质减退、脊椎骨折或脊椎膜瘤。
  • methods: 这个研究使用了一种新的内部构成注意力网络架构(HCA-Net),具有特别的优化对应关系。这个方法可以处理不同的数统尺度的特征,并将其集成以捕捉脊椎中的细微空间关系。
  • results: 这个研究的结果显示,使用HCA-Net架构可以在多中心脊椎数据集上取得更高的准确性,并且在MRI T1w和T2w模式上具有更好的表现。
    Abstract Accurate and automated segmentation of intervertebral discs (IVDs) in medical images is crucial for assessing spine-related disorders, such as osteoporosis, vertebral fractures, or IVD herniation. We present HCA-Net, a novel contextual attention network architecture for semantic labeling of IVDs, with a special focus on exploiting prior geometric information. Our approach excels at processing features across different scales and effectively consolidating them to capture the intricate spatial relationships within the spinal cord. To achieve this, HCA-Net models IVD labeling as a pose estimation problem, aiming to minimize the discrepancy between each predicted IVD location and its corresponding actual joint location. In addition, we introduce a skeletal loss term to reinforce the model's geometric dependence on the spine. This loss function is designed to constrain the model's predictions to a range that matches the general structure of the human vertebral skeleton. As a result, the network learns to reduce the occurrence of false predictions and adaptively improves the accuracy of IVD location estimation. Through extensive experimental evaluation on multi-center spine datasets, our approach consistently outperforms previous state-of-the-art methods on both MRI T1w and T2w modalities. The codebase is accessible to the public on \href{https://github.com/xmindflow/HCA-Net}{GitHub}.
    摘要 医学影像中间vertebral discs(IVD)的准确和自动分割是评估脊椎相关疾病的关键,如骨质疏松、脊椎骨折或IVD压缩。我们介绍HCA-Net,一种新的上下文注意网络架构,用于IVD的semantic标签预测,特别是利用前期几何信息。我们的方法可以处理不同的缩放级别,并有效地将其集成以捕捉脊椎内的细腻空间关系。为此,HCA-Net将IVD标注作为一个定位问题, aiming to minimize the discrepancy between each predicted IVD location and its corresponding actual joint location。此外,我们引入了骨架损失项,以强制模型具有脊椎的几何依赖。这个损失函数是设计来限制模型的预测在一个与人类脊椎骨架结构相符的范围内。因此,网络学习减少false预测的发生,并适应提高IVD的定位精度。经过对多中心脊椎数据集的广泛实验评估,我们的方法 consistent outperform了过去的状态收敛方法在MRI T1w和T2w模式上。代码库可以在 \href{https://github.com/xmindflow/HCA-Net}{GitHub} 上获取。

MaskFlow: Object-Aware Motion Estimation

  • paper_url: http://arxiv.org/abs/2311.12476
  • repo_url: None
  • paper_authors: Aria Ahmadi, David R. Walton, Tim Atherton, Cagatay Dikici
  • for: 这种新的运动估计方法MaskFlow可以准确地估计运动场景,即使是小对象、大偏移和异常的外观变化都是可以处理的。
  • methods: MaskFlow使用了各种层次特征,包括其他深度神经网络(DNN)基于的运动估计方法中使用的低级特征。此外,它还使用了对象级别的特征和分 segmentation。这些特征和分 segmentation 用于估计对象的翻译运动场景。
  • results: MaskFlow在我们新增的挑战性 dataset 上表现出色,而不会在流行的 FlyingThings3D benchmark dataset 上产生相似的结果。
    Abstract We introduce a novel motion estimation method, MaskFlow, that is capable of estimating accurate motion fields, even in very challenging cases with small objects, large displacements and drastic appearance changes. In addition to lower-level features, that are used in other Deep Neural Network (DNN)-based motion estimation methods, MaskFlow draws from object-level features and segmentations. These features and segmentations are used to approximate the objects' translation motion field. We propose a novel and effective way of incorporating the incomplete translation motion field into a subsequent motion estimation network for refinement and completion. We also produced a new challenging synthetic dataset with motion field ground truth, and also provide extra ground truth for the object-instance matchings and corresponding segmentation masks. We demonstrate that MaskFlow outperforms state of the art methods when evaluated on our new challenging dataset, whilst still producing comparable results on the popular FlyingThings3D benchmark dataset.
    摘要 我们介绍了一种新的运动估算方法,即MaskFlow,可以在非常具有挑战性的情况下估算准确的运动场,包括小物体、大偏移和强烈的外观变化。除了低级特征外,MaskFlow还利用了物体级别的特征和分割。这些特征和分割用于估算物体的翻译运动场。我们提出了一种新的和有效的方法,通过在后续运动估算网络中包含部分翻译运动场来补充和完善。我们还制作了一个新的和挑战性强的 sintetic数据集,并提供了对象匹配和相应的分割mask的附加真实数据。我们示示MaskFlow在我们新的挑战数据集上表现出色,而且在Popular FlyingThings3D benchmark dataset上仍然输出相似的结果。

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

  • paper_url: http://arxiv.org/abs/2311.12467
  • repo_url: https://github.com/khuvll/glad
  • paper_authors: Hyogun Lee, Kyungho Bae, Seong Jong Ha, Yumin Ko, Gyeong-Moon Park, Jinwoo Choi
  • for: 这篇论文是为了解决无监督视频领域适应(UVDA)问题,具体是在动作识别方面。
  • methods: 作者提出了一种新的全局-地方视图对齐方法,以解决时间差问题,以及一种学习时间顺序敏感表示和背景不变表示的方法,以抵消背景偏移。
  • results: 实验表明,提出的方法在Kinetics->BABEL数据集上具有显著的改善,与现有方法相比。代码可以在https://github.com/KHUVLL/GLAD上获取。
    Abstract In this work, we tackle the challenging problem of unsupervised video domain adaptation (UVDA) for action recognition. We specifically focus on scenarios with a substantial domain gap, in contrast to existing works primarily deal with small domain gaps between labeled source domains and unlabeled target domains. To establish a more realistic setting, we introduce a novel UVDA scenario, denoted as Kinetics->BABEL, with a more considerable domain gap in terms of both temporal dynamics and background shifts. To tackle the temporal shift, i.e., action duration difference between the source and target domains, we propose a global-local view alignment approach. To mitigate the background shift, we propose to learn temporal order sensitive representations by temporal order learning and background invariant representations by background augmentation. We empirically validate that the proposed method shows significant improvement over the existing methods on the Kinetics->BABEL dataset with a large domain gap. The code is available at https://github.com/KHUVLL/GLAD.
    摘要 在这项工作中,我们面临着无监督视频领域适应(UVDA)问题的挑战,具体是针对具有较大领域差异的情景。现有工作主要集中在小领域差异的情况下进行研究,而我们则专注于更真实的场景。为了建立更加实际的设定,我们提出了一种新的UVDA场景,即Kinetics->BABEL场景,其领域差异更大,包括时间动态和背景变化。为了解决时间偏移问题,即来源频道和目标频道之间的动作持续时间差异,我们提出了全局地视角协调方法。为了缓解背景变化问题,我们提出了学习时间顺序敏感表示法和背景增强法。我们通过实验证明,我们的方法在Kinetics->BABEL数据集上表现出了显著的改进,而且与现有方法相比,具有更高的鲁棒性和可靠性。代码可以在https://github.com/KHUVLL/GLAD上获取。

HiFi-Syn: Hierarchical Granularity Discrimination for High-Fidelity Synthesis of MR Images with Structure Preservation

  • paper_url: http://arxiv.org/abs/2311.12461
  • repo_url: None
  • paper_authors: Ziqi Yu, Botao Zhao, Shengjie Zhang, Xiang Chen, Jianfeng Feng, Tingying Peng, Xiao-Yong Zhang
  • for: 本研究旨在提高医疗影像的合成,以保持医学研究中的结构信息。
  • methods: 我们提出了层次粒度识别策略,利用医学影像中不同层次的Semantic信息。
  • results: 我们的策略在三个独立数据集(UK Biobank、IXI和BraTS 2018)上评估了图像翻译性能,并在比较最佳方法时表现出色。特别是,我们的模型不仅能够合成正常结构,还能够处理异常(疾病)结构,如脑肿瘤,并且在不同的医学成像模式下具有较好的变化抗颤性。
    Abstract Synthesizing medical images while preserving their structural information is crucial in medical research. In such scenarios, the preservation of anatomical content becomes especially important. Although recent advances have been made by incorporating instance-level information to guide translation, these methods overlook the spatial coherence of structural-level representation and the anatomical invariance of content during translation. To address these issues, we introduce hierarchical granularity discrimination, which exploits various levels of semantic information present in medical images. Our strategy utilizes three levels of discrimination granularity: pixel-level discrimination using a Brain Memory Bank, structure-level discrimination on each brain structure with a re-weighting strategy to focus on hard samples, and global-level discrimination to ensure anatomical consistency during translation. The image translation performance of our strategy has been evaluated on three independent datasets (UK Biobank, IXI, and BraTS 2018), and it has outperformed state-of-the-art algorithms. Particularly, our model excels not only in synthesizing normal structures but also in handling abnormal (pathological) structures, such as brain tumors, despite the variations in contrast observed across different imaging modalities due to their pathological characteristics. The diagnostic value of synthesized MR images containing brain tumors has been evaluated by radiologists. This indicates that our model may offer an alternative solution in scenarios where specific MR modalities of patients are unavailable. Extensive experiments further demonstrate the versatility of our method, providing unique insights into medical image translation.
    摘要 医疗图像合成是医学研究中非常重要的一环。在这些场景下,保持结构信息的稳定性特别重要。虽然最近的进步已经通过在翻译中包含实例级信息来引导翻译,但这些方法忽略了医学图像的空间相关性和结构级别的内容一致性。为解决这些问题,我们提出了层次粒度识别,它利用医学图像中不同级别的 semantic 信息。我们的策略包括三个级别的识别粒度:像素级识别使用 Brain Memory Bank,结构级识别通过重新权重策略来关注困难样本,以及全球级识别来保证结构一致性。我们的策略在三个独立的数据集(UK Biobank、IXI和BraTS 2018)上进行了图像翻译性能的评估,并在比较当前的算法时表现出色。特别是,我们的模型不仅能够合成正常结构,还能够处理异常(疾病)结构,如脑肿瘤,尽管在不同的医学成像Modalities中存在异常的对比特征。医生评估的合成 MR 图像中的脑肿瘤的诊断价值已被评估。这表明我们的模型可能会在某些患者的特定 MR Modalities 不可用时提供一个替代解决方案。广泛的实验还证明了我们的方法的多样性,为医学图像翻译提供了新的视角。

Learning Site-specific Styles for Multi-institutional Unsupervised Cross-modality Domain Adaptation

  • paper_url: http://arxiv.org/abs/2311.12437
  • repo_url: https://github.com/medicl-vu/crossmoda2023
  • paper_authors: Han Liu, Yubo Fan, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz
  • for: 这篇论文主要用于解决医学影像分析中的无监督跨频道领域适应问题,并且受到多个机构的数据收集所带来的挑战。
  • methods: 本篇论文使用了无对照图像转换来转换源频道影像到目标频道,并设计了动态网络以生成可控制、站点特定的目标频道图像。然后,我们使用这些伪图像进行分类模型训练,并通过自我训练来增加领域差距。
  • results: 根据 validation 和 testing 阶段的成绩,我们的解决方案获得了第一名。实际上,我们的代码存储库已经公开供大家使用,详细信息可以参考 https://github.com/MedICL-VU/crossmoda2023
    Abstract Unsupervised cross-modality domain adaptation is a challenging task in medical image analysis, and it becomes more challenging when source and target domain data are collected from multiple institutions. In this paper, we present our solution to tackle the multi-institutional unsupervised domain adaptation for the crossMoDA 2023 challenge. First, we perform unpaired image translation to translate the source domain images to the target domain, where we design a dynamic network to generate synthetic target domain images with controllable, site-specific styles. Afterwards, we train a segmentation model using the synthetic images and further reduce the domain gap by self-training. Our solution achieved the 1st place during both the validation and testing phases of the challenge. The code repository is publicly available at https://github.com/MedICL-VU/crossmoda2023.
    摘要 <>translate "Unsupervised cross-modality domain adaptation is a challenging task in medical image analysis, and it becomes more challenging when source and target domain data are collected from multiple institutions. In this paper, we present our solution to tackle the multi-institutional unsupervised domain adaptation for the crossMoDA 2023 challenge. First, we perform unpaired image translation to translate the source domain images to the target domain, where we design a dynamic network to generate synthetic target domain images with controllable, site-specific styles. Afterwards, we train a segmentation model using the synthetic images and further reduce the domain gap by self-training. Our solution achieved the 1st place during both the validation and testing phases of the challenge. The code repository is publicly available at https://github.com/MedICL-VU/crossmoda2023." into Simplified Chinese.Here's the translation:“不监督的跨模式领域适应是医疗影像分析中的挑战,而当源和目标领域数据来自多家机构时,这个挑战更加困难。在这篇文章中,我们介绍了我们的解决方案来解决crossMoDA 2023挑战中的多家机构不监督领域适应。我们首先使用不对称图像转换来将源领域图像转换到目标领域,并设计了动态网络来生成可控、站点特定的目标领域图像。接着,我们使用这些伪图像进行分类模型训练,并通过自我训练来进一步缩小领域差。我们的解决方案在验证和测试阶段都获得了第一名。相关的代码存储在 GitHub 上,可以在 访问。”

AR Visualization System for Ship Detection and Recognition Based on AI

  • paper_url: http://arxiv.org/abs/2311.12430
  • repo_url: None
  • paper_authors: Ziqi Ye, Limin Huang, Yongji Wu, Min Hu
  • for: 这项研究旨在开发一个基于人工智能和扩展现实技术的船舶检测和识别系统。
  • methods: 该系统主要包括三部分:人工智能模块、Unity开发模块和Hololens2AR模块。使用R3Det算法完成远程感知图像中船舶的检测和识别,并使用RTX 2080Ti训练模型可达96%的识别率。然后通过将船舶分类和信息转化为3D模型,在虚拟场景中生成船舶的3D模型。最后,添加了语音模块和UI交互模块,并通过MRTK部署到Hololens2上。
  • results: 该系统实现了计算机视觉和扩展现实技术的融合,将对象检测结果映射到AR场景中, represent a bold step towards future technological trends and intelligent applications.
    Abstract Augmented reality technology has been widely used in industrial design interaction, exhibition guide, information retrieval and other fields. The combination of artificial intelligence and augmented reality technology has also become a future development trend. This project is an AR visualization system for ship detection and recognition based on AI, which mainly includes three parts: artificial intelligence module, Unity development module and Hololens2AR module. This project is based on R3Det algorithm to complete the detection and recognition of ships in remote sensing images. The recognition rate of model detection trained on RTX 2080Ti can reach 96%. Then, the 3D model of the ship is obtained by ship categories and information and generated in the virtual scene. At the same time, voice module and UI interaction module are added. Finally, we completed the deployment of the project on Hololens2 through MRTK. The system realizes the fusion of computer vision and augmented reality technology, which maps the results of object detection to the AR field, and makes a brave step toward the future technological trend and intelligent application.
    摘要 “增强现实技术在工业设计互动、展览导航、信息检索等领域都得到了广泛应用。将人工智能和增强现实技术相结合,也成为未来发展趋势。这个项目是基于R3Det算法实现的基于AI的船舶检测和识别AR视觉系统,主要包括三部分:人工智能模块、Unity开发模块和Hololens2AR模块。该项目使用RTX 2080Ti进行模型训练,可达96%的检测率。然后,通过船 categories和信息获取3D模型,并在虚拟场景中生成。同时,添加了语音模块和UI互动模块。最后,通过MRTK进行 deploy 到 Hololens2 上。系统实现了计算机视觉和增强现实技术的融合,将对象检测结果映射到AR场景中,做出了勇敢的一步 towards 未来技术趋势和智能应用。”

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

  • paper_url: http://arxiv.org/abs/2311.12421
  • repo_url: None
  • paper_authors: Christian Keilstrup Ingwersen, Anders Bjorholm Dahl, Janus Nørtoft Jensen, Morten Rieger Hannemose
  • for: 提高单视图人姿估计模型性能,使其在没有3D数据的情况下进行 fine-tuning。
  • methods: 提出了一种新的损失函数——多视图一致性损失,可以在没有3D数据的情况下进行训练。这个损失函数要求一视图中的3D姿势与另一视图中的3D姿势相似。
  • results: 实验表明,只需要使用两个视图,即90度偏移的两个视图,可以获得良好的性能。只有在添加更多视图时,获得了有限的改进。这种方法可以通过捕捉活动中的偏移量来获得域特定的数据,从而消除较为复杂的准备过程。这些研究开创了域适应3D姿势估计领域的新可能性,提供了一种实用和经济的解决方案,可以适应特定应用场景。
    Abstract Deducing a 3D human pose from a single 2D image or 2D keypoints is inherently challenging, given the fundamental ambiguity wherein multiple 3D poses can correspond to the same 2D representation. The acquisition of 3D data, while invaluable for resolving pose ambiguity, is expensive and requires an intricate setup, often restricting its applicability to controlled lab environments. We improve performance of monocular human pose estimation models using multiview data for fine-tuning. We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision. This loss enforces that the inferred 3D pose from one view aligns with the inferred 3D pose from another view under similarity transformations. Our consistency loss substantially improves performance for fine-tuning with no available 3D data. Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views. Thus, we enable the acquisition of domain-specific data by capturing activities with off-the-shelf cameras, eliminating the need for elaborate calibration procedures. This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications. The used dataset, featuring additional views, will be made publicly available.
    摘要 <>用单个2D图像或2D关键点推算3D人姿是一项极其困难的任务,因为多个3D姿势可以对应同一个2D表示。获取3D数据是非常有价值的,可以解决姿势 ambiguity,但是它是昂贵的并且需要复杂的设置,通常只能在控制的实验室环境中进行。我们改进了单视图人姿估算模型的性能,使用多视图数据进行微调。我们提出了一个新的损失函数,多视图一致性,以便在没有3D数据时进行微调。这个损失函数要求一个视图中的3D姿势与另一个视图中的3D姿势在相似变换下吻合。我们的一致性损失substantially提高了没有3D数据时的微调性能。我们的实验表明,只需两个视图夹角为90度即可获得良好的性能,再加上更多视图只有微量提升。因此,我们可以通过捕捉活动中的自然运动来获得领域特有的数据,消除需要复杂的准备过程。这项研究开拓了3D姿势估算领域的领域适应,提供了实用和经济的解决方案,以便自定义模型 для特定应用。我们使用的数据集,包含多个视图,将在未来公开。

Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval

  • paper_url: http://arxiv.org/abs/2311.12894
  • repo_url: None
  • paper_authors: Xiu-Shen Wei, Yang Shen, Xuhao Sun, Peng Wang, Yuxin Peng
  • for: 这paper的目的是提出一种基于特征意识hash网络,以提高精细图像检索的效率和可靠性。
  • methods: 该paper使用了一种encoder-decoder结构的待检索任务,以不监督地提取高级别特征特征向量。此外,它还加入了一个特征减 correlate约束,以强化特征表示能力。
  • results: 对六个精细图像检索 datasets和两个通用检索 datasets进行了广泛的量化实验,并证明了该方法在对比方法的情况下表现出优异性。
    Abstract Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. Our models are also equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities. Then, driven by preserving original entities' similarity, the required hash codes can be generated from these attribute-specific vectors and thus become attribute-aware. Furthermore, to combat simplicity bias in deep hashing, we consider the model design from the perspective of the self-consistency principle and propose to further enhance models' self-consistency by equipping an additional image reconstruction path. Comprehensive quantitative experiments under diverse empirical settings on six fine-grained retrieval datasets and two generic retrieval datasets show the superiority of our models over competing methods.
    摘要 我们的工作关注于解决大规模细化图像检索问题,即根据查询图像中细化的细节来排序图像,并以细化细节为检索基准。这是一项实际重要的任务,因为细化图像数据的增长速度急剧,同时细化图像之间的变化很小,但是又有很大的内类变化。在这篇论文中,我们提出了Attribute-aware哈希网络,以提高检索效率和建立明确的哈希码和视觉特征之间的对应关系。具体来说,我们根据捕捉的视觉表示来设计一个encoder-decoder结构网络,以无监督的方式提取高级特征向量,并且通过对这些特征向量进行Feature decorrelation来增强其表示能力。然后,我们通过保持原始实体之间的相似性来生成需要的哈希码,从而使哈希码变得Attribute-aware。此外,为了解决深度哈希中的简单性偏好问题,我们从自 consistency 原理的角度设计了我们的模型,并在其基础上进一步增强模型的自 consistency。经过广泛的量化实验,我们的模型在六个细化检索数据集和两个通用检索数据集上表现出优于竞争方法。

Board-to-Board: Evaluating Moonboard Grade Prediction Generalization

  • paper_url: http://arxiv.org/abs/2311.12419
  • repo_url: https://github.com/a1773620/moonboard-grade-prediction
  • paper_authors: Daniel Petashvili, Matthew Rodda
  • for: 预测攀岩路线的推荐级别
  • methods: 应用经典和深度学习模型技术
  • results: 实现了2016、2017和2019年月块数据集上的state of the art的推荐性能,MAE=0.87、RMSE=1.12,不需要分解路线为具体的动作,并且能够泛化到不同版本。
    Abstract Bouldering is a sport where athletes aim to climb up an obstacle using a set of defined holds called a route. Typically routes are assigned a grade to inform climbers of its difficulty and allow them to more easily track their progression. However, the variation in individual climbers technical and physical attributes and many nuances of an individual route make grading a difficult and often biased task. In this work, we apply classical and deep-learning modelling techniques to the 2016, 2017 and 2019 Moonboard datasets, achieving state of the art grade prediction performance with 0.87 MAE and 1.12 RMSE. We achieve this performance on a feature-set that does not require decomposing routes into individual moves, which is a method common in literature and introduces bias. We also demonstrate the generalization capability of this model between editions and introduce a novel vision-based method of grade prediction. While the generalization performance of these techniques is below human level performance currently, we propose these methods as a basis for future work. Such a tool could be implemented in pre-existing mobile applications and would allow climbers to better track their progress and assess new routes with reduced bias.
    摘要 众所周知,碰碰 stones 是一种运动,运动员们的目标是通过一系列定义的托手(route)爬升障碍物。通常, Routes 会被分配一个等级,以便 climbers 更好地了解其困难程度并跟踪进步。然而,因为每名运动员的技术和物理属性之间的差异以及路线中的多种细节,这种等级很难以准确地进行。在这项工作中,我们使用古典和深度学习模型技术来处理2016、2017和2019年的 Moonboard 数据集,达到了状态之美的等级预测性能(0.87 MAE和1.12 RMSE)。我们实现这一性能在一个不需要分解路线为个体动作的特征集上,这是文献中常见的方法,但是会引入偏见。我们还证明了这些模型在不同版本之间的泛化性能,并提出了一种新的视觉等级预测方法。虽然这些技术的总体性能目前还不及人类水平,但我们提出这些方法作为未来工作的基础。这种工具可以与现有的手机应用程序集成,允许 climbers 更好地跟踪进步并评估新路线,减少偏见。

Learning Part Motion of Articulated Objects Using Spatially Continuous Neural Implicit Representations

  • paper_url: http://arxiv.org/abs/2311.12407
  • repo_url: https://github.com/Yushi-Du/PartMotion
  • paper_authors: Yushi Du, Ruihai Wu, Yan Shen, Hao Dong
  • for: 本研究旨在提出一种新的拟合机构,以便更好地理解和操作具有高度自由度和复杂几何结构的扭轴物体(如门和抽屉)。
  • methods: 我们提出了一种新的拟合方法,该方法使用空间连续神经几何表示来描述扭轴物体的部件运动。该方法可以覆盖不同类型的关节运动,并且可以在不同的几何空间中进行细致的部件运动模拟。
  • results: 我们的实验结果表明,我们的提出的拟合方法可以有效地模拟扭轴物体的部件运动,并且可以在不同类型的扭轴物体上进行精准的操作。
    Abstract Articulated objects (e.g., doors and drawers) exist everywhere in our life. Different from rigid objects, articulated objects have higher degrees of freedom and are rich in geometries, semantics, and part functions. Modeling different kinds of parts and articulations with nerual networks plays an essential role in articulated object understanding and manipulation, and will further benefit 3D vision and robotics communities. To model articulated objects, most previous works directly encode articulated objects into feature representations, without specific designs for parts, articulations and part motions. In this paper, we introduce a novel framework that explicitly disentangles the part motion of articulated objects by predicting the transformation matrix of points on the part surface, using spatially continuous neural implicit representations to model the part motion smoothly in the space. More importantly, while many methods could only model a certain kind of joint motion (such as the revolution in the clockwise order), our proposed framework is generic to different kinds of joint motions in that transformation matrix can model diverse kinds of joint motions in the space. Quantitative and qualitative results of experiments over diverse categories of articulated objects demonstrate the effectiveness of our proposed framework.
    摘要 everywhere in our lives, articulated objects (such as doors and drawers) exist. Compared to rigid objects, articulated objects have more freedom of movement and richer geometry, semantics, and part functions. Using neural networks to model different parts and articulations is essential for understanding and manipulating articulated objects, and will benefit the 3D vision and robotics communities.Previous works have directly encoded articulated objects into feature representations without specific designs for parts, articulations, and part motions. In this paper, we introduce a novel framework that explicitly disentangles the part motion of articulated objects by predicting the transformation matrix of points on the part surface using spatially continuous neural implicit representations to model part motion smoothly in space.Moreover, our proposed framework is generic to different types of joint motions, as the transformation matrix can model diverse types of joint motions in space. The effectiveness of our proposed framework is demonstrated through quantitative and qualitative experiments on diverse categories of articulated objects.

CASR: Refining Action Segmentation via Magrinalizing Frame-levle Causal Relationships

  • paper_url: http://arxiv.org/abs/2311.12401
  • repo_url: None
  • paper_authors: Keqing Du, Xinyu Yang, Hang Chen
  • for: 提高 Temporal Action Segmentation (TAS) 任务的解释性,通过深度学习和 causal discovery 集成。
  • methods: 提出 Causal Abstraction Segmentation Refiner (CASR),可以根据不同模型的前 segmentation 结果进行改进,并将视频 causality 增强到 marginalizing 帧级 causal relationships 中。
  • results: CASR 能够在多种数据集上显著超过现有方法,以及在 causal explainability 和泛化能力方面表现出色。
    Abstract Integrating deep learning and causal discovery has increased the interpretability of Temporal Action Segmentation (TAS) tasks. However, frame-level causal relationships exist many complicated noises outside the segment-level, making it infeasible to directly express macro action semantics. Thus, we propose Causal Abstraction Segmentation Refiner (CASR), which can refine TAS results from various models by enhancing video causality in marginalizing frame-level casual relationships. Specifically, we define the equivalent frame-level casual model and segment-level causal model, so that the causal adjacency matrix constructed from marginalized frame-level causal relationships has the ability to represent the segmnet-level causal relationships. CASR works out by reducing the difference in the causal adjacency matrix between we constructed and pre-segmentation results of backbone models. In addition, we propose a novel evaluation metric Causal Edit Distance (CED) to evaluate the causal interpretability. Extensive experimental results on mainstream datasets indicate that CASR significantly surpasses existing various methods in action segmentation performance, as well as in causal explainability and generalization.
    摘要 将深度学习和 causal discovery 结合可以提高 Temporal Action Segmentation (TAS) 任务的解释性。然而,视频中存在许多复杂的噪声,使得直接表达macro action semantics 变得不可能。因此,我们提议 Causal Abstraction Segmentation Refiner (CASR),可以通过增强视频 causality 来精炼 TAS 结果。specifically,我们定义了equivalent frame-level casual model和 segment-level causal model,使得在 marginalized frame-level causal relationships 中构建的 causal adjacency matrix 可以表示 segmnet-level causal relationships。CASR 通过减少这些 causal adjacency matrix 中与pre-segmentation结果不同的差异来工作。此外,我们提出了一种新的评价指标 Causal Edit Distance (CED),可以评估 causal explainability。经验 resultado indicate that CASR 在 action segmentation performance 和 causal explainability 以及泛化性方面具有显著优势。

IMJENSE: Scan-specific Implicit Representation for Joint Coil Sensitivity and Image Estimation in Parallel MRI

  • paper_url: http://arxiv.org/abs/2311.12892
  • repo_url: https://github.com/amri-lab/imjense
  • paper_authors: Ruimin Feng, Qing Wu, Jie Feng, Huajun She, Chunlei Liu, Yuyao Zhang, Hongjiang Wei
  • for: 提高磁共振成像(MRI)数据收集的速度。
  • methods: 使用含义表示(Implicit Neural Representation,INR)来改进平行MRI重建算法。
  • results: 在具有4或8梯度线的限制情况下,IMJENSE可以稳定地重建2D кар特西安收集的图像,并且图像质量较高。
    Abstract Parallel imaging is a commonly used technique to accelerate magnetic resonance imaging (MRI) data acquisition. Mathematically, parallel MRI reconstruction can be formulated as an inverse problem relating the sparsely sampled k-space measurements to the desired MRI image. Despite the success of many existing reconstruction algorithms, it remains a challenge to reliably reconstruct a high-quality image from highly reduced k-space measurements. Recently, implicit neural representation has emerged as a powerful paradigm to exploit the internal information and the physics of partially acquired data to generate the desired object. In this study, we introduced IMJENSE, a scan-specific implicit neural representation-based method for improving parallel MRI reconstruction. Specifically, the underlying MRI image and coil sensitivities were modeled as continuous functions of spatial coordinates, parameterized by neural networks and polynomials, respectively. The weights in the networks and coefficients in the polynomials were simultaneously learned directly from sparsely acquired k-space measurements, without fully sampled ground truth data for training. Benefiting from the powerful continuous representation and joint estimation of the MRI image and coil sensitivities, IMJENSE outperforms conventional image or k-space domain reconstruction algorithms. With extremely limited calibration data, IMJENSE is more stable than supervised calibrationless and calibration-based deep-learning methods. Results show that IMJENSE robustly reconstructs the images acquired at 5$\mathbf{\times}$ and 6$\mathbf{\times}$ accelerations with only 4 or 8 calibration lines in 2D Cartesian acquisitions, corresponding to 22.0% and 19.5% undersampling rates. The high-quality results and scanning specificity make the proposed method hold the potential for further accelerating the data acquisition of parallel MRI.
    摘要 parallel imaging是一种广泛使用的技术,以加速магни共振成像(MRI)数据收集。数学上,并行MRI重建可以表示为一个内部问题,关系快速取样的k空间测量和所需的MRI图像。尽管许多现有的重建算法已经取得了成功,但是仍然是一个挑战来可靠地重建高质量图像从高度减少的k空间测量。最近,隐式神经表示出现了一种强大的平台,可以利用部分获得的数据的内部信息和物理特性来生成所需的对象。在这项研究中,我们介绍了IMJENSE,一种扫描特定的隐式神经表示基于方法,用于改进并行MRI重建。特别是,在该方法中,MRI图像和探测器敏感度被模型为坐标系中的连续函数,由神经网络和多项式 Parametrized。这些网络和多项式的参数同时从快速取样的k空间测量中学习,而无需完全的训练数据。由于隐式神经表示的强大连续表示和同时估计MRI图像和探测器敏感度的 JOINT estimation,IMJENSE在图像或k空间频谱频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频��

RFTrans: Leveraging Refractive Flow of Transparent Objects for Surface Normal Estimation and Manipulation

  • paper_url: http://arxiv.org/abs/2311.12398
  • repo_url: None
  • paper_authors: Tutian Tang, Jiyu Liu, Jieyi Zhang, Haoyuan Fu, Wenqiang Xu, Cewu Lu
    for:这篇论文的目的是教 robots 如何与透明物体交互,以解决直接从 RGB 图像中提取 geometry 信息时的困难。methods:这篇论文提出了一种基于 RGB-D 摄像头的方法,称为 RFTrans,可以估计透明物体表面法向。该方法利用了偏振流作为中间表示,从而解决直接从 RGB 图像中提取 geometry 信息时的困难。results:在 synthetic 数据集上训练 RFTrans 后,它可以在 synthetic 和实际世界 benchmark 中一直胜过基准 ClearGrasp 的表现,占据了大幅度的优势。此外,在实际世界中,使用 RFTrans 实现了83% 的成功率。
    Abstract Transparent objects are widely used in our daily lives, making it important to teach robots to interact with them. However, it's not easy because the reflective and refractive effects can make RGB-D cameras fail to give accurate geometry measurements. To solve this problem, this paper introduces RFTrans, an RGB-D-based method for surface normal estimation and manipulation of transparent objects. By leveraging refractive flow as an intermediate representation, RFTrans circumvents the drawbacks of directly predicting the geometry (e.g. surface normal) from RGB images and helps bridge the sim-to-real gap. RFTrans integrates the RFNet, which predicts refractive flow, object mask, and boundaries, followed by the F2Net, which estimates surface normal from the refractive flow. To make manipulation possible, a global optimization module will take in the predictions, refine the raw depth, and construct the point cloud with normal. An analytical grasp planning algorithm, ISF, is followed to generate the grasp poses. We build a synthetic dataset with physically plausible ray-tracing rendering techniques to train the networks. Results show that the RFTrans trained on the synthetic dataset can consistently outperform the baseline ClearGrasp in both synthetic and real-world benchmarks by a large margin. Finally, a real-world robot grasping task witnesses an 83% success rate, proving that refractive flow can help enable direct sim-to-real transfer. The code, data, and supplementary materials are available at https://rftrans.robotflow.ai.
    摘要 透明物体在我们日常生活中广泛使用,因此教 robot 如何与它们交互是非常重要的。然而,由于镜面和弯光效果,RGB-D 摄像头可能无法提供准确的几何测量。为解决这个问题,本文介绍了 RFTrans,一种基于 RGB-D 的表面法向量估计和透明物体 manipulate 方法。通过利用弯光流作为中间表示,RFTrans 可以 circumvent 直接从 RGB 图像中提取几何的缺陷,帮助bridges 实验室到实际的差距。RFTrans 结合了 RFNet,这个网络预测弯光流、物体面积和边界,然后与 F2Net 结合,从弯光流中估计表面法向量。为了实现操作,global 优化模块将在预测基础上进行数据级别的优化,并将raw depth 转换为点云。通过使用 ISF 分析 grasp 算法,我们可以生成各种抓握姿态。我们在使用 Physically plausible 的 ray-tracing 渲染技术生成的 sintetic 数据集上训练了这些网络。结果表明,在 synthetic 和实际 benchmark 中,RFTrans 可以在 Baseline ClearGrasp 的大幅度上超越。最后,我们在实际世界中完成了一个 robot 抓握任务,观察到了 83% 的成功率,证明了弯光流可以帮助实现直接的 sim-to-real 传输。代码、数据和补充材料可以在 上获取。

Rich and Poor Texture Contrast: A Simple yet Effective Approach for AI-generated Image Detection

  • paper_url: http://arxiv.org/abs/2311.12397
  • repo_url: None
  • paper_authors: Nan Zhong, Yiran Xu, Zhenxing Qian, Xinpeng Zhang
  • for: 本研究旨在开发一种可以识别基于多种生成模型生成的假图像的检测器。
  • methods: 我们的方法利用图像中质地区域之间的相关性异同来检测假图像。我们将图像分解成多个区域,并将每个区域重建为两个图像:一个rich texture图像和一个poor texture图像。然后,我们提取rich texture和poor texture区域之间的相关性异同特征,并使其服为生成模型生成图像的普适指标。
  • results: 我们的方法在16种常见的生成模型下进行了广泛的实验,并证明了与现有基eline相比,我们的方法能够显著提高检测精度。
    Abstract Recent generative models show impressive performance in generating photographic images. Humans can hardly distinguish such incredibly realistic-looking AI-generated images from real ones. AI-generated images may lead to ubiquitous disinformation dissemination. Therefore, it is of utmost urgency to develop a detector to identify AI-generated images. Most existing detectors suffer from sharp performance drops over unseen generative models. In this paper, we propose a novel AI-generated image detector capable of identifying fake images created by a wide range of generative models. Our approach leverages the inter-pixel correlation contrast between rich and poor texture regions within an image. Pixels in rich texture regions exhibit more significant fluctuations than those in poor texture regions. This discrepancy reflects that the entropy of rich texture regions is larger than that of poor ones. Consequently, synthesizing realistic rich texture regions proves to be more challenging for existing generative models. Based on this principle, we divide an image into multiple patches and reconstruct them into two images, comprising rich-texture and poor-texture patches respectively. Subsequently, we extract the inter-pixel correlation discrepancy feature between rich and poor texture regions. This feature serves as a universal fingerprint used for AI-generated image forensics across different generative models. In addition, we build a comprehensive AI-generated image detection benchmark, which includes 16 kinds of prevalent generative models, to evaluate the effectiveness of existing baselines and our approach. Our benchmark provides a leaderboard for follow-up studies. Extensive experimental results show that our approach outperforms state-of-the-art baselines by a significant margin. Our project: https://fdmas.github.io/AIGCDetect/
    摘要 现代生成模型显示出惊人的表现,可以生成极其真实looking的人工图像。人类几乎无法分辨这些人工图像与真实图像之间的差异。然而,人工图像可能会导致广泛的谣言散布。因此,我们需要开发一个用于识别人工图像的检测器。现有的检测器往往会遭到未知的生成模型所致的性能下降。在这篇论文中,我们提出了一种新的人工图像检测器,可以识别由各种生成模型创建的假图像。我们的方法利用图像中rich和poor тексту重叠的间距异同。rich тексту重叠的像素在不同的文本重叠中存在更大的波动,这是因为rich texture regions的Entropy更大。这意味着,生成真实rich texture regions的挑战更大。基于这一原理,我们将图像分解成多个patch,并将它们重构成两个图像,其中一个包含rich texture patch,另一个包含poor texture patch。然后,我们提取rich和poor texture patches之间的干扰异同特征。这个特征作为一种通用的人工图像嫌疑犯迹,可以在不同的生成模型下进行AI-generated图像的诊断。此外,我们建立了一个包含16种常见生成模型的人工图像检测 benchmark,以评估现有的基eline和我们的方法的效果。我们的benchmark提供了一个领导者榜,用于后续的研究。我们的实验结果表明,我们的方法在现有基eline的基础上具有显著的优势。我们的项目:https://fdmas.github.io/AIGCDetect/

From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

  • paper_url: http://arxiv.org/abs/2311.12391
  • repo_url: None
  • paper_authors: Jiaxin Ge, Sanjay Subramanian, Trevor Darrell, Boyi Li
  • for: 提高视觉理解任务中的解释质量,减少人工标注量。
  • methods: 采用回归型视觉语言模型,在多步骤过程中计算视觉特征、答案和解释,以提高解释质量。
  • results: 相比单步骤生成,多步骤生成能够更好地导引模型更正自己的答案,并且生成的解释也可以作为有价值的自动标注。 Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets.
    Abstract Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm. Our method iteratively computes visual features (conditioned on the text input), an answer, and an explanation, to improve the explanation quality step by step until the answer converges. We find that this multi-step approach guides the model to correct its own answers and outperforms single-step explanation generation. Furthermore, explanations generated by ReVisE also serve as valuable annotations for few-shot self-training. Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets, underscoring the efficacy and data-efficiency of our method.
    摘要 Here's the text in Simplified Chinese:面对预训练视语模型用于生成视理解任务中的详细解释的挑战,我们提出了ReVisE:一种 recursive visual explanation algorithm。我们的方法会递归计算视特征(基于文本输入),答案和解释,以提高解释质量步骤weise until the answer converges。我们发现这种多步approach可以引导模型自我更正答案,并超过单步解释生成。此外,ReVisE生成的解释还可以用于几拍自动反馈,我们的方法可以在10个纪录中比前一代方法提高5%的人工注解,达到4.2和1.3倍的BLEU-1分数提高,证明了我们的方法的效果和数据效果。

Point, Segment and Count: A Generalized Framework for Object Counting

  • paper_url: http://arxiv.org/abs/2311.12386
  • repo_url: None
  • paper_authors: Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan
  • for: 本文提出了一种通用的对象计数框架,用于在图像中计算对象的数量,无需提供例示板或类名。
  • methods: 本文使用的方法包括SAM和CLIP两种基础模型,通过推断对象的分布来预测对象的数量。
  • results: 实验结果表明,PseCo在FSC-147数据集上达到了当前最佳性能,并在COCO和LVIS数据集上也取得了优秀的 результа。
    Abstract Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. Current state-of-the-art methods highly rely on density maps to predict object counts, which lacks model interpretability. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147 dataset demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection, with additional results on large-scale COCO and LVIS datasets. The source code is available at \url{https://github.com/Hzzone/PseCo}.
    摘要 <> transtable text into Simplified Chinese.��UClass-agnostic对象计数targets全image中的所有对象,即ew-shot和zero-shot计数。现有状态的方法都高度依赖density maps来预测对象计数,lacking model interpretability。在这篇论文中,我们提议一种通用的基础模型框架,用于ew-shot和zero-shot对象计数。我们的框架结合了两种基础模型的优点,即SAM和CLIP,而不会削弱其零 shot能力。然而,这种策略遇到了计算效率过高和小聚集的对象无法被本地化和分辨的问题。为了解决这些问题,我们的框架,称之为PseCo,采用以下三步:点、分割、计数。具体来说,我们首先提出一种基于类型的对象本地化,以提供准确但是最少的点提示 дляSAM,从而不仅减少计算成本,还避免了小对象的漏报。其次,我们提出一种通用对象分类,利用CLIP图像/文本嵌入为分类器,采用层次知识传递来获得特征分类。经验结果表明,PseCo在FSC-147 dataset上达到了当前最佳性能,同时还在大规模COCO和LVIS datasets上进行了详细的实验。源代码可以在github上找到:

Text-Guided Texturing by Synchronized Multi-View Diffusion

  • paper_url: http://arxiv.org/abs/2311.12891
  • repo_url: None
  • paper_authors: Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong
  • for: 本研究旨在Synthesize 3D 物体上的文本ure, 使用 pré-train 的 T2I 扩散模型。
  • methods: 我们提出了一种同步多视角扩散方法,使得不同视角的扩散过程可以在扩散过程中达成内容的协调,从而保证文本的一致性。在每个去噪步骤中,我们将不同视角的潜在内容在Texture 领域进行混合,特别是在视角之间的 overlap 区域。
  • results: 我们的方法在生成一致、不间断、高细节文本ure 方面达到了更高的性能,与当前状态艺技方法相比。
    Abstract This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.
    摘要

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

  • paper_url: http://arxiv.org/abs/2311.12890
  • repo_url: None
  • paper_authors: Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
  • for: 解决复杂多步视觉任务
  • methods: 使用自动分解和自动反馈机制,将复杂任务 decomposed into simpler subtasks,并通过多模型的集成来提高逻辑推理性能
  • results: 在多种视觉任务上实现新的benchmark,创造更加准确和Robust的程序
    Abstract Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a general framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.
    摘要 “视觉编程,一种模块化和普遍化的方法,将不同的模块和Python运算符集成以解决各种视觉语言任务。与终端模型不需要任务特定数据不同,它可以在无监督的情况下进行视觉处理和推理。现有的视觉编程方法在每个任务中生成单次程序,却缺乏评估和优化基于反馈的能力,这限制了它们在复杂多步任务中的效果。 Drawing inspiration from benders decomposition, we introduce De-fine, a general framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Semi-supervised Medical Image Segmentation via Query Distribution Consistency

  • paper_url: http://arxiv.org/abs/2311.12364
  • repo_url: None
  • paper_authors: Rong Wu, Dehua Li, Cong Zhang
  • for: 这篇论文主要针对医疗影像分类中的半监督学习问题进行研究,以提高医疗影像分类的精度和效率。
  • methods: 这篇论文提出了一个基于互动学习的DUAL KMax UX-Net框架,它结合了3D UX-Net Meta-architecture和KMax decoder,以优化医疗影像分类性能。
  • results: 实验结果显示,这个方法可以将不具有标签的数据融合到标签数据中,以提高医疗影像分类性能。此外,这个方法在10%和20%标签设定下也能够超过现有的半监督学习方法。
    Abstract Semi-supervised learning is increasingly popular in medical image segmentation due to its ability to leverage large amounts of unlabeled data to extract additional information. However, most existing semi-supervised segmentation methods focus only on extracting information from unlabeled data. In this paper, we propose a novel Dual KMax UX-Net framework that leverages labeled data to guide the extraction of information from unlabeled data. Our approach is based on a mutual learning strategy that incorporates two modules: 3D UX-Net as our backbone meta-architecture and KMax decoder to enhance the segmentation performance. Extensive experiments on the Atrial Segmentation Challenge dataset have shown that our method can significantly improve performance by merging unlabeled data. Meanwhile, our framework outperforms state-of-the-art semi-supervised learning methods on 10\% and 20\% labeled settings. Code located at: https://github.com/Rows21/DK-UXNet.
    摘要 semi-supervised learning在医疗影像 segmentation 中日益受欢迎,因为它可以利用大量无标签数据来提取更多信息。然而,现有大多数 semi-supervised segmentation 方法都只是利用无标签数据来提取信息。在这篇论文中,我们提出了一种新的 Dual KMax UX-Net 框架,该框架利用标签数据来导引无标签数据中的信息提取。我们的方法基于一种互助学习策略,该策略包括3D UX-Net作为我们的基础元体系和 KMax decoder来提高分 segmentation 性能。我们在 Atrial Segmentation Challenge 数据集上进行了广泛的实验,发现我们的方法可以显著提高性能,并且在 10% 和 20% 标签设置下超越了现有的 semi-supervised learning 方法。代码位于:https://github.com/Rows21/DK-UXNet。

Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition

  • paper_url: http://arxiv.org/abs/2311.12344
  • repo_url: None
  • paper_authors: Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim
  • For: 本研究提出了一种新的网络模型,即Modality Mixer(M-Mixer)网络,用于多模态人体动作识别。M-Mixer网络可以有效利用不同模态之间的补充信息,并将其与时间上的动作内容相结合。* Methods: 本研究提出了一个新的模块,即Complementary Feature Extraction Module(CFEM),用于提取不同模态之间的补充信息。CFEM使用专门的学习权重 embedding 来引导其提取相应的补充信息和全局动作内容。* Results: 实验结果表明,我们提出的方法在NTU RGB+D 60、NTU RGB+D 120和NW-UCLA数据集上达到了现状顶峰性能。此外,我们还进行了全面的抑制研究,以验证我们的提出方法的有效性。
    Abstract Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates sepearte learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.
    摘要 因为各种感测器具有不同的特点,每种模式都具有独特的物理特性。因此,在多模式动作识别场景下,需要考虑不同模式之间的补做信息。在本文中,我们提出了一种新的网络模型,名为多模式混合网络(M-Mixer),它能够有效地利用不同模式之间的补做信息,并在时间上进行动作内容的推理。M-Mixer网络的关键组件是多模式Contextualization Unit(MCU),这是一个简单 yet efficient的循环单元。MCU负责在一个模式(例如RGB)中编码动作内容特征,并将其与其他模式(例如深度和红外模式)中的动作内容特征进行推理。这个过程会让M-Mixer网络能够利用全局的动作内容,同时也能够补做不同模式之间的补做信息。此外,为了提取适当的补做信息,我们引入了一个新的模块,名为补做特征提取模块(CFEM)。CFEM使用每个模式的特有学习查询嵌入,以导引CFEM提取适当的补做信息和全局动作内容。因此,我们的提出方法在NTU RGB+D 60、NTU RGB+D 120和NW-UCLA数据集上比州前方法表现出色。此外,我们还进行了详细的抽象研究,以验证我们的提出方法的有效性。

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

  • paper_url: http://arxiv.org/abs/2311.12342
  • repo_url: None
  • paper_authors: Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou
  • For: 本研究旨在提出一种无需训练的方法,用于将文本描述和空间布局转化为高质量图像。* Methods: 本方法引入了一种本地注意力约束,以准确地将各个对象放置在指定的区域中。此外,我们还提出了一种补充то克规约,以利用已经被忽略的补充字符,从而避免生成的对象融合。* Results: 我们通过了多个测试 benchmark 表明,我们的方法可以高效地解决先前方法中所见的语义失败问题,并且在无需训练的情况下,我们的方法可以与现有的文本到图像和布局到图像模型集成,从而大幅提高其性能。
    Abstract Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in accurately conveying fine-grained spatial compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image synthesis that excels in producing high-quality images aligned with both textual prompts and spatial layouts. Our method introduces a Localized Attention Constraint to refine cross-attention for individual objects, ensuring their precise placement in designated regions. We further propose a Padding Token Constraint to leverage the semantic information embedded in previously neglected padding tokens, thereby preventing the undesired fusion of synthesized objects. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, significantly amplifying their performance and effectively addressing semantic failures observed in prior methods. Through extensive experiments, we showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.
    摘要 最近的文本到图像扩散模型已达到历史最高水平,但它们均依赖文本提示,而不能准确表达细腻的空间组合。在这篇论文中,我们提出了LoCo,一种不需要训练的方法,可以高质量地将文本提示和空间布局转化为图像。我们的方法引入了本地化注意力约束,以确保各个 объек 的精确位置在指定的区域内。此外,我们还提出了补充符约束,以利用 previously neglected 的补充符嵌入 semantic information,以避免人工合成对象的不良融合。LoCo可以轻松地与现有的文本到图像和布局到图像模型集成,显著提高其性能,有效地解决先前方法中所见的语义失败。经过广泛的实验,我们证明了我们的方法的超越性,在多个benchmark上 both qualitatively and quantitatively surpassed existing training-free layout-to-image methods。

Fine-Grained Open Domain Image Animation with Motion Guidance

  • paper_url: http://arxiv.org/abs/2311.12886
  • repo_url: None
  • paper_authors: Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
  • for: 本研究旨在提供一种基于视频扩散模型的开放频谱图像动画方法,以便在多种自然环境中捕捉的图像动画 generation 中实现细化和可控的动画。
  • methods: 本方法使用目标动作区域指导和动作强度指导,以实现精细控制动画中的可动区域和其运动速度。
  • results: 经过严格的实验 validate,本方法在开放频谱dataset上显示出比较出色的性能,能够生成细化和交互式的动画序列。
    Abstract Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance. The source code and model will be made publicly available upon publication.
    摘要 Image 动画是计算机视觉中关键任务之一,旨在将静止图像转化为动态视觉内容。现代图像动画技术使用神经网络基于渲染技术来生成真实的动画。然而,在文本指导下实现细化和可控的图像动画仍然是一项挑战,特别是在开放领域中拍摄的多样化实际环境中。在这篇论文中,我们提出了一种开放领域图像动画方法,利用视频扩散模型的运动先验。我们的方法包括targeted Motion Area指导和运动强度指导,以实现精细控制可动区域和其运动速度。这会使动画内的视觉元素更加与提示文本相对应,从而实现细化和交互的动画生成过程。我们通过对开放领域数据集进行严格的实验,证明了我们的方法的效果。代码和模型将在发表后公开。

ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability

  • paper_url: http://arxiv.org/abs/2311.12327
  • repo_url: https://github.com/anonymgiant/vilam
  • paper_authors: Xiaoyu Yang, Lijian Xu, Hongsheng Li, Shaoting Zhang
  • for: 本研究旨在应用语言模型于医学图像分析任务中,以提高人机交互和多模态任务的表现。
  • methods: 本研究提出了一种统一的视力语言变换模型(ViLaM),通过预训练语言模型的指令预测来优化知识和理解能力。文章使用冻结预训练encoder来编码和对应图像和文本特征,以便执行多种视觉任务。此外,文章还提出了一种循环训练策略来提高参照表达的质量。
  • results: 研究表明,ViLaM在公共普通数据集上显示出了极高的表现,并且在医学数据集上进行了验证和推广。此外,研究还发现了模型在零基础学习中的卓越表现,表明未来可能有很多应用场景。
    Abstract Vision-language models have revolutionized human-computer interaction and shown significant progress in multi-modal tasks. However, applying these models to complex visual tasks like medical image analysis remains challenging. In this study, we propose ViLaM, a unified Vision-Language transformer model that integrates instruction tuning predicated on a large language model. This approach enables us to optimally utilize the knowledge and reasoning capacities of large pre-trained language models for an array of tasks encompassing both language and vision. We employ frozen pre-trained encoders to encode and align both image and text features, enabling ViLaM to handle a variety of visual tasks following textual instructions. Besides, we've designed cycle training for referring expressions to address the need for high-quality, paired referring expression datasets for training large models in terms of both quantity and quality. We evaluated ViLaM's exceptional performance on public general datasets and further confirmed its generalizability on medical datasets. Importantly, we've observed the model's impressive zero-shot learning ability, indicating the potential future application of ViLaM in the medical field.
    摘要 现代视力语言模型已经改变人机交互方式,并在多模态任务上显示出了很大的进步。然而,将这些模型应用于复杂的视觉任务,如医疗图像分析,仍然是一个挑战。在这项研究中,我们提出了ViLaM,一种整合视力语言变换模型,它采用基于大型预训练语言模型的指令预测。这种方法允许我们最大限度利用大型预训练语言模型的知识和理解能力,以涵盖范围从语言到视觉的多种任务。我们使用冻结预训练的编码器来编码和对应图像和文本特征,使ViLaM能够处理多种视觉任务,根据文本指令进行执行。此外,我们还设计了循环训练来改进参照表达,以提高大型模型的训练数据质量和量。我们对公共普通数据集进行了评估,并证明了ViLaM在医疗领域的普适性。最重要的是,我们发现了ViLaM的很好的零基础学习能力,表明这个模型在未来可能在医疗领域发挥重要作用。

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

  • paper_url: http://arxiv.org/abs/2311.12885
  • repo_url: None
  • paper_authors: Honglin Li, Yunlong Zhang, Chenglu Zhu, Jiatong Cai, Sunyi Zheng, Lin Yang
  • for: This paper is written for developing a computer-aided diagnosis tool for histopathology images, specifically addressing the problem of shape varying large Whole Slide Images (WSIs) in previous Multi-Instance Learning (MIL) models.
  • methods: The proposed method, Long-contextual MIL (Long-MIL), introduces Linear Bias into Attention to amend position embedding for shape varying long-contextual WSIs, and utilizes Flash-Attention module to tackle computational complexity while maintaining full self-attention performance.
  • results: The proposed Long-MIL method is evaluated on extensive experiments including 4 datasets, including WSI classification and survival prediction tasks, and demonstrates superiority on shape varying WSIs compared to previous MIL models.Here’s the Chinese translation of the three key points:
  • for: 这篇论文是为了开发一种基于计算机支持的 histopathology 图像诊断工具,特别是解决过去的 Multi-Instance Learning (MIL) 模型中 shape varying 大的 Whole Slide Images (WSIs) 问题。
  • methods: 提议的方法是 Long-contextual MIL (Long-MIL),它通过 Linear Bias into Attention 来修正 shape varying long-contextual WSIs 的位域嵌入,并使用 Flash-Attention module 来解决 transformer 模型的计算复杂性,保持了全自注意性能。
  • results: 提议的 Long-MIL 方法在四个数据集上进行了广泛的实验,包括 WSI 分类和存活预测任务,并在 shape varying WSIs 上表现出了超越性。
    Abstract Histopathology image analysis is the golden standard of clinical diagnosis for Cancers. In doctors daily routine and computer-aided diagnosis, the Whole Slide Image (WSI) of histopathology tissue is used for analysis. Because of the extremely large scale of resolution, previous methods generally divide the WSI into a large number of patches, then aggregate all patches within a WSI by Multi-Instance Learning (MIL) to make the slide-level prediction when developing computer-aided diagnosis tools. However, most previous WSI-MIL models using global-attention without pairwise interaction and any positional information, or self-attention with absolute position embedding can not well handle shape varying large WSIs, e.g. testing WSIs after model deployment may be larger than training WSIs, since the model development set is always limited due to the difficulty of histopathology WSIs collection. To deal with the problem, in this paper, we propose to amend position embedding for shape varying long-contextual WSI by introducing Linear Bias into Attention, and adapt it from 1-d long sequence into 2-d long-contextual WSI which helps model extrapolate position embedding to unseen or under-fitted positions. We further utilize Flash-Attention module to tackle the computational complexity of Transformer, which also keep full self-attention performance compared to previous attention approximation work. Our method, Long-contextual MIL (Long-MIL) are evaluated on extensive experiments including 4 dataset including WSI classification and survival prediction tasks to validate the superiority on shape varying WSIs. The source code will be open-accessed soon.
    摘要 血癌病理图像分析是诊断 golden standard 的临床应用。在医生日常 Routine 和计算机支持诊断中,整个染色体图像 (Whole Slide Image, WSI) 被用于分析。由于图像的尺度非常大,前一些方法通常将 WSI 分解成大量的小块,然后使用多例学习 (Multi-Instance Learning, MIL) 将所有块在 WSI 中聚合,以便在开发计算机支持诊断工具时进行满板级预测。然而,大多数前一些 WSI-MIL 模型使用全球注意力而无需对比和任何位置信息,或者使用自注意力并嵌入绝对位置信息,无法好好地处理变形大 WSI,例如在模型部署后测试 WSI 大于训练 WSI,因为模型开发集是由于病理图像收集的困难所受限。为解决这个问题,在本文中,我们提议修改位嵌入以适应形变大 WSI,并将它从一维长序列转换为二维长 contextual WSI,以帮助模型在未seen或压缩位置 extrapolate 位嵌入。我们还利用 Flash-Attention 模块来解决 transformer 的计算复杂性,并保持与前一些注意力缩写工作相同的全自注意性性能。我们的方法,长 contextual MIL (Long-MIL),在广泛的实验中证明其在形变 WSI 上的优越性。代码将很快公开。

ABFL: Angular Boundary Discontinuity Free Loss for Arbitrary Oriented Object Detection in Aerial Images

  • paper_url: http://arxiv.org/abs/2311.12311
  • repo_url: None
  • paper_authors: Zifei Zhao, Shengyang Li
  • for: 寻找航空图像中的自由方向物体(AOOD)是一项广泛关注且高度挑战性的任务,在许多场景中具有重要作用。
  • methods: 本文使用了一种基于 von Mises 分布的ANGULAR BOUNDARY FREE LOSS(ABFL)来解决自由方向物体检测中的ANGULAR BOUNDARY DISCONTINUITY(ABD)问题。
  • results: 实验表明,提议的ABFL损失可以超越一些专门针对ABD问题的方法,并且不需要额外的编码-解码结构。
    Abstract Arbitrary oriented object detection (AOOD) in aerial images is a widely concerned and highly challenging task, and plays an important role in many scenarios. The core of AOOD involves the representation, encoding, and feature augmentation of oriented bounding-boxes (Bboxes). Existing methods lack intuitive modeling of angle difference measurement in oriented Bbox representations. Oriented Bboxes under different representations exhibit rotational symmetry with varying periods due to angle periodicity. The angular boundary discontinuity (ABD) problem at periodic boundary positions is caused by rotational symmetry in measuring angular differences. In addition, existing methods also use additional encoding-decoding structures for oriented Bboxes. In this paper, we design an angular boundary free loss (ABFL) based on the von Mises distribution. The ABFL aims to solve the ABD problem when detecting oriented objects. Specifically, ABFL proposes to treat angles as circular data rather than linear data when measuring angle differences, aiming to introduce angle periodicity to alleviate the ABD problem and improve the accuracy of angle difference measurement. In addition, ABFL provides a simple and effective solution for various periodic boundary discontinuities caused by rotational symmetry in AOOD tasks, as it does not require additional encoding-decoding structures for oriented Bboxes. Extensive experiments on the DOTA and HRSC2016 datasets show that the proposed ABFL loss outperforms some state-of-the-art methods focused on addressing the ABD problem.
    摘要 “自由方向物体检测(AOOD)在飞行图像中是广泛关注和高度挑战的任务,对多种场景都具有重要作用。AOOD的核心问题在于对 Orientated bounding-boxes(Bboxes)的表示、编码和特征增强。现有方法缺乏对角度差量的直观模型。Orientated Bboxes 在不同的表示下展现出固定的旋转对称性和不同的周期,由于角度对称性而导致的角度边界缺陷(ABD)问题在测量角度差量时出现。此外,现有方法还使用了额外的编码-解码结构。在这篇论文中,我们设计了一种角度边界无损损失(ABFL),基于韦氏分布。ABFL 目的是解决 ABD 问题在检测 Orientated 物体时。具体来说,ABFL 将 treat 角度为径向数据而不是线性数据,以引入角度对称性来缓解 ABD 问题并提高角度差量测量的准确性。此外,ABFL 还提供了一种简单有效的解决 Rotational Symmetry 导致的不同周期的径向边界缺陷,不需要额外的编码-解码结构。广泛的实验表明,提出的 ABFL 损失能够超越一些关注Addressing ABD 问题的方法。”

Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

  • paper_url: http://arxiv.org/abs/2311.12300
  • repo_url: https://github.com/ostadabbas/video-based-infant-action-recognition
  • paper_authors: Elaheh Hatamimajoumerd, Pooria Daneshvar Kakhaki, Xiaofei Huang, Lingfei Luan, Somaieh Amraee, Sarah Ostadabbas
  • for: 这个论文旨在提高宝宝的动作识别精度,以便提高宝宝的安全监测、发展阶段跟踪、早期发现发展延迟、促进父母宝宝关系、提高计算机辅助诊断和帮助科学界更好地理解婴儿发展。
  • methods: 这篇论文使用了新的 dataset “InfActPrimitive”,这个 dataset 包括5种重要的婴儿成长阶段动作类别,并使用了特殊的预处理技术来处理婴儿数据。AUTHORS 使用了当今最先进的 skeleton-based 动作识别模型进行了广泛的比较分析。
  • results: 研究发现,尽管 PoseC3D 模型的准确率达到了约71%,但其他模型几乎无法正确地捕捉婴儿动作的动态特征。这表明了成人动作识别领域和婴儿动作识别领域之间的知识差距很大,并且需要开发数据效率的管道模型。
    Abstract Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering parent-infant bonds, advancing computer-aided diagnostics, and contributing to the scientific comprehension of child development. This paper delves into the intricacies of infant action recognition, a domain that has remained relatively uncharted despite the accomplishments in adult action recognition. In this study, we introduce a groundbreaking dataset called ``InfActPrimitive'', encompassing five significant infant milestone action categories, and we incorporate specialized preprocessing for infant data. We conducted an extensive comparative analysis employing cutting-edge skeleton-based action recognition models using this dataset. Our findings reveal that, although the PoseC3D model achieves the highest accuracy at approximately 71%, the remaining models struggle to accurately capture the dynamics of infant actions. This highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.
    摘要 自动人类动作识别,一个在计算机视觉领域崛起的新领域,具有广泛的应用领域,包括监控、安全、人机交互、医疗诊断和体育分析。准确地识别婴儿的动作可以涵盖多个重要目标,如安全监控、发展阶段跟踪、早期发现发展延迟、促进父婴关系、提高计算机辅助诊断和进一步科学成就儿童发展。本文探讨了婴儿动作识别领域的细节,这个领域在成人动作识别领域的成就之后,尚未得到充分的研究。在这个研究中,我们发布了一个名为“InfActPrimitive”的新的婴儿 milestone 动作类别集合,并对婴儿数据进行特殊的预处理。我们通过使用最新的骨架基本动作识别模型进行了广泛的比较分析。我们的发现表明,尽管poseC3D模型在约71%的准确率下表现最高,但其他模型在捕捉婴儿动作的dinamics方面具有巨大的难题。这 highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.

Instance-aware 3D Semantic Segmentation powered by Shape Generators and Classifiers

  • paper_url: http://arxiv.org/abs/2311.12291
  • repo_url: None
  • paper_authors: Bo Sun, Qixing Huang, Xiangru Huang
  • for: 提高3D semantic segmentation的精度和效果,增强模型对形状信息的感知和利用。
  • methods: combines several geometry processing tasks supervised at instance-level,包括形状生成和形状分类任务,以提高特征表示的精度和效果。
  • results: 在多个公共benchmark上显著超越现有方法,包括Waymo Open Dataset、SemanticKITTI和ScanNetV2等。
    Abstract Existing 3D semantic segmentation methods rely on point-wise or voxel-wise feature descriptors to output segmentation predictions. However, these descriptors are often supervised at point or voxel level, leading to segmentation models that can behave poorly at instance-level. In this paper, we proposed a novel instance-aware approach for 3D semantic segmentation. Our method combines several geometry processing tasks supervised at instance-level to promote the consistency of the learned feature representation. Specifically, our methods use shape generators and shape classifiers to perform shape reconstruction and classification tasks for each shape instance. This enforces the feature representation to faithfully encode both structural and local shape information, with an awareness of shape instances. In the experiments, our method significantly outperform existing approaches in 3D semantic segmentation on several public benchmarks, such as Waymo Open Dataset, SemanticKITTI and ScanNetV2.
    摘要 现有的3D语义分割方法通常基于点级或体积级特征描述器来输出分割预测。然而,这些描述器经常被supervised at point或体积级,导致分割模型可能会 behave poorly at instance级。在这篇论文中,我们提出了一种新的实例意识的方法 для3D语义分割。我们的方法结合了多个geometry处理任务supervised at instance级来提高学习的特征表示的一致性。特别是,我们的方法使用形态生成器和形态分类器来进行形态重建和分类任务 для每个形态实例。这种强制特征表示具有 faithful地编码结构和本地形态信息,同时具备对形态实例的意识。在实验中,我们的方法在多个公共标准测试集上显著超越现有的方法,包括 Waymo Open Dataset、SemanticKITTI和ScanNetV2。

Procedural Generation of Grain Orientations using the Wave Function Collapse Algorithm

  • paper_url: http://arxiv.org/abs/2311.12272
  • repo_url: None
  • paper_authors: G. Magny-Fokam, D. Madisetti, J. El-Awady
  • for: 这 paper 的目的是研究如何使用 referential electron backscatter diffraction (EBSD) 地图来生成符合统计分布的粒体结构,以便进一步研究附加的塑性和失效行为。
  • methods: 这 paper 使用了两种方法来生成粒体结构:Wave Function Collapse (WFC) 算法和 Markov Junior。
  • results: 研究发现,Markov Junior 算法比 WFC 算法更有效,可以生成符合 EBSD 地图的 Voronoi 分布,并且可以在 Python 中使用来生成粒体结构。 Comparing the results between the reference and the generated EBSD, we found that the orientation and volume fractions were extremely similar. Therefore, it can be concluded that Markov Junior is an effective machine learning tool for reproducing representative grain microstructures.
    Abstract Statistics of grain sizes and orientations in metals correlate to the material's mechanical properties. Reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. Two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. The first is the Wave Function Collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through Markov Junior, a free and open-source software. This study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. It utilized two stainless steel EBSD maps as references to test both algorithms. First, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. The second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. When comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. With the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.
    摘要 《stats of grain sizes and orientations in metals correlate to material mechanical properties. reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. the first is the wave function collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through markov junior, a free and open-source software. this study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. it utilized two stainless steel EBSD maps as references to test both algorithms. first, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. the second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. when comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. with the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Boosting Audio-visual Zero-shot Learning with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.12268
  • repo_url: https://github.com/chenhaoxing/KDA
  • paper_authors: Haoxing Chen, Yaohui Li, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Huijia Zhu, Weiqiang Wang
  • for: 该论文主要针对的是Audio-visual零 shot learning问题,即可以基于已知的Audio-visual对应的类别名称来识别未经见过的类别。
  • methods: 该论文提出了一种简单 yet effective的框架,即知识填充适应(KDA),用于帮助模型更好地理解未经见过的动作内容。 Specifically, 该框架包括使用大型自然语言模型生成类别名称的详细描述,以及使用分布匹配损失和知识填充适应margin损失来进一步提高对未经见过的类别的泛化能力。
  • results: 实验结果表明, compared with现有的方法,我们的提出的KDA可以在三个 популяр的Audio-visual零 shot learning数据集上取得更高的识别率。
    Abstract Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences. Recent methods mainly focus on learning aligned and discriminative multi-modal features to boost generalization towards unseen categories. However, these approaches ignore the obscure action concepts in category names and may inevitably introduce complex network structures with difficult training objectives. In this paper, we propose a simple yet effective framework named Knowledge-aware Distribution Adaptation (KDA) to help the model better grasp the novel action contents with an external knowledge base. Specifically, we first propose using large language models to generate rich descriptions from category names, which leads to a better understanding of unseen categories. Additionally, we propose a distribution alignment loss as well as a knowledge-aware adaptive margin loss to further improve the generalization ability towards unseen categories. Extensive experimental results demonstrate that our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets. Our code will be avaliable at \url{https://github.com/chenhaoxing/KDA}.
    摘要 audio-visual零 shot learning目标是基于带有音频视频序列的Category recognition不见的类型。 current methods主要集中在学习对应的多Modal特征,以提高对不见类型的泛化能力。然而,这些方法可能会忽略Category名称中的模糊动作概念,并且可能会导入复杂的网络结构和困难的训练目标。在这篇论文中,我们提出了一个简单 yet effective的框架,名为知识感知分布适应(KDA),以帮助模型更好地理解新的动作内容。 Specifically,我们首先提出了使用大型自然语言模型生成Category名称的详细描述,以便更好地理解未见类型。此外,我们还提出了分布对齐损失以及知识感知 adaptive margin损失,以进一步提高对未见类型的泛化能力。我们的实验结果表明,我们的提出的KDA可以在三个流行的 audio-visual零 shot learning数据集上超越当前的状态OF-the-art方法。我们的代码将在 \url{https://github.com/chenhaoxing/KDA} 上提供。

Virtual Home Staging: Inverse Rendering and Editing an Indoor Panorama under Natural Illumination

  • paper_url: http://arxiv.org/abs/2311.12265
  • repo_url: https://github.com/gzhji/cali-hdr-dataset
  • paper_authors: Guanzhou Ji, Azadeh O. Sawyer, Srinivasa G. Narasimhan
  • for: 这篇论文是为了解决现有室内panorama中的家具布局更新问题,使用自然照明。
  • methods: 该方法包括三个关键组成部分:(1)检测和移除panorama中的家具,(2)自动生成室内地板设计,(3)使用场景几何、新家具对象和实时outdoor图像进行全景渲染。
  • results: 该方法可以在不同的outdoor照明条件下实现室内场景的重新渲染,并且提供了一个新的标准化HDR(Cali-HDR)数据集,包括137个标准化室内panorama和其相应的outdoor图像。
    Abstract We propose a novel inverse rendering method that enables the transformation of existing indoor panoramas with new indoor furniture layouts under natural illumination. To achieve this, we captured indoor HDR panoramas along with real-time outdoor hemispherical HDR photographs. Indoor and outdoor HDR images were linearly calibrated with measured absolute luminance values for accurate scene relighting. Our method consists of three key components: (1) panoramic furniture detection and removal, (2) automatic floor layout design, and (3) global rendering with scene geometry, new furniture objects, and a real-time outdoor photograph. We demonstrate the effectiveness of our workflow in rendering indoor scenes under different outdoor illumination conditions. Additionally, we contribute a new calibrated HDR (Cali-HDR) dataset that consists of 137 calibrated indoor panoramas and their associated outdoor photographs. The source code and dataset are available: https://github.com/Gzhji/Cali-HDR-Dataset.
    摘要 我们提出了一种新的反向渲染方法,可以将现有的室内360度全景图与新的室内家具布局转换为自然照明下的图像。为实现这一点,我们捕捉了室内HDR全景图和实时OUTDOOR半球形HDR照片。室内和 OUTDOOR HDR图像通过测量精确的场景照明值进行线性抽象,以实现准确的场景重新照明。我们的方法包括三个关键组件:(1)扫描排除室内家具,(2)自动生成室内地板设计,(3)基于场景几何、新家具对象和实时OUTDOOR照片进行全球渲染。我们在不同的 OUTDOOR照明条件下验证了我们的工作流程,并提供了一个新的标准化HDR数据集(Cali-HDR),包括137个标准化的室内全景图和其关联的 OUTDOOR照片。源代码和数据集可以在以下地址下下载:https://github.com/Gzhji/Cali-HDR-Dataset。

cs.AI - 2023-11-21

From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models

  • paper_url: http://arxiv.org/abs/2311.13063
  • repo_url: None
  • paper_authors: Zachary Englhardt, Chengqian Ma, Margaret E. Morris, Xuhai “Orson” Xu, Chun-Cheng Chang, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, Vikram Iyer
  • for: 提供MENTAL HEALTH PROFESSIONALS WITH INSIGHTS FROM PATIENTS’ DAILY LIVES
  • methods: 使用大型自然语言模型(LLMs)将多感器数据转化为临床实用信息
  • results: 成功实现了对抑郁症的二分类预测(精度达61.1%),并发现了一种新的人工智能协作方法,在临床决策过程中,临床专家和人工智能工具之间进行互动,以增强诊断的准确性和有用性。
    Abstract Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, developing analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental health. To address these challenges, we take a novel approach that leverages large language models (LLMs) to synthesize clinically useful insights from multi-sensor data. We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data such as step count and sleep relate to conditions like depression and anxiety. We first demonstrate binary depression classification with LLMs achieving accuracies of 61.1% which exceed the state of the art. While it is not robust for clinical use, this leads us to our key finding: even more impactful and valued than classification is a new human-AI collaboration approach in which clinician experts interactively query these tools and combine their domain expertise and context about the patient with AI generated reasoning to support clinical decision-making. We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
    摘要 大量敏感数据的收集和分析可以为心理健康专业人员提供有益的洞察。然而,将这些数据应用于临床实践中存在一些挑战,如设备间的总结和心理健康状况与测量信号之间的弱或抽象相关性。为解决这些挑战,我们采用了一种新的方法,利用大型自然语言模型(LLM)来生成临床有用的洞察。我们开发了一种链条思维提示方法,使用LLM生成与步数和睡眠等数据相关的心理健康状况的推理。我们首先证明了使用LLM进行抑郁分类可以达到61.1%的准确率,这超过了现有技术的表现。尽管这并不是临床使用的稳定性,但这导我们到了我们的关键发现:在临床决策中,与AI生成的推理相结合的专业医生知识和patient的特定情况可以更有价值和有用。我们发现GPT-4可以正确地引用数字数据75%的时间,并且临床参与者表达了强烈的兴趣使用这种方法来解读自我跟踪数据。

Latent Lab: Large Language Models for Knowledge Exploration

  • paper_url: http://arxiv.org/abs/2311.13051
  • repo_url: None
  • paper_authors: Kevin Dunnell, Trudy Painter, Andrew Stoddard, Andy Lippman
  • for: 本研究探讨了人工智能模型,特别是大型语言模型(LLMs),在创新过程中支持知识探索和人类创造力的潜力。
  • methods: 本研究使用了“潜在实验室”(Latent Lab),一种互动工具,用于探索MIT媒体实验室研究项目之间的连接,强调“探索”而不是搜索。
  • results: 在用户研究中,“潜在实验室”成功地引导用户探索到新的知识基础,并为人类-AI知识探索系统的进一步发展提供了基础。
    Abstract This paper investigates the potential of AI models, particularly large language models (LLMs), to support knowledge exploration and augment human creativity during ideation. We present "Latent Lab" an interactive tool for discovering connections among MIT Media Lab research projects, emphasizing "exploration" over search. The work offers insights into collaborative AI systems by addressing the challenges of organizing, searching, and synthesizing content. In a user study, the tool's success was evaluated based on its ability to introduce users to an unfamiliar knowledge base, ultimately setting the groundwork for the ongoing advancement of human-AI knowledge exploration systems.
    摘要

Do we listen to what we are told? An empirical study on human behaviour during the COVID-19 pandemic: neural networks vs. regression analysis

  • paper_url: http://arxiv.org/abs/2311.13046
  • repo_url: None
  • paper_authors: Yuxi Heluo, Kexin Wang, Charles W. Robson
  • for: 这项研究的目的是调查2020年COVID-19大流行期间人类行为是如何遵守面具穿着相关的公共卫生政策。
  • methods: 这项研究使用了对物体检测的卷积神经网络、回归分析和多层感知器组合分析了维也纳公众在2020年的视觉数据,以 investigate 人类行为。
  • results: 研究发现,面具穿着相关的政府规定和公共交通宣传影响了COVID-19大流行期间人类行为的正确性。此外,改变宣传和规定内容的变化对人类行为产生了不同的影响。对于人类行为的预测,神经网络比回归分析更准确。此外,回归分析还允许我们探索人类行为的可能的 causal 过程。
    Abstract In this work, we contribute the first visual open-source empirical study on human behaviour during the COVID-19 pandemic, in order to investigate how compliant a general population is to mask-wearing-related public-health policy. Object-detection-based convolutional neural networks, regression analysis and multilayer perceptrons are combined to analyse visual data of the Viennese public during 2020. We find that mask-wearing-related government regulations and public-transport announcements encouraged correct mask-wearing-behaviours during the COVID-19 pandemic. Importantly, changes in announcement and regulation contents led to heterogeneous effects on people's behaviour. Comparing the predictive power of regression analysis and neural networks, we demonstrate that the latter produces more accurate predictions of population reactions during the COVID-19 pandemic. Our use of regression modelling also allows us to unearth possible causal pathways underlying societal behaviour. Since our findings highlight the importance of appropriate communication contents, our results will facilitate more effective non-pharmaceutical interventions to be developed in future. Adding to the literature, we demonstrate that regression modelling and neural networks are not mutually exclusive but instead complement each other.
    摘要 在这项研究中,我们提供了 COVID-19 疫情期间人类行为的首个视觉开源实验研究,以Investigate whether the general population is compliant with mask-wearing-related public health policies. 我们结合了物体检测基于 convolutional neural networks,回归分析和多层感知器来分析了维也纳公众在2020年的视觉数据。我们发现,面具套用政策和公共交通宣传对 COVID-19 疫情期间的人类行为产生了正面影响。此外,宣传和规定的内容变化导致了人类行为的不同响应。我们比较了回归分析和神经网络的预测力,发现神经网络生成的预测更加准确。我们的回归分析还允许我们探索人类行为的可能的 causal 通路。由于我们的发现强调了有效的沟通内容的重要性,我们的结果将有助于未来开发更有效的非药用途 intevention。此外,我们还证明了回归分析和神经网络并不是矛盾的,而是可以相互补充。

Synaptic Sampling of Neural Networks

  • paper_url: http://arxiv.org/abs/2311.13038
  • repo_url: https://github.com/Fidan-Sadig/neural-network
  • paper_authors: James B. Aimone, William Severa, J. Darby Smith
  • for: 本文旨在描述一种能够直接通过bernoulli coin flips来采样神经网络的方法,以便在 probablistic computing 技术上实现更好的 uncertainty quantification。
  • methods: 本文使用了一种名为 scANN 的方法,即 \textit{sampling (by coinflips) artificial neural networks},其中神经网络的权重 treated as Bernoulli coin flips,以便通过采样来Quantify uncertainty。
  • results: 本文的实验结果表明,scANN 方法可以准确地预测神经网络输出的 uncertainty,同时nearly match fully deterministic performance。
    Abstract Probabilistic artificial neural networks offer intriguing prospects for enabling the uncertainty of artificial intelligence methods to be described explicitly in their function; however, the development of techniques that quantify uncertainty by well-understood methods such as Monte Carlo sampling has been limited by the high costs of stochastic sampling on deterministic computing hardware. Emerging computing systems that are amenable to hardware-level probabilistic computing, such as those that leverage stochastic devices, may make probabilistic neural networks more feasible in the not-too-distant future. This paper describes the scANN technique -- \textit{sampling (by coinflips) artificial neural networks} -- which enables neural networks to be sampled directly by treating the weights as Bernoulli coin flips. This method is natively well suited for probabilistic computing techniques that focus on tunable stochastic devices, nearly matches fully deterministic performance while also describing the uncertainty of correct and incorrect neural network outputs.
    摘要 probabilistic artificial neural networks 提供了可观的前景,使人工智能方法中的不确定性能被直接描述在函数中;然而,使用了已知方法 such as Monte Carlo sampling 来衡量不确定性的技术发展受到了 deterministic computing 硬件的高成本的限制。 emerging computing systems 可能在未来使 probabilistic neural networks 变得可能,这些系统通过利用Random devices 来实现硬件水平的 probabilistic computing。这篇文章描述了 scANN 技术---通过币技术来 sampling 人工神经网络---该技术使得神经网络可以直接通过对权重进行 Bernoulli 币flips 来采样。这种方法与可变的Stochastic devices 很compatible,并且可以与 deterministic performance 相比,同时也能够描述正确和错误神经网络输出的不确定性。

DMLR: Data-centric Machine Learning Research – Past, Present and Future

  • paper_url: http://arxiv.org/abs/2311.13028
  • repo_url: None
  • paper_authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson
  • for: 本文提出了创建下一代公共数据集的需求,以推动机器学习科学的发展。
  • methods: 本文使用了社区参与和基础设施建设的方法,以创建和维护这些数据集和方法。
  • results: 本文 Charted a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
    Abstract Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
    摘要 根据第一届DMLR会议在ICML2023年举行的讨论和前一些会议,本报告探讨了公共数据集的创建和维护对机器学习科学的发展 relevance,并提出了共同努力的道路,以实现这些数据集和方法对科学、社会和商业产生积极影响。

Attention: Large Multimodal Model is Watching your Geo-privacy

  • paper_url: http://arxiv.org/abs/2311.13018
  • repo_url: None
  • paper_authors: Yifan Yang, Yixian Zhang, Daoyang Li, Shuju Sun, Junhong Duan, Junzhou He, Qingyang Wu, Hao Liu
  • for: 本研究旨在探讨大数据时代人们的个人隐私如何遭受泄露,尤其是在社交媒体和大数据时代,个人隐私遭受泄露的风险增加。
  • methods: 本研究使用了一种基于GPT-4的模型,名为“Dr. Watson”,来分析和提取社交媒体和公共数据源中的地理信息。
  • results: 实验结果表明,“Dr. Watson”能够成功地提取特定的地理信息,从而暴露了当前的地理隐私措施的漏洞。这些发现强调了在社交媒体和大数据时代,个人隐私泄露的容易性。
    Abstract Geographic privacy, a crucial aspect of personal security, often goes unnoticed in daily activities. This paper addresses the underestimation of this privacy in the context of increasing online data sharing and the advancements in information gathering technologies. With the surge in the use of Large Multimodal Models, such as GPT-4, for Open Source Intelligence (OSINT), the potential risks associated with geographic privacy breaches have intensified. This study highlights the criticality of these developments, focusing on their implications for individual privacy. The primary objective is to demonstrate the capabilities of advanced AI tools, specifically a GPT-4 based model named "Dr. Watson," in identifying and potentially compromising geographic privacy through online shared content. We developed "Dr. Watson" to analyze and extract geographic information from publicly available data sources. The study involved five experimental cases, each offering different perspectives on the tool's application in extracting precise location data from partial images and social media content. The experiments revealed that "Dr. Watson" could successfully identify specific geographic details, thereby exposing the vulnerabilities in current geo-privacy measures. These findings underscore the ease with which geographic information can be unintentionally disclosed. The paper concludes with a discussion on the broader implications of these findings for individuals and the community at large. It emphasizes the urgency for enhanced awareness and protective measures against geo-privacy leakage in the era of advanced AI and widespread social media usage.
    摘要 geographic privacy, a crucial aspect of personal security, often goes unnoticed in daily activities. This paper addresses the underestimation of this privacy in the context of increasing online data sharing and the advancements in information gathering technologies. With the surge in the use of Large Multimodal Models, such as GPT-4, for Open Source Intelligence (OSINT), the potential risks associated with geographic privacy breaches have intensified. This study highlights the criticality of these developments, focusing on their implications for individual privacy. The primary objective is to demonstrate the capabilities of advanced AI tools, specifically a GPT-4 based model named "Dr. Watson," in identifying and potentially compromising geographic privacy through online shared content. We developed "Dr. Watson" to analyze and extract geographic information from publicly available data sources. The study involved five experimental cases, each offering different perspectives on the tool's application in extracting precise location data from partial images and social media content. The experiments revealed that "Dr. Watson" could successfully identify specific geographic details, thereby exposing the vulnerabilities in current geo-privacy measures. These findings underscore the ease with which geographic information can be unintentionally disclosed. The paper concludes with a discussion on the broader implications of these findings for individuals and the community at large. It emphasizes the urgency for enhanced awareness and protective measures against geo-privacy leakage in the era of advanced AI and widespread social media usage.

CovarNav: Machine Unlearning via Model Inversion and Covariance Navigation

  • paper_url: http://arxiv.org/abs/2311.12999
  • repo_url: None
  • paper_authors: Ali Abbasi, Chayne Thrash, Elaheh Akbari, Daniel Zhang, Soheil Kolouri
  • for: 这个研究旨在解决人工智能模型对训练数据的依赖,以实现选择性地删除特定训练数据点的影响。
  • methods: 我们透过对模型进行反击攻击,获取模型训练数据的代理,然后透过对忘记集合(i.e., 忘记集)进行错分类,最后使用梯度对应方法来对忘记集进行最小化交叉熵损失。
  • results: 我们在CIFAR-10和Vggface2 datasets上进行了严谨的评估,与现有的 Referenced 数据显示了我们的提案的有效性。
    Abstract The rapid progress of AI, combined with its unprecedented public adoption and the propensity of large neural networks to memorize training data, has given rise to significant data privacy concerns. To address these concerns, machine unlearning has emerged as an essential technique to selectively remove the influence of specific training data points on trained models. In this paper, we approach the machine unlearning problem through the lens of continual learning. Given a trained model and a subset of training data designated to be forgotten (i.e., the "forget set"), we introduce a three-step process, named CovarNav, to facilitate this forgetting. Firstly, we derive a proxy for the model's training data using a model inversion attack. Secondly, we mislabel the forget set by selecting the most probable class that deviates from the actual ground truth. Lastly, we deploy a gradient projection method to minimize the cross-entropy loss on the modified forget set (i.e., learn incorrect labels for this set) while preventing forgetting of the inverted samples. We rigorously evaluate CovarNav on the CIFAR-10 and Vggface2 datasets, comparing our results with recent benchmarks in the field and demonstrating the efficacy of our proposed approach.
    摘要 随着人工智能的快速进步和无 precedent 的公众采用,大 neural network 的训练数据的隐私问题得到了更多的关注。为解决这些问题,机器忘记技术在 Machine Learning 领域得到了广泛的应用。在这篇论文中,我们通过 Continual Learning 的视角,对机器忘记问题进行解决。给定一个训练过的模型和需要被忘记的训练数据(即“忘记集”),我们提出了一个三步过程,名为 CovarNav,以便实现这种忘记。首先,我们使用模型反推攻击来Derive 模型的训练数据的代理。然后,我们将忘记集中的样本重新标签,选择与实际真实值最不一致的最可能的类别。最后,我们使用梯度投影方法来在修改后的忘记集上降低交叉熵损失(即学习 incorrect labels для该集),同时避免对反推样本的忘记。我们在 CIFAR-10 和 Vggface2 datasets 进行了严格的评估,与当前领域的标准准则进行比较,并证明了我们的提出的方法的效果。

RLIF: Interactive Imitation Learning as Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.12996
  • repo_url: None
  • paper_authors: Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, Sergey Levine
  • for: 本研究旨在提出一种基于强化学习的自动技能学习方法,以解决在机器人学习中的实际控制问题中的分布性shift问题。
  • methods: 本研究使用了强化学习方法,并在人工专家的在线 intervene 下收集修正数据,以 Addressing the distributional shift challenges that afflict na"ive behavioral cloning。
  • results: 研究发现,使用我们提出的方法可以在不同任务中强制性地超越DAgger-like方法,特别是当 intervening experts 不是最优的情况下。 Code和视频可以在项目网站rlif-page.github.io找到。
    Abstract Although reinforcement learning methods offer a powerful framework for automatic skill acquisition, for practical learning-based control problems in domains such as robotics, imitation learning often provides a more convenient and accessible alternative. In particular, an interactive imitation learning method such as DAgger, which queries a near-optimal expert to intervene online to collect correction data for addressing the distributional shift challenges that afflict na\"ive behavioral cloning, can enjoy good performance both in theory and practice without requiring manually specified reward functions and other components of full reinforcement learning methods. In this paper, we explore how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert. We also provide a unified framework to analyze our RL method and DAgger; for which we present the asymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our method on challenging high-dimensional continuous control simulation benchmarks as well as real-world robotic vision-based manipulation tasks. The results show that it strongly outperforms DAgger-like approaches across the different tasks, especially when the intervening experts are suboptimal. Code and videos can be found on the project website: rlif-page.github.io
    摘要 尽管强化学习方法提供了自动技能获得的强大框架,但在实际的学习控制问题中,如 робо派生学习经常提供了更加便捷和可 accessible的替代方案。特别是,一种交互式学习法如 DAgger,可以在线查询一个近似优秀专家的干预数据,以Addressing the distributional shift challenges that afflict naive behavioral cloning。这种方法在理论和实践中都可以获得良好的性能,不需要手动指定奖励函数和其他全面强化学习方法的其他组件。在这篇论文中,我们 explore了如何使用强化学习来提高性能,并且不需要 intervening experts是near-optimal的假设。我们的提posed方法使用强化学习的用户干预信号作为奖励。这些假设可以relax the assumption that intervening experts in interactive imitation learning should be near-optimal,并允许算法学习提高过potential suboptimal human expert的行为。我们还提供了一个统一的框架来分析我们的RL方法和DAgger; 我们对这两种方法的非 asymptotic sample complexity bound和suboptimal gap的极限分析。然后,我们在高维连续控制 simulations中评估了我们的方法,以及实际的机器人视觉基于 manipulation tasks。结果显示,它在不同任务中都强制 beat DAgger-like approaches,特别是当 intervening experts are suboptimal。代码和视频可以在项目网站: rlif-page.github.io 找到。

NERIF: GPT-4V for Automatic Scoring of Drawn Models

  • paper_url: http://arxiv.org/abs/2311.12990
  • repo_url: None
  • paper_authors: Gyeong-Geon Lee, Xiaoming Zhai
    for: The paper aims to advance scientific modeling practices by leveraging the powerful image processing capability of GPT-4V to automatically score student-drawn models for science phenomena.methods: The paper developed a method called NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) that employs instructional notes and rubrics to prompt GPT-4V to score student-drawn models.results: The paper found that GPT-4V’s average scoring accuracy was .51, with higher accuracy for the ‘Beginning’ and ‘Developing’ classes and lower accuracy for the ‘Proficient’ class. The study also revealed how GPT-4V retrieves information from image input and narrates student-drawn models in natural language. Additionally, the paper demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes.
    Abstract Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.
    摘要 学生手绘模型评分是时间consuming的。最近发布的GPT-4V提供了一个独特的机会,以便进步科学模型设计方法,利用图像处理能力。为了测试这种自动评分能力,我们开发了一种方法 called NERIF(notation-enhanced rubric instruction for few-shot learning),使用指导注释和评分标准来让GPT-4V评分学生手绘模型。我们随机选择了900个平衡数据(N = 900),包括学生手绘模型的六个模型评估任务。每个模型都得到了GPT-4V的评分,分为三个级别:' Beginning'、'Developing' 和 'Proficient',根据评分标准。GPT-4V的评分与人类专家的评分进行比较,以计算评分准确率。结果显示,GPT-4V的平均评分准确率为0.51,标准差为0.037。具体来说,' Beginning' 级别的平均评分为0.64,'Developing' 级别的平均评分为0.62,'Proficient' 级别的平均评分为0.26,这表明更高水平的模型更容易得到正确的评分。进一步的研究表明GPT-4V从图像输入中提取信息,包括问题上下文、人类评分员提供的示例评估以及学生手绘模型。我们还发现GPT-4V捕捉学生手绘模型的特点,并将其表达成自然语言。最后,我们示出了GPT-4V如何根据给定的评分标准和指导注释评分学生手绘模型。这些结果表明NERIF是一种有效的方法,可以使GPT-4V自动评分学生手绘模型。虽然有些评分有所不准确,但是一些不当评分似乎可以被专家解释。这些结果表明,通过GPT-4V自动评分学生手绘模型是可行的,并且具有潜在的优势。

Unsupervised Graph Attention Autoencoder for Attributed Networks using K-means Loss

  • paper_url: http://arxiv.org/abs/2311.12986
  • repo_url: None
  • paper_authors: Abdelfateh Bekkaira, Slimane Bellaouar, Slimane Oulad-Naoui
  • for: 本研究旨在提出一种简单、高效、基于无监督图像识别的社区检测模型,用于解释具有特征信息的网络。
  • methods: 该模型基于图像识别自适应编码器,同时兼顾了网络的结构信息和特征信息,以达到社区检测的双重目标。它使用k-means作为目标函数,并使用多头图像识别自适应编码器进行解码。
  • results: 在三个具有特征信息的网络 dataset 上进行实验,该方法比当前状态体系方法在 NMI 和 ARI 指标上表现出色,并且随网络规模增加而适用。这些结果有助于解释生物网络、社会网络等领域中的基本社区结构。
    Abstract Several natural phenomena and complex systems are often represented as networks. Discovering their community structure is a fundamental task for understanding these networks. Many algorithms have been proposed, but recently, Graph Neural Networks (GNN) have emerged as a compelling approach for enhancing this task.In this paper, we introduce a simple, efficient, and clustering-oriented model based on unsupervised \textbf{G}raph Attention \textbf{A}uto\textbf{E}ncoder for community detection in attributed networks (GAECO). The proposed model adeptly learns representations from both the network's topology and attribute information, simultaneously addressing dual objectives: reconstruction and community discovery. It places a particular emphasis on discovering compact communities by robustly minimizing clustering errors. The model employs k-means as an objective function and utilizes a multi-head Graph Attention Auto-Encoder for decoding the representations. Experiments conducted on three datasets of attributed networks show that our method surpasses state-of-the-art algorithms in terms of NMI and ARI. Additionally, our approach scales effectively with the size of the network, making it suitable for large-scale applications. The implications of our findings extend beyond biological network interpretation and social network analysis, where knowledge of the fundamental community structure is essential.
    摘要 许多自然现象和复杂系统经常被表示为网络。发现这些网络的社区结构是理解它们的基本任务。许多算法已经被提出,但最近,图神经网络(GNN)已经出现为了提高这个任务的有力的方法。在这篇论文中,我们介绍了一种简单、高效、归一化的模型,基于无监督的图注意力自动编码器(GAECO)来进行社区检测。该模型能够同时学习网络的结构和属性信息,并同时解决两个目标:重建和社区发现。它特别关注发现紧凑的社区,并且可以坚定地降低归一化错误。该模型使用k-means作为目标函数,并使用多头图注意力自动编码器进行解码。在三个具有属性网络的数据集上进行了实验,我们发现我们的方法在NMI和ARI方面超过了当前最佳算法。此外,我们的方法可以有效扩展到大规模网络应用,使之适用于大规模应用。我们的发现对生物网络解释和社会网络分析等领域有深远的影响。

GAIA: a benchmark for General AI Assistants

  • paper_url: http://arxiv.org/abs/2311.12983
  • repo_url: None
  • paper_authors: Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom
  • for: 这个论文的目的是提出一个基本智能助手的标准测试集(GAIA),如果解决这个问题,则代表了人工智能研究的一个里程碑。
  • methods: 该论文使用了人类问题来测试基本智能能力,包括推理、多 modal 处理、网页浏览和工具使用能力。
  • results: 论文表明,人类参与者对 GAIA 问题的回答率为 92%,而使用插件的 GPT-4 则只有 15%。这显示人类的表现与现有的大型语言模型(LLMs)不同,后者在专业技能方面的任务上表现出色。
    Abstract We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
    摘要 我们介绍GAIA,一个标准参考平台 для通用AI助手,如果解决了这个问题,则表示了人工智能研究的里程碑。GAIA提出了真实世界中的问题,需要一系列基本能力,如推理、多Modal处理、网页浏览和一般工具使用能力。GAIA的问题概念简单 для人类,但是对现代AI来说很困难:我们显示了人类参考者对GAIA问题的回答率为92%,而使用插件的GPT-4则只有15%。这个差距非常明显,与现在的人工智能标准参考任务中的专业技能方面的LLMs(大型语言模型)之外表现相比,GAIA的哲学不同于现在的标准参考任务,它强调targeting tasks that are ever more difficult for humans。我们认为AGI的出现将取决于系统能够展示类似于人类平均表现的 Robustness。使用GAIA的方法ology,我们提出了466个问题和其答案。我们发布了这些问题,但保留了300个答案,以便在https://huggingface.co/gaia-benchmark上建立一个排行榜。

Neural Approximate Dynamic Programming for the Ultra-fast Order Dispatching Problem

  • paper_url: http://arxiv.org/abs/2311.12975
  • repo_url: None
  • paper_authors: Arash Dehghan, Mucahit Cevik, Merve Bodur
  • For: The paper aims to enhance the operational efficiency of same-day delivery (SDD) services by solving the ultra-fast order dispatching problem (ODP) within a centralized warehouse setting.* Methods: The paper introduces extensions to the ultra-fast ODP, such as order batching and explicit courier assignments, and uses NeurADP (a combination of Approximate Dynamic Programming and Deep Reinforcement Learning) as the solution method.* Results: The paper shows that NeurADP significantly outperforms myopic and DRL baselines, and that the inclusion of order batching and courier queues enhances the efficiency of delivery operations. The paper also provides detailed sensitivity analysis to confirm the robustness of NeurADP under different scenarios.
    Abstract Same-Day Delivery (SDD) services aim to maximize the fulfillment of online orders while minimizing delivery delays but are beset by operational uncertainties such as those in order volumes and courier planning. Our work aims to enhance the operational efficiency of SDD by focusing on the ultra-fast Order Dispatching Problem (ODP), which involves matching and dispatching orders to couriers within a centralized warehouse setting, and completing the delivery within a strict timeline (e.g., within minutes). We introduce important extensions to ultra-fast ODP such as order batching and explicit courier assignments to provide a more realistic representation of dispatching operations and improve delivery efficiency. As a solution method, we primarily focus on NeurADP, a methodology that combines Approximate Dynamic Programming (ADP) and Deep Reinforcement Learning (DRL), and our work constitutes the first application of NeurADP outside of the ride-pool matching problem. NeurADP is particularly suitable for ultra-fast ODP as it addresses complex one-to-many matching and routing intricacies through a neural network-based VFA that captures high-dimensional problem dynamics without requiring manual feature engineering as in generic ADP methods. We test our proposed approach using four distinct realistic datasets tailored for ODP and compare the performance of NeurADP against myopic and DRL baselines by also making use of non-trivial bounds to assess the quality of the policies. Our numerical results indicate that the inclusion of order batching and courier queues enhances the efficiency of delivery operations and that NeurADP significantly outperforms other methods. Detailed sensitivity analysis with important parameters confirms the robustness of NeurADP under different scenarios, including variations in courier numbers, spatial setup, vehicle capacity, and permitted delay time.
    摘要 same-day delivery (SDD) 服务 aim to maximize online order fulfillment while minimizing delivery delays, but are faced with operational uncertainties such as order volumes and courier planning. Our work aims to improve the operational efficiency of SDD by focusing on the ultra-fast Order Dispatching Problem (ODP), which involves matching and dispatching orders to couriers within a centralized warehouse setting and completing delivery within a strict timeline (e.g., within minutes). We introduce important extensions to ultra-fast ODP such as order batching and explicit courier assignments to provide a more realistic representation of dispatching operations and improve delivery efficiency. As a solution method, we primarily focus on NeurADP, a methodology that combines Approximate Dynamic Programming (ADP) and Deep Reinforcement Learning (DRL), and our work constitutes the first application of NeurADP outside of the ride-pool matching problem. NeurADP is particularly suitable for ultra-fast ODP as it addresses complex one-to-many matching and routing intricacies through a neural network-based VFA that captures high-dimensional problem dynamics without requiring manual feature engineering as in generic ADP methods. We test our proposed approach using four distinct realistic datasets tailored for ODP and compare the performance of NeurADP against myopic and DRL baselines by also making use of non-trivial bounds to assess the quality of the policies. Our numerical results indicate that the inclusion of order batching and courier queues enhances the efficiency of delivery operations and that NeurADP significantly outperforms other methods. Detailed sensitivity analysis with important parameters confirms the robustness of NeurADP under different scenarios, including variations in courier numbers, spatial setup, vehicle capacity, and permitted delay time.

Clustered Policy Decision Ranking

  • paper_url: http://arxiv.org/abs/2311.12970
  • repo_url: None
  • paper_authors: Mark Levin, Hana Chockler
  • for: 这篇论文旨在为RL训练过程中的策略进行重要性评估。
  • methods: 该论文提出了一种基于统计 covariance 估计的黑盒方法,可以对已经训练好的策略进行重要性排名。
  • results: 该方法可以准确地推断出RL训练过程中不同决策的重要性,并且比之前的统计错误排名方法更加准确。
    Abstract Policies trained via reinforcement learning (RL) are often very complex even for simple tasks. In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer. Moreover, it is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is. Given a trained policy, we propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states. We compare our measure against a previous statistical fault localization based ranking procedure.
    摘要 政策通过强化学习(RL)训练 oft very complex, even for simple tasks. In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer. Moreover, it is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is. Given a trained policy, we propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states. We compare our measure against a previous statistical fault localization based ranking procedure.Here's the translation breakdown:政策 (government policy) -> 政策 (policy)通过 (through) -> 通过 (through)强化学习 (reinforcement learning) -> 强化学习 (reinforcement learning)episode -> 集 (episode)time steps -> 时间步骤 (time steps)policy -> 政策 (policy)will make -> 会make (will make)actions -> 动作 (actions)many of which may appear non-intuitive -> 许多可能看起来不 intuitive (many of which may appear non-intuitive)to the observer. -> 对观察者来说。 (to the observer.)Moreover, it is not clear -> 此外,还不清楚 (Moreover, it is not clear)which of these decisions directly contribute -> 哪些决策直接贡献 (which of these decisions directly contribute)towards achieving the reward -> 获得奖励 (towards achieving the reward)and how significant their contribution is. -> 以及其贡献的重要性是多少。 (and how significant their contribution is.)Given a trained policy, we propose -> 给定一个训练好的政策,我们提出 (Given a trained policy, we propose)a black-box method based on statistical covariance estimation -> 一种基于统计协VAR分布的黑盒方法 (a black-box method based on statistical covariance estimation)that clusters the states of the environment -> 对环境的状态进行聚合 (that clusters the states of the environment)and ranks each cluster according to the importance of decisions made in its states. -> 并对每个分组中的决策进行排名,以便了解各个分组中决策的重要性。 (and ranks each cluster according to the importance of decisions made in its states.)We compare our measure against a previous statistical fault localization based ranking procedure. -> 我们对我们的度量进行与之前基于统计FAULT Localization的排名进行比较。 (We compare our measure against a previous statistical fault localization based ranking procedure.)

Robustifying Generalizable Implicit Shape Networks with a Tunable Non-Parametric Model

  • paper_url: http://arxiv.org/abs/2311.12967
  • repo_url: https://github.com/Ouasfi/Feat-NKRR-adaptation
  • paper_authors: Amine Ouasfi, Adnane Boukhayma
  • for: implicit shape reconstruction from unoriented point cloud
  • methods: feedforward generalizable models with inter-shape data prior and intra-shape regularization prior
  • results: improved performance and efficiency, adaptive expressiveness-robustness trade-off, demonstrated on synthetic and real data.Here’s the full Chinese text:
  • for: 这篇论文主要针对无方向点云的隐形形状重建问题进行研究。
  • methods: 使用 feedforward 通过给定的数据集进行训练,并将其与其他形状之间的数据径统一,以提高性能和推导速度。
  • results: 透过调整方法,实现了更好的表现和效率,并且在不同的形状下获得了适应性和稳定性。实验结果显示,该方法在实验和实际数据上均有着优越的表现。
    Abstract Feedforward generalizable models for implicit shape reconstruction from unoriented point cloud present multiple advantages, including high performance and inference speed. However, they still suffer from generalization issues, ranging from underfitting the input point cloud, to misrepresenting samples outside of the training data distribution, or with toplogies unseen at training. We propose here an efficient mechanism to remedy some of these limitations at test time. We combine the inter-shape data prior of the network with an intra-shape regularization prior of a Nystr\"om Kernel Ridge Regression, that we further adapt by fitting its hyperprameters to the current shape. The resulting shape function defined in a shape specific Reproducing Kernel Hilbert Space benefits from desirable stability and efficiency properties and grants a shape adaptive expressiveness-robustness trade-off. We demonstrate the improvement obtained through our method with respect to baselines and the state-of-the-art using synthetic and real data.
    摘要 <>将Feedforward普通模型应用于隐藏形状重建从不 oriented 点云中具有多种优点,包括高性能和推理速度。然而,它们仍然受到泛化问题的困扰,从下采样点云到违背样本分布,或者看到训练数据中未见 topology。我们在测试时提出了一种有效的解决方案。我们将网络的间 shape 数据 prior 与内 shape 规则化 prior 的 Nyström Kernel Ridge Regression 相结合,并将其适应于当前形状。 resulting shape function 在 shape 特定的 reproduce kernel Hilbert space 中定义,具有愉悦的稳定性和效率性质,同时具有形状特定的表达性和鲁棒性质。我们通过对基eline和现有技术的比较,示出了我们的方法可以提供更好的改进。Note that Simplified Chinese is a writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other countries.

Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection

  • paper_url: http://arxiv.org/abs/2311.12956
  • repo_url: https://github.com/sashamatsun/lskdiffdet
  • paper_authors: Ahmed Sharshar, Aleksandr Matsun
    for:This paper focuses on improving the accuracy and efficiency of object detection in aerial images, with a specific emphasis on small objects and diverse orientations.methods:The proposed approach utilizes the Large Selective Kernel Network (LSKNet) as the backbone and combines it with the DiffusionDet head, along with several novel methodologies and ablation studies to refine the model’s performance.results:The proposed model achieves a mean average precision (MAP) of approximately 45.7%, outperforming the RCNN model by 4.7% on the same dataset, indicating a significant improvement in object detection accuracy and efficiency.
    Abstract In the realm of aerial image analysis, object detection plays a pivotal role, with significant implications for areas such as remote sensing, urban planning, and disaster management. This study addresses the inherent challenges in this domain, notably the detection of small objects, managing densely packed elements, and accounting for diverse orientations. We present an in-depth evaluation of an object detection model that integrates the Large Selective Kernel Network (LSKNet)as its backbone with the DiffusionDet head, utilizing the iSAID dataset for empirical analysis. Our approach encompasses the introduction of novel methodologies and extensive ablation studies. These studies critically assess various aspects such as loss functions, box regression techniques, and classification strategies to refine the model's precision in object detection. The paper details the experimental application of the LSKNet backbone in synergy with the DiffusionDet heads, a combination tailored to meet the specific challenges in aerial image object detection. The findings of this research indicate a substantial enhancement in the model's performance, especially in the accuracy-time tradeoff. The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement, outperforming the RCNN model by 4.7% on the same dataset. This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis, paving the way for more accurate and efficient object detection methodologies. The code is publicly available at https://github.com/SashaMatsun/LSKDiffDet
    摘要 在飞行图像分析领域,对象检测占据重要地位,对于远程感知、城市规划和灾害管理等领域具有重要意义。这项研究解决了这个领域的内在挑战,包括检测小对象、处理紧密排序的元素以及考虑多种方向。我们提出了一种含有新的方法和广泛的缓解研究的对象检测模型,使用iSAID数据集进行实证分析。我们的方法包括引入新的损失函数、盒子回归技术和分类策略来提高模型对 объек detection的精度。我们在实验中应用了LSKNet作为模型的基础,并与DiffusionDet头部结合使用,这种组合是为了特别应对飞行图像中的对象检测挑战。我们的实验结果表明,提档的模型在精度-时间交换中具有显著提高,特别是在RCNN模型的4.7%上。这一进步证明了我们的修改是有效的,并为飞行图像分析领域设置了新的benchmark,开出了更高精度和更高效的对象检测方法。代码可以在https://github.com/SashaMatsun/LSKDiffDet上获得。

PINNs-Based Uncertainty Quantification for Transient Stability Analysis

  • paper_url: http://arxiv.org/abs/2311.12947
  • repo_url: None
  • paper_authors: Ren Wang, Ming Zhong, Kaidi Xu, Lola Giráldez Sánchez-Cortés, Ignacio de Cominges Guerra
  • for: 本文解决了电力系统中缺失参数和不确定性传递的稳定性挑战。
  • methods: 我们介绍了一种新的物理学 Informed Neural Networks(PINNs) ensemble(E-PINNs)来估算旋转角和固有剖面系数,提高了准确性和计算负担。E-PINNs 利用了振荡方程的物理基础,提供了一个可靠的解决方案。
  • results: 我们的方法不仅能够有效地进行参数估算,还能够评估系统行为的不确定性,为稳定性分析提供了 probabilistic 的洞察。我们通过对 $1$-bus 和 $2$-bus 系统的分析,表明了模型在参数变化和数据缺乏的情况下的可靠性。这种方法可以扩展到电力系统稳定性分析中,为计算效率和可靠性做出贡献。
    Abstract This paper addresses the challenge of transient stability in power systems with missing parameters and uncertainty propagation in swing equations. We introduce a novel application of Physics-Informed Neural Networks (PINNs), specifically an Ensemble of PINNs (E-PINNs), to estimate critical parameters like rotor angle and inertia coefficient with enhanced accuracy and reduced computational load. E-PINNs capitalize on the underlying physical principles of swing equations to provide a robust solution. Our approach not only facilitates efficient parameter estimation but also quantifies uncertainties, delivering probabilistic insights into the system behavior. The efficacy of E-PINNs is demonstrated through the analysis of $1$-bus and $2$-bus systems, highlighting the model's ability to handle parameter variability and data scarcity. The study advances the application of machine learning in power system stability, paving the way for reliable and computationally efficient transient stability analysis.
    摘要 Translated into Simplified Chinese:这篇论文挑战了电力系统缺失参数和不确定性传递的稳定性问题。我们介绍了一种使用物理学 Informed Neural Networks (PINNs) 的ensemble (E-PINNs)来估算总角和阻尼系数,并提高了准确性和计算负担。E-PINNs 利用振荡方程的物理原理来提供一个可靠的解决方案。我们的方法不仅帮助提高参数估算的效率,还可以量化不确定性,为系统行为提供概率性的洞察。我们通过分析 $1$-bus 和 $2$-bus 系统来证明 E-PINNs 的能力,表明该模型可以处理参数变化和数据缺乏。这种研究推动了机器学习在电力系统稳定性分析中的应用,为可靠和计算高效的稳定性分析开创了新的可能性。

DroneOptiNet: A Framework for Optimal Drone-based Load Redistribution Mechanism for 5G and Beyond Solar Small Cell Networks

  • paper_url: http://arxiv.org/abs/2311.12944
  • repo_url: None
  • paper_authors: Daksh Dave, Vinay Chamola, Sandeep Joshi, Sherali Zeadally
  • for: 提高无线通信系统的可靠性和韧性
  • methods: 利用机龙基站(BS)进行无线电力重新分配,并使用演化神经网络进行启发式调节
  • results: 降低基站的电力问题和维持通信带宽稳定,提高无线通信系统的可靠性和韧性
    Abstract The power requirements posed by the fifth-generation and beyond cellular networks are an important constraint in network deployment and require energy-efficient solutions. In this work, we propose a novel user load transfer approach using airborne base stations (BS), mounted on drones, for reliable and secure power redistribution across the micro-grid network comprising green small cell BSs. Depending on the user density and the availability of an aerial BS, the energy requirement of a cell with an energy deficit is accommodated by migrating the aerial BS from a high-energy to a low-energy cell. The proposed hybrid drone-based framework integrates long short-term memory with unique cost functions using an evolutionary neural network for drones and BSs, and efficiently manages energy and load redistribution. The proposed algorithm reduces power outages at BSs and maintains consistent throughput stability, thereby demonstrating its capability to boost the reliability and robustness of wireless communication systems.
    摘要 fifth-generation 和更高代Cellular 网络的能源需求是重要的约束因素,需要能效的解决方案。在这项工作中,我们提出了一种新的用户负荷传输方法,使用飞行基站(BS),固定在无人机上,为可靠和安全地分配电力。根据用户密度和有空 aerial BS的可用性,能源缺乏的细端网格中的一个细端BS的能源需求会通过将 aerial BS从高能端到低能端的细端BS中迁移来满足。我们提出的 hybrid 无人机基站框架将长Short-term 记忆与唯一的成本函数使用进行演化神经网络,以有效地管理能量和负荷重新分配。该算法可以减少BS的发电机故障,保持通信系统的吞吐稳定性,因此展示了它的可能性,提高了无线通信系统的可靠性和稳定性。

InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

  • paper_url: http://arxiv.org/abs/2311.12943
  • repo_url: None
  • paper_authors: Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury
  • for: 本研究旨在解决人机共同操作中机器人预测人意图的问题,以实现简洁执行任务。
  • methods: 本研究提出了一种新的architecture,即InteRACT,它利用人机动作之间的对应关系来进行转移学习,从大量的人人交互数据中预测人意图。
  • results: 在一系列实际的共同人机操作任务上,我们的条件预测模型超过了多种独立预测模型的基准。此外,我们还引入了新的操作技术来控制7个自由度的机器人臂,并收集了一个多样化的人机共同操作数据集,该数据集将被开源。
    Abstract In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.
    摘要 在合作人机Robot manipulation中, robot需要预测人类意图,并根据此进行相应的行为以实现任务。然而,人类的意图又取决于机器人的行动,导致一种鸡蛋问题。先前的方法忽略了这种互相关系,而是独立地训练了独立的意图预测模型。这是因为在机器人行动和人类行动之间的条件学习是困难的。我们的关键发现是利用人类和机器人行动之间的对应关系,以便在大规模的人类人类交互数据上进行转移学习。我们提议一种新的架构,即InteRACT,它先在大规模的人类人类交互数据上预训练一个条件意图预测模型,然后在小规模的人类机器人交互数据上细化。我们对一组真实的合作人机Robot manipulation任务进行评估,并证明了我们的条件模型在多种独立基线上进行改进。我们还介绍了一些新的技术来远程操纵7度自由度机器人臂,并收集了一个多样化的人机合作 manipulate数据,这些数据将被开源。

Intrinsic Image Decomposition via Ordinal Shading

  • paper_url: http://arxiv.org/abs/2311.12792
  • repo_url: https://github.com/compphoto/Intrinsic
  • paper_authors: Chris Careaga, Yağız Aksoy
  • for: 高精度的内在分解,用于各种反射渲染和计算摄影术pipeline
  • methods: 将问题分解为两部分,首先使用稠密的ORDinal shading表示法,并使用第二个网络将低和高分辨率的ORDinal estimation结合在一起,以生成具有全局协调和局部细节的颜色渲染
  • results: 通过计算估计的照明和反射率,使模型学习高精度的内在分解,并在实际应用中进行难以实现的编辑任务,如重新染色和重新照明
    Abstract Intrinsic decomposition is a fundamental mid-level vision problem that plays a crucial role in various inverse rendering and computational photography pipelines. Generating highly accurate intrinsic decompositions is an inherently under-constrained task that requires precisely estimating continuous-valued shading and albedo. In this work, we achieve high-resolution intrinsic decomposition by breaking the problem into two parts. First, we present a dense ordinal shading formulation using a shift- and scale-invariant loss in order to estimate ordinal shading cues without restricting the predictions to obey the intrinsic model. We then combine low- and high-resolution ordinal estimations using a second network to generate a shading estimate with both global coherency and local details. We encourage the model to learn an accurate decomposition by computing losses on the estimated shading as well as the albedo implied by the intrinsic model. We develop a straightforward method for generating dense pseudo ground truth using our model's predictions and multi-illumination data, enabling generalization to in-the-wild imagery. We present an exhaustive qualitative and quantitative analysis of our predicted intrinsic components against state-of-the-art methods. Finally, we demonstrate the real-world applicability of our estimations by performing otherwise difficult editing tasks such as recoloring and relighting.
    摘要 “内在分解是计算机视觉中的基本中间问题,它在各种反向渲染和计算 fotografraphy 管道中扮演着关键的角色。生成高精度内在分解是一个具有下降约束的任务,需要精准地估计连续值的颜色和反射率。在这个工作中,我们实现高分辨率内在分解,通过将问题分解成两部分。首先,我们提出了一种适用于各种照明条件的稠密顺序颜色表示,使用不受约束的损失函数来估计顺序颜色cue无需遵循内在模型。然后,我们将低分辨率和高分辨率的顺序估计结合使用第二个网络,生成一个具有全局准确性和局部细节的颜色估计。我们对模型的颜色估计和内在模型预测的反射率进行损失计算,以便帮助模型学习准确地分解。我们还提出了一种简单的方法,使用我们的模型预测和多种照明数据生成密集的 Pseudo 真实数据,以便普遍应用于野外图像。最后,我们对我们预测的内在组件进行了详细的质量和质量分析,并将其与当前的方法进行比较。最后,我们通过执行一些困难的编辑任务,如重新颜色和重新照明,来证明我们的估计的可能性。”

Quantifying Impairment and Disease Severity Using AI Models Trained on Healthy Subjects

  • paper_url: http://arxiv.org/abs/2311.12781
  • repo_url: https://github.com/fishneck/cobra
  • paper_authors: Boyang Yu, Aakash Kaku, Kangning Liu, Avinash Parnandi, Emily Fokas, Anita Venkatesan, Natasha Pandit, Rajesh Ranganath, Heidi Schambra, Carlos Fernandez-Granda
  • for: 这个研究的目的是提出一种新的评估残疾和疾病严重度的方法,利用人工智能模型,以便更好地评估患者的疾病程度。
  • methods: 这种方法利用了医学健康个体数据集来训练人工智能模型,然后使用这些模型对患者进行评估,并计算出一个名为COBRA分数,用于评估患者的残疾程度。
  • results: 研究表明,COBRA分数与专业评估结果(FMA)之间存在强相关关系,在两种不同的数据模式(穿戴式传感器和视频)上都有 statistically significant 的相关性(ρ = 0.845, 95% CI [0.743,0.908] 和 ρ = 0.746, 95% C.I [0.594, 0.847])。此外,这种方法还可以用于评估其他情况,如膝关节炎的严重程度,并与独立的临床评估相关(ρ = 0.644, 95% C.I [0.585,0.696])。
    Abstract Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine. We propose a novel framework to address this challenge, which leverages AI models trained exclusively on healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits the decrease in confidence of these models when presented with impaired or diseased patients to quantify their deviation from the healthy population. We applied the COBRA score to address a key limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45 minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation protocols to the progress of each patient. The COBRA score, computed automatically in under one minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different data modalities: wearable sensors ($\rho = 0.845$, 95% CI [0.743,0.908]) and video ($\rho = 0.746$, 95% C.I [0.594, 0.847]). To demonstrate the generalizability of the approach to other conditions, the COBRA score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging scans, again achieving significant correlation with an independent clinical assessment ($\rho = 0.644$, 95% C.I [0.585,0.696]).
    摘要 自动评估残疾和疾病严重性是数据驱动医学中的关键挑战。我们提出了一种新的框架,利用专门训练在健康人群中的AI模型来解决这个挑战。我们称之为COBRA分数(Confidence-Based chaRacterization of Anomalies),它利用AI模型对异常或疾病患者的预测不确定程度来衡量其与健康人群的偏差。我们应用COBRA分数来解决现有临床评估 Upper-body impairment 在roke患者中的一个重要限制,即需要专业评估人员在面对面进行30-45分钟的评估,这限制了监测频率和病人的进程不能适应医生改进复健协议。COBRA分数可以在1分钟内自动计算,与FMA(Gold-standard Fugl-Meyer Assessment)在独立测试组中呈现强相关关系(粒度值为0.845,95%CI=[0.743,0.908])和 виде(粒度值为0.746,95%CI=[0.594,0.847])两种数据模式。为了证明方法的普适性,我们还应用了COBRA分数来评估了Magnetic-resonance imaging 扫描中的 knee osteoarthritis 的严重程度,并在独立临床评估中达到了显著相关关系(粒度值为0.644,95%CI=[0.585,0.696])。

SPOT! Revisiting Video-Language Models for Event Understanding

  • paper_url: http://arxiv.org/abs/2311.12919
  • repo_url: None
  • paper_authors: Gengyuan Zhang, Jinhe Bi, Jindong Gu, Volker Tresp
  • for: 研究多Modal学习中的视频理解。
  • methods: 利用大规模的网络抓取的视频文本对照,通过弱监督来预训练视频语言模型,并显示了惊人的潜力在视频理解任务中。
  • results: 发现现有视频语言模型无法分辨细节事件差异,因为视频文本对照通常只包含大致级别的描述。为解决这个问题,我们提出了SPOT Prober,用于评估现有视频语言模型在事件理解方面的能力。我们发现,通过在模型中插入 manipulate 事件标签来提高模型的事件理解能力。
    Abstract Understanding videos is an important research topic for multimodal learning. Leveraging large-scale datasets of web-crawled video-text pairs as weak supervision has become a pre-training paradigm for learning joint representations and showcased remarkable potential in video understanding tasks. However, videos can be multi-event and multi-grained, while these video-text pairs usually contain only broad-level video captions. This raises a question: with such weak supervision, can video representation in video-language models gain the ability to distinguish even factual discrepancies in textual description and understand fine-grained events? To address this, we introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies as an indicator of models' event understanding ability. Our approach involves extracting events as tuples () from videos and generating false event tuples by manipulating tuple components systematically. We reevaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events. Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
    摘要 <> transtable="zh-CN"理解视频是一个重要的多模式学习研究领域。利用大规模的网络爬取的视频文本对的弱监督,已成为视频理解任务的预训练方法,并表现出了Remarkable potential。然而,视频可能是多事件和多层次的,而这些视频文本对通常只包含大规模的视频标签。这引发了一个问题:在如此弱监督下,视频表示能否分辨出文本描述中的细节不一致?为解决这问题,我们引入了SPOT Prober,用于评估现有视频语言模型对事件的理解能力。我们的方法是从视频中提取事件为元组(<主题、 predicates、 object、 attribute、 timestamps>),并将元组组成系统地 manipulate 成假事件元组。我们重新评估了现有的视频语言模型,并发现它们大多数无法分辨 manipulate 后的事件。基于我们的发现,我们提议在模型训练中添加这些 manipulate 后的事件caption作为困难样本,并发现它们有效地提高模型的事件理解能力。

  • paper_url: http://arxiv.org/abs/2311.12917
  • repo_url: None
  • paper_authors: E. Kulman, R. Kuang, Q. Morris
  • for: 这个论文旨在描述用于描述癌症发展的演化历史,并且提供有用的信息以 inform cancer treatment。
  • methods: 这篇论文使用了点变化检测结果来构建癌症演化树。
  • results: 这篇论文的结果表明,Orchard算法可以准确地重建癌症演化树,并且在90个 simulations和14个B-ALL例子中表现更加稳定和准确。
    Abstract Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer i.e., cancer phylogenies, provide useful insights about cancer development and inform treatment. Cancer phylogenies can be reconstructed using data obtained from bulk DNA sequencing of multiple tissue samples from the same cancer. We introduce Orchard, a fast algorithm that reconstructs cancer phylogenies using point mutations detected in bulk DNA sequencing data. Orchard constructs cancer phylogenies progressively, one point mutation at a time, ultimately sampling complete phylogenies from a posterior distribution implied by the bulk DNA data. Orchard reconstructs more plausible phylogenies than state-of-the-art cancer phylogeny reconstruction methods on 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). These results demonstrate that Orchard accurately reconstructs cancer phylogenies with up to 300 mutations. We then introduce a simple graph based clustering algorithm that uses a reconstructed phylogeny to infer unique groups of mutations i.e., mutation clusters, that characterize the genetic differences between cancer cell populations, and show that this approach is competitive with state-of-the-art mutation clustering methods.
    摘要 生物Tree depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer, namely cancer phylogenies, can provide valuable insights into cancer development and inform treatment. Cancer phylogenies can be reconstructed using data obtained from bulk DNA sequencing of multiple tissue samples from the same cancer. We propose a fast algorithm called Orchard, which reconstructs cancer phylogenies using point mutations detected in bulk DNA sequencing data. Orchard constructs cancer phylogenies progressively, one point mutation at a time, ultimately sampling complete phylogenies from the posterior distribution implied by the bulk DNA data. Orchard reconstructs more plausible phylogenies than state-of-the-art cancer phylogeny reconstruction methods on 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). These results demonstrate that Orchard accurately reconstructs cancer phylogenies with up to 300 mutations. We then introduce a simple graph-based clustering algorithm that uses a reconstructed phylogeny to infer unique groups of mutations, namely mutation clusters, that characterize the genetic differences between cancer cell populations, and show that this approach is competitive with state-of-the-art mutation clustering methods.

Digital Twin Framework for Optimal and Autonomous Decision-Making in Cyber-Physical Systems: Enhancing Reliability and Adaptability in the Oil and Gas Industry

  • paper_url: http://arxiv.org/abs/2311.12755
  • repo_url: None
  • paper_authors: Carine Menezes Rebello, Johannes Jäschkea, Idelfonso B. R. Nogueira
  • for: 本研究旨在提出一种用于油气工业中气动力喷气过程中的数字双体系,以便实现最优化和自动化的决策。
  • methods: 本研究使用了 bayesian inference、Monte Carlo simulations、转移学习、在线学习以及新的策略来具现数字双体系,包括模型高维度减少和认知攻击。
  • results: 该研究提出了一种高效、可靠、可信worthy的数字双体系标识框架,可以在不同环境下适应变化,并包括预测uncertainty,从而提高了整个决策过程的可靠性和效率。
    Abstract The concept of creating a virtual copy of a complete Cyber-Physical System opens up numerous possibilities, including real-time assessments of the physical environment and continuous learning from the system to provide reliable and precise information. This process, known as the twinning process or the development of a digital twin (DT), has been widely adopted across various industries. However, challenges arise when considering the computational demands of implementing AI models, such as those employed in digital twins, in real-time information exchange scenarios. This work proposes a digital twin framework for optimal and autonomous decision-making applied to a gas-lift process in the oil and gas industry, focusing on enhancing the robustness and adaptability of the DT. The framework combines Bayesian inference, Monte Carlo simulations, transfer learning, online learning, and novel strategies to confer cognition to the DT, including model hyperdimensional reduction and cognitive tack. Consequently, creating a framework for efficient, reliable, and trustworthy DT identification was possible. The proposed approach addresses the current gap in the literature regarding integrating various learning techniques and uncertainty management in digital twin strategies. This digital twin framework aims to provide a reliable and efficient system capable of adapting to changing environments and incorporating prediction uncertainty, thus enhancing the overall decision-making process in complex, real-world scenarios. Additionally, this work lays the foundation for further developments in digital twins for process systems engineering, potentially fostering new advancements and applications across various industrial sectors.
    摘要 《创建虚拟复制的完整Cyber-Physical System开发了许多可能性,包括实时评估物理环境和持续学习系统以提供可靠和准确的信息。这个过程,称为双体过程或数字双体(DT)的开发,在不同领域得到了广泛的应用。然而,在考虑实时执行人工智能模型,如在数字双体中使用的模型,在信息交换场景中存在挑战。这项工作提出了一个数字双体框架,用于在气压吸取过程中进行优化和自主决策。这个框架结合 bayesian推理、虚拟 Monte Carlo 仿真、传输学习、在线学习和新的战略,以 confer cognition to the DT,包括模型多维度减少和认知抓取。因此,创建一个高效、可靠、信任worthy的 DT 标识框架成为可能。这种方法填补了现有文献中关于融合不同学习技术和uncertainty管理的数字双体策略的空白。这个数字双体框架目的是提供一个可靠、高效的系统,能够适应变化的环境,并 incorporate prediction uncertainty,从而提高了总决策过程的复杂、真实场景中的可靠性。此外,这项工作为数字双体技术在过程系统工程领域奠定了基础,可能激发新的发展和应用在不同的产业领域。》

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

  • paper_url: http://arxiv.org/abs/2311.12754
  • repo_url: https://github.com/huang-yh/selfocc
  • paper_authors: Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, Jiwen Lu
  • for: 提高视觉自主驾驶系统的稳定性,预测周围3D空间中每个点是否占用。
  • methods: 使用自我超级vised学习方法,只使用视频序列来学习3D占用状态。
  • results: 与前一代最佳方法SceneRF相比,SelfOcc在SemanticKITTI上提高了58.7%,并在nuScenes上实现了高质量的深度和3D占用状态。
    Abstract 3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: https://github.com/huang-yh/SelfOcc.
    摘要 三维占用预测是自主驾驶视觉系统中重要任务,目标是预测周围3D空间中每个点的占用状态。现有方法通常需要3D占用标签以生成有意义的结果。然而,为每个 voz 粒子 annotate 占用状态是非常劳累。在这篇论文中,我们提出了一种自动学习的方法,即SelfOcc,使用仅视频序列来学习3D占用。我们首先将图像转换为3D空间(例如,鸟瞰视图),以获得场景的3D表示。然后,我们直接对3D表示进行约束,将其当作签名距离场景。我们可以将前一帧和后一帧的图像渲染为自我监督信号,以便学习3D表示。我们提出了一种嵌入MVS的策略,直接优化SDF引起的加权。我们的SelfOcc在SemanticKITTI上比前一个最佳方法SceneRF高58.7%,并且是自动学习的首个方法,可以在射频相机上生成有理性3D占用。SelfOcc可以生成高质量的深度图,并在SemanticKITTI、KITTI-2015和nuScenes上分别达到了状态最佳的结果。我们的代码可以在 GitHub 上找到。

Image Transformation for IoT Time-Series Data: A Review

  • paper_url: http://arxiv.org/abs/2311.12742
  • repo_url: None
  • paper_authors: Duygu Altunkaya, Feyza Yildirim Okay, Suat Ozdemir
  • for: 本研究主要针对 IoT 领域中时序数据的分类或回归问题,即高维、高频时序数据的分类或回归问题。
  • methods: 本研究使用图像转换/编码技术来解决 IoT 领域中时序数据的分类或回归问题,包括不同的编码技术、数据类型和应用领域。
  • results: 研究发现,图像转换/编码技术可以提高深度学习模型在 IoT 领域中时序数据分类或回归的性能。但是,还需要进一步探索图像转换/编码技术的挑战和未来维度。
    Abstract In the era of the Internet of Things (IoT), where smartphones, built-in systems, wireless sensors, and nearly every smart device connect through local networks or the internet, billions of smart things communicate with each other and generate vast amounts of time-series data. As IoT time-series data is high-dimensional and high-frequency, time-series classification or regression has been a challenging issue in IoT. Recently, deep learning algorithms have demonstrated superior performance results in time-series data classification in many smart and intelligent IoT applications. However, it is hard to explore the hidden dynamic patterns and trends in time-series. Recent studies show that transforming IoT data into images improves the performance of the learning model. In this paper, we present a review of these studies which use image transformation/encoding techniques in IoT domain. We examine the studies according to their encoding techniques, data types, and application areas. Lastly, we emphasize the challenges and future dimensions of image transformation.
    摘要 在互联网物联网(IoT)时代,智能手机、内置系统、无线感知器和几乎所有的智能设备通过本地网络或互联网连接起来,数十亿个智能设备之间进行通信,生成巨量的时间序列数据。由于IoT时间序列数据具有高维度和高频率,时间序列分类或回归成为IoT中的挑战。然而,难以探索时间序列中隐藏的动态模式和趋势。近年来,深度学习算法在IoT应用中表现出色的时间序列数据分类表现。然而,图像变换技术在IoT领域的应用仍然具有潜在的优势。本文是一篇系统性地回顾这些使用图像变换/编码技术的IoT领域研究。我们根据它们的编码技术、数据类型和应用领域进行了分析。最后,我们强调了图像变换的挑战和未来的维度。

Content Augmented Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.12741
  • repo_url: https://github.com/fatemehgholamzadeh/augss-gnn
  • paper_authors: Fatemeh Gholamzadeh Nasrabadi, AmirHossein Kashani, Pegah Zahedi, Mostafa Haghir Chehreghani
  • For: This paper aims to improve the performance of graph neural networks (GNNs) by incorporating content information into the node embeddings.* Methods: The proposed method uses a combination of structural embeddings generated by a GNN and content embeddings generated by an auto-encoder or content graph to form the final node embeddings.* Results: The proposed method achieves high accuracy and performance on several real-world datasets.Here is the same information in Simplified Chinese:* For: 这篇论文目的是提高图 нейрон网络(GNN)的性能,通过将内容信息 integrate 到节点嵌入中。* Methods: 提议方法使用结构嵌入和内容嵌入的组合,通过结构嵌入和内容嵌入来形成节点嵌入。* Results: 提议方法在多个实际数据集上达到了高精度和性能。
    Abstract In recent years, graph neural networks (GNNs) have become a popular tool for solving various problems over graphs. In these models, the link structure of the graph is typically exploited and nodes' embeddings are iteratively updated based on adjacent nodes. Nodes' contents are used solely in the form of feature vectors, served as nodes' first-layer embeddings. However, the filters or convolutions, applied during iterations/layers to these initial embeddings lead to their impact diminish and contribute insignificantly to the final embeddings. In order to address this issue, in this paper we propose augmenting nodes' embeddings by embeddings generating from their content, at higher GNN layers. More precisely, we propose models wherein a structural embedding using a GNN and a content embedding are computed for each node. These two are combined using a combination layer to form the embedding of a node at a given layer. We suggest methods such as using an auto-encoder or building a content graph, to generate content embeddings. In the end, by conducting experiments over several real-world datasets, we demonstrate the high accuracy and performance of our models.
    摘要 To address this issue, in this paper, we propose augmenting nodes' embeddings by generating embeddings from their content at higher GNN layers. Specifically, we propose models that compute a structural embedding using a GNN and a content embedding for each node. These two embeddings are combined using a combination layer to form the embedding of a node at a given layer. We suggest methods such as using an auto-encoder or building a content graph to generate content embeddings.Through experiments on several real-world datasets, we demonstrate the high accuracy and performance of our models.

  • paper_url: http://arxiv.org/abs/2311.12719
  • repo_url: None
  • paper_authors: Pranav Nataraj Devaraj, Rakesh Teja P V, Aaryav Gangrade, Manoj Kumar R
  • For: The paper is written for those who are interested in creating a Legal Documentation AI Chatbot with relevant features.* Methods: The paper uses a combination of AI technologies, including chatbots, to streamline the handling of legal documents. The authors describe the development of each component of the chatbot in detail, including the Android app and the Langchain query processing code.* Results: The paper presents the results of the authors’ efforts to integrate the chatbot components through a Flask backend and REST API methods, and discusses the functionality of each component.
    Abstract With the exponential growth of digital data and the increasing complexity of legal documentation, there is a pressing need for efficient and intelligent tools to streamline the handling of legal documents.With the recent developments in the AI field, especially in chatbots, it cannot be ignored as a very compelling solution to this problem.An insight into the process of creating a Legal Documentation AI Chatbot with as many relevant features as possible within the given time frame is presented.The development of each component of the chatbot is presented in detail.Each component's workings and functionality has been discussed.Starting from the build of the Android app and the Langchain query processing code till the integration of both through a Flask backend and REST API methods.
    摘要 随着数字数据的急速增长和法律文档的复杂化,有一定的需求是提高法律文档的处理效率和智能化工具。在人工智能领域的最近发展,尤其是在聊天机器人方面,不能忽略它作为非常有吸引力的解决方案。本文将对建立法律文档AI聊天机器人的过程进行深入探讨,并尽可能多地包含相关的特点和功能。从构建Android应用和Langchain查询处理代码到通过Flask后端和REST API方法集成。Here's the breakdown of the translation:随着数字数据的急速增长 (with the rapid growth of digital data)和法律文档的复杂化 (and the increasing complexity of legal documentation)有一定的需求 (there is a certain need)是提高法律文档的处理效率 (to improve the efficiency of handling legal documents)和智能化工具 (and intelligent tools)。在人工智能领域的最近发展 (in the recent developments of the AI field)尤其是在聊天机器人方面 (especially in chatbots)不能忽略 (cannot be ignored)它作为非常有吸引力的解决方案 (it as a very compelling solution)。本文将对建立法律文档AI聊天机器人的过程进行深入探讨 (this article will provide an in-depth discussion of the process of building a legal document AI chatbot),并尽可能多地包含相关的特点和功能 (and as many relevant features as possible)。从构建Android应用和Langchain查询处理代码 (from building the Android app and Langchain query processing code)到通过Flask后端和REST API方法集成 (to integrating through Flask backend and REST API methods)。

minimax: Efficient Baselines for Autocurricula in JAX

  • paper_url: http://arxiv.org/abs/2311.12716
  • repo_url: https://github.com/facebookresearch/minimax
  • paper_authors: Minqi Jiang, Michael Dennis, Edward Grefenstette, Tim Rocktäschel
  • for: 这篇论文主要用于提出一种快速实现自适应环境设计(UED)训练方法,以便在未看过环境中培养强大的决策maker。
  • methods: 该论文使用JAX实现了绝对约束环境和自适应训练算法,并通过硬件加速来加速训练过程。
  • results: 该论文提出了一个名为minimax的库,可以在加速硬件上快速进行UED训练,并实现了更高的训练效率和并行性。
    Abstract Unsupervised environment design (UED) is a form of automatic curriculum learning for training robust decision-making agents to zero-shot transfer into unseen environments. Such autocurricula have received much interest from the RL community. However, UED experiments, based on CPU rollouts and GPU model updates, have often required several weeks of training. This compute requirement is a major obstacle to rapid innovation for the field. This work introduces the minimax library for UED training on accelerated hardware. Using JAX to implement fully-tensorized environments and autocurriculum algorithms, minimax allows the entire training loop to be compiled for hardware acceleration. To provide a petri dish for rapid experimentation, minimax includes a tensorized grid-world based on MiniGrid, in addition to reusable abstractions for conducting autocurricula in procedurally-generated environments. With these components, minimax provides strong UED baselines, including new parallelized variants, which achieve over 120$\times$ speedups in wall time compared to previous implementations when training with equal batch sizes. The minimax library is available under the Apache 2.0 license at https://github.com/facebookresearch/minimax.
    摘要 自动课程学习(UED)是一种训练强化决策代理人的自动课程学习方法,可以 zero-shot 转移到未看过环境。这些自动课程学习方法在RL社区中受到了很大的关注。然而,UED实验通常需要几个星期的训练时间,这个 compute 需求是RL领域快速创新的主要障碍。这个工作介绍了用于 UED 训练的加速硬件库 minimax。使用 JAX 实现完全张量化环境和自动课程算法,minimax 将整个训练循环编译到硬件加速器上。为提供快速实验的 Petri 碗,minimax 包含了张量化网格世界,以及可重用的扩展函数,用于在生成的环境中进行自动课程。与这些组件相结合,minimax 提供了强大的 UED 基线,包括新的并发版本,可以在同批量大小的训练时间内实现更 чем 120倍的速度提升。minimax 库可以在 下载,遵循 Apache 2.0 许可证。

Alpha Zero for Physics: Application of Symbolic Regression with Alpha Zero to find the analytical methods in physics

  • paper_url: http://arxiv.org/abs/2311.12713
  • repo_url: None
  • paper_authors: Yoshihiro Michishita
  • for: 这个论文是为了探讨机器学习与神经网络如何应用于物理领域,特别是用于数学计算和实验检测的问题。
  • methods: 该论文提出了一种基于Symbolic regression的Alpha Zero算法(AZfP),用于发展物理学中的分析方法。作为示例,论文示出了AZfP可以 derivation高频扩展在 Floquet 系统中。
  • results: 该论文表明,AZfP可能有助于发展物理学中的新理论框架。
    Abstract Machine learning with neural networks is now becoming a more and more powerful tool for various tasks, such as natural language processing, image recognition, winning the game, and even for the issues of physics. Although there are many studies on the application of machine learning to numerical calculation and the assistance of experimental detection, the methods of applying machine learning to find the analytical method are poorly studied. In this paper, we propose the frameworks of developing analytical methods in physics by using the symbolic regression with the Alpha Zero algorithm, that is Alpha Zero for physics (AZfP). As a demonstration, we show that AZfP can derive the high-frequency expansion in the Floquet systems. AZfP may have the possibility of developing a new theoretical framework in physics.
    摘要 机器学习与神经网络在不同任务上 becoming increasingly powerful tool, such as自然语言处理、图像识别、游戏胜利、和物理问题。although there have been many studies on the application of machine learning to numerical calculation and experimental detection assistance, the methods of applying machine learning to find analytical methods are not well studied. In this paper, we propose the frameworks of developing analytical methods in physics by using the symbolic regression with the Alpha Zero algorithm, that is, Alpha Zero for physics (AZfP). As a demonstration, we show that AZfP can derive the high-frequency expansion in the Floquet systems. AZfP may have the possibility of developing a new theoretical framework in physics.Note: The translation is done using Google Translate, and may not be perfect. Please let me know if you need any further assistance.

Keeping Users Engaged During Repeated Administration of the Same Questionnaire: Using Large Language Models to Reliably Diversify Questions

  • paper_url: http://arxiv.org/abs/2311.12707
  • repo_url: None
  • paper_authors: Hye Sun Yun, Mehdi Arjmand, Phillip Raymond Sherlock, Michael Paasche-Orlow, James W. Griffith, Timothy Bickmore
  • for: 这篇论文的目的是提出一种使用大型自然语言模型(LLM)生成多个问卷版本,以保持良好的心理测量特性,并且避免回答疲劳和响应偏好。
  • methods: 这篇论文使用了一种长期研究,参与者每天为两周接受了标准化抑郁问卷或两个LLM生成的问卷变体,同时完成了一个验证了抑郁问卷的 validate 问卷。
  • results: 研究发现,LLM生成的问卷变体和标准化问卷之间具有一致的covariation,表明LLM生成的问卷variant具有可靠性和有效性。参与者认为, repeating 使用标准化问卷 Significantly 更加烦人,相比LLM生成的问卷变体。 这些发现表明LLM可以 Invigorate 问卷,提高参与者的参与度和兴趣,无需牺牲有效性。
    Abstract Standardized, validated questionnaires are vital tools in HCI research and healthcare, offering dependable self-report data. However, their repeated use in longitudinal or pre-post studies can induce respondent fatigue, impacting data quality via response biases and decreased response rates. We propose utilizing large language models (LLMs) to generate diverse questionnaire versions while retaining good psychometric properties. In a longitudinal study, participants engaged with our agent system and responded daily for two weeks to either a standardized depression questionnaire or one of two LLM-generated questionnaire variants, alongside a validated depression questionnaire. Psychometric testing revealed consistent covariation between the external criterion and the focal measure administered across the three conditions, demonstrating the reliability and validity of the LLM-generated variants. Participants found the repeated administration of the standardized questionnaire significantly more repetitive compared to the variants. Our findings highlight the potential of LLM-generated variants to invigorate questionnaires, fostering engagement and interest without compromising validity.
    摘要 标准化、验证的问卷是人机交互研究和医疗中非常重要的工具,提供可靠的自报数据。然而,在长期或先后研究中,重复使用标准问卷可能会导致受试者疲劳,影响数据质量通过回答偏见和回答率下降。我们提议使用大语言模型(LLM)生成多种问卷版本,保持好的心理测量性质。在一个长期研究中,参与者与我们的代理系统交互,每天对标准抑郁问卷或两种LLM生成的问卷变体进行了两周的回答,同时还进行了验证抑郁问卷的有效性测试。心理测量测试表明,三个状态下的外部标准和关键量测试之间存在一致的covariation,这表明LLM生成的变体具有可靠性和有效性。参与者发现,重复使用标准问卷比对LLM生成的变体更加烦琐。我们的发现表明,LLM生成的变体可以invigorate问卷,提高参与者的参与度和兴趣,而不会COMPROMISE有效性。

Can Large Language Models Understand Content and Propagation for Misinformation Detection: An Empirical Study

  • paper_url: http://arxiv.org/abs/2311.12699
  • repo_url: None
  • paper_authors: Mengyang Chen, Lingwei Wei, Han Cao, Wei Zhou, Songlin Hu
  • for: 本研究旨在探讨大语言模型(LLMs)在误信息探测任务中的表现。
  • methods: 本研究采用多种提示来评估多种LLMs的理解能力,并设计了四种指令调整策略以提高LLMs的误信息探测性能。
  • results: 实验结果表明,采用提示多样化和指令调整策略可以提高LLMs的误信息探测性能,并且LLMs可以更好地理解内容和宣传结构。
    Abstract Large Language Models (LLMs) have garnered significant attention for their powerful ability in natural language understanding and reasoning. In this paper, we present a comprehensive empirical study to explore the performance of LLMs on misinformation detection tasks. This study stands as the pioneering investigation into the understanding capabilities of multiple LLMs regarding both content and propagation across social media platforms. Our empirical studies on five misinformation detection datasets show that LLMs with diverse prompts achieve comparable performance in text-based misinformation detection but exhibit notably constrained capabilities in comprehending propagation structure compared to existing models in propagation-based misinformation detection. Besides, we further design four instruction-tuned strategies to enhance LLMs for both content and propagation-based misinformation detection. These strategies boost LLMs to actively learn effective features from multiple instances or hard instances, and eliminate irrelevant propagation structures, thereby achieving better detection performance. Extensive experiments further demonstrate LLMs would play a better capacity in content and propagation structure under these proposed strategies and achieve promising detection performance. These findings highlight the potential ability of LLMs to detect misinformation.
    摘要 Our analysis of five misinformation detection datasets shows that LLMs with diverse prompts perform similarly in text-based misinformation detection, but they struggle to comprehend the propagation structure. In contrast, existing models excel in propagation-based misinformation detection. To enhance the performance of LLMs, we design four instruction-tuned strategies that help them learn effective features from multiple instances or hard instances and eliminate irrelevant propagation structures.Our extensive experiments demonstrate that LLMs can achieve better detection performance under these strategies. These findings highlight the potential of LLMs to detect misinformation.

Diffusion Model Alignment Using Direct Preference Optimization

  • paper_url: http://arxiv.org/abs/2311.12908
  • repo_url: None
  • paper_authors: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
  • for: 本研究旨在使用人工比较数据和强化学习法 FROM human feedback (RLHF) 方法来提高 diffusion models 的适应性。
  • methods: 本研究提出了一种直接优化 diffusion models 以满足人类偏好的方法,称为 Diffusion-DPO。这种方法是基于 Direct Preference Optimization (DPO) 的一种简化版,可以直接优化一个策略以满足人类偏好。
  • results: 通过使用 Pick-a-Pic 数据集的 851K 个人类比较数据,我们经过 fine-tuning 基于 SDXL-1.0 模型的基本模型,得到了显著改善的视觉吸引力和提示匹配性。此外,我们还开发了一种使用 AI 反馈的变体,其性能与人类反馈训练相当,这开启了 diffusion model Alignment 方法的扩展。
    Abstract Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.
    摘要 大型语言模型(LLM)通过人工比较数据和人类反馈学习(RLHF)方法进行微调,以使其更好地适应用户的偏好。而文本到图像扩散模型中人员偏好学习尚未广泛研究,最佳现有方法是通过精心约选高质量的图像和标签来进行微调,以提高图像的可见性和文本的匹配度。我们提出了扩散-DPO方法,用于将扩散模型与人类偏好进行对应。扩散-DPO方法基于直接满足人类偏好的策略,并利用证据下界来 derivate一个可微分的目标函数。使用851000个人工投票的 Pick-a-Pic 数据集,我们微调了基于state-of-the-art Stable Diffusion XL(SDXL)-1.0模型的基本模型,并使用扩散-DPO方法。我们的微调基本模型在人工评价中显著超过了基本 SDXL-1.0 模型和增加了一个细化模型的 SDXL-1.0 模型,提高了图像的可见性和文本的匹配度。我们还开发了一种使用 AI 反馈的变体,其性能与人工偏好训练相当,开启了扩散模型对齐方法的扩展。

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

  • paper_url: http://arxiv.org/abs/2311.12668
  • repo_url: None
  • paper_authors: Cyril Picard, Kristen M. Edwards, Anna C. Doris, Brandon Man, Giorgio Giannone, Md Ferdous Alam, Faez Ahmed
    for:This paper aims to evaluate the capabilities of GPT-4V, a vision language model, in a wide range of engineering design tasks.methods:The paper uses a comprehensive evaluation of GPT-4V across four main areas of engineering design tasks, including Conceptual Design, System-Level and Detailed Design, Manufacturing and Inspection, and Engineering Education Tasks.results:The study assesses GPT-4V’s capabilities in design tasks such as sketch similarity analysis, concept selection using Pugh Charts, material selection, engineering drawing analysis, CAD generation, topology optimization, design for additive and subtractive manufacturing, spatial reasoning challenges, and textbook problems. The study identifies the limitations of GPT-4V in complex engineering design applications and provides a foundation for future assessments of vision language models in this field. Additionally, the paper contributes a set of benchmark testing datasets for ongoing advancements and applications.Here is the same information in Simplified Chinese:for:这篇论文目的是评估GPT-4V视觉语言模型在各种工程设计任务中的能力。methods:本论文使用了对GPT-4V在四个主要工程设计任务领域进行了全面评估,包括概念设计、系统级别和详细设计、生产和检查、工程教育任务。results:研究评估GPT-4V在各种设计任务中的能力,包括图纸相似性分析、使用Pugh图进行概念选择、材料选择、工程图分析、CAD生成、多尺度优化、适应加法和减法生产设计、空间逻辑挑战和学科题目。研究发现GPT-4V在复杂工程设计应用中存在一些限制,并提供了未来评估视觉语言模型的基础。此外,论文还提供了为未来进程中的进一步发展和应用提供了一个大于1000个查询的测试数据集。
    Abstract Engineering Design is undergoing a transformative shift with the advent of AI, marking a new era in how we approach product, system, and service planning. Large language models have demonstrated impressive capabilities in enabling this shift. Yet, with text as their only input modality, they cannot leverage the large body of visual artifacts that engineers have used for centuries and are accustomed to. This gap is addressed with the release of multimodal vision language models, such as GPT-4V, enabling AI to impact many more types of tasks. In light of these advancements, this paper presents a comprehensive evaluation of GPT-4V, a vision language model, across a wide spectrum of engineering design tasks, categorized into four main areas: Conceptual Design, System-Level and Detailed Design, Manufacturing and Inspection, and Engineering Education Tasks. Our study assesses GPT-4V's capabilities in design tasks such as sketch similarity analysis, concept selection using Pugh Charts, material selection, engineering drawing analysis, CAD generation, topology optimization, design for additive and subtractive manufacturing, spatial reasoning challenges, and textbook problems. Through this structured evaluation, we not only explore GPT-4V's proficiency in handling complex design and manufacturing challenges but also identify its limitations in complex engineering design applications. Our research establishes a foundation for future assessments of vision language models, emphasizing their immense potential for innovating and enhancing the engineering design and manufacturing landscape. It also contributes a set of benchmark testing datasets, with more than 1000 queries, for ongoing advancements and applications in this field.
    摘要 工程设计正在通过人工智能的出现,进入一个新的时代,如此以前所未有的产品、系统和服务规划方式。大型自然语言模型在这个过程中表现出了惊人的能力。然而,由于文本为它们的唯一输入模式,它们无法利用工程师 centuries 使用的大量视觉文物,这是一个问题。这个问题得到解决,通过发布多modal视觉语言模型,如 GPT-4V,使得 AI 能够影响更多类型的任务。在这些进步的背景下,这篇论文对 GPT-4V 进行了全面的评估,在工程设计任务中分为四个主要领域:概念设计、系统级和细节设计、制造和检查、和工程教育任务。我们的研究评估 GPT-4V 在设计任务中的能力,如:图像相似性分析、选择使用 Pugh 图、材料选择、工程图像分析、 CAD 生成、材料优化、设计添加和减少制造、空间理解挑战和文book问题。通过这种结构化的评估,我们不仅探索 GPT-4V 在复杂的设计和制造挑战中的能力,还确定了它在复杂工程设计应用中的局限性。我们的研究建立了未来评估视 language 模型的基础,强调它们在工程设计和制造领域的潜在创新和提高的潜力。同时,我们也提供了更多于 1000 个查询的测试数据集,为这一领域的进一步进步和应用做出了贡献。

The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change

  • paper_url: http://arxiv.org/abs/2311.12664
  • repo_url: None
  • paper_authors: Dominik Schlechtweg, Shafqat Mumtaz Virk, Pauline Sander, Emma Sköldberg, Lukas Theuer Linke, Tuo Zhang, Nina Tahmasebi, Jonas Kuhn, Sabine Schulte im Walde
  • for: 这篇论文是用于介绍一个名为DURel的工具,该工具可以在在线开源界面上进行 semantic proximity 的注释。
  • methods: 这篇论文使用了标准化的人工注释以及计算注释,建立在最近的 Word-in-Context 模型之上。注释员的判断被使用图像分 clustering 技术集中,并可以对分析。
  • results: 这篇论文通过使用 DURel 工具可以轻松地测试单个用例之间的 semantic proximity,无需大量的准备工作。此外,工具还提供了比较注释员之间的一致性,以及计算概念频率分布、semantic variation 和时间变化等数据分析功能。
    Abstract We present the DURel tool that implements the annotation of semantic proximity between uses of words into an online, open source interface. The tool supports standardized human annotation as well as computational annotation, building on recent advances with Word-in-Context models. Annotator judgments are clustered with automatic graph clustering techniques and visualized for analysis. This allows to measure word senses with simple and intuitive micro-task judgments between use pairs, requiring minimal preparation efforts. The tool offers additional functionalities to compare the agreement between annotators to guarantee the inter-subjectivity of the obtained judgments and to calculate summary statistics giving insights into sense frequency distributions, semantic variation or changes of senses over time.
    摘要 我们介绍了DURel工具,它实现了词语 semantic proximity 的注释到在线、开源 интер法面上。该工具支持标准化的人类注释以及计算注释,基于最近的 Word-in-Context 模型。注释决策被聚合使用自动图 clustering 技术,并可视化 для分析。这使得可以使用简单和直观的微任务判断对用例之间的词语涵义,需要最小的准备努力。工具还提供了比较注释者之间协议的功能,以确保获得的判断是 между主体的,以及计算摘要统计信息,提供词汇频率分布、semantic variation 和时间变化等信息。

PARK: Parkinson’s Analysis with Remote Kinetic-tasks

  • paper_url: http://arxiv.org/abs/2311.12654
  • repo_url: None
  • paper_authors: Md Saiful Islam, Sangwu Lee, Abdelrahman Abdelkader, Sooyong Park, Ehsan Hoque
  • for: 检测parkinson病(PD)的网络框架,让用户在家中完成 neuroscience 测试。
  • methods: 指导用户完成三个任务,包括语音、表情和手势测试,并分析视频确定是否存在PD的症状。
  • results: 以易于理解的方式显示结果,并提供个性化的治疗和护理资源。Note: The above information points are in Simplified Chinese, and the word order is adjusted to follow the typical sentence structure in Chinese.
    Abstract We present a web-based framework to screen for Parkinson's disease (PD) by allowing users to perform neurological tests in their homes. Our web framework guides the users to complete three tasks involving speech, facial expression, and finger movements. The task videos are analyzed to classify whether the users show signs of PD. We present the results in an easy-to-understand manner, along with personalized resources to further access to treatment and care. Our framework is accessible by any major web browser, improving global access to neurological care.
    摘要 我们提供了一个基于网页的框架,以帮助用户在家中进行parkinson病(PD)的检测。我们的网页框架会引导用户完成三项任务,包括语音、表情和手部运动。这些任务的视频会被分析,以确定用户是否显示出PD的症状。我们将结果显示在易于理解的方式下,并提供个性化的资源,以便更好地访问治疗和护理。我们的框架可以通过任何主要浏览器访问,从而提高全球脑科医疗的访问性。Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any specific requirements or preferences, please let me know and I can provide a more tailored translation.

Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for Mobile Robots

  • paper_url: http://arxiv.org/abs/2311.12651
  • repo_url: https://github.com/whu-usi3dv/mobile-seed
  • paper_authors: Youqi Liao, Shuhao Kang, Jianping Li, Yang Liu, Yun Liu, Zhen Dong, Bisheng Yang, Xieyuanli Chen
    for: 这篇论文主要是为了提出一种轻量级的框架,以实现同时进行Semantic Segmentation和Boundary Detection。methods: 该框架基于两核心思想:一个是分流体系,另一个是动态调整的融合方法。分流体系使得模型可以同时学习类别相关的 semantic information以及多尺度特征的boundary信息。动态调整的融合方法使得模型可以在不同的输入数据上进行精准的融合。results: 实验结果表明,Mobile-Seed在Cityscapes数据集上与SOTA基eline的比较中提高了2.2个百分点(pp)的mIoU和4.2个百分点(pp)的mF-score,同时保持了在线推理速度为23.9帧/秒(FPS),分辨率为1024x2048。此外,对CamVid和PASCAL Context数据集的实验也证明了该方法的普适性。
    Abstract Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at https://whu-usi3dv.github.io/Mobile-Seed/.
    摘要 这文章提出了一个名为“Mobile-Seed”的轻量级、双任务框架,用于同时进行 semantic segmentation 和 boundary detection。这个框架包括两条通道:一条用于捕捉类别意识的信息,另一条用于从多个特征中获取边界信息。我们还引入了一个活动融合解码器(AFD)和双任学习整合损失函数。AFD模组可以在 runtime 中动态地适应各个通道之间的融合,以精确地分配每个通道的重要性。此外,我们引入了一个内排优化损失函数,以减少双任学习中的冲突。在 Cityscapes 数据集上进行了实验,我们发现 Mobile-Seed 可以与现有基eline(SOTA)的基eline进行比较,在 mIoU 和 mF-score 上提高 2.2 分点和 4.2 分点,而且可以在 1024x2048 像素的输入上保持在线进行推断,具体来说是 23.9 帧/秒的帧速度。此外,我们还进行了 CamVid 和 PASCAL Context 数据集的实验,显示我们的方法具有普遍性。代码和更多的结果可以在 上获取。

KNVQA: A Benchmark for evaluation knowledge-based VQA

  • paper_url: http://arxiv.org/abs/2311.12639
  • repo_url: None
  • paper_authors: Sirui Cheng, Siyu Zhang, Jiayi Wu, Muchen Lan
  • for: 本研究旨在提供一种可靠的评估方法,以评估大型视语模型(LVLM)在多模态场景中的实用性。
  • methods: 本研究使用了一种新的评估方法——KNVQA-Eval,通过人工标注和感知来评估知识基于VQA任务中LVLM的准确性。
  • results: 研究发现,现有的LVLMs在知识基于VQA任务中存在诸如对象投影和事实准确性等两个主要问题,而previous评估方法更多地关注语言内容的理解和推理能力,而忽略了多模态交互的全面评估。
    Abstract Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. Furthermore, previous evaluation methods focus more on the comprehension and reasoning of language content but lack a comprehensive evaluation of multimodal interactions, thereby resulting in potential limitations. To this end, we propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs. To ensure the robustness and scalability of the evaluation, we develop a new KNVQA dataset by incorporating human judgment and perception, aiming to evaluate the accuracy of standard answers relative to AI-generated answers in knowledge-based VQA. This work not only comprehensively evaluates the contextual information of LVLMs using reliable human annotations, but also further analyzes the fine-grained capabilities of current methods to reveal potential avenues for subsequent optimization of LVLMs-based estimators. Our proposed VQA-Eval and corresponding dataset KNVQA will facilitate the development of automatic evaluation tools with the advantages of low cost, privacy protection, and reproducibility. Our code will be released upon publication.
    摘要 在多模态场景下,大视语模型(LVLM)已经做出了 significiant 的进步,这是因为它们在视觉和语言系统中具有强大的感知和理解能力。然而,LVLM仍然面临着两个关键问题:对象投影和事实准确性,这限制了LVLM在不同场景中的实用性。此外,先前的评估方法更关注语言内容的理解和逻辑,而忽略了多模态交互的全面评估,从而可能导致局限性。为此,我们提出了一种新的KNVQA-Eval,用于评估知识基于VQA任务的准确性。为保证评估的稳定性和可扩展性,我们开发了一个新的KNVQA数据集,通过将人类判断和感知纳入到数据集中,以评估标准答案与AI生成答案之间的差异。这项工作不仅全面评估了LVLM的上下文信息,还进一步分析了当前方法的细腻能力,以探索可能的优化方向。我们的提出的VQA-Eval和相应的KNVQA数据集将为LVLM-基于估计器的自动评估工具带来低成本、隐私保护和可重复性的优势。我们将在出版时释出代码。

ChessVision – A Dataset for Logically Coherent Multi-label Classification

  • paper_url: http://arxiv.org/abs/2311.12610
  • repo_url: https://github.com/espressovi/chessvisionchallenge
  • paper_authors: Soumadeep Saha, Utpal Garain
  • for: 这篇论文的目的是探讨深度学习技术在棋盘检测任务中的应用,以及这些技术如何处理棋盘检测任务中的语义上下文和逻辑约束。
  • methods: 这篇论文使用了深度学习技术,并提供了一个大量的规则集来检测棋盘检测任务中的语义上下文和逻辑约束。
  • results: 研究发现,使用深度学习技术进行棋盘检测任务可以取得高度的性能,但是这些模型往往产生了大量的无关的结果, indicating that this dataset presents a significant challenge for future works.
    Abstract Starting with early successes in computer vision tasks, deep learning based techniques have since overtaken state of the art approaches in a multitude of domains. However, it has been demonstrated time and again that these techniques fail to capture semantic context and logical constraints, instead often relying on spurious correlations to arrive at the answer. Since application of deep learning techniques to critical scenarios are dependent on adherence to domain specific constraints, several attempts have been made to address this issue. One limitation holding back a thorough exploration of this area, is a lack of suitable datasets which feature a rich set of rules. In order to address this, we present the ChessVision Dataset, consisting of 200,000+ images of annotated chess games in progress, requiring recreation of the game state from its corresponding image. This is accompanied by a curated set of rules which constrains the set of predictions to "reasonable" game states, and are designed to probe key semantic abilities like localization and enumeration. Alongside standard metrics, additional metrics to measure performance with regards to logical consistency is presented. We analyze several popular and state of the art vision models on this task, and show that, although their performance on standard metrics are laudable, they produce a plethora of incoherent results, indicating that this dataset presents a significant challenge for future works.
    摘要 开始于计算机视觉任务的早期成功,深度学习基于技术后来在多个领域超越了现状的方法。然而,它们时刻表现出不能捕捉 semantic context 和逻辑约束,而是经常依靠偶合关系来得到答案。因此,在应用深度学习技术到关键应用场景时,需要遵循域pecific的约束。为了解决这个问题,我们提出了 ChessVision 数据集,包含200,000+ 个 annotated chess game 的图像,需要从图像中重建游戏状态。此外,我们还提供了一个精心编辑的规则集,用于约束预测结果的合理性,并且旨在探索棋盘游戏的 semantic 能力,如本地化和排列。此外,我们还提出了一些针对逻辑一致性的metric,以评估模型的表现。我们对一些流行的和现状最佳的视觉模型进行了分析,并发现它们在标准 metric 上表现出色,但是它们在这个任务上产生了大量的不一致结果,这表明这个数据集对未来的工作呈现出了 significiant 挑战。

Trustworthy AI: Deciding What to Decide

  • paper_url: http://arxiv.org/abs/2311.12604
  • repo_url: None
  • paper_authors: Caesar Wu, Yuan-Fang Li, Jian Li, Jingjing Xu, Bouvry Pascal
  • For: The paper aims to address the challenge of determining which information can be trusted when using Artificial Intelligence (AI) systems for decision-making, known as Trustworthy AI (TAI).* Methods: The paper proposes a new framework for TAI that includes three crucial components of AI: representation space, loss function, and optimizer. Each component is loosely coupled with four TAI properties, resulting in a total of twelve TAI properties. The authors plan to use this framework to conduct experiments using quantitative and qualitative research methods to evaluate the effectiveness of the TAI properties in the decision-making context.* Results: The paper presents an optimal prediction model trained by a given dataset for applying strategic investment decisions in the technology sector using the proposed TAI framework. The authors also provide their future direction for TAI research.
    Abstract When engaging in strategic decision-making, we are frequently confronted with overwhelming information and data. The situation can be further complicated when certain pieces of evidence contradict each other or become paradoxical. The primary challenge is how to determine which information can be trusted when we adopt Artificial Intelligence (AI) systems for decision-making. This issue is known as deciding what to decide or Trustworthy AI. However, the AI system itself is often considered an opaque black box. We propose a new approach to address this issue by introducing a novel framework of Trustworthy AI (TAI) encompassing three crucial components of AI: representation space, loss function, and optimizer. Each component is loosely coupled with four TAI properties. Altogether, the framework consists of twelve TAI properties. We aim to use this framework to conduct the TAI experiments by quantitive and qualitative research methods to satisfy TAI properties for the decision-making context. The framework allows us to formulate an optimal prediction model trained by the given dataset for applying the strategic investment decision of credit default swaps (CDS) in the technology sector. Finally, we provide our view of the future direction of TAI research
    摘要 当参与战略决策时,我们经常面临着沮丧的信息和数据。情况可能更加复杂,因为一些证据之间矛盾或变得变形。主要挑战是如何确定可信的信息,当我们采用人工智能(AI)系统进行决策时。这个问题被称为可靠AI(TAI)。然而,AI系统本身经常被视为透明的黑盒子。我们提出了一种新的方法来解决这个问题,即引入一种新的TAI框架,包括AI中三个关键组件:表示空间、损失函数和优化器。每个组件都与四个TAI特性相互独立。总体来说,框架包含十二个TAI特性。我们计划使用这个框架进行TAI实验,通过量化和质量研究方法来满足TAI特性。这个框架允许我们根据给定的数据集构建一个优化预测模型,用于应用技术领域的投资决策。最后,我们提供了我们对TAI研究未来方向的看法。

Visual tracking brain computer interface

  • paper_url: http://arxiv.org/abs/2311.12592
  • repo_url: None
  • paper_authors: Changxing Huang, Nanlin Shi, Yining Miao, Xiaogang Chen, Yijun Wang, Xiaorong Gao
  • for: 这个研究旨在超越传统的扫描命令,实现基于神经活动的自然连续控制。
  • methods: 研究人员采用了一种新的空间编码刺激方法,并开发了相应的投影方法来实现连续模拟的解码。
  • results: 实验中,参与者的Fitt’s ITR为0.55bps(固定跟踪任务)和0.37bps(随机跟踪任务),表明该BCI的高精度和可靠性。此外,研究人员还将该BCI应用于绘画和游戏等两个应用场景。
    Abstract Brain-computer interfaces (BCIs) offer a way to interact with computers without relying on physical movements. Non-invasive electroencephalography (EEG)-based visual BCIs, known for efficient speed and calibration ease, face limitations in continuous tasks due to discrete stimulus design and decoding methods. To achieve continuous control, we implemented a novel spatial encoding stimulus paradigm and devised a corresponding projection method to enable continuous modulation of decoded velocity. Subsequently, we conducted experiments involving 17 participants and achieved Fitt's ITR of 0.55 bps for the fixed tracking task and 0.37 bps for the random tracking task. The proposed BCI with a high Fitt's ITR was then integrated into two applications, including painting and gaming. In conclusion, this study proposed a visual BCI-based control method to go beyond discrete commands, allowing natural continuous control based on neural activity.
    摘要 ��������� BCIs (���������computer interfaces) ��������� ��������� computers ��������� physical movements. EEG-based visual BCIs (��������� EEG��������� ��������� ���������) ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��������� ��

Improving Source-Free Target Adaptation with Vision Transformers Leveraging Domain Representation Images

  • paper_url: http://arxiv.org/abs/2311.12589
  • repo_url: None
  • paper_authors: Gauransh Sawhney, Daksh Dave, Adeel Ahmed, Jiechao Gao, Khalid Saleem
  • for: 这 paper 是为了提高 ViT 在无源频道适应中的性能,通过调整 Key 元素来提高域特征表示。
  • methods: 该 paper 使用了 Vision Transformers (ViTs) 和 Domain Representation Images (DRIs) 来实现源自由目标适应。DRIs 是域特征特征图像,通过将域特征embeddings 输入到 Key 元素中来提高域特征表示。
  • results: experiments 表明,不包括 DRIs 时,ViT 的性能有限提高,而包括 DRIs 时,平均精度得到了提高,表明 DRIs 可以有效地提高 ViT 在域适应中的性能。
    Abstract Unsupervised Domain Adaptation (UDA) methods facilitate knowledge transfer from a labeled source domain to an unlabeled target domain, navigating the obstacle of domain shift. While Convolutional Neural Networks (CNNs) are a staple in UDA, the rise of Vision Transformers (ViTs) provides new avenues for domain generalization. This paper presents an innovative method to bolster ViT performance in source-free target adaptation, beginning with an evaluation of how key, query, and value elements affect ViT outcomes. Experiments indicate that altering the key component has negligible effects on Transformer performance. Leveraging this discovery, we introduce Domain Representation Images (DRIs), feeding embeddings through the key element. DRIs act as domain-specific markers, effortlessly merging with the training regimen. To assess our method, we perform target adaptation tests on the Cross Instance DRI source-only (SO) control. We measure the efficacy of target adaptation with and without DRIs, against existing benchmarks like SHOT-B* and adaptations via CDTrans. Findings demonstrate that excluding DRIs offers limited gains over SHOT-B*, while their inclusion in the key segment boosts average precision promoting superior domain generalization. This research underscores the vital role of DRIs in enhancing ViT efficiency in UDA scenarios, setting a precedent for further domain adaptation explorations.
    摘要 不监督领域适应(UDA)方法可以帮助知识从标注源领域传播到无标注目标领域,解决领域偏移的问题。 convolutional neural networks(CNNs)是UDA中的标准工具,但是vision transformers(ViTs)的出现提供了新的途径 для领域总体化。这篇论文提出了一种创新的方法,用于提高ViT在源领域无标注目标领域中的表现,开始于元素如关键、查询和值元素对ViT的影响的评估。实验表明,关键元素的修改对转换器性能产生了微不足的影响。利用这一发现,我们引入了领域表示图像(DRIs),将嵌入图像通过关键元素。DRIs behave as domain-specific markers, effortlessly merging with the training regimen. To assess our method, we perform target adaptation tests on the Cross Instance DRI source-only (SO) control. We measure the efficacy of target adaptation with and without DRIs, against existing benchmarks like SHOT-B* and adaptations via CDTrans. Findings demonstrate that excluding DRIs offers limited gains over SHOT-B*, while their inclusion in the key segment boosts average precision promoting superior domain generalization. This research underscores the vital role of DRIs in enhancing ViT efficiency in UDA scenarios, setting a precedent for further domain adaptation explorations.

Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer

  • paper_url: http://arxiv.org/abs/2311.12905
  • repo_url: None
  • paper_authors: Wenqiao Zhang, Zheqi Lv, Hao Zhou, Jia-Wei Liu, Juncheng Li, Mengze Li, Siliang Tang, Yueting Zhuang
  • for: 提高模型在新目标领域的最大化适应
  • methods: 多源活动领域适应(MADA)和动态集成不确定性评估框架(Detective)
  • results: 比现有方法提高较多的表现在三个频道适应 benchmark 上
    Abstract Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate.This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to target a new and challenging setting of knowledge transfer that extends ADA from a single source domain to multiple source domains, termed Multi-source Active Domain Adaptation (MADA). Not surprisingly, we find that most traditional ADA methods cannot work directly in such a setting, mainly due to the excessive domain gap introduced by all the source domains and thus their uncertainty-aware sample selection can easily become miscalibrated under the multi-domain shifts. Considering this, we propose a Dynamic integrated uncertainty valuation framework(Detective) that comprehensively consider the domain shift between multi-source domains and target domain to detect the informative target samples. Specifically, the leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the model's parameters to fit the union of multi-source domains. This enables an approximate single-source domain modeling by the dynamic model. We then comprehensively measure both domain uncertainty and predictive uncertainty in the target domain to detect informative target samples using evidential deep learning, thereby mitigating uncertainty miscalibration. Furthermore, we introduce a contextual diversity-aware calculator to enhance the diversity of the selected samples. Experiments demonstrate that our solution outperforms existing methods by a considerable margin on three domain adaptation benchmarks.
    摘要 为了解决这个问题,我们提出了一个动态整合不确定性评估框架(Detective),该框架全面考虑多源领域和目标领域之间的领域差。具体来说,我们利用动态适应模型(DA),该模型学习如何适应模型参数以适应多源领域的联合。这使得模型能够在动态模型中实现约等于单个源领域模型。然后,我们全面测量目标领域中的领域不确定性和预测不确定性,以探测有用的目标样本。此外,我们引入了上下文多样化识别器,以提高选择的样本多样性。实验表明,我们的解决方案在三个领域适应 benchmark 上出现了明显的提高,相比之下,现有方法的性能较差。

Echocardiogram Foundation Model – Application 1: Estimating Ejection Fraction

  • paper_url: http://arxiv.org/abs/2311.12582
  • repo_url: None
  • paper_authors: Adil Dahlan, Cyril Zakka, Abhinav Kumar, Laura Tang, Rohan Shad, Robyn Fong, William Hiesinger
  • for: 评估心脏功能,提高诊断准确率
  • methods: 使用自动学习(SSL)方法,基于150万个echocardiogram进行训练
  • results: 实现了射出率的准确率,与专业医生的性能相同, mean absolute percentage error为9.40%
    Abstract Cardiovascular diseases stand as the primary global cause of mortality. Among the various imaging techniques available for visualising the heart and evaluating its function, echocardiograms emerge as the preferred choice due to their safety and low cost. Quantifying cardiac function based on echocardiograms is very laborious, time-consuming and subject to high interoperator variability. In this work, we introduce EchoAI, an echocardiogram foundation model, that is trained using self-supervised learning (SSL) on 1.5 million echocardiograms. We evaluate our approach by fine-tuning EchoAI to estimate the ejection fraction achieving a mean absolute percentage error of 9.40%. This level of accuracy aligns with the performance of expert sonographers.
    摘要 心血管疾病是全球最主要的死亡原因。在多种心脏影像技术中,电子心室图(echocardiogram)成为首选,因为它的安全性和低成本。然而,根据电子心室图量化心脏功能是很困难、时间consuming和人员差异很大。在这项工作中,我们引入EchoAI,一个基于自主学习(SSL)的电子心室图基础模型,在150万个电子心室图上进行训练。我们评估我们的方法,通过精度地预测心脏补做率(ejection fraction),实现了9.40%的绝对差距误差。这种精度与专业医生的性能相当。

IMGTB: A Framework for Machine-Generated Text Detection Benchmarking

  • paper_url: http://arxiv.org/abs/2311.12574
  • repo_url: None
  • paper_authors: Michal Spiegel, Dominik Macko
  • for: 本研究旨在提供一个简单易用的测试框架,以便测试和比较新的机器产生文本检测方法。
  • methods: 本研究使用了IMGTB框架,该框架可以轻松地整合新的检测方法和评估数据集。
  • results: 本研究的结果显示,IMGTB框架可以方便地进行机器产生文本检测方法的比较和研究,并且具有较高的构成性和可 configure 性。
    Abstract In the era of large language models generating high quality texts, it is a necessity to develop methods for detection of machine-generated text to avoid harmful use or simply due to annotation purposes. It is, however, also important to properly evaluate and compare such developed methods. Recently, a few benchmarks have been proposed for this purpose; however, integration of newest detection methods is rather challenging, since new methods appear each month and provide slightly different evaluation pipelines. In this paper, we present the IMGTB framework, which simplifies the benchmarking of machine-generated text detection methods by easy integration of custom (new) methods and evaluation datasets. Its configurability and flexibility makes research and development of new detection methods easier, especially their comparison to the existing state-of-the-art detectors. The default set of analyses, metrics and visualizations offered by the tool follows the established practices of machine-generated text detection benchmarking found in state-of-the-art literature.
    摘要 在大语言模型生成高质量文本时代,检测机器生成文本成为必要的一步,以避免有害使用或只是为annotations purpose。然而,也需要正确评估和比较这些发展出来的方法。最近,一些benchmark已经被提议用于这个目的;然而,新方法每月出现,提供些许不同的评估管道。本文介绍IMGTB框架,它使机器生成文本检测方法的benchmarking更加简单,可以轻松地 инте integrate自定义(新)方法和评估数据集。其可 configurability和灵活性使研究和开发新检测方法变得更加容易,尤其是与现有state-of-the-art检测器进行比较。IMGTB框架的默认集成了分析、指标和可视化工具,与现有Literature中的机器生成文本检测benchmarking实践一致。

Moderating Model Marketplaces: Platform Governance Puzzles for AI Intermediaries

  • paper_url: http://arxiv.org/abs/2311.12573
  • repo_url: None
  • paper_authors: Robert Gorwa, Michael Veale
  • For: The paper discusses the challenges of governing AI model marketplaces, such as Hugging Face, GitHub, and Civitai, where users can upload and share their own models and training data.* Methods: The paper examines several case studies of incidents on these platforms to explore how they moderate models and provides an analysis of the practices that industry has been developing to respond to moderation demands, including licensing, access and use restrictions, automated content moderation, and open policy development.* Results: The paper concludes that the policy challenge of governing AI model marketplaces is considerable, but suggests some ideas for how platforms could better mobilize resources to act as a careful, fair, and proportionate regulatory access point.
    Abstract The AI development community is increasingly making use of hosting intermediaries such as Hugging Face provide easy access to user-uploaded models and training data. These model marketplaces lower technical deployment barriers for hundreds of thousands of users, yet can be used in numerous potentially harmful and illegal ways. In this article, we explain ways in which AI systems, which can both `contain' content and be open-ended tools, present one of the trickiest platform governance challenges seen to date. We provide case studies of several incidents across three illustrative platforms -- Hugging Face, GitHub and Civitai -- to examine how model marketplaces moderate models. Building on this analysis, we outline important (and yet nevertheless limited) practices that industry has been developing to respond to moderation demands: licensing, access and use restrictions, automated content moderation, and open policy development. While the policy challenge at hand is a considerable one, we conclude with some ideas as to how platforms could better mobilize resources to act as a careful, fair, and proportionate regulatory access point.
    摘要 人工智能开发社区现在越来越多地利用托管者如Hugging Face提供易访问用户上传模型和训练数据。这些模型市场降低了数十万用户的技术部署障碍,但可以用于多种可能有害和非法的方式。在这篇文章中,我们解释了AI系统如何在多种方面构成极具挑战性的平台治理问题。我们通过对Hugging Face、GitHub和Civitai等三个示例平台的情况进行分析,探讨了模型市场如何进行模型级别管理。基于这种分析,我们描述了行业在应对管理需求方面的重要 yet limited的实践:许可证、访问和使用限制、自动内容审核、开放政策发展。虽然政策挑战在手头上是极大的,但我们在结尾采用一些想法,以便平台可以更好地利用资源,成为一个仔细、公正、 proporcionate的管理访问点。

Scheduling Distributed Flexible Assembly Lines using Safe Reinforcement Learning with Soft Shielding

  • paper_url: http://arxiv.org/abs/2311.12572
  • repo_url: None
  • paper_authors: Lele Li, Liyong Lin
  • for: 提高生产工艺线的效率和可靠性,并能够实时调度作业。
  • methods: 使用优点actor-批处理学习方法,并提出了一种更加紧凑的环境表示方法,以及一种基于Monte-Carlo搜索的软障碍组件。
  • results: 经过性能评估,提出的算法和软障碍组件能够有效地提高作业调度和生产流程的效率和可靠性。
    Abstract Highly automated assembly lines enable significant productivity gains in the manufacturing industry, particularly in mass production condition. Nonetheless, challenges persist in job scheduling for make-to-job and mass customization, necessitating further investigation to improve efficiency, reduce tardiness, promote safety and reliability. In this contribution, an advantage actor-critic based reinforcement learning method is proposed to address scheduling problems of distributed flexible assembly lines in a real-time manner. To enhance the performance, a more condensed environment representation approach is proposed, which is designed to work with the masks made by priority dispatching rules to generate fixed and advantageous action space. Moreover, a Monte-Carlo tree search based soft shielding component is developed to help address long-sequence dependent unsafe behaviors and monitor the risk of overdue scheduling. Finally, the proposed algorithm and its soft shielding component are validated in performance evaluation.
    摘要 高度自动化的生产线可以实现大量生产中的产量增加,但是在make-to-job和个性化生产中仍存在困难,需要进一步研究以提高效率、减少延迟、提高安全性和可靠性。在这种贡献中,一种基于优先级分发规则的优点批评学习方法被提出,用于在实时下解决分布式灵活生产线的调度问题。为了提高性能,我们提出了一种更紧凑的环境表示方法,可以与优先级分发规则生成固定和有利的动作空间。此外,我们还开发了基于Monte Carlo搜索的软隔离组件,用于 Addressing long-sequence dependent unsafe behaviors and monitoring the risk of overdue scheduling。最后,我们验证了提案的算法和软隔离组件的性能评估。

Multi-Session Budget Optimization for Forward Auction-based Federated Learning

  • paper_url: http://arxiv.org/abs/2311.12548
  • repo_url: None
  • paper_authors: Xiaoli Tang, Han Yu
  • for: 多Session federated learning (FL) 训练过程中,模型用户 (MU) 需要积极策略来优化预算分配,以最大化总用户价值。
  • methods: 基于层次强化学习的Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MultiBOS-AFL) jointly 优化了 между会话预算步骤和会话中的拍卖策略,以最大化总用户价值。
  • results: 对六个基准数据集进行了广泛的实验,与七种现有方法进行比较。 results show that MultiBOS-AFL 可以在给定预算下获得12.28%更高的用户价值,14.52%更多的通过拍卖获得的数据,以及1.23%更高的测试准确率。
    Abstract Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL model users (MUs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, an MU can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL MUs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MultiBOS-AFL). Based on hierarchical reinforcement learning, MultiBOS-AFL jointly optimizes inter-session budget pacing and intra-session bidding for AFL MUs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MultiBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for MUs in multi-session forward auction-based federated learning
    摘要

In-Context Learning Functions with Varying Number of Minima

  • paper_url: http://arxiv.org/abs/2311.12538
  • repo_url: https://github.com/pittnail/icl-minima
  • paper_authors: David Oniani, Yanshan Wang
  • for: 本研究探讨了大语言模型(LLMs)在受限学习(In-Context Learning,ICL)中的表现,以及ICL在特定函数特性下的干预。
  • methods: 本研究使用了正式框架来探讨ICL,并提出了一种新的函数预测任务,即预测函数有多少个 минимум。我们还实现了一种生成函数输入为 минимум的方法。
  • results: 我们发现,增加函数的 минимум数会下降ICL性能。然而,我们的评估结果表明,ICL在所有设置下都能够超越2层神经网络(2NN)模型,并且ICL在所有设置下都能够更快地学习。我们验证了这些发现通过一系列几何参数配置下的几何实验。
    Abstract Large Language Models (LLMs) have proven effective at In-Context Learning (ICL), an ability that allows them to create predictors from labeled examples. Few studies have explored the interplay between ICL and specific properties of functions it attempts to approximate. In our study, we use a formal framework to explore ICL and propose a new task of approximating functions with varying number of minima. We implement a method that allows for producing functions with given inputs as minima. We find that increasing the number of minima degrades ICL performance. At the same time, our evaluation shows that ICL outperforms 2-layer Neural Network (2NN) model. Furthermore, ICL learns faster than 2NN in all settings. We validate the findings through a set of few-shot experiments across various hyperparameter configurations.
    摘要

Oasis: Data Curation and Assessment System for Pretraining of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.12537
  • repo_url: https://github.com/tongzhou21/oasis
  • paper_authors: Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Jun Zhao, Shengping Liu
  • for: This paper presents a pretraining corpus curation and assessment platform called Oasis, which aims to improve the quality of large language models by customizing a corpus curation pipeline and leveraging comprehensive corpus assessment for iterative optimization.
  • methods: The Oasis platform includes a customized data curation module with an interactive modular rule filter, a debiased neural filter, and an adaptive document deduplication module. It also features a holistic data assessment module with human, GPT-4, and heuristic metrics.
  • results: The authors exhibit a complete process for using Oasis to curate and assess pretraining data, and they publicly release an 800GB bilingual corpus curated by Oasis. The results show that Oasis can significantly improve the quality of pretraining data and reduce the bias in large language models.
    Abstract Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.
    摘要 “数据是大型语言模型建模的关键元素。然而,现有系统 Either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To address this challenge, we present a pre-training corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pre-training data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you prefer Traditional Chinese, I can provide that as well.

Neural Network Pruning by Gradient Descent

  • paper_url: http://arxiv.org/abs/2311.12526
  • repo_url: https://github.com/3riccc/neural_pruning
  • paper_authors: Zhang Zhang, Ruyi Tao, Jiang Zhang
  • for: 这篇论文的目的是提出一种新的神经网络减少框架,以优化神经网络的参数和结构,提高计算效率和模型解释性。
  • methods: 该框架使用了Gumbel-Softmax技术,通过权重和结构的同时优化,在梯度下降 процессе中实现了神经网络的减少。
  • results: 实验结果表明,该框架可以在MNIST数据集上保持高准确率,只使用0.15%的原始神经网络参数。此外,该框架还提高了神经网络解释性,可以直接从减少后的神经网络中提取特征重要性,以及可以视觉化特征的对称和特征到结果的信息传递路径。
    Abstract The rapid increase in the parameters of deep learning models has led to significant costs, challenging computational efficiency and model interpretability. In this paper, we introduce a novel and straightforward neural network pruning framework that incorporates the Gumbel-Softmax technique. This framework enables the simultaneous optimization of a network's weights and topology in an end-to-end process using stochastic gradient descent. Empirical results demonstrate its exceptional compression capability, maintaining high accuracy on the MNIST dataset with only 0.15\% of the original network parameters. Moreover, our framework enhances neural network interpretability, not only by allowing easy extraction of feature importance directly from the pruned network but also by enabling visualization of feature symmetry and the pathways of information propagation from features to outcomes. Although the pruning strategy is learned through deep learning, it is surprisingly intuitive and understandable, focusing on selecting key representative features and exploiting data patterns to achieve extreme sparse pruning. We believe our method opens a promising new avenue for deep learning pruning and the creation of interpretable machine learning systems.
    摘要 “深度学习模型参数快速增长导致了显著的成本、计算效率和模型解释性问题。在这篇论文中,我们介绍了一种新的和简单的神经网络剪枝框架,该框架通过抽象跟踪技术实现了端到端的权重和结构同时优化。实验结果表明,我们的框架可以具有极高的压缩能力,在MNIST数据集上只使用0.15%的原始网络参数保持高度准确。此外,我们的框架可以增强神经网络解释性,不仅可以直接从剪枝网络中提取特征重要性,还可以通过特征对称和特征征向信息传播的可视化来描述神经网络的工作机制。尽管剪枝策略通过深度学习学习得到,但它却是非常直观和理解的,强调选择关键表征特征并利用数据模式来实现极度稀疏剪枝。我们认为,我们的方法将开启深度学习剪枝和可解释机器学习系统的新的先进 Avenues。”

ALPHA: AnomaLous Physiological Health Assessment Using Large Language Models

  • paper_url: http://arxiv.org/abs/2311.12524
  • repo_url: https://github.com/mcjacktang/llm-healthassistant
  • paper_authors: Jiankai Tang, Kegang Wang, Hongming Hu, Xiyuxing Zhang, Peiyu Wang, Xin Liu, Yuntao Wang
  • for: 本研究旨在评估大自然语言模型(LLM)在医疗领域的可效性,尤其是它们在个人异常健康监测中的应用。我们的研究主要研究LLM在使用FDA批准的设备获取的生物学数据中进行解释和分析的能力。
  • methods: 我们使用了一种模拟海拔低压环境中采集的异常生物学数据进行了广泛的分析,以评估LLM在识别和评估用户健康状况方面的精度和可靠性。
  • results: 我们发现LLM在评估心率和氧气含量(SpO2)方面表现出色,MAE低于1 beat/分钟,MAPE低于1%,总准确率超过85%。在图像分析任务中,我们特地修改的GPT模型在解读光谱游离图像(PPG)数据方面表现出色,心率估计错误低于1 bpm,心率估计误差为7.28 MAE。这项研究提出LLM作为健康数据分析工具和高级人工智能医疗助手的双重角色,在未来医疗助手框架中提供个性化健康洞见和建议。
    Abstract This study concentrates on evaluating the efficacy of Large Language Models (LLMs) in healthcare, with a specific focus on their application in personal anomalous health monitoring. Our research primarily investigates the capabilities of LLMs in interpreting and analyzing physiological data obtained from FDA-approved devices. We conducted an extensive analysis using anomalous physiological data gathered in a simulated low-air-pressure plateau environment. This allowed us to assess the precision and reliability of LLMs in understanding and evaluating users' health status with notable specificity. Our findings reveal that LLMs exhibit exceptional performance in determining medical indicators, including a Mean Absolute Error (MAE) of less than 1 beat per minute for heart rate and less than 1% for oxygen saturation (SpO2). Furthermore, the Mean Absolute Percentage Error (MAPE) for these evaluations remained below 1%, with the overall accuracy of health assessments surpassing 85%. In image analysis tasks, such as interpreting photoplethysmography (PPG) data, our specially adapted GPT models demonstrated remarkable proficiency, achieving less than 1 bpm error in cycle count and 7.28 MAE for heart rate estimation. This study highlights LLMs' dual role as health data analysis tools and pivotal elements in advanced AI health assistants, offering personalized health insights and recommendations within the future health assistant framework.
    摘要 Translation notes:* " LLMS" is translated as "大语言模型" (dà yǔ yán módel), which means "large language models" in Simplified Chinese.* "FDA-approved devices" is translated as "由FDA批准的设备" (yǔ FDA píng yǔ de jiāng yì), which means "devices approved by the FDA" in Simplified Chinese.* "anomalous physiological data" is translated as "异常生物数据" (yì cháng shēng wù shù) in Simplified Chinese, which means "anomalous biological data".* "MAE" and "MAPE" are translated as "平均绝对误差" (píng jian jí yì bìng yì) and "平均绝对误差%" (píng jian jí yì bìng yì) respectively, which means "mean absolute error" and "mean absolute percentage error" in Simplified Chinese.* "PPG" is translated as "血氧光谱" (xuè yǎng guāng shuāng) in Simplified Chinese, which means "photoplethysmography" in English.* "GPT" is translated as "自然语言处理" (zì rán yǔ yán xíng) in Simplified Chinese, which means "natural language processing" in English.

Classification of Tabular Data by Text Processing

  • paper_url: http://arxiv.org/abs/2311.12521
  • repo_url: None
  • paper_authors: Keshav Ramani, Daniel Borrajo
  • for: 这篇论文是为了提出一种基于文本处理技术的分类方法,即文本基类型分类(TBC),用于解决表格数据上的分类任务。
  • methods: 该方法使用当前最先进的文本处理技术来实现分类任务,包括文本特征提取、文本分类和模型训练等步骤。
  • results: 在多个数据集上,TBC方法可以与当前最先进的模型达到相似的准确率、准确率和预测类别的准确率。
    Abstract Natural Language Processing technology has advanced vastly in the past decade. Text processing has been successfully applied to a wide variety of domains. In this paper, we propose a novel framework, Text Based Classification(TBC), that uses state of the art text processing techniques to solve classification tasks on tabular data. We provide a set of controlled experiments where we present the benefits of using this approach against other classification methods. Experimental results on several data sets also show that this framework achieves comparable performance to that of several state of the art models in accuracy, precision and recall of predicted classes.
    摘要 自然语言处理技术在过去十年内发展了非常快。文本处理技术在各种领域得到了成功应用。在这篇论文中,我们提出了一种新的框架,文本基于类型化(TBC),该框架使用当前最佳的文本处理技术来解决类别任务。我们提供了一组控制的实验,其中我们展示了使用该方法与其他分类方法的比较。实验结果表明,这种框架在几个数据集上达到了和当前最佳模型的准确率、精度和归类率的相同水平。

Fin-QD: A Computational Design Framework for Soft Grippers: Integrating MAP-Elites and High-fidelity FEM

  • paper_url: http://arxiv.org/abs/2311.12477
  • repo_url: None
  • paper_authors: Yue Xie, Xing Wang, Fumiya Iida, David Howard
  • for: 本研究旨在 Computational design 可以激活软机械系统的潜在功能,并且解决软机械系统的高非线性问题。
  • methods: 本研究使用了自动化的 Computational design 优化框架,通过质量多样性approach来生成多种软机械系统的设计,以便在不同物理特性下成功地抓取各种几何特征不同的物体。
  • results: 本研究实现了自动化生成多种软机械系统的设计,并且通过高精度的 Finite Element Modelling (FEM) 来评估抓取性能和特征。这些设计可以考虑软机械系统的体积和工作空间,并且可以使用简单的控制方案来实现。这些结果可以帮助 bridge 软机械系统的计算设计空间和物体抓取问题。
    Abstract Computational design can excite the full potential of soft robotics that has the drawbacks of being highly nonlinear from material, structure, and contact. Up to date, enthusiastic research interests have been demonstrated for individual soft fingers, but the frame design space (how each soft finger is assembled) remains largely unexplored. Computationally design remains challenging for the finger-based soft gripper to grip across multiple geometrical-distinct object types successfully. Including the design space for the gripper frame can bring huge difficulties for conventional optimisation algorithms and fitness calculation methods due to the exponential growth of high-dimensional design space. This work proposes an automated computational design optimisation framework that generates gripper diversity to individually grasp geometrically distinct object types based on a quality-diversity approach. This work first discusses a significantly large design space (28 design parameters) for a finger-based soft gripper, including the rarely-explored design space of finger arrangement that is converted to various configurations to arrange individual soft fingers. Then, a contact-based Finite Element Modelling (FEM) is proposed in SOFA to output high-fidelity grasping data for fitness evaluation and feature measurements. Finally, diverse gripper designs are obtained from the framework while considering features such as the volume and workspace of grippers. This work bridges the gap of computationally exploring the vast design space of finger-based soft grippers while grasping large geometrically distinct object types with a simple control scheme.
    摘要 计算设计可以激发软机器人的潜力,但软机器人具有高度非线性的材料、结构和接触特性,导致计算设计具有极大的挑战性。至今,研究者们对具有各种软指的软机器人表现出了热烈的兴趣,但软机器人框架设计空间(如每个软指的组合方式)仍然得不到充分的探索。计算设计软机器人抓取器抓取多种几何体型物体时的成功率很低,这是因为抓取器的框架设计空间的增长率是指数增长的。本工作提出了一种自动化计算设计优化框架,该框架可以生成具有多种抓取器类型的多样性,以便在不同的几何体型物体上进行成功的抓取。本工作首先介绍了软机器人抓取器的巨大设计空间(28个设计参数),其中包括软指的安排方式的设计空间,这些设计空间可以通过不同的配置来变换到多种配置。然后,本工作提出了基于SOFA的contact-based Finite Element Modeling(FEM),以生成高精度的抓取数据,用于评估优质和特征量。最后,框架生成了具有多种特征的多种抓取器设计,包括抓取器体积和工作空间的考虑。本工作通过计算设计软机器人抓取器,bridge了计算探索软机器人抓取器巨大的设计空间和抓取多种几何体型物体的简单控制方案之间的差距。

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

  • paper_url: http://arxiv.org/abs/2311.12475
  • repo_url: https://github.com/clicknext-ai/phayathaibert
  • paper_authors: Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford
  • for: 本研究旨在解决 WangchanBERTa 模型对外语词语理解的缺陷,特别是英语词语的借鉴不受字形融合的情况。
  • methods: 本研究使用 vocabulary transfer 技术将 XLM-R 的预训练tokenizer 中的外语词语添加到 WangchanBERTa 的tokenizer中,然后从 WangchanBERTa 的Checkpoint开始,在一个更大的数据集上进行预训练。
  • results: 我们的新预训练模型 PhayaThaiBERT 在多个下游任务和数据集上表现出色,超越 WangchanBERTa 的表现。
    Abstract While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.
    摘要 王chanBERTa 已成为泰语预训言语模型中的标准,但它在对外语词汇理解方面仍有缺陷,特别是英语词汇,这些词汇经常被借鉴到泰语中没有正确的写字 assimilation 的情况。我们认为 WangchanBERTa 的tokenizer 缺乏外语词汇为主要的缺陷。我们然后通过将 XLM-R 的预训tokenizer 的词汇迁移到 WangchanBERTa 的tokenizer中,并从 WangchanBERTa 的checkpoint开始,在一个 larger 的数据集上预训一个新的模型,我们称之为 PhayaThaiBERT。我们的结果显示,PhayaThaiBERT 在许多下游任务和数据集上表现出色,超越 WangchanBERTa。

Self-Supervised Deconfounding Against Spatio-Temporal Shifts: Theory and Modeling

  • paper_url: http://arxiv.org/abs/2311.12472
  • repo_url: https://github.com/shotdowndiane/steve
  • paper_authors: Jiahao Ji, Wentao Zhang, Jingyuan Wang, Yue He, Chao Huang
  • for: 本研究旨在提高城市交通效率和推动可持续发展,通过减轻外部因素对于预测交通流量的影响。
  • methods: 我们首先在 causal 图中构建了过去交通数据、未来交通数据和外部预测因素的 causal 关系,并发现了优先艺术的失败Points in OOD 交通数据是由预测因素作为共同因素而导致的。然后,我们提出了一种名为分解上下文调整(DCA)的理论解决方案,它通过分解隐藏的 causal 相关性和干扰相关性来减轻外部因素对于预测交通流量的影响。
  • results: 我们的 STEVE 框架在四个大规模 benchmark 数据集上进行了广泛的实验,并显示其在不同的 OOD 交通预测场景下的稳定性和可靠性。
    Abstract As an important application of spatio-temporal (ST) data, ST traffic forecasting plays a crucial role in improving urban travel efficiency and promoting sustainable development. In practice, the dynamics of traffic data frequently undergo distributional shifts attributed to external factors such as time evolution and spatial differences. This entails forecasting models to handle the out-of-distribution (OOD) issue where test data is distributed differently from training data. In this work, we first formalize the problem by constructing a causal graph of past traffic data, future traffic data, and external ST contexts. We reveal that the failure of prior arts in OOD traffic data is due to ST contexts acting as a confounder, i.e., the common cause for past data and future ones. Then, we propose a theoretical solution named Disentangled Contextual Adjustment (DCA) from a causal lens. It differentiates invariant causal correlations against variant spurious ones and deconfounds the effect of ST contexts. On top of that, we devise a Spatio-Temporal sElf-superVised dEconfounding (STEVE) framework. It first encodes traffic data into two disentangled representations for associating invariant and variant ST contexts. Then, we use representative ST contexts from three conceptually different perspectives (i.e., temporal, spatial, and semantic) as self-supervised signals to inject context information into both representations. In this way, we improve the generalization ability of the learned context-oriented representations to OOD ST traffic forecasting. Comprehensive experiments on four large-scale benchmark datasets demonstrate that our STEVE consistently outperforms the state-of-the-art baselines across various ST OOD scenarios.
    摘要 为了提高城市交通效率和推动可持续发展,ST数据预测扮演了非常重要的应用。然而,交通数据的动态频繁受到外部因素的影响,导致预测模型需要处理非典型数据(OOD)的问题。在这种情况下,我们首先将问题正式化, constructed a causal graph of past traffic data, future traffic data, and external ST contexts。我们发现,先前的方法在OOD交通数据上出现失败是因为ST上下文作为共同原因,即过去数据和未来数据之间的共同原因。然后,我们提出了一种名为分解上下文调整(DCA)的理论解决方案,它可以减轻ST上下文的影响。此外,我们还提出了一种ST自适应推断(STEVE)框架,它可以将交通数据编码成两个分离的表示,其中一个表示共同的ST上下文,另一个表示特有的ST上下文。然后,我们使用三种不同的视角(时间、空间和semantic)中的代表ST上下文作为自我指导信号,以增强学习的上下文感知能力。最后,我们在四个大规模 benchmark 数据集上进行了广泛的实验,结果显示,我们的 STEVE consistently 超过了状态前的基elines across various ST OOD scenarios。

Towards a Gateway for Knowledge Graph Schemas Collection, Analysis, and Embedding

  • paper_url: http://arxiv.org/abs/2311.12465
  • repo_url: None
  • paper_authors: Mattia Fumagalli, Marco Boffo, Daqian Shi, Mayukh Bagchi, Fausto Giunchiglia
  • for: 本文的目的是使用现有的知识图库来训练统计模型。
  • methods: 本文使用了现有的知识图库,并提供了一个 gateway 来汇集这些数据,以便对这些数据进行分析和可视化。
  • results: 本文提供了一个初版的 LiveSchema iniciative,可以将现有的知识图库集成到一个 gateway 中,并提供了一些针对这些数据的查询和分析服务。
    Abstract One of the significant barriers to the training of statistical models on knowledge graphs is the difficulty that scientists have in finding the best input data to address their prediction goal. In addition to this, a key challenge is to determine how to manipulate these relational data, which are often in the form of particular triples (i.e., subject, predicate, object), to enable the learning process. Currently, many high-quality catalogs of knowledge graphs, are available. However, their primary goal is the re-usability of these resources, and their interconnection, in the context of the Semantic Web. This paper describes the LiveSchema initiative, namely, a first version of a gateway that has the main scope of leveraging the gold mine of data collected by many existing catalogs collecting relational data like ontologies and knowledge graphs. At the current state, LiveSchema contains - 1000 datasets from 4 main sources and offers some key facilities, which allow to: i) evolving LiveSchema, by aggregating other source catalogs and repositories as input sources; ii) querying all the collected resources; iii) transforming each given dataset into formal concept analysis matrices that enable analysis and visualization services; iv) generating models and tensors from each given dataset.
    摘要 一个 significative barrier to the training of statistical models on knowledge graphs is the difficulty that scientists have in finding the best input data to address their prediction goal. In addition to this, a key challenge is to determine how to manipulate these relational data, which are often in the form of particular triples (i.e., subject, predicate, object), to enable the learning process. Currently, many high-quality catalogs of knowledge graphs, are available. However, their primary goal is the re-usability of these resources, and their interconnection, in the context of the Semantic Web. This paper describes the LiveSchema initiative, namely, a first version of a gateway that has the main scope of leveraging the gold mine of data collected by many existing catalogs collecting relational data like ontologies and knowledge graphs. At the current state, LiveSchema contains - 1000 datasets from 4 main sources and offers some key facilities, which allow to: i) evolving LiveSchema, by aggregating other source catalogs and repositories as input sources; ii) querying all the collected resources; iii) transforming each given dataset into formal concept analysis matrices that enable analysis and visualization services; iv) generating models and tensors from each given dataset.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide that version as well.

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

  • paper_url: http://arxiv.org/abs/2311.12454
  • repo_url: https://github.com/sh-lee-prml/hierspeechpp
  • paper_authors: Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee
    for:This paper proposes a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC) tasks.methods:The proposed method, called HierSpeech++, uses a hierarchical speech synthesis framework that significantly improves the robustness and expressiveness of the synthetic speech. It also adopts the text-to-vec framework to generate a self-supervised speech representation and an F0 representation based on text representations and prosody prompts.results:The experimental results demonstrated that HierSpeech++ outperforms LLM-based and diffusion-based models in zero-shot speech synthesis tasks, and achieves human-level quality zero-shot speech synthesis.
    Abstract Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.
    摘要 大型语言模型(LLM)基于的语音合成已广泛应用于零批语音合成。然而,它们需要大规模数据和拥有过去autoregressive语音模型的同一些限制,包括慢速推理速度和缺乏可靠性。这篇论文提议了 HierSpeech++,一种快速和强大的零批语音合成器 для文本到语音(TTS)和语音转换(VC)。我们证明了层次语音合成框架可以显著提高合成语音的可靠性和表达力。此外,我们还提高了合成语音的自然性和说话人相似性,即使在零批语音合成场景下。 для文本到语音,我们采用文本到 вектор框架,它生成了一个自动编码的语音表示和F0表示,基于文本表示和语调提示。然后,HierSpeech++生成语音从生成的 вектор、F0和语音提示。我们还引入了高效的语音超分辨框架,从16 kHz提升到48 kHz。实验结果表明,层次变化自动机可以在零批语音合成场景下成为强大的零批语音合成器,并且超过LLM-基于和扩散-基于模型。此外,我们实现了首次人类水平质量的零批语音合成。音频示例和源代码可以在https://github.com/sh-lee-prml/HierSpeechpp获得。

Extracting Definienda in Mathematical Scholarly Articles with Transformers

  • paper_url: http://arxiv.org/abs/2311.12448
  • repo_url: https://github.com/sufianj/def_extraction
  • paper_authors: Shufan Jiang, Pierre Senellart
  • for: 本研究旨在自动识别学术论文中定义的概念。
  • methods: 本研究使用了精致预训Transformer Token-level分类任务和一般大语言模型(GPT)进行问题回答任务。
  • results: 实验结果显示,可以使用精致预训模型或更简单的预训模型,均可以 дости持高精度和准确性。
    Abstract We consider automatically identifying the defined term within a mathematical definition from the text of an academic article. Inspired by the development of transformer-based natural language processing applications, we pose the problem as (a) a token-level classification task using fine-tuned pre-trained transformers; and (b) a question-answering task using a generalist large language model (GPT). We also propose a rule-based approach to build a labeled dataset from the LATEX source of papers. Experimental results show that it is possible to reach high levels of precision and recall using either recent (and expensive) GPT 4 or simpler pre-trained models fine-tuned on our task.
    摘要 我们考虑自动从学术文章中提取定义中的特定术语。受转换器基于自然语言处理应用的发展启发,我们将问题定义为(a)一个Token级别分类任务,使用精度调整的预训练 transformer; 和(b)一个问题回答任务,使用一种通用大型自然语言模型(GPT)。我们还提议一种基于规则的方法来建立标注数据集从latex文档中。实验结果表明,可以通过使用最新的(且昂贵的)GPT 4或简单预训练模型,在我们任务上达到高精度和准确率。

Designing Long-term Group Fair Policies in Dynamical Systems

  • paper_url: http://arxiv.org/abs/2311.12447
  • repo_url: None
  • paper_authors: Miriam Rateike, Isabel Valera, Patrick Forré
  • for: 这个论文旨在提出一种新的框架,以实现在动态系统中长期保证群体公正。
  • methods: 这个框架基于时间Homogeneous Markov chain模型,并利用Markov chain收敛定理来确保唯一收敛。
  • results: 该框架可以实现在长期下达到targeted fair stationary state,独立于初始数据分布。它还可以评估不同的长期目标,并 analyze its impact on group-conditional population distribution在长期下。
    Abstract Neglecting the effect that decisions have on individuals (and thus, on the underlying data distribution) when designing algorithmic decision-making policies may increase inequalities and unfairness in the long term - even if fairness considerations were taken in the policy design process. In this paper, we propose a novel framework for achieving long-term group fairness in dynamical systems, in which current decisions may affect an individual's features in the next step, and thus, future decisions. Specifically, our framework allows us to identify a time-independent policy that converges, if deployed, to the targeted fair stationary state of the system in the long term, independently of the initial data distribution. We model the system dynamics with a time-homogeneous Markov chain and optimize the policy leveraging the Markov chain convergence theorem to ensure unique convergence. We provide examples of different targeted fair states of the system, encompassing a range of long-term goals for society and policymakers. Furthermore, we show how our approach facilitates the evaluation of different long-term targets by examining their impact on the group-conditional population distribution in the long term and how it evolves until convergence.
    摘要 忽略algorithmic decision-making中个人影响(以及相关数据分布)可能会导致长期不平等和不公正。在这篇论文中,我们提出了一种新的框架,以实现长期群体公平的dinamic system。在这个框架中,当前的决策可能会影响个人的特征在下一步,因此影响未来的决策。我们使用时Homogeneous Markov chain来模拟系统动态,并通过Markov chain converges theorem来确保唯一的 convergence。我们提供了不同的targeted fair state的例子,涵盖了社会和 policymakers的多种长期目标。此外,我们还示了我们的方法如何评估不同的长期目标,以及它们在长期发展中对群体conditional population distribution的影响。

Knowledge Base Enabled Semantic Communication: A Generative Perspective

  • paper_url: http://arxiv.org/abs/2311.12443
  • repo_url: None
  • paper_authors: Jinke Ren, Zezhong Zhang, Jie Xu, Guanying Chen, Yaping Sun, Ping Zhang, Shuguang Cui
    for: This paper aims to explore the use of semantic knowledge base (KB) to improve the efficiency of semantic communication in 6G wireless networks.methods: The paper proposes a generative semantic communication architecture that utilizes three sub-KBs: source, task, and channel KBs. The construction approaches for each sub-KB are also presented, along with their use in semantic coding and transmission.results: The paper demonstrates the superiority of generative semantic communication over conventional syntactic communication and classical semantic communication through a case study. The results show that generative semantic communication can significantly enhance the communication efficiency while maintaining the desired meaning of the source messages.
    Abstract Semantic communication is widely touted as a key technology for propelling the sixth-generation (6G) wireless networks. However, providing effective semantic representation is quite challenging in practice. To address this issue, this article takes a crack at exploiting semantic knowledge base (KB) to usher in a new era of generative semantic communication. Via semantic KB, source messages can be characterized in low-dimensional subspaces without compromising their desired meaning, thus significantly enhancing the communication efficiency. The fundamental principle of semantic KB is first introduced, and a generative semantic communication architecture is developed by presenting three sub-KBs, namely source, task, and channel KBs. Then, the detailed construction approaches for each sub-KB are described, followed by their utilization in terms of semantic coding and transmission. A case study is also provided to showcase the superiority of generative semantic communication over conventional syntactic communication and classical semantic communication. In a nutshell, this article establishes a scientific foundation for the exciting uncharted frontier of generative semantic communication.
    摘要 Semantic communication is widely promoted as a key technology for 6G wireless networks. However, providing effective semantic representation is quite challenging in practice. To address this issue, this article explores the use of semantic knowledge base (KB) to usher in a new era of generative semantic communication. By characterizing source messages in low-dimensional subspaces without compromising their desired meaning, semantic KB can significantly enhance communication efficiency.The fundamental principle of semantic KB is first introduced, followed by a generative semantic communication architecture that includes three sub-KBs: source, task, and channel KBs. The construction approaches for each sub-KB are then described, along with their utilization in semantic coding and transmission. A case study is also provided to demonstrate the superiority of generative semantic communication over conventional syntactic communication and classical semantic communication.In summary, this article establishes a scientific foundation for the exciting uncharted frontier of generative semantic communication.

Fair Enough? A map of the current limitations of the requirements to have “fair’’ algorithms

  • paper_url: http://arxiv.org/abs/2311.12435
  • repo_url: None
  • paper_authors: Alessandro Castelnovo, Nicole Inverardi, Gabriele Nanino, Ilaria Giuseppina Penco, Daniele Regoli
    for: The paper focuses on the issue of bias and unfairness in automated decision-making systems, and the need for a more nuanced understanding of what “fairness” means in real-world scenarios.methods: The paper highlights the existing research on assessing and mitigating bias in AI systems, but argues that these efforts are insufficient without a broader societal understanding of what fairness means in practice.results: The paper identifies a list of fundamental ambiguities and attention points that must be addressed in order to give concrete meaning to the demand for fairness in AI systems.
    Abstract In the recent years, the raise in the usage and efficiency of Artificial Intelligence and, more in general, of Automated Decision-Making systems has brought with it an increasing and welcome awareness of the risks associated with such systems. One of such risks is that of perpetuating or even amplifying bias and unjust disparities present in the data from which many of these systems learn to adjust and optimise their decisions. This awareness has on one side encouraged several scientific communities to come up with more and more appropriate ways and methods to assess, quantify, and possibly mitigate such biases and disparities. On the other hand, it has prompted more and more layers of society, including policy makers, to call for ``fair'' algorithms. We believe that while a lot of excellent and multidisciplinary research is currently being conducted, what is still fundamentally missing is the awareness that having ``fair'' algorithms is per s\'e a nearly meaningless requirement, that needs to be complemented with a lot of additional societal choices to become actionable. Namely, there is a hiatus between what the society is demanding from Automated Decision-Making systems, and what this demand actually means in real-world scenarios. In this work, we outline the key features of such a hiatus, and pinpoint a list of fundamental ambiguities and attention points that we as a society must address in order to give a concrete meaning to the increasing demand of fairness in Automated Decision-Making systems.
    摘要 在最近几年,人工智能和自动决策系统的使用和效率提高了,引发了关于这些系统的风险的增加和更加重要的关注。其中一种风险是通过学习和优化决策的数据中的偏见和不公正现象来延续或加剧偏见和不公正现象。这种关注导致了许多科学社区开发出更多和更适合的方法来评估、量化和可能减轻这些偏见和不公正现象。另一方面,更多的社会层次,包括政策制定者,要求“公平”的算法。我们认为,虽然目前有很多优秀的多学科研究,但实际上还缺乏一个基本的认识:具有“公平”的算法并不是一个具体的要求,需要更多的社会选择来变得行动Ready。即使在现实场景中,我们需要解决许多歧义和注意点,才能让“公平”的算法有具体的意义。在这篇文章中,我们列出了关于这种停滞的关键特征,并指出了一些基本的歧义和注意点,我们社会必须解决,以让“公平”的算法有具体的意义。

A recurrent connectionist model of melody perception : An exploration using TRACX2

  • paper_url: http://arxiv.org/abs/2311.12431
  • repo_url: None
  • paper_authors: Daniel Defays, Robert French, Barbara Tillmann
  • for: 研究Speech segmentation、Serial image processing和Music processing中的同样或相似机制。
  • methods: 使用TRACX2模型,一种基于认知的回归连接式自适应网络,模拟了speech和serial-imageprocessing。
  • results: TRACX2模型可以成功地处理易谱简单的法国儿童歌曲的interval,并且展现了人类可识别的旋律类别。
    Abstract Are similar, or even identical, mechanisms used in the computational modeling of speech segmentation, serial image processing and music processing? We address this question by exploring how TRACX2, (French et al., 2011; French \& Cottrell, 2014; Mareschal \& French, 2017), a recognition-based, recursive connectionist autoencoder model of chunking and sequence segmentation, which has successfully simulated speech and serial-image processing, might be applied to elementary melody perception. The model, a three-layer autoencoder that recognizes ''chunks'' of short sequences of intervals that have been frequently encountered on input, is trained on the tone intervals of melodically simple French children's songs. It dynamically incorporates the internal representations of these chunks into new input. Its internal representations cluster in a manner that is consistent with ''human-recognizable'' melodic categories. TRACX2 is sensitive to both contour and proximity information in the musical chunks that it encounters in its input. It shows the ''end-of-word'' superiority effect demonstrated by Saffran et al. (1999) for short musical phrases. The overall findings suggest that the recursive autoassociative chunking mechanism, as implemented in TRACX2, may be a general segmentation and chunking mechanism, underlying not only word-and imagechunking, but also elementary melody processing.
    摘要 这些机制是否在计算模型中的语音分 segmentation、串行图像处理和音乐处理中使用相同或者相似的机制呢?我们通过探讨如何使用TRACX2(French et al., 2011;French 与 Cottrell, 2014;Mareschal 与 French, 2017),一种认知基于模型,可以模拟语音和串行图像处理,是否可以应用于基本旋律识别。TRACX2是一个三层自适应神经网络,可以识别短序列间隔的''块'',这些块是在输入中频繁出现的。它在新输入中动态包含内部表示。它的内部表示归一化在''人类可识''的旋律类别上。TRACX2敏感于音乐块中的形态和距离信息。它在短乐段上显示出''字符串''终点优点效应,这与Saffran等人(1999)所发现的短乐段''字符串''终点优点效应相似。总体发现表明, recursive autoassociative chunking机制,如TRACX2所实现的,可能是一种通用的分 segmentation和chunking机制,在不同的领域中都能够应用。

How Far Have We Gone in Vulnerability Detection Using Large Language Models

  • paper_url: http://arxiv.org/abs/2311.12420
  • repo_url: None
  • paper_authors: Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, Chao Zhang
  • for: 本研究的目的是探讨大型自然语言模型(LLMs)在漏洞检测中的潜力。
  • methods: 本研究使用了16个LLMs和6种现有的深度学习模型和静态分析器进行实验,以evaluate LLMs在漏洞检测中的表现。
  • results: 实验结果显示,一些LLMs可以在漏洞检测中超过传统的深度学习方法表现,这表明LLMs在软件安全方面存在未经探索的潜力。
    Abstract As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of Large Language Models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.
    摘要 As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is becoming more and more important, yet challenging. Given the significant successes of Large Language Models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.Here's the word-for-word translation: software becoming increasingly complex, vulnerabilities increasingly important, yet challenging, Large Language Models (LLMs) success in various tasks, growing anticipation of efficacy in vulnerability detection, quantitative understanding of LLM potential missing, introduce comprehensive vulnerability benchmark VulBench, benchmark aggregates high-quality data from CTF challenges and real-world applications, annotations for each vulnerable function, vulnerability type and root cause, experiments encompassing 16 LLMs and 6 state-of-the-art deep learning-based models and static analyzers, several LLMs outperform traditional deep learning approaches, untapped potential in LLMs, contribution to understanding and utilization of LLMs for enhanced software security.

A Safer Vision-based Autonomous Planning System for Quadrotor UAVs with Dynamic Obstacle Trajectory Prediction and Its Application with LLMs

  • paper_url: http://arxiv.org/abs/2311.12893
  • repo_url: None
  • paper_authors: Jiageng Zhong, Ming Li, Yinliang Chen, Zihang Wei, Fan Yang, Haoran Shen
  • for: 这篇论文的目的是提出一种基于视觉的自主规划系统,以实现智能quadcopter UAV的可靠和可预测的自主飞行。
  • methods: 该论文使用了一种轻量级的物体检测算法来识别动态障碍物,然后使用加拿赫滤波来跟踪和估算障碍物的运动状态。在规划阶段,我们不仅考虑静止障碍物,还考虑了障碍物的潜在运动。在生成 trajectory 时,我们使用了一种基于B-spline的 trajectory 搜索算法,并将其进一步优化了多种约束,以提高安全性和飞行特性的匹配。
  • results: 我们在实验环境中对实际飞行和 simulate 环境进行了测试,结果表明,我们的方法可以在实时中检测和避免动态环境中的障碍物,与现有方法相比,具有更高的可靠性。此外,本研究还探讨了自主规划系统与大型自然语言处理(NLP)技术的 интеграción,以实现更加人性化的人机交互。
    Abstract For intelligent quadcopter UAVs, a robust and reliable autonomous planning system is crucial. Most current trajectory planning methods for UAVs are suitable for static environments but struggle to handle dynamic obstacles, which can pose challenges and even dangers to flight. To address this issue, this paper proposes a vision-based planning system that combines tracking and trajectory prediction of dynamic obstacles to achieve efficient and reliable autonomous flight. We use a lightweight object detection algorithm to identify dynamic obstacles and then use Kalman Filtering to track and estimate their motion states. During the planning phase, we not only consider static obstacles but also account for the potential movements of dynamic obstacles. For trajectory generation, we use a B-spline-based trajectory search algorithm, which is further optimized with various constraints to enhance safety and alignment with the UAV's motion characteristics. We conduct experiments in both simulation and real-world environments, and the results indicate that our approach can successfully detect and avoid obstacles in dynamic environments in real-time, offering greater reliability compared to existing approaches. Furthermore, with the advancements in Natural Language Processing (NLP) technology demonstrating exceptional zero-shot generalization capabilities, more user-friendly human-machine interactions have become feasible, and this study also explores the integration of autonomous planning systems with Large Language Models (LLMs).
    摘要 更进一步,随着自然语言处理(NLP)技术的发展,我们可以通过与大型语言模型(LLM)的集成,实现更加人性化的人机交互。这种方法可以让用户通过自然语言输入指令,实现对UAV的自主 планинг和控制。在这种情况下,我们可以通过NLP技术来解析和理解用户的输入,并根据这些输入生成合适的trajectory планы。在实验阶段,我们在模拟环境和实际环境中进行了详细的测试和评估。结果表明,我们的方法可以在实时中检测和避免动态环境中的障碍物,并且比现有方法更加可靠。此外,我们还发现,通过与LLM的集成,我们的方法可以在不同的语言环境下进行自主 планинг和控制,提高了UAV的可靠性和安全性。

nach0: Multimodal Natural and Chemical Languages Foundation Model

  • paper_url: http://arxiv.org/abs/2311.12410
  • repo_url: None
  • paper_authors: Micha Livne, Zulfat Miftahutdinov, Elena Tutubalina, Maksim Kuznetsov, Daniil Polykovskiy, Annika Brundyn, Aastha Jhunjhunwala, Anthony Costa, Alex Aliper, Alex Zhavoronkov
  • For: The paper introduces a new foundation model called nach0, which can solve various chemical and biological tasks such as biomedical question answering, named entity recognition, molecular generation, and others.* Methods: The model is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings, with fine-tuning using specific task-related instructions.* Results: The model outperforms state-of-the-art baselines on single-domain and cross-domain tasks, and can generate high-quality outputs in molecular and textual formats, demonstrating its effectiveness in multi-domain setups.Here is the summary in Traditional Chinese:* For: 这篇论文引入了一个新的基础模型called nach0,可以解决多种化学和生物任务,例如生物问题答案、名称识别、分子生成、分子合成、特征预测等等。* Methods: 这个模型是一个多元领域和多任务的encoder-decoder LLM,通过预训用不同领域和任务的文本数据,将多种化学和语言知识传递到模型中。* Results: 模型在单一领域和跨领域任务中表现出色,比基于状态的基eline表现更好,并且可以生成高质量的分子和文本格式的输出,展示其在多领域设置中的有效性。
    Abstract Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
    摘要 大型语言模型(LLM)已经在多个领域catalyzed scientific progress, 并且许多论文展示了它们的创新性解决方案。我们的论文介绍了一个新的基础模型 nach0,可以解决多种化学和生物任务:生物医学问答、名称实体识别、分子生成、分子合成、特征预测等等。 nach0 是一个多个领域和多任务 encoder-decoder LLM,通过对科学文献、专利和分子字符串的无标签文本进行预训练,以整合化学和语言知识。我们使用了指令调整,在最终的任务中使用特定任务相关的指令来细化 nach0。为了训练 nach0 效果,我们利用了 NeMo 框架,实现了平行优化基模型和大模型版本。广泛的实验表明,我们的模型在单domain和交叉domain任务上表现出色,并且可以生成高质量的输出,包括分子和文本格式,这说明它在多个领域设置中的效果。

Infinite forecast combinations based on Dirichlet process

  • paper_url: http://arxiv.org/abs/2311.12379
  • repo_url: None
  • paper_authors: Yinuo Ren, Feng Li, Yanfei Kang, Jue Wang
  • for: This paper proposes a deep learning ensemble forecasting model based on the Dirichlet process to integrate information from multiple sources and improve prediction accuracy.
  • methods: The method uses a deep learning sub-model pool, weight adjustment, and diversity strategies during the combination process. It also utilizes a decaying strategy to tackle the challenge of stochastic gradient descent in determining the optimal learning rate.
  • results: The proposed ensemble model demonstrates substantial improvements in prediction accuracy and stability compared to a single benchmark model, as demonstrated through an empirical analysis using the weekly dataset from the M4 competition.
    Abstract Forecast combination integrates information from various sources by consolidating multiple forecast results from the target time series. Instead of the need to select a single optimal forecasting model, this paper introduces a deep learning ensemble forecasting model based on the Dirichlet process. Initially, the learning rate is sampled with three basis distributions as hyperparameters to convert the infinite mixture into a finite one. All checkpoints are collected to establish a deep learning sub-model pool, and weight adjustment and diversity strategies are developed during the combination process. The main advantage of this method is its ability to generate the required base learners through a single training process, utilizing the decaying strategy to tackle the challenge posed by the stochastic nature of gradient descent in determining the optimal learning rate. To ensure the method's generalizability and competitiveness, this paper conducts an empirical analysis using the weekly dataset from the M4 competition and explores sensitivity to the number of models to be combined. The results demonstrate that the ensemble model proposed offers substantial improvements in prediction accuracy and stability compared to a single benchmark model.
    摘要 预测组合将多种来源的信息集成到一起,通过将多个预测结果从目标时间序列进行整合。而不需要手动选择最佳预测模型,这篇论文提出了基于 Dirichlet 过程的深度学习ensemble预测模型。在初始化学习率时,使用三个基础分布作为超参数来转换无穷 mixture 到一个有限的一个。然后,收集所有检查点来建立深度学习子模型池,并在组合过程中开发了质量调整和多样性策略。这种方法的主要优点是通过单一训练过程来生成所需的基础学习者,使用衰减策略来解决由梯度下降带来的难题,确定最佳学习率。为了保证方法的普遍性和竞争力,这篇论文通过使用 M4 竞赛每周数据集进行 empirical 分析,并探讨组合模型的数量对预测精度和稳定性的影响。结果表明,提议的组合模型可以在预测精度和稳定性方面提供显著改进,比单个参考模型更好。

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

  • paper_url: http://arxiv.org/abs/2311.12889
  • repo_url: None
  • paper_authors: Bowen Jiang, Zhijun Zhuang, Camillo Jose Taylor
  • for: 这篇论文旨在提出一种基于关系层次结构和常识知识的场景图生成方法,以提高场景图生成模型的性能。
  • methods: 该方法使用一个 bayesian 分类头,利用 hierarchical 结构来同时预测场景中两个物体之间的超类或类型关系,以及每个超类下的详细关系。
  • results: 实验结果表明,在 Visual Genome 和 OpenImage V6 数据集上,通过利用层次结构和常识知识,可以大幅提高场景图生成模型的性能。此外,该方法还可以作为现有场景图生成算法的可移植模块,以提高其结果。
    Abstract This work presents an enhanced approach to generating scene graphs by incorporating a relationship hierarchy and commonsense knowledge. Specifically, we propose a Bayesian classification head that exploits an informative hierarchical structure. It jointly predicts the super-category or type of relationship between the two objects, along with the detailed relationship under each super-category. We design a commonsense validation pipeline that uses a large language model to critique the results from the scene graph prediction system and then use that feedback to enhance the model performance. The system requires no external large language model assistance at test time, making it more convenient for practical applications. Experiments on the Visual Genome and the OpenImage V6 datasets demonstrate that harnessing hierarchical relationships enhances the model performance by a large margin. The proposed Bayesian head can also be incorporated as a portable module in existing scene graph generation algorithms to improve their results. In addition, the commonsense validation enables the model to generate an extensive set of reasonable predictions beyond dataset annotations.
    摘要 Simplified Chinese:这项工作提出了一种改进的场景图生成方法,通过包含层次关系结构和通用常识。特定的是,我们提议一种 bayesian 分类头,利用这种有用的层次结构。它同时预测两个物体之间的超类或类型关系,以及每个超类下的详细关系。我们设计了一个通用常识验证管道,使用一个大型语言模型来评估Scene graph prediction系统的结果,并使用这些反馈来提高模型性能。系统在测试时不需要外部大型语言模型的帮助,使其更适合实际应用。实验结果表明,在Visual Genome和OpenImage V6 datasets上,利用层次关系可以大幅提高模型性能。此外,我们还设计了一个可以与现有场景图生成算法集成的 bayesian 头,以提高其性能。通用常识验证使得模型可以生成更加广泛的合理预测,超过数据注释。

Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

  • paper_url: http://arxiv.org/abs/2311.12359
  • repo_url: None
  • paper_authors: Shivam Aggarwal, Alessandro Pappalardo, Hans Jakob Damsgaard, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra
  • for: 本研究旨在提出一种新的减量策略,以减少神经网络的精度而无需额外训练 overhead。
  • methods: 本研究使用的方法包括minifloat和整数quantization schemes,以及Weight equalization、bias correction、SmoothQuant、gradient-based learned rounding和GPTQ方法等。
  • results: 实验结果表明,使用low-precision minifloats可以与整数quantization schemes相比,在精度-准确性trade-off中实现更好的性能。此外,我们还对FPGA硬件成本模型进行了评估,发现整数quantization通常是最优的选择,因为它的硬件资源占用较小。
    Abstract Post-Training Quantization (PTQ) is a powerful technique for model compression, reducing the precision of neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point quantization (FP8) in the context of PTQ for model inference. However, the exploration of floating-point formats smaller than 8 bits and their comparison with integer quantization remains relatively limited. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. Our work presents a novel PTQ design-space exploration, comparing minifloat and integer quantization schemes across a range of 3 to 8 bits for both weights and activations. We examine the applicability of various PTQ techniques to minifloats, including weight equalization, bias correction, SmoothQuant, gradient-based learned rounding, and the GPTQ method. Our experiments validate the effectiveness of low-precision minifloats when compared to their integer counterparts across a spectrum of accuracy-precision trade-offs on a set of reference deep learning vision workloads. Finally, we evaluate our results against an FPGA-based hardware cost model, showing that integer quantization often remains the Pareto-optimal option, given its relatively smaller hardware resource footprint.
    摘要 后期训练量化(PTQ)是一种强大的模型压缩技术,可以降低神经网络的精度 без 额外训练负担。先前的研究已经 investigate 8 位浮点数量化(FP8)在 PTQ 中进行模型评估。然而,对于小于 8 位浮点数格式的研究以及与整数量化的比较尚未得到充分的探讨。在这种情况下,我们提出了“minifloat”,它是一种具有更低精度的浮点数格式,可以进一步减少模型的内存占用量、响应时间和能耗。我们的工作在 PTQ 设计空间中进行了新的探讨,比较了 minifloat 和整数量化方案,包括量化方法、平衡量化、偏移量化、SmoothQuant、学习型 learned 圆满量化和 GPTQ 方法。我们的实验表明,在一系列的深度学习视觉任务中,使用 low-precision minifloat 可以跟上整数量化的性能,并且可以在精度-精度之间进行trade-off。最后,我们对一个基于 FPGA 的硬件成本模型进行了评估,发现,在许多情况下,整数量化仍然是 Pareto 优化的选择,因为它的硬件资源占用相对较小。

Stable Diffusion For Aerial Object Detection

  • paper_url: http://arxiv.org/abs/2311.12345
  • repo_url: None
  • paper_authors: Yanan Jian, Fuxun Yu, Simranjit Singh, Dimitrios Stamoulis
  • for: 提高大规模飞行器物体检测的精度和效率
  • methods: 使用稳定扩散方法(SD)生成具有较高 semantics的增强数据,并通过粗略地址分割、低级别适应(LORA)和复制粘贴方法将生成的物体与背景融合
  • results: 通过这种方法,可以提高飞行器物体检测的精度和效率,并且可以适应大规模数据收集和飞行器物体的特殊性
    Abstract Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique challenges: stable diffusion's optimization for rich ground-level semantics doesn't align with the sparse nature of aerial objects, and the extraction of post-synthesis object coordinates remains problematic. To address these challenges, we introduce a synthetic data augmentation framework tailored for aerial images. It encompasses sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining, and finally, a Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data.
    摘要
  1. Sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap.2. Fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining.3. A Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data.This framework aims to address the challenges in using diffusion-based methods for aerial object detection, and provide a more effective and efficient approach to this task.

A Survey on Large Language Models for Personalized and Explainable Recommendations

  • paper_url: http://arxiv.org/abs/2311.12338
  • repo_url: None
  • paper_authors: Junyi Chen
  • for: This survey aims to analyze how Recommender Systems (RS) can benefit from Large Language Models (LLM) based methodologies.
  • methods: The survey describes how LLMs can be used to enhance personalized and explainable recommendations by processing vast amounts of textual data.
  • results: The survey highlights major challenges in Personalized Explanation Generating (PEG) tasks, including cold-start problems, unfairness, and bias in RS.Here is the text in Simplified Chinese:
  • for: 这份调查旨在分析如何使用大型自然语言模型(LLM)来改进个性化和可靠的推荐系统。
  • methods: 调查描述了如何使用LLM来提高个性化和可靠的推荐,包括处理大量文本数据。
  • results: 调查强调了个性化解释生成(PEG)任务中的主要挑战,包括冷启动问题、不公正和偏见在推荐系统中。
    Abstract In recent years, Recommender Systems(RS) have witnessed a transformative shift with the advent of Large Language Models(LLMs) in the field of Natural Language Processing(NLP). These models such as OpenAI's GPT-3.5/4, Llama from Meta, have demonstrated unprecedented capabilities in understanding and generating human-like text. This has led to a paradigm shift in the realm of personalized and explainable recommendations, as LLMs offer a versatile toolset for processing vast amounts of textual data to enhance user experiences. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey aims to analyze how RS can benefit from LLM-based methodologies. Furthermore, we describe major challenges in Personalized Explanation Generating(PEG) tasks, which are cold-start problems, unfairness and bias problems in RS.
    摘要 近年来,推荐系统(RS)在自然语言处理(NLP)领域内 witnessed 一种转变性的变革,即大型语言模型(LLM)的出现。这些模型,如OpenAI的 GPT-3.5/4 和 Llama from Meta,在理解和生成人类语言方面表现出了无 precedent 的能力。这对个人化和可解释推荐系统产生了一种新的思维方式, LLM 提供了一套灵活的工具集来处理大量文本数据,以提高用户体验。为了为读者提供全面的了解,本调查报告分析了如何使用 LLM 方法来改进 RS。此外,我们还描述了个性化解释生成(PEG)任务中的主要挑战,包括冷启始 пробле、不公正性和偏见问题在 RS 中。

Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

  • paper_url: http://arxiv.org/abs/2311.12337
  • repo_url: None
  • paper_authors: Tim Hartill, Joshua Bensemann, Michael Witbrock, Patricia J. Riddle
  • for: 这个论文的目的是检测语言模型是否通过记忆还是通过总结来回答问题。
  • methods: 这个论文使用了semantic similarity的方法来评估语言模型是否通过记忆或总结来回答问题。
  • results: 研究发现,使用这种方法可以检测出一些训练和评估样本之间的 semantic similarity,并且可以提高语言模型在这些样本上的性能。 Specifically, 在两个评估数据集上(DROP和ROPES),使用这种方法可以提高语言模型的性能 by 9.0%和25.7%。
    Abstract A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answering, discussion continues to occur as to the extent to which questions are answered through memorisation. We consider this issue for questions that would ideally be answered through reasoning over an associated context. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers. Our method is based on semantic similarity of input tokens and label tokens between training and evaluation samples. We show that our method offers advantages upon some prior approaches in that it is able to surface evaluation-train pairs that have overlap in either contiguous or discontiguous sequences of tokens. We use this method to identify unmemorisable subsets of our evaluation datasets. We train two Language Models in a multitask fashion whereby the second model differs from the first only in that it has two additional datasets added to the training regime that are designed to impart simple numerical reasoning strategies of a sort known to improve performance on some of our evaluation datasets but not on others. We then show that there is performance improvement between the two models on the unmemorisable subsets of the evaluation datasets that were expected to benefit from the additional training datasets. Specifically, performance on unmemorisable subsets of two of our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and 25.7% respectively while other evaluation datasets have no significant change in performance.
    摘要 一般来说,模型的能力包括直接从高度相似的训练样本中memorize标签 versus通过一种总结来预测标签。在使用语言模型进行问答时,对问题是否通过记忆进行答案的问题进行讨论。我们对这个问题进行了考虑,并提出了一种用于标识评估样本,它们的答案 unlikely 被我们的模型记忆。我们的方法基于输入token和标签token在训练和评估样本之间的语义相似性。我们显示了我们的方法可以抗衡一些先前的方法,因为它可以检测到恰合的评估样本。我们使用这种方法来标识我们的评估数据集中的不可 memorizable 子集。我们在多任务模式下训练了两个语言模型,其中第二个模型与第一个模型只有两个额外的训练数据集不同。这两个额外的训练数据集是为了教会模型简单的数字推理策略,这种策略已知能够提高一些我们的评估数据集的表现,但不能提高其他数据集的表现。我们then 显示了这两个模型在不可 memorizable 子集上的表现有显著改善,具体来说,Drop和ROPES两个评估数据集上的表现改善了9.0%和25.7%。而其他评估数据集没有显著改善。

Classification of Instagram fake users using supervised machine learning algorithms

  • paper_url: http://arxiv.org/abs/2311.12336
  • repo_url: None
  • paper_authors: Vertika Singh, Naman Tolasaria, Patel Meet Alpeshkumar, Shreyash Bartwal
  • for: 这篇论文是为了提供一种检测和消除在社交媒体上的伪造账户和在线人意代表的应用程序,以保护公司免受诈骗风险。
  • methods: 该应用程序采用了用户中心的设计,使得它可以轻松地与现有的调查程序集成,并且为刑事分支机构提供易于使用的界面。
  • results: 该应用程序可以帮助检测和消除社交媒体上的伪造账户和在线人意代表,从而帮助公司减少诈骗风险。
    Abstract In the contemporary era, online social networks have become integral to social life, revolutionizing the way individuals manage their social connections. While enhancing accessibility and immediacy, these networks have concurrently given rise to challenges, notably the proliferation of fraudulent profiles and online impersonation. This paper proposes an application designed to detect and neutralize such dishonest entities, with a focus on safeguarding companies from potential fraud. The user-centric design of the application ensures accessibility for investigative agencies, particularly the criminal branch, facilitating navigation of complex social media landscapes and integration with existing investigative procedures
    摘要 现代时期,在线社交网络已成为社会生活的一 integral part,改变了人们如何管理社交连接的方式。同时,这些网络也产生了挑战,主要是假 profiles和在线隐身的普遍存在。这篇文章提出了一种应用程序,用于检测和消除这些不诚实的实体,以保护公司免受欺诈的风险。用户中心的设计使得这些应用程序对刑事部门的探员来说更加轻松使用,能够快速浏览复杂的社交媒体景观,并与现有的探索过程集成。

Quantum-Enhanced Support Vector Machine for Large-Scale Stellar Classification with GPU Acceleration

  • paper_url: http://arxiv.org/abs/2311.12328
  • repo_url: None
  • paper_authors: Kuan-Cheng Chen, Xiaotian Xu, Henry Makhanov, Hui-Hsuan Chung, Chen-Yu Liu
  • for: 这项研究旨在开发一种基于量子计算和GPU加速的量子增强支持向量机(QSVM)方法,用于stellar classification。
  • methods: 该方法利用量子计算原理和GPU加速,对复杂的 binary 和多类stellar classification系统进行处理,并显著超过了传统方法(如K-Nearest Neighbors和Logistic Regression)。
  • results: 研究发现,量子增强支持向量机方法能够显著提高分类精度,特别是在处理复杂的stellar classification问题时。此外,量子计算和GPU加速的结合使得计算效率得到了提高,并且可以处理大量数据。这些发现示示了量子机器学习在天体物理研究中的潜在应用前景。
    Abstract In this study, we introduce an innovative Quantum-enhanced Support Vector Machine (QSVM) approach for stellar classification, leveraging the power of quantum computing and GPU acceleration. Our QSVM algorithm significantly surpasses traditional methods such as K-Nearest Neighbors (KNN) and Logistic Regression (LR), particularly in handling complex binary and multi-class scenarios within the Harvard stellar classification system. The integration of quantum principles notably enhances classification accuracy, while GPU acceleration using the cuQuantum SDK ensures computational efficiency and scalability for large datasets in quantum simulators. This synergy not only accelerates the processing process but also improves the accuracy of classifying diverse stellar types, setting a new benchmark in astronomical data analysis. Our findings underscore the transformative potential of quantum machine learning in astronomical research, marking a significant leap forward in both precision and processing speed for stellar classification. This advancement has broader implications for astrophysical and related scientific fields
    摘要 在这项研究中,我们介绍了一种创新的量子机器学习(QSVM)方法,用于星系分类,利用量子计算和GPU加速。我们的QSVM算法在处理复杂的 binary 和多类enario 时表现出色,特别是在使用 Harvard 星系分类系统时。量子原理的 integrate 使得分类精度得到了明显提高,而使用 cuQuantum SDK 的 GPU 加速器 Ensures 计算效率和可扩展性,使得对大型量子 simulate 数据进行处理变得更加快速和高效。这种结合不仅加速处理过程,还提高了对不同星系类型的分类精度,创造了新的benchmark 在天体数据分析领域。我们的发现表明量子机器学习在天体研究中具有transformative 的潜力,标志着天体分类的精度和处理速度得到了 significan 提高。这一进步有广泛的应用前景,不仅限于天体物理和相关的科学领域。

A Survey on Multimodal Large Language Models for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.12320
  • repo_url: https://github.com/irohxu/awesome-multimodal-llm-autonomous-driving
  • paper_authors: Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, Tianren Gao, Erlong Li, Kun Tang, Zhipeng Cao, Tong Zhou, Ao Liu, Xinrui Yan, Shuqi Mei, Jianguo Cao, Ziran Wang, Chao Zheng
    for:This paper aims to provide a comprehensive understanding of the challenges, opportunities, and future endeavors in applying Large Language Models (LLMs) to autonomous driving systems.methods:The paper reviews existing Multimodal Large Language Models (MLLMs) tools for driving, transportation, and map systems, as well as existing datasets and benchmarks. It also discusses several important problems that need to be solved by both academia and industry to further promote the development of this field.results:The paper presents a systematic investigation of the application of LLMs in autonomous driving systems, including a review of existing MLLM tools, datasets, and benchmarks, as well as a discussion of the challenges and opportunities in this field.
    Abstract With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.
    摘要 大量语言模型(LLM)和视觉基础模型(VFM)的出现,使得多模态AI系统可以与人类一样理解真实世界,做出决策,控制工具。在最近几个月里,LLM在自动驾驶和地图系统中受到了广泛的关注。虽然它们的潜力很大,但是还没有一个全面的理解关键挑战、机会和未来应用在LLM驾驶系统中的发展。本文提出了一个系统性的调查。首先,我们介绍了多模态大语言模型(MLLM)的背景,多模态模型的开发使用LLM,自动驾驶的历史。然后,我们对现有的MLLM工具、交通、地图系统进行了概述。此外,我们还summarized了在第一届WACV工作坊上关于大语言和视觉模型 для自动驾驶(LLVM-AD)的工作,这是自动驾驶领域中第一次关于LLM的工作坊。为了进一步推动这个领域的发展,我们还讨论了在使用MLLM时需要解决的一些重要问题,这些问题包括学术界和产业界。

Overcoming Pathology Image Data Deficiency: Generating Images from Pathological Transformation Process

  • paper_url: http://arxiv.org/abs/2311.12316
  • repo_url: https://github.com/rowerliu/adbd
  • paper_authors: Zeyu Liu, Yufang He, Yu Zhao, Yunlu Feng, Guanglei Zhang
  • for: 医学诊断中的金标准– histopathology 面临应用限制,由于医疗资源短缺。通过深度学习,计算支持诊断可以解决医生短缺和提供有效的临床分析。但是,建立可靠的模型通常需要大量的训练数据,这在 PATHOLOGICAL 领域是挑战。
  • methods: 我们提出了一种适应深度控制的双向扩散网络(ADBD),用于图像数据生成。域控制方法可以使用小训练集并在源信息引导下解决扩散过拟合问题。我们还开发了一种混合注意策略,将全局和本地注意力优先级融合,导引扩散和确保迁徙成功。
  • results: ADBD 能够有效地解决 PATHOLOGICAL 图像数据不足的问题,并且可以支持进一步的 PATHOLOGICAL 相关研究。我们通过实验证明,ADBD 可以生成无限个交叉域中间图像,并且与医生诊断相关的软标签相匹配。
    Abstract Histopathology serves as the gold standard for medical diagnosis but faces application limitations due to the shortage of medical resources. Leveraging deep learning, computer-aided diagnosis has the potential to alleviate the pathologist scarcity and provide timely clinical analysis. However, developing a reliable model generally necessitates substantial data for training, which is challenging in pathological field. In response, we propose an adaptive depth-controlled bidirectional diffusion (ADBD) network for image data generation. The domain migration approach can work with small trainset and overcome the diffusion overfitting by source information guidance. Specifically, we developed a hybrid attention strategy to blend global and local attention priorities, which guides the bidirectional diffusion and ensures the migration success. In addition, we developed the adaptive depth-controlled strategy to simulate physiological transformations, capable of yielding unlimited cross-domain intermediate images with corresponding soft labels. ADBD is effective for overcoming pathological image data deficiency and supportable for further pathology-related research.
    摘要 Here's the Simplified Chinese translation: histopathology 作为医学诊断的标准,但它面临着医疗资源短缺的问题。通过深度学习,计算机辅助诊断有可能解决医生短缺问题并提供快速的临床分析。然而,开发可靠的模型通常需要大量的训练数据,这在pathological领域是挑战。为此,我们提出了适应深度控制的双向扩散网络(ADBD)。这种领域迁移方法可以与小训练集一起工作,并通过源信息导航来避免扩散过拟合。我们还开发了一种混合注意力策略,将全局注意力和本地注意力优先级相互融合,以指导双向扩散和确保迁移成功。此外,我们还开发了适应深度控制策略,可以模拟生物物理转化,生成无限cross-domain中间图像和相应的软标签。 ADBD 有效地解决了医学图像数据不足问题,并可以用于进一步的医学相关研究。

IEKM: A Model Incorporating External Keyword Matrices

  • paper_url: http://arxiv.org/abs/2311.12310
  • repo_url: None
  • paper_authors: Cheng Luo, Qin Li, Zhao Yan, Mengliang Rao, Yunbo Cao
  • for: 解决客户端 plataform 系统中的核心文本Semantic Similarity(STS)任务中的两个紧迫挑战:首先,需要适应不同领域的客户(DDA);其次,模型很难分辨具有相似文本句对的句对,即困难的负样本。
  • methods: 我们提出了一种将外部关键词矩阵(IEKM)模型与Transformer结构结合使用,以实现自适应更新。通过网格单元对自注意层进行修正,以便更好地调整模型结果。
  • results: 我们在多个数据集上评估了方法,结果表明我们的方法在所有数据集上都有提高表现。为了证明我们的方法可以有效解决所有挑战,我们进行了灵活更正试验,其结果表明F1值从56.61提高到73.53。
    Abstract A customer service platform system with a core text semantic similarity (STS) task faces two urgent challenges: Firstly, one platform system needs to adapt to different domains of customers, i.e., different domains adaptation (DDA). Secondly, it is difficult for the model of the platform system to distinguish sentence pairs that are literally close but semantically different, i.e., hard negative samples. In this paper, we propose an incorporation external keywords matrices model (IEKM) to address these challenges. The model uses external tools or dictionaries to construct external matrices and fuses them to the self-attention layers of the Transformer structure through gating units, thus enabling flexible corrections to the model results. We evaluate the method on multiple datasets and the results show that our method has improved performance on all datasets. To demonstrate that our method can effectively solve all the above challenges, we conduct a flexible correction experiment, which results in an increase in the F1 value from 56.61 to 73.53. Our code will be publicly available.
    摘要 客户服务平台系统面临两个紧急挑战:首先,平台系统需要适应不同的客户领域(DDA)。其次,模型很难区分具有相似文本内容但具有不同含义的句子,即困难的负题样本(hard negative samples)。在这篇论文中,我们提出了一个外部关键矩阵模型(IEKM)来解决这些挑战。这个模型使用外部工具或词汇目录来构建外部矩阵,然后通过闸道单元与自律对话层组合,实现柔软调整模型结果。我们在多个数据集上进行评估,结果显示我们的方法在所有数据集上都有提高表现。为了证明我们的方法可以有效解决所有挑战,我们进行了柔软调整实验,实验结果显示自愿对应增加F1值由56.61提高至73.53。我们的代码将会公开。

Causality is all you need

  • paper_url: http://arxiv.org/abs/2311.12307
  • repo_url: None
  • paper_authors: Ning Xu, Yifei Gao, Hongshuo Tian, Yongdong Zhang, An-An Liu
  • for: This paper aims to build an integrated causal framework for revealing cause-effect forces hidden in data, which can improve the machine’s comprehension of causal relationships within a broader semantic space.
  • methods: The proposed Causal Graph Routing (CGR) framework relies entirely on intervention mechanisms and includes a stack of causal layers, each with a set of parallel deconfounding blocks from different causal graphs. The sufficient cause concept is used to dynamically select the suitable deconfounding methods in each layer.
  • results: The CGR framework was evaluated on two classical tasks of CV and NLP and surpassed the current state-of-the-art methods on both Visual Question Answer and Long Document Classification tasks. It has great potential in building the “causal” pre-training large-scale model that effectively generalizes to diverse tasks.
    Abstract In the fundamental statistics course, students are taught to remember the well-known saying: "Correlation is not Causation". Till now, statistics (i.e., correlation) have developed various successful frameworks, such as Transformer and Pre-training large-scale models, which have stacked multiple parallel self-attention blocks to imitate a wide range of tasks. However, in the causation community, how to build an integrated causal framework still remains an untouched domain despite its excellent intervention capabilities. In this paper, we propose the Causal Graph Routing (CGR) framework, an integrated causal scheme relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data. Specifically, CGR is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We combine these blocks via the concept of the proposed sufficient cause, which allows the model to dynamically select the suitable deconfounding methods in each layer. CGR is implemented as the stacked networks, integrating no confounder, back-door adjustment, front-door adjustment, and probability of sufficient cause. We evaluate this framework on two classical tasks of CV and NLP. Experiments show CGR can surpass the current state-of-the-art methods on both Visual Question Answer and Long Document Classification tasks. In particular, CGR has great potential in building the "causal" pre-training large-scale model that effectively generalizes to diverse tasks. It will improve the machines' comprehension of causal relationships within a broader semantic space.
    摘要 在基础统计课程中,学生常被教育以下诀窍:“相关性不等于 causality”。到目前为止,统计(即相关性)已经发展出了多种成功的框架,如 transformer 和预训练大规模模型,它们堆叠了多个并行的自注意机制来模仿多种任务。然而,在 causality 社区中,如何建立一个完整的 causal 框架仍然是一个未解决的问题。在这篇论文中,我们提出了 causal graph routing(CGR)框架,一种基于 intervención 机制的完整 causal 模型。specifically,CGR 由一系列 causal 层组成,每个层包括多个并行的 deconfounding 块。我们将这些块结合到一起,使用我们提出的 sufficient cause 概念,让模型在每个层中动态选择合适的 deconfounding 方法。CGR 实现为堆叠网络,结合 no confounder、back-door adjustment、front-door adjustment 和 probability of sufficient cause。我们在 CV 和 NLP 两个 classical task 上进行了实验,结果表明,CGR 可以超越当前状态的艺术方法。特别是,CGR 在建立 "causal" pre-training 大规模模型方面具有潜力,这将提高机器的 causal 关系理解,并扩展到更广泛的 semantic space。

Discovering Effective Policies for Land-Use Planning

  • paper_url: http://arxiv.org/abs/2311.12304
  • repo_url: None
  • paper_authors: Risto Miikkulainen, Olivier Francon, Daniel Young, Elliot Meyerson, Babak Hodjat
  • for: 这个研究的目的是为了提供一种有效的地用规划工具,以便决策者可以快速和有效地评估不同的地用选择,以影响气候变化。
  • methods: 该研究使用了可用的历史数据Changes in land use和Simulation of carbon emissions/absorption来学习一个代表模型,然后使用了进化搜索过程来找到适合specific locations的有效的地用政策。
  • results: 该研究使用了Project Resilience平台和Land-Use Harmonization数据集以及BLUE simulator来构建了一个可以生成Pareto前方的系统,这个系统可以考虑不同的地点和不同的地用选择,并提供了一个可能有用的地用规划工具。
    Abstract How areas of land are allocated for different uses, such as forests, urban, and agriculture, has a large effect on carbon balance, and therefore climate change. Based on available historical data on changes in land use and a simulation of carbon emissions/absorption, a surrogate model can be learned that makes it possible to evaluate the different options available to decision-makers efficiently. An evolutionary search process can then be used to discover effective land-use policies for specific locations. Such a system was built on the Project Resilience platform and evaluated with the Land-Use Harmonization dataset and the BLUE simulator. It generates Pareto fronts that trade off carbon impact and amount of change customized to different locations, thus providing a potentially useful tool for land-use planning.
    摘要 “地域的不同用途地域分配,如森林、城市和农业,对碳平衡有着很大的影响,因此对气候变化产生了重要的影响。通过历史数据Changes in land use和碳排放/吸收的模拟,可以学习一个代表模型,以便效率地评估不同选择。然后,一种进化搜索过程可以用来找到适合特定位置的有效地区用途政策。这种系统在Project Resilience平台上建立,并使用Land-Use Harmonization数据集和BLUE模拟器进行评估。它生成了碳影响和改变量的 pareto 前提,因此可以提供有用的工具 для地域规划。”Note: "碳平衡" (carbon balance) is translated as "碳平衡" in Simplified Chinese, but it is translated as "碳总量" (carbon total) in Traditional Chinese.

Detecting subtle macroscopic changes in a finite temperature classical scalar field with machine learning

  • paper_url: http://arxiv.org/abs/2311.12303
  • repo_url: None
  • paper_authors: Jiming Yang, Yutong Zheng, Jiahong Zhou, Huiyu Li, Jun Yin
  • for: 这个研究旨在探索如何通过检测宏观变化来探测实验几体系中的行为,从古典到量子领域。
  • methods: 研究使用了不同的方法来分 differentiate scalar field samples at varying temperatures,包括物理方法、统计方法和人工智能方法。
  • results: 研究发现,人工智能方法在敏感性方面比物理方法和统计方法更高,能够更好地探测宏观变化。这个结果提供了一种 Proof-of-concept,证明人工智能可能可以检测实验几体系中的宏观变化,这些变化可能会被物理测量所隐藏。
    Abstract The ability to detect macroscopic changes is important for probing the behaviors of experimental many-body systems from the classical to the quantum realm. Although abrupt changes near phase boundaries can easily be detected, subtle macroscopic changes are much more difficult to detect as the changes can be obscured by noise. In this study, as a toy model for detecting subtle macroscopic changes in many-body systems, we try to differentiate scalar field samples at varying temperatures. We compare different methods for making such differentiations, from physics method, statistics method, to AI method. Our finding suggests that the AI method outperforms both the statistical method and the physics method in its sensitivity. Our result provides a proof-of-concept that AI can potentially detect macroscopic changes in many-body systems that elude physical measures.
    摘要 <>translate "The ability to detect macroscopic changes is important for probing the behaviors of experimental many-body systems from the classical to the quantum realm. Although abrupt changes near phase boundaries can easily be detected, subtle macroscopic changes are much more difficult to detect as the changes can be obscured by noise. In this study, as a toy model for detecting subtle macroscopic changes in many-body systems, we try to differentiate scalar field samples at varying temperatures. We compare different methods for making such differentiations, from physics method, statistics method, to AI method. Our finding suggests that the AI method outperforms both the statistical method and the physics method in its sensitivity. Our result provides a proof-of-concept that AI can potentially detect macroscopic changes in many-body systems that elude physical measures." into Simplified Chinese.中文简体版:检测macroscopic变化对于实验many-body系统的行为 probing是非常重要。虽然在相对boundary附近的快速变化容易被检测出来,但是微妙的macroscopic变化很难被检测,因为这些变化可能会被噪声所隐藏。在这种情况下,我们使用scalar fieldamples在不同温度下进行比较,以便 diferenciating它们。我们比较了不同的方法,包括物理方法、统计方法和AI方法。我们的发现表明,AI方法在敏感性方面比物理方法和统计方法更高。这个结果提供了一种 Proof-of-concept,表明AI可能可以检测many-body系统中隐藏的macroscopic变化。

Noise in Relation Classification Dataset TACRED: Characterization and Reduction

  • paper_url: http://arxiv.org/abs/2311.12298
  • repo_url: None
  • paper_authors: Akshay Parekh, Ashish Anand, Amit Awekar
    for:这 paper 的目的是两fold。首先,探索基于模型的方法来描述数据集 TACRED 中的噪音的主要原因。其次,标识可能噪音的实例。methods:我们分析了现有的模型的预测和性能,以确定数据集 TACRED 中噪音的根本原因。我们发现大多数噪音来自于标记为“无关”的实例。results:我们提出了两种 nearest-neighbor-based 策略来自动标识可能噪音的实例,并对它们进行消除和重新标识。我们的第一种策略(IS)基于假设正例是干净的,因此我们使用假降低预测来标识噪音的负例。而我们的第二种策略(ES)基于使用干净的数据集来标识可能噪音的负例。我们的实验结果表明,在使用 IS retrained 两个 SOTA 模型时,得到了平均4%的 F1 分数提高。而重新标识(TACRED-R)不会改善原始结果。但是,在 ES 中,SOTA 模型在对应的消除和重新标识数据集上得到了平均3.8%和4.4%的 F1 分数提高。我们还扩展了 ES 来清洁正例,得到了平均5.8%和5.6%的 F1 分数提高。
    Abstract The overarching objective of this paper is two-fold. First, to explore model-based approaches to characterize the primary cause of the noise. in the RE dataset TACRED Second, to identify the potentially noisy instances. Towards the first objective, we analyze predictions and performance of state-of-the-art (SOTA) models to identify the root cause of noise in the dataset. Our analysis of TACRED shows that the majority of the noise in the dataset originates from the instances labeled as no-relation which are negative examples. For the second objective, we explore two nearest-neighbor-based strategies to automatically identify potentially noisy examples for elimination and reannotation. Our first strategy, referred to as Intrinsic Strategy (IS), is based on the assumption that positive examples are clean. Thus, we have used false-negative predictions to identify noisy negative examples. Whereas, our second approach, referred to as Extrinsic Strategy, is based on using a clean subset of the dataset to identify potentially noisy negative examples. Finally, we retrained the SOTA models on the eliminated and reannotated dataset. Our empirical results based on two SOTA models trained on TACRED-E following the IS show an average 4% F1-score improvement, whereas reannotation (TACRED-R) does not improve the original results. However, following ES, SOTA models show the average F1-score improvement of 3.8% and 4.4% when trained on respective eliminated (TACRED-EN) and reannotated (TACRED-RN) datasets respectively. We further extended the ES for cleaning positive examples as well, which resulted in an average performance improvement of 5.8% and 5.6% for the eliminated (TACRED-ENP) and reannotated (TACRED-RNP) datasets respectively.
    摘要 本文的主要目标是两重的。一是探索基于模型的方法来描述主要噪声的原因。在TACRED数据集上,我们分析了现有的状态艺模型的预测和性能,以确定噪声的根本原因。我们发现TACRED数据集中的噪声主要来自于“无关”类标注的实例。第二个目标是自动标识可能噪声的实例,以便对其进行更正和重新标识。我们提出了两种 nearest-neighbor-based 策略,即内在策略(IS)和外在策略(ES)。IS 基于假设正例是干净的,因此我们使用假阳例的预测来标识噪声的负例。ES 基于使用干净的数据集来标识可能噪声的负例。最后,我们在TACRED-E上重新训练了两个状态艺模型,并观察其性能。我们发现,基于IS,两个模型在TACRED-E上的平均 F1 分数提高了4%。然而,重新标识(TACRED-R)并没有改善原始结果。但是,基于ES,两个模型在TACRED-EN和TACRED-RN上的平均 F1 分数提高了3.8%和4.4%。我们还扩展了ES,以清理正例也,并观察了其性能改善。得到的结果显示,在TACRED-ENP和TACRED-RNP上,两个模型的平均 F1 分数提高了5.8%和5.6%。

ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

  • paper_url: http://arxiv.org/abs/2311.12289
  • repo_url: None
  • paper_authors: Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana
  • for: 提高语言模型在科学任务中的表现, especialy Retrieval augmentation 技术可以补充语言模型的知识容量。
  • methods: 提出了一种新的结构意识推 retrieved 语言模型,通过在检索过程中考虑文档之间的结构关系来提高检索的准确性和有用性。
  • results: 实验结果表明,结构意识推 retrieved 能够提取更加准确、 faithful 和上下文相关的段落,同时与总准确率相比具有相似的表现。
    Abstract Large language models record impressive performance on many natural language processing tasks. However, their knowledge capacity is limited to the pretraining corpus. Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources to complement the language model. However, existing retrieval augmentation techniques ignore the structural relationships between these documents. Furthermore, retrieval models are not explored much in scientific tasks, especially in regard to the faithfulness of retrieved documents. In this paper, we propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation. We create a heterogeneous document graph capturing multiple types of relationships (e.g., citation, co-authorship, etc.) that connect documents from more than 15 scientific disciplines (e.g., Physics, Medicine, Chemistry, etc.). We train a graph neural network on the curated document graph to act as a structural encoder for the corresponding passages retrieved during the model pretraining. Particularly, along with text embeddings of the retrieved passages, we obtain structural embeddings of the documents (passages) and fuse them together before feeding them to the language model. We evaluate our model extensively on various scientific benchmarks that include science question-answering and scientific document classification tasks. Experimental results demonstrate that structure-aware retrieval improves retrieving more coherent, faithful and contextually relevant passages, while showing a comparable performance in the overall accuracy.
    摘要 大型自然语言处理模型在许多任务上表现出色,但它们的知识容量受到预训练文档的限制。检索增强技术可以补充语言模型,但现有技术忽略了文档之间的结构关系。而且,在科学任务中,检索模型尚未得到充分的研究,特别是 faithfulness 的 retrieved 文档。本文提出了一种新的结构意识检索增强语言模型,该模型在检索过程中考虑文档之间的结构关系。我们构建了多种关系(如引用、合作作者等)连接来自多个科学领域(如物理、医学、化学等)的文档的异质文档图。我们使用图 neural network 在 cura 文档图上训练结构编码器,以便在语言模型 pré-training 过程中使用。特别是,在检索过程中获取的文档(Passage)的文本嵌入以及结构嵌入,并将它们进行拼接并传递给语言模型。我们在科学问答和科学文档分类任务中进行了广泛的实验评估。实验结果表明,结构意识检索可以提取更加一致、忠实和上下文相关的文档段落,同时保持与总准确率相当的水平。

Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications

  • paper_url: http://arxiv.org/abs/2311.12287
  • repo_url: None
  • paper_authors: Samira Ghodratnama, Mehrdad Zakershahrak
  • for: 这篇论文旨在探讨大语言模型(LLM)在信息检索中的应用,以及如何优化检索过程、选择优化模型和效率地搭配LLM。
  • methods: 论文提出了一些优化检索过程的方法,包括模型优化、数据优化和系统优化,以及如何选择最佳的LLM模型。
  • results: 论文通过对多个实验和案例进行分析,显示了LLM在信息检索中的效果,并解决了模型幻觉等问题。
    Abstract The advent of Large Language Models (LLMs) heralds a pivotal shift in online user interactions with information. Traditional Information Retrieval (IR) systems primarily relied on query-document matching, whereas LLMs excel in comprehending and generating human-like text, thereby enriching the IR experience significantly. While LLMs are often associated with chatbot functionalities, this paper extends the discussion to their explicit application in information retrieval. We explore methodologies to optimize the retrieval process, select optimal models, and effectively scale and orchestrate LLMs, aiming for cost-efficiency and enhanced result accuracy. A notable challenge, model hallucination-where the model yields inaccurate or misinterpreted data-is addressed alongside other model-specific hurdles. Our discourse extends to crucial considerations including user privacy, data optimization, and the necessity for system clarity and interpretability. Through a comprehensive examination, we unveil not only innovative strategies for integrating Language Models (LLMs) with Information Retrieval (IR) systems, but also the consequential considerations that underline the need for a balanced approach aligned with user-centric principles.
    摘要 大型语言模型(LLM)的出现标志着在线用户与信息交互的重要转变。传统信息检索(IR)系统主要依靠查询文档匹配,而LLM则 excel在理解和生成人类化文本方面,从而增强了IR体验的质量。尽管LLM常常与chatbot功能相关,但这篇论文推广了它们在信息检索中的直接应用。我们探讨了优化检索过程、选择最佳模型和效率地执行LLM的方法,以实现成本效益和提高结果准确性。我们还 Addresses 模型幻觉-模型生成的数据不准确或被误解的问题,以及其他模型特有的挑战。我们的讨论还涵盖了重要的考虑因素,如用户隐私、数据优化和系统的清晰性和可解释性。通过全面的检视,我们不仅探讨了将语言模型(LLM)与信息检索(IR)系统集成的创新策略,还探讨了这些策略的基础是用户中心的原则。

Probabilistic Forecast Reconciliation with Kullback-Leibler Divergence Regularization

  • paper_url: http://arxiv.org/abs/2311.12279
  • repo_url: https://github.com/guanyu0316/Probabilistic-Forecast-Reconciliation-with-DL
  • paper_authors: Guanyu Zhang, Feng Li, Yanfei Kang
  • for: 这个研究旨在提出一种新的概率预测重新汇合方法,以解决现有的预测预测重新汇合方法对准确性和一致性之间的贡献。
  • methods: 本研究使用深度学习框架实现概率预测重新汇合,并引入库拉克-莱布对不对准确性调整项,使重新汇合步骤更加柔软和软性。
  • results: 研究使用三个层次时间序数据集进行评估,结果显示 compared to other probabilistic forecast reconciliation methods, our proposed approach has several advantages.
    Abstract As the popularity of hierarchical point forecast reconciliation methods increases, there is a growing interest in probabilistic forecast reconciliation. Many studies have utilized machine learning or deep learning techniques to implement probabilistic forecasting reconciliation and have made notable progress. However, these methods treat the reconciliation step as a fixed and hard post-processing step, leading to a trade-off between accuracy and coherency. In this paper, we propose a new approach for probabilistic forecast reconciliation. Unlike existing approaches, our proposed approach fuses the prediction step and reconciliation step into a deep learning framework, making the reconciliation step more flexible and soft by introducing the Kullback-Leibler divergence regularization term into the loss function. The approach is evaluated using three hierarchical time series datasets, which shows the advantages of our approach over other probabilistic forecast reconciliation methods.
    摘要 随着层次点预测重叠方法的流行程度增加,对于 probabilistic 预测重叠方法的兴趣也在增加。许多研究使用机器学习或深度学习技术来实现 probabilistic 预测重叠方法,并在这方面做出了 notable 的进步。然而,这些方法将重叠步骤视为固定和坚定的后处理步骤,导致精度和准确性之间存在负面的贸易。在这篇论文中,我们提出了一种新的方法 для probabilistic 预测重叠。与现有方法不同的是,我们的提议方法将预测步骤和重叠步骤组合到了深度学习框架中,使得重叠步骤更加 flexible 和软,通过引入 Kullback-Leibler 分布regularization 项到损失函数中。该方法在三个层次时间序列数据集上进行评估,并显示了我们的方法与其他 probabilistic 预测重叠方法的优势。

Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity

  • paper_url: http://arxiv.org/abs/2311.12267
  • repo_url: None
  • paper_authors: Jikai Jin, Vasilis Syrgkanis
  • for: 这paper研究了 causal representation learning,即从低级数据中恢复高级 latent variables 和它们的 causal 关系,假设我们有访问多个环境中的观察数据。
  • methods: 这paper使用 linear causal models 和 general non-parametric causal models,并提供了一个名为 LiNGCReL 的算法,可以 garantuee recuperate 真实模型。
  • results: 这paper提供了一种可以在不假设硬 intervención的情况下保证模型可以被 recuperate 的方法,并且通过数学实验证明了这种方法的有效性。
    Abstract This paper studies causal representation learning, the task of recovering high-level latent variables and their causal relationships from low-level data that we observe, assuming access to observations generated from multiple environments. While existing works are able to prove full identifiability of the underlying data generating process, they typically assume access to single-node, hard interventions which is rather unrealistic in practice. The main contribution of this paper is characterize a notion of identifiability which is provably the best one can achieve when hard interventions are not available. First, for linear causal models, we provide identifiability guarantee for data observed from general environments without assuming any similarities between them. While the causal graph is shown to be fully recovered, the latent variables are only identified up to an effect-domination ambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to recover the ground-truth model up to EDA, and we demonstrate its effectiveness via numerical experiments. Moving on to general non-parametric causal models, we prove the same idenfifiability guarantee assuming access to groups of soft interventions. Finally, we provide counterparts of our identifiability results, indicating that EDA is basically inevitable in our setting.
    摘要 The main contribution of this paper is to characterize a notion of identifiability that is the best one can achieve without hard interventions. For linear causal models, we provide identifiability guarantees for data observed from general environments without assuming any similarities between them. While the causal graph is fully recovered, the latent variables are only identified up to an effect-domination ambiguity (EDA).We then propose an algorithm, LiNGCReL, which is guaranteed to recover the ground-truth model up to EDA. We demonstrate the effectiveness of our algorithm through numerical experiments.For general non-parametric causal models, we prove the same identifiability guarantee assuming access to groups of soft interventions. Finally, we provide counterparts of our identifiability results, indicating that EDA is inevitable in our setting.Translated into Simplified Chinese:这篇论文研究了 causal representation learning,即从低级数据中恢复高级 latent variable 和它们的 causal 关系。现有的工作可以证明下游数据生成过程的全部可 identificability,但它们假设了单个节点、硬 intervención,这是实际中不太可能的。这篇论文的主要贡献是 characterize一种可以 achievable 的 identifiability,不需要硬 intervención。 для linear causal models,我们提供了数据来自通用环境的全部可 identificability 保证,而不需要任何相似性假设。虽然 causal graph 完全可以被恢复,但 latent variable 只能被确定到 effect-domination ambiguity(EDA)。我们then propose了一种算法,LiNGCReL,该算法可以 garantuee 恢复真实模型,并且我们通过数学实验证明了其效果。对于通用非参数 causal models,我们证明了同样的可 identificability 保证,假设有软 intervención。最后,我们提供了对应的可 identificability 结果,表明 EDA 是不可避免的。

Resilient Control of Networked Microgrids using Vertical Federated Reinforcement Learning: Designs and Real-Time Test-Bed Validations

  • paper_url: http://arxiv.org/abs/2311.12264
  • repo_url: None
  • paper_authors: Sayak Mukherjee, Ramij R. Hossain, Sheik M. Mohiuddin, Yuan Liu, Wei Du, Veronica Adetola, Rohit A. Jinsiwale, Qiuhua Huang, Tianzhixi Yin, Ankit Singhal
  • for: 提高网络化微型电网的系统级可靠性,尤其是随着人口增加的增板式资源(IBR)的加入。
  • methods: 提出了一种 Federated Reinforcement Learning(Fed-RL)方法,用于解决模型复杂性和不确定的IBR设备动态行为问题,以及数据分享的隐私问题。
  • results: 通过将在模拟世界中学习的策略转移到实时硬件 loops中,成功桥接了模拟和实际世界之间的差距。实验结果表明,由模拟世界学习的RL控制器在实时硬件中表现出了令人满意的结果。
    Abstract Improving system-level resiliency of networked microgrids is an important aspect with increased population of inverter-based resources (IBRs). This paper (1) presents resilient control design in presence of adversarial cyber-events, and proposes a novel federated reinforcement learning (Fed-RL) approach to tackle (a) model complexities, unknown dynamical behaviors of IBR devices, (b) privacy issues regarding data sharing in multi-party-owned networked grids, and (2) transfers learned controls from simulation to hardware-in-the-loop test-bed, thereby bridging the gap between simulation and real world. With these multi-prong objectives, first, we formulate a reinforcement learning (RL) training setup generating episodic trajectories with adversaries (attack signal) injected at the primary controllers of the grid forming (GFM) inverters where RL agents (or controllers) are being trained to mitigate the injected attacks. For networked microgrids, the horizontal Fed-RL method involving distinct independent environments is not appropriate, leading us to develop vertical variant Federated Soft Actor-Critic (FedSAC) algorithm to grasp the interconnected dynamics of networked microgrid. Next, utilizing OpenAI Gym interface, we built a custom simulation set-up in GridLAB-D/HELICS co-simulation platform, named Resilient RL Co-simulation (ResRLCoSIM), to train the RL agents with IEEE 123-bus benchmark test systems comprising 3 interconnected microgrids. Finally, the learned policies in simulation world are transferred to the real-time hardware-in-the-loop test-bed set-up developed using high-fidelity Hypersim platform. Experiments show that the simulator-trained RL controllers produce convincing results with the real-time test-bed set-up, validating the minimization of sim-to-real gap.
    摘要 提高网络化微型电网的系统级坚韧性是一项非常重要的任务,随着人口的增加,增加了基于散列器(IBR)的资源。这篇文章(1)提出了在抗击敌意网络事件时的坚韧控制设计,并提出了一种新的联邦学习(Fed-RL)方法来解决模型复杂性、不确定的IBR设备动态行为和数据共享问题。此外,文章还解决了在多方所有的网络化电网中数据共享的隐私问题。通过以下多元目标,文章首先形成了一种学习控制(RL)训练设置,生成了抗击攻击信号插入Grid Forming Inverter(GFM)的primary控制器的epsidic trajectory。为了满足网络微型电网的特殊需求,我们不采用了水平的Fed-RL方法,而是开发了垂直的Federated Soft Actor-Critic(FedSAC)算法,以捕捉网络微型电网的水平连接动态。接着,我们使用OpenAI Gym接口,建立了一个自定义的GridLAB-D/HELICS合作 simulate平台,名为Resilient RL Co-simulation(ResRLCoSIM),用于训练RL代理。最后,文章通过将 simulate世界中学习的策略转移到实时硬件实验室中, Validated the minimization of sim-to-real gap。实验结果表明,由 simulate世界中学习的RL控制器在实时硬件实验室中产生了令人满意的结果,验证了 sim-to-real gap的最小化。