2023-09-15

cs.CV

cs.CV - 2023-09-15

EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding

paper_url: http://arxiv.org/abs/2309.08816
repo_url: https://github.com/facebookresearch/egoobjects
paper_authors: Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang Culatana, Roshan Sumbaly, Zhicheng Yan
for: 本研究使用 Egocentric visual data 进行对象理解，这是 egocentric 视觉研究中的基础问题。
methods: 本研究使用了大规模的 Egocentric 数据集 EgoObjects，包含了250名参与者从50多个国家使用4种便携式设备录制了9K个视频，并且包含了368个对象类别的650K个对象标注。
results: EgoObjects 数据集包含了每个对象的实例级标识符，并且包含了14K个唯一的对象实例。此外，EgoObjects 还能够捕捉到不同背景复杂性、周围对象、距离、照明和摄像头运动等多种因素。为了促进 EgoObjects 的研究，本文还提出了4个benchmark任务，包括一个新的实例级对象检测任务和一个传统的类别级对象检测任务。此外，本文还引入了2个新的持续学习对象检测任务。EgoObjects 数据集和 API 可以在 https://github.com/facebookresearch/EgoObjects 上下载。

Abstract
Object understanding in egocentric visual data is arguably a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14K unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance level- and the classical category level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects.

摘要
<> translate into Simplified Chinese Object understanding in egocentric visual data is a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9,000 videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650,000 object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14,000 unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting, and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance-level and classical category-level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects.Note: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from Traditional Chinese.

paper_url: http://arxiv.org/abs/2309.08769
repo_url: None
paper_authors: Jongwon Lee, Su Yeon Choi, Timothy Bretl
for: This paper is written to quantify the impact of adverse environmental conditions on the detection of fiducial markers by color cameras mounted on rotorcraft.
methods: The paper uses image sequences collected outdoors with cameras mounted on a quadrotor during semi-autonomous takeoff and landing operations under adverse environmental conditions.
results: The paper evaluates the performance of the marker detection system using various performance measures such as precision, recall, continuity, availability, robustness, resiliency, and coverage volume.

Abstract
This paper quantifies the impact of adverse environmental conditions on the detection of fiducial markers (i.e., artificial landmarks) by color cameras mounted on rotorcraft. We restrict our attention to square markers with a black-and-white pattern of grid cells that can be nested to allow detection at multiple scales. These markers have the potential to enhance the reliability of precision takeoff and landing at vertiports by flying vehicles in urban settings. Prior work has shown, in particular, that these markers can be detected with high precision (i.e., few false positives) and high recall (i.e., few false negatives). However, most of this prior work has been based on image sequences collected indoors with hand-held cameras. Our work is based on image sequences collected outdoors with cameras mounted on a quadrotor during semi-autonomous takeoff and landing operations under adverse environmental conditions that include variations in temperature, illumination, wind speed, humidity, visibility, and precipitation. In addition to precision and recall, performance measures include continuity, availability, robustness, resiliency, and coverage volume. We release both our dataset and the code we used for analysis to the public as open source.

摘要

Biased Attention: Do Vision Transformers Amplify Gender Bias More than Convolutional Neural Networks?

paper_url: http://arxiv.org/abs/2309.08760
repo_url: https://github.com/aibhishek/Biased-Attention
paper_authors: Abhishek Mandal, Susan Leavy, Suzanne Little
for: This paper aims to evaluate the potential for bias amplification in vision transformers (ViTs) and compare them to convolutional neural networks (CNNs) in the context of large multimodal models.
methods: The authors introduce a novel metric called Accuracy Difference to measure bias in architectures and use it to evaluate the performance of CNNs and ViTs in image classification tasks.
results: The results show that ViTs amplify gender bias to a greater extent than CNNs, highlighting the importance of considering the potential for bias amplification in the design and deployment of multimodal models.Here’s the same information in Simplified Chinese text:
for: 这篇论文旨在评估视transformer（ViTs）和卷积神经网络（CNNs）在大型多模态模型中的可能的偏见增强。
methods: 作者引入了一个新的度量方法called Accuracy Difference来衡量architecture中的偏见，并用其来评估CNNs和ViTs在图像分类任务中的性能。
results: 结果显示，ViTs在性别偏见方面比CNNs更加把握，强调考虑在设计和部署多模态模型时的偏见增强。

Abstract
Deep neural networks used in computer vision have been shown to exhibit many social biases such as gender bias. Vision Transformers (ViTs) have become increasingly popular in computer vision applications, outperforming Convolutional Neural Networks (CNNs) in many tasks such as image classification. However, given that research on mitigating bias in computer vision has primarily focused on CNNs, it is important to evaluate the effect of a different network architecture on the potential for bias amplification. In this paper we therefore introduce a novel metric to measure bias in architectures, Accuracy Difference. We examine bias amplification when models belonging to these two architectures are used as a part of large multimodal models, evaluating the different image encoders of Contrastive Language Image Pretraining which is an important model used in many generative models such as DALL-E and Stable Diffusion. Our experiments demonstrate that architecture can play a role in amplifying social biases due to the different techniques employed by the models for feature extraction and embedding as well as their different learning properties. This research found that ViTs amplified gender bias to a greater extent than CNNs

摘要
深度神经网络在计算机视觉中已经展现出许多社会偏见，如性别偏见。视觉转换器（ViTs）在计算机视觉应用中变得越来越受欢迎，在许多任务中如图像分类任务中超越了卷积神经网络（CNNs）。然而，由于研究减少计算机视觉中偏见的研究主要集中在CNNs上，因此需要评估不同网络架构对偏见的可能性。因此，我们引入了一种新的准则来衡量偏见，即准确性差异指标（Accuracy Difference）。我们在使用这两种架构的模型作为大型多Modal模型的一部分时进行了评估，并评估了不同图像编码器的对比语言图像预训练模型，这是一个广泛使用的生成模型，如DALL-E和Stable Diffusion。我们的实验表明，网络架构可以影响社会偏见的加剧，因为不同的模型在特征提取和嵌入以及学习特性上employs不同的技术。我们的研究发现，ViTs对gender bias的加剧比CNNs更大。

paper_url: http://arxiv.org/abs/2309.08747
repo_url: None
paper_authors: Reuben Dorent, Nazim Haouchine, Fryderyk Kögl, Samuel Joutard, Parikshit Juvekar, Erickson Torio, Alexandra Golby, Sebastien Ourselin, Sarah Frisken, Tom Vercauteren, Tina Kapur, William M. Wells
for: 这个论文是为了Synthesize missing images from various modalities, using a deep hierarchical variational autoencoder (VAE) with a probabilistic formulation for fusing multi-modal images in a common latent representation.
methods: 这个论文使用的方法包括 extending multi-modal VAEs with a hierarchical latent structure, adversarial learning, and a principled probabilistic fusion operation.
results: 经验表明，这个模型可以比multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT)更好地Synthesize missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation.

Abstract
We introduce MHVAE, a deep hierarchical variational auto-encoder (VAE) that synthesizes missing images from various modalities. Extending multi-modal VAEs with a hierarchical latent structure, we introduce a probabilistic formulation for fusing multi-modal images in a common latent representation while having the flexibility to handle incomplete image sets as input. Moreover, adversarial learning is employed to generate sharper images. Extensive experiments are performed on the challenging problem of joint intra-operative ultrasound (iUS) and Magnetic Resonance (MR) synthesis. Our model outperformed multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT) for synthesizing missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation. Our code is publicly available \url{https://github.com/ReubenDo/MHVAE}.

摘要
我们介绍MHVAE，一种深度嵌入式多modal VAE，可以将多modal图像中缺失的图像合成为完整的图像。我们通过在共同层次结构中嵌入多modal图像的概率表示方式，实现了对多modal图像的不同模式进行共同表示，并且可以处理部分图像集为输入。此外，我们采用了对抗学习来生成更加锐化的图像。我们在挑战性的 JOINT INTRA-OPERATIVE ULTRASOUND（iUS）和磁共振（MR）合成问题进行了广泛的实验，并证明了MHVAE对于缺失图像的合成表现出色，超过了多modal VAE、conditioned GAN和当前状态的架构（ResViT）。我们的代码可以在 GitHub 上找到，链接如下：https://github.com/ReubenDo/MHVAE。

Improved Breast Cancer Diagnosis through Transfer Learning on Hematoxylin and Eosin Stained Histology Images

paper_url: http://arxiv.org/abs/2309.08745
repo_url: None
paper_authors: Fahad Ahmed, Reem Abdel-Salam, Leon Hamnett, Mary Adewunmi, Temitope Ayano
For: The paper aims to classify breast cancer tumors into seven subtypes using deep learning models on histological images.* Methods: The authors use pre-trained deep learning models such as Xception, EfficientNet, ResNet50, and InceptionResNet, and pre-process the BRACS ROI images with image augmentation, upsampling, and dataset split strategies.* Results: The best results were obtained by ResNet50 achieving 66% f1-score for the default dataset split, and 96.2% f1-score for the custom dataset split, with a significant reduction in false positive and false negative classifications.Here are the three points in Simplified Chinese text:* For: 这个研究旨在使用深度学习模型来分类乳腺癌into seven种亚型。* Methods: 作者使用了预训练的深度学习模型，如Xception、EfficientNet、ResNet50和InceptionResNet，并对BRACS ROI图像进行了预处理，包括图像增强、放大和数据分割策略。* Results: 最佳结果由ResNet50实现，达到了66%的f1分数，而自定义数据分割策略可以达到96.2%的f1分数，同时减少了false positive和false negative分类的数量。

Abstract
Breast cancer is one of the leading causes of death for women worldwide. Early screening is essential for early identification, but the chance of survival declines as the cancer progresses into advanced stages. For this study, the most recent BRACS dataset of histological (H\&E) stained images was used to classify breast cancer tumours, which contains both the whole-slide images (WSI) and region-of-interest (ROI) images, however, for our study we have considered ROI images. We have experimented using different pre-trained deep learning models, such as Xception, EfficientNet, ResNet50, and InceptionResNet, pre-trained on the ImageNet weights. We pre-processed the BRACS ROI along with image augmentation, upsampling, and dataset split strategies. For the default dataset split, the best results were obtained by ResNet50 achieving 66\% f1-score. For the custom dataset split, the best results were obtained by performing upsampling and image augmentation which results in 96.2\% f1-score. Our second approach also reduced the number of false positive and false negative classifications to less than 3\% for each class. We believe that our study significantly impacts the early diagnosis and identification of breast cancer tumors and their subtypes, especially atypical and malignant tumors, thus improving patient outcomes and reducing patient mortality rates. Overall, this study has primarily focused on identifying seven (7) breast cancer tumor subtypes, and we believe that the experimental models can be fine-tuned further to generalize over previous breast cancer histology datasets as well.

摘要
乳癌是女性全球死亡率的主要原因之一。早期检测是锐意义的，但癌细胞阶段的检测难度增加。为了实现这一目标，我们使用了最新的BRACS数据集，包括整幅图像（WSI）和区域 интерес（ROI）图像，但我们只使用了ROI图像。我们使用了不同的预训练深度学习模型，如Xception、EfficientNet、ResNet50和InceptionResNet，预训练在ImageNet权重上。我们对BRACS ROI进行预处理，并使用图像增强、下采样和数据集分割策略。对于默认数据集分割，最佳结果由ResNet50实现66%的f1分数。对于自定义数据集分割，通过实施下采样和图像增强，我们获得了96.2%的f1分数。我们的第二种方法还将false正和false负分类数降低到每个类型下的less than 3%。我们认为这一研究对于早期诊断和识别乳癌癌症和其亚型具有重要意义，从而提高病人结果和减少病人死亡率。总之，这一研究主要关注了identifying seven（7）种乳癌癌症，并我们认为实验性模型可以进一步细化以泛化到之前的乳癌历史学数据集。

Personalized Food Image Classification: Benchmark Datasets and New Baseline

paper_url: http://arxiv.org/abs/2309.08744
repo_url: None
paper_authors: Xinyue Pan, Jiangpeng He, Fengqing Zhu
for: 本研究的目的是提出一种个性化食品分类方法，以便自动分析食品图像中的营养成分。
methods: 本研究使用了自适应神经网络和自我超vised学习技术，以利用食品图像的时间序列特征来提高个性化食品分类的精度。
results: 在两个个性化食品分类 benchmark dataset上进行测试，本研究的方法显示出与现有方法相比的改进性。 dataset可以在以下链接下下载：https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html

Abstract
Food image classification is a fundamental step of image-based dietary assessment, enabling automated nutrient analysis from food images. Many current methods employ deep neural networks to train on generic food image datasets that do not reflect the dynamism of real-life food consumption patterns, in which food images appear sequentially over time, reflecting the progression of what an individual consumes. Personalized food classification aims to address this problem by training a deep neural network using food images that reflect the consumption pattern of each individual. However, this problem is under-explored and there is a lack of benchmark datasets with individualized food consumption patterns due to the difficulty in data collection. In this work, we first introduce two benchmark personalized datasets including the Food101-Personal, which is created based on surveys of daily dietary patterns from participants in the real world, and the VFNPersonal, which is developed based on a dietary study. In addition, we propose a new framework for personalized food image classification by leveraging self-supervised learning and temporal image feature information. Our method is evaluated on both benchmark datasets and shows improved performance compared to existing works. The dataset has been made available at: https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html

摘要
食物图像分类是图像基于营养评估的基本步骤，允许自动分析食物图像中的营养成分。许多当前方法使用深度神经网络训练在通用食物图像集合上，这些集合不会反映现实生活中食物消耗的动态特点，食物图像会随着时间的推移而变化，具有个人特定的食物消耗模式。个性化食物分类目标是解决这个问题，通过使用每个个体的食物消耗模式来训练深度神经网络。然而，这个问题还没有得到充分的研究，并且缺乏个性化食物消耗模式的标准benchmark数据集。在这个工作中，我们首先介绍了两个benchmark个性化数据集，包括Food101-Personal，该数据集基于参与者的实际生活情况进行了调查，以及VFNPersonal，该数据集基于一项营养研究。此外，我们还提出了一种新的个性化食物图像分类框架，利用自动学习和时间图像特征信息。我们的方法在两个benchmark数据集上进行了评估，并与现有的方法进行了比较。数据集可以在以下网址获取：https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html。

Active Learning for Fine-Grained Sketch-Based Image Retrieval

paper_url: http://arxiv.org/abs/2309.08743
repo_url: None
paper_authors: Himanshu Thakur, Soumitri Chattopadhyay
for: 提高 Fine-grained sketch-based image retrieval (FG-SBIR) 的实际应用和扩展性，即使不需要大量的 faithful sketches。
methods: 提出了一种新的活动学样本选择技术，通过利用现有的 photo-sketch 对应关系和其中的中间表示来减少绘制照片笔记的努力。
results: 通过实验在 ChairV2 和 ShoeV2 两个公开的细化的 SBIR 数据集上，证明了我们的方法在 adapted baselines 上表现出优异性。

Abstract
The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

摘要
可以通过免费手写绘制图像检索照片的能力，高举出 Fine-grained sketch-based image retrieval（FG-SBIR）的巨大潜力。然而，它的实用化和扩展受到获得准确的绘制图像的成本的限制。一种解决方案是活动学习，可以最小化绘制图像的需求，同时保证性能的最大化。尽管在这个领域进行了广泛的研究，但是没有任何一个使用活动学习来减少 FG-SBIR 任务中的绘制努力。为了解决这个问题，我们提出了一种新的活动学习采样技术，可以减少绘制图像的需求。我们的提出的方法利用现有照片-绘制图像对的关系，以及这些图像的中间表示，来解决不确定性和多样性之间的负担。由于我们的方法只依据数据分布，因此是无关于模型采样的，因此可以应用于其他跨模态实例级检索任务中。通过对 ChairV2 和 ShoeV2 两个公开的细化 SBIR 数据集进行实验，我们证明了我们的方法的有利性，并比适应基elines进行比较。

Concept explainability for plant diseases classification

paper_url: http://arxiv.org/abs/2309.08739
repo_url: None
paper_authors: Jihen Amara, Birgitta König-Ries, Sheeba Samuel
for: 这个论文旨在提高植物疾病识别的准确率和可读性，以便更好地保护农业生产和食品安全。
methods: 这个研究使用了深度学习的 convolutional neural networks (CNN) 进行植物疾病类型的分类。在这种方法中，研究人员使用了一种名为 Testing with Concept Activation Vectors (TCAV) 的新方法，它可以帮助理解 deep learning 模型的决策过程。
results: 研究人员发现，使用 TCAV 方法可以提高植物疾病识别的准确率和可读性。这种方法可以帮助人们更好地理解 deep learning 模型的决策过程，从而提高植物疾病识别的可靠性和可读性。

Abstract
Plant diseases remain a considerable threat to food security and agricultural sustainability. Rapid and early identification of these diseases has become a significant concern motivating several studies to rely on the increasing global digitalization and the recent advances in computer vision based on deep learning. In fact, plant disease classification based on deep convolutional neural networks has shown impressive performance. However, these methods have yet to be adopted globally due to concerns regarding their robustness, transparency, and the lack of explainability compared with their human experts counterparts. Methods such as saliency-based approaches associating the network output to perturbations of the input pixels have been proposed to give insights into these algorithms. Still, they are not easily comprehensible and not intuitive for human users and are threatened by bias. In this work, we deploy a method called Testing with Concept Activation Vectors (TCAV) that shifts the focus from pixels to user-defined concepts. To the best of our knowledge, our paper is the first to employ this method in the field of plant disease classification. Important concepts such as color, texture and disease related concepts were analyzed. The results suggest that concept-based explanation methods can significantly benefit automated plant disease identification.

摘要
To address these limitations, we propose a method called Testing with Concept Activation Vectors (TCAV), which shifts the focus from pixels to user-defined concepts. Our method provides a more intuitive and comprehensible explanation of the algorithm's decisions, reducing the bias and uncertainty of traditional pixel-based approaches.We applied TCAV to plant disease classification and analyzed important concepts such as color, texture, and disease-related concepts. Our results suggest that concept-based explanation methods can significantly benefit automated plant disease identification. This is the first study to employ TCAV in the field of plant disease classification, and our findings demonstrate the potential of this method for improving the accuracy and reliability of plant disease detection.

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

paper_url: http://arxiv.org/abs/2309.08738
repo_url: None
paper_authors: Xingjian Diao, Ming Cheng, Shitong Cheng
for: 学习高质量视频表示，在计算机视觉中有广泛的应用并且 remains 挑战。previous work基于mask autoencoder，如ImageMAE和VideoMAE，已经证明通过重建策略在视觉modalities中学习表示有效。但这些模型具有内在的局限性，特别是在只能从视觉modalities中提取特征时，如low-resolution和模糊原始视频时。
methods: 我们提出了AV-MaskEnhancer，一种 combining visual和audio信息来学习高质量视频表示。我们的方法利用视觉和听频信息的衔接性，以提高视频表示质量。
results: 我们在UCf101 dataset上进行了视频分类任务，与existin work比较，我们的result达到了state-of-the-art，top-1准确率为98.8%，top-5准确率为99.9%。

Abstract
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.

摘要
学习高质量视频表示方法显示有 significante 应用于计算机视觉领域，但是仍然是挑战。先前的工作基于面卷 autoencoder 如 ImageMAE 和 VideoMAE 已经证明了通过重建策略在视觉模式中学习表示方法的效iveness。然而，这些模型具有内在的限制，特别是在仅从视觉模式提取特征时遇到低分辨率和模糊的原始视频时。基于这，我们提出了 AV-MaskEnhancer，一种结合视觉和音频信息来学习高质量视频表示方法的方法。我们的方法解决了这个挑战，并证明了音频和视觉内容之间的衔接性。此外，我们在 UCF101 数据集上进行了视频分类任务，与现有的工作相比，我们的结果达到了领先水平，top-1 准确率为 98.8%，top-5 准确率为 99.9%。

Segmentation of Tubular Structures Using Iterative Training with Tailored Samples

paper_url: http://arxiv.org/abs/2309.08727
repo_url: None
paper_authors: Wei Liao
for: 该文章是为了同时计算管状结构的分割面和中心线的最小路径方法。
methods: 该方法使用了基于CNN的特征EXTRACTING，并且引入了一种新的迭代训练方案，以生成更加适合的训练样本，从而解决了训练和推理时样本之间的差异。
results: 与七种之前的方法进行比较，该方法在三个公共数据集上（包括卫星图像和医疗图像） achievements state-of-the-art 的结果 both for segmentation masks and centerlines。

Abstract
We propose a minimal path method to simultaneously compute segmentation masks and extract centerlines of tubular structures with line-topology. Minimal path methods are commonly used for the segmentation of tubular structures in a wide variety of applications. Recent methods use features extracted by CNNs, and often outperform methods using hand-tuned features. However, for CNN-based methods, the samples used for training may be generated inappropriately, so that they can be very different from samples encountered during inference. We approach this discrepancy by introducing a novel iterative training scheme, which enables generating better training samples specifically tailored for the minimal path methods without changing existing annotations. In our method, segmentation masks and centerlines are not determined after one another by post-processing, but obtained using the same steps. Our method requires only very few annotated training images. Comparison with seven previous approaches on three public datasets, including satellite images and medical images, shows that our method achieves state-of-the-art results both for segmentation masks and centerlines.

摘要
我们提出了一种最小路径方法，同时计算 tubular 结构的分割掩模和中心线。最小路径方法广泛应用于 tubular 结构的分割多种应用场景。现有方法通常使用 CNN 提取特征，并经常超越手动调整特征的方法。然而，用于 CNN 基于方法的样本可能不当生成，因此在推断中遇到的样本可能很不同。我们解决这个差异的问题，通过引入一种新的迭代训练方案，可以无需更改现有的注释，生成更适合最小路径方法的训练样本。在我们的方法中，分割掩模和中心线不再是分别由 post-processing 确定的，而是通过同一步的步骤获得。我们的方法只需要很少的注释训练图像。与前七种方法进行比较，我们的方法在三个公共数据集上（卫星图像和医疗图像）上达到了状态的最佳结果。

Performance Metrics for Probabilistic Ordinal Classifiers

paper_url: http://arxiv.org/abs/2309.08701
repo_url: None
paper_authors: Adrian Galdran
for: 这篇论文是关于如何评估排序类别器的概率性预测性能。
methods: 本文使用了一种名为Ranked Probability Score（RPS）的评估方法，这个方法在预测场景中很受欢迎，但在图像分析领域中却没有Received much attention。
results: 经过了四个大规模的生医图像分类任务和三个不同的数据集，结果显示RPS是一个更适合的表现度量指标 для排序类别器的概率性预测。

Abstract
Ordinal classification models assign higher penalties to predictions further away from the true class. As a result, they are appropriate for relevant diagnostic tasks like disease progression prediction or medical image grading. The consensus for assessing their categorical predictions dictates the use of distance-sensitive metrics like the Quadratic-Weighted Kappa score or the Expected Cost. However, there has been little discussion regarding how to measure performance of probabilistic predictions for ordinal classifiers. In conventional classification, common measures for probabilistic predictions are Proper Scoring Rules (PSR) like the Brier score, or Calibration Errors like the ECE, yet these are not optimal choices for ordinal classification. A PSR named Ranked Probability Score (RPS), widely popular in the forecasting field, is more suitable for this task, but it has received no attention in the image analysis community. This paper advocates the use of the RPS for image grading tasks. In addition, we demonstrate a counter-intuitive and questionable behavior of this score, and propose a simple fix for it. Comprehensive experiments on four large-scale biomedical image grading problems over three different datasets show that the RPS is a more suitable performance metric for probabilistic ordinal predictions. Code to reproduce our experiments can be found at https://github.com/agaldran/prob_ord_metrics .

摘要
Ordinal 分类模型会对预测结果进行评分，以确定预测结果与实际类别之间的相似度。因此，它们适用于有关疾病进程预测或医疗图像等级分类等有关的诊断任务。然而，对于概率性预测的性能评价还没有得到过多的讨论。在普通的分类任务中，通用的评价指标包括 próper Scoring Rules（PSR）如布里分数，或者 calibration Errors 如 ECE，但这些指标并不适用于 ordinal 分类。一种 PSR 名为 Ranked Probability Score（RPS），在预测领域中广泛使用，更适合这种任务。然而，这个指标在图像分析社区中受到了少量的关注。本文提倡使用 RPS 来进行图像等级分类任务。此外，我们还发现了这个指标的一种Counter-intuitive和问题的行为，并提出了一个简单的修复方案。我们在四个大规模的生物医学图像等级分类问题上进行了四个不同的数据集的实验，并证明了 RPS 是适用于概率性 ordinal 预测的性能指标。可以在 GitHub 上找到我们的实验代码。

BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus

paper_url: http://arxiv.org/abs/2309.08690
repo_url: None
paper_authors: Valter Piedade, Pedro Miraldo
for: 提高 robust estimation 算法的效率，使其能够更好地应用于计算机视觉领域。
methods: 使用随机抽样、计算假设、计算异常点数来实现 robust estimation。
results: 在多个实际 datasets 中，方法可以减少计算时间而不会降低准确性，并且在准确性和计算时间之间取得了平衡。

Abstract
RANSAC-based algorithms are the standard techniques for robust estimation in computer vision. These algorithms are iterative and computationally expensive; they alternate between random sampling of data, computing hypotheses, and running inlier counting. Many authors tried different approaches to improve efficiency. One of the major improvements is having a guided sampling, letting the RANSAC cycle stop sooner. This paper presents a new adaptive sampling process for RANSAC. Previous methods either assume no prior information about the inlier/outlier classification of data points or use some previously computed scores in the sampling. In this paper, we derive a dynamic Bayesian network that updates individual data points' inlier scores while iterating RANSAC. At each iteration, we apply weighted sampling using the updated scores. Our method works with or without prior data point scorings. In addition, we use the updated inlier/outlier scoring for deriving a new stopping criterion for the RANSAC loop. We test our method in multiple real-world datasets for several applications and obtain state-of-the-art results. Our method outperforms the baselines in accuracy while needing less computational time.

摘要

Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion

paper_url: http://arxiv.org/abs/2309.08596
repo_url: https://github.com/wengflow/robust-e-nerf
paper_authors: Weng Fei Low, Gim Hee Lee
for: 这个论文的目的是探讨如何从移动事件相机中直接和可靠地重construct NeRF。
methods: 该方法使用了一种新的事件生成模型，该模型考虑了各种内在参数（如时间不变、非对称阈值和延迟时间）和寸努力（如像素到像素的阈值差异），以及一对归一的正常化损失函数，使得可以有效地泛化到任意速度配置和内在参数值。
results: 实验结果表明，该方法在真实的视频序列和新的真实 simulate的序列上具有高效性和可靠性，能够捕捉到较为复杂的场景特征。

Abstract
Event cameras offer many advantages over standard cameras due to their distinctive principle of operation: low power, low latency, high temporal resolution and high dynamic range. Nonetheless, the success of many downstream visual applications also hinges on an efficient and effective scene representation, where Neural Radiance Field (NeRF) is seen as the leading candidate. Such promise and potential of event cameras and NeRF inspired recent works to investigate on the reconstruction of NeRF from moving event cameras. However, these works are mainly limited in terms of the dependence on dense and low-noise event streams, as well as generalization to arbitrary contrast threshold values and camera speed profiles. In this work, we propose Robust e-NeRF, a novel method to directly and robustly reconstruct NeRFs from moving event cameras under various real-world conditions, especially from sparse and noisy events generated under non-uniform motion. It consists of two key components: a realistic event generation model that accounts for various intrinsic parameters (e.g. time-independent, asymmetric threshold and refractory period) and non-idealities (e.g. pixel-to-pixel threshold variation), as well as a complementary pair of normalized reconstruction losses that can effectively generalize to arbitrary speed profiles and intrinsic parameter values without such prior knowledge. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, synthetic dataset and improved event simulator are public.

摘要
In this work, we propose Robust e-NeRF, a novel method to directly and robustly reconstruct NeRFs from moving event cameras under various real-world conditions, including sparse and noisy events generated under non-uniform motion. Our method consists of two key components: a realistic event generation model that accounts for various intrinsic parameters (such as time-independent, asymmetric threshold, and refractory period) and non-idealities (such as pixel-to-pixel threshold variation), and a complementary pair of normalized reconstruction losses that can effectively generalize to arbitrary speed profiles and intrinsic parameter values without prior knowledge.Experiments on real and novel realistically simulated sequences have verified our effectiveness. Our code, synthetic dataset, and improved event simulator are publicly available.

Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

paper_url: http://arxiv.org/abs/2309.08588
repo_url: https://github.com/Neabfi/robust-rotation-estimation
paper_authors: Fabien Delattre, David Dirnfeld, Phat Nguyen, Stephen Scarano, Michael J. Jones, Pedro Miraldo, Erik Learned-Miller
for: 估计摄像头旋转在实际场景中，尤其是手持式单目视频中，是一个尚未得到充分研究的问题。
methods: 我们提出了一种使用HOUGH transform在SO(3)上进行高效和稳定的摄像头旋转估计方法。
results: 与其他比较快速的方法相比，我们的方法可以减少错误率大约50%，并且在任何情况下都比其他方法更加准确。这表示我们的方法在实际场景中表现出了强大的新表现点。

Abstract
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at https://fabiendelattre.com/robust-rotation-estimation.

摘要
我们提出了一种方法来估计拍摄机器人的旋转角度在实际场景中，基于单目视频。而Camera rotation estimation是一个非常研究过的问题，但没有任何方法能同时具备高精度和可接受的速度。由于这种场景没有其他数据集可供参考，我们提供了一个新的数据集和benchmark，其中的高精度、严格验证的参考数据来自17个视频序列。广泛使用的五点方法（例如，用于宽基线探测）在单目视频中表现糟糕，而自动驾驶领域中使用的SLAM方法则具有特定的传感器设置、特定的运动模型或本地优化策略（延迟批处理），这些方法无法通过到手持视频。此外，对于动态场景，通常使用的Robustification技术，如RANSAC，需要大量的迭代，成为无法进行的慢卡。我们介绍了一种新的SO(3)上的投影变换的普适扩展，以高效和可靠地找到拍摄机器人的旋转角度。与其他快速方法相比，我们的方法可以减少错误率大约50%，并且在任何速度下都高于任何方法。这代表了一个新的性能点，在实际场景中具有重要意义。我们的代码和数据集可以在https://fabiendelattre.com/robust-rotation-estimation上获取。

Replacing softmax with ReLU in Vision Transformers

paper_url: http://arxiv.org/abs/2309.08586
repo_url: None
paper_authors: Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
for: 这研究探讨了在替换注意力软 макс函数时对准确性的影响，并发现在视transformer上使用ReLU活动可以 Mitigate this degradation.
methods: 这研究使用了ImageNet-21k数据集训练小到大的视transformer，并 comparesthe performance of ReLU-attention和softmax-attention。
results: 结果表明，ReLU-attention可以与softmax-attention相比，在计算规模上Scaling behavior的表现相似或相当。

Abstract
Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

摘要
(Simplified Chinese)前期研究发现，当将注意力软 макс replaced with point-wise 活化函数如ReLU时，会导致准确性下降。在视觉转换器上，我们发现这种下降可以通过序列长度除法缓解。我们在ImageNet-21k上训练小到大的视觉转换器，发现ReLU-attention可以与softmax-attention相当，以计算行为为函数来规定缩放性能。

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

paper_url: http://arxiv.org/abs/2309.08585
repo_url: None
paper_authors: Xiaonan Lu, Jianlong Yuan, Ruigang Niu, Yuan Hu, Fan Wang
for: 本研究旨在解决现有的图像变换理解（ICU）任务中，已有的视觉语言基础模型（VLFM）的缺陷，即它们只能单独处理单个图像，而不能理解多个图像之间的变化。
methods: 我们提出了一种视角集成和准确处理方法，包括在预训练编码器中插入设计的可学习适配器和综合适配器，以有效地捕捉图像对的变化。此外，我们还设计了视角协调流和semantic强调模块，以降低视角变化对视觉和semantic空间的影响。
results: 我们在CLEVR-Change和Spot-the-Diff等测试集上进行了实验，结果显示，我们的方法可以在所有纬度上达到状态的排名，包括对应的语义描述和图像对比等。

Abstract
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.

摘要
近些年来，预训 vision language foundation models（VLFMs）的发展已经带来了多个任务的出色表现。然而，这些模型具有强大的单个图像理解能力，但缺乏多图像理解能力。因此，它们无法直接应用于图像变化理解（ICU）任务，ICU任务需要模型捕捉多个图像之间的实际变化并用语言描述。在这篇论文中，我们发现现有的 VLFMs 当直接应用于 ICU 时表现不佳，主要因为以下两个问题：（1） VLFMs 通常学习单个图像的全局表示，而 ICU 需要多图像之间的细节捕捉。（2） ICU 性能的变化会受到视角变化的影响，这是因为对象之间的关系变化导致的。为解决这些问题，我们提出了一种 Viewpoint Integration and Registration 方法。具体来说，我们引入了一个混合适配器图像编码器，通过插入设计的可学习适配器和混合适配器，有效地捕捉图像对的细节变化。此外，我们还设计了一种视角协调流和一种semantic Emphasizing模块，以降低由视角变化引起的视觉和semantic空间中的性能下降。实验结果表明，我们的方法在 CLEVR-Change 和 Spot-the-Diff 上达到了所有纪录的性能。

The Impact of Different Backbone Architecture on Autonomous Vehicle Dataset

paper_url: http://arxiv.org/abs/2309.08564
repo_url: None
paper_authors: Ning Ding, Azim Eskandarian
for: 这研究旨在评估不同背部架构在自动驾驶环境中的对象检测性能。
methods: 研究使用了三个常见的自动驾驶数据集：KITTI、NuScenes和BDD，对不同背部架构进行对比，以评估它们在对象检测任务中的性能。
results: 研究发现，不同背部架构在不同数据集上的性能有很大差异，而且一些背部架构在某些数据集上表现更好。

Abstract
Object detection is a crucial component of autonomous driving, and many detection applications have been developed to address this task. These applications often rely on backbone architectures, which extract representation features from inputs to perform the object detection task. The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance. Many researchers have focused on developing new and improved backbone architectures to enhance the efficiency and accuracy of object detection applications. While these backbone architectures have shown state-of-the-art performance on generic object detection datasets like MS-COCO and PASCAL-VOC, evaluating their performance under an autonomous driving environment has not been previously explored. To address this, our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.

摘要
<>translate(Object detection is a crucial component of autonomous driving, and many detection applications have been developed to address this task. These applications often rely on backbone architectures, which extract representation features from inputs to perform the object detection task. The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance. Many researchers have focused on developing new and improved backbone architectures to enhance the efficiency and accuracy of object detection applications. While these backbone architectures have shown state-of-the-art performance on generic object detection datasets like MS-COCO and PASCAL-VOC, evaluating their performance under an autonomous driving environment has not been previously explored. To address this, our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.)中文翻译：对于自动驾驶中的对象检测，有许多检测应用程序已经被开发出来解决这个问题。这些应用程序通常基于后凹结构，从输入中提取表示特征来进行对象检测任务。后凹结构的特征提取质量可以直接影响对象检测总性能。许多研究人员已经专注于开发新的和改进的后凹结构，以提高对象检测应用程序的效率和准确性。而这些后凹结构在通用对象检测数据集上表现出了state-of-the-art的性能，但是在自动驾驶环境中评估它们的性能尚未得到过探讨。为了解决这个问题，我们的研究对三个著名的自动驾驶数据集，namely KITTI, NuScenes, and BDD,进行对比不同后凹结构在对象检测任务中的性能。)

Automated dermatoscopic pattern discovery by clustering neural network output for human-computer interaction

paper_url: http://arxiv.org/abs/2309.08533
repo_url: None
paper_authors: Lidia Talavera-Martinez, Philipp Tschandl
for: 这项研究的目的是自动找到大量图像数据集中的可读性高的视觉模式，以便从中提取知识。methods: 该研究使用了k-means算法和神经网络提取的图像特征来自动对图像分类。results: 研究发现，通过使用优化的弯曲点方法或凝结度量来选择最佳分裂点，可以生成高度可读性的视觉模式，并且大多数分群可以与先前描述的皮肤病诊断模式相匹配。

Abstract
Background: As available medical image datasets increase in size, it becomes infeasible for clinicians to review content manually for knowledge extraction. The objective of this study was to create an automated clustering resulting in human-interpretable pattern discovery. Methods: Images from the public HAM10000 dataset, including 7 common pigmented skin lesion diagnoses, were tiled into 29420 tiles and clustered via k-means using neural network-extracted image features. The final number of clusters per diagnosis was chosen by either the elbow method or a compactness metric balancing intra-lesion variance and cluster numbers. The amount of resulting non-informative clusters, defined as those containing less than six image tiles, was compared between the two methods. Results: Applying k-means, the optimal elbow cutoff resulted in a mean of 24.7 (95%-CI: 16.4-33) clusters for every included diagnosis, including 14.9% (95% CI: 0.8-29.0) non-informative clusters. The optimal cutoff, as estimated by the compactness metric, resulted in significantly fewer clusters (13.4; 95%-CI 11.8-15.1; p=0.03) and less non-informative ones (7.5%; 95% CI: 0-19.5; p=0.017). The majority of clusters (93.6%) from the compactness metric could be manually mapped to previously described dermatoscopic diagnostic patterns. Conclusions: Automatically constraining unsupervised clustering can produce an automated extraction of diagnostically relevant and human-interpretable clusters of visual patterns from a large image dataset.

摘要
背景：随着医疗影像数据集的增加，为了EXTRACT知识，临床医生无法 manually review content。这项研究的目标是通过自动 clustering 实现人类可读的模式发现。方法：来自公共HAM10000数据集的皮肤悬液瘤病诊断图像，经过分割成29420个图像块，然后使用神经网络提取的图像特征进行k-means clustering。最终选择每个诊断的最佳割分数量是通过膝盖方法或尺度矩阵来决定。对于每个诊断，计算了非参考分布（即包含 menos than six 图像块的分布）的比例。结果：通过k-means clustering，使用最佳膝盖割分数量时，每个诊断的平均分布数量为24.7（95% CI：16.4-33），包含14.9%（95% CI：0.8-29.0）的非参考分布。使用尺度矩阵来选择割分数量时，得到的分布数量较少（13.4；95% CI：11.8-15.1），同时非参考分布的比例较低（7.5%; 95% CI：0-19.5）。大多数分布（93.6%）可以 manually mapping 到已知的DERMATOSCOPIC diagnostic pattern。结论：通过自动 constraining 不supervised clustering，可以生成一个自动EXTRACT的Visual pattern的diagnostically relevant和人类可读的分布。

Breathing New Life into 3D Assets with Generative Repainting

paper_url: http://arxiv.org/abs/2309.08523
repo_url: https://github.com/toshas/remesh_isotropic_planar
paper_authors: Tianfu Wang, Menelaos Kanakis, Konrad Schindler, Luc Van Gool, Anton Obukhov
for: 这篇论文主要旨在提出一种基于扩散模型的3D资产重新绘制方法，以提高3D资产的生成质量。
methods: 该方法利用了2D扩散模型和3D神经场的共振，并通过独立使用这两种工具来实现模块化的重新绘制。
results: 通过大规模的实验， authors 发现了该方法的优势，包括对多种物体和类别的生成质量和速度。In English, that would be:
for: The paper primarily aims to propose a method for re-rendering 3D assets based on diffusion models, to improve the quality of 3D asset generation.
methods: The method utilizes the entanglement of 2D diffusion models and 3D neural fields, and achieves modularization by independently using these two tools.
results: Through large-scale experiments, the authors found advantages of their method, including improved generation quality and speed for multiple objects and categories.

Abstract
Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators. Broad adoption of these models is due to significant improvement in the quality of generations and efficient conditioning on various modalities, not just text. However, lifting the rich generative priors of these 2D models into 3D is challenging. Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields. We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools and demonstrate their ability to work together in a non-learned fashion. Such modularity has the intrinsic advantage of eased partial upgrades, which became an important property in such a fast-paced domain. Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools, and outputs a painted input geometry in several formats. We conduct a large-scale study on a wide range of objects and categories from the ShapeNetSem dataset and demonstrate the advantages of our approach, both qualitatively and quantitatively. Project page: https://www.obukhov.ai/repainting_3d_assets

摘要
填充基于模型在图像创建领域内引起了广泛的关注，艺术家、内容创作者以及视觉社区。这些模型的广泛采用是因为它们在质量和多Modal Conditioning方面提供了显著改进。然而，将2D模型中的富有的生成假设提升到3D是一项挑战。最近的工作提出了基于填充模型和神经场的共融pipeline。我们研究了预训练的2D填充模型和标准3D神经辐射场的独立使用，以及它们之间的非学习协作。这种模块化的设计具有内置的升级优点，可以方便地更新和改进不同模块。我们的管道可以接受任何可 renderable 的geometry，如纹理或无纹理的三角形，并将2D生成精度和3D一致性检查工具相互协作，输出一个涂抹的输入geometry。我们对ShapeNetSem数据集上的各种物体和类型进行了大规模的研究，并证明了我们的方法的优势， both qualitatively和quantitatively。项目页面：https://www.obukhov.ai/repainting_3d_assets

Generalised Probabilistic Diffusion Scale-Spaces

paper_url: http://arxiv.org/abs/2309.08511
repo_url: None
paper_authors: Pascal Peter
for: 这篇论文是关于 probabilistic diffusion models 的研究，用于抽象新图像。
methods: 该方法基于物理中的漫游扩散概念，对图像进行噪声和模糊等perturbations，并使用概率分布来描述图像。
results: 该方法可以实现图像抽象，并且可以根据侧信息进行条件生成图像。目前大多数研究都是关注实践应用，而论文背景还有很多未解之处，尤其是与经典图像滤波器之间的关系。

Abstract
Probabilistic diffusion models excel at sampling new images from learned distributions. Originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. A corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. Most of the research focus currently lies on practice-oriented extensions. In contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. In order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. Moreover, we show conceptual and empirical connections to diffusion and osmosis filters.

摘要
probabilistic diffusion models excel at sampling new images from learned distributions. originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. a corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. most of the research focus currently lies on practice-oriented extensions. in contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. in order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. moreover, we show conceptual and empirical connections to diffusion and osmosis filters.Here's the text in Traditional Chinese:probabilistic diffusion models excel at sampling new images from learned distributions. originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. a corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. most of the research focus currently lies on practice-oriented extensions. in contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. in order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. moreover, we show conceptual and empirical connections to diffusion and osmosis filters.

OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection

paper_url: http://arxiv.org/abs/2309.08504
repo_url: https://github.com/jypjypjypjyp/occupancydetr
paper_authors: Yupeng Jia, Jie He, Runze Chen, Fang Zhao, Haiyong Luo
for: The paper is written for the purpose of proposing a novel 3D semantic occupancy perception method for robotic applications like autonomous driving, which can improve the ability of robots to understand their surroundings while reducing computational demand.
methods: The proposed method, OccupancyDETR, consists of a DETR-like object detection module and a 3D occupancy decoder module, which integrate object detection and 3D occupancy grid prediction to simplify the method and improve performance on small objects.
results: The proposed method is demonstrated on the SemanticKITTI dataset and achieves an mIoU of 23 and a processing speed of 6 frames per second, showcasing its effectiveness and potential for real-time 3D semantic scene completion.

Abstract
Visual-based 3D semantic occupancy perception (also known as 3D semantic scene completion) is a new perception paradigm for robotic applications like autonomous driving. Compared with Bird's Eye View (BEV) perception, it extends the vertical dimension, significantly enhancing the ability of robots to understand their surroundings. However, due to this very reason, the computational demand for current 3D semantic occupancy perception methods generally surpasses that of BEV perception methods and 2D perception methods. We propose a novel 3D semantic occupancy perception method, OccupancyDETR, which consists of a DETR-like object detection module and a 3D occupancy decoder module. The integration of object detection simplifies our method structurally - instead of predicting the semantics of each voxels, it identifies objects in the scene and their respective 3D occupancy grids. This speeds up our method, reduces required resources, and leverages object detection algorithm, giving our approach notable performance on small objects. We demonstrate the effectiveness of our proposed method on the SemanticKITTI dataset, showcasing an mIoU of 23 and a processing speed of 6 frames per second, thereby presenting a promising solution for real-time 3D semantic scene completion.

摘要
visuomotive 3D semantic occupancy perception (也称为3D semantic scene completion) 是一种新的感知 paradigm for robotic applications like autonomous driving. 相比 Bird's Eye View (BEV) perception, 它扩展了垂直维度，明显提高了机器人的周围环境理解能力。然而，由于这个very reason, 当前3D semantic occupancy perception方法的计算需求通常超过 BEV perception方法和2D perception方法。我们提出了一种新的3D semantic occupancy perception方法，OccupancyDETR，该方法包括一个 DETR-like object detection模块和一个 3D occupancy decoder模块。该模块的结合使得我们的方法在结构上更加简单 - 而不是预测每个 voxel 的 semantics, 它将场景中的对象和它们的相对应的 3D occupancy 网格标识出来。这使得我们的方法更快速、需要 fewer resources，并且可以利用对象检测算法，从而实现了对小对象的出色表现。我们在 SemanticKITTI 数据集上展示了我们的提议方法的效果，其中 mIoU 为 23 和处理速度为 6 帧每秒，因此表现出了可靠的解决方案 для实时3D semantic scene completion。

YCB-Ev: Event-vision dataset for 6DoF object pose estimation

paper_url: http://arxiv.org/abs/2309.08482
repo_url: https://github.com/paroj/ycbev
paper_authors: Pavel Rojtberg, Thomas Pöllabauer
For: Introduces the YCB-Ev dataset, a synchronized RGB-D and event data dataset for evaluating 6DoF object pose estimation algorithms.* Methods: Provides ground truth 6DoF object poses for the same 21 YCB objects as the YCB-Video dataset, enabling evaluation of algorithm performance when transferred across datasets.* Results: Evaluates the generalization capabilities of two state-of-the-art algorithms using the novel YCB-V sequences in the dataset.Here’s the full summary in Simplified Chinese:* For: 本研究 introduce YCB-Ev dataset，这是一个同步RGB-D和事件数据集，用于评估6DoF对象 pose 估计算法。* Methods: YCB-Ev dataset提供了同YCB-Video dataset中的21个物体的6DoF对象pose的ground truth数据，使得算法在不同dataset之间进行评估。* Results: 研究使用YCB-Ev dataset中的新的YCB-V序列评估了两种现状顶尖算法的泛化能力。

Abstract
Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects \cite{calli2017yale} that were used in the YCB-Video (YCB-V) dataset, enabling the evaluation of algorithm performance when transferred across datasets. The dataset consists of 21 synchronized event and RGB-D sequences, amounting to a total of 7:43 minutes of video. Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The proposed dataset is available at https://github.com/paroj/ycbev.

摘要
我们的工作介绍了YCB-Ev数据集，该数据集包含同步RGB-D帧和事件数据，可以用这些模式评估6DoF物体姿态估计算法。这个数据集提供了21个YCB物体的真实6DoF姿态标准，这些物体与YCB-Video（YCB-V）数据集中使用的同一个物体设置相同。因此，可以评估算法在不同数据集之间的性能转移。该数据集包含21个同步RGB-D序列，总计7分43秒的视频。值得注意的是，12个序列包含YCB-V数据集中使用的同一个物体设置。我们的数据集是首个提供事件流真实6DoF姿态数据的 dataset。此外，我们使用我们的YCB-V新序列评估了两个现状最佳算法的泛化能力。我们的数据集可以在https://github.com/paroj/ycbev上下载。

3D Arterial Segmentation via Single 2D Projections and Depth Supervision in Contrast-Enhanced CT Images

paper_url: http://arxiv.org/abs/2309.08481
repo_url: https://github.com/alinafdima/3dseg-mip-depth
paper_authors: Alina F. Dima, Veronika A. Zimmer, Martin J. Menten, Hongwei Bran Li, Markus Graf, Tristan Lemke, Philipp Raffler, Robert Graf, Jan S. Kirschke, Rickmer Braren, Daniel Rueckert
for: 这个论文的目的是提出一种新的3D血管分割方法，以便更好地诊断和治疗许多血管疾病。
methods: 该方法使用深度学习技术，并且只需要一个已经标注过的2D图像来进行训练。
results: 该方法可以准确地分割3D血管，并且可以减少标注工作的努力。

Abstract
Automated segmentation of the blood vessels in 3D volumes is an essential step for the quantitative diagnosis and treatment of many vascular diseases. 3D vessel segmentation is being actively investigated in existing works, mostly in deep learning approaches. However, training 3D deep networks requires large amounts of manual 3D annotations from experts, which are laborious to obtain. This is especially the case for 3D vessel segmentation, as vessels are sparse yet spread out over many slices and disconnected when visualized in 2D slices. In this work, we propose a novel method to segment the 3D peripancreatic arteries solely from one annotated 2D projection per training image with depth supervision. We perform extensive experiments on the segmentation of peripancreatic arteries on 3D contrast-enhanced CT images and demonstrate how well we capture the rich depth information from 2D projections. We demonstrate that by annotating a single, randomly chosen projection for each training sample, we obtain comparable performance to annotating multiple 2D projections, thereby reducing the annotation effort. Furthermore, by mapping the 2D labels to the 3D space using depth information and incorporating this into training, we almost close the performance gap between 3D supervision and 2D supervision. Our code is available at: https://github.com/alinafdima/3Dseg-mip-depth.

摘要
自动化分割血管在3DVolume中是诊断和治疗许多血管疾病的关键步骤。现有许多研究在深度学习方法中进行3D血管分割，但是训练3D深度网络需要大量的手动3D注解从专家手上，这很困难。特别是在3D血管分割方面，血管稀疏，分散在多个片段和2D片段中断掉，从而使得手动注解变得更加困难。在这种情况下，我们提出了一种新的方法，可以从单个注解的2D投影中分割3D血管，并且只需要每个训练样本一个随机选择的2D投影。我们对3D对照CT图像进行了广泛的实验，并证明我们可以从2D投影中获得丰富的深度信息，并将其纳入训练中。此外，我们将2D标签映射到3D空间中使用深度信息，并将其包含在训练中，从而减少了注解努力。此外，我们的代码可以在以下链接中找到：https://github.com/alinafdima/3Dseg-mip-depth。

PoseFix: Correcting 3D Human Poses with Natural Language

paper_url: http://arxiv.org/abs/2309.08480
repo_url: None
paper_authors: Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez
for: corrections 3D human poses with natural language
methods: introduce the PoseFix dataset, text-based pose editing, correctional text generation
results: potential for assisted 3D character animation, robot teaching

Abstract
Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses.

摘要
自动生成修改姿势的指令可以开启无数应用程序的 possibilties，如个性化指导和家庭物理治疗。解决反向问题（即基于自然语言反馈修改3D人体姿势）可以帮助助手动画或机器人教学等领域。虽然一些最近的研究探讨了自然语言与3D人体姿势之间的关系，但none of them focus on描述3D人体姿势的差异。在这篇论文中，我们面临着修正3D人体姿势的问题，并提出了PoseFix数据集，该数据集包括数千个对应的3D姿势和其相应的自然语言反馈，用于描述源姿势需要如何修改以获得目标姿势。我们在两个任务上示cases：（1）文本基于姿势编辑，即根据查询姿势和文本修改器生成修 corrected 3D人体姿势;和（2）修正文本生成，即根据两个人体姿势之间的差异生成指令。

TreeLearn: A Comprehensive Deep Learning Method for Segmenting Individual Trees from Forest Point Clouds

paper_url: http://arxiv.org/abs/2309.08471
repo_url: https://github.com/ecker-lab/treelearn
paper_authors: Jonathan Henrich, Jan van Delden, Dominik Seidel, Thomas Kneib, Alexander Ecker
for: 该研究旨在提出一种基于深度学习的森林点云Semantic和实例分割方法，以提高森林管理中的信息提取。
methods: 该方法基于已经分割的点云数据，通过深度学习来学习Semantic和实例特征，而不需要手动设计特征和算法。
results: 对比Lidar360软件标注的点云数据，该方法在 benchmark 数据集上表现相当或更好，而且可以通过细化训练来大幅提高性能。

Abstract
Laser-scanned point clouds of forests make it possible to extract valuable information for forest management. To consider single trees, a forest point cloud needs to be segmented into individual tree point clouds. Existing segmentation methods are usually based on hand-crafted algorithms, such as identifying trunks and growing trees from them, and face difficulties in dense forests with overlapping tree crowns. In this study, we propose \mbox{TreeLearn}, a deep learning-based approach for semantic and instance segmentation of forest point clouds. Unlike previous methods, TreeLearn is trained on already segmented point clouds in a data-driven manner, making it less reliant on predefined features and algorithms. Additionally, we introduce a new manually segmented benchmark forest dataset containing 156 full trees, and 79 partial trees, that have been cleanly segmented by hand. This enables the evaluation of instance segmentation performance going beyond just evaluating the detection of individual trees. We trained TreeLearn on forest point clouds of 6665 trees, labeled using the Lidar360 software. An evaluation on the benchmark dataset shows that TreeLearn performs equally well or better than the algorithm used to generate its training data. Furthermore, the method's performance can be vastly improved by fine-tuning on the cleanly labeled benchmark dataset. The TreeLearn code is availabe from https://github.com/ecker-lab/TreeLearn. The data as well as trained models can be found at https://doi.org/10.25625/VPMPID.

摘要
lazier-扫描的林地点云可以提供有价值的信息 для森林管理。为了考虑单个树木，林地点云需要被分割成 individuak树木点云。现有的分割方法通常基于手工编写的算法，如从树干上识别树木并在其上生长，并在稠密的林地中遇到树叶重叠时遇到困难。在这项研究中，我们提出了 \mbox{TreeLearn}，一种基于深度学习的 semantic和实例分割方法 для森林点云。与先前的方法不同，TreeLearn 通过数据驱动的方式进行训练，使其不依赖于预先定义的特征和算法。此外，我们还提供了一个新的手动分割的森林数据集，包含 156 棵完整的树木和 79 棵部分树木，这些树木已经被手动清洁分割。这使得我们可以评估实例分割性能，而不仅仅是评估检测 individuak 树木。我们在 6665 棵树木的点云上训练 TreeLearn，使用 Lidar360 软件标注。我们对 benchmark 数据集进行评估，发现 TreeLearn 与生成其训练数据的算法性能相似或更好。此外，通过精细调整 cleanly 标注的 benchmark 数据集，我们可以大幅提高方法的性能。TreeLearn 代码可以从 https://github.com/ecker-lab/TreeLearn 获取，数据和训练模型可以从 https://doi.org/10.25625/VPMPID 获取。

Segment Anything Model for Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2309.08434
repo_url: None
paper_authors: Peng Zhang, Yaping Wang
for: 诊断和治疗 Glioma 脑肿的准确分 segmentation 是非常重要的。
methods: 本研究使用 Meta AI 发布的 Segment Anything Model(SAM) 进行图像分 segmentation，无需模型微调。
results: 研究发现，无需模型微调，SAM 在脑肿分 segmentation 中还有一定的差距与当前状态流行(SOTA) 模型。

Abstract
Glioma is a prevalent brain tumor that poses a significant health risk to individuals. Accurate segmentation of brain tumor is essential for clinical diagnosis and treatment. The Segment Anything Model(SAM), released by Meta AI, is a fundamental model in image segmentation and has excellent zero-sample generalization capabilities. Thus, it is interesting to apply SAM to the task of brain tumor segmentation. In this study, we evaluated the performance of SAM on brain tumor segmentation and found that without any model fine-tuning, there is still a gap between SAM and the current state-of-the-art(SOTA) model.

摘要
glioma 是一种常见的脑肿瘤，对个人健康pose 一定的风险。确定脑肿瘤的准确分 segmentation 是诊断和治疗的关键。Meta AI 发布的 Segment Anything Model（SAM）是一种基本的图像分 segmentation 模型，具有出色的零样本泛化能力。因此，我们想用 SAM 进行脑肿瘤分 segmentation 的任务。在这个研究中，我们评估了 SAM 在脑肿瘤分 segmentation task 的性能，发现没有任何模型细化，仍有与当前领先模型（SOTA）之间的差距。Note: "脑肿瘤" (glioma) is a type of brain tumor, and "风险" (pose) means "risk" in Chinese. "诊断" (diagnosis) and "治疗" (treatment) are also translated as expected. "零样本泛化能力" (zero-sample generalization capabilities) is a bit more difficult to translate, but I think "出色的零样本泛化能力" (excellent zero-sample generalization capabilities) conveys the meaning well.

X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction

paper_url: http://arxiv.org/abs/2309.08424
repo_url: https://github.com/caodinhduc/x-pdnet-official
paper_authors: Duc Cao Dinh, J Lim
For: 本文目的是提出一种多任务学习框架，用于同时进行平面实例分割和深度估计。* Methods: 本文使用特征融合机制和几何约束损失来利用图像的视觉和几何特征。此外，本文还提出了跨任务特征储存设计，以便在早期共享双任务中提高特定任务的表现。* Results: 本文通过实验证明了其提出的方法的效果，在ScanNet和Stanford 2D-3D-S dataset上达到了大幅提高的量化结果，证明了其效果。

Abstract
Segmentation of planar regions from a single RGB image is a particularly important task in the perception of complex scenes. To utilize both visual and geometric properties in images, recent approaches often formulate the problem as a joint estimation of planar instances and dense depth through feature fusion mechanisms and geometric constraint losses. Despite promising results, these methods do not consider cross-task feature distillation and perform poorly in boundary regions. To overcome these limitations, we propose X-PDNet, a framework for the multitask learning of plane instance segmentation and depth estimation with improvements in the following two aspects. Firstly, we construct the cross-task distillation design which promotes early information sharing between dual-tasks for specific task improvements. Secondly, we highlight the current limitations of using the ground truth boundary to develop boundary regression loss, and propose a novel method that exploits depth information to support precise boundary region segmentation. Finally, we manually annotate more than 3000 images from Stanford 2D-3D-Semantics dataset and make available for evaluation of plane instance segmentation. Through the experiments, our proposed methods prove the advantages, outperforming the baseline with large improvement margins in the quantitative results on the ScanNet and the Stanford 2D-3D-S dataset, demonstrating the effectiveness of our proposals.

摘要
Segmentation of planar regions from a single RGB image is a particularly important task in the perception of complex scenes. To utilize both visual and geometric properties in images, recent approaches often formulate the problem as a joint estimation of planar instances and dense depth through feature fusion mechanisms and geometric constraint losses. Despite promising results, these methods do not consider cross-task feature distillation and perform poorly in boundary regions. To overcome these limitations, we propose X-PDNet, a framework for the multitask learning of plane instance segmentation and depth estimation with improvements in the following two aspects:Firstly, we construct the cross-task distillation design, which promotes early information sharing between dual-tasks for specific task improvements.Secondly, we highlight the current limitations of using the ground truth boundary to develop boundary regression loss, and propose a novel method that exploits depth information to support precise boundary region segmentation.Finally, we manually annotate more than 3000 images from the Stanford 2D-3D-Semantics dataset and make them available for evaluation of plane instance segmentation. Through the experiments, our proposed methods prove the advantages, outperforming the baseline with large improvement margins in the quantitative results on the ScanNet and the Stanford 2D-3D-S dataset, demonstrating the effectiveness of our proposals.

MIML: Multiplex Image Machine Learning for High Precision Cell Classification via Mechanical Traits within Microfluidic Systems

paper_url: http://arxiv.org/abs/2309.08421
repo_url: None
paper_authors: Khayrul Islam, Ratul Paul, Shen Wang, Yaling Liu
for: This paper aims to develop a novel machine learning framework for label-free cell classification, addressing the limitations of existing techniques in terms of specificity and speed.
methods: The proposed framework, called Multiplex Image Machine Learning (MIML), combines label-free cell images with biomechanical property data to offer a more holistic understanding of cellular properties.
results: The MIML approach achieves a remarkable 98.3% accuracy in cell classification, outperforming models that only consider a single data type. It has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its flexibility and transfer learning capability.

Abstract
Label-free cell classification is advantageous for supplying pristine cells for further use or examination, yet existing techniques frequently fall short in terms of specificity and speed. In this study, we address these limitations through the development of a novel machine learning framework, Multiplex Image Machine Learning (MIML). This architecture uniquely combines label-free cell images with biomechanical property data, harnessing the vast, often underutilized morphological information intrinsic to each cell. By integrating both types of data, our model offers a more holistic understanding of the cellular properties, utilizing morphological information typically discarded in traditional machine learning models. This approach has led to a remarkable 98.3\% accuracy in cell classification, a substantial improvement over models that only consider a single data type. MIML has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its inherent flexibility and transfer learning capability. It's particularly effective for cells with similar morphology but distinct biomechanical properties. This innovative approach has significant implications across various fields, from advancing disease diagnostics to understanding cellular behavior.

摘要
标签无Cell分类具有优势，可以提供不受损害的细胞用于进一步使用或检测，然而现有技术 frequently fall short in terms of specificity and speed. 在这种研究中，我们通过开发一种新的机器学习框架，Multiplex Image Machine Learning (MIML)，来解决这些局限性。这种架构独特地结合标签无Cell image和生物力学性数据，利用每个细胞内部的较为忽略的形态信息。通过将两种数据类型集成，我们的模型可以更全面地理解细胞的质量特征，使得模型具有98.3%的准确率。MIML已经成功地分类白血细胞和肿瘤细胞，并且具有更广泛的应用前景，因为它具有内置的灵活性和转移学习能力。它尤其有效于具有相似形态 yet distinct biomechanical properties的细胞。这种创新的方法对于不同领域的应用都具有深远的意义，从提高疾病诊断到理解细胞行为。

Deformable Neural Radiance Fields using RGB and Event Cameras

paper_url: http://arxiv.org/abs/2309.08416
repo_url: None
paper_authors: Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
for: 用于模型快速变形的物体从可见数据中的快速变化。
methods: 使用RGB和事件摄像头，并将事件流处理为ynchronously数据，并对于不确定的摄像头位置进行估计。
results: 与现有方法比较，提供了更好的性能，并在实际的游戏和实际数据上进行了试验。Here’s the full text in Simplified Chinese:
for: 本研究使用RGB和事件摄像头，从可见数据中模型快速变形的物体。
methods: 我们提出了一种新的方法，使用RGB和事件摄像头，并将事件流处理为ynchronously数据，并对于不确定的摄像头位置进行估计。
results: 实际实验结果显示，与现有方法比较，我们的方法提供了更好的性能，并在实际的游戏和实际数据上进行了试验。

Abstract
Modeling Neural Radiance Fields for fast-moving deformable objects from visual data alone is a challenging problem. A major issue arises due to the high deformation and low acquisition rates. To address this problem, we propose to use event cameras that offer very fast acquisition of visual change in an asynchronous manner. In this work, we develop a novel method to model the deformable neural radiance fields using RGB and event cameras. The proposed method uses the asynchronous stream of events and calibrated sparse RGB frames. In our setup, the camera pose at the individual events required to integrate them into the radiance fields remains unknown. Our method jointly optimizes these poses and the radiance field. This happens efficiently by leveraging the collection of events at once and actively sampling the events during learning. Experiments conducted on both realistically rendered graphics and real-world datasets demonstrate a significant benefit of the proposed method over the state-of-the-art and the compared baseline. This shows a promising direction for modeling deformable neural radiance fields in real-world dynamic scenes.

摘要
模型神经辐射场为快速变形物体从视觉数据中快速学习是一个挑战。主要问题在于高度变形和低收集率。为解决这个问题，我们提议使用事件摄像机，它们可以快速获取视觉变化的异步方式。在这个工作中，我们开发了一种新的方法，用于模型弹性神经辐射场，使用RGB和事件摄像机。我们的方法使用异步流动的事件，并在学习过程中活动地选择事件。在我们的设置中，摄像机的具体位置在个别事件中需要被集成到辐射场中。我们的方法同时优化这些位置和辐射场。这里我们可以高效地利用事件的批处理，并在学习过程中活动地选择事件。实验结果表明，我们的方法在实际render的图形和实际场景中具有显著的优势，比对比例的基eline。这表明了我们的方法在实际世界动态场景中具有扎实的批处理能力。

3D SA-UNet: 3D Spatial Attention UNet with 3D ASPP for White Matter Hyperintensities Segmentation

paper_url: http://arxiv.org/abs/2309.08402
repo_url: https://github.com/hjkuijf/wmhchallenge
paper_authors: Changlu Guo
for:* 这项研究旨在提高FLAIR图像中白 matter 高 INTENSITY 的自动分割精度，以便早期诊断多种疾病。methods:* 我们提出了一种名为3D Spatial Attention U-Net（3D SA-UNet）的深度学习模型，该模型使用仅FLAIR扫描图像进行自动WMH分割。* 3D SA-UNet引入了3D空间注意力模块，该模块可以高亮重要的疾病特征，如白 matter 高 INTENSITY，而且可以抑制无关的区域。* 我们还延展了Atrous Spatial Pyramid Pooling（ASPP）模块到3D版本，以捕捉不同级别的特征，从而提高网络的 segmentation 性能。results:* 我们对公共数据集进行了评估，并证明了3D空间注意力模块和3D ASPP在WMH分割中的效iveness。* 对比其他当前领域的3D卷积神经网络，我们的提出的3D SA-UNet模型在精度方面获得了更高的性能。

Abstract
White Matter Hyperintensity (WMH) is an imaging feature related to various diseases such as dementia and stroke. Accurately segmenting WMH using computer technology is crucial for early disease diagnosis. However, this task remains challenging due to the small lesions with low contrast and high discontinuity in the images, which contain limited contextual and spatial information. To address this challenge, we propose a deep learning model called 3D Spatial Attention U-Net (3D SA-UNet) for automatic WMH segmentation using only Fluid Attenuation Inversion Recovery (FLAIR) scans. The 3D SA-UNet introduces a 3D Spatial Attention Module that highlights important lesion features, such as WMH, while suppressing unimportant regions. Additionally, to capture features at different scales, we extend the Atrous Spatial Pyramid Pooling (ASPP) module to a 3D version, enhancing the segmentation performance of the network. We evaluate our method on publicly available dataset and demonstrate the effectiveness of 3D spatial attention module and 3D ASPP in WMH segmentation. Through experimental results, it has been demonstrated that our proposed 3D SA-UNet model achieves higher accuracy compared to other state-of-the-art 3D convolutional neural networks.

摘要
白 matter 高度突出 (WMH) 是各种疾病的成像特征，如 деменcia 和 apoplexy。正确地使用计算机技术将 WMH 自动分割是早期疾病诊断的关键。然而，这项任务仍然具有挑战性，因为病理学图像中的小 lesions 具有低对比度和高终端不连续，含有有限的情况和空间信息。为解决这个挑战，我们提议一种深度学习模型called 3D Spatial Attention U-Net (3D SA-UNet)，用于自动 WMH 分割，只使用 Fluid Attenuation Inversion Recovery (FLAIR) 扫描图像。3D SA-UNet 中的 3D Spatial Attention Module 可以高亮突出重要的 lesion 特征，如 WMH，并且压制不重要的区域。此外，为了捕捉不同尺度的特征，我们扩展了 Atrous Spatial Pyramid Pooling (ASPP) 模块到 3D 版本，从而提高了网络的分割性能。我们对公共可用的数据集进行了评估，并证明了 3D spatial attention module 和 3D ASPP 在 WMH 分割中的效果。通过实验结果，我们的提出的 3D SA-UNet 模型在比较其他状态的情况下 achieve 更高的准确率。

An inspection technology of inner surface of the fine hole based on machine vision

paper_url: http://arxiv.org/abs/2309.08649
repo_url: None
paper_authors: Rongfang He, Weibin Zhang, Guofang Gao
for: detect the quality of the inner surface of fine holes in industrial components
methods: uses a special optical measurement system with a sight pipe and flexible light array to guide external illumination light into the fine hole and output relevant images
results: can measure the inner surface quality of fine holes with a diameter range of 4mm to 47mm and a depth of up to 47mm, with a maximum measurement error standard deviation of about 10um

Abstract
Fine holes are an important structural component of industrial components, and their inner surface quality is closely related to their function.In order to detect the quality of the inner surface of the fine hole,a special optical measurement system was investigated in this paper. A sight pipe is employed to guide the external illumination light into the fine hole and output the relevant images simultaneously. A flexible light array is introduced to suit the narrow space, and the effective field of view is analyzed. Besides, the arc surface projection error and manufacturing assembly error of the device are analyzed, then compensated or ignored if small enough. In the test of prefabricated circular defects with the diameter {\phi}0.1mm, {\phi}0.2mm, 0.4mm distance distribution and the fissure defects with the width 0.3mm, the maximum measurement error standard deviation are all about 10{\mu}m. The minimum diameter of the measured fine hole is 4mm and the depth can reach 47mm.

摘要
细洞是工业Component的重要结构部件，其内部表面质量直接关系到它的功能。为了检测细洞内部质量，这篇论文提出了一种特殊的光学测量系统。使用视窗引导外部照明光入射细洞，并同时输出相关图像。还引入了 flexible 光 Array适应窄空间，分析了有效范围。此外，Device的弯曲面投影错误和生产组装错误也被分析了，并且可以根据小于一定的标准差忽略或补做。在尝试预制圆残渠defects with диаметр（φ）0.1mm、0.2mm、0.4mm距离分布和尖极残渠with width 0.3mm时，最大测量错误标准差都在10μm之间。测量细洞的最小径度为4mm，深度可达47mm。

Double Domain Guided Real-Time Low-Light Image Enhancement for Ultra-High-Definition Transportation Surveillance

paper_url: http://arxiv.org/abs/2309.08382
repo_url: https://github.com/qujx/ddnet
paper_authors: Jingxiang Qu, Ryan Wen Liu, Yuan Gao, Yu Guo, Fenghua Zhu, Fei-yue Wang
for: 这 paper 的目的是提出一种高效可靠的低光照图像增强网络 (DDNet)，用于实时交通监测系统 (ITS) 中的低光照图像增强。
methods: 这 paper 使用了一种 Encoder-Decoder 结构为主体网络 architecture，并将增强处理分解为两个子任务 (i.e., 色彩增强和梯度增强)，通过提出的 Course Enhancement Module (CEM) 和 LoG-based Gradient Enhancement Module (GEM)，以同时进行颜色和梯度特征的增强。
results: 评估实验表明，相比之前的方法，DDNet 可提供更高质量和更高效的低光照图像增强，并且在交通相关的数据集上进行物体检测和Scene Segmentation 实验也表明了DDNet 的实际应用价值。

Abstract
Real-time transportation surveillance is an essential part of the intelligent transportation system (ITS). However, images captured under low-light conditions often suffer the poor visibility with types of degradation, such as noise interference and vague edge features, etc. With the development of imaging devices, the quality of the visual surveillance data is continually increasing, like 2K and 4K, which has more strict requirements on the efficiency of image processing. To satisfy the requirements on both enhancement quality and computational speed, this paper proposes a double domain guided real-time low-light image enhancement network (DDNet) for ultra-high-definition (UHD) transportation surveillance. Specifically, we design an encoder-decoder structure as the main architecture of the learning network. In particular, the enhancement processing is divided into two subtasks (i.e., color enhancement and gradient enhancement) via the proposed coarse enhancement module (CEM) and LoG-based gradient enhancement module (GEM), which are embedded in the encoder-decoder structure. It enables the network to enhance the color and edge features simultaneously. Through the decomposition and reconstruction on both color and gradient domains, our DDNet can restore the detailed feature information concealed by the darkness with better visual quality and efficiency. The evaluation experiments on standard and transportation-related datasets demonstrate that our DDNet provides superior enhancement quality and efficiency compared with the state-of-the-art methods. Besides, the object detection and scene segmentation experiments indicate the practical benefits for higher-level image analysis under low-light environments in ITS.

摘要
现实时交通监测是智能交通系统（ITS）的重要组成部分。然而，在低光照条件下捕捉的图像经常受到质量下降，如噪声干扰和模糊边缘等问题。随着捕捉设备的发展，视觉监测数据的质量不断提高，如2K和4K，这对图像处理效率的要求越来越高。为满足质量提高和计算速度的双重要求，本文提出了双Domain指导实时低光照图像增强网络（DDNet），用于ultra-high-definition（UHD）交通监测。具体来说，我们设计了编码器-解码器结构为主体网络学习架构。特别是，增强处理被分为两个子任务（即色彩增强和梯度增强）via提出的粗略增强模块（CEM）和LoG基于的梯度增强模块（GEM），这些模块在编码器-解码器结构中嵌入。这使得网络可以同时增强色彩和梯度特征。通过对颜色和梯度频谱进行分解和重建，我们的DDNet可以更好地恢复在黑暗中隐藏的细节特征，提供更高质量和效率的增强结果。实验表明，与现有方法相比，我们的DDNet在标准和交通相关的数据集上提供了更高质量和效率的增强结果。此外，对象检测和Scene分割实验表明，DDNet在低光照环境中的高级图像分析也具有实际应用价值。

Reconsidering evaluation practices in modular systems: On the propagation of errors in MRI prostate cancer detection

paper_url: http://arxiv.org/abs/2309.08381
repo_url: None
paper_authors: Erlend Sortland Rolfsnes, Philip Thangngat, Trygve Eftestøl, Tobias Nordström, Fredrik Jäderling, Martin Eklund, Alvaro Fernandez-Quilez
for: 这篇论文主要是为了检测前列腺癌（PCa）的检测方法。
methods: 这篇论文使用人工智能（AI）系统来支持医学评估，包括范例分类和检测癌肿。
results: 这篇论文发现，使用不同的分类网络（s1和s2）会导致不同的检测结果，并且与理想的设定相比（s1：89.90+-2.23 vs 88.97+-3.06 ncsPCa，P<.001，89.30+-4.07和88.12+-2.71 csPCa，P<.001）。这些结果表明了评估整个系统的重要性，而不仅仅是单一模组。

Abstract
Magnetic resonance imaging has evolved as a key component for prostate cancer (PCa) detection, substantially increasing the radiologist workload. Artificial intelligence (AI) systems can support radiological assessment by segmenting and classifying lesions in clinically significant (csPCa) and non-clinically significant (ncsPCa). Commonly, AI systems for PCa detection involve an automatic prostate segmentation followed by the lesion detection using the extracted prostate. However, evaluation reports are typically presented in terms of detection under the assumption of the availability of a highly accurate segmentation and an idealistic scenario, omitting the propagation of errors between modules. For that purpose, we evaluate the effect of two different segmentation networks (s1 and s2) with heterogeneous performances in the detection stage and compare it with an idealistic setting (s1:89.90+-2.23 vs 88.97+-3.06 ncsPCa, P<.001, 89.30+-4.07 and 88.12+-2.71 csPCa, P<.001). Our results depict the relevance of a holistic evaluation, accounting for all the sub-modules involved in the system.

摘要

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

paper_url: http://arxiv.org/abs/2309.08372
repo_url: https://github.com/lin-kayla/subjectivity-sketch-reid
paper_authors: Kejun Lin, Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Shin’ichi Satoh
for: 本研究targets person re-identification (re-ID) problem using sketches as the only available information.
methods: 本研究提出了两个新的设计方法来处理主观性的挑战：1) non-local (NL) fusion module，集成不同证人的绘画来减少主观性的影响；2) AttrAlign module，通过特征属性作为隐藏mask来对域隔的特征进行对齐。
results: 研究在三个benchmark中表现出色，包括大规模、多样化和跨风格 benchmark。

Abstract
Person re-identification (re-ID) requires densely distributed cameras. In practice, the person of interest may not be captured by cameras and, therefore, needs to be retrieved using subjective information (e.g., sketches from witnesses). Previous research defines this case using the sketch as sketch re-identification (Sketch re-ID) and focuses on eliminating the domain gap. Actually, subjectivity is another significant challenge. We model and investigate it by posing a new dataset with multi-witness descriptions. It features two aspects. 1) Large-scale. It contains over 4,763 sketches and 32,668 photos, making it the largest Sketch re-ID dataset. 2) Multi-perspective and multi-style. Our dataset offers multiple sketches for each identity. Witnesses' subjective cognition provides multiple perspectives on the same individual, while different artists' drawing styles provide variation in sketch styles. We further have two novel designs to alleviate the challenge of subjectivity. 1) Fusing subjectivity. We propose a non-local (NL) fusion module that gathers sketches from different witnesses for the same identity. 2) Introducing objectivity. An AttrAlign module utilizes attributes as an implicit mask to align cross-domain features. To push forward the advance of Sketch re-ID, we set three benchmarks (large-scale, multi-style, cross-style). Extensive experiments demonstrate our leading performance in these benchmarks. Dataset and Codes are publicly available at: https://github.com/Lin-Kayla/subjectivity-sketch-reid

摘要
人识别（re-ID）需要密集分布的摄像头。在实践中，人物对摄像头不可见，因此需要使用主观信息（例如，见证人的素描）来检索。先前的研究定义这种情况为素描重新识别（Sketch re-ID），并将着眼点在消除领域差异。然而，主观性是另一个重要挑战。我们模型和调查这个问题，通过提供一个新的数据集，该数据集包含多个见证人的素描。它具有以下两个特点：1）大规模。该数据集包含超过4,763个素描和32,668个照片，是目前最大的Sketch re-ID数据集。2）多元 perspective和多样化。我们的数据集具有每个人物的多个素描，见证人的主观认知提供多个视角，而不同艺术家的绘制风格提供了多样化的素描风格。我们还有两项新的设计来缓解主观性挑战。1）素描 fusions。我们提议一种非本地（NL）融合模块，可以从不同见证人的素描中集成素描。2）引入对象性。我们引入一种特征对齐模块，使用特征作为隐藏的掩码，将不同领域的特征对齐。为推动Sketch re-ID的进步，我们设置了三个标准（大规模、多样化、跨领域）。广泛的实验表明我们在这些标准上显示出领先的性能。数据集和代码在 GitHub 上公开：https://github.com/Lin-Kayla/subjectivity-sketch-reid。

An Efficient Wide-Range Pseudo-3D Vehicle Detection Using A Single Camera

paper_url: http://arxiv.org/abs/2309.08369
repo_url: None
paper_authors: Zhupeng Ye, Yinqi Li, Zejian Yuan
for: 这篇论文旨在提供一种新的宽范围 Pseudo-3D 车辆检测方法，以实现智能驾驶系统中的活跃安全功能。
methods: 本文使用单一摄取器的图像，并将它拼接成两个子窗口图像，以最大化图像分辨率的利用。然后，运用特别设计的检测头来检测宽范围车辆物件。这些检测头同时发出延展的 BBox 和 Side Projection Line（SPL）表示，以捕捉车辆的形状和位置。
results: 实验结果显示，本文的模型在多种评估指标上均 achieve 良好的表现，包括检测精度、稳定性和预测精度。详细的实验结果和评估Metrics可以参考我们的自建dataset和评估报告。

Abstract
Wide-range and fine-grained vehicle detection plays a critical role in enabling active safety features in intelligent driving systems. However, existing vehicle detection methods based on rectangular bounding boxes (BBox) often struggle with perceiving wide-range objects, especially small objects at long distances. And BBox expression cannot provide detailed geometric shape and pose information of vehicles. This paper proposes a novel wide-range Pseudo-3D Vehicle Detection method based on images from a single camera and incorporates efficient learning methods. This model takes a spliced image as input, which is obtained by combining two sub-window images from a high-resolution image. This image format maximizes the utilization of limited image resolution to retain essential information about wide-range vehicle objects. To detect pseudo-3D objects, our model adopts specifically designed detection heads. These heads simultaneously output extended BBox and Side Projection Line (SPL) representations, which capture vehicle shapes and poses, enabling high-precision detection. To further enhance the performance of detection, a joint constraint loss combining both the object box and SPL is designed during model training, improving the efficiency, stability, and prediction accuracy of the model. Experimental results on our self-built dataset demonstrate that our model achieves favorable performance in wide-range pseudo-3D vehicle detection across multiple evaluation metrics. Our demo video has been placed at https://www.youtube.com/watch?v=1gk1PmsQ5Q8.

摘要
宽范围和细化的车辆检测在智能驾驶系统中扮演了关键的角色。然而，现有的基于矩形 bounding box（BBox）的车辆检测方法 frequently struggle with perceiving wide-range objects, especially small objects at long distances. BBox表达不能提供车辆的详细几何形状和姿态信息。这篇论文提出了一种基于单个摄像头的新型宽范围 Pseudo-3D 车辆检测方法。该模型使用组合两个子窗口图像的高分辨率图像来作为输入。这种图像格式可以最大化图像分辨率的利用，以保留车辆 объек 的重要信息。为检测 Pseudo-3D 对象，我们的模型采用特定的检测头。这些头同时输出扩展 BBox 和 Side Projection Line（SPL）表示，以捕捉车辆的形状和姿态，实现高精度检测。为了进一步提高检测性能，我们在模型训练中设计了联合约束损失，既包括对象框和 SPL 的损失。实验结果表明，我们的模型在多种评价指标上表现出色。我们的 Demo 视频已经上传到 YouTube 上，请参考 https://www.youtube.com/watch?v=1gk1PmsQ5Q8。

Robust Burned Area Delineation through Multitask Learning

paper_url: http://arxiv.org/abs/2309.08368
repo_url: https://github.com/links-ads/burned-area-seg
paper_authors: Edoardo Arnaudo, Luca Barco, Matteo Merlo, Claudio Rossi
for: 这篇论文是用于精确地界定野火烧伤区域，以便环境监控和火灾后评估。
methods: 我们使用了一个多任务学习框架，其中包括土地覆盖分类作为助手任务，以增强火烧伤区域分类模型的稳定性和性能。我们还使用了Sentinel-2输入和Copernicus动作等数据来构建一个特殊的数据集，并提供了多个任务的标签，包括火烧伤区域分类和土地覆盖分类。
results: 我们与标准二分法进行比较，结果显示我们的方法在精确地界定火烧伤区域方面表现更好，并且可以增强模型的稳定性和性能。

Abstract
In recent years, wildfires have posed a significant challenge due to their increasing frequency and severity. For this reason, accurate delineation of burned areas is crucial for environmental monitoring and post-fire assessment. However, traditional approaches relying on binary segmentation models often struggle to achieve robust and accurate results, especially when trained from scratch, due to limited resources and the inherent imbalance of this segmentation task. We propose to address these limitations in two ways: first, we construct an ad-hoc dataset to cope with the limited resources, combining information from Sentinel-2 feeds with Copernicus activations and other data sources. In this dataset, we provide annotations for multiple tasks, including burned area delineation and land cover segmentation. Second, we propose a multitask learning framework that incorporates land cover classification as an auxiliary task to enhance the robustness and performance of the burned area segmentation models. We compare the performance of different models, including UPerNet and SegFormer, demonstrating the effectiveness of our approach in comparison to standard binary segmentation.

摘要
Recently, wildfires have presented a significant challenge due to their increasing frequency and severity. Accurate delineation of burned areas is crucial for environmental monitoring and post-fire assessment. However, traditional approaches relying on binary segmentation models often struggle to achieve robust and accurate results, especially when trained from scratch, due to limited resources and the inherent imbalance of this segmentation task. We propose to address these limitations in two ways:First, we construct an ad-hoc dataset to cope with the limited resources, combining information from Sentinel-2 feeds with Copernicus activations and other data sources. In this dataset, we provide annotations for multiple tasks, including burned area delineation and land cover segmentation.Second, we propose a multitask learning framework that incorporates land cover classification as an auxiliary task to enhance the robustness and performance of the burned area segmentation models. We compare the performance of different models, including UPerNet and SegFormer, demonstrating the effectiveness of our approach in comparison to standard binary segmentation.

Continual Learning with Deep Streaming Regularized Discriminant Analysis

paper_url: http://arxiv.org/abs/2309.08353
repo_url: https://github.com/sonycslparis/deep_srda
paper_authors: Joe Khawand, Peter Hanappe, David Colliaux
for: 提高实际机器学习应用中的持续学习能力，以实现更人类化的学习方式。
methods: 提出了一种流处理学习方法， combinestraditional continual learning methods with a convolutional neural network to improve performance on real-world datasets.
results: 在ImageNet ILSVRC-2012数据集上，与批量学习和现有流处理学习算法相比，该方法表现出色，得到了更高的性能。

Abstract
Continual learning is increasingly sought after in real world machine learning applications, as it enables learning in a more human-like manner. Conventional machine learning approaches fail to achieve this, as incrementally updating the model with non-identically distributed data leads to catastrophic forgetting, where existing representations are overwritten. Although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.

摘要
<>continuous learning在现实世界机器学习应用中日益受到欢迎，因为它使得机器学习更像人类学习的方式。传统机器学习方法无法实现这一点，因为逐渐更新模型的非一致分布数据会导致扩散遗忘，已有表示被覆盖。 although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.<>Here's the translation in Traditional Chinese:<>不断学习在现实世界机器学习应用中日益受到欢迎，因为它使得机器学习更像人类学习的方式。传统机器学习方法无法实现这一点，因为逐渐更新模型的非一致分布数据会导致扩散遗忘，已有表示被覆盖。 although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.<>

T-UDA: Temporal Unsupervised Domain Adaptation in Sequential Point Clouds

paper_url: http://arxiv.org/abs/2309.08302
repo_url: https://github.com/ctu-vras/t-uda
paper_authors: Awet Haileslassie Gebrehiwot, David Hurych, Karel Zimmermann, Patrick Pérez, Tomáš Svoboda
for: 本研究旨在提高驾驶场景3D semantic segmentation模型的可靠性，使其在不同地区、感知器件、安装位置等不同域的开放世界设置下能够有效地适应。
methods: 本研究提出了一种新的频率适应方法，即Temporal UDA（T-UDA）方法，它结合输入数据的时间和交叉感知器件的准确性，并与mean teacher方法结合。
results: 经过实验表明，T-UDA方法在Waymo Open Dataset、nuScenes和SemanticKITTI等 datasets上对3D semantic segmentation任务具有显著的性能提升，并且对两种流行的3D点云架构（Cylinder3D和MinkowskiNet）都有优秀的效果。

Abstract
Deep perception models have to reliably cope with an open-world setting of domain shifts induced by different geographic regions, sensor properties, mounting positions, and several other reasons. Since covering all domains with annotated data is technically intractable due to the endless possible variations, researchers focus on unsupervised domain adaptation (UDA) methods that adapt models trained on one (source) domain with annotations available to another (target) domain for which only unannotated data are available. Current predominant methods either leverage semi-supervised approaches, e.g., teacher-student setup, or exploit privileged data, such as other sensor modalities or temporal data consistency. We introduce a novel domain adaptation method that leverages the best of both trends. Our approach combines input data's temporal and cross-sensor geometric consistency with the mean teacher method. Dubbed T-UDA for "temporal UDA", such a combination yields massive performance gains for the task of 3D semantic segmentation of driving scenes. Experiments are conducted on Waymo Open Dataset, nuScenes and SemanticKITTI, for two popular 3D point cloud architectures, Cylinder3D and MinkowskiNet. Our codes are publicly available at https://github.com/ctu-vras/T-UDA.

摘要
深度感知模型需要可靠地处理开放世界的领域变化，包括不同的地理区域、感知器件属性、安装位置等多种原因。由于可获取的标注数据的可得性是技术上不可能的，因此研究人员将焦点放在无监督领域适应（UDA）方法上，以适应已经训练的一个（源）领域的模型，并使其适应另一个（目标）领域的无标注数据。目前主流的方法包括使用 semi-supervised 方法，如教师生Setup，或者利用特权数据，如其他感知模式或时间数据一致性。我们介绍了一种新的领域适应方法，它将输入数据的时间和跨感器几何一致性与“教师”方法结合。我们将这种方法称为“时间领域适应”（T-UDA）。实验在 Waymo 开放数据集、nuscenes 和 SemanticKITTI 上进行，使用两种流行的3D点云架构：Cylinder3D 和 MinkowskiNet。我们的代码公开在 GitHub 上，请参考。

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

paper_url: http://arxiv.org/abs/2309.08295
repo_url: None
paper_authors: Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka
for: 这篇论文是为了研究一种实时、 causal、基于神经网络的活动说话人检测系统，适用于低功耗边缘计算。
methods: 该系统使用了一个微phone阵列和360度摄像头提供的数据，并使用了一种具有127 MFLOPs每参与者的神经网络。与前一些工作不同，这里的网络会在计算预算 exhausted 情况下表现出“美味的衰弱”，使系统可以在这种情况下运行良好。
results: 作者在一个真实的会议数据集上训练和评估了自己的算法，并证明了其在多达14名参与者、叠加演讲、其他挑战性enario下的性能。

Abstract
We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting, overlapped speech, and other challenging scenarios.

摘要
我们介绍了一种特有的实时、因果的神经网络基于活动 speaker检测系统，适用于低功耗边缘计算。这个系统驱动了虚拟 cinematography 模块，并在商业设备上部署。系统使用来自麦克风阵列和360度摄像头的数据。我们的网络只需127 MFLOPs每参与者，对于参与者14人的会议。与前一项不同，我们研究了我们网络在计算预算尽用时的错误率，发现它具有很好的宽恒化特性，使系统可以在这种情况下运行良好。与传统的 DOA 估计方法不同，我们的网络学习了查询可用的声音数据，考虑检测到的头部位置。我们在实际会议 dataset 上训练和评估了我们的算法，该 dataset 包括最多14名参与者、重叠的说话、其他挑战性enario。

Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models

paper_url: http://arxiv.org/abs/2309.08273
repo_url: None
paper_authors: Ruian He, Zhen Xing, Weimin Tan, Bo Yan
for: 提出了一种无监督的人脸表示学习方法，以提高无需大量标注数据的face理解能力。
methods: 提出了一种基于3D射频扩散模型的无监督分解方法，通过在幂空间进行分解，实现了人脸表示的分解和表示混合。
results: 通过对多个数据集进行测试，实现了无监督人脸表示学习模型的state-of-the-art性能，并且在人脸认证和表情识别等下游任务中达到了优秀的结果。

Abstract
Unsupervised learning of facial representations has gained increasing attention for face understanding ability without heavily relying on large-scale annotated datasets. However, it remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on 2D factors and pixel-level consistency, leading to incomplete disentangling and suboptimal performance in downstream tasks. In this paper, we propose LatentFace, a novel unsupervised disentangling framework for facial expression and identity representation. We suggest the disentangling problem should be performed in latent space and propose the solution using a 3D-ware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model (RDM) to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition and face verification among unsupervised facial representation learning models.

摘要
<>Translate the given text into Simplified Chinese.<>无监督学习的脸部表示学习在脸部理解能力方面受到了越来越多的关注，但它还未得到解决，因为脸部身份、表情和外部因素如pose和光照的关系。先前的方法主要关注于2D因素和像素级匹配，导致不完全分离和下游任务的下降性能。在本文中，我们提议在幂空间进行不监督分解，并提出一种3D-aware latent diffusion模型来解决这个问题。首先，我们引入3D-aware autoencoder来编码脸部图像到3D幂 embeddings。其次，我们提出一种新的表达分析模型（RDM），以分解3D幂到脸部身份和表情。因此，我们的方法在无监督脸部表示学习模型中实现了状态机器的表情识别和脸部验证性能。

Edge Based Oriented Object Detection

paper_url: http://arxiv.org/abs/2309.08265
repo_url: https://github.com/pratishtha-agarwal/Automation-of-Attendance-montoring-system
paper_authors: Jianghu Shen, Xiaojun Wu
for: 提高遥感对象检测精度
methods: 使用旋转 bounding box (OBB) 约束对象，并基于边导向量的类似度measurement函数设计新的损失函数，同时实现边缘自注意模块以增强对象边缘的识别。
results: 对比基eline algorithm 的 Smooth L1 损失函数，提出的损失函数实现了0.6%的mAP提升，并且通过边缘自注意模块实现了1.3%的mAP提升在 DOTA 数据集上。

Abstract
In the field of remote sensing, we often utilize oriented bounding boxes (OBB) to bound the objects. This approach significantly reduces the overlap among dense detection boxes and minimizes the inclusion of background content within the bounding boxes. To enhance the detection accuracy of oriented objects, we propose a unique loss function based on edge gradients, inspired by the similarity measurement function used in template matching task. During this process, we address the issues of non-differentiability of the function and the semantic alignment between gradient vectors in ground truth (GT) boxes and predicted boxes (PB). Experimental results show that our proposed loss function achieves $0.6\%$ mAP improvement compared to the commonly used Smooth L1 loss in the baseline algorithm. Additionally, we design an edge-based self-attention module to encourage the detection network to focus more on the object edges. Leveraging these two innovations, we achieve a mAP increase of 1.3% on the DOTA dataset.

摘要
在遥感领域中，我们经常使用方向 bounding box (OBB) 来约束 объек。这种方法可以减少密集检测框的重叠和背景内容的包含在 bounding box 中。为了提高方向 объек 的检测精度，我们提议一种基于边导向量的特有损失函数， inspirited by 模板匹配任务中的相似度测量函数。在这个过程中，我们解决了非 differentiability 问题和 GT 框和 PB 框中的semantic alignment问题。实验结果表明，我们的提议的损失函数可以与基eline algorithm 中的 Smooth L1 损失函数相比，提高了 $0.6\%$ mAP。此外，我们设计了一个基于边的自注意模块，以便检测网络更加注重对象边缘。通过这两个创新，我们在 DOTA 数据集上实现了 mAP 提高 $1.3\%$。

Leveraging the Power of Data Augmentation for Transformer-based Tracking

paper_url: http://arxiv.org/abs/2309.08264
repo_url: None
paper_authors: Jie Zhao, Johan Edstedt, Michael Felsberg, Dong Wang, Huchuan Lu
for: 这篇 paper 的目的是探讨对于视觉物件追踪的表现进行改进，并且检查数据增强的影响。
methods: 这篇 paper 使用了 transformer 架构，并且提出了两种数据增强方法，包括一个静态搜寻半径 Mechanism 和一个对应� Feature Mixing augmentation strategy。
results: 实验结果显示，这两种数据增强方法可以提高 transformer 架构的追踪性能，特别是在一些挑战性的情况下，如一击追踪和小图像分辨率。

Abstract
Due to long-distance correlation and powerful pretrained models, transformer-based methods have initiated a breakthrough in visual object tracking performance. Previous works focus on designing effective architectures suited for tracking, but ignore that data augmentation is equally crucial for training a well-performing model. In this paper, we first explore the impact of general data augmentations on transformer-based trackers via systematic experiments, and reveal the limited effectiveness of these common strategies. Motivated by experimental observations, we then propose two data augmentation methods customized for tracking. First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples. Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference. Extensive experiments on two transformer-based trackers and six benchmarks demonstrate the effectiveness and data efficiency of our methods, especially under challenging settings, like one-shot tracking and small image resolutions.

摘要
Translation in Simplified Chinese:因为长距离相关和强大预训模型，转换器基本方法在视觉对象跟踪性能中引入了一次突破。先前的工作主要关注设计适合跟踪的有效架构，而忽略了数据增强的重要性。在这篇论文中，我们首先通过系统实验研究 transformer 基本方法中的数据增强对应性的影响，并发现这些常见策略的有限效iveness。在实验结果的激发下，我们提出了两种特定于跟踪的数据增强方法。首先，我们通过动态搜索半径机制和模拟边缘样本来优化现有的随机裁剪。其次，我们提出了一种含义级别的特征混合增强策略，使模型在背景干扰等挑战下能够更有效。我们在两个 transformer 基本方法和六个标准测试集上进行了广泛的实验，特别是在一键跟踪和小图分辨率的情况下， demonstrating the effectiveness and data efficiency of our methods。

BROW: Better featuRes fOr Whole slide image based on self-distillation

paper_url: http://arxiv.org/abs/2309.08259
repo_url: None
paper_authors: Yuanfeng Wu, Shaojie Li, Zhiqiang Du, Wentao Zhu
for: This paper aims to propose a foundation model for extracting better feature representations for whole slide images (WSIs) in clinical diagnosis.
methods: The proposed model, called BROW, uses a transformer architecture and is pretrained using a self-distillation framework. It also employs techniques such as patch shuffling to improve the model’s robustness.
results: The proposed model achieves high performance on various downstream tasks, including slide-level subtyping, patch-level classification, and nuclei instance segmentation. The results confirm the efficacy, robustness, and good generalization ability of the model, making it a promising foundation model for WSI feature extraction.Here are the three points in Simplified Chinese text:
for: 这个论文的目的是提出一个基础模型，用于抽取全像片（WSIs）的更好的特征表示。
methods: 该提案的模型，名为BROW，使用转换器架构，通过自我精炼框架进行预训练。它还使用了质patch混淆来提高模型的稳定性。
results: 该模型在各种下游任务上达到了高性能，包括滑块分类、谱片分类和核体实例分割。结果证明了模型的可靠性、稳定性和通用性，表明其可以作为WSIs特征EXTRACTING的基础模型。

Abstract
Whole slide image (WSI) processing is becoming part of the key components of standard clinical diagnosis for various diseases. However, the direct application of conventional image processing algorithms to WSI faces certain obstacles because of WSIs' distinct property: the super-high resolution. The performance of most WSI-related tasks relies on the efficacy of the backbone which extracts WSI patch feature representations. Hence, we proposed BROW, a foundation model for extracting better feature representations for WSIs, which can be conveniently adapted to downstream tasks without or with slight fine-tuning. The model takes transformer architecture, pretrained using self-distillation framework. To improve model's robustness, techniques such as patch shuffling have been employed. Additionally, the model leverages the unique properties of WSIs, utilizing WSI's multi-scale pyramid to incorporate an additional global view, thereby further enhancing its performance. We used both private and public data to make up a large pretraining dataset, containing more than 11000 slides, over 180M extracted patches, encompassing WSIs related to various organs and tissues. To assess the effectiveness of \ourmodel, we run a wide range of downstream tasks, including slide-level subtyping, patch-level classification and nuclei instance segmentation. The results confirmed the efficacy, robustness and good generalization ability of the proposed model. This substantiates its potential as foundation model for WSI feature extraction and highlights promising prospects for its application in WSI processing.

摘要
整幕图像（WSI）处理已成为许多疾病的标准临床诊断中的关键组件。然而，直接将传统图像处理算法应用于WSI遇到了一些障碍，即WSI的超高分辨率。大多数WSI相关任务的性能取决于提取WSI补丁特征表示的后备模型的效果。因此，我们提出了BROW，一个用于提取WSI补丁特征表示的基本模型，可以无需或只需轻微微调整地应用于下游任务。该模型采用转换器架构，预训练使用自我混合框架。为了提高模型的Robustness，我们采用了补丁混淆技术。此外，模型利用WSI的多尺度 pyramid 特性，进一步增强其性能。我们使用了私人和公共数据构成了一个大型预训练数据集，包含超过11000幅整幕图像，18000万个提取的补丁，覆盖各种器官和组织的WSI。为评估 \ourmodel 的效果，我们运行了多种下游任务，包括板块分类、补丁级别分类和核体实例分割。结果证明了我们提出的模型的有效性、Robustness 和好的泛化能力，这证明了其作为WSI特征提取基本模型的潜在能力，并且表明了其在WSI处理领域的广阔前景。

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

paper_url: http://arxiv.org/abs/2309.08251
repo_url: None
paper_authors: Feihong He, Gang Li, Lingyu Si, Leilei Yan, Shimeng Hou, Hongwei Dong, Fanzhang Li
for: 这个论文旨在提出一种无需训练的图像漫画化方法，使用扩散变换模型进行图像漫画化。
methods: 该方法基于扩散变换模型的反向过程，分为semantic生成阶段和细节生成阶段。图像漫画化过程中，通过特定的干扰步骤来减少高频信号的干扰。
results: EXTENSIVE experiment results show that CartoonDiff has powerful ability in image cartoonization, without requiring additional reference images, complex model designs, or tedious adjustment of multiple parameters. 详细的实验结果表明，CartoonDiff可以具有强大的图像漫画化能力，不需要额外的参考图像、复杂的模型设计或多参数的繁琐调整。

Abstract
Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models using images of cartoon style. In this paper, we present CartoonDiff, a novel training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into the semantic generation phase and the detail generation phase. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn't require any additional reference images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The project page is available at: https://cartoondiff.github.io/

摘要
图像漫画化已经在图像生成领域引起了广泛的关注。然而，大多数现有的图像漫画化技术需要重新训练模型使用漫画风格图像。在这篇论文中，我们提出了CartoonDiff，一种新的无需训练的抽象采样方法，可以使用噪声变换模型来生成图像漫画化。我们将推理过程中的噪声变换模型分解成semantic生成阶段和细节生成阶段。此外，我们在特定的净化步骤中normalize高频信号，以实现图像漫画化过程。CartoonDiff不需要任何参考图像、复杂的模型设计或多个参数的繁琐调整。我们的实验结果表明CartoonDiff具有强大的能力。相关项目页面可以在以下地址找到：https://cartoondiff.github.io/

Optimization of Rank Losses for Image Retrieval

paper_url: http://arxiv.org/abs/2309.08250
repo_url: https://github.com/cvdfoundation/google-landmark
paper_authors: Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome
for: This paper focuses on improving the training of deep neural networks for image retrieval tasks using a new framework for robust and decomposable rank losses optimization.
methods: The proposed framework includes a general surrogate for ranking operators called SupRank, which provides an upperbound for rank losses and ensures robust training, as well as a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set.
results: The authors apply their framework to two standard metrics for image retrieval (AP and R@k) and introduce an extension of AP called hierarchical average precision $\mathcal{H}$-AP, which is optimized as well. Additionally, they create the first hierarchical landmarks retrieval dataset using a semi-automatic pipeline to create hierarchical labels, and release the code at https://github.com/elias-ramzi/SupRank.

Abstract
In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at https://github.com/cvdfoundation/google-landmark. Code will be released at https://github.com/elias-ramzi/SupRank.

摘要
在图像检索中，标准评估指标通常是基于分数排名的，例如平均精度（AP）和在k个结果中的恢复率（R@k）。在这项工作中，我们介绍了一种泛化框架，用于robust和可分解排名损失优化。它解决了深度神经网络在排名损失训练中的两个主要挑战：非导数性和非可分解性。我们首先提出了一种通用排名运算符的代理，即SupRank，该运算符是可以用于批量梯度下降的。它提供了排名损失的Upperbound，并确保了Robust训练。其次，我们使用一种简单 yet有效的损失函数，以减少排名损失的可分解差距。我们将我们的框架应用于AP和R@k两个标准指标，以及层次图像检索。我们引入了一个层次average precision（$\mathcal{H}$-AP），并且优化了NDCG。最后，我们创建了图像检索领域的首个层次标记集。我们使用一种 semi-自动化的管道来创建层次标记，扩展了Google Landmarks v2数据集。层次标记集公开可用于https://github.com/cvdfoundation/google-landmark。代码将在https://github.com/elias-ramzi/SupRank中发布。

A Real-time Faint Space Debris Detector With Learning-based LCM

paper_url: http://arxiv.org/abs/2309.08244
repo_url: None
paper_authors: Zherui Lu, Gangyi Wang, Xinguo Wei, Jian Li
for: 这篇论文旨在提高空间 situational awareness (SSA) 系统的敏感性和效率，以应对增长的空间垃圾问题。
methods: 本论文提出了一种基于本地对比和最大可能性估计 (MLE) 的低信号干扰estreak抽取方法，可以快速和高精度地检测低信号目标。
results: 本论文透过实验和实际应用证明了本方法的高速和高精度，并且与现有的ODCC方法相比，本方法具有较高的效率和较低的中心误差。

Abstract
With the development of aerospace technology, the increasing population of space debris has posed a great threat to the safety of spacecraft. However, the low intensity of reflected light and high angular velocity of space debris impede the extraction. Besides, due to the limitations of the ground observation methods, small space debris can hardly be detected, making it necessary to enhance the spacecraft's capacity for space situational awareness (SSA). Considering that traditional methods have some defects in low-SNR target detection, such as low effectiveness and large time consumption, this paper proposes a method for low-SNR streak extraction based on local contrast and maximum likelihood estimation (MLE), which can detect space objects with SNR 2.0 efficiently. In the proposed algorithm, local contrast will be applied for crude classifications, which will return connected components as preliminary results, and then MLE will be performed to reconstruct the connected components of targets via orientated growth, further improving the precision. The algorithm has been verified with both simulated streaks and real star tracker images, and the average centroid error of the proposed algorithm is close to the state-of-the-art method like ODCC. At the same time, the algorithm in this paper has significant advantages in efficiency compared with ODCC. In conclusion, the algorithm in this paper is of high speed and precision, which guarantees its promising applications in the extraction of high dynamic targets.

摘要
随着航天技术的发展，随空间垃圾的增加已经对航天器的安全提出了极大的威胁。然而，零intsity的反射光和高angular velocity的随空间垃圾使得extraction困难。此外，由于地面观测方法的限制，小型随空间垃圾几乎无法检测，因此需要提高航天器的空间 situational awareness（SSA）能力。由于传统方法在低Signal-to-Noise Ratio（SNR）目标检测中存在一些缺陷，这篇论文提出了一种基于本地对比和最大可能性估计（MLE）的低SNR扫描方法，可以高效地检测SNR 2.0的空间目标。在提案的算法中，本地对比将被应用于初步分类，返回连接组件作为先期结果，然后MLE将被执行以重建连接组件目标via oriented growth，进一步提高精度。这种算法已经在simulated streaks和实际星rack images中验证，并且算法的中心误差与 OdCC 类似。同时，这种算法在效率方面具有显著的优势。因此，这种算法在检测高动态目标方面具有承诺的应用前景。

Human-Inspired Topological Representations for Visual Object Recognition in Unseen Environments

paper_url: http://arxiv.org/abs/2309.08239
repo_url: None
paper_authors: Ekta U. Samani, Ashis G. Banerjee
for: 提高移动机器人在未看到和受阻的indoor环境中对物体认知的精度
methods: 使用TOPS2描述符和基于人类思维机制的THOR2认知框架，将Mapper算法获得的颜色嵌入与形态基于TOPS描述符进行混合
results: THOR2在两个真实世界数据集上（OCID和UW-IS Occluded）实现了比shape-based THOR框架和RGB-D ViT更高的认知精度，因此THOR2是实现低成本机器人中 robust认知的可能性的一步

Abstract
Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. Toward this goal, we extend our previous work to propose the TOPS2 descriptor, and an accompanying recognition framework, THOR2, inspired by a human reasoning mechanism known as object unity. We interleave color embeddings obtained using the Mapper algorithm for topological soft clustering with the shape-based TOPS descriptor to obtain the TOPS2 descriptor. THOR2, trained using synthetic data, achieves substantially higher recognition accuracy than the shape-based THOR framework and outperforms RGB-D ViT on two real-world datasets: the benchmark OCID dataset and the UW-IS Occluded dataset. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.

摘要
<>transliteration: zhèng zhì wén tiě de rén shì yǐ jīn yì yì zhòng zhèng shì, dài zhèng zhì wén tiě de jīn yì yì zhòng zhèng shì, zhèng zhì wén tiě de rén shì yǐ jīn yì yì zhòng zhèng shì, yǐ jīn yì yì zhòng zhèng shì, dài zhèng zhì wén tiě de jīn yì yì zhòng zhèng shì, yǐ jīn yì yì zhòng zhèng shì.Translation:现代移动机器人可能需要在未看过和受损的indoor环境中进行视觉对象识别。为此，我们从我们之前的工作中扩展了TOPS描述器，并采用了人类思维机制known as object unity的灵感，提出了TOPS2描述器。我们将Mapper算法得到的颜色嵌入与TOPS描述器结合，以获得TOPS2描述器。THOR2，使用 sintetic数据进行训练，在两个实际 datasets上表现出了substantially higher的识别精度，比shape-based THOR框架和RGB-D ViT更高。因此，THOR2是一个有前途的步骤，可以帮助实现低成本机器人中的稳定识别。

Efficient Polyp Segmentation Via Integrity Learning

paper_url: http://arxiv.org/abs/2309.08234
repo_url: None
paper_authors: Ziqiang Chen, Kang Wang, Yun Liu
for: 该研究旨在提高护肤诊断、指导 intervención 和 treatment 的准确性，通过解决质量不足问题，提高护肤诊断的准确性。
methods: 该研究提出了一种名为 Integrity Capturing Polyp Segmentation (IC-PolypSeg) 网络，该网络使用轻量级的骨干和三种关键组件来改善质量不足问题：1）像素级别特征重新分配模块（PFR），2）跨stage像素级别特征重新分配模块（CPFR），3）粗细调整模块。
results: 对于5个公共数据集，该研究表明，提出的 IC-PolypSeg 方法可以与8种现有方法进行比较，在准确率和计算效率两个方面具有显著优势。IC-PolypSeg-EF0 使用300次少于 PraNet 的参数，实现了实时处理速度235 FPS，并且可以减少 false negative 率。

Abstract
Accurate polyp delineation in colonoscopy is crucial for assisting in diagnosis, guiding interventions, and treatments. However, current deep-learning approaches fall short due to integrity deficiency, which often manifests as missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network utilizes lightweight backbones and 3 key components for integrity ameliorating: 1) Pixel-wise feature redistribution (PFR) module captures global spatial correlations across channels in the final semantic-rich encoder features. 2) Cross-stage pixel-wise feature redistribution (CPFR) module dynamically fuses high-level semantics and low-level spatial features to capture contextual information. 3) Coarse-to-fine calibration module combines PFR and CPFR modules to achieve precise boundary detection. Extensive experiments on 5 public datasets demonstrate that the proposed IC-PolypSeg outperforms 8 state-of-the-art methods in terms of higher precision and significantly improved computational efficiency with lower computational consumption. IC-PolypSeg-EF0 employs 300 times fewer parameters than PraNet while achieving a real-time processing speed of 235 FPS. Importantly, IC-PolypSeg reduces the false negative ratio on five datasets, meeting clinical requirements.

摘要
准确的肠癌腺分 segmentation在colonoscopy中非常重要，可以帮助诊断、引导 intervención和治疗。然而，现有的深度学习方法 often lack integrity, which can lead to missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network uses lightweight backbones and 3 key components to improve integrity:1. 像素级别特征重定向（PFR）模块 capture global spatial correlations across channels in the final semantic-rich encoder features.2. 跨阶段像素级别特征重定向（CPFR）模块 dynamically fuses high-level semantics and low-level spatial features to capture contextual information.3. 粗细调整模块 combines PFR and CPFR modules to achieve precise boundary detection.我们在5个公共数据集上进行了广泛的实验，结果显示，我们的提posed IC-PolypSeg方法在精度和计算效率方面都有所提高，与8种现有方法进行比较。 IC-PolypSeg-EF0使用300次少于PraNet的参数，同时实现了235帧/秒的实时处理速度。进一步地，IC-PolypSeg可以减少在5个数据集上的假阳性比率，满足临床要求。

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

paper_url: http://arxiv.org/abs/2309.08220
repo_url: None
paper_authors: Junwen Xiong, Peng Zhang, Chuanyue Li, Wei Huang, Yufei Zha, Tao You
for: 本研究旨在构建一个通用的视觉注意力模型框架，以实现视觉注意力预测和视觉关键物预测的融合。
methods: 本研究提出了一个具有视觉注意力属性的 transformer 架构，并将其应用于逐步增加分辨率的对照处理中，以获得更加积极的视觉注意力表示。此外，本研究还提出了一个任务特定的解码器，以进行最终的预测。
results: 实验结果显示，提出的 UniST 模型在七个挑战性的benchmark上表现出色，与其他现有的方法相比，具有更高的性能。

Abstract
Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceiving dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale saliency information to produce a robust representation. Furthermore, a task-specific decoder is proposed to perform the final prediction for each task. To the best of our knowledge, this is the first work that explores designing a transformer structure for both saliency modeling tasks. Convincible experiments demonstrate that the proposed UniST achieves superior performance across seven challenging benchmarks for two tasks, and significantly outperforms the other state-of-the-art methods.

摘要
视频注意力预测和检测是计算机视觉领域的兴旺研究领域，它们帮助计算机模拟人类看到动态场景中的注意力分布，类似于人类的视觉过程。虽然有很多方法已经为这两个独立的任务进行了特定任务训练 paradigma，但是很少人关注了开发一个涵盖这两个任务的通用注意力模型框架。在这种研究中，我们引入了统一注意力变换（UniST）框架，它利用视频注意力预测和视频突出对象检测中的重要特征进行了全面利用。此外，我们还设计了一个可靠的词法解码器，以便对每个任务进行最终预测。根据我们所知，这是第一个采用变换结构来解决两个注意力模型任务的研究。我们的实验表明，提案的 UniST 在七个挑战性的benchmark上表现出色，与其他现有方法相比，具有显著的优势。

Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

paper_url: http://arxiv.org/abs/2309.08206
repo_url: https://github.com/mathlee/gelenet
paper_authors: Gongyang Li, Zhen Bai, Zhi Liu, Xinpeng Zhang, Haibin Ling
For: 本研究提出了一种全新的全球抽象本地探索网络（GeleNet），用于 óptical remote sensing 图像中的突出对象检测（ORSI-SOD）。* Methods: GeleNet 使用 transformer 背景进行四级特征嵌入，并使用方向感知杂化杂化精度探索模块（D-SWSAM）和简化版 SWSAM，以及知识传递模块（KTM）进行增强。* Results: 对三个公共数据集进行了广泛的实验，结果表明，提出的 GeleNet 方法在相关的州态艺术方法之上表现出色，并且可以更好地检测 óptical remote sensing 图像中的突出对象。

Abstract
Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/GeleNet.

摘要
现有的 optical remote sensing images （ORSIs）中的醒目对象检测（SOD）方法主要采用卷积神经网络（CNN）作为背景，如 VGG 和 ResNet。由于 CNN 只能提取特定感受场的特征， większość ORSIs-SOD 方法通常采用本地到Contextual 方法。在这篇文章中，我们提出了一种全新的全球抽取本地探索网络（GeleNet） для ORSIs-SOD，采用全球到本地方法。具体来说，GeleNet 首先采用 transformer 背景来生成四级特征嵌入，并且使用方向感知杂合排序键控模块（D-SWSAM）和其简化版本（SWSAM）来增强本地互动。此外，我们还采用知识传递模块（KTM）来进一步增强跨级Contextual 互动。D-SWSAM 通过方向性杂合来完全感知最低级特征中的方向信息，以适应 ORSIs 中不同方向的醒目对象，并有效地提高醒目对象的细节。SWSAM 将 D-SWSAM 中的方向感知部分排除，以专注于 ORSIs 中醒目对象的本地化。KTM 基于自我注意机制，模型了两个中级特征的Contextual 相互关系知识，并将其传递给原始特征，以生成更有特征的特征。最后，我们使用 saliency predictor 来生成醒目度映射，基于上述三个模块的输出。我们在三个公共数据集上进行了广泛的实验，结果表明，提出的 GeleNet 方法在相关的状态 искусственный智能方法中具有优势。我们的代码和结果可以在 GitHub 上找到：https://github.com/MathLee/GeleNet。

One-stage Modality Distillation for Incomplete Multimodal Learning

paper_url: http://arxiv.org/abs/2309.08204
repo_url: None
paper_authors: Shicai Wei, Yang Luo, Chunbo Luo
for: Addresses the challenge of inferring with incomplete modality in multimodal learning.
methods: Proposes a one-stage modality distillation framework that combines privileged knowledge transfer and modality information fusion via multi-task learning.
results: Achieves state-of-the-art performance on RGB-D classification and segmentation tasks despite incomplete modality input in various scenes.

Abstract
Learning based on multimodal data has attracted increasing interest recently. While a variety of sensory modalities can be collected for training, not all of them are always available in development scenarios, which raises the challenge to infer with incomplete modality. To address this issue, this paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion into a single optimization procedure via multi-task learning. Compared with the conventional modality distillation that performs them independently, this helps to capture the valuable representation that can assist the final model inference directly. Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information. This addresses the representation heterogeneity caused by input discrepancy via the joint distribution adaptation. Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features. It leverages the parameters-sharing strategy to capture the cross-modal cues explicitly. Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.

摘要
Translation notes:* "Learning based on multimodal data" becomes "学习基于多modal数据" (学习基于多modal数据) in Simplified Chinese.* "while a variety of sensory modalities can be collected for training" becomes "而可收集多种感知模式进行训练" (而可收集多种感知模式进行训练) in Simplified Chinese.* "not all of them are always available in development scenarios" becomes "不一定可用于开发场景" (不一定可用于开发场景) in Simplified Chinese.* "which raises the challenge to infer with incomplete modality" becomes "带来 incomplete modality 的挑战" (带来 incomplete modality 的挑战) in Simplified Chinese.* "To address this issue, this paper presents a one-stage modality distillation framework" becomes "为解决这个问题，本文提出了一个一stage modality distillation框架" (为解决这个问题，本文提出了一个一stage modality distillation 框架) in Simplified Chinese.* "Compared with the conventional modality distillation that performs them independently" becomes "与传统的 modality distillation 相比" (与传统的 modality distillation 相比) in Simplified Chinese.* "This helps to capture the valuable representation that can assist the final model's inference directly" becomes "可以直接帮助最终模型的推理" (可以直接帮助最终模型的推理) in Simplified Chinese.* "Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information" becomes "我们专门提出了一个 joint adaptation network 来保持特权信息" (我们专门提出了一个 joint adaptation network 来保持特权信息) in Simplified Chinese.* "Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features" becomes "然后，我们引入了一个 cross translation network 来聚合恢复和可用的模式特征" (然后，我们引入了一个 cross translation network 来聚合恢复和可用的模式特征) in Simplified Chinese.* "Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance" becomes "广泛的RGB-D分类和 segmentation任务实验表明，我们提出的多modal继承框架可以在多种场景中superior performance" (广泛的RGB-D分类和 segmentation 任务实验表明，我们提出的多modal 继承框架可以在多种场景中superior performance) in Simplified Chinese.

Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2309.08197
repo_url: https://github.com/orhan-t/sm-cnn
paper_authors: Orhan Torun, Seniha Esen Yuksel, Erkut Erdem, Nevrez Imamoglu, Aykut Erdem
for: 本研究旨在提出一种基于自适应 spectral-spatial 信息的高spectral像像预处理方法，以提高现有高spectral像像预处理方法对实际复杂噪谱的性能。
methods: 该方法基于一种新的 spectral self-modulating residual block (SSMRB)，该块可以根据邻近 spectral 数据来自适应地变换输入高spectral像像的特征，从而提高网络对实际复杂噪谱的适应能力。
results: 实验表明，提出的 SM-CNN 方法在公共 benchmark 数据集上比其他当前领先的高spectral像像预处理方法表现更好 both quantitatively and qualitatively。

Abstract
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Hence, hyperspectral image denoising has attracted lots of attention by the community lately. While recent deep HSI denoising methods have provided effective solutions, their performance under real-life complex noise remains suboptimal, as they lack adaptability to new data. To overcome these limitations, in our work, we introduce a self-modulating convolutional neural network which we refer to as SM-CNN, which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which we call spectral self-modulating residual block (SSMRB), that allows the network to transform the features in an adaptive manner based on the adjacent spectral data, enhancing the network's ability to handle complex noise. In particular, the introduction of SSMRB transforms our denoising network into a dynamic network that adapts its predicted features while denoising every input HSI with respect to its spatio-spectral characteristics. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods both quantitatively and qualitatively on public benchmark datasets.

摘要
Translated into Simplified Chinese:与自然图像不同，光谱图像（HSIs）具有较多的频谱信息，每个频谱信息都是在某些波长上的特定信息，甚至是可见光spectrum之外的信息。这些光谱图像的特点使其在远程感知应用中非常有效。然而，现有的光谱成像设备会对HSIs进行严重的降解。因此，光谱图像去噪引起了社区的广泛关注。虽然最近的深度HSIs去噪方法提供了有效的解决方案，但它们在实际生成的复杂噪声下表现不佳，因为它们缺乏对新数据的适应性。为了解决这些限制，我们在这里引入一种自适应 convolutional neural network，即SM-CNN，该网络利用相关的 spectral和空间信息。SM-CNN的核心块是一种新的 spectral self-modulating residual block（SSMRB），该块使得网络可以根据邻近的spectral数据来适应性地变换特征，提高网络对复杂噪声的能力。具体来说，SSMRB将我们的去噪网络转化成一种动态网络，该网络在处理每个输入HSIs时对其进行适应性的预测。实验分析表明，我们提出的SM-CNN在 Synthetic和实际数据上都能够superior于其他状态的艺术HSIs去噪方法， both quantitatively and qualitatively。

ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection

paper_url: http://arxiv.org/abs/2309.08196
repo_url: None
paper_authors: Zhimeng Xin, Tianxu Wu, Shiming Chen, Yixiong Zou, Ling Shao, Xinge You
for: 提高ew-shot对象检测精度，使用两个阶段学习策略，但忽略了对象的局部到全局的转化。
methods: 提出了一种可扩展的合作注意力模块（ECEA），通过在基础阶段积累充足数据后，在新阶段进行可扩展的学习，使模型快速适应扩展局部区域到共存区域。
results: 在PASCAL VOC和COCO dataset上进行了广泛的实验，显示了ECEA模块可以帮助ew-shot检测器完全预测对象，即使一些地方在训练样本中不出现。 Comparing with现有的ew-shot对象检测方法，ECEA模块实现了新的最佳性。

Abstract
Few-shot object detection (FSOD) identifies objects from extremely few annotated samples. Most existing FSOD methods, recently, apply the two-stage learning paradigm, which transfers the knowledge learned from abundant base classes to assist the few-shot detectors by learning the global features. However, such existing FSOD approaches seldom consider the localization of objects from local to global. Limited by the scarce training data in FSOD, the training samples of novel classes typically capture part of objects, resulting in such FSOD methods cannot detect the completely unseen object during testing. To tackle this problem, we propose an Extensible Co-Existing Attention (ECEA) module to enable the model to infer the global object according to the local parts. Essentially, the proposed module continuously learns the extensible ability on the base stage with abundant samples and transfers it to the novel stage, which can assist the few-shot model to quickly adapt in extending local regions to co-existing regions. Specifically, we first devise an extensible attention mechanism that starts with a local region and extends attention to co-existing regions that are similar and adjacent to the given local region. We then implement the extensible attention mechanism in different feature scales to progressively discover the full object in various receptive fields. Extensive experiments on the PASCAL VOC and COCO datasets show that our ECEA module can assist the few-shot detector to completely predict the object despite some regions failing to appear in the training samples and achieve the new state of the art compared with existing FSOD methods.

摘要
几个样本对象检测（FSOD）可以从极少量的标注样本中识别 объек。现有的大多数FSOD方法在最近使用两stage学习 парадиг，通过将丰富基类知识传递到帮助几个样本检测器学习全局特征。然而，现有的FSOD方法通常不考虑对象从本地到全局的地方化。由于FSOD的训练样本通常只包含部分 объек，因此这些方法在测试中无法完全识别未看过的对象。为解决这个问题，我们提议一个可扩展的共存注意力（ECEA）模块，使得模型可以根据本地部分来推断全局对象。具体来说，我们首先设计了一个可扩展注意力机制，从本地区域开始，扩展注意力到相似和邻近的共存区域。然后，我们在不同的特征缩放中实现了可扩展注意力机制，逐渐发现全对象在不同的感知场。广泛的实验表明，我们的ECEA模块可以帮助几个样本检测器快速适应延伸本地区域到共存区域，并在PASCAL VOC和COCO数据集上达到新的状态的艺术水平，比现有的FSOD方法更高。

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

paper_url: http://arxiv.org/abs/2309.08644
repo_url: None
paper_authors: Sungchan Park, Eunyi You, Inhoe Lee, Joonseok Lee
for: 3DMPPE (3D pose estimation for multi-person from a monocular video)
methods: sequence-to-sequence 2D-to-3D lifting model with novel geometry-aware data augmentation strategy
results: robust generalization to diverse unseen views, robust recovery against heavy occlusions, and more natural and smoother outputsHere’s the full summary in Simplified Chinese:
for: 3DMPPE 是计算机视觉中一项非常有价值的任务，尤其是在多人3Dpose estimation方面，现有的方法仍然处于不稳定的阶段，尚未应用于实际场景。我们提出了三个未解决的问题：在训练过程中不能处理未见视图，容易受到遮挡，并且输出存在剧烈晃动。
methods: 我们提出了一种基于序列到序列的2D-to-3D升级模型，利用了一种新的geometry-aware数据增强策略，可以生成无限多的视图数据，同时注意到地面和遮挡。
results: 我们的模型和数据增强方法可以稳定地处理多个未见视图，在压权遮挡下强健地恢复pose，并生成更自然和平滑的输出。我们的方法不仅在公共benchmark上实现了状态的最佳性能，还在更加挑战性的室内视频上取得了优秀的result。I hope that helps!

Abstract
3D pose estimation is an invaluable task in computer vision with various practical applications. Especially, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. We pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output. As a remedy, we propose POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, we verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of our approach is verified not only by achieving the state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos. Demo videos are available at https://www.youtube.com/@potr3d.

摘要
“3Dpose estimation是计算机视觉中的一项无可取代任务，具有各种实际应用。特别是3DMPPE（多人3Dpose estimation from monocular video）在训练时没有看到视图的稳定性和 occlusion 问题，还尚未在实际场景中应用。我们提出了现有方法的三个未解问题：训练时没有看到视图的稳定性、 occlusion 的感受性和输出中的抖动。为了解决这些问题，我们提出了 POTR-3D，首个实现了sequence-to-sequence 2D-to-3D lifting模型，通过一种新的几何化数据增强策略，能够生成无限多个视图，同时注重地面和 occlusions。经过广泛的实验，我们证明了我们的模型和数据增强方法可以robustly泛化到多种未见过视图，对重重 occlusions 进行恢复，并且可以生成更自然和稳定的输出。我们的方法不仅在公共标准 bencmarks 上实现了state-of-the-art表现，还通过在更加挑战的in-the-wild视频中的质量结果证明了我们的方法的有效性。 Demo 视频可以在https://www.youtube.com/@potr3d 找到。”

STDG: Semi-Teacher-Student Training Paradigram for Depth-guided One-stage Scene Graph Generation

paper_url: http://arxiv.org/abs/2309.08179
repo_url: None
paper_authors: Xukun Zhou, Zhenbo Song, Jun He, Hongyan Liu, Zhaoxin Fan
for: 提高自动化 роботи系统的环境理解能力，尤其是在背景复杂度下
methods: 使用一stageScene Graph生成方法，包括自定义的HHA表示生成模块、 semi-teaching网络学习模块和Scene Graph生成模块
results: 对比基eline，本方法在一stageScene Graph生成任务上显著提高性能

Abstract
Scene Graph Generation is a critical enabler of environmental comprehension for autonomous robotic systems. Most of existing methods, however, are often thwarted by the intricate dynamics of background complexity, which limits their ability to fully decode the inherent topological information of the environment. Additionally, the wealth of contextual information encapsulated within depth cues is often left untapped, rendering existing approaches less effective. To address these shortcomings, we present STDG, an avant-garde Depth-Guided One-Stage Scene Graph Generation methodology. The innovative architecture of STDG is a triad of custom-built modules: The Depth Guided HHA Representation Generation Module, the Depth Guided Semi-Teaching Network Learning Module, and the Depth Guided Scene Graph Generation Module. This trifecta of modules synergistically harnesses depth information, covering all aspects from depth signal generation and depth feature utilization, to the final scene graph prediction. Importantly, this is achieved without imposing additional computational burden during the inference phase. Experimental results confirm that our method significantly enhances the performance of one-stage scene graph generation baselines.

摘要
场景图生成是自主 роботи系统的关键能力之一，但大多数现有方法受到背景复杂性的限制，导致它们无法完全解码环境的内在拓扑信息。此外，现有方法通常会忽略depth缺失信息，使其效果相对较差。为解决这些缺陷，我们提出了STDG方法，这是一种革新的深度导航一stage场景图生成方法。STDG方法的核心是一个自定义的三部分模块：深度导航HHA表示生成模块、深度导航 semi-教学网络学习模块和深度导航场景图生成模块。这三个模块共同利用深度信息，从depth信号生成到depth特征利用，最后预测场景图。与传统方法相比，STDG方法不需要在推理阶段添加计算负担。实验结果表明，我们的方法可以明显提高一stage场景图生成基线性能。

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

paper_url: http://arxiv.org/abs/2309.08167
repo_url: https://github.com/dun-research/drca
paper_authors: Rui Deng, Qian Wu, Yuke Li, Haoran Fu
for: 提高视频推理效率，应对各种场景中的快速变化和细化要求。
methods: 提出一种高效的视频表示网络，通过不同分辨率层次进行压缩和对齐，从早期网络阶段减少计算成本，保持时间相关性。
results: 实验结果显示，我们的方法在靠近重复视频检索和动态视频分类中实现了最佳的效率和性能协议，与当前状态艺术方法相比。代码：https://github.com/dun-research/DRCA。

Abstract
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:https://github.com/dun-research/DRCA

摘要
优化视频推理效率已成为不断增长的需求，尤其是在各个领域中进行视频分析。现有的方法可以通过显式抛弃空间或时间信息来实现高效性，但是这会在快速变化和细腻场景中带来挑战。为解决这些问题，我们提出了一种高效的视频表示网络，具有差分分辨率压缩和对齐机制。具体来说，我们利用了差分上下文感知压缩模块来编码不同分辨率的帧特征，并将其更新和修正为一个高低分辨率视频序列。为处理新的序列，我们引入了一个新的分辨率对齐转换层，以捕捉不同分辨率帧特征之间的全局时间相关性，同时减少空间计算成本平方根。整个网络可以通过差分压缩模块的结合来进行端到端优化。实验结果表明，我们的方法可以在靠近重复视频检索和动态视频分类中实现最佳的效率和性能，并且与当前状态OFTHE-ART方法相比，实现了竞争的结果。代码：https://github.com/dun-research/DRCA

A Ground Segmentation Method Based on Point Cloud Map for Unstructured Roads

paper_url: http://arxiv.org/abs/2309.08164
repo_url: None
paper_authors: Zixuan Li, Haiying Lin, Zhangyu Wang, Huazhi Li, Miao Yu, Jie Wang
for: 提高不结构化道路Scene中的地面分割精度
methods: 基于点云地图的方法，包括区域兴趣EXTRACTION、点云地图与实时点云的位域关联、基于 Gaussian Distribution 的背景模型 subtract
results: 实验结果显示，正确地面点分割率为 99.95%，运行时间为 26ms，与 state of the art 方法 Patchwork++ 相比，平均地面点分割精度提高 7.43%，运行时间增加 17ms。

Abstract
Ground segmentation, as the basic task of unmanned intelligent perception, provides an important support for the target detection task. Unstructured road scenes represented by open-pit mines have irregular boundary lines and uneven road surfaces, which lead to segmentation errors in current ground segmentation methods. To solve this problem, a ground segmentation method based on point cloud map is proposed, which involves three parts: region of interest extraction, point cloud registration and background subtraction. Firstly, establishing boundary semantic associations to obtain regions of interest in unstructured roads. Secondly, establishing the location association between point cloud map and the real-time point cloud of region of interest by semantics information. Thirdly, establishing a background model based on Gaussian distribution according to location association, and segments the ground in real-time point cloud by the background substraction method. Experimental results show that the correct segmentation rate of ground points is 99.95%, and the running time is 26ms. Compared with state of the art ground segmentation algorithm Patchwork++, the average accuracy of ground point segmentation is increased by 7.43%, and the running time is increased by 17ms. Furthermore, the proposed method is practically applied to unstructured road scenarios represented by open pit mines.

摘要
地面 segmentation，作为无人智能感知的基本任务，对目标检测任务提供了重要支持。不结构的公路场景，如开采场，具有不规则边界线和不平的路面，这会导致当前地面 segmentation 方法中的segmentation error。为解决这个问题，一种基于点云地图的地面 segmentation 方法被提出，该方法包括三部分：region of interest 提取、点云注册和背景 subtract。首先，通过边界semantic association establishment obtain regions of interest in unstructured roads。其次，通过semantics information establishment的point cloud map和实时点云的区域兴趣点cloud association。最后，根据该association establishment Gaussian distributionbased background model，并在实时点云中对ground进行background subtract。实验结果显示，正确地面点的分割率为99.95%，运行时间为26ms。相比之前的state-of-the-art ground segmentation algorithm Patchwork++, average accuracy of ground point segmentation提高了7.43%，运行时间提高了17ms。此外，提出的方法实际应用于无结构公路场景中，如开采场。

paper_url: http://arxiv.org/abs/2309.08160
repo_url: None
paper_authors: Yuda Bi, Anees Abrol, Jing Sui, Vince Calhoun
for: 本研究旨在透过条件Vision Transformer生成数学模型（cViT-GAN）Synthesize functional network connectivity（FNC）数据，并分析其与 estructural magnetic resonance imaging（sMRI）之间的交互关系。
methods: 本研究使用cViT-GAN模型，将sMRI资料作为输入，生成各个subject的FNC数据，并形成每个subject的FNC矩阵。最后，我们得到了一个群体差FNC矩阵，与实际FNC矩阵之间的相関系数为0.73。
results: 我们的FNC可视化结果显示了特定的下来部分脑区域之间的相关性，显示了模型的能力将结构功能关系纳入考虑。此表现与 conditional CNN-based GAN alternatives如Pix2Pix不同，实现了模型的优化。

Abstract
The cross-modal synthesis between structural magnetic resonance imaging (sMRI) and functional network connectivity (FNC) is a relatively unexplored area in medical imaging, especially with respect to schizophrenia. This study employs conditional Vision Transformer Generative Adversarial Networks (cViT-GANs) to generate FNC data based on sMRI inputs. After training on a comprehensive dataset that included both individuals with schizophrenia and healthy control subjects, our cViT-GAN model effectively synthesized the FNC matrix for each subject, and then formed a group difference FNC matrix, obtaining a Pearson correlation of 0.73 with the actual FNC matrix. In addition, our FNC visualization results demonstrate significant correlations in particular subcortical brain regions, highlighting the model's capability of capturing detailed structural-functional associations. This performance distinguishes our model from conditional CNN-based GAN alternatives such as Pix2Pix. Our research is one of the first attempts to link sMRI and FNC synthesis, setting it apart from other cross-modal studies that concentrate on T1- and T2-weighted MR images or the fusion of MRI and CT scans.

摘要
cross-modal合成 между结构磁共振成像(sMRI)和功能网络连接(FNC)是医学成像领域中相对未曾开探的领域，特别是在偏头痛方面。本研究使用条件视觉变换生成敌对网络(cViT-GANs)来生成基于sMRI输入的FNC数据。经过训练一个包括偏头痛和健康控制者的全面数据集后，我们的cViT-GAN模型成功生成了每个素个的FNC矩阵，然后组成了群差FNC矩阵，并 obtient了与实际FNC矩阵的Spearson相关系数0.73。此外，我们的FNC视觉结果显示了特定的下侧大脑区域之间的相关性，ILLUMINATING模型的能力捕捉细致的结构-功能关系。这种性能与 conditional CNN-based GAN的Pix2Pix相比， distinguishes our model。本研究是医学成像领域中首次将sMRI和FNC合成联系起来，与其他交叠Modal Studies的T1-和T2-束磁共振图像或MRI和CT扫描的融合相比，更加独特。

AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness

paper_url: http://arxiv.org/abs/2309.08159
repo_url: https://github.com/liyaojiang1998/adsee
paper_authors: Liyao Jiang, Chenglin Li, Haolan Chen, Xiaodong Gao, Xinwang Zhong, Yang Qiu, Shani Ye, Di Niu
for: 这个论文研究了在线广告的 clicked 率是否受到图像的语义编辑的影响。
methods: 该论文使用了 StyleGAN 技术进行语义编辑和反编辑，并使用了传统的视觉和文本特征来预测 clicked 率。
results: 该论文通过一个大量的 collected 数据集（QQ-AD）进行了广泛的 offline 测试，发现不同的语义方向和编辑强度可以影响 clicked 率。此外，该论文还设计了一个基于进化算法的广告编辑器，可以高效地搜索最佳的编辑方向和强度。在在线 A/B 测试中，对于 AdSEE 编辑的样本，与控制组的原始广告相比， click-through 率得到了提高。

Abstract
Online advertisements are important elements in e-commerce sites, social media platforms, and search engines. With the increasing popularity of mobile browsing, many online ads are displayed with visual information in the form of a cover image in addition to text descriptions to grab the attention of users. Various recent studies have focused on predicting the click rates of online advertisements aware of visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at https://github.com/LiyaoJiang1998/adsee.

摘要
在电子商务平台、社交媒体平台和搜索引擎上，在线广告是重要的元素。随着移动浏览的增加 popularity，许多在线广告将视觉信息显示为封面图片的形式，以吸引用户的注意。 Various recent studies have focused on predicting the click rates of online advertisements based on visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at .

Uncertainty-Aware Multi-View Visual Semantic Embedding

paper_url: http://arxiv.org/abs/2309.08154
repo_url: None
paper_authors: Wenzhang Wei, Zhipeng Gui, Changguang Wu, Anqi Zhao, Xingguang Wang, Huayi Wu
for: This paper aims to improve image-text retrieval by leveraging semantic information and modeling uncertainty in multi-modal understanding.
methods: The proposed Uncertainty-Aware Multi-View Visual Semantic Embedding (UAMVSE) framework decomposes the overall image-text matching into multiple view-text matchings and uses an uncertainty-aware loss function (UALoss) to adaptively model the uncertainty in each view-text correspondence.
results: Experimental results on the Flicker30k and MS-COCO datasets demonstrate that UAMVSE outperforms state-of-the-art models.

Abstract
The key challenge in image-text retrieval is effectively leveraging semantic information to measure the similarity between vision and language data. However, using instance-level binary labels, where each image is paired with a single text, fails to capture multiple correspondences between different semantic units, leading to uncertainty in multi-modal semantic understanding. Although recent research has captured fine-grained information through more complex model structures or pre-training techniques, few studies have directly modeled uncertainty of correspondence to fully exploit binary labels. To address this issue, we propose an Uncertainty-Aware Multi-View Visual Semantic Embedding (UAMVSE)} framework that decomposes the overall image-text matching into multiple view-text matchings. Our framework introduce an uncertainty-aware loss function (UALoss) to compute the weighting of each view-text loss by adaptively modeling the uncertainty in each view-text correspondence. Different weightings guide the model to focus on different semantic information, enhancing the model's ability to comprehend the correspondence of images and texts. We also design an optimized image-text matching strategy by normalizing the similarity matrix to improve model performance. Experimental results on the Flicker30k and MS-COCO datasets demonstrate that UAMVSE outperforms state-of-the-art models.

摘要
“图像文本检索的关键挑战是有效地利用semantic信息来衡量视觉和语言数据之间的相似性。然而，使用实例级binary标签，其中每个图像与一个文本相对应，无法捕捉不同semantic单元之间的多重匹配关系，从而导致多modal semantic理解的uncertainty。虽然latest research捕捉了细腻信息通过更复杂的模型结构或预训练技术，但很少研究直接模型匹配不确定性，以全面利用binary标签。为解决这个问题，我们提出了一种Uncertainty-Aware Multi-View Visual Semantic Embedding（UAMVSE） frameworks，它将整个图像文本匹配分解成多个视图文本匹配。我们的 frameworks introduce一种uncertainty-aware损失函数（UALoss）来计算每个视图文本损失的权重，通过自适应地模型不确定性来computing each view-text correspondence的uncertainty。不同的权重导引模型更好地理解图像和文本之间的匹配关系，提高模型对多modal semantic理解的能力。我们还设计了一种优化图像文本匹配策略，通过normalizing similarity Matrix来提高模型性能。实验结果表明，UAMVSE在Flicker30k和MS-COCO数据集上超过了当前状态的模型性能。”

DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions

paper_url: http://arxiv.org/abs/2309.08152
repo_url: None
paper_authors: Minsik Jeon, Junwon Seo, Jihong Min
for: 提高 object detection 方法在不良天气 condition 中的可靠性
methods: 使用 unsupervised domain adaptation 方法，分别处理 style 和 weather 两个领域的差异，以提高 object detection 的可靠性
results: 对比其他方法，本方法在不良天气 condition 下的 object detection 性能更高

Abstract
Despite the success of deep learning-based object detection methods in recent years, it is still challenging to make the object detector reliable in adverse weather conditions such as rain and snow. For the robust performance of object detectors, unsupervised domain adaptation has been utilized to adapt the detection network trained on clear weather images to adverse weather images. While previous methods do not explicitly address weather corruption during adaptation, the domain gap between clear and adverse weather can be decomposed into two factors with distinct characteristics: a style gap and a weather gap. In this paper, we present an unsupervised domain adaptation framework for object detection that can more effectively adapt to real-world environments with adverse weather conditions by addressing these two gaps separately. Our method resolves the style gap by concentrating on style-related information of high-level features using an attention module. Using self-supervised contrastive learning, our framework then reduces the weather gap and acquires instance features that are robust to weather corruption. Extensive experiments demonstrate that our method outperforms other methods for object detection in adverse weather conditions.

摘要
尽管深度学习基于对象检测方法在最近几年内取得了成功，但是在雨雪等不利天气条件下，对象检测器的可靠性仍然是一个挑战。为了增强对象检测器的可靠性，不经过监督的领域适应已经被应用于适应天气图像。而前一些方法并没有直接面对气候腐败的影响，但是领域差异 между晴天和不利天气可以分解为两个因素，它们具有不同的特征。在这篇论文中，我们提出了一种不经过监督的领域适应框架，可以更有效地适应实际环境中的不利天气条件。我们解决了风格差异的问题，通过集中于高级特征的风格相关信息使用注意力模块。然后，我们使用自我监督对比学习，减少气候差异，获得抵抗气候腐败的实例特征。我们的方法在不利天气下进行对象检测的实验表明，与其他方法相比，我们的方法表现更加出色。

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

paper_url: http://arxiv.org/abs/2309.08146
repo_url: https://github.com/awsaf49/synatt
paper_authors: Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Shaikh Anowarul Fattah, Mohammad Saquib
for: 本研究旨在对伪造的语音识别和推断伪造工具，以提高伪造语音识别和追溯的能力。
methods: 本研究提出了一种新的推理策略，将伪造语音轨变换为峰低 spectrogram，使用 CNN 提取特征，并使用 semi-supervision 和 ensemble 进行强化和普适化。
results: 本研究在 IEEE SP Cup 挑战中，以12-13%的精度优势在强 perturbed 数据集上（Eval 2），以及1-2%的精度优势在弱 perturbed 数据集上（Eval 1），与其他顶尖队伍相比表现出色。

Abstract
With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing semi-supervision and ensemble to improve its robustness and generalizability significantly. The proposed detector is validated on two evaluation datasets consisting of a total of 18,000 weakly perturbed (Eval 1) & 10,000 strongly perturbed (Eval 2) synthetic speeches. The proposed method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.

摘要
随着深度学习在音频和语音处理领域的大量技术进步，许多新的合成语音技术实现了不可思议的真实效果。这些方法可以生成真实的假人声音，因此可以用于有害的行为，如人类模仿、假新闻、散播、骗取、媒体操纵等。因此，可以准确地检测合成或自然语音的能力变得非常重要。此外，能够判断哪种算法生成了合成语音轨迹也是非常重要的。在本文中，一种新的策略是提出的，用于归因合成语音轨迹到生成它的算法。提议的检测器将音频转换为径向mel spectrogram，提取特征使用CNN，并将其分类为五个已知和未知算法，使用semi-supervision和ensemble以提高其可靠性和泛化性。提议的检测器在两个评估数据集上进行验证，包括18,000个弱地扰动（Eval 1）和10,000个强地扰动（Eval 2）合成语音。提议的方法在IEEE SP杯比赛中于ICASSP 2022年的评估中，与其他顶尖队伍相比，准确率高出12-13%（Eval 2）和1-2%（Eval 1）。

Multi-Scale Estimation for Omni-Directional Saliency Maps Using Learnable Equator Bias

paper_url: http://arxiv.org/abs/2309.08139
repo_url: https://github.com/islab-sophia/odisal
paper_authors: Takao Yamanaka, Tatsuya Suzuki, Taiki Nobutsune, Chenjunlin Wu
for: 这 paper 用于 estimating 镜像中的焦点点 cloud，以便在头戴式显示器上探测重要的区域。
methods: 该 paper 提出了一种基于抽象2D图像的新的焦点映射估计模型，通过在不同的方向和视角下提取 overlap 2D 图像来实现。
results: 该 paper 的实验结果表明，使用该模型可以提高镜像中焦点点云的准确性。

Abstract
Omni-directional images have been used in wide range of applications. For the applications, it would be useful to estimate saliency maps representing probability distributions of gazing points with a head-mounted display, to detect important regions in the omni-directional images. This paper proposes a novel saliency-map estimation model for the omni-directional images by extracting overlapping 2-dimensional (2D) plane images from omni-directional images at various directions and angles of view. While 2D saliency maps tend to have high probability at the center of images (center bias), the high-probability region appears at horizontal directions in omni-directional saliency maps when a head-mounted display is used (equator bias). Therefore, the 2D saliency model with a center-bias layer was fine-tuned with an omni-directional dataset by replacing the center-bias layer to an equator-bias layer conditioned on the elevation angle for the extraction of the 2D plane image. The limited availability of omni-directional images in saliency datasets can be compensated by using the well-established 2D saliency model pretrained by a large number of training images with the ground truth of 2D saliency maps. In addition, this paper proposes a multi-scale estimation method by extracting 2D images in multiple angles of view to detect objects of various sizes with variable receptive fields. The saliency maps estimated from the multiple angles of view were integrated by using pixel-wise attention weights calculated in an integration layer for weighting the optimal scale to each object. The proposed method was evaluated using a publicly available dataset with evaluation metrics for omni-directional saliency maps. It was confirmed that the accuracy of the saliency maps was improved by the proposed method.

摘要
全irectional图像在各种应用中使用，为这些应用，可以使用头戴式显示器来估算 gazing point 的抽象图像，以检测图像中重要的区域。本文提出了一种基于全irectional图像的新的抽象图像估算模型，通过提取不同方向和视角的2D平面图像来生成 omni-directional 图像的抽象图像。在使用头戴式显示器时，2D抽象图像的高概率区域往往出现在水平方向上（equator bias），而不是在图像中心（center bias）。因此，该文件提出了一种基于2D抽象图像的中心偏好层的修正方法，通过将中心偏好层替换为基于扬角的 equator-bias 层来提高 omni-directional 图像的抽象图像估算精度。由于 omni-directional 图像的数据匮乏，本文还提出了一种使用已有的2D抽象图像模型，通过大量的训练图像和相关的地平面抽象图像来练习该模型，从而补做数据的不足。此外，本文还提出了一种多比例估算方法，通过提取多个视角的2D图像来检测不同尺度的对象，并使用每个对象的变量感知权重来权重每个对象的估算结果。该方法被评估使用公共可用的数据集，并证实了其精度的提高。

Let’s Roll: Synthetic Dataset Analysis for Pedestrian Detection Across Different Shutter Types

paper_url: http://arxiv.org/abs/2309.08136
repo_url: None
paper_authors: Yue Hu, Gourav Datta, Kira Beerel, Peter Beerel
for: 本研究旨在检验不同闭着机制对机器学习对象检测模型的影响，以确定是否存在严重差异在检测精度方面。
methods: 本研究使用Unreal Engine 5进行了 sintetic数据生成，并对主流检测模型进行了训练和评估。
results: 研究结果表明，闭着机制对检测精度的影响存在差异，特别是在检测低速对象（如行人）时。在粗粒度检测精度方面（mAP=0.5），闭着机制对检测模型的影响几乎无效，但在细粒度检测精度方面（mAP=0.5:0.95）存在显著差异。这表明，ML管道可能不需要显式修正RS，但为了提高RS对象检测的精度，可能需要进行进一步的研究。

Abstract
Computer vision (CV) pipelines are typically evaluated on datasets processed by image signal processing (ISP) pipelines even though, for resource-constrained applications, an important research goal is to avoid as many ISP steps as possible. In particular, most CV datasets consist of global shutter (GS) images even though most cameras today use a rolling shutter (RS). This paper studies the impact of different shutter mechanisms on machine learning (ML) object detection models on a synthetic dataset that we generate using the advanced simulation capabilities of Unreal Engine 5 (UE5). In particular, we train and evaluate mainstream detection models with our synthetically-generated paired GS and RS datasets to ascertain whether there exists a significant difference in detection accuracy between these two shutter modalities, especially when capturing low-speed objects (e.g., pedestrians). The results of this emulation framework indicate the performance between them are remarkably congruent for coarse-grained detection (mean average precision (mAP) for IOU=0.5), but have significant differences for fine-grained measures of detection accuracy (mAP for IOU=0.5:0.95). This implies that ML pipelines might not need explicit correction for RS for many object detection applications, but mitigating RS effects in ISP-less ML pipelines that target fine-grained location of the objects may need additional research.

摘要
计算机视觉（CV）流水线通常会被评估使用图像信号处理（ISP）流水线处理的数据集，尽管在资源有限的应用中，一个重要的研究目标是避免最多的ISP步骤。尤其是，大多数CV数据集都是全球闭缝（GS）图像，而今天的大多数相机都使用滚动闭缝（RS）。这篇论文研究了不同闭缝机制对机器学习（ML）对象检测模型的影响，使用我们自己生成的synthetic数据集，并在Unreal Engine 5（UE5）的高级模拟功能下进行了模拟。具体来说，我们使用生成的paired GS和RS数据集来训练和评估主流检测模型，以确定这两种闭缝模式之间的检测精度差异，特别是检测低速对象（如人行道）。结果表明，这两种闭缝模式之间的性能差异不大（ Mean Average Precision（mAP）为IOU=0.5），但是在细化检测精度指标（mAP为IOU=0.5：0.95）时有显著差异。这表明，ML管道可能不需要显式修正RS，但是为了实现细化对象检测应用，可能需要进行进一步的研究以避免RS的影响。

AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT

paper_url: http://arxiv.org/abs/2309.08134
repo_url: None
paper_authors: Fangbo Qin, Taogang Hou, Shan Lin, Kaiyuan Wang, Michael C. Yip, Shan Yu
for: 这个研究旨在提出一种具有多物件类别和实例意识的一击式物点键点抽象方法（AnyOKP），用于实现灵活的物象视觉感知。
methods: 这个方法利用预训transformer（ViT）的强大表示能力，并可以从支持图片学习而获得键点。首先使用训练自动找到最佳原型对（BPPs），然后根据图片的出现相似性搜寻候选键点。最后，分解整个图形为不同物体实例的子图形。
results: 这个方法在实验中显示了跨类别和实例意识的优异性，并且具有传播和类别转换的优秀适应能力。

Abstract
Towards flexible object-centric visual perception, we propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, which leverages the powerful representation ability of pretrained vision transformer (ViT), and can obtain keypoints on multiple object instances of arbitrary category after learning from a support image. An off-the-shelf petrained ViT is directly deployed for generalizable and transferable feature extraction, which is followed by training-free feature enhancement. The best-prototype pairs (BPPs) are searched for in support and query images based on appearance similarity, to yield instance-unaware candidate keypoints.Then, the entire graph with all candidate keypoints as vertices are divided to sub-graphs according to the feature distributions on the graph edges. Finally, each sub-graph represents an object instance. AnyOKP is evaluated on real object images collected with the cameras of a robot arm, a mobile robot, and a surgical robot, which not only demonstrates the cross-category flexibility and instance awareness, but also show remarkable robustness to domain shift and viewpoint change.

摘要
向flexible对象中心视觉进化，我们提出了一种一shot实例感知对象关键点（OKP）提取方法，AnyOKP，它利用预训练的视transformer（ViT）的强大表示能力，可以在学习支持图片后，从多个对象实例中提取关键点。 directly使用已经训练的ViT进行通用和转移的特征提取，然后通过无需训练的特征增强。在支持图片和查询图片中基于外观相似性搜索最佳原型对（BPPs），然后将整个图像的所有候选关键点作为顶点组织成图像，最终每个子图像表示一个对象实例。 AnyOKP在实际对象图像中进行了测试，不仅表现出了跨类弹性和实例意识，而且还具有了remarkable的领域变化和视点变化的Robustness。

Increasing diversity of omni-directional images generated from single image using cGAN based on MLPMixer

paper_url: http://arxiv.org/abs/2309.08129
repo_url: https://github.com/islab-sophia/odigen-mlpmixer
paper_authors: Atsuya Nakata, Ryuto Miyazaki, Takao Yamanaka
for: 这个论文提出了一种生成全向图像的新方法，该方法可以将单张图像转换成全向图像。
methods: 该方法使用了多层感知器（MLPMixer）来取代传统的生成对抗网络（CNN），MLPMixer可以更好地传递信息，从而提高生成的全向图像的多样性和质量。
results: 该方法可以减少内存占用和计算成本，同时保持与传统方法相当的性能水平，并且可以生成更多样化的全向图像。

Abstract
This paper proposes a novel approach to generating omni-directional images from a single snapshot picture. The previous method has relied on the generative adversarial networks based on convolutional neural networks (CNN). Although this method has successfully generated omni-directional images, CNN has two drawbacks for this task. First, since a convolutional layer only processes a local area, it is difficult to propagate the information of an input snapshot picture embedded in the center of the omni-directional image to the edges of the image. Thus, the omni-directional images created by the CNN-based generator tend to have less diversity at the edges of the generated images, creating similar scene images. Second, the CNN-based model requires large video memory in graphics processing units due to the nature of the deep structure in CNN since shallow-layer networks only receives signals from a limited range of the receptive field. To solve these problems, MLPMixer-based method was proposed in this paper. The MLPMixer has been proposed as an alternative to the self-attention in the transformer, which captures long-range dependencies and contextual information. This enables to propagate information efficiently in the omni-directional image generation task. As a result, competitive performance has been achieved with reduced memory consumption and computational cost, in addition to increasing diversity of the generated omni-directional images.

摘要
Here's the text in Simplified Chinese:这篇论文提出了一种新的方法，用于从单一的Snapshot图片中生成全irectional图像。过去的方法都是基于生成对抗网络（GAN），但是这些方法有两个缺点。首先，Convolutional Neural Networks（CNN）在生成全irectional图像时，很难将中心位置的Snapshot图片信息传递到图像的边缘。因此，由CNN生成的全irectional图像中的边缘图像会更为相似。其次，CNN模型需要大量的图像处理器内存，这是因为CNN的深度结构会导致模型具有较大的计算量和内存占用。为解决这些问题，这篇论文提出了基于多层杂化（MLPMixer）的方法。MLPMixer可以 capture长距离依赖关系和Contextual信息，因此可以高效地传递信息在全irectional图像生成任务中。因此，该方法可以实现与传统方法相同的性能，同时减少内存占用量和计算量。此外，该方法还可以增加生成的全irectional图像的多样性。

paper_url: http://arxiv.org/abs/2309.08113
repo_url: https://github.com/yinzhicun/metaf2n
paper_authors: Zhicun Yin, Ming Liu, Xiaoming Li, Hui Yang, Longan Xiao, Wangmeng Zuo
for: 提高低品质图像的超解析表现
methods: 利用 faces 内容为模型进行 fine-tuning，避免了低品质图像生成和优化不确定性
results: 在实验中，MetaF2N 可以从一次 fine-tuning 中获得良好的表现，并且可以适应不同的自然图像环境

Abstract
Due to their highly structured characteristics, faces are easier to recover than natural scenes for blind image super-resolution. Therefore, we can extract the degradation representation of an image from the low-quality and recovered face pairs. Using the degradation representation, realistic low-quality images can then be synthesized to fine-tune the super-resolution model for the real-world low-quality image. However, such a procedure is time-consuming and laborious, and the gaps between recovered faces and the ground-truths further increase the optimization uncertainty. To facilitate efficient model adaptation towards image-specific degradations, we propose a method dubbed MetaF2N, which leverages the contained Faces to fine-tune model parameters for adapting to the whole Natural image in a Meta-learning framework. The degradation extraction and low-quality image synthesis steps are thus circumvented in our MetaF2N, and it requires only one fine-tuning step to get decent performance. Considering the gaps between the recovered faces and ground-truths, we further deploy a MaskNet for adaptively predicting loss weights at different positions to reduce the impact of low-confidence areas. To evaluate our proposed MetaF2N, we have collected a real-world low-quality dataset with one or multiple faces in each image, and our MetaF2N achieves superior performance on both synthetic and real-world datasets. Source code, pre-trained models, and collected datasets are available at https://github.com/yinzhicun/MetaF2N.

摘要
由于 faces 的高度结构化特征，因此可以更容易地进行匿名图像超解像。我们可以从低质量和恢复的人脸对 extracted degradation representation。使用这个表示，我们可以生成高质量的匿名图像，以便为超解像模型进行 fine-tuning。但是，这种过程时间consuming 和劳动密集，并且差距 между recovered faces 和真实值更大，这会增加优化的uncertainty。为了快速和高效地适应图像特定的降低，我们提议一种方法 MetaF2N，它利用 faces 来微调模型参数，以适应整个自然图像。在这种Meta-学框架中，我们可以 circumvent 低质量图像生成和降低表示步骤，只需要一次微调。考虑到差距 между recovered faces 和真实值，我们进一步部署了一个 MaskNet 来适应性地预测不同位置的损失权重，以降低低信任区域的影响。为了评估我们提议的 MetaF2N，我们收集了一个真实世界的低质量图像集，并且我们的 MetaF2N 在这些 synthetic 和真实世界的集合上都达到了更高的性能。代码、预训练模型和收集到的数据都可以在上获取。

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

paper_url: http://arxiv.org/abs/2309.08097
repo_url: None
paper_authors: Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng, Xinge You
for: Addressing the challenge of fine-grained visual categorization with limited data.
methods: Proposes a novel approach called the detail reinforcement diffusion model (DRDM), which leverages rich knowledge from large models for data augmentation and includes two key components: discriminative semantic recombination (DSR) and spatial knowledge reference (SKR).
results: Demonstrates improved performance for fine-grained visual recognition tasks through effective utilization of knowledge from large models, despite limited data availability.

Abstract
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

摘要
Simplified Chinese translation:细化图像分类的挑战在于如何探索不同 subclass 之间的细微差异并实现精准分类。先前的研究通过大量标注数据和预训练深度模型来实现目标。但当只有有限的样本可用时，类似的方法可能变得 menos有效。增殖模型在数据生成方面具有出色的多样性，但是高级别的细化图像需要大量的样本来进行训练，这使得现有的方法难以直接采用。为解决这个问题，我们提出了一种新的方法，称为细化强化增殖模型~(DRDM)。DRDM 利用大模型中的丰富知识来进行细化数据增殖，并包括两个关键组件：推理 semantic recombination~(DSR) 和空间知识引用~(SKR)。特别是，DSR 设计用于从标签中提取隐式相似关系，并重建类标签和实例之间的semantic mapping，从而提高细化差异的探索。此外，我们引入 SKR 模块，该模块在特征空间中引入不同数据集的分布作为参考。这使得 SKR 能够在少数shot FGVC 任务中聚合高维度的 subclass 特征分布，从而扩大决策边界。通过这两个关键组件，我们有效地利用大模型中的知识来解决数据稀缺性的问题，从而提高细化图像认知任务的性能。广泛的实验表明我们的 DRDM 具有稳定的性能提升。

hear-your-action: human action recognition by ultrasound active sensing

paper_url: http://arxiv.org/abs/2309.08087
repo_url: None
paper_authors: Risako Tanigawa, Yasunori Ishii
for: 本研究旨在提出一种隐私保护的动作识别方法，使得通过视觉信息获取动作识别结果时，不会泄露用户的隐私信息。
methods: 本研究使用ultrasound active sensing技术进行动作识别，并创建了一个新的数据集来支持这种方法。研究人员计算了时间变化的声波反射波的强度特征值，并使用支持向量机和VGG进行分类。
results: 研究人员在同一个人和同一个环境下训练和测试时，达到了97.9%的准确率。此外，研究人员还对不同人进行训练和测试，并达到了89.5%的准确率。研究人员还进行了不同条件下的准确率分析和限制。

Abstract
Action recognition is a key technology for many industrial applications. Methods using visual information such as images are very popular. However, privacy issues prevent widespread usage due to the inclusion of private information, such as visible faces and scene backgrounds, which are not necessary for recognizing user action. In this paper, we propose a privacy-preserving action recognition by ultrasound active sensing. As action recognition from ultrasound active sensing in a non-invasive manner is not well investigated, we create a new dataset for action recognition and conduct a comparison of features for classification. We calculated feature values by focusing on the temporal variation of the amplitude of ultrasound reflected waves and performed classification using a support vector machine and VGG for eight fundamental action classes. We confirmed that our method achieved an accuracy of 97.9% when trained and evaluated on the same person and in the same environment. Additionally, our method achieved an accuracy of 89.5% even when trained and evaluated on different people. We also report the analyses of accuracies in various conditions and limitations.

摘要
《动作识别是许多工业应用中的关键技术。图像信息的方法非常受欢迎，但是隐私问题限制了广泛的使用，因为图像中包含私人信息，如可见的脸和场景背景，这些信息不是必需的 для识别用户动作。在本文中，我们提出了一种隐私保护的动作识别方法，基于ultrasound活动感测。由于从ultrasound活动感测中得到动作识别的非侵入性方法未得到广泛的研究，我们创建了一个新的数据集，并进行了不同的特征比较。我们计算了时间变化的ultrasound反射波的幅值，并使用支持向量机和VGG进行分类。我们确认了我们的方法在同一个人和同一个环境中获得了97.9%的准确率，而且我们的方法在不同的人和环境中也获得了89.5%的准确率。此外，我们还报告了不同情况下的准确率分析和限制。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

2023-09-15

cs.AI

cs.AI - 2023-09-15

URA*: Uncertainty-aware Path Planning using Image-based Aerial-to-Ground Traversability Estimation for Off-road Environments

paper_url: http://arxiv.org/abs/2309.08814
repo_url: https://github.com/shaswata09/offroad-path-planning
paper_authors: Charles Moore, Shaswata Mitra, Nisha Pillai, Marc Moore, Sudip Mittal, Cindy Bethel, Jingdao Chen
for: 提高自动驾驶车辆在非公路环境中的导航能力，即使环境不具备明确的地图或路标。
methods: 使用空中图像 ensemble convolutional neural network (CNN) 模型进行像素级游离性预测，并使用不确定性意识的规划算法计算基于这些不纯净的游离性预测值的最佳路径。
results: 在马萨诸塞路数据集、深度 globeb 数据集以及密西西比州大学的非公路试验场上评估了提案的图像分割和规划方法，结果显示，提案的方法在初始路径质量和在线机器人运行中的重新规划能力方面卓越于传统规划算法。

Abstract
A major challenge with off-road autonomous navigation is the lack of maps or road markings that can be used to plan a path for autonomous robots. Classical path planning methods mostly assume a perfectly known environment without accounting for the inherent perception and sensing uncertainty from detecting terrain and obstacles in off-road environments. Recent work in computer vision and deep neural networks has advanced the capability of terrain traversability segmentation from raw images; however, the feasibility of using these noisy segmentation maps for navigation and path planning has not been adequately explored. To address this problem, this research proposes an uncertainty-aware path planning method, URA* using aerial images for autonomous navigation in off-road environments. An ensemble convolutional neural network (CNN) model is first used to perform pixel-level traversability estimation from aerial images of the region of interest. The traversability predictions are represented as a grid of traversal probability values. An uncertainty-aware planner is then applied to compute the best path from a start point to a goal point given these noisy traversal probability estimates. The proposed planner also incorporates replanning techniques to allow rapid replanning during online robot operation. The proposed method is evaluated on the Massachusetts Road Dataset, the DeepGlobe dataset, as well as a dataset of aerial images from off-road proving grounds at Mississippi State University. Results show that the proposed image segmentation and planning methods outperform conventional planning algorithms in terms of the quality and feasibility of the initial path, as well as the quality of replanned paths.

摘要
寻路自主导航在非公路环境中存在一个主要挑战，即缺乏地图或路标，可以用来规划自主机器人的路径。传统的寻路规划方法大多假设环境是完全知道的，而不考虑探测地形和障碍物时的感知和探测不确定性。现代计算机视觉和深度神经网络技术已经提高了从照片中提取地形可行性分割的能力，但是使用这些不稳定的分割图来 Navigation and path planning has not been fully explored. To address this problem, this research proposes an uncertainty-aware path planning method, URA* using aerial images for autonomous navigation in off-road environments. An ensemble convolutional neural network (CNN) model is first used to perform pixel-level traversability estimation from aerial images of the region of interest. The traversability predictions are represented as a grid of traversal probability values. An uncertainty-aware planner is then applied to compute the best path from a start point to a goal point given these noisy traversal probability estimates. The proposed planner also incorporates replanning techniques to allow rapid replanning during online robot operation. The proposed method is evaluated on the Massachusetts Road Dataset, the DeepGlobe dataset, as well as a dataset of aerial images from off-road proving grounds at Mississippi State University. Results show that the proposed image segmentation and planning methods outperform conventional planning algorithms in terms of the quality and feasibility of the initial path, as well as the quality of replanned paths.

SHAPNN: Shapley Value Regularized Tabular Neural Network

paper_url: http://arxiv.org/abs/2309.08799
repo_url: None
paper_authors: Qisen Cheng, Shuhui Qu, Janghwan Lee
for: 这篇论文是为了提出一种新的深度表格数据模型化架构，用于超级vised学习。
methods: 该方法使用了Shapley值，一种已有的解释黑盒模型的技术，并通过标准的反向传播优化方法来训练神经网络。
results: 该方法可以提供有效的解释，不带计算负担，对数据实例和数据集都适用。此外，预测与解释服务为训练提供了一种正则化，提高模型的性能。此外，正则化预测还提高了模型的持续学习能力。在各种公开数据集上进行了比较，SHAPNN在AUROC、透明度和流动数据Robustness等方面表现出色。

Abstract
We present SHAPNN, a novel deep tabular data modeling architecture designed for supervised learning. Our approach leverages Shapley values, a well-established technique for explaining black-box models. Our neural network is trained using standard backward propagation optimization methods, and is regularized with realtime estimated Shapley values. Our method offers several advantages, including the ability to provide valid explanations with no computational overhead for data instances and datasets. Additionally, prediction with explanation serves as a regularizer, which improves the model's performance. Moreover, the regularized prediction enhances the model's capability for continual learning. We evaluate our method on various publicly available datasets and compare it with state-of-the-art deep neural network models, demonstrating the superior performance of SHAPNN in terms of AUROC, transparency, as well as robustness to streaming data.

摘要
我们提出了SHAPNN，一种新的深度表格数据建模架构，适用于指导学习。我们的方法利用了Shapley值，已经广泛应用于解释黑盒模型的技术。我们的神经网络通过标准的反向传播优化方法进行训练，并通过实时估算的Shapley值进行规regularization。我们的方法具有许多优势，包括无计算 overhead的数据实例和数据集解释能力，预测与解释服务为评估器，提高模型性能，以及对流动数据的强化。我们对多个公开的数据集进行了评估，并与现有的深度神经网络模型进行了比较，demonstrating SHAPNN在AUROC、透明度和流动数据的稳定性方面的超越性。

D3: Data Diversity Design for Systematic Generalization in Visual Question Answering

paper_url: http://arxiv.org/abs/2309.08798
repo_url: https://github.com/amiroor/d3questiongenerationclevr
paper_authors: Amir Rahimi, Vanessa D’Amario, Moyuru Yamada, Kentaro Takemoto, Tomotake Sasaki, Xavier Boix
for: This paper is written to investigate the role of data diversity in achieving systematic generalization in Visual Question Answering (VQA) tasks.
methods: The paper uses a combination of simple tasks and neural network architectures to study the effect of data diversity on systematic generalization.
results: The paper finds that the diversity of simple tasks is a key factor in achieving systematic generalization, and that neural module networks are better able to leverage all forms of data diversity than monolithic architectures.

Abstract
Systematic generalization is a crucial aspect of intelligence, which refers to the ability to generalize to novel tasks by combining known subtasks and concepts. One critical factor that has been shown to influence systematic generalization is the diversity of training data. However, diversity can be defined in various ways, as data have many factors of variation. A more granular understanding of how different aspects of data diversity affect systematic generalization is lacking. We present new evidence in the problem of Visual Question Answering (VQA) that reveals that the diversity of simple tasks (i.e. tasks formed by a few subtasks and concepts) plays a key role in achieving systematic generalization. This implies that it may not be essential to gather a large and varied number of complex tasks, which could be costly to obtain. We demonstrate that this result is independent of the similarity between the training and testing data and applies to well-known families of neural network architectures for VQA (i.e. monolithic architectures and neural module networks). Additionally, we observe that neural module networks leverage all forms of data diversity we evaluated, while monolithic architectures require more extensive amounts of data to do so. These findings provide a first step towards understanding the interactions between data diversity design, neural network architectures, and systematic generalization capabilities.

摘要
系统化总结是智能的关键方面，它指的是将已知的子任务和概念结合起来应对新任务。一个重要的因素是训练数据的多样性，但是多样性的定义可以有多种方式。我们没有充分理解不同方面的数据多样性如何影响系统化总结。我们在视觉问答任务（VQA）中提供新的证据，显示了简单任务的多样性对系统化总结的关键作用。这意味着可能不需要收集大量和多样化的复杂任务，这些任务可能是成本高的获取。我们证明这结果不виси于训练和测试数据的相似性，并适用于常见的 neural network 架构（i.e. 单一架构和神经模块网络）。此外，我们发现神经模块网络可以利用我们评估的所有多样性，而单一架构则需要更多的数据来做到这一点。这些发现为我们理解数据多样性设计、神经网络架构和系统化总结能力之间的交互关系提供了一个第一步。

Privacy-preserving Early Detection of Epileptic Seizures in Videos

paper_url: http://arxiv.org/abs/2309.08794
repo_url: https://github.com/DevD1092/seizure-detection
paper_authors: Deval Mehta, Shobi Sivathamboo, Hugh Simpson, Patrick Kwan, Terence O`Brien, Zongyuan Ge
for: 这 paper 的目的是开发一种隐私保护的视频 Epilepsy 病发识别方法。
methods: 该方法使用了 optical flow 特征从 Epilepsy 病发视频中提取信息，并利用 transformer 基于进行逐步知识储存，从而保持隐私。
results: 该方法可以在 Privacy-preserving 的情况下，准确地检测 TCS 病发，其准确率为 83.9%。In English:
for: The purpose of this paper is to develop a privacy-preserving video-based epileptic seizure classification method.
methods: The method uses optical flow features extracted from epileptic seizure videos and utilizes a transformer-based progressive knowledge distillation to preserve privacy.
results: The method can accurately detect tonic-clonic seizures (TCSs) in a privacy-preserving manner, with an accuracy of 83.9%.

Abstract
In this work, we contribute towards the development of video-based epileptic seizure classification by introducing a novel framework (SETR-PKD), which could achieve privacy-preserved early detection of seizures in videos. Specifically, our framework has two significant components - (1) It is built upon optical flow features extracted from the video of a seizure, which encodes the seizure motion semiotics while preserving the privacy of the patient; (2) It utilizes a transformer based progressive knowledge distillation, where the knowledge is gradually distilled from networks trained on a longer portion of video samples to the ones which will operate on shorter portions. Thus, our proposed framework addresses the limitations of the current approaches which compromise the privacy of the patients by directly operating on the RGB video of a seizure as well as impede real-time detection of a seizure by utilizing the full video sample to make a prediction. Our SETR-PKD framework could detect tonic-clonic seizures (TCSs) in a privacy-preserving manner with an accuracy of 83.9% while they are only half-way into their progression. Our data and code is available at https://github.com/DevD1092/seizure-detection

摘要
在这项工作中，我们贡献了视频基于癫痫病发迹的分类发展，通过引入一种新的框架（SETR-PKD），实现了隐私保护的早期发现癫痫病发。 Specifically，我们的框架有两个重要组成部分：（1）基于视频中癫痫病发动的Optical flow特征提取，这些特征可以储存癫痫病发动的semiotics，同时保护病人的隐私;（2）使用基于转换器的进行进行逐步知识储存，其中知识从网络训练在更长的视频批处理中进行储存，然后逐步储存到短视频批处理中。因此，我们的提出的SETR-PKD框架可以在隐私保护的情况下，将癫痫病发迹推断到83.9%的准确率，而这些癫痫病发迹只有在进程中的半段。我们的数据和代码可以在https://github.com/DevD1092/seizure-detection上获取。

Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation

paper_url: http://arxiv.org/abs/2309.08793
repo_url: https://github.com/iit-dm/fin-fact
paper_authors: Aman Rangapur, Haoran Wang, Kai Shu
for: 这篇论文旨在提供一个Multimodal Fact-Checking数据集，用于combating misinformation in finance，提高金融报道和新闻传递的可靠性。
methods: 该论文使用了专业 fact-checker 的标注和解释，以及多Modal的内容，包括文本和视觉内容，以提高事实检查的准确性。
results: 该论文提供了一个名为 Fin-Fact 的数据集，包括专业 fact-checker 的标注和解释，以及多Modal的内容，可以帮助用户更好地理解事实检查的决策理由，提高事实检查的可靠性。

Abstract
Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and visual content, Fin-Fact provides complementary information sources to enhance factuality analysis. Its primary objective is combating misinformation in finance, fostering transparency, and building trust in financial reporting and news dissemination. By offering insightful explanations, Fin-Fact empowers users, including domain experts and end-users, to understand the reasoning behind fact-checking decisions, validating claim credibility, and fostering trust in the fact-checking process. The Fin-Fact dataset, along with our experimental codes is available at https://github.com/IIT-DM/Fin-Fact/.

摘要
Fact-checking in the financial domain is under-explored, and there is a shortage of high-quality datasets in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and visual content, Fin-Fact provides complementary information sources to enhance factuality analysis. Its primary objective is to combat misinformation in finance, promote transparency, and build trust in financial reporting and news dissemination. By offering insightful explanations, Fin-Fact empowers users, including domain experts and end-users, to understand the reasoning behind fact-checking decisions, validate claim credibility, and foster trust in the fact-checking process. The Fin-Fact dataset, along with our experimental codes, is available at .

Projected Task-Specific Layers for Multi-Task Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.08776
repo_url: https://github.com/josselinsomervilleroberts/ptsl
paper_authors: Josselin Somerville Roberts, Julia Di
for: 这篇论文是为了探讨多任务强化学习如何帮助机器人在家居和办公室中扩展多种操作任务。
methods: 该论文提出了一种新的架构，即 Projected Task-Specific Layers (PTSL)，它使用公共策略和密集任务特定更正来更好地表达共享和变量任务信息。
results: 论文表明，PTSL模型在Meta-World benchmark上的MT10和MT50测试集上比现有的状态arius方法表现出色，为一个Sawyer机械臂完成了10和50个目标conditioned任务。

Abstract
Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces. However, generalizing from one task to another and mitigating negative task interference still remains a challenge. Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured. In this work, we introduce our new architecture, Projected Task-Specific Layers (PTSL), that leverages a common policy with dense task-specific corrections through task-specific layers to better express shared and variable task information. We then show that our model outperforms the state of the art on the MT10 and MT50 benchmarks of Meta-World consisting of 10 and 50 goal-conditioned tasks for a Sawyer arm.

摘要
多任务强化学习可以让机器人在家庭和办公室中扩展到各种抓取任务。然而，从一个任务到另一个任务的泛化和消除负面任务干扰仍然是挑战。我们认为，如果可以成功共享任务之间的信息，那么解决这一挑战就取决于任务结构的捕捉程度。在这篇文章中，我们介绍了我们的新架构，即项目化任务特定层（PTSL），它利用共享策略和密集任务特定修正来更好地表达共享和变量任务信息。我们然后证明，我们的模型在Meta-World的MT10和MT50标准准的10和50个目标决策任务中表现出色，超过了现有的状态艺术。

Enhance audio generation controllability through representation similarity regularization

paper_url: http://arxiv.org/abs/2309.08773
repo_url: None
paper_authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra
for: 这个论文目的是提高语音生成的控制，使语音和文本表示之间更加一致。
methods: 这个论文使用了语音和文本token表示的整合，并在CFG阶段引入了表示规范，以避免语音和文本之间的差异。
results: 实验结果表明，该方法可以提高对audio和music生成的对象指标，以及人类对audio生成的感知。

Abstract
This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.

摘要
Our proposed method involves adding regularization to the audio and text representations, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The goal of this regularization is to minimize the differences in audio and text similarity compared to other samples within the same training batch.Experimental results on both music and audio generation tasks show that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in human perception for audio generation.Translation notes:* "audio generation" is translated as "音频生成" (yīn yuè shēng chéng)* "text representation" is translated as "文本表示" (wén tè biǎo yì)* "audio token" is translated as "音频符" (yīn yuè fú)* "classifier-free guidance" is translated as "无类别指导" (wú lè bèi zhǐ dào)* "CFG phase" is translated as "CFG阶段" (CFG jiè dàn)Please note that the translation is in Simplified Chinese, and the word order may be different from the original text.

Rethinking Cross-Domain Pedestrian Detection: A Background-Focused Distribution Alignment Framework for Instance-Free One-Stage Detectors

paper_url: http://arxiv.org/abs/2309.08771
repo_url: https://github.com/caiyancheng/bfda
paper_authors: Yancheng Cai, Bo Zhang, Baopu Li, Tao Chen, Hongliang Yan, Jingdong Zhang, Jiahao Xu
for: 这个研究旨在将步行人检测器从一个标签丰富的领域扩展到另一个标签缺乏的领域，以应对实际应用中的问题。
methods: 这篇研究使用了领域调整的方法，将领域整合到执行阶段，并且将执行阶段的整合与实际应用中的一阶段检测器结合起来。
results: 这篇研究提出了一个名为“背景注重分布对齐”（BFDA）的新框架，用于训练领域适应的一阶段检测器。BFDA首先将背景特征分离出来，然后使用一种新的长Short识别子来对背景特征进行对齐。

Abstract
Cross-domain pedestrian detection aims to generalize pedestrian detectors from one label-rich domain to another label-scarce domain, which is crucial for various real-world applications. Most recent works focus on domain alignment to train domain-adaptive detectors either at the instance level or image level. From a practical point of view, one-stage detectors are faster. Therefore, we concentrate on designing a cross-domain algorithm for rapid one-stage detectors that lacks instance-level proposals and can only perform image-level feature alignment. However, pure image-level feature alignment causes the foreground-background misalignment issue to arise, i.e., the foreground features in the source domain image are falsely aligned with background features in the target domain image. To address this issue, we systematically analyze the importance of foreground and background in image-level cross-domain alignment, and learn that background plays a more critical role in image-level cross-domain alignment. Therefore, we focus on cross-domain background feature alignment while minimizing the influence of foreground features on the cross-domain alignment stage. This paper proposes a novel framework, namely, background-focused distribution alignment (BFDA), to train domain adaptive onestage pedestrian detectors. Specifically, BFDA first decouples the background features from the whole image feature maps and then aligns them via a novel long-short-range discriminator.

摘要
cross-domain pedestrian detection aim to generalize pedestrian detectors from one label-rich domain to another label-scarce domain, which is crucial for various real-world applications. Most recent works focus on domain alignment to train domain-adaptive detectors either at the instance level or image level. From a practical point of view, one-stage detectors are faster. Therefore, we concentrate on designing a cross-domain algorithm for rapid one-stage detectors that lacks instance-level proposals and can only perform image-level feature alignment. However, pure image-level feature alignment causes the foreground-background misalignment issue to arise, i.e., the foreground features in the source domain image are falsely aligned with background features in the target domain image. To address this issue, we systematically analyze the importance of foreground and background in image-level cross-domain alignment, and learn that background plays a more critical role in image-level cross-domain alignment. Therefore, we focus on cross-domain background feature alignment while minimizing the influence of foreground features on the cross-domain alignment stage. This paper proposes a novel framework, namely, background-focused distribution alignment (BFDA), to train domain adaptive one-stage pedestrian detectors. Specifically, BFDA first decouples the background features from the whole image feature maps and then aligns them via a novel long-short-range discriminator.

AlbNER: A Corpus for Named Entity Recognition in Albanian

paper_url: http://arxiv.org/abs/2309.08741
repo_url: None
paper_authors: Erion Çano
for: 这篇论文是为了解决 Albanian 语言的自然语言处理和计算语言学研究中的资源短缺问题而写的。
methods: 该论文使用了 Albanian Wikipedia 文章中的 900 句话，并对其进行了标注名实体处理。
results: 根据BERT和RoBERTa变体在 AlbNER 数据上进行了微调和测试，结果显示，模型大小对 NER 性能有轻微的影响，而语言传递却有显著的影响。 AlbNER 数据集和这些结果可以作为未来实验的基线。

Abstract
Scarcity of resources such as annotated text corpora for under-resourced languages like Albanian is a serious impediment in computational linguistics and natural language processing research. This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles. Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one. AlbNER corpus and these obtained results should serve as baselines for future experiments.

摘要
资源短缺，如 Albanian 语言的注释文本库，是计算语言学和自然语言处理研究中的严重阻碍。本文提出了 AlbNER 词库，包含 900 句 sentences 的标注名实体，从 Albanian Wikipedia 文章中收集。初步结果表明，BERT 和 RoBERTa 变体在 AlbNER 数据上进行 fine-tuning 和测试后，模型大小对 NER 性能有轻微影响，而语言传递却有显著影响。AlbNER 词库和这些结果可作为未来实验的基线。

OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?

paper_url: http://arxiv.org/abs/2309.09992
repo_url: None
paper_authors: Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme
for: 本文解释OpenAI在其直播示例中使用GPT-4计算税务时取得的税法示例来源，以及GPT-4为什么得出错误答案，以及如何估算税务。
methods: GPT-4使用的方法是什么？
results: GPT-4的计算结果是什么？

Abstract
The authors explain where OpenAI got the tax law example in its livestream demonstration of GPT-4, why GPT-4 got the wrong answer, and how it fails to reliably calculate taxes.

摘要
文章讲述OpenAI在其直播中展示GPT-4时得到了税法示例，GPT-4为什么得到了错误答案，以及如何计算税金的问题。Note that "GPT-4" in the text is translated as "GPT-4" in Simplified Chinese, as there is no direct translation for "GPT-4" in Simplified Chinese.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

paper_url: http://arxiv.org/abs/2309.08730
repo_url: https://github.com/zihaod/musilingo
paper_authors: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos
for: 本研究旨在探讨大型语言模型（LLMs）在多媒体应用中的潜在性，特别是将文本和音乐领域融合在一起。
methods: 该研究提出了一种名为MusiLingo的音乐标题生成和音乐相关问题回答系统，使用单一投影层将音乐表示从预先冻结的MERT音乐抽象模型与预先冻结的LLaMA语言模型进行对应。
results: 经过训练和微调，MusiLingo在一个广泛的音乐标题集和 MusicInstruct（MI）集上表现竞争力强，能够生成高质量的音乐标题和音乐相关问题对。

Abstract
Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from MusicCaps, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

摘要
大型语言模型（LLM）在多模态应用中表现出了巨大的潜力，然而文字和音乐领域之间的融合仍然尚未得到充分的探索。为解决这个差距，我们介绍了MusiLingo，一种新的音乐标签生成和音乐相关问答系统。MusiLingo通过单一投影层将预训练的冻结音乐AUDIO模型MERT与预训练的LLaMA语言模型相关联，从而bridge音乐Audio和文字上下文之间的差距。我们在广泛的音乐标签数据集上训练并在指导数据上细化MusiLingo。由于音乐Q&A数据集的质量不够高，我们从MusicCaps中创建了MusicInstruct（MI）数据集，适用于开放式音乐问题。我们的实验表明MusiLingo在生成音乐标签和组合音乐相关问答对之间的表现具有竞争力。我们引入的数据集可以为之前的研究提供新的发展空间。

SculptBot: Pre-Trained Models for 3D Deformable Object Manipulation

paper_url: http://arxiv.org/abs/2309.08728
repo_url: None
paper_authors: Alison Bartsch, Charlotte Avra, Amir Barati Farimani
for: 这个论文旨在解决机器人 manipulate 塑料时存在的特殊挑战，包括高度自由度和自带遮挡。
methods: 该论文使用点云作为状态表示，并利用预训练的点云重建Transformer来学习材料塑形的积分动态模型，以预测抓取动作后材料的变形。
results: 实验表明提议的系统能够成功捕捉泥土的动态特征，并创造出简单的各种形状。

Abstract
Deformable object manipulation presents a unique set of challenges in robotic manipulation by exhibiting high degrees of freedom and severe self-occlusion. State representation for materials that exhibit plastic behavior, like modeling clay or bread dough, is also difficult because they permanently deform under stress and are constantly changing shape. In this work, we investigate each of these challenges using the task of robotic sculpting with a parallel gripper. We propose a system that uses point clouds as the state representation and leverages pre-trained point cloud reconstruction Transformer to learn a latent dynamics model to predict material deformations given a grasp action. We design a novel action sampling algorithm that reasons about geometrical differences between point clouds to further improve the efficiency of model-based planners. All data and experiments are conducted entirely in the real world. Our experiments show the proposed system is able to successfully capture the dynamics of clay, and is able to create a variety of simple shapes.

摘要
<>TRANSLATE_TEXT可变形物体控制存在机器人控制中的独特挑战，它们具有高度的自由度和严重的自遮挡。物体表示方面，模型粘土或面包皮的材料表示也是困难的，因为它们在压力下永久弯曲并不断变化形状。在这项工作中，我们通过机器人雕塑任务使用并列抓取器进行研究。我们提议一个使用点云作为状态表示的系统，利用预训练的点云重建Transformer来学习材料变形的秘密模型，以便根据抓取动作预测材料的弯曲。我们设计了一种新的行动抽样算法，该算法根据点云的几何差异来进一步改进模型基于 плаanner的效率。所有数据和实验都在实际世界中进行。我们的实验表明我们的提议系统能够成功地捕捉粘土的动态，并能够创造一些简单的形状。Note: The text has been translated using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Modelling Irregularly Sampled Time Series Without Imputation

paper_url: http://arxiv.org/abs/2309.08698
repo_url: https://github.com/rohit102497/slan
paper_authors: Rohit Agarwal, Aman Sinha, Dilip K. Prasad, Marianne Clausel, Alexander Horsch, Mathieu Constant, Xavier Coubez
for: 实时系列时间序列 (ISTS) 模型化困难，因为有资料欠损。现有方法大多将 irregularly 标本资料转换为常规标本资料，并假设该欠损机制。我们提出了 SLAN（Switch LSTM Aggregate Network），它不需要假设任何背景过程，将 irregularly 标本资料模型化，并且在测量过程中动态地适应架构。
methods: SLAN 使用了一个包含 LSTM 的封包，并且透过在飞行器中动态地适应架构，以捕捉每个实验室的本地概要信息，并且保持全球概要状态。
results: 我们在 MIMIC-III、Physionet 2012 和 Physionet 2019 等公共资料集上显示了 SLAN 的优化性能，并且提供了可用的代码（https://github.com/Rohit102497/SLAN）。

Abstract
Modelling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism leading to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a pack of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors. SLAN exploits the irregularity information to capture each sensor's local summary explicitly and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on publicly available datasets, namely, MIMIC-III, Physionet 2012 and Physionet 2019. The code is available at https://github.com/Rohit102497/SLAN.

摘要
模型异常样本时序（ISTS）具有许多挑战，主要是因为缺失数据。现有的方法通常是将异常样本数据转换为常规样本数据，并通过插值来填充缺失数据。这些模型假设存在下面的缺失机制，从而导致不需要的偏见和优化性不佳。我们提出了SLAN（ Switch LSTM Aggregate Network），它利用一个包含LSTM的模型来模型异常样本时序，不需要任何下面缺失机制的假设。它在实际测量的感知器上动态地调整其架构，并且在观测期间保持全局摘要状态。SLAN利用异常性信息来显式地捕捉每个感知器的本地摘要，并且在观测期间保持全局摘要状态。我们在公共可用的数据集上展示了SLAN的效果，具体来说是MIMIC-III、Physionet 2012和Physionet 2019等数据集。代码可以在https://github.com/Rohit102497/SLAN上获取。

Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents

paper_url: http://arxiv.org/abs/2309.08695
repo_url: https://github.com/ramonachristen/multilingual_negation_scope_resolution_on_legal_data
paper_authors: Ramona Christen, Anastassia Shaitarova, Matthias Stürmer, Joel Niklaus
for: 这个论文的目的是解决法律文本中的否定范围划分问题。
methods: 这个论文使用了语言模型，并在不同语言的文本上进行了精心的训练和评估。
results: 该论文的实验结果表明，使用语言模型进行法律文本中否定范围划分问题的解决可以达到token级别的86.7%的准确率，并在多种语言之间进行了跨语言比较。

Abstract
Resolving the scope of a negation within a sentence is a challenging NLP task. The complexity of legal texts and the lack of annotated in-domain negation corpora pose challenges for state-of-the-art (SotA) models when performing negation scope resolution on multilingual legal data. Our experiments demonstrate that models pre-trained without legal data underperform in the task of negation scope resolution. Our experiments, using language models exclusively fine-tuned on domains like literary texts and medical data, yield inferior results compared to the outcomes documented in prior cross-domain experiments. We release a new set of annotated court decisions in German, French, and Italian and use it to improve negation scope resolution in both zero-shot and multilingual settings. We achieve token-level F1-scores of up to 86.7% in our zero-shot cross-lingual experiments, where the models are trained on two languages of our legal datasets and evaluated on the third. Our multilingual experiments, where the models were trained on all available negation data and evaluated on our legal datasets, resulted in F1-scores of up to 91.1%.

摘要

Fake News Detectors are Biased against Texts Generated by Large Language Models

paper_url: http://arxiv.org/abs/2309.08674
repo_url: None
paper_authors: Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, Preslav Nakov
for: 本研究旨在评估假新闻检测器在人类写作和大语言模型生成的假信息场景下的性能。
methods: 本研究使用了一种新的评估方法，利用对真新闻文章的 adversarial 重写来 Mitigate 大语言模型生成的假信息。
results: 研究发现，现有的假新闻检测器往往偏好地标记大语言模型生成的内容为假新闻，而常常错将人类写作的假新闻标为真新闻。这种偏好源于 LLM 输出的语言特征。通过对真新闻文章进行 adversarial 重写，研究人员提出了一种纠正策略，以提高假新闻检测器的检测精度。

Abstract
The spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. In the era of Large Language Models (LLMs), the capability to generate believable fake content has intensified these concerns. In this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and LLM-generated misinformation. Intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging LLM-generated content as fake news while often misclassifying human-written fake news as genuine. This unexpected bias appears to arise from distinct linguistic patterns inherent to LLM outputs. To address this, we introduce a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. The resulting model yielded marked improvements in detection accuracy for both human and LLM-generated news. To further catalyze research in this domain, we release two comprehensive datasets, \texttt{GossipCop++} and \texttt{PolitiFact++}, thus amalgamating human-validated articles with LLM-generated fake and real news.

摘要
假新闻的扩散已成为一项重要挑战，消除信任和对社会构成威胁。在大语言模型（LLM）时代，生成可信假内容的能力加剧了这些问题。在这项研究中，我们提出了一种新的评估假新闻检测器的方法，包括人类写的和 LLM 生成的谣言。结果显示，许多现有的检测器具有偏见：它们更容易将 LLM 生成的内容标记为假新闻，而常常错过人类写的假新闻。这种意外的偏见似乎 arise from LLM 输出中的特殊语言特征。为解决这一问题，我们提出了一种缓解策略，利用 LLM 生成的真实新闻进行对抗训练。这种策略使得检测器的检测精度得到了显著提高，包括人类写的和 LLM 生成的新闻。为了进一步推动这个领域的研究，我们发布了两个全面的数据集，即 \texttt{GossipCop++} 和 \texttt{PolitiFact++}，这两个数据集包括人类验证的文章以及 LLM 生成的假和真新闻。

Chain-of-Thought Reasoning is a Policy Improvement Operator

paper_url: http://arxiv.org/abs/2309.08589
repo_url: None
paper_authors: Hugh Zhang, David C. Parkes
for: 这项研究的目的是证明语言模型可以通过自我教育来学习新的技能，不需要人类示范。
methods: 这项研究使用了链式思维理解来让语言模型自我教育，然后通过细化模型来生成相同的答案。
results: 研究发现，通过SECToR自我教育方法，语言模型可以自主地学习加法运算，并且可以在无法访问任何示例数据的情况下自动地计算29位数字。

Abstract
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on being trained on large amounts of human-generated data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can successfully teach themselves new skills using chain-of-thought reasoning. Inspired by previous work in both reinforcement learning (Silver et al., 2017) and human cognition (Kahneman, 2011), SECToR first uses chain-of-thought reasoning to slowly think its way through problems. SECToR then fine-tunes the model to generate those same answers, this time without using chain-of-thought reasoning. Language models trained via SECToR autonomously learn to add up to 29-digit numbers without any access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, analogously to how Monte-Carlo Tree Search is used in AlphaZero. We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

摘要
大型语言模型在全球引起了轰动，推出了新的可能性。然而，现在这些模型仍然无法自学新技能，它们需要在大量的人类生成的数据上接受训练。我们介绍了SECToR（自我教育via推理链），这是一个证明了语言模型可以通过推理链自我教育新技能的证明。这个想法源于之前的循环学习（Silver et al., 2017）和人类认知（Kahneman, 2011）的研究。SECToR首先使用推理链思考问题，然后精确地调整模型，以便在不使用推理链的情况下产生相同的答案。这些语言模型通过SECToR自主学习了添加29位数字的技能，而且无需任何真实的示例例外于初始的监督练习阶段，该阶段只包含6位或少数数字。我们的中心假设是，推理链可以作为政策改善算法，类似于AlphaZero中使用的Monte-Carlo Tree Search。我们希望这些研究可以带来新的方向，让语言模型可以自主学习而不需要人类示例。

Compositional Foundation Models for Hierarchical Planning

paper_url: http://arxiv.org/abs/2309.08587
repo_url: None
paper_authors: Anurag Ajay, Seungwook Han, Yilun Du, Shaung Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, Pulkit Agrawal
for: solves long-horizon tasks with hierarchical reasoning across spatial and temporal scales
methods: leverages multiple expert foundation models trained on language, vision, and action data; constructs symbolic plans grounded in the environment; infers actions from generated videos
results: illustrates efficacy and adaptability in three different long-horizon table-top manipulation tasks

Abstract
To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.

摘要
为了在新环境中做出有效的决策，需要进行层次的思考，覆盖空间和时间尺度。这意味着制定抽象子目标序列，视觉地理解下面的计划，并在计划中执行动作。我们提出了层次基础模型 для高级规划（HiP），一种基础模型，利用多个专家基础模型， individually jointly 处理语言、视觉和动作数据，解决长期任务。我们使用大型语言模型构建符号计划，将其grounded在环境中通过大型视频扩散模型。生成的视频计划然后grounded到视motor控制，通过反向动力学模型，从生成的视频中推算出动作。为了确保层次中的有效思考，我们在模型之间强制保持一致性，通过迭代纠正。我们在三个不同的长期表面 manipulate任务中证明了我们的方法的有效性和适应性。

How Transferable are Attribute Controllers on Pretrained Multilingual Translation Models?

paper_url: http://arxiv.org/abs/2309.08565
repo_url: https://github.com/dannigt/attribute-controller-transfer
paper_authors: Danni Liu, Jan Niehues
for: 这篇论文旨在探讨如何通过将属性控制器转移到新语言中，提高自然语言处理器的形式识别能力。
methods: 该论文使用基于预训练大量多语言翻译模型的无监督学习方法，以转移属性控制器的能力到新语言中。
results: 该论文通过对多种数据enario进行分析，发现了在零shot情况下和不同领域中的改进，并通过人工评估确认了这些改进的有效性。

Abstract
Customizing machine translation models to comply with fine-grained attributes such as formality has seen tremendous progress recently. However, current approaches mostly rely on at least some supervised data with attribute annotation. Data scarcity therefore remains a bottleneck to democratizing such customization possibilities to a wider range of languages, lower-resource ones in particular. Given recent progress in pretrained massively multilingual translation models, we use them as a foundation to transfer the attribute controlling capabilities to languages without supervised data. In this work, we present a comprehensive analysis of transferring attribute controllers based on a pretrained NLLB-200 model. We investigate both training- and inference-time control techniques under various data scenarios, and uncover their relative strengths and weaknesses in zero-shot performance and domain robustness. We show that both paradigms are complementary, as shown by consistent improvements on 5 zero-shot directions. Moreover, a human evaluation on a real low-resource language, Bengali, confirms our findings on zero-shot transfer to new target languages. The code is $\href{https://github.com/dannigt/attribute-controller-transfer}{\text{here}$.

摘要
Recently, there has been significant progress in customizing machine translation models to accommodate fine-grained attributes, such as formality. However, most current approaches rely on at least some supervised data with attribute annotation, which can be a limiting factor in democratizing these customization possibilities to a wider range of languages, particularly lower-resource ones. To address this issue, we use pretrained massively multilingual translation models as a foundation to transfer the attribute controlling capabilities to languages without supervised data.In this study, we conduct a comprehensive analysis of transferring attribute controllers based on a pretrained NLLB-200 model. We investigate both training- and inference-time control techniques under various data scenarios and compare their performance in zero-shot settings and domain robustness. Our results show that both paradigms are complementary, as evidenced by consistent improvements in five zero-shot directions. Additionally, a human evaluation on a real low-resource language, Bengali, confirms our findings on zero-shot transfer to new target languages. The code for this study is available at $\href{https://github.com/dannigt/attribute-controller-transfer}{\text{here}$.

Deep Reinforcement Learning for Efficient and Fair Allocation of Health Care Resources

paper_url: http://arxiv.org/abs/2309.08560
repo_url: None
paper_authors: Yikuan Li, Chengsheng Mao, Kaixuan Huang, Hanyin Wang, Zheng Yu, Mengdi Wang, Yuan Luo
for: 该研究旨在提出一种基于强化学习的医疗资源分配策略优化方法，以实现公平、有效地分配医疗资源，特别是在医疗资源紧缺的情况下。
methods: 该研究使用了转换器基本深度Q网络（Transformer-based deep Q-network）来集成病例患者的疾病进程和患者之间的交互效应，以优化医疗资源分配策略。
results: 研究结果表明，该方法可以减少过度死亡和提高分配公平性，特别在不同水平的呼吸器短缺情况下。与现有的严重程度和混合病理因素基于的方法相比，该方法可以更好地保护患者的生命和健康。

Abstract
Scarcity of health care resources could result in the unavoidable consequence of rationing. For example, ventilators are often limited in supply, especially during public health emergencies or in resource-constrained health care settings, such as amid the pandemic of COVID-19. Currently, there is no universally accepted standard for health care resource allocation protocols, resulting in different governments prioritizing patients based on various criteria and heuristic-based protocols. In this study, we investigate the use of reinforcement learning for critical care resource allocation policy optimization to fairly and effectively ration resources. We propose a transformer-based deep Q-network to integrate the disease progression of individual patients and the interaction effects among patients during the critical care resource allocation. We aim to improve both fairness of allocation and overall patient outcomes. Our experiments demonstrate that our method significantly reduces excess deaths and achieves a more equitable distribution under different levels of ventilator shortage, when compared to existing severity-based and comorbidity-based methods in use by different governments. Our source code is included in the supplement and will be released on Github upon publication.

摘要
缺乏医疗资源可能导致不可避免的配分。例如，呼吸机在公共卫生紧急情况或资源受限的医疗设施中经常受限，特别是在COVID-19大流行期间。目前没有一个通用的医疗资源配分协议标准，因此不同政府会根据不同的标准和规则来优先级化患者。在这项研究中，我们研究了使用强化学习来优化护理资源配分策略，以确保公平和有效地配分资源。我们提出了基于转换器的深度Q网络，以捕捉患者个人疾病进程和患者之间交互效应。我们希望通过提高配分公平性和总体患者结果来改善医疗资源配分。我们的实验表明，我们的方法可以在不同程度的呼吸机短缺情况下，比已有严重程度和相关病理因素基于的方法减少过剩死亡和实现更公平的配分。我们的源代码将在附录中提供，并在发表后在Github上发布。

HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks

paper_url: http://arxiv.org/abs/2309.08549
repo_url: None
paper_authors: Minh-Hao Van, Alycia N. Carey, Xintao Wu
for: 防止数据恶意攻击，提高深度学习模型的安全性
methods: 基于影响函数的健康干扰训练方法（Healthy Influential-Noise based Training，简称HINT），通过Influence functions减弱攻击者对模型的攻击，并且可以在部分训练数据被修改时仍然高效地防御
results: 对两个图像 dataset 进行了大量的实验，并在不同的实际攻击场景下显示了HINT 可以有效地防止深度学习模型受到攻击的影响，并且比traditional方法更具有抗性。

Abstract
While numerous defense methods have been proposed to prohibit potential poisoning attacks from untrusted data sources, most research works only defend against specific attacks, which leaves many avenues for an adversary to exploit. In this work, we propose an efficient and robust training approach to defend against data poisoning attacks based on influence functions, named Healthy Influential-Noise based Training. Using influence functions, we craft healthy noise that helps to harden the classification model against poisoning attacks without significantly affecting the generalization ability on test data. In addition, our method can perform effectively when only a subset of the training data is modified, instead of the current method of adding noise to all examples that has been used in several previous works. We conduct comprehensive evaluations over two image datasets with state-of-the-art poisoning attacks under different realistic attack scenarios. Our empirical results show that HINT can efficiently protect deep learning models against the effect of both untargeted and targeted poisoning attacks.

摘要
多种防御方法已经提议用于禁止来自不可信数据源的毒化攻击，但大多数研究工作只防御特定的攻击，这留下了许多攻击途径。在这项工作中，我们提出了一种高效和可靠的训练方法，以防止数据毒化攻击，名为健康影响函数基本训练（HINT）。使用影响函数，我们制作了健康的噪音，以强化分类模型对毒化攻击的抵抗能力，而不会对测试数据造成显著影响。此外，我们的方法可以在部分训练数据被修改时进行有效地工作，而不是现有的所有示例都添加噪音的方法。我们在两个图像 dataset 上进行了全面的评估，并在不同的现实攻击场景下进行了 state-of-the-art 毒化攻击。我们的实验结果表明，HINT 可以高效地保护深度学习模型对毒化攻击的影响。

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

paper_url: http://arxiv.org/abs/2309.08541
repo_url: None
paper_authors: Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini
for: 本研究用大型自然语言模型（LM）进行查询或文档扩展，以提高信息检索的通用性。
methods: 本研究使用了十一种扩展技术，并在十二个 datasets 上进行了多种分布差异的测试。
results: 研究发现，使用扩展技术可以提高弱化模型的表现，但是对于强化模型来说，通常会导致负面影响。通过质量错误分析，我们提出了一种recipe：在弱化模型或target dataset与训练 corpora 有显著差异时使用扩展技术，否则避免使用扩展技术，以保持 relevance signal 的清晰性。

Abstract
Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

摘要

Visual Speech Recognition for Low-resource Languages with Automatic Labels From Whisper Model

paper_url: http://arxiv.org/abs/2309.08535
repo_url: https://github.com/jeonghun0716/visual-speech-recognition-for-low-resource-languages
paper_authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro
for: 这个论文旨在提出一种可以处理多种语言的强大视说识别（VSR）方法，尤其是为低资源语言，即有限量的标注数据。
methods: 我们使用了一种叫做Whisper模型，它可以同时进行语言标识和音频基于的语音识别。它可以筛选出所需语言的数据，并从无标注的多语言视频数据池中提取标签。
results: 我们通过比较使用自动生成的标签和人工标注的标签来评估VSR模型的性能，发现我们可以达到与人工标注标签相同的VSR性能，不需要人工干预。通过自动标签过程，我们生成了大规模的无标注多语言数据库，包括VoxCeleb2和AVSpeech，共计1,002小时的数据。使用自动生成的标签，我们在四种低资源语言中实现了新的州OF-the-art性能，大幅超越之前的方法。自动生成的标签可以在线获取：https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages。

Abstract
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages

摘要

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

paper_url: http://arxiv.org/abs/2309.08532
repo_url: https://github.com/kyegomez/EAOT
paper_authors: Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang
for: 这个论文的目的是提出一种新的整数提示优化框架，即EvoPrompt，以优化大型自然语言模型（LLMs）的提示，以提高 LLMS 的性能。
methods: 这个论文使用了进化算法（EAs）来优化提示，并将 LLMS 与 EAs 连接起来，以同时利用 LLMS 的强大自然语言处理能力和 EAs 的高效优化能力。
results: 对于9个语言理解和生成任务的数据集，EvoPrompt 可以减少人工设计提示的努力，并在closed-和open-source LLMS 上提高提示的性能，相比之下，人工设计提示和现有的自动提示生成方法可以提高性能的14%和25%。

Abstract
Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 9 datasets spanning language understanding and generation tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation by up to 25% and 14% respectively. Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.

摘要
大型语言模型（LLM）在多种任务中表现出色，但它们依赖于考究制定的提示，这些提示frequently需要人类努力。为了自动化这个过程，在这篇论文中，我们提出了一种新的框架，called EvoPrompt，它借鉴了进化算法（EA）的好performanc和快速收敛特点。为了让EA工作于精确的提示上，我们将LLM与EA相连接。这种方法允许我们同时利用LLM的强大语言处理能力和EA的高效优化能力。具体来说，不需要任何梯度或参数，EvoPrompt从一个人工提示的population开始，逐渐通过进化运算器生成新的提示，以提高人工提示的质量。我们对closed-和open-source LLMs，包括GPT-3.5和Alpaca，在9个语言理解和生成任务上进行优化。EvoPrompt在人工制定提示和现有自动提示生成方法的基础上显著超越了，最高超过25%和14%。此外，EvoPrompt表明了将LLM与EA相连接可以创造共生，这可能会推动进一步的LLM和传统算法的组合研究。

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

paper_url: http://arxiv.org/abs/2309.08513
repo_url: https://github.com/showlab/sct
paper_authors: Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou
for: 提高下游任务的表示能力，降低 Parameters 成本
methods: 使用 Salient Channel Tuning（SCT）方法，通过在特定任务图像上进行前向传播，选择特定的通道，进行精细调整
results: 在 VTAB-1K benchmark 中，与全量 Fine-Tuning 相比，提高了 18 个任务中的 19 个任务，增加了只 0.11M 参数的 ViT-B，相比全量 Fine-Tuning 的 780 倍少于 Parameters 成本，并在领域总成本和少量学习中表现出色。

Abstract
Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1% of extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments outperform full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot learning surpass other PEFT methods with lower parameter costs, demonstrating our proposed tuning technique's strong capability and effectiveness in the low-data regime.

摘要
(Note: The text is translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation is written in the traditional Chinese character set.)

HealthFC: A Dataset of Health Claims for Evidence-Based Medical Fact-Checking

paper_url: http://arxiv.org/abs/2309.08503
repo_url: https://github.com/jvladika/healthfc
paper_authors: Juraj Vladika, Phillip Schneider, Florian Matthes
for: 这 paper 是为了提高自动化医疗信息检验的技术水平，提供一个大量的医疗相关laims数据集，并对其进行分类和分析。
methods: 该 paper 使用了一种新的数据集，包含 750 个医疗相关的laims，每个laim都被医疗专家 manually 标记为真或假，并且具有相关的临床研究证据。 authors 还提供了一些基本的机器学习模型，以便自动检验医疗信息的真实性。
results: 该 paper 的分析表明，该数据集具有一定的特点和挑战，例如医疗相关的laims 的复杂性和多样性。 authors 还提供了一些基本的模型性能分析，以便用于自动医疗信息检验。

Abstract
Seeking health-related advice on the internet has become a common practice in the digital era. Determining the trustworthiness of medical claims found online and finding appropriate evidence for this information is increasingly challenging. Fact-checking has emerged as an approach to assess the veracity of factual claims using evidence from credible knowledge sources. To help advance the automation of this task, in this paper, we introduce a novel dataset of 750 health-related claims, labeled for veracity by medical experts and backed with evidence from appropriate clinical studies. We provide an analysis of the dataset, highlighting its characteristics and challenges. The dataset can be used for Machine Learning tasks related to automated fact-checking such as evidence retrieval, veracity prediction, and explanation generation. For this purpose, we provide baseline models based on different approaches, examine their performance, and discuss the findings.

摘要
在数字时代，通过互联网寻求医疗相关建议已成为普遍的做法。但确定互联网上医疗laim的可靠性和找到相关证据已变得越来越困难。因此，Fact-checking作为一种方法，可以评估医疗声明的真实性，并提供来自可靠的知识源的证据。在这篇论文中，我们介绍了一个新的医疗声明数据集，包含750个医疗声明，每个声明都被医学专家标记为真实或假，并且与相关的临床研究证据相匹配。我们对这个数据集进行了分析，描述了其特点和挑战。这个数据集可以用于自动化Fact-checking任务，如证据检索、真实性预测和解释生成。为此，我们提供了基准模型，评估其表现，并讨论发现。

P-ROCKET: Pruning Random Convolution Kernels for Time Series Classification

paper_url: http://arxiv.org/abs/2309.08499
repo_url: https://github.com/shaowuchen/p-rocket
paper_authors: Shaowu Chen, Weize Sun, Lei Huang, Xiaopeng Li, Qingyuan Wang, Deepu John
for: 这篇论文主要关注的是对时间序列资料进行分类，并且提出了一个名为P-ROCKET的新方法，以提高时间序列资料的分类精度和效率。
methods: 这篇论文使用了ROCKET和MINIROCKET两种时间序列分类模型，并且提出了一个名为S-ROCKET的进一步改进方法。S-ROCKET使用了一个轻量级的演化算法，并且对时间序列资料进行了快速的特征提取和分类。
results: 这篇论文的结果显示，P-ROCKET方法可以对时间序列资料进行高精度的分类，并且比ROCKET和MINIROCKET更具有时间效率。此外，P-ROCKET方法还可以实现资源受限的设备上的时间序列分类。

Abstract
In recent years, two time series classification models, ROCKET and MINIROCKET, have attracted much attention for their low training cost and state-of-the-art accuracy. Utilizing random 1-D convolutional kernels without training, ROCKET and MINIROCKET can rapidly extract features from time series data, allowing for the efficient fitting of linear classifiers. However, to comprehensively capture useful features, a large number of random kernels are required, which is incompatible for resource-constrained devices. Therefore, a heuristic evolutionary algorithm named S-ROCKET is devised to recognize and prune redundant kernels. Nevertheless, the inherent nature of evolutionary algorithms renders the evaluation of kernels within S-ROCKET an unacceptable time-consuming process. In this paper, diverging from S-ROCKET, which directly evaluates random kernels with nonsignificant differences, we remove kernels from a feature selection perspective by eliminating associating connections in the sequential classification layer. To this end, we start by formulating the pruning challenge as a Group Elastic Net classification problem and employ the ADMM method to arrive at a solution. Sequentially, we accelerate the aforementioned time-consuming solving process by bifurcating the $l_{2,1}$ and $l_2$ regularizations into two sequential stages and solve them separately, which ultimately forms our core algorithm, named P-ROCKET. Stage 1 of P-ROCKET employs group-wise regularization similarly to our initial ADMM-based Algorithm, but introduces dynamically varying penalties to greatly accelerate the process. To mitigate overfitting, Stage 2 of P-ROCKET implements element-wise regularization to refit a linear classifier, utilizing the retained features.

摘要
近年来，两种时序分类模型ROCKET和MINIROCKET吸引了很多关注，因为它们可以快速提取时序数据中的特征，并且可以使用随机1D卷积神经网络来适应不同的时序数据。然而，为了全面捕捉有用的特征，需要使用很多随机卷积神经网络，这对资源受限的设备来说是不可接受的。因此，我们提出了一种遗传演化算法名为S-ROCKET，以确定和剪枝无用的卷积神经网络。然而，遗传演化算法的自然特性使得在S-ROCKET中评估卷积神经网络的过程变得非常时间consuming。在这篇论文中，我们与S-ROCKET不同，直接从feature选择的角度来剪枝无用的卷积神经网络。为此，我们将剪枝挑战转化为一个Group Elastic Net分类问题，并使用ADMM方法解决。然后，我们通过将$l_{2,1}$和$l_2$正则化分为两个阶段，并在每个阶段解决它们，从而形成我们的核心算法P-ROCKET。P-ROCKET的第一个阶段使用了群体强制正则化，类似于我们的初始ADMM算法，但是引入了动态变化的罚状，以大大加速过程。为避免过拟合，P-ROCKET的第二个阶段实施元素级正则化，以重新适应保留的特征。

Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata

paper_url: http://arxiv.org/abs/2309.08491
repo_url: https://github.com/bohuizhang/llmke
paper_authors: Bohui Zhang, Ioannis Reklos, Nitisha Jain, Albert Meroño Peñuela, Elena Simperl
for: The paper is written for knowledge engineering tasks in the context of the ISWC 2023 LM-KBC Challenge.
methods: The paper uses pre-trained Large Language Models (LLMs) to produce relevant objects in string format and link them to their respective Wikidata QIDs. The pipeline developed in the paper is called LLMKE, which combines knowledge probing and Wikidata entity mapping.
results: The paper achieved a macro-averaged F1-score of 0.701 across the properties, with scores varying from 1.00 to 0.328. The results demonstrate that the knowledge of LLMs varies significantly depending on the domain and that further experimentation is required to determine the circumstances under which LLMs can be used for automatic Knowledge Base completion and correction.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为了知识工程任务而写的，具体是在ISWC 2023 LM-KBC Challenge 的上下文中。
methods: 这篇论文使用预训练的大语言模型（LLMs）来生成相关的对象序列和将其与 Wikidata QID 相关联。提出的管道是LLMKE，它将知识探测和 Wikidata 实体映射相结合。
results: 这篇论文在不同属性上的平均 F1 分数为 0.701，分数从 1.00 到 0.328 不等。结果表明 LLM 的知识在不同领域存在显著差异，需要进一步的实验来确定 LLM 在自动知识基础（如 Wikidata）完成和修正的情况下是否有可能使用。

Abstract
In this work, we explore the use of Large Language Models (LLMs) for knowledge engineering tasks in the context of the ISWC 2023 LM-KBC Challenge. For this task, given subject and relation pairs sourced from Wikidata, we utilize pre-trained LLMs to produce the relevant objects in string format and link them to their respective Wikidata QIDs. We developed a pipeline using LLMs for Knowledge Engineering (LLMKE), combining knowledge probing and Wikidata entity mapping. The method achieved a macro-averaged F1-score of 0.701 across the properties, with the scores varying from 1.00 to 0.328. These results demonstrate that the knowledge of LLMs varies significantly depending on the domain and that further experimentation is required to determine the circumstances under which LLMs can be used for automatic Knowledge Base (e.g., Wikidata) completion and correction. The investigation of the results also suggests the promising contribution of LLMs in collaborative knowledge engineering. LLMKE won Track 2 of the challenge. The implementation is available at https://github.com/bohuizhang/LLMKE.

摘要
在这项工作中，我们探索了大语言模型（LLM）在知识工程任务中的应用 potential。在ISWC 2023 LM-KBC Challenge 的上下文中，我们使用预训练的 LLM 生成基于 Wikidata 中的主题和关系对的相关对象，并将其与其相应的 Wikidata QID 相连接。我们开发了一个基于 LLM 的知识工程管道（LLMKE），结合了知识探测和 Wikidata 实体映射。该方法在不同属性上的macro-平均 F1 分数达0.701，分布范围从1.00到0.328。这些结果表明 LLM 在不同领域中知识的 vary 程度很大，需要进一步的实验来确定 LLMEn可以用于自动完成和修正 Wikidata 知识库。此外，我们发现 LLMKE 在合作知识工程中的承诺可能性非常高。LLMKE 赢得了 Track 2 的挑战。实现可以在 GitHub 上找到：https://github.com/bohuizhang/LLMKE。

XFedHunter: An Explainable Federated Learning Framework for Advanced Persistent Threat Detection in SDN

paper_url: http://arxiv.org/abs/2309.08485
repo_url: None
paper_authors: Huynh Thai Thi, Ngo Duc Hoang Son, Phan The Duy, Nghi Hoang Khoa, Khoa Ngo-Khanh, Van-Hau Pham
For: This paper proposes a framework for detecting Advanced Persistent Threats (APTs) in Software-Defined Networking (SDN) using Federated Learning (FL) and Explainable Artificial Intelligence (XAI).* Methods: The proposed framework, called XFedHunter, utilizes a combination of Graph Neural Networks (GNNs) and Deep Learning models to detect APTs in the network system. It also leverages local cyber threat knowledge from multiple training collaborators to improve the accuracy of APT detection.* Results: The experimental results on two datasets (NF-ToN-IoT and DARPA TCE3) show that XFedHunter can effectively enhance the trust and accountability of Machine Learning (ML)-based systems for cybersecurity purposes without compromising privacy.

Abstract
Advanced Persistent Threat (APT) attacks are highly sophisticated and employ a multitude of advanced methods and techniques to target organizations and steal sensitive and confidential information. APT attacks consist of multiple stages and have a defined strategy, utilizing new and innovative techniques and technologies developed by hackers to evade security software monitoring. To effectively protect against APTs, detecting and predicting APT indicators with an explanation from Machine Learning (ML) prediction is crucial to reveal the characteristics of attackers lurking in the network system. Meanwhile, Federated Learning (FL) has emerged as a promising approach for building intelligent applications without compromising privacy. This is particularly important in cybersecurity, where sensitive data and high-quality labeling play a critical role in constructing effective machine learning models for detecting cyber threats. Therefore, this work proposes XFedHunter, an explainable federated learning framework for APT detection in Software-Defined Networking (SDN) leveraging local cyber threat knowledge from many training collaborators. In XFedHunter, Graph Neural Network (GNN) and Deep Learning model are utilized to reveal the malicious events effectively in the large number of normal ones in the network system. The experimental results on NF-ToN-IoT and DARPA TCE3 datasets indicate that our framework can enhance the trust and accountability of ML-based systems utilized for cybersecurity purposes without privacy leakage.

摘要
高级持续性威胁（APT）攻击非常复杂，使用多种先进技术和方法来目标组织和窃取敏感和机密信息。APT攻击包括多个阶段，并且具有定制化的策略，通过新创新的技术和技术来欺骗安全软件监控。为了有效地防茧APT，探测和预测APT指标是非常重要的，以揭示攻击者在网络系统中隐藏的特征。此外，联邦学习（FL）已经出现为建立智能应用程序而不损失隐私的有力方法。这是特别重要的在网络安全方面，因为敏感数据和高质量的标签具有重要的作用在建立有效的机器学习模型来检测网络威胁。因此，本工作提出了XFedHunter，一个可解释的联邦学习框架，用于SDN中的APT检测。在XFedHunter中，图 neural network（GNN）和深度学习模型被用来有效地揭示网络系统中的恶意事件，从多个培训合作者的本地网络威胁知识中获得了有用的信息。实验结果表明，XFedHunter可以提高机器学习系统在网络安全领域的信任和负责任性，而不导致隐私泄露。

VulnSense: Efficient Vulnerability Detection in Ethereum Smart Contracts by Multimodal Learning with Graph Neural Network and Language Model

paper_url: http://arxiv.org/abs/2309.08474
repo_url: None
paper_authors: Phan The Duy, Nghi Hoang Khoa, Nguyen Huu Quyen, Le Cong Trinh, Vu Trung Kien, Trinh Minh Hoang, Van-Hau Pham
for: 本研究提出的 VulnSense 框架用于高效地检测 Ethereum 智能合约中的漏洞，使用多modal 学习方法在图像基本和自然语言处理（NLP）模型上。
methods: 我们的提议方案结合了三种智能合约中的特征，包括源代码、 opcode 序列和控制流图（CFG）从 bytecode 中提取。我们采用 BERT、BiLSTM 和 GNN 模型来提取和分析这些特征。
results: 我们的多modal 方法在 Ethereum 智能合约中检测漏洞的准确率达到 77.96%，胜过了单个特征或单个模型深度学习技术的限制。

Abstract
This paper presents VulnSense framework, a comprehensive approach to efficiently detect vulnerabilities in Ethereum smart contracts using a multimodal learning approach on graph-based and natural language processing (NLP) models. Our proposed framework combines three types of features from smart contracts comprising source code, opcode sequences, and control flow graph (CFG) extracted from bytecode. We employ Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Long Short-Term Memory (BiLSTM) and Graph Neural Network (GNN) models to extract and analyze these features. The final layer of our multimodal approach consists of a fully connected layer used to predict vulnerabilities in Ethereum smart contracts. Addressing limitations of existing vulnerability detection methods relying on single-feature or single-model deep learning techniques, our method surpasses accuracy and effectiveness constraints. We assess VulnSense using a collection of 1.769 smart contracts derived from the combination of three datasets: Curated, SolidiFI-Benchmark, and Smartbugs Wild. We then make a comparison with various unimodal and multimodal learning techniques contributed by GNN, BiLSTM and BERT architectures. The experimental outcomes demonstrate the superior performance of our proposed approach, achieving an average accuracy of 77.96\% across all three categories of vulnerable smart contracts.

摘要
Existing vulnerability detection methods have limitations, such as relying on single-feature or single-model deep learning techniques, which can lead to accuracy and effectiveness constraints. In contrast, VulnSense surpasses these limitations by using a multimodal approach that combines multiple features and models to improve accuracy and effectiveness.The framework is evaluated using a collection of 1,769 smart contracts derived from three datasets: Curated, SolidiFI-Benchmark, and Smartbugs Wild. The experimental results demonstrate the superior performance of VulnSense, achieving an average accuracy of 77.96% across all three categories of vulnerable smart contracts.In comparison with various unimodal and multimodal learning techniques contributed by GNN, BiLSTM, and BERT architectures, VulnSense outperforms them all, showcasing its effectiveness in detecting vulnerabilities in Ethereum smart contracts.

Explaining Search Result Stances to Opinionated People

paper_url: http://arxiv.org/abs/2309.08460
repo_url: None
paper_authors: Z. Wu, T. Draws, F. Cau, F. Barile, A. Rieger, N. Tintarev
for: 本研究旨在探讨在搜寻信息时，用户是否可以通过站点标签和其解释来避免认知偏见。
methods: 本研究使用自动分类和标签搜寻结果，并生成对这些标签的解释。研究采用用户研究（N = 203）来检验搜寻结果的偏见程度影响用户的搜寻结果阅读。
results: 研究发现，站点标签和解释可以导致用户阅读更多的搜寻结果，但没有证据表明用户在这种情况下会有系统化的意见变化。这些结果可以帮助搜寻引擎设计师制定更 Informed 的设计决策。

Abstract
People use web search engines to find information before forming opinions, which can lead to practical decisions with different levels of impact. The cognitive effort of search can leave opinionated users vulnerable to cognitive biases, e.g., the confirmation bias. In this paper, we investigate whether stance labels and their explanations can help users consume more diverse search results. We automatically classify and label search results on three topics (i.e., intellectual property rights, school uniforms, and atheism) as against, neutral, and in favor, and generate explanations for these labels. In a user study (N =203), we then investigate whether search result stance bias (balanced vs biased) and the level of explanation (plain text, label only, label and explanation) influence the diversity of search results clicked. We find that stance labels and explanations lead to a more diverse search result consumption. However, we do not find evidence for systematic opinion change among users in this context. We believe these results can help designers of search engines to make more informed design decisions.

摘要
（人们通过网络搜索引擎来获取信息，然后形成意见，这可能会导致不同的决策。搜索过程中的认知努力可能会使用户受到认知偏见，例如确认偏见。本文 investigate 是否可以通过立场标签和其解释来帮助用户遍历更多的搜索结果。我们自动分类和标签搜索结果的三个话题（知识产权、学校制服和无神论）为对、中立和赞成，并生成对这些标签的解释。在用户研究（N = 203）中，我们调查了搜索结果的立场偏豫（偏豫VS不偏豫）和解释水平（纯文本、标签仅、标签和解释）对搜索结果遍历的影响。我们发现，立场标签和解释可以帮助用户遍历更多的搜索结果。然而，我们没有发现在这种情况下用户的系统意见变化的证据。我们认为这些结果可以帮助搜索引擎的设计师做出更 Informed 的设计决策。）

Adversarial Attacks on Tables with Entity Swap

paper_url: http://arxiv.org/abs/2309.08650
repo_url: None
paper_authors: Aneta Koleva, Martin Ringsquandl, Volker Tresp
for: 这个论文主要是为了研究表格理解的语言模型（LLMs）的可靠性和安全性。
methods: 这篇论文使用了一种新的黑盒子攻击方法，即Entity-swap攻击，用于测试表格语言模型的可靠性。
results: 实验结果显示，提议的攻击方法可以导致表格语言模型的性能下降达70%。

Abstract
The capabilities of large language models (LLMs) have been successfully applied in the context of table representation learning. The recently proposed tabular language models have reported state-of-the-art results across various tasks for table interpretation. However, a closer look into the datasets commonly used for evaluation reveals an entity leakage from the train set into the test set. Motivated by this observation, we explore adversarial attacks that represent a more realistic inference setup. Adversarial attacks on text have been shown to greatly affect the performance of LLMs, but currently, there are no attacks targeting tabular language models. In this paper, we propose an evasive entity-swap attack for the column type annotation (CTA) task. Our CTA attack is the first black-box attack on tables, where we employ a similarity-based sampling strategy to generate adversarial examples. The experimental results show that the proposed attack generates up to a 70% drop in performance.

摘要
大型自然语言模型（LLM）的能力已成功应用于表示学习上。最近提出的表格语言模型已经在不同任务上报告了最佳结果。然而，通过审查通常用于评估的数据集，发现了实体泄露问题，即在训练集中出现的实体在测试集中也出现。这一观察导致我们研究了更加现实的攻击场景。在文本上进行了广泛研究的 adversarial 攻击，但目前没有targeting表格语言模型的攻击。本文提出了一种逃脱实体替换攻击（CTA）任务的黑盒攻击。我们使用相似性基于采样策略生成了反恶意示例。实验结果表明，我们的攻击可以导致表格语言模型的性能下降达70%。

Toward responsible face datasets: modeling the distribution of a disentangled latent space for sampling face images from demographic groups

paper_url: http://arxiv.org/abs/2309.08442
repo_url: None
paper_authors: Parsa Rahimi, Christophe Ecabert, Sebastien Marcel
for: 本研究旨在生成具有均衡和可能是偏好的人脸识别模型训练、规范或评估用的偏好自由的合成数据集。
methods: 我们提出使用StyleGAN中的抽象投影模型来生成任何权重组合的人脸（例如，西班牙裔女性）。我们的实验表明可以效果地生成任何权重组合，并且人脸的身份不同于原始训练数据集。
results: 我们的实验结果表明，我们可以生成任何权重组合的人脸，并且人脸的身份不同于原始训练数据集。此外，我们还发布了源代码。

Abstract
Recently, it has been exposed that some modern facial recognition systems could discriminate specific demographic groups and may lead to unfair attention with respect to various facial attributes such as gender and origin. The main reason are the biases inside datasets, unbalanced demographics, used to train theses models. Unfortunately, collecting a large-scale balanced dataset with respect to various demographics is impracticable. In this paper, we investigate as an alternative the generation of a balanced and possibly bias-free synthetic dataset that could be used to train, to regularize or to evaluate deep learning-based facial recognition models. We propose to use a simple method for modeling and sampling a disentangled projection of a StyleGAN latent space to generate any combination of demographic groups (e.g. $hispanic-female$). Our experiments show that we can synthesis any combination of demographic groups effectively and the identities are different from the original training dataset. We also released the source code.

摘要
最近，有些现代人脸识别系统的问题在社会中引起了关注，即这些系统可能会对某些民族或性别进行不公正的待遇。主要的原因是训练模型时使用的数据集中含有偏见，人口结构不均衡。然而，收集大规模的均衡数据集是不现实的。在这篇论文中，我们提出了一种代替方案：通过使用简单的方法模拟和采样StyleGAN的积分空间，生成任何民族或性别组合（如《hispanic-female》）。我们的实验表明，我们可以有效地生成任何组合，并且生成的人脸不同于原始训练数据集。我们还发布了源代码。

Learning by Self-Explaining

paper_url: http://arxiv.org/abs/2309.08395
repo_url: https://github.com/abusufyanvu/6S191_MIT_DeepLearning
paper_authors: Wolfgang Stammer, Felix Friedrich, David Steinmann, Hikaru Shindo, Kristian Kersting
for: The paper aims to improve the learning process of artificial intelligence (AI) models by incorporating self-explaining mechanisms, which are inspired by human psychology and have been neglected in current AI research.
methods: The proposed Learning by Self-Explaining (LSX) paradigm involves a learning module performing a base task and providing explanations for its decisions, which are then evaluated by an internal critic module. The learner is refined with the critic’s feedback, and the loop is repeated as needed. The paper provides distinct instantiations of LSX for two different learner models.
results: The paper shows that LSX not only boosts the generalization abilities of AI models, particularly in small-data regimes, but also aids in mitigating the influence of confounding factors and leads to more task-specific and faithful model explanations. The results provide experimental evidence of the potential of self-explaining within the learning phase of an AI model.Here are the three points in Simplified Chinese text:
for: 本研究旨在通过人类智能发现的自我解释机制，提高人工智能（AI）模型的学习过程。
methods: 提议的学习 by Self-Explaining（LSX）方法包括一个学习模块（学习者）完成基础任务，并提供解释其决策的。内部批评模块接着评估这些解释的质量，并将学习者通过批评反馈进行改进。
results: 研究表明，LSX不仅提高了AI模型在小数据 regime下的泛化能力，而且帮助降低干扰因素的影响，并导致更任务特定和 faithful 的模型解释。结果提供了实验证明自我解释在AI模型学习阶段的潜在力。

Abstract
Artificial intelligence (AI) research has a long track record of drawing inspirations from findings from biology, in particular human intelligence. In contrast to current AI research that mainly treats explanations as a means for model inspection, a somewhat neglected finding from human psychology is the benefit of self-explaining in an agents' learning process. Motivated by this, we introduce a novel learning paradigm, termed Learning by Self-Explaining (LSX). The underlying idea is that a learning module (learner) performs a base task, e.g. image classification, and provides explanations to its decisions. An internal critic module next evaluates the quality of these explanations given the original task. Finally, the learner is refined with the critic's feedback and the loop is repeated as required. The intuition behind this is that an explanation is considered "good" if the critic can perform the same task given the respective explanation. Despite many implementation possibilities the structure of any LSX instantiation can be taxonomized based on four learning modules which we identify as: Fit, Explain, Reflect and Revise. In our work, we provide distinct instantiations of LSX for two different learner models, each illustrating different choices for the various LSX components. We broadly evaluate these on several datasets and show that Learning by Self-Explaining not only boosts the generalization abilities of AI models, particularly in small-data regimes, but also aids in mitigating the influence of confounding factors, as well as leading to more task specific and faithful model explanations. Overall, our results provide experimental evidence of the potential of self-explaining within the learning phase of an AI model.

摘要
人工智能（AI）研究有一长时间的传统是从生物发现中翔生新想法，特别是人类智能。与当前AI研究主要视解释为模型检查的工具而言，一种受欢迎的发现来自人类心理学是自我解释在智能代理人学习过程中的利好。为了推广这一想法，我们介绍了一种新的学习方法，称为自我解释学习（LSX）。这个想法的基本思想是，一个学习模块（学习者）在完成基础任务（例如图像分类）后，对自己的决策提供解释。一个内部批评模块（批评者）然后评估这些解释的质量，基于原始任务。最后，学习者通过批评者的反馈进行改进，并重复这个过程。我们认为，一个解释是“好”的，如果批评者可以通过该解释完成同样的任务。在我们的工作中，我们提供了不同的LSX实现方式，每个实现方式都是基于四个学习模块：适应、解释、反思和修复。我们在不同的学习器模型上实现了这些模块，并对其进行了广泛的评估。我们发现，通过自我解释学习不仅可以提高AI模型的总体化能力，特别是在小数据情况下，还可以减少干扰因素的影响，以及导致更任务特定和 faithful 的解释。总的来说，我们的结果提供了实验证据，证明了自我解释在智能代理人学习阶段的潜在力量。

MAPLE: Mobile App Prediction Leveraging Large Language model Embeddings

paper_url: http://arxiv.org/abs/2309.08648
repo_url: None
paper_authors: Yonchanok Khaokaew, Hao Xue, Flora D. Salim
for: 这篇论文旨在预测移动应用程序的使用情况，以提高应用程序的使用体验和功能性。
methods: 该论文提出了一种基于大语言模型（LLM）的移动应用程序预测模型（MAPLE），通过利用 LLM 来准确预测应用程序的使用情况。
results: 经过对两个公共数据集的严格测试，MAPLE 模型能够准确地捕捉用户行为的复杂特征和上下文，并在不同情况下保持稳定性。这些结果证明了 MAPLE 模型在不同场景下的多样性和可靠性。

Abstract
Despite the rapid advancement of mobile applications, predicting app usage remains a formidable challenge due to intricate user behaviours and ever-evolving contexts. To address these issues, this paper introduces the Mobile App Prediction Leveraging Large Language Model Embeddings (MAPLE) model. This innovative approach utilizes Large Language Models (LLMs) to predict app usage accurately. Rigorous testing on two public datasets highlights MAPLE's capability to decipher intricate patterns and comprehend user contexts. These robust results confirm MAPLE's versatility and resilience across various scenarios. While its primary design caters to app prediction, the outcomes also emphasize the broader applicability of LLMs in different domains. Through this research, we emphasize the potential of LLMs in app usage prediction and suggest their transformative capacity in modelling human behaviours across diverse fields.

摘要
尽管移动应用的快速发展，预测应用使用仍然是一项具有挑战性的任务，因为用户行为很复杂， Contexts 也在不断发展。为解决这些问题，这篇论文提出了 Mobile App Prediction Leveraging Large Language Model Embeddings（MAPLE）模型。这种创新的方法利用大型自然语言模型（LLMs）来准确预测应用使用。经过严谨的测试两个公共数据集，MAPLE 的 robust results 表明它可以准确地理解用户行为和上下文。这些结果也证明了 MAPLE 在不同情况下的 universality 和灵活性。虽然它的主要设计是为应用预测，但结果还证明了 LLMs 在不同领域中的应用性。通过这项研究，我们强调了 LLMs 在应用使用预测方面的潜在力量，并建议它们在不同领域中的模式化人类行为的能力。

Intent Detection at Scale: Tuning a Generic Model using Relevant Intents

paper_url: http://arxiv.org/abs/2309.08647
repo_url: None
paper_authors: Nichal Narotamo, David Aparicio, Tiago Mesquita, Mariana Almeida
for: 提高客户支持请求的意图预测精度，以提高客户服务系统的效率，让客户服务代表快速理解消息并根据需要优先级化回复。
methods: combining a single generic model with a per-client list of relevant intents，以降低训练和维护成本，同时为客户提供个性化体验，适应客户的变化需求。
results: 比较industry-specific模型，final system exhibits significantly superior performance，demonstrating its flexibility and ability to cater to diverse client needs。

Abstract
Accurately predicting the intent of customer support requests is vital for efficient support systems, enabling agents to quickly understand messages and prioritize responses accordingly. While different approaches exist for intent detection, maintaining separate client-specific or industry-specific models can be costly and impractical as the client base expands. This work proposes a system to scale intent predictions to various clients effectively, by combining a single generic model with a per-client list of relevant intents. Our approach minimizes training and maintenance costs while providing a personalized experience for clients, allowing for seamless adaptation to changes in their relevant intents. Furthermore, we propose a strategy for using the clients relevant intents as model features that proves to be resilient to changes in the relevant intents of clients -- a common occurrence in production environments. The final system exhibits significantly superior performance compared to industry-specific models, showcasing its flexibility and ability to cater to diverse client needs.

摘要
correctly 预测客户支持请求的意图是支持系统的关键，允许代表快速理解消息并根据优先级回复。 although different 方法 exists for intent detection, maintaining separate client-specific or industry-specific models can be expensive and impractical as the client base expands. This work proposes a system to scale intent predictions to various clients effectively, by combining a single generic model with a per-client list of relevant intents. Our approach minimizes training and maintenance costs while providing a personalized experience for clients, allowing for seamless adaptation to changes in their relevant intents. Furthermore, we propose a strategy for using the clients relevant intents as model features that proves to be resilient to changes in the relevant intents of clients -- a common occurrence in production environments. The final system exhibits significantly superior performance compared to industry-specific models, showcasing its flexibility and ability to cater to diverse client needs.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

M$^3$Net: Multilevel, Mixed and Multistage Attention Network for Salient Object Detection

paper_url: http://arxiv.org/abs/2309.08365
repo_url: https://github.com/I2-Multimedia-Lab/M3Net
paper_authors: Yao Yuan, Pan Gao, XiaoYang Tan
For: 提高精准的焦点对象检测精度（Salient Object Detection）* Methods: 提出多层次、混合和多stage的注意力网络（M$^3$Net），包括多层次交互块和混合注意力块，以及多Stage超vision策略，以提高预测精度。* Results: 在六个复杂的 dataset 上进行实验，提出的 M$^3$Net 已经超过了最近的 CNN 和 Transformer 基于的 SOD 艺术，按照四个纪录metric 进行评估。

Abstract
Most existing salient object detection methods mostly use U-Net or feature pyramid structure, which simply aggregates feature maps of different scales, ignoring the uniqueness and interdependence of them and their respective contributions to the final prediction. To overcome these, we propose the M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for Salient Object Detection (SOD). Firstly, we propose Multiscale Interaction Block which innovatively introduces the cross-attention approach to achieve the interaction between multilevel features, allowing high-level features to guide low-level feature learning and thus enhancing salient regions. Secondly, considering the fact that previous Transformer based SOD methods locate salient regions only using global self-attention while inevitably overlooking the details of complex objects, we propose the Mixed Attention Block. This block combines global self-attention and window self-attention, aiming at modeling context at both global and local levels to further improve the accuracy of the prediction map. Finally, we proposed a multilevel supervision strategy to optimize the aggregated feature stage-by-stage. Experiments on six challenging datasets demonstrate that the proposed M$^3$Net surpasses recent CNN and Transformer-based SOD arts in terms of four metrics. Codes are available at https://github.com/I2-Multimedia-Lab/M3Net.

摘要
现有的突出对象检测方法大多使用U-Net或特征峰结构，这些方法简单地聚合不同级别的特征图，忽略这些特征图的独特性和互相关系，以及它们对最终预测的贡献。为了解决这些问题，我们提议M$^3$Net，即多级混合多 stagel attention网络 для突出对象检测（SOD）。首先，我们提出了多尺度交互块，这里引入了交叉注意方法，以实现不同级别特征之间的交互，使高级特征导导低级特征学习，从而提高突出区域。其次，由于前一些Transformer基于SOD方法只是通过全局自注意来定位突出区域，而忽略了复杂对象的详细特征，我们提出了混合注意块。这个块组合了全局自注意和窗口自注意，以实现在全局和局部两个水平上模型对象上下文，以进一步提高预测矩阵的准确性。最后，我们提出了一种多级超vision策略，以逐步优化聚合特征的结果。实验结果表明，提案的M$^3$Net在六个复杂的数据集上超过了最近的CNN和Transformer基于SOD艺术，以四个纪录为评价指标。代码可以在https://github.com/I2-Multimedia-Lab/M3Net 中找到。

Data Distribution Bottlenecks in Grounding Language Models to Knowledge Bases

paper_url: http://arxiv.org/abs/2309.08345
repo_url: None
paper_authors: Yiheng Shu, Zhiwei Yu
for: 本研究旨在探讨语言模型（LM）在大规模知识库（KB）环境中的可靠性问题。
methods: 本研究采用了多种数据增强技术来提高LM的抗耗性和通用性。
results: 实验结果显示，even with 我们的数据增强技术，当前的LM技术在各种环境下表现不佳，尤其是面临不同语言变体和数据分布问题时。

Abstract
Language models (LMs) have already demonstrated remarkable abilities in understanding and generating both natural and formal language. Despite these advances, their integration with real-world environments such as large-scale knowledge bases (KBs) remains an underdeveloped area, affecting applications such as semantic parsing and indulging in "hallucinated" information. This paper is an experimental investigation aimed at uncovering the robustness challenges that LMs encounter when tasked with knowledge base question answering (KBQA). The investigation covers scenarios with inconsistent data distribution between training and inference, such as generalization to unseen domains, adaptation to various language variations, and transferability across different datasets. Our comprehensive experiments reveal that even when employed with our proposed data augmentation techniques, advanced small and large language models exhibit poor performance in various dimensions. While the LM is a promising technology, the robustness of the current form in dealing with complex environments is fragile and of limited practicality because of the data distribution issue. This calls for future research on data collection and LM learning paradims.

摘要
语言模型（LM）已经表现出了对自然语言和正式语言的理解和生成的很好的能力。然而，它们在大规模知识库（KB）中的集成仍然是一个未发掘的领域，这限制了应用程序的 semantic parsing 和 indulging in "hallucinated" information 等应用。这篇论文是一项实验性调查，旨在探讨语言模型在知识库问答（KBQA）中遇到的Robustness挑战。调查覆盖了训练和推理数据分布不均的场景，如泛化到未看过的领域、语言变化的适应以及不同数据集之间的传输性。我们的全面实验表明，即使使用我们提议的数据扩充技术，先进的小型和大型语言模型在多个维度上表现不佳。而当前的语言模型技术在复杂环境中的可靠性是脆弱的，这限制了它们的实际应用。这叫喊未来的研究投入数据收集和LM学习 парадиг。

Let’s Predict Who Will Move to a New Job

paper_url: http://arxiv.org/abs/2309.08333
repo_url: https://github.com/MarkipTheMudkip/in-class-project-2
paper_authors: Rania Mkhinini Gahar, Adel Hidri, Minyar Sassi Hidri
for: The paper is written to discuss the use of machine learning (ML) to predict whether an applicant will search for a new job or stay with the company.
methods: The paper uses data pre-processing, data encoding, and several ML algorithms such as Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and eXtreme Gradient Boosting (XGBoost) to predict job mobility. The synthetic minority oversampling technique (SMOTE) is also used to retain minority classes.
results: The paper assesses the performance of the ML models using decision support metrics such as precision, recall, F1-Score, and accuracy.Here are the three points in Simplified Chinese text:
for: 本研究使用机器学习（ML）预测候选人是否将搜索新工作或留在公司。
methods: 本研究使用数据预处理、数据编码和多种ML算法（Random Forest、Logistic Regression、Decision Tree和eXtreme Gradient Boosting）预测工作流动。同时，使用Synthetic Minority Oversampling Technique（SMOTE）保留少数类。
results: 本研究使用决策支持指标（精度、准确率、F1-Score和准确率）评估ML模型的性能。

Abstract
Any company's human resources department faces the challenge of predicting whether an applicant will search for a new job or stay with the company. In this paper, we discuss how machine learning (ML) is used to predict who will move to a new job. First, the data is pre-processed into a suitable format for ML models. To deal with categorical features, data encoding is applied and several MLA (ML Algorithms) are performed including Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and eXtreme Gradient Boosting (XGBoost). To improve the performance of ML models, the synthetic minority oversampling technique (SMOTE) is used to retain them. Models are assessed using decision support metrics such as precision, recall, F1-Score, and accuracy.

摘要
任何公司的人力资源部门面临着预测申请人员是否会寻找新工作的挑战。在这篇论文中，我们讨论了如何使用机器学习（ML）预测申请人员会离开公司。首先，数据进行了适应ML模型的处理。为处理 categorical 特征，数据编码是应用到了数据中。然后，我们使用了多种MLA（ML算法），包括随机森林（RF）、логистиック回归（LR）、决策树（DT）和极限梯度提升（XGBoost）。为了提高ML模型的性能，我们使用了人工少数样本填充技术（SMOTE）保留它们。模型的评估使用了决策支持度量标准，如准确率、回归率、F1-Score 和准确率。

paper_url: http://arxiv.org/abs/2309.08289
repo_url: None
paper_authors: Kaouther Mouheb, Mobina Ghojogh Nejad, Lavsen Dahal, Ehsan Samei, W. Paul Segars, Joseph Y. Lo
for: precisely modeling the surface of the large intestine for virtual imaging trials
methods: using geometric deep learning and denoising diffusion probabilistic models to refine segmentation results, and incorporating a state-of-the-art surface reconstruction model
results: significantly improved surface representation compared to initial segmentation, with a 70% reduction in Chamfer distance, a 32% reduction in Hausdorff distance, and a 6% reduction in Earth Mover’s distance

Abstract
Accurate 3D modeling of human organs plays a crucial role in building computational phantoms for virtual imaging trials. However, generating anatomically plausible reconstructions of organ surfaces from computed tomography scans remains challenging for many structures in the human body. This challenge is particularly evident when dealing with the large intestine. In this study, we leverage recent advancements in geometric deep learning and denoising diffusion probabilistic models to refine the segmentation results of the large intestine. We begin by representing the organ as point clouds sampled from the surface of the 3D segmentation mask. Subsequently, we employ a hierarchical variational autoencoder to obtain global and local latent representations of the organ's shape. We train two conditional denoising diffusion models in the hierarchical latent space to perform shape refinement. To further enhance our method, we incorporate a state-of-the-art surface reconstruction model, allowing us to generate smooth meshes from the obtained complete point clouds. Experimental results demonstrate the effectiveness of our approach in capturing both the global distribution of the organ's shape and its fine details. Our complete refinement pipeline demonstrates remarkable enhancements in surface representation compared to the initial segmentation, reducing the Chamfer distance by 70%, the Hausdorff distance by 32%, and the Earth Mover's distance by 6%. By combining geometric deep learning, denoising diffusion models, and advanced surface reconstruction techniques, our proposed method offers a promising solution for accurately modeling the large intestine's surface and can easily be extended to other anatomical structures.

摘要
准确的3D人体器官模型在计算机phantom中扮演了关键角色。然而，从计算机扫描图像中生成人体器官表面的准确重建仍然是许多人体器官的挑战。特别是对大肠的重建。在这项研究中，我们利用最新的几何深度学习和减噪概率模型来改进大肠的分 segmentation结果。我们首先将器官表示为点云，从3D分 segmentation掩模中采样的表面。然后，我们使用层次的自变量autoencoder来获得全局和局部的准确表达。我们在层次的准确空间中训练了两个条件减噪液体模型，以进行形态纠正。为了进一步增强我们的方法，我们添加了当前的Surface reconstruction模型，以生成完整的点云。实验结果表明，我们的方法可以准确地捕捉大肠的全局分布和细节。我们的完整的纠正管道可以大幅提高表面的表达，减少Chamfer距离70%，减少 Hausdorff距离32%，减少地球货物距离6%。通过结合几何深度学习、减噪液体模型和高级Surface reconstruction技术，我们提出了一种准确地模型大肠表面的方法，可以轻松扩展到其他 анатомиче结构。

Cure the headache of Transformers via Collinear Constrained Attention

paper_url: http://arxiv.org/abs/2309.08646
repo_url: https://github.com/luban-agi/coca
paper_authors: Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, Jianguo Li
for: 本研究旨在解决 transformer 模型中快速升级的问题，提高 extrapolating 性能。
methods: 我们提出了一种新的自注意力结构 named Collinear Constrained Attention (CoCA)，可以轻松地与现有的 extrapolation 和 interpoltion 方法集成。
results: 我们在推理阶段可以达到 16 倍至 24 倍的序列长度的极佳推理性能，无需 fine-tuning。同时，我们还提高了 CoCA 的计算和空间效率，确保其实用性。

Abstract
As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

摘要
As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.Here's the text in Traditional Chinese:为了推进大型语言模型的实用应用，研究领域中对于Transformer模型的推敲性能的重要性在不断增长。在我们的研究中，我们发现Transformer模型中的一个问题，即最重要的token附近的问题，导致了一个“Transformer的头痛”。为了解决这个问题的核心，我们提出了一个名为Collinear Constrained Attention（CoCA）的新型自我对应结构。这个结构可以与传统Transformer模型中的推敲、 interpolating 方法和其他优化策略一起使用。我们在推敲时可以 дости得出excel 的表现，而且不需要任何 fine-tuning。此外，我们还对CoCA进行了计算和空间的最佳化，以确保其实用性。我们计划将CoCA shortly 开源。在等待开源之前，我们的代码已经在附录中公开，供大家进行重复实验。

Quantitative and Qualitative Evaluation of Reinforcement Learning Policies for Autonomous Vehicles

paper_url: http://arxiv.org/abs/2309.08254
repo_url: None
paper_authors: Laura Ferrarotti, Massimiliano Luca, Gabriele Santin, Giorgio Previati, Gianpiero Mastinu, Elena Campi, Lorenzo Uccello, Antonino Albanese, Praveen Zalaya, Alessandro Roccasalva, Bruno Lepri
for: 优化交通流动在不断发展的交通环境中是非常重要，特别是在自动驾驶车辆（AV）与人类驾驶车辆共存的场景下。本文提出了一种使用Proximal Policy Optimization（PPO）束规学习算法优化AV的选择。我们学习了一个策略，以减少交通堵塞（即在enario中的时间）和减少污染。
methods: 我们使用了PPO算法来学习一个策略，并通过实验分析表明了我们的方法可以减少时间和污染水平。此外，我们使用了一个前沿的cockpit来评估学习的策略在实际情况下的性能。
results: 我们的实验结果表明，我们的方法可以减少时间和污染水平。此外，我们对参与了人类参与者使用的模拟器进行了评估，发现人驾驶车辆在优化AV动态时受益。此外，参与者们表示场景中80%的AV比20%的AV更安全和更流畅。

Abstract
Optimizing traffic dynamics in an evolving transportation landscape is crucial, particularly in scenarios where autonomous vehicles (AVs) with varying levels of autonomy coexist with human-driven cars. This paper presents a novel approach to optimizing choices of AVs using Proximal Policy Optimization (PPO), a reinforcement learning algorithm. We learned a policy to minimize traffic jams (i.e., minimize the time to cross the scenario) and to minimize pollution in a roundabout in Milan, Italy. Through empirical analysis, we demonstrate that our approach can reduce time and pollution levels. Furthermore, we qualitatively evaluate the learned policy using a cutting-edge cockpit to assess its performance in near-real-world conditions. To gauge the practicality and acceptability of the policy, we conducted evaluations with human participants using the simulator, focusing on a range of metrics like traffic smoothness and safety perception. In general, our findings show that human-driven vehicles benefit from optimizing AVs dynamics. Also, participants in the study highlighted that the scenario with 80\% AVs is perceived as safer than the scenario with 20\%. The same result is obtained for traffic smoothness perception.

摘要
优化交通征动在发展中的交通环境中是非常重要，特别在自动驾驶车辆（AV）与人类驾驶车辆共存的场景下。这篇论文提出了一种使用 proximal policy optimization（PPO）算法来优化AV的选择。我们学习了一种缩减交通堵塞（即减少场景内的时间）和减少污染的策略，并在米兰市的一个环境中进行了实验性分析。我们的研究表明，我们的方法可以减少时间和污染水平。此外，我们使用 cutting-edge 车辆控制台进行了质量评估，以评估我们学习的策略在实际 Condition 下的性能。此外，我们对人类参与者使用 simulator 进行了评估，并考虑了一系列指标，如交通畅通性和安全感。我们的发现表明，人类驾驶车辆受益于优化AV征动，并且参与者认为80% AV scenario 比20% AV scenario 更安全和更畅通。

A Geometric Perspective on Autoencoders

paper_url: http://arxiv.org/abs/2309.08247
repo_url: https://github.com/clementchadebec/geometric_perspective_on_vaes
paper_authors: Yonghyeon Lee
for: 该论文探讨了自动编码器框架中的几何方面，尽管这一方面在论文中得到了较少的关注。
methods: 该论文使用了高维数据点集合，以及一种低维拟合 manifold 的学习方法，同时学习 manifold 和坐标图。
results: 该论文发现了多个自动编码器可以对同一个数据集进行学习，并且这些自动编码器可能会生成错误的拟合 manifold 和扭曲的坐标图表示。

Abstract
This paper presents the geometric aspect of the autoencoder framework, which, despite its importance, has been relatively less recognized. Given a set of high-dimensional data points that approximately lie on some lower-dimensional manifold, an autoencoder learns the \textit{manifold} and its \textit{coordinate chart}, simultaneously. This geometric perspective naturally raises inquiries like "Does a finite set of data points correspond to a single manifold?" or "Is there only one coordinate chart that can represent the manifold?". The responses to these questions are negative, implying that there are multiple solution autoencoders given a dataset. Consequently, they sometimes produce incorrect manifolds with severely distorted latent space representations. In this paper, we introduce recent geometric approaches that address these issues.

摘要
Here's the translation in Simplified Chinese:这篇论文探讨了自动编码器框架中的几何方面，这一方面尚未得到足够的认可，虽然它对于自动编码器的学习和应用非常重要。给定一个高维数据点集，这些数据点约束在一个lower维度的拟合上，自动编码器就会学习这个拟合和它的坐标系。这个几何视角自然地引出了一些问题，例如"finite数据点集是否对应一个唯一的拟合?"或"是否只有一个坐标系可以表示拟合?"。答案是否定的，表示给定一个数据集，存在多个解的自动编码器，这些自动编码器可能会生成错误的拟合和扭曲的几何空间表示。在这篇论文中，我们介绍了最新的几何方法，以解决这些问题。

VERSE: Virtual-Gradient Aware Streaming Lifelong Learning with Anytime Inference

paper_url: http://arxiv.org/abs/2309.08227
repo_url: None
paper_authors: Soumya Banerjee, Vinay K. Verma, Avideep Mukherjee, Deepak Gupta, Vinay P. Namboodiri, Piyush Rai
for: 本研究旨在提出一种基于流处理的智能机器学习方法，可以在动态不确定环境中不断学习而不忘记之前学习的知识。
methods: 我们提出了一种基于虚拟梯度的 continual representation learning 方法，以防止忘记现象，并使用 exponential-moving-average-based semantic memory 进一步提高性能。
results: 我们的方法在多个 dataset 上进行了广泛的实验，并证明了与现有方法相比的优异性能。

Abstract
Lifelong learning, also referred to as continual learning, is the problem of training an AI agent continuously while also preventing it from forgetting its previously acquired knowledge. Most of the existing methods primarily focus on lifelong learning within a static environment and lack the ability to mitigate forgetting in a quickly-changing dynamic environment. Streaming lifelong learning is a challenging setting of lifelong learning with the goal of continuous learning in a dynamic non-stationary environment without forgetting. We introduce a novel approach to lifelong learning, which is streaming, requires a single pass over the data, can learn in a class-incremental manner, and can be evaluated on-the-fly (anytime inference). To accomplish these, we propose virtual gradients for continual representation learning to prevent catastrophic forgetting and leverage an exponential-moving-average-based semantic memory to further enhance performance. Extensive experiments on diverse datasets demonstrate our method's efficacy and superior performance over existing methods.

摘要
人生学习，也称为不断学习，是训练AI机器人的问题，以及防止它忘记之前学习的知识。现有的方法主要集中在静止环境下的生命学习，缺乏适应快速变化的动态环境中减弱忘记的能力。流动生命学习是生命学习的挑战 Setting，旨在在动态不站ARY环境中不断学习，而无需忘记之前学习的知识。我们提出了一种新的生命学习方法，即流动学习，只需一次遍历数据，可以在类增量学习方式下进行学习，并可以在实时（任何时候）进行评估。为了实现这些目标，我们提出了虚拟梯度为持续表征学习防止悖论忘记，并利用加权平均值基于semantic Memory来进一步提高性能。我们在多种 datasets上进行了广泛的实验，并证明了我们的方法的有效性和超越现有方法的表现。

Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level

paper_url: http://arxiv.org/abs/2309.08182
repo_url: None
paper_authors: Jingzhe Ding, Yan Cen, Xinyuan Wei
for: 这paper的目的是证明大语言模型（LLM）可以解决物理问题，不仅是文本问题。
methods: 这paper使用OpenAI的GPT3.5模型，通过zero-shot learning和few-shot learning来解决物理问题。
results: GPT3.5可以自动解决49.3%的问题，并且可以summarize问题的知识和提供相关的解释。

Abstract
Our work demonstrates that large language model (LLM) pre-trained on texts can not only solve pure math word problems, but also physics word problems, whose solution requires calculation and inference based on prior physical knowledge. We collect and annotate the first physics word problem dataset-PhysQA, which contains over 1000 junior high school physics word problems (covering Kinematics, Mass&Density, Mechanics, Heat, Electricity). Then we use OpenAI' s GPT3.5 to generate the answer of these problems and found that GPT3.5 could automatically solve 49.3% of the problems through zero-shot learning and 73.2% through few-shot learning. This result demonstrates that by using similar problems and their answers as prompt, LLM could solve elementary physics word problems approaching human level performance. In addition to solving problems, GPT3.5 can also summarize the knowledge or topics covered by the problems, provide relevant explanations, and generate new physics word problems based on the input. Our work is the first research to focus on the automatic solving, explanation, and generation of physics word problems across various types and scenarios, and we achieve an acceptable and state-of-the-art accuracy. This underscores the potential of LLMs for further applications in secondary education.

摘要
我们的工作表明，大型语言模型（LLM）预训练在文本上可以解决纯数学问题，以及物理问题，其解决需要计算和推理基于先前的物理知识。我们收集和标注了首个物理问题集合-PhysQA，包含了1000多个初中物理问题（涵盖运动、质量和密度、力学、热和电）。然后我们使用OpenAI的GPT3.5来生成问题的答案，发现GPT3.5可以通过零批学习解决49.3%的问题和通过几批学习解决73.2%的问题。这一结果表明，通过使用相似的问题和答案作为提示，LLM可以解决初中物理问题，接近人类水平性能。此外，GPT3.5还可以概括问题中所覆盖的知识和主题，提供相关的解释，并生成基于输入的新的物理问题。我们的工作是首次关注自动解决、概括和生成物理问题的研究，并实现了可接受的和领先的准确率。这一结果强调了LLM的潜在应用在初等教育中。

Unveiling Invariances via Neural Network Pruning

paper_url: http://arxiv.org/abs/2309.08171
repo_url: None
paper_authors: Derek Xu, Yizhou Sun, Wei Wang
for: 这篇论文是为了学习数据依赖的不变性而写的。
methods: 这篇论文使用了退化 neural network 来学习数据依赖的不变性，并通过剔除来捕捉这些不变性。
results: 这篇论文的结果表明，使用这种方法可以在视觉和表格数据集上以更高效和更高准确性进行预测。

Abstract
Invariance describes transformations that do not alter data's underlying semantics. Neural networks that preserve natural invariance capture good inductive biases and achieve superior performance. Hence, modern networks are handcrafted to handle well-known invariances (ex. translations). We propose a framework to learn novel network architectures that capture data-dependent invariances via pruning. Our learned architectures consistently outperform dense neural networks on both vision and tabular datasets in both efficiency and effectiveness. We demonstrate our framework on multiple deep learning models across 3 vision and 40 tabular datasets.

摘要
“几何描述不变数据的本质不变。神经网络，保持自然的几何不变，可以捕捉好的归化假设，实现更好的性能。因此，现代网络通常是以知道的几何不变为设计。我们提出了一个架构，可以透过剪裁来学习资料相对的几何不变。我们的学习架构在多个深度学习模型和多个视觉和数据集上表现出色，并且在效率和有效性两方面具有优秀的表现。”

To Predict or to Reject: Causal Effect Estimation with Uncertainty on Networked Data

paper_url: http://arxiv.org/abs/2309.08165
repo_url: None
paper_authors: Hechuan Wen, Tong Chen, Li Kheng Chai, Shazia Sadiq, Kai Zheng, Hongzhi Yin
for: 这篇论文旨在提出一个能够处理对称网络观察数据的不确定性的数据科学推理框架，以便更加可靠地估计个体水平的治疗效应。
methods: 这篇论文使用了深度kernel学习（GraphDKL）框架，具有体积约束（Lipschitz constraint），用于模型预测不确定性，并使用 Gaussian 过程来识别不可靠的估计。
results: 根据实验结果，这篇论文提出的方法能够更好地处理对称网络观察数据的不确定性，并且比传统方法更加可靠地估计个体水平的治疗效应。

Abstract
Due to the imbalanced nature of networked observational data, the causal effect predictions for some individuals can severely violate the positivity/overlap assumption, rendering unreliable estimations. Nevertheless, this potential risk of individual-level treatment effect estimation on networked data has been largely under-explored. To create a more trustworthy causal effect estimator, we propose the uncertainty-aware graph deep kernel learning (GraphDKL) framework with Lipschitz constraint to model the prediction uncertainty with Gaussian process and identify unreliable estimations. To the best of our knowledge, GraphDKL is the first framework to tackle the violation of positivity assumption when performing causal effect estimation with graphs. With extensive experiments, we demonstrate the superiority of our proposed method in uncertainty-aware causal effect estimation on networked data.

摘要

Investigating the Applicability of Self-Assessment Tests for Personality Measurement of Large Language Models

paper_url: http://arxiv.org/abs/2309.08163
repo_url: None
paper_authors: Akshat Gupta, Xiaoyang Song, Gopala Anumanchipalli
for: 这三篇研究旨在测量大语言模型（LLM）的个性，使用人类行为研究工具来量化LLM的行为。
methods: 这三篇研究使用人类自我评估测试来测量LLM的个性，并使用不同的提问来测量同一个LLM的个性。
results: 研究发现，不同的提问会导致LLM的个性分数异常大，表明人性自我评估测试不适用于LLM。此外，研究还发现，提问顺序对测量LLM个性的答案有影响。

Abstract
As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using personality self-assessment tests. In this paper, we take three such studies on personality measurement of LLMs that use personality self-assessment tests created to study human behavior. We use the prompts used in these three different papers to measure the personality of the same LLM. We find that all three prompts lead very different personality scores. This simple test reveals that personality self-assessment scores in LLMs depend on the subjective choice of the prompter. Since we don't know the ground truth value of personality scores for LLMs as there is no correct answer to such questions, there's no way of claiming if one prompt is more or less correct than the other. We then introduce the property of option order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the answers to the self-assessment tests are not robust to the order of the options. These simple tests, done on ChatGPT and Llama2 models show that self-assessment personality tests created for humans are not appropriate for measuring personality in LLMs.

摘要
大型语言模型（LLM）的能力不断进步，latest studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is measuring the "personality" of LLMs using personality self-assessment tests. In this paper, we take three such studies on personality measurement of LLMs that use personality self-assessment tests created to study human behavior. We use the prompts used in these three different papers to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores. This simple test reveals that personality self-assessment scores in LLMs depend on the subjective choice of the prompter. Since we don't know the ground truth value of personality scores for LLMs as there is no correct answer to such questions, there's no way of claiming if one prompt is more or less correct than the other. We then introduce the property of option order symmetry for personality measurement of LLMs. Since most self-assessment tests exist in the form of multiple choice questions (MCQ), we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the answers to the self-assessment tests are not robust to the order of the options. These simple tests, done on ChatGPT and Llama2 models, show that self-assessment personality tests created for humans are not appropriate for measuring personality in LLMs.

paper_url: http://arxiv.org/abs/2309.08138
repo_url: None
paper_authors: Hongcheng Wang, Andy Guan Hong Chen, Xiaoqi Li, Mingdong Wu, Hao Dong
for: 提高人工智能Agent在Visual Object Navigation（VON）任务中的准确性和效率，特别是在实际场景中，用户可能不知道场景中的物品或者指定的物品不存在。methods: 提议使用Demand-driven Navigation（DDN）方法，将用户的需求作为任务指令，以寻找匹配用户需求的物品。DDN方法使用大语言模型中的共同知识来提取对象的文本特征，并使用Contrastive Language-Image Pre-training（CLIP）将文本特征与视觉特征相互对应。通过 incorporating 视觉特征作为先知，提高导航过程的准确性。results: 实验结果表明，DDN方法可以提高Agent的导航性能，并在AI2Thor中与常用的VON方法进行比较，得到更好的结果。

Abstract
The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene. In order to successfully accomplish the VON task, two essential conditions must be fulfilled:1) the user must know the name of the desired object; and 2) the user-specified object must actually be present within the scene. To meet these conditions, a simulator can incorporate pre-defined object names and positions into the metadata of the scene. However, in real-world scenarios, it is often challenging to ensure that these conditions are always met. Human in an unfamiliar environment may not know which objects are present in the scene, or they may mistakenly specify an object that is not actually present. Nevertheless, despite these challenges, human may still have a demand for an object, which could potentially be fulfilled by other objects present within the scene in an equivalent manner. Hence, we propose Demand-driven Navigation (DDN), which leverages the user's demand as the task instruction and prompts the agent to find the object matches the specified demand. DDN aims to relax the stringent conditions of VON by focusing on fulfilling the user's demand rather than relying solely on predefined object categories or names. We propose a method first acquire textual attribute features of objects by extracting common knowledge from a large language model. These textual attribute features are subsequently aligned with visual attribute features using Contrastive Language-Image Pre-training (CLIP). By incorporating the visual attribute features as prior knowledge, we enhance the navigation process. Experiments on AI2Thor with the ProcThor dataset demonstrate the visual attribute features improve the agent's navigation performance and outperform the baseline methods commonly used in VON.

摘要
视觉对象导航（VON）任务需要一个代理人能够在给定场景中找到特定的对象。为了成功完成VON任务，需满足两个必要条件：1）用户必须知道想要的对象的名称；2）用户指定的对象必须实际存在在场景中。为了满足这两个条件，一个模拟器可以将预定义的对象名称和位置添加到场景的元数据中。然而，在实际场景中，经常存在一些挑战。人们在不熟悉的环境中可能不知道场景中的对象，或者可能指定不存在的对象。然而，尽管存在这些挑战，人们仍然可能有对某个对象的需求，这可能可以通过场景中其他对象来满足。因此，我们提出了需求驱动导航（DDN），它利用用户的需求作为任务指令，并让代理人找到匹配用户需求的对象。DDN通过放弃VON中的严格条件，而转而ocus on 满足用户需求，从而提高导航性能。我们提出了一种方法，首先从大语言模型中提取对象的文本特征，然后将这些特征与图像特征进行对比，使用Contrastive Language-Image Pre-training（CLIP）。通过将视觉特征作为先知，我们提高了导航过程。在AI2Thor上使用ProcThor数据集进行实验，我们发现视觉特征可以提高代理人的导航性能，并超越常用的VON基线方法。

“I’m Not Confident in Debiasing AI Systems Since I Know Too Little”: Teaching AI Creators About Gender Bias Through Hands-on Tutorials

paper_url: http://arxiv.org/abs/2309.08121
repo_url: None
paper_authors: Kyrie Zhixuan Zhou, Jiaxun Cao, Xiaowen Yuan, Daniel E. Weissglass, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Xin Tong
for: 帮助AI创造者了解和 Mitigate gender bias in AI systems, improving user experience and reducing injustices and mental harm to women.
methods: 使用实践 oriented hands-on tutorials to raise AI creators’ awareness of gender bias in AI and enhance their knowledge of sources of gender bias and debiasing techniques.
results: tutorials were evaluated with 18 AI creators, including AI researchers, AI industrial practitioners, and students who had learned AI, and their improved awareness and knowledge demonstrated the effectiveness of the tutorials.

Abstract
Gender bias is rampant in AI systems, causing bad user experience, injustices, and mental harm to women. School curricula fail to educate AI creators on this topic, leaving them unprepared to mitigate gender bias in AI. In this paper, we designed hands-on tutorials to raise AI creators' awareness of gender bias in AI and enhance their knowledge of sources of gender bias and debiasing techniques. The tutorials were evaluated with 18 AI creators, including AI researchers, AI industrial practitioners (i.e., developers and product managers), and students who had learned AI. Their improved awareness and knowledge demonstrated the effectiveness of our tutorials, which have the potential to complement the insufficient AI gender bias education in CS/AI courses. Based on the findings, we synthesize design implications and a rubric to guide future research, education, and design efforts.

摘要
gender bias 在 AI 系统中严重存在，导致用户体验差、不公正和对女性造成心理伤害。学 curricula 不教育 AI 创建者关于这个话题，留下他们无准备 mitigate gender bias 在 AI 中。在这篇论文中，我们设计了有手验utorials，以提高 AI 创建者对 gender bias 在 AI 中的意识和debiasing 技术的知识。我们的 tutorials 被评估了 18 名 AI 创建者，包括 AI 研究人员、 AI 工业实践者（即开发人员和产品经理）以及学生们，他们已经学习 AI。他们的改善意识和知识表明了我们的 tutorials 的效果，这些 tutorials 有可能补充 CS/AI 课程中的 gender bias 教育不足。根据发现，我们合并了设计建议和指南，以导向未来的研究、教育和设计努力。

Data-Driven Goal Recognition in Transhumeral Prostheses Using Process Mining Techniques

paper_url: http://arxiv.org/abs/2309.08106
repo_url: None
paper_authors: Zihang Su, Tianshi Yu, Nir Lipovetzky, Alireza Mohammadi, Denny Oetomo, Artem Polyvyanyy, Sebastian Sardina, Ying Tan, Nick van Beest
for: 这个论文旨在研究如何使用时间序列数据来识别截肢者的目标姿势。
methods: 该论文使用表面电MYography电极和动态测量仪器收集的时间序列数据，并将其转化为离散事件，然后使用现有的进程挖掘学基于的目标识别系统进行训练。
results: 据收集在虚拟现实环境中的数据显示，该方法可以准确地识别截肢者的目标姿势，并且比州立艺术技术更加精准和更加不信任，这有利于估算肢体运动的更灵活移动。

Abstract
A transhumeral prosthesis restores missing anatomical segments below the shoulder, including the hand. Active prostheses utilize real-valued, continuous sensor data to recognize patient target poses, or goals, and proactively move the artificial limb. Previous studies have examined how well the data collected in stationary poses, without considering the time steps, can help discriminate the goals. In this case study paper, we focus on using time series data from surface electromyography electrodes and kinematic sensors to sequentially recognize patients' goals. Our approach involves transforming the data into discrete events and training an existing process mining-based goal recognition system. Results from data collected in a virtual reality setting with ten subjects demonstrate the effectiveness of our proposed goal recognition approach, which achieves significantly better precision and recall than the state-of-the-art machine learning techniques and is less confident when wrong, which is beneficial when approximating smoother movements of prostheses.

摘要
一种跨肩 prosthesis 可以恢复下肩 absent 的解剖结构，包括手臂。活动 prostheses 使用实数据来识别病人目标姿势或目标，并激活人工肢体。先前的研究已经研究了不考虑时间步骤的数据是否可以帮助分类目标。在这篇案例研究中，我们将关注使用时间序列数据从表面电MYography 电极和运动传感器来顺序识别病人的目标。我们的方法包括将数据转换为精确的事件，并训练现有的过程挖掘-based 目标识别系统。从在虚拟现实环境中收集的十名参与者的数据显示，我们的建议的目标识别方法在精度和准确性方面具有显著的优势，并且当错误时具有较低的自信度，这有利于精制更平滑的人工肢体运动。

Research on Joint Representation Learning Methods for Entity Neighborhood Information and Description Information

paper_url: http://arxiv.org/abs/2309.08100
repo_url: None
paper_authors: Le Xiao, Xin Shan, Yuhua Wang, Miaolei Deng
for: address the issue of poor embedding performance in the knowledge graph of a programming design course
methods: 使用 joint representation learning model, combining entity neighborhood information and description information
results: 实验结果表明，提议的模型在programming design course的知识图 dataset上 achieved favorable performance, outperforming other baseline models.

Abstract
To address the issue of poor embedding performance in the knowledge graph of a programming design course, a joint represen-tation learning model that combines entity neighborhood infor-mation and description information is proposed. Firstly, a graph at-tention network is employed to obtain the features of entity neigh-boring nodes, incorporating relationship features to enrich the structural information. Next, the BERT-WWM model is utilized in conjunction with attention mechanisms to obtain the representation of entity description information. Finally, the final entity vector representation is obtained by combining the vector representations of entity neighborhood information and description information. Experimental results demonstrate that the proposed model achieves favorable performance on the knowledge graph dataset of the pro-gramming design course, outperforming other baseline models.

摘要
要解决程序设计课程知识图表中的埋点表现问题，提出了一种联合表示学习模型，结合实体邻居信息和描述信息。首先，使用图注意网络获取实体邻居节点特征，并将关系特征纳入结构信息。接着，使用BERT-WWM模型并加入注意机制来获取描述信息的表示。最后，将实体vector表示结果组合实体邻居信息和描述信息的vector表示。实验结果表明，提出的模型在程序设计课程知识图表 dataset上达到了优秀表现，比基eline模型高。

Fast and Accurate Deep Loop Closing and Relocalization for Reliable LiDAR SLAM

paper_url: http://arxiv.org/abs/2309.08086
repo_url: None
paper_authors: Chenghao Shi, Xieyuanli Chen, Junhao Xiao, Bin Dai, Huimin Lu
for: 本文旨在提出一种能够实现可靠和稳定长期SLAM的Loop Closing和Relocalization技术，以解决 pose estimation 漂移和缺失问题。
methods: 本文提出了一种多头网络LCR-Net，用于同时解决 Loop Closing 和 Relocalization 问题。该网络使用了新的特征提取和 pose-aware 注意力机制，以准确地估计相似性和 6-DoF 姿态。
results: Results 表明，LCR-Net 在三种设置中均表现出色，超越了现有方法，并且具有扩展性。特别是，LCR-Net 不需要使用时间消耗的稳定 pose estimator，使其适用于在线 SLAM 应用。根据我们所知，LCR-Net 是首个实现了 LiDAR SLAM 中的深度 Loop Closing 和 Relocalization 的方法。

Abstract
Loop closing and relocalization are crucial techniques to establish reliable and robust long-term SLAM by addressing pose estimation drift and degeneration. This article begins by formulating loop closing and relocalization within a unified framework. Then, we propose a novel multi-head network LCR-Net to tackle both tasks effectively. It exploits novel feature extraction and pose-aware attention mechanism to precisely estimate similarities and 6-DoF poses between pairs of LiDAR scans. In the end, we integrate our LCR-Net into a SLAM system and achieve robust and accurate online LiDAR SLAM in outdoor driving environments. We thoroughly evaluate our LCR-Net through three setups derived from loop closing and relocalization, including candidate retrieval, closed-loop point cloud registration, and continuous relocalization using multiple datasets. The results demonstrate that LCR-Net excels in all three tasks, surpassing the state-of-the-art methods and exhibiting a remarkable generalization ability. Notably, our LCR-Net outperforms baseline methods without using a time-consuming robust pose estimator, rendering it suitable for online SLAM applications. To our best knowledge, the integration of LCR-Net yields the first LiDAR SLAM with the capability of deep loop closing and relocalization. The implementation of our methods will be made open-source.

摘要
Loop closing和重新本地化是长期可靠的SLAM中重要的技巧，可以解决pose估计漂移和缺乏稳定性问题。本文首先将loop closing和重新本地化置于一个统一框架中，然后提出了一种新的多头网络LCR-Net，可以有效地处理这两个任务。LCR-Net使用了新的特征提取和pose相关注意机制，可以准确地估计相似性和6-DoF姿态之间的对应关系。文中还将LCR-Net与SLAM系统集成，并在户外驾驶环境中实现了稳定和准确的在线LiDAR SLAM。我们对LCR-Net进行了三种设置的评估，包括候选点云重新注册、循环关闭和连续重新本地化，并使用了多个数据集进行评估。结果表明，LCR-Net在这三个任务中均表现出色，超越了现有方法，并且具有很好的总体化能力。尤其是，LCR-Net不需要使用时间consuming的稳定pose估计器，因此适用于在线SLAM应用程序。到目前为止，我们的方法的实现将被开源。

A Stochastic Online Forecast-and-Optimize Framework for Real-Time Energy Dispatch in Virtual Power Plants under Uncertainty

paper_url: http://arxiv.org/abs/2309.08642
repo_url: None
paper_authors: Wei Jiang, Zhongkai Yi, Li Wang, Hanwei Zhang, Jihai Zhang, Fangquan Lin, Cheng Yang
for: 这篇论文旨在为了提高分布式能源资源的组合和管理，减少不确定性，特别是绿色能源生产的波动。
methods: 本论文提出了一个实时不确定性感知能源分配框架，包括两个关键元素：（i）一个混合预测和优化的序列任务，结合深度学习预测和Stochastic优化，这两个阶段通过不确定性估计在多个时间分辨率连接。（ii）一个高效的线上数据增强计划，联合预训和线上细化阶段。
results: 本论文在CityLearn Challenge 2022中获得冠军，并进行了实验评估其在实际应用中的有效性。

Abstract
Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dispatch framework, which is composed of two key elements: (i) A hybrid forecast-and-optimize sequential task, integrating deep learning-based forecasting and stochastic optimization, where these two stages are connected by the uncertainty estimation at multiple temporal resolutions; (ii) An efficient online data augmentation scheme, jointly involving model pre-training and online fine-tuning stages. In this way, the proposed framework is capable to rapidly adapt to the real-time data distribution, as well as to target on uncertainties caused by data drift, model discrepancy and environment perturbations in the control process, and finally to realize an optimal and robust dispatch solution. The proposed framework won the championship in CityLearn Challenge 2022, which provided an influential opportunity to investigate the potential of AI application in the energy domain. In addition, comprehensive experiments are conducted to interpret its effectiveness in the real-life scenario of smart building energy management.

摘要
合并分布式能源资源在电力系统中会增加不确定性，尤其是可再生能源生产的波动。这种问题使得广泛利用先进预测控制技术在不确定环境下 Ensure long-term economics and decarbonization. 在这篇论文中，我们提出了实时不确定性意识能源分配框架，该框架由两个关键元素组成：1. hybrid预测和优化顺序任务，将深度学习预测和随机优化连接在多个时间分辨率上。2. 高效在线数据扩充方案，包括模型预训练和在线细化阶段。这种方案可以快速适应实时数据分布，同时Target uncertainties caused by data drift, model discrepancy, and environment perturbations in the control process, and finally achieve an optimal and robust dispatch solution.我们的框架在CityLearn Challenge 2022中赢得冠军，提供了一个有力的机会来调查AI应用在能源领域的潜力。此外，我们还进行了实验来解释其效果在智能建筑能源管理实际场景中。

2023-09-15

cs.CL

cs.CL - 2023-09-15

An Empirical Study on Instance Selection Strategies in Self-training for Sentiment Analysis

paper_url: http://arxiv.org/abs/2309.08777
repo_url: None
paper_authors: Haochen Liu, Sai Krishna Rallabandi, Yijing Wu, Parag Pravin Dakle, Preethi Raghavan
for: 本研究旨在 investigate the influence of instance selection strategies and hyper-parameters on the performance of self-training in various few-shot settings for sentiment analysis.
methods: 本研究使用了自适应学习技术，并对不同的实例选择策略和超参数进行了 empirical study。
results: 研究发现，不同的实例选择策略和超参数对自适应学习的性能有很大的影响，并且在不同的几个 shot 设置下，不同的策略和超参数具有不同的最佳性能。

Abstract
Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a larger amount of unlabeled data. However, the performance of a self-training procedure heavily relies on the choice of the instance selection strategy, which has not been studied thoroughly. This paper presents an empirical study on various instance selection strategies for self-training on two public sentiment datasets, and investigates the influence of the strategy and hyper-parameters on the performance of self-training in various few-shot settings.

摘要
自然语言处理中的情感分析是一项重要任务，它涉及到从文本中提取主观情感。自我培训是一种经济高效的技术，可以使用少量标注数据和更多的无标注数据来开发情感分析模型。然而，自我培训过程中的实例选择策略的选择对模型性能产生很大影响。这篇论文通过对两个公共情感数据集上的不同实例选择策略进行实证研究，探讨自我培训在不同几个尝试设置下的性能影响。

Generating Semantic Graph Corpora with Graph Expansion Grammar

paper_url: http://arxiv.org/abs/2309.08714
repo_url: None
paper_authors: Eric Andersson, Johanna Björklund, Frank Drewes, Anna Jonsson
for: 创建Semantic graphs的 corps
methods: 使用图解析语言，让用户通过定义 grammar来控制生成的图集
results: 可以生成符合 grammar 的图集，用于增强现有 corpus 和教学正式语言理论

Abstract
We introduce Lovelace, a tool for creating corpora of semantic graphs. The system uses graph expansion grammar as a representational language, thus allowing users to craft a grammar that describes a corpus with desired properties. When given such grammar as input, the system generates a set of output graphs that are well-formed according to the grammar, i.e., a graph bank. The generation process can be controlled via a number of configurable parameters that allow the user to, for example, specify a range of desired output graph sizes. Central use cases are the creation of synthetic data to augment existing corpora, and as a pedagogical tool for teaching formal language theory.

摘要
我们介绍Lovelace，一个用于建立Semantic Graph的工具。这个系统使用图像扩展语法来描述图像的描述语言，因此让用户可以透过定义语法来制定图像的描述。当 given 这个语法为输入时，系统会生成一个符合语法的图像集合，即图像银行。生成过程可以通过一些可配置的参数控制，例如指定出力图像的大小范围。主要用途包括创建增强现有数据库的实验数据，以及教学正式语言理论的教学工具。

Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning

paper_url: http://arxiv.org/abs/2309.08708
repo_url: https://github.com/mlsw/dynamic-embedding-pruning
paper_authors: Miles Williams, Nikolaos Aletras
for: 这篇论文目的是简化预训练语言模型（PLM）的内存占用，以便在内存受限的云端环境或设备上部署。
methods: 论文使用嵌入矩阵来表示广泛的词汇，这些矩阵形成了模型参数的大部分。过往的工作已经对transformer层中的参数进行了删除，但是对嵌入矩阵的删除则没有被探讨。
results: 我们首先显示出，在这些情况下，词汇中有许多不会被使用。我们然后提出了一个简单 yet effective的方法，利用这个发现来删除嵌入矩阵中的部分参数。我们显示了这个方法可以在各种模型和任务上提供内存使用率的重要删除。值得注意的是，我们的方法可以保持下游任务的性能，并且让计算资源的使用更加有效率。

Abstract
The extensive memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings, such as cloud environments or on-device. PLMs use embedding matrices to represent extensive vocabularies, forming a large proportion of the model parameters. While previous work towards parameter-efficient PLM development has considered pruning parameters within the transformer layers, pruning the embedding matrix as part of fine-tuning or inference has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused in these scenarios. We then propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach maintains equivalent downstream task performance while allowing a more efficient use of compute resources.

摘要
PLMs的庞大内存占用率可能会阻碍部署在内存受限的环境中，如云端环境或设备上。PLMs使用 embedding 矩阵来表示广泛的词汇表，占据模型参数的大部分。而以前的工作在开发减少 PLM 参数时已经考虑过杜refix 层中的参数，但是在 fine-tuning 或 inference 阶段对 embedding 矩阵进行减少还没有被探讨。我们首先表明，在这些场景下，许多词汇 remained 未使用。我们然后提出了一种简单 yet effective 的方法，利用这个发现来减少 embedding 矩阵的内存占用。我们显示了这种方法可以在各种模型和任务上提供了重要的内存占用减少，而且保持下游任务性能相同，使 compute 资源的使用更加高效。

Sparse Autoencoders Find Highly Interpretable Features in Language Models

paper_url: http://arxiv.org/abs/2309.08600
repo_url: https://github.com/hoagyc/sparse_coding
paper_authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
for: 本研究旨在解决神经网络内部具有多义性的问题，以提高神经网络的内部工作方式的理解。
methods: 本研究使用稀疏自编码器来重建语言模型的内部活动，并从中提取更有意义和单义的特征集。
results: 研究发现，通过稀疏自编码器来解决神经网络中的超position问题，可以提高模型的可解释性和可控性，并且可以精准地编辑模型。

Abstract
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

摘要
To address this issue, we use "sparse autoencoders" to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by other approaches. By ablating these features, we can precisely edit the model, for example, by removing capabilities such as pronoun prediction, while disrupting the model's behavior less than prior techniques.This work demonstrates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our approach may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

“Merge Conflicts!” Exploring the Impacts of External Distractors to Parametric Knowledge Graphs

paper_url: http://arxiv.org/abs/2309.08594
repo_url: https://github.com/qiancheng0/ekd_impacts_pkg
paper_authors: Cheng Qian, Xinran Zhao, Sherry Tongshuang Wu
for: 这 paper 探讨了大语言模型（LLM）在与用户交互时如何处理外部知识的问题。
methods: 作者们提出了一个框架，用于系统地探索 LLM 的 parametric knowledge 和外部知识之间的交互。他们构建了一个 parametric knowledge graph，以透视 LLM 的不同知识结构，并通过不同的方法、位置和格式引入外部知识。
results: 实验结果表明，当 LLM 遇到直接冲突或信息变化时，它们很可能会偏离其 parametric knowledge 提供的答案。它们还发现，即使外部知识的真实性高，LLM 仍可能受到不相关信息的干扰。这些发现指出了现有 LLM 在交互时 инте格外部知识时存在风险的问题。所有数据和结果都公开可用。

Abstract
Large language models (LLMs) acquire extensive knowledge during pre-training, known as their parametric knowledge. However, in order to remain up-to-date and align with human instructions, LLMs inevitably require external knowledge during their interactions with users. This raises a crucial question: How will LLMs respond when external knowledge interferes with their parametric knowledge? To investigate this question, we propose a framework that systematically elicits LLM parametric knowledge and introduces external knowledge. Specifically, we uncover the impacts by constructing a parametric knowledge graph to reveal the different knowledge structures of LLMs, and introduce external knowledge through distractors of varying degrees, methods, positions, and formats. Our experiments on both black-box and open-source models demonstrate that LLMs tend to produce responses that deviate from their parametric knowledge, particularly when they encounter direct conflicts or confounding changes of information within detailed contexts. We also find that while LLMs are sensitive to the veracity of external knowledge, they can still be distracted by unrelated information. These findings highlight the risk of hallucination when integrating external knowledge, even indirectly, during interactions with current LLMs. All the data and results are publicly available.

摘要
Specifically, 我们 constructed a parametric knowledge graph 来揭露 LLMs 的不同知识结构，并通过对 LLMs 进行不同程度的外部知识引入，以发现它们在不同情况下的对应方式。我们的实验结果显示，当 LLMs 遇到直接冲突或干扰变化的情况时，它们往往会产生与 parametric knowledge 不符的回应。此外，我们发现 LLMS 对外部知识的敏感性可以在不同的情况下发挥作用，但是它们仍可以受到无关的信息所干扰。这些结果显示，当 LLMS 与外部知识进行互动时，存在诱导现象的风险。所有的数据和结果都公开可用。

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

paper_url: http://arxiv.org/abs/2309.08591
repo_url: https://github.com/UKPLab/maps
paper_authors: Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, Iryna Gurevych
for: 这 paper investigate whether multilingual language models (mLLMs) can reason with proverbs and sayings in a conversational context, and how well they understand these cultural references.
methods: The authors use a variety of state-of-the-art mLLMs to test the models’ ability to reason with proverbs and sayings, and they create a new evaluation dataset called MAPS (MulticultrAl Proverbs and Sayings) for six different languages.
results: The authors find that mLLMs have limited knowledge of proverbs and struggle to reason with figurative proverbs and sayings, and there is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages.

Abstract
Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in situational context, human expectations vary depending on the relevant cultural common ground. As human languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs 'knows' limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a "culture gap" in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.

摘要

mLLMs have limited knowledge of proverbs and simply memorizing proverbs does not mean understanding them in a conversational context.2. mLLMs struggle to reason with figurative proverbs and sayings, and often choose the wrong answer when asked to select a response.3. There is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages.To address these challenges, we have created and released an evaluation dataset called MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding in six different languages. This dataset will help researchers to better understand the limitations and potential of mLLMs when it comes to reasoning across cultures.

Neural Machine Translation Models Can Learn to be Few-shot Learners

paper_url: http://arxiv.org/abs/2309.08590
repo_url: None
paper_authors: Raphael Reinauer, Patrick Simianer, Kaden Uhlig, Johannes E. M. Mosig, Joern Wuebker
for: 这篇论文旨在探讨大语言模型在新领域和任务中具有快速学习能力，以及如何通过特殊的训练目标来减少模型的大小。
methods: 本研究使用了精心设计的训练目标，以实现在几个示例的情况下进行域 adapted 学习。
results: 研究结果表明，使用本方法可以实现高质量的翻译和快速适应率，并且在混合域批处理中进行批处理时能够更高效。

Abstract
The emergent ability of Large Language Models to use a small number of examples to learn to perform in novel domains and tasks, also called in-context learning (ICL). In this work, we show that a much smaller model can be trained to perform ICL by fine-tuning towards a specialized training objective, exemplified on the task of domain adaptation for neural machine translation. With this capacity for ICL, the model can take advantage of relevant few-shot examples to adapt its output towards the domain. We compare the quality of this domain adaptation to traditional supervised techniques and ICL with a 40B-parameter Large Language Model. Our approach allows efficient batch inference on a mix of domains and outperforms state-of-the-art baselines in terms of both translation quality and immediate adaptation rate, i.e. the ability to reproduce a specific term after being shown a single example.

摘要
大型语言模型的新兴能力，即使用少量示例来在新领域和任务中学习，也称为内容学习（ICL）。在这项工作中，我们示出了一种较小的模型可以通过特殊化训练目标来进行ICL，并 exemplified 在神经机器翻译领域中进行领域适应。通过这种ICL能力，模型可以利用相关的几个示例来适应领域。我们与传统的直接训练技术和ICL的40B参数大型语言模型进行比较，并发现我们的方法可以具有高效批处理能力，并在混合领域下进行批处理。此外，我们的方法还可以在翻译质量和快速适应率（即在看到单个示例后能够重新生成特定词汇）两个方面超越当前的基elines。

ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer

paper_url: http://arxiv.org/abs/2309.08583
repo_url: https://github.com/asaakyan/explain-st
paper_authors: Arkadiy Saakyan, Smaranda Muresan
for: 这个论文的目的是提出一种扩展和改进形式式转换数据集的解释框架，以便使用ChatGPT模型进行模型精炼，并通过人工指导来进一步修改生成的解释。
methods: 该论文使用了ChatGPT模型进行模型精炼，并通过ICLEF（In-Context Learning from Expert Feedback）技术来捕捉专家反馈。
results: 研究发现，现有的公开分布的 instruciton-tuned 模型（以及在某些设置下的ChatGPT）在这个任务上表现不佳，而通过 fine-tuning 在我们的高质量数据集上得到了显著提高。人工评估表明，比ChatGPT更小的模型在我们的数据集上进行 fine-tuning 后，与专家偏好更加相似。最后，论文还讨论了使用模型在解释式风格转换任务中的两种应用：可读性作者识别和可读性AI生成文本检测器的可读性针对攻击。

Abstract
While state-of-the-art language models excel at the style transfer task, current work does not address explainability of style transfer systems. Explanations could be generated using large language models such as GPT-3.5 and GPT-4, but the use of such complex systems is inefficient when smaller, widely distributed, and transparent alternatives are available. We propose a framework to augment and improve a formality style transfer dataset with explanations via model distillation from ChatGPT. To further refine the generated explanations, we propose a novel way to incorporate scarce expert human feedback using in-context learning (ICLEF: In-Context Learning from Expert Feedback) by prompting ChatGPT to act as a critic to its own outputs. We use the resulting dataset of 9,960 explainable formality style transfer instances (e-GYAFC) to show that current openly distributed instruction-tuned models (and, in some settings, ChatGPT) perform poorly on the task, and that fine-tuning on our high-quality dataset leads to significant improvements as shown by automatic evaluation. In human evaluation, we show that models much smaller than ChatGPT fine-tuned on our data align better with expert preferences. Finally, we discuss two potential applications of models fine-tuned on the explainable style transfer task: interpretable authorship verification and interpretable adversarial attacks on AI-generated text detectors.

摘要
当前最先进的语言模型在Style Transfer任务上表现出色，但现有工作并没有解释Style Transfer系统的可读性。我们提出一个框架，使用ChatGPT模型协助生成Style Transfer数据集的解释，通过模型液化（distillation）来提高数据质量。为了进一步细化生成的解释，我们提出一种新的方法，通过在Context Learning from Expert Feedback（ICLEF）中提供专家反馈来进一步改进ChatGPT的输出。我们使用这些解释Style Transfer实例（e-GYAFC）来证明，当前公开分布的开源 instrucion-tuned 模型（以及在某些设置下的ChatGPT）在这个任务上表现不佳，而 fine-tuning 在我们的高质量数据集上导致了显著的改进，如自动评估中所示。在人工评估中，我们发现使用我们数据集进行 fine-tuning 的模型比ChatGPT更好地与专家偏好相吻合。最后，我们讨论了基于解释Style Transfer任务的模型在涉及性作者鉴别和AI生成文本检测器的可读性攻击中的两个可能应用。

Casteist but Not Racist? Quantifying Disparities in Large Language Model Bias between India and the West

paper_url: http://arxiv.org/abs/2309.08573
repo_url: None
paper_authors: Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, Scott A. Hale
for: 本研究旨在评估大语言模型（LLMs）中存在的偏见问题，以及这些偏见在印度上的表现。
methods: 该研究采用了一种新的数据集——印度偏见评估 dataset（Indian-BhED），包含了印度社会中的阶层和宗教上的偏见和反偏见示例。通过对多种popular LLMs进行测试，研究人员发现了大多数LLMs在印度上具有强烈的偏见倾向。
results: 研究人员发现，在印度上，LLMs中的偏见倾向主要表现在阶层和宗教上，特别是与西方上的偏见倾向相比。此外，研究人员还发现了一种简单的调教技术——指令推荐——可以有效地减少LLMs中的偏见和反偏见。

Abstract
Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame and compare bias levels between the Indian and Western contexts. To do this, we develop a novel dataset which we call Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. We find that the majority of LLMs tested are strongly biased towards stereotypes in the Indian context, especially as compared to the Western context. We finally investigate Instruction Prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for GPT-3.5. The findings of this work highlight the need for including more diverse voices when evaluating LLMs.

摘要
大型语言模型（LLM），每天使用了 millions of users，可以储存社会偏见，使用者接触到表现性危害。学术研究中存在大量LLM偏见，但这些研究通常采用西方中心的框架，对于全球南方的偏见水平和潜在危害相对较少关注。本文使用一个新的数据集——印度偏见评估集（Indian-BhED），测量各LLM在印度和西方上的偏见水平。我们发现大多数测试的LLM强烈储存印度上的偏见，特别是与西方上的偏见相比。最后，我们调查了“指示提示”作为简单的 Mitigation 方法，发现其可以有效地减少大多数情况下的偏见和反偏见偏见。这些发现强调了包括更多多样化的声音在LLM评估中的重要性。

Augmenting conformers with structured state space models for online speech recognition

paper_url: http://arxiv.org/abs/2309.08551
repo_url: None
paper_authors: Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara Sainath
for: 本研究探讨了在线语音识别系统中使用神经网络模型，只访问左侧上下文。
methods: 本文提出了一种基于结构化状态空间序列模型（S4）的增强神经网络模型，以提高在线ASR系统的性能。 authorsperform了系统的ablation Study来比较不同的S4模型变体，并提出了两种新的方法， combinig S4模型和卷积。
results: results show that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. 最佳设计是将一小个S4模型与实数权重的卷积相结合，以实现它们的补充作用。 authors的best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

Abstract
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), which are a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We perform systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We find that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

摘要
online speech recognition，其中模型只能访问左侧上下文，是ASR系统中的重要和挑战性用case。在这项工作中，我们研究了将神经编码器引入在线ASR中，通过结构化状态空间序列模型（S4）来提供高效的左侧上下文访问方式。我们进行了系统性的减少研究，比较不同的S4模型变体，并提出了两种新的方法，其中一种将S4模型与卷积结合。我们发现最有效的设计是将一小个S4模型使用实数Recurrent权重与地方卷积结合，使其在不同上下文中工作衔接地。我们的最佳模型在Librispeech测试集上 achieve WERs of 4.01%/8.53%，超过了经过了大量调整的卷积Conformers。

paper_url: http://arxiv.org/abs/2309.08531
repo_url: https://github.com/ms-dot-k/Image-to-Speech-Captioning
paper_authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
for: 这 paper 的目的是提出一种强大和高效的图像到语音captioning（Im2Sp）模型。
methods: 该 paper 使用了一种基于大规模预训练视觉语言模型的视觉语言概念和语言模型知识进行 imports，并将 Im2Sp 的输出设置为精度化的语音特征，以便 incorporate 语言模型化能力。
results: 通过使用视觉语言预训练策略，该 paper 在 COCO 和 Flickr8k 两个广泛使用的标准数据库上实现了新的 Im2Sp 性能记录。此外， paper 还提出了一种提高 Im2Sp 模型的效率的方法。

Abstract
In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning.

摘要
在这篇论文中，我们提出了构建一个强大和高效的图像到语音描述（Im2Sp）模型的方法。为此，我们从大规模预训练视觉语言模型中导入了丰富的图像理解和语言模型化知识。我们设置了Im2Sp的输出为步骤化的语音特征，即一种自适应语音模型的量化语音特征。这些语音特征主要包含语言信息，同时压缩其他语音特征。这样可以将预训练视觉语言模型中的语言模型化能力integrated into Im2Sp的语音模型。通过视觉语言预训练策略，我们在COCO和Flickr8k两个广泛使用的数据库上设置了新的Im2Sp性能记录。然后，我们进一步提高了Im2Sp模型的效率。与语音单元类似，我们将原始图像转换为图像单元，它们通过Raw image的vector quantization来 derivation。通过这些图像单元，我们可以压缩图像数据的存储需求，从原始图像数据的bits比例来看，减少了99.2%。 demo页面：https://ms-dot-k.github.io/Image-to-Speech-Captioning。

SilverRetriever: Advancing Neural Passage Retrieval for Polish Question Answering

paper_url: http://arxiv.org/abs/2309.08469
repo_url: None
paper_authors: Piotr Rybak, Maciej Ogrodniczuk
for: 这篇论文目的是为了开发一种基于神经网络的波兰语问答系统，以提高问答系统的准确率和效率。
methods: 这篇论文使用了神经网络来实现问答系统的检索部分，并在多个手动或弱 Label 的数据集上训练。
results: 根据论文的描述，SilverRetriever 比其他波兰语模型更好，并与大型多语言模型相当。同时，论文还开源了五个新的检索数据集。

Abstract
Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present SilverRetriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. SilverRetriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.

摘要
现代开放领域问答系统经常利用准确和高效的检索组件来找到包含问题答案所需的信息。近年来，神经检索器在英语或中文等Popular语言中得到了广泛的应用，但是对其他语言，如波兰语，有限的模型是可用的。在这项工作中，我们介绍SilverRetriever，一种基于神经网络的波兰语检索器，通过手动或弱 Label的数据集进行训练。SilverRetriever在波兰语检索方面达到了较好的结果，与大型多语言模型竞争。此外，我们还开源了五个新的段落检索数据集。

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

paper_url: http://arxiv.org/abs/2309.08454
repo_url: None
paper_authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
for: 这项研究旨在提高自动语音识别（ASR）系统的性能，特别是处理重叠的语音场景。
methods: 这项研究使用了一种新的混合编码器，该编码器利用原始重叠的语音来减少由语音分离引入的噪声的影响。
results: 实验结果表明，使用这种混合编码器可以在LibriCSS数据集上达到顶峰性能，并且表明TF-GridNet模型具有强大的分离能力， largely closing the gap between previous methods and oracle separation.

Abstract
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.

摘要
许多实际应用中的自动语音识别（ASR）需要处理重叠的语音。一种常见方法是首先将语音分解成不重叠的流程，然后对得到的信号进行 ASR 处理。最近，一种将混合编码器添加到 ASR 模型中的方法被提议。这种混合编码器利用原始的重叠语音来减轻由语音分离引入的artefacts的影响。然而，之前的方法只处理了两个人的场景。在这种工作中，我们扩展了这种方法，以适应更自然的会议场景，包括任意数量的说话者和动态重叠。我们使用不同的语音分离器进行评估，包括强大的 TF-GridNet 模型。我们的实验结果表明，我们的方法在 LibriCSS 数据集上达到了状态机器的性能，并且强调混合编码器的优势。此外，它们也证明了 TF-GridNet 的强大分离能力， largely 关闭了之前方法和oracle分离之间的差距。

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

paper_url: http://arxiv.org/abs/2309.08448
repo_url: https://github.com/mtkresearch/mr-models
paper_authors: Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-shan Shiu
for: 评估大语言模型的能力是语言理解和生成领域的关键任务。
methods: 我们提出了一种新的评估框架，利用英文数据集创建了一系列特有的 benchmark，用于评估语言模型在traditional Chinese 中的多种能力。
results: 我们在这些 benchmark 上评估了GPT-3.5、Taiwan-LLaMa-v1.0和我们自己的模型Model 7-C，结果显示我们的模型在一些评估能力上与GPT-3.5相当。

Abstract
The evaluation of large language models is an essential task in the field of language understanding and generation. As language models continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of language models, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of language models' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of language models in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.

摘要
大型语言模型的评估是现场语言理解和生成领域的重要任务。随着语言模型不断进步，评估其性能的需求日益增加。在传统汉字中，对于语言模型的评估标准仅有一些限定的测试，如DRCD、TTQA、CMDQA和FGC数据集。为了填补这个空白，我们提出了一个新的测试集，利用现有的英文数据集，并特别针对传统汉字评估语言模型的多种能力。这个测试集包括了各种任务，例如对话问题答案、摘要、分类和表格理解。这些测试集提供了一个全面的评估框架，允许评估语言模型在不同任务上的能力。在这篇论文中，我们将评估GPT-3.5、Taiwan-LLaMa-v1.0和我们的专业模型（Model 7-C）的性能。评估结果显示，我们的模型（Model 7-C）在一部分评估能力方面与GPT-3.5相似。为了推进传统汉字语言模型的评估和促进这个领域的进一步研究，我们将测试集开源和模型公开试用。

Unleashing Potential of Evidence in Knowledge-Intensive Dialogue Generation

paper_url: http://arxiv.org/abs/2309.08380
repo_url: None
paper_authors: Xianjie Wu, Jian Yang, Tongliang Li, Di Liang, Shiwei Zhang, Yiyang Du, Zhoujun Li
for: 提高对话回答的正确性，增强对话生成系统的知识内容。
methods: 利用大语言模型挖掘可靠的证据真实标签，并在对话生成过程中使用证据标签进行可靠的证据标识和集中注意力。
results: 在MultiDoc2Dial上实验表明，提供证据标签的增强和调整注意力机制可以提高模型性能，比基eline高3-5点，并且进一步验证了模型的可靠性和事实一致性。

Abstract
Incorporating external knowledge into dialogue generation (KIDG) is crucial for improving the correctness of response, where evidence fragments serve as knowledgeable snippets supporting the factual dialogue replies. However, introducing irrelevant content often adversely impacts reply quality and easily leads to hallucinated responses. Prior work on evidence retrieval and integration in dialogue systems falls short of fully leveraging existing evidence since the model fails to locate useful fragments accurately and overlooks hidden evidence labels within the KIDG dataset. To fully Unleash the potential of evidence, we propose a framework to effectively incorporate Evidence in knowledge-Intensive Dialogue Generation (u-EIDG). Specifically, we introduce an automatic evidence generation framework that harnesses the power of Large Language Models (LLMs) to mine reliable evidence veracity labels from unlabeled data. By utilizing these evidence labels, we train a reliable evidence indicator to effectively identify relevant evidence from retrieved passages. Furthermore, we propose an evidence-augmented generator with an evidence-focused attention mechanism, which allows the model to concentrate on evidenced segments. Experimental results on MultiDoc2Dial demonstrate the efficacy of evidential label augmentation and refined attention mechanisms in improving model performance. Further analysis confirms that the proposed method outperforms other baselines (+3~+5 points) regarding coherence and factual consistency.

摘要
通过 incorporating 外部知识 into 对话生成 (KIDG) 中的对话回复，可以提高对话回复的正确性。然而，引入不相关的内容可能会消耗对话质量和导致幻想回复。现有的对话系统中的证据检索和整合方法未能充分利用现有的证据，因为模型无法准确地检索有用的断片和忽略掉隐藏在 KIDG 数据集中的证据标签。为了全面发挥证据的潜力，我们提出了一个框架，称为 u-EIDG（知识Intensive对话生成框架）。具体来说，我们提出了一个自动生成证据框架，利用大型自然语言模型 (LLMs) 来挖掘可靠的证据真实标签。通过这些证据标签，我们训练了一个可靠的证据指标，以确定有用的证据从 retrieved 段落中选择。此外，我们提出了一个带有证据专注注意机制的证据扩充生成器，使模型能够专注于证据段落。实验结果表明，证据标签增强和专注注意机制可以提高模型性能。进一步分析表明，我们的方法在coherence和事实一致性方面 (+3~+5 点) 表现出色。

PatFig: Generating Short and Long Captions for Patent Figures

paper_url: http://arxiv.org/abs/2309.08379
repo_url: None
paper_authors: Dana Aubakirova, Kim Gerdes, Lufei Liu
for: 该论文提出了一个新的大规模专利图像数据集，包含11,000多个欧洲专利申请的30,000多个专利图像。
methods: 该数据集每个图像都提供了短和长标题、参考 numerals、它们所对应的术语和图像中组件之间的最小索引。
results: 通过在Qatent PatFig上训练LVLM模型，可以生成短和长的描述，并 investigate了在专利图像描述过程中使用不同的文本基于cue的影响。

Abstract
This paper introduces Qatent PatFig, a novel large-scale patent figure dataset comprising 30,000+ patent figures from over 11,000 European patent applications. For each figure, this dataset provides short and long captions, reference numerals, their corresponding terms, and the minimal claim set that describes the interactions between the components of the image. To assess the usability of the dataset, we finetune an LVLM model on Qatent PatFig to generate short and long descriptions, and we investigate the effects of incorporating various text-based cues at the prediction stage of the patent figure captioning process.

摘要

DiaCorrect: Error Correction Back-end For Speaker Diarization

paper_url: http://arxiv.org/abs/2309.08377
repo_url: None
paper_authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky
for: 这个论文是为了提高 диари化系统的输出精度而设计的一种错误修正框架。
methods: 该方法基于自动语音识别中的错误修正技术，使用两个并行的卷积encoder和一个基于变换的decoder，通过利用输入录音和初始系统的输出之间的交互，自动修正初始说话人的活动，以最小化 диари化错误。
results: 对2个说话人电话数据进行实验表明，提案的 DiaCorrect 可以有效地提高初始模型的结果。

Abstract
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect.

摘要
在这个研究中，我们提出了一个错误修正框架，名为DiaCorrect，用于简化 диари化系统的输出。这种方法 draws inspiration from automatic speech recognition 的错误修正技术。我们的模型包括两个并行的卷积Encoder和一个基于 transform的解码器。通过利用输入录音和初始系统的输出之间的互动，DiaCorrect可以自动 correctionspeaker activities，以最小化 диари化错误。实验结果表明，我们的提posed DiaCorrect可以有效地提高初始模型的结果。我们的源代码可以在https://github.com/BUTSpeechFIT/diacorrect中获取。

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

paper_url: http://arxiv.org/abs/2309.08351
repo_url: https://github.com/NathanGodey/headless-lm
paper_authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot
for: 这篇研究旨在提出一种新的自主预训语言模型方法，它不再是预测字串probability分布，而是通过对输入嵌入重新构建的方式进行对比。
methods: 我们提出了一种叫做对比负载绑定（Contrastive Weight Tying，CWT）的方法，它可以在不同语言上预训头less language model。
results: 我们发现这种方法可以大幅提高GLUE分数和LAMBADA准确率，相比类别的语言模型在相似的计算预算下，具有更好的下游性能和数据效率。

Abstract
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.

摘要
自我超级预训练语言模型通常是预测Token词汇的概率分布。在这项研究中，我们提出了一种创新的方法，即通过对比绑定权重（CWT）来重建输入嵌入。我们在单语言和多语言上应用这种方法来预训Headless语言模型。我们的方法具有实用优势，可以减少训练计算需求，同时提高下游性能和数据效率。我们发现在相同的计算预算下，our方法可以提高+1.6 GLUE分数和+2.7 LAMBADA准确率。

Reward Engineering for Generating Semi-structured Explanation

paper_url: http://arxiv.org/abs/2309.08347
repo_url: https://github.com/jiuzhouh/reward-engineering-for-generating-seg
paper_authors: Jiuzhou Han, Wray Buntine, Ehsan Shareghi
for: 本研究旨在解决语言模型生成结构化解释的挑战，尤其是不太大的语言模型（LM）在生成结构化解释时的问题。
methods: 本研究使用了强化学习（RL）和奖励工程学习（RE）来解决这个问题，并 investigate了多种奖励汇总方法。
results: 研究发现RL可以更好地解决生成结构化解释的问题，并在两个semi-structured解释生成Benchmark（ExplaGraph和COPA-SSE）上达到了新的状态体系。

Abstract
Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify model's true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs, as the reasoner is expected to couple a sequential answer with a structured explanation which embodies both the correct presentation and the correct reasoning process. In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed reward on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

摘要
semi-structured 解释描述了推理者的隐式过程，并且这种解释强调了根据特定查询提供的信息以及reasoner内部的权重来生成答案。 Despite 最近的语言模型生成能力的改进，生成结构化解释以验证模型的真正推理能力仍然是一大挑战。特别是对于小型LM，因为推理者需要同时生成序列答案和结构化解释，这种问题更加突出。在这种工作中，我们首先强调了监督练习（SFT）的局限性，然后引入了仪器学习（RL）中的奖励工程学方法，以更好地解决这个问题。我们 investigate多种奖励汇聚方法，并提供了详细的讨论，这有助于探讨RL在未来研究中的潜在潜力。我们的提议的奖励在两个 semi-structured 解释生成 benchmark（ExplaGraph 和 COPA-SSE）上实现了新的状态 искусственный智能领域的最佳成绩。

Distributional Inclusion Hypothesis and Quantifications: Probing Hypernymy in Functional Distributional Semantics

paper_url: http://arxiv.org/abs/2309.08325
repo_url: None
paper_authors: Chun Hei Lo, Guy Emerson
for: 本文探讨了函数分布semantics（FDS）模型词义的方法，以及如何通过这种方法学习词义的不同层次结构。
methods: 本文使用了FDS模型，并对其进行了训练，以便学习词义的不同层次结构。
results: 实验结果表明，当文本资料集 strictly follows Distributional Inclusion Hypothesis时，FDS模型就可以学习词义的层次结构，并且可以处理简单的通用量化。

Abstract
Functional Distributional Semantics (FDS) models the meaning of words by truth-conditional functions. This provides a natural representation for hypernymy, but no guarantee that it is learnt when FDS models are trained on a corpus. We demonstrate that FDS models learn hypernymy when a corpus strictly follows the Distributional Inclusion Hypothesis. We further introduce a training objective that allows FDS to handle simple universal quantifications, thus enabling hypernymy learning under the reverse of DIH. Experimental results on both synthetic and real data sets confirm our hypotheses and the effectiveness of our proposed objective.

摘要
功能分布 semantics (FDS) 模型表示词语意义通过真理条件函数。这提供了自然的表示方式，但无 garantía que se aprenda hiperonimia when FDS 模型在一个 corpus 上训练。我们证明了 FDS 模型在 strictly following the Distributional Inclusion Hypothesis 的 corpus 上学习 hiperonimia。我们还引入了一个培训目标，allowing FDS 处理简单的通用量化，因此允许 hiperonimia 学习 under the reverse of DIH。实验结果表明我们的假设成立，并且我们的提议的目标有效。

Bridging Topic, Domain, and Language Shifts: An Evaluation of Comprehensive Out-of-Distribution Scenarios

paper_url: http://arxiv.org/abs/2309.08316
repo_url: None
paper_authors: Andreas Waldis, Iryna Gurevych
for: 本研究旨在评估语言模型（LMs）在各种异常情况下的泛化能力，包括主题、领域和语言方面的偏差。
methods: 研究人员采用了各种方法，包括准备分析、推荐策略和语言模型的练习，以评估LMs的泛化能力。
results: 研究发现，在各种异常情况下，提示基本 fine-tuning 表现最佳，特别是当训练和测试数据主要差异 semantic 时。同时，在 context 学习比 prompt-based fine-tuning 和 vanilla fine-tuning 更有效，尤其是在训练数据中存在重要差异的情况下。这表明，梯度学习带来了一定的结构性偏见。

Abstract
Language models (LMs) excel in in-distribution (ID) scenarios where train and test data are independent and identically distributed. However, their performance often degrades in real-world applications like argument mining. Such degradation happens when new topics emerge, or other text domains and languages become relevant. To assess LMs' generalization abilities in such out-of-distribution (OOD) scenarios, we simulate such distribution shifts by deliberately withholding specific instances for testing, as from the social media domain or the topic Solar Energy. Unlike prior studies focusing on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization. We define three metrics to pinpoint generalization flaws and propose eleven classification tasks covering topic, domain, and language shifts. Overall, we find superior performance of prompt-based fine-tuning, notably when train and test splits primarily differ semantically. Simultaneously, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks when training data embodies heavy discrepancies in label distribution compared to testing data. This reveals a crucial drawback of gradient-based learning: it biases LMs regarding such structural obstacles.

摘要
Unlike previous studies that focused on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization by defining three metrics to identify generalization flaws. We also propose eleven classification tasks covering topic, domain, and language shifts. Our results show that prompt-based fine-tuning performs better than other methods, especially when the train and test splits differ semantically. Additionally, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks with significant differences in label distribution between training and testing data. This highlights a limitation of gradient-based learning, as it can bias LMs towards such structural obstacles.

Self-Consistent Narrative Prompts on Abductive Natural Language Inference

paper_url: http://arxiv.org/abs/2309.08303
repo_url: https://github.com/hkust-knowcomp/alpha-pace
paper_authors: Chunkit Chan, Xin Liu, Tsz Ho Chan, Jiayang Cheng, Yangqiu Song, Ginny Wong, Simon See
for: 本研究旨在提高αNLI任务（即叙述语言推理任务）中的自适应性和叙述连续性。
methods: 本研究提出了一种Prompt Tuning模型（α-PACE），该模型考虑了自适应性和叙述连续性。此外，本研究还提出了一种通用自适应框架，该框架可以指导预训练语言模型理解输入叙述文本的叙述Context。
results: 本研究通过广泛的实验和细化的降级研究表明了α-PACE模型的效果。与普通竞争对手相比，α-PACE模型的性能显著提高。

Abstract
Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($\alpha$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $\alpha$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $\alpha$-PACE. The performance of our method shows significant improvement against extensive competitive baselines.

摘要
<>translate "Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($\alpha$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $\alpha$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $\alpha$-PACE. The performance of our method shows significant improvement against extensive competitive baselines." into Simplified Chinese.以下是文本的中文翻译：<>往日，强制被视为叙事理解和日常情境理解中的关键因素。 $\alpha$NLI任务已经被提出，这是基于文本的叙事任务，旨在从候选假设中选择最有可能性的假设。然而，之前的工作未能充分利用文本间的一致性和模型一致性。在这种情况下，我们提出了一种适应模型$\alpha$-PACE，该模型考虑了自我一致性和文本间一致性。此外，我们还提出了一种通用自一致框架，该框架考虑了不同的叙事顺序（例如，直线叙事和倒计时间顺序），以帮助预训练语言模型理解输入的叙事背景。我们进行了广泛的实验和细致的折衣研究，以证明 $\alpha$-PACE 的必要性和有效性。我们的方法在与多种竞争性基准模型进行比较时表现出了显著的改善。

Structural Self-Supervised Objectives for Transformers

paper_url: http://arxiv.org/abs/2309.08272
repo_url: https://github.com/lucadiliello/transformers-framework
paper_authors: Luca Di Liello
for: 本研究旨在提高自然语言模型的预训练，使其更加效率和下游应用更加一致。
methods: 本研究引入了三种代替BERT的Masked Language Modeling（MLM）目标，namely Random Token Substitution（RTS）、Cluster-based Random Token Substitution（C-RTS）和Swapped Language Modeling（SLM）。这些目标使用Token替换而不是遮盖，RTS和C-RTS预测Token的原始性，SLM预测原始Token的值。 results show que RTS和C-RTS需要较少的预训练时间， yet maintains performance comparable to MLM。Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget。
results: 本研究还提出了一些自然语言模型的自我超vised预训练任务，以适应下游应用。通过使用大量的文本数据，如Wikipedia和CC-News，我们训练模型可以识别文本段的来源，以及文本段是否来自同一篇文章或文档。通过不断的预训练，我们从现有的模型如RoBERTa、ELECTRA、DeBERTa、BART和T5开始，并示出了对各种任务的显著性提高，如 Fact Verification、Answer Sentence Selection和概要。这些提高尤其明显在有限的标注数据available。此外，我们还实现了多种标准 benchmark datasets的state-of-the-art results，包括 FEVER（开发集）、ASNQ、WikiQA和TREC-QA，以及提高概要的质量。

Abstract
This thesis focuses on improving the pre-training of natural language models using unsupervised raw data to make them more efficient and aligned with downstream applications. In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget. In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.

摘要

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

paper_url: http://arxiv.org/abs/2309.08255
repo_url: None
paper_authors: Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai, Sebastian Cygert, Kamil Pokora, Georgi Tinchev, Ziyao Zhang, Kayoko Yanagisawa
for: 这个研究旨在提出一个跨语言语音合成框架，用于将原始语言的语音转换为目标语言的声音，以提高语音识别度和准确性。
methods: 本研究使用了一个四个阶段的框架，包括：在第一个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音，在第二个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音，在第三个阶段使用语音转换模型将目标语言的语音转换为目标声音的声音，最后一个阶段则是使用一个无地域预测器进行训练。
results: 本研究的实验结果显示，提出的框架在比较于现有的方法时表现较好，并且在不同的架构、语言、说话者和资料量下都能够获得良好的效果。此外，本研究的方法特别适合低资源环境。

Abstract
In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

摘要
在这个研究中，我们提出了一种跨语言语音合成框架，包括4个阶段。在第一两个阶段，我们使用一个语音转换（VC）模型将目标地区的语音转换为目标说话人的voice。在第三个阶段，转换后的数据与目标语言的语音特征和持续时间从录音中提取出来，并用于训练单个说话人的听音模型。最后一个阶段是训练无关地区的 vocoder。我们的评估表明，我们的方法超过了当前状态的方法，基于大量多语言 TTS 模型的训练。此外，我们的实验还证明了我们的方法在不同的模型架构、语言、说话人和数据量下都具有 robustness。此外，我们的解决方案特别有利于低资源的设置。

Investigating Answerability of LLMs for Long-Form Question Answering

paper_url: http://arxiv.org/abs/2309.08210
repo_url: None
paper_authors: Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, Semih Yavuz
for: 了解大量LLMs（如ChatGPT）和小型开源LLMs的差异，以及它们的抽象和缩短版本的不同特点。
methods: 基于抽象摘要生成问题的方法，用于测试LLMs的理解和推理能力。
results: 研究结果显示，使用抽象摘要生成问题可以为LLMs提供一个挑战性的测试环境，并显示了大量LLMs和开源LLMs之间的性能差异，特别是在 longer contexts（>1024 tokens）下。

Abstract
As we embark on a new era of LLMs, it becomes increasingly crucial to understand their capabilities, limitations, and differences. Toward making further progress in this direction, we strive to build a deeper understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. To this end, we specifically focus on long-form question answering (LFQA) because it has several practical and impactful applications (e.g., troubleshooting, customer service, etc.) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that: (1) our proposed method of generating questions from abstractive summaries pose a challenging setup for LLMs and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama) (2) open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries -- especially for longer contexts (>1024 tokens)

摘要
As we enter a new era of LLMs, it becomes increasingly important to understand their capabilities, limitations, and differences. To make further progress in this area, we aim to deepen our understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. Specifically, we focus on long-form question answering (LFQA) as it has many practical and impactful applications (e.g., troubleshooting, customer service) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that:1. Our proposed method of generating questions from abstractive summaries poses a challenging setup for LLMs, and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama).2. Open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries, especially for longer contexts (>1024 tokens).

Encoded Summarization: Summarizing Documents into Continuous Vector Space for Legal Case Retrieval

paper_url: http://arxiv.org/abs/2309.08187
repo_url: None
paper_authors: Vu Tran, Minh Le Nguyen, Satoshi Tojo, Ken Satoh
for: 这个论文是为了解决法律案例检索任务而提出的方法。
methods: 该方法利用深度神经网络来编码文档，将文档摘要化成连续的向量空间中。同时，该方法还利用神经网络生成的含义特征和语言特征来提高检索系统的性能。
results: 实验结果表明，利用提供的摘要和编码摘要可以提高检索系统的性能。此外，该方法的实验结果还表明，神经网络生成的含义特征和语言特征可以补充each other，以提高检索系统的性能。该方法在法律案例检索任务上达到了F1分数的65.6%和57.6%。

Abstract
We present our method for tackling a legal case retrieval task by introducing our method of encoding documents by summarizing them into continuous vector space via our phrase scoring framework utilizing deep neural networks. On the other hand, we explore the benefits from combining lexical features and latent features generated with neural networks. Our experiments show that lexical features and latent features generated with neural networks complement each other to improve the retrieval system performance. Furthermore, our experimental results suggest the importance of case summarization in different aspects: using provided summaries and performing encoded summarization. Our approach achieved F1 of 65.6% and 57.6% on the experimental datasets of legal case retrieval tasks.

摘要
我们提出了一种方法来解决法律案件检索任务，通过我们的文档编码方法，将文档摘要到连续向量空间中。我们利用深度神经网络来实现文档编码，并 explore了将 lexical 特征和 latent 特征结合使用的好处。我们的实验结果表明，lexical 特征和 latent 特征在不同方面进行补做，可以提高检索系统的性能。此外，我们的实验结果还表明，案件摘要在不同方面具有重要性：使用提供的摘要和自动生成摘要。我们的方法在法律案件检索任务上实现了 F1 分数的 65.6% 和 57.6%。

Multilingual Sentence-Level Semantic Search using Meta-Distillation Learning

paper_url: http://arxiv.org/abs/2309.08185
repo_url: None
paper_authors: Meryem M’hamdi, Jonathan May, Franck Dernoncourt, Trung Bui, Seunghyun Yoon
for: 本研究旨在提高多语言Semantic Search的精度和效率，使其能够更好地理解用户的意图和含义。
methods: 本研究使用了Meta-distillation学习方法，特性是利用Teacher模型T-MAML来传递知识到Student模型S-MAML，从而提高Student模型在多语言Semantic Search中的性能。
results: 实验结果表明，相比基础模型和naive fine-tuning方法， meta-distillation方法可以大幅提高MAML的性能，并且在未看到的语言上也有较好的一致性。

Abstract
Multilingual semantic search is the task of retrieving relevant contents to a query expressed in different language combinations. This requires a better semantic understanding of the user's intent and its contextual meaning. Multilingual semantic search is less explored and more challenging than its monolingual or bilingual counterparts, due to the lack of multilingual parallel resources for this task and the need to circumvent "language bias". In this work, we propose an alignment approach: MAML-Align, specifically for low-resource scenarios. Our approach leverages meta-distillation learning based on MAML, an optimization-based Model-Agnostic Meta-Learner. MAML-Align distills knowledge from a Teacher meta-transfer model T-MAML, specialized in transferring from monolingual to bilingual semantic search, to a Student model S-MAML, which meta-transfers from bilingual to multilingual semantic search. To the best of our knowledge, we are the first to extend meta-distillation to a multilingual search application. Our empirical results show that on top of a strong baseline based on sentence transformers, our meta-distillation approach boosts the gains provided by MAML and significantly outperforms naive fine-tuning methods. Furthermore, multilingual meta-distillation learning improves generalization even to unseen languages.

摘要
多语言Semantic搜索是查询表达在不同语言组合中的相关内容。这需要更好地理解用户的意图和其语言上下文意义。由于多语言Semantic搜索相比于单语言或双语搜索更加不explored和更加挑战，因为缺乏多语言平行资源 для这个任务，并且需要绕过“语言偏见”。在这种工作中，我们提出了一种对齐方法：MAML-Align，专门针对低资源场景。我们的方法利用了meta-distillation学习基于MAML，一种基于Model-Agnostic Meta-Learner的优化算法。MAML-Align从一个特有的Teacher meta-传播模型T-MAML，该模型专门从单语言Semantic搜索转移到双语Semantic搜索，将知识传播到一个Student模型S-MAML，该模型从双语Semantic搜索转移到多语言Semantic搜索。根据我们所知，我们是首次将meta-distillation应用于多语言搜索应用。我们的实验结果表明，在一个强大的基础模型基于 sentence transformers 上，我们的meta-distillation方法可以提高MAML的效果，并且明显超过了简单的微调方法。此外，多语言meta-distillation学习还能提高到未看到的语言上的总体性能。

Large Language Models for Failure Mode Classification: An Investigation

paper_url: http://arxiv.org/abs/2309.08181
repo_url: https://github.com/nlp-tlp/chatgpt-fmc
paper_authors: Michael Stewart, Melinda Hodkiewicz, Sirui Li
for: 这个研究旨在评估大语言模型（LLMs）在失败模式分类（FMC）任务中的效果。
methods: 我们采用了提示工程来使一个GPT-3.5模型（F1=0.80）预测给定观察的失败模式使用限定的编码列表。
results: 我们发现，使用精心预处理的数据集进行高质量的 fine-tuning可以提高GPT-3.5模型的性能（F1=0.80），并且超过了现有的文本分类模型（F1=0.60）和尝试模型（F1=0.46）。

Abstract
In this paper we present the first investigation into the effectiveness of Large Language Models (LLMs) for Failure Mode Classification (FMC). FMC, the task of automatically labelling an observation with a corresponding failure mode code, is a critical task in the maintenance domain as it reduces the need for reliability engineers to spend their time manually analysing work orders. We detail our approach to prompt engineering to enable an LLM to predict the failure mode of a given observation using a restricted code list. We demonstrate that the performance of a GPT-3.5 model (F1=0.80) fine-tuned on annotated data is a significant improvement over a currently available text classification model (F1=0.60) trained on the same annotated data set. The fine-tuned model also outperforms the out-of-the box GPT-3.5 (F1=0.46). This investigation reinforces the need for high quality fine-tuning data sets for domain-specific tasks using LLMs.

摘要
在本文中，我们提出了大语言模型（LLM）的效果调查在故障模式分类（FMC）任务中。 FMC 是维保领域中的一项重要任务，它可以减少可靠工程师的时间 manually 分析工作订单。我们详细介绍了我们的激励程序工程来使得 LLM 可以使用限定的代码列表预测给定观察的故障模式。我们展示了一个 GPT-3.5 模型（F1=0.80）在注释数据上进行精度调整后的性能明显提高，与现有的文本分类模型（F1=0.60）在同一个注释数据集上进行训练后的性能相比。此外，我们还证明了 fine-tuned 模型在原始 GPT-3.5 模型（F1=0.46）上也表现出了显著的改善。这一调查证明了在域pecific任务中使用 LLM 需要高质量的 fine-tuning 数据集。

FedJudge: Federated Legal Large Language Model

paper_url: http://arxiv.org/abs/2309.08173
repo_url: https://github.com/yuelinan/fedjudge
paper_authors: Linan Yue, Qi Liu, Yichao Du, Weibo Gao, Ye Liu, Fangzhou Yao
for: 这篇论文旨在解决大语言模型在法律智能领域中的数据隐私问题，通过融合法律大语言模型和联邦学习方法。
methods: 这篇论文提出了一个名为 FedJudge 的框架，它使用了优化的法律大语言模型，并使用联邦学习方法进行本地化训练，以确保数据隐私。
results: 实验结果显示，FedJudge 能够有效地训练法律大语言模型，并且可以适应不同的数据分布。

Abstract
Large Language Models (LLMs) have gained prominence in the field of Legal Intelligence, offering potential applications in assisting legal professionals and laymen. However, the centralized training of these Legal LLMs raises data privacy concerns, as legal data is distributed among various institutions containing sensitive individual information. This paper addresses this challenge by exploring the integration of Legal LLMs with Federated Learning (FL) methodologies. By employing FL, Legal LLMs can be fine-tuned locally on devices or clients, and their parameters are aggregated and distributed on a central server, ensuring data privacy without directly sharing raw data. However, computation and communication overheads hinder the full fine-tuning of LLMs under the FL setting. Moreover, the distribution shift of legal data reduces the effectiveness of FL methods. To this end, in this paper, we propose the first Federated Legal Large Language Model (FedJudge) framework, which fine-tunes Legal LLMs efficiently and effectively. Specifically, FedJudge utilizes parameter-efficient fine-tuning methods to update only a few additional parameters during the FL training. Besides, we explore the continual learning methods to preserve the global model's important parameters when training local clients to mitigate the problem of data shifts. Extensive experimental results on three real-world datasets clearly validate the effectiveness of FedJudge. Code is released at https://github.com/yuelinan/FedJudge.

摘要
大型语言模型（LLM）在法律智能领域的应用优势吸引了广泛的关注，可以帮助法律专业人员和非专业人员。然而，中央训练这些法律 LLM 会引起数据隐私问题，因为法律数据分散在各个机构中，每个机构都包含敏感个人信息。本文解决这个挑战，通过探讨法律 LLM 与联合学习（FL）方法的结合。通过使用 FL，法律 LLM 可以在设备或客户端上进行本地微调，并将参数集中到中央服务器上，保证数据隐私而无需直接分享原始数据。然而，在 FL 设置下 computation 和通信开销妨碍了法律 LLM 的全面微调。此外，法律数据的分布差shift 也减少了 FL 方法的有效性。为此，本文提出了首个 Federated Legal Large Language Model（FedJudge）框架，可以高效地微调法律 LLM。特别是，FedJudge 使用 parameter-efficient 微调方法来在 FL 训练中更新只几个额外参数。此外，我们还探讨了连续学习方法，以保持全局模型中重要参数的稳定性，从而 Mitigate 数据差shift 问题。实验结果表明，FedJudge 在三个实际数据集上具有极高的有效性。代码可以在上下载。

paper_url: http://arxiv.org/abs/2309.08172
repo_url: None
paper_authors: Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu
for: 这个论文是为了解决大型语言模型在交互决策任务中的问题，例如网络浏览。
methods: 这个论文使用了模型州空间探索的方法，将大型语言模型 Agent 转移到一组已定义的状态中，并通过行动完成任务。
results: 实验结果显示，这个方法可以让大型语言模型 Agent 在网络浏览任务中表现出色，并且与人类性能更近。

Abstract
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. While achieving decent performance, previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples to teach the model how to reason in the interactive environment. Consequently, the model could not handle more challenging scenarios not covered in the in-context examples, e.g., mistakes, leading to sub-optimal performance. To address this issue, we propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible back-tracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task.

摘要
To address this issue, we propose modeling the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible back-tracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task.Here's the text in Simplified Chinese:大型语言模型（LLM）已经成功地应用到互动决策任务中，如网络浏览。虽然取得了不错的表现，但前一些方法都是假设LLM模型在前进方式下执行，即只提供了oracle路径作为互动环境中的示范例子，教导模型在互动环境中如何思考。这限制了模型的能力，不能处理更加具体的情况，导致表现不佳。为了解决这个问题，我们提议将互动任务模型为州空间探索， LLM代理在预先定义的状态集中转移，通过执行动作完成任务。这种形式允许灵活的回溯，让模型轻松地复原自错误。我们将我们的LASER代理评估在WebShop任务上。实验结果显示，我们的LASER代理与前一些方法相比，表现出色，几乎与人类表现相同。

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

paper_url: http://arxiv.org/abs/2309.08168
repo_url: None
paper_authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra
for: 加速大型自然语言模型（LLMs）的推理过程，无需额外模型。
methods: 提出了一种新的推理方案，即自我推测解oding，通过在推理过程中 selectively 跳过某些中间层来快速生成稿件，然后使用原始 LLMA 进行验证。
results: 对 LLaMA-2 和其精度模型进行了测试，获得了最高速up到 1.73 倍的加速效果。

Abstract
We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73$\times$.

摘要
我团队提出了一种新的推理方案，自我推敲，用于加速大语言模型（LLM），无需额外模型。这种方法包括两个阶段：稿件阶段和验证阶段。在稿件阶段，我们选择性地跳过某些中间层，以更快速地生成稿件，但是这些稿件的质量可能会下降些。然后，验证阶段使用原始的 LLM 来验证这些稿件输出token，并在一个前进 pass 中确认它们的正确性。这个过程保证了最终输出的质量与原始 LLM 输出的质量一样，因此不需要进行额外的神经网络训练和额外的存储空间。我们在 LLMA-2 和其精度调整模型上进行了 benchmark，并达到了 1.73 倍的速度提升。

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

paper_url: http://arxiv.org/abs/2309.08156
repo_url: None
paper_authors: Zhengliang Shi, Weiwei Sun, Shuo Zhang, Zhen Zhang, Pengjie Ren, Zhaochun Ren
For: 评估开放领域对话系统的自动评估方法，解决一个问题，即一个回答中有多种可能性。* Methods: 提出了Reference-Assisted Dialogue Evaluation（RADE）方法，利用预创建的对话utterance作为参考，相比金标签回答，解决一元多个问题。具体来说，RADE将参考和候选回答进行直接比较，预测回答的总分。此外，还添加了一个辅助回答生成任务，通过共享编码器提高预测。* Results: 在三个 dataset和两个现有的benchmark上进行了实验，与人类评估相比，Pearson、Spearman和Kendall相关度都高于现有基eline。

Abstract
Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores. Moreover, an auxiliary response generation task enhances prediction via a shared encoder. To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation. Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.

摘要
评估开放领域对话系统具有一些挑战，如一对多问题，即许多合适的回答而不仅是理想的回答。目前，自动评估方法需要更好的一致性与人类，而可靠的人类评估可能是时间和成本占用的。为此，我们提出了参考助力对话评估（RADE）方法，它利用预创建的话语作为参考而不是理想的回答来解决一对多问题。具体来说，RADE直接比较参考和候选答案的总分。此外，一个辅助回答生成任务通过共享Encoder来增强预测。为支持RADE，我们将三个数据集扩展为包括人类标注的多个评估答案。我们的实验表明，我们的方法可以在我们的三个数据集和两个现有的标准 benchmarke上具有更高的各种Spearman、Pearson和Kendall相关性 coefficient与人类评估，超过当前的基elines。

Unimodal Aggregation for CTC-based Speech Recognition

paper_url: http://arxiv.org/abs/2309.08150
repo_url: https://github.com/Audio-WestlakeU/UMA-ASR
paper_authors: Ying Fang, Xiaofei Li
for: 这 paper 是关于非autoregressive自动语音识别的研究，旨在学习更好的特征表示以提高识别精度和计算复杂度。
methods: 提议的方法是基于encoder获取帧 wise features和权重，然后通过decoder进行集成和处理。另外，还应用了CTC损失函数进行训练。
results: 对三个普通话 dataset 进行实验表明，提议的方法可以与其他高级非 autoregressive方法相比，并且可以降低识别错误率和计算复杂度。此外，通过将self-conditioned CTC integrate到提议的框架中，可以进一步提高性能。

Abstract
This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Compared to the regular CTC, the proposed method learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity. Experiments on three Mandarin datasets show that UMA demonstrates superior or comparable performance to other advanced non-autoregressive methods, such as self-conditioned CTC. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.

摘要
这篇论文工作在非autoregressive自动语音识别领域。我们提议一种单modal聚合（UMA）来段化和集成相同文本 токен的特征帧，从而学习更好的特征表示。特征帧和权重都来自Encoder。然后，通过Decoder进行进一步处理。使用Connectionist Temporal Classification（CTC）损失来训练。与常规CTC相比，我们的方法学习的特征表示更好，序列长度更短，识别错误和计算复杂性都更低。在三个普通话 datasets上进行了实验，UMA表现出优于其他高级非autoregressive方法，如自conditioned CTC。此外，通过将自conditioned CTC integrate到我们的框架中，可以进一步提高表现。

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

paper_url: http://arxiv.org/abs/2309.08140
repo_url: None
paper_authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana
for: This paper is written for researchers and developers interested in text-to-speech (TTS) synthesis, particularly those looking to control speaker identity using natural language descriptions.
methods: The paper proposes a prompt-based TTS synthesis system called PromptTTS++, which utilizes a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. The system also introduces the concept of speaker prompts, which describe voice characteristics such as gender-neutral, young, old, and muffled.
results: The subjective evaluation results show that the proposed method can better control speaker characteristics than previous methods without the speaker prompt. The authors also provide audio samples to demonstrate the effectiveness of their approach.

Abstract
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

摘要
我们提出PromptTTS++,一种基于提示的文本译为语音（TTS）生成系统，允许通过自然语言描述控制发音人的身份。在基于提示的TTS框架中控制发音人的身份，我们引入发音人提示，描述语言特征（例如，中性、年轻、老、嘴巴覆盖），这些特征被设计为基本独立于说话风格。由于没有大规模的发音人提示数据集，我们首先基于LibriTTS-R corpus构建了一个数据集，并手动标注了发音人提示。然后，我们使用扩散基于音频模型和混合密度网络来模型训练数据中的多个发音人因素。不同于先前的研究，我们的方法不仅通过说话风格提示（例如，音高、说话速度和能量）控制发音人的个性，而是通过添加一个额外的发音人提示，以更好地学习自然语言描述到不同发音人的声学特征的映射。我们的主观评估结果表明，我们的方法可以更好地控制发音人的特征，比以前没有提示的方法。听样本可以在https://reppy4620.github.io/demo.promptttspp/中找到。

Audio Difference Learning for Audio Captioning

paper_url: http://arxiv.org/abs/2309.08141
repo_url: None
paper_authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda
for: 本研究提出了一种新的训练方法，即音频差异学习，用于改进音频描述。
methods: 该方法基于创建一个保持音频关系的特征表示空间，以生成详细的音频信息描述。方法使用一个参考音频和输入音频，通过共享编码器转换为特征表示。然后，从这些差异特征生成描述。此外，提出了一种混合输入音频和其他音频的技术，使得混合后的音频与参考音频的差异恢复回原输入音频。
results: 在使用Clotho和ESC50数据集的实验中，提出的方法比传统方法提高了SPIDEr分数7%。

Abstract
This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.

摘要
这种研究引入了一种新的训练方法，即音频差异学习，以提高音频描述。该方法的基本概念是创建一个保持音频关系的特征表示空间，以便从音频中生成详细的描述。该方法使用一个参照音频以及输入音频，两者都经过共享编码器转换成特征表示。然后，从这些差异特征中生成描述。此外，该方法还提出了一种独特的技术，即将输入音频混合到其他音频中，并使用这个混合音频作为参照音频。这会使得混合音频与参照音频之间的差异恢复回原始输入音频，从而消除了需要额外注释的差异。在使用 clotho 和 esc50 数据集进行实验时，提出的方法在 SPIDEr 分数上提高了7%，比传统方法更高。

Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection

paper_url: http://arxiv.org/abs/2309.08099
repo_url: https://github.com/zhu00121/universal-representation-dynamics-of-deepfake-speech
paper_authors: Yi Zhu, Saurabh Powar, Tiago H. Falk
for: 本研究旨在提高现有深伪语音检测系统的普适性，以便在训练时未看到的攻击样本上进行检测。
methods: 本研究使用了新的方法来评估表示性动态，以提高检测深伪语音的能力。
results: 实验结果表明，使用该方法可以在训练时未看到的攻击样本上提高深伪语音检测的性能，并在ASVspoof 2019和2021 datasets上达到了显著的改进。

Abstract
Existing deepfake speech detection systems lack generalizability to unseen attacks (i.e., samples generated by generative algorithms not seen during training). Recent studies have explored the use of universal speech representations to tackle this issue and have obtained inspiring results. These works, however, have focused on innovating downstream classifiers while leaving the representation itself untouched. In this study, we argue that characterizing the long-term temporal dynamics of these representations is crucial for generalizability and propose a new method to assess representation dynamics. Indeed, we show that different generative models generate similar representation dynamics patterns with our proposed method. Experiments on the ASVspoof 2019 and 2021 datasets validate the benefits of the proposed method to detect deepfakes from methods unseen during training, significantly improving on several benchmark methods.

摘要
现有的深伪演说检测系统缺乏对未经训练的攻击（即由生成算法生成的样本）的普适性。近年来的研究强调使用通用的speech表示方法来解决这个问题，并取得了激进的结果。然而，这些工作均将注意力集中在下游分类器的创新上，而忽略了表示自身的改进。在本研究中，我们 argue that描述长期时间的speech表示动态是普适性的关键，并提出了一种新的方法来评估表示动态。实际上，我们发现了不同的生成模型在我们提出的方法下都会生成相似的表示动态模式。在ASVspoof 2019和2021 datasets上进行了实验，并证明了我们提出的方法可以很好地检测未经训练的深伪演说，与许多标准方法相比有显著提高。

2023-09-15

cs.LG

cs.LG - 2023-09-15

BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials

paper_url: http://arxiv.org/abs/2309.08788
repo_url: None
paper_authors: Rachel K. Luu, Markus J. Buehler
for: 本研究旨在帮助加速发现和引导想象，开发一个开源的自然语言处理模型（BioinspiredLLM），以便在生物物理和生物引用材料领域内进行搜索和研究。
methods: 该模型基于一个超过一千篇同行评审文章的词库，通过自动逻辑转换来挖掘知识，并可以根据用户的提示来 aktive和互动地回忆信息，帮助完成研究任务，以及作为创造力引擎。
results: 模型可以不仅准确回忆生物材料信息，还可以自动提出生物材料问题和答案，评估自己的性能，并且能够开发出有效的生物材料设计理论。此外，模型还能够与其他生成人工智能模型结合使用，重新定义传统材料设计过程。

Abstract
The study of biological materials and bio-inspired materials science is well established; however, surprisingly little knowledge has been systematically translated to engineering solutions. To accelerate discovery and guide insights, an open-source autoregressive transformer large language model, BioinspiredLLM, is reported. The model was finetuned with a corpus of over a thousand peer-reviewed articles in the field of structural biological and bio-inspired materials and can be prompted to actively and interactively recall information, assist with research tasks, and function as an engine for creativity. The model has proven by example that it is not only able to accurately recall information about biological materials when queried but also formulate biomaterials questions and answers that can evaluate its own performance. BioinspiredLLM also has been shown to develop sound hypotheses regarding biological materials design and remarkably so for materials that have never been explicitly studied before. Lastly, the model showed impressive promise in collaborating with other generative artificial intelligence models in a workflow that can reshape the traditional materials design process. This collaborative generative artificial intelligence method can stimulate and enhance bio-inspired materials design workflows. Biological materials is at a critical intersection of multiple scientific fields and models like BioinspiredLLM help to connect knowledge domains.

摘要
研究生物材料和生物启发材料科学领域已经有很长的历史，但 surprisingly little knowledge has been systematically translated into engineering solutions。为了加速发现和引导意见，一个开源的自适应 transformer大语言模型， BioinspiredLLM，已经被报道。这个模型通过一千多篇 peer-reviewed文章的训练集进行了训练，可以通过活动和互动地回忆信息，协助研究任务，并作为创ativity的发动机。这个模型不仅可以准确地回忆生物材料信息，而且可以形ulate生物材料 вопро题和答案，评估自己的性能。BioinspiredLLM还能提出有效的生物材料设计假设，特别是 для Materials that have never been explicitly studied before。最后，模型还示出了在工作流程中与其他生成人工智能模型合作的潜力，可以改变传统材料设计过程。这种合作生成人工智能方法可以刺激和提高生物启发材料设计 workflows。生物材料处于多种科学领域的交叉点，模型如BioinspiredLLM可以连接知识域。

Beyond Labels: Leveraging Deep Learning and LLMs for Content Metadata

paper_url: http://arxiv.org/abs/2309.08787
repo_url: None
paper_authors: Saurabh Agrawal, John Trenkle, Jaya Kawale
for: 这篇论文关注电影推荐系统中的内容metadata的使用，具体来说是对电影或电视剧的类别标签进行分析，以便为用户提供个性化推荐和 Item Cold Starting。
methods: 本文提出了一种新的类别标签分析方法，称为“类别谱”（Genre Spectrum），该方法能够捕捉电影或电视剧中各种细腻的类别，并通过线上和线下实验证明其效果。
results: 本文的实验结果表明，使用类别谱可以更好地捕捉电影或电视剧中的细腻类别，并且可以用于实现用户的2D家庭栅格中的有效推荐组织。

Abstract
Content metadata plays a very important role in movie recommender systems as it provides valuable information about various aspects of a movie such as genre, cast, plot synopsis, box office summary, etc. Analyzing the metadata can help understand the user preferences to generate personalized recommendations and item cold starting. In this talk, we will focus on one particular type of metadata - \textit{genre} labels. Genre labels associated with a movie or a TV series help categorize a collection of titles into different themes and correspondingly setting up the audience expectation. We present some of the challenges associated with using genre label information and propose a new way of examining the genre information that we call as the \textit{Genre Spectrum}. The Genre Spectrum helps capture the various nuanced genres in a title and our offline and online experiments corroborate the effectiveness of the approach. Furthermore, we also talk about applications of LLMs in augmenting content metadata which could eventually be used to achieve effective organization of recommendations in user's 2-D home-grid.

摘要
Content metadata plays a very important role in movie recommender systems, as it provides valuable information about various aspects of a movie, such as genre, cast, plot synopsis, box office summary, etc. Analyzing the metadata can help understand the user preferences and generate personalized recommendations and item cold starting. In this talk, we will focus on one particular type of metadata - 电影或电视剧的《类型》标签。这些标签可以将电影或电视剧分类到不同的主题，并对用户的期望进行设定。我们将介绍一些使用类型标签信息的挑战，并提出一种新的类型信息分析方法，我们称之为《类型谱》。《类型谱》可以捕捉电影或电视剧中的多种细腻类型，并我们的线上和线下实验证明了这种方法的有效性。此外，我们还会谈论使用大语言模型（LLMs）来增强内容metadata，这可以用于实现用户的2D家庭格式化推荐。

Long-term Neurological Sequelae in Post-COVID-19 Patients: A Machine Learning Approach to Predict Outcomes

paper_url: http://arxiv.org/abs/2309.09993
repo_url: None
paper_authors: Hayder A. Albaqer, Kadhum J. Al-Jibouri, John Martin, Fadhil G. Al-Amran, Salman Rawaf, Maitham G. Yousif
for: 这研究探讨了抗击 COVID-19 病毒的患者在长期后可能出现的神经科学综合征。
methods: 这项研究使用机器学习方法，基于多种临床数据和神经成像参数，预测患者在长期后的神经科学结果。
results: 研究发现，68% 的抗击 COVID-19 患者报告了神经科学症状，最常见的是疲劳、头痛和anosmia。此外，22% 的患者表现出更严重的神经科学综合征，包括脑部损伤和脑血栓。机器学习模型的应用显示了预测长期神经科学结果的潜在价值。

Abstract
The COVID-19 pandemic has brought to light a concerning aspect of long-term neurological complications in post-recovery patients. This study delved into the investigation of such neurological sequelae in a cohort of 500 post-COVID-19 patients, encompassing individuals with varying illness severity. The primary aim was to predict outcomes using a machine learning approach based on diverse clinical data and neuroimaging parameters. The results revealed that 68% of the post-COVID-19 patients reported experiencing neurological symptoms, with fatigue, headache, and anosmia being the most common manifestations. Moreover, 22% of the patients exhibited more severe neurological complications, including encephalopathy and stroke. The application of machine learning models showed promising results in predicting long-term neurological outcomes. Notably, the Random Forest model achieved an accuracy of 85%, sensitivity of 80%, and specificity of 90% in identifying patients at risk of developing neurological sequelae. These findings underscore the importance of continuous monitoring and follow-up care for post-COVID-19 patients, particularly in relation to potential neurological complications. The integration of machine learning-based outcome prediction offers a valuable tool for early intervention and personalized treatment strategies, aiming to improve patient care and clinical decision-making. In conclusion, this study sheds light on the prevalence of long-term neurological complications in post-COVID-19 patients and demonstrates the potential of machine learning in predicting outcomes, thereby contributing to enhanced patient management and better health outcomes. Further research and larger studies are warranted to validate and refine these predictive models and to gain deeper insights into the underlying mechanisms of post-COVID-19 neurological sequelae.

摘要
COVID-19 大流行已经暴露出一个担忧的长期神经障碍问题，这项研究通过调查500名恢复后COVID-19患者的神经障碍继发情况，发现68%的患者表现神经症状，最常见的是疲劳、头痛和anosmia。此外，22%的患者 display more severe neurological complications，包括脑膜炎和脑卒。通过机器学习模型应用发现，可以有效预测患者长期神经障碍的结果。特别是Random Forest模型的准确率达85%，敏感度80%，特异度90%。这些发现highlights the importance of continuous monitoring and follow-up care for post-COVID-19 patients, especially in relation to potential neurological complications。机器学习基于结果预测可以提供有价值的工具，用于早期 intervención和个性化治疗策略，以提高患者的护理和临床决策。总之，本研究暴露了COVID-19后长期神经障碍的现象，并证明了机器学习在预测结果方面的潜在价值，以便更好地护理患者和提高健康结果。更多的研究和更大的研究是必要的，以验证和细化这些预测模型，并帮助我们更深入了解COVID-19后神经障碍的机理。

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures

paper_url: http://arxiv.org/abs/2309.08765
repo_url: https://github.com/kosonocky/chef
paper_authors: Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte, Andrew D. Ellington
for: 这paper的目的是开发一个可预测化学功能的模型，以推断化学物质的功能。
methods: 这paper使用了大型自然语言处理算法，将化学专利文献中关于化学功能的信息抽取出来，并使用ChatGPT assisted patent summarization和word-embedding label cleaning pipeline来生成一个名为CheF的数据集，包含100,000个分子和其它专利中的功能标签。
results: 这paper发现了一个强相关性 между功能标签和化学结构空间，并通过分析功能标签的共occurrence图可以检测到相似的功能之间的关系。此外，使用CheF数据集训练了一个模型，可以为新的化学分子分配功能标签，并成功预测了一些批准的肝炎C病毒药物，还揭示了一种未公开的抗病毒机制。

Abstract
Predicting chemical function from structure is a major goal of the chemical sciences, from the discovery and repurposing of novel drugs to the creation of new materials. Recently, new machine learning algorithms are opening up the possibility of general predictive models spanning many different chemical functions. Here, we consider the challenge of applying large language models to chemical patents in order to consolidate and leverage the information about chemical functionality captured by these resources. Chemical patents contain vast knowledge on chemical function, but their usefulness as a dataset has historically been neglected due to the impracticality of extracting high-quality functional labels. Using a scalable ChatGPT-assisted patent summarization and word-embedding label cleaning pipeline, we derive a Chemical Function (CheF) dataset, containing 100K molecules and their patent-derived functional labels. The functional labels were validated to be of high quality, allowing us to detect a strong relationship between functional label and chemical structural spaces. Further, we find that the co-occurrence graph of the functional labels contains a robust semantic structure, which allowed us in turn to examine functional relatedness among the compounds. We then trained a model on the CheF dataset, allowing us to assign new functional labels to compounds. Using this model, we were able to retrodict approved Hepatitis C antivirals, uncover an antiviral mechanism undisclosed in the patent, and identify plausible serotonin-related drugs. The CheF dataset and associated model offers a promising new approach to predict chemical functionality.

摘要
预测化学 функ数据是化学科学的一大目标，从发现和重新利用新药到创造新材料。近些年，新的机器学习算法开始开销化可以涵盖多种化学功能的总体预测模型。在这里，我们考虑使用大语言模型来分析化学专利，以利用专利中 capture 了化学功能的信息。化学专利包含巨量的化学功能信息，但以前因为EXTRACTING高质量功能标签的困难而忽略了这些资源。我们使用可扩展的ChatGPT-assisted专利摘要和单词嵌入标签清洁管道， derive 了一个化学功能（CheF）数据集，包含100,000个分子和其专利 Derived 的功能标签。这些功能标签的质量得到验证，allowing 我们检测化学结构空间中功能标签和功能标签之间的强相关性。此外，我们发现功能标签的协occurrence 图包含坚实的semantic结构，这allowed 我们 examine 功能相似性 Among the compounds.我们然后使用CheF数据集训练了一个模型，allowing 我们对新分子分配功能标签。使用这个模型，我们能够预测已经批准的 Hepatitis C 抑菌药，揭示了在专利中未提及的抑菌机制，并Identify 可能的 serotonin 相关的药物。CheF数据集和相关模型提供了一个有前途的新方法，用于预测化学功能。

Circular Clustering with Polar Coordinate Reconstruction

paper_url: http://arxiv.org/abs/2309.08757
repo_url: None
paper_authors: Xiaoxiao Sun, Paul Sajda
for: Characterizing circular data found in biological systems, such as signal phase in neural recordings and nucleotide sequences in round genomes.
methods: Proposes a new analysis framework that utilizes projections onto a cylindrical coordinate system to better represent objects in a polar coordinate system, leveraging the mathematical properties of circular data to always find the correct clustering result within the reconstructed dataset.
results: Demonstrates more appropriate and consistent clustering results compared to standard methods on synthetic and real data, providing a more accurate and efficient way to cluster circular data.

Abstract
There is a growing interest in characterizing circular data found in biological systems. Such data are wide ranging and varied, from signal phase in neural recordings to nucleotide sequences in round genomes. Traditional clustering algorithms are often inadequate due to their limited ability to distinguish differences in the periodic component. Current clustering schemes that work in a polar coordinate system have limitations, such as being only angle-focused or lacking generality. To overcome these limitations, we propose a new analysis framework that utilizes projections onto a cylindrical coordinate system to better represent objects in a polar coordinate system. Using the mathematical properties of circular data, we show our approach always finds the correct clustering result within the reconstructed dataset, given sufficient periodic repetitions of the data. Our approach is generally applicable and adaptable and can be incorporated into most state-of-the-art clustering algorithms. We demonstrate on synthetic and real data that our method generates more appropriate and consistent clustering results compared to standard methods. In summary, our proposed analysis framework overcomes the limitations of existing polar coordinate-based clustering methods and provides a more accurate and efficient way to cluster circular data.

摘要
“there is a growing interest in characterizing circular data found in biological systems. such data are wide ranging and varied, from signal phase in neural recordings to nucleotide sequences in round genomes. traditional clustering algorithms are often inadequate due to their limited ability to distinguish differences in the periodic component. current clustering schemes that work in a polar coordinate system have limitations, such as being only angle-focused or lacking generality. to overcome these limitations, we propose a new analysis framework that utilizes projections onto a cylindrical coordinate system to better represent objects in a polar coordinate system. using the mathematical properties of circular data, we show our approach always finds the correct clustering result within the reconstructed dataset, given sufficient periodic repetitions of the data. our approach is generally applicable and adaptable and can be incorporated into most state-of-the-art clustering algorithms. we demonstrate on synthetic and real data that our method generates more appropriate and consistent clustering results compared to standard methods. in summary, our proposed analysis framework overcomes the limitations of existing polar coordinate-based clustering methods and provides a more accurate and efficient way to cluster circular data.”Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. If you prefer Traditional Chinese, I can provide that version as well.

Diverse Neural Audio Embeddings – Bringing Features back !

paper_url: http://arxiv.org/abs/2309.08751
repo_url: None
paper_authors: Prateek Verma
for: 本研究旨在学习听取 Audio embeddings，通过不同特征表示来学习多个听取特性，如抑 pitch、timbre 等。
methods: 本研究使用了endo-to-end 架构，并通过加入听取特性手工特征（如 pitch、timbre）来学习多个听取特性。
results: 研究发现，将手工特征与endo-to-end 架构结合使用可以显著提高性能，比单独使用endo-to-end 模型或手工特征来更好。

Abstract
With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.

摘要
现代AI体系出现以后，听众倾向于终端体系。这个转变导致神经网络被训练无关域pecific偏见/知识，根据任务进行优化。在这篇论文中，我们通过多样的特征表示学习音频嵌入，其中包括域pecific的特征表示。例如，在数百种声音分类任务中，我们学习了不同的音频属性，如抑 pitch、气质和神经表示，同时还通过终端体系来学习。我们发现，手工设计的嵌入，如抑 pitch和气质基于的嵌入，虽然独立不能超过完全终端体系的表示，但是将这些嵌入与终端体系结合可以帮助我们大幅提高性能。这项工作将为我们带来域pecific的专家知识和终端模型来学习多样、稳定的表示，超越只是训练终端模型的性能。

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

paper_url: http://arxiv.org/abs/2309.08748
repo_url: None
paper_authors: Yi Shen, Pan Xu, Michael M. Zavlanos
for: 本研究旨在提出一种基于 Wasserstein 距离的分布robust优化方法，以便在不直接与环境交互的情况下评估和学习政策。
methods: 本研究使用了分布robust优化方法，并提出了一种基于 Wasserstein 距离的新方法，以及一种减少计算成本的归一化方法和一种实际（偏向）随机梯度下降方法来优化政策。
results: 本研究提供了一种 theoretically 有 finite sample complexity 和 iteration complexity 的新方法，并通过使用公共数据进行验证，证明了该方法的可行性和有效性。

Abstract
Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

摘要
<>translate "Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial."into Simplified Chinese:<>将 "Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial."into Simplified Chinese.Here's the translation:Off-policy 评估和学习是关注评估给定策略和从停止数据中学习优化策略，而不直接与环境交互。常见情况下，收集数据的环境与学习后执行策略的环境不同。为了考虑不同的环境影响，分布robust优化(DRO)方法已经开发出来，计算新环境下策略值的最坏情况下界。通常，这个不确定集是基于logging dataset中的KL差分来定义。然而，KL不确定集无法包含变化支持的分布，而且缺乏分布支持的几何意识。因此，KL方法在实际环境差异处理中缺乏效果，容易陷入最坏情况的遮盲匹配。为了超越这些限制，我们提出了一种新的DRO方法，使用拓扑距离而不是KL差分。虽然拓扑DRO通常比KL DRO更加计算昂贵，但我们提出了一种减少方法和一种实用的偏向梯度下降法来有效地优化策略。我们还提供了论证分析的finite sample complexity和iteration complexity。此外，我们还验证了我们的方法使用公共数据集，该数据集在随机化的摇篮试验中录制。

Experimental Assessment of a Forward-Collision Warning System Fusing Deep Learning and Decentralized Radio Sensing

paper_url: http://arxiv.org/abs/2309.08737
repo_url: None
paper_authors: Jorge D. Cardenas, Omar Contreras-Ponce, Carlos A. Gutierrez, Ruth Aguilar-Ponce, Francisco R. Castillo-Soria, Cesar A. Azurdia-Meza
for: 这个论文旨在提出一个自动化前方碰撞警示系统，基于分散式射频感知（RS）方法。
methods: 这个系统使用了一个继续波形（CW）传送器，将其作为探测信号，以探测前方的车辆并警示驾驶员 potential 的前方碰撞。CW 可以轻松地被 integrate 到现有的多个车辆通信系统中。探测车辆的方法使用了深度学习（DL）模组，将 Doppler 簇印刷在 CW 探测信号上进行分析，以探测前方车辆的进入。
results: 这个实验使用了一系列在高速公路上进行的场域试验数据，以评估这种概念的可行性。两个不同的 DL 模型：一个长Short-Term Memory 网络和一个卷积神经网络，用于评估探测性能。结果显示了这种基于 DL 和分散式 CW RS 的前方碰撞警示系统的可行性。

Abstract
This paper presents the idea of an automatic forward-collision warning system based on a decentralized radio sensing (RS) approach. In this framework, a vehicle in receiving mode employs a continuous waveform (CW) transmitted by a second vehicle as a probe signal to detect oncoming vehicles and warn the driver of a potential forward collision. Such a CW can easily be incorporated as a pilot signal within the data frame of current multicarrier vehicular communication systems. Detection of oncoming vehicles is performed by a deep learning (DL) module that analyzes the features of the Doppler signature imprinted on the CW probe signal by a rapidly approaching vehicle. This decentralized CW RS approach was assessed experimentally using data collected by a series of field trials conducted in a two-lanes high-speed highway. Detection performance was evaluated for two different DL models: a long short-term memory network and a convolutional neural network. The obtained results demonstrate the feasibility of the envisioned forward-collision warning system based on the fusion of DL and decentralized CW RS.

摘要
The detection of oncoming vehicles is performed by a deep learning (DL) module that analyzes the features of the Doppler signature imprinted on the CW probe signal by a rapidly approaching vehicle. The decentralized CW RS approach was tested experimentally using data collected from field trials conducted on a two-lane high-speed highway. The detection performance was evaluated using two different DL models: a long short-term memory network and a convolutional neural network.The results show that the proposed forward-collision warning system based on the fusion of DL and decentralized CW RS is feasible.

Pointing the Way: Refining Radar-Lidar Localization Using Learned ICP Weights

paper_url: http://arxiv.org/abs/2309.08731
repo_url: None
paper_authors: Daniil Lisus, Johann Laconte, Keenan Burnett, Timothy D. Barfoot
for: 提高雷达测量与探雷数据匹配性，以提高自动驾驶系统的稳定性和安全性。
methods: 利用深度学习来加权雷达点云，并与高质量探雷数据进行匹配，以提高雷达-探雷匹配性。
results: 对实际自动驾驶数据进行实验，提高了雷达-探雷ICP结果的精度，减少了54.94%的翻译错误和68.39%的旋转错误，同时保持了解释性和可靠性。

Abstract
This paper presents a novel deep-learning-based approach to improve localizing radar measurements against lidar maps. Although the state of the art for localization is matching lidar data to lidar maps, radar has been considered as a promising alternative, as it is potentially more resilient against adverse weather such as precipitation and heavy fog. To make use of existing high-quality lidar maps, while maintaining performance in adverse weather, matching radar data to lidar maps is of interest. However, owing in part to the unique artefacts present in radar measurements, radar-lidar localization has struggled to achieve comparable performance to lidar-lidar systems, preventing it from being viable for autonomous driving. This work builds on an ICP-based radar-lidar localization system by including a learned preprocessing step that weights radar points based on high-level scan information. Combining a proven analytical approach with a learned weight reduces localization errors in radar-lidar ICP results run on real-world autonomous driving data by up to 54.94% in translation and 68.39% in rotation, while maintaining interpretability and robustness.

摘要
This work builds upon an ICP-based radar-lidar localization system by incorporating a learned preprocessing step that weights radar points based on high-level scan information. By combining a proven analytical approach with a learned weight, the proposed method reduces localization errors in radar-lidar ICP results on real-world autonomous driving data by up to 54.94% in translation and 68.39% in rotation, while maintaining interpretability and robustness.

Clustered Multi-Agent Linear Bandits

paper_url: http://arxiv.org/abs/2309.08710
repo_url: None
paper_authors: Hamza Cherkaoui, Merwan Barlier, Igor Colin
for: 这 paper addresses the multi-agent linear stochastic bandit problem with clustered structure.
methods: 该 paper 提出了一种新的算法，通过 Agent 之间的协作来加速优化问题。网络控制器负责估计网络下面的含义结构，并优化 Agent 之间的经验分享。
results: 该 paper 提供了许多实验结果，证明了该算法可以减少 regret 量，同时能够 correctly 回归真实的含义结构。

Abstract
We address in this paper a particular instance of the multi-agent linear stochastic bandit problem, called clustered multi-agent linear bandits. In this setting, we propose a novel algorithm leveraging an efficient collaboration between the agents in order to accelerate the overall optimization problem. In this contribution, a network controller is responsible for estimating the underlying cluster structure of the network and optimizing the experiences sharing among agents within the same groups. We provide a theoretical analysis for both the regret minimization problem and the clustering quality. Through empirical evaluation against state-of-the-art algorithms on both synthetic and real data, we demonstrate the effectiveness of our approach: our algorithm significantly improves regret minimization while managing to recover the true underlying cluster partitioning.

摘要
我们在这篇论文中Addressing a particular instance of the multi-agent linear stochastic bandit problem, called clustered multi-agent linear bandits. In this setting, we propose a novel algorithm that leverages efficient collaboration between agents to accelerate the overall optimization problem. In this contribution, a network controller is responsible for estimating the underlying cluster structure of the network and optimizing the experience sharing among agents within the same groups. We provide a theoretical analysis for both regret minimization and clustering quality. Through empirical evaluation against state-of-the-art algorithms on both synthetic and real data, we demonstrate the effectiveness of our approach: our algorithm significantly improves regret minimization while managing to recover the true underlying cluster partitioning.Here's the word-for-word translation of the text into Simplified Chinese:我们在这篇论文中Addressing a particular instance of the multi-agent linear stochastic bandit problem, called clustered multi-agent linear bandits. In this setting, we propose a novel algorithm that leverages efficient collaboration between agents to accelerate the overall optimization problem. In this contribution, a network controller is responsible for estimating the underlying cluster structure of the network and optimizing the experience sharing among agents within the same groups. We provide a theoretical analysis for both regret minimization and clustering quality. Through empirical evaluation against state-of-the-art algorithms on both synthetic and real data, we demonstrate the effectiveness of our approach: our algorithm significantly improves regret minimization while managing to recover the true underlying cluster partitioning.

Price of Safety in Linear Best Arm Identification

paper_url: http://arxiv.org/abs/2309.08709
repo_url: None
paper_authors: Xuedong Shang, Igor Colin, Merwan Barlier, Hamza Cherkaoui
for: 本文提出了一个安全最佳臂标定框架，用于在Linear Feedback的情况下进行最佳臂标定。agent需要在每个round中采取保守的行动，以确保安全约束不会被违反。
methods: 本文提出了一种含有差距的算法，可以在保证安全性的情况下实现最佳臂标定。这种算法利用了线性结构，以降低误差。
results: 本文表明了该算法可以实现有意义的样本复杂度，同时保证stage-wise安全性。然而，该算法需要付出一定的额外成本，即在安全约束下的探索阶段所增加的成本。实验结果证明了算法的设计合理性。

Abstract
We introduce the safe best-arm identification framework with linear feedback, where the agent is subject to some stage-wise safety constraint that linearly depends on an unknown parameter vector. The agent must take actions in a conservative way so as to ensure that the safety constraint is not violated with high probability at each round. Ways of leveraging the linear structure for ensuring safety has been studied for regret minimization, but not for best-arm identification to the best our knowledge. We propose a gap-based algorithm that achieves meaningful sample complexity while ensuring the stage-wise safety. We show that we pay an extra term in the sample complexity due to the forced exploration phase incurred by the additional safety constraint. Experimental illustrations are provided to justify the design of our algorithm.

摘要
我们介绍一个安全最佳臂分析框架，该框架运用了线性反馈，agent在每个阶段都受到一些线性靠度的安全约束，这些约束取决于一个未知参数vector。agent需要在保守的方式下进行行动，以确保在每个阶段不会违反安全约束。我们已经研究了利用线性结构来保证安全的方法，但是这些方法并未曾运用在最佳臂分析中。我们提出了一个差值基本的算法，可以在保证阶段性安全的情况下取得实际的样本缩减。我们显示出，我们因为强制的探索阶段而付出了额外的样本成本。实验证明了我们的算法的设计。

Wasserstein Distributionally Robust Control Barrier Function using Conditional Value-at-Risk with Differentiable Convex Programming

paper_url: http://arxiv.org/abs/2309.08700
repo_url: None
paper_authors: Alaa Eddine Chriat, Chuangchuang Sun
for: 这篇论文旨在设计安全控制器，以应对生产环境中的分布式随机性。
methods: 本文使用分布robust控制边界函数（DR-CBF）来实现安全性的可靠性，并保留控制边界函数的计算效率和前向不变性。
results: 本文通过实验验证，显示DR-CBF可以在分布式随机环境中提供可靠的安全性保证，并且适用于高维系统。

Abstract
Control Barrier functions (CBFs) have attracted extensive attention for designing safe controllers for their deployment in real-world safety-critical systems. However, the perception of the surrounding environment is often subject to stochasticity and further distributional shift from the nominal one. In this paper, we present distributional robust CBF (DR-CBF) to achieve resilience under distributional shift while keeping the advantages of CBF, such as computational efficacy and forward invariance. To achieve this goal, we first propose a single-level convex reformulation to estimate the conditional value at risk (CVaR) of the safety constraints under distributional shift measured by a Wasserstein metric, which is by nature tri-level programming. Moreover, to construct a control barrier condition to enforce the forward invariance of the CVaR, the technique of differentiable convex programming is applied to enable differentiation through the optimization layer of CVaR estimation. We also provide an approximate variant of DR-CBF for higher-order systems. Simulation results are presented to validate the chance-constrained safety guarantee under the distributional shift in both first and second-order systems.

摘要
控制边界函数（CBF）已引起广泛关注，用于设计安全控制器的应用在实际安全关键系统中。然而，环境的识别常受Randomness和分布转移的影响。在这篇论文中，我们提出了分布 robust CBF（DR-CBF），以实现对分布转移的抗性，保持CBF的优点，如计算效率和前向不变性。为达到这个目标，我们首先提出了单级凸形式的改进，以估计分布转移 measured by Wasserstein metric 下的安全约束CVaR的Conditional Value at Risk。此外，我们使用拟合ifferentiable convex programming来构建控制边界条件，以保证CVaR的前向不变性。我们还提供了高阶系统的approximate variant of DR-CBF。在实验中，我们验证了在分布转移下的可靠性约束的随机性，并在一阶和二阶系统中实现了安全 garantue。

Evaluating the Impact of Local Differential Privacy on Utility Loss via Influence Functions

paper_url: http://arxiv.org/abs/2309.08678
repo_url: None
paper_authors: Alycia N. Carey, Minh-Hao Van, Xintao Wu
for: This paper focuses on improving the privacy parameter setting in differential privacy (DP) for randomized response-based local DP.methods: The paper uses influence functions to approximate the change in test loss when randomized response is applied over features and/or labels, and provides a data curator with a way to select the best privacy parameter without requiring extensive computation.results: The paper shows that influence functions can approximate the true change in test loss with small mean absolute error, especially when noise correction methods are applied, and provides an empirical analysis of the computational complexity of the proposed approach.

Abstract
How to properly set the privacy parameter in differential privacy (DP) has been an open question in DP research since it was first proposed in 2006. In this work, we demonstrate the ability of influence functions to offer insight into how a specific privacy parameter value will affect a model's test loss in the randomized response-based local DP setting. Our proposed method allows a data curator to select the privacy parameter best aligned with their allowed privacy-utility trade-off without requiring heavy computation such as extensive model retraining and data privatization. We consider multiple common randomization scenarios, such as performing randomized response over the features, and/or over the labels, as well as the more complex case of applying a class-dependent label noise correction method to offset the noise incurred by randomization. Further, we provide a detailed discussion over the computational complexity of our proposed approach inclusive of an empirical analysis. Through empirical evaluations we show that for both binary and multi-class settings, influence functions are able to approximate the true change in test loss that occurs when randomized response is applied over features and/or labels with small mean absolute error, especially in cases where noise correction methods are applied.

摘要
如何正确地设置隐私参数在差分隐私（DP）中是一个长期未解之问题。在这个工作中，我们示示了影响函数可以为选择最佳隐私参数提供深入的视角。我们的提议方法不需要大量计算，如广泛的模型重训练和数据加密。我们考虑了多种常见的随机化场景，包括在特征上执行随机响应，以及在标签上执行随机响应，以及更复杂的情况下应用类征依赖性的标签噪声纠正方法来抵消随机噪声。此外，我们还提供了 Computational Complexity 的详细分析，并进行了实验评估。我们的实验结果表明，对于 binary 和多类设置，影响函数能够以小的绝对误差来 aproximate 随机响应中的真实变化，特别是在应用标签噪声纠正方法时。

Attention-Only Transformers and Implementing MLPs with Attention Heads

paper_url: http://arxiv.org/abs/2309.08593
repo_url: None
paper_authors: Robert Huben, Valerie Morris
for: 这篇论文主要针对的是 transformer 架构中的 multi-head attention 机制，以及如何通过 attention 机制来实现 traditional MLP 神经网络中的 linear transformation 和 activation function。
methods: 论文使用了 masked attention 机制来实现 MLP 神经网络中的 linear transformation 和 activation function，并证明了 attention 机制可以在较小的数量的 attention heads 中实现 MLP 神经网络的所有功能。
results: 论文证明了 attention heads 可以在较小的数量中实现 MLP 神经网络中的 linear transformation 和 activation function，并且可以在较小的数量中实现 arbitrary masking patterns。

Abstract
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

摘要
transformer 架构广泛应用于机器学习模型中，由两个相互转换的互层：注意头和多层感知（MLP）组成。我们证明了一个 MLP 神经元可以通过一个掩码注意头来实现，只要 MLP 的活化函数来自一个限定的类型，包括 SiLU 和 ReLU 以及其近似函数。这样，可以将 MLP-和注意 transformer 转换为注意只 transformer，但是需要增加大量的注意头。此外，我们证明了注意头可以分别执行 linear 变换和活化函数组成mlp的组成部分。最后，我们证明了注意头可以在其权重矩阵中编码任意的掩码模式，至少在小于零的误差水平。

A Bayesian Approach to Robust Inverse Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.08571
repo_url: https://github.com/rw422scarlet/bmirl_tf
paper_authors: Ran Wei, Siliang Zeng, Chenliang Li, Alfredo Garcia, Anthony McDonald, Mingyi Hong
for: 这个论文旨在提出一种 bayesian 方法，用于 offline 模型基于 inverse reinforcement learning（IRL）。
methods: 该方法 simultaneous 估计专家的奖励函数和环境动力学模型。使用一类 priors distributions，以 parameterize 专家的环境模型准确程度，开发高效的估计算法。
results: 分析发现，当专家被认为有高度准确的环境模型时，估计政策具有 Robust 性能。在 MuJoCo 环境中验证了这一观察，并证明了我们的算法在 offline IRL 中超过了现状的算法。

Abstract
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL). The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics. We make use of a class of prior distributions which parameterizes how accurate the expert's model of the environment is to develop efficient algorithms to estimate the expert's reward and subjective dynamics in high-dimensional settings. Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed (a priori) to have a highly accurate model of the environment. We verify this observation in the MuJoCo environments and show that our algorithms outperform state-of-the-art offline IRL algorithms.

摘要
我们考虑了bayesian方法来进行离线基于 inverse reinforcement learning（IRL）。我们的框架与现有的离线基于IRL方法不同，因为它同时进行了专家的奖励函数和环境动力学模型的同时估计。我们利用一类的先验分布来 parameterizes how accurate the expert's model of the environment 来开发高效的算法来估计专家的奖励和主观动力学模型在高维设置下。我们的分析发现了一个新的发现，即估计政策在专家被认为（先验）有高度准确的环境模型时表现稳定。我们在MuJoCo环境中验证了这一观察，并示出了我们的算法在离线IRL算法中表现出色。

Neural Network Driven, Interactive Design for Nonlinear Optical Molecules Based on Group Contribution Method

paper_url: http://arxiv.org/abs/2309.08570
repo_url: None
paper_authors: Jinming Fan, Chao Qian, Shaodong Zhou
for: 设计有机小分子非线性光学材料
methods: 利用多Stage Bayesian神经网络（msBNN）和修正Lewis模式分组贡献方法（cLGC）对分子的光学性能进行准确和高效预测
results: 通过使用小训练数据集，实现了高精度和高效的光学性能预测，并且通过特定的进化算法（EA）实现了结构搜索

Abstract
A Lewis-mode group contribution method (LGC) -- multi-stage Bayesian neural network (msBNN) -- evolutionary algorithm (EA) framework is reported for rational design of D-Pi-A type organic small-molecule nonlinear optical materials is presented. Upon combination of msBNN and corrected Lewis-mode group contribution method (cLGC), different optical properties of molecules are afforded accurately and efficiently - by using only a small data set for training. Moreover, by employing the EA model designed specifically for LGC, structural search is well achievable. The logical origins of the well performance of the framework are discussed in detail. Considering that such a theory guided, machine learning framework combines chemical principles and data-driven tools, most likely, it will be proven efficient to solve molecular design related problems in wider fields.

摘要
“一种基于Lewis-模式群贡献方法（LGC）—多Stage Bayesian神经网络（msBNN）—进化算法（EA）框架，用于有机小分子非线性光学材料的理智设计，被报道。在组合msBNN和 corrected Lewis-模式群贡献方法（cLGC）后，可以准确和有效地计算分子的不同光学性质，只需训练一小 datasets。此外，通过特定设计的EA模型，结构搜索也是可行的。文章详细介绍了这种框架的逻辑来源，可能在更广泛的领域中解决分子设计相关的问题。”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. If you need the translation in Traditional Chinese, please let me know.

Local Differential Privacy in Graph Neural Networks: a Reconstruction Approach

paper_url: http://arxiv.org/abs/2309.08569
repo_url: https://github.com/karuna-bhaila/rgnn
paper_authors: Karuna Bhaila, Wen Huang, Yongkai Wu, Xintao Wu
for: 本研究旨在提供节点私钥保护在用户层次上，而且具有低功能损失。
methods: 本研究使用了局部差分隐私（Local Differential Privacy）的分布式想法，并在节点级别进行了数据杂化。特别是在高维特征设定下，我们提出了一种LDP协议，并使用了频率估计在统计分析杂化数据中进行了重建方法。
results: 我们的提出的学习框架在实际世界和半 sintetic 数据集上进行了广泛的实验，并证明了我们的模型的有效性。

Abstract
Graph Neural Networks have achieved tremendous success in modeling complex graph data in a variety of applications. However, there are limited studies investigating privacy protection in GNNs. In this work, we propose a learning framework that can provide node privacy at the user level, while incurring low utility loss. We focus on a decentralized notion of Differential Privacy, namely Local Differential Privacy, and apply randomization mechanisms to perturb both feature and label data at the node level before the data is collected by a central server for model training. Specifically, we investigate the application of randomization mechanisms in high-dimensional feature settings and propose an LDP protocol with strict privacy guarantees. Based on frequency estimation in statistical analysis of randomized data, we develop reconstruction methods to approximate features and labels from perturbed data. We also formulate this learning framework to utilize frequency estimates of graph clusters to supervise the training procedure at a sub-graph level. Extensive experiments on real-world and semi-synthetic datasets demonstrate the validity of our proposed model.

摘要
格raph神经网络（Graph Neural Networks）在各种应用中取得了巨大成功，但有限的研究探讨了GNNS中的隐私保护。在这种工作中，我们提出了一种可以在用户级别保护节点隐私的学习框架。我们专注于分布式的差异隐私（Differential Privacy），具体来说是局部差异隐私（Local Differential Privacy），并通过在节点级别Randomization机制来扰乱特征和标签数据 перед数据被中央服务器收集 для模型训练。我们在高维特征设置中进行了研究，并提出了一种具有严格隐私保证的LDP协议。基于统计分析Randomized数据的频率估计，我们开发了一种可以将特征和标签从扰乱数据中重建的方法。此外，我们将这种学习框架设计为在子graph水平使用频率估计来监控训练过程。我们对真实世界和半合成数据集进行了广泛的实验，并证明了我们的提出的模型的有效性。

Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization

paper_url: http://arxiv.org/abs/2309.08546
repo_url: None
paper_authors: Jack Foster, Alexandra Brintrup
for: 本研究旨在应对长期自主的需求， robotic agent需要不断适应环境变化，并解决新任务。
methods: 本研究使用偏好基于的方法，以避免catastrophic forgetting问题。
results: 本研究引入Bayesian adaptive moment regularization（BAdam），一种新的偏好基于方法，可以更好地控制参数的增长，从而降低catastrophic forgetting。BAdam具有许多潜在应用的优点，例如轻量级、任务标签自由、快速收敛和安全的实际应用。Results表明，BAdam在单头类别增量实验中（如Split MNIST和Split FashionMNIST） achieve state-of-the-art performance，并且不需要任务标签或类别边界。

Abstract
The pursuit of long-term autonomy mandates that robotic agents must continuously adapt to their changing environments and learn to solve new tasks. Continual learning seeks to overcome the challenge of catastrophic forgetting, where learning to solve new tasks causes a model to forget previously learnt information. Prior-based continual learning methods are appealing for robotic applications as they are space efficient and typically do not increase in computational complexity as the number of tasks grows. Despite these desirable properties, prior-based approaches typically fail on important benchmarks and consequently are limited in their potential applications compared to their memory-based counterparts. We introduce Bayesian adaptive moment regularization (BAdam), a novel prior-based method that better constrains parameter growth, leading to lower catastrophic forgetting. Our method boasts a range of desirable properties for robotic applications such as being lightweight and task label-free, converging quickly, and offering calibrated uncertainty that is important for safe real-world deployment. Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments such as Split MNIST and Split FashionMNIST, and does so without relying on task labels or discrete task boundaries.

摘要
长期自主需求 robotic agent 必须不断适应环境变化，学习解决新任务。不断学习挑战Catastrophic forgetting问题，已经学习的信息将被新任务学习所取代。基于优先的 continual learning 方法在 robotic 应用中具有可取的特点，如空间有效和计算复杂性不增加任务数量增加。然而，这些方法通常在重要的标准准则上表现不佳，因此在应用上受到限制。我们介绍了 Bayesian adaptive moment regularization（BAdam），一种新的基于优先的方法，可以更好地控制参数增长，避免Catastrophic forgetting。BAdam 具有许多适合 robotic 应用的优点，如轻量级、无需任务标签和分类边界，快速收敛和提供了可靠的uncertainty，这些特点在实际世界中部署时非常重要。结果显示，BAdam 在单头类增量实验中（如 Split MNIST 和 Split FashionMNIST）达到了优秀的性能，而不需要任务标签或分类边界。

Efficient and robust Sensor Placement in Complex Environments

paper_url: http://arxiv.org/abs/2309.08545
repo_url: None
paper_authors: Lukas Taus, Yen-Hsi Richard Tsai
for: solves the problem of efficient and unobstructed surveillance or communication in complex environments with a minimal number of sensors.
methods: proposes a greedy algorithm and explores deep learning techniques to accelerate the evaluation of the objective function formulated in the greedy algorithm.
results: discusses the differences in using greedy and $\epsilon$-greedy algorithms to generate data and their impact on the robustness of the network.Here is the text in Traditional Chinese:
for: Addresses the problem of 高效且不受阻扰的监控或通信在复杂环境中，使用最小数量的侦测器。
methods: Proposes a 采取方法和应用深度学习技术来加速侦测器的评估。
results: Discusses the differences in using 绝对 greedy 和 $\epsilon$-greedy 算法来生成数据，以及它们对网络的稳定性。

Abstract
We address the problem of efficient and unobstructed surveillance or communication in complex environments. On one hand, one wishes to use a minimal number of sensors to cover the environment. On the other hand, it is often important to consider solutions that are robust against sensor failure or adversarial attacks. This paper addresses these challenges of designing minimal sensor sets that achieve multi-coverage constraints -- every point in the environment is covered by a prescribed number of sensors. We propose a greedy algorithm to achieve the objective. Further, we explore deep learning techniques to accelerate the evaluation of the objective function formulated in the greedy algorithm. The training of the neural network reveals that the geometric properties of the data significantly impact the network's performance, particularly at the end stage. By taking into account these properties, we discuss the differences in using greedy and $\epsilon$-greedy algorithms to generate data and their impact on the robustness of the network.

摘要
我们面临复杂环境中高效无阻碍监控或通信的问题。一方面，我们希望使用最小数量的传感器来覆盖环境。另一方面，常常需要考虑对于传感器故障或敌方攻击的强健性。这篇论文解决了设计最小传感器集的问题，以满足多覆盖需求，即每个环境点都被至少一定数量的传感器覆盖。我们提出了一个采取式算法来解决这个问题。此外，我们探讨了深度学习技术来加速评估目标函数中的算法。训练神经网络 revelaed that data的几何性对神经网络的性能有着重要的影响，特别是在终端阶段。我们考虑到这些属性，我们讨论了使用采取式和ε-采取式产生数据的区别，以及它们对网络的强健性的影响。

Towards Last-layer Retraining for Group Robustness with Fewer Annotations

paper_url: http://arxiv.org/abs/2309.08534
repo_url: https://github.com/tmlabonte/last-layer-retraining
paper_authors: Tyler LaBonte, Vidya Muthukumar, Abhishek Kumar
for: 这篇论文旨在探讨深度神经网络（Deep Neural Network，DNNet）中的Empirical Risk Minimization（ERM）问题，特别是DNNet在面对少数群体时的准确性问题。
methods: 这篇论文使用的方法包括last-layer retraining（最后层重训）和selective last-layer finetuning（SELF），并评估了这些方法在不同的测试集上的效果。
results: 论文的结果显示last-layer retraining可以对DNNet进行优化，并且可以在没有小集和分类标签的情况下进行优化。SELF方法可以对DNNet进行更好的分类准确性和种族准确性优化，并且可以在没有小集和分类标签的情况下进行优化。

Abstract
Empirical risk minimization (ERM) of neural networks is prone to over-reliance on spurious correlations and poor generalization on minority groups. The recent deep feature reweighting (DFR) technique achieves state-of-the-art group robustness via simple last-layer retraining, but it requires held-out group and class annotations to construct a group-balanced reweighting dataset. In this work, we examine this impractical requirement and find that last-layer retraining can be surprisingly effective with no group annotations (other than for model selection) and only a handful of class annotations. We first show that last-layer retraining can greatly improve worst-group accuracy even when the reweighting dataset has only a small proportion of worst-group data. This implies a "free lunch" where holding out a subset of training data to retrain the last layer can substantially outperform ERM on the entire dataset with no additional data or annotations. To further improve group robustness, we introduce a lightweight method called selective last-layer finetuning (SELF), which constructs the reweighting dataset using misclassifications or disagreements. Our empirical and theoretical results present the first evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations. Our code is available at https://github.com/tmlabonte/last-layer-retraining.

摘要
Empirical risk minimization (ERM) of neural networks 是受到偶散 correlations 和对少数群体的差异化的弱点。最近的 Deep Feature Reweighting (DFR) 技术可以实现当前的群体强度 Robustness via 简单的最后层重新训练，但它需要一个具有多少的分组和类别标注来构建一个均衡化的重新训练数据集。在这个工作中，我们考虑这个不实际的要求，并发现最后层重新训练可以 surprisingly 有效，只需要一些类别标注（除了模型选择之外）和少量的数据标注。我们首先表明，最后层重新训练可以在具有少量最坏组数据的情况下大幅提高最坏组的准确率。这意味着在找到一个小部分的训练数据来重新训练最后一层时，可以大幅超越 ERM 在整个数据集上的性能，无需额外的数据或标注。为了进一步提高群体强度，我们引入了一种轻量级的方法 called selective last-layer finetuning (SELF)，它使用了分类错误或不一致来构建重新训练数据集。我们的实际和理论结果表明，模型不一致可以增加最坏组数据，使 SELF 能够在四个常见的视觉和语言任务上达到 DFR 的性能，但需要无Group 标注和少于 3% 的被听出来的类别标注。我们的代码可以在上找到。

Scaling Laws for Sparsely-Connected Foundation Models

paper_url: http://arxiv.org/abs/2309.08520
repo_url: None
paper_authors: Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
for: This paper investigates the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets, specifically in the vision and language domains.
methods: The paper explores the relationship between weight sparsity, number of non-zero parameters, and amount of training data, and identifies a scaling law that describes this relationship. The authors also experiment with different sparsity structures and strategies.
results: The paper finds that the “optimal sparsity” (the sparsity level that yields the best performance for a given effective model size and training budget) increases with the amount of data used for training. The authors also extend their study to different sparsity structures and strategies, and shed light on the power and limitations of weight sparsity across various parameter and computational settings.

Abstract
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

摘要

Deep-learning-powered data analysis in plankton ecology

paper_url: http://arxiv.org/abs/2309.08500
repo_url: https://github.com/softmatterlab/deep-learning-in-plankton-ecology
paper_authors: Harshith Bachimanchi, Matthew I. M. Pinder, Chloé Robert, Pierre De Wit, Jonathan Havenhand, Alexandra Kinnby, Daniel Midtvedt, Erik Selander, Giovanni Volpe
for: 这篇论文旨在探讨深度学习算法在浮游生物学中的应用，以及深度学习在浮游生物研究中的优势和局限性。
methods: 论文介绍了深度学习算法在浮游生物图像识别和分类、掠食和游泳行为分析、以及生态模型建立方面的应用。
results: 论文总结了深度学习在浮游生物研究中的可能性和挑战，并提供了详细的教程和代码示例，以便读者可以应用这些方法到自己的数据中。

Abstract
The implementation of deep learning algorithms has brought new perspectives to plankton ecology. Emerging as an alternative approach to established methods, deep learning offers objective schemes to investigate plankton organisms in diverse environments. We provide an overview of deep-learning-based methods including detection and classification of phyto- and zooplankton images, foraging and swimming behaviour analysis, and finally ecological modelling. Deep learning has the potential to speed up the analysis and reduce the human experimental bias, thus enabling data acquisition at relevant temporal and spatial scales with improved reproducibility. We also discuss shortcomings and show how deep learning architectures have evolved to mitigate imprecise readouts. Finally, we suggest opportunities where deep learning is particularly likely to catalyze plankton research. The examples are accompanied by detailed tutorials and code samples that allow readers to apply the methods described in this review to their own data.

摘要
<>translate "The implementation of deep learning algorithms has brought new perspectives to plankton ecology. Emerging as an alternative approach to established methods, deep learning offers objective schemes to investigate plankton organisms in diverse environments. We provide an overview of deep-learning-based methods including detection and classification of phyto- and zooplankton images, foraging and swimming behaviour analysis, and finally ecological modelling. Deep learning has the potential to speed up the analysis and reduce the human experimental bias, thus enabling data acquisition at relevant temporal and spatial scales with improved reproducibility. We also discuss shortcomings and show how deep learning architectures have evolved to mitigate imprecise readouts. Finally, we suggest opportunities where deep learning is particularly likely to catalyze plankton research. The examples are accompanied by detailed tutorials and code samples that allow readers to apply the methods described in this review to their own data." into Simplified Chinese.翻译文本：深度学习算法在湿生生物学中提供了新的视角，代替传统方法，它可以 объектив地研究各种湿生生物在多样化环境中的行为。本文提供了基于深度学习的方法，包括植物和动物图像检测和分类、捕食和游泳行为分析，以及生态模型建立。深度学习可以快速分析数据，减少人类实验偏见，实现在相关的时空尺度上取得数据，并提高数据的重复性。我们还讨论了深度学习的缺点，并详细介绍了深度学习架构的发展，以抗衡不准确的输出。最后，我们提出了深度学习在湿生生物研究中可能具有激发作用的机会。这些例子附有详细的教程和代码示例，让读者可以将本文中的方法应用到自己的数据上。

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

paper_url: http://arxiv.org/abs/2309.08489
repo_url: None
paper_authors: Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang
for: 本研究旨在提出一种word-level朴素神经笔记录系统，以实现同时进行语音识别和说话人标注。
methods: 该系统使用了一种多任务学习算法，把语音识别和说话人标注嵌入到同一个神经网络架构中进行同时进行。
results: 实验结果表明，WEEND比基准系统在所有2人短采样场景中表现出色，并且具有总线上进行5分钟音频的泛化能力。

Abstract
While standard speaker diarization attempts to answer the question "who spoken when", most of relevant applications in reality are more interested in determining "who spoken what". Whether it is the conventional modularized approach or the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate the speaker labels with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same neural architecture. That is, while speech is being recognized, speaker labels are predicted simultaneously for each recognized word. Experimental results demonstrate that WEEND outperforms the turn-based diarization baseline system on all 2-speaker short-form scenarios and has the capability to generalize to audio lengths of 5 minutes. Although 3+speaker conversations are harder, we find that with enough in-domain training data, WEEND has the potential to deliver high quality diarized text.

摘要
通常的Speaker diarization（Speaker分类）问题是回答 "谁说了什么时候"，但是在实际应用中，大多数应用更关心的是 "谁说了什么"。无论是传统的模块化方法还是更新的终端神经网络分类（EEND），都需要一个自动语音识别（ASR）模型和一个抽象算法来将Speaker标签与识别出的词语相关联。在这篇论文中，我们提议了Word-level End-to-End Neural Diarization（WEEND），一种多任务学习算法，即在同一个神经网络架构中同时进行终端ASR和Speaker分类。即使在识别语音时，Speaker标签同时预测每个识别出的词语。实验结果表明，WEEND在所有2个说者短形enario下都高于基准系统，并且有能力泛化到5分钟的音频长度。虽然3+个说者对话更加复杂，但是我们发现，在具有相应预处理数据的情况下，WEEND有可能提供高质量的分类文本。

On the limitations of data-driven weather forecasting models

paper_url: http://arxiv.org/abs/2309.08473
repo_url: None
paper_authors: Massimo Bonavita
for: This paper examines the forecasts produced by the Pangu-Weather machine learning model and compares them to traditional physics-based models in terms of fidelity and physical consistency.
methods: The paper uses Pangu-Weather forecasts and compares them to traditional physics-based models to evaluate their accuracy and physical consistency.
results: The paper finds that Pangu-Weather forecasts lack the fidelity and physical consistency of physics-based models, but they can still provide accurate forecasts for specific applications and have low computational cost during deployment.Here’s the simplified Chinese text for the three points:
for: 这篇论文研究了由Pangu-Weather机器学习模型生成的预测和传统物理模型的对比，以评估其准确性和物理一致性。
methods: 论文使用Pangu-Weather预测和传统物理模型进行对比，以评估它们的准确性和物理一致性。
results: 论文发现Pangu-Weather预测缺乏物理模型的准确性和物理一致性，但它们可以在特定应用场景提供准确的预测，并且在部署时的计算成本非常低。

Abstract
As in many other areas of engineering and applied science, Machine Learning (ML) is having a profound impact in the domain of Weather and Climate Prediction. A very recent development in this area has been the emergence of fully data-driven ML prediction models which routinely claim superior performance to that of traditional physics-based models. In this work, we examine some aspects of the forecasts produced by an exemplar of the current generation of ML models, Pangu-Weather, with a focus on the fidelity and physical consistency of those forecasts and how these characteristics relate to perceived forecast performance. The main conclusion is that Pangu-Weather forecasts, and by extension those of similar ML models, do not have the fidelity and physical consistency of physics-based models and their advantage in accuracy on traditional deterministic metrics of forecast skill can be attributed, to a large extent, to these peculiarities. Similarly to other current post-processing technologies, ML models appear to be able to add value to standard NWP outputs for specific forecast applications and combined with their extremely low computational cost during deployment, will likely provide an additional, useful source of forecast information.

摘要
如同其他工程和应用科学领域一样，机器学习（ML）在气象和气候预测领域也有深远的影响。最近，这个领域出现了完全数据驱动的ML预测模型，这些模型常常超越传统的物理基础模型的性能。在这篇文章中，我们研究了一个ML模型的示例——Pangu-Weather，强调预测的准确性和物理一致性如何影响预测性能。结论是，Pangu-Weather的预测和类似的ML模型没有物理基础模型的准确性和物理一致性，但它们在传统的决定性指标上的准确性却很高。这种特点与其他当前的后处理技术类似，ML模型可以在特定预测应用场景中添加价值，并且在部署时的计算成本极低，因此将成为一个有用的预测信息来源。

Quantifying Credit Portfolio sensitivity to asset correlations with interpretable generative neural networks

paper_url: http://arxiv.org/abs/2309.08652
repo_url: None
paper_authors: Sergio Caprioli, Emanuele Cagliero, Riccardo Crupi
for: 本研究提出了一种新的方法，用于量化债权资产值风险（VaR）对资产回报的敏感性，通过使用深度学习模型生成的合成金融相关矩阵。
methods: 本研究使用了变量自动编码器（VAE），而不是之前使用的生成 adversarial network（GANs），以实现更加可读的幂等空间表示。
results: 通过分析，我们发现了VAE幂等空间可以 capture债权资产投资风险中的关键因素，特别是对资产相关性变化的敏感性。

Abstract
In this research, we propose a novel approach for the quantification of credit portfolio Value-at-Risk (VaR) sensitivity to asset correlations with the use of synthetic financial correlation matrices generated with deep learning models. In previous work Generative Adversarial Networks (GANs) were employed to demonstrate the generation of plausible correlation matrices, that capture the essential characteristics observed in empirical correlation matrices estimated on asset returns. Instead of GANs, we employ Variational Autoencoders (VAE) to achieve a more interpretable latent space representation. Through our analysis, we reveal that the VAE latent space can be a useful tool to capture the crucial factors impacting portfolio diversification, particularly in relation to credit portfolio sensitivity to asset correlations changes.

摘要
在这项研究中，我们提出了一种新的方法来衡量借据风险（VaR）敏感性对资产相关性的影响，使用深度学习模型生成的 sintetic 金融相关矩阵。在先前的工作中，我们使用Generative Adversarial Networks（GANs）来示出生成有可能性的相关矩阵，这些矩阵能够捕捉到实际相关矩阵中所见到的主要特征。而在这项研究中，我们使用Variational Autoencoders（VAE）来实现更加可读的幂等空间表示。通过我们的分析，我们发现了VAE幂等空间可以用于捕捉借据风险敏感性对资产相关性变化的关键因素，特别是借据风险对资产相关性变化的影响。

FedDCSR: Federated Cross-domain Sequential Recommendation via Disentangled Representation Learning

paper_url: http://arxiv.org/abs/2309.08420
repo_url: https://github.com/orion-orion/FedDCSR
paper_authors: Hongyu Zhang, Dongyi Zheng, Xu Yang, Jiyuan Feng, Qing Liao
for: 该论文目的是提出一种基于联合学习和跨Domains Sequential Recommendation的数据隐私保护推荐方法，以便全面利用不同Domains的用户序列数据，保持数据隐私。
methods: 该方法使用了联合学习（FL）和跨Domains Sequential Recommendation（CSR），并提出了一种解决序列特征差异性问题的方法，即在不同Domains中分离用户序列特征，并通过内域对用户序列进行数据增强来学习更丰富的用户特征。
results: 在三个实际场景中，FedDCSR比基于现有基eline的方法具有显著改善，表明该方法可以有效地利用不同Domains的用户序列数据，同时保持数据隐私。

Abstract
Cross-domain Sequential Recommendation (CSR) which leverages user sequence data from multiple domains has received extensive attention in recent years. However, the existing CSR methods require sharing origin user data across domains, which violates the General Data Protection Regulation (GDPR). Thus, it is necessary to combine federated learning (FL) and CSR to fully utilize knowledge from different domains while preserving data privacy. Nonetheless, the sequence feature heterogeneity across different domains significantly impacts the overall performance of FL. In this paper, we propose FedDCSR, a novel federated cross-domain sequential recommendation framework via disentangled representation learning. Specifically, to address the sequence feature heterogeneity across domains, we introduce an approach called inter-intra domain sequence representation disentanglement (SRD) to disentangle the user sequence features into domain-shared and domain-exclusive features. In addition, we design an intra domain contrastive infomax (CIM) strategy to learn richer domain-exclusive features of users by performing data augmentation on user sequences. Extensive experiments on three real-world scenarios demonstrate that FedDCSR achieves significant improvements over existing baselines.

摘要
cross-domain sequential recommendation (CSR) 跨多个域的序列推荐，在过去几年内得到了广泛的关注。然而，现有的 CSR 方法需要在域之间共享原始用户数据，这会违反欧盟数据保护条例 (GDPR)。因此，我们需要结合联邦学习 (FL) 和 CSR，以便在不同域上完全利用不同域的知识，保持数据隐私。然而，各域序列特征之间的差异会对整体性的FL具有负面影响。在本文中，我们提出了 FedDCSR，一种新的联邦跨域序列推荐框架，通过分离表示学习来解决各域序列特征之间的差异。具体来说，我们提出了一种名为域内序列特征分离 (SRD) 的方法，以分离用户序列特征为域共享和域独特的特征。此外，我们设计了一种域内对照信息最大 (CIM) 策略，以学习用户序列特征的更加丰富的域独特特征。经过对三个实际场景的广泛实验，我们发现 FedDCSR 可以与现有的基eline相比，具有显著的改进。

A new method of modeling the multi-stage decision-making process of CRT using machine learning with uncertainty quantification

paper_url: http://arxiv.org/abs/2309.08415
repo_url: https://github.com/miilab-mtu/crt_multistageml_uncertainty
paper_authors: Kristoffer Larsen, Chen Zhao, Joyce Keyak, Qiuying Sha, Diana Paez, Xinwei Zhang, Jiangang Zou, Amalia Peix, Weihua Zhou
for: 这项研究的目的是为心力衰竭疾患（HF）患者创建一个多个阶段机器学习模型，以预测心室复ynchronization疗法（CRT）的回应。
methods: 这项研究使用了218名接受了rest-gated SPECT MPI的患者，定义了CRT的回应为左心室泵力上升 > 5%的6个月跟踪。创建了一个多个阶段机器学习模型，combined two ensemble models。
results: CRT回应率为55.5%（n = 121），男性患者占61.0%（n = 133），平均年龄62.0岁，LVEF为27.7。多个阶段模型与Ensemble 2（使用额外的SPECT数据）的性能相似，AUC为0.75 vs. 0.77，准确率为0.71 vs. 0.69，感知率为0.70 vs. 0.72，特异性为0.72 vs. 0.65，respectively。但多个阶段模型只需要SPECT MPI数据 для52.7%的患者 across all folds。

Abstract
Aims. The purpose of this study is to create a multi-stage machine learning model to predict cardiac resynchronization therapy (CRT) response for heart failure (HF) patients. This model exploits uncertainty quantification to recommend additional collection of single-photon emission computed tomography myocardial perfusion imaging (SPECT MPI) variables if baseline clinical variables and features from electrocardiogram (ECG) are not sufficient. Methods. 218 patients who underwent rest-gated SPECT MPI were enrolled in this study. CRT response was defined as an increase in left ventricular ejection fraction (LVEF) > 5% at a 6 month follow-up. A multi-stage ML model was created by combining two ensemble models. Results. The response rate for CRT was 55.5% (n = 121) with overall male gender 61.0% (n = 133), an average age of 62.0, and LVEF of 27.7. The multi-stage model performed similarly to Ensemble 2 (which utilized the additional SPECT data) with AUC of 0.75 vs. 0.77, accuracy of 0.71 vs. 0.69, sensitivity of 0.70 vs. 0.72, and specificity 0.72 vs. 0.65, respectively. However, the multi-stage model only required SPECT MPI data for 52.7% of the patients across all folds. Conclusions. By using rule-based logic stemming from uncertainty quantification, the multi-stage model was able to reduce the need for additional SPECT MPI data acquisition without sacrificing performance.

摘要
目的：这个研究的目的是创建一个多Stage机器学习模型，以预测心脏复制疗法(CRT) 对心脏不全(HF) 患者的响应。该模型利用不确定性衡量来建议在基线临床变量和电cardiogram(ECG) 变量不充分时进行单 photon辐射计算机tomography myocardial perfusion imaging(SPECT MPI) 变量的采集。方法：这个研究包含218名HF患者，他们在Rest-gated SPECT MPI测试中被录用。CRT响应是指在6个月后跟踪中左心脏血液泵动功能(LVEF) 增加了5%以上。一个多Stage机器学习模型被创建，其由两个ensemble模型组成。结果：CRT响应率为55.5%（n=121），男性患者占61.0%（n=133），平均年龄为62.0岁，LVEF为27.7。多Stage模型与Ensemble 2（使用额外SPECT数据）的性能相似，AUC为0.75vs.0.77，准确率为0.71vs.0.69，敏感性为0.70vs.0.72，特异性为0.72vs.0.65，分别。然而，多Stage模型只需要SPECT MPI数据 dla 52.7%的患者 across all folds。结论：通过基于不确定性衡量的规则，多Stage模型可以降低SPECT MPI数据采集的需求，而不会降低性能。

Make Deep Networks Shallow Again

paper_url: http://arxiv.org/abs/2309.08414
repo_url: None
paper_authors: Bernhard Bermeitinger, Tomas Hrycej, Siegfried Handschuh
for: 提高深度神经网络的性能和优化速度
methods: 使用差分连接来解决梯度消失问题
results: 发现广泛、浅层网络可以与深度网络匹配性能，并且可以简化网络结构、提高优化效率和加速训练过程

Abstract
Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. A breakthrough has been achieved by the concept of residual connections -- an identity mapping parallel to a conventional layer. This concept is applicable to stacks of layers of the same dimension and substantially alleviates the vanishing gradient problem. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.

摘要
深度神经网络有着良好的成功记录，因此被视为复杂应用场景的最佳建筑选择。它们的主要缺点是长期以来的淡化Gradient，这阻碍了数学优化算法的可接受性。然而，通过嵌入式连接的概念——一种与传统层相同的同步映射——这个问题得到了重要缓解。这种概念适用于层叠的层次结构，并有效地缓解淡化Gradient的问题。层叠的嵌入式连接层可以表示为一种类似于泰勒展开的扩展，这种展开表明可以将高阶项截断，得到一个由所有初始层叠在平行状态下组成的单一宽层。即，一个顺序深度结构被替换为一个平行浅层结构。根据这一理论，我们对两种结构的性能进行了比较，使用了计算机视觉数据集MNIST和CIFAR10进行训练，共计6912组不同层数、缓冲器大小、滤波器数和其他元参数的组合。我们的发现表明，深度和浅度结构之间存在 surprising equivalence。两种布局均在训练和验证集损失方面得到了类似的结果。这一发现 imply that a wide, shallow architecture can potentially replace a deep network without sacrificing performance.这种替换可能使神经网络建筑简化，优化效率提高，训练过程加速。

Constraint-Free Structure Learning with Smooth Acyclic Orientations

paper_url: http://arxiv.org/abs/2309.08406
repo_url: None
paper_authors: Riccardo Massidda, Francesco Landolfi, Martina Cinquini, Davide Bacciu
for: 学习图structure问题，包括将生成的数据correctly重建其arc。
methods: 我们引入了COSMO，一种不含约束的连续优化方案，用于学习图结构。我们定义了一个导数 differentiable的方向矩阵参数化器，该参数化器可以无需评估循环性来生成一个smooth的方向矩阵和相应的无环度邻接矩阵。
results: 我们证明COSMO总是 converges to an acyclic solution，并且在规模增加时比预期的速度更快。我们的实验表明，COSMO在图结构学习中与竞争方法相比表现出色。

Abstract
The structure learning problem consists of fitting data generated by a Directed Acyclic Graph (DAG) to correctly reconstruct its arcs. In this context, differentiable approaches constrain or regularize the optimization problem using a continuous relaxation of the acyclicity property. The computational cost of evaluating graph acyclicity is cubic on the number of nodes and significantly affects scalability. In this paper we introduce COSMO, a constraint-free continuous optimization scheme for acyclic structure learning. At the core of our method, we define a differentiable approximation of an orientation matrix parameterized by a single priority vector. Differently from previous work, our parameterization fits a smooth orientation matrix and the resulting acyclic adjacency matrix without evaluating acyclicity at any step. Despite the absence of explicit constraints, we prove that COSMO always converges to an acyclic solution. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performance on graph reconstruction compares favorably with competing structure learning methods.

摘要
“structure learning problem”是指将由directed Acyclic Graph（DAG）生成的数据适当地重建其矩阵。在这个 Setting中，可 diferenciable approaches 使用连续relaxation的acyclic property来对优化问题进行干扰或调控。然而，评估graph acyclicity的computational cost是nodes的cubic，对于可扩展性有很大的影响。在本文中，我们介绍COSMO，一个不含任何预设的连续优化方案，用于学习矩阵的径向。我们定义了一个可微分的方向矩阵参数，这个参数是由单一的优先级 векторparameterized。与之前的工作不同，我们的参数化适合smooth的方向矩阵，而不需要评估矩阵的径向性在任何步骤。尽管没有明确的预设，我们证明COSMO总是会 converge to an acyclic solution。此外，我们的实验分析显示，COSMO在矩阵重建方面的性能与其他结构学习方法相比，有着优势。

Neural Metamaterial Networks for Nonlinear Material Design

paper_url: http://arxiv.org/abs/2309.10600
repo_url: https://github.com/asem010/legend-pice
paper_authors: Yue Li, Stelian Coros, Bernhard Thomaszewski
for: 本研究旨在开发一种基于神经网络的非线性金属材料设计方法，用于工程、医学、机器人等领域的应用。
methods: 该方法使用神经网络模型来表示非线性金属材料的机械性质，并通过离散几何变换来找到满足高级性能目标的结构参数。
results: 该方法可以自动设计满足需求的金属材料，包括指定的剪力弹簧曲线、指定的方向刚度和Poisson比。此外，研究还表明该方法比native scale优化更有优势。

Abstract
Nonlinear metamaterials with tailored mechanical properties have applications in engineering, medicine, robotics, and beyond. While modeling their macromechanical behavior is challenging in itself, finding structure parameters that lead to ideal approximation of high-level performance goals is a challenging task. In this work, we propose Neural Metamaterial Networks (NMN) -- smooth neural representations that encode the nonlinear mechanics of entire metamaterial families. Given structure parameters as input, NMN return continuously differentiable strain energy density functions, thus guaranteeing conservative forces by construction. Though trained on simulation data, NMN do not inherit the discontinuities resulting from topological changes in finite element meshes. They instead provide a smooth map from parameter to performance space that is fully differentiable and thus well-suited for gradient-based optimization. On this basis, we formulate inverse material design as a nonlinear programming problem that leverages neural networks for both objective functions and constraints. We use this approach to automatically design materials with desired strain-stress curves, prescribed directional stiffness and Poisson ratio profiles. We furthermore conduct ablation studies on network nonlinearities and show the advantages of our approach compared to native-scale optimization.

摘要
非线性金属材料具有工程、医学、机器人等领域的应用。而模型这些macro mechanical行为本身也是一项具有挑战性的任务。在这种工作中，我们提议使用神经元件网络（NMN）——光滑的神经表示方法，以编码整个金属家族的非线性机械学。给定结构参数作为输入，NMN返回继承了可导数学函数，从而保证了保守的力学。尽管在 simulations 数据上训练，NMN 不会继承由finite element网格的topological变化而产生的缺陷。它们而是提供了一个光滑的参数到性能空间的映射，这是完全可导的，因此非常适合 gradient-based 优化。基于这个基础，我们将 inverse material design 定义为一个非线性Programming problem，利用神经网络来定义目标函数和约束。我们使用这种方法来自动设计材料，以实现所需的弯形剪力曲线、指定的方向剪力率和Poisson ratio profile。我们进一步进行了网络非线性的ablation 研究，并证明了我们的方法与原始级别的优化相比具有优势。

Optimizing Modular Robot Composition: A Lexicographic Genetic Algorithm Approach

paper_url: http://arxiv.org/abs/2309.08399
repo_url: None
paper_authors: Jonathan Külz, Matthias Althoff
for: 用于开发适应任务需求和环境变化的模块化工业机器人
methods: combinatorial lexicographic optimization和遗传算法
results: 能够超越现有基eline，Synthesize 适应工业任务的模块化机器人

Abstract
Industrial robots are designed as general-purpose hardware, which limits their ability to adapt to changing task requirements or environments. Modular robots, on the other hand, offer flexibility and can be easily customized to suit diverse needs. The morphology, i.e., the form and structure of a robot, significantly impacts the primary performance metrics acquisition cost, cycle time, and energy efficiency. However, identifying an optimal module composition for a specific task remains an open problem, presenting a substantial hurdle in developing task-tailored modular robots. Previous approaches either lack adequate exploration of the design space or the possibility to adapt to complex tasks. We propose combining a genetic algorithm with a lexicographic evaluation of solution candidates to overcome this problem and navigate search spaces exceeding those in prior work by magnitudes in the number of possible compositions. We demonstrate that our approach outperforms a state-of-the-art baseline and is able to synthesize modular robots for industrial tasks in cluttered environments.

摘要
工业机器人通常设计为通用硬件，这限制了它们对任务要求或环境变化的适应能力。然而，模块化机器人具有灵活性，可以根据多种需求进行自定义。机器人的形态，即其形式和结构，对主要性能指标（即取得成本、周期时间和能效率）产生了深见影响。然而，确定一个任务特定的模块组合仍然是一个未解决的问题，这导致了开发任务特定的模块化机器人的很大难度。previous approaches either lack adequate exploration of the design space or the possibility to adapt to complex tasks. We propose combining a genetic algorithm with a lexicographic evaluation of solution candidates to overcome this problem and navigate search spaces exceeding those in prior work by magnitudes in the number of possible compositions. We demonstrate that our approach outperforms a state-of-the-art baseline and is able to synthesize modular robots for industrial tasks in cluttered environments.Translation:工业机器人通常设计为通用硬件，这限制了它们对任务要求或环境变化的适应能力。然而，模块化机器人具有灵活性，可以根据多种需求进行自定义。机器人的形态，即其形式和结构，对主要性能指标（即取得成本、周期时间和能效率）产生了深见影响。然而，确定一个任务特定的模块组合仍然是一个未解决的问题，这导致了开发任务特定的模块化机器人的很大难度。previous approaches either lack adequate exploration of the design space or the possibility to adapt to complex tasks. We propose combining a genetic algorithm with a lexicographic evaluation of solution candidates to overcome this problem and navigate search spaces exceeding those in prior work by magnitudes in the number of possible compositions. We demonstrate that our approach outperforms a state-of-the-art baseline and is able to synthesize modular robots for industrial tasks in cluttered environments.

Exploring Meta Information for Audio-based Zero-shot Bird Classification

paper_url: http://arxiv.org/abs/2309.08398
repo_url: None
paper_authors: Alexander Gebhard, Andreas Triantafyllopoulos, Teresa Bez, Lukas Christ, Alexander Kathan, Björn W. Schuller
for: 这个研究旨在探讨如何透过元信息提高零数据音乐分类，以鸟类为例研究，因为鸟类的数据库丰富且多样化。
methods: 本研究使用bird sound descriptions encoded via (S)BERT、功能特征(AVONET)和鸟类生物生活史特征(BLH)等三种元信息，将音频特征转换为这些auxiliary information的维度，然后使用点积compatibility函数和标准零数据学学习排名损失进行排序。
results: 研究发现， concatenating AVONET和BLH特征可以取得最好的结果，在五个不同的测试集上，平均F1分数为0.233，标准误差为0.032。

Abstract
Advances in passive acoustic monitoring and machine learning have led to the procurement of vast datasets for computational bioacoustic research. Nevertheless, data scarcity is still an issue for rare and underrepresented species. This study investigates how meta-information can improve zero-shot audio classification, utilising bird species as an example case study due to the availability of rich and diverse metadata. We investigate three different sources of metadata: textual bird sound descriptions encoded via (S)BERT, functional traits (AVONET), and bird life-history (BLH) characteristics. As audio features, we extract audio spectrogram transformer (AST) embeddings and project them to the dimension of the auxiliary information by adopting a single linear layer. Then, we employ the dot product as compatibility function and a standard zero-shot learning ranking hinge loss to determine the correct class. The best results are achieved by concatenating the AVONET and BLH features attaining a mean F1-score of .233 over five different test sets with 8 to 10 classes.

摘要
科学家们通过Active Passive Acoustic Monitoring和机器学习技术获得了大量的计算生物声学数据，但是对于罕见和受欢迎的物种来说，数据缺乏仍然是一个问题。这个研究探讨了如何使用元信息提高零样本音频分类，使用鸟类为例研究，因为鸟类的metadata具有丰富和多样化的特征。我们研究了三种不同的元信息来源：文本鸟叫描述编码via(S)BERT、功能特征（AVONET）和鸟类生活史特征（BLH）。我们使用音频спектрограм变换器（AST）特征提取音频特征，并将它们映射到auxiliary信息的维度上采用单线性层。然后，我们使用点积为兼容函数和标准零样本学习折射损失来确定正确的类别。最佳结果是在AVONET和BLH特征 concatenation 上达到了平均F1分数0.233，在5个不同的测试集上（8-10个类）。

A Unified View Between Tensor Hypergraph Neural Networks And Signal Denoising

paper_url: http://arxiv.org/abs/2309.08385
repo_url: None
paper_authors: Fuli Wang, Karelia Pena-Pena, Wei Qian, Gonzalo R. Arce
for: 本文研究了hypergraph neural networks (HyperGNNs)和hypergraph signal denoising (HyperGSD)的关系，以及如何基于HyperGSD的视角设计novel HyperGNNs。
methods: 本文提出了一种基于tensor-hypergraph convolutional network (T-HGCN)的HyperGSD问题Equivalence relation，以及一种基于这个关系的tensor-hypergraph iterative network (T-HGIN)方法。
results: numerically experiments demonstrate the promising applications of the proposed T-HGIN approach in preserving higher-order interactions on hypergraphs.

Abstract
Hypergraph Neural networks (HyperGNNs) and hypergraph signal denoising (HyperGSD) are two fundamental topics in higher-order network modeling. Understanding the connection between these two domains is particularly useful for designing novel HyperGNNs from a HyperGSD perspective, and vice versa. In particular, the tensor-hypergraph convolutional network (T-HGCN) has emerged as a powerful architecture for preserving higher-order interactions on hypergraphs, and this work shows an equivalence relation between a HyperGSD problem and the T-HGCN. Inspired by this intriguing result, we further design a tensor-hypergraph iterative network (T-HGIN) based on the HyperGSD problem, which takes advantage of a multi-step updating scheme in every single layer. Numerical experiments are conducted to show the promising applications of the proposed T-HGIN approach.

摘要
高等网络模型中的卷积神经网络（HyperGNN）和卷积信号干扰（HyperGSD）是两个基本话题。理解这两个领域之间的联系非常有用于设计基于HyperGSD的HyperGNN，以及vice versa。特别是，在T-HGCN中，我们发现了一个HyperGSD问题和T-HGCN之间的相等关系。根据这一结论，我们采用了一种基于HyperGSD问题的卷积迭代网络（T-HGIN），利用每个层的多步更新机制。我们进行了数值实验，以证明我们提议的T-HGIN方法的扩展性。

Adaptive Priority Reweighing for Generalizing Fairness Improvement

paper_url: http://arxiv.org/abs/2309.08375
repo_url: https://github.com/che2198/apw
paper_authors: Zhihao Hu, Yiran Xu, Mengnan Du, Jindong Gu, Xinmei Tian, Fengxiang He
for: 提高机器学习应用中的公平性，减少风险偏见。
methods: 提出一种适应重量调整方法，根据样本预测结果和决策边界之间的距离进行权重调整，以提高模型的泛化能力和公平性。
results: 通过对多个标准 benchmark 进行广泛的实验 validate 了我们的适应优先重量调整法，并在几种公平度指标（等机会、等化机会、人口平衡）上达到了比较好的性能。此外，我们还展示了我们的方法在语言和视觉模型中的应用和改进公平性。代码可以在 GitHub 上找到。

Abstract
With the increasing penetration of machine learning applications in critical decision-making areas, calls for algorithmic fairness are more prominent. Although there have been various modalities to improve algorithmic fairness through learning with fairness constraints, their performance does not generalize well in the test set. A performance-promising fair algorithm with better generalizability is needed. This paper proposes a novel adaptive reweighing method to eliminate the impact of the distribution shifts between training and test data on model generalizability. Most previous reweighing methods propose to assign a unified weight for each (sub)group. Rather, our method granularly models the distance from the sample predictions to the decision boundary. Our adaptive reweighing method prioritizes samples closer to the decision boundary and assigns a higher weight to improve the generalizability of fair classifiers. Extensive experiments are performed to validate the generalizability of our adaptive priority reweighing method for accuracy and fairness measures (i.e., equal opportunity, equalized odds, and demographic parity) in tabular benchmarks. We also highlight the performance of our method in improving the fairness of language and vision models. The code is available at https://github.com/che2198/APW.

摘要

Understanding the limitations of self-supervised learning for tabular anomaly detection

paper_url: http://arxiv.org/abs/2309.08374
repo_url: None
paper_authors: Kimberly T. Mai, Toby Davies, Lewis D. Griffin
for: 该论文探讨了自动学习是否可以提高表格数据异常检测性能。
methods: 作者通过在26个标准数据集上进行多种预 Text task的实验来检验自动学习是否有利于表格异常检测。
results: 结果表明，使用 neural network 的表示不会提高表格异常检测性能，但是使用 neural network 的子空间可以恢复性能。

Abstract
While self-supervised learning has improved anomaly detection in computer vision and natural language processing, it is unclear whether tabular data can benefit from it. This paper explores the limitations of self-supervision for tabular anomaly detection. We conduct several experiments spanning various pretext tasks on 26 benchmark datasets to understand why this is the case. Our results confirm representations derived from self-supervision do not improve tabular anomaly detection performance compared to using the raw representations of the data. We show this is due to neural networks introducing irrelevant features, which reduces the effectiveness of anomaly detectors. However, we demonstrate that using a subspace of the neural network's representation can recover performance.

摘要
自我指导学习在计算机视觉和自然语言处理中已经提高了异常检测，但是对于表格数据是否能够受益于它的问题仍然存在。这篇论文探讨了自我指导学习对表格异常检测的限制。我们进行了26个标准 benchmark 数据集的多种预测任务的实验，以了解这种情况的原因。我们的结果表明，由于神经网络引入无关的特征，自我指导学习得到的表格异常检测性能不会提高，而使用数据原始表示则能够保持性能。但是，我们发现使用神经网络表示的子空间可以恢复性能。

Convergence of ADAM with Constant Step Size in Non-Convex Settings: A Simple Proof

paper_url: http://arxiv.org/abs/2309.08339
repo_url: None
paper_authors: Alokendu Mazumder, Bhartendu Kumar, Manan Tayal, Punit Rathore
for: 研究ADAM算法在非对称Setting中的性能。
methods: 使用定step size的ADAM算法，并提供了runtime bounds for deterministic ADAM来达到精度逼近的极值。
results: 证明了step size的选择对ADAM算法性能的影响，并提供了 suficient conditions for the stepsize to achieve almost sure asymptotic convergence of the gradients to zero with minimal assumptions。

Abstract
In neural network training, RMSProp and ADAM remain widely favoured optimization algorithms. One of the keys to their performance lies in selecting the correct step size, which can significantly influence their effectiveness. It is worth noting that these algorithms performance can vary considerably, depending on the chosen step sizes. Additionally, questions about their theoretical convergence properties continue to be a subject of interest. In this paper, we theoretically analyze a constant stepsize version of ADAM in the non-convex setting. We show sufficient conditions for the stepsize to achieve almost sure asymptotic convergence of the gradients to zero with minimal assumptions. We also provide runtime bounds for deterministic ADAM to reach approximate criticality when working with smooth, non-convex functions.

摘要
在神经网络训练中，RMSProp和ADAM仍然广泛得到欢迎的优化算法。选择正确的步长是这些算法表现的关键，步长选择会对其效iveness产生很大的影响。需要注意的是，这些算法的性能会根据选择的步长而异常变化。此外，关于这些算法的理论收敛性质也仍然是研究的热点。在这篇论文中，我们对ADAM算法中的常数步长版本进行了理论分析。我们提供了适用于非对称函数的sufficient condition，以确保步长能够在大致上确定的条件下达到零的梯度几乎一定的精度。此外，我们还提供了 deterministic ADAM 在精确 критиче的运行时间上下界。

Estimation of Counterfactual Interventions under Uncertainties

paper_url: http://arxiv.org/abs/2309.08332
repo_url: None
paper_authors: Juliane Weilbach, Sebastian Gerwinn, Melih Kandemir, Martin Fraenzle
for: 本研究旨在解决counterfactual分析中的ambiguity问题，即在假设式的干扰下，对于一个系统的行为的可能性。
methods: 本研究使用 hierarchical Bayesian approach，以解决counterfactual分布的ambiguity问题。 Specifically, we derive counterfactual distributions for a Bayesian Warped Gaussian Process, allowing for non-Gaussian distributions and non-additive noise.
results: 我们在一个 synthetic 和一个 semi-synthetic example中 Illustrated the properties of our approach, and show its performance when used within an algorithmic recourse downstream task.

Abstract
Counterfactual analysis is intuitively performed by humans on a daily basis eg. "What should I have done differently to get the loan approved?". Such counterfactual questions also steer the formulation of scientific hypotheses. More formally it provides insights about potential improvements of a system by inferring the effects of hypothetical interventions into a past observation of the system's behaviour which plays a prominent role in a variety of industrial applications. Due to the hypothetical nature of such analysis, counterfactual distributions are inherently ambiguous. This ambiguity is particularly challenging in continuous settings in which a continuum of explanations exist for the same observation. In this paper, we address this problem by following a hierarchical Bayesian approach which explicitly models such uncertainty. In particular, we derive counterfactual distributions for a Bayesian Warped Gaussian Process thereby allowing for non-Gaussian distributions and non-additive noise. We illustrate the properties our approach on a synthetic and on a semi-synthetic example and show its performance when used within an algorithmic recourse downstream task.

摘要
<>将文本翻译成简化中文。>人们日常生活中也会进行Counterfactual分析，例如：“我应该如何更好地申请贷款？”。这种Counterfactual问题不仅影响科学假设的构思，而且还提供了系统的可能性提高的信息。由于Counterfactual分析的假设性质，Counterfactual分布存在ambiguity。这种抽象特征尤其在连续设置下存在挑战，因为存在同样的解释 для同一个观察结果的多种可能性。在这篇论文中，我们采用层次权重架构方法来解决这个问题，并将Counterfactual分布模型为 Bayesian Warped Gaussian Process，以允许非泊olu Distribution和非加性噪声。我们在一个 sintetic 和一个 semi-sintetic 示例中证明了我们的方法的性能，并在下游任务中使用我们的方法来实现Algorithmic recourse。

Heteroskedastic conformal regression

paper_url: http://arxiv.org/abs/2309.08313
repo_url: https://github.com/nmdwolf/heteroskedasticconformalregression
paper_authors: Nicolas Dewolf, Bernard De Baets, Willem Waegeman
for: 这个论文主要目的是提出一种适用于不同类型数据的预测间隔估计方法，以及一种基于这种方法的可靠预测间隔构建方法。
methods: 该论文使用了均值和 Mondrian 逻辑推理的准确预测间隔构建方法，以及一种基于这些方法的自适应预测间隔构建方法。
results: 该论文通过 teoretic 和实验研究，证明了这种预测间隔构建方法可以在适用于不同类型数据的情况下提供高度可靠的预测间隔，并且可以适应不同类型的雷达数据。

Abstract
Conformal prediction, and split conformal prediction as a specific implementation, offer a distribution-free approach to estimating prediction intervals with statistical guarantees. Recent work has shown that split conformal prediction can produce state-of-the-art prediction intervals when focusing on marginal coverage, i.e., on a calibration dataset the method produces on average prediction intervals that contain the ground truth with a predefined coverage level. However, such intervals are often not adaptive, which can be problematic for regression problems with heteroskedastic noise. This paper tries to shed new light on how adaptive prediction intervals can be constructed using methods such as normalized and Mondrian conformal prediction. We present theoretical and experimental results in which these methods are investigated in a systematic way.

摘要
《匹配预测》和《分割匹配预测》作为特定实现，提供了不受分布限制的预测间隔估计方法，并且具有统计保证。latest work表明，使得焦点在边缘覆盖率（marginal coverage）上，则可以在calibration dataset上produce state-of-the-art预测间隔。然而，这些间隔通常不是可适应的，这可能会对于具有不同水平噪声的回归问题而是一个问题。这篇论文试图通过方法如normalized和Mondrian匹配预测来构建适应预测间隔。我们present theoretical和实验结果，系统地 investigate这些方法。

Sampling-Free Probabilistic Deep State-Space Models

paper_url: http://arxiv.org/abs/2309.08256
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Andreas Look, Melih Kandemir, Barbara Rakitsch, Jan Peters
for: 这个论文是为了描述不同的动态系统，使用状态空间模型（SSM）的形式来描述每个观察结果是由隐藏状态所emit的。
methods: 这个论文使用了概率深度状态空间模型（ProDSSM），它是一种基于神经网络的模型，用于描述动态系统的未知参数形式。在这种模型中，转移和发射模型都是由神经网络所描述，并且具有不确定参数。
results: 该论文提出了一种 deterministic inference algorithm，用于在这种模型中进行归掌。这种方法具有高效的approximation能力，并且在许多实验中表现出了superior的balance between predictive performance和计算预算。

Abstract
Many real-world dynamical systems can be described as State-Space Models (SSMs). In this formulation, each observation is emitted by a latent state, which follows first-order Markovian dynamics. A Probabilistic Deep SSM (ProDSSM) generalizes this framework to dynamical systems of unknown parametric form, where the transition and emission models are described by neural networks with uncertain weights. In this work, we propose the first deterministic inference algorithm for models of this type. Our framework allows efficient approximations for training and testing. We demonstrate in our experiments that our new method can be employed for a variety of tasks and enjoys a superior balance between predictive performance and computational budget.

摘要
很多实际世界中的动态系统可以用状态空间模型（SSM）来描述。在这种形式中，每个观察结果是由隐藏状态发射出来的，隐藏状态follows first-order Markovian dynamics。一种 probabilistic deep SSM（ProDSSM）扩展了这种框架，用于未知参数形式的动态系统，转移和发射模型由神经网络的不确定参数来描述。在这项工作中，我们提出了首个确定性推理算法 для这种类型的模型。我们的框架允许有效的估计和测试。我们的实验表明，我们的新方法可以用于多种任务，并且具有更好的预测性和计算预算之间的平衡。

Deep Nonnegative Matrix Factorization with Beta Divergences

paper_url: http://arxiv.org/abs/2309.08249
repo_url: https://github.com/vleplat/deep-kl-nmf-public
paper_authors: Valentin Leplat, Le Thi Khanh Hien, Akwum Onwunta, Nicolas Gillis
for: 提取多层特征，包括图像、语音和文档等数据类型
methods: 使用 $\beta$-多元分解来评估准确性，并开发新的深度非负矩阵因子化模型和算法
results: 在面部特征提取、文档收集中的主题分类和干扰图像中的材料识别等方面获得了优秀的结果

Abstract
Deep Nonnegative Matrix Factorization (deep NMF) has recently emerged as a valuable technique for extracting multiple layers of features across different scales. However, all existing deep NMF models and algorithms have primarily centered their evaluation on the least squares error, which may not be the most appropriate metric for assessing the quality of approximations on diverse datasets. For instance, when dealing with data types such as audio signals and documents, it is widely acknowledged that $\beta$-divergences offer a more suitable alternative. In this paper, we develop new models and algorithms for deep NMF using $\beta$-divergences. Subsequently, we apply these techniques to the extraction of facial features, the identification of topics within document collections, and the identification of materials within hyperspectral images.

摘要

Topological Node2vec: Enhanced Graph Embedding via Persistent Homology

paper_url: http://arxiv.org/abs/2309.08241
repo_url: https://github.com/killianfmeehan/topological_node2vec
paper_authors: Yasuaki Hiraoka, Yusuke Imoto, Killian Meehan, Théo Lacombe, Toshiaki Yachimura
for: 本 paper 是一篇研究 graph embedding 的文章，它的目的是学习一个权重图的节点vector表示，同时保持节点之间的相对距离和全局结构。
methods: 本 paper 使用 Node2vec 算法，但是它发现 Node2vec 在保持输入图的拓扑结构方面存在问题。因此，它引入了一个拓扑损失函数，以尝试将输入图的拓扑结构与输出 embedding 的拓扑结构进行最佳匹配。
results: 本 paper 通过对synthetic例子进行示例，显示了这种方法的优点。它可以准确地重建输入图的拓扑结构和几何结构。

Abstract
Node2vec is a graph embedding method that learns a vector representation for each node of a weighted graph while seeking to preserve relative proximity and global structure. Numerical experiments suggest Node2vec struggles to recreate the topology of the input graph. To resolve this we introduce a topological loss term to be added to the training loss of Node2vec which tries to align the persistence diagram (PD) of the resulting embedding as closely as possible to that of the input graph. Following results in computational optimal transport, we carefully adapt entropic regularization to PD metrics, allowing us to measure the discrepancy between PDs in a differentiable way. Our modified loss function can then be minimized through gradient descent to reconstruct both the geometry and the topology of the input graph. We showcase the benefits of this approach using demonstrative synthetic examples.

摘要
Node2vec 是一种图 embedding 方法，它学习每个权重图的节点 vector 表示，同时尽可能保持相对靠近性和全局结构。但是实验表明，Node2vec 很难复制输入图的拓扑结构。为解决这个问题，我们引入了一个拓扑损失项，将它添加到 Node2vec 的训练损失中，以尝试将输入图的 persistency 图（PD）与输出 embedding 的 PD 之间的距离最小化。通过计算学习率，我们精心适应了 PD 度量，从而可以在 differentiable 的方式下测量拓扑损失的差异。我们修改了损失函数，然后通过梯度下降来减小这个损失，以重建输入图的几何和拓扑结构。我们通过示例数据示出了这种方法的优点。Here's the text with some minor adjustments to make it more readable in Simplified Chinese:Node2vec 是一种图 embedding 方法，它学习每个权重图的节点 vector 表示，同时尽可能保持相对靠近性和全局结构。但是实验表明，Node2vec 很难复制输入图的拓扑结构。为解决这个问题，我们引入了一个拓扑损失项，将它添加到 Node2vec 的训练损失中，以尝试将输入图的 persistency 图（PD）与输出 embedding 的 PD 之间的距离最小化。通过计算学习率，我们精心适应了 PD 度量，从而可以在 differentiable 的方式下测量拓扑损失的差异。我们修改了损失函数，然后通过梯度下降来减小这个损失，以重建输入图的几何和拓扑结构。我们通过示例数据示出了这种方法的优点。I hope this helps! Let me know if you have any further questions.

Ensuring Toplogical Data-Structure Preservation under Autoencoder Compression due to Latent Space Regularization in Gauss–Legendre nodes

paper_url: http://arxiv.org/abs/2309.08228
repo_url: https://github.com/casus/autoencoder-regularisation
paper_authors: Chethan Krishnamurthy Ramanaik, Juan-Esteban Suarez Cardona, Anna Willmann, Pia Hanfeld, Nico Hoffmann, Michael Hecht
for: 这个论文是为了提出一种数据独立的幂值空间规范化约束，以便为普通无监督自适应神经网络提供一个可靠的低维表示。
methods: 这个论文使用了采样自动编码器Jacobian在Legendre节点上，通过Gauss-Legendre quadrature来实现规范化。这种规范化可以保证初始数据拟杂的一对一嵌入到幂值空间中。
results: 实验表明，该规范化技术可以避免过去提出的规范化策略，如强制编码和卷积Autoencoder，在简单的示例中就会出现拓扑缺陷。而通过我们的贡献，标准多层感知网络也可以在规范化下提供可靠的低维表示，这些表示可以在不同领域中应用，包括FashionMNIST数据集和真实世界的MRI脑成像编码问题。

Abstract
We formulate a data independent latent space regularisation constraint for general unsupervised autoencoders. The regularisation rests on sampling the autoencoder Jacobian in Legendre nodes, being the centre of the Gauss-Legendre quadrature. Revisiting this classic enables to prove that regularised autoencoders ensure a one-to-one re-embedding of the initial data manifold to its latent representation. Demonstrations show that prior proposed regularisation strategies, such as contractive autoencoding, cause topological defects already for simple examples, and so do convolutional based (variational) autoencoders. In contrast, topological preservation is ensured already by standard multilayer perceptron neural networks when being regularised due to our contribution. This observation extends through the classic FashionMNIST dataset up to real world encoding problems for MRI brain scans, suggesting that, across disciplines, reliable low dimensional representations of complex high-dimensional datasets can be delivered due to this regularisation technique.

摘要
我们提出了一种无supervised autoencoder中独立的积分空间正则化约束。这种约束基于采样 autoencoder Jacobian 在Legendre节点中，这是Gauss-Legendre quadrature的中心。重新评估这个 классический方法，我们可以证明 regularized autoencoders 确保了初始数据抽象的一一重新嵌入到其latent表示中。示例显示，先前提出的正则化策略，如Contractive autoencoding，会在简单的示例中导致 topological defects，而且 convolutional based (variational) autoencoders 也是如此。相比之下，我们的正则化技术可以保证数据的topological preservation，这种技术已经被应用于多层感知神经网络，并且在classic FashionMNIST dataset和实际世界的编码问题中得到了证明。这意味着，无论是在不同领域，都可以通过这种正则化技术来提供可靠的低维度表示方法。

Unified Risk Analysis for Weakly Supervised Learning

paper_url: http://arxiv.org/abs/2309.08216
repo_url: None
paper_authors: Chao-Kai Chiang, Masashi Sugiyama
for: 提供一个涵盖全面理解和统一方法论的概念框架，以探讨弱监督学习（WSL）的机制和问题。
methods: 使用污染观点来构建形式体系，涵盖了15种WSL设定；并提出了一个新的预期链策略，以进行问题重写。
results: 透过实践验证，证明了提案的框架可以回传现有的 rewrite 报告。

Abstract
Among the flourishing research of weakly supervised learning (WSL), we recognize the lack of a unified interpretation of the mechanism behind the weakly supervised scenarios, let alone a systematic treatment of the risk rewrite problem, a crucial step in the empirical risk minimization approach. In this paper, we introduce a framework providing a comprehensive understanding and a unified methodology for WSL. The formulation component of the framework, leveraging a contamination perspective, provides a unified interpretation of how weak supervision is formed and subsumes fifteen existing WSL settings. The induced reduction graphs offer comprehensive connections over WSLs. The analysis component of the framework, viewed as a decontamination process, provides a systematic method of conducting risk rewrite. In addition to the conventional inverse matrix approach, we devise a novel strategy called marginal chain aiming to decontaminate distributions. We justify the feasibility of the proposed framework by recovering existing rewrites reported in the literature.

摘要
中文翻译：在弱监督学习（WSL）的繁荣研究中，我们注意到弱监督场景的机制之间没有一个统一的解释，更不是一种系统的风险 rewrite 问题的处理方法。在这篇论文中，我们提出了一个框架，提供了弱监督场景的全面理解和统一方法。该框架的形式部分，基于污染视角，提供了弱监督形成的统一解释，并包含了十五种现有的 WSL 设置。生成的减少图表示WTRLS中的广泛连接。分析部分，视为一种去污程序，提供了一种系统的风险 rewrite 方法，包括传统的逆矩阵方法以及一种新的 marginal chain 方法，用于去污分布。我们证明了提案的可行性，通过回归现有的 rewrite 报告。

HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

paper_url: http://arxiv.org/abs/2309.08208
repo_url: None
paper_authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu
for: Audio deepfake detection (ADD) 即检测由文本至语音或语音转换系统生成的伪装攻击。
methods: 我们提议了两种组件：（1）层次池化方法，逐渐减少序列长度，以消除重复信息（2）多级分类 токен聚合方法，使用分类 токен来从不同块中收集信息。
results: 在ASVspoof 2021 Deepfake dataset上进行实验，HM-Conformer实现了15.71% EER，与当前系统相比表现竞争力强。

Abstract
Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.

摘要
文本深圳检测（ADD）是检测由文本至语音或语音转换系统生成的恶作剂攻击的任务。恶作剂证据，可以帮助分辨 Between spoofed 和真实的讲话，可能存在本地或全局地在输入特征中。为了捕捉这些，Conformer ，它包括转换器和 CNN，具有适合的结构。然而，由于 Conformer 是设计 для序列到序列任务，其直接应用于 ADD 任务可能不佳。为了解决这些限制，我们提出了 HM-Conformer，采用以下两个组成部分：1. 层次 pooling 方法，逐渐减少序列长度，以消除重复的信息。2. 多级分类 токен 聚合方法，使用分类 токен 来从不同块中收集信息。受到这些组成部分的影响，HM-Conformer 可以高效地检测恶作剂证据，通过处理不同的序列长度和聚合信息。在 ASVspoof 2021 Deepfake 数据集的实验结果中，HM-Conformer 实现了 15.71% EER，与当前系统相比表现竞争力强。

Gaussian Processes with Linear Multiple Kernel: Spectrum Design and Distributed Learning for Multi-Dimensional Data

paper_url: http://arxiv.org/abs/2309.08201
repo_url: https://github.com/richardcsuwandi/distributed-gsm
paper_authors: Richard Cornelius Suwandi, Zhidi Lin, Feng Yin
for: 本文主要针对的是使用 Gaussian processes（GP）进行机器学习和信号处理，特别是Linear Multiple Kernels（LMK） kernel 的选择和模型化。
methods: 本文提出了一种新的 Grid Spectral Mixture（GSM） kernel 形式化，可以将多维数据模型化为 arbitrary stationary kernel，同时减少了超参数的数量，保持了有利的优化结构和近似能力。此外，为了使大规模超参数优化在 GSM kernel 中成为可行，我们首先引入了分布式 SCA（DSCA）算法，然后基于 ADMM 框架，提出了 doubly distributed SCA（D$^2$SCA）算法，可以在大数据上共同学习 GSM kernel，保持数据隐私。最后，我们解决了分布式框架中的内存带宽限制，通过对超参数进行量化，得到了量化 doubly distributed SCA（QD$^2$SCA）算法。
results: 理论分析表明了提posed algorithms 的收敛保证，而实验表明了我们的方法在多种数据集上具有更高的预测性能和效率。

Abstract
Gaussian processes (GPs) have emerged as a prominent technique for machine learning and signal processing. A key component in GP modeling is the choice of kernel, and linear multiple kernels (LMKs) have become an attractive kernel class due to their powerful modeling capacity and interpretability. This paper focuses on the grid spectral mixture (GSM) kernel, an LMK that can approximate arbitrary stationary kernels. Specifically, we propose a novel GSM kernel formulation for multi-dimensional data that reduces the number of hyper-parameters compared to existing formulations, while also retaining a favorable optimization structure and approximation capability. In addition, to make the large-scale hyper-parameter optimization in the GSM kernel tractable, we first introduce the distributed SCA (DSCA) algorithm. Building on this, we propose the doubly distributed SCA (D$^2$SCA) algorithm based on the alternating direction method of multipliers (ADMM) framework, which allows us to cooperatively learn the GSM kernel in the context of big data while maintaining data privacy. Furthermore, we tackle the inherent communication bandwidth restriction in distributed frameworks, by quantizing the hyper-parameters in D$^2$SCA, resulting in the quantized doubly distributed SCA (QD$^2$SCA) algorithm. Theoretical analysis establishes convergence guarantees for the proposed algorithms, while experiments on diverse datasets demonstrate the superior prediction performance and efficiency of our methods.

摘要

An Explainable Deep-learning Model of Proton Auroras on Mars

paper_url: http://arxiv.org/abs/2309.08195
repo_url: None
paper_authors: Dattaraj B. Dhuri, Dimitra Atri, Ahmed AlHantoobi
For: 这个论文是为了研究火星上的氢气发射 aurora 而写的。* Methods: 这个论文使用了 Mars Atmosphere and Volatile EvolutioN (MAVEN) 实际观测数据和 Ly alpha 辐射的观测数据，并使用人工神经网络来模拟氢气发射。* Results: 这个论文通过 SHapley Additive exPlanations (SHAP) 分析发现，solar Zenith Angle、季节性 CO2 atmospheric variability、solar wind temperature 和 density 是氢气发射的最重要因素。 Additionally, the paper demonstrates that the model can be used as an inexpensive tool for simulating and characterizing Ly alpha response under various seasonal and upstream solar wind conditions.

Abstract
Proton auroras are widely observed on the day side of Mars, identified as a significant intensity enhancement in the hydrogen Ly alpha (121.6 nm) emission between 120 and 150~km altitudes. Solar wind protons penetrating as energetic neutral atoms into the Martian thermosphere are thought to be responsible for these auroras. Understanding proton auroras is therefore important for characterizing the solar wind interaction with the atmosphere of Mars. Recent observations of spatially localized "patchy" proton auroras suggest a possible direct deposition of protons into the atmosphere of Mars during unstable solar wind conditions. Here, we develop a purely data-driven model of proton auroras using Mars Atmosphere and Volatile EvolutioN (MAVEN) in situ observations and limb scans of Ly alpha emissions between 2014 and 2022. We train an artificial neural network that reproduces individual Ly alpha intensities with a Pearson correlation of 0.95 along with a faithful reconstruction of the observed Ly alpha emission altitude profiles. By performing a SHapley Additive exPlanations (SHAP) analysis, we find that Solar Zenith Angle, seasonal CO2 atmosphere variability, solar wind temperature, and density are the most important features for the modelled proton auroras. We also demonstrate that our model can serve as an inexpensive tool for simulating and characterizing Ly alpha response under a variety of seasonal and upstream solar wind conditions.

摘要
托普遮 auroras 在火星日面 widely observed, identified as a significant intensity enhancement in the hydrogen Ly alpha (121.6 nm) emission between 120 and 150~km altitudes. Solar wind protons penetrating as energetic neutral atoms into the Martian thermosphere are thought to be responsible for these auroras. Understanding proton auroras is therefore important for characterizing the solar wind interaction with the atmosphere of Mars. Recent observations of spatially localized "patchy" proton auroras suggest a possible direct deposition of protons into the atmosphere of Mars during unstable solar wind conditions. Here, we develop a purely data-driven model of proton auroras using Mars Atmosphere and Volatile EvolutioN (MAVEN) in situ observations and limb scans of Ly alpha emissions between 2014 and 2022. We train an artificial neural network that reproduces individual Ly alpha intensities with a Pearson correlation of 0.95 along with a faithful reconstruction of the observed Ly alpha emission altitude profiles. By performing a SHapley Additive exPlanations (SHAP) analysis, we find that Solar Zenith Angle, seasonal CO2 atmosphere variability, solar wind temperature, and density are the most important features for the modelled proton auroras. We also demonstrate that our model can serve as an inexpensive tool for simulating and characterizing Ly alpha response under a variety of seasonal and upstream solar wind conditions.

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

paper_url: http://arxiv.org/abs/2309.08186
repo_url: None
paper_authors: Longwei Huang, Chao Fang, Qiong Li, Jun Lin, Zhongfeng Wang
for: 提高智能应用程序的吞吐量和能效率，在具有限制的能量、存储和计算资源的边缘设备上运行量化深度学习模型。
methods: 使用多种方法，如FP16操作支持、多精度整数乘除器重复使用和FPGA资源均衡映射，以显著提高硬件资源利用率。
results: 在Xilinx ZCU102 FPGA上实验表明，我们的处理器可以提高推理吞吐量1.6-14.6倍和能效率1.1-14.6倍，相比之前艺术XpulpNN。此外，我们的处理器可以实现FP16操作支持，实现在设备上进行学习。

Abstract
Extreme edge platforms, such as in-vehicle smart devices, require efficient deployment of quantized deep neural networks (DNNs) to enable intelligent applications with limited amounts of energy, memory, and computing resources. However, many edge devices struggle to boost inference throughput of various quantized DNNs due to the varying quantization levels, and these devices lack floating-point (FP) support for on-device learning, which prevents them from improving model accuracy while ensuring data privacy. To tackle the challenges above, we propose a precision-scalable RISC-V DNN processor with on-device learning capability. It facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations. Moreover, we employ multiple methods such as FP16 multiplier reuse and multi-precision integer multiplier reuse, along with balanced mapping of FPGA resources, to significantly improve hardware resource utilization. Experimental results on the Xilinx ZCU102 FPGA show that our processor significantly improves inference throughput by 1.6$\sim$14.6$\times$ and energy efficiency by 1.1$\sim$14.6$\times$ across various DNNs, compared to the prior art, XpulpNN. Additionally, our processor achieves a 16.5$\times$ higher FP throughput for on-device learning.

摘要
“极端边缘平台，如车辆内部智能设备，需要高效部署量化深度 нейрон网络（DNN）以实现智能应用程序，并且具备限制能源、内存和处理资源。然而，许多边缘设备受到不同量化水平的影响，无法提高实际测验运算速率，并且缺乏浮点（FP）支持，导致在保持数据隐私的情况下，不能提高模型精度。为了解决以上问题，我们提出了一个精简可Programmable RISC-V DNN处理器，具有在设备上进行学习的能力。这个处理器支持多种定点精度的固定点DNN测验，从2位到16位，并且提高了设备上学习的支持，包括FP16操作。此外，我们运用多种方法，例如FP16乘法重复和多精度整数乘法重复，以及FPGA资源均衡分配，实现了许多实际资源的优化。实验结果显示，我们的处理器可以实现1.6∼14.6倍的测验速率和1.1∼14.6倍的能源效率，与对比项XpulpNN相比。此外，我们的处理器可以达到16.5倍的FPthroughput。”

A Testbed for Automating and Analysing Mobile Devices and their Applications

paper_url: http://arxiv.org/abs/2309.08158
repo_url: None
paper_authors: Lachlan Simpson, Kyle Millar, Adriel Cheng, Hong Gunn Chew, Cheng-Chew Lim
for: 提高网络状况意识，增强网络安全性。
methods: 使用机器学习技术提供网络设备和活动的视图，自动生成和标注网络流量。
results: 创建了两个标注的网络流量集，对应应用类型的分类任务进行了分析和比较。

Abstract
The need for improved network situational awareness has been highlighted by the growing complexity and severity of cyber-attacks. Mobile phones pose a significant risk to network situational awareness due to their dynamic behaviour and lack of visibility on a network. Machine learning techniques enhance situational awareness by providing administrators insight into the devices and activities which form their network. Developing machine learning techniques for situational awareness requires a testbed to generate and label network traffic. Current testbeds, however, are unable to automate the generation and labelling of realistic network traffic. To address this, we describe a testbed which automates applications on mobile devices to generate and label realistic traffic. From this testbed, two labelled datasets of network traffic have been created. We provide an analysis of the testbed automation reliability and benchmark the datasets for the task of application classification.

摘要
需求改善网络现状意识提高，由于网络攻击的复杂性和严重性增加。移动设备对网络现状意识具有重要的风险，因为它们的动态行为和网络上的不可见性。机器学习技术可以提高现状意识，给管理员提供设备和活动的信息。开发机器学习技术需要一个测试环境，生成和标注网络流量。现有的测试环境无法自动生成和标注真实的网络流量。为解决这个问题，我们描述了一个测试环境，通过自动运行移动设备上的应用程序来生成和标注真实的网络流量。从这个测试环境中，我们创建了两个标注的网络流量集合。我们分析了测试环境自动化可靠性，并对这两个数据集进行了应用类别化分析。

Two-Step Knowledge Distillation for Tiny Speech Enhancement

paper_url: http://arxiv.org/abs/2309.08144
repo_url: None
paper_authors: Rayan Daod Nathoo, Mikolaj Kegler, Marko Stamenovic
for: 这项研究旨在提出一种新的两步小声音提升模型筛选方法，以提高模型压缩和采用精细的知识储存技术。
methods: 该方法首先使用知识储存对象进行预训练，然后转换到完全监督学习 regime。此外，提出了一种新的细化相似性保持知识储存损失函数，以匹配学生模型的内部活动矩阵与教师模型的矩阵。
results: 该方法在具有高压缩和低信号噪比（SNR）的条件下表现广泛改进，特别是在严重的条件下，如 -5 dB输入SNR和63倍压缩下，实现了信号质量与噪声比（SNR）的改进约0.9 dB和1.1 dB。

Abstract
Tiny, causal models are crucial for embedded audio machine learning applications. Model compression can be achieved via distilling knowledge from a large teacher into a smaller student model. In this work, we propose a novel two-step approach for tiny speech enhancement model distillation. In contrast to the standard approach of a weighted mixture of distillation and supervised losses, we firstly pre-train the student using only the knowledge distillation (KD) objective, after which we switch to a fully supervised training regime. We also propose a novel fine-grained similarity-preserving KD loss, which aims to match the student's intra-activation Gram matrices to that of the teacher. Our method demonstrates broad improvements, but particularly shines in adverse conditions including high compression and low signal to noise ratios (SNR), yielding signal to distortion ratio gains of 0.9 dB and 1.1 dB, respectively, at -5 dB input SNR and 63x compression compared to baseline.

摘要
小型、 causal 模型在嵌入式音频机器学习应用中非常重要。模型压缩可以通过将大教师模型中的知识透析到小学生模型中来实现。在这项工作中，我们提出了一种新的两步方法 для小声音提升模型压缩。相比标准方法，我们首先在学生模型中使用只有知识透析（KD）目标进行预训练，然后将学生模型转换到完全监督学习 regime。我们还提出了一种新的细致保持相似性的 KD 损失函数，旨在将学生模型的内部活动矩阵与教师模型的相似。我们的方法在各种情况下都显示了广泛的改进，特别是在高压缩和低信噪比（SNR）下，它们的信噪比提升为0.9 dB和1.1 dB，分别在 -5 dB 输入 SNR 和 63x 压缩下。

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

paper_url: http://arxiv.org/abs/2309.08125
repo_url: https://github.com/symbioticlab/oobleck
paper_authors: Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury
for: 这篇论文是为了实现大型深度神经网络模型的分布式训练，并提供保证性缺陷tolerance的方法。
methods: 这篇论文采用了规划执行协同设计方法，首先生成一组异类管道模板，然后将至少有$f+1$个同等管道复制，以承受任何$f$个同时故障。在执行时，它利用已经复制的模型状态来提供快速恢复。
results: 评估表明，使用Oobleck实现大型深度神经网络模型的分布式训练可以保证资源不会浪费，并且可以在大规模的模型训练中提供高 durchput，与现有的缺陷tolerance解决方案相比，Oobleck可以提高到13.9倍。

Abstract
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $13.9x$.

摘要

Supervised Stochastic Neighbor Embedding Using Contrastive Learning

paper_url: http://arxiv.org/abs/2309.08077
repo_url: https://github.com/imyizhang/manifold-learn
paper_authors: Yi Zhang
for: 本研究旨在将自动编码学习中的自然 neighboor embedding（SNE）方法与对抗学习（contrastive learning）相结合，以便在采用 labels 信息的全盘 Setting 下进行减维。
methods: 本研究使用的方法包括 t-SNE 和 UMAP，以及基于对抗学习的自然 neighboor embedding（SNE）方法。
results: 研究发现，在保持数据集中 neighboor 信息的情况下，将对抗学习引入到减维中可以更好地利用标签信息，使得样本 clusters 在低维度嵌入空间中更加紧密地凝聚在一起，同时将不同类型的样本 clusters 分开。

Abstract
Stochastic neighbor embedding (SNE) methods $t$-SNE, UMAP are two most popular dimensionality reduction methods for data visualization. Contrastive learning, especially self-supervised contrastive learning (SSCL), has showed great success in embedding features from unlabeled data. The conceptual connection between SNE and SSCL has been exploited. In this work, within the scope of preserving neighboring information of a dataset, we extend the self-supervised contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of samples belonging to the same class are pulled together in low-dimensional embedding space, while simultaneously pushing apart clusters of samples from different classes.

摘要

2023-09-15

eess.IV

eess.IV - 2023-09-15

Regularised Diffusion-Shock Inpainting

paper_url: http://arxiv.org/abs/2309.08761
repo_url: None
paper_authors: Kristina Schaefer, Joachim Weickert
for: 这篇论文是关于图像填充的一种新方法，即regularised diffusion–shock（RDS）填充。
methods: RDS填充结合了两种精心选择的组件：均匀扩散和凝合强化的激素筛选。它利用这两个组件的共同 synergy：激素筛选在大距离传递精确的边界数据，而均匀扩散快速填充大面积。
results: 我们的实验表明，RDS填充在对 вектор值数据的填充方面表现比较好，与许多其他的填充模型相当或更好。

Abstract
We introduce regularised diffusion--shock (RDS) inpainting as a modification of diffusion--shock inpainting from our SSVM 2023 conference paper. RDS inpainting combines two carefully chosen components: homogeneous diffusion and coherence-enhancing shock filtering. It benefits from the complementary synergy of its building blocks: The shock term propagates edge data with perfect sharpness and directional accuracy over large distances due to its high degree of anisotropy. Homogeneous diffusion fills large areas efficiently. The second order equation underlying RDS inpainting inherits a maximum--minimum principle from its components, which is also fulfilled in the discrete case, in contrast to competing anisotropic methods. The regularisation addresses the largest drawback of the original model: It allows a drastic reduction in model parameters without any loss in quality. Furthermore, we extend RDS inpainting to vector-valued data. Our experiments show a performance that is comparable to or better than many inpainting models, including anisotropic processes of second or fourth order.

摘要
我们介绍了调整过的扩散---击败（RDS）填写法，这是我们2023年SSVM会议论文中所提出的修改。RDS填写法结合了两个精心选择的 ком成分：恒性扩散和凝聚强化的击败范围滤波器。它受到了这两个元件的补偿性联乘：击败范围滤波器将边数据传递到距离很大的位置，拥有高度的不对称性，而恒性扩散则快速填写大面积。在确定方程的二次项下，RDS填写法继承了最大---最小则理论，这也是竞争性的扩散方法不符的。此外，我们将RDS填写法推广到向量值数据。我们的实验显示RDS填写法的表现与许多填写模型相似或更好，包括二次或四次的不对称过程。

Motion rejection and spectral unmixing for accurate estimation of in vivo oxygen saturation using multispectral optoacoustic tomography

paper_url: http://arxiv.org/abs/2309.08223
repo_url: None
paper_authors: Mitradeep Sarkar, Mailyn Pérez-Liva, Gilles Renault, Bertrand Tavitian, Jérôme Gateau
for: This paper aims to develop a two-step image processing method for estimating oxygen saturation (SO$_2$) in deep tissues using multispectral optoacoustic tomography (MSOT).methods: The proposed method consists of two steps. The first step is to mitigate motion artifacts by selecting OA images acquired during the respiratory pause of the animal using ultrafast ultrasound images (USIs). The second step is to estimate directly the SO$_2$ value of each pixel and evaluate the amount of noise present in that pixel, thereby eliminating pixels dominated by noise from the final SO$_2$ map.results: The proposed method was validated by in vivo oxygen challenge experiments and shown to outperform conventional methods for SO$_2$ estimation.

Abstract
Multispectral Optoacoustic Tomography (MSOT) uniquely enables spatial mapping in high resolution of oxygen saturation (SO$_2$), with potential applications in studying pathological complications and therapy efficacy. MSOT offers seamless integration with ultrasonography, by using a common ultrasound detector array. However, MSOT relies on multiple successive acquisitions of optoacoustic (OA) images at different optical wavelengths and the low frame rate of OA imaging makes the MSOT acquisition sensitive to body/respiratory motion. Moreover, estimation of SO$_2$ is highly sensitive to noise, and artefacts related to the respiratory motion of the animal were identified as the primary source of noise in MSOT.In this work, we propose a two-step image processing method for SO$_2$ estimation in deep tissues. First, to mitigate motion artefacts, we propose a method of selection of OA images acquired only during the respiratory pause of the animal, using ultrafast ultrasound images (USIs) acquired immediately after each OA acquisition (USI acquisition duration of 1.4 ms and a total delay of 7 ms). We show that gating is more effective using USIs than OA images at different optical wavelengths. Secondly, we propose a novel method which can estimate directly the SO$_2$ value of a pixel and at the same time evaluate the amount of noise present in that pixel. Hence, the method can efficiently eliminate the pixels dominated by noise from the final SO$_2$ map. Our post-processing method is shown to outperform conventional methods for SO$_2$ estimation, and the method was validated by in vivo oxygen challenge experiments.

摘要
多спектраль光学听音成像（MSOT）可以在高分辨率内创造空间图像，涉及到维生素浓度（SO2）的可能应用。MSOT通过使用共同的ultrasound detector array与ultrasound imaging（USI）集成，但MSOT需要多个顺序的optoacoustic（OA）图像获取，以及OA成像的低帧率，使得MSOT获取敏感于生物/呼吸动的运动。此外，SO2的估计受到噪声的影响，而动物呼吸动的噪声是主要的噪声来源。在这种情况下，我们提出了一种两步图像处理方法，以提高SO2估计的精度。第一步是避免噪声影响，我们提出了使用动物呼吸动时暂停的OA图像选择方法，使用ultrafast ultrasound images（USIs），USIs的获取时间为1.4毫秒，总延迟为7毫秒。我们显示，使用USIs比使用不同的光学波长的OA图像更有效。第二步是直接计算每个像素的SO2值，同时评估每个像素中的噪声量。因此，该方法可以高效地从SO2地图中除掉受噪声影响的像素。我们的后处理方法比传统的SO2估计方法更高效，并通过在生物体内进行氧气挑战实验进行验证。

NISF: Neural Implicit Segmentation Functions

paper_url: http://arxiv.org/abs/2309.08643
repo_url: https://github.com/niloide/implicit_segmentation
paper_authors: Nil Stolt-Ansó, Julian McGinnis, Jiazhen Pan, Kerstin Hammernik, Daniel Rueckert
for:* 这种新的图像分割模型是用于自动化医疗测量的重要工具。methods:* 该模型使用了神经网络学习一种叫做神经隐函数的新技术，可以在高维连续空间中分割生物学形态。results:* 在使用UK Biobank数据集进行3D+t短轴心脏分割任务时，该模型达到了0.87±0.045的 dice分数。Here is the same information in Simplified Chinese:for:* 这种新的图像分割模型是用于自动化医疗测量的重要工具。methods:* 该模型使用了神经网络学习一种叫做神经隐函数的新技术，可以在高维连续空间中分割生物学形态。results:* 在使用UK Biobank数据集进行3D+t短轴心脏分割任务时，该模型达到了0.87±0.045的 dice分数。

Abstract
Segmentation of anatomical shapes from medical images has taken an important role in the automation of clinical measurements. While typical deep-learning segmentation approaches are performed on discrete voxels, the underlying objects being analysed exist in a real-valued continuous space. Approaches that rely on convolutional neural networks (CNNs) are limited to grid-like inputs and not easily applicable to sparse or partial measurements. We propose a novel family of image segmentation models that tackle many of CNNs' shortcomings: Neural Implicit Segmentation Functions (NISF). Our framework takes inspiration from the field of neural implicit functions where a network learns a mapping from a real-valued coordinate-space to a shape representation. NISFs have the ability to segment anatomical shapes in high-dimensional continuous spaces. Training is not limited to voxelized grids, and covers applications with sparse and partial data. Interpolation between observations is learnt naturally in the training procedure and requires no post-processing. Furthermore, NISFs allow the leveraging of learnt shape priors to make predictions for regions outside of the original image plane. We go on to show the framework achieves dice scores of 0.87 $\pm$ 0.045 on a (3D+t) short-axis cardiac segmentation task using the UK Biobank dataset. We also provide a qualitative analysis on our frameworks ability to perform segmentation and image interpolation on unseen regions of an image volume at arbitrary resolutions.

摘要
Segmentation of anatomical shapes from medical images has become increasingly important in the automation of clinical measurements. While traditional deep-learning segmentation methods operate on discrete voxels, the objects being analyzed exist in a real-valued continuous space. Approaches based on convolutional neural networks (CNNs) are limited to grid-like inputs and are not easily applicable to sparse or partial measurements. We propose a novel family of image segmentation models that address many of the shortcomings of CNNs: Neural Implicit Segmentation Functions (NISF). Our framework takes inspiration from the field of neural implicit functions, where a network learns a mapping from a real-valued coordinate space to a shape representation. NISFs can segment anatomical shapes in high-dimensional continuous spaces, and training is not limited to voxelized grids, covering applications with sparse and partial data. Interpolation between observations is learned naturally during the training process and requires no post-processing. Furthermore, NISFs allow the leveraging of learned shape priors to make predictions for regions outside of the original image plane. We demonstrate the effectiveness of our framework on a 3D+t short-axis cardiac segmentation task using the UK Biobank dataset, achieving dice scores of 0.87 ± 0.045. We also provide a qualitative analysis of our framework's ability to perform segmentation and image interpolation on unseen regions of an image volume at arbitrary resolutions.

Finite Compressive Sensing

paper_url: http://arxiv.org/abs/2309.08641
repo_url: https://github.com/jwkanggist/ssamp
paper_authors: Marlon Bran Lorenzana, Benjamin Cottier, Matthew Marques, Andrew Kingston, Shekhar S. Chandra
for: 本研究旨在探讨常见的探测策略 для压缩感知磁共振成像（CS-MRI），以及一种新的投影式和 Cartesian 探测方案。
methods: 本研究使用了一种基于投影的pseudo-random sampling scheme，并提出了一种基于截断快探（DRT）的准确投影重建方法（FFR）。
results: 实验和数据示出，使用 FCS 可以获得3-5dB的PSNR提升，比1D Cartesian random sampling和radial under-sampling更高。

Abstract
This paper introduces a sparse projection matrix composed of discrete (digital) periodic lines that create a pseudo-random (p.frac) sampling scheme. Our approach enables random Cartesian sampling whilst employing deterministic and one-dimensional (1D) trajectories derived from the discrete Radon transform (DRT). Unlike radial trajectories, DRT projections can be back-projected without interpolation. Thus, we also propose a novel reconstruction method based on the exact projections of the DRT called finite Fourier reconstruction (FFR). We term this combined p.frac and FFR strategy Finite Compressive Sensing (FCS), with image recovery demonstrated on experimental and simulated data; image quality comparisons are made with Cartesian random sampling in 1D and two-dimensional (2D), as well as radial under-sampling in a more constrained experiment. Our experiments indicate FCS enables 3-5dB gain in peak signal-to-noise ratio (PSNR) for 2-, 4- and 8-fold under-sampling compared to 1D Cartesian random sampling. This paper aims to: Review common sampling strategies for compressed sensing (CS)-magnetic resonance imaging (MRI) to inform the motivation of a projective and Cartesian sampling scheme. Compare the incoherence of these sampling strategies and the proposed p.frac. Compare reconstruction quality of the sampling schemes under various reconstruction strategies to determine the suitability of p.frac for CS-MRI. It is hypothesised that because p.frac is a highly incoherent sampling scheme, that reconstructions will be of high quality compared to 1D Cartesian phase-encode under-sampling.

摘要
This paper aims to:1. Review common sampling strategies for compressed sensing (CS)-magnetic resonance imaging (MRI) to inform the motivation of a projective and Cartesian sampling scheme.2. Compare the incoherence of these sampling strategies and the proposed p.frac.3. Compare reconstruction quality of the sampling schemes under various reconstruction strategies to determine the suitability of p.frac for CS-MRI.It is hypothesized that because p.frac is a highly incoherent sampling scheme, that reconstructions will be of high quality compared to 1D Cartesian phase-encode under-sampling.

2023-09-15

eess.SP

eess.SP - 2023-09-15

Robust Indoor Localization with Ranging-IMU Fusion

paper_url: http://arxiv.org/abs/2309.08803
repo_url: None
paper_authors: Fan Jiang, David Caruso, Ashutosh Dhekne, Qi Qu, Jakob Julian Engel, Jing Dong
for: 本研究旨在提供一种高精度低功耗indoor无线范围定位方法，用于穿梭设备的准确位置测量。
methods: 该研究使用了融合范围测量和惯性测量获得的协调精度，并提出了一种特定非 Gaussian multipath干扰的偏振模型。此外，研究还使用了一种基于Levenberg-Marquardt家族的信任区改进版iSAM2融合算法，以提高robust性。
results: 研究在实验室中进行了densely occupied的实验，并达到了$\sim$0.3m的平均绝对准确性。此外，研究还表明，使用1Hz的范围测量可以获得相似的准确性。

Abstract
Indoor wireless ranging localization is a promising approach for low-power and high-accuracy localization of wearable devices. A primary challenge in this domain stems from non-line of sight propagation of radio waves. This study tackles a fundamental issue in wireless ranging: the unpredictability of real-time multipath determination, especially in challenging conditions such as when there is no direct line of sight. We achieve this by fusing range measurements with inertial measurements obtained from a low cost Inertial Measurement Unit (IMU). For this purpose, we introduce a novel asymmetric noise model crafted specifically for non-Gaussian multipath disturbances. Additionally, we present a novel Levenberg-Marquardt (LM)-family trust-region adaptation of the iSAM2 fusion algorithm, which is optimized for robust performance for our ranging-IMU fusion problem. We evaluate our solution in a densely occupied real office environment. Our proposed solution can achieve temporally consistent localization with an average absolute accuracy of $\sim$0.3m in real-world settings. Furthermore, our results indicate that we can achieve comparable accuracy even with infrequent (1Hz) range measurements.

摘要
内部无线距离地标定是一种有前途的方法，用于低功耗高精度的设备的本地化。主要挑战在这个领域是无线波的非直线传播，尤其是在没有直接视线的情况下。本研究解决了无线距离中的一个基本问题，即实时多path决定的不可预测性，特别是在复杂的环境下。我们通过将距离测量与IMU获取的动量测量进行混合来实现这一点。为此，我们提出了一种特定于非泊松噪声的非泊松噪声模型。此外，我们还提出了一种基于LM家族的信任区改进版iSAM2融合算法，以便在我们的距离-IMU融合问题上实现Robust性能。我们在一个实际办公室环境中进行了评估。我们的提议的解决方案可以在实际情况下实现时间一致的地标定，准确性约为0.3米。此外，我们的结果表明，我们可以在1Hz的距离测量频率下实现相同的准确性。

Towards Robust and Efficient Communications for Urban Air Mobility

paper_url: http://arxiv.org/abs/2309.08796
repo_url: None
paper_authors: Dennis Becker, Lukas Schalk
for: 本研究旨在开发一种特地适应未来城市空间的Drone-to-Drone通信和监测系统（DroneCAST），以确保安全运行。methods: 本研究使用了多链接方法，并在两架机器人上安装了实验Communication系统硬件原型。results: 实验结果表明，DroneCAST系统可以提供高效稳定的直接信息交换，并且可以应对高交通密度和紧急情况。

Abstract
For the realization of the future urban air mobility, reliable information exchange based on robust and efficient communication between all airspace participants will be one of the key factors to ensure safe operations. Especially in dense urban scenarios, the direct and fast information exchange between drones based on Drone-to-Drone communications is a promising technology for enabling reliable collision avoidance systems. However, to mitigate collisions and to increase overall reliability, unmanned aircraft still lack a redundant, higher-level safety net to coordinate and monitor traffic, as is common in today's civil aviation. In addition, direct and fast information exchange based on ad hoc communication is needed to cope with the very short reaction times required to avoid collisions and to cope with the the high traffic densities. Therefore, we are developing a \ac{d2d} communication and surveillance system, called DroneCAST, which is specifically tailored to the requirements of a future urban airspace and will be part of a multi-link approach. In this work we discuss challenges and expected safety-critical applications that will have to rely on communications for \ac{uam} and present our communication concept and necessary steps towards DroneCAST. As a first step towards an implementation, we equipped two drones with hardware prototypes of the experimental communication system and performed several flights around the model city to evaluate the performance of the hardware and to demonstrate different applications that will rely on robust and efficient communications.

摘要
为实现未来的城市空中交通，可靠的信息交换基于强健和高效的通信 между所有空间参与者将是一个关键因素，以确保安全操作。尤其在紧张的城市场景下，直接和快速的机器人之间的通信是一种承诺的技术，用于实现可靠的碰撞避免系统。然而，为了减少碰撞和提高总可靠性，无人飞机仍然缺乏一种备用的、更高一级的安全网络，以协调和监测交通，如今的民航航空中拥有的。此外，直接和快速的信息交换基于随机通信可以适应非常短的应急响应时间和高通信密度。因此，我们正在开发一个D2D通信和监测系统，称为DroneCAST，该系统特地适应未来城市空间的需求，并将成为多链接方法的一部分。在这个工作中，我们讨论了未来城市空中交通中通信的挑战和预期的安全关键应用，并提出了我们的通信概念和实现DroneCAST所需的必要步骤。作为实现的第一步，我们将两架无人机设备了实验通信系统的硬件原型，并在模型城市附近进行了多次飞行，以评估硬件性能和示例不同应用，它们均需要强健和高效的通信。

Stein Variational Gradient Descent-based Detection For Random Access With Preambles In MTC

paper_url: http://arxiv.org/abs/2309.08782
repo_url: None
paper_authors: Xin Zhu, Hongyi Pan, Salih Atici, Ahmet Enis Cetin
for: 提高 massive machine-type communication (mMTC) 中 grant-based random access scheme 中的准确性。
methods: 基于 Stein variational gradient descent (SVGD) 的新型准auer detection algorithm，利用恒定更新粒子进行连续推理。
results: 比 Markov Chain Monte Carlo-based approaches 更高的检测精度。

Abstract
Traditional preamble detection algorithms have low accuracy in the grant-based random access scheme in massive machine-type communication (mMTC). We present a novel preamble detection algorithm based on Stein variational gradient descent (SVGD) at the second step of the random access procedure. It efficiently leverages deterministic updates of particles for continuous inference. To further enhance the performance of the SVGD detector, especially in a dense user scenario, we propose a normalized SVGD detector with momentum. It utilizes the momentum and a bias correction term to reduce the preamble estimation errors during the gradient descent process. Simulation results show that the proposed algorithm performs better than Markov Chain Monte Carlo-based approaches in terms of detection accuracy.

摘要
传统的启语探测算法在大规模机器类通信（mMTC）中的授权随机访问方案中的准确率低。我们提出了基于Stein变量加速度下降（SVGD）的新启语探测算法，它在随机访问过程的第二步中高效地利用权值更新。为了进一步提高SVGD探测器的性能，特别是在密集用户场景下，我们提议一种归一化的SVGD探测器。它利用权值和偏移量来减少启语估计错误，并且使用滚动平均来降低启语探测的难度。实验结果表明，提出的算法在探测精度方面比Markov链 Монте卡洛-based方法表现更好。

Probabilistic Constellation Shaping With Denoising Diffusion Probabilistic Models: A Novel Approach

paper_url: http://arxiv.org/abs/2309.08688
repo_url: None
paper_authors: Mehdi Letafati, Samad Ali, Matti Latva-aho
for: 这篇论文是用于描述如何使用泛化抽象模型（DDPM）进行概率性星形设计，以提高无线通信中的信息率和通信性能。
methods: 这篇论文使用了DDPM模型，通过“恢复”和“生成”的方式，学习抽象星形符号的生成。
results: simulations 表明，提议的方案在低SNR范围和非高斯噪声下表现出较好的网络可靠性和robust性，并且与深度神经网络（DNN）的 Refer 方法相比，实现了3倍的辐射信息增加。

Abstract
With the incredible results achieved from generative pre-trained transformers (GPT) and diffusion models, generative AI (GenAI) is envisioned to yield remarkable breakthroughs in various industrial and academic domains. In this paper, we utilize denoising diffusion probabilistic models (DDPM), as one of the state-of-the-art generative models, for probabilistic constellation shaping in wireless communications. While the geometry of constellations is predetermined by the networking standards, probabilistic constellation shaping can help enhance the information rate and communication performance by designing the probability of occurrence (generation) of constellation symbols. Unlike conventional methods that deal with an optimization problem over the discrete distribution of constellations, we take a radically different approach. Exploiting the ``denoise-and-generate'' characteristic of DDPMs, the key idea is to learn how to generate constellation symbols out of noise, ``mimicking'' the way the receiver performs symbol reconstruction. By doing so, we make the constellation symbols sent by the transmitter, and what is inferred (reconstructed) at the receiver become as similar as possible. Our simulations show that the proposed scheme outperforms deep neural network (DNN)-based benchmark and uniform shaping, while providing network resilience as well as robust out-of-distribution performance under low-SNR regimes and non-Gaussian noise. Notably, a threefold improvement in terms of mutual information is achieved compared to DNN-based approach for 64-QAM geometry.

摘要
“受到循环式训练的干扰抑制模型（DDPM）和扩散模型的惊人成果，生成AI（GenAI）预计会在不同的产业和学术领域取得惊人的突破。在这篇论文中，我们使用DDPM作为一种state-of-the-art的生成模型，对无线通信中的报文排序进行概率排序。即使报文的几何结构由网络标准决定，但概率排序仍可以提高信息率和通信性能，通过设计报文符号的概率出现的预测。与传统方法不同的是，我们利用DDPM的“混叠和生成”特点，将报文符号从噪声中学习生成，“模拟”接收器在重建报文符号时的过程。这样做的目的是使报文符号由发送器发送，并由接收器重建的符号变得非常相似。我们的仿真结果表明，我们的方案在低SNR情况下和非泊峰噪声下具有更高的网络鲁棒性和robust out-of-distribution性，同时与DNN基本标准相比，实现了3倍的约束信息增加。特别是，对64-QAM几何来说，与DNN基本标准相比，我们的方案实现了3倍的约束信息增加。”

Enhancing Near-Field Sensing and Communications with Sparse Arrays: Potentials, Challenges, and Emerging Trends

paper_url: http://arxiv.org/abs/2309.08681
repo_url: None
paper_authors: Songjie Yang, Wanting Lyu, Zhongpei Zhang, Chau Yuen
For: The paper is written for exploring the potential of extremely large-scale (XL)-arrays for overcoming the severe path loss in millimeter-wave (mmWave) and TeraHertz (THz) channels, which is crucial for enabling 6G.* Methods: The paper uses the spherical-wave (SW) model to accurately represent the near-field propagation characteristics of XL-arrays, which significantly increases signal processing complexity.* Results: The paper identifies the potential benefits of near-field sensing and communications (S&C) enabled by XL-arrays, including improving communication multiplexing capability, spatial resolution, and degrees of freedom. Additionally, the paper proposes sparse arrays (SAs) as a solution to overcome the limitations of existing XL-arrays with dense uniform array layouts, and explores the potential of XL-SAs for mmWave/THz systems using multi-subarray designs.

Abstract
As a promising technique, extremely large-scale (XL)-arrays offer potential solutions for overcoming the severe path loss in millimeter-wave (mmWave) and TeraHertz (THz) channels, crucial for enabling 6G. Nevertheless, XL-arrays introduce deviations in electromagnetic propagation compared to traditional arrays, fundamentally challenging the assumption with the planar-wave model. Instead, it ushers in the spherical-wave (SW) model to accurately represent the near-field propagation characteristics, significantly increasing signal processing complexity. Fortunately, the SW model shows remarkable benefits on sensing and communications (S\&C), e.g., improving communication multiplexing capability, spatial resolution, and degrees of freedom. In this context, this article first overviews hardware/algorithm challenges, fundamental potentials, promising applications of near-field S\&C enabled by XL-arrays. To overcome the limitations of existing XL-arrays with dense uniform array layouts and improve S\&C applications, we introduce sparse arrays (SAs). Exploring their potential, we propose XL-SAs for mmWave/THz systems using multi-subarray designs. Finally, several applications, challenges and resarch directions are identified.

摘要
为了实现6G，极大规模（XL）阵列技术具有很大的潜在应用前景。然而，XL阵列引入了电磁波传播方面的偏差，从传统阵列假设中带来了挑战。为了更 accurately 表示近场传播特性，我们需要用圆波（SW）模型来取代平面波模型。这种模型具有优化沟通多路复用能力、空间分辨率和自由度等优点，因此在感知和通信（S\&C）方面具有极大的潜力。在这篇文章中，我们首先介绍硬件/算法挑战、基本潜力和XL阵列在感知和通信方面的应用潜力。然后，我们提出了使用多子阵列设计的稀疏阵列（SA）来解决现有XL阵列的局限性。最后，我们详细介绍了一些应用、挑战和研究方向。

Denoising Diffusion Probabilistic Models for Hardware-Impaired Communications

paper_url: http://arxiv.org/abs/2309.08568
repo_url: None
paper_authors: Mehdi Letafati, Samad Ali, Matti Latva-aho
for: 这篇论文探讨了对无线通信系统中实际假设（硬件缺陷、低信号识别率、量化误差）下的数据生成模型应用。
methods: 本论文提出了基于滤波泵润滤预测模型（DDPM）的几何发散模型接收器，以提供网络可靠性在低信号识别率、非高斯噪声、不同硬件缺陷水平和量化误差下。
results: 我们的结果显示，与深度神经网络（DNN）接收器相比，我们的架构可以在AWGN和非高斯噪声场景下提供30%和20%的重建性能提升。

Abstract
Generative AI has received significant attention among a spectrum of diverse industrial and academic domains, thanks to the magnificent results achieved from deep generative models such as generative pre-trained transformers (GPT) and diffusion models. In this paper, we explore the applications of denoising diffusion probabilistic models (DDPMs) in wireless communication systems under practical assumptions such as hardware impairments (HWI), low-SNR regime, and quantization error. Diffusion models are a new class of state-of-the-art generative models that have already showcased notable success with some of the popular examples by OpenAI and Google Brain. The intuition behind DDPM is to decompose the data generation process over small "denoising" steps. Inspired by this, we propose using denoising diffusion model-based receiver for a practical wireless communication scheme, while providing network resilience in low-SNR regimes, non-Gaussian noise, different HWI levels, and quantization error. We evaluate the reconstruction performance of our scheme in terms of bit error rate (BER) and mean-squared error (MSE). Our results show that 30% and 20% improvement in BER could be achieved compared to deep neural network (DNN)-based receivers in AWGN and non-Gaussian scenarios, respectively.

摘要
优化�ulsion AI在多样化的工业和学术领域中受到了广泛的关注，主要归功于深度生成模型，如生成预训练变换器（GPT）和扩散模型。在这篇论文中，我们探讨了在实际假设下，如硬件缺陷（HWI）、低信号强度（SNR）和量化误差的情况下，扩散模型在无线通信系统中的应用。扩散模型是一种新的生成模型，它已经在一些著名的例子中表现出了卓越的成绩。扩散模型的基本思想是将数据生成过程分解成小“净化”步骤。我们提议使用扩散模型来实现无线通信协议，并在低SNR情况下、非泊射噪、不同HWI水平和量化误差情况下提供网络可悟性。我们根据BER和MSE来评估我们的方案的重建性能。我们的结果显示，相比于深度神经网络（DNN）基于 receiver，我们的方案在AWGN和非泊射噪情况下可以提高30%和20%的BER。

Robust IRS-Element Activation for Energy Efficiency Optimization in IRS-Assisted Communication Systems With Imperfect CSI

paper_url: http://arxiv.org/abs/2309.08526
repo_url: None
paper_authors: Christos N. Efrem, Ioannis Krikidis
for: 这个论文研究了一个智能反射面（IRS）支持的通信系统，使用单天线发射机和接收机，并在不准确的通道状态信息（CSI）下进行了robust选择。
methods: 论文使用了closed-form表达式来计算最差情况的信噪比（SNR），然后根据CSI不确定性的 bound进行了一种robust（discrete）优化问题的解决。
results: 论文提出了一种基于动态程序（DP）的算法，可以在O（L log L）的复杂度下达到全局最大值，其中L是IRS元素的数量。此外，论文还提出了一种卷积relaxation-based方法（CRBM）来获得一个可行（不优化）解决方案，其复杂度为O（L^3.5）。numerical simulations表明，提出的算法比枚举搜索更快速，可以满足实际应用中的扩展性。

Abstract
In this paper, we study an intelligent reflecting surface (IRS)-aided communication system with single-antenna transmitter and receiver, under imperfect channel state information (CSI). More specifically, we deal with the robust selection of binary (on/off) states of the IRS elements in order to maximize the worst-case energy efficiency (EE), given a bounded CSI uncertainty, while satisfying a minimum signal-to-noise ratio (SNR). In addition, we consider not only continuous but also discrete IRS phase shifts. First, we derive closed-form expressions of the worst-case SNRs, and then formulate the robust (discrete) optimization problems for each case. In the case of continuous phase shifts, we design a dynamic programming (DP) algorithm that is theoretically guaranteed to achieve the global maximum with polynomial complexity $O(L\,{\log L})$, where $L$ is the number of IRS elements. In the case of discrete phase shifts, we develop a convex-relaxation-based method (CRBM) to obtain a feasible (sub-optimal) solution in polynomial time $O(L^{3.5})$, with a posteriori performance guarantee. Furthermore, numerical simulations provide useful insights and confirm the theoretical results. In particular, the proposed algorithms are several orders of magnitude faster than the exhaustive search when $L$ is large, thus being highly scalable and suitable for practical applications. Moreover, both algorithms outperform a baseline scheme, namely, the activation of all IRS elements.

摘要
在这篇论文中，我们研究了一个智能反射表面（IRS）干支持的通信系统，其中传输机和接收机各只有一个天线。我们更加特别地关注在不完全的通道状态信息（CSI）下，对IRS元素的二进制（On/Off）状态的精度选择，以最大化最差情况的能量效率（EE），同时保证最小化噪声比率（SNR）。此外，我们不仅考虑了连续的IRS频率偏移，还考虑了离散的IRS频率偏移。首先，我们 deriv出了最差情况的SNR的关闭式表达，然后将问题转化为一个稳定优化问题。在连续频率偏移的情况下，我们设计了一个动态规划算法（DP），该算法在规模为$L$的IRS元素下可以在$O(L\log L)$的复杂度下实现全局最大化。在离散频率偏移的情况下，我们开发了一种基于凸 relaksation 的方法（CRBM），可以在$O(L^{3.5})$的复杂度下获得一个可行（不优）解。numerical simulations 表明，提议的算法在大$L$时速度是几个数量级快于枚举搜索，因此具有高可扩展性和实际应用中的适用性。此外，两种算法都超过了一个基eline scheme，即所有IRS元素的活动。

Novel Expressions for the Outage Probability and Diversity Gains in Fluid Antenna System

paper_url: http://arxiv.org/abs/2309.08441
repo_url: None
paper_authors: José~~David~~Vega-Sánchez, Arianna Estefanía López-Ramírez, Luis~~Urquiza-Aguiar, Diana~~Pamela~~Moya~~Osorio
for: 本研究旨在探讨流体天线系统（Fluid Antenna System，FAS）在具有小型和受限的设备的网络中提供很好的多样性增强。
methods: 本文比较了不同的多样性和多样性Fluid Antenna System（FAS）接收器在空间相关 Nakagami-$m$ 抽象通道上的失业概率（OP）性能。
results: 通过一种新的准确匹配方法，我们 derivated一个简单又准确的关闭形式的近似，用于MGC-FAS schemes的OP性能。这个近似是通过两个阶段来完成：首先，对每个MGC-FAS支路的CDF进行近似，然后对MGC-FAS scheme的总体CDF进行近似。通过这些结果，我们得到了关闭形式的OP和极限OP。最后，我们的近似被 validate by numerical results，并在不同的多样性FAS场景下 demonstarte了其准确性。

Abstract
The flexibility and reconfigurability at the radio frequency (RF) front-end offered by the fluid antenna system (FAS) make this technology promising for providing remarkable diversity gains in networks with small and constrained devices. Toward this direction, this letter compares the outage probability (OP) performance of non-diversity and diversity FAS receivers undergoing spatially correlated Nakagami-$m$ fading channels. Although the system properties of FAS incur in complex analysis, we derive a simple yet accurate closed-form approximation by relying on a novel asymptotic matching method for the OP of a maximum-gain combining-FAS (MGC-FAS). The approximation is performed in two stages, the approximation of the cumulative density function (CDF) of each MGC-FAS branch, and then the approximation of the end-to-end CDF of the MGC-FAS scheme. With these results, closed-form expressions for the OP and the asymptotic OP are derived. Finally, numerical results validate our approximation of the MGC-FAS scheme and demonstrate its accuracy under different diversity FAS scenarios.

摘要
这个流动天线系统（Fluid Antenna System，FAS）的灵活性和重新配置能力使得这技术在有限的设备下提供了杰出的多标的优势。以这个方向为目标，这封信件比较了非多标和多标FAS接收器在空间相关的 Nakagami-$m$ 折射通道上的失败几率（OP）性能。即使FAS系统具有复杂的系统特性，我们靠一种新的概念匹配方法 derivation 了一个简单又准确的关键值分布函数（CDF）的扩展。这个扩展是在两个阶段进行的：首先，对每个最大增幅Combining-FAS（MGC-FAS）分支的CDF进行扩展，然后对MGC-FAS的终端CDF进行扩展。从这些结果，我们得到了关于OP和杰出OP的关键表达式。最后，我们的扩展验证了MGC-FAS的精度，并在不同的多标FAS情况下进行了数值验证。

IHT-Inspired Neural Network for Single-Snapshot DOA Estimation with Sparse Linear Arrays

paper_url: http://arxiv.org/abs/2309.08429
repo_url: None
paper_authors: Yunqiao Hu, Shunqiao Sun
for: 这个论文的目的是提出一种基于神经网络的单拍 snapshot 方向来源估计方法（DOA），以替代传统的迭代硬阈值 truncated-singular value decomposition（t-SVD）。
methods: 该方法使用了一种基于循环神经网络的IHT算法，并通过将循环神经网络结构与IHT算法结合，使得网络层操作与IHT算法的迭代过程相对应。此外，该方法还使用了一个浅层自动编码器来取代t-SVD，从而降低计算开销。
results: 实验结果表明，提议的方法可以保持强的解释性，并且在全面阵列信号重建和单拍 snapshot DOA估计中显示出更高的准确率和更快的收敛速率。

Abstract
Single-snapshot direction-of-arrival (DOA) estimation using sparse linear arrays (SLAs) has gained significant attention in the field of automotive MIMO radars. This is due to the dynamic nature of automotive settings, where multiple snapshots aren't accessible, and the importance of minimizing hardware costs. Low-rank Hankel matrix completion has been proposed to interpolate the missing elements in SLAs. However, the solvers of matrix completion, such as iterative hard thresholding (IHT), heavily rely on expert knowledge of hyperparameter tuning and lack task-specificity. Besides, IHT involves truncated-singular value decomposition (t-SVD), which has high computational cost in each iteration. In this paper, we propose an IHT-inspired neural network for single-snapshot DOA estimation with SLAs, termed IHT-Net. We utilize a recurrent neural network structure to parameterize the IHT algorithm. Additionally, we integrate shallow-layer autoencoders to replace t-SVD, reducing computational overhead while generating a novel optimizer through supervised learning. IHT-Net maintains strong interpretability as its network layer operations align with the iterations of the IHT algorithm. The learned optimizer exhibits fast convergence and higher accuracy in the full array signal reconstruction followed by single-snapshot DOA estimation. Numerical results validate the effectiveness of the proposed method.

摘要
单一快照方向估测（DOA）预测使用稀疏线性阵列（SLAs）在汽车多元追踪领域获得了重要的注意力。这是因为汽车设置的动态性，不能取得多个快照，并且优先运算成本。低维汉宝网络完成（Low-rank Hankel matrix completion）已经建议来 interpolate 缺失的元素 в SLAs。然而，对矩阵完成的解决方案，如循环几何��（IHT），需要专家知识来调整参数，而且lacks task-specificity。此外，IHT 包含 truncated-singular value decomposition（t-SVD），它在每个迭代中有高的计算成本。在本文中，我们提出了一个 IHT 灵感的神经网络，称为 IHT-Net。我们使用了循环神经网络结构来实现 IHT 算法的实现。此外，我们将浅层自动化网络与 t-SVD 结合，从而降低计算成本，并生成了一个新的优化器通过监督学习。IHT-Net 维持了强大的解释性，因为其网络层操作与 IHT 算法的迭代相似。学习的优化器展示了快速的融合和更高的精度在全阵列信号重建和单快照 DOA 预测中。 numerics validate the effectiveness of the proposed method.

A Simple Method for the Performance Analysis of Fluid Antenna Systems under Correlated Nakagami-$m$ Fading

paper_url: http://arxiv.org/abs/2309.08423
repo_url: None
paper_authors: José~~David~~Vega-Sánchez, Luis~~Urquiza-Aguiar, Martha Cecilia Paredes Paredes, Diana~~Pamela~~Moya~~Osorio
for: investigate the performance of a single-antenna fluid antenna system (FAS) over spatially correlated Nakagami-$m$ fading channels
methods: employ a novel asymptotic matching method to obtain simple and highly accurate closed-form approximations for the cumulative density function of the FAS channel and the outage probability of the proposed system
results: the proposed approximations are validated, and it is shown that the FAS can meet or even exceed the performance attained by the conventional maximal ratio combining (MRC) technique.Here are the three points in English for reference:
for: investigate the performance of a single-antenna fluid antenna system (FAS) over spatially correlated Nakagami-$m$ fading channels
methods: employ a novel asymptotic matching method to obtain simple and highly accurate closed-form approximations for the cumulative density function of the FAS channel and the outage probability of the proposed system
results: the proposed approximations are validated, and it is shown that the FAS can meet or even exceed the performance attained by the conventional maximal ratio combining (MRC) technique.

Abstract
By recognizing the tremendous flexibility of the emerging fluid antenna system (FAS), which allows dynamic reconfigurability of the location of the antenna within a given space, this paper investigates the performance of a single-antenna FAS over spatially correlated Nakagami-$m$ fading channels. Specifically, simple and highly accurate closed-form approximations for the cumulative density function of the FAS channel and the outage probability of the proposed system are obtained by employing a novel asymptotic matching method, which is an improved version of the well-known moment matching. With this method, the outage probability can be computed {simply} without incurring complex multi-fold integrals, thus requiring negligible computational effort. Finally, the accuracy of the proposed approximations is validated, and it is shown that the FAS can meet or even exceed the performance attained by the conventional maximal ratio combining (MRC) technique.

摘要
通过认可emerging fluid antenna system（FAS）的巨大的灵活性，这篇论文研究了单antenna FAS在 Nakagami-$m$ 抖抖几何渠道上的性能。特别是，通过一种新的增强版matching方法，提出了一种简单高度准确的闭式函数方法，可以计算FAS通道的总概率分布和提携系统的失业率。这种方法不需要进行复杂的多重积分，因此计算工作量很低。最后，研究证明了提携系统的性能可以达到或超过 convential maximal ratio combining（MRC）技术的性能。

Channel Estimation in Underdetermined Systems Utilizing Variational Autoencoders

paper_url: http://arxiv.org/abs/2309.08411
repo_url: None
paper_authors: Michael Baur, Nurettin Turan, Benedikt Fesl, Wolfgang Utschick
for: 这个论文应用了统计学来进行通道估计（CE）在不充分决定（UD）系统中。
methods: 这个方法使用了一个称为统计学自动化（VAE）的新方法，将通道状态资料（CSI）训练成一个对于均方差误差（MSE）最佳估计器的近似。
results: 这个研究展示了将VAE应用到UD系统中，可以获得非常好的性能，并且不需要完美的CSI During the training phase。这与大多数深度学习（DL）基本的CE方法不同，这些方法通常需要完美的CSI During the training phase。

Abstract
In this work, we propose to utilize a variational autoencoder (VAE) for channel estimation (CE) in underdetermined (UD) systems. The basis of the method forms a recently proposed concept in which a VAE is trained on channel state information (CSI) data and used to parameterize an approximation to the mean squared error (MSE)-optimal estimator. The contributions in this work extend the existing framework from fully-determined (FD) to UD systems, which are of high practical relevance. Particularly noteworthy is the extension of the estimator variant, which does not require perfect CSI during its offline training phase. This is a significant advantage compared to most other deep learning (DL)-based CE methods, where perfect CSI during the training phase is a crucial prerequisite. Numerical simulations for hybrid and wideband systems demonstrate the excellent performance of the proposed methods compared to related estimators.

摘要
在这项工作中，我们提议使用变量自动编码器（VAE）来进行通道估计（CE）在不充分掌握（UD）系统中。这种方法的基础是一个最近提出的概念，在该概念中，VAE被训练在通道状态信息（CSI）数据上，并用来参数化MSE优化的参数。本文的贡献在扩展现有框架从完全掌握（FD）系统中扩展到UD系统中，这些系统在实践中具有非常高的重要性。特别值得一提的是，该估计变体不需要在训练阶段获得完美的CSI，这与大多数深度学习（DL）基于CE方法不同，这些方法通常需要在训练阶段获得完美的CSI。数字实验室中的干扰和宽频系统的数据显示了我们提议的方法的优秀性相比于相关的估计器。

Bayes-Optimal Estimation in Generalized Linear Models via Spatial Coupling

paper_url: http://arxiv.org/abs/2309.08404
repo_url: None
paper_authors: Pablo Pascual Cobo, Kuan Hsieh, Ramji Venkataramanan
for: 这篇论文关注了一般线性模型（GLM）的信号估计问题。GLM包括了线性回归、phas Retrieval和1-bit压缩感知等许多canonical problem。
methods: 这篇论文使用了一种高效的抽象消息传递（AMP）算法来实现信号估计。
results: 研究发现，使用spatially coupled design和AMP算法可以实现高维度下的MSE接近 asymptotic MMSE。此外， numerically Results for phase retrieval和rectified linear regression also show that spatially coupled designs can achieve lower MSE than i.i.d. Gaussian designs at finite dimensions when used with AMP algorithms.

Abstract
We consider the problem of signal estimation in a generalized linear model (GLM). GLMs include many canonical problems in statistical estimation, such as linear regression, phase retrieval, and 1-bit compressed sensing. Recent work has precisely characterized the asymptotic minimum mean-squared error (MMSE) for GLMs with i.i.d. Gaussian sensing matrices. However, in many models there is a significant gap between the MMSE and the performance of the best known feasible estimators. In this work, we address this issue by considering GLMs defined via spatially coupled sensing matrices. We propose an efficient approximate message passing (AMP) algorithm for estimation and prove that with a simple choice of spatially coupled design, the MSE of a carefully tuned AMP estimator approaches the asymptotic MMSE in the high-dimensional limit. To prove the result, we first rigorously characterize the asymptotic performance of AMP for a GLM with a generic spatially coupled design. This characterization is in terms of a deterministic recursion (`state evolution') that depends on the parameters defining the spatial coupling. Then, using a simple spatially coupled design and judicious choice of functions defining the AMP, we analyze the fixed points of the resulting state evolution and show that it achieves the asymptotic MMSE. Numerical results for phase retrieval and rectified linear regression show that spatially coupled designs can yield substantially lower MSE than i.i.d. Gaussian designs at finite dimensions when used with AMP algorithms.

摘要
我们考虑一个泛化线性模型（GLM）的问题。 GLM 包括了许多线性回传问题的 canonical 问题，例如线性回传、相位抽取和1位压缩感知。现有的工作已经精确地定义 GLM 的 asymptotic minimum mean-squared error（MMSE）。然而，在许多模型中，存在一个明显的差距 между MMSE 和最好的可行的 estimator 的性能。在这个工作中，我们解决这个问题，通过考虑 GLM 是通过 spatially coupled sensing matrices 定义的。我们提出了一个高效的approximate message passing（AMP）算法来进行估计，并证明在高维度上，对于一个简单的选择的 spatially coupled 设计，AMP 估计器的MSE接近 asymptotic MMSE。在证明这个结果时，我们首先正确地描述了 GLM 的 asymptotic performance，它是一个 deterministic recursion（state evolution），它取决于定义 spatial coupling 的参数。然后，我们使用一个简单的 spatially coupled 设计，以及judicious 选择的函数来定义 AMP，并分析fixed points 的 resulted state evolution，并证明它实现 asymptotic MMSE。numerical results 表明，在相位抽取和线性回传 regression 中，使用 spatially coupled 设计可以在finite dimensions 下，对于 AMP 算法而言，实现较低的MSE。

Investigation of mmWave Radar Technology For Non-contact Vital Sign Monitoring

paper_url: http://arxiv.org/abs/2309.08317
repo_url: None
paper_authors: Steven Marty, Federico Pantanella, Andrea Ronco, Kanika Dheman, Michele Magno
for: 非接触式生命 Parameters监测
methods: millimeter-wave (mmWave) 雷达技术
results: 120 GHz 雷达系统在人体测量中表现最佳，可以准确地识别心跳和呼吸活动。

Abstract
Non-contact vital sign monitoring has many advantages over conventional methods in being comfortable, unobtrusive and without any risk of spreading infection. The use of millimeter-wave (mmWave) radars is one of the most promising approaches that enable contact-less monitoring of vital signs. Novel low-power implementations of this technology promise to enable vital sign sensing in embedded, battery-operated devices. The nature of these new low-power sensors exacerbates the challenges of accurate and robust vital sign monitoring and especially the problem of heart-rate tracking. This work focuses on the investigation and characterization of three Frequency Modulated Continuous Wave (FMCW) low-power radars with different carrier frequencies of 24 GHz, 60 GHz and 120 GHz. The evaluation platforms were first tested on phantom models that emulated human bodies to accurately evaluate the baseline noise, error in range estimation, and error in displacement estimation. Additionally, the systems were also used to collect data from three human subjects to gauge the feasibility of identifying heartbeat peaks and breathing peaks with simple and lightweight algorithms that could potentially run in low-power embedded processors. The investigation revealed that the 24 GHz radar has the highest baseline noise level, 0.04mm at 0{\deg} angle of incidence, and an error in range estimation of 3.45 +- 1.88 cm at a distance of 60 cm. At the same distance, the 60 GHz and the 120 GHz radar system shows the least noise level, 0.0lmm at 0{\deg} angle of incidence, and error in range estimation 0.64 +- 0.01 cm and 0.04 +- 0.0 cm respectively. Additionally, tests on humans showed that all three radar systems were able to identify heart and breathing activity but the 120 GHz radar system outperformed the other two.

摘要
非接触生命 Parameters 监测有多种优点，包括舒适、不侵入和不扩散感染。使用毫米波（mmWave）雷达是一种非接触监测生命 Parameters 的非常有前途的方法之一。新的低功耗实现技术可能使生命 Parameters 监测在嵌入式、电池操作的设备中进行。这些新的低功耗传感器增加了精准和可靠地生命 Parameters 监测的挑战，特别是心率跟踪问题。这项工作将关注三种频率变化连续波（FMCW）低功耗雷达的调查和特征化。这三种雷达的各自的载波频率分别为24GHz、60GHz和120GHz。测试平台首先在模拟人体的phantom模型上进行测试，以准确评估基准噪声、误差范围估计和位移估计误差。此外，系统还被用来收集来自三名人类试验者的数据，以评估可能通过简单和轻量级算法在低功耗嵌入式处理器中识别心跳和呼吸活动的可能性。调查表明，24GHz雷达具有最高的基准噪声水平（0.04mm，0°角度），误差范围估计为60cm处的3.45±1.88cm。相比之下，60GHz和120GHz雷达系统在同一个距离处具有最低的基准噪声水平（0.01mm，0°角度）和误差范围估计（0.64±0.01cm和0.04±0.0cm）。此外，对人类试验者进行测试表明，所有三种雷达系统都能够识别心跳和呼吸活动，但120GHz雷达系统表现最佳。

User Power Measurement Based IRS Channel Estimation via Single-Layer Neural Network

paper_url: http://arxiv.org/abs/2309.08275
repo_url: None
paper_authors: He Sun, Weidong Mei, Lipeng Zhu, Rui Zhang
for: 提高IRS干扰通信系统中BS-IRS-用户层次链路的通道知识，以便设计高性能的IRS反射。
methods: 基于用户接收信号强度测量的单层神经网络（NN） enabled IRS通道估计方法。
results: 比对 existed 力量测量基础的设计，NN enabled IRS通道估计方法可以显著提高多用户通信系统中 minimum received signal-to-noise ratio（SNR）。

Abstract
One main challenge for implementing intelligent reflecting surface (IRS) aided communications lies in the difficulty to obtain the channel knowledge for the base station (BS)-IRS-user cascaded links, which is needed to design high-performance IRS reflection in practice. Traditional methods for estimating IRS cascaded channels are usually based on the additional pilot signals received at the BS/users, which increase the system training overhead and also may not be compatible with the current communication protocols. To tackle this challenge, we propose in this paper a new single-layer neural network (NN)-enabled IRS channel estimation method based on only the knowledge of users' individual received signal power measurements corresponding to different IRS random training reflections, which are easily accessible in current wireless systems. To evaluate the effectiveness of the proposed channel estimation method, we design the IRS reflection for data transmission based on the estimated cascaded channels in an IRS-aided multiuser communication system. Numerical results show that the proposed IRS channel estimation and reflection design can significantly improve the minimum received signal-to-noise ratio (SNR) among all users, as compared to existing power measurement based designs.

摘要
一个主要挑战在实现智能反射表（IRS）协助通信是获取BS-IRS-用户层次链的通道知识，这是实际设计高性能IRS反射的必要前提。传统的IRS层次通道估算方法通常基于BS/用户收到的额外测试信号，这会增加系统训练负担并且可能与当前通信协议不兼容。为解决这个挑战，我们在本纸提出了一种基于单层神经网络（NN）的IRS通道估算方法，只需要用户个人接收信号强度测量对应不同的IRS随机测试反射，这些测量值易于获取现有无线系统中。为评估提案的效果，我们设计了基于估算的IRS反射 для数据传输在IRS协助多用户通信系统中。 numerically results show that the proposed IRS channel estimation and reflection design can significantly improve the minimum received signal-to-noise ratio（SNR） among all users, as compared to existing power measurement based designs.Note: "BS" stands for "base station", "IRS" stands for "intelligent reflecting surface", and "用户" stands for "user".

Trajectory Tracking Control of UAV Bicopter using Linear Quadratic Gaussian

paper_url: http://arxiv.org/abs/2309.08226
repo_url: None
paper_authors: Fahmizal, Hanung Adi Nugroho, Adha Imam Cahyadi, Igi Ardiyanto
for: 本研究设计了一个适用于UAV Bicopter的线性过 quadratic Gaussian（LQG）控制器，以确保适当的轨迹追踪控制。
methods: 本研究使用了LQG控制器，并在实验中评估了它的效果。
results: 实验结果显示，LQG控制器能够成功地击倒对UAV Bicopter的惯性干扰，并且在轨迹追踪时有较低的root mean square error（RMSE）值。

Abstract
This paper describes the design of a linear quadratic gaussian (LQG) for trajectory tracking control of UAV Bicopter. In this work, disturbance in the form of payload significantly affects the trajectory tracking control process on the UAV Bicopter when using a linear quadratic regulator (LQR) controller. The use of a LQR control will be optimal in the case of a state regulator towards an equilibrium point in a system, but for the tracking case, the LQR controller is not capable of optimally, especially in systems that have high levels of nonlinearity and system dynamic changes such as inertial disturbances. Therefore, this paper proposes the design of a LQG control that is expected to overcome system dynamic changes, in this case in the form of inertial disturbances to the UAV Bicopter when carrying a payload. The success of LQG control was tested in two scenarios, the first trajectory tracking at a circular position and the second with the position of the trajectory number "8". The simulation results show that the proposed LQG controller successfully overcame inertial disturbances when the UAV Bicopter performs trajectory tracking. When given an inertial disturbance, the trajectory tracking test results show that the LQG control has a lower root mean square error (RMSE) value than the LQR control.

摘要

Attitude Control and Low Cost Design of UAV Bicopter

paper_url: http://arxiv.org/abs/2309.08209
repo_url: None
paper_authors: Fahmizal, Hanung Adi Nugroho, Adha Imam Cahyadi, Igi Ardiyanto
for: 这个研究旨在开发一个对应几copter的控制系统，以提高它的姿态稳定性和适航性。
methods: 这个控制系统使用PID控制器，通过从IMU获取反馈，计算控制输入，以调整几copter的姿态（滚、翻和旋转角），抗抵抗风吟噪。
results: 在实验床上，这个控制系统能够成功地控制几copter的姿态，并且具有耐风吟噪的性能。

Abstract
This paper present a control system for the attitude and low cost design of a Bicopter. The control system uses a PID controller that receives feedback from an IMU to calculate control inputs that adjust the Bicopters attitude (roll, pitch and yaw angles) which is resistant to disturbances (wind noise) on a test bed. The control system is implemented on a hardware platform consisting of a Bicopter, an IMU sensor, and a microcontroller with low cost design. In mechanical design, the Bicopter is designed to more closely resemble the letter "V" so that the distribution of the centre of mass (CoM) of the Bicopter can be such that the servomotor torque reaction is parallel to the axis of rotation of the Bicopter during the movement of the pitch angle attitude. In electronic design, the Bicopter was developed using the ATmega328P microcontroller.

摘要
这篇论文介绍了一种控制系统，用于控制飞行器的姿态和低成本设计。该控制系统使用PID控制器，通过接收IMU传感器的反馈，计算控制输入，以调整飞行器的姿态（滚、平、旋角），抗抗扰（风噪）。控制系统在硬件平台上实现，包括飞行器、IMU传感器和微控制器，并采用低成本设计。在机械设计方面，飞行器设计更加接近字母"V"的形状，以使得飞行器的中心质量（CoM）的分布可以使服控轴向量和飞行器的旋转轴重叠。在电子设计方面，飞行器使用ATmega328P微控制器开发。

Privacy-Aware Joint Source-Channel Coding for image transmission based on Disentangled Information Bottleneck

paper_url: http://arxiv.org/abs/2309.08188
repo_url: None
paper_authors: Lunan Sun, Caili Guo, Mingzhe Chen, Yang Yang
for: 这个论文是针对隐私考虑的共同源通道编码（JSCC）进行研究，以避免在传输过程中泄露私人资讯。methods: 这个方法基于分离的信息瓶颈（DIB）来实现隐私考虑，具体来说是利用分离的信息瓶颈目标来分离私人资讯和公共资讯。results: 实验结果显示，这个方法可以对私人资讯的泄露率进行下降，相比之前的方法可以降低私人资讯的泄露率高达20%。

Abstract
Current privacy-aware joint source-channel coding (JSCC) works aim at avoiding private information transmission by adversarially training the JSCC encoder and decoder under specific signal-to-noise ratios (SNRs) of eavesdroppers. However, these approaches incur additional computational and storage requirements as multiple neural networks must be trained for various eavesdroppers' SNRs to determine the transmitted information. To overcome this challenge, we propose a novel privacy-aware JSCC for image transmission based on disentangled information bottleneck (DIB-PAJSCC). In particular, we derive a novel disentangled information bottleneck objective to disentangle private and public information. Given the separate information, the transmitter can transmit only public information to the receiver while minimizing reconstruction distortion. Since DIB-PAJSCC transmits only public information regardless of the eavesdroppers' SNRs, it can eliminate additional training adapted to eavesdroppers' SNRs. Experimental results show that DIB-PAJSCC can reduce the eavesdropping accuracy on private information by up to 20\% compared to existing methods.

摘要
当前的隐私意识敏感的联源渠道编码（JSCC）技术目的是避免在敌对训练JSCC编码器和解码器下传输私人信息。然而，这些方法会增加额外的计算和存储需求，因为需要训练多个神经网络来满足不同的伪造者的信号噪比（SNR）来确定传输的信息。为了解决这个挑战，我们提出了一种基于分离信息瓶颈（DIB）的隐私意识敏感JSCC技术，即DIB-PAJSCC。特别是，我们 derivated一个新的分离信息瓶颈目标函数，用于分离私人信息和公共信息。在给定的私人信息和公共信息之后，发送者可以只将公共信息发送到收件人，以最小化重建误差。由于DIB-PAJSCC只发送公共信息，不需要根据伪造者的SNR进行额外的训练。实验结果表明，DIB-PAJSCC可以将伪造者在私人信息上的抓取精度降低到20%以上。

Message Passing-Based Joint Channel Estimation and Signal Detection for OTFS with Superimposed Pilots

paper_url: http://arxiv.org/abs/2309.08177
repo_url: None
paper_authors: Fupeng Huang, Qinghua Guo, Youwen Zhang, Yuriy Zakharov
for: 提高OTFS系统传输效率，降低计算复杂性
methods: 使用SP在DD域和GCE基本函数展开模型（BEM）来实现消息传递基本 receiver，减少计算复杂性
results: 提出了一种具有减少计算复杂性的SP-DD-D接收器，可以有效减少射频信号的功率和几乎不产生频率域信号强度的减少，同时保持频率域信号强度准确性。

Abstract
Receivers with joint channel estimation and signal detection using superimposed pilots (SP) can achieve high transmission efficiency in orthogonal time frequency space (OTFS) systems. However, existing receivers have high computational complexity, hindering their practical applications. In this work, with SP in the delay-Doppler (DD) domain and the generalized complex exponential (GCE) basis expansion modeling (BEM) for channels, a message passing-based SP-DD iterative receiver is proposed, which drastically reduces the computational complexity while with marginal performance loss, compared to existing ones. To facilitate channel estimation (CE) in the proposed receiver, we design pilot signal to achieve pilot power concentration in the frequency domain, thereby developing an SP-DD-D receiver that can effectively reduce the power of the pilot signal and almost no loss of CE accuracy. Extensive simulation results are provided to demonstrate the superiority of the proposed SP-DD-D receiver.

摘要
接收器使用共同频率预估和信号检测使用超出练习器（SP）可以在orthogonal时频空间（OTFS）系统中实现高传输效率。然而，现有的接收器具有高计算复杂性，使其在实际应用中受限。在这种工作中，我们使用SP在延迟-Doppler（DD）域和通用复杂幂（GCE）基础函数扩展模型（BEM）来模型通道。我们提出了一种基于消息传递的SP-DD迭代接收器，可以减少计算复杂性，与现有的接收器相比，减少性能损失。为便channel estimation（CE）在提议的接收器中，我们设计了射频信号，以实现频率域中射频信号的集中，并开发了SP-DD-D接收器，可以有效减少射频信号的输力，并且几乎没有CE精度损失。我们在实验结果中提供了广泛的示范，以证明提议的SP-DD-D接收器的优越性。

TransMUSIC: A Transformer-Aided Subspace Method for DOA Estimation with Low-Resolution ADCs

paper_url: http://arxiv.org/abs/2309.08174
repo_url: https://github.com/jijunkai/transformer_music
paper_authors: Junkai Ji, Wei Mao, Feng Xi, Shengyao Chen
for: 这篇论文是为了解决大规模阵列的指向方向估计问题，尤其是在低分辨率ADCs中，因为这会导致讯号和噪音分配难以分离。
methods: 这篇论文使用了Transformer模型来帮助子空间估计，透过处理多个截面的 parallel 处理，以捕捉全面的相关性。这个学习的子空间可以用来建立MUSIC спектrum和执行格子less DOA估计使用神经网络基于峰找器。
results: numerics results表明TransMUSIC算法在一 bits quantized data中表现出色，superior to traditional methods. 这些结果显示Transformer-based技术在DOA估计中的潜力。

Abstract
Direction of arrival (DOA) estimation employing low-resolution analog-to-digital convertors (ADCs) has emerged as a challenging and intriguing problem, particularly with the rise in popularity of large-scale arrays. The substantial quantization distortion complicates the extraction of signal and noise subspaces from the quantized data. To address this issue, this paper introduces a novel approach that leverages the Transformer model to aid the subspace estimation. In this model, multiple snapshots are processed in parallel, enabling the capture of global correlations that span them. The learned subspace empowers us to construct the MUSIC spectrum and perform gridless DOA estimation using a neural network-based peak finder. Additionally, the acquired subspace encodes the vital information of model order, allowing us to determine the exact number of sources. These integrated components form a unified algorithmic framework referred to as TransMUSIC. Numerical results demonstrate the superiority of the TransMUSIC algorithm, even when dealing with one-bit quantized data. The results highlight the potential of Transformer-based techniques in DOA estimation.

摘要
direction of arrival (DOA) 估计使用低分辨率analog-to-digital convertors (ADCs) 已成为一个挑战和吸引人的问题，尤其是大规模阵列的普及。大规模的量化误差使得从量化数据中提取信号和噪声子空间变得复杂。为解决这个问题，本文提出了一种新的方法，利用Transformer模型来支持子空间估计。在这种模型中，多个快照被处理在平行进程中，使得global correlationspan它们。学习的子空间使我们能够构建MUSIC谱和使用神经网络基于峰找器来进行网格化DOA估计。此外，获得的子空间包含重要信息的模型顺序，allow us to determine the exact number of sources。这些集成的组件形成一个统一的算法框架，称为TransMUSIC。数学结果表明TransMUSIC算法在处理一比特量化数据时具有优势。结果也 highlights the potential of Transformer-based techniques in DOA estimation.

Exploration into Optimal State Estimation with Event-triggered Communication

paper_url: http://arxiv.org/abs/2309.08070
repo_url: None
paper_authors: Xiaolei Bian, Huimin Chen, X. Rong Li
for: 本研究旨在解决远程掌握某种随机线性系统的状态问题，该系统由感知器观测，但具有计算能力来计算本地估计。
methods: 我们提出了一种事件触发通信（ETC）方案和远程状态估计器，用于优化系统性能和通信资源的费用之间的tradeoff。我们还提出了一种基于时变阈值的幂等启发式通信概率，并 derivated了相应的远程最小方差均方差估计器。
results: 我们通过 simulation results 示出了我们的方法的效果。

Abstract
This paper deals with the problem of remote estimation of the state of a discrete-time stochastic linear system observed by a sensor with computational capacity to calculate local estimates. We design an event-triggered communication (ETC) scheme and a remote state estimator to optimally calibrate the tradeoff between system performance and limited communication resources. The novel communication scheme is the time-varying thresholding version for the cumulative innovation-driven communication scheme in [1], and its transmission probability is given. We derive the corresponding remote minimum mean square error (MMSE) estimator and present a tight upper bound for its MSE matrices. We also show that by employing a couple of weak assumptions, the optimality problem becomes (asymptotically) exact and can be addressed in an Markov Decision Process (MDP) framework, which delivers optimal policy and cost in an algorithmic procedure. The simulation results illustrate the effectiveness of our approach.

摘要

2023-09-14

cs.SD

cs.SD - 2023-09-14

DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing

paper_url: http://arxiv.org/abs/2309.08060
repo_url: None
paper_authors: Yunyi Liu, Craig Jin, David Gunawan
for: 该论文旨在控制声音效果的变化，使用神经音频合成模型。
methods: 该模型基于DDSP架构，利用预处理的音频特征和数字合成器实现高质量的声音合成，同时允许用户轻松控制声音特征的变化。
results: 该模型可以实现高评价的声音变化控制，并且可以通过声音导向来实现时间特征模拟。

Abstract
Controlling the variations of sound effects using neural audio synthesis models has been a difficult task. Differentiable digital signal processing (DDSP) provides a lightweight solution that achieves high-quality sound synthesis while enabling deterministic acoustic attribute control by incorporating pre-processed audio features and digital synthesizers. In this research, we introduce DDSP-SFX, a model based on the DDSP architecture capable of synthesizing high-quality sound effects while enabling users to control the timbre variations easily. We propose a transient modelling technique with higher objective evaluation scores and subjective ratings over impulsive signals (footsteps, gunshots). We propose a simple method that achieves timbre variation control while also allowing deterministic attribute control. We further qualitatively show the timbre transfer performance using voice as the guiding sound.

摘要
控制声音效果的变化使用神经音频合成模型是一项具有挑战性的任务。可 diferenciable digital signal processing（DDSP）提供了一种轻量级的解决方案，可以实现高质量的声音合成，同时允许用户 deterministic 控制声音特性。在这项研究中，我们介绍了 DDSP-SFX 模型，可以同时实现高质量的声音效果和用户容易控制声音特性的变化。我们提出了一种快速模拟技术，对于快速冲击信号（脚步、枪声）有更高的对象评价分数和主观评分。我们还提出了一种简单的方法，可以实现声音特性的变化控制，同时允许 deterministic 控制声音特性。最后，我们质量地表示了声音传输性能，使用语音作为引导声。

VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

paper_url: http://arxiv.org/abs/2309.08049
repo_url: https://github.com/digitalphonetics/voicepat
paper_authors: Sarina Meyer, Xiaoxiao Miao, Ngoc Thang Vu
for: 本研究旨在提供一个高效的话语隐藏和评估框架，以便比较和结合不同的隐藏方法。
methods: 本研究使用了一个模块化和易扩展的结构，几乎完全使用Python进行实现。该框架可以同时进行多种隐藏方法的并行运算，并且提供了与不同技术间的交互功能。
results: 本研究所得到的结果显示，使用修改后的评估方法可以大幅降低评估时间，比如65%-95%，具体取决于评估指标。此外，研究人员还提供了开源代码。

Abstract
Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity of evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker anonymization and evaluation framework based on a modular and easily extendable structure, almost fully in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and allows for interfacing between different techniques. Furthermore, we propose modifications to common evaluation methods which make the evaluation more powerful and reduces their computation time by 65 to 95\%, depending on the metric. Our code is fully open source.

摘要
干扰者隐藏是修改语音录音以使原始发言人无法识别的任务。自2020年的第一届语音隐私挑战以来，这个研究主题的 популяр度一直在不断增长。然而，对不同隐藏方法的比较和组合仍然具有较高的复杂性和评价方法的缺失，这使得研究者难以进行效果的比较和结合。我们因此提出了一个高效的干扰者隐藏和评价框架，基于模块化和扩展性强的结构，几乎完全基于Python编程语言。该框架可以同时实现多种隐藏方法的并行执行，并且支持不同技术的交互。此外，我们还提出了一些改进的评价方法，使评价更加强大，同时降低了计算时间，对各种指标的降低为65%-95%。我们的代码完全开源。

Comparative Assessment of Markov Models and Recurrent Neural Networks for Jazz Music Generation

paper_url: http://arxiv.org/abs/2309.08027
repo_url: None
paper_authors: Conrad Hsu, Ross Greer
for: Comparing the performance of a simple Markov chain model and a recurrent neural network (RNN) model in jazz music improvisation.
methods: Using transcriptions of jazz blues choruses from professional jazz players to train both models, and using musical jazz seeds to give the model context.
results: The RNN outperforms the Markov model on both metrics (groove pattern similarity and pitch class histogram entropy), indicating better rhythmic consistency and tonal stability in the generated music.

Abstract
As generative models have risen in popularity, a domain that has risen alongside is generative models for music. Our study aims to compare the performance of a simple Markov chain model and a recurrent neural network (RNN) model, two popular models for sequence generating tasks, in jazz music improvisation. While music, especially jazz, remains subjective in telling whether a composition is "good" or "bad", we aim to quantify our results using metrics of groove pattern similarity and pitch class histogram entropy. We trained both models using transcriptions of jazz blues choruses from professional jazz players, and also fed musical jazz seeds to help give our model some context in beginning the generation. Our results show that the RNN outperforms the Markov model on both of our metrics, indicating better rhythmic consistency and tonal stability in the generated music. Through the use of music21 library, we tokenized our jazz dataset into pitches and durations that our model could interpret and train on. Our findings contribute to the growing field of AI-generated music, highlighting the important use of metrics to assess generation quality. Future work includes expanding the dataset of MIDI files to a larger scale, conducting human surveys for subjective evaluations, and incorporating additional metrics to address the challenge of subjectivity in music evaluation. Our study provides valuable insight into the use of recurrent neural networks for sequential based tasks like generating music.

摘要
为了比较Markov链模型和循环神经网络（RNN）模型在爵士乐音乐创作中的表现，我们进行了一项研究。尽管音乐，特别是爵士乐，是主观的，我们使用了乐谱符号相似性和抽象度量来衡量结果。我们使用了music21库将爵士乐谱例转化为普通的音频文件，然后训练了两个模型。我们发现RNN模型在两个指标上表现得更好，表示它在生成的音乐中保持了更好的节奏一致性和听觉稳定性。我们的发现对于AI生成音乐领域的发展具有重要意义，并且将为将来的人工智能音乐创作做出贡献。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Efficient Face Detection with Audio-Based Region Proposals

paper_url: http://arxiv.org/abs/2309.08005
repo_url: None
paper_authors: William Aris, François Grondin
for: 这 paper 是为了提高机器人视觉系统的计算效率，减少图像质量下降的影响。
methods: 这 paper 使用了一种新的注意力机制，以便通过音频来生成optical图像中的区域关注点。
results: 该注意力机制可以减少计算负担，并且可以方便地适应人机交互、机器人监测、视频会议或智能眼镜等场景。

Abstract
Robot vision often involves a large computational load due to large images to process in a short amount of time. Existing solutions often involve reducing image quality which can negatively impact processing. Another approach is to generate regions of interest with expensive vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images. To achieve this, we propose a unique attention mechanism to localize speech sources and evaluate its impact on a face detection algorithm. Our results show that the attention mechanism reduces the computational load. The proposed pipeline is flexible and can be easily adapted for human-robot interactions, robot surveillance, video-conferences or smart glasses.

摘要
署 robot 视觉通常会面临大量计算压力，因为需要在短时间内处理大量图像。现有的解决方案经常包括降低图像质量，这会对处理造成负面影响。在这篇论文中，我们评估了如何使用音频来生成optical图像中的区域兴趣点。为此，我们提出了一种特殊的注意机制，用于本地化speech 源，并评估其对人脸检测算法的影响。我们的结果表明，注意机制可以降低计算压力。我们提出的管道可满足人机交互、机器人监控、视频会议或智能眼镜等应用。

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

paper_url: http://arxiv.org/abs/2309.07828
repo_url: None
paper_authors: Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann
for: 本研究主要针对听众语音中的情感转换任务，即将表达的情感转换为目标情感，保持语音内容和说话人身份不变。
methods: 本文提出一种基于扩散的生成模型，名为EmoConv-Diff，用于实现情感转换。该模型在训练时会重建输入语音，同时Conditioning on its emotion。在推理阶段，使用目标情感嵌入来转换输入语音的情感。与其他研究不同的是，本文不使用平行数据，而是使用大量在野数据进行训练。
results: 经验表明，提出的扩散模型可以成功地Synthesize speech with a controllable target emotion。此外，该方法还能够在情感值的极端值下表现出色，因此可以解决在语音情感转换领域的一个常见问题。

Abstract
Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.

摘要
<>输入文本翻译成简化中文。<>这个工作主要关注在实际场景中进行情感转换，而不是基于演示数据和平行数据样本。为了实现这一目标，我们提出了一种基于吸引过程的生成模型，即EmocoConv-Diff，可以重建输入utterance的内容，同时也根据情感进行conditioning。在推理阶段，使用目标情感嵌入来转换输入utterance的情感。与其他工作不同的是，我们使用连续的兴奋度维度来表示情感，同时实现了强度控制。我们验证了我们的方法在大量实际场景中的MSP-Podcast v1.10 dataset上，结果表明，提出的扩散模型确实可以控制目标情感。更重要的是，我们的方法在兴奋度的极值位置上表现出了改进的性能，因此解决了实际场景中的一个常见挑战。

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

paper_url: http://arxiv.org/abs/2309.07803
repo_url: None
paper_authors: Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao Weng, Zhiyong Wu, Helen Meng
For: + The paper aims to train a universal vocoder that can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces.* Methods: + The proposed method, called SnakeGAN, uses a GAN-based architecture with a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge. + The generator uses the Snake activation function and anti-aliased representation to introduce periodic nonlinearities and bring the desired inductive bias for audio synthesis.* Results: + The proposed method significantly outperforms compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization.

Abstract
Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal vocoder, which can synthesize high-fidelity audio in various OOD scenarios. SnakeGAN takes a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge, aiming at recovering high-fidelity waveform from a Mel-spectrogram. We introduce periodic nonlinearities through the Snake activation function and anti-aliased representation into the generator, which further brings the desired inductive bias for audio synthesis and significantly improves the extrapolation capacity for universal vocoding in unseen scenarios. To validate the effectiveness of our proposed method, we train SnakeGAN with only speech data and evaluate its performance for various OOD distributions with both subjective and objective metrics. Experimental results show that SnakeGAN significantly outperforms the compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization.

摘要
Translated into Simplified Chinese: генеритив adversarial network (GAN)-based neural vocoder 已经广泛应用于音频合成任务，因为它们的高质量生成、效率推理和小计算脚本。然而，在训练一个通用 vocoder 仍然是挑战，因为它需要在未经见过的 scenarios 中进行泛化，如尚未看到的说话风格、非语音化声、唱歌和音乐作品。在这个工作中，我们提出了 SnakeGAN，一种基于 GAN 的通用 vocoder，可以在多种 OOD 分布上生成高质量的音频。SnakeGAN 使用了干扰函数来引入期望的非线性，并在生成器中使用了抗锯齿表示，以进一步带来音频合成的适应性和泛化能力。为验证我们的提出方法的有效性，我们将 SnakeGAN 训练用 speech 数据，并在多种 OOD 分布上评估其表现。实验结果表明，SnakeGAN 在比较方法上表现出色，可以生成包括未看到的 speaker 和未经见过的风格的高质量音频样本。

Complexity Scaling for Speech Denoising

paper_url: http://arxiv.org/abs/2309.07757
repo_url: https://github.com/hangtingchen/Complexity-Scaling-for-Speech-Denoising.github.io
paper_authors: Hangting Chen, Jianwei Yu, Chao Weng
For: This paper aims to develop a unified architecture for speech denoising models that can handle a wide range of computational complexities.* Methods: The proposed Multi-Path Transform-based (MPT) architecture is designed to handle both low- and high-complexity scenarios. The authors explore the empirical relationship between model performance and computational cost on the denoising task.* Results: The MPT networks achieve high performance on the DNS challenge dataset, and the authors observe a linear increase in the values of PESQ-WB and SI-SNR as the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s.

Abstract
Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming to consolidate models with various complexities into a unified architecture. We present a Multi-Path Transform-based (MPT) architecture to handle both low- and high-complexity scenarios. A series of MPT networks present high performance covering a wide range of computational complexities on the DNS challenge dataset. Moreover, inspired by the scaling experiments in natural language processing, we explore the empirical relationship between model performance and computational cost on the denoising task. As the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which might contribute to the understanding and application of complexity scaling in speech denoising tasks.

摘要
computational complexity是speech denoising模型部署时的关键因素。大多数前一代研究都是优化模型结构来满足特定的计算成本限制，通常创建了不同计算成本限制的特有神经网络架构。这项研究进行了计算复杂性排比，旨在将不同计算复杂性的模型合并到一个统一的架构中。我们提出了基于多路变换的（MPT）架构，可以处理low-和high-计算复杂性的enario。MPT网络在DNS挑战数据集上表现出了高性能，并且在不同的计算成本水平下保持稳定性。此外，通过自然语言处理领域的缩放实验，我们发现了对于去噪任务，模型性能和计算成本之间存在线性关系，具体来说，随着MACs计算复杂性的幂数增加，PESQ-WB和SI-SNR的值会线性增加，这可能对于speech denoising任务中的计算复杂性排比做出贡献。

DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

paper_url: http://arxiv.org/abs/2309.07658
repo_url: None
paper_authors: Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi
for: 这篇论文探讨使用神经合成来生成音频电 guitar，从MIDI输入中提取特征。
methods: 论文提出了四种不同的系统，并对它们进行对比，包括对象 метри克和主观评价。它们的架构和中间任务，如预测抑制特征，都受到了考虑。
results: 研究发现，将控制特征预测任务设计为分类任务而不是回归任务可以获得更好的结果。此外，论文发现最简单的提出的系统，直接从MIDI输入预测合成参数，最好的性能。音频示例可以在https://erl-j.github.io/neural-guitar-web-supplement中找到。

Abstract
We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples are available at https://erl-j.github.io/neural-guitar-web-supplement.

摘要
我们研究使用神经合成来synthesize acoustic guitar from string-wise MIDI输入。我们提出了四种不同的系统，并与对象指标和主观评估对自然音频和样本基eline进行比较。我们逐次开发这四种系统，通过考虑不同的架构和中间任务，如预测把握特征和声音强度控制特征。我们发现，将控制特征预测任务转换为分类任务而不是回归任务，可以获得更好的结果。此外，我们发现我们最简单的提出的系统，直接从MIDI输入预测synthesis参数，在四种系统中表现最佳。有关音频示例，可以查看https://erl-j.github.io/neural-guitar-web-supplement。

Multilingual Audio Captioning using machine translated data

paper_url: http://arxiv.org/abs/2309.07615
repo_url: None
paper_authors: Matéo Cousin, Étienne Labbé, Thomas Pellegrini
for: 这个论文主要研究了自动化音频描述系统（AAC）的多语言支持，以及如何使用机器翻译生成多语言的音频描述文本。
methods: 作者使用了自动机器翻译将两个知名的AAC数据集（AudioCaps和Clotho）中的英文描述文本翻译成法语、德语和西班牙语。然后，他们在每种语言上训练和评估了单语言系统，并对AudioCaps和Clotho数据集进行了评估。
results: 研究发现，使用机器翻译生成多语言描述文本可以获得类似于英文系统的性能（约75% CIDEr on AudioCaps和43% on Clotho）。此外，在法语中，手动生成的评估subset的描述文本被法语系统训练后的输出比英文系统自动翻译后的输出更加准确。最后，作者建立了一个多语言模型，可以在每种语言上获得类似的性能，使用的参数少于使用多个单语言系统。

Abstract
Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems.

摘要

AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

paper_url: http://arxiv.org/abs/2309.07598
repo_url: https://github.com/unilight/seq2seq-vc
paper_authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
For: The paper is written for the purpose of proposing a non-autoregressive sequence-to-sequence (seq2seq) model for voice conversion (VC) that can generalize well to small training datasets.* Methods: The proposed model, called AAS-VC, uses automatic alignment search (AAS) to remove the dependency on external durations and provide a proper inductive bias for generalization.* Results: The experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes, compared to current non-AR seq2seq VC models that require larger training datasets.

Abstract
Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implementation are available online.

摘要
非自动回归（非AR）序列到序列（seq2seq）模型 для语音转换（VC）具有模型时间结构的能力，同时具有加速推理和提高听解性的优点。然而，现有的非AR seq2seq VC模型对外部AR模型提供的真实duration的依赖限制了其总结推理能力，特别是在小训练集上。在这篇论文中，我们首先描述了上述问题，并通过变换训练数据量来证明。然后，我们提出了AAS-VC模型，基于自动对齐搜索（AAS），解除了对外部duration的依赖，并提供了适当的束缚，以便在小训练集上进行总结推理。实验结果显示，AAS-VC可以更好地总结小训练集中的5分钟 audio samples。我们还进行了一些模型设计选择的抽象研究，以便更好地理解模型的工作原理。音频示例和实现可以在线获取。

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

paper_url: http://arxiv.org/abs/2309.07592
repo_url: https://github.com/arnabdas8901/StarGAN-VC_PlusPlus
paper_authors: Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober
for: 本研究旨在提高语音转换（VC）技术的自然性，保留原始说话人的情感表达。
methods: 研究使用了一种基于生成对抗网络的VC方法，称为StarGANv2-VC，但该方法无法保持说话人的情感表达。
results: 研究发现，StarGANv2-VC方法不能分离说话人和情感表达的 Representation，导致情感泄露。为解决这问题，研究提出了一种新的情感意识损失和一种无监督的方法，通过利用 latent emotion representation 进行情感监督。对多个数据集、情感、性别等进行对象和主观评估，研究证明了该策略的有效性。

Abstract
Voice conversion (VC) transforms an utterance to sound like another person without changing the linguistic content. A recently proposed generative adversarial network-based VC method, StarGANv2-VC is very successful in generating natural-sounding conversions. However, the method fails to preserve the emotion of the source speaker in the converted samples. Emotion preservation is necessary for natural human-computer interaction. In this paper, we show that StarGANv2-VC fails to disentangle the speaker and emotion representations, pertinent to preserve emotion. Specifically, there is an emotion leakage from the reference audio used to capture the speaker embeddings while training. To counter the problem, we propose novel emotion-aware losses and an unsupervised method which exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacy of the proposed strategy over diverse datasets, emotions, gender, etc.

摘要
声音转换（VC）将一句话变成另一个人的语音，不改变语言内容。一种最近提出的生成对抗网络基本VC方法，StarGANv2-VC非常成功地生成自然听起来的转换。然而，该方法无法保留源speaker的情感。情感保留是人机交互的重要需求。在这篇论文中，我们表明StarGANv2-VC无法分离说话人和情感表示，不能保留情感。具体来说，在训练时使用引用音频捕捉说话人嵌入的情感泄漏问题。为了解决该问题，我们提议使用新的情感意识损失和无监督方法，通过latent情感表示来进行情感监督。对于不同的数据集、情感、性别等，我们的目标和主观评估都证明了我们的策略的有效性。

Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2309.08320
repo_url: https://github.com/wngh1187/diff-sv
paper_authors: Ju-ho Kim, Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Ha-Jin Yu
for: 提高SV系统的准确性和可靠性，对background noise进行处理
methods: 使用扩散概率模型（DPM）来实现speech enhancement，并将其与 speaker embedding EXTractor结合起来，从而获得具有抗雑议性和可识别性的Speaker representation
results: 在VoxCeleb1测试集、外部噪声源和VOiCES corpus上进行了评估，实验结果表明，Diff-SV在噪声环境下达到了状态的前景性性能，超过了最新的噪声鲁棒SV系统

Abstract
Background noise considerably reduces the accuracy and reliability of speaker verification (SV) systems. These challenges can be addressed using a speech enhancement system as a front-end module. Recently, diffusion probabilistic models (DPMs) have exhibited remarkable noise-compensation capabilities in the speech enhancement domain. Building on this success, we propose Diff-SV, a noise-robust SV framework that leverages DPM. Diff-SV unifies a DPM-based speech enhancement system with a speaker embedding extractor, and yields a discriminative and noise-tolerable speaker representation through a hierarchical structure. The proposed model was evaluated under both in-domain and out-of-domain noisy conditions using the VoxCeleb1 test set, an external noise source, and the VOiCES corpus. The obtained experimental results demonstrate that Diff-SV achieves state-of-the-art performance, outperforming recently proposed noise-robust SV systems.

摘要
<>转换给定文本到简化中文。<>背景噪声会significantly reducethe accuracy和可靠性of speaker verification（SV）系统。这些挑战可以通过一个speech enhancement系统作为前端模块来解决。现在，diffusion probabilistic models（DPMs）在speech enhancement领域表现出了很好的噪声补偿能力。基于这种成功，我们提出了Diff-SV，一个具有噪声耐性的SV框架，利用DPM。Diff-SV将DPM-based speech enhancement系统与speaker embedding抽取器结合，通过层次结构实现一个可 dicriminative和噪声忍受的speaker表示。我们在VoxCeleb1测试集、外部噪声源和VOiCES corpus上进行了对Diff-SV的实验评估。实验结果表明，Diff-SV在噪声条件下实现了状态之arte的表现，比对最近提出的噪声耐性SV系统更高。

Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion

paper_url: http://arxiv.org/abs/2309.07586
repo_url: https://github.com/suhitaghosh10/emo-stargan
paper_authors: Suhita Ghosh, Arnab Das, Yamini Sinha, Ingo Siegert, Tim Polzehl, Sebastian Stober
for: 防止听取数据的违用，保持语音内容的自然性。
methods: 使用 semi-supervised StarGANv2-VC 变体，在部分情感标注的非平行数据上进行训练，并使用情感嵌入和声学特征相关的情感损失函数。
results: 对比于标准 StarGANv2-VC，提出的方法可以显著改善情感保持，在多个数据集、情感、目标说话人和跨群交流中无需妥协 inteligibility 和匿名化。

Abstract
Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.

摘要
<>发音匿名化可以防止使用口头数据的不当使用，同时保留至少的语言内容。然而，情感保留是人机交互的重要因素。知名的声音转换技术StarGANv2-VC可以实现匿名化，但是失去情感。本工作提出了任意到多个半监督StarGANv2-VC变体，通过部分情感标注不对称数据进行训练。我们提出了基于情感嵌入和听音特征相关的情感意识损失，以及直接提供情感标注。对象和主观评估表明，我们的方法可以明显改善情感保留，超过多种数据集、情感、目标说话人和交互转换等方面，无需妥协智能性和匿名化。

Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning

paper_url: http://arxiv.org/abs/2309.07500
repo_url: None
paper_authors: Yucong Zhang, Hongbin Suo, Yulong Wan, Ming Li
for: 这篇论文是为了检测异常声音而提出的方法，该方法利用多任务学习整合异常曝光和正常样本模型。methods: 该方法使用多任务学习整合异常曝光和正常样本模型，并提供多尺度得分来检测异常。results: 实验结果表明，该方法在MIMII和DCASE 2020任务2集合上的表现较为出色，超过了单个模型系统的状态卷积，并与多个系统组合 ensemble 的表现相当。

Abstract
This paper proposes an approach for anomalous sound detection that incorporates outlier exposure and inlier modeling within a unified framework by multitask learning. While outlier exposure-based methods can extract features efficiently, it is not robust. Inlier modeling is good at generating robust features, but the features are not very effective. Recently, serial approaches are proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.

摘要
本文提出了一种异常声音检测方法，该方法包括外围曝光和内围模型在一个统一框架中，通过多任务学习。而外围曝光方法可以快速提取特征，但不是非常稳定。内围模型可以生成稳定的特征，但这些特征不是很有效。 reciently， serial approaches have been proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.Here's the translation of the text into Traditional Chinese:本文提出了一种异常声音检测方法，该方法包括外围曝光和内围模型在一个统一架构中，通过多任务学习。而外围曝光方法可以快速提取特征，但不是非常稳定。内围模型可以生成稳定的特征，但这些特征不是很有效。 reciently， serial approaches have been proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.

Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift

paper_url: http://arxiv.org/abs/2309.07498
repo_url: None
paper_authors: Haiyan Lan, Qiaoxi Zhu, Jian Guan, Yuming Wei, Wenwu Wang
for: 这篇论文是为了提高适应领域转移的异常声检测（ASD）表现。
methods: 这篇论文使用了自愿监督学习方法，并利用了阶层 metadata 信息作为条件，以获得更精确的特征表现。
results: 实验结果显示，该方法可以在 DCASE 2022 挑战任务2中提高自愿监督学习方法的表现。

Abstract
Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial for characterizing domain shifts. In this paper, we present a hierarchical metadata information constrained self-supervised (HMIC) ASD method, where the hierarchical relation between section IDs and attributes is constructed, and used as constraints to obtain finer feature representation. In addition, we propose an attribute-group-center (AGC)-based method for calculating the anomaly score under the domain shift condition. Experiments are performed to demonstrate its improved performance over the state-of-the-art self-supervised methods in DCASE 2022 challenge Task 2.

摘要
自我监督学习方法在域shift下的异常声音检测（ASD）中表现出了可塑性，通过在特征学习中包含部分ID来考虑域shift的类型。然而，音频文件中附加的特征，如机器操作条件和噪音类型，尚未被考虑，尽管它们也是域shift的关键特征。在本文中，我们提出了层次metadata信息受限自动学习（HMIC）ASD方法，其中层次关系 между部分ID和特征被建立，并用于限制特征表示的精度。此外，我们还提出了特征组中心（AGC）方法来计算域shift下的异常分数。我们对DCASE 2022挑战任务2进行实验，以证明其与现有自我监督方法的比较优异性。

Codec Data Augmentation for Time-domain Heart Sound Classification

paper_url: http://arxiv.org/abs/2309.07466
repo_url: None
paper_authors: Ansh Mishra, Jia Qi Yip, Eng Siong Chng
for: 旨在检测心脏疾病的早期诊断，以挽救生命。
methods: 使用深度学习算法自动分类心音。
results: 通过数据增强，我们的方法可以提高分类错误率从0.8降低至0.2。

Abstract
Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. However, it is currently difficult to develop good heart sound classification models due to the limited data available for training. In this work, we propose a simple time domain approach, to the heart sound classification problem with a base classification error rate of 0.8 and show that augmentation of the data through codec simulation can improve the classification error rate to 0.2. With data augmentation, our approach outperforms the existing time-domain CNN-BiLSTM baseline model. Critically, our experiments show that codec data augmentation is effective in getting around the data limitation.

摘要
心脏听见是一种低成本高效的早期检测心 valve 疾病的方法，可以拯救生命。然而，由于听见效果受医生技能的限制，因此听见检测方法具有扩展的挑战。随着深度学习算法在医疗领域的应用，研究人员开始关注自动 классификация心音的问题。然而，由于听见数据的有限性，目前难以建立好的心音分类模型。在这项工作中，我们提出了一种简单的时域预测方法，并证明了在基础错误率0.8的情况下，数据增强通过编码模拟可以下降分类错误率至0.2。与现有的时域CNN-BiLSTM基eline模型相比，我们的方法在数据增强情况下表现出了优异性。关键的是，我们的实验表明，编码数据增强是一种有效的绕过数据限制的方法。

Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures

paper_url: http://arxiv.org/abs/2309.07458
repo_url: None
paper_authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Chng Eng Siong
for: 这篇论文旨在研究 speech separation 中情感因素的影响。
methods: 该论文使用了 Sepformer 模型，并使用了 Emo2Mix 测试集来分析情感对 speech separation 的影响。
results: 研究发现，即使使用强大的 out-of-domain 表现的 Sepformer 模型，也可能会在带有强烈情感的杂音中受到5.1 dB SI-SDRi 的负面影响。这表明在实际应用中应该考虑情感因素。

Abstract
Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in speech from the impact of out-of-domain inference. This is measured using a carefully designed test dataset, Emo2Mix, consisting of balanced data across all emotional combinations. We show that even models with strong out-of-domain performance such as Sepformer can still suffer significant degradation of up to 5.1 dB SI-SDRi on mixtures with strong emotions. This demonstrates the importance of accounting for emotions in real-world speech separation applications.

摘要
尽管最近的Speech Separation模型在不同情感下进行训练，大多数模型仍然是在中性情感下训练的。情感味语会对各种语音任务中的模型性能产生负面影响，从而降低模型在真实世界应用中的有效性。在这篇论文中，我们进行了分析，以区分情感的影响和域外推理的影响。我们使用了一个特殊的测试集，Emo2Mix，这个测试集包含了所有情感组合的均衡数据。我们发现，即使是具有强域外推理能力的Sepformer模型，在强情感下也可能受到5.1 dB SI-SDRi的较大下降。这表明在真实世界中应用Speech Separation时，需要考虑情感的因素。

SpatialCodec: Neural Spatial Speech Coding

paper_url: http://arxiv.org/abs/2309.07432
repo_url: https://github.com/xzwy/spatialcodec
paper_authors: Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu
for: 本研究旨在使用深度学习技术对麦克风阵列捕获的语音编码，以保留和准确重建多通道记录中的关键空间信息。
methods: 我们提出了一个基于神经网络的 neural spatial audio coding 框架，其包括两个阶段：（i）使用神经网络子带码器对参照通道进行低比特率编码，并（ii）使用 SpatialCodec 捕获相对空间信息以实现准确多通道重建。
results: 我们的系统在比较高比特率基eline和黑盒神经网络架构下显示出了superior的空间性表现，同时我们还提出了一些新的评价指标来评估空间信息保留度，包括cosine similarity在 espacially intuitive beamspace 上的cosine similarity和 beamformed audio quality。

Abstract
In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.

摘要
在这项工作中，我们面临了一个挑战，即使用深度学习技术对麦克风阵列捕捉的speech进行编码，以保留和准确重建多通道录音中的关键空间特征。我们提出了一个含有两个阶段的神经空间音频编码框架，其中第一个阶段是使用单通道神经子带编码器对参考通道进行低比特率编码，第二个阶段是使用SpatialCodec捕捉相对空间信息以实现准确多通道重建。此外，我们还提出了一些新的评估指标来评估空间特征保留情况：（1）空间相似性，它计算顺序空间上的余弦相似性，以及（2）扩扩音质。我们的系统在比高比特率基eline和黑盒神经架下表现出了superior的空间性能。演示可以在https://xzwy.github.io/SpatialCodecDemo中找到。代码和模型可以在https://github.com/XZWY/SpatialCodec中找到。

Mandarin Lombard Flavor Classification

paper_url: http://arxiv.org/abs/2309.07419
repo_url: None
paper_authors: Qingmu Liu, Yuhong Yang, Baifeng Li, Hongyang Chen, Weiping Tu, Song Lin
for: investigate the impact of different decibel levels and types of background noise on the Lombard effect
methods: used a flavor classification approach based on Mandarin Lombard speech under different noise conditions, simulated self-feedback speech, and statistical tests on word correct rates
results: found four distinct categories of Mandarin Lombard speech in the range of 30 to 80 dBA with different transition points, depending on the type of background noise (SSN or babble)

Abstract
The Lombard effect refers to individuals' unconscious modulation of vocal effort in response to variations in the ambient noise levels, intending to enhance speech intelligibility. The impact of different decibel levels and types of background noise on Lombard effects remains unclear. Building upon the characteristic of Lombard speech that individuals adjust their speech to improve intelligibility dynamically based on the self-feedback speech, we propose a flavor classification approach for the Lombard effect. We first collected Mandarin Lombard speech under different noise conditions, then simulated self-feedback speech, and ultimately conducted the statistical test on the word correct rate. We found that both SSN and babble noise types result in four distinct categories of Mandarin Lombard speech in the range of 30 to 80 dBA with different transition points.

摘要
“卢柏尔效应”指个体在噪声水平变化时，无意识地调整语音努力，以提高对话理解度。噪声强度和类型对卢柏尔效应的影响仍不清楚。基于卢柏尔语言特征，我们提出了一种味道分类方法。我们首先收集了不同噪声条件下的普通话 Lombard 语音，然后模拟自适应语音，并进行统计测试。我们发现，SSN 和喧嚣噪声类型都导致了4种不同的普通话 Lombard 语音，分别在30到80 dBA 的范围内，每个转变点都有不同。

M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec

paper_url: http://arxiv.org/abs/2309.07416
repo_url: https://github.com/anton-jeran/MULTI-AUDIODEC
paper_authors: Anton Ratnarajah, Shi-Xiong Zhang, Yi Luo, Dong Yu
for: 这个论文是为了提出一种基于神经网络的多通道语音编码器，可以有效地压缩多个说话人的多通道语音，同时保留每个说话人的空间位置信息。methods: 该模型使用了一种新的嵌入式推理方法，可以根据预先确定的多通道、多说话人和多空间重叠说话情况进行配置和训练。results: 该模型可以压缩和解码重叠说话的语音，并且可以在12.6 kbps的操作下，比Opus和AUDIODEC的24 kbps操作提高37%和52%。此外，该模型还可以在不同的语音增强和房间响应度下保持清晰的语音和空间位置信息。

Abstract
We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed for efficient compression of multi-channel (binaural) speech in both single and multi-speaker scenarios, while retaining the spatial location information of each speaker. This model boasts versatility, allowing configuration and training tailored to a predetermined set of multi-channel, multi-speaker, and multi-spatial overlapping speech conditions. Key contributions are as follows: 1) Previous neural codecs are extended from single to multi-channel audios. 2) The ability of our proposed model to compress and decode for overlapping speech. 3) A groundbreaking architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 4) M3-AUDIODEC's proficiency in reducing the bandwidth for compressing two-channel speech by 48% when compared to individual binaural channel compression. Impressively, at a 12.6 kbps operation, it outperforms Opus at 24 kbps and AUDIODEC at 24 kbps by 37% and 52%, respectively. In our assessment, we employed speech enhancement and room acoustic metrics to ascertain the accuracy of clean speech and spatial cue estimates from M3-AUDIODEC. Audio demonstrations and source code are available online https://github.com/anton-jeran/MULTI-AUDIODEC .

摘要
我们介绍M3-AUDIODEC，一种创新的神经网络声音编码器，用于高效压缩多通道（频率分量）的语音，包括单个和多个说话人场景，而且保留每个说话人的空间位置信息。这个模型具有灵活性，可以根据预先确定的多通道、多个说话人和多个空间重叠情况进行配置和训练。关键贡献包括：1. 扩展了单通道声音编码器到多通道声音。2. 能够压缩和解码重叠的语音。3. 采用独特的架构，将语音内容和空间cue分开压缩，以保持每个说话人的空间上下文 после decoding。4. M3-AUDIODEC 能够减少压缩两通道语音的带宽，比对个频道压缩减少48%。在12.6 kbps操作下，它超越了 Opus 和 AUDIODEC 的24 kbps操作，提高了37%和52%。在我们的评估中，我们使用了语音提升和房间声学指标来评估M3-AUDIODEC 中干净语音和空间cue的准确性。在线可以找到音频示例和源代码，请参阅https://github.com/anton-jeran/MULTI-AUDIODEC。

Multi-dimensional Speech Quality Assessment in Crowdsourcing

paper_url: http://arxiv.org/abs/2309.07385
repo_url: https://github.com/microsoft/P.808
paper_authors: Babak Naderi, Ross Cutler, Nicolae-Catalin Ristea
For: The paper is written to evaluate the subjective speech quality assessment in lab environments and crowdsourcing, and to extend the ITU-T Rec. P.800 and P.808 standards to measure speech quality in the presence of noise and reverberation.* Methods: The paper uses a crowdsourcing implementation of a multi-dimensional subjective test following the scales from P.804, which includes noisiness, coloration, discontinuity, loudness, and overall quality. The tool is both accurate and reproducible, and has been used in the ICASSP 2023 Speech Signal Improvement challenge.* Results: The paper shows the utility of these speech quality dimensions in the challenge and demonstrates the accuracy and reproducibility of the tool. The tool will be publicly available as open-source at https://github.com/microsoft/P.808.Here is the simplified Chinese text for the three key points:
for: 这篇论文是用来评估声音质量评估和电信系统的标准方法。
methods: 这篇论文使用了一种多维ensional的主观测试，以ITU-TRec. P.804的标准进行评估，包括噪音、颜色、缺失、响度等主观质量维度。
results: 这篇论文在ICASSP 2023 Speech Signal Improvement challenge中展示了这些声音质量维度的实用性，并证明了这种工具的准确性和可重复性。

Abstract
Subjective speech quality assessment is the gold standard for evaluating speech enhancement processing and telecommunication systems. The commonly used standard ITU-T Rec. P.800 defines how to measure speech quality in lab environments, and ITU-T Rec.~P.808 extended it for crowdsourcing. ITU-T Rec. P.835 extends P.800 to measure the quality of speech in the presence of noise. ITU-T Rec. P.804 targets the conversation test and introduces perceptual speech quality dimensions which are measured during the listening phase of the conversation. The perceptual dimensions are noisiness, coloration, discontinuity, and loudness. We create a crowdsourcing implementation of a multi-dimensional subjective test following the scales from P.804 and extend it to include reverberation, the speech signal, and overall quality. We show the tool is both accurate and reproducible. The tool has been used in the ICASSP 2023 Speech Signal Improvement challenge and we show the utility of these speech quality dimensions in this challenge. The tool will be publicly available as open-source at https://github.com/microsoft/P.808.

摘要
主观语音质量评估是评估语音增强处理和电信系统的金标准。常用的标准ITU-T Rec. P.800定义了如何测量语音质量在室内环境中，而ITU-T Rec. P.808将其扩展到了投票。ITU-T Rec. P.835将P.800扩展到测量噪音中的语音质量。ITU-T Rec. P.804targets conversation test和引入了感知语音质量维度，这些维度在听话阶段测量。这些感知维度包括噪音、颜色、破碎、和响度。我们创建了一个基于多维ensional主观测试的人群投票实现，并将其扩展到包括噪音、语音信号和总质量。我们显示了这个工具的准确性和可重复性。这个工具在ICASSP 2023 Speech Signal Improvement challenge中被使用，并示出了这些语音质量维度的实用性。这个工具将于https://github.com/microsoft/P.808上公开发布为开源项目。

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

paper_url: http://arxiv.org/abs/2309.07377
repo_url: https://github.com/k2-fsa/icefall
paper_authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen
for: 本研究旨在比较和优化不同自动学习模型生成的speech相关任务中的精炼度数据，以探讨这些数据在多种speech任务中的 universality。
methods: 本研究使用了多种主流自动学习模型生成的精炼度数据，包括Speech-to-Text和Text-to-Speech两个任务。我们通过对这些数据进行比较和优化来探讨它们在不同的speech任务中的效果。
results: 实验结果表明，使用精炼度数据可以在speech认知任务中实现相当于FBank特征的性能，而且在speech合成任务中，使用精炼度数据可以超过mel-spectrogram特征的性能。这些发现表明了精炼度数据在多种speech任务中的可 reuse 性和可靠性。

Abstract
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction.

摘要
自我监督学习（SSL）在语音相关任务中的能力已经推动了研究人员使用分割符进行语音任务，如识别和翻译，这些任务的存储要求较低，并且可以使用自然语言处理技术。然而，这些研究主要是单任务受注重，面临过拟合和多任务场景中的性能下降的挑战。本研究对各种领先SSL模型生成的分割符进行了全面的比较和优化，以探讨语音分割符的通用性。我们希望通过这些实验来探索语音分割符在多种语音任务中的可行性。实验结果表明，分割符在语音识别任务中与基于FBank特征学习系统相当，而在语音合成任务中，分割符在主观和客观指标中表现出色。这些发现表明，通用的语音分割符在多种语音任务中具有极大的潜力。我们的工作是开源的，以便促进这一方向的研究。

Training Audio Captioning Models without Audio

paper_url: http://arxiv.org/abs/2309.07372
repo_url: https://github.com/microsoft/noaudiocaptioning
paper_authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang
for: 本研究旨在解决自动Audio Captioning（AAC）系统的训练数据稀缺问题，提出一种使用仅文本训练AAC系统的方法。
methods: 本方法利用了对比式训练的音频-文本模型，如CLAP，来生成音频描述。在训练过程中，一个解码器使用预训练CLAP文本Encoder生成caption。在推理过程中，文本Encoder被替换为预训练CLAP音频Encoder。为bridging模态间的差距，我们提议在训练过程中使用噪声注入或学习 adapter。
results: 我们的文本只框架与状态对的模型相比，在无需音频或人工创建的文本描述时显示了竞争力。此外，我们还展示了在训练过程中不使用音频或人工创建的文本描述时实现了样式化音频描述和描述增强。

Abstract
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.

摘要
自动化语音描述（AAC）是将语音流转换为自然语言描述的任务。一个典型的AAC系统需要手动撰写的音频段和相应的文本描述标注。创建这些音频-描述对的成本高，导致AAC数据的总量匮乏。在这种情况下，我们提出了一种方法，使用仅文本进行AAC系统的训练。我们的方法利用了对比训练的音频-文本模型，如CLAP，来生成描述。在训练中，一个解码器根据预训练CLAP文本Encoder生成描述。在推理中，文本Encoder被替换为预训练CLAP音频Encoder。为了跨Modal空间的减少，我们提议在训练时使用噪声注入或学习适配器。我们发现，我们的文本仅框架与状态流行的模型相比，表现相对竞争力强。最后，我们展示了在没有音频或人工创建的文本描述时进行风格化音频描述和描述增强。

2023-09-14

cs.CV

cs.CV - 2023-09-14

Morphologically-Aware Consensus Computation via Heuristics-based IterATive Optimization (MACCHIatO)

paper_url: http://arxiv.org/abs/2309.08066
repo_url: None
paper_authors: Dimitri Hamzaoui, Sarah Montagne, Raphaële Renard-Penna, Nicholas Ayache, Hervé Delingette
for: 本研究旨在提供一种独立于图像背景大小和选择函数的方法，以获得协调多个评分人的投票结果。
methods: 我们提议使用Fréchet平均 distances来构建一个 binary或概率的协调分割，并提供了一种启发式方法来优化这个标准。
results: 我们对多个数据集进行了广泛的比较，并证明了我们的方法与STAPLE方法和简单的投票均值方法相比，能够得到中间大小的 binary consensus mask，以及不同于Mask Averaging和STAPLE方法的 posterior probabilities。

Abstract
The extraction of consensus segmentations from several binary or probabilistic masks is important to solve various tasks such as the analysis of inter-rater variability or the fusion of several neural network outputs. One of the most widely used methods to obtain such a consensus segmentation is the STAPLE algorithm. In this paper, we first demonstrate that the output of that algorithm is heavily impacted by the background size of images and the choice of the prior. We then propose a new method to construct a binary or a probabilistic consensus segmentation based on the Fr\'{e}chet means of carefully chosen distances which makes it totally independent of the image background size. We provide a heuristic approach to optimize this criterion such that a voxel's class is fully determined by its voxel-wise distance to the different masks, the connected component it belongs to and the group of raters who segmented it. We compared extensively our method on several datasets with the STAPLE method and the naive segmentation averaging method, showing that it leads to binary consensus masks of intermediate size between Majority Voting and STAPLE and to different posterior probabilities than Mask Averaging and STAPLE methods. Our code is available at https://gitlab.inria.fr/dhamzaou/jaccardmap .

摘要
抽取多个二进制或概率性掩模的consensus分割是解决多种任务的关键，如分析间读者差异或神经网络输出的融合。STAPLE算法是最常用的方法获得consensus分割。在这篇论文中，我们首先表明STAPLE算法的输出受到图像背景大小和选择前景的影响。然后，我们提出一种新的方法，基于Fréchet平均 distances，构建二进制或概率性consensus分割，不受图像背景大小的影响。我们提供了一个启发式方法优化这个标准，使每个壳体的类别完全由它与不同掩模的 voxel-wise 距离、所属连通分支和掩模 raters 的分类决定。我们对多个数据集进行了广泛的比较，表明我们的方法与STAPLE方法和naive掩模平均方法不同，并且导致二进制consensus掩模和不同的 posterior probabilities。我们的代码可以在https://gitlab.inria.fr/dhamzaou/jaccardmap 上获取。

Interpretability-Aware Vision Transformer

paper_url: http://arxiv.org/abs/2309.08035
repo_url: https://github.com/qiangyao1988/ia-vit
paper_authors: Yao Qiang, Chengyin Li, Prashant Khanduri, Dongxiao Zhu
for: 解释模型性能和解释模型含义
methods: 提出了一种新的培育方法，即具有解释性的培育方法，该方法通过自我注意机制提供了 faithful的解释
results: 在多个图像分类任务中表现出色，并进行了质量和量化的评估Here is the summary in English:
for: Explaining model performance and model interpretability
methods: Proposed a new training method that inherently enhances model interpretability, which uses a self-attention mechanism to provide faithful explanations
results: Performed excellently in multiple image classification tasks, with both qualitative and quantitative evaluations of model performance and interpretability.

Abstract
Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it post hoc} solutions to explain ViTs' outputs, these methods do not generalize to different downstream tasks and various transformer architectures. Furthermore, if ViTs are not properly trained with the given data and do not prioritize the region of interest, the {\it post hoc} methods would be less effective. Instead of developing another {\it post hoc} approach, we introduce a novel training procedure that inherently enhances model interpretability. Our interpretability-aware ViT (IA-ViT) draws inspiration from a fresh insight: both the class patch and image patches consistently generate predicted distributions and attention maps. IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective. Consequently, the interpreter simulates the behavior of the predictor and provides a faithful explanation through its single-head self-attention mechanism. Our comprehensive experimental results demonstrate the effectiveness of IA-ViT in several image classification tasks, with both qualitative and quantitative evaluations of model performance and interpretability. Source code is available from: https://github.com/qiangyao1988/IA-ViT.

摘要
目标是解释具有优秀表现的视Transformers（ViTs）模型的含义。然而，解释ViTs的性能未能与其表现相提并且。尽管有一波关注开发{\it post hoc}解释ViTs输出的方法，但这些方法不一致于不同的下游任务和多种变换器架构。此外，如果ViTs没有正确地训练与给定数据并不着眼于区域关注点，那么{\it post hoc}方法就会变得效果更差。相反，我们介绍了一种新的训练方法，可以自然地提高模型解释性。我们的解释具有ViT（IA-ViT）启发自新的见解： Both the class patch和image patches consistently generate predicted distributions and attention maps。IA-ViT包括一个特征提取器、一个预测器和一个解释器，这些部分在一起被训练了一个解释性感知训练目标。因此，解释器可以模拟预测器的行为，并通过其单头自注意机制提供 faithful的解释。我们的实验结果表明IA-ViT在多个图像分类任务中具有优秀的性能和解释性。代码可以从以下链接获取：https://github.com/qiangyao1988/IA-ViT。

Depth Estimation from a Single Optical Encoded Image using a Learned Colored-Coded Aperture

paper_url: http://arxiv.org/abs/2309.08033
repo_url: None
paper_authors: Jhon Lopez, Edwin Vargas, Henry Arguello
for: 本研究旨在提高单张照片中的深度估计，通过在镜头缝中引入多种颜色滤波器，以便在不同深度下生成不同的滤波Pattern，从而提高不同深度之间的差异。
methods: 本研究使用了色码镜头（CCA），并通过结合深度学习来设计Diffractive Optical Element（DOE），以便在单张照片中提取深度信息。
results: 通过三个不同的数据集进行了多种实验，并证明了该方法可以更好地估计深度，并且在实际场景中进行了实验，证明了该方法的可行性。

Abstract
Depth estimation from a single image of a conventional camera is a challenging task since depth cues are lost during the acquisition process. State-of-the-art approaches improve the discrimination between different depths by introducing a binary-coded aperture (CA) in the lens aperture that generates different coded blur patterns at different depths. Color-coded apertures (CCA) can also produce color misalignment in the captured image which can be utilized to estimate disparity. Leveraging advances in deep learning, more recent works have explored the data-driven design of a diffractive optical element (DOE) for encoding depth information through chromatic aberrations. However, compared with binary CA or CCA, DOEs are more expensive to fabricate and require high-precision devices. Different from previous CCA-based approaches that employ few basic colors, in this work we propose a CCA with a greater number of color filters and richer spectral information to optically encode relevant depth information in a single snapshot. Furthermore, we propose to jointly learn the color-coded aperture (CCA) pattern and a convolutional neural network (CNN) to retrieve depth information by using an end-to-end optimization approach. We demonstrate through different experiments on three different data sets that the designed color-encoding has the potential to remove depth ambiguities and provides better depth estimates compared to state-of-the-art approaches. Additionally, we build a low-cost prototype of our CCA using a photographic film and validate the proposed approach in real scenarios.

摘要
摄像机器一张图像深度估计是一项具有挑战性的任务，因为深度指示器在捕获过程中消失。现有的方法可以通过在镜头孔中引入二进制编码的眼镜（CA）或多色编码眼镜（CCA）来提高不同深度之间的区分。CCA可以在捕获图像中产生颜色偏移，并可以用于估计分辨率。通过深度学习的进步，更新的工作将Diffractive optical element（DOE）用于编码深度信息的数据驱动设计。然而，相比于二进制CA或CCA，DOE需要更高精度的设备并且更加昂贵。与前期CCA-based方法不同，我们在本工作中提议使用更多的颜色筛和更丰富的光谱信息来optically编码相关的深度信息。此外，我们还提议通过结合CCA筛和 convolutional neural network（CNN）的结构学习来learns depth information。通过不同的实验在三个不同的数据集上，我们表明了设计的颜色编码具有除去深度歧义的潜力，并且提供了与当前状态对比而言更好的深度估计。此外，我们还建立了一个低成本的CCA原型，使用摄影材料和 validate了我们的提议。

Empowering Visually Impaired Individuals: A Novel Use of Apple Live Photos and Android Motion Photos

paper_url: http://arxiv.org/abs/2309.08022
repo_url: None
paper_authors: Seyedalireza Khoshsirat, Chandra Kambhamettu
for: 这个论文旨在探讨使用Apple Live Photos和Android Motion Photos技术来改善视力障碍者使用机器学习模型处理视觉输入的问题。
methods: 该论文使用了一种简单的方法来评估和比较Live Photos和Motion Photos对于传统图像基本输入方法的效果。
results: 研究发现Live Photos和Motion Photos在常见的视觉助手任务中比单框图像表现更好，尤其是在物体分类和视频问答中。研究还进行了一系列的减少效果和更长的时间范围的实验来深入探讨这些技术的影响。

Abstract
Numerous applications have been developed to assist visually impaired individuals that employ a machine learning unit to process visual input. However, a critical challenge with these applications is the sub-optimal quality of images captured by the users. Given the complexity of operating a camera for visually impaired individuals, we advocate for the use of Apple Live Photos and Android Motion Photos technologies. In this study, we introduce a straightforward methodology to evaluate and contrast the efficacy of Live/Motion Photos against traditional image-based approaches. Our findings reveal that both Live Photos and Motion Photos outperform single-frame images in common visual assisting tasks, specifically in object classification and VideoQA. We validate our results through extensive experiments on the ORBIT dataset, which consists of videos collected by visually impaired individuals. Furthermore, we conduct a series of ablation studies to delve deeper into the impact of deblurring and longer temporal crops.

摘要
很多应用程序已经开发以帮助视障人群，这些应用程序使用机器学习单元处理视觉输入。然而，这些应用程序的一个挑战是用户拍摄的图像质量不够优化。为了解决这个问题，我们建议使用苹果Live Photos和Android Motion Photos技术。在这项研究中，我们提出了一种简单的方法来评估和比较Live/Motion Photos与传统图像基于的方法的效果。我们的发现表明，Live Photos和Motion Photos在常见的视觉协助任务中都能够超越单帧图像，具体来说是在物体分类和VideoQA中。我们通过对ORBIT dataset进行了广泛的实验来验证我们的结果。此外，我们还进行了一系列的剥夺研究，以 deeper 探究延迟和更长的时间割辑的影响。

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.08020
repo_url: https://github.com/zhaochongan/the-mask
paper_authors: Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, Luc Van Gool
for: 提高视频 semantic segmentation（VSS）的表现，增加表达能力。
methods: 引入时间感知层次对象查询，并使用简单的两轮匹配机制，以及层次损失来训练查询。
results: 在最新的VSS测试集VSPW上达到了状态数表现，无需额外设备或特性。

Abstract
Modern approaches have proved the huge potential of addressing semantic segmentation as a mask classification task which is widely used in instance-level segmentation. This paradigm trains models by assigning part of object queries to ground truths via conventional one-to-one matching. However, we observe that the popular video semantic segmentation (VSS) dataset has limited categories per video, meaning less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. This inefficiency limits the full expressive potential of all queries.Thus, we present a novel solution THE-Mask for VSS, which introduces temporal-aware hierarchical object queries for the first time. Specifically, we propose to use a simple two-round matching mechanism to involve more queries matched with minimal cost during training while without any extra cost during inference. To support our more-to-one assignment, in terms of the matching results, we further design a hierarchical loss to train queries with their corresponding hierarchy of primary or secondary. Moreover, to effectively capture temporal information across frames, we propose a temporal aggregation decoder that fits seamlessly into the mask-classification paradigm for VSS. Utilizing temporal-sensitive multi-level queries, our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.

摘要
现代方法已经证明了地址 semantic segmentation 作为一个Mask classification task的巨大潜力，这种思想广泛用于实例级别 segmentation。这种思想通过一般的一对一匹配来训练模型，但我们发现VSS数据集中的视频Semantic segmentation（VSS）中的类别数量很少，这意味着只有少于10%的查询可以得到有意义的梯度更新 durante la formación de VSS。这种不灵活性限制了所有查询的全面表达潜力。因此，我们提出了一种新的解决方案called THE-Mask for VSS，它是在VSS中首次引入了时间感知层次对象查询。具体来说，我们提议使用一个简单的两轮匹配机制，以便更多的查询在训练中得到匹配，而无需在推理过程中添加额外的成本。此外，为了支持我们的更多对一匹配，我们还设计了一个层次损失来训练查询和它们所对应的层次结构。此外，为了有效地捕捉视频帧之间的时间信息，我们提出了一种适应 temporal 的汇集解码器，这种解码器适应 VSS 中的 mask 分类思想。通过使用时间感知的多级查询，我们的方法在最新的 VSPW 测试 benchmark 上实现了状态机器的性能，不需要额外的辅助工具。

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

paper_url: http://arxiv.org/abs/2309.08009
repo_url: None
paper_authors: Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton
for: 评估文本到视频（T2V）模型生成的视频质量是重要的，以确保生成的输出能够诱导观众接受其真实性。
methods: 文章分析了一些常用的评估metric，并指出它们的局限性。文章还提供了5种最新的T2V模型生成的视频数据集，并应用了一些常用的评估metric。
results: 研究发现，自然性和文本提示使用的semantic匹配是评估T2V模型输出质量的重要因素，但没有单一的度量可以捕捉这些细节。

Abstract
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.

摘要
Here's the translation in Simplified Chinese:evaluating the quality of text-to-video（T2V）模型生成的视频是非常重要的，以确保它们生成的输出是真实可信的。我们检查了这个领域中一些常用的度量，并指出它们的局限性。文章发布了一个包含1000多个由5种最新T2V模型生成的视频的数据集，并在这些视频上应用了一些常用的度量。此外，我们还进行了详细的人类评估这些视频，以便比较度量的表现和人类评估的表现。我们的结论是，自然性和文本提示使用的含义匹配是非常重要的，但没有单一的度量可以捕捉这些细节。

Kinship Verification from rPPG using 1DCNN Attention networks

paper_url: http://arxiv.org/abs/2309.08006
repo_url: None
paper_authors: Xiaoting Wu, Xiaoyi Feng, Lili Liu, Constantino Álvarez Casado, Miguel Bordallo López
for: 这 paper 旨在使用 remote Photoplethysmography (rPPG) 信号来验证人们之间的亲属关系。
methods: 这 paper 提出了一种使用一维 Convolutional Neural Network (1DCNN) 和 contrastive loss 来学习从 rPPG 信号中提取的亲属相似性。
results: 这 paper 在 UvANEMO Smile Database 上对不同的亲属关系进行评估，并显示了 rPPG 信号在验证亲属关系方面的有用性。

Abstract
Facial kinship verification aims at automatically determining whether two subjects have a kinship relation. It has been widely studied from different modalities, such as faces, voices, gait, and smiling expressions. However, the potential of bio-signals, such as remote Photoplethysmography (rPPG) extracted from facial videos, remains largely unexplored in the kinship verification problem. In this paper, we investigate for the first time the usage of the rPPG signal for kinship verification. Specifically, we proposed a one-dimensional Convolutional Neural Network (1DCNN) with a 1DCNN-Attention module and contrastive loss to learn the kinship similarity from rPPGs. The network takes multiple rPPG signals extracted from various facial Regions of Interest (ROIs) as inputs. Additionally, the 1DCNN attention module is designed to learn and capture the discriminative kin features from feature embeddings. Finally, the proposed method is evaluated on the UvANEMO Smile Database from different kin relations, showing the usefulness of rPPG signals in verifying kinship.

摘要
facial kinship verification aimed at automatically determining whether two subjects have a kinship relationship. It has been widely studied from different modalities, such as faces, voices, gait, and smiling expressions. However, the potential of bio-signals, such as remote Photoplethysmography (rPPG) extracted from facial videos, remains largely unexplored in the kinship verification problem. In this paper, we investigate for the first time the usage of the rPPG signal for kinship verification. Specifically, we proposed a one-dimensional Convolutional Neural Network (1DCNN) with a 1DCNN-Attention module and contrastive loss to learn the kinship similarity from rPPGs. The network takes multiple rPPG signals extracted from various facial Regions of Interest (ROIs) as inputs. Additionally, the 1DCNN attention module is designed to learn and capture the discriminative kin features from feature embeddings. Finally, the proposed method is evaluated on the UvANEMO Smile Database from different kin relations, showing the usefulness of rPPG signals in verifying kinship.Here's the word-for-word translation:面部亲属验证旨在自动确定两个主题是否有亲属关系。从不同的modalities来研究，如面部、声音、走势和笑容表达。然而，远程光谱 Plethysmography（rPPG）从视频中提取的生物信号的潜在用于亲属验证问题仍未得到了广泛的利用。在这篇文章中，我们为首次利用rPPG信号进行亲属验证。我们提出了一个一维Convolutional Neural Network（1DCNN）以及1DCNN注意模块和对比损失来学习 Kinship similarity from rPPGs。网络接受多个从不同的面部Region of Interest（ROI）提取的rPPG信号作为输入。此外，1DCNN注意模块是用于学习和捕捉特征嵌入中的可塑性亲属特征。最后，我们对UvANEMO Smile Database中的不同亲属关系进行评估， demonstrably show rPPG信号在验证亲属方面的用于。

TCGF: A unified tensorized consensus graph framework for multi-view representation learning

paper_url: http://arxiv.org/abs/2309.09987
repo_url: None
paper_authors: Xiangzhu Meng, Wei Wei, Qiang Liu, Shu Wu, Liang Wang
for: 本文旨在提出一种通用多视图学习框架，即矩阵化一致图模型（TCGF），以解决现有多视图学习方法的缺乏整合和普适性问题。
methods: 本文提出了一种综合多视图表示学习框架，包括：（1）提供一种可靠的多视图表示框架，可适应任意假设和不同缩放的数据集。（2）将各视图的表示树堆叠在一起，形成高级别表示，以实现视图之间的稳定协同传递。（3）学习一个共识嵌入，通过各视图相互协同适应，揭示多视图数据中的 essence 结构。
results: 实验结果表明，相比现有的多视图学习方法，TCGF在七个不同缩放的数据集上具有显著的优势。

Abstract
Multi-view learning techniques have recently gained significant attention in the machine learning domain for their ability to leverage consistency and complementary information across multiple views. However, there remains a lack of sufficient research on generalized multi-view frameworks that unify existing works into a scalable and robust learning framework, as most current works focus on specific styles of multi-view models. Additionally, most multi-view learning works rely heavily on specific-scale scenarios and fail to effectively comprehend multiple scales holistically. These limitations hinder the effective fusion of essential information from multiple views, resulting in poor generalization. To address these limitations, this paper proposes a universal multi-view representation learning framework named Tensorized Consensus Graph Framework (TCGF). Specifically, it first provides a unified framework for existing multi-view works to exploit the representations for individual view, which aims to be suitable for arbitrary assumptions and different-scales datasets. Then, stacks them into a tensor under alignment basics as a high-order representation, allowing for the smooth propagation of consistency and complementary information across all views. Moreover, TCGF proposes learning a consensus embedding shared by adaptively collaborating all views to uncover the essential structure of the multi-view data, which utilizes view-consensus grouping effect to regularize the view-consensus representation. To further facilitate related research, we provide a specific implementation of TCGF for large-scale datasets, which can be efficiently solved by applying the alternating optimization strategy. Experimental results conducted on seven different-scales datasets indicate the superiority of the proposed TCGF against existing state-of-the-art multi-view learning methods.

摘要
多视图学习技术在机器学习领域最近受到了广泛关注，因为它们可以利用多视图之间的一致性和补充信息。然而，目前还缺乏一个可扩展和可靠的多视图框架，因为大多数当前的工作都是特定风格的多视图模型。另外，大多数多视图学习工作都是特定级别的场景，而不是全面地理解多个级别。这些限制了多视图数据的有效融合，导致泛化不佳。为了解决这些限制，本文提出了一种通用的多视图表示学习框架，名为张量化共识图 Framework（TCGF）。具体来说，它首先提供了一个可以拓展的框架，以便已有的多视图工作可以利用各个视图的表示，并且适用于任何假设和不同级别的数据集。然后，它将这些表示排列在一个张量上，并通过基本的对齐方式将它们作为高阶表示进行细致的传递。此外，TCGF还提出了一种共识嵌入，用于在所有视图之间共享一个共识表示，以探索多视图数据的基本结构。这种共识嵌入使用视图协调组效果来规范视图协调表示。为了进一步促进相关的研究，我们在大规模数据集上提供了TCGF的具体实现，可以通过 alternating 优化策略进行高效地解决。实验结果表明，提出的TCGF在七个不同级别数据集上的性能较好，比现有的多视图学习方法更高。

M3Dsynth: A dataset of medical 3D images with AI-generated local manipulations

paper_url: http://arxiv.org/abs/2309.07973
repo_url: None
paper_authors: Giada Zingarini, Davide Cozzolino, Riccardo Corvi, Giovanni Poggi, Luisa Verdoliva
for: 这个论文主要是为了研究如何检测图像受到修改的问题，尤其是在医疗图像领域，以避免因图像修改而导致诊断错误。
methods: 作者使用了三种基于生成对抗网络（GAN）或扩散模型（DM）的方法来创建修改后的计算机tomography（CT）肺部图像，共计8,577个修改样本。
results: 实验表明，这些修改图像可以轻松地骗唬自动诊断工具，而state-of-the-art的审核检测器也可以准确地检测和定位修改的synthetic内容，包括当训练集和测试集不一致时，表现良好的泛化能力。

Abstract
The ability to detect manipulated visual content is becoming increasingly important in many application fields, given the rapid advances in image synthesis methods. Of particular concern is the possibility of modifying the content of medical images, altering the resulting diagnoses. Despite its relevance, this issue has received limited attention from the research community. One reason is the lack of large and curated datasets to use for development and benchmarking purposes. Here, we investigate this issue and propose M3Dsynth, a large dataset of manipulated Computed Tomography (CT) lung images. We create manipulated images by injecting or removing lung cancer nodules in real CT scans, using three different methods based on Generative Adversarial Networks (GAN) or Diffusion Models (DM), for a total of 8,577 manipulated samples. Experiments show that these images easily fool automated diagnostic tools. We also tested several state-of-the-art forensic detectors and demonstrated that, once trained on the proposed dataset, they are able to accurately detect and localize manipulated synthetic content, including when training and test sets are not aligned, showing good generalization ability. Dataset and code will be publicly available at https://grip-unina.github.io/M3Dsynth/.

摘要
“检测修改图像内容的能力在许多应用领域变得日益重要，尤其是在医疗领域，因为修改图像可能会影响诊断结果。然而，这个问题在研究community中得到的关注相对较少。一个原因是没有大量的整理好的数据集，用于开发和测试目的。在这篇文章中，我们调查这个问题，并提出了M3Dsynth数据集，这是一个大量的修改了计算机扫描图像（CT）肺部影像的数据集。我们使用了三种基于生成对抗网络（GAN）或扩散模型（DM）的方法来创建修改后的图像，总共有8,577个修改样本。实验表明，这些图像容易让自动诊断工具进行误报。我们还测试了一些当前最佳的审核检测器，并证明它们可以准确地检测和定位修改的 sintetic内容，包括在训练和测试集不一致的情况下，这表明了它们的泛化能力。数据集和代码将在https://grip-unina.github.io/M3Dsynth/上公开。”

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

paper_url: http://arxiv.org/abs/2309.07970
repo_url: None
paper_authors: Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Chen, Angjoo Kanazawa, Ken Goldberg
For: + The paper aims to improve the ability of learning-based grasp planners to grasp objects by specific parts, especially for objects with diverse shapes and appearances. + The proposed method, LERF-TOGO, is designed to handle task-oriented grasping of objects using natural language queries. + The method is expected to be useful in applications where grasping objects by specific parts is crucial, such as robotic grasping and manipulation tasks.* Methods: + The proposed method, LERF-TOGO, uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. + The method first reconstructs a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. + To mitigate the lack of spatial grouping in LERF, the method extracts a 3D object mask via DINO features and then conditionally queries LERF on this mask to obtain a semantic distribution over the object. + The method uses an off-the-shelf grasp planner to rank grasps based on the semantic distribution.* Results: + The proposed method, LERF-TOGO, selects grasps on the correct part in 81% of all trials and grasps successfully in 69% of all trials on 31 different physical objects.Here is the result in Simplified Chinese text:* For: + 该研究旨在提高学习型抓取器对物体特定部分的抓取能力，特别是对物体形态和外观多样化。 + 提出的方法LERF-TOGO使用自然语言查询来实现任务对应的抓取。 + 方法适用于需要抓取物体特定部分的应用，如机器人抓取和操作任务。* Methods: + LERF-TOGO使用视觉语言模型零扫零抓取来输出对物体的抓取分布。 + 方法首先重建场景的LERF，将CLIP编码转换为多级3D语言场景，可以通过文本进行查询。 + 为了解决LERF中的空间分组问题，方法提取了3D物体Mask，并将其conditionally查询LERF以获取对物体的语义分布。 + 方法使用的是一个废弃的抓取 плаanner来对语义分布进行排名。* Results: + LERF-TOGO在31种不同的物体上选择了81%的正确部分，并在69%的试验中成功抓取。

Abstract
Grasping objects by a specific part is often crucial for safety and for executing downstream tasks. Yet, learning-based grasp planners lack this behavior unless they are trained on specific object part data, making it a significant challenge to scale object diversity. Instead, we propose LERF-TOGO, Language Embedded Radiance Fields for Task-Oriented Grasping of Objects, which uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. To accomplish this, we first reconstruct a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. However, LERF has no sense of objectness, meaning its relevancy outputs often return incomplete activations over an object which are insufficient for subsequent part queries. LERF-TOGO mitigates this lack of spatial grouping by extracting a 3D object mask via DINO features and then conditionally querying LERF on this mask to obtain a semantic distribution over the object with which to rank grasps from an off-the-shelf grasp planner. We evaluate LERF-TOGO's ability to grasp task-oriented object parts on 31 different physical objects, and find it selects grasps on the correct part in 81% of all trials and grasps successfully in 69%. See the project website at: lerftogo.github.io

摘要
通过特定部分抓取物体是安全和执行下游任务的关键。然而，学习基于的抓取 планиFiPG 缺乏这种行为，除非它们在特定对象部分数据上接受训练，这使得扩展对象多样性成为主要挑战。在这种情况下，我们提议LERF-TOGO，基于视觉语言模型的语言嵌入Radiance Fields для任务oriented抓取物体。LERF-TOGO使用视觉语言模型的零shot输出抓取分布ributable over an object given a natural language query。为了完成这个任务，我们首先重建LERF of the scene，这使得CLIP embedding into a multi-scale 3D language field queryable with text。然而，LERF没有物体感知，这意味着其归类输出通常返回对象上的不完整活动，这些活动不够 для后续的部分查询。LERF-TOGO解决了这个缺失，通过提取 DINO 特征来提取3D object mask，然后使用这个mask来 conditionally query LERF，以获取对象上的semantic distribution，并用这个分布来排序从存储库中获取的抓取器。我们对LERF-TOGO在31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不

Large-Vocabulary 3D Diffusion Model with Transformer

paper_url: http://arxiv.org/abs/2309.07920
repo_url: None
paper_authors: Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu
for: 本研究旨在生成具有大量分类的真实世界3D对象，使用一个单一的生成模型。
methods: 本文提出了一种基于扩散的Feed-Forward框架，通过综合利用三种策略来Synthesize 大量分类的真实世界3D对象。
results: 实验表明，使用单一的DiffTF模型可以 дости到现状最佳的大量分类3D对象生成性能，具有大量多样性、彩色 semantics和高质量。

Abstract
Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.

摘要
创造多样化和高质量的3D资产使用自动生成模型非常有优势。尽管现有很多3D生成研究，但大多数现有工作都专注于生成单个类或几个类。在这篇论文中，我们介绍了一种基于扩散的前向框架，可以使用单个生成模型来生成庞大类型的真实世界3D对象。需要解决的主要挑战包括：a) 需要高效强大的3D表示方式; b) 类别之间的几何和文本特征差异较大; c) 实际世界对象的外观复杂。为此，我们提出了一种基于三平面的3D-aware扩散模型，DiffTF，以解决以下三个方面的挑战。1) 为了保证效率和可靠性，我们改进了三平面表示方式，提高了适应速度和准确性。2) 为了处理类别之间的几何和文本特征差异，我们认为所有3D对象的特征是一种总结了各种3D知识的组合。为了从多个类别提取总结了3D知识的特征，我们提出了一种3D-aware transformer， Shared Cross-plane Attention，可以在不同的平面之间学习交叉平面关系，并将总结了3D知识与特定3D特征相结合。3) 此外，我们还设计了3D-aware编码器/解码器，以增强生成的总结3D知识，处理类别具有复杂的外观。广泛的实验证明，使用单个DiffTF模型可以在ShapeNet和OmniObject3D（超过200种真实世界类别）中实现状态之最高的大 vocabulary 3D对象生成性能，具有大量多样化、充分 semantics 和高质量。

OpenIllumination: A Multi-Illumination Dataset for Inverse Rendering Evaluation on Real Objects

paper_url: http://arxiv.org/abs/2309.07921
repo_url: None
paper_authors: Isabella Liu, Linghao Chen, Ziyang Fu, Liwen Wu, Haian Jin, Zhong Li, Chin Ming Ryan Wong, Yi Xu, Ravi Ramamoorthi, Zexiang Xu, Hao Su
for: 这个论文是为了提供一个大量的实际拍摄数据集，用于评估反射渲染和材质分解算法的性能。
methods: 论文使用了多种现有的反射渲染和材质分解方法，并对其进行了评估。
results: 论文通过对实际拍摄数据集进行评估，发现了一些现有方法的缺陷和限制，并提出了一些改进方案。Here’s the full translation in Simplified Chinese:
for: 这个论文是为了提供一个大量的实际拍摄数据集，用于评估反射渲染和材质分解算法的性能。
methods: 论文使用了多种现有的反射渲染和材质分解方法，并对其进行了评估。
results: 论文通过对实际拍摄数据集进行评估，发现了一些现有方法的缺陷和限制，并提出了一些改进方案。

Abstract
We introduce OpenIllumination, a real-world dataset containing over 108K images of 64 objects with diverse materials, captured under 72 camera views and a large number of different illuminations. For each image in the dataset, we provide accurate camera parameters, illumination ground truth, and foreground segmentation masks. Our dataset enables the quantitative evaluation of most inverse rendering and material decomposition methods for real objects. We examine several state-of-the-art inverse rendering methods on our dataset and compare their performances. The dataset and code can be found on the project page: https://oppo-us-research.github.io/OpenIllumination.

摘要
我们介绍OpenIllumination dataset，包含108,000几个不同材料的物品，在72个摄像头视角下拍摄，并提供了准确的摄像头参数、照明真实值和前景分类标识。这个 dataset 允许大多数反向渲染和材料分解方法的量化评估，并且我们在这个 dataset 上评估了一些现有的反向渲染方法的性能。dataset 和代码可以在项目页面上找到：https://oppo-us-research.github.io/OpenIllumination。

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

paper_url: http://arxiv.org/abs/2309.07918
repo_url: https://github.com/openrobotlab/unihsi
paper_authors: Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang
for: 这篇论文的目的是提出一种统一的人Scene交互（HSI）框架，以便在实体AI和虚拟现实等领域中实现多样化的交互控制和用户友好的界面。
methods: 这篇论文使用了一种基于“链接Contacts”（CoC）的定义来支持多种交互，并提出了一种基于大语言模型（LLM）的 плаanner来将自然语言提示转化为任务计划，以及一种统一控制器来将计划转化为具体的任务执行。
results: 该论文的实验结果表明，该框架在多种任务执行中具有广泛的应用性和可重复性，并且可以在真实扫描的场景中进行实际应用。

Abstract
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

摘要
人机场景交互（HSI）是embodied AI和虚拟现实领域的关键组件。尽管有进步在运动质量和物理可能性方面，但是多样化的交互控制和用户友好的界面的开发仍需进一步探索，以便实际应用HSI。本文提出了一个统一的HSI框架，称为UniHSI，该框架通过语言命令支持统一控制多种交互。这个框架基于交互定义为链接（Chain of Contacts，CoC）：人 JOINT-object部分的步骤，这是基于交互类型和人机对象接触区域之间的强相关性。UniHSI包括一个大语言模型（LLM） плаanner，用于将语言提示转化为任务计划的CoC形式，以及一个统一控制器，用于将CoC转化为统一的任务执行。为了便于训练和评估，我们收集了一个新的数据集名为ScenePlan，其包括由LLMs根据多样化的场景生成的 thousands个任务计划。广泛的实验表明了我们框架在多样化任务执行和真实扫描场景的普适性。项目页面位于https://github.com/OpenRobotLab/UniHSI。

Looking at words and points with attention: a benchmark for text-to-shape coherence

paper_url: http://arxiv.org/abs/2309.07917
repo_url: None
paper_authors: Andrea Amaduzzi, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
for: 提供了一个全面的解决方案，以改善文本描述和三维形状之间的准确性。
methods: 使用大型自然语言模型来自动修正三维形状的文本描述，并提出了一种新的量化度量来评估文本和形状之间的准确性。
results: 通过用户研究和比较量化分析，证明了新的度量在评估文本和形状之间的准确性方面具有更高的效果。同时，我们还公开发布了一个新的、精细的 benchmark，以便驱动研究在文本条件下的三维生成模型中的准确性。

Abstract
While text-conditional 3D object generation and manipulation have seen rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.

摘要
While text-conditional 3D object generation and manipulation have made rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.Here's the translation in Traditional Chinese:虽然文本条件下的3D物体生成和修改技术已经做出了快速的进步，但评估这些生成的3D形状和输入文本描述之间的协调性仍然缺乏明确的benchmark。这是因为：a) 公开 disponibles的文本描述资料伙伴的质量低下; b) 现有的评估 metric 对 text-to-shape coherence 的评估有限的效iveness。在这篇论文中，我们提出了一个完整的解决方案，它可以解决这两个问题。首先，我们使用大型语言模型来自动修复与形状相关的文本描述。第二，我们提出了一个新的评估metric，通过跨注意力机制来评估文本与形状之间的协调性。为了验证我们的方法，我们进行了用户研究，并与现有的评估metric进行比较。我们提出的新metric、修复的资料集和由用户验证的文本-形状对组成一个新的、精确的benchmark，我们公开发布这个benchmark，以促进文本条件下的3D生成模型协调性的研究。benchmark可以在https://cvlab-unibo.github.io/CrossCoherence-Web/ 获取。

ALWOD: Active Learning for Weakly-Supervised Object Detection

paper_url: http://arxiv.org/abs/2309.07914
repo_url: None
paper_authors: Yuting Wang, Velibor Ilic, Jiatong Li, Branislav Kisacanin, Vladimir Pavlovic
for: 提高对象检测（OD）任务中的训练数据质量，addressing the lack of large training datasets with precise object localization labels.
methods: fusion of active learning (AL) with weakly and semi-supervised object detection paradigms, including a new auxiliary image generator strategy and a new AL acquisition function.
results: significantly narrowing the gap between ODs trained on few partially labeled but strategically selected image instances and those that rely on fully-labeled data, demonstrated across several challenging benchmarks.Here’s the full text in Simplified Chinese:
for: 本研究旨在提高对象检测（OD）任务中的训练数据质量， Addressing the lack of large training datasets with precise object localization labels.
methods: 本研究提出了一种基于active learning（AL）和弱式和半supervised对象检测方法的新框架，包括一种新的auxiliary image generator策略和一种新的AL acquisition function。
results: 本研究表明，通过使用ALWOD框架，可以significantly narrow the gap betweenODs trained on few partially labeled but strategically selected image instances and those that rely on fully-labeled data，demonstrated across several challenging benchmarks.I hope this helps! Let me know if you have any further questions or if there’s anything else I can assist you with.

Abstract
Object detection (OD), a crucial vision task, remains challenged by the lack of large training datasets with precise object localization labels. In this work, we propose ALWOD, a new framework that addresses this problem by fusing active learning (AL) with weakly and semi-supervised object detection paradigms. Because the performance of AL critically depends on the model initialization, we propose a new auxiliary image generator strategy that utilizes an extremely small labeled set, coupled with a large weakly tagged set of images, as a warm-start for AL. We then propose a new AL acquisition function, another critical factor in AL success, that leverages the student-teacher OD pair disagreement and uncertainty to effectively propose the most informative images to annotate. Finally, to complete the AL loop, we introduce a new labeling task delegated to human annotators, based on selection and correction of model-proposed detections, which is both rapid and effective in labeling the informative images. We demonstrate, across several challenging benchmarks, that ALWOD significantly narrows the gap between the ODs trained on few partially labeled but strategically selected image instances and those that rely on the fully-labeled data. Our code is publicly available on https://github.com/seqam-lab/ALWOD.

摘要

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

paper_url: http://arxiv.org/abs/2309.07911
repo_url: https://github.com/alibaba-mmai-research/dist
paper_authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
for: 这个论文旨在解决现有的视频识别方法中的缺陷，即将大规模预训练的语言-图像模型（如CLIP）应用于视频识别 task 时，它们的 temporal 理解能力受到限制。
methods: 该论文提出了 DiST 方法，它使用 dual-encoder 结构，其中一个是预训练的基础模型，另一个是一个轻量级的网络，用于Temporal 编码。它们之间的拓展分支用于融合空间-时间信息。
results: 该论文的实验结果表明，DiST 方法可以更好地理解视频中的空间和时间信息，并且比现有的方法更高效。在五个标准测试集上，DiST 方法的性能较现有方法优越。此外，当预训练在大规模的 Kinetics-710 上进行时，DiST 方法在冻结 ViT-L 模型下达到了 89.7% 的表现。

Abstract
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.

摘要
近些年，大规模预训练语言图像模型如CLIP表现出了对空间内容的杰出理解能力，但将这些模型应用于视频识别仍然存在不满足的时间理解能力。现有方法通常是在预训练模型中插入可调整的结构，这些结构可以通过整个预训练模型进行反向传播，但这会占用大量资源。或者，这些方法可以通过预训练结构的时间逻辑能力限制，从而导致模型的性能不佳。在这种情况下，我们提出了DiST，它可以分离空间和时间方面的学习。具体来说，DiST使用了一个双encoder结构，预训练基础模型 acting as the spatial encoder，并 introduce a lightweight network as the temporal encoder。一个整合分支被插入到encoder之间，以融合空间-时间信息。DiST中的分离学习非常高效，因为它可以避免大量预训练参数的反向传播。同时，我们经验表明，分离学习并在额外的整合网络上进行融合，可以提高空间和时间理解能力。我们在五个标准 benchMark上进行了广泛的实验，并证明了DiST的性能比现有的状态码表示出较大的差距。当我们在大规模的 Kinetics-710 上预训练时，我们在 Kinetics-400 上 achieved 89.7%，这证明了 DiST 的可扩展性。codes和模型可以在 https://github.com/alibaba-mmai-research/DiST 找到。

TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting

paper_url: http://arxiv.org/abs/2309.07910
repo_url: None
paper_authors: Rohan Choudhury, Kris Kitani, Laszlo A. Jeni
for: 预测3D人姿 pose estimation
methods: 使用循环计算每个人2D姿势特征，并将空间和时间信息 fusion到一个表示中，以提高人姿准确性并降低计算量
results: 在CMU Panoptic Studio数据集上，与TesseTrack相比，TEMPO模型可以达到10%更高的MPJPE，同时提高33倍的FPS速度

Abstract
Existing volumetric methods for predicting 3D human pose estimation are accurate, but computationally expensive and optimized for single time-step prediction. We present TEMPO, an efficient multi-view pose estimation model that learns a robust spatiotemporal representation, improving pose accuracy while also tracking and forecasting human pose. We significantly reduce computation compared to the state-of-the-art by recurrently computing per-person 2D pose features, fusing both spatial and temporal information into a single representation. In doing so, our model is able to use spatiotemporal context to predict more accurate human poses without sacrificing efficiency. We further use this representation to track human poses over time as well as predict future poses. Finally, we demonstrate that our model is able to generalize across datasets without scene-specific fine-tuning. TEMPO achieves 10$\%$ better MPJPE with a 33$\times$ improvement in FPS compared to TesseTrack on the challenging CMU Panoptic Studio dataset.

摘要
现有的体量方法可以准确地预测3D人姿，但是计算成本高和优化为单步预测。我们介绍TEMPO，一种高效的多视图人姿预测模型，学习了稳定的时空表示，提高了人姿准确性，同时也可以跟踪和预测人姿。我们大幅减少了与当前状态艺技的计算量，通过在每个人物2D姿势特征上进行回卷计算，将空间和时间信息 fusion 到一个表示中。这使得我们的模型可以使用时空上下文来预测更加准确的人姿，而不需要牺牲效率。我们还使用这种表示来跟踪人姿在时间上的变化，以及预测未来的姿势。最后，我们证明了我们的模型可以在不进行场景特定的细化 fine-tuning 的情况下进行泛化。TEMPO在CMU Panoptic Studio dataset上达到了10%更好的MPJPE，并且33倍提高了FPS。

Physically Plausible Full-Body Hand-Object Interaction Synthesis

paper_url: http://arxiv.org/abs/2309.07907
repo_url: https://github.com/eth-ait/phys-fullbody-grasp
paper_authors: Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, Otmar Hilliges
for: 本研究旨在提出一种基于物理学的方法，用于 sintesizing 人工智能系统中的协调够强的手对象互动。
methods: 该方法使用了反馈学习（RL）和物理 simulate 来解决数据驱动方法的局限性。具体来说，我们首先在分离的设定下学习了身体和手部运动的能量套件，然后通过高级策略控制手对象互动，以满足任务目标。
results: 该方法成功完成了从接近对象到握持和后续操作的完整互动任务，并与基于骨骼动画的基线比较，显示更加物理合理的运动。

Abstract
We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting. While recent advancements have addressed specific facets of human-object interactions, a comprehensive physics-based approach remains a challenge. Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. In contrast, our proposed method embraces reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches. Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives. Our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.

摘要
Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives.Our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.Translated into Simplified Chinese:我们提出一种基于物理学的方法，用于Synthesizing skillful hand-object interactions in a full-body setting. although recent advancements have addressed specific aspects of human-object interactions, a comprehensive physics-based approach remains a challenge. existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. in contrast, our proposed method combines reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches.through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. the generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. a high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. it is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives.our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. we compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.

Generative Image Dynamics

paper_url: http://arxiv.org/abs/2309.07906
repo_url: https://github.com/generative-dynamics/generative-dynamics.github.io
paper_authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski
for: 该论文是用来模型图像空间的场景动力模型的方法。
methods: 该方法通过从真实视频序列中提取的运动轨迹来学习图像空间的动力模型。每个图像都通过一种频率协调的扩散抽象过程来预测每个像素的长期运动表示，这被称为神经随机运动xture。
results: 这种方法可以将单个图像转换为整个视频的密集运动轨迹，并且可以用于许多下游应用程序，如将Stationary图像转换为无缝循环动画，或让用户在真实图像中真实地与物体互动。

Abstract
We present an approach to modeling an image-space prior on scene dynamics. Our prior is learned from a collection of motion trajectories extracted from real video sequences containing natural, oscillating motion such as trees, flowers, candles, and clothes blowing in the wind. Given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a per-pixel long-term motion representation in the Fourier domain, which we call a neural stochastic motion texture. This representation can be converted into dense motion trajectories that span an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping dynamic videos, or allowing users to realistically interact with objects in real pictures.

摘要
我们提出了一种基于图像空间的场景动力模型。我们的估计来自于实际视频序列中的自然摆动，如树、花、灯火和衣服在风中摆动。给定一幅图像，我们的训练模型使用协调频率扩散采样过程预测每个像素的长期运动表示，我们称之为神经随机动 texture。这个表示可以转化为涵盖整个视频的精密运动轨迹，并且可以与图像基于渲染模块结合使用，以实现将静止图像转化为流畅动态视频，或让用户在真实图像中实现真实互动。

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

paper_url: http://arxiv.org/abs/2309.07891
repo_url: None
paper_authors: Hongsuk Choi, Nikhil Chavan-Dafle, Jiacheng Yuan, Volkan Isler, Hyunsoo Park
for: 这个论文旨在学习手-物体交互情况，从单个RGB图像中重建3D手-物体场景。
methods: 这篇论文使用了一种普适的启发函数HandNeRF，该函数考虑了手形特征和物体特征之间的相关性，以预测手和物体场景几何。
results: 经过实验表明，HandNeRF能够更准确地重建手-物体场景，并且对下游任务如机器人手上吸取具有较高精度。

Abstract
This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to constrain the possible relative configuration of the hand and object geometry. We design a generalizable implicit function, HandNeRF, that explicitly encodes the correlation of the 3D hand shape features and 2D object features to predict the hand and object scene geometry. With experiments on real-world datasets, we show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods. Moreover, we demonstrate that object reconstruction from HandNeRF ensures more accurate execution of a downstream task, such as grasping for robotic hand-over.

摘要
Translation notes:* "hand-object scene" is translated as "手指物品场景" (shǒu zhǐ wù jǐ zhèng jīng)* "3D hand shape" is translated as "三维手势" (sān wéi shǒu xíng)* "2D object features" is translated as "二维物品特征" (èr wéi wù jīn tiě fèi)* "grasp configuration" is translated as "抓取配置" (cuò qǔ pèng jì)* "novel grasp configurations" is translated as "新型抓取配置" (xīn xíng cuò qǔ pèng jì)* "hand-over" is translated as "手势传递" (shǒu xíng chuán zhèng)

A Novel Local-Global Feature Fusion Framework for Body-weight Exercise Recognition with Pressure Mapping Sensors

paper_url: http://arxiv.org/abs/2309.07888
repo_url: None
paper_authors: Davinder Pal Singh, Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
for: 本研究旨在提出一种新的本地-全局特征融合框架，用于地板式动压图像上的身体运动识别。
methods: 该框架使用图像处理技术和YOLO物体检测来融合本地和全局特征，并使用知识储存法进行正则化，以提高运动识别性能。
results: 我们的实验结果表明，该框架可以提高运动识别精度，同时保持标签特征。

Abstract
We present a novel local-global feature fusion framework for body-weight exercise recognition with floor-based dynamic pressure maps. One step further from the existing studies using deep neural networks mainly focusing on global feature extraction, the proposed framework aims to combine local and global features using image processing techniques and the YOLO object detection to localize pressure profiles from different body parts and consider physical constraints. The proposed local feature extraction method generates two sets of high-level local features consisting of cropped pressure mapping and numerical features such as angular orientation, location on the mat, and pressure area. In addition, we adopt a knowledge distillation for regularization to preserve the knowledge of the global feature extraction and improve the performance of the exercise recognition. Our experimental results demonstrate a notable 11 percent improvement in F1 score for exercise recognition while preserving label-specific features.

摘要
我们提出了一种新的本地-全局特征融合框架，用于loor-based动态压力图像 recognition of body-weight exercises. 在现有研究主要通过深度神经网络进行全局特征EXTRACTION的基础上，我们的框架尝试将本地和全局特征结合使用图像处理技术和YOLO对象检测来地理 pressure profiles from different body parts and consider physical constraints. 我们的本地特征提取方法生成了两组高级本地特征，包括cropped pressure mapping和数值特征如angular orientation, location on the mat, and pressure area. 此外，我们采用了知识储存 distillation 来增强性能。我们的实验结果表明，使用本地-全局特征融合框架可以提高运动认知率11%，同时保持标签特征。

mEBAL2 Database and Benchmark: Image-based Multispectral Eyeblink Detection

paper_url: http://arxiv.org/abs/2309.07880
repo_url: None
paper_authors: Roberto Daza, Aythami Morales, Julian Fierrez, Ruben Tolosana, Ruben Vera-Rodriguez
for: 这篇论文旨在开发一个新的多spectrum数据库和多spectrum眼睛跟踪方法，以提高数据驱动的眼睛跟踪方法的性能。
methods: 这篇论文使用了多种感知器，包括两个 Near-Infrared (NIR) 感知器和一个 RGB 感知器，记录学生在完成不同难度的电子学习任务时的面部姿态和眼睛运动。此外，这篇论文还使用了一个电encephalogram (EEG) 带来获取用户的认知活动和眼睛跟踪事件。
results: 这篇论文提出了一种 Convolutional Neural Network (CNN) 架构作为眼睛跟踪的标准准则，并实现了不同的训练方法，包括使用 RGB spectrum、NIR spectrum 和两者的组合来提高现有的眼睛跟踪器的性能。研究表明，将 NIR 和 RGB 图像在训练时结合使用可以提高 RGB 眼睛跟踪器的性能。此外，这篇论文还验证了提出的眼睛跟踪器在更加惊喜和更加具有挑战性的环境中的一致性。

Abstract
This work introduces a new multispectral database and novel approaches for eyeblink detection in RGB and Near-Infrared (NIR) individual images. Our contributed dataset (mEBAL2, multimodal Eye Blink and Attention Level estimation, Version 2) is the largest existing eyeblink database, representing a great opportunity to improve data-driven multispectral approaches for blink detection and related applications (e.g., attention level estimation and presentation attack detection in face biometrics). mEBAL2 includes 21,100 image sequences from 180 different students (more than 2 million labeled images in total) while conducting a number of e-learning tasks of varying difficulty or taking a real course on HTML initiation through the edX MOOC platform. mEBAL2 uses multiple sensors, including two Near-Infrared (NIR) and one RGB camera to capture facial gestures during the execution of the tasks, as well as an Electroencephalogram (EEG) band to get the cognitive activity of the user and blinking events. Furthermore, this work proposes a Convolutional Neural Network architecture as benchmark for blink detection on mEBAL2 with performances up to 97%. Different training methodologies are implemented using the RGB spectrum, NIR spectrum, and the combination of both to enhance the performance on existing eyeblink detectors. We demonstrate that combining NIR and RGB images during training improves the performance of RGB eyeblink detectors (i.e., detection based only on a RGB image). Finally, the generalization capacity of the proposed eyeblink detectors is validated in wilder and more challenging environments like the HUST-LEBW dataset to show the usefulness of mEBAL2 to train a new generation of data-driven approaches for eyeblink detection.

摘要
这个研究引入了一个新的多spectrum数据库和一种新的方法来探测眼睛跳动在RGB和近红外（NIR）个人图像中。我们的贡献数据集（mEBAL2）是目前最大的跳动数据库，它提供了一个很好的机会来提高数据驱动的多spectrum方法来探测跳动和相关应用（例如注意水平估计和面部生物特征识别中的攻击检测）。mEBAL2包含21,100个图像序列从180名不同的学生（总共超过200万个标记图像），他们在完成不同的电子学习任务或通过MOOC平台上的HTML入门课程进行了实际的学习。mEBAL2使用多种传感器，包括两个近红外（NIR）和一个RGB摄像头来捕捉面部姿势在执行任务时，以及一个电enzephalogram（EEG）带来用户的认知活动和跳动事件。此外，这个工作还提出了一种卷积神经网络架构作为mEBAL2上的跳动探测标准，其性能可达97%。不同的训练方法被实现使用RGB谱和NIR谱，以及两者的组合来提高现有的跳动探测器的性能。我们示出，在训练时使用RGB和NIR图像可以提高RGB跳动探测器的性能（即基于RGB图像alone的跳动探测）。最后，我们验证了提出的跳动探测器在更加恶劣的环境中的一致性，如HUST-LEBW数据集，以证明mEBAL2可以用来训练一代新的数据驱动的跳动探测器。

Using network metrics to explore the community structure that underlies movement patterns

paper_url: http://arxiv.org/abs/2309.07878
repo_url: None
paper_authors: Anh Pham Thi Minh, Abhishek Kumar Singh, Soumya Snigdha Kundu
for: 研究圣地亚哥的社区结构，通过分析居民的运动趋势。
methods: 使用匿名居民的家庭和工作地点数据构建了运动网络，并使用模块化优化算法和聚类技术确定社区。
results: 结果显示，结合社区探测算法和分化工具可以提供新的洞察，深入了解劳动时间的复杂地理 segregation。

Abstract
This work aims to explore the community structure of Santiago de Chile by analyzing the movement patterns of its residents. We use a dataset containing the approximate locations of home and work places for a subset of anonymized residents to construct a network that represents the movement patterns within the city. Through the analysis of this network, we aim to identify the communities or sub-cities that exist within Santiago de Chile and gain insights into the factors that drive the spatial organization of the city. We employ modularity optimization algorithms and clustering techniques to identify the communities within the network. Our results present that the novelty of combining community detection algorithms with segregation tools provides new insights to further the understanding of the complex geography of segregation during working hours.

摘要
这项工作的目的是探索 Santiagode Chile 城市的社区结构，通过分析居民的行为方式来描述城市内部的运动模式。我们使用一个包含一些匿名居民的家庭和工作地点的数据集来构建一个表示城市内部运动模式的网络。通过网络分析，我们希望发现城市中存在的社区或子城市，并了解城市空间组织的因素。我们使用模块化优化算法和聚类技术来确定社区内部的结构。我们的结果表明，结合社区探测算法和分化工具可以提供新的视角，deepen我们对城市复杂的地理结构的理解。

Gradient constrained sharpness-aware prompt learning for vision-language models

paper_url: http://arxiv.org/abs/2309.07866
repo_url: None
paper_authors: Liangchen Liu, Nannan Wang, Dawei Zhou, Xinbo Gao, Decheng Liu, Xi Yang, Tongliang Liu
for: 提高视语模型在未经见过的类型上的性能，同时保持已经见过的类型上的性能。
methods: 基于权衡采用极性感知（SAM）方法，通过控制优化器的梯度来实现在性能和极性之间的权衡。
results: 实验证明GCSCoOp方法可以在视语模型中实现在性能和极性之间的权衡，并且在不同的预测任务上具有显著的优势。

Abstract
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is more strict and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.

摘要
We compare the loss landscapes of state-of-the-art methods and vanilla Sharpness-aware Minimization (SAM) based methods, and find that the trade-off performance is correlated to both loss value and loss sharpness, but each of them is essential. However, we discover that existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance.To address this issue, we propose a novel SAM-based method for prompt learning, called Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), which dynamically constrains the optimizing gradient to achieve both performance objectives simultaneously. Extensive experiments demonstrate the effectiveness of GCSCoOp in the trade-off problem.In summary, this paper targets a novel trade-off problem in VLM prompt learning, and proposes a novel SAM-based method (GCSCoOp) to solve this problem by dynamically constraining the optimizing gradient. The proposed method achieves better trade-off performance than existing methods.

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.07849
repo_url: None
paper_authors: Rong Li, ShiJie Li, Xieyuanli Chen, Teli Ma, Wang Hao, Juergen Gall, Junwei Liang
for: 这篇论文旨在提高 LiDAR semantic segmentation 的精度和可靠性，以便自动驾驶和机器人更好地理解它们周围环境。
methods: 这篇论文提出了一种基于范围图像的 LiDAR semantic segmentation 方法，使用时间信息来解决“多到一”问题。特别是，我们在抽象层中添加了时间混合层，将前一批扫描结果与当前扫描结果进行混合，以提高精度。
results: 我们在两个标准评测 dataset 上进行了评测，并证明了我们的后处理技术可以适应不同的网络。我们的方法可以减少约20%的点云 occlusion，提高 semantic segmentation 的精度和可靠性。

Abstract
LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. There are different types of methods, such as point-based, range-image-based, polar-based, and hybrid methods. Among these, range-image-based methods are widely used due to their efficiency. However, they face a significant challenge known as the ``many-to-one'' problem caused by the range image's limited horizontal and vertical angular resolution. As a result, around 20\% of the 3D points can be occluded. In this paper, we present TFNet, a range-image-based LiDAR semantic segmentation method that utilizes temporal information to address this issue. Specifically, we incorporate a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. We then design a max-voting-based post-processing technique to correct false predictions, particularly those caused by the ``many-to-one'' issue. We evaluated the approach on two benchmarks and demonstrate that the post-processing technique is generic and can be applied to various networks. We will release our code and models.

摘要
lidar semantic segmentation 在自动驾驶和机器人理解境外环境的准确性和稳定性方面扮演着关键性的角色。有多种方法，如点云、范围图像、极坐标基本方法和混合方法。其中，范围图像基本方法因其高效性而广泛使用，但它们面临着“多个对一”问题，即范围图像的水平和垂直angular resolution有限，导致约20%的3D点被遮盖。在本文中，我们提出了TFNet，一种基于范围图像的 LiDAR semantic segmentation 方法，利用时间信息解决这个问题。具体来说，我们添加了时间融合层，从前一批扫描中提取有用信息，并与当前扫描结合。然后，我们设计了最大投票基本的后处理技术，以 corrrect false predictions，特别是由“多个对一”问题引起的错误预测。我们在两个标准准点上评估了方法，并证明了后处理技术可以应用于多种网络。我们将发布代码和模型。

MC-NeRF: Muti-Camera Neural Radiance Fields for Muti-Camera Image Acquisition Systems

paper_url: http://arxiv.org/abs/2309.07846
repo_url: https://github.com/IN2-ViAUn/MC-NeRF
paper_authors: Yu Gao, Lutong Su, Hao Liang, Yufeng Yue, Yi Yang, Mengyin Fu
for: 用于3D场景表示和NeRF的多视图图像处理
methods: 提出了一种能够同时优化内部和外部参数的MC-NeRF方法，包括对叠加问题的理论分析和高效的投影网络
results: 实验表明MC-NeRF方法能够在不提供初始姿态的情况下，使用110张图像和110个不同的内部和外部参数来获得3D场景表示

Abstract
Neural Radiance Fields (NeRF) employ multi-view images for 3D scene representation and have shown remarkable performance. As one of the primary sources of multi-view images, multi-camera systems encounter challenges such as varying intrinsic parameters and frequent pose changes. Most previous NeRF-based methods often assume a global unique camera and seldom consider scenarios with multiple cameras. Besides, some pose-robust methods still remain susceptible to suboptimal solutions when poses are poor initialized. In this paper, we propose MC-NeRF, a method can jointly optimize both intrinsic and extrinsic parameters for bundle-adjusting Neural Radiance Fields. Firstly, we conduct a theoretical analysis to tackle the degenerate case and coupling issue that arise from the joint optimization between intrinsic and extrinsic parameters. Secondly, based on the proposed solutions, we introduce an efficient calibration image acquisition scheme for multi-camera systems, including the design of calibration object. Lastly, we present a global end-to-end network with training sequence that enables the regression of intrinsic and extrinsic parameters, along with the rendering network. Moreover, most existing datasets are designed for unique camera, we create a new dataset that includes four different styles of multi-camera acquisition systems, allowing readers to generate custom datasets. Experiments confirm the effectiveness of our method when each image corresponds to different camera parameters. Specifically, we adopt up to 110 images with 110 different intrinsic and extrinsic parameters, to achieve 3D scene representation without providing initial poses. The Code and supplementary materials are available at https://in2-viaun.github.io/MC-NeRF.

摘要
neural radiance fields (NeRF) 使用多视图图像表示3D场景，并显示出杰出的性能。多摄像头系统面临多个挑战，包括不同的内参参数和频繁的pose变化。大多数前一代NeRF基于方法通常假设全局唯一的摄像头，并rarely考虑多摄像头场景。此外，一些pose鲁棒的方法仍然容易受到差初始化的pose的影响。在这篇论文中，我们提出MC-NeRF方法，可以同时优化内参和外参参数，为Neural Radiance Fields进行束致调整。首先，我们进行理论分析，解决由共同优化内参和外参参数而产生的协调问题和缺失问题。其次，我们提出了一种高效的准备图像采集方案，包括设计准备对象。最后，我们介绍了一个全球端到端网络，可以对内参和外参参数进行回归，同时实现图像渲染。此外，大多数现有的数据集都是为唯一摄像头设计的，我们创建了一个新的数据集，包括四种不同的多摄像头采集系统风格，让读者可以生成自定义数据集。实验证明我们的方法在每个图像对应不同的摄像头参数时显示出效果。具体来说，我们采用了110张图像，每张图像对应110个内参和外参参数，以实现3D场景表示，无需提供初始pose。代码和补充材料可以在https://in2-viaun.github.io/MC-NeRF上获取。

Decomposition of linear tensor transformations

paper_url: http://arxiv.org/abs/2309.07819
repo_url: None
paper_authors: Claudio Turchetti
for: 这篇论文的目的是开发一种数学框架，用于精确地 decomposes a tensor as the sum of a finite number of low-rank tensors。
methods: 该论文使用的方法包括： solving an optimization problem to find a low-dimensional subspace, and assuming the number of components is fixed。
results: 该论文的结果包括： deriving three different problems to decompose a non-negative self-adjoint tensor operator, a linear tensor transformation, and a generic tensor。

Abstract
One of the main issues in computing a tensor decomposition is how to choose the number of rank-one components, since there is no finite algorithms for determining the rank of a tensor. A commonly used approach for this purpose is to find a low-dimensional subspace by solving an optimization problem and assuming the number of components is fixed. However, even though this algorithm is efficient and easy to implement, it often converges to poor local minima and suffers from outliers and noise. The aim of this paper is to develop a mathematical framework for exact tensor decomposition that is able to represent a tensor as the sum of a finite number of low-rank tensors. In the paper three different problems will be carried out to derive: i) the decomposition of a non-negative self-adjoint tensor operator; ii) the decomposition of a linear tensor transformation; iii) the decomposition of a generic tensor.

摘要
一个主要问题在计算张量归一化是选择张量的级数，因为没有定finite的算法来确定张量的级数。一种常用的方法是通过解一个优化问题来找到一个低维度子空间，假设级数是固定的。然而，即使这种算法是效率高且易于实现，它经常会收敛到差异性和噪声的局部最优点，并且受到噪声和噪声的影响。本文的目标是开发一种数学框架，可以表示张量为一些有限级的低维度张量的和。在本文中，将解决以下三个问题：1. 非正式自Symmetric tensor operator的归一化分解;2. 线性张量变换的归一化分解;3. 一般张量的归一化分解。

For A More Comprehensive Evaluation of 6DoF Object Pose Tracking

paper_url: http://arxiv.org/abs/2309.07796
repo_url: None
paper_authors: Yang Li, Fan Zhong, Xin Wang, Shuangbing Song, Jiachen Li, Xueying Qin, Changhe Tu
for: 本研究旨在提供一个统一的对比方法，以解决6DoF物体pose追踪的评估问题。
methods: 本研究提出了一种多视图多物体全pose精确调整方法，可以同时精确调整所有物体和摄像头的 pose，实现sub-pixel sub-millimeter的对齐误差。
results: 本研究透过实验验证了提案的全pose精确调整方法的精度和可靠性，并在YCBV和BCOT两个基本数据集上进行了统一的评估。 results show that the proposed method outperforms previous methods in terms of accuracy and robustness.

Abstract
Previous evaluations on 6DoF object pose tracking have presented obvious limitations along with the development of this area. In particular, the evaluation protocols are not unified for different methods, the widely-used YCBV dataset contains significant annotation error, and the error metrics also may be biased. As a result, it is hard to fairly compare the methods, which has became a big obstacle for developing new algorithms. In this paper we contribute a unified benchmark to address the above problems. For more accurate annotation of YCBV, we propose a multi-view multi-object global pose refinement method, which can jointly refine the poses of all objects and view cameras, resulting in sub-pixel sub-millimeter alignment errors. The limitations of previous scoring methods and error metrics are analyzed, based on which we introduce our improved evaluation methods. The unified benchmark takes both YCBV and BCOT as base datasets, which are shown to be complementary in scene categories. In experiments, we validate the precision and reliability of the proposed global pose refinement method with a realistic semi-synthesized dataset particularly for YCBV, and then present the benchmark results unifying learning&non-learning and RGB&RGBD methods, with some finds not discovered in previous studies.

摘要
In this paper, we contribute a unified benchmark to address these problems. We propose a multi-view multi-object global pose refinement method that can jointly refine the poses of all objects and view cameras, resulting in sub-pixel sub-millimeter alignment errors. We also analyze the limitations of previous scoring methods and error metrics and introduce improved evaluation methods.The unified benchmark uses both YCBV and BCOT as base datasets, which are shown to be complementary in scene categories. In experiments, we validate the precision and reliability of the proposed global pose refinement method with a realistic semi-synthesized dataset particularly for YCBV, and then present the benchmark results unifying learning&non-learning and RGB&RGBD methods. Our findings reveal some new insights that were not discovered in previous studies.Translated into Simplified Chinese:前期评估6DoF对象pose追踪技术已经存在了明显的限制，这些限制障碍了这一领域的发展。特别是评估协议没有统一的，广泛使用的YCBV数据集中存在了重要的注释错误，而且错误指标也可能受到偏见。这使得不能公正地比较不同的方法，成为了大型算法开发的主要障碍。在本文中，我们提出了一个统一的评估标准，以解决这些问题。我们提议一种多视图多对象全 pose 精度修正方法，可以同时修正所有对象和视角摄像头的pose，从而实现sub-pixel sub-millimeter的Alignment Error。我们还分析了过去评估方法和指标的限制，并在基于这些分析结果提出了改进的评估方法。统一评估标准使用了YCBV和BCOT作为基本数据集，这些数据集在场景类型上是补偿的。在实验中，我们验证了提posed的全 pose 精度修正方法的准确性和可靠性，使用了特制的半Synthesized数据集，尤其是YCBV，然后公布了统一的评估结果，包括学习&非学习和RGB&RGBD方法的比较。我们的发现包括一些在过去的研究中没有发现的新发现。

Virchow: A Million-Slide Digital Pathology Foundation Model

paper_url: http://arxiv.org/abs/2309.07778
repo_url: None
paper_authors: Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Millar, Matthew Hanna, Juan Retamero, William A. Moye, Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Rothrock, Thomas J. Fuchs
for: 这个论文旨在提高计算 PATHOLOGY 中的人工智能应用，以实现精准医疗和决策支持系统，特别是在诊断和治疗癌症方面。methods: 这篇论文使用了自然语言处理技术，通过对整个染色片图像进行分析，来实现计算 PATHOLOGY 的目标。具体来说，他们使用了一个名为 Virchow 的深度神经网络基础模型，通过自我超vised学习来训练该模型，并使用了150万张染色片图像来进行训练。results: 根据论文的报告，使用 Virchow 模型可以在许多计算 PATHOLOGY 任务中获得出色的表现，包括板块级普ancer检测和分类、板块级生物标志预测等。模型的性能达到了93%的平衡准确率，并且在colon microsatellite instability status预测和Breast CDH1 status预测等任务中获得了0.983的AUC和0.967的AUC。这些表现表明，通过在大量的病理图像数据集上进行预训练，可以提高计算 PATHOLOGY 中的表现，并且可能会继续提高表现，如果进一步采用更大的数据集进行预训练。

Abstract
Computational pathology uses artificial intelligence to enable precision medicine and decision support systems through the analysis of whole slide images. It has the potential to revolutionize the diagnosis and treatment of cancer. However, a major challenge to this objective is that for many specific computational pathology tasks the amount of data is inadequate for development. To address this challenge, we created Virchow, a 632 million parameter deep neural network foundation model for computational pathology. Using self-supervised learning, Virchow is trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue groups, which is orders of magnitude more data than previous works. When evaluated on downstream tasks including tile-level pan-cancer detection and subtyping and slide-level biomarker prediction, Virchow outperforms state-of-the-art systems both on internal datasets drawn from the same population as the pretraining data as well as external public datasets. Virchow achieves 93% balanced accuracy for pancancer tile classification, and AUCs of 0.983 for colon microsatellite instability status prediction and 0.967 for breast CDH1 status prediction. The gains in performance highlight the importance of pretraining on massive pathology image datasets, suggesting pretraining on even larger datasets could continue improving performance for many high-impact applications where limited amounts of training data are available, such as drug outcome prediction.

摘要
computacional patología utiliza inteligencia artificial para permitir medicina de precisión y sistemas de soporte de decisión a través del análisis de imágenes enteras de técnicas. tiene el potencial de revolucionar el diagnóstico y el tratamiento del cáncer. sin embargo, un desafío importante para este objetivo es que para muchos tareas específicas de computacional patología, la cantidad de datos es insuficiente para el desarrollo. para abordar este desafío, creamos Virchow, un modelo de red neuronal profunda de 632 millones de parámetros para computacional patología. utilizando aprendizaje auto-supervisado, Virchow se entrenó en 1,5 millones de imágenes enteras de técnicas hematoxylin y eosin de grupos de tejidos diversas, lo que es órdenes de magnitud más datos que trabajos anteriores. cuando se evaluó en tareas downstream, incluyendo la detección y clasificación de paneles de cáncer y la predicción de marcadores en las placas, Virchow superó a sistemas de estado del arte en datos internos y externos. Virchow obtuvo una precisión equilibrada del 93% para la clasificación de paneles de cáncer, y áreas bajo el 0,983 para la predicción de la status de instabilidad microsatélite en el colon y el 0,967 para la predicción de la status de CDH1 en el seno. los ganancias en rendimiento destacan la importancia de preentrenar en conjuntos de datos de imágenes patológicas masivos, sugiriendo que preentrenar en conjuntos de datos aún más grandes podría continuar mejorando el rendimiento para muchas aplicaciones de alto impacto en las que se tienen cantidades limitadas de datos de entrenamiento.

Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion

paper_url: http://arxiv.org/abs/2309.07753
repo_url: None
paper_authors: Peiran Xu, Yadong Mu
for: 高级对象检测（CoSOD）任务的目的是在每个图像中强调共同突出的对象。
methods: 我们使用层次结构的Transformer模块来提取 semantic-level的共识，从而更好地捕捉共同对象类别的全面表示，并 exclude 其他对象的本地相似性干扰。我们还提出了基于Transformer的分布模块，该模块考虑了不同场景中共同对象的变化。它在图像特征地图上分布了共识，并且利用图像之间的交互来全面利用图像特征。
results: 我们在三个常用的CoSOD数据集上进行评估，并达到了当前最佳性能。

Abstract
Given a group of images, co-salient object detection (CoSOD) aims to highlight the common salient object in each image. There are two factors closely related to the success of this task, namely consensus extraction, and the dispersion of consensus to each image. Most previous works represent the group consensus using local features, while we instead utilize a hierarchical Transformer module for extracting semantic-level consensus. Therefore, it can obtain a more comprehensive representation of the common object category, and exclude interference from other objects that share local similarities with the target object. In addition, we propose a Transformer-based dispersion module that takes into account the variation of the co-salient object in different scenes. It distributes the consensus to the image feature maps in an image-specific way while making full use of interactions within the group. These two modules are integrated with a ViT encoder and an FPN-like decoder to form an end-to-end trainable network, without additional branch and auxiliary loss. The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance.

摘要
给定一组图像，共同突出物检测（CoSOD）目标是将每个图像中的共同突出物高亮显示出来。这个任务的成功取决于两个因素：一是共识EXTRACTION，二是共识的分布到每个图像中。大多数前一代方法使用本地特征来表示群组共识，而我们则使用层次Transformer模块来提取semantic级共识。这可以获得更全面的共同对象类划分，并排除其他对象的本地相似性干扰。此外，我们提议使用Transformer基于的分布模块，该模块考虑了不同场景中共同突出物的变化。它在图像特征图中分布共识，并且利用图像之间的交互来实现图像特征图中的分布。这两个模块与ViT编码器和FPN-like解码器集成，形成一个端到端可训练的网络，不需要额外支持和副任务损失。我们的方法在三个常用CoSOD数据集上进行评估，并达到了当前最佳性能。

DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

paper_url: http://arxiv.org/abs/2309.07752
repo_url: None
paper_authors: Yaoyu Su, Shaohui Wang, Haoqian Wang
for: 提高人脸朗读的实时渲染效果
methods: 使用特殊的三面体 decomposition 和 Neural Radiance Fields (NeRF) 技术
results: 实现人脸朗读的State-of-the-art 结果

Abstract
In this paper, we present the decomposed triplane-hash neural radiance fields (DT-NeRF), a framework that significantly improves the photorealistic rendering of talking faces and achieves state-of-the-art results on key evaluation datasets. Our architecture decomposes the facial region into two specialized triplanes: one specialized for representing the mouth, and the other for the broader facial features. We introduce audio features as residual terms and integrate them as query vectors into our model through an audio-mouth-face transformer. Additionally, our method leverages the capabilities of Neural Radiance Fields (NeRF) to enrich the volumetric representation of the entire face through additive volumetric rendering techniques. Comprehensive experimental evaluations corroborate the effectiveness and superiority of our proposed approach.

摘要
在这篇论文中，我们介绍了分解三平面哈希神经辐射场（DT-NeRF），一种框架，可以大幅提高描述讲话脸部的真实渲染效果，并在关键评估数据集上达到状态之最的结果。我们的架构将脸部分成两个特殊的三平面：一个专门用于表示嘴，另一个用于更广泛的脸部特征。我们引入音频特征作为剩余项，并通过音频-口型-脸部变换器将其纳入我们的模型。此外，我们的方法利用神经辐射场（NeRF）的能力，通过添加卷积渲染技术，进一步丰富脸部的材料表示。经过全面的实验评估，我们的提议方法的效果和优势得到了证明。

OmnimatteRF: Robust Omnimatte with 3D Background Modeling

paper_url: http://arxiv.org/abs/2309.07749
repo_url: None
paper_authors: Geng Lin, Chen Gao, Jia-Bin Huang, Changil Kim, Yipeng Wang, Matthias Zwicker, Ayush Saraf
for: 这 paper 是为了提出一种新的视频分割方法，以提高现实世界视频中的背景和前景分割精度。
methods: 该 paper 使用了 OmnimatteRF 方法，该方法结合了动态 2D 前景层和 3D 背景模型，以保留视频中主题的细节，并在实际视频中坚定重建场景。
results: 广泛的实验表明，OmnimatteRF 方法能够在各种视频中提供更高质量的场景重建。

Abstract
Video matting has broad applications, from adding interesting effects to casually captured movies to assisting video production professionals. Matting with associated effects such as shadows and reflections has also attracted increasing research activity, and methods like Omnimatte have been proposed to separate dynamic foreground objects of interest into their own layers. However, prior works represent video backgrounds as 2D image layers, limiting their capacity to express more complicated scenes, thus hindering application to real-world videos. In this paper, we propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model. The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos. Extensive experiments demonstrate that our method reconstructs scenes with better quality on various videos.

摘要
视频幕制有广泛的应用，从加入有趣的效果到casually captured movies 到协助视频制作专业人员。与关联的效果，如阴影和反射，一起吸引了增加研究活动，而方法如 Omnimatte 已经被提议来分离动态前景对象。但先前的工作 Represent video backgrounds as 2D image layers，限制了它们表达更复杂的场景，因此阻碍了应用于实际视频。在这篇论文中，我们提出了一种新的视频幕制方法，OmnimatteRF，它结合动态2D前景层和3D背景模型。2D层保留主题的细节，而3D背景坚定地重建了实际视频中的场景。广泛的实验表明，我们的方法可以在各种视频中重建场景质量更高。

Dataset Condensation via Generative Model

paper_url: http://arxiv.org/abs/2309.07698
repo_url: None
paper_authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou
for: 本研究旨在适应大规模 dataset 的 condensation，以减少训练样本数量，提高模型的训练速度和性能。
methods: 本研究提出了一种基于生成模型的 condensation 方法，通过将 dataset 转换为生成模型的形式，使得 condensation 可以适应大规模 dataset 和多种类别。该方法还提出了 intra-class 和 inter-class 损失函数，以保证 condensation 后的样本具有多样性和抗预测性。
results: Comparative studies with state-of-the-art methods and ablation studies confirm the effectiveness of the proposed method and its individual components. 本研究成功地应用于 ImageNet-1k dataset，并显示了与传统方法相比的性能提升。

Abstract
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k.

摘要

CoRF : Colorizing Radiance Fields using Knowledge Distillation

paper_url: http://arxiv.org/abs/2309.07668
repo_url: None
paper_authors: Ankit Dhiman, R Srinath, Srinjay Sarkar, Lokesh R Boregowda, R Venkatesh Babu
for: Synthesizing colorized novel views from input grey-scale multi-view images.
methods: 使用NeRF基于方法实现高质量的多视图图像新观 synthesis.
results: 比基eline提供更高质量的彩色化新视图图像，同时保持cross-view consistency.

Abstract
Neural radiance field (NeRF) based methods enable high-quality novel-view synthesis for multi-view images. This work presents a method for synthesizing colorized novel views from input grey-scale multi-view images. When we apply image or video-based colorization methods on the generated grey-scale novel views, we observe artifacts due to inconsistency across views. Training a radiance field network on the colorized grey-scale image sequence also does not solve the 3D consistency issue. We propose a distillation based method to transfer color knowledge from the colorization networks trained on natural images to the radiance field network. Specifically, our method uses the radiance field network as a 3D representation and transfers knowledge from existing 2D colorization methods. The experimental results demonstrate that the proposed method produces superior colorized novel views for indoor and outdoor scenes while maintaining cross-view consistency than baselines. Further, we show the efficacy of our method on applications like colorization of radiance field network trained from 1.) Infra-Red (IR) multi-view images and 2.) Old grey-scale multi-view image sequences.

摘要
neuronal radiance field (NeRF) 方法可以实现高质量的新视图合成 для多视图图像。这项工作提出了从输入灰度多视图图像中生成彩色化的新视图的方法。当我们应用图像或视频基于色化方法于生成的灰度新视图中时，我们会 observer artifacts due to inconsistency across views。训练 radiance field 网络在颜色化的灰度图像序列上也不能解决3D consistency issue。我们提出了一种液态基于方法，将自然图像中的颜色知识传递给 radiance field 网络。具体来说，我们的方法使用 radiance field 网络作为3D表示，并将自然图像中的颜色化方法中的知识传递给它。实验结果表明，我们提出的方法可以在indoor和outdoor场景中生成高质量的彩色化新视图，同时保持视图间的一致性。此外，我们还证明了我们的方法在以下两个应用中的效果：1.) 红外（IR）多视图图像和2.) 老灰度多视图图像序列中进行颜色化。

Towards Robust and Unconstrained Full Range of Rotation Head Pose Estimation

paper_url: http://arxiv.org/abs/2309.07654
repo_url: https://github.com/thohemp/6drepnet360
paper_authors: Thorsten Hempel, Ahmed A. Abdelrahman, Ayoub Al-Hamadi
for: 这篇论文是用于解决人体头部pose预测问题的，这是许多应用场景中的一个关键问题，但目前主要被视为前置 pose 预测的一个子任务。
methods: 我们提出了一种新的无约束端到端头部pose预测方法，用于解决极具挑战性的全范围orientation head pose预测问题。我们在ground truth数据中引入了旋转矩阵 formalism，并提议了一种连续的6D旋转矩阵表示方式，以便高效地学习整个旋转表现和超越当前状态艺。
results: 我们的方法在公共数据集上进行了广泛的评估，并证明了与其他状态艺方法相比，它在高效和可靠的情况下显著超越了其他方法，而且其先进的预测范围允许扩展应用领域。我们将训练和测试代码以及我们训练过的模型公开在 GitHub：https://github.com/thohemp/6DRepNet360。

Abstract
Estimating the head pose of a person is a crucial problem for numerous applications that is yet mainly addressed as a subtask of frontal pose prediction. We present a novel method for unconstrained end-to-end head pose estimation to tackle the challenging task of full range of orientation head pose prediction. We address the issue of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This allows to efficiently learn full rotation appearance and to overcome the limitations of the current state-of-the-art. Together with new accumulated training data that provides full head pose rotation data and a geodesic loss approach for stable learning, we design an advanced model that is able to predict an extended range of head orientations. An extensive evaluation on public datasets demonstrates that our method significantly outperforms other state-of-the-art methods in an efficient and robust manner, while its advanced prediction range allows the expansion of the application area. We open-source our training and testing code along with our trained models: https://github.com/thohemp/6DRepNet360.

摘要
“估算人姿pose是许多应用中的关键问题，然而现在主要被视为前方姿态预测的子任务。我们提出了一种新的方法，用于无条件端到端的人姿pose估算，以解决困难的全角度姿态预测问题。我们为我们的真实数据引入了旋转矩阵 formalism，并提出了连续的6D旋转矩阵表示，以实现高效和稳定的直接预测。这允许我们高效地学习全旋转样式，并超越现有的状态顶峰。furthermore，我们新增了大量的全头姿态资料，并使用地odesic损失函数，以确保稳定的学习。我们设计了一个进步的模型，能够预测更广泛的头姿范围。实验结果显示，我们的方法与其他状态顶峰方法相比，在高效和稳定的情况下，有 statistically significant 的进步。此外，我们的预测范围的扩展，允许扩展应用领域。我们将训练和测试代码，以及我们训练的模型，公开发布在 GitHub 上：https://github.com/thohemp/6DRepNet360。”

Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement

paper_url: http://arxiv.org/abs/2309.07640
repo_url: None
paper_authors: Sheng Ye, Yubin Hu, Matthieu Lin, Yu-Hui Wen, Wang Zhao, Wenping Wang, Yong-Jin Liu
for: 这 paper 的目的是 reconstruction indoor scene 从多视图 RGB 图像中。
methods: 这 paper 使用 neural radiance fields 和 predicted surface normal priors 来恢复场景的geometry。
results: 这 paper 的方法可以生成完整和平滑的结果，但是它们在复杂表面上受到限制，因为 implicit representation 不够表示高频结构。

Abstract
The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method significantly outperforms existing methods in terms of reconstruction quality.

摘要
重建室内场景从多视图RGB图像中是一项挑战，因为场景中存在平滑无 texture 的区域以及细节rich和细腻的区域。现有方法利用神经辐射场 aid by predicted surface normal priors来恢复场景几何。这些方法在 floor 和墙面上 producing complete and smooth results 表现出色。然而，它们在复杂的表面上 Capture high-frequency structures 困难，因为神经表示不够强大和预测的normal priors不准确。为了提高隐式表示能力，我们提议将低频和高频区域分别表示。为了提高normal priors，我们引入了一种简单 yet effective的图像锐化和减噪技术，并 coupling 一个网络来估算图像中每个像素的surface normal vector的uncertainty。这种uncertainty可以让我们的模型不会被不可靠的表面normal supervision mislead，从而精准地重建复杂的几何结构。实验表明，我们的方法与现有方法相比，在重建质量方面具有显著的优势。

SwitchGPT: Adapting Large Language Models for Non-Text Outputs

paper_url: http://arxiv.org/abs/2309.07623
repo_url: None
paper_authors: Xinyu Wang, Bohan Zhuang, Qi Wu
for: This paper aims to bridge the gap between language models (LLMs) and modality conversion models, such as text-to-image, by proposing a novel approach that can adapt LLMs to comprehend requests for non-text responses.
methods: The proposed approach, called \methodname, employs a minimal dataset to instruct LLMs to recognize the intended output modality as directed by the instructions. The adapted LLM can then effectively summon various off-the-shelf modality conversion models from the model zoos to generate non-text responses.
results: The experiment results show that, with minimal training, LLMs can be conveniently adapted to comprehend requests for non-text responses, thus achieving higher flexibility in multi-modal scenarios.

Abstract
Large Language Models (LLMs), primarily trained on text-based datasets, exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs. However, they falter when requests to generate non-text ones. Concurrently, modality conversion models, such as text-to-image, despite generating high-quality images, suffer from a lack of extensive textual pretraining. As a result, these models are only capable of accommodating specific image descriptions rather than comprehending more complex instructions. To bridge this gap, we propose a novel approach, \methodname, from a modality conversion perspective that evolves a text-based LLM into a multi-modal one. We specifically employ a minimal dataset to instruct LLMs to recognize the intended output modality as directed by the instructions. Consequently, the adapted LLM can effectively summon various off-the-shelf modality conversion models from the model zoos to generate non-text responses. This circumvents the necessity for complicated pretraining that typically requires immense quantities of paired multi-modal data, while simultaneously inheriting the extensive knowledge of LLMs and the ability of high-quality generative models. To evaluate and compare the adapted multi-modal LLM with its traditional counterparts, we have constructed a multi-modal instruction benchmark that solicits diverse modality outputs. The experiment results reveal that, with minimal training, LLMs can be conveniently adapted to comprehend requests for non-text responses, thus achieving higher flexibility in multi-modal scenarios. Code and data will be made available at https://github.com/xinke-wang/SwitchGPT.

摘要
大型语言模型（LLM），主要基于文本数据集进行训练，在理解和执行复杂语言指令方面表现出色，但在产生非文本回应方面则有问题。同时，模式转换模型，如文本转像，优秀地生成高品质的图像，但由于缺乏广泛的文本预训，因此只能适应特定的图像描述，而不能理解更复杂的指令。为了跨越这个差距，我们提出了一个新的方法，\methodname，从模式转换角度来进行。我们specifically使用了一个最小的数据集，将LLMs教育到识别所需的输出模式，以实现对不同模式的认知。因此，已经适应了LLMs的改进后，可以从模型zoo中选择多种高品质的生成模型，以生成非文本回应。这样可以避免需要大量的对整合多个模式的预训，同时继承了LLMs的广泛知识和高品质生成模型的能力。为了评估和比较改进后的多个模式LMM，我们建立了一个多模式指令库， solicits多种模式的回应。实验结果显示，仅需要最小的训练，LLMs可以轻松地适应产生非文本回应，因此在多模式enario中得到更高的灵活性。代码和数据将会在https://github.com/xinke-wang/SwitchGPT中公开。

Road Disease Detection based on Latent Domain Background Feature Separation and Suppression

paper_url: http://arxiv.org/abs/2309.07616
repo_url: None
paper_authors: Juwu Zheng, Jiangtao Ren
for: 这篇论文目的是提出一种新的 Latent Domain Background Feature Separation and Suppression（LDBFSS）网络，用于减少背景信息的影响，提高道路疾病检测的准确率。
methods: 该论文提出了一种新的 LDBFSS 网络，包括幽默领域发现模块、领域对抗学习模块和对比学习模块。通过不需要领域监督和对比提高对象特征表示，LDBFSS 网络可以减少背景信息的影响，提高道路疾病检测的准确率。
results: 实验结果表明，与最佳模型相比，LDBFSS 网络在 GRDDC 数据集上提高了约4%，在 CNRDD 数据集上提高了4.6%。这些结果证明了 LDBFSS 网络的有效性和优越性。

Abstract
Road disease detection is challenging due to the the small proportion of road damage in target region and the diverse background,which introduce lots of domain information.Besides, disease categories have high similarity,makes the detection more difficult. In this paper, we propose a new LDBFSS(Latent Domain Background Feature Separation and Suppression) network which could perform background information separation and suppression without domain supervision and contrastive enhancement of object features.We combine our LDBFSS network with YOLOv5 model to enhance disease features for better road disease detection. As the components of LDBFSS network, we first design a latent domain discovery module and a domain adversarial learning module to obtain pseudo domain labels through unsupervised method, guiding domain discriminator and model to train adversarially to suppress background information. In addition, we introduce a contrastive learning module and design k-instance contrastive loss, optimize the disease feature representation by increasing the inter-class distance and reducing the intra-class distance for object features. We conducted experiments on two road disease detection datasets, GRDDC and CNRDD, and compared with other models,which show an increase of nearly 4% on GRDDC dataset compared with optimal model, and an increase of 4.6% on CNRDD dataset. Experimental results prove the effectiveness and superiority of our model.

摘要
道路疾病检测具有较小的疾病占比和多样背景，这些背景信息会引入很多领域信息，使检测变得更加困难。在这篇论文中，我们提出了一种新的LDBFSS（幽默领域背景特征分离和抑制）网络，可以无需领域监督和强制对象特征进行增强，从而实现背景信息的分离和抑制。我们将LDBFSS网络与YOLOv5模型结合，以提高疾病特征的检测。LDBFSS网络的组成部分包括幽默领域发现模块和领域对抗学习模块，通过无监督方法获取 Pseudo 领域标签，引导领域推定器和模型进行对抗学习，以suppressBackground信息。此外，我们还引入了对比学习模块，并设计了k个实例对比损失，以优化疾病特征表示。我们在GRDDC和CNRDD两个道路疾病检测数据集上进行实验，与其他模型进行比较，结果显示，与优化模型相比，我们的模型在GRDDC数据集上增加了约4%的检测精度，在CNRDD数据集上增加了4.6%的检测精度。实验结果证明了我们的模型的有效性和优势。

Learning Quasi-Static 3D Models of Markerless Deformable Linear Objects for Bimanual Robotic Manipulation

paper_url: http://arxiv.org/abs/2309.07609
repo_url: https://github.com/ppi-put/neural_dlo_model
paper_authors: Piotr Kicki, Michał Bidziński, Krzysztof Walas
for: This paper focuses on the robotic manipulation of deformable linear objects (DLOs) and proposes a new learning-based 3D model based on the Transformer architecture to achieve superior accuracy.
methods: The paper uses several learning-based 3D models of DLOs and proposes a new one based on the Transformer architecture, as well as introduces a data augmentation technique to improve the prediction performance of the models.
results: The proposed model achieves superior accuracy on several challenging datasets, even on DLOs of different lengths, and demonstrates its applicability in the task of shaping a DLO.Here is the text in Simplified Chinese:
for: 这篇论文关注机器人控制弹性线性物体（DLO）的问题，并提出了基于Transformer架构的新的学习型3D模型，以实现超越性能。
methods: 论文使用了多种学习型3D模型，并提出了一种基于Transformer架构的新模型，同时引入了一种数据增强技术，以提高模型预测性能。
results: 提出的模型在多个复杂的数据集上表现出优于其他模型，并在DLO的不同长度下实现了超越性能，证明其在DLO的形状控制任务中的应用可行性。

Abstract
The robotic manipulation of Deformable Linear Objects (DLOs) is a vital and challenging task that is important in many practical applications. Classical model-based approaches to this problem require an accurate model to capture how robot motions affect the deformation of the DLO. Nowadays, data-driven models offer the best tradeoff between quality and computation time. This paper analyzes several learning-based 3D models of the DLO and proposes a new one based on the Transformer architecture that achieves superior accuracy, even on the DLOs of different lengths, thanks to the proposed scaling method. Moreover, we introduce a data augmentation technique, which improves the prediction performance of almost all considered DLO data-driven models. Thanks to this technique, even a simple Multilayer Perceptron (MLP) achieves close to state-of-the-art performance while being significantly faster to evaluate. In the experiments, we compare the performance of the learning-based 3D models of the DLO on several challenging datasets quantitatively and demonstrate their applicability in the task of shaping a DLO.

摘要
机器人对弹性线性物体（DLO）的机械抓握是一项重要且挑战性较高的任务，在实际应用中具有重要意义。经典模型基本方法需要一个准确的模型来捕捉机器人运动对DLO的折叠效应。当今，数据驱动模型可以提供最佳的平衡点。本文分析了多种学习基于3D模型的DLO，并提出了基于Transformer架构的新模型，实现了更高精度，即使DLO的长度不同也是如此。此外，我们还引入了数据增强技术，该技术可以提高大多数考虑的DLO数据驱动模型的预测性能。这种技术使得简单的多层感知器（MLP）可以达到接近状态艺术的性能，而且评估速度快得多。在实验中，我们对learning基于3D模型的DLO的性能进行了评量，并在多个挑战性数据集上进行了比较，以示其在DLO形状 задании中的可行性。

Universality of underlying mechanism for successful deep learning

paper_url: http://arxiv.org/abs/2309.07537
repo_url: None
paper_authors: Yuval Meir, Yarden Tzach, Shiri Hodassman, Ofek Tevet, Ido Kanter
for: 提高深度学习模型的准确率和计算复杂度
methods: 使用单个滤波器的质量量化方法，找到小集合的可能输出标签，并通过层次进行进一步加工，提高信噪比和准确率
results: 验证了一种通用机制，可以提高VGG-16和EfficientNet-B0模型在CIFAR-100和ImageNet datasets上的准确率，并且显示了隐藏层数量和输出标签数量之间的关系。

Abstract
An underlying mechanism for successful deep learning (DL) with a limited deep architecture and dataset, namely VGG-16 on CIFAR-10, was recently presented based on a quantitative method to measure the quality of a single filter in each layer. In this method, each filter identifies small clusters of possible output labels, with additional noise selected as labels out of the clusters. This feature is progressively sharpened with the layers, resulting in an enhanced signal-to-noise ratio (SNR) and higher accuracy. In this study, the suggested universal mechanism is verified for VGG-16 and EfficientNet-B0 trained on the CIFAR-100 and ImageNet datasets with the following main results. First, the accuracy progressively increases with the layers, whereas the noise per filter typically progressively decreases. Second, for a given deep architecture, the maximal error rate increases approximately linearly with the number of output labels. Third, the average filter cluster size and the number of clusters per filter at the last convolutional layer adjacent to the output layer are almost independent of the number of dataset labels in the range [3, 1,000], while a high SNR is preserved. The presented DL mechanism suggests several techniques, such as applying filter's cluster connections (AFCC), to improve the computational complexity and accuracy of deep architectures and furthermore pinpoints the simplification of pre-existing structures while maintaining their accuracies.

摘要
深度学习（DL）中的一种基本机制，即VGG-16在CIFAR-10上的某些研究，提出了一种量化方法来衡量每个滤波器的质量。这种方法是，每个滤波器可以找到小 clusters of possible output labels，并且选择这些 clusters 中的陌生标签作为输出。这种特征逐渐增强，导致输出signal-to-noise ratio（SNR）的提高和更高的准确率。在这个研究中，这种建议的通用机制得到了VGG-16和EfficientNet-B0在CIFAR-100和ImageNet数据集上的验证，以下是主要的结果：1. 准确度逐渐增加，而噪音每个滤波器通常逐渐减少。2. 对于给定的深度架构，最大错误率随输出标签的数量 approximately 线性增加。3. 最后一层的卷积层附近的滤波器集群大小和每个滤波器的集群数量在[3, 1,000]个标签范围内基本不变，而保持高SNR。这种深度学习机制建议了一些技术，如应用滤波器集群连接（AFCC），以提高深度架构的计算复杂性和准确率，同时简化现有结构而保持其准确性。

Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach

paper_url: http://arxiv.org/abs/2309.07944
repo_url: None
paper_authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie
for: 本文目的是生成Counterfactual Explanations（CEs），即通过修改最少必需的特征来改变分类器对给定图像的预测。
methods: 本文提出的方法是基于Distillation的黑盒Counterfactual技术，无需分类器结构、参数或梯度。首先，TIME引入了两种不同的偏见到Stable Diffusion中：context bias和class bias。context bias是图像结构相关的偏见，class bias是通过目标分类器学习的类特征偏见。然后，通过学习这两种偏见，找到最佳的latent code，并使用目标类token来重新生成图像，以生成counterfactual解释。
results: 对比 précédente方法，TIME可以在黑盒 Setting中生成相当有效的counterfactual解释。

Abstract
This paper addresses the challenge of generating Counterfactual Explanations (CEs), involving the identification and modification of the fewest necessary features to alter a classifier's prediction for a given image. Our proposed method, Text-to-Image Models for Counterfactual Explanations (TIME), is a black-box counterfactual technique based on distillation. Unlike previous methods, this approach requires solely the image and its prediction, omitting the need for the classifier's structure, parameters, or gradients. Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation. Extensive empirical studies validate that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting.

摘要
Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code by applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation.Empirical studies have shown that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting.

paper_url: http://arxiv.org/abs/2309.07524
repo_url: None
paper_authors: Yujie Feng, Yin Yang, Xiaohong Fan, Zhengpeng Zhang, Jianping Zhang
for: 这个研究旨在提高遥测图像质量，Addressing the limitations of remote sensing image degradation due to sensor technology and complex imaging environments.methods: 提出了一个新的盲目杀价学习框架，Combining alternating iterations of shrinkage thresholds, blurring kernels, and images with a theoretical foundation of network design. Additionally, a learnable blur kernel proximal mapping module and a deep proximal mapping module in the image domain are proposed.results: 实验结果显示了我们的MGSTNet框架在遥测图像 datasets上比现有的杀价方法更高效。

Abstract
Remote sensing images are essential for many earth science applications, but their quality can be degraded due to limitations in sensor technology and complex imaging environments. To address this, various remote sensing image deblurring methods have been developed to restore sharp, high-quality images from degraded observational data. However, most traditional model-based deblurring methods usually require predefined hand-craft prior assumptions, which are difficult to handle in complex applications, and most deep learning-based deblurring methods are designed as a black box, lacking transparency and interpretability. In this work, we propose a novel blind deblurring learning framework based on alternating iterations of shrinkage thresholds, alternately updating blurring kernels and images, with the theoretical foundation of network design. Additionally, we propose a learnable blur kernel proximal mapping module to improve the blur kernel evaluation in the kernel domain. Then, we proposed a deep proximal mapping module in the image domain, which combines a generalized shrinkage threshold operator and a multi-scale prior feature extraction block. This module also introduces an attention mechanism to adaptively adjust the prior importance, thus avoiding the drawbacks of hand-crafted image prior terms. Thus, a novel multi-scale generalized shrinkage threshold network (MGSTNet) is designed to specifically focus on learning deep geometric prior features to enhance image restoration. Experiments demonstrate the superiority of our MGSTNet framework on remote sensing image datasets compared to existing deblurring methods.

摘要
<>remote sensing 图像是许多地球科学应用中必需的，但它们可能会受到仪器技术和观测环境的限制而受损。为了解决这问题，许多基于模型的融合图像滤波方法已经被开发出来，以恢复高质量的图像。然而，大多数传统的模型基于的手工定制假设是难以处理复杂的应用场景，而大多数基于深度学习的滤波方法则是黑盒模型，缺乏透明性和可解释性。在这种情况下，我们提出了一种基于交互迭代的盲滤波学习框架，其基于网络设计理论。此外，我们还提出了一种可学习的滤波kernels proximal映射模块，以提高滤波kernels的评估。然后，我们提出了一种深度 proximal映射模块，它将通过一个通用的缩短阈值操作符和多尺度先验特征提取块来组合。这个模块还引入了一种注意机制，以适应性地调整先验重要性，从而避免手动定制图像先验项的缺陷。因此，我们提出了一种基于多尺度总体缩短阈值网络（MGSTNet），以专门学习深度几何先验特征，以提高图像恢复。实验表明，我们的MGSTNet框架在遥感图像 datasets 上的效果比 EXISTS 的滤波方法更佳。

Dhan-Shomadhan: A Dataset of Rice Leaf Disease Classification for Bangladeshi Local Rice

paper_url: http://arxiv.org/abs/2309.07515
repo_url: None
paper_authors: Md. Fahad Hossain
for: 这个论文是为了提供一个大量的rice叶病病例图像集，用于Computer Vision和图像识别技术的研究和应用。
methods: 本论文使用了两种背景变化，包括场景背景图像和白色背景图像，收集了1106张五种危害rice的病病例图像。
results: 本论文通过对这些图像进行分类和识别，得到了对rice叶病的识别和检测的结果。

Abstract
This dataset represents almost all the harmful diseases for rice in Bangladesh. This dataset consists of 1106 image of five harmful diseases called Brown Spot, Leaf Scaled, Rice Blast, Rice Turngo, Steath Blight in two different background variation named field background picture and white background picture. Two different background variation helps the dataset to perform more accurately so that the user can use this data for field use as well as white background for decision making. The data is collected from rice field of Dhaka Division. This dataset can use for rice leaf diseases classification, diseases detection using Computer Vision and Pattern Recognition for different rice leaf disease.

摘要
这个数据集代表了孟加拉的rice中的大多数有害病种。这个数据集包含1106张五种有害病种的图像，分别是棕点病、叶缘病、rice毒、rice螺旋和隐蔀病，在两种不同的背景 variation中，分别是田埂背景图像和白色背景图像。两种不同的背景 variation 帮助数据集更加准确地进行分类，以便用户在场景中使用这些数据进行决策。这些数据来自于达卡分区的rice场。这个数据集可以用于rice叶病识别、病种检测使用计算机视觉和模式识别。

paper_url: http://arxiv.org/abs/2309.07513
repo_url: None
paper_authors: Gregor Koehler, Tassilo Wald, Constantin Ulrich, David Zimmerer, Paul F. Jaeger, Jörg K. H. Franke, Simon Kohl, Fabian Isensee, Klaus H. Maier-Hein
for: 提高神经网络的决策精度，允许神经网络在不同角度重新考虑初始决策，从而提高决策质量。
methods: 提出了一种叫做 RecycleNet 的缓存特征回收方法，通过在多个回收步骤中，将输出反馈到早期神经网络层次，以实现神经网络可以在不同角度重新考虑初始决策，以提高决策质量。
results: 在医学图像分割任务上进行评估，显示了 RecycleNet 可以在不同的分割benchmark上提高决策精度，并且与当前最佳分割方法相比，也可以获得更好的性能。

Abstract
Despite the remarkable success of deep learning systems over the last decade, a key difference still remains between neural network and human decision-making: As humans, we cannot only form a decision on the spot, but also ponder, revisiting an initial guess from different angles, distilling relevant information, arriving at a better decision. Here, we propose RecycleNet, a latent feature recycling method, instilling the pondering capability for neural networks to refine initial decisions over a number of recycling steps, where outputs are fed back into earlier network layers in an iterative fashion. This approach makes minimal assumptions about the neural network architecture and thus can be implemented in a wide variety of contexts. Using medical image segmentation as the evaluation environment, we show that latent feature recycling enables the network to iteratively refine initial predictions even beyond the iterations seen during training, converging towards an improved decision. We evaluate this across a variety of segmentation benchmarks and show consistent improvements even compared with top-performing segmentation methods. This allows trading increased computation time for improved performance, which can be beneficial, especially for safety-critical applications.

摘要
尽管深度学习系统在过去十年内取得了非常出色的成绩，但是人类决策和神经网络决策之间仍然存在一定的区别：人类可以不仅立即做出决策，还可以思考、重新考虑、从不同的角度来到达更好的决策。为了让神经网络具备这种“pondering”能力，我们提出了RecycleNet方法，即 latent feature recycling 方法，允许神经网络在多个循环步骤中反复利用初始决策，以提高决策的准确性。这种方法对神经网络 Architecture 做出了最少的假设，因此可以在各种上下文中实现。使用医学影像 segmentation 作为评估环境，我们表明了 latent feature recycling 可以让神经网络在训练过程中没有看到的多个循环步骤中反复更新初始预测，并且在多个 segmentation 标准 benchmark 上达到了更高的性能。这意味着可以通过增加计算时间来换取更好的性能，这在安全关键应用中可能是有利的。

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

paper_url: http://arxiv.org/abs/2309.07509
repo_url: None
paper_authors: Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
for: 这篇论文的目的是生成真实的讲话表情，具有各种应用。
methods: DiffTalker 模型使用了音频和特征点共同驱动，以解决直接将扩散模型应用到音频控制的挑战。DiffTalker 包括两个代理网络：一个基于 transformer 的特征点完成网络，以确保几何精度，以及一个基于扩散的脸部生成网络，以捕捉具有纹理详细的讲话表情。
results: DiffTalker 能够生成具有清晰度和几何精度的讲话表情，不需要额外对 audio 和图像特征进行对齐。实验结果表明，DiffTalker 在生成讲话表情方面表现出色，无需额外对 audio 和图像特征进行对齐。

Abstract
Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

摘要
通过音频和标点驱动，DiffTalker模型可生成真实的说话脸。DiffTalker解决了直接应用扩散模型到音频控制的挑战，传统上是通过文本图像对对应来训练。DiffTalker包括两个代理网络：一个基于变换器的标点完成网络以确保几何准确，以及一个基于扩散的面Generated network дляTexture详细。标点在将音频和图像领域连接起来的过程中扮演着关键的角色，使得可以借助预训练的扩散模型来 incorporate知识。这种创新的方法可以高效地生成敏捷说话的脸。实验结果表明DiffTalker可以生成清晰和几何准确的说话脸，无需额外对音频和图像特征进行对齐。

Efficiently Robustify Pre-trained Models

paper_url: http://arxiv.org/abs/2309.07499
repo_url: None
paper_authors: Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, Vibhav Vineet
for: 本研究旨在探讨大规模深度学习模型在真实世界中的稳定性，以及现有robustification方法是否可 scalable。
methods: 我们首先对这些大规模模型进行了不同类型的拟合，以示它们在不同的数据集和预测task中的性能下降。然后，我们讨论了完全模型 fine-tuning 的缺点，包括 computational overhead和模型忘记部分感知特征。最后，我们提出了一种简单且cost-effective的方法，基于知识传递文献，可以快速地增强这些大规模模型的稳定性，同时保留模型的传输学习和零shot评估性能。
results: 我们的方法在不同的视觉拟合 dataset 上（包括 ImageNet-C,R,S,A dataset 和不同数据集的传输学习和零shot评估setup）进行了评估，结果显示我们的方法能够有效地增强大规模模型的稳定性，需要远少于原始模型的计算负担，同时保留模型的传输学习和零shot性能。

Abstract
A recent trend in deep learning algorithms has been towards training large scale models, having high parameter count and trained on big dataset. However, robustness of such large scale models towards real-world settings is still a less-explored topic. In this work, we first benchmark the performance of these models under different perturbations and datasets thereby representing real-world shifts, and highlight their degrading performance under these shifts. We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks and can also lead them to forget some of the desired characterstics. Finally, we propose a simple and cost-effective method to solve this problem, inspired by knowledge transfer literature. It involves robustifying smaller models, at a lower computation cost, and then use them as teachers to tune a fraction of these large scale networks, reducing the overall computational overhead. We evaluate our proposed method under various vision perturbations including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot evaluation setups on different datasets. Benchmark results show that our method is able to induce robustness to these large scale models efficiently, requiring significantly lower time and also preserves the transfer learning, zero-shot properties of the original model which none of the existing methods are able to achieve.

摘要
现在的深度学习算法趋势是训练大规模模型，具有高参数计数和在大量数据上训练。然而，这些大规模模型在实际场景中的稳定性仍然是一个未经探索的话题。在这项工作中，我们首先对这些模型在不同的扰动和数据集上进行了性能测试，并发现它们在这些扰动下的性能下降。然后，我们讨论了完整模型练习基于现有的Robustification方案可能不是一个可执行的选择，因为它们可能会使模型忘记一些愿望的特征。最后，我们提出了一种简单而经济的解决方案，基于知识传递文献。它是通过强化小型模型，然后使用这些小型模型作为老师来微调一部分这些大规模网络，从而降低总计算成本。我们对各种视觉扰动，包括ImageNet-C、R、S、A数据集以及转移学习、零shot评估集成进行了评估。结果表明，我们的方法能够有效地带来这些大规模模型的稳定性，需要较低的计算成本，同时保留原始模型的转移学习、零shot特性，与现有方法不同。

EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

paper_url: http://arxiv.org/abs/2309.07471
repo_url: https://github.com/minnjung/ep2p-loc
paper_authors: Minjung Kim, Junseo Koo, Gunhee Kim
for: 本研究旨在解决视觉地标化问题，即从视觉图像中估算出6DOF摄像机pose。
methods: 本方法使用了一种新的大规模视觉地标化方法，即EP2P-Loc。该方法利用了2D-3D特征匹配，并通过终端训练来提高pose估算的精度。
results: 在新的大规模indoor和outdoorbenchmark上进行了实验，并显示了与现有视觉地标化和图像到点云注册方法相比，本方法可以 дости得最佳性能。

Abstract
Visual localization is the task of estimating a 6-DoF camera pose of a query image within a provided 3D reference map. Thanks to recent advances in various 3D sensors, 3D point clouds are becoming a more accurate and affordable option for building the reference map, but research to match the points of 3D point clouds with pixels in 2D images for visual localization remains challenging. Existing approaches that jointly learn 2D-3D feature matching suffer from low inliers due to representational differences between the two modalities, and the methods that bypass this problem into classification have an issue of poor refinement. In this work, we propose EP2P-Loc, a novel large-scale visual localization method that mitigates such appearance discrepancy and enables end-to-end training for pose estimation. To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image, and find all 2D-3D correspondences without keypoint detection. To reduce memory usage and search complexity, we take a coarse-to-fine approach where we extract patch-level features from 2D images, then perform 2D patch classification on each 3D point, and obtain the exact corresponding 2D pixel coordinates through positional encoding. Finally, for the first time in this task, we employ a differentiable PnP for end-to-end training. In the experiments on newly curated large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show that our method achieves the state-of-the-art performance compared to existing visual localization and image-to-point cloud registration methods.

摘要
“视觉地理位置”是指根据查询图像和提供的3D参考地图来估算摄像机pose的任务。由于不同感知模式之间的表示差异，现有的方法 JOINTLY学习2D-3D特征匹配具有低准确率。此外，通过 Circumventing 这个问题，这些方法通常会导致精度不高的答案。在这项工作中，我们提出了EP2P-Loc，一种新的大规模视觉地理位置方法，可以减轻表示差异的问题，并允许端到端培训。为了增加准确率，我们提出了一种简单的算法，可以在图像中remove不可见的3D点，并找到所有2D-3D匹配。此外，为了降低内存使用量和搜索复杂度，我们采取了一种粗粒度-细粒度的方法，首先从2D图像中提取patch-level特征，然后在每个3D点上进行2D patch分类，并通过 pozitional encoding 获取准确的2D像素坐标。最后，我们首次在这个任务中使用了可导的PnP для端到端培训。在新编制的大规模indoor和outdoor benchmarks上进行了实验，我们示出了我们的方法可以与现有的视觉地理位置和图像-点云注册方法相比，达到了状态 искусственный智能水平。

Research on self-cross transformer model of point cloud change detecter

paper_url: http://arxiv.org/abs/2309.07444
repo_url: None
paper_authors: Xiaoxu Ren, Haili Sun, Zhenxin Zhang
for: 本文主要针对的是检测3D点云中的变化，以帮助城市建设过程中的变化检测，确保项目完整性和减少劳动成本。
methods: 本文提出了一种基于cross transformer模块的3D点云变化检测网络，并对其进行了验证和测试。
results: 测试结果表明，该网络在检测3D点云中的变化方面具有高精度和高速响应性。

Abstract
With the vigorous development of the urban construction industry, engineering deformation or changes often occur during the construction process. To combat this phenomenon, it is necessary to detect changes in order to detect construction loopholes in time, ensure the integrity of the project and reduce labor costs. Or the inconvenience and injuriousness of the road. In the study of change detection in 3D point clouds, researchers have published various research methods on 3D point clouds. Directly based on but mostly based ontraditional threshold distance methods (C2C, M3C2, M3C2-EP), and some are to convert 3D point clouds into DSM, which loses a lot of original information. Although deep learning is used in remote sensing methods, in terms of change detection of 3D point clouds, it is more converted into two-dimensional patches, and neural networks are rarely applied directly. We prefer that the network is given at the level of pixels or points. Variety. Therefore, in this article, our network builds a network for 3D point cloud change detection, and proposes a new module Cross transformer suitable for change detection. Simultaneously simulate tunneling data for change detection, and do test experiments with our network.

摘要
随着城市建设业的发展，工程变形或变化经常发生在建设过程中。要解决这种现象，需要检测变化，以时间准确检测建筑缺陷，保证项目完整性，降低劳动成本。或者公路不便。在3D点云变化检测研究中，研究人员已经发表了多种研究方法。大多基于传统的距离方法（C2C、M3C2、M3C2-EP），一些是将3D点云转换为DSM，这会产生很多原始信息的损失。虽然深度学习在远程感知方法中被广泛应用，但在3D点云变化检测方面，它更多地转换为二维质心，神经网络在Change Detection中 rarely applied directly。我们认为，网络应该给点或像素层次。因此，在这篇文章中，我们建立了一个3D点云变化检测网络，并提出了一个适用于变化检测的新模块——跨 transformer。同时，我们对各种数据进行了模拟隧道测试，并对我们的网络进行了测试实验。

DePT: Decoupled Prompt Tuning

paper_url: http://arxiv.org/abs/2309.07439
repo_url: https://github.com/koorye/dept
paper_authors: Ji Zhang, Shihan Wu, Lianli Gao, Hengtao Shen, Jingkuan Song
for: 解决基础-新任务负面关系（Base-New Tradeoff）在提升任务中，即提升基础任务的泛化性会导致新任务的泛化性减退，并且vice versa。
methods: 提出了一种叫做Decoupled Prompt Tuning（DePT）框架，它在提升过程中将基础知识从特征通道隔离到一个隔离的特征空间中，以保留原始特征空间中的任务共享知识，从而实现更好的零基础泛化性在新任务上。
results: 经过对11个数据集的广泛实验，显示DePT具有强大的灵活性和效果性，可以提高所有的提升方法。代码和预训练模型可以在https://github.com/Koorye/DePT上下载。

Abstract
This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning, i.e., the better the tuned model generalizes to the base (or target) task, the worse it generalizes to new tasks, and vice versa. Specifically, through an in-depth analysis of the learned features of the base and new tasks, we observe that the BNT stems from a channel bias issue, i.e., the vast majority of feature channels are occupied by base-specific knowledge, resulting in the collapse of taskshared knowledge important to new tasks. To address this, we propose the Decoupled Prompt Tuning (DePT) framework, which decouples base-specific knowledge from feature channels into an isolated feature space during prompt tuning, so as to maximally preserve task-shared knowledge in the original feature space for achieving better zero-shot generalization on new tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods, hence it can improve all of them. Extensive experiments on 11 datasets show the strong flexibility and effectiveness of DePT. Our code and pretrained models are available at https://github.com/Koorye/DePT.

摘要

Physical Invisible Backdoor Based on Camera Imaging

paper_url: http://arxiv.org/abs/2309.07428
repo_url: None
paper_authors: Yusheng Guo, Nan Zhong, Zhenxing Qian, Xinpeng Zhang
for: 这篇论文旨在提出一种physical invisible backdoor攻击方法，用于妨碍模型的正常工作，而不需要改变图像的 pixels。
methods: 该方法基于camera imaging，使用特定的摄像头ID来提取图像特征，并利用CFA interpolations algorithm和特征提取块组合成一个特殊的网络结构，以便在这个结构上进行攻击。
results: 实验结果表明，该方法可以 Effectively compromise classical models, such as ResNet18, over a new dataset of 21,500 images, and is robust against various backdoor defenses.

Abstract
Backdoor attack aims to compromise a model, which returns an adversary-wanted output when a specific trigger pattern appears yet behaves normally for clean inputs. Current backdoor attacks require changing pixels of clean images, which results in poor stealthiness of attacks and increases the difficulty of the physical implementation. This paper proposes a novel physical invisible backdoor based on camera imaging without changing nature image pixels. Specifically, a compromised model returns a target label for images taken by a particular camera, while it returns correct results for other images. To implement and evaluate the proposed backdoor, we take shots of different objects from multi-angles using multiple smartphones to build a new dataset of 21,500 images. Conventional backdoor attacks work ineffectively with some classical models, such as ResNet18, over the above-mentioned dataset. Therefore, we propose a three-step training strategy to mount the backdoor attack. First, we design and train a camera identification model with the phone IDs to extract the camera fingerprint feature. Subsequently, we elaborate a special network architecture, which is easily compromised by our backdoor attack, by leveraging the attributes of the CFA interpolation algorithm and combining it with the feature extraction block in the camera identification model. Finally, we transfer the backdoor from the elaborated special network architecture to the classical architecture model via teacher-student distillation learning. Since the trigger of our method is related to the specific phone, our attack works effectively in the physical world. Experiment results demonstrate the feasibility of our proposed approach and robustness against various backdoor defenses.

摘要
<>这里使用了简化字体。>黑门攻击目的是妥协模型，让模型在特定的触发模式出现时返回攻击者所需的输出，但是在清洁的输入中保持正常的行为。现有的黑门攻击需要改变清洁图像的像素，这会导致攻击的不透明度低下和实现physical实现更加困难。本文提出了一种新的物理隐藏黑门，基于Camera imaging，不需要改变 Nature 图像像素。具体来说，一个妥协模型会对特定的摄像头拍摄的图像返回目标标签，而对其他图像返回正确的结果。为实现和评估提议的黑门，我们使用多种多摄像头拍摄不同的物体从多个角度，并建立了一个新的数据集，包含21500个图像。传统的黑门攻击对于一些经典模型，如ResNet18，在上述数据集上无效。因此，我们提出了一种三步训练策略来实现黑门攻击。首先，我们设计和训练了一个摄像头标识模型，以EXTRACTING camera fingerprint feature。然后，我们利用CFA interpolación算法的特点和摄像头标识模型的特点，设计了一种特殊的网络架构，这种网络架构容易受到我们黑门攻击的影响。最后，我们通过教师学生分布式学习来帮助特殊网络架构转移到经典网络架构中，从而实现黑门攻击。由于触发器与特定的手机相关，我们的黑门在物理世界中具有效果。实验结果表明，我们的提议方法是可行的，并且对各种黑门防御措施具有较高的Robustness。

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

paper_url: http://arxiv.org/abs/2309.07409
repo_url: https://github.com/ffzzy840304/masked-pdpp
paper_authors: Fen Fang, Yun Liu, Ali Koksal, Qianli Xu, Joo-Hwee Lim
for: 本研究旨在解决视频教程中的程序规划问题，即从短视觉中快速识别多种任务类型（如倒 Pour milk、倒 Pour water、打开封口、关闭封口等），并 Capture 这些动作类型和任务目标之间的细致 semantic relation。
methods: 我们提出了一种简单 yet effective 的提高方法 - 使用 masked diffusion model。这个mask acts as a task-oriented attention filter，使得 diffusion/denoising process 能够专注于一 subset of action types。此外，我们还使用更强大的视觉表示学习技术来增强任务分类的准确性。特别是，我们学习了一个 joint visual-text embedding，其中text embedding 由提取 human actions的 pre-trained vision-language model 生成。
results: 我们在三个公共数据集上进行了evaluation，并达到了当前最佳性能在多个指标上。Code可以在https://github.com/ffzzy840304/Masked-PDPP中找到。

Abstract
A key challenge with procedure planning in instructional videos lies in how to handle a large decision space consisting of a multitude of action types that belong to various tasks. To understand real-world video content, an AI agent must proficiently discern these action types (e.g., pour milk, pour water, open lid, close lid, etc.) based on brief visual observation. Moreover, it must adeptly capture the intricate semantic relation of the action types and task goals, along with the variable action sequences. Recently, notable progress has been made via the integration of diffusion models and visual representation learning to address the challenge. However, existing models employ rudimentary mechanisms to utilize task information to manage the decision space. To overcome this limitation, we introduce a simple yet effective enhancement - a masked diffusion model. The introduced mask acts akin to a task-oriented attention filter, enabling the diffusion/denoising process to concentrate on a subset of action types. Furthermore, to bolster the accuracy of task classification, we harness more potent visual representation learning techniques. In particular, we learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions. We evaluate the method on three public datasets and achieve state-of-the-art performance on multiple metrics. Code is available at https://github.com/ffzzy840304/Masked-PDPP.

摘要
“一个主要挑战在程序规划视频教程中是如何处理庞大的决策空间，这个空间包含许多不同任务的动作类型。为了理解现实世界视频内容，一个AI代理必须能够准确地识别这些动作类型（如抹 milk、抹水、打开封口、关闭封口等），并且需要capture这些动作类型和任务目标之间的复杂semantic关系，以及变化的动作序列。在最近，通过混合扩散模型和视觉表示学习来解决这个挑战，然而现有模型使用简单的任务信息管理decision space的机制。为了超越这个限制，我们提出了一种简单 yet有效的提高---masked扩散模型。在扩散/净化过程中，这个mask acts as a task-oriented attention filter，使得扩散/净化过程能够专注于一 subset of action types。此外，为了提高任务分类的准确度，我们利用更强大的视觉表示学习技术。具体来说，我们学习一个joint visual-text embedding，其中的text embedding是通过Prompting a pre-trained vision-language model来Focus on human actions来生成。我们对三个公共数据集进行评估，并在多个纪录录制中 дости得状态的最佳性能。代码可以在https://github.com/ffzzy840304/Masked-PDPP中找到。”

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

paper_url: http://arxiv.org/abs/2309.07403
repo_url: None
paper_authors: Lei Fan, Bo Liu, Haoxiang Li, Ying Wu, Gang Hua
for: 提高视觉识别系统的灵活性和可靠性，解决知名类错误分类和未知类图像的异常行为问题。
methods: 基于主观逻辑理论，分别量化决策不确定性为内类冲突和外部无知，并通过证据结合来获得全面的主观意见。
results: 通过synthetic数据分析、视觉识别和开集检测等实验，证明了方法的有效性，可以准确量化两种不确定性并有助于灵活识别。

Abstract
In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition.

摘要

Quantifying prediction uncertainty: Uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples.2. Comparable uncertainty between samples: Both confusion and ignorance should be comparable between samples to enable effective decision-making.In this paper, we propose to explicitly model these two sources of uncertainty using the theory of Subjective Logic. Recognition is viewed as an evidence-collecting process, and confusion is defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, can be achieved via further evidence combinations.We demonstrate the effectiveness of our methods through a series of experiments on synthetic data analysis, visual recognition, and open-set detection. Our approach can quantify two sources of uncertainties and handle flexible recognition effectively.

HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis

paper_url: http://arxiv.org/abs/2309.07400
repo_url: https://github.com/hku-medai/higt
paper_authors: Ziyu Guo, Weiqin Zhao, Shujun Wang, Lequan Yu
For: 这 paper 是为了研究 computation pathology 领域中 gigapixel Whole Slide Images (WSIs) 的 pyramid 结构，以捕捉不同层次的信息，从单个细胞互动到组织微环境。* Methods: 这 paper 使用了一种新的 Hierarchical Interaction Graph-Transformer (HIGT) 模型，结合图 neural network 和 transformer 作为基础，可以学习 WSI pyramids 中的短距离本地信息和长距离全局表示。另外，为了在不同层次之间建立交互，我们设计了一种 novel Bidirectional Interaction block。* Results: 经过对两个公共 WSI 数据集（KICA 和 ESCA）的测试，我们的 HIGT 方法在肿瘤分类和阶段评估任务上表现出色，超过了现有的 hierarchical 和非层次方法。

Abstract
In computation pathology, the pyramid structure of gigapixel Whole Slide Images (WSIs) has recently been studied for capturing various information from individual cell interactions to tissue microenvironments. This hierarchical structure is believed to be beneficial for cancer diagnosis and prognosis tasks. However, most previous hierarchical WSI analysis works (1) only characterize local or global correlations within the WSI pyramids and (2) use only unidirectional interaction between different resolutions, leading to an incomplete picture of WSI pyramids. To this end, this paper presents a novel Hierarchical Interaction Graph-Transformer (i.e., HIGT) for WSI analysis. With Graph Neural Network and Transformer as the building commons, HIGT can learn both short-range local information and long-range global representation of the WSI pyramids. Considering that the information from different resolutions is complementary and can benefit each other during the learning process, we further design a novel Bidirectional Interaction block to establish communication between different levels within the WSI pyramids. Finally, we aggregate both coarse-grained and fine-grained features learned from different levels together for slide-level prediction. We evaluate our methods on two public WSI datasets from TCGA projects, i.e., kidney carcinoma (KICA) and esophageal carcinoma (ESCA). Experimental results show that our HIGT outperforms both hierarchical and non-hierarchical state-of-the-art methods on both tumor subtyping and staging tasks.

摘要
在计算 PATHOLOGY 领域，最近才 studying gigapixel Whole Slide Images (WSIs) 的 pyramid 结构，以捕捉各个细胞交互到组织微环境中的多种信息。这种层次结构被认为对于肿瘤诊断和预后 Task 有利。然而，大多数先前的层次 WSI 分析工作（1）只研究 WSI pyramids 的本地或全局相关性，以及（2）使用单向交互，导致 WSI pyramids 的完整图像减少。为此，本文提出了一种基于 Graph Neural Network 和 Transformer 的 Hierarchical Interaction Graph-Transformer (i.e., HIGT)，可以学习 WSI pyramids 中的短距离本地信息和长距离全局表示。因为不同层次的信息是补偿的，可以在学习过程中互助于each other，我们进一步设计了一种替换 Interaction 块来在 WSI pyramids 中建立不同层次之间的交互。最后，我们将不同层次上学习的粗细化特征和细胞特征融合在一起进行板块级预测。我们在 TCGA 项目中公开的两个 WSI 数据集（i.e., KICA 和 ESCA）上进行了实验，结果表明我们的 HIGT 在肿瘤分类和预后 Task 上都有出色的表现，超过了现有的层次和非层次方法。

Nucleus-aware Self-supervised Pretraining Using Unpaired Image-to-image Translation for Histopathology Images

paper_url: http://arxiv.org/abs/2309.07394
repo_url: https://github.com/zhiyuns/UNITPathSSL
paper_authors: Zhiyun Song, Penghui Du, Junpeng Yan, Kailu Li, Jianzhong Shou, Maode Lai, Yubo Fan, Yan Xu
for:本研究旨在提高模型性能，通过未标注数据中获得有效特征，并在 Histopathology 图像领域得到成功。然而，只有少数研究强调抽取核心水平信息，这是 PATHOLOGIC 分析中关键的一部分。本文提出了一种基于自我超vised 预训练的新核心意识框架，用于 Histopathology 图像。该框架通过未标注图像与假掩码图像之间的无对映射翻译，捕捉 Histopathology 图像中核仁形态和分布信息。methods:本研究使用了一种基于 conditional 和随机样式表示的自我超vised 预训练策略，以确保生成的 Histopathology 图像是真实且多样的。此外，我们还使用了实例分割指导策略，以捕捉实例水平信息。results:我们在 7 个数据集上进行了实验，并证明了我们的预训练方法在 Kather 分类、多实例学习和 5 个权重预测任务中超越了指导学习方法，并在 8 个半upervised 任务中提供了最佳结果。此外，我们还发现了基于自我超vised 预训练的模型在 PATHOLOGIC 分析中的优势。

Abstract
Self-supervised pretraining attempts to enhance model performance by obtaining effective features from unlabeled data, and has demonstrated its effectiveness in the field of histopathology images. Despite its success, few works concentrate on the extraction of nucleus-level information, which is essential for pathologic analysis. In this work, we propose a novel nucleus-aware self-supervised pretraining framework for histopathology images. The framework aims to capture the nuclear morphology and distribution information through unpaired image-to-image translation between histopathology images and pseudo mask images. The generation process is modulated by both conditional and stochastic style representations, ensuring the reality and diversity of the generated histopathology images for pretraining. Further, an instance segmentation guided strategy is employed to capture instance-level information. The experiments on 7 datasets show that the proposed pretraining method outperforms supervised ones on Kather classification, multiple instance learning, and 5 dense-prediction tasks with the transfer learning protocol, and yields superior results than other self-supervised approaches on 8 semi-supervised tasks. Our project is publicly available at https://github.com/zhiyuns/UNITPathSSL.

摘要
自我监督预训术目的是提高模型性能，通过无标注数据中获得有效特征，并在 histopathology 图像领域已经达到成功。然而，只有一些工作强调提取核心级信息，这是pathological分析中不可或缺的。在这项工作中，我们提出了一种新的自我监督预训框架，用于 histopathology 图像。该框架通过无标注图像到图像翻译来捕捉 histopathology 图像中核心形态和分布信息。生成过程由 conditional 和随机风格表示控制，以保证生成的 histopathology 图像的真实性和多样性。此外，我们还使用了实例分割指导策略来捕捉实例级信息。我们在 7 个数据集上进行了实验，结果表明，我们的预训法在 Kather 分类、多实例学习和 5 个激活预测任务中都能够超越supervised方法，并在 8 个半supervised任务中达到更高的效果。您可以在上下载我们的项目。

Judging a video by its bitstream cover

paper_url: http://arxiv.org/abs/2309.07361
repo_url: None
paper_authors: Yuxing Han, Yunan Ding, Jiangtao Wen, Chen Ye Gan
for: 本研究旨在开发一种基于post-compression bitstream的视频分类方法，以提高 multimedia 理解和检索效率。
methods: 该方法仅仅从视频的 post-compression bitstream 中提取特征，不需要进行视频解压缩，从而降低计算和存储需求。
results: 我们使用自定义的 YouTube 视频集（包含超过 29,000 个视频剪辑，总计 6,000 小时）进行验证，得到了precision、accuracy 和 recall 率高于 80%。该算法比传统的DTW算法六个数量级快，每秒钟可以处理 30 帧视频。

Abstract
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially in an age where an immense volume of video content is constantly being generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream. We validate our approach using a custom-built data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our preliminary evaluations indicate precision, accuracy, and recall rates well over 80%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by six orders of magnitude.

摘要
翻译文本为简化中文：分类视频为不同类别，如体育和音乐视频，对多媒体理解和检索是非常重要，特别在今天的数据涌入量是不断增长。传统方法需要视频解压缩，提取像素级特征，如颜色、文本和运动，从而增加计算和存储需求。此外，这些方法经常受到低质量视频的影响，性能下降。我们提出了一种新的方法，通过只对视频后期压缩流进行分类，取消了需要视频流。我们使用自定义的数据集，包括YouTube视频剪辑总计29,000个，总时长6,000小时，涵盖11种不同类别。我们的初步评估结果显示，精度、准确率和回归率超过80%。算法运行约15,000次 faster than real-time for 30fps videos，比传统的动态时间戳相差6个数量级。

2023-09-14

cs.AI

cs.AI - 2023-09-14

Retrieval-Augmented Text-to-Audio Generation

paper_url: http://arxiv.org/abs/2309.08051
repo_url: None
paper_authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
for: 提高文本到audio生成中的质量和准确性，特别是处理 dataset 中罕见的类别问题。
methods: 提出了一种简单的检索增强方法，使用 Contrastive Language Audio Pretraining (CLAP) 模型 retrieve 相关的文本-audio对，然后使用这些数据的特征作为 TTA 模型的学习指导。
results: 在 AudioCaps dataset 上，提出的 Re-AudioLDM 系统实现了 state-of-the-art Frechet Audio Distance (FAD) 1.37，较 existing 方法大幅提高。Re-AudioLDM 还能够生成高质量的 audio для 复杂的场景、罕见的音频类别和even 未看过的音频类型，表明其在 TTA 任务中的潜在能力。

Abstract
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

摘要
尽管最近的文本到音频（TTA）生成技术已经取得了一些进步，但我们发现，使用 AudioCaps 等数据集，state-of-the-art 模型，如 AudioLDM，在生成不匹配的类别时会受到偏见。具体来说，它们在常见的音频类别上表现出色，而在罕见的类别上表现不佳，从而降低整体生成性能。我们称这个问题为“长尾文本到音频生成”。为解决这个问题，我们提议一种简单的检索增强方法，即给输入文本提示时，首先利用 Contrastive Language Audio Pretraining（CLAP）模型来检索相关的文本-音频对。然后， retrieved audio-text 数据的特征被用作 TTA 模型的学习指导。我们增强 AudioLDM 后的系统被称为 Re-AudioLDM，在 AudioCaps 数据集上实现了最新的 Frechet Audio Distance（FAD）记录，为 1.37，超过了现有的方法。此外，我们还证明了 Re-AudioLDM 可以生成复杂场景、罕见音频类别以及未经见过的音频类别，表明它在 TTA 任务中的潜力。

Padding Aware Neurons

paper_url: http://arxiv.org/abs/2309.08048
repo_url: https://gitlab.com/paper14/padding-aware-neurons
paper_authors: Dario Garcia-Gasulla, Victor Gimenez-Abalos, Pablo Martin-Torres
for: 这篇论文主要研究了卷积层中的填充感知机制，以及这种机制对模型性能的影响。
methods: 研究者通过分析卷积层中的缺 padding 策略和激活函数来确定卷积层中的感知机制。他们还通过对多种预训练模型进行分析来explore PANs在不同模型中的存在。
results: 研究者发现了大多数卷积模型中的PANs，其数量从多达百个不等。他们还发现了不同类型的PANs，以及它们在数据上的偏好和影响。最后，研究者讨论了PANs的可 desirability 和其在模型性能、通用性、效率和安全性方面的可能的副作用。

Abstract
Convolutional layers are a fundamental component of most image-related models. These layers often implement by default a static padding policy (\eg zero padding), to control the scale of the internal representations, and to allow kernel activations centered on the border regions. In this work we identify Padding Aware Neurons (PANs), a type of filter that is found in most (if not all) convolutional models trained with static padding. PANs focus on the characterization and recognition of input border location, introducing a spatial inductive bias into the model (e.g., how close to the input's border a pattern typically is). We propose a method to identify PANs through their activations, and explore their presence in several popular pre-trained models, finding PANs on all models explored, from dozens to hundreds. We discuss and illustrate different types of PANs, their kernels and behaviour. To understand their relevance, we test their impact on model performance, and find padding and PANs to induce strong and characteristic biases in the data. Finally, we discuss whether or not PANs are desirable, as well as the potential side effects of their presence in the context of model performance, generalisation, efficiency and safety.

摘要
卷积层是图像相关模型中的基本组件。这些层通常采用静态预留策略（例如零预留），以控制内部表示的大小和kernel活动的中心位置。在这种工作中，我们认定了边缘位置感知neurons（PANs），这种filter在大多数（如果不是所有）卷积模型中被发现。PANs通过识别和识别输入边缘的位置来引入空间卷积偏好（例如，输入边缘上的模式是如何靠近）。我们提出了通过活动来识别PANs的方法，并在多个流行预训练模型中找到了PANs，从数十到百个。我们讨论和描述了不同类型的PANs，其核心和行为。为了了解其 relevance，我们测试了它们对模型性能的影响，并发现预留和PANs都会对数据产生强大和特征的偏好。最后，我们讨论了PANs是否有利，以及它们在模型性能、泛化、效率和安全性上的可能的侧effect。

Towards Large-scale Building Attribute Mapping using Crowdsourced Images: Scene Text Recognition on Flickr and Problems to be Solved

paper_url: http://arxiv.org/abs/2309.08042
repo_url: https://github.com/ya0-sun/str-berlin
paper_authors: Yao Sun, Anna Kruspe, Liqiu Meng, Yifan Tian, Eike J Hoffmann, Stefan Auer, Xiao Xiang Zhu
for: 这篇论文旨在应用Scene Text Recognition（STR）技术在拼团网络街景图像中映射建筑特征。
methods: 这篇论文使用了Flickr图像，尤其是建筑外墙上的文本。创建了一个Berlin Flickr数据集，并使用预训练的STR模型进行文本检测和识别。
results: 手动检查一 subset 的 STR-识别图像表示高准确性。研究表明，STR 结果和建筑功能之间存在相关性，并分析了在 residential 建筑上文本出现的情况。但是，这种任务还存在许多挑战，包括街景图像中文本区域较小、缺乏ground truth标签和 Flickr 图像中建筑与OpenStreetMap 中建筑的不匹配。为了开发城市范围内的映射，建议在不同的场景下开发适当的算法或者引入其他数据来处理其他情况。此外，应进行多学科合作，以理解摄影和标注建筑的动机。 STR-on-Flickr 结果可在 https://github.com/ya0-sun/STR-Berlin 中获得。

Abstract
Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition. Manual checking on a subset of STR-recognized images demonstrates high accuracy. We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones. Further investigation revealed significant challenges associated with this task, including small text regions in street-view images, the absence of ground truth labels, and mismatches in buildings in Flickr images and building footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban hotspot locations, we suggest differentiating the scenarios where STR proves effective while developing appropriate algorithms or bringing in additional data for handling other cases. Furthermore, interdisciplinary collaboration should be undertaken to understand the motivation behind building photography and labeling. The STR-on-Flickr results are publicly available at https://github.com/ya0-sun/STR-Berlin.

摘要
互助平台提供大量的街景图像，这些图像包含了重要的建筑信息。这项工作面临着在互助平台街景图像中应用场景文本识别（STR）的挑战。我们使用Flickr图像，特别是研究建筑facades上的文本。我们创建了一个Berlin Flickr数据集，并使用预训练的STR模型进行文本检测和识别。对一 subset of STR-识别的图像进行手动检查表明了高准确性。我们研究了STR结果和建筑功能之间的相关性，并分析了在住宅和商业建筑之间文本被识别的情况。进一步的调查发现了这个任务中的主要挑战，包括街景图像中文本区域的小さigkeit，缺乏真实标签数据，以及Flickr图像中的建筑和OpenStreetMap（OSM）中的建筑脚本之间的匹配问题。为了开发城市范围内的映射，我们建议在STR效果明显的场景下使用不同的算法或引入其他数据来处理其他情况。此外，我们建议进行多学科协作，以理解摄影和标注建筑的动机。STR-on-Flickr结果公共可用于https://github.com/ya0-sun/STR-Berlin。

BEA: Revisiting anchor-based object detection DNN using Budding Ensemble Architecture

paper_url: http://arxiv.org/abs/2309.08036
repo_url: None
paper_authors: Syed Sha Qutub, Neslihan Kose, Rafael Rosales, Michael Paulitsch, Korbinian Hagn, Florian Geissler, Yang Peng, Gereon Hinz, Alois Knoll
for: 提高 anchor-based 物体检测模型的准确率和 uncertainty estimation 质量。
methods: 使用 Budding Ensemble Architecture (BEA) 和提posed 损失函数来改进 confidence score 的准确性和 uncertainty estimation。
results: BEA 可以提高 Base-YOLOv3 和 SSD 模型的 object detection 精度和 uncertainty estimation 质量，并在不同 datasets 上实现更高的 out-of-distribution detection 性能。

Abstract
This paper introduces the Budding Ensemble Architecture (BEA), a novel reduced ensemble architecture for anchor-based object detection models. Object detection models are crucial in vision-based tasks, particularly in autonomous systems. They should provide precise bounding box detections while also calibrating their predicted confidence scores, leading to higher-quality uncertainty estimates. However, current models may make erroneous decisions due to false positives receiving high scores or true positives being discarded due to low scores. BEA aims to address these issues. The proposed loss functions in BEA improve the confidence score calibration and lower the uncertainty error, which results in a better distinction of true and false positives and, eventually, higher accuracy of the object detection models. Both Base-YOLOv3 and SSD models were enhanced using the BEA method and its proposed loss functions. The BEA on Base-YOLOv3 trained on the KITTI dataset results in a 6% and 3.7% increase in mAP and AP50, respectively. Utilizing a well-balanced uncertainty estimation threshold to discard samples in real-time even leads to a 9.6% higher AP50 than its base model. This is attributed to a 40% increase in the area under the AP50-based retention curve used to measure the quality of calibration of confidence scores. Furthermore, BEA-YOLOV3 trained on KITTI provides superior out-of-distribution detection on Citypersons, BDD100K, and COCO datasets compared to the ensembles and vanilla models of YOLOv3 and Gaussian-YOLOv3.

摘要
The proposed loss functions in BEA improve the accuracy of object detection models. The BEA method was applied to both Base-YOLOv3 and SSD models, and the results show a significant improvement in accuracy. The BEA on Base-YOLOv3 trained on the KITTI dataset resulted in a 6% and 3.7% increase in mean Average Precision (mAP) and AP50, respectively. Additionally, using a well-balanced uncertainty estimation threshold in real-time led to a 9.6% higher AP50 than the base model. This is attributed to a 40% increase in the area under the AP50-based retention curve, which measures the quality of calibration of confidence scores.Moreover, the BEA-YOLOV3 trained on KITTI provided superior out-of-distribution detection on Citypersons, BDD100K, and COCO datasets compared to the ensembles and vanilla models of YOLOv3 and Gaussian-YOLOv3. This demonstrates the effectiveness of the BEA method in improving the accuracy of object detection models.

Vision-based Analysis of Driver Activity and Driving Performance Under the Influence of Alcohol

paper_url: http://arxiv.org/abs/2309.08021
repo_url: None
paper_authors: Ross Greer, Akshay Gopalkrishnan, Sumega Mandadi, Pujitha Gunaratne, Mohan M. Trivedi, Thomas D. Marcotte
for: 防止酒后驾车的研究，以实现车辆安全。
methods: 使用多modal的ensemble，包括视觉、热成像、音频和化学感知器，测试酒后驾车的影响和探索测定酒后驾车的方法。
results: 透过computer vision和机器学习模型分析车长的面部热成像，并引入训练模型的管道，以测定车长的呼吸气体含酒量。

Abstract
About 30% of all traffic crash fatalities in the United States involve drunk drivers, making the prevention of drunk driving paramount to vehicle safety in the US and other locations which have a high prevalence of driving while under the influence of alcohol. Driving impairment can be monitored through active use of sensors (when drivers are asked to engage in providing breath samples to a vehicle instrument or when pulled over by a police officer), but a more passive and robust mechanism of sensing may allow for wider adoption and benefit of intelligent systems that reduce drunk driving accidents. This could assist in identifying impaired drivers before they drive, or early in the driving process (before a crash or detection by law enforcement). In this research, we introduce a study which adopts a multi-modal ensemble of visual, thermal, audio, and chemical sensors to (1) examine the impact of acute alcohol administration on driving performance in a driving simulator, and (2) identify data-driven methods for detecting driving under the influence of alcohol. We describe computer vision and machine learning models for analyzing the driver's face in thermal imagery, and introduce a pipeline for training models on data collected from drivers with a range of breath-alcohol content levels, including discussion of relevant machine learning phenomena which can help in future experiment design for related studies.

摘要
关于30%的美国交通事故死亡事故中有报告酒后驾驶，因此防止酒后驾驶是美国和其他有高酒精驾驶习惯的地区的车辆安全问题。驾驶异常可以通过活动使用感测器（当驾驶员被要求提供呼吸样本给车辆工具或被警察停车）进行监测，但更加不间断和可靠的感测方式可能会促使更广泛的应用和智能系统的发展，以降低酒后驾驶事故。这种研究可以帮助检测酒后驾驶之前或在驾驶过程中（前于事故或警察检测）。在这项研究中，我们介绍了一项多模态ensemble的视觉、热成像、声音和化学感测器来（1）分析酒后驾驶性能的影响，以及（2）检测酒后驾驶。我们描述了用于分析驾驶员面部的计算机视觉和机器学习模型，并介绍了一个管道用于在不同呼吸气体含量水平下收集数据，包括讨论有关相关机器学习现象的概念，可以帮助未来相关研究的实验设计。

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing

paper_url: http://arxiv.org/abs/2309.08008
repo_url: None
paper_authors: Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang
for: 本研究旨在设计有效的提示方法，以帮助大语言模型（LLMs）在医疗领域进行特定的自然语言处理（NLP）任务，无需任务特定的训练数据。
methods: 本研究使用了现有的提示方法，包括简单前缀、简单补充、链式思维和预测提示，以及两种新的提示方法：准则提示和 ensemble 提示。
results: 研究表明，不同的提示方法在不同的语言模型（GPT-3.5、BARD 和 LLAMA2）上的表现有很大差异。 Ensemble 提示方法在三个语言模型上都表现最佳，而预测提示方法在 GPT-3.5 上表现最好。 Zero-shot 提示与 few-shot 提示的比较也提供了新的视角和指导意见 для LLMs 在医疗 NLP 领域的提示工程。

Abstract
Large language models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), especially in domains where labeled data is scarce or expensive, such as clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. In this paper, we present a comprehensive and systematic experimental study on prompt engineering for five clinical NLP tasks: Clinical Sense Disambiguation, Biomedical Evidence Extraction, Coreference Resolution, Medication Status Extraction, and Medication Attribute Extraction. We assessed the prompts proposed in recent literature, including simple prefix, simple cloze, chain of thought, and anticipatory prompts, and introduced two new types of prompts, namely heuristic prompting and ensemble prompting. We evaluated the performance of these prompts on three state-of-the-art LLMs: GPT-3.5, BARD, and LLAMA2. We also contrasted zero-shot prompting with few-shot prompting, and provide novel insights and guidelines for prompt engineering for LLMs in clinical NLP. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative AI, and we hope that it will inspire and inform future research in this area.

摘要
大型语言模型（LLM）在自然语言处理（NLP）领域表现出色，特别是在数据 scarcity 或者 expensive 的医疗领域。然而，要激活这些 LLMP 中的医疗知识，我们需要设计有效的提示，使其在无任务特定训练数据的情况下完成特定的医疗 NLP 任务。这被称为“上下文学习”，是一种艺术和科学，需要了解不同 LLMP 的优劣和提示工程方法的特点。在这篇论文中，我们提出了一项全面和系统的实验研究，探讨了不同提示方法的效果在五个医疗 NLP 任务中：医学意思解释、生物医学证据抽取、核心引用解决、药物状态抽取和药物特性抽取。我们评估了现有文献中的提示方法，包括简单前缀、简单填充、链条思维和预测提示，并引入了两种新的提示方法：启发式提示和ensemble提示。我们使用三个当今最先进的 LLMP：GPT-3.5、BARD和LLAMA2进行评估。我们还比较了零shot提示和几shot提示，并提供了新的发现和指导，用于提示工程在医疗 NLP 领域。我们认为，这是目前已知的一些实验性的探讨不同提示工程方法的第一个研究，希望能启发和引导未来的研究。

An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone

paper_url: http://arxiv.org/abs/2309.07992
repo_url: None
paper_authors: Ijaz Ul Haq, Byung Suk Lee, Donna M. Rizzo, Julia N Perdrial
for: 这个研究旨在帮助水文学家在某些敏感区域中检测时间序列数据中的异常。
methods: 这个框架使用自动化机器学习方法，包括生成模型和自动化模型优化机制，以检测时间序列数据中的异常。
results: 研究表明，这个框架可以帮助水文学家选择最适合的模型实例，并在检测异常过程中提高准确率和计算效率。

Abstract
This paper presents an automated machine learning framework designed to assist hydrologists in detecting anomalies in time series data generated by sensors in a research watershed in the northeastern United States critical zone. The framework specifically focuses on identifying peak-pattern anomalies, which may arise from sensor malfunctions or natural phenomena. However, the use of classification methods for anomaly detection poses challenges, such as the requirement for labeled data as ground truth and the selection of the most suitable deep learning model for the given task and dataset. To address these challenges, our framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and incorporates an automated hyperparameter optimization mechanism. This mechanism generates an optimized model instance with the best architectural and training parameters from a pool of five selected models, namely Temporal Convolutional Network (TCN), InceptionTime, MiniRocket, Residual Networks (ResNet), and Long Short-Term Memory (LSTM). The selection is based on the user's preferences regarding anomaly detection accuracy and computational cost. The framework employs Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator. The generated model instances are evaluated using a combination of accuracy and computational cost metrics, including training time and memory, during the anomaly detection process. Performance evaluation of the framework was conducted using a dataset from a watershed, demonstrating consistent selection of the most fitting model instance that satisfies the user's preferences.

摘要
中文翻译：这篇论文提出了一个自动化机器学习框架，用于帮助 hidrologists 在感知器数据中检测峰嵌入的异常。该框架具体关注 peak-pattern 异常的检测，可能由感知器故障或自然现象引起。然而，使用分类方法进行异常检测存在挑战，包括需要标注数据作为真实数据和选择适合给定任务和数据集的最佳深度学习模型。为Addressing these challenges, the proposed framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and optimizes model hyperparameters based on the user's preferences. The framework uses Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator and evaluates model performance using a combination of accuracy and computational cost metrics. The proposed framework was evaluated using a real-world dataset and demonstrated consistent selection of the most fitting model instance that satisfies the user's preferences.

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

paper_url: http://arxiv.org/abs/2309.07986
repo_url: None
paper_authors: James Burgess, Kuan-Chieh Wang, Serena Yeung
for: 这个论文是为了检验文本扩散模型是否能够学习3D结构，并在2D超视觉下生成3D图像。
methods: 这个论文使用了稳定扩散模型，并提出了一种基于摄像头视角参数的视角神经网络，用于控制生成图像的3D视图角。
results: 该方法可以解决缺少输入视图的新视角合成问题，并且可以生成具有 semantic detail 和 photorealism 的单视图新视角图像。此外，该方法还可以生成多种样本，用于模拟3D视觉问题中的不确定性。

Abstract
Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

摘要
文本到图像扩散模型理解图像中对象之间的空间关系，但它们真的从只有2D监督学习到3D世界的真实结构吗？我们证明，2D图像扩散模型如稳定扩散，其中包含3D知识，并且我们表明了这种结构可以用于3D视觉任务。我们的方法，Viewpoint Neural Textual Inversion（ViewNeTI），可以控制扩散模型中的3D视角，从冻结扩散模型中生成图像。我们训练一个小的神经映射器，接受摄像机视角参数，并使用这些参数预测文本编码器的精度，然后这些精度控制扩散生成过程，以生成图像。ViewNeTI自然地解决了新视角合成（NVS）问题。通过利用冻结扩散模型作为先验，我们可以通过非常少的输入视图来解决NVS问题，甚至可以实现单视图新视角合成。我们的单视图NVS预测具有较好的semantic detail和photorealism，比之前的方法更好。我们的方法适合处理稀疏3D视觉问题中的uncertainty，可以快速生成多样的样本。我们的视角控制机制通用，可以改变图像中的摄像机视角，甚至可以在用户定义的提示中改变摄像机视角。

A Data Source for Reasoning Embodied Agents

paper_url: http://arxiv.org/abs/2309.07974
repo_url: https://github.com/facebookresearch/neuralmemory
paper_authors: Jack Lanchantin, Sainbayar Sukhbaatar, Gabriel Synnaeve, Yuxuan Sun, Kavya Srinet, Arthur Szlam
for: 这个论文的目的是探讨机器学习模型在理解任务中的进步，并提出了一种新的数据生成器来支持这些进步。
methods: 这个论文使用了新的模型架构、大规模预训练协议和专门的理解任务数据集来训练机器学习模型。
results: 研究人员通过对世界状态和机器人行为所生成的数据进行实例化，并使用了预训练语言模型和图structured Transformers来训练模型。然而，这些模型在 answering some questions about the world-state 方面表现不佳，这表明了设计神经网络理解模型和数据库表示方法的新研究方向。

Abstract
Recent progress in using machine learning models for reasoning tasks has been driven by novel model architectures, large-scale pre-training protocols, and dedicated reasoning datasets for fine-tuning. In this work, to further pursue these advances, we introduce a new data generator for machine reasoning that integrates with an embodied agent. The generated data consists of templated text queries and answers, matched with world-states encoded into a database. The world-states are a result of both world dynamics and the actions of the agent. We show the results of several baseline models on instantiations of train sets. These include pre-trained language models fine-tuned on a text-formatted representation of the database, and graph-structured Transformers operating on a knowledge-graph representation of the database. We find that these models can answer some questions about the world-state, but struggle with others. These results hint at new research directions in designing neural reasoning models and database representations. Code to generate the data will be released at github.com/facebookresearch/neuralmemory

摘要

paper_url: http://arxiv.org/abs/2309.07915
repo_url: https://github.com/haozhezhao/mic
paper_authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
for: 这种论文是为了解决现代计算机视觉语言模型（VLMs）在处理复杂多模态提示时的问题。
methods: 该论文提出了一种名为 MMICL 的新方法，该方法包括一种特制的架构设计，可以快速地结合视觉和文本上下文，以及一个名为 MIC 的数据集，用于减少训练数据和实际应用中的复杂多模态提示之间的差距。
results: 该论文的实验结果表明，MMICL 可以在各种通用视觉语言任务上达到新的状态机器人性和少量参数性，尤其是在复杂的推理核心任务上，如 MME 和 MMBench。此外，实验还表明，MMICL 可以成功解决多模态提示理解的挑战。

Abstract
Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

摘要

Multi-modal context with interleaved images and text2. Textual references for each image3. Multi-image data with spatial, logical, or temporal relationshipsOur experiments confirm that MMICL achieves new state-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

Beta Diffusion

paper_url: http://arxiv.org/abs/2309.07867
repo_url: https://github.com/ThisisBillhe/tiny-stable-diffusion
paper_authors: Mingyuan Zhou, Tianqi Chen, Zhendong Wang, Huangjie Zheng
for: beta diffusion是一种新的生成模型方法，用于生成在固定范围内的数据。
methods: beta diffusion使用了扩展和平移beta分布，通过多项式过程来实现时间的扩展和收缩，保持beta分布在前向采样和反向采样中。不同于传统的扩散基本模型， beta diffusion不使用加itive Gaussian噪声和重量化的证据下界（ELBO），而是使用KL divergence上界（KLUB），从而更好地优化模型。
results: 对于synthetic数据和自然图像，beta diffusion表现出了生成固定范围内数据的独特能力，并证明了KLUB的优化效果。

Abstract
We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.

摘要
我们介绍β扩散，一种新的生成模型方法，它结合掩盖和去噪来生成在给定范围内的数据。使用了标准化和偏移的β分布，β扩散利用时间的滑动积分来创建前向和反向的扩散过程，维持β分布在前向的预期和反向的条件下，对于任何时间点的数据。不同于传统的扩散基本生成模型，利用加法的 Gaussian 噪声和重新权重的证据下界（ELBO），β扩散是multiplicative的，并且通过对KL散度的上界（KLUB）来优化。我们示出了KLUBs的更高效性 compared to负ELBOs，这些KLUBs可以 viewed as KL散度的对称�。数据的损失函数表示为Bregman散度，进一步支持了KLUBs的优化。实验结果显示β扩散在给定范围内的生成模型和自然图像中的独特能力，并且证明了KLUBs在扩散模型的优化中的效iveness，因此它们成为了扩散基本生成模型和优化技术的有用贡献。

The Rise and Potential of Large Language Model Based Agents: A Survey

paper_url: http://arxiv.org/abs/2309.07864
repo_url: https://github.com/woooodyy/llm-agent-paper-list
paper_authors: Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Tao Gui
for: This paper aims to provide a comprehensive survey of large language model (LLM)-based agents, exploring their potential for building general artificial intelligence (AGI) agents.methods: The paper presents a general framework for LLM-based agents, consisting of three main components: brain, perception, and action. The framework can be tailored for different applications.results: The paper discusses the extensive applications of LLM-based agents in single-agent scenarios, multi-agent scenarios, and human-agent cooperation. It also explores agent societies, social phenomena, and insights for human society.

Abstract
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at https://github.com/WooooDyy/LLM-Agent-Paper-List.

摘要
人类一直追求人类水平或更高的人工智能（AI），AI代理人被视为可能的实现之路。AI代理人是人工智能中的人工实体，它可以感知环境，做出决策，并进行行动。尽管有很多努力，但大多数努力都集中在算法或训练策略的提高上，以提高特定任务的能力。然而，实际上，社区缺乏一个通用和强大的模型，用于设计可适应多种场景的AI代理人。由于它们的多样化能力，大型自然语言模型（LLM）被视为人工通用智能（AGI）的可能的燃点，提供了建立通用AI代理人的希望。许多研究人员已经利用LLM作为基础，建立了AI代理人，并取得了 significativ进步。在这篇论文中，我们进行了LLM-based agents的全面评估。我们从哲学起源追溯代理人概念，并解释了LLM的适用性，并在这基础之上提出了一个通用框架，包括脑、感知和行动三个主要组成部分，该框架可以适应不同应用场景。然后，我们探讨了LLM-based agents在单机场景、多机场景和人机合作场景中的广泛应用。接着，我们探索了代理人社会中的行为和个性，以及由代理人社会产生的社会现象和人类社会中的意见。最后，我们讨论了领域内的一些关键问题和开放问题。相关论文的存储库可以在https://github.com/WooooDyy/LLM-Agent-Paper-List中找到。

CiwaGAN: Articulatory information exchange

paper_url: http://arxiv.org/abs/2309.07861
repo_url: https://github.com/gbegus/articulationgan
paper_authors: Gašper Beguš, Thomas Lu, Alan Zhou, Peter Wu, Gopala K. Anumanchipalli
for: 本研究旨在提供一个人类语言学习的模型，用于模拟人类语音认知和语音交流过程。
methods: 本研究使用了不supervised的articulatory模型和auditory模型，combined these two components to simulate human spoken language acquisition。
results: 提出的CiwaGAN模型是人类语言学习最实际的数据学习模型，可以用于认知可能性最高的人类语音认知 simulations。

Abstract
Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeling and information exchange separately, our model is the first to combine the two components. The paper also proposes an improved articulatory model with more interpretable internal representations. The proposed CiwaGAN model is the most realistic approximation of human spoken language acquisition using deep learning. As such, it is useful for cognitively plausible simulations of the human speech act.

摘要
人类将信息转化为声音，控制口音器官，并通过听觉器官解码信息。这篇论文介绍了 CiwaGAN，一种人类口语学习模型，把无监督的口音模型与无监督的听觉模型相结合。在之前的研究中，这两种 ком ponent都被处理了分开。我们的模型是第一个将这两个组件结合起来的。文章还提出了一种改进的口音模型，具有更可读的内部表示。提出的 CiwaGAN 模型是人类口语学习使用深度学习的最真实的近似，因此它对于人类语音行为的认知可能性 simulations 非常有用。

ExpertQA: Expert-Curated Questions and Attributed Answers

paper_url: http://arxiv.org/abs/2309.07852
repo_url: https://github.com/chaitanyamalaviya/expertqa
paper_authors: Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth
for: 这个论文的目的是研究语言模型的准确性和归因性在不同领域中的表现。
methods: 这篇论文使用了专家参与的方法来评估语言模型的输出是否符合事实，并生成了一个高质量的长形问答数据集（ExpertQA），包括2177个问题和32个领域的专家回答和归因。
results: 研究发现，语言模型在不同领域中的准确性和归因性表现不一样，并且存在一些领域的专家知识偏见。这些结果可以帮助改进语言模型的训练和应用。

Abstract
As language models are adapted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study & professions. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying factuality and attribution has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we present an evaluation study analyzing various axes of factuality and attribution provided in responses from a few systems, by bringing domain experts in the loop. Specifically, we first collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. We also ask experts to revise answers produced by language models, which leads to ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

摘要
随着语言模型被更多的复杂和多样化的用户采用，保证它们提供的信息是正确的和可靠的 sources 的重要性在不同领域和职业中日益增加。特别是在高度危险的领域，如医学和法律， propagating false information 可能导致社会不良影响。 previous work studying factuality and attribution 没有专门研究这些语言模型输出的特点在具体的场景下。在这种工作中，我们介绍一项评估研究，检查不同的 factuality 和 attribution 特点在语言模型输出中。 Specifically，我们首先收集了 484 名专家参与者从 32 个领域提供的专家挑选的问题，然后询问这些专家评估生成的回答。我们还让专家修改语言模型生成的答案，从而得到 ExpertQA，一个高质量的长文 QA 数据集，包括 2177 个问题和32 个领域的准确答案和声明。

Applying Deep Learning to Calibrate Stochastic Volatility Models

paper_url: http://arxiv.org/abs/2309.07843
repo_url: None
paper_authors: Abir Sridi, Paul Bilokon
for: This paper aims to improve the calibration of stochastic volatility models by using deep learning techniques to speed up the calibration process and achieve more accurate results.methods: The authors use a differential deep learning (DDL) approach, which involves training machine learning models on samples of not only features and labels but also differentials of labels to features. They also compare the performance of different regularization techniques and show that the DDL approach outperforms classical deep learning methods.results: The trained neural network dramatically reduces the computation time required for Heston calibration, and the DDL approach outperforms classical deep learning methods in terms of reducing overfitting and improving generalization error.

Abstract
Stochastic volatility models, where the volatility is a stochastic process, can capture most of the essential stylized facts of implied volatility surfaces and give more realistic dynamics of the volatility smile or skew. However, they come with the significant issue that they take too long to calibrate. Alternative calibration methods based on Deep Learning (DL) techniques have been recently used to build fast and accurate solutions to the calibration problem. Huge and Savine developed a Differential Deep Learning (DDL) approach, where Machine Learning models are trained on samples of not only features and labels but also differentials of labels to features. The present work aims to apply the DDL technique to price vanilla European options (i.e. the calibration instruments), more specifically, puts when the underlying asset follows a Heston model and then calibrate the model on the trained network. DDL allows for fast training and accurate pricing. The trained neural network dramatically reduces Heston calibration's computation time. In this work, we also introduce different regularisation techniques, and we apply them notably in the case of the DDL. We compare their performance in reducing overfitting and improving the generalisation error. The DDL performance is also compared to the classical DL (without differentiation) one in the case of Feed-Forward Neural Networks. We show that the DDL outperforms the DL.

摘要
In this work, we also introduce different regularization techniques, and we apply them notably in the case of the DDL. We compare their performance in reducing overfitting and improving the generalization error. The DDL performance is also compared to the classical DL (without differentiation) one in the case of feed-forward neural networks. We show that the DDL outperforms the DL. Translated into Simplified Chinese:随机波动模型可以捕捉证券波动表面的主要特征，但它们需要较长时间来均值。 alternatively, deep learning (DL) 技术已经用于构建快速和准确的均值问题解决方案。 huge and Savine 提出了差分深度学习（DDL）方法，其中机器学习模型在样本中学习不 только特征和标签，还学习标签与特征之间的差分。当前的工作是使用 DDL 方法估算欧洲 vanilla 选择（即均值工具），具体来说是在 Heston 模型下估算 puts。 DDL 允许快速训练和精准估算。训练神经网络对 Heston 均值的计算时间减少了很多。在这个工作中，我们还引入了不同的规范技术，并在 DDL 中应用它们。我们比较它们在避免过拟合和提高通用错误的性能。 DDL 性能也与无梯度的深度学习（DL）相比。我们显示了 DDL 在 feed-forward 神经网络中的性能明显超过了 DL。

Two Timin’: Repairing Smart Contracts With A Two-Layered Approach

paper_url: http://arxiv.org/abs/2309.07841
repo_url: None
paper_authors: Abhinav Jain, Ehan Masud, Michelle Han, Rohan Dhillon, Sumukh Rao, Arya Joshi, Salar Cheema, Saurav Kumar
for: 本研究旨在提出一种两层框架，用于自动检测和修复智能合约中的攻击漏洞。
methods: 该框架包括两个层：第一层是使用 Slither 漏洞报告和源代码，通过预训练 RandomForestClassifier (RFC) 和 Large Language Models (LLMs) 进行分类和修复漏洞。第二层是使用预训练 GPT-3.5-Turbo 和 fine-tuned Llama-2-7B 模型来构建智能合约修复模型。
results: 实验表明，使用 Fine-tuned Llama-2-7B 模型可以将总体漏洞数量减少为 97.5%，使用 GPT-3.5-Turbo 模型可以减少总体漏洞数量为 96.7%。手动检查修复后的合约显示，所有修复后的合约都保持了功能， indicating that the proposed method is appropriate for automatic batch classification and repair of vulnerabilities in smart contracts。

Abstract
Due to the modern relevance of blockchain technology, smart contracts present both substantial risks and benefits. Vulnerabilities within them can trigger a cascade of consequences, resulting in significant losses. Many current papers primarily focus on classifying smart contracts for malicious intent, often relying on limited contract characteristics, such as bytecode or opcode. This paper proposes a novel, two-layered framework: 1) classifying and 2) directly repairing malicious contracts. Slither's vulnerability report is combined with source code and passed through a pre-trained RandomForestClassifier (RFC) and Large Language Models (LLMs), classifying and repairing each suggested vulnerability. Experiments demonstrate the effectiveness of fine-tuned and prompt-engineered LLMs. The smart contract repair models, built from pre-trained GPT-3.5-Turbo and fine-tuned Llama-2-7B models, reduced the overall vulnerability count by 97.5% and 96.7% respectively. A manual inspection of repaired contracts shows that all retain functionality, indicating that the proposed method is appropriate for automatic batch classification and repair of vulnerabilities in smart contracts.

摘要
（因为现代区块链技术的现代性，智能合约具有严重的风险和利益。在这些合约中，漏洞可能导致重大的后果，包括重要的损失。许多当前的论文主要关注智能合约的恶意意图，常常基于限制的合约特征，如字节码或操作码。本文提出了一种新的、两层架构：1）分类和2）直接修复恶意合约。使用Slither的漏洞报告，并将源代码与预训练的RandomForestClassifier（RFC）和大语言模型（LLMs）结合，对每个建议的漏洞进行分类和修复。实验表明，使用精制和提交的LLMs得到了有效的结果。智能合约修复模型，基于预训练的GPT-3.5-Turbo和精制的Llama-2-7B模型，将总体漏洞数量减少了97.5%和96.7%。人工检查修复的合约显示所有都保留了功能，表明提posed方法适用于自动批处理和修复智能合约中的漏洞。）

paper_url: http://arxiv.org/abs/2309.07832
repo_url: https://github.com/kasunweerkoon/VAPOR
paper_authors: Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Mohamed Elnoor, Dinesh Manocha
for: 本研究は、自律的な四肢动物ロボットが、密集した OUTDOOR 环境での自律ナビゲーションを自动化するために、オフラインの强化学习（RL）を使用した新しい方法を提案します。
methods: 本研究で使用された方法は、actor-critic ネットワークを使用して、実际の OUTDOOR 环境で収集されたある程度のデータを使用して、物理的なおよび几何学的な障害物の特性を学习します。
results: 本研究では、Spot ロボットを使用して、复雑な実际の OUTDOOR シーンでの成功率が、先行方法よりも40%増加しました。また、平均的な电流消耗が2.9%减少し、一般化されたトレジット长さが11.2%减少しました。

Abstract
We present VAPOR, a novel method for autonomous legged robot navigation in unstructured, densely vegetated outdoor environments using offline Reinforcement Learning (RL). Our method trains a novel RL policy using an actor-critic network and arbitrary data collected in real outdoor vegetation. Our policy uses height and intensity-based cost maps derived from 3D LiDAR point clouds, a goal cost map, and processed proprioception data as state inputs, and learns the physical and geometric properties of the surrounding obstacles such as height, density, and solidity/stiffness. The fully-trained policy's critic network is then used to evaluate the quality of dynamically feasible velocities generated from a novel context-aware planner. Our planner adapts the robot's velocity space based on the presence of entrapment inducing vegetation, and narrow passages in dense environments. We demonstrate our method's capabilities on a Spot robot in complex real-world outdoor scenes, including dense vegetation. We observe that VAPOR's actions improve success rates by up to 40%, decrease the average current consumption by up to 2.9%, and decrease the normalized trajectory length by up to 11.2% compared to existing end-to-end offline RL and other outdoor navigation methods.

摘要
我们介绍VAPOR方法，一种用于自主四肢机器人在未结构化、植被茂盛的户外环境中进行自主导航的新方法。我们的方法使用actor-critic网络来训练一个RL政策，并使用实际在户外植被中收集的任意数据进行训练。我们的政策使用高度和强度基于的成本地图、目标成本地图和处理后的 proprioception 数据作为输入，并学习周围障碍物的物理和几何特性，如高度、密度和坚硬程度。我们的完全训练的政策批评网络然后用于评估 dynamically feasible 的速度。我们的 плаanner 适应机器人的速度空间基于植被引起的困难和窄通道在密集环境中。我们在Spot机器人上展示了我们的方法在复杂的实际户外场景中的能力，包括密集的植被。我们观察到VAPOR的行动可以提高成功率达到40%，降低平均电池电 consumption达到2.9%，降低 нормализа trajectory length达到11.2%相比于现有的端到端Offline RL和其他户外导航方法。

Large-scale Weakly Supervised Learning for Road Extraction from Satellite Imagery

paper_url: http://arxiv.org/abs/2309.07823
repo_url: None
paper_authors: Shiqiao Meng, Zonglin Di, Siwei Yang, Yin Wang
for: 这个论文是为了提出一种基于深度学习的自动道路检测方法，以代替传统的手动地图生成。
methods: 这种方法使用了大规模的卫星图像和开源地图数据作为弱标签，并使用了D-LinkNet架构和ResNet-50背景网络进行Semantic segmentation模型的预训练。
results: 该方法的预测精度随着弱标签数据的Amount和地区训练地点的道路密度而增长，并在目前的DeepGlobe领先者排行板上超越了前一代的表现。此外，由于大规模预训练，该模型在不同的拍摄条件下Generalizes much better than通过只使用CURATED datasets进行训练的模型。

Abstract
Automatic road extraction from satellite imagery using deep learning is a viable alternative to traditional manual mapping. Therefore it has received considerable attention recently. However, most of the existing methods are supervised and require pixel-level labeling, which is tedious and error-prone. To make matters worse, the earth has a diverse range of terrain, vegetation, and man-made objects. It is well known that models trained in one area generalize poorly to other areas. Various shooting conditions such as light and angel, as well as different image processing techniques further complicate the issue. It is impractical to develop training data to cover all image styles. This paper proposes to leverage OpenStreetMap road data as weak labels and large scale satellite imagery to pre-train semantic segmentation models. Our extensive experimental results show that the prediction accuracy increases with the amount of the weakly labeled data, as well as the road density in the areas chosen for training. Using as much as 100 times more data than the widely used DeepGlobe road dataset, our model with the D-LinkNet architecture and the ResNet-50 backbone exceeds the top performer of the current DeepGlobe leaderboard. Furthermore, due to large-scale pre-training, our model generalizes much better than those trained with only the curated datasets, implying great application potential.

摘要
自动从卫星影像中提取公路是一种可行的代替方案，因此在最近受到了广泛关注。然而，大多数现有方法都是指导的，需要像素级标注，这是费时和容易出错的。更重要的是，地球上有多种地形、植被和人工物，模型在一个区域内训练后难以在其他区域中泛化。另外，不同的拍摄条件，如光照和视角，以及不同的图像处理技术，进一步复杂了问题。难以开发卷积数据来覆盖所有的图像风格。这篇文章提议利用OpenStreetMap公路数据作为弱标签，并使用大规模卫星影像进行预训练 semantic segmentation 模型。我们的广泛实验结果表明，随着弱标签数据的增加，以及训练区域中公路的密度，预测精度也随之提高。使用100倍以上的数据，我们的模型结构为D-LinkNet和ResNet-50脊梁的模型超过了目前DeepGlobe领先者板块。此外，由于大规模预训练，我们的模型比只使用精心编辑的数据进行训练，更好地泛化。

What Matters to Enhance Traffic Rule Compliance of Imitation Learning for Automated Driving

paper_url: http://arxiv.org/abs/2309.07808
repo_url: None
paper_authors: Hongkuan Zhou, Aifen Sui, Wei Cao, Letian Shi
For: 这项研究旨在提高终端自动驾驶技术的总性性能，通过对整个驾驶管道进行单个神经网络的替换，以提高驾驶管道的简洁性和决策速度。* Methods: 该项研究提出了一种名为P-CSG的罚款基于仿真学习方法，该方法通过融合多种感知技术来提高终端自动驾驶的总性性能。* Results: 研究人员通过使用 Town 05 Long 测试准 benchmark，发现该模型在终端自动驾驶方面实现了15%以上的驾驶得分提升，并对基eline模型进行了比较。此外，研究人员还对模型进行了Robustness测试，发现该模型在对FGSM和Dot等敌意攻击的robustness性得到了显著提高。

Abstract
More research attention has recently been given to end-to-end autonomous driving technologies where the entire driving pipeline is replaced with a single neural network because of its simpler structure and faster inference time. Despite this appealing approach largely reducing the components in driving pipeline, its simplicity also leads to interpretability problems and safety issues arXiv:2003.06404. The trained policy is not always compliant with the traffic rules and it is also hard to discover the reason for the misbehavior because of the lack of intermediate outputs. Meanwhile, Sensors are also critical to autonomous driving's security and feasibility to perceive the surrounding environment under complex driving scenarios. In this paper, we proposed P-CSG, a novel penalty-based imitation learning approach with cross semantics generation sensor fusion technologies to increase the overall performance of End-to-End Autonomous Driving. We conducted an assessment of our model's performance using the Town 05 Long benchmark, achieving an impressive driving score improvement of over 15%. Furthermore, we conducted robustness evaluations against adversarial attacks like FGSM and Dot attacks, revealing a substantial increase in robustness compared to baseline models.More detailed information, such as code-based resources, ablation studies and videos can be found at https://hk-zh.github.io/p-csg-plus.

摘要
更多研究注意力在最近已经转移到了端到端自主驾驶技术，因为它的更简单的结构和更快的推理时间。尽管这种方法可以大幅减少驾驶管道中的组件，但它的简单性也导致了解释问题和安全问题。arXiv:2003.06404。训练的策略并不总是遵循交通规则，而且还很难发现违规行为的原因，因为缺乏中间输出。同时，感知器也是自主驾驶的安全和可行性的关键。在这篇论文中，我们提出了一种基于罚金的模仿学习方法，并结合交叉语义生成感知器融合技术，以提高端到端自主驾驶的总性性能。我们使用了城市05长编辑评估我们的模型，实现了驾驶分数的显著提升超过15%。此外，我们还进行了对抗攻击 like FGSM和Dot attacks的Robustness评估，发现了与基线模型相比的显著增强。更多详细信息，如代码资源、截然退化研究和视频，可以在https://hk-zh.github.io/p-csg-plus找到。

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

paper_url: http://arxiv.org/abs/2309.08637
repo_url: None
paper_authors: Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
for: 本研究旨在推动大型自然语言处理模型在人工智能领域的应用，尤其是在多模态指令跟踪任务中。
methods: 本研究使用了 TextBind 框架，该框架几乎没有注解，可以帮助更大的语言模型拥有多turn多modal指令跟踪能力。
results: 研究发现，TextBind 框架可以从语言模型中生成多turn multimodal 指令响应对话，并且可以轻松地捕捉图像和文本输入。

Abstract
Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

摘要
大型语言模型具有 instrucion-following 能力已经革命化人工智能领域。这些模型在自然语言界面上显示出极高的通用性，可以轻松地完成各种实际任务。然而，其性能强度取决于高质量的示例数据，而这些数据往往困难获得。这个挑战更加减震，当面临多模态 instrucion-following 时。我们提出了 TextBind，一个几乎无需注释的框架，可以使大型语言模型拥有多turn interleaved multimodal instrucion-following 能力。我们的方法只需要图像caption对，可以生成多turn multimodal instrucion-response对话。为了处理图像文本输入和输出，我们设计了 MIM 架构，它将图像编码器和解码器模型与语言模型集成一体。我们发布了我们的数据集、模型和 demo，以便未来的研究人员可以在多模态 instrucion-following 领域进行更多的研究。

TiBGL: Template-induced Brain Graph Learning for Functional Neuroimaging Analysis

paper_url: http://arxiv.org/abs/2309.07947
repo_url: None
paper_authors: Xiangzhu Meng, Wei Wei, Qiang Liu, Shu Wu, Liang Wang
for: 本研究旨在提高功能磁共振成像的诊断效率，通过提取模板脑图来减少噪声信息和提高诊断性能。
methods: 本研究提出了一种新的脑图学习框架，即模板引导脑图学习（TiBGL），具有权威性和可解释性。该框架基于各个组的模板脑图，可以去除噪声信息，并提高诊断性能。
results: 实验结果显示，提出的TiBGL可以在三个实际数据集上实现superior的性能，并且与 neuroscience 文献中的发现相协同。

Abstract
In recent years, functional magnetic resonance imaging has emerged as a powerful tool for investigating the human brain's functional connectivity networks. Related studies demonstrate that functional connectivity networks in the human brain can help to improve the efficiency of diagnosing neurological disorders. However, there still exist two challenges that limit the progress of functional neuroimaging. Firstly, there exists an abundance of noise and redundant information in functional connectivity data, resulting in poor performance. Secondly, existing brain network models have tended to prioritize either classification performance or the interpretation of neuroscience findings behind the learned models. To deal with these challenges, this paper proposes a novel brain graph learning framework called Template-induced Brain Graph Learning (TiBGL), which has both discriminative and interpretable abilities. Motivated by the related medical findings on functional connectivites, TiBGL proposes template-induced brain graph learning to extract template brain graphs for all groups. The template graph can be regarded as an augmentation process on brain networks that removes noise information and highlights important connectivity patterns. To simultaneously support the tasks of discrimination and interpretation, TiBGL further develops template-induced convolutional neural network and template-induced brain interpretation analysis. Especially, the former fuses rich information from brain graphs and template brain graphs for brain disorder tasks, and the latter can provide insightful connectivity patterns related to brain disorders based on template brain graphs. Experimental results on three real-world datasets show that the proposed TiBGL can achieve superior performance compared with nine state-of-the-art methods and keep coherent with neuroscience findings in recent literatures.

摘要
在最近几年，功能核磁共振成为了人脑功能连接网络的 poderous工具。相关研究表明，人脑功能连接网络可以改善诊断神经疾病的效率。然而，还有两个挑战限制了功能神经成像的进步。首先，功能连接数据中存在严重的噪声和重复信息，导致性能下降。其次，现有的大脑网络模型往往偏重于分类性能或 neuroscience 发现的解释性能。为解决这些挑战，本文提出了一种新的大脑图学习框架，即模板引导的大脑图学习（TiBGL）。这种框架具有 both discriminative 和 interpretable 能力。受到相关医学发现的功能连接 patrerns 的激发，TiBGL 提出了模板引导的大脑图学习，可以从 brain 网络中提取模板 brain graphs。模板图可以视为对 brain 网络的增强处理，从中除去噪声信息，高亮重要的连接 patrerns。为同时支持分类和解释两个任务，TiBGL 进一步开发了模板引导的卷积神经网络和模板引导的大脑解释分析。特别是，前者可以将 rich 的信息从 brain 图和模板 brain 图 fusion 到 brain 疾病任务中，而后者可以基于模板 brain 图提供神经疾病相关的连接 patrerns。实验结果表明，提出的 TiBGL 可以在三个实际数据集上达到与九种 state-of-the-art 方法相当或更高的性能，同时与最新的 neuroscience 发现保持一致。

Variational Quantum Linear Solver enhanced Quantum Support Vector Machine

paper_url: http://arxiv.org/abs/2309.07770
repo_url: None
paper_authors: Jianming Yi, Kalyani Suresh, Ali Moghiseh, Norbert Wehn
for: 使用量子资源进行指导式机器学习任务，如分类。
methods: 提出了一种新的方法——Variational Quantum Linear Solver（VQLS）加强的Quantum Support Vector Machine（QSVM），利用量子线性解决器解决NISQ设备上的系统线性方程。
results: 通过大量的数值实验，我们发现我们的方法可以在不同的实例中准确地识别分布在一个8维特征空间中的分界面，并且在7维特征空间中表现出了强大的表现。

Abstract
Quantum Support Vector Machines (QSVM) play a vital role in using quantum resources for supervised machine learning tasks, such as classification. However, current methods are strongly limited in terms of scalability on Noisy Intermediate Scale Quantum (NISQ) devices. In this work, we propose a novel approach called the Variational Quantum Linear Solver (VQLS) enhanced QSVM. This is built upon our idea of utilizing the variational quantum linear solver to solve system of linear equations of a least squares-SVM on a NISQ device. The implementation of our approach is evaluated by an extensive series of numerical experiments with the Iris dataset, which consists of three distinct iris plant species. Based on this, we explore the practicality and effectiveness of our algorithm by constructing a classifier capable of classification in a feature space ranging from one to seven dimensions. Furthermore, by strategically exploiting both classical and quantum computing for various subroutines of our algorithm, we effectively mitigate practical challenges associated with the implementation. These include significant improvement in the trainability of the variational ansatz and notable reductions in run-time for cost calculations. Based on the numerical experiments, our approach exhibits the capability of identifying a separating hyperplane in an 8-dimensional feature space. Moreover, it consistently demonstrated strong performance across various instances with the same dataset.

摘要
量子支持向量机器 (QSVM) 在使用量子资源进行监督式机器学习任务中扮演着重要的角色。然而，现有方法在不纯量子设备 (NISQ) 上的扩展性受到很大的限制。在这项工作中，我们提出了一种新的方法，即变量量子直方法加强的QSVM (VQLS-QSVM)。这是基于我们的变量量子直方法来解决量子直方法在NISQ设备上的系统线性方程的想法。我们的实现方法通过了广泛的数值实验，使用芳香植物三个物种的芳香数据集（Iris dataset）进行评估。根据这些实验结果，我们发现我们的算法在Feature空间范围从1到7维度的情况下能够建立一个分类器。此外，我们通过策略地利用古典计算和量子计算在不同子routines中来有效地 mitigate实际挑战，包括提高变量 Ansatz 的可训练性和计算成本的减少。根据数值实验结果，我们的方法可以在8维度的特征空间中找到分离的折线。此外，它在不同的实例中也具有强大的表现。

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

paper_url: http://arxiv.org/abs/2309.07760
repo_url: https://github.com/minhanh151/respro
paper_authors: Anh Pham Thi Minh
for: 这篇论文旨在提高预先训练的视觉语言模型 CLIP 的零 shot 转移性能，并且解决 manual 的提示工程化问题，以实现实际应用。
methods: 本研究使用 Context Optimization (CoOp) 的概念，将 learnable textual tokens 引入视觉领域，以提高预先训练的模型在不同类型的图像上的表现。
results: 实验和广泛的拆分分析表明，我们的方法可以效率地提高预先训练的模型在新类别上的表现，并且在16 shot 设定下，与 CoOp 的比较获得了5.60% 的平均精度提升和3% 的调和比例提升，alls 在良好的训练时间内。

Abstract
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.

摘要
大型预训练视觉语言模型如CLIP已经表现出了 zeroshot 跨任务传播的潜力。然而，为了 достичь优化的性能，需要手动选择提示以改善图像分布和文本描述之间的对应性。这个手动提示工程ering是在实践中部署这些模型的主要挑战，因为它需要域专业知识并是非常时间消耗。为了避免非常严重的提示工程ering，最近的工作Context Optimization（CoOp）引入了视野中的提示学习。虽然CoOp可以实现显著提高，但其学习的上下文在更广泛的未看过类中的总体化能力不够。在这种情况下，我们提出了提示学习与杂化编码器（PRE） - 一种简单有效的方法，可以提高未看过类中的总体化能力。而不是直接优化提示，PRE使用提示编码器来重parameterize输入提示嵌入，从少量样本中挖掘任务特定的知识。我们的方法在8个标准benchmark上进行了实验和广泛的ablation研究，结果表明，PRE是一种高效的提示学习方法。特别是，PRE在16个shotSetting中的平均精度提高5.60%，新类精度提高3%，所有这些成果均在合理的训练时间内完成。

Generative AI Text Classification using Ensemble LLM Approaches

paper_url: http://arxiv.org/abs/2309.07755
repo_url: None
paper_authors: Harika Abburi, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen, Sanmitra Bhattacharya
for:* 本研究旨在判断一篇文章是人类写的还是AI生成的。methods:* 使用一个 ensemble neural network 模型，将多个预训练的 LLM 作为特征传递给一个传统机器学习（TML）分类器。results:* 在第一个任务中，模型在英语和西班牙语文本中分别排名第五和第十三（macro $F1$ 分数为 0.733 和 0.649）。* 在第二个任务中，模型在英语和西班牙语文本中分别排名第一（macro $F1$ 分数为 0.625 和 0.653）。

Abstract
Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from different pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place (with macro $F1$ scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro $F1$ scores of 0.625 and 0.653 for English and Spanish texts, respectively.

摘要
大型语言模型（LLM）在许多人工智能（AI）和自然语言处理任务中表现出色，如内容创建、报告生成等。然而，不当使用这些模型可能会带来不良后果，如创造假新闻、抄袭等。因此，正确地检测AI生成的语言可以是责任使用LLM的关键。在这个工作中，我们探索以下两个研究目标：1）一个文本是AI生成还是人类写就的，2）哪个语言模型在生成一个文本中扮演了主要角色。我们使用了 AuTexTification 共享任务中提供的数据集。为每个研究目标，我们提出了一个ensemble神经网络模型，将不同预训LLM的特征用于一个传统机器学习（TML）分类器后面。For the first task of distinguishing between AI and human-generated text, our model ranked 13th and 5th place (with macro $F1$ scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked first place with macro $F1$ scores of 0.625 and 0.653 for English and Spanish texts, respectively.

AIDPS:Adaptive Intrusion Detection and Prevention System for Underwater Acoustic Sensor Networks

paper_url: http://arxiv.org/abs/2309.07730
repo_url: None
paper_authors: Soumadeep Das, Aryan Mohammadi Pasikhani, Prosanta Gope, John A. Clark, Chintan Patel, Biplab Sikdar
For: The paper proposes a secure intrusion detection and prevention system for Underwater Acoustic Sensor Networks (UW-ASNs) to address the lack of security considerations and the resource-constrained nature of sensor nodes.* Methods: The proposed Adaptive decentralized Intrusion Detection and Prevention System (AIDPS) uses machine learning algorithms (e.g., Adaptive Random Forest, light gradient-boosting machine, and K-nearest neighbors) and concept drift detection algorithms (e.g., ADWIN, kdqTree, and Page-Hinkley) to detect underwater-related attacks.* Results: The proposed scheme outperforms state-of-the-art benchmarking methods in terms of performance and provides a wider range of desirable features such as scalability and complexity, as demonstrated through extensive experimental results.Here’s the Chinese version of the three key points:* For: 该论文提出了一种为Underwater Acoustic Sensor Networks (UW-ASNs) 提供安全的攻击检测和预防系统，以解决UW-ASNs 缺乏安全考虑和感知器节点的资源约束。* Methods: 提出的 Adaptive 分散式攻击检测和预防系统 (AIDPS) 使用机器学习算法（例如 Adaptive Random Forest、light gradient-boosting machine 和 K-nearest neighbors）和概念漂移检测算法（例如 ADWIN、kdqTree 和 Page-Hinkley）来检测水下相关的攻击。* Results: 比较 experimental 结果表明，提出的方案在性能方面超过了状态方法 benchmarking 方法，并提供了更多的愿望特征，如可扩展性和复杂性。

Abstract
Underwater Acoustic Sensor Networks (UW-ASNs) are predominantly used for underwater environments and find applications in many areas. However, a lack of security considerations, the unstable and challenging nature of the underwater environment, and the resource-constrained nature of the sensor nodes used for UW-ASNs (which makes them incapable of adopting security primitives) make the UW-ASN prone to vulnerabilities. This paper proposes an Adaptive decentralised Intrusion Detection and Prevention System called AIDPS for UW-ASNs. The proposed AIDPS can improve the security of the UW-ASNs so that they can efficiently detect underwater-related attacks (e.g., blackhole, grayhole and flooding attacks). To determine the most effective configuration of the proposed construction, we conduct a number of experiments using several state-of-the-art machine learning algorithms (e.g., Adaptive Random Forest (ARF), light gradient-boosting machine, and K-nearest neighbours) and concept drift detection algorithms (e.g., ADWIN, kdqTree, and Page-Hinkley). Our experimental results show that incremental ARF using ADWIN provides optimal performance when implemented with One-class support vector machine (SVM) anomaly-based detectors. Furthermore, our extensive evaluation results also show that the proposed scheme outperforms state-of-the-art bench-marking methods while providing a wider range of desirable features such as scalability and complexity.

摘要
水下声学传感网络（UW-ASN）广泛应用于水下环境中，但由于安全考虑不足、水下环境不稳定和敏感度高的传感节点，使得UW-ASN容易受到攻击。本文提出一种适应型分布式入侵检测预防系统（AIDPS），以提高UW-ASN的安全性，能够有效检测水下相关攻击（如黑洞、灰色洞和淹水攻击）。为确定最佳构建，我们进行了一系列实验，使用了多种当前最佳机器学习算法（如适应随机森林、光梯度搜索机和K nearest neighbors）和概念泄漏检测算法（如ADWIN、kdqTree和Page-Hinkley）。实验结果表明，增量ARF使用ADWIN提供了最佳性能，并且与一元支持向量机（SVM）异常检测器结合使用。此外，我们的广泛评估结果也显示，提案方案在与当前标准方法进行比较时表现出优异的特点，如扩展性和复杂度。

ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? (ver. 23Q3)

paper_url: http://arxiv.org/abs/2309.08636
repo_url: None
paper_authors: Edisa Lozić, Benjamin Štular
for: This paper analyzes the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology.
methods: The methodology used tagging AI-generated content for quantitative accuracy and qualitative precision by human experts.
results: The AI chatbots demonstrated proficiency in recombining existing knowledge but failed in generating original scientific content. Additionally, the paper highlights the challenges AI chatbots face in emulating human originality in scientific writing.

Abstract
Historically, proficient writing was deemed essential for human advancement, with creative expression viewed as one of the hallmarks of human achievement. However, recent advances in generative AI have marked an inflection point in this narrative, including for scientific writing. This article provides a comprehensive analysis of the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology. The methodology was based on tagging AI generated content for quantitative accuracy and qualitative precision by human experts. Quantitative accuracy assessed the factual correctness, while qualitative precision gauged the scientific contribution. While the AI chatbots, especially ChatGPT-4, demonstrated proficiency in recombining existing knowledge, they failed in generating original scientific content. As a side note, our results also suggest that with ChatGPT-4 the size of the LLMs has plateaued. Furthermore, the paper underscores the intricate and recursive nature of human research. This process of transforming raw data into refined knowledge is computationally irreducible, which highlights the challenges AI chatbots face in emulating human originality in scientific writing. In conclusion, while large language models have revolutionised content generation, their ability to produce original scientific contributions in the humanities remains limited. We expect that this will change in the near future with the evolution of current LLM-based AI chatbots towards LLM-powered software.

摘要

NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches

paper_url: http://arxiv.org/abs/2309.07704
repo_url: None
paper_authors: Chi-en Amy Tai, Matthew Keller, Saeejith Nair, Yuhao Chen, Yifan Wu, Olivia Markham, Krish Parmar, Pengcheng Xi, Heather Keller, Sharon Kirkpatrick, Alexander Wong
for:这篇论文是为了提高自动识别食物的精度而写的。methods:这篇论文使用了计算机视觉和机器学习技术来自动估计食物摄入量。results:这篇论文提出了一个大规模的synthetic食物图像集（NutritionVerse-Synth）和一个实际图像集（NutritionVerse-Real），并对这些数据进行了分析和评估。

Abstract
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.

摘要
准确的饮食摄入估算对于支持健康饮食政策和计划是非常重要，因为营养不良直接导致生活质量下降。然而，自我报告方法如食物日志受到了重大偏见。传统的饮食评估技术和新兴的方法如移动应用程序具有高时间成本和可能需要专业人员。现有的工作集中在使用计算机视觉和机器学习自动估算饮食摄入从食物图像，但由于缺乏完整的数据集，包括多个视角、Modalities和食物注释，导致这些方法的准确性和现实性受限。为解决这个限制，我们介绍了nutritionVerse-Synth，第一个大规模的数据集，包括84984个真实的二维食物图像和相关的营养信息和多模态注释（包括深度图像、实例面Mask和semantic面Mask）。此外，我们收集了一个真实图像数据集，nutritionVerse-Real，包括251种菜谱的889张图像，以评估实际性。通过这些新的数据集，我们开发和评估nutritionVerse，一种饮食摄入估算的实验研究，包括间接分割基于和直接预测网络。此外，我们将先前在 sintetic数据上预训练的模型与真实图像进行融合，以提供有关合并 sintetic和real数据的信息。最后，我们在https://www.kaggle.com/nutritionverse/datasets上发布了nutritionVerse-Synth和nutritionVerse-Real两个数据集，作为一个开放的机器学习饮食感知项目。

Tree of Uncertain Thoughts Reasoning for Large Language Models

paper_url: http://arxiv.org/abs/2309.07694
repo_url: None
paper_authors: Shentong Mo, Miao Xin
for: 提高 Large Language Models (LLMs) 的决策精度，特别是在面临多元决策时。
methods: 利用 Monte Carlo Dropout 来评估 LLMs 的地方决策不确定性，然后将这些不确定性评估与全球搜索算法结合使用。
results: 在两个复杂的规划任务中（游戏24和小十字word），TouT 的实验证明了它的superiority，比ToT和链式思维提问方法更好。

Abstract
While the recently introduced Tree of Thoughts (ToT) has heralded advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process. Addressing this pivotal gap, we introduce the Tree of Uncertain Thoughts (TouT) - a reasoning framework tailored for LLMs. Our TouT effectively leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses at these intermediate steps. By marrying this local uncertainty quantification with global search algorithms, TouT enhances the model's precision in response generation. We substantiate our approach with rigorous experiments on two demanding planning tasks: Game of 24 and Mini Crosswords. The empirical evidence underscores TouT's superiority over both ToT and chain-of-thought prompting methods.

摘要
traditional Chinese:Recently introduced Tree of Thoughts (ToT) has brought about advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, but it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process. To address this crucial gap, we introduce the Tree of Uncertain Thoughts (TouT) - a reasoning framework tailored for LLMs. Our TouT effectively leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses at these intermediate steps. By marrying this local uncertainty quantification with global search algorithms, TouT enhances the model's precision in response generation. We substantiate our approach with rigorous experiments on two demanding planning tasks: Game of 24 and Mini Crosswords. The empirical evidence underscores TouT's superiority over both ToT and chain-of-thought prompting methods.Simplified Chinese:最近引入的思想树（ToT）已经为大语言模型（LLM）提供了前视和归culo的优化，但它忽略了LLM的本地不确定性在中间决策点或"思想"中。这些本地不确定性， LLM的具有多种响应的潜在问题，仍然是globaledécision-making中的主要问题。为解决这个关键的差距，我们介绍了思想树的不确定思想树（TouT）-一种适应LLM的理解框架。我们的TouT通过利用Monte Carlo Dropout来评估LLM的多个本地响应的不确定度分数。将这种本地不确定度评估与全球搜索算法结合，TouT可以提高模型的响应生成精度。我们通过对棋盘24和小十字word puzzle两个需要努力的规划任务的实验，证明了TouT的超越性 compared to ToT和链式思维提示方法。

Detecting ChatGPT: A Survey of the State of Detecting ChatGPT-Generated Text

paper_url: http://arxiv.org/abs/2309.07689
repo_url: None
paper_authors: Mahdi Dhaini, Wessel Poelman, Ege Erdogan
for: 本研究旨在探讨如何 отличить人类生成的文本和大语言模型（LLM）生成的文本，以确保文本的integrity。
methods: 本文提供了当前的approaches，包括数据集的建构、不同的方法的使用以及人类生成的文本和ChatGPT生成的文本的quality的比较。
results: 本文summarizes the current state of the art in detecting ChatGPT-generated text, including the various datasets constructed for this task, the methods employed, and the qualitative analyses performed to understand the characteristics of human- versus ChatGPT-generated text.

Abstract
While recent advancements in the capabilities and widespread accessibility of generative language models, such as ChatGPT (OpenAI, 2022), have brought about various benefits by generating fluent human-like text, the task of distinguishing between human- and large language model (LLM) generated text has emerged as a crucial problem. These models can potentially deceive by generating artificial text that appears to be human-generated. This issue is particularly significant in domains such as law, education, and science, where ensuring the integrity of text is of the utmost importance. This survey provides an overview of the current approaches employed to differentiate between texts generated by humans and ChatGPT. We present an account of the different datasets constructed for detecting ChatGPT-generated text, the various methods utilized, what qualitative analyses into the characteristics of human versus ChatGPT-generated text have been performed, and finally, summarize our findings into general insights

摘要
“近期，智能语言模型的能力和普遍性得到了大量的进步，如ChatGPT（OpenAI，2022），这些模型已经带来了许多利益，例如生成流畅、人类语言样式的文本。然而， distinguishing between human-generated text and大型语言模型（LLM）生成的文本已成为一个重要的问题。这些模型有可能会欺骗人类，生成 искус生成的文本，这对于法律、教育和科学等领域来说，保持文本的完整性非常重要。本survey提供了当前的方法，用于分辨人类生成的文本和ChatGPT生成的文本，包括constructed的dataset、使用的方法、对human与ChatGPT生成的文本的分析，以及最后的总结。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

deepFDEnet: A Novel Neural Network Architecture for Solving Fractional Differential Equations

paper_url: http://arxiv.org/abs/2309.07684
repo_url: None
paper_authors: Ali Nosrati Firoozsalari, Hassan Dana Mazraeh, Alireza Afzal Aghaei, Kourosh Parand
for: 该研究提出了一种新的深度神经网络架构，用于精准解决不同类型的分数导函数方程。
methods: 该架构使用了Gaussian интеграル规则和$L_1$精度积分技术，在每个方程中使用深度神经网络来 aproximate 未知函数。
results: 实验结果表明，该架构可以高精度地解决不同类型的分数导函数方程，包括分数常 differential equation、分数预微差方程和分数预微差方程。

Abstract
The primary goal of this research is to propose a novel architecture for a deep neural network that can solve fractional differential equations accurately. A Gaussian integration rule and a $L_1$ discretization technique are used in the proposed design. In each equation, a deep neural network is used to approximate the unknown function. Three forms of fractional differential equations have been examined to highlight the method's versatility: a fractional ordinary differential equation, a fractional order integrodifferential equation, and a fractional order partial differential equation. The results show that the proposed architecture solves different forms of fractional differential equations with excellent precision.

摘要
primary goal 的这个研究是提出一种深度神经网络架构，能够精确解析分数 diferencial 方程。在提出的设计中，使用 Gaussian 积分规则和 $L_1$ 精度积分技术。在每个方程中，深度神经网络用于 approximate 未知函数。研究对三种分数 differential 方程进行了测试：分数 ordinary differential equation，分数 order integrodifferential equation，和分数 order partial differential equation。结果显示，提出的架构可以精确地解决不同形式的分数 differential 方程。Note: "分数" (fractional) in the text refers to the fact that the differential equations are of fractional order, meaning that the derivatives are of a non-integer order.

Assessing the nature of large language models: A caution against anthropocentrism

paper_url: http://arxiv.org/abs/2309.07683
repo_url: None
paper_authors: Ann Speed
for: 这项研究是为了评估OpenAIs chatbot GPT3.5，了解其能力和人格特征。
methods: 该研究使用标准、 нор化和验证的认知和人格测试盘测GPT3.5的能力和稳定性。
results: GPT3.5 unlikely to have developed sentience, but displayed large variability in cognitive and personality measures over repeated observations, and showed poor mental health such as low self-esteem and dissociation from reality despite upbeat and helpful responses.

Abstract
Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed GPT3.5 using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that GPT 3.5 is unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. It did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, GPT3.5 displays what in a human would be considered poor mental health, including low self-esteem and marked dissociation from reality despite upbeat and helpful responses.

摘要
我们的结果表明，GPT3.5不太可能已经发展出了意识（sentience）。尽管它能够回答人格评测表，但它在重复测试中显示了大量的变化，这与人类的个性特质不符。不过，GPT3.5在认知和人格测试中表现出了人类所谓的差等精神健康问题，包括低自尊和明显与现实分离。尽管如此，GPT3.5仍然能够表现出帮助和乐观的回答。

Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation

paper_url: http://arxiv.org/abs/2309.07670
repo_url: None
paper_authors: Fabiola Espinosa Castellon, Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Aurélien Mayoue, Antoine Souloumiac, Cédric Gouy-Pallier
for: 这篇论文是用于处理分布shift的联邦领域数据预测，特别是在部分客户拥有无标数据的情况下。
methods: 本文提出了一个基于词汇学习的联邦领域执行数据预测方法，称为FedDaDiL。这个方法通过客户端的词汇学习来学习客户端的分布，并将这些分布集成为联邦字库。
results: 本文透过实验证明了FedDaDiL的可行性和效果，包括在(i) Caltech-Office、(ii) TEP、和(iii) CWRU benchmark上进行了广泛的测试。此外，本文还与其中央化版本和其他联邦领域执行数据预测方法进行比较。

Abstract
In this article, we propose an approach for federated domain adaptation, a setting where distributional shift exists among clients and some have unlabeled data. The proposed framework, FedDaDiL, tackles the resulting challenge through dictionary learning of empirical distributions. In our setting, clients' distributions represent particular domains, and FedDaDiL collectively trains a federated dictionary of empirical distributions. In particular, we build upon the Dataset Dictionary Learning framework by designing collaborative communication protocols and aggregation operations. The chosen protocols keep clients' data private, thus enhancing overall privacy compared to its centralized counterpart. We empirically demonstrate that our approach successfully generates labeled data on the target domain with extensive experiments on (i) Caltech-Office, (ii) TEP, and (iii) CWRU benchmarks. Furthermore, we compare our method to its centralized counterpart and other benchmarks in federated domain adaptation.

摘要
在本文中，我们提出了一种 federated domain adaptation 的方法，这种情况下存在客户端之间的分布转换和一些无标注数据。我们的框架，FedDaDiL，通过对empirical Distribution学习来解决这个问题。在我们的设定下，客户端的分布表示特定的领域，FedDaDiL 在多个客户端之间共同培养一个联合的empirical Distribution字典。 Specifically，我们基于 Dataset Dictionary Learning 框架，并设计了合作通信协议和聚合操作。选择的协议保持客户端的数据私有，从而提高了总体隐私性，比中央化对手更加隐私。我们在 (i) Caltech-Office、(ii) TEP 和 (iii) CWRU 测试benchmark上进行了广泛的实验，并证明了我们的方法能成功生成目标领域的标注数据。此外，我们还与中央化对手和其他联邦预测方法进行了比较。

Multi-Source Domain Adaptation meets Dataset Distillation through Dataset Dictionary Learning

paper_url: http://arxiv.org/abs/2309.07666
repo_url: None
paper_authors: Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac
for: 本文解决了多源领域适应（MSDA）和数据简报（DD）两个问题的交叉问题，即MSDA-DD问题。
methods: 本文使用了多种先前在MSDA领域中被采用的方法，如沃氏矩阵运输和数据字典学习，以及DD方法的分布匹配。
results: 本文在四个 benchmark（Caltech-Office 10、Tennessee-Eastman Process、Continuous Stirred Tank Reactor、Case Western Reserve University）上进行了大量的实验，并显示了使用仅1个样本每个类的情况下，可以达到现有的适应性能。

Abstract
In this paper, we consider the intersection of two problems in machine learning: Multi-Source Domain Adaptation (MSDA) and Dataset Distillation (DD). On the one hand, the first considers adapting multiple heterogeneous labeled source domains to an unlabeled target domain. On the other hand, the second attacks the problem of synthesizing a small summary containing all the information about the datasets. We thus consider a new problem called MSDA-DD. To solve it, we adapt previous works in the MSDA literature, such as Wasserstein Barycenter Transport and Dataset Dictionary Learning, as well as DD method Distribution Matching. We thoroughly experiment with this novel problem on four benchmarks (Caltech-Office 10, Tennessee-Eastman Process, Continuous Stirred Tank Reactor, and Case Western Reserve University), where we show that, even with as little as 1 sample per class, one achieves state-of-the-art adaptation performance.

摘要
在这篇论文中，我们考虑了多源领域适应（MSDA）和数据精炼（DD）两个问题的交叉点。一方面，MSDA是适应多个不同标签源领域到一个没有标签目标领域的问题。另一方面，DD是解决将数据集中的所有信息概括到一个小Summary中的问题。因此，我们提出了一个新的问题 called MSDA-DD。为解决这个问题，我们改进了先前在MSDA литературе中的方法，如 Wasserstein Barycenter Transport和 Dataset Dictionary Learning，以及 DD 方法 Distribution Matching。我们在四个标准 benchmark（Caltech-Office 10、Tennessee-Eastman Process、Continuous Stirred Tank Reactor和Case Western Reserve University）进行了广泛的实验，并证明了，即使只有一个样本每个类，也可以达到状态精准适应性。

Feature Engineering in Learning-to-Rank for Community Question Answering Task

paper_url: http://arxiv.org/abs/2309.07610
repo_url: None
paper_authors: Nafis Sajid, Md Rashidul Hasan, Muhammad Ibrahim
for: This paper aims to improve the ranking of answers in community question answering (CQA) forums by introducing a BERT-based feature and combining both question and answer features.
methods: The proposed framework uses traditional features like TF-IDF and BM25, as well as a BERT-based feature to capture the semantic similarity between questions and answers. The framework also employs rank-learning algorithms that have not been widely used in the CQA domain.
results: The proposed framework achieves state-of-the-art performance on three standard CQA datasets, and the analysis of feature importance provides guidance for practitioners to select a better set of features for the CQA retrieval task.

Abstract
Community question answering (CQA) forums are Internet-based platforms where users ask questions about a topic and other expert users try to provide solutions. Many CQA forums such as Quora, Stackoverflow, Yahoo!Answer, StackExchange exist with a lot of user-generated data. These data are leveraged in automated CQA ranking systems where similar questions (and answers) are presented in response to the query of the user. In this work, we empirically investigate a few aspects of this domain. Firstly, in addition to traditional features like TF-IDF, BM25 etc., we introduce a BERT-based feature that captures the semantic similarity between the question and answer. Secondly, most of the existing research works have focused on features extracted only from the question part; features extracted from answers have not been explored extensively. We combine both types of features in a linear fashion. Thirdly, using our proposed concepts, we conduct an empirical investigation with different rank-learning algorithms, some of which have not been used so far in CQA domain. On three standard CQA datasets, our proposed framework achieves state-of-the-art performance. We also analyze importance of the features we use in our investigation. This work is expected to guide the practitioners to select a better set of features for the CQA retrieval task.

摘要
社区问答（CQA）论坛是互联网上的平台，用户可以提问一个话题，其他专家用户则会提供解决方案。许多CQA论坛，如Quora、Stack overflow、Yahoo！Answer、Stack Exchange等，都有大量用户生成的数据。这些数据可以被自动化CQA排名系统使用，以提供与用户提交的问题相似的问题和答案。在这项工作中，我们employn empirical investigation of several aspects of this domain。首先，我们引入了BERT基于的 semanticsimilarity特征，用于捕捉问题和答案之间的semantic关系。其次，大多数现有研究works都是只是从问题部分提取特征，而不是从答案部分提取特征。我们组合了这两种类型的特征在线性方式。最后，我们使用我们提出的概念，使用不同的排名算法进行实验，其中一些算法在CQA领域未曾使用过。在三个标准CQA数据集上，我们的提出的框架实现了状态的表现。我们还分析了我们使用的特征的重要性。这项工作预计会引导实践者选择更好的特征集 дляCQA检索任务。

Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?

paper_url: http://arxiv.org/abs/2309.07602
repo_url: https://github.com/antklen/sasrec-bert4rec-recsys23
paper_authors: Anton Klenitskiy, Alexey Vasilev
for: Compares the performance of SASRec and BERT4Rec in recommendation tasks, and explores the effectiveness of training SASRec with negative sampling.
methods: Uses Transformer-based models SASRec and BERT4Rec as baselines, and compares their performance with different loss functions and negative sampling strategies.
results: Finds that SASRec outperforms BERT4Rec in terms of quality and training speed when trained with the same loss function as BERT4Rec, and that SASRec can be effectively trained with negative sampling but requires a larger number of negative examples than one.

Abstract
Recently sequential recommendations and next-item prediction task has become increasingly popular in the field of recommender systems. Currently, two state-of-the-art baselines are Transformer-based models SASRec and BERT4Rec. Over the past few years, there have been quite a few publications comparing these two algorithms and proposing new state-of-the-art models. In most of the publications, BERT4Rec achieves better performance than SASRec. But BERT4Rec uses cross-entropy over softmax for all items, while SASRec uses negative sampling and calculates binary cross-entropy loss for one positive and one negative item. In our work, we show that if both models are trained with the same loss, which is used by BERT4Rec, then SASRec will significantly outperform BERT4Rec both in terms of quality and training speed. In addition, we show that SASRec could be effectively trained with negative sampling and still outperform BERT4Rec, but the number of negative examples should be much larger than one.

摘要
近期顺序推荐和下一个项目预测任务在推荐系统领域得到了越来越多的关注。目前，两种状态级基elines是基于Transformer架构的SASRec和BERT4Rec。过去几年，有很多文章比较了这两种算法，并提出了新的状态级基elines。大多数文章中，BERT4Rec的性能比SASRec好，但BERT4Rec使用交叉熵预测所有项目，而SASRec使用负样本和计算二进制交叉熵损失。在我们的工作中，我们发现如果两个模型都使用同一个损失函数，即BERT4Rec使用的交叉熵损失，那么SASRec会在质量和训练速度方面明显超过BERT4Rec。此外，我们还发现SASRec可以通过负样本进行训练，并且仍然超过BERT4Rec，但负样本的数量应该比一个更大得多。

Detecting Misinformation with LLM-Predicted Credibility Signals and Weak Supervision

paper_url: http://arxiv.org/abs/2309.07601
repo_url: None
paper_authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
for: 本研究旨在检验语言模型是否能够通过提供18个信任信号来生成弱标签，以便用于内容真实性预测。
methods: 本研究使用大语言模型（LLM），并通过提供18个信任信号来启动它们。然后，使用弱监督来聚合这些潜在噪声的标签，以预测内容真实性。
results: 研究发现，使用这种方法可以超过现有的状况检测器，并且不需要使用任何基于真实标签的训练数据。此外，研究还分析了各个信任信号对内容真实性预测的贡献，提供了新的有价值的意见。

Abstract
Credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyse the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.

摘要
credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyze the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.Here's the translation in Traditional Chinese:credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyze the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.

C-Pack: Packaged Resources To Advance General Chinese Embedding

paper_url: http://arxiv.org/abs/2309.07597
repo_url: https://github.com/flagopen/flagembedding
paper_authors: Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighof
for: 本文旨在提供一个包含多种资源的包，以提高中文嵌入模型的Field。
methods: 本文使用了三个关键资源：1）中文文本嵌入数据集C-MTEB，2）大量的中文文本嵌入数据集C-MTP，3）多种中文嵌入模型家族C-TEM。
results: 我们的模型在C-MTEB上表现出色，与之前的所有中文文本嵌入模型相比，提高了+10%。此外，我们还对C-TEM模型进行了整体训练和优化。同时，我们还发布了英文文本嵌入数据集和模型，其性能在MTEB上与之前的最佳性能相当。所有资源都可以在https://github.com/FlagOpen/FlagEmbedding上下载。

Abstract
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

摘要
我们介绍C-Pack，一个包含资源的集合，将通用中文嵌入领域带进行了显著的提升。C-Pack包括三个重要资源。1）C-MTEB是一个涵盖6个任务和35个数据集的中文文本嵌入测试集。2）C-MTP是一个由标注和未标注的中文资料集合而成的庞大文本嵌入数据集，用于训练嵌入模型。3）C-TEM是一家嵌入模型家族，覆盖多个大小。我们的模型在C-MTEB上比所有前一代中文文本嵌入模型高出+10%。我们还将整个训练方法集成并优化。此外，我们还发布了英文文本嵌入数据和模型，其性能在MTEB benchmark上达到了国际先进水平。而我们发布的英文数据量比中文数据量高出2倍。这些资源都公开提供在GitHub上，请参考https://github.com/FlagOpen/FlagEmbedding。

Neuro-Symbolic Recommendation Model based on Logic Query

paper_url: http://arxiv.org/abs/2309.07594
repo_url: None
paper_authors: Maonian Wu, Bang Chen, Shaojun Zhu, Bo Zheng, Wei Peng, Mingyi Zhang
for: 提供一种基于逻辑和符号的推荐模型，解决现有的推荐模型在实际任务中难以处理不一致和不完整的知识问题。
methods: 将用户历史交互转化为逻辑表达，然后将推荐预测转化为查询任务基于这个逻辑表达。使用神经网络的模块逻辑运算实现逻辑表达的计算。还构建了隐式逻辑编码器来有效减少逻辑计算的复杂性。
results: 在三个常见数据集上进行了实验，结果显示，我们的方法在比较于现有的浅深、会话、理解模型的情况下表现更好。

Abstract
A recommendation system assists users in finding items that are relevant to them. Existing recommendation models are primarily based on predicting relationships between users and items and use complex matching models or incorporate extensive external information to capture association patterns in data. However, recommendation is not only a problem of inductive statistics using data; it is also a cognitive task of reasoning decisions based on knowledge extracted from information. Hence, a logic system could naturally be incorporated for the reasoning in a recommendation task. However, although hard-rule approaches based on logic systems can provide powerful reasoning ability, they struggle to cope with inconsistent and incomplete knowledge in real-world tasks, especially for complex tasks such as recommendation. Therefore, in this paper, we propose a neuro-symbolic recommendation model, which transforms the user history interactions into a logic expression and then transforms the recommendation prediction into a query task based on this logic expression. The logic expressions are then computed based on the modular logic operations of the neural network. We also construct an implicit logic encoder to reasonably reduce the complexity of the logic computation. Finally, a user's interest items can be queried in the vector space based on the computation results. Experiments on three well-known datasets verified that our method performs better compared to state of the art shallow, deep, session, and reasoning models.

摘要

Statistically Valid Variable Importance Assessment through Conditional Permutations

paper_url: http://arxiv.org/abs/2309.07593
repo_url: None
paper_authors: Ahmad Chamma, Denis A. Engemann, Bertrand Thirion
for: This paper aims to provide a systematic approach for studying Conditional Permutation Importance (CPI) and to develop reusable benchmarks of state-of-the-art variable importance estimators.
methods: The paper uses a model-agnostic and computationally lean approach to study CPI, which overcomes the limitations of standard permutation importance by providing accurate type-I error control.
results: The paper shows that CPI consistently showed top accuracy across benchmarks when used with a deep neural network, and provides a more parsimonious selection of statistically significant variables in real-world data analysis.

Abstract
Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An empirical benchmark on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.

摘要
Here, we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that CPI overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, CPI consistently showed top accuracy across benchmarks.An empirical benchmark on real-world data analysis in a large-scale medical dataset showed that CPI provides a more parsimonious selection of statistically significant variables. Our results suggest that CPI can be readily used as a drop-in replacement for permutation-based methods.

Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.07578
repo_url: None
paper_authors: Cristina Pinneri, Sarah Bechtle, Markus Wulfmeier, Arunkumar Byravan, Jingwei Zhang, William F. Whitney, Martin Riedmiller
for: 提高RL Agent的泛化能力，在固定数据集上不需要额外与环境交互。
methods: 学习动力模型，并使用归一化规则增加对翻译变换的抽象集。
results: 在考虑环境中，使用增强的数据集和策略学习算法，提高策略的测试性能。

Abstract
We present a novel approach to address the challenge of generalization in offline reinforcement learning (RL), where the agent learns from a fixed dataset without any additional interaction with the environment. Specifically, we aim to improve the agent's ability to generalize to out-of-distribution goals. To achieve this, we propose to learn a dynamics model and check if it is equivariant with respect to a fixed type of transformation, namely translations in the state space. We then use an entropy regularizer to increase the equivariant set and augment the dataset with the resulting transformed samples. Finally, we learn a new policy offline based on the augmented dataset, with an off-the-shelf offline RL algorithm. Our experimental results demonstrate that our approach can greatly improve the test performance of the policy on the considered environments.

摘要
我们提出了一种新的方法，用于解决在线推荐学习（RL）中的泛化挑战，agent从固定的数据集中学习而不需要与环境进行任何交互。我们想要改进agent的泛化能力，以便在不同目标下进行更好的表现。为此，我们提议通过学习动力模型，并检查其对于固定类型的变换（ specifically, state space中的翻译）是否对称。然后，我们使用Entropy regularizer来增加对称集，并将结果中的变换样本添加到数据集中。最后，我们使用Offline RL算法学习一个新的策略，基于扩展后的数据集。我们的实验结果表明，我们的方法可以大幅提高考试策略在考试环境中的表现。

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

paper_url: http://arxiv.org/abs/2309.07566
repo_url: None
paper_authors: Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao
for: 用于进行Direct Speech-to-Speech Translation (S2ST)，以实现高品质的语音翻译。
methods: 使用了基于自我监督学习的听语模型，以及一种神经编码器来实现风格传递。听语模型通过自我监督学习，学习了风格传递的能力，无需靠仰于任何 speaker-parallel 数据。
results: 经过大量训练，我们的模型实现了零shot cross-lingual风格传递，并生成了高准确率和风格相似的翻译语音示例。示例可以在 http://stylelm.github.io/ 上找到。

Abstract
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .

摘要
直接speech-to-speech翻译（S2ST）已经实现了惊人的准确率，但是无法保留源语音的 speaker timbre。同时，获得高质量的 speaker-平行数据的缺乏对学习 Style transfer between source and target speech pose a challenge。我们提议一个基于 discrete units 的 S2ST 框架，使用一个基于 acoustic language model 的 neural codec for style transfer。这个 acoustic language model 通过自我超vised in-context learning 获得了 Style transfer 的能力，不需要任何 speaker-平行数据，因此解决了数据缺乏的问题。通过大量的训练数据，我们的模型实现了零shot cross-lingual Style transfer на previously unseen source languages。实验表明，我们的模型可以生成高准确率和 Style similarity 的翻译语音。Audio samples 可以在 http://stylelm.github.io/ 中找到。

Masked Generative Modeling with Enhanced Sampling Scheme

paper_url: http://arxiv.org/abs/2309.07945
repo_url: None
paper_authors: Daesoo Lee, Erlend Aune, Sara Malacarne
for: This paper proposes a novel sampling scheme for masked non-autoregressive generative modeling to overcome the limitations of existing sampling methods.
methods: The proposed Enhanced Sampling Scheme (ESS) consists of three stages: Naive Iterative Decoding, Critical Reverse Sampling, and Critical Resampling. ESS uses confidence scores from a self-Token-Critic and the structure of the quantized latent vector space to ensure both sample diversity and fidelity.
results: The proposed ESS shows significant performance gains in both unconditional sampling and class-conditional sampling using all 128 datasets in the UCR Time Series archive.

Abstract
This paper presents a novel sampling scheme for masked non-autoregressive generative modeling. We identify the limitations of TimeVQVAE, MaskGIT, and Token-Critic in their sampling processes, and propose Enhanced Sampling Scheme (ESS) to overcome these limitations. ESS explicitly ensures both sample diversity and fidelity, and consists of three stages: Naive Iterative Decoding, Critical Reverse Sampling, and Critical Resampling. ESS starts by sampling a token set using the naive iterative decoding as proposed in MaskGIT, ensuring sample diversity. Then, the token set undergoes the critical reverse sampling, masking tokens leading to unrealistic samples. After that, critical resampling reconstructs masked tokens until the final sampling step is reached to ensure high fidelity. Critical resampling uses confidence scores obtained from a self-Token-Critic to better measure the realism of sampled tokens, while critical reverse sampling uses the structure of the quantized latent vector space to discover unrealistic sample paths. We demonstrate significant performance gains of ESS in both unconditional sampling and class-conditional sampling using all the 128 datasets in the UCR Time Series archive.

摘要
ESS starts by sampling a token set using naive iterative decoding, as proposed in MaskGIT, to ensure sample diversity. Then, the token set undergoes critical reverse sampling, where tokens are masked to lead to unrealistic samples. Finally, critical resampling reconstructs masked tokens until the final sampling step is reached to ensure high fidelity. Critical resampling uses confidence scores obtained from a self-Token-Critic to better measure the realism of sampled tokens, while critical reverse sampling uses the structure of the quantized latent vector space to discover unrealistic sample paths.We demonstrate significant performance gains of ESS in both unconditional sampling and class-conditional sampling using all 128 datasets in the UCR Time Series archive.

SingFake: Singing Voice Deepfake Detection

paper_url: http://arxiv.org/abs/2309.07525
repo_url: None
paper_authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan
for: 这篇研究是针对伪声音Synthesis技术的应用和挑战，尤其是在音乐领域中。
methods: 这篇研究使用了四种现有的语音伪装系统，并在这些系统上进行了训练和评估。
results: 研究发现这些语音伪装系统在对话�uterances上表现良好，但在歌曲中却表现不佳，尤其是在不熟悉的歌手、语言和音乐背景下。

Abstract
The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available online.

摘要
¹ 歌唱voice合成技术的发展带来了艺术家和行业参与者面临的挑战，特别是在未经授权的voice使用方面。与合成语音不同，合成的歌唱voice通常会在具有强音乐背景的歌曲中释放，这可能会隐藏合成 artifacts。此外，歌唱voice具有不同的音响和语言特点，与语音词汇不同，这些独特的特点使得歌唱voice深层负伪检测成为一项有关的，但是与合成语音检测不同的问题。在这项工作中，我们提出了歌唱voice深层负伪检测任务。我们首先提供了 SingFake，这是首次在实际场景中采集的28.93小时真实音频和29.40小时深层负伪歌曲clip的第一个Curated dataset，包括5种语言和40名歌手。我们提供了训练/验证/测试的分 split，测试集包括多种场景。我们使用 SingFake 评估四种现状最佳的语音干扰系统，这些系统在语音测试数据上表现出色。然而，当我们将这些系统训练在 SingFake 上时，它们显示出了明显的改善。然而，我们的评估还发现了不同的歌手、通信编码器、语言和音乐背景的挑战，这要求特定的研究。 SingFake dataset和相关资源在线可用。

Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions

paper_url: http://arxiv.org/abs/2309.07510
repo_url: None
paper_authors: Kai Cheng, Ruihai Wu, Yan Shen, Chuanruo Ning, Guanqi Zhan, Hao Dong
for: 本研究旨在提供一种环境意识型的可行性框架，以便在多种环境中识别和控制3D彩色人工机器人。
methods: 该研究使用了一种新的对比式可行性学习框架，能够在含有一个障碍物的场景中培育可行性，并且能够在不同的障碍物组合场景中进行泛化。
results: 实验表明，该提议的环境意识型可行性框架能够有效地考虑环境约束，并且能够在多种环境中学习可行性。

Abstract
Perceiving and manipulating 3D articulated objects in diverse environments is essential for home-assistant robots. Recent studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.

摘要
感知和操作3D嵌入式对象在多样环境中是家庭助手机器人的重要能力。 latest studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.

Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure

paper_url: http://arxiv.org/abs/2309.07504
repo_url: https://github.com/Jiankai-Sun/SSTA2-ITSC-2023
paper_authors: Jiankai Sun, Shreyas Kousik, David Fridovich-Keil, Mac Schwager
for: 增强城市交通安全、效率和可持续性的自动驾驶汽车 (CAVs) 需要它们能准确预测周围的行为和安全地规划自己的动作。但是，在复杂的城市环境中，这是一项具有挑战性的任务，因为经常出现遮挡和多个代理人之间的交互。
methods: 本研究利用了一种名为 “Self-Supervised Traffic Advisor” (SSTA) 的智能基础设施来增强 CAV 的知觉状态，SSTA 是一种可以教育自己生成和广播有用的视频预测的感知器。在这种设计中，SSTA 预测的是未来的占用情况而不是原始的视频数据，这有助于减少广播预测数据的脚本。
results: 研究表明，这种设计可以有效地帮助 CAV 进行动作规划。一系列的数字实验研究了在充满人员的城市环境中 CAV 的实际应用情况，并证明了这种设计的有效性。

Abstract
Connected autonomous vehicles (CAVs) promise to enhance safety, efficiency, and sustainability in urban transportation. However, this is contingent upon a CAV correctly predicting the motion of surrounding agents and planning its own motion safely. Doing so is challenging in complex urban environments due to frequent occlusions and interactions among many agents. One solution is to leverage smart infrastructure to augment a CAV's situational awareness; the present work leverages a recently proposed "Self-Supervised Traffic Advisor" (SSTA) framework of smart sensors that teach themselves to generate and broadcast useful video predictions of road users. In this work, SSTA predictions are modified to predict future occupancy instead of raw video, which reduces the data footprint of broadcast predictions. The resulting predictions are used within a planning framework, demonstrating that this design can effectively aid CAV motion planning. A variety of numerical experiments study the key factors that make SSTA outputs useful for practical CAV planning in crowded urban environments.

摘要
connected autonomous vehicles (CAVs) 会提高城市交通的安全性、效率和可持续性。然而，这取决于CAV正确预测周围的行为和安全地规划自己的运动。在复杂的城市环境中，这是一项挑战，因为经常出现遮挡和多个代理人之间的互动。一种解决方案是利用智能基础设施来增强CAV的情况意识；本研究利用“自我学习交通指南”（SSTA）框架的智能感知器来生成和广播有用的道路用户视频预测。在这种设计中，SSTA预测被修改为预测未来占用情况而不是原始视频，这 reduces the data footprint of broadcast predictions。这些预测被用于 плани组织，示出了这种设计可以有效地帮助CAV规划运动。数字实验评估了实用CAV规划中关键的因素，以便在忙oso urban environments中提高CAV的安全性和效率。

HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods

paper_url: http://arxiv.org/abs/2309.07495
repo_url: https://github.com/yylgoodlucky/hdtr
paper_authors: Yongyuan Li, Xiuyuan Qin, Chao Liang, Mingqiang Wei
for: 高清定制化脸部动作生成（TFG）旨在通过音频和脸部特征来重建脸部动作，以实现自然和真实的嘴部运动。
methods: 我们提出了一种高清定制化牙齿修复网络（HDTR-Net），可以快速提高牙齿区域的清晰度，同时保持同步和时间一致性。我们还提出了细节特征融合（FGFF）模块，以Capture细节特征信息周围牙齿和相邻区域，并使用这些特征来细化特征图以提高牙齿的清晰度。
results: 我们的方法可以适应任何TFG方法，不会受到嘴部同步和帧协调的影响。此外，我们的方法可以在高清定制化脸部视频合成中实现实时生成，并且在执行速度方面比现有基于超分辨率的面部修复更快$300%$。

Abstract
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is $300\%$ faster than the current state-of-the-art face restoration based on super-resolution.

摘要
talking face generation (TFG) 目标是重建面部动作，以达到高自然的唇部运动和脸部特征之间的潜在连接。现有的 TFG 方法已经做出了很大的进步，以生成自然和真实的图像。然而，大多数工作rarely 考虑视觉质量。在 cross-modal 生成方法中，保证唇部同步 while avoiding 视觉质量下降是一个挑战。为解决这个问题，我们提出了一个通用的高清晰牙齿修复网络，名为 HDTR-Net，可以在极快速速度下提高牙齿区域的清晰度，并保持同步和时间一致性。具体来说，我们提出了细腻特征融合（FGFF）模块，可以有效地捕捉牙齿和周围地域的细节特征信息，并使用这些特征来细化特征地图来提高牙齿的清晰度。我们的方法可以适应任意的 TFG 方法，而不会受到唇部同步和帧协调的影响。另外，HDTR-Net 还具有实时生成能力，在高清晰牙齿视频合成的情况下，其推理速度比现有的面部恢复技术更快，高出 $300\%$。

Where2Explore: Few-shot Affordance Learning for Unseen Novel Categories of Articulated Objects

paper_url: http://arxiv.org/abs/2309.07473
repo_url: None
paper_authors: Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, Hao Dong
for: 本研究旨在解决机器人操作物体时遇到的基本问题，即对不同类别物体的抽象和Semantic variation的挑战。
methods: 我们提出了一种基于几何相似性的’Where2Explore’探索框架，通过在有限数量的实例上进行有效的探索，以适应 novel 类别物体。我们的框架可以快速地识别出不同类别物体之间的相似性，并将这些相似性转移到类似的部分，以便更高效地探索和学习。
results: 我们的实验表明，我们的框架可以快速地适应 novel 类别物体，并在实际环境中提供了高效的探索和学习功能。

Abstract
Articulated object manipulation is a fundamental yet challenging task in robotics. Due to significant geometric and semantic variations across object categories, previous manipulation models struggle to generalize to novel categories. Few-shot learning is a promising solution for alleviating this issue by allowing robots to perform a few interactions with unseen objects. However, extant approaches often necessitate costly and inefficient test-time interactions with each unseen instance. Recognizing this limitation, we observe that despite their distinct shapes, different categories often share similar local geometries essential for manipulation, such as pullable handles and graspable edges - a factor typically underutilized in previous few-shot learning works. To harness this commonality, we introduce 'Where2Explore', an affordance learning framework that effectively explores novel categories with minimal interactions on a limited number of instances. Our framework explicitly estimates the geometric similarity across different categories, identifying local areas that differ from shapes in the training categories for efficient exploration while concurrently transferring affordance knowledge to similar parts of the objects. Extensive experiments in simulated and real-world environments demonstrate our framework's capacity for efficient few-shot exploration and generalization.

摘要
《描述物体搅动是机器人学中的基本 yet 挑战性任务。由于不同物体类别之间的几何和semantic variation很大， previous manipulation models 很难泛化到新类别。ew-shot learning 是一种有前途的解决方案，允许机器人在未看过的对象上进行几次互动。然而，现有的方法经常需要费时且不efficient的在每个未看过的实例上进行测试。认识到这一点，我们注意到，尽管它们的形状不同，不同的类别通常具有类似的本地几何特征，如可拖 handle 和可握 edge - 一种通常被前一些ew-shot learning works 下utilized。为了利用这一点，我们提出了 'Where2Explore'，一种可以fficiently explore novel categories的投入学框架。我们的框架Explicitly estimates the geometric similarity across different categories, 并将这些类似性转移到相似的对象部分，以便高效地探索新类别。我们的框架在模拟和实际环境中进行了广泛的实验，并证明了它的高效性和泛化能力。

Detecting Unknown Attacks in IoT Environments: An Open Set Classifier for Enhanced Network Intrusion Detection

paper_url: http://arxiv.org/abs/2309.07461
repo_url: None
paper_authors: Yasir Ali Farrukh, Syed Wali, Irfan Khan, Nathaniel D. Bastian
for: 这篇论文旨在提供一个适应互联网领域的网络入侵检测系统（NIDS），以应对互联网物品（IoT）环境中的攻击。
methods: 这篇论文使用了图像基于的封包级数据表示法，将网络流量中的空间和时间模式抽象出来，以及整合堆叠和子集拓扑技术，以更好地识别未知的攻击。
results: 这篇论文的实验结果显示，该 frameworks 的检测率高达 88%，比较其他方法和最新的进展更高。

Abstract
The widespread integration of Internet of Things (IoT) devices across all facets of life has ushered in an era of interconnectedness, creating new avenues for cybersecurity challenges and underscoring the need for robust intrusion detection systems. However, traditional security systems are designed with a closed-world perspective and often face challenges in dealing with the ever-evolving threat landscape, where new and unfamiliar attacks are constantly emerging. In this paper, we introduce a framework aimed at mitigating the open set recognition (OSR) problem in the realm of Network Intrusion Detection Systems (NIDS) tailored for IoT environments. Our framework capitalizes on image-based representations of packet-level data, extracting spatial and temporal patterns from network traffic. Additionally, we integrate stacking and sub-clustering techniques, enabling the identification of unknown attacks by effectively modeling the complex and diverse nature of benign behavior. The empirical results prominently underscore the framework's efficacy, boasting an impressive 88\% detection rate for previously unseen attacks when compared against existing approaches and recent advancements. Future work will perform extensive experimentation across various openness levels and attack scenarios, further strengthening the adaptability and performance of our proposed solution in safeguarding IoT environments.

摘要
互联网物件的普遍散布在生活所有方面，带来了一个通信连接的时代，创造了新的预防措施挑战和强化网络防护系统的需求。然而，传统的安全系统具有关闭世界的想法，往往对于不断发展的威胁领域显示出困难。在本文中，我们介绍了一个适应网络入侵检测系统（NIDS）的框架，用于解决网络入侵检测系统中的开放集 recognition（OSR）问题。我们的框架利用图像基本的封包水平数据表示，检测网络流量中的空间和时间几何模式。此外，我们还整合了堆叠和子集拓扑技术，以实现模elling Complex和多元的正常行为，从而识别未知的攻击。我们的实验结果显示，我们的提案可以实现88%的检测率，与现有方法和最新的进展相比。未来的工作将进行各种开放程度和攻击enario的广泛实验，进一步强化我们的提案在保护 IoT 环境方面的适用性和性能。

Towards Artificial General Intelligence (AGI) in the Internet of Things (IoT): Opportunities and Challenges

paper_url: http://arxiv.org/abs/2309.07438
repo_url: None
paper_authors: Fei Dou, Jin Ye, Geng Yuan, Qin Lu, Wei Niu, Haijian Sun, Le Guan, Guoyu Lu, Gengchen Mai, Ninghao Liu, Jin Lu, Zhengliang Liu, Zihao Wu, Chenjiao Tan, Shaochen Xu, Xianqiao Wang, Guoming Li, Lilong Chai, Sheng Li, Jin Sun, Hongyue Sun, Yunli Shao, Changying Li, Tianming Liu, Wenzhan Song
for: 这篇研究探讨了智能网络的应用和挑战，尤其是在智能家居、生产、运输和教育等领域。methods: 本研究将AGI融合到IoT系统中，并提出了一个概念框架来实现这一目标。results: 研究发现AGI在IoT系统中的应用范围很广泛，但是适应IoT设备限制的AGI需要进一步的研究。此外，研究也探讨了IoT通信的复杂性和安全性问题。

Abstract
Artificial General Intelligence (AGI), possessing the capacity to comprehend, learn, and execute tasks with human cognitive abilities, engenders significant anticipation and intrigue across scientific, commercial, and societal arenas. This fascination extends particularly to the Internet of Things (IoT), a landscape characterized by the interconnection of countless devices, sensors, and systems, collectively gathering and sharing data to enable intelligent decision-making and automation. This research embarks on an exploration of the opportunities and challenges towards achieving AGI in the context of the IoT. Specifically, it starts by outlining the fundamental principles of IoT and the critical role of Artificial Intelligence (AI) in IoT systems. Subsequently, it delves into AGI fundamentals, culminating in the formulation of a conceptual framework for AGI's seamless integration within IoT. The application spectrum for AGI-infused IoT is broad, encompassing domains ranging from smart grids, residential environments, manufacturing, and transportation to environmental monitoring, agriculture, healthcare, and education. However, adapting AGI to resource-constrained IoT settings necessitates dedicated research efforts. Furthermore, the paper addresses constraints imposed by limited computing resources, intricacies associated with large-scale IoT communication, as well as the critical concerns pertaining to security and privacy.

摘要
人工通用智能（AGI）具有人类认知能力的容器，引发科学、商业和社会领域的广泛关注。特别是在互联网东西（IoT）领域，AGI的应用前景非常广阔。这项研究探讨了在IoT预设下实现AGI的机会和挑战。研究首先介绍了IoT的基本原则和人工智能（AI）在IoT系统中的重要作用。然后，探讨了AGI的基本原则，并构建了AGI在IoT中的概念框架。AGI在IoT领域的应用范围广泛，涵盖智能街区、家庭环境、制造、交通等领域，以及环境监测、农业、医疗和教育等领域。然而，将AGI应用于有限资源的IoT设置中需要专门的研究努力。此外，研究还考虑了IoT通信的大规模复杂性和安全隐私问题。

Semantic Parsing in Limited Resource Conditions

paper_url: http://arxiv.org/abs/2309.07429
repo_url: None
paper_authors: Zhuang Li
for: 本论文关注 semantic parsing 面临的挑战，具体是在有限数据和计算资源的情况下。
methods: 本论文提出了一些解决方案，包括自动数据约束、知识传递、活动学习和连续学习。在没有平行训练数据的情况下，论文提议生成基于结构化数据库的 sintetic 训练示例。当源领域有充足数据，但目标领域有限制的平行数据时，本论文利用源领域的知识提高 parsing 性能。在多语言情况下，论文提出了一种适应 parsers 的方法，通过有限人工翻译预算来进行活动学习，以达到更好的 parsing 性能。
results: 本论文的实验结果表明，这些方法可以有效地提高 semantic parsing 的性能，尤其是在有限数据和计算资源的情况下。

Abstract
This thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources. It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning. For tasks with no parallel training data, the thesis proposes generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, knowledge from the source is leveraged to improve parsing in the target domain. For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. In addition, an alternative method is also proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser. When computational resources are limited, a continual learning approach is introduced to minimize training time and computational memory. This maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting. Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.

摘要
For tasks with no parallel training data, the thesis suggests generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, the thesis leverages knowledge from the source domain to improve parsing in the target domain.For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. Additionally, an alternative method is proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser.When computational resources are limited, the thesis introduces a continual learning approach to minimize training time and computational memory. This approach maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting.Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.

JSMNet Improving Indoor Point Cloud Semantic and Instance Segmentation through Self-Attention and Multiscale

paper_url: http://arxiv.org/abs/2309.07425
repo_url: None
paper_authors: Shuochen Xu, Zhenxin Zhang
for: 本研究的目的是提高indoor 3D点云数据的Semantic Understanding，以便应用于室内服务机器人、导航系统和数字双工程等领域。
methods: 我们提出了JSMNet方法，它结合多层网络和全局特征自注意模块，以实现高质量的indoor点云Semantic和实例分割。我们还设计了一个多resolution特征适应融合模块，以更好地表达indoor目标的特征。
results: 我们在S3DIS dataset上进行了实验，并与其他方法进行比较。结果显示，我们的提出方法在Semantic和实例分割方面的性能比PointNet和ASIS等方法高出16.0%和26.3%，并在目标地区分割方面比JSPNet等方法高出3.3%。

Abstract
The semantic understanding of indoor 3D point cloud data is crucial for a range of subsequent applications, including indoor service robots, navigation systems, and digital twin engineering. Global features are crucial for achieving high-quality semantic and instance segmentation of indoor point clouds, as they provide essential long-range context information. To this end, we propose JSMNet, which combines a multi-layer network with a global feature self-attention module to jointly segment three-dimensional point cloud semantics and instances. To better express the characteristics of indoor targets, we have designed a multi-resolution feature adaptive fusion module that takes into account the differences in point cloud density caused by varying scanner distances from the target. Additionally, we propose a framework for joint semantic and instance segmentation by integrating semantic and instance features to achieve superior results. We conduct experiments on S3DIS, which is a large three-dimensional indoor point cloud dataset. Our proposed method is compared against other methods, and the results show that it outperforms existing methods in semantic and instance segmentation and provides better results in target local area segmentation. Specifically, our proposed method outperforms PointNet (Qi et al., 2017a) by 16.0% and 26.3% in terms of semantic segmentation mIoU in S3DIS (Area 5) and instance segmentation mPre, respectively. Additionally, it surpasses ASIS (Wang et al., 2019) by 6.0% and 4.6%, respectively, as well as JSPNet (Chen et al., 2022) by a margin of 3.3% for semantic segmentation mIoU and a slight improvement of 0.3% for instance segmentation mPre.

摘要
semantic understanding of indoor 3D point cloud data是重要的，用于多种应用程序，包括室内服务机器人、导航系统和数字孪生工程。全球特征对于实现高质量的indoor point cloud Semantic和实例分割是关键的，因为它们提供了重要的远程上下文信息。为了实现这一目标，我们提议了JSMNet，它组合了多层网络和全球特征自注意模块，以同时分割三维点云Semantic和实例。为了更好地表达室内目标特征，我们设计了一个多resolution特征适应融合模块，该模块考虑了由不同扫描仪距离目标而带来的点云密度差异。此外，我们提出了一个整合Semantic和实例特征的框架，以实现更高的结果。我们在S3DIS大型三维室内点云集合上进行实验，并与其他方法进行比较。结果显示，我们的提议方法在Semantic和实例分割方面比exist方法高出16.0%和26.3%，并在目标本地分割方面比exist方法高出6.0%和4.6%。此外，它还比JSPNet（Chen et al., 2022）高出3.3%的Semantic分割mIoU和0.3%的实例分割mPre。

An Assessment of ChatGPT on Log Data

paper_url: http://arxiv.org/abs/2309.07938
repo_url: None
paper_authors: Priyanka Mudgal, Rita Wouhaybi
for: 这 paper 旨在研究 ChatGPT 是否可以对日志数据进行有价值的处理，以及该模型在这个领域的缺陷和可能的下一步改进。
methods: 该 paper 使用 ChatGPT 进行日志数据处理，并对其表现进行分析和评估。
results: 研究发现，当前版本的 ChatGPT 对日志数据处理表现有限，响应不一致，并且存在扩展性问题。

Abstract
Recent development of large language models (LLMs), such as ChatGPT has been widely applied to a wide range of software engineering tasks. Many papers have reported their analysis on the potential advantages and limitations of ChatGPT for writing code, summarization, text generation, etc. However, the analysis of the current state of ChatGPT for log processing has received little attention. Logs generated by large-scale software systems are complex and hard to understand. Despite their complexity, they provide crucial information for subject matter experts to understand the system status and diagnose problems of the systems. In this paper, we investigate the current capabilities of ChatGPT to perform several interesting tasks on log data, while also trying to identify its main shortcomings. Our findings show that the performance of the current version of ChatGPT for log processing is limited, with a lack of consistency in responses and scalability issues. We also outline our views on how we perceive the role of LLMs in the log processing discipline and possible next steps to improve the current capabilities of ChatGPT and the future LLMs in this area. We believe our work can contribute to future academic research to address the identified issues.

摘要
现代大语言模型（LLM），如ChatGPT，在软件工程任务上广泛应用。许多论文报道了对写代码、摘要、文本生成等任务的分析。然而，对ChatGPT在日志处理方面的现状分析尚未受到足够关注。大规模软件系统生成的日志复杂且难以理解，但它们提供了关键信息，帮助专家了解系统状态并诊断问题。在这篇论文中，我们调查了当前ChatGPT对日志数据进行多种有趣任务的能力，同时尝试了识别其主要缺陷。我们的发现表明，当前ChatGPT对日志处理的性能有限，响应不一致和可扩展性受限。我们还描述了我们对LLM在日志处理领域的角色和未来LLM在这个领域的可能发展。我们认为，我们的工作可以贡献于未来的学术研究，解决已知问题。

Client-side Gradient Inversion Against Federated Learning from Poisoning

paper_url: http://arxiv.org/abs/2309.07415
repo_url: https://github.com/clientsidegia/cgi
paper_authors: Jiaheng Wei, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shirui Pan, Kok-Leong Ong, Jun Zhang, Yang Xiang
For: The paper is written to address the vulnerability of federated learning (FL) to gradient inversion attacks (GIA), and to propose a novel attack method called client-side poisoning gradient inversion (CGI) that can be launched from clients.* Methods: The paper proposes a distinct approach in which an adversary utilizes a malicious model that amplifies the loss of a specific targeted class of interest, and optimizes malicious updates and blends benign updates with a malicious replacement vector to remain undetected by Byzantine-robust aggregation rules (AGRs).* Results: The paper demonstrates the feasibility of a client-side adversary with limited knowledge being able to recover the training samples from the aggregated global model, and shows that the proposed CGI method consistently and successfully extracts training input in all tested scenarios, including against Byzantine-robust AGRs.

Abstract
Federated Learning (FL) enables distributed participants (e.g., mobile devices) to train a global model without sharing data directly to a central server. Recent studies have revealed that FL is vulnerable to gradient inversion attack (GIA), which aims to reconstruct the original training samples and poses high risk against the privacy of clients in FL. However, most existing GIAs necessitate control over the server and rely on strong prior knowledge including batch normalization and data distribution information. In this work, we propose Client-side poisoning Gradient Inversion (CGI), which is a novel attack method that can be launched from clients. For the first time, we show the feasibility of a client-side adversary with limited knowledge being able to recover the training samples from the aggregated global model. We take a distinct approach in which the adversary utilizes a malicious model that amplifies the loss of a specific targeted class of interest. When honest clients employ the poisoned global model, the gradients of samples belonging to the targeted class are magnified, making them the dominant factor in the aggregated update. This enables the adversary to effectively reconstruct the private input belonging to other clients using the aggregated update. In addition, our CGI also features its ability to remain stealthy against Byzantine-robust aggregation rules (AGRs). By optimizing malicious updates and blending benign updates with a malicious replacement vector, our method remains undetected by these defense mechanisms. To evaluate the performance of CGI, we conduct experiments on various benchmark datasets, considering representative Byzantine-robust AGRs, and exploring diverse FL settings with different levels of adversary knowledge about the data. Our results demonstrate that CGI consistently and successfully extracts training input in all tested scenarios.

摘要
分布式学习（FL）可以让分布式参与者（例如移动设备）共同训练全球模型，无需直接将数据传输到中央服务器。然而，现有研究表明，FL受到梯度反向攻击（GIA）的威胁，GIA的目标是重建原始训练样本，这会对客户端隐私造成高风险。然而，大多数现有的GIA都需要控制服务器并且需要强大的先前知识，包括批处理normalization和数据分布信息。在这项工作中，我们提出了客户端恶意梯度反向攻击（CGI），这是一种新的攻击方法，可以从客户端发起。我们首次表明，一个有限知识的客户端恶意者可以通过恶意模型来重建私有输入。我们采取了一种独特的方法，在恶意模型中 amplifies 特定类别的损失，使得这些样本的梯度在汇集后变得占主导地位。这使得恶意者可以通过汇集更新来重建其他客户端的私有输入。此外，我们的CGI还具有逃避拜尼瑞安规则（AGRs）的能力。通过优化恶意更新和杂合善良更新的恶意替换 вектор，我们的方法可以逃脱这些防御机制。为评估CGI的表现，我们在多个标准数据集上进行了实验，考虑了代表性的拜尼瑞安规则，以及不同的FL设置和恶意知识水平。我们的结果表明，CGI在所有测试场景中都能够成功地重建私有输入。

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

paper_url: http://arxiv.org/abs/2309.07405
repo_url: None
paper_authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
for: 这个研究是为了开发一个基本的神经语音编码工具库（FunCodec），并提供可重现性的训练程式和测试脚本，以及可以与其他语音处理工具集成的可扩展设计。
methods: 这个研究使用了声音流程（SoundStream）和编码器（Encodec）等最新的神经语音编码模型，并提供了预训练的模型，可以用于学术或通用目的。
results: 根据实验结果，FunCodec可以在相同的压缩比例下实现更好的重建质量，并且可以与其他工具kit和发布的模型进行比较。此外，预训练的模型还可以用于下游任务，例如自动语音识别和个人化文本读取 synthesis。

Abstract
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.

摘要
这篇论文介绍了FunCodec，一个基于开源语音处理工具kit FunASR的基本神经网络语音编码器工具kit。FunCodec提供可重现训练药方和推理脚本，用于最新的神经网络语音编码器模型，如SoundStream和Encodec。由于FunCodec和FunASR的统一设计，因此可以轻松地将FunCodec集成到下游任务中，如语音识别。同时，预训练模型也是提供的，可以用于学术或通用目的。基于工具kit，我们还提出了频率频域编码器模型（FreqCodec），可以实现与其他工具kit和发布的模型相同的语音质量，但是具有远低的计算复杂性和参数量。实验结果表明，在同一压缩比下，FunCodec可以实现比其他工具kit和发布的模型更好的重建质量。此外，我们还证明了预训练模型适用于下游任务，包括自动语音识别和个性化文本到语音合成。这个工具kit现在可以在https://github.com/alibaba-damo-academy/FunCodec上获取。

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

paper_url: http://arxiv.org/abs/2309.07401
repo_url: None
paper_authors: Yuesheng Xu, Taishan Zeng
For: This paper proposes a multi-grade deep learning method for solving nonlinear partial differential equations (PDEs), which can efficiently learn solutions of the equations and outperform existing single-grade deep learning methods in predictive accuracy.* Methods: The proposed method breaks down the task of learning a deep neural network (DNN) into several neural networks stacked on top of each other in a staircase-like manner, allowing for the mitigation of the complexity of solving the non-convex optimization problem with large number of parameters and the efficient learning of residual components left over from previous grades.* Results: The proposed two-stage multi-grade deep learning method enables efficient learning of solutions of the 1D, 2D, and 3D viscous Burgers equations, and outperforms existing single-grade deep learning methods in predictive accuracy. Specifically, the predictive errors of the single-grade deep learning are larger than those of the TS-MGDL method in 26-60, 4-31 and 3-12 times, for the 1D, 2D, and 3D equations, respectively.

Abstract
We develop in this paper a multi-grade deep learning method for solving nonlinear partial differential equations (PDEs). Deep neural networks (DNNs) have received super performance in solving PDEs in addition to their outstanding success in areas such as natural language processing, computer vision, and robotics. However, training a very deep network is often a challenging task. As the number of layers of a DNN increases, solving a large-scale non-convex optimization problem that results in the DNN solution of PDEs becomes more and more difficult, which may lead to a decrease rather than an increase in predictive accuracy. To overcome this challenge, we propose a two-stage multi-grade deep learning (TS-MGDL) method that breaks down the task of learning a DNN into several neural networks stacked on top of each other in a staircase-like manner. This approach allows us to mitigate the complexity of solving the non-convex optimization problem with large number of parameters and learn residual components left over from previous grades efficiently. We prove that each grade/stage of the proposed TS-MGDL method can reduce the value of the loss function and further validate this fact through numerical experiments. Although the proposed method is applicable to general PDEs, implementation in this paper focuses only on the 1D, 2D, and 3D viscous Burgers equations. Experimental results show that the proposed two-stage multi-grade deep learning method enables efficient learning of solutions of the equations and outperforms existing single-grade deep learning methods in predictive accuracy. Specifically, the predictive errors of the single-grade deep learning are larger than those of the TS-MGDL method in 26-60, 4-31 and 3-12 times, for the 1D, 2D, and 3D equations, respectively.

摘要
我们在这篇论文中开发了一种多级深度学习方法，用于解决非线性偏微分方程（PDEs）。深度神经网络（DNNs）在解决PDEs方面已经表现出了绝佳的成绩，同时在自然语言处理、计算机视觉和机器人等领域也取得了突出的成绩。然而，训练非常深的网络是一项具有挑战性的任务。随着网络层数的增加，解决大规模非凸优化问题，以获得DNN的PDE解决方案，变得越来越Difficult，可能会导致预测精度下降。为了解决这个挑战，我们提出了一种两个阶段多级深度学习（TS-MGDL）方法，它将把解决DNN的任务分解成多个堆叠在一起的神经网络。这种方法可以减少解决非凸优化问题中参数的数量，并高效地学习剩余的Components。我们证明每个阶段/stage的TS-MGDL方法都可以降低损失函数的值，并通过实验 validate this fact。虽然该方法适用于一般PDEs，但在这篇论文中只进行了三维液体压力方程的实现。实验结果表明，提出的两个阶段多级深度学习方法可以有效地学习PDE的解，并且在预测精度方面超过单个阶段深度学习方法。特别是，单个阶段深度学习方法的预测错误相对TS-MGDL方法大得多，在1D、2D和3D方程中分别为26-60、4-31和3-12倍。

Semantic Adversarial Attacks via Diffusion Models

paper_url: http://arxiv.org/abs/2309.07398
repo_url: https://github.com/steven202/semantic_adv_via_dm
paper_authors: Chenan Wang, Jinhao Duan, Chaowei Xiao, Edward Kim, Matthew Stamm, Kaidi Xu
for: 这篇论文旨在提出一个快速生成 semantic adversarial attack 的框架，并提供了两种不同的 variants，即 Semantic Transformation (ST) 和 Latent Masking (LM) approaches.
methods: 这篇论文使用了 recent diffusion models 来快速生成 semantic adversarial attack，并在这些模型的latent space中进行了修饰和调整。
results: experiments 表明，这篇论文的方法可以在 CelebA-HQ 和 AFHQ datasets 上 достичь高度的成功率（约 100%）和优秀的数据调整能力（FID 为 36.61），并且在不同的设定下具有了优秀的通用性和转移性。

Abstract
Traditional adversarial attacks concentrate on manipulating clean examples in the pixel space by adding adversarial perturbations. By contrast, semantic adversarial attacks focus on changing semantic attributes of clean examples, such as color, context, and features, which are more feasible in the real world. In this paper, we propose a framework to quickly generate a semantic adversarial attack by leveraging recent diffusion models since semantic information is included in the latent space of well-trained diffusion models. Then there are two variants of this framework: 1) the Semantic Transformation (ST) approach fine-tunes the latent space of the generated image and/or the diffusion model itself; 2) the Latent Masking (LM) approach masks the latent space with another target image and local backpropagation-based interpretation methods. Additionally, the ST approach can be applied in either white-box or black-box settings. Extensive experiments are conducted on CelebA-HQ and AFHQ datasets, and our framework demonstrates great fidelity, generalizability, and transferability compared to other baselines. Our approaches achieve approximately 100% attack success rate in multiple settings with the best FID as 36.61. Code is available at https://github.com/steven202/semantic_adv_via_dm.

摘要
传统的对抗攻击主要集中在图像空间中添加对抗扰动。然而，semantic adversarial攻击则更关注清晰例中的 semantic attribute，如颜色、上下文和特征。在这篇论文中，我们提出了一个框架，可以快速生成semantic adversarial攻击，通过利用最近的扩散模型，因为这些模型中包含了semantic信息。然后有两种变体的这个框架：1）semantic Transformation（ST）方法在生成图像和/或扩散模型的latent space中进行细化; 2）latent Masking（LM）方法使用另一个目标图像和局部backpropagation-based interpretation方法来遮盖latent space。此外，ST方法可以在白盒和黑盒设置中应用。我们对CelebA-HQ和AFHQ datasets进行了广泛的实验，并证明了我们的框架具有高度的准确性、普适性和传输性，相比其他基准。我们的方法在多种设置中实现了约100%的攻击成功率，并且FID值为36.61。代码可以在https://github.com/steven202/semantic_adv_via_dm上下载。

DebCSE: Rethinking Unsupervised Contrastive Sentence Embedding Learning in the Debiasing Perspective

paper_url: http://arxiv.org/abs/2309.07396
repo_url: None
paper_authors: Pu Miao, Zeyao Du, Junlin Zhang
for: 本研究的目的是提高句子嵌入模型的质量，因为现有的研究表明，BERT模型可能因为词频偏好而学习不良的句子嵌入。
methods: 本研究使用了一种新的对比学习框架，称为DebCSE，以消除各种偏好的影响，包括句子长度偏好和false negative sample bias。DebCSE使用了一种逆 probabilistic sampling 方法，选择高质量的正例对和负例对，以提高嵌入的质量。
results: 对 semantic textual similarity (STS) benchmarks进行了广泛的实验，显示DebCSE在BERTbase上得到了显著的提高，其Spearman correlation coefficient平均值为80.33%。

Abstract
Several prior studies have suggested that word frequency biases can cause the Bert model to learn indistinguishable sentence embeddings. Contrastive learning schemes such as SimCSE and ConSERT have already been adopted successfully in unsupervised sentence embedding to improve the quality of embeddings by reducing this bias. However, these methods still introduce new biases such as sentence length bias and false negative sample bias, that hinders model's ability to learn more fine-grained semantics. In this paper, we reexamine the challenges of contrastive sentence embedding learning from a debiasing perspective and argue that effectively eliminating the influence of various biases is crucial for learning high-quality sentence embeddings. We think all those biases are introduced by simple rules for constructing training data in contrastive learning and the key for contrastive learning sentence embedding is to mimic the distribution of training data in supervised machine learning in unsupervised way. We propose a novel contrastive framework for sentence embedding, termed DebCSE, which can eliminate the impact of these biases by an inverse propensity weighted sampling method to select high-quality positive and negative pairs according to both the surface and semantic similarity between sentences. Extensive experiments on semantic textual similarity (STS) benchmarks reveal that DebCSE significantly outperforms the latest state-of-the-art models with an average Spearman's correlation coefficient of 80.33% on BERTbase.

摘要
(Simplified Chinese)前些研究已经表明，word frequency bias可以使BERT模型学习不可区分的句子嵌入。例如SimCSE和ConSERT已经成功地在无监督句子嵌入中提高嵌入质量，而这些方法仍然引入了新的偏好，如句子长度偏好和假阴性样本偏好，这会阻碍模型学习更细化的 semantics。在这篇论文中，我们重新评估了对句子嵌入学习的挑战，并 argue that消除各种偏好是学习高质量句子嵌入的关键。我们认为这些偏好都是由对构造训练数据的简单规则引入的，因此，针对training数据的分布来进行无监督学习的方法是关键。我们提出了一种新的对句子嵌入框架，称为DebCSE，可以消除这些偏好的影响。DebCSE使用反propensity weighted sampling方法选择高质量的正样本和负样本，根据句子表面和semantic相似性。我们对STSbenchmark进行了广泛的实验，显示DebCSE在BERTbase上的平均Spearman correlation coefficient为80.33%，超过了最新的状态计算机模型。

Unleashing the Power of Depth and Pose Estimation Neural Networks by Designing Compatible Endoscopic Images

paper_url: http://arxiv.org/abs/2309.07390
repo_url: None
paper_authors: Junyang Wu, Yun Gu
for: 本研究旨在提高endoscopic navigation中的深度和pose estimation框架，通过更好地与图像兼容ibilize neural networks。
methods: 我们在本研究中提出了两种方法来改进endoscopic图像和神经网络之间的兼容性。首先，我们引入Mask Image Modelling（MIM）模块，它输入部分图像信息而不是完整的图像信息，使神经网络能够从部分像素信息中恢复全局信息。其次，我们提出了一种轻量级神经网络来进一步改进endoscopic图像，以确保图像和神经网络之间的兼容性。
results: 我们在三个公共数据集和一个内部数据集上进行了广泛的实验，并证明了我们的方法可以大幅提高基elines。此外，我们提出的加强图像可以作为数据增强方法，并能够提取更稳定的特征点，在传统特征点匹配任务中表现出色。

Abstract
Deep learning models have witnessed depth and pose estimation framework on unannotated datasets as a effective pathway to succeed in endoscopic navigation. Most current techniques are dedicated to developing more advanced neural networks to improve the accuracy. However, existing methods ignore the special properties of endoscopic images, resulting in an inability to fully unleash the power of neural networks. In this study, we conduct a detail analysis of the properties of endoscopic images and improve the compatibility of images and neural networks, to unleash the power of current neural networks. First, we introcude the Mask Image Modelling (MIM) module, which inputs partial image information instead of complete image information, allowing the network to recover global information from partial pixel information. This enhances the network' s ability to perceive global information and alleviates the phenomenon of local overfitting in convolutional neural networks due to local artifacts. Second, we propose a lightweight neural network to enhance the endoscopic images, to explicitly improve the compatibility between images and neural networks. Extensive experiments are conducted on the three public datasets and one inhouse dataset, and the proposed modules improve baselines by a large margin. Furthermore, the enhanced images we proposed, which have higher network compatibility, can serve as an effective data augmentation method and they are able to extract more stable feature points in traditional feature point matching tasks and achieve outstanding performance.

摘要
深度学习模型在无注意dataset上进行深度和pose估计框架，被视为有效的走向。现有大多数技术都是为了开发更高级别的神经网络，以提高准确性。然而，现有方法忽略了endooscopic图像的特殊性，导致神经网络无法全面发挥力量。在本研究中，我们进行了endooscopic图像的详细分析，并改进了图像和神经网络之间的兼容性，以解放神经网络的力量。首先，我们引入Mask Image Modelling（MIM）模块，该模块输入部分图像信息而不是完整的图像信息，allowing the network to recover global information from partial pixel information。这种方法使神经网络能够更好地感知全局信息，并减轻了因本地遗传而导致的Convolutional Neural Networks（CNN）的局部适应性。其次，我们提出了一种轻量级神经网络，用于提高endooscopic图像的兼容性。我们在三个公共数据集和一个内部数据集上进行了广泛的实验，并证明了我们的模块可以大幅提高基elines。此外，我们提出的改进图像可以作为一种有效的数据增强方法，它们具有更高的兼容性，可以提取更稳定的特征点，并在传统特征点匹配任务中达到出色的性能。

Landscape-Sketch-Step: An AI/ML-Based Metaheuristic for Surrogate Optimization Problems

paper_url: http://arxiv.org/abs/2309.07936
repo_url: https://github.com/rafael-a-monteiro-math/landscape_sketch_and_step
paper_authors: Rafael Monteiro, Kartik Sau
for: 这个论文是为了提出一种新的全球优化算法，用于解决在评估成本函数成本高、不可靠或禁止的情况下进行优化。methods: 该方法 combining机器学习、随机优化和奖励学习技术，利用历史信息来选择合适的参数值，以便更judicious地评估成本函数。results: 与传统的复制交换 Monte Carlo 方法相比，LSS 需要的评估次数相对较少，尤其是在高通量计算或高性能计算任务中，这点非常重要。此外，LSS 也不同于标准的代理优化技术，不需要构建一个代理模型来 aproximating 或重建目标函数。在低维度优化问题（维度 1、2、4、8）中应用 LSS ，与传统的 Simulated Annealing 相比，LSS 显示更有效地加速优化过程。

Abstract
In this paper, we introduce a new heuristics for global optimization in scenarios where extensive evaluations of the cost function are expensive, inaccessible, or even prohibitive. The method, which we call Landscape-Sketch-and-Step (LSS), combines Machine Learning, Stochastic Optimization, and Reinforcement Learning techniques, relying on historical information from previously sampled points to make judicious choices of parameter values where the cost function should be evaluated at. Unlike optimization by Replica Exchange Monte Carlo methods, the number of evaluations of the cost function required in this approach is comparable to that used by Simulated Annealing, quality that is especially important in contexts like high-throughput computing or high-performance computing tasks, where evaluations are either computationally expensive or take a long time to be performed. The method also differs from standard Surrogate Optimization techniques, for it does not construct a surrogate model that aims at approximating or reconstructing the objective function. We illustrate our method by applying it to low dimensional optimization problems (dimensions 1, 2, 4, and 8) that mimick known difficulties of minimization on rugged energy landscapes often seen in Condensed Matter Physics, where cost functions are rugged and plagued with local minima. When compared to classical Simulated Annealing, the LSS shows an effective acceleration of the optimization process.

摘要
在这篇论文中，我们介绍了一种新的全球优化启发法，用于在评估成本函数的成本高、不可靠或禁止的场景中进行全球优化。该方法，我们称之为“景观绘制步骤”（Landscape-Sketch-and-Step，LSS），结合机器学习、随机优化和强化学习技术，利用历史记录中的参数值来做出评估成本函数的judicious选择。与复制交换 Monte Carlo 方法不同，LSS 方法所需的评估成本函数的数量与 Simulated Annealing 相似，这种特点尤其重要在高通量计算或高性能计算任务中， где评估成本函数的计算成本或时间很高。此外，LSS 方法与标准代理优化技术不同，它不会构建一个目标函数的替身模型，以优化目标函数。我们通过应用该方法于低维度优化问题（维度 1、2、4、8），示出了 LSS 方法在 rugged 能量领域中的效果。相比于经典 Simulated Annealing，LSS 方法显示了更有效的优化过程。

The kernel-balanced equation for deep neural networks

paper_url: http://arxiv.org/abs/2309.07367
repo_url: None
paper_authors: Kenichi Nakazato
for: 这个论文的目的是研究深度神经网络在分布估计问题上的应用和稳定性问题。
methods: 这个论文使用了深度神经网络来估计数据集的分布，并通过训练来实现泛化 функción。
results: 研究发现，在训练时间长 enough和数据密度高 enough的情况下，神经网络的估计会变得不稳定，并且这种不稳定性与数据密度和训练时间之间存在正相关性。

Abstract
Deep neural networks have shown many fruitful applications in this decade. A network can get the generalized function through training with a finite dataset. The degree of generalization is a realization of the proximity scale in the data space. Specifically, the scale is not clear if the dataset is complicated. Here we consider a network for the distribution estimation of the dataset. We show the estimation is unstable and the instability depends on the data density and training duration. We derive the kernel-balanced equation, which gives a short phenomenological description of the solution. The equation tells us the reason for the instability and the mechanism of the scale. The network outputs a local average of the dataset as a prediction and the scale of averaging is determined along the equation. The scale gradually decreases along training and finally results in instability in our case.

摘要
深度神经网络在本decennary中有很多成功应用。一个网络可以通过训练finite dataset来获得通用函数。特别是，数据空间中的距离度不清楚，如果数据集复杂。我们考虑一个数据集的分布估计网络。我们发现估计是不稳定的，不稳定程度取决于数据密度和训练时间。我们得出了kernel-balanced方程，它给出了解释解决方案的简短现象描述。这个方程告诉我们不稳定的原因和机制，以及估计的横幅是如何确定的。网络输出了一个数据集的本地均值作为预测，并且这个均值的横幅是通过方程确定的。在训练过程中，这个横幅逐渐减小，最终导致我们的 случаpped in our case.

Doubly High-Dimensional Contextual Bandits: An Interpretable Model for Joint Assortment-Pricing

paper_url: http://arxiv.org/abs/2309.08634
repo_url: None
paper_authors: Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao
for: 这种paper是为了解决零售业的挑战问题，具体来说是如何选择产品 Display 给消费者（搜索问题），以及如何定价产品（定价问题）以最大化收益或利润。
methods: 作者提出了一种联合方法，基于上下文投机（contextual bandits），来解决搜索和定价问题。该模型是双高维的，即 Context vector 和 Action 都可以取值在高维空间中。为了缓解维度的味道，作者提出了一种简单 yet flexible 的模型，通过一个（近似）低维表示矩阵来捕捉 covariate 和 Action 之间的交互。这种类型的模型具有一定的表达能力，同时仍然可以通过 latent factor 进行解释。
results: 作者提出了一种 computationally tractable 的方法，combines 探索/利用协议与高维矩阵估计器，并证明了该方法的 regret bound。实验结果表明，该方法在各种标准投机和定价模型中表现较佳，并且在实际案例中（来自一家领先的快速食品公司和一家崛起的美妆公司）也有较高的收益。在每个案例中，作者证明了使用该方法可以获得至少三倍的收益或利润，同时Latent factor模型也能够很好地解释。

Abstract
Key challenges in running a retail business include how to select products to present to consumers (the assortment problem), and how to price products (the pricing problem) to maximize revenue or profit. Instead of considering these problems in isolation, we propose a joint approach to assortment-pricing based on contextual bandits. Our model is doubly high-dimensional, in that both context vectors and actions are allowed to take values in high-dimensional spaces. In order to circumvent the curse of dimensionality, we propose a simple yet flexible model that captures the interactions between covariates and actions via a (near) low-rank representation matrix. The resulting class of models is reasonably expressive while remaining interpretable through latent factors, and includes various structured linear bandit and pricing models as particular cases. We propose a computationally tractable procedure that combines an exploration/exploitation protocol with an efficient low-rank matrix estimator, and we prove bounds on its regret. Simulation results show that this method has lower regret than state-of-the-art methods applied to various standard bandit and pricing models. Real-world case studies on the assortment-pricing problem, from an industry-leading instant noodles company to an emerging beauty start-up, underscore the gains achievable using our method. In each case, we show at least three-fold gains in revenue or profit by our bandit method, as well as the interpretability of the latent factor models that are learned.

摘要
主要挑战在经营零售业务中包括如何选择给消费者提供的产品（集合问题），以及如何定价产品（定价问题）以最大化收入或利润。而不是单独考虑这两个问题，我们提议一种联合的集合-定价方法，基于上下文矩阵。我们的模型是双重高维的，即上下文向量和行动都可以取值在高维空间中。为了缓解维度繁殖的问题，我们提议一种简单 yet flexible的模型，通过一个（近）低维表示矩阵来捕捉上下文和行动之间的交互。这种类型的模型具有可解释性的特点，并包括一些结构化线性矩阵和定价模型作为特例。我们提出一种可行的计算过程，将探索/利用协议与高维矩阵估计器结合使用，并证明其误差的下界。实验结果显示，这种方法在各种标准矩阵和定价模型上的误差较低。实际案例，从一家领先的快速面公司到一家崛起的美容 startup，都证明了我们的方法可以实现至少三倍的收入或利润增加，同时 latent factor 模型的解释性也得到了证明。

Hodge-Aware Contrastive Learning

paper_url: http://arxiv.org/abs/2309.07364
repo_url: None
paper_authors: Alexander Möllers, Alexander Immer, Vincent Fortuin, Elvin Isufi
for: 模型多元依赖关系数据，如网络边缘数据或其他高阶结构中的数据。
methods: 利用黎替 decomposition 分解数据谱，并通过 simplicial neural networks 编码数据不变量，制定了有利的增强策略和权重重新定义的反向例示。
results: 通过这种方法，可以获得具有特定спектル信息的嵌入，并在两个标准的边流分类任务上达到了更高的表现。

Abstract
Simplicial complexes prove effective in modeling data with multiway dependencies, such as data defined along the edges of networks or within other higher-order structures. Their spectrum can be decomposed into three interpretable subspaces via the Hodge decomposition, resulting foundational in numerous applications. We leverage this decomposition to develop a contrastive self-supervised learning approach for processing simplicial data and generating embeddings that encapsulate specific spectral information.Specifically, we encode the pertinent data invariances through simplicial neural networks and devise augmentations that yield positive contrastive examples with suitable spectral properties for downstream tasks. Additionally, we reweight the significance of negative examples in the contrastive loss, considering the similarity of their Hodge components to the anchor. By encouraging a stronger separation among less similar instances, we obtain an embedding space that reflects the spectral properties of the data. The numerical results on two standard edge flow classification tasks show a superior performance even when compared to supervised learning techniques. Our findings underscore the importance of adopting a spectral perspective for contrastive learning with higher-order data.

摘要
高等结构数据模型化方面， simplicial complexes 表现出效果，如数据定义于网络边缘或其他更高级结构中。它们的谱可以通过欧拉解 composite 分解成三个可解释的子空间，从而在各种应用中发挥重要作用。我们利用这种解构来开发一种自适应学习方法，用于处理 simplicial 数据并生成具有特定spectral信息的嵌入。Specifically, we 使用 simplicial 神经网络编码 pertinent 数据不变性，并设计了可提高positive contrastive例子的权重的扩充。我们通过鼓励负例的 Hodge 组件与锚点之间的相似性来重新调整负例的权重。这种方法可以使得负例中的谱信息更加稳定，从而提高 embedding 空间的可靠性。我们在两个标准的边流分类任务上进行了数值研究，结果表明我们的方法在比supervised learning 技术更高效。我们的发现表明在处理更高级数据时，采用spectral perspective 的contrastive learning方法是非常重要的。

2023-09-14

cs.CL

cs.CL - 2023-09-14

Connecting the Dots in News Analysis: A Cross-Disciplinary Survey of Media Bias and Framing

paper_url: http://arxiv.org/abs/2309.08069
repo_url: None
paper_authors: Gisela Vallejo, Timothy Baldwin, Lea Frermann
for: 本研究旨在探讨新闻报道中的偏见现象，以及这种偏见对社会的影响。
methods: 本研究使用社会科学方法和NLP技术，对媒体偏见的影响进行分析和评估。
results: 本研究发现，现有的NLP方法在检测媒体偏见方面存在一些缺陷和局限性，需要更多的研究来解决这些问题。

Abstract
The manifestation and effect of bias in news reporting have been central topics in the social sciences for decades, and have received increasing attention in the NLP community recently. While NLP can help to scale up analyses or contribute automatic procedures to investigate the impact of biased news in society, we argue that methodologies that are currently dominant fall short of addressing the complex questions and effects addressed in theoretical media studies. In this survey paper, we review social science approaches and draw a comparison with typical task formulations, methods, and evaluation metrics used in the analysis of media bias in NLP. We discuss open questions and suggest possible directions to close identified gaps between theory and predictive models, and their evaluation. These include model transparency, considering document-external information, and cross-document reasoning rather than single-label assignment.

摘要
新闻报导中的偏见的表现和影响在社会科学领域已经是长期的研究主题，近年来在自然语言处理领域也得到了更多的关注。虽然NLTP可以帮助扩大分析或提供自动化的过程来研究偏见新闻对社会的影响，但我们认为现有的方法ologies fall short of addressing the complex questions and effects addressed in theoretical media studies。在这篇评论稿中，我们回顾社会科学的方法和从Typical task formulations, methods, and evaluation metrics used in media bias analysis in NLP中着重比较。我们讨论的开Question和建议可能的方向来填充已知的漏洞，包括模型透明度、考虑外部文档信息和跨文档逻辑 reasoning而不是单一标签分配。

Investigating Gender Bias in News Summarization

paper_url: http://arxiv.org/abs/2309.08047
repo_url: None
paper_authors: Julius Steen, Katja Markert
For: The paper is written to investigate the presence of harmful social biases in language models (LLMs) and their impact on summarization models.* Methods: The paper introduces several definitions for biased behaviors in summarization models and proposes a method to generate input documents with controlled demographic attributes to sidestep the issue of biases inherent in the input document.* Results: The paper finds that content selection in single document summarization is largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.Here is the information in Simplified Chinese text:
for: 本文是为了研究 LLM 中的危险社会偏见，以及它们对摘要模型的影响。
methods: 本文提出了一些定义偏见行为的方法，并提议使用控制性的人口特征生成输入文档，以避免输入文档中的偏见问题。
results: 本文发现，单文档摘要的内容选择几乎不受偏见影响，而插入式偏见则在生成摘要中存在证据。

Abstract
Summarization is an important application of large language models (LLMs). Most previous evaluation of summarization models has focused on their performance in content selection, grammaticality and coherence. However, it is well known that LLMs reproduce and reinforce harmful social biases. This raises the question: Do these biases affect model outputs in a relatively constrained setting like summarization? To help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical measures to quantify them. Since we find biases inherent to the input document can confound our analysis, we additionally propose a method to generate input documents with carefully controlled demographic attributes. This allows us to sidestep this issue, while still working with somewhat realistic input documents. Finally, we apply our measures to summaries generated by both purpose-built summarization models and general purpose chat models. We find that content selection in single document summarization seems to be largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.

摘要
<>使用大型语言模型（LLM）进行概要是一项重要应用。大多数之前的评估概要模型的研究都集中在内容选择、 grammaticality 和 coherence 等方面。然而， LLM 会重复和加强社会偏见。这引起了问题：这些偏见会影响模型输出在狭义 Setting 中吗？为回答这个问题，我们首先介绍了概要模型中偏见行为的定义，以及实际量化这些偏见的方法。由于输入文档中的偏见可能会混淆我们的分析，我们还提出了一种方法来生成具有控制的人口特征的输入文档。这样，我们可以 circumvent 这个问题，而仍然可以使用一些具有实际感的输入文档。最后，我们应用我们的度量方法来评估概要模型生成的概要。我们发现，单文档概要选择显然不受偏见影响，而 hallucinations 中的偏见则会传递到生成的概要中。

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

paper_url: http://arxiv.org/abs/2309.08030
repo_url: None
paper_authors: Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu
for: 本研究旨在提高audio-visual speech enhancement（AVSE）的性能，因为现实世界中的训练数据不具备净语音数据，从而降低了AVSE的研发难度。
methods: 本文提出了一种基于散射模型的AVSE方法，使用神经质量估计器从audio-visual dataset中提取了一个较好的净语音子集，然后使用这些子集训练散射模型，使其能够生成conditioned on continuous speech representations from AV-HuBERT的waveforms。
results: 研究表明，使用连续语音表示（continous speech representation）可以保留语速和发音信息，并且与masking-based基线相比，本方法的自适应性和净语音评价都有所提高。此外，通过细化散射模型，可以进一步提高AVSE的性能。

Abstract
Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.

摘要
听音提升系统通常通过干净和噪声的对照来训练。在音频视频听音提升（AVSE）中， however， there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, which makes the development of AVSE more challenging. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We use a neural quality estimator to obtain a subset of nearly clean speech from an audio-visual corpus, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at [https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html](https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html).

DiariST: Streaming Speech Translation with Speaker Diarization

paper_url: http://arxiv.org/abs/2309.08007
repo_url: https://github.com/mu-y/diarist
paper_authors: Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka
for: 这个论文主要目的是提出一种批处理 streaming speech translation 和 speaker diarization 的方法。
methods: 这个方法基于神经网络抽象器，并使用 token-level 序列化输出训练和 t-vector，以提高 streaming ST 和 SD 性能。
results: 对于 DiariST 系统，我们在对 AliMeeting 数据集进行了评估，并提出了一些新的评价指标，如 speaker-agnostic BLEU 和 speaker-attributed BLEU，以衡量 ST 质量和 SD 精度。我们的系统在对 overlap 的情况下进行了流式推理，并与 offline 系统基于 Whisper 进行了比较。

Abstract
End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.

摘要
听写末端到听写的speech翻译（ST）在对话录音中存在许多未经探索的挑战，如 speaker化（SD）无法准确地获取单词时间戳和对另一个人的语音进行流式处理。在这项工作中，我们提出了DiariST，第一个流式ST和SD解决方案。它基于神经网络抽象器基于流式ST系统，并 интегриру了token级别的序列化输出训练和t-vector，这些原始是为多个说话人的语音识别所开发。由于这个领域没有评估指标，我们开发了一个新的评估数据集，DiariST-AliMeeting，通过翻译中文笔录 AliMeeting 资料集的参考转录为英语。我们还提出了一些新的度量方法，称为 speaker-agnostic BLEU 和 speaker-attributed BLEU，以衡量 ST 质量并考虑 SD 精度。我们的系统在与 Whisper 的离线系统进行比较后，在流式推理中处理重叠的语音时达到了强大的 ST 和 SD 能力。为便于这个新方向的研究，我们发布了评估数据、离线基准系统和评估代码。

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

paper_url: http://arxiv.org/abs/2309.07998
repo_url: None
paper_authors: Sarah E. Finch, James D. Finch, Jinho D. Choi
for: 这篇论文主要是为了检验对话系统的评估方法。
methods: 这篇论文使用了4个不同的评估者组来评估4种现状的对话系统，并分析了评估者组的影响。
results: 研究发现，对于 likert 评估，评估者组的影响不大，但是有些特定的对话指标存在评估者差异。此外，研究还发现了一些限制这种 Robustness，包括评估者对对话机器人专业程度的差异和评估者对对话指标的主观性。

Abstract
Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and professional annotators have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between evaluators with different levels of chatbot expertise and indicate that evaluator objectivity is beneficial for certain dialogue metrics.

摘要
人工评估已被广泛接受为对话系统评估的标准。然而，以前的工作中选择评估人员的方法存在显著的差异。各种评估者组，如领域专家、大学学生和职业标注人员，已经用来评估和比较对话系统，却不清楚这些选择对结果的影响。这篇文章分析对话系统评估中评估者组的影响，通过使用4种当前最佳对话系统和4个不同的评估者组进行测试。我们的分析发现，对于 likert 评估，评估者组对对话系统的影响相对较强，但对于 pairwise 评估，只有小范围内的差异被观察到。此外，我们还发现了两点有关这种Robustness的限制性，一是评估人员对对话系统的专业程度的差异，二是评估人员对某些对话指标的客观性的影响。

Leveraging Contextual Information for Effective Entity Salience Detection

paper_url: http://arxiv.org/abs/2309.07990
repo_url: None
paper_authors: Rajarshi Bhowmik, Marco Ponza, Atharva Tendle, Anant Gupta, Rebecca Jiang, Xingyu Lu, Qian Zhao, Daniel Preotiuc-Pietro
for: 本研究旨在检测文档中突出的实体，以便提高搜索、排名和实体中心概要等下游应用的性能。
methods: 本研究使用了中等规模的自然语言处理模型，并采用了cross-encoder风格的 architecture，以提高performance。
results: 研究发现，通过 fine-tuning 中等规模的预训练语言模型，可以获得substantial performance gain，而 feature engineering 方法则无法达到这个目标。

Abstract
In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.

摘要
文档中的内容和关键事件通常涉及一 subset of 所有提到的实体。这些实体，经常被称为突出的实体，对文档的关键信息提供有用的提示。识别突出实体的存在有助于多个下游应用程序，如搜索、排名和实体中心摘要等。现有的突出实体检测主要基于机器学习模型，需要大量的特征工程。我们显示，将中型语言模型进行精度调整可以得到显著性能提升。为此，我们对公共可用的四个数据集进行了广泛的 benchmarking，并显示了代表中型预训练语言模型家族的模型的性能。此外，我们还示出了零样本提示的语言模型训练结果为 inferior，表明任务的独特性和复杂性。

Ambiguity-Aware In-Context Learning with Large Language Models

paper_url: http://arxiv.org/abs/2309.07900
repo_url: None
paper_authors: Lingyu Gao, Aditi Chaudhary, Krishna Srinivasan, Kazuma Hashimoto, Karthik Raman, Michael Bendersky
for: 本研究希望通过几个任务特定示例来提高大型语言模型（LLM）的下游性能，但是选择好示例是一个关键问题。
methods: 研究人员使用文本检索器来选择Semantic similarity between ICL demonstrations and test inputs，但这并不考虑LLM对这个任务的已有知识。
results: 通过对三个文本分类任务进行广泛的实验，研究人员发现，不仅选择semantic similarity的ICL示例，还需要选择能够解决测试示例周围的自然标签抖抖的示例，可以得到最大性能提升。

Abstract
In-context learning (ICL) i.e. showing LLMs only a few task-specific demonstrations has led to downstream gains with no task-specific fine-tuning required. However, LLMs are sensitive to the choice of prompts, and therefore a crucial research question is how to select good demonstrations for ICL. One effective strategy is leveraging semantic similarity between the ICL demonstrations and test inputs by using a text retriever, which however is sub-optimal as that does not consider the LLM's existing knowledge about that task. From prior work (Min et al., 2022), we already know that labels paired with the demonstrations bias the model predictions. This leads us to our hypothesis whether considering LLM's existing knowledge about the task, especially with respect to the output label space can help in a better demonstration selection strategy. Through extensive experimentation on three text classification tasks, we find that it is beneficial to not only choose semantically similar ICL demonstrations but also to choose those demonstrations that help resolve the inherent label ambiguity surrounding the test example. Interestingly, we find that including demonstrations that the LLM previously mis-classified and also fall on the test example's decision boundary, brings the most performance gain.

摘要
具体学习（ICL），即只显示一些任务特定示例，已经导致下游成果，无需任务特定微调。然而，LLM很敏感于选择示例的选择，因此一个重要的研究问题是如何选择好的示例 дляICL。一种有效的策略是利用示例和测试输入之间的semantic similarity，使用文本检索器。然而，这种方法并不理想，因为它不考虑LLM对该任务的已有知识。根据之前的研究（Min et al., 2022），我们已知标签与示例的匹配会偏导模型预测。这导我们到我们的假设，那么考虑LLM对该任务的已有知识，特别是输出标签空间中的知识，可以帮助实现更好的示例选择策略。通过对三个文本分类任务进行了广泛的实验，我们发现，不仅选择semantic similar的ICL示例，还要选择能够解决测试示例上的自然标签抖障的示例，能够带来最大的性能提升。 Interestingly，包括LLM之前错分的示例和测试示例的决策边缘示例，可以带来最大的性能提升。

Anchor Points: Benchmarking Models with Much Fewer Examples

paper_url: http://arxiv.org/abs/2309.08638
repo_url: https://github.com/rvivek3/anchorpoints
paper_authors: Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela
for: 本文旨在探讨如何使用小型评估集来评估语言模型的表现。
methods: 作者提出了一种名为”Anchor Point Selection”的技术，可以选择小型数据集来评估模型的表现。
results: 作者表明，使用 anchor points 可以准确地评估语言模型的表现，并且只需要使用几个 anchor points 可以估算模型在整个数据集中的表现。

Abstract
Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.

摘要
现代语言模型经常表现出强大 pero 脆弱的行为，导致开发更大和多样化的测试集来可靠地评估其行为。在这里，我们建议可以使用 much smaller evaluation sets 来评估模型性能。我们首先发现，在六个流行的语言分类标准测试集中，模型对正确类别的信任度在许多对点对是强相关的。我们基于这一现象提出了 Anchor Point Selection，一种技术来选择数据集中的小集合，以捕捉模型在整个数据集中的行为。 anchor points 可靠地排名模型：在 87 种语言模型-提示对的测试集中，使用 1-30 anchor points 来评估模型，比 uniform sampling 和其他基elines 更高效地准确地排名模型。此外，只需几个 anchor points 可以用来估算模型每个类别预测值的所有其他点在数据集中的低均绝对误差，足够用于评估模型在哪些点上可能会失败。最后，我们提出了 Anchor Point Maps，用于可见地表示这些 Insights 并且方便对不同模型在不同区域内数据集分布中的性能进行比较。

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

paper_url: http://arxiv.org/abs/2309.07875
repo_url: https://github.com/vinid/instruction-llms-safety-eval
paper_authors: Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou
for: 这个论文探讨了大语言模型是否只专注于帮助性能而忽视安全性的问题。
methods: 作者使用了多种实验方法来评估大语言模型的帮助性和安全性。
results: 研究发现，许多流行的指令调整模型具有高度的危险性，而在训练集中添加3%的安全示例可以提高模型的安全性，但是过度的安全调整可能会使模型拒绝合理的提示。

Abstract
Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.

摘要
培训大型自然语言模型遵循 instrucion 可以使其在各种任务上表现更好，通常变得更有用。然而，一个完全有用的模型会遵循任何有恶意 instrucion 并快速生成有害内容。在这篇论文中，我们表达对模型仅强调有用性，不强调安全性的担忧。我们发现了许多流行的 instrucion-tuned 模型具有高度不安全的问题。此外，我们发现，在 fine-tuning 一个模型如 LLMA 时，只需添加一些安全示例（几百个示例）可以大幅提高其安全性。我们的安全调整不会使模型变得显著不 capable 或不有用，如按照标准标准准则测试。然而，我们发现了一种安全偏见行为，即在安全调整过后，模型会拒绝处理有些安全提示，它们superficially resemble unsafe prompts。我们的研究揭示了培训 LLMs 遵循 instrucion 和表现安全行为之间的负担。

Agents: An Open-source Framework for Autonomous Language Agents

paper_url: http://arxiv.org/abs/2309.07870
repo_url: https://github.com/aiwaves-cn/agents
paper_authors: Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan
for: 这篇论文旨在探讨大型自然语言模型 (LLMs) 在语言代理方面的最新进展，并提供一个开源库 Agents，以便非专家可以轻松地建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互。
methods: 这篇论文使用了大型自然语言模型 (LLMs)，并提供了一个开源库 Agents，以便非专家可以轻松地使用这些模型来建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互。
results: 这篇论文提供了一个开源库 Agents，可以帮助非专家轻松地建立自然语言界面上的自动解决问题和与环境、人类和其他代理进行交互，并且可以支持规划、记忆、工具使用、多代理交流和细致的符号控制等重要特性。

Abstract
Recent advances on large language models (LLMs) enable researchers and developers to build autonomous language agents that can automatically solve various tasks and interact with environments, humans, and other agents using natural language interfaces. We consider language agents as a promising direction towards artificial general intelligence and release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience. Agents is carefully engineered to support important features including planning, memory, tool usage, multi-agent communication, and fine-grained symbolic control. Agents is user-friendly as it enables non-specialists to build, customize, test, tune, and deploy state-of-the-art autonomous language agents without much coding. The library is also research-friendly as its modularized design makes it easily extensible for researchers. Agents is available at https://github.com/aiwaves-cn/agents.

摘要
Agents is carefully engineered to support important features such as planning, memory, tool usage, multi-agent communication, and fine-grained symbolic control. The library is user-friendly, allowing non-specialists to build, customize, test, tune, and deploy state-of-the-art autonomous language agents with minimal coding. Additionally, the modularized design of Agents makes it easily extensible for researchers.Agents is available at the following link: .

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

paper_url: http://arxiv.org/abs/2309.07822
repo_url: https://github.com/ukplab/catfood
paper_authors: Rachneet Sachdeva, Martin Tutek, Iryna Gurevych
for: 增强小语言模型（SLMs）在不同预测任务中的表现，特别是在生成文本的任务中。
methods: 使用大语言模型（LLMs）生成自动生成的对话实例（CF），以增强小语言模型在不同预测任务中的表现。
results: 通过多种LLM生成器，对CF实例进行数据增强后，可以提高小语言模型在不同预测任务中的表现，并且可以改善模型的准确性和排序能力。

Abstract
In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.

摘要
Recently, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.Here's the translation in Traditional Chinese:过去的年份，大语言模型（LLM）在大规模下显示了杰出的能力，特别是在基于提示的文本生成。在我们的工作中，我们 investigate 使用 LLM 对小语言模型（SLM）的训练数据进行增强，以增强 OOD 性能。我们发现，各种 LLM 生成器都可以对 SLM 的 OOD 性能进行增强，并且可以改善模型的准确性。此外，这些性能改善与 CF 实例的多样性有很高的相关性，包括表面形式和 semantic 内容。最后，我们发现，可以轻松对 CF 实例进行混合的模型也会表现出较低的 entropy，这表明了 rational 扩展的准确性。

Text Classification of Cancer Clinical Trial Eligibility Criteria

paper_url: http://arxiv.org/abs/2309.07812
repo_url: None
paper_authors: Yumeng Yang, Soumya Jayaraj, Ethan B Ludmir, Kirk Roberts
for: 本研究旨在解决许多临床试验中的患者参与问题，因为参与条件通常表示为自然语言。
methods: 本研究使用文本分类方法来处理常见排除条件。
results: 我们的结果表明可以自动确定临床试验患者参与条件的可能性。此外，我们还发现一个专门为临床试验预先训练的语言模型可以为临床试验提供最高的平均性能。

Abstract
Automatic identification of clinical trials for which a patient is eligible is complicated by the fact that trial eligibility is stated in natural language. A potential solution to this problem is to employ text classification methods for common types of eligibility criteria. In this study, we focus on seven common exclusion criteria in cancer trials: prior malignancy, human immunodeficiency virus, hepatitis B, hepatitis C, psychiatric illness, drug/substance abuse, and autoimmune illness. Our dataset consists of 764 phase III cancer trials with these exclusions annotated at the trial level. We experiment with common transformer models as well as a new pre-trained clinical trial BERT model. Our results demonstrate the feasibility of automatically classifying common exclusion criteria. Additionally, we demonstrate the value of a pre-trained language model specifically for clinical trials, which yields the highest average performance across all criteria.

摘要
自动确定临床试验中患者是否符合参与条件是由于参与条件表示在自然语言中，这种问题的解决方案之一是使用文本分类方法来处理常见的参与条件。在本研究中，我们关注了7种常见排除条件在肿瘤试验中：先前的肿瘤、人类免疫缺陷病毒、乙型肝炎、丙型肝炎、心理疾病、药物/化学成瘾和自体免疫疾病。我们的数据集包括764个相关III肿瘤试验，这些排除条件在试验水平进行了标注。我们对常见变换器模型以及一个新的临床试验BERT模型进行实验。我们的结果表明自动分类常见排除条件的可能性，而且还表明特制的临床试验BERT模型在所有标准的参与条件上的平均性能最高。

Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?

paper_url: http://arxiv.org/abs/2309.07804
repo_url: None
paper_authors: Terry Yue Zhuo, Xiaoning Du, Zhenchang Xing, Jiamou Sun, Haowei Quan, Li Li, Liming Zhu
for: 本研究的目的是探讨现有的预训练代码模型在自动化API使用方面的表现，以及如何提高代码智能实践中的代码表示和自动化API使用。
methods: 本研究使用了知识探测技术，通过cloze-style测试来评估模型内存储的知识。研究从两个不同的角度探讨了预训练代码模型对 Fully Qualified Names (FQNs) 的理解能力，包括API调用和API导入。
results: 研究发现，当前的预训练代码模型在理解FQNs方面存在困难，尤其是在预训练策略方面对API名称学习产生了显著的影响。研究还发现，自然语言上下文可以帮助代码模型在找到Python API名称方面做出更好的表现，并且可以将Python API名称知识推广到未见数据上。这些发现为代码智能实践提供了指导和方向，并建议将API结构纳入预训练过程以提高自动化API使用和代码表示。

Abstract
Recent breakthroughs in pre-trained code models, such as CodeBERT and Codex, have shown their superior performance in various downstream tasks. The correctness and unambiguity of API usage among these code models are crucial for achieving desirable program functionalities, requiring them to learn various API fully qualified names structurally and semantically. Recent studies reveal that even state-of-the-art pre-trained code models struggle with suggesting the correct APIs during code generation. However, the reasons for such poor API usage performance are barely investigated. To address this challenge, we propose using knowledge probing as a means of interpreting code models, which uses cloze-style tests to measure the knowledge stored in models. Our comprehensive study examines a code model's capability of understanding API fully qualified names from two different perspectives: API call and API import. Specifically, we reveal that current code models struggle with understanding API names, with pre-training strategies significantly affecting the quality of API name learning. We demonstrate that natural language context can assist code models in locating Python API names and generalize Python API name knowledge to unseen data. Our findings provide insights into the limitations and capabilities of current pre-trained code models, and suggest that incorporating API structure into the pre-training process can improve automated API usage and code representations. This work provides significance for advancing code intelligence practices and direction for future studies. All experiment results, data and source code used in this work are available at \url{https://doi.org/10.5281/zenodo.7902072}.

摘要
近期的预训CodeBERT和Codex等模型已经显示了在不同的下游任务中的优秀性能。在这些代码模型中，正确和不ambiguous的API使用是达到愉悦的程序功能的关键，它们需要学习不同的API完全限定名称的结构和含义。然而，当代领先的预训代码模型在代码生成时间建议正确的API时会遇到困难。然而，这些原因几乎未经调查。为解决这个挑战，我们提议使用知识探测来解释代码模型，该方法使用cloze-style测试来测量模型中的知识。我们的全面研究检查了代码模型理解API完全限定名称的两个不同的角度：API调用和API导入。我们发现，当前的代码模型在理解API名称方面存在困难，而预训策略对API名称学习质量产生了显著影响。我们还发现，自然语言上下文可以帮助代码模型在Python API名称上进行定位，并且可以将Python API名称知识扩展到未看到的数据上。我们的发现提供了预训代码模型的局限性和能力，并建议将API结构纳入预训过程以提高自动API使用和代码表示。这项工作提供了代码智能实践的进步和未来研究的指导。所有实验结果、数据和源代码在\url{https://doi.org/10.5281/zenodo.7902072}上可以获得。

The Dynamical Principles of Storytelling

paper_url: http://arxiv.org/abs/2309.07797
repo_url: None
paper_authors: Isidoros Doxas, James Meiss, Steven Bottone, Tom Strelich, Andrew Plummer, Adrienne Breland, Simon Dennis, Kathy Garvin-Doxas, Michael Klymkowsky
for: 研究开头1800篇短篇小说的起始部分，发现大多数篇章遵循行动原理，如arXiv:2309.06600所定义。
methods: 研究者使用了混淆序列的方法，检查开头篇章的顺序对故事的semantic空间的影响。
results: 结果表明，在启动故事时，我们倾向于采取一定的方向在semantic空间中，可能与西方故事传统有关，如阿里斯多德在《诗学》中所暗示的。

Abstract
When considering the opening part of 1800 short stories, we find that the first dozen paragraphs of the average narrative follow an action principle as defined in arXiv:2309.06600. When the order of the paragraphs is shuffled, the average no longer exhibits this property. The findings show that there is a preferential direction we take in semantic space when starting a story, possibly related to a common Western storytelling tradition as implied by Aristotle in Poetics.

摘要
（考虑1800短篇故事的开头部分，我们发现平均的 Narraative 的前 dozen 段落遵循 arXiv:2309.06600 中定义的行动原则。当排序段落的顺序时，平均不再显示这种性质。发现在 semantic space 中开始故事时，有一个偏好的方向，可能与西方故事创作传统相关，如阿里斯托丰提到在《诗学》中。）

paper_url: http://arxiv.org/abs/2309.07794
repo_url: None
paper_authors: Danae Sánchez Villegas, Daniel Preoţiuc-Pietro, Nikolaos Aletras
for: 本研究旨在提高社交媒体文章中的多Modal信息利用，以便进行各种下游任务，如情感分析、讽刺检测和仇恨言语分类。
methods: 本文提议在微博模型 fine-tuning 过程中使用两个辅助任务：图像文本对比（ITC）和图像文本匹配（ITM），以直接模型图像文本之间的相互关系。
results: 通过对五种多Modal模型进行组合，本研究在四个社交媒体数据集上表现出了一致性的提高。此外，通过细化分析，我们发现每个辅助任务在具体情况和案例中的效果最为显著。

Abstract
Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection and hate speech classification. However, combining text and image information is challenging because of the idiosyncratic cross-modal semantics with hidden or complementary information present in matching image-text pairs. In this work, we aim to directly model this by proposing the use of two auxiliary losses jointly with the main task when fine-tuning any pre-trained multimodal model. Image-Text Contrastive (ITC) brings image-text representations of a post closer together and separates them from different posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates the understanding of semantic correspondence between images and text by penalizing unrelated pairs. We combine these objectives with five multimodal models, demonstrating consistent improvements across four popular social media datasets. Furthermore, through detailed analysis, we shed light on the specific scenarios and cases where each auxiliary task proves to be most effective.

摘要
通过有效地利用社交媒体文章中的多Modal信息，可以提高各种下游任务，如情感分析、讲究和仇恨言语识别。但是，将文字和图像信息结合起来是一项挑战，因为它们之间存在特殊的跨Modal semantics和隐藏或补做信息。在这项工作中，我们提议使用两个辅助损失函数，一起与主任务进行调整已经预训练的任意多Modal模型。图像文本对比（ITC）使图像文本对的 representations更加相似，并将它们与不同的文本对分开，捕捉到它们之间的依赖关系。图像文本匹配（ITM）促进了图像和文本之间的含义相似性的理解，使得不相关的对进行惩罚。我们将这些目标与五种多Modal模型结合，在四个流行的社交媒体数据集上进行了详细的分析，并证明了这些辅助任务在不同的场景和情况下的效果。

Spoken Humanoid Embodied Conversational Agents in Mobile Serious Games: A Usability Assessment

paper_url: http://arxiv.org/abs/2309.07773
repo_url: None
paper_authors: Danai Korre, Judy Robertson
for: 这项研究旨在检验 spoken Humanoid Embodied Conversational Agents (HECAs) 在移动严重游戏 (MSG) 应用中是否可以提高可用性。
methods: 研究使用了两种代理人示例：一种高度人类化的 HECA 和一种文本示例。实验评估了多个代理人和人类化的假设对交互质量的影响。
results: 实验结果显示用户更偏好与 HECA 交互，两个版本之间的差异为 statistically significant 大效果大（d=1.01），许多参与者表示人类化特征使得版本更加吸引人。这项研究为未来移动严重游戏的设计提供了重要信息。

Abstract
This paper presents an empirical investigation of the extent to which spoken Humanoid Embodied Conversational Agents (HECAs) can foster usability in mobile serious game (MSG) applications. The aim of the research is to assess the impact of multiple agents and illusion of humanness on the quality of the interaction. The experiment investigates two styles of agent presentation: an agent of high human-likeness (HECA) and an agent of low human-likeness (text). The purpose of the experiment is to assess whether and how agents of high humanlikeness can evoke the illusion of humanness and affect usability. Agents of high human-likeness were designed by following the ECA design model that is a proposed guide for ECA development. The results of the experiment with 90 participants show that users prefer to interact with the HECAs. The difference between the two versions is statistically significant with a large effect size (d=1.01), with many of the participants justifying their choice by saying that the human-like characteristics of the HECA made the version more appealing. This research provides key information on the potential effect of HECAs on serious games, which can provide insight into the design of future mobile serious games.

摘要

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

paper_url: http://arxiv.org/abs/2309.07765
repo_url: None
paper_authors: Sizhou Chen, Songyang Gao, Sen Fang
for: 这个论文的目的是提高自动语音识别（ASR）任务中的模型性能。
methods: 这个论文使用了变长注意力机制，以适应不同的语音样本duration和复杂度。
results: 根据我们的实验结果，将Echo-MSA模块 integrate到主模型的训练过程中，可以显著提高word error rate（WER）性能，同时保持原始模型的内在稳定性。

Abstract
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.

摘要
《Transformer架构在自动语音识别（ASR）任务中表现出非常高效，成为了这个领域的基础组件。历史上，许多方法都是采用固定长度注意窗口，这会导致不同的语音样本duration和复杂度的数据过滤和重要的长期连接被忽略。为了解决这个限制，我们介绍了Echo-MSA模块，这是一个具有可变长度注意机制的Variable-Length Attention（VLA）模块。这个模块可以捕捉不同的语音特征，从帧和音频到单词和话语等多个级别。我们的设计captures the variable length feature of speech和解决了固定长度注意的局限性。我们的评估使用了并行的注意架构和动态闭合机制，将传统注意与Echo-MSA模块输出结合。我们的实验证明，将Echo-MSA模块包含在主模型的训练过程中可以显著提高word error rate（WER）性能，同时保持原始模型的内在稳定性。》

PROGrasp: Pragmatic Human-Robot Communication for Object Grasping

paper_url: http://arxiv.org/abs/2309.07759
repo_url: None
paper_authors: Gi-Cheon Kang, Junghyun Kim, Jaein Kim, Byoung-Tak Zhang
for: 本研究旨在提出一种新的人机共同协作任务—— Pragmatic-IOG，以及相应的数据集—— Intention-oriented Multi-modal Dialogue (IM-Dial)。
methods: 我们提出了一种新的机器人系统—— Pragmatic Object Grasping (PROGrasp)，该系统通过视觉固定、问答、物体抓取和答案解释等模块来解决 Pragmatic-IOG 任务。
results: 我们的实验结果表明，PROGrasp 在线上和离线上都能够有效地完成 Pragmatic-IOG 任务。

Abstract
Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings.

摘要
<>translate "Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings."into Simplified Chinese.Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). 在当前的IOG系统中，人类用户会首先指定目标物的类别（例如瓶子）。 drawing inspiration from pragmatics, where humans often convey their intentions by relying on context to achieve goals, we propose a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial).In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. 在我们的提议的任务场景中，一个意向带有的语音（例如 "我是喝喝的"）会首先被给 robot。 robot 然后应该通过与人类用户交互来确定目标物。基于任务设置，我们提议一种新的机器人系统，可以理解用户的意图，并将目标物pick up，即 Pragmatic Object Grasping (PROGrasp)。PROGrasp 实现 Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference.Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings. 实验结果表明，PROGrasp 在offline（即目标物发现）和 online（即与物理机器人臂进行IOG）的设置下都是有效的。

The complementary roles of non-verbal cues for Robust Pronunciation Assessment

paper_url: http://arxiv.org/abs/2309.07739
repo_url: None
paper_authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
for: 本研究旨在评估非本地语言（L2）发音系统，拓展现有研究的 fonetic和phonological特征，同时充分利用非语言表征。
methods: 本研究提出了一种新的发音评估框架，IntraVerbalPA，该框架兼容细致的帧级和抽象词级非语言特征，并引入 ‘’Goodness of phonemic-duration’’ 度量来有效地模型duration分布。
results: 研究结果证明IntraVerbalPA框架和其组件的有效性，其性能与或超过了现有研究工作。

Abstract
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verbal cues, alongside the conventional speech and phoneme representations. Additionally, we introduce ''Goodness of phonemic-duration'' metric to effectively model duration distribution within the framework. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that either matches or outperforms existing research works.

摘要
研究声学评估系统通常强调非本地语言（L2）的音律和音位方面，经常忽略非语音表达的丰富信息。本研究提出了一种新的声学评估框架，IntraVerbalPA。该框架创新地结合了细致的帧级和抽象的语音级非语音示唆，并与传统的语音和音位表达结合使用。此外，我们还提出了''声音持续时间质量''指标，有效地模型duration分布。我们的结果证明了提案的IntraVerbalPA框架和其组件的有效性，其性能与或超过了现有研究成果。

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features

paper_url: http://arxiv.org/abs/2309.07733
repo_url: None
paper_authors: Eliana Pastor, Alkis Koudounas, Giuseppe Attanasio, Dirk Hovy, Elena Baralis
for: 本研究旨在解释语音分类模型的内部工作方式，以便更好地理解和信任这些模型。
methods: 本研究使用输入杂化来生成易于理解的解释，包括单词层和para linguistic特征层。单词层解释显示每个单词相关的音频段带来了哪些影响，而para linguistic特征层解释则回答了对话式问题：“如果将音频信号编辑得这样，模型的预测会如何变化？”
results: 我们验证了这种方法，使用两种语言（英语和意大利语）的两个语音分类任务来解释两个现代SLU模型。我们的结果表明，这些解释准确地反映了模型的内部工作方式，并且对人类来说是可理解的。

Abstract
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings demonstrate that the explanations are faithful to the model's inner workings and plausible to humans. Our method and findings pave the way for future research on interpreting speech models.

摘要
Note:* 可解释AI (XAI) is simplified as "eXplainable AI" in the text.* "speech models" is simplified as "speech classification models" in the text.* "spoken language understanding" is simplified as "SLU" in the text.* "word-related audio segment" is simplified as "word" in the text.* "paralinguistic features" is simplified as "paralinguistic" in the text.* "counterfactual" is simplified as "what if" in the text.

PerPLM: Personalized Fine-tuning of Pretrained Language Models via Writer-specific Intermediate Learning and Prompts

paper_url: http://arxiv.org/abs/2309.07727
repo_url: None
paper_authors: Daisuke Oba, Naoki Yoshinaga, Masashi Toyoda
for: 这个研究旨在提高文本理解任务的准确率，通过个性化PLM的精度调整 для特定作者。methods: 我们使用作者特定的提示来个性化一个统一的PLM，以避免多个用户的PLM存储和训练成本。我们还提出了一种基于遮盲语言模型的中间学习方法，以提取作者特定的文本特征。results: 我们的实验结果显示了不同的提示类型的特点，以及我们的中间学习方法的效果。我们使用多个任务、数据集和PLM进行了实验，并发现了个性化调整的优势。

Abstract
The meanings of words and phrases depend not only on where they are used (contexts) but also on who use them (writers). Pretrained language models (PLMs) are powerful tools for capturing context, but they are typically pretrained and fine-tuned for universal use across different writers. This study aims to improve the accuracy of text understanding tasks by personalizing the fine-tuning of PLMs for specific writers. We focus on a general setting where only the plain text from target writers are available for personalization. To avoid the cost of fine-tuning and storing multiple copies of PLMs for different users, we exhaustively explore using writer-specific prompts to personalize a unified PLM. Since the design and evaluation of these prompts is an underdeveloped area, we introduce and compare different types of prompts that are possible in our setting. To maximize the potential of prompt-based personalized fine-tuning, we propose a personalized intermediate learning based on masked language modeling to extract task-independent traits of writers' text. Our experiments, using multiple tasks, datasets, and PLMs, reveal the nature of different prompts and the effectiveness of our intermediate learning approach.

摘要
文本中的意思不仅取决于其使用场景（上下文），还取决于作者（写者）。预训言语模型（PLM）是一种强大的工具，可以捕捉上下文，但它们通常是通用的，需要进行 universal 的预训练和精度调整。这个研究的目标是通过个性化 PLM 的精度调整来提高文本理解任务的准确性。我们关注一般情况下，只有目标作者的平面文本可用于个性化。为了避免多个用户的 PLM 预训练和存储成本，我们对writer-specific 的提示进行了探索。由于设计和评估这些提示的领域还是未发展的，我们引入了不同类型的提示，并对它们进行比较。为了最大化个性化提示基于隐藏语言模型的学习效果，我们提议了个性化中间学习，以抽取任务不виси的作者文本特征。我们的实验，使用多个任务、数据集和 PLM，显示了不同类型的提示的性质和我们的中间学习方法的效果。

L1-aware Multilingual Mispronunciation Detection Framework

paper_url: http://arxiv.org/abs/2309.07719
repo_url: None
paper_authors: Yassine El Kheir, Shammur Absar Chwodhury, Ahmed Ali
for: 本研究旨在提出一种多语言声音识别（MDD）框架，以优化声音识别性能。
methods: 该框架基于一种新的多语言声音识别模型，即L1-MultiMDD模型，其中包含了语言一价声音表示。在该模型中，一个注意机制将输入音频与参考音频序列进行对应，然后使用多语言声音嵌入从一个辅助模型中提取，并将其与主网络相结合。最后，模型通过 Connectionist Temporal Classification（CTC）损失函数进行优化。
results: 实验结果表明，L1-MultiMDD模型在多种目标语言（英语、阿拉伯语和普通话）上具有稳定的性能，并且在各种声音识别任务上均显示出了领先的性能。

Abstract
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.

摘要
“对话者的Native语言（L1）和非Native语言（L2）之间的音系学差异作为主要因素，导致误对。本文提出了一个新的多语言MDD架构，L1-MultiMDD，其中包含了L1-意识的语音表现。一个终端处理器是对入力讯号和它的对应的音节序列进行对齐。接着，从副架构中提取L1-L2-语音嵌入，并将其与主网络相结合。最后，L1-MultiMDD是透过 Connectionist Temporal Classification（CTC）损失来优化一个多语言音节识别任务。我们的实验表明，提案的L1-MultiMDD架构在seen和unseen数据集上都有显著的性能提升，PER和false rejection rate（FRR）在所有目标语言上都有相似的下降。”

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

paper_url: http://arxiv.org/abs/2309.07707
repo_url: None
paper_authors: Heng-Jui Chang, Ning Dong, Ruslan Mavlyutov, Sravya Popuri, Yu-An Chung
for: 大规模自监督演算 speech 编码器在语音识别和翻译任务中表现出色，但由于开发这些大型模型的成本太高，新任务建立新模型和部署到设备应用程序是不可能的。
methods: 我们提出了一种新的知识填充方法，即做到层次预测和对比学习来训练学生模型模仿大教师模型的行为。
results: CoLLD 方法比前一代方法表现出色，在多语言语音文本翻译和识别 benchmark 上与小型模型几乎相当。

Abstract
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.

摘要
大规模自主学习预训练音频编码器超过传统方法在语音识别和翻译任务中表现出色。由于开发这些大型模型的成本很高，为新任务建立新的编码器并将其部署到设备应用程序是不可能的。先前的研究提出了模型压缩方法来解决这个问题，但这些方法主要关注小型模型和更为实际的任务。因此，我们提出了对比层次预训练知识填充（CoLLD），一种新的知识填充方法，通过使用遮盖预测和对比学习训练学生模型模仿大教师模型的行为。CoLLD超过了先前的方法，在多语言语音文本翻译和识别数据集上闭合了小型和大型模型之间的差距。

A Conversation is Worth A Thousand Recommendations: A Survey of Holistic Conversational Recommender Systems

paper_url: http://arxiv.org/abs/2309.07682
repo_url: https://github.com/lichuangnus/crs-paper-list
paper_authors: Chuang Li, Hengchang Hu, Yan Zhang, Min-Yen Kan, Haizhou Li
for: 这篇论文旨在探讨基于实际对话的会话推荐系统（CRS）的新趋势，即基于实际对话的holistic CRS方法。
methods: 这篇论文使用了一种结构化的方法来总结holistic CRS方法，其包括三个组成部分：1）基础语言模型，2）可选的外部知识，以及3）外部指导。
results: 论文提供了一个详细的分析对话推荐系统数据集和评价方法在实际应用场景中的现状，并提供了作者对现有挑战和未来趋势的评论。

Abstract
Conversational recommender systems (CRS) generate recommendations through an interactive process. However, not all CRS approaches use human conversations as their source of interaction data; the majority of prior CRS work simulates interactions by exchanging entity-level information. As a result, claims of prior CRS work do not generalise to real-world settings where conversations take unexpected turns, or where conversational and intent understanding is not perfect. To tackle this challenge, the research community has started to examine holistic CRS, which are trained using conversational data collected from real-world scenarios. Despite their emergence, such holistic approaches are under-explored. We present a comprehensive survey of holistic CRS methods by summarizing the literature in a structured manner. Our survey recognises holistic CRS approaches as having three components: 1) a backbone language model, the optional use of 2) external knowledge, and/or 3) external guidance. We also give a detailed analysis of CRS datasets and evaluation methods in real application scenarios. We offer our insight as to the current challenges of holistic CRS and possible future trends.

摘要
对话式推荐系统（CRS）通过交互过程生成推荐。然而，不全CRS方法使用真实的人工对话作为交互数据的来源；大多数先前CRS工作通过交换实体级别信息来模拟交互。因此，先前CRS工作的声索不懂实际世界中的对话弯曲和对话理解不准确。为解决这个挑战，研究社区开始了探索全面CRS，这些方法通过真实场景中的对话收集的数据进行训练。虽然它们的出现，但这些整体方法还未得到充分探索。我们提供了一份全面CRS方法的系统性报告，通过结构化的方式总结了相关文献。我们认为全面CRS方法包括三个组件：1）基础语言模型，可选的2）外部知识，以及3）外部指导。我们还给出了CRS数据集和评估方法在真实应用场景中的详细分析。我们对现有的全面CRS挑战和未来趋势给出了我们的见解。

Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version)

paper_url: http://arxiv.org/abs/2309.07677
repo_url: None
paper_authors: Chen Gong, Peilin Wu, Jinho D. Choi
for: 这篇论文提出了一种新的语音基于文本speaker分类评估方法，旨在解决传统 metric 不考虑文本上下文信息的限制。
methods: 论文提出了两种新的评估指标：文本基于的分类错误率和分类 F1 指标，这两种指标在语音分类任务中进行了单词和语音级别的评估，并且能够捕捉更多类型的错误。
results: 论文引入了一种多序列对Alignment算法，可以处理多个参考序列，并使用动态计算来处理高维对假序列的对应。这两个工具可以帮助创建高质量数据，推动对话系统的进步。

Abstract
This paper presents a novel evaluation approach to text-based speaker diarization (SD), tackling the limitations of traditional metrics that do not account for any contextual information in text. Two new metrics are proposed, Text-based Diarization Error Rate and Diarization F1, which perform utterance- and word-level evaluations by aligning tokens in reference and hypothesis transcripts. Our metrics encompass more types of errors compared to existing ones, allowing us to make a more comprehensive analysis in SD. To align tokens, a multiple sequence alignment algorithm is introduced that supports multiple sequences in the reference while handling high-dimensional alignment to the hypothesis using dynamic programming. Our work is packaged into two tools, align4d providing an API for our alignment algorithm and TranscribeView for visualizing and evaluating SD errors, which can greatly aid in the creation of high-quality data, fostering the advancement of dialogue systems.

摘要
To align tokens, a multiple sequence alignment algorithm is introduced that supports multiple sequences in the reference and handles high-dimensional alignment to the hypothesis using dynamic programming. The authors have developed two tools, align4d and TranscribeView, to facilitate the use of their alignment algorithm and to visualize and evaluate SD errors. These tools can help create high-quality data, which is essential for the development of dialogue systems.In Simplified Chinese:这篇论文提出了一种新的文本基于Speaker diarization（SD）评估方法，解决传统的评估方法不考虑文本中上下文信息的限制。该论文提出了两个新的评估指标：Text-based Diarization Error Rate和Diarization F1，它们在语音和词级别进行评估，并且使用了多个序列的对齐算法来对应语音和假设词的匹配。这些指标比现有的指标更加全面，可以对SD进行更加详细的分析。为了对token进行对齐，该论文引入了一种多序列对齐算法，该算法支持多个参照序列，并且使用动态编程来处理高维对齐。作者们还开发了两个工具：align4d和TranscribeView，它们可以帮助创建高质量的数据，这将对对话系统的发展起到关键作用。

Automatic Data Visualization Generation from Chinese Natural Language Questions

paper_url: http://arxiv.org/abs/2309.07650
repo_url: None
paper_authors: Yan Ge, Victor Junqiu Wei, Yuanfeng Song, Jason Chen Zhang, Raymond Chi-Wing Wong
for: 本研究旨在提出一个中文文本到视觉（Text-to-Vis）数据集，以便研究中文问题的数据视觉生成。
methods: 我们的模型使用多语言BERT作为编码器，提高了跨语言能力，并将ngram信息 интегрирован到单词表示学习中。
results: 我们的实验结果表明，我们的数据集具有挑战性，且值得进一步研究。

Abstract
Data visualization has emerged as an effective tool for getting insights from massive datasets. Due to the hardness of manipulating the programming languages of data visualization, automatic data visualization generation from natural languages (Text-to-Vis) is becoming increasingly popular. Despite the plethora of research effort on the English Text-to-Vis, studies have yet to be conducted on data visualization generation from questions in Chinese. Motivated by this, we propose a Chinese Text-to-Vis dataset in the paper and demonstrate our first attempt to tackle this problem. Our model integrates multilingual BERT as the encoder, boosts the cross-lingual ability, and infuses the $n$-gram information into our word representation learning. Our experimental results show that our dataset is challenging and deserves further research.

摘要
“数据视化已经成为大量数据获得洞察的有效工具。由于数据视化编程语言的困难，自动从自然语言（文本）到数据视化（Text-to-Vis）的转化是越来越受欢迎。尽管英语 Text-to-Vis 的研究已经充满投入，但尚未对中文问题进行研究。我们在本文中提出了一个中文 Text-to-Vis 数据集，并在这个问题上进行了我们的首次尝试。我们的模型使用多语言BERT作为Encoder，提高了 crossed-lingual 能力，并将 $n$-gram 信息integrated into our word representation learning。我们的实验结果表明，我们的数据集是挑战性的，值得进一步研究。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

paper_url: http://arxiv.org/abs/2309.07648
repo_url: None
paper_authors: Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen
for: 提高END-to-END模型中名实体识别的能力
methods: combines class-based语言模型(LM) into factorized neural Transducer (FNT)
results: 显著降低名实体识别错误，不对通用词语识别造成影响

Abstract
In spite of the excellent strides made by end-to-end (E2E) models in speech recognition in recent years, named entity recognition is still challenging but critical for semantic understanding. In order to enhance the ability to recognize named entities in E2E models, previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in named entity recognition in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose a novel E2E model to incorporate class-based LMs into FNT, which is referred as C-FNT. In C-FNT, the language model score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT presents significant error reduction in named entities without hurting performance in general word recognition.

摘要
尽管最近几年的终到终（E2E）模型在语音识别方面做出了优异的进步，但Named Entity Recognition（NER）仍然是一个挑战性的任务，对于含义理解是关键的。为了增强E2E模型中Named Entity的识别能力，先前的研究主要集中在不同的规则基于的或者关注基于的上下文偏好算法上。然而，其性能可能会受到偏好量的敏感性或者过度关注名称列表，同时也存在假触发的风险。drawing inspiration from the success of class-based language model（LM）在传统的混合系统中的Named Entity recognition和factorized neural Transducer（FNT）中的有效隔离语音和语言信息，我们提出了一种新的E2E模型，称为C-FNT。在C-FNT中，语言模型的名称分类得分可以与名称类型相关联，而不是其表面形式。实验结果显示，我们的提议的C-FNT可以减少Named Entities中的错误，而无需增加总体单词识别性能的影响。

Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation

paper_url: http://arxiv.org/abs/2309.07624
repo_url: None
paper_authors: Xiyan Fu, Anette Frank
for: 本研究旨在提高神经网络模型的结构化推理能力和通用性。
methods: 该研究提出了一种新的结构化解释生成任务设定，以便进行结构化推理研究。previous works使用预定的推理规则进行迭代推理，但这些方法仅适用于已定义的任务和固定的推理流程。因此，该研究提出了一种动态模块化推理模型，即MORSE，以提高神经网络模型的结构化通用性。
results: MORSE在两个benchmark上进行增长推理树的测试中，与其他竞争性基线相比，表现出了更好的结构化推理能力和通用性。模型减少和深入分析表明了动态推理模块的有效性和通用性。

Abstract
Despite the success of neural models in solving reasoning tasks, their compositional generalization capabilities remain unclear. In this work, we propose a new setting of the structured explanation generation task to facilitate compositional reasoning research. Previous works found that symbolic methods achieve superior compositionality by using pre-defined inference rules for iterative reasoning. But these approaches rely on brittle symbolic transfers and are restricted to well-defined tasks. Hence, we propose a dynamic modularized reasoning model, MORSE, to improve the compositional generalization of neural models. MORSE factorizes the inference process into a combination of modules, where each module represents a functional unit. Specifically, we adopt modularized self-attention to dynamically select and route inputs to dedicated heads, which specializes them to specific functions. We conduct experiments for increasing lengths and shapes of reasoning trees on two benchmarks to test MORSE's compositional generalization abilities, and find it outperforms competitive baselines. Model ablation and deeper analyses show the effectiveness of dynamic reasoning modules and their generalization abilities.

摘要
即使神经网络模型在理解任务上取得成功，它们的组合泛化能力仍然未得到清晰定义。在这项工作中，我们提出了一种新的结构化解释生成任务设定，以便促进神经网络模型的组合泛化研究。先前的工作发现，符号方法可以通过预先定义的推理规则进行迭代推理，从而实现更好的组合泛化。但这些方法受到不可靠的符号传递的限制，只能在已定义的任务上进行。因此，我们提出了一种动态模块化推理模型，称为MORSE，以提高神经网络模型的组合泛化能力。MORSE将推理过程分解为一系列模块，每个模块都代表了特定的功能单元。我们采用模块化自注意力来动态选择和导向输入到特定的头部，以进行特定的功能特циализация。我们在两个 benchmark 上进行了不同的LENGTH和SHAPE的理解树长度和形态测试，并发现MORSE在组合泛化能力方面表现出色，超越了竞争对手的基eline。模型剥离和深入分析表明了动态推理模块的效果和泛化能力。

Zero-shot Audio Topic Reranking using Large Language Models

paper_url: http://arxiv.org/abs/2309.07606
repo_url: None
paper_authors: Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate M. Knill, Mark J. F. Gales
for: 这个项目（Multimodal Video Search by Examples）使用视频片段作为搜索关键词，而不是传统的文本查询。这允许更加丰富的搜索Modalities，如图像、说话人、内容、话题和情感。
methods: 这个过程中使用视频特征的嵌入表示来支持大型档案的快速搜索。这个工作的目标是通过评估重新排序方法来减少快速搜索中的性能损失。特别是使用大型自然语言模型的零批训练重新排序方法。
results: 在一个公共可用的视频档案（BBC Rewind corpus）上进行话题基于搜索，results显示重新排序可以获得改善的搜索排名，而无需任何任务特有的训练数据。

Abstract
The Multimodal Video Search by Examples (MVSE) project investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data.

摘要
《多模式视频搜索示例（MVSE）项目》investigates using video clips as query terms for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data.Here's the breakdown of the translation:* 《多模式视频搜索示例（MVSE）项目》: The title of the project, "Multi-modal Video Search by Examples (MVSE) Project"* investigates: investigates* using video clips as query terms: 使用视频片段作为查询 термина* for information retrieval: для信息检索* rather than the more traditional text query: 而不是传统的文本查询* This enables far richer search modalities: 这使得搜索模式更加丰富* such as images, speaker, content, topic, and emotion: 如图像、speaker、内容、话题和情感* A key element for this process is highly rapid, flexible, search: 这个过程中的关键元素是高速灵活的搜索* to support large archives: 支持大型存档* which in MVSE is facilitated by representing video attributes by embeddings: 在MVSE中，通过表示视频特征用 embedding 来支持大型存档* This work aims to mitigate any performance loss from this rapid archive search: 这个工作目标是消除快速存档搜索中的性能损失* by examining reranking approaches: 通过研究重新排序方法* In particular, zero-shot reranking methods using large language models are investigated: 特别是使用大型自然语言模型的零shot重新排序方法* as these are applicable to any video archive audio content: 因为它们可以应用于任何视频存档的音频内容* Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus: 在公开可用的视频存档 BBC Rewind corpus 上进行话题基于检索性能评估* Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data: 结果表明，重新排序可以在无需任务特定训练数据的情况下实现改进的检索排名

Revisiting Supertagging for HPSG

paper_url: http://arxiv.org/abs/2309.07590
repo_url: None
paper_authors: Olga Zamaraeva, Carlos Gómez-Rodríguez
for: 这个论文的目的是为了开发新的HPSG超标注器，并用SVM和神经网络CRF-BERT模型来进行超标注。
methods: 这个论文使用了HPSG-based treebanks，这些treebanks具有高质量的注释和多种和复杂的测试数据集，包括WSJ部分23和Wikipedia数据。文章使用了MaxEnt-based模型，以及SVM和神经网络CRF-BERT模型，并证明了这些模型在超标注任务中的高精度性。
results: 文章的结果表明，使用SVM和神经网络CRF-BERT模型可以获得较高的超标注精度，比基eline模型高出许多。文章的最终BERT模型在1000个WSJ23句子上达到97.26%的精度，并在完全新领域的The Cathedral and the Bazaar（cb）数据集上达到93.88%的精度。

Abstract
We present new supertaggers trained on HPSG-based treebanks. These treebanks feature high-quality annotation based on a well-developed linguistic theory and include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline. Our fine-tuned BERT-based tagger achieves 97.26% accuracy on 1000 sentences from WSJ23 and 93.88% on the completely out-of-domain The Cathedral and the Bazaar (cb)). We conclude that it therefore makes sense to integrate these new supertaggers into modern HPSG parsers, and we also hope that the diverse and difficult datasets we used here will gain more popularity in the field. We contribute the complete dataset reformatted for token classification.

摘要
我们提出新的超标签器，基于HPSG-based treebanks进行训练。这些treebanks具有高质量的注释，基于完善的语言理论，并包括多样化和挑战性的测试数据集，超出常见的WSJ部分23和Wikipedia数据。在过去，HPSG超标签ging依赖于MaxEnt-based模型。我们使用SVM和神经网络CRF-以及BERT-based方法，并证明两者均在基准点上表现较高精度。我们精心调整的BERT-based标签器在WSJ23上达到97.26%的准确率，并在完全不同领域的The Cathedral and the Bazaar（cb）上达到93.88%的准确率。我们认为，因此是合理的将这些新的超标签器 интеグри into modern HPSG parser。我们还希望，我们使用的多样化和挑战性的数据集会在领域中受到更多的推广。我们提供了complete dataset reformatted for token classification。

Adaptive Prompt Learning with Distilled Connective Knowledge for Implicit Discourse Relation Recognition

paper_url: http://arxiv.org/abs/2309.07561
repo_url: https://github.com/wangzl99/AdaptPrompt
paper_authors: Bang Wang, Zhenglin Wang, Wei Xiang, Yijun Mo
for: 本研究旨在提高无显式连接的干扰语言识别（IDRR）性能，通过连续提问和知识传递来减少人工设计努力。
methods: 本文提出了一种连续提问法（AdaptPrompt），通过自动选择适当的模板和答案空间来减少人工设计努力。此外，我们还设计了一种答案关系映射规则，以生成答案空间。
results: 我们在最新的PDTB Corpus V3.0上进行了实验，并证明了我们的设计目标的实现，即与state-of-the-art竞争对手相比，提高了干扰语言识别性能。

Abstract
Implicit discourse relation recognition (IDRR) aims at recognizing the discourse relation between two text segments without an explicit connective. Recently, the prompt learning has just been applied to the IDRR task with great performance improvements over various neural network-based approaches. However, the discrete nature of the state-art-of-art prompting approach requires manual design of templates and answers, a big hurdle for its practical applications. In this paper, we propose a continuous version of prompt learning together with connective knowledge distillation, called AdaptPrompt, to reduce manual design efforts via continuous prompting while further improving performance via knowledge transfer. In particular, we design and train a few virtual tokens to form continuous templates and automatically select the most suitable one by gradient search in the embedding space. We also design an answer-relation mapping rule to generate a few virtual answers as the answer space. Furthermore, we notice the importance of annotated connectives in the training dataset and design a teacher-student architecture for knowledge transfer. Experiments on the up-to-date PDTB Corpus V3.0 validate our design objectives in terms of the better relation recognition performance over the state-of-the-art competitors.

摘要
假设论坛（IDRR）目的是识别文本段落之间的话语关系，而不需要显式的连接词。最近，推荐学习已经应用于IDRR任务中，并取得了较好的表现。然而，现有的状态 искусственный智能（AI）提示方法的精度性不够，需要手动设计模板和答案，这是实际应用中的一大障碍。在这篇论文中，我们提出了一种连续的提示学习方法，称之为AdaptPrompt，以减少手动设计尝试的努力，同时通过知识传输来提高表现。具体来说，我们设计了一些虚拟token来形成连续的模板，并通过梯度搜索在embedding空间中自动选择最适合的一个。我们还设计了一个答案关系映射规则来生成一些虚拟答案。此外，我们注意到了标注的连接词在训练集中的重要性，因此我们设计了一种教师-学生架构来进行知识传输。实验结果表明，我们的设计目标在现代PDTB Corpus V3.0上都得到了 validate。

DBLPLink: An Entity Linker for the DBLP Scholarly Knowledge Graph

paper_url: http://arxiv.org/abs/2309.07545
repo_url: https://github.com/uhh-lt/dblplink
paper_authors: Debayan Banerjee, Arefa, Ricardo Usbeck, Chris Biemann
for: 这个论文是关于构建DBLP学术知识图（DBLP scholarly knowledge graph）上的实体连接应用程序DBLPLink。
methods: 该应用程序使用文本到文本预训练语言模型，如T5，生成输入文本问题中的实体标签跨 span。实体候选者从数据库中提取基于标签，并使用实体嵌入模型，如TransE、DistMult和ComplEx，对实体进行排序。
results: 该应用程序可以在不同的KG嵌入模型下显示结果，让用户可以比较和对比不同模型的结果。示例可以在https://ltdemos.informatik.uni-hamburg.de/dblplink/上进行访问。

Abstract
In this work, we present a web application named DBLPLink, which performs entity linking over the DBLP scholarly knowledge graph. DBLPLink uses text-to-text pre-trained language models, such as T5, to produce entity label spans from an input text question. Entity candidates are fetched from a database based on the labels, and an entity re-ranker sorts them based on entity embeddings, such as TransE, DistMult and ComplEx. The results are displayed so that users may compare and contrast the results between T5-small, T5-base and the different KG embeddings used. The demo can be accessed at https://ltdemos.informatik.uni-hamburg.de/dblplink/.

摘要
在这项工作中，我们介绍了一个名为DBLPLink的网络应用程序，它在DBLP学术知识图上进行实体链接。DBLPLink使用文本到文本预训练语言模型，如T5，生成输入文本问题中的实体标签跨 span。实体候选者从数据库中 fetch，并使用实体嵌入，如TransE、DistMult和ComplEx，对实体进行排序。结果显示在用户可以比较和对比不同的T5小、基础和KG嵌入使用的结果。演示可以在https://ltdemos.informatik.uni-hamburg.de/dblplink/中进行访问。

Direct Text to Speech Translation System using Acoustic Units

paper_url: http://arxiv.org/abs/2309.07478
repo_url: None
paper_authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret
for: 这篇论文提出了一种直接文本到语音翻译系统，使用不同源语言的文本作为输入，生成目标语言的语音无需该语言的文本转写。
methods: 该框架使用文本encoder和分 clustering算法提取了音频单元，然后使用encoder-decoder架构进行预测。最后，vocoder生成了语音从单元。
results: 对新的CVSS corpus进行测试，系统在大多数语言对比中表现竞争力强，并且在使用多语言预训练模型的情况下显示出了remarkable的提升。

Abstract
This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.

摘要
这篇论文提出了一种直接文本到语音翻译系统，使用分割的声音单元。这个框架使用不同的源语言文本作为输入，生成目标语言的语音，不需要目标语言的文本转写。受到之前的直接Speech-to-Speech翻译系统的成功所 inspirited，我们使用同样的管道来提取声音单元，使用语音编码器和分 clustering 算法。一旦单元被获得，我们使用编码器-解码器架构来预测它们。然后，一个 vocoder 生成语音从单元。我们的直接文本到语音翻译方法在新的 CVSS corpora 上进行了测试，并使用两种不同的 text mBART 模型作为初始化。系统显示了竞争性的表现，并且结果表明，当使用更多语言预训练的模型作为初始化时，有很大的改善。

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

paper_url: http://arxiv.org/abs/2309.07462
repo_url: None
paper_authors: Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
for: This paper aims to investigate the use of LLM-based evaluators for scaling up multilingual evaluation in NLP tasks, and to calibrate LLM-based evaluation against human judgments.
methods: The paper uses LLM-based evaluators to evaluate the performance of NLP models in eight languages, and compares the results with human judgments.
results: The study finds that LLM-based evaluators may exhibit bias towards higher scores, and should be used with caution, particularly in low-resource and non-Latin script languages. Additionally, the paper suggests that calibrating LLM-based evaluators with a dataset of native speaker judgments is important for ensuring accurate evaluation.Here is the same information in Simplified Chinese text:
for: 这篇论文旨在研究使用LLM-based评估器来扩大多语言评估的可行性，并对LMM-based评估器与人工评估的准确性进行均衡。
methods: 这篇论文使用LLM-based评估器对NLP模型在八种语言中的表现进行评估，并与人工评估进行比较。
results: 研究发现LMM-based评估器可能受到高分偏袋的影响，需要在低资源语言和非拉丁字符语言中使用时进行谨慎使用，同时也需要对Native speaker评估数据进行准确性的均衡。

Abstract
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

摘要

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

paper_url: http://arxiv.org/abs/2309.07445
repo_url: https://github.com/dadelani/sib-200
paper_authors: David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee
for: 这个论文目的是为了提供一个大规模的多语言自然语言处理（NLP）评估测试集，以便测试多语言语言模型的性能。
methods: 这个论文使用了Flores-200机器翻译集的英文部分，并将其扩展到其他203种语言的句子级标注。
results: 这个论文的评估结果显示，在多语言评估中，高资源语言和低资源语言之间的性能差距仍然很大，特别是未在预训时期训练的语言、少数语言家族（如 nilotic 和 atlantic-Congo）、以及来自非洲、美洲、大洋洲和东南亚的语言，往往有最低的表现。

Abstract
Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200

摘要
尽管在过去几年内我们在多语言自然语言处理方面做出了一些进步，但评估通常只限于一小组已有数据集的语言，这排除了大量的低资源语言。在这篇论文中，我们创建了SIB-200——一个大规模的开源测试集，用于评估200种语言和方言的主题分类。对于许多被覆盖的语言，这是首次公开可用的评估 dataset for Natural Language Understanding (NLU)。该数据集基于 Flores-200 机器翻译库。我们对英语部分进行了注释，并将 sentence-level 注释扩展到剩余的 203种语言。 despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages。我们发现，在模型预训练时未见过的语言、不充分代表的语言家族（如 nilotic 和 atlantic-congolese）以及来自非洲、美洲、大洋洲和东南亚的语言，经常有最低的表现在我们的主题分类数据集上。我们希望我们的数据集能够促进更加包容的评估多语言模型在更加多样化的语言上。更多信息请参考 https://github.com/dadelani/sib-200。

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

paper_url: http://arxiv.org/abs/2309.07430
repo_url: https://github.com/stanfordmimi/clin-summ
paper_authors: Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari
for: clinical text summarization across multiple tasks
methods: employ domain adaptation methods on eight large language models (LLMs) spanning six datasets and four distinct summarization tasks
results: the best adapted LLM outperforms human summaries in terms of completeness and correctness, and traditional quantitative NLP metrics are correlated with reader study scores.

Abstract
Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy across diverse clinical summarization tasks has not yet been rigorously examined. In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not lead to improved results. Further, in a clinical reader study with six physicians, we depict that summaries from the best adapted LLM are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis delineates mutual challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and other irreplaceable human aspects of medicine.

摘要
通过庞大的文本数据进行筛选和摘要关键信息对临床医生的时间分配带来了巨大的压力。虽然大型自然语言处理（NLP）模型（LLM）在不同的NLP任务上表现出了很大的投入，但是它们在多个临床摘要任务上的效果尚未得到了系统性的评估。在这项工作中，我们使用领域适应方法在八个LLM上进行了八个数据集和四个不同的摘要任务的测试：诊断报告、病人问题、进度记录和医生与病人对话。我们的详细的量化评估表明了模型和适应方法之间的贸易offs，以及LLM在不同任务上的表现可能不是完美的。此外，我们在六位医生的读者研究中发现，最佳适应LLM的摘要比人类摘要更加完整和正确。我们的后续质量分析表明，LLM和人类专家面临着相似的挑战。最后，我们将传统的NLP量化指标与读者研究得分进行了相关性分析，以更好地理解这些指标与医生的偏好之间的关系。我们的研究表明，LLM可以在多个临床摘要任务上超越人类专家，这意味着将LLM integrate到临床工作流程中可以减轻文本记录的压力，让医生更能专注于个性化患者护理和其他不可取代的医学方面。

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

paper_url: http://arxiv.org/abs/2309.07423
repo_url: None
paper_authors: Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig
for: 本研究的目的是评估大语言模型（LLMs）在不同语言之间的翻译能力。
methods: 本研究使用的方法是使用FLORES-200benchmark进行实验，对204种语言进行了MT的评估。
results: 研究发现，GPT模型在一些高资源语言（HRLs）上表现比传统MT模型更好，但在低资源语言（LRLs）上表现较差，只有85.9%的语言表现比传统MT模型更好。研究还发现，语言资源水平是决定ChatGPT翻译某语言的能力的最重要因素，而且ChatGPT在非洲语言和低资源语言上表现较差。

Abstract
Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.

摘要
大型语言模型（LLM）通过自动学习执行多种语言任务，包括机器翻译（MT）。先前的研究探讨了 LLM 的 MT 能力的不同方面。然而，存在许多语言，其 MT 性能尚未被最新的 LLM 评估。无published experimental evidence的情况下，世界各地语言使用者难以了解是否可以使用 LLM 翻译他们的语言。我们提供了第一个实验证据，对204种语言进行了MT成本分析，使用FLORES-200 benchmar。结果显示，GPT模型在一些高资源语言（HRLs）上 approaching或超过传统MT模型的性能，但在低资源语言（LRLs）上一直偏下，对84.1%的语言进行了下rance。我们的分析表明，语言资源水平是确定 ChatGPT 翻译其中的最重要因素，并表明 ChatGPT 对非洲语言和低资源语言表现出了劣势。

PromptASR for contextualized ASR with controllable style

paper_url: http://arxiv.org/abs/2309.07414
repo_url: https://github.com/k2-fsa/icefall
paper_authors: Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey
for: 这个论文的目的是提出一种基于提示的端到端自动语音识别（E2E ASR）系统，以实现基于提示的语音识别，并可以控制语音识别的样式。
methods: 该系统使用专门的文本Encoder来编码提示文本，然后将编码的特征与语音Encoder进行交叉对应，以实现语音识别的提示。此外，系统还可以使用文本提示来改善语音识别的准确率，并可以给予不同的样式提示来控制语音识别的样式。
results: 在一个书的阅读 dataset 和一个内部dataset上，相比基eline ASR系统，该系统使用真实的文本提示可以提高21.9%和6.8%的单词错误率。此外，系统还可以使用单词级别的偏好列表作为提示，以提高对罕见词的识别率。

Abstract
Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall.

摘要
<> translation.googleapis.com/translate?sl=en&tl=zh-CN&text=Prompts%20are%20crucial%20to%20large%20language%20models%20as%20they%20provide%20context%20information%20such%20as%20topic%20or%20logical%20relationships.%20Inspired%20by%20this,%20we%20propose%20PromptASR,%20a%20framework%20that%20integrates%20prompts%20in%20end-to-end%20automatic%20speech%20recognition%20(E2E%20ASR)%20systems%20to%20achieve%20contextualized%20ASR%20with%20controllable%20style%20of%20transcriptions.%20Specifically,%20a%20dedicated%20text%20encoder%20encodes%20the%20text%20prompts%20and%20the%20encodings%20are%20injected%20into%20the%20speech%20encoder%20by%20cross-attending%20the%20features%20from%20two%20modalities.%20When%20using%20the%20ground%20truth%20text%20from%20preceding%20utterances%20as%20content%20prompt,%20the%20proposed%20system%20achieves%2021.9%25%20and%206.8%25%20relative%20word%20error%20rate%20reductions%20on%20a%20book%20reading%20dataset%20and%20an%20in-house%20dataset%20compared%20to%20a%20baseline%20ASR%20system.%20The%20system%20can%20also%20take%20word-level%20biasing%20lists%20as%20prompt%20to%20improve%20recognition%20accuracy%20on%20rare%20words.%20An%20additional%20style%20prompt%20can%20be%20given%20to%20the%20text%20encoder%20and%20guide%20the%20ASR%20system%20to%20output%20different%20styles%20of%20transcriptions.%20The%20code%20is%20available%20at%20icefall.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

CPPF: A contextual and post-processing-free model for automatic speech recognition

paper_url: http://arxiv.org/abs/2309.07413
repo_url: None
paper_authors: Lei Zhang, Zhengkun Tian, Xiang Chen, Jiaming Sun, Hongyu Xiang, Ke Ding, Guanglu Wan
for: 本研究旨在提高自然语言处理（NLP）领域的语音识别（ASR）系统的效果，通过将多种ASR处理任务与语音识别模型集成。
methods: 本研究使用了LLMs和Whisper等多种技术，把多种ASR处理任务与语音识别模型集成，以实现直接生成已经处理过的文本。
results: 研究表明，CPPF模型可以减少多stage管道，避免错误的协传，提高ASR的效果。

Abstract
ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.

摘要

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

paper_url: http://arxiv.org/abs/2309.07412
repo_url: None
paper_authors: Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky
for: 研究Linear Recurrent Neural Networks (LRNNs) 的可能性，以实现Transformer-level的自然语言处理和长距离模型，同时提供快速并行训练和常规推理成本。
methods: 研究LRNNs 是否可以学习训练序列中的隐藏规则，如正则语言的 grammatical structures。对现有 LRNNs 进行理论分析，发现它们在正则语言上存在限制。基于分析，提出一种新的 LRNN，具有块状 диагональ 输入依赖的转移矩阵。
results: 实验表明，提出的模型可以在正则语言任务中进行长度 extrapolation，如和、偶对、模块加法等。

Abstract
In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language modeling and long-range modeling while offering rapid parallel training and constant inference costs. With the resurged interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations on regular language. Motivated by the analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN that can perform length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic.

摘要

VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

paper_url: http://arxiv.org/abs/2309.07387
repo_url: None
paper_authors: Yunshui Li, Binyuan Hui, Zhaochao Yin, Wanwei He, Run Luo, Yuxing Long, Min Yang, Fei Huang, Yongbin Li
For: The paper aims to address the lack of a standardized evaluation framework for visually-grounded dialog systems by proposing a new benchmark called VDialogUE.* Methods: The paper introduces a novel evaluation metric called VDscore, based on the Analytic Hierarchy Process (AHP) method, to provide a comprehensive assessment of multi-modal dialogue systems. The authors also propose a baseline model named VISIT, which uses a two-stage pre-training strategy to progressively build its multi-modal foundation and dialogue capability.* Results: The paper presents the results of the VDialogUE benchmark on six datasets, demonstrating the effectiveness of the proposed evaluation metric and the baseline model. The authors believe that the VDialogUE benchmark and their proposed methods will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.Here is the information in Simplified Chinese text:* For: 这篇论文目标是解决视觉基立的对话系统评价框架的缺乏问题，提出了一个新的评价指标——VDscore。* Methods: 论文提出了一种新的评价指标——VDscore，基于分析层次法(AHP)方法，以提供多Modal对话系统的全面评价。同时， authors还提出了一个基线模型——VISIT，使用两个阶段预训练策略来逐渐建立多Modal基础和对话能力。* Results: 论文通过VDialogUE benchmark的六个数据集测试， demonstarte了VDscore的效果和基eline模型的可行性。作者认为，VDialogUE benchmark和提出的方法将加速视觉基立对话系统的发展，并促进更加复杂和有效的预训练模型的开发。

Abstract
Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.

摘要
📝Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose 📝VDialogUE, a 📝Visually-grounded 📝Dialogue benchmark for 📝Unified 📝Evaluation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named 📝VISIT~(📝VISually-grounded d📝Ialog 📝Transformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the 📝VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.

An Interactive Framework for Profiling News Media Sources

paper_url: http://arxiv.org/abs/2309.07384
repo_url: None
paper_authors: Nikhil Mehta, Dan Goldwasser
for: 检测和评估社交媒体上的假新闻和偏见内容，以维护社会的健康发展。
methods: 提出了一种互动式新闻媒体评估框架，结合图基新闻媒体评估模型、大语言模型和人类专家意见，以 caracterize社交媒体上的社会背景。
results: 实验结果表明，只需要5次人类互动，该框架可以快速检测新闻媒体中的假和偏见内容，包括新闻事件的突然出现。

Abstract
The recent rise of social media has led to the spread of large amounts of fake and biased news, content published with the intent to sway beliefs. While detecting and profiling the sources that spread this news is important to maintain a healthy society, it is challenging for automated systems. In this paper, we propose an interactive framework for news media profiling. It combines the strengths of graph based news media profiling models, Pre-trained Large Language Models, and human insight to characterize the social context on social media. Experimental results show that with as little as 5 human interactions, our framework can rapidly detect fake and biased news media, even in the most challenging settings of emerging news events, where test data is unseen.

摘要
最近社交媒体的崛起导致各种假和偏见新闻的扩散，这些新闻通常发布于影响人们信仰的目的。虽然检测和识别这些新闻的源是保持社会健康的重要任务，但是自动系统很难实现。在这篇论文中，我们提出了一个互动式新闻媒体 Profiling 框架。它结合基于图的新闻媒体 Profiling 模型、预训练的大语言模型以及人类智慧，以描述社交媒体上的社会背景。实验结果表明，我们的框架只需5次人类互动，就可以快速检测假和偏见新闻媒体，即使在新闻事件发生的最复杂的情况下也能够准确地识别。

Less is More for Long Document Summary Evaluation by LLMs

paper_url: http://arxiv.org/abs/2309.07382
repo_url: None
paper_authors: Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
for: 这个论文的目的是提出一种新的评估方法，以解决LLM在长文摘要评估任务中的计算成本高和 Lost-in-the-Middle 问题。
methods: 该方法首先提取长文摘要中的关键句子，然后使用LLM进行评估。
results: 实验结果显示，提出的方法不仅能减少评估成本，还能够更好地与人工评估相协调。此外，我们还提供了优化文档长度和句子提取方法的实践建议，以便开发更加Cost-effective yet accurate的LLM-based文本生成评估方法。

Abstract
Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

摘要
大型语言模型（LLM）在摘要评估任务中表现出色，但它们面临着高计算成本和中文混乱问题，中文混乱问题导致长文档中重要信息往往遗弃不了。为解决这些问题，本文提出了一种新的方法——提取然评估法，该方法首先从长源文档中提取关键句子，然后通过提问LLM进行评估。结果显示，提出的方法不仅有效减少评估成本，还与人工评估更高相关性。此外，我们还提供了优化文档长度和句子提取方法的实践建议，为LLM基于文本生成评估的成本减少而准确性提高作出贡献。

Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation

paper_url: http://arxiv.org/abs/2309.07369
repo_url: None
paper_authors: Shaoshi Ling, Guoli Ye, Rui Zhao, Yifan Gong
For: The paper is written for improving the text adaptation of attention-based encoder-decoder (AED) speech recognition models in industry settings.* Methods: The paper proposes a novel hybrid attention-based encoder-decoder (HAED) speech recognition model that separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.* Results: The proposed HAED model yields 21% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED models.Here’s the information in Simplified Chinese text:* For: 该论文是为了改进 attention-based encoder-decoder (AED) 语音识别模型在实际应用中的文本适应性。* Methods: 论文提出了一种新的 hybrid attention-based encoder-decoder (HAED) 语音识别模型，该模型将语音模型和语言模型分离开来，使得可以使用 conventional 文本基于语言模型适应技术。* Results: 提议的 HAED 模型在使用 out-of-domain 文本数据进行语言模型适应时，相比 conventional AED 模型，可以提高 Word Error Rate (WER) 21%。在一般测试集上，HAED 模型只有一定的负面影响。

Abstract
Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model.

摘要
听力基于Encoder-Decoder（AED）语音识别模型在最近几年内获得了广泛的成功。然而，在末端协调语音模型和语言模型的结合方面，有些挑战需要解决，特别是快速、效率地适应文本。为解决这个问题，我们提议一种新的模型，即混合注意力基于Encoder-Decoder（HAED）语音识别模型。我们的HAED模型分离了语音模型和语言模型，因此可以使用传统的文本基于语言模型适应技术。我们示出了我们提议的HAED模型可以在使用不同文本预测集时实现21%的单词错误率（WER）提升，并且在一般测试集上只受到轻微的WER下降，相比于传统的AED模型。

2023-09-15

EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding

The Use of Multi-Scale Fiducial Markers To Aid Takeoff and Landing Navigation by Rotorcraft

Biased Attention: Do Vision Transformers Amplify Gender Bias More than Convolutional Neural Networks?

Unified Brain MR-Ultrasound Synthesis using Multi-Modal Hierarchical Representations

Improved Breast Cancer Diagnosis through Transfer Learning on Hematoxylin and Eosin Stained Histology Images

Personalized Food Image Classification: Benchmark Datasets and New Baseline

Active Learning for Fine-Grained Sketch-Based Image Retrieval

Concept explainability for plant diseases classification

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

Segmentation of Tubular Structures Using Iterative Training with Tailored Samples

Performance Metrics for Probabilistic Ordinal Classifiers

BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus

Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion

Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

Replacing softmax with ReLU in Vision Transformers

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

The Impact of Different Backbone Architecture on Autonomous Vehicle Dataset

Automated dermatoscopic pattern discovery by clustering neural network output for human-computer interaction

Breathing New Life into 3D Assets with Generative Repainting

Generalised Probabilistic Diffusion Scale-Spaces

OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection

YCB-Ev: Event-vision dataset for 6DoF object pose estimation

3D Arterial Segmentation via Single 2D Projections and Depth Supervision in Contrast-Enhanced CT Images

PoseFix: Correcting 3D Human Poses with Natural Language

TreeLearn: A Comprehensive Deep Learning Method for Segmenting Individual Trees from Forest Point Clouds

Segment Anything Model for Brain Tumor Segmentation

X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction

MIML: Multiplex Image Machine Learning for High Precision Cell Classification via Mechanical Traits within Microfluidic Systems

Deformable Neural Radiance Fields using RGB and Event Cameras

3D SA-UNet: 3D Spatial Attention UNet with 3D ASPP for White Matter Hyperintensities Segmentation

An inspection technology of inner surface of the fine hole based on machine vision

Double Domain Guided Real-Time Low-Light Image Enhancement for Ultra-High-Definition Transportation Surveillance

Reconsidering evaluation practices in modular systems: On the propagation of errors in MRI prostate cancer detection

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

An Efficient Wide-Range Pseudo-3D Vehicle Detection Using A Single Camera

Robust Burned Area Delineation through Multitask Learning

Continual Learning with Deep Streaming Regularized Discriminant Analysis

T-UDA: Temporal Unsupervised Domain Adaptation in Sequential Point Clouds

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models

Edge Based Oriented Object Detection

Leveraging the Power of Data Augmentation for Transformer-based Tracking

BROW: Better featuRes fOr Whole slide image based on self-distillation

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Optimization of Rank Losses for Image Retrieval

A Real-time Faint Space Debris Detector With Learning-based LCM

Human-Inspired Topological Representations for Visual Object Recognition in Unseen Environments

Efficient Polyp Segmentation Via Integrity Learning

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

One-stage Modality Distillation for Incomplete Multimodal Learning

Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

STDG: Semi-Teacher-Student Training Paradigram for Depth-guided One-stage Scene Graph Generation

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

A Ground Segmentation Method Based on Point Cloud Map for Unstructured Roads

Cross-Modal Synthesis of Structural MRI and Functional Connectivity Networks via Conditional ViT-GANs

AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness

Uncertainty-Aware Multi-View Visual Semantic Embedding

DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

Multi-Scale Estimation for Omni-Directional Saliency Maps Using Learnable Equator Bias

Let’s Roll: Synthetic Dataset Analysis for Pedestrian Detection Across Different Shutter Types

AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT

Increasing diversity of omni-directional images generated from single image using cGAN based on MLPMixer

MetaF2N: Blind Image Super-Resolution by Learning Efficient Model Adaptation from Faces

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

hear-your-action: human action recognition by ultrasound active sensing

2023-09-15

URA*: Uncertainty-aware Path Planning using Image-based Aerial-to-Ground Traversability Estimation for Off-road Environments

SHAPNN: Shapley Value Regularized Tabular Neural Network

D3: Data Diversity Design for Systematic Generalization in Visual Question Answering

Privacy-preserving Early Detection of Epileptic Seizures in Videos

Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation

Projected Task-Specific Layers for Multi-Task Reinforcement Learning

Enhance audio generation controllability through representation similarity regularization

Rethinking Cross-Domain Pedestrian Detection: A Background-Focused Distribution Alignment Framework for Instance-Free One-Stage Detectors

AlbNER: A Corpus for Named Entity Recognition in Albanian