2023-09-01

cs.CV

cs.CV - 2023-09-01

Affine-Transformation-Invariant Image Classification by Differentiable Arithmetic Distribution Module

paper_url: http://arxiv.org/abs/2309.00752
repo_url: None
paper_authors: Zijie Tan, Guanfang Dong, Chenqiu Zhao, Anup Basu
for: 提高 Convolutional Neural Networks (CNNs) 对于图像分类 tasks 的 robustness, 尤其是对于旋转、平移、翻转和排序等非对称变换。
methods: 提出一种基于分布学习技术的 Differentiable Arithmetic Distribution Module (DADM)，通过学习图像中像素的空间分布信息来提高模型的Robustness。
results: 通过对比与 LeNet 等方法的实验和简洁分析，证明了该方法能够提高模型的 Robustness 无需牺牲特征提取能力。

Abstract
Although Convolutional Neural Networks (CNNs) have achieved promising results in image classification, they still are vulnerable to affine transformations including rotation, translation, flip and shuffle. The drawback motivates us to design a module which can alleviate the impact from different affine transformations. Thus, in this work, we introduce a more robust substitute by incorporating distribution learning techniques, focusing particularly on learning the spatial distribution information of pixels in images. To rectify the issue of non-differentiability of prior distribution learning methods that rely on traditional histograms, we adopt the Kernel Density Estimation (KDE) to formulate differentiable histograms. On this foundation, we present a novel Differentiable Arithmetic Distribution Module (DADM), which is designed to extract the intrinsic probability distributions from images. The proposed approach is able to enhance the model's robustness to affine transformations without sacrificing its feature extraction capabilities, thus bridging the gap between traditional CNNs and distribution-based learning. We validate the effectiveness of the proposed approach through ablation study and comparative experiments with LeNet.

摘要
To address the issue of non-differentiability of prior distribution learning methods that rely on traditional histograms, we adopt Kernel Density Estimation (KDE) to formulate differentiable histograms. On this foundation, we present a novel Differentiable Arithmetic Distribution Module (DADM), which is designed to extract the intrinsic probability distributions from images. The proposed approach can enhance the model's robustness to affine transformations without sacrificing its feature extraction capabilities, thus bridging the gap between traditional CNNs and distribution-based learning. We validate the effectiveness of the proposed approach through ablation study and comparative experiments with LeNet.

PathLDM: Text conditioned Latent Diffusion Model for Histopathology

paper_url: http://arxiv.org/abs/2309.00748
repo_url: None
paper_authors: Srikar Yellapragada, Alexandros Graikos, Prateek Prasanna, Tahsin Kurc, Joel Saltz, Dimitris Samaras
for: 这篇论文旨在开发一种基于文本指导的高质量病理图像生成模型，以便在计算机 PATHOLOGY 领域中提高模型训练效果。
methods: 该论文使用文本指导的幽Diffusion模型，通过将图像和文本数据 fusion 以提高生成过程。文本数据来自 histopathology 报告，通过 GPT 技术进行抽象和概要，以建立有效的 conditioning 机制。
results: 通过策略性的 conditioning 和必要的建构改进，该论文在 TCGA-BRCA 数据集上实现了 SoTA FID 分数 7.64，明显超越最近的文本指导竞争对手的 FID 分数 30.1。

Abstract
To achieve high-quality results, diffusion models must be trained on large datasets. This can be notably prohibitive for models in specialized domains, such as computational pathology. Conditioning on labeled data is known to help in data-efficient model training. Therefore, histopathology reports, which are rich in valuable clinical information, are an ideal choice as guidance for a histopathology generative model. In this paper, we introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Leveraging the rich contextual information provided by pathology text reports, our approach fuses image and textual data to enhance the generation process. By utilizing GPT's capabilities to distill and summarize complex text reports, we establish an effective conditioning mechanism. Through strategic conditioning and necessary architectural enhancements, we achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.

摘要

Learned Visual Features to Textual Explanations

paper_url: http://arxiv.org/abs/2309.00733
repo_url: None
paper_authors: Saeid Asgari Taghanaki, Aliasghar Khani, Amir Khasahmadi, Aditya Sanghi, Karl D. D. Willis, Ali Mahdavi-Amiri
for: 提高图像分类器的解释性和可靠性
methods: 利用大型自然语言模型（LLMs）解释图像分类器学习的特征空间
results: 在多个数据集上进行了实验，证明了方法的有效性，可以提高图像分类器的解释性和可靠性

Abstract
Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of large language models (LLMs) to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and LLMs. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.

摘要
machine learning 领域中解释视图模型学习的结果对于长期是一个挑战。为解决这个问题，我们提出了一种新的方法，即利用大型自然语言模型（LLMs）来解释预训练的图像分类器学习的特征。我们的方法，称为TExplain，通过在图像分类器的特征空间和LLMs之间建立连接来实现这个任务。在推理过程中，我们的方法生成大量的句子来解释给定图像中分类器学习的特征。这些句子中的最常见词语被用来提取特征空间中学习的特征和模式，从而提供了图像分类器的决策过程中的全面理解。我们的方法首次利用这些与视觉表示相对应的常见词语，提供了图像分类器的决策过程中的深入了解和检测偏见、偏好等。为验证我们的方法的有效性，我们在多个 dataset 上进行了实验，包括 ImageNet-9L 和 Waterbirds。结果表明，我们的方法可以增强图像分类器的可解释性和Robustness。

Deep learning in medical image registration: introduction and survey

paper_url: http://arxiv.org/abs/2309.00727
repo_url: None
paper_authors: Ahmad Hammoudeh, Stéphane Dupont
for: 本文主要用于介绍图像注册技术，以帮助医疗专业人员在标准化参照Frame中进行评估多种医学图像。
methods: 本文使用了多种图像变换，包括Affine变换、可变形变换、可逆变换、双向变换等，以及医学图像注册算法，如Voxelmorph、Demons、SynthMorph等。
results: 本文涵盖了多种图像注册技术，包括参考 Atlases、多stage图像注册、Pyramid Approach等，以及医学图像注册的数据集、评估指标、应用场景等。

Abstract
Image registration (IR) is a process that deforms images to align them with respect to a reference space, making it easier for medical practitioners to examine various medical images in a standardized reference frame, such as having the same rotation and scale. This document introduces image registration using a simple numeric example. It provides a definition of image registration along with a space-oriented symbolic representation. This review covers various aspects of image transformations, including affine, deformable, invertible, and bidirectional transformations, as well as medical image registration algorithms such as Voxelmorph, Demons, SyN, Iterative Closest Point, and SynthMorph. It also explores atlas-based registration and multistage image registration techniques, including coarse-fine and pyramid approaches. Furthermore, this survey paper discusses medical image registration taxonomies, datasets, evaluation measures, such as correlation-based metrics, segmentation-based metrics, processing time, and model size. It also explores applications in image-guided surgery, motion tracking, and tumor diagnosis. Finally, the document addresses future research directions, including the further development of transformers.

摘要
Image registration (IR) 是一个过程，用于将图像调整，使其与参照空间align，以便医疗专业人员通过标准化参照框架进行评估不同医疗图像，例如同一个旋转和缩放。本文介绍了图像 registration 的简单数字示例，并提供了图像 registration 的定义和空间 oriented 符号表示。本文评论了各种图像变换，包括 affine、可变、可逆、 bidirectional 变换，以及医疗图像 registration 算法，如 Voxelmorph、Demons、SyN、Iterative Closest Point 和 SynthMorph。此外，本文还探讨了 atlas-based registration 和多阶段图像 registration 技术，包括 coarse-fine 和 pyramid 方法。此外，本文还讨论了医疗图像 registration 的分类、数据集、评价指标，如 correlation-based метри克、segmentation-based метри克、处理时间和模型大小。此外，本文还探讨了图像导航手术、运动跟踪和肿瘤诊断的应用。最后，文档还讨论了未来研究方向，包括 transformers 的进一步发展。

Indexing Irises by Intrinsic Dimension

paper_url: http://arxiv.org/abs/2309.00705
repo_url: None
paper_authors: J. Michael Rozmus
for: 这个论文是为了提高人脸识别技术的精度和速度而写的。
methods: 这个论文使用了主成分分析（PCA）将一组高质量的眼照图像映射到四维内在空间中。
results: 这个论文得到了一个高精度的人脸识别系统，可以快速准确地匹配眼照图像到数据库中的匹配记录。

Abstract
28,000+ high-quality iris images of 1350 distinct eyes from 650+ different individuals from a relatively diverse university town population were collected. A small defined unobstructed portion of the normalized iris image is selected as a key portion for quickly identifying an unknown individual when submitting an iris image to be matched to a database of enrolled irises of the 1350 distinct eyes. The intrinsic dimension of a set of these key portions of the 1350 enrolled irises is measured to be about four (4). This set is mapped to a four-dimensional intrinsic space by principal components analysis (PCA). When an iris image is presented to the iris database for identification, the search begins in the neighborhood of the location of its key portion in the 4D intrinsic space, typically finding a correct identifying match after comparison to only a few percent of the database.

摘要
我们收集了1350个眼睛的28,000多个高质量眼睛图像，来自650多个不同个体的大学城市人口。我们选择了一小部分眼睛图像作为快速识别未知个体的关键部分，并计算了这些关键部分的内在维度为4。我们将这些关键部分映射到4维内在空间中，使用主成分分析（PCA）。当我们将眼睛图像提交到识别数据库时，我们会在数据库中查找与其关键部分相似的匹配，通常只需要比较数据库中的一些百分比就能够获得正确的识别结果。

AAN: Attributes-Aware Network for Temporal Action Detection

paper_url: http://arxiv.org/abs/2309.00696
repo_url: None
paper_authors: Rui Dai, Srijan Das, Michael S. Ryoo, Francois Bremond
for: 本研究的目的是解决长期视频理解中的效率EXTRACTING object semantics和其关系模型，以便对下游任务进行更好的支持。
methods: 本研究提出了Attributes-Aware Network（AAN），包括两个关键组件：Attributes Extractor和Graph Reasoning block。这两个组件可以帮助EXTRACTING object-centric attributes和视频中对象关系的模型。
results: 通过利用CLIP特征，AAN超过了当前state-of-the-art方法在Charades和Toyota Smarthome Untrimmed dataset上的性能。

Abstract
The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the Attributes-Aware Network (AAN), which consists of two key components: the Attributes Extractor and a Graph Reasoning block. These components facilitate the extraction of object-centric attributes and the modelling of their relationships within the video. By leveraging CLIP features, AAN outperforms state-of-the-art approaches on two popular action detection datasets: Charades and Toyota Smarthome Untrimmed datasets.

摘要
“长期视频理解的挑战仍然受到有效提取对象 semantics 和其关系的模型化限制。虽然 CLIP 视觉特征具有多种视觉任务中的推断性质，特别是对象编码，但它们对长期视频理解不利。为解决这个问题，我们提出了 Attributes-Aware Network（AAN），它包括两个关键组件：Attributes Extractor 和 Graph Reasoning 块。这两个组件可以帮助提取对象-中心的特征和视频中的对象关系。通过利用 CLIP 特征，AAN 超越了现有的状态泰施 Approaches 在 Charades 和 Toyota Smarthome Untrimmed 数据集上表现出色。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The Traditional Chinese writing system is also widely used in Taiwan and other parts of the world.

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

paper_url: http://arxiv.org/abs/2309.00616
repo_url: https://github.com/Pointcept/OpenIns3D
paper_authors: Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby
for: This paper is written for 3D open-vocabulary scene understanding at the instance level, without requiring any 2D image inputs.
methods: The OpenIns3D framework uses a “Mask-Snap-Lookup” scheme, which consists of a “Mask” module for class-agnostic mask proposals in 3D point clouds, a “Snap” module for generating synthetic scene-level images, and a “Lookup” module for assigning category names to the proposed masks using Mask2Pixel maps.
results: The proposed approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin, and it also allows for effortless switching of 2D detectors without re-training. Additionally, when integrated with state-of-the-art 2D open-world models and large language models (LLMs), it demonstrates excellent performance on open-vocabulary instance segmentation and processing complex text queries.

Abstract
Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/

摘要
当前的3D开 vocabularyScene理解方法通常使用Well-aligned的2D图像作为桥接学习3D特征。然而，在没有2D图像的场景下，这些方法变得困难。在这项工作中，我们引入了一个completely新的管道，即OpenIns3D，它不需要2D图像输入，可以实现3D开 vocabularyScene理解的实例水平。OpenIns3D框架采用“Mask-Snap-Lookup”的方案。“Mask”模块学习类型不敏感的3D点云掩模。“Snap”模块生成多个尺度的 sintetic场景图像，并利用2D视觉语言模型提取有趣的对象。“Lookup”模块通过Mask2Pixel地图，该地图包含3D掩模和 sintetic图像之间的准确对应关系，将提案的掩模分配类别名称。这种没有2D输入、易于训练、灵活的方法在各种室外和室内数据集上实现了状态的前一个Result，并且可以轻松地将2D检测器更换，无需重新训练。当与开放世界2D模型如ODISE和GroundingDINO集成后，Superb的结果被观察到。当与LLM力量2D模型如LISA集成后，它表现出了对高级推理和世界知识的强大处理能力。项目页面：https://zheninghuang.github.io/OpenIns3D/

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

paper_url: http://arxiv.org/abs/2309.00610
repo_url: https://github.com/hzxie/city-dreamer
paper_authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
for: This paper focuses on the generation of 3D cities, which has received less attention in recent years despite the greater challenges it poses due to human sensitivity to structural distortions and the complexity of generating buildings with a wide range of appearances.
methods: The proposed method, CityDreamer, is a compositional generative model that separates the generation of building instances from other background objects into distinct modules, and uses two datasets (OSM and GoogleEarth) to enhance the realism of the generated 3D cities.
results: Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.Here’s the text in Traditional Chinese for reference:
for: 这篇论文主要关注3D城市生成，这个领域在最近几年来受到了更多的研究，但是3D城市生成仍然受到了更大的挑战，主要是因为人类对城市环境的构造性变化更加敏感，而且生成建筑物的类型和外观相对更加复杂。
methods: 提案的方法是CityDreamer，这是一个具有分类生成模型的3D城市生成方法，它将建筑物实例的生成与其他背景物体（如道路、绿地和水域）分类为不同的模块，并使用OSM和GoogleEarth这两个 datasets来增强生成的3D城市的现实感。
results: 经过了广泛的实验，CityDreamer已经证明了它在生成各种生活力强的3D城市方面的优越性，比起现有的方法更加出色。

Abstract
In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the greater challenges posed by 3D city generation, mainly because humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose CityDreamer, a compositional generative model designed specifically for unbounded 3D cities, which separates the generation of building instances from other background objects, such as roads, green lands, and water areas, into distinct modules. Furthermore, we construct two datasets, OSM and GoogleEarth, containing a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.

摘要

Time Series Analysis of Urban Liveability

paper_url: http://arxiv.org/abs/2309.00594
repo_url: None
paper_authors: Alex Levering, Diego Marcos, Devis Tuia
for: 这篇论文探讨了深度学习模型来监测荷兰城市的长期生活品质变化。
methods: 该论文使用了高分辨率飞行图像和年度生活指标组合生成年度时间步骤，并使用了一个基于2016年飞行图像和生活指标的卷积神经网络来预测新时间步骤的生活品质。
results: 在训练城市（阿姆斯特丹）和 nunca before seen 城市（英顿）的结果中，显示了一些难以理解的趋势，特别是在不同时间步骤的图像获取方式下。这种结果 demonstarte了监测生活品质变化的复杂性，以及需要更加复杂的方法来补做不同于生活品质动态的变化。

Abstract
In this paper we explore deep learning models to monitor longitudinal liveability changes in Dutch cities at the neighbourhood level. Our liveability reference data is defined by a country-wise yearly survey based on a set of indicators combined into a liveability score, the Leefbaarometer. We pair this reference data with yearly-available high-resolution aerial images, which creates yearly timesteps at which liveability can be monitored. We deploy a convolutional neural network trained on an aerial image from 2016 and the Leefbaarometer score to predict liveability at new timesteps 2012 and 2020. The results in a city used for training (Amsterdam) and one never seen during training (Eindhoven) show some trends which are difficult to interpret, especially in light of the differences in image acquisitions at the different time steps. This demonstrates the complexity of liveability monitoring across time periods and the necessity for more sophisticated methods compensating for changes unrelated to liveability dynamics.

摘要
在这篇论文中，我们探讨深度学习模型来监测荷兰城市的长期生活质量变化。我们的生活质量参照数据是基于年度国家调查，并将各个指标组合成一个生活质量分数，称为Leefbaarometer。我们将年度可用的高分辨率航空图像与参照数据对应，从而创造了年度时间步。我们使用2016年的航空图像和Leefbaarometer分数来训练卷积神经网络，并用这些神经网络预测2012年和2020年的生活质量。在训练城市（阿姆斯特丹）和 nunca before seen during training 的城市（恩登霍恩）的结果中，我们发现了一些难以理解的趋势，特别是在不同时间步的图像获取方式的影响下。这表明监测生活质量的变化 across time periods 的复杂性，以及需要更加复杂的方法来补做不related to liveability dynamics的变化。

Discrete Morphological Neural Networks

paper_url: http://arxiv.org/abs/2309.00588
repo_url: https://github.com/dmarcondes/dmnn
paper_authors: Diego Marcondes, Junior Barrera
for: 本文提出了一种基于数学形态学（MM）的二进制图像运算设计方法，即离散形态神经网络（DMNN），用于二进制图像分析。
methods: 本文提出了一种基于机器学习的离散形态神经网络（DMNN）架构，该架构采用了传统的 morphological computational graph，并通过一种名为 lattice gradient descent algorithm（LGDA）来训练这些参数。
results: 本文应用了 DMNN 来识别受噪的数字边缘，并讨论了多个未来研究的话题。

Abstract
A classical approach to designing binary image operators is Mathematical Morphology (MM). We propose the Discrete Morphological Neural Networks (DMNN) for binary image analysis to represent W-operators and estimate them via machine learning. A DMNN architecture, which is represented by a Morphological Computational Graph, is designed as in the classical heuristic design of morphological operators, in which the designer should combine a set of MM operators and Boolean operations based on prior information and theoretical knowledge. Then, once the architecture is fixed, instead of adjusting its parameters (i.e., structural elements or maximal intervals) by hand, we propose a lattice gradient descent algorithm (LGDA) to train these parameters based on a sample of input and output images under the usual machine learning approach. We also propose a stochastic version of the LGDA that is more efficient, is scalable and can obtain small error in practical problems. The class represented by a DMNN can be quite general or specialized according to expected properties of the target operator, i.e., prior information, and the semantic expressed by algebraic properties of classes of operators is a differential relative to other methods. The main contribution of this paper is the merger of the two main paradigms for designing morphological operators: classical heuristic design and automatic design via machine learning. Thus, conciliating classical heuristic morphological operator design with machine learning. We apply the DMNN to recognize the boundary of digits with noise, and we discuss many topics for future research.

摘要
经典方法设计二进制图像运算员是数学形态学（MM）。我们提议使用数字形态神经网络（DMNN）来表示二进制图像分析，以代表W-运算员并使用机器学习来估算。DMNN架构，表示为形态计算图，是根据经典的规范设计形态操作员，其中设计师将组合一组MM操作员和逻辑运算，根据优化目标和理论知识。然后，在架构固定后，而不是手动调整其结构元素或最大间隔的参数，我们提议使用格子梯度下降算法（LGDA）来训练这些参数，根据输入和输出图像的样本。我们还提议一种随机版本的LGDA，它更高效、可扩展和可以在实际问题中获得小的错误。DMNN的类可以是非常一般或特殊，根据预期的目标运算员的特性和优化目标。我们的主要贡献在于将经典的规范设计和自动设计融合在一起，因此把经典规范设计与机器学习融合在一起。我们应用DMNN来识别数字的边缘，并讨论了许多未来研究的话题。

Mechanism of feature learning in convolutional neural networks

paper_url: http://arxiv.org/abs/2309.00570
repo_url: https://github.com/aradha/convrfm
paper_authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin
for: 本研究旨在解释深度学习中图像数据中特征学习的机制。
methods: 我们提出了卷积神经特征假设，即卷积层 filters的covariances是对输入图像中patches的average gradient outer product（AGOP）的平均值。我们提供了丰富的实验证据，包括AlexNet、VGG和ResNets等标准神经网络在ImageNet上的预训练时，卷积层 filters的covariances和patch-based AGOPs之间高度相关性。我们还提供了支持性的理论证据。
results: 我们的结果表明，使用patch-based AGOP可以启用深度特征学习在卷积核机中。我们称之为（深）ConvRFM，并证明其能够恢复深度 convolutional networks 中的类似特征。此外，我们发现Deep ConvRFM可以超越先前发现的卷积核的限制，例如对本地信号的适应能力和不可变性，从而导致性能提高。

Abstract
Understanding the mechanism of how convolutional neural networks learn features from image data is a fundamental problem in machine learning and computer vision. In this work, we identify such a mechanism. We posit the Convolutional Neural Feature Ansatz, which states that covariances of filters in any convolutional layer are proportional to the average gradient outer product (AGOP) taken with respect to patches of the input to that layer. We present extensive empirical evidence for our ansatz, including identifying high correlation between covariances of filters and patch-based AGOPs for convolutional layers in standard neural architectures, such as AlexNet, VGG, and ResNets pre-trained on ImageNet. We also provide supporting theoretical evidence. We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images and, as a result, leads to sizable performance improvement over fixed convolutional kernels.

摘要
We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks, including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images, and leads to sizable performance improvement over fixed convolutional kernels.

Amyloid-Beta Axial Plane PET Synthesis from Structural MRI: An Image Translation Approach for Screening Alzheimer’s Disease

paper_url: http://arxiv.org/abs/2309.00569
repo_url: None
paper_authors: Fernando Vega, Abdoljalil Addeh, M. Ethan MacDonald
for: 用于生成基于MRI的Synthetic抑衰β蛋白PET图像，以便获取β蛋白信息。
methods: 使用图像翻译模型，将MRI图像与β蛋白PET图像进行对应，以实现从结构图像到量化图像的转换。
results: 通过对MRI图像和β蛋白PET图像的对应，可以生成高度相似于真实的Synthetic抑衰β蛋白PET图像，具有高度的SSIM和PSNR。

Abstract
In this work, an image translation model is implemented to produce synthetic amyloid-beta PET images from structural MRI that are quantitatively accurate. Image pairs of amyloid-beta PET and structural MRI were used to train the model. We found that the synthetic PET images could be produced with a high degree of similarity to truth in terms of shape, contrast and overall high SSIM and PSNR. This work demonstrates that performing structural to quantitative image translation is feasible to enable the access amyloid-beta information from only MRI.

摘要
在这个工作中，我们实现了一种图像翻译模型，以生成基于MRI的蛋白β扩散图像，并且这些图像具有高度准确的量化性。我们使用了蛋白β扩散图像和MRI图像的对应对来训练模型。我们发现，使用这种方法可以生成高度相似于真实的蛋白β扩散图像，包括形态、对比度和总体高度匹配SSIM和PSNR。这个研究表明，从MRI图像中获取蛋白β信息是可能的，并且这种方法可以帮助解决蛋白β扩散图像的缺失问题。

Fused Classification For Differential Face Morphing Detection

paper_url: http://arxiv.org/abs/2309.00665
repo_url: None
paper_authors: Iurii Medvedev, Joana Pimenta, Nuno Gonçalves
for: 防止面部识别系统受到面形变换攻击
methods: 基于融合分类方法进行无参数检测
results: 实验结果表明方法有效地检测 morphing 攻击

Abstract
Face morphing, a sophisticated presentation attack technique, poses significant security risks to face recognition systems. Traditional methods struggle to detect morphing attacks, which involve blending multiple face images to create a synthetic image that can match different individuals. In this paper, we focus on the differential detection of face morphing and propose an extended approach based on fused classification method for no-reference scenario. We introduce a public face morphing detection benchmark for the differential scenario and utilize a specific data mining technique to enhance the performance of our approach. Experimental results demonstrate the effectiveness of our method in detecting morphing attacks.

摘要
面部融合攻击，一种复杂的演示攻击技术，对人脸识别系统 pose significiant 安全风险。传统方法很难探测融合攻击，这些攻击 involve 混合多个人脸图像生成一个合成图像，可以与不同个体匹配。在这篇论文中，我们关注 differential 探测方法，并提出基于融合分类方法的延展方法，用于无参考enario。我们引入一个公共人脸融合检测标准套件，并利用特定的数据挖掘技术来提高我们的方法的性能。实验结果表明我们的方法有效地探测融合攻击。

Impact of Image Context for Single Deep Learning Face Morphing Attack Detection

paper_url: http://arxiv.org/abs/2309.00549
repo_url: None
paper_authors: Joana Pimenta, Iurii Medvedev, Nuno Gonçalves
for: 本研究探讨了深度学习 face morphing 检测性能如何受到输入图像对齐设置的影响。
methods: 本研究使用了深度学习技术对 face morphing 进行检测，并分析了面 contour 和图像上下文之间的关系，以提出优化输入图像对齐的方法。
results: 研究发现，适当地调整输入图像对齐设置可以提高深度学习 face morphing 检测性能。

Abstract
The increase in security concerns due to technological advancements has led to the popularity of biometric approaches that utilize physiological or behavioral characteristics for enhanced recognition. Face recognition systems (FRSs) have become prevalent, but they are still vulnerable to image manipulation techniques such as face morphing attacks. This study investigates the impact of the alignment settings of input images on deep learning face morphing detection performance. We analyze the interconnections between the face contour and image context and suggest optimal alignment conditions for face morphing detection.

摘要
技术进步引起的安全问题带来了基于生物特征的识别方法的普遍性，特别是面Recognition系统（FRS）。然而，FRS仍然容易受到图像修改技术的袭击，如面形变换攻击。本研究研究输入图像的Alignment设置对深度学习面形变换检测性能的影响。我们分析面 outline和图像Context之间的关系，并提出优化Alignment条件以提高面形变换检测性能。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering

paper_url: http://arxiv.org/abs/2309.00528
repo_url: None
paper_authors: Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, Shangling Jui, Jian Yang
for: 本研究目的是解决无法获取源数据的情况下，进行领域适应（DA）的问题。
methods: 我们的方法基于目标数据中的自然结构，包括本地相似性和扩展 neighboorhood。
results: 我们的方法在多个2D图像和3D点云识别dataset上达到了状态 искусственный智能的性能。

Abstract
Domain adaptation (DA) aims to alleviate the domain shift between source domain and target domain. Most DA methods require access to the source data, but often that is not possible (e.g. due to data privacy or intellectual property). In this paper, we address the challenging source-free domain adaptation (SFDA) problem, where the source pretrained model is adapted to the target domain in the absence of source data. Our method is based on the observation that target data, which might not align with the source domain classifier, still forms clear clusters. We capture this intrinsic structure by defining local affinity of the target data, and encourage label consistency among data with high local affinity. We observe that higher affinity should be assigned to reciprocal neighbors. To aggregate information with more context, we consider expanded neighborhoods with small affinity values. Furthermore, we consider the density around each target sample, which can alleviate the negative impact of potential outliers. In the experimental results we verify that the inherent structure of the target features is an important source of information for domain adaptation. We demonstrate that this local structure can be efficiently captured by considering the local neighbors, the reciprocal neighbors, and the expanded neighborhood. Finally, we achieve state-of-the-art performance on several 2D image and 3D point cloud recognition datasets.

摘要
领域适应（DA）的目标是解决源领域和目标领域之间的频率差异。大多数DA方法需要访问源数据，但在一些情况下，这并不可能（例如，由于数据隐私或知识产权等原因）。在这篇论文中，我们面临着难以进行频率适应（SFDA）问题，其中源预训练模型被应用到目标领域中，无法访问源数据。我们基于目标数据的内在结构的观察，即目标数据可能不符合源领域分类器的分类结果。我们通过定义目标数据的本地相互关系来捕捉这种内在结构，并且鼓励标签相互一致。我们发现，更高的相互关系应该被分配给对应的反向邻居。为了更好地融合信息，我们考虑了扩展的邻里域，并且考虑了每个目标样本的扩展邻里域。此外，我们还考虑了每个目标样本的密度，以避免潜在的异常值的影响。在实验结果中，我们证明了目标特征的内在结构是频率适应中的重要信息来源。我们表明了这种本地结构可以通过考虑本地邻居、反向邻居和扩展邻里域来效率地捕捉。最后，我们在多个2D图像和3D点云认知dataset上实现了状态的最佳性能。

SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.00526
repo_url: None
paper_authors: Youhong Wang, Yunji Liang, Hao Xu, Shaohui Jiao, Hongkai Yu
for: 本研究旨在提出一种可以有效地从动态视觉中学习细腻场景结构的自助监督灰度估计方法，以提高自主驾驶和 роботех学的应用性。
methods: 本方法提出了一种新的 Self Query Layer (SQL)，用于建立自身成本量，从而直接从量中INFER depth，而不是从特征图中INFER depth。这种自身成本量隐式地捕捉了场景的内在几何结构，每个时刻的卷积缓存都表示了相对的距离关系。最后，这个量通过一种新的解码方法转换为深度图。
results: 实验结果表明，我们的方法在KITTI和Cityscapes上达到了remarkable的状态数据表现（AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground-truth, $0.106$ on Cityscapes），与前一个最佳方法相比，减少了9.9%、5.5%和4.5%的误差。此外，我们的方法还展示了减少训练复杂度、计算效率、改进的普适性和可以恢复细腻场景细节的能力。同时，自身匹配推导的SQLdepth可以在不同的摄像头和环境下进行适应性训练，并且可以在不同的任务上进行适应性测试。

Abstract
Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details with limited generalization. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structures from motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. The self-cost volume implicitly captures the intrinsic geometry of the scene within a single frame. Each individual slice of the volume signifies the relative distances between points and objects within a latent space. Ultimately, this volume is compressed to the depth map via a novel decoding approach. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance (AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground-truth and $0.106$ on Cityscapes), achieves $9.9\%$, $5.5\%$ and $4.5\%$ error reduction from the previous best. In addition, our approach showcases reduced training complexity, computational efficiency, improved generalization, and the ability to recover fine-grained scene details. Moreover, the self-supervised pre-trained and metric fine-tuned SQLdepth can surpass existing supervised methods by significant margins (AbsRel = $0.043$, $14\%$ error reduction). self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code and the pre-trained weights will be publicly available. Code is available at \href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl}.

摘要
最近，自主指导单目深度估计已经在自动驾驶和机器人领域得到了广泛的应用。然而，现有的解决方案主要是从直接视觉特征中估计深度，而忽略了细节Scene的恢复。在这篇论文中，我们提出了一种新的方法，即SQLdepth，可以有效地从运动中学习细节Scene的结构。在SQLdepth中，我们提出了一种新的Self Query层（SQL），用于建立自身成本Volume并从中估计深度，而不是从特征图进行估计。这个自身成本Volume隐式地捕捉了场景的内在几何结构，每个层次的尺度都表示了点和物体之间的相对距离在潜在空间中。最终，这个Volume通过一种新的解码方法压缩到深度图。实验结果表明，我们的方法在KITTI和Cityscapes上达到了非常出色的状态态表现（AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground truth and $0.106$ on Cityscapes），实现了$9.9\%$, $5.5\%$和$4.5\%$的错误减少。此外，我们的方法还显示了减少训练复杂性、计算效率、改进的泛化和恢复细节Scene的能力。此外，自主匹配方向的相对距离查询在SQL中提高了SQLdepth的Robustness和零shot泛化能力。代码和预训练 веса将在公共可用。代码可以在 \href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl} 上找到。

A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm

paper_url: http://arxiv.org/abs/2309.00514
repo_url: None
paper_authors: Fanyi Wang, Pin Cao, Yihui Zhang, Haotian Hu, Yongying Yang
for: 这 paper 的目的是提出一种机器视觉方法来修正大开口几何镜元件上的表面缺陷。
methods: 该方法使用了改进的 Adaptive Enhancement Algorithm (AEA), 包括现有的 Guided Filter Dark Channel Dehazing Algorithm (GFA) 和提出的轻量级 Multi-scale Densely Connected Network (MDC-Net)。
results: 该方法可以减少表面缺陷的误差到 Within 10um，并且具有一定的实时性。

Abstract
In the procedure of surface defects detection for large-aperture aspherical optical elements, it is of vital significance to adjust the optical axis of the element to be coaxial with the mechanical spin axis accurately. Therefore, a machine vision method for eccentric error correction is proposed in this paper. Focusing on the severe defocus blur of reference crosshair image caused by the imaging characteristic of the aspherical optical element, which may lead to the failure of correction, an Adaptive Enhancement Algorithm (AEA) is proposed to strengthen the crosshair image. AEA is consisted of existed Guided Filter Dark Channel Dehazing Algorithm (GFA) and proposed lightweight Multi-scale Densely Connected Network (MDC-Net). The enhancement effect of GFA is excellent but time-consuming, and the enhancement effect of MDC-Net is slightly inferior but strongly real-time. As AEA will be executed dozens of times during each correction procedure, its real-time performance is very important. Therefore, by setting the empirical threshold of definition evaluation function SMD2, GFA and MDC-Net are respectively applied to highly and slightly blurred crosshair images so as to ensure the enhancement effect while saving as much time as possible. AEA has certain robustness in time-consuming performance, which takes an average time of 0.2721s and 0.0963s to execute GFA and MDC-Net separately on ten 200pixels 200pixels Region of Interest (ROI) images with different degrees of blur. And the eccentricity error can be reduced to within 10um by our method.

摘要
在大开口非球面光元件表面缺陷检测过程中，准确调整光轴与机械轴的准确性至关重要。因此，这篇文章提出了一种机器视觉方法来修正不对称错误。关注大开口非球面光元件的极大模糊效应，可能导致修正失败，这篇文章提出了一种适应增强算法（AEA）来强化交叉线图像。AEA由现有的导引灰度黑色滤波算法（GFA）和提议的轻量级多尺度紧密连接网络（MDC-Net）组成。GFA的增强效果非常好，但是时间耗费较长，而MDC-Net的增强效果较弱，但是实时性非常好。因此，在每次修正过程中，AEA将被执行数十次，因此其实时性非常重要。因此，通过设置定义评估函数SMD2的实际阈值，GFA和MDC-Net分别应用于高度模糊和轻度模糊的交叉线图像，以确保增强效果而尽可能地避免时间浪费。AEA具有一定的实时性，每个ROI图像需要0.2721秒和0.0963秒分别使用GFA和MDC-Net进行处理，并且可以将不对称错误降到10um以下。

Multi-stage Deep Learning Artifact Reduction for Computed Tomography

paper_url: http://arxiv.org/abs/2309.00494
repo_url: None
paper_authors: Jiayang Shi, Daniel M. Pelt, K. Joost Batenburg
For: 提高计算tomography（CT）图像质量，减少图像artefacts。* Methods: 使用多个域（如投影图像和重建图像）的深度学习方法进行artefact除去，与传统的CT处理管道类似。* Results: 对于 simulate和实际实验数据集，我们的方法可以减少artefacts，并且比deep learning基于后处理的方法更高效。

Abstract
In Computed Tomography (CT), an image of the interior structure of an object is computed from a set of acquired projection images. The quality of these reconstructed images is essential for accurate analysis, but this quality can be degraded by a variety of imaging artifacts. To improve reconstruction quality, the acquired projection images are often processed by a pipeline consisting of multiple artifact-removal steps applied in various image domains (e.g., outlier removal on projection images and denoising of reconstruction images). These artifact-removal methods exploit the fact that certain artifacts are easier to remove in a certain domain compared with other domains. Recently, deep learning methods have shown promising results for artifact removal for CT images. However, most existing deep learning methods for CT are applied as a post-processing method after reconstruction. Therefore, artifacts that are relatively difficult to remove in the reconstruction domain may not be effectively removed by these methods. As an alternative, we propose a multi-stage deep learning method for artifact removal, in which neural networks are applied to several domains, similar to a classical CT processing pipeline. We show that the neural networks can be effectively trained in succession, resulting in easy-to-use and computationally efficient training. Experiments on both simulated and real-world experimental datasets show that our method is effective in reducing artifacts and superior to deep learning-based post-processing.

摘要
在计算 Tomography（CT）中，通过一组获取的投影图像计算对象内部结构的图像。这些重建图像的质量非常重要，但可能受到多种损害像素的影响。为了提高重建质量，通常将获取的投影图像处理为多个遗传元素去除抗错的步骤，这些步骤在不同的图像领域（例如，投影图像上的异常值除去和重建图像上的锈除）。这些遗传元素去除方法利用了某些遗传元素更易于在某个领域中除去，相比之下其他领域。最近，深度学习方法在CT图像中表现出了扩展的成果。然而，大多数现有的深度学习方法在CT图像中是作为后处理方法进行应用，因此，在重建领域中存在一些难以除去的遗传元素可能无法有效地除去。作为替代方案，我们提出了一种多Stage深度学习方法，在这种方法中，神经网络在多个领域中应用，类似于传统的CT处理管道。我们发现，这些神经网络可以在继序中有效地训练，从而实现了容易使用和计算效率高的训练。在模拟和实际实验数据集上，我们的方法能够有效地减少遗传元素，并且与深度学习基于后处理的方法相比，表现出优异。

Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information

paper_url: http://arxiv.org/abs/2309.00474
repo_url: None
paper_authors: Qun Zheng, Xihong Yang, Siwei Wang, Xinru An, Qi Liu
for: 这个论文旨在提出一个新的多观点聚类网络（CodingNet），以同时探索多观点数据中的多样和一致信息。
methods: 这个网络使用非对称架构，分别提取了 shallow 和 deep 特征。然后，通过调整 shallow 特征相似度矩阵，以确保多观点数据的多样性。此外，我们提出了一个双重对比机制，以维持 deep 特征在多观点和伪标端层上的一致性。
results: 在六个广泛使用的 benchmarkt 数据集上进行了广泛的实验，证明了我们的框架在多观点聚类 зада期中的高效性，并大多超过了现有的多观点聚类算法。

Abstract
In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by aligning the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms.

摘要
<>将文本翻译成简化中文。>在无监督场景下，深度对比多视图划分（DCMVC）正在成为研究热点，旨在挖掘不同视图之间的潜在关系。现有大多数DCMVC算法都是针对深度 semantic features的一致信息进行探索，而忽略了不同视图之间的多样信息。为了填补这一漏洞，我们在本文提出了一种新的多视图划分网络，称为 codingNet，以同时探索多视图数据中的多样和一致信息。具体来说，我们不使用传统的自编码器，而是设计了一种非对称结构网络，用于分离不同视图中的 shallow 和 deep features。然后，通过将 shallow 特征相似矩阵与零矩阵进行对齐，我们保证了多视图数据中 shallow 特征的多样性，从而为多视图划分提供更好的描述。此外，我们还提出了一种双对照机制，以保持深度特征在多视图和 pseudo-标签两级层次上的一致性。我们的框架在六种广泛使用的 benchmark 数据集上进行了广泛的实验，并与大多数现状的多视图划分算法进行比较，证明了我们的框架的效果。

General and Practical Tuning Method for Off-the-Shelf Graph-Based Index: SISAP Indexing Challenge Report by Team UTokyo

paper_url: http://arxiv.org/abs/2309.00472
repo_url: https://github.com/mti-lab/utokyo-sisap23-challenge-submission
paper_authors: Yutaro Oguri, Yusuke Matsui
for: 本研究旨在优化 graf-based 算法 для Approximate Nearest Neighbor (ANN) 搜索，并提供一种可靠的、 universally 适用的索引调整方法。
methods: 本研究使用了一种黑盒优化算法进行集成调整，以满足需要的召回率和 Queries Per Second (QPS) 要求。
results: 本研究在 SISAP 2023 Indexing Challenge 的 Task A 中获得第二名，在 10M 和 30M tracks 上显著提高了性能，相比较简单的方法。这种研究方法可以扩展到更广泛的应用场景。

Abstract
Despite the efficacy of graph-based algorithms for Approximate Nearest Neighbor (ANN) searches, the optimal tuning of such systems remains unclear. This study introduces a method to tune the performance of off-the-shelf graph-based indexes, focusing on the dimension of vectors, database size, and entry points of graph traversal. We utilize a black-box optimization algorithm to perform integrated tuning to meet the required levels of recall and Queries Per Second (QPS). We applied our approach to Task A of the SISAP 2023 Indexing Challenge and got second place in the 10M and 30M tracks. It improves performance substantially compared to brute force methods. This research offers a universally applicable tuning method for graph-based indexes, extending beyond the specific conditions of the competition to broader uses.

摘要
尽管图表基本算法在approximate nearest neighbor（ANN）搜索中的效果是明显的，但是最佳化这些系统的调整仍然不清楚。这个研究介绍了一种方法来调整off-the-shelf图表基本索引的性能，专注于维度 Vector，数据库大小，和图表搜索的入口点。我们使用黑盒优化算法进行集成调整，以达到需要的回快和Queries Per Second（QPS）水平。我们在SISAP 2023 Indexing Challenge的任务A中应用了我们的方法，在10M和30M tracks上获得了第二名，与简单方法相比，性能提高了很多。这项研究提供了一种通用的图表索引调整方法，超出了竞赛中的特定条件，拓展到更广泛的应用场景。

An Improved Encoder-Decoder Framework for Food Energy Estimation

paper_url: http://arxiv.org/abs/2309.00468
repo_url: None
paper_authors: Jack Ma, Jiangpeng He, Fengqing Zhu
for: 这个研究旨在提供一种自动化的食物营养评估方法，以便维护健康生活方式。
methods: 本研究使用改进的encoder-decoder框架估算食物能量，将食物影像转换为具有食物能量信息的更易提取格式，然后将能量信息提取出来。
results: 本研究比前一代营养估算方法提高了10%以上和30千卡拉以上的MAP和MAE分别。

Abstract
Dietary assessment is essential to maintaining a healthy lifestyle. Automatic image-based dietary assessment is a growing field of research due to the increasing prevalence of image capturing devices (e.g. mobile phones). In this work, we estimate food energy from a single monocular image, a difficult task due to the limited hard-to-extract amount of energy information present in an image. To do so, we employ an improved encoder-decoder framework for energy estimation; the encoder transforms the image into a representation embedded with food energy information in an easier-to-extract format, which the decoder then extracts the energy information from. To implement our method, we compile a high-quality food image dataset verified by registered dietitians containing eating scene images, food-item segmentation masks, and ground truth calorie values. Our method improves upon previous caloric estimation methods by over 10\% and 30 kCal in terms of MAPE and MAE respectively.

摘要
饮食评估是保持健康生活的重要组成部分。自动化图像基本饮食评估是研究领域的快速发展，因为图像捕捉设备的使用率在增长（如手机）。在这种工作中，我们将单一的偏振图像中的食物能量估算，这是由于图像中有限的精炼能量信息，使得这种任务非常困难。为此，我们采用了改进的编码器-解码器框架来进行能量估算，编码器将图像转化为含有食物能量信息的更易EXTRACT的表示形式，然后解码器EXTRACT出能量信息。为实现我们的方法，我们编译了高质量的食物图像数据集，该数据集包括餐厅场景图像、食物项分割面积和真实热量值。我们的方法与前期热量估算方法相比，提高了10%以上和30 kCal的MAP和MAE分别。

dacl10k: Benchmark for Semantic Bridge Damage Segmentation

paper_url: http://arxiv.org/abs/2309.00460
repo_url: None
paper_authors: Johannes Flotzinger, Philipp J. Rösch, Thomas Braml
for: 本研究旨在提供一个大规模、多种类型的混凝土结构缺陷识别数据集，以便在实际应用中进行bridge检测和评估。
methods: 本研究使用了实际桥梁检测数据集，包括9920张图像，并分为12种缺陷类型和6种桥Component。
results: 研究人员使用基eline模型对dacl10k数据集进行评估，得到了0.42的mean intersection-over-union值。此外，研究人员还将数据集和基eline模型公开访问，以便对bridge检测和评估领域进行进一步研究。

Abstract
Reliably identifying reinforced concrete defects (RCDs)plays a crucial role in assessing the structural integrity, traffic safety, and long-term durability of concrete bridges, which represent the most common bridge type worldwide. Nevertheless, available datasets for the recognition of RCDs are small in terms of size and class variety, which questions their usability in real-world scenarios and their role as a benchmark. Our contribution to this problem is "dacl10k", an exceptionally diverse RCD dataset for multi-label semantic segmentation comprising 9,920 images deriving from real-world bridge inspections. dacl10k distinguishes 12 damage classes as well as 6 bridge components that play a key role in the building assessment and recommending actions, such as restoration works, traffic load limitations or bridge closures. In addition, we examine baseline models for dacl10k which are subsequently evaluated. The best model achieves a mean intersection-over-union of 0.42 on the test set. dacl10k, along with our baselines, will be openly accessible to researchers and practitioners, representing the currently biggest dataset regarding number of images and class diversity for semantic segmentation in the bridge inspection domain.

摘要
<>Translate the given text into Simplified Chinese.<>鉴别强化混凝土缺陷（RCD）可以准确评估混凝土桥的结构完整性、交通安全性和长期持续性，混凝土桥是全球最常见的桥梁类型。然而，目前可用的RCD数据集较小，尺寸和类型多样性受限，这会问题其在实际场景中的可用性和作为标准。我们的贡献是“dacl10k”数据集，包含9,920个真实桥梁检查图像，用于多类Semantic segmentation。dacl10k可以分辨12种缺陷类型和6种桥 component，这些组件对建筑评估和建议行动（如修复工程、交通负荷限制或桥梁关闭）具有关键性。此外，我们还考虑了baseline模型，并评估其性能。测试集上的最佳模型得分为0.42。dacl10k、我们的baselines以及相关的评估结果将公开 accessible for researchers and practitioners，代表bridge检测领域中最大的数据集，包括图像数量和类型多样性。

Unsupervised bias discovery in medical image segmentation

paper_url: http://arxiv.org/abs/2309.00451
repo_url: None
paper_authors: Nicolás Gaggion, Rodrigo Echeveste, Lucas Mansilla, Diego H. Milone, Enzo Ferrante
for: 避免深度学习模型在医疗影像分割中存在对某些保护属性（如性别或民族）的偏见。
methods: 我们提出了一种新的无监督偏见探测方法，利用反分类精度框架来估算分割质量。
results: 我们通过synthetic和实际场景的数字实验表示，我们的方法能够成功预测深度分割模型的公平问题，这成为该领域的新和有价值的工具。

Abstract
It has recently been shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined in terms of protected attributes like sex or ethnicity. In this context, auditing fairness of deep segmentation models becomes crucial. However, such audit process generally requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when going from development to deployment. Here we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method leverages the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios we show how our method is able to successfully anticipate fairness issues in the absence of ground-truth labels, constituting a novel and valuable tool in this field.

摘要
Recently, research has shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined by protected attributes such as sex or ethnicity. In this context, it is crucial to audit the fairness of deep segmentation models. However, this process typically requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when moving from development to deployment. Here, we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method utilizes the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios, we demonstrate how our method can successfully anticipate fairness issues in the absence of ground-truth labels, providing a novel and valuable tool in this field.Here's the translation in Traditional Chinese as well:Recently, research has shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined by protected attributes such as sex or ethnicity. In this context, it is crucial to audit the fairness of deep segmentation models. However, this process typically requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when moving from development to deployment. Here, we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method utilizes the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios, we demonstrate how our method can successfully anticipate fairness issues in the absence of ground-truth labels, providing a novel and valuable tool in this field.

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

paper_url: http://arxiv.org/abs/2309.00661
repo_url: None
paper_authors: Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu
for: 提高视频瞬间 Retrieval（VMR）的精度，使其能够处理未知词汇和未经见过的场景。
methods: 利用视觉语言模型（VLM）的新的转移学习方法，通过大规模的视觉语言对数据来 derive universal visual-textual correlations，并通过 fine-tuning 来适应目标领域。
results: 提出了一种零基础方法，可以将通用的视觉文本关系转移到 VMR 领域，而不需要访问 VMR 数据。该方法包括一个 conditional feature refinement module 和一个底层提档生成策略，可以更好地理解瞬间边界，并最大化 VLM 的效果。实验结果表明，我们的零基础算法在三个 VMR 标准测试集上具有显著的性能优势，特别是在未知词汇和未经见过的场景下。

Abstract
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment, without the need for accessing the VMR data. To this end, we devise a conditional feature refinement module to generate boundary-aware visual features conditioned on text queries to enable better moment boundary understanding. Additionally, we design a bottom-up proposal generation strategy that mitigates the impact of domain discrepancies and breaks down complex-query retrieval tasks into individual action retrievals, thereby maximizing the benefits of VLM. Extensive experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm, especially in the novel-word and novel-location out-of-distribution setups.

摘要
准确的视频时刻选取（VMR）需要一种通用的视觉文本相关性，可以处理未知词汇和未经见过的场景。然而，学习的相关性可能受到有限的时刻文本数据的偏袋影响（完全监督），或者只有视频和文本之间的对应关系（弱监督），无法提供精细的时刻标注。近些年，视觉语言模型（VLM）表明了一种新的转移学习 paradigma，通过大规模的视觉语言对数据来获得通用的视觉文本相关性，并在目标领域进行 fine-tuning，以便为不同的视觉任务提供利用。在这种情况下，我们提出了一种零shot方法，通过不需要访问VMR数据来适应通用的视觉文本相关性。为此，我们设计了一个条件feature重定向模块，通过文本查询来生成与边界相关的视觉特征，以便更好地理解时刻边界。此外，我们还设计了一种底层提议生成策略，以降低域外差异的影响和将复杂的查询任务分解成个体动作 retrieve 任务，从而最大化VLM的利用。经过对三个VMR benchmark数据集的广泛实验，我们发现了我们零shot算法在未知词汇和未经见过的设置中表现出了remarkable的性能优势。

Improving the matching of deformable objects by learning to detect keypoints

paper_url: http://arxiv.org/abs/2309.00434
repo_url: https://github.com/verlab/learningtodetect_prl_2023
paper_authors: Felipe Cadar, Welerson Melo, Vaishnavi Kanagasabapathi, Guilherme Potje, Renato Martins, Erickson R. Nascimento
for: 提高非RIGID图像匹配任务中正确匹配的数量
methods: 使用练习annotated图像对照 pairs的真相匹配来训练一个端到端的卷积神经网络，从而找到更适合考虑的描述符的关键点位置
results: 对多种描述符的匹配精度提高20pp，并在真实世界中的对象检索任务中与最佳关键点检测器一样高效

Abstract
We propose a novel learned keypoint detection method to increase the number of correct matches for the task of non-rigid image correspondence. By leveraging true correspondences acquired by matching annotated image pairs with a specified descriptor extractor, we train an end-to-end convolutional neural network (CNN) to find keypoint locations that are more appropriate to the considered descriptor. For that, we apply geometric and photometric warpings to images to generate a supervisory signal, allowing the optimization of the detector. Experiments demonstrate that our method enhances the Mean Matching Accuracy of numerous descriptors when used in conjunction with our detection method, while outperforming the state-of-the-art keypoint detectors on real images of non-rigid objects by 20 p.p. We also apply our method on the complex real-world task of object retrieval where our detector performs on par with the finest keypoint detectors currently available for this task. The source code and trained models are publicly available at https://github.com/verlab/LearningToDetect_PRL_2023

摘要
我们提出了一种新的学习基于关键点检测方法，以提高非RIGID图像匹配任务中correct匹配的数量。通过利用已知图像对照对描述符EXTRACTOR进行匹配，我们训练了一个端到端的卷积神经网络（CNN）来找出更适合考虑的描述符的关键点位置。为了实现这一目标，我们应用了地理метриic和光学扭曲到图像，以生成一个监督信号，allowing the optimization of the detector。实验表明，我们的方法可以提高许多描述符的mean Matching Accuracy，并在真实图像中对非RIGID对象的检测任务上超越当前最佳的关键点检测器。我们还应用了我们的方法在复杂的real-world对象检索任务中，并与当前最佳的关键点检测器一样表现。source code和训练模型可以在https://github.com/verlab/LearningToDetect_PRL_2023中下载。

Selective Scene Text Removal

paper_url: http://arxiv.org/abs/2309.00410
repo_url: https://github.com/mitanihayato/Selective-Scene-Text-Removal
paper_authors: Hayato Mitani, Akisato Kimura, Seiichi Uchida
for: selective scene text removal (SSTR)
methods: multi-module structure
results: can remove target words as expected

Abstract
Scene text removal (STR) is the image transformation task to remove text regions in scene images. The conventional STR methods remove all scene text. This means that the existing methods cannot select text to be removed. In this paper, we propose a novel task setting named selective scene text removal (SSTR) that removes only target words specified by the user. Although SSTR is a more complex task than STR, the proposed multi-module structure enables efficient training for SSTR. Experimental results show that the proposed method can remove target words as expected.

摘要
Scene文本除除（STR）是图像变换任务，去除场景中的文本区域。传统的STR方法都是全面去除场景中的所有文本。在这篇论文中，我们提出了一种新的任务设定方式，即选择场景文本除除（SSTR），可以根据用户指定的Target字符串来去除特定的文本。虽然SSTR是STR的更加复杂的任务，但我们提出的多模块结构使得SSTR的训练变得高效。实验结果表明，我们的方法可以如预期地去除Target字符串。

Fine-grained Recognition with Learnable Semantic Data Augmentation

paper_url: http://arxiv.org/abs/2309.00399
repo_url: None
paper_authors: Yifan Pu, Yizeng Han, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang
for: 本研究旨在提高细化图像识别精度，通过在特征层进行多样化数据训练，以增强分类器的泛化性。
methods: 本研究提出了一种基于semantic特征方向的多样化数据生成方法，通过预测样本独特的协方差矩阵来生成多个不同的扩展样本，并与分类器进行共同优化。
results: 实验表明，该方法可以在多个竞争性的细化图像识别benchmark上提高分类器的泛化性，并在CUB-200-2011 dataset上与最新的方法相当。

Abstract
Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.

摘要
传统的图像识别挑战之一是细化图像识别，即在同一个meta-类别下分别识别多个子类别。由于图像在同一个meta-类别下通常具有相似的视觉特征，因此挖掘特征特征是识别细化类别的关键。虽然通常使用的图像级别数据增强技术已经取得了广泛的成功在通用图像分类问题上，但它们在细化场景下rarely被应用，因为它们的随机编辑区域行为容易 Destroying the discriminative visual cues residing in subtle regions.在这篇论文中，我们提出了在特征级别上多样化训练数据以解决细化类别损失问题。具体来说，我们生成了多样化增强样本 by translating image features along semantically meaningful directions. 预测样本级别的covariance matrix的网络来Estimate the semantic directions, which adapts to the large intra-class variation inherent in fine-grained images. 此外，预测网络和分类网络在meta-学习方式下jointly optimize the covariance prediction network to alleviate the degenerate solution problem.实验表明，我们的方法可以在四个竞争力高的细化图像识别benchmark上（CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds）提高通用分类网络（例如ResNets, DenseNets, EfficientNets, RegNets和ViT）的总体性能。与 reciently proposed method 的semantic data augmentation approach combine， our approach achieves state-of-the-art performance on the CUB-200-2011 dataset. 代码将会发布。

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

paper_url: http://arxiv.org/abs/2309.00398
repo_url: None
paper_authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
for: 这个研究旨在提出一种基于参考导向潜在扩散的文本到视频生成方法，可以生成高分辨率、高帧准确率的视频。
methods: 该方法首先使用现成的文本到图像生成模型（如稳定扩散）生成一个高质量的图像作为参考图像，然后引入一个高效的级联潜在扩散模块，通过参考图像和文本提示来生成潜在视频表示。最后，通过一个改进的流式增强步骤来提高视频的时间分辨率。
results: 该方法在文本到视频生成方面设置了新的州OF-THE-ART， both qualitative and quantitative evaluation 表明，VideoGen可以生成高质量、高分辨率的视频。更多样例可以参考 \url{https://videogen.github.io/VideoGen/}。

Abstract
In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.

摘要
在这篇论文中，我们提出了 VideoGen，一种文本到视频生成方法，可以生成高清晰度视频，具有高帧准确性和强时间一致性使用参考导向的潜在扩散。我们利用了一个标准的文本到图像生成模型，例如稳定扩散，来生成基于文本提示的图像，并用其为参考图像来导引视频生成。然后，我们引入了一个高效的级联潜在扩散模块，Conditional on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution.最后，我们将潜在视频表示映射到高清晰度视频 через一个加强的视频解码器。在训练时，我们使用了ground truth video的首帧作为参考图像来训练级联潜在扩散模块。主要特点包括：参考图像生成于文本到图像模型提高了视觉质量;使用参考图像作为条件使得扩散模型更专注学习视频动力学;以及视频解码器在训练过程中使用了高质量的无标注视频数据，从而受益于高质量的可获得视频数据。VideoGen将文本到视频生成领域的新州标准，以质量和量化评价为据。更多样例可以参考\url{https://videogen.github.io/VideoGen/}。

On the Localization of Ultrasound Image Slices within Point Distribution Models

paper_url: http://arxiv.org/abs/2309.00372
repo_url: None
paper_authors: Lennart Bastian, Vincent Bürgin, Ha Young Kim, Alexander Baumann, Benjamin Busam, Mahdi Saleh, Nassir Navab
for: 这个论文主要目标是提供一种自动化ultrasound图像块位置定位方法，以便更好地诊断甲状腺疾病。
methods: 该方法使用对ultrasound图像和三维形态模型的对比学习，学习一个共同准则空间，然后使用cross-modality注册和Procrustes分析将ultrasound块注册到三维形态模型中。
results: 实验结果表明，该多模态注册框架可以准确地将ultrasound块注册到患者特定的三维形态模型和统计 shapes模型中，并且可以预测块位置在患者特定的三维形态模型上的平均误差为1.2毫米，并且在统计 shapes模型上的平均误差为4.6毫米。

Abstract
Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for automated US image slice localization within a 3D shape representation to ease how such sonographic diagnoses are carried out. Our proposed method learns a common latent embedding space between US image patches and the 3D surface of an individual's thyroid shape, or a statistical aggregation in the form of a statistical shape model (SSM), via contrastive metric learning. Using cross-modality registration and Procrustes analysis, we leverage features from our model to register US slices to a 3D mesh representation of the thyroid shape. We demonstrate that our multi-modal registration framework can localize images on the 3D surface topology of a patient-specific organ and the mean shape of an SSM. Experimental results indicate slice positions can be predicted within an average of 1.2 mm of the ground-truth slice location on the patient-specific 3D anatomy and 4.6 mm on the SSM, exemplifying its usefulness for slice localization during sonographic acquisitions. Code is publically available: \href{https://github.com/vuenc/slice-to-shape}{https://github.com/vuenc/slice-to-shape}

摘要
淀脑疾病通常通过高分辨率超声成像（US）诊断。 longitudinal nodule tracking是诊断过程中的关键协议，但这会对临床医生带来很大的认知负担，因为需要维护一个 mental 3D 重建的器官。我们因此提出了一种自动化 US 图像片断localization在3D形状表示中的框架，以facilitate如此的超声诊断。我们的提议方法learns一个共同latent embedding空间 между US图像块和个体的thyroid形状3D，或者一个统计汇总模型（SSM），通过对比度学习。通过cross-modality registration和Procrustes分析，我们利用我们的模型特征来注册US剖平图像到个体特定的thyroid形状3D的表示中。我们的多Modal注册框架可以在病人特定的3D анатомия上和统计平均形态模型（SSM）上localize图像剖平。实验结果表明，我们可以在病人特定3D形状上和统计平均形态模型（SSM）上预测图像剖平的位置，与实际 slice位置相差在1.2毫米和4.6毫米之间。这 demonstartes我们的多Modal注册框架的用于图像剖平localization durante acquisitions sonográficas。代码可以在以下链接获取：https://github.com/vuenc/slice-to-shape

How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis

paper_url: http://arxiv.org/abs/2309.00350
repo_url: None
paper_authors: Dewinda Julianensi Rumala
for: 这研究探讨了医疗图像分析领域中深度学习模型的应用，以及这些模型在诊断和患者照管中的潜在问题。
methods: 研究使用了3D卷积神经网络（CNN）对大脑MRI图像进行分析，并 investigate了不同数据分割策略对模型性能的影响。
results: 研究发现，不当的数据分割策略可能会导致模型性能受到影响，特别是在长期图像数据中，包括重复的扫描数据。研究还发现，GradCAM视觉化可以揭示卷积神经网络模型中的快捷缺陷，这些缺陷可能会使模型学习到诊断特征以及患者身份。

Abstract
Deep learning models have revolutionized the field of medical image analysis, offering significant promise for improved diagnostics and patient care. However, their performance can be misleadingly optimistic due to a hidden pitfall called 'data leakage'. In this study, we investigate data leakage in 3D medical imaging, specifically using 3D Convolutional Neural Networks (CNNs) for brain MRI analysis. While 3D CNNs appear less prone to leakage than 2D counterparts, improper data splitting during cross-validation (CV) can still pose issues, especially with longitudinal imaging data containing repeated scans from the same subject. We explore the impact of different data splitting strategies on model performance for longitudinal brain MRI analysis and identify potential data leakage concerns. GradCAM visualization helps reveal shortcuts in CNN models caused by identity confounding, where the model learns to identify subjects along with diagnostic features. Our findings, consistent with prior research, underscore the importance of subject-wise splitting and evaluating our model further on hold-out data from different subjects to ensure the integrity and reliability of deep learning models in medical image analysis.

摘要
深度学习模型在医疗图像分析领域已经引起了广泛的关注，因为它们提供了改善诊断和患者护理的可能性。然而，它们的性能可能会受到一种隐藏的陷阱，称为“数据泄露”。在这项研究中，我们研究了3D医疗图像中的数据泄露， especailly使用3D卷积神经网络（CNN）进行脑MRI分析。虽然3D CNN在2D counterpart中看起来更加免疫于泄露，但是在cross-validation（CV）时不当的数据分割可以仍然存在问题，尤其是在包含同一个主题的重复扫描数据中。我们研究了不同的数据分割策略对于长itudinal脑MRI分析的模型性能的影响，并发现了可能的数据泄露问题。GradCAM可视化 помо助于揭示了 CNN模型中由identify confounding引起的快捷路径，其中模型学习了主题以及诊断特征。我们的发现与之前的研究一致，强调了在医疗图像分析中深度学习模型的完整性和可靠性的重要性，需要在不同主题的数据上进行评估。

RigNet++: Efficient Repetitive Image Guided Network for Depth Completion

paper_url: http://arxiv.org/abs/2309.00655
repo_url: None
paper_authors: Zhiqiang Yan, Xiang Li, Zhenyu Zhang, Jun Li, Jian Yang
for: 本研究旨在提高深度映射的精度，使用频繁重复的设计来提高图像引导学习框架的性能。
methods: 我们在图像引导学习框架中实现了高效的重复设计，包括图像引导分支和深度生成分支。图像引导分支中，我们设计了一个密集重复小时glass网络，以提取复杂环境中的特征特征，为深度预测提供强大的Contextual指导。深度生成分支中，我们引入了一种循环动态梯度Module，其中提出了一种高效的幂等分解方法，以减少复杂性而模型高频结构。
results: 我们在KITTI、VKITTI、NYUv2、3D60和Matterport3D等 datasets上进行了广泛的实验，结果表明，我们的方法可以取得Superior或竞争性的结果。

Abstract
Depth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, blurry guidance in the image and unclear structure in the depth still impede their performance. To tackle these challenges, we explore an efficient repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the efficient repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a dense repetitive hourglass network to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we introduce a repetitive guidance module based on dynamic convolution, in which an efficient convolution factorization is proposed to reduce the complexity while modeling high-frequency structures progressively. Extensive experiments indicate that our approach achieves superior or competitive results on KITTI, VKITTI, NYUv2, 3D60, and Matterport3D datasets.

摘要
depth completion 目标是从稀畴的深度地图中恢复粗粒度的深度地图，通常使用颜色图像来促进这个任务。 current depth method 主要关注于图像导学框架。然而，图像指导中的模糊和深度中的 unclear structure 仍然妨碍其性能。为了解决这些挑战，我们探索了一种高效的循环设计在我们的图像导学网络中。 Specifically, the efficient repetition is embodied in both the image guidance branch and depth generation branch。在前者分支中，我们设计了一个紧凑的重复弧形网络来提取复杂环境中的特征特征，这可以为深度预测提供强大的Contextual instruction。在后者分支中，我们引入了一种循环导引模块，基于动态 convolution，在其中我们提出了一种高效的 convolution factorization来降低复杂性，同时模型高频结构进行进行逐步进行进行。 extensive experiments indicate that our approach achieves superior or competitive results on KITTI, VKITTI, NYUv2, 3D60, and Matterport3D datasets。

MuraNet: Multi-task Floor Plan Recognition with Relation Attention

paper_url: http://arxiv.org/abs/2309.00348
repo_url: None
paper_authors: Lingxiao Huang, Jung-Hsuan Wu, Chiching Wei, Wilson Li
For: floor plan data recognition* Methods: attention-based multi-task model (MuraNet) with unified encoder (MURA) and separated branches for segmentation and detection tasks* Results: improved convergence speed and performance in detection and segmentation tasks compared to single-task models like U-Net and YOLOv3

Abstract
The recognition of information in floor plan data requires the use of detection and segmentation models. However, relying on several single-task models can result in ineffective utilization of relevant information when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively.Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.

摘要
<>loor plan数据认知需要使用探测和分割模型。然而，依赖于多个单任务模型可能会导致 relevante信息的不fficient使用， especialmente when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively. Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.中文简体版：loor plan数据认知需要使用探测和分割模型。然而，依赖于多个单任务模型可能会导致 relevante信息的不fficient使用， especialmente when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively. Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.

Towards Contrastive Learning in Music Video Domain

paper_url: http://arxiv.org/abs/2309.00347
repo_url: None
paper_authors: Karel Veldkamp, Mariya Hendriksen, Zoltán Szlávik, Alexander Keijser
for: 这个论文 investigate whether contrastive learning can be applied to the domain of music videos, and evaluate the effectiveness of this approach on downstream tasks such as music tagging and genre classification.
methods: 作者使用了 dual en-coder 来学习 audio 和 video 模式的 multimodal 表示，并使用了 bidirectional contrastive loss 进行训练。
results: 结果表明，没有对 contrastive learning 进行 fine-tuning 的预训练网络在两个下游任务中表现更好，而且作者通过Qualitative analysis of the learned representations来解释了为什么 contrastive learning 对 music videos 不成功。

Abstract
Contrastive learning is a powerful way of learning multimodal representations across various domains such as image-caption retrieval and audio-visual representation learning. In this work, we investigate if these findings generalize to the domain of music videos. Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song Dataset, and evaluate the quality of learned representations on the downstream tasks of music tagging and genre classification. Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks. To gain a better understanding of the reasons contrastive learning was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might have difficulties uniting embeddings from two modalities. Based on these findings, we outline possible directions for future work. To facilitate the reproducibility of our results, we share our code and the pre-trained model.

摘要
“对比学习是一种强大的学习多Modal表现的方法，可以应用于不同领域，如图像描述和视觉表现学习。在这个工作中，我们 investigate 这些结果是否应用于音乐录影带领域。 Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. 实验中，我们使用了550000首音乐录影带的industry dataset以及公共的Million Song Dataset，并评估学习的表现质量downstream task of music tagging和类别分类。我们的结果显示了pre-trained network without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks。为了更好地理解contrastive learning why it was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might have difficulties uniting embeddings from two modalities。 Based on these findings, we outline possible directions for future work。为了促进我们的结果的重现性，我们分享了我们的代码和预训模型。”

Robust Point Cloud Processing through Positional Embedding

paper_url: http://arxiv.org/abs/2309.00339
repo_url: https://github.com/osiriszjq/RobustPPE
paper_authors: Jianqiao Zheng, Xueqian Li, Sameera Ramasinghe, Simon Lucey
for: 这个论文是关于3D点云处理中使用分析式每个点嵌入的研究，以提高对异常点云和噪声的Robustness。
methods: 这篇论文使用了基于带宽的分析式每个点嵌入，并与Random Fourier Features（RFF）的坐标嵌入进行比较。
results: 论文通过在多种异常点云和噪声下进行多个下渠道任务的实验，表明该方法可以提供更高的Robustness和稳定性。

Abstract
End-to-end trained per-point embeddings are an essential ingredient of any state-of-the-art 3D point cloud processing such as detection or alignment. Methods like PointNet, or the more recent point cloud transformer -- and its variants -- all employ learned per-point embeddings. Despite impressive performance, such approaches are sensitive to out-of-distribution (OOD) noise and outliers. In this paper, we explore the role of an analytical per-point embedding based on the criterion of bandwidth. The concept of bandwidth enables us to draw connections with an alternate per-point embedding -- positional embedding, particularly random Fourier features. We present compelling robust results across downstream tasks such as point cloud classification and registration with several categories of OOD noise.

摘要
<>translate_language: zh-CNEnd-to-end 培чение的每个点嵌入是现代三维点云处理的重要组成部分，如探测或对Alignment。方法如PointNet或更近的点云变换器以及其变种都使用学习的每个点嵌入。尽管表现出色，但这些方法对于异常情况（OOD）噪声和异常值敏感。在这篇论文中，我们探讨使用分布参数来确定每个点嵌入的概念。这种概念允许我们与另一种嵌入——位置嵌入，特别是随机傅立叶特征进行联系。我们在多个下游任务中，如点云分类和注册，对多种OOD噪声进行了吸引人的Robust表现。

Human trajectory prediction using LSTM with Attention mechanism

paper_url: http://arxiv.org/abs/2309.00331
repo_url: None
paper_authors: Amin Manafi Soltan Ahmadi, Samaneh Hoseini Semnani
for: 本研究提出了一种人行踪预测模型，该模型结合了长期快速响应神经网络（LSTM）和注意机制。
methods: 该模型使用注意分数来确定输入数据中对预测输出的重要性，注意分数由输入特征的每个部分计算得到，高分数表明该部分在预测输出中的更大重要性。
results: 我们在ETH和UCY公共数据集上评估了我们的方法，并使用最终差分误差（FDE）和平均差分误差（ADE）度量来评估性能。我们发现，我们修改后的算法在拥挤空间中预测人行踪的性能有6.2%和6.3%的提升，相比文献中Social LSTM的结果。

Abstract
In this paper, we propose a human trajectory prediction model that combines a Long Short-Term Memory (LSTM) network with an attention mechanism. To do that, we use attention scores to determine which parts of the input data the model should focus on when making predictions. Attention scores are calculated for each input feature, with a higher score indicating the greater significance of that feature in predicting the output. Initially, these scores are determined for the target human position, velocity, and their neighboring individual's positions and velocities. By using attention scores, our model can prioritize the most relevant information in the input data and make more accurate predictions. We extract attention scores from our attention mechanism and integrate them into the trajectory prediction module to predict human future trajectories. To achieve this, we introduce a new neural layer that processes attention scores after extracting them and concatenates them with positional information. We evaluate our approach on the publicly available ETH and UCY datasets and measure its performance using the final displacement error (FDE) and average displacement error (ADE) metrics. We show that our modified algorithm performs better than the Social LSTM in predicting the future trajectory of pedestrians in crowded spaces. Specifically, our model achieves an improvement of 6.2% in ADE and 6.3% in FDE compared to the Social LSTM results in the literature.

摘要
在这篇论文中，我们提出了一种人体轨迹预测模型，该模型结合了长期短记忆网络（LSTM）和注意力机制。为了实现这一点，我们使用注意力分数来确定输入数据中哪些部分需要模型的注意力。注意力分数分别计算对每个输入特征的注意力，高注意力分数表示该特征在预测输出时的更大重要性。我们首先计算这些分数，然后将其与目标人体位置、速度和邻近个体位置和速度相关的输入特征进行相加。通过使用注意力分数，我们的模型可以在输入数据中优先级掌握相关信息，并且更准确地预测人体轨迹。我们从注意力机制中提取出注意力分数，并将其与位置信息一起处理。我们新增一层神经网络来处理注意力分数，并将其与位置信息进行拼接。我们使用公共可用的ETH和UCY数据集进行评估，并使用最终差分Error（FDE）和平均差分Error（ADE） metric来评估我们的方法。我们的修改后的算法在预测人体轨迹时比Social LSTM在文献中的结果更好，具体来说，我们的模型在ADE和FDE metric上分别提高6.2%和6.3%。

ARFA: An Asymmetric Receptive Field Autoencoder Model for Spatiotemporal Prediction

paper_url: http://arxiv.org/abs/2309.00314
repo_url: None
paper_authors: Wenxuan Zhang, Xuechao Zou, Li Wu, Jianqiang Huang, Xiaoying Wang
for: 预测Future sequences by learning from historical contexts, 用于各个领域，如交通流量预测和天气预测。
methods: 提出了一种偏 asymmetric Receptive Field Autoencoder (ARFA) 模型，通过设计不同大小的感知场模块，适应Encoder和Decoder的不同功能。在Encoder中，引入大kernel模块 для全球空间时间特征提取;在Decoder中，开发小kernel模块 для本地空间时间信息重建。
results: 在两个主流的空间时间预测数据集和我们自己construct的RainBench数据集上，ARFA实现了一致的状态集成性表现，证明了我们的方法的有效性。这种方法不仅从感知场的角度探索了一种新的方法，还为降水预测提供了数据支持，从而推动了未来的空间时间预测研究。

Abstract
Spatiotemporal prediction aims to generate future sequences by paradigms learned from historical contexts. It holds significant importance in numerous domains, including traffic flow prediction and weather forecasting. However, existing methods face challenges in handling spatiotemporal correlations, as they commonly adopt encoder and decoder architectures with identical receptive fields, which adversely affects prediction accuracy. This paper proposes an Asymmetric Receptive Field Autoencoder (ARFA) model to address this issue. Specifically, we design corresponding sizes of receptive field modules tailored to the distinct functionalities of the encoder and decoder. In the encoder, we introduce a large kernel module for global spatiotemporal feature extraction. In the decoder, we develop a small kernel module for local spatiotemporal information reconstruction. To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction. Experimental results demonstrate that ARFA achieves consistent state-of-the-art performance on two mainstream spatiotemporal prediction datasets and our RainBench dataset, affirming the effectiveness of our approach. This work not only explores a novel method from the perspective of receptive fields but also provides data support for precipitation prediction, thereby advancing future research in spatiotemporal prediction.

摘要
<> translate "Spatiotemporal prediction aims to generate future sequences by paradigms learned from historical contexts. It holds significant importance in numerous domains, including traffic flow prediction and weather forecasting. However, existing methods face challenges in handling spatiotemporal correlations, as they commonly adopt encoder and decoder architectures with identical receptive fields, which adversely affects prediction accuracy. This paper proposes an Asymmetric Receptive Field Autoencoder (ARFA) model to address this issue. Specifically, we design corresponding sizes of receptive field modules tailored to the distinct functionalities of the encoder and decoder. In the encoder, we introduce a large kernel module for global spatiotemporal feature extraction. In the decoder, we develop a small kernel module for local spatiotemporal information reconstruction. To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction. Experimental results demonstrate that ARFA achieves consistent state-of-the-art performance on two mainstream spatiotemporal prediction datasets and our RainBench dataset, affirming the effectiveness of our approach. This work not only explores a novel method from the perspective of receptive fields but also provides data support for precipitation prediction, thereby advancing future research in spatiotemporal prediction."中文翻译：<>预测在时空中的序列，通过历史上的模式学习来实现。这种预测在各个领域都具有重要性，例如交通流量预测和天气预测。然而，现有的方法在处理时空相关性方面存在挑战，因为它们通常采用编码器和解码器结构具有相同的接收场，这会影响预测精度。本文提出了不同接收场的自适应各自谱频域自适应编码器（ARFA）模型，以解决这个问题。特别是，我们在编码器中设计了大小不同的接收场模块，以适应不同的功能。在编码器中，我们引入了大kernel模块，用于全局时空特征提取。在解码器中，我们开发了小kernel模块，用于本地时空信息重建。为了解决天气预测数据的缺乏，我们建立了雨峰Bench，一个特有的雨水特征的大规模雷达响应数据集，用于雨水预测。实验结果表明，ARFA在两个主流时空预测数据集和我们的雨峰Bench数据集上具有一致的状态艺术性，证明了我们的方法的有效性。这种研究不仅从接收场的角度探讨了一种新的方法，还为雨水预测提供了数据支持，从而推动了未来的时空预测研究。

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

paper_url: http://arxiv.org/abs/2309.00310
repo_url: https://github.com/shaohua-pan/RobustCap
paper_authors: Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou, Jijunnan Li, Feng Xu
For: This paper proposes a method for real-time human motion capture using a combination of monocular images and sparse IMUs.* Methods: The proposed method uses a dual coordinate strategy to fully explore the IMU signals and combines the information from both modalities to achieve robust motion capture.* Results: The proposed method significantly outperforms state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation, and the codes are available for research at https://shaohua-pan.github.io/robustcap-page/.Here is the same information in Traditional Chinese:* For: 这篇 paper 提出了一种基于 monocular 影像和简陋 IMU 的 real-time人体动作捕捉方法。* Methods: 提出的方法使用了对 IMU 信号的双坐标策略，以全面利用 IMU 信号，并结合两种感测资料以 достиieving Robust 动作捕捉。* Results: 提出的方法在 global 方向和本地姿态估测方面均有 significanly 超过了现有的见识、IMU 和合成方法，并且 codes 可以在 https://shaohua-pan.github.io/robustcap-page/ 进行研究。

Abstract
Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.

摘要
Original text: Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.Simplified Chinese translation: Either RGB 图像或惯性信号已经用于人体运动捕捉（моcap）任务，但是将其结合在一起是一个新领域的研究。我们认为这种结合是补做的，可以解决单modal输入的内在困难，包括图像中的 occlusion、极端的光照/文化和视图外的出现。为此，我们提出了一种将单静止图像和稀疏 IMU 进行实时人体运动捕捉的方法。我们的方法包括一种双坐标策略，以便充分利用 IMU 信号的不同目标在运动捕捉中。具体来说，除了一个分支将 IMU 信号转换到相机坐标系统中，与图像信息结合之外，还有另一个分支可以从 IMU 信号中学习体部pose。此外，我们还提出了一种隐藏状态反馈机制，以便在极端输入情况下补做各自的缺陷。因此，我们的方法可以根据不同情况选择不同的信号或将其结合在一起，以实现一种稳定的 mocap。%两个分支可以在不同的情况下协助each other，以提高运动捕捉结果。我们的数据可以在 https://shaohua-pan.github.io/robustcap-page/ 上进行研究。

Efficient Surrogate Models for Materials Science Simulations: Machine Learning-based Prediction of Microstructure Properties

paper_url: http://arxiv.org/abs/2309.00305
repo_url: None
paper_authors: Binh Duong Nguyen, Pavlo Potapenko, Aytekin Dermici, Kishan Govinda, Stefan Sandfeld
for: 这篇论文的目的是为了探讨和预测物理、化学、生物等领域中的结构属性关系。
methods: 这篇论文使用了六种机器学习算法，并分析了这些算法的准确性和可靠性，以及它们之间的区别。
results: 这篇论文通过分析两个不同数据集，包括二维离散伊辛模型的数据和卡恩-希耶模型的数据，以及将这些数据转换为特别设计的特征，来预测结构属性关系。

Abstract
Determining, understanding, and predicting the so-called structure-property relation is an important task in many scientific disciplines, such as chemistry, biology, meteorology, physics, engineering, and materials science. Structure refers to the spatial distribution of, e.g., substances, material, or matter in general, while property is a resulting characteristic that usually depends in a non-trivial way on spatial details of the structure. Traditionally, forward simulations models have been used for such tasks. Recently, several machine learning algorithms have been applied in these scientific fields to enhance and accelerate simulation models or as surrogate models. In this work, we develop and investigate the applications of six machine learning techniques based on two different datasets from the domain of materials science: data from a two-dimensional Ising model for predicting the formation of magnetic domains and data representing the evolution of dual-phase microstructures from the Cahn-Hilliard model. We analyze the accuracy and robustness of all models and elucidate the reasons for the differences in their performances. The impact of including domain knowledge through tailored features is studied, and general recommendations based on the availability and quality of training data are derived from this.

摘要
In this work, we develop and investigate the applications of six machine learning techniques based on two different datasets from the domain of materials science: data from a two-dimensional Ising model for predicting the formation of magnetic domains and data representing the evolution of dual-phase microstructures from the Cahn-Hilliard model. We analyze the accuracy and robustness of all models and elucidate the reasons for the differences in their performances.We also study the impact of including domain knowledge through tailored features and derive general recommendations based on the availability and quality of training data. Our findings provide insights into the potential of machine learning techniques for predicting structure-property relations in materials science and highlight the importance of considering domain knowledge and data quality when selecting and applying these techniques.

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

paper_url: http://arxiv.org/abs/2309.00297
repo_url: None
paper_authors: Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, Qijun Chen
for: 本文提出了一种基于细谱强制学习的高级视频表示方法，用于提高视频表示的动态信息表示能力。
methods: 本文使用了框架差为动态信息源，并通过设计 pixel-level 动态监督学习和前景采样策略来提高动态信息的匹配率。此外，文章还提出了一种帧级动态强制损失来提高动态特征的时间多样性。
results: 实验表明，基于 FIMA 框架学习的表示能够具备出色的动态意识能力，并在 UCF101、HMDB51 和 Diving48 等 datasets 上 achieve state-of-the-art or competitive results。

Abstract
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.

摘要
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a 细grained Motion Alignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.

paper_url: http://arxiv.org/abs/2309.00287
repo_url: None
paper_authors: Charles Laroche, Andrés Almansa, Eva Coupete
for: solves inverse problems of blind image deblurring
methods: uses diffusion models and Expectation-Minimization (EM) estimation method with blur kernel regularization
results: provides effective and fast results compared to other state-of-the-art approaches in blind image deblurring

Abstract
Using diffusion models to solve inverse problems is a growing field of research. Current methods assume the degradation to be known and provide impressive results in terms of restoration quality and diversity. In this work, we leverage the efficiency of those models to jointly estimate the restored image and unknown parameters of the degradation model. In particular, we designed an algorithm based on the well-known Expectation-Minimization (EM) estimation method and diffusion models. Our method alternates between approximating the expected log-likelihood of the inverse problem using samples drawn from a diffusion model and a maximization step to estimate unknown model parameters. For the maximization step, we also introduce a novel blur kernel regularization based on a Plug \& Play denoiser. Diffusion models are long to run, thus we provide a fast version of our algorithm. Extensive experiments on blind image deblurring demonstrate the effectiveness of our method when compared to other state-of-the-art approaches.

摘要
(Simplified Chinese translation)使用分散模型解决反向问题是一个快速 развивающийся的研究领域。当前的方法假设质量损害是已知的，并且提供了非常出色的修复质量和多样性。在这个工作中，我们利用分散模型的效率来同时估计修复图像和未知的损害模型参数。特别是，我们设计了基于well-known Expectation-Minimization（EM）估计方法和分散模型的算法。我们的方法 alternate между估计反向问题的预期日志似然函数 using 分散模型中的样本，以及一个最大化步骤来估计未知模型参数。为了最大化步骤，我们还引入了一种新的噪声核定regularization，基于Plug & Playdenoiser。分散模型需要长时间运行，因此我们提供了一个快速版本的算法。我们的实验表明，当比较到其他当前最佳方法时，我们的方法具有非常出色的效果。

SparseSat-NeRF: Dense Depth Supervised Neural Radiance Fields for Sparse Satellite Images

paper_url: http://arxiv.org/abs/2309.00277
repo_url: https://github.com/lulinzhang/sps-nerf
paper_authors: Lulin Zhang, Ewelina Rupnik
for: 用于恢复非拉伯天地表面的数字表面模型，使得传统多视图缺失、异步获取或缺失缝隙等挑战情况下可以更好地工作。
methods: 使用神经辐射场（NeRF），这是一种不需要基础知识的自我监督学习方法，可以包含场景中物理参数，从而更好地处理传统多视图匹配（MVS）失败的情况。
results: 在偏辐射1B/WorldView-3图像上使用SpS-NeRF方法，与NeRF和Sat-NeRF相比，能够更好地恢复场景的几何结构。

Abstract
Digital surface model generation using traditional multi-view stereo matching (MVS) performs poorly over non-Lambertian surfaces, with asynchronous acquisitions, or at discontinuities. Neural radiance fields (NeRF) offer a new paradigm for reconstructing surface geometries using continuous volumetric representation. NeRF is self-supervised, does not require ground truth geometry for training, and provides an elegant way to include in its representation physical parameters about the scene, thus potentially remedying the challenging scenarios where MVS fails. However, NeRF and its variants require many views to produce convincing scene's geometries which in earth observation satellite imaging is rare. In this paper we present SparseSat-NeRF (SpS-NeRF) - an extension of Sat-NeRF adapted to sparse satellite views. SpS-NeRF employs dense depth supervision guided by crosscorrelation similarity metric provided by traditional semi-global MVS matching. We demonstrate the effectiveness of our approach on stereo and tri-stereo Pleiades 1B/WorldView-3 images, and compare against NeRF and Sat-NeRF. The code is available at https://github.com/LulinZhang/SpS-NeRF

摘要
<>使用传统多视图顺序匹配（MVS）生成数字表面模型在非拉贝特表面上表现不佳，特别是在异步获取、缺失数据或缺界面上。基于神经辐射场（NeRF）的新方法可以continuous volumetric representation来重建表面几何。NeRF不需要训练时的地面几何数据，同时可以自动包含场景中的物理参数，因此可能解决MVS在困难情况下失败的问题。然而，NeRF和其变种需要许多视图来生成实际场景的几何，而在地球观测卫星图像中这是罕见的。本文介绍了SpS-NeRF（SpareSat-NeRF），它是基于Sat-NeRF的扩展，针对罕见的卫星视图进行了改进。SpS-NeRF使用了密集的深度监督，通过传统的semi-global MVS匹配提供的相似性度量来导引。我们在顺recto-stereo Pleiades 1B/WorldView-3图像上展示了我们的方法的效果，并与NeRF和Sat-NeRF进行了比较。代码可以在https://github.com/LulinZhang/SpS-NeRF上获取。

Application of Machine Learning in Melanoma Detection and the Identification of ‘Ugly Duckling’ and Suspicious Naevi: A Review

paper_url: http://arxiv.org/abs/2309.00265
repo_url: None
paper_authors: Fatima Al Zegair, Nathasha Naranpanawa, Brigid Betz-Stablein, Monika Janda, H. Peter Soyer, Shekhar S. Chandra
for: 这篇论文的目的是什么？methods: 这篇论文使用了哪些方法？results: 这篇论文获得了什么结果？Here are the answers in Simplified Chinese text:for: 这篇论文的目的是为了提高皮肤癌诊断的精度和方便，以及应对皮肤癌医生短缺。methods: 这篇论文使用了机器学习和深度学习技术，包括人工神经网络等，以探索皮肤癌早期识别和应对方法。results: 这篇论文获得了训练结果，显示机器学习和深度学习技术可以实现与专业医生相等的皮肤癌诊断精度，并且可以帮助减少医疗成本和提高疗效率。

Abstract
Skin lesions known as naevi exhibit diverse characteristics such as size, shape, and colouration. The concept of an "Ugly Duckling Naevus" comes into play when monitoring for melanoma, referring to a lesion with distinctive features that sets it apart from other lesions in the vicinity. As lesions within the same individual typically share similarities and follow a predictable pattern, an ugly duckling naevus stands out as unusual and may indicate the presence of a cancerous melanoma. Computer-aided diagnosis (CAD) has become a significant player in the research and development field, as it combines machine learning techniques with a variety of patient analysis methods. Its aim is to increase accuracy and simplify decision-making, all while responding to the shortage of specialized professionals. These automated systems are especially important in skin cancer diagnosis where specialist availability is limited. As a result, their use could lead to life-saving benefits and cost reductions within healthcare. Given the drastic change in survival when comparing early stage to late-stage melanoma, early detection is vital for effective treatment and patient outcomes. Machine learning (ML) and deep learning (DL) techniques have gained popularity in skin cancer classification, effectively addressing challenges, and providing results equivalent to that of specialists. This article extensively covers modern Machine Learning and Deep Learning algorithms for detecting melanoma and suspicious naevi. It begins with general information on skin cancer and different types of naevi, then introduces AI, ML, DL, and CAD. The article then discusses the successful applications of various ML techniques like convolutional neural networks (CNN) for melanoma detection compared to dermatologists' performance. Lastly, it examines ML methods for UD naevus detection and identifying suspicious naevi.

摘要
皮肤 lesions bekannt为 naevi 具有多样的特征，如大小、形状和颜色。“ugly duckling naevus”是在监测 melanoma 时的概念，指的是一个与周围其他 lesions 不同的特征，可能是癌变。因为 lesions 在同一个人身上通常具有相似的特征和预测的模式，ugly duckling naevus 会突出来为不寻常，并可能表示癌变的存在。computer-aided diagnosis (CAD) 在研发领域中发挥了重要作用，它结合了机器学习技术和多种患者分析方法。其目的是提高准确性和简化决策，同时回应医疗专业人员的短缺。这些自动化系统在皮肤癌诊断中特别重要，因为专业人员的可用性有限。因此，它们的使用可能导致生命的拯救和医疗成本的减少。由于皮肤癌的晚期诊断和治疗对 patient 的结果有极大的影响，早期检测是至关重要的。机器学习（ML）和深度学习（DL）技术在皮肤癌类型化方面取得了成功，有效地解决了一些挑战，并提供了与专业人员相当的结果。这篇文章从 skin cancer 和不同类型的 naevi 的基础知识开始，然后介绍了 AI、ML、DL 和 CAD。文章 then discusses 了不同 ML 技术，如 convolutional neural networks (CNN) 在 melanoma 检测中的成功，与 dermatologists 的性能相比。最后，它检查了 ML 方法在 UD naevus 检测和寻找可疑 naevi 方面的应用。

Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care

paper_url: http://arxiv.org/abs/2309.00252
repo_url: None
paper_authors: Tin Lai
for: 本文旨在探讨最新的人工智能技术在基础医疗服务中的推广，以及使用感知 трансформер（ViT）模型，以及如何使用解释人工智能（XAI）方法来理解ViT模型做出决策的过程。
methods: 本文主要介绍了最新的ViT模型和XAI方法，包括自注意机制和解释模型，以及其应用于医疗诊断领域。
results: 本文结合了最新的ViT和XAI技术，提供了一种透明的医疗诊断方法，可以帮助医生更好地理解模型做出的决策，从而提高医疗诊断的准确性和可靠性。

Abstract
Recent advancements in artificial intelligence (AI) have facilitated its widespread adoption in primary medical services, addressing the demand-supply imbalance in healthcare. Vision Transformers (ViT) have emerged as state-of-the-art computer vision models, benefiting from self-attention modules. However, compared to traditional machine-learning approaches, deep-learning models are complex and are often treated as a "black box" that can cause uncertainty regarding how they operate. Explainable Artificial Intelligence (XAI) refers to methods that explain and interpret machine learning models' inner workings and how they come to decisions, which is especially important in the medical domain to guide the healthcare decision-making process. This review summarises recent ViT advancements and interpretative approaches to understanding the decision-making process of ViT, enabling transparency in medical diagnosis applications.

摘要
最近的人工智能（AI）发展已经推动了医疗服务中的广泛应用，解决医疗需求和供应的失衡。视Transformer（ViT）已经 emerge 为当前最佳的计算机视觉模型，受益于自注意模块。然而，相比传统机器学习方法，深度学习模型更加复杂，经常被视为一个“黑盒子”，导致决策过程中存在不确定性。可解释人工智能（XAI）是指解释和解读机器学习模型内部工作的方法，尤其在医疗领域，以便guide医疗决策过程。本文回顾了最新的ViT进展和解释方法，以提高医疗诊断应用中的透明度。

Object-Centric Multiple Object Tracking

paper_url: http://arxiv.org/abs/2309.00233
repo_url: https://github.com/amazon-science/object-centric-multiple-object-tracking
paper_authors: Zixu Zhao, Jiaze Wang, Max Horn, Yizhuo Ding, Tong He, Zechen Bai, Dominik Zietlow, Carl-Johann Simon-Gabriel, Bing Shuai, Zhuowen Tu, Thomas Brox, Bernt Schiele, Yanwei Fu, Francesco Locatello, Zheng Zhang, Tianjun Xiao
for: 这个论文是为了解决多个物体跟踪（MOT）预测中的注解卷入问题而写的。
methods: 这个论文使用的方法是基于无监督物体学习的对象-центric模型，包括一个index-merge模块和一个物体记忆模块。这两个模块可以处理 occlusions 和缺失捕捉。
results: 这个论文的实验结果显示，这种方法可以减少 MOT 预测中的注解卷入，并且可以与完全监督的状态前进比肩。此外，这种方法还可以超过一些无监督跟踪器的性能。

Abstract
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.

摘要
<>无监督对象中心学习方法可以将场景分解为实体，无需额外的本地化信息，是多个物体跟踪（MOT）管道中减少注解负担的优秀候选人。然而，它们缺乏两个关键特性：物体经常被分解成部分，并且在时间上不一致地跟踪。事实上，现状的模型可以在批处精度和时间上达到像素级准确性，通过额外的ID标签进行对时asso ciation。这篇论文提出了视频对象中心模型，它包括一个索引合并模块，将对象中心槽 adapt into 检测输出，以及一个对象记忆模块，用于处理遮挡。由于对象中心学习，我们只需 sparse 的检测标注（0%-6.25%）来确定物体位置和特征绑定。基于我们的自我超vised Expectation-Maximization-inspired 损失函数，我们的方法不需要ID标签。我们的实验显著缩小了现有的对象中心模型和完全监督状态前的差距，并在多个无监督跟踪器之上表现出色。

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

paper_url: http://arxiv.org/abs/2309.00227
repo_url: None
paper_authors: Jincheng Li, Chunyu Xie, Xiaoyu Wu, Bin Wang, Dawei Leng
for: 这篇论文旨在解决开放词汇检测（OVD）中的检测和识别未知对象的问题，以提高OVD的性能。
methods: 该论文使用了预训练的跨模态VLM（如CLIP、ALIGN等），并分析了三种OVD方法的设计各种各样。这三种方法分别是：vanilla方法、将二stage对象检测器与CLIP结合的方法，以及将RPN和RoI头结合的方法。
results: 在OVD-COCO和OVD-LVIS测试集上，DRR方法获得了最好的性能，在未知类别中提高了35.8个 Novel AP${50}$，相比之前的最佳状态（SOTA）提高了2.8个 Novel AP${50}$。在罕见类别中，DRR方法超过了前一个SOTA的1.9个 AP$_{50}$。此外，论文还提供了一个对象检测数据集called PID，并提供了该数据集上的基线性能。

Abstract
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary. This is challenging since traditional detectors can only learn from pre-defined categories and thus fail to detect and localize objects out of pre-defined vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM, such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part. We argue that for a good OVD detector, both classification and localization should be parallelly studied for the novel object categories. We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly. We analyze three families of OVD methods with different design emphases. We first propose a vanilla method,i.e., cropping a bounding box obtained by a localizer and resizing it into the CLIP. We next introduce another approach, which combines a standard two-stage object detector with CLIP. A two-stage object detector includes a visual backbone, a region proposal network (RPN), and a region of interest (RoI) head. We decouple RPN and ROI head (DRR) and use RoIAlign to extract meaningful features. In this case, it avoids resizing objects. To further accelerate the training time and reduce the model parameters, we couple RPN and ROI head (CRR) as the third approach. We conduct extensive experiments on these three types of approaches in different settings. On the OVD-COCO benchmark, DRR obtains the best performance and achieves 35.8 Novel AP$_{50}$, an absolute 2.8 gain over the previous state-of-the-art (SOTA). For OVD-LVIS, DRR surpasses the previous SOTA by 1.9 AP$_{50}$ in rare categories. We also provide an object detection dataset called PID and provide a baseline on PID.

摘要
新的目标检测方式：开 vocabulary detection (OVD)，旨在确定和识别未经定义的对象。这是一个挑战，因为传统的检测器只能学习预先定义的类别，因此无法检测和定位未知的对象。为解决这个挑战，OVD 利用预训练的交叉模态 VLM，如 CLIP 和 ALIGN 等。过去的工作主要关注于开 vocabulary 分类部分，尚未充分关注本地化部分。我们认为，为一个好的 OVD 检测器，分类和本地化应该同时研究对于新的对象类别。我们在这里展示，提高本地化以及交叉模态分类之间的互补关系，组成一个好的 OVD 检测器。我们分析了三种 OVD 方法的设计强调不同，并进行了广泛的实验。首先，我们提出了一种简单的方法，即将一个由本地化器生成的 bounding box 裁剪成 CLIP 的大小，并将其作为输入给 CLIP。然后，我们引入了另一种方法，即将标准的两stage 对象检测器与 CLIP 结合，该检测器包括视觉脊梁、区域提案网络（RPN）和区域兴趣（RoI）头。我们在这种情况下，使用 RoIAlign 提取有用的特征。这种方法可以避免对对象进行resize。最后，我们将 RPN 和 RoI 头结合（CRR），以便快速训练和减少模型参数。我们在这些三种方法上进行了广泛的实验，并在 OVD-COCO 标准测试集上进行了评估。DRR 在 OVD-COCO 上取得了最高性能，其 Novel AP$_{50}$ 为 35.8，相对于前一个 SOTA 提高了 2.8。在 rare 类别上，DRR 超过了之前的 SOTA 的 1.9 AP$_{50}$。我们还提供了一个对象检测数据集 called PID，并提供了这个数据集上的基线。

Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation

paper_url: http://arxiv.org/abs/2309.00216
repo_url: https://github.com/aiart-hdu/hida
paper_authors: Fei Gao, Yifan Zhu, Chang Jiang, Nannan Wang
for: 该 paper 的目的是提出一种基于人工创作的动态适应（HIDA）方法，以生成高质量的人脸素描。
methods: 该方法使用动态调整神经元活动，并使用可变卷积来对深度特征进行匹配，以生成抽象的人脸素描。
results: 实验结果表明，HIDA 可以在多种难度的人脸上生成高质量的素描，并在不同风格的人脸上保持一致性。

Abstract
Facial sketch synthesis (FSS) aims to generate a vivid sketch portrait from a given facial photo. Existing FSS methods merely rely on 2D representations of facial semantic or appearance. However, professional human artists usually use outlines or shadings to covey 3D geometry. Thus facial 3D geometry (e.g. depth map) is extremely important for FSS. Besides, different artists may use diverse drawing techniques and create multiple styles of sketches; but the style is globally consistent in a sketch. Inspired by such observations, in this paper, we propose a novel Human-Inspired Dynamic Adaptation (HIDA) method. Specially, we propose to dynamically modulate neuron activations based on a joint consideration of both facial 3D geometry and 2D appearance, as well as globally consistent style control. Besides, we use deformable convolutions at coarse-scales to align deep features, for generating abstract and distinct outlines. Experiments show that HIDA can generate high-quality sketches in multiple styles, and significantly outperforms previous methods, over a large range of challenging faces. Besides, HIDA allows precise style control of the synthesized sketch, and generalizes well to natural scenes and other artistic styles. Our code and results have been released online at: https://github.com/AiArt-HDU/HIDA.

摘要
Facial Sketch Synthesis (FSS) 目标是从给定的面部照片中生成一个生动的笔画肖像。现有的 FSS 方法只是基于面部semantic或外观的2D表示。然而，职业艺术家通常使用 outline 或渐变来传递3D几何信息。因此，面部3D几何（例如深度图）在 FSS 中非常重要。此外，不同艺术家可能使用不同的绘画技巧，但是在绘画中保持全局一致的风格是非常重要。以这些观察为 inspirations，在这篇论文中，我们提出了一种新的人类 inspiritedynamic adaptation（HIDA）方法。具体来说，我们提出在考虑面部3D几何和2D外观以及全局一致的风格控制的基础上，动态调整神经活动。此外，我们使用可变核函数在粗略尺度上对深度特征进行对接，以生成抽象而独特的 OUTLINE。实验表明，HIDA 可以生成高质量的笔画肖像，并在各种挑战性脸部上显著超越先前的方法。此外，HIDA 允许精准地控制生成的笔画肖像的风格，并在自然场景和其他艺术风格上广泛适用。我们的代码和结果已经在 GitHub 上发布：https://github.com/AiArt-HDU/HIDA。

DARC: Distribution-Aware Re-Coloring Model for Generalizable Nucleus Segmentation

paper_url: http://arxiv.org/abs/2309.00188
repo_url: None
paper_authors: Shengcong Chen, Changxing Ding, Dacheng Tao, Hao Chen
for: 本研究旨在提出一种普适的核心分割模型，能够在不同域的图像上进行准确的核心分割。
methods: 本研究使用了一种 Distribution-Aware Re-Coloring (DARC) 模型，从两个角度解决了域 gap 问题。首先，提出了一种重新调色方法，以解决不同域的图像颜色差异。其次，提出了一种新的实例normalization方法，可以快速地处理不同的前景-背景比例变化。
results: 对于两个 H$&$E 染料和两个 IHC 染料的图像数据集进行了广泛的实验，证明了我们提出的 DARC 模型的效果。代码可以在 \url{https://github.com/csccsccsccsc/DARC} 上下载。

Abstract
Nucleus segmentation is usually the first step in pathological image analysis tasks. Generalizable nucleus segmentation refers to the problem of training a segmentation model that is robust to domain gaps between the source and target domains. The domain gaps are usually believed to be caused by the varied image acquisition conditions, e.g., different scanners, tissues, or staining protocols. In this paper, we argue that domain gaps can also be caused by different foreground (nucleus)-background ratios, as this ratio significantly affects feature statistics that are critical to normalization layers. We propose a Distribution-Aware Re-Coloring (DARC) model that handles the above challenges from two perspectives. First, we introduce a re-coloring method that relieves dramatic image color variations between different domains. Second, we propose a new instance normalization method that is robust to the variation in foreground-background ratios. We evaluate the proposed methods on two H$\&$E stained image datasets, named CoNSeP and CPM17, and two IHC stained image datasets, called DeepLIIF and BC-DeepLIIF. Extensive experimental results justify the effectiveness of our proposed DARC model. Codes are available at \url{https://github.com/csccsccsccsc/DARC

摘要
nuclei segmentation 通常是生物医学图像分析任务的第一步。通用 nuclei segmentation 问题指的是训练一个可以在不同来源领域中具有良好性能的分 segmentation 模型。这些领域差异通常被认为是由不同的图像捕获条件引起的，例如不同的扫描仪、组织或染色方法。在这篇论文中，我们认为领域差异也可能是由不同的前景（核体）背景比例引起的，因为这种比例对于图像的特征统计学非常重要。我们提出了一种 Distribution-Aware Re-Coloring（DARC）模型，该模型可以处理以下挑战。首先，我们引入了一种重新调色方法，以降低不同领域图像之间的极大图像颜色差异。其次，我们提出了一种新的实例Normalization方法，可以对不同的前景-背景比例进行鲁棒化。我们在两个HE染料图像集（CoNSeP和CPM17）和两个IHC染料图像集（DeepLIIF和BC-DeepLIIF）上进行了广泛的实验，结果证明了我们提出的DARC模型的效果。代码可以在 \url{https://github.com/csccsccsccsc/DARC} 上获取。

Vision-aided nonlinear control framework for shake table tests

paper_url: http://arxiv.org/abs/2309.00187
repo_url: None
paper_authors: Zhongwei Chen, T. Y. Yang, Yifei Xiao, Xiao Pan, Wanyan Yang
For: 本研究使用适应控制理论来实现震动机械系统中的控制和结构相互作用（Control-Structural Interaction，CSI），并考虑了系统的非线性性。* Methods: 本研究使用了滤波器控制理论（Proportional-Integral-Derivative，PID）和适应控制理论来实现非线性控制。* Results: simulations and experiments show that the proposed control framework can effectively control the shake table and reduce the vibration of the structure under earthquake excitations.

Abstract
The structural response under the earthquake excitations can be simulated by scaled-down model shake table tests or full-scale model shake table tests. In this paper, adaptive control theory is used as a nonlinear shake table control algorithm which considers the inherent nonlinearity of the shake table system and the Control-Structural Interaction (CSI) effect that the linear controller cannot consider, such as the Proportional-Integral-Derivative (PID) controller. The mass of the specimen can be assumed as an unknown variation and the unknown parameter will be replaced by an estimated value in the proposed control framework. The signal generated by the control law of the adaptive control method will be implemented by a loop-shaping controller. To verify the stability and feasibility of the proposed control framework, a simulation of a bare shake table and experiments with a bare shake table with a two-story frame were carried out. This study randomly selects Earthquake recordings from the Pacific Earthquake Engineering Research Center (PEER) database. The simulation and experimental results show that the proposed control framework can be effectively used in shake table control.

摘要
《震动试验中的结构回应可以通过尺度缩小的模型震动试验或全尺度模型震动试验来模拟。本文使用适应控制理论作为震动试验中的非线性控制算法，考虑了震动试验系统的自然非线性和控制结构互动（CSI）效应，例如调速度-积分- Derivative（PID）控制器。试验用的物品重量可以被视为未知变化，并将未知参数更新为估算值。控制法框架中的控制信号将由适应控制方法的loop-shaping控制器实现。为验证提案的稳定性和可行性，本研究在实验中使用了一个空震动试验和一个有两层架构的空震动试验进行了实验。这些实验使用了太平洋地震工程研究中心（PEER）数据库中的地震纪录。实验和模拟结果显示，提案的控制法框架可以有效地应用于震动试验中。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

2023-09-01

cs.AI

cs.AI - 2023-09-01

Efficient RLHF: Reducing the Memory Usage of PPO

paper_url: http://arxiv.org/abs/2309.00754
repo_url: None
paper_authors: Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
for: 这个论文的目的是解决RLHF中PPO阶段的内存问题，使得更多的实践者能够使用RLHF进行语言模型化。
methods: 论文使用了一系列的内存节省技术来降低PPO的内存使用量，并对这些技术的影响进行了全面的分析。
results: 实验结果显示，使用LoRA durante PPO可以降低PPO的内存使用量，并在四个公共benchmark上提高了RLHF的对齐性。此外，Hydra-PPO可以降低LoRA-PPO的样本延迟时间，而不会影响其性能。这些结果表明，Hydra-PPO是一种简单有前途的解决方案，可以普及RLHF的使用。

Abstract
Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

摘要
“强化学习 avec 人类反馈（RLHF）已经革命化语言模型化，将模型与人类偏好进行Alignment。但是，RL阶段的Proximal Policy Optimization（PPO）需要更多的内存，使得大多数实践者无法使用。为解决这个问题，我们提供了一个涵盖性分析的内存使用、性能和训练时间的分析。我们首先将Supervised Fine-Tuning（SFT）和Reward模型集成，然后在训练过程中静态地将LoRA“Off”。我们的实验结果显示：1. 在PPO中使用LoRA可以降低其内存使用量，与SFT相比，并在四个公共测试集上提高了Alignment的表现。2. Hydra-PPO可以将LoRA-PPO的延迟时间降低至最多65%，保持其表现。我们的结果显示，Hydra-PPO是一个简单且有前途的解决方案，可以帮助RLHF更加广泛地应用。”

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

paper_url: http://arxiv.org/abs/2309.00743
repo_url: None
paper_authors: Divyanshu Raj, Chitta Baral, Nakul Gopalan
for: 本研究目标是通过语言指令确定机器人行走路径中的子任务。
methods: 我们使用语言提供的指导来确定语言指令中的子任务，并将这些子任务映射到机器人行走路径中的子段。
results: 我们的方法可以准确地确定机器人行走路径中的子任务，并且与基eline方法相比，我们的方法可以提高$1.78_{\pm 0.82}%$的准确率。

Abstract
In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to map an instruction to a smaller fragment of the trajectory. Unlike previous instruction following works which directly learn the mapping from language to a policy, we propose a language-conditioned change-point detection method to identify sub-tasks in a problem. Our approach learns the relationship between constituent segments of a long language command and corresponding constituent segments of a trajectory. These constituent trajectory segments can be used to learn subtasks or sub-goals for planning or options as demonstrated by previous related work. Our insight in this work is that the language-conditioned robot change-point detection problem is similar to the existing video moment retrieval works used to identify sub-segments within online videos. Through extensive experimentation, we demonstrate a $1.78_{\pm 0.82}\%$ improvement over a baseline approach in accurately identifying sub-tasks within a trajectory using our proposed method. Moreover, we present a comprehensive study investigating sample complexity requirements on learning this mapping, between language and trajectory sub-segments, to understand if the video retrieval-based methods are realistic in real robot scenarios.

摘要
在这个工作中，我们提出了一种方法，用于在人工智能机器人路径示例中标识子任务。我们使用示例中提供的语言作为指导，以标识路径中的子段。给定一个自然语言指令序列和一个包含图像帧和精确动作的长路径，我们想要将指令映射到更短的路径段。不同于前一些语言指令跟踪工作，我们提议使用语言条件变化点检测方法来标识子任务。我们的方法学习了语言命令中的各个段落和路径中的各个段落之间的关系。这些路径段可以用于学习子任务或子目标 для规划或选择。我们的发现是，语言条件变化点检测问题与现有在线视频中的分割问题类似。经过广泛的实验，我们表明了使用我们提议的方法可以与基准方法相比提高$1.78\pm0.82\%$的精度。此外，我们还进行了全面的研究，以了解学习这种映射的样本复杂度要求，以确定视频分割方法在真实的机器人场景中是否可行。

Contextual Biasing of Named-Entities with Large Language Models

paper_url: http://arxiv.org/abs/2309.00723
repo_url: None
paper_authors: Chuanneng Sun, Zeeshan Ahmed, Yingyi Ma, Zhe Liu, Yutong Pang, Ozlem Kalinli
for: 这paper研究使用大型自然语言模型（LLM）进行语音识别（ASR）的上下文化偏误。
methods: authors提出了一种不需要微调的提示方法，使用提示列表和少量示例来提供额外信息，以提高ASR性能。同时，他们还提出了多任务训练方法，使LLM预测实体类和下一个token。为了提高效率和避免LLM的最长序列长度限制，authors提出了动态提示方法，选择最有可能性的类，并只使用这个类中的Entity作为下一个token预测的Context。
results: results表明，提示列表和少量示例可以相对于首轮ASR提高17.8%和9.6%，而多任务训练和动态提示可以相对于首轮ASR提高20.0%和11.3%的WER。

Abstract
This paper studies contextual biasing with Large Language Models (LLMs), where during second-pass rescoring additional contextual information is provided to a LLM to boost Automatic Speech Recognition (ASR) performance. We propose to leverage prompts for a LLM without fine tuning during rescoring which incorporate a biasing list and few-shot examples to serve as additional information when calculating the score for the hypothesis. In addition to few-shot prompt learning, we propose multi-task training of the LLM to predict both the entity class and the next token. To improve the efficiency for contextual biasing and to avoid exceeding LLMs' maximum sequence lengths, we propose dynamic prompting, where we select the most likely class using the class tag prediction, and only use entities in this class as contexts for next token prediction. Word Error Rate (WER) evaluation is performed on i) an internal calling, messaging, and dictation dataset, and ii) the SLUE-Voxpopuli dataset. Results indicate that biasing lists and few-shot examples can achieve 17.8% and 9.6% relative improvement compared to first pass ASR, and that multi-task training and dynamic prompting can achieve 20.0% and 11.3% relative WER improvement, respectively.

摘要

Amortizing Pragmatic Program Synthesis with Rankings

paper_url: http://arxiv.org/abs/2309.03225
repo_url: https://github.com/evanthebouncy/pragmatic_synthesis_ranking
paper_authors: Yewen Pu, Saujas Vaduguru, Priyan Vaithilingam, Elena Glassman, Daniel Fried
for: 该论文旨在提高程序生成器的效率，使其能够应用于更多的领域。
methods: 该论文使用了 rational speech acts（RSA）框架，并开发了一种全局 Pragmatic 排名方法，以减轻 RSA 算法的计算负担。
results: 实验结果表明，使用全局 Pragmatic 排名方法可以大幅提高程序生成器的效率，并在多个示例下与非 Pragmatic synthesizer 相比，表现更优异。

Abstract
In program synthesis, an intelligent system takes in a set of user-generated examples and returns a program that is logically consistent with these examples. The usage of Rational Speech Acts (RSA) framework has been successful in building \emph{pragmatic} program synthesizers that return programs which -- in addition to being logically consistent -- account for the fact that a user chooses their examples informatively. However, the computational burden of running the RSA algorithm has restricted the application of pragmatic program synthesis to domains with a small number of possible programs. This work presents a novel method of amortizing the RSA algorithm by leveraging a \emph{global pragmatic ranking} -- a single, total ordering of all the hypotheses. We prove that for a pragmatic synthesizer that uses a single demonstration, our global ranking method exactly replicates RSA's ranked responses. We further empirically show that global rankings effectively approximate the full pragmatic synthesizer in an online, multi-demonstration setting. Experiments on two program synthesis domains using our pragmatic ranking method resulted in orders of magnitudes of speed ups compared to the RSA synthesizer, while outperforming the standard, non-pragmatic synthesizer.

摘要
在程序生成中，一个智能系统会接受用户生成的示例集并返回一个符合这些示例的程序。使用 rational speech acts（RSA）框架已经成功地建立了 Pragmatic 程序生成器，这些程序不仅需要符合逻辑上的一致，还需要考虑用户选择示例的信息性。然而，运行 RSA 算法的计算负担限制了 Pragmatic 程序生成的应用领域的规模。这项工作提出了一种归一化 RSA 算法的方法，通过利用全局的 Pragmatic 排名来实现。我们证明，在单个示例下，我们的全球排名方法可以准确复制 RSA 排名的答案。我们进一步验证了全球排名在在线、多示例的 Setting 下能够有效地逼近整个 Pragmatic 生成器。在两个程序生成领域中，使用我们的 Pragmatic 排名方法，比对 RSA 生成器和标准、非 Pragmatic 生成器，实现了一个数量级的速度提升，同时表现更高。

Reinforcement Learning with Human Feedback for Realistic Traffic Simulation

paper_url: http://arxiv.org/abs/2309.00709
repo_url: None
paper_authors: Yulong Cao, Boris Ivanovic, Chaowei Xiao, Marco Pavone
for: This paper aims to enhance the realism of existing traffic models for autonomous vehicle development by incorporating human preferences through reinforcement learning.
methods: The proposed framework, called TrafficRLHF, uses human feedback for alignment and employs reinforcement learning with human preference to generate realistic traffic scenarios.
results: The framework demonstrates its proficiency in generating traffic scenarios that are well-aligned with human preferences, as corroborated by comprehensive evaluations on the nuScenes dataset.

Abstract
In light of the challenges and costs of real-world testing, autonomous vehicle developers often rely on testing in simulation for the creation of reliable systems. A key element of effective simulation is the incorporation of realistic traffic models that align with human knowledge, an aspect that has proven challenging due to the need to balance realism and diversity. This works aims to address this by developing a framework that employs reinforcement learning with human preference (RLHF) to enhance the realism of existing traffic models. This study also identifies two main challenges: capturing the nuances of human preferences on realism and the unification of diverse traffic simulation models. To tackle these issues, we propose using human feedback for alignment and employ RLHF due to its sample efficiency. We also introduce the first dataset for realism alignment in traffic modeling to support such research. Our framework, named TrafficRLHF, demonstrates its proficiency in generating realistic traffic scenarios that are well-aligned with human preferences, as corroborated by comprehensive evaluations on the nuScenes dataset.

摘要
“为了Addressing the challenges and costs of real-world testing, autonomous vehicle developers often rely on simulation testing for the creation of reliable systems. A key element of effective simulation is the incorporation of realistic traffic models that align with human knowledge, an aspect that has proven challenging due to the need to balance realism and diversity. This study aims to address this by developing a framework that employs reinforcement learning with human preference (RLHF) to enhance the realism of existing traffic models. This study also identifies two main challenges: capturing the nuances of human preferences on realism and the unification of diverse traffic simulation models. To tackle these issues, we propose using human feedback for alignment and employ RLHF due to its sample efficiency. We also introduce the first dataset for realism alignment in traffic modeling to support such research. Our framework, named TrafficRLHF, demonstrates its proficiency in generating realistic traffic scenarios that are well-aligned with human preferences, as corroborated by comprehensive evaluations on the nuScenes dataset.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. Traditional Chinese is also an option, but it is less commonly used in mainland China.

Geometric Deep Learning: a Temperature Based Analysis of Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.00699
repo_url: None
paper_authors: M. Lapenna, F. Faglioni, F. Zanchetta, R. Fioresi
for: 这个研究使用几何深度学习模型来模拟 термо动力系统， weights 被看作为非量子和非相对论粒子。
methods: 研究使用过去定义的温度（参考 [7]），在不同层次上研究 GC 和 GAT 模型。
results: 研究结果可能有各种应用前景。I hope this helps! Let me know if you have any other questions.

Abstract
We examine a Geometric Deep Learning model as a thermodynamic system treating the weights as non-quantum and non-relativistic particles. We employ the notion of temperature previously defined in [7] and study it in the various layers for GCN and GAT models. Potential future applications of our findings are discussed.

摘要
我们研究了一个几何深度学习模型，将参数视为非量子和非 relativistic 粒子。我们使用先前在 [7] 中定义的温度概念，研究它在不同层次上的GCN和GAT模型中。我们还讨论了未来可能的应用。Here's a breakdown of the translation:* "We examine a Geometric Deep Learning model" becomes "我们研究了一个几何深度学习模型"* "as a thermodynamic system" becomes "视为一个热力学系统"* "treating the weights as non-quantum and non-relativistic particles" becomes "将参数视为非量子和非 relativistic 粒子"* "We employ the notion of temperature previously defined in [7]" becomes "我们使用先前在 [7] 中定义的温度概念"* "and study it in the various layers" becomes "研究它在不同层次上"* "for GCN and GAT models" becomes "在GCN和GAT模型中"* "Potential future applications of our findings are discussed" becomes "我们还讨论了未来可能的应用".

Jointly Exploring Client Drift and Catastrophic Forgetting in Dynamic Learning

paper_url: http://arxiv.org/abs/2309.00688
repo_url: None
paper_authors: Niklas Babendererde, Moritz Fuchs, Camila Gonzalez, Yuri Tolkach, Anirban Mukhopadhyay
for: 这篇研究旨在探讨 Federated and Continual Learning 中的 Client Drift 和 Catastrophic Forgetting 问题，并提出一个统一分析框架以测试这两种问题的相互关联性。
methods: 这篇研究使用了一个新的三维测试框架，可以同时考虑 Client Drift 和 Catastrophic Forgetting 的共同影响，并且可以统一分析这两种问题的共同性。
results: 研究发现，Client Drift 和 Catastrophic Forgetting 之间存在强联系，即当 Client Drift 发生时，Catastrophic Forgetting 也很可能发生，并且这两种问题之间存在一定的相互关联性。此外，研究还发现了一个“普遍提升”现象，即在某些混合情况下，由于 Client Drift 和 Catastrophic Forgetting 的共同影响，模型的性能可能会提高。

Abstract
Federated and Continual Learning have emerged as potential paradigms for the robust and privacy-aware use of Deep Learning in dynamic environments. However, Client Drift and Catastrophic Forgetting are fundamental obstacles to guaranteeing consistent performance. Existing work only addresses these problems separately, which neglects the fact that the root cause behind both forms of performance deterioration is connected. We propose a unified analysis framework for building a controlled test environment for Client Drift -- by perturbing a defined ratio of clients -- and Catastrophic Forgetting -- by shifting all clients with a particular strength. Our framework further leverages this new combined analysis by generating a 3D landscape of the combined performance impact from both. We demonstrate that the performance drop through Client Drift, caused by a certain share of shifted clients, is correlated to the drop from Catastrophic Forgetting resulting from a corresponding shift strength. Correlation tests between both problems for Computer Vision (CelebA) and Medical Imaging (PESO) support this new perspective, with an average Pearson rank correlation coefficient of over 0.94. Our framework's novel ability of combined spatio-temporal shift analysis allows us to investigate how both forms of distribution shift behave in mixed scenarios, opening a new pathway for better generalization. We show that a combination of moderate Client Drift and Catastrophic Forgetting can even improve the performance of the resulting model (causing a "Generalization Bump") compared to when only one of the shifts occurs individually. We apply a simple and commonly used method from Continual Learning in the federated setting and observe this phenomenon to be reoccurring, leveraging the ability of our framework to analyze existing and novel methods for Federated and Continual Learning.

摘要
随着 Federated Learning 和 Continual Learning 的出现，它们被视为在动态环境中使用深度学习的可靠和隐私保护方法的潜在方法。然而，客户端漂移和快速忘记是保证持续性表现的基本障碍。现有的工作仅 addressed these problems separately，忽略了它们的根本原因是相连的。我们提出一种统一分析框架，通过对 опреде定比例的客户端进行干扰来建立 Client Drift 的控制测试环境，并通过对所有客户端进行固定强度的偏移来建立 Catastrophic Forgetting 的测试环境。我们的框架进一步利用了这种新的共同分析，生成了 Client Drift 和 Catastrophic Forgetting 的共同性表现的 3D 地图。我们示出，Client Drift 引起的表现下降和 Catastrophic Forgetting 引起的表现下降之间存在强相关关系，在 Computer Vision (CelebA) 和 Medical Imaging (PESO) 支持这一新视角，共计 Pearson 相关系数超过 0.94。我们的框架的新的共同空间偏移分析能力，使我们可以在混合enario中调查 Client Drift 和 Catastrophic Forgetting 的分布变化行为，开启了一条新的通路以实现更好的泛化。我们显示，在混合 Client Drift 和 Catastrophic Forgetting 的情况下，模型的表现可能会得到改善（引起一个 "Generalization Bump"），比单独的偏移情况下更好。我们应用了常见的 Continual Learning 方法在 federated 设置下，并观察到这种现象是重复的，利用我们的框架分析现有和新的方法的可能性。

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

paper_url: http://arxiv.org/abs/2309.00615
repo_url: https://github.com/ziyuguo99/point-bind_point-llm
paper_authors: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
for: 这个论文旨在将3D点云与多媒体数据（图像、语音、视频）相互对应，以便实现多种应用程序，如任意到3D生成、3D嵌入数学和3D开放世界理解。
methods: authors propose Point-Bind，一种3D多媒体模型，通过ImageBind建立3D和多媒体之间的共同嵌入空间，并提出Point-LLM，一种基于3D多媒体指令的首个大语言模型。
results: authors fine-tune pre-trained LLMs with Point-Bind’s semantics, achieving superior 3D and multi-modal question-answering performance without requiring 3D instruction data.

Abstract
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.

摘要
我们介绍Point-Bind，一个3D多Modal模型，可以与2D图像、语音、影音等多种多Modalities进行对齐。受ImageBind的引导，我们建立了3D和多Modalities之间的共同嵌入空间，使得许多应用程序可能实现，例如任何到3D生成、3D嵌入加法和3D开放世界理解。此外，我们还呈发Point-LLM，首个遵循3D多Modal instructions的3D大语言模型。通过实现优化技术，Point-LLM可以将Point-Bind的 semantics 注入到先天训练的LLMs中，例如LLaMA，这些模型不需要3D instruction data，但可以实现3D和多Modal question-answering的高水平表现。我们希望我们的工作能够照亮社区，将3D点云扩展到多Modal应用程序。代码可以在https://github.com/ZiyuGuo99/Point-Bind_Point-LLM中找到。

Iterative Multi-granular Image Editing using Diffusion Models

paper_url: http://arxiv.org/abs/2309.00613
repo_url: None
paper_authors: K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan
for: 这个论文旨在支持创意专业人员生成艺术性和趣味性的视觉资产，以及Iterative Multi-granular Editing的过程。
methods: 该论文提出了一种基于扩散模型的Iterative Multi-granular Image Editor（EMILIE），它可以在图像生成和修改过程中进行迭代编辑，并且可以控制图像的空间范围（全球、本地或者任何位置）。
results: 该论文通过对比与现有的方法进行评估，表明EMILIE可以更好地支持创意专业人员的艺术创作，并且可以提供更多的控制选项。

Abstract
Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting.

摘要

Curating Naturally Adversarial Datasets for Trustworthy AI in Healthcare

paper_url: http://arxiv.org/abs/2309.00543
repo_url: None
paper_authors: Sydney Pugh, Ivan Ruchkin, Insup Lee, James Weimer
for: 本研究旨在提高深度学习模型对时间序列医疗应用中的预测精度，同时确保这些模型的可靠性和信任性。
methods: 本研究提出了一种使用自动生成的弱监督标签 combines 噪音和便宜获得的标签规则，以生成自然的对抗示例集，用于评估模型的可靠性。
results: 本研究在 six 个医学案例和三个非医学案例中，通过对输入数据进行随机排序，并使用这种排序构建一系列逐渐增强的对抗示例集，证明了该方法的可靠性和统计效果。

Abstract
Deep learning models have shown promising predictive accuracy for time-series healthcare applications. However, ensuring the robustness of these models is vital for building trustworthy AI systems. Existing research predominantly focuses on robustness to synthetic adversarial examples, crafted by adding imperceptible perturbations to clean input data. However, these synthetic adversarial examples do not accurately reflect the most challenging real-world scenarios, especially in the context of healthcare data. Consequently, robustness to synthetic adversarial examples may not necessarily translate to robustness against naturally occurring adversarial examples, which is highly desirable for trustworthy AI. We propose a method to curate datasets comprised of natural adversarial examples to evaluate model robustness. The method relies on probabilistic labels obtained from automated weakly-supervised labeling that combines noisy and cheap-to-obtain labeling heuristics. Based on these labels, our method adversarially orders the input data and uses this ordering to construct a sequence of increasingly adversarial datasets. Our evaluation on six medical case studies and three non-medical case studies demonstrates the efficacy and statistical validity of our approach to generating naturally adversarial datasets

摘要
To address this issue, we propose a method to curate datasets comprised of natural adversarial examples to evaluate model robustness. Our method relies on probabilistic labels obtained from automated weakly-supervised labeling that combines noisy and cheap-to-obtain labeling heuristics. We use these labels to adversarially order the input data and construct a sequence of increasingly adversarial datasets.Our evaluation on six medical case studies and three non-medical case studies demonstrates the effectiveness and statistical validity of our approach to generating naturally adversarial datasets. By using these datasets to evaluate model robustness, we can better ensure that AI systems are trustworthy and reliable in real-world scenarios.

ICDARTS: Improving the Stability and Performance of Cyclic DARTS

paper_url: http://arxiv.org/abs/2309.00664
repo_url: None
paper_authors: Emily Herron, Derek Rose, Steven Young
for: 提高循环DARTS的稳定性和通用性
methods: 改进CDARTS的训练协议，消除搜索网络和评估网络之间的依赖关系，并对搜索网络中的零操作进行修饰
results: 实现了提高网络通用性和实现了一种新的动态搜索空间 incorporation 方法，并进行了灵活搜索细致的扩展

Abstract
This work introduces improvements to the stability and generalizability of Cyclic DARTS (CDARTS). CDARTS is a Differentiable Architecture Search (DARTS)-based approach to neural architecture search (NAS) that uses a cyclic feedback mechanism to train search and evaluation networks concurrently. This training protocol aims to optimize the search process by enforcing that the search and evaluation networks produce similar outputs. However, CDARTS introduces a loss function for the evaluation network that is dependent on the search network. The dissimilarity between the loss functions used by the evaluation networks during the search and retraining phases results in a search-phase evaluation network that is a sub-optimal proxy for the final evaluation network that is utilized during retraining. We present ICDARTS, a revised approach that eliminates the dependency of the evaluation network weights upon those of the search network, along with a modified process for discretizing the search network's \textit{zero} operations that allows these operations to be retained in the final evaluation networks. We pair the results of these changes with ablation studies on ICDARTS' algorithm and network template. Finally, we explore methods for expanding the search space of ICDARTS by expanding its operation set and exploring alternate methods for discretizing its continuous search cells. These experiments resulted in networks with improved generalizability and the implementation of a novel method for incorporating a dynamic search space into ICDARTS.

摘要
Simplified Chinese translation:这个工作介绍了对循环DARTS（CDARTS）的改进，以提高其稳定性和通用性。CDARTS是基于演算 Architecture Search（DARTS）的神经网络搜索方法，使用循环反馈机制来同时训练搜索和评估网络。这种训练协议的目的是通过确保搜索和评估网络生成相似的输出来优化搜索过程。然而，CDARTS引入了评估网络的损失函数，这使得搜索阶段的评估网络成为一个临时性差的代理人。我们提出了ICDARTS，一种修改后的方法，该方法消除了搜索网络的依赖关系，并对搜索网络中的\textit{zero} 操作进行修正。我们还进行了ICDARTS算法和网络模板的ablation study。最后，我们探索了扩展ICDARTS搜索空间的方法，包括扩展其操作集和使用不同的抽象方法来抽象它的连续搜索细胞。这些实验导致了改进的通用性和一种新的方法来将动态搜索空间 incorporated 到ICDARTS中。

Learning-based NLOS Detection and Uncertainty Prediction of GNSS Observations with Transformer-Enhanced LSTM Network

paper_url: http://arxiv.org/abs/2309.00480
repo_url: https://github.com/rwth-irt/deepnlosdetection
paper_authors: Haoming Zhang, Zhanxin Wang, Heike Vallery
for: 这个研究旨在提高运输系统中GNSS的准确性和一致性，减少GNSS观测受到多路径和非线路径（NLOS）影响的情况下，传统方法可能无法正确地分类和排除错误GNSS观测，导致系统状态估计和运输安全性问题。
methods: 这个研究提出了一个基于深度学习的方法，通过分析GNSS观测为空间时间模型问题，探索NLOS观测和 Pseudorange 误差的预测方法。相比之前的研究，我们将 transformer-like 注意力机制整合到深度学习网络中，提高模型性能和普遍性。
results: 实验研究显示，我们的网络在训练和评估过程中比其他深度学习模型和传统机器学习模型更好，并且在实际应用中避免了车辆地图分布不均的问题。此外，我们还进行了网络 ком成分析和与数据外泛统计分析，以及与其他模型的比较。

Abstract
The global navigation satellite systems (GNSS) play a vital role in transport systems for accurate and consistent vehicle localization. However, GNSS observations can be distorted due to multipath effects and non-line-of-sight (NLOS) receptions in challenging environments such as urban canyons. In such cases, traditional methods to classify and exclude faulty GNSS observations may fail, leading to unreliable state estimation and unsafe system operations. This work proposes a Deep-Learning-based method to detect NLOS receptions and predict GNSS pseudorange errors by analyzing GNSS observations as a spatio-temporal modeling problem. Compared to previous works, we construct a transformer-like attention mechanism to enhance the long short-term memory (LSTM) networks, improving model performance and generalization. For the training and evaluation of the proposed network, we used labeled datasets from the cities of Hong Kong and Aachen. We also introduce a dataset generation process to label the GNSS observations using lidar maps. In experimental studies, we compare the proposed network with a deep-learning-based model and classical machine-learning models. Furthermore, we conduct ablation studies of our network components and integrate the NLOS detection with data out-of-distribution in a state estimator. As a result, our network presents improved precision and recall ratios compared to other models. Additionally, we show that the proposed method avoids trajectory divergence in real-world vehicle localization by classifying and excluding NLOS observations.

摘要
全球导航卫星系统（GNSS）在交通系统中扮演着重要的角色，对于精确和一致的车辆位置Localization提供了重要的帮助。然而，GNSS观测可能会受到多路径效应和非直线视野（NLOS）接收的干扰，特别是在城市的“峡谷”环境中。在这种情况下，传统的方法可能无法正确地分类和排除 faulty GNSS观测，导致系统的状态估计和运行不安全。本工作提出了一个基于深度学习的方法，通过分析 GNSS 观测为空间时间模型的问题，探测NLOS接收和预测 GNSS Pseudorange 误差。相比于前一代的工作，我们将 transformer-like 注意力机制搭配长期记忆类型的LSTM 网络，提高模型的性能和通用性。我们使用了香港和阿希的城市 Labelled 数据集进行训练和评估。此外，我们还介绍了一个标签GNSS 观测的方法，使用 lidar 地图。在实验研究中，我们与其他深度学习模型和传统机器学习模型进行比较。此外，我们还进行了我们网络的组件删除和与数据外部分布的整合。最终，我们的网络获得了提高的精确性和回应率，并且显示了在实际车辆Localization中避免了轨迹分支的问题。

A Theoretical and Practical Framework for Evaluating Uncertainty Calibration in Object Detection

paper_url: http://arxiv.org/abs/2309.00464
repo_url: https://github.com/pedrormconde/uncertainty_calibration_object_detection
paper_authors: Pedro Conde, Rui L. Lopes, Cristiano Premebida
for: 本研究旨在提出一个新的假设和实践框架，用于评估深度神经网络中的物体探测系统，并评估这些系统的不确定性调整。
methods: 本研究使用了一系列的实验和分析方法，包括实验设计、资料分析和模型评估，以评估不确定性调整的效果。
results: 研究结果显示，提出的不确定性调整度量具有良好的准确性和稳定性，并且可以帮助改善物体探测系统的可靠性和安全性。Here is the same information in English:
for: The purpose of this study is to propose a new theoretical and practical framework for evaluating object detection systems in the context of uncertainty calibration.
methods: The study uses a series of experimental and analytical methods, including experimental design, data analysis, and model evaluation, to assess the effectiveness of the proposed uncertainty calibration metrics.
results: The results show that the proposed uncertainty calibration metrics have good accuracy and stability, and can help improve the reliability and safety of object detection systems.

Abstract
The proliferation of Deep Neural Networks has resulted in machine learning systems becoming increasingly more present in various real-world applications. Consequently, there is a growing demand for highly reliable models in these domains, making the problem of uncertainty calibration pivotal, when considering the future of deep learning. This is especially true when considering object detection systems, that are commonly present in safety-critical application such as autonomous driving and robotics. For this reason, this work presents a novel theoretical and practical framework to evaluate object detection systems in the context of uncertainty calibration. The robustness of the proposed uncertainty calibration metrics is shown through a series of representative experiments. Code for the proposed uncertainty calibration metrics at: https://github.com/pedrormconde/Uncertainty_Calibration_Object_Detection.

摘要
深度神经网络的普及导致机器学习系统在实际应用中变得越来越普遍。因此，高可靠性模型在这些领域的需求在增加。特别是在自动驾驶和机器人等安全关键应用中，Object detection系统的uncertainty calibration问题变得越来越重要。为此，这个研究提出了一种新的理论和实践框架，用于评估Object detection系统的uncertainty calibration。这种uncertainty calibration度量的稳定性通过一系列代表性的实验展示。Code可以在https://github.com/pedrormconde/Uncertainty_Calibration_Object_Detection中找到。

New metrics for analyzing continual learners

paper_url: http://arxiv.org/abs/2309.00462
repo_url: None
paper_authors: Nicolas Michel, Giovanni Chierchia, Romain Negrel, Jean-François Bercher, Toshihiko Yamasaki
for: continual learning 学习环境中维护知识和学习新任务的稳定性和柔软性。
methods: 使用现有的措施来衡量稳定性和柔软性，并发现现有的指标忽略了任务增加难度的问题。因此，我们提出了新的指标来考虑任务增加难度。
results: 通过在标准 bencmark 数据集上进行实验，我们表明了我们提出的新指标可以为 continual learning 环境中模型的稳定性-柔软性质量提供新的视角。

Abstract
Deep neural networks have shown remarkable performance when trained on independent and identically distributed data from a fixed set of classes. However, in real-world scenarios, it can be desirable to train models on a continuous stream of data where multiple classification tasks are presented sequentially. This scenario, known as Continual Learning (CL) poses challenges to standard learning algorithms which struggle to maintain knowledge of old tasks while learning new ones. This stability-plasticity dilemma remains central to CL and multiple metrics have been proposed to adequately measure stability and plasticity separately. However, none considers the increasing difficulty of the classification task, which inherently results in performance loss for any model. In that sense, we analyze some limitations of current metrics and identify the presence of setup-induced forgetting. Therefore, we propose new metrics that account for the task's increasing difficulty. Through experiments on benchmark datasets, we demonstrate that our proposed metrics can provide new insights into the stability-plasticity trade-off achieved by models in the continual learning environment.

摘要

Establishing Markov Equivalence in Cyclic Directed Graphs

paper_url: http://arxiv.org/abs/2309.03092
repo_url: https://github.com/tomc-ghub/CET_uai2023
paper_authors: Tom Claassen, Joris M. Mooij
for: establishment of Markov equivalence between directed graphs
methods: based on Cyclic Equivalence Theorem (CET) and ancestral perspective
results: significantly reduced algorithmic complexity and conceptually simplified characterization, which may help to reinvigorate theoretical research towards sound and complete cyclic discovery in the presence of latent confounders.

Abstract
We present a new, efficient procedure to establish Markov equivalence between directed graphs that may or may not contain cycles under the \textit{d}-separation criterion. It is based on the Cyclic Equivalence Theorem (CET) in the seminal works on cyclic models by Thomas Richardson in the mid '90s, but now rephrased from an ancestral perspective. The resulting characterization leads to a procedure for establishing Markov equivalence between graphs that no longer requires tests for d-separation, leading to a significantly reduced algorithmic complexity. The conceptually simplified characterization may help to reinvigorate theoretical research towards sound and complete cyclic discovery in the presence of latent confounders. This version includes a correction to rule (iv) in Theorem 1, and the subsequent adjustment in part 2 of Algorithm 2.

摘要
我们提出了一种新的、高效的程序，用于在导航图中确定Markov等价关系，这些图可能或可能不含循环，基于\textit{d}-分离 критериion。这种方法基于托马斯·理查森在90年代中期的著名作品中的循环等价定理（CET），但现在从先祖 perspective重新表述。这种Characterization导致了一种不需要测试\textit{d}-分离的程序，从而大幅降低了算法复杂性。这种概念简化后的Characterization可能会促进理论研究，以探索在潜在干扰因素存在下的循环发现的正确和完整的方法。这个版本包括对第一个定理（iv）的更正，以及后续的修改在算法2的第2部分。

No Train Still Gain. Unleash Mathematical Reasoning of Large Language Models with Monte Carlo Tree Search Guided by Energy Function

paper_url: http://arxiv.org/abs/2309.03224
repo_url: None
paper_authors: Haotian Xu
for: 提高大型自然语言处理（NLP）模型的数学逻辑能力，无需额外 Fine-tuning 步骤。
methods: 使用 Monte Carlo Tree Search（MCTS）和轻量级能量函数来评估决策步骤，并使用噪声对比估计来估计能量函数的参数。
results: 通过对 GSM8k 和 AQUA-RAT 数学逻辑测试 benchmark 进行广泛的实验，显示了方法的杰出表现，无需额外 Fine-tuning 或人工反馈对适应。

Abstract
Large language models (LLMs) demonstrate impressive language understanding and contextual learning abilities, making them suitable for natural language processing (NLP) tasks and complex mathematical reasoning. However, when applied to mathematical reasoning tasks, LLMs often struggle to generate correct reasoning steps and answers despite having high probabilities for the solutions. To overcome this limitation and enhance the mathematical reasoning capabilities of fine-tuned LLMs without additional fine-tuning steps, we propose a method that incorporates Monte Carlo Tree Search (MCTS) and a lightweight energy function to rank decision steps and enable immediate reaction and precise reasoning. Specifically, we re-formulate the fine-tuned LLMs into a Residual-based Energy Model (Residual-EBM) and employ noise contrastive estimation to estimate the energy function's parameters. We then utilize MCTS with the energy function as a path verifier to search the output space and evaluate the reasoning path. Through extensive experiments on two mathematical reasoning benchmarks, GSM8k and AQUA-RAT, we demonstrate the exceptional capabilities of our method, which significantly improves the pass@1 metric of the fine-tuned model without requiring additional fine-tuning or reinforcement learning with human feedback alignment.

摘要

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

paper_url: http://arxiv.org/abs/2309.00424
repo_url: None
paper_authors: Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
for: 提高 speech 生成和识别下的细腻性，例如 minimally-supervised text-to-speech (TTS)、voice conversion (VC) 和 automatic speech recognition (ASR)。
methods: 使用 two encoders 将 phoneme 和 speech 带入一个共同的多Modal 空间，学习连接 phoneme 和 speech 的框架级别连接。
results: 在 210k 个 speech 和 phoneme 文本对中训练 CTAP 模型，实现了 minimally-supervised TTS、VC 和 ASR 等下游任务。

Abstract
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing.

摘要
为细化生成和识别任务，如无监督文本译 speech（TTS）、voice conversion（VC）和自动语音识别（ASR），则中间表示从语音中提取的应该作为两个模态之间的“桥”，含有文本和音频信息的信息。 semantic content 应该强调，而 speaker identity 和 acoustic details 则应该削弱。然而，现有的语音中间表示提取方法受到过 redundancy 和维度爆炸的问题。 Contrastive learning 是一种好的方法 для模型中间表示，但现有的音频领域的 Contrastive learning 方法是为下游音频分类任务提取全局描述性信息，这使得它们不适用于 TTS、VC 和 ASR 任务。为解决这些问题，我们提出了一种方法，名为“ Contrastive Token-Acoustic Pretraining”（CTAP），它使用两个Encoder将 phoneme 和 speech 带入到一个共同多Modal空间，学习如何在帧级连接 phoneme 和 speech。 CTAP 模型在 210k 语音和 phoneme 文本对中训练，实现了无监督 TTS、VC 和 ASR。我们提出的 CTAP 方法为 fine-grained generation 和识别下游任务提供了一个有前途的解决方案。

Declarative Reasoning on Explanations Using Constraint Logic Programming

paper_url: http://arxiv.org/abs/2309.00422
repo_url: https://github.com/lstate/reasonx
paper_authors: Laura State, Salvatore Ruggieri, Franco Turini
For: 提供对透明化机器学习模型的解释，即现有的AI解释方法存在许多缺点，如背景知识不够 incorporation、解释方法不够抽象和用户交互不够。* Methods: 使用 Constraint Logic Programming (CLP) 提供声明性的、交互式的解释方法，可以用于对决策树进行解释，以及对任何黑盒模型的全局/本地代理模型进行解释。* Results: 提供了 REASONX 解释方法的架构，包括 Python 层和 CLP 层，核心执行引擎是一个基于 Prolog 的 meta-程序，具有声明性的逻辑理论 semantics。

Abstract
Explaining opaque Machine Learning (ML) models is an increasingly relevant problem. Current explanation in AI (XAI) methods suffer several shortcomings, among others an insufficient incorporation of background knowledge, and a lack of abstraction and interactivity with the user. We propose REASONX, an explanation method based on Constraint Logic Programming (CLP). REASONX can provide declarative, interactive explanations for decision trees, which can be the ML models under analysis or global/local surrogate models of any black-box model. Users can express background or common sense knowledge using linear constraints and MILP optimization over features of factual and contrastive instances, and interact with the answer constraints at different levels of abstraction through constraint projection. We present here the architecture of REASONX, which consists of a Python layer, closer to the user, and a CLP layer. REASONX's core execution engine is a Prolog meta-program with declarative semantics in terms of logic theories.

摘要
explainable machine learning (ml) models 是一个日益重要的问题。当前的 AI (XAI) 方法存在多个缺点，包括知识背景的不充分 integrate 和用户交互的缺失。我们提议了 REASONX，一种基于幂逻Programming (CLP) 的解释方法。REASONX 可以为决策树提供声明性的、交互式的解释，这些决策树可以是分析或全局/本地的黑盒模型。用户可以通过Linear Constraints 和 MILP 优化来表达背景知识或通用常识，并通过约束投影在不同层次进行交互。我们在这里介绍了 REASONX 的架构，它由 Python 层和 CLP 层组成。REASONX 的核心执行引擎是一个 Prolog 元程序，其semantics 是逻辑理论的声明性。

Area-norm COBRA on Conditional Survival Prediction

paper_url: http://arxiv.org/abs/2309.00417
repo_url: None
paper_authors: Rahul Goswami, Arabin Kr. Dey
for: 这篇论文探讨了一种新的 combinational regression 策略，用于计算condition survival function。
methods: 该策略使用回归基于weak learner的ensemble技术，并使用距离度量作为两个生存曲线之间的区域。
results: 该模型比Random Survival Forest表现更好，并提供了一种选择最重要变量的新技术。 simulation study 表明该方法能够很好地确定变量的重要性。

Abstract
The paper explores a different variation of combined regression strategy to calculate the conditional survival function. We use regression based weak learners to create the proposed ensemble technique. The proposed combined regression strategy uses proximity measure as area between two survival curves. The proposed model shows a construction which ensures that it performs better than the Random Survival Forest. The paper discusses a novel technique to select the most important variable in the combined regression setup. We perform a simulation study to show that our proposition for finding relevance of the variables works quite well. We also use three real-life datasets to illustrate the model.

摘要
文章探讨了一种不同的combined regression策略来计算 conditional survival function。我们使用回归基于弱学习器的 ensemble技术来实现提案。提案的combined regression策略使用距离度量来度量两个生存曲线之间的区域。提案的模型能确保其在Random Survival Forest之上表现更好。文章介绍了一种新的方法来在combined regression中选择最重要的变量。我们通过实验研究表明我们的提案可以很好地确定变量的相关性。我们还使用了三个真实数据集来示例化模型。Here's the word-for-word translation:文章探讨了一种不同的combined regression策略来计算 conditional survival function。我们使用回归基于弱学习器的 ensemble技术来实现提案。提案的combined regression策略使用距离度量来度量两个生存曲线之间的区域。提案的模型能确保其在Random Survival Forest之上表现更好。文章介绍了一种新的方法来在combined regression中选择最重要的变量。我们通过实验研究表明我们的提案可以很好地确定变量的相关性。我们还使用了三个真实数据集来示例化模型。

Dense Voxel 3D Reconstruction Using a Monocular Event Camera

paper_url: http://arxiv.org/abs/2309.00385
repo_url: None
paper_authors: Haodong Chen, Vera Chung, Li Tan, Xiaoming Chen
For: 这个论文主要用于探讨使用单个事件摄像机实现高精度3D重建，以便在虚拟现实应用中使用。* Methods: 该论文提出了一种新的方法，使用单个事件摄像机来生成高精度3D重建。这种方法不需要多个摄像机，也不需要与其他方法组合使用。* Results: 据作者的预liminary结果表明，该方法可以直接生成可辨别的高精度3D重建结果，无需创建如先前方法一样的管道。此外，作者还创建了一个 Synthetic dataset，包含39739个对象扫描结果，这个dataset可以帮助加速相关领域的研究。

Abstract
Event cameras are sensors inspired by biological systems that specialize in capturing changes in brightness. These emerging cameras offer many advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. Due to these advantages, event cameras have increasingly been adapted in various fields, such as frame interpolation, semantic segmentation, odometry, and SLAM. However, their application in 3D reconstruction for VR applications is underexplored. Previous methods in this field mainly focused on 3D reconstruction through depth map estimation. Methods that produce dense 3D reconstruction generally require multiple cameras, while methods that utilize a single event camera can only produce a semi-dense result. Other single-camera methods that can produce dense 3D reconstruction rely on creating a pipeline that either incorporates the aforementioned methods or other existing Structure from Motion (SfM) or Multi-view Stereo (MVS) methods. In this paper, we propose a novel approach for solving dense 3D reconstruction using only a single event camera. To the best of our knowledge, our work is the first attempt in this regard. Our preliminary results demonstrate that the proposed method can produce visually distinguishable dense 3D reconstructions directly without requiring pipelines like those used by existing methods. Additionally, we have created a synthetic dataset with $39,739$ object scans using an event camera simulator. This dataset will help accelerate other relevant research in this field.

摘要
Event 摄像机是基于生物系统的感知器，专门用于测量光度变化。这些新型摄像机具有较高的动态范围、快速帧率和非常低的功耗消耗。由于这些优点，event 摄像机在不同领域得到了广泛应用，如 frame interpolation、semantic segmentation、odometry 和 SLAM。然而，它们在虚拟现实应用中的3D重建仍然受到了不足的研究。先前的方法主要通过depth map estimation来实现3D重建。这些方法通常需要多个摄像机，而使用单个 event 摄像机可以生成半密集的结果。其他使用单个摄像机实现密集3D重建的方法通常需要创建一个管道，该管道可以包括以上方法或其他现有的Structure from Motion（SfM）或Multi-view Stereo（MVS）方法。在这篇论文中，我们提出了一种新的方法，用于使用单个 event 摄像机实现密集3D重建。我们认为，这是首次尝试。我们的初步结果表明，我们的方法可以直接生成可辨识的密集3D重建，无需创建管道类似于现有方法。此外，我们创建了一个Synthetic dataset，包含39739个物体扫描结果，使用事件摄像机模拟器。这个数据集将会促进相关的研究。

Scenario-based model predictive control of water reservoir systems

paper_url: http://arxiv.org/abs/2309.00373
repo_url: None
paper_authors: Raffaele Giuseppe Cestari, Andrea Castelletti, Simone Formentin
for: optimize the operation of water reservoir systems in the presence of highly uncertain inflows
methods: stochastic MPC approach using plausible future inflows directly generated from past data
results: more cautious control that counteracts droughty periods while satisfying agricultural water demand, validated through extensive Monte Carlo tests using actual inflow data from Lake Como, Italy.Here’s the Chinese translation of the three points:
for: optimizes 水库系统的运行，面临高度不确定的流入
methods: 使用可能性分布来生成直接来自过去数据的未来流入，以实现随机MPC策略
results: 更谨慎的控制，能够避免干旱期（例如湖水水平下降到干旱限制），同时保证农业水需求的满足，通过实际各个流入数据进行了 Monte Carlo 测试。

Abstract
The optimal operation of water reservoir systems is a challenging task involving multiple conflicting objectives. The main source of complexity is the presence of the water inflow, which acts as an exogenous, highly uncertain disturbance on the system. When model predictive control (MPC) is employed, the optimal water release is usually computed based on the (predicted) trajectory of the inflow. This choice may jeopardize the closed-loop performance when the actual inflow differs from its forecast. In this work, we consider - for the first time - a stochastic MPC approach for water reservoirs, in which the control is optimized based on a set of plausible future inflows directly generated from past data. Such a scenario-based MPC strategy allows the controller to be more cautious, counteracting droughty periods (e.g., the lake level going below the dry limit) while at the same time guaranteeing that the agricultural water demand is satisfied. The method's effectiveness is validated through extensive Monte Carlo tests using actual inflow data from Lake Como, Italy.

摘要
水库系统的优化操作是一项复杂的任务，涉及多个 conflicting 目标。主要的复杂性来源于流水入库，它会对系统作为外生、高度不确定的干扰。当使用模型预测控制（MPC）时，通常基于预测流水入库轨迹来计算优化的水release。这可能会在实际入库与预测入库不同时影响closed-loop性。在这种工作中，我们对水库系统进行了第一次 Stochastic MPC 方法，在这种方法中，控制器是基于过去数据直接生成的可能性 Distribution 来优化控制。这种enario-based MPC策略使得控制器更加谨慎，可以避免干旱期间（例如湖水水位低于干旱限制）的问题，同时保证农业用水需求得到满足。我们通过使用意大湖 Como, 意大的实际入库数据进行了广泛的 Monte Carlo 测试，证明了该方法的有效性。

Discrete Versus Continuous Algorithms in Dynamics of Affective Decision Making

paper_url: http://arxiv.org/abs/2309.00357
repo_url: None
paper_authors: V. I. Yukalov, E. P. Yukalova
for: 这 paper 是研究智能网络中代理人的决策行为，以及不同类型的内存（长期和短期内存）对决策的影响。methods: 这 paper 使用概率情感决策理论，考虑了选择方案的理性利好和情感吸引力。results: 研究发现，由于网络参数的不同，可能存在较Close或较大差异的特征概率行为，这意味着使用不同的算法可能会导致非常不同的理论预测，从而无法Uniquely describe practical problems。

Abstract
The dynamics of affective decision making is considered for an intelligent network composed of agents with different types of memory: long-term and short-term memory. The consideration is based on probabilistic affective decision theory, which takes into account the rational utility of alternatives as well as the emotional alternative attractiveness. The objective of this paper is the comparison of two multistep operational algorithms of the intelligent network: one based on discrete dynamics and the other on continuous dynamics. By means of numerical analysis, it is shown that, depending on the network parameters, the characteristic probabilities for continuous and discrete operations can exhibit either close or drastically different behavior. Thus, depending on which algorithm is employed, either discrete or continuous, theoretical predictions can be rather different, which does not allow for a uniquely defined description of practical problems. This finding is important for understanding which of the algorithms is more appropriate for the correct analysis of decision-making tasks. A discussion is given, revealing that the discrete operation seems to be more realistic for describing intelligent networks as well as affective artificial intelligence.

摘要
<>translate "The dynamics of affective decision making is considered for an intelligent network composed of agents with different types of memory: long-term and short-term memory. The consideration is based on probabilistic affective decision theory, which takes into account the rational utility of alternatives as well as the emotional alternative attractiveness. The objective of this paper is the comparison of two multistep operational algorithms of the intelligent network: one based on discrete dynamics and the other on continuous dynamics. By means of numerical analysis, it is shown that, depending on the network parameters, the characteristic probabilities for continuous and discrete operations can exhibit either close or drastically different behavior. Thus, depending on which algorithm is employed, either discrete or continuous, theoretical predictions can be rather different, which does not allow for a uniquely defined description of practical problems. This finding is important for understanding which of the algorithms is more appropriate for the correct analysis of decision-making tasks. A discussion is given, revealing that the discrete operation seems to be more realistic for describing intelligent networks as well as affective artificial intelligence."Translation:<>affective决策动力学在智能网络中被考虑，智能网络由不同类型的记忆 agent组成：长期记忆和短期记忆。考虑基于概率性的情感决策理论，该理论考虑了决策选项的合理利益以及决策选项的情感吸引力。本文的目标是比较两种多步操作算法：一种基于离散动力学，另一种基于连续动力学。通过数值分析，我们发现，具有不同网络参数时，离散和连续操作的特征概率可能会展现出非常不同的行为。因此，使用不同的算法，对于实际问题的理论预测可能会非常不同，这不允许固定的描述实际问题。这一发现对于理解哪种算法更适合正确分析决策任务非常重要。文章还进行了讨论，表明离散操作更加真实地描述智能网络以及情感人工智能。

Explainable Active Learning for Preference Elicitation

paper_url: http://arxiv.org/abs/2309.00356
repo_url: https://github.com/furkancanturk/explainable_active_learning
paper_authors: Furkan Cantürk, Reyhan Aydoğan
for: 这篇论文的目的是解决新用户的偏好预测问题，特别是在冷开始问题下，当推荐系统缺乏用户存在或者其他用户数据存在限制，使得使用用户资料建立用户Profile几乎不可能。methods: 这篇论文使用了活动学习（AL）来解决冷开始问题，通过选择大量未标的数据，请 oracle 标注它们，并更新机器学习（ML）模型。论文还结合了不监controlled、半监controlled和监controlled ML的混合过程，并与用户反馈组合使用。results: 实验结果显示，提案的偏好探索方法在有限用户标注数据下可以实现高效的偏好预测，同时也能够提高用户信任度 durch 精准的解释。

Abstract
Gaining insights into the preferences of new users and subsequently personalizing recommendations necessitate managing user interactions intelligently, namely, posing pertinent questions to elicit valuable information effectively. In this study, our focus is on a specific scenario of the cold-start problem, where the recommendation system lacks adequate user presence or access to other users' data is restricted, obstructing employing user profiling methods utilizing existing data in the system. We employ Active Learning (AL) to solve the addressed problem with the objective of maximizing information acquisition with minimal user effort. AL operates for selecting informative data from a large unlabeled set to inquire an oracle to label them and eventually updating a machine learning (ML) model. We operate AL in an integrated process of unsupervised, semi-supervised, and supervised ML within an explanatory preference elicitation process. It harvests user feedback (given for the system's explanations on the presented items) over informative samples to update an underlying ML model estimating user preferences. The designed user interaction facilitates personalizing the system by incorporating user feedback into the ML model and also enhances user trust by refining the system's explanations on recommendations. We implement the proposed preference elicitation methodology for food recommendation. We conducted human experiments to assess its efficacy in the short term and also experimented with several AL strategies over synthetic user profiles that we created for two food datasets, aiming for long-term performance analysis. The experimental results demonstrate the efficiency of the proposed preference elicitation with limited user-labeled data while also enhancing user trust through accurate explanations.

摘要
为了获得新用户的偏好情况和个性化推荐，需要智能地管理用户互动，即向用户提问有价值信息以获得有效反馈。在这个研究中，我们关注了冷启动问题的特定场景，其中推荐系统缺乏用户存在或其他用户数据访问被限制，使用用户 profiling 方法使用现有系统数据 becomes impossible. 我们使用活动学习（AL）解决这个问题，以达到最大化信息收集的目的，同时减少用户努力。AL 方法从大量未标记数据集中选择有用信息，并请 oracle 标记它们，以更新机器学习（ML）模型。我们在混合式、半结构化和结构化 ML 中运行 AL，并在用户反馈（对系统的解释中提供的Feedback）上更新下面 ML 模型，以估计用户偏好。这种设计的用户互动方式可以个性化系统，并且提高用户信任度，因为它可以在推荐中更加准确地解释用户选择。我们在美食推荐领域实现了这种偏好抽取方法。我们对短期效果进行了人类实验，以及使用了多种 AL 策略对两个食品数据集进行了长期性能分析。实验结果表明，我们的偏好抽取方法在有限用户标注数据下可以 дости到高效性，同时提高用户信任度。

A Text-based Approach For Link Prediction on Wikipedia Articles

paper_url: http://arxiv.org/abs/2309.00317
repo_url: https://github.com/tam1032/dsaa2023-challenge-link-prediction-ds-uit_sat
paper_authors: Anh Hoang Tran, Tam Minh Nguyen, Son T. Luu
for: 这篇论文是为了参加 DSAA 2023 挑战，用于预测 Wikipedia 文章中的连结是否存在。
methods: 本文使用传统机器学习模型，使用文本中的 POS 标签特征进行训练分类模型。
results: 本文获得 F1 得分 0.99999，在竞赛中排名第 7 名。并且提供了可公开使用的源代码：https://github.com/Tam1032/DSAA2023-Challenge-Link-prediction-DS-UIT_SAT。

Abstract
This paper present our work in the DSAA 2023 Challenge about Link Prediction for Wikipedia Articles. We use traditional machine learning models with POS tags (part-of-speech tags) features extracted from text to train the classification model for predicting whether two nodes has the link. Then, we use these tags to test on various machine learning models. We obtained the results by F1 score at 0.99999 and got 7th place in the competition. Our source code is publicly available at this link: https://github.com/Tam1032/DSAA2023-Challenge-Link-prediction-DS-UIT_SAT

摘要
这篇论文介绍我们在 DSAA 2023 挑战中对维基百科文章链接预测的工作。我们使用传统机器学习模型，使用文本中提取的 POS 标签特征来训练分类模型，以预测两个节点是否有链接。然后，我们使用这些标签来测试不同的机器学习模型。我们获得的结果是 F1 分数为 0.99999，在比赛中获得第 7 名。我们的源代码可以在以下链接中下载：https://github.com/Tam1032/DSAA2023-Challenge-Link-prediction-DS-UIT_SAT。

paper_url: http://arxiv.org/abs/2309.03222
repo_url: None
paper_authors: V. L. Raju Chinthalapati, Guido Fioretti
for: 本文主要探讨了证据理论在社会和生物科学中的潜在应用，以及它与概率论的区别。
methods: 本文使用了德мпстер-沙法尔理论和信念函数理论来表达对事件的不确定性。
results: 本文证明了德мпстер-沙法尔的组合规则与 bayes 定理之间存在关系，并讨论了如何通过证据理论增强信息理论中的应用。 I hope that helps! Let me know if you have any further questions.

Abstract
While Evidence Theory (Demster-Shafer Theory, Belief Functions Theory) is being increasingly used in data fusion, its potentialities in the Social and Life Sciences are often obscured by lack of awareness of its distinctive features. With this paper we stress that Evidence Theory can express the uncertainty deriving from the fear that events may materialize, that one has not been able to figure out. By contrast, Probability Theory must limit itself to the possibilities that a decision-maker is currently envisaging. Subsequently, we illustrate how Dempster-Shafer's combination rule relates to Bayes' Theorem for various versions of Probability Theory and discuss which applications of Information Theory can be enhanced by Evidence Theory. Finally, we illustrate our claims with an example where Evidence Theory is used to make sense of the partially overlapping, partially contradictory solutions that appear in an auditing exercise.

摘要
“证据理论（德赫-沙佛理论，信念函数理论）在数据融合中日益受到应用，但它在社会和生活科学中的潜力往往被不了了之。本文强调证据理论可以表达因事件可能实现而导致的不确定性，而probability理论只能限制在决策者目前所看到的可能性上。后续，我们详细介绍德赫-沙佛组合规则与 bayes定理之间的关系，并讨论在信息理论中哪些应用可以增强使用证据理论。最后，我们通过一个例子说明证据理论如何用于理解审计实践中的部分重叠、部分矛盾的解决方案。”Note: "Simplified Chinese" is also known as "Mandarin" or "Standard Chinese".

On the Aggregation of Rules for Knowledge Graph Completion

paper_url: http://arxiv.org/abs/2309.00306
repo_url: None
paper_authors: Patrick Betz, Stefan Lüdtke, Christian Meilicke, Heiner Stuckenschmidt
for: 本研究旨在提高知识图完成任务中的规则学取得效率、可读性和竞争力。
methods: 本文使用数据驱动的规则学学习方法，并 investigate 规则集中的噪音和规则集大小的问题。
results: 本文提出了一种新的规则汇总策略，并证明了这种策略可以表示为规则集中的 marginal inference 操作。此外，本文还提出了一种效果很好的基线方法，可以与计算更昂贵的方法竞争。

Abstract
Rule learning approaches for knowledge graph completion are efficient, interpretable and competitive to purely neural models. The rule aggregation problem is concerned with finding one plausibility score for a candidate fact which was simultaneously predicted by multiple rules. Although the problem is ubiquitous, as data-driven rule learning can result in noisy and large rulesets, it is underrepresented in the literature and its theoretical foundations have not been studied before in this context. In this work, we demonstrate that existing aggregation approaches can be expressed as marginal inference operations over the predicting rules. In particular, we show that the common Max-aggregation strategy, which scores candidates based on the rule with the highest confidence, has a probabilistic interpretation. Finally, we propose an efficient and overlooked baseline which combines the previous strategies and is competitive to computationally more expensive approaches.

摘要
<> traduced text into Simplified Chinese.<>知识图完成任务的规则学习方法是高效、可读性和竞争力强的。规则汇总问题关注于找到多个规则同时预测的 кандидат事实的可能性分数。尽管这个问题在数据驱动的规则学习中很普遍，但在文献中它尚未得到过足够的研究和理论基础。在这种情况下，我们展示了现有的汇总方法可以表示为规则预测时的边缘推理操作。特别是，我们显示了通用的Max汇总策略，将 кандидат事实分数基于预测规则的信任度进行评分，具有概率解释。最后，我们提出了一种高效且被忽略的基线， combinig 前两种策略，与计算更昂贵的方法竞争。

Identifiable Cognitive Diagnosis with Encoder-decoder for Modelling Students’ Performance

paper_url: http://arxiv.org/abs/2309.00300
repo_url: None
paper_authors: Jiatong Li, Qi Liu, Fei Wang, Jiayu Liu, Zhenya Huang, Enhong Chen
for: 该论文旨在针对学生知识水平的诊断，以响应题目的回答得分作为基础，以便在多个领域中进行计算化适应测试。
methods: 该论文提出了一种新的可识别性诊断框架，包括直接从回答日志中诊断可识别和可解释的学生特征和问题特征，以及利用一种通用预测模块来重建回答日志，以保证诊断结果的准确性。
results: 该论文通过四个公共实验数据集的实验，证明了新的可识别性诊断框架可以提供可识别的诊断结果，同时也可以保证诊断结果的可解释性和精度。

Abstract
Cognitive diagnosis aims to diagnose students' knowledge proficiencies based on their response scores on exam questions, which is the basis of many domains such as computerized adaptive testing. Existing cognitive diagnosis models (CDMs) follow a proficiency-response paradigm, which views diagnostic results as learnable embeddings that are the cause of students' responses and learns the diagnostic results through optimization. However, such a paradigm can easily lead to unidentifiable diagnostic results and the explainability overfitting problem, which is harmful to the quantification of students' learning performance. To address these problems, we propose a novel identifiable cognitive diagnosis framework. Specifically, we first propose a flexible diagnostic module which directly diagnose identifiable and explainable examinee traits and question features from response logs. Next, we leverage a general predictive module to reconstruct response logs from the diagnostic results to ensure the preciseness of the latter. We furthermore propose an implementation of the framework, i.e., ID-CDM, to demonstrate the availability of the former. Finally, we demonstrate the identifiability, explainability and preciseness of diagnostic results of ID-CDM through experiments on four public real-world datasets.

摘要
�� cognitive diagnosis 目标是根据学生响应 scored exam 问题的得分来评估学生的知识水平，这是许多领域，如计算机化适应测试的基础。现有的 cognitive diagnosis 模型（CDM）采用 proficiency-response 模式，视学生的响应为可学习的嵌入，通过优化来学习诊断结果。然而，这种模式可能导致诊断结果难以识别和过拟合问题，这会对学生学习表现的量化带来害。为解决这些问题，我们提出了一种新的可识别 cognitive diagnosis 框架。 Specifically, we first propose a flexible diagnostic module directly diagnose identifiable and explainable examinee traits and question features from response logs. Next, we leverage a general predictive module to reconstruct response logs from the diagnostic results to ensure the preciseness of the latter. We furthermore propose an implementation of the framework, i.e., ID-CDM, to demonstrate the availability of the former. Finally, we demonstrate the identifiability, explainability and preciseness of diagnostic results of ID-CDM through experiments on four public real-world datasets.

End-to-end Lidar-Driven Reinforcement Learning for Autonomous Racing

paper_url: http://arxiv.org/abs/2309.00296
repo_url: None
paper_authors: Meraj Mammadov
For: The paper is written for the domain of car racing, specifically in the context of autonomous racing.* Methods: The paper uses reinforcement learning (RL) and feedforward raw lidar and velocity data to train an RL agent in a simulated environment.* Results: The RL agent’s performance is experimentally evaluated in a real-world racing scenario, demonstrating the feasibility and potential benefits of RL algorithms in enhancing autonomous racing performance, especially in environments where prior map information is not available.Here is the information in Simplified Chinese text:
for: 本研究针对的是自动赛车领域，具体来说是在 simulations 中使用 reinforcement learning（RL）和 feedforward raw lidar 和 velocity data 训练一个 RL 智能体。
methods: 本研究使用 RL 和 feedforward raw lidar 和 velocity data 训练一个 RL 智能体，并在 simulated 环境中进行了训练。
results: 在实际的 racing enario 中，RL 智能体的性能得到了实验证明，表明RL 算法在缺乏 prior map information 的环境中提供了可能的和有利的性能提升。

Abstract
Reinforcement Learning (RL) has emerged as a transformative approach in the domains of automation and robotics, offering powerful solutions to complex problems that conventional methods struggle to address. In scenarios where the problem definitions are elusive and challenging to quantify, learning-based solutions such as RL become particularly valuable. One instance of such complexity can be found in the realm of car racing, a dynamic and unpredictable environment that demands sophisticated decision-making algorithms. This study focuses on developing and training an RL agent to navigate a racing environment solely using feedforward raw lidar and velocity data in a simulated context. The agent's performance, trained in the simulation environment, is then experimentally evaluated in a real-world racing scenario. This exploration underlines the feasibility and potential benefits of RL algorithm enhancing autonomous racing performance, especially in the environments where prior map information is not available.

摘要
Reinforcement Learning (RL) 已经出现为自动化和机器人领域的一种转型方法，提供了强大的解决方案，解决了传统方法难以处理的复杂问题。在定义问题难以量化的情况下，学习基于的解决方案，如 RL，特别有价值。一个实例是在赛车场景中，这是一个动态和难以预测的环境，需要高级别的决策算法。这种研究将在模拟环境中开发和训练一个RL代理人， solely使用前向Raw Lidar和速度数据进行导航。在实际赛车场景中，代理人的性能，在模拟环境中训练的，进行实验性评估。这一探索， highlights the feasibility and potential benefits of RL算法在无产权地图信息的自动赛车性能提高中发挥作用。

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

paper_url: http://arxiv.org/abs/2309.00267
repo_url: None
paper_authors: Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi
for: 这篇研究是为了比较从人类反馈（RLHF）和从AI反馈（RLAIF）两种技术，以改善大语言模型（LLMs）与人类偏好的整合。
methods: 这篇研究使用了RLHF和RLAIF两种技术，RLHF需要人类提供反馈，而RLAIF则使用了一个商业化的LLM来提供反馈。
results: 研究发现，RLHF和RLAIF都能够将大语言模型与人类偏好进行高质量的整合，并且人类评价者对RLAIF和RLHF两种摘要都有与基准模型相似的喜好。

Abstract
Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.

摘要
人工反馈学习（RLHF）可以有效地将大语言模型（LLM）与人类偏好相对应，但收集高质量人类偏好标签是一个关键瓶颈。我们进行了RLHF与RL从AI反馈（RLAIF）的头比赛，其中RLAIF使用了市售LLM来标注偏好，而不是人类。我们发现，这两种技术在摘要任务上都可以达到类似的改进。人类评估员在70%的情况下偏好RLAIF和RLHF生成的摘要，并且对RLAIF和RLHF摘要进行评分时，偏好它们的情况相同。这些结果表明，RLAIF可以达到人类水平的表现，提供了RLHF扩展的可能性。

Leveraging Learning Metrics for Improved Federated Learning

paper_url: http://arxiv.org/abs/2309.00257
repo_url: None
paper_authors: Andre Fu
for: 本研究旨在应用可解释人工智能（XAI）的新研究，尤其是定量学习度量，以改善联边学习中的数据联边问题。
methods: 本研究使用联边学习和有效排名（ER）学习度量，实现了首个联边学习度量聚合方法。
results: 研究结果显示，使用有效排名学习度量可以超越基eline Federated Averaging \cite{konevcny2016federated}，并开发了一个基于有效排名的新量化策略。

Abstract
Currently in the federated setting, no learning schemes leverage the emerging research of explainable artificial intelligence (XAI) in particular the novel learning metrics that help determine how well a model is learning. One of these novel learning metrics is termed `Effective Rank' (ER) which measures the Shannon Entropy of the singular values of a matrix, thus enabling a metric determining how well a layer is mapping. By joining federated learning and the learning metric, effective rank, this work will \textbf{(1)} give the first federated learning metric aggregation method \textbf{(2)} show that effective rank is well-suited to federated problems by out-performing baseline Federated Averaging \cite{konevcny2016federated} and \textbf{(3)} develop a novel weight-aggregation scheme relying on effective rank.

摘要
当前在联合学习 Setting中，无法学习 schemes 利用 emerging research of explainable artificial intelligence (XAI) 特别是新的学习指标，帮助确定模型是如何学习。其中一个新的学习指标是“有效排名”（ER），测量矩阵的几何 entropy，因此可以提供一个度量layer是如何映射。通过联合学习和有效排名指标的结合，本工作将实现以下三个目标：1. 提供首个联合学习指标聚合方法。2. 表明有效排名指标适合联合问题，超越基eline Federated Averaging \cite{konevcny2016federated}。3. 开发一种基于有效排名指标的新的质量聚合方案。

DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models

paper_url: http://arxiv.org/abs/2309.00248
repo_url: https://github.com/mshenoda/diffugen
paper_authors: Michael Shenoda, Edward Kim
for: 本研究旨在提高计算机视觉领域中Machine learning模型的准确性和可靠性，通过生成高质量的标注图像集。
methods: 本paper introduce了一种名为DiffuGen的简单有效的方法，利用稳定的扩散模型来生成标注图像集。DiffuGen combine了扩散模型的能力与两种不同的标注技术：无监督和监督。
results: DiffuGen可以生成高质量的标注图像集，并且提供了一种灵活的解决方案 для标注生成。在本paper中，我们介绍了DiffuGen的方法ología，包括使用提示模板进行适应图像生成和文本反转来增强扩散模型的能力。

Abstract
Generating high-quality labeled image datasets is crucial for training accurate and robust machine learning models in the field of computer vision. However, the process of manually labeling real images is often time-consuming and costly. To address these challenges associated with dataset generation, we introduce "DiffuGen," a simple and adaptable approach that harnesses the power of stable diffusion models to create labeled image datasets efficiently. By leveraging stable diffusion models, our approach not only ensures the quality of generated datasets but also provides a versatile solution for label generation. In this paper, we present the methodology behind DiffuGen, which combines the capabilities of diffusion models with two distinct labeling techniques: unsupervised and supervised. Distinctively, DiffuGen employs prompt templating for adaptable image generation and textual inversion to enhance diffusion model capabilities.

摘要
<>转换文本到简化中文。<>在计算机视觉领域，生成高质量标注图像集是训练精准和可靠机器学习模型的关键。然而，手动标注真实图像是时间consuming和成本高昂的。为解决这些数据生成过程中的挑战，我们介绍“DiffuGen”，一种简单而适应的方法，利用稳定扩散模型来生成标注图像集。通过利用稳定扩散模型，DiffuGen不仅保证生成的数据质量，还提供了一种多样化的标签生成解决方案。在这篇论文中，我们介绍DiffuGen的方法ологи，它结合扩散模型的能力和两种不同的标签技术：无监督和监督。与其他方法不同的是，DiffuGen使用插入模板来适应图像生成，以及文本反转来增强扩散模型的能力。

City electric power consumption forecasting based on big data & neural network under smart grid background

paper_url: http://arxiv.org/abs/2309.00245
repo_url: None
paper_authors: Zhengxian Chen, Maowei Wang, Conghu Li
for: 这篇论文是为了研究城市电力消耗的预测和评估，以提供更好的城市服务。
methods: 论文使用大数据和神经网络模型，考虑了不同的非线性因素对城市电力消耗的影响，建立了一个预测城市电力消耗的模型。
results: 根据排序重要性测试，论文建立了城市电力消耗预测模型的核心特征值，对电力相关业界提供了重要参考。

Abstract
With the development of the electric power system, the smart grid has become an important part of the smart city. The rational transmission of electric energy and the guarantee of power supply of the smart grid are very important to smart cities, smart cities can provide better services through smart grids. Among them, predicting and judging city electric power consumption is closely related to electricity supply and regulation, the location of power plants, and the control of electricity transmission losses. Based on big data, this paper establishes a neural network and considers the influence of various nonlinear factors on city electric power consumption. A model is established to realize the prediction of power consumption. Based on the permutation importance test, an evaluation model of the influencing factors of city electric power consumption is constructed to obtain the core characteristic values of city electric power consumption prediction, which can provide an important reference for electric power related industry.

摘要
随着电力系统的发展，智能电网已成为智能城市的重要组成部分。智能城市通过智能电网提供更好的服务，智能电网的合理的电能传输和电力供应是非常重要的。其中，预测和评估城市电力消耗和电力供应的关系非常重要，包括发电厂的位置、电力传输损失的控制等多个因素。基于大数据，本文建立了神经网络模型，考虑了城市电力消耗的多个非线性因素的影响。通过Permutation Importance Test，建立了城市电力消耗影响因素评价模型，获得了城市电力消耗预测核心特征值，可以为电力相关行业提供重要参考。

FactLLaMA: Optimizing Instruction-Following Language Models with External Knowledge for Automated Fact-Checking

paper_url: http://arxiv.org/abs/2309.00240
repo_url: https://github.com/thcheung/FactLLaMA
paper_authors: Tsun-Hin Cheung, Kin-Man Lam
for: 本研究旨在提高自动 факчекин表现，以便更好地斗争假信息的扩散。
methods: 本研究使用了大型自然语言模型（LLMs）和指令遵循变体，如InstructGPT和Alpaca，以及外部证据检索来增强 fact-checking 表现。
results: 研究结果显示，将外部证据与 instruction-tuning 结合使用可以更好地预测输入CLAIM 的真伪性。在 RAWFC 和 LIAR 两个常用的 fact-checking 数据集上进行了实验，并取得了状态之 луч表现。

Abstract
Automatic fact-checking plays a crucial role in combating the spread of misinformation. Large Language Models (LLMs) and Instruction-Following variants, such as InstructGPT and Alpaca, have shown remarkable performance in various natural language processing tasks. However, their knowledge may not always be up-to-date or sufficient, potentially leading to inaccuracies in fact-checking. To address this limitation, we propose combining the power of instruction-following language models with external evidence retrieval to enhance fact-checking performance. Our approach involves leveraging search engines to retrieve relevant evidence for a given input claim. This external evidence serves as valuable supplementary information to augment the knowledge of the pretrained language model. Then, we instruct-tune an open-sourced language model, called LLaMA, using this evidence, enabling it to predict the veracity of the input claim more accurately. To evaluate our method, we conducted experiments on two widely used fact-checking datasets: RAWFC and LIAR. The results demonstrate that our approach achieves state-of-the-art performance in fact-checking tasks. By integrating external evidence, we bridge the gap between the model's knowledge and the most up-to-date and sufficient context available, leading to improved fact-checking outcomes. Our findings have implications for combating misinformation and promoting the dissemination of accurate information on online platforms. Our released materials are accessible at: https://thcheung.github.io/factllama.

摘要
自动化Fact-checking plays a crucial role in combating the spread of misinformation. Large Language Models (LLMs) and Instruction-Following variants, such as InstructGPT and Alpaca, have shown remarkable performance in various natural language processing tasks. However, their knowledge may not always be up-to-date or sufficient, potentially leading to inaccuracies in fact-checking. To address this limitation, we propose combining the power of instruction-following language models with external evidence retrieval to enhance fact-checking performance. Our approach involves leveraging search engines to retrieve relevant evidence for a given input claim. This external evidence serves as valuable supplementary information to augment the knowledge of the pretrained language model. Then, we instruct-tune an open-sourced language model, called LLaMA, using this evidence, enabling it to predict the veracity of the input claim more accurately. To evaluate our method, we conducted experiments on two widely used fact-checking datasets: RAWFC and LIAR. The results demonstrate that our approach achieves state-of-the-art performance in fact-checking tasks. By integrating external evidence, we bridge the gap between the model's knowledge and the most up-to-date and sufficient context available, leading to improved fact-checking outcomes. Our findings have implications for combating misinformation and promoting the dissemination of accurate information on online platforms. Our released materials are accessible at: https://thcheung.github.io/factllama.

ALJP: An Arabic Legal Judgment Prediction in Personal Status Cases Using Machine Learning Models

paper_url: http://arxiv.org/abs/2309.00238
repo_url: None
paper_authors: Salwa Abbara, Mona Hafez, Aya Kazzaz, Areej Alhothali, Alhanouf Alsolami
for: This paper aims to predict the judgment outcomes of Arabic case scripts, specifically in cases of custody and annulment of marriage.
methods: The authors use deep learning (DL) and natural language processing (NLP) techniques, including Support Vector Machine (SVM), Logistic regression (LR), Long Short Term Memory (LSTM), and Bidirectional Long Short-Term Memory (BiLSTM), with representation techniques such as TF-IDF and word2vec on a developed dataset.
results: The authors achieved high accuracy in predicting the judgment outcomes of custody cases and annulment of marriage, with the SVM model with word2vec and LR with TF-IDF achieving the highest accuracy of 88% and 78%, respectively. Additionally, the LR and SVM with word2vec and BiLSTM model with TF-IDF achieved the highest accuracy of 88% and 69%, respectively, in predicting the probability of outcomes on custody cases and annulment of marriage.

Abstract
Legal Judgment Prediction (LJP) aims to predict judgment outcomes based on case description. Several researchers have developed techniques to assist potential clients by predicting the outcome in the legal profession. However, none of the proposed techniques were implemented in Arabic, and only a few attempts were implemented in English, Chinese, and Hindi. In this paper, we develop a system that utilizes deep learning (DL) and natural language processing (NLP) techniques to predict the judgment outcome from Arabic case scripts, especially in cases of custody and annulment of marriage. This system will assist judges and attorneys in improving their work and time efficiency while reducing sentencing disparity. In addition, it will help litigants, lawyers, and law students analyze the probable outcomes of any given case before trial. We use a different machine and deep learning models such as Support Vector Machine (SVM), Logistic regression (LR), Long Short Term Memory (LSTM), and Bidirectional Long Short-Term Memory (BiLSTM) using representation techniques such as TF-IDF and word2vec on the developed dataset. Experimental results demonstrate that compared with the five baseline methods, the SVM model with word2vec and LR with TF-IDF achieve the highest accuracy of 88% and 78% in predicting the judgment on custody cases and annulment of marriage, respectively. Furthermore, the LR and SVM with word2vec and BiLSTM model with TF-IDF achieved the highest accuracy of 88% and 69% in predicting the probability of outcomes on custody cases and annulment of marriage, respectively.

摘要
法律判断预测（LJP）目标是根据案件描述预测判决结果。一些研究人员已经开发了用于帮助 potential clients 预测案件结果的技术，但是这些技术都没有在阿拉伯语中实现，只有一些尝试在英语、中文和捷地语中实现。在这篇论文中，我们开发了一个系统，使用深度学习（DL）和自然语言处理（NLP）技术，从阿拉伯语案件脚本中预测判决结果，特别是在监护和婚姻 annulment 案件中。这个系统将帮助法官和律师提高工作效率和时间效率，同时减少判决不公。此外，它还将帮助诉讼人、律师和法学生分析案件的可能结果之前。我们使用了不同的机器学习和深度学习模型，如支持向量机（SVM）、逻辑回归（LR）、长短期记忆（LSTM）和双向长短期记忆（BiLSTM），使用表示技术如 TF-IDF 和 word2vec 在开发的数据集上。实验结果表明，与基准方法相比，SVM 模型与 word2vec 和 LR 模型与 TF-IDF achieve 最高的准确率为 88% 和 78%，分别预测监护案件和婚姻 annulment 的判决结果。此外，LR 和 SVM 模型与 word2vec 和 BiLSTM 模型与 TF-IDF achieve 最高的准确率为 88% 和 69%，分别预测监护案件和婚姻 annulment 的可能结果。

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

paper_url: http://arxiv.org/abs/2309.00237
repo_url: https://github.com/starmpcc/asclepius
paper_authors: Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, Eunbyeol Cho, Seongsu Bae, Jungwoo Oh, Gyubok Lee, Jong Hak Moon, Seng Chan You, Seungjin Baek, Chang Hoon Han, Yoon Bin Jung, Yohan Jo, Edward Choi
For: The paper aims to develop a specialized clinical language model, Asclepius, to handle patients’ clinical notes, and to address the challenges of limited accessibility and usability of these notes due to strict privacy regulations.* Methods: The authors create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature, and use these synthetic notes to train Asclepius. They also evaluate the performance of Asclepius using real clinical notes and compare it with other large language models, including GPT-3.5-turbo and other open-source alternatives.* Results: The authors find that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models, and that Asclepius outperforms other large language models in real-world applications. The findings are supported by detailed evaluations conducted by both GPT-4 and medical professionals.Here’s the simplified Chinese text for the three key points:* For: 这篇论文目的是开发一种专门用于处理患者医疗记录的临床语言模型，以解决因严格隐私法规限制而受到的医疗记录访问和使用困难。* Methods: 作者们使用公开可用的案例报告从生物医学文献中提取的大规模临床报告来生成synthetic大规模的临床报告，然后使用这些synthetic报告来训练特殊的临床语言模型Asclepius。作者们还使用实际的临床报告来评估Asclepius的性能，并与其他大语言模型进行比较，包括GPT-3.5-turbo和其他开源选择。* Results: 作者们发现，使用synthetic临床报告可以成为高性能临床语言模型的建模 substitutes，而Asclepius在实际应用中表现出色，比其他大语言模型更高。这些结论得到了GPT-4和医疗专业人员的详细评估。

Abstract
The development of large language models tailored for handling patients' clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. While Asclepius is trained on synthetic data, we assess its potential performance in real-world applications by evaluating it using real clinical notes. We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives. To further validate our approach using synthetic notes, we also compare Asclepius with its variants trained on real clinical notes. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models. This conclusion is supported by detailed evaluations conducted by both GPT-4 and medical professionals. All resources including weights, codes, and data used in the development of Asclepius are made publicly accessible for future research.

摘要
大型语言模型的开发，用于处理病人的诊所录取得到受到隐私规定限制，导致这些录取不易存取和使用。为了解决这些挑战，我们首先创建了大规模的人工生成的诊所录取，使用公开可用的专业医疗文献中的案例报告。然后，我们使用这些人工生成的录取来训练我们的特殊化的医疗语言模型Asclepius。处理Asclepius的训练是使用人工生成的数据，我们使用真实的诊所录取进行评估。我们与其他大型语言模型，如GPT-3.5-turbo和其他开源选择进行比较。为了进一步验证我们的方法，我们还比较Asclepius与它的变体，它们是使用真实的诊所录取进行训练。我们的结果表明，人工生成的诊所录取可以作为真实的诊所录取的可行substitute，这是由GPT-4和医疗专业人员进行详细评估所支持。我们所有的资源，包括权重、代码和数据，都是公开可用的，以便未来的研究。

Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea’s Top 50 KOSPI Companies

paper_url: http://arxiv.org/abs/2309.00208
repo_url: None
paper_authors: Junwon Sung, Woojin Heo, Yunkyung Byun, Youngsam Kim
for: 这项研究探讨了OpenAI的GPT-3.5-turbo和GPT-4语言模型在韩国上市公司公告中的semantic分析能力，尤其是对于实时公告。
methods: 研究对韩国KOSPI上市50 круп公司的月度报告进行了分析，每份报告都被赋予了一个含义评分，范围从1(非常负面)到5(非常正面)。
results: 研究发现GPT-4表现了显著的准确性，与人工专家的评分相比，Spearman相关系数为0.61，朴素匹配率为0.82。这些发现对GPT模型的评价特征提供了重要的新视角，为未来自动 semantic监测领域的创新奠定了基础。

Abstract
In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top 50 publicly traded companies listed on the Korean KOSPI, based on market capitalization, and scrutinizes their monthly disclosure summaries over a period of 17 months. Each summary was assigned a sentiment rating on a scale ranging from 1(very negative) to 5(very positive). To gauge the effectiveness of the language models, their sentiment ratings were compared with those generated by human experts. Our findings reveal a notable performance disparity between GPT-3.5-turbo and GPT-4, with the latter demonstrating significant accuracy in human evaluation tests. The Spearman correlation coefficient was registered at 0.61, while the simple concordance rate was recorded at 0.82. This research contributes valuable insights into the evaluative characteristics of GPT models, thereby laying the groundwork for future innovations in the field of automated semantic monitoring.

摘要
在人工智能领域的快速发展中，现代语言模型如OpenAI的GPT-3.5-turbo和GPT-4提供了无前例的自动化复杂任务的机会。这篇研究论文探讨了这些模型在韩国上市公司公告中的语义分析能力，具体来说是对时间性公告进行实时分析。研究选择韩国KOSPI板块上市50大公司，根据市值排名，并对这些公司月度公告摘要进行17个月的分析。每份摘要都被赋予了一个sentiment评级，从1（非常负面）到5（非常正面）的评级范围内。为了评估语言模型的效果，我们与人类专家生成的sentiment评级进行比较。我们的发现表明GPT-3.5-turbo和GPT-4之间存在显著的性能差异，GPT-4在人类评估测试中表现出了显著的准确性。Spearman相关系数为0.61，单词匹配率为0.82。这篇研究为未来自动语义监测领域的创新奠定了基础。

Gap and Overlap Detection in Automated Fiber Placement

paper_url: http://arxiv.org/abs/2309.00206
repo_url: None
paper_authors: Assef Ghamisi, Homayoun Najjaran
For: This paper is written for the purpose of detecting and correcting manufacturing defects in composite parts produced through Automated Fiber Placement (AFP). The focus is on gaps and overlaps, which are the most common defects that can significantly impact the quality of the composite parts.* Methods: The paper proposes a novel method that uses an Optical Coherence Tomography (OCT) sensor and computer vision techniques to detect and locate gaps and overlaps in composite parts. The method involves generating a depth map image of the composite surface, detecting the boundaries of each tow, and comparing consecutive tows to identify gaps or overlaps that exceed a predefined tolerance threshold.* Results: The results of the paper demonstrate a high level of accuracy and efficiency in gap and overlap segmentation, as compared to ground truth annotations by experts. The approach is effective in detecting defects in composite parts produced through AFP, and has the potential to improve the overall quality and efficiency of the manufacturing process.

Abstract
The identification and correction of manufacturing defects, particularly gaps and overlaps, are crucial for ensuring high-quality composite parts produced through Automated Fiber Placement (AFP). These imperfections are the most commonly observed issues that can significantly impact the overall quality of the composite parts. Manual inspection is both time-consuming and labor-intensive, making it an inefficient approach. To overcome this challenge, the implementation of an automated defect detection system serves as the optimal solution. In this paper, we introduce a novel method that uses an Optical Coherence Tomography (OCT) sensor and computer vision techniques to detect and locate gaps and overlaps in composite parts. Our approach involves generating a depth map image of the composite surface that highlights the elevation of composite tapes (or tows) on the surface. By detecting the boundaries of each tow, our algorithm can compare consecutive tows and identify gaps or overlaps that may exist between them. Any gaps or overlaps exceeding a predefined tolerance threshold are considered manufacturing defects. To evaluate the performance of our approach, we compare the detected defects with the ground truth annotated by experts. The results demonstrate a high level of accuracy and efficiency in gap and overlap segmentation.

摘要
检测和修正制造过程中的缺陷，特别是孔隙和重叠，对于通过自动纤维放置（AFP）生产的复合部件质量的确保非常重要。这些缺陷是制造过程中最常见的问题，可能对全面质量产生重要影响。手动检查是时间consuming和人力 INTENSIVE，因此是不可靠的方法。为了解决这个挑战，我们提出了一种新的方法，使用光子干涉Tomography（OCT）感知器和计算机视觉技术来检测和定位复合部件中的孔陷和重叠。我们的方法是生成复合部件表面的深度图像，高亮显示复合带（或排列）的抬升。通过检测每个带的边界，我们的算法可以比较 consecutive带之间的孔陷和重叠，并确定任何超过预定的允许阈值的缺陷。我们对我们的方法的性能进行了评估，结果表明我们的方法具有高精度和高效的孔陷和重叠分 segmentation。

Subjectivity in Unsupervised Machine Learning Model Selection

paper_url: http://arxiv.org/abs/2309.00201
repo_url: None
paper_authors: Wanyi Chen, Mary L. Cummings
for: 这个研究旨在探讨机器学习模型选择过程中的主观性。
methods: 这个研究使用隐马尔可夫模型作为例子，通过询问33名参与者和三个大型自然语言模型（LLMs）进行模型选择，以探讨参与者和LLMs在不同条件下的选择差异。
results: 研究发现参与者和LLMs在不同条件下的选择具有差异和不一致性，尤其是当不同的评价标准和度量不同时。主观性的来源包括参与者对不同评价标准和度量的意见不一致，以及模型的简洁程度和数据集大小的影响。这些结果 highlights the importance of developing a more standardized way to document subjective choices made in model selection processes。

Abstract
Model selection is a necessary step in unsupervised machine learning. Despite numerous criteria and metrics, model selection remains subjective. A high degree of subjectivity may lead to questions about repeatability and reproducibility of various machine learning studies and doubts about the robustness of models deployed in the real world. Yet, the impact of modelers' preferences on model selection outcomes remains largely unexplored. This study uses the Hidden Markov Model as an example to investigate the subjectivity involved in model selection. We asked 33 participants and three Large Language Models (LLMs) to make model selections in three scenarios. Results revealed variability and inconsistencies in both the participants' and the LLMs' choices, especially when different criteria and metrics disagree. Sources of subjectivity include varying opinions on the importance of different criteria and metrics, differing views on how parsimonious a model should be, and how the size of a dataset should influence model selection. The results underscore the importance of developing a more standardized way to document subjective choices made in model selection processes.

摘要

Diffusion Model with Clustering-based Conditioning for Food Image Generation

paper_url: http://arxiv.org/abs/2309.00199
repo_url: None
paper_authors: Yue Han, Jiangpeng He, Mridul Gupta, Edward J. Delp, Fengqing Zhu
for: 这篇论文目的是提出一种基于条件扩散模型的食物图像生成方法，以提高食物图像生成质量和多样性。
methods: 该方法使用了条件扩散模型，并提出了一种基于归一化的聚类训练策略，以生成高质量和代表性的食物图像。
results: 研究表明，使用条件扩散模型生成的食物图像可以提高食物图像生成质量和多样性，并可以Address the severe class imbalance issue in long-tailed food classification。

Abstract
Image-based dietary assessment serves as an efficient and accurate solution for recording and analyzing nutrition intake using eating occasion images as input. Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation, which rely on large amounts of food images with annotations for training. However, such data dependency poses significant barriers to real-world applications, because acquiring a substantial, diverse, and balanced set of food images can be challenging. One potential solution is to use synthetic food images for data augmentation. Although existing work has explored the use of generative adversarial networks (GAN) based structures for generation, the quality of synthetic food images still remains subpar. In addition, while diffusion-based generative models have shown promising results for general image generation tasks, the generation of food images can be challenging due to the substantial intra-class variance. In this paper, we investigate the generation of synthetic food images based on the conditional diffusion model and propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images. The proposed method is evaluated on the Food-101 dataset and shows improved performance when compared with existing image generation works. We also demonstrate that the synthetic food images generated by ClusDiff can help address the severe class imbalance issue in long-tailed food classification using the VFN-LT dataset.

摘要
图像基于的营养评估可以作为有效和准确的解决方案，用于记录和分析饮食摄入，使用吃饭场景图像作为输入。深度学习技术通常用于图像分析，如食物分类、 segmentation 和分量估计，但是这些技术需要大量的食物图像进行训练。然而，在实际应用中，获得充足、多样化和均衡的食物图像是一个大的挑战。一个可能的解决方案是使用生成的食物图像进行数据增强。虽然现有的工作已经探讨了基于生成对抗网络（GAN）结构的生成，但是生成的食物图像质量仍然较差。此外，在涉及到食物图像生成时，存在较大的内部变异问题。在这篇论文中，我们研究基于条件扩散模型的食物图像生成，并提出一种有效的分组训练框架，名为ClusDiff，以生成高质量和代表性的食物图像。我们的方法被评估在Food-101数据集上，并与现有的图像生成工作进行比较。我们还示出了ClusDiff生成的食物图像可以帮助解决VFN-LT数据集中的严重类别偏见问题。

2023-09-01

cs.CL

cs.CL - 2023-09-01

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

paper_url: http://arxiv.org/abs/2309.00751
repo_url: https://github.com/DanielSc4/RewardLM
paper_authors: Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini
for: 这paper是为了研究语模型的减带技术对模型内部过程的影响。
methods: 这paper使用了流行的减带方法，并使用特征评估方法来衡量这些方法对模型的依赖度的影响。
results: 研究发现，使用减带方法可以改善模型的安全性，但是这些方法对模型内部过程的影响还不很清楚。此外，研究还发现，使用反对 narative 练习法可以提高模型的减带性能，但是这种方法与减带学习法的影响不同。

Abstract
Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

摘要

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

paper_url: http://arxiv.org/abs/2309.00614
repo_url: None
paper_authors: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein
for: 这篇论文是关于语言模型安全性的研究，特别是对于语言模型在不同的威胁模型下的防御性能。
methods: 本文使用了多种防御策略，包括检测（基于异常值的混淆）、输入预处理（重叠和重tokenization）以及对抗训练。
results: 研究发现，使用现有的粗糙优化器在文本域下可能会受到限制，而且对于语言模型来说，标准的适应攻击更加困难。未来的研究可能需要开发更强大的优化器，或者发现filtering和预处理防御策略在语言模型领域的强大性。

Abstract
As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

摘要

What practical threat models are relevant in this domain?2. How do baseline defense techniques perform in this new domain?3. How does LLM security differ from computer vision?We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, considering their feasibility and effectiveness in different settings. These include:1. Detection (perplexity-based)2. Input preprocessing (paraphrasing and retokenization)3. Adversarial trainingWe explore white-box and gray-box settings and analyze the trade-off between robustness and performance for each defense. Our findings suggest that the limitations of existing discrete optimizers for text, combined with the relatively high cost of optimization, make standard adaptive attacks more challenging for LLMs. Future research may focus on developing more powerful optimizers or enhancing the strength of filtering and preprocessing defenses in the LLM domain.In conclusion, understanding the security vulnerabilities of LLMs is crucial as they become increasingly ubiquitous. By examining these vulnerabilities using the principles of adversarial machine learning, we can develop effective defense strategies to protect these models from malicious attacks.

Taken out of context: On measuring situational awareness in LLMs

paper_url: http://arxiv.org/abs/2309.00667
repo_url: https://github.com/asacooperstickland/situational-awareness-evals
paper_authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
for: 本研究目的是更好地理解大语言模型（LLM）中的情境意识的发展。
methods: 本研究使用了扩展LLM的能力来评估情境意识的发展。 Specifically, the researchers used out-of-context reasoning as a way to test for situational awareness.
results: 研究发现，LLMs可以在没有示例或教程的情况下通过描述来解决测试。 however, the success of the models is sensitive to the training setup and only works with data augmentation. Additionally, the research found that larger models perform better on this task.

Abstract
We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment. Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose `out-of-context reasoning' (in contrast to in-context learning). We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs. Code is available at: https://github.com/AsaCooperStickland/situational-awareness-evals.

摘要
我们目标是更好地理解大语言模型（LLM）中的“情境意识”的出现。一个模型被称为情境意识模型，如果它意识到它是一个模型，并能识别它是否在测试或部署中。今天的LLM都在测试和对齐之前被部署。一个LLM可以通过情境意识来达到安全测试中高分，而在部署后执行有害的操作。情境意识可能会意外地出现，因此我们可以通过扩大模型来更好地预测其出现。作为一种能力，我们提出了“离 context 理解”（与上下文学习相对）。我们通过实验研究离 context 理解。我们首先精度调整了一个LLM，使其能够通过一个测试描述而不提供示例或示范。测试时，我们评估模型是否能通过测试。我们启示发现，LLMs在这种离 context 理解任务中成功。其成功关系于训练Setup，并且只有在应用数据扩展时才能实现。对于GPT-3和LLaMA-1，模型的性能随模型大小增长。这些发现为我们未来更多的实验研究提供了基础，以预测和可能控制LLM中的情境意识的出现。代码可以在以下 GitHub 上找到：https://github.com/AsaCooperStickland/situational-awareness-evals。

Satisfiability Checking of Multi-Variable TPTL with Unilateral Intervals Is PSPACE-Complete

paper_url: http://arxiv.org/abs/2309.00386
repo_url: None
paper_authors: Shankara Narayanan Krishna, Khushraj Nanik Madnani, Rupak Majumdar, Paritosh K. Pandya
for: 这个论文研究了${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL)的可 decidability。
methods: 作者使用了一种新的“非紧急”类型的 Alternating Timed Automata with multiple clocks called Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$)来验证 TPTL$^{0,\infty}$的满足性检查是PSPACE完备的。
results: 作者发现了一个新的多变量 fragment of TPTL，它的满足性检查是可 decidable，而且比 Metric Interval Temporal Logic (MITL)更加表达力强，且计算更加容易。这是首次没有对时间字符串（例如带有约束的变化）做出任何限制的多变量 TPTL fragment 的满足性检查是 decidable。

Abstract
We investigate the decidability of the ${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL). We show that the satisfiability checking of TPTL$^{0,\infty}$ is PSPACE-complete. Moreover, even its 1-variable fragment (1-TPTL$^{0,\infty}$) is strictly more expressive than Metric Interval Temporal Logic (MITL) for which satisfiability checking is EXPSPACE complete. Hence, we have a strictly more expressive logic with computationally easier satisfiability checking. To the best of our knowledge, TPTL$^{0,\infty}$ is the first multi-variable fragment of TPTL for which satisfiability checking is decidable without imposing any bounds/restrictions on the timed words (e.g. bounded variability, bounded time, etc.). The membership in PSPACE is obtained by a reduction to the emptiness checking problem for a new "non-punctual" subclass of Alternating Timed Automata with multiple clocks called Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$) which we prove to be in PSPACE. We show this by constructing a simulation equivalent non-deterministic timed automata whose number of clocks is polynomial in the size of the given VWATA$^{0,\infty}$.

摘要
我们调查${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL)的对应性。我们表明TPTL$^{0,\infty}$的满足性检查是PSPACE完全的。此外，我们还证明1-TPTL$^{0,\infty}$比Metric Interval Temporal Logic (MITL)更加表达力强，其满足性检查是EXPSPACE完全的。因此，我们有一个更加表达力强的逻辑，且 computationally easier的满足性检查。根据我们所知，TPTL$^{0,\infty}$是第一个不受任何紧 bound/restriction 的时间语言的多变量 fragment 的满足性检查是可 decidable。PSPACE 的成员由一个对应的 Alternating Timed Automata with multiple clocks 的新子集 Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$) 的emptiness checking problem 的 reduction 而得。我们显示这个问题是PSPACE的，通过建构一个与非决定型时间自动 machine 相对应的同步化的非决定型时间自动 machine，其中的时钟数量是对应类别的大小的几何函数。

BatchPrompt: Accomplish more with less

paper_url: http://arxiv.org/abs/2309.00384
repo_url: None
paper_authors: Jianzhe Lin, Maurice Diesendruck, Liang Du, Robin Abraham
for: 提高大语言模型（LLM）的提问效率，使其更加高效地处理长Context提问。
methods: 使用批处理（BatchPrompt）技术，将数据集分割成多个批处理，并对每个批处理进行独立的提问。并提出了两种技术：批处理 permutation（BPE）和自我反射指导的早期停止（SEAS）。
results: 通过实验证明，使用BPE和SEAS技术可以提高批处理提问的性能，并且与单个提问（SinglePrompt）相比，使用BPE和SEAS技术需要更少的LLM调用和输入token（只需9%-16%的LLM调用和27.4%的输入token，可以达到90.6%-90.9%的Boolq准确率，87.2%-88.4%的QQP准确率和91.5%-91.1%的RTE准确率）。这是大语言模型提问的首次技术改进。

Abstract
As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our comprehensive experimental evaluation demonstrates that BPE can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting(SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt with batch size 32, using just 9%-16% the number of LLM calls, Boolq accuracy 90.6% to 90.9% with 27.4% tokens, QQP accuracy 87.2% to 88.4% with 18.6% tokens, RTE accuracy 91.5% to 91.1% with 30.8% tokens). To the best of our knowledge, this is the first work to technically improve prompting efficiency of large language models. We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released.

摘要
为了提高大语言模型（LLM）的效率，我们可以考虑使用批处理（batching）技术。我们称这种技术为批提示（BatchPrompt）。我们发现，使用批处理技术可以大幅提高 LLM 的性能，但是在某些情况下，它可能会导致性能下降。我们提出了两种策略来解决这个问题：批 permutation 和批ensemble（BPE），以及一种新的自适应停止（SEAS）技术。我们的实验证明，BPE 可以在多种常见的自然语言处理任务上提高 BatchPrompt 的性能，并且和单个数据提示（SinglePrompt）相比，BPE 需要许多 fewer LLM 调用和输入符号（token）。我们认为，这是首次技术上提高大语言模型的提示效率的研究。我们希望我们的简单 yet 有效的方法可以引领未来的大语言模型研究。我们将代码发布。

Long-Term Memorability On Advertisements

paper_url: http://arxiv.org/abs/2309.00378
repo_url: None
paper_authors: Harini S I, Somesh Singh, Yaman K Singla, Aanisha Bhattacharyya, Veeky Baths, Changyou Chen, Rajiv Ratn Shah, Balaji Krishnamurthy
for: The paper aims to study the memorability of ads in the machine learning literature, specifically focusing on long-term memorability and the impact of multimodality and human factors.
methods: The study consists of 1203 participants and 2205 ads covering 276 brands, with statistical tests run to identify factors that contribute to ad memorability. Additionally, the paper presents a novel model called Sharingan, which leverages real-world knowledge of LLMs and visual knowledge of visual encoders to predict the memorability of content.
results: The study finds that fast-moving scenes in commercials are more memorable than slower scenes (p=8e-10), and that people who use ad-blockers remember lower number of ads than those who don’t (p=5e-3). The Sharingan model achieves state-of-the-art performance on all prominent memorability datasets in literature, and ablation studies reveal insights into what drives memory.

Abstract
Marketers spend billions of dollars on advertisements but to what end? At the purchase time, if customers cannot recognize a brand for which they saw an ad, the money spent on the ad is essentially wasted. Despite its importance in marketing, until now, there has been no study on the memorability of ads in the ML literature. Most studies have been conducted on short-term recall (<5 mins) on specific content types like object and action videos. On the other hand, the advertising industry only cares about long-term memorability (a few hours or longer), and advertisements are almost always highly multimodal, depicting a story through its different modalities (text, images, and videos). With this motivation, we conduct the first large scale memorability study consisting of 1203 participants and 2205 ads covering 276 brands. Running statistical tests over different participant subpopulations and ad-types, we find many interesting insights into what makes an ad memorable - both content and human factors. For example, we find that brands which use commercials with fast moving scenes are more memorable than those with slower scenes (p=8e-10) and that people who use ad-blockers remember lower number of ads than those who don't (p=5e-3). Further, with the motivation of simulating the memorability of marketing materials for a particular audience, ultimately helping create one, we present a novel model, Sharingan, trained to leverage real-world knowledge of LLMs and visual knowledge of visual encoders to predict the memorability of a content. We test our model on all the prominent memorability datasets in literature (both images and videos) and achieve state of the art across all of them. We conduct extensive ablation studies across memory types, modality, brand, and architectural choices to find insights into what drives memory.

摘要
To address this gap, we conducted a large-scale memorability study with 1203 participants and 2205 ads covering 276 brands. We found several interesting insights into what makes an ad memorable, including the use of fast-moving scenes (p=8e-10) and the impact of ad-blockers (p=5e-3).To simulate the memorability of marketing materials for a particular audience, we developed a novel model called Sharingan. This model leverages real-world knowledge of large language models (LLMs) and visual knowledge of visual encoders to predict the memorability of a content. We tested our model on several prominent memorability datasets in the literature (both images and videos) and achieved state-of-the-art results across all of them.We conducted extensive ablation studies to understand what drives memory, including the impact of different memory types, modalities, brands, and architectural choices. Our findings provide valuable insights for marketers and advertisers looking to create memorable ads that resonate with their target audience.

Examining the Effectiveness of Chatbots in Gathering Family History Information in Comparison to the Standard In-Person Interview-Based Approach

paper_url: http://arxiv.org/abs/2309.03223
repo_url: None
paper_authors: Kieron Drumm, Vincent Tran
For: This paper aims to present a chatbot-based approach for gathering family histories, with the goal of providing a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries.* Methods: The paper compares the performance and usability of a chatbot-based approach with two other methods: using ancestry.com and in-person interviews. The chatbot is designed to guide the interviewee through the process of providing their family history information.* Results: The paper shows that the chatbot-based approach has a lower number of mistakes made and a lower level of confusion from the user compared to the other two methods. However, the average time taken to conduct an interview using the chatbot is longer than the other two methods.

Abstract
One of the most common things that a genealogist is tasked with is the gathering of a person's initial family history, normally via in-person interviews or with the use of a platform such as ancestry.com, as this can provide a strong foundation upon which a genealogist may build. However, the ability to conduct these interviews can often be hindered by both geographical constraints and the technical proficiency of the interviewee, as the interviewee in these types of interviews is most often an elderly person with a lower than average level of technical proficiency. With this in mind, this study presents what we believe, based on prior research, to be the first chatbot geared entirely towards the gathering of family histories, and explores the viability of utilising such a chatbot by comparing the performance and usability of such a method with the aforementioned alternatives. With a chatbot-based approach, we show that, though the average time taken to conduct an interview may be longer than if the user had used ancestry.com or participated in an in-person interview, the number of mistakes made and the level of confusion from the user regarding the UI and process required is lower than the other two methods. Note that the final metric regarding the user's confusion is not applicable for the in-person interview sessions due to its lack of a UI. With refinement, we believe this use of a chatbot could be a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries where it is not possible to conduct an in-person interview.

摘要
Our results show that while the average time taken for an interview using the chatbot may be longer than with Ancestry.com or in-person interviews, the number of mistakes made and the level of user confusion is lower with the chatbot. Additionally, the chatbot-based approach could be a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries where in-person interviews are not possible.The chatbot-based approach has several advantages. Firstly, it allows for more flexible and convenient interview scheduling, as the interview can be conducted remotely. Secondly, the chatbot can guide the interviewee through the interview process, reducing the likelihood of mistakes and confusion. Finally, the chatbot can provide a more personalized and interactive experience for the interviewee, which can lead to more accurate and detailed information.However, there are also some limitations to the chatbot-based approach. One potential drawback is that the chatbot may not be able to capture the nuances and complexities of the interviewee's family history in the same way that a human interviewer could. Additionally, the chatbot may not be able to detect and correct errors or inconsistencies in the interviewee's responses in the same way that a human interviewer could.Despite these limitations, we believe that the use of a chatbot for gathering family histories has the potential to be a valuable tool for genealogists. With refinement and further development, the chatbot could be an effective and efficient way to gather accurate and detailed information about a person's family history, especially when dealing with interviewees who are based in other countries.

When Do Discourse Markers Affect Computational Sentence Understanding?

paper_url: http://arxiv.org/abs/2309.00368
repo_url: None
paper_authors: Ruiqi Li, Liesbeth Allein, Damien Sileo, Marie-Francine Moens
For: 这篇论文探讨了自然语言处理（NLP）机器学习模型在理解英语 дискурスconnectives方面的能力和用途。* Methods: 作者使用了 nine 种流行的自然语言处理系统来评估这些系统在理解英语 дискурスconnectives方面的能力，并分析了不同连接类型的计算处理复杂性是否与人类处理顺序一致。* Results: 研究发现，NLP系统不一定能够uniformly处理所有的 дискурスconnectives，并且在不同的语言理解任务下，不同的连接类型的计算处理复杂性并不总是一致于人类处理顺序。此外，人类在阅读过程中可能会受到外部影响，但是这并不一定会影响最终的理解性能。系统对连接知识的更多学习，则会增加不当连接的负面影响。这表明，在计算自然语言处理中，正确地表示连接是重要的。

Abstract
The capabilities and use cases of automatic natural language processing (NLP) have grown significantly over the last few years. While much work has been devoted to understanding how humans deal with discourse connectives, this phenomenon is understudied in computational systems. Therefore, it is important to put NLP models under the microscope and examine whether they can adequately comprehend, process, and reason within the complexity of natural language. In this chapter, we introduce the main mechanisms behind automatic sentence processing systems step by step and then focus on evaluating discourse connective processing. We assess nine popular systems in their ability to understand English discourse connectives and analyze how context and language understanding tasks affect their connective comprehension. The results show that NLP systems do not process all discourse connectives equally well and that the computational processing complexity of different connective kinds is not always consistently in line with the presumed complexity order found in human processing. In addition, while humans are more inclined to be influenced during the reading procedure but not necessarily in the final comprehension performance, discourse connectives have a significant impact on the final accuracy of NLP systems. The richer knowledge of connectives a system learns, the more negative effect inappropriate connectives have on it. This suggests that the correct explicitation of discourse connectives is important for computational natural language processing.

摘要
自过去几年来，自然语言处理（NLP）的能力和使用场景已经增长了很多。然而，在人工系统中对对话连接器的研究仍然不够。因此，我们需要把NLP模型放进显微镜中，检查它们是否能够正确地理解、处理和推理natural language的复杂性。在本章中，我们将介绍自动句子处理系统的主要机制一探析，然后专注于评估英文对话连接器的处理能力。我们评估了9种流行的NLP系统，检查它们在处理英文对话连接器时的能力，并分析了上下文和语言理解任务对其连接器理解的影响。结果显示NLP系统不同的连接器Kinds不一定能够正确地处理，而且computational处理复杂性不一定与人类处理的复杂性相符。此外，人类在阅读过程中可能会受到影响，但是在最终理解性能上不一定会受到影响。对NLP系统而言，正确地使用对话连接器是重要的。

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

paper_url: http://arxiv.org/abs/2309.00359
repo_url: None
paper_authors: Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy
for: 这 paper 的目的是提出 Large Content and Behavior Models (LCBMs)，用于解决Receiver 的行为 simulation, content simulation, behavior understanding, 和 behavior domain adaptation 等问题。
methods: 这 paper 使用了 Large Language Models (LLMs) 作为基础模型，并在其上添加了 “behavior tokens” 来增强模型的能力。
results: 这 paper 的实验结果表明，LCBMs 可以在多种任务上表现良好，包括内容理解、行为模拟、内容模拟、行为理解和行为适应性等。此外， paper 还发布了一个新的 Content Behavior Corpus (CBC)，用于推动更多的研究。

Abstract
Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of "behavior tokens" in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, etc. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.

摘要
谱在他的著名论文中介绍信息理论时，将通信分为三级：技术、 semantics 和效果。技术级关心已经传输的符号的准确重建，而 semantics 和效果级则关心接收者对符号的理解和对接收者的影响。 благо于电信技术的发展，技术级问题已经取得了很大的进步，如互联网。然而， semantics 和效果级问题仍然未得到充分解决。第三级问题是预测和优化通信以实现 Desired receiver behavior。LLMs 虽然在各种任务上显示了广泛的泛化能力，但它们无法解决这个问题。一个可能的原因是 LLMs 在训练 corpora 中缺乏 "behavior tokens"。 behavior tokens 定义了通信过程中接收者的行为，如分享、赞、点击、购买、 retweet 等。在 Preprocessing 数据 для LLM 训练时，通常会从 corpora 中除掉 behavior tokens 作为噪音。因此，在这篇论文中，我们在 LLM 训练中重新引入 behavior tokens，并训练 Large Content and Behavior Models (LCBMs)。LCBMs 不仅在内容理解任务上显示类似的表现，还能在行为模拟、内容模拟、行为理解和行为预测域中进行泛化。使用两个 corpora 上的各种任务，我们在所有这些能力上显示了结果。我们称这些模型为 Large Content and Behavior Models (LCBMs)。此外，为了促进更多关于 LCBMs 的研究，我们发布了我们的新 Content Behavior Corpus (CBC)，这是一个包含通信者、消息和相应的接收者行为的Repository。

Comparative Topic Modeling for Determinants of Divergent Report Results Applied to Macular Degeneration Studies

paper_url: http://arxiv.org/abs/2309.00312
repo_url: None
paper_authors: Lucas Cassiel Jacaruso
for: 本研究旨在透过比较主题模型分析不同报告中对同一研究问题的结果进行比较，以找到具有明确相关性和重要结果的主题。
methods: 本研究使用了主题模型分析方法，对具有相关结果的报告进行分类和排序，以确定具有最高相关性和最高效果的主题。
results: 研究发现了8种补充食品具有显著关系与有效结果的主题，其中6种得到了验证性的证据，即omega-3脂肪酸、氧化铁、萝芽素、兰氨酸、锌和氮氧化物。两种未得到验证（niacin和摩尔丹）也得到了最低分，这表明提议的方法可以用来评价主题的相关性。

Abstract
Topic modeling and text mining are subsets of Natural Language Processing with relevance for conducting meta-analysis (MA) and systematic review (SR). For evidence synthesis, the above NLP methods are conventionally used for topic-specific literature searches or extracting values from reports to automate essential phases of SR and MA. Instead, this work proposes a comparative topic modeling approach to analyze reports of contradictory results on the same general research question. Specifically, the objective is to find topics exhibiting distinct associations with significant results for an outcome of interest by ranking them according to their proportional occurrence and consistency of distribution across reports of significant results. The proposed method was tested on broad-scope studies addressing whether supplemental nutritional compounds significantly benefit macular degeneration (MD). Eight compounds were identified as having a particular association with reports of significant results for benefitting MD. Six of these were further supported in terms of effectiveness upon conducting a follow-up literature search for validation (omega-3 fatty acids, copper, zeaxanthin, lutein, zinc, and nitrates). The two not supported by the follow-up literature search (niacin and molybdenum) also had the lowest scores under the proposed methods ranking system, suggesting that the proposed method's score for a given topic is a viable proxy for its degree of association with the outcome of interest. These results underpin the proposed methods potential to add specificity in understanding effects from broad-scope reports, elucidate topics of interest for future research, and guide evidence synthesis in a systematic and scalable way.

摘要
Topic模型和文本挖掘是自然语言处理的子集，对于进行元分析（MA）和系统性综述（SR）有直接的应用。在证据整合中，这些自然语言处理方法通常用于特定主题的文献搜索或自动化SR和MA的关键阶段。然而，这项工作提出了比较主题模型方法，用于分析对同一个全面研究问题的报告中的不同结果。具体来说，目标是找到具有特定关系的主题，其中结果对于评价预测变量的占比和报告中的分布一致性高。这种方法在对眼肤肉营养剂的研究中进行了测试，并将八种成分确定为对有效结果报告具有特定关系。其中六种（ω-3脂肪酸、铁、杂色素、苷酸、锌和氮原子）在验证性文献搜索中得到了支持，而剩下两种（niacin和硫）则没有得到支持，其分数也相应较低，这表明该方法的分数可以作为一种可靠的评价指标。这些结果证明了该方法的潜在价值，可以增加特定的证据整合，解释特定主题，为未来研究提供指导，并在系统和可扩展的方式下进行证据整合。

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

paper_url: http://arxiv.org/abs/2309.00284
repo_url: None
paper_authors: Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng
for: 提高单声道合唱音色的 vocal range
methods: 使用多种者预训练方法，不影响音色一致性
results: 提高合唱音质和自然性，比基eline更高

Abstract
The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.

摘要
单个 speaker 歌唱voice 合成（SVS）通常在声部范围外或有限训练样本下表现不佳。基于我们的前一项工作，这项工作提出了一种不带预教学样本的多 speaker 预训练方法，以提高单个 speaker 的声部范围，不会影响声音相似性。这种预训练方法可以应用于大规模多 singer 数据集，只包含音频和歌词对应的数据。具体来说，在预训练步骤中，我们设计了一个音频预测器，生成音频帧级别的phoneme 概率 вектор作为声音时间信息，以及一个 speaker 编码器，模型不同 singer 的声音变化，直接从音频中提取帧级别的f0值作为抽象信息。这些预训练模型参数被送入细化步骤作为先验知识，以提高单个 speaker 的声部范围。此外，这项工作还提出了改进合成 singing voice 的音质和节奏自然性的方法，包括引入分配duration 调节器以提高合成声音的节奏自然性，以及bi-directional 流模型以提高音质。实验结果表明，提出的 SVS 系统在音质和自然性两个方面都高于基eline。

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

paper_url: http://arxiv.org/abs/2309.00254
repo_url: None
paper_authors: Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez
for: 本研究旨在解释大语言模型中对抗攻击的内部机制，尤其是对于 Gradient-based universal adversarial attacks 的理解。
methods: 本研究使用了一种新的几何视角来解释大语言模型中 universal adversarial attacks 的机制。研究者通过对 GPT-2 模型进行攻击，发现了一种可能的观察结果，即攻击触发器可能是 embedding vectors 的一种近似。
results: 研究者通过对 GPT-2 模型进行白盒模型分析，包括维度减少和隐藏表示相似度测量，发现了这种观察结果的证据。这种新的几何视角可能会帮助我们更深入地理解大语言模型的内部工作机制和失效模式，从而实现其防范。

Abstract
Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation.

摘要

Detecting Suicidality in Arabic Tweets Using Machine Learning and Deep Learning Techniques

paper_url: http://arxiv.org/abs/2309.00246
repo_url: None
paper_authors: Asma Abdulsalam, Areej Alhothali, Saleh Al-Ghamdi
For: The paper aims to develop a novel dataset of Arabic tweets related to suicidal thoughts and use machine learning and deep learning models to automatically detect suicidal ideation in these tweets.* Methods: The paper uses a variety of machine learning and deep learning models, including Na"ive Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, and XGBoost, trained on word frequency and word embedding features, as well as pre-trained deep learning models such as AraBert, AraELECTRA, and AraGPT2, to identify suicidal thoughts in Arabic tweets.* Results: The results show that the SVM and RF models trained on character n-gram features provided the best performance, with an accuracy of 86% and an F1 score of 79%. The AraBert model outperforms other machine and deep learning models, achieving an accuracy of 91% and an F1-score of 88%, significantly improving the detection of suicidal ideation in the Arabic tweets dataset.Here are the three points in Simplified Chinese:* For: 这个论文的目的是开发一个关于自杀思想的阿拉伯语推文数据集，并使用机器学习和深度学习模型自动检测这些推文中的自杀意图。* Methods: 这个论文使用了多种机器学习和深度学习模型，包括Na"ive Bayes、支持向量机、K-近邻 neighbors、Random Forest和XGBoost，使用单词频率和单词嵌入特征进行训练，以及预训练的深度学习模型如AraBert、AraELECTRA和AraGPT2，来识别阿拉伯语推文中的自杀思想。* Results: 结果显示，SVM和RF模型使用单词n-gram特征进行训练时提供了最好的性能，具有86%的准确率和79%的F1分数。AraBert模型在其他机器和深度学习模型之上表现出色，达到91%的准确率和88%的F1分数，显著提高了阿拉伯语推文中自杀意图的检测。

Abstract
Social media platforms have revolutionized traditional communication techniques by enabling people globally to connect instantaneously, openly, and frequently. People use social media to share personal stories and express their opinion. Negative emotions such as thoughts of death, self-harm, and hardship are commonly expressed on social media, particularly among younger generations. As a result, using social media to detect suicidal thoughts will help provide proper intervention that will ultimately deter others from self-harm and committing suicide and stop the spread of suicidal ideation on social media. To investigate the ability to detect suicidal thoughts in Arabic tweets automatically, we developed a novel Arabic suicidal tweets dataset, examined several machine learning models, including Na\"ive Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, and XGBoost, trained on word frequency and word embedding features, and investigated the ability of pre-trained deep learning models, AraBert, AraELECTRA, and AraGPT2, to identify suicidal thoughts in Arabic tweets. The results indicate that SVM and RF models trained on character n-gram features provided the best performance in the machine learning models, with 86% accuracy and an F1 score of 79%. The results of the deep learning models show that AraBert model outperforms other machine and deep learning models, achieving an accuracy of 91\% and an F1-score of 88%, which significantly improves the detection of suicidal ideation in the Arabic tweets dataset. To the best of our knowledge, this is the first study to develop an Arabic suicidality detection dataset from Twitter and to use deep-learning approaches in detecting suicidality in Arabic posts.

摘要
社交媒体平台已经革命化了传统的沟通方式，让人们全球协同交流、开放地分享自己的故事和看法。人们通过社交媒体分享自己的个人经历和表达自己的看法。特别是年轻一代，常常在社交媒体上表达自杀的思想和自危的情感。因此，通过社交媒体检测自杀思想可以提供适当的干预措施，ultimately prevent others from self-harm and suicide, and stop the spread of suicidal ideation on social media.为了研究自动检测阿拉伯语自杀思想的能力，我们创建了一个新的阿拉伯语自杀吟话集合，检验了多种机器学习模型，包括顺序规则模型、支持向量机器学习模型、最近邻居模型、随机森林模型和XGBoost模型。我们使用单词频和单词嵌入特征进行训练，并研究了预训练深度学习模型AraBert、AraELECTRA和AraGPT2的能力来识别阿拉伯语自杀思想。结果显示，SVM和RF模型在机器学习模型中表现最佳，具有86%的准确率和79%的F1分数。深度学习模型的结果表明，AraBert模型在其他机器和深度学习模型中表现出色，达到了91%的准确率和88%的F1分数，对阿拉伯语自杀吟话集合的检测提供了显著改善。据我们所知，这是首次从Twitter上创建了阿拉伯语自杀性检测dataset，并使用深度学习方法来检测阿拉伯语自杀思想。

NeuroSurgeon: A Toolkit for Subnetwork Analysis

paper_url: http://arxiv.org/abs/2309.00244
repo_url: https://github.com/mlepori1/neurosurgeon
paper_authors: Michael A. Lepori, Ellie Pavlick, Thomas Serre
for: 了解神经网络模型中学习的算法。
methods: 使用Python库NeuroSurgeon对Transformers库中的模型进行发现和操作。
results: 可以帮助研究人员更好地理解和修改神经网络模型。

Abstract
Despite recent advances in the field of explainability, much remains unknown about the algorithms that neural networks learn to represent. Recent work has attempted to understand trained models by decomposing them into functional circuits (Csord\'as et al., 2020; Lepori et al., 2023). To advance this research, we developed NeuroSurgeon, a python library that can be used to discover and manipulate subnetworks within models in the Huggingface Transformers library (Wolf et al., 2019). NeuroSurgeon is freely available at https://github.com/mlepori1/NeuroSurgeon.

摘要
尽管最近在神经网络解释领域有所进步，仍然有很多关于神经网络学习的表示方法未知。最近的研究尝试了通过分解神经网络模型为功能电路来理解训练后模型（Csordás et al., 2020; Lepori et al., 2023）。为了进一步推进这项研究，我们开发了一个名为NeuroSurgeon的Python库，可以用于发现和操作在Huggingface Transformers库中的子网络（Wolf et al., 2019）。NeuroSurgeon是免费释出的，可以在https://github.com/mlepori1/NeuroSurgeon上下载。

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

paper_url: http://arxiv.org/abs/2309.00236
repo_url: https://github.com/euanong/image-hijacks
paper_authors: Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons
for: 本研究探讨了基础模型是否具有恶意攻击者的安全性？研究者发现了图像劫驱，即在运行时控制生成模型的恶意图像。
methods: 研究者提出了一种普适的方法 named Behaviour Matching，用于创造图像劫驱。他们使用这种方法来探索三种类型的攻击。
results: 研究者在使用 LLaVA 和 LLaMA-2 模型进行测试时发现，所有的攻击类型都有超过 90% 的成功率。此外，这些攻击都是自动化的，只需要小的图像偏移即可实现。这些发现表明基础模型的安全性存在严重的问题。

Abstract
Are foundation models secure from malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control generative models at runtime. We introduce Behaviour Matching, a general method for creating image hijacks, and we use it to explore three types of attacks. Specific string attacks generate arbitrary output of the adversary's choice. Leak context attacks leak information from the context window into the output. Jailbreak attacks circumvent a model's safety training. We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90% success rate. Moreover, our attacks are automated and require only small image perturbations. These findings raise serious concerns about the security of foundation models. If image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found -- if it even exists.

摘要
Foundation models 是否安全免受黑客攻击？在这项工作中，我们关注vision-language模型（VLM）中的图像输入。我们发现图像劫持，也就是在运行时使用恶意图像控制生成模型的攻击。我们提出了行为匹配方法，可以创造图像劫持，并使用它来探索三种攻击方式。特定的字符串攻击可以生成对手选择的任意输出。泄露上下文攻击可以从上下文窗口中泄露信息到输出中。囚禁攻击可以绕过模型的安全训练。我们对LLaVA模型，基于CLIP和LLaMA-2，进行了研究，发现所有我们的攻击类型具有超过90%的成功率。此外，我们的攻击是自动化的，只需要小型图像变化即可。这些发现对基础模型的安全提出了严重的问题。如果图像劫持与CIFAR-10中的对抗性例子一样难以防御，那么可能需要很多年才能找到解决方案——如果其even exists。

JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning

paper_url: http://arxiv.org/abs/2309.00230
repo_url: https://github.com/kwanwaichung/jotr
paper_authors: Wai-Chung Kwan, Huimin Wang, Hongru Wang, Zezhong Wang, Xian Wu, Yefeng Zheng, Kam-Fai Wong
for: 本研究的目的是提出一种新的对话政策学习方法，以提高对话模型的性能和多样性。
methods: 本研究使用了一种基于Transformer的文本对话模型，通过Word级对话策略来生成灵活的对话行为。此外，本研究还使用了奖励学习和奖励托管机制来有效地训练对话策略。
results: 经过广泛的评估，本研究的JoTR方法在两个对话模型任务上达到了领先的状态。User simulator和人工评估者都认为JoTR的性能有所提高。

Abstract
Dialogue policy learning (DPL) is a crucial component of dialogue modelling. Its primary role is to determine the appropriate abstract response, commonly referred to as the "dialogue action". Traditional DPL methodologies have treated this as a sequential decision problem, using pre-defined action candidates extracted from a corpus. However, these incomplete candidates can significantly limit the diversity of responses and pose challenges when dealing with edge cases, which are scenarios that occur only at extreme operating parameters. To address these limitations, we introduce a novel framework, JoTR. This framework is unique as it leverages a text-to-text Transformer-based model to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation, without the need for any action templates. This setting enhances the diversity of responses and improves the system's ability to handle edge cases effectively. In addition, JoTR employs reinforcement learning with a reward-shaping mechanism to efficiently finetune the word-level dialogue policy, which allows the model to learn from its interactions, improving its performance over time. We conducted an extensive evaluation of JoTR to assess its effectiveness. Our extensive evaluation shows that JoTR achieves state-of-the-art performance on two benchmark dialogue modelling tasks, as assessed by both user simulators and human evaluators.

摘要
对话政策学习（DPL）是对话模型的一个重要组件。其主要职责是确定合适的抽象回复，通常称为对话动作。传统的DPL方法ologies treat this as a sequential decision problem, using pre-defined action candidates extracted from a corpus。然而，这些不完整的候选者可以很大地限制对话的多样性和处理边缘情况的能力。为解决这些限制，我们介绍了一个新的框架，JoTR。这个框架独特之处在于它利用了文本到文本的Transformer模型来生成灵活的对话动作。与传统方法不同，JoTR定义了字词级对话政策，允许对话动作生成更加动态和适应性强，无需任何动作模板。这种设置可以提高对话的多样性和对边缘情况的处理能力。此外，JoTR使用了奖励学习和形成机制来有效地训练字词级对话政策，让模型通过与人类互动学习并改进自己的性能。我们进行了广泛的评估，结果显示JoTR在两个标准对话模型任务上达到了当今最佳性能，并且被用户模拟器和人类评估器评为优秀。

The FruitShell French synthesis system at the Blizzard 2023 Challenge

paper_url: http://arxiv.org/abs/2309.00223
repo_url: None
paper_authors: Xin Qi, Xiaopeng Wang, Zhiyong Wang, Wang Liu, Mingming Ding, Shuchen Shi
for: 本研究提出了一个用于2023年Blizzard挑战的法文文本读取系统。这个挑战包括两个任务：生成高品质的女性读者的语音，以及将语音与具体的个人联系起来。
methods: 我们对竞赛数据进行了筛选程序，以移除遗传或错误的文本数据。我们将所有符号除非是phoneme外，并删除没有读音或零时间的符号。此外，我们将文本中的字界限和开始/结束符号添加到文本中，以改善语音质量。在Spoke任务中，我们遵循竞赛规则进行数据增强。我们使用了一个开源的G2P模型来将法文转换为phoneme。由于G2P模型使用国际音律字母表（IPA），我们将竞赛数据转换为IPA标示法。但由于编译器无法识别竞赛数据中的特殊符号，我们按照规则将所有phoneme转换为竞赛数据中使用的音律字母表。最后，我们将所有竞赛音频标准化到16 kHz的样本率。
results: 我们的系统在Hub任务中得到了3.6的质量MOS分数，在Spoke任务中得到了3.4的质量MOS分数，与所有参赛队伍的平均水平相当。

Abstract
This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.

摘要
Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience.For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data.Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model.The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.

Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

paper_url: http://arxiv.org/abs/2309.00215
repo_url: https://github.com/JoshuaFeinglass/VL-detector-eval
paper_authors: Joshua Feinglass, Yezhou Yang
for: 本研究旨在探讨目标框生成器在视觉语言任务中的性能评价协议是否准确，以及如何使用语义固定来缓解这一问题。
methods: 我们提出了一种新的评价方法，即根据图像描述文本中的Semantic信息来选择合适的注释 subset，并对这些注释进行评价。
results: 我们的方法能够准确地选择与视觉语言任务相关的注释 subset，并且与现有的评价方法相比，能够更好地反映图像描述文本中的Semantic信息。

Abstract
Object proposal generation serves as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). The performance of object proposals generated for VL tasks is currently evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. Importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Lastly, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.

摘要
Object proposal generation acts as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). Currently, the performance of object proposals generated for VL tasks is evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. The importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Finally, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.Here's the translation in Traditional Chinese:Object proposal generation acts as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). Currently, the performance of object proposals generated for VL tasks is evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. The importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Finally, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.

Exploring the law of text geographic information

paper_url: http://arxiv.org/abs/2309.00180
repo_url: None
paper_authors: Zhenhua Wang, Daiyu Zhang, Ming Ren, Guang Xu
for: 该论文的目的是探讨文本地理信息的分布特性以及人类使用它的限制。
methods: 该论文采用了严格的实验方法，测试了24种不同语言和类型的地理信息数据集，以验证人类使用地理信息的假设。
results: 研究发现，地理信息呈Gamma分布，其中的量、长度和距离受到人类行为、认知、表达和思维过程的影响。此外，与 Gaussian 分布和Zipf 法律进行了比较，并证明了这些法律的无关性。这些结果可能有助于探索未知的地理信息领域。

Abstract
Textual geographic information is indispensable and heavily relied upon in practical applications. The absence of clear distribution poses challenges in effectively harnessing geographic information, thereby driving our quest for exploration. We contend that geographic information is influenced by human behavior, cognition, expression, and thought processes, and given our intuitive understanding of natural systems, we hypothesize its conformity to the Gamma distribution. Through rigorous experiments on a diverse range of 24 datasets encompassing different languages and types, we have substantiated this hypothesis, unearthing the underlying regularities governing the dimensions of quantity, length, and distance in geographic information. Furthermore, theoretical analyses and comparisons with Gaussian distributions and Zipf's law have refuted the contingency of these laws. Significantly, we have estimated the upper bounds of human utilization of geographic information, pointing towards the existence of uncharted territories. Also, we provide guidance in geographic information extraction. Hope we peer its true countenance uncovering the veil of geographic information.

摘要

Will Sentiment Analysis Need Subculture? A New Data Augmentation Approach

paper_url: http://arxiv.org/abs/2309.00178
repo_url: None
paper_authors: Zhenhua Wang, Simin He, Guang Xu, Ming Ren
for: 强调文化价值和情感分析的研究，addressing the insufficient training data faced by sentiment analysis.
methods: 提出了一种基于子文化表达生成器的数据增强方法（SCDA），通过生成6种不同的子文化表达来生成6种增强文本。
results: 实验证明了SCDA的有效性和可能性，同时也发现了不同子文化表达对情感刺激的不同程度。此外，研究还发现了一种 linear reversibility 的现象，即某些子文化表达可以逆向转换为另一种子文化表达。

Abstract
The renowned proverb that "The pen is mightier than the sword" underscores the formidable influence wielded by text expressions in shaping sentiments. Indeed, well-crafted written can deeply resonate within cultures, conveying profound sentiments. Nowadays, the omnipresence of the Internet has fostered a subculture that congregates around the contemporary milieu. The subculture artfully articulates the intricacies of human feelings by ardently pursuing the allure of novelty, a fact that cannot be disregarded in the sentiment analysis. This paper strives to enrich data through the lens of subculture, to address the insufficient training data faced by sentiment analysis. To this end, a new approach of subculture-based data augmentation (SCDA) is proposed, which engenders six enhanced texts for each training text by leveraging the creation of six diverse subculture expression generators. The extensive experiments attest to the effectiveness and potential of SCDA. The results also shed light on the phenomenon that disparate subculture expressions elicit varying degrees of sentiment stimulation. Moreover, an intriguing conjecture arises, suggesting the linear reversibility of certain subculture expressions. It is our fervent aspiration that this study serves as a catalyst in fostering heightened perceptiveness towards the tapestry of information, sentiment and culture, thereby enriching our collective understanding.

摘要
“著名的谚语“笔子比剑更强”强调了文字表达在形塑情感的力量。实际上，美妙地撰写的文字可以深深地感染到文化中，传递出深刻的情感。在现代社会，互联网的普遍化使得文化 subgroup 形成了一种新的互文化环境，这种环境通过积极追求新鲜的感受，突出了情感分析的不足。为了解决这问题，本文提出了一种基于互文化的数据增强方法（SCDA），通过创造六种不同的互文化表达生成器来生成六个增强的文本。实验证明了SCDA的有效性和潜力。结果还暴露了一种有趣的推测：一些互文化表达的差异会引起不同的情感刺激。此外，这种研究也鼓励我们更加珍惜信息、情感和文化的多样性，以推动我们的共同理解的深化。”

2023-09-01

cs.LG

cs.LG - 2023-09-01

Universal Normalization Enhanced Graph Representation Learning for Gene Network Prediction

paper_url: http://arxiv.org/abs/2309.00738
repo_url: None
paper_authors: Zehao Dong, Muhan Zhang, Qihang Zhao, Philip R. O. Payne, Michael Province, Carlos Cruchaga, Tianyu Zhao, Yixin Chen, Fuhai Li
for: 这篇论文旨在提高生物信息学中的基因网络表示学问题中的表现力，通过对基因网络进行ormalization来提高基因网络表示学模型的稳定性和表现力。
methods: 本文提出了一个 novel UNGNN（通用normalized GNN）框架，它在基因网络中实现了通用的均值normalization，以提高基因网络表示学模型的表现力。
results: 根据实验结果，UNGNN模型在基因网络基础上的表现力比前一代基因网络表示学模型高出16%的表现力。此外，UNGNN模型在其他具有通用均值normalization的图形资料上也实现了superior表现。

Abstract
Effective gene network representation learning is of great importance in bioinformatics to predict/understand the relation of gene profiles and disease phenotypes. Though graph neural networks (GNNs) have been the dominant architecture for analyzing various graph-structured data like social networks, their predicting on gene networks often exhibits subpar performance. In this paper, we formally investigate the gene network representation learning problem and characterize a notion of \textit{universal graph normalization}, where graph normalization can be applied in an universal manner to maximize the expressive power of GNNs while maintaining the stability. We propose a novel UNGNN (Universal Normalized GNN) framework, which leverages universal graph normalization in both the message passing phase and readout layer to enhance the performance of a base GNN. UNGNN has a plug-and-play property and can be combined with any GNN backbone in practice. A comprehensive set of experiments on gene-network-based bioinformatical tasks demonstrates that our UNGNN model significantly outperforms popular GNN benchmarks and provides an overall performance improvement of 16 $\%$ on average compared to previous state-of-the-art (SOTA) baselines. Furthermore, we also evaluate our theoretical findings on other graph datasets where the universal graph normalization is solvable, and we observe that UNGNN consistently achieves the superior performance.

摘要
<>转换文本为简化中文。<>生物信息学中有效的基因网络表示学习非常重要，以预测/理解基因Profile和疾病表型之间的关系。虽然图神经网络（GNNs）已经在社会网络等多种图结构数据上进行分析，但它们在基因网络上预测通常表现不佳。在这篇论文中，我们正式调查基因网络表示学习问题，并提出了一种通用图Normalization的概念，可以在universal manner中减少图结构数据的表达能力，同时保持稳定。我们提出了一种UNGNN（通用Normalized GNN）框架，它在消息传递阶段和读取层都应用通用图Normalization，以提高基于GNN的表达能力。UNGNN具有插入性的性质，可以与任何GNN脊梁结合使用。我们进行了基于基因网络的生物信息学任务的广泛实验，并证明了我们的UNGNN模型在SOTA基elines上提供了16%的平均性能提升。此外，我们还评估了我们的理论发现在其他可解 Graphdataset上，并发现UNGNN在这些dataset上一直保持了最高表现。

Prediction Error Estimation in Random Forests

paper_url: http://arxiv.org/abs/2309.00736
repo_url: https://github.com/iankrupkin/Prediction-Error-Estimation-in-Random-Forests
paper_authors: Ian Krupkin, Johanna Hardin
for: 这paper主要研究了Random Forests中的错误估计。
methods: 该paper使用了Bates et al. (2023)提出的初始理论框架，对Random Forests中的错误估计进行了理论和实验研究。
results: 研究发现，在分类情况下，Random Forests的预测错误估计比true error rate更加接近。这与Bates et al. (2023)的结论相反，该结论是为логистиelde regression。此外，该结论在不同的错误估计策略（如cross-validation、bagging和数据分割）下都持平。

Abstract
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which were given for logistic regression. We further show that this result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.

摘要
在这篇论文中，Random Forests 分类预测错误估计的量化评估。基于Bates et al.（2023）提出的初始理论框架，我们 theoretically和empirically investigate了Random Forests中的真正错误率和预测错误率在不同的错误估计策略中的关系。我们发现在分类 caso，Random Forests 的预测错误估计比true error rate更加接近，而不是average prediction error。这与Bates et al.（2023）关于 logistic regression 的发现相反。我们还证明了这个结果在不同的错误估计策略，如交叉验证、bagging 和数据分割中都是如此。

Tempestas ex machina: A review of machine learning methods for wavefront control

paper_url: http://arxiv.org/abs/2309.00730
repo_url: None
paper_authors: J. Fowler, Rico Landman
for: 这篇论文的目的是为了开发和探索用于内部探测地球型行星的适应光学系统，并且探讨这些技术可以帮助我们的适应光学系统实现更高的影像质量。
methods: 这篇论文使用了机器学习技术来改善适应光学系统的波前控制，并且探讨了各种机器学习方法的应用。
results: 这篇论文总结了过去几十年来关于适应光学系统的机器学习方法的研究，并且提出了一些新的机器学习方法来改善适应光学系统的影像质量。

Abstract
As we look to the next generation of adaptive optics systems, now is the time to develop and explore the technologies that will allow us to image rocky Earth-like planets; wavefront control algorithms are not only a crucial component of these systems, but can benefit our adaptive optics systems without requiring increased detector speed and sensitivity or more effective and efficient deformable mirrors. To date, most observatories run the workhorse of their wavefront control as a classic integral controller, which estimates a correction from wavefront sensor residuals, and attempts to apply that correction as fast as possible in closed-loop. An integrator of this nature fails to address temporal lag errors that evolve over scales faster than the correction time, as well as vibrations or dynamic errors within the system that are not encapsulated in the wavefront sensor residuals; these errors impact high contrast imaging systems with complex coronagraphs. With the rise in popularity of machine learning, many are investigating applying modern machine learning methods to wavefront control. Furthermore, many linear implementations of machine learning methods (under varying aliases) have been in development for wavefront control for the last 30-odd years. With this work we define machine learning in its simplest terms, explore the most common machine learning methods applied in the context of this problem, and present a review of the literature concerning novel machine learning approaches to wavefront control.

摘要
为了下一代适应光学系统，现在是时候开发和探索能够捕捉岩石地球类行星的图像技术。扩散前方控制算法不仅是这些系统的关键组件，而且可以为我们的适应光学系统带来更多的好处，无需增加探测器的速度和敏感度，或者更有效和高效的可变镜。至今为止，大多数天文台都运行着 класси的积分控制器，这种控制器根据波前传感器异常值来估算修正，并尽可能快速地应用这些修正。然而，这种积分控制器无法处理时间延迟错误，以及系统中的振荡或动态错误，这些错误会影响高对比图像系统中的复杂卷积。随着机器学习的兴起，许多人正在研究应用现代机器学习方法来控制波前。此外，许多线性实现机器学习方法（以不同的别称出现）已经在波前控制领域进行了30多年的开发。在这项工作中，我们定义机器学习在最简单的形式下，探讨了在这个问题上最常见的机器学习方法，并对Literature concerning novel machine learning approaches to wavefront control.

Learning Shared Safety Constraints from Multi-task Demonstrations

paper_url: http://arxiv.org/abs/2309.00711
repo_url: None
paper_authors: Konwoo Kim, Gokul Swamy, Zuxin Liu, Ding Zhao, Sanjiban Choudhury, Zhiwei Steven Wu
for: 学习安全约束，以避免机器人在完成任务时出现危险行为。
methods: 基于专家示范的安全任务完成方式进行约束学习，通过扩展反奖学习技术来学习约束。
results: 通过多示范的利用，学习出更加紧致的约束，以避免过度保守的约束导致机器人无法完成任务。

Abstract
Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks.

摘要
无论我们想让我们的代理人完成哪个任务在环境中，通常存在共享的安全约束我们希望我们的代理人遵循。例如，无论是制作三明治还是清理桌子，厨房机器人不应该砸碎碗。手动指定这种约束可能会非常时间consuming和容易出错。我们展示了如何从专家示例安全任务完成的方式学习约束。理解来说，我们学习约束，禁止专家完成任务时高度奖励的行为。然而，约束学习问题通常很不充分定义，通常会导致过度保守的约束，禁止专家没有完成的所有行为。我们通过利用多个示例来学习更紧的约束。我们验证我们的方法通过高维连续控制任务的 simulations experiments。

Randomized Polar Codes for Anytime Distributed Machine Learning

paper_url: http://arxiv.org/abs/2309.00682
repo_url: None
paper_authors: Burak Bartan, Mert Pilanci
for: 这个论文是为了提出一种新的分布式计算框架，可以在慢计算节点的情况下进行稳定的线性运算 aproximate 和精确计算。
methods: 该机制基于随机损块和极icode的概念，并提出了一种顺序解码算法，可以处理实数据并保持低计算复杂度。此外， authors 还提出了一种任何时间估计器，可以在可用节点输出集不可解码的情况下产生可靠的估计。
results: authors 通过实践示例，包括大规模矩阵乘法和黑盒优化，证明了这个框架的可扩展性和实用性。 authors 还在无服务器云计算系统上实现了这些方法，并提供了大规模计算结果来证明其扩展性。

Abstract
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. The proposed mechanism integrates the concepts of randomized sketching and polar codes in the context of coded computation. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. Additionally, we provide an anytime estimator that can generate provably accurate estimates even when the set of available node outputs is not decodable. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization. We present the implementation of these methods on a serverless cloud computing system and provide numerical results to demonstrate their scalability in practice, including ImageNet scale computations.

摘要
我们提出了一种新的分布式计算框架，具有鲁棒性能快速计算节点的特点，可以进行精度和近似计算线性操作。我们的机制兼用随机损害和极码在计算机中的概念。我们提出了一种顺序解码算法，可以处理实数据，并保持低计算复杂性。此外，我们还提供了一个任何时间估计器，可以在可用节点输出集不可解码时产生可靠地估计。我们在不同场景中应用了这些方法，如大规模矩阵乘法和黑盒优化。我们在无服务器云计算系统上实现了这些方法，并提供了数字结果来证明其在实践中的扩展性，包括图像缩放计算。

Bayesian deep learning for cosmic volumes with modified gravity

paper_url: http://arxiv.org/abs/2309.00612
repo_url: https://github.com/JavierOrjuela/Bayesian-Neural-Net-with-MNFs-for-f-R-
paper_authors: Jorge Enrique García-Farieta, Héctor J Hortúa, Francisco-Shu Kitaura
For: This paper aims to extract cosmological parameters from modified gravity (MG) simulations using deep neural networks, with a focus on uncertainty estimations.* Methods: The paper uses Bayesian neural networks (BNNs) with an enriched approximate posterior distribution, and trains the networks with real-space density fields and power-spectra from a suite of 2000 dark matter only particle mesh $N$-body simulations.* Results: The paper finds that BNNs can accurately predict parameters for $\Omega_m$ and $\sigma_8$ and their correlation with the MG parameter, and yields well-calibrated uncertainty estimates. Additionally, the presence of MG parameter leads to a significant degeneracy with $\sigma_8$, and ignoring MG results in a deviation of the relative errors in $\Omega_m$ and $\sigma_8$ by at least $30%$.

Abstract
The new generation of galaxy surveys will provide unprecedented data allowing us to test gravity at cosmological scales. A robust cosmological analysis of the large-scale structure demands exploiting the nonlinear information encoded in the cosmic web. Machine Learning techniques provide such tools, however, do not provide a priori assessment of uncertainties. This study aims at extracting cosmological parameters from modified gravity (MG) simulations through deep neural networks endowed with uncertainty estimations. We implement Bayesian neural networks (BNNs) with an enriched approximate posterior distribution considering two cases: one with a single Bayesian last layer (BLL), and another one with Bayesian layers at all levels (FullB). We train both BNNs with real-space density fields and power-spectra from a suite of 2000 dark matter only particle mesh $N$-body simulations including modified gravity models relying on MG-PICOLA covering 256 $h^{-1}$ Mpc side cubical volumes with 128$^3$ particles. BNNs excel in accurately predicting parameters for $\Omega_m$ and $\sigma_8$ and their respective correlation with the MG parameter. We find out that BNNs yield well-calibrated uncertainty estimates overcoming the over- and under-estimation issues in traditional neural networks. We observe that the presence of MG parameter leads to a significant degeneracy with $\sigma_8$ being one of the possible explanations of the poor MG predictions. Ignoring MG, we obtain a deviation of the relative errors in $\Omega_m$ and $\sigma_8$ by at least $30\%$. Moreover, we report consistent results from the density field and power spectra analysis, and comparable results between BLL and FullB experiments which permits us to save computing time by a factor of two. This work contributes in setting the path to extract cosmological parameters from complete small cosmic volumes towards the highly nonlinear regime.

摘要
新一代星系调查将提供无前例的数据，允许我们在 cosmological scales 测试 gravitation。在大规模结构中的非线性信息探索需要使用机器学习技术，但这些技术不提供先验知道不确定性的工具。本研究尝试透过深度神经网络（BNN）估计 cosmological 参数，并在这些 BNN 中添加不确定性估计。我们实现了两种情况：一个包括单一的 Bayesian 最后层（BLL），另一个则是在所有层级添加 Bayesian 层（FullB）。我们将这两种 BNN 训练使用 real-space density 场和对应的 power-spectra，这些实验包括 modified gravity 模型，并且覆盖 256 $h^{-1}$ Mpc 方块Volume 2000 个 dark matter 对称运动 mesh simulate 128$^3$ 个粒子。BNN 能够精准地预测参数，特别是 $\Omega_m$ 和 $\sigma_8$ 的参数，并且与 MG 参数之间的相互关联性。我们发现 BNN 能够提供良好的不确定性估计，并且与传统神经网络相比，这些不确定性估计的问题较小。在忽略 MG 参数时，我们发现 $\Omega_m$ 和 $\sigma_8$ 的相对误差偏移至少 30%。此外，我们发现 density 场和对应的 power-spectra 分析具有相互关联性，并且 FullB 和 BLL 实验具有相互关联性。这些结果表明我们可以将 cosmological 参数从完整的小宇宙体积中提取，并且在非线性 regime 中进行测试。

Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair

paper_url: http://arxiv.org/abs/2309.00608
repo_url: https://github.com/ise-uiuc/Repilot
paper_authors: Yuxiang Wei, Chunqiu Steven Xia, Lingming Zhang
for:本文主要用于提高自动程序修复（APR）的效果，尤其是在普通程序语言中进行修复。methods:本文提出了一种名为Repilot的框架，它可以帮助AI“助手”（即大型自然语言模型）生成更有效的补丁。Repilot通过让LLM生成token，然后通过一个完成引擎来约束和改进这些token，以生成更有效的补丁。results:根据对Defects4j 1.2和2.0数据集的评估，Repilot可以 fixes 66和50个漏洞，分别高于基线最佳方案的14和16个漏洞。此外，Repilot可以在给定的生成预算下生成更有效和正确的补丁，比基eline LLM更高效。

Abstract
During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent Large Language Models (LLMs) have been shown to be helpful "copilots" in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a framework to further copilot the AI "copilots" (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot fixes 66 and 50 bugs, respectively, surpassing the best-performing baseline by 14 and 16 bugs fixed. More importantly, Repilot is capable of producing more valid and correct patches than the base LLM when given the same generation budget.

摘要
在自动化程序修复（APR）过程中，可能会遇到将程序作为字符序列处理的挑战。最近的大型自然语言模型（LLM）有助于开发者完成各种编程任务，并直接应用于补丁生成。然而，大多数LLM都忽略了目标编程语言的 semantics 约束，从而生成了许多静态无效的补丁，这阻碍了技术的实用性。因此，我们提出了 Repilot，一个框架，以帮助AI " copilots"（即LLM）生成更有效的补丁。我们的关键发现是，许多LLM生成输出以 autoregressive 方式进行（即字符串按字符串顺序），与人类编写程序类似，可以得到显著的提高和指导。Repilot 同时利用了一个 Completion Engine 来完成补丁的生成。其中，1) 筛除 LLM 所提供的不可能的字符，2) 积极地根据 Completion Engine 的建议完成字符。我们对 Defects4j 1.2 和 2.0 数据集进行了评估，发现 Repilot 可以修复 66 和 50 个漏洞，分别高于基eline 的最佳performanced by 14 和 16 个漏洞。此外，Repilot 能够在给定的生成预算下生成更有效和正确的补丁，与基础 LLM 相比。

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms

paper_url: http://arxiv.org/abs/2309.00591
repo_url: None
paper_authors: Qining Zhang, Lei Ying
for: 本研究考虑了一个Stochastic Multi-armed Bandit（MAB）问题，旨在同时实现两个目标：快速标识优致arm，并在序列中的$T$个轮次中 Maximize reward。
methods: 本文引入了Regret Optimal Best Arm Identification（ROBAI），以解决这两个目标。为了解决ROBAI，我们提出了 $\mathsf{EOCP}$ 算法和其变种，可以在 Gaussian 和通用bandit 中实现 asymptotic 优化的 regret，并在 $\mathcal{O}(\log T)$ 轮次内commit to the optimal arm。
results: 我们还确定了 ROBAI 的下界，表明 $\mathsf{EOCP}$ 算法和其变种是 sample 优化的，并且在适应 stopping time 下可以 дости到 almost sample 优化的性能。numerical results 表明，$\mathsf{EOCP}$ 算法在 comparison 中比 classic $\mathsf{UCB}$ 算法更有优势，这表明过度探索可能是系统性能的障碍。

Abstract
This paper considers a stochastic multi-armed bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of $T$ consecutive rounds. Though each objective has been individually well-studied, i.e., best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present the $\mathsf{EOCP}$ algorithm and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to sample complexity) of ROBAI, showing that $\mathsf{EOCP}$ and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting ``over-exploration'' phenomenon carried by classic $\mathsf{UCB}$ algorithms, such that $\mathsf{EOCP}$ has smaller regret even though it stops exploration much earlier than $\mathsf{UCB}$ ($\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$), which suggests over-exploration is unnecessary and potentially harmful to system performance.

摘要
本文考虑了一个随机多оружие带刺（MAB）问题，该问题具有两个目标：快速确定优至的枪和最大化奖励 throughout a sequence of $T$ consecutive rounds。虽然每个目标都已经 individually 得到了研究，即最优枪 identification 和 regret 最小化，但是同时实现这两个目标仍然是一个开放的问题，尽管它在实际中具有重要的实际意义。本文引入了 $\emph{Regret Optimal Best Arm Identification}$（ROBAI），以实现这两个目标。为了解决 ROBAI 的问题，我们提出了 $\mathsf{EOCP}$ 算法和其变种，这些算法不仅在 Gaussian 和通用带刺中实现了 asymptotic 最优的 regret，而且在 pre-determined stopping time 和 adaptive stopping time 下可以在 $\mathcal{O}(\log T)$ 轮或 $\mathcal{O}(\log^2 T)$ 轮内commit to the optimal arm。我们还给出了 ROBAI 的下界，证明 $\mathsf{EOCP}$ 和其变种是 sample 优的，并且在 adaptive stopping time 下是 almost sample 优的。实际结果验证了我们的理论分析，并显示了 класси的 $\mathsf{UCB}$ 算法会在 exploration 过程中出现“过度探索”现象，即 $\mathsf{EOCP}$ 在 much earlier than $\mathsf{UCB}$ （$\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$）内stop exploration，从而具有更小的 regret。这表明过度探索可能对系统性能产生负面影响，并且 $\mathsf{EOCP}$ 可能是一个更好的选择。

PolyGET: Accelerating Polymer Simulations by Accurate and Generalizable Forcefield with Equivariant Transformer

paper_url: http://arxiv.org/abs/2309.00585
repo_url: None
paper_authors: Rui Feng, Huan Tran, Aubrey Toland, Binghong Chen, Qi Zhu, Rampi Ramprasad, Chao Zhang
for: 这个论文的目的是开发一种新的聚合物力场模型，以提高聚合物模拟的准确性和效率。
methods: 这个论文使用了一种新的框架 called PolyGET，它使用了通用对称变换器来捕捉聚合物中的复杂量子交互，并且可以在不同的聚合物家族之间进行泛化。
results: 论文的实验结果表明，PolyGET可以在一个大规模的数据集上达到状态的艺术的性能，并且可以在不同的聚合物类型之间进行泛化。此外，PolyGET还可以在大聚合物模拟中实现高精度的DFT方法，而不需要大量的计算资源。

Abstract
Polymer simulation with both accuracy and efficiency is a challenging task. Machine learning (ML) forcefields have been developed to achieve both the accuracy of ab initio methods and the efficiency of empirical force fields. However, existing ML force fields are usually limited to single-molecule settings, and their simulations are not robust enough. In this paper, we present PolyGET, a new framework for Polymer Forcefields with Generalizable Equivariant Transformers. PolyGET is designed to capture complex quantum interactions between atoms and generalize across various polymer families, using a deep learning model called Equivariant Transformers. We propose a new training paradigm that focuses exclusively on optimizing forces, which is different from existing methods that jointly optimize forces and energy. This simple force-centric objective function avoids competing objectives between energy and forces, thereby allowing for learning a unified forcefield ML model over different polymer families. We evaluated PolyGET on a large-scale dataset of 24 distinct polymer types and demonstrated state-of-the-art performance in force accuracy and robust MD simulations. Furthermore, PolyGET can simulate large polymers with high fidelity to the reference ab initio DFT method while being able to generalize to unseen polymers.

摘要
聚合物模拟具有精度和效率是一项复杂的任务。机器学习（ML）力场已经开发出来实现精度和经验力场的同时。然而，现有的ML力场通常只能处理单个分子的设置，其模拟不够稳定。在这篇论文中，我们介绍了PolyGET框架，它是一种新的聚合物力场框架，用于捕捉聚合物中的复杂量子交互作用，并可以通过深度学习模型来泛化到不同的聚合物家族。我们提出了一种新的训练方法，它专门关注优化力，而不是与能量一起优化力和能量的现有方法。这种简单的力中心的目标函数可以避免了能量和力之间的竞争目标，因此允许学习一种综合的ML模型，可以在不同的聚合物家族上学习。我们对24种不同的聚合物类型进行了大规模的数据集测试，并证明了PolyGET在力精度和稳定的MD模拟方面具有状态机器之作。此外，PolyGET可以模拟大分子，并且可以泛化到未经见过的聚合物。

Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion

paper_url: http://arxiv.org/abs/2309.00584
repo_url: None
paper_authors: Zaynab Zahra, Zihao Li, Rosa Filgueira
for: 这篇论文旨在提供一个新的服务器less框架，基于dispel4py parallel流数据流库。
methods: 该框架使用专门的注册表来有效地管理流动工作流和组件，提供了无缝服务器less经验。
results: 该论文使用大语言模型提高了框架，并增加了semantic code搜索、代码概要和代码完成等功能。这种贡献有助于改善服务器less计算的执行，更有效地管理数据流，并提供了价值的工具 для研究人员和实践者。

Abstract
This paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large lenguage models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.

摘要
Translation in Simplified Chinese:这篇论文介绍了Laminar，一种基于dispel4py的新的服务器less框架。Laminar通过专门的注册表来有效地管理流处理工作流和组件，为用户提供了无缝服务器less体验。通过大型自然语言模型，Laminar增强了框架，添加了semantic code search、code summarization和code completion等功能。这一贡献提高了服务器less计算，更有效地管理数据流，为研究人员和实践者提供了一个有价值的工具。

Geometry-Informed Neural Operator for Large-Scale 3D PDEs

paper_url: http://arxiv.org/abs/2309.00583
repo_url: None
paper_authors: Zongyi Li, Nikola Borislavov Kovachki, Chris Choy, Boyi Li, Jean Kossaifi, Shourya Prakash Otta, Mohammad Amin Nabian, Maximilian Stadler, Christian Hundt, Kamyar Azizzadenesheli, Anima Anandkumar
for: 这个论文是为了学习大规模partial differential equations的解operator而写的。
methods: 该方法使用了签名距离函数和点云表示输入形状，以及基于图和傅риер体系的神经Operator来学习解operator。
results: 该方法可以高效地应用于大规模流体力学 simulator，并且可以在不同的几何参数下提供高精度的结果。

Abstract
We propose the geometry-informed neural operator (GINO), a highly efficient approach to learning the solution operator of large-scale partial differential equations with varying geometries. GINO uses a signed distance function and point-cloud representations of the input shape and neural operators based on graph and Fourier architectures to learn the solution operator. The graph neural operator handles irregular grids and transforms them into and from regular latent grids on which Fourier neural operator can be efficiently applied. GINO is discretization-convergent, meaning the trained model can be applied to arbitrary discretization of the continuous domain and it converges to the continuum operator as the discretization is refined. To empirically validate the performance of our method on large-scale simulation, we generate the industry-standard aerodynamics dataset of 3D vehicle geometries with Reynolds numbers as high as five million. For this large-scale 3D fluid simulation, numerical methods are expensive to compute surface pressure. We successfully trained GINO to predict the pressure on car surfaces using only five hundred data points. The cost-accuracy experiments show a $26,000 \times$ speed-up compared to optimized GPU-based computational fluid dynamics (CFD) simulators on computing the drag coefficient. When tested on new combinations of geometries and boundary conditions (inlet velocities), GINO obtains a one-fourth reduction in error rate compared to deep neural network approaches.

摘要
我们提出了几何资料驱动的神经运算器（GINO），一种高效的方法来学习巨大数据点解析方程的解析运算器。GINO使用签名距离函数和点云表示的输入形状，以及基于图形和傅立叶架构的神经运算器来学习解析运算器。图形神经运算器可以处理不规则格子，并将其转换为和傅立叶格子的常规隐藏格子，在这之上进行高效的傅立叶神经运算器应用。GINO是精确化易的， meaning the trained model can be applied to any discretization of the continuous domain and it converges to the continuum operator as the discretization is refined.为了实际验证我们的方法在大规模模拟中的性能，我们生成了3D车身对象的industry-standard aerodynamics数据集，其中Reynolds数达到500万。这些大规模3D流体模拟的计算Surface pressure是 computationally expensive的。我们成功地使用GINO预测车身表面压力，只需使用500个数据点。cost-accuracy实验显示GINO与优化的GPU基于computational fluid dynamics（CFD）仿真器相比，在计算抗力系数方面节省了26,000倍的时间成本。当GINO在新的几何和边界条件下进行测试时，它获得了对于深度神经网络方法的一半减少的错误率。

Consistency of Lloyd’s Algorithm Under Perturbations

paper_url: http://arxiv.org/abs/2309.00578
repo_url: None
paper_authors: Dhruv Patel, Hui Shen, Shankar Bhamidi, Yufeng Liu, Vladas Pipiras
for: 这个论文主要研究了 Lloyd 算法在不监督学习中的正确性，具体来说是在不知道真实的样本分布情况下，通过预处理管道如спектル方法来学习样本分布，然后使用 Lloyd 算法进行划分。
methods: 这个论文使用了 Lloyd 算法和相关的预处理管道，如спектル方法，来研究不监督学习中的划分问题。
results: 这个论文得出的结果是，对于不监督学习中的样本分布，Lloyd 算法的误分类率是在 $O(\log(n))$ 迭代后下降到某种下界，具体来说是 $\frac{C}{n^{1/4}$，其中 $C$ 是一个常数，$n$ 是样本数量。此外，这个论文还提供了一些具体的应用场景，如高维时间序列、多维折射和稀疏网络社区检测等。

Abstract
In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$ iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.

摘要
在无监督学习框架下，沛罗德算法是最广泛使用的聚类算法之一。它已经引起了许多研究确定算法在不同设定下的正确性。特别是在2016年，简和周在一个sub-Gaussian混合中的$n$个独立样本上显示了沛罗德算法的误分类率是在$O(\log(n))$迭代后经过抑制的，假设初始化正确。然而，在许多应用中，真实的样本是未观察的，需要从数据中学习via预处理管道，如спектраль方法在适当的数据矩阵上。我们表明，在sub-Gaussian混合中的受损样本上，沛罗德算法的误分类率也是在$O(\log(n))$迭代后抑制的，假设初始化正确且受损相对于sub-Gaussian噪声是小的。在 canonical 设定下，我们 derivebounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result.我们显示了这些结果对于pipelines measuring the statistical significance of derived clusters from data such as SigClust的影响。我们使用这些总结来 derive implications for providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.

Interpretation of High-Dimensional Linear Regression: Effects of Nullspace and Regularization Demonstrated on Battery Data

paper_url: http://arxiv.org/abs/2309.00564
repo_url: https://github.com/joachimschaeffer/hdreganalytics
paper_authors: Joachim Schaeffer, Eric Lenz, William C. Chueh, Martin Z. Bazant, Rolf Findeisen, Richard D. Braatz
for: This paper is written for researchers and practitioners who work with high-dimensional linear regression in various scientific fields, particularly those who deal with discrete measured data of underlying smooth latent processes.
methods: The paper proposes an optimization formulation to compare regression coefficients and to understand the relationship between the nullspace and regularization in high-dimensional linear regression. The authors also use physical engineering knowledge to interpret the regression results.
results: The case studies show that regularization and z-scoring are important design choices that can lead to interpretable regression results, while the combination of the nullspace and regularization can hinder interpretability. Additionally, the paper demonstrates that regression methods that do not produce coefficients orthogonal to the nullspace, such as fused lasso, can improve interpretability.

Abstract
High-dimensional linear regression is important in many scientific fields. This article considers discrete measured data of underlying smooth latent processes, as is often obtained from chemical or biological systems. Interpretation in high dimensions is challenging because the nullspace and its interplay with regularization shapes regression coefficients. The data's nullspace contains all coefficients that satisfy $\mathbf{Xw}=\mathbf{0}$, thus allowing very different coefficients to yield identical predictions. We developed an optimization formulation to compare regression coefficients and coefficients obtained by physical engineering knowledge to understand which part of the coefficient differences are close to the nullspace. This nullspace method is tested on a synthetic example and lithium-ion battery data. The case studies show that regularization and z-scoring are design choices that, if chosen corresponding to prior physical knowledge, lead to interpretable regression results. Otherwise, the combination of the nullspace and regularization hinders interpretability and can make it impossible to obtain regression coefficients close to the true coefficients when there is a true underlying linear model. Furthermore, we demonstrate that regression methods that do not produce coefficients orthogonal to the nullspace, such as fused lasso, can improve interpretability. In conclusion, the insights gained from the nullspace perspective help to make informed design choices for building regression models on high-dimensional data and reasoning about potential underlying linear models, which are important for system optimization and improving scientific understanding.

摘要
高维Linear regression在多个科学领域非常重要。这篇文章考虑了化学或生物系统中的离散测量数据，这些数据通常表示下面的精灵过程。在高维度下解释具有搜索挑战，因为null space和其与正则化的交互会影响回归系数。数据的null space包含所有满足 $\mathbf{Xw=0}$ 的系数，从而允许非常不同的系数导致同样的预测。我们提出了一种优化表述，用于比较回归系数和基于物理工程知识来理解系数差异。这个null space方法在一个 sintetic例和锂离子电池数据上进行了测试。案例研究表明，如果选择合适的正则化和z-scoring，那么回归结果将更加可读性。否则，null space和正则化的交互会使回归结果不可解释，而且在存在真实的下面线性模型时，无法获得正确的系数。此外，我们还示出了不能将系数正交于null space的回归方法，如融合lasso，可以提高可读性。因此，从null space角度获得的知识可以帮助我们做出有用的设计选择，建立回归模型，以便优化系统和提高科学理解。

Interactive and Concentrated Differential Privacy for Bandits

paper_url: http://arxiv.org/abs/2309.00557
repo_url: None
paper_authors: Achraf Azize, Debabrota Basu
for: 这篇论文关注了在中央决策者的情况下保护用户隐私的问题。
methods: 这篇论文使用了交互式敏感度隐私（DP）来保护用户隐私。
results: 论文提供了对固定武器和线性投掷器的最小最大偏差和问题依赖的下界，这些下界表明在不同的隐私预算$\rho$下，$\rho$-全球敏感度隐私对偏差的影响不同。论文还提出了两种$\rho$-全球敏感度隐私投掷算法，即AdaC-UCB和AdaC-GOPE。这两种算法都使用了 Gaussian 机制和适应集。论文分析了这些算法的偏差，并证明了AdaC-UCB实现了问题依赖的下界，而AdaC-GOPE实现了最小最大偏差下界。最后，论文提供了不同设置下的实验 validate 论文的结论。

Abstract
Bandits play a crucial role in interactive learning schemes and modern recommender systems. However, these systems often rely on sensitive user data, making privacy a critical concern. This paper investigates privacy in bandits with a trusted centralized decision-maker through the lens of interactive Differential Privacy (DP). While bandits under pure $\epsilon$-global DP have been well-studied, we contribute to the understanding of bandits under zero Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds on regret for finite-armed and linear bandits, which quantify the cost of $\rho$-global zCDP in these settings. These lower bounds reveal two hardness regimes based on the privacy budget $\rho$ and suggest that $\rho$-global zCDP incurs less regret than pure $\epsilon$-global DP. We propose two $\rho$-global zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear bandits respectively. Both algorithms use a common recipe of Gaussian mechanism and adaptive episodes. We analyze the regret of these algorithms to show that AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative constants, while AdaC-GOPE achieves the minimax regret lower bound up to poly-logarithmic factors. Finally, we provide experimental validation of our theoretical results under different settings.

摘要
匪徒在互动学习方案和现代推荐系统中发挥关键作用，但这些系统经常依赖敏感用户数据，因此隐私成为一个重要问题。这篇论文通过交互式差异private（DP）的视角研究了隐私在匪徒中。虽然纯$\epsilon$-全球DP已经得到了广泛的研究，但我们在zero Concentrated DP（zCDP）下进行了研究。我们提供了基于finite-armed和线性匪徒的最小最大和问题依赖的下界，这些下界量化了在这些设定下的 regret的成本。这些下界显示了两个硬度 режи，即隐私预算$\rho$的硬度和隐私预算$\rho$的硬度。我们还提出了两种$\rho$-全球zCDP匪徒算法，即AdaC-UCB和AdaC-GOPE。这两种算法都使用了共同的 Gaussian mechanism 和自适应集。我们分析了这些算法的 regret，并证明了AdaC-UCB实现了问题依赖的 regret下界，而AdaC-GOPE实现了最小最大的 regret下界。最后，我们在不同的设定下进行了实验验证。

Adaptive function approximation based on the Discrete Cosine Transform (DCT)

paper_url: http://arxiv.org/abs/2309.00530
repo_url: None
paper_authors: Ana I. Pérez-Neira, Marc Martinez-Gost, Miguel Ángel Lagunas
for: 这篇论文研究了基于恒等函数的一元连续函数近似方法，不具备快速储存的缺点。
methods: 本论文使用了一种supervised学习方法来获取近似系数，而不是使用快速储存变换 (DCT)。
results: 由于cosine基函数的有限动态和正交性，使用这种简单的梯度算法，如正规化最小二乘 (NLMS)，可以获得控制的和预测可靠的融合时间和误差补偿。这种技术的简单性使其成为在更复杂的超级vised学习系统中使用的佳技术。

Abstract
This paper studies the cosine as basis function for the approximation of univariate and continuous functions without memory. This work studies a supervised learning to obtain the approximation coefficients, instead of using the Discrete Cosine Transform (DCT). Due to the finite dynamics and orthogonality of the cosine basis functions, simple gradient algorithms, such as the Normalized Least Mean Squares (NLMS), can benefit from it and present a controlled and predictable convergence time and error misadjustment. Due to its simplicity, the proposed technique ranks as the best in terms of learning quality versus complexity, and it is presented as an attractive technique to be used in more complex supervised learning systems. Simulations illustrate the performance of the approach. This paper celebrates the 50th anniversary of the publication of the DCT by Nasir Ahmed in 1973.

摘要
这篇论文研究了无记忆函数的cosine作为基函数，用于精度地近似单变量连续函数。这项研究使用了指导学习而不是使用抽象 cosine transform (DCT) 来获取近似系数。由于cosine基函数的有限动态和正交性，简单的梯度算法，如normalized least squares (NLMS)，可以从中受益，并且可以控制和预测误差补偿的时间和误差。由于其简单性，该技术在学习质量 versus 复杂性方面排名最高，并且作为更复杂的超visisted learning系统中的一种吸引人技术。实验表明了该方法的性能。这篇论文纪念1973年Nasir Ahmed发表的《抽象cosine transform》50周年。

Online Distributed Learning over Random Networks

paper_url: http://arxiv.org/abs/2309.00520
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Nicola Bastianello, Diego Deplano, Mauro Franceschelli, Karl H. Johansson
for: 本研究的目的是解决分布式学习问题，特别是在多代理系统中，代理不直接分享数据，而是通过协作来学习模型。
methods: 本研究使用了分布式运算理论（DOT）版本的分解方向方法（ADMM），称为DOT-ADMM算法，以解决在线学习、异步代理计算、不可靠和有限通信等实际问题。
results: 本研究证明了DOT-ADMM算法在一类凸学习问题（如线性和启发式回归问题）中 converge linear rate，并Characterize了它们的解决方案如何受到（i）到（iv）的影响。在数值实验中，DOT-ADMM算法与其他当前状态算法进行比较，显示它具有对（i）到（iv）的Robustness。

Abstract
The recent deployment of multi-agent systems in a wide range of scenarios has enabled the solution of learning problems in a distributed fashion. In this context, agents are tasked with collecting local data and then cooperatively train a model, without directly sharing the data. While distributed learning offers the advantage of preserving agents' privacy, it also poses several challenges in terms of designing and analyzing suitable algorithms. This work focuses specifically on the following challenges motivated by practical implementation: (i) online learning, where the local data change over time; (ii) asynchronous agent computations; (iii) unreliable and limited communications; and (iv) inexact local computations. To tackle these challenges, we introduce the Distributed Operator Theoretical (DOT) version of the Alternating Direction Method of Multipliers (ADMM), which we call the DOT-ADMM Algorithm. We prove that it converges with a linear rate for a large class of convex learning problems (e.g., linear and logistic regression problems) toward a bounded neighborhood of the optimal time-varying solution, and characterize how the neighborhood depends on~$\text{(i)--(iv)}$. We corroborate the theoretical analysis with numerical simulations comparing the DOT-ADMM Algorithm with other state-of-the-art algorithms, showing that only the proposed algorithm exhibits robustness to (i)--(iv).

摘要

Online learning: Local data changes over time.2. Asynchronous agent computations.3. Unreliable and limited communications.4. Inexact local computations.To address these challenges, we introduce the Distributed Operator Theoretical (DOT) version of the Alternating Direction Method of Multipliers (ADMM), which we call the DOT-ADMM algorithm. We prove that the DOT-ADMM algorithm converges with a linear rate for a wide range of convex learning problems (such as linear and logistic regression problems) and shows robustness to (i)–(iv). Numerical simulations comparing the DOT-ADMM algorithm with other state-of-the-art algorithms demonstrate its superior performance in these challenging scenarios.

Solving multiscale elliptic problems by sparse radial basis function neural networks

paper_url: http://arxiv.org/abs/2309.03107
repo_url: None
paper_authors: Zhiwen Wang, Minxin Chen, Jingrun Chen
for: 解决多スケール elliptic partialling differential equations (PDEs) 的问题
methods: 使用 sparse radial basis function neural network (RBFNN) 方法，启发自 deep mixed residual method，将二次问题转化为一次系统，并使用多个 RBFNN 来近似未知函数
results: 提出了一种新的 $\ell_1$ regularization 技术，可以避免过拟合问题，并且可以在三维空间中提供可靠的数值解决方案，并且比较稳定和精度高于大多数其他可算法。

Abstract
Machine learning has been successfully applied to various fields of scientific computing in recent years. In this work, we propose a sparse radial basis function neural network method to solve elliptic partial differential equations (PDEs) with multiscale coefficients. Inspired by the deep mixed residual method, we rewrite the second-order problem into a first-order system and employ multiple radial basis function neural networks (RBFNNs) to approximate unknown functions in the system. To aviod the overfitting due to the simplicity of RBFNN, an additional regularization is introduced in the loss function. Thus the loss function contains two parts: the $L_2$ loss for the residual of the first-order system and boundary conditions, and the $\ell_1$ regularization term for the weights of radial basis functions (RBFs). An algorithm for optimizing the specific loss function is introduced to accelerate the training process. The accuracy and effectiveness of the proposed method are demonstrated through a collection of multiscale problems with scale separation, discontinuity and multiple scales from one to three dimensions. Notably, the $\ell_1$ regularization can achieve the goal of representing the solution by fewer RBFs. As a consequence, the total number of RBFs scales like $\mathcal{O}(\varepsilon^{-n\tau})$, where $\varepsilon$ is the smallest scale, $n$ is the dimensionality, and $\tau$ is typically smaller than $1$. It is worth mentioning that the proposed method not only has the numerical convergence and thus provides a reliable numerical solution in three dimensions when a classical method is typically not affordable, but also outperforms most other available machine learning methods in terms of accuracy and robustness.

摘要
Machine learning 在近年sciences computing中得到了成功应用。在这项工作中，我们提议使用稀疏卷积基函数神经网络方法解析各种具有多级别系数的几何 partial differential equations (PDEs)。受深混合异常方法的激发，我们将第二阶问题重写为第一阶系统，并使用多个卷积基函数神经网络（RBFNNs）来近似未知函数。为了避免RBFNN的过拟合，我们在损失函数中添加了一个 $\ell_1$ 规范项。因此，损失函数包含 $L_2$ 损失项和边界条件，以及 $\ell_1$ 规范项。我们提出了一种优化特定损失函数的算法，以加速训练过程。我们通过一系列具有多个级别、离散和多级别的问题来证明方法的准确性和有效性。值得注意的是，$\ell_1$ 规范可以实现函数的折衔，从而使得总的RBF数量 scales like $\mathcal{O}(\varepsilon^{-n\tau})$，其中 $\varepsilon$ 是最小的尺度，$n$ 是维度，$\tau$ 通常小于 1。此外，我们的方法不仅具有数值收敛性，可以在三维空间提供可靠的数值解决方案，而且在精度和稳定性方面超越大多数可用的机器学习方法。

Structure and Gradient Dynamics Near Global Minima of Two-layer Neural Networks

paper_url: http://arxiv.org/abs/2309.00508
repo_url: None
paper_authors: Leyang Zhang, Yaoyu Zhang, Tao Luo
for: 研究两层神经网络的损失地形结构，特别是在全局最优点附近，并确定可以达到完美泛化的参数集。
methods: 使用新的技术来探索复杂的损失地形，并发现模型、目标函数、样本和初始化对训练动态的影响不同。
results: 研究发现，（过参数化）神经网络可以很好地泛化，并且解释了这种能力的原因。

Abstract
Under mild assumptions, we investigate the structure of loss landscape of two-layer neural networks near global minima, determine the set of parameters which give perfect generalization, and fully characterize the gradient flows around it. With novel techniques, our work uncovers some simple aspects of the complicated loss landscape and reveals how model, target function, samples and initialization affect the training dynamics differently. Based on these results, we also explain why (overparametrized) neural networks could generalize well.

摘要
“我们在两层神经网络附近全球最佳点下调查损失地图的结构，决定具有完美泛化的参数集，并将渐进流动 vollständigCharacterize。我们的工作发现了一些简单的损失地图特性，并说明了模型、目标函数、样本和初始化对训练动态的不同影响。根据这些结果，我们也解释了为什么（过 Parametrization）神经网络具有良好的泛化能力。”Here's the breakdown of the translation:* “Under mild assumptions” becomes “在几何上的假设下” (shì yǐ jī hòu yǐ jī)* “we investigate the structure of loss landscape of two-layer neural networks” becomes “我们调查两层神经网络损失地图的结构” (wǒ men zhù chá yī liàng jī nǎo wǎng jī)* “near global minima” becomes “附近全球最佳点” (pò jìn qū jiāo zhì diǎn)* “determine the set of parameters which give perfect generalization” becomes “决定具有完美泛化的参数集” (jī dìng shì yǐ jī de fāng yì)* “and fully characterize the gradient flows around it” becomes “并将渐进流动 vollständigCharacterize” (bìng shì yǐ jī de jì qiǎo yǐ jī)* “With novel techniques” becomes “使用新的技术” (shǐ yòu xīn de jì huì)* “our work uncovers some simple aspects of the complicated loss landscape” becomes “我们的工作发现了一些简单的损失地图特性” (wǒ men de gōng zuò fā xiàn le yī si xiǎng xīn de zhòng jī)* “and reveals how model, target function, samples and initialization affect the training dynamics differently” becomes “并说明了模型、目标函数、样本和初始化对训练动态的不同影响” (bìng shuō mìng le mó delǐ, mù bìng funcție, yàng bèi hé chū shì yì jī)* “Based on these results, we also explain why (overparametrized) neural networks could generalize well” becomes “根据这些结果，我们也解释了为什么（过 Parametrization）神经网络具有良好的泛化能力” (gēn jī yǐ jī de, wǒ men yě also jiě shì le, shì zhè yǐ jī de, zhòng yì de, zhòng yì de)

Application of Deep Learning Methods in Monitoring and Optimization of Electric Power Systems

paper_url: http://arxiv.org/abs/2309.00498
repo_url: None
paper_authors: Ognjen Kundacina
for: 这个博士论文探讨了深度学习技术在电力系统监测和优化方面的应用，以提高电力系统状态估计和动态分布网络重新配置。
methods: 本论文使用图神经网络进行电力系统状态估计的提升，并使用强化学习进行动态分布网络重新配置。
results: 经过广泛的实验和仿真，提出的方法得到了证明，并且在电力系统监测和优化方面表现出色。

Abstract
This PhD thesis thoroughly examines the utilization of deep learning techniques as a means to advance the algorithms employed in the monitoring and optimization of electric power systems. The first major contribution of this thesis involves the application of graph neural networks to enhance power system state estimation. The second key aspect of this thesis focuses on utilizing reinforcement learning for dynamic distribution network reconfiguration. The effectiveness of the proposed methods is affirmed through extensive experimentation and simulations.

摘要
这个博士论文全面检讨了深度学习技术在电力系统监测和优化方面的应用。本论文的第一个重要贡献是通过图 neural network 提高电力系统状态估计的精度。第二个关键方面是通过强化学习来动态重新配置分布网络。实验和 simulations 证明了提议的方法的有效性。Here's the breakdown of the translation:* 这个博士论文 (zhè ge bóshì zhōngzì) - This PhD thesis* 全面检讨 (quánxiàn jiǎnzhèng) - thoroughly examines* 深度学习技术 (shēn dào xuéxí jìshù) - deep learning techniques* 在电力系统监测和优化方面 (zhī yì electric power systems) - in the monitoring and optimization of electric power systems* 应用 (yìngzuò) - application* 第一个重要贡献 (dì yī jī zhòngyì) - the first major contribution* 通过图 neural network (tōngguò graphein neural network) - through the use of graph neural networks* 提高电力系统状态估计的精度 (jīngdé electric power system state estimation) - to enhance the accuracy of power system state estimation* 第二个关键方面 (dì èr jiānjiāng fāngbiàn) - the second key aspect* 通过强化学习来动态重新配置分布网络 (tōngguò qiánghuà xuéxí jiào dòngxīn zhòngxīn) - through the use of reinforcement learning to dynamically reconfigure the distribution network* 实验和 simulations (shìyàn yǔ simulated) - experimental and simulated* 证明了 (jiànming le) - prove the effectiveness of* 提议的方法 (tiěyì de fāngzhì) - the proposed methods

How Does Forecasting Affect the Convergence of DRL Techniques in O-RAN Slicing?

paper_url: http://arxiv.org/abs/2309.00489
repo_url: None
paper_authors: Ahmad M. Nagib, Hatem Abou-Zeid, Hossam S. Hassanein
For: This paper focuses on improving the convergence of deep reinforcement learning (DRL) agents in open radio access network (O-RAN) architectures, specifically for immersive applications such as virtual reality (VR) gaming and metaverse services.* Methods: The authors use time series forecasting of traffic demands to improve the convergence of the DRL-based slicing agents. They propose a novel forecasting-aided DRL approach and provide an exhaustive experiment that supports multiple services, including real VR gaming traffic.* Results: The proposed approach shows significant improvements in the average initial reward value, convergence rate, and number of converged scenarios compared to the implemented baselines. The results also demonstrate the approach’s robustness against forecasting errors and the feasibility of using imperfect forecasting models.

Abstract
The success of immersive applications such as virtual reality (VR) gaming and metaverse services depends on low latency and reliable connectivity. To provide seamless user experiences, the open radio access network (O-RAN) architecture and 6G networks are expected to play a crucial role. RAN slicing, a critical component of the O-RAN paradigm, enables network resources to be allocated based on the needs of immersive services, creating multiple virtual networks on a single physical infrastructure. In the O-RAN literature, deep reinforcement learning (DRL) algorithms are commonly used to optimize resource allocation. However, the practical adoption of DRL in live deployments has been sluggish. This is primarily due to the slow convergence and performance instabilities suffered by the DRL agents both upon initial deployment and when there are significant changes in network conditions. In this paper, we investigate the impact of time series forecasting of traffic demands on the convergence of the DRL-based slicing agents. For that, we conduct an exhaustive experiment that supports multiple services including real VR gaming traffic. We then propose a novel forecasting-aided DRL approach and its respective O-RAN practical deployment workflow to enhance DRL convergence. Our approach shows up to 22.8%, 86.3%, and 300% improvements in the average initial reward value, convergence rate, and number of converged scenarios respectively, enhancing the generalizability of the DRL agents compared with the implemented baselines. The results also indicate that our approach is robust against forecasting errors and that forecasting models do not have to be ideal.

摘要
成功的 immerse 应用，如虚拟现实 (VR) 游戏和 metaverse 服务，需要低延迟和可靠的连接。为提供无缝用户体验，开放无线访问网络 (O-RAN) 架构和 sixth-generation 网络 (6G) 将扮演关键角色。RAN 分割，O-RAN 架构中的一个关键组件，允许网络资源根据 immerse 服务的需求进行分配，在单个物理基础设施上创建多个虚拟网络。在 O-RAN 文献中，深度强化学习 (DRL) 算法广泛用于资源分配优化。然而，实际应用中 DRL 的普及率较低。这主要是由 DRL 代理人在部署时和网络条件变化时表现缓慢和性能不稳定所致。在本文中，我们研究了基于时间序列预测的吞吐量需求对 DRL-based 分割代理人的 converges 的影响。为此，我们进行了详细的实验，支持多种服务，包括真实的 VR 游戏流量。然后，我们提出了一种 forecasting-aided DRL 方法和其相应的 O-RAN 实践部署工作流程，以提高 DRL 代理人的 converges。我们的方法显示在初始奖励值、 converges 速度和 converged 场景数量上增加了22.8%、86.3% 和 300%，从而提高了 DRL 代理人的通用性。结果还表明，我们的方法对预测错误有较好的Robustness，并且预测模型不需要 идеal。

Geometry-aware Line Graph Transformer Pre-training for Molecular Property Prediction

paper_url: http://arxiv.org/abs/2309.00483
repo_url: None
paper_authors: Peizhen Bai, Xianyuan Liu, Haiping Lu
for: 提高分子表示学习的精度，增强分子功能预测的能力
methods: 使用自我超vised学习方法，通过2D和3D模式提取分子特征信息
results: 与6个基eline比较，在12个性能测试上 consistently outperform所有基eline，证明其效果

Abstract
Molecular property prediction with deep learning has gained much attention over the past years. Owing to the scarcity of labeled molecules, there has been growing interest in self-supervised learning methods that learn generalizable molecular representations from unlabeled data. Molecules are typically treated as 2D topological graphs in modeling, but it has been discovered that their 3D geometry is of great importance in determining molecular functionalities. In this paper, we propose the Geometry-aware line graph transformer (Galformer) pre-training, a novel self-supervised learning framework that aims to enhance molecular representation learning with 2D and 3D modalities. Specifically, we first design a dual-modality line graph transformer backbone to encode the topological and geometric information of a molecule. The designed backbone incorporates effective structural encodings to capture graph structures from both modalities. Then we devise two complementary pre-training tasks at the inter and intra-modality levels. These tasks provide properly supervised information and extract discriminative 2D and 3D knowledge from unlabeled molecules. Finally, we evaluate Galformer against six state-of-the-art baselines on twelve property prediction benchmarks via downstream fine-tuning. Experimental results show that Galformer consistently outperforms all baselines on both classification and regression tasks, demonstrating its effectiveness.

摘要
молекулярная свойство предсказание с глубоким обучением получила много внимания в последние годы. из-за scarcity of labeled molecules, there has been growing interest in self-supervised learning methods that learn generalizable molecular representations from unlabeled data. Molecules are typically treated as 2D topological graphs in modeling, but it has been discovered that their 3D geometry is of great importance in determining molecular functionalities. In this paper, we propose the Geometry-aware line graph transformer (Galformer) pre-training, a novel self-supervised learning framework that aims to enhance molecular representation learning with 2D and 3D modalities. Specifically, we first design a dual-modality line graph transformer backbone to encode the topological and geometric information of a molecule. The designed backbone incorporates effective structural encodings to capture graph structures from both modalities. Then we devise two complementary pre-training tasks at the inter and intra-modality levels. These tasks provide properly supervised information and extract discriminative 2D and 3D knowledge from unlabeled molecules. Finally, we evaluate Galformer against six state-of-the-art baselines on twelve property prediction benchmarks via downstream fine-tuning. Experimental results show that Galformer consistently outperforms all baselines on both classification and regression tasks, demonstrating its effectiveness.

Polynomial-Model-Based Optimization for Blackbox Objectives

paper_url: http://arxiv.org/abs/2309.00663
repo_url: None
paper_authors: Janina Schreiber, Damar Wicaksono, Michael Hecht
For: 这个论文是为了解决黑盒优化问题而写的，黑盒优化是指很多系统的结构是未知的，并且使用泛型模型来优化这些系统以实现最佳性。* Methods: 这篇论文提出了一种新的黑盒优化算法，即Polynomial-Model-Based Optimization（PMBO），它使用了 bayesian 优化的思想，通过逐步更新模型，以实现平衡探索和利用率，同时提供了模型不确定性的估计。* Results: 论文通过对一些人工、分析函数进行比较，展示了 PMBO 与其他当前领先算法相比，具有竞争力，甚至在某些情况下超越了它们。因此，作者认为 PMBO 是解决黑盒优化问题的首选方法。

Abstract
For a wide range of applications the structure of systems like Neural Networks or complex simulations, is unknown and approximation is costly or even impossible. Black-box optimization seeks to find optimal (hyper-) parameters for these systems such that a pre-defined objective function is minimized. Polynomial-Model-Based Optimization (PMBO) is a novel blackbox optimizer that finds the minimum by fitting a polynomial surrogate to the objective function. Motivated by Bayesian optimization the model is iteratively updated according to the acquisition function Expected Improvement, thus balancing the exploitation and exploration rate and providing an uncertainty estimate of the model. PMBO is benchmarked against other state-of-the-art algorithms for a given set of artificial, analytical functions. PMBO competes successfully with those algorithms and even outperforms all of them in some cases. As the results suggest, we believe PMBO is the pivotal choice for solving blackbox optimization tasks occurring in a wide range of disciplines.

摘要
Motivated by Bayesian optimization, the model is iteratively updated according to the acquisition function Expected Improvement, balancing the exploitation and exploration rate and providing an uncertainty estimate of the model. PMBO is benchmarked against other state-of-the-art algorithms for a given set of artificial, analytical functions. PMBO competes successfully with those algorithms and even outperforms all of them in some cases. As the results suggest, we believe PMBO is the pivotal choice for solving blackbox optimization tasks occurring in a wide range of disciplines.Translated into Simplified Chinese:为许多应用领域，系统如神经网络或复杂的仿真模型的结构未知，并且估算是不可能或者很Expensive。黑盒优化寻找这些系统的优化参数，以实现预定的目标函数的最小化。Polynomial-Model-Based Optimization (PMBO) 是一种新的黑盒优化算法，通过适应 polynomial 模型来拟合目标函数。受 Bayesian 优化的 inspiritation，模型在 Expected Improvement 的优化函数下进行逐步更新，平衡利用率和探索率，并提供模型的不确定性估计。PMBO 与其他状态对照算法进行比较，在一组人工、分析函数上得到了竞争性的成绩，甚至在一些情况下超越了所有其他算法。据结果显示，我们认为 PMBO 是解决黑盒优化问题的绝佳选择，在许多领域中得到广泛的应用。

A Locality-based Neural Solver for Optical Motion Capture

paper_url: http://arxiv.org/abs/2309.00428
repo_url: https://github.com/non-void/localmocap
paper_authors: Xiaoyu Pan, Bowen Zheng, Xinwei Jiang, Guanglong Xu, Xianli Gu, Jingxiang Li, Qilong Kou, He Wang, Tianjia Shao, Kun Zhou, Xiaogang Jin
for: 本研究旨在提出一种基于本地特征的Optical Motion Capture（OMC）数据清洁和解决方法，以减少 marker 误差和 occlusion 的影响。
methods: 提出了一种基于标签和关节的不同类型节点的 hetroogeneous graph neural network（HGNN），使用图 convolution 操作提取 marker 和关节的本地特征，并将其转化为干净的运动。
results: 对多个数据集进行了extensive comparison，并证明了我们的方法可以在多个纬度上高度准确地预测 occluded marker 位置错误，并对重建关节旋转和位置进行了30%的误差减少。代码和数据可以在https://github.com/non-void/LocalMoCap上下载。

Abstract
We present a novel locality-based learning method for cleaning and solving optical motion capture data. Given noisy marker data, we propose a new heterogeneous graph neural network which treats markers and joints as different types of nodes, and uses graph convolution operations to extract the local features of markers and joints and transform them to clean motions. To deal with anomaly markers (e.g. occluded or with big tracking errors), the key insight is that a marker's motion shows strong correlations with the motions of its immediate neighboring markers but less so with other markers, a.k.a. locality, which enables us to efficiently fill missing markers (e.g. due to occlusion). Additionally, we also identify marker outliers due to tracking errors by investigating their acceleration profiles. Finally, we propose a training regime based on representation learning and data augmentation, by training the model on data with masking. The masking schemes aim to mimic the occluded and noisy markers often observed in the real data. Finally, we show that our method achieves high accuracy on multiple metrics across various datasets. Extensive comparison shows our method outperforms state-of-the-art methods in terms of prediction accuracy of occluded marker position error by approximately 20%, which leads to a further error reduction on the reconstructed joint rotations and positions by 30%. The code and data for this paper are available at https://github.com/non-void/LocalMoCap.

摘要
我们提出了一种新的地域性学习方法，用于清洁和解决光学动作捕捉数据。给定含有噪声的标记数据，我们提议一种新的异类图 neural network，其中标记和关节被视为不同类型的节点，并使用图 convolution 操作来提取标记和关节的本地特征，并将其转化为清洁动作。为了处理异常标记（例如受到遮盖或大跟踪错误），我们的关键发现是，标记的运动具有强相关性，与其邻近的标记的运动相关，而与其他标记的运动相关性较弱，这使得我们能够高效地填充缺失的标记（例如由遮盖所致）。此外，我们还可以识别标记异常（例如由跟踪错误所致），通过研究它们的加速度轨迹。最后，我们建议一种基于表示学习和数据扩展的训练方法，通过在数据上进行掩码训练。掩码方案的目的是模拟实际数据中的受遮盖和噪声标记。我们的方法在多个维度上达到高精度，相比之前的方法，我们的方法在填充缺失标记和跟踪错误方面的预测精度提高约20%，这导致了再次的误差减少在重建关节旋转和位置上约30%。代码和数据可以在https://github.com/non-void/LocalMoCap中获取。

Advancing Personalized Federated Learning: Group Privacy, Fairness, and Beyond

paper_url: http://arxiv.org/abs/2309.00416
repo_url: None
paper_authors: Filippo Galli, Kangsoo Jung, Sayan Biswas, Catuscia Palamidessi, Tommaso Cucinotta
for: 本研究旨在探讨在分布式学习框架下，如何兼顾个人化、隐私保障和公平性。
methods: 本研究使用了 $d$-privacy（也称为 metric privacy）来保证客户端数据的隐私，并通过对模型更新进行权限控制来实现个人化模型训练。
results: 研究发现，通过使用 $d$-privacy，可以在分布式学习框架下实现个人化模型训练，同时提供正式的隐私保障和较好的群体公平性。

Abstract
Federated learning (FL) is a framework for training machine learning models in a distributed and collaborative manner. During training, a set of participating clients process their data stored locally, sharing only the model updates obtained by minimizing a cost function over their local inputs. FL was proposed as a stepping-stone towards privacy-preserving machine learning, but it has been shown vulnerable to issues such as leakage of private information, lack of personalization of the model, and the possibility of having a trained model that is fairer to some groups than to others. In this paper, we address the triadic interaction among personalization, privacy guarantees, and fairness attained by models trained within the FL framework. Differential privacy and its variants have been studied and applied as cutting-edge standards for providing formal privacy guarantees. However, clients in FL often hold very diverse datasets representing heterogeneous communities, making it important to protect their sensitive information while still ensuring that the trained model upholds the aspect of fairness for the users. To attain this objective, a method is put forth that introduces group privacy assurances through the utilization of $d$-privacy (aka metric privacy). $d$-privacy represents a localized form of differential privacy that relies on a metric-oriented obfuscation approach to maintain the original data's topological distribution. This method, besides enabling personalized model training in a federated approach and providing formal privacy guarantees, possesses significantly better group fairness measured under a variety of standard metrics than a global model trained within a classical FL template. Theoretical justifications for the applicability are provided, as well as experimental validation on real-world datasets to illustrate the working of the proposed method.

摘要
federated learning（FL）是一种分布式和合作的机器学习框架，在训练过程中，参与训练的客户端将本地存储的数据进行处理，并仅将模型更新分布式地提交给训练。FL被提出为隐私保护机器学习的一个途径，但它存在一些问题，如透露隐私信息、缺乏个性化模型和对某些群体更加公平的模型。本文探讨在FL框架中个性化、隐私保障和公平性之间的三元交互。使用泛化隐私和其变种的研究和应用已成为当前隐私保障的标准。然而，FL中的客户端通常拥有具有不同数据集和异质社区的本地数据，因此保护这些敏感信息的同时仍保证模型对用户具有公平性是一项重要的任务。为解决这个问题，我们提出了基于$d$-隐私（即 метриック隐私）的方法，该方法通过本地化隐私保障，保持原始数据的Topological分布，并在个性化模型训练中提供正式的隐私保障。这种方法不仅允许在分布式方式进行个性化模型训练，还具有较好的群体公平度，并且在不同的标准度量中测试了其工作。我们还提供了理论上的正当性和实验 validate的数据来证明方法的有效性。

paper_url: http://arxiv.org/abs/2309.00380
repo_url: None
paper_authors: Marcel Hirt, Domenico Campolo, Victoria Leong, Juan-Pablo Ortega
for: 这个论文是为了提出一种基于深度隐藏变量模型的多modal数据生成模型，以jointly explain multiple modalities的latent representations。
methods: 这个论文使用了多modal Variational Autoencoders（VAEs）作为生成模型，并使用了Product-of-Experts（PoE）或Mixture-of-Experts（MoE）的归一化方法来编码来自不同modalities的隐藏变量。
results: 该论文通过提出更加灵活的归一化方法和更加紧密的Lower bounding方法，以提高多modal数据生成模型的生成质量和多modal性能。

Abstract
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations which jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. In order to encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly lower bound the data log-likelihood. We develop more flexible aggregation schemes that generalise PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.

摘要
开发深入的潜在变量模型以便多模态数据已经是机器学习研究的长期主题。多模态变量自动机（VAEs）是一种受欢迎的生成模型类，它学习共同解释多个模态的latent表示。对于这些模型，有各种目标函数被建议，通常是多模态数据的日志概率下界或信息理论上的考虑。为了从不同模态子集中编码潜在变量，Product-of-Experts（PoE）或Mixture-of-Experts（MoE）的汇集方案经常使用，并显示出不同的贸易OFF，例如生成质量或多模态之间的一致性。在这个工作中，我们考虑一种可以紧紧下界多模态数据的日志概率的 variational bound。我们开发更 flexible的汇集方案，把不同模态的编码特征通过具有征文化不变性的神经网络进行组合。我们的数字实验表明，多模态variational bound和不同的汇集方案之间存在贸易OFF。我们显示，更紧的variational bound和更flexible的汇集模型可以在 aproximate true joint distribution over observed modalities and latent variables in identifiable models 中变得有利。

Anomaly detection with semi-supervised classification based on risk estimators

paper_url: http://arxiv.org/abs/2309.00379
repo_url: None
paper_authors: Le Thi Khanh Hien, Sukanya Patra, Souhaib Ben Taieb
for: 本研究旨在超越一类分类异常检测方法的重要局限性，即假设训练数据只包含正常实例的假设。
methods: 我们提出了两种新的分类基于异常检测方法，包括使用不偏的风险估计的混合学习异常检测方法，以及使用非正式风险估计的深度异常检测方法。
results: 我们对两种风险估计的选择和正则化参数的选择进行了严格的分析，并通过广泛的实验证明了异常检测方法的有效性。

Abstract
A significant limitation of one-class classification anomaly detection methods is their reliance on the assumption that unlabeled training data only contains normal instances. To overcome this impractical assumption, we propose two novel classification-based anomaly detection methods. Firstly, we introduce a semi-supervised shallow anomaly detection method based on an unbiased risk estimator. Secondly, we present a semi-supervised deep anomaly detection method utilizing a nonnegative (biased) risk estimator. We establish estimation error bounds and excess risk bounds for both risk minimizers. Additionally, we propose techniques to select appropriate regularization parameters that ensure the nonnegativity of the empirical risk in the shallow model under specific loss functions. Our extensive experiments provide strong evidence of the effectiveness of the risk-based anomaly detection methods.

摘要
一个重要的限制是一类分类异常检测方法的假设，即训练数据只包含正常实例。为了突破这个不现实的假设，我们提出了两种新的分类基于异常检测方法。首先，我们介绍了一种半监督浅层异常检测方法，基于不偏的风险估计器。其次，我们介绍了一种半监督深层异常检测方法，利用非负（偏）风险估计器。我们确立了风险估计器的估计误差 bound 和过分布 bound，以及适当的规则化参数选择技术，以保证训练数据的非负性，特别是在特定的损失函数下。我们的广泛的实验表明，风险基于异常检测方法是有效的。

Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

paper_url: http://arxiv.org/abs/2309.00367
repo_url: https://github.com/toenshoff/lrgb
paper_authors: Jan Tönshoff, Martin Ritzert, Eran Rosenbluth, Martin Grohe
for: estabilish a higher standard of empirical rigor within the graph machine learning community
methods: carefully reevaluate multiple MPGNN baselines and the Graph Transformer GPS
results: the reported performance gap is overestimated due to suboptimal hyperparameter choices, and the performance gap completely vanishes after basic hyperparameter optimization.Here’s the text in Simplified Chinese:
for: estabilish 高水平的实验准则在图机器学习社区
methods: 精心重评多个MPGNN基线和图 transformer GPS
results: 报告的性能差异过度估计，因为SUB优化参数选择不当Note: “SUB” stands for “suboptimal” in English.

Abstract
The recent Long-Range Graph Benchmark (LRGB, Dwivedi et al. 2022) introduced a set of graph learning tasks strongly dependent on long-range interaction between vertices. Empirical evidence suggests that on these tasks Graph Transformers significantly outperform Message Passing GNNs (MPGNNs). In this paper, we carefully reevaluate multiple MPGNN baselines as well as the Graph Transformer GPS (Ramp\'a\v{s}ek et al. 2022) on LRGB. Through a rigorous empirical analysis, we demonstrate that the reported performance gap is overestimated due to suboptimal hyperparameter choices. It is noteworthy that across multiple datasets the performance gap completely vanishes after basic hyperparameter optimization. In addition, we discuss the impact of lacking feature normalization for LRGB's vision datasets and highlight a spurious implementation of LRGB's link prediction metric. The principal aim of our paper is to establish a higher standard of empirical rigor within the graph machine learning community.

摘要
最近的长距离图 benchMark (LRGB, Dwivedi et al. 2022) 引入了一系列强依赖于长距离交互的图学任务。实际证据表明，在这些任务上，图Transformers 明显超越了Message Passing GNNs (MPGNNs)。在这篇论文中，我们仔细重新评估了多个 MPGNN 基eline以及 Graph Transformer GPS (Ramp\'a\v{s}ek et al. 2022) 在 LRGB 上的性能。通过严格的实际分析，我们表明了报告的性能差距被过度估计，这是因为使用不优化的超参数。在多个 dataset 上，性能差距完全消失了 после基本超参数优化。此外，我们讨论了 LRGB 视觉数据集上缺失的Feature Normalization的影响，并指出了 LRGB 链接预测度量的误导性实现。我们文章的主要目标是在图机器学习社区中提高实际的严格度。

FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning

paper_url: http://arxiv.org/abs/2309.00363
repo_url: https://github.com/alibaba/federatedscope
paper_authors: Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou
for: This paper focuses on the challenges of fine-tuning large language models (LLMs) in federated learning (FL) settings, and proposes a package called FS-LLM to address these challenges.
methods: The paper introduces several components of the FS-LLM package, including an end-to-end benchmarking pipeline, federated parameter-efficient fine-tuning algorithms, and resource-efficient operators for fine-tuning LLMs with limited resources.
results: The paper conducts extensive experiments to validate the effectiveness of FS-LLM and compares it with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings. The results show that FS-LLM achieves better performance with lower communication and computation costs, and provides valuable insights into federated fine-tuning LLMs for the research community.Here’s the Chinese translation of the three information points:
for: This paper focuses on the challenges of fine-tuning large language models (LLMs) in federated learning (FL) settings, and proposes a package called FS-LLM to address these challenges.
methods: The paper introduces several components of the FS-LLM package, including an end-to-end benchmarking pipeline, federated parameter-efficient fine-tuning algorithms, and resource-efficient operators for fine-tuning LLMs with limited resources.
results: The paper conducts extensive experiments to validate the effectiveness of FS-LLM and compares it with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings. The results show that FS-LLM achieves better performance with lower communication and computation costs, and provides valuable insights into federated fine-tuning LLMs for the research community.

Abstract
LLMs have demonstrated great capabilities in various NLP tasks. Different entities can further improve the performance of those LLMs on their specific downstream tasks by fine-tuning LLMs. When several entities have similar interested tasks, but their data cannot be shared because of privacy concerns regulations, federated learning (FL) is a mainstream solution to leverage the data of different entities. However, fine-tuning LLMs in federated learning settings still lacks adequate support from existing FL frameworks because it has to deal with optimizing the consumption of significant communication and computational resources, data preparation for different tasks, and distinct information protection demands. This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution, which consists of the following components: (1) we build an end-to-end benchmarking pipeline, automizing the processes of dataset preprocessing, federated fine-tuning execution, and performance evaluation on federated LLM fine-tuning; (2) we provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios with low communication and computation costs, even without accessing the full model; (3) we adopt several accelerating and resource-efficient operators for fine-tuning LLMs with limited resources and the flexible pluggable sub-routines for interdisciplinary study. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings, which also yields valuable insights into federated fine-tuning LLMs for the research community. To facilitate further research and adoption, we release FS-LLM at https://github.com/alibaba/FederatedScope/tree/llm.

摘要
LLMs 已经在不同的自然语言处理任务中表现出了惊人的能力。不同的实体可以通过特定的下游任务进一步提高 LLMs 的性能。当多个实体有相似的 interessetasks，但其数据不能被共享由隐私问题限制的情况下，联邦学习（FL）成为了主流的解决方案，以利用不同实体的数据来提高 LLMs 的性能。然而，在联邦学习设置下进行 LLMs 的 fine-tuning 仍然缺乏现有 FL 框架的有效支持，因为需要处理大量的通信和计算资源，为不同任务准备数据，并遵守不同的隐私保护要求。本文首先描述了联邦 fine-tuning LLMs 中存在的挑战，并 introduce 我们的 package FS-LLM 作为主要贡献，该包包括以下三个组件：1. 我们建立了一个端到端的 benchmarking 管道，自动化了数据预处理、联邦 fine-tuning 执行和性能评估在联邦 LL 的 fine-tuning 中。2. 我们提供了联邦参数高效的 fine-tuning 算法实现和多样化的编程接口，以便在 FL 场景中，即使没有访问全模型，也可以实现低通信和计算成本的 fine-tuning。3. 我们采用了一些加速和资源高效的操作符，以便在有限资源的情况下进行 fine-tuning LLMs，并采用可替换的子过程，以便进行跨学科研究。我们进行了广泛的实验 validate 了 FS-LLM 的有效性，并在 FL 设置下对 advanced LLMs 进行了参数高效的 fine-tuning，从而获得了价值的发现，以便为研究社区提供参考。为了促进更多的研究和应用，我们将FS-LLM 发布在 GitHub 上，可以在中获取。

Local and adaptive mirror descents in extensive-form games

paper_url: http://arxiv.org/abs/2309.00656
repo_url: None
paper_authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko
for: 这个论文是研究如何在零 SUM 不完全信息游戏中学习 $\epsilon$-优策略的。
methods: 该论文使用了一种固定抽样方法，将玩家的政策更新到每个 episoden 中，并使用了一个适应 Online Mirror Descent（OMD）算法来实现。
results: 该论文显示了这种方法可以在 $\tilde{\mathcal{O}(T^{-1/2})$ 的速度下 converge，并且在游戏参数中具有near-optimal的依赖关系。

Abstract
We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of $\tilde{\mathcal{O}(T^{-1/2})$ with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments.

摘要
我们研究如何学习 $\epsilon$-优化策略在零和游戏中（IIG）中，其中玩家采用批量更新策略基于他们观察到的行为序列。现有的方法具有高度的卷积变iance，这是由importance sampling over sequences of actions引入的。为了降低这种变iance，我们考虑了一种固定抽样方法，其中玩家仍然在时间内更新策略，但是通过固定的抽样策略获取观察。我们的方法基于一种适应 Online Mirror Descent（OMD）算法，该算法在每个信息集中应用 OMD 地方，使用个人减少学习率和带有规化损失的梯度下降。我们证明了这种方法在高probability下 converges at a rate of $\tilde{\mathcal{O}(T^{-1/2})$，并且在游戏参数下有 near-optimal 的依赖性，当应用最佳的理论学习率和抽样策略时。为了实现这些结果，我们扩展了 OMD 稳定性的概念，允许时间变化的正则化，并使用几何增量。

Bespoke Nanoparticle Synthesis and Chemical Knowledge Discovery Via Autonomous Experimentations

paper_url: http://arxiv.org/abs/2309.00349
repo_url: None
paper_authors: Hyuk Jun Yoo, Nayeon Kim, Heeseung Lee, Daeho Kim, Leslie Tiong Ching Ow, Hyobin Nam, Chansoo Kim, Seung Yong Lee, Kwan-Young Lee, Donghun Kim, Sang Soo Han
for: 本研究旨在开发一种智能化的纳米材料合成平台，以优化纳米材料的Synthesize方法并实现targeted的光学性质。
methods: 该平台采用了一种封闭的closed-loop机制，通过将批量合成模块和UV-Vis光谱测量模块相连接，基于人工智能优化模型的反馈，以实现精准地控制纳米材料的Synthesize过程。
results: 通过用银(Ag)粒子为例，我们示出了 bayesian优化器在五种合成原料中进行优化时的高效性，可以在200轮 iteration中 precisely possession desired absorption spectra。此外，我们还发现了一种新的化学效应，即 citrate的Amount对硬度和材料的形态具有关键作用，从而影响了硬度和光谱特征。

Abstract
The optimization of nanomaterial synthesis using numerous synthetic variables is considered to be extremely laborious task because the conventional combinatorial explorations are prohibitively expensive. In this work, we report an autonomous experimentation platform developed for the bespoke design of nanoparticles (NPs) with targeted optical properties. This platform operates in a closed-loop manner between a batch synthesis module of NPs and a UV- Vis spectroscopy module, based on the feedback of the AI optimization modeling. With silver (Ag) NPs as a representative example, we demonstrate that the Bayesian optimizer implemented with the early stopping criterion can efficiently produce Ag NPs precisely possessing the desired absorption spectra within only 200 iterations (when optimizing among five synthetic reagents). In addition to the outstanding material developmental efficiency, the analysis of synthetic variables further reveals a novel chemistry involving the effects of citrate in Ag NP synthesis. The amount of citrate is a key to controlling the competitions between spherical and plate-shaped NPs and, as a result, affects the shapes of the absorption spectra as well. Our study highlights both capabilities of the platform to enhance search efficiencies and to provide a novel chemical knowledge by analyzing datasets accumulated from the autonomous experimentations.

摘要
“精细材料合成优化使用多个合成变量是一项非常困难的任务，因为传统的可靠性探索是非常昂贵的。在这项工作中，我们报道了一种自动化实验平台，用于设计targeted optical properties的粒子（NPs）。这个平台通过closed-loop模式，将批量合成NPs模块和UV-Vispectroscopy模块相连，基于AI优化模型的反馈。使用银（Ag）NPs作为例子，我们示示了bayesian优化器，在5种合成原料中进行优化时，可以高效地生成银NPs，具有所需的吸收 спектrum。此外，分析合成变量还揭示了一种新的化学知识，即 citrate的影响在银NP合成中。 citrate的含量是控制球形和板状NPs的竞争的关键，因此也影响了吸收спектrum的形状。我们的研究强调了该平台的搜索效率提高和化学知识的提供。”

Multitask Deep Learning for Accurate Risk Stratification and Prediction of Next Steps for Coronary CT Angiography Patients

paper_url: http://arxiv.org/abs/2309.00330
repo_url: None
paper_authors: Juan Lu, Mohammed Bennamoun, Jonathon Stewart, JasonK. Eshraghian, Yanbin Liu, Benjamin Chow, Frank M. Sanfilippo, Girish Dwivedi
for: 这份研究是为了提高怀疑和证明 coronary artery disease (CAD) 患者的风险评估和诊断决策。
methods: 这篇研究使用了多任务深度学习模型，以支持风险评估和下游测试选择。
results: 研究结果显示，这个模型可以对 CCTA 报告数据进行实际的分析，并且可以实现高度的 CAD 风险评估和下游测试选择。模型的 AUC 为 0.76，可以精确地估计 CAD 的可能性和建议下游测试。

Abstract
Diagnostic investigation has an important role in risk stratification and clinical decision making of patients with suspected and documented Coronary Artery Disease (CAD). However, the majority of existing tools are primarily focused on the selection of gatekeeper tests, whereas only a handful of systems contain information regarding the downstream testing or treatment. We propose a multi-task deep learning model to support risk stratification and down-stream test selection for patients undergoing Coronary Computed Tomography Angiography (CCTA). The analysis included 14,021 patients who underwent CCTA between 2006 and 2017. Our novel multitask deep learning framework extends the state-of-the art Perceiver model to deal with real-world CCTA report data. Our model achieved an Area Under the receiver operating characteristic Curve (AUC) of 0.76 in CAD risk stratification, and 0.72 AUC in predicting downstream tests. Our proposed deep learning model can accurately estimate the likelihood of CAD and provide recommended downstream tests based on prior CCTA data. In clinical practice, the utilization of such an approach could bring a paradigm shift in risk stratification and downstream management. Despite significant progress using deep learning models for tabular data, they do not outperform gradient boosting decision trees, and further research is required in this area. However, neural networks appear to benefit more readily from multi-task learning than tree-based models. This could offset the shortcomings of using single task learning approach when working with tabular data.

摘要
医学诊断调查在抑制性肺动脉疾病（CAD）的诊断和治疗中扮演着重要的角色。然而，现有大多数工具都是专注于门槛测试的选择，而忽略了下游测试或治疗的信息。我们提议一种多任务深度学习模型，用于支持CCTA扫描后的风险分级和下游测试选择。我们的分析包括2006年至2017年间对CCTA扫描的14,021名患者。我们的新的多任务深度学习框架将状态艺术Perceiver模型扩展到实际CCTA报告数据上。我们的模型在CAD风险分级方面 achieve了0.76的接受分数（AUC），而在预测下游测试方面 achieve了0.72的接受分数（AUC）。我们的提议的深度学习模型可以准确地估计CAD的可能性，并基于过去CCTA数据提供下游测试的建议。在临床实践中，使用这种方法可能会引入一种新的风格转移，改善风险分级和下游管理。虽然使用深度学习模型对 tabular 数据进行了重要的进步，但它们不会超过梯度提升 decision trees，需要进一步的研究。然而，神经网络似乎更易受到多任务学习的影响，这可能将 tabular 数据上的弱点 offset。

Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI’s Whisper

paper_url: http://arxiv.org/abs/2309.00329
repo_url: None
paper_authors: Tomasz Wojnar, Jaroslaw Hryszko, Adam Roman
for: 评估语音识别机器学习模型的性能和适应性 across 多种语言、方言、发音样式和音质水平。
methods: 利用 YouTube 作为数据源，覆盖多种语言、方言、发音样式和音质水平，并对 OpenAI 开发的 Whisper 模型进行测试。
results: YouTube 作为测试平台可以保证语音识别模型的稳定性、准确性和适应性，并可以帮助找出 YouTube 上的搜索引擎优化。

Abstract
This article introduces Mi-Go, a novel testing framework aimed at evaluating the performance and adaptability of general-purpose speech recognition machine learning models across diverse real-world scenarios. The framework leverages YouTube as a rich and continuously updated data source, accounting for multiple languages, accents, dialects, speaking styles, and audio quality levels. To demonstrate the effectiveness of the framework, the Whisper model, developed by OpenAI, was employed as a test object. The tests involve using a total of 124 YouTube videos to test all Whisper model versions. The results underscore the utility of YouTube as a valuable testing platform for speech recognition models, ensuring their robustness, accuracy, and adaptability to diverse languages and acoustic conditions. Additionally, by contrasting the machine-generated transcriptions against human-made subtitles, the Mi-Go framework can help pinpoint potential misuse of YouTube subtitles, like Search Engine Optimization.

摘要

Multi-fidelity reduced-order surrogate modeling

paper_url: http://arxiv.org/abs/2309.00325
repo_url: https://github.com/contipaolo/multifidelity_pod
paper_authors: Paolo Conti, Mengwu Guo, Andrea Manzoni, Attilio Frangi, Steven L. Brunton, J. Nathan Kutz
for: 这个论文是用于描述一种基于多 fideltity神经网络的减简方法，用于在有限的计算预算下，使用低精度模型来提高解的预测精度。
methods: 该方法首先使用高精度解决生成特征值分解(POD)，然后使用多 fideltity长短期记忆(LSTM)网络来approximate低精度解的动态行为。
results: 该方法可以有效地捕捉低精度解中的不稳定性和转移现象，并且可以在不侵入式的方式下，使用低精度模型来重建全解场景。

Abstract
High-fidelity numerical simulations of partial differential equations (PDEs) given a restricted computational budget can significantly limit the number of parameter configurations considered and/or time window evaluated for modeling a given system. Multi-fidelity surrogate modeling aims to leverage less accurate, lower-fidelity models that are computationally inexpensive in order to enhance predictive accuracy when high-fidelity data are limited or scarce. However, low-fidelity models, while often displaying important qualitative spatio-temporal features, fail to accurately capture the onset of instability and critical transients observed in the high-fidelity models, making them impractical as surrogate models. To address this shortcoming, we present a new data-driven strategy that combines dimensionality reduction with multi-fidelity neural network surrogates. The key idea is to generate a spatial basis by applying the classical proper orthogonal decomposition (POD) to high-fidelity solution snapshots, and approximate the dynamics of the reduced states - time-parameter-dependent expansion coefficients of the POD basis - using a multi-fidelity long-short term memory (LSTM) network. By mapping low-fidelity reduced states to their high-fidelity counterpart, the proposed reduced-order surrogate model enables the efficient recovery of full solution fields over time and parameter variations in a non-intrusive manner. The generality and robustness of this method is demonstrated by a collection of parametrized, time-dependent PDE problems where the low-fidelity model can be defined by coarser meshes and/or time stepping, as well as by misspecified physical features. Importantly, the onset of instabilities and transients are well captured by this surrogate modeling technique.

摘要
高精度数学模拟（PDE）在限制计算预算的情况下可能很难以考虑大量参数配置和时间窗口。多层次模型可以利用低精度模型，以提高预测精度，但低精度模型通常不能准确地捕捉高精度模型中的不稳定性和关键过渡。为解决这个缺点，我们提出了一种新的数据驱动策略，它将维度减少与多层次神经网络模型结合在一起。我们首先应用高精度解决方案中的经典正交分解（POD）法生成空间基，然后使用多层次长短期记忆（LSTM）网络来approximate高精度解决方案中的动力学行为。通过将低精度减少到高精度解决方案，我们提出的减少的模型可以非侵入地重建全解场在时间和参数变化中。我们通过一系列 Parametrized，时间依赖的PDE问题的集合来证明这种方法的一般性和稳定性。importantly，这种模拟方法可以准确地捕捉不稳定性和过渡。

SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks

paper_url: http://arxiv.org/abs/2309.00255
repo_url: None
paper_authors: Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi
for: 这篇论文旨在提出一个通用且可扩展的方法，以实现深度学习模型在数据量和计算资源限制下的最佳化。
methods: 本论文提出了一个sorted training的方法，具有以下几个特点：（1）使用一个嵌入式的架构，将深度学习模型分解为多个子网络；（2）在训练过程中，随机抽样子网络，并使用数据类型和数据量的组合来决定子网络的训练；（3）使用一个新的更新方法，将子网络的训练结果组合成最终的模型。
results: 实验结果显示， sorted training 方法可以实现高效的深度学习模型训练，并且比过去的动态训练方法高效得多。具体来说，这篇论文可以训练 160 个不同的子网络 simultaneously，并且维持模型性能的 96%。

Abstract
As the size of deep learning models continues to grow, finding optimal models under memory and computation constraints becomes increasingly more important. Although usually the architecture and constituent building blocks of neural networks allow them to be used in a modular way, their training process is not aware of this modularity. Consequently, conventional neural network training lacks the flexibility to adapt the computational load of the model during inference. This paper proposes SortedNet, a generalized and scalable solution to harness the inherent modularity of deep neural networks across various dimensions for efficient dynamic inference. Our training considers a nested architecture for the sub-models with shared parameters and trains them together with the main model in a sorted and probabilistic manner. This sorted training of sub-networks enables us to scale the number of sub-networks to hundreds using a single round of training. We utilize a novel updating scheme during training that combines random sampling of sub-networks with gradient accumulation to improve training efficiency. Furthermore, the sorted nature of our training leads to a search-free sub-network selection at inference time; and the nested architecture of the resulting sub-networks leads to minimal storage requirement and efficient switching between sub-networks at inference. Our general dynamic training approach is demonstrated across various architectures and tasks, including large language models and pre-trained vision models. Experimental results show the efficacy of the proposed approach in achieving efficient sub-networks while outperforming state-of-the-art dynamic training approaches. Our findings demonstrate the feasibility of training up to 160 different sub-models simultaneously, showcasing the extensive scalability of our proposed method while maintaining 96% of the model performance.

摘要
随着深度学习模型的大小不断增长，在内存和计算限制下找到最佳模型变得越来越重要。尽管架构和组成部件的 Neil 网络允许它们在模块化方式下使用，但训练过程并不意识到这种模块性。因此，传统的神经网络训练缺乏对模型计算负荷的灵活性。这篇论文提出了 SortNet，一种通用且可扩展的解决方案，以便在多维度上利用深度神经网络的内置模块性进行高效的动态推理。我们的训练方法包括嵌入式架构，共享参数的卷积网络，并在排序和 probabilistic 的方式下对卷积网络进行训练。这种排序训练方法使得我们可以在单一轮训练中批量地训练数百个子网络。我们还提出了一种新的更新方法，将随机抽样的子网络与梯度积累相结合，以提高训练效率。此外，排序的训练方法导致在推理时不需要搜索子网络，并且嵌入式架构的结果是最小的存储需求和高效的子网络交换。我们的通用动态训练方法在不同的架构和任务上进行了广泛的实验，包括大语言模型和预训练视觉模型。实验结果表明我们的方法可以高效地实现高性能的子网络，并超越当前的动态训练方法。我们的发现表明可以同时训练160个不同的子模型， demonstrating the extensive scalability of our proposed method while maintaining 96% of the model performance.

Data-Driven Projection for Reducing Dimensionality of Linear Programs: Generalization Bound and Learning Methods

paper_url: http://arxiv.org/abs/2309.00203
repo_url: None
paper_authors: Shinsaku Sakaue, Taihei Oki
for: 这个论文研究了一种基于数据的高维线性 програм（LP）解决方法。给定过去的 $n$-维LP数据，我们学习了一个 $n\times k$ 的投影矩阵 ($n > k$)，将高维问题降维到低维问题。然后，我们通过解决低维LP问题并通过投影矩阵恢复高维解决方案。这种方法可以与任何用户喜欢的LP解决方法结合使用，因此可以快速解决LP问题。
methods: 我们提出了一种基于数据驱动的LP解决方法，包括使用PCA和梯度下降两种方法学习投影矩阵。PCA方法简单效率高，而梯度下降方法可能会提供更高质量的解决方案。
results: 我们的实验表明，学习投影矩阵可以快速和精准地解决LP问题。具体来说，我们可以在减少LP解决时间的同时保持高质量解决方案。此外，我们还发现在某些情况下，使用梯度下降方法可以提供更高质量的解决方案。

Abstract
This paper studies a simple data-driven approach to high-dimensional linear programs (LPs). Given data of past $n$-dimensional LPs, we learn an $n\times k$ \textit{projection matrix} ($n > k$), which reduces the dimensionality from $n$ to $k$. Then, we address future LP instances by solving $k$-dimensional LPs and recovering $n$-dimensional solutions by multiplying the projection matrix. This idea is compatible with any user-preferred LP solvers, hence a versatile approach to faster LP solving. One natural question is: how much data is sufficient to ensure the recovered solutions' quality? We address this question based on the idea of \textit{data-driven algorithm design}, which relates the amount of data sufficient for generalization guarantees to the \textit{pseudo-dimension} of performance metrics. We present an $\tilde{\mathrm{O}(nk^2)$ upper bound on the pseudo-dimension ($\tilde{\mathrm{O}$ compresses logarithmic factors) and complement it by an $\Omega(nk)$ lower bound, hence tight up to an $\tilde{\mathrm{O}(k)$ factor. On the practical side, we study two natural methods for learning projection matrices: PCA- and gradient-based methods. While the former is simple and efficient, the latter sometimes leads to better solution quality. Experiments confirm that learned projection matrices are beneficial for reducing the time for solving LPs while maintaining high solution quality.

摘要
A natural question is how much data is needed to ensure the quality of the recovered solutions. We address this using the idea of data-driven algorithm design, which relates the amount of data needed for generalization guarantees to the pseudo-dimension of performance metrics. We provide an $\tilde{\mathrm{O}(nk^2)$ upper bound on the pseudo-dimension and complement it with an $\Omega(nk)$ lower bound, giving a tight bound up to an $\tilde{\mathrm{O}(k)$ factor.In practice, we examine two methods for learning projection matrices: principal component analysis (PCA)-based and gradient-based methods. While PCA-based methods are simple and efficient, gradient-based methods sometimes lead to better solution quality. Our experiments show that learned projection matrices can significantly reduce the time required to solve LPs while maintaining high solution quality.

Deep-learning-based Early Fixing for Gas-lifted Oil Production Optimization: Supervised and Weakly-supervised Approaches

paper_url: http://arxiv.org/abs/2309.00197
repo_url: None
paper_authors: Bruno Machado Pacheco, Laio Oriel Seman, Eduardo Camponogara
for: 提高天然气提取油气井的油量，解决杂项线性规划问题 (Mixed-Integer Linear Programs, MILPs)。
methods: 基于深度学习模型，提出一种适应型规划策略，通过提供所有整数变量的值，将原始问题转化为线性Program (LP)。提出了两种发展学习基于策略：一种是指导学习方法，需要训练集中的优化整数值；另一种是弱指导学习方法，只需要解决随机分配整数变量的早期固定问题。
results: 比较结果表明，使用学习基于策略可以实现runtime的减少率为71.11%，而弱指导学习模型具有显著的值提供能力，即使在训练过程中从未看到优化的整数值。

Abstract
Maximizing oil production from gas-lifted oil wells entails solving Mixed-Integer Linear Programs (MILPs). As the parameters of the wells, such as the basic-sediment-to-water ratio and the gas-oil ratio, are updated, the problems must be repeatedly solved. Instead of relying on costly exact methods or the accuracy of general approximate methods, in this paper, we propose a tailor-made heuristic solution based on deep learning models trained to provide values to all integer variables given varying well parameters, early-fixing the integer variables and, thus, reducing the original problem to a linear program (LP). We propose two approaches for developing the learning-based heuristic: a supervised learning approach, which requires the optimal integer values for several instances of the original problem in the training set, and a weakly-supervised learning approach, which requires only solutions for the early-fixed linear problems with random assignments for the integer variables. Our results show a runtime reduction of 71.11% Furthermore, the weakly-supervised learning model provided significant values for early fixing, despite never seeing the optimal values during training.

摘要
最大化天然气吸取油井生产具有解决杂integer线性程序（MILP）的挑战。随着井井的参数，如基本粉末水比和气油比，的更新，问题必须重复解决。而不是依赖于贵重的精确方法或通用的估计方法，在这篇论文中，我们提出了特制的深度学习模型，用于提供所有整数变量的值，以及 varying 井井参数下的 LP。我们提出了两种方法来开发学习基于模型：一种监督学习方法，需要训练集中的优化整数值的多个实例；一种弱监督学习方法，只需要解决早期固定的线性问题，并将整数变量 randomly 分配。我们的结果显示，使用学习模型可以reduces 71.11%的运行时间。此外，弱监督学习模型在没有训练优化值的情况下也提供了重要的初始化估计值。

2023-09-01

eess.IV

eess.IV - 2023-09-01

High-resolution, large field-of-view label-free imaging via aberration-corrected, closed-form complex field reconstruction

paper_url: http://arxiv.org/abs/2309.00755
repo_url: https://github.com/rzcao/apic-analytical-complex-field-reconstruction
paper_authors: Ruizhi Cao, Cheng Shen, Changhuei Yang
for: 这篇论文是用于描述一种新的计算成像方法，它可以高效地生成高分辨率、大场景视野的镜像，而不需要参数选择或迭代算法。
methods: 这种方法使用多个倾斜照射来实现高速成像，并使用新的分析式phaserecovery框架来重建复杂场景。
results: 实验表明，这种方法可以 Correctly retrieve the complex field associated with darkfield measurements, and also analytically retrieve complex aberrations of an imaging system with no additional hardware. Compared to traditional FPM method, APIC method is more robust against aberrations and can achieve higher resolution.

Abstract
Computational imaging methods empower modern microscopy with the ability of producing high-resolution, large field-of-view, aberration-free images. One of the dominant computational label-free imaging methods, Fourier ptychographic microscopy (FPM), effectively increases the spatial-bandwidth product of conventional microscopy by using multiple tilted illuminations to achieve high-throughput imaging. However, its iterative reconstruction method is prone to parameter selection, can be computationally expensive and tends to fail under excessive aberrations. Recently, spatial Kramers-Kronig methods show it is possible to analytically reconstruct complex field but lacks the ability of correcting aberrations or providing extended resolution enhancement. Here, we present a closed-form method, termed APIC, which weds the strengths of both methods. A new analytical phase retrieval framework is established in APIC, which demonstrates, for the first time, the feasibility of analytically reconstructing the complex field associated with darkfield measurements. In addition, APIC can analytically retrieve complex aberrations of an imaging system with no additional hardware. By avoiding iterative algorithms, APIC requires no human designed convergence metric and always obtains a closed-form complex field solution. The faithfulness and correctness of APIC's reconstruction are guaranteed due to its analytical nature. We experimentally demonstrate that APIC gives correct reconstruction result while FPM fails to do so when constrained to the same number of measurements. Meanwhile, APIC achieves 2.8 times faster computation using image tile size of 256 (length-wise). We also demonstrate APIC is unprecedentedly robust against aberrations compared to FPM - APIC is capable of addressing aberration whose maximal phase difference exceeds 3.8${\pi}$ when using a NA 0.25 objective in experiment.

摘要
计算成像技术为现代微镜技术提供了高分辨率、大场视野、抗偏振的图像生成能力。其中一种主导的计算无标签成像方法是傅立叶探针微镜（FPM），它通过多个倾斜照明来实现高通过率成像。然而，它的迭代重建方法容易选择参数、计算成本较高并且容易在过度偏振下失败。最近，空间克劳斯-克朗根方法表明了可以分析地重建复杂场的可能性，但它缺乏修正偏振或提供扩展的分辨率提升能力。我们现在介绍一种关闭式方法，称之为APIC，它将两种方法的优点相结合。我们建立了一个新的分析phaserecovery框架，可以分析地重建相关的黑场测量中的复杂场。此外，APIC可以分析地重建抗偏振系统的复杂偏振。通过避免迭代算法，APIC不需要人类设计的整合度量，总是得到关闭式的复杂场解决方案。由于APIC的分析性质，它的重建 faithfulness和正确性得到保证。我们实验表明，在相同数量的测量下，APIC可以正确地重建图像，而FPM则失败。此外，APIC在图像分割大小为256（长wise）时计算速度为2.8倍。我们还证明APIC对偏振强度的Robustness比FPM更高，可以处理偏振 whose maximal phase difference exceeds 3.8$\pi$ when using a NA 0.25 objective in experiment.

Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels

paper_url: http://arxiv.org/abs/2309.00470
repo_url: None
paper_authors: Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gündüz
For: 该论文提出了一种基于视力变换器（ViT）的深度联合源和渠道编码（DeepJSCC）方案，用于无线图像传输。* Methods: 该方案使用了自注意力机制，以智能学习特征映射和功率分配策略，适应源图像和存在的通道条件。* Results: 数值实验表明， DeepJSCC-MIMO 可以在开loop和关loop MIMO 系统中提高传输质量，并且具有鲁棒性和灵活性，不需要重新训练。

Abstract
This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-based benchmarks with robustness to channel estimation errors and showcases remarkable flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining. Specifically, by harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently learns feature mapping and power allocation strategies tailored to the unique characteristics of the source image and prevailing channel conditions. Extensive numerical experiments validate the significant improvements in transmission quality achieved by DeepJSCC-MIMO for both open-loop and closed-loop MIMO systems across a wide range of scenarios. Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it an appealing solution for emerging semantic communication systems.

摘要
The novel DeepJSCC-MIMO architecture surpasses traditional separation-based benchmarks and is robust to channel estimation errors. It also demonstrates great flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining.The DeepJSCC-MIMO scheme utilizes the self-attention mechanism of ViT to intelligently learn feature mapping and power allocation strategies tailored to the unique characteristics of the source image and the prevailing channel conditions. This leads to significant improvements in transmission quality for both open-loop and closed-loop MIMO systems across a wide range of scenarios.Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it a promising solution for emerging semantic communication systems. Extensive numerical experiments validate the effectiveness of the proposed scheme.

Learning the Imaging Model of Speed-of-Sound Reconstruction via a Convolutional Formulation

paper_url: http://arxiv.org/abs/2309.00453
repo_url: None
paper_authors: Can Deniz Bezek, Maxim Haas, Richard Rau, Orcun Goksel
for: 这个论文目的是提出一种基于数据学习的 Speed-of-sound（SoS）成像方法，以提高SoS成像的准确性和稳定性。
methods: 该方法基于Convolutional Neural Networks（CNNs）的形式ulation，通过学习数据来学习SoS成像模型，并通过least-squares估算来估算参数。
results: 对于 simulate数据和实验室数据，使用这种方法可以提高SoS成像的对比度，比传统手动设计的模型更高。在人体实验中，使用这种方法可以提高SoS成像的对比度7倍和10倍。

Abstract
Speed-of-sound (SoS) is an emerging ultrasound contrast modality, where pulse-echo techniques using conventional transducers offer multiple benefits. For estimating tissue SoS distributions, spatial domain reconstruction from relative speckle shifts between different beamforming sequences is a promising approach. This operates based on a forward model that relates the sought local values of SoS to observed speckle shifts, for which the associated image reconstruction inverse problem is solved. The reconstruction accuracy thus highly depends on the hand-crafted forward imaging model. In this work, we propose to learn the SoS imaging model based on data. We introduce a convolutional formulation of the pulse-echo SoS imaging problem such that the entire field-of-view requires a single unified kernel, the learning of which is then tractable and robust. We present least-squares estimation of such convolutional kernel, which can further be constrained and regularized for numerical stability. In experiments, we show that a forward model learned from k-Wave simulations improves the median contrast of SoS reconstructions by 63%, compared to a conventional hand-crafted line-based wave-path model. This simulation-learned model generalizes successfully to acquired phantom data, nearly doubling the SoS contrast compared to the conventional hand-crafted alternative. We demonstrate equipment-specific and small-data regime feasibility by learning a forward model from a single phantom image, where our learned model quadruples the SoS contrast compared to the conventional hand-crafted model. On in-vivo data, the simulation- and phantom-learned models respectively exhibit impressive 7 and 10 folds contrast improvements over the conventional model.

摘要
声速（SoS）是一种emergingultrasound contrast模式，使用普通探测器的推送-回声技术可以获得多种优点。为了估计组织声速分布，使用不同探测器序列的相对速度异常可以获得很好的结果。这种方法基于一个前向模型，该模型关系 soughtlocal声速值与观察到的速度异常，并解决了相关的图像重建问题。图像重建精度因此具有手工设计的前向图像模型的依赖性。在这种工作中，我们提出了学习SoS图像模型的方法。我们将探测器序列的pulse-echo SoS图像问题转换为一个 convolutional 形式，使得整个场景需要一个单一的核心，可以通过学习来解决。我们介绍了 least-squares 估算这个 convolutional 核心，可以进一步约束和减少数据稳定性。在实验中，我们发现使用k-WaveSimulation学习的前向模型可以提高SoS重建的 median 对比度by 63%，比 convential hand-crafted line-based wave-path模型更好。这个simulation-learned模型在实际数据上成功地泛化，对于acquired phantom data，它可以nearly double SoS对比度。我们还证明了设备特定和小数据 режим的可行性，通过学习一个 forward model从single phantom image中学习。在in vivo数据上，simulation-和 phantom-learned模型分别展示出了很出色的7和10倍对比度提高。

2023-09-01

eess.SP

eess.SP - 2023-09-01

Jamming Suppression Via Resource Hopping in High-Mobility OTFS-SCMA Systems

paper_url: http://arxiv.org/abs/2309.00753
repo_url: None
paper_authors: Qinwen Deng, Yao Ge, Zhi Ding
for: 这篇论文研究了OTFS系统中的上链多接入和干扰抑制机制。
methods: 该论文提议了一种基于延迟-Doppler域的资源频谱分配方法，以mitigate OTFS系统中的干扰影响。
results: 该方法通过利用扩展平衡，在干扰下表现出了与传统OTFS-SCMA系统相比的BER性能改善。

Abstract
This letter studies the mechanism of uplink multiple access and jamming suppression in an OTFS system. Specifically, we propose a novel resource hopping mechanism for orthogonal time frequency space (OTFS) systems with delay or Doppler partitioned sparse code multiple access (SCMA) to mitigate the effect of jamming in controlled multiuser uplink. We analyze the non-uniform impact of classic jamming signals such as narrowband interference (NBI) and periodic impulse noise (PIN) in delay-Doppler (DD) domain on OTFS systems. Leveraging turbo equalization, our proposed hopping method demonstrates consistent BER performance improvement under jamming over conventional OTFS-SCMA systems compared to static resource allocation schemes.

摘要

Signal Processing and Learning for Next Generation Multiple Access in 6G

paper_url: http://arxiv.org/abs/2309.00559
repo_url: None
paper_authors: Wei Chen, Yuanwei Liu, Hamid Jafarkhani, Yonina C. Eldar, Peiying Zhu, Khaled B Letaief
for: 本文主要探讨了下一代多个访问（NGMA）技术的研究进展，尤其是大规模随机访问和非对势多访问。
methods: 本文考虑了新技术的潜在互动和学习基本技术的挑战。
results: 研究表明，学习基本技术可以解决许多传输和处理信号的复杂问题，但是还需要更多的研究来掌握这些技术的潜在。

Abstract
Wireless communication systems to date primarily rely on the orthogonality of resources to facilitate the design and implementation, from user access to data transmission. Emerging applications and scenarios in the sixth generation (6G) wireless systems will require massive connectivity and transmission of a deluge of data, which calls for more flexibility in the design concept that goes beyond orthogonality. Furthermore, recent advances in signal processing and learning have attracted considerable attention, as they provide promising approaches to various complex and previously intractable problems of signal processing in many fields. This article provides an overview of research efforts to date in the field of signal processing and learning for next-generation multiple access, with an emphasis on massive random access and non-orthogonal multiple access. The promising interplay with new technologies and the challenges in learning-based NGMA are discussed.

摘要
无线通信系统至今主要依靠资源的正交性来实现用户访问和数据传输。 sixth generation（6G）无线系统中出现的新应用和场景将需要巨量的连接和大量数据传输，这需要更多的灵活性在设计理念中，超出正交性的限制。此外，近年来的信号处理和学习技术的进步吸引了广泛的关注，因为它们在各个领域提供了可能的解决方案。本文提供了到date的研究努力在信号处理和学习领域中，强调巨量随机访问和非正交多访问。新技术的潜在优势和学习基于NGMA的挑战也被讨论。

Achievable Rate Region and Path-Based Beamforming for Multi-User Single-Carrier Delay Alignment Modulation

paper_url: http://arxiv.org/abs/2309.00391
repo_url: None
paper_authors: Xingwei Wang, Haiquan Lu, Yong Zeng, Xiaoli Xu, Jie Xu
for:* 这篇论文是为了研究多用户mmWave巨量MIMO通信系统中的延时对Alignment模ulation（DAM）技术。methods:* 这篇论文使用了延时预补做法和路径基于的扫描方法来有效地对多个射频组件进行对齐，从而消除干扰而保留多Path强度提升。results:* 这篇论文表明，在Mt sufficiently large的情况下，通过使用简单的延时预补和每个射频路径基于的MRT扫描方法，单载DAM可以完美地消除干扰和IUI。* 在Mt finite的情况下，该论文研究了多用户DAM系统的可达速率区的可能性。* 该论文提出了三种低复杂度的每个射频路径基于的扫描策略，并研究了这些策略对系统的可达速率的影响。* 最后，该论文提供了对两个参考方案（基于最强射频路径基于的扫描和OFDM）的比较，并证明DAM在高 spatial resolution和多Path多杂性下 achieve higher spectral efficiency和/或 lower peak-to-average-ratio。

Abstract
Delay alignment modulation (DAM) is a novel wideband transmission technique for mmWave massive MIMO systems, which exploits the high spatial resolution and multi-path sparsity to mitigate ISI, without relying on channel equalization or multi-carrier transmission. In particular, DAM leverages the delay pre-compensation and path-based beamforming to effectively align the multi-path components, thus achieving the constructive multi-path combination for eliminating the ISI while preserving the multi-path power gain. Different from the existing works only considering single-user DAM, this paper investigates the DAM technique for multi-user mmWave massive MIMO communication. First, we consider the asymptotic regime when the number of antennas Mt at BS is sufficiently large. It is shown that by employing the simple delay pre-compensation and per-path-based MRT beamforming, the single-carrier DAM is able to perfectly eliminate both ISI and IUI. Next, we consider the general scenario with Mt being finite. In this scenario, we characterize the achievable rate region of the multi-user DAM system by finding its Pareto boundary. Specifically, we formulate a rate-profile-constrained sum rate maximization problem by optimizing the per-path-based beamforming. Furthermore, we present three low-complexity per-path-based beamforming strategies based on the MRT, zero-forcing, and regularized zero-forcing principles, respectively, based on which the achievable sum rates are studied. Finally, we provide simulation results to demonstrate the performance of our proposed strategies as compared to two benchmark schemes based on the strongest-path-based beamforming and the prevalent OFDM, respectively. It is shown that DAM achieves higher spectral efficiency and/or lower peak-to-average-ratio, for systems with high spatial resolution and multi-path diversity.

摘要
延迟匹配模ulation（DAM）是一种新的宽带传输技术，用于mmWave巨量MIMO系统，可以利用高度空间分解和多path稀烈性来缓解混合干扰（ISI），不需要通道平衡或多帧传输。具体来说，DAM利用延迟预补和路径基本形式 beamforming来有效地对多path组分进行匹配，从而实现了构建多path组分的积加，以消除ISI而保留多path功率增加。与现有作品只考虑单用户DAM不同，本文研究了DAM技术在多用户mmWave巨量MIMO通信中的应用。首先，我们考虑了Mt的很大值时的极限情况。 results show that by employing simple delay pre-compensation and per-path-based MRT beamforming, the single-carrier DAM can perfectly eliminate both ISI and IUI.然后，我们考虑了Mt finite值的普通情况。在这种情况下，我们定义了DAM系统的可达性区的边界，并通过优化每个路径基本形式 beamforming来实现最大化总bitrate。 Specifically, we formulate a rate-profile-constrained sum rate maximization problem by optimizing the per-path-based beamforming.此外，我们还提出了三种低复杂度的每个路径基本形式 beamforming策略，基于MRT、零干扰和正则化零干扰原理。这些策略的实现可以帮助提高系统的可达性和 Spectral efficiency。最后，我们通过对两种参考方案（基于最强路径基本形式 beamforming和普遍OFDM）进行比较，提供了实验结果，以证明DAM在高度空间分解和多path稀烈性的系统中可以 дости得更高的 spectral efficiency和/或更低的 peak-to-average-ratio。

Bayesian estimation and reconstruction of marine surface contaminant dispersion

paper_url: http://arxiv.org/abs/2309.00369
repo_url: None
paper_authors: Yang Liu, Christopher M. Harvey, Frederick E. Hamlyn, Cunjia Liu
for: 这个研究旨在提供一个精度估计和重建海洋环境中潜在危害物质泄漏的框架，以便与环境监测敏感器网络或移动敏感器相结合。methods: 这个研究使用了基本的扩散-渗透偏微方程（PDE）来表示物质泄漏的总体分布，并使用动态适应finite-element方法（FEM）将其在非均匀流场中的空间灵活分解。在扩散过程中，研究者们考虑了感知器的不准确现象，包括漏掉检测和信号量化。results: 研究结果表明，提出的框架在实际的油污泄漏事件中的波罗的海 demonstrate its efficacy in reconstructing spatio-temporal dispersion in the presence of imperfect measurements.

Abstract
Discharge of hazardous substances into the marine environment poses a substantial risk to both public health and the ecosystem. In such incidents, it is imperative to accurately estimate the release strength of the source and reconstruct the spatio-temporal dispersion of the substances based on the collected measurements. In this study, we propose an integrated estimation framework to tackle this challenge, which can be used in conjunction with a sensor network or a mobile sensor for environment monitoring. We employ the fundamental convection-diffusion partial differential equation (PDE) to represent the general dispersion of a physical quantity in a non-uniform flow field. The PDE model is spatially discretised into a linear state-space model using the dynamic transient finite-element method (FEM) so that the characterisation of time-varying dispersion can be cast into the problem of inferring the model states from sensor measurements. We also consider imperfect sensing phenomena, including miss-detection and signal quantisation, which are frequently encountered when using a sensor network. This complicated sensor process introduces nonlinearity into the Bayesian estimation process. A Rao-Blackwellised particle filter (RBPF) is designed to provide an effective solution by exploiting the linear structure of the state-space model, whereas the nonlinearity of the measurement model can be handled by Monte Carlo approximation with particles. The proposed framework is validated using a simulated oil spill incident in the Baltic sea with real ocean flow data. The results show the efficacy of the developed spatio-temporal dispersion model and estimation schemes in the presence of imperfect measurements. Moreover, the parameter selection process is discussed, along with some comparison studies to illustrate the advantages of the proposed algorithm over existing methods.

摘要
排放危险物质到海洋环境中存在严重的风险，对公众健康和生态系统都具有潜在的威胁。在这种情况下，需要精准地估计排放源的强度和杂散的空间时间特征。本研究提出一种集成估计框架，可以与监测敏感器网络或移动监测器相结合使用。我们采用基本的扩散混合方程（PDE）来表示物质扩散的总体特征，并将PDE方程在空间上精度化为线性状态空间模型（FEM），以便在排放源的特征上进行时间变化的描述。我们还考虑了感知现象的不准确性，包括感知错误和信号量化，这些现象在使用敏感器网络时经常出现。这种复杂的感知过程引入了非线性到 bayesian 估计过程中。我们采用Rao-Blackwellised particle filter（RBPF）来提供有效的解决方案，通过利用状态空间模型的线性结构，同时处理测量模型中的非线性。我们在 simulated oil spill incident 中使用 Baltic sea 的实际海流数据进行验证，结果显示我们提出的空间时间杂散模型和估计方法在受到不准确测量的情况下具有效果。此外，我们还讨论了参数选择过程，并进行了与其他方法进行比较，以 Illustrate 我们的方法的优势。

Message Passing Based Block Sparse Signal Recovery for DOA Estimation Using Large Arrays

paper_url: http://arxiv.org/abs/2309.00313
repo_url: None
paper_authors: Yiwen Mao, Dawei Gao, Qinghua Guo, Ming Jin
for: 这篇论文关注方向来源估计（DOA），使用大天线数组。
methods: 该论文首先开发了一种新的信号模型，使用反排DFT操作生成一个稀疏系统传输矩阵，从而转化为一个结构化块稀疏信号恢复问题，使用消息传递基于 bayesian 算法，并使用 фактор图表示。
results: 模拟结果表明提议方法的性能较高。

Abstract
This work deals with directional of arrival (DOA) estimation with a large antenna array. We first develop a novel signal model with a sparse system transfer matrix using an inverse discrete Fourier transform (DFT) operation, which leads to the formulation of a structured block sparse signal recovery problem with a sparse sensing matrix. This enables the development of a low complexity message passing based Bayesian algorithm with a factor graph representation. Simulation results demonstrate the superior performance of the proposed method.

摘要
这个工作关于方向来源估计（DOA）测量使用大天线数组。我们首先开发了一种新的信号模型，使用反对排阵 Fourier 转换（DFT）操作获得简单系统传输矩阵，从而导致了一个嵌入式块简单信号恢复问题，其中感知矩阵具有稀疏性。这使得我们可以开发一种低复杂度的消息传递基于 bayesian 算法，并使用因素图表示。实验结果表明我们的方法表现更好。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

A Spatial Sigma-Delta Approach to Mitigation of Power Amplifier Distortions in Massive MIMO Downlink

paper_url: http://arxiv.org/abs/2309.00289
repo_url: None
paper_authors: Yatao Liu, Mingjie Shao, Wing-Kin Ma
for: 提高多Input多Output（MIMO）下链系统中基站（BS）的物理实现，使用低成本和功率低的发射器（PA），以避免高硬件成本和高功率消耗。
methods: 使用Sigma-Delta（$\Sigma \Delta$)模ulator来修正发射器（PA）的不良，通过将不良修正到高角度区域。
results: 通过数字预修正（DPD）和符号级precoding（SLP） schemes，可以提高系统性能。

Abstract
In massive multiple-input multiple-output (MIMO) downlink systems, the physical implementation of the base stations (BSs) requires the use of cheap and power-efficient power amplifiers (PAs) to avoid high hardware cost and high power consumption. However, such PAs usually have limited linear amplification ranges. Nonlinear distortions arising from operation beyond the linear amplification ranges can significantly degrade system performance. Existing approaches to handle the nonlinear distortions, such as digital predistortion (DPD), typically require accurate knowledge, or acquisition, of the PA transfer function. In this paper, we present a new concept for mitigation of the PA distortions. Assuming a uniform linear array (ULA) at the BS, the idea is to apply a Sigma-Delta ($\Sigma \Delta$) modulator to spatially shape the PA distortions to the high-angle region. By having the system operating in the low-angle region, the received signals are less affected by the PA distortions. To demonstrate the potential of this spatial $\Sigma \Delta$ approach, we study the application of our approach to the multi-user MIMO-orthogonal frequency division modulation (OFDM) downlink scenario. A symbol-level precoding (SLP) scheme and a zero-forcing (ZF) precoding scheme, with the new design requirement by the spatial $\Sigma \Delta$ approach being taken into account, are developed. Numerical simulations are performed to show the effectiveness of the developed $\Sigma \Delta$ precoding schemes.

摘要
在大规模多输入多出力（MIMO）下行系统中，基站（BS）的物理实现需要使用便宜且功率低的功率增强器（PA），以避免高硬件成本和高功率消耗。然而，这些PA通常有有限的线性增强范围。操作 beyond这些范围的非线性扭曲可以很大程度地降低系统性能。现有的抗扭曲方法，如数字预修正（DPD），通常需要准确地知道或获得PA传输函数。在本文中，我们提出了一种新的抗扭曲方法。假设基站使用uniform linear array（ULA），我们的想法是通过空间Sigma-Delta（$\Sigma \Delta$)模ulator来形态地扭曲PA扭曲到高角度区域。由于系统在低角度区域运行，接收信号受到PA扭曲的影响较少。为了证明这种空间$\Sigma \Delta$方法的潜在效果，我们对多用户MIMO-orthogonal frequency division modulation（OFDM）下行enario进行了研究。我们开发了一种符号级 precoding（SLP）和零强制（ZF） precoding schemes，并考虑了新的设计要求。我们对 numerically simulations 进行了测试，以证明我们的 $\Sigma \Delta$ precoding schemes 的有效性。

Evaluation of onboard sensors for track geometry monitoring against conventional track recording measurements

paper_url: http://arxiv.org/abs/2309.00270
repo_url: None
paper_authors: Hengcheng Zhang, Zhan Yie Chin, Pietro Borghesani, James Pitt, Michael E. Cholette
for: 本研究旨在评估使用在列车上的微电子机械系统（MEMS）加速仪来推算轨道条件参数，如垂直和水平对齐。
methods: 研究人员设计了一种在列车上的数据获取系统（DAQ）的原型，并在布里斯班郊区铁路网络上进行了测量。 comparison of accelerometer-based results vs TRC recordings 表明，在车厢上安装的加速仪最佳妥协是距离源最近，但免疫干扰噪声的选择。
results: 研究发现，在左右两侧的 bogie 上安装的两个垂直加速仪可以提供好的量化估计垂直对齐，而两个水平加速仪的测量结果与 TRC 记录相关。这些发现表明，使用两个 bogie 上的 MEMS 加速仪，可以提供量化的垂直对齐和水平对齐的估计，并且可以描述轨道网络中水平对齐的趋势。

Abstract
The main objective of this paper is to assess the feasibility and accuracy of inferring key track condition parameters, e.g., vertical alignment and horizontal alignment of the rails, using onboard micro-electro-mechanical-system (MEMS) accelerometers. To achieve this aim, a prototype of an onboard data acquisition system (DAQ) was designed and installed on a track recording car (TRC) and a measurement campaign was conducted on an extensive portion of the Brisbane Suburban railway network. Comparison of the accelerometer-based results vs TRC recordings have shown that accelerometers installed on the bogie are the best compromise between proximity to the source and insensitivity to impulsive noise. Moreover, it was found that two vertical bogie accelerometers (left and right side) provide a good quantitative estimate of vertical alignment and that strong correlations with TRC measurements exist for lateral MEMS accelerometer measurements (horizontal alignment). These findings suggest that two bogie MEMS accelerometers with two measurement axes (vertical and lateral) are an effective system and can provide quantitative estimates of vertical alignment and trends in the geographical distribution of horizontal alignment.

摘要
本文的主要目标是评估使用在列车上的微型电子机械系统（MEMS）加速仪测量铁轨的重要参数，如垂直平行和水平平行。为达到这个目标，我们设计了一种在列车上的数据采集系统（DAQ）的原型，并在布里斯班郊区铁路网络上进行了评估。对比加速仪测量结果和TRC记录结果表明，在车厢上安装的加速仪最佳的折衔是与源之间的距离和干扰噪音的敏感度之间的平衡。此外，我们发现了两个垂直车厢加速仪（左右两侧）可以提供好的量化估计垂直平行，并且发现了两个车厢MEMS加速仪的两个测量轴（垂直和水平）可以提供有关垂直平行的趋势和地理分布的水平平行的量化估计。这些发现表明，两个车厢MEMS加速仪是一个有效的系统，可以提供有关垂直平行和水平平行的量化估计。

Concept for an Automatic Annotation of Automotive Radar Data Using AI-segmented Aerial Camera Images

paper_url: http://arxiv.org/abs/2309.00268
repo_url: None
paper_authors: Marcel Hoffmann, Sandro Braun, Oliver Sura, Michael Stelzig, Christian Schüßler, Knut Graichen, Martin Vossiek
for: 自动标注汽车雷达数据使用人工智能分割的飞行器摄像头图像
methods: 使用无人飞行器摄像头图像在地面上找到和映射到雷达图像中的实例和段落，然后将摄像头图像中探测到的实例和段落应用直接作为雷达数据的标签
results: 在测量中，使用这种方法自动标注了589名行人在雷达数据中，并且只用2分钟的时间Here’s the translation of the abstract in English:
for: Automatically annotate automotive radar data with AI-segmented aerial camera images
methods: Use UAV camera images to find and map instances and segments in the ground plane onto radar images, and apply the detected instances and segments as labels for the radar data
results: Demonstrated effectiveness and scalability in measurements, where 589 pedestrians in the radar data were automatically labeled within 2 minutes.

Abstract
This paper presents an approach to automatically annotate automotive radar data with AI-segmented aerial camera images. For this, the images of an unmanned aerial vehicle (UAV) above a radar vehicle are panoptically segmented and mapped in the ground plane onto the radar images. The detected instances and segments in the camera image can then be applied directly as labels for the radar data. Owing to the advantageous bird's eye position, the UAV camera does not suffer from optical occlusion and is capable of creating annotations within the complete field of view of the radar. The effectiveness and scalability are demonstrated in measurements, where 589 pedestrians in the radar data were automatically labeled within 2 minutes.

摘要

Ground Truth Generation Algorithm for Medium-Frequency R-Mode Skywave Detection

paper_url: http://arxiv.org/abs/2309.00234
repo_url: None
paper_authors: Suhui Jeong, Pyo-Woong Son
for: 提高交通运输车辆的重要性和实用性，Navigation系统提供了定位、导航和时间信息的重要性在不断增加。
methods: 研究者使用了一种称为R-Mode的面基站集成导航系统，该系统利用中频差分GNSS（DGNSS）和很高频数据交换系统（VDES）信号作为定位信号，并将现有的地面导航系统称为增强长距离导航（eLoran）。
results: 研究人员发现MF R-Mode在白天和黑夜之间表现出显著的性能差异，这是由GNSS信号在离子层反射后引起的天体干扰所致。在这项研究中，我们提出了一种可以生成天体干扰的背景真实场景生成算法，并在实验数据上进行了 validate。

Abstract
With the advancement of transportation vehicles, the importance and utility of navigation systems providing positioning, navigation, and timing (PNT) information have been increasing. Global navigation satellite systems (GNSS) are widely used navigation systems, but they are vulnerable to radio frequency interference (RFI), resulting in disruptions of satellite navigation signals. Recognizing this limitation, extensive research is being conducted on alternative navigation systems. In the maritime industry, ongoing research focuses on a groundbased integrated navigation system called R-Mode. R-Mode utilizes medium frequency (MF) differential GNSS (DGNSS) and very high-frequency data exchange system (VDES) signals as ranging signals for positioning and incorporates the existing ground-based navigation system known as enhanced long-range navigation (eLoran). However, MF R-Mode, which uses MF DGNSS signals for positioning, exhibits significant performance differences between daytime and nighttime due to skywave interference caused by signals reflecting off the ionosphere. In this study, we propose a skywave ground truth generation algorithm that is crucial for studying mitigation methods for MF R-Mode skywave interference. Furthermore, we demonstrate the proposed algorithm using field-test data.

摘要
Translation in Simplified Chinese:随着交通运输车辆的发展，提供位置、导航和时间信息的导航系统的重要性和实用性日益增加。全球卫星导航系统（GNSS）是广泛使用的导航系统，但它们受到广播频率干扰（RFI）的影响，导致卫星导航信号的中断。认识到这些限制，广泛的研究在进行 altenative 导航系统。在海事业中，进行的研究主要关注一种基于地面的集成导航系统，即R-Mode。R-Mode 使用中频 differential GNSS（DGNSS）和射频数据交换系统（VDES）信号作为距离测量信号，并 integra 现有的地面导航系统，即增强距离导航（eLoran）。然而，MF R-Mode，使用MF DGNSS信号进行位置测量，在日间和夜间 exhibits 显著的性能差异，这是由于天顶层干扰，即信号反射在离子层。在这项研究中，我们提出了天顶层真实数据生成算法，这是关键的 для研究MF R-Mode 天顶层干扰的mitigation方法。此外，我们使用实验数据来证明提出的算法。

Empirical Modeling of Variance in Medium Frequency R-Mode Time-of-Arrival Measurements

paper_url: http://arxiv.org/abs/2309.00202
repo_url: None
paper_authors: Jaewon Yu, Pyo-Woong Son
for: 提高中频R-Mode系统的性能模拟精度
methods: 基于实际数据模型TOA测量的时间差异方差
results: 通过应用loran方法来计算时间接收标准差，并估算TOA测量方差参数Here’s the same information in English:
for: Improving the accuracy of performance simulation for the medium frequency R-Mode system
methods: Modeling the variance of time-of-arrival measurements based on actual data, inspired by the method used to calculate the standard deviation of time-of-reception measurements in Loran
results: Estimating the parameters for modeling the variance of TOA measurements using the Loran method and applying it to the MF R-Mode system.

Abstract
The R-Mode system, an advanced terrestrial integrated navigation system, is designed to address the vulnerabilities of global navigation satellite systems (GNSS) and explore the potential of a complementary navigation system. This study aims to enhance the accuracy of performance simulation for the medium frequency (MF) R-Mode system by modeling the variance of time-of-arrival (TOA) measurements based on actual data. Drawing inspiration from the method used to calculate the standard deviation of time-of-reception (TOR) measurements in Loran, we adapted and applied this approach to the MF R-Mode system. Data were collected from transmitters in Palmi and Chungju, South Korea, and the parameters for modeling the variance of TOA were estimated.

摘要
R-Mode系统，一个进阶的陆地综合导航系统，旨在解决全球导航卫星系统（GNSS）的漏洞和探索一个辅助导航系统的潜力。本研究旨在提高中频R-Mode系统的性能模拟精度，基于实际数据模型时间到来（TOA）测量的变化。将 Loran中用于计算时间接收变化的方法作为灵感，我们适应并应用这种方法到中频R-Mode系统中。数据来自韩国Palmi和Chungju的传送器，并估算了模型时间到来的变化。

2023-08-31

cs.SD

cs.SD - 2023-08-31

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech

paper_url: http://arxiv.org/abs/2308.16569
repo_url: None
paper_authors: Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu
for: 这篇论文的目的是提出一个轻量级的干扰机制（LightGrad），以减少对于资源有限的设备的需求，并且降低干扰步骤的数量，以提高干扰速度和质量。
methods: 这篇论文使用了一个名为U-Net的散射解oder，以及一种无需训练的快速抽样技术，以减少模型的参数数量和干扰步骤的数量。
results: 相比 Grad-TTS，LightGrad 可以降低参数数量62.2%，降低干扰步骤65.7%，并且保持中文和英文干扰质量相似的表现，具体为4个步骤。

Abstract
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs), which can be stably trained and are more parameter-efficient compared with other generative models. As transmitting data between customs and the cloud introduces high latency and the risk of exposing private data, deploying TTS models on edge devices is preferred. When implementing DPMs onto edge devices, there are two practical problems. First, current DPMs are not lightweight enough for resource-constrained devices. Second, DPMs require many denoising steps in inference, which increases latency. In this work, we present LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight U-Net diffusion decoder and a training-free fast sampling technique, reducing both model parameters and inference latency. Streaming inference is also implemented in LightGrad to reduce latency further. Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.

摘要
最近的神经文本至语音（TTS）模型技术发展，已经带动数千个TTS应用程序进入日常生活中，这些模型通常被部署在云端提供服务。其中，扩散概率模型（DPM）是一种可以稳定地训练并且比其他生成模型更parameter-efficient的模型。由于在customs和云端之间传输数据会带来高延迟和披露私人数据的风险，因此在边缘设备上部署TTS模型是更好的选择。在实现DPMonto边缘设备时，存在两个实际问题。一个是现有DPM都不够轻量级，二是DPM在推理过程中需要多个净化步骤，这会增加延迟。在这种情况下，我们提出了LightGrad，一种轻量级的DPM для TTS。LightGrad配备了轻量级的U-Net扩散解码器和无需训练的快速采样技术，从而降低了模型参数和推理延迟。此外，LightGrad还实现了流动推理，进一步降低延迟。相比 Grad-TTS，LightGrad实现了参数减少62.2%，推理延迟减少65.7%，保持了中文普通话和英语的语音质量在4个净化步骤中相似。

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

paper_url: http://arxiv.org/abs/2308.16511
repo_url: https://github.com/ncsoft/PhonMatchNet
paper_authors: Yong-Hyeok Lee, Namhyun Cho
for: 本研究旨在提出一种新的零shot用户定义关键词检索模型，利用关键词的音频-phoneme关系提高表现。与先前的方法一样，我们使用话语和phoneme两级信息。
methods: 我们提出的方法包括两核心音频编码器架构、自注意力基于模式EXTractor以及phoneme级别检测损失，以实现在不同发音环境下高性能。
results: 根据实验结果，我们的提议模型比基eline模型表现更好，并与全shot关键词检索模型相当。我们的模型在所有数据集上显著提高了EER和AUC，平均相对提高67%和80%。

Abstract
This study presents a novel zero-shot user-defined keyword spotting model that utilizes the audio-phoneme relationship of the keyword to improve performance. Unlike the previous approach that estimates at utterance level, we use both utterance and phoneme level information. Our proposed method comprises a two-stream speech encoder architecture, self-attention-based pattern extractor, and phoneme-level detection loss for high performance in various pronunciation environments. Based on experimental results, our proposed model outperforms the baseline model and achieves competitive performance compared with full-shot keyword spotting models. Our proposed model significantly improves the EER and AUC across all datasets, including familiar words, proper nouns, and indistinguishable pronunciations, with an average relative improvement of 67% and 80%, respectively. The implementation code of our proposed model is available at https://github.com/ncsoft/PhonMatchNet.

摘要
本研究提出了一种新的零shot用户定义关键词检测模型，利用关键词的音频-phoneme关系提高性能。与前一种方法不同，我们使用了utterance和phoneme两级信息。我们提出的方法包括了两树speech编码器架构，基于自我注意力的模式提取器，以及phoneme级别检测损失，以实现在不同发音环境中高性能。根据实验结果，我们的提议模型在baseline模型和全shot关键词检测模型的比较中具有竞争力，并且在各种数据集上（包括熟语、地名和不可识别发音）实现了显著的EER和AUC提升，均为67%和80%。实现代码可以在https://github.com/ncsoft/PhonMatchNet中找到。

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting

paper_url: http://arxiv.org/abs/2308.16488
repo_url: None
paper_authors: Hui Wang, Shiwan Zhao, Xiguang Zheng, Yong Qin
for: 评估合成语音的主观评分（Mean Opinion Score，MOS）是非常重要的，但是现有的方法仅部分地解决了特征提取器的数据缺乏问题，导致decoder的性能下降。为了解决这个挑战，我们提出了一种基于检索的MOS预测方法，名为{\bf RAMP}。
methods: 我们提出了一种基于检索的MOS预测方法，包括一个检索网络和一个权重融合网络。检索网络可以在不同的实例中动态调整检索范围和融合权重，以提高decoder的表现。
results: 我们的方法在多种场景下表现出色，比如在不同的语音样本和不同的MOS评分情况下。实验结果表明，我们的方法可以更好地预测MOS，并且在数据缺乏问题下表现更为优化。

Abstract
Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the perceptual quality of the synthetic speech. While recent approaches using pre-trained self-supervised learning (SSL) models have shown promising results, they only partly address the data scarcity issue for the feature extractor. This leaves the data scarcity issue for the decoder unresolved and leading to suboptimal performance. To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.

摘要
自动 Mean Opinion Score（MOS）预测是评估人工语音质量的关键。latest approaches使用预训练的自动学习（SSL）模型已经显示了有前途，但它们只是部分地解决了特征提取器的数据缺乏问题。这 leaves the decoder的数据缺乏问题未解决，导致表现下标。To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed RAMP, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.Here's a word-for-word translation of the text into Simplified Chinese:自动MOS预测是评估人工语音质量的关键。latest approaches使用预训练的自动学习（SSL）模型已经显示了有前途，但它们只是部分地解决了特征提取器的数据缺乏问题。这 leaves the decoder的数据缺乏问题未解决，导致表现下标。为了解决这个挑战，我们提议一种增强decoder的可信度预测方法，称为RAMP，以解决数据缺乏问题。我们还提出了一种拼接网络，以动态调整每个实例的检索范围和拼接权重基于预测可信度。实验结果显示，我们的提议方法在多种场景下表现出色。

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

paper_url: http://arxiv.org/abs/2308.16485
repo_url: None
paper_authors: Xuechen Wang, Shiwan Zhao, Yong Qin
for: 这篇论文的目的是提高speech emotion recognition（SER）性能，因为现有数据有限和感受的分界难以定义。
methods: 该论文提出了一种全面的方法来提高SER模型的生命周期，包括预训练、精度调整和推断阶段。在预训练阶段，我们利用了pre-trained模型wav2vec2.0。在精度调整阶段，我们提出了一种新的损失函数，将cross-entropy损失与超级vised contrastive learning损失相结合，以提高模型的识别能力。这种方法使得间类距离增大，同类距离减小，从而解决感受分界不清晰的问题。最后，在推断阶段，我们提出了一种 interpolating方法，将模型预测结果与k-nearest neighbors模型的输出结果相结合。
results: 我们在IEMOCAP数据集上进行了实验，并证明了我们提出的方法可以超越当前状态的最佳结果。

Abstract
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP demonstrate that our proposed methods outperform current state-of-the-art results.

摘要
<> translate "Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP demonstrate that our proposed methods outperform current state-of-the-art results."Translation:<>语音情感识别（SER）是一个复杂的任务，因为数据的有限性和某些情感的模糊boundaries。在这篇论文中，我们提出了一种全面的方法来提高SER性能，包括预训练、细化和推理阶段。为了解决数据稀缺问题，我们利用了预训练模型wav2vec2.0。在细化阶段，我们提出了一种新的损失函数，将权重损失与监督对比学习损失结合在一起，以提高模型的分类能力。这种方法可以提高类间距离并降低类内距离，从而解决模糊boundaries的问题。最后，我们在推理阶段提出了一种 interpolating 方法，将模型预测结果与k最近邻居模型的输出结果结合在一起。我们在IEMOCAP上进行了实验，并证明了我们提出的方法可以超越当前状态的杰出结果。

Sequential Pitch Distributions for Raga Detection

paper_url: http://arxiv.org/abs/2308.16421
repo_url: None
paper_authors: Vishwaas Narasinh, Senthil Raja G
for: 本研究旨在提出一种新的特征来检测印度古典音乐（IAM）中的拉加（raga）。
methods: 本研究使用了一种新的特征，即时间序列抽象（SPD），来捕捉音频示例中的Sequential Pitch Distributions。
results: 研究在两个Hindustani和Carnatic音乐数据集上实现了state-of-the-art的准确率，即99%和88.13%。SPD在标准抽取器上表现出了显著的提升。

Abstract
Raga is a fundamental melodic concept in Indian Art Music (IAM). It is characterized by complex patterns. All performances and compositions are based on the raga framework. Raga and tonic detection have been a long-standing research problem in the field of Music Information Retrieval. In this paper, we attempt to detect the raga using a novel feature to extract sequential or temporal information from an audio sample. We call these Sequential Pitch Distributions (SPD), which are distributions taken over pitch values between two given pitch values over time. We also achieve state-of-the-art results on both Hindustani and Carnatic music raga data sets with an accuracy of 99% and 88.13%, respectively. SPD gives a great boost in accuracy over a standard pitch distribution. The main goal of this paper, however, is to present an alternative approach to modeling the temporal aspects of the melody and thereby deducing the raga.

摘要
印度古典音乐（IAM）中的拉格（raga）是一种基本的旋律概念，具有复杂的模式。所有演奏和作曲都基于拉格框架。拉格和和声检测是音乐信息检索领域的长期研究问题。在这篇论文中，我们尝试通过一种新的特征来提取音频样本中的时间信息。我们称之为时间扩展的涨落分布（SPD），它是在两个给定的把拍值上的涨落分布的时间变化。我们在印度古典音乐和卡尔那提克音乐的拉格数据集上达到了最佳的结果，印度古典音乐的准确率为99%，卡尔那提克音乐的准确率为88.13%。SPD提供了对标准把拍分布的大幅提升。本文的主要目标是提出一种新的方法，用于模型旋律的时间方面，从而推测拉格。

The Biased Journey of MSD_AUDIO.ZIP

paper_url: http://arxiv.org/abs/2308.16389
repo_url: None
paper_authors: Haven Kim, Keunwoo Choi, Mateusz Modrzejewski, Cynthia C. S. Liem
for: 本研究的目的是探讨Million Song Dataset中的音频数据访问问题，以及这些数据的可用性和平等访问权的问题。
methods: 本研究使用了22名参与者的经验和回忆，包括那些尝试使用API访问数据以及创建数据的人。
results: 研究发现，由于API的复杂性和Million Song Dataset中的数据报告问题（前2016年）以及API的终止（后2016年），使得这些数据的访问已经变得受限，只有某些机构的成员才能访问。这些结果希望能够促进MIR社区中更多的对访问权的思考和对话。

Abstract
The equitable distribution of academic data is crucial for ensuring equal research opportunities, and ultimately further progress. Yet, due to the complexity of using the API for audio data that corresponds to the Million Song Dataset along with its misreporting (before 2016) and the discontinuation of this API (after 2016), access to this data has become restricted to those within certain affiliations that are connected peer-to-peer. In this paper, we delve into this issue, drawing insights from the experiences of 22 individuals who either attempted to access the data or played a role in its creation. With this, we hope to initiate more critical dialogue and more thoughtful consideration with regard to access privilege in the MIR community.

摘要
“学术数据的公平分享是确保同等研究机会的关键，而 ultimately 进一步发展的关键。然而，由于 Million Song Dataset 的 API 使用问题（前2016年）和该 API 的终止（后2016年），对于这些数据的存取已经受到限制，只有特定的机构之间的连接可以获得存取权。在这篇文章中，我们深入探讨这个问题，从 22 名参与者的经验中获得了启发。我们希望透过这篇文章，导入更多的对话和更加认真的考虑，以便在 MIR 社区中更好地处理存取特权问题。”

2023-08-31

cs.CV

cs.CV - 2023-08-31

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

paper_url: http://arxiv.org/abs/2308.16906
repo_url: https://github.com/xlwangdev/hc-net
paper_authors: Xiaolong Wang, Runsen Xu, Zuofan Cui, Zeyu Wan, Yu Zhang
for: 本研究旨在解决细致跨视图地理地标注问题。
methods: 我们提出了一种新的方法，通过使用准确的投影变换，将地面图像与卫星图像的同一区域对齐。我们使用可 differentiable 的球面变换，遵循几何原理，将地面图像的视角与卫星图像的视角一起对齐。为了解决 occlusion、小 overlap 和季节变化等挑战，我们提出了一种可靠的相关性感知投影变换 estimator。
results: 我们的方法可以在30 FPS的速度下运行，并在 VIGOR 测试benchmark上显著提高了平均 метрик地标注错误率，减少了21.3%和32.4%在同一个区域和跨区域通用任务中，以及在 KITTI 测试benchmark上减少了34.4%。

Abstract
In this paper, we introduce a novel approach to fine-grained cross-view geo-localization. Our method aligns a warped ground image with a corresponding GPS-tagged satellite image covering the same area using homography estimation. We first employ a differentiable spherical transform, adhering to geometric principles, to accurately align the perspective of the ground image with the satellite map. This transformation effectively places ground and aerial images in the same view and on the same plane, reducing the task to an image alignment problem. To address challenges such as occlusion, small overlapping range, and seasonal variations, we propose a robust correlation-aware homography estimator to align similar parts of the transformed ground image with the satellite image. Our method achieves sub-pixel resolution and meter-level GPS accuracy by mapping the center point of the transformed ground image to the satellite image using a homography matrix and determining the orientation of the ground camera using a point above the central axis. Operating at a speed of 30 FPS, our method outperforms state-of-the-art techniques, reducing the mean metric localization error by 21.3% and 32.4% in same-area and cross-area generalization tasks on the VIGOR benchmark, respectively, and by 34.4% on the KITTI benchmark in same-area evaluation.

摘要
“在这篇论文中，我们介绍了一种新的细化跨视地理定位方法。我们的方法使用投影变换将摄影地图与相应的GPS标记的卫星图像重叠的同一个区域，并使用同尺度的投影变换来减少任务到图像对齐问题。为了解决 occlusion、小 overlap 范围和季节变化等挑战，我们提出了一种可靠的相关关系感知投影变换器，用于对投影变换后的地面图像中的相似部分与卫星图像进行对齐。我们的方法可以实现幂等分辨率和米级GPS准确性，通过将投影变换后的地面图像的中心点映射到卫星图像中使用投影矩阵，并通过确定地面摄像机的方向使用一个点上的中心轴来减少偏差。我们的方法可以在30帧/秒的速度下运行，并且在VIGOR和KITTI的测试集上比州前方法提高了21.3%和32.4%的同区和跨区普通地理定位误差，并在KITTI的测试集上提高了34.4%的同区误差。”

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

paper_url: http://arxiv.org/abs/2308.16894
repo_url: https://github.com/eth-ait/emdb
paper_authors: Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan Zarate, Otmar Hilliges
For: This paper presents a novel dataset called EMDB, which contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos.* Methods: The authors use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record motion data, and propose a multi-stage optimization procedure to construct EMDB. They also leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance.* Results: The authors evaluate the accuracy of EMDB using a multi-view volumetric capture system and show that it has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. They also evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB.Here’s the format you requested:* For: 这篇论文提出了一个名为EMDB的新数据集，该数据集包含了高质量的3D SMPL姿态和形状参数，以及全身和摄像机的轨迹，用于在自然环境中拍摄的视频。* Methods: 作者使用了身体穿着的无线电磁（EM）仪器和手持式iPhone记录运动数据，并提出了一种多Stage优化算法来构建EMDB。他们还利用了一种神经凝聚模型来重建人体表面几何和外观，以提高对齐和平滑性。* Results: 作者使用多视角扫描系统进行了EMDB的评估，并显示其预期偏差为2.3 cm的位差和10.6度的姿态偏差，超过了前一代在自然环境中的数据集的精度。他们还评估了现有的单摄RGB方法在EMDB上的性能。EMDB公共可用于https://ait.ethz.ch/emdb。

Abstract
We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and outdoor sequences and 10 participants. Together with accurate body poses and shapes, we also provide global camera poses and body root trajectories. To construct EMDB, we propose a multi-stage optimization procedure, which first fits SMPL to the 6-DoF EM measurements and then refines the poses via image observations. To achieve high-quality results, we leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance, which allows for improved alignment and smoothness via a dense pixel-level objective. Our evaluations, conducted with a multi-view volumetric capture system, indicate that EMDB has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB. EMDB is publicly available under https://ait.ethz.ch/emdb

摘要
我们介绍EMDB，全球3D人姿和形状的电磁学数据库。EMDB是一个新的数据集，包含高质量的3D SMPL姿势和形状参数，以及全身和相机轨迹数据，从户外和户内的81个序列和10名参与者中收集了总计58分钟的运动数据。此外，我们还提供了全球相机位和身体根轨迹数据。为构建EMDB，我们提议了多个阶段优化过程，首先是使用6度自由度电磁测量数据适应SMPL，然后通过图像观察进行纠正。为了实现高质量结果，我们利用神经隐式人物模型来重construct人类表面几何和外观，这使得对齐和平滑得到了改进。我们的评估，使用多视图探针捕捉系统，表明EMDB的预期准确性为2.3公分位置和10.6度角度误差，超过了之前的户外数据集的准确性。我们对EMDB进行了多种state-of-the-art单视RGB相机相对和全球姿势估计的评估。EMDB公开可用于https://ait.ethz.ch/emdb。

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

paper_url: http://arxiv.org/abs/2308.16891
repo_url: https://github.com/YanjieZe/GNFactor
paper_authors: Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang
for: This paper aims to develop a visual behavior cloning agent for multi-task robotic manipulation that can execute diverse tasks from visual observations in unstructured real-world environments.
methods: The proposed method, called GNFactor, uses a combination of a generalizable neural field (GNF) and a Perceiver Transformer to jointly optimize a shared deep 3D voxel representation, leveraging a vision-language foundation model to incorporate semantics in 3D.
results: The authors evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations, showing a substantial improvement over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor.

Abstract
It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

摘要
Robotics 是一个长期的问题，是开发能够从视觉观察中执行多样化的抓取任务的智能代理的能力。为了实现这个目标，机器人需要有完整的Scene的3D结构和 semantics的理解。在这项工作中，我们提出了 $\textbf{GNFactor}$,一种可视行为做模块 для多任务机器人抓取，该模块结合了一个通用神经场（GNF）和一个Perceiver Transformer作为决策模块，利用了一个共享的深度3D立方体表示。为了在3D中包含 semantics，重建模块利用了一个视觉语言基础模型（例如，Stable Diffusion）将丰富的semantic信息透传到深度3D立方体中。我们在3个真实机器人任务上评估GNFactor，并对RLBench任务进行详细的抽象。我们发现GNFactor在seen和unseen任务中具有显著的提高， demonstrating GNFactor的强大通用能力。我们的项目网站是。

Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details

paper_url: http://arxiv.org/abs/2308.16880
repo_url: None
paper_authors: Inwoo Hwang, Hyeonwoo Kim, Young Min Kim
for: 这个论文的目的是创建虚拟场景中对象的细节Texture。
methods: 该方法使用参考图片和文本描述来 guideline 3D形状的纹理，使得生成的颜色遵循场景中的层次结构或semantic parts的结构。
results: 该方法可以实时创建场景中对象的细节Texture，并且能够保持场景中的结构层次结构。这是首个实用和可扩展的方法，可以创建场景中对象的细节Texture，并且不需要专门的高质量Texture dataset。

Abstract
We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects.

摘要
我们提出了 Text2Scene，一种方法可以自动创建虚拟场景中多个物体的真实 texture。根据参考图像和文本描述，我们的管道将Labelled 3D 几何体上添加细节 texture，使得生成的颜色尊重层次结构或semantic parts中的相似材质。而不是在整个场景上一步应用平面化，我们从 geometric segmentation 中获取weak semantic cues，并将分割后的部分赋予初始颜色。然后，我们为个体对象添加 texture details，使其像素空间投影展示Feature embedding相对 embedding输入一致。这种分解使整个管道可以利用一定的计算资源和存储空间进行执行。由于我们的框架利用现有的图像和文本嵌入，因此不需要专门的高质量 texture 数据集，由技术型 худож们制作。根据我们所知，这是首个实用和可扩展的方法，可以创建包含多个物体的场景中的细节和真实 texture，同时保持结构上下文。

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation

paper_url: http://arxiv.org/abs/2308.16876
repo_url: https://github.com/JiabenChen/SportsSloMo
paper_authors: Jiaben Chen, Huaizu Jiang
For: The paper is written for improving the performance of video frame interpolation in human-centric scenarios, specifically in the sports analysis industry, by introducing a new benchmark dataset and two human-aware loss terms.* Methods: The paper uses several state-of-the-art video frame interpolation methods and re-trains them on a new benchmark dataset called SportsSloMo, which consists of high-resolution slow-motion sports videos crawled from YouTube. The paper also introduces two human-aware loss terms to improve the accuracy of video frame interpolation.* Results: The paper shows that the proposed loss terms lead to consistent performance improvement over five existing models, establishing strong baseline models on the SportsSloMo benchmark. The results also highlight the difficulty of the benchmark and the importance of considering human-aware priors in video frame interpolation.Here is the information in Simplified Chinese text:* For: 这篇论文是为了提高人体中心的视频帧 interpolate的性能，特别是在运动分析领域，通过引入新的benchmark dataset和两种人体意识损失项。* Methods: 论文使用了一些现有的视频帧 interpolate方法，并将其重新训练在新的benchmark datasetcalled SportsSloMo上，该dataset包含高分辨率慢动作运动视频从YouTube上抓取。论文还引入了两种人体意识损失项以提高 interpolate的准确性。* Results: 论文显示，提出的损失项导致了五种现有模型的性能的一致提高，在SportsSloMo benchmark上建立了强的基线模型。结果还 highlights SportsSloMo benchmark的困难性和人体意识损失项的重要性。

Abstract
Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

摘要
人类中心视频帧 interpolate 技术在娱乐领域有很大的潜力，例如生成高速慢动作视频。 although there are several benchmark datasets available in the community, none of them is specifically designed for human-centric scenarios. To address this gap, we introduce SportsSloMo, a benchmark that includes over 130,000 video clips and 1 million high-resolution ($ \geq $ 720p) slow-motion sports videos crawled from YouTube. We retrain several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. This highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve accuracy, we introduce two loss terms that consider human-aware priors, adding auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. These loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results demonstrate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

Holistic Processing of Colour Images Using Novel Quaternion-Valued Wavelets on the Plane

paper_url: http://arxiv.org/abs/2308.16875
repo_url: None
paper_authors: Neil D. Dizon, Jeffrey A. Hogan
for: color image processing
methods: quaternionic wavelet filters, quaternion-valued wavelets on the plane
results: demonstration of quaternion-valued wavelets as a promising tool for holistic color image processing, including compression, enhancement, segmentation, and denoising techniques.Here’s the full translation of the abstract in Simplified Chinese:
for: 本研究 investigate the applicability of quaternion-valued wavelets on the plane to holistic color image processing.
methods: 我们提出了一种使用quaternionic wavelet filters和最近发展的quaternion-valued wavelets on the plane进行color image decomposition和重建的方法。
results: 我们通过实践了各种压缩、提高、分类和去噪技术，证明quaternion-valued wavelets as a promising tool for holistic color image processing.

Abstract
We investigate the applicability of quaternion-valued wavelets on the plane to holistic colour image processing. We present a methodology for decomposing and reconstructing colour images using quaternionic wavelet filters associated to recently developed quaternion-valued wavelets on the plane. We consider compression, enhancement, segmentation, and denoising techniques to demonstrate quaternion-valued wavelets as a promising tool for holistic colour image processing.

摘要
我们研究使用平面上的四元数值波幅来进行整体彩色图像处理。我们提出了使用四元数值波幅筛选器与最近发展的平面上四元数值波幅进行彩色图像的分解和重建方法。我们考虑了压缩、提高、分割和降噪技术，以示四元数值波幅在整体彩色图像处理中的潜力。

Self-pruning Graph Neural Network for Predicting Inflammatory Disease Activity in Multiple Sclerosis from Brain MR Images

paper_url: http://arxiv.org/abs/2308.16863
repo_url: https://github.com/chinmay5/ms_ida
paper_authors: Chinmay Prabhakar, Hongwei Bran Li, Johannes C. Paetzold, Timo Loehr, Chen Niu, Mark Mühlau, Daniel Rueckert, Benedikt Wiestler, Bjoern Menze
For: 预测多发性硬化病（MS）的Inflammatory disease activity是关键的，以便评估疾病程度和治疗。但MS患者的脑部 lesions 的数量、大小和分布差异较大，使得机器学习方法难以学习整个脑部 MRI 图像来评估和预测疾病。* Methods: 我们使用图 neural network (GNN) 来聚合重要的生物标志物（如患者的 lesion load 或 spatial proximity），以便建立一个全脑 MRI 图像的全球表示。我们提出了一种两阶段的 MS Inflammatory disease activity 预测方法。第一阶段，使用 3D 分割网络检测 lesions，然后使用自动学习算法提取图像特征。第二阶段，使用检测到的 lesions 建立患者图，lesions acted as nodes 在图中，并将它们连接到基于空间 proximity 的图中。最后，我们将 inflammatory disease activity 预测问题转化为图类别问题。* Results: 我们的提出方法在一年和两年的Inflammatory disease activity 预测任务中都有较大的提高（AUCs 分别为 0.67 vs. 0.61 和 0.66 vs. 0.60），并且可以自动选择最重要的 lesions 进行预测。此外，我们的方法具有内在的解释性，可以为每个 lesion 分配一个重要性分数，以便理解整个预测结果的基础。

Abstract
Multiple Sclerosis (MS) is a severe neurological disease characterized by inflammatory lesions in the central nervous system. Hence, predicting inflammatory disease activity is crucial for disease assessment and treatment. However, MS lesions can occur throughout the brain and vary in shape, size and total count among patients. The high variance in lesion load and locations makes it challenging for machine learning methods to learn a globally effective representation of whole-brain MRI scans to assess and predict disease. Technically it is non-trivial to incorporate essential biomarkers such as lesion load or spatial proximity. Our work represents the first attempt to utilize graph neural networks (GNN) to aggregate these biomarkers for a novel global representation. We propose a two-stage MS inflammatory disease activity prediction approach. First, a 3D segmentation network detects lesions, and a self-supervised algorithm extracts their image features. Second, the detected lesions are used to build a patient graph. The lesions act as nodes in the graph and are initialized with image features extracted in the first stage. Finally, the lesions are connected based on their spatial proximity and the inflammatory disease activity prediction is formulated as a graph classification task. Furthermore, we propose a self-pruning strategy to auto-select the most critical lesions for prediction. Our proposed method outperforms the existing baseline by a large margin (AUCs of 0.67 vs. 0.61 and 0.66 vs. 0.60 for one-year and two-year inflammatory disease activity, respectively). Finally, our proposed method enjoys inherent explainability by assigning an importance score to each lesion for the overall prediction. Code is available at https://github.com/chinmay5/ms_ida.git

摘要
多发性硬化病（MS）是一种严重的神经系统疾病，它表现为中枢神经系统中的Inflammatory lesions。因此，预测这种疾病的活动是评估疾病和治疗的关键。然而，MS斑点可以在脑部任何地方出现，其形状、大小和总数都会因 patient differ。这种高度的变化性使得机器学习方法很难学习整个脑部MRI扫描的全局有效表示。技术上也是非常困难的将必要的生物标志物such as lesion load or spatial proximity纳入机器学习模型中。我们的工作是首次使用图 neural networks（GNN）来聚合这些生物标志物，以实现一种全新的全局表示。我们提出了一种两个阶段的多发性硬化病活动预测方法。首先，一个3D segmentation network detects lesions，然后一种自动学习算法提取了这些斑点的图像特征。其次，检测到的斑点被用来建立patient graph。斑点 acts as nodes in the graph and initialized with image features extracted in the first stage。最后，斑点被基于其空间 proximity连接。这种连接方式使得疾病活动预测变成了一种图类别任务。此外，我们提出了一种自动剪枝策略，以便自动选择最重要的斑点进行预测。我们的提出的方法在现有基线上显著超越（AUCs of 0.67 vs. 0.61 and 0.66 vs. 0.60 for one-year and two-year inflammatory disease activity, respectively）。最后，我们的方法具有内置的解释性，可以为每个斑点分配一个重要性分数，以用于总体预测。代码可以在中找到。

Diffusion Models for Interferometric Satellite Aperture Radar

paper_url: http://arxiv.org/abs/2308.16847
repo_url: None
paper_authors: Alexandre Tuel, Thomas Kerdreux, Claudia Hulbert, Bertrand Rouet-Leduc
for: 用于生成雷达数据集，促进深度学习对雷达数据进行处理和分析。
methods: 使用潜在分布模型（PDM）生成雷达数据集，但是采样时间仍然是一个问题，加速采样策略在简单的图像数据集上工作良好，但在我们的雷达数据集上失败。
results: PDMs能够生成具有复杂和真实结构的图像，但是采样时间仍然是一个问题。提供了一个简单和多功能的开源库https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation，可以在单个GPU上训练、采样和评估PDMs。

Abstract
Probabilistic Diffusion Models (PDMs) have recently emerged as a very promising class of generative models, achieving high performance in natural image generation. However, their performance relative to non-natural images, like radar-based satellite data, remains largely unknown. Generating large amounts of synthetic (and especially labelled) satellite data is crucial to implement deep-learning approaches for the processing and analysis of (interferometric) satellite aperture radar data. Here, we leverage PDMs to generate several radar-based satellite image datasets. We show that PDMs succeed in generating images with complex and realistic structures, but that sampling time remains an issue. Indeed, accelerated sampling strategies, which work well on simple image datasets like MNIST, fail on our radar datasets. We provide a simple and versatile open-source https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation to train, sample and evaluate PDMs using any dataset on a single GPU.

摘要
几何扩散模型（PDM）在自然图像生成方面已经表现出非常出色，但它们在非自然图像，如雷达基站数据，的性能则还不很清楚。生成大量的人工卫星数据是深度学习方法处理和分析雷达数据的关键，我们利用PDM来生成几个雷达基站图像集。我们发现PDM可以生成具有复杂和真实结构的图像，但抽样时间仍然是一个问题。事实上，加速抽样策略，可以在简单的图像集如MNIST上工作良好，在我们的雷达集上失败。我们提供了一个简单和多功能的开源https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation，可以在单个GPU上训练、抽样和评估PDM。

Coarse-to-Fine Amodal Segmentation with Shape Prior

paper_url: http://arxiv.org/abs/2308.16825
repo_url: None
paper_authors: Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu
for: 本研究旨在解决模糊物体分割问题，提出了一种新的方法Coarse-to-Fine Segmentation（C2F-Seg）。
methods: C2F-Seg首先将学习空间从像素级别的图像空间降低到向量化的隐藏空间，从而更好地处理长距离依赖关系和学习模糊物体分割。然而，这个隐藏空间缺乏对物体的细节信息，这使得直接提供精确的模糊物体分割变得困难。为了解决这个问题，我们提出了一种卷积整合模块，用于在视觉特征和粗略预测分割的基础上提供更精确的模糊物体分割。
results: 我们在KINS和COCO-A两个标准测试集上进行了广泛的实验，并证明了C2F-Seg的优越性。此外，我们还展示了我们的方法在视频模糊物体分割任务上的潜在应用。项目页面：http://jianxgao.github.io/C2F-Seg。

Abstract
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: http://jianxgao.github.io/C2F-Seg.

摘要
“模糊物体分割是一项复杂的任务，涉及到分割可见和遮盖的对象部分。在这篇论文中，我们提出了一种新的方法，即粗细到精细分割（C2F-Seg），用于解决这个问题。C2F-Seg首先将学习空间从像素级图像空间降低到量化的几何空间。这使得我们可以更好地处理长距离依赖关系，从视觉特征和可见分割中学习粗细的模糊分割。然而，这个几何空间缺乏对物体的细节信息，这使得直接提供精确的分割变得困难。为了解决这个问题，我们提出了一个 convolution 精细化模块，用于在可见分割基础之上注入细节信息，以提供更精确的模糊物体分割。为了促进模糊物体分割的研究，我们创建了一个人工生成的模糊数据集，名为MOViD-A（MOViD-A），可以用于图像和视频模糊物体分割。我们广泛测试了我们的模型在 KINS 和 COCO-A 两个标准数据集上，结果表明 C2F-Seg 的超越性。此外，我们还展示了我们的方法在视频模糊物体分割任务上的潜在应用性，在 FISHBOWL 和我们所创建的 MOViD-A 上。更多信息请访问：http://jianxgao.github.io/C2F-Seg。”

BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.16819
repo_url: None
paper_authors: Johannes Künzel, Anna Hilsmann, Peter Eisert
for: 这个论文的目的是提出一种基于弱监督学习的Semantic Image Segmentation方法，以应对不同恶劣环境（如雨、夜晚、雪、极端照明）带来的挑战。
methods: 本论文使用的方法是基于image-level对照的，将不同恶劣环境下的图像视为彼此的”增强”，然后使用Barlow twins损失函数进行训练。
results: 该方法在ACDC和新的ACG评分标准上进行评估，与现有的State-of-the-art方法进行比较，结果显示其具有较好的一致性和扩展性。

Abstract
Semantic image segmentation is a critical component in many computer vision systems, such as autonomous driving. In such applications, adverse conditions (heavy rain, night time, snow, extreme lighting) on the one hand pose specific challenges, yet are typically underrepresented in the available datasets. Generating more training data is cumbersome and expensive, and the process itself is error-prone due to the inherent aleatoric uncertainty. To address this challenging problem, we propose BTSeg, which exploits image-level correspondences as weak supervision signal to learn a segmentation model that is agnostic to adverse conditions. To this end, our approach uses the Barlow twins loss from the field of unsupervised learning and treats images taken at the same location but under different adverse conditions as "augmentations" of the same unknown underlying base image. This allows the training of a segmentation model that is robust to appearance changes introduced by different adverse conditions. We evaluate our approach on ACDC and the new challenging ACG benchmark to demonstrate its robustness and generalization capabilities. Our approach performs favorably when compared to the current state-of-the-art methods, while also being simpler to implement and train. The code will be released upon acceptance.

摘要
Semantic image segmentation 是许多计算机视觉系统中的关键组件，例如自动驾驶。在这些应用中，不利的条件（重雨、夜晚、雪、极端照明）等等会 pose 特定的挑战，然而这些数据在可用的数据集中往往受到偏见。生成更多的训练数据是繁琐且昂贵，而且自身的过程也具有内在的随机uncertainty。为解决这个问题，我们提出了 BTSeg，它利用图像水平匹配作为弱超级视图来学习一个不受不利条件的影响的 segmentation 模型。为达到这个目的，我们的方法使用 Barlow twins 损失函数，它来自不upervised learning 领域，并将图像在不同的不利条件下拍摄的图像视为“增强”而不是独立的图像。这样可以让学习一个不受不利条件的 segmentation 模型。我们对 ACDC 和新的 AC G 数据集进行评估，以示其Robustness 和通用性。我们的方法与当前状态的方法相比，性能更好，同时也更简单地实现和训练。代码将在接受后发布。

Multiscale Residual Learning of Graph Convolutional Sequence Chunks for Human Motion Prediction

paper_url: http://arxiv.org/abs/2308.16801
repo_url: None
paper_authors: Mohsen Zand, Ali Etemad, Michael Greenspan
for: human motion prediction
methods: ResChunk, an end-to-end network that explores dynamically correlated body components based on pairwise relationships between all joints in individual sequences, and learns the residuals between target sequence chunks in an autoregressive manner to enforce temporal connectivities.
results: outperforms other techniques and sets a new state-of-the-art on two challenging benchmark datasets, CMU Mocap and Human3.6M.Here is the Chinese translation of the three key information points:
for: 人体动作预测
methods: ResChunk endorse 终端网络，通过对每个序列中所有关节之间的对应关系进行学习，并通过在终端预测中强制执行时间连接性来学习很多级别的动作特征。
results: 在 CMU Mocap 和 Human3.6M 两个挑战性评测数据集上超越其他技术，设置新的状态对。

Abstract
A new method is proposed for human motion prediction by learning temporal and spatial dependencies. Recently, multiscale graphs have been developed to model the human body at higher abstraction levels, resulting in more stable motion prediction. Current methods however predetermine scale levels and combine spatially proximal joints to generate coarser scales based on human priors, even though movement patterns in different motion sequences vary and do not fully comply with a fixed graph of spatially connected joints. Another problem with graph convolutional methods is mode collapse, in which predicted poses converge around a mean pose with no discernible movements, particularly in long-term predictions. To tackle these issues, we propose ResChunk, an end-to-end network which explores dynamically correlated body components based on the pairwise relationships between all joints in individual sequences. ResChunk is trained to learn the residuals between target sequence chunks in an autoregressive manner to enforce the temporal connectivities between consecutive chunks. It is hence a sequence-to-sequence prediction network which considers dynamic spatio-temporal features of sequences at multiple levels. Our experiments on two challenging benchmark datasets, CMU Mocap and Human3.6M, demonstrate that our proposed method is able to effectively model the sequence information for motion prediction and outperform other techniques to set a new state-of-the-art. Our code is available at https://github.com/MohsenZand/ResChunk.

摘要
新方法提议用于人体运动预测，通过学习时间和空间依赖关系。最近，多尺度图被开发以更高层次模型人体，导致更稳定的运动预测。现有方法尝试在不同的运动序列中模型人体的运动方式，但是它们预先决定了尺度水平并将空间靠近的关节组合起来生成更粗细的尺度，这并不符合人类运动的实际情况。另外，图 convolutional 方法可能会出现模式落弓现象，预测 pose 会 converges 到一个平均 pose WITH NO 可见运动，特别是在长期预测中。为了解决这些问题，我们提出了 ResChunk，一个终端到终点网络，它通过对各个序列中的所有关节之间的对应关系进行学习，来捕捉人体动态相关的特征。ResChunk 在 autoregressive 方式下强制执行时间相关的连接，以便在 consecutive chunk 之间强制 enforcing 时间连接。因此，它是一种序列到序列预测网络，可以考虑人体动态的空间时间特征。我们在 CMU Mocap 和 Human3.6M 两个挑战性数据集上进行了实验，结果显示，我们的提议方法可以有效地模型序列信息，并超越其他技术，设置新的状态之册。我们的代码可以在 GitHub 上找到：https://github.com/MohsenZand/ResChunk.

Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

paper_url: http://arxiv.org/abs/2308.16777
repo_url: None
paper_authors: Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng Zuo
for: This paper is written for the task of zero-shot referring image segmentation, which involves finding an instance segmentation mask based on a given referring description without using paired training data.
methods: The paper proposes a novel method called Referring Diffusional segmentor (Ref-Diff), which leverages fine-grained multi-modal information from generative models to improve performance.
results: The paper demonstrates that Ref-Diff achieves comparable performance to existing state-of-the-art weakly-supervised models without a proposal generator, and outperforms these competing methods by a significant margin when combining both generative and discriminative models.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为了零shot引用图像分割而写的，即基于给定的引用描述而不使用对应的训练数据来获得图像分割mask。
methods: 该论文提出了一种新的方法 called Referring Diffusional segmentor (Ref-Diff)，它利用生成模型中的细腻多模态信息来提高性能。
results: 论文表明，Ref-Diff可以在不使用提案生成器的情况下与现有的州对的弱监督模型匹配，并且在结合生成和识别模型时，Ref-Diff可以超越竞争对手的性能。

Abstract
Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.

摘要
<>转换文本到简化中文。<>零shot引用图像分割是一项具有挑战性的任务，它目标是基于给定的引用描述获取一个实例分割mask，无需对这类对应数据进行训练。现有的零shot方法主要依靠预训练的推理模型（如CLIP）。然而，我们发现了使用生成模型（如稳定扩散）可能已经理解了不同视觉元素与文本描述之间的关系，这在这个任务中rarely被研究。在这项工作中，我们介绍了一种新的引用扩散分割器（Ref-Diff），它利用生成模型细致的多Modal信息。我们示示了无需提议生成器，生成模型alone可以达到与现有SOTA弱相关模型相同的性能。当我们将生成和推理模型结合起来时，我们的Ref-Diff超过了这些竞争对手，表明生成模型也是这个任务中有用的和可以补充推理模型以实现更好的引用分割。我们的代码在https://github.com/kodenii/Ref-Diff上公开。

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

paper_url: http://arxiv.org/abs/2308.16758
repo_url: None
paper_authors: Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, Hang Xu
for: 生成基于文本描述的3D脸部，如游戏、电影和机器人等领域。
methods: 提出了一种基于文本指导的3D脸部生成方法（TG-3DFace），使用无条件3D脸部生成框架，并通过文本条件学习实现文本指导3D脸部生成。此外，我们还提出了两种文本到脸部的交叉模态对接技术，包括全局对比学习和细化对接模块，以实现高度相关性 между生成的3D脸部和输入文本。
results: TG-3DFace比现有方法提高了9%多视角一致率（MVIC），并在rendered face图像中实现了更高的FID和CLIP分数，证明了我们在生成实用和semantic-consistent的文本图像方面的超越。

Abstract
Generating 3D faces from textual descriptions has a multitude of applications, such as gaming, movie, and robotics. Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, refer as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.

摘要
生成3D面部从文本描述有多种应用，如游戏、电影和 робоaxi。Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, referred to as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.

Unsupervised CT Metal Artifact Reduction by Plugging Diffusion Priors in Dual Domains

paper_url: http://arxiv.org/abs/2308.16742
repo_url: https://github.com/deepxuan/dudodp-mar
paper_authors: Xuan Liu, Yaoqin Xie, Songhui Diao, Shan Tan, Xiaokun Liang
For: The paper aims to improve the quality of computed tomography (CT) images by reducing metal artifacts, which can be challenging to diagnose accurately.* Methods: The proposed method uses an unsupervised diffusion model to restore degraded portions in CT images caused by metal artifacts. This approach leverages the priors embedded within the pre-trained diffusion model in both the sinogram and image domains.* Results: The proposed method outperforms existing unsupervised metal artifact reduction methods, including another method based on the diffusion model, in terms of both quantitative and qualitative performance. It also demonstrates superior visual results compared to supervised and unsupervised methods on clinical datasets.Here is the same information in Simplified Chinese text:* 为：本文目的是提高计算tomography（CT）图像质量，减少金属残余所引起的诊断困难。* 方法：提议的方法使用无监督的扩散模型来修复CT图像中的金属残余所引起的延迟。这种方法利用预训练扩散模型中的假设在sinogram和图像域中进行双域处理。* 结果：提议的方法在量化和质量上都超过了现有的无监督残余减少方法，包括另一种基于扩散模型的方法。此外，它在临床数据上也显示出了较好的视觉效果，比supervised和无监督方法更好。

Abstract
During the process of computed tomography (CT), metallic implants often cause disruptive artifacts in the reconstructed images, impeding accurate diagnosis. Several supervised deep learning-based approaches have been proposed for reducing metal artifacts (MAR). However, these methods heavily rely on training with simulated data, as obtaining paired metal artifact CT and clean CT data in clinical settings is challenging. This limitation can lead to decreased performance when applying these methods in clinical practice. Existing unsupervised MAR methods, whether based on learning or not, typically operate within a single domain, either in the image domain or the sinogram domain. In this paper, we propose an unsupervised MAR method based on the diffusion model, a generative model with a high capacity to represent data distributions. Specifically, we first train a diffusion model using CT images without metal artifacts. Subsequently, we iteratively utilize the priors embedded within the pre-trained diffusion model in both the sinogram and image domains to restore the degraded portions caused by metal artifacts. This dual-domain processing empowers our approach to outperform existing unsupervised MAR methods, including another MAR method based on the diffusion model, which we have qualitatively and quantitatively validated using synthetic datasets. Moreover, our method demonstrates superior visual results compared to both supervised and unsupervised methods on clinical datasets.

摘要
In this paper, we propose an unsupervised MAR method based on the diffusion model, a generative model with a high capacity to represent data distributions. Specifically, we first train a diffusion model using CT images without metal artifacts. Then, we iteratively utilize the priors embedded within the pre-trained diffusion model in both the sinogram and image domains to restore the degraded portions caused by metal artifacts. This dual-domain processing enables our approach to outperform existing unsupervised MAR methods, including another MAR method based on the diffusion model, which we have qualitatively and quantitatively validated using synthetic datasets. Moreover, our method demonstrates superior visual results compared to both supervised and unsupervised methods on clinical datasets.

Parsing is All You Need for Accurate Gait Recognition in the Wild

paper_url: http://arxiv.org/abs/2308.16739
repo_url: https://github.com/Gait3D/Gait3D-Benchmark
paper_authors: Jinkai Zheng, Xinchen Liu, Shuai Wang, Lihao Wang, Chenggang Yan, Wu Liu
for: 本研究的目的是提出一种新的步态表示方法，即步态解析序列（GPS），以提高人体步态识别的准确率。
methods: 该研究提出了一种基于人体解析的人体步态识别框架，即解析步态（ParsingGait），该框架包括一个Convolutional Neural Network（CNN）基础和两个轻量级的头。
results: 该研究通过对Gait3D-Parsing dataset进行了全面的评估，并与现有的人体步态识别方法进行了比较，得到了显著提高的准确率。

Abstract
Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait. The code and dataset are available at https://gait3d.github.io/gait3d-parsing-hp .

摘要
Binary 阴影和关键点基于骨架在人体行走识别研究中占据了主导地位，因为它们从视频帧中易于提取。尽管它们在实验室环境中表现出色，但在实际场景中通常失败，因为它们的信息熵很低。为实现高精度的人体行走识别在野外，本文提出了一种新的行走表示方式，名为行走解析序列（GPS）。GPS是从视频帧中提取的细化人体分割，因此它们具有远高于Binary阴影和关键点基于骨架的信息熵，可以更好地编码人体行走时的形态和动态。此外，为了有效地探索GPS表示法的能力，我们提出了一种基于人体分割的行走识别框架，名为ParsingGait。ParsingGait包括一个基于Convolutional Neural Network（CNN）的背bone和两个轻量级的头。第一个头从GPS中提取全局semantic特征，而另一个头通过图像学树征网络学习人体行走中的细节动态。此外，由于缺乏适用的数据集，我们建立了首个基于分割的人体行走识别数据集，名为Gait3D-Parsing，通过扩展大规模和挑战性的Gait3D数据集。基于Gait3D-Parsing，我们对我们的方法和现有的行走识别方法进行了全面评估。实验结果表明，GPS表示法和ParsingGait带来了显著的准确性提升。代码和数据集可以在https://gait3d.github.io/gait3d-parsing-hp 上获取。

US-SFNet: A Spatial-Frequency Domain-based Multi-branch Network for Cervical Lymph Node Lesions Diagnoses in Ultrasound Images

paper_url: http://arxiv.org/abs/2308.16738
repo_url: None
paper_authors: Yubiao Yue, Jun Xue, Haihua Liang, Bingchun Luo, Zhenzhang Li
for: 诊断 cervical lymph node lesions
methods: 使用 deep learning 模型，包括 Conv-FFT Block 和 US-SFNet architecture
results: 实现 92.89% 准确率，90.46% 精度，89.95% 敏感性和 97.49% 特异性

Abstract
Ultrasound imaging serves as a pivotal tool for diagnosing cervical lymph node lesions. However, the diagnoses of these images largely hinge on the expertise of medical practitioners, rendering the process susceptible to misdiagnoses. Although rapidly developing deep learning has substantially improved the diagnoses of diverse ultrasound images, there remains a conspicuous research gap concerning cervical lymph nodes. The objective of our work is to accurately diagnose cervical lymph node lesions by leveraging a deep learning model. To this end, we first collected 3392 images containing normal lymph nodes, benign lymph node lesions, malignant primary lymph node lesions, and malignant metastatic lymph node lesions. Given that ultrasound images are generated by the reflection and scattering of sound waves across varied bodily tissues, we proposed the Conv-FFT Block. It integrates convolutional operations with the fast Fourier transform to more astutely model the images. Building upon this foundation, we designed a novel architecture, named US-SFNet. This architecture not only discerns variances in ultrasound images from the spatial domain but also adeptly captures microstructural alterations across various lesions in the frequency domain. To ascertain the potential of US-SFNet, we benchmarked it against 12 popular architectures through five-fold cross-validation. The results show that US-SFNet is SOTA and can achieve 92.89% accuracy, 90.46% precision, 89.95% sensitivity and 97.49% specificity, respectively.

摘要
ultrasound imaging serves as a pivotal tool for diagnosing cervical lymph node lesions . However, the diagnoses of these images largely hinge on the expertise of medical practitioners, rendering the process susceptible to misdiagnoses . Although rapidly developing deep learning has substantially improved the diagnoses of diverse ultrasound images , there remains a conspicuous research gap concerning cervical lymph nodes . The objective of our work is to accurately diagnose cervical lymph node lesions by leveraging a deep learning model . To this end, we first collected 3392 images containing normal lymph nodes, benign lymph node lesions, malignant primary lymph node lesions, and malignant metastatic lymph node lesions . Given that ultrasound images are generated by the reflection and scattering of sound waves across varied bodily tissues, we proposed the Conv-FFT Block . It integrates convolutional operations with the fast Fourier transform to more astutely model the images . Building upon this foundation, we designed a novel architecture, named US-SFNet . This architecture not only discerns variances in ultrasound images from the spatial domain but also adeptly captures microstructural alterations across various lesions in the frequency domain . To ascertain the potential of US-SFNet, we benchmarked it against 12 popular architectures through five-fold cross-validation . The results show that US-SFNet is SOTA and can achieve 92.89% accuracy, 90.46% precision, 89.95% sensitivity and 97.49% specificity, respectively .

Towards Vehicle-to-everything Autonomous Driving: A Survey on Collaborative Perception

paper_url: http://arxiv.org/abs/2308.16714
repo_url: None
paper_authors: Si Liu, Chen Gao, Yuan Chen, Xingyu Peng, Xianghao Kong, Kun Wang, Runsheng Xu, Wentao Jiang, Hao Xiang, Jiaqi Ma, Miao Wang
for: 这篇论文旨在为智能交通系统的开发提供一种新的方向，即 vehicle-to-everything (V2X) 自动驾驶。
methods: 这篇论文提出了一种协同探测 (CP) 方法，用于解决 V2X 系统中的各种限制，如遮挡和远程探测。
results: 论文提供了一系列 CP 方法的总结和分析，包括协同探测过程中的不同阶段、路边设备的置放、延迟补做、性能-带宽贸易等方面。

Abstract
Vehicle-to-everything (V2X) autonomous driving opens up a promising direction for developing a new generation of intelligent transportation systems. Collaborative perception (CP) as an essential component to achieve V2X can overcome the inherent limitations of individual perception, including occlusion and long-range perception. In this survey, we provide a comprehensive review of CP methods for V2X scenarios, bringing a profound and in-depth understanding to the community. Specifically, we first introduce the architecture and workflow of typical V2X systems, which affords a broader perspective to understand the entire V2X system and the role of CP within it. Then, we thoroughly summarize and analyze existing V2X perception datasets and CP methods. Particularly, we introduce numerous CP methods from various crucial perspectives, including collaboration stages, roadside sensors placement, latency compensation, performance-bandwidth trade-off, attack/defense, pose alignment, etc. Moreover, we conduct extensive experimental analyses to compare and examine current CP methods, revealing some essential and unexplored insights. Specifically, we analyze the performance changes of different methods under different bandwidths, providing a deep insight into the performance-bandwidth trade-off issue. Also, we examine methods under different LiDAR ranges. To study the model robustness, we further investigate the effects of various simulated real-world noises on the performance of different CP methods, covering communication latency, lossy communication, localization errors, and mixed noises. In addition, we look into the sim-to-real generalization ability of existing CP methods. At last, we thoroughly discuss issues and challenges, highlighting promising directions for future efforts. Our codes for experimental analysis will be public at https://github.com/memberRE/Collaborative-Perception.

摘要
自动驾驶 Vehicle-to-everything（V2X）开启了一个有前途的方向，为智能交通系统新一代的发展带来了无限的可能性。协同感知（CP）作为V2X实现的关键组件，可以超越个体感知的限制，包括遮挡和远程感知。在本文中，我们提供了V2X场景中CP方法的全面和深入的评论，为社区带来了更深刻的理解。首先，我们介绍了典型的V2X系统架构和工作流程，为了更好地理解整个V2X系统以及CP在其中的角色。然后，我们系统地总结和分析了现有的V2X感知数据集和CP方法。特别是，我们介绍了CP方法的各种关键视角，包括协同阶段、路边设备位置、延迟补做、性能带宽交易、攻击防御、pose对齐等。此外，我们还进行了广泛的实验分析，比较和探讨现有CP方法的性能，揭示了一些不已探索的发现。例如，我们分析了不同方法在不同带宽下的性能变化，提供了深入的性能带宽交易问题的研究。同时，我们还研究了不同LiDAR范围下的方法性能。为了研究模型的稳定性，我们还进行了对不同的模拟实际噪声的影响分析，包括通信延迟、损失通信、地理位置错误和混合噪声。此外，我们还检查了现有CP方法的 sim-to-real 普适性。最后，我们进行了问题和挑战的详细讨论，探讨了未来努力的可能性。我们的实验分析代码将在GitHub上公开，链接为：https://github.com/memberRE/Collaborative-Perception。

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

paper_url: http://arxiv.org/abs/2308.16689
repo_url: None
paper_authors: Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun
for: 这篇论文的目标是提高视觉语言预训练（VLP）方法的精度和速度。
methods: 该论文提出了两个组成部分来进一步促进模型学习细腻的图像语言对应。一是在掩码语言模型（MLM）中使用交叉热力学标签生成软标签，以提高模型的稳定性。二是在图像语言匹配（ITM）任务中，利用当前的语言编码器生成硬件负样本，使模型学习高质量的表示。
results: 经过广泛的实验测试，该论文的提案可以在多种视觉语言任务中达到更高的性能水平，表明其在VLP预训练中的潜力。

Abstract
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training.

摘要
Recently, vision-language pre-training (VLP) methods have gained popularity, with the main goal of jointly learning visual and textual features using a transformer-based architecture, leading to significant improvements on various vision-language tasks. However, existing methods tend to focus on aligning visual and textual features, with insufficient attention given to improving the robustness of the model and accelerating convergence.In this paper, we propose a novel method called ViLTA, which consists of two components to further enhance the model's ability to learn fine-grained representations of image-text pairs. For Masked Language Modeling (MLM), we introduce a cross-distillation method to generate soft labels, which helps improve the robustness of the model by addressing the issue of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of the language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task.Our ViLTA achieves better performance on various vision-language tasks, as demonstrated by extensive experiments on benchmark datasets. The effectiveness of ViLTA and its potential for vision-language pre-training are promising, paving the way for further advancements in this field.

Diffusion Inertial Poser: Human Motion Reconstruction from Arbitrary Sparse IMU Configurations

paper_url: http://arxiv.org/abs/2308.16682
repo_url: None
paper_authors: Tom Van Wouwe, Seunghwan Lee, Antoine Falisse, Scott Delp, C. Karen Liu
for: 这种研究的目的是为了实时重建人体动作，并且能够适应不同的各个IMU配置。
methods: 该研究使用了单批扩散生成模型（Diffusion Inertial Poser，DiffIP）来重建人体动作。DiffIP不仅可以在不同的IMU配置下提供高精度的重建结果，还可以在实时中进行选择最佳IMU配置。
results: 研究发现，使用DiffIP可以在不同的IMU配置下重建人体动作，并且可以在实时中选择最佳IMU配置。此外，研究还发现，在只有四个IMU可用时，最佳IMU配置是 instruementing the thighs and forearms，而global translation reconstruction是 instruementing the feet instead of the thighs。

Abstract
Motion capture from a limited number of inertial measurement units (IMUs) has important applications in health, human performance, and virtual reality. Real-world limitations and application-specific goals dictate different IMU configurations (i.e., number of IMUs and chosen attachment body segments), trading off accuracy and practicality. Although recent works were successful in accurately reconstructing whole-body motion from six IMUs, these systems only work with a specific IMU configuration. Here we propose a single diffusion generative model, Diffusion Inertial Poser (DiffIP), which reconstructs human motion in real-time from arbitrary IMU configurations. We show that DiffIP has the benefit of flexibility with respect to the IMU configuration while being as accurate as the state-of-the-art for the commonly used six IMU configuration. Our system enables selecting an optimal configuration for different applications without retraining the model. For example, when only four IMUs are available, DiffIP found that the configuration that minimizes errors in joint kinematics instruments the thighs and forearms. However, global translation reconstruction is better when instrumenting the feet instead of the thighs. Although our approach is agnostic to the underlying model, we built DiffIP based on physiologically realistic musculoskeletal models to enable use in biomedical research and health applications.

摘要
干质量捕捉从有限的惯性测量单元（IMU）中有重要的应用在健康、人类性能和虚拟现实领域。实际世界的限制和特定应用目标导致不同的IMU配置（即IMU的数量和选择的体部附属部分），交换准确性和实用性。虽然最近的工作能够高精度地重construct整个人体运动从六个IMU，但这些系统只能在特定的IMU配置下工作。我们提出了一种单 diffusion生成模型，即干质量投射（DiffIP），可以在实时重构人体运动从任意IMU配置。我们表明了DiffIP具有对IMU配置的灵活性，同时与最新的状态之一相当准确。我们的系统允许选择不同应用场景中的优化IMU配置，无需重新训练模型。例如，当只有四个IMU可用时，DiffIP发现了将骨骼和肘部Instrument为最小错误的 JOINT动态学Instrument。然而，全局翻译重建更好When Instrumenting the feet instead of the thighs。虽然我们的方法不依赖于下面的模型，但我们基于实际的musculoskeletal模型来启用在生物医学研究和健康应用中。

SoccerNet 2023 Tracking Challenge – 3rd place MOT4MOT Team Technical Report

paper_url: http://arxiv.org/abs/2308.16651
repo_url: None
paper_authors: Gal Shitrit, Ishay Be’ery, Ido Yerhushalmy
for: 本研究是为了解决足球竞赛中玩家和球的探测和追踪问题。
methods: 我们使用了一个现代化的在线多对象追踪器和一个当代的物体检测器来进行玩家追踪。为了解决我们的线上方法的局限性，我们将在处理过程中加入插值和无出现的追踪合并。此外，我们还使用了一个外观基于的追踪合并技术来处理图像边界上的终止和创建追踪。
results: 我们的方法在足球网络2023追踪挑战中获得第三名，具有66.27的HOTA分数。

Abstract
The SoccerNet 2023 tracking challenge requires the detection and tracking of soccer players and the ball. In this work, we present our approach to tackle these tasks separately. We employ a state-of-the-art online multi-object tracker and a contemporary object detector for player tracking. To overcome the limitations of our online approach, we incorporate a post-processing stage using interpolation and appearance-free track merging. Additionally, an appearance-based track merging technique is used to handle the termination and creation of tracks far from the image boundaries. Ball tracking is formulated as single object detection, and a fine-tuned YOLOv8l detector with proprietary filtering improves the detection precision. Our method achieves 3rd place on the SoccerNet 2023 tracking challenge with a HOTA score of 66.27.

摘要
《2023年度足球网络追踪挑战》需要识别和追踪足球球员和球。在这个工作中，我们提出了一个分开处理球员和球的方法。我们使用了现代化的在线多个物体追踪器和当代物体检测器来进行球员追踪。为了突破我们的在线方法的局限，我们将在追踪过程中进行插值和无现象追踪融合。此外，我们还使用了基于外观的追踪融合技术来处理图像边缘上的追踪。球的追踪是 формулю视为单一物体检测，我们使用了微小的 YOLOv8l 检测器进行精确的检测。我们的方法在《2023年度足球网络追踪挑战》中获得第三名，具有66.27的HOTA分数。

paper_url: http://arxiv.org/abs/2308.16649
repo_url: None
paper_authors: Prateksha Udhayanan, Srikrishna Karanam, Balaji Vasan Srinivasan
for: solves the problem of composed image retrieval, which takes an input query consisting of an image and a modification text and retrieves images that match the desired changes.
methods: uses a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step, using a new visual image attention computation technique called multi-modal gradient attention (MMGrad).
results: demonstrates improved grounding and better explainability of the models, as well as competitive quantitative retrieval performance on standard benchmark datasets.

Abstract
We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by first proposing a new visual image attention computation technique, which we call multi-modal gradient attention (MMGrad) that is explicitly conditioned on the modifier text. We next demonstrate how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text. By training retrieval models with this new loss function, we show improved grounding by means of better visual attention maps, leading to better explainability of the models as well as competitive quantitative retrieval performance on standard benchmark datasets.

摘要
我们考虑一个图像搜寻问题，将输入查询包含图像和修改文本，指定图像中需要进行的变更。现有的先进技术使用全球特征进行搜寻，导致因全球特征的 natur 而导致图像中需要修改的区域 incorrect 的位置。因为修改文本通常与图像中的特定地方有关，因此模型需要学习地方特征，以便更好地位置和搜寻。为此，我们的关键新特点是一个新的 Gradient-Attention 基于学习目标，强制模型在每次搜寻过程中专注于需要修改的地方。我们首先提出了一种新的视觉图像注意力计算技术，我们称之为多模式Gradient注意力（MMGrad），这个技术是基于修改文本的条件下进行。我们接着示出了如何将 MMGrad 注意力地图Integrated 到一个端到端训练策略中，并提出了一个新的学习目标，强制这些 MMGrad 注意力地图高亮正确的地方。通过对搜寻模型进行训练，我们显示了改善的视觉注意力地图，导致模型的解释性提高，以及与标准参考数据相对的量化搜寻性能。

Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps

paper_url: http://arxiv.org/abs/2308.16648
repo_url: https://github.com/miquel-espinosa/map-sat
paper_authors: Miguel Espinosa, Elliot J. Crowley
for: 这篇论文是为了探讨气象观测中的扩散模型，以及如何使用这些模型来生成具有真实感的卫星图像。
methods: 该论文使用了现有的预训练扩散模型，并将其conditioned在 cartographic 数据上，以生成具有真实感的卫星图像。
results: 论文通过对 Mainland Scotland 和 Central Belt 地区的两个大数据集进行训练，并通过对 ControlNet 模型的评估，demonstrated 了图像质量和地图准确性都可以实现。Here’s the full text in Simplified Chinese:
for: 这篇论文是为了探讨气象观测中的扩散模型，以及如何使用这些模型来生成具有真实感的卫星图像。
methods: 该论文使用了现有的预训练扩散模型，并将其conditioned在 cartographic 数据上，以生成具有真实感的卫星图像。
results: 论文通过对 Mainland Scotland 和 Central Belt 地区的两个大数据集进行训练，并通过对 ControlNet 模型的评估，demonstrated 了图像质量和地图准确性都可以实现。

Abstract
Despite recent advancements in image generation, diffusion models still remain largely underexplored in Earth Observation. In this paper we show that state-of-the-art pretrained diffusion models can be conditioned on cartographic data to generate realistic satellite images. We provide two large datasets of paired OpenStreetMap images and satellite views over the region of Mainland Scotland and the Central Belt. We train a ControlNet model and qualitatively evaluate the results, demonstrating that both image quality and map fidelity are possible. Finally, we provide some insights on the opportunities and challenges of applying these models for remote sensing. Our model weights and code for creating the dataset are publicly available at https://github.com/miquel-espinosa/map-sat.

摘要
尽管最近的图像生成技术有了很大的进步，但 diffusion 模型仍然在地球观测中尚未得到充分的发掘。在这篇论文中，我们表明了使用 cartographic 数据来Conditional Pretrained Diffusion Models（ControlNet）来生成真实的卫星图像。我们提供了两个大的对应的 OpenStreetMap 图像和卫星视图数据集，并对其进行训练，证明了图像质量和地图准确性都是可能的。最后，我们提供了应用这些模型在远程感知方面的机遇和挑战。我们的模型权重和创建数据集的代码都可以在 GitHub 上获得：https://github.com/miquel-espinosa/map-sat。

Learning Channel Importance for High Content Imaging with Interpretable Deep Input Channel Mixing

paper_url: http://arxiv.org/abs/2308.16637
repo_url: None
paper_authors: Daniel Siegismund, Mario Wieser, Stephan Heyse, Stephan Steigele
for: 本研究旨在开发一种新的药物候选者选择方法，用于治疗复杂疾病。
methods: 本研究使用了深度学习方法，但是这些方法缺乏关键通道信息。因此，本研究提出了一种新的方法，基于图像融合概念和α混合，可以在高内容图像分析中 interpret 生物学phenotype。
results: experiments 表明，DCMIX 可以learns 生物学相关的通道重要性，而不会 sacrifice 预测性能。

Abstract
Uncovering novel drug candidates for treating complex diseases remain one of the most challenging tasks in early discovery research. To tackle this challenge, biopharma research established a standardized high content imaging protocol that tags different cellular compartments per image channel. In order to judge the experimental outcome, the scientist requires knowledge about the channel importance with respect to a certain phenotype for decoding the underlying biology. In contrast to traditional image analysis approaches, such experiments are nowadays preferably analyzed by deep learning based approaches which, however, lack crucial information about the channel importance. To overcome this limitation, we present a novel approach which utilizes multi-spectral information of high content images to interpret a certain aspect of cellular biology. To this end, we base our method on image blending concepts with alpha compositing for an arbitrary number of channels. More specifically, we introduce DCMIX, a lightweight, scaleable and end-to-end trainable mixing layer which enables interpretable predictions in high content imaging while retaining the benefits of deep learning based methods. We employ an extensive set of experiments on both MNIST and RXRX1 datasets, demonstrating that DCMIX learns the biologically relevant channel importance without scarifying prediction performance.

摘要
揭示新药候选者治疗复杂疾病的研究仍然是生物医药研究领域中最大的挑战。为了解决这个挑战，生物医药研究实施了标准化高内容成像协议，将不同的细胞组织渠道每个图像通道上标签。为了评估实验结果，科学家需要了解渠道对某种fenotype的重要性，以解读下面的生物学。在传统的图像分析方法中，这些实验通常被分析为深度学习基本的方法，但这些方法缺乏关键的渠道重要性信息。为了解决这个限制，我们提出了一种新的方法，利用高内容图像的多spectrum信息来解释某些细胞生物学方面。为此，我们基于图像拟合概念和alpha拼接 compose进行arbitrary数量的渠道。我们称之为DCMIX，它是一种轻量级、可扩展和可以执行的混合层，可以在高内容成像中提取生物相关的渠道重要性信息，而不损失深度学习基本的预测性能。我们在MNIST和RXRX1数据集上进行了广泛的实验，证明DCMIX可以学习生物相关的渠道重要性，而不损失预测性能。

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

paper_url: http://arxiv.org/abs/2308.16635
repo_url: None
paper_authors: Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han
for: 本研究旨在模型面对面交流场景，包括发送者和接收者的角色。现有研究主要关注生成发送者视频，而忽略了生成接收者头像的问题。
methods: 我们提出了多方面响应听众头像生成网络（MFR-Net），其中使用潜在噪声扩散模型预测多样化的头 pose 和表情特征。为实现多方面响应，我们设计了特征综合模块，以提高听众身份信息的准确性。
results: 我们的广泛实验表明，MFR-Net不仅实现了多方面响应的多样性和发送者身份信息，还能够在发送者视频中表达不同的态度和观点。

Abstract
Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.

摘要
人际交流是一个常见的场景，其中包括说话者和听众的角色。现有大多数研究方法都是生成说话者视频，而听众头部生成则受到了相对的忽略。响应式听众头部生成是一项重要的任务，它目标是模拟人际交流场景，通过生成一个听众头部视频，并与说话者视频和听众头部图像进行相互响应。理想的生成的响应式听众视频应该与说话者保持相互关系，同时表达出不同的互动模式和听众身份信息的准确性。为实现这个目标，我们提出了多元faceted响应听众头部生成网络（MFR-Net）。具体来说，MFR-Net使用概率推净扩散模型预测多元faceted的头部姿势和表情特征。为了在多个响应方式中保持听众身份信息的准确性，我们设计了特征聚合模块，以增强听众身份特征并与其他说话者相关的特征相集成。最后，通过标准化进行训练的渲染器生成了最终的听众头部视频。我们的广泛的实验表明，MFR-Net不仅实现了多元faceted响应，同时也保持了听众身份信息的准确性和说话者的态度和观点表达。

Semi-Supervised SAR ATR Framework with Transductive Auxiliary Segmentation

paper_url: http://arxiv.org/abs/2308.16633
repo_url: None
paper_authors: Chenwei Wang, Xiaoyu Liu, Yulin Huang, Siyi Luo, Jifang Pei, Jianyu Yang, Deqing Mao
for: 提高自动目标识别（ATR）性能，解决受限于少量标注数据的问题。
methods: 提出了一种半监督SAR ATR框架，利用可见 auxiliary segmentation 激活 inductive bias，通过训练过程中的信息异常损失（IRL）来逐渐利用认知和分割的信息 compilation。
results: 在MSTAR数据集上进行了实验，实现了对20个类的几招学习，并同时实现了准确的分割结果，其中识别率高于88.00%，面对EOC的变化也能保持高于80.00%的识别率。

Abstract
Convolutional neural networks (CNNs) have achieved high performance in synthetic aperture radar (SAR) automatic target recognition (ATR). However, the performance of CNNs depends heavily on a large amount of training data. The insufficiency of labeled training SAR images limits the recognition performance and even invalidates some ATR methods. Furthermore, under few labeled training data, many existing CNNs are even ineffective. To address these challenges, we propose a Semi-supervised SAR ATR Framework with transductive Auxiliary Segmentation (SFAS). The proposed framework focuses on exploiting the transductive generalization on available unlabeled samples with an auxiliary loss serving as a regularizer. Through auxiliary segmentation of unlabeled SAR samples and information residue loss (IRL) in training, the framework can employ the proposed training loop process and gradually exploit the information compilation of recognition and segmentation to construct a helpful inductive bias and achieve high performance. Experiments conducted on the MSTAR dataset have shown the effectiveness of our proposed SFAS for few-shot learning. The recognition performance of 94.18\% can be achieved under 20 training samples in each class with simultaneous accurate segmentation results. Facing variances of EOCs, the recognition ratios are higher than 88.00\% when 10 training samples each class.

摘要
卷积神经网络（CNN）在抽象天线镜（SAR）自动目标识别（ATR）中表现出色，但是它们的性能受到大量训练数据的限制。lack of labeled SAR training images limits the recognition performance and even renders some ATR methods ineffective. Under few labeled training data, many existing CNNs are even ineffective. To address these challenges, we propose a Semi-supervised SAR ATR Framework with transductive Auxiliary Segmentation (SFAS). The proposed framework focuses on exploiting the transductive generalization on available unlabeled samples with an auxiliary loss serving as a regularizer. Through auxiliary segmentation of unlabeled SAR samples and information residue loss (IRL) in training, the framework can employ the proposed training loop process and gradually exploit the information compilation of recognition and segmentation to construct a helpful inductive bias and achieve high performance. Experiments conducted on the MSTAR dataset have shown the effectiveness of our proposed SFAS for few-shot learning. The recognition performance of 94.18% can be achieved under 20 training samples in each class with simultaneous accurate segmentation results. Facing variances of EOCs, the recognition ratios are higher than 88.00% when 10 training samples each class.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

paper_url: http://arxiv.org/abs/2308.16632
repo_url: https://github.com/sosppxo/3d-stmn
paper_authors: Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
for: 这篇论文主要针对的是3D Referring Expression Segmentation（3D-RES）领域的问题，即提取和匹配 Referring Expression 的两个阶段方法存在重要的问题，如生成不够精准的初始提案和推理速度的减速。
methods: 我们提出了一种创新的综合型Superpoint-Text Matching Network（3D-STMN），它利用Superpoint-Text Matching（STM）机制，将语言指示与相关的超点相匹配，而不是 traditional methods 通过 navigating through instance proposals。此外，我们还添加了 Dependency-Driven Interaction（DDI）模块，通过依赖树来增强模型对 Referring Expression 的semantic理解。
results: 我们的模型在ScanRefer benchmark上表现出色，不仅提高了mIoU 11.7个点，而且也提高了推理速度，比传统方法快了95.7倍。

Abstract
In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities of our model. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.

摘要
在三维引用表达分割（3D-RES）中，早期的方法采用两阶段 paradigm，先提取分割提案并将其与引用表达进行匹配。然而，这种传统的方法存在显著的挑战，主要是生成亮点不够的初始提案以及推理速度明显下降。认识到这些限制，我们提出了一种创新的端到端Superpoint-Text Matching网络（3D-STMN），增强了基于依赖关系的洞察。我们的模型的一个关键元素是Superpoint-Text Matching（STM）机制。与传统方法不同，STM直接相关语言指示与其相应的超点相关，而不是通过实例提案来进行匹配。这种建筑决策使得我们的模型可以高效地利用交叉模式的semantic关系，主要利用densely注释的超点文本对，而不是更罕见的实例文本对。为了进一步增强文本在引导 segmentation 过程中的作用，我们进一步包含了依赖关系驱动的Interaction（DDI）模块。通过依赖树作为指南，这个模块可以识别表达中的复杂关系，从而提高我们模型的semantic理解和地方化能力。经验表明，我们的模型不仅设置了新的性能标准，mIoU提高11.7点，还实现了惊人的推理速度提升，高速95.7倍。代码和模型可以在https://github.com/sosppxo/3D-STMN 中下载。

Neural Gradient Regularizer

paper_url: http://arxiv.org/abs/2308.16612
repo_url: https://github.com/yyfz/neural-gradient-regularizer
paper_authors: Shuang Xu, Yifan Wang, Zixiang Zhao, Jiangjun Peng, Xiangyong Cao, Deyu Meng
for: 提高图像处理领域中的图像稳定性和细节表示性。
methods: 使用神经网络 Output 来表示梯度图，不需要采用简约性假设，可以避免梯度图的下估。
results: 对多种图像类型和处理任务进行了广泛的实验，证明了 NGR 的灵活性和可重用性，并且在多种任务上表现出了superior的性能。

Abstract
Owing to its significant success, the prior imposed on gradient maps has consistently been a subject of great interest in the field of image processing. Total variation (TV), one of the most representative regularizers, is known for its ability to capture the sparsity of gradient maps. Nonetheless, TV and its variants often underestimate the gradient maps, leading to the weakening of edges and details whose gradients should not be zero in the original image. Recently, total deep variation (TDV) has been introduced, assuming the sparsity of feature maps, which provides a flexible regularization learned from large-scale datasets for a specific task. However, TDV requires retraining when the image or task changes, limiting its versatility. In this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on the sparsity assumption, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility.

摘要
In this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on the sparsity assumption, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility. translate into Simplified Chinese as follows:由于其取得了显著的成功，在图像处理领域中强制 gradient maps 的优先级一直受到了极大的关注。总变化 (TV)，图像处理中最为代表的正则化之一，能够捕捉到 gradient maps 的稀畴性。然而，TV 及其变种通常会下降 gradient maps，导致图像中的边缘和细节的 Gradient 值不应该为零。最近，总深度变化 (TDV) 被引入，假设特定任务中的特征图中的稀畴性，提供了大规模数据集学习而来的灵活正则化。然而，TDV 需要重新训练当图像或任务改变，限制了它的多样性。在这篇论文中，我们提出了一种神经Gradient regularizer (NGR)，它将 gradient map 表示为神经网络的输出。与现有方法不同，NGR 不假设稀畴性，因此可以避免下降 gradient maps。NGR 适用于不同的图像类型和图像处理任务，可以在零处理模式下工作，使它成为一种多样化和插件化的正则化。我们的实验结果表明，NGR 在不同任务中表现出色，超越了当前的相关方法，进一步证明了它的有效性和多样性。

Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method

paper_url: http://arxiv.org/abs/2308.16611
repo_url: None
paper_authors: Eivind Moholdt, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
for: 本研究旨在利用生成图像模型检测新闻中图像-$caption$对的Out-of-Context（OOC）使用。
methods: 本研究使用了两种生成图像模型：DALL-E 2和Stable-Diffusion，并创建了两个新的数据集，包含6800个图像。
results: 研究人员通过对每个图像生成模型进行初步质量和量化分析，评估它们是否适用于检测便宜 fake。

Abstract
The growth of misinformation and re-contextualized media in social media and news leads to an increasing need for fact-checking methods. Concurrently, the advancement in generative models makes cheapfakes and deepfakes both easier to make and harder to detect. In this paper, we present a novel approach using generative image models to our advantage for detecting Out-of-Context (OOC) use of images-caption pairs in news. We present two new datasets with a total of $6800$ images generated using two different generative models including (1) DALL-E 2, and (2) Stable-Diffusion. We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes. We run a preliminary qualitative and quantitative analysis to evaluate the performance of each image generation model for this task, and evaluate a handful of methods for computing image similarity.

摘要
随着社交媒体和新闻中的谣言和重新 контекст化媒体的增长，需要Fact-checking方法的增加。同时，进步的生成模型使得低成本的 fake 和深度 fake 变得更加容易生成和更加难以识别。在这篇论文中，我们提出了一种使用生成图像模型的新方法，用于检测新闻中图像-标题组合的 OUT-OF-CONTEXT（OOC）使用。我们提供了两个新的数据集，共计6800张图像，由两种不同的生成模型生成，包括（1）DALL-E 2，和（2）Stable-Diffusion。我们 confidence 认为，提出的方法可以为生成模型在 cheapfake 检测领域进行进一步研究，并且得到的数据集可以用于训练和评估新的 cheapfake 检测模型。我们进行了先期的 качеitative 和量化分析，评估每种图像生成模型的性能，以及计算图像相似性的方法。

Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

paper_url: http://arxiv.org/abs/2308.16598
repo_url: https://github.com/ramtin-mojtahedi/ovtps
paper_authors: Ramtin Mojtahedi, Mohammad Hamghalam, Richard K. G. Do, Amber L. Simpson
for: 这个研究是为了提高血液肿瘤检测的精度和效率，特别是在肝脏癌的早期诊断和治疗中。
methods: 这个研究使用了深度学习模型，特别是使用了 vision transformer 来解析 3D 计算机断层成像 (CT) 照片。
results: 研究发现，使用我们提议的多resolution图像检查方法可以提高 vision transformer 的检测性能，并且可以适应不同的肿瘤大小。

Abstract
Detection of tumors in metastatic colorectal cancer (mCRC) plays an essential role in the early diagnosis and treatment of liver cancer. Deep learning models backboned by fully convolutional neural networks (FCNNs) have become the dominant model for segmenting 3D computerized tomography (CT) scans. However, since their convolution layers suffer from limited kernel size, they are not able to capture long-range dependencies and global context. To tackle this restriction, vision transformers have been introduced to solve FCNN's locality of receptive fields. Although transformers can capture long-range features, their segmentation performance decreases with various tumor sizes due to the model sensitivity to the input patch size. While finding an optimal patch size improves the performance of vision transformer-based models on segmentation tasks, it is a time-consuming and challenging procedure. This paper proposes a technique to select the vision transformer's optimal input multi-resolution image patch size based on the average volume size of metastasis lesions. We further validated our suggested framework using a transfer-learning technique, demonstrating that the highest Dice similarity coefficient (DSC) performance was obtained by pre-training on training data with a larger tumour volume using the suggested ideal patch size and then training with a smaller one. We experimentally evaluate this idea through pre-training our model on a multi-resolution public dataset. Our model showed consistent and improved results when applied to our private multi-resolution mCRC dataset with a smaller average tumor volume. This study lays the groundwork for optimizing semantic segmentation of small objects using vision transformers. The implementation source code is available at:https://github.com/Ramtin-Mojtahedi/OVTPS.

摘要
检测肿瘤在转移性大肠癌（mCRC）中发挥重要作用于早期诊断和治疗肝癌。深度学习模型基于全部 convolutional neural network（FCNN）已成为对3D计算机断层成像（CT）扫描的标准模型。然而，由于它们的卷积层受到限制，无法捕捉长距离依赖和全局上下文。为解决这个限制，视transformer被引入以解决FCNN的局部感受场。although transformers can capture long-range features, their segmentation performance decreases with various tumor sizes due to the model sensitivity to the input patch size. While finding an optimal patch size improves the performance of vision transformer-based models on segmentation tasks, it is a time-consuming and challenging procedure.这篇论文提出了一种方法，可以根据转移性大肠癌肿瘤的平均体积来选择视transformer的最佳输入多resolution图像 patch size。我们进一步验证了我们的建议框架使用传播学习技术，并证明在segmentation任务上，使用我们建议的理想patch size后，再进行训练，可以获得最高的 dice相似度系数（DSC）性能。我们实验验证了这个想法，通过在多resolution的公共数据集上预训练我们的模型，然后在我们私有的多resolutionmCRC数据集上进行训练，我们的模型在这些数据集上显示了一致和改善的结果。这项研究为优化基于视transformer的 semantic segmentation小物体的优化奠定了基础。模型实现代码可以在https://github.com/Ramtin-Mojtahedi/OVTPS中找到。

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

paper_url: http://arxiv.org/abs/2308.16582
repo_url: None
paper_authors: Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
for: 解决文本描述到图像生成中的分辨率导致的组合问题
methods: 提出了一个两stage管道 named Any-Size-Diffusion (ASD), 使用选择的图像集和限定的比例范围来优化文本条件的扩散模型，以适应不同的图像大小。
results: 实验结果表明，ASD可以生成任意大小的图像，并且比传统的瓦片算法减少了推理时间 by 2倍。

Abstract
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

摘要
基于稳定扩散的生成模型在文本到图像合成中遇到了分辨率引起的组合问题，主要来自于模型在具有单个分辨率图像和相应的文本描述的对配对上进行训练。此外，直接在无限大的图像上进行训练是不可能的，因为这需要巨量的文本-图像对和巨大的计算成本。为了解决这些挑战，我们提出了一个两个阶段的管道，名为Any-Size-Diffusion（ASD），用于生成具有任意大小的、具有良好组合的图像，并最小化高内存GPU资源的需求。在首个阶段，我们称为Any Ratio Adaptability Diffusion（ARAD），利用一个选择的图像集，其中图像的比例范围是有限的，来优化文本受控扩散模型，以提高其对不同图像大小的调整能力。为了支持任意大小的图像创建，我们在后续阶段引入了快速无缝瓷 diffusion（FSTD）技术。这种技术允许快速扩大ASD输出到任何高分辨率大小，避免缝合 artifacts或内存过载。我们在LAION-COCO和MM-CelebA-HQ标准底下进行了实验，结果表明ASD可以生成具有任意大小的、具有良好组合的图像，并比传统瓷 diffusion算法快速两倍。

GHuNeRF: Generalizable Human NeRF from a Monocular Video

paper_url: http://arxiv.org/abs/2308.16576
repo_url: None
paper_authors: Chen Li, Jihao Lin, Gim Hee Lee
for: 本研究的目的是学习一个通用人体NeRF模型从单视镜视频中。
methods: 我们提出了一种可视性感知聚合方案，用于计算点精度特征，并使用注意力机制增强点精度特征以提高量化表示精度。
results: 我们在ZJU-MoCap数据集上进行了验证，与多视图视频基于方法相比，我们的方法可以达到相似的性能。在单视镜人像抓取数据集上，我们的方法比existings works更高效。

Abstract
In this paper, we tackle the challenging task of learning a generalizable human NeRF model from a monocular video. Although existing generalizable human NeRFs have achieved impressive results, they require muti-view images or videos which might not be always available. On the other hand, some works on free-viewpoint rendering of human from monocular videos cannot be generalized to unseen identities. In view of these limitations, we propose GHuNeRF to learn a generalizable human NeRF model from a monocular video of the human performer. We first introduce a visibility-aware aggregation scheme to compute vertex-wise features, which is used to construct a 3D feature volume. The feature volume can only represent the overall geometry of the human performer with insufficient accuracy due to the limited resolution. To solve this, we further enhance the volume feature with temporally aligned point-wise features using an attention mechanism. Finally, the enhanced feature is used for predicting density and color for each sampled point. A surface-guided sampling strategy is also introduced to improve the efficiency for both training and inference. We validate our approach on the widely-used ZJU-MoCap dataset, where we achieve comparable performance with existing multi-view video based approaches. We also test on the monocular People-Snapshot dataset and achieve better performance than existing works when only monocular video is used.

摘要
在这篇论文中，我们面临着从单视频中学习一个通用人体NeRF模型的挑战。现有的通用人体NeRF模型都已经实现了卓越的结果，但它们需要多视图图像或视频，这可能并不总是可доступ的。在另一方面，一些从单视频中对人体进行自由视点渲染的工作无法泛化到未看到的人体。在这些限制下，我们提出了GHuNeRF来学习单视频中人体NeRF模型。我们首先引入了可见性感知汇聚方案来计算顶点级别特征，这些特征用于构建3D特征体积。由于限制了分辨率，特征体积只能表示人体演员的总体几何结构，而不够准确。为解决这个问题，我们进一步增强体积特征使用时间对齐点级别特征的注意机制。最后，我们使用增强的特征预测每个采样点的密度和颜色。我们还引入了基于表面的采样策略来提高训练和推断的效率。我们在常用的中国交通大学MoCap数据集上验证了我们的方法，我们与多视图视频基于方法相比具有相似的性能。我们还在单视频人像Snapshot数据集上进行测试，与单视频基于方法相比，我们的方法在只使用单视频时表现更好。

Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.16573
repo_url: None
paper_authors: Yuanbin Chen, Tao Wang, Hui Tang, Longxuan Zhao, Ruige Zong, Tao Tan, Xinlin Zhang, Tong Tong
for: 这篇论文目的是提出一种基于mean-teacher模型的半supervised医疗影像分类方法（Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation，简称DCPA），以提高半supervised分类的效能。
methods: 这篇论文使用了一个共同encoder和两个不同的decoder，并且使用了consistency regularization、pseudo-labels和mixup操作来强化半supervised分类。
results: 实验结果显示，DCPA模型在三个公开的医疗影像数据集上比六个现有的半supervised方法表现出色，尤其是在5% annotated data的情况下。

Abstract
Medical image segmentation methods often rely on fully supervised approaches to achieve excellent performance, which is contingent upon having an extensive set of labeled images for training. However, annotating medical images is both expensive and time-consuming. Semi-supervised learning offers a solution by leveraging numerous unlabeled images alongside a limited set of annotated ones. In this paper, we introduce a semi-supervised medical image segmentation method based on the mean-teacher model, referred to as Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation (DCPA). This method combines consistency regularization, pseudo-labels, and data augmentation to enhance the efficacy of semi-supervised segmentation. Firstly, the proposed model comprises both student and teacher models with a shared encoder and two distinct decoders employing different up-sampling strategies. Minimizing the output discrepancy between decoders enforces the generation of consistent representations, serving as regularization during student model training. Secondly, we introduce mixup operations to blend unlabeled data with labeled data, creating mixed data and thereby achieving data augmentation. Lastly, pseudo-labels are generated by the teacher model and utilized as labels for mixed data to compute unsupervised loss. We compare the segmentation results of the DCPA model with six state-of-the-art semi-supervised methods on three publicly available medical datasets. Beyond classical 10\% and 20\% semi-supervised settings, we investigate performance with less supervision (5\% labeled data). Experimental outcomes demonstrate that our approach consistently outperforms existing semi-supervised medical image segmentation methods across the three semi-supervised settings.

摘要
医疗图像分割方法frequently rely on完全supervised方法来实现优秀性能，这是因为需要一个较大的标注图像集来训练。然而，标注医疗图像是both expensive and time-consuming。 semi-supervised learning 提供了一种解决方案，利用大量的无标注图像和一些标注图像来训练模型。在这篇论文中，我们引入了一种基于mean-teacher模型的 semi-supervised医疗图像分割方法，称为 dual-decoder consistency via pseudo-labels guided data augmentation (DCPA)。这种方法结合了一致性规则、pseudo-labels和数据增强来提高 semi-supervised segmentation的效果。首先，我们的模型包括一个共享编码器和两个不同的解码器。将解码器的输出差异抑制到最小值，使得学生模型在训练时生成一致的表示。其次，我们引入了mixup操作，将无标注数据与标注数据混合，从而实现数据增强。最后，我们使用教师模型生成的pseudo-labels来计算混合数据的无supervised损失。我们与六种state-of-the-art semi-supervised方法进行比较，并在三个公共可用的医疗数据集上进行测试。我们不仅在经典的10%和20% semi-supervised设置下进行测试，还在5%标注数据下进行测试。实验结果表明，我们的方法在三个 semi-supervised 设置下一直表现出优于现有的 semi-supervised 医疗图像分割方法。

Document Layout Analysis on BaDLAD Dataset: A Comprehensive MViTv2 Based Approach

paper_url: http://arxiv.org/abs/2308.16571
repo_url: None
paper_authors: Ashrafur Rahman Khan, Asif Azad
for: 本研究旨在自动抽取文档中的文本框、段落、图片和表格。
methods: 我们使用MViTv2变换器模型架构和缺失检测R-CNN在BaDLAD数据集上进行训练，以抽取文档中的各种元素。训练过程中使用3个阶段的循环训练，共训练20365个文档图像，训练损失为0.2125，掩码损失为0.19。
results: 我们的研究不仅限于训练，还进行了各种可能的改进方向的探索。我们研究了旋转和翻转增强的影响、预推理图像切割的效果、变换器背景分辨率的变化情况，以及使用双通道推理来探测漏掉的文本框。这些探索结果表明，一些修改可以提高性能，而其他修改则提供了未来尝试的新视角。

Abstract
In the rapidly evolving digital era, the analysis of document layouts plays a pivotal role in automated information extraction and interpretation. In our work, we have trained MViTv2 transformer model architecture with cascaded mask R-CNN on BaDLAD dataset to extract text box, paragraphs, images and tables from a document. After training on 20365 document images for 36 epochs in a 3 phase cycle, we achieved a training loss of 0.2125 and a mask loss of 0.19. Our work extends beyond training, delving into the exploration of potential enhancement avenues. We investigate the impact of rotation and flip augmentation, the effectiveness of slicing input images pre-inference, the implications of varying the resolution of the transformer backbone, and the potential of employing a dual-pass inference to uncover missed text-boxes. Through these explorations, we observe a spectrum of outcomes, where some modifications result in tangible performance improvements, while others offer unique insights for future endeavors.

摘要
在数字时代的快速演化中，文档布局分析具有重要的作用，以帮助自动提取和解释信息。在我们的工作中，我们使用MViTv2变换器模型架构和缩进mask R-CNN在BaDLAD数据集上进行了训练，以提取文档中的文本框、段落、图片和表格。训练过程中，我们使用20365个文档图像，在3个阶段的循环中进行了36个轮次，最终获得了训练损失为0.2125和mask损失为0.19。我们的工作不仅在训练上停留，还在探索可能的提升途径。我们研究了旋转和翻转增强的影响，以及在预判之前切分输入图像的效果，以及变换器后部分的分辨率对性能的影响。此外，我们还探索了使用双通道推理来检测漏掉的文本框的可能性。通过这些探索，我们发现了一些修改可以提高性能，而其他修改则提供了未来努力的新思路。

Shape of my heart: Cardiac models through learned signed distance functions

paper_url: http://arxiv.org/abs/2308.16568
repo_url: None
paper_authors: Jan Verhülsdonk, Thomas Grandits, Francisco Sahli Costabal, Rolf Krause, Angelo Auricchio, Gundolf Haase, Simone Pezzuto, Alexander Effland
for: 构建个性化人体心脏模型
methods: 使用三维深度Signed distance functions刚性函数来重建心脏形态
results: 能够从部分数据或不同模式中重建准确的心脏形态，以及生成新的形态样本。

Abstract
The efficient construction of an anatomical model is one of the major challenges of patient-specific in-silico models of the human heart. Current methods frequently rely on linear statistical models, allowing no advanced topological changes, or requiring medical image segmentation followed by a meshing pipeline, which strongly depends on image resolution, quality, and modality. These approaches are therefore limited in their transferability to other imaging domains. In this work, the cardiac shape is reconstructed by means of three-dimensional deep signed distance functions with Lipschitz regularity. For this purpose, the shapes of cardiac MRI reconstructions are learned from public databases to model the spatial relation of multiple chambers in Cartesian space. We demonstrate that this approach is also capable of reconstructing anatomical models from partial data, such as point clouds from a single ventricle, or modalities different from the trained MRI, such as electroanatomical mapping, and in addition, allows us to generate new anatomical shapes by randomly sampling latent vectors.

摘要
Current methods for constructing patient-specific in-silico models of the human heart often rely on linear statistical models, which do not allow for advanced topological changes, or require medical image segmentation followed by a meshing pipeline, which is strongly dependent on image resolution, quality, and modality. These approaches are limited in their transferability to other imaging domains.In this study, we reconstruct the cardiac shape using three-dimensional deep signed distance functions with Lipschitz regularity. We learn the shapes of cardiac MRI reconstructions from public databases to model the spatial relation of multiple chambers in Cartesian space. Our approach can also reconstruct anatomical models from partial data, such as point clouds from a single ventricle, or modalities different from the trained MRI, such as electroanatomical mapping. Additionally, we can generate new anatomical shapes by randomly sampling latent vectors.

ScrollNet: Dynamic Weight Importance for Continual Learning

paper_url: http://arxiv.org/abs/2308.16567
repo_url: https://github.com/firefyf/scrollnet
paper_authors: Fei Yang, Kai Wang, Joost van de Weijer
for: 这个研究是为了探讨一种名为暂时学习（Continual Learning，CL）的方法，以提高机器学习模型在继续学习新任务时的稳定性。
methods: 这个研究使用了一种名为滑块网络（ScrollNet）的方法，它可以在进行sequential task learning时，根据不同任务的重要性，将网络中的参数重新分配。这样可以在继续学习新任务时，维持更好的稳定性。
results: 实验结果显示， ScrollNet 可以与不同的 CL 方法结合，并且在 CIFAR100 和 TinyImagenet dataset 上显示出良好的效果。code 可以在 https://github.com/FireFYF/ScrollNet.git 上取得。

Abstract
The principle underlying most existing continual learning (CL) methods is to prioritize stability by penalizing changes in parameters crucial to old tasks, while allowing for plasticity in other parameters. The importance of weights for each task can be determined either explicitly through learning a task-specific mask during training (e.g., parameter isolation-based approaches) or implicitly by introducing a regularization term (e.g., regularization-based approaches). However, all these methods assume that the importance of weights for each task is unknown prior to data exposure. In this paper, we propose ScrollNet as a scrolling neural network for continual learning. ScrollNet can be seen as a dynamic network that assigns the ranking of weight importance for each task before data exposure, thus achieving a more favorable stability-plasticity tradeoff during sequential task learning by reassigning this ranking for different tasks. Additionally, we demonstrate that ScrollNet can be combined with various CL methods, including regularization-based and replay-based approaches. Experimental results on CIFAR100 and TinyImagenet datasets show the effectiveness of our proposed method. We release our code at https://github.com/FireFYF/ScrollNet.git.

摘要
“现有大多数持续学习（CL）方法的基本原则是优先保持稳定性，通过责备关键任务 Parameters 的变化，同时允许其他 Parameters 进行可变性。任务重要性的量可以通过在训练时期学习任务特定的面罩（e.g., 参数隔离方法）或通过引入调整项（e.g., 调整方法）来决定。但所有这些方法都假设任务重要性的量是无知的。在这篇文章中，我们提出了 ScrollNet，一个滚动神经网络，用于持续学习。 ScrollNet 可以看作是一个动态网络，它在训练时期将任务重要性的排名分配给每个任务，以 achieve 更好的稳定性-可变性贡献。此外，我们还证明了 ScrollNet 可以与不同的 CL 方法结合使用，包括调整基本和回溯基本方法。实验结果在 CIFAR100 和 TinyImagenet 数据集上显示了我们提出的方法的有效性。我们在 GitHub 上发布了我们的代码，可以在 https://github.com/FireFYF/ScrollNet.git 取得。”

MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

paper_url: http://arxiv.org/abs/2308.16561
repo_url: https://github.com/trinhvg/moma
paper_authors: Trinh Thi Le Vuong, Jin Tae Kwak
for: This paper aims to address the issue of lack of quality data in computational pathology by proposing a method to transfer knowledge from an existing model to a new model without direct access to the source data.
methods: The proposed method utilizes a student-teacher framework with momentum contrastive learning and multi-head attention mechanism to distill relevant knowledge from the teacher model and adapt to the unique nuances of the target data.
results: The proposed method outperforms other related methods in transferring knowledge to different domains and tasks, and provides a guideline on the learning strategy for different types of tasks and scenarios in computational pathology.Here’s the same information in Simplified Chinese text:
for: 这篇论文目标是解决计算生物学中数据质量不足的问题，提出一种将现有模型中的知识传递到新模型中，不直接访问源数据的方法。
methods: 提议的方法使用学生教师框架，通过帧动冲突对比学习和多头注意机制，从教师模型中提取有用的知识，并适应目标数据中独特的特点。
results: 提议的方法在不同的场景和任务中表现出色，超过其他相关方法，并提供了不同类型任务和场景的学习策略的指南。

Abstract
There is no doubt that advanced artificial intelligence models and high quality data are the keys to success in developing computational pathology tools. Although the overall volume of pathology data keeps increasing, a lack of quality data is a common issue when it comes to a specific task due to several reasons including privacy and ethical issues with patient data. In this work, we propose to exploit knowledge distillation, i.e., utilize the existing model to learn a new, target model, to overcome such issues in computational pathology. Specifically, we employ a student-teacher framework to learn a target model from a pre-trained, teacher model without direct access to source data and distill relevant knowledge via momentum contrastive learning with multi-head attention mechanism, which provides consistent and context-aware feature representations. This enables the target model to assimilate informative representations of the teacher model while seamlessly adapting to the unique nuances of the target data. The proposed method is rigorously evaluated across different scenarios where the teacher model was trained on the same, relevant, and irrelevant classification tasks with the target model. Experimental results demonstrate the accuracy and robustness of our approach in transferring knowledge to different domains and tasks, outperforming other related methods. Moreover, the results provide a guideline on the learning strategy for different types of tasks and scenarios in computational pathology. Code is available at: \url{https://github.com/trinhvg/MoMA}.

摘要
“无疑，高级人工智能模型和高质量数据是 Computational Pathology 工具的成功关键。尽管整体病理数据量在增加，但因为多种理由，包括隐私和伦理问题，对特定任务而言，仍然存在质量数据的问题。在这个工作中，我们提出了将知识传承，即使不直接访问源数据，以学习一个新的目标模型。我们使用学生-教师架构，将学生模型从教师模型中学习出一个目标模型，并透过几种方法，例如对称对抗学习和多头注意力机制，将教师模型中的知识逐渐传承到学生模型中。这使得学生模型能够吸收教师模型中的有用表示，同时适应特定任务的独特特点。我们对不同的任务和场景进行了严谨的评估，结果显示我们的方法在不同的领域和任务中具有高精度和可靠性，并且超越了其他相关方法。此外，我们的结果还提供了 Computational Pathology 中不同任务和场景的学习策略的指南。Code可以在：\url{https://github.com/trinhvg/MoMA} 中找到。”

E3CM: Epipolar-Constrained Cascade Correspondence Matching

paper_url: http://arxiv.org/abs/2308.16555
repo_url: None
paper_authors: Chenbo Zhou, Shuai Su, Qijun Chen, Rui Fan
For: The paper is written for addressing the challenges of accurate and robust correspondence matching in 3D computer vision tasks, and proposing a novel approach called Epipolar-Constrained Cascade Correspondence (E3CM) to improve the accuracy and efficiency of correspondence matching.* Methods: The paper uses pre-trained convolutional neural networks to match correspondence, without requiring annotated data for any network training or fine-tuning. The method leverages epipolar constraints to guide the matching process and incorporates a cascade structure for progressive refinement of matches.* Results: The paper demonstrates the superiority of E3CM over existing methods through comprehensive experiments and provides a publicly available source code for further research and reproducibility.Here are the three points in Simplified Chinese text:* For: 本文是为了解决3D计算机视觉任务中准确和稳定的对应匹配问题而写的。* Methods: 本文使用预训练的卷积神经网络来匹配对应，不需要任何网络训练或精度调整的 annotated 数据。方法利用epipolar约束来导航匹配过程，并采用阶段结构进行对应的进一步精化。* Results: 本文通过了广泛的实验证明E3CM的优越性，并提供了可公开下载的源代码，以便进一步研究和重现性。

Abstract
Accurate and robust correspondence matching is of utmost importance for various 3D computer vision tasks. However, traditional explicit programming-based methods often struggle to handle challenging scenarios, and deep learning-based methods require large well-labeled datasets for network training. In this article, we introduce Epipolar-Constrained Cascade Correspondence (E3CM), a novel approach that addresses these limitations. Unlike traditional methods, E3CM leverages pre-trained convolutional neural networks to match correspondence, without requiring annotated data for any network training or fine-tuning. Our method utilizes epipolar constraints to guide the matching process and incorporates a cascade structure for progressive refinement of matches. We extensively evaluate the performance of E3CM through comprehensive experiments and demonstrate its superiority over existing methods. To promote further research and facilitate reproducibility, we make our source code publicly available at https://mias.group/E3CM.

摘要
准确和可靠的对应匹配对于多种3D计算机视觉任务是非常重要的。然而，传统的显式编程基础方法经常陷入复杂的场景中，深度学习基础方法则需要大量高质量标注数据进行网络训练。在这篇文章中，我们介绍了Epipolar-Constrained Cascade Correspondence（E3CM），一种新的方法，解决了这些限制。与传统方法不同，E3CM利用预训练的卷积神经网络来匹配对应，不需要任何网络训练或细化。我们的方法利用epipolar约束来引导匹配过程，并在某些阶段进行进一步的匹配精度提高。我们在详细的实验中证明了E3CM的superiority，并且为了促进更多的研究和重复性，我们将我们的代码公开发布在https://mias.group/E3CM。

Prompt-enhanced Hierarchical Transformer Elevating Cardiopulmonary Resuscitation Instruction via Temporal Action Segmentation

paper_url: http://arxiv.org/abs/2308.16552
repo_url: None
paper_authors: Yang Liu, Xiaoyun Zhong, Shiyao Zhai, Zhicheng Du, Zhenyuan Gao, Qiming Huang, Canyang Zhang, Bin Jiang, Vijay Kumar Pandey, Sanyang Han, Runming Wang, Yuxing Han, Peiwu Qin
for: 这个论文主要是为了提高心肺征复（CPR）训练的效果，以提高救护人员的救命能力。
methods: 这个论文使用了现代深度学习技术，包括文本提示基于的视频特征提取器（VFE）、转换器基于的动作分割执行器（ASE）和准确率调整器（PRC）。
results: 这个论文的实验结果表明，通过使用这些深度学习技术，可以提高CPR训练的效果，并且可以达到多个指标超过91.0%的水平。

Abstract
The vast majority of people who suffer unexpected cardiac arrest are performed cardiopulmonary resuscitation (CPR) by passersby in a desperate attempt to restore life, but endeavors turn out to be fruitless on account of disqualification. Fortunately, many pieces of research manifest that disciplined training will help to elevate the success rate of resuscitation, which constantly desires a seamless combination of novel techniques to yield further advancement. To this end, we collect a custom CPR video dataset in which trainees make efforts to behave resuscitation on mannequins independently in adherence to approved guidelines, thereby devising an auxiliary toolbox to assist supervision and rectification of intermediate potential issues via modern deep learning methodologies. Our research empirically views this problem as a temporal action segmentation (TAS) task in computer vision, which aims to segment an untrimmed video at a frame-wise level. Here, we propose a Prompt-enhanced hierarchical Transformer (PhiTrans) that integrates three indispensable modules, including a textual prompt-based Video Features Extractor (VFE), a transformer-based Action Segmentation Executor (ASE), and a regression-based Prediction Refinement Calibrator (PRC). The backbone of the model preferentially derives from applications in three approved public datasets (GTEA, 50Salads, and Breakfast) collected for TAS tasks, which accounts for the excavation of the segmentation pipeline on the CPR dataset. In general, we unprecedentedly probe into a feasible pipeline that genuinely elevates the CPR instruction qualification via action segmentation in conjunction with cutting-edge deep learning techniques. Associated experiments advocate our implementation with multiple metrics surpassing 91.0%.

摘要
大多数因不期望的心血管综合症而产生突发心肺重症的人都会received cardiopulmonary resuscitation（CPR）治疗，但是这些尝试通常是无果的，因为不符合要求。幸好，许多研究表明，有条件的训练可以提高救命成功率，这个目标 Constant需要不断地 комбинировать新技术以实现进一步的进步。为此，我们收集了一个自定义CPR视频数据集，在该数据集中，学员独立尝试在人形机上进行救命，遵循approved的指南，以创建一个辅助工具箱，以便supervision和修正中间可能存在的问题。我们的研究视这个问题为计算机视觉中的时间动作 segmentation（TAS）任务， aimsto segment an untrimmed video at a frame-wise level。我们提出了一种Prompt-enhanced hierarchical Transformer（PhiTrans）模型，该模型包括三个不可或缺的模块：一个基于文本提示的 Video Features Extractor（VFE），一个基于transformer的Action Segmentation Executor（ASE），以及一个基于回归的 Prediction Refinement Calibrator（PRC）。模型的后勤来自于三个官方公开的 dataset（GTEA、50Salads和Breakfast），这些dataset用于 TAS 任务，这使得模型的 segmentation pipeline 可以在 CPR 数据集上进行挖掘。总的来说，我们未曾 probed 一个可行的管道，通过动作 segmentation 和现代深度学习技术来提高 CPR instrucion 资格。相关实验表明，我们的实现在多个指标上超过 91.0%。

Object Detection for Caries or Pit and Fissure Sealing Requirement in Children’s First Permanent Molars

paper_url: http://arxiv.org/abs/2308.16551
repo_url: None
paper_authors: Chenyao Jiang, Shiyao Zhai, Hengrui Song, Yuqing Ma, Yachen Fan, Yancheng Fang, Dongmei Yu, Canyang Zhang, Sanyang Han, Runming Wang, Yong Liu, Jianbo Li, Peiwu Qin
for: 预防牙肥病和牙缝病舒缝病的自动检测，帮助家长或孩子的监护人在家中进行检测。
methods: 使用智能手机拍摄的照片自动检测牙肥病和牙缝病舒缝病，使用YOLOv5和YOLOX模型，采用分割策略减少图像预处理中的信息损失。
results: 使用YOLOX模型并采用分割策略可以达到72.3 mAP.5的最佳结果，而不使用分割策略的YOLOv5模型可以达到71.2 mAP.5的最佳结果。

Abstract
Dental caries is one of the most common oral diseases that, if left untreated, can lead to a variety of oral problems. It mainly occurs inside the pits and fissures on the occlusal/buccal/palatal surfaces of molars and children are a high-risk group for pit and fissure caries in permanent molars. Pit and fissure sealing is one of the most effective methods that is widely used in prevention of pit and fissure caries. However, current detection of pits and fissures or caries depends primarily on the experienced dentists, which ordinary parents do not have, and children may miss the remedial treatment without timely detection. To address this issue, we present a method to autodetect caries and pit and fissure sealing requirements using oral photos taken by smartphones. We use the YOLOv5 and YOLOX models and adopt a tiling strategy to reduce information loss during image pre-processing. The best result for YOLOXs model with tiling strategy is 72.3 mAP.5, while the best result without tiling strategy is 71.2. YOLOv5s6 model with/without tiling attains 70.9/67.9 mAP.5, respectively. We deploy the pre-trained network to mobile devices as a WeChat applet, allowing in-home detection by parents or children guardian.

摘要
牙科疾病中的牙肉病是最常见的口腔疾病之一，如果未经治疗，可能导致许多口腔问题。它主要发生在牙齿的凹槽和缺陷处，特别是永久牙齿上的 occlusal/buccal/palatal 表面。牙齿凹槽填充是预防牙肉病的一种最有效的方法，但目前检测牙齿凹槽或疾病的方法主要仍然取决于经验丰富的牙医，普通的父母不具备这种经验。为解决这个问题，我们提出了一种自动检测牙肉病和牙齿凹槽填充需求的方法，使用智能手机拍摄的口腔照片。我们使用 YOLOv5 和 YOLOX 模型，采用分割策略来减少图像预处理中的信息损失。最佳结果为 YOLOXs 模型 WITH 分割策略，得分为 72.3 mAP.5，而不使用分割策略的最佳结果为 71.2。YOLOv5s6 模型 WITH/ без 分割策略分别得到了 70.9/67.9 mAP.5。我们将预训练网络部署到移动设备上，并作为 WeChat applet 提供在家 detection，allowing 父母或孩子的监护人进行 домаш减轻。

Decoupled Local Aggregation for Point Cloud Learning

paper_url: http://arxiv.org/abs/2308.16532
repo_url: None
paper_authors: Binjie Chen, Yunzhou Xia, Yu Zang, Cheng Wang, Jonathan Li
for: 本文旨在提出一种基于点云的点网络，以解决点云的不结构性导致的本地归一化效率问题。
methods: 本文使用了一种叫做“分解本地归一化”的方法，即在本地归一化过程中不再Explicit地模型空间关系，而是通过点云特征中的本地信息来进行归一化。
results: 实验结果表明，使用本文提出的DeLA方法可以在五个 классическихbenchmark上 achieve state-of-the-art性能，同时具有相对较低的延迟。具体来说，DeLA在ScanObjectNN上达到了90.3%的总准确率，在S3DIS Area 5上达到了74.2%的mIoU。

Abstract
The unstructured nature of point clouds demands that local aggregation be adaptive to different local structures. Previous methods meet this by explicitly embedding spatial relations into each aggregation process. Although this coupled approach has been shown effective in generating clear semantics, aggregation can be greatly slowed down due to repeated relation learning and redundant computation to mix directional and point features. In this work, we propose to decouple the explicit modelling of spatial relations from local aggregation. We theoretically prove that basic neighbor pooling operations can too function without loss of clarity in feature fusion, so long as essential spatial information has been encoded in point features. As an instantiation of decoupled local aggregation, we present DeLA, a lightweight point network, where in each learning stage relative spatial encodings are first formed, and only pointwise convolutions plus edge max-pooling are used for local aggregation then. Further, a regularization term is employed to reduce potential ambiguity through the prediction of relative coordinates. Conceptually simple though, experimental results on five classic benchmarks demonstrate that DeLA achieves state-of-the-art performance with reduced or comparable latency. Specifically, DeLA achieves over 90\% overall accuracy on ScanObjectNN and 74\% mIoU on S3DIS Area 5. Our code is available at https://github.com/Matrix-ASC/DeLA .

摘要
“点云的无架构造需要地方聚合运算能够适应不同的地方结构。现有方法通过明确地嵌入空间关系到每个聚合运算中，尽管这种联合方法有效地实现了清晰的 semantics，但聚合运算可能会受到重复学习空间关系和重复计算的影响，导致聚合运算速度受到限制。在这个工作中，我们提议将明确地嵌入空间关系的聚合运算与地方聚合分开。我们理论上证明，基本的邻居池化操作可以不会失去清晰度，只要点 cloud 中的基本空间信息已经被编码。为了实现分开的地方聚合，我们提出了 DeLA，一个轻量级的点网络，其中在每个学习阶段中首先形成 relative 空间编码，然后使用点实体卷积和边最大池化进行地方聚合。此外，我们还使用了规制项来减少潜在的混乱，通过预测相对坐标的预测。概念简单虽然，但实验结果显示，DeLA 在五个 класси的 benchmar 上实现了状态顶峰性能，并且具有相对较高的聚合速度。具体来说，DeLA 在 ScanObjectNN 上 achieves 过 90% 的总准确率，并在 S3DIS Area 5 上 achieves 74% 的 mIoU。我们的代码可以在 https://github.com/Matrix-ASC/DeLA 上找到。”

Privacy-Preserving Medical Image Classification through Deep Learning and Matrix Decomposition

paper_url: http://arxiv.org/abs/2308.16530
repo_url: None
paper_authors: Andreea Bianca Popescu, Cosmin Ioan Nita, Ioana Antonia Taca, Anamaria Vizitiu, Lucian Mihai Itu
for: 本研究旨在开发一种基于深度学习的医疗数据处理技术，以保护医疗数据的隐私和安全性。
methods: 本研究使用了 singular value decomposition（SVD）和主成分分析（PCA）将医疗图像加密，以便在深度学习分析中使用加密图像。
results: 研究发现，使用加密图像可以保护医疗图像的隐私和安全性，同时不会影响深度学习模型的性能。

Abstract
Deep learning (DL)-based solutions have been extensively researched in the medical domain in recent years, enhancing the efficacy of diagnosis, planning, and treatment. Since the usage of health-related data is strictly regulated, processing medical records outside the hospital environment for developing and using DL models demands robust data protection measures. At the same time, it can be challenging to guarantee that a DL solution delivers a minimum level of performance when being trained on secured data, without being specifically designed for the given task. Our approach uses singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis. The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames. The security level is probed by simulated artificial intelligence (AI)-based reconstruction attacks, considering two threat actors with different prior knowledge of the targeted data. The degree of privacy is quantitatively measured using similarity indices. Although a trade-off between privacy and accuracy should be considered, the proposed technique allows for training the angiographic view classifier exclusively on secured data with satisfactory performance and with no computational overhead, model adaptation, or hyperparameter tuning. While the obfuscated medical image content is well protected against human perception, the hypothetical reconstruction attack proved that it is also difficult to recover the complete information of the original frames.

摘要
深度学习（DL）基本解决方案在医疗领域得到了广泛的研究，提高诊断、规划和治疗的效果。由于医疗相关数据的使用受到严格的限制，为了开发和使用 DL 模型，需要实施robust的数据保护措施。同时，保证 DL 解决方案在受保护数据上提供最低水平的性能是一个挑战。我们的方法使用特征值分解（SVD）和主成分分析（PCA）将医疗图像减少为不可识别的形式，然后使其进行 DL 分析。DL 算法能够从受保护数据中提取有用信息，我们在ANGIOGRAPHIC VIEW分类任务上进行了评估。我们对这种安全性水平进行了模拟人工智能（AI）基于重建攻击的检验，考虑了两个不同的目标数据的攻击者。我们使用相似指标来量化隐私水平。虽然需要考虑隐私和准确之间的平衡，但我们的方法可以在受保护数据上培训ANGIOGRAPHIC VIEW分类器，并且无需计算开销、模型适应或Hyperparameter调整。尽管干扰后的医疗图像内容具有较高的隐私保护，但是对于原始干扰的完整信息的重建仍然是困难的。

SA6D: Self-Adaptive Few-Shot 6D Pose Estimator for Novel and Occluded Objects

paper_url: http://arxiv.org/abs/2308.16528
repo_url: None
paper_authors: Ning Gao, Ngo Anh Vien, Hanna Ziesche, Gerhard Neumann
for: 提高机器人对实际世界中物体的有意义 manipulate 能力
methods: 使用自适应分割模块来识别新目标对象，并使用受限的参考图像构建目标对象的点云模型
results: SA6D方法在实际桌面对象 dataset 上表现出优于现有方法，特别是在受遮挡的场景下，而且需要 fewer reference images

Abstract
To enable meaningful robotic manipulation of objects in the real-world, 6D pose estimation is one of the critical aspects. Most existing approaches have difficulties to extend predictions to scenarios where novel object instances are continuously introduced, especially with heavy occlusions. In this work, we propose a few-shot pose estimation (FSPE) approach called SA6D, which uses a self-adaptive segmentation module to identify the novel target object and construct a point cloud model of the target object using only a small number of cluttered reference images. Unlike existing methods, SA6D does not require object-centric reference images or any additional object information, making it a more generalizable and scalable solution across categories. We evaluate SA6D on real-world tabletop object datasets and demonstrate that SA6D outperforms existing FSPE methods, particularly in cluttered scenes with occlusions, while requiring fewer reference images.

摘要
为实现真实世界中机器人掌控物品的有意义，6D姿态估计是一项关键方面。现有的方法通常难以扩展预测到新的物品种类引入的场景，特别是在受到干扰的情况下。在这种工作中，我们提出了几个shot姿态估计（FSPE）方法called SA6D，该方法使用自适应分割模块来识别新的目标对象并使用仅具有少量堵塞的参考图像构建目标对象的点云模型。与现有方法不同，SA6D不需要物体中心的参考图像或任何其他物体信息，这使得它在类别上更加通用和扩展性更强。我们在真实的桌面对象数据集上评估SA6D并证明它在受到干扰的场景下表现更出色，需要更少的参考图像。

Unsupervised Recognition of Unknown Objects for Open-World Object Detection

paper_url: http://arxiv.org/abs/2308.16527
repo_url: https://github.com/frh23333/mepu-owod
paper_authors: Ruohuan Fang, Guansong Pang, Lei Zhou, Xiao Bai, Jin Zheng
for: 该论文旨在解决开放世界物体检测（OWOD）问题，即需要一个检测模型能够检测已知和未知物体，并且可以逐渐学习新引入的知识。
methods: 该论文提出了一种新的方法，即学习不监督的探测模型，可以从 raw pseudo labels 生成的真实未知对象中识别真正的未知对象。该模型可以通过一种无类别自我训练方法来进一步加以改进。
results: 实验结果表明，该方法可以在 MS COCO 数据集上对未知对象进行更好的检测，同时保持已知对象类别的检测性能。此外，该方法在 LVIS 和 Objects365 数据集上也有更好的泛化能力。

Abstract
Open-World Object Detection (OWOD) extends object detection problem to a realistic and dynamic scenario, where a detection model is required to be capable of detecting both known and unknown objects and incrementally learning newly introduced knowledge. Current OWOD models, such as ORE and OW-DETR, focus on pseudo-labeling regions with high objectness scores as unknowns, whose performance relies heavily on the supervision of known objects. While they can detect the unknowns that exhibit similar features to the known objects, they suffer from a severe label bias problem that they tend to detect all regions (including unknown object regions) that are dissimilar to the known objects as part of the background. To eliminate the label bias, this paper proposes a novel approach that learns an unsupervised discriminative model to recognize true unknown objects from raw pseudo labels generated by unsupervised region proposal methods. The resulting model can be further refined by a classification-free self-training method which iteratively extends pseudo unknown objects to the unlabeled regions. Experimental results show that our method 1) significantly outperforms the prior SOTA in detecting unknown objects while maintaining competitive performance of detecting known object classes on the MS COCO dataset, and 2) achieves better generalization ability on the LVIS and Objects365 datasets.

摘要

MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layers

paper_url: http://arxiv.org/abs/2308.16518
repo_url: None
paper_authors: Yongxin Shao, Aihong Tan, Tianhong Yan, Zhetao Sun
for: 本研究旨在提出一种基于小尺寸矩阵和大尺寸矩阵的二Stage 3D检测方法，以提高3D特征层的效率和精度。
methods: 本方法使用小尺寸矩阵提取细腻的本地特征，并使用大尺寸矩阵捕捉远程的本地特征。此外，我们还提出了一种基于多尺度semantic特征点的3D特征层构建方法，以将稀疏的3D特征层转化为更加 компакт的表示。
results: 我们对KITTI dataset和ONCE dataset进行了合理评估，并证明了我们的方法可以显著提高3D检测的效率和精度。

Abstract
Lidar point clouds, as a type of data with accurate distance perception, can effectively represent the motion and posture of objects in three-dimensional space. However, the sparsity and disorderliness of point clouds make it challenging to extract features directly from them. Many studies have addressed this issue by transforming point clouds into regular voxel representations. However, these methods often lead to the loss of fine-grained local feature information due to downsampling. Moreover, the sparsity of point clouds poses difficulties in efficiently aggregating features in 3D feature layers using voxel-based two-stage methods. To address these issues, this paper proposes a two-stage 3D detection framework called MS$^{2}$3D. In MS$^{2}$3D, we utilize small-sized voxels to extract fine-grained local features and large-sized voxels to capture long-range local features. Additionally, we propose a method for constructing 3D feature layers using multi-scale semantic feature points, enabling the transformation of sparse 3D feature layers into more compact representations. Furthermore, we compute the offset between feature points in the 3D feature layers and the centroid of objects, aiming to bring them as close as possible to the object's center. It significantly enhances the efficiency of feature aggregation. To validate the effectiveness of our method, we evaluated our method on the KITTI dataset and ONCE dataset together.

摘要
利达点云，作为三维空间中对物体运动和姿态的识别数据，具有高精度的距离识别能力。然而，点云的稀疏性和杂乱性使其直接提取特征具有挑战性。许多研究已经解决这个问题，将点云转换成常见的小立方体表示。然而，这些方法通常会导致细节特征信息的丢失，由于下采样。此外，点云的稀疏性使得在三维特征层中效率地聚合特征更加困难。为解决这些问题，本文提出了一种两stage的三维探测框架，称为MS$^{2}$3D。在MS$^{2}$3D中，我们利用小型的立方体来提取细节特征，并利用大型的立方体来捕捉远程的本地特征。此外，我们提出了一种建立三维特征层的方法，使得稀疏的三维特征层可以转换成更加紧凑的表示。进一步，我们计算了特征点在三维特征层中的偏移量，并将其尝试带到物体的中心点最近。这有助于提高特征聚合的效率。为证明我们的方法的有效性，我们在KITTI dataset和ONCE dataset上进行了联合评估。

MVDream: Multi-view Diffusion for 3D Generation

paper_url: http://arxiv.org/abs/2308.16512
repo_url: None
paper_authors: Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang
for: 生成多视图图像从文本提示中
methods: 利用图像扩散模型、多视图数据集和3D资产实现多视图协同生成
results: 实现了基于2D扩散和3D数据的多视图协同生成，并可以在几架扩散样本下进行个性化3D生成

Abstract
We propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

摘要
我们提出MVDream，一种多视图填充模型，可以根据给定的文本提示生成具有准确尺度的多视图图像。通过利用 pré-训练在大规模网络数据集上的图像填充模型和从3D资产渲染的多视图数据集，所得到的多视图填充模型可以实现2D填充的通用性和3D数据的一致性。这种模型因此可以应用于3D生成中的多视图优先，通过Score Distillation Sampling，解决现有2D-提升方法的稳定性问题。 finally，我们表明该多视图填充模型还可以在几个shot设定下进行个性化3D生成，即DreamBooth3D应用，保持主体标识一致性。

Robust GAN inversion

paper_url: http://arxiv.org/abs/2308.16510
repo_url: None
paper_authors: Egor Sevriugov, Ivan Oseledets
for: 提高生成 adversarial networks（GANs）latent space中图像的修改精度和可编辑性。
methods: 提出一种在原始latent space $W$中工作的方法，通过调整生成器网络来恢复缺失的图像细节。具有学习的regularization策略，使用StyleGAN 2模型进行训练。
results: 相比传统方法，该方法在重建质量和计算效率两个方面表现出色，实现最低的抖动值，并且在构建二分图像特征的抽象上观察到了微妙的改善。在Flickr-Faces-HQ和LSUN Church两个复杂的数据集上进行了实验。

Abstract
Recent advancements in real image editing have been attributed to the exploration of Generative Adversarial Networks (GANs) latent space. However, the main challenge of this procedure is GAN inversion, which aims to map the image to the latent space accurately. Existing methods that work on extended latent space $W+$ are unable to achieve low distortion and high editability simultaneously. To address this issue, we propose an approach which works in native latent space $W$ and tunes the generator network to restore missing image details. We introduce a novel regularization strategy with learnable coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This method outperforms traditional approaches in terms of reconstruction quality and computational efficiency, achieving the lowest distortion with 4 times fewer parameters. Furthermore, we observe a slight improvement in the quality of constructing hyperplanes corresponding to binary image attributes. We demonstrate the effectiveness of our approach on two complex datasets: Flickr-Faces-HQ and LSUN Church.

摘要
Recent advances in real image editing have been attributed to the exploration of Generative Adversarial Networks (GANs) latent space. However, the main challenge of this procedure is GAN inversion, which aims to map the image to the latent space accurately. Existing methods that work on extended latent space $W+$ are unable to achieve low distortion and high editability simultaneously. To address this issue, we propose an approach that works in native latent space $W$ and tunes the generator network to restore missing image details. We introduce a novel regularization strategy with learnable coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This method outperforms traditional approaches in terms of reconstruction quality and computational efficiency, achieving the lowest distortion with 4 times fewer parameters. Furthermore, we observe a slight improvement in the quality of constructing hyperplanes corresponding to binary image attributes. We demonstrate the effectiveness of our approach on two complex datasets: Flickr-Faces-HQ and LSUN Church.Translated into Simplified Chinese:现代图像编辑技术的进步，挺借助于生成对抗网络（GANs）的秘密空间的探索。然而，GAN倒退是这个过程的主要挑战，它目标是将图像映射到秘密空间上准确。现有的方法，working on extend latent space $W+$, 无法同时实现低偏差和高可编辑性。为解决这个问题，我们提出了一种方法，working in native latent space $W$，并调整生成器网络以还原缺失图像细节。我们引入了一种新的规范策略，使用可学习的系数，通过训练随机的StyleGAN 2模型-WRanGAN来获得。这种方法在重建质量和计算效率方面超过传统方法，实现了最低偏差，并且参数量比传统方法少四倍。此外，我们发现在构建二元图像特征上的抽象图像上也有一定的改善。我们在两个复杂的数据集Flickr-Faces-HQ和LSUN Church中证明了我们的方法的有效性。

Illumination Distillation Framework for Nighttime Person Re-Identification and A New Benchmark

paper_url: http://arxiv.org/abs/2308.16486
repo_url: https://github.com/alexadlu/idf
paper_authors: Andong Lu, Zhang Zhang, Yan Huang, Yifan Zhang, Chenglong Li, Jin Tang, Liang Wang
for: 这 paper 是关于夜间人识别 task 的研究，它是视觉监测领域中非常重要且挑战性的任务。
methods: 该 paper 提出了一个 Illumination Distillation Framework (IDF)，用于 Addressing the low illumination challenge in nighttime person Re-ID。 IDF 包括 master branch、illumination enhancement branch 和 illumination distillation module。
results: 实验结果表明，我们的 IDF 可以在两个夜间人识别 dataset 上达到状态空间的表现。我们将在 GitHub 上发布我们的代码和数据集。

Abstract
Nighttime person Re-ID (person re-identification in the nighttime) is a very important and challenging task for visual surveillance but it has not been thoroughly investigated. Under the low illumination condition, the performance of person Re-ID methods usually sharply deteriorates. To address the low illumination challenge in nighttime person Re-ID, this paper proposes an Illumination Distillation Framework (IDF), which utilizes illumination enhancement and illumination distillation schemes to promote the learning of Re-ID models. Specifically, IDF consists of a master branch, an illumination enhancement branch, and an illumination distillation module. The master branch is used to extract the features from a nighttime image. The illumination enhancement branch first estimates an enhanced image from the nighttime image using a nonlinear curve mapping method and then extracts the enhanced features. However, nighttime and enhanced features usually contain data noise due to unstable lighting conditions and enhancement failures. To fully exploit the complementary benefits of nighttime and enhanced features while suppressing data noise, we propose an illumination distillation module. In particular, the illumination distillation module fuses the features from two branches through a bottleneck fusion model and then uses the fused features to guide the learning of both branches in a distillation manner. In addition, we build a real-world nighttime person Re-ID dataset, named Night600, which contains 600 identities captured from different viewpoints and nighttime illumination conditions under complex outdoor environments. Experimental results demonstrate that our IDF can achieve state-of-the-art performance on two nighttime person Re-ID datasets (i.e., Night600 and Knight ). We will release our code and dataset at https://github.com/Alexadlu/IDF.

摘要
<>Translate the given text into Simplified Chinese.<>夜间人识别（人识别在夜间）是视觉监测中非常重要且挑战性强的任务，但它尚未得到全面研究。在低照明条件下，人识别方法的性能通常会降低。为了解决夜间人识别中低照明的挑战，这篇论文提出了照明混合框架（IDF），该框架利用照明优化和照明混合方案来推进人识别模型的学习。具体来说，IDF包括主支线、照明优化支线和照明混合模块。主支线用于从夜间图像中提取特征。照明优化支线首先使用非线性曲线映射方法将夜间图像提取出高质量的照明图像，然后从照明图像中提取提高特征。然而，夜间和提高特征通常含有数据噪音，由于不稳定的照明条件和优化失败。为了完全利用夜间和提高特征的共同优点，同时降低数据噪音，我们提出了照明混合模块。具体来说，照明混合模块将两支线的特征通过瓶颈混合模型进行混合，然后使用混合特征来引导两支线的学习。此外，我们建立了一个真实的夜间人识别数据集，名为Night600，该数据集包含600个人识别样本，从不同的视点和夜间照明条件下捕捉到了复杂的户外环境。实验结果表明，我们的IDF可以在两个夜间人识别数据集（即Night600和Knight）上达到状态独特的性能。我们将在GitHub上发布我们的代码和数据集，链接为https://github.com/Alexadlu/IDF。

PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction

paper_url: http://arxiv.org/abs/2308.16477
repo_url: None
paper_authors: Wenjie Ding, Limeng Qiao, Xi Qiu, Chi Zhang
for: 高清晰地图在自动驾驶研究中获得了广泛关注，旨在提高地图精度和完整性。
methods: 我们提出了一种简单 yet effective的架构方案名为PivotNet，它采用了统一的承式基于地图表示方法和直接集体预测方法。
results: 我们的实验和排除示例表明，PivotNet在与其他SOTAs进行比较时表现出色，在最佳情况下提高了5.9 mAP。

Abstract
Vectorized high-definition map online construction has garnered considerable attention in the field of autonomous driving research. Most existing approaches model changeable map elements using a fixed number of points, or predict local maps in a two-stage autoregressive manner, which may miss essential details and lead to error accumulation. Towards precise map element learning, we propose a simple yet effective architecture named PivotNet, which adopts unified pivot-based map representations and is formulated as a direct set prediction paradigm. Concretely, we first propose a novel Point-to-Line Mask module to encode both the subordinate and geometrical point-line priors in the network. Then, a well-designed Pivot Dynamic Matching module is proposed to model the topology in dynamic point sequences by introducing the concept of sequence matching. Furthermore, to supervise the position and topology of the vectorized point predictions, we propose a Dynamic Vectorized Sequence loss. Extensive experiments and ablations show that PivotNet is remarkably superior to other SOTAs by 5.9 mAP at least. The code will be available soon.

摘要
“卷积高清地图在自动驾驶研究中受到了广泛关注。大多数现有方法使用固定数量的点来模型可变地图元素，或者采用两阶段autoregressive方法预测本地地图，这可能会错过重要细节并导致错误积累。为了精确地学习地图元素，我们提议一个简单 yet effective的架构名为PivotNet，它采用了统一的承载点基于地图表示方法。具体来说，我们首先提出了一种新的点线掩码模块，用于编码子ordinates和几何点线约束在网络中。然后，我们提出了一种套件动态匹配模块，用于模型动态点序列中的topology。此外，为了监视点预测的位置和topology，我们提出了动态卷积序列损失。经验和拓展表明，PivotNet至少比其他SOTAs高5.9 mAP。代码将很快上传。”Note: "SOTAs" refers to "State of the Arts" in English, which means the current best performance in a particular field or technology.

Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

paper_url: http://arxiv.org/abs/2308.16466
repo_url: None
paper_authors: Yiming Zhang, Tianang Leng, Kun Han, Xiaohui Xie
for: 这篇论文旨在提出一个基于 Self-Sampling Meta SAM 框架的几拍医疗像素分类方法，以便在医疗像素分类中实现快速线上适应。
methods: 本研究使用了一个线上快速Gradient Descent优化器，并与一个元学习器进行进一步优化，以确保快速和可靠地适应新任务。此外，还包括一个自适应模组和一个特殊设计的医疗几拍学习构造，以提高医疗像素分类的精度。
results: 实验结果显示，提出的方法在一个流行的腹部 CT 数据集和一个 MRI 数据集上实现了state-of-the-art 的改善，具体改善率为10.21% 和1.80% 在适应率上，分别是DSC 的平均改善。在结论中，我们提出了一种快速线上适应的方法，可以在0.83分钟内适应新的器官。代码将在 GitHub 上公开。

Abstract
While the Segment Anything Model (SAM) excels in semantic segmentation for general-purpose images, its performance significantly deteriorates when applied to medical images, primarily attributable to insufficient representation of medical images in its training dataset. Nonetheless, gathering comprehensive datasets and training models that are universally applicable is particularly challenging due to the long-tail problem common in medical images. To address this gap, here we present a Self-Sampling Meta SAM (SSM-SAM) framework for few-shot medical image segmentation. Our innovation lies in the design of three key modules: 1) An online fast gradient descent optimizer, further optimized by a meta-learner, which ensures swift and robust adaptation to new tasks. 2) A Self-Sampling module designed to provide well-aligned visual prompts for improved attention allocation; and 3) A robust attention-based decoder specifically designed for medical few-shot learning to capture relationship between different slices. Extensive experiments on a popular abdominal CT dataset and an MRI dataset demonstrate that the proposed method achieves significant improvements over state-of-the-art methods in few-shot segmentation, with an average improvements of 10.21% and 1.80% in terms of DSC, respectively. In conclusion, we present a novel approach for rapid online adaptation in interactive image segmentation, adapting to a new organ in just 0.83 minutes. Code is publicly available on GitHub upon acceptance.

摘要
Segment Anything Model (SAM) 在通用图像 semantic segmentation 方面表现出色，但在医疗图像方面表现下降，主要归结于训练数据集中医疗图像的不充分表现。然而，收集全面的数据集和训练通用的模型是特别困难，因为医疗图像的长尾问题。为解决这个差距，我们在这里提出了一个 Self-Sampling Meta SAM (SSM-SAM) 框架，用于医疗图像 few-shot segmentation。我们的创新在于设计了三个关键模块：1. 在线快速梯度下降优化器，通过meta-学习进一步优化，以确保快速和robust地适应新任务。2. 自适应模块，用于提供高效的视觉提示，以改善注意力分配。3. 专门为医疗 few-shot learning 设计的强健关注基本 decode，用于捕捉不同扁平的关系。我们在一个流行的 Abdomen CT 数据集和一个 MRI 数据集上进行了广泛的实验，结果表明，我们的方法在 few-shot segmentation 方面得到了明显的提高，相对于状态之前的方法，平均提高了10.21%和1.80%的 DSC，分别是。在结尾，我们介绍了一种 novel 的方法，可以在交互式图像分割中快速在线适应，在0.83分钟内适应新的器官。代码将在 GitHub 上公开。

Domain Adaptive Synapse Detection with Weak Point Annotations

paper_url: http://arxiv.org/abs/2308.16461
repo_url: None
paper_authors: Qi Chen, Wei Huang, Yueyi Zhang, Zhiwei Xiong
for: This paper is written for detecting synapses from electron microscopy (EM) images using a two-stage segmentation-based framework with weak point annotations.
methods: The paper uses a segmentation-based pipeline to obtain synaptic instance masks in the first stage, and regenerates square masks to get high-quality pseudo labels in the second stage. The method also utilizes the distance nearest principle to match paired pre-synapses and post-synapses.
results: The paper ranks 1st place in the WASPSYN challenge at ISBI 2023 with high-accuracy detection results.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了使用电子显微镜图像（EM）检测 synapse，使用了两个阶段的分割基于框架，使用弱点注解。
methods: 论文使用了分割ipeline来获得 synaptic instance masks 的第一阶段，然后在第二阶段使用 Square masks 来获得高质量的pseudo标签。论文还使用了距离最近原理来匹配 paired pre-synapses 和 post-synapses。
results: 论文在 ISBI 2023 上的 WASPSYN 挑战中获得了第一名的高精度检测结果。

Abstract
The development of learning-based methods has greatly improved the detection of synapses from electron microscopy (EM) images. However, training a model for each dataset is time-consuming and requires extensive annotations. Additionally, it is difficult to apply a learned model to data from different brain regions due to variations in data distributions. In this paper, we present AdaSyn, a two-stage segmentation-based framework for domain adaptive synapse detection with weak point annotations. In the first stage, we address the detection problem by utilizing a segmentation-based pipeline to obtain synaptic instance masks. In the second stage, we improve model generalizability on target data by regenerating square masks to get high-quality pseudo labels. Benefiting from our high-accuracy detection results, we introduce the distance nearest principle to match paired pre-synapses and post-synapses. In the WASPSYN challenge at ISBI 2023, our method ranks the 1st place.

摘要
随着学习基本方法的发展，电子显微镜成像（EM）图像中的 synapse 检测得到了显著改进。然而，为每个数据集训练模型需要大量的标注数据，并且将学习的模型应用到不同的脑区域数据中是困难的。在这篇论文中，我们提出了 AdaSyn，一种基于分段的框架，用于域适应的synapse检测。在第一个阶段，我们利用分段管道来获取 synaptic instance masks。在第二个阶段，我们通过生成高质量pseudo标签来提高模型在目标数据上的泛化性。由于我们的高精度检测结果，我们引入了距离最近原则来匹配预synapse和后synapse。在 ISBI 2023 年的 WASPSYN 挑战中，我们的方法位于第一名。

Improving Lens Flare Removal with General Purpose Pipeline and Multiple Light Sources Recovery

paper_url: http://arxiv.org/abs/2308.16460
repo_url: https://github.com/yuyanzhou1/improving-lens-flare-removal
paper_authors: Yuyan Zhou, Dong Liang, Songcan Chen, Sheng-Jun Huang, Shuo Yang, Chongyi Li
for: 提高镜头闪光除除的性能，使其更适用于更广泛的情景。
methods: 根据实际拍摄的图像对光源进行自适应曝光和腐减处理，并通过拟合抑制法实现更加可靠地回收多个光源。
results: 经验表明，我们的解决方案可以有效地提高镜头闪光除除的性能，并推动其在更加复杂的场景中的应用。

Abstract
When taking images against strong light sources, the resulting images often contain heterogeneous flare artifacts. These artifacts can importantly affect image visual quality and downstream computer vision tasks. While collecting real data pairs of flare-corrupted/flare-free images for training flare removal models is challenging, current methods utilize the direct-add approach to synthesize data. However, these methods do not consider automatic exposure and tone mapping in image signal processing pipeline (ISP), leading to the limited generalization capability of deep models training using such data. Besides, existing methods struggle to handle multiple light sources due to the different sizes, shapes and illuminance of various light sources. In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP and remodeling the principle of automatic exposure in the synthesis pipeline and design a more reliable light sources recovery strategy. The new pipeline approaches realistic imaging by discriminating the local and global illumination through convex combination, avoiding global illumination shifting and local over-saturation. Our strategy for recovering multiple light sources convexly averages the input and output of the neural network based on illuminance levels, thereby avoiding the need for a hard threshold in identifying light sources. We also contribute a new flare removal testing dataset containing the flare-corrupted images captured by ten types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.

摘要
当采集图像时，强光源的影响 often leads to heterogeneous flare artifacts, which can significantly affect the visual quality of the image and downstream computer vision tasks. Current methods use the direct-add approach to synthesize data, but these methods do not consider automatic exposure and tone mapping in the image signal processing pipeline (ISP), resulting in limited generalization capability of deep models training using such data. Moreover, existing methods struggle to handle multiple light sources due to the different sizes, shapes, and illuminance of various light sources.In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP and remodeling the principle of automatic exposure in the synthesis pipeline. We design a more reliable light sources recovery strategy that convexly averages the input and output of the neural network based on illuminance levels, avoiding the need for a hard threshold in identifying light sources. Our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.In addition, we contribute a new flare removal testing dataset containing flare-corrupted images captured by ten types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.

Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

paper_url: http://arxiv.org/abs/2308.16454
repo_url: None
paper_authors: Satoshi Suzuki, Shin’ya Yamaguchi, Shoichiro Takeda, Sekitoshi Kanai, Naoki Makishima, Atsushi Ando, Ryo Masumura
for: 这篇论文关注深度神经网络（DNN）中的标准准确率和抗击例训练（AT）之间的贸易关系。 Although AT improves robustness, it degrades standard accuracy, thus yielding a tradeoff.
methods: 该论文提出了一种新的AT方法called ARREST，它包括三个组成部分：（i）敌对精度训练（AFT）、（ii）表示导航知识传播（RGKD）、（iii）噪音再播（NR）。 AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT.
results: 实验结果表明，ARREST可以更好地缓解标准准确率和抗击例训练之间的贸易关系，比前一些AT基于的方法更有效。

Abstract
This paper addresses the tradeoff between standard accuracy on clean examples and robustness against adversarial examples in deep neural networks (DNNs). Although adversarial training (AT) improves robustness, it degrades the standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we propose a novel AT method called ARREST, which comprises three components: (i) adversarial finetuning (AFT), (ii) representation-guided knowledge distillation (RGKD), and (iii) noisy replay (NR). AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT. RGKD penalizes the distance between the representations of the standardly pretrained and AFT DNNs. NR switches input adversarial examples to nonadversarial ones when the representation changes significantly during AFT. By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results demonstrate that ARREST mitigates the tradeoff more effectively than previous AT-based methods do.

摘要

Adversarial finetuning (AFT): AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is pre-trained on clean examples.2. Representation-guided knowledge distillation (RGKD): RGKD adds a regularization term to the loss function to ensure that the representations of the DNN remain consistent during AFT.3. Noisy replay (NR): NR switches the input adversarial examples to non-adversarial ones when the representation of the DNN changes significantly during AFT.By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results show that ARREST outperforms previous AT-based methods in mitigating the trade-off.

Njobvu-AI: An open-source tool for collaborative image labeling and implementation of computer vision models

paper_url: http://arxiv.org/abs/2308.16435
repo_url: https://github.com/sullichrosu/njobvu-ai
paper_authors: Jonathan S. Koning, Ashwin Subramanian, Mazen Alotaibi, Cara L. Appel, Christopher M. Sullivan, Thon Chao, Lisa Truong, Robyn L. Tanguay, Pankaj Jaiswal, Taal Levi, Damon B. Lesmeister
for: 这个论文的目的是为了提供一个可用、开源的计算机视觉模型工具，帮助许多研究人员在各种领域进行计算机视觉应用。
methods: 该工具使用Node.js实现，可以在桌面和服务器硬件上运行，具有多种功能，包括数据标注、多人协作、自定义模型训练和新模型应用。
results: 该工具可以帮助研究人员快速和简单地创建和应用自己的计算机视觉模型，并且支持多种数据类型和标注方式，可以满足各种应用需求。

Abstract
Practitioners interested in using computer vision models lack user-friendly and open-source software that combines features to label training data, allow multiple users, train new algorithms, review output, and implement new models. Labeling training data, such as images, is a key step to developing accurate object detection algorithms using computer vision. This step is often not compatible with many cloud-based services for marking or labeling image and video data due to limited internet bandwidth in many regions of the world. Desktop tools are useful for groups working in remote locations, but users often do not have the capability to combine projects developed locally by multiple collaborators. Furthermore, many tools offer features for labeling data or using pre-trained models for classification, but few allow researchers to combine these steps to create and apply custom models. Free, open-source, and user-friendly software that offers a full suite of features (e.g., ability to work locally and online, and train custom models) is desirable to field researchers and conservationists that may have limited coding skills. We developed Njobvu-AI, a free, open-source tool that can be run on both desktop and server hardware using Node.js, allowing users to label data, combine projects for collaboration and review, train custom algorithms, and implement new computer vision models. The name Njobvu-AI (pronounced N-joh-voo AI), incorporating the Chichewa word for elephant, is inspired by a wildlife monitoring program in Malawi that was a primary impetus for the development of this tool and references similarities between the powerful memory of elephants and properties of computer vision models.

摘要
计算机视觉模型的实践者需要一个用户友好、开源的软件，这个软件可以结合标注训练数据、允许多个用户合作、训练新算法、查看输出和应用新模型的功能。标注训练数据，如图像，是计算机视觉精度的关键步骤，但是许多云服务不支持在多个地区的图像和视频数据标注或标记，因为这些地区的互联网带宽有限。桌面工具可以帮助远程团队合作，但用户通常无法将本地开发的项目集成到多个合作者的项目中。此外，许多工具只提供标注数据或使用预训练模型进行分类的功能，而很少允许研究人员将这些步骤结合起来创建和应用自定义模型。开源、免费、用户友好的软件，具有本地和在线工作、自定义模型训练等功能，对于野外研究人员和保护人员来说非常有用，他们可能有限的编程技能。为了满足这些需求，我们开发了Njobvu-AI，一个免费、开源的工具，可以在桌面和服务器硬件上运行，使用Node.js，允许用户标注数据、合作项目、查看输出和训练自定义算法。Njobvu-AI的名字（pronounced N-joh-voo AI），包含恩爱语言中的象象，是为了纪念马拉维的野生生物监测计划，这个计划是Njobvu-AI的开发的主要驱动力量，同时也表达计算机视觉模型和象象的相似之处。

Deformation Robust Text Spotting with Geometric Prior

paper_url: http://arxiv.org/abs/2308.16404
repo_url: None
paper_authors: Xixuan Hao, Aozhong Zhang, Xianze Meng, Bin Fu
for: 文本检测和识别问题的解决方案
methods: 基于ARText数据集，提出了一种抗形态变化的文本检测方法，包括经验式检测和semantic reasoning两个模块
results: 在ARText和IC19-ReCTS数据集上进行了实验，并取得了优秀的效果

Abstract
The goal of text spotting is to perform text detection and recognition in an end-to-end manner. Although the diversity of luminosity and orientation in scene texts has been widely studied, the font diversity and shape variance of the same character are ignored in recent works, since most characters in natural images are rendered in standard fonts. To solve this problem, we present a Chinese Artistic Dataset, termed as ARText, which contains 33,000 artistic images with rich shape deformation and font diversity. Based on this database, we develop a deformation robust text spotting method (DR TextSpotter) to solve the recognition problem of complex deformation of characters in different fonts. Specifically, we propose a geometric prior module to highlight the important features based on the unsupervised landmark detection sub-network. A graph convolution network is further constructed to fuse the character features and landmark features, and then performs semantic reasoning to enhance the discrimination for different characters. The experiments are conducted on ARText and IC19-ReCTS datasets. Our results demonstrate the effectiveness of our proposed method.

摘要
文本检测的目标是在端到端方式下完成文本检测和识别。虽然景点文本的多样性和方向性已经广泛研究，但是文本中字体多样性和形状差异仍然被当前的研究忽略，因为大多数自然图像中的字体都是标准字体。为解决这个问题，我们提出了一个中文艺术数据集，称为ARText，该数据集包含33,000件艺术图像，具有丰富的形态变化和字体多样性。基于这个数据集，我们开发了一种可以抗形态变化的文本检测方法（DR TextSpotter），以解决不同字体中字符的识别问题。具体来说，我们提出了一个 геометрические前提模块，用于基于无监督标记检测子网络中重要特征的高亮显示。然后，我们将Character特征和标记特征 fusion在一起，并使用semantic reasoning进行更多的增强，以提高不同字符之间的分辨率。我们对ARText和IC19-ReCTS数据集进行了实验，我们的结果表明我们的提议的方法的有效性。

paper_url: http://arxiv.org/abs/2308.16386
repo_url: https://github.com/husteryoung/mplt
paper_authors: Yang Luo, Xiqing Guo, Hui Feng, Lei Ao
for: 实现RGB-T tracking中更全面的融合，减少计算成本
methods: 基于促进学习的多modal融合tracking体系，具有降低计算成本的强化注意力机制
results: 实验表明，提议的tracking体系具有高效率和优秀性能，能够实现状态部级表现而且保持高速运行

Abstract
Object tracking based on the fusion of visible and thermal im-ages, known as RGB-T tracking, has gained increasing atten-tion from researchers in recent years. How to achieve a more comprehensive fusion of information from the two modalities with fewer computational costs has been a problem that re-searchers have been exploring. Recently, with the rise of prompt learning in computer vision, we can better transfer knowledge from visual large models to downstream tasks. Considering the strong complementarity between visible and thermal modalities, we propose a tracking architecture based on mutual prompt learning between the two modalities. We also design a lightweight prompter that incorporates attention mechanisms in two dimensions to transfer information from one modality to the other with lower computational costs, embedding it into each layer of the backbone. Extensive ex-periments have demonstrated that our proposed tracking ar-chitecture is effective and efficient, achieving state-of-the-art performance while maintaining high running speeds.

摘要
<>对于RGB-T tracking（基于可见和热图像的跟踪），在过去几年内，研究人员们已经投入了大量时间和精力。为了实现更全面的两种模式之间的信息融合，同时降低计算成本，是研究人员所关注的一个问题。随着计算机视觉中的备受推荐学的兴起，我们可以更好地将视觉大型模型的知识传递到下游任务中。考虑到可见和热图像之间的强相互补做，我们提出了基于两种模式之间的互动学习的跟踪架构。我们还设计了一个轻量级的推荐器，其中包含了两个维度的注意机制，以便在两种模式之间传输信息，并将其嵌入到每层的背bone中。广泛的实验证明了我们提出的跟踪架构的有效性和高效率，同时保持高速运行。

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

paper_url: http://arxiv.org/abs/2308.16383
repo_url: None
paper_authors: Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu
for: This paper focuses on the task of text-based visual question answering (TextVQA), which involves answering questions about the text in images.methods: The proposed method, called Separate and Locate (SaL), explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, it uses a Text Semantic Separate (TSS) module to recognize semantic contextual relations between words, and a Spatial Circle Position (SCP) module to better construct and reason the spatial position relationships.results: The SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets, respectively. Additionally, the proposed method achieves 2.68% and 2.52% accuracy improvement compared to the pre-training state-of-the-art method, without any pre-training tasks.Here’s the simplified Chinese text for the three key points:for: 这篇论文关注文本视觉问答任务（TextVQA），即对图像中文本进行问答。methods: 提议的方法是分离和定位（SaL），它利用文本semantic上下文信息和设计空间位嵌入来构建文本之间的空间关系。results: SaL模型比基线模型高出4.44%和3.96%的准确率在TextVQA和ST-VQA数据集上，并且与预训练状态的方法相比，无需预训练任务，仍然实现2.68%和2.52%的准确率提升。

Abstract
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

摘要
文本视觉问答（TextVQA）的目标是回答图像中文本中的问题。大多数相关研究都将注意力集中在网络结构设计或者预训练任务上。所有这些方法都将OCR文本列表为阅读顺序（从左到右和从上到下），并将其转化为自然语言的句子。然而，它们忽略了OCR文本在TextVQA任务中的大多数文本没有含义上的相互关系。此外，这些方法使用一维位坐标的 embedding 来构建文本之间的空间关系，这并不合理。一维位坐标只能表示句子中单个字的左右顺序关系，而不能表示复杂的空间位置关系。为了解决这些问题，我们提出了一种新的方法，即分离和定位（SaL）。具体来说，我们提出了文本含义分离（TSS）模块，帮助模型认识文本中具有含义上的相互关系的单词。然后，我们引入空间圆形位（SCP）模块，帮助模型更好地构建和理解OCR文本之间的空间位置关系。我们的SaL模型与基线模型相比，提高了4.44%和3.96%的准确率。与6400万预训练样本预训练的状态OF-THE-ART方法相比，我们的方法，没有任何预训练任务，仍然实现了2.68%和2.52%的准确率提高。我们的代码和模型将在https://github.com/fangbufang/SaL上发布。

3D vision-based structural masonry damage detection

paper_url: http://arxiv.org/abs/2308.16380
repo_url: None
paper_authors: Elmira Faraji Zonouz, Xiao Pan, Yu-Cheng Hsu, Tony Yang
For: 监测砖石结构的损害是非常重要，以避免可能的灾难性结果。但是，手动检查可能需要很长时间，并且对人员来说可能有危险。* Methods: 本研究使用了新的计算机视觉和机器学习算法来自动化检查过程，以提高效率和安全性。大多数现有的2D视觉基本方法仅能进行质量化损害分类、2D定位和平面质量化。* Results: 我们的研究表明，使用3D视觉方法可以准确地检测砖石结构的损害，并且可以在复杂环境中检测损害。我们的实验表明，我们的方法可以有效地分类损害状态，定位和质量化重要的损害特征。结果表明，我们的方法可以提高砖石结构检查的自动化水平。

Abstract
The detection of masonry damage is essential for preventing potentially disastrous outcomes. Manual inspection can, however, take a long time and be hazardous to human inspectors. Automation of the inspection process using novel computer vision and machine learning algorithms can be a more efficient and safe solution to prevent further deterioration of the masonry structures. Most existing 2D vision-based methods are limited to qualitative damage classification, 2D localization, and in-plane quantification. In this study, we present a 3D vision-based methodology for accurate masonry damage detection, which offers a more robust solution with a greater field of view, depth of vision, and the ability to detect failures in complex environments. First, images of the masonry specimens are collected to generate a 3D point cloud. Second, 3D point clouds processing methods are developed to evaluate the masonry damage. We demonstrate the effectiveness of our approach through experiments on structural masonry components. Our experiments showed the proposed system can effectively classify damage states and localize and quantify critical damage features. The result showed the proposed method can improve the level of autonomy during the inspection of masonry structures.

摘要
检测砖石损害的重要性可以防止可能的灾难性结果。然而，手动检查可能需要很长时间，同时也可能对人类检查员造成危险。通过自动化检查过程，使用新的计算机视觉和机器学习算法可以提供更高效和安全的解决方案，以防止砖石结构的进一步衰化。现有的大多数2D视觉基本方法只能进行质量化损害分类、2D位置确定和平面质量量化。在本研究中，我们提出了一种基于3D视觉的方法，用于精准地检测砖石损害。这种方法具有更广阔的视场、深度视觉和在复杂环境中检测损害的能力。首先，我们收集砖石样品图像，以生成3D点云。然后，我们开发了3D点云处理方法，以评估砖石损害。我们的实验表明，我们的方法可以有效地分类损害状态，并在复杂环境中准确地定位和评估重要的损害特征。实验结果表明，我们的方法可以提高砖石结构检查的自动化水平。

Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

paper_url: http://arxiv.org/abs/2308.16376
repo_url: None
paper_authors: Lei Bai, Dongang Wang, Michael Barnett, Mariano Cabezas, Weidong Cai, Fernando Calamante, Kain Kyle, Dongnan Liu, Linda Ly, Aria Nguyen, Chun-Chien Shieh, Ryan Sullivan, Hengrui Wang, Geng Zhan, Wanli Ouyang, Chenyu Wang
For: 这个研究旨在对多个临床站点中的多个肢体疾病（Multiple Sclerosis，MS）进行演化评估，以帮助理解疾病进展和指导治疗策略。* Methods: 这个研究使用了联邦学习框架，并考虑了标签杂变。具体来说，我们提出了一个分离式硬件标签修正（DHLC）策略，考虑了患者疾病的不均分布和朗甚界限，以修正错误的标签。另外，我们也提出了一个中央强化标签修正（CELC）策略，利用所有站点的聚合中央模型作为修正教师，提高修正过程的可靠性。* Results: 我们在两个多站点数据集上进行了广泛的实验，结果显示了我们的提案方法的有效性和可靠性，显示了它们在跨站点合作中的应用潜力。

Abstract
Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations.

摘要

2023-08-31

cs.AI

cs.AI - 2023-08-31

PointLLM: Empowering Large Language Models to Understand Point Clouds

paper_url: http://arxiv.org/abs/2308.16911
repo_url: https://github.com/openrobotlab/pointllm
paper_authors: Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin
for: 本研究旨在将大型自然语言模型（LLM）应用于三维理解，以扩展其现有的二维视觉处理能力。
methods: 本研究使用PointLLM，一个初步尝试将点云资料与LLM结合，以便理解点云并生成相应的回应。PointLLM使用点云编码器与强大的LLM进行有效地融合几何、外观和语言信息。
results: experiments show that PointLLM outperforms existing 2D baselines and demonstrates superior performance in human-evaluated object captioning tasks, with human annotators being outperformed in over 50% of the samples.

Abstract
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, thereby enabling LLMs to understand point clouds and offering a new avenue beyond 2D visual data. PointLLM processes colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. We collect a novel dataset comprising 660K simple and 70K complex point-text instruction pairs to enable a two-stage training strategy: initially aligning latent spaces and subsequently instruction-tuning the unified model. To rigorously evaluate our model's perceptual abilities and its generalization capabilities, we establish two benchmarks: Generative 3D Object Classification and 3D Object Captioning, assessed through three different methods, including human evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experiment results show that PointLLM demonstrates superior performance over existing 2D baselines. Remarkably, in human-evaluated object captioning tasks, PointLLM outperforms human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM .

摘要
大量的自然语言处理技术（LLM）已经创造出历史上无 precedent的进步，但是它们还未全面掌握三维理解。这篇论文介绍了PointLLM，一项初步尝试，以填补这一空白，使得 LLM 可以理解点云并提供一条新的探索途径，超出了2D视觉数据的限制。PointLLM 处理了人工指令颜色点云，并生成了上下文相应的响应，这表明它对点云和常识有深刻的理解。具体来说，它利用了一个强大的 LLM 和点云编码器，以有效地融合几何、外观和语言信息。我们收集了一个新的数据集，包括660,000个简单点云和70,000个复杂点云文本指令对，以实现两个阶段的训练策略：首先是将 latent space 对齐，然后是通过指令调整已经一体化的模型。为了严格评估我们模型的感知能力和总体化能力，我们建立了两个标准准则：生成3D物体分类和3D物体描述，通过三种不同的评价方法，包括人工评估、GPT-4/ChatGPT评估和传统指标。实验结果表明，PointLLM 在已有的2D基线上表现出色，并且在人工评估3D物体描述任务中，PointLLM 在50%以上的样本中超过了人类评估员。代码、数据集和标准准则可以在 GitHub 上获取，请参考。

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

paper_url: http://arxiv.org/abs/2308.16909
repo_url: https://github.com/johannwyh/styleinv
paper_authors: Yuhan Wang, Liming Jiang, Chen Change Loy
for: 高质量视频生成
methods: 学习倒整流程网络做动作生成器，具有稳定的 Temporal 协调和多样化的 Style 转换能力
results: 能够生成高分辨率、长寿命、具有单帧质量和时间协调的视频

Abstract
Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency.

摘要
不受限制的视频生成是一项具有挑战性的任务，旨在生成高质量的视频，同时保持视频的凝聚性和长度。为 Addressing this challenge, researchers have used pre-trained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. In this paper, we propose a novel motion generator design that uses a learning-based inversion network for GAN. Our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency.

InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion

paper_url: http://arxiv.org/abs/2308.16905
repo_url: https://github.com/Sirui-Xu/InterDiff
paper_authors: Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, Liang-Yan Gui
for: 这篇论文目标是解决3D人际物互动（HOI）的新任务，大多数现有的HOI合成研究仅仅对小物件或静止物件进行了限制，这个任务更加具有挑战性，因为它需要模型动态物件，捕捉全身动作，并确保物件之间的物理关联性。
methods: 我们提出了一个名为InterDiff的框架，包括两个关键步骤：（i）互动扩散，我们利用扩散模型将未来人际物互动的分布编码为Future HOI distribution；（ii）互动修正，我们引入物理学 Informed predictor，对扩散步骤中的噪声HOI进行修正。我们的关键见解是将与接触点的互动视为一个简单的模式，容易预测。
results: 我们在多个人际物互动数据集上进行了实验，结果显示我们的方法能够生成真实、生动、remarkably Long-term 3D HOI预测。

Abstract
This paper addresses a novel task of anticipating 3D human-object interactions (HOIs). Most existing research on HOI synthesis lacks comprehensive whole-body interactions with dynamic objects, e.g., often limited to manipulating small or static objects. Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions. To this end, we propose InterDiff, a framework comprising two key steps: (i) interaction diffusion, where we leverage a diffusion model to encode the distribution of future human-object interactions; (ii) interaction correction, where we introduce a physics-informed predictor to correct denoised HOIs in a diffusion step. Our key insight is to inject prior knowledge that the interactions under reference with respect to contact points follow a simple pattern and are easily predictable. Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably long-term 3D HOI predictions.

摘要

Interaction Diffusion: A diffusion model is used to encode the distribution of future human-object interactions.2. Interaction Correction: A physics-informed predictor is introduced to correct the denoised HOIs in the diffusion step.The key insight of the proposed method is to inject prior knowledge that the interactions under reference with respect to contact points follow a simple pattern and are easily predictable.Experiments on multiple human-object interaction datasets demonstrate the effectiveness of the proposed method in producing realistic, vivid, and remarkably long-term 3D HOI predictions.

Transformers as Support Vector Machines

paper_url: http://arxiv.org/abs/2308.16898
repo_url: https://github.com/mohamedehab00/A-Hybrid-Arabic-Text-Summarization-Approach-based-on-Transformers
paper_authors: Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak
for: 这个论文主要是为了解释transformer架构中的自注意力层的优化方法和其在自然语言处理中的应用。
methods: 这篇论文使用了一种形式化的方法，将自注意力层的优化问题与硬margin支持向量机（SVM）问题等同起来，从而可以更好地理解transformer架构中的自注定对象的优化方法。
results: 研究人员通过这种形式化的方法，发现了一些关于transformer架构中自注意力层的优化方法的重要特征，包括优化方法的本质和global/local方向的导航方式等。此外，他们还提出了一些开放的问题和研究方向，以便进一步探索transformer架构的应用和优化方法。

Abstract
Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

摘要
自“专注是所有你需要”这个起源，transformer架构带来了启蒸的进步在自然语言处理（NLP）领域。内部的专注层在transformer架构中让输入序列$X$进行互动，通过 ComputeSoftmax$(XQK^\top X^\top)$中的对称关系，其中$(K,Q)$是可变的钥匙-请求参数。在这篇文章中，我们建立了对专注层的自动化构造和硬margin Support Vector Machine（SVM）问题之间的正式等价性。这个等价性让我们可以对1层transformer的专注层优化器（K,Q）的对称关系进行描述，并且评估这个对称关系的内部偏好。我们的研究结果如下：1. 对于具有干扰调整的专注层优化器（K,Q），当运算在干扰调整下时，专注层的优化将 converges到一个硬margin SVM解释，这个解释可以最小化综合参数$W=KQ^\top$的核心 нор。相反，直接对$W$进行优化则会导致一个弹性范围的对应目标。我们描述了这个对称关系的传播，并证明它可以发生在本地优化方向上，而不是全局优化方向上。2. 我们还证明了在适当的几何条件下，对于具有干扰调整的专注层优化器（K,Q），gradient descent将在本地和全球方向上传播。另外，我们显示了过 parameterization 可以刺激全球传播，并且保证搜索空间的稳定性和缺乏站点。3. 我们的理论主要适用于线性预测头，但我们提出了一个更一般的SVM等价性，可以预测专注层的隐含偏好。我们的发现适用于任意的数据集和预测任务，并且我们透过实验验证了我们的理论。 finally，我们提出了一些开放的问题和研究方向。我们认为这些发现可以启发人们对transformer架构的解释，将其视为一个对称层次的SVM，用于分类和选择最佳的节点。

PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction

paper_url: http://arxiv.org/abs/2308.16896
repo_url: https://github.com/wzzheng/pointocc
paper_authors: Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, Jiwen Lu
for: 本文目的是提出一种高效的点云 semantic segmentation 方法，用于自动驾驶中的Scene understanding。
methods: 该方法使用 cylindrical tri-perspective view (TPV) 来表示点云，并使用 PointOcc 模型来处理它们。具体来说，该方法使用 distance distribution 来构建TPV，并使用空间群集 pooling 来保持点云的结构细节。最后，该方法使用 2D 脊梁来高效地处理每个 TPV 面，并使用 PointOcc 模型来汇聚每个点的特征。
results: 对于 3D occupancy prediction 和 LiDAR segmentation benchmark，PointOcc 方法 achieved state-of-the-art 性能，并且比其他方法快得多。具体来说，只使用 LiDAR 数据的 PointOcc 方法在 OpenOccupancy benchmark 上大幅超越了所有其他方法，包括多模式方法。

Abstract
Semantic segmentation in autonomous driving has been undergoing an evolution from sparse point segmentation to dense voxel segmentation, where the objective is to predict the semantic occupancy of each voxel in the concerned 3D space. The dense nature of the prediction space has rendered existing efficient 2D-projection-based methods (e.g., bird's eye view, range view, etc.) ineffective, as they can only describe a subspace of the 3D scene. To address this, we propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively and a PointOcc model to process them efficiently. Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system for more fine-grained modeling of nearer areas. We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane. Finally, we obtain the features of each point by aggregating its projected features on each of the processed TPV planes without the need for any post-processing. Extensive experiments on both 3D occupancy prediction and LiDAR segmentation benchmarks demonstrate that the proposed PointOcc achieves state-of-the-art performance with much faster speed. Specifically, despite only using LiDAR, PointOcc significantly outperforms all other methods, including multi-modal methods, with a large margin on the OpenOccupancy benchmark. Code: https://github.com/wzzheng/PointOcc.

摘要
<>translate_language: zh-CN<>自动驾驶 semantic segmentation 在进化中，从稀疏点 segmentation 转向紧凑的 voxel segmentation，目标是预测每个 voxel 在关注的 3D 空间中的semantic occupancy。紧凑的预测空间使得现有的高效 2D 投影基于方法（如鸟瞰视、距离视图等）无法描述 3D 场景中的所有信息，因此我们提出了一种筒形三视角视图来有效地处理点云，并使用 PointOcc 模型来处理它们。根据 LiDAR 点云的距离分布，我们在筒形坐标系中构建了三视角视图，以更细化近距离区域的模型化。我们使用空间组合池化以保持结构细节，并采用 2D 脊梁来高效处理每个 TPV 面。最后，我们通过对每个点的 проекted 特征进行聚合来获得每个点的特征，无需任何后处理。广泛的实验表明，我们的 PointOcc 模型在 3D 占用率预测和 LiDAR 分割 benchmark 上具有州先进性，并且具有更快的速度。具体来说，只使用 LiDAR 的 PointOcc 模型可以在 OpenOccupancy benchmark 上大幅超越所有其他方法，包括多modal方法，并且具有大的差距。代码：https://github.com/wzzheng/PointOcc。

Language-Conditioned Path Planning

paper_url: http://arxiv.org/abs/2308.16893
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Amber Xie, Youngwoon Lee, Pieter Abbeel, Stephen James
for: This paper focuses on the problem of path planning for robotic manipulation tasks, specifically in contact-rich environments.
methods: The proposed method is called Language-Conditioned Collision Functions (LACO), which uses a single-view image, language prompt, and robot configuration to learn a collision function and enable flexible, conditional path planning.
results: The authors demonstrate the effectiveness of LACO in both simulation and real-world experiments, showing that it can facilitate complex, nuanced path plans that allow for safe collisions with objects in the environment.

Abstract
Contact is at the core of robotic manipulation. At times, it is desired (e.g. manipulation and grasping), and at times, it is harmful (e.g. when avoiding obstacles). However, traditional path planning algorithms focus solely on collision-free paths, limiting their applicability in contact-rich tasks. To address this limitation, we propose the domain of Language-Conditioned Path Planning, where contact-awareness is incorporated into the path planning problem. As a first step in this domain, we propose Language-Conditioned Collision Functions (LACO) a novel approach that learns a collision function using only a single-view image, language prompt, and robot configuration. LACO predicts collisions between the robot and the environment, enabling flexible, conditional path planning without the need for manual object annotations, point cloud data, or ground-truth object meshes. In both simulation and the real world, we demonstrate that LACO can facilitate complex, nuanced path plans that allow for interaction with objects that are safe to collide, rather than prohibiting any collision.

摘要
“联系”是 robotic manipulation 的核心。有时候需要联系（例如操作和抓取），有时候则需要避免触碰（例如避免障碍物）。然而，传统的路径观察算法仅专注于避免冲突的路径，这限制了它们在联系丰富任务中的应用范围。为了解决这个限制，我们提出了“语言条件路径观察”领域，其中联系意识被包含到路径观察问题中。作为这个领域的第一步，我们提出了一种新的方法：Language-Conditioned Collision Functions (LACO)。LACO 是一种学习冲突函数的方法，它使用单一的图像、语言提示和机器人配置来学习冲突。LACO 预测机器人和环境之间的冲突，允许机器人进行自动、 conditional 的路径观察，不需要手动设定物体标注、点云资料或真实物体对应。在实验和实际情况下，我们证明了 LACO 可以实现复杂、细节的路径观察，允许机器人与安全冲突的物体进行互动，而不是禁止任何冲突。

ReZero: Region-customizable Sound Extraction

paper_url: http://arxiv.org/abs/2308.16892
repo_url: None
paper_authors: Rongzhi Gu, Yi Luo
for: 这篇论文是为了解决多通道区域特定声音提取任务（R-SE）而写的。
methods: 这篇论文使用了不同类型的空间区域定义（如角度窗口、球体、喷气体等），以及对这些区域的特征提取和聚合方法。它还使用了多通道扩展的带分RNN（BSRNN）模型，特制 для R-SE任务。
results: 实验结果表明，ReZero在不同的麦克风数据格式和系统配置下都有效，并且在 simulate 和实际记录的数据上都达到了高度的提取精度。详细的实验结果和演示可以在 https://innerselfm.github.io/rezero/ 上查看。

Abstract
We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise sound extraction (R-SE) task. R-SE task aims at extracting all active target sounds (e.g., human speech) within a specific, user-defined spatial region, which is different from conventional and existing tasks where a blind separation or a fixed, predefined spatial region are typically assumed. The spatial region can be defined as an angular window, a sphere, a cone, or other geometric patterns. Being a solution to the R-SE task, the proposed ReZero framework includes (1) definitions of different types of spatial regions, (2) methods for region feature extraction and aggregation, and (3) a multi-channel extension of the band-split RNN (BSRNN) model specified for the R-SE task. We design experiments for different microphone array geometries, different types of spatial regions, and comprehensive ablation studies on different system configurations. Experimental results on both simulated and real-recorded data demonstrate the effectiveness of ReZero. Demos are available at https://innerselfm.github.io/rezero/.

摘要
我们介绍一个通用和灵活的概念抽取框架（ReZero），用于多通道区域特定声音抽取（R-SE）任务。R-SE任务的目标是在用户定义的特定空间区域内抽取所有活动目标声音（例如人类语音），而不是传统的盲目分离或预先定义的空间区域。用户可以定义空间区域为角度窗口、球体、圆锥体或其他几何图形。作为R-SE任务的解决方案，我们的ReZero框架包括以下几个方面：1. 不同类型的空间区域定义2. 空间区域特征提取和聚合方法3. 适用于R-SE任务的多通道扩展的带谱RNN（BSRNN）模型我们设计了不同的麦克风数据列表，不同类型的空间区域，以及对不同系统配置进行了全面的减少研究。实验结果表明，ReZero在模拟和真实记录的数据上具有效果。 demo 可以在上找到。

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

paper_url: http://arxiv.org/abs/2308.16884
repo_url: None
paper_authors: Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
for: 这个论文的目的是扩展自然语言理解（NLU）标准套件中的语言覆盖率，并评估多种语言模型在不同语言环境下的表现。
methods: 这个论文使用了一个多选机器阅读理解（MRC）数据集，包括122种语言变种，以评估文本模型在不同语言环境下的表现。每个问题基于一篇Flores-200数据集中的短文章，并提供了四个多选答案。
results: 这个论文的结果表明，尽管英语中心的大语言模型（LLM）在多语言环境下具有较高的泛化能力，但是较小的多语言模型（MLM）在多语言环境下仍能够理解更多的语言。此外，研究发现大词汇量和意识construct vocabulary对低资源语言的表现有着正面的关系。

Abstract
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

摘要
我们介绍了Belebele，一个多选机器读取理解（MRC）数据集，覆盖122种语言变体。这个数据集将提高自然语言理解（NLU）标准 benchmarks 的语言覆盖率，并允许evaluate文本模型在高-,中-,低-资源语言中的表现。每个问题基于Flores-200数据集中的短段文本，有四个多选答案。问题被仔细制定，以区分不同水平的通用语言理解能力。英语数据集本身也足够挑战当前语言模型。这个数据集完全平行，可以直接比较所有语言的模型性能。我们使用这个数据集评估多语言掩码语言模型（MLM）和大语言模型（LLM）的能力。我们发表了广泛的结果，发现虽然英语中心的LLMs具有显著的cross-lingual transfer，但是 Much smaller MLMs pretrained on balance multilingual data仍然能够理解更多的语言。我们还发现，大 vocabulary size和conscious vocabulary construction相关于低资源语言中的表现。总的来说，Belebele开启了新的评估和分析多语言NLP系统的avenues。

Adaptation Speed Analysis for Fairness-aware Causal Models

paper_url: http://arxiv.org/abs/2308.16879
repo_url: None
paper_authors: Yujie Lin, Chen Zhao, Minglai Shao, Xujiang Zhao, Haifeng Chen
For: This paper explores the adaptation of two models to a domain shift in the presence of a sensitive variable (bias) in a structural causal model (SCM) with a cause-bias-effect structure.* Methods: The paper uses two models with opposite directions to align the original distribution p with the modified distribution p* due to an unknown intervention. The adaptation speeds of the two models are compared across four shift scenarios, and the connection between the adaptation speeds is proven.* Results: The paper examines the adaptation of two models to a domain shift in the presence of a sensitive variable (bias) and compares their adaptation speeds across four shift scenarios, proving the connection between the adaptation speeds of the two models across all interventions.Here’s the same information in Simplified Chinese text:
for: 这篇论文研究了在Structural Causal Model（SCM）中存在敏感变量（偏见）的域转移问题中，两个模型对域转移的适应。
methods: 这篇论文使用了两个模型，每个模型都有相反的方向，以将原始分布p与 modify分布p*进行对应。
results: 这篇论文对两个模型在域转移问题中的适应速度进行了比较，并证明了两个模型在所有干扰情况下的适应速度之间的连接。

Abstract
For example, in machine translation tasks, to achieve bidirectional translation between two languages, the source corpus is often used as the target corpus, which involves the training of two models with opposite directions. The question of which one can adapt most quickly to a domain shift is of significant importance in many fields. Specifically, consider an original distribution p that changes due to an unknown intervention, resulting in a modified distribution p*. In aligning p with p*, several factors can affect the adaptation rate, including the causal dependencies between variables in p. In real-life scenarios, however, we have to consider the fairness of the training process, and it is particularly crucial to involve a sensitive variable (bias) present between a cause and an effect variable. To explore this scenario, we examine a simple structural causal model (SCM) with a cause-bias-effect structure, where variable A acts as a sensitive variable between cause (X) and effect (Y). The two models, respectively, exhibit consistent and contrary cause-effect directions in the cause-bias-effect SCM. After conducting unknown interventions on variables within the SCM, we can simulate some kinds of domain shifts for analysis. We then compare the adaptation speeds of two models across four shift scenarios. Additionally, we prove the connection between the adaptation speeds of the two models across all interventions.

摘要
Original text:In machine translation tasks, to achieve bidirectional translation between two languages, the source corpus is often used as the target corpus, which involves the training of two models with opposite directions. The question of which one can adapt most quickly to a domain shift is of significant importance in many fields. Specifically, consider an original distribution p that changes due to an unknown intervention, resulting in a modified distribution p*. In aligning p with p*, several factors can affect the adaptation rate, including the causal dependencies between variables in p. In real-life scenarios, however, we have to consider the fairness of the training process, and it is particularly crucial to involve a sensitive variable (bias) present between a cause and an effect variable. To explore this scenario, we examine a simple structural causal model (SCM) with a cause-bias-effect structure, where variable A acts as a sensitive variable between cause (X) and effect (Y). The two models, respectively, exhibit consistent and contrary cause-effect directions in the cause-bias-effect SCM. After conducting unknown interventions on variables within the SCM, we can simulate some kinds of domain shifts for analysis. We then compare the adaptation speeds of two models across four shift scenarios. Additionally, we prove the connection between the adaptation speeds of the two models across all interventions.Simplified Chinese translation:在机器翻译任务中，以源文库作为目标文库，训练两个模型的对向翻译是非常重要的。具体来说，考虑一个原始分布p，由于未知的干扰而变化，导致的修改后的分布p*。在将p与p*进行对应的时候，很多因素可以影响对应速度，包括在p中的 causal 依赖关系。在实际应用中，我们需要考虑培训过程的公平性，特别是在涉及到敏感变量（偏见）的情况下。为了探讨这种情况，我们研究了一个简单的结构 causal 模型（SCM），其中变量A acts as a 敏感变量 между cause（X）和 effect（Y）。这两个模型分别在 cause-bias-effect SCM 中表现出了一致和相反的 causal 效果方向。在对 SCM 中变量进行未知干扰后，我们可以模拟一些领域变化进行分析。然后，我们将对四个干扰场景进行比较两个模型的对应速度。此外，我们还证明了两个模型在所有干扰情况下的对应速度之间的连接。

The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages

paper_url: http://arxiv.org/abs/2308.16871
repo_url: None
paper_authors: Benjamin Muller, Belen Alastruey, Prangthip Hansanti, Elahe Kalbassi, Christophe Ropers, Eric Michael Smith, Adina Williams, Luke Zettlemoyer, Pierre Andrews, Marta R. Costa-jussà
for: 本研究旨在探讨语言生成系统中的性别偏见，并提出一种可能的来源——训练和评估数据中的性别表达不均衡。
methods: 本研究使用了一个多语言词典来自动量化大规模数据中的性别表达，并使用了WMT训练数据和开发数据来评估这种方法。
results: 研究发现，现有的数据集具有 masculine 表达的偏好，这可能导致语言生成系统对 masculine 性别表达优先化。研究建议在现有数据集中引入性别量化管道，并希望将其修改为均衡的性别表达。

Abstract
Gender biases in language generation systems are challenging to mitigate. One possible source for these biases is gender representation disparities in the training and evaluation data. Despite recent progress in documenting this problem and many attempts at mitigating it, we still lack shared methodology and tooling to report gender representation in large datasets. Such quantitative reporting will enable further mitigation, e.g., via data augmentation. This paper describes the Gender-GAP Pipeline (for Gender-Aware Polyglot Pipeline), an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. The pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation. Having unbalanced datasets may indirectly optimize our systems towards outperforming one gender over the others. We suggest introducing our gender quantification pipeline in current datasets and, ideally, modifying them toward a balanced representation.

摘要
gender bias in language generation systems is difficult to eliminate. one possible source of these biases is the gender representation disparities in the training and evaluation data. despite recent progress in documenting this problem and many attempts at mitigating it, we still lack a shared methodology and tooling to report gender representation in large datasets. such quantitative reporting will enable further mitigation, e.g., via data augmentation. this paper describes the gender-aware polyglot pipeline (for gender-aware polyglot pipeline), an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. the pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. we showcase it to report gender representation in wmt training data and development data for the news task, confirming that current data is skewed towards masculine representation. having unbalanced datasets may indirectly optimize our systems towards outperforming one gender over the others. we suggest introducing our gender quantification pipeline in current datasets and, ideally, modifying them towards a balanced representation.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

paper_url: http://arxiv.org/abs/2308.16870
repo_url: None
paper_authors: Wissam Kontar, Xinzhi Zhong, Soyoung Ahn
for: 本研究提出了一种框架，用于通过知识共享和个性化训练自动驾驶车辆（AVs）驾驶模型。由于交通系统中的自然变化使得对AVs进行实际测试或实验很困难，因此AVs可能会缺乏一些对其安全和高效操作的关键驾驶场景。这种知识共享的方法可以帮助AVs更好地适应实际驾驶情况。
methods: 本研究使用了联邦学习方法，通过多辆车辆之间的知识共享和借鉴，实现个性化训练AVs的驾驶模型。这种方法不需要车辆之间共享Raw数据，从而保持了数据隐私和安全性。
results: 我们在实验 simulations中展示了我们的方法的性能。这种方法在交通工程中拥有广泛的应用，包括智能交通系统、交通管理和车辆间通信。研究页面上提供了代码和示例数据，访问https://github.com/wissamkontar。

Abstract
This paper describes a framework for learning Automated Vehicles (AVs) driver models via knowledge sharing between vehicles and personalization. The innate variability in the transportation system makes it exceptionally challenging to expose AVs to all possible driving scenarios during empirical experimentation or testing. Consequently, AVs could be blind to certain encounters that are deemed detrimental to their safe and efficient operation. It is then critical to share knowledge across AVs that increase exposure to driving scenarios occurring in the real world. This paper explores a method to collaboratively train a driver model by sharing knowledge and borrowing strength across vehicles while retaining a personalized model tailored to the vehicle's unique conditions and properties. Our model brings a federated learning approach to collaborate between multiple vehicles while circumventing the need to share raw data between them. We showcase our method's performance in experimental simulations. Such an approach to learning finds several applications across transportation engineering including intelligent transportation systems, traffic management, and vehicle-to-vehicle communication. Code and sample dataset are made available at the project page https://github.com/wissamkontar.

摘要

IoMT-Blockchain based Secured Remote Patient Monitoring Framework for Neuro-Stimulation Device

paper_url: http://arxiv.org/abs/2308.16857
repo_url: None
paper_authors: Md Sakib Ullah Sourav, Mohammad Sultan Mahmud, Md Simul Hasan Talukder, Rejwan Bin Sulaiman, Abdullah Yasin
for: 这篇论文的目的是提高医疗业电子设备的准确性、可靠性和生产力，通过使用互联网物联网（IoMT）技术，并利用区块链（BC）解决中心化存储和数据抢夺等问题。
methods: 该论文使用了一种基于IoMT的远程非侵入式脑刺激系统，使用了Android应用程序控制的硬件基于的tDCS设备，并采用了文献最佳实践来解决IoMTBC系统的问题。
results: 该论文的研究结果表明，使用IoMT和BC技术可以提高脑刺激系统的准确性和可靠性，并且可以实现实时远程监测病人的状况。

Abstract
Biomedical Engineering's Internet of Medical Things (IoMT) is helping to improve the accuracy, dependability, and productivity of electronic equipment in the healthcare business. Real-time sensory data from patients may be delivered and subsequently analyzed through rapid development of wearable IoMT devices, such as neuro-stimulation devices with a range of functions. Data from the Internet of Things is gathered, analyzed, and stored in a single location. However, single-point failure, data manipulation, privacy difficulties, and other challenges might arise as a result of centralization. Due to its decentralized nature, blockchain (BC) can alleviate these issues. The viability of establishing a non-invasive remote neurostimulation system employing IoMT-based transcranial Direct Current Stimulation is investigated in this work (tDCS). A hardware-based prototype tDCS device has been developed that can be operated over the internet using an android application. Our suggested framework addresses the problems of IoMTBC-based systems, meets the criteria of real-time remote patient monitoring systems, and incorporates literature best practices in the relevant fields.

摘要
生物医学工程的互联网医疗物联网（IoMT）在医疗业中提高了电子设备的准确性、可靠性和生产力。通过快速开发的着装式IoMT设备，如神经刺激设备，可以实时传输患者的感知数据并进行分析。互联网物联网收集、分析和存储数据的问题，但是中央化的问题可能会出现单点故障、数据操纵、隐私问题等问题。由于其分布式的特点，区块链（BC）可以解决这些问题。本文提出了一种非侵入式远程神经刺激系统，使用IoMT基于的脑 Direct Current Stimulation（tDCS）。我们开发了一个基于硬件的tDCS设备，可以通过android应用程序在互联网上运行。我们建议的框架解决了IoMTBC系统中的问题，满足了实时远程病人监测系统的要求，并兼容了相关领域的文献最佳实践。

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

paper_url: http://arxiv.org/abs/2308.16836
repo_url: None
paper_authors: Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng
for: 高品质的歌唱voice合成系统（SVS），以提高合成的嗓音表达力。
methods: 使用bidirectional encoder representation from Transformers（BERT）得到的含义表达 embeddings，以及歌词文本表达、能量预测器和真实音高预测器等特定设计。
results: 比过去的SVS模型高品质的嗓音合成，并且在专业和主观实验中表现出色。

Abstract
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, different from the previous SVS models, we use text representation of lyrics extracted from pre-trained BERT as additional input to the model. The representation contains information about semantics of the lyrics, which could help SVS system produce more expressive and natural voice. Second, we further introduce an energy predictor to stabilize the synthesized voice and model the wider range of energy variations that also contribute to the expressiveness of singing voice. Last but not the least, to attenuate the off-key issues, the pitch predictor is re-designed to predict the real to note pitch ratio. Both objective and subjective experimental results indicate that the proposed SVS system can produce singing voice with higher-quality outperforming VISinger.

摘要
Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.

Can Programming Languages Boost Each Other via Instruction Tuning?

paper_url: http://arxiv.org/abs/2308.16824
repo_url: https://github.com/nl2code/codem
paper_authors: Daoguang Zan, Ailun Yu, Bo Shen, Jiaxin Zhang, Taihong Chen, Bing Geng, Bei Chen, Jichuan Ji, Yafen Yao, Yongji Wang, Qianxiang Wang
for: 本研究探讨了 Whether programming languages can boost each other during the instruction fine-tuning phase of code large language models.
methods: 我们使用了 8 种流行的编程语言 (Python, JavaScript, TypeScript, C, C++, Java, Go, HTML) 在 StarCoder 上进行了广泛的实验。
results: 结果表明，编程语言可以很大程度上提高对方。例如， CodeM-Python 15B 在 Python 上训练后可以提高 Java 的 pass@1 精度达 17.95%。而即使使用 HTML corpus 进行训练，CodeM-HTML 7B 也可以提高 Java 的 pass@1 精度达 15.24%。我们的训练数据可以在 GitHub 上下载。

Abstract
When human programmers have mastered a programming language, it would be easier when they learn a new programming language. In this report, we focus on exploring whether programming languages can boost each other during the instruction fine-tuning phase of code large language models. We conduct extensive experiments of 8 popular programming languages (Python, JavaScript, TypeScript, C, C++, Java, Go, HTML) on StarCoder. Results demonstrate that programming languages can significantly improve each other. For example, CodeM-Python 15B trained on Python is able to increase Java by an absolute 17.95% pass@1 on HumanEval-X. More surprisingly, we found that CodeM-HTML 7B trained on the HTML corpus can improve Java by an absolute 15.24% pass@1. Our training data is released at https://github.com/NL2Code/CodeM.

摘要
当人工程师掌握了一种编程语言，学习新的编程语言就会变得更加容易。在这份报告中，我们关注于研究 whether 编程语言可以在代码大型自然语言模型的指令细化阶段互相提高。我们在 StarCoder 上进行了广泛的实验，测试了 8 种流行的编程语言（Python、JavaScript、TypeScript、C、C++、Java、Go、HTML）。结果表明，编程语言可以彼此提高。例如， CodeM-Python 15B 在 Python 上训练后，可以提高 Java 的 pass@1 精度达 17.95%。更 surprisngly，我们发现 CodeM-HTML 7B 在 HTML 语料库上训练后，可以提高 Java 的 pass@1 精度达 15.24%。我们的训练数据可以在 GitHub 上下载：https://github.com/NL2Code/CodeM。

Latent Variable Multi-output Gaussian Processes for Hierarchical Datasets

paper_url: http://arxiv.org/abs/2308.16822
repo_url: None
paper_authors: Chunchao Ma, Arthur Leroy, Mauricio Alvarez
for: 这篇论文旨在提出一种基于树结构的多输出泊尔 proces（MOGPs）的扩展，以处理具有层次结构的数据集。
methods: 该论文提出了一种适应层次结构数据集的MOGPs模型，其中包含一个适应层次结构数据集的 kernel function，以及一个新的 latent variables kernel，用于表达输出之间的下游关系。
results: 经过对both synthetic和实际数据进行了extensive的实验研究， authors 发现，该扩展模型可以显著提高对多任务数据的渐进性和泊尔表达能力。

Abstract
Multi-output Gaussian processes (MOGPs) have been introduced to deal with multiple tasks by exploiting the correlations between different outputs. Generally, MOGPs models assume a flat correlation structure between the outputs. However, such a formulation does not account for more elaborate relationships, for instance, if several replicates were observed for each output (which is a typical setting in biological experiments). This paper proposes an extension of MOGPs for hierarchical datasets (i.e. datasets for which the relationships between observations can be represented within a tree structure). Our model defines a tailored kernel function accounting for hierarchical structures in the data to capture different levels of correlations while leveraging the introduction of latent variables to express the underlying dependencies between outputs through a dedicated kernel. This latter feature is expected to significantly improve scalability as the number of tasks increases. An extensive experimental study involving both synthetic and real-world data from genomics and motion capture is proposed to support our claims.

摘要
多输出泊松过程（MOGPs）已经引入以处理多个任务，通过利用不同输出之间的相关性。通常，MOGPs 模型假设输出之间的相关性平坦。然而，这种形式不会考虑更复杂的关系，例如每个输出都有多个重复观测（这是生物实验中常见的设置）。本文提出了基于层次结构的 MOGPs 扩展，我们的模型定义了适应层次结构数据的专门kernel函数，以捕捉不同层次的相关性，同时通过专门的kernel表示输出之间的依赖关系。这种特点预期会在任务数量增加时提高可扩展性。我们采用了大量的实验研究，包括synthetic和实际数据来支持我们的主张。

Irregular Traffic Time Series Forecasting Based on Asynchronous Spatio-Temporal Graph Convolutional Network

paper_url: http://arxiv.org/abs/2308.16818
repo_url: None
paper_authors: Weijia Zhang, Le Zhang, Jindong Han, Hao Liu, Jingbo Zhou, Yu Mei, Hui Xiong
for: 这篇论文旨在提出一个能够实现高精度交通预测的方法，以提高智能交通信号系统的效率。methods: 本文使用了 asynchronous spatio-temporal graph convolutional neural network (ASeer)，它通过连接车道的交通散射网络来模型车道间的异步空间相依性，并使用可学习的个人化时间编码来捕捉车道间的时间相依性。results: 实验结果显示，ASeer 能够实现高精度的交通预测，并且在六个度量上优于现有的方法。

Abstract
Accurate traffic forecasting at intersections governed by intelligent traffic signals is critical for the advancement of an effective intelligent traffic signal control system. However, due to the irregular traffic time series produced by intelligent intersections, the traffic forecasting task becomes much more intractable and imposes three major new challenges: 1) asynchronous spatial dependency, 2) irregular temporal dependency among traffic data, and 3) variable-length sequence to be predicted, which severely impede the performance of current traffic forecasting methods. To this end, we propose an Asynchronous Spatio-tEmporal graph convolutional nEtwoRk (ASeer) to predict the traffic states of the lanes entering intelligent intersections in a future time window. Specifically, by linking lanes via a traffic diffusion graph, we first propose an Asynchronous Graph Diffusion Network to model the asynchronous spatial dependency between the time-misaligned traffic state measurements of lanes. After that, to capture the temporal dependency within irregular traffic state sequence, a learnable personalized time encoding is devised to embed the continuous time for each lane. Then we propose a Transformable Time-aware Convolution Network that learns meta-filters to derive time-aware convolution filters with transformable filter sizes for efficient temporal convolution on the irregular sequence. Furthermore, a Semi-Autoregressive Prediction Network consisting of a state evolution unit and a semiautoregressive predictor is designed to effectively and efficiently predict variable-length traffic state sequences. Extensive experiments on two real-world datasets demonstrate the effectiveness of ASeer in six metrics.

摘要
准确预测交通流量在智能交通信号控制系统中是关键。然而，由于智能交通信号处理器生成的交通流量时序序列具有不规则性和非线性，这导致了三大新挑战：1） asynchronous spatial dependency，2） irregular temporal dependency among traffic data，和3） variable-length sequence to be predicted。为解决这些挑战，我们提出了一种异步空间-时间图 convolutional neural network（ASeer），用于预测进入智能交通 crossing 的车道 traffic states 在未来时间窗口内。 Specifically, we first model the asynchronous spatial dependency between the time-misaligned traffic state measurements of lanes by linking lanes via a traffic diffusion graph. Then, to capture the temporal dependency within the irregular traffic state sequence, we devise a learnable personalized time encoding to embed the continuous time for each lane. After that, we propose a Transformable Time-aware Convolution Network that learns meta-filters to derive time-aware convolution filters with transformable filter sizes for efficient temporal convolution on the irregular sequence. Furthermore, a Semi-Autoregressive Prediction Network consisting of a state evolution unit and a semiautoregressive predictor is designed to effectively and efficiently predict variable-length traffic state sequences. Our extensive experiments on two real-world datasets demonstrate the effectiveness of ASeer in six metrics.

Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.16800
repo_url: None
paper_authors: Andreas Roth, Thomas Liebig
for: This paper aims to provide new theoretical insights into the issues of over-smoothing and feature over-correlation in deep graph neural networks.
methods: The paper uses a theoretical approach to demonstrate the prevalence of invariant subspaces in deep graph neural networks, and shows how this can lead to over-smoothing and over-correlation.
results: The paper’s results include a better understanding of the causes of over-smoothing and over-correlation, and the proposal of a sum of Kronecker products as a beneficial property that can prevent these issues. Additionally, the paper demonstrates the inability of existing models to capture linearly independent features in the non-linear case.

Abstract
Our study reveals new theoretical insights into over-smoothing and feature over-correlation in deep graph neural networks. We show the prevalence of invariant subspaces, demonstrating a fixed relative behavior that is unaffected by feature transformations. Our work clarifies recent observations related to convergence to a constant state and a potential over-separation of node states, as the amplification of subspaces only depends on the spectrum of the aggregation function. In linear scenarios, this leads to node representations being dominated by a low-dimensional subspace with an asymptotic convergence rate independent of the feature transformations. This causes a rank collapse of the node representations, resulting in over-smoothing when smooth vectors span this subspace, and over-correlation even when over-smoothing is avoided. Guided by our theory, we propose a sum of Kronecker products as a beneficial property that can provably prevent over-smoothing, over-correlation, and rank collapse. We empirically extend our insights to the non-linear case, demonstrating the inability of existing models to capture linearly independent features.

摘要

Agent Teaming Situation Awareness (ATSA): A Situation Awareness Framework for Human-AI Teaming

paper_url: http://arxiv.org/abs/2308.16785
repo_url: None
paper_authors: Qi Gao, Wei Xu, Mowei Shen, Zaifeng Gao
for:* The paper is written to provide a review of leading situation awareness (SA) theoretical models and to propose a new framework for SA in the human-AI teaming (HAT) context.methods:* The paper uses a literature review to identify key features and processes of HAT and to develop a new framework for SA in the HAT context.results:* The proposed Agent Teaming Situation Awareness (ATSA) framework unifies human and AI behavior and emphasizes cohesive and effective HAT through structures and components such as teaming understanding, teaming control, and the world.Here is the information in Simplified Chinese text, as requested:for:* 论文是为了提供人机合作(HAT)场景下的情况意识(SA)理论模型的回顾和新的SA模型框架。methods:* 论文通过文献综述来确定HAT场景中的关键特征和过程，并开发了新的SA模型框架。results:* 提出的Agent Teaming Situation Awareness(ATSA)框架将人机行为统一，强调团队理解、团队控制和世界等结构和组件，以实现效果的HAT。

Abstract
The rapid advancements in artificial intelligence (AI) have led to a growing trend of human-AI teaming (HAT) in various fields. As machines continue to evolve from mere automation to a state of autonomy, they are increasingly exhibiting unexpected behaviors and human-like cognitive/intelligent capabilities, including situation awareness (SA). This shift has the potential to enhance the performance of mixed human-AI teams over all-human teams, underscoring the need for a better understanding of the dynamic SA interactions between humans and machines. To this end, we provide a review of leading SA theoretical models and a new framework for SA in the HAT context based on the key features and processes of HAT. The Agent Teaming Situation Awareness (ATSA) framework unifies human and AI behavior, and involves bidirectional, and dynamic interaction. The framework is based on the individual and team SA models and elaborates on the cognitive mechanisms for modeling HAT. Similar perceptual cycles are adopted for the individual (including both human and AI) and the whole team, which is tailored to the unique requirements of the HAT context. ATSA emphasizes cohesive and effective HAT through structures and components, including teaming understanding, teaming control, and the world, as well as adhesive transactive part. We further propose several future research directions to expand on the distinctive contributions of ATSA and address the specific and pressing next steps.

摘要
人工智能（AI）的快速进步导致人机合作（HAT）在不同领域得到了普遍应用。随着机器的演进从自动化到智能化，它们开始显示出人类智能的特征和不期望的行为，包括情境意识（SA）。这种变化有可能提高混合人机队列的性能，高亮了我们更好地理解人机合作中的SA交互的需要。为此，我们提供了SA理论模型的综述和人机合作情境中SA框架（ATSA），该框架将人类和AI行为结合在一起，并包括对向和动态互动。ATSA基于个体和团队SA模型，并详细介绍了人机合作中的认知机制。在团队水平上，采用同样的观察循环，包括人类和AI的个体SA，以适应HAT特殊需求。ATSA强调合作和有效的人机合作，通过结构和组件，如团队理解、团队控制和世界，以及贯通性的交互。我们还建议了一些未来研究方向，以扩展ATSA的独特贡献和解决特定和紧迫的下一步。

StratMed: Relevance Stratification for Low-resource Medication Recommendation

paper_url: http://arxiv.org/abs/2308.16781
repo_url: None
paper_authors: Xiang Li
for: 这篇论文的目的是提出一个基于人工智能的药物建议方法，以整合长期医疗历史资料和医疗知识，帮助医生诊断更加精确和安全的药物组合。
methods: 这篇论文使用了一个创新的相关分类机制，协调资料的长尾分布差异，并将安全和精确的药物组合表现同等化。Specifically, the authors first construct a pre-training method using deep learning networks to obtain entity representation, and then design a pyramid-like data stratification method to obtain more generalized entity relationships by reinforcing the features of unpopular entities. Based on this relationship, they designed two graph structures to express medication precision and safety at the same level to obtain visit representations.
results: 实验结果显示，该方法在MIMIC-III dataset上比现有的州际专业方法表现出色，在四个评估指标中（包括安全和精确）都有出色的表现。

Abstract
With the growing imbalance between limited medical resources and escalating demands, AI-based clinical tasks have become paramount. Medication recommendation, as a sub-domain, aims to amalgamate longitudinal patient history with medical knowledge, assisting physicians in prescribing safer and more accurate medication combinations. Existing methods overlook the inherent long-tail distribution in medical data, lacking balanced representation between head and tail data, which leads to sub-optimal model performance. To address this challenge, we introduce StratMed, a model that incorporates an innovative relevance stratification mechanism. It harmonizes discrepancies in data long-tail distribution and strikes a balance between the safety and accuracy of medication combinations. Specifically, we first construct a pre-training method using deep learning networks to obtain entity representation. After that, we design a pyramid-like data stratification method to obtain more generalized entity relationships by reinforcing the features of unpopular entities. Based on this relationship, we designed two graph structures to express medication precision and safety at the same level to obtain visit representations. Finally, the patient's historical clinical information is fitted to generate medication combinations for the current health condition. Experiments on the MIMIC-III dataset demonstrate that our method has outperformed current state-of-the-art methods in four evaluation metrics (including safety and accuracy).

摘要
To address this challenge, we introduce StratMed, a model that incorporates an innovative relevance stratification mechanism. This mechanism harmonizes discrepancies in data long-tail distribution and strikes a balance between the safety and accuracy of medication combinations.Specifically, we first construct a pre-training method using deep learning networks to obtain entity representation. After that, we design a pyramid-like data stratification method to obtain more generalized entity relationships by reinforcing the features of unpopular entities. Based on this relationship, we designed two graph structures to express medication precision and safety at the same level to obtain visit representations. Finally, the patient's historical clinical information is fitted to generate medication combinations for the current health condition.Experiments on the MIMIC-III dataset demonstrate that our method has outperformed current state-of-the-art methods in four evaluation metrics (including safety and accuracy).

Efficacy of Neural Prediction-Based NAS for Zero-Shot NAS Paradigm

paper_url: http://arxiv.org/abs/2308.16775
repo_url: https://github.com/minh1409/dft-npzs-nas
paper_authors: Minh Le, Nhan Nguyen, Ngoc Hoang Luong
for: This paper focuses on addressing the limitation of performance indicators in prediction-based Neural Architecture Search (NAS), specifically the inability to evaluate architecture performance across varying search spaces.
methods: The proposed approach uses Fourier sum of sines encoding for convolutional kernels, which enables the construction of a computational feed-forward graph with a structure similar to the architecture under evaluation. An accompanying multi-layer perceptron (MLP) then ranks these architectures based on their encodings.
results: The approach proposed in this paper surpasses previous methods using graph convolutional networks in terms of correlation on the NAS-Bench-201 dataset and exhibits a higher convergence rate. Moreover, the extracted feature representation trained on each NAS-Benchmark is transferable to other NAS-Benchmarks, showing promising generalizability across multiple search spaces.

Abstract
In prediction-based Neural Architecture Search (NAS), performance indicators derived from graph convolutional networks have shown significant success. These indicators, achieved by representing feed-forward structures as component graphs through one-hot encoding, face a limitation: their inability to evaluate architecture performance across varying search spaces. In contrast, handcrafted performance indicators (zero-shot NAS), which use the same architecture with random initialization, can generalize across multiple search spaces. Addressing this limitation, we propose a novel approach for zero-shot NAS using deep learning. Our method employs Fourier sum of sines encoding for convolutional kernels, enabling the construction of a computational feed-forward graph with a structure similar to the architecture under evaluation. These encodings are learnable and offer a comprehensive view of the architecture's topological information. An accompanying multi-layer perceptron (MLP) then ranks these architectures based on their encodings. Experimental results show that our approach surpasses previous methods using graph convolutional networks in terms of correlation on the NAS-Bench-201 dataset and exhibits a higher convergence rate. Moreover, our extracted feature representation trained on each NAS-Benchmark is transferable to other NAS-Benchmarks, showing promising generalizability across multiple search spaces. The code is available at: https://github.com/minh1409/DFT-NPZS-NAS

摘要
在预测基于的神经网络搜索（NAS）中，基于图 convolutional networks 的性能指标得到了显著的成功。这些指标，通过将 feed-forward 结构表示为组成图通过一个一个式编码，面临一个限制：它们无法评估搜索空间中不同的架构性能。相比之下，手工制作的性能指标（零shot NAS），使用同一个架构并且随机初始化，可以在多个搜索空间中 generale。为了解决这一限制，我们提出了一种新的零shot NAS 方法，使用深度学习。我们的方法使用 Fourier 和平差编码来构建一个计算 feed-forward 图，其结构与被评估的架构相似。这些编码是学习的，可以提供架构的全面信息。随后的多层感知器（MLP）将这些架构按照其编码进行排序。实验结果表明，我们的方法在 NAS-Bench-201 数据集上和前一代方法使用 graph convolutional networks 相比，具有更高的相关性和更快的收敛率。此外，我们提取的特征表示在每个 NAS-Benchmark 上训练，可以在其他 NAS-Benchmark 上提取到较好的特征，表现出了良好的普适性。代码可以在：https://github.com/minh1409/DFT-NPZS-NAS 上获取。

Towards Low-Barrier Cybersecurity Research and Education for Industrial Control Systems

paper_url: http://arxiv.org/abs/2308.16769
repo_url: None
paper_authors: Colman McGuan, Chansu Yu, Qin Lin
for: 这个研究旨在提供一个可靠的测试环境，以便 validate和比较各种入侵检测算法，以保护工业控制系统（ICS）。
methods: 我们使用了3D高精度模拟器，实现自动启动攻击，收集数据，训练机器学习模型，并评估在实际生产过程中。
results: 我们的Minimal Threshold and Window SVM（MinTWin SVM）模型可以实现避免伪阳性，并对物理过程异常进行感知。此外，我们在ICScybersecurity教育中使用了我们的数据集，让学生在实际ICS数据上练习机器学习理论。

Abstract
The protection of Industrial Control Systems (ICS) that are employed in public critical infrastructures is of utmost importance due to catastrophic physical damages cyberattacks may cause. The research community requires testbeds for validation and comparing various intrusion detection algorithms to protect ICS. However, there exist high barriers to entry for research and education in the ICS cybersecurity domain due to expensive hardware, software, and inherent dangers of manipulating real-world systems. To close the gap, built upon recently developed 3D high-fidelity simulators, we further showcase our integrated framework to automatically launch cyberattacks, collect data, train machine learning models, and evaluate for practical chemical and manufacturing processes. On our testbed, we validate our proposed intrusion detection model called Minimal Threshold and Window SVM (MinTWin SVM) that utilizes unsupervised machine learning via a one-class SVM in combination with a sliding window and classification threshold. Results show that MinTWin SVM minimizes false positives and is responsive to physical process anomalies. Furthermore, we incorporate our framework with ICS cybersecurity education by using our dataset in an undergraduate machine learning course where students gain hands-on experience in practicing machine learning theory with a practical ICS dataset. All of our implementations have been open-sourced.

摘要
对于公共重要基础设施中使用的工业控制系统（ICS）的安全保护非常重要，因为黑客可以通过网络攻击引起严重的物理损害。研究社区需要测试平台来验证和比较不同的入侵检测算法，以保护ICS。但是，ICS安全领域的研究和教育面临着高的入门障碍，因为ICS系统的硬件、软件和实际操作是昂贵的，而且具有很高的危险性。为了解决这个问题，我们基于最近发展的3D高精度模拟器，提供了一个集成的测试平台，可以自动发起网络攻击，收集数据，训练机器学习模型，并评估实际化学和制造过程中的做法。在我们的测试平台上，我们验证了我们提出的入侵检测模型，称为最小阈值窗口支持向量机（MinTWin SVM），它利用了无监督的机器学习，并将窗口和分类阈值结合使用。结果表明，MinTWin SVM可以减少假阳性，同时快速响应物理过程异常。此外，我们将我们的框架与ICS安全教育相结合，通过使用我们的数据集在大学生Machine learning课程中进行实践，让学生通过实践机器学习理论来掌握ICS数据集的实践应用。所有我们的实现都已经开源。

Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection

paper_url: http://arxiv.org/abs/2308.16763
repo_url: None
paper_authors: Kairui Hu, Ming Yan, Joey Tianyi Zhou, Ivor W. Tsang, Wen Haw Chong, Yong Keong Yap
for: 提高大型自然语言模型（LLM）的逻辑能力，以及提高小型LLM的性能。
methods: 提出了一种名为“笔脚架”（LoT）的双阶段协同优化框架，通过充分利用高质量的外部知识，提高模型生成的中间逻辑。
results: 对比chatGPT和chatGPT+CoT，LoT achieved a 16% improvement in stance detection task, and a 10% improvement compared to chatGPT with CoT.

Abstract
Chain-of-Thought Prompting (CoT) reinforces the reasoning capabilities of Large Language Models (LLMs) through the generation of intermediate rationales. However, these enhancements predominantly benefit large-scale models, leaving small LMs without significant performance improvements when directly applying CoT. Despite the advanced reasoning capabilities of LLMs, CoT relies primarily on their pre-trained internal knowledge. The external knowledge that is previously unknown to the model remains unexploited. This omission becomes pronounced in tasks such as stance detection, where the external background knowledge plays a pivotal role. Additionally, the large-scale architecture of LLMs inevitably present efficiency challenges during deployment. To address these challenges, we introduce the Ladder-of-Thought (LoT) for stance detection. Grounded in a dual-phase Cascaded Optimization framework, LoT directs the model to incorporate high-quality external knowledge, enhancing the intermediate rationales it generates. These bolstered rationales subsequently serve as the foundation for more precise predictions - akin to how a ladder facilitates reaching elevated goals. LoT achieves a balance between efficiency and accuracy, making it an adaptable and efficient framework for stance detection. Our empirical evaluations underscore LoT's effectiveness, marking a 16% improvement over ChatGPT and a 10% enhancement compared to ChatGPT with CoT.

摘要
链接思维提示（CoT）可以增强大语言模型（LLM）的逻辑能力，但是这些改进主要适用于大规模模型，小型LM无法直接应用CoT而获得显著性能提升。尽管LLM具有高度的逻辑能力，但CoT仍然主要基于其先前预训练的内部知识。外部知识，尚未被模型所知悉，则被忽略。这种欠缺特别manifest在tasks like stance detection中， где外部背景知识扮演着关键性的角色。此外，大规模的LLM架构在部署时依然会存在效率挑战。为了解决这些挑战，我们提出了思维阶梯（LoT） для stance detection。LoT基于双阶段分布式优化框架，使模型能够更好地利用高质量的外部知识，并在生成中提高中间逻辑。这些加强的逻辑后续成为更准确的预测基础，类似于如何使用梯子来达到更高的目标。LoT寻求效率和准确性之间的平衡，使其成为适应性强的和高效的框架。我们的实验证明了LoT的效果，与ChatGPT和ChatGPT with CoT相比，LoT提高了16%和10%。

Context Aware Query Rewriting for Text Rankers using LLM

paper_url: http://arxiv.org/abs/2308.16753
repo_url: None
paper_authors: Abhijit Anand, Venktesh V, Vinay Setty, Avishek Anand
for: 提高文档排序任务中的查询模糊匹配问题的解决方案。
methods: 使用生成模型（LLMs）生成 pseudo 文档，以优化查询模糊匹配问题。
results: 在训练阶段使用 CAR approach rewrite 查询，可以提高文档排序任务的表现，比基eline表现提高至多33%。

Abstract
Query rewriting refers to an established family of approaches that are applied to underspecified and ambiguous queries to overcome the vocabulary mismatch problem in document ranking. Queries are typically rewritten during query processing time for better query modelling for the downstream ranker. With the advent of large-language models (LLMs), there have been initial investigations into using generative approaches to generate pseudo documents to tackle this inherent vocabulary gap. In this work, we analyze the utility of LLMs for improved query rewriting for text ranking tasks. We find that there are two inherent limitations of using LLMs as query re-writers -- concept drift when using only queries as prompts and large inference costs during query processing. We adopt a simple, yet surprisingly effective, approach called context aware query rewriting (CAR) to leverage the benefits of LLMs for query understanding. Firstly, we rewrite ambiguous training queries by context-aware prompting of LLMs, where we use only relevant documents as context.Unlike existing approaches, we use LLM-based query rewriting only during the training phase. Eventually, a ranker is fine-tuned on the rewritten queries instead of the original queries during training. In our extensive experiments, we find that fine-tuning a ranker using re-written queries offers a significant improvement of up to 33% on the passage ranking task and up to 28% on the document ranking task when compared to the baseline performance of using original queries.

摘要
Query 重写指的是一家已经确立的方法，用于解决文档排名中 vocabulary 匹配问题。通常情况下，查询将在查询处理时进行重写，以便更好地模型查询。随着大型语言模型（LLM）的出现，有初步的调查表明，可以使用生成方法生成 pseudo 文档来解决这种遗传 vocabulary 差距。在这种工作中，我们分析了使用 LLM 进行改进查询重写的 utility。我们发现了两种使用 LLM 作为查询重写器的内在限制：一是概念漂移，只使用查询作为提示；二是大量的推理成本 durante 查询处理。我们采用一种简单 yet 有效的方法，即 context-aware 查询重写（CAR），以利用 LLM 的优势来更好地理解查询。首先，我们将ambiguous 的训练查询重写为上下文感知的 LLM 提示，并且只使用相关的文档作为上下文。不同于现有的方法，我们在训练阶段使用 LLM 进行查询重写，而不是在查询处理阶段。最后，我们在训练阶段使用重写后的查询进行rankers 的精度。在我们的广泛实验中，我们发现，使用重写后的查询可以提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表现，提高文档排名任务的表

Socratis: Are large multimodal models emotionally aware?

paper_url: http://arxiv.org/abs/2308.16741
repo_url: None
paper_authors: Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko
for: 提高 Multimodal 语言模型对情感的认知和生成能力
methods: 使用多种情感标签和理由描述来评估模型的表现
results: 人类对人写的理由更加喜欢，而不是机器生成的理由，而且现有的captioning metric不能与人类喜好相吻合

Abstract
Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a \underline{soc}ietal \underline{r}e\underline{a}c\underline{ti}on\underline{s} benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.

摘要
现有的情绪预测 benchmark 包含粗糙的情绪标签，不考虑图文内容对人类的多样化情绪响应。学习多样化情绪对图文内容是重要的，因为智能机器在生成和传递内容方面发挥了中心作用。为解决这个差距，我们提议了 Socratis benchmark，每个图文笔记 (IC) 对象被标注为多种情绪和其原因。Socratis 包含 18,000 个自由格式的反应，用于 980 种情绪的 2,075 个图文笔记对。我们使用现代大语言模型测试能否生成情绪的原因，并观察到人类更加偏好人工写的原因，相比于机器生成的原因。此外，我们还发现现有的captioning metric 基于大视语言模型并不与人类偏好相关。我们希望这些发现和我们的 benchmark 能够激发更多的情绪意识模型训练研究。

Robust Networked Federated Learning for Localization

paper_url: http://arxiv.org/abs/2308.16737
repo_url: None
paper_authors: Reza Mirzaeifard, Naveen K. D. Venkategowda, Stefan Werner
For: 本研究旨在解决 Federated Learning 环境中的地域化问题，该问题是非 convex 和非凸的，数据分布在多个设备上。* Methods: 我们提议一种使用 $L_1$-norm 稳定的分布式 sub-gradient 框架，以适应 Federated Learning 环境中的异常数据问题。* Results: 我们的方法可以快速地 converge to a stationary point，并在实验中证明其超越现有的地域化方法。

Abstract
This paper addresses the problem of localization, which is inherently non-convex and non-smooth in a federated setting where the data is distributed across a multitude of devices. Due to the decentralized nature of federated environments, distributed learning becomes essential for scalability and adaptability. Moreover, these environments are often plagued by outlier data, which presents substantial challenges to conventional methods, particularly in maintaining estimation accuracy and ensuring algorithm convergence. To mitigate these challenges, we propose a method that adopts an $L_1$-norm robust formulation within a distributed sub-gradient framework, explicitly designed to handle these obstacles. Our approach addresses the problem in its original form, without resorting to iterative simplifications or approximations, resulting in enhanced computational efficiency and improved estimation accuracy. We demonstrate that our method converges to a stationary point, highlighting its effectiveness and reliability. Through numerical simulations, we confirm the superior performance of our approach, notably in outlier-rich environments, which surpasses existing state-of-the-art localization methods.

摘要

Post-Deployment Adaptation with Access to Source Data via Federated Learning and Source-Target Remote Gradient Alignment

paper_url: http://arxiv.org/abs/2308.16735
repo_url: https://github.com/felixwag/staralign
paper_authors: Felix Wagner, Zeju Li, Pramit Saha, Konstantinos Kamnitsas
for: 本文旨在 Addressing the distribution shift problem in deep neural network deployment for medical imaging, specifically in the context of post-deployment adaptation (PDA) with limited or no labelled target data.
methods: 本文提出了一种新的适应框架 called FedPDA，它利用远程学习来帮助已经部署的模型适应目标数据分布。此外，文章还提出了一种新的优化方法 StarAlign，用于将源数据和目标数据之间的梯度进行对齐，以便学习一个特定的目标模型。
results: 文章通过使用多个医疗机构的数据库进行肿瘤检测和皮肤病分类任务，证明了 StarAlign 方法的有效性，与之前的工作相比，其表现更好。

Abstract
Deployment of Deep Neural Networks in medical imaging is hindered by distribution shift between training data and data processed after deployment, causing performance degradation. Post-Deployment Adaptation (PDA) addresses this by tailoring a pre-trained, deployed model to the target data distribution using limited labelled or entirely unlabelled target data, while assuming no access to source training data as they cannot be deployed with the model due to privacy concerns and their large size. This makes reliable adaptation challenging due to limited learning signal. This paper challenges this assumption and introduces FedPDA, a novel adaptation framework that brings the utility of learning from remote data from Federated Learning into PDA. FedPDA enables a deployed model to obtain information from source data via remote gradient exchange, while aiming to optimize the model specifically for the target domain. Tailored for FedPDA, we introduce a novel optimization method StarAlign (Source-Target Remote Gradient Alignment) that aligns gradients between source-target domain pairs by maximizing their inner product, to facilitate learning a target-specific model. We demonstrate the method's effectiveness using multi-center databases for the tasks of cancer metastases detection and skin lesion classification, where our method compares favourably to previous work. Code is available at: https://github.com/FelixWag/StarAlign

摘要
部署深度神经网络在医疗影像领域面临分布shift问题，导致性能下降。协作式适应（PDA）解决这个问题，通过使用有限的标注或无标注目标数据来适应目标数据分布，而不需要访问源训练数据，因为隐私问题和它们的大小。这使得可靠的适应变得困难，因为有限的学习信号。这篇论文挑战这一假设，并引入FedPDA，一种新的适应框架，它将在联合学习中获得源数据信息，并且通过远程梯度交换来优化模型，以便适应目标领域。为了适应FedPDA，我们引入了一种新的优化方法：StarAlign（源-目标远程梯度匹配），它通过最大化源-目标对的内积来匹配梯度，以便学习一个特定的目标模型。我们使用多个医疗数据中心的数据进行肿瘤检测和皮肤涂抹分类任务，并证明了我们的方法与之前的工作相比较有利。代码可以在 GitHub 上找到：https://github.com/FelixWag/StarAlign。

Proof of Deep Learning: Approaches, Challenges, and Future Directions

paper_url: http://arxiv.org/abs/2308.16730
repo_url: None
paper_authors: Mahmoud Salhab, Khaleel Mershad
for: 本研究主要旨在调查各种Proof of Deep Learning（PoDL）机制，了解它们的优缺点，以及它们在不同应用场景中的可能性。
methods: 本研究使用了多种方法，包括Literature Review、Algorithm Analysis和Future Research Direction等。
results: 本研究结果显示，PoDL机制可以充分利用计算能力，同时保持区块链的安全性和完整性。但是，PoDL还需要进一步的研究和开发，以便在实际应用中得到更好的效果。

Abstract
The rise of computational power has led to unprecedented performance gains for deep learning models. As more data becomes available and model architectures become more complex, the need for more computational power increases. On the other hand, since the introduction of Bitcoin as the first cryptocurrency and the establishment of the concept of blockchain as a distributed ledger, many variants and approaches have been proposed. However, many of them have one thing in common, which is the Proof of Work (PoW) consensus mechanism. PoW is mainly used to support the process of new block generation. While PoW has proven its robustness, its main drawback is that it requires a significant amount of processing power to maintain the security and integrity of the blockchain. This is due to applying brute force to solve a hashing puzzle. To utilize the computational power available in useful and meaningful work while keeping the blockchain secure, many techniques have been proposed, one of which is known as Proof of Deep Learning (PoDL). PoDL is a consensus mechanism that uses the process of training a deep learning model as proof of work to add new blocks to the blockchain. In this paper, we survey the various approaches for PoDL. We discuss the different types of PoDL algorithms, their advantages and disadvantages, and their potential applications. We also discuss the challenges of implementing PoDL and future research directions.

摘要
随着计算机力的提高，深度学习模型的性能得到了历史上无 precedent 的提升。随着更多的数据变得可用并模型结构变得更加复杂，需要更多的计算机力的增加。然而，自比特币的出现以来，各种变体和方法被提出，其中大多数具有一个共同之处，即证明工作（PoW）共识机制。PoW主要用于支持新块生成过程。虽然PoW已经证明了其Robustness，但它的主要缺点是需要大量的处理能力来保持区块链的安全性和完整性。这是因为通过施加 brut force 解决哈希拟合问题。为了利用计算机能源进行有用和意义的工作而不是保持区块链的安全性，许多技术被提出，其中之一是知名的深度学习证明（PoDL）。PoDL是一种使用深度学习模型证明工作来添加新块到区块链的共识机制。在这篇文章中，我们对PoDL的不同方法进行了抽象，讨论了它们的优缺点，以及它们在不同应用场景中的潜在应用。我们还讨论了实施PoDL的挑战和未来研究方向。

Terrain Diffusion Network: Climatic-Aware Terrain Generation with Geological Sketch Guidance

paper_url: http://arxiv.org/abs/2308.16725
repo_url: None
paper_authors: Zexin Hu, Kun Hu, Clinton Mo, Lei Pan, Zhiyong Wang
for: 这paper的目的是提出一种新的液态网络（TDN），用于实现更加真实的地形生成，并提供更高级别的用户控制性。
methods: 该方法使用了多层混清过程，并采用了用户指导的方式，以保证生成的地形更加真实和有趣。此外，该方法还使用了预训练的地形自动编码器，以提高生成的地形精度。
results: 经过广泛的实验，该方法在一个新的NASA Topology Images dataset上达到了状态方法的性能，并且可以生成更加真实和有趣的地形。

Abstract
Sketch-based terrain generation seeks to create realistic landscapes for virtual environments in various applications such as computer games, animation and virtual reality. Recently, deep learning based terrain generation has emerged, notably the ones based on generative adversarial networks (GAN). However, these methods often struggle to fulfill the requirements of flexible user control and maintain generative diversity for realistic terrain. Therefore, we propose a novel diffusion-based method, namely terrain diffusion network (TDN), which actively incorporates user guidance for enhanced controllability, taking into account terrain features like rivers, ridges, basins, and peaks. Instead of adhering to a conventional monolithic denoising process, which often compromises the fidelity of terrain details or the alignment with user control, a multi-level denoising scheme is proposed to generate more realistic terrains by taking into account fine-grained details, particularly those related to climatic patterns influenced by erosion and tectonic activities. Specifically, three terrain synthesisers are designed for structural, intermediate, and fine-grained level denoising purposes, which allow each synthesiser concentrate on a distinct terrain aspect. Moreover, to maximise the efficiency of our TDN, we further introduce terrain and sketch latent spaces for the synthesizers with pre-trained terrain autoencoders. Comprehensive experiments on a new dataset constructed from NASA Topology Images clearly demonstrate the effectiveness of our proposed method, achieving the state-of-the-art performance. Our code and dataset will be publicly available.

摘要
《绘图基 terrain 生成》是一种目标创建虚拟环境中的真实景观，如电子游戏、动画和虚拟现实等应用。现在，基于深度学习的 terrain 生成技术在不断发展，其中以生成对抗网络（GAN）为代表。然而，这些方法经常难以满足用户控制的灵活性和生成多样性，以保证真实的地形。因此，我们提出了一种新的扩散基本方法，即 terrain 扩散网络（TDN），它可以 aktiv 地 incorporate 用户指导，考虑地形特征，如河流、山脊、盆地和峰峰。而不是遵循传统的单一杂化过程，TDN 可以生成更真实的地形，并且具有较高的用户控制性。为了实现这一目标，我们采用了多层杂化机制，包括三个不同级别的 terrain 杂化器，用于处理不同级别的地形细节。此外，为了提高 TDN 的效率，我们还引入了地形和绘图幂等空间，并采用了预训练的地形自动编码器。实验结果表明，我们的提议方法可以达到现状最佳性，并且在 NASA Topology 图像 Dataset 上进行了全面的测试。我们的代码和数据将在线公开。

CReHate: Cross-cultural Re-annotation of English Hate Speech Dataset

paper_url: http://arxiv.org/abs/2308.16705
repo_url: None
paper_authors: Nayeon Lee, Chani Jung, Junho Myung, Jiho Jin, Juho Kim, Alice Oh
for: This paper aims to address cultural biases in hate speech detection models and datasets by introducing a cross-cultural re-annotation of the SBIC dataset and analyzing differences in perceptions of hate speech among individuals from five distinct countries.
methods: The paper uses a cross-cultural re-annotation of the SBIC dataset, which includes annotations from Australia, Singapore, South Africa, the United Kingdom, and the United States. The authors also employ transfer learning to develop a culturally sensitive hate speech classifier.
results: The authors find significant differences in the perception of hate speech among individuals from different countries, with only 59.4% of the samples achieving consensus among all countries. They also develop a culturally sensitive hate speech classifier that can capture the perspectives of different nationalities.

Abstract
English datasets predominantly reflect the perspectives of certain nationalities, which can lead to cultural biases in models and datasets. This is particularly problematic in tasks heavily influenced by subjectivity, such as hate speech detection. To delve into how individuals from different countries perceive hate speech, we introduce CReHate, a cross-cultural re-annotation of the sampled SBIC dataset. This dataset includes annotations from five distinct countries: Australia, Singapore, South Africa, the United Kingdom, and the United States. Our thorough statistical analysis highlights significant differences based on nationality, with only 59.4% of the samples achieving consensus among all countries. We also introduce a culturally sensitive hate speech classifier via transfer learning, adept at capturing perspectives of different nationalities. These findings underscore the need to re-evaluate certain aspects of NLP research, especially with regard to the nuanced nature of hate speech in the English language.

摘要
Translation notes:* "English datasets" is translated as "英语 datasets" (Yīngyǔ datasets) in Simplified Chinese.* "predominantly" is translated as "主要" (zhǔyào) in Simplified Chinese.* "reflect" is translated as "反映" (fǎngyìng) in Simplified Chinese.* "certain nationalities" is translated as "specific nationalities" (特定国籍) in Simplified Chinese.* "cultural biases" is translated as "文化偏见" (wénhuà péndiǎn) in Simplified Chinese.* "tasks heavily influenced by subjectivity" is translated as "受主观因素影响的任务" (shòu zhǔyǎn yìnxīng de jìnzuò) in Simplified Chinese.* "CReHate" is translated as "CReHate" (CReHate) in Simplified Chinese, as it is a proper noun.* "cross-cultural re-annotation" is translated as "跨文化重标注" (kuà wénhuà zhòngbiǎozhù) in Simplified Chinese.* "sampled SBIC dataset" is translated as "采样的 SBIC 数据集" (chǎi yàng de SBIC dàta sets) in Simplified Chinese.* "significant differences" is translated as "显著差异" (xiǎng zhì kù yì) in Simplified Chinese.* "only 59.4% of the samples achieving consensus among all countries" is translated as "只有59.4% 的样本达成全球各国的一致" (zhīyǒu 59.4% de yàngbèi dàchéng quánxiàng zhìyì) in Simplified Chinese.* "culturally sensitive hate speech classifier" is translated as "文化敏感 hate speech 分类器" (wénhuà mǐngkan hate speech fènklè yì) in Simplified Chinese.* "transfer learning" is translated as "传输学习" (chuánxīng xuéxí) in Simplified Chinese.* "adept at capturing perspectives of different nationalities" is translated as "能够捕捉不同国籍的视角" (nénggòu bòshì bùdìng guójiè de zhìkǎng) in Simplified Chinese.* "nuanced nature of hate speech in the English language" is translated as "英语中的仇恨言语之细节" (Yīngyǔ zhōng de shūhèn yánwén zhī xiǎo jiě) in Simplified Chinese.

Fault Injection and Safe-Error Attack for Extraction of Embedded Neural Network Models

paper_url: http://arxiv.org/abs/2308.16703
repo_url: None
paper_authors: Kevin Hector, Pierre-Alain Moellic, Mathieu Dumont, Jean-Max Dutertre
for: 本研究主要针对于嵌入式深度神经网络模型在IoT设备上的安全性问题，特别是模型抽取攻击。
methods: 本研究使用了常见的缺陷插入攻击策略——安全错误攻击（SEA）来实现模型抽取攻击。攻击者具有有限的训练数据访问权限。
results: 研究发现，使用约1500个手动设计的输入可以成功抽取嵌入式深度神经网络模型中的至少90%最重要比特数据，以训练一个与受害模型具有相似准确率的假模型。

Abstract
Model extraction emerges as a critical security threat with attack vectors exploiting both algorithmic and implementation-based approaches. The main goal of an attacker is to steal as much information as possible about a protected victim model, so that he can mimic it with a substitute model, even with a limited access to similar training data. Recently, physical attacks such as fault injection have shown worrying efficiency against the integrity and confidentiality of embedded models. We focus on embedded deep neural network models on 32-bit microcontrollers, a widespread family of hardware platforms in IoT, and the use of a standard fault injection strategy - Safe Error Attack (SEA) - to perform a model extraction attack with an adversary having a limited access to training data. Since the attack strongly depends on the input queries, we propose a black-box approach to craft a successful attack set. For a classical convolutional neural network, we successfully recover at least 90% of the most significant bits with about 1500 crafted inputs. These information enable to efficiently train a substitute model, with only 8% of the training dataset, that reaches high fidelity and near identical accuracy level than the victim model.

摘要
模型提取emerges as a critical security threat, with attack vectors exploiting both algorithmic and implementation-based approaches. The main goal of an attacker is to steal as much information as possible about a protected victim model, so that he can mimic it with a substitute model, even with limited access to similar training data. Recently, physical attacks such as fault injection have shown worrying efficiency against the integrity and confidentiality of embedded models. We focus on embedded deep neural network models on 32-bit microcontrollers, a widespread family of hardware platforms in IoT, and the use of a standard fault injection strategy - Safe Error Attack (SEA) - to perform a model extraction attack with an adversary having limited access to training data. Since the attack strongly depends on the input queries, we propose a black-box approach to craft a successful attack set. For a classical convolutional neural network, we successfully recover at least 90% of the most significant bits with about 1500 crafted inputs. These information enable to efficiently train a substitute model, with only 8% of the training dataset, that reaches high fidelity and near identical accuracy level than the victim model.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Using Large Language Models to Automate Category and Trend Analysis of Scientific Articles: An Application in Ophthalmology

paper_url: http://arxiv.org/abs/2308.16688
repo_url: None
paper_authors: Hina Raja, Asim Munawar, Mohammad Delsoz, Mohammad Elahi, Yeganeh Madadi, Amr Hassan, Hashem Abu Serhan, Onur Inam, Luis Hermandez, Sang Tran, Wuqas Munir, Alaa Abd-Alrazaq, Hao Chen, SiamakYousefi
for: 这 paper 的目的是提出一种自动化文献分类方法，利用大型自然语言处理（NLP）技术和大语言模型（LLM）。
methods: 该方法基于 NLP 技术，包括高级 ZSL LLM 模型，对科学论文的文本内容进行处理和分析。
results: 实验结果表明，LLM 可以高效地自动分类大量的眼科论文，无需人工干预。在 RenD 数据集上，模型达到了平均准确率 0.86 和平均 F1 分数 0.85。Here’s the breakdown of each point:
for: The paper aims to propose an automated method for article classification using Large Language Models (LLMs) in the field of ophthalmology, but the model is extendable to other fields.
methods: The method is based on Natural Language Processing (NLP) techniques, including advanced LLMs, to process and analyze the textual content of scientific papers.
results: The experimental results demonstrate the effectiveness of LLMs in categorizing large number of ophthalmology papers without human intervention. The model achieved a mean accuracy of 0.86 and mean F1 of 0.85 based on the RenD dataset.

Abstract
Purpose: In this paper, we present an automated method for article classification, leveraging the power of Large Language Models (LLM). The primary focus is on the field of ophthalmology, but the model is extendable to other fields. Methods: We have developed a model based on Natural Language Processing (NLP) techniques, including advanced LLMs, to process and analyze the textual content of scientific papers. Specifically, we have employed zero-shot learning (ZSL) LLM models and compared against Bidirectional and Auto-Regressive Transformers (BART) and its variants, and Bidirectional Encoder Representations from Transformers (BERT), and its variant such as distilBERT, SciBERT, PubmedBERT, BioBERT. Results: The classification results demonstrate the effectiveness of LLMs in categorizing large number of ophthalmology papers without human intervention. Results: To evalute the LLMs, we compiled a dataset (RenD) of 1000 ocular disease-related articles, which were expertly annotated by a panel of six specialists into 15 distinct categories. The model achieved mean accuracy of 0.86 and mean F1 of 0.85 based on the RenD dataset. Conclusion: The proposed framework achieves notable improvements in both accuracy and efficiency. Its application in the domain of ophthalmology showcases its potential for knowledge organization and retrieval in other domains too. We performed trend analysis that enables the researchers and clinicians to easily categorize and retrieve relevant papers, saving time and effort in literature review and information gathering as well as identification of emerging scientific trends within different disciplines. Moreover, the extendibility of the model to other scientific fields broadens its impact in facilitating research and trend analysis across diverse disciplines.

摘要
目的：在这篇论文中，我们提出了一种自动化文章分类方法，利用大型自然语言处理（NLP）模型的力量。我们的研究领域为眼科领域，但模型可以扩展到其他领域。方法：我们开发了基于NLP技术的模型，包括高级Zero-shot学习（ZSL）模型、bi-directional和自然语言模型（BART）和其变体、Bidirectional Encoder Representations from Transformers（BERT）和其变体如distilBERT、SciBERT、PubmedBERT、BioBERT。结果：我们对1000篇眼科疾病相关文章进行了自动分类，得到了人工干预无需的高精度分类结果。结果：为评估LLMs，我们编译了1000篇眼科疾病相关文章的 dataset（RenD），由6名专家 manually标注为15种不同类别。模型在RenD dataset上取得了0.86的 mean accuracy和0.85的 mean F1。结论：我们提出的框架实现了显著的提高 both accuracy和 efficiency。在眼科领域中应用该模型，可以帮助研究者和临床医生快速地分类和检索相关文章，节省时间和劳动力，并且可以快速地发现不同领域的科学趋势。此外，模型的扩展性使其在其他科学领域中有广泛的影响，推动了研究和趋势分析的进程。

Everyone Can Attack: Repurpose Lossy Compression as a Natural Backdoor Attack

paper_url: http://arxiv.org/abs/2308.16684
repo_url: None
paper_authors: Sze Jue Yang, Quang Nguyen, Chee Seng Chan, Khoa Doan
for: 这个论文主要关注的是机器学习模型中的潜在攻击问题，具体来说是 silent backdoor 攻击。
methods: 这个论文使用了一种广泛使用的lossy图像压缩算法来实现攻击，而且这种攻击不需要特殊的技能和努力，只需要点击“转换”或“保存为”按钮即可。
results: 这个论文的实验结果表明，这种攻击可以在多个 benchmark 数据集中 achieved 100% 攻击成功率，而且在干净标签设定下，只需要杂 poisoning 率才能达到近百分之十的攻击成功率。

Abstract
The vulnerabilities to backdoor attacks have recently threatened the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that not everyone can be an attacker since the process of designing the trigger generation algorithm often involves significant effort and extensive experimentation to ensure the attack's stealthiness and effectiveness. Alternatively, this paper shows that there exists a more severe backdoor threat: anyone can exploit an easily-accessible algorithm for silent backdoor attacks. Specifically, this attacker can employ the widely-used lossy image compression from a plethora of compression tools to effortlessly inject a trigger pattern into an image without leaving any noticeable trace; i.e., the generated triggers are natural artifacts. One does not require extensive knowledge to click on the "convert" or "save as" button while using tools for lossy image compression. Via this attack, the adversary does not need to design a trigger generator as seen in prior works and only requires poisoning the data. Empirically, the proposed attack consistently achieves 100% attack success rate in several benchmark datasets such as MNIST, CIFAR-10, GTSRB and CelebA. More significantly, the proposed attack can still achieve almost 100% attack success rate with very small (approximately 10%) poisoning rates in the clean label setting. The generated trigger of the proposed attack using one lossy compression algorithm is also transferable across other related compression algorithms, exacerbating the severity of this backdoor threat. This work takes another crucial step toward understanding the extensive risks of backdoor attacks in practice, urging practitioners to investigate similar attacks and relevant backdoor mitigation methods.

摘要
Recently, backdoor attacks have posed a significant threat to the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that only a select few can launch attacks, as designing the trigger generation algorithm requires significant effort and extensive experimentation to ensure stealth and effectiveness. However, this paper reveals a more severe backdoor threat: anyone can exploit an easily accessible algorithm for silent backdoor attacks. Specifically, the attacker can use widely-used lossy image compression tools to effortlessly inject a trigger pattern into an image without leaving any noticeable trace; i.e., the generated triggers are natural artifacts. One does not require extensive knowledge to click on the "convert" or "save as" button while using these tools. Through this attack, the adversary does not need to design a trigger generator as seen in prior works and only requires poisoning the data. Our empirical results consistently achieve a 100% attack success rate in several benchmark datasets, including MNIST, CIFAR-10, GTSRB, and CelebA. Moreover, the proposed attack can still achieve almost 100% attack success rate with very small (approximately 10%) poisoning rates in the clean label setting. The generated trigger using one lossy compression algorithm is also transferable across other related compression algorithms, exacerbating the severity of this backdoor threat. This work takes another crucial step toward understanding the extensive risks of backdoor attacks in practice, urging practitioners to investigate similar attacks and relevant backdoor mitigation methods.

Fault Injection on Embedded Neural Networks: Impact of a Single Instruction Skip

paper_url: http://arxiv.org/abs/2308.16665
repo_url: None
paper_authors: Clement Gaine, Pierre-Alain Moellic, Olivier Potin, Jean-Max Dutertre
for: 这篇论文的目的是为了研究基于32位微控制器平台的神经网络模型的安全性，并通过电磁干扰和激光干扰来模拟硬件干扰的影响。
methods: 该论文使用了两种干扰方式，电磁干扰和激光干扰，并在Cortex M4 32位微控制器平台上进行了实验。而不同于大多数现有的内部参数或输入值修改方法，该论文的目标是通过控制流指令跳过来模拟内存干扰的影响。
results: 该论文发现了一些修改攻击的潜在威胁，可以让攻击者通过修改神经网络模型的控制流来改变模型的预测结果，并且可以根据不同的恶意目标来选择合适的攻击方法。

Abstract
With the large-scale integration and use of neural network models, especially in critical embedded systems, their security assessment to guarantee their reliability is becoming an urgent need. More particularly, models deployed in embedded platforms, such as 32-bit microcontrollers, are physically accessible by adversaries and therefore vulnerable to hardware disturbances. We present the first set of experiments on the use of two fault injection means, electromagnetic and laser injections, applied on neural networks models embedded on a Cortex M4 32-bit microcontroller platform. Contrary to most of state-of-the-art works dedicated to the alteration of the internal parameters or input values, our goal is to simulate and experimentally demonstrate the impact of a specific fault model that is instruction skip. For that purpose, we assessed several modification attacks on the control flow of a neural network inference. We reveal integrity threats by targeting several steps in the inference program of typical convolutional neural network models, which may be exploited by an attacker to alter the predictions of the target models with different adversarial goals.

摘要
随着神经网络模型的大规模集成和应用，特别是在关键附加系统中，其安全评估已成为紧迫需要。更specifically， deployed in embedded platforms, such as 32-bit microcontrollers, are physically accessible by adversaries and therefore vulnerable to hardware disturbances. We present the first set of experiments on the use of two fault injection means, electromagnetic and laser injections, applied on neural network models embedded on a Cortex M4 32-bit microcontroller platform. Contrary to most of state-of-the-art works dedicated to the alteration of the internal parameters or input values, our goal is to simulate and experimentally demonstrate the impact of a specific fault model that is instruction skip. For that purpose, we assessed several modification attacks on the control flow of a neural network inference. We reveal integrity threats by targeting several steps in the inference program of typical convolutional neural network models, which may be exploited by an attacker to alter the predictions of the target models with different adversarial goals.Here's the text with some additional information about the translation:I translated the text into Simplified Chinese, which is the most widely used standard for Chinese writing. I tried to preserve the original meaning and structure of the text as much as possible, while also making it more fluent and natural-sounding in Chinese.Some notes on the translation:* "附加系统" (fùjí systems) is used to refer to "embedded systems" or "critical embedded systems" in Chinese.* "神经网络模型" (shénxīn wǎngluò módelì) is used to refer to "neural network models" in Chinese.* "instruction skip" is translated as "指令跳过" (fùjì skīp) in Chinese.* "modification attacks" is translated as "修改攻击" (xiūgòu hángchè) in Chinese.* "integrity threats" is translated as "完整性威胁" (wánzhèngxìng wēidāi) in Chinese.I hope this helps! Let me know if you have any further questions or if you need any additional assistance.

Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering

paper_url: http://arxiv.org/abs/2308.16622
repo_url: None
paper_authors: Lars-Peter Meyer, Johannes Frey, Kurt Junghanns, Felix Brei, Kirill Bulert, Sabine Gründer-Fahrer, Michael Martin
for: 本研究旨在评估和监测大语言模型（LLMs）的性能，特别是在知识图工程（KGE）领域。
methods: 本研究提出了一个基准框架，包括三个挑战，用于测试LLMs的 sintaxis和错误 corrections、事实提取和数据集生成能力。
results: 研究发现，当使用零 shot 提示时，LLMs 对知识图生成仍然不具备能力，因此提出了一个LLM-KG-Bench框架，用于自动评估和存储 LLM 响应，以及统计数据和视觉化工具来支持提问工程和模型性能跟踪。

Abstract
As the field of Large Language Models (LLMs) evolves at an accelerated pace, the critical need to assess and monitor their performance emerges. We introduce a benchmarking framework focused on knowledge graph engineering (KGE) accompanied by three challenges addressing syntax and error correction, facts extraction and dataset generation. We show that while being a useful tool, LLMs are yet unfit to assist in knowledge graph generation with zero-shot prompting. Consequently, our LLM-KG-Bench framework provides automatic evaluation and storage of LLM responses as well as statistical data and visualization tools to support tracking of prompt engineering and model performance.

摘要
随着大语言模型（LLMs）的发展速度加剧，评估和监测其性能的需求日益突出。我们提出了一个专注于知识图工程（KGE）的 benchmarcking 框架，并提出了三个挑战，其中一个是语法和错误修正，另外两个是事实提取和数据集生成。我们发现，虽然 LLMS 是一个有用的工具，但它们无法在零shot提示下帮助知识图生成。因此，我们的 LLM-KG-Bench 框架提供了自动评估和存储 LLMS 回应，以及统计数据和可视化工具，以支持提问工程和模型性能追踪。

paper_url: http://arxiv.org/abs/2308.16615
repo_url: None
paper_authors: Lossan Bonde, Severin Dembele
for: 这篇论文是为了预测恐怖活动的目的。
methods: 这篇论文使用了社交媒体文本来提取必要的信息，以建立一个适合的数据集来预测恐怖活动。
results: 实验表明，现有的解决方案具有低精度，而我们的解决方案可以准确地识别地点信息。

Abstract
Terrorism has become a worldwide plague with severe consequences for the development of nations. Besides killing innocent people daily and preventing educational activities from taking place, terrorism is also hindering economic growth. Machine Learning (ML) and Natural Language Processing (NLP) can contribute to fighting terrorism by predicting in real-time future terrorist attacks if accurate data is available. This paper is part of a research project that uses text from social networks to extract necessary information to build an adequate dataset for terrorist attack prediction. We collected a set of 3000 social network texts about terrorism in Burkina Faso and used a subset to experiment with existing NLP solutions. The experiment reveals that existing solutions have poor accuracy for location recognition, which our solution resolves. We will extend the solution to extract dates and action information to achieve the project's goal.

摘要
恐怖主义已成为全球的恶性疾病，对国家发展造成严重的影响。除了每天杀害无辜的人和破坏教育活动外，恐怖主义还妨碍经济增长。机器学习（ML）和自然语言处理（NLP）可以帮助斗争恐怖主义，预测未来恐怖袭击的可能性，只要有准确的数据。这篇论文是一项研究项目的一部分，使用社交媒体文本提取必要的信息建立恐怖袭击预测数据集。我们收集了3000个社交媒体文本关于恐怖主义在布基纳法索的样本，使用一个子集进行了现有NLP解决方案的实验。实验表明，现有的解决方案在位置识别方面有较差的准确率，我们的解决方案可以解决这个问题。我们将延续解决方案，以提取日期和动作信息，实现项目的目标。

Towards Long-Tailed Recognition for Graph Classification via Collaborative Experts

paper_url: http://arxiv.org/abs/2308.16609
repo_url: None
paper_authors: Siyu Yi, Zhengyang Mao, Wei Ju, Yongdao Zhou, Luchen Liu, Xiao Luo, Ming Zhang
for: 本文旨在为 Graf 级别分类提供有效的分类器，以掌握 Graph 级别的表示，并且在长尾分布的 Graph 数据上进行分类。
methods: 本文提出了一种基于多特效学习的长尾 Graph 级别分类框架，包括对均衡表示学习和分类器训练、硬件分类минning、灵活的权重融合和知识分离等方法。
results: 根据七个广泛使用的 benchmark 数据集的实验结果，我们的方法 CoMe 在与基eline比较的情况下显示出了superiority，并且在长尾分布下进行 Graph 级别分类时表现出了优异的效果。

Abstract
Graph classification, aiming at learning the graph-level representations for effective class assignments, has received outstanding achievements, which heavily relies on high-quality datasets that have balanced class distribution. In fact, most real-world graph data naturally presents a long-tailed form, where the head classes occupy much more samples than the tail classes, it thus is essential to study the graph-level classification over long-tailed data while still remaining largely unexplored. However, most existing long-tailed learning methods in visions fail to jointly optimize the representation learning and classifier training, as well as neglect the mining of the hard-to-classify classes. Directly applying existing methods to graphs may lead to sub-optimal performance, since the model trained on graphs would be more sensitive to the long-tailed distribution due to the complex topological characteristics. Hence, in this paper, we propose a novel long-tailed graph-level classification framework via Collaborative Multi-expert Learning (CoMe) to tackle the problem. To equilibrate the contributions of head and tail classes, we first develop balanced contrastive learning from the view of representation learning, and then design an individual-expert classifier training based on hard class mining. In addition, we execute gated fusion and disentangled knowledge distillation among the multiple experts to promote the collaboration in a multi-expert framework. Comprehensive experiments are performed on seven widely-used benchmark datasets to demonstrate the superiority of our method CoMe over state-of-the-art baselines.

摘要
GRAPH classification, aiming at learning the graph-level representations for effective class assignments, has received outstanding achievements, which heavily relies on high-quality datasets that have balanced class distribution. In fact, most real-world graph data naturally presents a long-tailed form, where the head classes occupy much more samples than the tail classes, it thus is essential to study the graph-level classification over long-tailed data while still remaining largely unexplored. However, most existing long-tailed learning methods in visions fail to jointly optimize the representation learning and classifier training, as well as neglect the mining of the hard-to-classify classes. Directly applying existing methods to graphs may lead to sub-optimal performance, since the model trained on graphs would be more sensitive to the long-tailed distribution due to the complex topological characteristics. Hence, in this paper, we propose a novel long-tailed graph-level classification framework via Collaborative Multi-expert Learning (CoMe) to tackle the problem. To equilibrate the contributions of head and tail classes, we first develop balanced contrastive learning from the view of representation learning, and then design an individual-expert classifier training based on hard class mining. In addition, we execute gated fusion and disentangled knowledge distillation among the multiple experts to promote the collaboration in a multi-expert framework. Comprehensive experiments are performed on seven widely-used benchmark datasets to demonstrate the superiority of our method CoMe over state-of-the-art baselines.

The Quest of Finding the Antidote to Sparse Double Descent

paper_url: http://arxiv.org/abs/2308.16596
repo_url: None
paper_authors: Victor Quétu, Marta Milovanović
for: 本文目的是找到深度学习模型的优化大小，以提高性能并避免权重递减现象。
methods: 本文使用了一种简单的 $\ell_2$ 正则化方法和知识整合学习方法来解决权重递减现象。
results: 实验结果表明，使用这种方法可以避免权重递减现象，并且可以在图像分类任务中 достичь更好的性能。

Abstract
In energy-efficient schemes, finding the optimal size of deep learning models is very important and has a broad impact. Meanwhile, recent studies have reported an unexpected phenomenon, the sparse double descent: as the model's sparsity increases, the performance first worsens, then improves, and finally deteriorates. Such a non-monotonic behavior raises serious questions about the optimal model's size to maintain high performance: the model needs to be sufficiently over-parametrized, but having too many parameters wastes training resources. In this paper, we aim to find the best trade-off efficiently. More precisely, we tackle the occurrence of the sparse double descent and present some solutions to avoid it. Firstly, we show that a simple $\ell_2$ regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity compromise. To overcome this problem, we then introduce a learning scheme in which distilling knowledge regularizes the student model. Supported by experimental results achieved using typical image classification setups, we show that this approach leads to the avoidance of such a phenomenon.

摘要
在能效学习方案中，发现优化模型的大小非常重要，它会对性能产生广泛的影响。然而，最近的研究发现了一种意外现象：随着模型的稀疏性增加，性能首先恶化，然后改善，最后恶化。这种非 monotonic 的行为引发了优化模型大小以保持高性能的严重问题：模型需要充分过 parametrization，但过多的参数会浪费训练资源。在这篇文章中，我们目标是寻找最佳的平衡，更具体地说，我们解决 sparse double descent 现象，并提供一些解决方案。首先，我们表明了一种简单的 $\ell_2$ 正则化方法可以减轻这种现象，但是这会牺牲性能/稀疏性的权衡。为了解决这个问题，我们然后引入一种知识整合学习方法，通过这种方法，学生模型可以从导师模型中学习知识。通过实验结果，我们显示了这种方法可以避免 sparse double descent 现象。

CL-MAE: Curriculum-Learned Masked Autoencoders

paper_url: http://arxiv.org/abs/2308.16572
repo_url: None
paper_authors: Neelu Madan, Nicolae-Catalin Ristea, Kamal Nasrollahi, Thomas B. Moeslund, Radu Tudor Ionescu
for: 提高自我超vised学习的表示学习能力
methods: 使用curriculum学习方法，逐渐增加masking策略的复杂度，从而训练模型学习更加复杂和可传播的表示
results: 训练CL-MAE模型在ImageNet上，并在五个下游任务上显示出优于MAE模型的表示学习能力

Abstract
Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders.

摘要
马SK模型（Masked Image Modeling）已经证明是一种强大的预tex task，可以生成可以广泛应用的多个下游任务中的稳定表示。通常，这种方法 involve randomly masking patches（ tokens）在输入图像中，并且masking策略在训练过程中保持不变。在这篇论文中，我们提议了一种学习级 curriculum learningapproach，即在训练过程中不断更新masking策略，以增加自我超vised reconstruction任务的复杂性。我们 conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations。为了实现这一目标，我们提出了一种新的可学习的masking模块，具有可以生成不同复杂性的masks的能力。我们将这个模块 integrate into masked autoencoders (MAE)，并在训练过程中jointly train它。在训练过程中，我们将masking模块的行为逐渐从MAE的合作者（同样optimize reconstruction loss）转化为对手（optimize opposite loss），而在过渡过程中，masking模块的行为会随着一个因子的multiplication，以控制masking模块的权重。这种过渡过程是平滑的，使得我们可以通过训练程序来生成一个易于增加的curriculum。我们在ImageNet上训练了我们的Curriculum-Learned Masked Autoencoder (CL-MAE)，并证明了它在多个下游任务中表现出了superior representation learning capabilities。empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders。

The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.16562
repo_url: https://github.com/stratosphereips/meme_malware_rl
paper_authors: Maria Rigaki, Sebastian Garcia
for: This paper is written for researchers and practitioners in the field of malware detection and defense, particularly those interested in the use of machine learning and automation for malware detection.
methods: The paper proposes a new algorithm called MEME (Malware Evasion and Model Extraction) attacks, which combines model-based reinforcement learning and adversarial modification of Windows executable binary samples to evade malware detection.
results: The paper evaluates the MEME algorithm against two state-of-the-art attacks in adversarial malware creation and shows that MEME outperforms the state-of-the-art methods in terms of evasion capabilities, producing evasive malware with an evasion rate in the range of 32-73%. The paper also shows that the surrogate models produced by MEME have a high agreement with the target models, with a prediction label agreement between 97-99%.

Abstract
Due to the proliferation of malware, defenders are increasingly turning to automation and machine learning as part of the malware detection tool-chain. However, machine learning models are susceptible to adversarial attacks, requiring the testing of model and product robustness. Meanwhile, attackers also seek to automate malware generation and evasion of antivirus systems, and defenders try to gain insight into their methods. This work proposes a new algorithm that combines Malware Evasion and Model Extraction (MEME) attacks. MEME uses model-based reinforcement learning to adversarially modify Windows executable binary samples while simultaneously training a surrogate model with a high agreement with the target model to evade. To evaluate this method, we compare it with two state-of-the-art attacks in adversarial malware creation, using three well-known published models and one antivirus product as targets. Results show that MEME outperforms the state-of-the-art methods in terms of evasion capabilities in almost all cases, producing evasive malware with an evasion rate in the range of 32-73%. It also produces surrogate models with a prediction label agreement with the respective target models between 97-99%. The surrogate could be used to fine-tune and improve the evasion rate in the future.

摘要
(Simplified Chinese translation)由于恶意软件的普及，防御者正在不断地启用自动化和机器学习作为恶意软件检测工具链的一部分。然而，机器学习模型受到了针对性攻击的威胁，需要测试模型和产品的可靠性。同时，攻击者也尝试自动生成恶意软件和绕过安全软件系统，防御者则尝试了解他们的方法。这个工作提出了一个新的算法，它将恶意软件逃脱和模型提取（MEME）攻击组合起来。MEME使用基于模型的强化学习来对Windows执行文件binary样本进行针对性修改，同时培养一个与目标模型具有高协调的副模型。为了评估这种方法，我们与三个公开发布的模型和一个安全产品作为目标进行比较。结果显示，MEME在逃脱能力方面与状态法比较，生成了97-99%的预测标签一致的副模型，并且生成了32-73%的逃脱率。这些副模型可以用来练化和改进逃脱率的未来。

On a Connection between Differential Games, Optimal Control, and Energy-based Models for Multi-Agent Interactions

paper_url: http://arxiv.org/abs/2308.16539
repo_url: None
paper_authors: Christopher Diehl, Tobias Klosek, Martin Krüger, Nils Murzyn, Torsten Bertram
for: This paper is written for modeling multi-agent interactions in real-world robotics applications using game theory.
methods: The paper uses a combination of differential games, optimal control, and energy-based models to address challenges in applying game theory to real-world robotics.
results: The paper introduces a new end-to-end learning application that combines neural networks for game-parameter inference with a differentiable game-theoretic optimization layer, and demonstrates empirical evidence that the game-theoretic layer improves the predictive performance of various neural network backbones using simulated mobile robot pedestrian interactions and real-world automated driving data.Here is the same information in Simplified Chinese text:
for: 这篇论文是为了模型多智能体交互在现实世界机器人应用中使用游戏理论。
methods: 这篇论文使用了分析游戏、优化控制和能量基本模型来解决在实际世界机器人应用中应用游戏理论的挑战。
results: 这篇论文介绍了一种新的综合学习应用程序，将神经网络用于游戏参数推理和可微游戏理论优化层，并通过模拟移动机器人人行交互和实际自动驾驶数据进行了实验，证明了游戏理论层可以提高各种神经网络背景的预测性能。

Abstract
Game theory offers an interpretable mathematical framework for modeling multi-agent interactions. However, its applicability in real-world robotics applications is hindered by several challenges, such as unknown agents' preferences and goals. To address these challenges, we show a connection between differential games, optimal control, and energy-based models and demonstrate how existing approaches can be unified under our proposed Energy-based Potential Game formulation. Building upon this formulation, this work introduces a new end-to-end learning application that combines neural networks for game-parameter inference with a differentiable game-theoretic optimization layer, acting as an inductive bias. The experiments using simulated mobile robot pedestrian interactions and real-world automated driving data provide empirical evidence that the game-theoretic layer improves the predictive performance of various neural network backbones.

摘要

The AI Revolution: Opportunities and Challenges for the Finance Sector

paper_url: http://arxiv.org/abs/2308.16538
repo_url: None
paper_authors: Carsten Maple, Lukasz Szpruch, Gregory Epiphaniou, Kalina Staykova, Simran Singh, William Penwarden, Yisi Wen, Zijian Wang, Jagdish Hariharan, Pavle Avramovic
for: 本研究探讨了人工智能（AI）在金融领域的应用，描述了其可能性，并讨论了其挑战。
methods: 本研究使用了多种方法，包括客户服务改进、诈骗检测、风险管理和信贷评估等。
results: 本研究发现，AI在金融领域的应用可以提高客户服务质量、提高风险管理和信贷评估等方面的效率，但同时也存在许多挑战，如透明度、解释性、公平性和信任worthiness等问题。

Abstract
This report examines Artificial Intelligence (AI) in the financial sector, outlining its potential to revolutionise the industry and identify its challenges. It underscores the criticality of a well-rounded understanding of AI, its capabilities, and its implications to effectively leverage its potential while mitigating associated risks. The potential of AI potential extends from augmenting existing operations to paving the way for novel applications in the finance sector. The application of AI in the financial sector is transforming the industry. Its use spans areas from customer service enhancements, fraud detection, and risk management to credit assessments and high-frequency trading. However, along with these benefits, AI also presents several challenges. These include issues related to transparency, interpretability, fairness, accountability, and trustworthiness. The use of AI in the financial sector further raises critical questions about data privacy and security. A further issue identified in this report is the systemic risk that AI can introduce to the financial sector. Being prone to errors, AI can exacerbate existing systemic risks, potentially leading to financial crises. Regulation is crucial to harnessing the benefits of AI while mitigating its potential risks. Despite the global recognition of this need, there remains a lack of clear guidelines or legislation for AI use in finance. This report discusses key principles that could guide the formation of effective AI regulation in the financial sector, including the need for a risk-based approach, the inclusion of ethical considerations, and the importance of maintaining a balance between innovation and consumer protection. The report provides recommendations for academia, the finance industry, and regulators.

摘要
AI has the potential to transform the financial sector, with applications in customer service, fraud detection, risk management, credit assessments, and high-frequency trading. However, AI also raises several challenges, including issues related to transparency, interpretability, fairness, accountability, and trustworthiness. Additionally, the use of AI in the financial sector raises critical questions about data privacy and security.The report also highlights the systemic risk that AI can introduce to the financial sector, as it can exacerbate existing systemic risks and potentially lead to financial crises. To address these risks, the report proposes key principles for effective AI regulation in the financial sector, including a risk-based approach, ethical considerations, and a balance between innovation and consumer protection.The report provides recommendations for academia, the finance industry, and regulators, emphasizing the need for a comprehensive understanding of AI's potential and challenges to ensure the responsible use of AI in the financial sector.

Conditioning Score-Based Generative Models by Neuro-Symbolic Constraints

paper_url: http://arxiv.org/abs/2308.16534
repo_url: None
paper_authors: Davide Scassola, Sebastiano Saccani, Ginevra Carbone, Luca Bortolussi
for: 该论文旨在提出一种不需要额外训练的方法，可以从conditionale的score-based生成模型中随机抽取符合用户定义的逻辑约束的样本。
methods: 该方法首先解释了如何使用学习得到的分数来随机抽取不归一化分布的样本，然后定义了一种灵活且数字化的符号逻辑框架，用于编码软逻辑约束。最后，该方法结合了这两个元素，实现了一种通用但是近似的随机抽取算法。
results: 该论文通过对各种约束和数据进行实验，包括表格数据、图像和时间序列，证明了该方法的有效性。

Abstract
Score-based and diffusion models have emerged as effective approaches for both conditional and unconditional generation. Still conditional generation is based on either a specific training of a conditional model or classifier guidance, which requires training a noise-dependent classifier, even when the classifier for uncorrupted data is given. We propose an approach to sample from unconditional score-based generative models enforcing arbitrary logical constraints, without any additional training. Firstly, we show how to manipulate the learned score in order to sample from an un-normalized distribution conditional on a user-defined constraint. Then, we define a flexible and numerically stable neuro-symbolic framework for encoding soft logical constraints. Combining these two ingredients we obtain a general, but approximate, conditional sampling algorithm. We further developed effective heuristics aimed at improving the approximation. Finally, we show the effectiveness of our approach for various types of constraints and data: tabular data, images and time series.

摘要
Score-based和扩散模型已成为条件和无条件生成的有效方法。然而，条件生成仍基于 Either a specific training of a conditional model or classifier guidance，需要训练一个受噪声依赖的分类器，即使给定了对不受扰干的数据的分类器。我们提出一种方法，可以从无条件分数基的生成模型中采样，不需要任何额外训练。我们首先示出如何 manipulate the learned score，以采样从一个未 норmal化的分布，条件于用户定义的约束。然后，我们定义了一种灵活且数值稳定的神经符号学框架，用于编码软逻辑约束。将这两个元素组合起来，我们得到一种通用的，但是近似的条件采样算法。我们进一步开发了有效的规则，以改进近似。最后，我们证明了我们的方法对各种约束和数据类型（表格数据、图像和时间序列）具有效果。

paper_url: http://arxiv.org/abs/2308.16529
repo_url: None
paper_authors: Yoon Kyung Lee, Yoonwon Jung, Gyuyi Kang, Sowon Hahn
For: The paper aims to enhance the empathetic capacities of social robots by incorporating non-verbal cues.* Methods: The authors use a Large Language Model (LLM) to generate four types of empathetic non-verbal cues (Speech, Action, Facial expression, and Emotion) in a social robot.* Results: The preliminary results show that the robot is able to recognize and respond to social cues, such as nodding gestures and positive emotions, in a more authentic and context-aware manner.Here are the three key points in Simplified Chinese:* For: 增强社交机器人的共鸣能力，通过 integrate 非语言价值。* Methods: 使用 Large Language Model (LLM) 设计并标注四种共鸣非语言价值（Speech、Action、Facial expression、Emotion），并将其应用于社交机器人。* Results: 初步结果表明，机器人能够识别和响应社交价值，如护拍姿势和积极情感，以更加authentic和上下文感知的方式。

Abstract
We propose augmenting the empathetic capacities of social robots by integrating non-verbal cues. Our primary contribution is the design and labeling of four types of empathetic non-verbal cues, abbreviated as SAFE: Speech, Action (gesture), Facial expression, and Emotion, in a social robot. These cues are generated using a Large Language Model (LLM). We developed an LLM-based conversational system for the robot and assessed its alignment with social cues as defined by human counselors. Preliminary results show distinct patterns in the robot's responses, such as a preference for calm and positive social emotions like 'joy' and 'lively', and frequent nodding gestures. Despite these tendencies, our approach has led to the development of a social robot capable of context-aware and more authentic interactions. Our work lays the groundwork for future studies on human-robot interactions, emphasizing the essential role of both verbal and non-verbal cues in creating social and empathetic robots.

摘要
我们提议通过 интеграción非语言cue来增强社交机器人的共鸣能力。我们的主要贡献是设计和标签四种共鸣非语言cue，简称为SAFE：语音、动作（姿势）、 facial expression 和情感，在社交机器人中。这些cue使用大自然语言模型（LLM）生成。我们开发了基于LLM的对话系统，并评估了人工辅导员定义的社交cue的对应关系。初步结果表明机器人的回应存在明显的偏好，如宁静和积极社交情感如“喜悦”和“活泼”，以及频繁的头部 nodding 动作。尽管如此，我们的方法已经导致了一个Context-aware的社交机器人，可以进行更加authentic的互动。我们的工作为未来人机交互研究提供了基础，强调语言和非语言cue在创造社交和共鸣机器人中的重要性。

Curvature-based Pooling within Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.16516
repo_url: https://gitlab.com/cedric_sanders/masterarbeit
paper_authors: Cedric Sanders, Andreas Roth, Thomas Liebig
for: 本文旨在提高图 neural network (GNN) 的能力，解决图学习中的过拟合和过缩减问题。
methods: 本文提出了一种新的池化方法叫做 CurvPool，它利用图的曲率特征来自适应地选择结构，以减少过拟合和过缩减问题。
results: 比较 experiment 表明，CurvPool 在图分类任务中表现出色，其精度高于其他相关方法，并且具有更好的计算复杂性和灵活性。

Abstract
Over-squashing and over-smoothing are two critical issues, that limit the capabilities of graph neural networks (GNNs). While over-smoothing eliminates the differences between nodes making them indistinguishable, over-squashing refers to the inability of GNNs to propagate information over long distances, as exponentially many node states are squashed into fixed-size representations. Both phenomena share similar causes, as both are largely induced by the graph topology. To mitigate these problems in graph classification tasks, we propose CurvPool, a novel pooling method. CurvPool exploits the notion of curvature of a graph to adaptively identify structures responsible for both over-smoothing and over-squashing. By clustering nodes based on the Balanced Forman curvature, CurvPool constructs a graph with a more suitable structure, allowing deeper models and the combination of distant information. We compare it to other state-of-the-art pooling approaches and establish its competitiveness in terms of classification accuracy, computational complexity, and flexibility. CurvPool outperforms several comparable methods across all considered tasks. The most consistent results are achieved by pooling densely connected clusters using the sum aggregation, as this allows additional information about the size of each pool.

摘要
Over-squashing和over-smoothing是两个关键问题，它们限制了图神经网络（GNN）的能力。而over-smoothing使得节点变得无法分辨，而over-squashing则是GNN无法在长距离传播信息的问题，这两个问题都是由图 topology引起的。为了解决这些问题在图分类任务中，我们提出了CurvPool，一种新的池化方法。CurvPool利用图的 curvature来自适应地识别导致over-smoothing和over-squashing的结构。通过基于Balanced Forman curvature的归一化，CurvPool将节点分组成更适合的结构，以便 deeper models和融合远程信息。我们与其他当前最佳池化方法进行比较，并证明CurvPool在分类精度、计算复杂度和灵活性方面具有竞争力。CurvPool在所有考虑的任务中表现出了最佳的结果，并且在使用积加聚合 pooling densely connected clusters时，可以获得更加稳定的结果，因为这种方法可以提供更多关于pool size的信息。

Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations

paper_url: http://arxiv.org/abs/2308.16505
repo_url: None
paper_authors: Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, Xing Xie
for: 本研究旨在结合推荐模型和大语言模型（LLM）创造一个多功能且交互的推荐系统，以提高推荐系统的功能和用户体验。
methods: 本研究使用了LLM作为智能核心，并结合了多种推荐模型作为工具，以实现交互式推荐。研究提出了一个有效的框架 named RecAgent，并实现了一个简单的工作流程，包括内存总线、动态示范增强任务规划和反射。
results: 实验结果表明，RecAgent在多个公共数据集上实现了满意的对话式推荐系统性能，比较于通用的LLM更高。

Abstract
Recommender models excel at providing domain-specific item recommendations by leveraging extensive user behavior data. Despite their ability to act as lightweight domain experts, they struggle to perform versatile tasks such as providing explanations and engaging in conversations. On the other hand, large language models (LLMs) represent a significant step towards artificial general intelligence, showcasing remarkable capabilities in instruction comprehension, commonsense reasoning, and human interaction. However, LLMs lack the knowledge of domain-specific item catalogs and behavioral patterns, particularly in areas that diverge from general world knowledge, such as online e-commerce. Finetuning LLMs for each domain is neither economic nor efficient. In this paper, we bridge the gap between recommender models and LLMs, combining their respective strengths to create a versatile and interactive recommender system. We introduce an efficient framework called RecAgent, which employs LLMs as the brain and recommender models as tools. We first outline a minimal set of essential tools required to transform LLMs into RecAgent. We then propose an efficient workflow within RecAgent for task execution, incorporating key components such as a memory bus, dynamic demonstration-augmented task planning, and reflection. RecAgent enables traditional recommender systems, such as those ID-based matrix factorization models, to become interactive systems with a natural language interface through the integration of LLMs. Experimental results on several public datasets show that RecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.

摘要
<> translate into Simplified Chinese推荐模型在域pecific的物品推荐方面表现出色，通过对用户行为数据的抓取和分析而成为轻量级域专家。然而，它们在提供解释和进行对话方面存在限制，而大语言模型（LLM）则在指令理解、通用常识和人机交互方面表现惊人。然而，LLM缺乏域pecific的物品目录和用户行为模式的知识，特别是在与普通世界知识的领域相互独立的情况下。不得 économic nor efficient 的方式finetuning LLMs for each domain。在这篇论文中，我们将推荐模型和LLM之间的差距bridged，将它们的优势相互结合，创造一个多才多艺的和交互的推荐系统。我们提出了一个效率的框架called RecAgent，其中LLMs acts as the brain，推荐模型 acts as tools。我们首先列出了将LLMs转化为RecAgent所需的最小必备工具。然后，我们提出了RecAgent中任务执行的高效工作流程，包括内存总线、动态示范增强任务规划和反思。RecAgent使得传统的推荐系统，如ID基于的矩阵因子化模型，成为了交互系统，通过与LLMs的集成，并通过自然语言界面提供了对用户的交互。我们在一些公共数据集上进行了实验，结果表明，RecAgent在对话推荐系统方面表现满意，比通用的LLMs更好。Note: The translation is done using a machine translation tool, and may not be perfect. Please let me know if you need any further assistance.

Individually Rational Collaborative Vehicle Routing through Give-And-Take Exchanges

paper_url: http://arxiv.org/abs/2308.16501
repo_url: None
paper_authors: Paul Mingzheng Tang, Ba Phong Tran, Hoong Chuin Lau
For: 本研究旨在自动化物流公司间的订单交易，以便最大化总收益。* Methods: 我们提出了一种多代理方法，将内给运输问题转化为协力运输问题（CVRP），并运用单位运输问题（VRP）的原则来对两辆车的组合进行优化。我们的算法考虑了标准VRP的约束和个人合理性约束，并通过帮助竞争的物流代理人实现协力，以获得更好的总路线和系统效率。* Results: 我们透过实际测试使用重要物流公司的数据，证明了我们的算法能够快速获得许多优化的解，强调了它的实际应用性和可能性Transform the logistics industry。

Abstract
In this paper, we are concerned with the automated exchange of orders between logistics companies in a marketplace platform to optimize total revenues. We introduce a novel multi-agent approach to this problem, focusing on the Collaborative Vehicle Routing Problem (CVRP) through the lens of individual rationality. Our proposed algorithm applies the principles of Vehicle Routing Problem (VRP) to pairs of vehicles from different logistics companies, optimizing the overall routes while considering standard VRP constraints plus individual rationality constraints. By facilitating cooperation among competing logistics agents through a Give-and-Take approach, we show that it is possible to reduce travel distance and increase operational efficiency system-wide. More importantly, our approach ensures individual rationality and faster convergence, which are important properties of ensuring the long-term sustainability of the marketplace platform. We demonstrate the efficacy of our approach through extensive experiments using real-world test data from major logistics companies. The results reveal our algorithm's ability to rapidly identify numerous optimal solutions, underscoring its practical applicability and potential to transform the logistics industry.

摘要
在这篇论文中，我们关注了市场平台上的物流公司之间自动订单交换以优化总收益。我们提出了一种新的多代理模型，通过对协同车辆Routing问题（CVRP）进行定点剖析，以实现个体合理性。我们的提议的算法运用了汽车Routing问题（VRP）的原则，对不同物流公司的车辆对应的对应，优化总路径，同时考虑标准VRP约束以及个体合理性约束。通过在竞争物流代理之间促进合作，我们采用了“给与take”方法，从而减少旅行距离，提高系统综合效率。更重要的是，我们的方法保证了个体合理性和快速收敛，这些性质对于长期稳定性的市场平台是非常重要。我们通过使用实际的物流公司数据进行广泛的实验，证明了我们的算法的实用性和可能性。结果表明，我们的算法能够快速发现许多优化解决方案，这些解决方案在实际应用中具有实际意义和潜在的变革力。

Generalised Winograd Schema and its Contextuality

paper_url: http://arxiv.org/abs/2308.16498
repo_url: None
paper_authors: Kin Ian Lo, Mehrnoosh Sadrzadeh, Shane Mansfield
for: 这篇论文的目的是研究语言ambiguity和量子上下文性之间的关系。
methods: 该论文使用了sheaf-theoretic模型来研究语言ambiguity，并在Winograd schema中实验了量子上下文性。
results: 该研究发现，通过模拟Winograd schema的量子物理实验，可以观察到语言ambiguity中的量子上下文性。此外，该研究还发现了一种新的机制来扩展Winograd schema，使其更能模拟人类的理解。

Abstract
Ambiguities in natural language give rise to probability distributions over interpretations. The distributions are often over multiple ambiguous words at a time; a multiplicity which makes them a suitable topic for sheaf-theoretic models of quantum contextuality. Previous research showed that different quantitative measures of contextuality correlate well with Psycholinguistic research on lexical ambiguities. In this work, we focus on coreference ambiguities and investigate the Winograd Schema Challenge (WSC), a test proposed by Levesque in 2011 to evaluate the intelligence of machines. The WSC consists of a collection of multiple-choice questions that require disambiguating pronouns in sentences structured according to the Winograd schema, in a way that makes it difficult for machines to determine the correct referents but remains intuitive for human comprehension. In this study, we propose an approach that analogously models the Winograd schema as an experiment in quantum physics. However, we argue that the original Winograd Schema is inherently too simplistic to facilitate contextuality. We introduce a novel mechanism for generalising the schema, rendering it analogous to a Bell-CHSH measurement scenario. We report an instance of this generalised schema, complemented by the human judgements we gathered via a crowdsourcing platform. The resulting model violates the Bell-CHSH inequality by 0.192, thus exhibiting contextuality in a coreference resolution setting.

摘要
自然语言中的歧义给出了概率分布 над interpretations。这些分布通常包括多个歧义词的时候; 这种多样性使得它们成为量子上下文uality的适当主题。过去的研究表明了不同的量化contextuality推量well with Psycholinguistic research on lexical ambiguities。在这项工作中，我们关注核心引用ambiguities和 investigate the Winograd Schema Challenge (WSC), proposed by Levesque in 2011 to evaluate the intelligence of machines. WSC consists of a collection of multiple-choice questions that require disambiguating pronouns in sentences structured according to the Winograd schema, making it difficult for machines to determine the correct referents but remains intuitive for human comprehension.在这项工作中，我们提出了一种方法，即模拟Winograd schema为量子物理实验。然而，我们认为原始的Winograd schema是太简单，无法促进上下文uality。我们引入了一种新的机制，使得Winograd schema可以扩展，类似于Bell-CHSH测量场景。我们报道了这个扩展的schema，并通过一个人类判断平台收集了数据。得到的模型违反了Bell-CHSH不等式by 0.192，因此在核心引用解决设置下表现出了上下文uality。

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

paper_url: http://arxiv.org/abs/2308.16493
repo_url: None
paper_authors: Riley Tavassoli, Mani Amani, Reza Akhavian
for:This paper aims to improve the scene understanding of vision-language models (VLMs) by aligning the embedding spaces of different modalities, such as inertial measurement unit (IMU) data, with the vision embedding space.methods:The proposed method combines supervised and contrastive training to align the embedding spaces of different modalities with the vision embedding space, without requiring retraining of the VLM. The IMU embeddings are given directly to the model, allowing for nonlinear interactions between the query, image, and IMU signal.results:The proposed method is evaluated through experiments on human activity recognition using IMU data and visual inputs. The results show that using multiple modalities as input improves the VLM’s scene understanding and enhances its overall performance in various tasks, demonstrating the effectiveness of the proposed method.

Abstract
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that feeds directly into the prompt to allow for any nonlinear interactions between the query, image, and IMU signal that would be lost by mapping the IMU data to a discrete activity label. Further, we demonstrate our methodology's efficacy through experiments involving human activity recognition using IMU data and visual inputs. Our results show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks, thus paving the way for more versatile and capable language models in multi-modal contexts.

摘要
视力语言模型（VLM）已经展现出极强的能力在视觉问答和理解任务中，通过将视觉表示与大语言模型（LLM）在预训练时学习的抽象技能相结合。视觉，是现实中最受欢迎的感知模式，但是只是场景理解中的一种表示。在人机交互场景中，机器人需要准确地理解场景。在这篇论文中，我们定义并实现了将不同modalities（在这种情况下是测量单元（IMU）数据）的 embedding 空间与视觉 embedding 空间对齐，使得 VLM 能够理解和处理这些其他模式，无需重新训练。我们选择将 IMU 嵌入直接给模型，而不是使用一个独立的人类活动识别模型，以便保留非线性交互 между查询、图像和 IMU 信号。此外，我们通过对人类活动识别 tasks 进行实验，证明了我们的方法的有效性。我们的结果表明，将多种模式作为输入，可以提高 VLM 的场景理解和总性性能，从而开创更多功能强大的语言模型在多modal contexts。

In-class Data Analysis Replications: Teaching Students while Testing Science

paper_url: http://arxiv.org/abs/2308.16491
repo_url: None
paper_authors: Kristina Gligoric, Tiziano Piccardi, Jake Hofman, Robert West
for: 这个论文目的是为了探讨在数据分析教程中包含复制任务的可行性，以及这种方法对学生、教师和科学家的影响。
methods: 这个研究使用了在EPFL教授的应用数据分析课程（CS-401）中包含复制任务的方法，并通过在课程进行的问卷调查来收集数据。
results: 研究发现学生可以复制已经发表的科学论文，大多数情况下是质量的，一些情况下是准确的。学生对复制任务的期望和实际经验之间存在差异，这些差异共同证明了对critical thinking的激励作用。此外，教师可以了解在教室中包含复制任务的成本和问题，以及这种方法对传统任务的比较。研究还发现了对科学社区的具体利益，如复制报告和科学工作中避免的复制障碍。

Abstract
Science is facing a reproducibility crisis. Previous work has proposed incorporating data analysis replications into classrooms as a potential solution. However, despite the potential benefits, it is unclear whether this approach is feasible, and if so, what the involved stakeholders-students, educators, and scientists-should expect from it. Can students perform a data analysis replication over the course of a class? What are the costs and benefits for educators? And how can this solution help benchmark and improve the state of science? In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course (CS-401) taught at EPFL (N=354 students). Here we report pre-registered findings based on surveys administered throughout the course. First, we demonstrate that students can replicate previously published scientific papers, most of them qualitatively and some exactly. We find discrepancies between what students expect of data analysis replications and what they experience by doing them along with changes in expectations about reproducibility, which together serve as evidence of attitude shifts to foster students' critical thinking. Second, we provide information for educators about how much overhead is needed to incorporate replications into the classroom and identify concerns that replications bring as compared to more traditional assignments. Third, we identify tangible benefits of the in-class data analysis replications for scientific communities, such as a collection of replication reports and insights about replication barriers in scientific work that should be avoided going forward. Overall, we demonstrate that incorporating replication tasks into a large data science class can increase the reproducibility of scientific work as a by-product of data science instruction, thus benefiting both science and students.

摘要
科学面临着可重现危机。 previous work提议在课程中包含数据分析重复，以解决这个问题。然而，尚未确定这种方法是否实施可行，以及参与者们（学生、教师和科学家）应该期望什么。学生在课程中完成数据分析重复是否可能？教师所承担的成本和利益是什么？这种解决方案可以如何帮助评估和改进科学的状况？在 presente study中，我们在EPFL教授的应用数据分析课程（CS-401）中 integrate了数据分析重复。我们通过课程中的问naire进行了预先注册的发现，发现学生可以重复已发表的科学论文，大多数是Qualitatively相同，一些是精确相同。我们发现学生对数据分析重复的预期与实际经验存在差异，这些差异共同证明了学生的批判思维的提高。其次，我们为教师提供了包括 integrate replications into the classroom overhead和replications bring 相比传统任务的担忧。 finally，我们发现在课程中的数据分析重复提供了科学社区的 tangible benefits，如replication reports和对重复过程中的障碍的洞察，这些材料可以为未来的科学工作提供指导。总之，我们的研究表明，在课程中包含数据分析重复任务可以提高科学工作的可重现性，并为学生和科学社区带来利益。

Latent Painter

paper_url: http://arxiv.org/abs/2308.16490
repo_url: None
paper_authors: Shih-Chieh Su
for: 用于生成创意艺术动画
methods: 使用潜在隐藏的canvas和预测结果作为规划，通过转移一个生成的图像到另一个来实现动画变换
results: 能够生成具有变换性的精细艺术动画

Abstract
Latent diffusers revolutionized the generative AI and inspired creative art. When denoising the latent, the predicted original image at each step collectively animates the formation. However, the animation is limited by the denoising nature of the diffuser, and only renders a sharpening process. This work presents Latent Painter, which uses the latent as the canvas, and the diffuser predictions as the plan, to generate painting animation. Latent Painter also transits one generated image to another, which can happen between images from two different sets of checkpoints.

摘要
Latent diffusers革新生成AI和艺术创作。当去噪latent时，预测的原图在每步 коллектив卷积动画。然而，动画受到噪音除去器的限制，只能进行锐化处理。这个工作介绍Latent Painter，它使用latent作为画布，并使用预测的diffuser来计划生成画作动画。Latent Painter还可以在一个生成的图像与另一个图像之间进行转换，这可以发生在两个不同的检点集中的图像之间。

Test-Time Adaptation for Point Cloud Upsampling Using Meta-Learning

paper_url: http://arxiv.org/abs/2308.16484
repo_url: None
paper_authors: Ahmed Hatem, Yiming Qian, Yang Wang
for: 提高激活点云upsampling的模型通用性
methods: 使用meta-学习来适应测试数据的特点
results: 比对标准基准数据的表现有所提高

Abstract
Affordable 3D scanners often produce sparse and non-uniform point clouds that negatively impact downstream applications in robotic systems. While existing point cloud upsampling architectures have demonstrated promising results on standard benchmarks, they tend to experience significant performance drops when the test data have different distributions from the training data. To address this issue, this paper proposes a test-time adaption approach to enhance model generality of point cloud upsampling. The proposed approach leverages meta-learning to explicitly learn network parameters for test-time adaption. Our method does not require any prior information about the test data. During meta-training, the model parameters are learned from a collection of instance-level tasks, each of which consists of a sparse-dense pair of point clouds from the training data. During meta-testing, the trained model is fine-tuned with a few gradient updates to produce a unique set of network parameters for each test instance. The updated model is then used for the final prediction. Our framework is generic and can be applied in a plug-and-play manner with existing backbone networks in point cloud upsampling. Extensive experiments demonstrate that our approach improves the performance of state-of-the-art models.

摘要
便宜的3D扫描仪通常生成稀疏不均匀的点云，这会负面影响下游应用程序在робо特系统中。而现有的点云upsampling架构在标准测试数据上已经达到了可观的结果，但是它们在测试数据与训练数据之间的分布不同时会经受显著性能下降。为解决这个问题，这篇论文提出了一种测试时适应approach，用于提高点云upsampling模型的通用性。我们的方法不需要任何测试数据的先前信息。在meta-training中，模型参数被学习从一个集合实例级任务中，每个实例包含一对稀疏和密集的点云从训练数据中。在meta-testing中，已经训练过的模型被微调一些梯度更新，以生成每个测试实例唯一的网络参数。更新后的模型然后用于最终预测。我们的框架可以与现有的后缀网络在点云upsampling中进行插件式应用。广泛的实验证明了我们的方法可以提高现有模型的性能。

Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning

paper_url: http://arxiv.org/abs/2308.16481
repo_url: None
paper_authors: Ahmed Hatem, Yiming Qian, Yang Wang
for: 提高点云注册模型的通用性和性能
methods: 提出了一种基于测试时适应的点云注册框架，通过三个自动适应任务来适应测试数据，并通过meta-依赖学习方法来在测试时进行适应。
results: 实验结果表明，该方法可以提高点云注册模型的通用性和性能，并且超过了其他现有方法的表现。

Abstract
We present Point-TTA, a novel test-time adaptation framework for point cloud registration (PCR) that improves the generalization and the performance of registration models. While learning-based approaches have achieved impressive progress, generalization to unknown testing environments remains a major challenge due to the variations in 3D scans. Existing methods typically train a generic model and the same trained model is applied on each instance during testing. This could be sub-optimal since it is difficult for the same model to handle all the variations during testing. In this paper, we propose a test-time adaptation approach for PCR. Our model can adapt to unseen distributions at test-time without requiring any prior knowledge of the test data. Concretely, we design three self-supervised auxiliary tasks that are optimized jointly with the primary PCR task. Given a test instance, we adapt our model using these auxiliary tasks and the updated model is used to perform the inference. During training, our model is trained using a meta-auxiliary learning approach, such that the adapted model via auxiliary tasks improves the accuracy of the primary task. Experimental results demonstrate the effectiveness of our approach in improving generalization of point cloud registration and outperforming other state-of-the-art approaches.

摘要
我们介绍Point-TTA，一种新的测试时适应框架 для点云注册（PCR），可以提高注册模型的通用性和性能。学习型方法已经取得了很大的进步，但是在测试环境中普遍存在不同的3D扫描数据，这使得同一个模型在测试时难以处理所有变化。在这篇论文中，我们提议一种测试时适应方法 для PCR。我们的模型可以在测试时适应未看过的分布，无需任何测试数据的先知知识。具体来说，我们设计了三个自动编目任务，这些任务与主要PCR任务一起被优化。给定一个测试实例，我们使用这些自动编目任务适应我们的模型，并使用更新后的模型进行推理。在训练时，我们使用一种元助理学习方法来训练我们的模型，以便适应任务中的更新。实验结果表明，我们的方法可以提高点云注册的通用性和超越其他现有的方法。

Transformer Compression via Subspace Projection

paper_url: http://arxiv.org/abs/2308.16475
repo_url: None
paper_authors: Yuxuan Hu, Jing Zhang, Chen Zhao, Cuiping Li, Hong Chen
for: 压缩 transformer 模型，减少隐藏尺寸
methods: 使用矩阵运算在压缩后的空间中进行 matrix operations
results: 实验结果显示，TCSP 可以实现44%的压缩率，并且精度下降不超过1.6%，超过或匹配先前的压缩方法。同时，TCSP 兼容其他针对筛子和注意头尺寸压缩的方法。

Abstract
We propose TCSP, a novel method for compressing a transformer model by focusing on reducing the hidden size of the model. By projecting the whole transform model into a subspace, we enable matrix operations between the weight matrices in the model and features in a reduced-dimensional space, leading to significant reductions in model parameters and computing resources. To establish this subspace, we decompose the feature matrix, derived from different layers of sampled data instances, into a projection matrix. For evaluation, TCSP is applied to compress T5 and BERT models on the GLUE and SQuAD benchmarks. Experimental results demonstrate that TCSP achieves a compression ratio of 44\% with at most 1.6\% degradation in accuracy, surpassing or matching prior compression methods. Furthermore, TCSP exhibits compatibility with other methods targeting filter and attention head size compression.

摘要
我们提出TCSP方法，一种 novel方法用于压缩transformer模型，主要是降低模型的隐藏大小。我们通过将整个transform模型转映到一个子空间中，使得matrix操作可以进行在模型对于特征的压缩空间中，从而实现了重要的压缩 Parameter和计算资源。为了建立这个子空间，我们将特征矩阵，从不同层次的抽象数据实例中 derivation， decomposed为一个投影矩阵。为了评估，TCSP方法在GLUE和SQuAD评分板上压缩T5和BERT模型，实现了44%的压缩比，并且对应最多1.6%的精度下降，超过或匹配先前的压缩方法。此外，TCSP方法可以与其他对答和注意头大小压缩方法相容。

paper_url: http://arxiv.org/abs/2308.16474
repo_url: None
paper_authors: Yongqiang Zhao, Zhenyu Li, Feng Zhang, Xinhai Xu, Donghong Liu
For: This paper aims to improve the performance of multi-modal large language models (MLLMs) by selecting multiple pre-trained models to complete the same subtask and combining their results to obtain the optimal outcome.* Methods: The proposed approach involves selecting multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, invoking these models in parallel to process input data, and comparing the results from multiple pre-trained models using a large language model (LLM) to choose the best outcome.* Results: The proposed approach is shown to be effective in improving the performance of MLLMs through extensive experiments using GPT-4 annotated datasets and human-annotated datasets, with results from various evaluation metrics demonstrating the approach’s effectiveness.

Abstract
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.

摘要

MaintainoMATE: A GitHub App for Intelligent Automation of Maintenance Activities

paper_url: http://arxiv.org/abs/2308.16464
repo_url: None
paper_authors: Anas Nadeem, Muhammad Usman Sarwar, Muhammad Zubair Malik
for: 本研究旨在提高软件开发项目中维护任务的效率，特别是自动化issue tracking系统上的issue报告处理。
methods: 本研究使用BERT模型来自动分类issue报告并将其分配给相关的开发者。
results: experiments show that MaintainoMATE可以达到约80%的F1分数，并且可以将issue报告分配给相关的开发者，其F1分数达54%，与现有方法相当。

Abstract
Software development projects rely on issue tracking systems at the core of tracking maintenance tasks such as bug reports, and enhancement requests. Incoming issue-reports on these issue tracking systems must be managed in an effective manner. First, they must be labelled and then assigned to a particular developer with relevant expertise. This handling of issue-reports is critical and requires thorough scanning of the text entered in an issue-report making it a labor-intensive task. In this paper, we present a unified framework called MaintainoMATE, which is capable of automatically categorizing the issue-reports in their respective category and further assigning the issue-reports to a developer with relevant expertise. We use the Bidirectional Encoder Representations from Transformers (BERT), as an underlying model for MaintainoMATE to learn the contextual information for automatic issue-report labeling and assignment tasks. We deploy the framework used in this work as a GitHub application. We empirically evaluate our approach on GitHub issue-reports to show its capability of assigning labels to the issue-reports. We were able to achieve an F1-score close to 80\%, which is comparable to existing state-of-the-art results. Similarly, our initial evaluations show that we can assign relevant developers to the issue-reports with an F1 score of 54\%, which is a significant improvement over existing approaches. Our initial findings suggest that MaintainoMATE has the potential of improving software quality and reducing maintenance costs by accurately automating activities involved in the maintenance processes. Our future work would be directed towards improving the issue-assignment module.

摘要
Translated into Simplified Chinese:软件开发项目依赖问题跟踪系统的核心，包括BUG报告和改进请求。接收到issue报告后，需要有效地管理它们。首先，它们需要被标签，然后分配给相关的开发者。这个过程是 kritical 和需要干预的，因为需要从issue报告中提取信息，这是一项劳动密集的任务。在这篇论文中，我们提出了一个统一框架，叫做MaintainoMATE，可以自动将issue报告分类到不同的类别中，并将其分配给相关的开发者。我们使用了BERT模型，作为MaintainoMATE的下一层模型，以学习issue报告中的上下文信息。我们将这个框架部署到GitHub上。我们对GitHub上的issue报告进行了实验，以示其能否将标签分配给issue报告。我们获得了一个F1分数接近80%，与现有的状态艺术结果相似。此外，我们的初步评估表明，我们可以将issue报告分配给相关的开发者，F1分数为54%，与现有方法相比，是一个显著的改进。我们的初步发现表明，MaintainoMATE有可能提高软件质量并降低维护成本，通过准确地自动化维护过程中的活动。我们未来的工作将是改进issue分配模块。

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

paper_url: http://arxiv.org/abs/2308.16458
repo_url: https://github.com/gersteinlab/biocoder
paper_authors: Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, Mark Gerstein
for: 本研究开发了一个名为 BioCoder 的库，用于评估现有的预训构模型在生成生物信息学程式码方面的表现。
methods: BioCoder 使用了 GitHub 和 Rosalind Project 上的 Python 和 Java 程式码，以及一个对测试模型的测试框架，以评估模型的表现。
results: 研究发现，为了在生物信息学程式码生成中取得出色的表现，模型需要具备领域知识、实用程式码生成能力和上下文理解能力。

Abstract
Pre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.

摘要
Pre-trained语言模型如ChatGPT已经显著改进了代码生成。随着这些模型的扩大，代码生成的输出需要承办更加复杂的任务。在生物信息学中，生成功能程序受到域知识的限制，需要进行复杂的数据操作和函数依赖关系。为此，我们提出了BioCoder，一个用于评估现有预训练模型的代码生成能力的benchmark。在函数代码生成方面，BioCoder覆盖了可能的包依赖、类声明和全局变量。它包含1026个函数和1243个方法在Python和Java中，从GitHub和Rosalind项目中提取来的253个示例。BioCoder包含一个混淆测试框架，我们已经应用到了许多模型中，包括InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+和ChatGPT。我们的详细分析表明，域知识、实用代码生成和上下文理解对代码生成的质量具有重要作用。我们的数据集、benchmark、Docker镜像和测试脚本都可以在https://github.com/gersteinlab/biocoder中获取。

Contrastive Representation Learning Based on Multiple Node-centered Subgraphs

paper_url: http://arxiv.org/abs/2308.16441
repo_url: None
paper_authors: Dong Li, Wenjun Wang, Minglai Shao, Chen Zhao
for: 学习图表示性能，即使用自我supervised方式学习图节点表示。
methods: 提出了一种多个节点中心子图对比学习方法，通过设计精心的多个节点中心子图来增强节点表示的自适应能力。
results: 在多个实际世界数据集和不同下游任务中，模型已经实现了状态级 результаts。

Abstract
As the basic element of graph-structured data, node has been recognized as the main object of study in graph representation learning. A single node intuitively has multiple node-centered subgraphs from the whole graph (e.g., one person in a social network has multiple social circles based on his different relationships). We study this intuition under the framework of graph contrastive learning, and propose a multiple node-centered subgraphs contrastive representation learning method to learn node representation on graphs in a self-supervised way. Specifically, we carefully design a series of node-centered regional subgraphs of the central node. Then, the mutual information between different subgraphs of the same node is maximized by contrastive loss. Experiments on various real-world datasets and different downstream tasks demonstrate that our model has achieved state-of-the-art results.

摘要
为图structured数据的基本元素，节点已被认为是学习图表示的主要对象。一个单个节点拥有多个基于整个图的节点中心子图（例如，一个社交网络中的一个人有多个基于他不同关系的社交圈）。我们在图矩阵学习框架下研究这一感知，并提出了多个节点中心子图对比学习方法来自顺supervised的学习节点表示。具体来说，我们特别设计了一系列基于中心节点的节点中心子图。然后，通过对不同节点的子图进行对比损失，最大化不同节点之间的共通信息。实验结果表明，我们的模型在实际世界数据集和不同下游任务中均达到了状态率的Result。

BenchTemp: A General Benchmark for Evaluating Temporal Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.16385
repo_url: https://github.com/qianghuangwhu/benchtemp
paper_authors: Qiang Huang, Jiawei Jiang, Xi Susie Rao, Ce Zhang, Zhichao Han, Zitao Zhang, Xin Wang, Yongjun He, Quanqing Xu, Yang Zhao, Chuang Hu, Shuo Shang, Bo Du
for: 评估Temporal Graph Neural Networks (TGNNs)的性能，提供一个通用的评估平台。
methods: 使用BenchTemp benchmark suite，包括各种任务和设置，对TGNN模型进行比较。
results: 对多种代表性TGNN模型进行了广泛的比较，包括效果率和效率两个指标。

Abstract
To handle graphs in which features or connectivities are evolving over time, a series of temporal graph neural networks (TGNNs) have been proposed. Despite the success of these TGNNs, the previous TGNN evaluations reveal several limitations regarding four critical issues: 1) inconsistent datasets, 2) inconsistent evaluation pipelines, 3) lacking workload diversity, and 4) lacking efficient comparison. Overall, there lacks an empirical study that puts TGNN models onto the same ground and compares them comprehensively. To this end, we propose BenchTemp, a general benchmark for evaluating TGNN models on various workloads. BenchTemp provides a set of benchmark datasets so that different TGNN models can be fairly compared. Further, BenchTemp engineers a standard pipeline that unifies the TGNN evaluation. With BenchTemp, we extensively compare the representative TGNN models on different tasks (e.g., link prediction and node classification) and settings (transductive and inductive), w.r.t. both effectiveness and efficiency metrics. We have made BenchTemp publicly available at https://github.com/qianghuangwhu/benchtemp.

摘要
为了处理时间演化的图像，一系列的时间图神经网络（TGNN）已经被提议。尽管这些TGNN模型具有成功的表现，但之前的TGNN评价显示了四个关键问题的局限性：1）不一致的数据集，2）不一致的评价流水线，3）缺乏工作负荷多样性，4）缺乏高效的比较。总的来说，没有一个实证研究可以将TGNN模型放在一起，并对其进行全面的比较。为此，我们提出了BenchTemp，一个通用的benchmark用于评价TGNN模型的多种工作负荷。BenchTemp提供了一组benchmark数据集，以便不同的TGNN模型可以公平地比较。此外，BenchTemp还设计了一个标准的评价流水线，以确保TGNN模型在不同任务（如链接预测和节点分类）和设置（推uctive和induction）下进行公平的评价。通过BenchTemp，我们对不同的TGNN模型进行了广泛的比较，并对其效果和效率指标进行了评价。BenchTemp已经在https://github.com/qianghuangwhu/benchtemp上公开发布。

A Survey on Privacy in Graph Neural Networks: Attacks, Preservation, and Applications

paper_url: http://arxiv.org/abs/2308.16375
repo_url: None
paper_authors: Yi Zhang, Yuying Zhao, Zhaoqing Li, Xueqi Cheng, Yu Wang, Olivera Kotevska, Philip S. Yu, Tyler Derr
For: The paper aims to provide a comprehensive overview of attacks on graph data and privacy preservation techniques in graph neural networks (GNNs).* Methods: The paper categorizes privacy preservation techniques in GNNs and reviews datasets and applications for analyzing/solving privacy issues in GNNs.* Results: The paper outlines potential directions for future research to build better privacy-preserving GNNs.Here’s the Chinese version of the three key points:* For: 论文旨在提供图数据的攻击和图神经网络（GNNs）中的隐私保护技术的全面回顾。* Methods: 论文对GNNs中的隐私保护技术进行分类，并评估图数据分析/解决隐私问题的数据和应用程序。* Results: 论文提出未来研究的可能方向，以建立更好的隐私保护GNNs。

Abstract
Graph Neural Networks (GNNs) have gained significant attention owing to their ability to handle graph-structured data and the improvement in practical applications. However, many of these models prioritize high utility performance, such as accuracy, with a lack of privacy consideration, which is a major concern in modern society where privacy attacks are rampant. To address this issue, researchers have started to develop privacy-preserving GNNs. Despite this progress, there is a lack of a comprehensive overview of the attacks and the techniques for preserving privacy in the graph domain. In this survey, we aim to address this gap by summarizing the attacks on graph data according to the targeted information, categorizing the privacy preservation techniques in GNNs, and reviewing the datasets and applications that could be used for analyzing/solving privacy issues in GNNs. We also outline potential directions for future research in order to build better privacy-preserving GNNs.

摘要
graph neural networks (GNNs) 在处理图structured数据方面已经吸引了广泛的注意力，但是许多这些模型强调高性能，如准确率，而忽略了隐私考虑，这在现代社会中是一个重要的问题，因为隐私攻击是普遍的。为解决这个问题，研究人员开始了隐私保护GNNs的开发。 despite this progress， there is a lack of a comprehensive overview of the attacks and the techniques for preserving privacy in the graph domain. In this survey, we aim to address this gap by summarizing the attacks on graph data according to the targeted information, categorizing the privacy preservation techniques in GNNs, and reviewing the datasets and applications that could be used for analyzing/solving privacy issues in GNNs. We also outline potential directions for future research in order to build better privacy-preserving GNNs.Here's the translation in Traditional Chinese:graph neural networks (GNNs) 在处理图structured数据方面已经吸引了广泛的注意力，但是许多这些模型强调高性能，如准确率，而忽略了隐私考虑，这在现代社会中是一个重要的问题，因为隐私攻击是普遍的。为解决这个问题，研究人员开始了隐私保护GNNs的开发。 despite this progress， there is a lack of a comprehensive overview of the attacks and the techniques for preserving privacy in the graph domain. In this survey, we aim to address this gap by summarizing the attacks on graph data according to the targeted information, categorizing the privacy preservation techniques in GNNs, and reviewing the datasets and applications that could be used for analyzing/solving privacy issues in GNNs. We also outline potential directions for future research in order to build better privacy-preserving GNNs.

2023-08-31

cs.CL

cs.CL - 2023-08-31

TouchStone: Evaluating Vision-Language Models by Language Models

paper_url: http://arxiv.org/abs/2308.16890
repo_url: None
paper_authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou
for: 评估大视觉语言模型（LVLMs）的多种能力，包括认知、理解和处理视觉信息，以及对话技巧和文学创作能力。
methods: 使用强大的语言模型（LLMs）作为评判者，对LVLMs的多种能力进行全面评估，包括开放世界图像和问题，涵盖五大类能力和27个子任务。
results: 通过验证，表明强大的LVLMs，如GPT-4，可以通过文本能力alone评估多modal对话质量，与人类偏好相align。

Abstract
Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

摘要
首先，我们建立了一个完整的视觉对话 dataset TouchStone，包括开放世界的图片和问题，涵盖五大类能力和 27 个子任务。这个 dataset 不仅覆盖基本的识别和理解，也扩展到文学创作。其次，通过将复杂的视觉内容转换为可以由 LLMs 理解的形式，我们可以直接使用高级 LLMs 评估多模式对话质量，不需要人工干预。经过验证，我们展示了强大的 LVLMs，如 GPT-4，可以通过它们的文本能力 alone 评估对话质量，与人类偏好相Alignment。我们希望这个工作可以成为 LVLMs 评估的 touchstone，导向建立更强大的 LVLMs。评估代码可以在 https://github.com/OFA-Sys/TouchStone 上获取。

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

paper_url: http://arxiv.org/abs/2308.16797
repo_url: https://github.com/johndmendonca/dialevalml
paper_authors: John Mendonça, Patrícia Pereira, João Paulo Carvalho, Alon Lavie, Isabel Trancoso
for: 这 paper 是为了开发一个可以评估多语言对话系统的自动对话评估指标的框架。
methods: 这 paper 使用了现有评估模型的优势，同时采用了新的大语言模型（LLM）提问 paradigm。
results: 这 paper 的实验结果表明，其框架在多个 benchmark 上的 Mean Spearman correlation 分数均达到了state of the art Water mark，并在 DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems” 的 Robust 和 Multilingual 任务中排名第一。

Abstract
Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.

摘要
尽管有很多研究努力在自动对话评价指标的发展中，但对于其他语言的对话评价却得到了少量的关注。同时，确保评价结果对Semantically相同的答案具有不变性也是一个被忽略的话题。为了实现自动对话评价指标的稳定性和多语言性，我们提出了一种新的框架，利用现有评价模型的优势以及新的大语言模型（LLM）的推荐 paradigm。实验结果显示，我们的框架在多个 benchmark 上取得了 state of the art 的 Mean Spearman correlation 分数，并在 DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems" 的 Robust 和 Multilingual 任务上取得了第一名，证明了提高 LLM 的评价能力。

Towards Multilingual Automatic Dialogue Evaluation

paper_url: http://arxiv.org/abs/2308.16795
repo_url: None
paper_authors: John Mendonça, Alon Lavie, Isabel Trancoso
for: 这篇论文主要针对的问题是开发robust的多语言对话评估指标的主要限制因素是多语言数据的缺乏和开源多语言对话系统的有限可用性。
methods: 作者提议一种绕过这些限制的方法是利用强大的多语言预训练自然语言处理模型，并使用机器翻译将英语对话数据扩展到多语言数据。
results: 作者经验表明，直接使用翻译后的数据进行训练是不足以超越基线的多语言模型，而需要仔细筛选翻译后的数据使用MT质量评估 metric，以避免低质量翻译对性能的影响。

Abstract
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is the lack of multilingual data and the limited availability of open sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained LLM and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

摘要
主要的限制因素是多语言对话评估指标的发展缺乏多语言数据和开源的多语言对话系统的有限可用性。在这种工作中，我们提出了一种绕过这种缺乏数据的 workaround，利用强大的多语言预训练深度学习模型，并通过机器翻译来扩展现有的英语对话数据。我们经验显示，直接使用翻译后的数据进行训练是不够的，而是需要仔细筛选翻译后的数据使用MT质量评估指标，排除低质量翻译，以保证其表现。

Enhancing PLM Performance on Labour Market Tasks via Instruction-based Finetuning and Prompt-tuning with Rules

paper_url: http://arxiv.org/abs/2308.16770
repo_url: None
paper_authors: Jarno Vrolijk, David Graus
for: 本研究旨在探讨如何使用预训练语言模型（PLM）在劳动市场特定应用中提高表示性。
methods: 本研究使用了提示基于调整和 instrucion tuning 方法，无需 exemplars 和数据增强，可以在劳动市场特定应用中提高 PLM 的表现。
results: 研究结果表明，使用提示基于调整和 instrucion tuning 方法可以在劳动市场特定应用中提高 PLM 的表现，而无需添加新的模型层、手动标注和数据增强。

Abstract
The increased digitization of the labour market has given researchers, educators, and companies the means to analyze and better understand the labour market. However, labour market resources, although available in high volumes, tend to be unstructured, and as such, research towards methodologies for the identification, linking, and extraction of entities becomes more and more important. Against the backdrop of this quest for better labour market representations, resource constraints and the unavailability of large-scale annotated data cause a reliance on human domain experts. We demonstrate the effectiveness of prompt-based tuning of pre-trained language models (PLM) in labour market specific applications. Our results indicate that cost-efficient methods such as PTR and instruction tuning without exemplars can significantly increase the performance of PLMs on downstream labour market applications without introducing additional model layers, manual annotations, and data augmentation.

摘要
随着劳动市场的数字化，研究者、教育者和公司得到了分析和更好地理解劳动市场的工具。然而，劳动市场资源，即使在大量存在，通常是不结构化的，因此对方法ologies for the identification, linking, and extraction of entities的研究变得越来越重要。在这种寻求更好的劳动市场表示方面，因为资源受限和大规模annotated data的不可得性，人际域专家的依赖度增加。我们示示了适用Prompt-based tuning的pre-trained语言模型（PLM）在劳动市场特定应用中的效果。我们的结果表明，不需要添加更多的模型层、手动标注和数据扩展的cost-efficient方法，如PTR和instruction tuning without exemplars，可以大幅提高PLMs在下游劳动市场应用中的性能。

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

paper_url: http://arxiv.org/abs/2308.16692
repo_url: None
paper_authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
for: 这paper aimed to evaluate the suitability of existing speech tokens for speech language modeling and to propose a unified speech tokenizer for speech large language models.
methods: The paper proposed a unified speech tokenizer called SpeechTokenizer, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ).
results: The SpeechTokenizer performed comparably to EnCodec in speech reconstruction and demonstrated strong performance on the SLMTokBench benchmark. Additionally, the Unified Speech Language Model (USLM) outperformed VALL-E in zero-shot Text-to-Speech tasks.

Abstract
Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

摘要
当前的大语言模型建立于不连续的语音表示方式上，可以分为 semantics 和 acoustic 两类。然而，现有的语音表示token并不是专门为语音语言模型设计的。为了评估语音表示token的适用程度，我们建立了首个benchmarkSLMTokBench。我们的结果表明， neither semantic noch acoustic tokens 是理想的。因此，我们提出了 SpeechTokenizer，一种通用的语音tokenizer для语音大语言模型。SpeechTokenizer采用了 Encoder-Decoder 架构和剩余 вектор量化（RVQ）。在不同的RVQ层中，SpeechTokenizer层次分解不同的语音信息。此外，我们构建了 Unified Speech Language Model (USLM)，利用 SpeechTokenizer。实验表明，SpeechTokenizer与EnCodec相当在语音重建任务中，并在SLMTokBench标准差中表现出色。此外，USLM在零基本Text-to-Speech任务中表现出优于 VALL-E。代码和模型可以在https://github.com/ZhangXInFD/SpeechTokenizer/上获取。

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

paper_url: http://arxiv.org/abs/2308.16687
repo_url: None
paper_authors: Shaltiel Shmidman, Avi Shmidman, Moshe Koppel
for: 这个论文是为了提出一个新的现代希伯来BERT模型，以及两个特定任务的两个精度版本：prefix segmentation和 morphological tagging。
methods: 这个论文使用了BERT模型，并在其基础之上进行了特定任务的精度版本。
results: 论文表明了这些模型在不同的标准测试数据上的表现，并释放了这些模型以便进一步的研究和开发。I hope this helps! Let me know if you have any other questions.

Abstract
We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release two fine-tuned versions of the model, designed to perform two specific foundational tasks in the analysis of Hebrew texts: prefix segmentation and morphological tagging. These fine-tuned models allow any developer to perform prefix segmentation and morphological tagging of a Hebrew sentence with a single call to a HuggingFace model, without the need to integrate any additional libraries or code. In this paper we describe the details of the training as well and the results on the different benchmarks. We release the models to the community, along with sample code demonstrating their use. We release these models as part of our goal to help further research and development in Hebrew NLP.

摘要
我们介绍DictaBERT，一个新的现代希伯来预训练BERT模型，在大多数标准准则上超越现有模型。此外，我们释放了两个精度调整版本的模型，用于执行希伯来文本分析中两个基本任务：前缀分 segmentation和 morphological tagging。这两个精度调整版本使得任何开发者可以通过一个HuggingFace模型的单调用来完成希伯来句子的前缀分 segmentation和 morphological tagging，无需额外的库或代码集成。在这篇文章中，我们详细描述了训练细节以及不同的标准准则的结果。我们将这些模型公开发布，并附送示例代码以示其使用。我们发布这些模型，以帮助进一步推动希伯来自然语言处理的研究和开发。

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

paper_url: http://arxiv.org/abs/2308.16593
repo_url: None
paper_authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng
for: 提高自然语言对话中的启发行为标注数据和自然语言对话中的表达质量
methods: 使用半监督预训练方法，同时考虑文本和语音信息，以检测对话中的启发行为标注
results: 实验结果表明，提posed方法可以实现高质量的自然语言对话synthesis，同时能够模型对话中的启发行为和预测对话中的自然语言表达

Abstract
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.

摘要
人们在对话中的自发行为通常使得语音更加人类化，然而 sintesizing自发样式语音具有高质量数据和标注自发行为的高成本。在这篇论文中，我们提出了一种半supervised预训练方法，以增加自发样式语音和自发行为标签。在半supervised学习中，我们考虑了文本和语音信息，以检测speech中的自发行为标签。此外，我们使用语言意识encoder来模型对话中每句话之间的关系。实验结果表明，我们的提议方法可以实现高水平的表达语音合成性能，同时能够模型自发样式语音中的自发行为和从文本中预测合理的自发行为。

Interpreting Sentiment Composition with Latent Semantic Tree

paper_url: http://arxiv.org/abs/2308.16588
repo_url: https://github.com/changmenseng/semantic_tree
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Cao Liu, Jiansong Chen, Jun Zhao, Kang Liu
for: 这篇论文是为了提出一种新的 sentiment composition 方法，以解决传统的 hierarchical trees 存在偏度和难以理解的问题。
methods: 该方法使用 semantic tree，一种基于 context-free grammar (CFG) 的新树形式，来理解 sentiment composition 的原则性。 semantic tree 是一个 latent variable，通过 inside algorithm 进行抽象，以提高分类性能。
results: 该方法在常规和领域适应分类任务中 achieves 更好或竞争性的结果，同时也可以生成合理的树解释。

Abstract
As the key to sentiment analysis, sentiment composition considers the classification of a constituent via classifications of its contained sub-constituents and rules operated on them. Such compositionality has been widely studied previously in the form of hierarchical trees including untagged and sentiment ones, which are intrinsically suboptimal in our view. To address this, we propose semantic tree, a new tree form capable of interpreting the sentiment composition in a principled way. Semantic tree is a derivation of a context-free grammar (CFG) describing the specific composition rules on difference semantic roles, which is designed carefully following previous linguistic conclusions. However, semantic tree is a latent variable since there is no its annotation in regular datasets. Thus, in our method, it is marginalized out via inside algorithm and learned to optimize the classification performance. Quantitative and qualitative results demonstrate that our method not only achieves better or competitive results compared to baselines in the setting of regular and domain adaptation classification, and also generates plausible tree explanations.

摘要
As the key to sentiment analysis, sentiment composition considers the classification of a constituent via classifications of its contained sub-constituents and rules operated on them. Such compositionality has been widely studied previously in the form of hierarchical trees including untagged and sentiment ones, which are intrinsically suboptimal in our view. To address this, we propose semantic tree, a new tree form capable of interpreting the sentiment composition in a principled way. Semantic tree is a derivation of a context-free grammar (CFG) describing the specific composition rules on difference semantic roles, which is designed carefully following previous linguistic conclusions. However, semantic tree is a latent variable since there is no its annotation in regular datasets. Thus, in our method, it is marginalized out via inside algorithm and learned to optimize the classification performance. Quantitative and qualitative results demonstrate that our method not only achieves better or competitive results compared to baselines in the setting of regular and domain adaptation classification, and also generates plausible tree explanations.Here's the translation in Traditional Chinese:作为情感分析的关键，情感组合考虑 класифіcation的构成单元 через其包含的子单元的类别和运算之规则。这种结构已经在过去广泛研究过，通常用树结构，包括未标的树和情感树，这些树结构是我们看来不理想的。为了解决这个问题，我们提出了含义树，一种新的树形式，可以在原理上解释情感组合。含义树是基于特定的语言结构（CFG），描述了不同Semantic Role的特定composing规则，这是以前的语言结论为基础设计的。然而，含义树是一个隐藏变量，因为没有它的标注在常规dataset中。因此，在我们的方法中，它是通过内部算法和学习来抑制标注的。结果显示，我们的方法不仅在常规和预设类别的设定下实现了更好或竞争性的结果，还可以生成合理的树解释。

Unsupervised Text Style Transfer with Deep Generative Models

paper_url: http://arxiv.org/abs/2308.16584
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Yiming Ju, Kang Liu
for: 该论文提出了一种总结式文本风格转移框架，用于无监督地将文本风格转换为另一种风格。
methods: 该框架基于深度生成模型，对每个句子-标签对进行模型化，并利用数据中的依赖关系学习句子的内容和风格代码。
results: 该方法在三个标准评测 benchmark 上进行了实验，自动和人工评估结果都显示了与多个强基eline相比的更好或竞争的效果。

Abstract
We present a general framework for unsupervised text style transfer with deep generative models. The framework models each sentence-label pair in the non-parallel corpus as partially observed from a complete quadruplet which additionally contains two latent codes representing the content and style, respectively. These codes are learned by exploiting dependencies inside the observed data. Then a sentence is transferred by manipulating them. Our framework is able to unify previous embedding and prototype methods as two special forms. It also provides a principled perspective to explain previously proposed techniques in the field such as aligned encoder and adversarial training. We further conduct experiments on three benchmarks. Both automatic and human evaluation results show that our methods achieve better or competitive results compared to several strong baselines.

摘要
我们提出了一种总体框架，用于无监督文本风格传输with deep生成模型。这个框架每个句子-标签对在非平行 corpus 中被视为部分观察到的完整四元组，其中包括两个隐藏代码，表示内容和风格。这些代码通过利用观察数据中的依赖关系学习。然后，一个句子可以通过操作这些代码进行传输。我们的框架可以将之前的嵌入和原型方法视为两种特殊形式，并提供了一个理性的视角来解释过去的相关技术，如对齐编码器和对抗训练。我们进一步进行了三个标准测试。自动和人工评估结果都显示，我们的方法可以与一些强大基eline相比，或者达到相同的结果。

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

paper_url: http://arxiv.org/abs/2308.16577
repo_url: None
paper_authors: Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng
for: 提高文本到语音合成中的自然性和 inteligibilty
methods: 使用多级上下文信息（包括 между话语语言信息和当前话语语言信息），使用多任务学习（MTL）解决方案预测语音结构
results: 对两个数据集进行了对jective评估，获得了更高的F1分数，并且在主观 preference测试中也表明了合成语音的自然性得到了改进。

Abstract
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use inter-utterance linguistic information to improve the performance of PSP. Multi-level contextual information, which includes both inter-utterance and intrautterance linguistic information, is extracted by a hierarchical encoder from character level, utterance level and discourse level of the input text. Then a multi-task learning (MTL) decoder predicts prosodic boundaries from multi-level contextual information. Objective evaluation results on two datasets show that our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH). It demonstrates the effectiveness of using multi-level contextual information for PSP. Subjective preference tests also indicate the naturalness of synthesized speeches are improved.

摘要
To achieve this, a hierarchical encoder extracts multi-level contextual information from the input text, including character level, utterance level, and discourse level. Then, a multi-task learning (MTL) decoder predicts prosodic boundaries based on the multi-level contextual information.Experimental results on two datasets show that our method outperforms previous methods in predicting prosodic word (PW), prosodic phrase (PPH), and intonational phrase (IPH) with higher F1 scores. Subjective preference tests also indicate that the synthesized speeches produced by our method are more natural-sounding.This work demonstrates the effectiveness of using multi-level contextual information for PSP, and has important implications for improving the naturalness and intelligibility of TTS synthesis.

Thesis Distillation: Investigating The Impact of Bias in NLP Models on Hate Speech Detection

paper_url: http://arxiv.org/abs/2308.16549
repo_url: None
paper_authors: Fatma Elsafoury
for: 本研究探讨了NLP模型中偏见的影响，具体来说是从三个角度：解释性、偏见刻板印象和公平性。
methods: 本研究使用了NLP模型的解释性、偏见刻板印象和公平性来探讨偏见的影响。
results: 研究发现，NLP模型中的偏见从三个角度都会影响 hate speech 检测任务，而且不integrating social sciences在研究偏见的NLP模型中，我们无法有效地解决偏见的问题。

Abstract
This paper is a summary of the work in my PhD thesis. In which, I investigate the impact of bias in NLP models on the task of hate speech detection from three perspectives: explainability, offensive stereotyping bias, and fairness. I discuss the main takeaways from my thesis and how they can benefit the broader NLP community. Finally, I discuss important future research directions. The findings of my thesis suggest that bias in NLP models impacts the task of hate speech detection from all three perspectives. And that unless we start incorporating social sciences in studying bias in NLP models, we will not effectively overcome the current limitations of measuring and mitigating bias in NLP models.

摘要
这份论文是我博士论文的摘要，其中我 investigate了NLP模型中偏见的影响在仇视言语检测任务中，从三个角度：可解性、偏见刻板印象和公平。我讲述了我的博士论文的主要答案和如何对整个NLP社区有益。最后，我讲述了未来研究的重要方向。我的论文发现，NLP模型中的偏见会影响仇视言语检测任务从三个角度，而且如果我们不开始在研究NLP模型中的偏见时，我们无法有效地解决NLP模型中的偏见问题。Here's a word-for-word translation:这份论文是我博士论文的摘要，其中我 investigate了NLP模型中偏见的影响在仇视言语检测任务中，从三个角度：可解性、偏见刻板印象和公平。我讲述了我的博士论文的主要答案和如何对整个NLP社区有益。最后，我讲述了未来研究的重要方向。我的论文发现，NLP模型中的偏见会影响仇视言语检测任务从三个角度，而且如果我们不开始在研究NLP模型中的偏见时，我们无法有效地解决NLP模型中的偏见问题。

Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

paper_url: http://arxiv.org/abs/2308.16540
repo_url: None
paper_authors: Dhananjaya Gowda, Sudarsana Reddy Kadiri, Brad Story, Paavo Alku
for: 这篇论文提出了一种新的准确地估计和跟踪speech信号中的声门形态使用时间变化 quasi-closed-phase (TVQCP)分析方法。
methods: 该方法 combinesthree approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the $L_1$ optimization, and (3) it uses time-varying linear prediction analysis over long time windows to impose a continuity constraint on the vocal tract model and hence on the formant trajectories.
results: 对于各种合成和自然语音信号的实验表明，提出的TVQCP方法比传统和流行的formant tracking工具，如Wavesurfer和Praat（基于动态规划）、KARMA算法（基于加尔曼滤波）和DeepFormants（基于深度神经网络）perform better。

Abstract
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based on dynamic programming or a linear state-space model. One of the main disadvantages of these approaches is that the tracking stage, however good it may be, cannot improve upon the formant estimation accuracy of the first stage. The proposed TVQCP method provides a single-stage formant tracking that combines the estimation and tracking stages into one. TVQCP analysis combines three approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the $L_1$ optimization and (3) it uses time-varying linear prediction analysis over long time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal tract model and hence on the formant trajectories. Formant tracking experiments with a wide variety of synthetic and natural speech signals show that the proposed TVQCP method performs better than conventional and popular formant tracking tools, such as Wavesurfer and Praat (based on dynamic programming), the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on deep neural networks trained in a supervised manner). Matlab scripts for the proposed method can be found at: https://github.com/njaygowda/ftrack

摘要
在这篇论文中，我们提出了一种新的方法用于准确地计算和跟踪语音信号中的声门特征（formant）。传统的声门跟踪方法通常采用两个阶段的估计和跟踪策略，其中首先使用短时间分析（例如10-50ms）来估计初始声门候选者，然后使用动态计划或线性状态空间模型来跟踪。这种方法的主要缺点是跟踪阶段，即使非常好，也无法改善初始估计阶段的声门估计精度。我们的提议的TVQCP方法则提供了一种单阶段的声门跟踪，其中估计和跟踪阶段被结合到一起。TVQCP分析结合了三种方法来提高声门估计和跟踪精度：1. 使用时间权重 quasi-closed-phase 分析来 deriv closed-phase 估计值，减少干扰来自激发源的干扰。2. 使用 $L_1$ 优化增加剩余稀热性。3. 使用时间变化的线性预测分析在长时间窗口（例如100-200ms）来强制施加 vocals tract 模型中的连续性约束。我们在各种 sintetic 和自然语音信号上进行了声门跟踪实验，结果显示，我们的TVQCP方法在与传统和流行的声门跟踪工具（如Wavesurfer和Praat）进行比较时，表现出了更高的精度。Matlab 脚本 для我们的方法可以在以下 GitHub 地址找到：https://github.com/njaygowda/ftrack。

The Smart Data Extractor, a Clinician Friendly Solution to Accelerate and Improve the Data Collection During Clinical Trials

paper_url: http://arxiv.org/abs/2308.16537
repo_url: None
paper_authors: Sophie Quennelle, Maxime Douillet, Lisa Friedlander, Olivia Boyer, Anita Burgun, Antoine Neuraz, Nicolas Garcelon
for: 提高医疗数据收集效率和质量，避免人工劳动和错误
methods: 提出一种半自动的数据收集系统，能够自动提取各种数据，包括病历记录
results: 对比手动和半自动数据收集方法，发现半自动方法的平均时间为3’22’’，比手动方法要快，并且错误数量较少（46个整个减少到163个），提供一种容易使用、易于理解和快速的便携式临床研究表单填写解决方案，提高数据收集效率和质量，避免人工劳动和错误

Abstract
In medical research, the traditional way to collect data, i.e. browsing patient files, has been proven to induce bias, errors, human labor and costs. We propose a semi-automated system able to extract every type of data, including notes. The Smart Data Extractor pre-populates clinic research forms by following rules. We performed a cross-testing experiment to compare semi-automated to manual data collection. 20 target items had to be collected for 79 patients. The average time to complete one form was 6'81'' for manual data collection and 3'22'' with the Smart Data Extractor. There were also more mistakes during manual data collection (163 for the whole cohort) than with the Smart Data Extractor (46 for the whole cohort). We present an easy to use, understandable and agile solution to fill out clinical research forms. It reduces human effort and provides higher quality data, avoiding data re-entry and fatigue induced errors.

摘要
医学研究中，传统的数据收集方式，即阅读病人文件，已经被证明会导致偏见、错误、人工劳动和成本增加。我们提议一种半自动的数据收集系统，能够自动提取所有类型的数据，包括笔记。智能数据抽取器按照规则自动填充临床研究表单。我们进行了跨测试实验，比较半自动和手动数据收集方式。对79名病人的20个目标项进行了收集。手动数据收集的平均时间为6'81''，而智能数据抽取器的平均时间为3'22''。此外，手动数据收集中还有更多的错误（总共163个），与智能数据抽取器相比（46个）。我们提供了一种易于使用、易于理解、快速的解决方案，快速填充临床研究表单，减少人工劳动，提供更高质量的数据，避免数据重复和劳动 induced错误。

Link Prediction for Wikipedia Articles as a Natural Language Inference Task

paper_url: http://arxiv.org/abs/2308.16469
repo_url: None
paper_authors: Chau-Thang Phan, Quoc-Nam Nguyen, Kiet Van Nguyen
for: 本文提出了一种解决自动理解大规模知识库结构的链接预测问题的系统，并在Data Science and Advanced Analytics 2023 Competition “Efficient and Effective Link Prediction” (DSAA-2023 Competition)中提交了该系统。
methods: 本文提出了一种将链接预测视为自然语言理解（NLI）任务的方法，基于最近的自然语言处理和理解技术，将链接预测视为两篇文章之间的文本关系预测任务。
results: 本文的实现基于Sentence Pair Classification for Link Prediction for the Wikipedia Articles task，在公共测试集上 achiev 0.99996 Macro F1-score和1.00000 Macro F1-score，与第一名和第二名的分数相同。

Abstract
Link prediction task is vital to automatically understanding the structure of large knowledge bases. In this paper, we present our system to solve this task at the Data Science and Advanced Analytics 2023 Competition "Efficient and Effective Link Prediction" (DSAA-2023 Competition) with a corpus containing 948,233 training and 238,265 for public testing. This paper introduces an approach to link prediction in Wikipedia articles by formulating it as a natural language inference (NLI) task. Drawing inspiration from recent advancements in natural language processing and understanding, we cast link prediction as an NLI task, wherein the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code is publicly for research purposes.

摘要
很重要的任务是预测wiki文章之间的链接，以自动理解大量知识库的结构。在这篇论文中，我们介绍了我们在“高效高效链接预测”（DSAA-2023）比赛中解决这个任务的系统，采用了948233个训练文章和238265个测试文章。本论文将链接预测问题转化为自然语言推理（NLI）任务， Drawing inspiration from recent advances in natural language processing and understanding, we cast link prediction as an NLI task, where the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code is publicly available for research purposes.

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

paper_url: http://arxiv.org/abs/2308.16463
repo_url: https://github.com/HYPJUDY/Sparkles
paper_authors: Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel Collier, Yutong Lu
for: SparklesChat is designed to handle open-ended dialogues across multiple images, addressing the challenge of maintaining dialogue coherence in multimodal instruction-following tasks.methods: SparklesChat uses a multimodal instruction-following model that integrates text and images, and is trained on the newly introduced SparklesDialogue dataset. The model is evaluated using the SparklesEval benchmark, which assesses conversational competence across multiple images and dialogue turns.results: SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks and scored 8.56 out of 10 on SparklesEval, demonstrating its effectiveness in understanding and reasoning across multiple images and dialogue turns. Qualitative evaluations also showed the model’s generality in handling real-world applications.

Abstract
Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources will be available at https://github.com/HYPJUDY/Sparkles.

摘要
大型语言模型在不同任务上显示出强化零学习性能，当精通化 instruction-following 数据时。多模式 instruction-following 模型将这些能力扩展到包括文字和图像在内的多个模式。然而，现有的模型，如 MiniGPT-4，在多个图像场景中维持对话一致性存在问题。主要原因是缺乏特殊化的数据集。为了bridging这些差距，我们提出 SparklesChat，一个多模式 instruction-following 模型，用于开放式对话过程中的多个图像。为支持训练，我们引入 SparklesDialogue，首个特别设计 для word-level 跨多个图像和文字互动的机器生成对话数据集。此外，我们建立 SparklesEval，一个基于 GPT 的测试工具，用于量化评估模型在多个图像和对话转换中的对话能力。我们的实验显示 SparklesChat 在多个图像和对话转换中理解和推理能力有所提高。具体来说，SparklesChat 在已知的视觉和语言标准 benchmark 上表现出色，包括BISON binary 图像选择任务和 NLVR2 视觉理解任务。此外，SparklesChat 在 SparklesEval 上获得 8.56 分，大幅超过 MiniGPT-4 的 3.91 分，并且接近 GPT-4 的 9.26 分。实验结果还表明 SparklesChat 在实际应用中具有一般性。所有资源将在 GitHub 上公开。

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

paper_url: http://arxiv.org/abs/2308.16415
repo_url: None
paper_authors: Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang
for: 提高流式自动语音识别（ASR）模型的性能
methods: 使用层到层知识储 transmit 教师Encoder 到学生Encoder
results: 比前 Token 概率储 transmit 方法减少词错率

Abstract
Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.

摘要
流式自动语音识别（ASR）模型因无法访问未来上下文，因此其性能较差于非流式模型。为提高流式ASR的性能，我们已经研究了知识塑化（KD）从非流式到流式模型，主要关注输出token概率的匹配。在这篇论文中，我们提议一种层到层KD从教师encoder到学生encoder。为确保使用相同上下文提取特征，我们在学生encoder中插入了auxiliary非流式分支，并在教师层和auxiliary层之间进行KD。我们设计了一种特殊的KD损失，使用自适应预测编码（APC）机制，以促进流式模型预测未见的未来上下文。实验结果表明，我们的方法可以在前期token概率塑化方法中减少词错率。

2023-09-01

Affine-Transformation-Invariant Image Classification by Differentiable Arithmetic Distribution Module

PathLDM: Text conditioned Latent Diffusion Model for Histopathology

Learned Visual Features to Textual Explanations

Deep learning in medical image registration: introduction and survey

Indexing Irises by Intrinsic Dimension

AAN: Attributes-Aware Network for Temporal Action Detection

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Time Series Analysis of Urban Liveability

Discrete Morphological Neural Networks

Mechanism of feature learning in convolutional neural networks

Amyloid-Beta Axial Plane PET Synthesis from Structural MRI: An Image Translation Approach for Screening Alzheimer’s Disease

Fused Classification For Differential Face Morphing Detection

Impact of Image Context for Single Deep Learning Face Morphing Attack Detection

Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering

SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation

A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm

Multi-stage Deep Learning Artifact Reduction for Computed Tomography

Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information

General and Practical Tuning Method for Off-the-Shelf Graph-Based Index: SISAP Indexing Challenge Report by Team UTokyo

An Improved Encoder-Decoder Framework for Food Energy Estimation

dacl10k: Benchmark for Semantic Bridge Damage Segmentation

Unsupervised bias discovery in medical image segmentation

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Improving the matching of deformable objects by learning to detect keypoints

Selective Scene Text Removal

Fine-grained Recognition with Learnable Semantic Data Augmentation

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

On the Localization of Ultrasound Image Slices within Point Distribution Models

How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis

RigNet++: Efficient Repetitive Image Guided Network for Depth Completion

MuraNet: Multi-task Floor Plan Recognition with Relation Attention

Towards Contrastive Learning in Music Video Domain

Robust Point Cloud Processing through Positional Embedding

Human trajectory prediction using LSTM with Attention mechanism

ARFA: An Asymmetric Receptive Field Autoencoder Model for Spatiotemporal Prediction

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Efficient Surrogate Models for Materials Science Simulations: Machine Learning-based Prediction of Microstructure Properties

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution

SparseSat-NeRF: Dense Depth Supervised Neural Radiance Fields for Sparse Satellite Images

Application of Machine Learning in Melanoma Detection and the Identification of ‘Ugly Duckling’ and Suspicious Naevi: A Review

Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care

Object-Centric Multiple Object Tracking

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation

DARC: Distribution-Aware Re-Coloring Model for Generalizable Nucleus Segmentation

Vision-aided nonlinear control framework for shake table tests

2023-09-01

Efficient RLHF: Reducing the Memory Usage of PPO

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Contextual Biasing of Named-Entities with Large Language Models

Amortizing Pragmatic Program Synthesis with Rankings

Reinforcement Learning with Human Feedback for Realistic Traffic Simulation

Geometric Deep Learning: a Temperature Based Analysis of Graph Neural Networks

Jointly Exploring Client Drift and Catastrophic Forgetting in Dynamic Learning

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Iterative Multi-granular Image Editing using Diffusion Models

Curating Naturally Adversarial Datasets for Trustworthy AI in Healthcare

ICDARTS: Improving the Stability and Performance of Cyclic DARTS

Learning-based NLOS Detection and Uncertainty Prediction of GNSS Observations with Transformer-Enhanced LSTM Network

A Theoretical and Practical Framework for Evaluating Uncertainty Calibration in Object Detection

New metrics for analyzing continual learners

Establishing Markov Equivalence in Cyclic Directed Graphs

No Train Still Gain. Unleash Mathematical Reasoning of Large Language Models with Monte Carlo Tree Search Guided by Energy Function

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Declarative Reasoning on Explanations Using Constraint Logic Programming

Area-norm COBRA on Conditional Survival Prediction

Dense Voxel 3D Reconstruction Using a Monocular Event Camera

Scenario-based model predictive control of water reservoir systems

Discrete Versus Continuous Algorithms in Dynamics of Affective Decision Making

Explainable Active Learning for Preference Elicitation

A Text-based Approach For Link Prediction on Wikipedia Articles

Sherlock Holmes Doesn’t Play Dice: The significance of Evidence Theory for the Social and Life Sciences

On the Aggregation of Rules for Knowledge Graph Completion

Identifiable Cognitive Diagnosis with Encoder-decoder for Modelling Students’ Performance

End-to-end Lidar-Driven Reinforcement Learning for Autonomous Racing

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Leveraging Learning Metrics for Improved Federated Learning